Lecture 2.1 Outer Alignment: Reward Misspecification | October 3rd, 2023

Markov Decision Process

Environment at time t is in a state S. Agent receives a scalar reward R after an action A. Choose an action that leads to the best reward. Reward is used as a training signal to tune neural network.

RL Challenges

  • Sparse and delayed rewards (coast-runners game)
  • Partial observability / exploration vs. exploitation
  • Non-stationary (dynamic environments)
  • Sim-to-real transfer (humans evaluate if desired output is achieved, instead of using reward)
  • Sample efficiency / computational cost

AI Alignment

“How can we make sure the thing does what we want it to do?”

(Outer) misalignment examples:

  • Unintended “solutions” to be deployed problems (coast runner)
  • Flawed reward design (not accounting for bias in training data)

The goal of differentiating undesirable vs. desirable novel solutions: human evaluations are limited.

→ How do we capture the human concept of a given task in a reward function?

LLMs

  • Supervised learning, abstracts human language based on correlation and statistical patterns in the training data
  • Can generate text based on statistical likelihood and patterns in the data.

Initial “raw” LLM → two sample responses → human eval → Safe LLM (aligned with RLHF).

llm

Challenges with RLHF:

  • RLHF degrades model output accuracy
  • Human preferences could be influenced by bad actors / only an estimation (how do we determine proper human preference?)
  • Neglecting: inner misalignment

Goodhart’s Law

When a measure becomes a target, it ceases to be a good measure. Proxy objectives (e.g. scalar reward) lead to optimizations towards proxy goals

  1. Goal: Examine students
    • Proxy Metric: Test scores → can lead to learning for tests and not for comprehension
  2. Goal: Win a democratic election
    • Proxy metric: Votes → can lead to over-claimed promises
  3. Goal: Find a good function approximation
    • Proxy Metric: Deviation between data points and function → overfitting
  4. Goal Align AI systems with human preferences
    • Proxy metric: preference model from human feedback → sycophancy, model degradation, insufficient alignment (generalization is worse)

Research Question: How is model over-optimization affected by model sizes?

graph1

Goal: Understand the amount of overfitting and how it scales for safe model optimization

  • Number of human annotator samples reduces overfitting