Lecture 1 Llama 2 | October 3rd, 2023

Large Language Models (LLMs)

Challenges:

  • Statistical learning takes a lot of data
  • Humans have multisensory information
  • Memory augmentation (remembering)
  • Often doesn’t know what it doesn’t know
  • Interpretability → large black boxes

Llama 2

DeepMind Chinchilla → focused on scaling both data and parameters

  • Llama2 new approach on data quality, yet limited size in quantity
  • Fast inference paradigm → decoder model only which replied on Rotary Positional Embedding (RoPE)
    • Each position embedding is a set of sine and cosine functions with different frequencies, representing different positions in the sequence.
    • SwiGLU instead of ReLU
  • Byte Pair Encoding (BPE) to predict unseen tokens
  • Trains using AdamW Optimizer which improves downstream task performance after finetuning
  • But how do we make the model for all? → fine tune chat models for different domains

Pretraining

  • 4k context length
  • Multihead → group query attention
    • MHA runs out of memory at 1024
  • Consciously choose to focus on english as dataset language

  • Supervised Finetuning (SFT) was bootstrapped with public datasets/3rd party vendors
  • Finetuning the model w/ Temporal Perception allows it to remember timeline and conversation

    Short Intermission to talk about Attention

    Attention allows models to weigh the importance of a particular input in a sequence relative to others. The mechanism produces weights that determine how much each element contributes to the final output. For example, in a sentence translation task, the words “I” and “apple” might carry more weight in deriving the meaning of the sentence “I ate an apple”, than less significant words such as “an” and “ate”. Let ht be the word representation (either as an embedding or a (concatenation) hidden state(s) of an RNN) of dimension Kw for the word at position t in some sentence padded to a length of M.

Reinforcement Learning thru Human Feedback (RLHF)

Reward Model

  • Reward based on binary preference data based on human annotators
  • [insert mathjax here], where rtheta is optimized
  • Rejection Sampling → sample k using temperature scaling
  • Policy Language Model combines PPO and Rejection Sampling

Evaluation and Safety

  • How do we build benchmarks to evaluate? Since conversation and interaction can take different paths
  • Human evaluation by side-by-side comparison is costly and time consuming
  • Built-in system prompts often flag false positives

Explicit Measures

  • Removing PI data, mandating toxicity scores of documents
  • Separate reward models for safety (preventing unintended solutions)
  • Context distillation for safety with adversarial prompts
  • Red-teaming to test robustness