Lecture 1 Llama 2 | October 3rd, 2023
Large Language Models (LLMs)
Challenges:
- Statistical learning takes a lot of data
- Humans have multisensory information
- Memory augmentation (remembering)
- Often doesn’t know what it doesn’t know
- Interpretability → large black boxes
Llama 2
DeepMind Chinchilla → focused on scaling both data and parameters
- Llama2 new approach on data quality, yet limited size in quantity
- Fast inference paradigm → decoder model only which replied on Rotary Positional Embedding (RoPE)
- Each position embedding is a set of sine and cosine functions with different frequencies, representing different positions in the sequence.
- SwiGLU instead of ReLU
- Byte Pair Encoding (BPE) to predict unseen tokens
- Trains using AdamW Optimizer which improves downstream task performance after finetuning
- But how do we make the model for all? → fine tune chat models for different domains
Pretraining
- 4k context length
- Multihead → group query attention
- MHA runs out of memory at 1024
-
Consciously choose to focus on english as dataset language
- Supervised Finetuning (SFT) was bootstrapped with public datasets/3rd party vendors
- Finetuning the model w/ Temporal Perception allows it to remember timeline and conversation
Short Intermission to talk about Attention
Attention allows models to weigh the importance of a particular input in a sequence relative to others. The mechanism produces weights that determine how much each element contributes to the final output. For example, in a sentence translation task, the words “I” and “apple” might carry more weight in deriving the meaning of the sentence “I ate an apple”, than less significant words such as “an” and “ate”. Let
be the word representation (either as an embedding or a (concatenation) hidden state(s) of an RNN) of dimension for the word at position in some sentence padded to a length of .
Reinforcement Learning thru Human Feedback (RLHF)
Reward Model
- Reward based on binary preference data based on human annotators
- [insert mathjax here], where
is optimized - Rejection Sampling → sample k using temperature scaling
- Policy Language Model combines PPO and Rejection Sampling
Evaluation and Safety
- How do we build benchmarks to evaluate? Since conversation and interaction can take different paths
- Human evaluation by side-by-side comparison is costly and time consuming
- Built-in system prompts often flag false positives
Explicit Measures
- Removing PI data, mandating toxicity scores of documents
- Separate reward models for safety (preventing unintended solutions)
- Context distillation for safety with adversarial prompts
- Red-teaming to test robustness