Lecture 1 Llama 2 | October 3rd, 2023

DeepMind Chinchilla → focused on scaling both data and parameters

Llama2 new approach on data quality, yet limited size in quantity
Fast inference paradigm → decoder model only which replied on Rotary Positional Embedding (RoPE)
- Each position embedding is a set of sine and cosine functions with different frequencies, representing different positions in the sequence.
- SwiGLU instead of ReLU
Byte Pair Encoding (BPE) to predict unseen tokens
Trains using AdamW Optimizer which improves downstream task performance after finetuning
But how do we make the model for all? → fine tune chat models for different domains

4k context length
Multihead → group query attention
- MHA runs out of memory at 1024
Consciously choose to focus on english as dataset language
Supervised Finetuning (SFT) was bootstrapped with public datasets/3rd party vendors
Finetuning the model w/ Temporal Perception allows it to remember timeline and conversation
Short Intermission to talk about Attention

Attention allows models to weigh the importance of a particular input in a sequence relative to others. The mechanism produces weights that determine how much each element contributes to the final output. For example, in a sentence translation task, the words “I” and “apple” might carry more weight in deriving the meaning of the sentence “I ate an apple”, than less significant words such as “an” and “ate”. Let $h_{t}$ be the word representation (either as an embedding or a (concatenation) hidden state(s) of an RNN) of dimension $K_{w}$ for the word at position $t$ in some sentence padded to a length of $M$ .

Reinforcement Learning thru Human Feedback (RLHF)

How do we build benchmarks to evaluate? Since conversation and interaction can take different paths
Human evaluation by side-by-side comparison is costly and time consuming
Built-in system prompts often flag false positives