The Evolution of Meta’s LLaMA Model

7 min readAug 24, 2024

Following my previous blog which overviewed the evolution of OpenAI’s GPT Model, this blog will overview Meta’s LLaMA (Large Language Model Meta AI) LLM model.

This blog aims to summarize the evolution of the Llama model, looking at each academic paper of the three iterations of the model. Unlike the GPT model, Llama is open-source meaning all of the model’s details and weights are publicly available!

Llama 1 (LLaMA: Open and Efficient Foundation Language Models, 2023) [1]

The ethos behind the original Llama model was to create a series of language models smaller than the current state-of-the-art but that could achieve higher performance levels by training the models for longer.

The reasoning behind this is that the preferred language model will be the one fastest at inference, not at training. So, even though you train the model for longer, you achieve a smaller inference cost by using a smaller model. The result of this approach are language models called LLaMA that outperform GPT-3 and other state-of-the-art models whilst being significantly smaller.

One key difference with the training approach used in this paper is that Llama is only trained using publicly available data. They create a dataset from a range of publicly available datasets and it comprises 1.4 trillion tokens with a vocabulary size of 32K tokens. This contrasts with GPT 3.0, where the dataset used for training was not made available to the public.

The Llama model is a standard transformer-based language model with the following changes:

Pre-Normalisation is used as in GPT-3, where the layer normalization is moved to the input of each transformer layer instead of the output. The RMSNorm normalising function was chosen to be used in the layer normalisation.
Instead of using the ReLU activation function, SwiGLU activation is used to improve the performance.
Rotary Positional Embeddings (RoPE) are used instead of absolute positional embeddings.

Four variants of the model of differing sizes were created:

LLaMA-7B — 32 attention heads, 32 transformer layers, trained on 1.0T tokens.
LLaMA-13B — 40 attention heads, 40 transformer layers, trained on 1.0T tokens.
LLaMA-33B — 52 attention heads, 60 transformer layers, trained on 1.4T tokens.
LLaMA-65B — 64 attention heads, 80 transformer layers, trained on 1.4T tokens.

The same training process for typical language models was followed. That is, the model was trained to predict the next token given a sequence of input tokens. They were then evaluated in zero-shot and few-shot modes on a range of tasks.

The tasks and results are summarised below:

Common Sense Reasoning — LLaMA-13B outperforms GPT-3, and LLaMA-65B outperforms the state-of-the-art similar-size models.
Closed-book Question Answering — LLaMA-13B outperforms GPT-3, and LLaMA-65B outperforms the state-of-the-art similar-size models.
Reading Comprehension — LLaMA-13B outperforms GPT-3, and LLaMA-65B outperforms the state-of-the-art similar-size models.
Mathematical Reasoning — LLaMA-13B outperforms GPT-3, and LLaMA-65B matches the state-of-the-art similar-size models but fails to outperform the PaLM model fine-tuned on mathematical data
Code Generation — LLaMA-13B outperforms GPT-3, and LLaMA-65B outperforms the state-of-the-art similar-size models.

The notable result is that LLaMA-13B outperforms GPT-3 whilst being 10x smaller and the largest model, LLaMA-65B is competitive with 2 other LLMs, Chinchilla-70B and PaLM-540B.

Llama 2 (Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023) [2]

As expected, Llama 2 is an updated version of Llama 1. The Llama 2 family consists of models of sizes 7B, 13B and 70B parameters.

The models are trained on a new mix of data from publicly available sources and the resulting dataset now consists of 40% more tokens, bringing the total size to 2 trillion tokens but keeping the vocabulary size the same. More robust data cleaning is also performed compared to the dataset used for Llama 1.

The model architecture remains mostly the same as in Llama 1.0, with a few slight modifications:

The context length was doubled from 2048 to 4096 tokens.
Grouped-query attention (GQA) was used. GQA combines multi-head and multi-query attention by merging the heads into groups. For N groups, the heads will be merged such that there are now N heads (or groups). Each group shares a single key and value vector, rather than each head using its own. If we were to convert a multi-head attention layer to GQA, the key vector and value vector for a group would be found by averaging the key and value vectors for the merged heads in the group. [5] [6]
Grouped-query attention reduces the computational and memory overhead of maintaining multiple vectors for keys and values. By altering the number of groups, GQA can fine-tune the trade-off between speed and model accuracy.

There were no changes to the number of attention heads or the number of transformer layers used in each model compared to the Llama 1 models.

Llama 2 was pre-trained in the same way as Llama 1, that is they were trained to predict the next token given a sequence of input tokens.

The performance of the pre-trained Llama 2 models was evaluated over a few different categories of benchmarks:

Code
Commonsense Reasoning
World Knowledge
Reading Comprehension
Math
Popular Aggregated Benchmarks

The key takeaways are that Llama 2.0 models outperform Llama 1.0 models as well as all open-source models. Llama 2–70B is close to GPT-3.5, but it is far from GPT-4, it is also on par or better than PaLM-540B but is far from PaLM-2-L.

The most interesting contribution of this paper was the development of chat models called Llama 2-Chat. These are fine-tuned Llama 2 models, optimized for conversational use cases similar to ChatGPT.

The key difference between the Llama-2-Chat model and the vanilla Llama 2 model, is the addition of a post-training step.

The post-training step fine-tunes a Llama 2 model using supervised fine-tuning and reinforcement learning from human feedback (RLHF).

Supervised fine-tuning fine-tunes the model by training it on publicly available instruction tuning data. This type of data consists of a prompt (data) and an answer (label).

Then they use RLHF to align the model behaviour with human preferences. In RLHF, human-labelled data is collected using the output of the LLM which is used to train a reward model. This reward model is then used in a reinforcement learning approach to train the model to align its output to human preferences.

Llama 3.1 (The Llama 3 Herd of Models, 2024) [3]

Llama 3.1 (or just Llama 3) is the latest family of Llama models. Three models are released with model parameter sizes of 8B, 70B and 405B.

Llama 3.1 models are based on the same architecture as Llama 2.

Grouped-query attention is used as in Llama 2.
Context length is increased from 4K tokens to 128K tokens.
Llama 3–8B has 32 attention heads and 32 transformer layers.
Llama 3–70B has 64 attention heads and 80 transformer layers.
Llama 3–405B has 128 attention heads and 126 transformer layers.

The most significant changes compared to Llama 2 come from the training data. The training dataset increases from 2 trillion tokens to 15.6 trillion tokens. Vocabulary size increases from 32K to 128K tokens. Moreover, the quality of the data increased by employing more careful pre-processing and curation pipelines for pre-training data.

They also carefully determine the proportion of data sources in the pre-training dataset (data mix). 50% of the tokens correspond to general knowledge, 25% mathematical and reasoning tokens, 17% code tokens and 8% multilingual tokens.

Llama 3 models are pre-trained in the same way as Llama 1 and 2 models. That is they are trained to predict the next token given a sequence of input tokens.

For these models, during the pre-training process, they slowly increase the context length. They start at 4K tokens, then move to 8K tokens. In the final pre-training stage they train using a context length of 128K tokens. During pre-training on the final set of tokens from the training data, they linearly anneal the learning rate to 0, maintaining the context length of 128K tokens.

The pre-trained Llama 3 405B model significantly outperforms open-source models and is comparable to GPT-4 and Gemini Ultra.

As with Llama 2, there is a post-training step to optimize it for conversational use cases.

Human-labelled data is first collected using the output of the LLM which is used to train a reward model.

The model is then fine-tuned by training it on instruction-tuning data as before. In Llama 2 this data was gathered from public sources. However, in Llama 3, this data is gathered using the prompts from human-annotated data. K responses to the prompts are passed through the reward model to select the best one, which is then used as the desired response to that prompt. This data is then combined with additional synthetic data to produce a complete instruction-tuning dataset.

They then apply direct preference optimization (DPO) instead of RLHF, this differs from RLHF as it eliminates the need for a reward model. It uses a binary cross-entropy loss to adjust the LLM’s outputs to align with human-preferred responses over dispreferred ones [4].

Conclusions

I hope this blog has helped to summarise the evolution of Meta’s Llama model, what stood out to me was how quickly a model can evolve with and match state-of-the-art models despite being open source!

The Evolution of Meta’s LLaMA Model

Written by Nathan Bailey