The Evolution of OpenAI’s GPT Model

Nathan Bailey
7 min readAug 8, 2024

--

Large Language Models (LLMs) can be confusing, but this blog can help! I recently went on a journey to understand the building blocks behind LLMs in order to understand how they work. You can find some of my previous work via the following links.

This blog aims to summarize the evolution of OpenAI’s Generative Pre-Trained Transformer (GPT) model and will take a look at each academic paper of the four main iterations of this model. Unfortunately, as the GPT model advanced, details on the model, its training process and training datasets became more protected by OpenAI. However, the fundamental model architecture, training and inference strategies can still be understood and explained using the earlier literature.

Without further ado, let us dive into the first paper which proposes the first GPT model.

GPT 1.0 (Improving language understanding with unsupervised learning, 2018) [1]

This paper is the most verbose in a technical sense as it proposes the architecture behind the GPT model. Somewhat surprisingly, this architecture remains relatively unchanged as the model evolved. As we will see, the model mainly just became larger, which allowed it to perform tasks with human-level accuracy.

The paper outlines the transformer-based model called the Generative Pre-Trained Transformer (GPT). It consists of an embedding layer, which encodes an embedding for tokens in a sentence and then applies 12 transformer blocks to these embeddings. It then applies a regular feed-forward layer with a softmax activation function to predict the probabilities of the next token.

Each transformer block outputted a vector of 768 dimensionality and contains 12 attention heads. This means that the dimensions of the keys, queries and values to each attention head is 64.

I explained more about the implementation of a GPT model in the following blog, but for completeness, the GPT model can be seen visually below.

GPT Model [5]

Training the GPT model is a two-fold process. First, it is trained in a language modelling context on unlabelled sequences from a dataset. They use the BookCorpus dataset which consists of 7000 books. The language modelling context aims to correctly predict the next word in the sentence given the preceding words.

Hence, the following likelihood is maximised.

Language Modelling Likelihood Function [1]

Then, the model was fine-tuned for specific tasks using labelled data. These tasks were: Natural Language Inference, Question Answering, Semantic Similarity and Text Classification.

To achieve this, the softmax layer was replaced with one that outputs the probabilities over all the possible classes in the dataset. As typically done in labelled datasets, the probability of predicting the correct label is maximised.

Fine-Tuning Likelihood Function [1]

They found it best to combine the 2 objective functions as this helped the model to learn by improving generalization and accelerating convergence.

Overall, they found that using the GPT model improves on the state-of-the-art models on 9 out of 12 datasets used. Specifically, they see gains on all tasks, however, they do not outperform all datasets used across all the tasks.

GPT 2.0 (Language Models are Unsupervised Multitask Learners, 2019) [2]

As to be expected, GPT 2.0 is an extension of the GPT 1.0 model. Specifically, the paper focuses on zero-shot learning. That is, they train the model using the same language modelling objective function as before and then apply the trained model to the tasks without any fine-tuning.

A new dataset, WebText, was collected to perform the training and consists of over 8 million web pages. This contrasts with the BookCorpus dataset used for GPT 1.0 which consists of only 7000 books. This new dataset is larger and more diverse than the BookCorpus dataset.

In terms of model architecture, they made a few changes compared to the GPT 1.0 model:

  • Layer normalisation was moved to the input of each sub-block in the transformer block. So instead of it being performed after the attention and feed-forward layers, it is applied after.
  • An additional layer normalisation layer is applied after the final self-attention block.
  • The vocabulary size was expanded to 50257
  • Context size was increased from 512 to 1024. Context size is the number of tokens that the model can process in a single input. This includes both the input and output tokens. So if the model was to predict the new word in a sentence, it can take in a sentence of a maximum length of 1023 tokens to do this.
  • Compared to GPT 1.0, GPT 2.0 uses 48 transformer layers rather than 12. The dimensionality of the model is increased from 768 to 1600, that is each transformer block now outputs a vector of dimension 1600. This means there are now 25 attention heads in each transformer block, each processing queries, keys and values with a dimension of 64.

When tested on language modelling (i.e. the task it was trained on), the GPT 2.0 model achieves better results in 7 out of 8 datasets compared to the state-of-the-art.

The model is then evaluated on several tasks, some examples were: Classification, Predicting words in sentences, Summarisation, Reading Comprehension, Translation and Question Answering.

For the specific tasks, because zero-shot learning was used, specific prompts or conditions were employed. For example for summarisation, they provide the article and then “TL;DR” after it to induce summarisation.

Another example is translation where to translate say an English sentence to a French sentence they provided a few examples in the form of “English sentence = French sentence”. Then they provided the sentence to translate as English sentence = <sentence>.

For most tasks, it cannot outperform models specifically trained for the given task but shows promising results that enable future work to expand on.

GPT 3.0 (Language Models are Few-Shot Learners, 2020) [3]

GPT 3.0 uses the same architecture as GPT 2.0 but adds alternating dense and sparse attention patterns in the layers of the transformer.

In sparse attention, attention scores are only computed for a localised subset of the keys for a given query. This reduces the number of attention scores needed to be computed, reducing the computational burden.

The context size is increased from 1024 to 2048 and uses 96 transformer layers with 96 attention heads in each layer. The query, key and value dimensions are increased from 64 to 128, giving a total output dimension for each transformer of 12288.

The dataset used to train the model in a language modelling context is expanded. The new dataset includes an expanded version of WebText and a filtered version of CommonCrawl. The resulting dataset is 570GB, which is over a 10x increase compared to the 40GB dataset used in GPT 2.0.

Importantly, they clarify what is meant by zero-shot learning by defining the following terms:

  • Fine Tuning — Updates the weights of the pre-trained model by training on a labelled dataset for a given task.
  • Few Shot — The model is given a few demonstrations of the task to be performed, e.g. English Sentence = French Sentence. K examples are given of context and completion. Then one final context which the model is expected to complete.
  • One Shot — Same as Few Shot but K = 1
  • Zero Shot — No examples are given to the model, just a natural language description of the task at hand is provided to the model.

On the task of language modelling, GPT 3.0 improves state-of-the-art on 2 datasets but fails to improve on the other 2 datasets.

On specific tasks, GPT 3.0 produces comparable results on some tasks and datasets. It provides a large improvement on the task of question answering compared to GPT 2.0 and beats the state-of-the-art on one dataset.

It also achieves comparable results on some language translation tasks, for example, it beats the state-of-the-art on French to English and Dutch to English.

Overall, GPT 3.0 provides a step forward in few/one/zero-shot results, showing the increasing potential of the model to generalise incredibly well to a range of tasks given it is only trained in a language modelling context.

GPT 4.0 (GPT-4 Technical Report, 2023) [4]

Unfortunately, there is not much public information available on GPT 4.0, nor its predecessor GPT 3.5, the models are closed source.

It was deduced that the GPT 4.0 model is around 1.8 trillion parameters, a huge increase from the 175 billion of GPT 3.0.

Based on the paper published on GPT 4.0, it outperforms existing language models, including GPT 3.5 on a range of benchmarks.

The model was evaluated on exams, both multiple choice and free responses. It exhibits human-level performance on the majority of the exams and outperforms GPT 3.5. Notability, it passes the bar exam with a score in the top 10%.

A new addition to the model compared to GPT 3.0/3.5 was the visual inputs. GPT is now multi-modal, accepting prompts of both images and text and producing text outputs based on inputs of text and/or images.

Conclusions

I hope this blog has helped summarise the evolution of OpenAI’s Generative Pre-Trained Transformer. It shows how quickly LLMs have evolved, from needing fine-tuning to perform well on tasks, to now being able to achieve a top mark on the bar exam with no fine-tuning!

--

--

Nathan Bailey

MSc AI and ML Student @ ICL. Ex ML Engineer @ Arm, Ex FPGA Engineer @ Arm + Intel, University of Warwick CSE Graduate, Climber. https://www.nathanbaileyw.com