Implementing a GPT Model in PyTorch

9 min readJul 13, 2024

Following on from my previous blog on transformers, I decided to implement a small GPT model in PyTorch. This blog details the implementation of such a model.

The full code for this project can be found on GitHub.

Dataset

The dataset that we will be using for this project is the wine reviews dataset (https://www.kaggle.com/datasets/zynicide/wine-reviews). This is a simple dataset that contains reviews about certain wines from different countries. We shall be creating a GPT model that aims to generate a new wine review given the prompt: “wine review : ”.

Some examples from this dataset can be seen below:

{‘points’: ‘87’, ‘title’: ‘Nicosia 2013 Vulkà Bianco (Etna)’, ‘description’: “Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn’t overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.”, ‘taster_name’: ‘Kerin O’Keefe’, ‘taster_twitter_handle’: ‘@kerinokeefe’, ‘price’: None, ‘designation’: ‘Vulkà Bianco’, ‘variety’: ‘White Blend’, ‘region_1’: ‘Etna’, ‘region_2’: None, ‘province’: ‘Sicily & Sardinia’, ‘country’: ‘Italy’, ‘winery’: ‘Nicosia’}
{‘points’: ‘87’, ‘title’: ‘Quinta dos Avidagos 2011 Avidagos Red (Douro)’, ‘description’: “This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It’s already drinkable, although it will certainly be better from 2016.”, ‘taster_name’: ‘Roger Voss’, ‘taster_twitter_handle’: ‘@vossroger’, ‘price’: 15, ‘designation’: ‘Avidagos’, ‘variety’: ‘Portuguese Red’, ‘region_1’: None, ‘region_2’: None, ‘province’: ‘Douro’, ‘country’: ‘Portugal’, ‘winery’: ‘Quinta dos Avidagos’}

Before we use this dataset in our model, we will need to perform some preprocessing on it. Our dataset is in the form of a JSON file. We can write a small snippet of code to filter this data to produce a complete string that will comprise a single training example.

class WineDataset(torch.utils.data.Dataset):
  """Custom Dataset for the Wine Reviews."""
  def __init__(
    self, path_to_file: Path | str, max_length: int, vocab_size: int
  ):
  """Init variables."""
  super().__init__()
  self.max_length = max_length + 1
  self.vocab_size = vocab_size

  with open(path_to_file, encoding="utf-8") as json_file:
    wine_data = json.load(json_file)

  filtered_data = [
    f"wine review : {country} : {province} : {variety} : {description}"
    for x in wine_data
    if all(
      (
        country := x.get("country"),
        province := x.get("province"),
        variety := x.get("variety"),
        description := x.get("description"),
      )
    )
  ]

wine review : Italy : Sicily & Sardinia : White Blend : Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn’t overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.
wine review : Portugal : Douro : Portuguese Red : This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It’s already drinkable, although it will certainly be better from 2016.

Next, we process these strings to prepare them for tokenization. As a small refresher, tokenization is the process of splitting up the text into individual units. Here we will use word tokenization, that is each word will be a token or a unit. To prepare for this, we take in our string, convert it to lowercase and pad the punctuation with spaces, so that each punctuation character becomes a separate word.

def prepare_string(self, text: str) -> list[str]:
  """Prepare a string."""
  text = re.sub(f"([{string.punctuation}, '\n'])", r" \1 ", text)
  text = re.sub(" +", " ", text)
  text_list = text.lower().split()
  return text_list

text_dataset = [self.prepare_string(x) for x in filtered_data]

During the process of tokenization, we also create a vocabulary, which is the set of distinct words in the training set. We will create this manually and sort it in order of the most common to the least common words. Creating a vocabulary when using word tokenization can result in a very large set, so we also put a maximum limit on the size of the resulting vocab.

To create the vocabulary, we take in our dataset and count the number of each occurring word. We then sort this, so the most common words appear at the start. As we place a limit on the size of the vocab, we insert a word representing an unknown word (<unk>) in the vocabulary at the start. We also assign the end-of-sentence character (eos) at the start.

def create_vocab(self, texts: list[list[str]]) -> dict[str, int]:
  """Create a vocab from the dataset."""
  counter = Counter(token for tokens in texts for token in tokens)
  counter_sorted = dict(
    sorted(counter.items(), key=lambda item: item[1], reverse=True)
  )
  vocab = [word for (word, _) in counter_sorted.items()]
  vocab.insert(0, "")
  vocab.insert(1, "<unk>")
  vocab = vocab[: self.vocab_size]
  vocab_dict = {word: idx for idx, word in enumerate(vocab)}
  return vocab_dict

Some of the resulting vocabulary can be seen below.

{‘’: 0, ‘<unk>’: 1, ‘:’: 2, ‘,’: 3, ‘.’: 4, ‘and’: 5, ‘the’: 6, ‘wine’: 7, ‘a’: 8, ‘of’: 9}

We can now convert our sentences to integers, ready for input to our model. We simply take each sentence and convert it to integers. If the word is not present in the vocabulary, we set it to 1 which represents the unknown character. We then pad out the sentences with the eos character, so that all the sentences become the same length. This will allow the model to process a batch of sentences at once.

def prepare_inputs(self, text: list[str]) -> list[int]:
  """Tokenize the text."""
  tokenized_text = [self.vocab.get(word, 1) for word in text]
  if len(tokenized_text) >= self.max_length:
    return tokenized_text[: self.max_length]
  len_to_pad = self.max_length - len(tokenized_text)
  tokenized_text += [0 for _ in range(len_to_pad)]
  return tokenized_text

self.train_dataset = [
  self.prepare_inputs(text) for text in text_dataset
]

We set up the rest of the dataset class in the classic PyTorch way as shown below. As we are predicting the next word in a sentence. When returning an element of the dataset, our input is the sentence minus the final word, and our target is the input sentence shifted by one.

def __len__(self) -> int:
  return len(self.train_dataset)

def __getitem__(self, idx: int) -> tuple[torch.Tensor, torch.Tensor]:
  text = self.train_dataset[idx]
  return torch.LongTensor(text[:-1]), torch.LongTensor(text[1:])

Model Building Blocks

To create our GPT model, we will need to create a few building blocks. First, we will create our embedding layer. This layer takes in our sentences and embeds them to a higher dimension. Transformer based models require both token and position embedding. That is, we embed the tokens and the position of these tokens in the sentence.

We can achieve this using 2 separate embedding layers. The first is the token embedding layer which is applied to the incoming data to embed it. The second is the position embedding layer, which takes in the indices of the sentence.

The outputs from these 2 layers are added together to produce our final embedding.

class TokenAndPositionEmbedding(torch.nn.Module):
  """Token and positioning embedding layer for a sequence."""

  def __init__(
    self, max_len_input: int, vocab_size: int, embed_dim: int
  ) -> None:
    """Init variables and layers."""
    super().__init__()
    self.token_emb = torch.nn.Embedding(
      num_embeddings=vocab_size, embedding_dim=embed_dim
    )
    self.position_emb = torch.nn.Embedding(
      num_embeddings=max_len_input, embedding_dim=embed_dim
    )

  def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Forward Pass."""
    len_input = x.size()[1]
    positions = torch.arange(start=0, end=len_input, step=1).to(
      torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    )
    position_embedding = self.position_emb(positions)
    token_embedding = self.token_emb(x)
    return token_embedding + position_embedding

For the attention head part of our model, we will need to create a causal mask, so that the head cannot use future tokens to predict the next token. We can create this using the following function.

Note that PyTorch assumes an inverted mask, which is achieved using the logical_not function.

def create_attention_mask(
  key_length: int,
  query_length: int,
  dtype: torch.dtype,
) -> torch.Tensor:
  """
  Create a Casual Mask for
  the multi head attention layer.
  """
  i = torch.arange(query_length)[:, None]
  j = torch.arange(key_length)
  mask = i >= j - key_length + query_length
  mask = torch.logical_not(mask)
  mask = mask.to(dtype)
  return mask

An example output is shown below. The queries make up the rows, so for the first query, i.e. predicting what word comes after the first word in the sentence, we will only use that first token to predict.

Example Mask (before the logical not operator)

We lastly create the transformer layer as shown below. This consists of a multi-head attention layer, 2 layer normalization layers and 2 feed-forward layers.


class TransformerBlock(torch.nn.Module):
  """Transformer Block Layer."""

  def __init__(
    self,
    num_heads: int,
    key_dim: int,
    embed_dim: int,
    ff_dim: int,
    mask_function: Callable[[int, int, torch.dtype], torch.Tensor],
    dropout_rate: float = 0.1,
  ) -> None:
    """Init variables and layers."""
    super().__init__()
    self.attn = torch.nn.MultiheadAttention(
      embed_dim=embed_dim,
      num_heads=num_heads,
      kdim=key_dim,
      vdim=key_dim,
      batch_first=True,
    )
    self.dropout_1 = torch.nn.Dropout(p=dropout_rate)
    self.layer_norm_1 = torch.nn.LayerNorm(
      normalized_shape=embed_dim, eps=1e-6
    )
    self.ffn_1 = torch.nn.Linear(
      in_features=embed_dim, out_features=ff_dim
    )
    self.ffn_2 = torch.nn.Linear(
      in_features=ff_dim, out_features=embed_dim
    )
    self.dropout_2 = torch.nn.Dropout(p=dropout_rate)
    self.layer_norm_2 = torch.nn.LayerNorm(
      normalized_shape=embed_dim, eps=1e-6
    )
    self.mask_function = mask_function

    def forward(self, inputs: torch.Tensor) -> torch.Tensor:
      """Forward Pass."""
      seq_len = inputs.size()[1]
      mask = self.mask_function(seq_len, seq_len, torch.bool).to(
        torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
      )
      attention_output, _ = self.attn(
        query=inputs, key=inputs, value=inputs, attn_mask=mask
      )
      attention_output = self.dropout_1(attention_output)
      out1 = self.layer_norm_1(inputs + attention_output)
      ffn_1 = F.relu(self.ffn_1(out1))
      ffn_2 = self.ffn_2(ffn_1)
      ffn_output = self.dropout_2(ffn_2)
      output = self.layer_norm_2(out1 + ffn_output)
      return output

Model

Using these building blocks, we can create a small GPT model, featuring our embedding layer, a single transformer block and an output feed-forward layer.

class GPTModel(torch.nn.Module):
  """GPT Model Class."""

  def __init__(
    self,
    max_len_input: int,
    vocab_size: int,
    embed_dim: int,
    feed_forward_dim: int,
    num_heads: int,
    key_dim: int,
  ) -> None:
    """Init Function."""
    super().__init__()
    self.embedding_layer = TokenAndPositionEmbedding(
      max_len_input=max_len_input,
      vocab_size=vocab_size,
      embed_dim=embed_dim,
    )
    self.transformer = TransformerBlock(
      num_heads=num_heads,
      key_dim=key_dim,
      embed_dim=embed_dim,
      ff_dim=feed_forward_dim,
      mask_function=create_attention_mask,
    )
    self.output_layer = torch.nn.Linear(embed_dim, vocab_size)

  def forward(self, input_tensor: torch.Tensor) -> torch.Tensor:
    """Forward Pass."""
    embedding = self.embedding_layer(input_tensor)
    transformer_output = self.transformer(embedding)
    output = self.output_layer(transformer_output)
    return output

Training

The training loop for this model is fairly simple, we simply input our batch of sentences into the model. For each sentence in the batch, we will receive an output matrix, where each row Yi, will be the probability distribution over the entire vocabulary for predicting the token coming after the input token Xi.

We permute this tensor of probabilities to the form that the cross entropy loss expects, before performing our backward pass.

def train_network(
  model: torch.nn.Module,
  vocab: dict[str, int],
  num_epochs: int,
  optimizer: torch.optim.Optimizer,
  loss_function: Callable[[torch.Tensor, torch.Tensor], torch.Tensor],
  trainloader: torch.utils.data.DataLoader,
  device: torch.device,
) -> None:
  """Train the Network."""
  print("Training Started")
  for epoch in range(1, num_epochs + 1):
    sys.stdout.flush()
    train_loss = []
    model.train()
    for batch in trainloader:
      optimizer.zero_grad()
      x = batch[0].to(device)
      y = batch[1].to(device)
      outputs = model(x)
      loss = loss_function(torch.transpose(outputs, 2, 1), y)
      loss.backward()
      optimizer.step()
      train_loss.append(loss.item())

At the end of each epoch, we generate a new text using the prompt: “wine review :” to monitor our training. This is achieved using a simple function that generates a sentence word by word. We generate a word, append this to the sentence and then input the sentence back into the model to sample the next word. This is continued until we see an end-of-sentence (eos) character or we reach a maximum number of generated words.

def generate_text(
  model: torch.nn.Module,
  start_prompt: str,
  max_tokens: int,
  temp: float,
  vocab: dict[str, int],
  device: torch.device,
) -> str:
  """Function to Generate Text During Training."""

  def sample_from(probs: np.ndarray, temp: float) -> int:
    """Sample a token from the list of probabilities."""
    probs = probs ** (1 / temp)
    probs = probs / np.sum(probs)
    return np.random.choice(len(probs), p=probs)
  
  with torch.no_grad():
    vocab_words = list(vocab.keys())
    # If not found return the unknown token (1)
    start_tokens = [vocab.get(x, 1) for x in start_prompt.split()]
    sample_token = None
    while len(start_tokens) < max_tokens and sample_token != 0:
      x = torch.LongTensor([start_tokens]).to(device)
      y = model(x)
      sample_token = sample_from(
        torch.nn.functional.softmax(y[0][-1], dim=0).cpu().numpy(),
        temp,
      )
      start_tokens.append(sample_token)
      start_prompt += f" {vocab_words[sample_token]}"
    return f"\nGenerated text:\n{start_prompt}\n"

...
def train_network()
...
  model.eval()
  print(f"Epoch: {epoch}, Training Loss: {np.mean(train_loss):.4f}")
  generated_text = generate_text(
    model,
    "wine review",
    max_tokens=80,
    temp=1.0,
    vocab=vocab,
    device=device,
  )
  print(generated_text)

Results

We train the network for 300 epochs, using the cross-entropy loss function. The loss graph is shown below.

After training for 300 epochs, we can see an example output for the prompt: “wine review : us”

Generated text:
wine review : us : california : pinot noir : this flavor combines plump cherry , strawberry , raspberry and cola with lots of zesty textural complexity for the freshness and deep mouthfeel . while it ‘ s is too ripe for balance . the touches of <unk> — accented oak , this will be a lovely pairing with wide range of foods .

We can see this generated response is very good, apart from the unknown word, we have generated a convincing wine review!

Generative Deep Learning, David Foster

Implementing a GPT Model in PyTorch

Written by Nathan Bailey