Transformers Explained: The Holy Grail of GenAI

Nathan Bailey
13 min readJul 12, 2024

--

Following on from my previous blog on Diffusion models, I decided to research the holy grail of building blocks for generating models: the transformer.

The full code for this blog can be found on GitHub and was adapted from [1]. The majority of information in this blog was sourced from [1] too, which is an excellent book on Generative Deep Learning.

Architecture

Transformer models have taken over RNN models in the context of text generation. RNN models only allow the network to process sequences one token at a time. Transformers on the other hand can exploit parallelism, allowing them to be trained on massive datasets.

The transformer model is built around the attention mechanism (head). This can decide where in the input sequence it wants to pull information from, in other words, it pays more attention to some words and ignores others (or pays less attention).

In this blog, we will be focusing on the task of predicting the next word in the sequence, the architecture of the attention mechanism and the transformer will be viewed in this context.

Contrasting an attention head to a recurrent layer, a recurrent layer builds up a generic hidden state, based on the inputs at each timestep. However, only some words captured in the hidden vector are directly relevant to the task (predicting the next word). Some words should not influence the next word. Attention heads, part of the transformer model, can decide where to pull information from and pick and choose how to combine information from nearby words.

For an attention head, we have a few inputs:

  • Query (Q) — A representation of the current task at hand. In this context of our task, we pass through the embedding of the word we want to predict after and multiply it by a weight matrix to change the dimensionality.
  • Key Vectors (K) — Representations of each word in the sentence. We pass each token embedding through a weights matrix Wk to change the dimensionality to the same dimensionality as the query.
  • Value Vectors (V) — Representations of each word in the sentence. It follows the same process as the key vectors. We pass them through a weight matrix to change the dimensionality. More importantly, the resulting dimensionality does not have to be the same as the key or query.

In all 3 cases here, the multiplication of the vector by a weight matrix is analogous to using a dense/linear layer.

Each key is compared to the query using a dot product between each pair of vectors (each key and the query). The higher the number for the dot product, the more the key is allowed to contribute to the output of the attention head. These dot products are then scaled, and the softmax function is applied so that the sum of the contributions is 1. Note that this is not a probability, it is to ensure that the values are between 0 and 1 and that they collectively sum to 1. These values are called attention weights.

Ideally, the attention weights should be as close to zero for tokens that have little influence on the output and largest for inputs that have the most influence. We do not want them to be negative, to avoid the situation where one weight becomes large and another compensates by becoming negative. We also want them to sum to 1 so that if an output pays attention to one input, it is at the expense of another. This is why the softmax function is applied. [2]

The final output is the sum of the attention weights multiplied by the value vectors. Which gives an output of the size of the value vectors. This is called a context vector. We can see this process diagrammatically below. This shows the task of predicting the word coming after the word too.

Attention Head [1]

The formula for this is also shown.

Attention Head Formula [1]

We scale the dot products, as gradients of the softmax function become small when the inputs are large in magnitude. To prevent this, we seek out a suitable scaling. If the elements of the query and key vectors are random numbers with unit variance, then as Var(X+Y) = Var(X) + Var(Y), then a resulting dot product will have the variance of Dk (dimensionality of the keys). Therefore, a suitable scaling is the square root of Dk, which is the standard deviation [2].

Multiple Heads

There might be multiple patterns of attention that are relevant at the same time. We can combine attention heads, where each head learns a distinct attention mechanism so the layer as a whole can learn more complex relationships.

Each attention head has independent learnable parameters that calculate the query, key and value vectors. We take the output from each attention head and concatenate them together. We can then use a linear layer to project the output to a desired size as shown below.

Masking

One of the benefits of the attention head and the transformer architecture is that it enables operation on every query at once in the sentence. That is each token in a sequence acts as a target value for the sequence of previous tokens and as an input value for subsequent tokens.

We can batch the vectors together in a matrix, but we need to apply a mask to the query and key dot product to avoid information from future words leaking through. We can use a causal mask for this, which is shown below.

This mask sets to zero all the attention weights that correspond to any later tokens in the sequence. E.g. for the query elephant, we set the weights of the tokens (tried, to, get …) to zero.

We can implement a multi-head attention layer in Python. First, we implement the attention scores. We take in the queries, keys and values, we also take in the causal mask explained above.

class DotProductAttention(keras.layers.Layer):
"""
Custom Dot Product Layer.
Takes the Dot Product between queries and keys.
"""

def call(
self,
queries: tf.Tensor,
keys: tf.Tensor,
values: tf.Tensor,
d_k: int,
mask: tf.Tensor | None = None,
) -> tf.Tensor:
"""Forward Pass."""
scores = tf.matmul(queries, keys, transpose_b=True) / tf.sqrt(
tf.cast(d_k, dtype=tf.float32)
)
if mask is not None:
scores += -1e9 * tf.cast(
tf.where(tf.cast(mask, dtype=tf.uint8) == 0, 1, 0), tf.float32
)
attention_weights = keras.backend.softmax(scores)
attention_output = tf.matmul(attention_weights, values)
return attention_output

We compute the dot product of the query and the keys and scale these by the square root of the dimension of the keys.

Then we apply the mask, here we are assuming a boolean mask, with values of true corresponding to attention weights that should be kept and values of false corresponding to attention weights that should be masked out. We can implement a causal mask using the function shown below.

def casual_attention_mask(
batch_size: int,
key_length: int,
query_length: int,
dtype: tf.dtypes.DType,
) -> tf.Tensor:
"""
Create a Casual Mask for
the multi head attention layer.
"""
i = tf.range(query_length)[:, None]
j = tf.range(key_length)
mask = i >= j - key_length + query_length
mask = tf.cast(mask, dtype)
mask = tf.reshape(mask, [1, query_length, key_length])
mult = tf.concat(
[tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)],
0,
)
return tf.tile(mask, mult)

An example mask is shown below. The queries make up the rows. So using the above example, the mask for the query “the” is the first row in the matrix. As we iterate through our queries, we allow the previous tokens in the sentence to be used.

Example Causal Mask (query and key length of 10)

We convert the mask to binary values and then invert it. We multiply these values by a large negative value and add it to the scores. When these large values enter the softmax function, they will become 0, masking out the attention scores.

We finally multiply the output from the softmax with the value vectors giving us our attention output.

We can then incorporate this into a multi-head attention layer. We take in the dimensions of our keys, values, and our final output and the number of heads that we require. We can then construct our linear layers that will perform the dimensionality changes to our inputs.

class MultiHeadAttention(keras.layers.Layer):
"""Multi Head Attention Layer Class."""

def __init__(
self,
num_heads: int,
key_dim: int,
output_dim: int | None = None,
value_dim: int | None = None,
) -> None:
"""Init Variables and Layers."""
super().__init__()
self.attention = DotProductAttention()
self.num_heads = num_heads
self.key_dim = key_dim
self.value_dim = value_dim
self.output_dim = output_dim

if self.value_dim is None:
self.value_dim = key_dim

if self.output_dim is None:
self.output_dim = key_dim

self.w_q = keras.layers.Dense(self.key_dim)
self.w_k = keras.layers.Dense(self.key_dim)
self.w_v = keras.layers.Dense(self.value_dim)
self.w_o = keras.layers.Dense(self.output_dim)

We also construct a helper function to reshape the inputs. This function takes in an input of size (N, S, D), where N is the batch size, S is the sequence length and D is the size of each word in the sequence. We split this up into a size (N, S, num_heads, X), where X = D/num_heads.

def reshape_tensor(self, x: tf.Tensor, flag: bool = True) -> tf.Tensor:
"""Reshape inputs to allow them to be processed by multiple heads."""
if flag:
input_shape = tf.shape(x)
x = tf.reshape(
x, shape=(input_shape[0], input_shape[1], self.num_heads, -1)
)
x = tf.transpose(x, perm=(0, 2, 1, 3))
return x

x = tf.transpose(x, perm=(0, 2, 1, 3))
input_shape = tf.shape(x)
x = tf.reshape(x, shape=(input_shape[0], input_shape[1], -1))
return x

As we can see from the forward pass, this reshape function is applied to the outputs of the query, key and value layers. This allows us to use single linear layers for all the heads, we are equivalently splitting up the weight matrices into individual num_head matrices and operating these on the inputs.

  def call(
self,
query: tf.Tensor,
value: tf.Tensor,
key: tf.Tensor,
mask: tf.Tensor | None = None,
) -> tf.Tensor:
"""Forward Pass."""
query = self.reshape_tensor(self.w_q(query))
key = self.reshape_tensor(self.w_k(key))
value = self.reshape_tensor(self.w_v(value))
output = self.attention(query, key, value, self.key_dim, mask=mask)
output = self.reshape_tensor(output, flag=False)
output = self.w_o(output)
return output

We apply the attention layer to the outputs of the query, key and value layers. These are concatenated together by applying the reverse reshape function and then passed through the output layer to give us our final output.

Transformer Block

The transformer block below consists of a multi-head attention layer, layer normalization and feed-forward layers. Skip connections are also applied, which enable the vanishing gradient problem to be avoided.

Transformer Block

Layer normalization provides stability to the training process. In this layer, given a vector input that was created from a single query, we calculate the mean and standard deviation across this vector. In reality, we would have multiple queries and multiple sequences forming a batch. This is shown below, our sentence length will be equal to the number of queries, as we will use each word as a query. So, essentially for each query output, we calculate a mean and std and then normalize the query output using that mean and std.

Given an input tensor with shape [B, L, D], for each sample in the batch (B) and each position in the sequence (L), we compute one mean and std across the D dimension. Each mean and std is calculated over D values. We then normalise each position in sequence using its mean and std.

Layer Normalization [1]
Layer Normalization Formula

A set of dense layers is applied at the end of the transformer, this allows the block to extract higher-level features as we go deeper into the network.

We can implement a transformer block in Keras as shown below.

We simply set up all the layers in the init function, and we opt to use Keras’ implementation of the multi-head attention layer here. In the forward pass, we simply pass the input through the layers as shown in the diagram above.

class TransformerBlock(keras.layers.Layer):
"""Transformer Block Layer."""
def __init__(
self,
num_heads: int,
key_dim: int,
embed_dim: int,
ff_dim: int,
mask_function: Callable[
[int, int, int, tf.dtypes.DType], tf.Tensor
] = casual_attention_mask,
dropout_rate: float = 0.1,
) -> None:
"""Init variables and layers."""
super().__init__()
self.num_heads = num_heads
self.key_dim = key_dim
self.embed_dim = embed_dim
self.ff_dim = ff_dim
self.dropout_rate = dropout_rate
self.attn = keras.layers.MultiHeadAttention(
num_heads, key_dim, output_shape=embed_dim
)
self.dropout_1 = keras.layers.Dropout(self.dropout_rate)
self.layer_norm_1 = keras.layers.LayerNormalization(epsilon=1e-6)
self.ffn_1 = keras.layers.Dense(self.ff_dim, activation="relu")
self.ffn_2 = keras.layers.Dense(self.embed_dim)
self.dropout_2 = keras.layers.Dropout(self.dropout_rate)
self.layer_norm_2 = keras.layers.LayerNormalization(epsilon=1e-6)
self.mask_function = mask_function

def call(self, inputs: tf.Tensor) -> tf.Tensor:
"""Forward Pass."""
input_shape = tf.shape(inputs)
batch_size = input_shape[0]
seq_len = input_shape[1]
mask = self.mask_function(batch_size, seq_len, seq_len, tf.bool)
attention_output = self.attn(
query=inputs, value=inputs, key=inputs, attention_mask=mask
)
attention_output = self.dropout_1(attention_output)
out1 = self.layer_norm_1(inputs + attention_output)
ffn_1 = self.ffn_1(out1)
ffn_2 = self.ffn_2(ffn_1)
ffn_output = self.dropout_2(ffn_2)
output = self.layer_norm_2(out1 + ffn_output)
return output

Positional Encoding

In the attention layer, given a query and keys. We compute the dot product between each key and the query in parallel. This is opposed to a recurrent network, where if we had a query, we would have to feed in all keys in order in timesteps up to that query to find the next token. This is a strength of the transformer, due to parallelisation, but there is now no ordering of the keys.

We can solve this problem using positional encoding when creating the inputs to the transformer model. We encode each token using a token embedding as before, but now we also encode the position of the token using a positional embedding.

The token embedding is created by passing the tokens through an embedding layer. The positional embedding is created by passing the indexes of the sentence through an embedding layer.

We add the 2 embeddings together to produce our final embedding.

This can easily be implemented in Keras, we take into the layer the vocab size and the maximum size a sentence could take. We can then construct 2 embedding layers for token and position embedding.

class TokenAndPositionEmbedding(keras.layers.Layer):
"""Token and positioning embedding layer for a sequence."""
def __init__(
self, max_len_input: int, vocab_size: int, embed_dim: int
) -> None:
"""Init variables and layers."""
super().__init__()
self.max_len = max_len_input
self.vocab_size = vocab_size
self.embed_dim = embed_dim
self.token_emb = keras.layers.Embedding(
input_dim=vocab_size, output_dim=embed_dim
)
self.pos_emb = keras.layers.Embedding(
input_dim=max_len_input, output_dim=embed_dim
)

During the forward pass of the layer, we embed the input sentence using the token and position embedding layers, passing through the indices of the sentence into the positioning layer.

def call(self, x: tf.Tensor) -> tf.Tensor:
"""Forward Pass."""
len_input = tf.shape(x)[-1]
positions = tf.range(start=0, limit=len_input, delta=1)
positions = self.pos_emb(positions)
x = self.token_emb(x)
return x + positions

GPT Model

To conclude this blog, we present the architecture of the GPT model. It consists of an embedding layer, stacked transformer blocks and a dense layer. The dense layer outputs a probability distribution over the entire vocabulary. The model below is shown for a single query.

In reality, during training the input into the block is still a sequence of tokens. However, the model can compute each token acting as a query in parallel as we saw with the causal mask. We therefore get an output matrix of probabilities. Where each row (i) in the matrix is a probability distribution over the vocabulary using the token Xi as a query. I.e. the output probability distribution for token Xi represents the probabilities for the next token Xi+1, given all the previous tokens X1, X2, …, Xi.

This is a key advantage of transformer-based models. Unlike recurrent neural networks (RNNs) that process tokens one by one, transformer models process all input tokens in parallel. For a given input sequence x1…xn, the model performs a single forward pass through all its layers, producing outputs for all positions (all tokens as queries) at once.

We also make use of parallelism and process a batch of sequences at once. This requires the sequences to be of the same length by introducing padding to the sequences.

During inference, we generate one token at a time in an autoregressive manner. We input a sequence of tokens and get the probability for the next token, append it to the sequence and repeat the process, like in an RNN.

GPT Model [1]

Experiment

We implement a small GPT model as shown below, consisting of one transformer block. We then train this model to predict the next words for wine review sentences [3]. Some examples of the training sentences are also shown below.

def create_gpt_model(
max_len_input: int,
vocab_size: int,
embed_dim: int,
feed_forward_dim: int,
num_heads: int,
key_dim: int,
) -> keras.models.Model:
"""Create GPT Model."""
inputs = keras.layers.Input(shape=(None,), dtype=tf.int32)
x = TokenAndPositionEmbedding(max_len_input, vocab_size, embed_dim)(
inputs
)
x = TransformerBlock(
num_heads,
key_dim,
embed_dim,
feed_forward_dim,
mask_function=casual_attention_mask,
)(x)
outputs = keras.layers.Dense(vocab_size, activation="softmax")(x)
gpt_model = keras.models.Model(inputs=inputs, outputs=outputs)
return gpt_model
wine review : Italy : Sicily & Sardinia : White Blend : Aromas include tropical fruit , broom , brimstone and dried herb . The palate isn ' t overly expressive , offering unripened apple , citrus and dried sage alongside brisk acidity . 

wine review : Portugal : Douro : Portuguese Red : This is ripe and fruity , a wine that is smooth while still structured . Firm tannins are filled out with juicy red berry fruits and freshened with acidity . It ' s already drinkable , although it will certainly be better from 2016 .

We can see the output of the GPT model after it has been trained for 20 epochs for a prompt: “wine review : germany” below. We can see that it creates a very convincing wine review!

generated text:
wine review : germany : mosel : riesling : this is a remarkably minerally riesling , boasting aromas of ripe yellow peach and apple , with a hint of petrol . it ' s round and lush on the palate , with a lemony acidity that lingers on the finish .

Conclusion

This blog outlined the architecture of the transformer model, including its key building block, the attention head layer. We showed this transformer model applied in the context of a GPT model and showed that when trained on a dataset, it is very effective in generating convincing data.

  1. Generative Deep Learning, David Foster
  2. Deep Learning: Foundations and Concepts, Christopher M Bishop
  3. https://www.kaggle.com/datasets/zynicide/wine-reviews

--

--

Nathan Bailey

MSc AI and ML Student @ ICL. Ex ML Engineer @ Arm, Ex FPGA Engineer @ Arm + Intel, University of Warwick CSE Graduate, Climber. https://www.nathanbaileyw.com