Almost every model that matters in modern language work is a transformer. BERT, GPT, T5, LLaMA, the lot. They look intimidating from the outside, but the architecture is built from a small number of parts that each do one understandable job, and once you have assembled them by hand the whole thing stops being magic. This article builds a transformer from the ground up in PyTorch, from token embeddings through attention to a full encoder-decoder, and explains why each piece is there.
The shortcut, and why we will ignore it
PyTorch can hand you a complete transformer in a single call. You name the embedding size, the number of attention heads, and how many encoder and decoder layers to stack, and it builds the rest.
import torch.nn as nnmodel = nn.Transformer( d_model=512, nhead=8, num_encoder_layers=6, num_decoder_layers=6,)
This is genuinely what you reach for in production. But it is a sealed box, and the point of this article is to open it. Everything below rebuilds the same machine bolt by bolt, because understanding the parts is what lets you debug, modify, and reason about transformers rather than just instantiate them.
Turning tokens into vectors
A model cannot read the word “dog.” It can only process numbers, and the first step turns each token, represented as an integer ID, into a dense vector the network can work with. That is an embedding layer, which is really just a learnable lookup table with one row per vocabulary word and one column per dimension.
import mathimport torch.nn as nnclass InputEmbeddings(nn.Module): def __init__(self, vocab_size, d_model): super().__init__() self.d_model = d_model self.embedding = nn.Embedding(vocab_size, d_model) def forward(self, x): return self.embedding(x) * math.sqrt(self.d_model)
The one detail worth flagging is that multiplication by the square root of the model dimension. It is a small trick from the original paper that scales the embedding values upward so they are not drowned out by the positional signal added next. Without it the meaningful word information would be tiny relative to the position information, like music too quiet to hear over background noise.
Giving the model a sense of order
Attention, the core mechanism, has no built-in notion of sequence. To a pure attention layer, “dog bites man” and “man bites dog” look identical, which is clearly a problem. Positional encoding fixes it by adding a unique signature to each position before the sequence enters the network. The classic scheme uses sine and cosine waves at different frequencies, so that every dimension of the positional vector oscillates at its own speed and any two positions end up with a distinct combined fingerprint across all the dimensions.
class PositionalEncoding(nn.Module): def __init__(self, d_model, max_seq_length): super().__init__() pe = torch.zeros(max_seq_length, d_model) position = torch.arange(0, max_seq_length).float().unsqueeze(1) div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) self.register_buffer('pe', pe.unsqueeze(0)) def forward(self, x): return x + self.pe[:, :x.size(1)]
Because these encodings are fixed mathematics rather than something the model learns, they are stored with register_buffer. That keeps them attached to the model, so they get saved, loaded, and moved to the GPU along with everything else, while making sure the optimizer never tries to update them.
Attention, the heart of the thing
Attention lets every token build its output as a weighted average of all the other tokens, where the weights reflect how relevant each token is to it. The mechanism rests on three projections of the input, called query, key, and value. The intuition is a small conversation: the query is the question a token is asking, the key is what each token advertises about itself, and the value is what it actually contributes. You compare every query against every key to get relevance scores, turn those into weights with a softmax, and take the weighted sum of values. Written compactly, attention is the softmax of Q times K transposed, divided by the square root of the head dimension, all multiplied by V.
Multi-head attention runs this several times in parallel with different learned projections, which is like having eight analysts read the same document, each trained to notice something different, one tracking grammar, another meaning, another references back to earlier words. Each head works on its own slice of the embedding, and their conclusions are stitched back together at the end.
class MultiHeadAttention(nn.Module): def __init__(self, d_model, num_heads): super().__init__() self.num_heads = num_heads self.d_model = d_model self.head_dim = d_model // num_heads self.query_linear = nn.Linear(d_model, d_model, bias=False) self.key_linear = nn.Linear(d_model, d_model, bias=False) self.value_linear = nn.Linear(d_model, d_model, bias=False) self.output_linear = nn.Linear(d_model, d_model) def compute_attention(self, query, key, value, mask=None): scores = torch.matmul(query, key.transpose(-2, -1)) / (self.head_dim ** 0.5) if mask is not None: scores = scores.masked_fill(mask == 0, float('-inf')) weights = F.softmax(scores, dim=-1) return torch.matmul(weights, value)
Two parts of the implementation are worth understanding even though the full class has more plumbing. The division by the square root of the head dimension keeps the scores from growing large as the dimensionality rises, which would otherwise push the softmax into a saturated region where gradients vanish. And the mask, when present, writes negative infinity into forbidden positions, because the softmax then turns those into zero weight, effectively blinding a token to places it is not allowed to look. The surrounding code splits the projected tensors so all heads compute in parallel through the batch dimension, then recombines them, with a final linear layer mixing the heads’ outputs.
The per-token transform
After attention has let tokens share information across the sequence, a feed-forward sublayer transforms each token independently. It is a plain two-layer network: expand the vector to a wider hidden size, typically four times the model dimension, apply a ReLU, then project back down. The division of labour is clean. Attention is the teamwork where tokens consult each other, and the feed-forward network is each token going off alone to think about what it just learned. There is no cross-token communication here at all.
Wrapping sublayers so deep stacks can train
A transformer is many layers deep, and deep networks are hard to train because the signal and its gradients tend to fade as they pass through. The fix appears around every sublayer in the architecture and follows one fixed recipe: run the sublayer, apply dropout, add the original input back, and normalise.
class EncoderLayer(nn.Module): def __init__(self, d_model, num_heads, d_ff, dropout): super().__init__() self.self_attn = MultiHeadAttention(d_model, num_heads) self.ff_sublayer = FeedForwardSubLayer(d_model, d_ff) self.norm1 = nn.LayerNorm(d_model) self.norm2 = nn.LayerNorm(d_model) self.dropout = nn.Dropout(dropout) def forward(self, x, src_mask): attn_output = self.self_attn(x, x, x, src_mask) x = self.norm1(x + self.dropout(attn_output)) ff_output = self.ff_sublayer(x) x = self.norm2(x + self.dropout(ff_output)) return x
That x + ... is the residual connection, and it is what makes depth possible. By adding the original input back, you guarantee that even if a sublayer learns nothing useful, the original signal still passes straight through, so gradients always have a clean path back to the early layers. Layer normalisation then keeps the activations in a stable range. The pattern to memorise is simply LayerNorm of x plus Dropout of the sublayer applied to x, and it wraps both the attention and the feed-forward parts.
In a self-attention block the query, key, and value all come from the same input, which is why you see x passed three times. That is the token consulting its own sequence.
Stacking into a full encoder
The encoder is then a short pipeline. Embed the tokens, stamp them with positions, and pass them through a stack of identical encoder layers, each one giving every token a richer, more context-aware representation. After several layers the vector for “bank” in “river bank” looks genuinely different from “bank” in “bank account,” because the model has resolved the ambiguity by attending to the surrounding words.
self.layers = nn.ModuleList([ EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
The use of nn.ModuleList rather than a plain Python list is not cosmetic. A normal list would hide those layers from PyTorch’s parameter tracking, so the optimizer would never see their weights and they would silently fail to train. ModuleList registers them properly.
Bodies and heads
A useful way to think about a transformer is that the encoder is a general-purpose understanding engine, the body, and the task-specific part bolted on top is the head. For sentiment analysis the head can be as simple as a single linear layer mapping the hidden representation to two scores, positive and negative, followed by a log-softmax for numerically stable training. Swapping the head while keeping the body, a classifier for one task, a regressor for another, a language model head for a third, is exactly how transfer learning works in practice.
Generating text means hiding the future
Decoders produce tokens one at a time, left to right. During training, though, you feed the whole target sequence at once for efficiency, which creates a cheating risk: the token at position two must not be allowed to see position three, or it would simply copy the answer. The causal mask enforces this. It is a lower-triangular boolean matrix that marks future positions as off limits.
seq_length = 3tgt_mask = (1 - torch.triu(torch.ones(1, seq_length, seq_length), diagonal=1)).bool()
Fed into the attention mask, the forbidden upper triangle becomes negative infinity in the scores and therefore zero weight after the softmax, making each position blind to everything it has not yet generated. A decoder layer is then structurally identical to an encoder layer, with the single crucial difference that its self-attention uses this causal mask. That one change is what turns an understanding block into a generative one. Add a final linear layer mapping each position’s hidden state to a score over the whole vocabulary, and you have GPT in miniature: at every position the model is answering, given everything so far, what is the most likely next word.
Letting the decoder read the source
For sequence-to-sequence tasks like translation or summarisation, the decoder needs to consult the encoder’s output, and that is the job of cross-attention, a second attention block inserted in the middle of the decoder layer.
def forward(self, x, y, tgt_mask, cross_mask): self_attn_output = self.self_attn(x, x, x, tgt_mask) x = self.norm1(x + self.dropout(self_attn_output)) cross_attn_output = self.cross_attn(x, y, y, cross_mask) x = self.norm2(x + self.dropout(cross_attn_output)) ff_output = self.ff_sublayer(x) x = self.norm3(x + self.dropout(ff_output)) return x
The detail that makes cross-attention click is where the query, key, and value come from. The query is the decoder’s current state, what it is trying to figure out, while the keys and values come from the encoder’s output, what the source sequence said. This is precisely how a translation model reads the French source while writing the English target: each step of generating English queries the encoded French to decide which source words to lean on. Note that a decoder layer with cross-attention now has three sublayers, self-attention, cross-attention, and feed-forward, each wrapped in its own residual and normalisation.
The whole machine
Assemble an encoder and a cross-attention decoder and you have the complete original transformer. The encoder reads the source once and compresses it into a rich representation of every source token. The decoder then generates the target one token at a time, attending both to its own previous outputs through causally masked self-attention and to the full encoder output through cross-attention. Three masks coordinate what each attention block may see: a padding mask in the encoder so it ignores filler tokens, the causal-plus-padding mask in the decoder’s self-attention, and a padding mask on the cross-attention so it skips padded positions in the encoder output.
Three families from the same parts
Everything above collapses into three architectures you will meet constantly. An encoder on its own, topped with a classifier head, is the BERT family, used for classification, named-entity recognition, and producing embeddings. A decoder on its own, topped with a language model head, is the GPT and LLaMA family, used for text generation and chat. The two together, joined by cross-attention, is the T5 and BART family, used for translation and summarisation. They are not three different inventions but three arrangements of the same handful of components you just built: embeddings, positional encodings, attention, feed-forward sublayers, and the residual-and-norm wrapper that holds it all together.
Build these pieces once by hand and nn.Transformer stops being a black box. You will know exactly what every argument controls and what is happening inside when you run it.
See you soon.
[…] Transformer Models with PyTorch […]
[…] Transformer Models with PyTorch […]