Back to Blog
TransformerAttentionDeep LearningNLP

How the Transformer Architecture Actually Works

A practical walkthrough of the Transformer — self-attention, multi-head attention, positional encoding, and a hands-on PyTorch implementation you can actually learn from.

Published 2026-02-12|15 min

The Paper That Changed Everything

Back in 2017, Vaswani et al. dropped a paper called "Attention Is All You Need" and it genuinely changed the game. Before that, we were stuck with RNNs and LSTMs for NLP tasks. They worked, sure, but they had a fundamental problem: they processed tokens one at a time. That meant long-range dependencies were hard to capture, and you couldn't really parallelize training across sequence positions.

The Transformer fixed both of these problems in one shot by ditching recurrence entirely. Instead of crawling through a sequence step by step, it looks at all positions at once and computes relevance scores between every pair of tokens. That's a huge deal — it unlocks massive parallelism during training and lets the model pick up on dependencies no matter how far apart they are in the sequence.

Today, virtually every state-of-the-art language model is built on the Transformer. BERT, GPT-4, LLaMA — all of them. If you're working in modern AI, understanding this architecture isn't optional. In this article, we'll walk through each core component, build up the mathematical intuition, and tie it all together with a working PyTorch implementation.

Self-Attention: Where the Magic Happens

Self-attention (or scaled dot-product attention, if you want to be formal about it) is what lets each token in a sequence "look at" every other token. Here's how it works: for each input token, the model computes three vectors — a Query (Q), a Key (K), and a Value (V). You get these by multiplying the input embedding by three learned weight matrices: W_Q, W_K, and W_V.

The attention score between two positions is just the dot product of one position's query with the other's key. The full formula is: Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V. That division by sqrt(d_k) is important — without it, the dot products can get really large, which pushes the softmax into regions where gradients basically vanish.

Here's a nice way to think about it: each query is asking "which tokens in this sequence matter most to me?" The keys provide the answers, and the attention weights determine how much each value contributes to the output at that position. What makes this so powerful is that it's completely dynamic — the model focuses on different parts of the input depending on context. That's way more flexible than fixed convolution windows or the step-by-step processing of recurrent models.

python
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class ScaledDotProductAttention(nn.Module):
    """Scaled dot-product attention mechanism."""

    def __init__(self, dropout: float = 0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

    def forward(
        self,
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        mask: torch.Tensor | None = None,
    ) -> tuple[torch.Tensor, torch.Tensor]:
        d_k = query.size(-1)

        # Compute attention scores: (batch, heads, seq_len, seq_len)
        scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)

        # Apply mask (e.g., for causal / padding masks)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float("-inf"))

        # Normalize with softmax and apply dropout
        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Weighted sum of values
        output = torch.matmul(attn_weights, value)
        return output, attn_weights

Scaled dot-product attention — the core building block

Multi-Head Attention: Learning Different Relationships in Parallel

A single attention head can only capture one type of relationship between tokens. That's pretty limiting. Multi-head attention solves this by running multiple attention operations in parallel, each with its own set of learned projections. The original Transformer uses 8 heads, and each one operates on a 64-dimensional slice of the 512-dimensional model space.

For each head i, you compute head_i = Attention(X W_Q_i, X W_K_i, X W_V_i). Then you concatenate all the head outputs and push them through a final linear layer: MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_O. In practice, different heads end up specializing naturally — some learn syntactic patterns, others pick up semantic similarity, and others focus on positional relationships. It's quite elegant.

python
class MultiHeadAttention(nn.Module):
    """Multi-head attention with configurable number of heads."""

    def __init__(self, d_model: int = 512, n_heads: int = 8, dropout: float = 0.1):
        super().__init__()
        assert d_model % n_heads == 0, "d_model must be divisible by n_heads"

        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        # Linear projections for Q, K, V and output
        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_o = nn.Linear(d_model, d_model)

        self.attention = ScaledDotProductAttention(dropout)

    def forward(
        self,
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        mask: torch.Tensor | None = None,
    ) -> torch.Tensor:
        batch_size = query.size(0)

        # Project and reshape: (batch, seq_len, d_model) -> (batch, n_heads, seq_len, d_k)
        q = self.w_q(query).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        k = self.w_k(key).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        v = self.w_v(value).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)

        # Apply attention across all heads in parallel
        attn_output, _ = self.attention(q, k, v, mask)

        # Concatenate heads and project: (batch, n_heads, seq_len, d_k) -> (batch, seq_len, d_model)
        attn_output = (
            attn_output.transpose(1, 2)
            .contiguous()
            .view(batch_size, -1, self.d_model)
        )

        return self.w_o(attn_output)

Multi-head attention — letting the model learn diverse patterns

Positional Encoding: Teaching Order to an Orderless Model

Here's the thing about Transformers: since they process all positions in parallel, they have zero sense of order on their own. The word "cat" in position 1 looks the same as "cat" in position 50. So we need to inject positional information explicitly. The original paper does this with sinusoidal encodings — each position gets a vector of sine and cosine values at different frequencies. Specifically, PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) and PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)).

What makes this clever is that the encoding at any fixed offset can be expressed as a linear function of the encoding at the current position. So the model can learn to attend by relative position, not just absolute. In practice, many modern Transformers have moved to learned positional embeddings or fancier schemes like Rotary Position Embedding (RoPE), but the sinusoidal approach is still a great starting point and works surprisingly well.

python
class PositionalEncoding(nn.Module):
    """Sinusoidal positional encoding from 'Attention Is All You Need'."""

    def __init__(self, d_model: int = 512, max_len: int = 5000, dropout: float = 0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

        # Create positional encoding matrix
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2, dtype=torch.float)
            * (-math.log(10000.0) / d_model)
        )

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)  # (1, max_len, d_model)

        self.register_buffer("pe", pe)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x shape: (batch, seq_len, d_model)
        x = x + self.pe[:, : x.size(1)]
        return self.dropout(x)

Sinusoidal positional encoding — giving the model a sense of order

Feed-Forward Networks and Residual Connections

Each Transformer layer has a position-wise feed-forward network (FFN) that's applied independently at every position. It's two linear transformations with a non-linearity sandwiched in between: FFN(x) = max(0, x W_1 + b_1) W_2 + b_2. The inner dimension is 2048 in the original — four times the model dimension. This expand-and-contract pattern lets the model project into a richer space for feature extraction before mapping back down.

Both the attention and feed-forward sub-layers are wrapped in residual connections: the output is LayerNorm(x + SubLayer(x)). Why does this matter? Residual connections let gradients flow straight through during backpropagation, which is essential for training deep networks without vanishing gradients. Layer normalization keeps the activations well-behaved by normalizing across the feature dimension.

You'll often see modern Transformers use Pre-LayerNorm instead — applying normalization before the sub-layer rather than after. This turns out to help a lot with training stability in very deep models. Together, residual connections, layer normalization, and the feed-forward expansion give each layer the ability to progressively refine the representations that attention has built.

python
class TransformerEncoderLayer(nn.Module):
    """Single Transformer encoder layer with pre-layer normalization."""

    def __init__(
        self,
        d_model: int = 512,
        n_heads: int = 8,
        d_ff: int = 2048,
        dropout: float = 0.1,
    ):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, n_heads, dropout)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout),
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, mask: torch.Tensor | None = None) -> torch.Tensor:
        # Pre-norm self-attention with residual
        normed = self.norm1(x)
        x = x + self.dropout(self.self_attn(normed, normed, normed, mask))

        # Pre-norm feed-forward with residual
        normed = self.norm2(x)
        x = x + self.feed_forward(normed)

        return x


class TransformerEncoder(nn.Module):
    """Stack of Transformer encoder layers."""

    def __init__(
        self,
        vocab_size: int,
        d_model: int = 512,
        n_heads: int = 8,
        n_layers: int = 6,
        d_ff: int = 2048,
        max_len: int = 5000,
        dropout: float = 0.1,
    ):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_len, dropout)
        self.layers = nn.ModuleList(
            [TransformerEncoderLayer(d_model, n_heads, d_ff, dropout) for _ in range(n_layers)]
        )
        self.norm = nn.LayerNorm(d_model)
        self.d_model = d_model

    def forward(self, src: torch.Tensor, mask: torch.Tensor | None = None) -> torch.Tensor:
        x = self.embedding(src) * math.sqrt(self.d_model)
        x = self.pos_encoding(x)

        for layer in self.layers:
            x = layer(x, mask)

        return self.norm(x)

Encoder layer and full encoder stack — the complete picture

The Full Forward Pass: How It All Fits Together

Let's trace through what actually happens when data flows through a Transformer encoder. First, each token gets converted into a dense embedding vector. Then we add positional encodings to give the model order information. These representations pass through a stack of identical layers — each one runs multi-head self-attention, then a feed-forward network, with residual connections and layer normalization at every step.

For encoder-decoder setups (like the original paper's machine translation model), the decoder adds a cross-attention sub-layer that attends to the encoder's output. It also uses masked self-attention to prevent peeking at future tokens — you need that for autoregressive generation. Modern decoder-only models like GPT use just this masked self-attention variant, and it turns out that's enough to do incredible things.

Info

To put the scale in perspective: the original Transformer had 6 encoder layers, 6 decoder layers, 8 attention heads, and a model dimension of 512 — roughly 65 million parameters total. GPT-3 scales that same fundamental architecture to 96 layers, 96 heads, and a model dimension of 12288, hitting 175 billion parameters. Same blueprint, wildly different scale.

From BERT to GPT to the LLMs We Use Today

The Transformer spawned two major paradigms. BERT uses only the encoder stack with bidirectional attention, trained with masked language modeling. It's great at understanding tasks — classification, named entity recognition, question answering. GPT takes the opposite approach: decoder-only with causal (left-to-right) attention, trained to predict the next token. It shines at generation, and when you scale it up enough, it starts showing emergent abilities that nobody explicitly trained it for.

What really kicked off the LLM revolution was a key insight: Transformers scale beautifully. Kaplan et al. at OpenAI showed that model performance improves as a predictable power law when you increase model size, dataset size, and compute. That predictability gave researchers the confidence to bet big on training larger and larger models — and it paid off.

Modern innovations have refined the original design without changing what makes it work. RoPE encodes relative position directly into the attention computation for better length generalization. Grouped Query Attention (GQA) shares key-value heads across multiple query heads to cut memory bandwidth during inference. FlashAttention reformulates attention to be IO-aware, dramatically boosting GPU utilization. SwiGLU activations replace ReLU in the feed-forward layers for better performance. But here's what's remarkable: the fundamental building blocks — attention, residual connections, layer normalization, feed-forward networks — haven't changed since 2017.

"The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution." — Vaswani et al., Attention Is All You Need (2017)

What You Should Take Away from This

  • Self-attention computes relevance scores between all positions in a sequence, replacing RNNs' sequential processing with fully parallel computation.
  • Multi-head attention lets the model capture different types of relationships simultaneously — syntax, semantics, position — by running multiple attention operations in parallel.
  • Positional encodings are essential because attention itself has no concept of order. Without them, your model can't tell the difference between "dog bites man" and "man bites dog."
  • Residual connections and layer normalization are what make it possible to stack dozens (or hundreds) of layers without training falling apart.
  • The architecture scales predictably — the same design that works at 65 million parameters extends to hundreds of billions, powering the most capable AI systems we have today.
  • Modern improvements like RoPE, GQA, FlashAttention, and SwiGLU make things faster and better, but the core architecture has been remarkably stable since 2017.

The Transformer is one of those rare innovations that genuinely reshapes an entire field. Once you understand its components — scaled dot-product attention, positional encoding, the residual feed-forward structure — you have the foundation to work with any modern language model. Whether you're fine-tuning BERT for classification, building RAG pipelines, or training models from scratch, this is the single most valuable piece of architectural knowledge in AI today.