Lecture 20

Attention and Transformers

Attention and Transformer Architecture

The Attention Mechanism

Motivation: Different parts of our input relate to different parts of our output. Sometimes these important relationships can be far apart, like in machine translation. Attention helps us dynamically calculate what is important.

Why Attention Was Needed (Long-Range Dependency Problem)

RNNs compress an entire input sequence into a single hidden vector, which makes capturing long-range dependencies difficult.
During backpropagation, gradients must pass through many time steps, causing vanishing/exploding gradients.

Attention solves this by directly referencing the entire input sequence when predicting each output, instead of relying on hidden states to store all information.

Key insight: Attention creates a constant path length between any two positions, unlike the sequential dependency in RNNs.

Hard Attention

Hard attention makes a binary 0/1 decision about where to attend. It asks: “Is this input important to this prediction or not?”

Soft Attention

Rather than a binary decision, we assign a continuous weight between 0 and 1 for each input.

Soft Attention vs. RNN for Image Captioning

RNNs:

Soft Attention:

\[a_{i,j} = \mathrm{softmax}_j(x_i^\top x_j)\]

and the final attention output is:

\[A_i = \sum_j a_{i,j} x_j.\]

Aside: CNNs were an example of Hard Attention. As the filter slides over the image, the part of the image inside the filter gets attention weight 1, and the rest gets weight 0.

Why Attention Reduces the Need for Recurrence


Self Attention

Motivation: Can we get rid of the sequential RNN component? Since attention already ties inputs across the sequence, is it necessary to continue to loop over it?

Basic Self Attention

Main procedure:

  1. Derive Attention Weights: Calculate similarity between each current input and all other inputs.
  2. Normalize: Use softmax to normalize weights.
  3. Compute: Calculate attention from normalized weights and corresponding inputs.

Computing Weights: We use a dot product when computing attention weights. Dot products are similar to cosine similarity, but are sensitive to vector magnitudes. Since we will be learning weights (not in this basic version, but later), there is no need to normalize in practice.

Learnable Self Attention

The basic version has no learnable parameters. To fix this, we add three trainable weight matrices to be multiplied by input sequence embeddings: Query, Key, and Value.

  1. Query ($Q$): Represents what the token is “asking” the rest of the sequence.
  2. Key ($K$): Describes how the token “advertises” the information it holds.
  3. Value ($V$): The actual content shared if other tokens pay attention to it.

The Process: For every token, the model compares its Query to the Keys of all tokens in the sequence.

These scores are normalized to act as weights. Each token builds a new representation by taking a weighted blend of the Value vectors from all tokens. Every position becomes a learned mixture of information pulled from everywhere else, with the mixing proportions determined by how relevant the model thinks each other token is.

Why this works: Because $Q$, $K$, and $V$ are trainable, the model learns its own notion of “relevant context.” Early layers may focus on nearby words; deeper layers may learn to link pronouns to nouns or relate the start/end of sentences. This structure emerges from adjusting those weight matrices to reduce training loss, turning self-attention into a flexible, learned mechanism for combining information across a sequence.


The Transformer

Originally proposed for machine translation. It consists of two stacks side-by-side, each replicated $N$ times.

Components

\[\text{head}_h = \mathrm{Attention}(Q W_h^Q,\; K W_h^K,\; V W_h^V)\]

The final output concatenates all heads:

\[\text{MHA}(Q,K,V) = \text{Concat}(\text{head}_1,\dots,\text{head}_H)\, W^O.\]

Transformer Tricks (NLP)


© 2025 University of Wisconsin — STAT 453 Lecture Notes