Lecture 19

Sequence Learning with RNNs

1. Motivation

Traditional models like Bag-of-Words and CNNs cannot fully capture sequential dependencies in text or time series. RNNs are introduced to handle variable-length sequences and preserve temporal order.

Bag-of-Words: Ignores order, fixed length.
HMM: Captures sequential dependency, but limited capacity.
CNN: Captures local context but fixed-size window.
RNN: Processes sequences step-by-step, retaining contextual information through hidden states.

2. RNN Architecture

At each time step t:

Input: $x_t$ (current token/feature)
Hidden state: $h_t$
Output: $y_t$

Variants of Sequence Tasks

Many-to-One: Sentiment classification (sequence → label)
One-to-Many: Image captioning (vector → sequence)
Many-to-Many: Translation, video tagging (sequence → sequence)

3. Training with Backpropagation Through Time (BPTT)

RNNs are trained via BPTT, where gradients are propagated across time steps.

Loss Function

$L = \sum_t \text{loss}(y_t, \hat{y}_t)$

Problems

Vanishing gradients: Long-term dependencies are lost.
Exploding gradients: Instability in training.

Remedies

Gradient clipping
Truncated BPTT (limit propagation length)
Use LSTM/GRU architectures.

4. LSTM (Long Short-Term Memory)

Designed to handle long-term dependencies by introducing gates and a memory cell.

Components

Cell state (c_t): Long-term memory highway.
Forget gate (f_t): Decides what to erase.
Input gate (i_t): Decides what new information to add.
Output gate (o_t): Decides what to output.

Key Equations

\[\begin{aligned} f_t &= \sigma(W_f[x_t, h_{t-1}] + b_f) \\ i_t &= \sigma(W_i[x_t, h_{t-1}] + b_i) \\ \{g}_t &= \tanh(W_g[x_t, h_{t-1}] + b_g) \\ c_t &= f_t \odot c_{t-1} + i_t \odot \{g}_t \\ o_t &= \sigma(W_o[x_t, h_{t-1}] + b_o) \\ h_t &= o_t \odot \tanh(c_t) \end{aligned}\]

5.Many-to-One Word RNNs

Task Definition

Input: A sequence of words (sentence or text).
Output: A fixed-size vector used to predict a single class label (e.g., sentiment).

Data Processing Pipeline

Step 1 — Build Vocabulary
- Construct a word-index dictionary from the training corpus.
- Include special tokens: <unk> for unknown words, <pad> for sequence padding.
Step 2 — Convert Text to Indices
- Replace each word with its index from the vocabulary.
- Pad shorter sequences with <pad> to achieve uniform length.
- Keep track of the true sequence lengths for batching.
Step 3 — Convert Indices to One-Hot (for illustration)
- Each word index is represented as a one-hot vector of size |V|.
- This is only conceptual; real implementations use embeddings instead.
Step 4 — Convert One-Hot to Embeddings
- Multiply the one-hot vector by an embedding matrix to obtain dense representations.
- Each row of the embedding matrix corresponds to a word vector.

6. Summary