Lecture 19
Sequence Learning with RNNs
1. Motivation
Traditional models like Bag-of-Words and CNNs cannot fully capture sequential dependencies in text or time series. RNNs are introduced to handle variable-length sequences and preserve temporal order.
- Bag-of-Words: Ignores order, fixed length.
- HMM: Captures sequential dependency, but limited capacity.
- CNN: Captures local context but fixed-size window.
- RNN: Processes sequences step-by-step, retaining contextual information through hidden states.
2. RNN Architecture
At each time step t:
-
Input: $x_t$ (current token/feature)
-
Hidden state: $h_t$
-
Output: $y_t$
Variants of Sequence Tasks
- Many-to-One: Sentiment classification (sequence → label)
- One-to-Many: Image captioning (vector → sequence)
- Many-to-Many: Translation, video tagging (sequence → sequence)
3. Training with Backpropagation Through Time (BPTT)
RNNs are trained via BPTT, where gradients are propagated across time steps.
Loss Function
\(L = \sum_t \text{loss}(y_t, \hat{y}_t)\)
Problems
- Vanishing gradients: Long-term dependencies are lost.
- Exploding gradients: Instability in training.
Remedies
- Gradient clipping
- Truncated BPTT (limit propagation length)
- Use LSTM/GRU architectures.
4. LSTM (Long Short-Term Memory)
Designed to handle long-term dependencies by introducing gates and a memory cell.
Components
- Cell state (c_t): Long-term memory highway.
- Forget gate (f_t): Decides what to erase.
- Input gate (i_t): Decides what new information to add.
- Output gate (o_t): Decides what to output.
Key Equations
\[\begin{aligned} f_t &= \sigma(W_f[x_t, h_{t-1}] + b_f) \\ i_t &= \sigma(W_i[x_t, h_{t-1}] + b_i) \\ \{g}_t &= \tanh(W_g[x_t, h_{t-1}] + b_g) \\ c_t &= f_t \odot c_{t-1} + i_t \odot \{g}_t \\ o_t &= \sigma(W_o[x_t, h_{t-1}] + b_o) \\ h_t &= o_t \odot \tanh(c_t) \end{aligned}\]5.Many-to-One Word RNNs
Task Definition
- Input: A sequence of words (sentence or text).
- Output: A fixed-size vector used to predict a single class label (e.g., sentiment).
Data Processing Pipeline
- Step 1 — Build Vocabulary
- Construct a word-index dictionary from the training corpus.
- Include special tokens:
<unk>for unknown words,<pad>for sequence padding.
- Step 2 — Convert Text to Indices
- Replace each word with its index from the vocabulary.
- Pad shorter sequences with
<pad>to achieve uniform length. - Keep track of the true sequence lengths for batching.
- Step 3 — Convert Indices to One-Hot (for illustration)
- Each word index is represented as a one-hot vector of size
|V|. - This is only conceptual; real implementations use embeddings instead.
- Each word index is represented as a one-hot vector of size
- Step 4 — Convert One-Hot to Embeddings
- Multiply the one-hot vector by an embedding matrix to obtain dense representations.
- Each row of the embedding matrix corresponds to a word vector.
6. Summary
| Concept | Description | |———-|————–| | RNN | Models sequence dependency via hidden states | | BPTT | Backpropagation through time; may cause vanishing/exploding gradients | | LSTM | Uses gates to manage information flow | | Gradient Clipping | Prevents gradient explosion | | Truncated BPTT | Limits gradient flow length | | Application | NLP, speech, translation, time series prediction |