Lecture 19

Sequence Learning with RNNs

1. Motivation

Traditional models like Bag-of-Words and CNNs cannot fully capture sequential dependencies in text or time series. RNNs are introduced to handle variable-length sequences and preserve temporal order.


2. RNN Architecture

At each time step t:

Variants of Sequence Tasks


3. Training with Backpropagation Through Time (BPTT)

RNNs are trained via BPTT, where gradients are propagated across time steps.

Loss Function

\(L = \sum_t \text{loss}(y_t, \hat{y}_t)\)

Problems

Remedies


4. LSTM (Long Short-Term Memory)

Designed to handle long-term dependencies by introducing gates and a memory cell.

Components

Key Equations

\[\begin{aligned} f_t &= \sigma(W_f[x_t, h_{t-1}] + b_f) \\ i_t &= \sigma(W_i[x_t, h_{t-1}] + b_i) \\ \{g}_t &= \tanh(W_g[x_t, h_{t-1}] + b_g) \\ c_t &= f_t \odot c_{t-1} + i_t \odot \{g}_t \\ o_t &= \sigma(W_o[x_t, h_{t-1}] + b_o) \\ h_t &= o_t \odot \tanh(c_t) \end{aligned}\]

5.Many-to-One Word RNNs

Task Definition

Data Processing Pipeline

  1. Step 1 — Build Vocabulary
    • Construct a word-index dictionary from the training corpus.
    • Include special tokens: <unk> for unknown words, <pad> for sequence padding.
  2. Step 2 — Convert Text to Indices
    • Replace each word with its index from the vocabulary.
    • Pad shorter sequences with <pad> to achieve uniform length.
    • Keep track of the true sequence lengths for batching.
  3. Step 3 — Convert Indices to One-Hot (for illustration)
    • Each word index is represented as a one-hot vector of size |V|.
    • This is only conceptual; real implementations use embeddings instead.
  4. Step 4 — Convert One-Hot to Embeddings
    • Multiply the one-hot vector by an embedding matrix to obtain dense representations.
    • Each row of the embedding matrix corresponds to a word vector.

6. Summary

| Concept | Description | |———-|————–| | RNN | Models sequence dependency via hidden states | | BPTT | Backpropagation through time; may cause vanishing/exploding gradients | | LSTM | Uses gates to manage information flow | | Gradient Clipping | Prevents gradient explosion | | Truncated BPTT | Limits gradient flow length | | Application | NLP, speech, translation, time series prediction |