Lecture 22
Unsupervised Training of LLMs
1.1 Unsupervised Training of LLMs (Overview)
Goal
Train large language models (LLMs) using maximum likelihood estimation (MLE) on raw text—no labels, no supervision, no explicit objectives beyond predicting the next token.
Context (From GPT-1 → GPT-4)
LLMs have grown dramatically in:
- Scale: 1.5B parameters → over 1T parameters
- Context length: 512 → 128k tokens
- Depth & width: 12 → 96+ layers; 12 → 96+ heads
- Embedding dimension: 768 → over 12k
- Vocab: 40k → 50k+
- Training data: 5GB BookCorpus → private 13T tokens (~50 TB)
Modern tokenizers often include image patches, enabling multimodality. Mixture-of-Experts architectures further boost scale. Training now also includes reinforcement learning for alignment after pretraining.
1.2 MLE Training of Language Models
Directed Probabilistic Graphical Model
LLMs use an autoregressive factorization, which can be visualized as a directed graphical model where each token depends on all previous tokens.
MLE Objective
LLMs are trained to maximize the log-likelihood of observed sequences:
\[\hat{\theta}_{\text{MLE}} = \arg\max_\theta \mathbb{E}_{x\sim P_{\text{data}}}[\log P_\theta(x)]\]They model sequences autoregressively, factorizing the joint probability as:
\[P_\theta(X) = \prod_{i} \prod_{t} P_\theta(X_{i,t} \mid X_{i, < t})\]where (i) indexes sequences in the dataset and (t) indexes token positions within each sequence.
MLE → Emergent Understanding
The model directly fits the data distribution without introducing structured latent variables or explicitly controlling entropy. However, a key insight emerges from this simple objective:
“The simplest way to predict the next token is to understand what happened throughout the context.”
For example, to predict the word “is” in “The capital of France ___ Paris.”, the model must:
- Resolve subject-verb agreement (singular subject requires singular verb)
- Recognize a factual structure (this is a statement about geography)
- Know the domain (geography, world capitals)
Unlike earlier n-gram language models (e.g., Brants et al., 2007), modern LLMs model relationships between token embeddings via the Transformer architecture, enabling them to capture long-range dependencies at scale.
What MLE Does Well
- Directly models empirical sequences
- Efficient and scalable
- Allows training on extremely large corpora
But MLE Has Hidden Problems
- Implicit low-entropy bias → overconfident predictions
- Mode collapse / degeneration / memorization
- Neural text degeneration (Holtzman 2020)
- e.g., repeated phrases
- generic completions
- “The man said the man said…”
- lack of diversity unless sampling is carefully tuned
Scale & Emergent Capabilities
2.1 Emergent Capabilities in LLMs
What Happens as We Scale Training?
Examples of Emergence
- In-Context Learning : Model learns to perform tasks from examples provided in the prompt.
- Chain-of-Thought Reasoning : Model generates intermediate reasoning steps.
- Factual structure understanding: To fill in “The capital of France ___ Paris”, model must:
- Track subject-verb agreement;
- Understand the underlying fact structure;
- Recognize task domain (geography).
Hypotheses for Emergence
- Scale increases representational capacity
- MLE forces models to capture global dependencies
- Large training corpora contain implicit demonstrations of reasoning
- Transformers act as meta-learners over massive data distributions
2.2 Challenges in Scaling Unsupervised Training
Data Filtering Issues
At trillion-token scale, filtering toxic, low-quality, or duplicated text becomes extremely difficult.
Filtering stages (example from RefinedWeb):
- URL filtering
- Text extraction
- Language identification
- Repetition removal
- Document-wise filtering
- Line-wise corrections
- Fuzzy deduplication
- Exact deduplication
Parallel Training
Training trillion-parameter models requires:
- Expert parallelism
- Distributed optimization
- Fault tolerance
- Training pipelines like DeepSpeed / Megatron
Types of Parallel Training
- Data Parallel: replicate parameters across ranks and average gradients before update.
- Distributed Data Parallel: gradients are all-reduced across workers; synchronized updates across ranks.
LLM360 Initiative
Open efforts like LLM360 / TxT360 aim to make complete training pipelines observable and reproducible for education and research.
3. Challenges of MLE-Based Unsupervised Training
3.1 MLE as KL Divergence
Maximum likelihood training of a language model is equivalent to minimizing the KL divergence from the data distribution to the model:
\[\begin{aligned} heta_{\text{MLE}} &= \arg\max_\theta \mathbb{E}_{x \sim P_{\text{data}}}\big[\log P_\theta(x)\big] \\ &= \arg\min_\theta \mathrm{KL}\big(P_{\text{data}} \,\|\, P_\theta\big) \end{aligned}\]because
\[\mathrm{KL}\big(P_{\text{data}} \,\|\, P_\theta\big) = \mathbb{E}_{x \sim P_{\text{data}}}\big[-\log P_\theta(x)\big] + \text{const}\]Implications of this KL direction
- The model must place probability mass on all observed sequences, even if some are incoherent or low-quality.
- This can lead to:
- Repetition and genericity
e.g. outputs like “The man said the man said…”. - Poor calibration on out-of-distribution prompts.
- Memorization of rare patterns in the data.
- Strong dependence on sampling / decoding strategy.
- Important role of entropy and confidence in the model’s behavior.
- Repetition and genericity
3.2 What MLE Does Not Provide
The MLE objective only cares about matching the joint distribution of tokens. It does not explicitly encode:
- Semantics (meaning, truthfulness, usefulness).
- Task goals (what we want the model to accomplish).
- Rewards or preferences (which outputs are better for users).
So, a pure MLE-trained LM is a very good next-token predictor, but not necessarily aligned with human intents or downstream tasks.
3.3 Entropy and Confidence
For an autoregressive LM at time step (t), the token-level entropy is
\[\begin{aligned} H_t &= -\sum_v P(x_t = v \mid x_{<t}) \log P(x_t = v \mid x_{<t}) \\ &= \mathbb{E}_{v}\big[-\log P(x_t = v \mid x_{<t})\big] \end{aligned}\]- Low entropy ⇒ the distribution is very peaked ⇒ high confidence.
- High entropy ⇒ the distribution is more spread out ⇒ lower confidence / more diversity.
A key design question:
Do we want the model to be low-entropy or high-entropy in its predictions?
Low entropy improves log-likelihood but can cause over-confidence and degenerate text; high entropy improves diversity but can hurt accuracy.
3.4 Connection to Variational Inference
Recall the variational inference decomposition:
\[\begin{aligned} \log p(x \mid \theta) &= \mathbb{E}_{z \sim q}\big[\log p(x, z \mid \theta)\big] \\ &\quad + H(q) \\ &\quad + \mathrm{KL}\big(q(z \mid x) \,\|\, p(z \mid x, \theta)\big) \end{aligned}\]This yields the Evidence Lower Bound (ELBO):
\[\begin{aligned} \log p(x \mid \theta) &\ge \mathbb{E}_{z \sim q}\big[\log p(x, z \mid \theta)\big] \\ &\quad + H(q) \end{aligned}\]- Maximizing the ELBO explicitly rewards higher entropy (H(q)) of the approximate posterior (q(z \mid x)).
By contrast, in autoregressive MLE we directly maximize (\mathbb{E}{x \sim P{\text{data}}}[\log P_\theta(x)]) without an explicit entropy term for (P_\theta(x)). This often drives the model toward lower-entropy distributions.
3.5 MLE, Entropy, and Degeneration
Formally, the MLE objective is
\[\hat{\theta}_{\text{MLE}} = \arg\max_\theta \mathbb{E}_{x \sim P_{\text{data}}}\big[\log P_\theta(x)\big]\]Properties:
- It optimizes the data likelihood only, without explicitly controlling the entropy of (P_\theta(x)).
- In practice, this frequently pushes (P_\theta(x)) toward low-entropy, highly peaked distributions.
- Consequences:
- Mode collapse / degeneration: the model focuses on a few high-probability patterns.
- Memorization: reproduces specific training sequences.
- With deterministic decoding (e.g. large-beam beam search), this can produce extremely repetitive, unnatural text (“neural text degeneration”).
3.6 Mitigating Low-Entropy Problems
A common strategy is to modify the objective so we do not purely optimize log-likelihood.
3.6.1 Entropy-regularized MLE
Add an explicit entropy bonus to the training objective:
\[\hat{\theta} = \arg\max_{\theta} \left[ \mathbb{E}_{x \sim P_{\mathrm{data}}} \log P_{\theta}(x) + \lambda \sum_{t} H\!\left(P_{\theta}(x_t \mid x_{<t})\right) \right]\]where (\lambda > 0) controls how strongly we encourage higher-entropy token distributions.
3.6.2 Label Smoothing
Instead of one-hot targets, use smoothed labels for classification:
\[ilde{y}_\varepsilon = \begin{cases} 1 - \varepsilon, & y = 1, \\ \dfrac{\varepsilon}{V - 1}, & y = 0. \end{cases}\]- (V): vocabulary size
- (\varepsilon): small smoothing constant (e.g. 0.1)
Effect: discourages the model from assigning probability 1 to any token, implicitly increasing entropy and improving calibration.
3.6.3 Contrastive and Preference-Based Objectives
Other families of objectives also combat degeneration:
- Contrastive losses (e.g. noise-contrastive, InfoNCE): compare good vs bad sequences to shape the distribution.
- Scheduled sampling / noise-contrastive objectives: expose the model to its own predictions during training.
- Risk minimization / utility / preference-based losses: optimize directly for downstream task performance or human preferences (e.g. RLHF, ranking losses).
These augment or replace pure MLE so that the model is not just a low-entropy next-token predictor, but better aligned with human-desired behavior.
References
- Radford et al. (2018) — Improving Language Understanding by Generative Pre-Training (GPT-1)
- Brants et al. (2007) — Large Language Models in Machine Translation
- Holtzman et al. (2020) — The Curious Case of Neural Text Degeneration
- Kaplan et al. (2021) — Scaling Laws for Neural Language Models
- Understanding LLMs (2024) — Overviews of LLM training to inference pipeline
- LLM360 / TxT360