Lecture 22

Unsupervised Training of LLMs

1.1 Unsupervised Training of LLMs (Overview)

Goal

Train large language models (LLMs) using maximum likelihood estimation (MLE) on raw text—no labels, no supervision, no explicit objectives beyond predicting the next token.

Context (From GPT-1 → GPT-4)

LLMs have grown dramatically in:

Scale: 1.5B parameters → over 1T parameters
Context length: 512 → 128k tokens
Depth & width: 12 → 96+ layers; 12 → 96+ heads
Embedding dimension: 768 → over 12k
Vocab: 40k → 50k+
Training data: 5GB BookCorpus → private 13T tokens (~50 TB)

Modern tokenizers often include image patches, enabling multimodality. Mixture-of-Experts architectures further boost scale. Training now also includes reinforcement learning for alignment after pretraining.

1.2 MLE Training of Language Models

Directed Probabilistic Graphical Model

LLMs use an autoregressive factorization, which can be visualized as a directed graphical model where each token depends on all previous tokens.

MLE Objective

LLMs are trained to maximize the log-likelihood of observed sequences:

\[\hat{\theta}_{\text{MLE}} = \arg\max_\theta \mathbb{E}_{x\sim P_{\text{data}}}[\log P_\theta(x)]\]

They model sequences autoregressively, factorizing the joint probability as:

\[P_\theta(X) = \prod_{i} \prod_{t} P_\theta(X_{i,t} \mid X_{i, < t})\]

where (i) indexes sequences in the dataset and (t) indexes token positions within each sequence.

MLE → Emergent Understanding

The model directly fits the data distribution without introducing structured latent variables or explicitly controlling entropy. However, a key insight emerges from this simple objective:

“The simplest way to predict the next token is to understand what happened throughout the context.”

For example, to predict the word “is” in “The capital of France ___ Paris.”, the model must:

Resolve subject-verb agreement (singular subject requires singular verb)
Recognize a factual structure (this is a statement about geography)
Know the domain (geography, world capitals)

Unlike earlier n-gram language models (e.g., Brants et al., 2007), modern LLMs model relationships between token embeddings via the Transformer architecture, enabling them to capture long-range dependencies at scale.

What MLE Does Well

Directly models empirical sequences
Efficient and scalable
Allows training on extremely large corpora

But MLE Has Hidden Problems

Implicit low-entropy bias → overconfident predictions
Mode collapse / degeneration / memorization
Neural text degeneration (Holtzman 2020)
- e.g., repeated phrases
- generic completions
- “The man said the man said…”
- lack of diversity unless sampling is carefully tuned

Scale & Emergent Capabilities

2.1 Emergent Capabilities in LLMs

What Happens as We Scale Training?

**Figure 1.** Language modeling performance improves smoothly as we increase the model size, dataset size, and amount of compute used for training. For optimal performance all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two.

**Figure 2.** Smooth improvements in overall loss can lead to sharp “emergent” jumps in task performance. An ability is called emergent if it is not present in smaller models but appears in larger models (Wei et al., 2022).

Examples of Emergence

In-Context Learning : Model learns to perform tasks from examples provided in the prompt.
Chain-of-Thought Reasoning : Model generates intermediate reasoning steps.
Factual structure understanding: To fill in “The capital of France ___ Paris”, model must:
- Track subject-verb agreement;
- Understand the underlying fact structure;
- Recognize task domain (geography).

Hypotheses for Emergence

Scale increases representational capacity
MLE forces models to capture global dependencies
Large training corpora contain implicit demonstrations of reasoning
Transformers act as meta-learners over massive data distributions

2.2 Challenges in Scaling Unsupervised Training

Data Filtering Issues

At trillion-token scale, filtering toxic, low-quality, or duplicated text becomes extremely difficult.

Filtering stages (example from RefinedWeb):

URL filtering
Text extraction
Language identification
Repetition removal
Document-wise filtering
Line-wise corrections
Fuzzy deduplication
Exact deduplication

Parallel Training

Training trillion-parameter models requires:

Expert parallelism
Distributed optimization
Fault tolerance
Training pipelines like DeepSpeed / Megatron

Types of Parallel Training

Data Parallel: replicate parameters across ranks and average gradients before update.
Distributed Data Parallel: gradients are all-reduced across workers; synchronized updates across ranks.

LLM360 Initiative

Open efforts like LLM360 / TxT360 aim to make complete training pipelines observable and reproducible for education and research.

3. Challenges of MLE-Based Unsupervised Training

3.1 MLE as KL Divergence

Maximum likelihood training of a language model is equivalent to minimizing the KL divergence from the data distribution to the model:

\[\begin{aligned} heta_{\text{MLE}} &= \arg\max_\theta \mathbb{E}_{x \sim P_{\text{data}}}\big[\log P_\theta(x)\big] \\ &= \arg\min_\theta \mathrm{KL}\big(P_{\text{data}} \,\|\, P_\theta\big) \end{aligned}\]

because

\[\mathrm{KL}\big(P_{\text{data}} \,\|\, P_\theta\big) = \mathbb{E}_{x \sim P_{\text{data}}}\big[-\log P_\theta(x)\big] + \text{const}\]

Implications of this KL direction

The model must place probability mass on all observed sequences, even if some are incoherent or low-quality.
This can lead to:
- Repetition and genericity
  e.g. outputs like “The man said the man said…”.
- Poor calibration on out-of-distribution prompts.
- Memorization of rare patterns in the data.
- Strong dependence on sampling / decoding strategy.
- Important role of entropy and confidence in the model’s behavior.

3.2 What MLE Does Not Provide

The MLE objective only cares about matching the joint distribution of tokens. It does not explicitly encode:

Semantics (meaning, truthfulness, usefulness).
Task goals (what we want the model to accomplish).
Rewards or preferences (which outputs are better for users).

So, a pure MLE-trained LM is a very good next-token predictor, but not necessarily aligned with human intents or downstream tasks.

3.3 Entropy and Confidence

For an autoregressive LM at time step (t), the token-level entropy is

\[\begin{aligned} H_t &= -\sum_v P(x_t = v \mid x_{<t}) \log P(x_t = v \mid x_{<t}) \\ &= \mathbb{E}_{v}\big[-\log P(x_t = v \mid x_{<t})\big] \end{aligned}\]

Low entropy ⇒ the distribution is very peaked ⇒ high confidence.
High entropy ⇒ the distribution is more spread out ⇒ lower confidence / more diversity.

A key design question:

Do we want the model to be low-entropy or high-entropy in its predictions?

Low entropy improves log-likelihood but can cause over-confidence and degenerate text; high entropy improves diversity but can hurt accuracy.

3.4 Connection to Variational Inference

Recall the variational inference decomposition:

\[\begin{aligned} \log p(x \mid \theta) &= \mathbb{E}_{z \sim q}\big[\log p(x, z \mid \theta)\big] \\ &\quad + H(q) \\ &\quad + \mathrm{KL}\big(q(z \mid x) \,\|\, p(z \mid x, \theta)\big) \end{aligned}\]

This yields the Evidence Lower Bound (ELBO):

\[\begin{aligned} \log p(x \mid \theta) &\ge \mathbb{E}_{z \sim q}\big[\log p(x, z \mid \theta)\big] \\ &\quad + H(q) \end{aligned}\]

Maximizing the ELBO explicitly rewards higher entropy (H(q)) of the approximate posterior (q(z \mid x)).

By contrast, in autoregressive MLE we directly maximize (\mathbb{E}{x \sim P{\text{data}}}[\log P_\theta(x)]) without an explicit entropy term for (P_\theta(x)). This often drives the model toward lower-entropy distributions.

3.5 MLE, Entropy, and Degeneration

Formally, the MLE objective is

\[\hat{\theta}_{\text{MLE}} = \arg\max_\theta \mathbb{E}_{x \sim P_{\text{data}}}\big[\log P_\theta(x)\big]\]

Properties:

It optimizes the data likelihood only, without explicitly controlling the entropy of (P_\theta(x)).
In practice, this frequently pushes (P_\theta(x)) toward low-entropy, highly peaked distributions.
Consequences:
- Mode collapse / degeneration: the model focuses on a few high-probability patterns.
- Memorization: reproduces specific training sequences.
- With deterministic decoding (e.g. large-beam beam search), this can produce extremely repetitive, unnatural text (“neural text degeneration”).

3.6 Mitigating Low-Entropy Problems

A common strategy is to modify the objective so we do not purely optimize log-likelihood.

3.6.1 Entropy-regularized MLE

Add an explicit entropy bonus to the training objective:

\[\hat{\theta} = \arg\max_{\theta} \left[ \mathbb{E}_{x \sim P_{\mathrm{data}}} \log P_{\theta}(x) + \lambda \sum_{t} H\!\left(P_{\theta}(x_t \mid x_{<t})\right) \right]\]

where (\lambda > 0) controls how strongly we encourage higher-entropy token distributions.

3.6.2 Label Smoothing

Instead of one-hot targets, use smoothed labels for classification:

\[ilde{y}_\varepsilon = \begin{cases} 1 - \varepsilon, & y = 1, \\ \dfrac{\varepsilon}{V - 1}, & y = 0. \end{cases}\]

(V): vocabulary size
(\varepsilon): small smoothing constant (e.g. 0.1)

Effect: discourages the model from assigning probability 1 to any token, implicitly increasing entropy and improving calibration.

3.6.3 Contrastive and Preference-Based Objectives

Other families of objectives also combat degeneration:

Contrastive losses (e.g. noise-contrastive, InfoNCE): compare good vs bad sequences to shape the distribution.
Scheduled sampling / noise-contrastive objectives: expose the model to its own predictions during training.
Risk minimization / utility / preference-based losses: optimize directly for downstream task performance or human preferences (e.g. RLHF, ranking losses).

These augment or replace pure MLE so that the model is not just a low-entropy next-token predictor, but better aligned with human-desired behavior.

References

Radford et al. (2018) — Improving Language Understanding by Generative Pre-Training (GPT-1)
Brants et al. (2007) — Large Language Models in Machine Translation
Holtzman et al. (2020) — The Curious Case of Neural Text Degeneration
Kaplan et al. (2021) — Scaling Laws for Neural Language Models
Understanding LLMs (2024) — Overviews of LLM training to inference pipeline
LLM360 / TxT360