Lecture 22

Unsupervised Training of LLMs

1.1 Unsupervised Training of LLMs (Overview)

Goal

Train large language models (LLMs) using maximum likelihood estimation (MLE) on raw text—no labels, no supervision, no explicit objectives beyond predicting the next token.

Context (From GPT-1 → GPT-4)

LLMs have grown dramatically in:

Modern tokenizers often include image patches, enabling multimodality. Mixture-of-Experts architectures further boost scale. Training now also includes reinforcement learning for alignment after pretraining.

1.2 MLE Training of Language Models

Directed Probabilistic Graphical Model

LLMs use an autoregressive factorization, which can be visualized as a directed graphical model where each token depends on all previous tokens.

MLE Objective

LLMs are trained to maximize the log-likelihood of observed sequences:

\[\hat{\theta}_{\text{MLE}} = \arg\max_\theta \mathbb{E}_{x\sim P_{\text{data}}}[\log P_\theta(x)]\]

They model sequences autoregressively, factorizing the joint probability as:

\[P_\theta(X) = \prod_{i} \prod_{t} P_\theta(X_{i,t} \mid X_{i, < t})\]

where (i) indexes sequences in the dataset and (t) indexes token positions within each sequence.

MLE → Emergent Understanding

The model directly fits the data distribution without introducing structured latent variables or explicitly controlling entropy. However, a key insight emerges from this simple objective:

“The simplest way to predict the next token is to understand what happened throughout the context.”

For example, to predict the word “is” in “The capital of France ___ Paris.”, the model must:

Unlike earlier n-gram language models (e.g., Brants et al., 2007), modern LLMs model relationships between token embeddings via the Transformer architecture, enabling them to capture long-range dependencies at scale.

What MLE Does Well

But MLE Has Hidden Problems

Scale & Emergent Capabilities

2.1 Emergent Capabilities in LLMs

What Happens as We Scale Training?

Figure 1. Language modeling performance improves smoothly as we increase the model size, dataset size, and amount of compute used for training. For optimal performance all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two.
Figure 2. Smooth improvements in overall loss can lead to sharp “emergent” jumps in task performance. An ability is called emergent if it is not present in smaller models but appears in larger models (Wei et al., 2022).

Examples of Emergence

  1. In-Context Learning : Model learns to perform tasks from examples provided in the prompt.
  2. Chain-of-Thought Reasoning : Model generates intermediate reasoning steps.
  3. Factual structure understanding: To fill in “The capital of France ___ Paris”, model must:
    • Track subject-verb agreement;
    • Understand the underlying fact structure;
    • Recognize task domain (geography).

Hypotheses for Emergence

2.2 Challenges in Scaling Unsupervised Training

Data Filtering Issues

At trillion-token scale, filtering toxic, low-quality, or duplicated text becomes extremely difficult.

Filtering stages (example from RefinedWeb):

Parallel Training

Training trillion-parameter models requires:

Types of Parallel Training

LLM360 Initiative

Open efforts like LLM360 / TxT360 aim to make complete training pipelines observable and reproducible for education and research.

3. Challenges of MLE-Based Unsupervised Training

3.1 MLE as KL Divergence

Maximum likelihood training of a language model is equivalent to minimizing the KL divergence from the data distribution to the model:

\[\begin{aligned} heta_{\text{MLE}} &= \arg\max_\theta \mathbb{E}_{x \sim P_{\text{data}}}\big[\log P_\theta(x)\big] \\ &= \arg\min_\theta \mathrm{KL}\big(P_{\text{data}} \,\|\, P_\theta\big) \end{aligned}\]

because

\[\mathrm{KL}\big(P_{\text{data}} \,\|\, P_\theta\big) = \mathbb{E}_{x \sim P_{\text{data}}}\big[-\log P_\theta(x)\big] + \text{const}\]

Implications of this KL direction


3.2 What MLE Does Not Provide

The MLE objective only cares about matching the joint distribution of tokens. It does not explicitly encode:

So, a pure MLE-trained LM is a very good next-token predictor, but not necessarily aligned with human intents or downstream tasks.


3.3 Entropy and Confidence

For an autoregressive LM at time step (t), the token-level entropy is

\[\begin{aligned} H_t &= -\sum_v P(x_t = v \mid x_{<t}) \log P(x_t = v \mid x_{<t}) \\ &= \mathbb{E}_{v}\big[-\log P(x_t = v \mid x_{<t})\big] \end{aligned}\]

A key design question:

Do we want the model to be low-entropy or high-entropy in its predictions?

Low entropy improves log-likelihood but can cause over-confidence and degenerate text; high entropy improves diversity but can hurt accuracy.


3.4 Connection to Variational Inference

Recall the variational inference decomposition:

\[\begin{aligned} \log p(x \mid \theta) &= \mathbb{E}_{z \sim q}\big[\log p(x, z \mid \theta)\big] \\ &\quad + H(q) \\ &\quad + \mathrm{KL}\big(q(z \mid x) \,\|\, p(z \mid x, \theta)\big) \end{aligned}\]

This yields the Evidence Lower Bound (ELBO):

\[\begin{aligned} \log p(x \mid \theta) &\ge \mathbb{E}_{z \sim q}\big[\log p(x, z \mid \theta)\big] \\ &\quad + H(q) \end{aligned}\]

By contrast, in autoregressive MLE we directly maximize (\mathbb{E}{x \sim P{\text{data}}}[\log P_\theta(x)]) without an explicit entropy term for (P_\theta(x)). This often drives the model toward lower-entropy distributions.


3.5 MLE, Entropy, and Degeneration

Formally, the MLE objective is

\[\hat{\theta}_{\text{MLE}} = \arg\max_\theta \mathbb{E}_{x \sim P_{\text{data}}}\big[\log P_\theta(x)\big]\]

Properties:


3.6 Mitigating Low-Entropy Problems

A common strategy is to modify the objective so we do not purely optimize log-likelihood.

3.6.1 Entropy-regularized MLE

Add an explicit entropy bonus to the training objective:

\[\hat{\theta} = \arg\max_{\theta} \left[ \mathbb{E}_{x \sim P_{\mathrm{data}}} \log P_{\theta}(x) + \lambda \sum_{t} H\!\left(P_{\theta}(x_t \mid x_{<t})\right) \right]\]

where (\lambda > 0) controls how strongly we encourage higher-entropy token distributions.

3.6.2 Label Smoothing

Instead of one-hot targets, use smoothed labels for classification:

\[ilde{y}_\varepsilon = \begin{cases} 1 - \varepsilon, & y = 1, \\ \dfrac{\varepsilon}{V - 1}, & y = 0. \end{cases}\]

Effect: discourages the model from assigning probability 1 to any token, implicitly increasing entropy and improving calibration.

3.6.3 Contrastive and Preference-Based Objectives

Other families of objectives also combat degeneration:

These augment or replace pure MLE so that the model is not just a low-entropy next-token predictor, but better aligned with human-desired behavior.

References


© 2025 University of Wisconsin — STAT 453 Lecture Notes