Lecture 25

Alignment, explainability, and open directions in LLM research

Phases of Model Training

Why Explainability Matters

Fairness & Sensitive Features

Key Example:

Interpretability Approaches

1. Post-hoc Explanations

\[\xi(x) = \arg\min_{g \in G} \mathcal{L}(f, g, \pi_x) + \Omega(g)\]

where $f$ is the original model, $g$ is an interpretable local model, $\pi_x$ is a proximity weight function around sample $x$, and $\Omega(g)$ is a complexity penalty.

\[\phi_i = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|!(|N|-|S|-1)!}{|N|!} [f(S \cup \{i\}) - f(S)]\] \[IG_i(x) = (x_i - x'_i) \times \int_{\alpha=0}^{1} \frac{\partial F(x' + \alpha(x - x'))}{\partial x_i} d\alpha\]

2. Transparency by Design

3. Mechanistic Interpretability

Caveats

Summary: Interpretability Methods Comparison

Method Type Scope Pros Cons
LIME Post-hoc Local Model-agnostic, intuitive Instability, sampling dependent
SHAP Post-hoc Local/Global Theoretically grounded, consistent Computationally expensive
Integrated Gradients Post-hoc Local Axiomatic, gradient-based Requires differentiable model

Scaling Laws vs. Interpretability

Information Acquisition

What should we measure?

Modularity

Swappable, testable Components

Value Alignment

What did the model learn to optimize?

Concept Definition Key Question Risk if Violated
Outer Alignment Objective matches human values Is the loss function correct? Optimizing wrong goal
Inner Alignment Model robustly optimizes objective Does model generalize safely? Unexpected behavior in new settings

Inner Alignment in Practice: Jagged Performance

Even if a model performs well on average, it may fail catastrophically on specific sub-tasks. This “jagged performance” is a key sign of poor inner alignment: the model has not truly learned a robust mechanism that generalizes safely.

Examples of jagged performance:

This connects directly to the question of whether the model truly optimizes the intended objective across all conditions.

Risks of Misalignment

Open Challenges & Takeaways