Lecture 25

Alignment, explainability, and open research directions in modern machine learning, with a focus on large language models and system-level reliability.

Key Takeaways

Modern AI research is shifting from raw performance to alignment, interpretability, and system-level reliability.
Post-hoc explainability tools are widely used but have serious fidelity and robustness limitations.
Scaling laws explain why larger models work better, but they do not guarantee safety or alignment.
Interpretability benefits not only users, but also system designers, by improving measurement, modularity, and value alignment.
Many core challenges (alignment, reasoning, data limits, economic impact) remain open research problems.

Logistics

Project Final Report: Due Friday, December 12th. Submit via Canvas.
Final Exam: December 17th, 5:05–7:05 PM in Science 180. A study guide has been released.

Learning Goals

By the end of this lecture, you should be able to:

Explain why alignment and explainability are central problems in modern AI.
Distinguish between post-hoc, transparent, and mechanistic interpretability.
Describe the difference between outer alignment and inner alignment.
Understand how system design interacts with interpretability.
Identify major open research problems in alignment and interpretability.

The LLM Training and Usage Pipeline

Modern Large Language Models (LLMs) progress through distinct stages, from broad pattern learning to task-specific adaptation:

Random Model: The initialized architecture before training.
Pre-Training: Unsupervised training on massive datasets (e.g., Common Crawl) to learn general patterns.
Fine-Tuning: Alignment using in-domain data via Supervised Fine-Tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF).
In-Context Learning: At inference time, prompts and examples guide behavior without updating weights.

Key Observation:
The same trained model can behave very differently depending on context. Pre-training, fine-tuning, and in-context learning primarily change how the model is used, not just its parameters.

Why Explainability Matters

Models trained on large-scale data are rarely naturally interpretable to humans. Explainability is critical for:

Safety and trust
Debugging and model validation
Regulatory and ethical compliance
Understanding system-level behavior beyond accuracy

Common Confusions

Explainability ≠ Accuracy: A highly accurate model can still be unsafe or untrustworthy.
Post-hoc explanations ≠ true understanding: Plausible explanations may not reflect the model’s actual computation.
Dropping sensitive features ≠ fairness: Bias can persist through correlated variables.

Fairness and Sensitive Features

Removing sensitive attributes like race or gender from training data does not ensure invariance.

Strategies for Invariance:

Remove the feature: Often insufficient due to correlations.
Train then clean: Train with all features, then remove learned components post-hoc.
Test-time blinding: Drop the feature only during inference.
Modified loss functions: Penalize prediction dependence on sensitive attributes.

The History of Interpretability

Interpretability Categories

Type	Core Idea	Main Limitation
Post-hoc	Explain predictions after training	Often lacks fidelity
Transparent	Interpretable by design	Limited flexibility
Mechanistic	Reverse-engineer internals	Hard to scale

2016: Setting the Stage

The Mythos: Interpretability invoked when metrics are imperfect proxies for objectives (Lipton, 2016).
Evaluation Modes: Application-grounded, human-grounded, and functionally-grounded (Doshi-Velez & Kim, 2017).

2017–2020: Fragmentation

Methodology	Examples	Description
Post-hoc	LIME, SHAP, Integrated Gradients	Industry standard; explain after training
Transparency	GAMs, Monotonic Nets	Niche, common in healthcare/tabular data
Mechanistic	Circuits, probing	Technically deep, rarely user-facing

Cracks in Post-Hoc Explanations

Insensitivity: Saliency maps may remain unchanged under weight randomization (Adebayo et al., 2018).
Vulnerability: LIME and SHAP can be easily fooled (Slack et al., 2020).
Plausibility vs. Faithfulness: Explanations may look reasonable but misrepresent computation (Jacovi & Goldberg, 2020).
High-Stakes Critique: In safety-critical settings, post-hoc methods may be insufficient (Rudin, 2019).

Interpretability Approaches

1. Post-hoc Explanations

These methods attempt to explain a black-box model after it has been trained. While tools like LIME, SHAP, and Integrated Gradients became industry standards, they face significant criticism regarding their fidelity.

LIME (Locally Interpretable Model-agnostic Explanations): Approximates model behavior with simpler models locally. $\xi(x) = \operatorname*{arg\,min}_{g \in G} \mathcal{L}(f, g, \pi_x) + \Omega(g)$ Critique: It is easy to fool LIME and SHAP (Slack et al. 2020).
SHAP (SHapley Additive exPlanations): Uses Shapley values to assign contribution scores. $\phi_i = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|!(|N|-|S|-1)!}{|N|!} [f(S \cup \{i\}) - f(S)]$
Integrated Gradients: An axiomatic attribution method. $IG_i(x) = (x_i - x'_i) \times \int_{\alpha=0}^{1} \frac{\partial F(x' + \alpha(x - x'))}{\partial x_i} d\alpha$

Cracks in Post-Hoc Explanations:

Insensitivity: Saliency maps can be insensitive to model weights (Adebayo et al. 2018).
Plausibility vs. Faithfulness: There is a gap between an explanation looking reasonable to a human and it actually reflecting the model’s computation (Jacovi & Goldberg 2020).
High-Stakes: Rudin (2019) argues for abandoning post-hoc methods entirely in high-stakes settings.

2. Transparency by Design

Instead of explaining a black box, use inherently interpretable models.

Generalized Additive Models (GAMs): Decompose complex outcomes into a sum of univariate functions: $F(x) = \beta_0 + f_1(x_1) + … + f_r(x_r)$.
- Pros: Components can be individually visualized.
- Cons: Niche application, often used in healthcare/tabular data rather than unstructured data.
Monotonic Nets: Constrain variables to affect predictions in only one direction.

3. Mechanistic Interpretability

Focuses on circuits, probing, and feature geometry.
Reverse Engineering: Extracting components from trained models.
Concept-Based: Attempting to incorporate human-understandable concepts from the start of training.

Summary: Interpretability Methods Comparison

Method	Type	Scope	Adoption
LIME, SHAP, IG	Post-Hoc	Local/Global	Industry Standard: Popular, look convincing, but don’t guarantee fidelity.
GAMs, Monotonic Nets	Transparency	Global	Niche: Primarily used in healthcare and tabular settings.
Circuits, Probing	Mechanistic	Internal	Research: Technically deep, rarely user-facing.

Scaling Laws vs. Interpretability

“Scale is all you need”? (Kaplan et al. 2021) suggests that language modeling performance improves smoothly according to power laws as we increase three factors:

Model Size (Parameters)
Dataset Size (Tokens)
Compute (PF-days)

Empirical performance follows a power-law relationship: $L(x) = (x/x_0)^{-\alpha}$ provided it is not bottlenecked by the other two factors. However, as models scale, they become less interpretable.

A System Design View of Interpretability

Interpretability is a system-level property, not just a debugging tool. Like moving from individual player stats to lineup net rating, interpretability helps optimize the human–AI system.

The three main benefits are:

Information Acquisition
Value Alignment
Modularity

1. Information Acquisition

What should we measure? Predictive models often treat measurements as fixed, but in biomedicine and other fields, measurement is active and costly. Interpretability helps identify which measurements are actually driving risk.

Case Study: Severe Maternal Morbidity (SMM) Using a GAM to predict SMM revealed that the BabySize-MaternalHeight Ratio was the #1 feature for importance—more critical than preeclampsia.

Insight: We just happened to have maternal height data available; interpretability forced the question: “What should we actually be measuring clinically?”.

2. Value Alignment

What did the model learn to optimize? We must connect probabilistic objectives (loss functions) to value-based objectives (human goals).

Outer Alignment: Is the loss function we train on actually aligned with human goals?
Inner Alignment: Given that loss, does the trained model’s internal representation faithfully implement that goal, even off-distribution?

Jagged Performance & Misalignment Current AI exhibits a “jagged frontier”: it is unbelievably intelligent in some areas but fails at specific tasks (e.g., coding bugs). This suggests a lack of robust inner alignment.

The “Goodhart’s Law” of Biomarkers “When a biomarker is used to guide treatment decisions, it ceases to predict outcomes”.

Example: In pneumonia patients, high Blood Urea Nitrogen (BUN) predicts mortality. However, Creatinine levels show a “U-shaped” risk curve where low levels (usually healthy) showed higher risk—likely because those patients had low muscle mass (frailty).
Risk: If a data-driven system learns that kidney failure treatments improve survival, it might learn to “put everyone into kidney failure” to minimize the predicted loss. This is accurate for the loss function, but misaligned with patient health.

3. Modularity

Swappable, testable components. Interpretability allows us to connect component-level performance to system-level performance. If we can reverse-engineer models or build them with concept-based components, we can test and swap individual parts (like “modules”) as better versions become available.

Open Challenges & Takeaways

As Ilya Sutskever noted, “It’s back to the age of research again, just with big computers”.

Economic Impact: Models show impressive eval performance but lack commensurate real-world economic impact.
Jaggedness: Models repeat bugs and have uneven capabilities.
Emotional/Value Functions: Humans use emotions as robust value functions to guide generalization. AI currently lacks this mechanism.
Data Walls: Pre-training scales uniformly but is hitting data limits. RL consumes more compute but needs better efficiency via value functions.
Verifiable Rewards: Scaling RL requires rewards that can be verified at scale.
Symbolic Reasoning: Combining LLMs with symbolic reasoning and graphical models remains an open problem.

Final Takeaway:
Scaling delivers performance, but interpretability, alignment, and system-level thinking determine whether AI systems are safe, useful, and beneficial in the real world.