Lecture 25

Alignment, explainability, and open research directions in modern machine learning, with a focus on large language models and system-level reliability.

Key Takeaways

Logistics


Learning Goals

By the end of this lecture, you should be able to:


The LLM Training and Usage Pipeline

Modern Large Language Models (LLMs) progress through distinct stages, from broad pattern learning to task-specific adaptation:

  1. Random Model: The initialized architecture before training.
  2. Pre-Training: Unsupervised training on massive datasets (e.g., Common Crawl) to learn general patterns.
  3. Fine-Tuning: Alignment using in-domain data via Supervised Fine-Tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF).
  4. In-Context Learning: At inference time, prompts and examples guide behavior without updating weights.

Key Observation:
The same trained model can behave very differently depending on context. Pre-training, fine-tuning, and in-context learning primarily change how the model is used, not just its parameters.


Why Explainability Matters

Models trained on large-scale data are rarely naturally interpretable to humans. Explainability is critical for:

Common Confusions


Fairness and Sensitive Features

Removing sensitive attributes like race or gender from training data does not ensure invariance.

Strategies for Invariance:

  1. Remove the feature: Often insufficient due to correlations.
  2. Train then clean: Train with all features, then remove learned components post-hoc.
  3. Test-time blinding: Drop the feature only during inference.
  4. Modified loss functions: Penalize prediction dependence on sensitive attributes.

The History of Interpretability

Interpretability Categories

Type Core Idea Main Limitation
Post-hoc Explain predictions after training Often lacks fidelity
Transparent Interpretable by design Limited flexibility
Mechanistic Reverse-engineer internals Hard to scale

2016: Setting the Stage

2017–2020: Fragmentation

Methodology Examples Description
Post-hoc LIME, SHAP, Integrated Gradients Industry standard; explain after training
Transparency GAMs, Monotonic Nets Niche, common in healthcare/tabular data
Mechanistic Circuits, probing Technically deep, rarely user-facing

Cracks in Post-Hoc Explanations


Interpretability Approaches

1. Post-hoc Explanations

These methods attempt to explain a black-box model after it has been trained. While tools like LIME, SHAP, and Integrated Gradients became industry standards, they face significant criticism regarding their fidelity.

Cracks in Post-Hoc Explanations:

2. Transparency by Design

Instead of explaining a black box, use inherently interpretable models.

3. Mechanistic Interpretability

Summary: Interpretability Methods Comparison

Method Type Scope Adoption
LIME, SHAP, IG Post-Hoc Local/Global Industry Standard: Popular, look convincing, but don’t guarantee fidelity.
GAMs, Monotonic Nets Transparency Global Niche: Primarily used in healthcare and tabular settings.
Circuits, Probing Mechanistic Internal Research: Technically deep, rarely user-facing.

Scaling Laws vs. Interpretability

“Scale is all you need”? (Kaplan et al. 2021) suggests that language modeling performance improves smoothly according to power laws as we increase three factors:

  1. Model Size (Parameters)
  2. Dataset Size (Tokens)
  3. Compute (PF-days)

Empirical performance follows a power-law relationship: $L(x) = (x/x_0)^{-\alpha}$ provided it is not bottlenecked by the other two factors. However, as models scale, they become less interpretable.


A System Design View of Interpretability

Interpretability is a system-level property, not just a debugging tool. Like moving from individual player stats to lineup net rating, interpretability helps optimize the human–AI system.

The three main benefits are:

  1. Information Acquisition
  2. Value Alignment
  3. Modularity

1. Information Acquisition

What should we measure? Predictive models often treat measurements as fixed, but in biomedicine and other fields, measurement is active and costly. Interpretability helps identify which measurements are actually driving risk.

Case Study: Severe Maternal Morbidity (SMM) Using a GAM to predict SMM revealed that the BabySize-MaternalHeight Ratio was the #1 feature for importance—more critical than preeclampsia.

2. Value Alignment

What did the model learn to optimize? We must connect probabilistic objectives (loss functions) to value-based objectives (human goals).

Jagged Performance & Misalignment Current AI exhibits a “jagged frontier”: it is unbelievably intelligent in some areas but fails at specific tasks (e.g., coding bugs). This suggests a lack of robust inner alignment.

The “Goodhart’s Law” of Biomarkers “When a biomarker is used to guide treatment decisions, it ceases to predict outcomes”.

3. Modularity

Swappable, testable components. Interpretability allows us to connect component-level performance to system-level performance. If we can reverse-engineer models or build them with concept-based components, we can test and swap individual parts (like “modules”) as better versions become available.

Open Challenges & Takeaways

As Ilya Sutskever noted, “It’s back to the age of research again, just with big computers”.


Final Takeaway:
Scaling delivers performance, but interpretability, alignment, and system-level thinking determine whether AI systems are safe, useful, and beneficial in the real world.