Lecture 25
Alignment, explainability, and open research directions in modern machine learning, with a focus on large language models and system-level reliability.
Key Takeaways
- Modern AI research is shifting from raw performance to alignment, interpretability, and system-level reliability.
- Post-hoc explainability tools are widely used but have serious fidelity and robustness limitations.
- Scaling laws explain why larger models work better, but they do not guarantee safety or alignment.
- Interpretability benefits not only users, but also system designers, by improving measurement, modularity, and value alignment.
- Many core challenges (alignment, reasoning, data limits, economic impact) remain open research problems.
Logistics
- Project Final Report: Due Friday, December 12th. Submit via Canvas.
- Final Exam: December 17th, 5:05–7:05 PM in Science 180. A study guide has been released.
Learning Goals
By the end of this lecture, you should be able to:
- Explain why alignment and explainability are central problems in modern AI.
- Distinguish between post-hoc, transparent, and mechanistic interpretability.
- Describe the difference between outer alignment and inner alignment.
- Understand how system design interacts with interpretability.
- Identify major open research problems in alignment and interpretability.
The LLM Training and Usage Pipeline
Modern Large Language Models (LLMs) progress through distinct stages, from broad pattern learning to task-specific adaptation:
- Random Model: The initialized architecture before training.
- Pre-Training: Unsupervised training on massive datasets (e.g., Common Crawl) to learn general patterns.
- Fine-Tuning: Alignment using in-domain data via Supervised Fine-Tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF).
- In-Context Learning: At inference time, prompts and examples guide behavior without updating weights.
Key Observation:
The same trained model can behave very differently depending on context. Pre-training, fine-tuning, and in-context learning primarily change how the model is used, not just its parameters.
Why Explainability Matters
Models trained on large-scale data are rarely naturally interpretable to humans. Explainability is critical for:
- Safety and trust
- Debugging and model validation
- Regulatory and ethical compliance
- Understanding system-level behavior beyond accuracy
Common Confusions
- Explainability ≠ Accuracy: A highly accurate model can still be unsafe or untrustworthy.
- Post-hoc explanations ≠ true understanding: Plausible explanations may not reflect the model’s actual computation.
- Dropping sensitive features ≠ fairness: Bias can persist through correlated variables.
Fairness and Sensitive Features
Removing sensitive attributes like race or gender from training data does not ensure invariance.
Strategies for Invariance:
- Remove the feature: Often insufficient due to correlations.
- Train then clean: Train with all features, then remove learned components post-hoc.
- Test-time blinding: Drop the feature only during inference.
- Modified loss functions: Penalize prediction dependence on sensitive attributes.
The History of Interpretability
Interpretability Categories
| Type | Core Idea | Main Limitation |
|---|---|---|
| Post-hoc | Explain predictions after training | Often lacks fidelity |
| Transparent | Interpretable by design | Limited flexibility |
| Mechanistic | Reverse-engineer internals | Hard to scale |
2016: Setting the Stage
- The Mythos: Interpretability invoked when metrics are imperfect proxies for objectives (Lipton, 2016).
- Evaluation Modes: Application-grounded, human-grounded, and functionally-grounded (Doshi-Velez & Kim, 2017).
2017–2020: Fragmentation
| Methodology | Examples | Description |
|---|---|---|
| Post-hoc | LIME, SHAP, Integrated Gradients | Industry standard; explain after training |
| Transparency | GAMs, Monotonic Nets | Niche, common in healthcare/tabular data |
| Mechanistic | Circuits, probing | Technically deep, rarely user-facing |
Cracks in Post-Hoc Explanations
- Insensitivity: Saliency maps may remain unchanged under weight randomization (Adebayo et al., 2018).
- Vulnerability: LIME and SHAP can be easily fooled (Slack et al., 2020).
- Plausibility vs. Faithfulness: Explanations may look reasonable but misrepresent computation (Jacovi & Goldberg, 2020).
- High-Stakes Critique: In safety-critical settings, post-hoc methods may be insufficient (Rudin, 2019).
Interpretability Approaches
1. Post-hoc Explanations
These methods attempt to explain a black-box model after it has been trained. While tools like LIME, SHAP, and Integrated Gradients became industry standards, they face significant criticism regarding their fidelity.
-
LIME (Locally Interpretable Model-agnostic Explanations): Approximates model behavior with simpler models locally. \(\xi(x) = \operatorname*{arg\,min}_{g \in G} \mathcal{L}(f, g, \pi_x) + \Omega(g)\) Critique: It is easy to fool LIME and SHAP (Slack et al. 2020).
-
SHAP (SHapley Additive exPlanations): Uses Shapley values to assign contribution scores. \(\phi_i = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|!(|N|-|S|-1)!}{|N|!} [f(S \cup \{i\}) - f(S)]\)
-
Integrated Gradients: An axiomatic attribution method. \(IG_i(x) = (x_i - x'_i) \times \int_{\alpha=0}^{1} \frac{\partial F(x' + \alpha(x - x'))}{\partial x_i} d\alpha\)
Cracks in Post-Hoc Explanations:
- Insensitivity: Saliency maps can be insensitive to model weights (Adebayo et al. 2018).
- Plausibility vs. Faithfulness: There is a gap between an explanation looking reasonable to a human and it actually reflecting the model’s computation (Jacovi & Goldberg 2020).
- High-Stakes: Rudin (2019) argues for abandoning post-hoc methods entirely in high-stakes settings.
2. Transparency by Design
Instead of explaining a black box, use inherently interpretable models.
- Generalized Additive Models (GAMs): Decompose complex outcomes into a sum of univariate functions: $F(x) = \beta_0 + f_1(x_1) + … + f_r(x_r)$.
- Pros: Components can be individually visualized.
- Cons: Niche application, often used in healthcare/tabular data rather than unstructured data.
- Monotonic Nets: Constrain variables to affect predictions in only one direction.
3. Mechanistic Interpretability
- Focuses on circuits, probing, and feature geometry.
- Reverse Engineering: Extracting components from trained models.
- Concept-Based: Attempting to incorporate human-understandable concepts from the start of training.
Summary: Interpretability Methods Comparison
| Method | Type | Scope | Adoption |
|---|---|---|---|
| LIME, SHAP, IG | Post-Hoc | Local/Global | Industry Standard: Popular, look convincing, but don’t guarantee fidelity. |
| GAMs, Monotonic Nets | Transparency | Global | Niche: Primarily used in healthcare and tabular settings. |
| Circuits, Probing | Mechanistic | Internal | Research: Technically deep, rarely user-facing. |
Scaling Laws vs. Interpretability
“Scale is all you need”? (Kaplan et al. 2021) suggests that language modeling performance improves smoothly according to power laws as we increase three factors:
- Model Size (Parameters)
- Dataset Size (Tokens)
- Compute (PF-days)
Empirical performance follows a power-law relationship: $L(x) = (x/x_0)^{-\alpha}$ provided it is not bottlenecked by the other two factors. However, as models scale, they become less interpretable.
A System Design View of Interpretability
Interpretability is a system-level property, not just a debugging tool. Like moving from individual player stats to lineup net rating, interpretability helps optimize the human–AI system.
The three main benefits are:
- Information Acquisition
- Value Alignment
- Modularity
1. Information Acquisition
What should we measure? Predictive models often treat measurements as fixed, but in biomedicine and other fields, measurement is active and costly. Interpretability helps identify which measurements are actually driving risk.
Case Study: Severe Maternal Morbidity (SMM) Using a GAM to predict SMM revealed that the BabySize-MaternalHeight Ratio was the #1 feature for importance—more critical than preeclampsia.
- Insight: We just happened to have maternal height data available; interpretability forced the question: “What should we actually be measuring clinically?”.
2. Value Alignment
What did the model learn to optimize? We must connect probabilistic objectives (loss functions) to value-based objectives (human goals).
- Outer Alignment: Is the loss function we train on actually aligned with human goals?
- Inner Alignment: Given that loss, does the trained model’s internal representation faithfully implement that goal, even off-distribution?
Jagged Performance & Misalignment Current AI exhibits a “jagged frontier”: it is unbelievably intelligent in some areas but fails at specific tasks (e.g., coding bugs). This suggests a lack of robust inner alignment.
The “Goodhart’s Law” of Biomarkers “When a biomarker is used to guide treatment decisions, it ceases to predict outcomes”.
- Example: In pneumonia patients, high Blood Urea Nitrogen (BUN) predicts mortality. However, Creatinine levels show a “U-shaped” risk curve where low levels (usually healthy) showed higher risk—likely because those patients had low muscle mass (frailty).
- Risk: If a data-driven system learns that kidney failure treatments improve survival, it might learn to “put everyone into kidney failure” to minimize the predicted loss. This is accurate for the loss function, but misaligned with patient health.
3. Modularity
Swappable, testable components. Interpretability allows us to connect component-level performance to system-level performance. If we can reverse-engineer models or build them with concept-based components, we can test and swap individual parts (like “modules”) as better versions become available.
Open Challenges & Takeaways
As Ilya Sutskever noted, “It’s back to the age of research again, just with big computers”.
- Economic Impact: Models show impressive eval performance but lack commensurate real-world economic impact.
- Jaggedness: Models repeat bugs and have uneven capabilities.
- Emotional/Value Functions: Humans use emotions as robust value functions to guide generalization. AI currently lacks this mechanism.
- Data Walls: Pre-training scales uniformly but is hitting data limits. RL consumes more compute but needs better efficiency via value functions.
- Verifiable Rewards: Scaling RL requires rewards that can be verified at scale.
- Symbolic Reasoning: Combining LLMs with symbolic reasoning and graphical models remains an open problem.
Final Takeaway:
Scaling delivers performance, but interpretability, alignment, and system-level thinking determine whether AI systems are safe, useful, and beneficial in the real world.