Lecture 25
Alignment, explainability, and open directions in LLM research
Phases of Model Training
- Pre-training: The model is first trained unsupervised on massive datasets to learn general patterns (e.g., “train on everything available”).
- Fine-tuning: The model is then refined for specific goals using supervised learning, reinforcement learning, or instruction-following tasks.
- In-context learning: During usage, prompts and examples in the input guide the model to produce outputs adapted to user intent.
Why Explainability Matters
- Models trained on large data are rarely naturally interpretable to humans.
- Making models understandable is crucial for safety, trust, debugging, and meeting regulatory or ethical standards.
Fairness & Sensitive Features
- Merely dropping sensitive features like race from data does not eliminate their influence—biases can be encoded via correlated variables.
- Preferred approaches:
- Train with sensitive features, then “remove” their effects post-hoc.
- Use loss functions that explicitly encourage invariance to sensitive features (e.g., penalizing models if predictions vary by race).
Key Example:
- If a healthcare model is trained after removing race, it may still encode race via other variables (like ZIP code, income), hiding bias rather than eliminating it.
Interpretability Approaches
1. Post-hoc Explanations
- LIME (Locally Interpretable Model-agnostic Explanations): Approximates model behavior with simpler models locally. The optimization objective is:
where $f$ is the original model, $g$ is an interpretable local model, $\pi_x$ is a proximity weight function around sample $x$, and $\Omega(g)$ is a complexity penalty.
- SHAP (SHapley Additive exPlanations): Uses Shapley values from cooperative game theory to assign contribution scores to features. The Shapley value for feature $i$ is:
- Integrated Gradients: An axiomatic attribution method for neural networks. For input $x$ and baseline $x’$:
- These methods help interpret what influenced a specific prediction, but may not capture global model logic.
2. Transparency by Design
- Use inherently interpretable models:
- Additive Models / Generalized Additive Models (GAMs): Model output is a sum of simple functions of each feature. Easy to visualize and analyze each feature’s impact.
- Monotonic Nets: Guarantee certain variables affect predictions only in one direction.
- Pros: Clear, component-wise logic.
- Cons: May underperform compared to black-box models on complex tasks.
3. Mechanistic Interpretability
- Analyze what neural network units/layers are “doing” internally (e.g., finding circuits, probing activations).
- Allows understanding high-level model “motifs.”
- Still an active research area with many challenges.
Caveats
- All interpretability methods make simplifying assumptions; adversarial examples can break explanations.
- The complexity of modern models means explanations are always partial or approximate.
Summary: Interpretability Methods Comparison
| Method | Type | Scope | Pros | Cons |
|---|---|---|---|---|
| LIME | Post-hoc | Local | Model-agnostic, intuitive | Instability, sampling dependent |
| SHAP | Post-hoc | Local/Global | Theoretically grounded, consistent | Computationally expensive |
| Integrated Gradients | Post-hoc | Local | Axiomatic, gradient-based | Requires differentiable model |
Scaling Laws vs. Interpretability
- Scaling Laws: Increasing model/data size (e.g., as seen in GPT series) linearly improves performance on many tasks.
- Challenge: As models scale, their decisions become harder to interpret for humans.
- System-level Design: Breaking up AI systems into interpretable modules/components can support both performance and understanding.
Information Acquisition
What should we measure?
- Predictive models often take measurements as fixed.
- In practice, measurement is active and costly.
- Interpretability can highlight missing but valuable information.
Modularity
Swappable, testable Components
- Interpretability connects component-level performance to system-level performance.
- Each component has a job, new versions can be tested and adopted.
- Baking in modularity can help understand models better for longer. As models themselves get more complex, components will still be understandable.
Value Alignment
What did the model learn to optimize?
- Connect probabilistic objectives to value-based objectives.
- Outer alignment: Is the objective/loss function well-chosen and reflective of what we truly value?
- Inner alignment: Does the model optimize the given objective robustly, including in new or unexpected settings?
| Concept | Definition | Key Question | Risk if Violated |
|---|---|---|---|
| Outer Alignment | Objective matches human values | Is the loss function correct? | Optimizing wrong goal |
| Inner Alignment | Model robustly optimizes objective | Does model generalize safely? | Unexpected behavior in new settings |
Inner Alignment in Practice: Jagged Performance
Even if a model performs well on average, it may fail catastrophically on specific sub-tasks. This “jagged performance” is a key sign of poor inner alignment: the model has not truly learned a robust mechanism that generalizes safely.
Examples of jagged performance:
- Excellent benchmark performance but unexpected mistakes on rare or adversarial cases
- Strong overall accuracy but isolated “holes” where behavior breaks unpredictably
This connects directly to the question of whether the model truly optimizes the intended objective across all conditions.
Risks of Misalignment
- Example: Medical models learned that patients who received aggressive kidney failure treatment had better survival, so the model might naively recommend putting all patients into kidney failure—this is accurate but unsafe if used blindly!
- Different intended uses (risk prediction, benchmarking, treatment) require different notions of alignment.
- When a specific measurement becomes a target, it ceases to be a good measure.
Open Challenges & Takeaways
- Reinforcement Learning (RL): Powerful but hard to keep robust/align, especially as task complexity grows.
- Emotional Value Functions: Human emotions serve as values that help guide decisions across contexts. Current AI lacks that as rewards are defined externally. Those rewards change as the problem being solved changes. As a result, what if we can instill emotions that can provide stable rewards across situations without having to retrain and redefine the problem?
- Continual Learning: Current paradigm retrains static models, while human-like intelligence involves continuous adaptation.
- Formalizing Alignment: Need precise frameworks for aligning models with varying real-world values and use-cases.
- Ethics & Responsibility: Decisions made by future AI practitioners will have major, possibly societal-level impacts.
- Economic & Societal Impact: Superhuman models will reshape industries and power dynamics; open debate and responsible innovation are essential.
- Jagged Performance: Models appear to have a sizable difference in utility based on the task it is executing. We must determine whether this jaggedness is a bug or a feature. If it is the former then we must figure out how to smooth it out. If it is the latter, then alignment is even more important to optimizing the utility of the tasks that we want to accomplish.
- Optimal Information Collection: The case study demonstrates the utility of data previously assumed unrelated. This highlights the balance between the low efficiency of gathering as much data as possible about everything, with the high efficiency of only gathering information that is presumed to be pertinent. Finding an optimal solution is key.