Structuring context improves inference within foundation models, just as it does in classical statistical models.
Structuring context improves inference—not only in statistical graphical models, but also within foundation models. This project investigates how explicit contextual structure can make foundation models more efficient, modular, and interpretable.
We focus on making large models practical for real-world deployment by aligning their internal mechanisms with the same principles that make classical models statistically efficient: modularity, conditional independence, and context-aware adaptation.
This work supports:
Structured contextual inference by building architectural and training constraints that reflect known structure.
Modular reasoning by decomposing complex predictions into composable parts.
Deployment-readiness through faster, lower-latency models that retain contextual sensitivity.
Recent work includes FastCache(Liu et al., 2025), Memory-Keyed Attention(Liu et al., 2025), and ongoing development of , a framework for efficient, composable LLMs.
References
2025
FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation
Dong Liu, Jiayi Zhang, Yifan Li, and 3 more authors
CVPR Another Brick in the AI Wall: Building Practical Solutions from Theoretical Foundations (CVPR BASE 2025), 26–28 aug 2025
Diffusion Transformers (DiT) are powerful generative models but remain computationally intensive due to their iterative structure and deep transformer stacks. To alleviate this inefficiency, we propose FastCache, a hidden-state-level caching and compression framework that accelerates DiT inference by exploiting redundancy within the model’s internal representations. FastCache introduces a dual strategy: (1) a spatial-aware token selection mechanism that adaptively filters redundant tokens based on hidden state saliency, and (2) a transformer-level cache that reuses latent activations across timesteps when changes are statistically insignificant. These modules work jointly to reduce unnecessary computation while preserving generation fidelity through learnable linear approximation. Theoretical analysis shows that FastCache maintains bounded approximation error under a hypothesis-testing-based decision rule. Empirical evaluations across multiple DiT variants demonstrate substantial reductions in latency and memory usage, with best generation output quality compared to other cache methods, as measured by FID and t-FID. Code implementation of FastCache is available on GitHub at this https URL.
@article{liu2025fastcache,title={FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation},author={Liu, Dong and Zhang, Jiayi and Li, Yifan and Yu, Yanxuan and Lengerich, Ben and Wu, Ying Nian},journal={CVPR Another Brick in the AI Wall: Building Practical Solutions from Theoretical Foundations (CVPR BASE 2025)},year={2025},}
MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning
Dong Liu, Yanxuan Yu, Xuhong Wang, and 2 more authors
ICML Long Context Foundation Models (LCFM), 26–28 aug 2025
As long-context language modeling becomes increasingly important, the cost of maintaining and attending to large Key/Value (KV) caches grows rapidly, becoming a major bottleneck in both training and inference. While prior works such as Multi-Query Attention (MQA) and MultiLatent Attention (MLA) reduce memory by sharing or compressing KV features, they often trade off representation quality or incur runtime overhead. We propose Memory-Keyed Attention (MKA), a hierarchical attention mechanism that integrates multi-level KV caches—local, session, and long-term—and learns to route attention across them dynamically. We further introduce FastMKA, a broadcast-routed variant that fuses memory sources before attention computation for enhanced efficiency. Experiments on different sequence lengths show that MKA improves perplexity over MHA and MLA, while FastMKA achieves comparable accuracy to MLA with up to 4× faster training and 40% lower evaluation latency. These results highlight MKA as a practical and extensible framework for efficient longcontext attention.
@article{liu2025mka,title={MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning},journal={ICML Long Context Foundation Models (LCFM)},author={Liu, Dong and Yu, Yanxuan and Wang, Xuhong and Lengerich, Ben and Wu, Ying Nian},year={2025},}