Efficient Foundation Models

Structuring context improves inference within foundation models, just as it does in classical statistical models.

Structuring context improves inference—not only in statistical graphical models, but also within foundation models. This project investigates how explicit contextual structure can make foundation models more efficient, modular, and interpretable.

We focus on making large models practical for real-world deployment by aligning their internal mechanisms with the same principles that make classical models statistically efficient: modularity, conditional independence, and context-aware adaptation.

This work supports:

Structured contextual inference by building architectural and training constraints that reflect known structure.
Modular reasoning by decomposing complex predictions into composable parts.
Deployment-readiness through faster, lower-latency models that retain contextual sensitivity.

Recent work includes FastCache(Liu et al., 2025), Memory-Keyed Attention(Liu et al., 2025), Pi-KV (Liu et al., 2025), and ongoing development of , a framework for efficient, composable LLMs.

References

2025

FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation

Dong Liu, Jiayi Zhang, Yifan Li, and 3 more authors

CVPR Another Brick in the AI Wall: Building Practical Solutions from Theoretical Foundations (CVPR BASE 2025), 26–28 aug 2025

Abs Bib HTML

Diffusion Transformers (DiT) are powerful generative models but remain computationally intensive due to their iterative structure and deep transformer stacks. To alleviate this inefficiency, we propose FastCache, a hidden-state-level caching and compression framework that accelerates DiT inference by exploiting redundancy within the model’s internal representations. FastCache introduces a dual strategy: (1) a spatial-aware token selection mechanism that adaptively filters redundant tokens based on hidden state saliency, and (2) a transformer-level cache that reuses latent activations across timesteps when changes are statistically insignificant. These modules work jointly to reduce unnecessary computation while preserving generation fidelity through learnable linear approximation. Theoretical analysis shows that FastCache maintains bounded approximation error under a hypothesis-testing-based decision rule. Empirical evaluations across multiple DiT variants demonstrate substantial reductions in latency and memory usage, with best generation output quality compared to other cache methods, as measured by FID and t-FID. Code implementation of FastCache is available on GitHub at this https URL.
@article{liu2025fastcache, title = {FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation}, author = {Liu, Dong and Zhang, Jiayi and Li, Yifan and Yu, Yanxuan and Lengerich, Ben and Wu, Ying Nian}, journal = {CVPR Another Brick in the AI Wall: Building Practical Solutions from Theoretical Foundations (CVPR BASE 2025)}, year = {2025}, }
MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning

Dong Liu, Yanxuan Yu, Xuhong Wang, and 2 more authors

ICML Long Context Foundation Models (LCFM), 26–28 aug 2025

Abs Bib PDF

As long-context language modeling becomes increasingly important, the cost of maintaining and attending to large Key/Value (KV) caches grows rapidly, becoming a major bottleneck in both training and inference. While prior works such as Multi-Query Attention (MQA) and MultiLatent Attention (MLA) reduce memory by sharing or compressing KV features, they often trade off representation quality or incur runtime overhead. We propose Memory-Keyed Attention (MKA), a hierarchical attention mechanism that integrates multi-level KV caches—local, session, and long-term—and learns to route attention across them dynamically. We further introduce FastMKA, a broadcast-routed variant that fuses memory sources before attention computation for enhanced efficiency. Experiments on different sequence lengths show that MKA improves perplexity over MHA and MLA, while FastMKA achieves comparable accuracy to MLA with up to 4× faster training and 40% lower evaluation latency. These results highlight MKA as a practical and extensible framework for efficient longcontext attention.
@article{liu2025mka, title = {MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning}, journal = {ICML Long Context Foundation Models (LCFM)}, author = {Liu, Dong and Yu, Yanxuan and Wang, Xuhong and Lengerich, Ben and Wu, Ying Nian}, year = {2025}, }
PiKV: KV Cache Management System for MoE Architecture

Dong Liu, Yanxuan Yu, Ben Lengerich, and 2 more authors

In ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models , 26–28 aug 2025

Abs Bib PDF

As large-scale language models continue to scale up in both size and context length, the memory and communication cost of key-value (KV) cache storage has become a major bottleneck in multi-GPU and multi-node inference. While MoE-based architectures sparsify computation across experts, the corresponding KV caches remain dense and globally synchronized, resulting in significant overhead. We introduce \textbfPiKV, a parallel and distributed KV cache serving framework tailored for MoE architecture. PiKV leverages \textitexpert-sharded KV storage to partition caches across GPUs, \textitPiKV routing to reduce token-to-KV access, and a \textitPiKV Scheduling to adaptively retain query-relevant entries. To further reduce memory usage, PiKV integrates \textitPiKV Compression modules the caching pipeline for acceleration.
@inproceedings{liu2025pikv, title = {Pi{KV}: {KV} Cache Management System for MoE Architecture}, author = {Liu, Dong and Yu, Yanxuan and Lengerich, Ben and Wu, Ying Nian and Wang, Xuhong}, booktitle = {ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models}, year = {2025}, url = {https://openreview.net/forum?id=hHoK1kBPd9}, }