Toward Efficient Inference for Mixture of Experts
MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism
Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching
MoEUT: Mixture-of-Experts Universal Transformers
Mirage: A Multi-Level Superoptimizer for Tensor Programs
Inference-Time Scaling for Generalist Reward Modeling
MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism
FLEX ATTENTION: A PROGRAMMING MODEL FOR GENERATING OPTIMIZED ATTENTION KERNELS
LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers
SparseTransX: Efficient Training of Translation-Based Knowledge Graph Embeddings Using Sparse Matrix Operations
AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
Context Parallelism for Scalable Million-Token Inference
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
SELF-DATA DISTILLATION FOR RECOVERING QUALITY IN PRUNED LARGE LANGUAGE MODELS
PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training
TRAINING ULTRA LONG CONTEXT LANGUAGE MODEL WITH FULLY PIPELINED DISTRIBUTED TRANSFORMER
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
Venn: Resource Management Across Federated Learning Jobs
DIFFSERVE: EFFICIENTLY SERVING TEXT-TO-IMAGE DIFFUSION MODELS WITH QUERY-AWARE MODEL SCALING
Balancing Pipeline Parallelism with Vocabulary Parallelism
AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution
EFFICIENT LLM INFERENCE USING DYNAMIC INPUT PRUNING AND CACHE-AWARE MASKING
Marconi: Prefix Caching for the Era of Hybrid LLMs
LAVA: LIFETIME-AWARE VM ALLOCATION WITH LEARNED DISTRIBUTIONS AND ADAPTATION TO MISPREDICTIONS
TurboAttention: Efficient Attention Approximation for High Throughputs LLMs
Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts
A PRACTICAL CROSS-LAYER APPROACH FOR ML-DRIVEN STORAGE PLACEMENT IN WAREHOUSE-SCALE COMPUTERS
Scaling Deep Learning Training with MPMD Pipeline Parallelism
LSERVE: EFFICIENT LONG-SEQUENCE LLM SERVING WITH UNIFIED SPARSE ATTENTION
VOLUT: EFFICIENT VOLUMETRIC STREAMING ENHANCED BY LUT-BASED SUPER-RESOLUTION
ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments
Forget the Data and Fine-Tuning! Just Fold the Network to Compress
Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences
Speculate, then Collaborate: Fusing Knowledge of Language Models during Decoding
HEXGEN-2: DISAGGREGATED GENERATIVE INFERENCE OF LLMS IN HETEROGENEOUS ENVIRONMENT
You OnlyPruneOnce: DESIGNING CALIBRATION-FREE MODEL COMPRESSION WITH POLICY LEARNING
Dynamic Diffusion Transformer
TypedThinker: Typed Thinking Improves Large Language Model Reasoning
FlashMask: Efficient and Rich Mask Extension of FlashAttention
LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid
SmolLM2: When Smol Goes Big Data-Centric Training of a Small Language Mode
Robust and Secure Code Watermarking for Large Language Models via ML/Crypto Codesign
BitsAI-CR: Automated Code Review via LLM in Practice
[Final Comment]: 변수명 ‘radious’가 오타입니다. ‘radius’로 수정하세요. [Review Summary]: 오타 감지 - ‘radious’를 ‘radius’로 변경 추천.