PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training
TRAINING ULTRA LONG CONTEXT LANGUAGE MODEL WITH FULLY PIPELINED DISTRIBUTED TRANSFORMER
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
Venn: Resource Management Across Federated Learning Jobs
DIFFSERVE: EFFICIENTLY SERVING TEXT-TO-IMAGE DIFFUSION MODELS WITH QUERY-AWARE MODEL SCALING
Balancing Pipeline Parallelism with Vocabulary Parallelism
AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution
EFFICIENT LLM INFERENCE USING DYNAMIC INPUT PRUNING AND CACHE-AWARE MASKING
Marconi: Prefix Caching for the Era of Hybrid LLMs
LAVA: LIFETIME-AWARE VM ALLOCATION WITH LEARNED DISTRIBUTIONS AND ADAPTATION TO MISPREDICTIONS
TurboAttention: Efficient Attention Approximation for High Throughputs LLMs
Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts
A PRACTICAL CROSS-LAYER APPROACH FOR ML-DRIVEN STORAGE PLACEMENT IN WAREHOUSE-SCALE COMPUTERS
Scaling Deep Learning Training with MPMD Pipeline Parallelism
LSERVE: EFFICIENT LONG-SEQUENCE LLM SERVING WITH UNIFIED SPARSE ATTENTION
VOLUT: EFFICIENT VOLUMETRIC STREAMING ENHANCED BY LUT-BASED SUPER-RESOLUTION
ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments
Forget the Data and Fine-Tuning! Just Fold the Network to Compress
Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences
Speculate, then Collaborate: Fusing Knowledge of Language Models during Decoding
HEXGEN-2: DISAGGREGATED GENERATIVE INFERENCE OF LLMS IN HETEROGENEOUS ENVIRONMENT
You OnlyPruneOnce: DESIGNING CALIBRATION-FREE MODEL COMPRESSION WITH POLICY LEARNING
Dynamic Diffusion Transformer
TypedThinker: Typed Thinking Improves Large Language Model Reasoning
FlashMask: Efficient and Rich Mask Extension of FlashAttention
LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid
SmolLM2: When Smol Goes Big Data-Centric Training of a Small Language Mode
Robust and Secure Code Watermarking for Large Language Models via ML/Crypto Codesign
BitsAI-CR: Automated Code Review via LLM in Practice
[Final Comment]: 변수명 ‘radious’가 오타입니다. ‘radius’로 수정하세요. [Review Summary]: 오타 감지 - ‘radious’를 ‘radius’로 변경 추천.
Qwen2.5-1M Technical Report
Humanity's Last Exam
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at AnyResolution
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search
Qwen2 Technical Report
Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data
DeepSeek-VL: Towards Real-World Vision-Language Understanding
How to Train Data-Efficient LLMs
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Qwen Technical Report
Qwen-VL: AVersatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Janus:DecouplingVisualEncoding for Unified Multimodal Understanding and Generation
DeepSeek-V3 Technical Report
Qwen2.5 Technical Report
Fast State Restoration in LLM Serving with HCache
Compressed Context Memory For Online Language Model Interaction
A Hardware Evaluation Framework for Large Language Model Inference
TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication
DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs
TAIPAN: EFFICIENT AND EXPRESSIVE STATE SPACE LANGUAGE MODELS WITH SELECTIVE ATTENTION
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
AIOS: LLM Agent Operating System
SANA: EFFICIENT HIGH-RESOLUTION IMAGE SYN THESIS WITH LINEAR DIFFUSION TRANSFORMERS
Block Transformer: Global-to-Local Language Modeling for Fast Inference
FLAME: Factuality-Aware Alignment for Large Language Models
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT
Rethinking Optimization and Architecture for Tiny Language Models
LLM in a flash : Efficient Large Language Model Inference with Limited Memory
Cascade Speculative Drafting for Even Faster LLM Inference
Distributed Inference and Fine-tuning of Large Language Models Over The Internet
Gated Linear Attention Transformers with Hardware-Efficient Training
EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism
Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge
SparQ Attention: Bandwidth-Efficient LLM Inference
Improving alignment of dialogue agents via targeted human judgements
Language Models are General-Purpose Interfaces
OPT: Open Pre-trained Transformer Language Models
CBQ: Cross-Block Quantization for Large Language Models
SCCA: Shifted Cross Chunk Attention for long contextual semantic expansion
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Gemma: Open Models Based on Gemini Research and Technology
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction
Abseil Tip 234 값, 포인터, 참조로 전달하기
아래는 “이번 주의 팁 #234: 값, 포인터, 참조로 전달하기”에 대한 한글 번역입니다.
Abseil Tip 232 변수 선언 시 auto를 언제 사용할 것인가
아래는 “이번 주의 팁 #232: 변수 선언 시 auto
를 언제 사용할 것인가”에 대한 한글 번역입니다.
Abseil Tip 231 여기와 저기 사이 – 간과되기 쉬운 몇 가지 알고리즘
아래는 “이번 주의 팁 #231: 여기와 저기 사이 – 간과되기 쉬운 몇 가지 알고리즘”에 대한 한글 번역입니다.