Jaehun's Blog

For Efficient AI


  • 홈

  • 카테고리

  • 태그

  • 아카이브

  • About

  • 검색

categories 카테고리

블로그 다시 시작..

10-31

블로그 시작..

01-03

ElementryOS mouch pad Using it like a Mac Touch Gestures (Loki,Juno)

02-11

Logitech MX anywhere 2s 우분투에서 제스쳐 사용하기

02-11

ubuntu에서 parallel gzip사용하여 빠르게 압축하기(pigz)

02-11

Ubuntu pdf 를 이미지로 변환

02-11

ubuntu 16.04 python3 opencv 3.4 설치

02-11

jupyter notebook 다른python이 실행될 시

02-11

inplace_swap

02-11

ElementryOS mouch pad Using it like a Mac Touch Gestures (Loki,Juno)

02-11

docker tag 검색하기

02-11

docker 로 gitlab만들기

02-11

apt-get source 바꾸기

02-11

Linux ubuntu nvidia-docker 설치 및 자주 쓰는 명령어

01-08

Linux ubuntu Zsh 및 oh-my-zsh 설치

01-03

간단논문 정리 TVM An Automated End-to-End Optimizing Compiler for Deep Learning (OSDI 18)

02-12

논문 정리 LLVM A Compilation Framework for Lifelong Program Analysis & Transformation(CGO 04)

02-12

논문 정리 Chameleon Adaptive Code Optimization for Expedited Deep Neural Network Compilation(ICLR 2020)

02-12

논문 정리 NeuroVectorizer End-to-End Vectorization with Deep Reinforcement Learning (CGO 20)

02-12

간단논문 정리 Fast and Effective Orchestration of Compiler Optimizations(Zhelong Pan,Rudolf Eigenmann;Purdue University ;CGO’06)

02-12

간단논문 정리 End-to-End Deep Learning of Optimization Heuristics (PACT 17)

02-12

LLVM loop unroll and jam pass and view-cfg

02-12

LLVM (clang) build and install (ubuntu 18.04)

02-12

간단논문 정리 TVM An Automated End-to-End Optimizing Compiler for Deep Learning (OSDI 18)

02-12

논문 정리 Chameleon Adaptive Code Optimization for Expedited Deep Neural Network Compilation(ICLR 2020)

02-12

논문 정리 NeuroVectorizer End-to-End Vectorization with Deep Reinforcement Learning (CGO 20)

02-12

간단논문 정리 Fast and Effective Orchestration of Compiler Optimizations(Zhelong Pan,Rudolf Eigenmann;Purdue University ;CGO’06)

02-12

간단논문 정리 End-to-End Deep Learning of Optimization Heuristics (PACT 17)

02-12

간단논문 정리 DARTS DIFFERENTIABLE ARCHITECTURE SEARCH (ICLR 2019)

02-12

LLVM loop unroll and jam pass and view-cfg

02-12

간단논문 정리 DARTS DIFFERENTIABLE ARCHITECTURE SEARCH (ICLR 2019)

02-12

Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures

05-17

RODIMUS*: BREAKING THE ACCURACY-EFFICIENCY TRADE-OFF WITH EFFICIENT ATTENTIONS

05-17

Mixture Compressor for Mixture-of-Experts LLMs Gains More

05-17

An Empirical Study of Qwen3 Quantization

05-12

Gemma 3 Technical Report

05-12

Gemini Embedding: Generalizable Embeddings from Gemini

05-12

Seesaw: High-throughput LLM Inference via Model Re-sharding

05-12

MELODI: Exploring Memory Compression for Long Contexts

05-12

Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

04-16

Toward Efficient Inference for Mixture of Experts

04-16

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

04-14

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

04-14

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

04-14

Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching

04-13

MoEUT: Mixture-of-Experts Universal Transformers

04-13

Mirage: A Multi-Level Superoptimizer for Tensor Programs

04-13

Inference-Time Scaling for Generalist Reward Modeling

04-07

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

04-07

FLEX ATTENTION: A PROGRAMMING MODEL FOR GENERATING OPTIMIZED ATTENTION KERNELS

04-07

LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

04-07

SparseTransX: Efficient Training of Translation-Based Knowledge Graph Embeddings Using Sparse Matrix Operations

04-02

AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds

04-02

Context Parallelism for Scalable Million-Token Inference

03-31

NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

03-31

SELF-DATA DISTILLATION FOR RECOVERING QUALITY IN PRUNED LARGE LANGUAGE MODELS

03-25

PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training

03-25

TRAINING ULTRA LONG CONTEXT LANGUAGE MODEL WITH FULLY PIPELINED DISTRIBUTED TRANSFORMER

03-24

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

03-24

On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions

03-24

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

03-18

Venn: Resource Management Across Federated Learning Jobs

03-18

DIFFSERVE: EFFICIENTLY SERVING TEXT-TO-IMAGE DIFFUSION MODELS WITH QUERY-AWARE MODEL SCALING

03-17

Balancing Pipeline Parallelism with Vocabulary Parallelism

03-17

AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution

03-17

EFFICIENT LLM INFERENCE USING DYNAMIC INPUT PRUNING AND CACHE-AWARE MASKING

03-12

Marconi: Prefix Caching for the Era of Hybrid LLMs

03-12

LAVA: LIFETIME-AWARE VM ALLOCATION WITH LEARNED DISTRIBUTIONS AND ADAPTATION TO MISPREDICTIONS

03-11

TurboAttention: Efficient Attention Approximation for High Throughputs LLMs

03-11

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

03-10

A PRACTICAL CROSS-LAYER APPROACH FOR ML-DRIVEN STORAGE PLACEMENT IN WAREHOUSE-SCALE COMPUTERS

03-10

Scaling Deep Learning Training with MPMD Pipeline Parallelism

03-10

LSERVE: EFFICIENT LONG-SEQUENCE LLM SERVING WITH UNIFIED SPARSE ATTENTION

03-06

VOLUT: EFFICIENT VOLUMETRIC STREAMING ENHANCED BY LUT-BASED SUPER-RESOLUTION

03-06

ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments

03-06

Forget the Data and Fine-Tuning! Just Fold the Network to Compress

03-04

Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences

03-04

Speculate, then Collaborate: Fusing Knowledge of Language Models during Decoding

02-25

HEXGEN-2: DISAGGREGATED GENERATIVE INFERENCE OF LLMS IN HETEROGENEOUS ENVIRONMENT

02-25

You OnlyPruneOnce: DESIGNING CALIBRATION-FREE MODEL COMPRESSION WITH POLICY LEARNING

02-25

Dynamic Diffusion Transformer

02-25

TypedThinker: Typed Thinking Improves Large Language Model Reasoning

02-24

FlashMask: Efficient and Rich Mask Extension of FlashAttention

02-24

LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid

02-17

SmolLM2: When Smol Goes Big Data-Centric Training of a Small Language Mode

02-13

Robust and Secure Code Watermarking for Large Language Models via ML/Crypto Codesign

02-13

BitsAI-CR: Automated Code Review via LLM in Practice

02-13

Qwen2.5-1M Technical Report

02-12

Humanity's Last Exam

02-12

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

02-12

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

02-12

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

02-11

Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at AnyResolution

02-11

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

02-11

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

02-10

Qwen2 Technical Report

02-10

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

02-10

DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

02-09

DeepSeek-VL: Towards Real-World Vision-Language Understanding

02-09

How to Train Data-Efficient LLMs

02-09

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

02-07

DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence

02-07

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

02-05

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

02-05

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

02-05

Qwen Technical Report

02-04

Qwen-VL: AVersatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

02-04

Janus:DecouplingVisualEncoding for Unified Multimodal Understanding and Generation

02-03

DeepSeek-V3 Technical Report

01-21

Qwen2.5 Technical Report

01-21

Fast State Restoration in LLM Serving with HCache

01-21

Compressed Context Memory For Online Language Model Interaction

01-21

A Hardware Evaluation Framework for Large Language Model Inference

01-21

TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication

01-20

DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs

01-20

TAIPAN: EFFICIENT AND EXPRESSIVE STATE SPACE LANGUAGE MODELS WITH SELECTIVE ATTENTION

01-20

SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs

01-20

AIOS: LLM Agent Operating System

01-20

SANA: EFFICIENT HIGH-RESOLUTION IMAGE SYN THESIS WITH LINEAR DIFFUSION TRANSFORMERS

01-15

Block Transformer: Global-to-Local Language Modeling for Fast Inference

01-15

FLAME: Factuality-Aware Alignment for Large Language Models

01-15

MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT

01-15

Rethinking Optimization and Architecture for Tiny Language Models

01-15

LLM in a flash : Efficient Large Language Model Inference with Limited Memory

01-02

Cascade Speculative Drafting for Even Faster LLM Inference

01-02

Distributed Inference and Fine-tuning of Large Language Models Over The Internet

01-02

Gated Linear Attention Transformers with Hardware-Efficient Training

01-02

EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism

01-02

Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

12-31

SparQ Attention: Bandwidth-Efficient LLM Inference

12-31

Improving alignment of dialogue agents via targeted human judgements

12-31

Language Models are General-Purpose Interfaces

12-31

OPT: Open Pre-trained Transformer Language Models

12-31

CBQ: Cross-Block Quantization for Large Language Models

12-30

SCCA: Shifted Cross Chunk Attention for long contextual semantic expansion

12-30

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

12-26

Gemma: Open Models Based on Gemini Research and Technology

12-26

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

12-26

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

12-26

The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction

12-26

SAGEATTENTION: ACCURATE 8-BIT ATTENTION FOR PLUG-AND-PLAY INFERENCE ACCELERATION

12-24

Gemma 2: Improving Open Language Models at a Practical Size

12-24

The Llama 3 Herd of Models

12-24

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

12-24

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

12-24

Communication Compression for Tensor Parallel LLM Inference

12-23

Context Parallelism for Scalable Million-Token Inference

12-23

SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile

12-23

FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs

12-23

Large Concept Models: Language Modeling in a Sentence Representation Space

12-20

Sharing and Throughput-oriented Token Batching

12-20

ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression

12-20

Star Attention: Efficient LLM Inference over Long Sequences

12-20

Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation

12-20

SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference

12-20

SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration

12-20

Byte Latent Transformer: Patches Scale Better Than Tokens

12-19

Memory Layers at Scale

12-19

Efficient Memory Management for Large Language Model Serving with PagedAttention

12-19

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

12-19

GSPMD: General and Scalable Parallelization for ML Computation Graphs

12-19

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

12-19

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

12-18

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

12-18

Fast Inference of Mixture-of-Experts Language Models with Offloading

12-18

SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads

12-18

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

12-18

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

12-18

Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models

12-17

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

12-17

FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference

12-17

FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

12-17

Fast and Effective Weight Update for Pruned Large Language Models

12-17

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads

12-16

MEDUSA: Simple LLMInference Acceleration Framework with Multiple Decoding Heads

12-16

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

12-16

DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

12-16

INFERFLOW: AN EFFICIENT AND HIGHLY CONFIG URABLE INFERENCE ENGINE FOR LARGE LANGUAGE MODELS

12-16

TP-Aware Dequantization

12-15

Break the Sequential Dependency of LLM Inference Using LOOKAHEAD DECODING

12-15

Decoding Speculative Decoding

12-15

Efficient Prompt Caching via Embedding Similarity

12-15

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

12-15

RelayAttention for Efficient Large Language Model Serving with Long System Prompts

12-14

Benchmarking and Dissecting the Nvidia Hopper GPU Architecture

12-14

QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference

12-14

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inferenc

12-14

Hydragen: High-Throughput LLM Inference with Shared Prefixes

12-14

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

12-14

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Model

12-13

Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time

12-13

Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

12-13

GQA:Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

12-13

Token Merging: Your ViT But Faster

12-13

Fast Transformer Decoding: One Write-Head is All You Need

12-13

FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

12-12

No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization

12-12

LLM Inference Unveiled: Survey and Roofline Model Insights

12-12

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

12-12

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

12-12

Scaling Instruction-Finetuned Language Models

12-12

GLM-130B: An Open Bilingual Pre-trained Model

12-12

Benchmarks as Limits to Arbitrage: Understanding the Low-Volatility Anomaly

12-10

High Idiosyncratic Volatility and Low Returns: International and Further U.S. Evidence

12-10

CHAI: Clustered Head Attention for Efficient LLM Inference

12-10

QAQ: Quality Adaptive Quantization for LLM KV Cache

12-10

Transformers are Multi-State RNNs

12-10

Compressed Context Memory For Online Language Model Interaction

12-10

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

12-10

LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression

12-10

Galactica: A Large Language Model for Science

12-10

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

12-10

Momentum Strategies

12-09

Mixed Precision Quantization

12-09

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

12-09

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference

12-09

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

12-09

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

12-09

DeepCache: Accelerating Diffusion Models for Free

12-09

Improving Language Understanding by Generative Pre-Training

12-08

QAQ: Quality Adaptive Quantization for LLM KV Cache

12-08

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

12-08

PaLM 2 Technical Report

12-08

ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching

12-06

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

12-06

FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

12-06

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

12-06

Fast Inference from Transformers via Speculative Decoding

12-06

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

12-06

HIERARCHICAL CONTEXT MERGING: BETTER LONG CONTEXT UNDERSTANDING FOR PRE-TRAINED LLMS

12-05

MuxServe:FlexibleSpatial-TemporalMultiplexingforMultipleLLMServing

12-05

Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

12-05

ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching

12-05

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

12-05

MELTing point: Mobile Evaluation of Language Transformers

12-05

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

12-05

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

12-05

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

12-05

CORM: Cache Optimization with Recent Message for Large Language Model Inference

12-04

Retrieval Head Mechanistically Explains Long-Context Factuality

12-04

SnapKV: LLM Knows What You are Looking for Before Generation

12-04

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

12-04

SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget

12-04

Toward Inference-optimal Mixture-of-Expert Large Language Models

12-04

Mistral 7B

12-04

Llama 2: Open Foundation and Fine-Tuned Chat Models

12-04

MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases

12-03

PowerInfer-2: Fast Large Language Model Inference on a Smartphone

12-03

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

12-03

RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation

12-03

Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

12-03

Tree-based Speculative Inference and Verification

12-03

Fast Inference from Transformers via Speculative Decoding

12-03

KV Cache Compression

12-02

PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference

12-02

Layer-Condensed KV Cache for Efficient Inference of Large Language Models

12-02

SKVQ:Sliding-window Key and Value Cache Quantization for Large Language Models

12-02

You Only Cache Once: Decoder-Decoder Architectures for Language Models

12-02

LoCoCo: Dropping In Convolutions for Long Context Compression

11-29

Loki: Low-rank Keys for Efficient Sparse Attention

11-29

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

11-29

MiniCache: KV Cache Compression in Depth Dimension for Large Language Models

11-29

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

11-29

Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters

11-28

CItruS : ChunkedInstruction-aware State Eviction for Long Sequence Modeling

11-28

ASimple and Effective L2 Norm-Based Strategy for KV Cache Compression

11-28

MLKV:Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

11-28

Effectively Compress KV Heads for LLM

11-28

MODEL TELLS YOU WHERE TO MERGE: ADAPTIVE KV CACHE MERGING FOR LLMS ON LONG-CONTEXT TASKS

11-27

Efficient Sparse Attention needs Adaptive Token Release

11-27

Benchmark of Long Context Capable Approaches

11-27

LOOK-M:Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference

11-27

Dynamic Discriminative Operations (D2O) for Efficient Generative Inference of Large Language Models

11-27

Pruning in Transformer Decoder

11-26

Keep the Cost Down: A Review on Methods to Optimize LLM’s KV Cache Consumption.

11-26

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

11-26

PQCache: Product Quantization-based KVCache for Long Context LLM Inference

11-26

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

11-26

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

11-25

Post-Training Sparse Attention with Double Sparsity

11-25

NACL: AGeneral and Effective KV Cache Eviction Framework for LLMs at Inference Time

11-25

Palu: Compressing KV-Cache with Low-Rank Projection

11-25

ThinK: Thinner Key Cache by Query-Driven Pruning

11-25

Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads

11-21

InfiniPot: Infinite Context Processing on Memory-Constrained LLMs

11-21

KV-COMPRESS: Paged KV-Cache Compression with Variable Compression Rates per Attention Head

11-21

Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction

11-21

TACO-RL: Task Aware Prompt Compression Optimization with Reinforcement Learning

11-21

DUOATTENTION: EFFICIENT LONG-CONTEXT LLM INFERENCE WITH RETRIEVAL AND STREAMING HEADS

11-20

TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training

11-20

TIDALDECODE: FAST AND ACCURATE LLM DECOD ING WITH POSITION PERSISTENT SPARSE ATTENTION

11-20

SPARSEVLM: VISUAL TOKEN SPARSIFICATION FOR EFFICIENT VISION-LANGUAGE MODEL INFERENCE

11-20

SWIFTKV: FAST PREFILL-OPTIMIZED INFERENCE WITH KNOWLEDGE-PRESERVING MODEL TRANSFORMATION

11-20

LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy

11-20

A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference

11-19

SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction

11-19

In-context KV-Cache Eviction for LLMs via Attention-Gate

11-19

Prompt Compression for Large Language Models: A Survey

11-19

Textbooks Are All You Need

11-19

Scaling Laws for Neural Language Models

11-19

Squeezed Attention: Accelerating Long Context Length LLM Inference

11-18

Recycled Attention: Efficient inference for long-context language models

11-18

TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection

11-18

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration

11-18

MagicPIG: LSH Sampling for Efficient LLM Generation

11-18

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

11-14

Learning Transferable Visual Models From Natural Language Supervision

11-14

HART Efficient Visual Generation with Hybrid Autoregressive Transformer

11-14

Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

11-14

The CoT Collection Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning

11-14

VILA-U a Unified Foundation Model Integrating Visual Understanding and Generation

11-13

Condition-Aware Neural Network for Controlled Image Generation

11-13

DistriFusion Distributed Parallel Inference for High-Resolution Diffusion Models

11-13

VILA On Pre-training for Visual Language Models

11-13

FastComposer Tuning-Free Multi-Subject Image Generation with Localized Attention

11-13

ShadowKV KV Cache in Shadows for High-Throughput Long-Context LLM Inference

11-12

Query-Efficient Correlation Clustering with Noisy Oracle

11-12

LiteMoE Customizing On-device LLM Serving via Proxy Submodel Tuning

11-12

LaRS Latent Reasoning Skills for Chain-of-Thought Reasoning

11-12

Batch Calibration Rethinking Calibration for In-Context Learning and Prompt Engineering

11-12

Scientific Beta Multi-Beta Multi-Strategy Indices Implementing Multi-Factor Equity Portfolios with Smart Factor Indices

11-11

Foundations of Factor Investing

11-11

RAG4ITOps A Supervised Fine-Tunable and Comprehensive RAG Framework for IT Operations and Maintenance

11-11

MagicPIG LSH Sampling for Efficient LLM Generation

11-11

EPIC Efficient Position-Independent Context Caching for Serving Large Language Models

11-11

ELICIT LLM Augmentation via External In-Context Capability

11-11

COMET Towards Partical W4A4KV4 LLMs Serving

11-11

The Cross-Section of Expected Stock Returns

11-10

Portfolio Selection

11-10

Capital asset prices A theory of market equilibrium under conditions of risk

11-10

MInference 1.0 Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

11-10

HYSYNTH Context-Free LLM Approximation for Guiding Program Synthesis

11-10

DynamoLLM Designing LLM Inference Clusters for Performance and Energy Efficiency

11-10

Can Graph Learning Improve Planning in LLM-based Agents?

11-10

ALPINE Unveiling the Planning Capability of Autoregressive Learning in Language Models

11-10

Transformers are Multi-State RNNs

11-07

Meta Large Language Model Compiler Foundation Models of Compiler Optimization

11-07

Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning

11-07

Efficient Streaming Language Models with Attention Sinks

11-07

Model Tells You What to Discard Adaptive KV Cache Compression for LLMs

11-07

BUZZ Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference

11-06

KVSharer Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing

11-06

FLUX Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

11-06

Don't Look Twice Faster Video Transformers with Run-Length Tokenization

11-06

CDMPP:ADevice-Model Agnostic Framework for Latency Prediction of Tensor Programs

11-06

Magicoder Empowering Code Generation with OSS-Instruct

11-05

SpotServe Serving Generative Large Language Models on Preemptible Instances

11-05

Optimal Kernel Orchestration for Tensor Programs with Korch

11-05

KernelGPT Enhanced Kernel Fuzzing via Large Language Models

11-05

Efficient Generative LLM Inference Using Phase Splitting

11-05

SpecExec Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

11-04

Sequoia Scalable, Robust, and Hardware-aware Speculative Decoding

11-04

Memory Bounds for the Experts Problem

11-04

Helix Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

11-04

GraphPipe Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

11-04

Reasoning over Public and Private Data in Retrieval-Based Systems

11-03

MEGABYTE Predicting Million-byte Sequences with Multiscale Transformers

11-03

Reasoning over Public and Private Data in Retrieval-Based Systems

11-03

Breaking the Curse of Quality Saturation with User-Centric Ranking

11-03

Teola Towards End-to-End Optimization of LLM-based Applications

11-01

Quest Query-Aware Sparsity for Efficient Long-Context LLM Inference

11-01

What Matters in Transformers? Not All Attention is Needed Fusion

11-01

KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches

11-01

FlexGen High-Throughput Generative Inference of Large Language Models with a Single GPU

11-01

Prompt Cache Modular Attention Reuse for Low-Latency Inference

10-31

Better & Faster Large Language Models via Multi-token Prediction

10-31

Keyformer KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference

10-31

CacheBlend Fast Large Language Model Serving for RAG with Cached Knowledge Fusion

10-31

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

10-31

간단논문 정리 TVM An Automated End-to-End Optimizing Compiler for Deep Learning (OSDI 18)

02-12

논문 정리 LLVM A Compilation Framework for Lifelong Program Analysis & Transformation(CGO 04)

02-12

논문 정리 Chameleon Adaptive Code Optimization for Expedited Deep Neural Network Compilation(ICLR 2020)

02-12

논문 정리 NeuroVectorizer End-to-End Vectorization with Deep Reinforcement Learning (CGO 20)

02-12

간단논문 정리 Fast and Effective Orchestration of Compiler Optimizations(Zhelong Pan,Rudolf Eigenmann;Purdue University ;CGO’06)

02-12

간단논문 정리 End-to-End Deep Learning of Optimization Heuristics (PACT 17)

02-12

간단논문 정리 DARTS DIFFERENTIABLE ARCHITECTURE SEARCH (ICLR 2019)

02-12

자주쓰는 파이썬 스크립트 패턴

02-12

Mixture Compressor for Mixture-of-Experts LLMs Gains More

05-17

An Empirical Study of Qwen3 Quantization

05-12

Gemma 3 Technical Report

05-12

Gemini Embedding: Generalizable Embeddings from Gemini

05-12

Seesaw: High-throughput LLM Inference via Model Re-sharding

05-12

MELODI: Exploring Memory Compression for Long Contexts

05-12

Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

04-16

Toward Efficient Inference for Mixture of Experts

04-16

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

04-14

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

04-14

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

04-14

Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching

04-13

MoEUT: Mixture-of-Experts Universal Transformers

04-13

Mirage: A Multi-Level Superoptimizer for Tensor Programs

04-13

Inference-Time Scaling for Generalist Reward Modeling

04-07

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

04-07

FLEX ATTENTION: A PROGRAMMING MODEL FOR GENERATING OPTIMIZED ATTENTION KERNELS

04-07

LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

04-07

SparseTransX: Efficient Training of Translation-Based Knowledge Graph Embeddings Using Sparse Matrix Operations

04-02

AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds

04-02

Context Parallelism for Scalable Million-Token Inference

03-31

NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

03-31

SELF-DATA DISTILLATION FOR RECOVERING QUALITY IN PRUNED LARGE LANGUAGE MODELS

03-25

PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training

03-25

TRAINING ULTRA LONG CONTEXT LANGUAGE MODEL WITH FULLY PIPELINED DISTRIBUTED TRANSFORMER

03-24

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

03-24

On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions

03-24

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

03-18

Venn: Resource Management Across Federated Learning Jobs

03-18

DIFFSERVE: EFFICIENTLY SERVING TEXT-TO-IMAGE DIFFUSION MODELS WITH QUERY-AWARE MODEL SCALING

03-17

Balancing Pipeline Parallelism with Vocabulary Parallelism

03-17

AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution

03-17

EFFICIENT LLM INFERENCE USING DYNAMIC INPUT PRUNING AND CACHE-AWARE MASKING

03-12

Marconi: Prefix Caching for the Era of Hybrid LLMs

03-12

LAVA: LIFETIME-AWARE VM ALLOCATION WITH LEARNED DISTRIBUTIONS AND ADAPTATION TO MISPREDICTIONS

03-11

TurboAttention: Efficient Attention Approximation for High Throughputs LLMs

03-11

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

03-10

A PRACTICAL CROSS-LAYER APPROACH FOR ML-DRIVEN STORAGE PLACEMENT IN WAREHOUSE-SCALE COMPUTERS

03-10

Scaling Deep Learning Training with MPMD Pipeline Parallelism

03-10

LSERVE: EFFICIENT LONG-SEQUENCE LLM SERVING WITH UNIFIED SPARSE ATTENTION

03-06

VOLUT: EFFICIENT VOLUMETRIC STREAMING ENHANCED BY LUT-BASED SUPER-RESOLUTION

03-06

ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments

03-06

Forget the Data and Fine-Tuning! Just Fold the Network to Compress

03-04

Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences

03-04

Speculate, then Collaborate: Fusing Knowledge of Language Models during Decoding

02-25

HEXGEN-2: DISAGGREGATED GENERATIVE INFERENCE OF LLMS IN HETEROGENEOUS ENVIRONMENT

02-25

You OnlyPruneOnce: DESIGNING CALIBRATION-FREE MODEL COMPRESSION WITH POLICY LEARNING

02-25

Dynamic Diffusion Transformer

02-25

TypedThinker: Typed Thinking Improves Large Language Model Reasoning

02-24

FlashMask: Efficient and Rich Mask Extension of FlashAttention

02-24

LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid

02-17

SmolLM2: When Smol Goes Big Data-Centric Training of a Small Language Mode

02-13

Robust and Secure Code Watermarking for Large Language Models via ML/Crypto Codesign

02-13

BitsAI-CR: Automated Code Review via LLM in Practice

02-13

Qwen2.5-1M Technical Report

02-12

Humanity's Last Exam

02-12

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

02-12

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

02-12

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

02-11

Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at AnyResolution

02-11

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

02-11

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

02-10

Qwen2 Technical Report

02-10

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

02-10

DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

02-09

DeepSeek-VL: Towards Real-World Vision-Language Understanding

02-09

How to Train Data-Efficient LLMs

02-09

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

02-07

DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence

02-07

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

02-05

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

02-05

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

02-05

Qwen Technical Report

02-04

Qwen-VL: AVersatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

02-04

Janus:DecouplingVisualEncoding for Unified Multimodal Understanding and Generation

02-03

DeepSeek-V3 Technical Report

01-21

Qwen2.5 Technical Report

01-21

Fast State Restoration in LLM Serving with HCache

01-21

Compressed Context Memory For Online Language Model Interaction

01-21

A Hardware Evaluation Framework for Large Language Model Inference

01-21

TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication

01-20

DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs

01-20

TAIPAN: EFFICIENT AND EXPRESSIVE STATE SPACE LANGUAGE MODELS WITH SELECTIVE ATTENTION

01-20

SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs

01-20

AIOS: LLM Agent Operating System

01-20

SANA: EFFICIENT HIGH-RESOLUTION IMAGE SYN THESIS WITH LINEAR DIFFUSION TRANSFORMERS

01-15

Block Transformer: Global-to-Local Language Modeling for Fast Inference

01-15

FLAME: Factuality-Aware Alignment for Large Language Models

01-15

MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT

01-15

Rethinking Optimization and Architecture for Tiny Language Models

01-15

LLM in a flash : Efficient Large Language Model Inference with Limited Memory

01-02

Cascade Speculative Drafting for Even Faster LLM Inference

01-02

Distributed Inference and Fine-tuning of Large Language Models Over The Internet

01-02

Gated Linear Attention Transformers with Hardware-Efficient Training

01-02

EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism

01-02

Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

12-31

SparQ Attention: Bandwidth-Efficient LLM Inference

12-31

Improving alignment of dialogue agents via targeted human judgements

12-31

Language Models are General-Purpose Interfaces

12-31

OPT: Open Pre-trained Transformer Language Models

12-31

CBQ: Cross-Block Quantization for Large Language Models

12-30

SCCA: Shifted Cross Chunk Attention for long contextual semantic expansion

12-30

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

12-26

Gemma: Open Models Based on Gemini Research and Technology

12-26

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

12-26

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

12-26

The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction

12-26

SAGEATTENTION: ACCURATE 8-BIT ATTENTION FOR PLUG-AND-PLAY INFERENCE ACCELERATION

12-24

Gemma 2: Improving Open Language Models at a Practical Size

12-24

The Llama 3 Herd of Models

12-24

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

12-24

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

12-24

Communication Compression for Tensor Parallel LLM Inference

12-23

Context Parallelism for Scalable Million-Token Inference

12-23

SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile

12-23

FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs

12-23

Large Concept Models: Language Modeling in a Sentence Representation Space

12-20

Sharing and Throughput-oriented Token Batching

12-20

ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression

12-20

Star Attention: Efficient LLM Inference over Long Sequences

12-20

Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation

12-20

SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference

12-20

SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration

12-20

Byte Latent Transformer: Patches Scale Better Than Tokens

12-19

Memory Layers at Scale

12-19

Efficient Memory Management for Large Language Model Serving with PagedAttention

12-19

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

12-19

GSPMD: General and Scalable Parallelization for ML Computation Graphs

12-19

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

12-19

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

12-18

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

12-18

Fast Inference of Mixture-of-Experts Language Models with Offloading

12-18

SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads

12-18

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

12-18

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

12-18

Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models

12-17

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

12-17

FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference

12-17

FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

12-17

Fast and Effective Weight Update for Pruned Large Language Models

12-17

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads

12-16

MEDUSA: Simple LLMInference Acceleration Framework with Multiple Decoding Heads

12-16

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

12-16

DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

12-16

INFERFLOW: AN EFFICIENT AND HIGHLY CONFIG URABLE INFERENCE ENGINE FOR LARGE LANGUAGE MODELS

12-16

TP-Aware Dequantization

12-15

Break the Sequential Dependency of LLM Inference Using LOOKAHEAD DECODING

12-15

Decoding Speculative Decoding

12-15

Efficient Prompt Caching via Embedding Similarity

12-15

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

12-15

RelayAttention for Efficient Large Language Model Serving with Long System Prompts

12-14

Benchmarking and Dissecting the Nvidia Hopper GPU Architecture

12-14

QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference

12-14

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inferenc

12-14

Hydragen: High-Throughput LLM Inference with Shared Prefixes

12-14

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

12-14

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Model

12-13

Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time

12-13

Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

12-13

GQA:Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

12-13

Token Merging: Your ViT But Faster

12-13

Fast Transformer Decoding: One Write-Head is All You Need

12-13

FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

12-12

No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization

12-12

LLM Inference Unveiled: Survey and Roofline Model Insights

12-12

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

12-12

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

12-12

Scaling Instruction-Finetuned Language Models

12-12

GLM-130B: An Open Bilingual Pre-trained Model

12-12

Benchmarks as Limits to Arbitrage: Understanding the Low-Volatility Anomaly

12-10

High Idiosyncratic Volatility and Low Returns: International and Further U.S. Evidence

12-10

CHAI: Clustered Head Attention for Efficient LLM Inference

12-10

QAQ: Quality Adaptive Quantization for LLM KV Cache

12-10

Transformers are Multi-State RNNs

12-10

Compressed Context Memory For Online Language Model Interaction

12-10

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

12-10

LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression

12-10

Galactica: A Large Language Model for Science

12-10

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

12-10

Momentum Strategies

12-09

Mixed Precision Quantization

12-09

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

12-09

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference

12-09

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

12-09

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

12-09

DeepCache: Accelerating Diffusion Models for Free

12-09

Improving Language Understanding by Generative Pre-Training

12-08

QAQ: Quality Adaptive Quantization for LLM KV Cache

12-08

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

12-08

PaLM 2 Technical Report

12-08

ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching

12-06

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

12-06

FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

12-06

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

12-06

Fast Inference from Transformers via Speculative Decoding

12-06

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

12-06

HIERARCHICAL CONTEXT MERGING: BETTER LONG CONTEXT UNDERSTANDING FOR PRE-TRAINED LLMS

12-05

MuxServe:FlexibleSpatial-TemporalMultiplexingforMultipleLLMServing

12-05

Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

12-05

ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching

12-05

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

12-05

MELTing point: Mobile Evaluation of Language Transformers

12-05

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

12-05

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

12-05

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

12-05

CORM: Cache Optimization with Recent Message for Large Language Model Inference

12-04

Retrieval Head Mechanistically Explains Long-Context Factuality

12-04

SnapKV: LLM Knows What You are Looking for Before Generation

12-04

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

12-04

SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget

12-04

Toward Inference-optimal Mixture-of-Expert Large Language Models

12-04

Mistral 7B

12-04

Llama 2: Open Foundation and Fine-Tuned Chat Models

12-04

MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases

12-03

PowerInfer-2: Fast Large Language Model Inference on a Smartphone

12-03

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

12-03

RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation

12-03

Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

12-03

Tree-based Speculative Inference and Verification

12-03

Fast Inference from Transformers via Speculative Decoding

12-03

KV Cache Compression

12-02

PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference

12-02

Layer-Condensed KV Cache for Efficient Inference of Large Language Models

12-02

SKVQ:Sliding-window Key and Value Cache Quantization for Large Language Models

12-02

You Only Cache Once: Decoder-Decoder Architectures for Language Models

12-02

LoCoCo: Dropping In Convolutions for Long Context Compression

11-29

Loki: Low-rank Keys for Efficient Sparse Attention

11-29

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

11-29

MiniCache: KV Cache Compression in Depth Dimension for Large Language Models

11-29

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

11-29

Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters

11-28

CItruS : ChunkedInstruction-aware State Eviction for Long Sequence Modeling

11-28

ASimple and Effective L2 Norm-Based Strategy for KV Cache Compression

11-28

MLKV:Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

11-28

Effectively Compress KV Heads for LLM

11-28

MODEL TELLS YOU WHERE TO MERGE: ADAPTIVE KV CACHE MERGING FOR LLMS ON LONG-CONTEXT TASKS

11-27

Efficient Sparse Attention needs Adaptive Token Release

11-27

Benchmark of Long Context Capable Approaches

11-27

LOOK-M:Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference

11-27

Dynamic Discriminative Operations (D2O) for Efficient Generative Inference of Large Language Models

11-27

Pruning in Transformer Decoder

11-26

Keep the Cost Down: A Review on Methods to Optimize LLM’s KV Cache Consumption.

11-26

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

11-26

PQCache: Product Quantization-based KVCache for Long Context LLM Inference

11-26

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

11-26

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

11-25

Post-Training Sparse Attention with Double Sparsity

11-25

NACL: AGeneral and Effective KV Cache Eviction Framework for LLMs at Inference Time

11-25

Palu: Compressing KV-Cache with Low-Rank Projection

11-25

ThinK: Thinner Key Cache by Query-Driven Pruning

11-25

Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads

11-21

InfiniPot: Infinite Context Processing on Memory-Constrained LLMs

11-21

KV-COMPRESS: Paged KV-Cache Compression with Variable Compression Rates per Attention Head

11-21

Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction

11-21

TACO-RL: Task Aware Prompt Compression Optimization with Reinforcement Learning

11-21

DUOATTENTION: EFFICIENT LONG-CONTEXT LLM INFERENCE WITH RETRIEVAL AND STREAMING HEADS

11-20

TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training

11-20

TIDALDECODE: FAST AND ACCURATE LLM DECOD ING WITH POSITION PERSISTENT SPARSE ATTENTION

11-20

SPARSEVLM: VISUAL TOKEN SPARSIFICATION FOR EFFICIENT VISION-LANGUAGE MODEL INFERENCE

11-20

SWIFTKV: FAST PREFILL-OPTIMIZED INFERENCE WITH KNOWLEDGE-PRESERVING MODEL TRANSFORMATION

11-20

LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy

11-20

A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference

11-19

SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction

11-19

In-context KV-Cache Eviction for LLMs via Attention-Gate

11-19

Prompt Compression for Large Language Models: A Survey

11-19

Textbooks Are All You Need

11-19

Scaling Laws for Neural Language Models

11-19

Squeezed Attention: Accelerating Long Context Length LLM Inference

11-18

Recycled Attention: Efficient inference for long-context language models

11-18

TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection

11-18

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration

11-18

MagicPIG: LSH Sampling for Efficient LLM Generation

11-18

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

11-14

Learning Transferable Visual Models From Natural Language Supervision

11-14

HART Efficient Visual Generation with Hybrid Autoregressive Transformer

11-14

Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

11-14

The CoT Collection Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning

11-14

VILA-U a Unified Foundation Model Integrating Visual Understanding and Generation

11-13

Condition-Aware Neural Network for Controlled Image Generation

11-13

DistriFusion Distributed Parallel Inference for High-Resolution Diffusion Models

11-13

VILA On Pre-training for Visual Language Models

11-13

FastComposer Tuning-Free Multi-Subject Image Generation with Localized Attention

11-13

ShadowKV KV Cache in Shadows for High-Throughput Long-Context LLM Inference

11-12

Query-Efficient Correlation Clustering with Noisy Oracle

11-12

LiteMoE Customizing On-device LLM Serving via Proxy Submodel Tuning

11-12

LaRS Latent Reasoning Skills for Chain-of-Thought Reasoning

11-12

Batch Calibration Rethinking Calibration for In-Context Learning and Prompt Engineering

11-12

Scientific Beta Multi-Beta Multi-Strategy Indices Implementing Multi-Factor Equity Portfolios with Smart Factor Indices

11-11

Foundations of Factor Investing

11-11

RAG4ITOps A Supervised Fine-Tunable and Comprehensive RAG Framework for IT Operations and Maintenance

11-11

MagicPIG LSH Sampling for Efficient LLM Generation

11-11

EPIC Efficient Position-Independent Context Caching for Serving Large Language Models

11-11

ELICIT LLM Augmentation via External In-Context Capability

11-11

COMET Towards Partical W4A4KV4 LLMs Serving

11-11

The Cross-Section of Expected Stock Returns

11-10

Portfolio Selection

11-10

Capital asset prices A theory of market equilibrium under conditions of risk

11-10

MInference 1.0 Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

11-10

HYSYNTH Context-Free LLM Approximation for Guiding Program Synthesis

11-10

DynamoLLM Designing LLM Inference Clusters for Performance and Energy Efficiency

11-10

Can Graph Learning Improve Planning in LLM-based Agents?

11-10

ALPINE Unveiling the Planning Capability of Autoregressive Learning in Language Models

11-10

Transformers are Multi-State RNNs

11-07

Meta Large Language Model Compiler Foundation Models of Compiler Optimization

11-07

Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning

11-07

Efficient Streaming Language Models with Attention Sinks

11-07

Model Tells You What to Discard Adaptive KV Cache Compression for LLMs

11-07

BUZZ Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference

11-06

KVSharer Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing

11-06

FLUX Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

11-06

Don't Look Twice Faster Video Transformers with Run-Length Tokenization

11-06

CDMPP:ADevice-Model Agnostic Framework for Latency Prediction of Tensor Programs

11-06

Magicoder Empowering Code Generation with OSS-Instruct

11-05

SpotServe Serving Generative Large Language Models on Preemptible Instances

11-05

Optimal Kernel Orchestration for Tensor Programs with Korch

11-05

KernelGPT Enhanced Kernel Fuzzing via Large Language Models

11-05

Efficient Generative LLM Inference Using Phase Splitting

11-05

SpecExec Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

11-04

Sequoia Scalable, Robust, and Hardware-aware Speculative Decoding

11-04

Memory Bounds for the Experts Problem

11-04

Helix Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

11-04

GraphPipe Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

11-04

Reasoning over Public and Private Data in Retrieval-Based Systems

11-03

MEGABYTE Predicting Million-byte Sequences with Multiscale Transformers

11-03

Reasoning over Public and Private Data in Retrieval-Based Systems

11-03

Breaking the Curse of Quality Saturation with User-Centric Ranking

11-03

Teola Towards End-to-End Optimization of LLM-based Applications

11-01

Quest Query-Aware Sparsity for Efficient Long-Context LLM Inference

11-01

What Matters in Transformers? Not All Attention is Needed Fusion

11-01

KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches

11-01

FlexGen High-Throughput Generative Inference of Large Language Models with a Single GPU

11-01

Prompt Cache Modular Attention Reuse for Low-Latency Inference

10-31

Better & Faster Large Language Models via Multi-token Prediction

10-31

Keyformer KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference

10-31

CacheBlend Fast Large Language Model Serving for RAG with Cached Knowledge Fusion

10-31

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

10-31

Benchmarks as Limits to Arbitrage: Understanding the Low-Volatility Anomaly

12-10

High Idiosyncratic Volatility and Low Returns: International and Further U.S. Evidence

12-10

Momentum Strategies

12-09

Scientific Beta Multi-Beta Multi-Strategy Indices Implementing Multi-Factor Equity Portfolios with Smart Factor Indices

11-11

Foundations of Factor Investing

11-11

The Cross-Section of Expected Stock Returns

11-10

Portfolio Selection

11-10

Capital asset prices A theory of market equilibrium under conditions of risk

11-10

Abseil Tip 234 값, 포인터, 참조로 전달하기

12-24

Abseil Tip 232 변수 선언 시 auto를 언제 사용할 것인가

12-24

Abseil Tip 231 여기와 저기 사이 – 간과되기 쉬운 몇 가지 알고리즘

12-24

Abseil Tip 229 템플릿 메타프로그래밍을 위한 순위 기반 오버로드

12-18

Abseil Tip 227 빈 컨테이너와 부호 없는 정수 연산 주의하기

12-18

Abseil Tip 224 vector.at() 사용 피하기

12-18

Abseil Tip 197 Reader Lock은 드물게 사용해야 합니다

12-18

Abseil Tip 3 문자열 연결과 operator+ vs. StrCat()

12-17

Abseil Tip 218 FTADLE로 확장 지점 설계하기

12-17

Abseil Tip 215 AbslStringify()를 사용한 사용자 정의 타입 문자열화"

12-17

Abseil Tip 198 태그 타입(Tag Types)

12-17

Abseil Tip 18 Substitute를 활용한 문자열 포맷팅

12-17

Abseil Tip 124 absl::StrFormat()

12-17

Abseil Tip 188 스마트 포인터를 함수 매개변수로 사용할 때 주의하세요

12-15

Abseil Tip 187 std::unique_ptr Must Be Moved"

12-15

Abseil Tip 186 함수는 무명 네임스페이스에 두는 것을 선호하세요

12-15

Abseil Tip 76 absl::Status 사용하기

12-14

Abseil Tip 181 StatusOr 값 접근하기

12-14

Abseil Tip 165 초기화 구문을 포함한 if와 switch 문 사용하기

12-14

Abseil Tip 116 함수 인자에서 참조 사용 시 주의사항

12-14

Abseil Tip 5 사라지는 객체의 함정

12-12

Abseil Tip 177 할당 가능성과 데이터 멤버 타입

12-12

Abseil Tip 176 출력 매개변수 대신 반환 값을 선호하세요

12-12

Abseil Tip 175 C++14와 C++17의 리터럴 상수 변경 사항

12-12

Abseil Tip 173 옵션 구조체로 인수 래핑하기

12-12

Abseil Tip 172 지정 초기화자(Designated Initializers)

12-12

Abseil Tip 171 Sentinel 값 피하기

12-12

Abseil Tip 163 std::optional 매개변수 전달하기

12-12

Abseil Tip 140 상수(Constant) 처리 안전한 관용구

12-12

Abseil Tip 168 inline 변수

12-10

Abseil Tip 166 복사가 복사가 아닐 때

12-10

Abseil Tip 161 좋은 지역 변수와 나쁜 지역 변수

12-10

Abseil Tip 146 기본 초기화와 값 초기화

12-10

Abseil Tip 132 Avoid Redundant Map Lookups

12-10

Abseil Tip 108 std::bind를 피하세요

12-10

Abseil Tip 182 정수형 변수를 초기화하세요!

12-09

Abseil Tip 180 Dangling References(유효하지 않은 참조) 피하기

12-09

Abseil Tip 158 Abseil 연관 컨테이너와 contains()

12-09

Abseil Tip 147 Exhaustive switch 문을 책임감 있게 사용하기

12-09

Abseil Tip 90 Retired Flags(사용 중단된 플래그)

12-08

Abseil Tip 45 플래그를 피하라, 특히 라이브러리 코드에서

12-08

Abseil Tip 103 플래그는 전역 변수입니다

12-08

Abseil Tip 153 using-directives를 사용하지 마세요

12-06

Abseil Tip 152 AbslHashValue과 함께

12-06

Abseil Tip 144 연관 컨테이너에서의 이종 조회(Heterogeneous Lookup)

12-06

Abseil Tip 136 Unordered Containers

12-06

Abseil Tip 24 복사, 축약

12-05

Abseil Tip 149 Object Lifetimes vs = delete

12-05

Abseil Tip 148 Overload Sets

12-05

Abseil Tip 117 복사 생략과 값으로 전달하기

12-05

Abseil Tip 143 C++11 삭제된 함수 (= delete)

12-04

Abseil Tip 120 반환 값은 건드리지 마세요

12-04

Abseil Tip 11 반환 정책

12-04

Abseil Tip 93 absl::Span 사용하기

12-03

Abseil Tip 61 기본 멤버 초기화 (Default Member Initializers)

12-03

Abseil Tip 141 bool로의 암시적 변환에 주의하라

12-03

Abseil Tip 134 make_unique와 private 생성자

12-03

Abseil Tip 88 초기화 방법 =, (), 그리고 {}

12-02

Abseil Tip 59 튜플 연결하기

12-02

Abseil Tip 142 다중 매개변수 생성자와 explicit

12-02

Abseil Tip 36 새로운 Join API

11-29

Abseil Tip 3 문자열 연결과 operator+ vs. StrCat()

11-29

Abseil Tip 10 문자열 분리, 골치 아프지 않게

11-29

Abseil Tip 74 위임 생성자와 상속 생성자

11-28

Abseil Tip 42 초기화 메서드보다 팩토리 함수를 선호하세요

11-28

Abseil Tip 131 Special 멤버 함수와 = default

11-28

Abseil Tip 130 네임스페이스 이름 지정

11-26

Abseil Tip 123 absl::optional과 std::unique_ptr

11-26

Abseil Tip 119 using 선언과 네임스페이스 별칭 사용하기

11-26

Abseil Tip 99 비멤버 인터페이스 에티켓

11-25

Abseil Tip 126 make_unique는 새로운 new입니다

11-25

Abseil Tip 109 함수 선언에서 의미 있는 const 사용

11-25

Abseil Tip 65 제자리에 넣기

11-20

Abseil Tip 49 인자 기반 탐색

11-20

Abseil Tip 112 emplace vs. push_back

11-20

Abseil Tip 135 계약을 테스트하라, 구현을 테스트하지 마라

11-18

Abseil Tip 107 참조 수명 연장

11-18

Abseil Tip 101 반환 값, 참조 및 수명

11-18

Abseil Tip 86 클래스(enum class)를 활용한 열거형

11-14

Abseil Tip 77 임시 객체, 이동, 복사

11-14

Abseil Tip 64 Raw 문자열 리터럴

11-14

Abseil Tip 55 이름 개수 세기와 unique_ptr

11-13

Abseil Tip 122 테스트 픽스처, 명확성, 그리고 데이터 흐름

11-13

Abseil Tip 1 string_view의 활용 방법과 이점

11-12

Abseil Tip 234 값, 포인터, 참조로 전달하기

12-24

Abseil Tip 232 변수 선언 시 auto를 언제 사용할 것인가

12-24

Abseil Tip 231 여기와 저기 사이 – 간과되기 쉬운 몇 가지 알고리즘

12-24

Abseil Tip 229 템플릿 메타프로그래밍을 위한 순위 기반 오버로드

12-18

Abseil Tip 227 빈 컨테이너와 부호 없는 정수 연산 주의하기

12-18

Abseil Tip 224 vector.at() 사용 피하기

12-18

Abseil Tip 197 Reader Lock은 드물게 사용해야 합니다

12-18

Abseil Tip 3 문자열 연결과 operator+ vs. StrCat()

12-17

Abseil Tip 218 FTADLE로 확장 지점 설계하기

12-17

Abseil Tip 215 AbslStringify()를 사용한 사용자 정의 타입 문자열화"

12-17

Abseil Tip 198 태그 타입(Tag Types)

12-17

Abseil Tip 18 Substitute를 활용한 문자열 포맷팅

12-17

Abseil Tip 124 absl::StrFormat()

12-17

Abseil Tip 188 스마트 포인터를 함수 매개변수로 사용할 때 주의하세요

12-15

Abseil Tip 187 std::unique_ptr Must Be Moved"

12-15

Abseil Tip 186 함수는 무명 네임스페이스에 두는 것을 선호하세요

12-15

Abseil Tip 76 absl::Status 사용하기

12-14

Abseil Tip 181 StatusOr 값 접근하기

12-14

Abseil Tip 165 초기화 구문을 포함한 if와 switch 문 사용하기

12-14

Abseil Tip 116 함수 인자에서 참조 사용 시 주의사항

12-14

Abseil Tip 5 사라지는 객체의 함정

12-12

Abseil Tip 177 할당 가능성과 데이터 멤버 타입

12-12

Abseil Tip 176 출력 매개변수 대신 반환 값을 선호하세요

12-12

Abseil Tip 175 C++14와 C++17의 리터럴 상수 변경 사항

12-12

Abseil Tip 173 옵션 구조체로 인수 래핑하기

12-12

Abseil Tip 172 지정 초기화자(Designated Initializers)

12-12

Abseil Tip 171 Sentinel 값 피하기

12-12

Abseil Tip 163 std::optional 매개변수 전달하기

12-12

Abseil Tip 140 상수(Constant) 처리 안전한 관용구

12-12

Abseil Tip 168 inline 변수

12-10

Abseil Tip 166 복사가 복사가 아닐 때

12-10

Abseil Tip 161 좋은 지역 변수와 나쁜 지역 변수

12-10

Abseil Tip 146 기본 초기화와 값 초기화

12-10

Abseil Tip 132 Avoid Redundant Map Lookups

12-10

Abseil Tip 108 std::bind를 피하세요

12-10

Abseil Tip 182 정수형 변수를 초기화하세요!

12-09

Abseil Tip 180 Dangling References(유효하지 않은 참조) 피하기

12-09

Abseil Tip 158 Abseil 연관 컨테이너와 contains()

12-09

Abseil Tip 147 Exhaustive switch 문을 책임감 있게 사용하기

12-09

Abseil Tip 90 Retired Flags(사용 중단된 플래그)

12-08

Abseil Tip 45 플래그를 피하라, 특히 라이브러리 코드에서

12-08

Abseil Tip 103 플래그는 전역 변수입니다

12-08

Abseil Tip 153 using-directives를 사용하지 마세요

12-06

Abseil Tip 152 AbslHashValue과 함께

12-06

Abseil Tip 144 연관 컨테이너에서의 이종 조회(Heterogeneous Lookup)

12-06

Abseil Tip 136 Unordered Containers

12-06

Abseil Tip 24 복사, 축약

12-05

Abseil Tip 149 Object Lifetimes vs = delete

12-05

Abseil Tip 148 Overload Sets

12-05

Abseil Tip 117 복사 생략과 값으로 전달하기

12-05

Abseil Tip 143 C++11 삭제된 함수 (= delete)

12-04

Abseil Tip 120 반환 값은 건드리지 마세요

12-04

Abseil Tip 11 반환 정책

12-04

Abseil Tip 93 absl::Span 사용하기

12-03

Abseil Tip 61 기본 멤버 초기화 (Default Member Initializers)

12-03

Abseil Tip 141 bool로의 암시적 변환에 주의하라

12-03

Abseil Tip 134 make_unique와 private 생성자

12-03

Abseil Tip 88 초기화 방법 =, (), 그리고 {}

12-02

Abseil Tip 59 튜플 연결하기

12-02

Abseil Tip 142 다중 매개변수 생성자와 explicit

12-02

Abseil Tip 36 새로운 Join API

11-29

Abseil Tip 3 문자열 연결과 operator+ vs. StrCat()

11-29

Abseil Tip 10 문자열 분리, 골치 아프지 않게

11-29

Abseil Tip 74 위임 생성자와 상속 생성자

11-28

Abseil Tip 42 초기화 메서드보다 팩토리 함수를 선호하세요

11-28

Abseil Tip 131 Special 멤버 함수와 = default

11-28

Abseil Tip 130 네임스페이스 이름 지정

11-26

Abseil Tip 123 absl::optional과 std::unique_ptr

11-26

Abseil Tip 119 using 선언과 네임스페이스 별칭 사용하기

11-26

Abseil Tip 99 비멤버 인터페이스 에티켓

11-25

Abseil Tip 126 make_unique는 새로운 new입니다

11-25

Abseil Tip 109 함수 선언에서 의미 있는 const 사용

11-25

Abseil Tip 65 제자리에 넣기

11-20

Abseil Tip 49 인자 기반 탐색

11-20

Abseil Tip 112 emplace vs. push_back

11-20

Abseil Tip 135 계약을 테스트하라, 구현을 테스트하지 마라

11-18

Abseil Tip 107 참조 수명 연장

11-18

Abseil Tip 101 반환 값, 참조 및 수명

11-18

Abseil Tip 86 클래스(enum class)를 활용한 열거형

11-14

Abseil Tip 77 임시 객체, 이동, 복사

11-14

Abseil Tip 64 Raw 문자열 리터럴

11-14

Abseil Tip 55 이름 개수 세기와 unique_ptr

11-13

Abseil Tip 122 테스트 픽스처, 명확성, 그리고 데이터 흐름

11-13

Abseil Tip 1 string_view의 활용 방법과 이점

11-12

Transformers are Multi-State RNNs

12-10

Galactica: A Large Language Model for Science

12-10

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

12-10

Improving Language Understanding by Generative Pre-Training

12-08

PaLM 2 Technical Report

12-08

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

12-06

Fast Inference from Transformers via Speculative Decoding

12-06

Fast Inference from Transformers via Speculative Decoding

12-06

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

12-06

CHAI: Clustered Head Attention for Efficient LLM Inference

12-10

QAQ: Quality Adaptive Quantization for LLM KV Cache

12-10

Compressed Context Memory For Online Language Model Interaction

12-10

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

12-10

LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression

12-10

Mixed Precision Quantization

12-09

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

12-09

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference

12-09

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

12-09

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

12-09

DeepCache: Accelerating Diffusion Models for Free

12-09

QAQ: Quality Adaptive Quantization for LLM KV Cache

12-08

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

12-08

ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching

12-06

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

12-06

FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

12-06

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

12-06

FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

12-06

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

12-06

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

02-10

DeepSeek-V3 Technical Report

01-21

DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

02-09

Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at AnyResolution

02-11

Qwen2 Technical Report

02-10

Speculate, then Collaborate: Fusing Knowledge of Language Models during Decoding

02-25

HEXGEN-2: DISAGGREGATED GENERATIVE INFERENCE OF LLMS IN HETEROGENEOUS ENVIRONMENT

02-25

You OnlyPruneOnce: DESIGNING CALIBRATION-FREE MODEL COMPRESSION WITH POLICY LEARNING

02-25

Dynamic Diffusion Transformer

02-25

TypedThinker: Typed Thinking Improves Large Language Model Reasoning

02-24

FlashMask: Efficient and Rich Mask Extension of FlashAttention

02-24

Context Parallelism for Scalable Million-Token Inference

03-31

NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

03-31

DIFFSERVE: EFFICIENTLY SERVING TEXT-TO-IMAGE DIFFUSION MODELS WITH QUERY-AWARE MODEL SCALING

03-17

AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution

03-17

EFFICIENT LLM INFERENCE USING DYNAMIC INPUT PRUNING AND CACHE-AWARE MASKING

03-12

Marconi: Prefix Caching for the Era of Hybrid LLMs

03-12

LAVA: LIFETIME-AWARE VM ALLOCATION WITH LEARNED DISTRIBUTIONS AND ADAPTATION TO MISPREDICTIONS

03-11

TurboAttention: Efficient Attention Approximation for High Throughputs LLMs

03-11

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

03-10

A PRACTICAL CROSS-LAYER APPROACH FOR ML-DRIVEN STORAGE PLACEMENT IN WAREHOUSE-SCALE COMPUTERS

03-10

Scaling Deep Learning Training with MPMD Pipeline Parallelism

03-10

LSERVE: EFFICIENT LONG-SEQUENCE LLM SERVING WITH UNIFIED SPARSE ATTENTION

03-06

VOLUT: EFFICIENT VOLUMETRIC STREAMING ENHANCED BY LUT-BASED SUPER-RESOLUTION

03-06

ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments

03-06

Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures

05-17

RODIMUS*: BREAKING THE ACCURACY-EFFICIENCY TRADE-OFF WITH EFFICIENT ATTENTIONS

05-17
류재훈

류재훈

447 포스트
24 카테고리
1 태그
RSS
e-mail Linkedin
© 2025 류재훈
Powered by Jekyll
Theme - NexT.Mist