Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality 12-24
SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration 12-20
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache 12-18
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models 12-17
FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference 12-17
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving 12-16
FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design 12-15
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inferenc 12-14
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time 12-13
FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning 12-12
No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization 12-12
LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression 12-10
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference 12-09
Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters 11-28
Dynamic Discriminative Operations (D2O) for Efficient Generative Inference of Large Language Models 11-27
Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction 11-21
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection 11-18
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration 11-18
The CoT Collection Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning 11-14
Scientific Beta Multi-Beta Multi-Strategy Indices Implementing Multi-Factor Equity Portfolios with Smart Factor Indices 11-11
RAG4ITOps A Supervised Fine-Tunable and Comprehensive RAG Framework for IT Operations and Maintenance 11-11
Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning 11-07
BUZZ Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference 11-06
SpecExec Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices 11-04
GraphPipe Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism 11-04
KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches 11-01
논문 정리 Chameleon Adaptive Code Optimization for Expedited Deep Neural Network Compilation(ICLR 2020) 02-12
간단논문 정리 Fast and Effective Orchestration of Compiler Optimizations(Zhelong Pan,Rudolf Eigenmann;Purdue University ;CGO’06) 02-12