Benchmark of Long Context Capable Approaches
LOOK-M:Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference
Dynamic Discriminative Operations (D2O) for Efficient Generative Inference of Large Language Models
Abseil Tip 130 네임스페이스 이름 지정
주간 팁 #130: 네임스페이스 이름 지정
Abseil Tip 123 absl::optional과 std::unique_ptr
주간 팁 #123: absl::optional
과 std::unique_ptr
Abseil Tip 119 using 선언과 네임스페이스 별칭 사용하기
주간 팁 #119: using
선언과 네임스페이스 별칭 사용하기
Pruning in Transformer Decoder
Keep the Cost Down: A Review on Methods to Optimize LLM’s KV Cache Consumption.
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
PQCache: Product Quantization-based KVCache for Long Context LLM Inference
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Abseil Tip 99 비멤버 인터페이스 에티켓
Abseil Tip 126 make_unique는 새로운 new입니다
Abseil Tip 109 함수 선언에서 의미 있는 const 사용
한글 번역
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
Post-Training Sparse Attention with Double Sparsity
NACL: AGeneral and Effective KV Cache Eviction Framework for LLMs at Inference Time
Palu: Compressing KV-Cache with Low-Rank Projection
ThinK: Thinner Key Cache by Query-Driven Pruning
Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads
InfiniPot: Infinite Context Processing on Memory-Constrained LLMs
KV-COMPRESS: Paged KV-Cache Compression with Variable Compression Rates per Attention Head
Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction
TACO-RL: Task Aware Prompt Compression Optimization with Reinforcement Learning
Abseil Tip 65 제자리에 넣기
한글 번역
Abseil Tip 49 인자 기반 탐색
한글 번역
Abseil Tip 112 emplace vs. push_back
한글 번역
title: “Tip of the Week #112: emplace vs. push_back”
layout: tips
sidenav: side-nav-tips.html
published: true
permalink: tips/112
type: markdown
order: “112”
—
DUOATTENTION: EFFICIENT LONG-CONTEXT LLM INFERENCE WITH RETRIEVAL AND STREAMING HEADS
TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training
TIDALDECODE: FAST AND ACCURATE LLM DECOD ING WITH POSITION PERSISTENT SPARSE ATTENTION
SPARSEVLM: VISUAL TOKEN SPARSIFICATION FOR EFFICIENT VISION-LANGUAGE MODEL INFERENCE
SWIFTKV: FAST PREFILL-OPTIMIZED INFERENCE WITH KNOWLEDGE-PRESERVING MODEL TRANSFORMATION
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy
A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference
SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction
In-context KV-Cache Eviction for LLMs via Attention-Gate
Prompt Compression for Large Language Models: A Survey
Textbooks Are All You Need
Scaling Laws for Neural Language Models
Abseil Tip 135 계약을 테스트하라, 구현을 테스트하지 마라
주간 팁 #135: 계약을 테스트하라, 구현을 테스트하지 마라
Abseil Tip 107 참조 수명 연장
아래는 “이번 주의 팁 #107: 참조 수명 확장”에 대한 한글 번역입니다:
Abseil Tip 101 반환 값, 참조 및 수명
주간 팁 #101: 반환 값, 참조 및 수명
Squeezed Attention: Accelerating Long Context Length LLM Inference
Recycled Attention: Efficient inference for long-context language models
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration
MagicPIG: LSH Sampling for Efficient LLM Generation
Abseil Tip 86 클래스(enum class)를 활용한 열거형
title: “이번 주의 팁 #86: 클래스(enum class)를 활용한 열거형” layout: tips sidenav: side-nav-tips.html published: true permalink: tips/86 type: markdown order: “086” —
Abseil Tip 77 임시 객체, 이동, 복사
title: “이번 주의 팁 #77: 임시 객체, 이동, 복사” layout: tips sidenav: side-nav-tips.html published: true permalink: tips/77 type: markdown order: “077” —
Abseil Tip 64 Raw 문자열 리터럴
title: “이번 주의 팁 #64: Raw 문자열 리터럴” layout: tips sidenav: side-nav-tips.html published: true permalink: tips/64 type: markdown order: “064” —
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
논문 : https://arxiv.org/abs/2201.11903
Learning Transferable Visual Models From Natural Language Supervision
논문 : https://arxiv.org/abs/2103.00020
HART Efficient Visual Generation with Hybrid Autoregressive Transformer
논문 : https://arxiv.org/abs/2410.10812
Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models
논문 : https://arxiv.org/abs/2410.10733v2
The CoT Collection Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning
논문 : https://arxiv.org/abs/2305.14045
Abseil Tip 55 이름 개수 세기와 unique_ptr
title: “이번 주의 팁 #55: 이름 개수 세기와 unique_ptr
”
layout: tips
sidenav: side-nav-tips.html
published: true
permalink: tips/55
type: markdown
order: “055”
—
Abseil Tip 122 테스트 픽스처, 명확성, 그리고 데이터 흐름
title: “이번 주의 팁 #122: 테스트 픽스처, 명확성, 그리고 데이터 흐름” layout: tips sidenav: side-nav-tips.html published: true permalink: tips/122 type: markdown order: “122” —
VILA-U a Unified Foundation Model Integrating Visual Understanding and Generation
논문 : https://arxiv.org/abs/2409.04429
Condition-Aware Neural Network for Controlled Image Generation
논문 : https://arxiv.org/abs/2404.01143
DistriFusion Distributed Parallel Inference for High-Resolution Diffusion Models
논문 : https://arxiv.org/abs/2402.19481
VILA On Pre-training for Visual Language Models
논문 : https://arxiv.org/abs/2312.07533
FastComposer Tuning-Free Multi-Subject Image Generation with Localized Attention
논문 : https://arxiv.org/abs/2305.10431
Abseil Tip 1 string_view의 활용 방법과 이점
Abseil Tip #1: string_view
의 활용 방법과 이점
ShadowKV KV Cache in Shadows for High-Throughput Long-Context LLM Inference
논문 : https://arxiv.org/abs/2410.21465
Query-Efficient Correlation Clustering with Noisy Oracle
논문 : https://arxiv.org/abs/2402.01400
LiteMoE Customizing On-device LLM Serving via Proxy Submodel Tuning
논문 : https://dl.acm.org/doi/10.1145/3666025.3699355
LaRS Latent Reasoning Skills for Chain-of-Thought Reasoning
논문 : https://aclanthology.org/2024.findings-emnlp.206/
Batch Calibration Rethinking Calibration for In-Context Learning and Prompt Engineering
논문 : https://arxiv.org/abs/2309.17249
Scientific Beta Multi-Beta Multi-Strategy Indices Implementing Multi-Factor Equity Portfolios with Smart Factor Indices
논문 : https://conferences.pionline.com/uploads/conference_admin/ERI_Scientific_Beta_Publication_Scientific_Beta_Multi-Beta_Multi-Strategy_Indices_Equity_Portfolios.pdf
Foundations of Factor Investing
논문 : https://www.msci.com/documents/1296102/1336482/Foundations_of_Factor_Investing.pdf
RAG4ITOps A Supervised Fine-Tunable and Comprehensive RAG Framework for IT Operations and Maintenance
논문 : https://arxiv.org/abs/2410.15805v1
MagicPIG LSH Sampling for Efficient LLM Generation
논문 : https://arxiv.org/abs/2410.16179
EPIC Efficient Position-Independent Context Caching for Serving Large Language Models
논문 : https://arxiv.org/abs/2410.15332
ELICIT LLM Augmentation via External In-Context Capability
논문 : https://arxiv.org/abs/2410.09343
COMET Towards Partical W4A4KV4 LLMs Serving
논문 : https://arxiv.org/abs/2410.12168
The Cross-Section of Expected Stock Returns
논문 : https://www.jstor.org/stable/2329112
Portfolio Selection
논문 : https://www.jstor.org/stable/2975974
Capital asset prices A theory of market equilibrium under conditions of risk
논문 : https://www.jstor.org/stable/2977928
MInference 1.0 Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
논문 : https://arxiv.org/abs/2407.02490
HYSYNTH Context-Free LLM Approximation for Guiding Program Synthesis
논문 : https://arxiv.org/abs/2405.15880v2
DynamoLLM Designing LLM Inference Clusters for Performance and Energy Efficiency
논문 : https://arxiv.org/abs/2408.00741
Can Graph Learning Improve Planning in LLM-based Agents?
논문 : https://arxiv.org/abs/2405.19119
ALPINE Unveiling the Planning Capability of Autoregressive Learning in Language Models
논문 : https://arxiv.org/abs/2405.09220
Transformers are Multi-State RNNs
논문 : https://arxiv.org/abs/2401.06104
Meta Large Language Model Compiler Foundation Models of Compiler Optimization
논문 : https://arxiv.org/abs/2407.02524
Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning
논문 : https://www.usenix.org/system/files/osdi24-zhai.pdf
Efficient Streaming Language Models with Attention Sinks
논문 : https://arxiv.org/abs/2309.17453
Model Tells You What to Discard Adaptive KV Cache Compression for LLMs
논문 : https://arxiv.org/abs/2310.01801
BUZZ Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference
논문 : https://arxiv.org/abs/2410.23079
KVSharer Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing
논문 : https://arxiv.org/abs/2410.18517
FLUX Fast Software-based Communication Overlap On GPUs Through Kernel Fusion
논문 : https://arxiv.org/abs/2406.06858
Don't Look Twice Faster Video Transformers with Run-Length Tokenization
논문 : https://openreview.net/pdf/e7782b237ab632c467717143b2b7ef283d71c282.pdf
CDMPP:ADevice-Model Agnostic Framework for Latency Prediction of Tensor Programs
논문 : https://i.cs.hku.hk/~cwu/papers/hphu-eurosys24.pdf
Magicoder Empowering Code Generation with OSS-Instruct
논문 : https://arxiv.org/abs/2312.02120
SpotServe Serving Generative Large Language Models on Preemptible Instances
논문 : https://arxiv.org/abs/2311.15566
Optimal Kernel Orchestration for Tensor Programs with Korch
논문 : https://arxiv.org/abs/2406.09465
KernelGPT Enhanced Kernel Fuzzing via Large Language Models
논문 : https://arxiv.org/abs/2401.00563
Efficient Generative LLM Inference Using Phase Splitting
논문 : https://arxiv.org/abs/2311.18677v2
SpecExec Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
논문 : https://arxiv.org/abs/2406.02532