SPARSEVLM: VISUAL TOKEN SPARSIFICATION FOR EFFICIENT VISION-LANGUAGE MODEL INFERENCE
SWIFTKV: FAST PREFILL-OPTIMIZED INFERENCE WITH KNOWLEDGE-PRESERVING MODEL TRANSFORMATION
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy
A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference
SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction
In-context KV-Cache Eviction for LLMs via Attention-Gate
Prompt Compression for Large Language Models: A Survey
Textbooks Are All You Need
Scaling Laws for Neural Language Models
Abseil Tip 135 계약을 테스트하라, 구현을 테스트하지 마라
주간 팁 #135: 계약을 테스트하라, 구현을 테스트하지 마라
Abseil Tip 107 참조 수명 연장
Abseil Tip 101 반환 값, 참조 및 수명
주간 팁 #101: 반환 값, 참조 및 수명
Squeezed Attention: Accelerating Long Context Length LLM Inference
Recycled Attention: Efficient inference for long-context language models
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration
MagicPIG: LSH Sampling for Efficient LLM Generation
Abseil Tip 86 클래스(enum class)를 활용한 열거형
한글 번역
Abseil Tip 77 임시 객체, 이동, 복사
title: “이번 주의 팁 #77: 임시 객체, 이동, 복사” layout: tips sidenav: side-nav-tips.html published: true permalink: tips/77 type: markdown order: “077” —
Abseil Tip 64 Raw 문자열 리터럴
한글 번역
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
논문 : https://arxiv.org/abs/2201.11903
Learning Transferable Visual Models From Natural Language Supervision
논문 : https://arxiv.org/abs/2103.00020
HART Efficient Visual Generation with Hybrid Autoregressive Transformer
논문 : https://arxiv.org/abs/2410.10812
Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models
논문 : https://arxiv.org/abs/2410.10733v2
The CoT Collection Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning
논문 : https://arxiv.org/abs/2305.14045
Abseil Tip 55 이름 개수 세기와 unique_ptr
한글 번역
Abseil Tip 122 테스트 픽스처, 명확성, 그리고 데이터 흐름
한글 번역
VILA-U a Unified Foundation Model Integrating Visual Understanding and Generation
논문 : https://arxiv.org/abs/2409.04429
Condition-Aware Neural Network for Controlled Image Generation
논문 : https://arxiv.org/abs/2404.01143
DistriFusion Distributed Parallel Inference for High-Resolution Diffusion Models
논문 : https://arxiv.org/abs/2402.19481
VILA On Pre-training for Visual Language Models
논문 : https://arxiv.org/abs/2312.07533
FastComposer Tuning-Free Multi-Subject Image Generation with Localized Attention
논문 : https://arxiv.org/abs/2305.10431
Abseil Tip 1 string_view의 활용 방법과 이점
Abseil Tip #1: string_view
의 활용 방법과 이점
ShadowKV KV Cache in Shadows for High-Throughput Long-Context LLM Inference
논문 : https://arxiv.org/abs/2410.21465
Query-Efficient Correlation Clustering with Noisy Oracle
논문 : https://arxiv.org/abs/2402.01400
LiteMoE Customizing On-device LLM Serving via Proxy Submodel Tuning
논문 : https://dl.acm.org/doi/10.1145/3666025.3699355
LaRS Latent Reasoning Skills for Chain-of-Thought Reasoning
논문 : https://aclanthology.org/2024.findings-emnlp.206/
Batch Calibration Rethinking Calibration for In-Context Learning and Prompt Engineering
논문 : https://arxiv.org/abs/2309.17249
Scientific Beta Multi-Beta Multi-Strategy Indices Implementing Multi-Factor Equity Portfolios with Smart Factor Indices
논문 : https://conferences.pionline.com/uploads/conference_admin/ERI_Scientific_Beta_Publication_Scientific_Beta_Multi-Beta_Multi-Strategy_Indices_Equity_Portfolios.pdf
Foundations of Factor Investing
논문 : https://www.msci.com/documents/1296102/1336482/Foundations_of_Factor_Investing.pdf
RAG4ITOps A Supervised Fine-Tunable and Comprehensive RAG Framework for IT Operations and Maintenance
논문 : https://arxiv.org/abs/2410.15805v1
MagicPIG LSH Sampling for Efficient LLM Generation
논문 : https://arxiv.org/abs/2410.16179
EPIC Efficient Position-Independent Context Caching for Serving Large Language Models
논문 : https://arxiv.org/abs/2410.15332
ELICIT LLM Augmentation via External In-Context Capability
논문 : https://arxiv.org/abs/2410.09343
COMET Towards Partical W4A4KV4 LLMs Serving
논문 : https://arxiv.org/abs/2410.12168
The Cross-Section of Expected Stock Returns
논문 : https://www.jstor.org/stable/2329112
Portfolio Selection
논문 : https://www.jstor.org/stable/2975974
Capital asset prices A theory of market equilibrium under conditions of risk
논문 : https://www.jstor.org/stable/2977928
MInference 1.0 Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
논문 : https://arxiv.org/abs/2407.02490
HYSYNTH Context-Free LLM Approximation for Guiding Program Synthesis
논문 : https://arxiv.org/abs/2405.15880v2
DynamoLLM Designing LLM Inference Clusters for Performance and Energy Efficiency
논문 : https://arxiv.org/abs/2408.00741
Can Graph Learning Improve Planning in LLM-based Agents?
논문 : https://arxiv.org/abs/2405.19119
ALPINE Unveiling the Planning Capability of Autoregressive Learning in Language Models
논문 : https://arxiv.org/abs/2405.09220
Transformers are Multi-State RNNs
논문 : https://arxiv.org/abs/2401.06104
Meta Large Language Model Compiler Foundation Models of Compiler Optimization
논문 : https://arxiv.org/abs/2407.02524
Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning
논문 : https://www.usenix.org/system/files/osdi24-zhai.pdf
Efficient Streaming Language Models with Attention Sinks
논문 : https://arxiv.org/abs/2309.17453
Model Tells You What to Discard Adaptive KV Cache Compression for LLMs
논문 : https://arxiv.org/abs/2310.01801
BUZZ Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference
논문 : https://arxiv.org/abs/2410.23079
KVSharer Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing
논문 : https://arxiv.org/abs/2410.18517
FLUX Fast Software-based Communication Overlap On GPUs Through Kernel Fusion
논문 : https://arxiv.org/abs/2406.06858
Don't Look Twice Faster Video Transformers with Run-Length Tokenization
논문 : https://openreview.net/pdf/e7782b237ab632c467717143b2b7ef283d71c282.pdf
CDMPP:ADevice-Model Agnostic Framework for Latency Prediction of Tensor Programs
논문 : https://i.cs.hku.hk/~cwu/papers/hphu-eurosys24.pdf
Magicoder Empowering Code Generation with OSS-Instruct
논문 : https://arxiv.org/abs/2312.02120
SpotServe Serving Generative Large Language Models on Preemptible Instances
논문 : https://arxiv.org/abs/2311.15566
Optimal Kernel Orchestration for Tensor Programs with Korch
논문 : https://arxiv.org/abs/2406.09465
KernelGPT Enhanced Kernel Fuzzing via Large Language Models
논문 : https://arxiv.org/abs/2401.00563
Efficient Generative LLM Inference Using Phase Splitting
논문 : https://arxiv.org/abs/2311.18677v2
SpecExec Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
논문 : https://arxiv.org/abs/2406.02532
Sequoia Scalable, Robust, and Hardware-aware Speculative Decoding
논문 : https://arxiv.org/abs/2402.12374
Memory Bounds for the Experts Problem
논문 : https://arxiv.org/abs/2204.09837
Helix Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs
논문 : https://arxiv.org/abs/2406.01566v1
GraphPipe Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism
개인의견 : pipeline Parallelism에 관련된 논문을 몇개 봤는데 그 중에서 가장 재밋는 논문인것 같네요. graph의 toplogy-aware Pipeline Parallelism이라는 당연한 개념을 포함하여 효과적인 결과를 내는것 같습니다. 다만 구현에 대한 오버헤드는 쫌 있을꺼같군요
Reasoning over Public and Private Data in Retrieval-Based Systems
개인의견 : 잘모르는 분야이지만 meta publication을 보다가 흥미로워 보여서 선택한 논문.(내가 잘몰라서 질문을 못하겠다 ㅎㅎ;)
MEGABYTE Predicting Million-byte Sequences with Multiscale Transformers
개인의견 : 패치 기반의 병렬 연산과 Cross-Attention을 통한 글로벌-로컬 상호작용, 그리고 토크나이저 free? 한 신기한 논문으로 보인다.
Reasoning over Public and Private Data in Retrieval-Based Systems
개인의견 : 잘모르는 분야이지만 meta publication을 보다가 흥미로워 보여서 선택한 논문.(내가 잘몰라서 질문을 못하겠다 ㅎㅎ;)
Breaking the Curse of Quality Saturation with User-Centric Ranking
개인의견 : 잘모르는 분야이지만 meta publication을 보다가 흥미로워 보여서 선택한 논문.(내가 잘몰라서 질문을 못하겠다 ㅎㅎ;)
Teola Towards End-to-End Optimization of LLM-based Applications
논문 : https://arxiv.org/abs/2407.00326
Quest Query-Aware Sparsity for Efficient Long-Context LLM Inference
논문 : https://arxiv.org/abs/2406.10774
What Matters in Transformers? Not All Attention is Needed Fusion
논문 : https://arxiv.org/abs/2406.15786v1
KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches
논문 : https://arxiv.org/abs/2407.01527v1
FlexGen High-Throughput Generative Inference of Large Language Models with a Single GPU
논문 : https://arxiv.org/abs/2303.06865
Prompt Cache Modular Attention Reuse for Low-Latency Inference
논문 : https://arxiv.org/abs/2311.04934
Better & Faster Large Language Models via Multi-token Prediction
논문 : https://arxiv.org/abs/2404.19737
Keyformer KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference
논문 : https://arxiv.org/abs/2403.09054
CacheBlend Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
논문 : https://arxiv.org/abs/2405.16444
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
논문 : https://arxiv.org/abs/2403.02310
블로그 다시 시작..
처음 블로그를 만든 날짜를 보니 포항에서 열심히 대학원을 다닐떄 였군요 아마 첫논문을 작성할 때 쯤부터 너무 정신이 없어서 간간히 작성하던 블로그가 멈춰있었던것 같습니다.
자주쓰는 파이썬 스크립트 패턴
개인적으로 shell보다는 python를 자주 사용하기 때문에 자주 사용하는 패턴들을 간단하게 정리
glob,os
glob는 유닉스 스타일 경로명 패턴 확장 라이브러입니다. 이것과 os 라이브러리를 이용하면 간단하게 파일을 찾거나 바꿀수 있습니다. 개인적으로 실험 결과를 파싱할때 많이 사용하는 라이브러리 입니다.
간단논문 정리 TVM An Automated End-to-End Optimizing Compiler for Deep Learning (OSDI 18)
논문 정리 LLVM A Compilation Framework for Lifelong Program Analysis & Transformation(CGO 04)
논문 정리 Chameleon Adaptive Code Optimization for Expedited Deep Neural Network Compilation(ICLR 2020)
논문 정리 NeuroVectorizer End-to-End Vectorization with Deep Reinforcement Learning (CGO 20)
간단논문 정리 Fast and Effective Orchestration of Compiler Optimizations(Zhelong Pan,Rudolf Eigenmann;Purdue University ;CGO’06)
간단논문 정리 End-to-End Deep Learning of Optimization Heuristics (PACT 17)
간단논문 정리 DARTS DIFFERENTIABLE ARCHITECTURE SEARCH (ICLR 2019)
LLVM loop unroll and jam pass and view-cfg
대학원 컴파일러 수업에서 ML을 이용하여 unroll and jam을 판별하는 모델을 학습을 하는 term project를 진행하였다. unroll and jam pass는 이름에서 알 수 있듯이 loop 최적화에 관련된 pass로 unroll 과 jam을 수행하여 innermost loop body의 병렬성을 증가시켜서 제한된 resource의 utilization을 증가시키는 최적화이다. 내 기억이 맞다면 O2 이상의 최적화 부터 적용되는데 opt의 debug를 통하여 볼때 생각보다 잘? 사용이 안된다. LLVM code를 보면 대부분 loop unroll과 loop fusion pass를 재활용하며 검사 정도만 하는데 이 때문에 da,lcssa,loop simplify가 조건을 만족하여도 unroll and jam pass가 동작되지 않는 경우가 많다.
LLVM (clang) build and install (ubuntu 18.04)
clone llvm repo
git clone -b llvmorg-10.0.0 https://github.com/llvm/llvm-project.git llvm10
configure
Ninja를 사용하면 컴파일 시간을 많이 단축할 수 있다.
cd llvm9
mkdir build
cmake -DLLVM_ENABLE_PROJECTS="clang;" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -G "Ninja" ../llvm
ninja