EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism

작성일 2025-01-02 | In paper-review , with-gpt , |

논문 링크

Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

작성일 2024-12-31 | In paper-review , with-gpt , |

논문 링크

SparQ Attention: Bandwidth-Efficient LLM Inference

작성일 2024-12-31 | In paper-review , with-gpt , |

논문 링크

Improving alignment of dialogue agents via targeted human judgements

작성일 2024-12-31 | In paper-review , with-gpt , |

논문 링크

Language Models are General-Purpose Interfaces

작성일 2024-12-31 | In paper-review , with-gpt , |

논문 링크

OPT: Open Pre-trained Transformer Language Models

작성일 2024-12-31 | In paper-review , with-gpt , |

논문 링크

CBQ: Cross-Block Quantization for Large Language Models

작성일 2024-12-30 | In paper-review , with-gpt , |

논문 링크

SCCA: Shifted Cross Chunk Attention for long contextual semantic expansion

작성일 2024-12-30 | In paper-review , with-gpt , |

논문 링크

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

작성일 2024-12-26 | In paper-review , with-gpt , |

논문 링크

Gemma: Open Models Based on Gemini Research and Technology

작성일 2024-12-26 | In paper-review , with-gpt , |

논문 링크

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

작성일 2024-12-26 | In paper-review , with-gpt , |

논문 링크

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

작성일 2024-12-26 | In paper-review , with-gpt , |

논문 링크

The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction

작성일 2024-12-26 | In paper-review , with-gpt , |

논문 링크

Abseil Tip 234 값, 포인터, 참조로 전달하기

작성일 2024-12-24 | In cpp , abseil , |

아래는 “이번 주의 팁 #234: 값, 포인터, 참조로 전달하기”에 대한 한글 번역입니다.

Abseil Tip 232 변수 선언 시 auto를 언제 사용할 것인가

작성일 2024-12-24 | In cpp , abseil , |

아래는 “이번 주의 팁 #232: 변수 선언 시 auto를 언제 사용할 것인가”에 대한 한글 번역입니다.

Abseil Tip 231 여기와 저기 사이 – 간과되기 쉬운 몇 가지 알고리즘

작성일 2024-12-24 | In cpp , abseil , |

아래는 “이번 주의 팁 #231: 여기와 저기 사이 – 간과되기 쉬운 몇 가지 알고리즘”에 대한 한글 번역입니다.

SAGEATTENTION: ACCURATE 8-BIT ATTENTION FOR PLUG-AND-PLAY INFERENCE ACCELERATION

작성일 2024-12-24 | In paper-review , with-gpt , |

논문 링크

Gemma 2: Improving Open Language Models at a Practical Size

작성일 2024-12-24 | In paper-review , with-gpt , |

논문 링크

The Llama 3 Herd of Models

작성일 2024-12-24 | In paper-review , with-gpt , |

논문 링크

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

작성일 2024-12-24 | In paper-review , with-gpt , |

논문 링크

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

작성일 2024-12-24 | In paper-review , with-gpt , |

논문 링크

Communication Compression for Tensor Parallel LLM Inference

작성일 2024-12-23 | In paper-review , with-gpt , |

논문 링크

Context Parallelism for Scalable Million-Token Inference

작성일 2024-12-23 | In paper-review , with-gpt , |

논문 링크

SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile

작성일 2024-12-23 | In paper-review , with-gpt , |

논문 링크

FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs

작성일 2024-12-23 | In paper-review , with-gpt , |

논문 링크

Large Concept Models: Language Modeling in a Sentence Representation Space

작성일 2024-12-20 | In paper-review , with-gpt , |

논문 링크

Sharing and Throughput-oriented Token Batching

작성일 2024-12-20 | In paper-review , with-gpt , |

논문 링크

ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression

작성일 2024-12-20 | In paper-review , with-gpt , |

논문 링크

Star Attention: Efficient LLM Inference over Long Sequences

작성일 2024-12-20 | In paper-review , with-gpt , |

논문 링크

Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation

작성일 2024-12-20 | In paper-review , with-gpt , |

논문 링크

SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference

작성일 2024-12-20 | In paper-review , with-gpt , |

논문 링크

SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration

작성일 2024-12-20 | In paper-review , with-gpt , |

논문 링크

Byte Latent Transformer: Patches Scale Better Than Tokens

작성일 2024-12-19 | In paper-review , with-gpt , |

논문 링크

Memory Layers at Scale

작성일 2024-12-19 | In paper-review , with-gpt , |

논문 링크

Efficient Memory Management for Large Language Model Serving with PagedAttention

작성일 2024-12-19 | In paper-review , with-gpt , |

논문 링크

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

작성일 2024-12-19 | In paper-review , with-gpt , |

논문 링크

GSPMD: General and Scalable Parallelization for ML Computation Graphs

작성일 2024-12-19 | In paper-review , with-gpt , |

논문 링크

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

작성일 2024-12-19 | In paper-review , with-gpt , |

논문 링크

Abseil Tip 229 템플릿 메타프로그래밍을 위한 순위 기반 오버로드

작성일 2024-12-18 | In cpp , abseil , |

제목: “이번 주의 팁 #229: 템플릿 메타프로그래밍을 위한 순위 기반 오버로드”

Abseil Tip 227 빈 컨테이너와 부호 없는 정수 연산 주의하기

작성일 2024-12-18 | In cpp , abseil , |

제목: “이번 주의 팁 #227: 빈 컨테이너와 부호 없는 정수 연산 주의하기”

Abseil Tip 224 vector.at() 사용 피하기

작성일 2024-12-18 | In cpp , abseil , |

제목: “이번 주의 팁 #224: vector.at() 사용 피하기”

Abseil Tip 197 Reader Lock은 드물게 사용해야 합니다

작성일 2024-12-18 | In cpp , abseil , |

제목: “이번 주의 팁 #197: Reader Lock은 드물게 사용해야 합니다”

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

작성일 2024-12-18 | In paper-review , with-gpt , |

논문 링크

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

작성일 2024-12-18 | In paper-review , with-gpt , |

논문 링크

Fast Inference of Mixture-of-Experts Language Models with Offloading

작성일 2024-12-18 | In paper-review , with-gpt , |

논문 링크

SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads

작성일 2024-12-18 | In paper-review , with-gpt , |

논문 링크

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

작성일 2024-12-18 | In paper-review , with-gpt , |

논문 링크

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

작성일 2024-12-18 | In paper-review , with-gpt , |

논문 링크

Abseil Tip 3 문자열 연결과 operator+ vs. StrCat()

작성일 2024-12-17 | In cpp , abseil , |

제목: “이번 주의 팁 #3: 문자열 연결과 operator+ vs. StrCat()”

Abseil Tip 218 FTADLE로 확장 지점 설계하기

작성일 2024-12-17 | In cpp , abseil , |

제목: “이번 주의 팁 #218: FTADLE로 확장 지점 설계하기”

Abseil Tip 215 AbslStringify()를 사용한 사용자 정의 타입 문자열화"

작성일 2024-12-17 | In cpp , abseil , |

제목: “이번 주의 팁 #215: AbslStringify()를 사용한 사용자 정의 타입 문자열화”

Abseil Tip 198 태그 타입(Tag Types)

작성일 2024-12-17 | In cpp , abseil , |

아래는 “이번 주의 팁 #198: 태그 타입(Tag Types)”에 대한 한글 번역입니다.

Abseil Tip 18 Substitute를 활용한 문자열 포맷팅

작성일 2024-12-17 | In cpp , abseil , |

물론입니다! 아래는 번역된 내용입니다:

Abseil Tip 124 absl::StrFormat()

작성일 2024-12-17 | In cpp , abseil , |

제목: “이번 주의 팁 #124: absl::StrFormat()”

Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models

작성일 2024-12-17 | In paper-review , with-gpt , |

논문 링크

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

작성일 2024-12-17 | In paper-review , with-gpt , |

논문 링크

FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference

작성일 2024-12-17 | In paper-review , with-gpt , |

논문 링크

FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

작성일 2024-12-17 | In paper-review , with-gpt , |

논문 링크

Fast and Effective Weight Update for Pruned Large Language Models

작성일 2024-12-17 | In paper-review , with-gpt , |

논문 링크

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads

작성일 2024-12-16 | In paper-review , with-gpt , |

논문 링크

MEDUSA: Simple LLMInference Acceleration Framework with Multiple Decoding Heads

작성일 2024-12-16 | In paper-review , with-gpt , |

논문 링크

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

작성일 2024-12-16 | In paper-review , with-gpt , |

논문 링크

DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

작성일 2024-12-16 | In paper-review , with-gpt , |

논문 링크

INFERFLOW: AN EFFICIENT AND HIGHLY CONFIG URABLE INFERENCE ENGINE FOR LARGE LANGUAGE MODELS

작성일 2024-12-16 | In paper-review , with-gpt , |

논문 링크

Abseil Tip 188 스마트 포인터를 함수 매개변수로 사용할 때 주의하세요

작성일 2024-12-15 | In cpp , abseil , |

원문 게시: 2020년 12월 10일, 주간 팁 #188

Abseil Tip 187 std::unique_ptr Must Be Moved"

작성일 2024-12-15 | In cpp , abseil , |

원문 게시: 2020년 11월 5일, 주간 팁 #187

Abseil Tip 186 함수는 무명 네임스페이스에 두는 것을 선호하세요

작성일 2024-12-15 | In cpp , abseil , |

원문 게시: 2020년 11월 5일, 주간 팁 #186

TP-Aware Dequantization

작성일 2024-12-15 | In paper-review , with-gpt , |

논문 링크

Break the Sequential Dependency of LLM Inference Using LOOKAHEAD DECODING

작성일 2024-12-15 | In paper-review , with-gpt , |

논문 링크

Decoding Speculative Decoding

작성일 2024-12-15 | In paper-review , with-gpt , |

논문 링크

Efficient Prompt Caching via Embedding Similarity

작성일 2024-12-15 | In paper-review , with-gpt , |

논문 링크

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

작성일 2024-12-15 | In paper-review , with-gpt , |

논문 링크

Abseil Tip 76 absl::Status 사용하기

작성일 2024-12-14 | In cpp , abseil , |

Tip of the Week #76: `absl::Status` 사용하기

Abseil Tip 181 StatusOr 값 접근하기

작성일 2024-12-14 | In cpp , abseil , |

Tip of the Week #181: `StatusOr<T>` 값 접근하기

Abseil Tip 165 초기화 구문을 포함한 if와 switch 문 사용하기

작성일 2024-12-14 | In cpp , abseil , |

Tip of the Week #165: 초기화 구문을 포함한 `if`와 `switch` 문 사용하기

Abseil Tip 116 함수 인자에서 참조 사용 시 주의사항

작성일 2024-12-14 | In cpp , abseil , |

Tip of the Week #116: 함수 인자에서 참조 사용 시 주의사항

RelayAttention for Efficient Large Language Model Serving with Long System Prompts

작성일 2024-12-14 | In paper-review , with-gpt , |

논문 링크

Benchmarking and Dissecting the Nvidia Hopper GPU Architecture

작성일 2024-12-14 | In paper-review , with-gpt , |

논문 링크

QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference

작성일 2024-12-14 | In paper-review , with-gpt , |

논문 링크

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inferenc

작성일 2024-12-14 | In paper-review , with-gpt , |

논문 링크

Hydragen: High-Throughput LLM Inference with Shared Prefixes

작성일 2024-12-14 | In paper-review , with-gpt , |

논문 링크

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

작성일 2024-12-14 | In paper-review , with-gpt , |

논문 링크

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Model

작성일 2024-12-13 | In paper-review , with-gpt , |

논문 링크

Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time

작성일 2024-12-13 | In paper-review , with-gpt , |

논문 링크

Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

작성일 2024-12-13 | In paper-review , with-gpt , |

논문 링크

GQA:Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

작성일 2024-12-13 | In paper-review , with-gpt , |

논문 링크

Token Merging: Your ViT But Faster

작성일 2024-12-13 | In paper-review , with-gpt , |

논문 링크

Fast Transformer Decoding: One Write-Head is All You Need

작성일 2024-12-13 | In paper-review , with-gpt , |

논문 링크

Abseil Tip 5 사라지는 객체의 함정