Jaehun’s Blog

LLVM (clang) build and install (ubuntu 18.04)

2021-02-12T00:00:00+00:00

clone llvm repo

git clone -b llvmorg-10.0.0 https://github.com/llvm/llvm-project.git llvm10

configure

Ninja를 사용하면 컴파일 시간을 많이 단축할 수 있다.

cd llvm9
mkdir build
cmake -DLLVM_ENABLE_PROJECTS="clang;"   -DCMAKE_BUILD_TYPE=Release  -DLLVM_ENABLE_ASSERTIONS=ON -G "Ninja" ../llvm
ninja

pass를 만들다보면 자주 컴파일할 상황이 생기기 때문에 ccache와 ninja를 사용해서 빌드 속도를 높혀주는것을 추천한다.(-LLVM_CCACHE_BUILD ON) 만약에 ninja가 없을시에는 -G “Unix Makefile”로 바꾸자 위 설정에서 build type이 Debug일 시에 linking시 많은 ram이 소요되니 주의 바란다. 위에서 빌드한 clang을 특정 위치에 설치하고 싶다면 -DCMAKE_INSTALL_PREFIX 옵션을 사용하자.

LLVM loop unroll and jam pass and view-cfg

2021-02-12T00:00:00+00:00

대학원 컴파일러 수업에서 ML을 이용하여 unroll and jam을 판별하는 모델을 학습을 하는 term project를 진행하였다. unroll and jam pass는 이름에서 알 수 있듯이 loop 최적화에 관련된 pass로 unroll 과 jam을 수행하여 innermost loop body의 병렬성을 증가시켜서 제한된 resource의 utilization을 증가시키는 최적화이다. 내 기억이 맞다면 O2 이상의 최적화 부터 적용되는데 opt의 debug를 통하여 볼때 생각보다 잘? 사용이 안된다. LLVM code를 보면 대부분 loop unroll과 loop fusion pass를 재활용하며 검사 정도만 하는데 이 때문에 da,lcssa,loop simplify가 조건을 만족하여도 unroll and jam pass가 동작되지 않는 경우가 많다.

그래서 많은 시행착오를 끝에 알아낸것이 다음과 같은 최적화를 추가하면 수행이 된것이다.

아래는 행렬 연산의 예시이며 loop unroll and jam 이 수행되어야한다.

#define N 256
#define size 1024

int A[size][size];
int B[size][size];
int C[size][size];

void matmul() {
    int i,j,k;
    for (i=0; i < N; i++)
        for (j=0; j < N; j++)
            for (k=0; k < N; k++)
                C[i][j] += A[k][i] * B[j][k];
}

clang -Xclang -disable-O0-optnone -emit-llvm matmul.c -S 
opt -stats -debug -loop-unroll-and-jam -allow-unroll-and-jam -unroll-and-jam-count=2 matmul.ll

위처럼 해도 최적화가 수행안되는데 이걸로 삽질을 많이했다.

아래 pass를 추가하면 unroll and jam이 수행된다.

-mem2reg -simplifycfg -loop-rotate -instcombine

그리고 이걸하면서 cfg를 확인할 일이 있어 다음옵션을 유용하게 이용하였다.

opt -view-cfg matmul.ll 을 하면 다음과 같은 cfg가 나오며 xdot을 설치해야 볼수있다.(우분투에서는 sudo apt-get install xdot로 설치가능하다)

프로젝트의 결론 중 하나가 loop-unroll and jam은 결국 아래 다이어 그램에서 표기된 target independent 한 simple loop opt의 하나이므로 이것 하나의 수행 여부를 판별하는 것이 전체적인 최적화 pass의 성능 여부에 생각보다 큰 영향을 미치지 못한다는 것이다.(유감..) 아마 NAS에서 연구되는 방법을 사용하여 phase ordering 조금 더 섬세하게 하면 성능향상이 있을 것 으로 예상이 된다.

간단논문 정리 DARTS DIFFERENTIABLE ARCHITECTURE SEARCH (ICLR 2019)

2021-02-12T00:00:00+00:00

제목

DARTS: DIFFERENTIABLE ARCHITECTURE SEARCH

저자

Hanxiao Liu,Karen Simonyan,Yiming Yang

Motivation

기존 NAS가 상당수의 시간 혹은 cost가 필요(2000 GPU days of reinforcement learning, 3150 GPU days of evolution)이러한 원인 중 하나가 discrete domain, which leads to a large number of architecture evaluations required 때문이라고 분석. 물론 이전에도 filter size와 같은 것들을 연속적으로 학습 했으나 해당 논문은 블록, 그래프 토플로지 까지 학습하는 것을 목표로 함

Contribution

기존의 discrete and non-differentiable search space에서 RL혹은 GA를 이용하던 NAS를 아키텍쳐의 표현을 bilevel optimization을 사용하여 gradient descent로 학습 하게 함.
extensive experiments on image classification and language modeling (좋은 결과)
기존 방법에 비하여 학습 시간을 줄임
CNN,RNN에서 transferable 함을 보임

CONTINUOUS RELAXATION AND OPTIMIZATION

위 그림과 아래 수식을 통해서 어떠한 방식을 통하여 연속적으로 연산을 정의 하는지 알 수 있다. node$i$,$j$연산의 종류를 선택하는 방법은 아래 식처럼 $\alpha$의 softmax를 이용하는 것이고 이는 위 그림을 통하여 직관적으로 알 수 있다.

building block을 위에서 정의 했으니 weight를 학습하며 final architecture를 정해야 한다. 이는 아래와 같이 bilevel optimization을 사용한다.

APPROXIMATE ARCHITECTURE GRADIENT

개인적으로는 design choice로 보이며 관련 후속논문이 있기때문에 크게 중요한 내용은 아닌것 같다. 위 bilevel optimization form을 보면 MAML의 수식이 떠오른다. 이 논문에서도. First-order Approximation을 포함하여 연산량 감소를 위하여 수식을 변형 하였다.(trade-off가 있기 때문에 상황에 맞춰야)

DERIVING DISCRETE ARCHITECTURES

discrete architecture를 만들기 위해서 top-k strongest operations만 선택 (zero는 예외)

Results

NASNET-A(2000 GPU days),AmoebaNet-A(3150 GPU days) ENAS (0.5 GPU day)에 비하여 동일 파라라미터를 맞췄을때 상당하게 시간 측면에서 효율적인 결과를 보여줌

cifar10

PTB

ImageNet in the mobile setting

references

paper official code

간단논문 정리 End-to-End Deep Learning of Optimization Heuristics (PACT 17)

2021-02-12T00:00:00+00:00

제목

End-to-End Deep Learning of Optimization Heuristics

저자

Chris Cummins ; Pavlos Petoumenos ; Zheng Wang ; Hugh Leather

Motivation

기존 머신러닝을 이용한 compiler optimizaion 방법에서는 human experts를 이용한 feature engineering 이 필요

Contribution

논문에서 제안하는 Source Rewriter & Language model을 이용하여 RAW PROGRAM CODE을 직접 이용하여 compiler optimizaion을 수행 아울어 transfer learning 을 이용하여 small number of program 에서도 학습을 수행

references

https://ieeexplore.ieee.org/document/1281665

간단논문 정리 Fast and Effective Orchestration of Compiler Optimizations(Zhelong Pan,Rudolf Eigenmann;Purdue University ;CGO’06)

2021-02-12T00:00:00+00:00

제목

Fast and Effective Orchestration of Compiler Optimizations

저자

Zhelong Pan,Rudolf Eigenmann

Motivation

compile-time optimizations 은 전반적으로 프로그램 성능을 향상시키지만 일부 기법은 성능 하락을 야기한다. 입력프로그램와 target architecture에 대한 불충분한 정보는 컴파일 시간에 정확도 향상을 향상 시키는 모델의 한계를 만든다.

Contribution

기존에 존재하는 Batch Elimination(BE)와 Iterative Elimination(IE)의 장점을 섞어 Combined Elimination(CE) 알고리즘을 제안한다.
OptimizationSpace Exploration (OSE),Statistical Selection (SS)에 비하여도 향상된 성능을 보여준다.
large set of realistic programs에서 평가하여 현실적인 결과를 제시한다.
Content
Exhaustive Search =>O(2^n)
Batch Elimination => O(n)
- Relative Improvement Percentage(RIP)을 기준으로 RIP하락시 optimizations 제거
Iterative Elimination => O(n^2)
- RIP을 기준으로 부정적인 결과를 보이는 하나의 optimization을 제거하는 방법을 반복
Combined Elimination => O(n^2)
- RIP을 기준으로 부정적인 결과를 보이는 optimization을 모두 제거하는 방법을 반복
Optimization Space Exploration(OSE) => O(n^3)
- The basic idea of the pruning algorithm is to iteratively find better optimization combinations by merging the beneficial ones.
Statistical Selection (SS) =>O(n^2)
- It uses a statistical method to identify the performance effect of the optimization options. The options with positive effects are turned on, while the ones with negative effects are turned off in the final version, in an iterative fashion
  Results

references

https://arxiv.org/abs/1802.04799

논문 정리 NeuroVectorizer End-to-End Vectorization with Deep Reinforcement Learning (CGO 20)

2021-02-12T00:00:00+00:00

제목

NeuroVectorizer: End-to-End Vectorization with Deep Reinforcement Learning

저자

Ameer Haj-Ali, Nesreen K. Ahmed, Ted Willke, Sophia Shao, Krste Asanovic, Ion Stoica

Motivation

Compilers are designed today to use fixed-cost models that are based on heuristics to make vectorization decisions on loops. However, these models are unable to capture the data dependency, the computation graph, or the organization of instructions The vectorization is critical to enhancing the performance of compute-intensive workloads in modern computers.

Contribution

A comprehensive data set of more than 10,000 synthetic loop examples. An end-to-end deep reinforcement learning (RL) based auto loop-vectorization method

개인적인 느낌

search space가 너무 작아서 솔찍하게 의미가 있는지 의문.. /

The Proposed Framework Architecture

Code Embedding

Code2vec(Embedding Network) represents a code snippet as a single fixed-length code vector, which can be used to predict the semantic properties of the snippet.
This vector captures many characteristics of the code, such as semantic similarities, combinations, and analogies

A code snippet and its predicted labels as computed by code2vec reference The architecture of our path-attention network. A full-connected layer learns to combine embeddings of each path-contexts with itself; attention weights are learned using the combined context vectors, and used to compute a code vector. The code vector is used to predicts the label. reference

Automatic Vectorization Example

The RL Environment Definition

where baseline is the execution time when compiled with the currently implemented baseline cost model in LLVM and RL is the execution time when compiled with the injected pragmas by the RL agent where MAX_VF and MAX_IF are respectively the maximum VF and IF supported by the underlying architecture

Dataset Description

To speed up the training, and make it more efficient, we built a dataset that includes loops only. We built generators that generate more than 10,000 synthetic loop examples automatically from the LLVM vectorization test-suite.

Handling Long Compilation Time

During training, some of the programs took a long time to compile, mainly when the agent was trying to vectorize more than plausible
giving a penalty reward of −9 (equivalent to assuming it takes ten times the execution time of the baseline) so that the agent will learn not to overestimate the vectorization and avoid it
Results:Reward mean and training loss for different action space definitions

Results:The performance of the proposed vectorizer

The performance is normalized to the baseline(VF = 4, IF = 2)

Results:Normalized average performance of supervised FCNN and deep RL

Results:The performance of the proposed vectorizer on

Mibench compared to Polly and the baseline cost model

The performance is normalized to the baeline(VF = 4, IF = 2)

references

https://arxiv.org/abs/1909.13639

논문 정리 Chameleon Adaptive Code Optimization for Expedited Deep Neural Network Compilation(ICLR 2020)

2021-02-12T00:00:00+00:00

제목

Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation

저자

Byung Hoon Ahn, Prannoy Pilligundla, Amir Yazdanbakhsh, Hadi Esmaeilzadeh

Motivation

The current approaches are oblivious to the patterns in the design space of schedules that are available for exploitation, and causes inefficient search or even converges to solutions that may even be suboptimal. Current solutions that rely on greedy sampling lead to significant fractions of the candidate configurations being redundant over iterations(long compilation time)

Contribution

Devising an Adaptive Exploration module that utilizes reinforcement learning to adapt to unseen design space of new networks to reduce search time yet achieve better performance.
Proposing an Adaptive Sampling algorithm that utilizes clustering to adaptively reduce the number of costly hardware measurements
개인적인 생각

RL을 이용하여 exploration을 잘하고 sampling을 효율적으로 해서 time을 줄이고자하는 목적이 참 깔끔하고 좋은 논문.

Overall design

Adaptive Exploration

TVM leverages simulated annealing relies on the stochastic guarantees of its random walks(required numerous iterations of exploration) thus insufficient to enable disruptive innovations in neural networks
Adaptive Exploration, based Reinforcement Learning ,is concerned with learning to maximize reward given an environment by making good exploration and exploitation tradeoffs
These networks not only learn the dependencies among the different knobs of the design space (which are interrelated) that helps our module navigate through the design space but also lean the potential gains of the modifications to the configurations.

Learning procedure

Adaptive Sampling : Reducing number of costly hardware measurements

we observe that the candidate configurations are clustered in subregions of the design space
Our Adaptive Sampling iterates over a different number of clusters for their respective centroids and the L2 loss.(k-means)
Selecting the number of centroids for clustering entails making the important tradeoff (using L2-performance degradation graph of knee of the curve
Improving candidate configurations using sample synthesis
Many of the automated approaches for black-box optimization are prone to invalid configurations
These invalid configurations not only blow the chances for better exploration but also leads to an extra optimization time overhead to reset the physical hardware for the subsequent hardware measurement
When our compiler runs into redundant samples, the proposed synthesis method analyzes the candidate samples to determine the most probable (most frequent = mode function) non-invalid choice for each knob to come up with a new configuration
Improving candidate configurations using sample synthesis
During training, some of the programs took a long time to compile, mainly when the agent was trying to vectorize more than plausible
giving a penalty reward of −9 (equivalent to assuming it takes ten times the execution time of the baseline) so that the agent will learn not to overestimate the vectorization and avoid it the most probable (most frequent = mode function)
Evaluation

Task Index => layer order Overall, observation is that CHAMELEON’s Adaptive Exploration requires 2.88 less search steps compared to simulated annealing to find good solution.

references

https://openreview.net/forum?id=rygG4AVFvH

Project Page

https://github.com/anony-sub/chameleon

논문 정리 LLVM A Compilation Framework for Lifelong Program Analysis & Transformation(CGO 04)

2021-02-12T00:00:00+00:00

제목

LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation

저자

Chris Lattner Vikram Adve

개인적으로 느끼는 논문의 insight

Lifelong Program Analysis개념을 도입하여 Front-end를 제외한 부분에서 전체적인 최적화를 수행,SSA,machine-independent optimization 논문에서 제시된 개념이 지금의 llvm과 정확하게 일치하지는 않지만 대단하다..

Motivation

Multiple-stages of analysis & transformation
compile-time, link-time, install-time, run-time, idle-time
Use aggressive interprocedural optimizations
Gather and exploit end-user profile information
Tune the application to the user’s hardware
Contributions
A persistent, rich code representation
- Enables analysis & optimization throughout lifetime
Offline native code generation
- Must be able to generate high-quality code statically
Profiling & optimization in the field
- Adapt to the end-user’s usage patterns
Language independence
- No runtime, object model, or exception semantics
Uniform whole-program optimization
- Allow optimization across languages and runtime
  Instruction Set
Avoids machine specific constraints
Infinite set of typed virtual registers
- In SSA form
- Includes support for phi functions
- This allows flow insensitive algorithm to gain benefits of flow sensitive without expensive Data Flow analysis
Avoids same code for multiple instructions (overloaded opcodes)
Exceptions mechanism based on two instructions invoke and unwind
LLVM Compiler Architecture

This strategy provides the 5 benefits
Some limitations
- Language specific optimizations must be performed on frontend
- Benefit to languages like Java(JVM) requiring sophisticated runtime systems?
Front-end compiler
- Translate source code to LLVM representation
- Perform language specific optimizations
- Need not perform SSA construction at this time
- Invoke LLVM passes for global inter procedural optimization at module level
Linker/Interprocedure Optimizer
- Various analyses occur
  - Points-to analysis
  - Mod/Ref analysis
  - Dead global elimination, dead argument elimination, constant, propagation, array bounds check, etc
  - Can be speeded up by adding inter-procedural summaries
Native Code Generation
- JIT or Offline
- Currently supports Sparc V9 and x86 architectures
Reoptimizers
- Identifies frequently run code and ‘hotspots’
- Performs additional optimizations, thus native code generation can be performed ahead of time
- Idle-time reoptimizer

Results:How do high-level features map onto LLVM?

The table shows that many of these programs (164, 176, 179, 181, 183, 186, & 256) are surprisingly type-safe, despite the fact that the programming language does not enforce type-safety. The figure shows that LLVM code is about the same size as native executables for SPARC, and is roughly 25% larger on average for x86 DGE (aggressive10 Dead global variable and function Elimination), DAE (an aggressive Dead Argument Elimination), inline (a function integration pass), DSA (Data Structure Analysis), and GCC (time to compile the programs with the gcc 3.3 compiler at – O3, provided as a reference point)

references

https://ieeexplore.ieee.org/document/1281665

간단논문 정리 TVM An Automated End-to-End Optimizing Compiler for Deep Learning (OSDI 18)

2021-02-12T00:00:00+00:00

제목

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

TVM?

해당논문은 머신러닝용 컴파일러중에 대표적인 TVM에 대한 paper입니다. 현재는 apache에서 관리 하고 있으며 graph level IR 을 통한 target-independent optimization, autotune을 통한 target-dependent optimization 을 지원하며 llvm 및 vta를 통하여 cpu,gpu뿐만 아니라 FPGA를 backend로 지원합니다.

저자

Tianqi Chen and Thierry Moreau, University of Washington; Ziheng Jiang, University of Washington, AWS; Lianmin Zheng, Shanghai Jiao Tong University; Eddie Yan, Haichen Shen, and Meghan Cowan, University of Washington; Leyuan Wang, UC Davis, AWS; Yuwei Hu, Cornell; Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy

Motivation

기존 머신러닝을 이용한 compiler optimizaion 방법에서는 human experts를 이용한다양한 hardward back-end(GPU,FPGA,ASIC)이 늘어남에 따라 그 구조에 적합한 complier optimization이 달라 질 수 밖에 없다.

Contribution

해당논문은 머신러닝 High level Graph 연산을 ML 기반으로 특정 device 적합한 excutable 코드를 만들도록 수행하는 방법을 제시. 전체적인 framework의 blueprint 느낌이 강하고 전형적인 DL compiler의 구조라서 큰 contribution을 느끼지는 못함.

Content

Graph level modification & hareware-aware optimization

Operator Fusion
- Combines many small ops
Constant Folding
- Pre-computes static graphs
Static Memory Planning Pass
- Pre-allocates memory for needed tensors
Data Layout Transformations
- Optimize data storage for each backend
cost model에 ML을 이용
- Query에서 추출한 feature 를XGBoost 를 이용하여 costs 를 예측
  references
  
  https://arxiv.org/abs/1802.04799
  
  Project Page
  
  https://tvm.apache.org

자주쓰는 파이썬 스크립트 패턴

2021-02-12T00:00:00+00:00

개인적으로 shell보다는 python를 자주 사용하기 때문에 자주 사용하는 패턴들을 간단하게 정리

glob,os

glob는 유닉스 스타일 경로명 패턴 확장 라이브러입니다. 이것과 os 라이브러리를 이용하면 간단하게 파일을 찾거나 바꿀수 있습니다. 개인적으로 실험 결과를 파싱할때 많이 사용하는 라이브러리 입니다.

예시)result폴더에서 png파일과 npy파일이 모두 존재할 시에 npy파일을 조작

import numpy as np
import glob
import os

...

for name in glob.glob('result/**/*.png',recursive=True):
    np_name=name.replace('png','npy')
    if os.path.exists(np_name):
        np.load(np_name)
        ....

subprocess

해당 명령어는 파이썬에서 shell명령어를 실행시키는 명령어 입니다. 개인적으로 파라미터 등을 바꾸어가면서 실험을 진행 할때 많이 사용합니다. 예시 parameter A와 B를 바꾸어 가면서 실험.

import subprocess
basetext='python3 text.py --numA _numA --numB _numB 2>&1'
index=0                                 
for numA in [30,40,50]:
    for numB in [1.0,1.5,2.0]:
        text=basetext
        text=text.replace('_numA',f'{numA}')
        text=text.replace('_numB',f'{numB}')
        proc = subprocess.Popen( text , shell=True, executable='/bin/bash')
        proc.communicate()

schedule

파이썬의 crontab.

import schedule
import time

def job():
    print("Do Job...!!!")

schedule.every(10).minutes.do(job)
schedule.every().hour.do(job)
schedule.every().day.at("10:30").do(job)
schedule.every(5).to(10).minutes.do(job)
schedule.every().monday.do(job)
schedule.every().wednesday.at("13:15").do(job)
schedule.every().minute.at(":17").do(job)

while True:
    schedule.run_pending()
    time.sleep(1)

Jaehun’s Blog

LLVM (clang) build and install (ubuntu 18.04)

clone llvm repo

configure

LLVM loop unroll and jam pass and view-cfg

간단논문 정리 DARTS DIFFERENTIABLE ARCHITECTURE SEARCH (ICLR 2019)

제목

저자

Motivation

Contribution

CONTINUOUS RELAXATION AND OPTIMIZATION

APPROXIMATE ARCHITECTURE GRADIENT

DERIVING DISCRETE ARCHITECTURES

Results

cifar10

PTB

ImageNet in the mobile setting

references

간단논문 정리 End-to-End Deep Learning of Optimization Heuristics (PACT 17)

제목

저자

Motivation

Contribution

references

간단논문 정리 Fast and Effective Orchestration of Compiler Optimizations(Zhelong Pan,Rudolf Eigenmann;Purdue University ;CGO’06)

제목

저자

Motivation

Contribution

Content

Results

references

논문 정리 NeuroVectorizer End-to-End Vectorization with Deep Reinforcement Learning (CGO 20)

제목

저자

Motivation

Contribution

개인적인 느낌

The Proposed Framework Architecture

Code Embedding

Automatic Vectorization Example

The RL Environment Definition

Dataset Description

Handling Long Compilation Time

Results:Reward mean and training loss for different action space definitions

Results:The performance of the proposed vectorizer

Results:Normalized average performance of supervised FCNN and deep RL

Results:The performance of the proposed vectorizer on

references

논문 정리 Chameleon Adaptive Code Optimization for Expedited Deep Neural Network Compilation(ICLR 2020)

제목

저자

Motivation

Contribution

개인적인 생각

Overall design

Adaptive Exploration

Learning procedure

Adaptive Sampling : Reducing number of costly hardware measurements

Improving candidate configurations using sample synthesis

Improving candidate configurations using sample synthesis

Evaluation

references

Project Page

논문 정리 LLVM A Compilation Framework for Lifelong Program Analysis & Transformation(CGO 04)

제목

저자

개인적으로 느끼는 논문의 insight

Motivation

Contributions

Instruction Set

LLVM Compiler Architecture

Results:How do high-level features map onto LLVM?

references

간단논문 정리 TVM An Automated End-to-End Optimizing Compiler for Deep Learning (OSDI 18)

제목

TVM?

저자

Motivation

Contribution