Jekyll2023-07-31T12:23:11+00:00https://deepmi.me/atom.xmlJaehun’s Blog류재훈LLVM (clang) build and install (ubuntu 18.04)2021-02-12T00:00:00+00:002021-02-12T00:00:00+00:00https://deepmi.me/compiler/2021/02/12/llvm-clang<h1 id="clone-llvm-repo">clone llvm repo</h1>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone -b llvmorg-10.0.0 https://github.com/llvm/llvm-project.git llvm10
</code></pre></div></div>
<h1 id="configure">configure</h1>
<p>Ninja를 사용하면 컴파일 시간을 많이 단축할 수 있다.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cd llvm9
mkdir build
cmake -DLLVM_ENABLE_PROJECTS="clang;" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -G "Ninja" ../llvm
ninja
</code></pre></div></div>
<p>pass를 만들다보면 자주 컴파일할 상황이 생기기 때문에 ccache와 ninja를 사용해서 빌드 속도를 높혀주는것을 추천한다.(-LLVM_CCACHE_BUILD ON)
만약에 ninja가 없을시에는 -G “Unix Makefile”로 바꾸자
위 설정에서 build type이 Debug일 시에 linking시 많은 ram이 소요되니 주의 바란다.
위에서 빌드한 clang을 특정 위치에 설치하고 싶다면 -DCMAKE_INSTALL_PREFIX 옵션을 사용하자.</p>류재훈clone llvm repo git clone -b llvmorg-10.0.0 https://github.com/llvm/llvm-project.git llvm10 configure Ninja를 사용하면 컴파일 시간을 많이 단축할 수 있다. cd llvm9 mkdir build cmake -DLLVM_ENABLE_PROJECTS="clang;" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -G "Ninja" ../llvm ninjaLLVM loop unroll and jam pass and view-cfg2021-02-12T00:00:00+00:002021-02-12T00:00:00+00:00https://deepmi.me/compiler/ml/2021/02/12/llvm-unrollandjam<p><img src="/assets/images/llvm.jpeg" alt="" />
대학원 컴파일러 수업에서 ML을 이용하여 unroll and jam을 판별하는 모델을 학습을 하는 term project를 진행하였다.
unroll and jam pass는 이름에서 알 수 있듯이 loop 최적화에 관련된 pass로 unroll 과 jam을 수행하여 innermost loop body의 병렬성을 증가시켜서 제한된 resource의 utilization을 증가시키는 최적화이다.
내 기억이 맞다면 O2 이상의 최적화 부터 적용되는데 opt의 debug를 통하여 볼때 생각보다 잘? 사용이 안된다.
LLVM code를 보면 대부분 loop unroll과 loop fusion pass를 재활용하며 검사 정도만 하는데 이 때문에 da,lcssa,loop simplify가 조건을 만족하여도 unroll and jam pass가 동작되지 않는 경우가 많다.</p>
<p>그래서 많은 시행착오를 끝에 알아낸것이 다음과 같은 최적화를 추가하면 수행이 된것이다.</p>
<p>아래는 행렬 연산의 예시이며 loop unroll and jam 이 수행되어야한다.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#define N 256
#define size 1024
int A[size][size];
int B[size][size];
int C[size][size];
void matmul() {
int i,j,k;
for (i=0; i < N; i++)
for (j=0; j < N; j++)
for (k=0; k < N; k++)
C[i][j] += A[k][i] * B[j][k];
}
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>clang -Xclang -disable-O0-optnone -emit-llvm matmul.c -S
opt -stats -debug -loop-unroll-and-jam -allow-unroll-and-jam -unroll-and-jam-count=2 matmul.ll
</code></pre></div></div>
<p>위처럼 해도 최적화가 수행안되는데 이걸로 삽질을 많이했다.</p>
<p>아래 pass를 추가하면 unroll and jam이 수행된다.</p>
<p><code class="language-plaintext highlighter-rouge">-mem2reg -simplifycfg -loop-rotate -instcombine</code></p>
<p>그리고 이걸하면서 cfg를 확인할 일이 있어 다음옵션을 유용하게 이용하였다.</p>
<p>opt -view-cfg matmul.ll 을 하면 다음과 같은 cfg가 나오며 xdot을 설치해야 볼수있다.(우분투에서는 sudo apt-get install xdot로 설치가능하다)
<img src="/assets/images/unroll_and_jam.png" alt="" /></p>
<p>프로젝트의 결론 중 하나가 loop-unroll and jam은 결국 아래 다이어 그램에서 표기된 target independent 한 simple loop opt의 하나이므로 이것 하나의 수행 여부를 판별하는 것이 전체적인 최적화 pass의 성능 여부에 생각보다 큰 영향을 미치지 못한다는 것이다.(유감..)
아마 NAS에서 연구되는 방법을 사용하여 phase ordering 조금 더 섬세하게 하면 성능향상이 있을 것 으로 예상이 된다.
<img src="https://releases.llvm.org/8.0.0/tools/polly/docs/_images/LLVM-Passes-all.png" alt="" /></p>류재훈대학원 컴파일러 수업에서 ML을 이용하여 unroll and jam을 판별하는 모델을 학습을 하는 term project를 진행하였다. unroll and jam pass는 이름에서 알 수 있듯이 loop 최적화에 관련된 pass로 unroll 과 jam을 수행하여 innermost loop body의 병렬성을 증가시켜서 제한된 resource의 utilization을 증가시키는 최적화이다. 내 기억이 맞다면 O2 이상의 최적화 부터 적용되는데 opt의 debug를 통하여 볼때 생각보다 잘? 사용이 안된다. LLVM code를 보면 대부분 loop unroll과 loop fusion pass를 재활용하며 검사 정도만 하는데 이 때문에 da,lcssa,loop simplify가 조건을 만족하여도 unroll and jam pass가 동작되지 않는 경우가 많다.간단논문 정리 DARTS DIFFERENTIABLE ARCHITECTURE SEARCH (ICLR 2019)2021-02-12T00:00:00+00:002021-02-12T00:00:00+00:00https://deepmi.me/nas/ml/paper-review/2021/02/12/paper-DARTS<h1 id="제목">제목</h1>
<p>DARTS: DIFFERENTIABLE ARCHITECTURE SEARCH</p>
<h1 id="저자">저자</h1>
<p>Hanxiao Liu,Karen Simonyan,Yiming Yang</p>
<h1 id="motivation">Motivation</h1>
<p>기존 NAS가 상당수의 시간 혹은 cost가 필요(2000 GPU days of reinforcement learning, 3150 GPU days of evolution)이러한 원인 중 하나가 discrete domain, which leads to a large number of architecture evaluations required 때문이라고 분석. 물론 이전에도 filter size와 같은 것들을 연속적으로 학습 했으나 해당 논문은 블록, 그래프 토플로지 까지 학습하는 것을 목표로 함</p>
<h1 id="contribution">Contribution</h1>
<ul>
<li>기존의 discrete and non-differentiable search space에서 RL혹은 GA를 이용하던 NAS를 아키텍쳐의 표현을 bilevel optimization을 사용하여 gradient descent로 학습 하게 함.</li>
<li>extensive experiments on image classification and language modeling (좋은 결과)</li>
<li>기존 방법에 비하여 학습 시간을 줄임</li>
<li>CNN,RNN에서 transferable 함을 보임</li>
</ul>
<h1 id="continuous-relaxation-and-optimization">CONTINUOUS RELAXATION AND OPTIMIZATION</h1>
<p><img src="/assets/images/darts1.PNG" alt="" /></p>
<p>위 그림과 아래 수식을 통해서 어떠한 방식을 통하여 연속적으로 연산을 정의 하는지 알 수 있다.
node$i$,$j$연산의 종류를 선택하는 방법은 아래 식처럼 $\alpha$의 softmax를 이용하는 것이고 이는 위 그림을 통하여 직관적으로 알 수 있다.</p>
<p><img src="/assets/images/darts2.PNG" alt="" /></p>
<p>building block을 위에서 정의 했으니 weight를 학습하며 final architecture를 정해야 한다.
이는 아래와 같이 bilevel optimization을 사용한다.
<img src="/assets/images/dart3.PNG" alt="" /></p>
<p><img src="/assets/images/darts4.PNG" alt="" /></p>
<h1 id="approximate-architecture-gradient">APPROXIMATE ARCHITECTURE GRADIENT</h1>
<p>개인적으로는 design choice로 보이며 관련 후속논문이 있기때문에 크게 중요한 내용은 아닌것 같다.
위 bilevel optimization form을 보면 MAML의 수식이 떠오른다. 이 논문에서도. First-order Approximation을 포함하여 연산량 감소를 위하여 수식을 변형 하였다.(trade-off가 있기 때문에 상황에 맞춰야)</p>
<h1 id="deriving-discrete-architectures">DERIVING DISCRETE ARCHITECTURES</h1>
<p>discrete architecture를 만들기 위해서 top-k strongest operations만 선택 (zero는 예외)</p>
<h1 id="results">Results</h1>
<p>NASNET-A(2000 GPU days),AmoebaNet-A(3150 GPU days) ENAS (0.5 GPU day)에 비하여 동일 파라라미터를 맞췄을때 상당하게 시간 측면에서 효율적인 결과를 보여줌
<img src="/assets/images/darts5.PNG" alt="" /></p>
<h2 id="cifar10">cifar10</h2>
<p><img src="/assets/images/darts5.PNG" alt="" /></p>
<h2 id="ptb">PTB</h2>
<p><img src="/assets/images/darts6.PNG" alt="" /></p>
<h2 id="imagenet-in-the-mobile-setting">ImageNet in the mobile setting</h2>
<p><img src="/assets/images/darts7.PNG" alt="" /></p>
<h1 id="references">references</h1>
<p><a href="https://openreview.net/pdf?id=S1eYHoC5FX">paper</a>
<a href="https://github.com/quark0/darts.git">official code</a></p>류재훈제목 DARTS: DIFFERENTIABLE ARCHITECTURE SEARCH간단논문 정리 End-to-End Deep Learning of Optimization Heuristics (PACT 17)2021-02-12T00:00:00+00:002021-02-12T00:00:00+00:00https://deepmi.me/compiler/ml/paper-review/2021/02/12/paper-End-to-EndDeepLearning-of-pptimization-heuristics<h1 id="제목">제목</h1>
<p>End-to-End Deep Learning of Optimization Heuristics</p>
<h1 id="저자">저자</h1>
<p>Chris Cummins ; Pavlos Petoumenos ; Zheng Wang ; Hugh Leather</p>
<h1 id="motivation">Motivation</h1>
<p>기존 머신러닝을 이용한 compiler optimizaion 방법에서는 human experts를 이용한 feature engineering 이 필요</p>
<h1 id="contribution">Contribution</h1>
<p>논문에서 제안하는 Source Rewriter & Language model을 이용하여 RAW PROGRAM CODE을 직접 이용하여 compiler optimizaion을 수행 아울어 transfer learning 을 이용하여 small number of program 에서도 학습을 수행</p>
<h1 id="references">references</h1>
<p>https://ieeexplore.ieee.org/document/1281665</p>류재훈제목 End-to-End Deep Learning of Optimization Heuristics간단논문 정리 Fast and Effective Orchestration of Compiler Optimizations(Zhelong Pan,Rudolf Eigenmann;Purdue University ;CGO’06)2021-02-12T00:00:00+00:002021-02-12T00:00:00+00:00https://deepmi.me/compiler/ml/paper-review/2021/02/12/paper-Fast%20and%20EffectiveOrchestrationofCompilerOptimizations<h1 id="제목">제목</h1>
<p>Fast and Effective Orchestration of Compiler Optimizations</p>
<h1 id="저자">저자</h1>
<p>Zhelong Pan,Rudolf Eigenmann</p>
<h1 id="motivation">Motivation</h1>
<p>compile-time optimizations 은 전반적으로 프로그램 성능을 향상시키지만 일부 기법은 성능 하락을 야기한다.
입력프로그램와 target architecture에 대한 불충분한 정보는 컴파일 시간에 정확도 향상을 향상 시키는 모델의 한계를 만든다.</p>
<h1 id="contribution">Contribution</h1>
<ul>
<li>기존에 존재하는 Batch Elimination(BE)와 Iterative Elimination(IE)의 장점을 섞어 Combined Elimination(CE) 알고리즘을 제안한다.</li>
<li>OptimizationSpace Exploration (OSE),Statistical Selection (SS)에 비하여도 향상된 성능을 보여준다.</li>
<li>large set of realistic programs에서 평가하여 현실적인 결과를 제시한다.
<h1 id="content">Content</h1>
</li>
<li>Exhaustive Search =>O(2^n)</li>
<li>Batch Elimination => O(n)
<ul>
<li>Relative Improvement Percentage(RIP)을 기준으로 RIP하락시 optimizations 제거</li>
</ul>
</li>
<li>Iterative Elimination => O(n^2)
<ul>
<li>RIP을 기준으로 부정적인 결과를 보이는 하나의 optimization을 제거하는 방법을 반복</li>
</ul>
</li>
<li>Combined Elimination => O(n^2)
<ul>
<li>RIP을 기준으로 부정적인 결과를 보이는 optimization을 모두 제거하는 방법을 반복</li>
</ul>
</li>
<li>Optimization Space Exploration(OSE) => O(n^3)
<ul>
<li>The basic idea of the pruning algorithm is to iteratively find better optimization combinations by merging the beneficial ones.</li>
</ul>
</li>
<li>Statistical Selection (SS) =>O(n^2)
<ul>
<li>It uses a statistical method to identify the performance effect of the optimization options. The options with positive effects are turned on, while the ones with negative effects are turned off in the final version, in an iterative fashion
<h1 id="results">Results</h1>
</li>
</ul>
</li>
</ul>
<p><img src="/assets/images/fe1.png" alt="" />
<img src="/assets/images/fe2.png" alt="" /></p>
<h1 id="references">references</h1>
<p>https://arxiv.org/abs/1802.04799</p>류재훈제목 Fast and Effective Orchestration of Compiler Optimizations논문 정리 NeuroVectorizer End-to-End Vectorization with Deep Reinforcement Learning (CGO 20)2021-02-12T00:00:00+00:002021-02-12T00:00:00+00:00https://deepmi.me/compiler/ml/paper-review/2021/02/12/paper-NeuroVectorizer<p><img src="/assets/images/nv1.png" alt="" /></p>
<h1 id="제목">제목</h1>
<p>NeuroVectorizer: End-to-End Vectorization with Deep Reinforcement Learning</p>
<h1 id="저자">저자</h1>
<p>Ameer Haj-Ali, Nesreen K. Ahmed, Ted Willke, Sophia Shao, Krste Asanovic, Ion Stoica</p>
<h1 id="motivation">Motivation</h1>
<p>Compilers are designed today to use fixed-cost models that are based on heuristics to make vectorization decisions on loops. However, these models are unable to capture the data dependency, the computation graph, or the organization of instructions
The vectorization is critical to enhancing the performance of compute-intensive workloads in modern computers.</p>
<h1 id="contribution">Contribution</h1>
<p>A comprehensive data set of more than 10,000 synthetic loop examples.
An end-to-end deep reinforcement learning (RL) based auto loop-vectorization method</p>
<h1 id="개인적인-느낌">개인적인 느낌</h1>
<p>search space가 너무 작아서 솔찍하게 의미가 있는지 의문.. /</p>
<h1 id="the-proposed-framework-architecture">The Proposed Framework Architecture</h1>
<p><img src="/assets/images/nv2.png" alt="" /></p>
<h1 id="code-embedding">Code Embedding</h1>
<ul>
<li>Code2vec(Embedding Network) represents a code snippet as a single fixed-length code vector, which can be used to predict the semantic properties of the snippet.</li>
<li>This vector captures many characteristics of the code, such as semantic similarities, combinations, and analogies</li>
</ul>
<p><img src="/assets/images/nv3.png" alt="" />
A code snippet and its predicted labels as computed by code2vec
<a href="https://arxiv.org/pdf/1803.09473.pdf">reference</a>
<img src="/assets/images/nv4.png" alt="" />
The architecture of our path-attention network. A full-connected layer learns to combine embeddings of
each path-contexts with itself; attention weights are learned using the combined context vectors, and used to
compute a code vector. The code vector is used to predicts the label.
<a href="https://arxiv.org/pdf/1803.09473.pdf">reference</a></p>
<h1 id="automatic-vectorization-example">Automatic Vectorization Example</h1>
<p><img src="/assets/images/nv5.png" alt="" /></p>
<h1 id="the-rl-environment-definition">The RL Environment Definition</h1>
<p><img src="/assets/images/nv6.png" alt="" />
where baseline is the execution time when compiled with the currently implemented baseline cost model in LLVM and RL is the execution time when compiled with the injected pragmas by the RL agent
<img src="/assets/images/nv7.png" alt="" />
where MAX_VF and MAX_IF are respectively the maximum
VF and IF supported by the underlying architecture</p>
<h1 id="dataset-description">Dataset Description</h1>
<p><img src="/assets/images/nv8.png" alt="" />
To speed up the training, and make it more efficient,
we built a dataset that includes loops only. We built generators that generate more than 10,000 synthetic loop examples automatically from the LLVM vectorization test-suite.</p>
<h1 id="handling-long-compilation-time">Handling Long Compilation Time</h1>
<ul>
<li>During training, some of the programs took a long time to compile, mainly when the agent was trying to vectorize more than plausible</li>
<li>giving a penalty reward of −9 (equivalent to assuming it takes ten times the execution time of the baseline) so that the agent will learn not to overestimate the vectorization and avoid it
<h1 id="resultsreward-mean-and-training-loss-for-different-action-space-definitions">Results:Reward mean and training loss for different action space definitions</h1>
<p><img src="/assets/images/nv9.png" alt="" /></p>
<h1 id="resultsthe-performance-of-the-proposed-vectorizer">Results:The performance of the proposed vectorizer</h1>
<p><img src="/assets/images/nv10.png" alt="" />
The performance is normalized to the baseline(VF = 4, IF =
2)</p>
<h1 id="resultsnormalized-average-performance-of-supervised-fcnn-and-deep-rl">Results:Normalized average performance of supervised FCNN and deep RL</h1>
<p><img src="/assets/images/nv11.png" alt="" /></p>
<h1 id="resultsthe-performance-of-the-proposed-vectorizer-on">Results:The performance of the proposed vectorizer on</h1>
<p><img src="/assets/images/nv12.png" alt="" />
Mibench compared to Polly and the baseline cost model</p>
</li>
</ul>
<p>The performance is normalized to the baeline(VF = 4, IF =
2)</p>
<h1 id="references">references</h1>
<p>https://arxiv.org/abs/1909.13639</p>류재훈제목 NeuroVectorizer: End-to-End Vectorization with Deep Reinforcement Learning논문 정리 Chameleon Adaptive Code Optimization for Expedited Deep Neural Network Compilation(ICLR 2020)2021-02-12T00:00:00+00:002021-02-12T00:00:00+00:00https://deepmi.me/compiler/ml/paper-review/2021/02/12/paper-chameleon<p><img src="/assets/images/chameleon1.jpg" alt="" /></p>
<h1 id="제목">제목</h1>
<p>Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation</p>
<h1 id="저자">저자</h1>
<p>Byung Hoon Ahn, Prannoy Pilligundla, Amir Yazdanbakhsh, Hadi Esmaeilzadeh</p>
<h1 id="motivation">Motivation</h1>
<p>The current approaches are oblivious to the patterns in the design space of schedules that are available for exploitation, and causes inefficient search or even converges to solutions that may even be suboptimal.
Current solutions that rely on greedy sampling lead to significant fractions of the candidate configurations being redundant over iterations(long compilation time)</p>
<h1 id="contribution">Contribution</h1>
<ul>
<li>Devising an <strong>Adaptive Exploration</strong> module that utilizes reinforcement learning to adapt to unseen design space of new networks to reduce search time yet achieve better performance.</li>
<li>Proposing an <strong>Adaptive Sampling</strong> algorithm that utilizes clustering to adaptively reduce the number of costly hardware measurements
<h1 id="개인적인-생각">개인적인 생각</h1>
<p>RL을 이용하여 exploration을 잘하고 sampling을 효율적으로 해서 time을 줄이고자하는 목적이 참 깔끔하고 좋은 논문.</p>
<h1 id="overall-design">Overall design</h1>
<p><img src="/assets/images/chameleon2.png" alt="" /></p>
</li>
</ul>
<h1 id="adaptive-exploration">Adaptive Exploration</h1>
<ul>
<li>TVM leverages simulated annealing relies on the stochastic guarantees of its random walks(required numerous iterations of exploration) thus insufficient to enable disruptive innovations in neural networks</li>
<li>Adaptive Exploration, based Reinforcement Learning ,is concerned with learning to maximize reward given an environment by making good exploration and exploitation tradeoffs</li>
<li>These networks not only learn the dependencies among the different knobs of the design space (which are interrelated) that helps our module navigate through the design space but also lean the potential gains of the modifications to the configurations.</li>
</ul>
<h1 id="learning-procedure">Learning procedure</h1>
<p><img src="/assets/images/chameleon3.png" alt="" /></p>
<h1 id="adaptive-sampling--reducing-number-of-costly-hardware-measurements">Adaptive Sampling : Reducing number of costly hardware measurements</h1>
<p><img src="/assets/images/chameleon4.png" alt="" /></p>
<ul>
<li>we observe that the candidate configurations are clustered in subregions of the design space</li>
<li>Our Adaptive Sampling iterates over a different number of clusters for their respective centroids and the L2 loss.(k-means)</li>
<li>Selecting the number of centroids for clustering entails making the important tradeoff (using L2-performance degradation graph of knee of the curve
<h1 id="improving-candidate-configurations-using-sample-synthesis">Improving candidate configurations using sample synthesis</h1>
</li>
<li>Many of the automated approaches for black-box optimization are prone to invalid configurations</li>
<li>These invalid configurations not only blow the chances for better exploration but also leads to an extra optimization time overhead to reset the physical hardware for the subsequent hardware measurement</li>
<li>When our compiler runs into redundant samples, the proposed synthesis method analyzes the candidate samples to determine the most probable (most frequent = mode function) non-invalid choice for each knob to come up with a new configuration
<h1 id="improving-candidate-configurations-using-sample-synthesis-1">Improving candidate configurations using sample synthesis</h1>
</li>
<li>During training, some of the programs took a long time to compile, mainly when the agent was trying to vectorize more than plausible</li>
<li>giving a penalty reward of −9 (equivalent to assuming it takes ten times the execution time of the baseline) so that the agent will learn not to overestimate the vectorization and avoid it
<img src="/assets/images/chameleon5.png" alt="" />
the most probable (most frequent = mode function)
<h1 id="evaluation">Evaluation</h1>
<p><img src="/assets/images/chameleon6.png" alt="" />
Task Index => layer order
<img src="/assets/images/chameleon7.png" alt="" />
Overall, observation is that CHAMELEON’s Adaptive Exploration requires 2.88 less search steps compared to simulated annealing to find good solution.
<img src="/assets/images/chameleon8.png" alt="" />
<img src="/assets/images/chameleon9.png" alt="" />
<img src="/assets/images/chameleon10.png" alt="" /></p>
</li>
</ul>
<h1 id="references">references</h1>
<p>https://openreview.net/forum?id=rygG4AVFvH</p>
<h1 id="project-page">Project Page</h1>
<p>https://github.com/anony-sub/chameleon</p>류재훈제목 Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation논문 정리 LLVM A Compilation Framework for Lifelong Program Analysis & Transformation(CGO 04)2021-02-12T00:00:00+00:002021-02-12T00:00:00+00:00https://deepmi.me/compiler/paper-review/2021/02/12/paper-llvm<h1 id="제목">제목</h1>
<p>LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation</p>
<h1 id="저자">저자</h1>
<p>Chris Lattner Vikram Adve</p>
<h1 id="개인적으로-느끼는-논문의-insight">개인적으로 느끼는 논문의 insight</h1>
<p>Lifelong Program Analysis개념을 도입하여 Front-end를 제외한 부분에서 전체적인 최적화를 수행,SSA,machine-independent optimization
논문에서 제시된 개념이 지금의 llvm과 정확하게 일치하지는 않지만 대단하다..</p>
<h1 id="motivation">Motivation</h1>
<ul>
<li>Multiple-stages of analysis & transformation</li>
<li>compile-time, link-time, install-time, run-time, idle-time</li>
<li>Use aggressive interprocedural optimizations</li>
<li>Gather and exploit end-user profile information</li>
<li>Tune the application to the user’s hardware
<h1 id="contributions">Contributions</h1>
</li>
<li>A persistent, rich code representation
<ul>
<li>Enables analysis & optimization throughout lifetime</li>
</ul>
</li>
<li>Offline native code generation
<ul>
<li>Must be able to generate high-quality code statically</li>
</ul>
</li>
<li>Profiling & optimization in the field
<ul>
<li>Adapt to the end-user’s usage patterns</li>
</ul>
</li>
<li>Language independence
<ul>
<li>No runtime, object model, or exception semantics</li>
</ul>
</li>
<li>Uniform whole-program optimization
<ul>
<li>Allow optimization across languages and runtime
<h1 id="instruction-set">Instruction Set</h1>
</li>
</ul>
</li>
<li>Avoids machine specific constraints</li>
<li>Infinite set of typed virtual registers
<ul>
<li>In SSA form</li>
<li>Includes support for phi functions</li>
<li>This allows flow insensitive algorithm to gain benefits of flow sensitive without expensive Data Flow analysis</li>
</ul>
</li>
<li>Avoids same code for multiple instructions (overloaded opcodes)</li>
<li>Exceptions mechanism based on two instructions invoke and unwind
<h1 id="llvm-compiler-architecture">LLVM Compiler Architecture</h1>
<p><img src="/assets/images/llvm1.png" alt="" />
<strong>This strategy provides the 5 benefits</strong></p>
</li>
<li>Some limitations
<ul>
<li>Language specific optimizations must be performed on frontend</li>
<li>Benefit to languages like Java(JVM) requiring sophisticated runtime systems?</li>
</ul>
</li>
<li>Front-end compiler
<ul>
<li>Translate source code to LLVM representation</li>
<li>Perform language specific optimizations</li>
<li>Need not perform SSA construction at this time</li>
<li>Invoke LLVM passes for global inter procedural optimization at module level</li>
</ul>
</li>
<li>Linker/Interprocedure Optimizer
<ul>
<li>Various analyses occur
<ul>
<li>Points-to analysis</li>
<li>Mod/Ref analysis</li>
<li>Dead global elimination, dead argument elimination, constant, propagation, array bounds check, etc</li>
<li>Can be speeded up by adding inter-procedural summaries</li>
</ul>
</li>
</ul>
</li>
<li>Native Code Generation
<ul>
<li>JIT or Offline</li>
<li>Currently supports Sparc V9 and x86 architectures</li>
</ul>
</li>
<li>Reoptimizers
<ul>
<li>Identifies frequently run code and ‘hotspots’</li>
<li>Performs additional optimizations, thus native code generation can be performed ahead of time</li>
<li>Idle-time reoptimizer</li>
</ul>
</li>
</ul>
<h1 id="resultshow-do-high-level-features-map-onto-llvm">Results:How do high-level features map onto LLVM?</h1>
<p><img src="/assets/images/llvm2.png" alt="" />
The table shows that many of these programs (164, 176,
179, 181, 183, 186, & 256) are surprisingly type-safe, despite
the fact that the programming language does not enforce
type-safety.
<img src="/assets/images/llvm3.png" alt="" />
The figure shows that LLVM code is about the same size
as native executables for SPARC, and is roughly 25% larger
on average for x86
<img src="/assets/images/llvm4.png" alt="" />
DGE (aggressive10 Dead
global variable and function Elimination), DAE (an aggressive Dead Argument Elimination), inline (a function integration pass), DSA (Data Structure Analysis), and GCC
(time to compile the programs with the gcc 3.3 compiler at –
O3, provided as a reference point)</p>
<h1 id="references">references</h1>
<p>https://ieeexplore.ieee.org/document/1281665</p>류재훈제목 LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation간단논문 정리 TVM An Automated End-to-End Optimizing Compiler for Deep Learning (OSDI 18)2021-02-12T00:00:00+00:002021-02-12T00:00:00+00:00https://deepmi.me/compiler/ml/paper-review/2021/02/12/paper-tvm<p><img src="/assets/images/tvm1.png" alt="" /></p>
<h1 id="제목">제목</h1>
<p>TVM: An Automated End-to-End Optimizing Compiler for Deep Learning</p>
<h1 id="tvm">TVM?</h1>
<p>해당논문은 머신러닝용 컴파일러중에 대표적인 TVM에 대한 paper입니다. 현재는 apache에서 관리 하고 있으며 graph level IR 을 통한 target-independent optimization,
autotune을 통한 target-dependent optimization 을 지원하며 llvm 및 vta를 통하여 cpu,gpu뿐만 아니라 FPGA를 backend로 지원합니다.</p>
<h1 id="저자">저자</h1>
<p>Tianqi Chen and Thierry Moreau, University of Washington; Ziheng Jiang, University of Washington, AWS; Lianmin Zheng, Shanghai Jiao Tong University; Eddie Yan, Haichen Shen, and Meghan Cowan, University of Washington; Leyuan Wang, UC Davis, AWS; Yuwei Hu, Cornell; Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy</p>
<h1 id="motivation">Motivation</h1>
<p>기존 머신러닝을 이용한 compiler optimizaion 방법에서는 human experts를 이용한다양한 hardward back-end(GPU,FPGA,ASIC)이 늘어남에 따라 그 구조에 적합한 complier optimization이 달라 질 수 밖에 없다.</p>
<h1 id="contribution">Contribution</h1>
<p>해당논문은 머신러닝 High level Graph 연산을 ML 기반으로 특정 device 적합한 excutable 코드를 만들도록 수행하는 방법을 제시. 전체적인 framework의 blueprint 느낌이 강하고 전형적인 DL compiler의 구조라서 큰 contribution을 느끼지는 못함.</p>
<h1 id="content">Content</h1>
<p>Graph level modification & hareware-aware optimization</p>
<ul>
<li>Operator Fusion
<ul>
<li>Combines many small ops</li>
</ul>
</li>
<li>Constant Folding
<ul>
<li>Pre-computes static graphs</li>
</ul>
</li>
<li>Static Memory Planning Pass
<ul>
<li>Pre-allocates memory for needed tensors</li>
</ul>
</li>
<li>Data Layout Transformations
<ul>
<li>Optimize data storage for each backend</li>
</ul>
</li>
<li>cost model에 ML을 이용
<ul>
<li>Query에서 추출한 feature 를XGBoost 를 이용하여 costs 를 예측
<h1 id="references">references</h1>
<p>https://arxiv.org/abs/1802.04799</p>
<h1 id="project-page">Project Page</h1>
<p>https://tvm.apache.org</p>
</li>
</ul>
</li>
</ul>류재훈제목 TVM: An Automated End-to-End Optimizing Compiler for Deep Learning자주쓰는 파이썬 스크립트 패턴2021-02-12T00:00:00+00:002021-02-12T00:00:00+00:00https://deepmi.me/python/2021/02/12/python_script<p>개인적으로 shell보다는 python를 자주 사용하기 때문에 자주 사용하는 패턴들을 간단하게 정리</p>
<h2 id="globos">glob,os</h2>
<p>glob는 유닉스 스타일 경로명 패턴 확장 라이브러입니다. 이것과 os 라이브러리를 이용하면 간단하게 파일을 찾거나 바꿀수 있습니다.
개인적으로 실험 결과를 파싱할때 많이 사용하는 라이브러리 입니다.</p>
<p>예시)result폴더에서 png파일과 npy파일이 모두 존재할 시에 npy파일을 조작</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import numpy as np
import glob
import os
...
for name in glob.glob('result/**/*.png',recursive=True):
np_name=name.replace('png','npy')
if os.path.exists(np_name):
np.load(np_name)
....
</code></pre></div></div>
<h2 id="subprocess">subprocess</h2>
<p>해당 명령어는 파이썬에서 shell명령어를 실행시키는 명령어 입니다.
개인적으로 파라미터 등을 바꾸어가면서 실험을 진행 할때 많이 사용합니다.
예시 parameter A와 B를 바꾸어 가면서 실험.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import subprocess
basetext='python3 text.py --numA _numA --numB _numB 2>&1'
index=0
for numA in [30,40,50]:
for numB in [1.0,1.5,2.0]:
text=basetext
text=text.replace('_numA',f'{numA}')
text=text.replace('_numB',f'{numB}')
proc = subprocess.Popen( text , shell=True, executable='/bin/bash')
proc.communicate()
</code></pre></div></div>
<h2 id="schedule">schedule</h2>
<p>파이썬의 crontab.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import schedule
import time
def job():
print("Do Job...!!!")
schedule.every(10).minutes.do(job)
schedule.every().hour.do(job)
schedule.every().day.at("10:30").do(job)
schedule.every(5).to(10).minutes.do(job)
schedule.every().monday.do(job)
schedule.every().wednesday.at("13:15").do(job)
schedule.every().minute.at(":17").do(job)
while True:
schedule.run_pending()
time.sleep(1)
</code></pre></div></div>류재훈개인적으로 shell보다는 python를 자주 사용하기 때문에 자주 사용하는 패턴들을 간단하게 정리 glob,os glob는 유닉스 스타일 경로명 패턴 확장 라이브러입니다. 이것과 os 라이브러리를 이용하면 간단하게 파일을 찾거나 바꿀수 있습니다. 개인적으로 실험 결과를 파싱할때 많이 사용하는 라이브러리 입니다.