项目作者: upenn-acg

项目描述 :
GPU Drano Static Analysis for GPU programs.
高级语言: C++
项目地址: git://github.com/upenn-acg/gpudrano-static-analysis_v1.0.git
创建时间: 2018-03-01T22:30:11Z
项目社区:https://github.com/upenn-acg/gpudrano-static-analysis_v1.0

开源协议:Other

下载


GPU Drano

GPU Drano is a static analysis tool for GPU Programs. One of the analyses in
GPU Drano is for finding
uncoalesced memory accesses
in CUDA code. The other analysis supported by GPU Drano is analysis to prove
block-size independence of GPU programs.

Modern GPUs bundle threads into warps. All threads in a warp perform operations
in lockstep. Memory accesses to different memory locations can be coalesced into
a single load/store if the memory is adjacent or close enough in memory. When
the memory accessed by a warp is far apart, multiple load/stores are required to
complete the memory transaction, and we say the access is uncoalesced.

A GPU kernel is said to be block-size independent, if modifying the block-size
while keeping the total number of threads same, does not break the functionality
of the program. This is essential for correct block-size tuning in GPU programs,
which is often used to improve program performance.

We have also implemented a dynamic analysis to identify uncoalesced accesses, it’s
available in this repository.

Details

Implementation

GPU Drano is implemented as a compiler pass for LLVM using Google’s open source CUDA
implementation: gpucc.
Therefore, Drano is tightly coupled with LLVM.

Requirements

Drano requires LLVM version 6.0 or later. It also requires CUDA toolkit (version 7.5
or later) from NVIDIA. It has been tested on Ubuntu 16.04 LTS, but should work with
most existing Linux systems.

The static analysis itself does not require a GPU. However to execute the programs
and to run dynamic anlaysis, an NVIDIA GPU and compatible drivers are required. Check
NVIDIA’s system requirements
for more information.

Algorithm and Performance

The details of the algorithm and the design choices can be found in following papers:
1) GPUDrano: Detecting Uncoalesced Accesses in GPU Programs.
2) Block-size Independence for GPU Programs.

Installation

The script installnrun.sh included with the project details the steps required
to build and execute GPU Drano on an Ubuntu system. To run the script,

  • Download GPU Drano and set ROOT_DIR to the path to the downloaded folder.
  • Run sh installnrun.sh

This would automatically install GPU Drano on a Linux system. We further
describe the installation steps for installing the uncoalesced access analysis
here. The installation for block-size independence analysis (also refered to as
block-size invariance analysis) can be done similarly.

Setup LLVM

1) Get LLVM source:

Ensure subversion is installed. Download the newest version of LLVM:

  1. svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm

2) Get Clang source:

Change your current working directory to llvm/tools/ and check out clang
from the svn repository:

  1. svn co http://llvm.org/svn/llvm-project/cfe/trunk clang

3) Add GPU Drano to LLVM:

Copy GPU Drano’s src/ folder into the directory llvm/lib/Transforms/UncoalescedAnalysis
in your source code:

  1. cp -r src/abstract-execution/* llvm/lib/Transforms/UncoalescedAnalysis
  2. cp -r src/uncoalesced-analysis/* llvm/lib/Transforms/UncoalescedAnalysis

This will create a folder called llvm/lib/Transforms/UncoalescedAnalysis/. We must
register our pass with the LLVM build system, cmake. Therefore, append
add_subdirectory(UncoalescedAnalysis) to llvm/lib/Transforms/CMakeLists.txt

Sample CMakeLists.txt file:

  1. $> more CMakeLists.txt
  2. add_subdirectory(Utils)
  3. add_subdirectory(Instrumentation)
  4. add_subdirectory(InstCombine)
  5. add_subdirectory(Scalar)
  6. add_subdirectory(IPO)
  7. add_subdirectory(Vectorize)
  8. add_subdirectory(Hello)
  9. add_subdirectory(ObjCARC)
  10. add_subdirectory(Coroutines)
  11. add_subdirectory(UncoalescedAnalysis)

4) Build LLVM and GPU Drano:

From the root directory of Drano, create a build/ directory. Then,
change directory to the build/ directory. Ensure CMake is installed on
the system. Execute the following commands here:

  1. cmake ../llvm
  2. make

That’s cmake with the path to LLVM directory (../llvm). CMake configures
LLVM for your system. It should generate several files in your current working
directory (build/). The command make builds LLVM and Drano.

Ideally, use make -j N where N is your number of cores to build with in
parallel as llvm takes a long time to build.

Notice: LLVM build takes a large amount of memory to compile, specifically
at the linking step. If you have < 8 gigabytes of RAM, LLVM may fail to build.
If so, you can rerun make without the -j option. This will only recompile the
failed parts of the build, and still save more time than not using -j from
the start.

5) Install LLVM and GPU Drano
Perform sudo make install. This should install the libraries and binaries in the
default location (or the specified location). If installed locally, this guide
assumes bash can find the command in it’s PATH.

The above is a quick start guide. If you’re unfamiliar with LLVM you may find all
details for installation at: http://llvm.org/docs/GettingStarted.html

Setup NVIDIA drivers, toolkit and SDK

The script installnrun.sh briefly describes the process to build NVIDIA drivers,
toolkit and the SDK. The script has been commented out to avoid the risk of overriding
existing NVIDIA drivers. Note that we only need the toolkit and the SDK to run
the static analysis. We also need drivers and a working GPU to execute dynamic analysis
and CUDA programs themselves.

1) Install the NVIDIA drivers, CUDA toolkit and SDK:
Please refer to the NVIDIA installtion guide
for instructions.

In summary, if you use a recent and popular Linux distro you should be able
to use the automatic download tool to install the required drivers and sdk.

If you already have the required drivers and SDK installed you may skip this step.
It’s probably not a good idea to override your existing drivers as this might render
your system unusable.

2) You may require g++-multilib to install necessary libraries required by clang.

Verifying CUDA + LLVM installation

This step is optional. A simple CUDA program like “hello world” should now compile
using clang or clang++:

  1. clang -x cuda helloWorld.cu

The -x cuda option explicitly states the language. You may omit it as well and
clang will infer this as a CUDA program.

Notice: Your clang installation may need not be able to find several internal
LLVM functions, if so, you may need to include -lcudart. You may also need to
point to the location of the cudart.so.

Example:

  1. clang -x cuda -L /usr/local/cuda/targets/x86_64-linux/lib/ -lcudart helloWorld.cu

The path to your cudart.so may be different depending on your system.

GPU Drano binary

LLVM should have generated a shared object .so file, called LLVMUncoalescedAnalysis.so,
under the build/lib/ directory.

Running GPU Drano

The LLVM complier clang generates separate device (the gpu kernel code) and
host (code run on CPU) IR files when compiling CUDA programs. GPU Drano analyzes
the LLVM IR files for device code.

As an example, let’s analyze the Rodinia kernel code for the gaussian benchmark.
We change directory into rodinia_3.1/cuda/gaussian/.

First we have clang generate LLVM IR files for the code we are interested in:

  1. clang++ -S -g -emit-llvm gaussian.cu

Notice we compile with debug symbol -g, to keep the debug information about
source code locations in the generated IR. This is used to point source code locations
with potential uncoalesced accesses from LLVM IR.

The compilation generates two files:

  1. gaussian-cuda-nvptx64-nvidia-cuda-sm_20.ll
  2. gaussian.ll

We can then run the static analysis through LLVM’s opt by specifying the path to
the GPU Drano binary and specifying pass -interproc-uncoalesced-analysis to be run. This
pass is an interprocedural analysis to detect uncoalesced accesses. It starts with
the analysis of the top-most function in the call-graph and then proceeds with the
analysis of its callees in a topological order. While analyzing a specific callee,
it considers the join of the call contexts of all its callers. To run an intraprocedural
analysis (that assumes all initial function arguments are independent of thread’s id),
specify pass -uncoalesced-analysis to be run, instead of -interproc-uncoalesced-analysis.

  1. opt -load ../../../build/lib/LLVMUncoalescedAnalysis.so -instnamer -interproc-uncoalesced-analysis < gaussian-cuda-nvptx64-nvidia-cuda-sm_20.ll > /dev/null 2> gpuDranoResults.txt

Notice opt reads the IR file from standard input. opt writes it’s own uninteresting
output to standard out so we redirection it to /dev/null GPU Drano’s output is
written to standard error which may be redirected to a file.

To generate verbose analysis results (LLVM IR annotated with analysis info), run
opt with an additional -debug-only=uncoalesced-analysis flag.

The following command can be used to run the block-size independence analysis on a program.

  1. opt -load ../../../build/lib/LLVMBlockSizeInvarianceAnalysis.so - -instnamer -always-inline -interproc-bsize-invariance-analysis < gaussian-cuda-nvptx64-nvidia-cuda-sm_20.ll > /dev/null 2> gpuDranoResults.txt

Understanding GPU Drano’s output

The generated results for uncoalesced access analysis reports all accesses that
might be potentially uncoalesced in each of the GPU kernels. For example, here
are the results for the analysis of
gaussian.cu.

  1. Analysis Results:
  2. Function: _Z4Fan1PfS_ii
  3. Uncoalesced accesses: #2
  4. -- gaussian.cu:295:59
  5. -- gaussian.cu:295:61
  6. Analysis Results:
  7. Function: _Z4Fan2PfS_S_iii
  8. Uncoalesced accesses: #4
  9. -- gaussian.cu:312:38
  10. -- gaussian.cu:312:35
  11. -- gaussian.cu:312:35
  12. -- gaussian.cu:317:23

Each result item points to a potentially uncoalesced access in the source code.
For example, access to m_cuda at line 295, column 59 in gaussian.cu at method
Fan1() is uncoalesced.

Similarly, generated results for block-size independence analysis identify all
kernels that are block-size independent!

Running GPU Drano Uncoalesced Access Analysis on Rodinia Benchmark Suite

Rodinia is popular benchmark suite of GPU programs consisting of 22 programs from
different domains. We analyzed the suite and found 111 real uncoalesced accesses
using static analysis. To reproduce the results, here are the steps involved:

1) Update CUDA and Drano configuration in rodinia_3.1/common/make.config. Set
CUDA_DIR and SDK_DIR with NVIDIA toolkit and SDK paths. Update opt alias
with path to GPU Drano binary LLVMUncoalescedAnalysis.so.

2) Go to benchmarks directory rodinia_3.1/cuda.

3) Compile benchmarks:

  1. sh compile.sh

4) Run GPU Drano on benchmarks (takes about 2 hours):

  1. sh run-analysis.sh

Analysis of each benchmark generates results in its respective folder in log-files
named log_<filename>.

5) Summarize results:

  1. ./summarize-results.sh

Running GPU Drano Block-size Independence Analysis on Nvidia CUDA SDK 8.0 Samples

Nvidia provides a set of CUDA samples which can be used for various applications.
We analyzed the suite to identify block-size independent kernels in the samples. To
reproduce the results, the steps are:

1) Update HOME variable in NVIDIA_CUDA-8.0_Samples/common/drano.mk to the root
GPU Drano directory.

2) Install OpenGL (required for compiling a few benchmarks). On Ubuntu, following
command can be used:

  1. sudo apt-get install freeglut3-dev

3) Go to the directory NVIDIA_CUDA-8.0_Samples

4) Compile benchmarks:

  1. sh compile.sh

5) Run GPU Drano on benchmarks (takes about 2 hours):

  1. sh run-analysis.sh

Analysis of each benchmark generates results in its respective folder in log-files
named log_<filename>.

6) Summarize results:

  1. ./summarize-bsize-independence.sh