项目作者: tactcomplabs

项目描述 :
Memory system characterization benchmarks using atomic operations
高级语言: C++
项目地址: git://github.com/tactcomplabs/circustent.git
创建时间: 2019-10-28T23:03:59Z
项目社区:https://github.com/tactcomplabs/circustent

开源协议:Apache License 2.0

下载


CircusTent: Atomic Memory Operation System Benchmarks

GitHub license

CircusTent

Overview

The CircusTent infrastructure is designed to provide users and architects
the ability to discover the relevant performance of a target system
architecture’s memory subsystem using atomic memory operations. Atomic
memory operations have traditionally been considered to be latent or
low performance given the difficulty in their respective implementations.
However, atomic operations are widely utilized across parallel programming
constructs for synchronization primitives and to promote concurrency. However,
prior to the creation of CircusTent, the architecture and programming
model communities had little ability to quantify the performance of
atomics on varying scales of a system architecture.

The CircusTent infrastructure is designed to be a modular benchmark
platform consisting of a frontend and backend infrastructure.
The frontend infrastructure defines the various benchmark types and
standard benchmark algorithms as well as providing the command line
execution interface. The backend provides one or more implementations
of the standard algorithms using various programming models.

Building From Source

Prerequisites

The following packages/utilities are required to build CircusTent from source:

  • CMake 3.4.3+
  • C++ Compiler
  • C Compiler

Optional packages include:

  • RPM tools to build RPMs
  • Debian package tools to build DEBs
  • Backend-specific libraries

Building

The following steps are generic build instructions. You may need to
modify these steps for your target system and compiler.

  1. Clone the CircusTent repository
    1. git clone https://github.com/tactcomplabs/circustent.git
  2. Setup your build tree
    1. cd circustent
    2. mkdir build
    3. cd build
  3. Execute CMake to generate the makefiles (where XXX refers to the backend that you want to enable)

    1. cmake -DENABLE_XXX=ON -DCT_CFLAGS="..." -DCT_CXXFLAGS="..." -DCT_LINKER_FLAGS="..." ../

    Note that it will most often be necessary to pass the compiler specific flags needed for your chosen backend
    implementation to the CMake infrastructure via the CT_CFLAGS, CT_CXXFLAGS, and CT_LINKER_FLAGS options as shown above.

  4. Execute the build

    1. make

    The circustent binary will reside in ./src/CircusTent/

  5. (Optional) Install the build

    1. make install

Build Options

The following are additional build options supported by the CircusTent CMake script

  • CC : Utilize the target C compiler
  • CXX : Utilize the target C++ compiler
  • -DCMAKE_C_FLAGS : Set the standard C compiler flags
  • -DCMAKE_CXX_FLAGS : Set the standard C++ compiler flags
  • -DCMAKE_INSTALL_PREFIX : installation target (make install)
  • -DCIRCUSTENT_BUILD_RPM : Builds an RPM package
  • -DCIRCUSTENT_BUILD_DEB : Builds a DEB package
  • -DCIRCUSTENT_BUILD_TGZ : Builds a TGZ package
  • -DBUILD_ALL_TESTING : Builds the test infrastructure (developers only)

Algorithm Descriptions

The following contains brief descriptions of each candidate algorithm. For each algorithm,
we apply one or more of the following atomics:

  • Fetch and Add (ADD)
  • Compare and Exchange (CAS)

The algorithmic descriptions below do not specify the size of the data values
implemented. The CircusTent software does not derive bandwidth. However,
we highly suggest that implementors utilize 64-bit values for the source
and index portions of the benchmark.

The following table presents all the core benchmarks and the number of
atomic operations performed for each (which is vital to calculating
accurate GAMs values across platforms).

Benchmark Number of AMOs
RAND 1
STRIDE1 1
STRIDEN 1
PTRCHASE 1
CENTRAL 1
SG 4
SCATTER 3
GATHER 3

RAND

Performs a stride-1 atomic update using an index array with randomly generated
indices and a source value array. The index array (IDX) must contain valid indices
within the bounds of the source value array (ARRAY). Utilizing standard-C
linear congruential methods is sufficient.

  1. for( i=0; i<iters; i++ ){
  2. AMO(ARRAY[IDX[i]])
  3. }

STRIDE1

Performs a stride-1 atomic update using only a source array (ARRAY).

  1. for( i=0; i<iters; i++ ){
  2. AMO(ARRAY[i])
  3. }

STRIDEN

Performs a stride-N atomic update using only a source array (ARRAY).
The user must specify the respective stride of the operation

  1. for( i=0; i<iters; i+=stride ){
  2. AMO(ARRAY[i])
  3. }

PTRCHASE

Performs a pointer chase operation across an index array. This implies
that the i’th+1 value is selected from the i’th operation. This algorithm
only utilizes the index array (IDX). All index values must be valid within the
scope of the index array.

  1. for( i=0; i<iters; i++ ){
  2. start = AMO(IDX[start])
  3. }

CENTRAL

Performs an atomic operation to a singular value from all PEs. This is a deliberate
hot-spot action that is designed to immediately stress system and network
interconnects.

  1. for( i=0; i<iters; i++ ){
  2. AMO(ARRAY[0])
  3. }

SG

Performs a scatter and a gather operation. The source values for the scatter,
gather and the final values are all fetched atomically. As with the other
algorithms, the source array and index array must be valid.

  1. for( i=0; i<iters; i++ ){
  2. src = AMO(IDX[i])
  3. dest = AMO(IDX[i+1])
  4. val = AMO(ARRAY[src])
  5. AMO(ARRAY[dest], val) // ARRAY[dest] = val
  6. }

SCATTER

Performs the scatter portion of an SG operation. As with the other
algorithms, the source array and index array must be valid.

  1. for( i=0; i<iters; i++ ){
  2. dest = AMO(IDX[i+1])
  3. val = AMO(ARRAY[i])
  4. AMO(ARRAY[dest], val) // ARRAY[dest] = val
  5. }

GATHER

Performs the gather portion of an SG operation. As with the other
algorithms, the source array and index array must be valid.

  1. for( i=0; i<iters; i++ ){
  2. dest = AMO(IDX[i+1])
  3. val = AMO(ARRAY[dest])
  4. AMO(ARRAY[i], val) // ARRAY[i] = val
  5. }

Backends

OMP

  • CMake Build Flag: -DENABLE_OMP=ON
  • Implementation Language: C++ & C using GNU intrinsics
  • Utilizes unsigned 64-bit integers for the ARRAY and IDX values
  • Utilizes __ATOMIC_RELAXED where appropriate
  • Intrinsic documentation: GNU Atomics
Benchmark Supported?
RAND_ADD yes
RAND_CAS yes
STRIDE1_ADD yes
STRIDE1_CAS yes
STRIDEN_ADD yes
STRIDEN_CAS yes
PTRCHASE_ADD yes
PTRCHASE_CAS yes
CENTRAL_ADD yes
CENTRAL_CAS yes
SG_ADD yes
SG_CAS yes
SCATTER_ADD yes
SCATTER_CAS yes
GATHER_ADD yes
GATHER_CAS yes

OMP with Target Offloading

  • CMake Build Flags: -DENABLE_OMP_TARGET=ON
  • Implementation Language: C++ & C
  • Users may define a particular $OMP_DEFAULT_DEVICE, otherwise the default is utilized
  • Maps the provided PEs argument to OpenMP teams wherein the number of iterations specified are executed by each team. Iterations for a given team are workshared using thread and vector level parallelism based on the behavior of the user’s OpenMP implementation and compiler.
  • In order to preserve the intended memory access pattern, the PTRCHASE kernels utilize only teams level parallelism.
  • Utilizes unsigned 64-bit integers for the ARRAY and IDX values
Benchmark Supported?
RAND_ADD yes
RAND_CAS no
STRIDE1_ADD yes
STRIDE1_CAS no
STRIDEN_ADD yes
STRIDEN_CAS no
PTRCHASE_ADD yes
PTRCHASE_CAS no
CENTRAL_ADD yes
CENTRAL_CAS no
SG_ADD yes
SG_CAS no
SCATTER_ADD yes
SCATTER_CAS no
GATHER_ADD yes
GATHER_CAS no

OpenSHMEM

  • CMake Build Flag: -DENABLE_OPENSHMEM=ON
  • Users must specify the OpenSHMEM compiler wrapper alongside the CMake command as follows:
    1. CC=oshcc CXX=oschcxx cmake -DENABLE_OPENSHMEM=ON ../
  • Implementation Language: C++ and C using SHMEM functions and symmetric heap
  • Utilizes unsigned 64-bit integers for the ARRAY and IDX values
  • Target PE’s for all benchmarks except PTRCHASE are initialized in a stride-1 ring pattern. This implies
    that for every N’th PE, the target PE is N+1. All benchmarks except PTRCHASE target a single destination PE for each iteration
  • The PTRCHASE benchmark utilizes randomly generated target PE’s for each iteration
  • For benchmark values that don’t require atomic access to indices, we utilize SHMEM_GET operations to
    fetch the index for a given iteration (ex, RAND_ADD, RAND_CAS)
  • Tested with OSSS-UCX: OpenSHMEM Reference Implementation
Benchmark Supported?
RAND_ADD yes
RAND_CAS yes
STRIDE1_ADD yes
STRIDE1_CAS yes
STRIDEN_ADD yes
STRIDEN_CAS yes
PTRCHASE_ADD yes
PTRCHASE_CAS yes
CENTRAL_ADD yes
CENTRAL_CAS yes
SG_ADD yes
SG_CAS yes
SCATTER_ADD yes
SCATTER_CAS yes
GATHER_ADD yes
GATHER_CAS yes

MPI

  • CMake Build Flag: -DENABLE_MPI=ON
  • Users must specify the MPI compiler wrapper alongside the CMake command as follows:
    1. CC=mpicc CXX=mpicxx cmake -DENABLE_MPI=ON ../
  • Implementation Language: C++ and C using MPI-3 functions and one-sided operations
  • Utilizes unsigned 64-bit integers for the ARRAY and IDX values
  • Target PE’s for all benchmarks except PTRCHASE are initialized in a stride-1 ring pattern. This implies
    that for every N’th PE, the target PE is N+1. All benchmarks except PTRCHASE target a single destination PE for each iteration
  • The PTRCHASE benchmark utilizes randomly generated target PE’s for each iteration
  • For benchmark values that don’t require atomic access to indices, we utilize MPI_Get operations to
    fetch the index for a given iteration (ex, RAND_ADD, RAND_CAS)
  • Tested with OpenMPI
Benchmark Supported?
RAND_ADD yes
RAND_CAS yes
STRIDE1_ADD yes
STRIDE1_CAS yes
STRIDEN_ADD yes
STRIDEN_CAS yes
PTRCHASE_ADD yes
PTRCHASE_CAS yes
CENTRAL_ADD yes
CENTRAL_CAS yes
SG_ADD yes
SG_CAS yes
SCATTER_ADD yes
SCATTER_CAS yes
GATHER_ADD yes
GATHER_CAS yes

xBGAS

  • CMake Build Flag: -DENABLE_XBGAS=ON
  • Implementation Language: C++ and C using xBGAS functions
  • Utilizes unsigned 64-bit integers for the ARRAY and IDX values
  • Target PE’s for all benchmarks except PTRCHASE are initialized in a stride-1 ring pattern. This implies
    that for every N’th PE, the target PE is N+1. All benchmarks except PTRCHASE target a single destination PE for each iteration
  • The PTRCHASE benchmark utilizes randomly generated target PE’s for each iteration
  • For benchmark values that don’t require atomic access to indices, we utilize XBGAS_GET operations to
    fetch the index for a given iteration (ex, RAND_ADD, RAND_CAS)
Benchmark Supported?
RAND_ADD yes
RAND_CAS yes
STRIDE1_ADD yes
STRIDE1_CAS yes
STRIDEN_ADD yes
STRIDEN_CAS yes
PTRCHASE_ADD yes
PTRCHASE_CAS yes
CENTRAL_ADD yes
CENTRAL_CAS yes
SG_ADD yes
SG_CAS yes
SCATTER_ADD yes
SCATTER_CAS yes
GATHER_ADD yes
GATHER_CAS yes

OpenACC

  • CMake Build Flags: -DENABLE_OPENACC=ON
  • Implementation Language: C++ & C
  • Users may define $ACC_DEVICE_TYPE and/or $ACC_DEVICE_ID to set
    the target device type and ID, respectively. However, since these values
    may be overidden or ignored by your OpenACC implementation, we recommend
    the user verify their desired device matches the one selected by checking
    the CircusTent output messages printed during device initiailization.
  • Maps the provided PEs argument to OpenACC gangs wherein the number of iterations specified are executed by each gang. Iterations for a given gang are workshared using worker and vector level parallelism based on the behavior of the user’s OpenACC implementation and compiler.
  • In order to preserve the intended memory access pattern, the PTRCHASE kernels utilize only gangs level parallelism.
  • Utilizes unsigned 64-bit integers for the ARRAY and IDX values
Benchmark Supported?
RAND_ADD yes
RAND_CAS no
STRIDE1_ADD yes
STRIDE1_CAS no
STRIDEN_ADD yes
STRIDEN_CAS no
PTRCHASE_ADD yes
PTRCHASE_CAS no
CENTRAL_ADD yes
CENTRAL_CAS no
SG_ADD yes
SG_CAS no
SCATTER_ADD yes
SCATTER_CAS no
GATHER_ADD yes
GATHER_CAS no

Pthreads

  • CMake Build Flags: -DENABLE_PTHREADS=ON
  • Implementation Language: C++ & C using GNU intrinsics
  • Utilizes unsigned 64-bit integers for the ARRAY and IDX values
  • Utilizes __ATOMIC_RELAXED where appropriate
  • Intrinsic documentation: GNU Atomics
Benchmark Supported?
RAND_ADD yes
RAND_CAS yes
STRIDE1_ADD yes
STRIDE1_CAS yes
STRIDEN_ADD yes
STRIDEN_CAS yes
PTRCHASE_ADD yes
PTRCHASE_CAS yes
CENTRAL_ADD yes
CENTRAL_CAS yes
SG_ADD yes
SG_CAS yes
SCATTER_ADD yes
SCATTER_CAS yes
GATHER_ADD yes
GATHER_CAS yes

OpenCL

  • CMake Build Flags: -DENABLE_OPENCL=ON
  • Implementation Language: C++ & C with OpenCL extensions
  • Users must define both $OCL_TARGET_PLATFORM_NAME and $OCL_TARGET_DEVICE_NAME to set
    the OpenCL target platform and device, respectively
  • Utilizes unsigned 64-bit integers (cl_ulong) for the ARRAY and IDX values
  • Utilizes OpenCL API-level atomic operations
Benchmark Supported?
RAND_ADD yes
RAND_CAS yes
STRIDE1_ADD yes
STRIDE1_CAS yes
STRIDEN_ADD yes
STRIDEN_CAS yes
PTRCHASE_ADD yes
PTRCHASE_CAS yes
CENTRAL_ADD yes
CENTRAL_CAS yes
SG_ADD yes
SG_CAS yes
SCATTER_ADD yes
SCATTER_CAS yes
GATHER_ADD yes
GATHER_CAS yes

C++ Standard Threads & Atomics

  • CMake Build Flags: -DENABLE_CPP_STD=ON
  • Implementation Language: C++11
  • Utilizes unsigned 64-bit integers for the ARRAY and IDX values
  • Utilizes C++11 standard library threads and atomic operations
Benchmark Supported?
RAND_ADD yes
RAND_CAS yes
STRIDE1_ADD yes
STRIDE1_CAS yes
STRIDEN_ADD yes
STRIDEN_CAS yes
PTRCHASE_ADD yes
PTRCHASE_CAS yes
CENTRAL_ADD yes
CENTRAL_CAS yes
SG_ADD yes
SG_CAS yes
SCATTER_ADD yes
SCATTER_CAS yes
GATHER_ADD yes
GATHER_CAS yes

CUDA

  • CMake Build Flag: -DENABLE_CUDA=ON
  • Implementation Language: CUDA C/C++
  • Utilizes unsigned 64-bit integers
  • Utilizes CUDA API-level atomic operations
  • Desired taget device can be set with $CUDA_VISIBLE_DEVICES, otherwise the default CUDA-enabled device will be used
  • In lieu of a PEs parameter, requires specification of CUDA-specific parallel resources as used in the kernel launch configuration:
    • --blocks : number of thread blocks
    • --threads: number of threads per block
  • Sample Execution:
    1. circustent -b RAND_ADD -m 1024 -i 1000 --blocks 100 --threads 512
Benchmark Supported?
RAND_ADD yes
RAND_CAS yes
STRIDE1_ADD yes
STRIDE1_CAS yes
STRIDEN_ADD yes
STRIDEN_CAS yes
PTRCHASE_ADD yes
PTRCHASE_CAS yes
CENTRAL_ADD yes
CENTRAL_CAS yes
SG_ADD yes
SG_CAS yes
SCATTER_ADD yes
SCATTER_CAS yes
GATHER_ADD yes
GATHER_CAS yes

Execution Parameters

Backend Independent Parameters

The following list details the current set of command line options common to all CircusTent backends:

  • —bench BENCH : specifies the target benchmark to run
  • —memsize BYTES : sets the size of the memory array to allocate in bytes (general rule is 1/2 of physical memory)
  • —iters ITERATIONS : sets the number of algorithmic iterations per PE. Total iterations = (PEs x ITERATIONS)
  • —stride STRIDE : sets the stride (in elements) for the target algorithm. Not all algorithms require the stride to be specified. If this value is not required, the algorithm will ignore it.
  • —help : prints the help menu
  • —list : prints a list of the target benchmarks

In addition to the options above, backends not explictly listed below also utilize the “pes” command line option as shown.

  • —pes PEs : sets the number of parallel execution units (threads, ranks, etc…)

CUDA Parameters

When utilizing the CUDA backend, users must explicitly define the number of thread blocks and threads per block to use during kernel execution as follows (Note that the CUDA backend does not accept a PEs argument):

  • —blocks THREAD_BLOCKS : Sets the number of thread blocks
  • —threads THREADS_PER_BLOCK : Sets the number of threads per block

Sample Execution

The following are various examples of utilizing CircusTent for benchmarks

  1. Print the help menu
    1. circustent --help
  2. List the benchmark algorithms
    1. circustent --list
  3. Execute the RAND_ADD algorithm using 1024 bytes of memory, 2 PE’s and 1000 iterations
    1. circustent -b RAND_ADD -m 1024 -p 2 -i 1000
  4. Execute the SCATTER_CAS algorithm using 16GB of memory, 24 PE’s and 20,000,000 iterations
    1. circustent -b SCATTER_CAS -m 16488974000 -p 24 -i 20000000

Interpreting the Results

For each of the target benchmarks, CircusTent prints two relevant
performance values. First, the wallclock runtime of the target algorithm
is printed in seconds. Note that running very small problems with very small
wallclock runtimes may exceed the lower bound of the timing variables. If
you experience issues in printing the timing, increase the number of iterations
per PE. An example of the timing printout is as follows:

  1. Timing (secs) : 0.340783

The second metric that is printed is the number of billions of atomic
operations per second, or GAMS (Giga AMOs/sec). This metric derives
the total, parallel number of atomic operations performed in the given
time window. This value can be utilized to compare platforms based upon
the number of parallel atomics that can be realistically performed using the
target algorithm. This is derived uniquely for each algorithm as the total
number of atomics performend is equivalent to (NUM_PEs x NUM_ITERATIONS x NUM_AMOs_PER_ITER ).
An example of the GAMs printout is as follows:

  1. Giga AMOs/sec (GAMS) : 4.22556

A sample result set from executing the the OpenMP (OMP) implementation
on a modern, dual socket Intel Xeon system are depicted as follows.
For each of these benchmarks, we utilized the following execution parameters:

  • Memsize = 16488974000
  • Iterations = 20000000
  • PEs = 1 - 24
  • Stride (StrideN) = 9

GAMS
TIMING

Adding New Atomic Implementations

See the developer documentation.

Contributing

All contributions must be made via documented pull requests. Pull requests will be tested
using the CircusTent development infrastructure in order to ensure correctness and
code stability. Pull requests may be initially denied for one or more of the following
reasons (violations will be documented in pull request comments):

  • Code lacks sufficient documentation
  • Code inhibits/breaks existing functionality
  • Code does not follow existing stylistic guidelines
  • Benchmark implementation violates benchmark rules
  • Benchmark implementation cannot be proven to exist (no test systems exist)

License

CircustTent is licensed under an Apache-style license see the LICENSE file for details

Authors

Acknowledgments

  • None at this time