Memory system characterization benchmarks using atomic operations
The CircusTent infrastructure is designed to provide users and architects
the ability to discover the relevant performance of a target system
architecture’s memory subsystem using atomic memory operations. Atomic
memory operations have traditionally been considered to be latent or
low performance given the difficulty in their respective implementations.
However, atomic operations are widely utilized across parallel programming
constructs for synchronization primitives and to promote concurrency. However,
prior to the creation of CircusTent, the architecture and programming
model communities had little ability to quantify the performance of
atomics on varying scales of a system architecture.
The CircusTent infrastructure is designed to be a modular benchmark
platform consisting of a frontend and backend infrastructure.
The frontend infrastructure defines the various benchmark types and
standard benchmark algorithms as well as providing the command line
execution interface. The backend provides one or more implementations
of the standard algorithms using various programming models.
The following packages/utilities are required to build CircusTent from source:
Optional packages include:
The following steps are generic build instructions. You may need to
modify these steps for your target system and compiler.
git clone https://github.com/tactcomplabs/circustent.git
cd circustent
mkdir build
cd build
Execute CMake to generate the makefiles (where XXX refers to the backend that you want to enable)
cmake -DENABLE_XXX=ON -DCT_CFLAGS="..." -DCT_CXXFLAGS="..." -DCT_LINKER_FLAGS="..." ../
Note that it will most often be necessary to pass the compiler specific flags needed for your chosen backend
implementation to the CMake infrastructure via the CT_CFLAGS, CT_CXXFLAGS, and CT_LINKER_FLAGS options as shown above.
Execute the build
make
The circustent
binary will reside in ./src/CircusTent/
(Optional) Install the build
make install
The following are additional build options supported by the CircusTent CMake script
The following contains brief descriptions of each candidate algorithm. For each algorithm,
we apply one or more of the following atomics:
The algorithmic descriptions below do not specify the size of the data values
implemented. The CircusTent software does not derive bandwidth. However,
we highly suggest that implementors utilize 64-bit values for the source
and index portions of the benchmark.
The following table presents all the core benchmarks and the number of
atomic operations performed for each (which is vital to calculating
accurate GAMs values across platforms).
Benchmark | Number of AMOs |
---|---|
RAND | 1 |
STRIDE1 | 1 |
STRIDEN | 1 |
PTRCHASE | 1 |
CENTRAL | 1 |
SG | 4 |
SCATTER | 3 |
GATHER | 3 |
Performs a stride-1 atomic update using an index array with randomly generated
indices and a source value array. The index array (IDX) must contain valid indices
within the bounds of the source value array (ARRAY). Utilizing standard-C
linear congruential methods is sufficient.
for( i=0; i<iters; i++ ){
AMO(ARRAY[IDX[i]])
}
Performs a stride-1 atomic update using only a source array (ARRAY).
for( i=0; i<iters; i++ ){
AMO(ARRAY[i])
}
Performs a stride-N atomic update using only a source array (ARRAY).
The user must specify the respective stride of the operation
for( i=0; i<iters; i+=stride ){
AMO(ARRAY[i])
}
Performs a pointer chase operation across an index array. This implies
that the i’th+1 value is selected from the i’th operation. This algorithm
only utilizes the index array (IDX). All index values must be valid within the
scope of the index array.
for( i=0; i<iters; i++ ){
start = AMO(IDX[start])
}
Performs an atomic operation to a singular value from all PEs. This is a deliberate
hot-spot action that is designed to immediately stress system and network
interconnects.
for( i=0; i<iters; i++ ){
AMO(ARRAY[0])
}
Performs a scatter and a gather operation. The source values for the scatter,
gather and the final values are all fetched atomically. As with the other
algorithms, the source array and index array must be valid.
for( i=0; i<iters; i++ ){
src = AMO(IDX[i])
dest = AMO(IDX[i+1])
val = AMO(ARRAY[src])
AMO(ARRAY[dest], val) // ARRAY[dest] = val
}
Performs the scatter portion of an SG operation. As with the other
algorithms, the source array and index array must be valid.
for( i=0; i<iters; i++ ){
dest = AMO(IDX[i+1])
val = AMO(ARRAY[i])
AMO(ARRAY[dest], val) // ARRAY[dest] = val
}
Performs the gather portion of an SG operation. As with the other
algorithms, the source array and index array must be valid.
for( i=0; i<iters; i++ ){
dest = AMO(IDX[i+1])
val = AMO(ARRAY[dest])
AMO(ARRAY[i], val) // ARRAY[i] = val
}
Benchmark | Supported? |
---|---|
RAND_ADD | yes |
RAND_CAS | yes |
STRIDE1_ADD | yes |
STRIDE1_CAS | yes |
STRIDEN_ADD | yes |
STRIDEN_CAS | yes |
PTRCHASE_ADD | yes |
PTRCHASE_CAS | yes |
CENTRAL_ADD | yes |
CENTRAL_CAS | yes |
SG_ADD | yes |
SG_CAS | yes |
SCATTER_ADD | yes |
SCATTER_CAS | yes |
GATHER_ADD | yes |
GATHER_CAS | yes |
Benchmark | Supported? |
---|---|
RAND_ADD | yes |
RAND_CAS | no |
STRIDE1_ADD | yes |
STRIDE1_CAS | no |
STRIDEN_ADD | yes |
STRIDEN_CAS | no |
PTRCHASE_ADD | yes |
PTRCHASE_CAS | no |
CENTRAL_ADD | yes |
CENTRAL_CAS | no |
SG_ADD | yes |
SG_CAS | no |
SCATTER_ADD | yes |
SCATTER_CAS | no |
GATHER_ADD | yes |
GATHER_CAS | no |
CC=oshcc CXX=oschcxx cmake -DENABLE_OPENSHMEM=ON ../
Benchmark | Supported? |
---|---|
RAND_ADD | yes |
RAND_CAS | yes |
STRIDE1_ADD | yes |
STRIDE1_CAS | yes |
STRIDEN_ADD | yes |
STRIDEN_CAS | yes |
PTRCHASE_ADD | yes |
PTRCHASE_CAS | yes |
CENTRAL_ADD | yes |
CENTRAL_CAS | yes |
SG_ADD | yes |
SG_CAS | yes |
SCATTER_ADD | yes |
SCATTER_CAS | yes |
GATHER_ADD | yes |
GATHER_CAS | yes |
CC=mpicc CXX=mpicxx cmake -DENABLE_MPI=ON ../
Benchmark | Supported? |
---|---|
RAND_ADD | yes |
RAND_CAS | yes |
STRIDE1_ADD | yes |
STRIDE1_CAS | yes |
STRIDEN_ADD | yes |
STRIDEN_CAS | yes |
PTRCHASE_ADD | yes |
PTRCHASE_CAS | yes |
CENTRAL_ADD | yes |
CENTRAL_CAS | yes |
SG_ADD | yes |
SG_CAS | yes |
SCATTER_ADD | yes |
SCATTER_CAS | yes |
GATHER_ADD | yes |
GATHER_CAS | yes |
Benchmark | Supported? |
---|---|
RAND_ADD | yes |
RAND_CAS | yes |
STRIDE1_ADD | yes |
STRIDE1_CAS | yes |
STRIDEN_ADD | yes |
STRIDEN_CAS | yes |
PTRCHASE_ADD | yes |
PTRCHASE_CAS | yes |
CENTRAL_ADD | yes |
CENTRAL_CAS | yes |
SG_ADD | yes |
SG_CAS | yes |
SCATTER_ADD | yes |
SCATTER_CAS | yes |
GATHER_ADD | yes |
GATHER_CAS | yes |
Benchmark | Supported? |
---|---|
RAND_ADD | yes |
RAND_CAS | no |
STRIDE1_ADD | yes |
STRIDE1_CAS | no |
STRIDEN_ADD | yes |
STRIDEN_CAS | no |
PTRCHASE_ADD | yes |
PTRCHASE_CAS | no |
CENTRAL_ADD | yes |
CENTRAL_CAS | no |
SG_ADD | yes |
SG_CAS | no |
SCATTER_ADD | yes |
SCATTER_CAS | no |
GATHER_ADD | yes |
GATHER_CAS | no |
Benchmark | Supported? |
---|---|
RAND_ADD | yes |
RAND_CAS | yes |
STRIDE1_ADD | yes |
STRIDE1_CAS | yes |
STRIDEN_ADD | yes |
STRIDEN_CAS | yes |
PTRCHASE_ADD | yes |
PTRCHASE_CAS | yes |
CENTRAL_ADD | yes |
CENTRAL_CAS | yes |
SG_ADD | yes |
SG_CAS | yes |
SCATTER_ADD | yes |
SCATTER_CAS | yes |
GATHER_ADD | yes |
GATHER_CAS | yes |
Benchmark | Supported? |
---|---|
RAND_ADD | yes |
RAND_CAS | yes |
STRIDE1_ADD | yes |
STRIDE1_CAS | yes |
STRIDEN_ADD | yes |
STRIDEN_CAS | yes |
PTRCHASE_ADD | yes |
PTRCHASE_CAS | yes |
CENTRAL_ADD | yes |
CENTRAL_CAS | yes |
SG_ADD | yes |
SG_CAS | yes |
SCATTER_ADD | yes |
SCATTER_CAS | yes |
GATHER_ADD | yes |
GATHER_CAS | yes |
Benchmark | Supported? |
---|---|
RAND_ADD | yes |
RAND_CAS | yes |
STRIDE1_ADD | yes |
STRIDE1_CAS | yes |
STRIDEN_ADD | yes |
STRIDEN_CAS | yes |
PTRCHASE_ADD | yes |
PTRCHASE_CAS | yes |
CENTRAL_ADD | yes |
CENTRAL_CAS | yes |
SG_ADD | yes |
SG_CAS | yes |
SCATTER_ADD | yes |
SCATTER_CAS | yes |
GATHER_ADD | yes |
GATHER_CAS | yes |
--blocks
: number of thread blocks--threads
: number of threads per block
circustent -b RAND_ADD -m 1024 -i 1000 --blocks 100 --threads 512
Benchmark | Supported? |
---|---|
RAND_ADD | yes |
RAND_CAS | yes |
STRIDE1_ADD | yes |
STRIDE1_CAS | yes |
STRIDEN_ADD | yes |
STRIDEN_CAS | yes |
PTRCHASE_ADD | yes |
PTRCHASE_CAS | yes |
CENTRAL_ADD | yes |
CENTRAL_CAS | yes |
SG_ADD | yes |
SG_CAS | yes |
SCATTER_ADD | yes |
SCATTER_CAS | yes |
GATHER_ADD | yes |
GATHER_CAS | yes |
The following list details the current set of command line options common to all CircusTent backends:
In addition to the options above, backends not explictly listed below also utilize the “pes” command line option as shown.
When utilizing the CUDA backend, users must explicitly define the number of thread blocks and threads per block to use during kernel execution as follows (Note that the CUDA backend does not accept a PEs argument):
The following are various examples of utilizing CircusTent for benchmarks
circustent --help
circustent --list
circustent -b RAND_ADD -m 1024 -p 2 -i 1000
circustent -b SCATTER_CAS -m 16488974000 -p 24 -i 20000000
For each of the target benchmarks, CircusTent prints two relevant
performance values. First, the wallclock runtime of the target algorithm
is printed in seconds. Note that running very small problems with very small
wallclock runtimes may exceed the lower bound of the timing variables. If
you experience issues in printing the timing, increase the number of iterations
per PE. An example of the timing printout is as follows:
Timing (secs) : 0.340783
The second metric that is printed is the number of billions of atomic
operations per second, or GAMS (Giga AMOs/sec). This metric derives
the total, parallel number of atomic operations performed in the given
time window. This value can be utilized to compare platforms based upon
the number of parallel atomics that can be realistically performed using the
target algorithm. This is derived uniquely for each algorithm as the total
number of atomics performend is equivalent to (NUM_PEs x NUM_ITERATIONS x NUM_AMOs_PER_ITER ).
An example of the GAMs printout is as follows:
Giga AMOs/sec (GAMS) : 4.22556
A sample result set from executing the the OpenMP (OMP) implementation
on a modern, dual socket Intel Xeon system are depicted as follows.
For each of these benchmarks, we utilized the following execution parameters:
See the developer documentation.
All contributions must be made via documented pull requests. Pull requests will be tested
using the CircusTent development infrastructure in order to ensure correctness and
code stability. Pull requests may be initially denied for one or more of the following
reasons (violations will be documented in pull request comments):
CircustTent is licensed under an Apache-style license see the LICENSE file for details