An optimized sample code for SHA256 and SHA512 using C intrinsic
This sample code package is an optimized version of SHA256 and SHA512.
The code is written by Nir Drucker and Shay Gueron, AWS Cryptographic Algorithms Group.
While C code is easier to maintain and review, the performance obtained by compilation (e.g., with gcc-9 and clang-9) is often slower than the performance of hand written assembly code (e.g., the code in this example). This sample code is made publicly available to help compiler designers understand this use case by reviewing the code and its generated assembler. We hope this information will improve compiler’s abilities to generate efficient assembler.
This sample code provides testing binaries but no shared or a static libraries. This is because the code is desgined to be used for benchmarking purposes only and not in final products.
The x86-64 AVX code is based on the paper:
The code version that uses Intel SHA Extensions instructions is based on the following reference:
This project is licensed under the Apache-2.0 License.
This package requires
To build the directory first create a working directory
mkdir build
cd build
Then, run CMake and compile
cmake -DCMAKE_BUILD_TYPE=Release ..
make
Additional CMake compilation flags:
To clean - remove the build
directory. Note that a “clean” is required prior to compilation with modified flags.
To format (clang-format-9
or above is required):
make format
To use clang-tidy (clang-tidy-9
is required):
CC=clang-9 cmake -DCMAKE_C_CLANG_TIDY="clang-tidy-9;--fix-errors;--format-style=file" ..
make
Before committing code, please test it usingtests/pre-commit-script.sh
This will run all the sanitizers and also clang-format
and clang-tidy
(requires clang-9 to be installed).
The package was compiled and tested with gcc-9 and clang-9 in 64-bit mode.
Tests were run on a Linux (Ubuntu 18.04.4 LTS) OS on x86-64 and AARCH64 machines.
Compilation on other platforms may require some adjustments.
When using the TEST_SPEED flag the performance measurements are reported in processor cycles (per single core). The results are obtained using the following methodology. Each measured function was isolated, run 25 times (warm-up), followed by 100 iterations that were clocked and averaged. To minimize the effect of background tasks running on the system, every experiment was repeated 10 times, and the minimum result is reported.
The library reports the results only for supported code by the OS/compiler. It also compares the results of the C with intrinsic code to the assembly code of OpenSSL commit 13c5d744 (see here for more details).
A benchmark example is found here.