项目作者: zz-systems

项目描述 :
SIMD Abstraction layer
高级语言: C++
项目地址: git://github.com/zz-systems/zacc.git
创建时间: 2016-04-25T06:03:24Z
项目社区:https://github.com/zz-systems/zacc

开源协议:MIT License

下载


ZACC

License: MIT
Codacy Badge

Branch Travis CI AppVeyorCI
Master Build Status Build status
Develop Build Status Build status

Abstract

ZACC is a human-readable and extensible computation abstraction layer.
Using ZACC and ZACC build, you are able to write and compile code once and execute it on target machines, unleashing their potential.

It is still under development which is synchronous to cacophony development.

Feel free to report issues and bugs to the Issue tracker on GitHub

Documentation

Design goals

There a few SIMD Libraries available, like Eigen or Agner Fog’s vector class library, each of them following the same goal: accelerate your algorithms by using SIMD instructions.

ZACC implementation had these goals:

  • Coding as if you would write vanilla C++.
    std::cout << (zint(32) % 16) << std::endl; prints [0, 0, 0, 0] if SSE extensions are used.
  • DRY.
    Write once, run faster everywhere
  • Runtime feature selection.
    The dispatcher checks the system features and selects the best fitting implementation.
  • Easy integration.
    ZACC offers cmake scripts to build your project.
  • Portability.
    ZACC accelerated projects should be able to run on any OS and any processor.
  • Speed.
    Although ZACC may be not the highest-optimized library in the world, speed combined with a great usability is a high priority.

Features

  • Linear algebra support
  • Arithmetic operations
  • Conditional operations
  • Rounding operations
  • Standard functions like abs, min, max, etc…
  • Trigonometric functions (sin, cos, tan)
  • Platform detection
  • Runtime dispatching
  • Kernel infrastructure
  • Extended algorithms (STL-compatible)
  • Uses vanilla C++14

Integration

The project is available as a direct submodule if you use git or released here.

If you decide for the submodule way, simply add it via git submodule add https://github.com/zz-systems/zacc.git

CMake is required in your project to be able to use ZACC and ZACC build system.

Usage

To execute an accelerated algorithm, you need a kernel interface, a kernel implementation and an entrypoint.

Mandelbrot kernel interface

The kernel interface is the connection between the vectorized code in satellite assemblies and the main application.
The separation is necessary, because the kernel implementation uses vector types, which must not appear in the main application and are hidden in satellite assemblies.

The vital function mapping for the dispatcher is provided by system::kernel_interface<_KernelInterface> (The dispatcher relies on operator()(...) overloads).

3 methods are already mapped, you have to declare them in the interface and implement in the kernel:

  • run(output_container_t &output)
  • run(const input_container &input, output_container &output)
  • configure(any argument...)

You can extend or change the mappings with your custom implementation.
Also, you need to specify the input and output container types and provide a name for the kernel.

Below is an exemplary mandelbrot kernel interface - available in the examples.

  1. #include <vector>
  2. #include "zacc.hpp"
  3. #include "math/matrix.hpp"
  4. #include "util/algorithm.hpp"
  5. #include "system/entrypoint.hpp"
  6. #include "system/kernel_interface.hpp"
  7. using namespace zacc;
  8. using namespace math;
  9. struct __mandelbrot
  10. {
  11. using output_container = std::vector<int>;
  12. using input_container = std::vector<int>;
  13. static constexpr auto kernel_name() { return "mandelbrot"; }
  14. virtual void configure(vec2<int> dim, vec2<float> cmin, vec2<float> cmax, size_t max_iterations) = 0;
  15. virtual void run(output_container_t &output) = 0;
  16. };
  17. using mandelbrot = system::kernel_interface<__mandelbrot>;

Mandelbrot kernel implementation

Now that you have specified the kernel interface, you may want to write the implementation.
Please have in mind, that C++ own if/else won’t work with vector types. You need to rethink and use branchless arithmetic.
Nonetheless, the implementation does not differ much from the canonical Mandelbrot implementation and is able to use SSE2, SSE3, SSE4, FMA, AVX, AVX2 features of the host processor.
And all that without having to touch intrinsics like here

Write once, run faster everywhere :)

  1. #include "zacc.hpp"
  2. #include "math/complex.hpp"
  3. #include "math/matrix.hpp"
  4. #include "util/algorithm.hpp"
  5. #include "system/kernel.hpp"
  6. #include "../interfaces/mandelbrot.hpp"
  7. using namespace zacc;
  8. using namespace math;
  9. DISPATCHED struct mandelbrot_kernel : system::kernel<mandelbrot>,
  10. allocatable<mandelbrot_kernel, arch>
  11. {
  12. vec2<zint> _dim;
  13. vec2<zfloat> _cmin;
  14. vec2<zfloat> _cmax;
  15. size_t _max_iterations;
  16. virtual void configure(vec2<int> dim, vec2<float> cmin, vec2<float> cmax, size_t max_iterations) override
  17. {
  18. _dim = dim;
  19. _cmax = cmax;
  20. _cmin = cmin;
  21. _max_iterations = max_iterations;
  22. }
  23. virtual void run(mandelbrot::output_container &output) override
  24. {
  25. // populate output container
  26. zacc::generate<zint>(std::begin(output), std::end(output), [this](auto i)
  27. {
  28. // compute 2D-position from 1D-index
  29. auto pos = reshape<vec2<zfloat>>(make_index<zint>(zint(i)), _dim);
  30. zcomplex<zfloat> c(_cmin.x + pos.x / zfloat(_dim.x - 1) * (_cmax.x - _cmin.x),
  31. _cmin.y + pos.y / zfloat(_dim.y - 1) * (_cmax.y - _cmin.x));
  32. zcomplex<zfloat> z = 0;
  33. bfloat done = false;
  34. zint iterations;
  35. for (size_t j = 0; j < _max_iterations; j++)
  36. {
  37. // done when magnitude is >= 2 (or square magnitude is >= 4)
  38. done = done || z.sqr_magnitude() >= 4.0;
  39. // compute next complex if not done
  40. z = z
  41. .when(done)
  42. .otherwise(z * z + c);
  43. // increment if not done
  44. iterations = iterations
  45. .when(done)
  46. .otherwise(iterations + 1);
  47. // break if all elements are not zero
  48. if (is_set(done))
  49. break;
  50. }
  51. return iterations;
  52. });
  53. }
  54. };

Entrypoint

The so-called entrypoint is the low-level interface between the main application and vectorized implementations.
Over this interface, the kernels are created and destroyed.

entrypoint.hpp

Here you declare your available kernel ‘constructors’ and ‘destructors’.
The convention is {kernel_name}_create_instance() and {kernel_name}_delete_instance(entrypoint *).

  1. #include "{your_application_name}_arch_export.hpp"
  2. #include "system/entrypoint.hpp"
  3. extern "C"
  4. {
  5. {your_application_name}_ARCH_EXPORT zacc::system::entrypoint *mandelbrot_create_instance();
  6. {your_application_name}_ARCH_EXPORT void mandelbrot_delete_instance(zacc::system::entrypoint *instance);
  7. }

entrypoint.cpp

Here you implement your available kernel ‘constructors’ and ‘destructors’.
Usually, simply instantiating/deleting a kernel is sufficient, but a more complex logic can be introduced.

  1. #include "entrypoint.hpp"
  2. #include "system/arch.hpp"
  3. #include "kernels/mandelbrot.hpp"
  4. // create mandelbrot kernel instance
  5. zacc::system::entrypoint *mandelbrot_create_instance()
  6. {
  7. return new zacc::examples::mandelbrot_kernel<zacc::arch::types>();
  8. }
  9. // destroy mandelbrot kernel instance
  10. void mandelbrot_delete_instance(zacc::system::entrypoint* instance)
  11. {
  12. if(instance != nullptr)
  13. delete instance;
  14. }

Execution

Here you need to create a dispatcher for your kernel and configure / invoke the kernel.
The kernel invocation happens inside the dispatcher, which acts as a proxy.
The dispatcher offers the following methods

  • dispatch_some(...) - dispatch on all available architectures (e.g kernel configuration)
  • dispatch_one(...) - dispatch on the best available architecture (e.g kernel execution)
  1. #include "../interfaces/mandelbrot.hpp"
  2. #include "system/kernel_dispatcher.hpp"
  3. #include "math/matrix.hpp"
  4. // mandelbrot config:
  5. vec2<int> dimensions = {2048, 2048};
  6. vec2<float> cmin = {-2, -2};
  7. vec2<float> cmax = { 2, 2 };
  8. size_t max_iterations = 2048;
  9. // get kernel dispatcher
  10. auto dispatcher = system::make_dispatcher<mandelbrot>();
  11. // configure kernel
  12. dispatcher.dispatch_some(_dim, cmin, cmax, max_iterations);
  13. // prepare output
  14. std::vector<int>(_dim.x * _dim.y);
  15. // run
  16. dispatcher.dispatch_one(result);
  17. ...

Build system

Prequisites

  1. # add zacc targets
  2. add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/dependencies/zacc)
  3. # use zacc build system
  4. include(${CMAKE_CURRENT_SOURCE_DIR}/dependencies/zacc/cmake/zacc.shared.cmake)
  5. # add include lookup directories
  6. include_directories(
  7. ${CMAKE_CURRENT_SOURCE_DIR}/dependencies/zacc/include
  8. )

Library target

Defines a shared/dynamic library with dispatcher and kernel implementations in additional libraries.

  1. # your shared library which aggregates the branches
  2. zacc_add_dispatched_library(_your_library_
  3. # your library entrypoint
  4. ENTRYPOINT ${CMAKE_SOURCE_DIR}/_your_library_entrypoint.cpp
  5. # additional includes
  6. INCLUDES ${CMAKE_SOURCE_DIR}/include ${CMAKE_SOURCE_DIR}/dependencies/zacc/include
  7. # branches to build for
  8. BRANCHES "${branches}"
  9. # your main library source
  10. SOURCES
  11. ${CMAKE_SOURCE_DIR}/_your_library_.cpp
  12. )

Executable target

Defines a main application with dispatcher and kernel implementations in additional libraries.

  1. zacc_add_dispatched_executable(_your_application_
  2. # branches to build for
  3. BRANCHES "${branches}"
  4. # additional includes
  5. INCLUDES
  6. ${PROJECT_SOURCE_DIR}/include
  7. # your kernel entrypoint
  8. ENTRYPOINT
  9. ${PROJECT_SOURCE_DIR}/_your_application_entrypoint.cpp
  10. # your main application sources
  11. SOURCES
  12. ${PROJECT_SOURCE_DIR}/_your_application_.cpp
  13. )

Unit test target

Defines unit test targets using GoogleTest

  1. # unit testing your implementation on all branches
  2. # find the test main (you may provide your own implementation)
  3. file(GLOB ZACC_TEST_MAIN "${PROJECT_SOURCE_DIR}/*/zacc/*/test_main.cpp")
  4. # find the test entry point (you may provide your own implementation)
  5. file(GLOB ZACC_TEST_ENTRYPOINT "${PROJECT_SOURCE_DIR}/*/zacc/*/test_entry_point.cpp")
  6. zacc_add_dispatched_tests(_your_tests_
  7. # test main. used to skip the tests if the processing unit is not
  8. # capable of running a particular featureset
  9. TEST_MAIN ${ZACC_TEST_MAIN}
  10. # gtest main
  11. TEST_ENTRYPOINT ${ZACC_TEST_ENTRYPOINT}
  12. # branches to build for
  13. BRANCHES "${branches}"
  14. # additional include directories
  15. INCLUDES ${CMAKE_SOURCE_DIR}/include
  16. # your test sources
  17. SOURCES
  18. ${_your_test_files_here}
  19. )

Current state

  • In development!
  • Used in cacophony - a coherent noise library

Tested hardware:

Processor Highest featureset
AMD FX-8350 AVX1
Intel Core i7 6500U AVX2 + FMA
Intel Core i7 7700K AVX2 + FMA
Intel Xeon E5-2697 v3 AVX2 + FMA
Intel Xeon E5-2680 v3 AVX2 + FMA
Intel Xeon E5-2680 v2 AVX1
Intel Xeon X5570 SSE4.1

Tested operating systems

  • Mac OS X Sierra / High Sierra
  • Linux
  • Windows 10

Architecture support

Featureset State
x87 FPU :white_check_mark: scalar
SSE2 :white_check_mark:
SSE3 :white_check_mark:
SSE3 + SSSE3 :white_check_mark:
SSE4.1 :white_check_mark:
SSE4.1 + FMA3 :white_check_mark:
SSE4.1 + FMA4 :white_check_mark:
AVX1 :no_entry: Integer vector emulation faulty.
AVX1 + FMA3 :no_entry: Integer vector emulation faulty.
AVX2 :white_check_mark:
AVX512 :no_entry: in development, can’t be tested yet*
ARM NEON :no_entry: Not implemented yet
GPGPU :no_entry: Not implemented yet**
FPGA :no_entry: Not implemented yet*

*For AVX512, access to a Xeon Phi accelerator or a modern Xeon CPU is necessary

**Some work is already done for the OpenCL implementation. Some macros or C++ code postprocessing may be introduced.

*Same starting issues as for the GPGPU feature, the code generation is another topic.

Compiler support

Compiler State
GCC 5 :white_check_mark:
GCC 6 :white_check_mark:
GCC 7 :white_check_mark:
Clang 3.9 :no_entry: Not compilable
Clang 4.0 :white_check_mark:
LLVM version 8.1.0 :no_entry: Not compilable
LLVM version 9.0.0 :white_check_mark:
Clang-cl :white_check_mark:
MSVC :no_entry: Not supported*

*MSVC is not supported due to required fine granular compile options and non-conform C++ implementation.
Instead Clang-cl is used, which is binary compatible with MSVC (work in progress).

Supported data types

C++ scalar type ZACC vector type State
signed int8 zint8, zbyte :white_check_mark: Partially emulated.
signed int16 zint16, zshort :white_check_mark:
signed int32 zint32, zint :white_check_mark:
signed int64 zint64, zlong :no_entry: Not implemented yet
float16 zfloat16 :no_entry: Not implemented yet
float32 zfloat, zfloat32 :white_check_mark:
float64 zdouble, zfloat64 :white_check_mark:

License

The library is licensed under the MIT License:

Copyright © 2015-2018 Sergej Zuyev

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Execute unit tests

To compile and run the tests, you need to execute

  1. $ make zacc.tests.all
  2. $ ctest
  3. --------------------------------------------------------------------
  4. Start 1: ci.zacc.tests.scalar
  5. 1/8 Test #1: ci.zacc.tests.scalar ............. Passed 0.01 sec
  6. Start 2: ci.zacc.tests.sse.sse2
  7. 2/8 Test #2: ci.zacc.tests.sse.sse2 ........... Passed 0.01 sec
  8. Start 3: ci.zacc.tests.sse.sse3
  9. 3/8 Test #3: ci.zacc.tests.sse.sse3 ........... Passed 0.01 sec
  10. Start 4: ci.zacc.tests.sse.sse41
  11. 4/8 Test #4: ci.zacc.tests.sse.sse41 .......... Passed 0.01 sec
  12. Start 5: ci.zacc.tests.sse.sse41.fma3
  13. 5/8 Test #5: ci.zacc.tests.sse.sse41.fma3 ..... Passed 0.01 sec
  14. Start 6: ci.zacc.tests.sse.sse41.fma4
  15. 6/8 Test #6: ci.zacc.tests.sse.sse41.fma4 ..... Passed 0.00 sec
  16. Start 7: ci.zacc.tests.avx
  17. 7/8 Test #7: ci.zacc.tests.avx ................ Passed 0.01 sec
  18. Start 8: ci.zacc.tests.avx2
  19. 8/8 Test #8: ci.zacc.tests.avx2 ............... Passed 0.01 sec
  20. 100% tests passed, 0 tests failed out of 8
  21. Total Test time (real) = 0.11 sec