Performance patches and build fixes for Elbrus (e2k) architecture.
Performance patches and build fixes for Elbrus (e2k) architecture.
This is my personal repository so that patches won’t get lost.
Elbrus 2000 (aka e2k) is a 64-bit little-endian architecture.
The compiler is mostly GCC compatible (defines __GNUC__
), EDG frontend.
uname -m
returns e2k
if({CMAKE_SYSTEM_PROCESSOR} STREQUAL "e2k")
if defined(__e2k__)
__LCC__ = 125
and __LCC_MINOR__ = 9
then it’s “LCC 1.25.09”__iset__
(less than 3 is obsolete, 6 is the latest at the moment)The compiler enables MMX to AVX2 support by default, pass -mno-avx
(-mno-sse4.2
) if code depends on the presence of macros (e.g. #if defined(__AVX2__)
).
x86intrin.h
first)Use compile time CPU detection, select the best SIMD up to SSE4.1.
#include <x86intrin.h>
uint64_t time = __rdtsc();
// same: unsigned aux; uint64_t time = __rdtscp(&aux);
_Pragma("name")
- to use from macros.
Use before the loop:
#pragma ivdep
- ignore data dependencies inside the loop#pragma unroll(n)
- unroll cycle N timesUsing the restrict
keyword is good for performance, but note that it is ignored by the LCC if you’re using vector load/store intrinsics such as _mm_load_si128()
. For code with vector intrinsics use #pragma ivdep
.
Instead of makecontext(ctx, ...)
use makecontext_e2k(ctx, ...)
, returns a negative integer on error. Allocates extra resources that need to be freed using freecontext_e2k(ctx)
.
Use __asm__ __volatile__ ("nop")
or _mm_pause()
for a little delay.
The GNUC standard function __clear_cache(char *begin, char *end)
works correctly since LCC 1.25.18, LCC 1.26.04.
This function is available in previous versions, but does nothing.
If it’s crucial to performance, then use __attribute__((__always_inline__)) inline
rather than just inline
. Because when using large or complicated inline functions, the LCC compiler may decide not to inline them.
The GNUC C extension Labels as Values is available in the LCC, but performance is worse than using a simple switch/case.
The GNUC Vector Extension is also available in LCC, but poorly implemented and its performance is very bad.