Performance of vector element sum using float vs bfloat16 as the storage type.
Comparison of vector element sum using various data types.
For all experiments, each approach is attempted on a number of vector sizes,
running each approach 5 times per size to get a good time measure. The
experiments are done with guidance from Prof. Dip Sankar Banerjee and
Prof. Kishore Kothapalli.
In this experiment (float-vs-bfloat16, main), we comparing the performance of
finding the sum of numbers between, the number stored as float or
bfloat16. While it seemed to me that bfloat16 method would be a clear
winner because of reduced memory bandwidth requirement, for some reason it is
only slightly faster. This is possibly because memory loads are anyway
always 32-bit. The only reason using bfloat16 is slightly faster could
possibly be because it allows data to be retained in cache for a longer period
of time (because of its small size). Note that neither approach makes use of
SIMD instructions which are available on all modern hardware.