PROSAGA码农传奇-EAAS/NFV-C ++中的高级GPU编程

0# 取之 | 2019-08-31 10-32

<div class =“post-text”itemprop =“text”>
  <P>
    如果你正在寻找更高维度的容器以及在内核代码中传递和操作这些容器的能力，我花了几年的时间来开发
    <a href="https://github.com/BaderLab/ecuda" rel="nofollow">
      ecuda
    </A>
     API协助我自己的科研项目（所以它已经完成了步伐）。希望它可以填补所需的利基。它是如何使用的简短示例（此处使用了C ++ 11特性，但ecuda可以与前C ++ 11编译器一起使用）：
  </p>
   <pre>
    <code>
      #include <cstdlib>
#include <iomanip>
#include <iostream>
#include <vector>

#include <ecuda/ecuda.hpp>

// kernel function
__global__
void calcColumnSums(
  typename ecuda::matrix<double>::const_kernel_argument mat,
  typename ecuda::vector<double>::kernel_argument vec
)
{
    const std::size_t t = threadIdx.x;
    auto col = mat.get_column(t);
    vec[t] = ecuda::accumulate( col.begin(), col.end(), static_cast<double>(0) );
}

int main( int argc, char* argv[] )
{

// allocate 1000x1000 hardware-aligned device memory matrix
    ecuda::matrix<double> deviceMatrix( 1000, 1000 );

// generate random values row-by-row and copy to matrix
    std::vector<double> hostRow( 1000 );
    for( std::size_t i = 0; i < 1000; ++i ) {
        for( double& x : hostRow ) x = static_cast<double>(rand())/static_cast<double>(RAND_MAX);
        ecuda::copy( hostRow.begin(), hostRow.end(), deviceMatrix[i].begin() );
    }

// allocate device memory for column sums
    ecuda::vector<double> deviceSums( 1000 );

CUDA_CALL_KERNEL_AND_WAIT(
        calcColumnSums<<<1,1000>>>( deviceMatrix, deviceSums )
    );

// copy columns sums to host and print
    std::vector<double> hostSums( 1000 );
    ecuda::copy( deviceSums.begin(), deviceSums.end(), hostSums.begin() );

std::cout << "SUMS =";
    for( const double& x : hostSums ) std::cout << " " << std::fixed << x;
    std::cout << std::endl;

return 0;

}

</code>
  </pre>
  <P>
    我把它写成尽可能直观（通常像用ecuda：:)替换std ::一样简单。如果你知道STL，那么ecuda应该做你在逻辑上期望基于CUDA的C ++扩展所做的事情。
  </p>
</DIV>

1# 無口君 | 2019-08-31 10-32

2# 昵称不能为空 | 2019-08-31 10-32

3# 昵称为空呵呵 | 2019-08-31 10-32

<div class =“post-text”itemprop =“text”>
  <P>
    有许多专门用于GPGPU编程的高级库。由于它们依赖于CUDA和/或OpenCL，因此必须明智地选择它们（基于CUDA的程序不会在AMD的GPU上运行，除非它经历了诸如以下项目的预处理步骤
    <a href="http://code.google.com/p/gpuocelot/" rel="noreferrer">
      的<strong>
        gpuocelot
      </强>
    </A>
    ）。
  </p>
  <H2>
    CUDA
  </H2>
  <P>
    您可以在NVIDIA上找到一些CUDA库的示例
    <a href="https://developer.nvidia.com/technologies/Libraries" rel="noreferrer">
      网站
    </A>
    。
  </p>
  <UL>
    <LI>
      <a href="http://thrust.github.io/" rel="noreferrer">
        的<strong>
          推力
        </强>
      </A>
      ：官方说明不言而喻
    </LI>
  </UL>
  <BLOCKQUOTE>
    <P>
      Thrust是一个类似于C ++标准的并行算法库
  模板库（STL）。 Thrust的高级界面大大增强
  程序员的工作效率，同时实现性能可移植性
  GPU和多核CPU。与已建立的互操作性
  技术（如CUDA，TBB和OpenMP）有助于集成
  使用现有软件。
    </p>
  </BLOCKQUOTE>
  <P>
    如
    <a href="https://stackoverflow.com/a/16478427/1043187">
      @Ashwin
    </A>
     指出，Thrust的类似STL语法使其成为开发CUDA程序时广泛选择的库。快速查看这些示例，可以看出如果您决定使用此库，您将编写的代码类型。 NVIDIA的网站提供了
    <a href="https://developer.nvidia.com/thrust" rel="noreferrer">
      主要特点
    </A>
     这个图书馆。一个
    <a href="http://nvidia.fullviewmedia.com/gtc2012/0515-A3-S0602.html" rel="noreferrer">
      视频演示
    </A>
     （来自GTC 2012）也可提供。
  </p>
  <UL>
    <LI>
      <a href="http://nvlabs.github.io/cub/" rel="noreferrer">
        的<strong>
          幼兽
        </强>
      </A>
      ：官方说明告诉我们：
    </LI>
  </UL>
  <BLOCKQUOTE>
    <P>
      CUB为CUDA编程模式的每一层提供最先进的可重用软件组件。它是一个灵活的协作线程块原语库和CUDA内核编程的其他实用程序。
    </p>
  </BLOCKQUOTE>
  <P>
    它提供设备范围，块宽和整个经线的并行原语，如并行排序，前缀扫描，缩减，直方图等。
  </p>
  <P>
    它是开源的，可用
    <a href="https://github.com/NVlabs/cub" rel="noreferrer">
      GitHub上
    </A>
    。从实现的角度来看，它不是高级的（您在CUDA内核中开发），但提供了高级算法和例程。
  </p>
  <UL>
    <LI>
      <a href="https://github.com/tqchen/mshadow" rel="noreferrer">
        的<strong>
          mshadow
        </强>
      </A>
      ：C ++ / CUDA中的轻量级CPU / GPU矩阵/张量模板库。
    </LI>
  </UL>
  <P>
    该库主要用于机器学习，并且依赖于
    <a href="https://github.com/tqchen/mshadow/wiki/Expression%20Template" rel="noreferrer">
      表达模板
    </A>
    。
  </p>
  <UL>
    <LI>
      <a href="http://eigen.tuxfamily.org/" rel="noreferrer">
        的<strong>
          艾根
        </强>
      </A>
      ：用新的Tensor类支持CUDA
      <a href="http://eigen.tuxfamily.org/index.php?title=3.3#Experimental_CUDA_support" rel="noreferrer">
        在3.3版中添加
      </A>
      。它被Google用于
      <a href="https://www.tensorflow.org/" rel="noreferrer">
        TensorFlow
      </A>
      ，仍然是实验性的。
    </LI>
  </UL>
  <BLOCKQUOTE>
    <P>
      从Eigen 3.3开始，现在可以在CUDA内核中使用Eigen的对象和算法。但是，仅支持一部分功能，以确保在CUDA内核中不会触发动态分配。
    </p>
  </BLOCKQUOTE>
  <H2>
    OpenCL的
  </H2>
  <P>
    注意
    <a href="http://en.wikipedia.org/wiki/OpenCL" rel="noreferrer">
      OpenCL的
    </A>
     不仅仅是GPGPU计算，因为它支持异构平台（多核CPU，GPU等）。
  </p>
  <UL>
    <LI>
      <a href="http://www.openacc-standard.org" rel="noreferrer">
        的<strong>
          OpenACC的
        </强>
      </A>
      ：这个项目为GPGPU提供类似OpenMP的支持。编程的很大一部分是由编译器和运行时API隐式完成的。你可以找到一个
      <a href =“http://www.openacc.org/sites/default/files/Heat-Conduction-C--Sample-V2.pdf”的rel = “noreferrer”>
        示例代码
      </A>
       在他们的网站上。
    </LI>
  </UL>
  <BLOCKQUOTE>
    <P>
      OpenACC应用程序接口描述了一个集合
  编译器指令，用于指定标准中的循环和代码区域
  C，C ++和Fortran从主机CPU卸载到附加
  加速器，提供跨操作系统，主机CPU的可移植性
  和加速器。
    </p>
  </BLOCKQUOTE>
  <UL>
    <LI>
      <a href="https://github.com/HSA-Libraries/Bolt" rel="noreferrer">
        的<strong>
          螺栓
        </强>
      </A>
      ：具有STL类接口的开源库。
    </LI>
  </UL>
  <BLOCKQUOTE>
    <P>
      Bolt是一个针对异构计算而优化的C ++模板库。
  Bolt旨在提供高性能库实现
  用于常见算法，例如扫描，缩小，转换和排序。该
  Bolt接口在C ++标准模板库（STL）上建模。
  熟悉STL的开发人员将认识到许多Bolt API
  和定制技术。
    </p>
  </BLOCKQUOTE>
  <UL>
    <LI>
      <P>
        <a href="http://boostorg.github.io/compute/" rel="noreferrer">
          的<strong>
            Boost.Compute
          </强>
        </A>
        ：as
        <a href="https://stackoverflow.com/a/16441065/1043187">
          @Kyle Lutz
        </A>
         说，Boost.Compute为OpenCL提供类似STL的接口。请注意，这不是官方的Boost库（尚未）。
      </p>
    </LI>
    <LI>
      <P>
        <a href="http://skelcl.uni-muenster.de/" rel="noreferrer">
          的<strong>
            SkelCL
          </强>
        </A>
         “是一个提供高级抽象的库，用于缓解现代并行异构系统的编程”。这个库依赖于
        <a href="http://en.wikipedia.org/wiki/Skeleton_%28computer_programming%29" rel="noreferrer">
          骨架编程
        </A>
        ，你可以在他们的网站上找到更多信息
        <a href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6008967" rel="noreferrer">
          调查报告
        </A>
        。
      </p>
    </LI>
  </UL>
  <H2>
    CUDA + OpenCL
  </H2>
  <UL>
    <LI>
      <a href="http://arrayfire.com" rel="noreferrer">
        的<strong>
          ArrayFire
        </强>
      </A>
       是一个开源（以前是专有的）GPGPU编程库。他们首先针对CUDA，但现在也支持OpenCL。你可以检查一下
      <a href="http://arrayfire.org/docs/examples.htm" rel="noreferrer">
        例子
      </A>
       在线提供。 NVIDIA的网站提供了一个
      <a href="https://developer.nvidia.com/accelereyes-arrayfire" rel="noreferrer">
        总结
      </A>
       其主要特点。
    </LI>
  </UL>
  <H2>
    补充资料
  </H2>
  <P>
    虽然这不是这个问题的范围，但对其他编程语言也有同样的支持：
  </p>
  <UL>
    <LI>
      的<strong>
        蟒蛇
      </强>
      ：
      <a href="https://developer.nvidia.com/pycuda" rel="noreferrer">
        的<strong>
          PyCUDA
        </强>
      </A>
       对于CUDA，
      <a href="http://srossross.github.io/Clyther/" rel="noreferrer">
        的<strong>
          Clyther
        </强>
      </A>
       和
      <a href="http://mathema.tician.de/software/pyopencl" rel="noreferrer">
        的<strong>
          PyOpenCL
        </强>
      </A>
       对于OpenCL。有一个
      <a href="https://stackoverflow.com/questions/5957554/python-gpu-programming">
        专用的StackOverflow问题
      </A>
       为了这。
    </LI>
    <LI>
      的<strong>
        Java的
      </强>
      ：
      <a href="http://www.jcuda.org/" rel="noreferrer">
        的<strong>
          JCuda
        </强>
      </A>
       对于CUDA和OpenCL，你可以检查一下
      <a href="https://stackoverflow.com/questions/2633483/best-approach-for-gpgpu-cuda-opencl-in-java">
        其他问题
      </A>
      。
    </LI>
  </UL>
  <P>
    如果您需要进行线性代数（例如）或其他特定操作，专用数学库也可用于CUDA和OpenCL（例如
    <a href="http://viennacl.sourceforge.net/viennacl-about.html" rel="noreferrer">
      的<strong>
        ViennaCL
      </强>
    </A>
    ，
    <a href="https://developer.nvidia.com/cublas" rel="noreferrer">
      的<strong>
        CUBLAS
      </强>
    </A>
    ，
    <a href="http://icl.cs.utk.edu/magma/index.html" rel="noreferrer">
      的<strong>
        岩浆
      </强>
    </A>
     等等。）。</p>
  <P>
    另请注意，如果需要执行一些非常具体的计算，使用这些库不会阻止您执行某些低级操作。
  </p>
  <P>
    最后，我们可以提到C ++标准库的未来。已经进行了大量工作来增加并行性支持。这是
    <a href="https://github.com/cplusplus/parallelism-ts" rel="noreferrer">
      仍然是技术规范
    </A>
    并没有明确提到GPU（AFAIK）（虽然NVIDIA的Thrust开发商Jared Hoberock直接参与其中），但是实现这一目标的意愿绝对存在。
  </p>
</DIV>

4# VIP | 2019-08-31 10-32

5# polo | 2019-08-31 10-32

<div class =“post-text”itemprop =“text”>
  <P>
    cpp-opencl项目提供了一种使开发人员可以轻松编程GPU的方法。它允许您直接在C ++中实现GPU上的数据并行性，而不是使用OpenCL。
  </p>
  <P>
    请参阅
    <a href="http://dimitri-christodoulou.blogspot.com/2014/02/implement-data-parallelism-on-gpu.html" rel="nofollow">
      http://dimitri-christodoulou.blogspot.com/2014/02/implement-data-parallelism-on-gpu.html
    </A>
  </p>
  <P>
    和源代码：
    <a href="https://github.com/dimitrs/cpp-opencl" rel="nofollow">
      https://github.com/dimitrs/cpp-opencl
    </A>
  </p>
  <P>
    请参阅下面的示例。 parallel_for_each lambda函数中的代码在GPU上执行，其余所有代码都在CPU上执行。更具体地说，在CPU（通过调用std :: transform）和GPU（通过调用compute :: parallel_for_each）上执行功能。
  </p>
   <pre>
    <code>
      #include <vector>
#include <stdio.h>
#include "ParallelForEach.h"

template<class T> 
T square(T x)  
{
    return x * x;
}

void func() {
  std::vector<int> In {1,2,3,4,5,6};
  std::vector<int> OutGpu(6);
  std::vector<int> OutCpu(6);

compute::parallel_for_each(In.begin(), In.end(), OutGpu.begin(), [](int x){
      return square(x);
  });

std::transform(In.begin(), In.end(), OutCpu.begin(), [](int x) {
    return square(x);
  });

// 
  // Do something with OutCpu and OutGpu 锟斤拷..........

//

}

int main() {
  func();
  return 0;
}

</code>
  </pre>
</DIV>

6# Colin | 2019-08-31 10-32

7# 青阳 | 2019-08-31 10-32