介紹

一般我們一門語言糠爬，首先都是從最簡單的Hello World開始。但是對于OpenCL C和CUDA C粟关，GPU是用來完成并行計算的好幫手尔觉，所以最簡單的一維向量相加便成了這兩種特殊語言的Hello World

4-17更新，在更改了部分代碼后成功運行

在Mac上失敗的經歷

眾所周知傲宜，Apple已經在其最新的系統(tǒng)（包括Mac和iOS）中拋棄了OpenCL运杭，繼而轉向Metal。

以下是蘋果官方的介紹：
Metal 是 macOS函卒、iOS 和 Apple TVOS 中內建的圖形和計算技術辆憔。通過這項技術，主處理器 (CPU) 和圖形處理器 (GPU) 可以更高效地協(xié)同工作谆趾，從而在游戲中提供更流暢的圖形性能躁愿，并加快高性能媒體應用的運行速度。

在MacOS High Sierra中沪蓬，OpenCL的版本是1.2。我最剛開始因為版本的原因来候，改動了/Library/Framework/OpenCL.Framework下的一些東西（關閉了rootless機制跷叉，強制使用root命令）。結果花了11個小時重裝Mac营搅。云挟。。

后來因為不能更改转质，便開始做一維向量相加的測試园欣。很遺憾的是每次到創(chuàng)建內核的時候都失敗了。初步的想法還是和OpenCL.Framework中的東西有關（無奈不想再重裝休蟹，所以只是放出結果沸枯，不再Mac上做測試了日矫，后續(xù)如果有時間會在其他平臺上面做測試。）

錯誤的結果

斷點調試中kernel

執(zhí)行了創(chuàng)建kernel后

這里就發(fā)現了一個問題：之前kernel的值變?yōu)?code>0x0000000000000000绑榴，然后執(zhí)行內核賦值哪轿。
也就是說內核沒有創(chuàng)建成功，最后的結果算出來都是0翔怎。

錯誤的結果

源碼（內核源碼和c++源碼）

注意改./Vadd.cl為自己的內核源碼路徑窃诉。
Vadd.cl

__kernel void vecadd(__global const float* A, 
                                   __global const float* B, 
                                   __global float* C){
    int id = get_global_id(0);
    C[id] = A[id] + B[id];
}

cl_test.cpp

#include <iostream>
#include <unistd.h>
#include <time.h>
#include "OpenCL/opencl.h"
#define ARRAY_SIZE 6    // 向量長度

void process_CPU();
void process_GPU();
bool GetFileData(const char* fname,std::string& str);
float array_1[ARRAY_SIZE] = {1.0f,2.0f,3.1f,4.2f,5.5f,7.9f};
float array_2[ARRAY_SIZE] = {2.3f,3.3f,6.7f,11.5f,13.5f,8.9f};
float array_result[ARRAY_SIZE]; // 已知結果的向量長度，直接聲明
float array_result_gpu[ARRAY_SIZE];

int main(int argc, const char** argv) {
    process_CPU(); 
    process_GPU();
    return 0;
}

void process_CPU(){
    std::cout<<"-----------Start CPU process------------\nResults:"<<std::endl;
    clock_t start =  clock();
    
    for(int i = 0;i < ARRAY_SIZE; i++){
        array_result[i] = array_1[i] + array_2[i];
        std::cout<<array_result[i]<<" ";
    }

    double time_consume = (double)((clock_t)clock() - start) / CLOCKS_PER_SEC;
    std::cout<<"\n------------End CPU process-------------\nTime comsume(s):"
    <<time_consume<<std::endl;

}

void process_GPU(){
    std::cout<<"----------Start GPU process-------------"<<std::endl;

    // 查詢平臺
    cl_uint status;
    cl_platform_id platform_id;

    // 獲取平臺(Platform)對象
    status = clGetPlatformIDs(1,&platform_id,NULL);
    if(status != CL_SUCCESS){
        std::cout<<"ERROR:failed to find any platform."<<std::endl;
        return ;
    }

    // 獲取設備(Device)信息
    cl_device_id devices;
    clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 2, &devices, NULL);
    
    // 創(chuàng)建上下文(Context)
    cl_context context;
    context = clCreateContext(NULL,1,&devices,NULL,NULL,NULL);

    // 創(chuàng)建命令隊列(command queue)
    cl_command_queue queue;
    queue = clCreateCommandQueue(context,devices,CL_QUEUE_PROFILING_ENABLE,NULL);

    // 創(chuàng)建3個CL對象(cl memory object)赤套，并都通過顯示的方式拷貝到GPU內存：

    // 開始拷貝第一個數組
    cl_mem cl_array_1;
    cl_array_1 = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR,
                                ARRAY_SIZE*sizeof(cl_float),
                                (void *)array_1,NULL);
    // 第二個
    cl_mem cl_array_2;
    cl_array_2 = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR,
                                ARRAY_SIZE*sizeof(cl_float),
                                (void *)array_2,NULL);
    
    // 創(chuàng)建第三個
    cl_mem cl_array_result;
    cl_array_result = clCreateBuffer(context,CL_MEM_WRITE_ONLY,
                                ARRAY_SIZE*sizeof(cl_float),
                                (void *)array_result_gpu,NULL);

    // 上傳到緩沖區(qū)(write buffer)
    clEnqueueWriteBuffer(queue,cl_array_1,1,0,
                        ARRAY_SIZE*sizeof(cl_float),
                        array_1,0,0,0);

    clEnqueueWriteBuffer(queue,cl_array_2,1,0,
                        ARRAY_SIZE*sizeof(cl_float),
                        array_2,0,0,0);

    // 創(chuàng)建程序對象(program)
    cl_program program;
    std::string code_file;
    if(GetFileData("./Vadd.cl",code_file) == false)return ;

    char* buf_code = new char[code_file.size()];
    strcpy(buf_code,code_file.c_str());
    buf_code[code_file.size()-1] = NULL;

    program = clCreateProgramWithSource(context,1,(const char**)&buf_code,NULL,NULL);

    // 構建程序(Build program)
    clBuildProgram(program,1,&devices,NULL,NULL,NULL);
    
    // 創(chuàng)建內核
    cl_kernel kernel;
    kernel = clCreateKernel(program,"vector_add",NULL);
    // 設置參數飘痛，開始執(zhí)行內核(kernel)：
    clSetKernelArg(kernel,0,sizeof(cl_mem),&cl_array_1);
    clSetKernelArg(kernel,1,sizeof(cl_mem),&cl_array_2);
    clSetKernelArg(kernel,2,sizeof(cl_mem),&cl_array_result);

     size_t globalWorkSize[1];
     globalWorkSize[0] = ARRAY_SIZE;

    clEnqueueNDRangeKernel(queue, kernel, 1, NULL, globalWorkSize, NULL, 0, NULL, NULL);

    // 取回計算結果
    clEnqueueReadBuffer(queue, cl_array_result, CL_TRUE, 0, ARRAY_SIZE*sizeof(cl_float), 
                        array_result_gpu, 0, NULL, NULL);

    // 輸出計算結果：
    std::cout<<"Results:"<<std::endl;
    for(int i = 0; i < ARRAY_SIZE;i++){
        std::cout<<array_result_gpu[i]<<" ";
    }
    std::cout<<"\n-----------END-----------"<<std::endl;
}

bool GetFileData(const char* fname,std::string& str){
    FILE* fp = fopen(fname,"r");
    if(fp==NULL){
        printf("ERROR:File opened failed.\n");
        return false;
    }
    while(feof(fp) == 0){
        str += fgetc(fp);
    }
    return true;
}

參考編譯命令：（在Mac上可直接鏈接框架）
clang++ cl_test -o cl_test -framework OpenCL

個人的一點想法

蘋果宣布其未來的系統(tǒng)不支持OpenCL，決定了OpenCL悲慘的命運容握，沒有自家人支持的OpenCL路又會在何方《囟現在市場上成熟的就只剩下CUDA了，而使用CUDA需要NVIDIA的設備唯沮。那么其它GPU呢脖旱？不說AMD（有實力研制），很多嵌入式（包括絕大部分的安卓手機）設備都是通過OpenCL實現GPU加速的介蛉。一般設備現在還真的是難以用GPU做計算萌庆，也許在未來會出現更好的專門針對GPU的加速框架出現吧。

4.17日更新

// This program implements a vector addition using OpenCL
// System includes
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
// OpenCL includes
#include <CL/cl.h>

// OpenCL kernel to perform an element-wise 
// add of two arrays            
const char* programSource =
"__kernel void vecadd(__global int *A,__global int *B,__global int *C)\n" 
"{\n"
"http:// Get the work-item unique ID \n"
"int idx = get_global_id(0);   \n"
"http:// Add the corresponding locations of \n"
"http:// 'A' and 'B', and store the result in 'C'.\n"
"   C[idx] = A[idx] + B[idx];        \n"                    
"}\n"
;
int main() {
  // This code executes on the OpenCL host
  
  // Host data
  int *A = NULL;  // Input array
  int *B = NULL;  // Input array
  int *C = NULL;  // Output array
  
  // Elements in each array
  const int elements = 2048;  
  
  // Compute the size of the data 
  size_t datasize = sizeof(int)*elements;

  // Allocate space for input/output data
  A = (int*)malloc(datasize);
  B = (int*)malloc(datasize);
  C = (int*)malloc(datasize);
  // Initialize the input data
  for(int i = 0; i < elements; i++) {
    A[i] = i;
    B[i] = i;
  }
    clock_t start =  clock();
  // Use this to check the output of each API call
  cl_int status;  
   
  //-----------------------------------------------------
  // STEP 1: Discover and initialize the platforms
  //-----------------------------------------------------
  
  cl_uint numPlatforms = 0;
  cl_platform_id *platforms = NULL;
  
  // Use clGetPlatformIDs() to retrieve the number of 
  // platforms
  status = clGetPlatformIDs(0, NULL, &numPlatforms);
 
  // Allocate enough space for each platform
  platforms =  
    (cl_platform_id*)malloc(
      numPlatforms*sizeof(cl_platform_id));
 
  // Fill in platforms with clGetPlatformIDs()
  status = clGetPlatformIDs(numPlatforms, platforms, 
        NULL);

  //-----------------------------------------------------
  // STEP 2: Discover and initialize the devices
  //----------------------------------------------------- 
  
  cl_uint numDevices = 0;
  cl_device_id *devices = NULL;

  // Use clGetDeviceIDs() to retrieve the number of 
  // devices present
  status = clGetDeviceIDs(
    platforms[0], 
    CL_DEVICE_TYPE_CPU, 
    0, 
    NULL, 
    &numDevices);

  // Allocate enough space for each device
  devices = 
    (cl_device_id*)malloc(
      numDevices*sizeof(cl_device_id));

  // Fill in devices with clGetDeviceIDs()
  status = clGetDeviceIDs(
    platforms[0], 
    CL_DEVICE_TYPE_CPU,    
    numDevices, 
    devices, 
    NULL);

  //-----------------------------------------------------
  // STEP 3: Create a context
  //----------------------------------------------------- 
  
  cl_context context = NULL;

  // Create a context using clCreateContext() and 
  // associate it with the devices
  context = clCreateContext(
    NULL, 
    numDevices, 
    devices, 
    NULL, 
    NULL, 
    &status);

  //-----------------------------------------------------
  // STEP 4: Create a command queue
  //----------------------------------------------------- 
  
  cl_command_queue cmdQueue;

  // Create a command queue using clCreateCommandQueue(),
  // and associate it with the device you want to execute 
  // on
  cmdQueue = clCreateCommandQueue(
    context, 
    devices[0], 
    0, 
    &status);

  //-----------------------------------------------------
  // STEP 5: Create device buffers
  //----------------------------------------------------- 
  
  cl_mem bufferA;  // Input array on the device
  cl_mem bufferB;  // Input array on the device
  cl_mem bufferC;  // Output array on the device

  // Use clCreateBuffer() to create a buffer object (d_A) 
  // that will contain the data from the host array A
  bufferA = clCreateBuffer(
    context, 
    CL_MEM_READ_ONLY,             
    datasize, 
    NULL, 
    &status);

  // Use clCreateBuffer() to create a buffer object (d_B)
  // that will contain the data from the host array B
  bufferB = clCreateBuffer(
    context, 
    CL_MEM_READ_ONLY,             
    datasize, 
    NULL, 
    &status);

  // Use clCreateBuffer() to create a buffer object (d_C) 
  // with enough space to hold the output data
  bufferC = clCreateBuffer(
    context, 
    CL_MEM_WRITE_ONLY,         
    datasize, 
    NULL, 
    &status);
  
  //-----------------------------------------------------
  // STEP 6: Write host data to device buffers
  //----------------------------------------------------- 
  
  // Use clEnqueueWriteBuffer() to write input array A to
  // the device buffer bufferA
  status = clEnqueueWriteBuffer(
    cmdQueue, 
    bufferA, 
    CL_FALSE, 
    0, 
    datasize,             
    A, 
    0, 
    NULL, 
    NULL);
  
  // Use clEnqueueWriteBuffer() to write input array B to 
  // the device buffer bufferB
  status = clEnqueueWriteBuffer(
    cmdQueue, 
    bufferB, 
    CL_FALSE, 
    0, 
    datasize,                  
    B, 
    0, 
    NULL, 
    NULL);

  //-----------------------------------------------------
  // STEP 7: Create and compile the program
  //----------------------------------------------------- 
   
  // Create a program using clCreateProgramWithSource()
  cl_program program = clCreateProgramWithSource(
    context, 
    1, 
    (const char**)&programSource,                 
    NULL, 
    &status);

  // Build (compile) the program for the devices with
  // clBuildProgram()
  status = clBuildProgram(
    program, 
    numDevices, 
    devices, 
    NULL, 
    NULL, 
    NULL);
 
  //-----------------------------------------------------
  // STEP 8: Create the kernel
  //----------------------------------------------------- 

  cl_kernel kernel = NULL;

  // Use clCreateKernel() to create a kernel from the 
  // vector addition function (named "vecadd")
  kernel = clCreateKernel(program, "vecadd", &status);

  //-----------------------------------------------------
  // STEP 9: Set the kernel arguments
  //----------------------------------------------------- 
  
  // Associate the input and output buffers with the 
  // kernel 
  // using clSetKernelArg()
  status  = clSetKernelArg(
    kernel, 
    0, 
    sizeof(cl_mem), 
    &bufferA);
  status |= clSetKernelArg(
    kernel, 
    1, 
    sizeof(cl_mem), 
    &bufferB);
  status |= clSetKernelArg(
    kernel, 
    2, 
    sizeof(cl_mem), 
    &bufferC);

  //-----------------------------------------------------
  // STEP 10: Configure the work-item structure
  //----------------------------------------------------- 
  
  // Define an index space (global work size) of work 
  // items for 
  // execution. A workgroup size (local work size) is not 
  // required, 
  // but can be used.
  size_t globalWorkSize[1];  
  // There are 'elements' work-items 
  globalWorkSize[0] = elements;

  //-----------------------------------------------------
  // STEP 11: Enqueue the kernel for execution
  //----------------------------------------------------- 
  
  // Execute the kernel by using 
  // clEnqueueNDRangeKernel().
  // 'globalWorkSize' is the 1D dimension of the 
  // work-items
  status = clEnqueueNDRangeKernel(
    cmdQueue, 
    kernel, 
    1, 
    NULL, 
    globalWorkSize, 
    NULL, 
    0, 
    NULL, 
    NULL);

  //-----------------------------------------------------
  // STEP 12: Read the output buffer back to the host
  //----------------------------------------------------- 
  
  // Use clEnqueueReadBuffer() to read the OpenCL output  
  // buffer (bufferC) 
  // to the host output array (C)
  clEnqueueReadBuffer(
    cmdQueue, 
    bufferC, 
    CL_TRUE, 
    0, 
    datasize, 
    C, 
    0, 
    NULL, 
    NULL);

  // Verify the output
  bool result = true;
  for(int i = 0; i < elements; i++) {
    if(C[i] != i+i) {
      result = false;
      break;
    }
  }
  if(result) {
    printf("Output is correct\n");
  } else {
    printf("Output is incorrect\n");
  }
        double time_consume = (double)((clock_t)clock() - start) / CLOCKS_PER_SEC;
        printf("Time consume(s): %f",time_consume);
  //-----------------------------------------------------
  // STEP 13: Release OpenCL resources
  //----------------------------------------------------- 
  
  // Free OpenCL resources
  clReleaseKernel(kernel);
  clReleaseProgram(program);
  clReleaseCommandQueue(cmdQueue);
  clReleaseMemObject(bufferA);
  clReleaseMemObject(bufferB);
  clReleaseMemObject(bufferC);
  clReleaseContext(context);

  // Free host resources
  free(A);
  free(B);
  free(C);
  free(platforms);
  free(devices);
}

可以更改第79行和92行的CL_DEVICE_TYPE_CPU币旧，查看不同設備的運行情況践险。可選設備
CL_DEVICE_TYPE_ALL // 選取全部支持OpenCL的設備
CL_DEVICE_TYPE_CPU // 僅選取CPU
CL_DEVICE_TYPE_GPU // 僅選取GPU
輸出運行結果：

// CL_DEVICE_TYPE_ALL
Output is correct
Time consume(s): 9.128000

// CL_DEVICE_TYPE_GPU
Output is correct
Time consume(s): 7.775000

// CL_DEVICE_TYPE_CPU
Output is correct
Time consume(s): 5.408000

運行結果圖

失敗的原因分析

一般的情況下吹菱，Intel 7代以后的CPU都支持OpenCL（截止2020年最新版本是OpenCL 2.2）巍虫，仔細對比了兩次的代碼，發(fā)現了還是在前面初始化設備的時候出現了問題鳍刷。clGetPlatformIDs占遥，clGetDeviceIDs兩個函數以及最重要的兩個malloc，第60和85行输瓜，分別是給平臺和設備分配足夠的內存瓦胎，第一次在Mac上操作的時候這些細節(jié)沒有注意，導致內核總是創(chuàng)建不成功尤揣。

OpenCL的HelloWorld：一維向量相加

OpenCL的HelloWorld：一維向量相加

介紹

4-17更新，在更改了部分代碼后成功運行

在Mac上失敗的經歷

錯誤的結果

源碼（內核源碼和c++源碼）

個人的一點想法

4.17日更新

失敗的原因分析