介紹
一般我們一門語言糠爬,首先都是從最簡單的Hello World開始。但是對于OpenCL C和CUDA C粟关,GPU是用來完成并行計算的好幫手尔觉,所以最簡單的一維向量相加便成了這兩種特殊語言的Hello World
4-17更新,在更改了部分代碼后成功運行
在Mac上失敗的經歷
眾所周知傲宜,Apple已經在其最新的系統(tǒng)(包括Mac和iOS)中拋棄了OpenCL运杭,繼而轉向Metal。
以下是蘋果官方的介紹:
Metal 是 macOS函卒、iOS 和 Apple TVOS 中內建的圖形和計算技術辆憔。通過這項技術,主處理器 (CPU) 和圖形處理器 (GPU) 可以更高效地協(xié)同工作谆趾,從而在游戲中提供更流暢的圖形性能躁愿,并加快高性能媒體應用的運行速度。
在MacOS High Sierra中沪蓬,OpenCL的版本是1.2。我最剛開始因為版本的原因来候,改動了/Library/Framework/OpenCL.Framework下的一些東西(關閉了rootless機制跷叉,強制使用root命令)。結果花了11個小時重裝Mac营搅。云挟。。
后來因為不能更改转质,便開始做一維向量相加的測試园欣。很遺憾的是每次到創(chuàng)建內核的時候都失敗了。初步的想法還是和OpenCL.Framework中的東西有關(無奈不想再重裝休蟹,所以只是放出結果沸枯,不再Mac上做測試了日矫,后續(xù)如果有時間會在其他平臺上面做測試。)
錯誤的結果
這里就發(fā)現了一個問題:之前kernel的值變?yōu)?code>0x0000000000000000绑榴,然后執(zhí)行內核賦值哪轿。
也就是說內核沒有創(chuàng)建成功,最后的結果算出來都是0翔怎。
源碼(內核源碼和c++源碼)
注意改./Vadd.cl
為自己的內核源碼路徑窃诉。
Vadd.cl
__kernel void vecadd(__global const float* A,
__global const float* B,
__global float* C){
int id = get_global_id(0);
C[id] = A[id] + B[id];
}
cl_test.cpp
#include <iostream>
#include <unistd.h>
#include <time.h>
#include "OpenCL/opencl.h"
#define ARRAY_SIZE 6 // 向量長度
void process_CPU();
void process_GPU();
bool GetFileData(const char* fname,std::string& str);
float array_1[ARRAY_SIZE] = {1.0f,2.0f,3.1f,4.2f,5.5f,7.9f};
float array_2[ARRAY_SIZE] = {2.3f,3.3f,6.7f,11.5f,13.5f,8.9f};
float array_result[ARRAY_SIZE]; // 已知結果的向量長度,直接聲明
float array_result_gpu[ARRAY_SIZE];
int main(int argc, const char** argv) {
process_CPU();
process_GPU();
return 0;
}
void process_CPU(){
std::cout<<"-----------Start CPU process------------\nResults:"<<std::endl;
clock_t start = clock();
for(int i = 0;i < ARRAY_SIZE; i++){
array_result[i] = array_1[i] + array_2[i];
std::cout<<array_result[i]<<" ";
}
double time_consume = (double)((clock_t)clock() - start) / CLOCKS_PER_SEC;
std::cout<<"\n------------End CPU process-------------\nTime comsume(s):"
<<time_consume<<std::endl;
}
void process_GPU(){
std::cout<<"----------Start GPU process-------------"<<std::endl;
// 查詢平臺
cl_uint status;
cl_platform_id platform_id;
// 獲取平臺(Platform)對象
status = clGetPlatformIDs(1,&platform_id,NULL);
if(status != CL_SUCCESS){
std::cout<<"ERROR:failed to find any platform."<<std::endl;
return ;
}
// 獲取設備(Device)信息
cl_device_id devices;
clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 2, &devices, NULL);
// 創(chuàng)建上下文(Context)
cl_context context;
context = clCreateContext(NULL,1,&devices,NULL,NULL,NULL);
// 創(chuàng)建命令隊列(command queue)
cl_command_queue queue;
queue = clCreateCommandQueue(context,devices,CL_QUEUE_PROFILING_ENABLE,NULL);
// 創(chuàng)建3個CL對象(cl memory object)赤套,并都通過顯示的方式拷貝到GPU內存:
// 開始拷貝第一個數組
cl_mem cl_array_1;
cl_array_1 = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR,
ARRAY_SIZE*sizeof(cl_float),
(void *)array_1,NULL);
// 第二個
cl_mem cl_array_2;
cl_array_2 = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR,
ARRAY_SIZE*sizeof(cl_float),
(void *)array_2,NULL);
// 創(chuàng)建第三個
cl_mem cl_array_result;
cl_array_result = clCreateBuffer(context,CL_MEM_WRITE_ONLY,
ARRAY_SIZE*sizeof(cl_float),
(void *)array_result_gpu,NULL);
// 上傳到緩沖區(qū)(write buffer)
clEnqueueWriteBuffer(queue,cl_array_1,1,0,
ARRAY_SIZE*sizeof(cl_float),
array_1,0,0,0);
clEnqueueWriteBuffer(queue,cl_array_2,1,0,
ARRAY_SIZE*sizeof(cl_float),
array_2,0,0,0);
// 創(chuàng)建程序對象(program)
cl_program program;
std::string code_file;
if(GetFileData("./Vadd.cl",code_file) == false)return ;
char* buf_code = new char[code_file.size()];
strcpy(buf_code,code_file.c_str());
buf_code[code_file.size()-1] = NULL;
program = clCreateProgramWithSource(context,1,(const char**)&buf_code,NULL,NULL);
// 構建程序(Build program)
clBuildProgram(program,1,&devices,NULL,NULL,NULL);
// 創(chuàng)建內核
cl_kernel kernel;
kernel = clCreateKernel(program,"vector_add",NULL);
// 設置參數飘痛,開始執(zhí)行內核(kernel):
clSetKernelArg(kernel,0,sizeof(cl_mem),&cl_array_1);
clSetKernelArg(kernel,1,sizeof(cl_mem),&cl_array_2);
clSetKernelArg(kernel,2,sizeof(cl_mem),&cl_array_result);
size_t globalWorkSize[1];
globalWorkSize[0] = ARRAY_SIZE;
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, globalWorkSize, NULL, 0, NULL, NULL);
// 取回計算結果
clEnqueueReadBuffer(queue, cl_array_result, CL_TRUE, 0, ARRAY_SIZE*sizeof(cl_float),
array_result_gpu, 0, NULL, NULL);
// 輸出計算結果:
std::cout<<"Results:"<<std::endl;
for(int i = 0; i < ARRAY_SIZE;i++){
std::cout<<array_result_gpu[i]<<" ";
}
std::cout<<"\n-----------END-----------"<<std::endl;
}
bool GetFileData(const char* fname,std::string& str){
FILE* fp = fopen(fname,"r");
if(fp==NULL){
printf("ERROR:File opened failed.\n");
return false;
}
while(feof(fp) == 0){
str += fgetc(fp);
}
return true;
}
參考編譯命令:(在Mac上可直接鏈接框架)
clang++ cl_test -o cl_test -framework OpenCL
個人的一點想法
蘋果宣布其未來的系統(tǒng)不支持OpenCL,決定了OpenCL悲慘的命運容握,沒有自家人支持的OpenCL路又會在何方《囟現在市場上成熟的就只剩下CUDA了,而使用CUDA需要NVIDIA的設備唯沮。那么其它GPU呢脖旱?不說AMD(有實力研制),很多嵌入式(包括絕大部分的安卓手機)設備都是通過OpenCL實現GPU加速的介蛉。一般設備現在還真的是難以用GPU做計算萌庆,也許在未來會出現更好的專門針對GPU的加速框架出現吧。
4.17日更新
// This program implements a vector addition using OpenCL
// System includes
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
// OpenCL includes
#include <CL/cl.h>
// OpenCL kernel to perform an element-wise
// add of two arrays
const char* programSource =
"__kernel void vecadd(__global int *A,__global int *B,__global int *C)\n"
"{\n"
"http:// Get the work-item unique ID \n"
"int idx = get_global_id(0); \n"
"http:// Add the corresponding locations of \n"
"http:// 'A' and 'B', and store the result in 'C'.\n"
" C[idx] = A[idx] + B[idx]; \n"
"}\n"
;
int main() {
// This code executes on the OpenCL host
// Host data
int *A = NULL; // Input array
int *B = NULL; // Input array
int *C = NULL; // Output array
// Elements in each array
const int elements = 2048;
// Compute the size of the data
size_t datasize = sizeof(int)*elements;
// Allocate space for input/output data
A = (int*)malloc(datasize);
B = (int*)malloc(datasize);
C = (int*)malloc(datasize);
// Initialize the input data
for(int i = 0; i < elements; i++) {
A[i] = i;
B[i] = i;
}
clock_t start = clock();
// Use this to check the output of each API call
cl_int status;
//-----------------------------------------------------
// STEP 1: Discover and initialize the platforms
//-----------------------------------------------------
cl_uint numPlatforms = 0;
cl_platform_id *platforms = NULL;
// Use clGetPlatformIDs() to retrieve the number of
// platforms
status = clGetPlatformIDs(0, NULL, &numPlatforms);
// Allocate enough space for each platform
platforms =
(cl_platform_id*)malloc(
numPlatforms*sizeof(cl_platform_id));
// Fill in platforms with clGetPlatformIDs()
status = clGetPlatformIDs(numPlatforms, platforms,
NULL);
//-----------------------------------------------------
// STEP 2: Discover and initialize the devices
//-----------------------------------------------------
cl_uint numDevices = 0;
cl_device_id *devices = NULL;
// Use clGetDeviceIDs() to retrieve the number of
// devices present
status = clGetDeviceIDs(
platforms[0],
CL_DEVICE_TYPE_CPU,
0,
NULL,
&numDevices);
// Allocate enough space for each device
devices =
(cl_device_id*)malloc(
numDevices*sizeof(cl_device_id));
// Fill in devices with clGetDeviceIDs()
status = clGetDeviceIDs(
platforms[0],
CL_DEVICE_TYPE_CPU,
numDevices,
devices,
NULL);
//-----------------------------------------------------
// STEP 3: Create a context
//-----------------------------------------------------
cl_context context = NULL;
// Create a context using clCreateContext() and
// associate it with the devices
context = clCreateContext(
NULL,
numDevices,
devices,
NULL,
NULL,
&status);
//-----------------------------------------------------
// STEP 4: Create a command queue
//-----------------------------------------------------
cl_command_queue cmdQueue;
// Create a command queue using clCreateCommandQueue(),
// and associate it with the device you want to execute
// on
cmdQueue = clCreateCommandQueue(
context,
devices[0],
0,
&status);
//-----------------------------------------------------
// STEP 5: Create device buffers
//-----------------------------------------------------
cl_mem bufferA; // Input array on the device
cl_mem bufferB; // Input array on the device
cl_mem bufferC; // Output array on the device
// Use clCreateBuffer() to create a buffer object (d_A)
// that will contain the data from the host array A
bufferA = clCreateBuffer(
context,
CL_MEM_READ_ONLY,
datasize,
NULL,
&status);
// Use clCreateBuffer() to create a buffer object (d_B)
// that will contain the data from the host array B
bufferB = clCreateBuffer(
context,
CL_MEM_READ_ONLY,
datasize,
NULL,
&status);
// Use clCreateBuffer() to create a buffer object (d_C)
// with enough space to hold the output data
bufferC = clCreateBuffer(
context,
CL_MEM_WRITE_ONLY,
datasize,
NULL,
&status);
//-----------------------------------------------------
// STEP 6: Write host data to device buffers
//-----------------------------------------------------
// Use clEnqueueWriteBuffer() to write input array A to
// the device buffer bufferA
status = clEnqueueWriteBuffer(
cmdQueue,
bufferA,
CL_FALSE,
0,
datasize,
A,
0,
NULL,
NULL);
// Use clEnqueueWriteBuffer() to write input array B to
// the device buffer bufferB
status = clEnqueueWriteBuffer(
cmdQueue,
bufferB,
CL_FALSE,
0,
datasize,
B,
0,
NULL,
NULL);
//-----------------------------------------------------
// STEP 7: Create and compile the program
//-----------------------------------------------------
// Create a program using clCreateProgramWithSource()
cl_program program = clCreateProgramWithSource(
context,
1,
(const char**)&programSource,
NULL,
&status);
// Build (compile) the program for the devices with
// clBuildProgram()
status = clBuildProgram(
program,
numDevices,
devices,
NULL,
NULL,
NULL);
//-----------------------------------------------------
// STEP 8: Create the kernel
//-----------------------------------------------------
cl_kernel kernel = NULL;
// Use clCreateKernel() to create a kernel from the
// vector addition function (named "vecadd")
kernel = clCreateKernel(program, "vecadd", &status);
//-----------------------------------------------------
// STEP 9: Set the kernel arguments
//-----------------------------------------------------
// Associate the input and output buffers with the
// kernel
// using clSetKernelArg()
status = clSetKernelArg(
kernel,
0,
sizeof(cl_mem),
&bufferA);
status |= clSetKernelArg(
kernel,
1,
sizeof(cl_mem),
&bufferB);
status |= clSetKernelArg(
kernel,
2,
sizeof(cl_mem),
&bufferC);
//-----------------------------------------------------
// STEP 10: Configure the work-item structure
//-----------------------------------------------------
// Define an index space (global work size) of work
// items for
// execution. A workgroup size (local work size) is not
// required,
// but can be used.
size_t globalWorkSize[1];
// There are 'elements' work-items
globalWorkSize[0] = elements;
//-----------------------------------------------------
// STEP 11: Enqueue the kernel for execution
//-----------------------------------------------------
// Execute the kernel by using
// clEnqueueNDRangeKernel().
// 'globalWorkSize' is the 1D dimension of the
// work-items
status = clEnqueueNDRangeKernel(
cmdQueue,
kernel,
1,
NULL,
globalWorkSize,
NULL,
0,
NULL,
NULL);
//-----------------------------------------------------
// STEP 12: Read the output buffer back to the host
//-----------------------------------------------------
// Use clEnqueueReadBuffer() to read the OpenCL output
// buffer (bufferC)
// to the host output array (C)
clEnqueueReadBuffer(
cmdQueue,
bufferC,
CL_TRUE,
0,
datasize,
C,
0,
NULL,
NULL);
// Verify the output
bool result = true;
for(int i = 0; i < elements; i++) {
if(C[i] != i+i) {
result = false;
break;
}
}
if(result) {
printf("Output is correct\n");
} else {
printf("Output is incorrect\n");
}
double time_consume = (double)((clock_t)clock() - start) / CLOCKS_PER_SEC;
printf("Time consume(s): %f",time_consume);
//-----------------------------------------------------
// STEP 13: Release OpenCL resources
//-----------------------------------------------------
// Free OpenCL resources
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseCommandQueue(cmdQueue);
clReleaseMemObject(bufferA);
clReleaseMemObject(bufferB);
clReleaseMemObject(bufferC);
clReleaseContext(context);
// Free host resources
free(A);
free(B);
free(C);
free(platforms);
free(devices);
}
可以更改第79行和92行的CL_DEVICE_TYPE_CPU
币旧,查看不同設備的運行情況践险。可選設備
CL_DEVICE_TYPE_ALL
// 選取全部支持OpenCL的設備
CL_DEVICE_TYPE_CPU
// 僅選取CPU
CL_DEVICE_TYPE_GPU
// 僅選取GPU
輸出運行結果:
// CL_DEVICE_TYPE_ALL
Output is correct
Time consume(s): 9.128000
// CL_DEVICE_TYPE_GPU
Output is correct
Time consume(s): 7.775000
// CL_DEVICE_TYPE_CPU
Output is correct
Time consume(s): 5.408000
失敗的原因分析
一般的情況下吹菱,Intel 7代以后的CPU都支持OpenCL(截止2020年最新版本是OpenCL 2.2)巍虫,仔細對比了兩次的代碼,發(fā)現了還是在前面初始化設備的時候出現了問題鳍刷。clGetPlatformIDs
占遥,clGetDeviceIDs
兩個函數以及最重要的兩個malloc
,第60和85行输瓜,分別是給平臺和設備分配足夠的內存瓦胎,第一次在Mac上操作的時候這些細節(jié)沒有注意,導致內核總是創(chuàng)建不成功尤揣。