使用GPU計(jì)算的流程
1. 寫(xiě)一個(gè) C語(yǔ)言的GPU函數(shù)
void add_arrays(const float* inA,
const float* inB,
float* result,
int length)
{
for (int index = 0; index < length ; index++)
{
result[index] = inA[index] + inB[index];
}
}
2. 將C語(yǔ)言函數(shù)轉(zhuǎn)化成Metal著色語(yǔ)言(MSL)
kernel void add_arrays(device const float* inA,
device const float* inB,
device float* result,
uint index [[thread_position_in_grid]])
{
// the for-loop is replaced with a collection of threads, each of which
// calls this function.
result[index] = inA[index] + inB[index];
}
3. 找一個(gè)GPU設(shè)備(MTLDevice)
id<MTLDevice> device = MTLCreateSystemDefaultDevice();
4. 初始化Metal實(shí)體們
MetalAdder* adder = [[MetalAdder alloc] initWithDevice:device]; //用它來(lái)管理需要需Metal通訊的實(shí)體
5. 引用Metal函數(shù)
Metal函數(shù)在 app 的默認(rèn) Metal Library 里涡扼,所以使用 MTLDevice 獲取 MTLLibrary送巡,然后通過(guò) MTLLibrary 或者M(jìn)TLFunction(Metal 函數(shù))
- (instancetype) initWithDevice: (id<MTLDevice>) device
{
self = [super init];
if (self)
{
_mDevice = device;
NSError* error = nil;
// Load the shader files with a .metal file extension in the project
id<MTLLibrary> defaultLibrary = [_mDevice newDefaultLibrary];
if (defaultLibrary == nil)
{
NSLog(@"Failed to find the default library.");
return nil;
}
id<MTLFunction> addFunction = [defaultLibrary newFunctionWithName:@"add_arrays"];
if (addFunction == nil)
{
NSLog(@"Failed to find the adder function.");
return nil;
}
//官網(wǎng)就沒(méi)有結(jié)束大括號(hào)湿痢,也許是這個(gè)方法實(shí)際還沒(méi)結(jié)束偎行?
6. 準(zhǔn)備Metal管道
Metal函數(shù)不是真正的可執(zhí)行代碼腔剂,Metal管道將函數(shù)轉(zhuǎn)化成實(shí)際可執(zhí)行代碼辕羽。在Metal中惹挟,管道表示為pipeline state object (創(chuàng)建管道的時(shí)候編譯代碼)
_mAddFunctionPSO = [_mDevice newComputePipelineStateWithFunction: addFunction error:&error];
7. 創(chuàng)建命令隊(duì)列
給GPU發(fā)送命令拆魏,需要一個(gè)命令隊(duì)列
_mCommandQueue = [_mDevice newCommandQueue];
8. 創(chuàng)建Buffer和數(shù)據(jù)
Metal使用MTLResource管理內(nèi)存熟妓,使用MTLDevice實(shí)例創(chuàng)建內(nèi)存(實(shí)際使用MTLBuffer表示創(chuàng)建的buffer雪猪,是MTLResource的子類)
_mBufferA = [_mDevice newBufferWithLength:bufferSize options:MTLResourceStorageModeShared];//MTLResourceStorageModeShared可以讓CPU和GPU共享
_mBufferB = [_mDevice newBufferWithLength:bufferSize options:MTLResourceStorageModeShared];
_mBufferResult = [_mDevice newBufferWithLength:bufferSize options:MTLResourceStorageModeShared];
[self generateRandomFloatData:_mBufferA];
[self generateRandomFloatData:_mBufferB];
- (void) generateRandomFloatData: (id<MTLBuffer>) buffer
{
float* dataPtr = buffer.contents;
for (unsigned long index = 0; index < arrayLength; index++)
{
dataPtr[index] = (float)rand()/(float)(RAND_MAX);
}
}
9.創(chuàng)建Command Buffer
id<MTLCommandBuffer> commandBuffer = [_mCommandQueue commandBuffer];
10.創(chuàng)建命令編碼器 Command Encoder
為了將命令寫(xiě)入Command Buffer,需要一個(gè)命令解碼器來(lái)傳遞具體哪種命令起愈,這里使用計(jì)算命令編碼器只恨。
它編碼出一個(gè)計(jì)算通路,里面有一列命令抬虽,每個(gè)計(jì)算命令都會(huì)導(dǎo)致GPU創(chuàng)建一個(gè)矩陣表(grid)來(lái)執(zhí)行
id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
為了編碼一個(gè)命令官觅,可以對(duì)編碼器調(diào)用一系列方法,有一些設(shè)置狀態(tài)信息阐污,比如pipeline state object(PSO)休涤,或者傳遞給管道的參數(shù)。當(dāng)作出這些狀態(tài)改變后笛辟,會(huì)編碼命令來(lái)執(zhí)行管道功氨。編碼器把所有的狀態(tài)改變和命令參數(shù)寫(xiě)入Command Buffer
11.設(shè)置Pipeline State和參數(shù)數(shù)據(jù)
先設(shè)置管道要執(zhí)行的Pipeline state object,再設(shè)置add_arrays函數(shù)需要處理的數(shù)據(jù)手幢,這里的index和add_arrays的參數(shù)位置對(duì)應(yīng)捷凄。offset是buffer的偏移量。也可以用同一個(gè)buffer围来,不同偏移量跺涤,代表不同參數(shù)
[computeEncoder setComputePipelineState:_mAddFunctionPSO];
[computeEncoder setBuffer:_mBufferA offset:0 atIndex:0];
[computeEncoder setBuffer:_mBufferB offset:0 atIndex:1];
[computeEncoder setBuffer:_mBufferResult offset:0 atIndex:2];
12.指定線程數(shù)和組織方式
Metal可以處理1D,2D和3D數(shù)據(jù)监透,本例是1D數(shù)據(jù)桶错,所以傳datasize * 1 * 1作為參數(shù)
MTLSize gridSize = MTLSizeMake(arrayLength, 1, 1);
13.指定線程組大小
Metal把整個(gè)數(shù)據(jù)表分割成小的表,叫做線程組才漆,每個(gè)線程組獨(dú)立運(yùn)行牛曹,分發(fā)給不同的GPU處理單元,來(lái)加速處理醇滥。你需要決定線程組有多大
NSUInteger threadGroupSize = _mAddFunctionPSO.maxTotalThreadsPerThreadgroup;//目前可用的最大的線程數(shù)量
if (threadGroupSize > arrayLength)
{
threadGroupSize = arrayLength;
}
MTLSize threadgroupSize = MTLSizeMake(threadGroupSize, 1, 1);
14.編碼 計(jì)算命令 并執(zhí)行線程
[computeEncoder dispatchThreads:gridSize
threadsPerThreadgroup:threadgroupSize];
編碼器可以編碼多個(gè)命令黎比,而無(wú)需多余步驟
15.結(jié)束計(jì)算通路
[computeEncoder endEncoding];
16.提交Command Buffer來(lái)執(zhí)行命令
[commandBuffer commit];
Metal異步執(zhí)行這些命令超营,在執(zhí)行完以后,command buffer會(huì)被標(biāo)記成已完成
17.等待計(jì)算完成
[commandBuffer waitUntilCompleted];
這個(gè)方法可以同步等待計(jì)算完成阅虫,也可以對(duì)command buffer添加addCompletedHandler(_:)
演闭,或者檢查status屬性來(lái)獲取完成狀態(tài)
18. 從Buffer中讀取結(jié)果
例子是讀取結(jié)果,然后cpu再算一遍颓帝,看看gpu算的對(duì)不對(duì)
- (void) verifyResults
{
float* a = _mBufferA.contents;
float* b = _mBufferB.contents;
float* result = _mBufferResult.contents;
for (unsigned long index = 0; index < arrayLength; index++)
{
if (result[index] != (a[index] + b[index]))
{
printf("Compute ERROR: index=%lu result=%g vs %g=a+b\n",
index, result[index], a[index] + b[index]);
assert(result[index] == (a[index] + b[index]));
}
}
printf("Compute results as expected\n");
}