Professional CUDA C Programing
代碼下載:http://www.wrox.com/WileyCDA/WroxTitle/Professional-CUDA-C-Programming.productCd-1118739329,descCd-DOWNLOAD.html
Dynamic Parallelism
到目前為止,所有kernel都是在host端調(diào)用写半,GPU的工作完全在CPU的控制下迹卢。CUDA Dynamic Parallelism允許GPU kernel在device端創(chuàng)建調(diào)用。Dynamic Parallelism使遞歸更容易實現(xiàn)和理解重罪,由于啟動的配置可以由device上的thread在運(yùn)行時決定,這也減少了host和device之間傳遞數(shù)據(jù)和執(zhí)行控制哀九。通過動態(tài)并行性剿配,可以直到程序運(yùn)行時才推遲確定在GPU上創(chuàng)建有多少塊和網(wǎng)格,利用GPU硬件調(diào)度器和負(fù)載平衡動態(tài)地適應(yīng)數(shù)據(jù)驅(qū)動的決策或工作負(fù)載阅束。
Nested Execution(嵌套執(zhí)行)
在host調(diào)用kernel和在device調(diào)用kernel的語法完全一樣呼胚。kernel的執(zhí)行則被分為兩種類型:parent和child。一個parent thread息裸,parent block或者parent grid可以啟動一個新的grid蝇更,即child grid。child grid必須在parent 之前完成呼盆,也就是說簿寂,parent必須等待所有child完成。當(dāng)parent啟動一個child grid時宿亡,在parent顯式調(diào)用synchronize之前常遂,child不保證會開始執(zhí)行。parent和child共享同一個global和constant memory挽荠,但是有不同的shared 和local memory克胳。不難理解的是,只有兩個時刻可以保證child和parent見到的global memory完全一致:child剛開始和child完成圈匆。所有parent對global memory的操作對child都是可見的漠另,而child對global memory的操作只有在parent進(jìn)行synchronize操作后對parent才是可見的。
Nested Hello World on the GPU
為了更好地理解dynamic parallelism跃赚,我們重新編寫hello world算法笆搓。host主機(jī)調(diào)用了parent grid,該parent grid的single block只有8個thread纬傲。parent中的thread0調(diào)用了child grid_1满败,child grid_1只有parent grid 一半的thread(4 threads),接著child grid_1中的thread0又調(diào)用了child grid_2(2 threads),接著child grid_2 中的thread0又調(diào)用了一個child grid_3(1 thread)叹括。
-
parent grid 只有1個block
Screenshot from 2017-05-03 14:43:17.png
__global__ void nestedHelloWorld(int const iSize, int iDepth)
{
int tid = threadIdx.x;
printf("Recursion=%d: Hello World from thread %d block %d\n", iDepth, tid,
blockIdx.x);
// condition to stop recursive execution
if (iSize == 1) return;
// reduce block size to half
int nthreads = iSize >> 1;
// thread 0 launches child grid recursively
if(tid == 0 && nthreads > 0)
{
nestedHelloWorld<<<1, nthreads>>>(nthreads, ++iDepth);
printf("-------> nested execution depth: %d\n", iDepth);
}
}
編譯
$ nvcc -arch=sm_35 -rdc=true nestedHelloWorld.cu -o nestedHelloWorld -lcudadevrt
-lcudadevrt是用來連接runtime庫的算墨,rdc=true使device代碼可重入,這是DynamicParallelism所必須的汁雷。
hym@hym-ThinkPad-Edge-E440:~/CodeSamples/chapter03$ nvcc -arch=sm_35 -rdc=true nestedHelloWorld.cu -o nestedHelloWorld -lcudadevrt
hym@hym-ThinkPad-Edge-E440:~/CodeSamples/chapter03$ ./nestedHelloWorld
./nestedHelloWorld Execution Configuration: grid 1 block 8
Recursion=0: Hello World from thread 0 block 0
Recursion=0: Hello World from thread 1 block 0
Recursion=0: Hello World from thread 2 block 0
Recursion=0: Hello World from thread 3 block 0
Recursion=0: Hello World from thread 4 block 0
Recursion=0: Hello World from thread 5 block 0
Recursion=0: Hello World from thread 6 block 0
Recursion=0: Hello World from thread 7 block 0
-------> nested execution depth: 1
Recursion=1: Hello World from thread 0 block 0
Recursion=1: Hello World from thread 1 block 0
Recursion=1: Hello World from thread 2 block 0
Recursion=1: Hello World from thread 3 block 0
-------> nested execution depth: 2
Recursion=2: Hello World from thread 0 block 0
Recursion=2: Hello World from thread 1 block 0
-------> nested execution depth: 3
Recursion=3: Hello World from thread 0 block 0
可以用nvvp觀察parent和child的執(zhí)行情況:
nvvp ./nestedHelloWorld
注意:藍(lán)色的表示執(zhí)行净嘀,空白部分表示等待报咳,parent grid nestedHelloWorld執(zhí)行了一次,調(diào)用了3次nestedHelloWorld挖藏。從最后一行往上看暑刃,最后一行表示depth=3調(diào)用,當(dāng)該調(diào)用完成時膜眠,depth=2的調(diào)用才可以結(jié)束稍走,當(dāng)depth=2的調(diào)用結(jié)束后depth=1的才可以結(jié)束,最后parent grid才能結(jié)束柴底。
- parent grid 有2個block
hym@hym-ThinkPad-Edge-E440:~/CodeSamples/chapter03$ ./nestedHelloWorld 2
./nestedHelloWorld Execution Configuration: grid 2 block 8
Recursion=0: Hello World from thread 0 block 0
Recursion=0: Hello World from thread 1 block 0
Recursion=0: Hello World from thread 2 block 0
Recursion=0: Hello World from thread 3 block 0
Recursion=0: Hello World from thread 4 block 0
Recursion=0: Hello World from thread 5 block 0
Recursion=0: Hello World from thread 6 block 0
Recursion=0: Hello World from thread 7 block 0
Recursion=0: Hello World from thread 0 block 1
Recursion=0: Hello World from thread 1 block 1
Recursion=0: Hello World from thread 2 block 1
Recursion=0: Hello World from thread 3 block 1
Recursion=0: Hello World from thread 4 block 1
Recursion=0: Hello World from thread 5 block 1
Recursion=0: Hello World from thread 6 block 1
Recursion=0: Hello World from thread 7 block 1
-------> nested execution depth: 1
-------> nested execution depth: 1
Recursion=1: Hello World from thread 0 block 0
Recursion=1: Hello World from thread 1 block 0
Recursion=1: Hello World from thread 2 block 0
Recursion=1: Hello World from thread 3 block 0
Recursion=1: Hello World from thread 0 block 0
Recursion=1: Hello World from thread 1 block 0
Recursion=1: Hello World from thread 2 block 0
Recursion=1: Hello World from thread 3 block 0
-------> nested execution depth: 2
-------> nested execution depth: 2
Recursion=2: Hello World from thread 0 block 0
Recursion=2: Hello World from thread 1 block 0
Recursion=2: Hello World from thread 0 block 0
Recursion=2: Hello World from thread 1 block 0
-------> nested execution depth: 3
-------> nested execution depth: 3
Recursion=3: Hello World from thread 0 block 0
Recursion=3: Hello World from thread 0 block 0
從上面結(jié)果來看,首先應(yīng)該注意到粱胜,所有child的block的id都是0柄驻。下圖是調(diào)用過程,parent有兩個block了焙压,但是所有child都只有一個blcok:
nestedHelloWorld<<<1, nthreads>>>(nthreads, ++iDepth);
注意:Dynamic Parallelism只有在計算能力3.5以上才被支持鸿脓。通過Dynamic Parallelism調(diào)用的kernel不能執(zhí)行于不同的device(物理上實際存在的)上。調(diào)用的最大深度是24涯曲,但實際情況是野哭,kernel要受限于memory資源,其中包括為了同步parent和child而需要的額外的memory資源幻件。
Nested Reduction
Reduction可以很自然地描述成一個遞歸的過程拨黔。
// Recursive Implementation of Interleaved Pair Approach
int cpuRecursiveReduce(int *data, int const size)
{
// stop condition
if (size == 1) return data[0];
// renew the stride
int const stride = size / 2;
// in-place reduction
for (int i = 0; i < stride; i++)
{
data[i] += data[i + stride];
}
// call recursively
return cpuRecursiveReduce(data, stride);
}
Dynamic parallelism:parent grid 有很多個blocks,但是所有的child grid都被parent的thread0調(diào)用绰沥,并且child grid只有一個block篱蝇。第一步還是將global memory的地址g_idata轉(zhuǎn)化為每個block本地地址。然后徽曲,if判斷是否該退出零截,退出的話,就將結(jié)果拷貝回global memory秃臣。如果不該退出涧衙,就進(jìn)行本地reduction,一般的線程執(zhí)行in-place(就地)reduction奥此,然后弧哎,同步block來保證所有部分和的計算。thread0再次產(chǎn)生一個只有一個block和當(dāng)前一半數(shù)量thread的child grid稚虎。
__global__ void gpuRecursiveReduce (int *g_idata, int *g_odata,
unsigned int isize)
{
// set thread ID
unsigned int tid = threadIdx.x;
// convert global data pointer to the local pointer of this block
int *idata = g_idata + blockIdx.x * blockDim.x;
int *odata = &g_odata[blockIdx.x];
// stop condition
if (isize == 2 && tid == 0)
{
g_odata[blockIdx.x] = idata[0] + idata[1];
return;
}
// nested invocation
int istride = isize >> 1;
if(istride > 1 && tid < istride)
{
// in place reduction
idata[tid] += idata[tid + istride];
}
// sync at block level
__syncthreads();
// nested invocation to generate child grids
if(tid == 0)
{
gpuRecursiveReduce<<<1, istride>>>(idata, odata, istride);
// sync all child grids launched in this block
cudaDeviceSynchronize();
}
// sync at block level again
__syncthreads();
}
hym@hym-ThinkPad-Edge-E440:~/CodeSamples/chapter03$ nvcc -arch=sm_35 -rdc=true nestedReduce.cu -o nestedReduce
hym@hym-ThinkPad-Edge-E440:~/CodeSamples/chapter03$ ./nestedReduce
./nestedReduce starting reduction at device 0: GeForce GT 740M array 1048576 grid 2048 block 512
cpu reduce elapsed 0.002892 sec cpu_sum: 1048576
gpu Neighbored elapsed 0.002178 sec gpu_sum: 1048576 <<<grid 2048 block 512>>>
gpu nested elapsed 0.733954 sec gpu_sum: 1048576 <<<grid 2048 block 512>>>
從上面結(jié)果看傻铣,2048個block被初始化了。每個block執(zhí)行了8個遞歸祥绞,2048*8=16384個child block被創(chuàng)建非洲,__syncthreads 也被調(diào)用了16384次鸭限,這都是導(dǎo)致效率很低的原因。
當(dāng)一個child grid被調(diào)用后两踏,他看到的memory是和parent完全一樣的败京,因為child只需要parent的一部分?jǐn)?shù)據(jù),block在每個child grid的啟動前的同步操作是不必要的梦染,修改后:
__global__ void gpuRecursiveReduceNosync (int *g_idata, int *g_odata,
unsigned int isize)
{
// set thread ID
unsigned int tid = threadIdx.x;
// convert global data pointer to the local pointer of this block
int *idata = g_idata + blockIdx.x * blockDim.x;
int *odata = &g_odata[blockIdx.x];
// stop condition
if (isize == 2 && tid == 0)
{
g_odata[blockIdx.x] = idata[0] + idata[1];
return;
}
// nested invoke
int istride = isize >> 1;
if(istride > 1 && tid < istride)
{
idata[tid] += idata[tid + istride];
if(tid == 0)
{
gpuRecursiveReduceNosync<<<1, istride>>>(idata, odata, istride);
}
}
}
實驗結(jié)果:
hym@hym-ThinkPad-Edge-E440:~/CodeSamples/chapter03$ nvcc -arch=sm_35 -rdc=true nestedReduceNosync.cu -o nestedReduceNosync
hym@hym-ThinkPad-Edge-E440:~/CodeSamples/chapter03$ ./nestedReduceNosync
./nestedReduceNosync starting reduction at device 0: GeForce GT 740M array 1048576 grid 2048 block 512
cpu reduce elapsed 0.002918 sec cpu_sum: 1048576
gpu Neighbored elapsed 0.002182 sec gpu_sum: 1048576 <<<grid 2048 block 512>>>
gpu nested elapsed 0.733726 sec gpu_sum: 1048576 <<<grid 2048 block 512>>>
gpu nestedNosyn elapsed 0.030162 sec gpu_sum: 1048576 <<<grid 2048 block 512>>>
從以上試驗結(jié)果發(fā)現(xiàn)gpu nestedNosyn 提升了很多赡麦,但是性能還是比neighbour-paired要慢。接下來在做點改動帕识,主要想法如下圖所示泛粹,kernel的調(diào)用增加了一個參數(shù)iDim,這是因為每次遞歸調(diào)用肮疗,child block的大小就減半晶姊,parent 的blockDim必須傳遞給child grid,從而使每個thread都能計算正確的global memory偏移地址伪货。注意们衙,所有空閑的thread都被移除了。相較于之前的實現(xiàn)碱呼,每次都會有一半的thread空閑下來而被移除蒙挑,也就釋放了一半的計算資源。
__global__ void gpuRecursiveReduce2(int *g_idata, int *g_odata, int iStride,
int const iDim)
{
// convert global data pointer to the local pointer of this block
int *idata = g_idata + blockIdx.x * iDim;
// stop condition
if (iStride == 1 && threadIdx.x == 0)
{
g_odata[blockIdx.x] = idata[0] + idata[1];
return;
}
// in place reduction
idata[threadIdx.x] += idata[threadIdx.x + iStride];
// nested invocation to generate child grids
if(threadIdx.x == 0 && blockIdx.x == 0)
{
gpuRecursiveReduce2<<<gridDim.x, iStride / 2>>>(g_idata, g_odata,
iStride / 2, iDim);
}
}
main 函數(shù)中調(diào)用:
gpuRecursiveReduce2<<<grid, block.x / 2>>>(d_idata, d_odata, block.x / 2,block.x);
ccit@ccit:~/hym/CodeSamples/chapter03$ ./nestedReduce2
./nestedReduce2 starting reduction at device 0: Tesla K80 array 1048576 grid 2048 block 512
cpu reduce elapsed 0.002539 sec cpu_sum: 1048576
gpu Neighbored elapsed 0.001015 sec gpu_sum: 1048576 <<<grid 2048 block 512>>>
gpu nested elapsed 0.250117 sec gpu_sum: 1048576 <<<grid 2048 block 512>>>
gpu nestedNosyn elapsed 0.024537 sec gpu_sum: 1048576 <<<grid 2048 block 512>>>
gpu nested2 elapsed 0.001025 sec gpu_sum: 1048576 <<<grid 2048 block 512>>>```
==25190== Profiling application: ./nestedReduce2
==25190== Profiling result:
Time(%) Time Calls (host) Calls (device) Avg Min Max Name
92.61% 11.9872s 1 16384 731.60us 3.3280us 285.05ms gpuRecursiveReduce(int, int, unsigned int)
7.34% 950.18ms 1 16384 57.990us 2.8480us 40.780ms gpuRecursiveReduceNosync(int, int, unsigned int)
0.04% 5.6049ms 4 - 1.4012ms 1.3760ms 1.4362ms [CUDA memcpy HtoD]
0.01% 723.10us 1 8 80.343us 31.839us 143.71us gpuRecursiveReduce2(int, int, int, int)
0.00% 538.30us 1 0 538.30us 538.30us 538.30us reduceNeighbored(int, int, unsigned int)
0.00% 18.271us 4 - 4.5670us 4.1920us 5.2150us [CUDA memcpy DtoH]
分析:gpu nested2 實際上是<<<2048,256>>>,修改后的程序只需要產(chǎn)生8個child愚臀,和之前的16384個child比起來忆蚀,減少了很多資源的開銷。但是我在實驗過程中發(fā)現(xiàn)了一個很奇怪的結(jié)果Tesla k80可以正確運(yùn)行姑裂,但是我的gt740m上無法正確運(yùn)行蜓谋,計算的結(jié)果不正確,我暫時還沒有找到錯誤的原因炭分。