（Cuda）存儲器Memory（二）

本文從CSDN上轉(zhuǎn)移過來：
http://blog.csdn.net/mounty_fsc/article/details/51092925

本部分內(nèi)容為[1]CUDA_C_Programming_Guide中筆記

1 Device Memory

這是對后邊的shared memory, global memory等的總稱
可分為linear memory和 CUDA arrays
CUDA arrays為紋理獲取做了優(yōu)化亏娜，見紋理存儲器

函數(shù)	描述
cudaMalloc()
cudaMemcpy()
cudaMallocPitch()	2D,返回的pitch需要在訪問時使用
cudaMemcpy2D()	2D
cudaMalloc3D()	3D
cudaMemcpy3D()	3D
cudaFree()

對于線性存儲器合武，一般用以下函數(shù)處理：

函數(shù)	描述
cudaMalloc()
cudaMemcpy()
cudaMallocPitch()	2D,返回的pitch需要在訪問時使用
cudaMemcpy2D()	2D
cudaMalloc3D()	3D
cudaMemcpy3D()	3D
cudaFree()

cudaMallocPitch例子

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&devPtr, &pitch,
width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);

// Device code
__global__ void MyKernel(float* devPtr,
size_t pitch, int width, int height)
{
    for (int r = 0; r < height; ++r) {
        float* row = (float*)((char*)devPtr + r * pitch);
        for (int c = 0; c < width; ++c) {
            float element = row[c];
        }
    }
}

2 shared Memory

2.1 不使用共享內(nèi)存

這里寫圖片描述

對于C中每個元素，使用一個線程去計算呀酸。
共訪問全局存儲器次數(shù)：對矩陣A中的每個元素客给，共B.width次韩容，對矩陣B中的元素甜害，共訪問了A.height次

2.2 使用共享內(nèi)存

這里寫圖片描述

本質(zhì)上還是通砍，使用一個線程去計算C中的一個元素。只是角度不一樣了前域，一個block計算一個Csub，一個block中的線程計算一個Csub中的元素韵吨。
策略是匿垄，對于同一個block中的線程，只讀取一次全局存儲器归粉。
共訪問全局存儲器次數(shù)：對矩陣A中的每個元素椿疗，共(B.width/block_size)次，對矩陣B中的元素糠悼，共訪問了(A.height/block_size)次
共享存儲器示例代碼見最后一部分

3 Page-Locked Host Memory

分頁鎖定主機(jī)存儲器（也叫pinned）届榄，區(qū)別為malloc()分配的可分頁的主機(jī)存儲器（可分頁為操作系統(tǒng)策略，將導(dǎo)致內(nèi)存中只保存部分?jǐn)?shù)據(jù)）
分頁鎖定主機(jī)存儲器資源有限倔喂，比可分頁的要容易分配失敗铝条。
函數(shù) 說明

cudaHostAlloc()

cudaFreeHost()

cudaHostRegister() 分頁鎖定一段malloc()分配的內(nèi)存

相關(guān)函數(shù)：

函數(shù) 說明

cudaHostAlloc()

cudaFreeHost()

cudaHostRegister() 分頁鎖定一段malloc()分配的內(nèi)存

函數(shù)	說明
cudaHostAlloc()
cudaFreeHost()
cudaHostRegister()	分頁鎖定一段malloc()分配的內(nèi)存

函數(shù)	說明
cudaHostAlloc()
cudaFreeHost()
cudaHostRegister()	分頁鎖定一段malloc()分配的內(nèi)存

中文	英文	符號
可分享存儲器	Portable Memory	cudaHostAllocPortable, cudaHostRegisterPortable
寫結(jié)合存儲器	Write-Combining Memory
映射存儲器	Mapped Memory

類別

中文	英文	符號
可分享存儲器	Portable Memory	cudaHostAllocPortable, cudaHostRegisterPortable
寫結(jié)合存儲器	Write-Combining Memory
映射存儲器	Mapped Memory

4 Texture Memory

5 Surface Memory

相關(guān)代碼

共享存儲器示例代碼：

// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.stride + col)
typedef struct {
    int width;
    int height;
    int stride; 
    float* elements;
} Matrix;

// Get a matrix element
__device__ float GetElement(const Matrix A, int row, int col)
{
    return A.elements[row * A.stride + col];
}

// Set a matrix element
__device__ void SetElement(Matrix A, int row, int col,
                           float value)
{
    A.elements[row * A.stride + col] = value;
}

// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is
// located col sub-matrices to the right and row sub-matrices down
// from the upper-left corner of A
 __device__ Matrix GetSubMatrix(Matrix A, int row, int col) 
{
    Matrix Asub;
    Asub.width    = BLOCK_SIZE;
    Asub.height   = BLOCK_SIZE;
    Asub.stride   = A.stride;
    Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row
                                         + BLOCK_SIZE * col];
    return Asub;
}

// Thread block size
#define BLOCK_SIZE 16

// Forward declaration of the matrix multiplication kernel
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);

// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
    // Load A and B to device memory
    Matrix d_A;
    d_A.width = d_A.stride = A.width; d_A.height = A.height;
    size_t size = A.width * A.height * sizeof(float);
    cudaMalloc(&d_A.elements, size);
    cudaMemcpy(d_A.elements, A.elements, size,
               cudaMemcpyHostToDevice);
    Matrix d_B;
    d_B.width = d_B.stride = B.width; d_B.height = B.height;
    size = B.width * B.height * sizeof(float);

    cudaMalloc(&d_B.elements, size);
    cudaMemcpy(d_B.elements, B.elements, size,
    cudaMemcpyHostToDevice);

    // Allocate C in device memory
    Matrix d_C;
    d_C.width = d_C.stride = C.width; d_C.height = C.height;
    size = C.width * C.height * sizeof(float);
    cudaMalloc(&d_C.elements, size);

    // Invoke kernel
    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
    // 這里因為cuda中為列優(yōu)先，事實上對C作了個轉(zhuǎn)置
    dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
    MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

    // Read C from device memory
    cudaMemcpy(C.elements, d_C.elements, size,
               cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_A.elements);
    cudaFree(d_B.elements);
    cudaFree(d_C.elements);
}

// Matrix multiplication kernel called by MatMul()
 __global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
    // Block row and column
    // 可以理解為在物理上是列優(yōu)先席噩，從邏輯上又轉(zhuǎn)成行優(yōu)先
    int blockRow = blockIdx.y;
    int blockCol = blockIdx.x;

    // Each thread block computes one sub-matrix Csub of C
    Matrix Csub = GetSubMatrix(C, blockRow, blockCol);

    // Each thread computes one element of Csub
    // by accumulating results into Cvalue
    float Cvalue = 0;

    // Thread row and column within Csub
    int row = threadIdx.y;
    int col = threadIdx.x;

    // Loop over all the sub-matrices of A and B that are
    // required to compute Csub
    // Multiply each pair of sub-matrices together
    // and accumulate the results
    for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) {

        // Get sub-matrix Asub of A
        Matrix Asub = GetSubMatrix(A, blockRow, m);

        // Get sub-matrix Bsub of B
        Matrix Bsub = GetSubMatrix(B, m, blockCol);

        // Shared memory used to store Asub and Bsub respectively
        __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
        __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

        // Load Asub and Bsub from device memory to shared memory
        // Each thread loads one element of each sub-matrix
        As[row][col] = GetElement(Asub, row, col);
        Bs[row][col] = GetElement(Bsub, row, col);

        // Synchronize to make sure the sub-matrices are loaded
        // before starting the computation
        __syncthreads();

        // Multiply Asub and Bsub together
        for (int e = 0; e < BLOCK_SIZE; ++e)
            Cvalue += As[row][e] * Bs[e][col];

        // Synchronize to make sure that the preceding
        // computation is done before loading two new
        // sub-matrices of A and B in the next iteration
        __syncthreads();
    }

    // Write Csub to device memory
    // Each thread writes one element
    SetElement(Csub, row, col, Cvalue);
}

[1]. CUDA_C_Programming_Guide

最后編輯于：2017.12.03 06:49:25

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末班缰，一起剝皮案震驚了整個濱河市，隨后出現(xiàn)的幾起案子悼枢，更是在濱河造成了極大的恐慌埠忘，老刑警劉巖，帶你破解...
沈念sama閱讀 222,729評論 6贊 517
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件馒索，死亡現(xiàn)場離奇詭異莹妒，居然都是意外死亡，警方通過查閱死者的電腦和手機(jī)绰上，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 95,226評論 3贊 399
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門旨怠，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人渔期，你說我怎么就攤上這事运吓】拾睿” “怎么了？”我有些...
開封第一講書人閱讀 169,461評論 0贊 362
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵拘哨，是天一觀的道長谋梭。經(jīng)常有香客問我，道長倦青，這世上最難降的妖魔是什么瓮床？我笑而不...
開封第一講書人閱讀 60,135評論 1贊 300
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮产镐，結(jié)果婚禮上隘庄，老公的妹妹穿的比我還像新娘。我一直安慰自己癣亚，他們只是感情好丑掺，可當(dāng)我...
茶點故事閱讀 69,130評論 6贊 398
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著述雾，像睡著了一般街州。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上玻孟，一...
開封第一講書人閱讀 52,736評論 1贊 312
城市分裂傳說
那天唆缴，我揣著相機(jī)與錄音，去河邊找鬼黍翎。笑死面徽，一個胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的匣掸。我是一名探鬼主播趟紊，決...
沈念sama閱讀 41,179評論 3贊 422
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼碰酝！你這毒婦竟也來了织阳？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 40,124評論 0贊 277
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤砰粹，失蹤者是張志新（化名）和其女友劉穎唧躲，沒想到半個月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體碱璃，經(jīng)...
沈念sama閱讀 46,657評論 1贊 320
?護(hù)林員之死
正文獨居荒郊野嶺守林人離奇死亡弄痹，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 38,723評論 3贊 342
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發(fā)現(xiàn)自己被綠了嵌器。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片肛真。...
茶點故事閱讀 40,872評論 1贊 353
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖爽航，靈堂內(nèi)的尸體忽然破棺而出蚓让，到底是詐尸還是另有隱情乾忱，我是刑警寧澤，帶...
沈念sama閱讀 36,533評論 5贊 351
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布历极，位于F島的核電站窄瘟，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏趟卸。R本人自食惡果不足惜蹄葱，卻給世界環(huán)境...
茶點故事閱讀 42,213評論 3贊 336
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望锄列。院中可真熱鬧图云，春花似錦、人聲如沸邻邮。這莊子的主人今日做“春日...
開封第一講書人閱讀 32,700評論 0贊 25
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽筒严。三九已至帕翻，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間萝风，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 33,819評論 1贊 274
情欲美人皮
我被黑心中介騙來泰國打工紫岩，沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留规惰，地道東北人。一個月前我還...
沈念sama閱讀 49,304評論 3贊 379
代替公主和親
正文我出身青樓泉蝌，卻偏偏與公主長得像歇万，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子勋陪，可洞房花燭夜當(dāng)晚...
茶點故事閱讀 45,876評論 2贊 361

（Cuda）存儲器Memory（二）

1 Device Memory

2 shared Memory

2.1 不使用共享內(nèi)存

2.2 使用共享內(nèi)存

3 Page-Locked Host Memory

4 Texture Memory

5 Surface Memory

相關(guān)代碼

推薦閱讀更多精彩內(nèi)容