NVIDIA CUDA Learning Note 1

1）CPU architecture

Pipelining
Branch Prediction
Superscalar
Out-of-Order Execution
Memory Hierarchy
Vector Operation
Multi-core

What is CPU?

Execute instruction, process data
Additional complex function
Contains many transistor

What is instruction

For example:
arithmetic:add r3,r4 > r4
visit and save:load[r4] > r7
control:jz end

Optimize Objective:

cycles for instruction * seconds/cycle

CPI(clock cycle per instruction) & clock cycle
The two factors are not independent, some time the increase of CPI can cause the decrease of the number of instruction.

Desktop Programs

Lightly threaded
Lots of branches
Lots of memory accesses
Most desktop program deals with data transfer instead of numeric computation.

Moore's Law

The complexity for minimum component costs has increased at a rate of roughly a factor of two per year.
What do we do with our transistor budget?

image.png

8 Core processor contains 2,2 billion transistor, most part of the cpu is about I/O and saving instead of computation.

Pipelining

Several steps involved in executing an instruction:
Fetch -> Decode -> Execute -> Memory -> Writeback
This process can be separate to different parts of pipeline

image.png

Pros

Instruction level parallelism (ILP)
Significantly reduced clock period.

Cons

Slight latency & area increase (pipeline latches)
Dependency
How to manage the branch
Alieged Pipeline Lengths

Bypassing

image.png

If two instructions are dependent, for example, the ADD instruction has to wait for SUB instruction to finish pipeline and return R7, bypassing can pass R7 to latter instruction without waiting.

Stalls

image.png

If load is not finished, the pipeline must stop to wait.

Branch

image.png

Branch Prediction

Guess what instruction comes next
Based off branch history
Example: two-level predictor with global history

Maintain history table of all outcomes for M successive
Compare with past N results (history register)
Sandy Bridge employs 32-bit history register

Modern predictors > 90%

Pros:
Raise performance and energy efficiency
Cons;
Area increase
Potential fetch stage latency increase

Predication

Replace branches with conditional instructions
Avoids branch predictor

Avoid area penalty, misprediction penalty

GPU also use prediction

Increase IPC

Normal IPC is limited by 1 instruction / clock
Superscalar - increase the width of the pipeline

Superscalar

Peak IPC is N (for N-way superscalar)

image.png

Scheduling

xor r1,r2 -> r3
add r3,r4 -> r4

sub r5,r2 ->r3
addi r3,1->r1

xor and add : Read-After-Write,RAW
sub and addi: RAW
xor and sub: WAW

Register Renaming

xor r1,r2 -> r6
add r6,r4 -> r7

sub r5,r2 ->r8
addi r8,1->r9
xor and sub can parallel compute

Out-of-Order(OoO) Execution

Reordering the order
Fetch -> Decode -> Rename -> Dispatch -> Issue ->
Register-Read - > Execute -> Memory -> Writeback ->
Commit

Reorder Buffer
Issue Queue/Scheduler

Pros:
IPC near to the ideal state
Cons:
Area increase
Power cost

Modern Desktop/ Mobile In-order CPUs

Intel Atom
ARM Cortex-A8
Quaicomm Scorpion

Modern Desktop/Mobile OoO CPUs

Intel Pentium Pro and onwards
ARM Cortex-A9
Quaicomm Krait

Memory Hierarchy

image.png

Caching

Put the data in a position as close as possible。

Time proximity
Spatial proximity

Cpu parallel

Instruction - level extraction
Data - Level Parallelism (Vectors)
Thread- Level Parallelism (TLP)

Vectors Motivation

for(int i = 0;i<N;i++)
A[i] = B[i] + c[i]

Single instruction multiple Data
//in parallel
A[i] = B[i] + c[i]
A[i+!] = B[i+!] + c[i+!]
A[i+2] = B[i+2] + c[i+2]
A[i+3] = B[i+3] + c[i+3]
A[i+4] = B[i+4] + c[i+4]

X86 Vector Motivation

SSE2
AVX

Thread-Level Parallelism

Programmers can destroy and create.
Programmers or OS can dispatch.

Multicore

Locks, Coherence and Consistency

Multi thread access same data
Coherence: which one is correct
Consistency: what kind of data is correct

Power Wall

The increase of the main frequency of CPU leads to the increase of power consumption, so that the density can not be increased unrestricted.

CPU provides optimization for series program

?著作權歸作者所有,轉載或內容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市，隨后出現(xiàn)的幾起案子届宠，更是在濱河造成了極大的恐慌，老刑警劉巖气筋，帶你破解...
沈念sama閱讀 217,509評論 6贊 504
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件嘹害，死亡現(xiàn)場離奇詭異豪墅，居然都是意外死亡，警方通過查閱死者的電腦和手機铣口，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,806評論 3贊 394
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門滤钱，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人脑题，你說我怎么就攤上這事件缸。” “怎么了叔遂？”我有些...
開封第一講書人閱讀 163,875評論 0贊 354
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵他炊，是天一觀的道長争剿。經常有香客問我，道長痊末，這世上最難降的妖魔是什么蚕苇？我笑而不...
開封第一講書人閱讀 58,441評論 1贊 293
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮凿叠，結果婚禮上涩笤，老公的妹妹穿的比我還像新娘。我一直安慰自己盒件，他們只是感情好蹬碧，可當我...
茶點故事閱讀 67,488評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著履恩，像睡著了一般锰茉。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上切心，一...
開封第一講書人閱讀 51,365評論 1贊 302
城市分裂傳說
那天，我揣著相機與錄音片吊，去河邊找鬼绽昏。笑死，一個胖子當著我的面吹牛俏脊，可吹牛的內容都是我干的全谤。我是一名探鬼主播，決...
沈念sama閱讀 40,190評論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼爷贫，長吁一口氣：“原來是場噩夢啊……” “哼认然！你這毒婦竟也來了？” 一聲冷哼從身側響起漫萄，我...
開封第一講書人閱讀 39,062評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤卷员，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后腾务，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體毕骡，經...
沈念sama閱讀 45,500評論 1贊 314
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內容為張勛視角年9月15日...
茶點故事閱讀 37,706評論 3贊 335
?白月光啟示錄
正文我和宋清朗相戀三年岩瘦，在試婚紗的時候發(fā)現(xiàn)自己被綠了未巫。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 39,834評論 1贊 347
活死人
序言：一個原本活蹦亂跳的男人離奇死亡启昧，死狀恐怖叙凡，靈堂內的尸體忽然破棺而出，到底是詐尸還是另有隱情密末，我是刑警寧澤握爷，帶...
沈念sama閱讀 35,559評論 5贊 345
?日本核電站爆炸內幕
正文年R本政府宣布跛璧，位于F島的核電站，受9級特大地震影響饼拍，放射性物質發(fā)生泄漏赡模。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點故事閱讀 41,167評論 3贊 328
男人毒藥：我在死后第九天來索命
文/蒙蒙一师抄、第九天我趴在偏房一處隱蔽的房頂上張望漓柑。院中可真熱鬧，春花似錦叨吮、人聲如沸辆布。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,779評論 0贊 22
一樁弒父案茶鉴，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽锋玲。三九已至，卻和暖如春涵叮，著一層夾襖步出監(jiān)牢的瞬間惭蹂，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 32,912評論 1贊 269
情欲美人皮
我被黑心中介騙來泰國打工割粮，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留盾碗，地道東北人。一個月前我還...
沈念sama閱讀 47,958評論 2贊 370
代替公主和親
正文我出身青樓舀瓢，卻偏偏與公主長得像廷雅，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子京髓，可洞房花燭夜當晚...
茶點故事閱讀 44,779評論 2贊 354

NVIDIA CUDA Learning Note 1

1）CPU architecture

What is CPU?

What is instruction

Desktop Programs

Moore's Law

Pipelining

Pros

Cons

Bypassing

Stalls

Branch

Branch Prediction

Predication

Increase IPC

Superscalar

Scheduling

Register Renaming

Out-of-Order(OoO) Execution

Modern Desktop/ Mobile In-order CPUs

Modern Desktop/Mobile OoO CPUs

Memory Hierarchy

Caching

Cpu parallel

Vectors Motivation

X86 Vector Motivation

Thread-Level Parallelism

Multicore

Locks, Coherence and Consistency

Power Wall

CPU provides optimization for series program

推薦閱讀更多精彩內容