NVIDIA CUDA Learning Note 1

1)CPU architecture

  • Pipelining
  • Branch Prediction
  • Superscalar
  • Out-of-Order Execution
  • Memory Hierarchy
  • Vector Operation
  • Multi-core

What is CPU?

  • Execute instruction, process data
  • Additional complex function
  • Contains many transistor

What is instruction

For example:
arithmetic:add r3,r4 > r4
visit and save:load[r4] > r7
control:jz end

Optimize Objective:

cycles for instruction * seconds/cycle

CPI(clock cycle per instruction) & clock cycle
The two factors are not independent, some time the increase of CPI can cause the decrease of the number of instruction.

Desktop Programs

Lightly threaded
Lots of branches
Lots of memory accesses
Most desktop program deals with data transfer instead of numeric computation.

Moore's Law

The complexity for minimum component costs has increased at a rate of roughly a factor of two per year.
What do we do with our transistor budget?


image.png

8 Core processor contains 2,2 billion transistor, most part of the cpu is about I/O and saving instead of computation.

Pipelining

Several steps involved in executing an instruction:
Fetch -> Decode -> Execute -> Memory -> Writeback
This process can be separate to different parts of pipeline


image.png

Pros

  • Instruction level parallelism (ILP)
  • Significantly reduced clock period.

Cons

  • Slight latency & area increase (pipeline latches)
  • Dependency
  • How to manage the branch
  • Alieged Pipeline Lengths

Bypassing

image.png

If two instructions are dependent, for example, the ADD instruction has to wait for SUB instruction to finish pipeline and return R7, bypassing can pass R7 to latter instruction without waiting.

Stalls

image.png

If load is not finished, the pipeline must stop to wait.

Branch

image.png

Branch Prediction

Guess what instruction comes next
Based off branch history
Example: two-level predictor with global history

  • Maintain history table of all outcomes for M successive
  • Compare with past N results (history register)
  • Sandy Bridge employs 32-bit history register

Modern predictors > 90%

Pros:
Raise performance and energy efficiency
Cons;
Area increase
Potential fetch stage latency increase

Predication

Replace branches with conditional instructions
Avoids branch predictor

  • Avoid area penalty, misprediction penalty

GPU also use prediction

Increase IPC

  • Normal IPC is limited by 1 instruction / clock
  • Superscalar - increase the width of the pipeline

Superscalar

Peak IPC is N (for N-way superscalar)


image.png

Scheduling

xor r1,r2 -> r3
add r3,r4 -> r4

sub r5,r2 ->r3
addi r3,1->r1

xor and add : Read-After-Write,RAW
sub and addi: RAW
xor and sub: WAW

Register Renaming

xor r1,r2 -> r6
add r6,r4 -> r7

sub r5,r2 ->r8
addi r8,1->r9
xor and sub can parallel compute

Out-of-Order(OoO) Execution

Reordering the order
Fetch -> Decode -> Rename -> Dispatch -> Issue ->
Register-Read - > Execute -> Memory -> Writeback ->
Commit

Reorder Buffer
Issue Queue/Scheduler

Pros:
IPC near to the ideal state
Cons:
Area increase
Power cost

Modern Desktop/ Mobile In-order CPUs
  • Intel Atom
  • ARM Cortex-A8
  • Quaicomm Scorpion
Modern Desktop/Mobile OoO CPUs
  • Intel Pentium Pro and onwards
  • ARM Cortex-A9
  • Quaicomm Krait

Memory Hierarchy

image.png

Caching

Put the data in a position as close as possible。

  • Time proximity
  • Spatial proximity

Cpu parallel

  • Instruction - level extraction
  • Data - Level Parallelism (Vectors)
  • Thread- Level Parallelism (TLP)

Vectors Motivation

for(int i = 0;i<N;i++)
A[i] = B[i] + c[i]

Single instruction multiple Data
//in parallel
A[i] = B[i] + c[i]
A[i+!] = B[i+!] + c[i+!]
A[i+2] = B[i+2] + c[i+2]
A[i+3] = B[i+3] + c[i+3]
A[i+4] = B[i+4] + c[i+4]

X86 Vector Motivation

  • SSE2
  • AVX

Thread-Level Parallelism

Programmers can destroy and create.
Programmers or OS can dispatch.

Multicore

Locks, Coherence and Consistency

  • Multi thread access same data
  • Coherence: which one is correct
  • Consistency: what kind of data is correct

Power Wall

The increase of the main frequency of CPU leads to the increase of power consumption, so that the density can not be increased unrestricted.

CPU provides optimization for series program

?著作權歸作者所有,轉載或內容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子届宠,更是在濱河造成了極大的恐慌,老刑警劉巖气筋,帶你破解...
    沈念sama閱讀 217,509評論 6 504
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件嘹害,死亡現(xiàn)場離奇詭異豪墅,居然都是意外死亡,警方通過查閱死者的電腦和手機铣口,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,806評論 3 394
  • 文/潘曉璐 我一進店門滤钱,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人脑题,你說我怎么就攤上這事件缸。” “怎么了叔遂?”我有些...
    開封第一講書人閱讀 163,875評論 0 354
  • 文/不壞的土叔 我叫張陵他炊,是天一觀的道長争剿。 經常有香客問我,道長痊末,這世上最難降的妖魔是什么蚕苇? 我笑而不...
    開封第一講書人閱讀 58,441評論 1 293
  • 正文 為了忘掉前任,我火速辦了婚禮凿叠,結果婚禮上涩笤,老公的妹妹穿的比我還像新娘。我一直安慰自己盒件,他們只是感情好蹬碧,可當我...
    茶點故事閱讀 67,488評論 6 392
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著履恩,像睡著了一般锰茉。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上切心,一...
    開封第一講書人閱讀 51,365評論 1 302
  • 那天,我揣著相機與錄音片吊,去河邊找鬼绽昏。 笑死,一個胖子當著我的面吹牛俏脊,可吹牛的內容都是我干的全谤。 我是一名探鬼主播,決...
    沈念sama閱讀 40,190評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼爷贫,長吁一口氣:“原來是場噩夢啊……” “哼认然!你這毒婦竟也來了?” 一聲冷哼從身側響起漫萄,我...
    開封第一講書人閱讀 39,062評論 0 276
  • 序言:老撾萬榮一對情侶失蹤卷员,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后腾务,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體毕骡,經...
    沈念sama閱讀 45,500評論 1 314
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 37,706評論 3 335
  • 正文 我和宋清朗相戀三年岩瘦,在試婚紗的時候發(fā)現(xiàn)自己被綠了未巫。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 39,834評論 1 347
  • 序言:一個原本活蹦亂跳的男人離奇死亡启昧,死狀恐怖叙凡,靈堂內的尸體忽然破棺而出,到底是詐尸還是另有隱情密末,我是刑警寧澤握爷,帶...
    沈念sama閱讀 35,559評論 5 345
  • 正文 年R本政府宣布跛璧,位于F島的核電站,受9級特大地震影響饼拍,放射性物質發(fā)生泄漏赡模。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點故事閱讀 41,167評論 3 328
  • 文/蒙蒙 一师抄、第九天 我趴在偏房一處隱蔽的房頂上張望漓柑。 院中可真熱鬧,春花似錦叨吮、人聲如沸辆布。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,779評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽锋玲。三九已至,卻和暖如春涵叮,著一層夾襖步出監(jiān)牢的瞬間惭蹂,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 32,912評論 1 269
  • 我被黑心中介騙來泰國打工割粮, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留盾碗,地道東北人。 一個月前我還...
    沈念sama閱讀 47,958評論 2 370
  • 正文 我出身青樓舀瓢,卻偏偏與公主長得像廷雅,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子京髓,可洞房花燭夜當晚...
    茶點故事閱讀 44,779評論 2 354

推薦閱讀更多精彩內容