1)CPU architecture
- Pipelining
- Branch Prediction
- Superscalar
- Out-of-Order Execution
- Memory Hierarchy
- Vector Operation
- Multi-core
What is CPU?
- Execute instruction, process data
- Additional complex function
- Contains many transistor
What is instruction
For example:
arithmetic:add r3,r4 > r4
visit and save:load[r4] > r7
control:jz end
Optimize Objective:
cycles for instruction * seconds/cycle
CPI(clock cycle per instruction) & clock cycle
The two factors are not independent, some time the increase of CPI can cause the decrease of the number of instruction.
Desktop Programs
Lightly threaded
Lots of branches
Lots of memory accesses
Most desktop program deals with data transfer instead of numeric computation.
Moore's Law
The complexity for minimum component costs has increased at a rate of roughly a factor of two per year.
What do we do with our transistor budget?
8 Core processor contains 2,2 billion transistor, most part of the cpu is about I/O and saving instead of computation.
Pipelining
Several steps involved in executing an instruction:
Fetch -> Decode -> Execute -> Memory -> Writeback
This process can be separate to different parts of pipeline
Pros
- Instruction level parallelism (ILP)
- Significantly reduced clock period.
Cons
- Slight latency & area increase (pipeline latches)
- Dependency
- How to manage the branch
- Alieged Pipeline Lengths
Bypassing
If two instructions are dependent, for example, the ADD instruction has to wait for SUB instruction to finish pipeline and return R7, bypassing can pass R7 to latter instruction without waiting.
Stalls
If load is not finished, the pipeline must stop to wait.
Branch
Branch Prediction
Guess what instruction comes next
Based off branch history
Example: two-level predictor with global history
- Maintain history table of all outcomes for M successive
- Compare with past N results (history register)
- Sandy Bridge employs 32-bit history register
Modern predictors > 90%
Pros:
Raise performance and energy efficiency
Cons;
Area increase
Potential fetch stage latency increase
Predication
Replace branches with conditional instructions
Avoids branch predictor
- Avoid area penalty, misprediction penalty
GPU also use prediction
Increase IPC
- Normal IPC is limited by 1 instruction / clock
- Superscalar - increase the width of the pipeline
Superscalar
Peak IPC is N (for N-way superscalar)
Scheduling
xor r1,r2 -> r3
add r3,r4 -> r4
sub r5,r2 ->r3
addi r3,1->r1
xor and add : Read-After-Write,RAW
sub and addi: RAW
xor and sub: WAW
Register Renaming
xor r1,r2 -> r6
add r6,r4 -> r7
sub r5,r2 ->r8
addi r8,1->r9
xor and sub can parallel compute
Out-of-Order(OoO) Execution
Reordering the order
Fetch -> Decode -> Rename -> Dispatch -> Issue ->
Register-Read - > Execute -> Memory -> Writeback ->
Commit
Reorder Buffer
Issue Queue/Scheduler
Pros:
IPC near to the ideal state
Cons:
Area increase
Power cost
Modern Desktop/ Mobile In-order CPUs
- Intel Atom
- ARM Cortex-A8
- Quaicomm Scorpion
Modern Desktop/Mobile OoO CPUs
- Intel Pentium Pro and onwards
- ARM Cortex-A9
- Quaicomm Krait
Memory Hierarchy
Caching
Put the data in a position as close as possible。
- Time proximity
- Spatial proximity
Cpu parallel
- Instruction - level extraction
- Data - Level Parallelism (Vectors)
- Thread- Level Parallelism (TLP)
Vectors Motivation
for(int i = 0;i<N;i++)
A[i] = B[i] + c[i]
Single instruction multiple Data
//in parallel
A[i] = B[i] + c[i]
A[i+!] = B[i+!] + c[i+!]
A[i+2] = B[i+2] + c[i+2]
A[i+3] = B[i+3] + c[i+3]
A[i+4] = B[i+4] + c[i+4]
X86 Vector Motivation
- SSE2
- AVX
Thread-Level Parallelism
Programmers can destroy and create.
Programmers or OS can dispatch.
Multicore
Locks, Coherence and Consistency
- Multi thread access same data
- Coherence: which one is correct
- Consistency: what kind of data is correct
Power Wall
The increase of the main frequency of CPU leads to the increase of power consumption, so that the density can not be increased unrestricted.