Linux性能分析工具-perf ????

轉(zhuǎn)載自：https://www.cnblogs.com/arnoldlu/p/6241297.html
系統(tǒng)級性能優(yōu)化通常包括兩個階段：性能剖析（performance profiling）和代碼優(yōu)化。

性能剖析的目標是尋找性能瓶頸痢法，查找引發(fā)性能問題的原因及熱點代碼。

代碼優(yōu)化的目標是針對具體性能問題而優(yōu)化代碼或編譯選項，以改善軟件性能。

在性能剖析階段蚊夫，需要借助于現(xiàn)有的profiling工具，如perf等。在代碼優(yōu)化階段往往需要借助開發(fā)者的經(jīng)驗姻政，編寫簡潔高效的代碼，甚至在匯編級別合理使用各種指令岂嗓，合理安排各種指令的執(zhí)行順序汁展。

perf是一款Linux性能分析工具。Linux性能計數(shù)器是一個新的基于內(nèi)核的子系統(tǒng)，它提供一個性能分析框架食绿，比如硬件（CPU侈咕、PMU(Performance Monitoring Unit)）功能和軟件(軟件計數(shù)器、tracepoint)功能器紧。
通過perf耀销，應用程序可以利用PMU、tracepoint和內(nèi)核中的計數(shù)器來進行性能統(tǒng)計品洛。它不但可以分析制定應用程序的性能問題（per thread）树姨，也可以用來分析內(nèi)核的性能問題摩桶，當然也可以同事分析應用程序和內(nèi)核桥状，從而全面理解應用程序中的性能瓶頸。

使用perf蝌衔，可以分析程序運行期間發(fā)生的硬件事件疾掰，比如instructions retired徊都、processor clock cycles等；也可以分析軟件時間士飒，比如page fault和進程切換。

perf是一款綜合性分析工具蔗崎，大到系統(tǒng)全局性性能酵幕，再小到進程線程級別，甚至到函數(shù)及匯編級別缓苛。

perf提供了十八般武器芳撒，可以拿大刀大卸八塊，也可以拿起手術(shù)刀細致分析未桥。

1. 背景知識

1.1 tracepoints

tracepoints是散落在內(nèi)核源碼中的一些hook笔刹，它們可以在特定的代碼被執(zhí)行到時觸發(fā)，這一特定可以被各種trace/debug工具所使用冬耿。

perf將tracepoint產(chǎn)生的時間記錄下來舌菜，生成報告，通過分析這些報告亦镶，條有人緣便可以了解程序運行期間內(nèi)核的各種細節(jié)日月，對性能癥狀做出準確的診斷。

這些tracepint的對應的sysfs節(jié)點在/sys/kernel/debug/tracing/events目錄下缤骨。

1.2 硬件特性之cache

內(nèi)存讀寫是很快的山孔，但是還是無法和處理器指令執(zhí)行速度相比。為了從內(nèi)存中讀取指令和數(shù)據(jù)荷憋，處理器需要等待台颠，用處理器時間來衡量，這種等待非常漫長。cache是一種SRAM串前，讀寫速度非程崩铮快，能和處理器相匹配荡碾。因此將常用的數(shù)據(jù)保存在cache中谨读，處理器便無需等待，從而提高性能坛吁。cache的尺寸一般都很小劳殖，充分利用cache是軟件調(diào)優(yōu)非常重要部分。

2. 主要關(guān)注點

基于性能分析拨脉，可以進行算法優(yōu)化（空間復雜度和時間復雜度權(quán)衡）哆姻、代碼優(yōu)化（提高執(zhí)行速度、減少內(nèi)存占用）玫膀。

評估程序?qū)τ布Y源的使用情況矛缨，例如各級cache的訪問次數(shù)、各級cache的丟失次數(shù)帖旨、流水線停頓周期箕昭、前端總線訪問次數(shù)等。

評估程序?qū)Σ僮飨到y(tǒng)資源的使用情況解阅，系統(tǒng)調(diào)用次數(shù)落竹、上下文切換次數(shù)、任務遷移次數(shù)货抄。

事件可以分為三種：

Hardware Event由PMU部件產(chǎn)生述召，在特定的條件下探測性能事件是否發(fā)生以及發(fā)生的次數(shù)。比如cache命中碉熄。
Software Event是內(nèi)核產(chǎn)生的事件桨武，分布在各個功能模塊中，統(tǒng)計和操作系統(tǒng)相關(guān)性能事件锈津。比如進程切換呀酸，tick數(shù)等。
Tracepoint Event是內(nèi)核中靜態(tài)tracepoint所觸發(fā)的事件琼梆，這些tracepoint用來判斷程序運行期間內(nèi)核的行為細節(jié)性誉，比如slab分配器的分配次數(shù)等。

3. perf的使用

perf --help之后可以看到perf的二級命令茎杂。

| 序號 | 命令 | 作用 |
| 1 | annotate | 解析perf record生成的perf.data文件错览，顯示被注釋的代碼。 |
| 2 | archive | 根據(jù)數(shù)據(jù)文件記錄的build-id煌往，將所有被采樣到的elf文件打包倾哺。利用此壓縮包轧邪，可以再任何機器上分析數(shù)據(jù)文件中記錄的采樣數(shù)據(jù)。 |
| 3 | bench | perf中內(nèi)置的benchmark羞海，目前包括兩套針對調(diào)度器和內(nèi)存管理子系統(tǒng)的benchmark忌愚。 |
| 4 | buildid-cache | 管理perf的buildid緩存，每個elf文件都有一個獨一無二的buildid却邓。buildid被perf用來關(guān)聯(lián)性能數(shù)據(jù)與elf文件硕糊。 |
| 5 | buildid-list | 列出數(shù)據(jù)文件中記錄的所有buildid。 |
| 6 | diff | 對比兩個數(shù)據(jù)文件的差異腊徙。能夠給出每個符號（函數(shù)）在熱點分析上的具體差異简十。 |
| 7 | evlist | 列出數(shù)據(jù)文件perf.data中所有性能事件。 |
| 8 | inject | 該工具讀取perf record工具記錄的事件流撬腾，并將其定向到標準輸出螟蝙。在被分析代碼中的任何一點，都可以向事件流中注入其它事件时鸵。 |
| 9 | kmem | 針對內(nèi)核內(nèi)存（slab）子系統(tǒng)進行追蹤測量的工具 |
| 10 | kvm | 用來追蹤測試運行在KVM虛擬機上的Guest OS胶逢。 |
| 11 | list | 列出當前系統(tǒng)支持的所有性能事件厅瞎。包括硬件性能事件饰潜、軟件性能事件以及檢查點。 |
| 12 | lock | 分析內(nèi)核中的鎖信息和簸，包括鎖的爭用情況彭雾，等待延遲等。 |
| 13 | mem | 內(nèi)存存取情況 |
| 14 | record | 收集采樣信息锁保，并將其記錄在數(shù)據(jù)文件中薯酝。隨后可通過其它工具對數(shù)據(jù)文件進行分析。 |
| 15 | report | 讀取perf record創(chuàng)建的數(shù)據(jù)文件爽柒，并給出熱點分析結(jié)果吴菠。 |
| 16 | sched | 針對調(diào)度器子系統(tǒng)的分析工具。 |
| 17 | script | 執(zhí)行perl或python寫的功能擴展腳本浩村、生成腳本框架做葵、讀取數(shù)據(jù)文件中的數(shù)據(jù)信息等。 |
| 18 | stat | 執(zhí)行某個命令心墅，收集特定進程的性能概況酿矢，包括CPI、Cache丟失率等怎燥。 |
| 19 | test | perf對當前軟硬件平臺進行健全性測試瘫筐，可用此工具測試當前的軟硬件平臺是否能支持perf的所有功能。 |
| 20 | timechart | 針對測試期間系統(tǒng)行為進行可視化的工具 |
| 21 | top | 類似于linux的top命令铐姚，對系統(tǒng)性能進行實時分析策肝。 |
| 22 | trace | 關(guān)于syscall的工具。 |
| 23 | probe | 用于定義動態(tài)檢查點。 |

全局性概況：

perf list查看當前系統(tǒng)支持的性能事件之众；

perf bench對系統(tǒng)性能進行摸底篇梭；

perf test對系統(tǒng)進行健全性測試；

perf stat對全局性能進行統(tǒng)計酝枢；

全局細節(jié)：

perf top可以實時查看當前系統(tǒng)進程函數(shù)占用率情況恬偷；

perf probe可以自定義動態(tài)事件；

特定功能分析：

perf kmem針對slab子系統(tǒng)性能分析帘睦；

perf kvm針對kvm虛擬化分析袍患；

perf lock分析鎖性能；

perf mem分析內(nèi)存slab性能竣付；

perf sched分析內(nèi)核調(diào)度器性能诡延；

perf trace記錄系統(tǒng)調(diào)用軌跡；

最常用功能perf record古胆，可以系統(tǒng)全局肆良，也可以具體到某個進程，更甚具體到某一進程某一事件逸绎；可宏觀惹恃，也可以很微觀。

pref record記錄信息到perf.data棺牧；

perf report生成報告巫糙；

perf diff對兩個記錄進行diff；

perf evlist列出記錄的性能事件颊乘；

perf annotate顯示perf.data函數(shù)代碼参淹；

perf archive將相關(guān)符號打包，方便在其它機器進行分析乏悄；

perf script將perf.data輸出可讀性文本浙值；

可視化工具perf timechart

perf timechart record記錄事件；

perf timechart生成output.svg文檔檩小；

3.0 perf引入的overhead

perf測試不可避免的會引入額外負荷开呐，有三種形式：

counting：內(nèi)核提供計數(shù)總結(jié)，多是Hardware Event识啦、Software Events负蚊、PMU計數(shù)等。相關(guān)命令perf stat颓哮。

sampling：perf將事件數(shù)據(jù)緩存到一塊buffer中家妆，然后異步寫入到perf.data文件中。使用perf report等工具進行離線分析冕茅。

bpf：Kernel 4.4+新增功能伤极，可以提供更多有效filter和輸出總結(jié)蛹找。

counting引入的額外負荷最小哨坪；sampling在某些情況下會引入非常大的負荷庸疾；bpf可以有效縮減負荷。

針對sampling当编，可以通過掛在建立在RAM上的文件系統(tǒng)來有效降低讀寫I/O引入的負荷届慈。

mkdir /tmpfs

mount -t tmpfs tmpfs /tmpfs

3.1 perf list

perf list不能完全顯示所有支持的事件類型，需要sudo perf list忿偷。

同時還可以顯示特定模塊支持的perf事件：hw/cache/pmu都是硬件相關(guān)的金顿；tracepoint基于內(nèi)核的ftrace；sw實際上是內(nèi)核計數(shù)器鲤桥。

hw/hardware顯示支持的硬件事件相關(guān)揍拆，如：

al@al-System-Product-Name:~/perf$ sudo perf list hardware

List of pre-defined events (to be used in -e):

branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
cache-misses [Hardware event]
cache-references [Hardware event]
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
stalled-cycles-backend OR idle-cycles-backend [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]

sw/software顯示支持的軟件事件列表：

al@al-System-Product-Name:~/perf$ sudo perf list sw

List of pre-defined events (to be used in -e):

alignment-faults [Software event]
bpf-output [Software event]
context-switches OR cs [Software event]
cpu-clock [Software event]
cpu-migrations OR migrations [Software event]
dummy [Software event]
emulation-faults [Software event]
major-faults [Software event]
minor-faults [Software event]
page-faults OR faults [Software event]
task-clock [Software event]

cache/hwcache顯示硬件cache相關(guān)事件列表：

al@al-System-Product-Name:~/perf$ sudo perf list cache

List of pre-defined events (to be used in -e):

L1-dcache-load-misses [Hardware cache event]
L1-dcache-loads [Hardware cache event]
L1-dcache-prefetch-misses [Hardware cache event]
L1-dcache-prefetches [Hardware cache event]
L1-icache-load-misses [Hardware cache event]
L1-icache-loads [Hardware cache event]
L1-icache-prefetches [Hardware cache event]
LLC-load-misses [Hardware cache event]
LLC-loads [Hardware cache event]
LLC-stores [Hardware cache event]
branch-load-misses [Hardware cache event]
branch-loads [Hardware cache event]
dTLB-load-misses [Hardware cache event]
dTLB-loads [Hardware cache event]
iTLB-load-misses [Hardware cache event]
iTLB-loads [Hardware cache event]
node-load-misses [Hardware cache event]
node-loads [Hardware cache event]

pmu顯示支持的PMU事件列表：

al@al-System-Product-Name:~/perf$ sudo perf list pmu

List of pre-defined events (to be used in -e):

branch-instructions OR cpu/branch-instructions/ [Kernel PMU event]
branch-misses OR cpu/branch-misses/ [Kernel PMU event]
cache-misses OR cpu/cache-misses/ [Kernel PMU event]
cache-references OR cpu/cache-references/ [Kernel PMU event]
cpu-cycles OR cpu/cpu-cycles/ [Kernel PMU event]
instructions OR cpu/instructions/ [Kernel PMU event]
msr/aperf/ [Kernel PMU event]
msr/mperf/ [Kernel PMU event]
msr/tsc/ [Kernel PMU event]
stalled-cycles-backend OR cpu/stalled-cycles-backend/ [Kernel PMU event]
stalled-cycles-frontend OR cpu/stalled-cycles-frontend/ [Kernel PMU event]

tracepoint顯示支持的所有tracepoint列表，這個列表就比較龐大：

al@al-System-Product-Name:~/perf$ sudo perf list tracepoint

List of pre-defined events (to be used in -e):

alarmtimer:alarmtimer_cancel [Tracepoint event]
alarmtimer:alarmtimer_fired [Tracepoint event]
alarmtimer:alarmtimer_start [Tracepoint event]
alarmtimer:alarmtimer_suspend [Tracepoint event]
block:block_bio_backmerge [Tracepoint event]
block:block_bio_bounce [Tracepoint event]
block:block_bio_complete [Tracepoint event]
block:block_bio_frontmerge [Tracepoint event]
block:block_bio_queue [Tracepoint event]

…

3.2 perf top

默認情況下perf top是無法顯示信息的茶凳，需要sudo perf top或者echo -1 > /proc/sys/kernel/perf_event_paranoid(在Ubuntu16.04嫂拴，還需要echo 0 > /proc/sys/kernel/kptr_restrict)。

即可以正常顯示perf top如下：

第一列：符號引發(fā)的性能事件的比例贮喧，指占用的cpu周期比例筒狠。

第二列：符號所在的DSO(Dynamic Shared Object)，可以是應用程序塞淹、內(nèi)核窟蓝、動態(tài)鏈接庫罪裹、模塊饱普。

第三列：DSO的類型。[.]表示此符號屬于用戶態(tài)的ELF文件状共，包括可執(zhí)行文件與動態(tài)鏈接庫套耕；[k]表述此符號屬于內(nèi)核或模塊。

第四列：符號名峡继。有些符號不能解析為函數(shù)名冯袍，只能用地址表示。

image

關(guān)于perf top界面常用命令如下：

h：顯示幫助碾牌，即可顯示詳細的幫助信息康愤。

UP/DOWN/PGUP/PGDN/SPACE：上下和翻頁。

a：annotate current symbol舶吗，注解當前符號征冷。能夠給出匯編語言的注解，給出各條指令的采樣率誓琼。

d：過濾掉所有不屬于此DSO的符號检激。非常方便查看同一類別的符號肴捉。

P：將當前信息保存到perf.hist.N中。

perf top常用選項有：

-e <event>：指明要分析的性能事件叔收。

-p <pid>：Profile events on existing Process ID (comma sperated list). 僅分析目標進程及其創(chuàng)建的線程齿穗。

-k <path>：Path to vmlinux. Required for annotation functionality. 帶符號表的內(nèi)核映像所在的路徑。

-K：不顯示屬于內(nèi)核或模塊的符號饺律。

-U：不顯示屬于用戶態(tài)程序的符號窃页。

-d <n>：界面的刷新周期，默認為2s复濒，因為perf top默認每2s從mmap的內(nèi)存區(qū)域讀取一次性能數(shù)據(jù)腮出。

-g：得到函數(shù)的調(diào)用關(guān)系圖。

perf top --call-graph [fractal]芝薇，路徑概率為相對值胚嘲，加起來為100%，調(diào)用順序為從下往上洛二。

perf top --call-graph graph馋劈，路徑概率為絕對值，加起來為該函數(shù)的熱度晾嘶。

image

3.3 perf stat

perf stat用于運行指令妓雾，并分析其統(tǒng)計結(jié)果。雖然perf top也可以指定pid垒迂，但是必須先啟動應用才能查看信息械姻。

perf stat能完整統(tǒng)計應用整個生命周期的信息。

命令格式為：

perf stat [-e <EVENT> | --event=EVENT] [-a] <command>
perf stat [-e <EVENT> | --event=EVENT] [-a] — <command> [<options>]

下面簡單看一下perf stat 的輸出：

al@al-System-Product-Name:~/perf$ sudo perf stat
^C
Performance counter stats for 'system wide':

  40904.820871      cpu-clock (msec)          #    5.000 CPUs utilized          
         18,132      context-switches          #    0.443 K/sec                  
          1,053      cpu-migrations            #    0.026 K/sec                  
          2,420      page-faults               #    0.059 K/sec                  
  3,958,376,712      cycles                    #    0.097 GHz                      (49.99%)
    574,598,403      stalled-cycles-frontend   #   14.52% frontend cycles idle     (49.98%)
  9,392,982,910      stalled-cycles-backend    #  237.29% backend cycles idle      (50.00%)
  1,653,185,883      instructions              #    0.42  insn per cycle         
                                               #    5.68  stalled cycles per insn  (50.01%)
    237,061,366      branches                  #    5.795 M/sec                    (50.02%)
     18,333,168      branch-misses             #    7.73% of all branches          (50.00%)

   8.181521203 seconds time elapsed

輸出解釋如下：

cpu-clock：任務真正占用的處理器時間机断，單位為ms楷拳。CPUs utilized = task-clock / time elapsed，CPU的占用率吏奸。

context-switches：程序在運行過程中上下文的切換次數(shù)欢揖。

CPU-migrations：程序在運行過程中發(fā)生的處理器遷移次數(shù)。Linux為了維持多個處理器的負載均衡奋蔚，在特定條件下會將某個任務從一個CPU遷移到另一個CPU她混。

CPU遷移和上下文切換：發(fā)生上下文切換不一定會發(fā)生CPU遷移，而發(fā)生CPU遷移時肯定會發(fā)生上下文切換泊碑。發(fā)生上下文切換有可能只是把上下文從當前CPU中換出坤按，下一次調(diào)度器還是將進程安排在這個CPU上執(zhí)行。

page-faults：缺頁異常的次數(shù)馒过。當應用程序請求的頁面尚未建立臭脓、請求的頁面不在內(nèi)存中，或者請求的頁面雖然在內(nèi)存中沉桌，但物理地址和虛擬地址的映射關(guān)系尚未建立時谢鹊，都會觸發(fā)一次缺頁異常算吩。另外TLB不命中，頁面訪問權(quán)限不匹配等情況也會觸發(fā)缺頁異常佃扼。

cycles：消耗的處理器周期數(shù)偎巢。如果把被ls使用的cpu cycles看成是一個處理器的，那么它的主頻為2.486GHz兼耀⊙怪纾可以用cycles / task-clock算出。

stalled-cycles-frontend：指令讀取或解碼的質(zhì)量步驟瘤运，未能按理想狀態(tài)發(fā)揮并行左右窍霞，發(fā)生停滯的時鐘周期。

stalled-cycles-backend：指令執(zhí)行步驟拯坟，發(fā)生停滯的時鐘周期但金。

instructions：執(zhí)行了多少條指令。IPC為平均每個cpu cycle執(zhí)行了多少條指令郁季。

branches：遇到的分支指令數(shù)冷溃。branch-misses是預測錯誤的分支指令數(shù)。

其他常用參數(shù)

-a, --all-cpus 顯示所有CPU上的統(tǒng)計信息
-C, --cpu <cpu> 顯示指定CPU的統(tǒng)計信息
-c, --scale scale/normalize counters
-D, --delay <n> ms to wait before starting measurement after program start
-d, --detailed detailed run - start a lot of events
-e, --event <event> event selector. use 'perf list' to list available events
-G, --cgroup <name> monitor event in cgroup name only
-g, --group put the counters into a counter group
-I, --interval-print <n>
print counts at regular interval in ms (>= 10)
-i, --no-inherit child tasks do not inherit counters
-n, --null null run - dont start any counters
-o, --output <file> 輸出統(tǒng)計信息到文件
-p, --pid <pid> stat events on existing process id
-r, --repeat <n> repeat command and print average + stddev (max: 100, forever: 0)
-S, --sync call sync() before starting a run
-t, --tid <tid> stat events on existing thread id
...

示例

前面統(tǒng)計程序的示例梦裂，下面看一下統(tǒng)計CPU信息的示例：

執(zhí)行sudo perf stat -C 0似枕，統(tǒng)計CPU 0的信息。想要停止后年柠，按下Ctrl+C終止凿歼。可以看到統(tǒng)計項一樣冗恨，只是統(tǒng)計對象變了答憔。

al@al-System-Product-Name:~/perf$ sudo perf stat -C 0
^C
Performance counter stats for 'CPU(s) 0':

   2517.107315      cpu-clock (msec)          #    1.000 CPUs utilized          
          2,941      context-switches          #    0.001 M/sec                  
            109      cpu-migrations            #    0.043 K/sec                  
             38      page-faults               #    0.015 K/sec                  
    644,094,340      cycles                    #    0.256 GHz                      (49.94%)
     70,425,076      stalled-cycles-frontend   #   10.93% frontend cycles idle     (49.94%)
    965,270,543      stalled-cycles-backend    #  149.86% backend cycles idle      (49.94%)
    623,284,864      instructions              #    0.97  insn per cycle         
                                               #    1.55  stalled cycles per insn  (50.06%)
     65,658,190      branches                  #   26.085 M/sec                    (50.06%)
      3,276,104      branch-misses             #    4.99% of all branches          (50.06%)

   2.516996126 seconds time elapsed

如果需要統(tǒng)計更多的項，需要使用-e派近，如：

| perf stat -e task-clock,context-switches,cpu-migrations,page-faults,cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,branches,branch-misses,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses,dTLB-loads,dTLB-load-misses ls |

結(jié)果如下攀唯，關(guān)注的特殊項也納入統(tǒng)計。

al@al-System-Product-Name:~/perf$ sudo perf stat -e task-clock,context-switches,cpu-migrations,page-faults,cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,branches,branch-misses,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses,dTLB-loads,dTLB-load-misses ls

Performance counter stats for 'ls':

      2.319422      task-clock (msec)         #    0.719 CPUs utilized          
              0      context-switches          #    0.000 K/sec                  
              0      cpu-migrations            #    0.000 K/sec                  
             89      page-faults               #    0.038 M/sec                  
      2,142,386      cycles                    #    0.924 GHz                    
        659,800      stalled-cycles-frontend   #   30.80% frontend cycles idle   
        725,343      stalled-cycles-backend    #   33.86% backend cycles idle    
      1,344,518      instructions              #    0.63  insn per cycle         
                                               #    0.54  stalled cycles per insn
  <not counted>      branches                                                    
  <not counted>      branch-misses                                               
  <not counted>      L1-dcache-loads                                             
  <not counted>      L1-dcache-load-misses                                       
  <not counted>      LLC-loads                                                   
  <not counted>      LLC-load-misses                                             
  <not counted>      dTLB-loads                                                  
  <not counted>      dTLB-load-misses                                           

   0.003227507 seconds time elapsed

3.4 perf bench

perf bench作為benchmark工具的通用框架渴丸，包含sched/mem/numa/futex等子系統(tǒng)，all可以指定所有另凌。

perf bench可用于評估系統(tǒng)sched/mem等特定性能谱轨。

perf bench sched：調(diào)度器和IPC機制。包含messaging和pipe兩個功能吠谢。

perf bench mem：內(nèi)存存取性能土童。包含memcpy和memset兩個功能。

perf bench numa：NUMA架構(gòu)的調(diào)度和內(nèi)存處理性能工坊。包含mem功能献汗。

perf bench futex：futex壓力測試敢订。包含hash/wake/wake-parallel/requeue/lock-pi功能。

perf bench all：所有bench測試的集合

3.4.1 perf bench sched all

測試messaging和pipi兩部分性能罢吃。

3.4.1.1 sched messaging評估進程調(diào)度和核間通信

sched message 是從經(jīng)典的測試程序 hackbench 移植而來楚午，用來衡量調(diào)度器的性能，overhead 以及可擴展性尿招。

該 benchmark 啟動 N 個 reader/sender 進程或線程對矾柜，通過 IPC(socket 或者 pipe) 進行并發(fā)的讀寫。一般人們將 N 不斷加大來衡量調(diào)度器的可擴展性就谜。

sched message 的用法及用途和 hackbench 一樣怪蔑，可以通過修改參數(shù)進行不同目的測試：

-g, --group <n> Specify number of groups

-l, --nr_loops <n> Specify the number of loops to run (default: 100)

-p, --pipe Use pipe() instead of socketpair()

-t, --thread Be multi thread instead of multi process

測試結(jié)果：

al@al-System-Product-Name:~/perf$ perf bench sched all

Running sched/messaging benchmark...

20 sender and receiver processes per group

10 groups == 400 processes run
 Total time: 0.173 [sec]
Running sched/pipe benchmark...

Executed 1000000 pipe operations between two processes
 Total time: 12.233 [sec]

  12.233170 usecs/op
       81744 ops/sec

使用pipe()和socketpair()對測試影響：

perf bench sched messaging

Running 'sched/messaging' benchmark:

20 sender and receiver processes per group

10 groups == 400 processes run
 Total time: 0.176 [sec]
perf bench sched messaging -p

Running 'sched/messaging' benchmark:

20 sender and receiver processes per group

10 groups == 400 processes run
 Total time: 0.093 [sec]

可見socketpair()性能要明顯低于pipe()。

3.4.1.2 sched pipe評估pipe性能

sched pipe 從 Ingo Molnar 的 pipe-test-1m.c 移植而來丧荐。當初 Ingo 的原始程序是為了測試不同的調(diào)度器的性能和公平性的缆瓣。

其工作原理很簡單，兩個進程互相通過 pipe 拼命地發(fā) 1000000 個整數(shù)虹统，進程 A 發(fā)給 B捆愁，同時 B 發(fā)給 A。因為 A 和 B 互相依賴窟却，因此假如調(diào)度器不公平昼丑，對 A 比 B 好，那么 A 和 B 整體所需要的時間就會更長夸赫。

al@al-System-Product-Name:~/perf$ perf bench sched pipe

Running 'sched/pipe' benchmark:

Executed 1000000 pipe operations between two processes
 Total time: 12.240 [sec]

  12.240411 usecs/op
       81696 ops/sec

3.4.2 perf bench mem all

該測試衡量不同版本的memcpy/memset/ 函數(shù)處理一個 1M 數(shù)據(jù)的所花費的時間菩帝，轉(zhuǎn)換成吞吐率棕叫。

al@al-System-Product-Name:~/perf$ perf bench mem all

Running mem/memcpy benchmark...

function 'default' (Default memcpy() provided by glibc)

Copying 1MB bytes ...
   1.236155 GB/sec.
..

3.4.3 perf bench futex

Futex是一種用戶態(tài)和內(nèi)核態(tài)混合機制川队，所以需要兩個部分合作完成，linux上提供了sys_futex系統(tǒng)調(diào)用单料，對進程競爭情況下的同步處理提供支持切平。

所有的futex同步操作都應該從用戶空間開始握础，首先創(chuàng)建一個futex同步變量，也就是位于共享內(nèi)存的一個整型計數(shù)器悴品。

當進程嘗試持有鎖或者要進入互斥區(qū)的時候禀综，對futex執(zhí)行"down"操作，即原子性的給futex同步變量減1苔严。如果同步變量變?yōu)?定枷，則沒有競爭發(fā)生，進程照常執(zhí)行届氢。

如果同步變量是個負數(shù)欠窒，則意味著有競爭發(fā)生，需要調(diào)用futex系統(tǒng)調(diào)用的futex_wait操作休眠當前進程退子。

當進程釋放鎖或者要離開互斥區(qū)的時候岖妄，對futex進行"up"操作型将，即原子性的給futex同步變量加1。如果同步變量由0變成1荐虐，則沒有競爭發(fā)生七兜，進程照常執(zhí)行。

如果加之前同步變量是負數(shù)缚俏，則意味著有競爭發(fā)生惊搏，需要調(diào)用futex系統(tǒng)調(diào)用的futex_wake操作喚醒一個或者多個等待進程。

al@al-System-Product-Name:~/perf$ perf bench futex all

Running futex/hash benchmark...

Run summary [PID 3806]: 5 threads, each operating on 1024 [private] futexes for 10 secs.

[thread 0] futexes: 0x4003d20 ... 0x4004d1c [ 4635648 ops/sec ]
[thread 1] futexes: 0x4004d30 ... 0x4005d2c [ 4611072 ops/sec ]
[thread 2] futexes: 0x4005e70 ... 0x4006e6c [ 4254515 ops/sec ]
[thread 3] futexes: 0x4006fb0 ... 0x4007fac [ 4559360 ops/sec ]
[thread 4] futexes: 0x40080f0 ... 0x40090ec [ 4636262 ops/sec ]

Averaged 4539371 operations/sec (+- 1.60%), total secs = 10

Running futex/wake benchmark...

Run summary [PID 3806]: blocking on 5 threads (at [private] futex 0x96b52c), waking up 1 at a time.

[Run 1]: Wokeup 5 of 5 threads in 0.0270 ms
[Run 2]: Wokeup 5 of 5 threads in 0.0370 ms

...

3.4 perf record

運行一個命令忧换，并將其數(shù)據(jù)保存到perf.data中恬惯。隨后，可以使用perf report進行分析亚茬。

perf record和perf report可以更精確的分析一個應用酪耳，perf record可以精確到函數(shù)級別。并且在函數(shù)里面混合顯示匯編語言和代碼刹缝。

創(chuàng)建一個fork.c文件用于測試：

[

復制代碼

](javascript:void(0); "復制代碼")

<pre style="margin: 0px 0px 0px 22px; white-space: pre-wrap; overflow-wrap: break-word; font-size: 12px !important; font-family: "Courier New" !important;">#include <stdio.h>

void test_little(void)
{ int i,j; for(i = 0; i < 30000000; i++)
j=i;
} void test_mdedium(void)
{ int i,j; for(i = 0; i < 60000000; i++)
j=i;
} void test_high(void)
{ int i,j; for(i = 0; i < 90000000; i++)
j=i;
} void test_hi(void)
{ int i,j; for(i = 0; i < 120000000; i++)
j=i;
} int main(void)
{ int i, pid, result; for(i = 0; i<2; i++) {
result = fork(); if(result>0)
printf("i=%d parent parent=%d current=%d child=%d\n", i, getppid(), getpid(), result); else printf("i=%d child parent=%d current=%d\n", i, getppid(), getpid()); if(i==0)
{
test_little();
sleep(1);
} else {
test_mdedium();
sleep(1);
}
}

pid = wait(NULL);
test_high();
printf("pid=%d wait=%d\n", getpid(), pid);
sleep(1);
pid = wait(NULL);
test_hi();
printf("pid=%d wait=%d\n", getpid(), pid); return 0;
}</pre>

[

復制代碼

](javascript:void(0); "復制代碼")

編譯fork.c文件gcc fork.c -o fork-g -O0碗暗，同時可以使用此方法分析是否選擇編譯優(yōu)化產(chǎn)生的結(jié)果。-g是只能callgraph功能梢夯，-O0是關(guān)閉優(yōu)化言疗。

常用選項

-e record指定PMU事件
--filter event事件過濾器
-a 錄取所有CPU的事件
-p 錄取指定pid進程的事件
-o 指定錄取保存數(shù)據(jù)的文件名
-g 使能函數(shù)調(diào)用圖功能
-C 錄取指定CPU的事件

sudo perf record -a -g ./fork：會在當前目錄生成perf.data文件。

sudo perf report --call-graph none結(jié)果如下,后面結(jié)合perf timechart分析.

image

上圖看上去比較亂颂砸，如果想只看fork產(chǎn)生的信息：

sudo perf report --call-graph none -c fork

可以看出只顯示了fork程序的相關(guān)符號及其占用率噪奄。

image

3.5 perf report

解析perf record產(chǎn)生的數(shù)據(jù)，并給出分析結(jié)果人乓。

常用參數(shù)：

-i 導入的數(shù)據(jù)文件名稱勤篮，如果沒有則默認為perf.data

-g 生成函數(shù)調(diào)用關(guān)系圖，
核要打開CONFIG_KALLSYMS色罚；用戶空間庫或者執(zhí)行文件需要帶符號信息(not stripped)碰缔，編譯選項需要加上-g。**

--sort 從更高層面顯示分類統(tǒng)計信息戳护，比如： pid, comm, dso, symbol, parent, cpu,socket, srcline, weight, local_weight.

執(zhí)行sudo perf report -i perf.data金抡，可以看出main函數(shù)所占百分比，以及funcA和funcB分別所占百分比姑尺。

在funcB執(zhí)行過程中竟终，還產(chǎn)生了apic timer，占用了一部分cpu資源切蟋。除此之外，占比基本上是1：10榆芦。

image

funcA和funcB的占比柄粹，基本符合預期喘鸟。那么進入longa，分析一下熱點驻右。

在C和匯編混合顯示界面什黑，可以看出for循環(huán)占用了69.92%，j=i賦值占用了30.08%堪夭。

image

根據(jù)之上描述愕把，可以看出top適合監(jiān)控整個系統(tǒng)的性能，stat比較適合單個程序的性能分析森爽，record/report更適合對程序進行更細粒度的分析恨豁。

注意點：

在使用perf report -g的時候，可能會提示Failed to open /lib/libpthread-0.9.33.2.so, continuing without symbols爬迟。

這時候通過file xxx查看橘蜜，如果提示xxxx stripped表示此文件不包含符號信息，需要xxxx not stripped文件付呕。

3.6 perf timechart

perf timechart是將之前的各種統(tǒng)計信息圖形化的一個工具计福。

perf timechart record <option> <command>用于記錄整個系統(tǒng)或者一個應用的事件，還可以加option記錄指定類型的事件徽职。

perf timechart用于將perf.data轉(zhuǎn)換成SVG格式的文件象颖，SVG可以通過Inkscape或者瀏覽器打開。

perf timechart record可以指定特定類型的事件：

-P：記錄power相關(guān)事件

-T：記錄任務相關(guān)事件

-I：記錄io相關(guān)事件

-g：記錄函數(shù)調(diào)用關(guān)系

perf timechart用于將perf timechart record錄取的perf.data轉(zhuǎn)換成output.svg姆钉。

-w調(diào)整輸出的svg文件長度说订，可以查看更多細節(jié)。

-p可以指定只查看某些進程輸出育韩，使用方式：sudo perf timechart -p test1 -p thermald

-o 指定輸出文件名

-i 指定待解析的文件名

-w 輸出SVG文件寬度

-P 只顯示power相關(guān)事件圖標

-T , --tasks-only 顯示task信息克蚂，不顯示處理器信息

-p 顯示指定進程名稱或者PID顯示

--symfs=<directory> 指定系統(tǒng)符號表路徑

-t, --topology 根據(jù)拓撲結(jié)構(gòu)對CPU進行分類

--highlight=<duration_nsecs|task_name> 對運行超過特定時間的task高亮

當線程太多影響svg解析速度的時候，可以通過-p指定特定線程進行分析筋讨。如果需要幾個線程埃叭，每個線程采用-p xxx。

sudo perf timechart record -T ./fork && sudo perf timechart –p fork

結(jié)果如下悉罕，可以看到相關(guān)task的名稱赤屋，開始時間/結(jié)束時間，每個時間點的狀態(tài)（Running/Idle/Deeper Idle/Deepest Idle/Sleeping/Waiting for Cpu /Blocked on IO）壁袄。

3.6.1 結(jié)合perf timechart和perf report分析函數(shù)占比

由perf report可知test_little类早、test_medium、test_high嗜逻、test_hi四個函數(shù)占比分別為3.84%涩僻、12.01%、22.99%、30.43%逆日。

有代碼可知如果以test_little為1單位嵌巷，那么test_medium就為2單位，test_high為3單位室抽，test_hi為4單位搪哪。

四個函數(shù)執(zhí)行次數(shù)分別為2、4坪圾、4晓折、4，所以四個函數(shù)每個單位對應CPU占比為：

test_little - 3.84%/2=1.9%

test_medium - 12.01%/4/2=1.5%

test_high - 22.99/4/3=1.9%

test_hi - 30.43%/4/4=1.9%

基本上符合預期兽泄。

image

記錄IO事件漓概，可以看到按應用分類的，Disk/Network/Sync/Poll/Error信息已日。以及每個應用數(shù)據(jù)吞吐量垛耳。

sudo perf timechart record -I && sudo perf timechart -w 1800。

記錄Power狀態(tài)事件飘千，可以看到不同之處在于Idle之類的狀態(tài)里面還有細分C/C2更詳細的顯示Power狀態(tài)堂鲜。

sudo perf timechart record -I && sudo perf timechart -w 1800

3.7 perf script

用于讀取perf record保存的裸trace數(shù)據(jù)。

使用方法：

   perf script [<options>]
   perf script [<options>] record <script> [<record-options>] <command>
   perf script [<options>] report <script> [script-args]
   perf script [<options>] <script> <required-script-args> [<record-options>] <command>
   perf script [<options>] <top-script> [script-args]

還可以編寫perl或者python腳本進行數(shù)據(jù)分析护奈。

3.8 perf lock

3.8.1 perf lock內(nèi)核配置

要使用此功能缔莲，內(nèi)核需要編譯選項的支持：CONFIG_LOCKDEP、CONFIG_LOCK_STAT霉旗。

CONFIG_LOCKDEP defines acquired and release events.

CONFIG_LOCK_STAT defines contended and acquired lock events.

CONFIG_LOCKDEP=y

CONFIG_LOCK_STAT=y

3.8.2 perf lock使用

分析內(nèi)核鎖統(tǒng)計信息痴奏。

鎖是內(nèi)核用于同步的方法，一旦加了鎖厌秒，其他加鎖的內(nèi)核執(zhí)行路徑就必須等待读拆，降低了并行。同時鸵闪，如果加鎖不正確還會造成死鎖檐晕。

因此對于內(nèi)核鎖進行分析是一項重要的調(diào)優(yōu)工作。

perf lock record：抓取執(zhí)行命令的lock事件信息到perf.data中

perf lock report：產(chǎn)生統(tǒng)計報告

perf lock script：顯示原始lock事件

perf lock info：

-k <value>：sorting key蚌讼，默認為acquired辟灰，還可以按contended、wait_total篡石、wait_max和wait_min來排序芥喇。

Name：內(nèi)核鎖的名字。

aquired：該鎖被直接獲得的次數(shù)凰萨，因為沒有其它內(nèi)核路徑占用該鎖继控，此時不用等待械馆。

contended：該鎖等待后獲得的次數(shù)，此時被其它內(nèi)核路徑占用湿诊，需要等待狱杰。

total wait：為了獲得該鎖瘦材，總共的等待時間厅须。

max wait：為了獲得該鎖，最大的等待時間食棕。

min wait：為了獲得該鎖朗和，最小的等待時間。

3.9 perf kmem

3.9.1 perf kmem介紹

perf kmem用于跟蹤測量內(nèi)核slab分配器事件信息簿晓。
比如內(nèi)存分配/釋放等眶拉。可以用來研究程序在哪里分配了大量內(nèi)存憔儿，或者在什么地方產(chǎn)生碎片之類的和內(nèi)存管理相關(guān)的問題忆植。

perf kmem和perf lock實際上都是perf tracepoint的子類，等同于perf record -e kmem:和perf record -e lock:谒臼。

但是這些工具在內(nèi)部隊員是數(shù)據(jù)進行了慧聰和分析朝刊，因此統(tǒng)計報表更具可讀性。

perf kmem record：抓取命令的內(nèi)核slab分配器事件

perf kmem stat：生成內(nèi)核slab分配器統(tǒng)計信息

選項：

--caller 顯示每個調(diào)用點統(tǒng)計信息

--alloc   顯示每次內(nèi)存分配事件

-s <key[,key2...]>, --sort=<key[,key2...]>

       Sort the output (default: frag,hit,bytes for slab and bytes,hit for page). Available sort keys are ptr, callsite, bytes, hit, pingpong, frag for slab and page, callsite, bytes, hit, order, migtype, gfp for page.
       This option should be preceded by one of the mode selection options - i.e. --slab, --page, --alloc and/or --caller.

-l <num>, 只顯示固定行數(shù)

--raw-ip

Print raw ip instead of symbol

--slab  分析slab分配器事件

--page  分析頁分配事件

--live

       Show live page stat. The perf kmem shows total allocation stat by default, but this option shows live (currently allocated) pages instead. (This option works with --page option only)

3.9.2 perf kmem使用

| sudo perf kmem record ls |

sudo perf kmem stat只顯示概要統(tǒng)計信息：

| SUMMARY (SLAB allocator)

Total bytes requested: 368,589
Total bytes allocated: 369,424
Total bytes wasted on internal fragmentation: 835
Internal fragmentation: 0.226028%
Cross CPU allocations: 0/2,256 |

sudo perf kmem --alloc --caller --slab stat顯示了更加詳細的分類信息：

Callsite | Total_alloc/Per | Total_req/Per | Hit | Ping-pong | Frag

proc_reg_open+32 | 64/64 | 40/40 | 1 | 0 | 37.500%
seq_open+34 | 384/192 | 272/136 | 2 | 0 | 29.167%
apparmor_file_alloc_security+5c | 608/32 | 456/24 | 19 | 1 | 25.000%
ext4_readdir+8bd | 64/64 | 48/48 | 1 | 0 | 25.000%
ext4_htree_store_dirent+3e | 896/68 | 770/59 | 13 | 0 | 14.062%
load_elf_phdrs+64 | 1024/512 | 896/448 | 2 | 0 | 12.500%
load_elf_binary+222 | 32/32 | 28/28 | 1 | 0 | 12.500%
anon_vma_prepare+11b | 1280/80 | 1152/72 | 16 | 0 | 10.000%
inotify_handle_event+75 | 73664/64 | 66758/58 | 1151 | 0 | 9.375%
do_execveat_common.isra.33+e5 | 2048/256 | 1920/240 | 8 | 1 | 6.250%
... | ... | ... | ... | ... | ...

Alloc Ptr | Total_alloc/Per | Total_req/Per | Hit | Ping-pong | Frag

0xffff8800ca4d86c0 | 192/192 | 136/136 | 1 | 0 | 29.167%
0xffff8801ea05aa80 | 192/192 | 136/136 | 1 | 0 | 29.167%
0xffff8801f6ad6540 | 96/96 | 68/68 | 1 | 0 | 29.167%
0xffff8801f6ad6f00 | 96/96 | 68/68 | 1 | 0 | 29.167%
0xffff880214e65e80 | 96/32 | 72/24 | 3 | 0 | 25.000%
0xffff8801f45ddac0 | 64/64 | 48/48 | 1 | 0 | 25.000%
0xffff8800ac4093c0 | 32/32 | 24/24 | 1 | 1 | 25.000%
0xffff8800af5a4260 | 32/32 | 24/24 | 1 | 0 | 25.000%
0xffff880214e651e0 | 32/32 | 24/24 | 1 | 0 | 25.000%
0xffff880214e65220 | 32/32 | 24/24 | 1 | 0 | 25.000%
0xffff880214e654e0 | 32/32 | 24/24 | 1 | 0 | 25.000%

SUMMARY (SLAB allocator)

Total bytes requested: 409,260
Total bytes allocated: 417,008
Total bytes wasted on internal fragmentation: 7,748
Internal fragmentation: 1.857998%
Cross CPU allocations: 0/2,833

該報告有三個部分：根據(jù) Callsite 顯示的部分蜈缤，所謂 Callsite 即內(nèi)核代碼中調(diào)用 kmalloc 和 kfree 的地方拾氓。

比如上圖中的函數(shù)proc_reg_open，Hit 欄為 1底哥，表示該函數(shù)在 record 期間一共調(diào)用了 kmalloc 一次咙鞍。

對于第一行 Total_alloc/Per 顯示為 1024/1024，第一個值 1024 表示函數(shù) proc_reg_open 總共分配的內(nèi)存大小趾徽，Per 表示平均值续滋。

比較有趣的兩個參數(shù)是 Ping-pong 和 Frag。Frag 比較容易理解孵奶，即內(nèi)部碎片疲酌。雖然相對于 Buddy System，Slab 正是要解決內(nèi)部碎片問題拒课，但 slab 依然存在內(nèi)部碎片徐勃，比如一個 cache 的大小為 1024，但需要分配的數(shù)據(jù)結(jié)構(gòu)大小為 1022早像，那么有 2 個字節(jié)成為碎片僻肖。Frag 即碎片的比例。

Ping-pong 是一種現(xiàn)象卢鹦，在多 CPU 系統(tǒng)中臀脏，多個 CPU 共享的內(nèi)存會出現(xiàn)”乒乓現(xiàn)象”劝堪。一個 CPU 分配內(nèi)存，其他 CPU 可能訪問該內(nèi)存對象揉稚，也可能最終由另外一個 CPU 釋放該內(nèi)存對象秒啦。而在多 CPU 系統(tǒng)中，L1 cache 是 per CPU 的搀玖，CPU2 修改了內(nèi)存余境，那么其他的 CPU 的 cache 都必須更新，這對于性能是一個損失灌诅。Perf kmem 在 kfree 事件中判斷 CPU 號芳来，如果和 kmalloc 時的不同，則視為一次 ping-pong猜拾，理想的情況下 ping-pong 越小越好即舌。Ibm developerworks 上有一篇講述 oprofile 的文章，其中關(guān)于 cache 的調(diào)優(yōu)可以作為很好的參考資料挎袜。

Callsite：內(nèi)核代碼中調(diào)用kmalloc和kfree的地方顽聂。

Total_alloc/Per：總共分配的內(nèi)存大小，平均每次分配的內(nèi)存大小盯仪。

Total_req/Per：總共請求的內(nèi)存大小紊搪，平均每次請求的內(nèi)存大小。

Hit：調(diào)用的次數(shù)磨总。

Ping-pong：kmalloc和kfree不被同一個CPU執(zhí)行時的次數(shù)嗦明，這會導致cache效率降低。

Frag：碎片所占的百分比蚪燕，碎片 = 分配的內(nèi)存 - 請求的內(nèi)存娶牌，這部分是浪費的。

有使用--alloc選項馆纳，還會看到Alloc Ptr诗良，即所分配內(nèi)存的地址。

后面則有根據(jù)被調(diào)用地點的顯示方式的部分鲁驶。

最后一個部分是匯總數(shù)據(jù)鉴裹，顯示總的分配的內(nèi)存和碎片情況，Cross CPU allocation 即 ping-pong 的匯總钥弯。

要分析--page事件径荔，需要在record的時候加上--page選項。sudo perf kmem record --page ls脆霎，使用sudo perf kmem stat --page查看結(jié)果：

| 0xee318 [0x8]: failed to process type: 68
error during process events: -22 |

3.10 perf sched

perf sched專門用于跟蹤/測量調(diào)度器总处，包括延時等。

perf sched record <command>：錄制測試過程中的調(diào)度事件

perf sched latency：報告線程調(diào)度延時和其他調(diào)度相關(guān)屬性

perf sched script：查看執(zhí)行過程中詳細的trace信息

perf sched replay：回放perf sched record錄制的執(zhí)行過程

perf sched map:用字符表示打印上下文切換

執(zhí)行sudo perf sched record ls后睛蛛，通過不同方式查看結(jié)果鹦马。

sudo perf sched latency胧谈，可以看到ls進程的Average delay/Maximum delay時間。各個 column 的含義如下： Task: 進程的名字和 pid Runtime: 實際運行時間 Switches: 進程切換的次數(shù) Average delay: 平均的調(diào)度延遲 Maximum delay: 最大延遲

這里最值得人們關(guān)注的是 Maximum delay荸频，一般從這里可以看到對交互性影響最大的特性：調(diào)度延遲菱肖，如果調(diào)度延遲比較大，那么用戶就會感受到視頻或者音頻斷斷續(xù)續(xù)的旭从。

Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at |

TOTAL: | 2.623 ms | 5 |

sudo perf sched script能看到更詳細的sched信息稳强，包括sched_wakeup/sched_switch等等。每一列的含義依次是：進程名/pid/CPU ID/時間戳遇绞。

| perf 7801 [002] 5398.722314: sched:sched_stat_sleep: comm=perf pid=7806 delay=110095391 [ns]
perf 7801 [002] 5398.722316: sched:sched_wakeup: comm=perf pid=7806 prio=120 target_cpu=004
swapper 0 [004] 5398.722328: sched:sched_stat_wait: comm=perf pid=7806 delay=0 [ns]
swapper 0 [004] 5398.722333: sched:sched_switch: prev_comm=swapper/4 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=perf next_pid=7806 next_prio=120
perf 7801 [002] 5398.722363: sched:sched_stat_runtime: comm=perf pid=7801 runtime=1255788 [ns] vruntime=3027478102 [ns]
perf 7801 [002] 5398.722364: sched:sched_switch: prev_comm=perf prev_pid=7801 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120
perf 7806 [004] 5398.722568: sched:sched_wakeup: comm=migration/4 pid=27 prio=0 target_cpu=004
perf 7806 [004] 5398.722571: sched:sched_stat_runtime: comm=perf pid=7806 runtime=254732 [ns] vruntime=1979611107 [ns]
perf 7806 [004] 5398.722575: sched:sched_switch: prev_comm=perf prev_pid=7806 prev_prio=120 prev_state=R+ ==> next_comm=migration/4 next_pid=27 next_prio=0
migration/4 27 [004] 5398.722582: sched:sched_stat_wait: comm=perf pid=7806 delay=13914 [ns]
migration/4 27 [004] 5398.722586: sched:sched_migrate_task: comm=perf pid=7806 prio=120 orig_cpu=4 dest_cpu=2
swapper 0 [002] 5398.722595: sched:sched_stat_wait: comm=perf pid=7806 delay=0 [ns]
swapper 0 [002] 5398.722596: sched:sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=perf next_pid=7806 next_prio=120
migration/4 27 [004] 5398.722611: sched:sched_switch: prev_comm=migration/4 prev_pid=27 prev_prio=0 prev_state=S ==> next_comm=swapper/4 next_pid=0 next_prio=120
ls 7806 [002] 5398.723421: sched:sched_stat_sleep: comm=kworker/u12:3 pid=7064 delay=1226675 [ns]
ls 7806 [002] 5398.723423: sched:sched_wakeup: comm=kworker/u12:3 pid=7064 prio=120 target_cpu=003
swapper 0 [003] 5398.723432: sched:sched_stat_wait: comm=kworker/u12:3 pid=7064 delay=0 [ns]
swapper 0 [003] 5398.723434: sched:sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/u12:3 next_pid=7064 next_prio=120
kworker/u12:3 7064 [003] 5398.723441: sched:sched_stat_sleep: comm=/usr/bin/termin pid=2511 delay=80833386 [ns]
kworker/u12:3 7064 [003] 5398.723447: sched:sched_wakeup: comm=/usr/bin/termin pid=2511 prio=120 target_cpu=004
kworker/u12:3 7064 [003] 5398.723449: sched:sched_stat_runtime: comm=kworker/u12:3 pid=7064 runtime=29315 [ns] vruntime=846439549943 [ns]
kworker/u12:3 7064 [003] 5398.723451: sched:sched_switch: prev_comm=kworker/u12:3 prev_pid=7064 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
swapper 0 [004] 5398.723462: sched:sched_stat_wait: comm=/usr/bin/termin pid=2511 delay=0 [ns]
swapper 0 [004] 5398.723466: sched:sched_switch: prev_comm=swapper/4 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=/usr/bin/termin next_pid=2511 next_prio=120
ls 7806 [002] 5398.723503: sched:sched_migrate_task: comm=perf pid=7801 prio=120 orig_cpu=2 dest_cpu=3
ls 7806 [002] 5398.723505: sched:sched_stat_sleep: comm=perf pid=7801 delay=1142537 [ns]
ls 7806 [002] 5398.723506: sched:sched_wakeup: comm=perf pid=7801 prio=120 target_cpu=003
ls 7806 [002] 5398.723508: sched:sched_stat_runtime: comm=ls pid=7806 runtime=920005 [ns] vruntime=3028398107 [ns]
swapper 0 [003] 5398.723508: sched:sched_stat_wait: comm=perf pid=7801 delay=0 [ns]
swapper 0 [003] 5398.723508: sched:sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=perf next_pid=7801 next_prio=120
ls 7806 [002] 5398.723510: sched:sched_switch: prev_comm=ls prev_pid=7806 prev_prio=120 prev_state=x ==> next_comm=swapper/2 next_pid=0 next_prio=120
/usr/bin/termin 2511 [004] 5398.723605: sched:sched_stat_runtime: comm=/usr/bin/termin pid=2511 runtime=162720 [ns] vruntime=191386139371 [ns]
/usr/bin/termin 2511 [004] 5398.723611: sched:sched_switch: prev_comm=/usr/bin/termin prev_pid=2511 prev_prio=120 prev_state=S ==> next_comm=swapper/4 next_pid=0 next_prio=120 |

sudo perf sched map的好處在于提供了一個的總的視圖键袱，將成百上千的調(diào)度事件進行總結(jié)，顯示了系統(tǒng)任務在 CPU 之間的分布摹闽，假如有不好的調(diào)度遷移，比如一個任務沒有被及時遷移到 idle 的 CPU 卻被遷移到其他忙碌的 CPU褐健，類似這種調(diào)度器的問題可以從 map 的報告中一眼看出來付鹿。

星號表示調(diào)度事件發(fā)生所在的 CPU。

點號表示該 CPU 正在 IDLE蚜迅。

| *A0 5398.722333 secs A0 => perf:7806
*. A0 5398.722365 secs . => swapper:0
. *B0 5398.722575 secs B0 => migration/4:27
*A0 B0 5398.722597 secs
A0 *. 5398.722611 secs
A0 *C0 . 5398.723434 secs C0 => kworker/u12:3:7064
A0 *. . 5398.723451 secs
A0 . *D0 5398.723467 secs D0 => /usr/bin/termin:2511
A0 *E0 D0 5398.723509 secs E0 => perf:7801
*. E0 D0 5398.723510 secs
. E0 *. 5398.723612 secs |

perf sched replay 這個工具更是專門為調(diào)度器開發(fā)人員所設計舵匾，它試圖重放 perf.data 文件中所記錄的調(diào)度場景。很多情況下谁不，一般用戶假如發(fā)現(xiàn)調(diào)度器的奇怪行為坐梯，他們也無法準確說明發(fā)生該情形的場景，或者一些測試場景不容易再次重現(xiàn)刹帕，或者僅僅是出于“偷懶”的目的吵血，使用 perf replay，perf 將模擬 perf.data 中的場景偷溺，無需開發(fā)人員花費很多的時間去重現(xiàn)過去蹋辅，這尤其利于調(diào)試過程，因為需要一而再挫掏，再而三地重復新的修改是否能改善原始的調(diào)度場景所發(fā)現(xiàn)的問題侦另。

| run measurement overhead: 166 nsecs
sleep measurement overhead: 52177 nsecs
the run test took 999975 nsecs
the sleep test took 1064623 nsecs
nr_run_events: 11
nr_sleep_events: 581
nr_wakeup_events: 5
task 0 ( swapper: 0), nr_events: 11
task 1 ( swapper: 1), nr_events: 1
task 2 ( swapper: 2), nr_events: 1
task 3 ( kthreadd: 3), nr_events: 1
...
task 563 ( kthreadd: 7509), nr_events: 1
task 564 ( bash: 7751), nr_events: 1
task 565 ( man: 7762), nr_events: 1
task 566 ( kthreadd: 7789), nr_events: 1
task 567 ( bash: 7800), nr_events: 1
task 568 ( sudo: 7801), nr_events: 4
task 569 ( perf: 7806), nr_events: 8

1 : 25.887, ravg: 25.89, cpu: 1919.68 / 1919.68

2 : 27.994, ravg: 26.10, cpu: 2887.76 / 2016.49

3 : 26.403, ravg: 26.13, cpu: 2976.09 / 2112.45

4 : 29.400, ravg: 26.46, cpu: 1015.01 / 2002.70

5 : 26.750, ravg: 26.49, cpu: 2942.80 / 2096.71

6 : 27.647, ravg: 26.60, cpu: 3087.37 / 2195.78

7 : 31.405, ravg: 27.08, cpu: 2762.43 / 2252.44

8 : 23.770, ravg: 26.75, cpu: 2172.55 / 2244.45

9 : 26.952, ravg: 26.77, cpu: 2794.93 / 2299.50

10 : 30.904, ravg: 27.18, cpu: 973.26 / 2166.88 |

3.11 perf probe

Need to find vmlinux XXXXXXXXXXXXXXXXXX

可以自定義探測點。

Define new dynamic tracepoints.

使用例子

(1) Display which lines in schedule() can be probed

perf probe --line schedule

前面有行號的可以探測尉共，沒有行號的就不行了褒傅。

(2) Add a probe on schedule() function 12th line.

perf probe -a schedule:12

在schedule函數(shù)的12處增加一個探測點。

3.14 perf trace

perf trace類似于strace袄友，但增加了其他系統(tǒng)事件的分析殿托，比如pagefaults、task lifetime事件杠河、scheduling事件等碌尔。

下面的命令可以查看系統(tǒng)中已經(jīng)安裝的腳本：

<pre class="displaycode" style="margin: 0px 0px 0px 22px; white-space: pre-wrap; overflow-wrap: break-word; font-size: 1em;"> # perf trace -l
List of available trace scripts:
syscall-counts [comm] system-wide syscall counts
syscall-counts-by-pid [comm] system-wide syscall counts, by pid
failed-syscalls-by-pid [comm] system-wide failed syscalls, by pid
</pre>

比如 failed-syscalls 腳本浇辜，執(zhí)行的效果如下：

<pre class="displaycode" style="margin: 0px 0px 0px 22px; white-space: pre-wrap; overflow-wrap: break-word; font-size: 1em;"> # perf trace record failed-syscalls
^C[ perf record: Woken up 11 times to write data ]
[ perf record: Captured and wrote 1.939 MB perf.data (~84709 samples) ]

perf trace report failed-syscalls
perf trace started with Perl script \
/root/libexec/perf-core/scripts/perl/failed-syscalls.pl

failed syscalls, by comm: 

comm                    # errors 
--------------------  ---------- 
firefox                     1721 
claws-mail                   149 
konsole                       99 
X                             77 
emacs                         56 
[...] 

failed syscalls, by syscall: 

syscall                           # errors 
------------------------------  ---------- 
sys_read                              2042 
sys_futex                              130 
sys_mmap_pgoff                          71 
sys_access                              33 
sys_stat64                               5 
sys_inotify_add_watch                    4 
[...]</pre>

該報表分別按進程和按系統(tǒng)調(diào)用顯示失敗的次數(shù)。非常簡單明了唾戚，而如果通過普通的 perf record 加 perf report 命令柳洋，則需要自己手工或者編寫腳本來統(tǒng)計這些數(shù)字。

4. perf擴展應用

4.1 Flame Graph

FlameGraph是

1.抓取perf信息并轉(zhuǎn)換

perf record -F 99 -a -g -- sleep 60
perf script > out.perf

./stackcollapse-perf.pl out.perf > out.folded

./flamegraph.pl out.kern_folded > kernel.svg

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末叹坦，一起剝皮案震驚了整個濱河市熊镣，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌募书，老刑警劉巖绪囱，帶你破解...
沈念sama閱讀 217,907評論 6贊 506
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異莹捡，居然都是意外死亡鬼吵，警方通過查閱死者的電腦和手機，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,987評論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門篮赢，熙熙樓的掌柜王于貴愁眉苦臉地迎上來齿椅，“玉大人，你說我怎么就攤上這事启泣』两牛” “怎么了？”我有些...
開封第一講書人閱讀 164,298評論 0贊 354
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵寥茫，是天一觀的道長遣蚀。經(jīng)常有香客問我，道長纱耻，這世上最難降的妖魔是什么芭梯？我笑而不...
開封第一講書人閱讀 58,586評論 1贊 293
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮膝迎，結(jié)果婚禮上粥帚，老公的妹妹穿的比我還像新娘。我一直安慰自己限次，他們只是感情好芒涡，可當我...
茶點故事閱讀 67,633評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著卖漫，像睡著了一般费尽。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上羊始，一...
開封第一講書人閱讀 51,488評論 1贊 302
城市分裂傳說
那天旱幼，我揣著相機與錄音，去河邊找鬼突委。笑死柏卤，一個胖子當著我的面吹牛冬三，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播缘缚，決...
沈念sama閱讀 40,275評論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼勾笆，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了桥滨？” 一聲冷哼從身側(cè)響起窝爪，我...
開封第一講書人閱讀 39,176評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎齐媒，沒想到半個月后蒲每，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,619評論 1贊 314
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡喻括，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 37,819評論 3贊 336
?白月光啟示錄
正文我和宋清朗相戀三年邀杏，在試婚紗的時候發(fā)現(xiàn)自己被綠了。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片双妨。...
茶點故事閱讀 39,932評論 1贊 348
活死人
序言：一個原本活蹦亂跳的男人離奇死亡淮阐，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出刁品，到底是詐尸還是另有隱情，我是刑警寧澤浩姥，帶...
沈念sama閱讀 35,655評論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布挑随，位于F島的核電站，受9級特大地震影響勒叠，放射性物質(zhì)發(fā)生泄漏兜挨。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點故事閱讀 41,265評論 3贊 329
男人毒藥：我在死后第九天來索命
文/蒙蒙一眯分、第九天我趴在偏房一處隱蔽的房頂上張望拌汇。院中可真熱鬧，春花似錦弊决、人聲如沸噪舀。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,871評論 0贊 22
一樁弒父案飘诗，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽与倡。三九已至，卻和暖如春昆稿，著一層夾襖步出監(jiān)牢的瞬間纺座，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 32,994評論 1贊 269
情欲美人皮
我被黑心中介騙來泰國打工溉潭，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留净响，地道東北人少欺。一個月前我還...
沈念sama閱讀 48,095評論 3贊 370
代替公主和親
正文我出身青樓，卻偏偏與公主長得像馋贤，于是被迫代替她去往敵國和親赞别。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當晚...
茶點故事閱讀 44,884評論 2贊 354

Linux性能分析工具-perf ????

1. 背景知識

1.1 tracepoints

1.2 硬件特性之cache

2. 主要關(guān)注點

3. perf的使用

3.0 perf引入的overhead

3.1 perf list

3.2 perf top

3.3 perf stat

其他常用參數(shù)

示例

3.4 perf bench

3.4.1 perf bench sched all

3.4.1.1 sched messaging評估進程調(diào)度和核間通信

Running sched/messaging benchmark...

20 sender and receiver processes per group

10 groups == 400 processes run

Running sched/pipe benchmark...

Executed 1000000 pipe operations between two processes

Running 'sched/messaging' benchmark:

20 sender and receiver processes per group

10 groups == 400 processes run

Running 'sched/messaging' benchmark:

20 sender and receiver processes per group

10 groups == 400 processes run

3.4.1.2 sched pipe評估pipe性能

Running 'sched/pipe' benchmark:

Executed 1000000 pipe operations between two processes

3.4.2 perf bench mem all

Running mem/memcpy benchmark...

function 'default' (Default memcpy() provided by glibc)

Copying 1MB bytes ...

3.4.3 perf bench futex

Running futex/hash benchmark...

Running futex/wake benchmark...

3.4 perf record

常用選項

3.5 perf report

常用參數(shù)：

3.6 perf timechart

3.6.1 結(jié)合perf timechart和perf report分析函數(shù)占比

3.7 perf script

3.8 perf lock

3.8.1 perf lock內(nèi)核配置

3.8.2 perf lock使用

3.9 perf kmem

3.9.1 perf kmem介紹

3.9.2 perf kmem使用

| SUMMARY (SLAB allocator)

Callsite | Total_alloc/Per | Total_req/Per | Hit | Ping-pong | Frag

Alloc Ptr | Total_alloc/Per | Total_req/Per | Hit | Ping-pong | Frag

SUMMARY (SLAB allocator)

3.10 perf sched

Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at |

TOTAL: | 2.623 ms | 5 |

1 : 25.887, ravg: 25.89, cpu: 1919.68 / 1919.68

2 : 27.994, ravg: 26.10, cpu: 2887.76 / 2016.49

3 : 26.403, ravg: 26.13, cpu: 2976.09 / 2112.45

4 : 29.400, ravg: 26.46, cpu: 1015.01 / 2002.70

5 : 26.750, ravg: 26.49, cpu: 2942.80 / 2096.71

6 : 27.647, ravg: 26.60, cpu: 3087.37 / 2195.78

7 : 31.405, ravg: 27.08, cpu: 2762.43 / 2252.44

8 : 23.770, ravg: 26.75, cpu: 2172.55 / 2244.45

9 : 26.952, ravg: 26.77, cpu: 2794.93 / 2299.50

10 : 30.904, ravg: 27.18, cpu: 973.26 / 2166.88 |

3.11 perf probe

perf probe --line schedule

perf probe -a schedule:12

3.14 perf trace

4. perf擴展應用

4.1 Flame Graph

推薦閱讀更多精彩內(nèi)容