Doris提供了一個(gè)圖形化的命令幫助用戶方便分析一個(gè)具體的查詢或者導(dǎo)入操作扶镀,在使用過(guò)程中的性能問(wèn)題黄绩,本文主要介紹如何使用改功能
1. 查詢計(jì)劃樹(shù)
SQL 是一個(gè)描述性語(yǔ)言,用戶通過(guò)一個(gè) SQL 來(lái)描述想獲取的數(shù)據(jù)。而一個(gè) SQL 的具體執(zhí)行方式依賴(lài)于數(shù)據(jù)庫(kù)的實(shí)現(xiàn)朴则。而查詢規(guī)劃器就是用來(lái)決定數(shù)據(jù)庫(kù)如何具體執(zhí)行一個(gè) SQL 的。 比如用戶指定了一個(gè) Join 算子钓简,則查詢規(guī)劃器需要決定具體的 Join 算法乌妒,比如是 Hash Join,還是 Merge Sort Join外邓;是使用 Shuffle 還是 Broadcast撤蚊;Join 順序是否需要調(diào)整以避免笛卡爾積;以及確定最終的在哪些節(jié)點(diǎn)執(zhí)行等等损话。
Doris 的查詢規(guī)劃過(guò)程是先將一個(gè) SQL 語(yǔ)句轉(zhuǎn)換成一個(gè)單機(jī)執(zhí)行計(jì)劃樹(shù)侦啸。
之后,查詢規(guī)劃器會(huì)根據(jù)具體的算子執(zhí)行方式、數(shù)據(jù)的具體分布匹中,將單機(jī)查詢計(jì)劃轉(zhuǎn)換為分布式查詢計(jì)劃夏漱。分布式查詢計(jì)劃是由多個(gè) Fragment 組成的,每個(gè) Fragment 負(fù)責(zé)查詢計(jì)劃的一部分顶捷,各個(gè) Fragment 直接會(huì)通過(guò) ExchangeNode 算子進(jìn)行數(shù)據(jù)的傳輸挂绰。
如上圖,我們將單機(jī)計(jì)劃分成了兩個(gè) Fragment:F1 和 F2服赎。兩個(gè) Fragment 之間通過(guò)一個(gè) ExchangeNode 節(jié)點(diǎn)傳輸數(shù)據(jù)葵蒂。
而一個(gè) Fragment 會(huì)進(jìn)一步的劃分為多個(gè) Instance。Instance 是最終具體的執(zhí)行實(shí)例重虑。劃分成多個(gè) Instance 有助于充分利用機(jī)器資源践付,提升一個(gè) Fragment 的執(zhí)行并發(fā)度。
查看查詢計(jì)劃
可以通過(guò)以下兩種命令查看一個(gè) SQL 的執(zhí)行計(jì)劃缺厉。
EXPLAIN GRAPH select ...;
EXPLAIN select ...;
其中第一個(gè)命令以圖形化的方式展示一個(gè)查詢計(jì)劃永高,這個(gè)命令可以比較直觀的展示查詢計(jì)劃的樹(shù)形結(jié)構(gòu),以及 Fragment 的劃分情況:
mysql> desc graph SELECT SUM(lo_revenue), d_year, p_brand
-> FROM lineorder, date, part, supplier
-> WHERE lo_orderdate = d_datekey
-> AND lo_partkey = p_partkey
-> AND lo_suppkey = s_suppkey
-> AND p_brand BETWEEN 'MFGR#2221'
-> AND 'MFGR#2228'
-> AND s_region = 'ASIA'
-> GROUP BY d_year, p_brand
-> ORDER BY d_year, p_brand;
+------------------------------------------------------------------------------------------------------------+
| Explain String |
+------------------------------------------------------------------------------------------------------------+
| ┌────────────────┐ |
| │[12: ResultSink]│ |
| │[Fragment: 4] │ |
| │RESULT SINK │ |
| └────────────────┘ |
| │ |
| │ |
| ┌──────────────────────┐ |
| │[12: MERGING-EXCHANGE]│ |
| │[Fragment: 4] │ |
| └──────────────────────┘ |
| │ |
| │ |
| ┌────────────────────┐ |
| │[12: DataStreamSink]│ |
| │[Fragment: 0] │ |
| │STREAM DATA SINK │ |
| │ EXCHANGE ID: 12 │ |
| │ UNPARTITIONED │ |
| └────────────────────┘ |
| │ |
| │ |
| ┌─────────────┐ |
| │[8: TOP-N] │ |
| │[Fragment: 0]│ |
| └─────────────┘ |
| │ |
| │ |
| ┌────────────────────────────────┐ |
| │[7: AGGREGATE (update finalize)]│ |
| │[Fragment: 0] │ |
| └────────────────────────────────┘ |
| │ |
| │ |
| ┌───────────────────────────────┐ |
| │[6: HASH JOIN] │ |
| │[Fragment: 0] │ |
| │join op: INNER JOIN (BROADCAST)│ |
| └───────────────────────────────┘ |
| ┌───────────┴────────────────────────────────┐ |
| │ │ |
| ┌───────────────────────────────┐ ┌──────────────┐ |
| │[4: HASH JOIN] │ │[11: EXCHANGE]│ |
| │[Fragment: 0] │ │[Fragment: 0] │ |
| │join op: INNER JOIN (BROADCAST)│ └──────────────┘ |
| └───────────────────────────────┘ │ |
| ┌──────────┴─────────────────────┐ │ |
| │ │ ┌────────────────────┐ |
| ┌───────────────────────────────┐ ┌──────────────┐ │[11: DataStreamSink]│ |
| │[2: HASH JOIN] │ │[10: EXCHANGE]│ │[Fragment: 3] │ |
| │[Fragment: 0] │ │[Fragment: 0] │ │STREAM DATA SINK │ |
| │join op: INNER JOIN (BROADCAST)│ └──────────────┘ │ EXCHANGE ID: 11 │ |
| └───────────────────────────────┘ │ │ UNPARTITIONED │ |
| ┌──────────┴─────────┐ │ └────────────────────┘ |
| │ │ ┌────────────────────┐ ┌┘ |
| ┌─────────────────┐ ┌─────────────┐ │[10: DataStreamSink]│ │ |
| │[0: OlapScanNode]│ │[9: EXCHANGE]│ │[Fragment: 2] │ ┌─────────────────┐ |
| │[Fragment: 0] │ │[Fragment: 0]│ │STREAM DATA SINK │ │[5: OlapScanNode]│ |
| │TABLE: lineorder │ └─────────────┘ │ EXCHANGE ID: 10 │ │[Fragment: 3] │ |
| └─────────────────┘ │ │ UNPARTITIONED │ │TABLE: supplier │ |
| │ └────────────────────┘ └─────────────────┘ |
| ┌───────────────────┐ ┌┘ |
| │[9: DataStreamSink]│ │ |
| │[Fragment: 1] │ ┌─────────────────┐ |
| │STREAM DATA SINK │ │[3: OlapScanNode]│ |
| │ EXCHANGE ID: 09 │ │[Fragment: 2] │ |
| │ UNPARTITIONED │ │TABLE: part │ |
| └───────────────────┘ └─────────────────┘ |
| │ |
| │ |
| ┌─────────────────┐ |
| │[1: OlapScanNode]│ |
| │[Fragment: 1] │ |
| │TABLE: date │ |
| └─────────────────┘ |
+------------------------------------------------------------------------------------------------------------+
75 rows in set (0.04 sec)
從圖中可以看出提针,查詢計(jì)劃樹(shù)被分為了5個(gè) Fragment:0命爬、1碘菜、2垄分、3、4概作。如 OlapScanNode
節(jié)點(diǎn)上的 [Fragment: 0]
表示這個(gè)節(jié)點(diǎn)屬于 Fragment 0嗜价。每個(gè)Fragment之間都通過(guò) DataStreamSink 和 ExchangeNode 進(jìn)行數(shù)據(jù)傳輸艇抠。
圖形命令僅展示簡(jiǎn)化后的節(jié)點(diǎn)信息,如果需要查看更具體的節(jié)點(diǎn)信息久锥,如下推到節(jié)點(diǎn)上的過(guò)濾條件等家淤,則需要通過(guò)第二個(gè)命令查看更詳細(xì)的文字版信息:
mysql> desc SELECT SUM(lo_revenue), d_year, p_brand FROM lineorder, date, part, supplier WHERE lo_orderdate = d_datekey AND lo_partkey = p_partkey AND lo_suppkey = s_suppkey AND p_brand BETWEEN 'MFGR#2221' AND 'MFGR#2228' AND s_region = 'ASIA' GROUP BY d_year, p_brand ORDER BY d_year, p_brand;
+--------------------------------------------------------------------------------------------------------------+
| Explain String |
+--------------------------------------------------------------------------------------------------------------+
| PLAN FRAGMENT 0 |
| OUTPUT EXPRS:<slot 15> <slot 12> sum(`lo_revenue`) | <slot 13> <slot 10> `d_year` | <slot 14> <slot 11> `p_brand` |
| PARTITION: UNPARTITIONED |
| |
| RESULT SINK |
| |
| 12:MERGING-EXCHANGE |
| limit: 65535 |
| |
| PLAN FRAGMENT 1 |
| OUTPUT EXPRS: |
| PARTITION: RANDOM |
| |
| STREAM DATA SINK |
| EXCHANGE ID: 12 |
| UNPARTITIONED |
| |
| 8:TOP-N |
| | order by: <slot 13> <slot 10> `d_year` ASC, <slot 14> <slot 11> `p_brand` ASC |
| | offset: 0 |
| | limit: 65535 |
| | |
| 7:AGGREGATE (update finalize) |
| | output: sum(`lo_revenue`) |
| | group by: `d_year`, `p_brand` |
| | cardinality=-1 |
| | |
| 6:HASH JOIN |
| | join op: INNER JOIN (BROADCAST) |
| | hash predicates: |
| | colocate: false, reason: Tables are not in the same group |
| | equal join conjunct: `lo_suppkey` = `s_suppkey` |
| | runtime filters: RF000[in_or_bloom] <- `s_suppkey` |
| | cardinality=0 |
| | |
| |----11:EXCHANGE |
| | |
| 4:HASH JOIN |
| | join op: INNER JOIN (BROADCAST) |
| | hash predicates: |
| | colocate: false, reason: Tables are not in the same group |
| | equal join conjunct: `lo_partkey` = `p_partkey` |
| | runtime filters: RF001[in_or_bloom] <- `p_partkey` |
| | cardinality=0 |
| | |
| |----10:EXCHANGE |
| | |
| 2:HASH JOIN |
| | join op: INNER JOIN (BROADCAST) |
| | hash predicates: |
| | colocate: false, reason: Tables are not in the same group |
| | equal join conjunct: `lo_orderdate` = `d_datekey` |
| | runtime filters: RF002[in_or_bloom] <- `d_datekey` |
| | cardinality=0 |
| | |
| |----9:EXCHANGE |
| | |
| 0:OlapScanNode |
| TABLE: lineorder |
| PREAGGREGATION: OFF. Reason: conjunct on `lo_orderdate` which is StorageEngine value column |
| runtime filters: RF000[in_or_bloom] -> `lo_suppkey`, RF001[in_or_bloom] -> `lo_partkey`, RF002[in_or_bloom] -> `lo_orderdate` |
| partitions=0/7 |
| rollup: null |
| tabletRatio=0/0 |
| tabletList= |
| cardinality=0 |
| avgRowSize=20.0 |
| numNodes=1 |
| |
| PLAN FRAGMENT 2 |
| OUTPUT EXPRS: |
| PARTITION: HASH_PARTITIONED: `default_cluster:demo`.`supplier`.`s_suppkey` |
| |
| STREAM DATA SINK |
| EXCHANGE ID: 11 |
| UNPARTITIONED |
| |
| 5:OlapScanNode |
| TABLE: supplier |
| PREAGGREGATION: OFF. Reason: null |
| PREDICATES: `s_region` = 'ASIA' |
| partitions=0/1 |
| rollup: null |
| tabletRatio=0/0 |
| tabletList= |
| cardinality=0 |
| avgRowSize=20.0 |
| numNodes=1 |
| |
| PLAN FRAGMENT 3 |
| OUTPUT EXPRS: |
| PARTITION: HASH_PARTITIONED: `default_cluster:demo`.`part`.`p_partkey` |
| |
| STREAM DATA SINK |
| EXCHANGE ID: 10 |
| UNPARTITIONED |
| |
| 3:OlapScanNode |
| TABLE: part |
| PREAGGREGATION: OFF. Reason: null |
| PREDICATES: `p_brand` >= 'MFGR#2221', `p_brand` <= 'MFGR#2228' |
| partitions=0/1 |
| rollup: null |
| tabletRatio=0/0 |
| tabletList= |
| cardinality=0 |
| avgRowSize=20.0 |
| numNodes=1 |
| |
| PLAN FRAGMENT 4 |
| OUTPUT EXPRS: |
| PARTITION: HASH_PARTITIONED: `default_cluster:demo`.`date`.`d_datekey` |
| |
| STREAM DATA SINK |
| EXCHANGE ID: 09 |
| UNPARTITIONED |
| |
| 1:OlapScanNode |
| TABLE: date |
| PREAGGREGATION: OFF. Reason: null |
| partitions=0/1 |
| rollup: null |
| tabletRatio=0/0 |
| tabletList= |
| cardinality=0 |
| avgRowSize=8.0 |
| numNodes=1 |
+--------------------------------------------------------------------------------------------------------------+
127 rows in set (0.01 sec)
2. 查看查詢 Profile
用戶可以通過(guò)以下命令打開(kāi)會(huì)話變量
set enable_profile = true
然后執(zhí)行查詢,則 Doris 會(huì)產(chǎn)生該查詢的一個(gè) Profile奴拦。Profile 包含了一個(gè)查詢各個(gè)節(jié)點(diǎn)的具體執(zhí)行情況媒鼓,有助于我們分析查詢瓶頸届吁。
執(zhí)行完查詢后错妖,我們可以通過(guò)如下命令先獲取 Profile 列表:
mysql> show query profile "/"\G
*************************** 1. row ***************************
QueryId: 8676cf2c28ac4081-95ead4b1b734b0b3
User: root
DefaultDb: default_cluster:demo
SQL: SELECT SUM(lo_extendedprice*lo_discount) AS REVENUE FROM lineorder, date WHERE lo_orderdate = d_datekey AND d_weeknuminyear= 6 AND d_year = 1994 AND lo_discount BETWEEN 5 AND 7 AND lo_quantity BETWEEN 26 AND 35
QueryType: Query
StartTime: 2022-05-13 11:47:57
EndTime: 2022-05-13 11:47:57
TotalTime: 19ms
QueryState: EOF
1 row in set (0.01 sec)
這個(gè)命令會(huì)列出當(dāng)前保存的所有 Profile疚沐。每行對(duì)應(yīng)一個(gè)查詢。我們可以選擇我們想看的 Profile 對(duì)應(yīng)的 QueryId亮蛔,查看具體情況。
查看一個(gè)Profile分為3個(gè)步驟:
2.1 查看整體執(zhí)行計(jì)劃樹(shù)
這一步主要用于從整體分析執(zhí)行計(jì)劃,并查看每個(gè)Fragment的執(zhí)行耗時(shí)动遭。
mysql> show query profile "/26da5b3dbd5e42d8-8b5c39768b5b78f4"\G
*************************** 1. row ***************************
Fragments:
┌────────────────────────┐
│[-1: DataBufferSender] │
│Fragment: 0 │
│MaxActiveTime: 355.973us│
└────────────────────────┘
┌┘
│
┌─────────────────────┐
│[3: AGGREGATION_NODE]│
│Fragment: 0 │
└─────────────────────┘
│
│
┌───────────────────┐
│[2: HASH_JOIN_NODE]│
│Fragment: 0 │
└───────────────────┘
┌───────────┴────────────────────────────┐
│ │
┌───────────────────┐ ┌──────────────────┐
│[0: OLAP_SCAN_NODE]│ │[4: EXCHANGE_NODE]│
│Fragment: 0 │ │Fragment: 0 │
└───────────────────┘ └──────────────────┘
┌─────────┴─────────────────┐ │
│ │ │
┌─────────────────────────────────┐ ┌─────────────┐ ┌──────────────────────┐
│[RuntimeFilter:in_or_bloomfilter]│ │[OlapScanner]│ │[4: DataStreamSender] │
│Fragment: 0 │ │Fragment: 0 │ │Fragment: 1 │
└─────────────────────────────────┘ └─────────────┘ │MaxActiveTime: 4.219ms│
│ └──────────────────────┘
│ ┌┘
┌─────────────────┐ │
│[SegmentIterator]│ ┌───────────────────┐
│Fragment: 0 │ │[1: OLAP_SCAN_NODE]│
└─────────────────┘ │Fragment: 1 │
└───────────────────┘
│
┌─────────────┐
│[OlapScanner]│
│Fragment: 1 │
└─────────────┘
│
│
┌─────────────────┐
│[SegmentIterator]│
│Fragment: 1 │
└─────────────────┘
1 row in set (0.00 sec)
如上圖厘惦,每個(gè)節(jié)點(diǎn)都標(biāo)注了自己所屬的 Fragment,并且在每個(gè) Fragment 的 Sender節(jié)點(diǎn)宵蕉,標(biāo)注了該 Fragment 的執(zhí)行耗時(shí)(MaxActiveTime)。這個(gè)耗時(shí)羡玛,是Fragment下所有 Instance 執(zhí)行耗時(shí)中最長(zhǎng)的一個(gè)宗苍。這個(gè)有助于我們從整體角度發(fā)現(xiàn)最耗時(shí)的 Fragment稼稿。
2.2 查看具體 Fragment 下的 Instance 列表
比如我們發(fā)現(xiàn) Fragment 1 耗時(shí)最長(zhǎng),則可以繼續(xù)查看 Fragment 1 的 Instance 列表:
mysql> show query profile "/8676cf2c28ac4081-95ead4b1b734b0b3/1" ;
+-----------------------------------+-------------------+------------+
| Instances | Host | ActiveTime |
+-----------------------------------+-------------------+------------+
| 8676cf2c28ac4081-95ead4b1b734b0b5 | 172.28.7.230:9060 | 4.219ms |
+-----------------------------------+-------------------+------------+
1 row in set (0.00 sec)
這里展示了 Fragment 1 上所有的 1 個(gè) Instance 所在的執(zhí)行節(jié)點(diǎn)和耗時(shí)讳窟。
2.3 查看具體 Instance
我們可以繼續(xù)查看某一個(gè)具體的 Instance 上各個(gè)算子的詳細(xì) Profile:
mysql> show query profile "/8676cf2c28ac4081-95ead4b1b734b0b3/1/8676cf2c28ac4081-95ead4b1b734b0b5" ;
┌───────────────────────────────────────┐
│[4: DataStreamSender] │
│(Active: 16.473us, non-child: 0.09) │
│ - Counters: │
│ - BytesSent: 0.00 │
│ - IgnoreRows: 0 │
│ - LocalBytesSent: 0.00 │
│ - OverallThroughput: 0.0 /sec │
│ - PeakMemoryUsage: 10.66 KB │
│ - SerializeBatchTime: 0ns │
│ - UncompressedRowBatchSize: 0.00 │
└───────────────────────────────────────┘
│
│
┌─────────────────────────────────────┐
│[1: OLAP_SCAN_NODE] │
│(Active: 23.934us, non-child: 0.12) │
│ - Counters: │
│ - BatchQueueWaitTime: 225ns │
│ - BytesRead: 0.00 │
│ - NumDiskAccess: 0 │
│ - NumScanners: 0 │
│ - PeakMemoryUsage: 0.00 │
│ - RowsRead: 0 │
│ - RowsReturned: 0 │
│ - RowsReturnedRate: 0 │
│ - ScannerBatchWaitTime: 0ns │
│ - ScannerWorkerWaitTime: 0ns │
│ - TabletCount : 0 │
│ - TotalReadThroughput: 0.0 /sec│
└─────────────────────────────────────┘
┌┘
│
┌─────────────────────────────────┐
│[OlapScanner] │
│(Active: 0ns, non-child: 0.00) │
│ - Counters: │
│ - BlockConvertTime: 0ns │
│ - BlockFetchTime: 0ns │
│ - ReaderInitTime: 0ns │
│ - RowsDelFiltered: 0 │
│ - RowsPushedCondFiltered: 0│
│ - ScanCpuTime: 0ns │
│ - ScanTime: 0ns │
│ - ShowHintsTime_V1: 0ns │
└─────────────────────────────────┘
└┐
│
┌────────────────────────────────────┐
│[SegmentIterator] │
│(Active: 0ns, non-child: 0.00) │
│ - Counters: │
│ - BitmapIndexFilterTimer: 0ns │
│ - BlockLoadTime: 0ns │
│ - BlockSeekCount: 0 │
│ - BlockSeekTime: 0ns │
│ - BlocksLoad: 0 │
│ - CachedPagesNum: 0 │
│ - CompressedBytesRead: 0.00 │
│ - DecompressorTimer: 0ns │
│ - IOTimer: 0ns │
│ - IndexLoadTime_V1: 0ns │
│ - NumSegmentFiltered: 0 │
│ - NumSegmentTotal: 0 │
│ - RawRowsRead: 0 │
│ - RowsBitmapIndexFiltered: 0 │
│ - RowsBloomFilterFiltered: 0 │
│ - RowsConditionsFiltered: 0 │
│ - RowsKeyRangeFiltered: 0 │
│ - RowsStatsFiltered: 0 │
│ - RowsVectorPredFiltered: 0 │
│ - TotalPagesNum: 0 │
│ - UncompressedBytesRead: 0.00 │
│ - VectorPredEvalTime: 0ns │
└────────────────────────────────────┘
上圖展示了 Fragment 1 中渺杉,Instance 8676cf2c28ac4081-95ead4b1b734b0b5 的各個(gè)算子的具體 Profile。
通過(guò)以上3個(gè)步驟挪钓,我們可以逐步排查一個(gè)SQL的性能瓶頸是越。