spark-sql的優(yōu)化器是cost-based的嗎?
這是一個(gè)很有意思的問(wèn)題丁逝。
大概兩年以前汁胆,剛開始準(zhǔn)備用spark-sql來(lái)做點(diǎn)事情的時(shí)候,抱著工欲善其事必先利其器的想法霜幼,大概看了看關(guān)于spark的那幾篇論文沦泌。
鄙人才疏學(xué)淺,其他的也看不太懂辛掠,但是對(duì)于cbo略知一二谢谦,所以下意識(shí)的就想看看spark-sql是怎么計(jì)算cost的,特別是spark對(duì)data geography的處理和我所了解的teradata 截然不同萝衩,想知道spark是怎么做的回挽。
結(jié)果論文看了半天,發(fā)現(xiàn)對(duì)于cbo這一塊寫的及其含糊猩谊,讀了幾遍都不得要領(lǐng)千劈,索性書看不懂去看看代碼唄,去github上翻了半天源代碼?之后牌捷,發(fā)現(xiàn)了這么一段:
/**
* Abstract class for transforming [[LogicalPlan]]s into physical plans.
* Child classes are responsible for specifying a list of [[GenericStrategy]] objects that
* each of which can return a list of possible physical plan options.
* If a given strategy is unable to plan all
* of the remaining operators in the tree, it can call [[planLater]], which returns a placeholder
* object that will be filled in using other available strategies.
*
* TODO: RIGHT NOW ONLY ONE PLAN IS RETURNED EVER...
*? ? ? PLAN SPACE EXPLORATION WILL BE IMPLEMENTED LATER.
*
*@tparamPhysicalPlanThe type of physical plan produced by this [[QueryPlanner]]
*/
是的墙牌,直到前兩天發(fā)布的2.2.版本之前涡驮,spark根本沒實(shí)現(xiàn)cbo,所有的優(yōu)化都是基于規(guī)則的,跟spark-sql那篇論文上寫的一點(diǎn)都不一樣喜滨。
如果有人只讀過(guò)論文沒實(shí)際去看過(guò)的話捉捅,是不可能知道這個(gè)問(wèn)題的正確答案的。所以現(xiàn)在看到那些寫spark-sql 的 blog虽风,很容易就知道水平了棒口。
我打算下次碰見有人自稱精通spark的時(shí)候,就問(wèn)問(wèn)他這個(gè)問(wèn)題 :-)