spark1.5開始為mesos粗粒度模式和standalone模式提供了Dynamic Allocation的機(jī)制俗扇。
通過將閑置executor移除,達(dá)到提高資源利用率的目的恶守。
一.動態(tài)資源調(diào)配
為standalone模式和mesos的粗粒度模式提供了executor的動態(tài)管理,具體表現(xiàn)為:如果executor在一段時間內(nèi)空閑就會移除這個executor贡必。
動態(tài)申請executor
如果有新任務(wù)處于等待狀態(tài),并且等待時間超過spark.dynamicAllocation.schedulerBacklogTimeout
(默認(rèn)1s)庸毫,則會依次啟動executor,每次啟動1,2,4,8...個executor(如果有的話)仔拟。
啟動的間隔由spark.dynamicAllocation.sustainedSchedulerBacklogTimeout
控制(默認(rèn)與schedulerBacklogTimeout相同)。
動態(tài)移除executor
executor空閑時間超過spark.dynamicAllocation.executorIdleTimeout
設(shè)置的值(默認(rèn)60s )飒赃,該executor會被移除利花,除非有緩存數(shù)據(jù)。
二.配置
conf/spark-default.conf
中配置
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
開啟shuffle service(每個worker節(jié)點(diǎn))
sbin/start-shuffle-service.sh
啟動worker
sbin/start-slave.sh -h hostname sparkURL
如果有節(jié)點(diǎn)沒開载佳,運(yùn)行任務(wù)時該節(jié)點(diǎn)就報錯
ExecutorLostFailure
相關(guān)配置
參數(shù)名 | 默認(rèn)值 | 描述 |
---|---|---|
spark.dynamicAllocation.executorIdleTimeout | 60s | executor空閑時間達(dá)到規(guī)定值炒事,則將該executor移除。 |
spark.dynamicAllocation.cachedExecutorIdleTimeout | infinity | 緩存了數(shù)據(jù)的executor默認(rèn)不會被移除 |
spark.dynamicAllocation.maxExecutors | infinity | 最多使用的executor數(shù)蔫慧,默認(rèn)為你申請的最大executor數(shù) |
spark.dynamicAllocation.minExecutors | 0 | 最少保留的executor數(shù) |
spark.dynamicAllocation.schedulerBacklogTimeout | 1s | 有task等待運(yùn)行時間超過該值后開始啟動executor |
spark.dynamicAllocation.executorIdleTimeout | schedulerBacklogTimeout | 動態(tài)啟動executor的間隔 |
spark.dynamicAllocation.initialExecutors | spark.dynamicAllocation.minExecutors | 如果所有的executor都移除了挠乳,重新請求時啟動的初始executor數(shù) |
三.使用
啟動一個spark-shell,有5個executor,每個executor使用2個core
bin/spark-shell --total-executor-cores 10 --executor-cores 2
如果在60s內(nèi)無動作,在終端會看到如下提示
scala> 15/11/17 15:40:47 ERROR TaskSchedulerImpl: Lost executor 0 on spark047213: remote Rpc client disassociated
15/11/17 15:40:47 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark047213:50015] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
15/11/17 15:40:50 ERROR TaskSchedulerImpl: Lost executor 1 on spark047213: remote Rpc client disassociated
15/11/17 15:40:50 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark047213:49847] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
...
吐槽一下姑躲,executor移除后會提示你和executor斷開連接睡扬,給的提示居然是ERROR....
然后可以在web ui上看到使用的10個core已經(jīng)處于left狀態(tài)
提交一個只需要2個core的任務(wù)
sc.parallelize(1 to 2).count
看到有2個core開始進(jìn)入注冊狀態(tài),提供服務(wù)