spark-sql跑任務(wù)報(bào)錯(cuò)org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location f...

spark-SQL跑任務(wù)報(bào)錯(cuò)

錯(cuò)誤信息如下

19/10/17 18:06:50 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e122_1568025476000_38356_01_000022 on host: node02. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137
Killed by external signal

19/10/17 18:06:50 INFO BlockManagerMasterEndpoint: Trying to remove executor 21 from BlockManagerMaster.
19/10/17 18:06:50 INFO BlockManagerMaster: Removal of executor 21 requested
19/10/17 18:06:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 21
19/10/17 18:06:50 INFO TaskSetManager: Starting task 1.1 in stage 28.3 (TID 46, node02, executor 4, partition 1, NODE_LOCAL, 5066 bytes)
19/10/17 18:06:51 INFO BlockManagerInfo: Added broadcast_23_piece0 in memory on node02:27885 (size: 106.5 KB, free: 5.2 GB)
19/10/17 18:06:51 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 6 to xx.xx.xx.xx:30178
19/10/17 18:06:51 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 6 is 166 bytes
19/10/17 18:06:51 WARN TaskSetManager: Lost task 1.1 in stage 28.3 (TID 46, node02, executor 4): FetchFailed(null, shuffleId=6, mapId=-1, reduceId=166, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 6
    at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)
    at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)
    at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
    at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:693)
    at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:147)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49)
    at org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:165)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

)

上面的錯(cuò)誤報(bào)完接著就是以下錯(cuò)誤：
19/10/17 17:10:32 WARN TaskSetManager: Lost task 0.0 in stage 82.0 (TID 1855, node01, executor 5): FetchFailed(BlockManagerId(1, node01, 26969, None), shuffleId=13, mapId=1, reduceId=10, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to hostname/xx.xx.xx.xx:26969
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:513)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:444)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:61)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
    at org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
    at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:776)
    at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:737)
    at org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:899)
    at org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:935)
    at org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:313)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:254)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1371)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:259)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:189)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:188)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to node01/xx.xx.xx.xx:26969
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
    at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:97)
    at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
    at org.apache.spark.network.shuffle.RetryingBlockFetcher.lambda$initiateRetry$0(RetryingBlockFetcher.java:169)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
    ... 1 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: node01/xx.xx.xx.xx:26969
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
    ... 2 more

)

解決方案：

增加executor內(nèi)存，把spark.executor.memory由10G調(diào)到20G就不出現(xiàn)這個(gè)異常了

原因分析

shuffle分為shuffle write和shuffle read兩部分遗淳。
shuffle write的分區(qū)數(shù)由上一階段的RDD分區(qū)數(shù)控制黑界，shuffle read的分區(qū)數(shù)則是由Spark提供的一些參數(shù)控制卧惜。
shuffle write可以簡單理解為類似于saveAsLocalDiskFile的操作情竹，將計(jì)算的中間結(jié)果按某種規(guī)則臨時(shí)放到各個(gè)executor所在的本地磁盤上。
shuffle read的時(shí)候數(shù)據(jù)的分區(qū)數(shù)則是由spark提供的一些參數(shù)控制抡砂∮卓粒可以想到的是窒篱，如果這個(gè)參數(shù)值設(shè)置的很小，同時(shí)shuffle read的量很大舶沿，那么將會(huì)導(dǎo)致一個(gè)task需要處理的數(shù)據(jù)非常大墙杯。結(jié)果導(dǎo)致JVM crash，從而導(dǎo)致取shuffle數(shù)據(jù)失敗括荡，同時(shí)executor也丟失了高镐，看到Failed to connect to host的錯(cuò)誤，也就是executor lost的意思畸冲。有時(shí)候即使不會(huì)導(dǎo)致JVM crash也會(huì)造成長時(shí)間的gc嫉髓。

解決思路

知道原因后問題就好解決了观腊，主要從shuffle的數(shù)據(jù)量和處理shuffle數(shù)據(jù)的分區(qū)數(shù)兩個(gè)角度入手。

減少shuffle數(shù)據(jù)
思考是否可以使用map side join或是broadcast join來規(guī)避shuffle的產(chǎn)生算行。
將不必要的數(shù)據(jù)在shuffle前進(jìn)行過濾恕沫，比如原始數(shù)據(jù)有20個(gè)字段，只要選取需要的字段進(jìn)行處理即可纱意，將會(huì)減少一定的shuffle數(shù)據(jù)。
SparkSQL和DataFrame的join,group by等操作
通過spark.sql.shuffle.partitions控制分區(qū)數(shù)鲸阔，默認(rèn)為200偷霉，根據(jù)shuffle的量以及計(jì)算的復(fù)雜度提高這個(gè)值。
Rdd的join,groupBy,reduceByKey等操作
通過spark.default.parallelism控制shuffle read與reduce處理的分區(qū)數(shù)褐筛，默認(rèn)為運(yùn)行任務(wù)的core的總數(shù)（mesos細(xì)粒度模式為8個(gè)类少，local模式為本地的core總數(shù)），官方建議為設(shè)置成運(yùn)行任務(wù)的core的2-3倍渔扎。
提高executor的內(nèi)存
通過spark.executor.memory適當(dāng)提高executor的memory值硫狞。
-是否存在數(shù)據(jù)傾斜的問題
空值是否已經(jīng)過濾？異常數(shù)據(jù)（某個(gè)key數(shù)據(jù)特別大）是否可以單獨(dú)處理晃痴？考慮改變數(shù)據(jù)分區(qū)規(guī)則

最后編輯于：2019.10.18 09:34:11

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末残吩，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子倘核，更是在濱河造成了極大的恐慌泣侮，老刑警劉巖，帶你破解...
沈念sama閱讀 206,968評(píng)論 6贊 482
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件紧唱，死亡現(xiàn)場離奇詭異活尊，居然都是意外死亡，警方通過查閱死者的電腦和手機(jī)漏益，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,601評(píng)論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門蛹锰，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人绰疤，你說我怎么就攤上這事铜犬。” “怎么了峦睡？”我有些...
開封第一講書人閱讀 153,220評(píng)論 0贊 344
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵翎苫，是天一觀的道長。經(jīng)常有香客問我榨了，道長煎谍，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 55,416評(píng)論 1贊 279
?港島之戀（遺憾婚禮）
正文為了忘掉前任龙屉，我火速辦了婚禮呐粘，結(jié)果婚禮上满俗，老公的妹妹穿的比我還像新娘。我一直安慰自己作岖，他們只是感情好唆垃，可當(dāng)我...
茶點(diǎn)故事閱讀 64,425評(píng)論 5贊 374
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著痘儡，像睡著了一般辕万。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上沉删，一...
開封第一講書人閱讀 49,144評(píng)論 1贊 285
城市分裂傳說
那天渐尿，我揣著相機(jī)與錄音，去河邊找鬼矾瑰。笑死砖茸，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的殴穴。我是一名探鬼主播凉夯，決...
沈念sama閱讀 38,432評(píng)論 3贊 401
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼采幌！你這毒婦竟也來了劲够？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 37,088評(píng)論 0贊 261
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤植榕，失蹤者是張志新（化名）和其女友劉穎再沧，沒想到半個(gè)月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體尊残，經(jīng)...
沈念sama閱讀 43,586評(píng)論 1贊 300
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡炒瘸，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 36,028評(píng)論 2贊 325
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了寝衫。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片顷扩。...
茶點(diǎn)故事閱讀 38,137評(píng)論 1贊 334
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖慰毅，靈堂內(nèi)的尸體忽然破棺而出隘截，到底是詐尸還是另有隱情，我是刑警寧澤汹胃，帶...
沈念sama閱讀 33,783評(píng)論 4贊 324
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布婶芭，位于F島的核電站，受9級(jí)特大地震影響着饥，放射性物質(zhì)發(fā)生泄漏犀农。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,343評(píng)論 3贊 307
男人毒藥：我在死后第九天來索命
文/蒙蒙一宰掉、第九天我趴在偏房一處隱蔽的房頂上張望呵哨。院中可真熱鬧赁濒，春花似錦、人聲如沸孟害。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,333評(píng)論 0贊 19
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽挨务。三九已至击你，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間谎柄，已是汗流浹背果漾。一陣腳步聲響...
開封第一講書人閱讀 31,559評(píng)論 1贊 262
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留谷誓，地道東北人。一個(gè)月前我還...
沈念sama閱讀 45,595評(píng)論 2贊 355
代替公主和親
正文我出身青樓吨凑，卻偏偏與公主長得像捍歪，于是被迫代替她去往敵國和親。傳聞我的和親對(duì)象是個(gè)殘疾皇子鸵钝，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 42,901評(píng)論 2贊 345

spark-sql跑任務(wù)報(bào)錯(cuò)org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location f...

錯(cuò)誤信息如下

原因分析

解決思路

推薦閱讀更多精彩內(nèi)容