spark-sql跑任務(wù)報(bào)錯(cuò)org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location f...

spark-SQL跑任務(wù)報(bào)錯(cuò)

錯(cuò)誤信息如下

19/10/17 18:06:50 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e122_1568025476000_38356_01_000022 on host: node02. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137
Killed by external signal

19/10/17 18:06:50 INFO BlockManagerMasterEndpoint: Trying to remove executor 21 from BlockManagerMaster.
19/10/17 18:06:50 INFO BlockManagerMaster: Removal of executor 21 requested
19/10/17 18:06:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 21
19/10/17 18:06:50 INFO TaskSetManager: Starting task 1.1 in stage 28.3 (TID 46, node02, executor 4, partition 1, NODE_LOCAL, 5066 bytes)
19/10/17 18:06:51 INFO BlockManagerInfo: Added broadcast_23_piece0 in memory on node02:27885 (size: 106.5 KB, free: 5.2 GB)
19/10/17 18:06:51 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 6 to xx.xx.xx.xx:30178
19/10/17 18:06:51 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 6 is 166 bytes
19/10/17 18:06:51 WARN TaskSetManager: Lost task 1.1 in stage 28.3 (TID 46, node02, executor 4): FetchFailed(null, shuffleId=6, mapId=-1, reduceId=166, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 6
    at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)
    at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)
    at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
    at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:693)
    at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:147)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49)
    at org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:165)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

)

上面的錯(cuò)誤報(bào)完接著就是以下錯(cuò)誤:
19/10/17 17:10:32 WARN TaskSetManager: Lost task 0.0 in stage 82.0 (TID 1855, node01, executor 5): FetchFailed(BlockManagerId(1, node01, 26969, None), shuffleId=13, mapId=1, reduceId=10, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to hostname/xx.xx.xx.xx:26969
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:513)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:444)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:61)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
    at org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
    at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:776)
    at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:737)
    at org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:899)
    at org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:935)
    at org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:313)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:254)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1371)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:259)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:189)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:188)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to node01/xx.xx.xx.xx:26969
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
    at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:97)
    at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
    at org.apache.spark.network.shuffle.RetryingBlockFetcher.lambda$initiateRetry$0(RetryingBlockFetcher.java:169)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
    ... 1 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: node01/xx.xx.xx.xx:26969
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
    ... 2 more

)

解決方案:

增加executor內(nèi)存,把spark.executor.memory由10G調(diào)到20G就不出現(xiàn)這個(gè)異常了

原因分析

shuffle分為shuffle write和shuffle read兩部分遗淳。
shuffle write的分區(qū)數(shù)由上一階段的RDD分區(qū)數(shù)控制黑界,shuffle read的分區(qū)數(shù)則是由Spark提供的一些參數(shù)控制卧惜。
shuffle write可以簡單理解為類似于saveAsLocalDiskFile的操作情竹,將計(jì)算的中間結(jié)果按某種規(guī)則臨時(shí)放到各個(gè)executor所在的本地磁盤上。
shuffle read的時(shí)候數(shù)據(jù)的分區(qū)數(shù)則是由spark提供的一些參數(shù)控制抡砂∮卓粒可以想到的是窒篱,如果這個(gè)參數(shù)值設(shè)置的很小,同時(shí)shuffle read的量很大舶沿,那么將會(huì)導(dǎo)致一個(gè)task需要處理的數(shù)據(jù)非常大墙杯。結(jié)果導(dǎo)致JVM crash,從而導(dǎo)致取shuffle數(shù)據(jù)失敗括荡,同時(shí)executor也丟失了高镐,看到Failed to connect to host的錯(cuò)誤,也就是executor lost的意思畸冲。有時(shí)候即使不會(huì)導(dǎo)致JVM crash也會(huì)造成長時(shí)間的gc嫉髓。

解決思路

知道原因后問題就好解決了观腊,主要從shuffle的數(shù)據(jù)量和處理shuffle數(shù)據(jù)的分區(qū)數(shù)兩個(gè)角度入手。

  • 減少shuffle數(shù)據(jù)
    思考是否可以使用map side join或是broadcast join來規(guī)避shuffle的產(chǎn)生算行。
    將不必要的數(shù)據(jù)在shuffle前進(jìn)行過濾恕沫,比如原始數(shù)據(jù)有20個(gè)字段,只要選取需要的字段進(jìn)行處理即可纱意,將會(huì)減少一定的shuffle數(shù)據(jù)。
  • SparkSQL和DataFrame的join,group by等操作
    通過spark.sql.shuffle.partitions控制分區(qū)數(shù)鲸阔,默認(rèn)為200偷霉,根據(jù)shuffle的量以及計(jì)算的復(fù)雜度提高這個(gè)值。
  • Rdd的join,groupBy,reduceByKey等操作
    通過spark.default.parallelism控制shuffle read與reduce處理的分區(qū)數(shù)褐筛,默認(rèn)為運(yùn)行任務(wù)的core的總數(shù)(mesos細(xì)粒度模式為8個(gè)类少,local模式為本地的core總數(shù)),官方建議為設(shè)置成運(yùn)行任務(wù)的core的2-3倍渔扎。
  • 提高executor的內(nèi)存
    通過spark.executor.memory適當(dāng)提高executor的memory值硫狞。
    -是否存在數(shù)據(jù)傾斜的問題
    空值是否已經(jīng)過濾?異常數(shù)據(jù)(某個(gè)key數(shù)據(jù)特別大)是否可以單獨(dú)處理晃痴?考慮改變數(shù)據(jù)分區(qū)規(guī)則
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末残吩,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子倘核,更是在濱河造成了極大的恐慌泣侮,老刑警劉巖,帶你破解...
    沈念sama閱讀 206,968評(píng)論 6 482
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件紧唱,死亡現(xiàn)場離奇詭異活尊,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)漏益,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,601評(píng)論 2 382
  • 文/潘曉璐 我一進(jìn)店門蛹锰,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人绰疤,你說我怎么就攤上這事铜犬。” “怎么了峦睡?”我有些...
    開封第一講書人閱讀 153,220評(píng)論 0 344
  • 文/不壞的土叔 我叫張陵翎苫,是天一觀的道長。 經(jīng)常有香客問我榨了,道長煎谍,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 55,416評(píng)論 1 279
  • 正文 為了忘掉前任龙屉,我火速辦了婚禮呐粘,結(jié)果婚禮上满俗,老公的妹妹穿的比我還像新娘。我一直安慰自己作岖,他們只是感情好唆垃,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,425評(píng)論 5 374
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著痘儡,像睡著了一般辕万。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上沉删,一...
    開封第一講書人閱讀 49,144評(píng)論 1 285
  • 那天渐尿,我揣著相機(jī)與錄音,去河邊找鬼矾瑰。 笑死砖茸,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的殴穴。 我是一名探鬼主播凉夯,決...
    沈念sama閱讀 38,432評(píng)論 3 401
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼采幌!你這毒婦竟也來了劲够?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 37,088評(píng)論 0 261
  • 序言:老撾萬榮一對(duì)情侶失蹤植榕,失蹤者是張志新(化名)和其女友劉穎再沧,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體尊残,經(jīng)...
    沈念sama閱讀 43,586評(píng)論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡炒瘸,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,028評(píng)論 2 325
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了寝衫。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片顷扩。...
    茶點(diǎn)故事閱讀 38,137評(píng)論 1 334
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖慰毅,靈堂內(nèi)的尸體忽然破棺而出隘截,到底是詐尸還是另有隱情,我是刑警寧澤汹胃,帶...
    沈念sama閱讀 33,783評(píng)論 4 324
  • 正文 年R本政府宣布婶芭,位于F島的核電站,受9級(jí)特大地震影響着饥,放射性物質(zhì)發(fā)生泄漏犀农。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,343評(píng)論 3 307
  • 文/蒙蒙 一宰掉、第九天 我趴在偏房一處隱蔽的房頂上張望呵哨。 院中可真熱鬧赁濒,春花似錦、人聲如沸孟害。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,333評(píng)論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽挨务。三九已至击你,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間谎柄,已是汗流浹背果漾。 一陣腳步聲響...
    開封第一講書人閱讀 31,559評(píng)論 1 262
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留谷誓,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 45,595評(píng)論 2 355
  • 正文 我出身青樓吨凑,卻偏偏與公主長得像捍歪,于是被迫代替她去往敵國和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子鸵钝,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,901評(píng)論 2 345

推薦閱讀更多精彩內(nèi)容

  • 1 數(shù)據(jù)傾斜調(diào)優(yōu) 1.1 調(diào)優(yōu)概述 有的時(shí)候糙臼,我們可能會(huì)遇到大數(shù)據(jù)計(jì)算中一個(gè)最棘手的問題——數(shù)據(jù)傾斜,此時(shí)Spar...
    wisfern閱讀 2,933評(píng)論 0 23
  • 原文:https://tech.meituan.com/spark-tuning-pro.html Spark性能...
    code_solve閱讀 1,182評(píng)論 0 34
  • 前言 繼基礎(chǔ)篇講解了每個(gè)Spark開發(fā)人員都必須熟知的開發(fā)調(diào)優(yōu)與資源調(diào)優(yōu)之后恩商,本文作為《Spark性能優(yōu)化指南》的...
    Alukar閱讀 868評(píng)論 0 2
  • 場景 數(shù)據(jù)傾斜解決方案與shuffle類性能調(diào)優(yōu) 分析 數(shù)據(jù)傾斜 有的時(shí)候变逃,我們可能會(huì)遇到大數(shù)據(jù)計(jì)算中一個(gè)最棘手的...
    過江小卒閱讀 3,424評(píng)論 0 9
  • 1、創(chuàng)意存在于意料之外怠堪,情理之中揽乱。 2、學(xué)思結(jié)合指的是思考建立在學(xué)習(xí)的基礎(chǔ)上粟矿,而思考是對(duì)學(xué)習(xí)內(nèi)容溫習(xí)及實(shí)踐凰棉。 3、...
    Yayummy閱讀 419評(píng)論 0 0