【2019-01-04】Spark 程序在driver卡住

1.背景

  • 現(xiàn)象
    用戶在使用Spark時(shí)候晶密,任務(wù)正常40min結(jié)束。偶現(xiàn)driver卡住模她,UI上沒有新task產(chǎn)生稻艰,且2小時(shí)候后程序failed。由于客戶愿意無法獲取到driver詳細(xì)日志侈净。

  • 基本信息 :

組件 版本 模塊
Spark 1.5.1 dataframe.write.mode(SaveMode.Overwrite).parquet(path)
  • 核心業(yè)務(wù)代碼:
dataFrame.write.mode(SaveMode.Overwrite).parquet(path)
sqlContext.sql("load data inpath '"+path+"'+ overwrite into table "+ tableName)
  • hang著的drvier堆棧:
    at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)

2.分析

  • 分析對(duì)應(yīng)的application日志

發(fā)現(xiàn)在2019-01-16 15:53:11,748:

 WARN  | [driver-heartbeater] | Error sending message [message = Heartbeat(28,[Lscala.Tuple2;@2043010d,BlockManagerId(28, 10.180.181.196, 23640))] in 2 attempts | org.apache.spark.Logging$class.logWarning(Logging.scala:92)

最終executor和driver的心跳失效:

2019-01-16 15:53:20,101 | ERROR | [dispatcher-event-loop-43] | Driver 10.180.143.218:23214 disassociated! Shutting down. | org.apache.spark.Logging$class.logError(Logging.scala:75)

此外還有大量shuffle異常日志:

2019-01-16 14:12:01,709 | WARN  | [shuffle-server-0] | Exception in connection from /10.180.181.153:49936 | org.apache.spark.network.server.TransportChannelHandler.exceptionCaught(TransportChannelHandler.java:79)
io.netty.handler.codec.CorruptedFrameException: negative pre-adjustment length field: -6101251590885211657
        at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:377)
        at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:327)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:230)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:748)
2019-01-16 14:12:01,710 | INFO  | [shuffle-server-0] | DIGEST41:Unmatched MACs | com.sun.security.sasl.digest.DigestMD5Base$DigestPrivacy.unwrap(DigestMD5Base.java:1481)
2019-01-16 14:12:01,710 | WARN  | [shuffle-server-0] | Exception in connection from /10.180.181.153:49936 | org.apache.spark.network.server.TransportChannelHandler.exceptionCaught(TransportChannelHandler.java:79)
io.netty.handler.codec.DecoderException: javax.security.sasl.SaslException: DIGEST-MD5: Out of order sequencing of messages from server. Got: 505 Expected: 501
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:99)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
        at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:300)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)
        at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828)
        at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:748)
Caused by: javax.security.sasl.SaslException: DIGEST-MD5: Out of order sequencing of messages from server. Got: 505 Expected: 501
        at com.sun.security.sasl.digest.DigestMD5Base$DigestPrivacy.unwrap(DigestMD5Base.java:1489)
        at com.sun.security.sasl.digest.DigestMD5Base.unwrap(DigestMD5Base.java:213)
        at org.apache.spark.network.sasl.SparkSaslServer.unwrap(SparkSaslServer.java:149)
        at org.apache.spark.network.sasl.SaslEncryption$DecryptionHandler.decode(SaslEncryption.java:127)
        at org.apache.spark.network.sasl.SaslEncryption$DecryptionHandler.decode(SaslEncryption.java:102)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)

shuffle異常是否也是失敗的根因尊勿?

  • 多分析幾個(gè)節(jié)點(diǎn)的executor日志
    1.節(jié)點(diǎn):HQCDDSJAPP181008_26009,從14:13:17,33515:13:17,292間executor無任何日志打印

  • 進(jìn)一步分析driver日志

1.由于無法獲取到driver日志畜侦,沒法做更多的分析元扔。先排除推測機(jī)制的干擾。讓客戶關(guān)閉掉spark推測機(jī)制:spark.speculation

2.關(guān)閉掉推測機(jī)制后旋膳,任務(wù)運(yùn)行也失敗了摇展。啟動(dòng)executor失敗的次數(shù)達(dá)到上限

Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures (150) reached) 

并且失敗都有大量的socket異常打印,executor和driver網(wǎng)絡(luò)通信中斷:

Caused by: java.io.IOException: Connection from 10.180.143.218/10.180.143.218:23781 closed
        at org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104)
        at org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:94)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
        at io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)

還是懷疑是網(wǎng)絡(luò)的問題

  • 2019-01-21重新收集到了日志。

1.分析AM日志溺忧,發(fā)現(xiàn)AM日志從15點(diǎn)到之后都沒有任何打佑搅:

2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181113:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181132:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181077:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181045:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181080:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181140:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181164:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181103:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181104:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181033:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181109:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181046:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,500 | INFO  | [Reporter] | Received new token for : HQCDDSJAPP181182:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,505 | INFO  | [Reporter] | Received 29 containers from YARN, launching executors on 0 of them. | org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2019-01-21 17:25:56,565 | INFO  | [dispatcher-event-loop-2] | Driver terminated or disconnected! Shutting down. 10.180.143.218:23073 | org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2019-01-21 17:25:56,567 | INFO  | [dispatcher-event-loop-2] | Unregistering ApplicationMaster with FAILED (diag message: Driver has been dead, no need to retry the AM.) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2019-01-21 17:25:56,573 | INFO  | [dispatcher-event-loop-2] | Waiting for application to be successfully unregistered. | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.unregisterApplicationMaster(AMRMClientImpl.java:394)
2019-01-21 17:25:56,675 | INFO  | [dispatcher-event-loop-2] | Deleting staging directory .sparkStaging/application_1547839891509_15295 | org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2019-01-21 17:25:58,197 | INFO  | [pool-1-thread-1] | Shutdown hook called | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

發(fā)現(xiàn)TID 13203 執(zhí)行了40多min,在stage 31鲁森。雖然stage31的某個(gè)task執(zhí)行慢祟滴,但是最終是執(zhí)行成功的。spark的shuffle對(duì)網(wǎng)絡(luò)性能要求比較高歌溉,準(zhǔn)備進(jìn)行如下shuffle調(diào)優(yōu)垄懂,避免單個(gè)task問題:
準(zhǔn)備進(jìn)行如下調(diào)整:調(diào)優(yōu)化shuffle參數(shù):
spark.shuffle.file.buffer=64k,
spark.reducer.maxSizeInFlight=96M
spark.network.timeout=300s
spark.rpc.askTimeout=300s
spark.shuffle.io.serverThreads=8

結(jié)論

1.部分task執(zhí)行慢痛垛,是由于shuffle性能影響草慧,調(diào)整shuffle參數(shù)規(guī)避。
spark.shuffle.file.buffer=64k匙头,
spark.reducer.maxSizeInFlight=96M
spark.network.timeout=300s
spark.rpc.askTimeout=300s
spark.shuffle.io.serverThreads=8
2.不排除網(wǎng)絡(luò)問題的影響漫谷,試圖調(diào)整os參數(shù),但是客戶生產(chǎn)
net.ipv4.tcp_keepalive_time= 15
net.ipv4.tcp_keepalive_probes = 10
net.ipv4.tcp_keepalive_intvl= 30
3.關(guān)閉sasl
spark.authenticate.enableSaslEncryption=false
spark.authenticate=false

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末蹂析,一起剝皮案震驚了整個(gè)濱河市舔示,隨后出現(xiàn)的幾起案子碟婆,更是在濱河造成了極大的恐慌,老刑警劉巖惕稻,帶你破解...
    沈念sama閱讀 218,682評(píng)論 6 507
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件竖共,死亡現(xiàn)場離奇詭異,居然都是意外死亡俺祠,警方通過查閱死者的電腦和手機(jī)公给,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,277評(píng)論 3 395
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來蜘渣,“玉大人淌铐,你說我怎么就攤上這事∷挝啵” “怎么了匣沼?”我有些...
    開封第一講書人閱讀 165,083評(píng)論 0 355
  • 文/不壞的土叔 我叫張陵狰挡,是天一觀的道長捂龄。 經(jīng)常有香客問我,道長加叁,這世上最難降的妖魔是什么倦沧? 我笑而不...
    開封第一講書人閱讀 58,763評(píng)論 1 295
  • 正文 為了忘掉前任,我火速辦了婚禮它匕,結(jié)果婚禮上展融,老公的妹妹穿的比我還像新娘。我一直安慰自己豫柬,他們只是感情好告希,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,785評(píng)論 6 392
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著烧给,像睡著了一般燕偶。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上础嫡,一...
    開封第一講書人閱讀 51,624評(píng)論 1 305
  • 那天指么,我揣著相機(jī)與錄音,去河邊找鬼榴鼎。 笑死伯诬,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的巫财。 我是一名探鬼主播盗似,決...
    沈念sama閱讀 40,358評(píng)論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢(mèng)啊……” “哼平项!你這毒婦竟也來了桥言?” 一聲冷哼從身側(cè)響起萌踱,我...
    開封第一講書人閱讀 39,261評(píng)論 0 276
  • 序言:老撾萬榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎号阿,沒想到半個(gè)月后并鸵,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,722評(píng)論 1 315
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡扔涧,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,900評(píng)論 3 336
  • 正文 我和宋清朗相戀三年园担,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片枯夜。...
    茶點(diǎn)故事閱讀 40,030評(píng)論 1 350
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡弯汰,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出湖雹,到底是詐尸還是另有隱情咏闪,我是刑警寧澤,帶...
    沈念sama閱讀 35,737評(píng)論 5 346
  • 正文 年R本政府宣布摔吏,位于F島的核電站鸽嫂,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏征讲。R本人自食惡果不足惜据某,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,360評(píng)論 3 330
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望诗箍。 院中可真熱鬧癣籽,春花似錦、人聲如沸滤祖。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,941評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽匠童。三九已至埂材,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間俏让,已是汗流浹背楞遏。 一陣腳步聲響...
    開封第一講書人閱讀 33,057評(píng)論 1 270
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留首昔,地道東北人寡喝。 一個(gè)月前我還...
    沈念sama閱讀 48,237評(píng)論 3 371
  • 正文 我出身青樓,卻偏偏與公主長得像勒奇,于是被迫代替她去往敵國和親预鬓。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,976評(píng)論 2 355