1.背景
現(xiàn)象
用戶在使用Spark時(shí)候晶密,任務(wù)正常40min結(jié)束。偶現(xiàn)driver卡住模她,UI上沒有新task產(chǎn)生稻艰,且2小時(shí)候后程序failed。由于客戶愿意無法獲取到driver詳細(xì)日志侈净。基本信息 :
組件 | 版本 | 模塊 |
---|---|---|
Spark | 1.5.1 | dataframe.write.mode(SaveMode.Overwrite).parquet(path) |
- 核心業(yè)務(wù)代碼:
dataFrame.write.mode(SaveMode.Overwrite).parquet(path)
sqlContext.sql("load data inpath '"+path+"'+ overwrite into table "+ tableName)
- hang著的drvier堆棧:
at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
2.分析
-
分析對(duì)應(yīng)的application日志
發(fā)現(xiàn)在2019-01-16 15:53:11,748:
WARN | [driver-heartbeater] | Error sending message [message = Heartbeat(28,[Lscala.Tuple2;@2043010d,BlockManagerId(28, 10.180.181.196, 23640))] in 2 attempts | org.apache.spark.Logging$class.logWarning(Logging.scala:92)
最終executor和driver的心跳失效:
2019-01-16 15:53:20,101 | ERROR | [dispatcher-event-loop-43] | Driver 10.180.143.218:23214 disassociated! Shutting down. | org.apache.spark.Logging$class.logError(Logging.scala:75)
此外還有大量shuffle異常日志:
2019-01-16 14:12:01,709 | WARN | [shuffle-server-0] | Exception in connection from /10.180.181.153:49936 | org.apache.spark.network.server.TransportChannelHandler.exceptionCaught(TransportChannelHandler.java:79)
io.netty.handler.codec.CorruptedFrameException: negative pre-adjustment length field: -6101251590885211657
at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:377)
at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:327)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:230)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:748)
2019-01-16 14:12:01,710 | INFO | [shuffle-server-0] | DIGEST41:Unmatched MACs | com.sun.security.sasl.digest.DigestMD5Base$DigestPrivacy.unwrap(DigestMD5Base.java:1481)
2019-01-16 14:12:01,710 | WARN | [shuffle-server-0] | Exception in connection from /10.180.181.153:49936 | org.apache.spark.network.server.TransportChannelHandler.exceptionCaught(TransportChannelHandler.java:79)
io.netty.handler.codec.DecoderException: javax.security.sasl.SaslException: DIGEST-MD5: Out of order sequencing of messages from server. Got: 505 Expected: 501
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:99)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:300)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828)
at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:748)
Caused by: javax.security.sasl.SaslException: DIGEST-MD5: Out of order sequencing of messages from server. Got: 505 Expected: 501
at com.sun.security.sasl.digest.DigestMD5Base$DigestPrivacy.unwrap(DigestMD5Base.java:1489)
at com.sun.security.sasl.digest.DigestMD5Base.unwrap(DigestMD5Base.java:213)
at org.apache.spark.network.sasl.SparkSaslServer.unwrap(SparkSaslServer.java:149)
at org.apache.spark.network.sasl.SaslEncryption$DecryptionHandler.decode(SaslEncryption.java:127)
at org.apache.spark.network.sasl.SaslEncryption$DecryptionHandler.decode(SaslEncryption.java:102)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)
shuffle異常是否也是失敗的根因尊勿?
多分析幾個(gè)節(jié)點(diǎn)的executor日志
1.節(jié)點(diǎn):HQCDDSJAPP181008_26009,從14:13:17,335到15:13:17,292間executor無任何日志打印-
進(jìn)一步分析driver日志
1.由于無法獲取到driver日志畜侦,沒法做更多的分析元扔。先排除推測機(jī)制的干擾。讓客戶關(guān)閉掉spark推測機(jī)制:spark.speculation
2.關(guān)閉掉推測機(jī)制后旋膳,任務(wù)運(yùn)行也失敗了摇展。啟動(dòng)executor失敗的次數(shù)達(dá)到上限
Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures (150) reached)
并且失敗都有大量的socket異常打印,executor和driver網(wǎng)絡(luò)通信中斷:
Caused by: java.io.IOException: Connection from 10.180.143.218/10.180.143.218:23781 closed
at org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:104)
at org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:94)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144)
at io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53)
還是懷疑是網(wǎng)絡(luò)的問題
-
2019-01-21重新收集到了日志。
1.分析AM日志溺忧,發(fā)現(xiàn)AM日志從15點(diǎn)到之后都沒有任何打佑搅:
2019-01-21 15:07:46,499 | INFO | [Reporter] | Received new token for : HQCDDSJAPP181113:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO | [Reporter] | Received new token for : HQCDDSJAPP181132:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO | [Reporter] | Received new token for : HQCDDSJAPP181077:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO | [Reporter] | Received new token for : HQCDDSJAPP181045:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO | [Reporter] | Received new token for : HQCDDSJAPP181080:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO | [Reporter] | Received new token for : HQCDDSJAPP181140:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO | [Reporter] | Received new token for : HQCDDSJAPP181164:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO | [Reporter] | Received new token for : HQCDDSJAPP181103:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO | [Reporter] | Received new token for : HQCDDSJAPP181104:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO | [Reporter] | Received new token for : HQCDDSJAPP181033:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO | [Reporter] | Received new token for : HQCDDSJAPP181109:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,499 | INFO | [Reporter] | Received new token for : HQCDDSJAPP181046:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,500 | INFO | [Reporter] | Received new token for : HQCDDSJAPP181182:26009 | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.populateNMTokens(AMRMClientImpl.java:372)
2019-01-21 15:07:46,505 | INFO | [Reporter] | Received 29 containers from YARN, launching executors on 0 of them. | org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2019-01-21 17:25:56,565 | INFO | [dispatcher-event-loop-2] | Driver terminated or disconnected! Shutting down. 10.180.143.218:23073 | org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2019-01-21 17:25:56,567 | INFO | [dispatcher-event-loop-2] | Unregistering ApplicationMaster with FAILED (diag message: Driver has been dead, no need to retry the AM.) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2019-01-21 17:25:56,573 | INFO | [dispatcher-event-loop-2] | Waiting for application to be successfully unregistered. | org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.unregisterApplicationMaster(AMRMClientImpl.java:394)
2019-01-21 17:25:56,675 | INFO | [dispatcher-event-loop-2] | Deleting staging directory .sparkStaging/application_1547839891509_15295 | org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2019-01-21 17:25:58,197 | INFO | [pool-1-thread-1] | Shutdown hook called | org.apache.spark.Logging$class.logInfo(Logging.scala:59)
發(fā)現(xiàn)TID 13203 執(zhí)行了40多min,在stage 31鲁森。雖然stage31的某個(gè)task執(zhí)行慢祟滴,但是最終是執(zhí)行成功的。spark的shuffle對(duì)網(wǎng)絡(luò)性能要求比較高歌溉,準(zhǔn)備進(jìn)行如下shuffle調(diào)優(yōu)垄懂,避免單個(gè)task問題:
準(zhǔn)備進(jìn)行如下調(diào)整:調(diào)優(yōu)化shuffle參數(shù):
spark.shuffle.file.buffer=64k,
spark.reducer.maxSizeInFlight=96M
spark.network.timeout=300s
spark.rpc.askTimeout=300s
spark.shuffle.io.serverThreads=8
結(jié)論
1.部分task執(zhí)行慢痛垛,是由于shuffle性能影響草慧,調(diào)整shuffle參數(shù)規(guī)避。
spark.shuffle.file.buffer=64k匙头,
spark.reducer.maxSizeInFlight=96M
spark.network.timeout=300s
spark.rpc.askTimeout=300s
spark.shuffle.io.serverThreads=8
2.不排除網(wǎng)絡(luò)問題的影響漫谷,試圖調(diào)整os參數(shù),但是客戶生產(chǎn)
net.ipv4.tcp_keepalive_time= 15
net.ipv4.tcp_keepalive_probes = 10
net.ipv4.tcp_keepalive_intvl= 30
3.關(guān)閉sasl
spark.authenticate.enableSaslEncryption=false
spark.authenticate=false