spark-SQL跑任務(wù)報(bào)錯(cuò)
錯(cuò)誤信息如下
19/10/17 18:06:50 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e122_1568025476000_38356_01_000022 on host: node02. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137
Killed by external signal
19/10/17 18:06:50 INFO BlockManagerMasterEndpoint: Trying to remove executor 21 from BlockManagerMaster.
19/10/17 18:06:50 INFO BlockManagerMaster: Removal of executor 21 requested
19/10/17 18:06:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 21
19/10/17 18:06:50 INFO TaskSetManager: Starting task 1.1 in stage 28.3 (TID 46, node02, executor 4, partition 1, NODE_LOCAL, 5066 bytes)
19/10/17 18:06:51 INFO BlockManagerInfo: Added broadcast_23_piece0 in memory on node02:27885 (size: 106.5 KB, free: 5.2 GB)
19/10/17 18:06:51 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 6 to xx.xx.xx.xx:30178
19/10/17 18:06:51 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 6 is 166 bytes
19/10/17 18:06:51 WARN TaskSetManager: Lost task 1.1 in stage 28.3 (TID 46, node02, executor 4): FetchFailed(null, shuffleId=6, mapId=-1, reduceId=166, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 6
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:693)
at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:147)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49)
at org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:165)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
)
上面的錯(cuò)誤報(bào)完接著就是以下錯(cuò)誤:
19/10/17 17:10:32 WARN TaskSetManager: Lost task 0.0 in stage 82.0 (TID 1855, node01, executor 5): FetchFailed(BlockManagerId(1, node01, 26969, None), shuffleId=13, mapId=1, reduceId=10, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to hostname/xx.xx.xx.xx:26969
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:513)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:444)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:61)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:776)
at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:737)
at org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:899)
at org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:935)
at org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:313)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:254)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1371)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:259)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:189)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:188)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to node01/xx.xx.xx.xx:26969
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:97)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.lambda$initiateRetry$0(RetryingBlockFetcher.java:169)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
... 1 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: node01/xx.xx.xx.xx:26969
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
... 2 more
)
解決方案:
增加executor內(nèi)存,把spark.executor.memory由10G調(diào)到20G就不出現(xiàn)這個(gè)異常了
原因分析
shuffle分為shuffle write和shuffle read兩部分遗淳。
shuffle write的分區(qū)數(shù)由上一階段的RDD分區(qū)數(shù)控制黑界,shuffle read的分區(qū)數(shù)則是由Spark提供的一些參數(shù)控制卧惜。
shuffle write可以簡單理解為類似于saveAsLocalDiskFile的操作情竹,將計(jì)算的中間結(jié)果按某種規(guī)則臨時(shí)放到各個(gè)executor所在的本地磁盤上。
shuffle read的時(shí)候數(shù)據(jù)的分區(qū)數(shù)則是由spark提供的一些參數(shù)控制抡砂∮卓粒可以想到的是窒篱,如果這個(gè)參數(shù)值設(shè)置的很小,同時(shí)shuffle read的量很大舶沿,那么將會(huì)導(dǎo)致一個(gè)task需要處理的數(shù)據(jù)非常大墙杯。結(jié)果導(dǎo)致JVM crash,從而導(dǎo)致取shuffle數(shù)據(jù)失敗括荡,同時(shí)executor也丟失了高镐,看到Failed to connect to host的錯(cuò)誤,也就是executor lost的意思畸冲。有時(shí)候即使不會(huì)導(dǎo)致JVM crash也會(huì)造成長時(shí)間的gc嫉髓。
解決思路
知道原因后問題就好解決了观腊,主要從shuffle的數(shù)據(jù)量和處理shuffle數(shù)據(jù)的分區(qū)數(shù)兩個(gè)角度入手。
- 減少shuffle數(shù)據(jù)
思考是否可以使用map side join或是broadcast join來規(guī)避shuffle的產(chǎn)生算行。
將不必要的數(shù)據(jù)在shuffle前進(jìn)行過濾恕沫,比如原始數(shù)據(jù)有20個(gè)字段,只要選取需要的字段進(jìn)行處理即可纱意,將會(huì)減少一定的shuffle數(shù)據(jù)。 - SparkSQL和DataFrame的join,group by等操作
通過spark.sql.shuffle.partitions控制分區(qū)數(shù)鲸阔,默認(rèn)為200偷霉,根據(jù)shuffle的量以及計(jì)算的復(fù)雜度提高這個(gè)值。 - Rdd的join,groupBy,reduceByKey等操作
通過spark.default.parallelism控制shuffle read與reduce處理的分區(qū)數(shù)褐筛,默認(rèn)為運(yùn)行任務(wù)的core的總數(shù)(mesos細(xì)粒度模式為8個(gè)类少,local模式為本地的core總數(shù)),官方建議為設(shè)置成運(yùn)行任務(wù)的core的2-3倍渔扎。 - 提高executor的內(nèi)存
通過spark.executor.memory適當(dāng)提高executor的memory值硫狞。
-是否存在數(shù)據(jù)傾斜的問題
空值是否已經(jīng)過濾?異常數(shù)據(jù)(某個(gè)key數(shù)據(jù)特別大)是否可以單獨(dú)處理晃痴?考慮改變數(shù)據(jù)分區(qū)規(guī)則