spark常見(jiàn)問(wèn)題處理

1黄绩、spark thriftserver報(bào)以下錯(cuò)誤噪奄,其他諸如hive/sparksql等方式均正常

ERROR?ActorSystemImpl:?Uncaught?fatal?error?from?thread?[sparkDriverActorSystem-akka.actor.default-dispatcher-379]?shutting?down?ActorSystem?[sparkDriverActorSystem]

java.lang.OutOfMemoryError:?Java?heap?space

原因:thriftserver的堆內(nèi)存不足

解決辦法: 重啟thriftserver刷钢,并調(diào)大executor-memory內(nèi)存(不能超過(guò)spark總剩余內(nèi)存乳附,如超過(guò)内地,可調(diào)大spark-env.sh中的SPARK_WORKER_MEMORY參數(shù),并重啟spark集群赋除。

start-thriftserver.sh --master spark://masterip:7077 --executor-memory 2g --total-executor-cores 4 --executor-cores 1 --hiveconf hive.server2.thrift.port=10050 --conf spark.dynamicAllocation.enabled=false

如果調(diào)大了executor的內(nèi)存,依舊報(bào)此錯(cuò)誤,仔細(xì)分析發(fā)現(xiàn)應(yīng)該不是executor內(nèi)存不足,而是driver內(nèi)存不足,在standalone模式下默認(rèn)給driver 1G內(nèi)存,當(dāng)我們提交應(yīng)用啟動(dòng)driver時(shí),如果讀取數(shù)據(jù)太大,driver就可能報(bào)內(nèi)存不足阱缓。

在spark-defaults.conf中調(diào)大driver參數(shù)

spark.driver.memory??? 2g

同時(shí)在spark-env.sh中同樣設(shè)置

export SPARK_DRIVER_MEMORY=2g

2、Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

JDK6新增錯(cuò)誤類(lèi)型举农。當(dāng)GC為釋放很小空間占用大量時(shí)間時(shí)拋出荆针。

在JVM中增加該選項(xiàng) -XX:-UseGCOverheadLimit 關(guān)閉限制GC的運(yùn)行時(shí)間(默認(rèn)啟用 )

在spark-defaults.conf中增加以下參數(shù)

spark.executor.extraJavaOptions -XX:-UseGCOverheadLimit

spark.driver.extraJavaOptions -XX:-UseGCOverheadLimit?

3、spark 錯(cuò)誤描述:

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby

原因:

NN主從切換時(shí),spark報(bào)上述錯(cuò)誤,經(jīng)分析spark-defaults.conf配置參數(shù)spark.eventLog.dir寫(xiě)死了其中一個(gè)NN的IP,導(dǎo)致主從切換時(shí),無(wú)法讀日志航背。

另外喉悴,跟應(yīng)用程序使用的checkpoint也有關(guān)系,首次啟動(dòng)應(yīng)用程序時(shí),創(chuàng)建了一個(gè)新的sparkcontext,而該sparkcontext綁定了具體的NN ip,

往后每次程序重啟,由于應(yīng)用代碼【StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)】將從已有checkpoint目錄導(dǎo)入checkpoint 數(shù)據(jù)來(lái)重新創(chuàng)建 StreamingContext 實(shí)例。

如果 checkpointDirectory 存在玖媚,那么 context 將導(dǎo)入 checkpoint 數(shù)據(jù)箕肃。如果目錄不存在,函數(shù) functionToCreateContext 將被調(diào)用并創(chuàng)建新的 context最盅。

故出現(xiàn)上述異常突雪。

解決:

針對(duì)測(cè)試系統(tǒng):

1、將某個(gè)NN固定IP改成nameservice對(duì)應(yīng)的值

2涡贱、清空應(yīng)用程序的checkpoint日志

3咏删、重啟應(yīng)用后,切換NN,spark應(yīng)用正常

4、獲取每次內(nèi)存GC信息

spark-defaults.conf中增加:

spark.executor.extraJavaOptions -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintHeapAtGC -XX:+PrintGCApplicationConcurrentTime -Xloggc:gc.log

5问词、timeout,報(bào)錯(cuò):

17/10/18 17:33:46 WARN TaskSetManager: Lost task 1393.0 in stage 382.0 (TID 223626, test-ssps-s-04): ExecutorLostFailure (executor 0 exited caused by one of the running tasks)

Reason:Executor heartbeat timed out after 173568 ms

17/10/18 17:34:02 WARN NettyRpcEndpointRef: Error sending message [message = KillExecutors(app-20171017115441-0012,List(8))] in 2 attempts

org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout

? at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)

? ? ?網(wǎng)絡(luò)或者gc引起,worker或executor沒(méi)有接收到executor或task的心跳反饋督函。

? ? ?提高 spark.network.timeout 的值,根據(jù)情況改成300(5min)或更高激挪。

? ? ?默認(rèn)為 120(120s),配置所有網(wǎng)絡(luò)傳輸?shù)难訒r(shí)? ? ? ?

? ? ?spark.network.timeout 300000

6辰狡、通過(guò)sparkthriftserver讀取lzo文件報(bào)錯(cuò):

ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found

?? 在spark-env.sh中增加如下配置:

? export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/hadoop/hadoop-2.2.0/lib/native

? export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/home/hadoop/hadoop-2.2.0/lib/native

? export SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/hadoop/hadoop-2.2.0/share/hadoop/yarn/*:/home/hadoop/hadoop-2.2.0/share/hadoop/yarn/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/common/*:/home/hadoop/hadoop-2.2.0/share/hadoop/common/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs/*:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/mapreduce/*:/home/hadoop/hadoop-2.2.0/share/hadoop/mapreduce/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/tools/lib/*:/home/hadoop/spark-1.6.1-bin-2.2.0/lib/*

并分發(fā)到各節(jié)點(diǎn)

? 重啟spark thrift server執(zhí)行正常

7、spark worker中發(fā)布executor頻繁報(bào)錯(cuò),陷入死循環(huán)(新建->失敗->新建->失敗.....)

work日志:

Asked to launch executor app-20171024225907-0018/77347 for org.apache.spark.sql.hive.thriftserver.HiveThriftServer2

worker.ExecutorRunner (Logging.scala:logInfo(58)) - Launch command:? "/home/hadoop/jdk1.7.0_09/bin/java" ......

Executor app-20171024225907-0018/77345 finished with state EXITED message Command exited with code 53 exitStatus 53

executor日志:

ERROR [main] storage.DiskBlockManager (Logging.scala:logError(95)) - Failed to create local dir in . Ignoring this directory.

java.io.IOException: Failed to create a temp directory (under ) after 10 attempts!

再看配置文件spark-env.sh:

export SPARK_LOCAL_DIRS=/data/spark/data

設(shè)置了spark本地目錄垄分,但機(jī)器上并沒(méi)有創(chuàng)建該目錄宛篇,所以引發(fā)錯(cuò)誤。

./drun "mkdir -p /data/spark/data"

./drun "chown -R hadoop:hadoop /data/spark"

創(chuàng)建后薄湿,重啟worker未再出現(xiàn)錯(cuò)誤叫倍。

8、spark worker異常退出:

worker中日志:

17/10/25 11:59:58 INFO worker.Worker: Master with url spark://10.10.10.82:7077 requested this worker to reconnect.

17/10/25 11:59:58 INFO worker.Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already.

17/10/25 11:59:59 INFO worker.Worker: Successfully registered with master spark://10.10.10.82:7077

17/10/25 11:59:59 INFO worker.Worker: Worker cleanup enabled; old application directories will be deleted in: /home/hadoop/spark-1.6.1-bin-2.2.0/work

17/10/25 12:00:00 INFO worker.Worker: Retrying connection to master (attempt # 1)

17/10/25 12:00:00 INFO worker.Worker: Master with url spark://10.10.10.82:7077 requested this worker to reconnect.

17/10/25 12:00:00 INFO worker.Worker: Master with url spark://10.10.10.82:7077 requested this worker to reconnect.

17/10/25 12:00:00 INFO worker.Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already.

17/10/25 12:00:00 INFO worker.Worker: Asked to launch executor app-20171024225907-0018/119773 for org.apache.spark.sql.hive.thriftserver.HiveThriftServer2

17/10/25 12:00:00 INFO worker.Worker: Connecting to master 10.10.10.82:7077...

17/10/25 12:00:00 INFO spark.SecurityManager: Changing view acls to: hadoop

17/10/25 12:00:00 INFO worker.Worker: Connecting to master 10.10.10.82:7077...

17/10/25 12:00:00 INFO spark.SecurityManager: Changing modify acls to: hadoop

17/10/25 12:00:00 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)

17/10/25 12:00:01 ERROR worker.Worker: Worker registration failed: Duplicate worker ID

17/10/25 12:00:01 INFO worker.ExecutorRunner: Launch command: "/home/hadoop/jdk1.7.0_09/bin/java" "-cp" "/home/hadoop/spark-1.6.1-bin-2.2.0/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/yarn/*:/home/hadoop/hadoop-2.2.0/share/hadoop/yarn/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/common/*:/home/hadoop/hadoop-2.2.0/share/hadoop/common/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs/*:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/mapreduce/*:/home/hadoop/hadoop-2.2.0/share/hadoop/mapreduce/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/tools/lib/*:/home/hadoop/spark-1.6.1-bin-2.2.0/lib/*:/home/hadoop/spark-1.6.1-bin-2.2.0/conf/:/home/hadoop/spark-1.6.1-bin-2.2.0/lib/spark-assembly-1.6.1-hadoop2.2.0.jar:/home/hadoop/spark-1.6.1-bin-2.2.0/lib/datanucleus-api-jdo-3.2.6.jar:/home/hadoop/spark-1.6.1-bin-2.2.0/lib/datanucleus-rdbms-3.2.9.jar:/home/hadoop/spark-1.6.1-bin-2.2.0/lib/datanucleus-core-3.2.10.jar:/home/hadoop/hadoop-2.2.0/etc/hadoop/" "-Xms1024M" "-Xmx1024M" "-Dspark.driver.port=43546" "-XX:MaxPermSize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@10.10.10.80:43546" "--executor-id" "119773" "--hostname" "10.10.10.190" "--cores" "1" "--app-id" "app-20171024225907-0018" "--worker-url" "spark://Worker@10.10.10.190:55335"

17/10/25 12:00:02 INFO worker.ExecutorRunner: Killing process!

17/10/25 12:00:02 INFO worker.ExecutorRunner: Killing process!

17/10/25 12:00:02 INFO worker.ExecutorRunner: Killing process!

17/10/25 12:00:03 INFO worker.ExecutorRunner: Killing process!

17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!

17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!

17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!

17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!

17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!

17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!

17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!

17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!

17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!

17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!

17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!

17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!

17/10/25 12:00:05 INFO worker.ExecutorRunner: Killing process!

17/10/25 12:00:05 INFO worker.ExecutorRunner: Killing process!

17/10/25 12:00:05 INFO util.ShutdownHookManager: Shutdown hook called

17/10/25 12:00:06 INFO util.ShutdownHookManager: Deleting directory /data/spark/data/spark-ab442bb5-62e6-4567-bc7c-8d00534ba1a3

近期吆倦,頻繁出現(xiàn)worker異常退出情況坐求,從worker日志看到,master要求重新連接并注冊(cè)须妻,注冊(cè)后泛领,worker繼續(xù)連接master,并反饋了一個(gè)錯(cuò)誤信息:

ERROR worker.Worker: Worker registration failed: Duplicate worker ID

之后就突然殺掉所有Executor然后退出worker。

出現(xiàn)該問(wèn)題的機(jī)器內(nèi)存都是16g(其他32g節(jié)點(diǎn),沒(méi)有出現(xiàn)worker退出問(wèn)題)司倚。

再查看該類(lèi)節(jié)點(diǎn)yarn的 配置篓像,發(fā)現(xiàn)分配yarn的資源是15g,懷疑問(wèn)題節(jié)點(diǎn)yarn和spark同處高峰期,導(dǎo)致spark分配不到資源退出盒粮。

為驗(yàn)證猜想奠滑,查看NM日志,其時(shí)間發(fā)生點(diǎn)部分容器有自殺情況摊崭,但這個(gè)不能說(shuō)明什么杰赛,因?yàn)槠渌麜r(shí)間點(diǎn)也存有該問(wèn)題。

為查看問(wèn)題時(shí)間點(diǎn)根时,節(jié)點(diǎn)中yarn資源使用情況辰晕,只能到active rm中查看rm日志,最后看到節(jié)點(diǎn)中在11:46:22到11:59:18替裆,

問(wèn)題節(jié)點(diǎn)yarn占用資源一直在唱较。

如此,確實(shí)是該問(wèn)題所致胸遇,將yarn資源調(diào)整至12g/10c,后續(xù)再繼續(xù)觀察汉形。

8、spark (4個(gè)dd逗威,11個(gè)nm)沒(méi)有利用非datanode節(jié)點(diǎn)上的executor問(wèn)題

問(wèn)題描述:跑大job時(shí)岔冀,總?cè)蝿?wù)數(shù)上萬(wàn)個(gè),但是只利用了其中4個(gè)dd上的executor罐呼,另外7臺(tái)沒(méi)有跑或者跑很少。

分析:spark driver進(jìn)行任務(wù)分配時(shí),會(huì)希望每個(gè)任務(wù)正好分配到要計(jì)算的數(shù)據(jù)節(jié)點(diǎn)上厌杜,避免網(wǎng)絡(luò)傳輸计螺。但當(dāng)task因?yàn)槠渌跀?shù)據(jù)節(jié)點(diǎn)資源正好被分配完而沒(méi)機(jī)會(huì)再分配時(shí),spark會(huì)等一段時(shí)間(由spark.locality.wait參數(shù)控制匙握,默認(rèn)3s),超過(guò)該時(shí)間陈轿,就會(huì)選擇一個(gè)本地化差的級(jí)別進(jìn)行計(jì)算。

解決辦法: 將spark.locality.wait設(shè)置為0赠堵,不等待法褥,任務(wù)直接分配,需重啟服務(wù)生效

9揍愁、spark自帶參數(shù)不生效問(wèn)題

spark thrift測(cè)試系統(tǒng)經(jīng)常出問(wèn)題杀饵,調(diào)整了driver內(nèi)存參數(shù),依舊報(bào)問(wèn)題朽缎。

常見(jiàn)問(wèn)題狀態(tài):連接spark thrift無(wú)響應(yīng)谜悟,一會(huì)提示OutOfMemoryError: Java heap space

后來(lái)發(fā)現(xiàn)設(shè)置的driver內(nèi)存參數(shù)沒(méi)有生效,環(huán)境配置文件spark-env.sh設(shè)置了SPARK_DAEMON_MEMORY=1024m最筒,覆蓋了啟動(dòng)參數(shù)設(shè)置的

--driver-memory 4G

,導(dǎo)致參數(shù)設(shè)置沒(méi)生效蔚叨。


?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市邢锯,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖领曼,帶你破解...
    沈念sama閱讀 217,826評(píng)論 6 506
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件庶骄,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡单刁,警方通過(guò)查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,968評(píng)論 3 395
  • 文/潘曉璐 我一進(jìn)店門(mén)肺樟,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)么伯,“玉大人卡儒,你說(shuō)我怎么就攤上這事∮脖” “怎么了?”我有些...
    開(kāi)封第一講書(shū)人閱讀 164,234評(píng)論 0 354
  • 文/不壞的土叔 我叫張陵缀磕,是天一觀的道長(zhǎng)虐骑。 經(jīng)常有香客問(wèn)我赎线,道長(zhǎng),這世上最難降的妖魔是什么颠黎? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 58,562評(píng)論 1 293
  • 正文 為了忘掉前任,我火速辦了婚禮狭归,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘室梅。我一直安慰自己疚宇,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,611評(píng)論 6 392
  • 文/花漫 我一把揭開(kāi)白布间涵。 她就那樣靜靜地躺著榜揖,像睡著了一般。 火紅的嫁衣襯著肌膚如雪思劳。 梳的紋絲不亂的頭發(fā)上妨猩,一...
    開(kāi)封第一講書(shū)人閱讀 51,482評(píng)論 1 302
  • 那天,我揣著相機(jī)與錄音钠导,去河邊找鬼森瘪。 笑死,一個(gè)胖子當(dāng)著我的面吹牛逮栅,可吹牛的內(nèi)容都是我干的窗宇。 我是一名探鬼主播,決...
    沈念sama閱讀 40,271評(píng)論 3 418
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼侥加,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼粪躬!你這毒婦竟也來(lái)了昔穴?” 一聲冷哼從身側(cè)響起提前,我...
    開(kāi)封第一講書(shū)人閱讀 39,166評(píng)論 0 276
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤狈网,失蹤者是張志新(化名)和其女友劉穎,沒(méi)想到半個(gè)月后勇垛,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體拓售,經(jīng)...
    沈念sama閱讀 45,608評(píng)論 1 314
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡础淤,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,814評(píng)論 3 336
  • 正文 我和宋清朗相戀三年鸽凶,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片玻侥。...
    茶點(diǎn)故事閱讀 39,926評(píng)論 1 348
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡凑兰,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出姑食,到底是詐尸還是另有隱情茅坛,我是刑警寧澤,帶...
    沈念sama閱讀 35,644評(píng)論 5 346
  • 正文 年R本政府宣布曹鸠,位于F島的核電站斥铺,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏叛薯。R本人自食惡果不足惜浑吟,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,249評(píng)論 3 329
  • 文/蒙蒙 一组力、第九天 我趴在偏房一處隱蔽的房頂上張望抖拴。 院中可真熱鬧,春花似錦候衍、人聲如沸洒放。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 31,866評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)领追。三九已至,卻和暖如春绒窑,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背些膨。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 32,991評(píng)論 1 269
  • 我被黑心中介騙來(lái)泰國(guó)打工订雾, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人葬燎。 一個(gè)月前我還...
    沈念sama閱讀 48,063評(píng)論 3 370
  • 正文 我出身青樓谱净,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親壕探。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,871評(píng)論 2 354

推薦閱讀更多精彩內(nèi)容