1黄绩、spark thriftserver報(bào)以下錯(cuò)誤噪奄,其他諸如hive/sparksql等方式均正常
ERROR?ActorSystemImpl:?Uncaught?fatal?error?from?thread?[sparkDriverActorSystem-akka.actor.default-dispatcher-379]?shutting?down?ActorSystem?[sparkDriverActorSystem]
java.lang.OutOfMemoryError:?Java?heap?space
原因:thriftserver的堆內(nèi)存不足
解決辦法: 重啟thriftserver刷钢,并調(diào)大executor-memory內(nèi)存(不能超過(guò)spark總剩余內(nèi)存乳附,如超過(guò)内地,可調(diào)大spark-env.sh中的SPARK_WORKER_MEMORY參數(shù),并重啟spark集群赋除。
start-thriftserver.sh --master spark://masterip:7077 --executor-memory 2g --total-executor-cores 4 --executor-cores 1 --hiveconf hive.server2.thrift.port=10050 --conf spark.dynamicAllocation.enabled=false
如果調(diào)大了executor的內(nèi)存,依舊報(bào)此錯(cuò)誤,仔細(xì)分析發(fā)現(xiàn)應(yīng)該不是executor內(nèi)存不足,而是driver內(nèi)存不足,在standalone模式下默認(rèn)給driver 1G內(nèi)存,當(dāng)我們提交應(yīng)用啟動(dòng)driver時(shí),如果讀取數(shù)據(jù)太大,driver就可能報(bào)內(nèi)存不足阱缓。
在spark-defaults.conf中調(diào)大driver參數(shù)
spark.driver.memory??? 2g
同時(shí)在spark-env.sh中同樣設(shè)置
export SPARK_DRIVER_MEMORY=2g
2、Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
JDK6新增錯(cuò)誤類(lèi)型举农。當(dāng)GC為釋放很小空間占用大量時(shí)間時(shí)拋出荆针。
在JVM中增加該選項(xiàng) -XX:-UseGCOverheadLimit 關(guān)閉限制GC的運(yùn)行時(shí)間(默認(rèn)啟用 )
在spark-defaults.conf中增加以下參數(shù)
spark.executor.extraJavaOptions -XX:-UseGCOverheadLimit
spark.driver.extraJavaOptions -XX:-UseGCOverheadLimit?
3、spark 錯(cuò)誤描述:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
原因:
NN主從切換時(shí),spark報(bào)上述錯(cuò)誤,經(jīng)分析spark-defaults.conf配置參數(shù)spark.eventLog.dir寫(xiě)死了其中一個(gè)NN的IP,導(dǎo)致主從切換時(shí),無(wú)法讀日志航背。
另外喉悴,跟應(yīng)用程序使用的checkpoint也有關(guān)系,首次啟動(dòng)應(yīng)用程序時(shí),創(chuàng)建了一個(gè)新的sparkcontext,而該sparkcontext綁定了具體的NN ip,
往后每次程序重啟,由于應(yīng)用代碼【StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)】將從已有checkpoint目錄導(dǎo)入checkpoint 數(shù)據(jù)來(lái)重新創(chuàng)建 StreamingContext 實(shí)例。
如果 checkpointDirectory 存在玖媚,那么 context 將導(dǎo)入 checkpoint 數(shù)據(jù)箕肃。如果目錄不存在,函數(shù) functionToCreateContext 將被調(diào)用并創(chuàng)建新的 context最盅。
故出現(xiàn)上述異常突雪。
解決:
針對(duì)測(cè)試系統(tǒng):
1、將某個(gè)NN固定IP改成nameservice對(duì)應(yīng)的值
2涡贱、清空應(yīng)用程序的checkpoint日志
3咏删、重啟應(yīng)用后,切換NN,spark應(yīng)用正常
4、獲取每次內(nèi)存GC信息
spark-defaults.conf中增加:
spark.executor.extraJavaOptions -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintHeapAtGC -XX:+PrintGCApplicationConcurrentTime -Xloggc:gc.log
5问词、timeout,報(bào)錯(cuò):
17/10/18 17:33:46 WARN TaskSetManager: Lost task 1393.0 in stage 382.0 (TID 223626, test-ssps-s-04): ExecutorLostFailure (executor 0 exited caused by one of the running tasks)
Reason:Executor heartbeat timed out after 173568 ms
17/10/18 17:34:02 WARN NettyRpcEndpointRef: Error sending message [message = KillExecutors(app-20171017115441-0012,List(8))] in 2 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
? at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
? ? ?網(wǎng)絡(luò)或者gc引起,worker或executor沒(méi)有接收到executor或task的心跳反饋督函。
? ? ?提高 spark.network.timeout 的值,根據(jù)情況改成300(5min)或更高激挪。
? ? ?默認(rèn)為 120(120s),配置所有網(wǎng)絡(luò)傳輸?shù)难訒r(shí)? ? ? ?
? ? ?spark.network.timeout 300000
6辰狡、通過(guò)sparkthriftserver讀取lzo文件報(bào)錯(cuò):
ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found
?? 在spark-env.sh中增加如下配置:
? export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/hadoop/hadoop-2.2.0/lib/native
? export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/home/hadoop/hadoop-2.2.0/lib/native
? export SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/hadoop/hadoop-2.2.0/share/hadoop/yarn/*:/home/hadoop/hadoop-2.2.0/share/hadoop/yarn/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/common/*:/home/hadoop/hadoop-2.2.0/share/hadoop/common/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs/*:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/mapreduce/*:/home/hadoop/hadoop-2.2.0/share/hadoop/mapreduce/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/tools/lib/*:/home/hadoop/spark-1.6.1-bin-2.2.0/lib/*
并分發(fā)到各節(jié)點(diǎn)
? 重啟spark thrift server執(zhí)行正常
7、spark worker中發(fā)布executor頻繁報(bào)錯(cuò),陷入死循環(huán)(新建->失敗->新建->失敗.....)
work日志:
Asked to launch executor app-20171024225907-0018/77347 for org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
worker.ExecutorRunner (Logging.scala:logInfo(58)) - Launch command:? "/home/hadoop/jdk1.7.0_09/bin/java" ......
Executor app-20171024225907-0018/77345 finished with state EXITED message Command exited with code 53 exitStatus 53
executor日志:
ERROR [main] storage.DiskBlockManager (Logging.scala:logError(95)) - Failed to create local dir in . Ignoring this directory.
java.io.IOException: Failed to create a temp directory (under ) after 10 attempts!
再看配置文件spark-env.sh:
export SPARK_LOCAL_DIRS=/data/spark/data
設(shè)置了spark本地目錄垄分,但機(jī)器上并沒(méi)有創(chuàng)建該目錄宛篇,所以引發(fā)錯(cuò)誤。
./drun "mkdir -p /data/spark/data"
./drun "chown -R hadoop:hadoop /data/spark"
創(chuàng)建后薄湿,重啟worker未再出現(xiàn)錯(cuò)誤叫倍。
8、spark worker異常退出:
worker中日志:
17/10/25 11:59:58 INFO worker.Worker: Master with url spark://10.10.10.82:7077 requested this worker to reconnect.
17/10/25 11:59:58 INFO worker.Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already.
17/10/25 11:59:59 INFO worker.Worker: Successfully registered with master spark://10.10.10.82:7077
17/10/25 11:59:59 INFO worker.Worker: Worker cleanup enabled; old application directories will be deleted in: /home/hadoop/spark-1.6.1-bin-2.2.0/work
17/10/25 12:00:00 INFO worker.Worker: Retrying connection to master (attempt # 1)
17/10/25 12:00:00 INFO worker.Worker: Master with url spark://10.10.10.82:7077 requested this worker to reconnect.
17/10/25 12:00:00 INFO worker.Worker: Master with url spark://10.10.10.82:7077 requested this worker to reconnect.
17/10/25 12:00:00 INFO worker.Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already.
17/10/25 12:00:00 INFO worker.Worker: Asked to launch executor app-20171024225907-0018/119773 for org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
17/10/25 12:00:00 INFO worker.Worker: Connecting to master 10.10.10.82:7077...
17/10/25 12:00:00 INFO spark.SecurityManager: Changing view acls to: hadoop
17/10/25 12:00:00 INFO worker.Worker: Connecting to master 10.10.10.82:7077...
17/10/25 12:00:00 INFO spark.SecurityManager: Changing modify acls to: hadoop
17/10/25 12:00:00 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
17/10/25 12:00:01 ERROR worker.Worker: Worker registration failed: Duplicate worker ID
17/10/25 12:00:01 INFO worker.ExecutorRunner: Launch command: "/home/hadoop/jdk1.7.0_09/bin/java" "-cp" "/home/hadoop/spark-1.6.1-bin-2.2.0/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/yarn/*:/home/hadoop/hadoop-2.2.0/share/hadoop/yarn/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/common/*:/home/hadoop/hadoop-2.2.0/share/hadoop/common/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs/*:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/mapreduce/*:/home/hadoop/hadoop-2.2.0/share/hadoop/mapreduce/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/tools/lib/*:/home/hadoop/spark-1.6.1-bin-2.2.0/lib/*:/home/hadoop/spark-1.6.1-bin-2.2.0/conf/:/home/hadoop/spark-1.6.1-bin-2.2.0/lib/spark-assembly-1.6.1-hadoop2.2.0.jar:/home/hadoop/spark-1.6.1-bin-2.2.0/lib/datanucleus-api-jdo-3.2.6.jar:/home/hadoop/spark-1.6.1-bin-2.2.0/lib/datanucleus-rdbms-3.2.9.jar:/home/hadoop/spark-1.6.1-bin-2.2.0/lib/datanucleus-core-3.2.10.jar:/home/hadoop/hadoop-2.2.0/etc/hadoop/" "-Xms1024M" "-Xmx1024M" "-Dspark.driver.port=43546" "-XX:MaxPermSize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@10.10.10.80:43546" "--executor-id" "119773" "--hostname" "10.10.10.190" "--cores" "1" "--app-id" "app-20171024225907-0018" "--worker-url" "spark://Worker@10.10.10.190:55335"
17/10/25 12:00:02 INFO worker.ExecutorRunner: Killing process!
17/10/25 12:00:02 INFO worker.ExecutorRunner: Killing process!
17/10/25 12:00:02 INFO worker.ExecutorRunner: Killing process!
17/10/25 12:00:03 INFO worker.ExecutorRunner: Killing process!
17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!
17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!
17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!
17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!
17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!
17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!
17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!
17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!
17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!
17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!
17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!
17/10/25 12:00:04 INFO worker.ExecutorRunner: Killing process!
17/10/25 12:00:05 INFO worker.ExecutorRunner: Killing process!
17/10/25 12:00:05 INFO worker.ExecutorRunner: Killing process!
17/10/25 12:00:05 INFO util.ShutdownHookManager: Shutdown hook called
17/10/25 12:00:06 INFO util.ShutdownHookManager: Deleting directory /data/spark/data/spark-ab442bb5-62e6-4567-bc7c-8d00534ba1a3
近期吆倦,頻繁出現(xiàn)worker異常退出情況坐求,從worker日志看到,master要求重新連接并注冊(cè)须妻,注冊(cè)后泛领,worker繼續(xù)連接master,并反饋了一個(gè)錯(cuò)誤信息:
ERROR worker.Worker: Worker registration failed: Duplicate worker ID
之后就突然殺掉所有Executor然后退出worker。
出現(xiàn)該問(wèn)題的機(jī)器內(nèi)存都是16g(其他32g節(jié)點(diǎn),沒(méi)有出現(xiàn)worker退出問(wèn)題)司倚。
再查看該類(lèi)節(jié)點(diǎn)yarn的 配置篓像,發(fā)現(xiàn)分配yarn的資源是15g,懷疑問(wèn)題節(jié)點(diǎn)yarn和spark同處高峰期,導(dǎo)致spark分配不到資源退出盒粮。
為驗(yàn)證猜想奠滑,查看NM日志,其時(shí)間發(fā)生點(diǎn)部分容器有自殺情況摊崭,但這個(gè)不能說(shuō)明什么杰赛,因?yàn)槠渌麜r(shí)間點(diǎn)也存有該問(wèn)題。
為查看問(wèn)題時(shí)間點(diǎn)根时,節(jié)點(diǎn)中yarn資源使用情況辰晕,只能到active rm中查看rm日志,最后看到節(jié)點(diǎn)中在11:46:22到11:59:18替裆,
問(wèn)題節(jié)點(diǎn)yarn占用資源一直在唱较。
如此,確實(shí)是該問(wèn)題所致胸遇,將yarn資源調(diào)整至12g/10c,后續(xù)再繼續(xù)觀察汉形。
8、spark (4個(gè)dd逗威,11個(gè)nm)沒(méi)有利用非datanode節(jié)點(diǎn)上的executor問(wèn)題
問(wèn)題描述:跑大job時(shí)岔冀,總?cè)蝿?wù)數(shù)上萬(wàn)個(gè),但是只利用了其中4個(gè)dd上的executor罐呼,另外7臺(tái)沒(méi)有跑或者跑很少。
分析:spark driver進(jìn)行任務(wù)分配時(shí),會(huì)希望每個(gè)任務(wù)正好分配到要計(jì)算的數(shù)據(jù)節(jié)點(diǎn)上厌杜,避免網(wǎng)絡(luò)傳輸计螺。但當(dāng)task因?yàn)槠渌跀?shù)據(jù)節(jié)點(diǎn)資源正好被分配完而沒(méi)機(jī)會(huì)再分配時(shí),spark會(huì)等一段時(shí)間(由spark.locality.wait參數(shù)控制匙握,默認(rèn)3s),超過(guò)該時(shí)間陈轿,就會(huì)選擇一個(gè)本地化差的級(jí)別進(jìn)行計(jì)算。
解決辦法: 將spark.locality.wait設(shè)置為0赠堵,不等待法褥,任務(wù)直接分配,需重啟服務(wù)生效
9揍愁、spark自帶參數(shù)不生效問(wèn)題
spark thrift測(cè)試系統(tǒng)經(jīng)常出問(wèn)題杀饵,調(diào)整了driver內(nèi)存參數(shù),依舊報(bào)問(wèn)題朽缎。
常見(jiàn)問(wèn)題狀態(tài):連接spark thrift無(wú)響應(yīng)谜悟,一會(huì)提示OutOfMemoryError: Java heap space
后來(lái)發(fā)現(xiàn)設(shè)置的driver內(nèi)存參數(shù)沒(méi)有生效,環(huán)境配置文件spark-env.sh設(shè)置了SPARK_DAEMON_MEMORY=1024m最筒,覆蓋了啟動(dòng)參數(shù)設(shè)置的
--driver-memory 4G
,導(dǎo)致參數(shù)設(shè)置沒(méi)生效蔚叨。