構(gòu)建spark docker 鏡像
下載spark-2.4.8-bin-hadoop2.7.tgz
Note: 這里下載spark包一定不是能是without hadoop 的慷嗜。不然構(gòu)建完以后運(yùn)行门扇,會(huì)報(bào)一些包找不到翎蹈。比如log4j
tar -xvf spark-2.4.8-bin-hadoop2.7.tgz
cd spark-2.4.8-bin-hadoop2.7
編輯spark dockerfile
vim kubernetes/dockerfiles/spark/Dockerfile
將18行的 FROM openjdk:8-jdk-slim 替換成 FROM openjdk:8-jdk-slim-buster
因?yàn)槟J(rèn)openjdk基礎(chǔ)形象是debian 11 粤策,后面的spark-py鏡像會(huì)依賴此鏡像。debian11 安裝的python3 是python3.8 以上版本。spark2.4 不支持python3.7以上版本,所以會(huì)報(bào)”TypeError:an integer is required(got type bytes)” 這樣的錯(cuò)誤诅需。
所以將基礎(chǔ)鏡像更換成debian 10。安裝的python3 的版本是3.7
構(gòu)建鏡像
bin/docker-image-tool.sh -t v2.4.8 build
由于apt-get 源是國(guó)外的構(gòu)建會(huì)比較慢荧库,最好是開代理堰塌。
沒有代理的也可以跟換國(guó)內(nèi)源 (更換國(guó)內(nèi)源會(huì)有包依賴的問(wèn)題導(dǎo)致spark-py鏡像是構(gòu)建失敗)
vim kubernetes/dockerfiles/spark/Dockerfile
在 29分衫,31行之間 加入
ADD sources.list /etc/apt/sources.list
然后把sources.list文件放在spark-2.4.8-bin-hadoop2.7目錄下
最后鏡像構(gòu)建好以后场刑,會(huì)有3個(gè)鏡像
spark、spark-py蚪战、spark-r
把這3個(gè)鏡像push到鏡像倉(cāng)庫(kù)摇邦。
spark 支持 OSS
spark 支持oss 可以再修改spark dockerfile
vim kubernetes/dockerfiles/spark/Dockerfile
將下面的幾行jar文件放在COPY data /opt/spark/data 下面
ADD https://repo1.maven.org/maven2/com/aliyun/odps/hadoop-fs-oss/3.3.8-public/hadoop-fs-oss-3.3.8-public.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/com/aliyun/oss/aliyun-sdk-oss/3.8.1/aliyun-sdk-oss-3.8.1.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/org/aspectj/aspectjweaver/1.9.5/aspectjweaver-1.9.5.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/org/jdom/jdom/1.1.3/jdom-1.1.3.jar $SPARK_HOME/jars
也可以重新重新再寫一個(gè)dockerfile 重新構(gòu)建一個(gè)新的鏡像恤煞。保持原有鏡像
FROM acr-test01-registry.cn-beijing.cr.aliyuncs.com/netops/spark-py:v2.4.8
RUN mkdir -p /opt/spark/jars
# 如果需要使用OSS(讀取OSS數(shù)據(jù)或者離線Event到OSS),可以添加以下JAR包到鏡像中
ADD https://repo1.maven.org/maven2/com/aliyun/odps/hadoop-fs-oss/3.3.8-public/hadoop-fs-oss-3.3.8-public.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/com/aliyun/oss/aliyun-sdk-oss/3.8.1/aliyun-sdk-oss-3.8.1.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/org/aspectj/aspectjweaver/1.9.5/aspectjweaver-1.9.5.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/org/jdom/jdom/1.1.3/jdom-1.1.3.jar $SPARK_HOME/jars
docker build -t ack-spark-oss:v2.4.8 .
docker tag
docker push
spark on ack yaml
這個(gè)yaml是用來(lái)提交spark任務(wù)施籍。
scala 居扒、java 和 python 不太一樣。
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-pi
namespace: default
spec:
type: Scala
mode: cluster
image: "acr-test01-registry.cn-beijing.cr.aliyuncs.com/netops/ack-spark-2.4.5:v9"
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "oss://qa-oss/spark-examples_2.11-2.4.8.jar"
sparkConf:
"spark.eventLog.enabled": "true"
"spark.eventLog.dir": "oss://qa-oss/spark-events"
"spark.hadoop.fs.oss.impl": "org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem"
"spark.hadoop.fs.oss.endpoint": "oss-cn-beijing-internal.aliyuncs.com"
"spark.hadoop.fs.oss.accessKeySecret": "OSd0RVN"
"spark.hadoop.fs.oss.accessKeyId": "LTADXrW"
sparkVersion: "2.4.5"
imagePullSecrets: [spark]
restartPolicy:
type: Never
driver:
cores: 2
coreLimit: "2"
memory: "3g"
memoryOverhead: "1g"
labels:
version: 2.4.5
serviceAccount: spark
annotations:
k8s.aliyun.com/eci-kube-proxy-enabled: 'true'
k8s.aliyun.com/eci-image-cache: "true"
executor:
cores: 2
instances: 5
memory: "3g"
memoryOverhead: "1g"
labels:
version: 2.4.5
annotations:
k8s.aliyun.com/eci-kube-proxy-enabled: 'true'
k8s.aliyun.com/eci-image-cache: "true"
如果你的鏡像倉(cāng)庫(kù)是public的就不需要imagePullSecrets 這個(gè)參數(shù)丑慎。
如果你的鏡像倉(cāng)庫(kù)是帶驗(yàn)證的喜喂。那么就要使用imagePullSecrets 在驗(yàn)證,后面[spark] 是一個(gè)configmap 里面是鏡像倉(cāng)庫(kù)用戶名竿裂,密碼
另外mainApplicationFile 這個(gè)是任務(wù)jar 包位置玉吁。可以是 oss,或是 hdfs 如果用local就需要jar 在鏡像內(nèi)腻异。
sparkconf 部分是配置 spark-history 进副,如果不需要?jiǎng)h除掉。
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-pi
namespace: default
spec:
type: Python
mode: cluster
image: "acr-test01-registry.cn-beijing.cr.aliyuncs.com/netops/ack-spark-2.4.5:v9"
imagePullPolicy: Always
mainApplicationFile: "local:///opt/spark/examples/src/main/python/pi.py"
sparkVersion: "2.4.5"
pythonVersion: "3"
imagePullSecrets: [spark]
restartPolicy:
type: Never
driver:
cores: 2
coreLimit: "2"
memory: "3g"
memoryOverhead: "1g"
labels:
version: 2.4.5
serviceAccount: spark
annotations:
k8s.aliyun.com/eci-kube-proxy-enabled: 'true'
k8s.aliyun.com/eci-image-cache: "true"
executor:
cores: 2
instances: 5
memory: "3g"
memoryOverhead: "1g"
labels: