本文記錄用Scala編寫WordCount并提交到Spark集群運(yùn)行堰汉。在搭建本集群之前必須先搭建好Spark集群夫壁,搭建Spark集群請參考:Spark on Yarn 環(huán)境搭建.
主要內(nèi)容:
- 1.創(chuàng)建工程
- 2.主程序
- 3.提交運(yùn)行
相關(guān)文章:
1.Spark之PI本地
2.Spark之WordCount集群
3.SparkStreaming之讀取Kafka數(shù)據(jù)
4.SparkStreaming之使用redis保存Kafka的Offset
5.SparkStreaming之優(yōu)雅停止
6.SparkStreaming之寫數(shù)據(jù)到Kafka
7.Spark計算《西虹市首富》短評詞云
1.創(chuàng)建Maven工程并引入依賴
項目結(jié)構(gòu)截圖如下:
依賴文件如下:
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<id>scala-compile-first</id>
<goals>
<goal>compile</goal>
</goals>
<configuration>
<includes>
<include>**/*.scala</include>
</includes>
</configuration>
</execution>
<execution>
<id>scala-test-compile</id>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
包含spark的核心依賴籽前,以及scala編譯插件诱鞠。
2.主程序
ScalaWordCount.scala如下:
import org.apache.spark.{SparkConf, SparkContext}
import org.slf4j.LoggerFactory
object ScalaWordCount {
val LOG = LoggerFactory.getLogger("ScalaWordCount")
def main(args: Array[String]): Unit = {
if (args.length < 2) {
LOG.error("請輸入正確的參數(shù)")
System.exit(1)
}
//創(chuàng)建一個Config
val conf = new SparkConf()
.setAppName("ScalaWordCount")
//創(chuàng)建SparkContext對象
val sc = new SparkContext(conf)
//WordCount
sc.textFile(args(0))
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_ + _)
.repartition(1)
.sortBy(_._2, false)
.saveAsTextFile(args(1))
//停止SparkContext對象
sc.stop()
}
}
接收2個參數(shù)闹瞧,第一個輸入文件欺劳,第二個輸出文件夾
執(zhí)行命令:
./bin/spark-submit \
--class me.jinkun.scala.wc.ScalaWordCount \
--master yarn \
--deploy-mode cluster \
--driver-memory 1G \
--executor-memory 1G \
--executor-cores 1 \
"/opt/soft-install/data/spark-yarn-1.0-SNAPSHOT.jar" \
"hdfs://hadoop1:9000/data/input/words.txt" "hdfs://hadoop1:9000/data/output/wc.txt"
輸入?yún)?shù)1為:hdfs://hadoop1:9000/data/input/words.txt 是hdfs上的words.txt文件
輸入?yún)?shù)2為:hdfs://hadoop1:9000/data/output/wc.txt 是hdfs上的wc.txt文件夾唧取,運(yùn)行前必須先確保此文件夾不存在,否則會報錯
words.txt內(nèi)容
hello me
hello you
hello her
找到IDEA右側(cè)的Maven Projects將程序打成jar包并運(yùn)行
在hdfs頁面查看生成的文件
結(jié)果如下:
(hello,3)
(you,1)
(her,1)
(me,1)