前言
最近在專注Spark開發(fā),記錄下自己的工作和學(xué)習(xí)路程寄雀,希望能跟大家互相交流成長
本文章更傾向于實(shí)戰(zhàn)案例得滤,涉及框架原理及基本應(yīng)用還請讀者自行閱讀相關(guān)文章,相關(guān)在本文章最后參考資料中
關(guān)于Zookeeper/Kafka/HBase/Hadoop相關(guān)集群環(huán)境搭建作者會陸續(xù)更新
本文章發(fā)布后會及時(shí)更新文章中出現(xiàn)的錯(cuò)誤及增加內(nèi)容盒犹,歡迎大家訂閱
QQ:86608625 微信:guofei1990123
背景
Kafka實(shí)時(shí)記錄從數(shù)據(jù)采集工具Flume或業(yè)務(wù)系統(tǒng)實(shí)時(shí)接口收集數(shù)據(jù)懂更,并作為消息緩沖組件為上游實(shí)時(shí)計(jì)算框架提供可靠數(shù)據(jù)支撐,Spark 1.3版本后支持兩種整合Kafka機(jī)制(Receiver-based Approach 和 Direct Approach)急膀,具體細(xì)節(jié)請參考文章最后官方文檔鏈接沮协,數(shù)據(jù)存儲使用HBase
實(shí)現(xiàn)思路
- 實(shí)現(xiàn)Kafka消息生產(chǎn)者模擬器
- Spark-Streaming采用Direct Approach方式實(shí)時(shí)獲取Kafka中數(shù)據(jù)
- Spark-Streaming對數(shù)據(jù)進(jìn)行業(yè)務(wù)計(jì)算后數(shù)據(jù)存儲到HBase
本地虛擬機(jī)集群環(huán)境配置
由于筆者機(jī)器性能有限,hadoop/zookeeper/kafka集群都搭建在一起主機(jī)名分別為hadoop1,hadoop2,hadoop3; hbase為單節(jié)點(diǎn) 在hadoop1
缺點(diǎn)及不足
由于筆者技術(shù)有限卓嫂,代碼設(shè)計(jì)上有部分缺陷慷暂,比如spark-streaming計(jì)算后數(shù)據(jù)保存hbase邏輯性能很低,希望大家多提意見以便小編及時(shí)更正
代碼實(shí)現(xiàn)
Kafka消息模擬器
package clickstream
import java.util.{Properties, Random, UUID}
import kafka.producer.{KeyedMessage, Producer, ProducerConfig}
import org.codehaus.jettison.json.JSONObject
/** *
Created by 郭飛 on 2016/5/31.
*/
object KafkaMessageGenerator {
private val random = new Random()
private var pointer = -1
private val os_type = Array(
"Android", "IPhone OS",
"None", "Windows Phone")
def click() : Double = {
random.nextInt(10)
}
def getOsType() : String = {
pointer = pointer + 1
if(pointer >= os_type.length) {
pointer = 0
os_type(pointer)
} else {
os_type(pointer)
}
}
def main(args: Array[String]): Unit = {
val topic = "user_events"
//本地虛擬機(jī)ZK地址
val brokers = "hadoop1:9092,hadoop2:9092,hadoop3:9092"
val props = new Properties()
props.put("metadata.broker.list", brokers)
props.put("serializer.class", "kafka.serializer.StringEncoder")
val kafkaConfig = new ProducerConfig(props)
val producer = new Producer[String, String](kafkaConfig)
while(true) {
// prepare event data
val event = new JSONObject()
event
.put("uid", UUID.randomUUID())//隨機(jī)生成用戶id
.put("event_time", System.currentTimeMillis.toString) //記錄時(shí)間發(fā)生時(shí)間
.put("os_type", getOsType) //設(shè)備類型
.put("click_count", click) //點(diǎn)擊次數(shù)
// produce event message
producer.send(new KeyedMessage[String, String](topic, event.toString))
println("Message sent: " + event)
Thread.sleep(200)
}
}
}
Spark-Streaming主類
package clickstream
import kafka.serializer.StringDecoder
import net.sf.json.JSONObject
import org.apache.hadoop.hbase.client.{HTable, Put}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{HBaseConfiguration, TableName}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* Created by 郭飛 on 2016/5/31.
*/
object PageViewStream {
def main(args: Array[String]): Unit = {
var masterUrl = "local[2]"
if (args.length > 0) {
masterUrl = args(0)
}
// Create a StreamingContext with the given master URL
val conf = new SparkConf().setMaster(masterUrl).setAppName("PageViewStream")
val ssc = new StreamingContext(conf, Seconds(5))
// Kafka configurations
val topics = Set("PageViewStream")
//本地虛擬機(jī)ZK地址
val brokers = "hadoop1:9092,hadoop2:9092,hadoop3:9092"
val kafkaParams = Map[String, String](
"metadata.broker.list" -> brokers,
"serializer.class" -> "kafka.serializer.StringEncoder")
// Create a direct stream
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
val events = kafkaStream.flatMap(line => {
val data = JSONObject.fromObject(line._2)
Some(data)
})
// Compute user click times
val userClicks = events.map(x => (x.getString("uid"), x.getInt("click_count"))).reduceByKey(_ + _)
userClicks.foreachRDD(rdd => {
rdd.foreachPartition(partitionOfRecords => {
partitionOfRecords.foreach(pair => {
//Hbase配置
val tableName = "PageViewStream"
val hbaseConf = HBaseConfiguration.create()
hbaseConf.set("hbase.zookeeper.quorum", "hadoop1:9092")
hbaseConf.set("hbase.zookeeper.property.clientPort", "2181")
hbaseConf.set("hbase.defaults.for.version.skip", "true")
//用戶ID
val uid = pair._1
//點(diǎn)擊次數(shù)
val click = pair._2
//組裝數(shù)據(jù)
val put = new Put(Bytes.toBytes(uid))
put.add("Stat".getBytes, "ClickStat".getBytes, Bytes.toBytes(click))
val StatTable = new HTable(hbaseConf, TableName.valueOf(tableName))
StatTable.setAutoFlush(false, false)
//寫入數(shù)據(jù)緩存
StatTable.setWriteBufferSize(3*1024*1024)
StatTable.put(put)
//提交
StatTable.flushCommits()
})
})
})
ssc.start()
ssc.awaitTermination()
}
}
Maven POM文件
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.guofei.spark</groupId>
<artifactId>RiskControl</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>
<name>RiskControl</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<!--Spark core 及 streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.3.0</version>
</dependency>
<!-- Spark整合Kafka-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.3.0</version>
</dependency>
<!-- 整合Hbase-->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase</artifactId>
<version>0.96.2-hadoop2</version>
<type>pom</type>
</dependency>
<!--Hbase依賴 -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>0.96.2-hadoop2</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>0.96.2-hadoop2</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>0.96.2-hadoop2</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>1.3.2</version>
</dependency>
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
<version>1.1.3</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>2.5.0</version>
</dependency>
<dependency>
<groupId>io.netty</groupId>
<artifactId>netty</artifactId>
<version>3.6.6.Final</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-protocol</artifactId>
<version>0.96.2-hadoop2</version>
</dependency>
<dependency>
<groupId>org.apache.zookeeper</groupId>
<artifactId>zookeeper</artifactId>
<version>3.4.5</version>
</dependency>
<dependency>
<groupId>org.cloudera.htrace</groupId>
<artifactId>htrace-core</artifactId>
<version>2.01</version>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-mapper-asl</artifactId>
<version>1.9.13</version>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-core-asl</artifactId>
<version>1.9.13</version>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-jaxrs</artifactId>
<version>1.9.13</version>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-xc</artifactId>
<version>1.9.13</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.6.4</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.6.4</version>
</dependency>
<!-- Hadoop依賴包-->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.4</version>
</dependency>
<dependency>
<groupId>commons-configuration</groupId>
<artifactId>commons-configuration</artifactId>
<version>1.6</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-auth</artifactId>
<version>2.6.4</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.6.4</version>
</dependency>
<dependency>
<groupId>net.sf.json-lib</groupId>
<artifactId>json-lib</artifactId>
<version>2.4</version>
<classifier>jdk15</classifier>
</dependency>
<dependency>
<groupId>org.codehaus.jettison</groupId>
<artifactId>jettison</artifactId>
<version>1.1</version>
</dependency>
<dependency>
<groupId>redis.clients</groupId>
<artifactId>jedis</artifactId>
<version>2.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-pool2</artifactId>
<version>2.2</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-make:transitive</arg>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
FAQ
- Maven導(dǎo)入json-lib報(bào)錯(cuò)
Failure to find net.sf.json-lib:json-lib:jar:2.3 in
http://repo.maven.apache.org/maven2 was cached in the local
repository
解決:
http://stackoverflow.com/questions/4173214/maven-missing-net-sf-json-lib
<dependency>
<groupId>net.sf.json-lib</groupId>
<artifactId>json-lib</artifactId>
<version>2.4</version>
<classifier>jdk15</classifier>
</dependency> - 執(zhí)行Spark-Streaming程序報(bào)錯(cuò)
org.apache.spark.SparkException: Task not serializable
userClicks.foreachRDD(rdd => {
rdd.foreachPartition(partitionOfRecords => {
partitionOfRecords.foreach(
這里面的代碼中所包含的對象必須是序列化的
這里面的代碼中所包含的對象必須是序列化的
這里面的代碼中所包含的對象必須是序列化的
})
})
})
- 執(zhí)行Maven打包報(bào)錯(cuò)晨雳,找不到依賴的jar包
error:not found: object kafka
ERROR import kafka.javaapi.producer.Producer
解決:win10本地系統(tǒng) 用戶/郭飛/.m2/ 目錄含有中文
參考文檔
- spark-streaming官方文檔
http://spark.apache.org/docs/latest/streaming-programming-guide.html - spark-streaming整合kafka官方文檔
http://spark.apache.org/docs/latest/streaming-kafka-integration.html - spark-streaming整合flume官方文檔
http://spark.apache.org/docs/latest/streaming-flume-integration.html - spark-streaming整合自定義數(shù)據(jù)源官方文檔
http://spark.apache.org/docs/latest/streaming-custom-receivers.html - spark-streaming官方scala案例
https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples/streaming - 簡單之美博客
http://shiyanjun.cn/archives/1097.html