基于Kafka+SparkStreaming+HBase實(shí)時(shí)點(diǎn)擊流案例

前言

最近在專注Spark開發(fā),記錄下自己的工作和學(xué)習(xí)路程寄雀,希望能跟大家互相交流成長
本文章更傾向于實(shí)戰(zhàn)案例得滤,涉及框架原理及基本應(yīng)用還請讀者自行閱讀相關(guān)文章,相關(guān)在本文章最后參考資料中
關(guān)于Zookeeper/Kafka/HBase/Hadoop相關(guān)集群環(huán)境搭建作者會陸續(xù)更新
本文章發(fā)布后會及時(shí)更新文章中出現(xiàn)的錯(cuò)誤及增加內(nèi)容盒犹,歡迎大家訂閱
QQ:86608625 微信:guofei1990123

背景

Kafka實(shí)時(shí)記錄從數(shù)據(jù)采集工具Flume或業(yè)務(wù)系統(tǒng)實(shí)時(shí)接口收集數(shù)據(jù)懂更,并作為消息緩沖組件為上游實(shí)時(shí)計(jì)算框架提供可靠數(shù)據(jù)支撐,Spark 1.3版本后支持兩種整合Kafka機(jī)制(Receiver-based Approach 和 Direct Approach)急膀,具體細(xì)節(jié)請參考文章最后官方文檔鏈接沮协,數(shù)據(jù)存儲使用HBase

實(shí)現(xiàn)思路

  1. 實(shí)現(xiàn)Kafka消息生產(chǎn)者模擬器
  2. Spark-Streaming采用Direct Approach方式實(shí)時(shí)獲取Kafka中數(shù)據(jù)
  3. Spark-Streaming對數(shù)據(jù)進(jìn)行業(yè)務(wù)計(jì)算后數(shù)據(jù)存儲到HBase

本地虛擬機(jī)集群環(huán)境配置

由于筆者機(jī)器性能有限,hadoop/zookeeper/kafka集群都搭建在一起主機(jī)名分別為hadoop1,hadoop2,hadoop3; hbase為單節(jié)點(diǎn) 在hadoop1

缺點(diǎn)及不足

由于筆者技術(shù)有限卓嫂,代碼設(shè)計(jì)上有部分缺陷慷暂,比如spark-streaming計(jì)算后數(shù)據(jù)保存hbase邏輯性能很低,希望大家多提意見以便小編及時(shí)更正

代碼實(shí)現(xiàn)

Kafka消息模擬器

package clickstream
import java.util.{Properties, Random, UUID}
import kafka.producer.{KeyedMessage, Producer, ProducerConfig}
import org.codehaus.jettison.json.JSONObject

/**  * 
Created by 郭飛 on 2016/5/31.  
*/
object KafkaMessageGenerator {
  private val random = new Random()
  private var pointer = -1
  private val os_type = Array(
    "Android", "IPhone OS",
    "None", "Windows Phone")

  def click() : Double = {
    random.nextInt(10)
  }

  def getOsType() : String = {
    pointer = pointer + 1
    if(pointer >= os_type.length) {
      pointer = 0
      os_type(pointer)
    } else {
      os_type(pointer)
    }
  }

  def main(args: Array[String]): Unit = {
    val topic = "user_events"
    //本地虛擬機(jī)ZK地址
    val brokers = "hadoop1:9092,hadoop2:9092,hadoop3:9092"
    val props = new Properties()
    props.put("metadata.broker.list", brokers)
    props.put("serializer.class", "kafka.serializer.StringEncoder")

    val kafkaConfig = new ProducerConfig(props)
    val producer = new Producer[String, String](kafkaConfig)

    while(true) {
      // prepare event data
      val event = new JSONObject()
      event
        .put("uid", UUID.randomUUID())//隨機(jī)生成用戶id
        .put("event_time", System.currentTimeMillis.toString) //記錄時(shí)間發(fā)生時(shí)間
        .put("os_type", getOsType) //設(shè)備類型
        .put("click_count", click) //點(diǎn)擊次數(shù)

      // produce event message
      producer.send(new KeyedMessage[String, String](topic, event.toString))
      println("Message sent: " + event)

      Thread.sleep(200)
    }
  }
}

Spark-Streaming主類

package clickstream
import kafka.serializer.StringDecoder
import net.sf.json.JSONObject
import org.apache.hadoop.hbase.client.{HTable, Put}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{HBaseConfiguration, TableName}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * Created by 郭飛 on 2016/5/31.
  */
object PageViewStream {
  def main(args: Array[String]): Unit = {
    var masterUrl = "local[2]"
    if (args.length > 0) {
      masterUrl = args(0)
    }

    // Create a StreamingContext with the given master URL
    val conf = new SparkConf().setMaster(masterUrl).setAppName("PageViewStream")
    val ssc = new StreamingContext(conf, Seconds(5))

    // Kafka configurations
    val topics = Set("PageViewStream")
    //本地虛擬機(jī)ZK地址
    val brokers = "hadoop1:9092,hadoop2:9092,hadoop3:9092"
    val kafkaParams = Map[String, String](
      "metadata.broker.list" -> brokers,
      "serializer.class" -> "kafka.serializer.StringEncoder")

    // Create a direct stream
    val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)

    val events = kafkaStream.flatMap(line => {
      val data = JSONObject.fromObject(line._2)
      Some(data)
    })
    // Compute user click times
    val userClicks = events.map(x => (x.getString("uid"), x.getInt("click_count"))).reduceByKey(_ + _)
    userClicks.foreachRDD(rdd => {
      rdd.foreachPartition(partitionOfRecords => {
        partitionOfRecords.foreach(pair => {
          //Hbase配置
          val tableName = "PageViewStream"
          val hbaseConf = HBaseConfiguration.create()
          hbaseConf.set("hbase.zookeeper.quorum", "hadoop1:9092")
          hbaseConf.set("hbase.zookeeper.property.clientPort", "2181")
          hbaseConf.set("hbase.defaults.for.version.skip", "true")
          //用戶ID
          val uid = pair._1
          //點(diǎn)擊次數(shù)
          val click = pair._2
          //組裝數(shù)據(jù)
          val put = new Put(Bytes.toBytes(uid))
          put.add("Stat".getBytes, "ClickStat".getBytes, Bytes.toBytes(click))
          val StatTable = new HTable(hbaseConf, TableName.valueOf(tableName))
          StatTable.setAutoFlush(false, false)
          //寫入數(shù)據(jù)緩存
          StatTable.setWriteBufferSize(3*1024*1024)
          StatTable.put(put)
          //提交
          StatTable.flushCommits()
        })
      })
    })
    ssc.start()
    ssc.awaitTermination()

  }

}

Maven POM文件

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>com.guofei.spark</groupId>
  <artifactId>RiskControl</artifactId>
  <version>1.0-SNAPSHOT</version>
  <packaging>jar</packaging>

  <name>RiskControl</name>
  <url>http://maven.apache.org</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>

  <dependencies>
    <!--Spark core 及 streaming -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.10</artifactId>
      <version>1.3.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.10</artifactId>
      <version>1.3.0</version>
    </dependency>
    <!-- Spark整合Kafka-->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-kafka_2.10</artifactId>
      <version>1.3.0</version>
    </dependency>

    <!-- 整合Hbase-->
    <dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase</artifactId>
      <version>0.96.2-hadoop2</version>
      <type>pom</type>
    </dependency>
    <!--Hbase依賴 -->
    <dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase-server</artifactId>
      <version>0.96.2-hadoop2</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase-client</artifactId>
      <version>0.96.2-hadoop2</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase-common</artifactId>
      <version>0.96.2-hadoop2</version>
    </dependency>
    <dependency>
      <groupId>commons-io</groupId>
      <artifactId>commons-io</artifactId>
      <version>1.3.2</version>
    </dependency>
    <dependency>
      <groupId>commons-logging</groupId>
      <artifactId>commons-logging</artifactId>
      <version>1.1.3</version>
    </dependency>
    <dependency>
      <groupId>log4j</groupId>
      <artifactId>log4j</artifactId>
      <version>1.2.17</version>
    </dependency>

    <dependency>
      <groupId>com.google.protobuf</groupId>
      <artifactId>protobuf-java</artifactId>
      <version>2.5.0</version>
    </dependency>
    <dependency>
      <groupId>io.netty</groupId>
      <artifactId>netty</artifactId>
      <version>3.6.6.Final</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase-protocol</artifactId>
      <version>0.96.2-hadoop2</version>
    </dependency>
    <dependency>
      <groupId>org.apache.zookeeper</groupId>
      <artifactId>zookeeper</artifactId>
      <version>3.4.5</version>
    </dependency>
    <dependency>
      <groupId>org.cloudera.htrace</groupId>
      <artifactId>htrace-core</artifactId>
      <version>2.01</version>
    </dependency>
    <dependency>
      <groupId>org.codehaus.jackson</groupId>
      <artifactId>jackson-mapper-asl</artifactId>
      <version>1.9.13</version>
    </dependency>
    <dependency>
      <groupId>org.codehaus.jackson</groupId>
      <artifactId>jackson-core-asl</artifactId>
      <version>1.9.13</version>
    </dependency>
    <dependency>
      <groupId>org.codehaus.jackson</groupId>
      <artifactId>jackson-jaxrs</artifactId>
      <version>1.9.13</version>
    </dependency>
    <dependency>
      <groupId>org.codehaus.jackson</groupId>
      <artifactId>jackson-xc</artifactId>
      <version>1.9.13</version>
    </dependency>
    <dependency>
      <groupId>org.slf4j</groupId>
      <artifactId>slf4j-api</artifactId>
      <version>1.6.4</version>
    </dependency>
    <dependency>
      <groupId>org.slf4j</groupId>
      <artifactId>slf4j-log4j12</artifactId>
      <version>1.6.4</version>
    </dependency>

    <!-- Hadoop依賴包-->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>2.6.4</version>
    </dependency>
    <dependency>
      <groupId>commons-configuration</groupId>
      <artifactId>commons-configuration</artifactId>
      <version>1.6</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-auth</artifactId>
      <version>2.6.4</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>2.6.4</version>
    </dependency>

    <dependency>
      <groupId>net.sf.json-lib</groupId>
      <artifactId>json-lib</artifactId>
      <version>2.4</version>
      <classifier>jdk15</classifier>
    </dependency>

    <dependency>
      <groupId>org.codehaus.jettison</groupId>
      <artifactId>jettison</artifactId>
      <version>1.1</version>
    </dependency>

    <dependency>
      <groupId>redis.clients</groupId>
      <artifactId>jedis</artifactId>
      <version>2.5.2</version>
    </dependency>
    <dependency>
      <groupId>org.apache.commons</groupId>
      <artifactId>commons-pool2</artifactId>
      <version>2.2</version>
    </dependency>
  </dependencies>

  <build>
    <sourceDirectory>src/main/scala</sourceDirectory>
    <testSourceDirectory>src/test/scala</testSourceDirectory>
    <plugins>
      <plugin>
        <groupId>net.alchim31.maven</groupId>
        <artifactId>scala-maven-plugin</artifactId>
        <version>3.2.2</version>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
            <configuration>
              <args>
                <arg>-make:transitive</arg>
                <arg>-dependencyfile</arg>
                <arg>${project.build.directory}/.scala_dependencies</arg>
              </args>
            </configuration>
          </execution>
        </executions>
      </plugin>

      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>2.4.3</version>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>
            <configuration>
              <filters>
                <filter>
                  <artifact>*:*</artifact>
                  <excludes>
                    <exclude>META-INF/*.SF</exclude>
                    <exclude>META-INF/*.DSA</exclude>
                    <exclude>META-INF/*.RSA</exclude>
                  </excludes>
                </filter>
              </filters>
            </configuration>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
</project>

FAQ

  1. Maven導(dǎo)入json-lib報(bào)錯(cuò)
    Failure to find net.sf.json-lib:json-lib:jar:2.3 in
    http://repo.maven.apache.org/maven2 was cached in the local
    repository
    解決:
    http://stackoverflow.com/questions/4173214/maven-missing-net-sf-json-lib
    <dependency>
    <groupId>net.sf.json-lib</groupId>
    <artifactId>json-lib</artifactId>
    <version>2.4</version>
    <classifier>jdk15</classifier>
    </dependency>
  2. 執(zhí)行Spark-Streaming程序報(bào)錯(cuò)
    org.apache.spark.SparkException: Task not serializable
userClicks.foreachRDD(rdd => { 
rdd.foreachPartition(partitionOfRecords => { 
partitionOfRecords.foreach(
這里面的代碼中所包含的對象必須是序列化的
這里面的代碼中所包含的對象必須是序列化的
這里面的代碼中所包含的對象必須是序列化的
}) 
}) 
})
  1. 執(zhí)行Maven打包報(bào)錯(cuò)晨雳,找不到依賴的jar包
    error:not found: object kafka
    ERROR import kafka.javaapi.producer.Producer
    解決:win10本地系統(tǒng) 用戶/郭飛/.m2/ 目錄含有中文

參考文檔

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末行瑞,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子餐禁,更是在濱河造成了極大的恐慌血久,老刑警劉巖,帶你破解...
    沈念sama閱讀 206,311評論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件帮非,死亡現(xiàn)場離奇詭異氧吐,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)喜鼓,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,339評論 2 382
  • 文/潘曉璐 我一進(jìn)店門副砍,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人庄岖,你說我怎么就攤上這事豁翎。” “怎么了隅忿?”我有些...
    開封第一講書人閱讀 152,671評論 0 342
  • 文/不壞的土叔 我叫張陵心剥,是天一觀的道長邦尊。 經(jīng)常有香客問我,道長优烧,這世上最難降的妖魔是什么蝉揍? 我笑而不...
    開封第一講書人閱讀 55,252評論 1 279
  • 正文 為了忘掉前任,我火速辦了婚禮畦娄,結(jié)果婚禮上又沾,老公的妹妹穿的比我還像新娘。我一直安慰自己熙卡,他們只是感情好杖刷,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,253評論 5 371
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著驳癌,像睡著了一般滑燃。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上颓鲜,一...
    開封第一講書人閱讀 49,031評論 1 285
  • 那天表窘,我揣著相機(jī)與錄音,去河邊找鬼甜滨。 笑死乐严,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的艳吠。 我是一名探鬼主播麦备,決...
    沈念sama閱讀 38,340評論 3 399
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼昭娩!你這毒婦竟也來了凛篙?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 36,973評論 0 259
  • 序言:老撾萬榮一對情侶失蹤栏渺,失蹤者是張志新(化名)和其女友劉穎呛梆,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體磕诊,經(jīng)...
    沈念sama閱讀 43,466評論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡填物,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 35,937評論 2 323
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了霎终。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片滞磺。...
    茶點(diǎn)故事閱讀 38,039評論 1 333
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖莱褒,靈堂內(nèi)的尸體忽然破棺而出击困,到底是詐尸還是另有隱情,我是刑警寧澤广凸,帶...
    沈念sama閱讀 33,701評論 4 323
  • 正文 年R本政府宣布阅茶,位于F島的核電站蛛枚,受9級特大地震影響猫态,放射性物質(zhì)發(fā)生泄漏培慌。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,254評論 3 307
  • 文/蒙蒙 一球碉、第九天 我趴在偏房一處隱蔽的房頂上張望撞蜂。 院中可真熱鬧盲镶,春花似錦、人聲如沸谅摄。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,259評論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽送漠。三九已至,卻和暖如春由蘑,著一層夾襖步出監(jiān)牢的瞬間闽寡,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 31,485評論 1 262
  • 我被黑心中介騙來泰國打工尼酿, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留爷狈,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 45,497評論 2 354
  • 正文 我出身青樓裳擎,卻偏偏與公主長得像涎永,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個(gè)殘疾皇子鹿响,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,786評論 2 345

推薦閱讀更多精彩內(nèi)容