背景
Kafka實(shí)時(shí)記錄從數(shù)據(jù)采集工具Flume或業(yè)務(wù)系統(tǒng)實(shí)時(shí)接口收集數(shù)據(jù)您宪,并作為消息緩沖組件為上游實(shí)時(shí)計(jì)算框架提供可靠數(shù)據(jù)支撐洪灯,Spark 1.3版本后支持兩種整合Kafka機(jī)制(Receiver-based Approach 和 Direct Approach)齐佳,具體細(xì)節(jié)請參考文章最后官方文檔鏈接,數(shù)據(jù)存儲使用HBase
實(shí)現(xiàn)思路
實(shí)現(xiàn)Kafka消息生產(chǎn)者模擬器
Spark-Streaming采用Direct Approach方式實(shí)時(shí)獲取Kafka中數(shù)據(jù)
Spark-Streaming對數(shù)據(jù)進(jìn)行業(yè)務(wù)計(jì)算后數(shù)據(jù)存儲到HBase
本地虛擬機(jī)集群環(huán)境配置
由于筆者機(jī)器性能有限掘鄙,hadoop/zookeeper/kafka集群都搭建在一起主機(jī)名分別為hadoop1,hadoop2,hadoop3; hbase為單節(jié)點(diǎn) 在hadoop1
缺點(diǎn)及不足
由于筆者技術(shù)有限掌栅,代碼設(shè)計(jì)上有部分缺陷,比如spark-streaming計(jì)算后數(shù)據(jù)保存hbase邏輯性能很低瀑梗,希望大家多提意見以便小編及時(shí)更正
代碼實(shí)現(xiàn)
Kafka消息模擬器
packageclickstreamimportjava.util.{Properties,Random,UUID}importkafka.producer.{KeyedMessage,Producer,ProducerConfig}importorg.codehaus.jettison.json.JSONObject/**? *
Created by 郭飛 on 2016/5/31.
*/objectKafkaMessageGenerator{privatevalrandom =newRandom()privatevarpointer =-1privatevalos_type =Array("Android","IPhone OS","None","Windows Phone")defclick() :Double= {? ? random.nextInt(10)? }defgetOsType() :String= {? ? pointer = pointer +1if(pointer >= os_type.length) {? ? ? pointer =0os_type(pointer)? ? }else{? ? ? os_type(pointer)? ? }? }defmain(args:Array[String]):Unit= {valtopic ="user_events"http://本地虛擬機(jī)ZK地址valbrokers ="hadoop1:9092,hadoop2:9092,hadoop3:9092"valprops =newProperties()? ? props.put("metadata.broker.list", brokers)? ? props.put("serializer.class","kafka.serializer.StringEncoder")valkafkaConfig =newProducerConfig(props)valproducer =newProducer[String,String](kafkaConfig)while(true) {// prepare event datavalevent =newJSONObject()? ? ? event? ? ? ? .put("uid",UUID.randomUUID())//隨機(jī)生成用戶id.put("event_time",System.currentTimeMillis.toString)//記錄時(shí)間發(fā)生時(shí)間.put("os_type", getOsType)//設(shè)備類型.put("click_count", click)//點(diǎn)擊次數(shù)// produce event messageproducer.send(newKeyedMessage[String,String](topic, event.toString))? ? ? println("Message sent: "+ event)Thread.sleep(200)? ? }? }}
Spark-Streaming主類
packageclickstreamimportkafka.serializer.StringDecoderimportnet.sf.json.JSONObjectimportorg.apache.hadoop.hbase.client.{HTable,Put}importorg.apache.hadoop.hbase.util.Bytesimportorg.apache.hadoop.hbase.{HBaseConfiguration,TableName}importorg.apache.spark.SparkConfimportorg.apache.spark.streaming.kafka.KafkaUtilsimportorg.apache.spark.streaming.{Seconds,StreamingContext}/**
* Created by 郭飛 on 2016/5/31.
*/objectPageViewStream{defmain(args:Array[String]):Unit= {varmasterUrl ="local[2]"if(args.length >0) {? ? ? masterUrl = args(0)? ? }// Create a StreamingContext with the given master URLvalconf =newSparkConf().setMaster(masterUrl).setAppName("PageViewStream")valssc =newStreamingContext(conf,Seconds(5))// Kafka configurationsvaltopics =Set("PageViewStream")//本地虛擬機(jī)ZK地址valbrokers ="hadoop1:9092,hadoop2:9092,hadoop3:9092"valkafkaParams =Map[String,String]("metadata.broker.list"-> brokers,"serializer.class"->"kafka.serializer.StringEncoder")// Create a direct streamvalkafkaStream =KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](ssc, kafkaParams, topics)valevents = kafkaStream.flatMap(line => {valdata =JSONObject.fromObject(line._2)Some(data)? ? })// Compute user click timesvaluserClicks = events.map(x => (x.getString("uid"), x.getInt("click_count"))).reduceByKey(_ + _)? ? userClicks.foreachRDD(rdd => {? ? ? rdd.foreachPartition(partitionOfRecords => {? ? ? ? partitionOfRecords.foreach(pair => {//Hbase配置valtableName ="PageViewStream"valhbaseConf =HBaseConfiguration.create()? ? ? ? ? hbaseConf.set("hbase.zookeeper.quorum","hadoop1:9092")? ? ? ? ? hbaseConf.set("hbase.zookeeper.property.clientPort","2181")? ? ? ? ? hbaseConf.set("hbase.defaults.for.version.skip","true")//用戶IDvaluid = pair._1//點(diǎn)擊次數(shù)valclick = pair._2//組裝數(shù)據(jù)valput =newPut(Bytes.toBytes(uid))? ? ? ? ? put.add("Stat".getBytes,"ClickStat".getBytes,Bytes.toBytes(click))valStatTable=newHTable(hbaseConf,TableName.valueOf(tableName))StatTable.setAutoFlush(false,false)//寫入數(shù)據(jù)緩存StatTable.setWriteBufferSize(3*1024*1024)StatTable.put(put)//提交StatTable.flushCommits()? ? ? ? })? ? ? })? ? })? ? ssc.start()? ? ssc.awaitTermination()? }}
Maven POM文件
4.0.0com.guofei.sparkRiskControl1.0-SNAPSHOTjarRiskControlhttp://maven.apache.orgUTF-8org.apache.sparkspark-core_2.101.3.0org.apache.sparkspark-streaming_2.101.3.0org.apache.sparkspark-streaming-kafka_2.101.3.0org.apache.hbasehbase0.96.2-hadoop2pomorg.apache.hbasehbase-server0.96.2-hadoop2org.apache.hbasehbase-client0.96.2-hadoop2org.apache.hbasehbase-common0.96.2-hadoop2commons-iocommons-io1.3.2commons-loggingcommons-logging1.1.3log4jlog4j1.2.17com.google.protobufprotobuf-java2.5.0io.nettynetty3.6.6.Finalorg.apache.hbasehbase-protocol0.96.2-hadoop2org.apache.zookeeperzookeeper3.4.5org.cloudera.htracehtrace-core2.01org.codehaus.jacksonjackson-mapper-asl1.9.13org.codehaus.jacksonjackson-core-asl1.9.13org.codehaus.jacksonjackson-jaxrs1.9.13org.codehaus.jacksonjackson-xc1.9.13org.slf4jslf4j-api1.6.4org.slf4jslf4j-log4j121.6.4org.apache.hadoophadoop-client2.6.4commons-configurationcommons-configuration1.6org.apache.hadoophadoop-auth2.6.4org.apache.hadoophadoop-common2.6.4net.sf.json-libjson-lib2.4jdk15org.codehaus.jettisonjettison1.1redis.clientsjedis2.5.2org.apache.commonscommons-pool22.2src/main/scalasrc/test/scalanet.alchim31.mavenscala-maven-plugin3.2.2compiletestCompile-make:transitive-dependencyfile${project.build.directory}/.scala_dependenciesorg.apache.maven.pluginsmaven-shade-plugin2.4.3packageshade*:*META-INF/*.SFMETA-INF/*.DSAMETA-INF/*.RSA
FAQ
Maven導(dǎo)入json-lib報(bào)錯(cuò)
Failure to find net.sf.json-lib:json-lib:jar:2.3 in
http://repo.maven.apache.org/maven2was cached in the local
repository
解決:
http://stackoverflow.com/questions/4173214/maven-missing-net-sf-json-lib
net.sf.json-lib
json-lib
2.4
jdk15
執(zhí)行Spark-Streaming程序報(bào)錯(cuò)
org.apache.spark.SparkException: Task not serializable
userClicks.foreachRDD(rdd=>{ rdd.foreachPartition(partitionOfRecords=>{ partitionOfRecords.foreach(這里面的代碼中所包含的對象必須是序列化的這里面的代碼中所包含的對象必須是序列化的這里面的代碼中所包含的對象必須是序列化的}) }) })
執(zhí)行Maven打包報(bào)錯(cuò)烹笔,找不到依賴的jar包
error:not found: object kafka
ERROR import kafka.javaapi.producer.Producer
解決:win10本地系統(tǒng) 用戶/郭飛/.m2/ 目錄含有中文
參考文檔
spark-streaming官方文檔
http://spark.apache.org/docs/latest/streaming-programming-guide.html
spark-streaming整合kafka官方文檔
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
spark-streaming整合flume官方文檔
http://spark.apache.org/docs/latest/streaming-flume-integration.html
spark-streaming整合自定義數(shù)據(jù)源官方文檔
http://spark.apache.org/docs/latest/streaming-custom-receivers.html
spark-streaming官方scala案例
簡單之美博客