一、SparkStreaming概念
SparkStreaming是一個準(zhǔn)實時的數(shù)據(jù)處理框架,支持對實時數(shù)據(jù)流進(jìn)行可擴(kuò)展份汗、高吞吐量、容錯的流處理蝴簇,SparkStreaming可以從kafka杯活、HDFS等中獲取數(shù)據(jù),經(jīng)過SparkStreaming數(shù)據(jù)處理后保存到HDFS熬词、數(shù)據(jù)庫等旁钧。
spark streaming接收實時輸入數(shù)據(jù)流,并將數(shù)據(jù)分為多個微批互拾,然后由spark engine進(jìn)行處理歪今,批量生成最終結(jié)果流。
二颜矿、基本操作
2.1初始化StreamingContext
Durations指定接收數(shù)據(jù)的延遲時間寄猩,多久觸發(fā)一次job
SparkConf conf = new SparkConf().setMaster("local").setAppName("alarmCount");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(10));
2.2基本操作
1:streamingcontext.start() 開始接受數(shù)據(jù)
2:streamingContext.stop() 停止
2.3注意的點(diǎn)
1:上下文啟動后,不能重新設(shè)置或添加新的流式計算
2:一個JVM進(jìn)程中只能有一個StreamingContext 存活
2.4DStream
DStream是離散數(shù)據(jù)流骑疆,是由一系列RDD組成的序列
1:每個InputDStream對應(yīng)一個接收器(文件流不需要接收器)田篇,一個接收器也只接受一個單一的數(shù)據(jù)流,但是SparkStreaming應(yīng)用中可以創(chuàng)建多個輸入流
2:每個接收器占用一個核箍铭,應(yīng)用程序的核數(shù)要大于接收器數(shù)量泊柬,如果小于數(shù)據(jù)將無法全部梳理
三、從kafka中讀取數(shù)據(jù)
通過KafkaUtils從kafka讀取數(shù)據(jù)诈火,讀取數(shù)據(jù)有兩種方式兽赁,createDstream和createDirectStream。
3.1:createDstream:基于Receiver的方式
1: kafka數(shù)據(jù)持續(xù)被運(yùn)行在Spark workers/executors 中的Kafka Receiver接受冷守,這種方式使用的是kafka的高階用戶API
2:接受到的數(shù)據(jù)存儲在Spark workers/executors內(nèi)存以及WAL(Write Ahead Logs), 在數(shù)據(jù)持久化到日志后刀崖,kafka接收器才會更新zookeeper中的offset
3:接受到的數(shù)據(jù)信息及WAL位置信息被可靠存儲,失敗時用于重新讀取數(shù)據(jù)教沾。
3.2:createDirectStream 直接讀取方式
這種方式下需要自行管理offset蒲跨,可以通過checkpoint或者數(shù)據(jù)庫方式管理
SparkStreaming
public class SparkStreaming {
private static String CHECKPOINT_DIR = "/Users/dbq/Documents/checkpoint";
public static void main(String[] args) throws InterruptedException {
//初始化StreamingContext
SparkConf conf = new SparkConf().setMaster("local").setAppName("alarmCount");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(10));
jssc.checkpoint(CHECKPOINT_DIR);
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("metadata.broker.list", "172.*.*.6:9092,172.*.*.7:9092,172.*.*.8:9092");
kafkaParams.put("bootstrap.servers", "172.*.*.6:9092,172.*.*.7:9092,172.*.*.8:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "alarmGroup");
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", true);
Collection<String> topics = Arrays.asList("alarmTopic");
JavaInputDStream<ConsumerRecord<String, String>> messages =
KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
);
JavaDStream<String> lines = messages.map((Function<ConsumerRecord<String, String>, String>) record -> record.value());
lines.foreachRDD((VoidFunction<JavaRDD<String>>) record -> {
List<String> list = record.collect();
for (int i = 0; i < list.size(); i++) {
writeToFile(list.get(i));
}
});
lines.print();
jssc.start();
jssc.awaitTermination();
System.out.println("----------------end");
}
//將結(jié)果寫入到文件,也可以寫入到MongoDB或者HDFS等
private synchronized static void writeToFile(String content) {
String fileName = "/Users/dbq/Documents/result.txt";
FileWriter writer = null;
try {
writer = new FileWriter(fileName, true);
writer.write(content + " \r\n");
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if (writer != null) {
writer.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
Kafka的集成
生產(chǎn)者配置類
public class KafkaProducerConfig {
@Value("${spring.kafka.bootstrap-servers}")
private String broker;
@Value("${spring.kafka.producer.acks}")
private String acks;
@Value("${spring.kafka.producer.retries}")
private Integer retries;
@Value("${spring.kafka.producer.batch-size}")
private Integer batchSize;
@Value("${spring.kafka.producer.buffer-memory}")
private long bufferMemory;
public Map<String, Object> getConfig() {
Map<String, Object> props = new HashMap<>();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, broker);
props.put(ProducerConfig.ACKS_CONFIG, acks);
props.put(ProducerConfig.RETRIES_CONFIG, retries);
props.put(ProducerConfig.BATCH_SIZE_CONFIG, batchSize);
props.put(ProducerConfig.BUFFER_MEMORY_CONFIG, bufferMemory);
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
return props;
}
}
Kafka生產(chǎn)者
@Component
public class Producer {
@Autowired
private KafkaTemplate<String, String> kafkaTemplate;
public void send(Message message) {
kafkaTemplate.send("alarmTopic", JSONObject.toJSONString(message));
}
}
配置kafkaTemplate
@Component
public class PushMessageConfig {
@Autowired
private PushProducerListener producerListener;
@Autowired
private KafkaProducerConfig producerConfig;
@Bean
public KafkaTemplate<String, String> kafkaTemplate() {
@SuppressWarnings({ "unchecked", "rawtypes" })
ProducerFactory<String, String> factory = new DefaultKafkaProducerFactory<>(producerConfig.getConfig());
KafkaTemplate<String, String> kafkaTemplate = new KafkaTemplate<>(factory, true);
kafkaTemplate.setProducerListener(producerListener);
kafkaTemplate.setDefaultTopic("alarmTopic");
return kafkaTemplate;
}
}
配置生產(chǎn)者監(jiān)聽
@Component
public class PushProducerListener implements ProducerListener<String, String> {
private Logger logger = LoggerFactory.getLogger(PushProducerListener.class);
@Override
public void onSuccess(String topic, Integer partition, String key, String value,
RecordMetadata recordMetadata) {
// 數(shù)據(jù)成功發(fā)送到消息隊列
System.out.println("發(fā)送成功:" + value);
logger.info("onSuccess. " + key + " : " + value);
}
@Override
public void onError(String topic, Integer partition, String key, String value,
Exception exception) {
logger.error("onError. " + key + " : " + value);
logger.error("catching an error when sending data to mq.", exception);
// 發(fā)送到消息隊列失敗授翻,直接在本地處理
}
@Override
public boolean isInterestedInSuccess() {
// 發(fā)送成功后回調(diào)onSuccess或悲,false則不回調(diào)
return true;
}
}