kafka stream入門1
最近本人在單位經(jīng)常有對(duì)于大量心跳數(shù)據(jù)進(jìn)行匯總計(jì)算能真,然后更加計(jì)算匯總出不同種類的中間數(shù)據(jù)集合,來提供后期的處理的需求逢勾。
原先的方案是自己寫了不少的job涂召,然后利用zookeeper等進(jìn)行job進(jìn)度的控制,問題是這種模式下敏沉,需要大量的編碼,保證數(shù)據(jù)不被重復(fù)消費(fèi)炎码,感覺自己的程序在出現(xiàn)異常的時(shí)候盟迟,
還是會(huì)有部分?jǐn)?shù)據(jù)丟失的問題。
考慮采用一個(gè)業(yè)績主流的流式計(jì)算的方案潦闲,同時(shí)也要支持對(duì)于歷史數(shù)據(jù)的批量操作攒菠。
對(duì)比了spark,storm歉闰,kafka_stream辖众,首先本人完全沒有大數(shù)據(jù)的實(shí)戰(zhàn)經(jīng)驗(yàn)卓起,個(gè)人感覺,前兩者相對(duì)成熟很多凹炸,后者kafka_stream是新出來的戏阅,相對(duì)資源少。
但是前兩者是框架級(jí)別的啤它,以spark為例奕筐,看了下,一般要單獨(dú)部署一套自己的spark集群(除非單位有現(xiàn)成的給你使用)我們這邊是不具備的变骡。搭建的硬件要求也很高离赫。
對(duì)比kafkastream,其只是個(gè)庫塌碌,依賴只有kafka渊胸,硬件資源需求較小,決定自己先研究下台妆。如果可行翎猛,就投入生產(chǎn)。
以下摘錄一個(gè)最簡單的入門的案例频丘。
后續(xù)繼續(xù)補(bǔ)全办成。
package io.confluent.examples.streams;
import org.apache.kafka.common.serialization.Serde;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.kstream.KStreamBuilder;
import org.apache.kafka.streams.kstream.KTable;
import java.util.Arrays;
import java.util.Properties;
import java.util.regex.Pattern;
/**
* Demonstrates, using the high-level KStream DSL, how to implement the WordCount program that
* computes a simple word occurrence histogram from an input text. This example uses lambda
* expressions and thus works with Java 8+ only.
* <p>
* In this example, the input stream reads from a topic named "TextLinesTopic", where the values of
* messages represent lines of text; and the histogram output is written to topic
* "WordsWithCountsTopic", where each record is an updated count of a single word, i.e. {@code word (String) -> currentCount (Long)}.
* <p>
* Note: Before running this example you must 1) create the source topic (e.g. via {@code kafka-topics --create ...}),
* then 2) start this example and 3) write some data to the source topic (e.g. via {@code kafka-console-producer}).
* Otherwise you won't see any data arriving in the output topic.
* <p>
* <br>
* HOW TO RUN THIS EXAMPLE
* <p>
* 1) Start Zookeeper and Kafka. Please refer to <a >QuickStart</a>.
* <p>
* 2) Create the input and output topics used by this example.
* <pre>
* {@code
* $ bin/kafka-topics --create --topic TextLinesTopic \
* --zookeeper localhost:2181 --partitions 1 --replication-factor 1
* $ bin/kafka-topics --create --topic WordsWithCountsTopic \
* --zookeeper localhost:2181 --partitions 1 --replication-factor 1
* }</pre>
* Note: The above commands are for the Confluent Platform. For Apache Kafka it should be {@code bin/kafka-topics.sh ...}.
* <p>
* 3) Start this example application either in your IDE or on the command line.
* <p>
* If via the command line please refer to <a >Packaging</a>.
* Once packaged you can then run:
* <pre>
* {@code
* $ java -cp target/streams-examples-3.3.0-standalone.jar io.confluent.examples.streams.WordCountLambdaExample
* }</pre>
* 4) Write some input data to the source topic "TextLinesTopic" (e.g. via {@code kafka-console-producer}).
* The already running example application (step 3) will automatically process this input data and write the
* results to the output topic "WordsWithCountsTopic".
* <pre>
* {@code
* # Start the console producer. You can then enter input data by writing some line of text, followed by ENTER:
* #
* # hello kafka streams<ENTER>
* # all streams lead to kafka<ENTER>
* # join kafka summit<ENTER>
* #
* # Every line you enter will become the value of a single Kafka message.
* $ bin/kafka-console-producer --broker-list localhost:9092 --topic TextLinesTopic
* }</pre>
* 5) Inspect the resulting data in the output topic, e.g. via {@code kafka-console-consumer}.
* <pre>
* {@code
* $ bin/kafka-console-consumer --topic WordsWithCountsTopic --from-beginning \
* --new-consumer --bootstrap-server localhost:9092 \
* --property print.key=true \
* --property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer
* }</pre>
* You should see output data similar to below. Please note that the exact output
* sequence will depend on how fast you type the above sentences. If you type them
* slowly, you are likely to get each count update, e.g., kafka 1, kafka 2, kafka 3.
* If you type them quickly, you are likely to get fewer count updates, e.g., just kafka 3.
* This is because the commit interval is set to 10 seconds. Anything typed within
* that interval will be compacted in memory.
* <pre>
* {@code
* hello 1
* kafka 1
* streams 1
* all 1
* streams 2
* lead 1
* to 1
* join 1
* kafka 3
* summit 1
* }</pre>
* 6) Once you're done with your experiments, you can stop this example via {@code Ctrl-C}. If needed,
* also stop the Kafka broker ({@code Ctrl-C}), and only then stop the ZooKeeper instance (`{@code Ctrl-C}).
*/
public class WordCountLambdaExample {
public static void main(final String[] args) throws Exception {
final String bootstrapServers = args.length > 0 ? args[0] : "localhost:9092";
final Properties streamsConfiguration = new Properties();
// Give the Streams application a unique name. The name must be unique in the Kafka cluster
// against which the application is run.
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount-lambda-example");
streamsConfiguration.put(StreamsConfig.CLIENT_ID_CONFIG, "wordcount-lambda-example-client");
// Where to find Kafka broker(s).
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
// Specify default (de)serializers for record keys and for record values.
streamsConfiguration.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
// Records should be flushed every 10 seconds. This is less than the default
// in order to keep this example interactive.
streamsConfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 10 * 1000);
// For illustrative purposes we disable record caches
streamsConfiguration.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
// Set up serializers and deserializers, which we will use for overriding the default serdes
// specified above.
final Serde<String> stringSerde = Serdes.String();
final Serde<Long> longSerde = Serdes.Long();
// In the subsequent lines we define the processing topology of the Streams application.
final KStreamBuilder builder = new KStreamBuilder();
// Construct a `KStream` from the input topic "TextLinesTopic", where message values
// represent lines of text (for the sake of this example, we ignore whatever may be stored
// in the message keys).
//
// Note: We could also just call `builder.stream("TextLinesTopic")` if we wanted to leverage
// the default serdes specified in the Streams configuration above, because these defaults
// match what's in the actual topic. However we explicitly set the deserializers in the
// call to `stream()` below in order to show how that's done, too.
final KStream<String, String> textLines = builder.stream(stringSerde, stringSerde, "TextLinesTopic");
final Pattern pattern = Pattern.compile("\\W+", Pattern.UNICODE_CHARACTER_CLASS);
final KTable<String, Long> wordCounts = textLines
// Split each text line, by whitespace, into words. The text lines are the record
// values, i.e. we can ignore whatever data is in the record keys and thus invoke
// `flatMapValues()` instead of the more generic `flatMap()`.
.flatMapValues(value -> Arrays.asList(pattern.split(value.toLowerCase())))
// Count the occurrences of each word (record key).
//
// This will change the stream type from `KStream<String, String>` to `KTable<String, Long>`
// (word -> count). In the `count` operation we must provide a name for the resulting KTable,
// which will be used to name e.g. its associated state store and changelog topic.
//
// Note: no need to specify explicit serdes because the resulting key and value types match our default serde settings
.groupBy((key, word) -> word)
.count("Counts");
// Write the `KStream<String, Long>` to the output topic.
wordCounts.to(stringSerde, longSerde, "WordsWithCountsTopic");
// Now that we have finished the definition of the processing topology we can actually run
// it via `start()`. The Streams application as a whole can be launched just like any
// normal Java application that has a `main()` method.
final KafkaStreams streams = new KafkaStreams(builder, streamsConfiguration);
// Always (and unconditionally) clean local state prior to starting the processing topology.
// We opt for this unconditional call here because this will make it easier for you to play around with the example
// when resetting the application for doing a re-run (via the Application Reset Tool,
// http://docs.confluent.io/current/streams/developer-guide.html#application-reset-tool).
//
// The drawback of cleaning up local state prior is that your app must rebuilt its local state from scratch, which
// will take time and will require reading all the state-relevant data from the Kafka cluster over the network.
// Thus in a production scenario you typically do not want to clean up always as we do here but rather only when it
// is truly needed, i.e., only under certain conditions (e.g., the presence of a command line flag for your app).
// See `ApplicationResetExample.java` for a production-like example.
streams.cleanUp();
streams.start();
// Add shutdown hook to respond to SIGTERM and gracefully close Kafka Streams
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
}
}
這里的邏輯答題上就是從kafka的輸入stream TextLinesTopic里,不斷讀入用戶輸入的文本行搂漠,
final KStream<String, String> textLines = builder.stream(stringSerde, stringSerde, "TextLinesTopic");
然后針對(duì)每行輸入用正則表達(dá)式查封成各個(gè)word迂卢。flatMap到word 單詞數(shù)據(jù)流
flatMapValues(value -> Arrays.asList(pattern.split(value.toLowerCase())))
接下來,按照不同的單詞進(jìn)行分組
groupBy((key, word) -> word)
最后把kstream 通過count進(jìn)行轉(zhuǎn)存到ktable里桐汤,后續(xù)可以通過ksql進(jìn)行查詢
切記而克,streams需要開啟才能工作
streams.start();