kafka簡介

Apache kafka是一個分布式流平臺。這到底是什么意思吃既?

我們認為流平臺具有三個關(guān)鍵功能:

  1. 它允許發(fā)布和訂閱記錄流。在這方面跨细,它類似于消息隊列或企業(yè)消息傳遞系統(tǒng)鹦倚。
  2. 它允許您以容錯的方式存儲記錄流。
  3. 它允許您處理記錄發(fā)生時的流扼鞋。

kafka的優(yōu)點是什么申鱼?

它被用于兩大類應(yīng)用程序:

  1. 構(gòu)建實時流數(shù)據(jù)管道,在系統(tǒng)或應(yīng)用程序之間可靠地獲取數(shù)據(jù)
  2. 構(gòu)建對數(shù)據(jù)流進行轉(zhuǎn)換或響應(yīng)的實時流應(yīng)用程序

為了理解kafka是如何做到這些的云头,讓我們深入探究kafka的自下而上的能力捐友。

首先幾個概念:

  • kafka是運行在一個或多個服務(wù)器集群。
  • kafka集群將記錄的流存儲在稱為主題的類別中溃槐。
  • 每個記錄由一個鍵匣砖、一個值和一個時間戳組成。

kafka有四個核心API:

  • Producer API允許應(yīng)用程序?qū)⒂涗浟靼l(fā)布到一個或多個kafka主題
  • Consumer API允許應(yīng)用程序訂閱一個或多個主題昏滴,并處理生成給它們的記錄流猴鲫。
  • Streams API允許應(yīng)用程序充當流處理器,消耗來自一個或多個主題的輸入流谣殊,并向一個或多個輸出主題生成輸出流拂共,有效地將輸入流轉(zhuǎn)換為輸出流。
  • Connector API允許構(gòu)建和運行可重用的生產(chǎn)者或消費者姻几,將kafka主題連接到現(xiàn)有應(yīng)用程序或數(shù)據(jù)系統(tǒng)宜狐。例如,關(guān)系數(shù)據(jù)庫的連接器可能捕獲表的每一個更改蛇捌。
1.png

在kafka中抚恒,客戶機和服務(wù)器之間的通信是通過一個簡單的、高性能的络拌、與語言無關(guān)的tcp協(xié)議完成的俭驮。該協(xié)議是版本和保持與舊版本的向后兼容性。我們?yōu)閗afka提供java客戶端春贸,但客戶也可選擇其他語言混萝。

Topics and Logs

讓我們首先深入了解kafka的核心抽象遗遵,它提供了一段記錄——topic。
topic是發(fā)布記錄的類別或提要名稱譬圣。kafka的主題總是多訂閱者;也就是說瓮恭,一個topic可以有零、一個或許多訂閱了寫入數(shù)據(jù)的消費者厘熟。
對于每個topic屯蹦,Kafka集群維護一個分區(qū)日志,它看起來像這樣:


.png

每個分區(qū)都是一個有序的绳姨、不可變的記錄序列登澜,它不斷被附加到結(jié)構(gòu)化的提交日志中。分區(qū)中的記錄都分配了一個名為偏移量的連續(xù)id號飘庄,它惟一地標識分區(qū)中的每個記錄脑蠕。
kafka集群保留了所有已發(fā)布的記錄——無論它們是否使用了可配置的保留期。例如跪削,如果保留策略設(shè)置為兩天谴仙,那么在記錄發(fā)布之后的兩天內(nèi),它就可以用于消費碾盐,之后將被丟棄以釋放空間晃跺。kafka的性能在數(shù)據(jù)大小上是有效的常數(shù),所以長期存儲數(shù)據(jù)并不是問題毫玖。


3.png

實際上掀虎,在每個消費者基礎(chǔ)上保留的唯一元數(shù)據(jù)是該用戶在日志中的偏移量或位置。這個偏移量由使用者控制:通常情況下付枫,消費者會隨著讀取記錄而線性增加偏移量烹玉,但實際上,由于該位置是由消費者控制的阐滩,它可以按照它喜歡的任意順序消耗記錄二打。例如,消費者可以重新設(shè)置舊的偏移量來重新處理過去的數(shù)據(jù)掂榔,或者跳過最近的記錄址儒,并從“現(xiàn)在”開始消費。
這種功能的組合意味著kafka的消費者非常便宜——他們可以在不影響集群或其他消費者的情況下來來去去衅疙。例如,您可以使用我們的命令行工具來“跟蹤”任何主題的內(nèi)容鸳慈,而不改變?nèi)魏维F(xiàn)有使用者所使用的內(nèi)容饱溢。
日志中的分區(qū)有多種用途。首先走芋,它們允許日志的規(guī)模超出了適用于單個服務(wù)器的大小绩郎。每個單獨的分區(qū)必須適合承載它的服務(wù)器潘鲫,但是一個主題可能有多個分區(qū),因此它可以處理任意數(shù)量的數(shù)據(jù)肋杖。其次溉仑,它們作為并行的單元,在這一點上再多一點状植。

Distribution

日志的分區(qū)分布在Kafka集群中的服務(wù)器上浊竟,每個服務(wù)器處理數(shù)據(jù),并請求共享分區(qū)津畸。每個分區(qū)在可配置數(shù)量的服務(wù)器上復(fù)制振定,用于容錯。
每個分區(qū)都有一個服務(wù)器充當“領(lǐng)導(dǎo)者”肉拓,0或更多的服務(wù)器充當“追隨者”后频。當追隨者被動地復(fù)制領(lǐng)導(dǎo)者時,領(lǐng)導(dǎo)者處理所有對分區(qū)的讀和寫請求暖途。如果領(lǐng)導(dǎo)者失敗卑惜,其中一個追隨者將自動成為新的領(lǐng)導(dǎo)者。每個服務(wù)器作為它的某些分區(qū)的領(lǐng)導(dǎo)者驻售,以及其他一些分區(qū)的跟隨者露久,因此負載在集群中是均衡的。

Producer

生產(chǎn)者將數(shù)據(jù)發(fā)布到他們選擇的主題上芋浮。生產(chǎn)者負責(zé)選擇在主題中分配給哪個分區(qū)的記錄抱环。這可以用一個循環(huán)的方式來完成,僅僅是為了平衡負載纸巷,或者可以根據(jù)一些語義分區(qū)函數(shù)來完成(比方說根據(jù)記錄中的一些關(guān)鍵字)镇草。更多關(guān)于在一秒內(nèi)使用分區(qū)!

Consumers

消費者用一個消費者群體的名稱來標記自己,并且每個發(fā)布到一個主題的記錄都被發(fā)送到每個訂閱消費者組內(nèi)的一個消費者實例瘤旨。消費者實例可以在單獨的進程中梯啤,也可以在單獨的機器上。
如果所有的消費者實例都有相同的消費者組存哲,那么記錄將有效地負載在消費者實例上因宇。
如果所有的消費者實例都有不同的消費群體,那么每個記錄將被廣播到所有的消費者過程祟偷。


4.png

一個兩個服務(wù)器Kafka集群察滑,承載4個分區(qū)(p0 - p3)和兩個消費者組。消費者集團A有兩個消費者實例修肠,B組有4個贺辰。

然而,更常見的是,我們發(fā)現(xiàn)主題有少量的用戶組饲化,每個“邏輯訂閱者”都有一個莽鸭。每個組都由許多用于可伸縮性和容錯的消費者實例組成。這只不過是發(fā)布-訂閱語義吃靠,訂閱者是一組消費者硫眨,而不是單個進程。

在Kafka中實現(xiàn)消費的方式是將日志中的分區(qū)劃分為消費者實例巢块,以便每個實例都是在任何時間點上“公平共享”分區(qū)的唯一使用者礁阁。這一維護團隊成員的過程是由Kafka協(xié)議動態(tài)處理的。如果新實例加入該小組夕冲,他們將接管該小組其他成員的一些分區(qū);如果一個實例死亡氮兵,它的分區(qū)將被分配給其余的實例。

Kafka只在一個分區(qū)中提供一個總訂單歹鱼,而不是一個主題中的不同分區(qū)之間的記錄泣栈。對于大多數(shù)應(yīng)用程序來說,每個分區(qū)排序和按鍵劃分數(shù)據(jù)的能力都是足夠的弥姻。然而南片,如果您需要一個完整的訂單盒犹,那么就可以用一個只有一個分區(qū)的主題來實現(xiàn)种柑,盡管這意味著每個消費者群體只有一個消費者過程。

Guarantees

高級的kafka給出以下?lián)#?/p>

  • 由生產(chǎn)者發(fā)送給特定主題分區(qū)的消息將按發(fā)送的順序追加信殊。也就是說秧廉,如果一個記錄M1被相同的生產(chǎn)者發(fā)送為一個記錄的M2伞广,并且M1被先發(fā)送,那么M1將會有一個比M2更低的偏移量疼电,并在日志的前面出現(xiàn)嚼锄。
  • 一個消費者實例看到記錄的順序存儲在日志中。
  • 對于一個帶有復(fù)制因子N的主題蔽豺,我們將容忍高達N - 1的服務(wù)器故障区丑,而不會丟失任何提交到日志的記錄。

kafka as a Messaging System

與傳統(tǒng)的企業(yè)消息傳遞系統(tǒng)相比修陡,Kafka的信息流是怎樣的呢?

消息傳遞傳統(tǒng)上有兩種模式:排隊和發(fā)布訂閱沧侥。在一個隊列中,一個用戶池可以從服務(wù)器讀取數(shù)據(jù)魄鸦,并且每個記錄都可以訪問其中一個;在發(fā)布-訂閱中宴杀,記錄向所有消費者廣播。這兩種模型都有優(yōu)點和缺點拾因。排隊的優(yōu)勢在于旺罢,它允許您在多個消費者實例上劃分數(shù)據(jù)處理斯棒,這可以讓您對處理進行擴展。不幸的是主经,隊列不是多訂閱的——一旦一個進程讀取數(shù)據(jù),它就會消失庭惜。發(fā)布-訂閱允許將數(shù)據(jù)廣播到多個進程罩驻,但由于每個消息都傳遞給每個訂閱服務(wù)器,所以無法進行縮放處理护赊。

Kafka的消費者群體概念概括了這兩個概念惠遏。與隊列一樣,消費者組允許您在一個進程集合(消費者組的成員)中分配處理骏啰。與發(fā)布訂閱一樣节吮,Kafka允許您向多個消費群體廣播消息。

Kafka模型的優(yōu)點是判耕,每個主題都有這些屬性——它可以進行規(guī)模處理透绩,而且也可以多訂閱——不需要選擇一個或另一個.

與傳統(tǒng)的消息傳遞系統(tǒng)相比,Kafka的訂貨保證更強壁熄。

傳統(tǒng)的隊列保留了服務(wù)器上訂單的記錄帚豪,如果多個使用者從隊列中消費,則服務(wù)器會按存儲的順序分發(fā)記錄草丧。然而狸臣,盡管服務(wù)器會按順序分發(fā)記錄,但這些記錄是異步傳遞給消費者的昌执,所以它們可能會在不同的消費者中出現(xiàn)烛亦。這實際上意味著在并行消費的情況下,記錄的順序丟失了懂拾。消息傳遞系統(tǒng)通常使用“獨占消費者”的概念來解決這個問題煤禽,它只允許一個進程從隊列中使用,但這當然意味著在處理過程中沒有并行性委粉。

Kafka做得更好呜师。通過在主題中有一個并行的概念,Kafka能夠同時提供訂購保證和負載均衡贾节。這是通過將主題中的分區(qū)分配給消費者組中的消費者來實現(xiàn)的汁汗,這樣每個分區(qū)就會被組中的一個消費者所消費。通過這樣做栗涂,我們確保使用者是該分區(qū)的唯一閱讀器知牌,并按順序使用數(shù)據(jù)。由于有許多分區(qū)斤程,因此在許多消費者實例上仍然可以平衡負載角寸。但是請注意菩混,在消費者組中不能有比分區(qū)更多的消費者實例。

Kafka as a Storage System

任何允許發(fā)布消息與使用它們分離的消息隊列都可以有效地充當飛行消息的存儲系統(tǒng)扁藕。Kafka的不同之處在于它是一個很好的存儲系統(tǒng)沮峡。

寫入Kafka的數(shù)據(jù)被寫入磁盤,并被復(fù)制用于容錯亿柑。Kafka允許生產(chǎn)者等待確認邢疙,以便在完全復(fù)制并保證即使服務(wù)器寫入失敗時,寫入也不完整望薄。

Kafka使用的磁盤結(jié)構(gòu)疟游,無論您在服務(wù)器上有50 KB還是50 TB的持久數(shù)據(jù),Kafka都將執(zhí)行相同的操作痕支。

由于認真對待存儲并允許客戶控制其讀取位置颁虐,您可以將Kafka看作是一種專用于高性能、低延遲提交日志存儲卧须、復(fù)制和傳播的分布式文件系統(tǒng)另绩。

Kafka for Stream Processing

僅僅讀取、寫入和存儲數(shù)據(jù)流是不夠的故慈,其目的是實現(xiàn)流的實時處理板熊。

在Kafka的流處理器中,任何需要從輸入的主題中持續(xù)的數(shù)據(jù)流察绷,對這個輸入執(zhí)行一些處理干签,并生成持續(xù)的數(shù)據(jù)流到輸出主題。

例如拆撼,零售應(yīng)用程序可能接收銷售和發(fā)貨的輸入流容劳,并輸出一串重新排序和價格調(diào)整以計算這些數(shù)據(jù)。

可以使用生產(chǎn)者和消費者api直接進行簡單的處理闸度。然而竭贩,對于更復(fù)雜的轉(zhuǎn)換,Kafka提供了一個完全集成的流API莺禁。這使得構(gòu)建應(yīng)用程序可以進行非平凡的處理留量,這些應(yīng)用程序可以從流中計算聚合或?qū)⒘鬟B接到一起。

這個工具幫助解決了這種類型的應(yīng)用程序所面臨的難題:處理無序數(shù)據(jù)哟冬、重新處理輸入作為代碼更改楼熄、執(zhí)行有狀態(tài)的計算等。

streams API基于Kafka提供的核心原語:它使用生產(chǎn)者和消費者API作為輸入浩峡,使用Kafka進行有狀態(tài)的存儲可岂,并在流處理器實例中使用相同的組機制來進行容錯。

Putting the Pieces Together

這種消息傳遞翰灾、存儲和流處理的組合看起來很不尋常缕粹,但對于Kafka作為流媒體平臺的角色來說稚茅,這是必不可少的。

像HDFS這樣的分布式文件系統(tǒng)允許為批處理存儲靜態(tài)文件平斩。實際上亚享,這樣的系統(tǒng)允許存儲和處理過去的歷史數(shù)據(jù)。

傳統(tǒng)的企業(yè)消息傳遞系統(tǒng)允許處理在訂閱后到達的未來消息绘面。以這種方式構(gòu)建的應(yīng)用程序在到達時處理未來的數(shù)據(jù)虹蒋。

Kafka結(jié)合了這兩種功能,這兩者的結(jié)合對于Kafka的使用來說是一個重要的平臺飒货,可以作為流媒體應(yīng)用程序的平臺,也可以用于流媒體數(shù)據(jù)管道峭竣。

通過組合存儲和低延遲訂閱塘辅,流媒體應(yīng)用程序可以以同樣的方式處理過去和將來的數(shù)據(jù)。這是一個單一的應(yīng)用程序可以處理歷史的皆撩、存儲的數(shù)據(jù)扣墩,而不是在它到達最后一個記錄時結(jié)束,它可以在未來的數(shù)據(jù)到達時繼續(xù)處理扛吞。這是一個關(guān)于流處理的通用概念呻惕,即subsumes批處理和消息驅(qū)動的應(yīng)用程序。

同樣滥比,對于流媒體數(shù)據(jù)管道亚脆,訂閱和實時事件的結(jié)合使得使用Kafka用于非常低延遲的管道成為可能;但是,能夠可靠地存儲數(shù)據(jù)的能力使其能夠用于關(guān)鍵數(shù)據(jù)盲泛,在這些關(guān)鍵數(shù)據(jù)中濒持,必須保證數(shù)據(jù)的交付,或者是與離線系統(tǒng)集成寺滚,這些系統(tǒng)只能周期性地加載數(shù)據(jù)柑营,或者可以進行更長時間的維護。流處理設(shè)施可以在數(shù)據(jù)到達時轉(zhuǎn)換數(shù)據(jù)村视。

原文

Introduction

Apache Kafka? is a distributed streaming platform. What exactly does that mean?
We think of a streaming platform as having three key capabilities:
It lets you publish and subscribe to streams of records. In this respect it is similar to a message queue or enterprise messaging system.
It lets you store streams of records in a fault-tolerant way.
It lets you process streams of records as they occur.

What is Kafka good for?
It gets used for two broad classes of application:
Building real-time streaming data pipelines that reliably get data between systems or applications
Building real-time streaming applications that transform or react to the streams of data

To understand how Kafka does these things, let's dive in and explore Kafka's capabilities from the bottom up.
First a few concepts:
Kafka is run as a cluster on one or more servers.
The Kafka cluster stores streams of records in categories called topics.
Each record consists of a key, a value, and a timestamp.

Kafka has four core APIs:
The Producer API allows an application to publish a stream of records to one or more Kafka topics.
The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.


In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version. We provide a Java client for Kafka, but clients are available in many languages.
Topics and Logs
Let's first dive into the core abstraction Kafka provides for a stream of records—the topic.
A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.
For each topic, the Kafka cluster maintains a partitioned log that looks like this:
Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
The Kafka cluster retains all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.
In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now".
This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to "tail" the contents of any topic without changing what is consumed by any existing consumers.
The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.
Distribution
The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.
Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.
Producers
Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!
Consumers
Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.
If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.
More commonly, however, we have found that topics have a small number of consumer groups, one for each "logical subscriber". Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is a cluster of consumers instead of a single process.
The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.
Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.
Guarantees
At a high-level Kafka gives the following guarantees:
Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
A consumer instance sees records in the order they are stored in the log.
For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.

More details on these guarantees are given in the design section of the documentation.
Kafka as a Messaging System
How does Kafka's notion of streams compare to a traditional enterprise messaging system?
Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read from a server and each record goes to one of them; in publish-subscribe the record is broadcast to all consumers. Each of these two models has a strength and a weakness. The strength of queuing is that it allows you to divide up the processing of data over multiple consumer instances, which lets you scale your processing. Unfortunately, queues aren't multi-subscriber—once one process reads the data it's gone. Publish-subscribe allows you broadcast data to multiple processes, but has no way of scaling processing since every message goes to every subscriber.
The consumer group concept in Kafka generalizes these two concepts. As with a queue the consumer group allows you to divide up processing over a collection of processes (the members of the consumer group). As with publish-subscribe, Kafka allows you to broadcast messages to multiple consumer groups.
The advantage of Kafka's model is that every topic has both these properties—it can scale processing and is also multi-subscriber—there is no need to choose one or the other.
Kafka has stronger ordering guarantees than a traditional messaging system, too.
A traditional queue retains records in-order on the server, and if multiple consumers consume from the queue then the server hands out records in the order they are stored. However, although the server hands out records in order, the records are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the records is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of "exclusive consumer" that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.
Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances in a consumer group than partitions.
Kafka as a Storage System
Any message queue that allows publishing messages decoupled from consuming them is effectively acting as a storage system for the in-flight messages. What is different about Kafka is that it is a very good storage system.
Data written to Kafka is written to disk and replicated for fault-tolerance. Kafka allows producers to wait on acknowledgement so that a write isn't considered complete until it is fully replicated and guaranteed to persist even if the server written to fails.
The disk structures Kafka uses scale well—Kafka will perform the same whether you have 50 KB or 50 TB of persistent data on the server.
As a result of taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of special purpose distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.
For details about the Kafka's commit log storage and replication design, please read this page.
Kafka for Stream Processing
It isn't enough to just read, write, and store streams of data, the purpose is to enable real-time processing of streams.
In Kafka a stream processor is anything that takes continual streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics.
For example, a retail application might take in input streams of sales and shipments, and output a stream of reorders and price adjustments computed off this data.
It is possible to do simple processing directly using the producer and consumer APIs. However for more complex transformations Kafka provides a fully integrated Streams API. This allows building applications that do non-trivial processing that compute aggregations off of streams or join streams together.
This facility helps solve the hard problems this type of application faces: handling out-of-order data, reprocessing input as code changes, performing stateful computations, etc.
The streams API builds on the core primitives Kafka provides: it uses the producer and consumer APIs for input, uses Kafka for stateful storage, and uses the same group mechanism for fault tolerance among the stream processor instances.
Putting the Pieces Together
This combination of messaging, storage, and stream processing may seem unusual but it is essential to Kafka's role as a streaming platform.
A distributed file system like HDFS allows storing static files for batch processing. Effectively a system like this allows storing and processing historical data from the past.
A traditional enterprise messaging system allows processing future messages that will arrive after you subscribe. Applications built in this way process future data as it arrives.
Kafka combines both of these capabilities, and the combination is critical both for Kafka usage as a platform for streaming applications as well as for streaming data pipelines.
By combining storage and low-latency subscriptions, streaming applications can treat both past and future data the same way. That is a single application can process historical, stored data but rather than ending when it reaches the last record it can keep processing as future data arrives. This is a generalized notion of stream processing that subsumes batch processing as well as message-driven applications.
Likewise for streaming data pipelines the combination of subscription to real-time events make it possible to use Kafka for very low-latency pipelines; but the ability to store data reliably make it possible to use it for critical data where the delivery of data must be guaranteed or for integration with offline systems that load data only periodically or may go down for extended periods of time for maintenance. The stream processing facilities make it possible to transform data as it arrives.
For more information on the guarantees, APIs, and capabilities Kafka provides see the rest of the documentation.

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末官套,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子蚁孔,更是在濱河造成了極大的恐慌奶赔,老刑警劉巖,帶你破解...
    沈念sama閱讀 211,194評論 6 490
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件勒虾,死亡現(xiàn)場離奇詭異纺阔,居然都是意外死亡,警方通過查閱死者的電腦和手機修然,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,058評論 2 385
  • 文/潘曉璐 我一進店門笛钝,熙熙樓的掌柜王于貴愁眉苦臉地迎上來质况,“玉大人,你說我怎么就攤上這事玻靡〗衢” “怎么了?”我有些...
    開封第一講書人閱讀 156,780評論 0 346
  • 文/不壞的土叔 我叫張陵囤捻,是天一觀的道長臼朗。 經(jīng)常有香客問我,道長蝎土,這世上最難降的妖魔是什么视哑? 我笑而不...
    開封第一講書人閱讀 56,388評論 1 283
  • 正文 為了忘掉前任,我火速辦了婚禮誊涯,結(jié)果婚禮上挡毅,老公的妹妹穿的比我還像新娘。我一直安慰自己暴构,他們只是感情好跪呈,可當我...
    茶點故事閱讀 65,430評論 5 384
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著取逾,像睡著了一般耗绿。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上砾隅,一...
    開封第一講書人閱讀 49,764評論 1 290
  • 那天误阻,我揣著相機與錄音,去河邊找鬼晴埂。 笑死堕绩,一個胖子當著我的面吹牛,可吹牛的內(nèi)容都是我干的邑时。 我是一名探鬼主播奴紧,決...
    沈念sama閱讀 38,907評論 3 406
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼晶丘!你這毒婦竟也來了黍氮?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 37,679評論 0 266
  • 序言:老撾萬榮一對情侶失蹤浅浮,失蹤者是張志新(化名)和其女友劉穎沫浆,沒想到半個月后,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體滚秩,經(jīng)...
    沈念sama閱讀 44,122評論 1 303
  • 正文 獨居荒郊野嶺守林人離奇死亡专执,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 36,459評論 2 325
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了郁油。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片本股。...
    茶點故事閱讀 38,605評論 1 340
  • 序言:一個原本活蹦亂跳的男人離奇死亡攀痊,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出拄显,到底是詐尸還是另有隱情苟径,我是刑警寧澤,帶...
    沈念sama閱讀 34,270評論 4 329
  • 正文 年R本政府宣布躬审,位于F島的核電站棘街,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏承边。R本人自食惡果不足惜遭殉,卻給世界環(huán)境...
    茶點故事閱讀 39,867評論 3 312
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望博助。 院中可真熱鬧恩沽,春花似錦、人聲如沸翔始。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,734評論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽城瞎。三九已至,卻和暖如春疾瓮,著一層夾襖步出監(jiān)牢的瞬間脖镀,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 31,961評論 1 265
  • 我被黑心中介騙來泰國打工狼电, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留蜒灰,地道東北人。 一個月前我還...
    沈念sama閱讀 46,297評論 2 360
  • 正文 我出身青樓肩碟,卻偏偏與公主長得像强窖,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子削祈,可洞房花燭夜當晚...
    茶點故事閱讀 43,472評論 2 348

推薦閱讀更多精彩內(nèi)容

  • Spring Cloud為開發(fā)人員提供了快速構(gòu)建分布式系統(tǒng)中一些常見模式的工具(例如配置管理翅溺,服務(wù)發(fā)現(xiàn),斷路器髓抑,智...
    卡卡羅2017閱讀 134,629評論 18 139
  • **2014真題Directions:Read the following text. Choose the be...
    又是夜半驚坐起閱讀 9,429評論 0 23
  • 往往當自己付出了很多卻沒有什么收獲時咙崎,心情就會覺得郁悶甚至痛苦,難受的自己不自覺的落淚吨拍,我在孩子身上付出了很多褪猛,每...
    沂陽閱讀 228評論 0 0
  • 前言 本章我們來介紹下SpringBoot對靜態(tài)資源的支持以及很重要的一個類WebMvcConfigurerAda...
    嘟爺MD閱讀 1,628評論 6 4
  • 很餓,家里有剩飯羹饰,圖省事兒伊滋,一碗炒飯解決了晚飯碳却。 看著炒飯,想起了今年七月份去世的姑爺新啼。 ...
    紅燈閃盤子亮閱讀 254評論 0 0