Debezium for PostgreSQL to Kafka

In this article, we discuss the necessity of segregate data model for read and write and use event sourcing for capture detailed data changing. These two aspects are critical for data analysis in big data world. We will compare some candidate solutions and draw a conclusion that CDC strategy is a perfect match for CQRS pattern.

Context and Problem

To support business decision-making, we demand fresh and accurate data that’s available where and when we need it, often in real-time.

But,

  • as business analysts try to run analysis, the production databases are (will be) overloaded;
  • some process details (transaction stream) valuable for analysis may have been overwritten;
  • OLTP data models may not be friendly to analysis purpose.

We hope to come out with a efficient solution to capture detailed transaction stream and ingest data to Hadoop for analysis.

State VS Stream

CQRS and Event Sourcing Pattern

CQRS-based systems use separate read and write data models, each tailored to relevant tasks and often located in physically separate stores.

Event-sourcing: Instead of storing just the current state of the data in a domain, use an append-only store to record the full series of actions taken on that data.

CQRS

Decouple: one team of developers can focus on the complex domain model that is part of the write model, and another team can focus on the read model and the user interfaces.

Ingest Solutions - dual writes

Dual Write

  • brings complexity in business system
  • is less fault tolerant when backend message queue is blocked or under maintenance
  • suffers from race conditions and consistency problems

Business log

  • concerns of data sensitivity
  • brings complexity in business system
Dual Write

Ingest Solutions - database operations

Snapshot

  • data in the database is constantly changing, so the snapshot is already out-of-date by the time it’s loaded
  • even if you take a snapshot once a day, you still have one-day-old data in the downstream system
  • on a large database those snapshots and bulk loads can become very expensive

Data offload

  • brings operational complexity
  • is inability to meet low-latency requirements
  • can’t handle delete operations

Ingest Solutions - capture data change

process only “diff” of changes

  • write all your data to only one primary DB;
  • extract two things from that database:
  • a consistent snapshot and
  • a real-time stream of changes

Benefits:

  • decouple with business system
  • get a latency of less than a second
  • stream is ordering of writes, less race conditions
  • pull strategy is robust to data corruption (log replaying)
  • support as many variant data consumers as required
CDC

Ingest Solutions - wrapup

Considering data application under the picture of business application, we will focus on the ‘capture changes to data’ components.

image.png

Open Source for Postgres to Kafka

**Sqoop **
can only take full snapshots of a database, and not capture an ongoing stream of changes. Also, transactional consistency of its snapshots is not wells supported (Apache).
pg_kafka
is a Kafka producer client in a Postgres function, so we could potentially produce to Kafka from a trigger. (MIT license)
bottledwater-pg
is a change data capture (CDC) specifically from PostgreSQL into Kafka (Apache License 2.0, from confluent inc.)
debezium-pg
is a change data capture for a variety of databases (Apache License 2.0, from redhat)

image.png

Debezium for Postgres is comparatively better.

Debezium for Postgres Architecture

debezium/postgres-decoderbufs

  • manually build the output plugin
  • change PG configuration, preload the lib file and restart PG service

debezium/debezium

  • compile and package the dependent jar files

Kafka connect

  • deploy distributed kafka connect service
  • start a debezium connector in Kafka connect

HBase connect

  • development work: implement a hbase connect for PG CDC events
  • Start a hbase connector in Kafka connect

Spark streaming

  • development work: implement data process functions atop Spark streaming
image.png

Considerations

Reliability
For example

  • be aware of data source exception or source relocation, and automatically/manually restart data capture tasks or redirect data source;
  • monitor data quality and latency;

Scalability

  • be aware of data source load pressure, and automatically/manually scale out data capture tasks;

Maintainability

  • GUI for system monitoring, data quality check, latency statistics etc.;
  • GUI for configuring data capture task scale out

Other CDC solutions

Databus (linkedIn): no native support for PG
Wormhole (facebook): not opensource
**Sherpa (yahoo!) **: not opensource
BottledWater (confluent): postgres Only
Maxwell: mysql Only
Debezium (redhat): good
Mongoriver: only for MongiDB
GoldenGate (Oracle): for Oracle and mysql, free but not opensource
Canal & otter (alibaba): for mysql world replication

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末溪掀,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子啸箫,更是在濱河造成了極大的恐慌好乐,老刑警劉巖虫碉,帶你破解...
    沈念sama閱讀 219,188評(píng)論 6 508
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡拷泽,警方通過查閱死者的電腦和手機(jī)碘梢,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,464評(píng)論 3 395
  • 文/潘曉璐 我一進(jìn)店門咬摇,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人煞躬,你說我怎么就攤上這事肛鹏∫莅睿” “怎么了?”我有些...
    開封第一講書人閱讀 165,562評(píng)論 0 356
  • 文/不壞的土叔 我叫張陵在扰,是天一觀的道長(zhǎng)缕减。 經(jīng)常有香客問我,道長(zhǎng)芒珠,這世上最難降的妖魔是什么桥狡? 我笑而不...
    開封第一講書人閱讀 58,893評(píng)論 1 295
  • 正文 為了忘掉前任,我火速辦了婚禮皱卓,結(jié)果婚禮上裹芝,老公的妹妹穿的比我還像新娘。我一直安慰自己娜汁,他們只是感情好嫂易,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,917評(píng)論 6 392
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著掐禁,像睡著了一般怜械。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上穆桂,一...
    開封第一講書人閱讀 51,708評(píng)論 1 305
  • 那天宫盔,我揣著相機(jī)與錄音,去河邊找鬼享完。 笑死灼芭,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的般又。 我是一名探鬼主播彼绷,決...
    沈念sama閱讀 40,430評(píng)論 3 420
  • 文/蒼蘭香墨 我猛地睜開眼,長(zhǎng)吁一口氣:“原來是場(chǎng)噩夢(mèng)啊……” “哼茴迁!你這毒婦竟也來了寄悯?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 39,342評(píng)論 0 276
  • 序言:老撾萬榮一對(duì)情侶失蹤堕义,失蹤者是張志新(化名)和其女友劉穎猜旬,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體倦卖,經(jīng)...
    沈念sama閱讀 45,801評(píng)論 1 317
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡洒擦,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,976評(píng)論 3 337
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了怕膛。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片熟嫩。...
    茶點(diǎn)故事閱讀 40,115評(píng)論 1 351
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖褐捻,靈堂內(nèi)的尸體忽然破棺而出掸茅,到底是詐尸還是另有隱情椅邓,我是刑警寧澤,帶...
    沈念sama閱讀 35,804評(píng)論 5 346
  • 正文 年R本政府宣布昧狮,位于F島的核電站景馁,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏陵且。R本人自食惡果不足惜裁僧,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,458評(píng)論 3 331
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望慕购。 院中可真熱鬧聊疲,春花似錦、人聲如沸沪悲。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,008評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽殿如。三九已至贡珊,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間涉馁,已是汗流浹背门岔。 一陣腳步聲響...
    開封第一講書人閱讀 33,135評(píng)論 1 272
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留烤送,地道東北人寒随。 一個(gè)月前我還...
    沈念sama閱讀 48,365評(píng)論 3 373
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像帮坚,于是被迫代替她去往敵國和親妻往。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,055評(píng)論 2 355

推薦閱讀更多精彩內(nèi)容