Distributed systems theory for the distributed systems engineer
適合 分布式系統(tǒng)工程師 的 分布式系統(tǒng)理論
Gwen Shapira, who at the time was an engineer at Cloudera and now is spreading the Kafka gospel, asked a question on Twitter that got me thinking.
Gwen Shapira曾在Cloudera做工程師妙黍,現(xiàn)在宣傳Kafka劣坊,他在Twitter問了以下問題睡互,使我有所思考。
I need to improve my proficiency in distributed systems theory. Where do I start? Any recommended books?
我想在分布式理論上有所提升虐先。應(yīng)該從哪開始?有推薦的書?
— Gwen (Chen) Shapira (@gwenshap) August 7, 2014
My response of old might have been “well, here’s the FLP paper, and here’s the Paxos paper, and here’s the Byzantine generals paper…”,
我第一反應(yīng)是“可以看:FLP論文、paxos論文旷赖、Byzantine將軍論文”,
and I’d have prescribed a laundry list of primary source material which would have taken at least six months to get through if you rushed.
我推薦的主要閱讀材料,如果你貿(mào)然去讀更卒,你至少要閱讀6個(gè)月才會(huì)有感覺等孵。
But I’ve come to thinking that recommending a ton of theoretical papers is often precisely the wrong way to go about learning distributed systems theory (unless you are in a PhD program).
由此可知,推薦一噸的理論論文讓你閱讀蹂空,這是了解分布式系統(tǒng)的錯(cuò)誤的方式俯萌。(除非你在讀博士)
Papers are usually deep, usually complex, and require both serious study, and usually significant experience to glean their important contributions and to place them in context.
論文一般是深?yuàn)W果录、復(fù)雜的,而且需要一系列學(xué)習(xí)和豐富的經(jīng)驗(yàn)才能感覺到其貢獻(xiàn)咐熙、才能其放到對應(yīng)的場景(以理解和應(yīng)用)弱恒。
What good is requiring that level of expertise of engineers?
工程師了解分布式理論有什么好處?
And yet, unfortunately, there’s a paucity of good ‘bridge’ material that summarises, distills and contextualises the important results and ideas in distributed systems theory;
很不幸棋恼,幾乎沒有好的引導(dǎo)文章返弹,來總結(jié)、提煉蘸泻、場景化 分布式系統(tǒng)理論中的重要結(jié)論和想法琉苇;
particularly material that does so without condescending.
特別是 通俗易懂的引導(dǎo)文章 更沒有。
Considering that gap lead me to another interesting question:
考慮這樣的空白區(qū)域悦施,讓我想問另一個(gè)問題:
What distributed systems theory should a distributed systems engineer know?
一個(gè)分布式系統(tǒng)工程師應(yīng)該了解什么樣的分布式系統(tǒng)理論?
A little theory is, in this case, not such a dangerous thing.
這種情況下去团,了解一點(diǎn)點(diǎn)理論并不是壞事抡诞。
So I tried to come up with a list of what I consider the basic concepts that are applicable to my every-day job as a distributed systems engineer.
我日常工作是一個(gè)分布式系統(tǒng)工程師,我認(rèn)為適合我的基本概念土陪,下面會(huì)給出這些基本概念昼汗。
Let me know what you think I missed!
你認(rèn)為我缺失的請告知我!
First steps 準(zhǔn)備
These four readings do a pretty good job of explaining what about building distributed systems is challenging.
下面四個(gè)讀物解釋了構(gòu)建分布式系統(tǒng)會(huì)遇到的困難。
Collectively they outline a set of abstract but technical difficulties that the distributed systems engineer has to overcome, and set the stage for the more detailed investigation in later sections
這些讀物都勾勒了一些列 抽象而非技術(shù) 的困難鬼雀,分布式系統(tǒng)工程師必須要克服這些困難顷窒。這些讀物的后面章節(jié)有更詳細(xì)的研究。
Distributed Systems for Fun and Profit is a short book which tries to cover some of the basic issues in distributed systems including the role of time and different strategies for replication.
Distributed Systems for Fun and Profit 是一本小書源哩,它想覆蓋分布式系統(tǒng)中的一些基本問題鞋吉,包括 時(shí)鐘所起的作用、不同策略的復(fù)制励烦。
Notes on distributed systems for young bloods - not theory, but a good practical counterbalance to keep the rest of your reading grounded.
Notes on distributed systems for young bloods - 非理論谓着,而是一個(gè)很好的實(shí)踐,以讓你落到實(shí)處坛掠。
A Note on Distributed Systems - a classic paper on why you can’t just pretend all remote interactions are like local objects.
A Note on Distributed Systems - 一個(gè)經(jīng)典論文赊锚,關(guān)于 為什么你不能假裝所有遠(yuǎn)程交互像本地對象一樣。
The fallacies of distributed computing - 8 fallacies of distributed computing that set the stage for the kinds of things system designers forget.
The fallacies of distributed computing 分布式計(jì)算的8個(gè)錯(cuò)誤的推論屉栓,以提醒系統(tǒng)設(shè)計(jì)者舷蒲。
You should know about safety and liveness properties:
你應(yīng)該知道 安全 和 活力:
safety properties say that nothing bad will ever happen. For example, the property of never returning an inconsistent value is a safety property, as is never electing two leaders at the same time.
安全 說的是 永遠(yuǎn)不會(huì)發(fā)生壞事。比如友多,不返回不一致的值 是 一種 安全牲平, 同一時(shí)刻不會(huì)選出兩個(gè) 主節(jié)點(diǎn) 也是 一種 安全。
liveness properties say that something good will eventually happen. For example, saying that a system will eventually return a result to every API call is a liveness property, as is guaranteeing that a write to disk always eventually completes.
活力 說的是 好事情終究會(huì)發(fā)生夷陋。比如欠拾,對于每個(gè)api調(diào)用胰锌,一個(gè)系統(tǒng)終究會(huì)返回一個(gè)結(jié)果,這是一種 活力藐窄;保證一次寫磁盤最終總能結(jié)束资昧,這是一種 活力。
Failure and Time 失敗和時(shí)鐘
Many difficulties that the distributed systems engineer faces can be blamed on two underlying causes:
分布式系統(tǒng)工程師面對的許多困難可以歸結(jié)為以下兩個(gè)原因:
Processes may fail
進(jìn)程可能失敗
There is no good way to tell that they have done so
There is a very deep relationship between what, if anything, processes share about their knowledge of time, what failure scenarios are possible to detect, and what algorithms and primitives may be correctly implemented.
進(jìn)程間怎么共用時(shí)鐘荆忍、什么樣的失敗可以檢測格带、什么樣的算法和原語可以被正確實(shí)現(xiàn),這三者之間有很深的聯(lián)系刹枉。
Most of the time, we assume that two different nodes have absolutely no shared knowledge of what time it is, or how quickly time passes.
一般情況下叽唱,我們假設(shè)不同節(jié)點(diǎn)絕對無法共用時(shí)鐘(時(shí)刻值或流過了多少時(shí)間)
You should know:
你應(yīng)該知道:
The (partial) hierarchy of failure modes: crash stop -> omission -> Byzantine. You should understand that what is possible at the top of the hierarchy must be possible at lower levels, and what is impossible at lower levels must be impossible at higher levels.
失敗模型的層次:節(jié)點(diǎn)崩潰后關(guān)機(jī) -> 節(jié)點(diǎn)崩潰后死機(jī)(經(jīng)過無限長時(shí)間后才響應(yīng)) -> 惡意節(jié)點(diǎn) (不遵守約定的規(guī)則) 。 各個(gè)層次間逐漸將限制放松微宝,你應(yīng)該知道這些限制.
How you decide whether an event happened before another event in the absence of any shared clock. This means Lamport clocks and their generalisation to Vector clocks, but also see the Dynamo paper.
兩個(gè)節(jié)點(diǎn)之間棺亭,沒有任何共用時(shí)鐘,你怎么確定一個(gè)節(jié)點(diǎn)上的一個(gè)事件和另一個(gè)節(jié)點(diǎn)上的另一個(gè)事件之間的先后順序. 這就要閱讀Lamport時(shí)鐘和更一般化的Vector時(shí)鐘, 也可以閱讀Dynamo論文.
How big an impact the possibility of even a single failure can actually have on our ability to implement correct distributed systems (see my notes on the FLP result below).
允許單節(jié)點(diǎn)失敗對實(shí)現(xiàn)正確的分布式系統(tǒng)有多大的沖擊蟋软?(見下面FLP結(jié)論處)
Different models of time: synchronous, partially synchronous and asynchronous
時(shí)鐘的不同模型:同步镶摘、部分同部 、 異步
That detecting failures is a fundamental problem, one that trades off accuracy and completeness - yet another safety vs liveness conflict. The paper that really set out failure detection as a theoretical problem is Chandra and Toueg’s ‘Unreliable Failure Detectors for Reliable Distributed Systems’. But there are several shorter summaries around - I quite like this random one from Stanford.
失敗檢測是一個(gè)基本問題岳守,失敗檢測可以平衡準(zhǔn)確度和完成度(如果能檢測到失敗了凄敢,則可以容許不那么準(zhǔn)確、沒完全做完)湿痢,失敗檢測也可以解決安全和活力間的沖突涝缝。把失敗檢測作為理論來研究的論文是 Chandra and Toueg’s ‘Unreliable Failure Detectors for Reliable Distributed Systems’. 不過也有一些簡短的總結(jié)-我特別喜歡this random one from Stanford.
The basic tension of fault tolerance 容錯(cuò)導(dǎo)致的基本矛盾
A system that tolerates some faults without degrading must be able to act as though those faults had not occurred.
一個(gè)系統(tǒng)容忍一些錯(cuò)誤而沒有降級 必須能當(dāng)成 就像這些錯(cuò)誤沒有發(fā)生過一樣。
This means usually that parts of the system must do work redundantly, but doing more work than is absolutely necessary typically carries a cost both in performance and resource consumption.
這意味著系統(tǒng)的一部分要冗余地工作(同樣的功能部署多個(gè)節(jié)點(diǎn))譬重,冗余是絕對必要的拒逮,冗余一般會(huì)帶來性能和資源的消耗。
This is the basic tension of adding fault tolerance to a system.
這就是給一個(gè)系統(tǒng)添加冗余的基本矛盾害幅。
You should know:
你應(yīng)該知道:
The quorum technique for ensuring single-copy serialisability. See Skeen’s original paper, but perhaps better is Wikipedia’s entry.
確保串行單復(fù)制的多數(shù)派技術(shù). 見 Skeen的原始論文, 不過或許更好的是 Wikipedia’s entry.
(多數(shù)派中有一個(gè)是主節(jié)點(diǎn),其余為從節(jié)點(diǎn)消恍,以主節(jié)點(diǎn)接收到的寫請求序列為準(zhǔn)[串行],主節(jié)點(diǎn)單方面的要求從們接受字節(jié)的寫請求序列[從節(jié)點(diǎn)不得反抗以现、不得有異議:從節(jié)點(diǎn)是非惡意的狠怨、遵守全局規(guī)則的、非拜占庭的])About 2-phase-commit, 3-phase-commit and Paxos, and why they have different fault-tolerance properties.
How eventual consistency, and other techniques, seek to avoid this tension at the cost of weaker guarantees about system behaviour. The Dynamo paper is a great place to start, but also Pat Helland’s classic Life Beyond Transactions is a must-read.
最終一致性、其他技術(shù) 以 對系統(tǒng)行為做更弱的保證 為代價(jià) 來 設(shè)法避開 此矛盾 . 可以看 Dynamo 論文 , 不過 必須要讀 Pat Helland的論文 經(jīng)典 Life Beyond Transactions .
Basic primitives 基本原語
There are few agreed-upon basic building blocks in distributed systems, but more are beginning to emerge. You should know what the following problems are, and where to find a solution for them:
在分布式系統(tǒng)中记盒,很少有約定的基本構(gòu)建塊憎蛤,更多的是處于形成中的基本構(gòu)建塊。有應(yīng)該知道下面的問題是什么,并且從哪能找到他們的解決方案:
Leader election (e.g. the Bully algorithm)
主節(jié)點(diǎn)選舉 (例如 Bully 算法)
Consistent snapshotting (e.g. this classic paper from Chandy and Lamport)
一致快照 (比如 這個(gè)來自 Chandy and Lamport的經(jīng)典論文 )
一致性 (見上面 2PC 俩檬、 Paxos 處)
Distributed state machine replication (Wikipedia is ok, Lampson’s paper is canonical but dry).
分布式狀態(tài)機(jī)復(fù)制 (看Wikipedia 就行, Lampson的 論文 是權(quán)威但是太枯燥了).
Broadcast - delivering messages to more than one node at once
-
廣播 - 同時(shí)發(fā)送消息給集群
Atomic broadcast - can you deliver a message to all nodes in a group, or none?
原子廣播 - 你能發(fā)送消息給一集群萎胰,使得要么集群中的所有節(jié)點(diǎn)都收到了這條信息、要么集群中全部節(jié)點(diǎn)都沒收到此消息?(這就是原子廣播)
Gossip (the classic paper)
Gossip (經(jīng)典論文)
Causal multicast (but also consider the enjoyable back-and-forth between Birman and Cheriton).
[因果廣播](https://www.cs.cornell.edu/courses/cs614/2003sp/papers/BSS91.pdf) (也可以看看 [Birman](https://www.cs.rice.edu/~alc/comp520/papers/Cheriton_Skeen.pdf)和[forth](https://www.cs.princeton.edu/courses/archive/fall07/cos518/papers/catocs-limits-response.pdf) ).
Chain replication (a neat way of ensuring consistency and ordering of writes by organizing nodes into a virtual linked list).
-
鏈?zhǔn)綇?fù)制 (將節(jié)點(diǎn)們放進(jìn)一個(gè)虛擬鏈表中棚辽,從而可以干凈的確保寫請求的一致性和順序 ).
The original paper
一系列改良 for read-mostly workloads
對讀請求占絕大多數(shù)的一系列改良
Fundamental Results 基礎(chǔ)結(jié)論
Some facts just need to be internalised. There are more than this, naturally, but here’s a flavour:
有些事實(shí)只需要主觀理解(不需要關(guān)注證明).
You can’t implement consistent storage and respond to all requests if you might drop messages between processes. This is the CAP theorem.
如果節(jié)點(diǎn)間可能丟失消息[:P]技竟,那么你不可能 既 實(shí)現(xiàn)一致性存儲(chǔ)[:C] 又 響應(yīng)所有時(shí)刻的請求[:A]. 這就是 CAP理論.
Consensus is impossible to implement in such a way that it both a) is always correct and b) always terminates if even one machine might fail in an asynchronous system with crash-* stop failures (the FLP result). The first slides - before the proof gets going - of my Papers We Love SF talk do a reasonable job of explaining the result, I hope. Suggestion: there’s no real need to understand the proof.
在一個(gè)異步系統(tǒng)中,一致性不可能以這樣一個(gè)途徑實(shí)現(xiàn):既a) 總是正確的 屈藐; 又b) 總是能結(jié)束 即使只有一個(gè)節(jié)點(diǎn)可能以 崩潰-*停止 失敗 (FLP結(jié)論). 在看證明之前榔组,看下我以簡明的方式解釋FLP結(jié)論的論文 Papers We Love SF talk . 建議: 沒有理解證明的需求.
(一個(gè)異步系統(tǒng)中,假設(shè)節(jié)點(diǎn)崩潰后停止而不是奔潰后又恢復(fù)联逻;1搓扯、要確保結(jié)果總是正確的,2包归、每次寫請求能夠在有限時(shí)間內(nèi)返回結(jié)果锨推。這兩點(diǎn)沒法同時(shí)滿足:這就是FLP結(jié)論)Consensus is impossible to solve in fewer than 2 rounds of messages in general.
一般地,只進(jìn)行少于2輪的消息傳遞公壤,不可能達(dá)成一致性 .
Atomic broadcast is exactly as hard as consensus - in a precise sense, if you solve atomic broadcast, you solve consensus, and vice versa. Chandra and Toueg prove this, but you just need to know that it’s true.
原子廣播和一致性爱态,二者的難度精確的相等。更直白的說境钟,如果你能解原子廣播,那么你也能解一致性俭识,反之亦然慨削。 Chandra 和 Toueg 證明了這一點(diǎn), 但是你只需要知道這個(gè)論斷是成立的。
Real systems 真實(shí)系統(tǒng)
The most important exercise to repeat is to read descriptions of new, real systems, and to critique their design decisions. Do this over and over again. Some suggestions:
最重要的套媚、應(yīng)該不斷重復(fù)的實(shí)踐是:讀新的缚态、真實(shí)的系統(tǒng)的描述,并評價(jià)他們設(shè)計(jì)的決定堤瘤。 下面是建議的系統(tǒng):
Google:
Not Google:
Postscript 結(jié)尾
If you tame all the concepts and techniques on this list, I’d like to talk to you about engineering positions working with the menagerie of distributed systems we curate at Cloudera.
如果你馴服了這個(gè)列表中的所有概念和技術(shù)玫芦,我很樂意和你聊聊Cloudera的分布式系統(tǒng)工程師職位。