Consistency Guarantees
主要還是要了解更強(qiáng)的 consistency models, once we have seen it, we will be in a better position to decide which one best fits our needs.
But while there is some overlap, they are mostly independent concerns: transaction isolation is primarily about avoiding race conditions due to concurrently executing transactions, whereas distributed consistency is mostly about coordinating the state of replicas in the face of delays and faults.
注意枪蘑,這里的Consistency model 跟 isolation level 是不一樣的损谦,isolation主要考慮race condition, 而 distributed consistency is mostly about coordinating the state of replicas in the face of delays and faults
換句話說(shuō)岳颇,distributed consistency 是平衡不同 replicas 之間的 state 而產(chǎn)生的
Linearizability
這是一個(gè) consistency model, 不是 isolation level
The exact definition of linearizability is quite subtle, and we will explore it in the rest of this section. But the basic idea is to make a system appear as if there were only one copy of the data, and all operations on it are atomic.
Linearizability 定義就是讓client 認(rèn)為只有一個(gè)copy照捡, 盡管我們有多個(gè),然后讓他們看到只有一個(gè)copy 并且all operations on it are atomic
![[DDIA-Figure-9-2.png]]
register in distributed system means object that is operated on?
從上面這張圖的解釋话侧,應(yīng)該是primary key的意思
In this example, the register has two types of operations:
- means the client requested to read the value of register x, and the database returned the value v.
- means the client requested to set the register x to value v, and the database returned response r (which could be ok or error).
這里用到了arrow function的方式來(lái)定義栗精,編程語(yǔ)言應(yīng)該是借用了數(shù)學(xué)的方式來(lái)表達(dá)一個(gè)function
![[DDIA-Figure-9-3.png]]
這張圖表示了read 新的 value 之后所有的read 都要讀到新的 value
9-4 這張圖加了新的 operation.
In figure9-4 we add a third type of operation besides read and write:
- means the client requested an atomic compare-and-set operation (see “Compare-and-set” on page 245). If the current value of the register x equals , it should be atomically set to . If then the operation should leave the register unchanged and return an error. r is the database’s response (ok or error).
![[DDIA-Figure-9-4.png]]
- First client B sent a request to read x, then client D sent a request to set x to 0, and then client A sent a request to set x to 1. Nevertheless, the value returned to B’s read is 1 (the value written by A). This is okay: it means that the database first processed D’s write, then A’s write, and finally B’s read. Although this is not the order in which the requests were sent, it’s an acceptable order, because the three requests are concurrent. Perhaps B’s read request was slightly delayed in the net‐ work, so it only reached the database after the two writes.
- Client B’s read returned 1 before client A received its response from the database, saying that the write of the value 1 was successful. This is also okay: it doesn’t mean the value was read before it was written, it just means the ok response from the database to client A was slightly delayed in the network.
- This model doesn’t assume any transaction isolation: another client may change a value at any time. For example, C first reads 1 and then reads 2, because the value was changed by B between the two reads. An atomic compare-and-set (cas) operation can be used to check the value hasn’t been concurrently changed by another client: B and C’s cas requests succeed, but D’s cas request fails (by the time the database processes it, the value of x is no longer 0).
- The final read by client B (in a shaded bar) is not linearizable. The operation is concurrent with C’s cas write, which updates x from 2 to 4. In the absence of other requests, it would be okay for B’s read to return 2. However, client A has already read the new value 4 before B’s read started, so B is not allowed to read an older value than A. Again, it’s the same situation as with Alice and Bob in Figure 9-1.
這幾段話一定要反復(fù)看,而且要提醒自己,這個(gè)是consistency model悲立, 不是isolation level鹿寨, 所以這里討論的是DB 返回的 value 是 consistent 的。
所以說(shuō)cas (compare and set)只是為了保證 consistency 的一個(gè)操作薪夕?并不是新introduce 的操作
是新的操作脚草,看之后我寫的解釋
也就是說(shuō),在下一次read 操作之前原献, 你都要進(jìn)行一次 cas 操作馏慨,確保中間沒(méi)有人update value。 因?yàn)閺?-3 這張圖來(lái)看姑隅, client B 是進(jìn)行了兩次 read 操作写隶,然后9-4 這里在第二次 read 之前進(jìn)行了 cas 操作, 這樣就確保了 consistency讲仰?
哦慕趴! 第三個(gè) bullet point 解答了我的疑問(wèn),就是你任何操作(operation) 之后都要進(jìn)行 cas 操作鄙陡, 這樣才會(huì)檢查你的操作是否跟DB 的 state consistent秩贰。
書(shū)中原話:
An atomic compare-and-set (cas) operation can be used to check the value hasn’t been concurrently changed by another client: B and C’s cas requests succeed, but D’s cas request fails (by the time the database processes it, the value of x is no longer 0).
D的 cas 沒(méi)有成功是因?yàn)閯e人改了(client B的cas)
不對(duì), cas是新的操作柔吼,并不只是為了檢查你之前操作是否成功,因?yàn)樽詈骳lient B的read 沒(méi)有成功是因?yàn)閏lient C有一個(gè) cas 操作丙唧。
新的概念一定要有清晰的定義愈魏!
[[2021-12-27#一定要了解新的概念的定義]]
Linearizability 本身是一個(gè)abstract model, 也就是說(shuō)他是為了consistency 定義的model想际, 跟isolation level 完全不一樣培漏,不能搞混了
Serializability
Serializability is an isolation property of transactions, where every transaction may read and write multiple objects (rows, documents, records)—see “Single- Object and Multi-Object Operations” on page 228. It guarantees that transac‐ tions behave the same as if they had executed in some serial order (each transaction running to completion before the next transaction starts). It is okay for that serial order to be different from the order in which transactions were actually run [12].
Serializability 是 transaction 層面的guarantee,他是在有多個(gè) transaction 同時(shí)進(jìn)行的情況下胡本,保證他們跟順序執(zhí)行是一樣的
而 Linearizability 則是 consistent guarantee, (recency guarantee)牌柄。 他是在 register (individual object) 層面的 讀寫 操作的 guarantee。他并不會(huì)把不同的操作合并起來(lái) (it doesn't group operations together into transactions) 所以說(shuō)它會(huì)有write skew 的問(wèn)題侧甫,也就是說(shuō) Linearizability 是一個(gè) weaker guarantee compare to Serializability
最大的不同應(yīng)該在于 Linearizability 是作用于單個(gè)object的珊佣,而Serialiazabiltiy可以作用于多個(gè)object
文中后面提到了Linearizability 通常是在coordination發(fā)生的時(shí)候要用到,也就是多個(gè)node 之間要確認(rèn) linearizable 之后 達(dá)成 consensus披粟?所以說(shuō)ZooKeeper 一定要用 Linearizable storage 來(lái)進(jìn)行 coordinate 的操作
serializable 是更強(qiáng)的guarantee咒锻, 書(shū)中也提到了,serializable 通常都是 linearizable的
A database may provide both serializability and linearizability, and this combination is known as strict serializability or strong one-copy serializability (strong-1SR) [4, 13]. Implementations of serializability based on two-phase locking (see “Two-Phase Lock‐ ing (2PL)” on page 257) or actual serial execution (see “Actual Serial Execution” on page 252) are typically linearizable.
Linearizability
Linearizability is a recency guarantee on reads and writes of a register (an individual object). It doesn’t group operations together into transactions, so it does not prevent problems such as write skew (see “Write Skew and Phantoms” on page 246), unless you take additional measures such as materializing conflicts (see “Materializing conflicts” on page 251).
自己給出定義之后果然有很大的好處守屉,因?yàn)殡S時(shí)可以查自己之前的理解惑艇,并且如果有問(wèn)題可以及時(shí)修正…… 我在讀這段話的時(shí)候就查了自己之前對(duì)于2PL(2 phase locking 的定義)
A database may provide both serializability and linearizability, and this combination is known as strict serializability or strong one-copy serializability (strong-1SR) [4, 13]. Implementations of serializability based on two-phase locking (see “Two-Phase Locking (2PL)” on page 257) or actual serial execution (see “Actual Serial Execution” on page 252) are typically linearizable.
[[DDIA Ch7 Transactions#Two phase locking]]
在第七章這里,作者解釋的很好,所以我是基于他的基礎(chǔ)上給了自己的解釋滨巴。 2PL就像java 里面的 對(duì)象鎖思灌,所有的讀寫都要等之前的操作完成才能輪到他,不像snapshot isolation, writers never block readers, readers nerver block writers
Linearizability 應(yīng)用場(chǎng)景
ZooKeeper 就需要用到Linearizability guarantee 來(lái)確保他們的 distributed locks and leader election is correct.
Two Communication channel
文中給出Alice bob 的例子之所以會(huì)有問(wèn)題恭取,就是因?yàn)樗麄冎g還有一個(gè) communication channel泰偿,alice's mouth and bob's ear
同樣的,如果你的系統(tǒng)里面也有多出來(lái)的 communication channel, 你就需要 linearizable storage秽荤, 不然的話就會(huì)出問(wèn)題甜奄,下面這張圖給出了這個(gè)例子
![[DDIA-Figure-9-5.png]]
這個(gè)系統(tǒng)如果不是linearizable storage, 那么 message queue 如果比f(wàn)ile storage本身存儲(chǔ)速度快的話, 當(dāng)image resizer fetch的時(shí)候窃款,file storage 還沒(méi)完成存儲(chǔ)的操作课兄,image resizer 拿到的就是舊的數(shù)據(jù),甚至什么也沒(méi)有晨继,然后返回舊的data烟阐, 造成 file storage permanently inconsistent
This problem arises because there are two different communication channels between the web server and the resizer: the file storage and the message queue. Without the recency guarantee of linearizability, race conditions between these two channels are possible. This situation is analogous to Figure 9-1, where there was also a race condition between two communication channels: the database replication and the real-life audio channel between Alice’s mouth and Bob’s ears.
Linearizable 并不是唯一的解決辦法,read your own writes 也可以紊扬,只不過(guò)會(huì)讓問(wèn)題更復(fù)雜(因?yàn)閘inearizable 是DB保障的蜒茄,read your own writes 好像需要特殊的config, unique transaction id)
Implementing Linearizable Systems
首先要判斷不同的replication model 是否linearizable
![[DDIA-不同replication是否Linearizable.png]]
CAP theorem
Thus, a better way of phrasing CAP would be either Consistent or Available when Partitioned
![[DDIA-CAP-Definition.png]]
The CAP theorem as formally defined [30] is of very narrow scope: it only considers one consistency model (namely linearizability) and one kind of fault (network parti‐ tions,vi or nodes that are alive but disconnected from each other). It doesn’t say anything about network delays, dead nodes, or other trade-offs. Thus, although CAP has been historically influential, it has little practical value for designing systems [9, 40].
Concurrency 有問(wèn)題的根本原因在于違反了因果律
書(shū)中這段話總結(jié)的太好了餐屎,而且解釋了ordering, linearizability, and consesus 都是因?yàn)?concurrency 有可能違反因果律造成的原因
There are several reasons why ordering keeps coming up, and one of the reasons is that it helps preserve causality. We have already seen several examples over the course of this book where causality has been important:
- In “Consistent Prefix Reads” on page 165 (Figure 5-5) we saw an example where the observer of a conversation saw first the answer to a question, and then the question being answered. This is confusing because it violates our intuition of cause and effect: if a question is answered, then clearly the question had to be there first, because the person giving the answer must have seen the question (assuming they are not psychic and cannot see into the future). We say that there is a causal dependency between the question and the answer.
- A similar pattern appeared in Figure 5-9, where we looked at the replication between three leaders and noticed that some writes could “overtake” others due to network delays. From the perspective of one of the replicas it would look as though there was an update to a row that did not exist. Causality here means that a row must first be created before it can be updated.
- In “Detecting Concurrent Writes” on page 184 we observed that if you have two operations A and B, there are three possibilities: either A happened before B, or B happened before A, or A and B are concurrent. This happened before relationship is another expression of causality: if A happened before B, that means B might have known about A, or built upon A, or depended on A. If A and B are concur‐ rent, there is no causal link between them; in other words, we are sure that nei‐ ther knew about the other.
- In the context of snapshot isolation for transactions (“Snapshot Isolation and Repeatable Read” on page 237), we said that a transaction reads from a consistent snapshot. But what does “consistent” mean in this context? It means consistent with causality: if the snapshot contains an answer, it must also contain the ques‐ tion being answered [48]. Observing the entire database at a single point in time makes it consistent with causality: the effects of all operations that happened cau‐ sally before that point in time are visible, but no operations that happened cau‐ sally afterward can be seen. Read skew (non-repeatable reads, as illustrated in Figure 7-6) means reading data in a state that violates causality.
- Our examples of write skew between transactions (see “Write Skew and Phan‐ toms” on page 246) also demonstrated causal dependencies: in Figure 7-8, Alice was allowed to go off call because the transaction thought that Bob was still on call, and vice versa. In this case, the action of going off call is causally dependent on the observation of who is currently on call. Serializable snapshot isolation (see “Serializable Snapshot Isolation (SSI)” on page 261) detects write skew by track‐ ing the causal dependencies between transactions.
- In the example of Alice and Bob watching football (Figure 9-1), the fact that Bob got a stale result from the server after hearing Alice exclaim the result is a causal‐ ity violation: Alice’s exclamation is causally dependent on the announcement of the score, so Bob should also be able to see the score after hearing Alice. The same pattern appeared again in “Cross-channel timing dependencies” on page 331 in the guise of an image resizing service.
Concurrency would mean that the timeline branches and merges again
CSAPP里面那張圖L锤稹!
causal order is not a total order
![[DDIA-Causal Order is not Total Order.png]]
這張圖清楚明了的說(shuō)明了causal order is not a total order
Concurrency would mean that the timeline branches and merges again—and in this case, operations on different branches are incomparable (i.e., concurrent). We saw this phenomenon in Chapter 5: for example, Figure 5-14 is not a straight-line total order, but rather a jumble of different operations going on concurrently. The arrows in the diagram indicate causal dependencies—the partial ordering of operations.
So what is the relationship between the causal order and linearizability? The answer is that linearizability implies causality
所以linearizable 是 causal order 的子集 (更小的圓)腹缩,換句話說(shuō)屿聋,只要你是linearizable, 那么你一定在causal order 的集合里面!
Causal consistency 是可以實(shí)現(xiàn)的
現(xiàn)在還是research的領(lǐng)域藏鹊,但是我們可以實(shí)現(xiàn)causal consistency without paying the price at Linearizability
The good news is that a middle ground is possible. Linearizability is not the only way of preserving causality—there are other ways too. A system can be causally consistent without incurring the performance hit of making it linearizable (in particular, the CAP theorem does not apply). In fact, causal consistency is the strongest possible consistency model that does not slow down due to network delays, and remains available in the face of network failures [2, 42].
In many cases, systems that appear to require linearizability in fact only really require causal consistency, which can be implemented more efficiently. Based on this obser‐ vation, researchers are exploring new kinds of databases that preserve causality, with performance and availability characteristics that are similar to those of eventually consistent systems [49, 50, 51].
The techniques for determining which operation happened before which other operation are similar to what we discussed in “Detecting Concurrent Writes” on page 184. That section discussed causality in a leaderless datastore, where we need to detect concurrent writes to the same key in order to prevent lost updates. Causal consistency goes further: it needs to track causal dependencies across the entire database, not just for a single key. Version vectors can be generalized to do this [54].
Version vector 能解決causal consistency 的問(wèn)題
![[DDIA-Version-Vector.png]]
Lamport timestamp
lamport timestamp 主要思想是每次node 收到request 或者 response润讥,都會(huì)attach自己目前看到的最大的counter, 這樣所有node 都會(huì)update 自己的counter
every node and every client keeps track of the maximum counter value it has seen so far, and includes that maximum on every request.
Using Total Order Broadcast
Consensus services such as ZooKeeper and etcd implement total order broadcast.
Another way of looking at total order broadcast is that it is a way of creating a log (as in a replication log, transaction log, or write-ahead log): delivering a message is like appending to the log. Since all nodes must deliver the same messages in the same order, all nodes can read the log and see the same sequence of messages.
這不是有點(diǎn)像比特幣/區(qū)塊鏈 建立的公共賬本嘛盘寡?
In general, if you think hard enough about linearizable sequence number gener‐ ators, you inevitably end up with a consensus algorithm.
到最后都要通過(guò)consensus algorithm 來(lái)實(shí)現(xiàn)楚殿?
This is no coincidence: it can be proved that a linearizable compare-and-set (or increment-and-get) register and total order broadcast are both equivalent to consen‐ sus [28, 67]. That is, if you can solve one of these problems, you can transform it into a solution for the others. This is quite a profound and surprising insight!
You may have heard about the FLP result [68]—named after the authors Fischer, Lynch, and Paterson—which proves that there is no algorithm that is always able to reach consensus if there is a risk that a node may crash. In a distributed system, we must assume that nodes may crash, so reliable consensus is impossible. Yet, here we are, discussing algorithms for achieving consensus. What is going on here?
what!這么nb嘛
The answer is that the FLP result is proved in the asynchronous system model (see “System Model and Reality” on page 306), a very restrictive model that assumes a deterministic algorithm that cannot use any clocks or timeouts. If the algorithm is allowed to use timeouts, or some other way of identifying suspected crashed nodes (even if the suspicion is sometimes wrong), then consensus becomes solvable [67]. Even just allowing the algorithm to use random numbers is sufficient to get around the impossibility result [69].
![[DDIA-Impossibility of Consensus.png]]
Atomic Commit
如果一個(gè) transaction 在多個(gè)node 上執(zhí)行竿痰, 然后他們都需要commit脆粥, 這時(shí)候你就需要 atomic commit 的 guarantee了,不然的話有的node success影涉, 有的node fail 就會(huì)造成問(wèn)題
In these cases, it is not sufficient to simply send a commit request to all of the nodes and independently commit the transaction on each one. In doing so, it could easily happen that the commit succeeds on some nodes and fails on other nodes, which would violate the atomicity guarantee:
Some nodes may detect a constraint violation or conflict, making an abort neces‐ sary, while other nodes are successfully able to commit.
Some of the commit requests might be lost in the network, eventually aborting due to a timeout, while other commit requests get through.
Some nodes may crash before the commit record is fully written and roll back on recovery, while others successfully commit.
Two-Phase Commit
Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes
2 phase commit 是一個(gè)為了achieve atomic transaction commit(多個(gè)node) 的算法
是分布式里面的經(jīng)典算法
[13, 35, 75] 這三個(gè)reference 回頭有空看一下
reference
[13] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman: Concurrency Control and Recovery in Database Systems. Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available online at research.microsoft.com.
http://research.microsoft.com/en-us/people/philbe/ccontrol.aspx
[35] Bruce G. Lindsay, Patricia Griffiths Selinger, C. Galtieri, et al.: “Notes on Dis‐ tributed Databases,” IBM Research, Research Report RJ2571(33471), July 1979.
http://domino.research.ibm.com/library/cyberdig.nsf/papers/A776EC17FC2FCE73852579F100578964/$File/RJ2571.pdf
[75] C. Mohan, Bruce G. Lindsay, and Ron Obermarck: “Transaction Management in the R* Distributed Database Management System,” ACM Transactions on Database Systems, volume 11, number 4, pages 378–396, December 1986. doi: 10.1145/7239.7266
https://cs.brown.edu/courses/csci2270/archives/2012/papers/dtxn/p378-mohan.pdf
basic flow
The basic flow of 2PC is illustrated in Figure 9-9. Instead of a single commit request, as with a single-node transaction, the commit/abort process in 2PC is split into two phases (hence the name).
![[DDIA-Figure-9-9.png]]
其實(shí)2PL 跟 2PC 不同是因?yàn)?2PL 只是為了isolation冠绢, 2PC是不同 node 之間要agree,兩個(gè)都有兩個(gè)階段常潮, 2PL 是拿鎖和釋放鎖的兩個(gè)階段弟胀, 2PC則是讓所有node 返回他們是否commit(第一階段), 然后根據(jù)返回結(jié)果決定 commit 還是 abort(第二階段)
2PC 引入了一個(gè)新的 component, coordinator(transaction manager)孵户, 跟consensus algorithm 里面的coordinator 有什么區(qū)別萧朝?
注釋里面有說(shuō)到,atomic commit and consensus are reducible to each other (他們之間可以互換)
xii. Atomic commit is formalized slightly differently from consensus: an atomic transaction can commit only if all participants vote to commit, and must abort if any participant needs to abort. Consensus is allowed to decide on any value that is proposed by one of the participants. However, atomic commit and consensus are reducible to each other [70, 71]. Nonblocking atomic commit is harder than consensus—see “Three-phase commit” on page 359.
2PC 中間有很多細(xì)節(jié)夏哭,主要就是保證所有node say yes 之后检柬, coordinator write to disk commit, 之后發(fā)送commit request 如果timeout竖配, 那么就一直retry何址,因?yàn)檫@時(shí)候已經(jīng)決定commit了,不管哪個(gè)node down 掉进胯,必須一直retry用爪。
Coordinator failure
If the coordinator fails before sending the prepare requests, a participant can safely abort the transaction. But once the participant has received a prepare request and voted “yes,” it can no longer abort unilaterally—it must wait to hear back from the coordinator whether the transaction was committed or aborted. If the coordinator crashes or the network fails at this point, the participant can do nothing but wait. A participant’s transaction in this state is called in doubt or uncertain.
![[DDIA-Figure-9-10.png]]
Thus, the commit point of 2PC comes down to a regular single-node atomic commit on the coordinator.
所以在coordinator 決定commit 并且寫入WAL之后, 他就成了single point failure胁镐,因?yàn)槿绻@時(shí)候它 crash 了偎血, 必須所有node 都要等他 recover 之后才能決定是否 commit。因此有了 Three-Phase Commit
Three-phase commit
就是一個(gè)nonblocking atomic commit protocol. 但他assume a network with bounded delay and nodes with bounded resposne times
In general, nonblocking atomic commit requires a perfect failure detector [67, 71]— i.e., a reliable mechanism for telling whether a node has crashed or not. In a network with unbounded delay a timeout is not a reliable failure detector, because a request may time out due to a network problem even if no node has crashed. For this reason, 2PC continues to be used, despite the known problem with coordinator failure.
XA transactions
因?yàn)榉植际较到y(tǒng)一般整合了不同的vendor(不同DB盯漂, message queue之類的)颇玷,所以需要一個(gè)protocol 來(lái)讓他們 implement 2PC.
X/Open XA (short for eXtended Architecture) is a standard for implementing two- phase commit across heterogeneous technologies [76, 77]. It was introduced in 1991 and has been widely implemented: XA is supported by many traditional relational databases (including PostgreSQL, MySQL, DB2, SQL Server, and Oracle) and message brokers (including ActiveMQ, HornetQ, MSMQ, and IBM MQ).
XA 并不是network protocol, 只是一個(gè)C的API for interfacing with a transaction coordinator.
XA is not a network protocol—it is merely a C API for interfacing with a transaction coordinator. Bindings for this API exist in other languages; for example, in the world of Java EE applications, XA transactions are implemented using the Java Transaction API (JTA), which in turn is supported by many drivers for databases using Java Data‐ base Connectivity (JDBC) and drivers for message brokers using the Java Message Service (JMS) APIs.
這么來(lái)看java的support 好強(qiáng)
所以實(shí)際上coordinator 只是application的一個(gè)library就缆!并不是單獨(dú)的service like ZooKeeper
所以是application 層面的東西帖渠,我們build application的時(shí)候帶上XA coordinator?
The standard does not specify how it should be implemented, but in practice the coordinator is often simply a library that is loaded into the same process as the application issuing the transaction (not a separate service).
跑 application 的服務(wù)器上面用這個(gè)library竭宰,如果這個(gè)application process crash 了空郊,那么就要重啟這個(gè)server,然后coordinator 讀WAL的記錄來(lái) recover
If the application process crashes, or the machine on which the application is running dies, the coordinator goes with it. Any participants with prepared but uncommitted transactions are then stuck in doubt. Since the coordinator’s log is on the application server’s local disk, that server must be restarted, and the coordinator library must read the log to recover the commit/abort outcome of each transaction.
這句話提醒了我羞延,不僅是前端,大多數(shù)application即使server side 也一樣是stateless…… 所有state 都從DB那里拿到
- Many server-side applications are developed in a stateless model (as favored by HTTP), with all persistent state stored in a database,
The consensus problem is normally formalized as follows: one or more nodes may propose values, and the consensus algorithm decides on one of those values. In the seat-booking example, when several customers are concurrently trying to buy the last seat, each node handling a customer request may propose the ID of the customer it is serving, and the decision indicates which one of those customers got the seat.
其實(shí)就是說(shuō) consensus algorithm 決定誰(shuí)來(lái)改數(shù)據(jù)
In this formalism, a consensus algorithm must satisfy the following properties [25]
Uniform agreement
No two nodes decide differently.
Integrity
No node decides twice.
Validity
If a node decides value v, then v was proposed by some node.
Termination
Every node that does not crash eventually decides some value.
Consensus algorithms
The best-known fault-tolerant consensus algorithms are Viewstamped Replication (VSR) [94, 95], Paxos [96, 97, 98, 99], Raft [22, 100, 101], and Zab [15, 21, 102].
Most of these algorithms actually don’t directly use the formal model described here (proposing and deciding on a single value, while satisfying the agreement, integrity, validity, and termination properties). Instead, they decide on a sequence of values, which makes them total order broadcast algorithms, as discussed previously in this chapter
這里有一個(gè)問(wèn)題就是single leader 結(jié)構(gòu)就類似 consensus algorithm里面的 coordinator脾还, 但是如果leader fail了伴箩,你需要consensus 去選leader,然后consensus algorithm needs a leader…… full circle
如何解決鄙漏?
Epoch numbering and quorums
All of the consensus protocols discussed so far internally use a leader in some form or another, but they don’t guarantee that the leader is unique. Instead, they can make a weaker guarantee: the protocols define an epoch number (called the ballot number in Paxos, view number in Viewstamped Replication, and term number in Raft) and guarantee that within each epoch, the leader is unique.
也就是新給了一個(gè)數(shù)字嗤谚,每次生成這個(gè)數(shù)字的時(shí)候,都會(huì)有一個(gè)unique leader
如果有conflict怔蚌, 比如之前epoch number的leader恢復(fù)了巩步,然后split brain condition,這時(shí)候higher epoch number leader win
Every time the current leader is thought to be dead, a vote is started among the nodes to elect a new leader. This election is given an incremented epoch number, and thus epoch numbers are totally ordered and monotonically increasing. If there is a conflict between two different leaders in two different epochs (perhaps because the previous leader actually wasn’t dead after all), then the leader with the higher epoch number prevails.
在leader 決定做任何事情之前桦踊,都需要從nodes 那里邊獲得投票( quorum of nodes)只有大多數(shù)nodes 同意之后椅野, leader propose的事情才可以進(jìn)行
這跟政治就很像了
Before a leader is allowed to decide anything, it must first check that there isn’t some other leader with a higher epoch number which might take a conflicting decision. How does a leader know that it hasn’t been ousted by another node? Recall “The Truth Is Defined by the Majority” on page 300: a node cannot necessarily trust its own judgment—just because a node thinks that it is the leader, that does not neces‐ sarily mean the other nodes accept it as their leader.
所以2PC 相比于 consensus algorithm 來(lái)說(shuō),他需要每個(gè)nodes 都participate, consensus 有 fault tolerant
This voting process looks superficially similar to two-phase commit. The biggest dif‐ ferences are that in 2PC the coordinator is not elected, and that fault-tolerant consen‐ sus algorithms only require votes from a majority of nodes, whereas 2PC requires a “yes” vote from every participant. Moreover, consensus algorithms define a recovery process by which nodes can get into a consistent state after a new leader is elected, ensuring that the safety properties are always met. These differences are key to the correctness and fault tolerance of a consensus algorithm.
To understand this, it is helpful to briefly explore how a service like ZooKeeper is used. As an application developer, you will rarely need to use ZooKeeper directly, because it is actually not well suited as a general-purpose database. It is more likely that you will end up relying on it indirectly via some other project: for example, HBase, Hadoop YARN, OpenStack Nova, and Kafka all rely on ZooKeeper running in the background.
感覺(jué)ZooKeeper 也太強(qiáng)了
One example in which the ZooKeeper/Chubby model works well is if you have sev‐ eral instances of a process or service, and one of them needs to be chosen as leader or primary. If the leader fails, one of the other nodes should take over. This is of course useful for single-leader databases, but it’s also useful for job schedulers and similar stateful systems.
Another example arises when you have some partitioned resource (database, message streams, file storage, distributed actor system, etc.) and need to decide which parti‐ tion to assign to which node. As new nodes join the cluster, some of the partitions need to be moved from existing nodes to the new nodes in order to rebalance the load (see “Rebalancing Partitions” on page 209). As nodes are removed or fail, other nodes need to take over the failed nodes’ work.
These kinds of tasks can be achieved by judicious use of atomic operations, ephem‐ eral nodes, and notifications in ZooKeeper. If done correctly, this approach allows the application to automatically recover from faults without human intervention. It’s not easy, despite the appearance of libraries such as Apache Curator [17] that have sprung up to provide higher-level tools on top of the ZooKeeper client API—but it is still much better than attempting to implement the necessary consensus algorithms from scratch, which has a poor success record
總結(jié)
這一章從不同角度看 consistency and consensus 花了專門篇幅去講 linearizabillity竟闪,一種consistency model
linearizability 主要目的是為了讓有 replica的數(shù)據(jù)看起來(lái)就只有一個(gè)copy离福, 然后所有操作都是atomic的
他是一個(gè)model!
還看了因果律(causality)炼蛤,因果律決定了系統(tǒng)事件的先后順序(ordering of events) 跟 linearizability不一樣的是妖爷,causality 是可以有multiple events mingled together
![[DDIA-Causal Order is not Total Order.png]]
而linearizability 則是確保了所有operations 的順序, 它是total order的
Unlike linearizability, which puts all operations in a single, totally ordered timeline, causality provides us with a weaker consistency model: some things can be concurrent, so the version history is like a timeline with branching and merging. Causal consistency does not have the coordination overhead of linearizability and is much less sensitive to network problems.
所以causality is a weaker guarantee/ weaker consistency model
linearizability come with cost
但有些時(shí)候只保證了因果律并不能完全保證application的邏輯是對(duì)的,比如用戶注冊(cè)用戶名理朋,他不知道是否另一個(gè)process 也在注冊(cè)相同的用戶名絮识,causality 只知道先后順序,而且有l(wèi)amport timestamp 也只能在事后才知道先后順序嗽上,而用戶注冊(cè)這種行為必須在當(dāng)下就決定次舌。這時(shí)候就需要consensus了
然后發(fā)現(xiàn)很多問(wèn)題都可以reducible to consensus, 所以consensus 解決了之后很多問(wèn)題也就迎刃而解了
這些問(wèn)題包括:
Linearizable compare-and-set registers
The register needs to atomically decide whether to set its value, based on whether its current value equals the parameter given in the operation.
Atomic transaction commit
A database must decide whether to commit or abort a distributed transaction.
Total order broadcast
The messaging system must decide on the order in which to deliver messages.
Locks and leases
When several clients are racing to grab a lock or lease, the lock decides which one successfully acquired it.
Membership/coordination service
Given a failure detector (e.g., timeouts), the system must decide which nodes are alive, and which should be considered dead because their sessions timed out.
Uniqueness constraint
When several transactions concurrently try to create conflicting records with the same key, the constraint must decide which one to allow and which should fail with a constraint violation.
consensus algorithm 通常都是由total order broadcast 實(shí)現(xiàn)的,這里可能還要再研究一下
而consensus algorithm 通常也需要一個(gè)leader(跟 single leader replica很像)
但如果這個(gè)leader fail了炸裆,整個(gè)系統(tǒng)就沒(méi)法make progress了垃它,所以跟我們的目標(biāo)不一樣(分布式系統(tǒng)的目標(biāo)就是為了有fault tolerance),如果leader down 掉之后烹看,我們有三種方式handle:
Wait for the leader to recover, and accept that the system will be blocked in the meantime. Many XA/JTA transaction coordinators choose this option. This approach does not fully solve consensus because it does not satisfy the termina‐ tion property: if the leader does not recover, the system can be blocked forever.
Manually fail over by getting humans to choose a new leader node and reconfig‐ ure the system to use it. Many relational databases take this approach. It is a kind of consensus by “act of God”—the human operator, outside of the computer sys‐ tem, makes the decision. The speed of failover is limited by the speed at which humans can act, which is generally slower than computers.
Use an algorithm to automatically choose a new leader. This approach requires a consensus algorithm, and it is advisable to use a proven algorithm that correctly handles adverse network conditions [107].
第三種方式就是consensus algorithm国拇, 這一章最后都在描述ZooKeeper
Nevertheless, not every system necessarily requires consensus: for example, leaderless and multi-leader replication systems typically do not use global consensus.
所以consensus通常只在single leader 結(jié)構(gòu)下使用? 可能因?yàn)閙ulti leader 或者 leaderless 用 global consensus的代價(jià)太高了惯殊,就跟bitcoin 交易時(shí)間很長(zhǎng)一樣