Paxos Made simple 釋譯

原文:Paxos Made Simple
作者:Leslie Lamport
時(shí)間:01 Nov 2001


1 Introduction

The Paxos algorithm for implementing a fault-tolerant distributed system has been regarded as difficult to understand, perhaps because the original presentation was Greek to many readers [5]. In fact, it is among the simplest and most obvious of distributed algorithms. At its heart is a consensus algorithm—the “synod” algorithm of [5]. The next section shows that this consensus algorithm follows almost unavoidably from the properties we want it to satisfy. The last section explains the complete Paxos algorithm, which is obtained by the straightforward application of consensus to the state machine approach for building a distributed system—an approach that should be well-known, since it is the subject of what is probably the most often-cited article on the theory of distributed systems [4].

2 The Consensus Algorithm

2.1 The Problem

Assume a collection of processes that can propose values. A consensus algorithm ensures that a single one among the proposed values is chosen. If no value is proposed, then no value should be chosen. If a value has been chosen, then processes should be able to learn the chosen value. The safety requirements for consensus are:

  • Only a value that has been proposed may be chosen,
  • Only a single value is chosen, and
  • A process never learns that a value has been chosen unless it actually has been.

We won’t try to specify precise liveness requirements. However, the goal is to ensure that some proposed value is eventually chosen and, if a value has been chosen, then a process can eventually learn the value.

當(dāng)多個(gè)process發(fā)起提案時(shí)喻圃,paxos確保只有一個(gè)提案值被選中(chosen)烹看,其他process可以習(xí)得(learn)這個(gè)選中值,從而確保多個(gè)process在最終達(dá)成一致

We let the three roles in the consensus algorithm be performed by three classes of agents: proposers, acceptors, and learners. In an implementation, a single process may act as more than one agent, but the mapping from agents to processes does not concern us here.

Assume that agents can communicate with one another by sending messages. We use the customary asynchronous, non-Byzantine model, in which:

? Agents operate at arbitrary speed, may fail by stopping, and may restart. Since all agents may fail after a value is chosen and then restart, a solution is impossible unless some information can be re membered by an agent that has failed and restarted.

? Messages can take arbitrarily long to be delivered, can be duplicated, and can be lost, but they are not corrupted.

網(wǎng)絡(luò)消息通信的唯一要求:消息不會(huì)被篡改(非拜占庭式消息)

2.2 Choosing a Value

The easiest way to choose a value is to have a single acceptor agent. A proposer sends a proposal to the acceptor, who chooses the first proposed value that it receives. Although simple, this solution is unsatisfactory because the failure of the acceptor makes any further progress impossible.

So, let’s try another way of choosing a value. Instead of a single acceptor, let’s use multiple acceptor agents. A proposer sends a proposed value to a set of acceptors. An acceptor may accept the proposed value. The value is chosen when a large enough set of acceptors have accepted it. How large is large enough? To ensure that only a single value is chosen, we can let a large enough set consist of any majority of the agents. Because any two majorities have at least one acceptor in common, this works if an acceptor can accept at most one value. (There is an obvious generalization of a majority that has been observed in numerous papers, apparently starting with [3].).

In the absence of failure or message loss, we want a value to be chosen even if only one value is proposed by a single proposer. This suggests the requirement:

  • P1. An acceptor must accept the first proposal that it receives.
    P1:acceptor必須通過(accept)它收到的第一個(gè)提案

But this requirement raises a problem. Several values could be proposed by different proposers at about the same time, leading to a situation in which every acceptor has accepted a value, but no single value is accepted by a majority of them. Even with just two proposed values, if each is accepted by about half the acceptors, failure of a single acceptor could make it impossible to learn which of the values was chosen.

P1 and the requirement that a value is chosen only when it is accepted by a majority of acceptors imply that an acceptor must be allowed to accept more than one proposal. We keep track of the different proposals that an acceptor may accept by assigning a (natural) number to each proposal, so a proposal consists of a proposal number and a value. To prevent confusion, we require that different proposals have different numbers. How this is achieved depends on the implementation, so for now we just assume it. A value is chosen when a single proposal with that value has been accepted by a majority of the acceptors. In that case, we say that the proposal (as well as its value) has been chosen. We can allow multiple proposals to be chosen, but we must guarantee that all chosen proposals have the same value.

這段對(duì)理解paxos非常重要,建議各位多讀幾遍浪册。這里作者清晰的闡述了幾個(gè)概念:

  • 提案(proposal):提案(proposal)由編號(hào)(proposal number)和值(value)兩部分組成,{n, v}。
    注意:提案編號(hào)的唯一用途是確定提案值,提案編號(hào)本身沒有意義囱稽。
  • 選中(chosen):提案被多數(shù)acceptor接受(accept)即為選中(chosen)。

By induction on the proposal number, it suffices to guarantee:

  • P2. If a proposal with value v is chosen, then every higher-numbered proposal that is chosen has value v.
    P2. 如果被選中(chosen)的提案值為v二跋,那么編號(hào)更高的战惊、被選中(chosen)的提案值也必須是v。

Since numbers are totally ordered, condition P2 guarantees the crucial safety property that only a single value is chosen. To be chosen, a proposal must be accepted by at least one acceptor. So, we can satisfy P2 by satisfying:

  • P2a. If a proposal with value v is chosen, then every higher-numbered proposal accepted by any acceptor has value v.
    P2a. 如果被選中的提案值為v扎即,那么編號(hào)更高的吞获、被acceptor接受(accept)的提案值也必須是v。

We still maintain P1 to ensure that some proposal is chosen. Because communication is asynchronous, a proposal could be chosen with some particular acceptor c never having received any proposal. Suppose a new proposer “wakes up” and issues a higher-numbered proposal with a different value. P1 requires c to accept this proposal, violating P2a. Maintaining both P1 and P2a requires strengthening P2a to:

  • P2b. If a proposal with value v is chosen, then every higher-numbered proposal issued by any proposer has value v.
    P2b.如果value為v的提案被選中(chosen)谚鄙,那么編號(hào)更高的各拷、被proposer提出(issued)的提案值也必須是v。

Since a proposal must be issued by a proposer before it can be accepted by an acceptor, P2b implies P2a, which in turn implies P2.

To discover how to satisfy P2b, let’s consider how we would prove that it holds. We would assume that some proposal with number m and value v is chosen and show that any proposal issued with number n > m also has value v. We would make the proof easier by using induction on n, so we can prove that proposal number n has value v under the additional assumption that every proposal issued with a number in m . . (n ? 1) has value v, where i . . j denotes the set of numbers from i through j . For the proposal numbered m to be chosen, there must be some set C consisting of a majority of acceptors such that every acceptor in C accepted it. Combining this with the induction assumption, the hypothesis that m is chosen implies:

Every acceptor in C has accepted a proposal with number in m . . (n ? 1), and every proposal with number in m . . (n ? 1) accepted by any acceptor has value v.

Since any set S consisting of a majority of acceptors contains at least one member of C , we can conclude that a proposal numbered n has value v by ensuring that the following invariant is maintained:

  • P2c. For any v and n, if a proposal with value v and number n is issued, then there is a set S consisting of a majority of acceptors such that either (a) no acceptor in S has accepted any proposal numbered less than n, or (b) v is the value of the highest-numbered proposal among all proposals numbered less than n accepted by the acceptors in S.
    P2c. 如果發(fā)起了一個(gè)提案{n, v}闷营,那么存在集合S(由半數(shù)以上的acceptor組成):集合S中所有acceptor沒有接受任何編號(hào)小于n的提案烤黍,或者集合S中存在acceptor接受了編號(hào)小于n的提案,在這些提案中傻盟,最大編號(hào)的提案值為v速蕊。

We can therefore satisfy P2b by maintaining the invariance of P2c. To maintain the invariance of P2c, a proposer that wants to issue a proposal numbered n must learn the highest-numbered proposal with number less than n, if any, that has been or will be accepted by each acceptor in some majority of acceptors. Learning about proposals already accepted is easy enough; predicting future acceptances is hard. Instead of trying to predict the future, the proposer controls it by extracting a promise that there won’t be any such acceptances. In other words, the proposer requests that the acceptors not accept any more proposals numbered less than n. This leads to the following algorithm for issuing proposals.

  1. A proposer chooses a new proposal number n and sends a request to each member of some set of acceptors, asking it to respond with:
    (a) A promise never again to accept a proposal numbered less than n, and
    (b) The proposal with the highest number less than n that it has accepted, if any.
    I will call such a request a prepare request with number n.
  2. If the proposer receives the requested responses from a majority of the acceptors, then it can issue a proposal with number n and value v, where v is the value of the highest-numbered proposal among the responses, or is any value selected by the proposer if the responders reported no proposals.
    A proposer issues a proposal by sending, to some set of acceptors, a request that the proposal be accepted. (This need not be the same set of acceptors that responded to the initial requests.) Let’s call this an accept request.

This describes a proposer’s algorithm. What about an acceptor? It can receive two kinds of requests from proposers: prepare requests and accept requests. An acceptor can ignore any request without compromising safety. So, we need to say only when it is allowed to respond to a request. It can always respond to a prepare request. It can respond to an accept request, accepting the proposal, iff it has not promised not to. In other words:

  • P1a. An acceptor can accept a proposal numbered n iff it has not responded to a prepare request having a number greater than n.
    P1a.一個(gè)acceptor可以接受一個(gè)編號(hào)為n的提案,只要它還未響應(yīng)任何編號(hào)大于n的prepare請(qǐng)求娘赴。

Observe that P1a subsumes P1.
We now have a complete algorithm for choosing a value that satisfies the required safety properties—assuming unique proposal numbers. The final algorithm is obtained by making one small optimization.

Suppose an acceptor receives a prepare request numbered n, but it has already responded to a prepare request numbered greater than n, thereby promising not to accept any new proposal numbered n. There is then no reason for the acceptor to respond to the new prepare request, since it will not accept the proposal numbered n that the proposer wants to issue. So we have the acceptor ignore such a prepare request.We also have it ignore a prepare request for a proposal it has already accepted.

With this optimization, an acceptor needs to remember only the highest numbered proposal that it has ever accepted and the number of the highest numbered prepare request to which it has responded. Because P2c must be kept invariant regardless of failures, an acceptor must remember this information even if it fails and then restarts. Note that the proposer can always abandon a proposal and forget all about it—as long as it never tries to issue another proposal with the same number.

P2C的不變性在acceptor異常规哲、重啟等狀態(tài)下得以保存,因此acceptor必須將該信息持久化

Putting the actions of the proposer and acceptor together, we see that the algorithm operates in the following two phases.

  • Phase 1.
    (a) A proposer selects a proposal number n and sends a prepare request with number n to a majority of acceptors.
    (b) If an acceptor receives a prepare request with number n greater than that of any prepare request to which it has already responded, then it responds to the request with a promise not to accept any more proposals numbered less than n and with the highest-numbered proposal value(if any) that it has accepted.
  • Phase 2.
    (a) If the proposer receives a response to its prepare requests(numbered n) from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n with a value v, where v is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals.
    (b) If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request having a number greater than n.

提案由{編號(hào)n诽表,提案值v}兩部分組成唉锌,提案選定包含Prepare、Accept兩個(gè)階段竿奏。

  1. Prepare階段(確定提案編號(hào)n)
    (a). proposer選擇提案編號(hào)n袄简,發(fā)送給所有的acceptor
    (b). 如果acceptor之前已接受編號(hào)大于n的提案,拒絕該提案泛啸,并返回已接受的最大編號(hào)
    (c). 如果accepter之前沒有接受編號(hào)大于n的提案绿语,接受該提案;并且查找之前已接受的提案中是否存在提案值平痰,如果存在則返回編號(hào)最大的提案值汞舱。
  2. Accept階段(確定提案值)
    如果Prepare階段提案被半數(shù)以上acceptor接受伍纫,發(fā)起提案{n宗雇,v}。v來源如下:
    (a) 如果Prepare階段一個(gè)或者多個(gè)acceptor返回了提案值莹规;選擇編號(hào)最大的提案值做為v赔蒲。
    (b) 所有acceptor均返回任何提案值,由proposer自由選擇提案值。

A proposer can make multiple proposals, so long as it follows the algorithm for each one. It can abandon a proposal in the middle of the protocol at any time. (Correctness is maintained, even though requests and/or responses for the proposal may arrive at their destinations long after the proposal was abandoned.) It is probably a good idea to abandon a proposal if some proposer has begun trying to issue a higher-numbered one. Therefore, if an acceptor ignores a prepare or accept request because it has already received a prepare request with a higher number, then it should probably inform the proposer, who should then abandon its proposal. This is a performance optimization that does not affect correctness.

2.3 Learning a Chosen Value

To learn that a value has been chosen, a learner must find out that a proposal has been accepted by a majority of acceptors. The obvious algorithm is to have each acceptor, whenever it accepts a proposal, respond to all learners, sending them the proposal. This allows learners to find out about a chosen value as soon as possible, but it requires each acceptor to respond to each learner—a number of responses equal to the product of the number of acceptors and the number of learners.

The assumption of non-Byzantine failures makes it easy for one learner to find out from another learner that a value has been accepted. We can have the acceptors respond with their acceptances to a distinguished learner, which in turn informs the other learners when a value has been chosen. This approach requires an extra round for all the learners to discover the chosen value. It is also less reliable, since the distinguished learner could fail. But it requires a number of responses equal only to the sum of the number of acceptors and the number of learners.

More generally, the acceptors could respond with their acceptances to some set of distinguished learners, each of which can then inform all the learners when a value has been chosen. Using a larger set of distinguished learners provides greater reliability at the cost of greater communication complexity.

Because of message loss, a value could be chosen with no learner ever finding out. The learner could ask the acceptors what proposals they have accepted, but failure of an acceptor could make it impossible to know whether or not a majority had accepted a particular proposal. In that case, learners will find out what value is chosen only when a new proposal is chosen. If a learner needs to know whether a value has been chosen, it can have a proposer issue a proposal, using the algorithm described above.

關(guān)于如何習(xí)得選中提案舞虱,作者漸進(jìn)的提出了幾種想法:

  1. 每次accept一個(gè)提案時(shí)欢际,通知所有的learner。
    缺點(diǎn):n個(gè)acceptor接受一個(gè)提案時(shí)廣播到n個(gè)learner矾兜,網(wǎng)絡(luò)消息總數(shù)為n*n损趋。
  2. 每次accpet一個(gè)提案時(shí),通知到一個(gè)主leanrer
    缺點(diǎn):主learner異常導(dǎo)致其他Learner學(xué)不到chosen值
  3. 每次accept一個(gè)提案時(shí)椅寺,通知到K個(gè)主learner
    缺點(diǎn):網(wǎng)絡(luò)丟包等原因仍舊可能導(dǎo)致選中值無法被習(xí)得浑槽。

由于消息傳輸是不可靠的,learner需要主動(dòng)向所有的acceptor發(fā)起learn請(qǐng)求返帕,習(xí)得chosen值桐玻。但由于acceptor中持久化的值僅僅保證是accepted值,并不一定是chosen值荆萤,因此镊靴,這里需要由learner充當(dāng)proposal發(fā)起一次提案,用于習(xí)得chosen值

這里如果暫時(shí)不能理解链韭,請(qǐng)?zhí)^繼續(xù)往下看偏竟。

2.4 Progress

It’s easy to construct a scenario in which two proposers each keep issuing a sequence of proposals with increasing numbers, none of which are ever chosen. Proposer p completes phase 1 for a proposal number n1. Another proposer q then completes phase 1 for a proposal number n2 > n1. Proposer p’s phase 2 accept requests for a proposal numbered n1 are ignored because the acceptors have all promised not to accept any new proposal numbered less than n2. So, proposer p then begins and completes phase 1 for a new proposal number n3 > n2, causing the second phase 2 accept requests of proposer q to be ignored. And so on.

To guarantee progress, a distinguished proposer must be selected as the only one to try issuing proposals. If the distinguished proposer can communicate successfully with a majority of acceptors, and if it uses a proposal with number greater than any already used, then it will succeed in issuing aproposal that is accepted. By abandoning a proposal and trying again if it learns about some request with a higher proposal number, the distinguished proposer will eventually choose a high enough proposal number.

If enough of the system (proposer, acceptors, and communication network) is working properly, liveness can therefore be achieved by electing a single distinguished proposer. The famous result of Fischer, Lynch, and Patterson [1] implies that a reliable algorithm for electing a proposer must use either randomness or real time—for example, by using timeouts. However, safety is ensured regardless of the success or failure of the election.

如果多個(gè)proposer不停的發(fā)起更高編號(hào)的提案,prepare階段可能陷入死循環(huán)敞峭。作者提出了一種簡(jiǎn)單的解決方案:選擇一個(gè)主proposer苫耸,所有的提議由主proposer發(fā)起。和典型的主備集群不同的是儡陨,選主只是為了避免不沖突褪子。無主情況下,paxos依舊可以正常工作骗村。

2.5 The Implementation

The Paxos algorithm [5] assumes a network of processes. In its consensus algorithm, each process plays the role of proposer, acceptor, and learner. The algorithm chooses a leader, which plays the roles of the distinguished proposer and the distinguished learner.

每個(gè)進(jìn)程同時(shí)扮演了proposer嫌褪、acceptor、learner胚股,在這些進(jìn)程中選擇一個(gè)主進(jìn)程笼痛,做為主proposer和主learner。主proposer和主learner在一臺(tái)機(jī)器上琅拌,省去了proposer和learner獨(dú)立接收acceptor消息缨伊,獨(dú)立判定chosen value的額外開銷。

The Paxos consensus algorithm is precisely the one described above, where requests and responses are sent as ordinary messages. (Response messages are tagged with the corresponding proposal number to prevent confusion.) Stable storage, preserved during failures, is used to maintain the information that the acceptor must remember.An acceptor records its intended response in stable storage beforeactually sending the response.

acceptor在返回響應(yīng)前必須做數(shù)據(jù)持久化进宝,以免數(shù)據(jù)丟失刻坊。

All that remains is to describe the mechanism for guaranteeing that no two proposals are ever issued with the same number. Different proposers choose their numbers from disjoint sets of numbers, so two different proposers never issue a proposal with the same number. Each proposer remembers (in stable storage) the highest-numbered proposal it has tried to issue, and begins phase 1 with a higher proposal number than any it has already used.

通過算法保證每臺(tái)機(jī)器之間編號(hào)唯一且嚴(yán)格遞增。如提案編號(hào)可以由{node id, serial id}兩部分組成党晋,每個(gè)節(jié)點(diǎn)node id保證不同谭胚,serial id遞增徐块。提案編號(hào)比較判定邏輯為:serial id更大的代表提案編號(hào)更大,serial id相同情況下灾而,node id更大的代表提案編號(hào)更大胡控。

3 Implementing a State Machine

A simple way to implement a distributed system is as a collection of clients that issue commands to a central server. The server can be described as a deterministic state machine that performs client commands in some sequence. The state machine has a current state; it performs a step by taking as input a command and producing an output and a new state. For example, the clients of a distributed banking system might be tellers, and the state-machine state might consist of the account balances of all users. A withdrawal would be performed by executing a state machine command that decreases an account’s balance if and only if the balance is greater than the amount withdrawn, producing as output the old and new balances.

An implementation that uses a single central server fails if that server fails. We therefore instead use a collection of servers, each one independently implementing the state machine. Because the state machine is deterministic, all the servers will produce the same sequences of states and outputs if they all execute the same sequence of commands. A client issuing a command can then use the output generated for it by any server.

To guarantee that all servers execute the same sequence of state machine commands, we implement a sequence of separate instances of the Paxos consensus algorithm, the value chosen by the i th instance being the i th state machine command in the sequence. Each server plays all the roles (proposer, acceptor, and learner) in each instance of the algorithm. For now, I assume that the set of servers is fixed, so all instances of the consensus algorithm use the same sets of agents.

狀態(tài)機(jī)是paxos連續(xù)確定提案值的消費(fèi)者,狀態(tài)機(jī)消費(fèi)順序和提案值確定順序一致旁趟。

In normal operation, a single server is elected to be the leader, which acts as the distinguished proposer (the only one that tries to issue proposals) in all instances of the consensus algorithm. Clients send commands to the leader, who decides where in the sequence each command should appear. If the leader decides that a certain client command should be the 135th command, it tries to have that command chosen as the value of the 135th instance of the consensus algorithm. It will usually succeed. It might fail because of failures, or because another server also believes itself to be the leader and has a different idea of what the 135th command should be. But the consensus algorithm ensures that at most one command can be chosen as the 135th one.

Paxos協(xié)議也建議選擇一個(gè)leader節(jié)點(diǎn)昼激,所有的提案都有l(wèi)eader發(fā)起,以此減少可能出現(xiàn)的提案沖突锡搜,進(jìn)而提升性能癣猾。
Paxos也采用了“主備”模式,那么它和典型的主備模式有何區(qū)別呢余爆?
答:Paxos中選舉的主只是為了提升性能纷宇,在出現(xiàn)雙主、甚至多主時(shí)依舊可以正常工作蛾方,并保證分布式系統(tǒng)的一致性

Key to the efficiency of this approach is that, in the Paxos consensus algorithm, the value to be proposed is not chosen until phase 2. Recall that, after completing phase 1 of the proposer’s algorithm, either the value to be proposed is determined or else the proposer is free to propose any value.

I will now describe how the Paxos state machine implementation works during normal operation. Later, I will discuss what can go wrong. I consider what happens when the previous leader has just failed and a new leader has been selected. (System startup is a special case in which no commands have yet been proposed.)

The new leader, being a learner in all instances of the consensus algorithm, should know most of the commands that have already been chosen. Suppose it knows commands 1–134, 138, and 139—that is, the values chosen in instances 1–134, 138, and 139 of the consensus algorithm. (We will see later how such a gap in the command sequence could arise.) It then executes phase 1 of instances 135–137 and of all instances greater than 139.(I describe below how this is done.) Suppose that the outcome of these executions determine the value to be proposed in instances 135 and 140, but leaves the proposed value unconstrained in all other instances. The leader then executes phase 2 for instances 135 and 140, thereby choosing commands 135 and 140.

新的leader做為learner可以習(xí)得所有的選中值像捶,因此這里應(yīng)該是all the commands that have already been chosen。135-137桩砰、140+均無選中值拓春,需要發(fā)起提案確定。

The leader, as well as any other server that learns all the commands the leader knows, can now execute commands 1–135. However, it can’t execute commands 138–140, which it also knows, because commands 136 and 137 have yet to be chosen. The leader could take the next two commands requested by clients to be commands 136 and 137. Instead, we let it fill the gap immediately by proposing, as commands 136 and 137, a special “noop” command that leaves the state unchanged. (It does this by executing phase 2 of instances 136 and 137 of the consensus algorithm.) Once these no-op commands have been chosen, commands 138–140 can be executed.

noop指的是No prepare operation亚隅,即該類操作跳過Prepare階段硼莽,直接執(zhí)行accept確定提案值。后面馬上會(huì)講煮纵,何種場(chǎng)景運(yùn)行執(zhí)行noop操作懂鸵。

Commands 1–140 have now been chosen. The leader has also completed phase 1 for all instances greater than 140 of the consensus algorithm, and it is free to propose any value in phase 2 of those instances. It assigns command number 141 to the next command requested by a client, proposing it as the value in phase 2 of instance 141 of the consensus algorithm. It proposes the next client command it receives as command 142, and so on.

The leader can propose command 142 before it learns that its proposed command 141 has been chosen. It’s possible for all the messages it sent in proposing command 141 to be lost, and for command 142 to be chosen before any other server has learned what the leader proposed as command 141. When the leader fails to receive the expected response to its phase 2 messages in instance 141, it will retransmit those messages. If all goes well, its proposed command will be chosen. However, it could fail first, leaving a gap in the sequence of chosen commands. In general, suppose a leader can get α commands ahead—that is, it can propose commands i + 1 through i +α after commands 1 through i are chosen. A gap of up to α?1 commands could then arise.

Gap的產(chǎn)生:在命令1到i被選中之后,leader預(yù)取α個(gè)命令同時(shí)發(fā)起提案行疏。那么最多可以產(chǎn)生α-1個(gè)間隔(最后一個(gè)成功匆光,之前的α-1個(gè)全部失敗)。

A newly chosen leader executes phase 1 for infinitely many instances of the consensus algorithm—in the scenario above, for instances 135–137 and all instances greater than 139. Using the same proposal number for all instances, it can do this by sending a single reasonably short message to the other servers. In phase 1, an acceptor responds with more than a simple OK only if it has already received a phase 2 message from some proposer. (In the scenario, this was the case only for instances 135 and 140.) Thus, a server (acting as acceptor) can respond for all instances with a single reasonably short message. Executing these infinitely many instances of phase 1 therefore poses no problem.

Since failure of the leader and election of a new one should be rare events, the effective cost of executing a state machine command—that is, of achieving consensus on the command/value—is the cost of executing only phase 2 of the consensus algorithm. It can be shown that phase 2 of the Paxos consensus algorithm has the minimum possible cost of any algorithm for reaching agreement in the presence of faults [2]. Hence, the Paxos algorithm is essentially optimal.

This discussion of the normal operation of the system assumes that there is always a single leader, except for a brief period between the failure of the current leader and the election of a new one. In abnormal circumstances, the leader election might fail. If no server is acting as leader, then no new commands will be proposed. If multiple servers think they are leaders, then they can all propose values in the same instance of the consensus algorithm, which could prevent any value from being chosen. However, safety is preserved—two different servers will never disagree on the value chosen as the i th state machine command. Election of a single leader is needed only to ensure progress.

為何允許跳過Prepare階段酿联?

  • 如果當(dāng)前處于單主模式终息,只有一個(gè)propser發(fā)起提案,一次Prepare請(qǐng)求適用于后續(xù)所有的提案贞让。
  • 如果當(dāng)前處于無主模式周崭,沒有proposer發(fā)起提案。
  • 如果當(dāng)前處于多主模式喳张,多個(gè)proposer同時(shí)發(fā)起提案续镇,accept請(qǐng)求失敗重新觸發(fā)Prepare階段,由paxos協(xié)議本身保證一致性蹲姐。

If the set of servers can change, then there must be some way of determining what servers implement what instances of the consensus algorithm. The easiest way to do this is through the state machine itself. The current set of servers can be made part of the state and can be changed with ordinary state-machine commands. We can allow a leader to get α commands ahead by letting the set of servers that execute instance i + α of the consensus algorithm be specified by the state after execution of the i th state machine command. This permits a simple implementation of an arbitrarily sophisticated reconfiguration algorithm.

References

[1] Michael J. Fischer, Nancy Lynch, and Michael S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32(2):374–382, April 1985.
[2] Idit Keidar and Sergio Rajsbaum. On the cost of fault-tolerant consensus when there are no faults—a tutorial. TechnicalReport MIT-LCS-TR-821, Laboratory for Computer Science, Massachusetts Institute Technology, Cambridge, MA, 02139, May 2001. also published in SIGACT News 32(2) (June 2001).
[3] Leslie Lamport. The implementation of reliable distributed multiprocess systems. Computer Networks, 2:95–114, 1978.
[4] Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, July 1978.
[5] Leslie Lamport. The part-time parliament. ACM Transactions on Com- puter Systems, 16(2):133–169, May 1998.


【轉(zhuǎn)載請(qǐng)注明】隨安居士. Paxos made simple 釋譯. 2017.11.11

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末磨取,一起剝皮案震驚了整個(gè)濱河市人柿,隨后出現(xiàn)的幾起案子柴墩,更是在濱河造成了極大的恐慌忙厌,老刑警劉巖,帶你破解...
    沈念sama閱讀 211,348評(píng)論 6 491
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件江咳,死亡現(xiàn)場(chǎng)離奇詭異逢净,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)歼指,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,122評(píng)論 2 385
  • 文/潘曉璐 我一進(jìn)店門爹土,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人踩身,你說我怎么就攤上這事胀茵。” “怎么了挟阻?”我有些...
    開封第一講書人閱讀 156,936評(píng)論 0 347
  • 文/不壞的土叔 我叫張陵琼娘,是天一觀的道長(zhǎng)。 經(jīng)常有香客問我附鸽,道長(zhǎng)脱拼,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 56,427評(píng)論 1 283
  • 正文 為了忘掉前任坷备,我火速辦了婚禮熄浓,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘省撑。我一直安慰自己赌蔑,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 65,467評(píng)論 6 385
  • 文/花漫 我一把揭開白布竟秫。 她就那樣靜靜地躺著惯雳,像睡著了一般。 火紅的嫁衣襯著肌膚如雪鸿摇。 梳的紋絲不亂的頭發(fā)上石景,一...
    開封第一講書人閱讀 49,785評(píng)論 1 290
  • 那天,我揣著相機(jī)與錄音拙吉,去河邊找鬼潮孽。 笑死,一個(gè)胖子當(dāng)著我的面吹牛筷黔,可吹牛的內(nèi)容都是我干的往史。 我是一名探鬼主播,決...
    沈念sama閱讀 38,931評(píng)論 3 406
  • 文/蒼蘭香墨 我猛地睜開眼佛舱,長(zhǎng)吁一口氣:“原來是場(chǎng)噩夢(mèng)啊……” “哼椎例!你這毒婦竟也來了挨决?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 37,696評(píng)論 0 266
  • 序言:老撾萬榮一對(duì)情侶失蹤订歪,失蹤者是張志新(化名)和其女友劉穎脖祈,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體刷晋,經(jīng)...
    沈念sama閱讀 44,141評(píng)論 1 303
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡盖高,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,483評(píng)論 2 327
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了眼虱。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片喻奥。...
    茶點(diǎn)故事閱讀 38,625評(píng)論 1 340
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖捏悬,靈堂內(nèi)的尸體忽然破棺而出撞蚕,到底是詐尸還是另有隱情,我是刑警寧澤过牙,帶...
    沈念sama閱讀 34,291評(píng)論 4 329
  • 正文 年R本政府宣布甥厦,位于F島的核電站,受9級(jí)特大地震影響抒和,放射性物質(zhì)發(fā)生泄漏矫渔。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,892評(píng)論 3 312
  • 文/蒙蒙 一摧莽、第九天 我趴在偏房一處隱蔽的房頂上張望庙洼。 院中可真熱鬧,春花似錦镊辕、人聲如沸油够。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,741評(píng)論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽石咬。三九已至,卻和暖如春卖哎,著一層夾襖步出監(jiān)牢的瞬間鬼悠,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 31,977評(píng)論 1 265
  • 我被黑心中介騙來泰國(guó)打工亏娜, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留焕窝,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 46,324評(píng)論 2 360
  • 正文 我出身青樓维贺,卻偏偏與公主長(zhǎng)得像它掂,于是被迫代替她去往敵國(guó)和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子溯泣,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 43,492評(píng)論 2 348

推薦閱讀更多精彩內(nèi)容