ZooKeeper:分布式應(yīng)用的分布式協(xié)調(diào)服務(wù)
ZooKeeper是分布式應(yīng)用的開源協(xié)調(diào)服務(wù)金拒。它公開了一組簡單的原語,分布式應(yīng)用程序可以在實現(xiàn)更高級別的同步糊啡、配置維護红氯、組和命名的基礎(chǔ)上進行構(gòu)建。它被設(shè)計成易于編程伸辟,并使用一個數(shù)據(jù)模型,它采用了熟悉的文件系統(tǒng)目錄樹結(jié)構(gòu)馍刮。它在Java中運行信夫,并且對Java和C都有綁定。
眾所周知卡啰,協(xié)調(diào)服務(wù)很難得到正確的結(jié)果静稻。他們特別容易犯諸如競態(tài)條件和死鎖等錯誤。ZooKeeper的動機是減輕分布式應(yīng)用程序從頭開始執(zhí)行協(xié)調(diào)服務(wù)的責(zé)任匈辱。
設(shè)計目標(biāo)
- ZooKeeper是簡單的振湾。ZooKeeper允許分布式進程通過一個共享的層級名稱空間來相互協(xié)調(diào),該名稱空間與標(biāo)準(zhǔn)文件系統(tǒng)類似亡脸。名稱空間由數(shù)據(jù)寄存器(稱為znode)組成押搪,在ZooKeeper中佛南,這些都類似于文件和目錄。與典型的存儲文件系統(tǒng)不同嵌言,ZooKeeper數(shù)據(jù)保存在內(nèi)存中嗅回,這意味著ZooKeeper可以實現(xiàn)高吞吐量和低延遲數(shù)。ZooKeeper的實現(xiàn)為高性能摧茴、高可用性绵载、嚴(yán)格有序的訪問提供了額外的優(yōu)勢。ZooKeeper的性能方面意味著它可以在大型分布式系統(tǒng)中使用苛白⊥薇可靠性方面使它不成為單點故障。嚴(yán)格的排序意味著復(fù)雜的同步原語可以在客戶端實現(xiàn)购裙。
- ZooKeeper是復(fù)制的懂版。與它所協(xié)調(diào)的分布式進程一樣,ZooKeeper本身也要在一組名為“集成”的主機上進行復(fù)制躏率。
組成ZooKeeper服務(wù)的服務(wù)器必須互相了解躯畴。它們維護狀態(tài)的內(nèi)存映像,以及持久存儲中的事務(wù)日志和快照薇芝。只要有大多數(shù)服務(wù)器可用蓬抄,就可以使用ZooKeeper服務(wù)『坏剑客戶端連接到一個ZooKeeper服務(wù)器嚷缭。客戶端維護一個TCP連接耍贾,通過它發(fā)送請求阅爽、獲取響應(yīng)、觀看事件和發(fā)送心跳荐开。如果連接到服務(wù)器的TCP連接中斷付翁,客戶端將連接到另一個服務(wù)器。
- ZooKeeper是有序的誓焦。ZooKeeper用一個數(shù)字來標(biāo)記每個更新胆敞,它反映了所有ZooKeeper交易的順序。后續(xù)的操作可以使用命令來實現(xiàn)更高級別的抽象杂伟,例如同步原語。
- ZooKeeper是快速的仍翰。它在“讀占主導(dǎo)”的工作負(fù)載中尤其快速赫粥。ZooKeeper的應(yīng)用程序運行在成千上萬臺機器上,而且它的性能最好予借,比寫的更常見越平,在10:1左右频蛔。
數(shù)據(jù)模型和分層命名空間
由ZooKeeper提供的名稱空間非常類似于標(biāo)準(zhǔn)文件系統(tǒng)。名稱是由斜杠(/)分隔的路徑元素序列秦叛。在ZooKeeper的名稱空間中的每個節(jié)點都通過一條路徑被識別晦溪。
節(jié)點和臨時節(jié)點
與標(biāo)準(zhǔn)文件系統(tǒng)不同,ZooKeeper名稱空間中的每個節(jié)點都可以有與它和子節(jié)點相關(guān)聯(lián)的數(shù)據(jù)挣跋。它就像有一個文件系統(tǒng)三圆,允許文件也可以是一個目錄。(ZooKeeper的設(shè)計目的是存儲協(xié)調(diào)數(shù)據(jù):狀態(tài)信息避咆、配置舟肉、位置信息等,因此存儲在每個節(jié)點的數(shù)據(jù)通常很小查库,字節(jié)為千字節(jié)范圍路媚。)我們使用znode這個術(shù)語來明確表示我們討論的是ZooKeeper數(shù)據(jù)節(jié)點。
znode維護一個stat結(jié)構(gòu)樊销,其中包括數(shù)據(jù)更改整慎、ACL更改和時間戳的版本號,以允許緩存驗證和協(xié)調(diào)更新围苫。每次znode的數(shù)據(jù)發(fā)生變化時院领,版本號就會增加。例如够吩,每當(dāng)客戶機檢索數(shù)據(jù)時比然,它也接收數(shù)據(jù)的版本。
存儲在名稱空間中的每個znode的數(shù)據(jù)是用原子方式讀取和寫入的周循。讀取所有與znode相關(guān)的數(shù)據(jù)字節(jié)强法,而write將替換所有數(shù)據(jù)。每個節(jié)點都有一個訪問控制列表(ACL)湾笛,它限制了誰可以做什么饮怯。
ZooKeeper也有短暫的節(jié)點的概念。只要創(chuàng)建znode的會話是活動的嚎研,就存在這些znode蓖墅。當(dāng)會話結(jié)束時,znode被刪除临扮。當(dāng)您想要實現(xiàn)[tbd]時论矾,臨時節(jié)點是有用的。
條件更新和監(jiān)控
ZooKeeper有監(jiān)控的概念杆勇√翱牵客戶可以設(shè)置znode監(jiān)控。當(dāng)znode改變時蚜退,監(jiān)控將被觸發(fā)并刪除闰靴。當(dāng)一個監(jiān)控被觸發(fā)時彪笼,客戶機接收到一個數(shù)據(jù)包,表示znode已經(jīng)發(fā)生了變化蚂且。如果客戶端與ZooKeeper服務(wù)器之間的連接斷開配猫,客戶端將收到本地通知。這些可以用于[tbd]杏死。
擔(dān)保
ZooKeeper是非逞砗模快而且很簡單耘戚。盡管它的目標(biāo)是構(gòu)建更復(fù)雜的服務(wù)(如同步),但它提供了一組保證。這些都是:
順序一致性——來自客戶機的更新將按照發(fā)送的順序進行應(yīng)用交胚。
原子性——更新要么成功嘶伟,要么失敗窗怒。沒有部分結(jié)果淌哟。
單個系統(tǒng)映像——客戶機將看到服務(wù)的相同視圖,而不管它連接到的服務(wù)器是什么惠豺。
可靠性——一旦應(yīng)用了更新银还,它就會一直持續(xù),直到客戶端覆蓋更新洁墙。
時效性——在一定的時間內(nèi)蛹疯,系統(tǒng)的客戶端視圖保證是最新的。
簡單的API
ZooKeeper的設(shè)計目標(biāo)之一就是提供一個非常簡單的編程接口热监。因此捺弦,它只支持這些操作:
create:在樹的某個位置創(chuàng)建一個節(jié)點
delete:刪除節(jié)點
exitsts:測試節(jié)點是否存在于某個位置
get data:從節(jié)點讀數(shù)據(jù)
set data:向節(jié)點寫數(shù)據(jù)
get children:檢索節(jié)點的子節(jié)點列表
sync:等待數(shù)據(jù)被傳播
實現(xiàn)
ZooKeeper組件顯示了ZooKeeper服務(wù)的高級組件。除了請求處理器之外孝扛,組成ZooKeeper服務(wù)的每個服務(wù)器都復(fù)制自己的每個組件的副本列吼。
復(fù)制的數(shù)據(jù)庫是包含整個數(shù)據(jù)樹的內(nèi)存數(shù)據(jù)庫。更新被記錄到磁盤以獲得可恢復(fù)性苦始,在將其應(yīng)用到內(nèi)存數(shù)據(jù)庫之前寞钥,寫入將被序列化到磁盤。
每個ZooKeeper服務(wù)器服務(wù)客戶端陌选±碇#客戶端連接到一個服務(wù)器以提交i請求。讀取請求是從每個服務(wù)器數(shù)據(jù)庫的本地副本服務(wù)的咨油。請求更改服務(wù)的狀態(tài)您炉,寫入請求,由協(xié)議協(xié)議處理臼勉。
作為協(xié)議協(xié)議的一部分邻吭,所有來自客戶機的請求都被轉(zhuǎn)發(fā)給一個服務(wù)器,稱為leader宴霸。其余的ZooKeeper服務(wù)器囱晴,稱為跟隨者,接收來自領(lǐng)導(dǎo)者的消息建議瓢谢,并同意消息傳遞畸写。消息傳遞層負(fù)責(zé)替換失敗的領(lǐng)導(dǎo)者,并與領(lǐng)導(dǎo)者同步追隨者氓扛。
ZooKeeper使用一個自定義的原子消息協(xié)議枯芬。由于消息層是原子性的,所以ZooKeeper可以保證本地副本不會出現(xiàn)分歧采郎。當(dāng)領(lǐng)導(dǎo)者收到一個寫請求時千所,它會計算系統(tǒng)的狀態(tài),當(dāng)它被應(yīng)用并將其轉(zhuǎn)換成一個捕獲這個新狀態(tài)的事務(wù)時蒜埋。
使用
ZooKeeper的編程接口是故意簡單的淫痰。但是,您可以實現(xiàn)更高的訂單操作整份,例如同步原語待错、組成員、所有權(quán)等烈评。一些分布式應(yīng)用程序已經(jīng)使用它:[tbd:添加來自白皮書和視頻演示的應(yīng)用火俄。
性能
ZooKeeper被設(shè)計成具有很高的性能。但真的是這樣嗎?在雅虎的ZooKeeper開發(fā)團隊的結(jié)果!研究表明它是讲冠。它在應(yīng)用程序中性能特別高瓜客,因為寫操作涉及到同步所有服務(wù)器的狀態(tài)。
可靠性
為了顯示系統(tǒng)的行為竿开,隨著故障被注入谱仪,我們運行了一個由7個機器組成的ZooKeeper服務(wù)。我們運行了相同的飽和度基準(zhǔn)德迹,但這次我們將寫百分比保持在30%芽卿,這是我們預(yù)期工作負(fù)載的保守比率。
從這張圖中胳搞,我們得到了一些重要的觀察結(jié)果卸例。首先,如果追隨者失敗并迅速恢復(fù)肌毅,那么盡管失敗了筷转,但是ZooKeeper能夠保持高的吞吐量。但更重要的是悬而,領(lǐng)導(dǎo)選舉算法允許系統(tǒng)快速恢復(fù)呜舒,以防止吞吐量大幅下降。在我們的觀察中笨奠,動物園管理員需要不到200毫秒的時間來選出一位新的領(lǐng)導(dǎo)袭蝗。第三唤殴,隨著追隨者的恢復(fù),一旦開始處理請求到腥,ZooKeeper就能夠再次提高吞吐量朵逝。
原文
ZooKeeper
ZooKeeper is a distributed, open-source coordination service for distributed applications. It exposes a simple set of primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming. It is designed to be easy to program to, and uses a data model styled after the familiar directory tree structure of file systems. It runs in Java and has bindings for both Java and C.
Coordination services are notoriously hard to get right. They are especially prone to errors such as race conditions and deadlock. The motivation behind ZooKeeper is to relieve distributed applications the responsibility of implementing coordination services from scratch.
Design Goals
ZooKeeper is simple. ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchal namespace which is organized similarly to a standard file system. The name space consists of data registers - called znodes, in ZooKeeper parlance - and these are similar to files and directories. Unlike a typical file system, which is designed for storage, ZooKeeper data is kept in-memory, which means ZooKeeper can achieve high throughput and low latency numbers.
The ZooKeeper implementation puts a premium on high performance, highly available, strictly ordered access. The performance aspects of ZooKeeper means it can be used in large, distributed systems. The reliability aspects keep it from being a single point of failure. The strict ordering means that sophisticated synchronization primitives can be implemented at the client.
ZooKeeper is replicated. Like the distributed processes it coordinates, ZooKeeper itself is intended to be replicated over a sets of hosts called an ensemble.
ZooKeeper Service
The servers that make up the ZooKeeper service must all know about each other. They maintain an in-memory image of state, along with a transaction logs and snapshots in a persistent store. As long as a majority of the servers are available, the ZooKeeper service will be available.
Clients connect to a single ZooKeeper server. The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends heart beats. If the TCP connection to the server breaks, the client will connect to a different server.
ZooKeeper is ordered. ZooKeeper stamps each update with a number that reflects the order of all ZooKeeper transactions. Subsequent operations can use the order to implement higher-level abstractions, such as synchronization primitives.
ZooKeeper is fast. It is especially fast in "read-dominant" workloads. ZooKeeper applications run on thousands of machines, and it performs best where reads are more common than writes, at ratios of around 10:1.
Data model and the hierarchical namespace
The name space provided by ZooKeeper is much like that of a standard file system. A name is a sequence of path elements separated by a slash (/). Every node in ZooKeeper's name space is identified by a path.
ZooKeeper's Hierarchical Namespace
Nodes and ephemeral nodes
Unlike standard file systems, each node in a ZooKeeper namespace can have data associated with it as well as children. It is like having a file-system that allows a file to also be a directory. (ZooKeeper was designed to store coordination data: status information, configuration, location information, etc., so the data stored at each node is usually small, in the byte to kilobyte range.) We use the term znode to make it clear that we are talking about ZooKeeper data nodes.
Znodes maintain a stat structure that includes version numbers for data changes, ACL changes, and timestamps, to allow cache validations and coordinated updates. Each time a znode's data changes, the version number increases. For instance, whenever a client retrieves data it also receives the version of the data.
The data stored at each znode in a namespace is read and written atomically. Reads get all the data bytes associated with a znode and a write replaces all the data. Each node has an Access Control List (ACL) that restricts who can do what.
ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted. Ephemeral nodes are useful when you want to implement [tbd].
Conditional updates and watches
ZooKeeper supports the concept of watches. Clients can set a watch on a znode. A watch will be triggered and removed when the znode changes. When a watch is triggered, the client receives a packet saying that the znode has changed. If the connection between the client and one of the Zoo Keeper servers is broken, the client will receive a local notification. These can be used to [tbd].
Guarantees
ZooKeeper is very fast and very simple. Since its goal, though, is to be a basis for the construction of more complicated services, such as synchronization, it provides a set of guarantees. These are:
Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.
Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.
For more information on these, and how they can be used, see [tbd]
Simple API
One of the design goals of ZooKeeper is provide a very simple programming interface. As a result, it supports only these operations:
create
creates a node at a location in the tree
delete
deletes a node
exists
tests if a node exists at a location
get data
reads the data from a node
set data
writes data to a node
get children
retrieves a list of children of a node
sync
waits for data to be propagated
For a more in-depth discussion on these, and how they can be used to implement higher level operations, please refer to [tbd]
Implementation
ZooKeeper Components shows the high-level components of the ZooKeeper service. With the exception of the request processor, each of the servers that make up the ZooKeeper service replicates its own copy of each of the components.
ZooKeeper Components
The replicated database is an in-memory database containing the entire data tree. Updates are logged to disk for recoverability, and writes are serialized to disk before they are applied to the in-memory database.
Every ZooKeeper server services clients. Clients connect to exactly one server to submit irequests. Read requests are serviced from the local replica of each server database. Requests that change the state of the service, write requests, are processed by an agreement protocol.
As part of the agreement protocol all write requests from clients are forwarded to a single server, called the leader. The rest of the ZooKeeper servers, called followers, receive message proposals from the leader and agree upon message delivery. The messaging layer takes care of replacing leaders on failures and syncing followers with leaders.
ZooKeeper uses a custom atomic messaging protocol. Since the messaging layer is atomic, ZooKeeper can guarantee that the local replicas never diverge. When the leader receives a write request, it calculates what the state of the system is when the write is to be applied and transforms this into a transaction that captures this new state.
Uses
The programming interface to ZooKeeper is deliberately simple. With it, however, you can implement higher order operations, such as synchronizations primitives, group membership, ownership, etc. Some distributed applications have used it to: [tbd: add uses from white paper and video presentation.] For more information, see[tbd]
Performance
ZooKeeper is designed to be highly performant. But is it? The results of the ZooKeeper's development team at Yahoo! Research indicate that it is. (See ZooKeeper Throughput as the Read-Write Ratio Varies.) It is especially high performance in applications where reads outnumber writes, since writes involve synchronizing the state of all servers. (Reads outnumbering writes is typically the case for a coordination service.)
ZooKeeper Throughput as the Read-Write Ratio Varies
The figure ZooKeeper Throughput as the Read-Write Ratio Varies is a throughput graph of ZooKeeper release 3.2 running on servers with dual 2Ghz Xeon and two SATA 15K RPM drives. One drive was used as a dedicated ZooKeeper log device. The snapshots were written to the OS drive. Write requests were 1K writes and the reads were 1K reads. "Servers" indicate the size of the ZooKeeper ensemble, the number of servers that make up the service. Approximately 30 other servers were used to simulate the clients. The ZooKeeper ensemble was configured such that leaders do not allow connections from clients.
Benchmarks also indicate that it is reliable, too. Reliability in the Presence of Errorsshows how a deployment responds to various failures. The events marked in the figure are the following:
Failure and recovery of a follower
Failure and recovery of a different follower
Failure of the leader
Failure and recovery of two followers
Failure of another leader
Reliability
To show the behavior of the system over time as failures are injected we ran a ZooKeeper service made up of 7 machines. We ran the same saturation benchmark as before, but this time we kept the write percentage at a constant 30%, which is a conservative ratio of our expected workloads.
Reliability in the Presence of Errors
The are a few important observations from this graph. First, if followers fail and recover quickly, then ZooKeeper is able to sustain a high throughput despite the failure. But maybe more importantly, the leader election algorithm allows for the system to recover fast enough to prevent throughput from dropping substantially. In our observations, ZooKeeper takes less than 200ms to elect a new leader. Third, as followers recover, ZooKeeper is able to raise throughput again once they start processing requests.
The ZooKeeper Project
ZooKeeper has been successfully used in many industrial applications. It is used at Yahoo! as the coordination and failure recovery service for Yahoo! Message Broker, which is a highly scalable publish-subscribe system managing thousands of topics for replication and data delivery. It is used by the Fetching Service for Yahoo! crawler, where it also manages failure recovery. A number of Yahoo! advertising systems also use ZooKeeper to implement reliable services.
All users and developers are encouraged to join the community and contribute their expertise. See the Zookeeper Project on Apache for more information.