前言:在Hadoop 1.x版本赠尾,HDFS集群的NameNode一直存在單點故障問題:集群只存在一個NameNode節(jié)點,它維護了HDFS所有的元數據信息悲酷,當該節(jié)點所在服務器宕機或者服務不可用岩臣,整個HDFS集群都將處于不可用狀態(tài)烈评,極大限制了HDFS在生產環(huán)境的應用場景。直到Hadoop 2.0版本才提出了高可用 (High Availability, HA) 解決方案工三,并且經過多個版本的迭代更新迁酸,已經廣泛應用于生產環(huán)境。
解決方案:在同一個HDFS集群徒蟆,運行兩個互為主備的NameNode節(jié)點胁出。一臺為主Namenode節(jié)點,處于Active狀態(tài)段审,一臺為備NameNode節(jié)點全蝶,處于Standby狀態(tài)。其中只有Active NameNode對外提供讀寫服務寺枉,Standby NameNode會根據Active NameNode的狀態(tài)變化抑淫,在必要時切換成Active狀態(tài)。
【NameNode HA架構圖】
ZKFC
ZKFC即ZKFailoverController姥闪,作為獨立進程存在始苇,負責控制NameNode的主備切換,ZKFC會監(jiān)測NameNode的健康狀況筐喳,當發(fā)現Active NameNode出現異常時會通過Zookeeper集群進行一次主備選舉催式,完成Active和Standby狀態(tài)的切換;
HealthMonitor
定時調用NameNode的HAServiceProtocol RPC接口(monitorHealth和getServiceStatus)避归,監(jiān)控NameNode的健康狀態(tài)并向ZKFC反饋荣月;
ActiveStandbyElector
接收ZKFC的選舉請求,通過Zookeeper自動完成主備選舉梳毙,選舉完成后回調ZKFC的主備切換方法對NameNode進行Active和Standby狀態(tài)的切換哺窄;
JouranlNode集群
共享存儲系統(tǒng),負責存儲HDFS的元數據,Active NameNode(寫入)和Standby NameNode(讀取)通過共享存儲系統(tǒng)實現元數據同步萌业,在主備切換過程中坷襟,新的Active NameNode必須確保元數據同步完成才能對外提供服務;
【ZKFC工作原理】
ZKFailoverController在啟動時同時會初始化HealthMonitor和ActiveStandbyElector服務,同時也會向HealthMonitor和ActiveStandbyElector注冊相應的回調方法:
private int doRun(String[] args) throws Exception {
try {
initZK(); //初始化ActiveStandbyElector服務
} catch (KeeperException ke) {
LOG.error("Unable to start failover controller. Unable to connect "
+ "to ZooKeeper quorum at " + zkQuorum + ". Please check the "
+ "configured value for " + ZK_QUORUM_KEY + " and ensure that "
+ "ZooKeeper is running.", ke);
return ERR_CODE_NO_ZK;
}
......
try {
initRPC();
initHM(); //初始化HealthMonitor服務
startRPC();
mainLoop();
} catch (Exception e) {
LOG.error("The failover controller encounters runtime error: ", e);
throw e;
} finally {
rpcServer.stopAndJoin();
elector.quitElection(true);
healthMonitor.shutdown();
healthMonitor.join();
}
return 0;
}
一. 狀態(tài)監(jiān)控
HealthMonitor檢測NameNode的兩類狀態(tài)生年,HealthMonitor.State和HealthMonitor.HAServiceStatus婴程。在程序上啟動一個線程循環(huán)調用NameNode的HAServiceProtocol RPC接口的方法來檢測NameNode 的狀態(tài),并將狀態(tài)的變化通過回調的方式來通知ZKFailoverController晶框。
HealthMonitor.State包括:
INITIALIZING:The health monitor is still starting up;
SERVICE_NOT_RESPONDING:The service is not responding to health check RPCs;
SERVICE_HEALTHY:The service is connected and healthy;
SERVICE_UNHEALTHY:The service is running but unhealthy;
HEALTH_MONITOR_FAILED:The health monitor itself failed unrecoverably and can no longer provide accurate information;
HealthMonitor.HAServiceStatus包括:
INITIALIZING:NameNode正在啟動中排抬;
ACTIVE:當前NameNode角色為Active;
STANDBY:當前NameNode角色為Standby授段;
STOPPING:NameNode已經停止運行蹲蒲;
當HealthMonitor檢測到NameNode的健康狀態(tài)或角色狀態(tài)發(fā)生變化時,ZKFC會根據狀態(tài)的變化決定是否需要進行主備選舉侵贵。
二. 主備選舉
HealthMonitor.State狀態(tài)變化導致的不同后續(xù)措施:
/**
* Check the current state of the service, and join the election
* if it should be in the election.
*/
private void recheckElectability() {
// Maintain lock ordering of elector -> ZKFC
synchronized (elector) {
synchronized (this) {
boolean healthy = lastHealthState == State.SERVICE_HEALTHY;
long remainingDelay = delayJoiningUntilNanotime - System.nanoTime();
if (remainingDelay > 0) {
if (healthy) {
LOG.info("Would have joined master election, but this node is " +
"prohibited from doing so for " +
TimeUnit.NANOSECONDS.toMillis(remainingDelay) + " more ms");
}
scheduleRecheck(remainingDelay);
return;
}
switch (lastHealthState) {
case SERVICE_HEALTHY:
//調用ActiveStandbyElector的joinElection發(fā)起一次主備選舉;
elector.joinElection(targetToData(localTarget));
if (quitElectionOnBadState) {
quitElectionOnBadState = false;
}
break;
case INITIALIZING:
LOG.info("Ensuring that " + localTarget + " does not " +
"participate in active master election");
//調用ActiveStandbyElector的quitElection(false)從ZK上刪除已經建立的臨時節(jié)點退出主備選舉届搁,不進行隔離;
elector.quitElection(false);
serviceState = HAServiceState.INITIALIZING;
break;
case SERVICE_UNHEALTHY:
case SERVICE_NOT_RESPONDING:
LOG.info("Quitting master election for " + localTarget +
" and marking that fencing is necessary");
//調用ActiveStandbyElector的quitElection(true)從ZK上刪除已經建立的臨時節(jié)點退出主備選舉,并進行隔離窍育;
elector.quitElection(true);
serviceState = HAServiceState.INITIALIZING;
break;
case HEALTH_MONITOR_FAILED:
fatalError("Health monitor failed!");
break;
default:
throw new IllegalArgumentException("Unhandled state:"
+ lastHealthState);
}
}
}
}
HAServiceStatus在狀態(tài)檢測之中僅起輔助的作用卡睦,當HAServiceStatus發(fā)生變化時,ZKFC會判斷NameNode返回的HAServiceStatus與ZKFC所期望的是否相同漱抓,如果不相同表锻,ZKFC會調用ActiveStandbyElector的quitElection方法刪除當前已經在ZK上建立的臨時節(jié)點退出主備選舉。
void verifyChangedServiceState(HAServiceState changedState) {
synchronized (elector) {
synchronized (this) {
if (serviceState == HAServiceState.INITIALIZING) {
if (quitElectionOnBadState) {
LOG.debug("rechecking for electability from bad state");
recheckElectability();
}
return;
}
if (changedState == serviceState) {
serviceStateMismatchCount = 0;
return;
}
if (serviceStateMismatchCount == 0) {
// recheck one more time. As this might be due to parallel transition.
serviceStateMismatchCount++;
return;
}
// quit the election as the expected state and reported state
// mismatches.
LOG.error("Local service " + localTarget
+ " has changed the serviceState to " + changedState
+ ". Expected was " + serviceState
+ ". Quitting election marking fencing necessary.");
delayJoiningUntilNanotime = System.nanoTime()
+ TimeUnit.MILLISECONDS.toNanos(1000);
elector.quitElection(true);
quitElectionOnBadState = true;
serviceStateMismatchCount = 0;
serviceState = HAServiceState.INITIALIZING;
}
}
}
三. 主備選舉
ZKFC通過ActiveStandbyElector的joinElection方法發(fā)起NameNode的主備選舉乞娄,這個過程通過Zookeeper的寫一致性和臨時節(jié)點機制實現:
a. 當發(fā)起一次主備選舉時瞬逊,Zookeeper會嘗試創(chuàng)建臨時節(jié)點/hadoop-ha/${dfs.nameservices}/ActiveStandbyElectorLock
,Zookeeper的寫一致性保證最終只會有一個ActiveStandbyElector創(chuàng)建成功仪或,創(chuàng)建成功的 ActiveStandbyElector對應的NameNode就會成為主NameNode确镊,ActiveStandbyElector回調ZKFC的方法將對應的NameNode切換為Active狀態(tài)。而創(chuàng)建失敗的ActiveStandbyElector對應的NameNode成為備NameNode范删,ActiveStandbyElector回調ZKFC的方法將對應的NameNode切換為Standby狀態(tài);
private void joinElectionInternal() {
Preconditions.checkState(appData != null,
"trying to join election without any app data");
if (zkClient == null) {
if (!reEstablishSession()) {
fatalError("Failed to reEstablish connection with ZooKeeper");
return;
}
}
createRetryCount = 0;
wantToBeInElection = true;
createLockNodeAsync(); //創(chuàng)建臨時節(jié)點
}
b.不管是否選舉成功蕾域,所有ActiveStandbyElector都會向Zookeeper注冊一個Watcher來監(jiān)聽這個節(jié)點的狀態(tài)變化事件;
private void monitorLockNodeAsync() {
if (monitorLockNodePending && monitorLockNodeClient == zkClient) {
LOG.info("Ignore duplicate monitor lock-node request.");
return;
}
monitorLockNodePending = true;
monitorLockNodeClient = zkClient;
zkClient.exists(zkLockFilePath, watcher, this, zkClient); //向zookeeper注冊Watcher監(jiān)聽器
}
c.如果Active NameNode對應的HealthMonitor檢測到NameNode狀態(tài)異常時,ZKFC會刪除在Zookeeper上創(chuàng)建的臨時節(jié)點ActiveStandbyElectorLock到旦,這樣處于Standby NameNode的ActiveStandbyElector注冊的Watcher就會收到這個節(jié)點的 NodeDeleted事件旨巷。收到這個事件后,會馬上再次創(chuàng)建ActiveStandbyElectorLock添忘,如果創(chuàng)建成功采呐,則Standby NameNode被選舉為Active NameNode。
【防止腦裂】
在分布式系統(tǒng)中腦裂又稱為雙主現象昔汉,由于Zookeeper的“假死”懈万,長時間的垃圾回收或其它原因都可能導致雙Active NameNode現象,此時兩個NameNode都可以對外提供服務靶病,無法保證數據一致性会通。對于生產環(huán)境,這種情況的出現是毀滅性的娄周,必須通過自帶的隔離(Fencing)機制預防這種現象的出現涕侈。
ActiveStandbyElector為了實現fencing隔離機制,在成功創(chuàng)建hadoop-ha/dfs.nameservices/ActiveStandbyElectorLock
臨時節(jié)點后煤辨,會創(chuàng)建另外一個/hadoop?ha/{dfs.nameservices}/ActiveBreadCrumb
持久節(jié)點裳涛,這個持久節(jié)點保存了Active NameNode的地址信息。當Active NameNode在正常的狀態(tài)下斷開Zookeeper Session (注意由于/hadoop-ha/dfs.nameservices/ActiveStandbyElectorLock
是臨時節(jié)點众辨,也會隨之刪除)端三,會一起刪除持久節(jié)點/hadoop?ha/{dfs.nameservices}/ActiveBreadCrumb
。但是如果ActiveStandbyElector在異常的狀態(tài)下關閉Zookeeper Session鹃彻,那么由于/hadoop-ha/${dfs.nameservices}/ActiveBreadCrumb
是持久節(jié)點郊闯,會一直保留下來。當另一個NameNode(standy => active)選主成功之后蛛株,會注意到上一個Active NameNode遺留下來的ActiveBreadCrumb節(jié)點团赁,從而會回調ZKFailoverController的方法對舊的Active NameNode進行fencing。
① 首先ZKFC會嘗試調用舊Active NameNode的HAServiceProtocol RPC接口的transitionToStandby方法谨履,看能否將狀態(tài)切換為Standby欢摄;
② 如果調用transitionToStandby方法切換狀態(tài)失敗,那么就需要執(zhí)行Hadoop自帶的隔離措施笋粟,Hadoop目前主要提供兩種隔離措施:
sshfence:SSH to the Active NameNode and kill the process怀挠;
shellfence:run an arbitrary shell command to fence the Active NameNode;
只有在成功地執(zhí)行完成fencing之后矗钟,選主成功的ActiveStandbyElector才會回調ZKFC的becomeActive方法將對應的NameNode切換為Active唆香,開始對外提供服務。
private boolean becomeActive() {
assert wantToBeInElection;
if (state == State.ACTIVE) {
// already active
return true;
}
try {
Stat oldBreadcrumbStat = fenceOldActive(); //隔離old active NameNode
writeBreadCrumbNode(oldBreadcrumbStat); //更新ActiveBreadCrumb保存的active NameNode地址信息
if (LOG.isDebugEnabled()) {
LOG.debug("Becoming active for " + this);
}
appClient.becomeActive(); //選主成功的ActiveStandbyElector切換NameNode狀態(tài)
state = State.ACTIVE;
return true;
} catch (Exception e) {
LOG.warn("Exception handling the winning of election", e);
// Caller will handle quitting and rejoining the election.
return false;
}
}