HBase File Locality in HDFS--轉(zhuǎn)載

One of the more ambiguous things in Hadoop is block replication: it happens automatically and you should not have to worry about it. HBase relies on it 100% to provide the data safety as it stores its files into the distributed file system. While that works completely transparent, one of the more advanced questions asked though is how does this affect performance? This usually arises when the user starts writing MapReduce jobs against either HBase or Hadoop directly. Especially with larger data being stored in HBase, how does the system take care of placing the data close to where it is needed? This is referred to data locality and in case of HBase using the Hadoop file system (HDFS) there may be doubts how that is working.

First let's see how Hadoop handles this. The MapReduce documentation advertises the fact that tasks run close to the data they process. This is achieved by breaking up large files in HDFS into smaller chunks, or so called blocks. That is also the reason why the block size in Hadoop is much larger than you may know them from operating systems and their file systems. Default setting is 64MB, but usually 128MB is chosen, if not even larger when you are sure all your files are larger than a single block in size. Each block maps to a task run to process the contained data. That also means larger block sizes equal fewer map tasks to run as the number of mappers is driven by the number of blocks that need processing. Hadoop knows where blocks are located and runs the map tasks directly on the node that hosts it (actually one of them as replication means it has a few hosts to chose from). This is how it guarantees data locality during MapReduce.

Back to HBase. When you have arrived at that point with Hadoop and you now understand that it can process data locally you start to question how this may work with HBase. If you have read my post on HBase's storage architecture you saw that HBase simply stores files in HDFS. It does so for the actual data files (HFile) as well as its log (WAL). And if you look into the code it simply uses FileSystem.create(Path path)
to create these. When you then consider two access patterns, a) direct random access and b) MapReduce scanning of tables, you wonder if care was taken that the HDFS blocks are close to where they are read by HBase.

One thing upfront, if you do not co-share your cluster with Hadoop and HBase but instead employ a separate Hadoop as well as a stand-alone HBase cluster then there is no data locality - and it can't be. That equals to running a separate MapReduce cluster where it would not be able to execute tasks directly on the datanode. It is imperative for data locality to have them running on the same cluster, Hadoop (as in the HDFS), MapReduce and HBase. End of story.

OK, you them all co-located on a single (hopefully larger) cluster? Then read on. How does Hadoop figure out where data is located as HBase accesses it. Remember the access pattern above, both go through a single piece of software called a RegionServer. Case a) uses random access patterns while b) scans all contiguous rows of a table but does so through the same API. As explained in my referenced post and mentioned above, HBase simply stores files and those get distributed as replicated blocks across all data nodes of the HDFS. Now imagine you stop HBase after saving a lot of data and restarting it subsequently. The region servers are restarted and assign a seemingly random number of regions. At this very point there is no data locality guaranteed - how could it be?

The most important factor is that HBase is not restarted frequently and that it performs house keeping on a regular basis. These so called compactions rewrite files as new data is added over time. All files in HDFS once written are immutable (for all sorts of reasons). Because of that, data is written into new files and as their number grows HBase compacts them into another set of new, consolidated files. And here is the kicker: HDFS is smart enough to put the data where it is needed! How does that work you ask? We need to take a deep dive into Hadoop's source code and see how the above FileSystem.create(Path path) that HBase uses works. We are running on HDFS here, so we are actually using DistributedFileSystem.create(Path path) which looks like this:
public FSDataOutputStream create(Path f) throws IOException { return create(f, true); }
It returns a FSDataOutputStream and that is create like so:
public FSDataOutputStream create(Path f, FsPermission permission, boolean overwrite, int bufferSize, short replication, long blockSize, Progressable progress) throws IOException { return new FSDataOutputStream(dfs.create(getPathName(f), permission, overwrite, replication, blockSize, progress, bufferSize), statistics); }
It uses a DFSClient instance that is the "umbilical" cord connecting the client with the NameNode:
this.dfs = new DFSClient(namenode, conf, statistics);

What is returned though is a DFSClient.DFSOutputStreaminstance. As you write data into the stream the DFSClient aggregates it into "packages" which are then written as blocks to the data nodes. This happens in DFSClient.DFSOutputStream.DataStreamer (please hang in there, we are close!) which runs as a daemon thread in the background. The magic unfolds now in a few hops on the stack, first in the daemon run() it gets the list of nodes to store the data on:

nodes = nextBlockOutputStream(src);
This in turn calls:
long startTime = System.currentTimeMillis(); lb = locateFollowingBlock(startTime); block = lb.getBlock(); nodes = lb.getLocations();
We follow further down and see that locateFollowingBlocks() calls:
return namenode.addBlock(src, clientName);

Here is where it all comes together. The name node is called to add a new block and the src parameter indicates for what file, while clientName is the name of the DFSClient instance. I skip one more small method in between and show you the next bigger step involved:
public LocatedBlock getAdditionalBlock(String src, Strin clientName) throws IOException { INodeFileUnderConstruction pendingFile = checkLease(src, clientName); fileLength = pendingFile.computeContentSummary().getLength(); blockSize = pendingFile.getPreferredBlockSize(); clientNode = pendingFile.getClientNode(); replication = (int)pendingFile.getReplication(); // choose targets for the new block tobe allocated. DatanodeDescriptor targets[] = replicator.chooseTarget(replication, clientNode, null, blockSize); }
We are finally getting to the core of this code in the replicator.chooseTarget() call:

private DatanodeDescriptor chooseTarget(int numOfReplicas, DatanodeDescriptor writer, List<Node> excludedNodes, long private DatanodeDescriptor chooseTarget(int numOfReplicas, DatanodeDescriptor writer, List<Node> excludedNodes, long blocksize, int maxNodesPerRack, List<DatanodeDescriptor> results) { if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) { return writer; } int numOfResults = results.size(); boolean newBlock = (numOfResults==0); if (writer == null && !newBlock) { writer = (DatanodeDescriptor)results.get(0); } try { switch(numOfResults) { case 0: writer = chooseLocalNode(writer, excludedNodes, blocksize, maxNodesPerRack, results); if (--numOfReplicas == 0) { break; } case 1: chooseRemoteRack(1, results.get(0), excludedNodes, blocksize, maxNodesPerRack, results); if (--numOfReplicas == 0) { break; } case 2: if (clusterMap.isOnSameRack(results.get(0), results.get(1))) { chooseRemoteRack(1, results.get(0), excludedNodes, blocksize, maxNodesPerRack, results); } else if (newBlock) { chooseLocalRack(results.get(1), excludedNodes, blocksize, maxNodesPerRack, results); } else { chooseLocalRack(writer, excludedNodes, blocksize, maxNodesPerRack, results); } if (--numOfReplicas == 0) { break; } default: chooseRandom(numOfReplicas, NodeBase.ROOT, excludedNodes, blocksize, maxNodesPerRack, results); } } catch (NotEnoughReplicasException e) { FSNamesystem.LOG.warn("Not able to place enough replicas, still in need of " + numOfReplicas); } return writer; }

Recall that we have started with the DFSClient and created a file which was subsequently filled with data. As the blocks need writing out the above code checks first if that can be done on the same host that the client is on, i.e. the "writer". That is "case 0". In "case 1" the code tries to find a remote rack to have a distant replication of the block. Lastly is fills the list of required replicas with local or machines of another rack.

So this means for HBase that as the region server stays up for long enough (which is the default) that after a major compaction on all tables - which can be invoked manually or is triggered by a configuration setting - it has the files local on the same host. The data node that shares the same physical host has a copy of all data the region server requires. If you are running a scan or get or any other use-case you can be sure to get the best performance.

Finally a good overview over the HDFS design and data replication can be found here. Also note that the HBase team is working on redesigning how the Master is assigning the regions to servers. The plan is to improve it so that regions are deployed on the server where most blocks are. This will particularly be useful after a restart because it would guarantee a better data locality right off the bat. Stay tuned!

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市系宫,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 216,470評論 6 501
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件绽快,死亡現(xiàn)場離奇詭異嚷往,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)婚惫,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,393評論 3 392
  • 文/潘曉璐 我一進(jìn)店門氛赐,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人先舷,你說我怎么就攤上這事艰管。” “怎么了蒋川?”我有些...
    開封第一講書人閱讀 162,577評論 0 353
  • 文/不壞的土叔 我叫張陵牲芋,是天一觀的道長。 經(jīng)常有香客問我捺球,道長缸浦,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,176評論 1 292
  • 正文 為了忘掉前任氮兵,我火速辦了婚禮裂逐,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘泣栈。我一直安慰自己卜高,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,189評論 6 388
  • 文/花漫 我一把揭開白布秩霍。 她就那樣靜靜地躺著篙悯,像睡著了一般。 火紅的嫁衣襯著肌膚如雪铃绒。 梳的紋絲不亂的頭發(fā)上鸽照,一...
    開封第一講書人閱讀 51,155評論 1 299
  • 那天,我揣著相機(jī)與錄音颠悬,去河邊找鬼矮燎。 笑死,一個胖子當(dāng)著我的面吹牛赔癌,可吹牛的內(nèi)容都是我干的诞外。 我是一名探鬼主播,決...
    沈念sama閱讀 40,041評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼灾票,長吁一口氣:“原來是場噩夢啊……” “哼峡谊!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起刊苍,我...
    開封第一講書人閱讀 38,903評論 0 274
  • 序言:老撾萬榮一對情侶失蹤既们,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后正什,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體啥纸,經(jīng)...
    沈念sama閱讀 45,319評論 1 310
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,539評論 2 332
  • 正文 我和宋清朗相戀三年婴氮,在試婚紗的時候發(fā)現(xiàn)自己被綠了斯棒。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片盾致。...
    茶點(diǎn)故事閱讀 39,703評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖荣暮,靈堂內(nèi)的尸體忽然破棺而出庭惜,到底是詐尸還是另有隱情,我是刑警寧澤渠驼,帶...
    沈念sama閱讀 35,417評論 5 343
  • 正文 年R本政府宣布蜈块,位于F島的核電站鉴腻,受9級特大地震影響迷扇,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜爽哎,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,013評論 3 325
  • 文/蒙蒙 一蜓席、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧课锌,春花似錦厨内、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,664評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至志鞍,卻和暖如春瞭亮,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背固棚。 一陣腳步聲響...
    開封第一講書人閱讀 32,818評論 1 269
  • 我被黑心中介騙來泰國打工统翩, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人此洲。 一個月前我還...
    沈念sama閱讀 47,711評論 2 368
  • 正文 我出身青樓厂汗,卻偏偏與公主長得像,于是被迫代替她去往敵國和親呜师。 傳聞我的和親對象是個殘疾皇子娶桦,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,601評論 2 353

推薦閱讀更多精彩內(nèi)容