022 HBase Compaction and Data Locality in Hadoop
1. HBase Compaction and Data Locality With Hadoop
1. HBase 壓實(shí)和數(shù)據(jù)的位置與 Hadoop
In this Hadoop HBase tutorial of HBase Compaction and Data Locality with Hadoop, we will learn the whole concept of Minor and Major Compaction in HBase, a process by which HBase cleans itself in detail. Also, we will see Data Locality with Hadoop Compaction because data locality is a solution to data not being available to Mapper.
So, let’s start HBase Compaction and Data Locality in Hadoop.
HBase 的 Hadoop 教程 HBase 的壓實(shí)和地點(diǎn)與 Hadoop 數(shù)據(jù)透敌,我們將未成年人學(xué) ”的概念和主要的壓實(shí)HBase,HBase 詳細(xì)清理自身的過(guò)程.此外盯滚,我們將看到數(shù)據(jù)的位置Hadoop因?yàn)閿?shù)據(jù)局部性是 Mapper 無(wú)法使用的數(shù)據(jù)的解決方案.
所以,讓我們?cè)?Hadoop 中開始 Compaction 壓縮和數(shù)據(jù)本地化.
HBase Compaction and Data Locality in Hadoop
2. What is HBase Compaction?
2. HBase 的壓實(shí)酗电?
As we know, for read performance, HBase is an optimized distributed data store. But this optimal read performance needs one file per column family. Although, during the heavy writes, it is not always possible to have one file per column family. Hence, to reduce the maximum number of disk seeks needed for read, HBase tries to combine all HFiles into a large single HFile. So, this process is what we call Compaction.
Do you know about HBase Architecture
In other words, Compaction in HBase is a process by which HBase cleans itself, whereas this process is of two types: Minor HBase Compaction as well as Major HBase Compaction.
在讀取性能方面淌山,HBase 是一個(gè)經(jīng)過(guò)優(yōu)化的分布式數(shù)據(jù)存儲(chǔ).但是,這種最佳讀取性能需要每個(gè)列族一個(gè)文件.雖然顾瞻,在大量寫入期間,每個(gè)列族不總是可以有一個(gè)文件.因此荷荤,為了減少讀取所需的最大磁盤查找次數(shù)退渗,HBase 嘗試將所有 HFiles 組合成一個(gè)大的單個(gè)HFile.這個(gè)過(guò)程就是我們所說(shuō)的壓實(shí).
了解 HBase 架構(gòu)嗎:
換句話說(shuō),Compaction 中的壓縮是 HBase 清理自己的過(guò)程蕴纳,而這個(gè)過(guò)程有兩種類型: 輕微的 HBase 壓縮和主要的 HBase 壓縮.
a. HBase Minor Compaction
A.HBase 輕微的壓實(shí)
The process of combining the configurable number of smaller HFiles into one large HFile is what we call Minor compaction. Though, it is quite important since, reading particular rows needs many disk reads and may reduce overall performance, without it.
Here are the several processes which involve in HBase Minor Compaction, are:
將可配置數(shù)量較小的 HFile 合并成一個(gè)大 HFile 的過(guò)程就是我們所說(shuō)的小壓縮.盡管如此会油,這一點(diǎn)非常重要,因?yàn)樽x取特定行需要許多磁盤讀取古毛,如果沒有它翻翩,可能會(huì)降低整體性能.
以下是 HBase 次要壓縮中涉及的幾個(gè)過(guò)程:
- By combining smaller Hfiles, it creates bigger Hfile.
- Also, Hfile stores the deleted file along with it.
- To store more data increases space in memory.
- Uses merge sorting.
[圖片上傳中...(image-cb1f06-1564812083800-3)]
HBase Compaction
b. HBase Major compaction
B.HBase 主要壓實(shí)
Whereas, a process of combining the StoreFiles of regions into a single StoreFile, is what we call HBase Major Compaction. Also, it deletes remove and expired versions. As a process, it merges all StoreFiles into single StoreFile and also runs every 24 hours. However, the region will split into new regions after compaction, if the new larger StoreFile is greater than a certain size (defined by property).
Have a look at HBase Commands
Well, the HBase Major Compaction in HBase is the other way to go around:
然而都许,將區(qū)域的存儲(chǔ)文件組合成一個(gè)存儲(chǔ)文件的過(guò)程,就是我們所說(shuō)的 HBase 主要壓縮.此外嫂冻,它會(huì)刪除刪除和過(guò)期的版本.作為一個(gè)過(guò)程胶征,它將所有存儲(chǔ)文件合并到單個(gè)存儲(chǔ)文件中,并且每 24 小時(shí)運(yùn)行一次.然而桨仿,如果新的更大的存儲(chǔ)文件大于某個(gè)大小 (由屬性定義)睛低,則該區(qū)域?qū)⒃趬嚎s后拆分為新區(qū)域.
看看 HBase 命令
嗯,HBase 中的 HBase 主要壓縮是另一種方法:
- Data present per column family in one region is accumulated to 1 Hfile.
- All deleted files or expired cells are deleted permanently, during this process.
- Increase read performance of newly created Hfile.
- It accepts lots of I/O.
- Possibilities for traffic congestion.
- The other name of major compaction process is Write amplification Process.
- And it is must schedule this process at a minimum bandwidth of network I/O.
HBase Major Compaction
<form action="https://data-flair.training:443/blogs/hbase-compaction/" method="post" class="wpcf7-form" novalidate="">
Get the most demanding skills of IT Industry - Learn Hadoop
<input type="email" name="your-email" value="" size="40" class="wpcf7-form-control wpcf7-text wpcf7-email wpcf7-validates-as-required wpcf7-validates-as-email" aria-required="true" aria-invalid="false" placeholder="Email"> <input type="tel" name="your-phone" value="" size="40" class="wpcf7-form-control wpcf7-text wpcf7-tel wpcf7-validates-as-required wpcf7-validates-as-tel" aria-required="true" aria-invalid="false" placeholder="Phone with country code"> <input type="submit" value="Submit" class="wpcf7-form-control wpcf7-submit">
獲得 IT 行業(yè)最苛刻的技能-學(xué)習(xí) Hadoop
<input type="email" name="your-email" value="" size="40" class="wpcf7-form-control wpcf7-text wpcf7-email wpcf7-validates-as-required wpcf7-validates-as-email" aria-required="true" aria-invalid="false" placeholder="Email"> <input type="tel" name="your-phone" value="" size="40" class="wpcf7-form-control wpcf7-text wpcf7-tel wpcf7-validates-as-required wpcf7-validates-as-tel" aria-required="true" aria-invalid="false" placeholder="Phone with country code"> <input type="submit" value="Submit" class="wpcf7-form-control wpcf7-submit">
</form>
3. HBase Compaction Tuning
HBase 3. 壓縮調(diào)整
a. Short Description of HBase Compaction:
A.HBase 壓實(shí)的簡(jiǎn)要說(shuō)明:
Now, to enhance performance and stability of the HBase cluster, we can use some hidden HBase compaction configuration like below.
現(xiàn)在服傍,為了增強(qiáng) HBase 集群的性能和穩(wěn)定性,我們可以使用一些隱藏的 compaction 壓縮配置罩抗,如下所示.
b. Disabling Automatic Major Compactions in HBase
在 HBase 中禁用自動(dòng)主要操作
Generally, HBase users ask to possess a full management of major compaction events hence the method to do that is by setting** HBase.hregion.majorcompaction** to 0, disable periodic automatic major compactions in HBase.
However, it does not offer 100% management of major compactions, yet, by HBase automatically, minor compactions can be promoted to major ones, sometimes, although, we’ve got another configuration choice, luckily, that will help during this case.
Let’s take a tour to HBase Operations.
通常灿椅,HBase 用戶要求對(duì)主要壓實(shí)事件進(jìn)行全面管理澄暮,因此實(shí)現(xiàn)這一點(diǎn)的方法是通過(guò)設(shè)置HBase.hregion.majorcompaction到 0阱扬,在 HBase 中禁用定期自動(dòng)主要操作.
然而泣懊,它并沒有提供 100% 的主要 compactions 管理,但是麻惶,通過(guò) HBase 自動(dòng)地馍刮,次要的 compactions 可以被提升到主要的 compactions,盡管有時(shí),幸運(yùn)的是, 在這種情況下亡脸,這將會(huì)有所幫助.
讓我們來(lái)了解一下 HBase 操作.
c. Maximum HBase Compaction Selection Size
最大 Compaction 壓縮選擇大小
Control compaction process in HBase is another option:
hbase.hstore.compaction.max.size (by default value is set to LONG.MAX_VALUE)
In HBase 1.2+ we have as well:
hbase.hstore.compaction.max.size.offpeak
HBase 中的控制壓縮過(guò)程是另一個(gè)選擇:
Compaction.hstore.compaction.max.size (默認(rèn)設(shè)置為 LONG.max _ value)
在 HBase 1.2 + 中,我們也有:
Compaction.hstore.compaction.最大尺寸.非峰值
d. Off-peak Compactions in HBase
HBase 中的非高峰競(jìng)爭(zhēng)
Further, we can use off-peak configuration settings, if our deployment has off-peak hours.
Here are HBase Compaction Configuration options must set, to enable off peak compaction:
hbase.offpeak.start.hour= 0..23
hbase.offpeak.end.hour= 0..23
Compaction file ratio for off peak 5.0 (by default) or for peak hours is 1.2.
Both can be changed:
hbase.hstore.compaction.ratio
hbase.hstore.compaction.ratio.offpeak
As much high the file ratio value will be, the more will be the aggressive (frequent) compaction. So, for the majority of deployments, default values are fine.
此外滥朱,如果我們的部署有非高峰時(shí)間鹃栽,我們可以使用非高峰配置設(shè)置.
這里是HBase 壓縮配置必須設(shè)置選項(xiàng)蓬抄,以啟用非峰值壓實(shí):
Hbase.offpeak.開始.小時(shí) = 月..23
Hbase.offpeak..小時(shí) = 月..23
非峰值 5.0 (默認(rèn)情況下) 或峰值時(shí)間的壓實(shí)文件比率為 1.2.
兩者都可以改變:
Compaction.hstore.壓實(shí).比
Compaction.hstore.壓實(shí).比.
文件比率值越高饮亏,積極 (頻繁) 的壓縮就越多.因此付翁,對(duì)于大多數(shù)部署來(lái)說(shuō)佣渴,默認(rèn)值是可以的.
4. Data Locality in Hadoop
4. Hadoop 數(shù)據(jù)位置
As we know, in Hadoop, Datasets is stored in HDFS. Basically, it is divided into blocks as well as stored among the data nodes in a Hadoop cluster. Though, the individual Mappers will process the blocks (input splits), while a MapReduce job is executed against the dataset. However, data has to copy over the network from the data node that has data to the data node that is executing the Mapper task, when data is not available for Mapper in the same node. So, it is what we call data Locality in Hadoop.
You can learn more about Data Locality in Hadoop
In Hadoop, there are 3 categories of Data Locality, such as:
我們知道瀑粥,在 Hadoop 中狞换,數(shù)據(jù)集存儲(chǔ)在HDFS.基本上查库,它被分成塊,并存儲(chǔ)在Hadoop 集群.然而黄琼,單個(gè)地圖繪制程序?qū)⑻幚磉@些塊 (輸入拆分)樊销,而 MapReduce 作業(yè)對(duì)數(shù)據(jù)集執(zhí)行.然而,當(dāng)數(shù)據(jù)在同一節(jié)點(diǎn)中不可用于 Mapper 時(shí)脏款,數(shù)據(jù)必須通過(guò)網(wǎng)絡(luò)從具有數(shù)據(jù)的數(shù)據(jù)節(jié)點(diǎn)復(fù)制到執(zhí)行 Mapper 任務(wù)的數(shù)據(jù)節(jié)點(diǎn).這就是我們所說(shuō)的 Hadoop 中的數(shù)據(jù)局部性.
您可以在 Hadoop 中了解更多關(guān)于數(shù)據(jù)局部性的信息
在 Hadoop 中围苫,有 3 類數(shù)據(jù)局部性,如:
Data Locality in Hadoop
1. Data Local Data Locality
1. 數(shù)據(jù)本地?cái)?shù)據(jù)的地方
Data local data locality is when data is located on the same node as the mapper working on the data. In this case, the proximity of data is very near to computation. Basically, it is the highly preferable option.
數(shù)據(jù)本地性是當(dāng)數(shù)據(jù)位于與處理數(shù)據(jù)的映射程序相同的節(jié)點(diǎn)上時(shí).在這種情況下撤师,數(shù)據(jù)的接近性非常接近于計(jì)算.基本上剂府,這是一個(gè)非常好的選擇.
2. Intra-Rack Data Locality
2. 所在地內(nèi)部資料架子
However, because of resource constraints, it is always not possible to execute the Mapper on the same node. Hence at that time, the Mapper executes on another node within the same rack as the node that has data. So, this is what we call intra-rack data locality.
然而,由于資源限制剃盾,在同一個(gè)節(jié)點(diǎn)上執(zhí)行映射程序總是不可能的.因此腺占,在那個(gè)時(shí)候,映射程序在與有數(shù)據(jù)的節(jié)點(diǎn)位于同一機(jī)架內(nèi)的另一個(gè)節(jié)點(diǎn)上執(zhí)行.這就是我們所說(shuō)的機(jī)架內(nèi)數(shù)據(jù)局部性.
3. Inter-Rack Data Locality
3. 間架資料地點(diǎn)
Well, there is a case when we are not able to achieve intra-rack locality as well as data locality because of resource constraints. So, at that time we need to execute the mapper on nodes on different racks, and also then the data copy from the node that has data to the node executing mapper between racks. So, this is what we call inter-rack data locality. Although, this option is less preferable.
Let’s learn features and principle of Hadoop
So, this was all in HBase Compaction and Data Locality in Hadoop. Hope you like our explanation.
嗯痒谴,有一種情況是衰伯,由于資源限制,我們無(wú)法實(shí)現(xiàn)機(jī)架內(nèi)局部性和數(shù)據(jù)局部性.因此积蔚,當(dāng)時(shí)我們需要在不同機(jī)架上的節(jié)點(diǎn)上執(zhí)行映射程序意鲸,然后將數(shù)據(jù)從具有數(shù)據(jù)的節(jié)點(diǎn)復(fù)制到機(jī)架之間執(zhí)行映射程序的節(jié)點(diǎn)上.這就是我們所說(shuō)的機(jī)架間數(shù)據(jù)局部性.雖然,這個(gè)選項(xiàng)不太可取.
讓我們學(xué)習(xí) Hadoop 的特性和原理
所以库倘,這都是 Hadoop 中的 HBase 壓縮和數(shù)據(jù)本地化.希望你喜歡我們的解釋
5. Conclusion: HBase Compaction
5..結(jié)論: Compaction 壓縮
Hence, in this Hadoop HBase tutorial of HBase Compaction and Data Locality, we have seen the cleaning process of HBase that is HBase Compaction. Also, we have seen a solution to data not being available to Mapper, Apache Hadoop Data Locality in detail. Hope it helps! Please share your experience through comments on our HBase Compaction explanation.
See also –
HBase Performance Tuning
For reference
因此临扮,在這個(gè) Compaction 壓縮和數(shù)據(jù)局部性的 Hadoop HBase 教程中,我們看到了 cleaning 壓縮的清理過(guò)程.ALso教翩,我們已經(jīng)看到了一個(gè)數(shù)據(jù)不可用的解決方案映射,詳細(xì)說(shuō)明 Apache Hadoop 數(shù)據(jù)局部性.希望有幫助!請(qǐng)通過(guò)對(duì)我們的 HBase 壓縮解釋的評(píng)論分享你的經(jīng)驗(yàn).
另見-
HBase 性能調(diào)優(yōu)
供參考