Hadoop now has become a popular solution for today’s world needs. The design of Hadoop keeps various goals in mind. These are fault tolerance, handling of large datasets, data locality, portability across heterogeneous hardware and software platforms etc. In this blog, we will explore the Hadoop Architecture in detail. Also, we will see Hadoop Architecture Diagram that helps you to understand it better.
Hadoop 現(xiàn)在已經(jīng)成為流行的大數(shù)據(jù)解決方案 為了當(dāng)今世界的需要 Hadoop 的設(shè)計(jì)考慮到了各種各樣的目標(biāo). 這些都是容錯(cuò)售葡,大數(shù)據(jù)集的處理载慈,數(shù)據(jù)的局部性毅桃,異構(gòu)軟硬件平臺的可移植性等,在這篇博客中翁狐,我們將詳細(xì)探討 Hadoop 架構(gòu).此外芍殖,我們還將看到幫助您更好地理解 Hadoop 體系結(jié)構(gòu)圖.
So, let’s explore Hadoop Architecture.
那么,讓我們來探討一下 Hadoop 架構(gòu).
What is Hadoop Architecture?
Hadoop has a master-slave topology. In this topology, we have* one master node and multiple slave nodes*. Master node’s function is to assign a task to various slave nodes and manage resources. The slave nodes do the actual computing. Slave nodes store the real data whereas on master we have metadata. This means it stores data about data. What does metadata comprise that we will see in a moment?
Hadoop 具有主從式拓?fù)浣Y(jié)構(gòu). 在這個(gè)拓?fù)渲?一個(gè)主節(jié)點(diǎn)和多個(gè)從節(jié)點(diǎn). 主節(jié)點(diǎn)的功能是給各個(gè)從節(jié)點(diǎn)分配一個(gè)任務(wù)谴蔑,并對資源進(jìn)行管理.從節(jié)點(diǎn)做實(shí)際的計(jì)算. 從節(jié)點(diǎn)存儲真實(shí)的數(shù)據(jù),而在 master 上我們有元數(shù)據(jù). 這意味著它存儲關(guān)于數(shù)據(jù)的數(shù)據(jù). 元數(shù)據(jù)包括什么龟梦,我們一會兒就會看到隐锭?
Hadoop Application Architecture in Detail
詳細(xì)介紹 Hadoop 應(yīng)用架構(gòu)
Hadoop Architecture comprises three major layers. They are:-
Hadoop 架構(gòu)包括三個(gè)主要層.他們是:-
- HDFS (Hadoop Distributed File System)
- Yarn
- MapReduce
1. HDFS
HDFS stands for Hadoop Distributed File System. It provides for data storage of Hadoop. HDFS splits the data unit into smaller units called blocks and stores them in a distributed manner. It has got two daemons running. One for master node – NameNode and other for slave nodes – DataNode.
HDFS 代表 分布式文件系統(tǒng) . 它提供了 Hadoop 的數(shù)據(jù)存儲. HDFS 將數(shù)據(jù)單元拆分成稱為塊的較小單元,并以分布式方式存儲它們. 它有兩個(gè)守護(hù)進(jìn)程在運(yùn)行.一個(gè)用于主節(jié)點(diǎn)-NameNode计贰,另一個(gè)用于從節(jié)點(diǎn)-DataNode.
a. NameNode and DataNode
HDFS has a Master-slave architecture. The daemon called NameNode runs on the master server. It is responsible for Namespace management and regulates file access by the client. DataNode daemon runs on slave nodes. It is responsible for storing actual business data. Internally, a file gets split into a number of data blocks and stored on a group of slave machines. Namenode manages modifications to file system namespace. These are actions like the opening, closing and renaming files or directories. NameNode also keeps track of mapping of blocks to DataNodes. This DataNodes serves read/write request from the file system’s client. DataNode also creates, deletes and replicates blocks on demand from NameNode.
HDFS 有 主從式架構(gòu) . NameNode 的守護(hù)進(jìn)程在主服務(wù)器上運(yùn)行. 它負(fù)責(zé)命名空間管理钦睡,并管理客戶端的文件訪問. DataNode 守護(hù)進(jìn)程在從屬節(jié)點(diǎn)上運(yùn)行. 負(fù)責(zé)實(shí)際業(yè)務(wù)數(shù)據(jù)的存儲. 在內(nèi)部,文件被拆分成許多數(shù)據(jù)塊躁倒,并存儲在一組從屬機(jī)器上.Namenode 管理對文件系統(tǒng)命名空間的修改.這些操作包括打開荞怒、關(guān)閉和重命名文件或目錄.洒琢、復(fù)制指令還跟蹤的block DataNodes 映射.此 DataNodes 服務(wù)文件系統(tǒng)客戶端的讀/寫請求. DataNode 還根據(jù)需要從 NameNode 創(chuàng)建、刪除和復(fù)制塊.
Java is the native language of HDFS. Hence one can deploy DataNode and NameNode on machines having Java installed. In a typical deployment, there is one dedicated machine running NameNode. And all the other nodes in the cluster run DataNode. The NameNode contains metadata like the location of blocks on the DataNodes. And arbitrates resources among various competing DataNodes.
HDFS 使用 Java 語言開發(fā). 因此褐桌,可以在安裝了 Java 的機(jī)器上部署 DataNode 和 NameNode.在典型的部署中衰抑,有一臺運(yùn)行 NameNode 的專用機(jī)器.集群中的所有其他節(jié)點(diǎn)都運(yùn)行 DataNode.NameNode 包含元數(shù)據(jù),如數(shù)據(jù)節(jié)點(diǎn)上塊的位置.在各種競爭的數(shù)據(jù)節(jié)點(diǎn)之間仲裁資源.
You must read about Hadoop High Availability Concept
b. Block in HDFS
Block is nothing but the smallest unit of storage on a computer system. It is the smallest contiguous storage allocated to a file. In Hadoop, we have a default block size of 128MB or 256 MB.
Block 只是計(jì)算機(jī)系統(tǒng)上最小的存儲單元.它是分配給文件的最小連續(xù)存儲.在 Hadoop 中我們的默認(rèn)塊大小為 128 MB 或 256 MB.
One should select the block size very carefully. To explain why so let us take an example of a file which is 700MB in size. If our block size is 128MB then HDFS divides the file into 6 blocks. Five blocks of 128MB and one block of 60MB. What will happen if the block is of size 4KB? But in HDFS we would be having files of size in the order terabytes to petabytes. With 4KB of the block size, we would be having numerous blocks. This, in turn, will create huge metadata which will overload the NameNode. Hence we have to choose our HDFS block size judiciously.
你應(yīng)該非常仔細(xì)地選擇塊的大小.為了解釋為什么荧嵌,讓我們舉一個(gè)大小為 700 MB 的文件的例子.如果我們的塊大小是 128 MB呛踊,那么 HDFS 將文件分成 6 個(gè)塊.5 塊 128 MB,1 塊 60 MB.如果塊大小為 4KB啦撮,會發(fā)生什么谭网?但是在 HDFS 中,我們將擁有大小為 tb 到 pb 的文件.有了 4KB 的塊大小赃春,我們就有了許多塊.這反過來將創(chuàng)建巨大的元數(shù)據(jù)愉择,從而使 NameNode 過載.因此,我們必須明智地選擇 HDFS 塊大小.
c. Replication Management
To provide fault tolerance HDFS uses a replication technique. In that, it makes copies of the blocks and stores in on different DataNodes. Replication factor decides how many copies of the blocks get stored. It is 3 by default but we can configure to any value.
要提供容錯(cuò) HDFS使用復(fù)制技術(shù).在這一點(diǎn)上织中,它制作塊的副本锥涕,并存儲在不同的 DataNodes 上.復(fù)制因子決定了存儲塊的副本數(shù)量.默認(rèn)情況下是 3,但是我們可以配置為任何值.
The above figure shows how the replication technique works. Suppose we have a file of 1GB then with a replication factor of 3 it will require 3GBs of total storage.
上圖顯示了復(fù)制技術(shù)是如何工作的.假設(shè)我們有一個(gè) 1gb 的文件抠璃,那么復(fù)制因子為 3 的文件將需要總存儲容量的 3GBs.
To maintain the replication factor NameNode collects block report from every DataNode. Whenever a block is under-replicated or over-replicated the NameNode adds or deletes the replicas accordingly.
為了保持復(fù)制因子站楚,NameNode 從每個(gè) DataNode 收集塊報(bào)告.每當(dāng)塊被復(fù)制不足或復(fù)制過多時(shí),NameNode 都會相應(yīng)地添加或刪除副本.
d. What is Rack Awareness?
A rack contains many DataNode machines and there are several such racks in the production. HDFS follows a rack awareness algorithm to place the replicas of the blocks in a distributed fashion. This rack awareness algorithm provides for low latency and fault tolerance. Suppose the replication factor configured is 3. Now rack awareness algorithm will place the first block on a local rack. It will keep the other two blocks on a different rack. It does not store more than two blocks in the same rack if possible.
一個(gè)機(jī)架包含許多 DataNode 機(jī)器搏嗡,在生產(chǎn)中有幾個(gè)這樣的機(jī)架.Rack 遵循機(jī)架感知算法以分布式方式放置塊的副本.這種機(jī)架感知算法具有低延遲和容錯(cuò)能力.假設(shè)配置的復(fù)制因子 3..現(xiàn)在窿春,機(jī)架感知算法將在本地機(jī)架上放置第一個(gè)塊.它將其他兩個(gè)塊放在不同的機(jī)架上.如果可能的話,它不會在同一個(gè)機(jī)架中存儲超過兩個(gè)塊.
2. MapReduce
MapReduce is the data processing layer of Hadoop. It is a software framework that allows you to write applications for processing a large amount of data. MapReduce runs these applications in parallel on a cluster of low-end machines. It does so in a reliable and fault-tolerant manner.
MapReduce 是 Hadoop 的數(shù)據(jù)處理層..它是一個(gè)軟件框架采盒,允許你編寫處理大量數(shù)據(jù)的應(yīng)用程序.MapReduce 在低端機(jī)器集群上并行運(yùn)行這些應(yīng)用程序.它以可靠和容錯(cuò)的方式這樣做.
MapReduce job comprises a number of map tasks and reduces tasks. Each task works on a part of data. This distributes the load across the cluster. The function of Map tasks is to load, parse, transform and filter data. Each reduce task works on the sub-set of output from the map tasks. Reduce task applies grouping and aggregation to this intermediate data from the map tasks.
MapReduce job 包含多個(gè) map 任務(wù)旧乞,減少了任務(wù).每個(gè)任務(wù)都處理數(shù)據(jù)的一部分.這將負(fù)載分布在整個(gè)集群中.Map 任務(wù)的功能是對數(shù)據(jù)進(jìn)行加載、解析磅氨、轉(zhuǎn)換和過濾.每個(gè) reduce 任務(wù)都在地圖任務(wù)的輸出子集上工作.Reduce task 對來自 map 任務(wù)的中間數(shù)據(jù)應(yīng)用分組和聚合.
The input file for the MapReduce job exists on HDFS. The inputformat decides how to split the input file into input splits. Input split is nothing but a byte-oriented view of the chunk of the input file. This input split gets loaded by the map task. The map task runs on the node where the relevant data is present. The data need not move over the network and get processed locally.
MapReduce 作業(yè)的輸入文件存在于 HDFS 上.Inputformat 決定如何將輸入文件拆分為輸入拆分.輸入拆分只是輸入文件塊的面向字節(jié)的視圖.Map 任務(wù)會加載此輸入拆分.Map 任務(wù)在存在相關(guān)數(shù)據(jù)的節(jié)點(diǎn)上運(yùn)行.數(shù)據(jù)不需要在網(wǎng)絡(luò)上移動(dòng)尺栖,也不需要在本地處理.
i. Map Task
The Map task run in the following phases:-
Map 任務(wù)在以下階段運(yùn)行:-
a. RecordReader
The **recordreader **transforms the input split into records. It parses the data into records but does not parse records itself. It provides the data to the mapper function in key-value pairs. Usually, the key is the positional information and value is the data that comprises the record.
的**記錄閱讀器 **將輸入拆分成記錄.它將數(shù)據(jù)解析為記錄,但不解析記錄本身.它在鍵值對中為 mapper 函數(shù)提供數(shù)據(jù).通常烦租,關(guān)鍵是位置信息晕城,值是包含記錄的數(shù)據(jù).
b. Map
In this phase, the mapper which is the user-defined function processes the key-value pair from the recordreader. It produces zero or multiple intermediate key-value pairs.
在這個(gè)階段,映射這是用戶定義的函數(shù)處理記錄閱讀器中的鍵值對.它產(chǎn)生零個(gè)或多個(gè)中間鍵值對.
The decision of what will be the key-value pair lies on the mapper function. The key is usually the data on which the reducer function does the grouping operation. And value is the data which gets aggregated to get the final result in the reducer function.
鍵-值對的決定取決于 mapper 函數(shù).密鑰通常是 reducer 函數(shù)進(jìn)行分組操作的數(shù)據(jù).Value 是在 reducer 函數(shù)中聚合以獲得最終結(jié)果的數(shù)據(jù).
c. Combiner
The combiner is actually a localized reducer which groups the data in the map phase. It is optional. Combiner takes the intermediate data from the mapper and aggregates them. It does so within the small scope of one mapper. In many situations, this decreases the amount of data needed to move over the network. For example, moving (Hello World, 1) three times consumes more network bandwidth than moving (Hello World, 3). Combiner provides extreme performance gain with no drawbacks. The combiner is not guaranteed to execute. Hence it is not of overall algorithm.
組合是局部的減速將地圖階段的數(shù)據(jù)分組.它是可選的.Combiner 從 mapper 中獲取中間數(shù)據(jù)并對其進(jìn)行聚合.它在一個(gè)映射器的小范圍內(nèi)這樣做.在許多情況下,這減少了通過網(wǎng)絡(luò)移動(dòng)所需的數(shù)據(jù)量.例如味混,移動(dòng) (Hello World眼姐,1) 比移動(dòng) (Hello World,3) 消耗的網(wǎng)絡(luò)帶寬多三倍.Combiner 在沒有缺點(diǎn)的情況下提供了極致的性能提升.不能保證組合器執(zhí)行.因此窃祝,它不是整體算法.
d. Partitioner
Partitioner pulls the intermediate key-value pairs from the mapper. It splits them into shards, one shard per reducer. By default, partitioner fetches the hashcode of the key. The partitioner performs modulus operation by a number of reducers: key.hashcode()%(number of reducers). This distributes the keyspace evenly over the reducers. It also ensures that key with the same value but from different mappers end up into the same reducer. The partitioned data gets written on the local file system from each map task. It waits there so that reducer can pull it.
分區(qū)器拉動(dòng)中間鍵值對從地圖上它將它們分成碎片掐松,每個(gè)減速器一個(gè)碎片.默認(rèn)情況下,分區(qū)程序獲取密鑰的 hashcode.分區(qū)程序通過多個(gè)減速器執(zhí)行模數(shù)運(yùn)算: key.hashcode () % (減速器數(shù)量).這將鍵空間均勻地分布在減速器上.它還確保具有相同值但來自不同地圖繪制器的密鑰最終會被放入相同的 reducer 中.從每個(gè)地圖任務(wù)中,分區(qū)數(shù)據(jù)被寫入本地文件系統(tǒng).它在那里等待大磺,這樣減速機(jī)就可以拉它了.
b. Reduce Task
The various phases in reduce task are as follows:
Reduce 任務(wù)中的各個(gè)階段如下:
i. Shuffle and Sort
The reducer starts with shuffle and sort step. This step downloads the data written by partitioner to the machine where reducer is running. This step sorts the individual data pieces into a large data list. The purpose of this sort is to collect the equivalent keys together. The framework does this so that we could iterate over it easily in the reduce task. This phase is not customizable. The framework handles everything automatically. However, the developer has control over how the keys get sorted and grouped through a comparator object.
減速器從洗牌和排序步驟開始.此步驟將分區(qū)程序?qū)懭氲臄?shù)據(jù)下載到減速機(jī)運(yùn)行的機(jī)器上.此步驟將單個(gè)數(shù)據(jù)塊分類為一個(gè)大數(shù)據(jù)列表.這種類型的目的是將等效鍵收集在一起.框架這樣做是為了我們可以在 reduce 任務(wù)中輕松地迭代它.此階段不可自定義.框架會自動(dòng)處理所有事情.然而抡句,開發(fā)人員可以控制如何通過 comparator 對象對鍵進(jìn)行排序和分組.
ii. Reduce
The reducer performs the reduce function once per key grouping. The framework passes the function key and an iterator object containing all the values pertaining to the key.
We can write reducer to filter, aggregate and combine data in a number of different ways. Once the reduce function gets finished it gives zero or more key-value pairs to the outputformat. Like map function, reduce function changes from job to job. As it is the core logic of the solution.
我們可以編寫 reducer,以多種不同的方式過濾杠愧、聚合和組合數(shù)據(jù).Reduce 函數(shù)完成后待榔,它會為 output 格式提供零個(gè)或多個(gè)鍵值對.像 map 函數(shù)一樣,減少從作業(yè)到作業(yè)的功能變化.因?yàn)檫@是解決方案的核心邏輯.
iii. OutputFormat
This is the final step. It takes the key-value pair from the reducer and writes it to the file by recordwriter. By default, it separates the key and value by a tab and each record by a newline character. We can customize it to provide richer output format. But none the less final data gets written to HDFS.
這是最后一步.它從 reducer 獲取鍵值對殴蹄,并通過 recordwriter 將其寫入文件.默認(rèn)情況下究抓,它用 tab 分隔鍵和值,用換行符分隔每個(gè)記錄.我們可以自定義它袭灯,以提供更豐富的輸出格式.但是最終數(shù)據(jù)還是被寫入了 HDFS.
3. YARN
YARN or Yet Another Resource Negotiator is the resource management layer of Hadoop. The basic principle behind YARN is to separate resource management and job scheduling/monitoring function into separate daemons. In YARN there is one global ResourceManager and per-application ApplicationMaster. An Application can be a single job or a DAG of jobs.
Hadoop 或另一個(gè)資源協(xié)商者是 Hadoop 的資源管理層.的紗線背后的基本原理是將資源管理和作業(yè)調(diào)度/監(jiān)控功能分離成單獨(dú)的守護(hù)進(jìn)程.在 YARN 中刺下,有一個(gè)全局資源管理器和每個(gè)應(yīng)用程序管理器.應(yīng)用程序可以是單個(gè)作業(yè)或作業(yè)的 DAG.
Inside the YARN framework, we have two daemons ResourceManager and NodeManager. The ResourceManager arbitrates resources among all the competing applications in the system. The job of NodeManger is to monitor the resource usage by the container and report the same to ResourceManger. The resources are like CPU, memory, disk, network and so on.
在 YARN 框架中,我們有兩個(gè)守護(hù)進(jìn)程資源管理器還有 NodeManager資源管理器在系統(tǒng)中所有競爭應(yīng)用程序之間仲裁資源.NodeManger 的工作是監(jiān)控容器的資源使用情況稽荧,并向資源管理人員報(bào)告.資源有 CPU 橘茉、內(nèi)存、磁盤姨丈、網(wǎng)絡(luò)等.
The ApplcationMaster negotiates resources with ResourceManager and works with NodeManger to execute and monitor the job.
ApplcationMaster 與資源管理器協(xié)商資源畅卓,并適用于 NodeManger執(zhí)行和監(jiān)控作業(yè).
The ResourceManger has two important components – Scheduler and ApplicationManager
資源管理器有兩個(gè)重要的組件: 調(diào)度器和應(yīng)用程序管理器.
i. Scheduler
Scheduler is responsible for allocating resources to various applications. This is a pure scheduler as it does not perform tracking of status for the application. It also does not reschedule the tasks which fail due to software or hardware errors. The scheduler allocates the resources based on the requirements of the applications.
調(diào)度器負(fù)責(zé)為各種應(yīng)用程序分配資源.這是一個(gè)純粹的調(diào)度程序,因?yàn)樗粸閼?yīng)用程序執(zhí)行狀態(tài)跟蹤.它也不會重新安排由于軟件或硬件錯(cuò)誤而失敗的任務(wù).調(diào)度器根據(jù)應(yīng)用程序的要求分配資源.
ii. Application Manager
Following are the functions of ApplicationManager
以下是 applicmanager 的功能
Accepts job submission.
Negotiates the first container for executing ApplicationMaster. A container incorporates elements such as CPU, memory, disk, and network.
Restarts the ApplicationMaster container on failure.
接受工作提交.
協(xié)商執(zhí)行應(yīng)用程序 master 的第一個(gè)容器.容器包含 CPU 蟋恬、內(nèi)存翁潘、磁盤和網(wǎng)絡(luò)等元素.
失敗時(shí)重新啟動(dòng) applicymaster 容器.
Functions of ApplicationMaster:-
Negotiates resource container from Scheduler.
Tracks the resource container status.
Monitors progress of the application.
從調(diào)度器協(xié)商資源容器.
跟蹤資源容器狀態(tài).
監(jiān)控應(yīng)用程序的進(jìn)度.
We can scale the YARN beyond a few thousand nodes through YARN Federation feature. This feature enables us to tie multiple YARN clusters into a single massive cluster. This allows for using independent clusters, clubbed together for a very large job.
我們可以通過紗線聯(lián)盟功能將紗線擴(kuò)展到幾千個(gè)節(jié)點(diǎn)之外.這個(gè)功能使我們能夠?qū)⒍鄠€(gè)紗線簇成一個(gè)巨大的集群.這允許使用獨(dú)立的集群,為一個(gè)非常大的工作聚集在一起.
iii. Features of Yarn
YARN has the following features:-
a. Multi-tenancy
A.多租戶
YARN allows a variety of access engines (open-source or propriety) on the same Hadoop data set. These access engines can be of batch processing, real-time processing, iterative processing and so on.
YARN 允許在同一臺設(shè)備上使用各種訪問引擎 (開源或適當(dāng))Hadoop 數(shù)據(jù)集.這些訪問引擎可以是批處理歼争、實(shí)時(shí)處理拜马、迭代處理等.
b. Cluster Utilization
With the dynamic allocation of resources, YARN allows for good use of the cluster. As compared to static map-reduce rules in previous versions of Hadoop which provides lesser utilization of the cluster.
通過資源的動(dòng)態(tài)分配,YARN 可以很好地利用集群.與靜態(tài) map-reduce 規(guī)則相比,Hadoop 的早期版本這使得集群的利用率更低.
c. Scalability
Any data center processing power keeps on expanding. YARN’s ResourceManager focuses on scheduling and copes with the ever-expanding cluster, processing petabytes of data.
數(shù)據(jù)中心的處理能力不斷擴(kuò)大.YARN 的資源管理器專注于調(diào)度和處理不斷擴(kuò)展的集群沐绒,處理 pb 級數(shù)據(jù).
d. Compatibility
MapReduce program developed for Hadoop 1.x can still on this YARN. And this is without any disruption to processes that already work.
為 Hadoop 1.X 開發(fā)的 MapReduce 程序仍然可以在這個(gè)紗線.這對已經(jīng)開始工作的流程沒有任何干擾.
Best Practices For Hadoop Architecture Design
Hadoop 架構(gòu)設(shè)計(jì)的最佳實(shí)踐
i. Embrace Redundancy Use Commodity Hardware
I. 商用硬件實(shí)現(xiàn)冗余
Many companies venture into Hadoop by business users or analytics group. The infrastructure folks peach in later. These people often have no idea about Hadoop. The result is the over-sized cluster which increases the budget many folds. Hadoop was mainly created for availing cheap storage and deep data analysis. To achieve this use JBOD i.e. Just a Bunch Of Disk. Also, use a single power supply.
許多公司冒險(xiǎn)進(jìn)入 Hadoop業(yè)務(wù)用戶或分析組.基礎(chǔ)設(shè)施的人在后面.這些人往往對 Hadoop 一無所知.結(jié)果是超大的集群增加了許多倍的預(yù)算.Hadoop 主要是為了利用廉價(jià)的存儲和深度數(shù)據(jù)分析.為了實(shí)現(xiàn)這一點(diǎn)俩莽,使用 jbot,即一堆磁盤.此外乔遮,使用單個(gè)電源.
ii. Start Small and Keep Focus
Ii.從小做起扮超,保持專注
Many projects fail because of their complexity and expense. To avoid this start with a small cluster of nodes and add nodes as you go along. Start with a small project so that infrastructure and development guys can understand the internal working of Hadoop.
許多項(xiàng)目因?yàn)槠鋸?fù)雜性和費(fèi)用而失敗.為了避免這種情況,從一個(gè)小的節(jié)點(diǎn)集群開始蹋肮,并在前進(jìn)的過程中添加節(jié)點(diǎn).從一個(gè)小項(xiàng)目開始出刷,這樣基礎(chǔ)設(shè)施和開發(fā)人員就可以理解Hadoop 的內(nèi)部工作.
iii. Create Procedure For Data Integration
Iii.數(shù)據(jù)集成的創(chuàng)建過程
One of the features of Hadoop is that it allows dumping the data first. And we can define the data structure later. We can get data easily with tools such as Flume and Sqoop. But it is essential to create a data integration process. This includes various layers such as staging, naming standards, location etc. Make proper documentation of data sources and where they live in the cluster.
其中一個(gè)Hadoop 的特點(diǎn)它允許首先轉(zhuǎn)儲數(shù)據(jù).我們可以在后面定義數(shù)據(jù)結(jié)構(gòu).我們可以通過 Flume 和 Sqoop 等工具輕松獲取數(shù)據(jù).但是,創(chuàng)建數(shù)據(jù)集成過程是至關(guān)重要的.這包括各種層坯辩,如臨時(shí)馁龟、命名標(biāo)準(zhǔn)、位置等.對數(shù)據(jù)源及其在集群中的位置進(jìn)行適當(dāng)?shù)奈臋n記錄.
iv. Use Compression Technique
四濒翻、使用壓縮技術(shù)
Enterprise has a love-hate relationship with compression. There is a trade-off between performance and storage. Although compression decreases the storage used it decreases the performance too. But Hadoop thrives on compression. It can increase storage usage by 80%.
企業(yè)與壓縮有著愛恨的關(guān)系.性能和存儲之間存在權(quán)衡.雖然壓縮減少了存儲空間,但它也降低了性能.但是 Hadoop 是靠壓縮來發(fā)展的.它可以使存儲使用量增加 80%.
v. Create Multiple Environments
創(chuàng)建多個(gè)環(huán)境
It is a best practice to build multiple environments for development, testing, and production. As** Apache Hadoop has a wide ecosystem**, different projects in it have different requirements. Hence there is a need for a non-production environment for testing upgrades and new functionalities.
為開發(fā)、測試和生產(chǎn)構(gòu)建多個(gè)環(huán)境是最佳實(shí)踐.作為Apache Hadoop 擁有廣泛的生態(tài)系統(tǒng)不同的項(xiàng)目有不同的要求.因此有送,需要一個(gè)非生產(chǎn)環(huán)境來測試升級和新功能.
Summary
總結(jié)
Hence, in this Hadoop Application Architecture, we saw the design of Hadoop Architecture is such that it recovers itself whenever needed. Its redundant storage structure makes it fault-tolerant and robust. We are able to scale the system linearly. The MapReduce part of the design works on the principle of data locality. The Map-Reduce framework moves the computation close to the data. Therefore decreasing network traffic which would otherwise have consumed major bandwidth for moving large datasets. Thus overall architecture of Hadoop makes it economical, scalable and efficient big data technology.
因此淌喻,在這個(gè) Hadoop 應(yīng)用程序架構(gòu)中,我們看到 Hadoop 架構(gòu)的設(shè)計(jì)是這樣的雀摘,它可以在需要的時(shí)候自行恢復(fù).它的冗余存儲結(jié)構(gòu)使得它具有容錯(cuò)能力和健壯性.我們可以線性地?cái)U(kuò)展這個(gè)系統(tǒng).設(shè)計(jì)的 MapReduce 部分工作在數(shù)據(jù)局部性原理.Map-Reduce 框架將計(jì)算移動(dòng)到數(shù)據(jù)附近.因此裸删,減少網(wǎng)絡(luò)流量,否則移動(dòng)大型數(shù)據(jù)集會消耗大量帶寬.因此阵赠,Hadoop 的整體架構(gòu)使得大數(shù)據(jù)技術(shù)經(jīng)濟(jì)涯塔、可擴(kuò)展且高效.
Hadoop Architecture is a very important topic for your Hadoop Interview. We recommend you to once check most asked Hadoop Interview questions. You will get many questions from Hadoop Architecture.
Hadoop 架構(gòu)對于你的 Hadoop 面試來說是一個(gè)非常重要的話題.我們向您推薦一次查看最常被問到的 Hadoop 面試問題.您將從 Hadoop 架構(gòu)中獲得許多問題.
Did you enjoy reading Hadoop Architecture? Do share your thoughts with us.
你喜歡看 Hadoop 架構(gòu)嗎?請與我們分享你的想法.