001 大數(shù)據(jù)愛(ài)好者的 Hadoop 教程-學(xué)習(xí) Hadoop 的最佳方式

000 Hadoop Tutorial for Big Data Enthusiasts – The Optimal way of Learning Hadoop

Hadoop Tutorial – One of the most searched terms on the internet today. Do you know the reason? It is because Hadoop is the major part or framework of Big Data.

Hadoop 教程-當(dāng)今互聯(lián)網(wǎng)上搜索最多的術(shù)語(yǔ)之一. 你知道原因嗎？這是因?yàn)?Hadoop 是大數(shù)據(jù)的主要組成部分或框架.

If you don’t know anything about Big Data then you are in major trouble. But don’t worry I have something for you which is completely FREE –*** 520+ Big Data Tutorials. *** This free tutorial series will make you a master of Big Data in just few weeks. Also, I have explained a little about Big Data in this blog.

如果你對(duì)大數(shù)據(jù)一無(wú)所知肃叶，那你就麻煩大了. 但是別擔(dān)心，我有東西給你 完全免費(fèi)-***520 + 大數(shù)據(jù)教程:. *** 這個(gè)免費(fèi)的教程系列將在幾周內(nèi)讓你成為大數(shù)據(jù)的大師.此外，我在這個(gè)博客中解釋了一點(diǎn)關(guān)于大數(shù)據(jù)的知識(shí).

“Hadoop is a technology to store massive datasets on a cluster of cheap machines in a distributed manner”. It was originated by Doug Cutting and Mike Cafarella.

“Hadoop 是一種以分布式方式將大量數(shù)據(jù)集存儲(chǔ)在廉價(jià)機(jī)器集群上的技術(shù)”. 它由道格切和邁克 · 卡法雷拉發(fā)起.

Doug Cutting’s kid named Hadoop to one of his toy that was a yellow elephant. Doug then used the name for his open source project because it was easy to spell, pronounce, and not used elsewhere.

道格 · 切的孩子把 Hadoop 命名為他的一個(gè)玩具书妻，那是一只黃色的大象.道格隨后在他的開(kāi)源項(xiàng)目中使用了這個(gè)名字，因?yàn)樗苋菀灼磳?xiě)、發(fā)音，在其他地方也不使用.

Interesting, right?

有趣是吧宅静？

Hadoop Tutorial

Hadoop 教程:

Now, let’s begin our interesting Hadoop tutorial with the basic introduction to Big Data.

現(xiàn)在，讓我們從大數(shù)據(jù)的基本介紹開(kāi)始我們有趣的 Hadoop 教程.

What is Big Data?

大數(shù)據(jù)是什么站欺？

Big Data refers to the datasets too large and complex for traditional systems to store and process. The major problems faced by Big Data majorly falls under three Vs. They are volume, velocity, and variety.

大數(shù)據(jù)是指?jìng)鹘y(tǒng)系統(tǒng)存儲(chǔ)和處理的數(shù)據(jù)集太大姨夹、太復(fù)雜.大數(shù)據(jù)面臨的主要問(wèn)題主要是體積、速度和多樣性.

***Do you know – ****Every minute we send 204 million emails, generate 1.8 million Facebook likes, send 278 thousand Tweets, and up-load 200,000 photos to Facebook. *

**你知道嗎****我們每分鐘發(fā)送 2.04億封電子郵件矾策，生成 180萬(wàn)個(gè) Facebook 贊磷账，發(fā)送 278,000 條推文，并向 Facebook 上傳 200,000 張照片.

Volume: The data is getting generated in order of Tera to petabytes. The largest contributor of data is social media. For instance, Facebook generates 500 TB of data every day. Twitter generates 8TB of data daily.

體積: 按照 Tera 到 pb 的順序生成數(shù)據(jù).社交媒體是最大的數(shù)據(jù)貢獻(xiàn)者.例如贾虽，F(xiàn)acebook 每天產(chǎn)生 500 TB 的數(shù)據(jù).Twitter 每天產(chǎn)生 8 TB 的數(shù)據(jù).

Velocity: Every enterprise has its own requirement of the time frame within which they have process data. Many use cases like credit card fraud detection have only a few seconds to process the data in real-time and detect fraud. Hence there is a need of framework which is capable of high-speed data computations.

速度: 每個(gè)企業(yè)都有自己的處理數(shù)據(jù)的時(shí)間框架要求.像信用卡欺詐檢測(cè)這樣的許多用例只有幾秒鐘的時(shí)間來(lái)實(shí)時(shí)處理數(shù)據(jù)并檢測(cè)欺詐.因此够颠，需要能夠進(jìn)行高速數(shù)據(jù)計(jì)算的框架.

Variety: Also the data from various sources have varied formats like text, XML, images, audio, video, etc. Hence the Big Data technology should have the capability of performing analytics on a variety of data.

品種: 此外，來(lái)自不同來(lái)源的數(shù)據(jù)也有不同的格式榄鉴，如文本、 XML 蛉抓、圖像庆尘、音頻、視頻等.因此巷送，大數(shù)據(jù)技術(shù)應(yīng)該有能力對(duì)各種數(shù)據(jù)進(jìn)行分析.

Hope you have checked the Free Big Data DataFlair Tutorial Series. Here is one more interesting article for you – Top Big Data Quotes by the Experts

希望您已經(jīng)查看了免費(fèi)的大數(shù)據(jù) DataFlair 教程系列

Why Hadoop is Invented?

Hadoop 為何發(fā)明驶忌？

Let us discuss the shortcomings of the traditional approach which led to the invention of Hadoop –

讓我們討論導(dǎo)致 Hadoop 發(fā)明的傳統(tǒng)方法的缺點(diǎn)-

1. Storage for Large Datasets

1. 存儲(chǔ)的大數(shù)據(jù)集

The conventional RDBMS is incapable of storing huge amounts of Data. The cost of data storage in available RDBMS is very high. As it incurs the cost of hardware and software both.

傳統(tǒng)的關(guān)系數(shù)據(jù)庫(kù)不能存儲(chǔ)大量的數(shù)據(jù).在可用的數(shù)據(jù)庫(kù)中存儲(chǔ)數(shù)據(jù)的成本非常高.因?yàn)樗鼤?huì)帶來(lái)硬件和軟件的成本.

2. Handling data in different formats

2. 、處理不同格式的數(shù)據(jù)

The RDBMS is capable of storing and manipulating data in a structured format. But in the real world we have to deal with data in a structured, unstructured and semi-structured format.

關(guān)系數(shù)據(jù)庫(kù)能夠以結(jié)構(gòu)化格式存儲(chǔ)和操作數(shù)據(jù).但是在現(xiàn)實(shí)世界中笑跛，我們必須以結(jié)構(gòu)化付魔、非結(jié)構(gòu)化和半結(jié)構(gòu)化的格式處理數(shù)據(jù).

3. Data getting generated with high speed:

3..高速生成數(shù)據(jù):

The data in oozing out in the order of tera to peta bytes daily. Hence we need a system to process data in real-time within a few seconds. The traditional RDBMS fail to provide real-time processing at great speeds.

數(shù)據(jù)以 tera 到 peta 字節(jié)的順序每天滲出.因此，我們需要一個(gè)系統(tǒng)在幾秒鐘內(nèi)實(shí)時(shí)處理數(shù)據(jù).傳統(tǒng)的關(guān)系數(shù)據(jù)庫(kù)不能提供高速的實(shí)時(shí)處理.

What is Hadoop?

Hadoop is the solution to above Big Data problems. It is the technology to store massive datasets on a cluster of cheap machines in a distributed manner. Not only this it provides Big Data analytics through distributed computing framework.

Hadoop 是解決上述大數(shù)據(jù)問(wèn)題的解決方案.這是一種以分布式方式將大量數(shù)據(jù)集存儲(chǔ)在廉價(jià)機(jī)器集群上的技術(shù).它不僅通過(guò)分布式計(jì)算框架提供大數(shù)據(jù)分析.

It is an open-source software developed as a project by Apache Software Foundation. Doug Cutting created Hadoop. In the year 2008 Yahoo gave Hadoop to Apache Software Foundation. Since then two versions of Hadoop has come. Version 1.0 in the year 2011 and version 2.0.6 in the year 2013. Hadoop comes in various flavors like Cloudera, IBM BigInsight, MapR and Hortonworks.

它是 Apache 軟件基金會(huì)作為一個(gè)項(xiàng)目開(kāi)發(fā)的開(kāi)源軟件.Doug Cutting 創(chuàng)建了 Hadoop.2008年飞蹂，雅虎將 Hadoop 交給了 Apache 軟件基金會(huì).從那以后几苍，Hadoop 有了兩個(gè)版本.2011年的 1.0 版和 2013年的版.Hadoop 有 Cloudera 、 IBM BigInsight 陈哑、 MapR 和 Hortonworks 等多種版本.

Prerequisites to Learn Hadoop

學(xué)習(xí) Hadoop 的先決條件

Familiarity with some basic Linux Command – Hadoop is set up over Linux Operating System preferable Ubuntu. So one must know certain*** basic Linux commands***. These commands are for uploading the file in HDFS, downloading the file from HDFS and so on.
Basic Java concepts – Folks want to learn Hadoop can get started in Hadoop while simultaneously grasping basic concepts of Java. We can write map and reduce functions in Hadoop using other languages too. And these are Python, Perl, C, Ruby, etc. This is possible via streaming API. It supports reading from standard input and writing to standard output. Hadoop also has high-level abstractions tools like Pig and Hive which do not require familiarity with Java.

Big Data Hadoop Tutorial Video

熟悉一些基本的 Linux 命令 Hadoop 是在 Linux 操作系統(tǒng)上建立的妻坝，比 Ubuntu 更好.所以一定要知道 基本的 Linux 命令. 這些命令用于在 HDFS 中上傳文件、從 HDFS 下載文件等.
Java 的基本概念 想學(xué)習(xí) Hadoop 的人可以在同時(shí)掌握 Hadoop 的同時(shí)開(kāi)始學(xué)習(xí) Java 的基本概念. 我們也可以使用其他語(yǔ)言在 Hadoop 中編寫(xiě) map 和 reduce 函數(shù). 這些是 Python 惊窖、 Perl 刽宪、 C 、 Ruby 等界酒，這是通過(guò)流 API 實(shí)現(xiàn)的.它支持從標(biāo)準(zhǔn)輸入到標(biāo)準(zhǔn)輸出的讀取和寫(xiě)入. Hadoop 還有像 Pig 和 Hive 這樣的高級(jí)抽象工具圣拄，不需要熟悉 Java.

Hadoop consists of three core components –

Hadoop 由三個(gè)核心組件組成

Hadoop Distributed File System **(HDFS) – **It is the storage layer of Hadoop.
**Map-Reduce – **It is the data processing layer of Hadoop.
**YARN – **It is the resource management layer of Hadoop.
分布式文件系統(tǒng) (HDFS)- 是 Hadoop 的存儲(chǔ)層.
Map-Reduce- 是 Hadoop 的數(shù)據(jù)處理層.
Yarn- Hadoop 的資源管理層.

Core Components of Hadoop

Hadoop 的核心組件

Let us understand these Hadoop components in detail.

讓我們?cè)敿?xì)了解這些 Hadoop 組件.

1. HDFS

Short for Hadoop Distributed File System provides for distributed storage for Hadoop. HDFS has a master-slave topology.

Hadoop 分布式文件系統(tǒng)的簡(jiǎn)稱，為 Hadoop 提供分布式存儲(chǔ).HDFS 具有主從拓?fù)浣Y(jié)構(gòu).

image.png

Master is a high-end machine where as slaves are inexpensive computers. The Big Data files get divided into the number of blocks. Hadoop stores these blocks in a distributed fashion on the cluster of slave nodes. On the master, we have metadata stored.

Master 是一種高端機(jī)器毁欣，作為奴隸庇谆，它是廉價(jià)的計(jì)算機(jī).大數(shù)據(jù)文件按照塊的數(shù)量進(jìn)行劃分.Hadoop 以分布式方式將這些塊存儲(chǔ)在從屬節(jié)點(diǎn)集群上.在 master 上岳掐，我們存儲(chǔ)了元數(shù)據(jù).

HDFS has two daemons running for it. They are :

HDFS 有兩個(gè)守護(hù)進(jìn)程在運(yùn)行.他們是:

NameNode : NameNode performs following functions –

NameNode Daemon runs on the master machine.
It is responsible for maintaining, monitoring and managing DataNodes.
It records the metadata of the files like the location of blocks, file size, permission, hierarchy etc.
Namenode captures all the changes to the metadata like deletion, creation and renaming of the file in edit logs.
It regularly receives heartbeat and block reports from the DataNodes.
NameNode 守護(hù)進(jìn)程在主機(jī)上運(yùn)行.
負(fù)責(zé)數(shù)據(jù)節(jié)點(diǎn)的維護(hù)、監(jiān)控和管理.
它記錄文件的元數(shù)據(jù)族铆，如塊的位置岩四、文件大小、權(quán)限哥攘、層次結(jié)構(gòu)等.
Namenode 捕獲對(duì)元數(shù)據(jù)的所有更改剖煌，如在編輯日志中刪除、創(chuàng)建和重命名文件.
它定期從 DataNodes 接收心跳和阻塞報(bào)告.

DataNode: The various functions of DataNode are as follows –

DataNode runs on the slave machine.
It stores the actual business data.
It serves the read-write request from the user.
DataNode does the ground work of creating, replicating and deleting the blocks on the command of NameNode.
After every 3 seconds, by default, it sends heartbeat to NameNode reporting the health of HDFS.
DataNode 在從機(jī)上運(yùn)行.
存儲(chǔ)實(shí)際業(yè)務(wù)數(shù)據(jù).
它服務(wù)于用戶的讀寫(xiě)請(qǐng)求.
DataNode 在 NameNode 命令下執(zhí)行創(chuàng)建逝淹、復(fù)制和刪除塊的基本工作.
默認(rèn)情況下耕姊，每隔 3 秒，它會(huì)向報(bào)告 HDFS 健康狀況的 NameNode 發(fā)送心跳.

Erasure Coding in HDFS

擦除編碼 HDFS

Till Hadoop 2.x replication is the only method for providing fault tolerance. Hadoop 3.0 introduces one more method called erasure coding. Erasure coding provides the same level of fault tolerance but with lower storage overhead.

直到 Hadoop 2.X 復(fù)制是提供容錯(cuò)的唯一方法. Hadoop 3.0 又引入了一種稱為擦除編碼的方法.擦除編碼提供了相同級(jí)別的容錯(cuò)能力栅葡，但存儲(chǔ)開(kāi)銷較低.

Erasure coding is usually used in RAID (Redundant Array of Inexpensive Disks) kind of storage. RAID provides erasure coding via striping. In this, it divides the data into smaller units like bit/byte/block and stores the consecutive units on different disks. Hadoop calculates parity bits for each of these cell (units). We call this process as encoding. On the event of loss of certain cells, Hadoop computes these by decoding. Decoding is a process in which lost cells gets recovered from remaining original and parity cells.

RAID (廉價(jià)磁盤的冗余陣列) 存儲(chǔ)通常使用擦除編碼.RAID 通過(guò)條帶化提供擦除編碼.在這種情況下茉兰，它將數(shù)據(jù)分成更小的單元，如位/字節(jié)/塊欣簇，并將連續(xù)的單元存儲(chǔ)在不同的磁盤上.Hadoop 計(jì)算每個(gè)單元 (單元) 的奇偶校驗(yàn)位.我們把這個(gè)過(guò)程稱為編碼.在某些單元丟失的情況下规脸，Hadoop 通過(guò)解碼來(lái)計(jì)算這些單元.解碼是從剩余的原始和奇偶校驗(yàn)單元格中恢復(fù)丟失的單元格的過(guò)程.

Erasure coding is mostly used for warm or cold data which undergo less frequent I/O access. The replication factor of Erasure coded file is always one. we cannot change it by -setrep command. Under erasure coding storage overhead is never more than 50%.

擦除編碼主要用于接受不太頻繁 I/O 訪問(wèn)的溫暖或寒冷數(shù)據(jù).擦除編碼文件的復(fù)制因子始終是 1.我們不能通過(guò)-setrep 命令來(lái)改變它.在擦除編碼下，存儲(chǔ)開(kāi)銷不會(huì)超過(guò) 50%.

Under conventional Hadoop storage replication factor of 3 is default. It means 6 blocks will get replicated into 6*3 i.e. 18 blocks. This gives a storage overhead of 200%. As opposed to this in Erasure coding technique there are 6 data blocks and 3 parity blocks. This gives storage overhead of 50%.

默認(rèn)情況下熊咽，傳統(tǒng)的 Hadoop 存儲(chǔ)復(fù)制因子為 3.這意味著 6 個(gè)塊將被復(fù)制到 6*3莫鸭，即 18 個(gè)塊中.這將導(dǎo)致 200% 的存儲(chǔ)開(kāi)銷.與擦除編碼技術(shù)相反，有 6 個(gè)數(shù)據(jù)塊和 3 個(gè)奇偶校驗(yàn)塊.這使得存儲(chǔ)開(kāi)銷高達(dá) 50%.

The File System Namespace

文件系統(tǒng)命名空間

HDFS supports hierarchical file organization. One can create, remove, move or rename a file. NameNode maintains file system Namespace. NameNode records the changes in the Namespace. It also stores the replication factor of the file.

HDFS 支持分層文件組織.可以創(chuàng)建横殴、刪除被因、移動(dòng)或重命名文件.NameNode 維護(hù)文件系統(tǒng)命名空間.NameNode 記錄命名空間中的更改.它還存儲(chǔ)文件的復(fù)制因子.

2. MapReduce

It is the data processing layer of Hadoop. It processes data in two phases.

是 Hadoop 的數(shù)據(jù)處理層.它分兩個(gè)階段處理數(shù)據(jù).

They are:-

Map Phase- This phase applies business logic to the data. The input data gets converted into key-value pairs.

Map 階段 這個(gè)階段對(duì)數(shù)據(jù)應(yīng)用業(yè)務(wù)邏輯.輸入數(shù)據(jù)被轉(zhuǎn)換成鍵值對(duì).

Reduce Phase- The Reduce phase takes as input the output of Map Phase. It applies aggregation based on the key of the key-value pairs.

Reduce 階段 將 Map 階段的輸出作為輸入.它基于鍵-值對(duì)的鍵應(yīng)用聚合.

Hadoop MapReduce Working

Map-Reduce works in the following way:

The client specifies the file for input to the Map function. It splits it into tuples
Map function defines key and value from the input file. The output of the map function is this key-value pair.
MapReduce framework sorts the key-value pair from map function.
The framework merges the tuples having the same key together.
The reducers get these merged key-value pairs as input.
Reducer applies aggregate functions on key-value pair.
The output from the reducer gets written to HDFS.
客戶端指定輸入到 Map 函數(shù)的文件.把它拆分成元組
Map 函數(shù)從輸入文件中定義鍵和值.Map 函數(shù)的輸出是這個(gè)鍵值對(duì).
MapReduce 框架根據(jù) map 函數(shù)對(duì)鍵值對(duì)進(jìn)行排序.
框架將具有相同鍵的元組合并在一起.
Reducers 將這些合并的鍵值對(duì)作為輸入.
Reducer 在鍵值對(duì)上應(yīng)用聚合函數(shù).
減速機(jī)的輸出被寫(xiě)到 HDFS.

3. YARN

Short for Yet Another Resource Locator has the following components:-

另一個(gè)資源定位器的縮寫(xiě)有以下組件:-

Resource Manager 資源經(jīng)理

How resource manager works

Resource Manager runs on the master node.
It knows where the location of slaves (Rack Awareness).
It is aware about how much resources each slave have.
Resource Scheduler is one of the important service run by the Resource Manager.
Resource Scheduler decides how the resources get assigned to various tasks.
Application Manager is one more service run by Resource Manager.
Application Manager negotiates the first container for an application.
Resource Manager keeps track of the heart beats from the Node Manager.
資源管理器在主節(jié)點(diǎn)上運(yùn)行.
它知道奴隸的位置 (機(jī)架感知).
它知道每個(gè)奴隸有多少資源.
資源調(diào)度器是資源管理器運(yùn)行的重要服務(wù)之一.
資源調(diào)度器決定如何將資源分配給各種任務(wù).
Application Manager 是資源管理器運(yùn)行的又一個(gè)服務(wù).
Application Manager 為應(yīng)用程序協(xié)商第一個(gè)容器.
資源管理器從節(jié)點(diǎn)管理器跟蹤心跳.

Node Manager 節(jié)點(diǎn)管理器

[圖片上傳失敗...(image-3f9790-1564409913371)]

It runs on slave machines.
It manages containers. Containers are nothing but a fraction of Node Manager’s resource capacity
Node manager monitors resource utilization of each container.
It sends heartbeat to Resource Manager.
它在從機(jī)上運(yùn)行.
它管理集裝箱.容器只是節(jié)點(diǎn)管理器資源容量的一小部分
節(jié)點(diǎn)管理器監(jiān)視每個(gè)容器的資源利用率.
它向資源管理器發(fā)送心跳.

Job Submitter 工作提交者

Job submitter in Yarn

The application startup process is as follows:-

應(yīng)用程序啟動(dòng)過(guò)程如下:-

The client submits the job to Resource Manager.
Resource Manager contacts Resource Scheduler and allocates container.
Now Resource Manager contacts the relevant Node Manager to launch the container.
Container runs Application Master.
客戶端將作業(yè)提交給資源管理器.
資源管理器聯(lián)系資源調(diào)度器并分配容器.
現(xiàn)在，資源管理器聯(lián)系相關(guān)的節(jié)點(diǎn)管理器來(lái)啟動(dòng)容器.
容器運(yùn)行應(yīng)用程序 Master.

The basic idea of YARN was to split the task of resource management and job scheduling. It has one global Resource Manager and per-application Application Master. An application can be either one job or DAG of jobs.

YARN 的基本思想是將資源管理和作業(yè)調(diào)度的任務(wù)進(jìn)行拆分.它有一個(gè)全局資源管理器和每個(gè)應(yīng)用程序的主應(yīng)用程序.應(yīng)用程序可以是一個(gè)作業(yè)衫仑，也可以是作業(yè)的 DAG.

The Resource Manager’s job is to assign resources to various competing applications. Node Manager runs on the slave nodes. It is responsible for containers, monitoring resource utilization and informing about the same to Resource Manager.

資源管理器的工作是為各種競(jìng)爭(zhēng)的應(yīng)用程序分配資源.節(jié)點(diǎn)管理器在從屬節(jié)點(diǎn)上運(yùn)行.它負(fù)責(zé)容器梨与、監(jiān)控資源利用率并向資源管理器通知.

The job of Application master is to negotiate resources from the Resource Manager. It also works with NodeManager to execute and monitor the tasks.

應(yīng)用主的工作是從資源管理器協(xié)商資源.它還與 NodeManager 一起執(zhí)行和監(jiān)控任務(wù).

***Wait before scrolling further! This is the time to read about the top 15 Hadoop Ecosystem components. ***

***在進(jìn)一步滾動(dòng)之前等待!這就是我們看到的Hadoop 生態(tài)系統(tǒng)組件前 15 名. ***

Why Hadoop?

Let us now understand why Big Data Hadoop is very popular, why Apache Hadoop capture more than 90% of the big data market.

現(xiàn)在讓我們來(lái)了解為什么大數(shù)據(jù) Hadoop 非常受歡迎，為什么 Apache Hadoop 在大數(shù)據(jù)市場(chǎng)上占據(jù)了 90% 以上的份額.

Apache Hadoop is not only a storage system but is a platform for data storage as well as processing. It is scalable (as we can add more nodes on the fly), Fault-tolerant (Even if nodes go down, data processed by another node).
Following characteristics of Hadoop make it a unique platform:

Apache Hadoop 不僅是一個(gè)存儲(chǔ)系統(tǒng)文狱，也是一個(gè)數(shù)據(jù)存儲(chǔ)和處理的平臺(tái).它是可擴(kuò)展(因?yàn)槲覀兛梢詣?dòng)態(tài)添加更多節(jié)點(diǎn)),容錯(cuò)的(即使節(jié)點(diǎn)宕機(jī)粥鞋，數(shù)據(jù)也由另一個(gè)節(jié)點(diǎn)處理).
以下Hadoop 的特點(diǎn)打造獨(dú)一無(wú)二的平臺(tái):

Flexibility to store and mine any type of data whether it is structured, semi-structured or unstructured. It is not bounded by a single schema.
Excels at processing data of complex nature. Its scale-out architecture divides workloads across many nodes. Another added advantage is that its flexible file-system eliminates ETL bottlenecks.
Scales economically, as discussed it can deploy on commodity hardware. Apart from this its open-source nature guards against vendor lock.
靈活地存儲(chǔ)和挖掘任何類型的數(shù)據(jù)，無(wú)論是結(jié)構(gòu)化的如贷、半結(jié)構(gòu)化的還是非結(jié)構(gòu)化的.它不受單個(gè)模式的限制.
擅長(zhǎng)處理復(fù)雜性質(zhì)的數(shù)據(jù).它的橫向擴(kuò)展架構(gòu)在許多節(jié)點(diǎn)上劃分工作負(fù)載.它的另一個(gè)優(yōu)點(diǎn)是靈活的文件系統(tǒng)消除了 ETL 瓶頸.
正如所討論的陷虎，它可以在商品硬件上部署，經(jīng)濟(jì)規(guī)模.除此之外杠袱，它的開(kāi)源自然保護(hù)供應(yīng)商鎖.

What is Hadoop Architecture?

Hadoop 架構(gòu)是什么尚猿？

After understanding what is Apache Hadoop, let us now understand the Hadoop Architecture in detail.

了解了什么是 Apache Hadoop 之后，現(xiàn)在就讓我們?cè)敿?xì)了解一下 Hadoop 的架構(gòu).

Hadoo Works

How Hadoop Works

Hadoop works in master-slave fashion. There is a master node and there are n numbers of slave nodes where n can be 1000s. Master manages, maintains and monitors the slaves while slaves are the actual worker nodes. In Hadoop architecture, the Master should deploy on good configuration hardware, not just commodity hardware. As it is the centerpiece of Hadoop cluster.

Hadoop 的工作原理主-從.有一個(gè)主節(jié)點(diǎn)楣富，有 n 個(gè)從節(jié)點(diǎn)凿掂，其中 n 個(gè)可以是 1000.當(dāng)從屬節(jié)點(diǎn)是實(shí)際的工作節(jié)點(diǎn)時(shí)，Master 管理、維護(hù)和監(jiān)控從屬節(jié)點(diǎn).在 Hadoop 架構(gòu)中庄萎，Master 應(yīng)該部署在配置良好的硬件上踪少，而不僅僅是商品硬件.因?yàn)樗?Hadoop 集群.

Master stores the metadata (data about data) while slaves are the nodes which store the data. Distributedly data stores in the cluster. The client connects with the master node to perform any task. Now in this Hadoop tutorial for beginners, we will discuss different features of Hadoop in detail.

Master 存儲(chǔ)元數(shù)據(jù) (關(guān)于數(shù)據(jù)的數(shù)據(jù))，而 slaves 是存儲(chǔ)數(shù)據(jù)的節(jié)點(diǎn).集群中的分布式數(shù)據(jù)存儲(chǔ).客戶端與主節(jié)點(diǎn)連接以執(zhí)行任何任務(wù).現(xiàn)在糠涛，在這個(gè)面向初學(xué)者的 Hadoop 教程中援奢，我們將詳細(xì)討論 Hadoop 的不同特性.

Hadoop Features

Hadoop 特性

Here are the top Hadoop features that make it popular –

以下是 Hadoop 最受歡迎的功能-

1. Reliability

1. 可靠性

In the Hadoop cluster, if any node goes down, it will not disable the whole cluster. Instead, another node will take the place of the failed node. Hadoop cluster will continue functioning as nothing has happened. Hadoop has built-in fault tolerance feature.

在 Hadoop 集群中，如果有任何節(jié)點(diǎn)宕機(jī)忍捡，都不會(huì)禁用整個(gè)集群.相反集漾，另一個(gè)節(jié)點(diǎn)將取代失敗的節(jié)點(diǎn).由于沒(méi)有發(fā)生任何事情，Hadoop 集群將繼續(xù)運(yùn)行.Hadoop 內(nèi)置了容錯(cuò)功能.

2. Scalable

2. 可擴(kuò)展

Hadoop gets integrated with cloud-based service. If you are installing Hadoop on the cloud you need not worry about scalability. You can easily procure more hardware and expand your Hadoop cluster within minutes.

Hadoop 與基于云的服務(wù)集成.如果你在云上安裝 Hadoop砸脊，你不需要擔(dān)心可擴(kuò)展性.您可以在幾分鐘內(nèi)輕松獲得更多硬件并擴(kuò)展 Hadoop 集群.

3. Economical

3. 經(jīng)濟(jì)型

Hadoop gets deployed on commodity hardware which is cheap machines. This makes Hadoop very economical. Also as Hadoop is an open system software there is no cost of license too.

Hadoop 部署在廉價(jià)機(jī)器上的商用硬件上.這使得 Hadoop 非常經(jīng)濟(jì).此外具篇，由于 Hadoop 是一個(gè)開(kāi)放的系統(tǒng)軟件，許可證也沒(méi)有成本.

4. Distributed Processing

4. 分布式處理

In Hadoop, any job submitted by the client gets divided into the number of sub-tasks. These sub-tasks are independent of each other. Hence they execute in parallel giving high throughput.

在 Hadoop 中凌埂，客戶端提交的任何作業(yè)都被劃分為子任務(wù)的數(shù)量.這些子任務(wù)是相互獨(dú)立的.因此驱显，它們并行執(zhí)行，提供高吞吐量.

5. Distributed Storage

5. 分布式存儲(chǔ)

Hadoop splits each file into the number of blocks. These blocks get stored distributedly on the cluster of machines.

Hadoop 將每個(gè)文件拆分成塊的數(shù)量.這些數(shù)據(jù)塊被分布式地存儲(chǔ)在機(jī)器集群上.

6. Fault Tolerance

6. 容錯(cuò)

Hadoop replicates every block of file many times depending on the replication factor. Replication factor is 3 by default. In Hadoop suppose any node goes down then the data on that node gets recovered. This is because this copy of the data would be available on other nodes due to replication. Hadoop is fault tolerant.

Hadoop 根據(jù)復(fù)制因子多次復(fù)制每個(gè)文件塊.默認(rèn)情況下瞳抓，復(fù)制因子為 3.在 Hadoop 中埃疫，假設(shè)任何節(jié)點(diǎn)都關(guān)閉，那么該節(jié)點(diǎn)上的數(shù)據(jù)就會(huì)恢復(fù).這是因?yàn)橛捎趶?fù)制孩哑，數(shù)據(jù)的此副本將在其他節(jié)點(diǎn)上可用.Hadoop 是容錯(cuò)的.

***Are you looking for more Features? Here are the additional Hadoop Features that make it special. ***

你在尋找更多的功能嗎熔恢？以下是Hadoop 的附加功能這讓它變得特別.

Hadoop Flavors

This section of the Hadoop Tutorial talks about the various flavors of Hadoop.

Hadoop 教程的這一部分講述了 Hadoop 的各種風(fēng)格.

Apache – Vanilla flavor, as the actual code is residing in Apache repositories.
Hortonworks – Popular distribution in the industry.
Cloudera – It is the most popular in the industry.
MapR – It has rewritten HDFS and its HDFS is faster as compared to others.
IBM – Proprietary distribution is known as Big Insights.

All the databases have provided native connectivity with Hadoop for fast data transfer. Because, to transfer data from Oracle to Hadoop, you need a connector.

所有數(shù)據(jù)庫(kù)都提供了與 Hadoop 的本地連接，以實(shí)現(xiàn)快速數(shù)據(jù)傳輸.因?yàn)橐獙?shù)據(jù)從 Oracle 傳輸?shù)?Hadoop臭笆，需要一個(gè)連接器.

All flavors are almost same and if you know one, you can easily work on other flavors as well.

所有的口味幾乎都是一樣的，如果你知道一種秤掌，你也可以很容易地嘗試其他口味.

Hadoop Quiz

Hadoop Future Scope

未來(lái)的 Hadoop

There is going to be a lot of investment in the Big Data industry in coming years. According to a report by FORBES, 90% of global organizations will be investing in Big Data technology. Hence the demand for Hadoop resources will also grow. Learning Apache Hadoop will give you accelerated growth in career. It also tends to increase your pay package.

將會(huì)有大量的投資在 未來(lái)幾年的大數(shù)據(jù)產(chǎn)業(yè) .根據(jù)一份報(bào)告福布斯90% 的全球組織將投資大數(shù)據(jù)技術(shù).因此愁铺，對(duì) Hadoop 資源的需求也將增長(zhǎng).學(xué)習(xí) Apache Hadoop 可以讓你的職業(yè)生涯加速發(fā)展.它還會(huì)增加你的薪酬.

There is a lot of gap between the supply and demand of Big Data professional. The skill in Big Data technologies continues to be in high demand. This is because companies grow as they try to get the most out of their data. Therefore, their salary package is quite high as compared to professionals in other technology.

大數(shù)據(jù)專業(yè)人才的供需缺口很大.對(duì)大數(shù)據(jù)技術(shù)的需求仍然很高.這是因?yàn)楣驹谂臄?shù)據(jù)中獲得最大收益的過(guò)程中不斷增長(zhǎng).因此，與其他技術(shù)專業(yè)人員相比闻鉴，他們的工資待遇相當(dāng)高.

The managing director of** Dice, Alice Hills** has said that Hadoop jobs have seen 64% increase from the previous year. It is evident that Hadoop is ruling the Big Data market and its future is bright. The demand for Big Data Analytics professional is ever increasing. As it is a known fact that data is nothing without power to analyze it.

You must check Expert’s Prediction for the Future of Hadoop

Summary – Hadoop Tutorial

摘要-Hadoop 教程

On concluding this Hadoop tutorial, we can say that Apache Hadoop is the most popular and powerful big data tool. Big Data stores huge amount of data in the distributed manner and processes the data in parallel on a cluster of nodes. It provides the world’s most reliable storage layer- HDFS. Batch processing engine MapReduce and Resource management layer- YARN.

總結(jié)這個(gè) Hadoop 教程茵乱，可以說(shuō) Apache Hadoop 是目前最流行、最強(qiáng)大的大數(shù)據(jù)工具.大數(shù)據(jù)以分布式的方式存儲(chǔ)大量數(shù)據(jù)孟岛，并在一個(gè)節(jié)點(diǎn)集群上并行處理數(shù)據(jù).它提供了世界上最可靠的存儲(chǔ)層 -- HDFS.批處理引擎 MapReduce 和資源管理層-YARN.

On summarizing this Hadoop Tutorial, I want to give you a quick revision of all the topics we have discussed

在總結(jié)這個(gè) Hadoop 教程時(shí)瓶竭，我想給你一個(gè)我們討論過(guò)的所有主題的快速修訂

The concept of Big Data
Reason for Hadoop’s Invention
Prerequisites to learn Hadoop
Introduction to Hadoop
Core components of Hadoop
Why Hadoop
Hadoop Architecture
Features of Hadoop
Hadoop Flavours
Future Scope of Hadoop
大數(shù)據(jù)的概念
Hadoop 發(fā)明的原因:
學(xué)習(xí) Hadoop 的先決條件
Hadoop 簡(jiǎn)介
Hadoop 的核心組件
Hadoop 為什么
Hadoop 架構(gòu)
Hadoop 的特點(diǎn)
Hadoop 特色
Hadoop 的未來(lái)范圍

Hope this Hadoop Tutorial helped you. If you face any difficulty while understanding Hadoop concept, comment below.

希望這個(gè) Hadoop 教程對(duì)你有幫助.如果您在理解 Hadoop 概念時(shí)遇到任何困難，請(qǐng)?jiān)谙旅姘l(fā)表評(píng)論.

***This is the right time to start your Hadoop learning with industry experts. ***

***這是開(kāi)始你的與行業(yè)專家一起學(xué)習(xí) Hadoop. ***

https://data-flair.training/blogs/hadoop-tutorial

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末渠羞，一起剝皮案震驚了整個(gè)濱河市斤贰，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌次询，老刑警劉巖荧恍，帶你破解...
沈念sama閱讀 221,548評(píng)論 6贊 515
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異，居然都是意外死亡送巡，警方通過(guò)查閱死者的電腦和手機(jī)摹菠，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 94,497評(píng)論 3贊 399
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)骗爆，“玉大人次氨，你說(shuō)我怎么就攤上這事≌叮” “怎么了煮寡？”我有些...
開(kāi)封第一講書(shū)人閱讀 167,990評(píng)論 0贊 360
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長(zhǎng)谷朝。經(jīng)常有香客問(wèn)我洲押，道長(zhǎng)，這世上最難降的妖魔是什么圆凰？我笑而不...
開(kāi)封第一講書(shū)人閱讀 59,618評(píng)論 1贊 296
?港島之戀（遺憾婚禮）
正文為了忘掉前任杈帐，我火速辦了婚禮，結(jié)果婚禮上专钉，老公的妹妹穿的比我還像新娘挑童。我一直安慰自己，他們只是感情好跃须，可當(dāng)我...
茶點(diǎn)故事閱讀 68,618評(píng)論 6贊 397
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開(kāi)白布站叼。她就那樣靜靜地躺著，像睡著了一般菇民。火紅的嫁衣襯著肌膚如雪尽楔。梳的紋絲不亂的頭發(fā)上，一...
開(kāi)封第一講書(shū)人閱讀 52,246評(píng)論 1贊 308
城市分裂傳說(shuō)
那天第练，我揣著相機(jī)與錄音阔馋，去河邊找鬼。笑死娇掏，一個(gè)胖子當(dāng)著我的面吹牛呕寝，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播婴梧，決...
沈念sama閱讀 40,819評(píng)論 3贊 421
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開(kāi)眼下梢，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼！你這毒婦竟也來(lái)了塞蹭？” 一聲冷哼從身側(cè)響起孽江，我...
開(kāi)封第一講書(shū)人閱讀 39,725評(píng)論 0贊 276
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤，失蹤者是張志新（化名）和其女友劉穎番电，沒(méi)想到半個(gè)月后竟坛，有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 46,268評(píng)論 1贊 320
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 38,356評(píng)論 3贊 340
?白月光啟示錄
正文我和宋清朗相戀三年担汤，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了涎跨。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 40,488評(píng)論 1贊 352
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡崭歧，死狀恐怖隅很，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情率碾，我是刑警寧澤叔营，帶...
沈念sama閱讀 36,181評(píng)論 5贊 350
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站所宰，受9級(jí)特大地震影響绒尊，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜仔粥，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,862評(píng)論 3贊 333
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一婴谱、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧躯泰，春花似錦谭羔、人聲如沸。這莊子的主人今日做“春日...
開(kāi)封第一講書(shū)人閱讀 32,331評(píng)論 0贊 24
一樁弒父案瘟裸，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)。三九已至诵竭，卻和暖如春话告，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背卵慰。一陣腳步聲響...
開(kāi)封第一講書(shū)人閱讀 33,445評(píng)論 1贊 272
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工超棺，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人呵燕。一個(gè)月前我還...
沈念sama閱讀 48,897評(píng)論 3贊 376
代替公主和親
正文我出身青樓，卻偏偏與公主長(zhǎng)得像件相，于是被迫代替她去往敵國(guó)和親再扭。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 45,500評(píng)論 2贊 359

001 大數(shù)據(jù)愛(ài)好者的 Hadoop 教程-學(xué)習(xí) Hadoop 的最佳方式

Hadoop Tutorial

Hadoop 教程:

What is Big Data?

大數(shù)據(jù)是什么站欺？

Why Hadoop is Invented?

Hadoop 為何發(fā)明驶忌？

1. Storage for Large Datasets

1. 存儲(chǔ)的大數(shù)據(jù)集

2. Handling data in different formats

2. 、處理不同格式的數(shù)據(jù)

3. Data getting generated with high speed:

3..高速生成數(shù)據(jù):

What is Hadoop?

Prerequisites to Learn Hadoop

學(xué)習(xí) Hadoop 的先決條件

Big Data Hadoop Tutorial Video

Core Components of Hadoop

Hadoop 的核心組件

1. HDFS

Erasure Coding in HDFS

擦除編碼 HDFS

The File System Namespace

文件系統(tǒng)命名空間

2. MapReduce

2. MapReduce

3. YARN

Why Hadoop?

What is Hadoop Architecture?

Hadoop 架構(gòu)是什么尚猿？

Hadoop Features

Hadoop 特性

1. Reliability

1. 可靠性

2. Scalable

2. 可擴(kuò)展

3. Economical

3. 經(jīng)濟(jì)型

4. Distributed Processing

4. 分布式處理

5. Distributed Storage

5. 分布式存儲(chǔ)

6. Fault Tolerance

6. 容錯(cuò)

Hadoop Flavors

Hadoop Future Scope

未來(lái)的 Hadoop

Summary – Hadoop Tutorial

摘要-Hadoop 教程

推薦閱讀更多精彩內(nèi)容