【筆記】Hadoop概覽

1权烧、Origin

  • Hadoop is an Apache open source software framework for storage and large scale processing of the data-sets and clusters on commodity hardware.
  • Hadoop was created by Doug Cutting and Mike Cafarella in 2005. It was originally developed to support distribution of the Nutch Search Engine Project. Doug, who was working at Yahoo at the time, who is now actually a chief architect at Cloudera, has named this project after his son's elephant, Hadoop.
  • The idea behind Hadoop is that instead of moving data to computation, we move computation to data.
  • The Apache's Hadoop MapReduce and HTFS components were originally derived from the Google's MapReduce and Google's file system.
圖片源自w3cschool

2最欠、Basic Modules

圖片源自w3cschool
  • Hadoop Common
    ----contains libraries and utilities needed by other Hadoop modules.
  • HDFS (Hadoop Distributed File System)
    ---- is a distributed file system that stores data on a commodity machine
  • MapReduce
    ---- is a programming model that scales data across a lot of different processes
  • YARN
    ----is a resource management platform responsible for managing compute resources in the cluster and using them in order to schedule users and applications
  • other different applications, like Apache PIG, Apache Hive, HBase, and others

For the end users, through the MapReduce Java code, we can access any of these applications. And we can build many different kinds of systems. We can even talk about the streaming systems. And implement these map and reduce jobs to accomplish the task at hand.

3只磷、HDFS

圖片源自w3cschool
  • Each node in Hadoop instance typically has a single name node, and a cluster of data nodes that formed this HDFS cluster.
  • The secondary NameNode regularly connects to the primary NameNode and builds snapshots of the primary's NameNodes, the rapture information, and remembers which system saves to the local and the remote directories.

4易结、

  • The typical MapReduce engine will consist of a job tracker, to which client applications can submit MapReduce jobs, and this job tracker typically pushes work out to all the available task trackers, now it's in the cluster.
  • To keep the word as close to the data as possible, as balanced as possible.
  • YARN is completely compatible with the MapReduce.
  • We use Pig and Hive for high level languages and querying some of the data. And then we use Zookeeper as a coordination service on bottom of this stack.
  • Cloudera had this really excellent quick start VM.

5叽躯、Hadoop Ecosystem Major Components

From Yiibai
  1. Sqoop stands for SQL to Hadoop.
  • It is a straightforward command line tool that has several different capabilities.
  • It lets us import individual tables or entire databases into our HDF system.
  • It generates Java classes to allow us to interact and import data, with all the data that we imported.
  1. Hbase is a key component of the Hadoop stack, as its design caters to applications that require really fast random access to significant data set. Hbase is based on Google's big table and it can handle massive data tables combining billion and billions of rows and millions of columns.
  2. Pig is a scripting language, it's really a high level platform for creating MapReduce programs using Hadoop. This language is called Pig Latin, and it excels at describing data analysis problems as data flows.
    In the pig, you can actually have pig in both code of many different languages, like JRuby, JPython, and Java. And conversely, you can execute PIG scripts in other languages.
  3. Hive, the Apache Hive data warehouse software facilitates querying and managing large datasets residing in our distributed file storage.
    It actually provides a mechanism to project structure on top of all of this data and allow us to use SQL like queries to access the data that we have stored in this data warehouse. This query language is called Hive QL
  4. Oozie is a workflow schedule system that manages all of our Apache Hadoop jobs.
    Oozie workflow jobs are what we call DAGs or Directed Graphs. Oozie coordinator jobs are recurrent Oozie workflow jobs that are triggered by frequency or data availability. It's integrated with the rest of the Hadoop stack supporting several different Hadoop jobs right out of the box. You can bring in Java MapReduce, you can bring in streaming MapReduce. You can run Pig and Hive and Sqoop and many other specific jobs on the system itself. It's very scalable and reliable and a quite extensible system.
  5. Hadoop have the large zoo of crazy wild animals and we've got to keep them in and keep them somehow organized. Well that's kind of what the Zookeeper does.
  • It provides operational services for the Hadoop cluster.
  • It provides a distributed configuration service and synchronization service so he can synchronize all these jobs and a naming registry for the entire distributed system.
  • Distributed applications use the zookeeper to store immediate updates to important configuration information On the cluster itself.
  1. Flume is a distributed and reliable available service for efficiently collecting aggregating and moving large amounts of data.
  • It has a simple and very flexible architecture based on streaming data flows.
  • It uses simple extensible data model that allows us to
    apply all kinds of online analytic applications.

6花鹅、Spark.

Although Hadoop captures the most attention for distributed data analytics, there are now a number of alternatives that provide some kind of interesting advantages over the traditional Hadoop platform.Spark is one of them.

  • Spark is a scalable data analytics platform that incorporates primitives for in-memory computing and therefore,
    It is allowing to exercise some different performance advantages over traditional Hadoop's cluster storage system approach.
    And it's implemented and supports something called Scala language, and provides unique environment for data processing.

  • Spark is Is really great for more complex kinds of analytics, and it's great at supporting machine learning libraries. By allowing user to load data into clusters memory and querying it repeatedly, Spark is really well suited for these machined learning kinds of applications that oftentimes have iterative sorting in memory kinds of computation.

  • Spark requires a cluster management and a distributed storage system. So for the cluster management, Spark supports standalone native Spark clusters, or you can actually run Spark on top of a Hadoop yarn, or via patching mesas. For distributor storage, Spark can interface with any of the variety of storage systems, including the HDFS, Amazon S3, or some IB custom solution at your organization is willing to invest into.

資料來自:San Diego Supercomputer Center

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯(lián)系作者
  • 序言:七十年代末疙赠,一起剝皮案震驚了整個濱河市旁理,隨后出現(xiàn)的幾起案子樊零,更是在濱河造成了極大的恐慌,老刑警劉巖孽文,帶你破解...
    沈念sama閱讀 217,734評論 6 505
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件驻襟,死亡現(xiàn)場離奇詭異,居然都是意外死亡芋哭,警方通過查閱死者的電腦和手機沉衣,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,931評論 3 394
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來减牺,“玉大人豌习,你說我怎么就攤上這事存谎。” “怎么了肥隆?”我有些...
    開封第一講書人閱讀 164,133評論 0 354
  • 文/不壞的土叔 我叫張陵既荚,是天一觀的道長。 經(jīng)常有香客問我栋艳,道長恰聘,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,532評論 1 293
  • 正文 為了忘掉前任吸占,我火速辦了婚禮晴叨,結果婚禮上,老公的妹妹穿的比我還像新娘矾屯。我一直安慰自己篙螟,他們只是感情好,可當我...
    茶點故事閱讀 67,585評論 6 392
  • 文/花漫 我一把揭開白布问拘。 她就那樣靜靜地躺著,像睡著了一般惧所。 火紅的嫁衣襯著肌膚如雪骤坐。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,462評論 1 302
  • 那天下愈,我揣著相機與錄音纽绍,去河邊找鬼。 笑死势似,一個胖子當著我的面吹牛拌夏,可吹牛的內容都是我干的。 我是一名探鬼主播履因,決...
    沈念sama閱讀 40,262評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼障簿,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了栅迄?” 一聲冷哼從身側響起站故,我...
    開封第一講書人閱讀 39,153評論 0 276
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎毅舆,沒想到半個月后西篓,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,587評論 1 314
  • 正文 獨居荒郊野嶺守林人離奇死亡憋活,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 37,792評論 3 336
  • 正文 我和宋清朗相戀三年岂津,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片悦即。...
    茶點故事閱讀 39,919評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡吮成,死狀恐怖橱乱,靈堂內的尸體忽然破棺而出,到底是詐尸還是另有隱情赁豆,我是刑警寧澤仅醇,帶...
    沈念sama閱讀 35,635評論 5 345
  • 正文 年R本政府宣布,位于F島的核電站魔种,受9級特大地震影響析二,放射性物質發(fā)生泄漏。R本人自食惡果不足惜节预,卻給世界環(huán)境...
    茶點故事閱讀 41,237評論 3 329
  • 文/蒙蒙 一叶摄、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧安拟,春花似錦蛤吓、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,855評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至拙泽,卻和暖如春淌山,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背顾瞻。 一陣腳步聲響...
    開封第一講書人閱讀 32,983評論 1 269
  • 我被黑心中介騙來泰國打工泼疑, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人荷荤。 一個月前我還...
    沈念sama閱讀 48,048評論 3 370
  • 正文 我出身青樓退渗,卻偏偏與公主長得像,于是被迫代替她去往敵國和親蕴纳。 傳聞我的和親對象是個殘疾皇子会油,可洞房花燭夜當晚...
    茶點故事閱讀 44,864評論 2 354

推薦閱讀更多精彩內容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi閱讀 7,331評論 0 10
  • 1 何為情緒“踢貓效應” 抽象效應 2 “小我”什么? (1)社會集體意識產(chǎn)生 (2)基于恐懼形成的自我 (...
    蘿卜坑被蘿卜閱讀 489評論 0 0
  • 2018-6-12 晴 難受古毛,但是不能不吃飯钞啸,時間來得及,去廁所吧喇潘。 如果你的優(yōu)先級沒有意義体斩,那你的工作永遠不...
    蟋蟀王閱讀 134評論 0 0
  • 互聯(lián)網(wǎng)悖論 通信聯(lián)系增加和人際接觸減少的對立現(xiàn)象。 文字本身缺少情緒化和人性化成分颖低,這樣一種缺乏社交性和社會情緒的...
    蘇千千0046閱讀 551評論 1 1