學(xué)習(xí)計劃
- Big Data Specialization from the Uni of California, San Diego
- Hadoop 權(quán)威指南
本文
- Hadoop Platform and Application Framework Week1: ** Hadoop Basics**
- Hadoop 權(quán)威指南第一章:初識Hadoop
Hadoop是什么?
Apache Hadoop是在商用硬件集群上儲存并大規(guī)模處理數(shù)據(jù)集的開源軟件框架(Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware)。
Hadoop框架的基本模塊是什么育特?
- Hadoop Common: Hadoop Common 包含其他Hadoop模塊需要的庫和實用程序(Hadoop Common contains libraries and utilities needed by other Hadoop modules)
-
Hadoop分布式文件系統(tǒng)(Hadoop Distributed File System): HDFS 是一個用于儲存超大文件的系統(tǒng)菌瘪。這個系統(tǒng)在商用硬件集群上運行狂芋,以流式數(shù)據(jù)訪問模式來存儲這些超大文件(HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware)
- 超大文件(Very large files): GB, TB, PB級別文件
- 流式數(shù)據(jù)訪問(Streaming data access):一次寫入,多次讀取
- 商用硬件(Commodity hardware): 并不需要運行在高可靠的硬件上。因此,成本低但節(jié)點故障率高
- Hadoop YARN (Yet Another Resource Negotiator): YARN 是用于集群計算資源管理和用戶艰额、應(yīng)用規(guī)劃的資源管理平臺(YARN is a resource management platform responsible for managing compute resources in the cluster and using them in order to schedule users and applications). YARN的基礎(chǔ)思想是將job tracker的兩個主要功能(資源管理和任務(wù)分配與監(jiān)控)分離 (The fundamental idea behind the MapReduce 2.0 is to split up two major functionalities of the job tracker, resource management, and the job scheduling and monitoring, and to do two separate units.)
- Hadoop MapReduce:一個用于數(shù)據(jù)處理的編程模型(MapReduce is a programming model for data processing.)
Hadoop生態(tài)系統(tǒng)主要組成部分是什么?
- Apache Sqoop: 在關(guān)系型數(shù)據(jù)庫和HDFS之間移動數(shù)據(jù)的工具(A tool for efficiently moving data between relational databases and HDFS)
- Apache HBase:一個分布式的列數(shù)據(jù)庫椒涯。HBase使用HDFS進行基礎(chǔ)儲存并同時支持MapReduce的批量計算和隨機讀取的點查詢(A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computation using MapReduce and point queries (random reads))
- Apache Pig:Pig是一種探索大規(guī)模數(shù)據(jù)集的腳本語言柄沮,由兩部分組成:Pig Latin(描述數(shù)據(jù)流)和用于運行Pig Latin程序的執(zhí)行環(huán)境。
- Apache Hive: Hive是一個分布式的數(shù)據(jù)倉庫,管理存儲在HDFS中的數(shù)據(jù)并提供和SQL長得像的查詢語言來查詢數(shù)據(jù)(A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.)
- Apache Oozie: Oozie用于管理Hadoop所有工作的工作流計劃系統(tǒng)(Oozie's a workflow schedule system that manages all of our Apache Hadoop jobs)
- Apache Flume: Flume 是一個用于收集不斷增加并移動的大量數(shù)據(jù)的分布式服務(wù)(Flume is a distributed and reliable available service for efficiently collecting aggregating and moving large amounts of data)
- Apache Zookeeper: Zookeeper提供分布式的配置服務(wù)和同步服務(wù)祖搓,這樣我們可以將Hadoop的所有工作和整個分布系統(tǒng)的注冊表同步(It provides a distributed configuration service and synchronization service so he can synchronize all these jobs and a naming registry for the entire distributed system)