1权烧、Origin
- Hadoop is an Apache open source software framework for storage and large scale processing of the data-sets and clusters on commodity hardware.
- Hadoop was created by Doug Cutting and Mike Cafarella in 2005. It was originally developed to support distribution of the Nutch Search Engine Project. Doug, who was working at Yahoo at the time, who is now actually a chief architect at Cloudera, has named this project after his son's elephant, Hadoop.
- The idea behind Hadoop is that instead of moving data to computation, we move computation to data.
- The Apache's Hadoop MapReduce and HTFS components were originally derived from the Google's MapReduce and Google's file system.
2最欠、Basic Modules
- Hadoop Common
----contains libraries and utilities needed by other Hadoop modules. - HDFS (Hadoop Distributed File System)
---- is a distributed file system that stores data on a commodity machine - MapReduce
---- is a programming model that scales data across a lot of different processes - YARN
----is a resource management platform responsible for managing compute resources in the cluster and using them in order to schedule users and applications - other different applications, like Apache PIG, Apache Hive, HBase, and others
For the end users, through the MapReduce Java code, we can access any of these applications. And we can build many different kinds of systems. We can even talk about the streaming systems. And implement these map and reduce jobs to accomplish the task at hand.
3只磷、HDFS
- Each node in Hadoop instance typically has a single name node, and a cluster of data nodes that formed this HDFS cluster.
- The secondary NameNode regularly connects to the primary NameNode and builds snapshots of the primary's NameNodes, the rapture information, and remembers which system saves to the local and the remote directories.
4易结、
- The typical MapReduce engine will consist of a job tracker, to which client applications can submit MapReduce jobs, and this job tracker typically pushes work out to all the available task trackers, now it's in the cluster.
- To keep the word as close to the data as possible, as balanced as possible.
- YARN is completely compatible with the MapReduce.
- We use Pig and Hive for high level languages and querying some of the data. And then we use Zookeeper as a coordination service on bottom of this stack.
- Cloudera had this really excellent quick start VM.
5叽躯、Hadoop Ecosystem Major Components
- Sqoop stands for SQL to Hadoop.
- It is a straightforward command line tool that has several different capabilities.
- It lets us import individual tables or entire databases into our HDF system.
- It generates Java classes to allow us to interact and import data, with all the data that we imported.
- Hbase is a key component of the Hadoop stack, as its design caters to applications that require really fast random access to significant data set. Hbase is based on Google's big table and it can handle massive data tables combining billion and billions of rows and millions of columns.
- Pig is a scripting language, it's really a high level platform for creating MapReduce programs using Hadoop. This language is called Pig Latin, and it excels at describing data analysis problems as data flows.
In the pig, you can actually have pig in both code of many different languages, like JRuby, JPython, and Java. And conversely, you can execute PIG scripts in other languages. - Hive, the Apache Hive data warehouse software facilitates querying and managing large datasets residing in our distributed file storage.
It actually provides a mechanism to project structure on top of all of this data and allow us to use SQL like queries to access the data that we have stored in this data warehouse. This query language is called Hive QL - Oozie is a workflow schedule system that manages all of our Apache Hadoop jobs.
Oozie workflow jobs are what we call DAGs or Directed Graphs. Oozie coordinator jobs are recurrent Oozie workflow jobs that are triggered by frequency or data availability. It's integrated with the rest of the Hadoop stack supporting several different Hadoop jobs right out of the box. You can bring in Java MapReduce, you can bring in streaming MapReduce. You can run Pig and Hive and Sqoop and many other specific jobs on the system itself. It's very scalable and reliable and a quite extensible system. - Hadoop have the large zoo of crazy wild animals and we've got to keep them in and keep them somehow organized. Well that's kind of what the Zookeeper does.
- It provides operational services for the Hadoop cluster.
- It provides a distributed configuration service and synchronization service so he can synchronize all these jobs and a naming registry for the entire distributed system.
- Distributed applications use the zookeeper to store immediate updates to important configuration information On the cluster itself.
- Flume is a distributed and reliable available service for efficiently collecting aggregating and moving large amounts of data.
- It has a simple and very flexible architecture based on streaming data flows.
- It uses simple extensible data model that allows us to
apply all kinds of online analytic applications.
6花鹅、Spark.
Although Hadoop captures the most attention for distributed data analytics, there are now a number of alternatives that provide some kind of interesting advantages over the traditional Hadoop platform.Spark is one of them.
Spark is a scalable data analytics platform that incorporates primitives for in-memory computing and therefore,
It is allowing to exercise some different performance advantages over traditional Hadoop's cluster storage system approach.
And it's implemented and supports something called Scala language, and provides unique environment for data processing.Spark is Is really great for more complex kinds of analytics, and it's great at supporting machine learning libraries. By allowing user to load data into clusters memory and querying it repeatedly, Spark is really well suited for these machined learning kinds of applications that oftentimes have iterative sorting in memory kinds of computation.
Spark requires a cluster management and a distributed storage system. So for the cluster management, Spark supports standalone native Spark clusters, or you can actually run Spark on top of a Hadoop yarn, or via patching mesas. For distributor storage, Spark can interface with any of the variety of storage systems, including the HDFS, Amazon S3, or some IB custom solution at your organization is willing to invest into.