環(huán)境
一臺ubuntu 14.04虛擬機(jī)戈擒。
Hadoop版本:2.6.0。
增加用戶
為了隔離Hadoop和其它軟件艰毒,可以新建一個(gè)用戶hduser
和用戶組hadoop
來專門運(yùn)行Hadoop:
sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser
配置SSH免秘鑰登陸
Hadoop使用SSH管理節(jié)點(diǎn)筐高,需要為相關(guān)的遠(yuǎn)程機(jī)器和本機(jī)配置SSH免密碼登陸。
首先丑瞧,生成SSH秘鑰柑土,生成的私鑰文件默認(rèn)位置是~/.ssh/id_rsa.pub
:
ssh-keygen -t rsa -P ""
將私鑰寫入~/.ssh/authorized_keys
:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
嘗試使用ssh登陸localhost
,應(yīng)該不再需要輸入密碼:
ssh localhost
準(zhǔn)備依賴軟件
確保本機(jī)上安裝有JDK绊汹。安裝JDK:
sudo apt-get update
sudo apt-get install default-jdk
確保安裝節(jié)點(diǎn)上有ssh稽屏,且sshd已經(jīng)運(yùn)行。安裝ssh:
sudo apt-get install ssh
安裝rsync:
sudo apt-get install rsync
安裝Hadoop
下載Hadoop安裝文件:
wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
解壓縮安裝包:
tar xfz hadoop-2.6.0.tar.gz
設(shè)置Hadoop根路徑環(huán)境變量西乖。本次安裝路徑為/usr/local/hadoop
诫欠,后續(xù)的一些命令會使用到該路徑:
export HADOOP_HOME=/usr/local/hadoop
移動安裝包到Hadoop根路徑:
sudo mv hadoop-2.6.0 $HADOOP_HOME
將安裝目錄的屬主修改為hduser
:
sudo chown hduser $HADOOP_HOME
配置Hadoop
1. 設(shè)置JAVA_HOME
環(huán)境變量
通過以下命令,可以獲知JAVA_HOME
環(huán)境變量的值:
update-alternatives --config java
在本機(jī)上顯示結(jié)果如下:
There is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
Nothing to configure.
那么JAVA_HOME
應(yīng)該為/jre/bin/java
前面的部分浴栽,也即:
/usr/lib/jvm/java-7-openjdk-amd64
2. 配置./bashrc
我打算將相關(guān)路徑寫到.bashrc
下,這樣轿偎,登陸用戶時(shí)典鸡,就會自動加載。使用VIM編輯.bashrc
:
vim ~/.bashrc
在.bashrc
中添加相關(guān)環(huán)境變量:
#HADOOP
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
應(yīng)用環(huán)境變量:
source ~/.bashrc
3. 配置$HADOOP_HOME/etc/hadoop/hadoop-env.sh
編輯$HADOOP_HOME/etc/hadoop/hadoop-env.sh
文件坏晦,設(shè)置JAVA_HOME
:
sudo vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh
找到其中的JAVA_HOME
變量萝玷,修改為:
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
4. 配置$HADOOP_HOME/etc/hadoop/core-site.xml
編輯$HADOOP_HOME/etc/hadoop/core-site.xml
文件:
sudo vim $HADOOP_HOME/etc/hadoop/core-site.xml
在<configuration></configuration>
之間加入HDFS的配置(HDFS的端口配置在9000
):
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
5. 配置$HADOOP_HOME/etc/hadoop/yarn-site.xml
編輯$HADOOP_HOME/etc/hadoop/yarn-site.xml
文件:
sudo vim $HADOOP_HOME/etc/hadoop/yarn-site.xml
在<configuration></configuration>
之間加入以下內(nèi)容:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
6. 配置$HADOOP_HOME/etc/hadoop/mapred-site.xml
HADOOP_HOME目錄下有一個(gè)配置模板$HADOOP_HOME/etc/hadoop/mapred-site.xml.template
嫁乘,先拷貝到$HADOOP_HOME/etc/hadoop/mapred-site.xml
。
cp $HADOOP_HOME/etc/hadoop/mapred-site.xml{.template,}
編輯$HADOOP_HOME/etc/hadoop/mapred-site.xml
文件:
sudo vim $HADOOP_HOME/etc/hadoop/mapred-site.xml
在<configuration></configuration>
之間加入以下內(nèi)容:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
7. 準(zhǔn)備數(shù)據(jù)存儲目錄
假設(shè)準(zhǔn)備將數(shù)據(jù)存放在/mnt/hdfs
球碉,方便起見蜓斧,現(xiàn)將其設(shè)為一個(gè)環(huán)境變量:
export HADOOP_DATA_DIR=/mnt/hdfs
創(chuàng)建DataNode和NameNode的存儲目錄,同時(shí)將這兩個(gè)文件夾的屬主修改為hduser
:
sudo mkdir -p $HADOOP_DATA_DIR/namenode
sudo mkdir -p $HADOOP_DATA_DIR/datanode
sudo chown hduser /mnt/hdfs/namenode
sudo chown hduser /mnt/hdfs/datanode
8. 配置$HADOOP_HOME/etc/hadoop/hdfs-site.xml
編輯$HADOOP_HOME/etc/hadoop/hdfs-site.xml
文件:
sudo vim $HADOOP_HOME/etc/hadoop/hdfs-site.xml
在<configuration></configuration>
之間增加DataNode和NameNode的配置睁冬,如下:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/mnt/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/mnt/hdfs/datanode</value>
</property>
9. 格式化HDFS文件系統(tǒng)
使用下列命令格式化HDFS文件系統(tǒng):
hdfs namenode -format
啟動Hadoop
啟動HDFS:
start-dfs.sh
啟動yarn:
start-yarn.sh
HDFS和yarn的web控制臺默認(rèn)監(jiān)聽端口分別為50070
和8088
挎春。
如果一切正常,使用jps可以查看到正在運(yùn)行的Hadoop服務(wù)豆拨,在我機(jī)器上的顯示結(jié)果為:
29117 NameNode
29675 ResourceManager
29278 DataNode
30002 NodeManager
30123 Jps
29469 SecondaryNameNode
運(yùn)行Hadoop任務(wù)
下面以著名的WordCount例子來說明如何使用Hadoop直奋。
1. 準(zhǔn)備程序包
下面是WordCount的源代碼。
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
編譯代碼施禾,并打包:
export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
bin/hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class
wc.jar就是打包后的Hadoop Mapreduce程序文件脚线。
2. 準(zhǔn)備輸入文件
我們的Hadoop Mapreduce程序從HDFS讀取輸入文件,同時(shí)也將輸出存放到HDFS中弥搞。本文將測試程序的輸入目錄和輸出目錄確定為wordcount/input
和wordcount/output
邮绿。
在HDFS上創(chuàng)建輸入文件夾:
hdfs dfs -mkdir -p wordcount/input
準(zhǔn)備一些文本文件作為測試數(shù)據(jù),本文準(zhǔn)備的兩個(gè)文件如下:
文件1:input1
The Apache? Hadoop? project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
The project includes these modules:
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS?): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
文件2:input2
Apache Hadoop 2.6.0 is a minor release in the 2.x.y release line, building upon the previous stable release 2.4.1.
Here is a short overview of the major features and improvements.
Common
Authentication improvements when using an HTTP proxy server. This is useful when accessing WebHDFS via a proxy server.
A new Hadoop metrics sink that allows writing directly to Graphite.
Specification work related to the Hadoop Compatible Filesystem (HCFS) effort.
HDFS
Support for POSIX-style filesystem extended attributes. See the user documentation for more details.
Using the OfflineImageViewer, clients can now browse an fsimage via the WebHDFS API.
The NFS gateway received a number of supportability improvements and bug fixes. The Hadoop portmapper is no longer required to run the gateway, and the gateway is now able to reject connections from unprivileged ports.
The SecondaryNameNode, JournalNode, and DataNode web UIs have been modernized with HTML5 and Javascript.
YARN
YARN's REST APIs now support write/modify operations. Users can submit and kill applications through REST APIs.
The timeline store in YARN, used for storing generic and application-specific information for applications, supports authentication through Kerberos.
The Fair Scheduler supports dynamic hierarchical user queues, user queues are created dynamically at runtime under any specified parent-queue.
將這兩個(gè)文件拷貝到wordcount/input
:
hdfs dfs -copyFromLocal input* wordcount/input/
3. 運(yùn)行程序
在Hadoop上執(zhí)行程序:
hadoop jar wc.jar WordCount wordcount/input wordcount/output
程序的結(jié)果在wordcount/output
攀例,查看輸出目錄:
hdfs dfs -ls wordcount/output
查看輸出結(jié)果:
hdfs dfs -cat wordcount/output/part-r-00000
參考文獻(xiàn)
[1] "HDFS Commands Guide", http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#copyFromLocal
[2] "MapReduce Tutorial", http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
[3] "Hadoop MapReduce Next Generation - Setting up a Single Node Cluster", http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
[4] "Running Hadoop On Ubuntu Linux (Single-Node Cluster) - Michael G. Noll", http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
[5] "How to Install Hadoop on Ubuntu 13.10", https://www.digitalocean.com/community/tutorials/how-to-install-hadoop-on-ubuntu-13-10