Hadoop Streaming 編程

董的博客 ? Hadoop Streaming 編程
http://dongxicheng.org/mapreduce/hadoop-streaming-programming/

1、概述
Hadoop Streaming是Hadoop提供的一個(gè)編程工具褒颈，它允許用戶使用任何可執(zhí)行文件或者腳本文件作為Mapper和Reducer柒巫，例如：
采用shell腳本語(yǔ)言中的一些命令作為mapper和reducer（cat作為mapper，wc作為reducer）
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop--streaming.jar
-input myInputDirs
-output myOutputDir
-mapper cat
-reducer wc
本文安排如下谷丸，第二節(jié)介紹Hadoop Streaming的原理堡掏，第三節(jié)介紹Hadoop Streaming的使用方法，第四節(jié)介紹Hadoop Streaming的程序編寫方法刨疼，在這一節(jié)中泉唁，用C++、C揩慕、shell腳本和python實(shí)現(xiàn)了WordCount作業(yè)游两，第五節(jié)總結(jié)了常見的問(wèn)題。文章最后給出了程序下載地址漩绵。(本文內(nèi)容基于Hadoop-0.20.2版本)
(注：如果你采用的語(yǔ)言為C或者C++贱案，也可以使用Hadoop Pipes，具體可參考這篇文章：Hadoop Pipes編程止吐。)
關(guān)于Hadoop Streaming高級(jí)編程方法宝踪，可參考這篇文章：Hadoop Streaming高級(jí)編程，Hadoop編程實(shí)例碍扔。
2瘩燥、Hadoop Streaming原理
mapper和reducer會(huì)從標(biāo)準(zhǔn)輸入中讀取用戶數(shù)據(jù)，一行一行處理后發(fā)送給標(biāo)準(zhǔn)輸出不同。Streaming工具會(huì)創(chuàng)建MapReduce作業(yè)厉膀，發(fā)送給各個(gè)tasktracker溶耘，同時(shí)監(jiān)控整個(gè)作業(yè)的執(zhí)行過(guò)程。
如果一個(gè)文件（可執(zhí)行或者腳本）作為mapper服鹅，mapper初始化時(shí)凳兵，每一個(gè)mapper任務(wù)會(huì)把該文件作為一個(gè)單獨(dú)進(jìn)程啟動(dòng)，mapper任務(wù)運(yùn)行時(shí)企软，它把輸入切分成行并把每一行提供給可執(zhí)行文件進(jìn)程的標(biāo)準(zhǔn)輸入庐扫。同時(shí)，mapper收集可執(zhí)行文件進(jìn)程標(biāo)準(zhǔn)輸出的內(nèi)容仗哨，并把收到的每一行內(nèi)容轉(zhuǎn)化成key/value對(duì)形庭，作為mapper的輸出。默認(rèn)情況下厌漂，一行中第一個(gè)tab之前的部分作為key萨醒，之后的（不包括tab）作為value****。如果沒(méi)有tab苇倡，整行作為key值验靡，value值為null。
對(duì)于reducer雏节，類似胜嗓。
以上是Map/Reduce框架和streaming mapper/reducer之間的基本通信協(xié)議。
3钩乍、Hadoop Streaming用法
Usage: $HADOOP_HOME/bin/hadoop jar
$HADOOP_HOME/contrib/streaming/hadoop--streaming.jar [options]
options：
（1）-input：輸入文件路徑
（2）-output：輸出文件路徑
（3）-mapper：用戶自己寫的mapper程序辞州，可以是可執(zhí)行文件或者腳本
（4）-reducer：用戶自己寫的reducer程序，可以是可執(zhí)行文件或者腳本
（5）-file：打包文件到提交的作業(yè)中寥粹，可以是mapper或者reducer要用的輸入文件变过，如配置文件，字典等涝涤。
（6）-partitioner：用戶自定義的partitioner程序
（7）-combiner：用戶自定義的combiner程序（必須用java實(shí)現(xiàn)）
（8）-D：作業(yè)的一些屬性（以前用的是-jonconf）媚狰，具體有：1）mapred.map.tasks：map task數(shù)目2）mapred.reduce.tasks：reduce task數(shù)目3）stream.map.input.field.separator/stream.map.output.field.separator： map task輸入/輸出數(shù)據(jù)的分隔符,默認(rèn)均為\t。4）stream.num.map.output.key.fields：指定map task輸出記錄中key所占的域數(shù)目5）stream.reduce.input.field.separator/stream.reduce.output.field.separator：reduce task輸入/輸出數(shù)據(jù)的分隔符阔拳，默認(rèn)均為\t崭孤。6）stream.num.reduce.output.key.fields：指定reduce task輸出記錄中key所占的域數(shù)目另外，Hadoop本身還自帶一些好用的Mapper和Reducer：（1） Hadoop聚集功能Aggregate提供一個(gè)特殊的reducer類和一個(gè)特殊的combiner類糊肠，并且有一系列的“聚合器”（例如“sum”辨宠，“max”，“min”等）用于聚合一組value的序列货裹。用戶可以使用Aggregate定義一個(gè)mapper插件類嗤形，這個(gè)類用于為mapper輸入的每個(gè)key/value對(duì)產(chǎn)生“可聚合項(xiàng)”。Combiner/reducer利用適當(dāng)?shù)木酆掀骶酆线@些可聚合項(xiàng)弧圆。要使用Aggregate赋兵，只需指定“-reducer aggregate”笔咽。（2）字段的選取（類似于Unix中的‘cut’）Hadoop的工具類org.apache.hadoop.mapred.lib.FieldSelectionMapReduc幫助用戶高效處理文本數(shù)據(jù)霹期，就像unix中的“cut”工具叶组。工具類中的map函數(shù)把輸入的key/value對(duì)看作字段的列表。用戶可以指定字段的分隔符（默認(rèn)是tab）经伙，可以選擇字段列表中任意一段（由列表中一個(gè)或多個(gè)字段組成）作為map輸出的key或者value。同樣勿锅，工具類中的reduce函數(shù)也把輸入的key/value對(duì)看作字段的列表帕膜，用戶可以選取任意一段作為reduce輸出的key或value。
4溢十、Mapper和Reducer實(shí)現(xiàn)
本節(jié)試圖用盡可能多的語(yǔ)言編寫Mapper和Reducer垮刹，包括Java，C张弛，C++荒典，Shell腳本，python等（初學(xué)者運(yùn)行第一個(gè)程序時(shí)吞鸭，務(wù)必要閱讀第5部分 “常見問(wèn)題及解決方案”Ｋ露！？贪遮咖！）。
由于Hadoop會(huì)自動(dòng)解析數(shù)據(jù)文件到Mapper或者Reducer的標(biāo)準(zhǔn)輸入中造虏，以供它們讀取使用御吞，所有應(yīng)先了解各個(gè)語(yǔ)言獲取標(biāo)準(zhǔn)輸入的方法。
（1） Java語(yǔ)言：
見Hadoop自帶例子
（2） ** C++語(yǔ)言**：
1
2
3
4
5

string key;

while
(cin>>key){

cin>>value;

….

}

（3） C語(yǔ)言：
1
2
3
4
5

char
buffer[BUF_SIZE];

while
(
fgets
(buffer, BUF_SIZE - 1, stdin)){

int
len =
strlen
(buffer);

…

}

（4） Shell腳本
管道
（5） ** Python腳本**
1
2
3

import
sys

for
line
in
sys.stdin:

.......

為了說(shuō)明各種語(yǔ)言編寫Hadoop Streaming程序的方法漓藕，下面以WordCount為例陶珠，WordCount作業(yè)的主要功能是對(duì)用戶輸入的數(shù)據(jù)中所有字符串進(jìn)行計(jì)數(shù)。
（1）C語(yǔ)言實(shí)現(xiàn)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68

//mapper

include <stdio.h>

include <string.h>

include <stdlib.h>

define BUF_SIZE 2048

define DELIM "\n"

int
main(
int
argc,
char
*argv[]){

char
buffer[BUF_SIZE];

while
(
fgets
(buffer, BUF_SIZE - 1, stdin)){

int
len =
strlen
(buffer);

if
(buffer[len-1] ==
'\n'
)

buffer[len-1] = 0;

char
*querys = index(buffer,
' '
);

char
*query = NULL;

if
(querys == NULL)
continue
;

querys += 1;
/* not to include '\t' */

query =
strtok
(buffer,
" "
);

while
(query){

printf
(
"%s\t1\n"
, query);

query =
strtok
(NULL,
" "
);

}

return
0;

}

//---------------------------------------------------------------------------------------

//reducer

include <stdio.h>

include <string.h>

include <stdlib.h>

define BUFFER_SIZE 1024

define DELIM "\t"

int
main(
int
argc,
char
*argv[]){

char
strLastKey[BUFFER_SIZE];

char
strLine[BUFFER_SIZE];

int
count = 0;

*strLastKey =
'\0'
;

*strLine =
'\0'
;

while
(
fgets
(strLine, BUFFER_SIZE - 1, stdin) ){

char
*strCurrKey = NULL;

char
*strCurrNum = NULL;

strCurrKey =
strtok
(strLine, DELIM);

strCurrNum =
strtok
(NULL, DELIM);
/* necessary to check error but.... */

if
( strLastKey[0] ==
'\0'
){

strcpy
(strLastKey, strCurrKey);

}

if
(
strcmp
(strCurrKey, strLastKey)) {

printf
(
"%s\t%d\n"
, strLastKey, count);

count =
atoi
(strCurrNum);

}
else
{

count +=
atoi
(strCurrNum);

}

strcpy
(strLastKey, strCurrKey);

}

printf
(
"%s\t%d\n"
, strLastKey, count);
/* flush the count */

return
0;

}

（2）C++語(yǔ)言實(shí)現(xiàn)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

//mapper

include <stdio.h>

include <string>

include <iostream>

using
namespace
std;

int
main(){

string key;

string value =
"1"
;

while
(cin>>key){

cout<<key<<
"\t"
<<value<<endl;

}

return
0;

}

//------------------------------------------------------------------------------------------------------------

//reducer

include <string>

include <map>

include <iostream>

include <iterator>

using
namespace
std;

int
main(){

string key;

string value;

map<string,
int

word2count;

map<string,
int

::iterator it;

while
(cin>>key){

cin>>value;

it = word2count.find(key);

if
(it != word2count.end()){

(it->second)++;

}

else
{

word2count.insert(make_pair(key, 1));

}

for
(it = word2count.begin(); it != word2count.end(); ++it){

cout<<it->first<<
"\t"
<<it->second<<endl;

}

return
0;

}

（3）shell腳本語(yǔ)言實(shí)現(xiàn)****簡(jiǎn)約版享钞，每行一個(gè)單詞：
1
2
3
4
5

$HADOOP_HOME
/bin/hadoop
jar $HADOOP_HOME
/hadoop-streaming
.jar \

-input myInputDirs \

-output myOutputDir \

-mapper
cat
\

-reducer
wc

詳細(xì)版揍诽，每行可有多個(gè)單詞（由史江明編寫）： mapper.sh
1
2
3
4
5
6
7

! /bin/bash

while
read
LINE;
do

for
word
in
$LINE

echo
"$word 1"

done

reducer.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

! /bin/bash

count=0

started=0

word=
""

while
read
LINE;
do

newword=echo $LINE | cut -d ' ' -f 1

if
[
"$word"
!=
"$newword"
];
then

[ $started -
ne
0 ] &&
echo
"$word\t$count"

word=$newword

count=1

started=1

else

count=$(( $count + 1 ))

done

echo
"$word\t$count"

（4）Python腳本語(yǔ)言實(shí)現(xiàn)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56

!/usr/bin/env python

import
sys

maps words to their counts

word2count

{}

input comes from STDIN (standard input)

for
line
in
sys.stdin:

remove leading and trailing whitespace

line

line.strip()

split the line into words while removing any empty strings

words

filter
(
lambda
word: word, line.split())

increase counters

for
word
in
words:

write the results to STDOUT (standard output);

what we output here will be the input for the

Reduce step, i.e. the input for reducer.py

tab-delimited; the trivial word count is 1

print
'%s\t%s'
%
(word,
1
)

---------------------------------------------------------------------------------------------------------

!/usr/bin/env python

from
operator
import
itemgetter

import
sys

maps words to their counts

word2count

{}

input comes from STDIN

for
line
in
sys.stdin:

remove leading and trailing whitespace

line

line.strip()

parse the input we got from mapper.py

word, count

line.split()

convert count (currently a string) to int

try
:

count

int
(count)

word2count[word]

word2count.get(word,
0
)

count

except
ValueError:

count was not a number, so silently

ignore/discard this line

pass

sort the words lexigraphically;

this step is NOT required, we just do it so that our

final output will look more like the official Hadoop

word count examples

sorted_word2count

sorted
(word2count.items(), key
=
itemgetter(
0
))

write the results to STDOUT (standard output)

for
word, count
in
sorted_word2count:

print
'%s\t%s'
%
(word, count)

5、常見問(wèn)題及解決方案
（1）作業(yè)總是運(yùn)行失敗栗竖，
提示找不多執(zhí)行程序寝姿，比如“Caused by: java.io.IOException: Cannot run program “/user/hadoop/Mapper”: error=2, No such file or directory”：
可在提交作業(yè)時(shí)，采用-file選項(xiàng)指定這些文件划滋，比如上面例子中饵筑，可以使用“-file Mapper -file Reducer” 或者 “-file Mapper.py -file Reducer.py”，這樣处坪，Hadoop會(huì)將這兩個(gè)文件自動(dòng)分發(fā)到各個(gè)節(jié)點(diǎn)上根资，比如：
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar
-input myInputDirs
-output myOutputDir
-mapper Mapper.py
-reducer Reducerr.py
-file Mapper.py
-file Reducer.py

（2）用腳本編寫時(shí)架专，第一行需注明腳本解釋器，默認(rèn)是shell （3）如何對(duì)Hadoop Streaming程序進(jìn)行測(cè)試玄帕？ Hadoop Streaming程序的一個(gè)優(yōu)點(diǎn)是易于測(cè)試部脚，比如在Wordcount例子中，可以運(yùn)行以下命令在本地進(jìn)行測(cè)試：
cat input.txt | python Mapper.py | sort | python Reducer.py

或者
cat input.txt | ./Mapper | sort | ./Reducer

6裤纹、參考資料
【1】C++&Python實(shí)現(xiàn)Hadoop Streaming的paritioner和模塊化
 【2】如何在Hadoop中使用Streaming編寫MapReduce
【3】Hadoop如何與C++結(jié)合
 【4】Hadoop Streaming和pipes理解
 7委刘、程序打包下載
文章中用到的程序源代碼可在此處下載！

最后編輯于：2017.12.05 00:17:59

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末鹰椒，一起剝皮案震驚了整個(gè)濱河市锡移，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌漆际，老刑警劉巖淆珊，帶你破解...
沈念sama閱讀 211,123評(píng)論 6贊 490
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異奸汇，居然都是意外死亡施符，警方通過(guò)查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 90,031評(píng)論 2贊 384
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門擂找，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)戳吝，“玉大人，你說(shuō)我怎么就攤上這事贯涎」强樱” “怎么了？”我有些...
開封第一講書人閱讀 156,723評(píng)論 0贊 345
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵柬采，是天一觀的道長(zhǎng)欢唾。經(jīng)常有香客問(wèn)我，道長(zhǎng)粉捻，這世上最難降的妖魔是什么礁遣？我笑而不...
開封第一講書人閱讀 56,357評(píng)論 1贊 283
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮肩刃，結(jié)果婚禮上祟霍，老公的妹妹穿的比我還像新娘。我一直安慰自己盈包，他們只是感情好沸呐，可當(dāng)我...
茶點(diǎn)故事閱讀 65,412評(píng)論 5贊 384
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開白布。她就那樣靜靜地躺著呢燥，像睡著了一般崭添。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上叛氨，一...
開封第一講書人閱讀 49,760評(píng)論 1贊 289
城市分裂傳說(shuō)
那天呼渣，我揣著相機(jī)與錄音棘伴，去河邊找鬼。笑死屁置，一個(gè)胖子當(dāng)著我的面吹牛焊夸，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播蓝角，決...
沈念sama閱讀 38,904評(píng)論 3贊 405
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼阱穗，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼！你這毒婦竟也來(lái)了使鹅？” 一聲冷哼從身側(cè)響起揪阶，我...
開封第一講書人閱讀 37,672評(píng)論 0贊 266
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤，失蹤者是張志新（化名）和其女友劉穎并徘，沒(méi)想到半個(gè)月后遣钳，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體扰魂，經(jīng)...
沈念sama閱讀 44,118評(píng)論 1贊 303
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡麦乞，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 36,456評(píng)論 2贊 325
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了劝评。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片姐直。...
茶點(diǎn)故事閱讀 38,599評(píng)論 1贊 340
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖蒋畜，靈堂內(nèi)的尸體忽然破棺而出声畏，到底是詐尸還是另有隱情，我是刑警寧澤姻成，帶...
沈念sama閱讀 34,264評(píng)論 4贊 328
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布插龄，位于F島的核電站，受9級(jí)特大地震影響科展，放射性物質(zhì)發(fā)生泄漏均牢。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,857評(píng)論 3贊 312
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一才睹、第九天我趴在偏房一處隱蔽的房頂上張望徘跪。院中可真熱鬧，春花似錦琅攘、人聲如沸垮庐。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,731評(píng)論 0贊 21
一樁弒父案坞琴，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)哨查。三九已至，卻和暖如春剧辐，著一層夾襖步出監(jiān)牢的瞬間解恰，已是汗流浹背锋八。一陣腳步聲響...
開封第一講書人閱讀 31,956評(píng)論 1贊 264
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留护盈，地道東北人挟纱。一個(gè)月前我還...
沈念sama閱讀 46,286評(píng)論 2贊 360
代替公主和親
正文我出身青樓，卻偏偏與公主長(zhǎng)得像腐宋，于是被迫代替她去往敵國(guó)和親紊服。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 43,465評(píng)論 2贊 348

Hadoop Streaming 編程

include <stdio.h>

include <string.h>

include <stdlib.h>

define BUF_SIZE 2048

define DELIM "\n"

include <stdio.h>

include <string.h>

include <stdlib.h>

define BUFFER_SIZE 1024

define DELIM "\t"

include <stdio.h>

include <string>

include <iostream>

include <string>

include <map>

include <iostream>

include <iterator>

! /bin/bash

! /bin/bash

!/usr/bin/env python

maps words to their counts

word2count

input comes from STDIN (standard input)

remove leading and trailing whitespace

line

split the line into words while removing any empty strings

words

increase counters

write the results to STDOUT (standard output);

what we output here will be the input for the

Reduce step, i.e. the input for reducer.py

tab-delimited; the trivial word count is 1

---------------------------------------------------------------------------------------------------------

!/usr/bin/env python

maps words to their counts

word2count

input comes from STDIN

remove leading and trailing whitespace

line

parse the input we got from mapper.py

word, count

convert count (currently a string) to int

count

word2count[word]

count was not a number, so silently

ignore/discard this line

sort the words lexigraphically;

this step is NOT required, we just do it so that our

final output will look more like the official Hadoop

word count examples

sorted_word2count

write the results to STDOUT (standard output)

推薦閱讀更多精彩內(nèi)容