Hadoop introduction

Example 2-3. Mapper for the maximum temperature example

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
context.write(new Text(year), new IntWritable(airTemperature));
}
}
}

The Mapper class is a generic type, with four formal type parameters that specify the input key, input value, output key, and output value types of the map function. For the present example, the input key is a long integer offset, the input value is a line of text, the output key is a year, and the output value is an air temperature (an integer). Rather than using built-in Java types, Hadoop provides its own set of basic types that are optimized for network serialization. These are found in the org.apache.hadoop.io package. Here we use LongWritable, which corresponds to a Java Long, Text (like Java String), and IntWritable (like Java Integer). The map() method is passed a key and a value. We convert the Text value containing the line of input into a Java String, then use its substring() method to extract the columns we are interested in. The map() method also provides an instance of Context to write the output to. In this case, we write the year as a Text object (since we are just using it as a key), and the temperature is wrapped in an IntWritable. We write an output record only if the temperature is present and the quality code indicates the temperature reading is OK. The reduce function is similarly defined using a Reducer, as illustrated in Example 2-4.
Example 2-4. Reducer for the maximum temperature example

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
}
context.write(key, new IntWritable(maxValue));
}
}

Again, four formal type parameters are used to specify the input and output types, this time for the reduce function. The input types of the reduce function must match the output types of the map function: Text and IntWritable. And in this case, the output types of the reduce function are Text and IntWritable, for a year and its maximum temperature,
which we find by iterating through the temperatures and comparing each with a record of the highest found so far.
The third piece of code runs the MapReduce job (see Example 2-5).
Example 2-5. Application to find the maximum temperature in the weather dataset

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MaxTemperature {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature")
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

A Job object forms the specification of the job and gives you control over how the job is
run. When we run this job on a Hadoop cluster, we will package the code into a JAR file
(which Hadoop will distribute around the cluster). Rather than explicitly specifying the
name of the JAR file, we can pass a class in the Job’s setJarByClass() method, which
Hadoop will use to locate the relevant JAR file by looking for the JAR file containing this
class.
Having constructed a Job object, we specify the input and output paths. An input path is
specified by calling the static addInputPath() method on FileInputFormat, and it can be
a single file, a directory (in which case, the input forms all the files in that directory), or a
file pattern. As the name suggests, addInputPath() can be called more than once to use
input from multiple paths.
The output path (of which there is only one) is specified by the static setOutputPath()
method on FileOutputFormat. It specifies a directory where the output files from the
reduce function are written. The directory shouldn’t exist before running the job because
Hadoop will complain and not run the job. This precaution is to prevent data loss (it can
be very annoying to accidentally overwrite the output of a long job with that of another).
Next, we specify the map and reduce types to use via the setMapperClass() and
setReducerClass() methods.
The setOutputKeyClass() and setOutputValueClass() methods control the output types
for the reduce function, and must match what the Reduce class produces. The map output
types default to the same types, so they do not need to be set if the mapper produces the
same types as the reducer (as it does in our case). However, if they are different, the map
output types must be set using the setMapOutputKeyClass() and
setMapOutputValueClass() methods.
The input types are controlled via the input format, which we have not explicitly set
because we are using the default TextInputFormat.
After setting the classes that define the map and reduce functions, we are ready to run the
job. The waitForCompletion() method on Job submits the job and waits for it to finish.
The single argument to the method is a flag indicating whether verbose output is
generated. When true, the job writes information about its progress to the console.
The return value of the waitForCompletion() method is a Boolean indicating success
(true) or failure (false), which we translate into the program’s exit code of 0 or 1.

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末搬卒,一起剝皮案震驚了整個(gè)濱河市净当,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌烟具,老刑警劉巖,帶你破解...
    沈念sama閱讀 223,002評(píng)論 6 519
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件吆豹,死亡現(xiàn)場(chǎng)離奇詭異碱呼,居然都是意外死亡,警方通過(guò)查閱死者的電腦和手機(jī)幼东,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 95,357評(píng)論 3 400
  • 文/潘曉璐 我一進(jìn)店門(mén),熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)科雳,“玉大人根蟹,你說(shuō)我怎么就攤上這事≡忝兀” “怎么了简逮?”我有些...
    開(kāi)封第一講書(shū)人閱讀 169,787評(píng)論 0 365
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)尿赚。 經(jīng)常有香客問(wèn)我散庶,道長(zhǎng),這世上最難降的妖魔是什么凌净? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 60,237評(píng)論 1 300
  • 正文 為了忘掉前任悲龟,我火速辦了婚禮,結(jié)果婚禮上泻蚊,老公的妹妹穿的比我還像新娘。我一直安慰自己丑婿,他們只是感情好性雄,可當(dāng)我...
    茶點(diǎn)故事閱讀 69,237評(píng)論 6 398
  • 文/花漫 我一把揭開(kāi)白布。 她就那樣靜靜地躺著羹奉,像睡著了一般秒旋。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上诀拭,一...
    開(kāi)封第一講書(shū)人閱讀 52,821評(píng)論 1 314
  • 那天迁筛,我揣著相機(jī)與錄音,去河邊找鬼耕挨。 笑死细卧,一個(gè)胖子當(dāng)著我的面吹牛尉桩,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播贪庙,決...
    沈念sama閱讀 41,236評(píng)論 3 424
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼蜘犁,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了止邮?” 一聲冷哼從身側(cè)響起这橙,我...
    開(kāi)封第一講書(shū)人閱讀 40,196評(píng)論 0 277
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎导披,沒(méi)想到半個(gè)月后屈扎,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 46,716評(píng)論 1 320
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡撩匕,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 38,794評(píng)論 3 343
  • 正文 我和宋清朗相戀三年鹰晨,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片滑沧。...
    茶點(diǎn)故事閱讀 40,928評(píng)論 1 353
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡并村,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出滓技,到底是詐尸還是另有隱情哩牍,我是刑警寧澤,帶...
    沈念sama閱讀 36,583評(píng)論 5 351
  • 正文 年R本政府宣布令漂,位于F島的核電站膝昆,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏叠必。R本人自食惡果不足惜荚孵,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 42,264評(píng)論 3 336
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望纬朝。 院中可真熱鬧收叶,春花似錦、人聲如沸共苛。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 32,755評(píng)論 0 25
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)隅茎。三九已至澄峰,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間辟犀,已是汗流浹背俏竞。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 33,869評(píng)論 1 274
  • 我被黑心中介騙來(lái)泰國(guó)打工, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人魂毁。 一個(gè)月前我還...
    沈念sama閱讀 49,378評(píng)論 3 379
  • 正文 我出身青樓玻佩,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親漱牵。 傳聞我的和親對(duì)象是個(gè)殘疾皇子夺蛇,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,937評(píng)論 2 361

推薦閱讀更多精彩內(nèi)容