Hive支持自定義map與reduce script摇肌。接下來我用一個簡單的wordcount例子加以說明擂红。
如果自己使用Java開發(fā),需要處理System.in,System,out以及key/value的各種邏輯围小,比較麻煩昵骤。有人開發(fā)了一個小框架,可以讓我們使用與Hadoop中map與reduce相似的寫法肯适,只關注map與reduce即可变秦。如今此框架已經(jīng)集成在Hive中,就是$HIVE_HOME/lib/hive-contrib-2.3.0.jar框舔,hive版本不同蹦玫,對應的contrib名字可能不同赎婚。
開發(fā)工具:intellij
JDK:jdk1.7
hive:2.3.0
hadoop:2.8.1
一、開發(fā)map與reduce
“map類
public class WordCountMap {
public static void main(String args[]) throws Exception{
new GenericMR().map(System.in, System.out, new Mapper() {
@Override
public void map(String[] strings, Output output) throws Exception {
for(String str:strings){
String[] strs=str.split("\\W+");//如果源文本文件是以\t分隔的樱溉,則不需要再拆分挣输,傳入的strings就是每行拆分好的單詞
for(String str_2:strs) {
output.collect(new String[]{str_2, "1"});
}
}
}
});
}
}
"reduce類
public class WordCountReducer {
public static void main(String args[]) throws Exception{
new GenericMR().reduce(System.in, System.out, new Reducer() {
@Override
public void reduce(String s, Iterator<String[]> iterator, Output output) throws Exception {
int sum=0;
while(iterator.hasNext()){
Integer count=Integer.valueOf(iterator.next()[1]);
sum+=count;
}
output.collect(new String[]{s,String.valueOf(sum)});
}
});
}
}
二、導出jar包
然后導出Jar包(包含hive-contrib-2.3.0)福贞,假如導出jar包名為wordcount.jar
File->Project Structure
add Artifacts
不用填寫Main Class,直接點擊OK
jar包配置
生成jar包
三撩嚼、編寫hive sql
drop table if exists raw_lines;
-- create table raw_line, and read all the lines in '/user/inputs', this is the path on your local HDFS
create external table if not exists raw_lines(line string)
ROW FORMAT DELIMITED
stored as textfile
location '/user/inputs';
drop table if exists word_count;
-- create table word_count, this is the output table which will be put in '/user/outputs' as a text file, this is the path on your local HDFS
create external table if not exists word_count(word string, count int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
lines terminated by '\n' STORED AS TEXTFILE LOCATION '/user/outputs/';
-- add the mapper&reducer scripts as resources, please change your/local/path
--must use "add file",not "add jar",or,hive won't find map and reduce main class
add file your/local/path/wordcount.jar;
from (
from raw_lines
map raw_lines.line
--call the mapper here
using 'java -cp wordcount.jar WordCountMap'
as word, count
cluster by word) map_output
insert overwrite table word_count
reduce map_output.word, map_output.count
--call the reducer here
using 'java -cp wordcount.jar WordCountReducer'
as word,count;
此hive sql保存為wordcount.hql
四、執(zhí)行hive sql
beeline -u [hiveserver] -n username -f wordcount.hql
簡單說下Hive的自定義map與reduce內部原理:
hive讀取文本文件挖帘,然后將其一行行輸入系統(tǒng)標準輸入中完丽,用戶自定義的Map讀取標準輸入流中數(shù)據(jù),一行行處理拇舀,然后將其按照一定格式(例如:"key\tvalue")輸出到標準輸出流中逻族,然后hive會將輸出的字符串進行排序,然后再送到標準輸入流中骄崩,Reduce再從標準輸入流中讀取數(shù)據(jù)進行相應處理瓷耙,處理完成后,再送到標準輸出流中刁赖,Hive再對Reduce結果進行處理存入表中搁痛。