1、相關(guān)幫助操作函數(shù)
查看內(nèi)置函數(shù):show functions;
顯示函數(shù)的詳細(xì)信息:desc function abs;
顯示函數(shù)的擴(kuò)展信息:desc function extended concat;
2吼肥、學(xué)習(xí)內(nèi)置函數(shù)的終極心法
第一步:把 show functions 命令的所有函數(shù)都仔細(xì)閱讀一遍丑蛤,建立整體認(rèn)識(shí)和印象
第二步:使用 desc extended function function_name 方式來(lái)查詢?cè)?函數(shù) 的詳細(xì)使用方式
第三布:通過(guò)以上的方式找到了函數(shù)叠聋,但是不清楚使用方式,請(qǐng)按照:"hive函數(shù) function_name詳解"
這種格式的關(guān)鍵詞進(jìn)行搜索
第四步:通過(guò)以上的方式受裹,沒(méi)有找到合適的函數(shù)可以使用碌补,那么請(qǐng)組合函數(shù)使用虏束,或者自定義函數(shù)
3、測(cè)試內(nèi)置函數(shù)的快捷方式
第一種方式:直接使用厦章,不用from語(yǔ)法分支镇匀,例如:
select concat('a','a')
第二種方式:創(chuàng)建dual表,幫助我們寫完整SQL
1袜啃、創(chuàng)建一個(gè)dual表create table dual(id string);
2汗侵、load一個(gè)文件(一行,一個(gè)空格)到dual表
3群发、select substr('huangbo',2,3) from dual;
4晃择、內(nèi)置函數(shù)列表
一、關(guān)系運(yùn)算
1. 等值比較: =
2. 等值比較:<=>
3. 不等值比較: <>和!=
4. 小于比較: <
5. 小于等于比較: <=
6. 大于比較: >
7. 大于等于比較: >=
8. 區(qū)間比較
9. 空值判斷: IS NULL
10. 非空判斷: IS NOT NULL
11. LIKE比較: LIKE
12. JAVA的LIKE操作: RLIKE
13. REGEXP操作: REGEXP
二也物、數(shù)學(xué)運(yùn)算
1. 加法操作: +
2. 減法操作: –
3. 乘法操作: *
4. 除法操作: /
5. 取余操作: %
6. 位與操作: &
7. 位或操作: |
8. 位異或操作: ^
9.位取反操作: ~
三、邏輯運(yùn)算
1. 邏輯與操作: AND列疗、&&
2. 邏輯或操作: OR滑蚯、||
3. 邏輯非操作: NOT、!
四抵栈、復(fù)合類型構(gòu)造函數(shù)
1. map結(jié)構(gòu)
2. struct結(jié)構(gòu)
3. named_struct結(jié)構(gòu)
4. array結(jié)構(gòu)
5. create_union
五告材、復(fù)合類型操作符
1. 獲取array中的元素
2. 獲取map中的元素
3. 獲取struct中的元素
六、數(shù)值計(jì)算函數(shù)
1. 取整函數(shù): round
2. 指定精度取整函數(shù): round
3. 向下取整函數(shù): floor
4. 向上取整函數(shù): ceil
5. 向上取整函數(shù): ceiling
6. 取隨機(jī)數(shù)函數(shù): rand
7. 自然指數(shù)函數(shù): exp
8. 以10為底對(duì)數(shù)函數(shù): log10
9. 以2為底對(duì)數(shù)函數(shù): log2
10. 對(duì)數(shù)函數(shù): log
11. 冪運(yùn)算函數(shù): pow
12. 冪運(yùn)算函數(shù): power
13. 開平方函數(shù): sqrt
14. 二進(jìn)制函數(shù): bin
15. 十六進(jìn)制函數(shù): hex
16. 反轉(zhuǎn)十六進(jìn)制函數(shù): unhex
17. 進(jìn)制轉(zhuǎn)換函數(shù): conv
18. 絕對(duì)值函數(shù): abs
19. 正取余函數(shù): pmod
20. 正弦函數(shù): sin
21. 反正弦函數(shù): asin
22. 余弦函數(shù): cos
23. 反余弦函數(shù): acos
24. positive函數(shù): positive
25. negative函數(shù): negative
七古劲、集合操作函數(shù)
1. map類型大谐飧场:size
2. array類型大小:size
3. 判斷元素?cái)?shù)組是否包含元素:array_contains
4. 獲取map中所有value集合
5. 獲取map中所有key集合
6. 數(shù)組排序
八产艾、類型轉(zhuǎn)換函數(shù)
1. 二進(jìn)制轉(zhuǎn)換:binary
2. 基礎(chǔ)類型之間強(qiáng)制轉(zhuǎn)換:cast
九疤剑、日期函數(shù)
1. UNIX時(shí)間戳轉(zhuǎn)日期函數(shù): from_unixtime
2. 獲取當(dāng)前UNIX時(shí)間戳函數(shù): unix_timestamp
3. 日期轉(zhuǎn)UNIX時(shí)間戳函數(shù): unix_timestamp
4. 指定格式日期轉(zhuǎn)UNIX時(shí)間戳函數(shù): unix_timestamp
5. 日期時(shí)間轉(zhuǎn)日期函數(shù): to_date
6. 日期轉(zhuǎn)年函數(shù): year
7. 日期轉(zhuǎn)月函數(shù): month
8. 日期轉(zhuǎn)天函數(shù): day
9. 日期轉(zhuǎn)小時(shí)函數(shù): hour
10. 日期轉(zhuǎn)分鐘函數(shù): minute
11. 日期轉(zhuǎn)秒函數(shù): second
12. 日期轉(zhuǎn)周函數(shù): weekofyear
13. 日期比較函數(shù): datediff
14. 日期增加函數(shù): date_add
15. 日期減少函數(shù): date_sub
十、條件函數(shù)
1. If函數(shù): if
2. 非空查找函數(shù): COALESCE
3. 條件判斷函數(shù):CASE
十一闷堡、字符串函數(shù)
1. 字符ascii碼函數(shù):ascii
2. base64字符串
3. 字符串連接函數(shù):concat
4. 帶分隔符字符串連接函數(shù):concat_ws
5. 數(shù)組轉(zhuǎn)換成字符串的函數(shù):concat_ws
6. 小數(shù)位格式化成字符串函數(shù):format_number
7. 字符串截取函數(shù):substr,substring
8. 字符串截取函數(shù):substr,substring
9. 字符串查找函數(shù):instr
10. 字符串長(zhǎng)度函數(shù):length
11. 字符串查找函數(shù):locate
12. 字符串格式化函數(shù):printf
13. 字符串轉(zhuǎn)換成map函數(shù):str_to_map
14. base64解碼函數(shù):unbase64(string str)
15. 字符串轉(zhuǎn)大寫函數(shù):upper,ucase
16. 字符串轉(zhuǎn)小寫函數(shù):lower,lcase
17. 去空格函數(shù):trim
18. 左邊去空格函數(shù):ltrim
19. 右邊去空格函數(shù):rtrim
20. 正則表達(dá)式替換函數(shù):regexp_replace
21. 正則表達(dá)式解析函數(shù):regexp_extract
22. URL解析函數(shù):parse_url
23. json解析函數(shù):get_json_object
24. 空格字符串函數(shù):space
25. 重復(fù)字符串函數(shù):repeat
26. 左補(bǔ)足函數(shù):lpad
27. 右補(bǔ)足函數(shù):rpad
28. 分割字符串函數(shù): split
29. 集合查找函數(shù): find_in_set
30. 分詞函數(shù):sentences
31. 分詞后統(tǒng)計(jì)一起出現(xiàn)頻次最高的TOP-K
32. 分詞后統(tǒng)計(jì)與指定單詞一起出現(xiàn)頻次最高的TOP-K
十二隘膘、混合函數(shù)
1. 調(diào)用Java函數(shù):java_method
2. 調(diào)用Java函數(shù):reflect
3. 字符串的hash值:hash
十三锄贷、XPath解析XML函數(shù)
1. xpath
2. xpath_string
3. xpath_boolean
4. xpath_short, xpath_int, xpath_long
5. xpath_float, xpath_double, xpath_number
十四朱灿、匯總統(tǒng)計(jì)函數(shù)(UDAF)
1. 個(gè)數(shù)統(tǒng)計(jì)函數(shù): count
2. 總和統(tǒng)計(jì)函數(shù): sum
3. 平均值統(tǒng)計(jì)函數(shù): avg
4. 最小值統(tǒng)計(jì)函數(shù): min
5. 最大值統(tǒng)計(jì)函數(shù): max
6. 非空集合總體變量函數(shù): var_pop
7. 非空集合樣本變量函數(shù): var_samp
8. 總體標(biāo)準(zhǔn)偏離函數(shù): stddev_pop
9. 樣本標(biāo)準(zhǔn)偏離函數(shù): stddev_samp
10.中位數(shù)函數(shù): percentile
11. 中位數(shù)函數(shù): percentile
12. 近似中位數(shù)函數(shù): percentile_approx
13. 近似中位數(shù)函數(shù): percentile_approx
14. 直方圖: histogram_numeric
15. 集合去重?cái)?shù):collect_set
16. 集合不去重函數(shù):collect_list
十五、表格生成函數(shù)Table-Generating Functions (UDTF)
1.?dāng)?shù)組拆分成多行:explode(array)
2.Map拆分成多行:explode(map)
Hive自定義函數(shù)UDF
當(dāng)Hive提供的內(nèi)置函數(shù)無(wú)法滿足業(yè)務(wù)處理需要時(shí)畜吊,此時(shí)就可以考慮使用用戶自定義函數(shù)
函數(shù)類型 | 解釋 |
---|---|
UDF | (user-defined function)作用于單個(gè)數(shù)據(jù)行踱阿,產(chǎn)生一個(gè)數(shù)據(jù)行作為輸出管钳。數(shù)學(xué)函數(shù),字符串函數(shù)软舌,相當(dāng)于一個(gè)映射操作才漆,一個(gè)輸入,一個(gè)輸出 |
UDAF | (用戶定義聚集函數(shù)User- Defined Aggregation Funcation):接收多個(gè)輸入數(shù)據(jù)行葫隙,并產(chǎn)生一個(gè)輸出數(shù)據(jù)行栽烂,count,max等,相當(dāng)于聚合操作腺办,多個(gè)輸入焰手,一個(gè)輸出 |
UDTF | (表格生成函數(shù)User-Defined Table Functions):接收一行輸入,輸出多行(explode)怀喉。相當(dāng)于炸裂操作书妻,一個(gè)輸入,多個(gè)輸出 |
一個(gè)簡(jiǎn)單的UDF示例
1躬拢、先開發(fā)一個(gè)簡(jiǎn)單的 Java 類躲履,繼承 org.apache.hadoop.hive.ql.exec.UDF ,重載 evaluate 方法
Package com.naixue.hive.udf
import java.util.HashMap;
import org.apache.hadoop.hive.ql.exec.UDF;
public class ToLowerCase extends UDF {
// 必須是public聊闯,并且evaluate方法可以重載
public String evaluate(String field) {
String result = field.toLowerCase();
return result;
}
// 根據(jù)傳入的不同參數(shù)工猜,可以執(zhí)行對(duì)應(yīng)的不同邏輯的方法
public int evaluate(int a, int b) {
return a + b;
}
}
2、打成jar包上傳到服務(wù)器
3菱蔬、將jar包添加到 hive 的 classpath
hive> add JAR /home/bigdata/hivejar/udf.jar;
hive> list jar;
4篷帅、創(chuàng)建臨時(shí)函數(shù)與開發(fā)好的class關(guān)聯(lián)起來(lái)
hive> create temporary function tolowercase as
'com.naixue.hive.udf.ToLowerCase';
5、至此拴泌,便可以在hql在使用自定義的函數(shù)
select tolowercase(name), age from student;
JSON數(shù)據(jù)解析UDF開發(fā)
現(xiàn)有原始 JSON 數(shù)據(jù)(rating.json)如下:
{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
.....
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
{"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}
{"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"}
{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"
現(xiàn)在需要將數(shù)據(jù)導(dǎo)入到hive倉(cāng)庫(kù)中魏身,并且最終要得到這么一個(gè)結(jié)果:
movie | rate | timeStamp | uid |
---|---|---|---|
1193 | 5 | 978300760 | 1 |
該怎么做?(提示:可用內(nèi)置 get_json_object 或者 自定義函數(shù)完成)
Transform實(shí)現(xiàn)
Hive 的 Transform 關(guān)鍵字提供了在 SQL 中調(diào)用自寫腳本的功能蚪腐。適合實(shí)現(xiàn) Hive 中沒(méi)有的功能又不想寫 UDF 的情況箭昵。
具體以一個(gè)實(shí)例講解。
JSON 數(shù)據(jù):
{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
需求:把 timestamp 的值轉(zhuǎn)換成日期編號(hào)
1回季、先加載 rating.json 文件到hive的一個(gè)原始表 rate_json
create table rate_json(line string) row format delimited;
load data local inpath '/home/bigdata/rating.json' into table rate_json;
2家制、創(chuàng)建 rate 這張表用來(lái)存儲(chǔ)解析 json 出來(lái)的字段:
create table rate(movie int, rate int, unixtime int, userid int) row format
delimited fields terminated by '\t';
解析 json,得到結(jié)果之后存入 rate 表:
insert into table rate select
get_json_object(line,'$.movie') as moive,
get_json_object(line,'$.rate') as rate,
get_json_object(line,'$.timeStamp') as unixtime,
get_json_object(line,'$.uid') as userid
from rate_json;
3茧跋、使用 transform+python 的方式去轉(zhuǎn)換 unixtime 為 weekday
先編輯一個(gè) python 腳本文件:weekday_mapper.py
vi weekday_mapper.py
代碼如下:
#!/bin/python
import sys
import datetime
for line in sys.stdin:
line = line.strip()
movie,rate,unixtime,userid = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([movie, rate, str(weekday), userid])
保存文件
然后慰丛,將文件加入 hive 的 classpath:
hive> add file /home/bigdata/weekday_mapper.py;
hive> insert into table lastjsontable select
transform(movie,rate,unixtime,userid)
using 'python weekday_mapper.py' as(movie,rate,weekday,userid) from rate;
創(chuàng)建最后的用來(lái)存儲(chǔ)調(diào)用python腳本解析出來(lái)的數(shù)據(jù)的表:lastjsontable
create table lastjsontable(movie int, rate int, weekday int, userid int) row
format delimited fields terminated by '\t';
最后查詢看數(shù)據(jù)是否正確:
select distinct(weekday) from lastjsontable;
HIVE特殊分隔符處理
補(bǔ)充:Hive讀取數(shù)據(jù)的機(jī)制:
1、首先用 InputFormat <默認(rèn)是:org.apache.hadoop.mapred.TextInputFormat> 的一個(gè)具體實(shí)現(xiàn)類讀入文件數(shù)據(jù)瘾杭,返回一條一條的記錄(可以是行诅病,或者是你邏輯中的"行")
2、然后利用 SerDe <默認(rèn):org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe> 的一個(gè)具體實(shí)現(xiàn)類粥烁,對(duì)上面返回的一條一條的記錄進(jìn)行字段切割
了解SerDe的詳細(xì)贤笆,請(qǐng)看這兒:https://cwiki.apache.org/confluence/display/Hive/SerDe
3、InputFormat 和 SerDe 聯(lián)合起來(lái)工作:
HDFS files –> InputFileFormat –> <key, value> –> Deserializer –> Row object
Row object –> Serializer –> <key, value> –> OutputFileFormat –> HDFS files
4讨阻、例子測(cè)試
Hive對(duì)文件中字段的分隔符默認(rèn)情況下只支持單字節(jié)分隔符芥永,如果數(shù)據(jù)文件 multi_delim.txt 中的分隔符是多字符的,如下所示:
01||huangbo
02||xuzheng
03||wangbaoqiang
請(qǐng)注意:如果你使用||作為分隔符钝吮,建表不會(huì)出錯(cuò)埋涧,但是數(shù)據(jù)是解析不正常的板辽。是因?yàn)镠ive默認(rèn)的SERDE不支持多字節(jié)分隔符。支持的分隔符類型是char
使用 RegexSerDe 通過(guò)正則表達(dá)式來(lái)抽取字段
創(chuàng)建表:
drop table if exists multi_delim1;
create table multi_delim(id string,name string)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)\\|\\|(.*)','output.format.string'='%1$s
%2$s')
stored as textfile;
multi_delim.txt 數(shù)據(jù)如下:
01||huangbo
02||xuzheng
03||wangbaoqiang
導(dǎo)入數(shù)據(jù):
load data local inpath '/home/bigdata/hivedata/multi_delim.txt' into table
multi_delim;
使用 MultiDelimitSerDe 解決多字節(jié)分隔符
創(chuàng)建表:
drop table if exists multi_delim2;
CREATE TABLE multi_delim2 (id STRING, name STRING, city STRING) ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH
SERDEPROPERTIES("field.delim"="^|~");
導(dǎo)入數(shù)據(jù):
load data local inpath '/home/bigdata/hivedata/multi_delim2.txt' into table
multi_delim2;
multi_delim2.txt 數(shù)據(jù)格式為:
1^|~huangbo^|~beijing
2^|~xuzheng^|~shanghai
3^|~wangbaoqiang^|~tianjin
查詢數(shù)據(jù)結(jié)構(gòu):
select id, name, city from multi_delim2;
查詢結(jié)果:
通過(guò)自定義 InputFormat 解決特殊分隔符問(wèn)題
其原理是在 InputFormat 讀取行的時(shí)候?qū)?shù)據(jù)中的“多字節(jié)分隔符”替換為 hive 默認(rèn)的分隔符(ctrl+A亦即\x01)或用于替代的單字符分隔符棘催,以便 hive 在 serde 操作時(shí)按照默認(rèn)的單字節(jié)分隔符進(jìn)行字段抽取
com.naixue.hive.delimit2.BiDelimiterInputFormat 代碼如下:
package com.naixue.hive.delimit2;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
public class BiDelimiterInputFormat extends TextInputFormat {
@Override
public RecordReader<LongWritable, Text> getRecordReader(InputSplit
genericSplit, JobConf job, Reporter reporter)throws IOException {
reporter.setStatus(genericSplit.toString());
BiRecordReader reader = new BiRecordReader(job,(FileSplit)genericSplit);
// MyRecordReader reader = new MyRecordReader(job,(FileSplit)genericSplit);
return reader;
}
}
com.naixue.hive.delimit2.BiRecordReader 代碼如下:
package com.naixue.hive.delimit2;
import java.io.IOException;
import java.io.InputStream;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.Seekable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.CodecPool;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.Decompressor;
import org.apache.hadoop.io.compress.SplitCompressionInputStream;
import org.apache.hadoop.io.compress.SplittableCompressionCodec;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.LineRecordReader;
import org.apache.hadoop.mapred.RecordReader;
public class BiRecordReader implements RecordReader<LongWritable, Text> {
private static final Log LOG =
LogFactory.getLog(LineRecordReader.class.getName());
private CompressionCodecFactory compressionCodecs = null;
private long start;
private long pos;
private long end;
private LineReader in;
int maxLineLength;
private Seekable filePosition;
private CompressionCodec codec;
private Decompressor decompressor;
/**
* A class that provides a line reader from an input stream.
* @deprecated Use {@link org.apache.hadoop.util.LineReader} instead.
*/
@Deprecated
public static class LineReader extends org.apache.hadoop.util.LineReader {
LineReader(InputStream in) {
super(in);
}
LineReader(InputStream in, int bufferSize) {
super(in, bufferSize);
}
public LineReader(InputStream in, Configuration conf)
throws IOException {
super(in, conf);
}
}
public BiRecordReader(Configuration job, FileSplit split) throws IOException
{
this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength",
Integer.MAX_VALUE);
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
compressionCodecs = new CompressionCodecFactory(job);
codec = compressionCodecs.getCodec(file);
// open the file and seek to the start of the split
FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());
if (isCompressedInput()) {
decompressor = CodecPool.getDecompressor(codec);
if (codec instanceof SplittableCompressionCodec) {
final SplitCompressionInputStream cIn =
((SplittableCompressionCodec) codec)
.createInputStream(fileIn, decompressor, start, end,
SplittableCompressionCodec.READ_MODE.BYBLOCK);
in = new LineReader(cIn, job);
start = cIn.getAdjustedStart();
end = cIn.getAdjustedEnd();
filePosition = cIn; // take pos from compressed stream
} else {
in = new LineReader(codec.createInputStream(fileIn,
decompressor), job);
filePosition = fileIn;
}
} else {
fileIn.seek(start);
in = new LineReader(fileIn, job);
filePosition = fileIn;
}
// If this is not the first split, we always throw away first record
// because we always (except the last split) read one extra line in
// next() method.
if (start != 0) {
start += in.readLine(new Text(), 0, maxBytesToConsume(start));
}
this.pos = start;
}
private boolean isCompressedInput() {
return (codec != null);
}
private int maxBytesToConsume(long pos) {
return isCompressedInput() ? Integer.MAX_VALUE : (int) Math.min(
Integer.MAX_VALUE, end - pos);
}
private long getFilePosition() throws IOException {
long retVal;
if (isCompressedInput() && null != filePosition) {
retVal = filePosition.getPos();
} else {
retVal = pos;
}
return retVal;
}
public BiRecordReader(InputStream in, long offset, long endOffset,
int maxLineLength) {
this.maxLineLength = maxLineLength;
this.in = new LineReader(in);
this.start = offset;
this.pos = offset;
this.end = endOffset;
this.filePosition = null;
}
public BiRecordReader(InputStream in, long offset, long endOffset,
Configuration job) throws IOException {
this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength",
Integer.MAX_VALUE);
this.in = new LineReader(in, job);
this.start = offset;
this.pos = offset;
this.end = endOffset;
this.filePosition = null;
}
public LongWritable createKey() {
return new LongWritable();
}
public Text createValue() {
return new Text();
}
/** Read a line. */
public synchronized boolean next(LongWritable key, Text value)
throws IOException {
// We always read one extra line, which lies outside the upper
// split limit i.e. (end - 1)
while (getFilePosition() <= end) {
key.set(pos);
// 重點(diǎn)代碼處
int newSize = in.readLine(value,
maxLineLength,Math.max(maxBytesToConsume(pos), maxLineLength));
String str = value.toString().replaceAll("\\|\\|", "\\|");
value.set(str);
pos += newSize;
if (newSize == 0) {
return false;
}
if (newSize < maxLineLength) {
return true;
}
// line too long. try again
LOG.info("Skipped line of size " + newSize + " at pos "
+ (pos - newSize));
}
return false;
}
/**
* Get the progress within the split
*/
public float getProgress() throws IOException {
if (start == end) {
return 0.0f;
} else {
return Math.min(1.0f, (getFilePosition() - start)
/ (float) (end - start));
}
}
public synchronized long getPos() throws IOException {
return pos;
}
public synchronized void close() throws IOException {
try {
if (in != null) {
in.close();
}
} finally {
if (decompressor != null) {
CodecPool.returnDecompressor(decompressor);
}
}
}
}
注意:
1劲弦、上述代碼中的 API 全部使用 Hadoop 的老 API 接口 org.apache.hadoop.mapred…。然后將工程打包醇坝,并拷貝至hive安裝目錄的lib文件夾中邑跪,并重啟hive,使用以下語(yǔ)句建表
hive> create table new_bi(id string,name string) row format delimited fields
terminated by '|' stored as inputformat
'com.naixue.hive.delimit2.BiDelimiterInputFormat' outputformat
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
hive> load data local inpath '/home/bigdata/bi.dat' into table new_bi;
hive> select * from new_bi;
OK
01 huangbo
02 xuzheng
03 wangbaoqiang
2呼猪、還需要在 Hive 中使用 add jar画畅,才能在執(zhí)行 HQL 查詢?cè)摫頃r(shí)把自定義 jar 包傳遞給 mapTask
hive> add jar /home/bigdata/apps/hive/lib/myinput.jar;