hive是給了我們很多內(nèi)置函數(shù)的,比如轉(zhuǎn)大小寫命雀,截取字符串等,具體的都在官方文檔里面斩箫。但是并不是所有的函數(shù)都能滿足我們的需求吏砂,所以hive提供了給我們自定義函數(shù)的功能。
1乘客、至于怎么測(cè)試hive為我們提供的函數(shù)
因?yàn)閙ysql或者oracle中都可以使用偽表狐血,但是hive不行,所以可以使用以下方法
1)易核、創(chuàng)建表dual,create table dual(id string)
2)匈织、在本地創(chuàng)建文件dual.data,內(nèi)容為空格或者空一行
3)、將dual.data文件load到表dual
進(jìn)行測(cè)試牡直,比如:字符串截取
0: jdbc:hive2://localhost:10000> select substr('sichuan',1,3) from dual;
+------+--+
| _c0 |
+------+--+
| sic |
+------+--+
當(dāng)然也可以直接使用 select substr(‘sichuan’,1,3)缀匕,但是還是習(xí)慣用from dual;
2、自定義內(nèi)置函數(shù)
添加maven依賴
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>1.2.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-metastore -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-metastore</artifactId>
<version>1.2.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-common -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-common</artifactId>
<version>1.2.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-service -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-service</artifactId>
<version>1.2.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>1.2.1</version>
</dependency>
1)碰逸、大寫轉(zhuǎn)小寫
可以先創(chuàng)建java類繼承UDF,重載evaluate方法乡小。
/**
* 大寫轉(zhuǎn)小寫
* @author 12706
*/
public class UpperToLowerCase extends UDF {
/*
* 重載evaluate
* 訪問(wèn)限制必須是public
*/
public String evaluate(String word) {
String lowerWord = word.toLowerCase();
return lowerWord;
}
}
打包上傳到hadoop集群(打的jar包名字為hive.jar)。
0: jdbc:hive2://localhost:10000> select * from t5;
+--------+-----------+--+
| t5.id | t5.name |
+--------+-----------+--+
| 13 | BABY |
| 1 | zhangsan |
| 2 | lisi |
| 3 | wangwu |
| 4 | furong |
| 5 | fengjie |
| 6 | aaa |
| 7 | bbb |
| 8 | ccc |
| 9 | ddd |
| 10 | eee |
| 11 | fff |
| 12 | ggg |
+--------+-----------+--+
13 rows selected (0.221 seconds)
將jar包放到hive的classpath下
0: jdbc:hive2://localhost:10000> add jar /root/hive.jar;
創(chuàng)建臨時(shí)函數(shù)饵史,指定完整類名
0: jdbc:hive2://localhost:10000> create temporary function tolower as 'com.scu.hive.UpperToLowerCase';
到這就可以使用自定義臨時(shí)函數(shù)tolower()了满钟,測(cè)試t5表中的name輸出小寫
0: jdbc:hive2://localhost:10000> select id,tolower(name) from t5;
+-----+-----------+--+
| id | _c1 |
+-----+-----------+--+
| 13 | baby |
| 1 | zhangsan |
| 2 | lisi |
| 3 | wangwu |
| 4 | furong |
| 5 | fengjie |
| 6 | aaa |
| 7 | bbb |
| 8 | ccc |
| 9 | ddd |
| 10 | eee |
| 11 | fff |
| 12 | ggg |
+-----+-----------+--+
根據(jù)電話號(hào)碼顯示歸屬地信息
java類
/**
* 根據(jù)電話號(hào)碼前三位獲取歸屬地
* @author 12706
*
*/
public class PhoneNumParse extends UDF{
static HashMap<String, String> phoneMap = new HashMap<String, String>();
static{
phoneMap.put("136", "beijing");
phoneMap.put("137", "shanghai");
phoneMap.put("138", "shenzhen");
}
public static String evaluate(int phoneNum) {
String num = String.valueOf(phoneNum);
String province = phoneMap.get(num.substring(0, 3));
return province==null?"foreign":province;
}
//測(cè)試
public static void main(String[] args) {
String string = evaluate(136666);
System.out.println(string);
}
}
將工程打包上傳到linux胜榔,注意:如果名字還是跟上面一樣,那么需要重新連接hive服務(wù)端了零远,否則jar包是不會(huì)覆蓋的苗分,建議打的jar包名字別一樣
編輯文件vi prov.data
創(chuàng)建表flow(phonenum int,flow int)
將文件load到flow表
[root@mini1 ~]# vi prov.data;
1367788,1
1367788,10
1377788,80
1377788,97
1387788,98
1387788,99
1387788,100
1555118,99
0: jdbc:hive2://localhost:10000> create table flow(phonenum int,flow int)
0: jdbc:hive2://localhost:10000> row format delimited fields terminated by ',';
No rows affected (0.143 seconds)
0: jdbc:hive2://localhost:10000> load data local inpath '/root/prov.data' into table flow;
INFO : Loading data to table myhive3.flow from file:/root/prov.data
INFO : Table myhive3.flow stats: [numFiles=1, totalSize=88]
No rows affected (0.316 seconds)
0: jdbc:hive2://localhost:10000> select * from flow;
+----------------+------------+--+
| flow.phonenum | flow.flow |
+----------------+------------+--+
| 1367788 | 1 |
| 1367788 | 10 |
| 1377788 | 80 |
| 1377788 | 97 |
| 1387788 | 98 |
| 1387788 | 99 |
| 1387788 | 100 |
| 1555118 | 99 |
+----------------+------------+--+
classpath下加入jar包,創(chuàng)建臨時(shí)函數(shù)牵辣,測(cè)試
0: jdbc:hive2://localhost:10000> add jar /root/hive.jar;
INFO : Added [/root/hive.jar] to class path
INFO : Added resources: [/root/hive.jar]
No rows affected (0.236 seconds)
0: jdbc:hive2://localhost:10000> create temporary function getprovince as 'com.scu.hive.PhoneNumParse';
No rows affected (0.038 seconds)
0: jdbc:hive2://localhost:10000> select phonenum,getprovince(phonenum),flow from flow;
+-----------+-----------+-------+--+
| phonenum | _c1 | flow |
+-----------+-----------+-------+--+
| 1367788 | beijing | 1 |
| 1367788 | beijing | 10 |
| 1377788 | shanghai | 80 |
| 1377788 | shanghai | 97 |
| 1387788 | shenzhen | 98 |
| 1387788 | shenzhen | 99 |
| 1387788 | shenzhen | 100 |
| 1555118 | foreign | 99 |
+-----------+-----------+-------+--+
Json數(shù)據(jù)解析UDF開發(fā)
有文件摔癣,內(nèi)容一部分如下,里面都是json串纬向,現(xiàn)在需要將它展示輸出到表中择浊,并解析對(duì)應(yīng)為4個(gè)字段。
{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
java類
public class JsonParse extends UDF{
//{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
//輸出字符串 1193 5 978300760 1
public static String evaluate(String line){
ObjectMapper objectMapper = new ObjectMapper();
//json串轉(zhuǎn)java對(duì)象
String json = "";
try {
MovieRateBean bean = objectMapper.readValue(line,MovieRateBean.class);
json = bean.toString();
} catch (Exception e) {
e.printStackTrace();
}
return json;
}
}
public class MovieRateBean {
private String movie;
private String rate;//評(píng)分
private String timeStamp;
private String uid;
@Override
public String toString() {
return this.movie+"\t"+this.rate+"\t"+this.timeStamp+"\t"+this.uid;
}
get逾条、set方法
}
工程打包上傳到linux下琢岩。
創(chuàng)建表json
create table json(line string);
將文件導(dǎo)入到j(luò)son表
load data local inpath ‘/root/json.data’ into table json;
0: jdbc:hive2://localhost:10000> select * from json limit 10;
+----------------------------------------------------------------+--+
| json.line |
+----------------------------------------------------------------+--+
| {"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"} |
| {"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"} |
| {"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"} |
| {"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"} |
| {"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"} |
| {"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"} |
| {"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"} |
| {"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"} |
| {"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"} |
| {"movie":"919","rate":"4","timeStamp":"978301368","uid":"1"} |
+----------------------------------------------------------------+--+
0: jdbc:hive2://localhost:10000> add jar /root/hive3.jar;
INFO : Added [/root/hive3.jar] to class path
INFO : Added resources: [/root/hive3.jar]
No rows affected (0.023 seconds)
0: jdbc:hive2://localhost:10000> create temporary function parsejson as 'com.scu.hive.JsonParse';
No rows affected (0.07 seconds)
0: jdbc:hive2://localhost:10000> select parsejson(line) from json limit 10;
+---------------------+--+
| _c0 |
+---------------------+--+
| 1193 5 978300760 1 |
| 661 3 978302109 1 |
| 914 3 978301968 1 |
| 3408 4 978300275 1 |
| 2355 5 978824291 1 |
| 1197 3 978302268 1 |
| 1287 5 978302039 1 |
| 2804 5 978300719 1 |
| 594 4 978302268 1 |
| 919 4 978301368 1 |
+---------------------+--+
到這里發(fā)現(xiàn)還有不足的地方,就是沒(méi)顯示字段师脂〉?祝可以使用函數(shù)來(lái)實(shí)現(xiàn)重寫建表來(lái)命名。
0: jdbc:hive2://localhost:10000> create table t_rating as
0: jdbc:hive2://localhost:10000> select split(parsejson(line),'\t')[0]as movieid,
0: jdbc:hive2://localhost:10000> split(parsejson(line),'\t')[1] as rate,
0: jdbc:hive2://localhost:10000> split(parsejson(line),'\t')[2] as timestring,
0: jdbc:hive2://localhost:10000> split(parsejson(line),'\t')[3] as uid
0: jdbc:hive2://localhost:10000> from json limit 10;
0: jdbc:hive2://localhost:10000> select * from t_rating;
+-------------------+----------------+----------------------+---------------+--+
| t_rating.movieid | t_rating.rate | t_rating.timestring | t_rating.uid |
+-------------------+----------------+----------------------+---------------+--+
| 919 | 4 | 978301368 | 1 |
| 594 | 4 | 978302268 | 1 |
| 2804 | 5 | 978300719 | 1 |
| 1287 | 5 | 978302039 | 1 |
| 1197 | 3 | 978302268 | 1 |
| 2355 | 5 | 978824291 | 1 |
| 3408 | 4 | 978300275 | 1 |
| 914 | 3 | 978301968 | 1 |
| 661 | 3 | 978302109 | 1 |
| 1193 | 5 | 978300760 | 1 |
+-------------------+----------------+----------------------+---------------+--+
transform關(guān)鍵字使用
需求吃警,創(chuàng)建新表糕篇,內(nèi)容與t_rating表一致,但是第三個(gè)字段時(shí)間戳要改為輸出周幾酌心。
Hive的 TRANSFORM 關(guān)鍵字提供了在SQL中調(diào)用自寫腳本的功能
適合實(shí)現(xiàn)Hive中沒(méi)有的功能又不想寫UDF的情況拌消。
1、編寫python腳本(先看看機(jī)器有沒(méi)有python)安券,用來(lái)將表時(shí)間戳轉(zhuǎn)為周幾
2墩崩、加入編寫的py文件
3、創(chuàng)建新表侯勉,字段值為t_rating表傳入py函數(shù)后輸出的字段值
[root@mini1 ~]# python
Python 2.6.6 (r266:84292, Feb 21 2013, 23:54:59)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print 'hello';
hello
>>> quit()
[root@mini1 ~]# vi weekday_mapper.py;
#import sys
import datetime
for line in sys.stdin:
line = line.strip()
movieid, rating, unixtime,userid = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([movieid, rating, str(weekday),userid])
切換到hive客戶端
0: jdbc:hive2://localhost:10000> add FILE /root/weekday_mapper.py;
1
0: jdbc:hive2://localhost:10000> create TABLE u_data_new as
0: jdbc:hive2://localhost:10000> SELECT
0: jdbc:hive2://localhost:10000> TRANSFORM (movieid, rate, timestring,uid)
0: jdbc:hive2://localhost:10000> USING 'python weekday_mapper.py'
0: jdbc:hive2://localhost:10000> AS (movieid, rate, weekday,uid)
0: jdbc:hive2://localhost:10000> FROM t_rating;
...
0: jdbc:hive2://localhost:10000> select * from u_data_new;
+---------------------+------------------+---------------------+-----------------+--+
| u_data_new.movieid | u_data_new.rate | u_data_new.weekday | u_data_new.uid |
+---------------------+------------------+---------------------+-----------------+--+
| 919 | 4 | 1 | 1 |
| 594 | 4 | 1 | 1 |
| 2804 | 5 | 1 | 1 |
| 1287 | 5 | 1 | 1 |
| 1197 | 3 | 1 | 1 |
| 2355 | 5 | 7 | 1 |
| 3408 | 4 | 1 | 1 |
| 914 | 3 | 1 | 1 |
| 661 | 3 | 1 | 1 |
| 1193 | 5 | 1 | 1 |
+---------------------+------------------+---------------------+-----------------+--+