Hive高級操作
1.使用LIKE羽圃、AS創(chuàng)建表,表重命名,添加捌年、修改、刪除列
表結(jié)構(gòu)數(shù)據(jù)復(fù)制
根據(jù)已存在的表結(jié)構(gòu)挂洛,使用like關(guān)鍵字礼预,復(fù)制一個(gè)表結(jié)構(gòu)一模一樣的新表
hive> create table student_info2 like student_info;
OK
Time taken: 0.73 seconds
hive> show tables;
OK
employee
student_info
student_info2
student_school_info
student_school_info_external_partition
student_school_info_partition
Time taken: 0.15 seconds, Fetched: 6 row(s)
根據(jù)已經(jīng)存在的表,使用as關(guān)鍵字虏劲,創(chuàng)建一個(gè)與查詢結(jié)果字段一致的表托酸,同時(shí)將查詢結(jié)果數(shù)據(jù)插入到新表
create table student_info3 as select * from student_info;
只有student_id,name兩個(gè)字段的表
create table student_info4 as select student_id,name from student_info;
表重命名
student_info4表重命名為student_id_name
alter table student_info4 rename to student_id_name;
添加列
給student_info3表添加性別列,新添加的字段會(huì)在所有列最后,分區(qū)列之前柒巫,在添加新列之前已經(jīng)存在的數(shù)據(jù)文件中励堡。
如果沒有新添加列對應(yīng)的數(shù)據(jù),在查詢的時(shí)候顯示為空堡掏。添加多個(gè)列用逗號隔開应结。
hive> alter table student_info3 add columns(gender string comment '性別');
OK
Time taken: 0.185 seconds
刪除列或修改列
修改列,將繼續(xù)存在的列再定義一遍布疼,需要替換的列重新定義
hive> alter table student_info3 replace columns(student_id string,name string,age int,origin string,gender2 int);
OK
Time taken: 0.422 seconds
刪除列,將繼續(xù)存在的列再定義一遍摊趾,需要?jiǎng)h除的列不再定義
hive> alter table student_info3 replace columns(student_id string,name string,age int,origin string);
OK
Time taken: 0.529 seconds
2.分桶表使用
創(chuàng)建分桶表
按照指定字段取它的hash散列值分桶,創(chuàng)建學(xué)生入學(xué)信息分桶表
字段名稱類型注釋分桶字段
student_idstring學(xué)生ID是
namestring姓名否
ageint年齡否
originstring學(xué)院ID否
create table rel.student_info_bucket(
student_id string,
name string,
age int,
origin string
)
clustered by (student_id) sorted by (student_id asc) into 4 buckets
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
分桶表插入數(shù)據(jù)
向student_info_bucket分桶表插入數(shù)據(jù)
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;
insert overwrite table student_info_bucket
select student_id,name,age,origin
from student_info
cluster by(student_id);
查看hdfs分桶文件
[root@hadoop01 ~]# hadoop fs -ls /user/hive/warehouse/rel.db/student_info_bucket
Found 4 items
-rwxr-xr-x? 3 hadoop supergroup? ? ? ? 78 2018-01-24 19:33 /user/hive/warehouse/rel.db/student_info_bucket/000000_0
-rwxr-xr-x? 3 hadoop supergroup? ? ? ? 84 2018-01-24 19:33 /user/hive/warehouse/rel.db/student_info_bucket/000001_0
-rwxr-xr-x? 3 hadoop supergroup? ? ? ? 80 2018-01-24 19:33 /user/hive/warehouse/rel.db/student_info_bucket/000002_0
-rwxr-xr-x? 3 hadoop supergroup? ? ? ? 81 2018-01-24 19:33 /user/hive/warehouse/rel.db/student_info_bucket/000003_0
說明:
分桶表一般不使用load向分桶表中導(dǎo)入數(shù)據(jù),因?yàn)閘oad導(dǎo)入數(shù)據(jù)只是將數(shù)據(jù)復(fù)制到表的數(shù)據(jù)存儲(chǔ)目錄下游两,hive并不會(huì)在load的時(shí)候?qū)?shù)據(jù)進(jìn)行分析然后按照分桶字段分桶砾层,load只會(huì)將一個(gè)文件全部導(dǎo)入到分桶表中,并沒有分桶贱案。一般采用insert從其他表向分桶表插入數(shù)據(jù)肛炮。 分桶表在創(chuàng)建表的時(shí)候只是定義表的模型止吐,插入的時(shí)候需要做如下操作: 在每次執(zhí)行分桶插入的時(shí)候在當(dāng)前執(zhí)行的session會(huì)話中要設(shè)置
hive.enforce.bucketing=true;
聲明本次執(zhí)行的是一次分桶操作。需要指定reduce個(gè)數(shù)與分桶的數(shù)量相同
set mapreduce.job.reduces=4侨糟,
這樣才能保證有多少桶就生成多少個(gè)文件碍扔。 如果定義了按照分桶字段排序,需要在從其他表查詢數(shù)據(jù)過程中將數(shù)據(jù)按照分區(qū)字段排序之后插入各個(gè)桶中秕重,分桶表并不會(huì)將各分桶中的數(shù)據(jù)排序不同。排序和分桶的字段相同的時(shí)候使用Cluster by(字段),cluster by 默認(rèn)按照分桶字段在桶內(nèi)升序排列,如果需要在桶內(nèi)降序排列溶耘,使用distribute by (col) sort by (col desc)組合實(shí)現(xiàn)二拐。
3.導(dǎo)出數(shù)據(jù)
使用insert將student_info表數(shù)據(jù)導(dǎo)出到本地指定路徑
insert overwrite local directory '/home/hadoop/apps/hive_test_data/export_data'
row format delimited? fields terminated by '\t' select * from student_info;
[root@hadoop01 export_data]# cat 000000_0
1? xiaoming? ? 20? 11
2? xiaobai 21? 31
3? zhangfei? ? 22? 44
4? likui? 19? 44
5? zhaoyun 21? 13
6? zhangsan? ? 20? 11
7? lisi? ? 19? 11
8? wangwu? 23? 31
9? zhaofei 19? 21
10? zhangyan? ? 20? 21
11? lihe? ? 20? 22
12? caoyang 17? 32
13? lihao? 19? 32
14? zhaoming? ? 21? 50
15? zhouhong? ? 18? 51
16? yangshuo? ? 23? 33
17? xiaofei 24? 13
18? liman? 23? 13
19? qianbao 20? 13
20? sunce? 21? 41
導(dǎo)出數(shù)據(jù)到本地的常用方法
[hadoop@hadoop01 export_data]$ hive -e"select * from rel.student_info"> /home/hadoop/student_info_data.txt
[hadoop@hadoop01 ~]$ cat student_info_data.txt
1? xiaoming? ? 20? 11
2? xiaobai 21? 31
3? zhangfei? ? 22? 44
4? likui? 19? 44
5? zhaoyun 21? 13
6? zhangsan? ? 20? 11
7? lisi? ? 19? 11
8? wangwu? 23? 31
9? zhaofei 19? 21
10? zhangyan? ? 20? 21
11? lihe? ? 20? 22
12? caoyang 17? 32
13? lihao? 19? 32
14? zhaoming? ? 21? 50
15? zhouhong? ? 18? 51
16? yangshuo? ? 23? 33
17? xiaofei 24? 13
18? liman? 23? 13
19? qianbao 20? 13
20? sunce? 21? 41
默認(rèn)結(jié)果分隔符:'\t'
4.關(guān)聯(lián)查詢
創(chuàng)建2張表
create table rel.a(
id int,
name string
)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
create table rel.b(
id int,
name string
)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
導(dǎo)入數(shù)據(jù)
hive> load data local inpath '/home/hadoop/apps/hive_test_data/a_join_data' into table a;
Loading data to table rel.a
Table rel.a stats: [numFiles=1, totalSize=61]
OK
Time taken: 1.79 seconds
hive> load data local inpath '/home/hadoop/apps/hive_test_data/b_join_data' into table b;
Loading data to table rel.b
Table rel.b stats: [numFiles=1, totalSize=38]
OK
Time taken: 0.562 seconds
inner或inner join
兩個(gè)表通過id關(guān)聯(lián),只把id值相等的數(shù)據(jù)查詢出來凳兵。join的查詢結(jié)果與inner join的查詢結(jié)果相同百新。
select * from a join b on a.id=b.id;
等同于
select * from a inner join b on a.id=b.id;
.....
OK
1? a? 1? AA
2? b? 2? BB
3? c? 3? CC
6? f? 6? FF
Time taken: 44.337 seconds, Fetched: 4 row(s)
full outer join或full join
兩個(gè)表通過id關(guān)聯(lián),把兩個(gè)表的數(shù)據(jù)全部查詢出來
OK
1? a? 1? AA
2? b? 2? BB
3? c? 3? CC
4? d? NULL? ? NULL
5? e? NULL? ? NULL
6? f? 6? FF
7? g? NULL? ? NULL
8? h? NULL? ? NULL
9? i? NULL? ? NULL
10? j? NULL? ? NULL
11? k? NULL? ? NULL
12? l? NULL? ? NULL
13? m? NULL? ? NULL
14? n? NULL? ? NULL
NULL? ? NULL? ? 20? TT
NULL? ? NULL? ? 21? UU
NULL? ? NULL? ? 22? vv
left join
左連接時(shí)庐扫,左表中出現(xiàn)的join字段都保留饭望,右表沒有連接上的都為空
OK
1? a? 1? AA
2? b? 2? BB
3? c? 3? CC
4? d? NULL? ? NULL
5? e? NULL? ? NULL
6? f? 6? FF
7? g? NULL? ? NULL
8? h? NULL? ? NULL
9? i? NULL? ? NULL
10? j? NULL? ? NULL
11? k? NULL? ? NULL
12? l? NULL? ? NULL
13? m? NULL? ? NULL
14? n? NULL? ? NULL
right join
右連接時(shí),右表中出現(xiàn)的join字段都保留形庭,左表沒有連接上的都是空
select * from a right join b on a.id=b.id;
OK
1? a? 1? AA
2? b? 2? BB
3? c? 3? CC
6? f? 6? FF
NULL? ? NULL? ? 20? TT
NULL? ? NULL? ? 21? UU
NULL? ? NULL? ? 22? vv
Time taken: 25.188 seconds, Fetched: 7 row(s)
left semi join
左半連接實(shí)現(xiàn)了類似IN/EXISTS的查詢語義铅辞,輸出符合條件的左表內(nèi)容。 hive不支持in …exists這種關(guān)系型數(shù)據(jù)庫中的子查詢結(jié)構(gòu)萨醒,hive暫時(shí)不支持右半連接巷挥。 例如:
select a.id, a.name from a where a.id in (select b.id from b);
使用Hive對應(yīng)于如下語句:
select a.id,a.name from a left semi join b on a.id = b.id;
OK
1? a
2? b
3? c
6? f
Time taken: 27.42 seconds, Fetched: 4 row(s)
map side join
使用分布式緩存將小表數(shù)據(jù)加載都各個(gè)map任務(wù)中,在map端完成join验靡,map任務(wù)輸出后,不需要將數(shù)據(jù)拷貝到reducer階段再進(jìn)行join雏节, 降低的數(shù)據(jù)在網(wǎng)絡(luò)節(jié)點(diǎn)之間傳輸?shù)拈_銷胜嗓。多表關(guān)聯(lián)數(shù)據(jù)傾斜優(yōu)化的一種手段。多表連接钩乍,如果只有一個(gè)表比較大辞州,其他表都很小, 則join操作會(huì)轉(zhuǎn)換成一個(gè)只包含map的Job寥粹。運(yùn)行日志中會(huì)出現(xiàn)Number of reduce tasks is set to 0 since there's no reduce operator 沒有reduce的提示变过。 例如:
select /*+ mapjoin(b) */ a.id, a.name from a join b on a.id = b.id
Total MapReduce CPU Time Spent: 1 seconds 320 msec
OK
1? a
2? b
3? c
6? f
Time taken: 25.538 seconds, Fetched: 4 row(s)
5.Hive內(nèi)置函數(shù)
創(chuàng)建用戶評分表
create table rel.user_core_info(
user_id string,
age int,
gender string,
core int
)
row format delimited fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
導(dǎo)入數(shù)據(jù)
load data local inpath '/home/hadoop/apps/hive_test_data/user_core.txt' into table rel.user_core_info;
條件函數(shù) case when
語法1:CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
說明:如果a等于b,那么返回c涝涤;如果a等于d媚狰,那么返回e;否則返回f
例如:
hive> select case 1 when 2 then 'two' when 1 then 'one' else 'zero' end;
OK
one
Time taken: 0.152 seconds, Fetched: 1 row(s)
語法2:CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END 說明:如果a為TRUE阔拳,則返回b崭孤;如果c為TRUE,則返回d;否則返回e 例如:
hive> select case when 1=2 then 'two' when 1=1 then 'one' else 'zero' end;
OK
one
Time taken: 0.33 seconds, Fetched: 1 row(s)
查詢用戶評分表辨宠,每個(gè)年齡段的最大評分值
select gender,
case when age<=20 then 'p0' when age>20 and age<=50 then 'p1' when age>=50 then 'p3' else 'p0' end,
max(core) max_core
from rel.user_core_info
group by gender,
case when age<=20 then 'p0' when age>20 and age<=50 then 'p1' when age>=50 then 'p3' else 'p0' end;
結(jié)果為:
OK
female? p0? 90
female? p1? 95
female? p3? 90
male? ? p0? 80
male? ? p1? 80
male? ? p3? 80
Time taken: 28.461 seconds, Fetched: 6 row(s)
自定義UDF函數(shù)
當(dāng)Hive提供的內(nèi)置函數(shù)無法滿足你的業(yè)務(wù)處理需要時(shí)遗锣,此時(shí)就可以考慮使用用戶自定義函數(shù)(UDF:user-defined function)。 UDF 作用于單個(gè)數(shù)據(jù)行嗤形,產(chǎn)生一個(gè)數(shù)據(jù)行作為輸出精偿。 步驟:
先開發(fā)一個(gè)java類,繼承UDF赋兵,并重載evaluate方法
打成jar包上傳到服務(wù)器
在使用的時(shí)候?qū)ar包添加到hive的classpath
hive>add jar /home/hadoop/apps/hive_test_data/HiveUdfPro-1.0-SNAPSHOT.jar;
創(chuàng)建臨時(shí)函數(shù)與開發(fā)好的java class關(guān)聯(lián)
hive>create temporary function age_partition as 'cn.chinahadoop.udf.AgePartitionFunction';
即可在hql中使用自定義的函數(shù)
新建Maven 項(xiàng)目
Pom 信息如下:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
? xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
? <modelVersion>4.0.0</modelVersion>
? <groupId>com.yongliang.udf</groupId>
? <artifactId>HiveUdfPro</artifactId>
? <version>1.0-SNAPSHOT</version>
? <packaging>jar</packaging>
? <name>HiveUdfPro</name>
? <url>http://maven.apache.org</url>
? <properties>
? ? <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
? </properties>
? <dependencies>
? ? <dependency>
? ? ? <groupId>org.apache.hive</groupId>
? ? ? <artifactId>hive-exec</artifactId>
? ? ? <version>1.2.2</version>
? ? </dependency>
? ? <dependency>
? ? ? <groupId>org.apache.hadoop</groupId>
? ? ? <artifactId>hadoop-common</artifactId>
? ? ? <version>2.7.0</version>
? ? </dependency>
? </dependencies>
? <build>
? ? <plugins>
? ? ? <plugin>
? ? ? ? <groupId>org.apache.maven.plugins</groupId>
? ? ? ? <artifactId>maven-compiler-plugin</artifactId>
? ? ? ? <version>2.3.2</version>
? ? ? ? <executions>
? ? ? ? ? <execution>
? ? ? ? ? ? <id>default-compile</id>
? ? ? ? ? ? <phase>compile</phase>
? ? ? ? ? ? <goals>
? ? ? ? ? ? ? <goal>compile</goal>
? ? ? ? ? ? </goals>
? ? ? ? ? ? <configuration>
? ? ? ? ? ? ? <encoding>UTF-8</encoding>
? ? ? ? ? ? </configuration>
? ? ? ? ? </execution>
? ? ? ? </executions>
? ? ? </plugin>
? ? </plugins>
? </build>
</project>
新建類繼承UDF
package com.yongliang.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
/**
* 創(chuàng)建時(shí)間 : 2018/1/27 15:35
* 類描述 :? Hive UDF自定義函數(shù)笔咽,作用于單個(gè)數(shù)據(jù)行,產(chǎn)生一個(gè)數(shù)據(jù)行作為輸出
* @author? zhangyonglaing
*/
public class AgePartitionFunction extends UDF {
? ? public String evaluate(int age) {
? ? ? ? String partition = "p0";
? ? ? ? if(age <=20){
? ? ? ? ? ? partition = "p0";
? ? ? ? }else if(age > 20 && age <=50){
? ? ? ? ? ? partition = "p1";
? ? ? ? }else if(age > 50){
? ? ? ? ? ? partition = "p2";
? ? ? ? }
? ? ? ? return partition;
? ? }
}
將項(xiàng)目進(jìn)行打包
Hive UDF 自定義函數(shù)
說明:
如出現(xiàn)以下異常信息: Failed to execute goal on project hive-exec: Could not resolve dependencies for project org.apache.hive:hive-exec:jar:2.3.0: Could not find artifact org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde in alimaven (http://maven.aliyun.com/nexus/content/groups/public/) -> [Help 1]
錯(cuò)誤異常信息
請手動(dòng)下載Jar包pentaho-aggdesigner-algorithm-5.1.5-jhyde.jar下載地址:
https://public.nexus.pentaho.org/content/groups/omni/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/
將Jar包放置在本地Maven倉庫org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde路徑下毡惜,之后進(jìn)行重新打包拓轻。
將jar包添加到hive的classpath
hive> add jar /home/hadoop/apps/HiveUdfPro-1.0-SNAPSHOT.jar;
Added [/home/hadoop/apps/HiveUdfPro-1.0-SNAPSHOT.jar] to class path
Added resources: [/home/hadoop/apps/HiveUdfPro-1.0-SNAPSHOT.jar]
創(chuàng)建臨時(shí)函數(shù)與開發(fā)好的java class關(guān)聯(lián)
hive> create temporary function age_partition as 'com.yongliang.udf.AgePartitionFunction';
在hql中使用自定義的函數(shù)
select gender,
age_partition(age),
max(core) max_core
from rel.user_core_info
group by gender,
age_partition(age);
結(jié)果為:
OK
female? p0? 90
female? p1? 95
female? p2? 90
male? ? p0? 80
male? ? p1? 80
male? ? p2? 80