數(shù)據(jù)倉庫-Hive基礎(chǔ)（五）Hive 的基本操作

1. 數(shù)據(jù)庫操作

1.1 創(chuàng)建數(shù)據(jù)庫

create database if not exists myhive; 

use myhive;

說明：hive的表存放位置模式是由hive-site.xml當(dāng)中的一個屬性指定的

<name>hive.metastore.warehouse.dir</name> 
<value>/user/hive/warehouse</value>

1.2 創(chuàng)建數(shù)據(jù)庫并指定位置

create database myhive2 location '/myhive2';

1.3 設(shè)置數(shù)據(jù)庫鍵值對信息

數(shù)據(jù)庫可以有一些描述性的鍵值對信息，在創(chuàng)建時添加：

create database foo with dbproperties ('owner'='itcast', 'date'='20190120');

查看數(shù)據(jù)庫的鍵值對信息：

describe database extended foo;

修改數(shù)據(jù)庫的鍵值對信息：

alter database foo set dbproperties ('owner'='itheima');

1.4 查看數(shù)據(jù)庫更多詳細(xì)信息

desc database extended myhive2;

1.5 刪除數(shù)據(jù)庫

刪除一個空數(shù)據(jù)庫，如果數(shù)據(jù)庫下面有數(shù)據(jù)表斗躏，那么就會報錯

drop database myhive2;

強(qiáng)制刪除數(shù)據(jù)庫惦积，包含數(shù)據(jù)庫下面的表一起刪除

drop database myhive cascade;

2.數(shù)據(jù)庫表操作

create [external] table [if not exists] table_name ( 
col_name data_type [comment '字段描述信息'] 
col_name data_type [comment '字段描述信息']) 
[comment '表的描述信息'] 
[partitioned by (col_name data_type,...)] 
[clustered by (col_name,col_name,...)] 
[sorted by (col_name [asc|desc],...) into num_buckets buckets] 
[row format row_format] 
[storted as ....] 
[location '指定表的路徑']

說明：

create table

創(chuàng)建一個指定名字的表割粮。如果相同名字的表已經(jīng)存在并鸵，則拋出異常苇倡；用戶可以用 IF NOT EXISTS 選項來忽略這個異常拳球。

external

可以讓用戶創(chuàng)建一個外部表审姓，在建表的同時指定一個指向?qū)嶋H數(shù)據(jù)的路徑（LOCATION），Hive 創(chuàng)建內(nèi)部表時祝峻，會將數(shù)據(jù)移動到數(shù)據(jù)倉庫指向的路徑魔吐；若創(chuàng)建外部表，僅記錄數(shù)據(jù)所在的路徑莱找，不對數(shù)據(jù)的位置做任何改變酬姆。在刪除表的時候，內(nèi)部表的元數(shù)據(jù)和數(shù)據(jù)會被一起刪除奥溺，而外部表只刪除元數(shù)據(jù)辞色，不刪除數(shù)據(jù)。

comment

表示注釋,默認(rèn)不能使用中文

partitioned by

表示使用表分區(qū),一個表可以擁有一個或者多個分區(qū)浮定，每一個分區(qū)單獨存在一個目錄下 .

clustered by

對于每一個表分文件淫僻， Hive可以進(jìn)一步組織成桶，也就是說桶是更為細(xì)粒度的數(shù)據(jù)范圍劃分壶唤。Hive也是針對某一列進(jìn)行桶的組織雳灵。

sorted by

指定排序字段和排序規(guī)則

row format

指定表文件字段分隔符

storted as

指定表文件的存儲格式, 常用格式:SEQUENCEFILE, TEXTFILE, RCFILE,如果文件數(shù)據(jù)是純文本，可以使用 STORED AS TEXTFILE闸盔。如果數(shù)據(jù)需要壓縮悯辙，使用 storted as SEQUENCEFILE。

location

指定表文件的存儲路徑

3.內(nèi)部表的操作

創(chuàng)建表時,如果沒有使用external關(guān)鍵字,則該表是內(nèi)部表（managed table）

Hive建表字段類型

分類	類型	描述	字面量示例
原始類型	BOOLEAN	true/false	TRUE
	TINYINT	1字節(jié)的有符號整數(shù), -128~127	1Y
	SMALLINT	2個字節(jié)的有符號整數(shù)，-32768~32767	1S
	INT	4個字節(jié)的帶符號整數(shù)	1
	BIGINT	8字節(jié)帶符號整數(shù)	1L
	FLOAT	4字節(jié)單精度浮點數(shù)	1.0
	DOUBLE	8字節(jié)雙精度浮點數(shù)	1.0
	DEICIMAL	任意精度的帶符號小數(shù)	1.0
	STRING	字符串躲撰，變長	“a”,’b’
	VARCHAR	變長字符串	“a”,’b’
	CHAR	固定長度字符串	“a”,’b’
	BINARY	字節(jié)數(shù)組	無法表示
	TIMESTAMP	時間戳针贬，毫秒值精度	122327493795
	DATE	日期	‘2016-03-29’
	INTERVAL	時間頻率間隔
復(fù)雜類型	ARRAY	有序的的同類型的集合	array(1,2)
	MAP	key-value,key必須為原始類型，value可以任意類型	map(‘a(chǎn)’,1,’b’,2)
	STRUCT	字段集合,類型可以不同	struct(‘1’,1,1.0), named_stract(‘col1’,’1’,’col2’,1,’clo3’,1.0)
	UNION	在有限取值范圍內(nèi)的一個值	create_union(1,’a’,63)

建表入門:

# 選中創(chuàng)建的數(shù)據(jù)庫
use myhive; 

# 創(chuàng)建學(xué)生表
create table stu(id int,name string); 

#插入一條數(shù)據(jù)（insert命令走mapreduce所以效率很低）
insert into stu values (1,"zhangsan"); 

# 查詢表內(nèi)元素
select * from stu;

創(chuàng)建表并指定字段之間的分隔符

create table if not exists stu2(id int ,name string) row format delimited fields terminated by '\t';

創(chuàng)建表并指定表文件的存放路徑

create table if not exists stu2(id int ,name string) row format delimited fields terminated by '\t' location '/user/stu2';

根據(jù)查詢結(jié)果創(chuàng)建表

create table stu3 as select * from stu2; # 通過復(fù)制表結(jié)構(gòu)和表內(nèi)容創(chuàng)建新表

根據(jù)已經(jīng)存在的表結(jié)構(gòu)創(chuàng)建表

create table stu4 like stu;

查詢表的詳細(xì)信息

desc formatted stu2;

刪除表

drop table stu4;

4. 外部表的操作

外部表說明

外部表因為是指定其他的hdfs路徑的數(shù)據(jù)加載到表當(dāng)中來拢蛋，所以hive表會認(rèn)為自己不完全獨占這份數(shù)據(jù)桦他，所以刪除hive表的時候，數(shù)據(jù)仍然存放在hdfs當(dāng)中谆棱，不會刪掉.

內(nèi)部表和外部表的使用場景

每天將收集到的網(wǎng)站日志定期流入HDFS文本文件快压。在外部表（原始日志表）的基礎(chǔ)上做大量的統(tǒng)計分析，用到的中間表垃瞧、結(jié)果表使用內(nèi)部表存儲蔫劣，數(shù)據(jù)通過SELECT+INSERT進(jìn)入內(nèi)部表。

操作案例

分別創(chuàng)建老師與學(xué)生表外部表个从，并向表中加載數(shù)據(jù)

創(chuàng)建老師表

create external table teacher (t_id string,t_name string) row format delimited fields terminated by '\t';

創(chuàng)建學(xué)生表

create external table student (s_id string,s_name string,s_birth string , s_sex string ) row format delimited fields terminated by '\t';

加載數(shù)據(jù)

load data local inpath '/export/servers/hivedatas/student.csv' into table student;

加載數(shù)據(jù)并覆蓋已有數(shù)據(jù)

load data local inpath '/export/servers/hivedatas/student.csv' overwrite into table student;

從hdfs文件系統(tǒng)向表中加載數(shù)據(jù)（需要提前將數(shù)據(jù)上傳到hdfs文件系統(tǒng)）

# 進(jìn)入文件目錄
cd /export/servers/hivedatas 

# 用hdfs上傳文件到指定文件夾
hdfs dfs -mkdir -p /hivedatas hdfs dfs -put techer.csv /hivedatas/ 

# 讀取數(shù)據(jù)到teable中
load data inpath '/hivedatas/techer.csv' into table teacher;

5. 分區(qū)表的操作

在大數(shù)據(jù)中脉幢，最常用的一種思想就是分治，我們可以把大的文件切割劃分成一個個的小的文件嗦锐，這樣每次操作一個小的文件就會很容易了嫌松，同樣的道理，在hive當(dāng)中也是支持這種思想的奕污，就是我們可以把大的數(shù)據(jù)豆瘫，按照每月，或者天進(jìn)行切分成一個個的小的文件,存放在不同的文件夾中.

創(chuàng)建分區(qū)表語法

create table score(s_id string,c_id string, s_score int) partitioned by (month string) row format delimited fields terminated by '\t';

創(chuàng)建一個表帶多個分區(qū)

create table score2 (s_id string,c_id string, s_score int) partitioned by (year string,month string,day string) row format delimited fields terminated by '\t';

加載數(shù)據(jù)到分區(qū)表中

load data local inpath '/export/servers/hivedatas/score.csv' into table score partition (month='201806');

加載數(shù)據(jù)到多分區(qū)表中

load data local inpath '/export/servers/hivedatas/score.csv' into table score2 partition(year='2018',month='06',day='01');

多分區(qū)表聯(lián)合查詢(使用union all)

select * from score where month = '201806' union all select * from score where month = '201806';

查看分區(qū)

show partitions score;

添加一個分區(qū)

alter table score add partition(month='201805');

刪除分區(qū)

alter table score drop partition(month = '201806');

6. 分區(qū)表綜合練習(xí)

現(xiàn)在有一個文件score.csv文件菊值，存放在集群的這個目錄下/scoredatas/month=201806外驱，這個文件每天都會生成，存放到對應(yīng)的日期文件夾下面去腻窒，文件別人也需要公用昵宇，不能移動。需求儿子，創(chuàng)建hive對應(yīng)的表瓦哎，并將數(shù)據(jù)加載到表中，進(jìn)行數(shù)據(jù)統(tǒng)計分析柔逼，且刪除表之后蒋譬，數(shù)據(jù)不能刪除

數(shù)據(jù)準(zhǔn)備：

hdfs dfs -mkdir -p /scoredatas/month=201806 
hdfs dfs -put score.csv /scoredatas/month=201806/

創(chuàng)建外部分區(qū)表，并指定文件數(shù)據(jù)存放目錄

create external table score4(s_id string, c_id string,s_score int) partitioned by (month string) row format delimited fields terminated by '\t' location '/scoredatas';

進(jìn)行表的修復(fù)(建立表與數(shù)據(jù)文件之間的一個關(guān)系映射)

msck repair table score4;

7. 分桶表操作

分桶愉适，就是將數(shù)據(jù)按照指定的字段進(jìn)行劃分到多個文件當(dāng)中去,分桶就是MapReduce中的分區(qū).

開啟 Hive 的分桶功能

set hive.enforce.bucketing=true;

設(shè)置 Reduce 個數(shù)

set mapreduce.job.reduces=3;

創(chuàng)建分桶表

create table course (c_id string,c_name string,t_id string) clustered by(c_id) into 3 buckets row format delimited fields terminated by '\t';

桶表的數(shù)據(jù)加載犯助，由于通標(biāo)的數(shù)據(jù)加載通過hdfs dfs -put文件或者通過load data均不好使，只能通過insert overwrite

創(chuàng)建普通表维咸，并通過insert overwriter的方式將普通表的數(shù)據(jù)通過查詢的方式加載到桶表當(dāng)中去

創(chuàng)建普通表

create table course_common (c_id string,c_name string,t_id string) row format delimited fields terminated by '\t';

普通表中加載數(shù)據(jù)

load data local inpath '/export/servers/hivedatas/course.csv' into table course_common;

通過insert overwrite給桶表中加載數(shù)據(jù)

insert overwrite table course select * from course_common cluster by(c_id);

修改表結(jié)構(gòu)

重命名:

alter table old_table_name rename to new_table_name;

把表score4修改成score5

alter table score4 rename to score5;

增加/修改列信息:

查詢表結(jié)構(gòu)

desc score5;

添加列

alter table score5 add columns (mycol string, mysco int);

更新列

alter table score5 change column mysco mysconew int;

刪除表

drop table score5;

增加/修改列信息:

查詢表結(jié)構(gòu)

desc score5;

添加列

alter table score5 add columns (mycol string, mysco int);

更新列

alter table score5 change column mysco mysconew int;

刪除表

drop table score5;

hive表中加載數(shù)據(jù)

直接向分區(qū)表中插入數(shù)據(jù)

create table score3 like score;

insert into table score3 partition(month ='201807') values ('001','002','100');

通過load方式加載數(shù)據(jù)

load data local inpath '/export/servers/hivedatas/score.csv' overwrite into table score partition(month='201806');

通過查詢方式加載數(shù)據(jù)

create table score4 like score; 

insert overwrite table score4 partition(month = '201806') select s_id,c_id,s_score from score;