數(shù)據(jù)倉庫工具Hive實踐
官網(wǎng)命令鏈接:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select
一、DDL庫操作
數(shù)據(jù)定義語言 Data Definition Language
(一)針對庫
創(chuàng)建庫昼扛、查看庫寸齐、刪除庫、修改庫抄谐、切換庫
1渺鹦、創(chuàng)建庫
create database myhive;
create database if not exists myhive;
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
小括號表示必選,小括號中間的豎線表示或者蛹含,中括號表示可選毅厚。
create database myhive_1 comment "myhive_1 database 1" location "/myhive_1";
# location是HDFS的目錄
2、查看庫
show databases;//查看所有的庫
desc database [extended] myhive_1; //顯示數(shù)據(jù)庫的詳細屬性信息
select current_database();//查看當前使用的庫
3浦箱、刪除庫
默認:
drop database myhive_1; 等價于 drop database myhive_1 RESTRICT;
drop database if exists myhive_1;
drop database myhive_1 cascade;//強制級聯(lián)刪除吸耿,慎用!慎用酷窥!慎用咽安!
4、修改庫
不同版本可以使用的功能不同
不常用E钔啤妆棒!
5、切換庫
use myhive;
(二)針對表
創(chuàng)建表、刪除表糕珊、修改表动分、查看表、查看某張表的詳細信息红选。
1澜公、查看表
show tables;
show tables in myhive;//在當前庫查看另外一個庫的表的信息
show tables "stu*"; //正則匹配表名查詢
2、查看某張表的詳細信息
desc student;
desc formatted student; //查看表的格式化了之后的詳細信息
desc extended student; //查看表的詳細信息
show create table student_ptn; //查看建表完整語法
show partitions student_ptn; //查看分區(qū)
相關(guān)名詞解釋
CREATE TABLE
:創(chuàng)建一個指定名字的表纠脾。
EXTERNAL
: 關(guān)鍵字可以讓用戶創(chuàng)建一個外部表玛瘸。在刪除表的時候,內(nèi)部表的元數(shù)據(jù)和數(shù)據(jù)會被一起刪除苟蹈,而外部表只刪除元數(shù)據(jù)糊渊,不刪除數(shù)據(jù)。
PARTITIONED BY
:在 Hive Select 查詢中一般會掃描整個表內(nèi)容慧脱,會消耗很多時間做沒必要 的工作渺绒。 有時候只需要掃描表中關(guān)心的一部分數(shù)據(jù),因此建表時引入 partition 概念菱鸥。
LIKE
:允許用戶復制現(xiàn)有的表結(jié)構(gòu)宗兼,但是不復制數(shù)據(jù)。
COMMENT
:可以為表與字段增加描述氮采。
ROW FORMAT
:用戶在建表的時候可以自定義 SerDe 或者使用自帶的 SerDe殷绍。如果沒有指定 ROW FORMAT 或者 ROW FORMAT DELIMITED,將會使用自帶的 SerDe鹊漠。
STORED AS TEXTFILE | SEQUENCEFILE | RCFILE
: 如果文件數(shù)據(jù)是純文本主到,可以使用 STORED AS TEXTFILE,默認也是 textFile 格式躯概,可以通過執(zhí)行 命令 set hive.default.?leformat登钥,進行查看,如果數(shù)據(jù)需要壓縮娶靡,使用 STORED AS SEQUENCEFILE牧牢。 RCFILE 是一種行列存儲相結(jié)合的存儲方式。
CLUSTERED BY
:對于每一個表(table)或者分區(qū)姿锭,Hive 可以進一步組織成桶塔鳍,也就是說桶是更為細粒 度的數(shù)據(jù)范圍劃分。Hive 也是針對某一列進行桶的組織呻此。Hive 采用對列值哈希轮纫,然后 除以桶的個數(shù)求 余的方式?jīng)Q定該條記錄存放在哪個桶當中。
LOCATION
:指定數(shù)據(jù)文件存放的 HDFS 目錄趾诗,不管內(nèi)部表還是外表,都可以指定。不指定就在默認的 倉庫路徑恃泪。
3郑兴、創(chuàng)建表
分桶表
示例:CLUSTERED BY department SORTED BY age ASC,id DESC INTO 3 BUCKETS
Create Table ... As Select 簡稱CTAS
create table studentss like student; //復制表結(jié)構(gòu),不是復制數(shù)據(jù)
創(chuàng)建表的類型有6種:
(1)創(chuàng)建內(nèi)部表
create table student(id int, name string, sex string, age int, department string) row format delimited fields terminated by ",";
內(nèi)部表的類型:MANAGED_TABLE
(2)創(chuàng)建外部表
create external table student_ext_1(id int, name string, sex string, age int, department string) row format delimited fields terminated by ",";
內(nèi)部表和外部表的對比:
1贝乎、在創(chuàng)建表的時候指定關(guān)鍵字: external
2情连、一般來說,創(chuàng)建外部表览效,都需要指定一個外部路徑
內(nèi)部表和外部表的區(qū)別:
刪除表的時候却舀,內(nèi)部表會都刪除,外部表只刪除元數(shù)據(jù)
到底選擇內(nèi)部表還是外部表锤灿?
1挽拔、如果數(shù)據(jù)已經(jīng)存儲在HDFS上面了,需要使用hive去進行分析但校,并且這份數(shù)據(jù)還有可能使用其他的計算引擎來執(zhí)行分析螃诅,使用外部表
2、如果這個一份數(shù)據(jù)只是hive做數(shù)據(jù)分析使用状囱,就可以使用內(nèi)部表
// 指定一個不存在的外部路徑: 創(chuàng)建表的時候术裸,會自動給你創(chuàng)建表目錄
create external table student_ext_2(id int, name string, sex string, age int, department string) row format delimited fields terminated by "," location "/student_ext_2";
// 指定一個已經(jīng)存在的目錄: 并且有數(shù)據(jù)
//在linux中執(zhí)行
//hadoop fs -mkdir -p /student_ext_3
//hadoop fs -put /home/bigdata/data/student.txt /student_ext_3
//在hive命令行中執(zhí)行
create external table student_ext_3(id int, name string, sex string, age int, department string) row format delimited fields terminated by "," location "/student_ext_3";
(3)創(chuàng)建分區(qū)表
// 創(chuàng)建只有一個分區(qū)字段的分區(qū)表:
create table student_ptn(id int, name string, sex string, age int, department string) partitioned by (city string comment "partitioned field") row format delimited fields terminated by ",";
load data local inpath "/home/bigdata/data/student.txt" into table student_ptn; //錯誤XXXXXX
// 把數(shù)據(jù)導入到一個不存在的分區(qū),它會自動創(chuàng)建該分區(qū)
load data local inpath "/home/bigdata/data/student.txt" into table student_ptn partition(city="beijing"); //正確√√√√√√
注意:partitioned里的字段不能是表中聲明的字段
分區(qū)字段是虛擬列亭枷,它的值是存儲在元數(shù)據(jù)庫中袭艺,不是存儲在數(shù)據(jù)文件中。
分區(qū)字段的使用和普通字段沒有區(qū)別
// 把數(shù)據(jù)導入到一個已經(jīng)存在的分區(qū)
alter table student_ptn add partition (city="chongqing"); //沒有變化
load data local inpath "/home/bigdata/data/student.txt" into table student_ptn partition(city="chongqing"); //數(shù)據(jù)翻倍
// 創(chuàng)建有多個分區(qū)字段的分區(qū)表:
create table student_ptn_date(id int, name string, sex string, age int, department string) partitioned by (city string comment "partitioned field", dt string) row format delimited fields terminated by ",";
// 往分區(qū)中導入數(shù)據(jù):
load data local inpath "/home/bigdata/data/student.txt" into table student_ptn_date partition(city="beijing"); //錯誤XXXXXX
load data local inpath "/home/bigdata/data/student.txt" into table student_ptn_date partition(city="beijing", dt='2012-12-12'); //正確√√√√√√
// 不能在導入數(shù)據(jù)的時候指定多個分區(qū)定義
load data local inpath "/home/bigdata/data/student.txt" into table student_ptn_date partition(city="beijing", dt='2012-12-14') partition(city="beijing" , dt='2012-12-13'); //錯誤XXXXXX
// 添加分區(qū)
alter table student_ptn_date add partition(city="beijing", dt='2012-12-14') partition (city="beijing" , dt='2012-12-13'); //正確√√√√√√
alter table student_ptn_date add partition(city="chongqing", dt='2012-12-14') partition (city="chongqing" , dt='2012-12-13');
// 查詢一個分區(qū)表有那些分區(qū)
show partitions student_ptn;
show partitions student_ptn_date;
show partitions student; //報錯
(4)創(chuàng)建分桶表
// 創(chuàng)建一個分桶表
create table student_bucket (id int, name string, sex string, age int, department string) clustered by (department) sorted by (age desc, id asc) into 3 buckets row format delimited fields terminated by ",";
//desc formatted student_bucket;
//Num Buckets: 3
//Bucket Columns: [department]
//Sort Columns: [Order(col:age, order:0), Order(col:id, order:1)]
注意:clustered里的字段必須要是表字段中出現(xiàn)的字段
分桶字段和排序字段可以不一樣叨粘,分桶字段和排序字段都必須是表字段中的一部分
你往分通表里面導入數(shù)據(jù)要通過分桶查詢方式進行導入數(shù)據(jù)猾编。
(5)從查詢語句的結(jié)果創(chuàng)建新表
//通過下面的命令:
create table ... as select ....
//查詢例子:
select department, count(*) as total from student group by department;
//完整的CTAS語句:
create table dpt_count as select department, count(*) as total from student group by department;
(6)通過like復制已有表的結(jié)構(gòu)創(chuàng)建新表
create table student_like like student;
4、刪除表
drop table student;
drop table if exists student;
5宣鄙、修改表
1)修改表名
alter table student_like rename to studentss;
2)修改字段
添加字段:
alter table student2 add columns (city string, dt string);
刪除字段:
alter table student2 drop columns (city); //報錯XXXXXX
替換字段:
alter table student2 replace columns (id int, name string, sex string, age int);
改變列的定義:
alter table student2 change id newid string comment "new id";
改變列的順序:
alter table student2 change sex sex string first;
alter table student2 change name name string after sex;
3)修改分區(qū)
添加分區(qū):
alter table student_ptn add partition(city='tiajin') partition(city='shanghai');
刪除分區(qū):
alter table student_ptn drop partition(city='tiajin');
alter table student_ptn drop partition(city='tiajin'),partition(city='shanghai');
修改分區(qū)的數(shù)據(jù)目錄:
alter table student_ptn partition(city="beijing") set location "/stu_beijing"; //報錯XXXXXX
alter table student_ptn partition(city="beijing") set location "hdfs://bigdata02:8020/stu_beijing"; //正確√√√√√√
6袍镀、清空表
truncate table student;//清空表只是清空該表的所有數(shù)據(jù)文件
hadoop fs -rm -r /user/hive/warehouse/myhive.db/student/*
三、DML操作
數(shù)據(jù)操縱語言 Data Manipulation Language
1冻晤、導入數(shù)據(jù)
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
(1)LOAD操作是復制或者移動操作苇羡,將數(shù)據(jù)文件復制或移動到Hive設置的路徑上。
寫LOCAL的是復制鼻弧,不寫LOCAL是移動
//導入本地絕對路徑數(shù)據(jù):
load data local inpath "/home/bigdata/data/student.txt" into table student;
//導入本地相對路徑的數(shù)據(jù):
load data local inpath "./student.txt" into table student;
load data local inpath './student.txt' overwrite into table student;
(覆蓋導入)
導入本地數(shù)據(jù)设江,相當于復制或者上傳
//導入HDFS上的簡便路徑數(shù)據(jù):
// hadoop fs -put /home/bigdata/data/student.txt /
load data inpath '/student.txt' into table student;
//導入HDFS上的全路徑模式下的數(shù)據(jù):
load data inpath 'hdfs://bigdata02:9000/student.txt' into table student;
導入HDFS上的數(shù)據(jù)到hive表,表示截切攘轩,移動
insert
insert into table student (id, name, sex, age, department) values (101,"huangbo","M",222,"IT");
創(chuàng)建分區(qū)表:
create table student_ptn (id int, name string, sex string, age int) partitioned by (department string) row format delimited fields terminated by ",";
單重插入:
insert into table student_ptn partition (department = 'IS') select id,sex,name,age from student where department = 'IS';
insert into table student_ptn partition (department = 'CS') select id,sex,name,age from student where department = 'CS';
insert into table student_ptn partition (department = 'MA') select id,sex,name,age from student where department = 'MA';
多重插入:
from student
insert into table student_ptn partition (department = 'IS') select id,sex,name,age where department = 'IS'
insert into table student_ptn partition (department = 'CS') select id,sex,name,age where department = 'CS'
insert into table student_ptn partition (department = 'MA') select id,sex,name,age where department = 'MA'
多重插入最大的好處就是給很多結(jié)構(gòu)相同的SQL語句組合在一起提高所有的HQL的執(zhí)行效率叉存,翻譯成的MapReduce只需要讀取一次數(shù)據(jù)就搞定了。
分區(qū)插入:
需要手動的創(chuàng)建分區(qū)
alter table student add partition (city="zhengzhou")
load data local inpath '/student.txt' into table student partition(city='zhengzhou');
CTAS(create table ... as select ...)(直接把查詢出來的結(jié)果存儲到新建的一張表里)
內(nèi)部表/內(nèi)建表
create table student as select id,name,age,department from mingxing;
注意:自動新建的表中的字段和查詢語句出現(xiàn)的字段的名稱度帮,類型歼捏,注釋一模一樣
限制:
1稿存、不能創(chuàng)建外部表
2、不能創(chuàng)建分區(qū)表
3瞳秽、不能創(chuàng)建分桶表
分桶插入:
創(chuàng)建分桶表:
create table mingxing(id int, name string, sex string, age int, department string)
clustered by(id) sorted by(age desc) into 4 buckets
row format delimited fields terminated by ',';
//不能使用load方式直接往分桶表中導入數(shù)據(jù)
插入數(shù)據(jù):
insert into table mingxing select id,name,sex,age,department from mingxing2
distribute by id sort by age desc;
注意:查詢語句中的分桶信息必須和分桶表中的信息一致
2瓣履、導出數(shù)據(jù)
單模式導出數(shù)據(jù)到本地:
insert overwrite local directory '/root/outputdata' select id,name,sex,age,department from mingxing;
多模式導出數(shù)據(jù)到本地:
from mingxing
insert overwrite local directory '/root/outputdata1' select id, name
insert overwrite local directory '/root/outputdata2' select id, name,age
簡便路徑模式導出到hdfs:
insert overwrite directory '/root/outputdata' select id,name,sex,age,department from mingxing;
全路徑模式查詢數(shù)據(jù)到hdfs:
insert overwrite directory 'hdfs://bigdata02:9000/root/outputdata1' select id,name,sex,age,department from mingxing;
local :導出到本地目錄
overwrite :表示覆蓋
3、查詢數(shù)據(jù)
Hive 中的 SELECT 基礎(chǔ)語法和標準 SQL 語法基本一致练俐,支持 WHERE袖迎、DISTINCT、GROUP BY腺晾、 ORDER BY燕锥、HAVING、LIMIT悯蝉、子查詢等
order by : 全局排序
sort by :局部排序
一般來說归形,要搭配 分桶操作使用
distribute by id sort by age desc;
distribute by : 純粹就是分桶
在使用distribute by的時候:要設置reduceTask的個數(shù)
cluster by : 既分桶,也排序
cluster by age = distribute by age sort by age;
cluster by 和 sort by 不能同時使用
where , group by, distinct ,having , case...when, ....
四泉粉、Hive視圖
和關(guān)系型數(shù)據(jù)庫一樣连霉,Hive 也提供了視圖的功能,不過請注意嗡靡,Hive 的視圖和關(guān)系型數(shù)據(jù)庫的數(shù)據(jù)還 是有很大的區(qū)別:
1跺撼、只有邏輯視圖,沒有物理視圖讨彼;
2歉井、視圖只能查詢,不能 Load/Insert/Update/Delete 數(shù)據(jù)哈误;
3哩至、視圖在創(chuàng)建時候,只是保存了一份元數(shù)據(jù)蜜自,當查詢視圖的時候菩貌,才開始執(zhí)行視圖對應的那些子查詢 。
1重荠、創(chuàng)建視圖
create view view_name as select * from carss; create view carss_view as select * from carss limit 500;
2箭阶、查看視圖
show tables; // 可以查看表,也可以查看視圖
desc view_name // 查看某個具體視圖的信息
desc carss_view
3戈鲁、刪除視圖
drop view view_name drop view if exists carss_view
4仇参、使用視圖
create view sogou_view as select * from sogou_table where rank > 3 ;
select count(distinct uid) from sogou_vi