數(shù)據(jù)倉庫工具Hive實踐

官網(wǎng)命令鏈接：
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select

一、DDL庫操作

數(shù)據(jù)定義語言 Data Definition Language

（一）針對庫

創(chuàng)建庫昼扛、查看庫寸齐、刪除庫、修改庫抄谐、切換庫

    1渺鹦、創(chuàng)建庫
        create database myhive;
        create database if not exists myhive;
    
    CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
    小括號表示必選，小括號中間的豎線表示或者蛹含，中括號表示可選毅厚。
        create database myhive_1 comment "myhive_1 database 1" location "/myhive_1";
        # location是HDFS的目錄

    2、查看庫
        show databases;//查看所有的庫
        desc  database  [extended]  myhive_1; //顯示數(shù)據(jù)庫的詳細屬性信息 
        select current_database();//查看當前使用的庫

    3浦箱、刪除庫
        默認：
        drop database myhive_1; 等價于     drop database myhive_1 RESTRICT;
        drop database if exists myhive_1; 
        drop database myhive_1 cascade;//強制級聯(lián)刪除吸耿，慎用！慎用酷窥！慎用咽安！

    4、修改庫
        不同版本可以使用的功能不同
        不常用Ｅ钔啤妆棒！

    5、切換庫
        use myhive;

（二）針對表

創(chuàng)建表、刪除表糕珊、修改表动分、查看表、查看某張表的詳細信息红选。

    1澜公、查看表
        show tables;
        show tables in myhive;//在當前庫查看另外一個庫的表的信息
        show tables "stu*"; //正則匹配表名查詢

    2、查看某張表的詳細信息
        desc student;
        desc formatted student;    //查看表的格式化了之后的詳細信息
        desc extended student;    //查看表的詳細信息
        show create table student_ptn;     //查看建表完整語法
        show partitions student_ptn;          //查看分區(qū)

相關(guān)名詞解釋

CREATE TABLE：創(chuàng)建一個指定名字的表纠脾。

EXTERNAL：關(guān)鍵字可以讓用戶創(chuàng)建一個外部表玛瘸。在刪除表的時候，內(nèi)部表的元數(shù)據(jù)和數(shù)據(jù)會被一起刪除苟蹈，而外部表只刪除元數(shù)據(jù)糊渊，不刪除數(shù)據(jù)。

PARTITIONED BY：在 Hive Select 查詢中一般會掃描整個表內(nèi)容慧脱，會消耗很多時間做沒必要的工作渺绒。有時候只需要掃描表中關(guān)心的一部分數(shù)據(jù)，因此建表時引入 partition 概念菱鸥。

LIKE：允許用戶復制現(xiàn)有的表結(jié)構(gòu)宗兼，但是不復制數(shù)據(jù)。

COMMENT：可以為表與字段增加描述氮采。

ROW FORMAT ：用戶在建表的時候可以自定義 SerDe 或者使用自帶的 SerDe殷绍。如果沒有指定 ROW FORMAT 或者 ROW FORMAT DELIMITED，將會使用自帶的 SerDe鹊漠。

STORED AS TEXTFILE | SEQUENCEFILE | RCFILE : 如果文件數(shù)據(jù)是純文本主到，可以使用 STORED AS TEXTFILE，默認也是 textFile 格式躯概，可以通過執(zhí)行命令 set hive.default.?leformat登钥，進行查看，如果數(shù)據(jù)需要壓縮娶靡，使用 STORED AS SEQUENCEFILE牧牢。 RCFILE 是一種行列存儲相結(jié)合的存儲方式。

CLUSTERED BY:對于每一個表（table）或者分區(qū)姿锭，Hive 可以進一步組織成桶塔鳍，也就是說桶是更為細粒度的數(shù)據(jù)范圍劃分。Hive 也是針對某一列進行桶的組織呻此。Hive 采用對列值哈希轮纫，然后除以桶的個數(shù)求余的方式?jīng)Q定該條記錄存放在哪個桶當中。

LOCATION：指定數(shù)據(jù)文件存放的 HDFS 目錄趾诗，不管內(nèi)部表還是外表，都可以指定。不指定就在默認的倉庫路徑恃泪。

    3郑兴、創(chuàng)建表
        
        分桶表
        示例：CLUSTERED BY department SORTED BY age ASC,id DESC INTO 3 BUCKETS
        
        Create Table ... As Select  簡稱CTAS
        
        create table studentss like student; //復制表結(jié)構(gòu)，不是復制數(shù)據(jù)

        創(chuàng)建表的類型有6種：
        （1）創(chuàng)建內(nèi)部表
        create table student(id int, name string, sex string, age int, department string) row format delimited fields terminated by ",";
            
            內(nèi)部表的類型：MANAGED_TABLE
            
        （2）創(chuàng)建外部表
        create external table student_ext_1(id int, name string, sex string, age int, department string) row format delimited fields terminated by ",";

        內(nèi)部表和外部表的對比：
            1贝乎、在創(chuàng)建表的時候指定關(guān)鍵字： external
            2情连、一般來說，創(chuàng)建外部表览效，都需要指定一個外部路徑  
        內(nèi)部表和外部表的區(qū)別：
            刪除表的時候却舀，內(nèi)部表會都刪除，外部表只刪除元數(shù)據(jù)
        到底選擇內(nèi)部表還是外部表锤灿？
            1挽拔、如果數(shù)據(jù)已經(jīng)存儲在HDFS上面了，需要使用hive去進行分析但校，并且這份數(shù)據(jù)還有可能使用其他的計算引擎來執(zhí)行分析螃诅，使用外部表
            2、如果這個一份數(shù)據(jù)只是hive做數(shù)據(jù)分析使用状囱，就可以使用內(nèi)部表
        
        // 指定一個不存在的外部路徑: 創(chuàng)建表的時候术裸，會自動給你創(chuàng)建表目錄
        create external table student_ext_2(id int, name string, sex string, age int, department string) row format delimited fields terminated by "," location "/student_ext_2";

        // 指定一個已經(jīng)存在的目錄: 并且有數(shù)據(jù)
        //在linux中執(zhí)行
        //hadoop fs -mkdir -p /student_ext_3
        //hadoop fs -put /home/bigdata/data/student.txt /student_ext_3
        //在hive命令行中執(zhí)行
        create external table student_ext_3(id int, name string, sex string, age int, department string) row format delimited fields terminated by "," location "/student_ext_3";
        
        
        （3）創(chuàng)建分區(qū)表
        
        // 創(chuàng)建只有一個分區(qū)字段的分區(qū)表：
        create table student_ptn(id int, name string, sex string, age int, department string) partitioned by (city string comment "partitioned field") row format delimited fields terminated by ",";

        load data local inpath "/home/bigdata/data/student.txt" into table student_ptn;  //錯誤XXXXXX

        // 把數(shù)據(jù)導入到一個不存在的分區(qū)，它會自動創(chuàng)建該分區(qū)
        load data local inpath "/home/bigdata/data/student.txt" into table student_ptn partition(city="beijing");  //正確√√√√√√

        注意：partitioned里的字段不能是表中聲明的字段
        分區(qū)字段是虛擬列亭枷，它的值是存儲在元數(shù)據(jù)庫中袭艺，不是存儲在數(shù)據(jù)文件中。
        分區(qū)字段的使用和普通字段沒有區(qū)別

        // 把數(shù)據(jù)導入到一個已經(jīng)存在的分區(qū)
        alter table student_ptn add partition (city="chongqing"); //沒有變化
        load data local inpath "/home/bigdata/data/student.txt" into table student_ptn partition(city="chongqing"); //數(shù)據(jù)翻倍
                

        // 創(chuàng)建有多個分區(qū)字段的分區(qū)表：
        create table student_ptn_date(id int, name string, sex string, age int, department string) partitioned by (city string comment "partitioned field", dt string) row format delimited fields terminated by ",";

        // 往分區(qū)中導入數(shù)據(jù):
        load data local inpath "/home/bigdata/data/student.txt" into table student_ptn_date partition(city="beijing");  //錯誤XXXXXX

        load data local inpath "/home/bigdata/data/student.txt" into table student_ptn_date partition(city="beijing", dt='2012-12-12');     //正確√√√√√√

        // 不能在導入數(shù)據(jù)的時候指定多個分區(qū)定義
        load data local inpath "/home/bigdata/data/student.txt" into table student_ptn_date partition(city="beijing", dt='2012-12-14') partition(city="beijing" , dt='2012-12-13');   //錯誤XXXXXX

        // 添加分區(qū)
        alter table student_ptn_date add partition(city="beijing", dt='2012-12-14') partition (city="beijing" , dt='2012-12-13');     //正確√√√√√√
        alter table student_ptn_date add partition(city="chongqing", dt='2012-12-14') partition (city="chongqing" , dt='2012-12-13'); 

        // 查詢一個分區(qū)表有那些分區(qū)
        show partitions student_ptn;
        show partitions student_ptn_date;
        show partitions student; //報錯
        
        （4）創(chuàng)建分桶表
        
        // 創(chuàng)建一個分桶表
        create table student_bucket (id int, name string, sex string, age int, department string) clustered by (department) sorted by (age desc, id asc) into 3 buckets row format delimited fields terminated by ",";

        //desc formatted student_bucket;
        //Num Buckets:          3                        
        //Bucket Columns:       [department]             
        //Sort Columns:         [Order(col:age, order:0), Order(col:id, order:1)]   

        注意：clustered里的字段必須要是表字段中出現(xiàn)的字段
        分桶字段和排序字段可以不一樣叨粘，分桶字段和排序字段都必須是表字段中的一部分

        你往分通表里面導入數(shù)據(jù)要通過分桶查詢方式進行導入數(shù)據(jù)猾编。
        
        
        （5）從查詢語句的結(jié)果創(chuàng)建新表
        //通過下面的命令：
        create table ... as  select ....
        //查詢例子：
        select department, count(*) as total from student group by department;
        //完整的CTAS語句：
        create table dpt_count as select department, count(*) as total from student group by department;
                        
        （6）通過like復制已有表的結(jié)構(gòu)創(chuàng)建新表
        create table student_like like student;

    4、刪除表
    drop table student;
    drop table if exists student;

    5宣鄙、修改表
        
        1)修改表名
        alter table student_like rename to studentss;

        2)修改字段

            添加字段： 
            alter table student2 add columns (city string, dt string);
            刪除字段： 
            alter table student2 drop columns (city);   //報錯XXXXXX
            替換字段：
            alter table student2 replace columns (id int, name string, sex string, age int);
            改變列的定義：
            alter table student2 change id newid string comment "new id";
            改變列的順序：
            alter table student2 change sex sex string first;
            alter table student2 change name name string after sex;

        3)修改分區(qū)

            添加分區(qū)：
            alter table student_ptn add partition(city='tiajin') partition(city='shanghai');

            刪除分區(qū)：
            alter table student_ptn drop partition(city='tiajin');  
            alter table student_ptn drop partition(city='tiajin'),partition(city='shanghai'); 

            修改分區(qū)的數(shù)據(jù)目錄：
            alter table student_ptn partition(city="beijing") set location "/stu_beijing";   //報錯XXXXXX
            alter table student_ptn partition(city="beijing") set location "hdfs://bigdata02:8020/stu_beijing";  //正確√√√√√√

    6袍镀、清空表
    truncate table student;//清空表只是清空該表的所有數(shù)據(jù)文件
    hadoop fs -rm -r /user/hive/warehouse/myhive.db/student/*

三、DML操作

數(shù)據(jù)操縱語言 Data Manipulation Language

    1冻晤、導入數(shù)據(jù)
    LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
    （1）LOAD操作是復制或者移動操作苇羡，將數(shù)據(jù)文件復制或移動到Hive設置的路徑上。
    寫LOCAL的是復制鼻弧，不寫LOCAL是移動
    //導入本地絕對路徑數(shù)據(jù)：
    load data local inpath "/home/bigdata/data/student.txt" into table student;

    //導入本地相對路徑的數(shù)據(jù):
    load data local inpath "./student.txt" into table student;
    load data local inpath './student.txt' overwrite into table student;
    (覆蓋導入)

     導入本地數(shù)據(jù)设江，相當于復制或者上傳

    //導入HDFS上的簡便路徑數(shù)據(jù)：
    // hadoop fs -put /home/bigdata/data/student.txt /
    load data inpath '/student.txt' into table student;

    //導入HDFS上的全路徑模式下的數(shù)據(jù)：
    load data inpath 'hdfs://bigdata02:9000/student.txt' into table student;

    導入HDFS上的數(shù)據(jù)到hive表，表示截切攘轩，移動

    insert
    insert into table student (id, name, sex, age, department) values (101,"huangbo","M",222,"IT");
    
    創(chuàng)建分區(qū)表：

    create table student_ptn (id int, name string, sex string, age int) partitioned by (department string) row format delimited fields terminated by ",";


    單重插入：
    insert into table student_ptn partition (department = 'IS') select id,sex,name,age from student where department  = 'IS';
    insert into table student_ptn partition (department = 'CS') select id,sex,name,age from student where department  = 'CS';
    insert into table student_ptn partition (department = 'MA') select id,sex,name,age from student where department  = 'MA';


    多重插入：
    from student 
    insert into table student_ptn partition (department = 'IS') select id,sex,name,age where department = 'IS' 
    insert into table student_ptn partition (department = 'CS') select id,sex,name,age where department = 'CS' 
    insert into table student_ptn partition (department = 'MA') select id,sex,name,age where department = 'MA' 
    
    多重插入最大的好處就是給很多結(jié)構(gòu)相同的SQL語句組合在一起提高所有的HQL的執(zhí)行效率叉存，翻譯成的MapReduce只需要讀取一次數(shù)據(jù)就搞定了。

        分區(qū)插入：
    需要手動的創(chuàng)建分區(qū)
    alter table student add partition (city="zhengzhou")
    load data local inpath '/student.txt' into table student partition(city='zhengzhou');


    CTAS(create table ... as select ...)(直接把查詢出來的結(jié)果存儲到新建的一張表里)
    內(nèi)部表/內(nèi)建表
    create table student as select id,name,age,department from mingxing;
    注意：自動新建的表中的字段和查詢語句出現(xiàn)的字段的名稱度帮，類型歼捏，注釋一模一樣

    限制：
    1稿存、不能創(chuàng)建外部表
    2、不能創(chuàng)建分區(qū)表
    3瞳秽、不能創(chuàng)建分桶表

    分桶插入：
    創(chuàng)建分桶表：
    create table mingxing(id int, name string, sex string, age int, department string)
    clustered by(id) sorted by(age desc) into 4 buckets
    row format delimited fields terminated by ',';

    //不能使用load方式直接往分桶表中導入數(shù)據(jù)
    插入數(shù)據(jù)：
    insert into table mingxing select id,name,sex,age,department from mingxing2
    distribute by id sort by age desc;
    注意：查詢語句中的分桶信息必須和分桶表中的信息一致

    2瓣履、導出數(shù)據(jù)
    單模式導出數(shù)據(jù)到本地：
    insert overwrite local directory '/root/outputdata' select id,name,sex,age,department from mingxing;

    多模式導出數(shù)據(jù)到本地：
    from mingxing
    insert overwrite local directory '/root/outputdata1' select id, name
    insert overwrite local directory '/root/outputdata2' select id, name,age

    簡便路徑模式導出到hdfs：
    insert overwrite directory '/root/outputdata' select id,name,sex,age,department from mingxing;

    全路徑模式查詢數(shù)據(jù)到hdfs：
    insert overwrite directory 'hdfs://bigdata02:9000/root/outputdata1' select id,name,sex,age,department from mingxing;

    local ：導出到本地目錄
    overwrite ：表示覆蓋

    3、查詢數(shù)據(jù)
     Hive 中的 SELECT 基礎(chǔ)語法和標準 SQL 語法基本一致练俐，支持 WHERE袖迎、DISTINCT、GROUP BY腺晾、 ORDER BY燕锥、HAVING、LIMIT悯蝉、子查詢等

    order by : 全局排序
    sort by ：局部排序
    一般來說归形，要搭配 分桶操作使用
    distribute by id sort by age desc;
    
    distribute by : 純粹就是分桶
    在使用distribute by的時候：要設置reduceTask的個數(shù)

    cluster by ： 既分桶，也排序
    cluster by age = distribute by age sort by age;
    
    cluster by 和 sort by 不能同時使用

    where , group by, distinct ,having ,  case...when, ....

四泉粉、Hive視圖

和關(guān)系型數(shù)據(jù)庫一樣连霉，Hive 也提供了視圖的功能，不過請注意嗡靡，Hive 的視圖和關(guān)系型數(shù)據(jù)庫的數(shù)據(jù)還是有很大的區(qū)別：

1跺撼、只有邏輯視圖，沒有物理視圖讨彼；
2歉井、視圖只能查詢，不能 Load/Insert/Update/Delete 數(shù)據(jù)哈误；
3哩至、視圖在創(chuàng)建時候，只是保存了一份元數(shù)據(jù)蜜自，當查詢視圖的時候菩貌，才開始執(zhí)行視圖對應的那些子查詢。

1重荠、創(chuàng)建視圖 
    create view view_name as select * from carss; create view carss_view as select * from carss limit 500; 

2箭阶、查看視圖
    show tables;   // 可以查看表，也可以查看視圖 
    desc view_name  // 查看某個具體視圖的信息 
    desc carss_view 

3戈鲁、刪除視圖 
    drop view view_name drop view if exists carss_view 

4仇参、使用視圖 
    create view sogou_view as select * from sogou_table where rank > 3 ; 
    select count(distinct uid) from sogou_vi