1从橘、Hive操作演示
1.1.1?創(chuàng)建表并加載數(shù)據(jù)
第一步啟動(dòng)HDFS锁摔、YARN和Hive绸贡,啟動(dòng)完畢后創(chuàng)建Hive數(shù)據(jù)庫(kù)
hive>create database hive;
hive>show databases;
hive>use hive;
第二步創(chuàng)建內(nèi)部表
由于Hive使用了類似SQL的語(yǔ)法寨腔,所以創(chuàng)建內(nèi)部表的語(yǔ)句相對(duì)SQL只增加了行和字段分隔符辛友。
hive>CREATE
TABLE tab2(DT STRING,WEBSESSION STRING,WORD STRING,S_SEQ INT,C_SEQ
INT,WEBSITE STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES
TERMINATED BY '\n' ;
第三步加載數(shù)據(jù)
數(shù)據(jù)文件可以從HDFS或者本地操作系統(tǒng)加載到表中斩例,如果加載HDFS文件使用LOAD DATA INPATH宠能,而加載本地操作系統(tǒng)文件使用LOAD DATA LOCAL INPATH命令。HIVE表保存的默認(rèn)路徑在${HIVE_HOME}/conf/hive-site.xml配置文件的hive.metastore.warehouse.dir屬性指定,當(dāng)創(chuàng)建表時(shí)會(huì)在hive.metastore.warehouse.dir指向的目錄下以表名創(chuàng)建一個(gè)文件夾很钓,在本演示中表默認(rèn)指向的是/user/hive/warehouse。
數(shù)據(jù)文件在本地操作系統(tǒng)將復(fù)制到表對(duì)應(yīng)的目錄中翅睛,而數(shù)據(jù)文件在HDFS中捅僵,數(shù)據(jù)文件將移動(dòng)到表對(duì)應(yīng)的目錄中,原來(lái)的路徑將不存在該文件掖看。
hive>LOAD DATA LOCAL INPATH '/home/hadoop/upload/sogou/tab2.txt' INTO TABLE tab2;
在/user/hive/warehouse/hive.db/tab2目錄下匣距,可以看到tab2.txt數(shù)據(jù)文件:
可以用count關(guān)鍵字查詢tab2.txt數(shù)據(jù)行數(shù)诈铛,查詢時(shí)會(huì)啟動(dòng)MapReduce進(jìn)行計(jì)算,Map的個(gè)數(shù)一般和數(shù)據(jù)分片個(gè)數(shù)對(duì)應(yīng)墨礁,在本查詢中有2個(gè)Map任務(wù)(數(shù)據(jù)文件有2個(gè)Block)幢竹,1個(gè)Reduce任務(wù)。
hive>select count(*) from tab2;
可以用like關(guān)鍵字進(jìn)行模糊查詢恩静,Map的個(gè)數(shù)一般和數(shù)據(jù)分片個(gè)數(shù)對(duì)應(yīng)焕毫。
hive>select count(*) from tab2 where WEBSITE like '%laopo%';
1.1.4?查詢結(jié)果排名第1,點(diǎn)擊次序排第2驶乾,其中URL包含google的數(shù)據(jù)
hive>select count(*) from tab2 where S_SEQ=1 and C_SEQ=2 and WEBSITE like '%google%';
1.2.1?創(chuàng)建表關(guān)聯(lián)數(shù)據(jù)
第一步在HDFS創(chuàng)建外部表存放數(shù)據(jù)目錄
$hadoop fs -mkdir -p /class5/tab1
$hadoop fs -ls /class5
第二步在Hive創(chuàng)建外部表邑飒,指定表存放目錄
hive>CREATE
EXTERNAL TABLE tab1(DT STRING,WEBSESSION STRING,WORD STRING,S_SEQ
INT,C_SEQ INT,WEBSITE STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY
'\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION
'/class5/tab1';
hive>show tables;
觀察一下創(chuàng)建表和外部表的區(qū)別,會(huì)發(fā)現(xiàn)創(chuàng)建外部表多了EXTERNAL關(guān)鍵字以及指定了表對(duì)應(yīng)存放文件夾LOCATION?‘/class5/tab1’
【注】在刪除表的時(shí)候级乐,內(nèi)部表將刪除表的元數(shù)據(jù)和數(shù)據(jù)文件疙咸;而刪除外部表的時(shí)候,僅僅刪除外部表的元數(shù)據(jù)风科,不刪除數(shù)據(jù)文件
第三步加載數(shù)據(jù)文件到外部表對(duì)應(yīng)的目錄中
創(chuàng)建Hive外部表關(guān)聯(lián)數(shù)據(jù)文件有兩種方式撒轮,一種是把外部表數(shù)據(jù)位置直接關(guān)聯(lián)到數(shù)據(jù)文件所在目錄上,這種方式適合數(shù)據(jù)文件已經(jīng)在HDFS存在贼穆,另外一種方式是創(chuàng)建表時(shí)指定外部表數(shù)據(jù)目錄题山,隨后把數(shù)據(jù)加載到該目錄下。以下將以第二種方式進(jìn)行演示:
$hadoop fs -copyFromLocal /home/hadoop/upload/sogou/tab1.txt? /class5/tab1/
$hadoop fs -ls /class5/tab1
$hadoop fs -tail /class5/tab1/tab1.txt
hive>select count(*) from tab1;
hive>select * from tab1?limit 10;
可以看出Hive會(huì)根據(jù)查詢不同任務(wù)決定是否生成Job故痊,獲取前10條并沒(méi)有生成Job顶瞳,而是得到數(shù)據(jù)后直接進(jìn)行顯示。
1.2.4?查詢結(jié)果排名第1愕秫,點(diǎn)擊次序排第2的數(shù)據(jù)
hive>select count(*) from tab1?where S_SEQ=1 and C_SEQ=2;
按照session號(hào)進(jìn)行歸組慨菱,并按照查詢次數(shù)進(jìn)行排序,最終顯示查詢次數(shù)最多的前10條戴甩。
hive>select WEBSESSION,count(WEBSESSION) as cw from tab1?group by WEBSESSION order by cw desc limit 10;
交易數(shù)據(jù)存放在該系列配套資源的/class5/saledata目錄下,在/home/hadoop/upload創(chuàng)建class5目錄用于存放本周測(cè)試數(shù)據(jù)
$cd /home/hadoop/upload
$mkdir class5
創(chuàng)建新文件夾后使用等恐,使用SSH Secure File Transfer工具上傳到/home/hadoop/upload/class5目錄下洲劣,如下圖所示:
2.1.2?在Hive創(chuàng)建數(shù)據(jù)庫(kù)和表
啟動(dòng)Hadoop集群,進(jìn)入Hive命令行操作界面课蔬,使用如下命令創(chuàng)建三張數(shù)據(jù)表:
ltbDate定義了日期的分類囱稽,將每天分別賦予所屬的月份、星期二跋、季度等屬性战惊,字段分別為日期、年月扎即、年吞获、月况凉、日、周幾各拷、第幾周刁绒、季度、旬烤黍、半月知市;
ltbStock定義了訂單表頭,字段分別為訂單號(hào)速蕊、交易位置嫂丙、交易日期;
ltbStockDetail文件定義了訂單明細(xì)规哲,該表和tbStock以交易號(hào)進(jìn)行關(guān)聯(lián)跟啤,字段分別為訂單號(hào)、行號(hào)唉锌、貨品隅肥、數(shù)量、金額:
hive>use hive;
hive>CREATE TABLE?tbDate(dateID
string,theyearmonth string,theyear string,themonth string,thedate
string,theweek string,theweeks
string,thequot string,thetenday string,thehalfmonth string) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' ;
hive>CREATE TABLE?tbStock(ordernumber STRING,locationid string,dateID string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' ;
hive>CREATE TABLE?tbStockDetail(ordernumber STRING,rownum int,itemid string,qty int,price int ,amount int) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' ;
從本地操作系統(tǒng)分別加載日期糊秆、交易信息和交易詳細(xì)信息表數(shù)據(jù)
hive>use hive;
hive>LOAD DATA LOCAL INPATH '/home/hadoop/upload/class5/saledata/tbDate.txt' INTO TABLE tbDate;
hive>LOAD DATA LOCAL INPATH '/home/hadoop/upload/class5/saledata/tbStock.txt' INTO TABLE tbStock;
hive>LOAD DATA LOCAL INPATH '/home/hadoop/upload/class5/saledata/tbStockDetail.txt' INTO TABLE tbStockDetail;
查看HDFS中相關(guān)SALEDATA數(shù)據(jù)庫(kù)中增加了三個(gè)文件夾武福,分別對(duì)應(yīng)三個(gè)表:
要計(jì)算所有訂單每年的總金額,首先需要獲取所有訂單的訂單號(hào)痘番、訂單日期和訂單金信息,然后把這些信息和日期表進(jìn)行關(guān)聯(lián)平痰,獲取年份信息汞舱,最后根據(jù)這四個(gè)列按年份歸組統(tǒng)計(jì)獲取所有訂單每年的總金額。
hive>use hive;
hive>select
c.theyear, sum(b.amount) from tbStock a,tbStockDetail b,tbDate c where
a.ordernumber=b.ordernumber and a.dateid=c.dateid group by c.theyear
order by c.theyear;
該算法分為兩步:
1.按照日期和訂單號(hào)進(jìn)行歸組計(jì)算宗雇,獲取所有訂單每天的銷售數(shù)據(jù)昂芜;
2.把第一步獲取的數(shù)據(jù)和日期表進(jìn)行關(guān)聯(lián)獲取的年份信息,然后按照年份進(jìn)行歸組赔蒲,使用Max函數(shù)泌神,獲取所有訂單每年最大金額訂單的銷售額。
//所有訂單每年最大金額訂單的銷售額
//第一步:
hive>use hive;
hive>select
a.dateid,a.ordernumber,sum(b.amount) as sumofamount from tbStock
a,tbStockDetail b where a.ordernumber=b.ordernumber group by
a.dateid,a.ordernumber;
//第二步:
hive>select
c.theyear,max(d.sumofamount) from tbDate c,(select
a.dateid,a.ordernumber,sum(b.amount) as sumofamount from tbStock
a,tbStockDetail b where a.ordernumber=b.ordernumber group by
a.dateid,a.ordernumber)
d??where c.dateid=d.dateid group by c.theyear sort by c.theyear;
運(yùn)行過(guò)程中創(chuàng)建兩個(gè)Job舞虱,分別為job_1437659442092_0004和job_1437659442092_0005欢际,運(yùn)行過(guò)程如下:
在YARN的資源管理器界面中可以看到如下界面:
其中job_1437659442092_0005運(yùn)行的具體情況如下:
整個(gè)計(jì)算過(guò)程使用了285秒,結(jié)果如下:
//所有訂單中季度銷售額前10位
hive>use hive;
hive>select
c.theyear,c.thequot,sum(b.amount) as sumofamount from tbStock
a,tbStockDetail b,tbDate c where a.ordernumber=b.ordernumber and
a.dateid=c.dateid group by c.theyear,c.thequot order by sumofamount
desc limit 10;
2008????1????5252819
2007????4????4613093
2007????1????4446088
2006????1????3916638
2008????2????3886470
2007????3????3870558
2007????2????3782235
2006????4????3691314
2005????1????3592007
2005????3????3304243
//列出銷售金額在100000以上的單據(jù)
hive>use hive;
hive>select
a.ordernumber,sum(b.amount) as sumofamount from tbStock a,tbStockDetail b
where a.ordernumber=b.ordernumber group by a.ordernumber having
sumofamount>100000;
//所有訂單中每年最暢銷貨品
第一步:
hive>use hive;
hive>select
c.theyear,b.itemid,sum(b.amount) as sumofamount from tbStock
a,tbStockDetail b,tbDate c where a.ordernumber=b.ordernumber and
a.dateid=c.dateid group by c.theyear,b.itemid;
第二步:
hive>select
d.theyear,max(d.sumofamount) as maxofamount from (select
c.theyear,b.itemid,sum(b.amount) as sumofamount from tbStock
a,tbStockDetail b,tbDate c where a.ordernumber=b.ordernumber and
a.dateid=c.dateid
group by c.theyear,b.itemid) d group by d.theyear ;
第三步:
hive>select distinct??e.theyear,e.itemid,f.maxofamount from (select c.theyear,b.itemid,sum(b.amount) as sumofamount from tbStock a,tbStockDetail b,tbDate c where a.ordernumber=b.ordernumber and a.dateid=c.dateid group by c.theyear,b.itemid) e , (select d.theyear,max(d.sumofamount) as maxofamount from (select c.theyear,b.itemid,sum(b.amount) as sumofamount from tbStock a,tbStockDetail b,tbDate c where a.ordernumber=b.ordernumber and a.dateid=c.dateid group by c.theyear,b.itemid) d group by d.theyear) f where e.theyear=f.theyear and e.sumofamount=f.maxofamount order by e.theyear;
2004????JY424420810101???53374
2005????24124118880102???56569
2006????JY425468460101???113684
2007????JY425468460101???70226
2008????E2628204040101???97981
2009????YL327439080102???30029
2010????SQ429425090101??4494