hive

Hive 是一個(gè)SQL 解析引擎，將SQL語(yǔ)句轉(zhuǎn)譯成MR Job，然后再hadoop上運(yùn)行翁狐，達(dá)到快速

mysql是存放數(shù)據(jù)的准夷，而hive是不存放數(shù)據(jù)的钥飞，hive的表是純邏輯表，只是表的定義衫嵌，即表的元數(shù)據(jù)读宙，實(shí)際數(shù)據(jù)在hadoop的磁盤上

Hive的內(nèi)容是讀多寫少，不支持對(duì)數(shù)據(jù)的改寫和刪除楔绞，要?jiǎng)h除只能把整個(gè)表drop掉

當(dāng)需要導(dǎo)入到hive中的數(shù)據(jù)结闸，文本中包含'\n'，就會(huì)以'\n'換行酒朵，導(dǎo)致數(shù)據(jù)串行桦锄。
怎么辦？

hive的mapreduce


select word, count(*)
from (
select explode(split(sentence,' ')) as word from article_1
) t
group by word
 
解釋：
select explode(split(sentence,' ')) as word from article： 做map操作
explode()：這個(gè)函數(shù)的功能就是行轉(zhuǎn)列
split(sentence,' ')：將sentence這個(gè)字段里面的內(nèi)容以空格分割開(kāi)蔫耽，返回的是單詞的數(shù)組
as word 表示新生成的列名字叫做word
t： 新生成的表的別名结耀，新生成的表是臨時(shí)表【語(yǔ)法是from后面要接一個(gè)表】
select word, count(*)
from () t 
group by  word
--
group by word: 對(duì)word做聚合，reduce 的過(guò)程
count(*): 求和

測(cè)試：
select explode(split(sentence,' ')) as word from article limit 30

select word, count(1) as cnt
from (
select explode(split(sentence,' ')) as word from article
) t
group by word

Hive體系架構(gòu)
數(shù)據(jù)存儲(chǔ)：
hive數(shù)據(jù)以文件形式存儲(chǔ)在HDFS的指定目錄下
hive語(yǔ)句生成查詢計(jì)劃，由mapreduce調(diào)用執(zhí)行
語(yǔ)句轉(zhuǎn)換
解析器：生成抽象語(yǔ)法樹
語(yǔ)法分析器：驗(yàn)證查詢語(yǔ)句
邏輯計(jì)劃生成器（包括優(yōu)化器）：生成操作符樹
查詢計(jì)劃生成器：轉(zhuǎn)換為map-reduce任務(wù)
用戶接口
CLI：?jiǎn)?dòng)的時(shí)候饼记，會(huì)同時(shí)啟動(dòng)一個(gè)Hive的副本
JDBC：Hive的客戶端香伴，用戶連接至Hive Server
WUI：通過(guò)瀏覽器訪問(wèn)Hive

hive的表的本質(zhì)就是hadoop的目錄

hive創(chuàng)建表的方式：
創(chuàng)建內(nèi)部表：create table 內(nèi)部表
創(chuàng)建外部表：create external table location 'hdfs_path' 必須是文件夾路徑

在導(dǎo)入數(shù)據(jù)到外部表，數(shù)據(jù)并沒(méi)有移動(dòng)到自己的數(shù)據(jù)倉(cāng)庫(kù)目錄下具则，也就是說(shuō)外部表的數(shù)據(jù)并不是由它自己來(lái)管理的即纲，而內(nèi)部表不一樣
在刪除表的時(shí)候，hive將會(huì)把屬于表的元數(shù)據(jù)和數(shù)據(jù)全部刪除博肋；而刪除外部表的時(shí)候低斋，hive僅僅刪除外部表的元數(shù)據(jù)，數(shù)據(jù)是不會(huì)刪除的

============================
實(shí)戰(zhàn)部分

查看數(shù)據(jù)庫(kù)
show databases;

查看表
show tables匪凡；

創(chuàng)建數(shù)據(jù)庫(kù) user_base_1：
CREATE DATABASE IF NOT EXISTS user_base_1;

hive的mapreduce:
代碼：
select word, count(1) as cnt
from (
select explode(split(sentence,' ')) as word from article
) t
group by word
order by cnt desc
limit 100

說(shuō)明：
1. order by 排序膊畴，因?yàn)槭侨峙判颍灾荒茉谝粋€(gè)reduce里面跑
2. order by 是一個(gè)任務(wù)病游，所以上面的代碼會(huì)啟動(dòng)兩個(gè)Job唇跨，第一個(gè)Job有一個(gè)map一個(gè)reduce，第二個(gè)Job只有一個(gè)reduce
3. 而且會(huì)有依賴衬衬，必須等第一個(gè)Job結(jié)束之后才有第二個(gè)Job執(zhí)行

SQL的成本很低买猖，而且在大公司一般都有一個(gè)內(nèi)部使用的web界面，直接在上面寫SQL語(yǔ)句就可以了滋尉，而且還帶提示的玉控，特別方便，用習(xí)慣了hive之后狮惜，再寫python的mapreduce表示回不去了高诺。

SQL是鍛煉數(shù)據(jù)思維、數(shù)據(jù)處理的能力碾篡，需要經(jīng)常練習(xí)虱而。

Hive的SQL可擴(kuò)展性高，支持UDF/UDAF/UDTF开泽，支持用戶自定義的函數(shù)方法牡拇。

hive的架構(gòu)：
類比于執(zhí)行一個(gè)C程序
首先編譯檢查語(yǔ)法是否有問(wèn)題，檢查hive需要調(diào)取的那些元數(shù)據(jù)是否有問(wèn)題眼姐，然后將hive的代碼轉(zhuǎn)化為mapreduce的任務(wù)诅迷，然后在hadoop執(zhí)行任務(wù)佩番，最后生成結(jié)果數(shù)據(jù)众旗。

分區(qū) partition
hive表名就是文件夾，好處：根據(jù)時(shí)間趟畏、日期做partition贡歧，每天一個(gè)partition，每天的數(shù)據(jù)會(huì)存放到一個(gè)文件夾里面，相當(dāng)于將數(shù)據(jù)按日期劃分利朵。
如果只想要查詢昨天的數(shù)據(jù)律想，只需用對(duì)應(yīng)查詢昨天日期的文件夾下的數(shù)據(jù)
分桶 bucket
10bucket 把數(shù)據(jù)劃分10份， 1/10 只需要拿一份绍弟，但是因?yàn)橥ㄟ^(guò)shuffle過(guò)程分的技即，所以可能數(shù)量上不是很準(zhǔn)

建表，只是建立元數(shù)據(jù)信息+hdfs目錄下給一個(gè)表名文件夾樟遣，里面是沒(méi)有數(shù)據(jù)的
create table article(sentence string)
row format delimited fields terminated by '\n';

從本地導(dǎo)入數(shù)據(jù)而叼，相當(dāng)于將path數(shù)據(jù) 類似于 hadoop fs -put /hive/warehouse/badou.db
load data local inpath 'localpath' into table article;

查看數(shù)據(jù)：
select * from article limit 3;

查看hadoop中的數(shù)據(jù)：
 hadoop fs -ls /usr/local/src/apache-hive-1.2.2-bin/warehouse/badou.db/article_1

-rwxr-xr-x   3 root supergroup     632207 2019-03-15 22:27 /usr/local/src/apache-hive-1.2.2-bin/warehouse/badou.db/article_1/The_Man_of_Property.txt

外部表
create external table article_2(sentence string)
row format delimited fields terminated by '\n'
stored as textfile #存儲(chǔ)成為文本形式
location '/data/ext';
badou.db目錄下沒(méi)有新建的外部表數(shù)據(jù)（因?yàn)槭峭獠勘頂?shù)據(jù)）
外部數(shù)據(jù)源數(shù)據(jù)未發(fā)生變化
drop table article_1；
--發(fā)現(xiàn)數(shù)據(jù)原信息被刪除了豹悬，但是在hdfs路徑下的/data/ext的數(shù)據(jù)還存在葵陵，類似于軟鏈接

partition 建表
create table art_dt(sentence string)
partitioned by (dt string)
row format delimited fields terminated by '\n';

從hive表中的數(shù)據(jù)插入到新表(分區(qū)表)中：從article表中取100條數(shù)據(jù)插入到art_dt表中
insert overwrite table art_dt partition(dt='20190329')
select * from article limit 100;

hdfs的hive目錄下對(duì)應(yīng)數(shù)據(jù)庫(kù)中：badou.db/art_dt/dt_20190329

select * from art_dt  limit 10;
分析：這個(gè)查找是一個(gè)全量的查找，相當(dāng)于查找這個(gè)表下面的全量的分區(qū)瞻佛，舉個(gè)例子：如果只有兩個(gè)分區(qū)的話脱篙，等價(jià)于:
select * from art_dt  where dt between  '20190328' and '20190329' limit 10;
如果表的分區(qū)數(shù)特別多的話，查找就會(huì)很慢很慢伤柄。
如果知道在哪個(gè)分區(qū)绊困，直接去那個(gè)分區(qū)找，查詢的效率就會(huì)特別高响迂。
select * from art_dt  where dt between  '20190328' and '20190329' limit 10;

partition實(shí)際是怎么產(chǎn)生的考抄？用在什么數(shù)據(jù)上？
每天都會(huì)產(chǎn)生用戶瀏覽蔗彤、點(diǎn)擊川梅、收藏、購(gòu)買的記錄然遏。
按照每天的方式去存儲(chǔ)數(shù)據(jù)贫途，按天做partition
--
根據(jù)數(shù)據(jù)來(lái)源區(qū)分，app/m/pc
例如：logs/dt=20190329/type=app
logs這張表待侵，在20190329這個(gè)日期丢早，app端的log數(shù)據(jù)存放路徑
logs/dt=20190329/type=app
logs/dt=20190329/type=m
logs/dt=20190329/type=pc
--
數(shù)據(jù)量太大的情況下，除了按照天劃分?jǐn)?shù)據(jù)秧倾，還可以按照三端的方式劃分?jǐn)?shù)據(jù)
數(shù)據(jù)庫(kù) 存放數(shù)據(jù)：用戶的屬性怨酝，年齡，性別那先，blog等等
每天都會(huì)有新增用戶农猬，修改信息 dt=20190328 dt=20190329 大量信息太冗余了
解決方法：
overwrite 7 每天做overwrite dt=20190328 這天中的信息包含這天之前的所有用戶信息(當(dāng)天之前所有的全量數(shù)據(jù))
存7個(gè)分區(qū)，冗余7份售淡，防止丟失(不是防止機(jī)器掛掉了丟失數(shù)據(jù)斤葱，而是防止誤操作導(dǎo)致的數(shù)據(jù)丟失慷垮，這個(gè)鍋很大，背不起)揍堕，也會(huì)有冗余料身，但是只冗余7份，每天刪除7天前的數(shù)據(jù)衩茸。

分桶 bucket

create table udata(
user_id string,
item_id string,
rating string,
`timestamp` string
) row format delimited fields terminated by '\t';
load data local inpath '/home/badou/Documents/u.data' into table udata;

# 設(shè)置打印列名
set hive.cli.print.header=true;

bucket
hive中的table可以拆分成partition芹血，table和partition可以通過(guò)‘CLUSTERED BY’ 進(jìn)一步分bucket， bucket中的數(shù)據(jù)可以通過(guò)‘sort by’排序楞慈。
sort by 是分桶內(nèi)的排序祟牲，order by 是全局排序。
作用：數(shù)據(jù)sampling 數(shù)據(jù)采樣

#建表
create table bucket_users (
user_id int,
item_id string,
rating string,
`timestamp` string
) clustered by(user_id) into 4 buckets;

#插入數(shù)據(jù)
#因?yàn)樾枰殖?個(gè)桶抖部，需要設(shè)置強(qiáng)制分桶说贝，否則會(huì)根據(jù)處理的數(shù)據(jù)量，只會(huì)啟用一個(gè)reduce
set hive.enforce.bucketing = true;

insert overwrite table bucket_users
select cast(user_id as int ) as user_id, item_id, rating, `timestamp` from udata;

#查看結(jié)果：可以看到4個(gè)分桶的表
$ hadoop fs -ls /usr/local/src/apache-hive-1.2.2-bin/warehouse/badou.db/bucket_users

-rwxr-xr-x   3 root supergroup     466998 2019-03-29 09:06 /usr/local/src/apache-hive-1.2.2-bin/warehouse/badou.db/bucket_users/000000_0
-rwxr-xr-x   3 root supergroup     497952 2019-03-29 09:06 /usr/local/src/apache-hive-1.2.2-bin/warehouse/badou.db/bucket_users/000001_0
-rwxr-xr-x   3 root supergroup     522246 2019-03-29 09:06 /usr/local/src/apache-hive-1.2.2-bin/warehouse/badou.db/bucket_users/000002_0
-rwxr-xr-x   3 root supergroup     491977 2019-03-29 09:06 /usr/local/src/apache-hive-1.2.2-bin/warehouse/badou.db/bucket_users/000003_0

#采樣 sampling
tablesample() 函數(shù)
格式：tablesample(bucket x out of y)
比如：有32個(gè)桶慎颗，bucket 3 out of 16乡恕，意思就是32/16=2，取兩個(gè)桶的數(shù)據(jù)俯萎，從第三個(gè)桶開(kāi)始算起傲宜，3%16=3，19%16=3夫啊，最終結(jié)果就是取第3個(gè)桶和第19個(gè)桶的數(shù)據(jù)函卒，這樣就達(dá)到了采樣的目的。
 
#查看任意一個(gè)bucket的數(shù)據(jù)
select * from bucket_users tablesample(bucket 1 out of 4 on user_id);

#計(jì)算任意一個(gè)bucket有多少數(shù)據(jù)
select count(*) from bucket_users tablesample(bucket 1 out of 4 on user_id);
結(jié)果：23572 (總數(shù)是100000條)

select count(*) from bucket_users tablesample(bucket 2 out of 4 on user_id);
結(jié)果：25159 (總數(shù)是100000條)

分桶是進(jìn)行了partition的過(guò)程撇眯，分的不是特別精確报嵌。

#采樣數(shù)據(jù)，插入到新創(chuàng)建表中
$ create table tmp as select * from bucket_users tablesample(bucket 1 out of 4 on user_id);

hive join in MR

image.png

# 訂單商品的歷史行為數(shù)據(jù)
create table order_product_prior(
order_id string, 
product_id string,
add_to_cart string,  #加購(gòu)物車
reordered string  #重復(fù)購(gòu)買
) row format delimited fields terminated by ',';
load data local inpath '/home/badou/Documents/data/order_data/order_products__prior.csv' into table order_product_prior;

#訂單表
# order_number 訂單購(gòu)買順序
# eval_set 標(biāo)志是訓(xùn)練集還是測(cè)試集
# order_dow dow day of week 那天買的
# order_hour_of_day  一天中什么時(shí)候下的訂單
# days_since_prior_order 距離上一個(gè)訂單多久了
create table orders (
order_id string,
user_id string,
eval_set string,
order_number string,
order_dow string,
order_hour_of_day string,
days_since_prior_order string
) row format delimited fields terminated by ',';
load data local inpath '/home/badou/Documents/data/order_data/orders.csv' into table orders;

$ select * from order_product_prior limit 10;
order_id        product_id      add_to_cart_order       reordered
2       33120   1       1
2       28985   2       1
2       9327    3       0
2       45918   4       1
2       30035   5       0
2       17794   6       1
2       40141   7       1
2       1819    8       1
2       43668   9       0

$ select * from orders limit 10;
order_id        user_id eval_set        order_number    order_dow       order_hour_of_day       days_since_prior_order
2539329 1       prior   1       2       08
2398795 1       prior   2       3       07      15.0
473747  1       prior   3       3       12      21.0
2254736 1       prior   4       4       07      29.0
431534  1       prior   5       4       15      28.0
3367565 1       prior   6       2       07      19.0
550135  1       prior   7       1       09      20.0
3108588 1       prior   8       1       14      14.0
2295261 1       prior   9       1       16      0.0

需求：統(tǒng)計(jì)每個(gè)用戶購(gòu)買過(guò)多少商品
1. 每個(gè)訂單的商品數(shù)量【訂單中的商品數(shù)量】
select order_id, count(1) as prod_cnt 
from order_product_prior
group by order_id
order by prod_cnt desc
limit 30;

2. user - 產(chǎn)品數(shù)量的關(guān)系
將每個(gè)訂單的數(shù)量帶給user  join
table1: order_id  prod_cnt
table2: order_id user_id
table1 + table2 => order_id, user_id, prod_cnt

-- 這個(gè)用戶在這個(gè)訂單中購(gòu)買了多少商品prod_cnt
select 
t2.order_id as order_id, 
t2.user_id as user_id,
t1.prod_cnt as prod_cnt 
from orders t2
join
(select order_id, count(1) as prod_cnt
from order_product_prior
group by order_id) t1
on t2.order_id=t1.order_id
limit 30;

3. 這個(gè)用戶所有訂單的商品總和
select 
user_id,
sum(prod_cnt) as sum_prod_cnt
from
(select
t2.order_id as order_id,
t2.user_id as user_id,
t1.prod_cnt as prod _cnt
from orders t2
join
(select order_id, count(1) as prod_cnt
from order_prodct_prior
group by order_id) t1
on t2.order_id=t1.order_id) t12
group by user_id
order by sum_prod_cnt desc
limit 30;

簡(jiǎn)寫：
select x from (select x from t1) join (select x from t2) on x 
group by x
order by x
limit n

寫sql熊榛，上千行的都有??
這才哪到哪??

hive優(yōu)化

合并小文件锚国，減少map數(shù)？

適當(dāng)增加map數(shù)玄坦？
set mapred.map.tasks = 10;

map的優(yōu)化主要是在文件數(shù)量上的優(yōu)化血筑，遇到的比較少，主要還是在reduce上的優(yōu)化煎楣，比如最重要的數(shù)據(jù)傾斜豺总。

設(shè)置reduce任務(wù)處理的數(shù)據(jù)量
hive.exec.reduceers.bytes.per.reducer
調(diào)整reduce的個(gè)數(shù)
設(shè)置reducer處理的數(shù)量
set mapred.reduce.tasks=10
一個(gè)reduce的情況
全局排序的話，在一個(gè)reduce里面進(jìn)行
笛卡爾積：
select
t1.u1 as u1,
t2.u2 as u2
from
(select user_id as u1 from tmp) t1
join
(select user_id as u2 from tmp) t2;

笛卡爾積會(huì)使得數(shù)據(jù)增加得特別快择懂，需要盡量避免喻喳，笛卡爾積是在一個(gè)reduce里面進(jìn)行的。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末休蟹，一起剝皮案震驚了整個(gè)濱河市沸枯，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌赂弓，老刑警劉巖绑榴，帶你破解...
沈念sama閱讀 206,311評(píng)論 6贊 481
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異盈魁，居然都是意外死亡翔怎，警方通過(guò)查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,339評(píng)論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門杨耙，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)赤套，“玉大人，你說(shuō)我怎么就攤上這事珊膜∪菸眨” “怎么了？”我有些...
開(kāi)封第一講書人閱讀 152,671評(píng)論 0贊 342
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵车柠，是天一觀的道長(zhǎng)剔氏。經(jīng)常有香客問(wèn)我，道長(zhǎng)竹祷，這世上最難降的妖魔是什么谈跛？我笑而不...
開(kāi)封第一講書人閱讀 55,252評(píng)論 1贊 279
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮塑陵，結(jié)果婚禮上感憾，老公的妹妹穿的比我還像新娘。我一直安慰自己令花，他們只是感情好阻桅，可當(dāng)我...
茶點(diǎn)故事閱讀 64,253評(píng)論 5贊 371
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開(kāi)白布。她就那樣靜靜地躺著兼都，像睡著了一般鳍刷。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上俯抖，一...
開(kāi)封第一講書人閱讀 49,031評(píng)論 1贊 285
城市分裂傳說(shuō)
那天输瓜，我揣著相機(jī)與錄音，去河邊找鬼芬萍。笑死尤揣，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的柬祠。我是一名探鬼主播北戏，決...
沈念sama閱讀 38,340評(píng)論 3贊 399
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開(kāi)眼，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼漫蛔！你這毒婦竟也來(lái)了嗜愈？” 一聲冷哼從身側(cè)響起旧蛾，我...
開(kāi)封第一講書人閱讀 36,973評(píng)論 0贊 259
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤，失蹤者是張志新（化名）和其女友劉穎蠕嫁，沒(méi)想到半個(gè)月后锨天，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 43,466評(píng)論 1贊 300
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡剃毒，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 35,937評(píng)論 2贊 323
?白月光啟示錄
正文我和宋清朗相戀三年病袄，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片赘阀。...
茶點(diǎn)故事閱讀 38,039評(píng)論 1贊 333
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡益缠，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出基公，到底是詐尸還是另有隱情幅慌，我是刑警寧澤，帶...
沈念sama閱讀 33,701評(píng)論 4贊 323
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布轰豆，位于F島的核電站欠痴，受9級(jí)特大地震影響，放射性物質(zhì)發(fā)生泄漏秒咨。R本人自食惡果不足惜喇辽，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,254評(píng)論 3贊 307
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望雨席。院中可真熱鬧菩咨，春花似錦、人聲如沸陡厘。這莊子的主人今日做“春日...
開(kāi)封第一講書人閱讀 30,259評(píng)論 0贊 19
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)糙置。三九已至云茸，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間谤饭，已是汗流浹背标捺。一陣腳步聲響...
開(kāi)封第一講書人閱讀 31,485評(píng)論 1贊 262
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留揉抵，地道東北人亡容。一個(gè)月前我還...
沈念sama閱讀 45,497評(píng)論 2贊 354
代替公主和親
正文我出身青樓，卻偏偏與公主長(zhǎng)得像冤今，于是被迫代替她去往敵國(guó)和親闺兢。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 42,786評(píng)論 2贊 345

hive

推薦閱讀更多精彩內(nèi)容