之前有寫過pandas和numpy的一些淺顯的使用,沒有系統(tǒng)的學(xué)習(xí)所有的API,基本上還是在夠用的基礎(chǔ)上增炭,以后如果需要更復(fù)雜的數(shù)據(jù)需求再進(jìn)一步研究吧烟阐。最近轉(zhuǎn)做后臺搬俊,之前在寫numpy使用的時候立了個flag說要做一些有產(chǎn)出的機器學(xué)習(xí)項目。現(xiàn)在確實做了文章的推薦模型的訓(xùn)練蜒茄,從Hive取數(shù)到模型輸出唉擂,不過最終沒能部署上線,是有些遺憾(公司部署環(huán)境對python不友好檀葛,可能需要spark模型上線)玩祟,可能之后上線之后再把上線的流程補上吧。項目告一段落最后也想說總結(jié)一下(其實是反過來在看的時候有些東西都忘了~~記錄一下)屿聋,最近跟數(shù)據(jù)打交道的比較多空扎,也結(jié)合之前訓(xùn)練數(shù)據(jù)提取的經(jīng)驗藏鹊,會寫一系列的文章。首先第一篇就先寫Hive吧转锈。知識不夠全面盘寡,不過我平時也夠用
首先第一步就是補上網(wǎng)上大神的參考文章
基本操作
Hive的基本操作與MYSQL類似,除了部分函數(shù)不支持以外黑忱,可以按照寫sql的思路來寫Hive sql宴抚。
新建
新建表:
CREATE TABLE `test`(
`id` bigint COMMENT '這是字段備注',
`value` string COMMENT '這是字段備注')
COMMENT '表名稱備注'
PARTITIONED BY (
`dt` string)
上面是一個創(chuàng)建表的語句,創(chuàng)建的是一個含有id和value字段的hive表甫煞,有dt分區(qū)菇曲。
分區(qū)就是存儲時候的文件夾名稱。因為hive表也是在HDFS之上的應(yīng)用抚吠,所以也是存儲到硬盤的常潮。
添加分區(qū):
ALTER TABLE table_name ADD PARTITION (partCol = 'value1') location 'loc1'; //示例
ALTER TABLE table_name ADD IF NOT EXISTS PARTITION (dt='20130101') LOCATION '/user/hadoop/warehouse/table_name/dt=20130101'; //一次添加一個分區(qū)
ALTER TABLE page_view ADD PARTITION (dt='2008-08-08', country='us') location '/path/to/us/part080808' PARTITION (dt='2008-08-09', country='us') location '/path/to/us/part080809'; //一次添加多個分區(qū)
添加新列:
ALTER TABLE table_name ADD COLUMNS (col_name STRING); //在所有存在的列后面,但是在分區(qū)列之前添加一列
刪除
刪除表:
DROP TABLE IF NOT EXISTS tablename;
刪除分區(qū):
ALTER TABLE login DROP IF EXISTS PARTITION (dt='2008-08-08');
ALTER TABLE page_view DROP IF EXISTS PARTITION (dt='2008-08-08', country='us');
修改
表的重命名:
ALTER TABLE table_name RENAME TO new_table_name
修改列:
CREATE TABLE test_change (a int, b int, c int);
// will change column a's name to a1
ALTER TABLE test_change CHANGE a a1 INT;
// will change column a's name to a1, a's data type to string, and put it after column b. The new table's structure is: b int, a1 string, c int
ALTER TABLE test_change CHANGE a a1 STRING AFTER b;
// will change column b's name to b1, and put it as the first column. The new table's structure is: b1 int, a string, c int
ALTER TABLE test_change CHANGE b b1 INT FIRST;
修改表屬性:
alter table table_name set TBLPROPERTIES ('EXTERNAL'='TRUE'); //內(nèi)部表轉(zhuǎn)外部表
alter table table_name set TBLPROPERTIES ('EXTERNAL'='FALSE'); //外部表轉(zhuǎn)內(nèi)部表
JOIN
select * from A join B on A.id = B.id; //取交集
select * from A left join/left outer join B on A.id = B.id; // A表為基準(zhǔn)表
select * from A right join/right outer join B on A.id = B.id; // B表為基準(zhǔn)表
select * from A full outer join B on A.id = B.id;//取并集
select * from A left semi join B on A.id = B.id; //可以用來取差集
窗口分析函數(shù)
統(tǒng)計函數(shù)
over(partition by...order by)窗口函數(shù)可以統(tǒng)計個數(shù)楷力,統(tǒng)計最大值喊式,最小值等信息。partition by語法可以看成是局部的group by萧朝,order by就是排序岔留。使用方法與order by 一樣。
另外隨機取數(shù)的時候可以使用order by rand()
函數(shù) | 語法 | 備注 |
---|---|---|
sum | sum() over(partition by .. order by) | 和 |
count | count() over(partition by .. order by) | 個數(shù) |
min | min() over(partition by .. order by) | 最小值 |
max | max() over(partition by .. order by) | 最大值 |
avg | avg() over(partition by .. order by) | 平均值 |
以上的函數(shù)也可以結(jié)合group by函數(shù)來計算检柬,但是沒有上述的靈活献联。在使用這些統(tǒng)計函數(shù)的時候有些小的trick,就是靈活的結(jié)合CASE WHEN函數(shù)何址,比如我們要統(tǒng)計前三個月的pv
SUM(CASE WHEN time > 3month TEHN 1 ELSE NULL) AS 3monthPV
SUM示例
準(zhǔn)備數(shù)據(jù):
cokieid | createtime | pv |
---|---|---|
cookie1 | 2015-04-10 | 1 |
cookie1 | 2015-04-11 | 5 |
cookie1 | 2015-04-12 | 7 |
cookie1 | 2015-04-13 | 3 |
cookie1 | 2015-04-14 | 2 |
cookie1 | 2015-04-15 | 4 |
cookie1 | 2015-04-16 | 4 |
SELECT cookieid,
createtime,
pv,
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默認(rèn)為從起點到當(dāng)前行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --從起點到當(dāng)前行里逆,結(jié)果同pv1
SUM(pv) OVER(PARTITION BY cookieid) AS pv3, --分組內(nèi)所有行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --當(dāng)前行+往前3行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --當(dāng)前行+往前3行+往后1行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---當(dāng)前行+往后所有行
FROM lxw1234;
cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6
-----------------------------------------------------------------------------
cookie1 2015-04-10 1 1 1 26 1 6 26
cookie1 2015-04-11 5 6 6 26 6 13 25
cookie1 2015-04-12 7 13 13 26 13 16 20
cookie1 2015-04-13 3 16 16 26 16 18 13
cookie1 2015-04-14 2 18 18 26 17 21 10
cookie1 2015-04-15 4 22 22 26 16 20 8
cookie1 2015-04-16 4 26 26 26 13 13 4
- pv1: 分組內(nèi)從起點到當(dāng)前行的pv累積,如用爪,11號的pv1=10號的pv+11號的pv, 12號=10號+11號+12號
- pv2: 同pv1
- pv3: 分組內(nèi)(cookie1)所有的pv累加
- pv4: 分組內(nèi)當(dāng)前行+往前3行原押,如,11號=10號+11號偎血, 12號=10號+11號+12號诸衔, 13號=10號+11號+12號+13號, 14號=11號+12號+13號+14號
- pv5: 分組內(nèi)當(dāng)前行+往前3行+往后1行颇玷,如署隘,14號=11號+12號+13號+14號+15號=5+7+3+2+4=21
- pv6: 分組內(nèi)當(dāng)前行+往后所有行,如磁餐,13號=13號+14號+15號+16號=3+2+4+4=13,14號=14號+15號+16號=2+4+4=10
- 如果不指定ROWS BETWEEN,默認(rèn)為從起點到當(dāng)前行;
- 如果不指定ORDER BY,則將分組內(nèi)所有值累加;
- 關(guān)鍵是理解ROWS BETWEEN含義,也叫做WINDOW子句:
- PRECEDING:往前
- FOLLOWING:往后
- CURRENT ROW:當(dāng)前行
- UNBOUNDED:起點诊霹,UNBOUNDED PRECEDING 表示從前面的起點羞延, UNBOUNDED FOLLOWING:表示到后面的終點
序列函數(shù)
序列函數(shù)主要是給數(shù)據(jù)排序
函數(shù) | 語法 | 備注 |
---|---|---|
NTILE | NTILE() over(partition by .. order by) | 平均分片,如果不能平均分脾还,默認(rèn)增加第一個分片的值 |
ROW_NUMBER | ROW_NUMBER() over(partition by .. order by) | 排列序號伴箩,依次排序 |
RANK | RANK() over(partition by .. order by) | 排序,相等留空位 |
DENSE_RANK | DENSE_RANK() over(partition by .. order by) | 排序鄙漏,相等不留空位 |
NTILE
SELECT
cookieid,
createtime,
pv,
NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1, --分組內(nèi)將數(shù)據(jù)分成2片
NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2, --分組內(nèi)將數(shù)據(jù)分成3片
NTILE(4) OVER(ORDER BY createtime) AS rn3 --將所有數(shù)據(jù)分成4片
FROM lxw1234
ORDER BY cookieid,createtime;
cookieid day pv rn1 rn2 rn3
-------------------------------------------------
cookie1 2015-04-10 1 1 1 1
cookie1 2015-04-11 5 1 1 1
cookie1 2015-04-12 7 1 1 2
cookie1 2015-04-13 3 1 2 2
cookie1 2015-04-14 2 2 2 3
cookie1 2015-04-15 4 2 3 3
cookie1 2015-04-16 4 2 3 4
cookie2 2015-04-10 2 1 1 1
cookie2 2015-04-11 3 1 1 1
cookie2 2015-04-12 5 1 1 2
cookie2 2015-04-13 6 1 2 2
cookie2 2015-04-14 3 2 2 3
cookie2 2015-04-15 9 2 3 4
cookie2 2015-04-16 7 2 3 4
排序
SELECT
cookieid,
createtime,
pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3
FROM lxw1234
WHERE cookieid = 'cookie1';
cookieid day pv rn1 rn2 rn3
--------------------------------------------------
cookie1 2015-04-12 7 1 1 1
cookie1 2015-04-11 5 2 2 2
cookie1 2015-04-15 4 3 3 3
cookie1 2015-04-16 4 3 3 4
cookie1 2015-04-13 3 5 4 5
cookie1 2015-04-14 2 6 5 6
cookie1 2015-04-10 1 7 6 7
rn1: 15號和16號并列第3, 13號排第5
rn2: 15號和16號并列第3, 13號排第4
rn3: 如果相等嗤谚,則按記錄值排序,生成唯一的次序怔蚌,如果所有記錄值都相等巩步,或許會隨機排吧。
條件函數(shù)
- IF函數(shù):IF
- 非空查找函數(shù):COALESCE
- 條件判斷函數(shù):CASE
函數(shù)名 | 語法 | 備注 |
---|---|---|
IF | if(boolvalue condition, T trueValue, T falseValueorNULL) | condition=True 返回trueValue |
COALESCE | COALESCE(T v1, T v2) | 返回參數(shù)中的第一個非空值桦踊;如果所有值為NULL則返回NULL |
CASE | CASE a WHEN b THEN c WHEN d THEN e ELSE f NED | 如果a=b返回c椅野;如果d=e返回f |
集合操作
test1
a | b | c |
---|---|---|
1 | 2 | 1 |
test2
a | c |
---|---|
22 | 55 |
交集(exist / in)=> LEFT SEMI JOIN
過濾掉test1的數(shù)據(jù)
select test_1.id, test_1.num from test_1 left semi join test_2 on (test_1.id = test_2.id);
a | c |
---|---|
1 | 1 |
差集(exist / in)=> LEFT OUTER JOIN .. is NULL
A差B的情況
select test_1.id, test_1.num from test_1 left outer join test_2 on (test_2.id = test_2.id) where test_2.num is null;
b |
---|
2 |