Hive開(kāi)窗函數(shù)

1. 介紹

普通聚合函數(shù)聚合的行集是組殴瘦,開(kāi)窗函數(shù)聚合的行集是窗口。
因此驱还,普通聚合函數(shù)每組(Group by)只有一個(gè)返回值涛碑,而開(kāi)窗函數(shù)則可以為窗口中的每行都返回一個(gè)值精堕。

1.1 基礎(chǔ)結(jié)構(gòu)

分析函數(shù)(如:sum(), max(), row_number()...) + 窗口子句(over函數(shù))

1.2 over函數(shù)

over(partition by [column_n] order by [column_m])先按照column_n分區(qū),相同的column_n分為一區(qū)锌唾,每個(gè)分區(qū)根據(jù)column_m排序(默認(rèn)升序)。

1.3 測(cè)試數(shù)據(jù)

建表并插入數(shù)據(jù)

-- 建表
create table student_scores(
  id int,
  studentId int,
  language int,
  math int,
  english int,
  classId string,
  departmentId string
);
-- 寫(xiě)入數(shù)據(jù)
insert into table student_scores values 
  (1,111,68,69,90,'class1','department1'),
  (2,112,73,80,96,'class1','department1'),
  (3,113,90,74,75,'class1','department1'),
  (4,114,89,94,93,'class1','department1'),
  (5,115,99,93,89,'class1','department1'),
  (6,121,96,74,79,'class2','department1'),
  (7,122,89,86,85,'class2','department1'),
  (8,123,70,78,61,'class2','department1'),
  (9,124,76,70,76,'class2','department1'),
  (10,211,89,93,60,'class1','department2'),
  (11,212,76,83,75,'class1','department2'),
  (12,213,71,94,90,'class1','department2'),
  (13,214,94,94,66,'class1','department2'),
  (14,215,84,82,73,'class1','department2'),
  (15,216,85,74,93,'class1','department2'),
  (16,221,77,99,61,'class2','department2'),
  (17,222,80,78,96,'class2','department2'),
  (18,223,79,74,96,'class2','department2'),
  (19,224,75,80,78,'class2','department2'),
  (20,225,82,85,63,'class2','department2');

1.4 窗口含義

select studentId,math,departmentId,classId,
-- 符合所有條件的行作為窗口夺英,這里符合department1的有9個(gè)
count(math) over() as count1,
-- 按照classId分組的所有行作為窗口
count(math) over(partition by classId) as count2,
-- 按照classId分組晌涕、按照math排序的所有行作為窗口
count(math) over(partition by classId order by math) as count3,
-- 按照classId分組、按照math排序痛悯,當(dāng)前行向前1行向后2行的行作為窗口
count(math) over(partition by classId order by math rows between 1 preceding and 2 following) as count4,
-- 按照classId分組余黎、按照math排序,當(dāng)前行向后所有行作為窗口
count(math) over(partition by classId order by math rows between current row and unbounded following) as count5
from student_scores where departmentId='department1';

結(jié)果:

studentId    math    departmentId    classId    count1    count2    count3    count4    count5
111          69      department1     class1     9         5         1         3         5
113          74      department1     class1     9         5         2         4         4
112          80      department1     class1     9         5         3         4         3
115          93      department1     class1     9         5         4         3         2
114          94      department1     class1     9         5         5         2         1
124          70      department1     class2     9         4         1         3         4
121          74      department1     class2     9         4         2         4         3
123          78      department1     class2     9         4         3         3         2
122          86      department1     class2     9         4         4         2         1

結(jié)果解析:

studentId=115载萌,
count1為departmentId=department1的行數(shù)為9惧财,
count2為分區(qū)class1中的行數(shù)5,
count3為分區(qū)class1中math<=93的行數(shù)4扭仁,
count4為分區(qū)class1中math值向前1行和向后2行的行數(shù)3垮衷,
count5為分區(qū)class1中當(dāng)前math值到class1分區(qū)結(jié)束的行數(shù)2。

上面可以看到:如果不指定ROWS BETWEEN乖坠,默認(rèn)統(tǒng)計(jì)窗口是從起點(diǎn)到當(dāng)前行
關(guān)鍵是ROWS BETWEEN搀突,也叫做window子句
PRECEDING:向前
FOLLOWING:向后
CURRENT ROW:當(dāng)前行
UNBOUNDED:無(wú)邊界熊泵,UNBOUNDED PRECEDING表示從最前面的起點(diǎn)開(kāi)始仰迁,UNBOUNDED FOLLOWING表示到最后面的終點(diǎn)

開(kāi)窗函數(shù)可以粗略地分為兩類:聚合開(kāi)窗函數(shù)和排序開(kāi)窗函數(shù)。

2. 聚合開(kāi)窗函數(shù)

2.1 sum函數(shù)

select studentId,math,departmentId,classId,
sum(math) over() as sum1,
sum(math) over(partition by classId) as sum2,
sum(math) over(partition by classId order by math) as sum3,
sum(math) over(partition by classId order by math rows between 1 preceding and 2 following) as sum4,
sum(math) over(partition by classId order by math rows between current row and unbounded following) as sum5
from student_scores where departmentId='department1';

-- 結(jié)果解析:類似count()函數(shù)

min()顽分,max()徐许,avg()都與count()類似

2.2 first_value開(kāi)窗函數(shù)

作用:返回分區(qū)中的第一個(gè)值

select studentId,math,departmentId,classId,
first_value(math) over(partition by classId order by math) as first_value1,
first_value(math) over(partition by classId order by math rows between 1 preceding and 2 following) as first_value2
from student_scores where departmentId='department1';

結(jié)果:

studentid    math    departmentid    classid    first_value1    first_value2
111          69      department1     class1     69              69
113          74      department1     class1     69              69
112          80      department1     class1     69              74
115          93      department1     class1     69              80
114          94      department1     class1     69              93
124          70      department1     class2     70              70
121          74      department1     class2     70              70
123          78      department1     class2     70              74
122          86      department1     class2     70              78

結(jié)果解析:

studentId=122,
first_value1為分區(qū)class2中按照math排序的第一個(gè)值70卒蘸,
first_value2為分區(qū)class2中按照math排序后當(dāng)前行向前1行向后2行區(qū)間的第一個(gè)值78雌隅。

2.3 last_vlaue開(kāi)窗函數(shù)

作用:返回分區(qū)最后一個(gè)值

2.4 lag開(kāi)窗函數(shù)

作用:LAG(col, n, DEFAULT)用于統(tǒng)計(jì)窗口內(nèi)向上第n行的值
col:列名
n:向上n行,[可選,默認(rèn)為1]
DEFAULT:當(dāng)向上n行為NULL時(shí)澄步,取默認(rèn)值冰蘑;如果不指定,則為NULL

select studentId,math,departmentId,classId,
lag(math,2,60) over(partition by classId order by math) as lag1,
lag(math,2) over(partition by classId order by math) as lag2
from student_scores where departmentId='department1';

結(jié)果:

studentid    math    departmentid    classid    lag1    lag2
111          69      department1     class1     60      NULL
113          74      department1     class1     60      NULL
112          80      department1     class1     69      69
115          93      department1     class1     74      74
114          94      department1     class1     80      80
124          70      department1     class2     60      NULL
121          74      department1     class2     60      NULL
123          78      department1     class2     70      70
122          86      department1     class2     74      74

結(jié)果解析:

studentId=113村缸,
lag1為分區(qū)class1按照math排序后當(dāng)前行向上2行的值NULL祠肥,但是設(shè)置了DEFUALT,所以為60梯皿,
lag2因?yàn)闆](méi)有設(shè)置DEFAULT仇箱,所以為NULL。

2.5 lead開(kāi)窗函數(shù)

作用:LEAD(col, n, DEFAULT)LAG相反东羹,用于統(tǒng)計(jì)窗口內(nèi)向下n行的值
col:列名
n:向下n行剂桥,[可選,默認(rèn)為1]
DEFAULT:當(dāng)向下n行為NULL時(shí)属提,取默認(rèn)值权逗;如果不指定,則為NULL

2.6 cume_dist開(kāi)窗函數(shù)

作用:計(jì)算某個(gè)窗口或分區(qū)中某個(gè)值的累積分布冤议。假定升序排序斟薇,則使用以下公式確定累積分布:(小于等于當(dāng)前值的行數(shù)) / (分區(qū)內(nèi)總行數(shù))

select studentId,math,departmentId,classId,
cume_dist() over(order by math) as cume_dist1,
cume_dist() over(order by math desc) as cume_dist2,
cume_dist() over(partition by classId order by math) as cume_dist3
from student_scores where departmentId='department1';

結(jié)果:

studentid    math    departmentid    classid    cume_dist1            cume_dist2            cume_dist3
111          69      department1     class1     0.1111111111111111    1.0                   0.2
113          74      department1     class1     0.4444444444444444    0.7777777777777778    0.4
112          80      department1     class1     0.6666666666666666    0.4444444444444444    0.6
115          93      department1     class1     0.8888888888888888    0.2222222222222222    0.8
114          94      department1     class1     1.0                   0.1111111111111111    1.0
124          70      department1     class2     0.2222222222222222    0.8888888888888888    0.25
121          74      department1     class2     0.4444444444444444    0.7777777777777778    0.5
123          78      department1     class2     0.5555555555555556    0.5555555555555556    0.75
122          86      department1     class2     0.7777777777777778    0.3333333333333333    1.0

結(jié)果解析:

studentId=115,
cume_dist1=小于等于93的行數(shù)8/總行數(shù)9=0.8888888888888888
cume_dist2=大于等于93的行數(shù)2/總行數(shù)9=0.2222222222222222
cume_dist3=class1分區(qū)內(nèi)小于等于93的行數(shù)4/總行數(shù)5=0.8

3. 排序開(kāi)窗函數(shù)

3.1 row_number開(kāi)窗函數(shù)

作用:row_number() over([partition by col1] [order by col2])開(kāi)窗函數(shù)是基于over子句中order by列的一個(gè)排名恕酸。在窗口或分區(qū)內(nèi)從1開(kāi)始排序堪滨,即使遇到col2相等時(shí),名次依舊增加蕊温。例如:有兩條記錄相等袱箱,但一個(gè)是第一,一個(gè)是第二义矛。

select id,studentId,language,math,english,classId,departmentId,
row_number() over (order by math) as row_number1
from student_scores;

結(jié)果:

id    studentid    language    math    english    classid    departmentid    row_number1
1     111          68          69      90         class1     department1     1
9     124          76          70      76         class2     department1     2
18    223          79          74      96         class2     department2     3
15    216          85          74      93         class1     department2     4
3     113          90          74      75         class1     department1     5
6     121          96          74      79         class2     department1     6
8     123          70          78      61         class2     department1     7
17    222          80          78      96         class2     department2     8
19    224          75          80      78         class2     department2     9
2     112          73          80      96         class1     department1     10
14    215          84          82      73         class1     department2     11
11    212          76          83      75         class1     department2     12
20    225          82          85      63         class2     department2     13
7     122          89          86      85         class2     department1     14
5     115          99          93      89         class1     department1     15
10    211          89          93      60         class1     department2     16
12    213          71          94      90         class1     department2     17
4     114          89          94      93         class1     department1     18
13    214          94          94      66         class1     department2     19
16    221          77          99      61         class2     department2     20

studentId=223, 216, 113, 121的math都為74发笔,但是row_number1的值是遞增的

3.2 rank開(kāi)窗函數(shù)

作用:rank() over([partition by col1] [order by col2]),當(dāng)遇到col2相等時(shí)凉翻,名次相同筐咧,但是下一個(gè)col2值的名次遞增N(N是重復(fù)的次數(shù))。例如:有兩條記錄是并列第一噪矛,下一個(gè)是第三量蕊,沒(méi)有第二。

select id,studentId,language,math,english,classId,departmentId,
dense_rank() over (order by math) as rank1
from student_scores;

結(jié)果:

id    studentid    language    math    english    classid    departmentid    rank1
1     111          68          69      90         class1     department1     1
9     124          76          70      76         class2     department1     2
18    223          79          74      96         class2     department2     3
15    216          85          74      93         class1     department2     3
3     113          90          74      75         class1     department1     3
6     121          96          74      79         class2     department1     3
8     123          70          78      61         class2     department1     7
17    222          80          78      96         class2     department2     7
19    224          75          80      78         class2     department2     9
2     112          73          80      96         class1     department1     9
14    215          84          82      73         class1     department2     11
11    212          76          83      75         class1     department2     12
20    225          82          85      63         class2     department2     13
7     122          89          86      85         class2     department1     14
5     115          99          93      89         class1     department1     15
10    211          89          93      60         class1     department2     15
12    213          71          94      90         class1     department2     17
4     114          89          94      93         class1     department1     17
13    214          94          94      66         class1     department2     17
16    221          77          99      61         class2     department2     20

從第3行到第6行艇挨,重復(fù)出現(xiàn)了4次残炮,所以第7行rank1為3+4=7

3.3 dense_rank開(kāi)窗函數(shù)

作用:dense_rank() over([partition by col1] [order by col2])rank類似,當(dāng)遇到col2相等時(shí)缩滨,名次同樣相等势就,不同的是泉瞻,下一個(gè)col2值的名次+1,而不是+N苞冯。

select id,studentId,language,math,english,classId,departmentId,
dense_rank() over (order by math) as dense_rank1
from student_scores;

結(jié)果:

id    studentid    language    math    english    classid    departmentid    dense_rank1
1     111          68          69      90         class1     department1     1
9     124          76          70      76         class2     department1     2
18    223          79          74      96         class2     department2     3
15    216          85          74      93         class1     department2     3
3     113          90          74      75         class1     department1     3
6     121          96          74      79         class2     department1     3
8     123          70          78      61         class2     department1     4
17    222          80          78      96         class2     department2     4
19    224          75          80      78         class2     department2     5
2     112          73          80      96         class1     department1     5
14    215          84          82      73         class1     department2     6
11    212          76          83      75         class1     department2     7
20    225          82          85      63         class2     department2     8
7     122          89          86      85         class2     department1     9
5     115          99          93      89         class1     department1     10
10    211          89          93      60         class1     department2     10
12    213          71          94      90         class1     department2     11
4     114          89          94      93         class1     department1     11
13    214          94          94      66         class1     department2     11
16    221          77          99      61         class2     department2     12

從第3行到第6行math值相等袖牙,dense_rank1為3,所以第7行dense_rank1為4

3.4 ntile開(kāi)窗函數(shù)

作用:ntile(N) over([partition by col1] [order by col2])舅锄,將分區(qū)中的數(shù)據(jù)按照順序劃分為N片鞭达,返回當(dāng)前片的值。
注1:如果切片分布不均勻皇忿,默認(rèn)增加第一個(gè)切片的分布
注2:ntile不支持ROWS BETWEEN

select id,studentId,language,math,english,classId,departmentId,
ntile(2) over(partition by departmentid order by math) as ntile1
from student_scores;

結(jié)果:

id    studentid    language    math    english    classid    departmentid    ntile1
1     111          68          69      90         class1     department1     1     
9     124          76          70      76         class2     department1     1     
6     121          96          74      79         class2     department1     1     
3     113          90          74      75         class1     department1     1     
8     123          70          78      61         class2     department1     1     
2     112          73          80      96         class1     department1     2     
7     122          89          86      85         class2     department1     2     
5     115          99          93      89         class1     department1     2     
4     114          89          94      93         class1     department1     2     
15    216          85          74      93         class1     department2     1     
18    223          79          74      96         class2     department2     1     
17    222          80          78      96         class2     department2     1     
19    224          75          80      78         class2     department2     1     
14    215          84          82      73         class1     department2     1     
11    212          76          83      75         class1     department2     1     
20    225          82          85      63         class2     department2     2     
10    211          89          93      60         class1     department2     2     
12    213          71          94      90         class1     department2     2     
13    214          94          94      66         class1     department2     2     
16    221          77          99      61         class2     department2     2     

3.5 percent_rank開(kāi)窗函數(shù)

作用:percent_rank() over([partition by col1] [order by col2])畴蹭,計(jì)算給定行的百分比排名。分組內(nèi)當(dāng)前行的RANK值-1/分組內(nèi)總行數(shù)-1鳍烁,可以用來(lái)計(jì)算超過(guò)了百分之多少的人叨襟。

select id,studentid,language,math,english,classid,departmentid,
percent_rank() over(order by math) as rn1 
from student_scores;

結(jié)果:

id    studentid    language    math    english    classid    departmentid    rn1
1     111          68          69      90         class1     department1     0.0
9     124          76          70      76         class2     department1     0.05263157894736842
18    223          79          74      96         class2     department2     0.10526315789473684
15    216          85          74      93         class1     department2     0.10526315789473684
3     113          90          74      75         class1     department1     0.10526315789473684
6     121          96          74      79         class2     department1     0.10526315789473684
8     123          70          78      61         class2     department1     0.3157894736842105
17    222          80          78      96         class2     department2     0.3157894736842105
19    224          75          80      78         class2     department2     0.42105263157894735
2     112          73          80      96         class1     department1     0.42105263157894735
14    215          84          82      73         class1     department2     0.5263157894736842
11    212          76          83      75         class1     department2     0.5789473684210527
20    225          82          85      63         class2     department2     0.631578947368421
7     122          89          86      85         class2     department1     0.6842105263157895
5     115          99          93      89         class1     department1     0.7368421052631579
10    211          89          93      60         class1     department2     0.7368421052631579
12    213          71          94      90         class1     department2     0.8421052631578947
4     114          89          94      93         class1     department1     0.8421052631578947
13    214          94          94      66         class1     department2     0.8421052631578947
16    221          77          99      61         class2     department2     1.0

第二行,rn1的值為(2-1)/(20-1)=0.05263157894736842
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末幔荒,一起剝皮案震驚了整個(gè)濱河市糊闽,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌爹梁,老刑警劉巖右犹,帶你破解...
    沈念sama閱讀 221,273評(píng)論 6 515
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場(chǎng)離奇詭異卫键,居然都是意外死亡傀履,警方通過(guò)查閱死者的電腦和手機(jī)虱朵,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 94,349評(píng)論 3 398
  • 文/潘曉璐 我一進(jìn)店門莉炉,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái),“玉大人碴犬,你說(shuō)我怎么就攤上這事絮宁。” “怎么了服协?”我有些...
    開(kāi)封第一講書(shū)人閱讀 167,709評(píng)論 0 360
  • 文/不壞的土叔 我叫張陵绍昂,是天一觀的道長(zhǎng)。 經(jīng)常有香客問(wèn)我偿荷,道長(zhǎng)窘游,這世上最難降的妖魔是什么? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 59,520評(píng)論 1 296
  • 正文 為了忘掉前任跳纳,我火速辦了婚禮忍饰,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘寺庄。我一直安慰自己艾蓝,他們只是感情好力崇,可當(dāng)我...
    茶點(diǎn)故事閱讀 68,515評(píng)論 6 397
  • 文/花漫 我一把揭開(kāi)白布。 她就那樣靜靜地躺著赢织,像睡著了一般亮靴。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上于置,一...
    開(kāi)封第一講書(shū)人閱讀 52,158評(píng)論 1 308
  • 那天茧吊,我揣著相機(jī)與錄音,去河邊找鬼俱两。 笑死饱狂,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的宪彩。 我是一名探鬼主播休讳,決...
    沈念sama閱讀 40,755評(píng)論 3 421
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼尿孔!你這毒婦竟也來(lái)了俊柔?” 一聲冷哼從身側(cè)響起,我...
    開(kāi)封第一講書(shū)人閱讀 39,660評(píng)論 0 276
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤活合,失蹤者是張志新(化名)和其女友劉穎雏婶,沒(méi)想到半個(gè)月后,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體白指,經(jīng)...
    沈念sama閱讀 46,203評(píng)論 1 319
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡留晚,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 38,287評(píng)論 3 340
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了告嘲。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片错维。...
    茶點(diǎn)故事閱讀 40,427評(píng)論 1 352
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖橄唬,靈堂內(nèi)的尸體忽然破棺而出赋焕,到底是詐尸還是另有隱情,我是刑警寧澤仰楚,帶...
    沈念sama閱讀 36,122評(píng)論 5 349
  • 正文 年R本政府宣布隆判,位于F島的核電站,受9級(jí)特大地震影響僧界,放射性物質(zhì)發(fā)生泄漏侨嘀。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,801評(píng)論 3 333
  • 文/蒙蒙 一捂襟、第九天 我趴在偏房一處隱蔽的房頂上張望咬腕。 院中可真熱鬧,春花似錦笆豁、人聲如沸郎汪。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 32,272評(píng)論 0 23
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)煞赢。三九已至抛计,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間照筑,已是汗流浹背吹截。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 33,393評(píng)論 1 272
  • 我被黑心中介騙來(lái)泰國(guó)打工, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留凝危,地道東北人波俄。 一個(gè)月前我還...
    沈念sama閱讀 48,808評(píng)論 3 376
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像蛾默,于是被迫代替她去往敵國(guó)和親懦铺。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,440評(píng)論 2 359

推薦閱讀更多精彩內(nèi)容