1. 介紹
普通聚合函數(shù)聚合的行集是組殴瘦,開(kāi)窗函數(shù)聚合的行集是窗口。
因此驱还,普通聚合函數(shù)每組(Group by)只有一個(gè)返回值涛碑,而開(kāi)窗函數(shù)則可以為窗口中的每行都返回一個(gè)值精堕。
1.1 基礎(chǔ)結(jié)構(gòu)
分析函數(shù)(如:sum(), max(), row_number()...) + 窗口子句(over函數(shù))
1.2 over函數(shù)
over(partition by [column_n] order by [column_m])
先按照column_n
分區(qū),相同的column_n
分為一區(qū)锌唾,每個(gè)分區(qū)根據(jù)column_m
排序(默認(rèn)升序)。
1.3 測(cè)試數(shù)據(jù)
建表并插入數(shù)據(jù)
-- 建表
create table student_scores(
id int,
studentId int,
language int,
math int,
english int,
classId string,
departmentId string
);
-- 寫(xiě)入數(shù)據(jù)
insert into table student_scores values
(1,111,68,69,90,'class1','department1'),
(2,112,73,80,96,'class1','department1'),
(3,113,90,74,75,'class1','department1'),
(4,114,89,94,93,'class1','department1'),
(5,115,99,93,89,'class1','department1'),
(6,121,96,74,79,'class2','department1'),
(7,122,89,86,85,'class2','department1'),
(8,123,70,78,61,'class2','department1'),
(9,124,76,70,76,'class2','department1'),
(10,211,89,93,60,'class1','department2'),
(11,212,76,83,75,'class1','department2'),
(12,213,71,94,90,'class1','department2'),
(13,214,94,94,66,'class1','department2'),
(14,215,84,82,73,'class1','department2'),
(15,216,85,74,93,'class1','department2'),
(16,221,77,99,61,'class2','department2'),
(17,222,80,78,96,'class2','department2'),
(18,223,79,74,96,'class2','department2'),
(19,224,75,80,78,'class2','department2'),
(20,225,82,85,63,'class2','department2');
1.4 窗口含義
select studentId,math,departmentId,classId,
-- 符合所有條件的行作為窗口夺英,這里符合department1的有9個(gè)
count(math) over() as count1,
-- 按照classId分組的所有行作為窗口
count(math) over(partition by classId) as count2,
-- 按照classId分組晌涕、按照math排序的所有行作為窗口
count(math) over(partition by classId order by math) as count3,
-- 按照classId分組、按照math排序痛悯,當(dāng)前行向前1行向后2行的行作為窗口
count(math) over(partition by classId order by math rows between 1 preceding and 2 following) as count4,
-- 按照classId分組余黎、按照math排序,當(dāng)前行向后所有行作為窗口
count(math) over(partition by classId order by math rows between current row and unbounded following) as count5
from student_scores where departmentId='department1';
結(jié)果:
studentId math departmentId classId count1 count2 count3 count4 count5
111 69 department1 class1 9 5 1 3 5
113 74 department1 class1 9 5 2 4 4
112 80 department1 class1 9 5 3 4 3
115 93 department1 class1 9 5 4 3 2
114 94 department1 class1 9 5 5 2 1
124 70 department1 class2 9 4 1 3 4
121 74 department1 class2 9 4 2 4 3
123 78 department1 class2 9 4 3 3 2
122 86 department1 class2 9 4 4 2 1
結(jié)果解析:
studentId=115载萌,
count1為departmentId=department1的行數(shù)為9惧财,
count2為分區(qū)class1中的行數(shù)5,
count3為分區(qū)class1中math<=93的行數(shù)4扭仁,
count4為分區(qū)class1中math值向前1行和向后2行的行數(shù)3垮衷,
count5為分區(qū)class1中當(dāng)前math值到class1分區(qū)結(jié)束的行數(shù)2。
上面可以看到:如果不指定ROWS BETWEEN
乖坠,默認(rèn)統(tǒng)計(jì)窗口是從起點(diǎn)到當(dāng)前行
關(guān)鍵是ROWS BETWEEN
搀突,也叫做window子句
。
PRECEDING
:向前
FOLLOWING
:向后
CURRENT ROW
:當(dāng)前行
UNBOUNDED
:無(wú)邊界熊泵,UNBOUNDED PRECEDING
表示從最前面的起點(diǎn)開(kāi)始仰迁,UNBOUNDED FOLLOWING
表示到最后面的終點(diǎn)
開(kāi)窗函數(shù)可以粗略地分為兩類:聚合開(kāi)窗函數(shù)和排序開(kāi)窗函數(shù)。
2. 聚合開(kāi)窗函數(shù)
2.1 sum函數(shù)
select studentId,math,departmentId,classId,
sum(math) over() as sum1,
sum(math) over(partition by classId) as sum2,
sum(math) over(partition by classId order by math) as sum3,
sum(math) over(partition by classId order by math rows between 1 preceding and 2 following) as sum4,
sum(math) over(partition by classId order by math rows between current row and unbounded following) as sum5
from student_scores where departmentId='department1';
-- 結(jié)果解析:類似count()函數(shù)
min()顽分,max()徐许,avg()都與count()類似
2.2 first_value開(kāi)窗函數(shù)
作用:返回分區(qū)中的第一個(gè)值
select studentId,math,departmentId,classId,
first_value(math) over(partition by classId order by math) as first_value1,
first_value(math) over(partition by classId order by math rows between 1 preceding and 2 following) as first_value2
from student_scores where departmentId='department1';
結(jié)果:
studentid math departmentid classid first_value1 first_value2
111 69 department1 class1 69 69
113 74 department1 class1 69 69
112 80 department1 class1 69 74
115 93 department1 class1 69 80
114 94 department1 class1 69 93
124 70 department1 class2 70 70
121 74 department1 class2 70 70
123 78 department1 class2 70 74
122 86 department1 class2 70 78
結(jié)果解析:
studentId=122,
first_value1為分區(qū)class2中按照math排序的第一個(gè)值70卒蘸,
first_value2為分區(qū)class2中按照math排序后當(dāng)前行向前1行向后2行區(qū)間的第一個(gè)值78雌隅。
2.3 last_vlaue開(kāi)窗函數(shù)
作用:返回分區(qū)最后一個(gè)值
2.4 lag開(kāi)窗函數(shù)
作用:LAG(col, n, DEFAULT)
用于統(tǒng)計(jì)窗口內(nèi)向上第n行的值
col
:列名
n
:向上n行,[可選,默認(rèn)為1]
DEFAULT
:當(dāng)向上n行為NULL時(shí)澄步,取默認(rèn)值冰蘑;如果不指定,則為NULL
select studentId,math,departmentId,classId,
lag(math,2,60) over(partition by classId order by math) as lag1,
lag(math,2) over(partition by classId order by math) as lag2
from student_scores where departmentId='department1';
結(jié)果:
studentid math departmentid classid lag1 lag2
111 69 department1 class1 60 NULL
113 74 department1 class1 60 NULL
112 80 department1 class1 69 69
115 93 department1 class1 74 74
114 94 department1 class1 80 80
124 70 department1 class2 60 NULL
121 74 department1 class2 60 NULL
123 78 department1 class2 70 70
122 86 department1 class2 74 74
結(jié)果解析:
studentId=113村缸,
lag1為分區(qū)class1按照math排序后當(dāng)前行向上2行的值NULL祠肥,但是設(shè)置了DEFUALT,所以為60梯皿,
lag2因?yàn)闆](méi)有設(shè)置DEFAULT仇箱,所以為NULL。
2.5 lead開(kāi)窗函數(shù)
作用:LEAD(col, n, DEFAULT)
與LAG
相反东羹,用于統(tǒng)計(jì)窗口內(nèi)向下n行的值
col
:列名
n
:向下n行剂桥,[可選,默認(rèn)為1]
DEFAULT
:當(dāng)向下n行為NULL時(shí)属提,取默認(rèn)值权逗;如果不指定,則為NULL
2.6 cume_dist開(kāi)窗函數(shù)
作用:計(jì)算某個(gè)窗口或分區(qū)中某個(gè)值的累積分布冤议。假定升序排序斟薇,則使用以下公式確定累積分布:(小于等于當(dāng)前值的行數(shù)) / (分區(qū)內(nèi)總行數(shù))
select studentId,math,departmentId,classId,
cume_dist() over(order by math) as cume_dist1,
cume_dist() over(order by math desc) as cume_dist2,
cume_dist() over(partition by classId order by math) as cume_dist3
from student_scores where departmentId='department1';
結(jié)果:
studentid math departmentid classid cume_dist1 cume_dist2 cume_dist3
111 69 department1 class1 0.1111111111111111 1.0 0.2
113 74 department1 class1 0.4444444444444444 0.7777777777777778 0.4
112 80 department1 class1 0.6666666666666666 0.4444444444444444 0.6
115 93 department1 class1 0.8888888888888888 0.2222222222222222 0.8
114 94 department1 class1 1.0 0.1111111111111111 1.0
124 70 department1 class2 0.2222222222222222 0.8888888888888888 0.25
121 74 department1 class2 0.4444444444444444 0.7777777777777778 0.5
123 78 department1 class2 0.5555555555555556 0.5555555555555556 0.75
122 86 department1 class2 0.7777777777777778 0.3333333333333333 1.0
結(jié)果解析:
studentId=115,
cume_dist1=小于等于93的行數(shù)8/總行數(shù)9=0.8888888888888888
cume_dist2=大于等于93的行數(shù)2/總行數(shù)9=0.2222222222222222
cume_dist3=class1分區(qū)內(nèi)小于等于93的行數(shù)4/總行數(shù)5=0.8
3. 排序開(kāi)窗函數(shù)
3.1 row_number開(kāi)窗函數(shù)
作用:row_number() over([partition by col1] [order by col2])
開(kāi)窗函數(shù)是基于over
子句中order by
列的一個(gè)排名恕酸。在窗口或分區(qū)內(nèi)從1開(kāi)始排序堪滨,即使遇到col2
相等時(shí),名次依舊增加蕊温。例如:有兩條記錄相等袱箱,但一個(gè)是第一,一個(gè)是第二义矛。
select id,studentId,language,math,english,classId,departmentId,
row_number() over (order by math) as row_number1
from student_scores;
結(jié)果:
id studentid language math english classid departmentid row_number1
1 111 68 69 90 class1 department1 1
9 124 76 70 76 class2 department1 2
18 223 79 74 96 class2 department2 3
15 216 85 74 93 class1 department2 4
3 113 90 74 75 class1 department1 5
6 121 96 74 79 class2 department1 6
8 123 70 78 61 class2 department1 7
17 222 80 78 96 class2 department2 8
19 224 75 80 78 class2 department2 9
2 112 73 80 96 class1 department1 10
14 215 84 82 73 class1 department2 11
11 212 76 83 75 class1 department2 12
20 225 82 85 63 class2 department2 13
7 122 89 86 85 class2 department1 14
5 115 99 93 89 class1 department1 15
10 211 89 93 60 class1 department2 16
12 213 71 94 90 class1 department2 17
4 114 89 94 93 class1 department1 18
13 214 94 94 66 class1 department2 19
16 221 77 99 61 class2 department2 20
studentId=223, 216, 113, 121的math都為74发笔,但是row_number1的值是遞增的
3.2 rank開(kāi)窗函數(shù)
作用:rank() over([partition by col1] [order by col2])
,當(dāng)遇到col2
相等時(shí)凉翻,名次相同筐咧,但是下一個(gè)col2值的名次遞增N(N是重復(fù)的次數(shù))。例如:有兩條記錄是并列第一噪矛,下一個(gè)是第三量蕊,沒(méi)有第二。
select id,studentId,language,math,english,classId,departmentId,
dense_rank() over (order by math) as rank1
from student_scores;
結(jié)果:
id studentid language math english classid departmentid rank1
1 111 68 69 90 class1 department1 1
9 124 76 70 76 class2 department1 2
18 223 79 74 96 class2 department2 3
15 216 85 74 93 class1 department2 3
3 113 90 74 75 class1 department1 3
6 121 96 74 79 class2 department1 3
8 123 70 78 61 class2 department1 7
17 222 80 78 96 class2 department2 7
19 224 75 80 78 class2 department2 9
2 112 73 80 96 class1 department1 9
14 215 84 82 73 class1 department2 11
11 212 76 83 75 class1 department2 12
20 225 82 85 63 class2 department2 13
7 122 89 86 85 class2 department1 14
5 115 99 93 89 class1 department1 15
10 211 89 93 60 class1 department2 15
12 213 71 94 90 class1 department2 17
4 114 89 94 93 class1 department1 17
13 214 94 94 66 class1 department2 17
16 221 77 99 61 class2 department2 20
從第3行到第6行艇挨,重復(fù)出現(xiàn)了4次残炮,所以第7行rank1為3+4=7
3.3 dense_rank開(kāi)窗函數(shù)
作用:dense_rank() over([partition by col1] [order by col2])
與rank
類似,當(dāng)遇到col2
相等時(shí)缩滨,名次同樣相等势就,不同的是泉瞻,下一個(gè)col2值的名次+1,而不是+N苞冯。
select id,studentId,language,math,english,classId,departmentId,
dense_rank() over (order by math) as dense_rank1
from student_scores;
結(jié)果:
id studentid language math english classid departmentid dense_rank1
1 111 68 69 90 class1 department1 1
9 124 76 70 76 class2 department1 2
18 223 79 74 96 class2 department2 3
15 216 85 74 93 class1 department2 3
3 113 90 74 75 class1 department1 3
6 121 96 74 79 class2 department1 3
8 123 70 78 61 class2 department1 4
17 222 80 78 96 class2 department2 4
19 224 75 80 78 class2 department2 5
2 112 73 80 96 class1 department1 5
14 215 84 82 73 class1 department2 6
11 212 76 83 75 class1 department2 7
20 225 82 85 63 class2 department2 8
7 122 89 86 85 class2 department1 9
5 115 99 93 89 class1 department1 10
10 211 89 93 60 class1 department2 10
12 213 71 94 90 class1 department2 11
4 114 89 94 93 class1 department1 11
13 214 94 94 66 class1 department2 11
16 221 77 99 61 class2 department2 12
從第3行到第6行math值相等袖牙,dense_rank1為3,所以第7行dense_rank1為4
3.4 ntile開(kāi)窗函數(shù)
作用:ntile(N) over([partition by col1] [order by col2])
舅锄,將分區(qū)中的數(shù)據(jù)按照順序劃分為N片鞭达,返回當(dāng)前片的值。
注1:如果切片分布不均勻皇忿,默認(rèn)增加第一個(gè)切片的分布
注2:ntile
不支持ROWS BETWEEN
select id,studentId,language,math,english,classId,departmentId,
ntile(2) over(partition by departmentid order by math) as ntile1
from student_scores;
結(jié)果:
id studentid language math english classid departmentid ntile1
1 111 68 69 90 class1 department1 1
9 124 76 70 76 class2 department1 1
6 121 96 74 79 class2 department1 1
3 113 90 74 75 class1 department1 1
8 123 70 78 61 class2 department1 1
2 112 73 80 96 class1 department1 2
7 122 89 86 85 class2 department1 2
5 115 99 93 89 class1 department1 2
4 114 89 94 93 class1 department1 2
15 216 85 74 93 class1 department2 1
18 223 79 74 96 class2 department2 1
17 222 80 78 96 class2 department2 1
19 224 75 80 78 class2 department2 1
14 215 84 82 73 class1 department2 1
11 212 76 83 75 class1 department2 1
20 225 82 85 63 class2 department2 2
10 211 89 93 60 class1 department2 2
12 213 71 94 90 class1 department2 2
13 214 94 94 66 class1 department2 2
16 221 77 99 61 class2 department2 2
3.5 percent_rank開(kāi)窗函數(shù)
作用:percent_rank() over([partition by col1] [order by col2])
畴蹭,計(jì)算給定行的百分比排名。分組內(nèi)當(dāng)前行的RANK值-1/分組內(nèi)總行數(shù)-1鳍烁,可以用來(lái)計(jì)算超過(guò)了百分之多少的人叨襟。
select id,studentid,language,math,english,classid,departmentid,
percent_rank() over(order by math) as rn1
from student_scores;
結(jié)果:
id studentid language math english classid departmentid rn1
1 111 68 69 90 class1 department1 0.0
9 124 76 70 76 class2 department1 0.05263157894736842
18 223 79 74 96 class2 department2 0.10526315789473684
15 216 85 74 93 class1 department2 0.10526315789473684
3 113 90 74 75 class1 department1 0.10526315789473684
6 121 96 74 79 class2 department1 0.10526315789473684
8 123 70 78 61 class2 department1 0.3157894736842105
17 222 80 78 96 class2 department2 0.3157894736842105
19 224 75 80 78 class2 department2 0.42105263157894735
2 112 73 80 96 class1 department1 0.42105263157894735
14 215 84 82 73 class1 department2 0.5263157894736842
11 212 76 83 75 class1 department2 0.5789473684210527
20 225 82 85 63 class2 department2 0.631578947368421
7 122 89 86 85 class2 department1 0.6842105263157895
5 115 99 93 89 class1 department1 0.7368421052631579
10 211 89 93 60 class1 department2 0.7368421052631579
12 213 71 94 90 class1 department2 0.8421052631578947
4 114 89 94 93 class1 department1 0.8421052631578947
13 214 94 94 66 class1 department2 0.8421052631578947
16 221 77 99 61 class2 department2 1.0
第二行,rn1的值為(2-1)/(20-1)=0.05263157894736842