寫在前面
窗口函數(shù)在處理復(fù)雜需求時(shí)提供了一種更為簡(jiǎn)便的數(shù)據(jù)處理方式雷猪,在實(shí)際業(yè)務(wù)中應(yīng)用非常廣泛睛竣,也是面試官喜歡重點(diǎn)考察的知識(shí)點(diǎn)。
什么是窗口函數(shù)求摇?
窗口函數(shù)也被稱為聯(lián)機(jī)分析函數(shù)(OLAP,Online Anallytical Processing)或者分析函數(shù)(Analytic Function)射沟,窗口指對(duì)滿足條件的集合進(jìn)行計(jì)算,并對(duì)每一行數(shù)據(jù)返回分析結(jié)果与境,窗口函數(shù)的格式如下:
<窗口函數(shù)> OVER (partition by <用于分組的列名> order by <用于排序的列名> frame_clause)
1验夯、常用窗口函數(shù)
1) 聚合函數(shù):sum()、count()摔刁、max()挥转、min()、avg()
2) 排序函數(shù):row_number()簸搞、rank()扁位、dense_rank()
3) 分布函數(shù):percent_rank()准潭、cume_dist()
4) 平移函數(shù):lead()趁俊、lag()
5) 首尾函數(shù):first_val()、last_val()
2刑然、分區(qū)(partition by)
over中partition by類似group by對(duì)數(shù)據(jù)進(jìn)行分區(qū)寺擂,此時(shí),窗口函數(shù)會(huì)對(duì)每個(gè)分區(qū)單獨(dú)進(jìn)行分析泼掠,如果不指定partition by將會(huì)對(duì)整體數(shù)據(jù)進(jìn)行分析怔软。
3、排序(order by)
over中的order by對(duì)分區(qū)內(nèi)的數(shù)據(jù)進(jìn)行排序择镇,默認(rèn)為升序挡逼,當(dāng)order by某個(gè)字段中有重復(fù)值時(shí)會(huì)對(duì)重復(fù)值進(jìn)行求和,然后對(duì)所有數(shù)據(jù)進(jìn)行累加腻豌。
4家坎、窗口大小(frame_clause)
over中的frame_clause指對(duì)分區(qū)集合指定一個(gè)移動(dòng)窗口嘱能,當(dāng)指定了窗口大小后函數(shù)就不會(huì)在分區(qū)上進(jìn)行計(jì)算,而是基于窗口大小內(nèi)的數(shù)據(jù)進(jìn)行計(jì)算虱疏。窗口大小的格式如下:
rows frame_start
or
rows between frame_start and frame_end
其中惹骂,rows表示偏移的行數(shù)。frame_start表示窗口的起始位置做瞪,有三種選項(xiàng):
- UNBOUNDED PRECEDING,為默認(rèn)值对粪,表示從第一行開始。
- N PRECEDING,表示從前一行開始装蓬,前一行數(shù)據(jù)缺失則為0 著拭。
- CURRENT ROW,表示從當(dāng)前行開始矛物。
frame_end表示窗口的結(jié)束位置茫死,有三種選項(xiàng):
- CURRENT ROW為默認(rèn)值,表示從當(dāng)前行結(jié)束履羞。
- N FOLLOWING峦萎,表示當(dāng)前行后的第N行結(jié)束。
- UNBOUNDED FOLLOWING忆首,表示窗口到分區(qū)的最后一行結(jié)束爱榔。
sql中的默認(rèn)選項(xiàng)為:rows between UNBOUNDED PRECEDING AND CURRENT ROW,表示統(tǒng)計(jì)從第一行至當(dāng)前記錄行糙及。
rows between 1 PRECEDING AND 1 FOLLOWING详幽,表示當(dāng)前行和前一行及后面一行聚合,多用于近N月的數(shù)據(jù)統(tǒng)計(jì)浸锨。
rows between current row and UNBOUNDED FOLLOWING唇聘,表示當(dāng)前行及后面所有行。
為什么要使用窗口函數(shù)
在實(shí)際業(yè)務(wù)中我們經(jīng)常會(huì)遇到需要對(duì)數(shù)據(jù)結(jié)果進(jìn)行額外的統(tǒng)計(jì)柱搜,例如在計(jì)算各部門員工薪資后新增一列為公司整體薪資迟郎,又或者對(duì)各部門薪資水平進(jìn)行排序、計(jì)算占比等操作聪蘸,此時(shí)如果不使用窗口函數(shù)可能需要對(duì)表進(jìn)行多次的關(guān)聯(lián)才能實(shí)現(xiàn)宪肖,因此使用窗口函數(shù)可以大大簡(jiǎn)化代碼并提升代碼的讀寫性能。
如何使用窗口函數(shù)
首先根據(jù)窗口函數(shù)的定義我們可以知道健爬,窗口函數(shù)主要分為了聚合控乾、排序、分布娜遵、平移及首尾等類型蜕衡,對(duì)于每一種類型具體的應(yīng)用場(chǎng)景如下:
聚合函數(shù)
聚合函數(shù)也可以充當(dāng)窗口函數(shù),我們需要經(jīng)常對(duì)窗口下的數(shù)據(jù)集進(jìn)行聚合統(tǒng)計(jì)设拟,也是窗口函數(shù)中應(yīng)用較為廣泛的一類慨仿。例如鸽扁,我們需要統(tǒng)計(jì)A公司各部門下各員工的銷售,并對(duì)各部門的最大值镶骗、最小值桶现、平均值及計(jì)數(shù)等。
sql實(shí)現(xiàn)
select dept, name, salary,
sum(salary) over(partition by dept) as sum_salary, --各部門員工薪資求和
avg(salary) over(partition by dept) as avg_salary, --各部門員工薪資求平均
min(salary) over(partition by dept) as min_salary, --各部門員工薪資求最小
max(salary) over(partition by dept) as max_salary --各部門員工薪資求最大值
from data
python實(shí)現(xiàn)
import numpy as np
import pandas as pd
company=["A","B","C"]
data=pd.DataFrame({
"dept":[company[x] for x in np.random.randint(0,len(company),8)],
"name":["a","b","c","d","e","f","g","h"],
"salary":np.random.randint(10,30,8)
}
)
data['sum_salary'] = data.groupby('dept')['salary'].transform('sum')
data['min_salary'] = data.groupby('dept')['salary'].transform('min')
data['mean_salary'] = data.groupby('dept')['salary'].transform('mean')
data['max_salary'] = data.groupby('dept')['salary'].transform('max')
data
排序函數(shù)
排序函數(shù)常用于對(duì)分組集或者整體數(shù)據(jù)進(jìn)行排名鼎姊,例如我們需要對(duì)各部門員工薪資進(jìn)行排序骡和,排序函數(shù)又可以根據(jù)排序方式有以下分類:
1) row_number:對(duì)分組內(nèi)的數(shù)據(jù)進(jìn)行"同分不同級(jí)"方式排序,不存在序號(hào)并列的現(xiàn)象相寇,即使同分時(shí)排序也會(huì)不同慰于。
2) rank:對(duì)分組內(nèi)的數(shù)據(jù)進(jìn)行"同分同級(jí)且不緊密"方式排序,當(dāng)同分時(shí)序號(hào)相同唤衫,其它排序按正常排名進(jìn)行排序婆赠,即1,2,2,4,5。
3) dense_rank:對(duì)分組內(nèi)的數(shù)據(jù)進(jìn)行"同分同級(jí)且緊密"方式排序佳励,當(dāng)同分時(shí)序號(hào)相同休里,其它排序按下一排名進(jìn)行排序,即1,2,2,3,4赃承。
sql實(shí)現(xiàn)
select dept, name, salary,
row_number(salary) over(partition by dept order by salary desc) as row_number, --對(duì)各部門員工薪資按同分不同級(jí)方式排序
rank(salary) over(partition by dept order by salary desc) as rank, --對(duì)各部門員工薪資按同分同級(jí)且緊密方式方式排序
dense_rank(salary) over(partition by dept order by salary desc) as dense_rank --對(duì)各部門員工薪資按同分同級(jí)且不緊密方式方式排序
from data
python實(shí)現(xiàn)
import numpy as np
import pandas as pd
company=["A","B","C"]
data=pd.DataFrame({
"dept":[company[x] for x in np.random.randint(0,len(company),8)],
"name":["a","b","c","d","e","f","g","h"],
"salary":np.random.randint(10,15,8)
}
)
data['row_number'] = data.groupby('dept')['salary'].rank(ascending=False,method='first') #同分不同級(jí)
data['rank'] = data.groupby('dept')['salary'].rank(ascending=False,method='min') #"同分同級(jí)且不緊密"
data['dense_rank'] = data.groupby('dept')['salary'].rank(ascending=False,method='dense') #"同分同級(jí)且緊密"
data
分布函數(shù)
分布函數(shù)主要分為兩類:percent_rank()和cume_dist()妙黍。
percent_rank():指按照排名計(jì)算百分比,即該排名位于區(qū)間[0,1]的位置瞧剖,其中區(qū)間內(nèi)第一名為值0拭嫁,最后一名值為1。其具體公式為:
cume_dist():指區(qū)間內(nèi)大于等于當(dāng)前排名的行數(shù)占區(qū)間內(nèi)總函數(shù)的比例抓于。多用于判斷比當(dāng)前薪資做粤、得分高的用戶比例為多少。
sql實(shí)現(xiàn)
select dept, name, salary,
percent_rank(salary) over(partition by dept order by salary desc) as percent_rank,
cume_dist(salary) over(partition by dept order by salary desc) as cume_dist
from data
import numpy as np
import pandas as pd
company=["A","B","C"]
data=pd.DataFrame({
"dept":[company[x] for x in np.random.randint(0,len(company),8)],
"name":["a","b","c","d","e","f","g","h"],
"salary":np.random.randint(10,15,8)
}
)
# data.groupby('dept')['salary'].rank(ascending=False,method='first',pct=True)
data['percent_rank'] = (data.groupby('dept')['salary'].rank(ascending=False,method='min')-1) / \
(data.groupby('dept')['salary'].transform('count')-1) #如果分組只有一個(gè)記錄則數(shù)據(jù)為na
data['cume_dist'] = data.groupby('dept')['salary'].rank(ascending=False,method='first',pct=True) #可以結(jié)合排序函數(shù)的方法使用
data
平移函數(shù)
分布函數(shù)主要分為兩類:lead(列名,n)和lag(列名,n)捉撮,此函數(shù)多用于計(jì)算指標(biāo)同比怕品、環(huán)比。
lead(列名,n):獲取分區(qū)內(nèi)向下平移n行數(shù)據(jù)呕缭。
lag(列名,n):獲取分區(qū)內(nèi)向上平移n行數(shù)據(jù)堵泽。
sql實(shí)現(xiàn)
select dept, name, salary,
lead(salary,1) over(partition by dept order by salary desc ) as lead,
lag(salary,1) over(partition by dept order by salary desc) as lag
from data
import numpy as np
import pandas as pd
company=["A","B","C"]
data=pd.DataFrame({
"dept":[company[x] for x in np.random.randint(0,len(company),8)],
"name":["a","b","c","d","e","f","g","h"],
"salary":np.random.randint(10,30,8)
}
)
data['lead'] = data.sort_values(['dept','salary'],ascending=False).groupby('dept')['salary'].shift(-1) # 分區(qū)內(nèi)向下平移一個(gè)單位
data['lag'] = data.sort_values(['dept','salary'],ascending=False).groupby('dept')['salary'].shift(1) # 分區(qū)內(nèi)向上平移一個(gè)單位
data
首尾函數(shù)
分布函數(shù)主要分為兩類:first_val()和last_val()修己。
first_val():獲取分區(qū)內(nèi)第一行數(shù)據(jù)恢总。
last_val():獲取分區(qū)內(nèi)最后一行數(shù)據(jù)。
sql實(shí)現(xiàn)
select dept, name, salary,
first_val(salary) over(partition by dept order by salary desc ) as first_val,
# 由于窗口函數(shù)默認(rèn)的是第一行至當(dāng)前行睬愤,所以在使用last_val()函數(shù)時(shí)片仿,會(huì)出現(xiàn)分區(qū)內(nèi)最后一行和當(dāng)前行大小一致的情況,因此我們需要將分區(qū)偏移量改為第一行至最后一行尤辱。
last_val(salary) over(partition by dept order by salary desc rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING) as last_val
from data
python實(shí)現(xiàn)
import numpy as np
import pandas as pd
company=["A","B","C"]
data=pd.DataFrame({
"dept":[company[x] for x in np.random.randint(0,len(company),8)],
"name":["a","b","c","d","e","f","g","h"],
"salary":np.random.randint(10,30,8)
}
)
data['first_val'] = data.groupby('dept')['salary'].transform('min')
data['last_val'] = data.groupby('dept')['salary'].transform('max')
data
Q&A
Q1:聚合函數(shù)和窗口函數(shù)的區(qū)別
區(qū)別:聚合函數(shù)是將多條數(shù)據(jù)聚合成一行數(shù)據(jù)砂豌,而窗口函數(shù)是為每一行數(shù)據(jù)返回一個(gè)結(jié)果厢岂。
聯(lián)系:都是對(duì)一組數(shù)據(jù)進(jìn)行分析,窗口函數(shù)可以使用聚合函數(shù)作為函數(shù)阳距。
當(dāng)需要對(duì)數(shù)據(jù)結(jié)果進(jìn)行額外的統(tǒng)計(jì)時(shí)塔粒,我們常常需要使用窗口函數(shù)。
Q2:SQL的執(zhí)行順序
SQL的書寫順序是SELECT筐摘、FROM卒茬、JOIN、ON咖熟、WHERE圃酵、GROUP BY、HAVING馍管、ORDER BY郭赐、LIMIT,其執(zhí)行順序見下圖:
在這里我們需要強(qiáng)調(diào)一下sql的執(zhí)行順序确沸,因?yàn)榇蠖鄶?shù)情況下我們是不需要太多考慮sql執(zhí)行順序的捌锭,但是由于窗口函數(shù)的執(zhí)行順序位于大多數(shù)字段之后,只位于字段ORDER BY之前罗捎,因此相當(dāng)于在執(zhí)行所有字段生成的臨時(shí)表基礎(chǔ)之上執(zhí)行的窗口函數(shù)的操作舀锨,舉個(gè)例子:
從上圖中我們可以看到,第二張圖無(wú)GROUP BY和第三張有GROUP BY時(shí)宛逗,最終數(shù)據(jù)的行數(shù)由7條記錄變?yōu)?條坎匿,就是因?yàn)榇翱诤瘮?shù)是基于GROUP BY 字段執(zhí)行之后的臨時(shí)表基礎(chǔ)上進(jìn)行的計(jì)算,因此在清楚SQL的執(zhí)行順序后我們就能很容易理解最后展現(xiàn)的結(jié)果了雷激。