Hive分析函數(shù)和窗口函數(shù)

Hive分析函數(shù)和窗口函數(shù)

在Hive 0.11之后支持的,掃描多個(gè)輸入的行計(jì)算每行的結(jié)果困曙。通常和OVER,PARTITION BY, ORDER BY, WINDOWING配合使用恰矩。和傳統(tǒng)的分組結(jié)果不一樣记盒,傳統(tǒng)的結(jié)果每組中只有一個(gè)結(jié)果。分析函數(shù)的結(jié)果會(huì)出現(xiàn)多次外傅,和每條記錄都連接輸出纪吮。

語法形式如下:

Function(arg1,....argn) OVER([PARTITION BY<...>] [ORDER BY<...>] [window_clause])

窗口函數(shù)

函數(shù)名 說明
FIRST_VALUE 取出分組內(nèi)排序后俩檬,截止到當(dāng)前行,第一個(gè)值
LAST_VALUE 取出分組內(nèi)排序后碾盟,截止到當(dāng)前行棚辽,最后一個(gè)值
LEAD(col, n, DEFAULT) 用于統(tǒng)計(jì)窗口內(nèi)往下第n行的值。第一個(gè)參數(shù)為列名巷疼,第二個(gè)參數(shù)為往下第n行(可選晚胡,默認(rèn)為1)灵奖,第三個(gè)參數(shù)為默認(rèn)值(當(dāng)往下第n行為NULL時(shí)嚼沿,取默認(rèn)值)
LAG(col,n,DEFAULT) 與lead相反,用于統(tǒng)計(jì)窗口內(nèi)往下第n個(gè)值瓷患。第一個(gè)參數(shù)為列名骡尽,第二個(gè)參數(shù)為往上第n行(可選,默認(rèn)為1)

OVER從句

  1. 使用標(biāo)準(zhǔn)的聚合函數(shù)COUNT,SUM,MIN,MAX,AVG
  2. 使用PARTITION BY語句擅编,使用一個(gè)或者多個(gè)原始數(shù)據(jù)類型的列
  3. 使用PARTITION BY與ORDER BY語句攀细,使用一個(gè)或者多個(gè)數(shù)據(jù)類型的分區(qū)或者拍序列
  4. 使用窗口規(guī)范,窗口規(guī)范支持一下格式:
(ROW | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROW | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROW | RANGE) BETWEEN [num] PRECEDING AND (UNBOUNDED | [num]) FOLLOWING

當(dāng)ORDER BY后面缺少窗口從句條件爱态,窗口規(guī)范默認(rèn)是

RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

當(dāng)ORDER BY和窗口從句都缺失谭贪,窗口規(guī)范默認(rèn)是:

ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

分析函數(shù)

函數(shù) 說明
ROW_NUMBER() 從1開始,按照順序锦担,生成分組內(nèi)記錄的序列俭识,比如,按照pv降序排列洞渔,生成分組內(nèi)每天的pv名次套媚,ROW_NUMBER()的應(yīng)用 場景非常多,再比如磁椒,獲取分組內(nèi)排序第一的記錄堤瘤,獲取一個(gè)session中的第一條refer等
RANK() 生成數(shù)據(jù)項(xiàng)在分組中的排名,排名相等會(huì)在名次中留下空位
DENSE_RANK() 生成數(shù)據(jù)項(xiàng)在分組中的排名浆熔,排名相等會(huì)在名次中不會(huì)留下空位
CUME_DIST() 小于等于當(dāng)前值的行數(shù)除以分組內(nèi)總行數(shù)本辐。比如,統(tǒng)計(jì)小于等于當(dāng)前薪水的人數(shù)所占總?cè)藬?shù)的比例
PERCENT_RANK() 分組內(nèi)當(dāng)前行的RANK值-1/分組內(nèi) 總行數(shù)-1
NTILE(n) 用于將分組數(shù)據(jù)按照順序切分成n片医增,返回當(dāng)前切片值师郑,如果切片不均勻,默認(rèn)增加第一個(gè)切片的分布调窍。NTILE不支持ROWS BETWEEN,比如NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PERCEDING AND CURRENT ROW)

Hive2.1.0及以后版本支持distinct

在聚合函數(shù)(sum, count, avg)中支持distinct宝冕,但是在order by或者 窗口限制中不支持。
conut(distinct a) over(partition by c)

Hive2.1.0以后支持在OVER從句中支持聚合函數(shù)

select rank() over(order by sum(b))

Hive2.2.0中在使用ORDER BY和窗口限制時(shí)支持distinct

count(distinct a) over (partition by c order by d rows between 1 preceding and 1 following)

通過實(shí)例深入理解窗口函數(shù)和分析函數(shù)

COUNT邓萨、SUM地梨、MIN菊卷、MAX、AVG案例分析


## 創(chuàng)建數(shù)據(jù)表
create table orders(
    user_id string,
    device_id string,
    user_type string,
    price float,
    sales int);

## 添加數(shù)據(jù)orders.txt
zhangsa test1   new     67.1    2
lisi    test2   old     43.32   1
wanger  test3   new     88.88   3
liliu   test4   new     66.0    1
tom     test5   new     54.32   1
tomas   test6   old     77.77   2
tomson  test7   old     88.44   3
tom1    test8   new     56.55   6
tom2    test9   new     88.88   5
tom3    test10  new     66.66   5

## 開窗函數(shù)案例
select
    user_id,
    user_type,
    sales,
    -- 默認(rèn)從起點(diǎn)到當(dāng)前所有重復(fù)行
    sum(sales) over(partition by user_type order by sales asc) as sales_1,
    -- 從起點(diǎn)到當(dāng)前所有重復(fù)行與sales_1結(jié)果相同
    sum(sales) over(partition by user_type order by sales asc range between unbounded preceding and current row) as sales_2,
    -- 從起點(diǎn)到當(dāng)前行宝剖,結(jié)果與sale_1結(jié)果不同
    sum(sales) over(partition by user_type order by sales asc rows between unbounded preceding and current row) as sales_3,
    -- 當(dāng)前行加上往前3行
    sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and current row) as sales_4,
    -- 當(dāng)前范圍往上加3行
    sum(sales) over(partition by user_type order by sales asc range between 3 preceding and current row) as sales_5,
    -- 當(dāng)前行+往前3行+往后1行
    sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and 1 following) as sales_6,
    --
    sum(sales) over(partition by user_type order by sales asc range between 3 preceding and 1 following) as sales_7,
    -- 當(dāng)前行+之后所有行
    sum(sales) over(partition by user_type order by sales asc rows between current row and unbounded following) as sales_8,
    --
    sum(sales) over(partition by user_type order by sales asc range between current row and unbounded following) as sales_9,
    -- 分組內(nèi)所有行
    sum(sales) over(partition by user_type) as sales_10
from
    orders
order by
    user_type,
    sales,
    user_id;

##上述查詢結(jié)果如下:

| user_id  | user_type  | sales  | sales_1  | sales_2  | sales_3  | sales_4  | sales_5  | sales_6  | sales_7  | sales_8  | sales_9  | sales_10  |
|----------|------------|--------|----------|----------|----------|----------|----------|----------|----------|----------|----------|-----------|
| liliu    | new        | 1      | 2        | 2        | 2        | 2        | 2        | 4        | 4        | 22       | 23       | 23        |
| tom      | new        | 1      | 2        | 2        | 1        | 1        | 2        | 2        | 4        | 23       | 23       | 23        |
| zhangsa  | new        | 2      | 4        | 4        | 4        | 4        | 4        | 7        | 7        | 21       | 21       | 23        |
| wanger   | new        | 3      | 7        | 7        | 7        | 7        | 7        | 12       | 7        | 19       | 19       | 23        |
| tom2     | new        | 5      | 17       | 17       | 17       | 15       | 15       | 21       | 21       | 11       | 16       | 23        |
| tom3     | new        | 5      | 17       | 17       | 12       | 11       | 15       | 16       | 21       | 16       | 16       | 23        |
| tom1     | new        | 6      | 23       | 23       | 23       | 19       | 19       | 19       | 19       | 6        | 6        | 23        |
| lisi     | old        | 1      | 1        | 1        | 1        | 1        | 1        | 3        | 3        | 6        | 6        | 6         |
| tomas    | old        | 2      | 3        | 3        | 3        | 3        | 3        | 6        | 6        | 5        | 5        | 6         |
| tomson   | old        | 3      | 6        | 6        | 6        | 6        | 6        | 6        | 6        | 3        | 3        | 6         |

注意

  1. 結(jié)果和ORDER BY相關(guān),默認(rèn)為升序
  2. 如果不指定ROWS BETWEEN,默認(rèn)為從起點(diǎn)到當(dāng)前行;
  3. 如果不指定ORDER BY洁闰,則將分組內(nèi)所有值累加;
  4. PRECEDING:往前
  5. FOLLOWING:往后
  6. CURRENT ROW:當(dāng)前行
  7. UNBOUNDED:無界限(起點(diǎn)或終點(diǎn))
  8. UNBOUNDED PRECEDING:表示從前面的起點(diǎn)
  9. UNBOUNDED FOLLOWING:表示到后面的終點(diǎn)
  10. 其他COUNT、AVG万细,MIN扑眉,MAX,和SUM用法一樣赖钞。

FIRST_VALUE和LAST_VALUE案例分析


select
    user_id,
    user_type,
    sales,
    ROW_NUMBER() OVER(PARTITION BY user_type ORDER BY sales) AS row_num,
    first_value(user_id) over (partition by user_type order by sales desc) as max_sales_user,
    first_value(user_id) over (partition by user_type order by sales asc) as min_sales_user,
    last_value(user_id) over (partition by user_type order by sales desc) as curr_last_min_user,
    last_value(user_id) over (partition by user_type order by sales asc) as curr_last_max_user
from
    orders
order by
    user_type,
    sales;

##上述查詢結(jié)果如下:

| user_id | user_type | sales | row_num | max_sales_user | min_sales_user | curr_last_min_user | curr_last_max_user |
| ------- | --------- | ----- | ------- | -------------- | -------------- | ------------------ | ------------------ |
| tom     | new       | 1     | 1       | tom1           | tom            | tom                | liliu              |
| liliu   | new       | 1     | 2       | tom1           | tom            | tom                | liliu              |
| zhangsa | new       | 2     | 3       | tom1           | tom            | zhangsa            | zhangsa            |
| wanger  | new       | 3     | 4       | tom1           | tom            | wanger             | wanger             |
| tom3    | new       | 5     | 5       | tom1           | tom            | tom3               | tom2               |
| tom2    | new       | 5     | 6       | tom1           | tom            | tom3               | tom2               |
| tom1    | new       | 6     | 7       | tom1           | tom            | tom1               | tom1               |
| lisi    | old       | 1     | 1       | tomson         | lisi           | lisi               | lisi               |
| tomas   | old       | 2     | 2       | tomson         | lisi           | tomas              | tomas              |
| tomson  | old       | 3     | 3       | tomson         | lisi           | tomson             | tomson             |

LEAD與LAG


select
    user_id,
    device_id,
    sales,
    ROW_NUMBER() OVER(ORDER BY sales) AS row_num,
    lead(device_id) over (order by sales) as default_after_one_line,
    lag(device_id) over (order by sales) as default_before_one_line,
    lead(device_id,2) over (order by sales) as after_two_line,
    lag(device_id,2,'abc') over (order by sales) as before_two_line
from
    orders
order by
    sales;

上述查詢結(jié)果如下

| user_id  | device_id  | sales  | row_num  | default_after_one_line  | default_before_one_line  | after_two_line  | before_two_line  |
|----------|------------|--------|----------|-------------------------|--------------------------|-----------------|------------------|
| lisi     | test2      | 1      | 3        | test6                   | test4                    | test1           | test5            |
| liliu    | test4      | 1      | 2        | test2                   | test5                    | test6           | abc              |
| tom      | test5      | 1      | 1        | test4                   | NULL                     | test2           | abc              |
| zhangsa  | test1      | 2      | 5        | test7                   | test6                    | test3           | test2            |
| tomas    | test6      | 2      | 4        | test1                   | test2                    | test7           | test4            |
| wanger   | test3      | 3      | 7        | test10                  | test7                    | test9           | test1            |
| tomson   | test7      | 3      | 6        | test3                   | test1                    | test10          | test6            |
| tom2     | test9      | 5      | 9        | test8                   | test10                   | NULL            | test3            |
| tom3     | test10     | 5      | 8        | test9                   | test3                    | test8           | test7            |
| tom1     | test8      | 6      | 10       | NULL                    | test9                    | NULL            | test10           |

RANK腰素、ROW_NUMBER、DENSE_RANK


select
user_id,user_type,sales,
RANK() over (partition by user_type order by sales desc) as r,
ROW_NUMBER() over (partition by user_type order by sales desc) as rn,
DENSE_RANK() over (partition by user_type order by sales desc) as dr
from
orders;

##上述查詢結(jié)果如下

| user_id | user_type | sales |  r  | rn  | dr  |
| ------- | --------- | ----- | --- | --- | --- |
| tom1    | new       | 6     | 1   | 1   | 1   |
| tom3    | new       | 5     | 2   | 2   | 2   |
| tom2    | new       | 5     | 2   | 3   | 2   |
| wanger  | new       | 3     | 4   | 4   | 3   |
| zhangsa | new       | 2     | 5   | 5   | 4   |
| tom     | new       | 1     | 6   | 6   | 5   |
| liliu   | new       | 1     | 6   | 7   | 5   |
| tomson  | old       | 3     | 1   | 1   | 1   |
| tomas   | old       | 2     | 2   | 2   | 2   |
| lisi    | old       | 1     | 3   | 3   | 3   |

NTILE


select
    user_type,sales,
    --分組內(nèi)將數(shù)據(jù)分成2片
    NTILE(2) OVER(PARTITION BY user_type ORDER BY sales) AS nt2,
    --分組內(nèi)將數(shù)據(jù)分成3片
    NTILE(3) OVER(PARTITION BY user_type ORDER BY sales) AS nt3,
    --分組內(nèi)將數(shù)據(jù)分成4片
    NTILE(4) OVER(PARTITION BY user_type ORDER BY sales) AS nt4,
    --將所有數(shù)據(jù)分成4片
    NTILE(4) OVER(ORDER BY sales) AS all_nt4
from
    orders
order by
    user_type,
    sales;

##上述查詢結(jié)果如下
| user_type | sales | nt2 | nt3 | nt4 | all_nt4 |
| --------- | ----- | --- | --- | --- | ------- |
| new       | 1     | 1   | 1   | 1   | 1       |
| new       | 1     | 1   | 1   | 1   | 1       |
| new       | 2     | 1   | 1   | 2   | 2       |
| new       | 3     | 1   | 2   | 2   | 3       |
| new       | 5     | 2   | 2   | 3   | 4       |
| new       | 5     | 2   | 3   | 3   | 3       |
| new       | 6     | 2   | 3   | 4   | 4       |
| old       | 1     | 1   | 1   | 1   | 1       |
| old       | 2     | 1   | 2   | 2   | 2       |
| old       | 3     | 2   | 3   | 3   | 2       |

求取sale前20%的用戶ID

select
    user_id
from
(
    select
        user_id,
        NTILE(5) OVER(ORDER BY sales desc) AS nt
    from
        orders
)A
where nt=1;
## 結(jié)果如下
+----------+
| user_id  |
+----------+
| tom1     |
| tom3     |
+----------+

CUME_DIST雪营、PERCENT_RANK


select
    user_id,user_type,sales,
    --沒有partition,所有數(shù)據(jù)均為1組
    CUME_DIST() OVER(ORDER BY sales) AS cd1,
    --按照user_type進(jìn)行分組
    CUME_DIST() OVER(PARTITION BY user_type ORDER BY sales) AS cd2
from
    orders;

## 上述結(jié)果如下
+----------+------------+--------+------+----------------------+--+
| user_id  | user_type  | sales  | cd1  |         cd2          |
+----------+------------+--------+------+----------------------+--+
| liliu    | new        | 1      | 0.3  | 0.2857142857142857   |
| tom      | new        | 1      | 0.3  | 0.2857142857142857   |
| zhangsa  | new        | 2      | 0.5  | 0.42857142857142855  |
| wanger   | new        | 3      | 0.7  | 0.5714285714285714   |
| tom2     | new        | 5      | 0.9  | 0.8571428571428571   |
| tom3     | new        | 5      | 0.9  | 0.8571428571428571   |
| tom1     | new        | 6      | 1.0  | 1.0                  |
| lisi     | old        | 1      | 0.3  | 0.3333333333333333   |
| tomas    | old        | 2      | 0.5  | 0.6666666666666666   |
| tomson   | old        | 3      | 0.7  | 1.0                  |
+----------+------------+--------+------+----------------------+--+

select
    user_type,sales,
    --分組內(nèi)總行數(shù)
    SUM(1) OVER(PARTITION BY user_type) AS s,
    --RANK值
    RANK() OVER(ORDER BY sales) AS r,
    PERCENT_RANK() OVER(ORDER BY sales) AS pr,
    --分組內(nèi)
    PERCENT_RANK() OVER(PARTITION BY user_type ORDER BY sales) AS prg
from
    orders;

## 上述結(jié)果如下
+------------+--------+----+-----+---------------------+---------------------+--+
| user_type  | sales  | s  |  r  |         pr          |         prg         |
+------------+--------+----+-----+---------------------+---------------------+--+
| new        | 1      | 7  | 1   | 0.0                 | 0.0                 |
| new        | 1      | 7  | 1   | 0.0                 | 0.0                 |
| new        | 2      | 7  | 4   | 0.3333333333333333  | 0.3333333333333333  |
| new        | 3      | 7  | 6   | 0.5555555555555556  | 0.5                 |
| new        | 5      | 7  | 8   | 0.7777777777777778  | 0.6666666666666666  |
| new        | 5      | 7  | 8   | 0.7777777777777778  | 0.6666666666666666  |
| new        | 6      | 7  | 10  | 1.0                 | 1.0                 |
| old        | 1      | 3  | 1   | 0.0                 | 0.0                 |
| old        | 2      | 3  | 4   | 0.3333333333333333  | 0.5                 |
| old        | 3      | 3  | 6   | 0.5555555555555556  | 1.0                 |
+------------+--------+----+-----+---------------------+---------------------+--+

參考博文

http://blog.csdn.net/scgaliguodong123_/article/details/60135385

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末弓千,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子献起,更是在濱河造成了極大的恐慌洋访,老刑警劉巖,帶你破解...
    沈念sama閱讀 211,376評(píng)論 6 491
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件谴餐,死亡現(xiàn)場離奇詭異姻政,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)岂嗓,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,126評(píng)論 2 385
  • 文/潘曉璐 我一進(jìn)店門汁展,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人摄闸,你說我怎么就攤上這事善镰。” “怎么了年枕?”我有些...
    開封第一講書人閱讀 156,966評(píng)論 0 347
  • 文/不壞的土叔 我叫張陵炫欺,是天一觀的道長。 經(jīng)常有香客問我熏兄,道長品洛,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 56,432評(píng)論 1 283
  • 正文 為了忘掉前任摩桶,我火速辦了婚禮桥状,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘硝清。我一直安慰自己辅斟,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 65,519評(píng)論 6 385
  • 文/花漫 我一把揭開白布芦拿。 她就那樣靜靜地躺著士飒,像睡著了一般查邢。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上酵幕,一...
    開封第一講書人閱讀 49,792評(píng)論 1 290
  • 那天扰藕,我揣著相機(jī)與錄音,去河邊找鬼芳撒。 笑死邓深,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的笔刹。 我是一名探鬼主播芥备,決...
    沈念sama閱讀 38,933評(píng)論 3 406
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼徘熔!你這毒婦竟也來了门躯?” 一聲冷哼從身側(cè)響起淆党,我...
    開封第一講書人閱讀 37,701評(píng)論 0 266
  • 序言:老撾萬榮一對情侶失蹤酷师,失蹤者是張志新(化名)和其女友劉穎,沒想到半個(gè)月后染乌,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體山孔,經(jīng)...
    沈念sama閱讀 44,143評(píng)論 1 303
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,488評(píng)論 2 327
  • 正文 我和宋清朗相戀三年荷憋,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了台颠。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 38,626評(píng)論 1 340
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡勒庄,死狀恐怖串前,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情实蔽,我是刑警寧澤荡碾,帶...
    沈念sama閱讀 34,292評(píng)論 4 329
  • 正文 年R本政府宣布,位于F島的核電站局装,受9級(jí)特大地震影響坛吁,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜铐尚,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,896評(píng)論 3 313
  • 文/蒙蒙 一拨脉、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧宣增,春花似錦玫膀、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,742評(píng)論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽劳景。三九已至,卻和暖如春碉就,著一層夾襖步出監(jiān)牢的瞬間盟广,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 31,977評(píng)論 1 265
  • 我被黑心中介騙來泰國打工瓮钥, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留筋量,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 46,324評(píng)論 2 360
  • 正文 我出身青樓碉熄,卻偏偏與公主長得像桨武,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個(gè)殘疾皇子锈津,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 43,494評(píng)論 2 348

推薦閱讀更多精彩內(nèi)容