關於 Hive開窗函式
1、基本語法
Function (arg1,..., argn) OVER ([PARTITION BY <...>] [ORDER BY <....>]
[<window_expression>])
Function (arg1,..., argn) 可以是下面的函式:
Aggregate Functions: 聚合函式,比如:sum(...)、 max(...)、min(...)、avg(...)等.
Sort Functions: 資料排序函式, 比如 :rank(...)、row_number(...)等.
Analytics Functions: 統計和比較函式, 比如:lead(...)、lag(...)、 first_value(...)等.
2、前期準備:
2.1、建表語句:
CREATE TABLE IF NOT EXISTS temp.test (
`name` string COMMENT '姓名',
`dept_num` int COMMENT '編號',
`employee_id` int COMMENT 'id',
`salary` int COMMENT '工資',
`type` string COMMENT '崗位型別',
`start_date` date COMMENT '入職時間'
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED as TEXTFILE;
2.2、在本地建立test檔案:
name dept_num employee_id salary type start_date
Michael 1000 100 5000 full 2014-01-29
Will 1000 101 4000 full 2013-10-02
Wendy 1000 101 4000 part 2014-10-02
Steven 1000 102 6400 part 2012-11-03
Lucy 1000 103 5500 full 2010-01-03
Lily 1001 104 5000 part 2014-11-29
Jess 1001 105 6000 part 2014-12-02
Mike 1001 106 6400 part 2013-11-03
Wei 1002 107 7000 part 2010-04-03
Yun 1002 108 5500 full 2014-01-29
Richard 1002 109 8000 full 2013-09-01
2.3、將建立好的本地文上傳至hive庫中:
load data local inpath '/root/test' into table temp.test;
3、視窗聚合函式
3.1、查詢姓名、部門編號、工資以及部門人數
select `name`,`dept_num`,`salary`,
count(*) over (partition by dept_num) as cnt
from employee;
3.1.1、輸出結果為:
Total MapReduce CPU Time Spent: 1 seconds 780 msec
OK
name dept_num salary cnt
Lucy 1000 5500 5
Steven 1000 6400 5
Wendy 1000 4000 5
Will 1000 4000 5
Michael 1000 5000 5
Mike 1001 6400 3
Jess 1001 6000 3
Lily 1001 5000 3
Richard 1002 8000 3
Yun 1002 5500 3
Wei 1002 7000 3
Time taken: 22.624 seconds, Fetched: 12 row(s)
3.2、查詢姓名、部門編號、工資以及每個部門的總工資,部門總工資按照降序輸出
select name,dept_num,salary,
sum(salary) over (partition by dept_num order by dept_num) as sum_dept_salary
from employee order by sum_dept_salary desc;
3.2.1、輸出結果為:
Total MapReduce CPU Time Spent: 3 seconds 470 msec
OK
name dept_num salary sum_dept_salary
Michael 1000 5000 24900
Will 1000 4000 24900
Wendy 1000 4000 24900
Steven 1000 6400 24900
Lucy 1000 5500 24900
Wei 1002 7000 20500
Yun 1002 5500 20500
Richard 1002 8000 20500
Lily 1001 5000 17400
Jess 1001 6000 17400
Mike 1001 6400 17400
Time taken: 47.313 seconds, Fetched: 12 row(s)
4、視窗排序函式
4.1、查詢姓名、部門編號、工資、排名編號(按工資的多少排名)
select `name`,`dept_num`,`salary`,
row_number() over (order by salary desc ) rnum
from employee;
4.1.1、輸出結果為:
Total MapReduce CPU Time Spent: 1 seconds 890 msec
OK
name dept_num salary rnum
Richard 1002 8000 1
Wei 1002 7000 2
Mike 1001 6400 3
Steven 1000 6400 4
Jess 1001 6000 5
Yun 1002 5500 6
Lucy 1000 5500 7
Lily 1001 5000 8
Michael 1000 5000 9
Wendy 1000 4000 10
Will 1000 4000 11
Time taken: 22.453 seconds, Fetched: 12 row(s)
4.2、查詢每個部門工資最高的兩個人的資訊(姓名、部門、薪水)
select name,dept_num,salary
from (
select `name`,`dept_num`,`salary`,
row_number() over (partition by dept_num order by salary desc ) rnum
from employee) t1 where rnum <= 2;
4.2.1、輸出結果為:
Total MapReduce CPU Time Spent: 2 seconds 680 msec
OK
name dept_num salary
Steven 1000 6400
Lucy 1000 5500
Mike 1001 6400
Jess 1001 6000
Richard 1002 8000
Wei 1002 7000
Time taken: 24.083 seconds, Fetched: 7 row(s)
4.3、查詢每個部門的員工工資排名資訊
select `name`,`dept_num`,`salary`,
row_number() over (partition by dept_num order by salary desc ) rnum
from employee;
4.3.1、輸出結果為:
Total MapReduce CPU Time Spent: 1 seconds 860 msec
OK
name dept_num salary rnum
Steven 1000 6400 1
Lucy 1000 5500 2
Michael 1000 5000 3
Wendy 1000 4000 4
Will 1000 4000 5
Mike 1001 6400 1
Jess 1001 6000 2
Lily 1001 5000 3
Richard 1002 8000 1
Wei 1002 7000 2
Yun 1002 5500 3
Time taken: 23.202 seconds, Fetched: 12 row(s)
4.4、使用rank函式進行排名
select `name`,`dept_num`,`salary`,
rank() over (order by salary desc) rank
from employee;
4.4.1、輸出結果為:
Total MapReduce CPU Time Spent: 1 seconds 830 msec
OK
name dept_num salary rank
Richard 1002 8000 1
Wei 1002 7000 2
Mike 1001 6400 3
Steven 1000 6400 3
Jess 1001 6000 5
Yun 1002 5500 6
Lucy 1000 5500 6
Lily 1001 5000 8
Michael 1000 5000 8
Wendy 1000 4000 10
Will 1000 4000 10
Time taken: 21.547 seconds, Fetched: 12 row(s)
4.5、使用dense_rank進行排名
select `name`,`dept_num`,`salary`,
dense_rank() over (order by salary desc) rank
from employee;
4.5.1、輸出結果為:
Total MapReduce CPU Time Spent: 1 seconds 710 msec
OK
name dept_num salary rank
Richard 1002 8000 1
Wei 1002 7000 2
Mike 1001 6400 3
Steven 1000 6400 3
Jess 1001 6000 4
Yun 1002 5500 5
Lucy 1000 5500 5
Lily 1001 5000 6
Michael 1000 5000 6
Wendy 1000 4000 7
Will 1000 4000 7
Time taken: 21.879 seconds, Fetched: 12 row(s)
4.6、使用percent_rank()進行排名
select name,dept_num,salary,
percent_rank() over (order by salary desc) rank
from employee;
4.6.1、輸出結果為:
Total MapReduce CPU Time Spent: 1 seconds 940 msec
OK
name dept_num salary rank
Richard 1002 8000 0.0
Wei 1002 7000 0.09090909090909091
Mike 1001 6400 0.18181818181818182
Steven 1000 6400 0.18181818181818182
Jess 1001 6000 0.36363636363636365
Yun 1002 5500 0.45454545454545453
Lucy 1000 5500 0.45454545454545453
Lily 1001 5000 0.6363636363636364
Michael 1000 5000 0.6363636363636364
Wendy 1000 4000 0.8181818181818182
Will 1000 4000 0.8181818181818182
Time taken: 22.401 seconds, Fetched: 12 row(s)
4.7、使用ntile進行資料分片排名
SELECT name,dept_num as deptno,salary,
ntile(4) OVER(ORDER BY salary desc) as ntile
FROM employee;
4.7.1、輸出結果為:
Total MapReduce CPU Time Spent: 1 seconds 940 msec
OK
name deptno salary ntile
Richard 1002 8000 1
Wei 1002 7000 1
Mike 1001 6400 1
Steven 1000 6400 2
Jess 1001 6000 2
Yun 1002 5500 2
Lucy 1000 5500 3
Lily 1001 5000 3
Michael 1000 5000 3
Wendy 1000 4000 4
Will 1000 4000 4
Time taken: 28.829 seconds, Fetched: 12 row(s)
5、視窗分析函式
5.1、統計小於等於當前工資的人數佔總人數的比例
SELECT name,dept_num,salary,
cume_dist() OVER (ORDER BY salary) as cume
FROM employee;
5.1.1、輸出結果為:
name deptno salary cume
Wendy 1000 4000 0.25
Will 1000 4000 0.25
Lily 1001 5000 0.4166666666666667
Michael 1000 5000 0.4166666666666667
Yun 1002 5500 0.5833333333333334
Lucy 1000 5500 0.5833333333333334
Jess 1001 6000 0.6666666666666666
Mike 1001 6400 0.8333333333333334
Steven 1000 6400 0.8333333333333334
Wei 1002 7000 0.9166666666666666
Richard 1002 8000 1.0
Time taken: 20.869 seconds, Fetched: 12 row(s)
5.2、統計小於等於當前工資的人數佔總人數的比例
SELECT name,dept_num,salary,
cume_dist() OVER (ORDER BY salary desc) as cume
FROM employee;
5.2.1、輸出結果為:
Total MapReduce CPU Time Spent: 1 seconds 790 msec
OK
name dept_num salary cume
Richard 1002 8000 0.08333333333333333
Wei 1002 7000 0.16666666666666666
Mike 1001 6400 0.3333333333333333
Steven 1000 6400 0.3333333333333333
Jess 1001 6000 0.4166666666666667
Yun 1002 5500 0.5833333333333334
Lucy 1000 5500 0.5833333333333334
Lily 1001 5000 0.75
Michael 1000 5000 0.75
Wendy 1000 4000 0.9166666666666666
Will 1000 4000 0.9166666666666666
Time taken: 21.672 seconds, Fetched: 12 row(s)
5.3、統計小於等於當前工資的人數佔總人數的比例
SELECT name,dept_num,salary,
cume_dist() OVER (PARTITION BY dept_num ORDER BY salary) as cume
FROM employee;
5.3.1、輸出結果為:
Total MapReduce CPU Time Spent: 2 seconds 130 msec
OK
name dept_num salary cume
Wendy 1000 4000 0.4
Will 1000 4000 0.4
Michael 1000 5000 0.6
Lucy 1000 5500 0.8
Steven 1000 6400 1.0
Lily 1001 5000 0.3333333333333333
Jess 1001 6000 0.6666666666666666
Mike 1001 6400 1.0
Yun 1002 5500 0.3333333333333333
Wei 1002 7000 0.6666666666666666
Richard 1002 8000 1.0
Time taken: 22.055 seconds, Fetched: 12 row(s)
5.4、統計小於等於當前工資的人數佔總人數的比例
SELECT name,dept_num,salary,
lead(salary,1) OVER (PARTITION BY dept_num ORDER BY salary) as lead
FROM employee;
5.4.1、輸出結果為:
Total MapReduce CPU Time Spent: 1 seconds 880 msec
OK
name dept_num salary lead
Wendy 1000 4000 4000
Will 1000 4000 5000
Michael 1000 5000 5500
Lucy 1000 5500 6400
Steven 1000 6400 NULL
Lily 1001 5000 6000
Jess 1001 6000 6400
Mike 1001 6400 NULL
Yun 1002 5500 7000
Wei 1002 7000 8000
Richard 1002 8000 NULL
Time taken: 21.57 seconds, Fetched: 12 row(s)
5.5、統計小於等於當前工資的人數佔總人數的比例
SELECT name,dept_num,salary,
lag(salary,1) OVER (PARTITION BY dept_num ORDER BY salary) as lead
FROM employee;
5.5.1、輸出結果為:
Total MapReduce CPU Time Spent: 1 seconds 700 msec
OK
name dept_num salary lead
Wendy 1000 4000 NULL
Will 1000 4000 4000
Michael 1000 5000 4000
Lucy 1000 5500 5000
Steven 1000 6400 5500
Lily 1001 5000 NULL
Jess 1001 6000 5000
Mike 1001 6400 6000
Yun 1002 5500 NULL
Wei 1002 7000 5500
Richard 1002 8000 7000
Time taken: 21.423 seconds, Fetched: 12 row(s)
5.6、統計小於等於當前工資的人數佔總人數的比例
SELECT name,dept_num,salary,
first_value(salary) OVER (PARTITION BY dept_num ORDER BY salary) as fval
FROM employee;
5.6.1、輸出結果為:
Total MapReduce CPU Time Spent: 1 seconds 720 msec
OK
name dept_num salary fval
Wendy 1000 4000 4000
Will 1000 4000 4000
Michael 1000 5000 4000
Lucy 1000 5500 4000
Steven 1000 6400 4000
Lily 1001 5000 5000
Jess 1001 6000 5000
Mike 1001 6400 5000
Yun 1002 5500 5500
Wei 1002 7000 5500
Richard 1002 8000 5500
Time taken: 20.379 seconds, Fetched: 12 row(s)
5.7、統計小於等於當前工資的人數佔總人數的比例
SELECT name,dept_num,salary,
last_value(salary) OVER (PARTITION BY dept_num ORDER BY salary RANGE
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as lval
FROM employee;
5.7.1、輸出結果為:
Total MapReduce CPU Time Spent: 1 seconds 770 msec
OK
name dept_num salary lval
Wendy 1000 4000 6400
Will 1000 4000 6400
Michael 1000 5000 6400
Lucy 1000 5500 6400
Steven 1000 6400 6400
Lily 1001 5000 6400
Jess 1001 6000 6400
Mike 1001 6400 6400
Yun 1002 5500 8000
Wei 1002 7000 8000
Richard 1002 8000 8000
Time taken: 21.649 seconds, Fetched: 12 row(s)
「其他文章」
- 清華大學團隊:人臉識別爆出巨大丑聞,15分鐘解鎖19款手機
- Kafka架構設計
- 你有多少個前男友?
- 這是什麼鬼文章
- CYNORA宣佈推出業界首個基於TADF深綠髮光體的裝置測試套件用於下一代OLED顯示
- Kubernetes 入門到進階實戰
- rsyslog日誌伺服器搭建
- Hystrix斷路器
- 計算機專業學習的一些思考和總結
- ThingJS官方示例(十一):基於資料向量及貼圖url開發OD線
- RTMP協議網際網路教育課堂直播點播系統EasyDSS獲取直播資訊優化設計方案介紹
- 關於 Hive開窗函式
- JVM學習系列學習七
- Spring cloud系列教程第五篇-Eureka單機版配置
- 做題基礎~~~tips
- Java第六次作業
- SSDCRM客戶關係管理系統釋出原則
- 邀請學霸分享經驗:FRM一級沒你想象中那麼難!
- 其他訊息中介軟體及場景應用(下3)
- 289頁初中級前端題助你拿下Offer