關於 Hive開窗函式

語言: CN / TW / HK

1、基本語法

    Function (arg1,..., argn) OVER ([PARTITION BY <...>] [ORDER BY <....>]
    [<window_expression>])
Function (arg1,..., argn) 可以是下面的函式:
    Aggregate Functions: 聚合函式,比如:sum(...)、 max(...)、min(...)、avg(...)等.
    Sort Functions: 資料排序函式, 比如 :rank(...)、row_number(...)等.
    Analytics Functions: 統計和比較函式, 比如:lead(...)、lag(...)、 first_value(...)等.

2、前期準備:

2.1、建表語句:

CREATE TABLE IF NOT EXISTS temp.test (
                `name` string COMMENT '姓名',
                `dept_num` int COMMENT '編號',
                `employee_id` int COMMENT 'id',
                `salary` int COMMENT '工資',
                `type` string COMMENT '崗位型別',
                `start_date` date COMMENT '入職時間'
) ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t'
  STORED as TEXTFILE;

2.2、在本地建立test檔案:

name    dept_num    employee_id salary  type    start_date
Michael 1000        100         5000    full    2014-01-29
Will    1000        101         4000    full    2013-10-02
Wendy   1000        101         4000    part    2014-10-02
Steven  1000        102         6400    part    2012-11-03
Lucy    1000        103         5500    full    2010-01-03
Lily    1001        104         5000    part    2014-11-29
Jess    1001        105         6000    part    2014-12-02
Mike    1001        106         6400    part    2013-11-03
Wei     1002        107         7000    part    2010-04-03
Yun     1002        108         5500    full    2014-01-29
Richard 1002        109         8000    full    2013-09-01

2.3、將建立好的本地文上傳至hive庫中:

load data local inpath '/root/test' into table temp.test;

3、視窗聚合函式

3.1、查詢姓名、部門編號、工資以及部門人數

select `name`,`dept_num`,`salary`,
count(*) over (partition by dept_num) as cnt 
from employee;

3.1.1、輸出結果為:

Total MapReduce CPU Time Spent: 1 seconds 780 msec
OK
name    dept_num    salary    cnt
Lucy    1000        5500      5
Steven  1000        6400      5
Wendy   1000        4000      5
Will    1000        4000      5
Michael 1000        5000      5
Mike    1001        6400      3
Jess    1001        6000      3
Lily    1001        5000      3
Richard 1002        8000      3
Yun     1002        5500      3
Wei     1002        7000      3
Time taken: 22.624 seconds, Fetched: 12 row(s)

3.2、查詢姓名、部門編號、工資以及每個部門的總工資,部門總工資按照降序輸出

select name,dept_num,salary,
sum(salary) over (partition by dept_num order by dept_num) as sum_dept_salary 
from employee order by sum_dept_salary desc;

3.2.1、輸出結果為:

Total MapReduce CPU Time Spent: 3 seconds 470 msec
OK
name    dept_num    salary    sum_dept_salary
Michael 1000        5000      24900
Will    1000        4000      24900
Wendy   1000        4000      24900
Steven  1000        6400      24900
Lucy    1000        5500      24900
Wei     1002        7000      20500
Yun     1002        5500      20500
Richard 1002        8000      20500
Lily    1001        5000      17400
Jess    1001        6000      17400
Mike    1001        6400      17400
Time taken: 47.313 seconds, Fetched: 12 row(s)

4、視窗排序函式

4.1、查詢姓名、部門編號、工資、排名編號(按工資的多少排名)

select `name`,`dept_num`,`salary`,
row_number() over (order by salary desc ) rnum 
from employee;

4.1.1、輸出結果為:

Total MapReduce CPU Time Spent: 1 seconds 890 msec
OK
name    dept_num    salary    rnum
Richard 1002        8000      1
Wei     1002        7000      2
Mike    1001        6400      3
Steven  1000        6400      4
Jess    1001        6000      5
Yun     1002        5500      6
Lucy    1000        5500      7
Lily    1001        5000      8
Michael 1000        5000      9
Wendy   1000        4000      10
Will    1000        4000      11
Time taken: 22.453 seconds, Fetched: 12 row(s)

4.2、查詢每個部門工資最高的兩個人的資訊(姓名、部門、薪水)

select name,dept_num,salary 
from (
select `name`,`dept_num`,`salary`,
row_number() over (partition by dept_num order by salary desc ) rnum 
from employee) t1 where rnum <= 2;

4.2.1、輸出結果為:

Total MapReduce CPU Time Spent: 2 seconds 680 msec
OK
name    dept_num    salary
Steven  1000        6400
Lucy    1000        5500
Mike    1001        6400
Jess    1001        6000
Richard 1002        8000
Wei     1002        7000
Time taken: 24.083 seconds, Fetched: 7 row(s)

4.3、查詢每個部門的員工工資排名資訊

select `name`,`dept_num`,`salary`,
row_number() over (partition by dept_num order by salary desc ) rnum 
from employee;

4.3.1、輸出結果為:

Total MapReduce CPU Time Spent: 1 seconds 860 msec
OK
name    dept_num    salary    rnum
Steven  1000        6400      1
Lucy    1000        5500      2
Michael 1000        5000      3
Wendy   1000        4000      4
Will    1000        4000      5
Mike    1001        6400      1
Jess    1001        6000      2
Lily    1001        5000      3
Richard 1002        8000      1
Wei     1002        7000      2
Yun     1002        5500      3
Time taken: 23.202 seconds, Fetched: 12 row(s)

4.4、使用rank函式進行排名

select `name`,`dept_num`,`salary`,
rank() over (order by salary desc) rank
from employee;

4.4.1、輸出結果為:

Total MapReduce CPU Time Spent: 1 seconds 830 msec
OK
name    dept_num    salary    rank
Richard 1002        8000      1
Wei     1002        7000      2
Mike    1001        6400      3
Steven  1000        6400      3
Jess    1001        6000      5
Yun     1002        5500      6
Lucy    1000        5500      6
Lily    1001        5000      8
Michael 1000        5000      8
Wendy   1000        4000      10
Will    1000        4000      10
Time taken: 21.547 seconds, Fetched: 12 row(s)

4.5、使用dense_rank進行排名

select `name`,`dept_num`,`salary`,
dense_rank() over (order by salary desc) rank
from employee;

4.5.1、輸出結果為:

Total MapReduce CPU Time Spent: 1 seconds 710 msec
OK
name    dept_num    salary    rank
Richard 1002        8000      1
Wei     1002        7000      2
Mike    1001        6400      3
Steven  1000        6400      3
Jess    1001        6000      4
Yun     1002        5500      5
Lucy    1000        5500      5
Lily    1001        5000      6
Michael 1000        5000      6
Wendy   1000        4000      7
Will    1000        4000      7
Time taken: 21.879 seconds, Fetched: 12 row(s)

4.6、使用percent_rank()進行排名

select name,dept_num,salary,
percent_rank() over (order by salary desc) rank
from employee;

4.6.1、輸出結果為:

Total MapReduce CPU Time Spent: 1 seconds 940 msec
OK
name    dept_num    salary  rank
Richard 1002        8000    0.0
Wei     1002        7000    0.09090909090909091
Mike    1001        6400    0.18181818181818182
Steven  1000        6400    0.18181818181818182
Jess    1001        6000    0.36363636363636365
Yun     1002        5500    0.45454545454545453
Lucy    1000        5500    0.45454545454545453
Lily    1001        5000    0.6363636363636364
Michael 1000        5000    0.6363636363636364
Wendy   1000        4000    0.8181818181818182
Will    1000        4000    0.8181818181818182
Time taken: 22.401 seconds, Fetched: 12 row(s)

4.7、使用ntile進行資料分片排名

SELECT name,dept_num as deptno,salary,
ntile(4) OVER(ORDER BY salary desc) as ntile
FROM employee;

4.7.1、輸出結果為:

Total MapReduce CPU Time Spent: 1 seconds 940 msec
OK
name    deptno  salary  ntile
Richard 1002    8000    1
Wei     1002    7000    1
Mike    1001    6400    1
Steven  1000    6400    2
Jess    1001    6000    2
Yun     1002    5500    2
Lucy    1000    5500    3
Lily    1001    5000    3
Michael 1000    5000    3
Wendy   1000    4000    4
Will    1000    4000    4
Time taken: 28.829 seconds, Fetched: 12 row(s)

5、視窗分析函式

5.1、統計小於等於當前工資的人數佔總人數的比例

SELECT name,dept_num,salary,
cume_dist() OVER (ORDER BY salary) as cume
FROM employee;

5.1.1、輸出結果為:

name    deptno  salary  cume
Wendy   1000    4000    0.25
Will    1000    4000    0.25
Lily    1001    5000    0.4166666666666667
Michael 1000    5000    0.4166666666666667
Yun     1002    5500    0.5833333333333334
Lucy    1000    5500    0.5833333333333334
Jess    1001    6000    0.6666666666666666
Mike    1001    6400    0.8333333333333334
Steven  1000    6400    0.8333333333333334
Wei     1002    7000    0.9166666666666666
Richard 1002    8000    1.0
Time taken: 20.869 seconds, Fetched: 12 row(s)

5.2、統計小於等於當前工資的人數佔總人數的比例

SELECT name,dept_num,salary,
cume_dist() OVER (ORDER BY salary desc) as cume
FROM employee;

5.2.1、輸出結果為:

Total MapReduce CPU Time Spent: 1 seconds 790 msec
OK
name    dept_num    salary  cume
Richard 1002        8000    0.08333333333333333
Wei     1002        7000    0.16666666666666666
Mike    1001        6400    0.3333333333333333
Steven  1000        6400    0.3333333333333333
Jess    1001        6000    0.4166666666666667
Yun     1002        5500    0.5833333333333334
Lucy    1000        5500    0.5833333333333334
Lily    1001        5000    0.75
Michael 1000        5000    0.75
Wendy   1000        4000    0.9166666666666666
Will    1000        4000    0.9166666666666666
Time taken: 21.672 seconds, Fetched: 12 row(s)

5.3、統計小於等於當前工資的人數佔總人數的比例

SELECT name,dept_num,salary,
cume_dist() OVER (PARTITION BY dept_num ORDER BY salary) as cume
FROM employee;

5.3.1、輸出結果為:

Total MapReduce CPU Time Spent: 2 seconds 130 msec
OK
name    dept_num    salary  cume
Wendy   1000        4000    0.4
Will    1000        4000    0.4
Michael 1000        5000    0.6
Lucy    1000        5500    0.8
Steven  1000        6400    1.0
Lily    1001        5000    0.3333333333333333
Jess    1001        6000    0.6666666666666666
Mike    1001        6400    1.0
Yun     1002        5500    0.3333333333333333
Wei     1002        7000    0.6666666666666666
Richard 1002        8000    1.0
Time taken: 22.055 seconds, Fetched: 12 row(s)

5.4、統計小於等於當前工資的人數佔總人數的比例

SELECT name,dept_num,salary,
lead(salary,1) OVER (PARTITION BY dept_num ORDER BY salary) as lead
FROM employee;

5.4.1、輸出結果為:

Total MapReduce CPU Time Spent: 1 seconds 880 msec
OK
name    dept_num    salary  lead
Wendy   1000        4000    4000
Will    1000        4000    5000
Michael 1000        5000    5500
Lucy    1000        5500    6400
Steven  1000        6400    NULL
Lily    1001        5000    6000
Jess    1001        6000    6400
Mike    1001        6400    NULL
Yun     1002        5500    7000
Wei     1002        7000    8000
Richard 1002        8000    NULL
Time taken: 21.57 seconds, Fetched: 12 row(s)

5.5、統計小於等於當前工資的人數佔總人數的比例

SELECT name,dept_num,salary,
lag(salary,1) OVER (PARTITION BY dept_num ORDER BY salary) as lead
FROM employee;

5.5.1、輸出結果為:

Total MapReduce CPU Time Spent: 1 seconds 700 msec
OK
name    dept_num    salary  lead
Wendy   1000        4000    NULL
Will    1000        4000    4000
Michael 1000        5000    4000
Lucy    1000        5500    5000
Steven  1000        6400    5500
Lily    1001        5000    NULL
Jess    1001        6000    5000
Mike    1001        6400    6000
Yun     1002        5500    NULL
Wei     1002        7000    5500
Richard 1002        8000    7000
Time taken: 21.423 seconds, Fetched: 12 row(s)

5.6、統計小於等於當前工資的人數佔總人數的比例

SELECT name,dept_num,salary,
first_value(salary) OVER (PARTITION BY dept_num ORDER BY salary) as fval
FROM employee;

5.6.1、輸出結果為:

Total MapReduce CPU Time Spent: 1 seconds 720 msec
OK
name    dept_num    salary  fval
Wendy   1000        4000    4000
Will    1000        4000    4000
Michael 1000        5000    4000
Lucy    1000        5500    4000
Steven  1000        6400    4000
Lily    1001        5000    5000
Jess    1001        6000    5000
Mike    1001        6400    5000
Yun     1002        5500    5500
Wei     1002        7000    5500
Richard 1002        8000    5500
Time taken: 20.379 seconds, Fetched: 12 row(s)

5.7、統計小於等於當前工資的人數佔總人數的比例

SELECT name,dept_num,salary,
last_value(salary) OVER (PARTITION BY dept_num ORDER BY salary RANGE
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as lval
FROM employee;

5.7.1、輸出結果為:

Total MapReduce CPU Time Spent: 1 seconds 770 msec
OK
name    dept_num    salary  lval
Wendy   1000        4000    6400
Will    1000        4000    6400
Michael 1000        5000    6400
Lucy    1000        5500    6400
Steven  1000        6400    6400
Lily    1001        5000    6400
Jess    1001        6000    6400
Mike    1001        6400    6400
Yun     1002        5500    8000
Wei     1002        7000    8000
Richard 1002        8000    8000
Time taken: 21.649 seconds, Fetched: 12 row(s)