[大数据] Hive系列之三：开窗函数详解

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 大数据 -> Hive系列之三：开窗函数详解 -> 正文阅读

[大数据]Hive系列之三：开窗函数详解

开窗函数简介

普通的聚合函数聚合的行集是组,开窗函数聚合的行集是窗口。

因此,普通的聚合函数每组(Group by)只返回一个值，而开窗函数则可为窗口中的每行都返回一个值。

简单理解，就是对查询的结果多出一列，这一列可以是聚合值，也可以是排序值。
开窗函数一般分为两类,聚合开窗函数和排序开窗函数。

准备数据

建立学生成绩表

0: jdbc:hive2://master:10000> use myhive;
OK
No rows affected (1.61 seconds)
0: jdbc:hive2://master:10000> create table stu_scores(
. . . . . . . . . . . . . . > id int,
. . . . . . . . . . . . . . > stu_id int,
. . . . . . . . . . . . . . > chinese int,
. . . . . . . . . . . . . . > math int,
. . . . . . . . . . . . . . > english int,
. . . . . . . . . . . . . . > class_id string,
. . . . . . . . . . . . . . > dept_Id string
. . . . . . . . . . . . . . > )row format delimited fields terminated by '/t';
OK
No rows affected (4.162 seconds)
0: jdbc:hive2://master:10000> show tables;
OK
+---------------------------+
|         tab_name          |
+---------------------------+
| bucket                    |
| bucket_com                |
| employee                  |
| mix_partition             |
| salary_dynamic_partition  |
| static_mix                |
| stu                       |
| stu_scores                |
+---------------------------+
8 rows selected (1.314 seconds)
0: jdbc:hive2://master:10000>

插入数据，数据样例如下stu_scores_input.txt

1       10001   69      70      71      C1      D1
2       10002   66      71      56      C1      D1
3       10003   60      72      57      C1      D1
4       10004   56      78      58      C1      D1
5       10005   98      79      59      C1      D1
6       10006   56      80      60      C2      D1
7       10007   100     81      61      C2      D1
8       10008   78      82      76      C2      D1
9       10009   90      80      77      C2      D1
10      10010   87      81      78      C2      D1
11      10011   66      82      79      C1      D1
12      10012   98      83      80      C1      D2
13      10013   97      84      81      C1      D2
14      10014   87      85      85      C1      D2
15      10015   98      94      86      C1      D2
16      10016   100     95      87      C2      D2
17      10017   68      96      88      C2      D2
18      10018   50      97      89      C1      D2
19      10019   42      98      90      C1      D2
20      10020   59      99      91      C1      D2

上传到hdfs

[hadoop@master sql_script]$ hdfs dfs -put stu_scores_input.txt

导入数据

load data inpath '/user/hadoop/stu_scores_input.txt' into table stu_scores;

聚合开窗函数

count开窗函数

执行下面脚本

select 
stu_id
,math
,dept_id
-- 以符合条件的所有行作为窗口
,count(math) over() as ct1
-- 以按class_id分组、按math排序的所有行作为窗口
,count(math) over(partition by class_id) as ct2
-- 以按class_id分组、按math排序的所有行作为窗口
,count(math) over(partition by class_id order by math) as ct3
-- 以按class_id分组、按math排序、按 当前行+往前1行+往后2行的行作为窗口
,count(math) over(partition by class_id order by math rows between 1 preceding and 2 following) as ct4
from myhive.stu_scores
where dept_id = 'D1';

结果：

+---------+-------+----------+------+------+------+------+
| stu_id  | math  | dept_id  | ct1  | ct2  | ct3  | ct4  |
+---------+-------+----------+------+------+------+------+
| 10001   | 70    | D1       | 11   | 6    | 1    | 3    |
| 10002   | 71    | D1       | 11   | 6    | 2    | 4    |
| 10003   | 72    | D1       | 11   | 6    | 3    | 4    |
| 10004   | 78    | D1       | 11   | 6    | 4    | 4    |
| 10005   | 79    | D1       | 11   | 6    | 5    | 3    |
| 10011   | 82    | D1       | 11   | 6    | 6    | 2    |
| 10009   | 80    | D1       | 11   | 5    | 2    | 3    |
| 10006   | 80    | D1       | 11   | 5    | 2    | 4    |
| 10010   | 81    | D1       | 11   | 5    | 4    | 4    |
| 10007   | 81    | D1       | 11   | 5    | 4    | 3    |
| 10008   | 82    | D1       | 11   | 5    | 5    | 2    |
+---------+-------+----------+------+------+------+------+

解释：

在dept_id = 'D1'的条件下，对于stu_id = '10010'
ct1:统计所有行数位11
ct2:统计class_id = 'C2' 进行分组的个数5，
ct3:统计class_id = 'C2'，math <= 81的个数4,
ct4:10010所在行，向前（上）+1行，向后（下）+2行，统计math个数4

sum开窗函数

示例代码

select 
stu_id
,math
,class_id
-- 以符合条件的所有行作为窗口
,sum(math) over() as ct1
-- 以按class_id分组、按math排序的所有行作为窗口
,sum(math) over(partition by class_id) as ct2
-- 以按class_id分组、按math排序的所有行作为窗口
,sum(math) over(partition by class_id order by math) as ct3
-- 以按class_id分组、按math排序、按 当前行+往前1行+往后2行的行作为窗口
,sum(math) over(partition by class_id order by math rows between 1 preceding and 2 following) as ct4
from myhive.stu_scores
where dept_id = 'D1';

结果：

+---------+-------+-----------+------+------+------+------+
| stu_id  | math  | class_id  | ct1  | ct2  | ct3  | ct4  |
+---------+-------+-----------+------+------+------+------+
| 10001   | 70    | C1        | 856  | 452  | 70   | 213  |
| 10002   | 71    | C1        | 856  | 452  | 141  | 291  |
| 10003   | 72    | C1        | 856  | 452  | 213  | 300  |
| 10004   | 78    | C1        | 856  | 452  | 291  | 311  |
| 10005   | 79    | C1        | 856  | 452  | 370  | 239  |
| 10011   | 82    | C1        | 856  | 452  | 452  | 161  |
| 10009   | 80    | C2        | 856  | 404  | 160  | 241  |
| 10006   | 80    | C2        | 856  | 404  | 160  | 322  |
| 10010   | 81    | C2        | 856  | 404  | 322  | 324  |
| 10007   | 81    | C2        | 856  | 404  | 322  | 244  |
| 10008   | 82    | C2        | 856  | 404  | 404  | 163  |
+---------+-------+-----------+------+------+------+------+

min开窗函数

示例代码：

0: jdbc:hive2://master:10000> SELECT 
. . . . . . . . . . . . . . > stu_id
. . . . . . . . . . . . . . > ,math
. . . . . . . . . . . . . . > ,class_id
. . . . . . . . . . . . . . > -- 以符合条件的所有行作为窗口
. . . . . . . . . . . . . . > ,MIN(math) OVER() AS ct1
. . . . . . . . . . . . . . > -- 以按class_id分组、按math排序的所有行作为窗口
. . . . . . . . . . . . . . > ,MIN(math) OVER(PARTITION BY class_id) AS ct2
. . . . . . . . . . . . . . > -- 以按class_id分组、按math排序的所有行作为窗口
. . . . . . . . . . . . . . > ,MIN(math) OVER(PARTITION BY class_id ORDER BY math) AS ct3
. . . . . . . . . . . . . . > -- 以按class_id分组、按math排序、按 当前行+往前1行+往后2行的行作为窗口
. . . . . . . . . . . . . . > ,MIN(math) OVER(PARTITION BY class_id ORDER BY math ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) AS ct4
. . . . . . . . . . . . . . > FROM myhive.stu_scores
. . . . . . . . . . . . . . > WHERE dept_id = 'D1';

结果

+-----------+---------+-------------+--------+--------+--------+--------+
| stu_id    | math    | class_id    | ct1    | ct2    | ct3    | ct4    |
+-----------+---------+-------------+--------+--------+--------+--------+
| 10001     | 70      | C1          | 70     | 70     | 70     | 70     |
| 10002     | 71      | C1          | 70     | 70     | 70     | 70     |
| 10003     | 72      | C1          | 70     | 70     | 70     | 71     |
| 10004     | 78      | C1          | 70     | 70     | 70     | 72     |
| 10005     | 79      | C1          | 70     | 70     | 70     | 78     |
| 10011     | 82      | C1          | 70     | 70     | 70     | 79     |
| 10006     | 80      | C2          | 70     | 80     | 80     | 80     |
| 10009     | 80      | C2          | 70     | 80     | 80     | 80     |
| 10007     | 81      | C2          | 70     | 80     | 80     | 80     |
| 10010     | 81      | C2          | 70     | 80     | 80     | 81     |
| 10008     | 82      | C2          | 70     | 80     | 80     | 81     |
+-----------+---------+-------------+--------+--------+--------+--------+

max开窗函数

示例代码：

select 
stu_id
,math
,class_id
-- 以符合条件的所有行作为窗口
,max(math) over() as ct1
-- 以按class_id分组所有行作为窗口
,max(math) over(partition by class_id) as ct2
-- 以按class_id分组、按math排序的所有行作为窗口
,max(math) over(partition by class_id order by math) as ct3
-- 以按class_id分组、按math排序、按 当前行+往前1行+往后2行的行作为窗口
,max(math) over(partition by class_id order by math rows between 1 preceding and 2 following) as ct4
from myhive.stu_scores
where dept_id = 'D1';

结果：

+-----------+---------+-------------+--------+--------+--------+--------+
| stu_id    | math    | class_id    | ct1    | ct2    | ct3    | ct4    |
+-----------+---------+-------------+--------+--------+--------+--------+
| 10001     | 70      | C1          | 82     | 82     | 70     | 72     |
| 10002     | 71      | C1          | 82     | 82     | 71     | 78     |
| 10003     | 72      | C1          | 82     | 82     | 72     | 79     |
| 10004     | 78      | C1          | 82     | 82     | 78     | 82     |
| 10005     | 79      | C1          | 82     | 82     | 79     | 82     |
| 10011     | 82      | C1          | 82     | 82     | 82     | 82     |
| 10006     | 80      | C2          | 82     | 82     | 80     | 81     |
| 10009     | 80      | C2          | 82     | 82     | 80     | 81     |
| 10007     | 81      | C2          | 82     | 82     | 81     | 82     |
| 10010     | 81      | C2          | 82     | 82     | 81     | 82     |
| 10008     | 82      | C2          | 82     | 82     | 82     | 82     |
+-----------+---------+-------------+--------+--------+--------+--------+

avg开窗函数

示例代码：

select 
stu_id
,math
,class_id
--以符合条件的所有行作为窗口
,avg(math) over() as ct1
-- 以按class_id分组所有行作为窗口
,avg(math) over(partition by class_id) as ct2
-- 以按class_id分组、按math排序的所有行作为窗口
,avg(math) over(partition by class_id order by math) as ct3
-- 以按class_id分组、按math排序、按 当前行+往前1行+往后2行的行作为窗口
,avg(math) over(partition by class_id order by math rows between 1 preceding and 2 following) as ct4
from myhive.stu_scores
where dept_id = 'D1';

结果：

+----------+--------+------------+-----------+-----------+-----------+-----------+
| stu_id   | math   | class_id   | ct1       | ct2       | ct3       | ct4       |
+----------+--------+------------+-----------+-----------+-----------+-----------+
| 10001    | 70     | C1         | 77.8182   | 75.3333   | 70.0000   | 71.0000   |
| 10002    | 71     | C1         | 77.8182   | 75.3333   | 70.5000   | 72.7500   |
| 10003    | 72     | C1         | 77.8182   | 75.3333   | 71.0000   | 75.0000   |
| 10004    | 78     | C1         | 77.8182   | 75.3333   | 72.7500   | 77.7500   |
| 10005    | 79     | C1         | 77.8182   | 75.3333   | 74.0000   | 79.6667   |
| 10011    | 82     | C1         | 77.8182   | 75.3333   | 75.3333   | 80.5000   |
| 10006    | 80     | C2         | 77.8182   | 80.8000   | 80.0000   | 80.3333   |
| 10009    | 80     | C2         | 77.8182   | 80.8000   | 80.0000   | 80.5000   |
| 10007    | 81     | C2         | 77.8182   | 80.8000   | 80.5000   | 81.0000   |
| 10010    | 81     | C2         | 77.8182   | 80.8000   | 80.5000   | 81.3333   |
| 10008    | 82     | C2         | 77.8182   | 80.8000   | 80.8000   | 81.5000   |
+----------+--------+------------+-----------+-----------+-----------+-----------+

位置开窗函数

位置开窗函数，也是聚合函数的一种l类别。

first_value开窗函数

函数功能：

返回分区中的第一个值。

示例代码：

select 
stu_id
,math
,class_id
-- 以符合条件的所有行作为窗口
,first_value(math) over() as ct1
-- 以按class_id分组的所有行作为窗口
,first_value(math) over(partition by class_id) as ct2
-- 以按class_id分组、按math排序的所有行作为窗口
,first_value(math) over(partition by class_id order by math) as ct3
-- 以按class_id分组、按math排序、按 当前行+往前1行+往后2行的行作为窗口
,first_value(math) over(partition by class_id order by math rows between 1 preceding and 2 following) as ct4
from myhive.stu_scores
where dept_id = 'D1';

结果：

+----------+--------+------------+-------+-------+-------+-------+
|stu_id    |math    |class_id    |ct1    |ct2    |ct3    |ct4    |
+----------+--------+------------+-------+-------+-------+-------+
| 10001    |  70    |C1          | 70    | 70    | 70    | 70    |
| 10002    |  71    |C1          | 70    | 70    | 70    | 70    |
| 10003    |  72    |C1          | 70    | 70    | 70    | 71    |
| 10004    |  78    |C1          | 70    | 70    | 70    | 72    |
| 10005    |  79    |C1          | 70    | 70    | 70    | 78    |
| 10011    |  82    |C1          | 70    | 70    | 70    | 79    |
| 10006    |  80    |C2          | 70    | 80    | 80    | 80    |
| 10009    |  80    |C2          | 70    | 80    | 80    | 80    |
| 10007    |  81    |C2          | 70    | 80    | 80    | 80    |
| 10010    |  81    |C2          | 70    | 80    | 80    | 81    |
| 10008    |  82    |C2          | 70    | 80    | 80    | 81    |
+----------+--------+------------+-------+-------+-------+-------+

last_value开窗函数

函数功能：

取分组内排序后，截止到当前行，最后一个值。

示例代码：

select 
stu_id
,math
,class_id
-- 以符合条件的所有行作为窗口
,last_value(math) over() as ct1
-- 以按class_id分组的所有行作为窗口
,last_value(math) over(partition by class_id) as ct2
-- 以按class_id分组、按math排序的所有行作为窗口
,last_value(math) over(partition by class_id order by math) as ct3
-- 以按class_id分组、按math排序、按 当前行+往前1行+往后2行的行作为窗口
,last_value(math) over(partition by class_id order by math rows between 1 preceding and 2 following) as ct4
from myhive.stu_scores
where dept_id = 'D1';

结果：

+----------+--------+------------+-------+-------+-------+-------+
|stu_id    |math    |class_id    |ct1    |ct2    |ct3    |ct4    |
+----------+--------+------------+-------+-------+-------+-------+
| 10001    |  70    |C1          | 82    | 82    | 70    | 72    |
| 10002    |  71    |C1          | 82    | 82    | 71    | 78    |
| 10003    |  72    |C1          | 82    | 82    | 72    | 79    |
| 10004    |  78    |C1          | 82    | 82    | 78    | 82    |
| 10005    |  79    |C1          | 82    | 82    | 79    | 82    |
| 10011    |  82    |C1          | 82    | 82    | 82    | 82    |
| 10006    |  80    |C2          | 82    | 81    | 80    | 81    |
| 10009    |  80    |C2          | 82    | 81    | 80    | 81    |
| 10007    |  81    |C2          | 82    | 81    | 81    | 82    |
| 10010    |  81    |C2          | 82    | 81    | 81    | 82    |
| 10008    |  82    |C2          | 82    | 81    | 82    | 82    |
+----------+--------+------------+-------+-------+-------+-------+

分析：

看上面结果可以发现，ct1值和预期一样。
ct2的返回值，在C1分组是正确的，但在C2分组中返回值为81而不是82。
ct3的返回值，则等于当前行的值。
ct4的值为当前行向下偏移2行的值。
可以说，输出结果其实不符合我们预期的结果，即返回分组中最后一行的值，为什么会这样呢？

1、首先，分析函数不是聚合函数。聚合函数将多个输入行折叠为一个输出行，而分析函数为每个输入行仅计算一个输出行。因此，您需要确保正在考虑每个输入行的输出是什么。
2、其次，分析函数在行的“窗口”上操作，该窗口是该行所属的“分区”的子集。输入行的分区由PARTITION BY子句确定，或者如果您希望分区是整个输入行集，则可以将其忽略。该窗口由ROWS子句提供，但是如果您未指定窗口(用户通常不指定)，则默认为整个分区(不应用任何顺序时)或该分区中行的集合。第一行到当前行(当存在ORDER BY时)。请注意，分区中每个输入行的窗口可能会有所不同!
3、回到LAST_VALUE。尽管上述默认窗口在许多情况下都是合理的(例如，计算累加和)，但使用LAST_VALUE时，它的效果极差。 LAST_VALUE函数返回窗口中最后一行的值，默认情况下，窗口中的最后一行是当前行。

因此，要解决该问题，需要明确指定LAST_VALUE的窗口是整个分区，而不仅仅是整个当前行。

# 可以进行如下操作，指定rows语句
SELECT x, LAST_VALUE(y) OVER (PARTITION BY x ORDER BY y ASC
  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM table

# 示例脚本
select 
stu_id
,math
,class_id
,last_value(math) over() as ct1
,last_value(math) over(partition by class_id  order by math rows between unbounded preceding and unbounded following) as ct2
,last_value(math) over(partition by class_id order by math rows between 1 preceding and 2 following) as ct4
from myhive.stu_scores
where dept_id = 'D1'；
# 返回结果
+----------+--------+------------+-------+-------+-------+
|stu_id    |math    |class_id    |ct1    |ct2    |ct4    |
+----------+--------+------------+-------+-------+-------+
| 10001    |  70    |C1          | 82    | 82    | 72    |
| 10002    |  71    |C1          | 82    | 82    | 78    |
| 10003    |  72    |C1          | 82    | 82    | 79    |
| 10004    |  78    |C1          | 82    | 82    | 82    |
| 10005    |  79    |C1          | 82    | 82    | 82    |
| 10011    |  82    |C1          | 82    | 82    | 82    |
| 10006    |  80    |C2          | 82    | 82    | 81    |
| 10009    |  80    |C2          | 82    | 82    | 81    |
| 10007    |  81    |C2          | 82    | 82    | 82    |
| 10010    |  81    |C2          | 82    | 82    | 82    |
| 10008    |  82    |C2          | 82    | 82    | 82    |
+----------+--------+------------+-------+-------+-------+

需要特别注意order by和rows的运用，不同字段，返回结果是不同的

# 示例1,无rows，当前行即最后一行 注释# 和 -- 一样
select
  x,
  first_value(x) over (order by x asc) first_asc,
  first_value(x) over (order by x desc) first_desc,
  last_value(x) over (order by x asc) last_asc,
  last_value(x) over (order by x desc) last_desc
from
  (select 4 as x
	union all
  select 2 as x
	union all
  select 1 as x
	union all
  select 3 as x)t
  
# 结果 
+-+---------+----------+--------+---------+
|x|first_asc|first_desc|last_asc|last_desc|
+-+---------+----------+--------+---------+
|4|        1|         4|       4|        4|
|3|        1|         4|       3|        3|
|2|        1|         4|       2|        2|
|1|        1|         4|       1|        1|
+-+---------+----------+--------+---------+


# 示例2,有rows
select
  x,
  first_value(x) over (order by x asc) first_asc,
  first_value(x) over (order by x desc) first_desc,
  last_value(x) over (order by x asc rows between unbounded preceding and unbounded following) last_asc,
  last_value(x) over (order by x desc rows between unbounded preceding and unbounded following) last_desc
from
  (select 4 as x
	union all
  select 2 as x
	union all
  select 1 as x
	union all
  select 3 as x)t
  
# 结果
+-+---------+----------+--------+---------+
|x|first_asc|first_desc|last_asc|last_desc|
+-+---------+----------+--------+---------+
|4|        1|         4|       4|        1|
|3|        1|         4|       4|        1|
|2|        1|         4|       4|        1|
|1|        1|         4|       4|        1|
+-+---------+----------+--------+---------+

注意上面两个结果中last_asc和last_desc的值

lag开窗函数

函数功能：

lag(col,n,default) 用于统计窗口内往上第n个值,即取每个分区内某列的前面的第n个值。
    col:列名
    n:往上第n行
    default:往上第n行为NULL时候，取默认值,不指定则取NULL

示例代码：

# lag(math,2,0) 取math列当前行往上2行的值，如果为空用0补足。
select 
stu_id
,math
,class_id
,lag(math,2,0) over(partition by class_id  order by math rows between unbounded preceding and unbounded following) as lag_value
from myhive.stu_scores
where dept_id = 'D1'

结果：

+----------+--------+------------+-------------+
|stu_id    |math    |class_id    |lag_value    |
-----------+--------+------------+-------------+
| 10001    |  70    |C1          |        0    |
| 10002    |  71    |C1          |        0    |
| 10003    |  72    |C1          |       70    |
| 10004    |  78    |C1          |       71    |
| 10005    |  79    |C1          |       72    |
| 10011    |  82    |C1          |       78    |
| 10006    |  80    |C2          |        0    |
| 10009    |  80    |C2          |        0    |
| 10007    |  81    |C2          |       80    |
| 10010    |  81    |C2          |       80    |
| 10008    |  82    |C2          |       81    |
-----------+--------+------------+-------------+
# math列，第一行70、第二行71两个值，向上取2个值为null，结果为0
# math列，第三行72，向上取2个值为70

lead开窗函数

函数功能：

lead(col,n,default) 用于统计窗口内往下第n个值,即每个分区内某列的后n个值。
    col:列名
    n:往下第n行
    default:往下第n行为NULL时候，取默认值,不指定则取NULL

示例代码：

# lag(math,2,0) 取math列当前行向下2行的值，如果为空用0补足。
select 
stu_id
,math
,class_id
,lead(math,2,0) over(partition by class_id  order by math rows between unbounded preceding and unbounded following) as lag_value
from myhive.stu_scores
where dept_id = 'D1'

结果：

+----------+--------+------------+-------------+
|stu_id    |math    |class_id    |lag_value    |
+----------+--------+------------+-------------+
| 10001    |  70    |C1          |       72    |
| 10002    |  71    |C1          |       78    |
| 10003    |  72    |C1          |       79    |
| 10004    |  78    |C1          |       82    |
| 10005    |  79    |C1          |        0    |
| 10011    |  82    |C1          |        0    |
| 10006    |  80    |C2          |       81    |
| 10009    |  80    |C2          |       81    |
| 10007    |  81    |C2          |       82    |
| 10010    |  81    |C2          |        0    |
| 10008    |  82    |C2          |        0    |
+----------+--------+------------+-------------+

cume_dist开窗函数

函数功能：

cume_dist 返回小于等于当前值的行数/分组内总行数。
计算某个窗口或分区中某个值的累积分布。假定升序排序，则使用以下公式确定累积分布：
小于等于当前值x的行数 / 窗口或partition分区内的总行数。其中，x 等于 order by 子句中指定的列的当前行中的值。

示例代码：

select 
stu_id
,math
,class_id
,round(cume_dist() over(order by math ),2) as cd1
,round(cume_dist() over(partition by class_id order by math),2) as cd2
from myhive.stu_scores
where dept_id = 'D1';

结果：

+----------+--------+------------+--------+--------+
|stu_id    |math    |class_id    |cd1     |cd2     |
+----------+--------+------------+--------+--------+
| 10001    |  70    |C1          |0.09    |0.17    |
| 10002    |  71    |C1          |0.18    |0.33    |
| 10003    |  72    |C1          |0.27    | 0.5    |
| 10004    |  78    |C1          |0.36    |0.67    |
| 10005    |  79    |C1          |0.45    |0.83    |
| 10011    |  82    |C1          | 1.0    | 1.0    |
| 10006    |  80    |C2          |0.64    | 0.4    |
| 10009    |  80    |C2          |0.64    | 0.4    |
| 10007    |  81    |C2          |0.82    | 0.8    |
| 10010    |  81    |C2          |0.82    | 0.8    |
| 10008    |  82    |C2          | 1.0    | 1.0    |
+----------+--------+------------+--------+--------+

分析：

1、cd1中由于没有指定partition,所以是针对整个结果集进行统计，比如id为10001的math=70，则小于等于70的只有1行数据，整个结果集为11行，即1/11 = 0.09,其余返回结果同理可得。
2、cd2中指定了partition，所以计算返回结果区分class_id = 'C1' 和class_id = 'C2'两组进行分别计算。

排序开窗函数

row_number开窗函数

函数功能：

从1开始对分区内的数据排序。

示例代码：

select 
stu_id
,math
,class_id
,row_number() over(partition by class_id order by math) as rn
from myhive.stu_scores
where dept_id = 'D1';

结果：

+----------+--------+------------+------+
|stu_id    |math    |class_id    |rn    |
+----------+--------+------------+------+
| 10001    |  70    |C1          | 1    |
| 10002    |  71    |C1          | 2    |
| 10003    |  72    |C1          | 3    |
| 10004    |  78    |C1          | 4    |
| 10005    |  79    |C1          | 5    |
| 10011    |  82    |C1          | 6    |
| 10006    |  80    |C2          | 1    |
| 10009    |  80    |C2          | 2    |
| 10007    |  81    |C2          | 3    |
| 10010    |  81    |C2          | 4    |
| 10008    |  82    |C2          | 5    |
+----------+--------+------------+------+

rank开窗函数

语法：

rank 开窗函数基于 over 子句中的 order by 确定一组值中一个值的排名。如果存在partition by ,则为每个分区组中的每个值排名。排名可能不是连续的。例如，如果两个行的排名为 1，则下一个排名为 3。

示例代码：

select 
stu_id
,math
,class_id
,row_number() over(partition by class_id order by math) as rn
,rank() over(partition by class_id order by math) as rk
from myhive.stu_scores
where dept_id = 'D1';

结果：

+----------+--------+------------+------+------+
|stu_id    |math    |class_id    |rn    |rk    |
+----------+--------+------------+------+------+
| 10001    |  70    |C1          | 1    | 1    |
| 10002    |  71    |C1          | 2    | 2    |
| 10003    |  72    |C1          | 3    | 3    |
| 10004    |  78    |C1          | 4    | 4    |
| 10005    |  79    |C1          | 5    | 5    |
| 10011    |  82    |C1          | 6    | 6    |
| 10006    |  80    |C2          | 1    | 1    |
| 10009    |  80    |C2          | 2    | 1    |
| 10007    |  81    |C2          | 3    | 3    |
| 10010    |  81    |C2          | 4    | 3    |
| 10008    |  82    |C2          | 5    | 5    |
+----------+--------+------------+------+------+

# C2分组的排序，明显与rownumber不同

dense_rank开窗函数

函数功能：

dense_rank与rank有一点不同,当排名一样的时候,接下来的行是连续的。如两个行的排名为 1，则下一个排名为 2。

示例代码：

select 
stu_id
,math
,class_id
,row_number() over(partition by class_id order by math) as rn
,rank() over(partition by class_id order by math) as rk
,dense_rank() over(partition by class_id order by math) as drk
from myhive.stu_scores
where dept_id = 'D1';

结果：

+----------+--------+------------+------+------+------+
|stu_id    |math    |class_id    |rn    |rk    |drk    |
+----------+--------+------------+------+------+------+
| 10001    |  70    |C1          | 1    | 1    | 1    |
| 10002    |  71    |C1          | 2    | 2    | 2    |
| 10003    |  72    |C1          | 3    | 3    | 3    |
| 10004    |  78    |C1          | 4    | 4    | 4    |
| 10005    |  79    |C1          | 5    | 5    | 5    |
| 10011    |  82    |C1          | 6    | 6    | 6    |
| 10006    |  80    |C2          | 1    | 1    | 1    |
| 10009    |  80    |C2          | 2    | 1    | 1    |
| 10007    |  81    |C2          | 3    | 3    | 2    |
| 10010    |  81    |C2          | 4    | 3    | 2    |
| 10008    |  82    |C2          | 5    | 5    | 3    |
+----------+--------+------------+------+------+------+

# 区别于rank() ,结果列drk中，在C2分组，相同分数，排名相同，下一名次连续统计。

percent_rank开窗函数

函数功能：

计算给定行的百分比排名。可以用来计算超过了百分之多少的人。
即：(当前行的rank值-1)/(分组内的总行数-1)

示例代码：

select 
stu_id
,math
,class_id
,row_number() over(partition by class_id order by math) as rn
,rank() over(partition by class_id order by math) as rk
,dense_rank() over(partition by class_id order by math) as drk
,percent_rank() over(partition by class_id order by math) as prk
from myhive.stu_scores
where dept_id = 'D1';

结果：

+----------+--------+------------+------+------+------+------+
|stu_id    |math    |class_id    |rn    |rk    |drk   |prk   |
+----------+--------+------------+------+------+------+------+
| 10001    |  70    |C1          | 1    | 1    | 1    |0.0	 |
| 10002    |  71    |C1          | 2    | 2    | 2    |0.2	 |
| 10003    |  72    |C1          | 3    | 3    | 3    |0.4	 |
| 10004    |  78    |C1          | 4    | 4    | 4    |0.6	 |
| 10005    |  79    |C1          | 5    | 5    | 5    |0.8	 |
| 10011    |  82    |C1          | 6    | 6    | 6    |1.0	 |
| 10006    |  80    |C2          | 1    | 1    | 1    |0.0	 |
| 10009    |  80    |C2          | 2    | 1    | 1    |0.0	 |
| 10007    |  81    |C2          | 3    | 3    | 2    |0.5	 |
| 10010    |  81    |C2          | 4    | 3    | 2    |0.5	 |
| 10008    |  82    |C2          | 5    | 5    | 3    |1.0	 |
+----------+--------+------------+------+------+------+------+

# 根据rank()值即rk列进行计算,即（当前rk值-1）/(rk最大值-1)

ntile开窗函数

函数功能：

NTILE(n)，将每个分区内排序后的结果均分成N份。本质是将每个分区拆分成更小的分区。
如果切片不均匀，默认增加第一个切片的分布。
NTILE不支持ROWS BETWEEN。

示例代码：

# 按class_id分区后，再将每个分区拆分成2个、3个小分区
# ntile(2) 将每个class的math成绩分成高、低两部分
# ntile(3) 将每个class的math成绩分成高、中、低三部分。
select 
stu_id
,math
,class_id
,ntile(2) over(partition by class_id order by math) as nt1
,ntile(3) over(partition by class_id order by math) as nt2
from myhive.stu_scores
where dept_id = 'D1';

结果：

+----------+--------+------------+-------+-------+
|stu_id    |math    |class_id    |nt1    |nt2    |
+----------+--------+------------+-------+-------+
| 10001    |  70    |C1          |  1    |  1    |
| 10002    |  71    |C1          |  1    |  1    |
| 10003    |  72    |C1          |  1    |  2    |
| 10004    |  78    |C1          |  2    |  2    |
| 10005    |  79    |C1          |  2    |  3    |
| 10011    |  82    |C1          |  2    |  3    |
| 10006    |  80    |C2          |  1    |  1    |
| 10009    |  80    |C2          |  1    |  1    |
| 10007    |  81    |C2          |  1    |  2    |
| 10010    |  81    |C2          |  2    |  2    |
| 10008    |  82    |C2          |  2    |  3    |
+----------+--------+------------+-------+-------+