1、开窗函数的格式
FUNCTION_NAME([argument_list])
OVER (
[PARTITION BY window_partition,…]
[ORDER BY window_ordering, … [ASC|DESC]])
[ { ROWS | RANGE } BETWEEN frame_start AND frame_end ] );
FUNCTION_NAME:函数名称。如row_number()、sum()、first_value()等。
argument_list:函数的参数列表。
PARTITION BY:根据window_partition(分区字段)进行分区,该子句也被称为查询分区子句。类似于group by,都是将数据按照边界值进行分组。而OVER之前的函数在每一个分组之内进行,如果超出了分组,则函数会重新计算。
ORDER BY:将各个分区内的数据,根据window_ordering(排序字段)进行排序。ORDER BY子句会对输入的数据强制排序(窗口函数是SQL语句最后执行的函数,因此可以把SQL结果集想象成输入数据)。ORDER BY子句对于诸如row_number(),lead(),lag()等函数是必须的。如果数据无序,这些函数的结果就没有意义。
ROWS和RANGE分别表示选择前后几行、选择数据范围。
2、窗口范围图例
注释: PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点
UNBOUNDED PRECEDING 表示从前面的起点
UNBOUNDED FOLLOWING:表示到后面的终点 常用的范围: 01:ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW 02:ROWS BETWEEN 3 PRECEDING AND CURRENT ROW 03:ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING,n PRECEDING m FOLLOWING:表示窗口的范围是[(当前行的行数)- n, (当前行的行数)+ m] row。 04:ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
3、连续登录问题:
数据准备: ±–±---------+ |id |login_date| ±–±---------+ |01 |2021-02-28| |01 |2021-03-01| |01 |2021-03-02| |01 |2021-03-04| |01 |2021-03-05| |01 |2021-03-06| |01 |2021-03-08| |02 |2021-03-01| |02 |2021-03-02| |02 |2021-03-03| |02 |2021-03-06| |03 |2021-03-06| ±–±---------+ 统计连续登录天数超过3天的用户,输出信息包括:用户id,登录天数,起始时间,结束时间; 方法1:
SELECT
t2.id,
count(1) as login_times,
min(t2.login_date) as start_date,
max(t2.login_date) as end_date
FROM
(
SELECT
t1.id,
t1.login_date,
date_sub(t1.login_date,rn) as diff_date
FROM
(
SELECT
id,
login_date,
row_number() over(partition by id order by login_date asc) as rn
FROM data
) t1
) t2
group by t2.id, t2.diff_date
having login_times >= 3;
+
|id |login_times|start_date|end_date |
+
| 01|3 |2021-02-28|2021-03-02|
| 01|3 |2021-03-04|2021-03-06|
| 02|3 |2021-03-01|2021-03-03|
+
方法2:
SELECT
id,
lag_login_date,
login_date,lead_login_date
FROM
(SELECT
id,
login_date,
lag(login_date,1,login_date) over(partition by id order by login_date) as lag_login_date,
lead(login_date,1,login_date) over(partition by id order by login_date) as lead_login_date
FROM data
) t1
where datediff(login_date,lag_login_date) =1 and datediff(lead_login_date,login_date) =1;
+
|id |lag_login_date|login_date|lead_login_date|
+
|01 |2018-02-28 |2018-03-01|2018-03-02 |
|01 |2018-03-04 |2018-03-05|2018-03-06 |
|02 |2018-03-01 |2018-03-02|2018-03-03 |
+
方法3:
SELECT
id,
lag_login_date,
login_date
FROM
(SELECT
id,
login_date,
lag(login_date,2,login_date) over(partition by id order by login_date) as lag_login_date,
FROM data
) t1
where datediff(login_date,lag_login_date) =2
+
|id |lag_login_date|login_date|
+
|01 |2018-02-28 |2018-03-02|
|01 |2018-03-04 |2018-03-06|
|02 |2018-03-01 |2018-03-03|
+
4、 限制时间段内登录次数问题:
有一张用户购买记录表.现在我们需要找出所有的特殊用户.特殊用户的定义如下: 在当前购买时间的近7天内(含当天)购买次数超过3次(含),且近7天的购买总金额超过1000的用户即为特殊用户. 数据准备: ±–±---------+ |user_id |lbuy_date|amount| ±–±---------±-----------+ |101 |2021-01-01|1000| |101 |2021-01-02|2000| |102|2021-10-01|10| |102 |2021-10-02|700| |102 |2021-10-07|200| |103 |2021-11-07|500| |103 |2021-11-08|500| |103 |2021-11-20|500| |104|2021-03-01|10| |104|2021-03-05|200| |104|2021-03-09|800| |104 |2021-03-09|800| ±–±---------+
实现:
select
distinct user_id
from
(
select
user_id
,buy_date
,count(1) over(PARTITION by user_id order by datediff(buy_date,'2021-01-01') RANGE between 6 PRECEDING and CURRENT row) as cnt
,sum(amount) over(PARTITION by user_id order by datediff(buy_date,'2021-01-01') RANGE between 6 PRECEDING and CURRENT row) as amount
from test.aaa001
)t1
where cnt>=3 and amount>1000
;
user_id
104
106
|