最近七天内连续三天活跃用户数
首先,感谢大华公司给的面试机会,非常感谢~! 进入正题,建表:
create table uv_detail_daycount(
mid int
)PARTITIONED BY(dt string);
通过load将hdfs文件加载到hive中。 数据文件名如下: 里面的数据只有用户mid。如下所示:
每个日期对应的用户mid,即为该天活跃。 通过上图可以发现,最近七天内连续三天活跃用户数应该是001和002号用户,最终2021-08-10这天的最近七天内连续三天活跃用户数为2.
实现
第一步,查询最近七天的数据,并按照日期从小到大进行排序。
select
mid,
dt,
rank() over(partition by mid order by dt) mid_dt_rank
from uv_detail_daycount
where dt >=date_add('2021-08-10',-6) and dt<='2021-08-10'
第二步,求日期和排名的差值.
with t1 as (select
mid,
dt,
rank() over(partition by mid order by dt) mid_dt_rank
from uv_detail_daycount
where dt >=date_add('2021-08-10',-6) and dt<='2021-08-10')
select
mid,
date_sub(dt, mid_dt_rank) date_dif
from
t1;
第三步,对用户和差值进行分组,然后通过having选择差值相同个数大于等于3的数据取出。
with t1 as (select
mid,
dt,
rank() over(partition by mid order by dt) mid_dt_rank
from uv_detail_daycount
where dt >=date_add('2021-08-10',-6) and dt<='2021-08-10'),
t2 as (select
mid,
date_sub(dt, mid_dt_rank) date_diff
from t1)
SELECT mid
from
t2
group by mid, date_diff
HAVING count(*) >= 3;
第四步,根据用户id去重(为什么会出现重复的mid?最近七天可能用户前3天用户连续登录满足所求指标的要求,后三天也是如此,所以会出现mid重复。这个mid可以理解为该用户满足指标的次数吧,但是指标求的是活跃用户数,所以要去重)
with t1 as (select
mid,
dt,
rank() over(partition by mid order by dt) mid_dt_rank
from uv_detail_daycount
where dt >=date_add('2021-08-10',-6) and dt<='2021-08-10'),
t2 as (select
mid,
date_sub(dt, mid_dt_rank) date_diff
from t1),
t3 as (SELECT mid
from
t2
group by mid, date_diff
HAVING count(*) >= 3)
select mid
from
t3
group by mid;
第五步,整理显示:
with t1 as (select
mid,
dt,
rank() over(partition by mid order by dt) mid_dt_rank
from uv_detail_daycount
where dt >= date_add('2021-08-10',-6) and dt <= '2021-08-10'),
t2 as (select
mid,
date_sub(dt, mid_dt_rank) date_diff
from t1),
t3 as (SELECT mid
from
t2
group by mid, date_diff
HAVING count(*) >= 3),
t4 as(select mid
from
t3
group by mid)
select
'2021-08-10',
concat(date_add('2021-08-10',-6),'至','2021-08-10'),
count(*)
from
t4;
|