hive sql 求所有用户和活跃用户的总数及平均年龄。(活跃用户指连续两天都有访问记录的用户)
有日志如下,请写出代码求得所有用户和活跃用户的总数及平均年龄。(活跃用户指连续两天都有访问记录的用户)
日期 用户 年龄
2019-02-11,test_1,23
2019-02-11,test_2,19
2019-02-11,test_3,39
2019-02-11,test_1,23
2019-02-11,test_3,39
2019-02-11,test_1,23
2019-02-12,test_2,19
2019-02-13,test_1,23
2019-02-15,test_2,19
2019-02-16,test_2,19
2019-02-17,test_4,35
2019-02-18,test_4,35
2019-02-22,test_4,35
2019-02-23,test_4,35
建表
create table users(day_time string,user_id string,age int)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile;
导入数据
load data local inpath '/opt/data/user.txt' into table users;
注意:这里都默认每个人的年龄是不变的,不考虑一个人的年龄增长的情况;
用到的知识点有hive的 over()开窗函数、row_number()函数、lead()函数
思路一
1. 求所有用户的总数和平均年龄
1)先求出每个用户的年龄
select user_id,avg(age) from users group by user_id;
user_id c1
test_1 23.0
test_2 19.0
test_3 39.0
test_4 35.0
2) 再求所有用户的总数和平均年龄
select
count(a.user_id) `用户数`,
avg(a.av) `所有用户的平均年龄`
from(
select user_id,avg(age) av from users group by user_id
) a;
用户数 所有用户的平均年龄
4 29.0
2.求活跃用户总数和平均年龄(活跃用户指连续两天都有访问记录的用户)
1) 因为一个用户每天可能登陆多次,计算活跃用户数时每天登陆一次就算今日活跃过了,所以要对每日登陆的用户去重
select distinct day_time,user_id,age from users;
day_time user_id age
2019-02-11 test_1 23
2019-02-11 test_2 19
2019-02-11 test_3 39
2019-02-12 test_2 19
2019-02-13 test_1 23
2019-02-15 test_2 19
2019-02-16 test_2 19
2019-02-17 test_4 35
2019-02-18 test_4 35
2019-02-22 test_4 35
2019-02-23 test_4 35
2) 这里开始到了活跃用户的计算逻辑了:
先用 row_number() over(partiton by _ order by _) 对用户进行分组,对登陆日期进行排序,排序完就会多出来一列(排序的列);
select
t1.user_id,t1.day_time,
row_number() over(partition by t1.user_id order by t1.day_time) rk,
t1.age
from(select distinct day_time,user_id,age from users) t1;
user_id day_time rk age
test_1 2019-02-11 1 23
test_1 2019-02-13 2 23
test_2 2019-02-11 1 19
test_2 2019-02-12 2 19
test_2 2019-02-15 3 19
test_2 2019-02-16 4 19
test_3 2019-02-11 1 39
test_4 2019-02-17 1 35
test_4 2019-02-18 2 35
test_4 2019-02-22 3 35
test_4 2019-02-23 4 35
再用时间列减去排序的列(rk),在连续登陆的情况下,每次相减的结果都相同;
select
t2.user_id,
t2.day_time,
t2.rk,
date_sub(t2.day_time,rk) ds,
age
from(
select
t1.user_id,
t1.day_time,
row_number() over(partition by t1.user_id order by t1.day_time) rk,
t1.age
from(select distinct day_time,user_id,age from users) t1
) t2;
user_id day_time rk ds age
test_1 2019-02-11 1 2019-02-10 23
test_1 2019-02-13 2 2019-02-11 23
test_2 2019-02-11 1 2019-02-10 19
test_2 2019-02-12 2 2019-02-10 19
test_2 2019-02-15 3 2019-02-12 19
test_2 2019-02-16 4 2019-02-12 19
test_3 2019-02-11 1 2019-02-10 39
test_4 2019-02-17 1 2019-02-16 35
test_4 2019-02-18 2 2019-02-16 35
test_4 2019-02-22 3 2019-02-19 35
test_4 2019-02-23 4 2019-02-19 35
3) 活跃用户的规则是连续2次登陆就算活跃用户,所以以用户id和ds列进行分组,统计组内的数据条数,当条数大于等于2就是活跃用户;可以顺便得到连续登陆的开始日期,结束日期,连续登陆的天数;
select
t3.user_id,
min(t3.day_time),
max(t3.day_time),
count(*),
avg(t3.age)
from(
select
t2.user_id,
t2.day_time,
t2.rk,
date_sub(t2.day_time,rk) ds,
age
from(
select
t1.user_id,
t1.day_time,
row_number() over(partition by t1.user_id order by t1.day_time) rk,
t1.age
from(select distinct day_time,user_id,age from users) t1
) t2
) t3 group by t3.user_id,t3.ds having count(*)>=2;
user_id 开始日期 结束日期 连续登陆天数 age
test_2 2019-02-11 2019-02-12 2 19.0
test_2 2019-02-15 2019-02-16 2 19.0
test_4 2019-02-17 2019-02-18 2 35.0
test_4 2019-02-22 2019-02-23 2 35.0
4)求活跃用户总数和平均年龄
第一步:因为同一个用户可以活跃多次(在不同时间段内都满足连续登陆2天),但只算一个活跃用户,因此要先去重,先求出活跃用户的id和对应的年龄(去重);
select
t4.user_id,
avg(t4.ag)
from(
select
t3.user_id,
min(t3.day_time),
max(t3.day_time),
count(*),
avg(t3.age) ag
from(
select
t2.user_id,
t2.day_time,
t2.rk,
date_sub(t2.day_time,rk) ds,
age
from(
select
t1.user_id,
t1.day_time,
row_number() over(partition by t1.user_id order by t1.day_time) rk,
t1.age
from(select distinct day_time,user_id,age from users) t1
) t2
) t3 group by t3.user_id,t3.ds having count(*)>=2
) t4 group by t4.user_id;
user_id age
test_2 19.0
test_4 35.0
连续登陆3天算活跃用户,就count(*)>=3;
连续登陆4天算活跃用户,就count(*)>=4;
连续登陆5天算活跃用户,就count(*)>=5;
以此逻辑类推;
第二步:计算活跃用户总数和平均年龄
select
count(t5.user_id) `活跃用户总数`,
avg(t5.ag2) `活跃用户平均年龄`
from(
select
t4.user_id,
avg(t4.ag) ag2
from(
select
t3.user_id,
min(t3.day_time),
max(t3.day_time),
count(*),
avg(t3.age) ag
from(
select
t2.user_id,
t2.day_time,
t2.rk,
date_sub(t2.day_time,rk) ds,
age
from(
select
t1.user_id,
t1.day_time,
row_number() over(partition by t1.user_id order by t1.day_time) rk,
t1.age
from(select distinct day_time,user_id,age from users) t1
) t2
) t3 group by t3.user_id,t3.ds having count(*)>=2
) t4 group by t4.user_id
) t5;
活跃用户总数 活跃用户平均年龄
2 27.0
最后一步:将这两个结果结合起来
select
b.`用户数` `所有用户数`,
b.`所有用户的平均年龄` `所有用户的平均年龄`,
0 `活跃用户总数`,
0 `活跃用户平均年龄`
from(
select
count(a.user_id) `用户数`,
avg(a.av) `所有用户的平均年龄`
from(
select user_id,avg(age) av from users group by user_id
) a
) b
union all
select
0 `所有用户数`,
0 `所有用户的平均年龄`,
t6.`活跃用户总数` `活跃用户总数`,
t6.`活跃用户平均年龄` `活跃用户平均年龄`
from(
select
count(t5.user_id) `活跃用户总数`,
avg(t5.ag2) `活跃用户平均年龄`
from(
select
t4.user_id,
avg(t4.ag) ag2
from(
select
t3.user_id,
min(t3.day_time),
max(t3.day_time),
count(*),
avg(t3.age) ag
from(
select
t2.user_id,
t2.day_time,
t2.rk,
date_sub(t2.day_time,rk) ds,
age
from(
select
t1.user_id,
t1.day_time,
row_number() over(partition by t1.user_id order by t1.day_time) rk,
t1.age
from(select distinct day_time,user_id,age from users) t1
) t2
) t3 group by t3.user_id,t3.ds having count(*)>=2
) t4 group by t4.user_id
) t5
) t6;
u1.所有用户数 u1.所有用户的平均年龄 u1.活跃用户总数 u1.活跃用户平均年龄
4 29.0 0 0.0
0 0.0 2 27.0
思路二:用hive里的函数lead()
day_time user_id age
2019-02-11,test_1,23
2019-02-11,test_2,19
2019-02-11,test_3,39
2019-02-11,test_1,23
2019-02-11,test_3,39
2019-02-11,test_1,23
2019-02-12,test_2,19
2019-02-13,test_1,23
2019-02-15,test_2,19
2019-02-16,test_2,19
2019-02-17,test_4,35
2019-02-18,test_4,35
2019-02-22,test_4,35
2019-02-23,test_4,35
求活跃用户总数和平均年龄(活跃用户指连续两天都有访问记录的用户)
1)select distinct user_id,day_time,age from users;
user_id day_time age
test_1 2019-02-11 23
test_2 2019-02-11 19
test_3 2019-02-11 39
test_2 2019-02-12 19
test_1 2019-02-13 23
test_2 2019-02-15 19
test_2 2019-02-16 19
test_4 2019-02-17 35
test_4 2019-02-18 35
test_4 2019-02-22 35
test_4 2019-02-23 35
2)
select
t1.user_id,
t1.day_time,
lead(t1.day_time,1) over(partition by t1.user_id order by t1.day_time) nt,
t1.age
from(
select distinct user_id,day_time,age from users
) t1;
t1.user_id t1.day_time nt t1.age
test_1 2019-02-11 2019-02-13 23
test_1 2019-02-13 NULL 23
test_2 2019-02-11 2019-02-12 19
test_2 2019-02-12 2019-02-15 19
test_2 2019-02-15 2019-02-16 19
test_2 2019-02-16 NULL 19
test_3 2019-02-11 NULL 39
test_4 2019-02-17 2019-02-18 35
test_4 2019-02-18 2019-02-22 35
test_4 2019-02-22 2019-02-23 35
test_4 2019-02-23 NULL 35
3)
select
t2.user_id,
t2.age
from(
select
t1.user_id,
t1.day_time,
lead(t1.day_time,1) over(partition by t1.user_id order by t1.day_time) nt,
t1.age
from(
select distinct user_id,day_time,age from users
) t1
) t2 where datediff(t2.nt,t2.day_time)=1;
t2.user_id t2.age
test_2 19
test_2 19
test_4 35
test_4 35
连续登陆2天就算活跃用户: 就用 lead(t1.day_time,1) over(partition by t1.user_id order by t1.day_time) nt 然后 nt-day_time=1;
连续登陆3天就算活跃用户: 就用 lead(t1.day_time,2) over(partition by t1.user_id order by t1.day_time) nt 然后 nt-day_time=2;
连续登陆4天就算活跃用户: 就用 lead(t1.day_time,3) over(partition by t1.user_id order by t1.day_time) nt 然后 nt-day_time=3;
以此逻辑类推;
select
t3.user_id,
avg(t3.age) ag
from(
select
t2.user_id,
t2.age
from(
select
t1.user_id,
t1.day_time,
lead(t1.day_time,1) over(partition by t1.user_id order by t1.day_time) nt,
t1.age
from(
select distinct user_id,day_time,age from users
) t1
) t2 where datediff(t2.nt,t2.day_time)=1
) t3 group by t3.user_id;
t3.user_id ag
test_2 19.0
test_4 35.0
5)
select
count(t6.user_id) `活跃用户总数`,
avg(t6.ag) `活跃用户平均年龄`
from(
select
t3.user_id,
avg(t3.age) ag
from(
select
t2.user_id,
t2.age
from(
select
t1.user_id,
t1.day_time,
lead(t1.day_time,1) over(partition by t1.user_id order by t1.day_time) nt,
t1.age
from(
select distinct user_id,day_time,age from users
) t1
) t2 where datediff(t2.nt,t2.day_time)=1
) t3 group by t3.user_id
) t6;
活跃用户总数 活跃用户平均年龄
2 27.0
6) 后面的union all 跟思路一 一样,省略;
|