0-业务场景
统计外电网各个设备的断电次数及断电时长,其中外电网正常状态为0,断电状态为1。统计断电次数,即是外电网的状态从0-1的变化次数。msg_time表示上报消息的时间,一分钟上报一次。
数据格式:设备ID - 时间戳 - 外电网状态。
类似的有统计开关量的使用次数,使用时长等
问题: (1)求断电次数
(2)求每次断电后断电时长(断电多久后被恢复,时间为分钟数要求保留一位小数)
(3)求每次断电后状态开始时间和结束时间
1-数据准备
vim opg.txt
1 2021-06-18 10:11:12 0
1 2021-06-18 10:12:10 1
1 2021-06-18 10:13:00 1
1 2021-06-18 10:14:01 1
1 2021-06-18 10:15:00 1
1 2021-06-18 10:16:02 0
1 2021-06-18 10:17:01 1
1 2021-06-18 10:18:00 1
1 2021-06-18 10:19:03 1
1 2021-06-18 10:20:04 0
2 2021-06-18 10:11:13 0
2 2021-06-18 10:11:00 0
2 2021-06-18 10:11:00 1
2 2021-06-18 10:11:00 1
2 2021-06-18 10:11:00 0
(2)建表
drop table opg;
create table opg(
sensor_id string,
msg_time string,
opg_state int
)
row format delimited fields terminated by '\t'
(3)加载数据
load data local inpath "/home/centos/dan_test/opg.txt" into table opg;
(4)数据展示
hive> select * from opg;
OK
1 2021-06-18 10:11:12 0
1 2021-06-18 10:12:10 1
1 2021-06-18 10:13:00 1
1 2021-06-18 10:14:01 1
1 2021-06-18 10:15:00 1
1 2021-06-18 10:16:02 0
1 2021-06-18 10:17:01 1
1 2021-06-18 10:18:00 1
1 2021-06-18 10:19:03 1
1 2021-06-18 10:20:04 0
2 2021-06-18 10:11:13 0
2 2021-06-18 10:12:00 0
2 2021-06-18 10:13:00 1
2 2021-06-18 10:14:00 1
2 2021-06-18 10:15:00 0
Time taken: 0.628 seconds, Fetched: 15 row(s)
2-数据分析
(1) 统计断电次数
-
目标:统计断电次数,及断电后的每次断电时长。断电次数:表示状态从0->1的次数,注意此题容易误解,有同学会统计1的个数认为断电次数,如果发生断电,则状态从0变为1,此时断电时间内一直会为1的状态,那么此时统计1的个数就会不准。 -
思路:统计0->1的个数,事实上我们可以借助窗口函数增加辅助列来帮我们计算,利用lag()函数取上一行的值作为辅助列,此时很容易看出0->1的变化。如下图: 上图中可以看到新增了一列opg_state_lag,代表上一行的opg_state数据,可以看出只要过滤出opg_state=1,opg_state_la=0的数据,再按设备分组统计即可。上图的SQL如下:
select sensor_id
,msg_time
,opg_state
,lag(opg_state,1,opg_state) over(partition by sensor_id) as opg_state_lag
from opg
所求结果为:
OK
1 2021-06-18 10:11:12 0 0
1 2021-06-18 10:12:10 1 0
1 2021-06-18 10:13:00 1 1
1 2021-06-18 10:14:01 1 1
1 2021-06-18 10:15:00 1 1
1 2021-06-18 10:16:02 0 1
1 2021-06-18 10:17:01 1 0
1 2021-06-18 10:18:00 1 1
1 2021-06-18 10:19:03 1 1
1 2021-06-18 10:20:04 0 1
2 2021-06-18 10:11:13 0 0
2 2021-06-18 10:12:00 0 0
2 2021-06-18 10:13:00 1 0
2 2021-06-18 10:14:00 1 1
2 2021-06-18 10:15:00 0 1
Time taken: 9.732 seconds, Fetched: 15 row(s)
最终的结果SQL如下:
select sensor_id
,count(1)
from(
select sensor_id
,msg_time
,opg_state
,lag(opg_state,1,opg_state) over(partition by sensor_id) as opg_state_lag
from opg
) t
where opg_state_lag=0 and opg_state=1
group by sensor_id
最终结果为:
OK
1 2
2 1
Time taken: 14.449 seconds, Fetched: 2 row(s)
(2)求断电时长
- 求每一次断电后断电时长:这里指的是每次断电后从第一次断电开始的时间到下一次恢复时的时间间隔。如下图所示所标记出来时间间隔即为断电时长,这里注意有的同学会误认为opg_state为1的状态持续的时间即为断电时长,这样求出是不准确,这里忽略了下次恢复时这一段时间间隔,容易被求错。
求解时,我们需要先标记断电开始的时间即0->1变化后最早时间,然后标记断电恢复后最早的时间(即1->0后最早时间),用后者减去前者即为断电时长。我们利用第一问得到的结果集,过滤出状态发生变化的数据即opg_state=1及opg_state_lag=0,和opg_state=0及opg_state_lag=1的数据。
select *
from(
select sensor_id
,msg_time
,opg_state
,lag(opg_state,1,opg_state) over(partition by sensor_id order by msg_time) as opg_state_lag
from opg
) t
where (opg_state_lag=0 and opg_state=1)or(opg_state_lag=1 and opg_state=0)
OK
1 2021-06-18 10:12:10 1 0
1 2021-06-18 10:16:02 0 1
1 2021-06-18 10:17:01 1 0
1 2021-06-18 10:20:04 0 1
2 2021-06-18 10:13:00 1 0
2 2021-06-18 10:15:00 0 1
Time taken: 8.082 seconds, Fetched: 6 row(s)
由于上述结果中按照时间进行升序排序,那么只要发生状态变化时,获取的时间即为最早的时间
考虑到时间的顺序性及状态的连续性,我们用上述结果集中按设备分组后,相邻两行的时间差值的绝对值即为断电时长。
计算时间差的中结果如下:
select sensor_id
,msg_time
,opg_state
,lag(msg_time,1,msg_time) over(partition by sensor_id order by msg_time) as msg_time_lag
from(
select sensor_id
,msg_time
,opg_state
,lag(opg_state,1,opg_state) over(partition by sensor_id order by msg_time) as opg_state_lag
from opg
) t
where (opg_state_lag=0 and opg_state=1)or(opg_state_lag=1 and opg_state=0)
OK
1 2021-06-18 10:12:10 1 2021-06-18 10:12:10
1 2021-06-18 10:16:02 0 2021-06-18 10:12:10
1 2021-06-18 10:17:01 1 2021-06-18 10:16:02
1 2021-06-18 10:20:04 0 2021-06-18 10:17:01
2 2021-06-18 10:13:00 1 2021-06-18 10:13:00
2 2021-06-18 10:15:00 0 2021-06-18 10:13:00
Time taken: 11.102 seconds, Fetched: 6 row(s)
然后针对每一行SQL用msg_time减去msg_time_lag的时间即为结果。
计算时间差用unix_timestamp() 返回当前的秒数。然后按照需求再进行转换,比如返回分钟则除以60即可,返回小时X/60/60,返回天X/60/60/24,本题返回分钟数。
unix_timestamp函数用法(输入为string形式,输出为bigint形式)
- unix_timestamp() 得到当前时间戳(秒数)
- 如果参数date满足yyyy-MM-dd HH:mm:ss形式,则可以直接unix_timestamp(string date) 得到参数对应的时间戳
- 如果参数date不满足yyyy-MM-dd HH:mm:ss形式,则我们需要指定date的形式,在进行转换
如输入时间为2021-06-18,那么后面就需要指定时间格式,因为标准的格式为yyyy-MM-dd HH:mm:ss形式否则返回NULL
如下例子:
hive> select unix_timestamp('2021-06-18 10:11:12');
OK
1623982272
Time taken: 0.258 seconds, Fetched: 1 row(s)
hive> select unix_timestamp('2021-06-18');
OK
NULL
Time taken: 0.342 seconds, Fetched: 1 row(s)
hive> select unix_timestamp('2021-06-18','yyyy-MM-dd');
OK
1623945600
Time taken: 0.782 seconds, Fetched: 1 row(s)
select to_utc_timestamp(1623982272358, 'GMT');
hive> select to_utc_timestamp(1623982272358, 'GMT');
OK
2021-06-18 10:11:12.358
Time taken: 0.468 seconds, Fetched: 1 row(s)
select current_timestamp()
,cast(current_timestamp() as double)*1000 as msg_time
hive> select current_timestamp()
> ,cast(current_timestamp() as float)*1000 as msg_time;
OK
2021-06-19 17:36:52.708 1.62409533E12
Time taken: 0.323 seconds, Fetched: 1 row(s)
where from_unixtime(cast(substr(msg_time,1,10) as bigint),'yyyy-MM-dd')='$lastday'
hive 保留小数位的方法:
select round(645342.875645342,2);
+
| _c0 |
+
| 645342.88 |
+
方法二
select cast(645342.875645342 as decimal(10,2));
+
| _c0 |
+
| 645342.88 |
+
上述保留小数位的方法推荐方法2,方法2准确且不出bug,round()函数有时经常会出现意想不到的bug.
最终的SQL如下:
select sensor_id
,cast((unix_timestamp(msg_time) -unix_timestamp(msg_time_lag))/60 as decimal(10,1)) as power_off_duration
from
( select *
,lag(msg_time,1,msg_time) over(partition by sensor_id order by msg_time) as msg_time_lag
from(
select sensor_id
,msg_time
,opg_state
,lag(opg_state,1,opg_state) over(partition by sensor_id order by msg_time) as opg_state_lag
from opg
) t
where (opg_state_lag=0 and opg_state=1)or(opg_state_lag=1 and opg_state=0)
) m
where opg_state_lag=1 and opg_state=0
结果如下:
OK
1 3.9
1 3.1
2 2
Time taken: 8.345 seconds, Fetched: 3 row(s)
(3)求每次断电后状态持续的开始时间和结束时间
- 需要对每次断电后(0->1变化后)将1的数据重新分成一组,然后再求出这一组的最大时间与最小时间。这样我们想到了按事件发生变化后分桶重排序的算法。
算法如下:
第一步:判定 当前行和上一行的第二列是否相等(是否发生某种事件),如果不等置为1,相等为0,作为flag列
第二步:累加 flag列,值作为分桶id
第三步:按照累加的值(分桶id)分组。
第四步:组内重排序
第一步:按照事件打标签(本题将数据发生变化的地方进行标记)
SQL如下:
select sensor_id
,msg_time
,opg_state
,case when lag(opg_state,1,opg_state) over(partition by sensor_id order by msg_time)!=opg_state
then 1
else 0 end as fag
from opg
这一步所示结果如下: 第二步:将flag列进行开窗累加。sum(flag) over(partition by sensor_id order by msg_time)
具体SQL如下:
select sensor_id
,msg_time
,opg_state
,sum(flag) over(partition by sensor_id order by msg_time) as bucket_id
from(
select sensor_id
,msg_time
,opg_state
,case when lag(opg_state,1,opg_state) over(partition by sensor_id order by msg_time)!=opg_state
then 1
else 0 end as flag
from opg
) t
结果为:
OK
1 2021-06-18 10:11:12 0 0
1 2021-06-18 10:12:10 1 1
1 2021-06-18 10:13:00 1 1
1 2021-06-18 10:14:01 1 1
1 2021-06-18 10:15:00 1 1
1 2021-06-18 10:16:02 0 2
1 2021-06-18 10:17:01 1 3
1 2021-06-18 10:18:00 1 3
1 2021-06-18 10:19:03 1 3
1 2021-06-18 10:20:04 0 4
2 2021-06-18 10:11:13 0 0
2 2021-06-18 10:12:00 0 0
2 2021-06-18 10:13:00 1 1
2 2021-06-18 10:14:00 1 1
2 2021-06-18 10:15:00 0 2
Time taken: 10.249 seconds, Fetched: 15 row(s)
结果如下所示: 第三步:过滤出opg_state=1的数据按bucket_id及sensor_id进行分组求出起始时间和结束时间具体SQL如下:
select sensor_id
,bucket_id
,min(msg_time) as start_time
,max(msg_time) as end_time
from(
select sensor_id
,msg_time
,opg_state
,flag
,sum(flag) over(partition by sensor_id order by msg_time) as bucket_id
from(
select sensor_id
,msg_time
,opg_state
,case when lag(opg_state,1,opg_state) over(partition by sensor_id order by msg_time)!=opg_state
then 1
else 0 end as flag
from opg
) t
) m
where opg_state=1
group by sensor_id,bucket_id
最终计算结果如下:
OK
1 1 2021-06-18 10:12:10 2021-06-18 10:15:00
1 3 2021-06-18 10:17:01 2021-06-18 10:19:03
2 1 2021-06-18 10:13:00 2021-06-18 10:14:00
Time taken: 22.05 seconds, Fetched: 3 row(s)
3-总结
本题实际上是实际业务场景经常需求问题,将其抽象出编成SQL面试题,主要考察基本SQL解决问题的能力,本题主要的知识点总结如下:
(1)分析函数lag()的使用和理解
(2)按事件发生变化后进行分桶的方法的应用
(3)时间差问题的求解思路
(4)需求理解能力及分析问题思维方法
|