数据定义、数据插入、数据查询实验跳过
目录
一.ELT原始数据处理
1.新建项目
2.添加项目依赖
3.文件内容?
1)数据抽取、拆分工具类函数?
2)继承Mapper
3)运行Runner?
4.打包
?编辑
?5.上传至hdfs
1)启动环境
?2)上传数据文件与jar包
3)数据上传至集群
4)运行jar包
?二.准备工作
1.准备hive环境
2.创建表
1)gulivideo_ori
2)gulivideo_user_ori?
3)?gulivideo_orc?
?4)gulivideo_user_orc
3.导入ELT后的数据
4.向ORC表插入数据?
?5.查看导入的数据结果
?三、业务分析
1.统计观看视频数Top10
?编辑?2.统计视频类热度
?3.统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数
?4.统计视频观看数Top50所关联视频的所属类别Rank
?5.统计每个类别中的视频热度Top10,以Music为例
1)创建类别表
?2)向类别表中插入数据
?3)统计Music类别的Top10(也可以统计其他)
?6.统计每个类别中视频流量Top10,以Music例
7.统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频
?8.统计每个类别视频观看数Top10
一.ELT原始数据处理
1.新建项目

2.添加项目依赖

3.文件内容?
1)数据抽取、拆分工具类函数?

2)继承Mapper

3)运行Runner?


4.打包

?5.上传至hdfs
1)启动环境

?2)上传数据文件与jar包

3)数据上传至集群
[root@hadoop001 hive]#? hdfs dfs -put guiliVideo /

4)运行jar包
[root@hadoop001 hive]# yarn jar
?/root/hive/guli.video-1.0-SNAPSHOT.jar
com.guli.mapper.VideoETLRunner /guiliVideo/video/2008/0222 /guliout



?二.准备工作
1.准备hive环境
hive> SHOW DATABASES;
hive> SET hive.cli.print.current.db=true;
hive (default)> USE text;
hive (text)> set hive.cli.print.header=true;

2.创建表
1)gulivideo_ori
hive (text)> create table gulivideo_ori(
?????????? >???? videoId string,
?????????? >???? uploader string,
?????????? >???? age int,
?????????? >???? category array<string>,
?????????? >???? length int,
?????????? >???? views int,
?????????? >???? rate float,
?????????? >???? ratings int,
?????????? >???? comments int,
?????????? >???? relatedId array<string>)
?????????? > row format delimited fields terminated by "\t"
?????? ????> collection items terminated by "&"
?????????? > stored as textfile;

2)gulivideo_user_ori?
hive (text)> create table gulivideo_user_ori(
?????????? >???? uploader string,
?????????? >???? videos int,
?????????? >???? friends int)
?????????? > row format delimited fields terminated by "\t"
?????????? > stored as textfile;

3)?gulivideo_orc?
hive (text)> create table gulivideo_orc(
?????????? >???? videoId string,
?????????? >???? uploader string,
?????????? >???? age int,
?????????? >???? category array<string>,
?????????? >???? length int,
?????????? >???? views int,
?????????? >???? rate float,
?????????? >???? ratings int,
?????????? >???? comments int,
?????????? >???? relatedId array<string>)
?????????? > clustered by (uploader) into 8 buckets
?????????? > row format delimited fields terminated by "\t"
?????????? > collection items terminated by "&"
?????????? > stored as orc;

?4)gulivideo_user_orc
hive (text)> create table gulivideo_user_orc(
?????????? >???? uploader string,
?????????? >???? videos int,
?????????? >???? friends int)
?????????? > row format delimited fields terminated by "\t"
?????????? > stored as orc;

3.导入ELT后的数据
hive (text)> load data inpath "/guliout" into table gulivideo_ori;
hive (text)> load data inpath "/guiliVideo/user/2008/0903" into table gulivideo_user_ori;

4.向ORC表插入数据?
hive (text)> insert into table gulivideo_orc select * from gulivideo_ori;
hive (text)> insert into table gulivideo_user_orc select * from gulivideo_user_ori;



?5.查看导入的数据结果
hive (text)> select * from gulivideo_orc limit 2;
hive (text)> select * from gulivideo_user_orc limit 2;

?三、业务分析
1.统计观看视频数Top10
hive (text)> select videoId,uploader,age,category,length,views,rate,ratings,comments from gulivideo_orc order by views desc limit 10;

?2.统计视频类热度
hive (text)> select
?????????? >???? category_name as category,
?????????? >???? count(t1.videoId) as hot
?????????? > from (
?????????? >? ???select
?????????? >???????? videoId,
?????????? >???????? category_name
?????????? >???? from
?????????? >???????? gulivideo_orc lateral view explode(category) t_catetory as
?????????? > category_name) t1
?????????? > group by
?????????? >???? t1.category_name
?????????? > order by
?????????? >???? hot
?????????? > desc limit
?????????? >???? 10;


?3.统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数
hive (text)> select
?????????? >???? category_name as category,
?????????? >???? count(t2.videoId) as hot_with_views
?????????? > from (
?????????? >???? select
?????????? >???????? videoId,
?????????? >???????? category_name
?????????? >???? from (
?????????? >???????? select
?????????? >???????????? *
?????????? >???????? from
?????????? >???? ????????gulivideo_orc
?????????? >???????? order by
?????????? >???????????? views
?????????? >???????????? desc limit
?????????? >???????????????? 20) t1 lateral view explode(category) t_catetory as
?????????? > category_name) t2
?????????? > group by
?? ????????>???? category_name
?????????? > order by
?????????? >???? hot_with_views
?????????? > desc;


?4.统计视频观看数Top50所关联视频的所属类别Rank

hive (text)> select
?????????? >???? category_name as category,
?????????? >???? count(t5.videoId) as hot
?????????? > from (
?????????? >???? select
?????????? >???????? videoId,
?????????? >???????? category_name
?????????? >???? from (
?????????? >???????? select
?????????? >???????? distinct(t2.videoId),
?????????? >???????? t3.category
?????????? >?????? ??from (
?????????? >???????????? select
?????????? >???????????????? explode(relatedId) as videoId
?????????? >???????????? from (
?????????? >???????????????? select
?????????? >???????????????????? *
?????????? >???????????????? from
?????????? >?????? ??????????????gulivideo_orc
?????????? >???????????????? order by
?????????? >???????????????????? views
?????????? >???????????????? desc limit
?????????? >???????????????????? 50) t1) t2
?????????? >???????? inner join
?????????? >???????????? gulivideo_orc t3 on t2.videoId = t3.videoId) t4 lateral view
?????????? > explode(category) t_catetory as category_name) t5
?????????? > group by
?????????? >???? category_name
?????????? > order by
?????????? >???? hot
?????????? > desc;


?5.统计每个类别中的视频热度Top10,以Music为例
1)创建类别表
hive (text)> create table gulivideo_category(
?????????? >???? videoId string,
?????????? >???? uploader string,
?????????? >???? age int,
?????????? >???? categoryId string,
?????????? >???? length int,
?????? ????>???? views int,
?????????? >???? rate float,
?????????? >???? ratings int,
?????????? >???? comments int,
?????????? >???? relatedId array<string>)
?????????? > row format delimited
?????????? > fields terminated by "\t"
?????????? > collection items terminated by "&"
?????????? > stored as orc;

?2)向类别表中插入数据
hive (text)> insert into table gulivideo_category
?????????? >???? select
?????????? >???????? videoId,
?????????? >???????? uploader,
?????????? >???????? age,
?????????? >???????? categoryId,
?????????? >???????? length,
?????????? >???????? views,
?????????? >???????? rate,
?????????? >???????? ratings,
?????????? >???????? comments,
?????????? >???????? relatedId
?????????? >?? ??from
?????????? >???????? gulivideo_orc lateral view explode(category) catetory as
?????????? > categoryId;


?3)统计Music类别的Top10(也可以统计其他)
hive (text)> select
?????????? >???? videoId,
?????????? >???? views
?????????? > from
?????????? >???? gulivideo_category
?????????? > where
?????????? >???? categoryId = "Music"
?????????? > order by
?????????? >???? views
?????????? > desc limit
?????????? >???? 10;


?6.统计每个类别中视频流量Top10,以Music例
hive (text)> select
?????????? >???? videoId,
?????????? >???? views,
?????????? >???? ratings
?????????? > from
?????????? >???? gulivideo_category
?????????? > where
?????????? >???? categoryId = "Music"
?????????? > order by
?????????? >???? ratings
?????????? > desc limit
?????????? >???? 10;


7.统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频
hive (text)> select
?????????? >???? t2.videoId,
?????????? >???? t2.views,
?????????? >???? t2.ratings,
?????????? >???? t1.videos,
?????????? >???? t1.friends
?????????? > from (
?????????? >???? select
?????????? >???????? *
?????????? >???? from
?????????? >???????? gulivideo_user_orc
?????????? >???? order by
?????????? >???????? videos desc
?????????? >???? limit
?????????? >???????? 10) t1
?????????? >???? join
?????????? >?? ??????gulivideo_orc t2
?????????? >???? on
?????????? >???????? t1.uploader = t2.uploader
?????????? >???? order by
?????????? >???????? views desc
?????????? >???? limit
?????????? >???????? 20;


?8.统计每个类别视频观看数Top10
hive (text)> select
?????????? >???? t1.*
?????????? > from (
?????????? >???? select
?????????? >???????? videoId,
?????????? >???????? categoryId,
?????????? >???????? views,
?????????? > row_number() over(partition by categoryId order by views desc)
?????????? > rank from gulivideo_category) t1
?????????? > where
?????????? >???? rank <= 10;


|