IT数码 购物 网址 头条 软件 日历 阅读 图书馆
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
图片批量下载器
↓批量下载图片,美女图库↓
图片自动播放器
↓图片自动播放器↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁
 
   -> 大数据 -> Hive高手之路:二、hive数据分析实验 -> 正文阅读

[大数据]Hive高手之路:二、hive数据分析实验

数据定义、数据插入、数据查询实验跳过

目录

一.ELT原始数据处理

1.新建项目

2.添加项目依赖

3.文件内容?

1)数据抽取、拆分工具类函数?

2)继承Mapper

3)运行Runner?

4.打包

?编辑

?5.上传至hdfs

1)启动环境

?2)上传数据文件与jar包

3)数据上传至集群

4)运行jar包

?二.准备工作

1.准备hive环境

2.创建表

1)gulivideo_ori

2)gulivideo_user_ori?

3)?gulivideo_orc?

?4)gulivideo_user_orc

3.导入ELT后的数据

4.向ORC表插入数据?

?5.查看导入的数据结果

?三、业务分析

1.统计观看视频数Top10

?编辑?2.统计视频类热度

?3.统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数

?4.统计视频观看数Top50所关联视频的所属类别Rank

?5.统计每个类别中的视频热度Top10,以Music为例

1)创建类别表

?2)向类别表中插入数据

?3)统计Music类别的Top10(也可以统计其他)

?6.统计每个类别中视频流量Top10,以Music例

7.统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频

?8.统计每个类别视频观看数Top10


一.ELT原始数据处理

1.新建项目

2.添加项目依赖

3.文件内容?

1)数据抽取、拆分工具类函数?

2)继承Mapper

3)运行Runner?

4.打包

?5.上传至hdfs

1)启动环境

?2)上传数据文件与jar包

3)数据上传至集群

[root@hadoop001 hive]#? hdfs dfs -put guiliVideo /

4)运行jar包

[root@hadoop001 hive]# yarn jar

?/root/hive/guli.video-1.0-SNAPSHOT.jar

com.guli.mapper.VideoETLRunner /guiliVideo/video/2008/0222 /guliout

?二.准备工作

1.准备hive环境

hive> SHOW DATABASES;

hive> SET hive.cli.print.current.db=true;

hive (default)> USE text;

hive (text)> set hive.cli.print.header=true;

2.创建表

1)gulivideo_ori

hive (text)> create table gulivideo_ori(

?????????? >???? videoId string,

?????????? >???? uploader string,

?????????? >???? age int,

?????????? >???? category array<string>,

?????????? >???? length int,

?????????? >???? views int,

?????????? >???? rate float,

?????????? >???? ratings int,

?????????? >???? comments int,

?????????? >???? relatedId array<string>)

?????????? > row format delimited fields terminated by "\t"

?????? ????> collection items terminated by "&"

?????????? > stored as textfile;

2)gulivideo_user_ori?

hive (text)> create table gulivideo_user_ori(

?????????? >???? uploader string,

?????????? >???? videos int,

?????????? >???? friends int)

?????????? > row format delimited fields terminated by "\t"

?????????? > stored as textfile;

3)?gulivideo_orc?

hive (text)> create table gulivideo_orc(

?????????? >???? videoId string,

?????????? >???? uploader string,

?????????? >???? age int,

?????????? >???? category array<string>,

?????????? >???? length int,

?????????? >???? views int,

?????????? >???? rate float,

?????????? >???? ratings int,

?????????? >???? comments int,

?????????? >???? relatedId array<string>)

?????????? > clustered by (uploader) into 8 buckets

?????????? > row format delimited fields terminated by "\t"

?????????? > collection items terminated by "&"

?????????? > stored as orc;

?4)gulivideo_user_orc

hive (text)> create table gulivideo_user_orc(

?????????? >???? uploader string,

?????????? >???? videos int,

?????????? >???? friends int)

?????????? > row format delimited fields terminated by "\t"

?????????? > stored as orc;

3.导入ELT后的数据

hive (text)> load data inpath "/guliout" into table gulivideo_ori;

hive (text)> load data inpath "/guiliVideo/user/2008/0903" into table gulivideo_user_ori;

4.向ORC表插入数据?

hive (text)> insert into table gulivideo_orc select * from gulivideo_ori;

hive (text)> insert into table gulivideo_user_orc select * from gulivideo_user_ori;

?5.查看导入的数据结果

hive (text)> select * from gulivideo_orc limit 2;

hive (text)> select * from gulivideo_user_orc limit 2;

?三、业务分析

1.统计观看视频数Top10

hive (text)> select videoId,uploader,age,category,length,views,rate,ratings,comments from gulivideo_orc order by views desc limit 10;

?2.统计视频类热度

hive (text)> select

?????????? >???? category_name as category,

?????????? >???? count(t1.videoId) as hot

?????????? > from (

?????????? >? ???select

?????????? >???????? videoId,

?????????? >???????? category_name

?????????? >???? from

?????????? >???????? gulivideo_orc lateral view explode(category) t_catetory as

?????????? > category_name) t1

?????????? > group by

?????????? >???? t1.category_name

?????????? > order by

?????????? >???? hot

?????????? > desc limit

?????????? >???? 10;

?3.统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数

hive (text)> select

?????????? >???? category_name as category,

?????????? >???? count(t2.videoId) as hot_with_views

?????????? > from (

?????????? >???? select

?????????? >???????? videoId,

?????????? >???????? category_name

?????????? >???? from (

?????????? >???????? select

?????????? >???????????? *

?????????? >???????? from

?????????? >???? ????????gulivideo_orc

?????????? >???????? order by

?????????? >???????????? views

?????????? >???????????? desc limit

?????????? >???????????????? 20) t1 lateral view explode(category) t_catetory as

?????????? > category_name) t2

?????????? > group by

?? ????????>???? category_name

?????????? > order by

?????????? >???? hot_with_views

?????????? > desc;

?4.统计视频观看数Top50所关联视频的所属类别Rank

hive (text)> select

?????????? >???? category_name as category,

?????????? >???? count(t5.videoId) as hot

?????????? > from (

?????????? >???? select

?????????? >???????? videoId,

?????????? >???????? category_name

?????????? >???? from (

?????????? >???????? select

?????????? >???????? distinct(t2.videoId),

?????????? >???????? t3.category

?????????? >?????? ??from (

?????????? >???????????? select

?????????? >???????????????? explode(relatedId) as videoId

?????????? >???????????? from (

?????????? >???????????????? select

?????????? >???????????????????? *

?????????? >???????????????? from

?????????? >?????? ??????????????gulivideo_orc

?????????? >???????????????? order by

?????????? >???????????????????? views

?????????? >???????????????? desc limit

?????????? >???????????????????? 50) t1) t2

?????????? >???????? inner join

?????????? >???????????? gulivideo_orc t3 on t2.videoId = t3.videoId) t4 lateral view

?????????? > explode(category) t_catetory as category_name) t5

?????????? > group by

?????????? >???? category_name

?????????? > order by

?????????? >???? hot

?????????? > desc;

?5.统计每个类别中的视频热度Top10,以Music为例

1)创建类别表

hive (text)> create table gulivideo_category(

?????????? >???? videoId string,

?????????? >???? uploader string,

?????????? >???? age int,

?????????? >???? categoryId string,

?????????? >???? length int,

?????? ????>???? views int,

?????????? >???? rate float,

?????????? >???? ratings int,

?????????? >???? comments int,

?????????? >???? relatedId array<string>)

?????????? > row format delimited

?????????? > fields terminated by "\t"

?????????? > collection items terminated by "&"

?????????? > stored as orc;

?2)向类别表中插入数据

hive (text)> insert into table gulivideo_category

?????????? >???? select

?????????? >???????? videoId,

?????????? >???????? uploader,

?????????? >???????? age,

?????????? >???????? categoryId,

?????????? >???????? length,

?????????? >???????? views,

?????????? >???????? rate,

?????????? >???????? ratings,

?????????? >???????? comments,

?????????? >???????? relatedId

?????????? >?? ??from

?????????? >???????? gulivideo_orc lateral view explode(category) catetory as

?????????? > categoryId;

?3)统计Music类别的Top10(也可以统计其他)

hive (text)> select

?????????? >???? videoId,

?????????? >???? views

?????????? > from

?????????? >???? gulivideo_category

?????????? > where

?????????? >???? categoryId = "Music"

?????????? > order by

?????????? >???? views

?????????? > desc limit

?????????? >???? 10;

?6.统计每个类别中视频流量Top10,以Music例

hive (text)> select

?????????? >???? videoId,

?????????? >???? views,

?????????? >???? ratings

?????????? > from

?????????? >???? gulivideo_category

?????????? > where

?????????? >???? categoryId = "Music"

?????????? > order by

?????????? >???? ratings

?????????? > desc limit

?????????? >???? 10;

7.统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频

hive (text)> select

?????????? >???? t2.videoId,

?????????? >???? t2.views,

?????????? >???? t2.ratings,

?????????? >???? t1.videos,

?????????? >???? t1.friends

?????????? > from (

?????????? >???? select

?????????? >???????? *

?????????? >???? from

?????????? >???????? gulivideo_user_orc

?????????? >???? order by

?????????? >???????? videos desc

?????????? >???? limit

?????????? >???????? 10) t1

?????????? >???? join

?????????? >?? ??????gulivideo_orc t2

?????????? >???? on

?????????? >???????? t1.uploader = t2.uploader

?????????? >???? order by

?????????? >???????? views desc

?????????? >???? limit

?????????? >???????? 20;

?8.统计每个类别视频观看数Top10

hive (text)> select

?????????? >???? t1.*

?????????? > from (

?????????? >???? select

?????????? >???????? videoId,

?????????? >???????? categoryId,

?????????? >???????? views,

?????????? > row_number() over(partition by categoryId order by views desc)

?????????? > rank from gulivideo_category) t1

?????????? > where

?????????? >???? rank <= 10;

  大数据 最新文章
实现Kafka至少消费一次
亚马逊云科技:还在苦于ETL?Zero ETL的时代
初探MapReduce
【SpringBoot框架篇】32.基于注解+redis实现
Elasticsearch:如何减少 Elasticsearch 集
Go redis操作
Redis面试题
专题五 Redis高并发场景
基于GBase8s和Calcite的多数据源查询
Redis——底层数据结构原理
上一篇文章      下一篇文章      查看所有文章
加:2022-10-17 12:41:07  更:2022-10-17 12:41:52 
 
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁

360图书馆 购物 三丰科技 阅读网 日历 万年历 2025年4日历 -2025/4/22 8:55:29-

图片自动播放器
↓图片自动播放器↓
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
图片批量下载器
↓批量下载图片,美女图库↓
  网站联系: qq:121756557 email:121756557@qq.com  IT数码