11 Hive实战
11.3.2 安装Tez引擎(了解)
Tez是一个Hive的运行引擎,性能优于MR。为什么优于MR呢?看下。
用Hive直接编写MR程序,假设有四个有依赖关系的MR作业,上图中,绿色是Reduce Task,云状表示写屏蔽,需要将中间结果持久化写到HDFS。 Tez可以将多个有依赖的作业转换为一个作业,这样只需写一次HDFS,且中间节点较少,从而大大提升作业的计算性能。 》》1:将tez安装包拷贝到集群,并解压tar包
[dev1@hadoop102 software]$ mkdir /opt/module/tez
[dev1@hadoop102 software]$ tar -zxvf /opt/software/software/tez-0.10.1-SNAPSHOT.tar.gz -C /opt/module/tez
》》2:上传tez依赖到HDFS
[dev1@hadoop102 software]$ hadoop fs -mkdir /tez
[dev1@hadoop102 software]$ hadoop fs -put /opt/software/tez-0.10.1-SNAPSHOT.tar.gz /tez
》》3:新建tez-site.xml
[dev1@hadoop102 software]$ vim $HADOOP_HOME/etc/hadoop/tez-site.xml
添加如下内容:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>tez.lib.uris</name>
<value>${fs.defaultFS}/tez/tez-0.10.1-SNAPSHOT.tar.gz</value>
</property>
<property>
<name>tez.use.cluster.hadoop-libs</name>
<value>true</value>
</property>
<property>
<name>tez.am.resource.memory.mb</name>
<value>1024</value>
</property>
<property>
<name>tez.am.resource.cpu.vcores</name>
<value>1</value>
</property>
<property>
<name>tez.container.max.java.heap.fraction</name>
<value>0.4</value>
</property>
<property>
<name>tez.task.resource.memory.mb</name>
<value>1024</value>
</property>
<property>
<name>tez.task.resource.cpu.vcores</name>
<value>1</value>
</property>
</configuration>
》》4:修改Hadoop环境变量
[dev1@hadoop102 software]$ vim $HADOOP_HOME/etc/hadoop/shellprofile.d/tez.sh
添加Tez的Jar包相关信息
hadoop_add_profile tez
function _tez_hadoop_classpath
{
hadoop_add_classpath "$HADOOP_HOME/etc/hadoop" after
hadoop_add_classpath "/opt/module/tez/*" after
hadoop_add_classpath "/opt/module/tez/lib/*" after
}
》》5:修改Hive的计算引擎
[dev1@hadoop102 software]$ vim $HIVE_HOME/conf/hive-site.xml
添加
<property>
<name>hive.execution.engine</name>
<value>tez</value>
</property>
<property>
<name>hive.tez.container.size</name>
<value>1024</value>
</property>
》》6:解决日志Jar包冲突 [dev1@hadoop102 software]$ rm /opt/module/tez/lib/slf4j-log4j12-1.7.10.jar
11.1 需求描述
统计iqiyi视频网站的常规指标,各种TopN指标: – 统计视频观看数Top10 – 统计视频类别热度Top10 – 统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数 – 统计视频观看数Top50所关联视频的所属类别排序 – 统计每个类别中的视频热度Top10,以Music为例 – 统计每个类别视频观看数Top10 – 统计上传视频最多的用户Top10以及他们上传的视频观看次数在前20的视频
11.2 数据结构
》》1:视频表
》》2:用户表
11.3 准备工作
11.3.1 准备表
》》1:需要准备的表 创建原始数据表:iqiyivideo_ori,iqiyivideo_user_ori, 创建最终表:iqiyivideo_orc,iqiyivideo_user_orc 》》1:创建原始数据表: (1)iqiyivideo_ori
create table iqiyivideo_ori(
videoId string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)
row format delimited fields terminated by "\t"
collection items terminated by "&"
stored as textfile;
(2)创建原始数据表: iqiyivideo_user_ori
create table iqiyivideo_user_ori(
uploader string,
videos int,
friends int)
row format delimited
fields terminated by "\t"
stored as textfile;
》》2:创建orc存储格式带snappy压缩的表: (1)iqiyivideo_orc
create table iqiyivideo_orc(
videoId string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)
stored as orc
tblproperties("orc.compress"="SNAPPY");
(2)iqiyivideo_user_orc
create table iqiyivideo_user_orc(
uploader string,
videos int,
friends int)
row format delimited
fields terminated by "\t"
stored as orc
tblproperties("orc.compress"="SNAPPY");
(3)向ori表插入数据
load data local inpath "/opt/module/data/video" into table iqiyivideo_ori;
load data local inpath "/opt/module/data/user" into table iqiyivideo_user_ori;
(4)向orc表插入数据
insert into table iqiyivideo_orc select * from iqiyivideo_ori;
insert into table iqiyivideo_user_orc select * from iqiyivideo_user_ori;
11.4 业务分析
11.4.1 统计视频观看数Top10
思路:使用order by按照views字段做一个全局排序即可,同时我们设置只显示前10条。 最终代码:
SELECT
videoId,
views
FROM
iqiyivideo_orc
ORDER BY
views DESC
LIMIT 10;
11.4.2 统计视频类别热度Top10
思路: (1)即统计每个类别有多少个视频,显示出包含视频最多的前10个类别。 (2)我们需要按照类别group by聚合,然后count组内的videoId个数即可。 (3)因为当前表结构为:一个视频对应一个或多个类别。所以如果要group by类别,需要先将类别进行列转行(展开),然后再进行count即可。 (4)最后按照热度排序,显示前10条。 最终代码:
SELECT
t1.category_name ,
COUNT(t1.videoId) hot
FROM
(
SELECT
videoId,
category_name
FROM
iqiyivideo_orc
lateral VIEW explode(category) iqiyivideo_orc_tmp AS category_name
) t1
GROUP BY
t1.category_name
ORDER BY
hot
DESC
LIMIT 10
11.4.3 统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数
思路: (1)先找到观看数最高的20个视频所属条目的所有信息,降序排列 (2)把这20条信息中的category分裂出来(列转行) (3)最后查询视频分类名称和该分类下有多少个Top20的视频 最终代码:
SELECT
t2.category_name,
COUNT(t2.videoId) video_sum
FROM
(
SELECT
t1.videoId,
category_name
FROM
(
SELECT
videoId,
views ,
category
FROM
iqiyivideo_orc
ORDER BY
views
DESC
LIMIT 20
) t1
lateral VIEW explode(t1.category) t1_tmp AS category_name
) t2
GROUP BY t2.category_name
11.4.4 统计视频观看数Top50所关联视频的所属类别排序
代码:
SELECT
t6.category_name,
t6.video_sum,
rank() over(ORDER BY t6.video_sum DESC ) rk
FROM
(
SELECT
t5.category_name,
COUNT(t5.relatedid_id) video_sum
FROM
(
SELECT
t4.relatedid_id,
category_name
FROM
(
SELECT
t2.relatedid_id ,
t3.category
FROM
(
SELECT
relatedid_id
FROM
(
SELECT
videoId,
views,
relatedid
FROM
iqiyivideo_orc
ORDER BY
views
DESC
LIMIT 50
)t1
lateral VIEW explode(t1.relatedid) t1_tmp AS relatedid_id
)t2
JOIN
iqiyivideo_orc t3
ON
t2.relatedid_id = t3.videoId
) t4
lateral VIEW explode(t4.category) t4_tmp AS category_name
) t5
GROUP BY
t5.category_name
ORDER BY
video_sum
DESC
) t6
11.4.5 统计每个类别中的视频热度Top10,以Music为例
思路: (1)要想统计Music类别中的视频热度Top10,需要先找到Music类别,那么就需要将category展开,所以可以创建一张表用于存放categoryId展开的数据。 (2)向category展开的表中插入数据。 (3)统计对应类别(Music)中的视频热度。 统计Music类别的Top10(也可以统计其他)
SELECT
t1.videoId,
t1.views,
t1.category_name
FROM
(
SELECT
videoId,
views,
category_name
FROM iqiyivideo_orc
lateral VIEW explode(category) iqiyivideo_orc_tmp AS category_name
)t1
WHERE
t1.category_name = "Music"
ORDER BY
t1.views
DESC
LIMIT 10
11.4.6 统计每个类别视频观看数Top10
最终代码:
SELECT
t2.videoId,
t2.views,
t2.category_name,
t2.rk
FROM
(
SELECT
t1.videoId,
t1.views,
t1.category_name,
rank() over(PARTITION BY t1.category_name ORDER BY t1.views DESC ) rk
FROM
(
SELECT
videoId,
views,
category_name
FROM iqiyivideo_orc
lateral VIEW explode(category) iqiyivideo_orc_tmp AS category_name
)t1
)t2
WHERE t2.rk <=10
11.4.7 统计上传视频最多的用户Top10以及他们上传的视频观看次数在前20的视频
思路: (1)求出上传视频最多的10个用户 (2)关联iqiyivideo_orc表,求出这10个用户上传的所有的视频,按照观看数取前20 最终代码:
SELECT
t2.videoId,
t2.views,
t2.uploader
FROM
(
SELECT
uploader,
videos
FROM iqiyivideo_user_orc
ORDER BY
videos
DESC
LIMIT 10
) t1
JOIN iqiyivideo_orc t2
ON t1.uploader = t2.uploader
ORDER BY
t2.views
DESC
LIMIT 20
附录:常见错误及解决方案
》》1:如果更换Tez引擎后,执行任务卡住,可以尝试调节容量调度器的资源调度策略 将$HADOOP_HOME/etc/hadoop/capacity-scheduler.xml 文件中的
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>0.1</value>
<description>
Maximum percent of resources in the cluster which can be used to run
application masters i.e. controls number of concurrent running
applications.
</description>
</property>
改成
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>1</value>
<description>
Maximum percent of resources in the cluster which can be used to run
application masters i.e. controls number of concurrent running
applications.
</description>
</property>
》》2:连接不上mysql数据库 (1)导错驱动包,应该把mysql-connector-java-5.1.27-bin.jar导入/opt/module/hive/lib的不是这个包。错把mysql-connector-java-5.1.27.tar.gz导入hive/lib包下。 (2)修改user表中的主机名称没有都修改为%,而是修改为localhost 》》3:hive默认的输入格式处理是CombineHiveInputFormat,会对小文件进行合并。
hive (default)> set hive.input.format;
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
可以采用HiveInputFormat就会根据分区数输出相应的文件。
hive (default)> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
》》3:不能执行mapreduce程序 可能是hadoop的yarn没开启。 》》4:启动mysql服务时,报MySQL server PID file could not be found! 异常。 在/var/lock/subsys/mysql路径下创建hadoop102.pid,并在文件中添加内容:4396 》》5:报service mysql status MySQL is not running, but lock file (/var/lock/subsys/mysql[失败])异常。 解决方案:在/var/lib/mysql 目录下创建: -rw-rw----. 1 mysql mysql 5 12月 22 16:41 hadoop102.pid 文件,并修改权限为 777。 》》6:JVM堆内存溢出 描述:java.lang.OutOfMemoryError: Java heap space 解决:在yarn-site.xml中加入如下代码
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx1024m</value>
</property>
》》7:虚拟内存限制 在yarn-site.xml中添加如下配置:
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
|