IT数码 购物 网址 头条 软件 日历 阅读 图书馆
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
图片批量下载器
↓批量下载图片,美女图库↓
图片自动播放器
↓图片自动播放器↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁
 
   -> 大数据 -> 怎样通过explain执行计划,来优化SQL(以hive为例) -> 正文阅读

[大数据]怎样通过explain执行计划,来优化SQL(以hive为例)

适用场景

Hive SQL在执行之前会将SQL转换为MapReduce任务,因此需要了解具体的转换过程。可以通过explain关键字来查看具体的执行计划。通过执行计划能看到 SQL 程序转换成相应计算引擎的执行逻辑,从而能够实现更有针对性的优化(比如有些看似等价的SQL写法实则不等价)。

使用语法就是在查询语句之前加上explain关键字:

explain select * from ads.ads_grow_ad_material_statistic_df? where date = 20220609;

前置知识

一个 Hive 查询被转换为一个有向无环图DAG,包含一个或多个stage(阶段),不同stage之间会存在依赖关系。越复杂的查询通常会引入更多的stage,通常stage越多则需要越多时间来完成任务。

默认hive一次执行一个stage,一个stage可以是一个mapreduce任务,也可以是一个抽样阶段,或者是一个合并阶段,还可以是一个limit阶段。

我们拿一个查询简单举个例子:

-- 根据app_id和plan_id来查看收入
explain select app_id, plan_id, sum(pay_mnt) 
from ads.ads_grow_ad_material_statistic_df? 
where date = 20220609 
and material_id = 'ALL' 
group by app_id, plan_id;

explain得出的结果如下:

STAGE DEPENDENCIES:?????????????????????????????????????????????????????????????????????????????????????????? |
|?? Stage-1 is a root stage??????????????????????????????????????? ?????????????????????????????????????????????|
|?? Stage-0 depends on stages: Stage-1????????????????????????????????????????????????????????????????????????? |
|????????????????????????????????????????????????????????????????????????????????????????????? ?????????????????|
| STAGE PLANS:????????????????????????????????????????????????????????????????????????????????????????????????? |
|?? Stage: Stage-1????????????????????????????????????????????????????????????????????????????????????????????? |
|???? Map Reduce??????????????????????????????????????????????????????????????????????????????????????????????? |
|?????? Map Operator Tree:????????????????????????????????????????????????????????????????????????????????????? |
|?????????? TableScan??????????????? ???????????????????????????????????????????????????????????????????????????|
|???????????? alias: ads_grow_ad_material_statistic_df????????????????????????????????????????????????????????? |
|???????????? Statistics: Num rows: 262845 Data size: 7096850 Basic stats: COMPLETE Column stats: NONE????????? |
|???????????? Filter Operator?????????????????????????????????????????????????????????????????????????????????? |
|?????????????? predicate: (material_id = 'ALL') (type: boolean)???????????????????????????? ???????????????????|
|?????????????? Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE??????? |
|?????????????? Select Operator???????????????????????????????????????????????????????????????????????????????? |
|????? ???????????expressions: app_id (type: bigint), plan_id (type: bigint), pay_mnt (type: bigint)??????????? |
|???????????????? outputColumnNames: app_id, plan_id, pay_mnt?????????????????????????????????????????????????? |
|???????????????? Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE????? |
|???????????????? Group By Operator???????????????????????????????????????????????????????????????????????????? |
|?????????????????? aggregations: sum(pay_mnt)???????????????? ?????????????????????????????????????????????????|
|?????????????????? keys: app_id (type: bigint), plan_id (type: bigint)???????????????????????????????????????? |
|?????????????????? mode: hash???????????????????????????????????????????????????????????? ?????????????????????|
|?????????????????? outputColumnNames: _col0, _col1, _col2????????????????????????????????????????????????????? |
|?????????????????? Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE??? |
|??? ???????????????Reduce Output Operator????????????????????????????????????????????????????????????????????? |
|???????????????????? key expressions: _col0 (type: bigint), _col1 (type: bigint)?????????????????????????????? |
|???????????????????? sort order: ++??????????????????????????????????????????????????????????????????????????? |
|???????????????????? Map-reduce partition columns: _col0 (type: bigint), _col1 (type: bigint)????????????????? |
|???????????????????? Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE? |
|???????????????????? value expressions: _col2 (type: bigint)?????????????????????????????????????????????????? |
|?????? Reduce Operator Tree:??????????????????????????????????????????????????????????? ???????????????????????|
|???????? Group By Operator???????????????????????????????????????????????????????????????????????????????????? |
|?????????? aggregations: sum(VALUE._col0)????????????????????????????????????????????????????????????????????? |
|? ?????????keys: KEY._col0 (type: bigint), KEY._col1 (type: bigint)??????????????????????????????????????????? |
|?????????? mode: mergepartial????????????????????????????????????????????????????????????????????????????????? |
|?????????? outputColumnNames: _col0, _col1, _col2????????????????????????????????????????????????????????????? |
|?????????? Statistics: Num rows: 65711 Data size: 1774205 Basic stats: COMPLETE Column stats: NONE???????????? |
|?????????? File Output Operator?????????????????????????? ?????????????????????????????????????????????????????|
|???????????? compressed: false???????????????????????????????????????????????????????????????????????????????? |
|???????????? Statistics: Num rows: 65711 Data size: 1774205 Basic stats: COMPLETE Column stats: NONE?????????? |
|???????????? table:??????????????????????????????????????????????????????????????????????????????????????????? |
|???????????????? input format: org.apache.hadoop.mapred.TextInputFormat??????????????????????????????????????? |
|???????????????? output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat???????????????????? |
|???????????????? serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe???????????????????????????????????? |
|??????????????????????????? ???????????????????????????????????????????????????????????????????????????????????|
|?? Stage: Stage-0????????????????????????????????????????????????????????????????????????????????????????????? |
|???? Fetch Operator???????????????????????????????????? ???????????????????????????????????????????????????????|
|?????? limit: -1?????????????????????????????????????????????????????????????????????????????????????????????? |
|?????? Processor Tree:????????????????????????????????????????????????????????????? ???????????????????????????|
|???????? ListSink??

我们将上述结果拆分看,先从最外层开始,包含两个大的部分:

先看第一部分 stage dependencies ,包含两个 stage,stage-1 是根 stage,是开始的 stage,包含了,而stage-0 依赖 stage-1,stage-1 执行完成后执行 stage-0。

  • stage dependencies:各个 stage 之间的依赖性
  • stage plan:各个 stage 的执行计划

再看第二部分 stage plan,其中stage1包含了这个job的大部分处理过程,会触发一个mapreduce。里面有一个 Map Reduce,一个 MR 的执行计划分为两个部分:

  • Map Operator Tree:MAP 端的执行计划树
  • Reduce Operator Tree:Reduce 端的执行计划树

这两个执行计划树里面包含这条 sql 语句的 operator:

Key

Desciption

Detail

TableScan

map 端第一个操作肯定是加载表,所以就是表扫描操作,其实就是from

tablescan以这个表作为输入,然后会产生一个只有字段number的输出

alias

表名称

从哪个表查

Statistics

  1. 表统计信息,包含表中数据条数,数据大小等;
  2. join 完成之后生成的数据条数,大小等;
  3. 分组聚合之后的数据条数,数据大小等

Select Operator

选取操作,常见的属性

expressions

需要的字段名称及字段类型

outputColumnNames

  1. 输出的列名称;
  2. 聚合之后输出列名;
  3. join 完成之后输出的字段

Group By Operator

分组聚合操作

有group by 时会触发该operator,会产生一个输出字段_col0,这是为临时字段按规则起的临时字段名

aggregations

显示聚合函数信息

mode

聚合模式:

  1. hash:随机聚合,就是 hash partition;
  2. partial:局部聚合;
  3. final:最终聚合

keys

  1. 分组的字段,如果没有分组,则没有此字段;
  2. 如果有join,也会作为join的条件字段;

Reduce Output Operator

输出到 reduce 操作,常见属性;

sort order

  1. 值为空不排序;
  2. 值为 + 正序排序,值为 - 倒序排序;
  3. 值为 +- 排序的列为两列,第一列为正序,第二列为倒序;

Filter Operator

过滤操作

对应where或having来触发operator

predicate

过滤条件,如 sql 语句中的 where id>=1,则此处显示 (id>= 1)

Map Join Operator

join 操作,常见的属性

condition map

join 方式 ,如 Inner Join 0 to 1 Left Outer Join0 to 2

File Output Operator

文件输出操作

compressed

是否压缩

table

表的信息,包含输入输出文件格式化方式,序列化方式等

Fetch Operator

?客户端获取数据操作

一般是Stage-0 中

limit

值为 -1 表示不限制条数,其他值为限制的条数

一般是Stage-0 中

实例

. join时,是否需要额外过滤掉 null 的值

我们在hive client中输入以下查询计划语句:

SELECT? game_dict.package_name
?????? ,game_dict.game_name
?????? ,temp.business
FROM
(
??? SELECT? package_name
?????????? ,business
??? FROM ods.ods_grow_ad_plan_info
??? WHERE date = 20220609
??? AND deli_busi = 3
) temp
JOIN
(
??? SELECT? package_name
?????????? ,game_name
??? FROM gamesdk.game_info_all
??? WHERE date = 20220609
) game_dict
ON temp.package_name = game_dict.package_name
LIMIT 10;

问题是:上面这条join会过滤掉package_name 为NULL的值吗?

通过explain来做下验证:(仅截取了部分输出信息)

TableScan?????????????????????????????????????????????????????????????????????????????????????????? |
alias: ods_grow_ad_plan_info????????????????????????????????????????????????????????????????????? |
?? Statistics: Num rows: 2593 Data size: 300879 Basic stats: COMPLETE Column stats: NONE???????????? ||???????????? Filter Operator?????????????????????????????????????????????????????????????????????????????????? |
?? predicate: ((deli_busi = 3) and package_name is not null) (type: boolean)????????????????????? ?|
?? Statistics: Num rows: 648 Data size: 75190 Basic stats: COMPLETE Column stats: NONE???????????? |
?? Select Operator???????????????????????????????????????????????????????????????????????????????? |
???? expressions: package_name (type: string), business (type: bigint)???????????????????????????? |
???? outputColumnNames: _col0, _col1?????????????????????????????????????????????????????????????? |
???? Statistics: Num rows: 648 Data size: 75190 Basic stats: COMPLETE Column stats: NONE?????????? |
??? HashTable Sink Operator?????????????????????????????????????????????????????????????????????? |
?????? keys:?????????????????????????????????????????????????????????????????????????????????????? |
???????? 0 _col0 (type: string)??????????????????????????? ????????????????????????????????????????|
???????? 1 _col0 (type: string)?????
...

从上述结果可以看到 predicate: package_name is not null,说明 join 时会自动过滤掉关联字段为 null 值的情况,但 left/right join 或 full join 是不会自动过滤 null 值的。

. 做了group by后,是否还需要对语句做排序?

SELECT? app_id
?????? ,plan_id
?????? ,SUM(pay_mnt)
FROM ads.ads_grow_ad_material_statistic_df
WHERE date = 20220609
AND material_id = 'ALL'
GROUP BY? app_id
???????? ,plan_id;

问:对app_id,plan_id做group by后,是否还需要对语句进行排序?

直接来看 explain 之后结果 (仅截取了部分输出信息):

TableScan?????????????????????????????????????????????????????????????????????????????????????????? |
??? alias: ads_grow_ad_material_statistic_df????????????????????????????????????????????????????????? |
??? Statistics: Num rows: 262845 Data size: 7096850 Basic stats: COMPLETE Column stats: NONE????????? |
??? Filter Operator?????????????????????????????????????????????????????????????????????????????????? |
??????? predicate: (material_id = 'ALL') (type: boolean)????????????????????????????????????????? ??????|
??????? Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE??????? |
??????? Select Operator???????????????????????????????????????????????????????????????????????????????? |
??????????? expressions: app_id (type: bigint), plan_id (type: bigint), pay_mnt (type: bigint)??????????? |
??????????? outputColumnNames: app_id, plan_id, pay_mnt?????????????????????????????????????????????????? |
??????????? Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE????? |
??????????? Group By Operator???????????????????????????????????????????????????????????????????????????? |
??????????????? aggregations: sum(pay_mnt)????????????????????????????????????????????????????????????????? |
????? ??????????keys: app_id (type: bigint), plan_id (type: bigint)???????????????????????????????????????? |
??????????????? mode: hash????????????????????????????????????????????????????????????????????????????????? |
??????????????? outputColumnNames: _col0, _col1, _col2????????????????????????????????????????????????????? |
??????????????? Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE??? |
??????????????? Reduce Output Operator??????????????????????????????????????? ??????????????????????????????||???????????????????? key expressions: _col0 (type: bigint), _col1 (type: bigint)?????????????????????????????? |
??????????????????? sort order: ++??????????????????????????????????????????????????????????????????????????? |
??????????????????? Map-reduce partition columns: _col0 (type: bigint), _col1 (type: bigint)????????????????? |
??????????????????? Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE? |
??????????????????? value expressions: _col2 (type: bigint)
...???

分组主要看Group By Operator,里面的keys: app_id (type: bigint), plan_id (type: bigint) 表示是按照appid和planid来进行分组,再往下走有sort order:++,说明是按照两个字段进行正序排序的。

. 子查询和join哪个执行效率高?

有这样两个SQL:

-- SQL1
select? a.id, b.id
from (
??? select id
??? from ods.ods_grow_ad_plan_info
??? where date = 20220609
) a
join (
??? select id, caid
??? from ods.ods_grow_ad_create_info
??? where date = 20220609
) b
on a.id = b.caid;

-- SQL2
SELECT? a.id
FROM
(
??? SELECT? id
??? FROM ods.ods_grow_ad_plan_info
??? WHERE date = 20220609
) a
WHERE a.id IN ( SELECT id FROM ods.ods_grow_ad_create_info WHERE date = 20220609)

这两条 sql 语句输出的结果是一样的,但是哪条 sql 执行效率高?高的话高在哪里?

第一条的explain:

Map Reduce Local Work??????????????????????????????????????????????????????????????????????????????????? |
?? Alias -> Map Local Tables:???????????????????????????????????????????????????????????????????????????? |
???? a:ods_grow_ad_plan_info??????????? ??????????????????????????????????????????????????????????????????|
?????? Fetch Operator???????????????????????????????????????????????????????????????????????????????????? |
???????? limit: -1????????????????????????????????????????????????????????????? ??????????????????????????|
?? Alias -> Map Local Operator Tree:????????????????????????????????????????????????????????????????????? |
???? a:ods_grow_ad_plan_info????????????????????????????????????????????????????????????????????????????? |
?????? TableScan????????????????????????????????????????????????????????????????????????????????????????? |
???????? alias: ods_grow_ad_plan_info???????????????????????????????????????????????????????????????????? |
???????? Statistics: Num rows: 37609 Data size: 300879 Basic stats: COMPLETE Column stats: NONE?????????? |
???????? Filter Operator????????????????????????????????????????????????????????????????????????????????? |
?????????? predicate: id is not null (type: boolean)??????????????????????????????????????? ??????????????|
?????????? Statistics: Num rows: 18805 Data size: 150443 Basic stats: COMPLETE Column stats: NONE???????? |
?????????? Select Operator? ??????????????????????????????????????????????????????????????????????????????|
???????????? expressions: id (type: bigint)?????????????????????????????????????????????????????????????? |
???????????? outputColumnNames: _col0???????????????????????????????????????????????????????????????????? |
???????????? Statistics: Num rows: 18805 Data size: 150443 Basic stats: COMPLETE Column stats: NONE?????? |
???????????? HashTable Sink Operator????????????????????????????????????????????????????????????????????? |
?????????????? keys:??????????????????????????????????????????????????????????????????????????????????? ??|
???????????????? 0 _col0 (type: bigint)?????????????????????????????????????????????????????????????????? |
???????????????? 1 _col1 (type: bigint)
...

第二条的explain:

Map Reduce Local Work??????????????????????????????????????????????????????????? ?????????????????????????|
?? Alias -> Map Local Tables:????????????????????????????????????????????????????????????????????????????? |
???? sq_1:ods_grow_ad_create_info????????????????????????????????????????????????????????????????????????? |
?????? Fetch Operator????????????????????????????????????????????????????????????????????????????????????? |
???????? limit: -1???????????????????????????????????????????????????????????????????????????????????????? |
?? Alias -> Map Local Operator Tree:???????????? ??????????????????????????????????????????????????????????|
???? sq_1:ods_grow_ad_create_info????????????????????????????????????????????????????????????????????????? |
?????? TableScan?????????????????????????????????????????????????????????????????????? ????????????????????|
???????? alias: ods_grow_ad_create_info??????????????????????????????????????????????????????????????????? |
???????? Statistics: Num rows: 242672 Data size: 1941378 Basic stats: COMPLETE Column stats: NONE????????? |
???????? Filter Operator?????????????????????????????????????????????????????????????????????????????????? |
?????????? predicate: id is not null (type: boolean)?????????????????????????????????????????????????????? |
?????????? Statistics: Num rows: 121336 Data size: 970689 Basic stats: COMPLETE Column stats: NONE???????? |
?????????? Select Operator???????????????????????????????????????????????????????????????????????????????? |
???????????? expressions: id (type: bigint)???????????????????????????????????????????????? ???????????????|
???????????? outputColumnNames: _col0????????????????????????????????????????????????????????????????????? ||???????????????? Statistics: Num rows: 121336 Data size: 970689 Basic stats: COMPLETE Column stats: NONE?????? |
???????????? Group By Operator???????????????????????????????????????????????????????????????????????????? |
?????????????? keys: _col0 (type: bigint)????????????????????????????????????????????????????????????????? |
?????????????? mode: hash????????????????????????????? ????????????????????????????????????????????????????|
?????????????? outputColumnNames: _col0??????????????????????????????????????????????????????????????????? |
?????????????? Statistics: Num rows: 121336 Data size: 970689 Basic stats: COMPLETE Column stats: NONE???? |
?????????????? HashTable Sink Operator???????????????????????????????????????????????????????????????????? |
???????????????? keys:???????????????????????????????????????????????????????????????????????????????????? |
?????????????????? 0 _col0 (type: bigint)????????????????????????????????????????????????????????????????? |
?????????????????? 1 _col0 (type: bigint)?
?...

可以看到第二条的执行相比第一条,多了一个Group By的Operator,经过了一个hash partition的随机聚合,因此可以用join来代替子查询来提高性能。

. 面试题:比较两个sql的效率

-- SQL1
SELECT
?a.id,
?b.user_name
FROM
?test1 a
JOIN test2 b ON a.id = b.id
WHERE
?a.id > 2;
?
-- SQL2
SELECT
?a.id,
?b.user_name
FROM
(SELECT * FROM test1 WHERE id > 2) a
JOIN test2 b ON a.id = b.id;

这两条 sql 语句输出的结果是一样的,但哪条 sql 执行效率高呢?

  • 可能是第一条 sql 执行效率高,因为第二条 sql 有子查询,子查询会影响性能;
  • 也可能是第二条 sql 执行效率高,因为先过滤之后,在进行 join 时的条数减少了,所以执行效率就高了。

explain两个sql后来发现(这里不贴全部了):除了表别名不一样,其他的执行计划完全一样,都是先进行 where 条件过滤,在进行 join 条件关联。说明 hive 底层会自动帮我们进行优化,所以这两条 sql 语句执行效率是一样的。

参考资料

1. 《Hive编程指南》

2. 《Hive实战》

3. 《Hive性能调优实战》

读懂 explain 的执行计划有利于我们优化 Hive SQL,同时也能提升我们对 SQL 的掌控力。如查看 stage 的依赖情况、排查数据倾斜、hive 调优等,小伙伴们阔以自行尝试。

  大数据 最新文章
实现Kafka至少消费一次
亚马逊云科技:还在苦于ETL?Zero ETL的时代
初探MapReduce
【SpringBoot框架篇】32.基于注解+redis实现
Elasticsearch:如何减少 Elasticsearch 集
Go redis操作
Redis面试题
专题五 Redis高并发场景
基于GBase8s和Calcite的多数据源查询
Redis——底层数据结构原理
上一篇文章      下一篇文章      查看所有文章
加:2022-06-16 21:45:45  更:2022-06-16 21:46:12 
 
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁

360图书馆 购物 三丰科技 阅读网 日历 万年历 2024年11日历 -2024/11/23 16:39:49-

图片自动播放器
↓图片自动播放器↓
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
图片批量下载器
↓批量下载图片,美女图库↓
  网站联系: qq:121756557 email:121756557@qq.com  IT数码