IT数码 购物 网址 头条 软件 日历 阅读 图书馆
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
图片批量下载器
↓批量下载图片,美女图库↓
图片自动播放器
↓图片自动播放器↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁
 
   -> 大数据 -> with as 语句真的会把数据存内存嘛?(源码剖析) -> 正文阅读

[大数据]with as 语句真的会把数据存内存嘛?(源码剖析)

0.前言

最近有好几个朋友都有咨询这个问题,大概有两类:

1、为啥我用了with..as效率没有提高?

2、sql跑不动,改成with..as的写法,会不会更好些?

网上博客几乎都有结论with ... as语句会把数据放在内存:

?

一、?Hive-SQL

?在hive中有一个参数

hive.optimize.cte.materialize.threshold

?这个参数在默认情况下是-1(关闭的);当开启(大于0),比如设置为2,则如果with..as语句被引用2次及以上时,会把with..as语句生成的table物化,从而做到with..as语句只执行一次,来提高效率。

1.1 测试?

explain
with atable as (
     SELECT  id,source,channel
            FROM  test

)
select source from atable WHERE   channel = '直播'
union ALL
select source from atable WHERE   channel = '视频'

?不设置该参数时,执行计划:


STAGE DEPENDENCIES:
 Stage-1 is a root stage
 Stage-0 depends on stages: Stage-1

STAGE PLANS:
 Stage: Stage-1
  Map Reduce
   Map Operator Tree:
     TableScan
      alias: test
      Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
      Filter Operator
       predicate: (channel = '直播') (type: boolean)
       Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
       Select Operator
        expressions: source (type: string)
        outputColumnNames: _col0
        Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
        Union
         Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
         File Output Operator
          compressed: false
          Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
          table:
            input format: org.apache.hadoop.mapred.SequenceFileInputFormat
            output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
     TableScan
      alias: test
      Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
      Filter Operator
       predicate: (channel = '视频') (type: boolean)
       Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
       Select Operator
        expressions: source (type: string)
        outputColumnNames: _col0
        Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
        Union
         Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
         File Output Operator
          compressed: false
          Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
          table:
            input format: org.apache.hadoop.mapred.SequenceFileInputFormat
            output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

 Stage: Stage-0
  Fetch Operator
   limit: -1
   Processor Tree:
    ListSink

从执行计划上看,test表被读两次。

设置set hive.optimize.cte.materialize.threshold=1,执行计划


STAGE DEPENDENCIES:
 Stage-1 is a root stage
 Stage-6 depends on stages: Stage-1 , consists of Stage-3, Stage-2, Stage-4
 Stage-3
 Stage-0 depends on stages: Stage-3, Stage-2, Stage-5
 Stage-8 depends on stages: Stage-0
 Stage-2
 Stage-4
 Stage-5 depends on stages: Stage-4
 Stage-7 depends on stages: Stage-8

STAGE PLANS:
 Stage: Stage-1
  Map Reduce
   Map Operator Tree:
     TableScan
      alias: test
      Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
      Select Operator
       expressions: id (type: int), source (type: string), channel (type: string)
       outputColumnNames: _col0, _col1, _col2
       Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
       File Output Operator
        compressed: false
        Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
        table:
          input format: org.apache.hadoop.mapred.TextInputFormat
          output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
          name: default.atable

 Stage: Stage-6
  Conditional Operator

 Stage: Stage-3
  Move Operator
   files:
     hdfs directory: true
     destination: hdfs://localhost:9000/tmp/hive/bytedance/bae441cb-0ef6-4e9c-9f7a-8a5f97d0e560/_tmp_space.db/ce44793b-6eed-4299-b737-f05c66b2281b/.hive-staging_hive_2021-03-24_20-17-38_169_5695913330535939856-1/-ext-10002

 Stage: Stage-0
  Move Operator
   files:
     hdfs directory: true
     destination: hdfs://localhost:9000/tmp/hive/bytedance/bae441cb-0ef6-4e9c-9f7a-8a5f97d0e560/_tmp_space.db/ce44793b-6eed-4299-b737-f05c66b2281b

 Stage: Stage-8
  Map Reduce
   Map Operator Tree:
     TableScan
      alias: atable
      Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
      Filter Operator
       predicate: (channel = '直播') (type: boolean)
       Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
       Select Operator
        expressions: source (type: string)
        outputColumnNames: _col0
        Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
        Union
         Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
         File Output Operator
          compressed: false
          Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
          table:
            input format: org.apache.hadoop.mapred.SequenceFileInputFormat
            output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
     TableScan
      alias: atable
      Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
      Filter Operator
       predicate: (channel = '视频') (type: boolean)
       Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
       Select Operator
        expressions: source (type: string)
        outputColumnNames: _col0
        Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
        Union
         Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
         File Output Operator
          compressed: false
          Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
          table:
            input format: org.apache.hadoop.mapred.SequenceFileInputFormat
            output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

 Stage: Stage-2
  Map Reduce
   Map Operator Tree:
     TableScan
      File Output Operator
       compressed: false
       table:
         input format: org.apache.hadoop.mapred.TextInputFormat
         output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
         serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
         name: default.atable

 Stage: Stage-4
  Map Reduce
   Map Operator Tree:
     TableScan
      File Output Operator
       compressed: false
       table:
         input format: org.apache.hadoop.mapred.TextInputFormat
         output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
         serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
         name: default.atable

 Stage: Stage-5
  Move Operator
   files:
     hdfs directory: true
     destination: hdfs://localhost:9000/tmp/hive/bytedance/bae441cb-0ef6-4e9c-9f7a-8a5f97d0e560/_tmp_space.db/ce44793b-6eed-4299-b737-f05c66b2281b/.hive-staging_hive_2021-03-24_20-17-38_169_5695913330535939856-1/-ext-10002

 Stage: Stage-7
  Fetch Operator
   limit: -1
   Processor Tree:
    ListSink

可以看到test表被物化了。

1.2 源码

?从源码看,在获取元数据时,会做参数判断,判断参数阈值及cte的引用次数

二、spark-sql

spark对cte的操作比较少,在spark侧,现在还没发现有相关的优化参数

with atable as (
     SELECT  content_type,
                    channel,
                    channel_note,
                    enter_method,
                    enter_method_note
            FROM    search_dw.dim_ecom_enter_channel_df
            
)
select channel from atable WHERE   content_type = '直播'
union ALL
select channel from atable WHERE   content_type = '视频'

?参考资料:公众号(数据仓库与Python大数据)-《with as 语句真的会把数据存内存嘛?(源码剖析)》

  大数据 最新文章
实现Kafka至少消费一次
亚马逊云科技:还在苦于ETL?Zero ETL的时代
初探MapReduce
【SpringBoot框架篇】32.基于注解+redis实现
Elasticsearch:如何减少 Elasticsearch 集
Go redis操作
Redis面试题
专题五 Redis高并发场景
基于GBase8s和Calcite的多数据源查询
Redis——底层数据结构原理
上一篇文章      下一篇文章      查看所有文章
加:2021-09-29 10:21:27  更:2021-09-29 10:21:45 
 
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁

360图书馆 购物 三丰科技 阅读网 日历 万年历 2024年11日历 -2024/11/27 14:34:59-

图片自动播放器
↓图片自动播放器↓
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
图片批量下载器
↓批量下载图片,美女图库↓
  网站联系: qq:121756557 email:121756557@qq.com  IT数码