IT数码 购物 网址 头条 软件 日历 阅读 图书馆
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
图片批量下载器
↓批量下载图片,美女图库↓
图片自动播放器
↓图片自动播放器↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁
 
   -> 大数据 -> Hive中LEFT OUTER JOIN和INNER JOIN连用时数据缺失问题 -> 正文阅读

[大数据]Hive中LEFT OUTER JOIN和INNER JOIN连用时数据缺失问题

山重水复疑无路,柳暗花明又一村。
——陆游《游山西村》


前言

JOIN连接是SQL常用的关联方式,但他们之前连用时可能会出现数据缺失的情况,本文分享生产中的bug案例,目前已有解决方案,具体原因为个人理解,如有错误,请各位小伙伴解答。


一、具体场景

Hive建表时,需要用到left outer join加上inner join,当他们连用时,发现数据缺失严重。

SELECT
    t1.arrange_date
   ,t2.fmaterialnumber                                                                                     AS new_material_code
   ,t1.fbillno                                                                                             AS bill_code
   ,MAX(t2.ftaxprice)                                                                                      AS price_unit
   ,SUM(t2.frealqty)                                                                                       AS sale_out_qty
   ,SUM(t2.fallamount_lc)                                                                                  AS sale_out_price_total_rmb
   ,IF((t1.arrange_date <= '2021-05-31' AND t1.fbillno IN ('OUSO20210319321696', 'OUSO20210322323127'))
       ,'Lumidept00165', t1.fsaledeptnumber)                                                               AS dept_code
   ,t1.salesman_code
   ,t2.fmainid                                                                                             AS main_id
   ,t1.system_source
   ,t1.fcustnumber
   ,t2.warehouse_code
   ,t1.fbilltypenumber
   ,t2.delivery_order_id
   ,t1.f_aqa_recaddr
   ,t1.fsaleorgid                                                                                          AS sales_organization
   ,SUM(t2.fcostamount_lc)                                                                                 AS cost_rmb
   ,t2.fsrcbillno
   ,t1.fbilltypename
   ,t2.fentryid
   ,t1.fbasecurrnumber                                                                                     AS currency_code
   ,t2.funitnumber                                                                                         AS unit
   ,SUM(t2.famount)                                                                                        AS sale_out_amt_total
   ,SUM(t2.famount_lc)                                                                                     AS sale_out_amt_total_rmb
   ,t1.fsettleorgid
FROM
(
    SELECT
        t3.fid
       ,t3.fbillno
       ,t3.system_source
       ,t3.arrange_date
       ,t3.fsaledeptnumber
       ,t3.salesman_code
       ,t3.fcustnumber
       ,t3.fbilltypenumber
       ,t3.fbilltypename
       ,t3.f_aqa_recaddr
       ,t3.fsaleorgid
       ,t3.fbasecurrnumber
       ,t4.fsettleorgid
    FROM
    (
        SELECT
            fid
           ,fbillno
           ,system_source
           ,arrange_date
           ,fsaledeptnumber
           ,salesman_code
           ,fcustnumber
           ,fbilltypenumber
           ,fbilltypename
           ,f_aqa_recaddr
           ,fsaleorgid
           ,fbasecurrnumber
        FROM tmp.xxx
        WHERE rn = 1
        AND is_delete IS NOT TRUE
    ) t3
    LEFT OUTER JOIN
    (
        SELECT
            fmainid
           ,fsettleorgid
        FROM tmp.yyy
        WHERE rn = 1
        AND is_delete IS NOT TRUE
    ) t4
    ON t3.fid = t4.fmainid
) t1
INNER JOIN
(
    SELECT
        fmainid
       ,ftaxprice
       ,frealqty
       ,fallamount_lc
       ,fmaterialnumber
       ,delivery_order_id
       ,warehouse_code
       ,fcostamount_lc
       ,fsrcbillno
       ,fentryid
       ,funitnumber
       ,famount
       ,famount_lc
    FROM tmp.zzz
    WHERE rn = 1
    AND is_delete IS NOT TRUE
) t2
ON t1.fid = t2.fmainid
GROUP BY
    t1.arrange_date
   ,t2.fmaterialnumber
   ,t1.fbillno
   ,t1.fsaledeptnumber
   ,t1.salesman_code
   ,t2.fmainid
   ,t1.system_source
   ,t1.fcustnumber
   ,t1.fbilltypenumber
   ,t2.delivery_order_id
   ,t1.f_aqa_recaddr
   ,t1.fsaleorgid
   ,t2.warehouse_code
   ,t2.fsrcbillno
   ,t1.fbilltypename
   ,t2.fentryid
   ,t1.fbasecurrnumber
   ,t2.funitnumber
   ,t1.fsettleorgid
;

在这里插入图片描述
在这里插入图片描述

二、问题分析思路

  1. 将逻辑分步跑,发现在inner join后数据开始不正确。
  2. 将inner join换为left outer join结果也依然是数据缺失。
  3. 网上博文启发,Inner join和Outer join一起使用的注意点,如下图所示。
  4. 查各关联条件发现为同一关联条件,但各表中关联条件的数据量并不一致,就像下图案例一样,因为inner join的关联条件宠物类型是需要有宠物才会有相应宠物类型,有一个人没有养宠物,那他实际上就不会有相应的宠物类型,这时关联会出现数据缺失。
    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
-- 每个人养宠物的情况
select
   t1.personname 
   ,t2.petname
from
(
    select
        personid 
        ,personname   
    from people
) t1
left outer join
(
    select
        ownerid 
        ,petname 
    from pets 
) t2
on t1.personid = t2.ownerid
;

在这里插入图片描述

-- 再匹配上相应宠物类型
select
    t3.personname
    ,t3.petname
    ,t4.pettype
from 
(
select
   t1.personname 
   ,t2.petname
   ,t2.pettypeid
from
(
    select
        personid 
        ,personname   
    from people
) t1
left outer join
(
    select
        ownerid 
        ,petname
        ,pettypeid 
    from pets 
) t2
on t1.personid = t2.ownerid
) t3
inner join 
(
    select
        pettypeid
        ,pettype 
    from pettypes
) t4
on t3.pettypeid = t4.pettypeid
;

在这里插入图片描述
在这里插入图片描述

三、解决方案

  1. 将left outer join提到外层
SELECT
    t3.arrange_date
   ,t3.new_material_code
   ,t3.bill_code
   ,t3.price_unit
   ,t3.sale_out_qty
   ,t3.sale_out_price_total_rmb
   ,t3.dept_code
   ,t3.salesman_code
   ,t3.main_id
   ,t3.system_source
   ,t3.fcustnumber
   ,t3.warehouse_code
   ,t3.fbilltypenumber
   ,t3.delivery_order_id
   ,t3.f_aqa_recaddr
   ,t3.sales_organization
   ,t3.cost_rmb
   ,t3.fsrcbillno
   ,t3.fbilltypename
   ,t3.fentryid
   ,t3.currency_code
   ,t3.unit
   ,t3.sale_out_amt_total
   ,t3.sale_out_amt_total_rmb
   ,t4.fsettleorgid
FROM
(
    SELECT
        t1.arrange_date
       ,t2.fmaterialnumber                                                                                     AS new_material_code
       ,t1.fbillno                                                                                             AS bill_code
       ,MAX(t2.ftaxprice)                                                                                      AS price_unit
       ,SUM(t2.frealqty)                                                                                       AS sale_out_qty
       ,SUM(t2.fallamount_lc)                                                                                  AS sale_out_price_total_rmb
       ,IF((t1.arrange_date <= '2021-05-31' AND t1.fbillno IN ('OUSO20210319321696', 'OUSO20210322323127'))
           ,'Lumidept00165', t1.fsaledeptnumber)                                                               AS dept_code
       ,t1.salesman_code
       ,t2.fmainid                                                                                             AS main_id
       ,t1.system_source
       ,t1.fcustnumber
       ,t2.warehouse_code
       ,t1.fbilltypenumber
       ,t2.delivery_order_id
       ,t1.f_aqa_recaddr
       ,t1.fsaleorgid                                                                                          AS sales_organization
       ,SUM(t2.fcostamount_lc)                                                                                 AS cost_rmb
       ,t2.fsrcbillno
       ,t1.fbilltypename
       ,t2.fentryid
       ,t1.fbasecurrnumber                                                                                     AS currency_code
       ,t2.funitnumber                                                                                         AS unit
       ,SUM(t2.famount)                                                                                        AS sale_out_amt_total
       ,SUM(t2.famount_lc)                                                                                     AS sale_out_amt_total_rmb
    FROM
    (
        SELECT
            fid
           ,fbillno
           ,system_source
           ,arrange_date
           ,fsaledeptnumber
           ,salesman_code
           ,fcustnumber
           ,fbilltypenumber
           ,fbilltypename
           ,f_aqa_recaddr
           ,fsaleorgid
           ,fbasecurrnumber
        FROM tmp.xxx
        WHERE rn = 1
        AND is_delete IS NOT TRUE
    ) t1
    INNER JOIN
    (
        SELECT
            fmainid
           ,ftaxprice
           ,frealqty
           ,fallamount_lc
           ,fmaterialnumber
           ,delivery_order_id
           ,warehouse_code
           ,fcostamount_lc
           ,fsrcbillno
           ,fentryid
           ,funitnumber
           ,famount
           ,famount_lc
        FROM tmp.yyy
        WHERE rn = 1
        AND is_delete IS NOT TRUE
    ) t2
    ON t1.fid = t2.fmainid
    GROUP BY
        t1.arrange_date
       ,t2.fmaterialnumber
       ,t1.fbillno
       ,t1.fsaledeptnumber
       ,t1.salesman_code
       ,t2.fmainid
       ,t1.system_source
       ,t1.fcustnumber
       ,t1.fbilltypenumber
       ,t2.delivery_order_id
       ,t1.f_aqa_recaddr
       ,t1.fsaleorgid
       ,t2.warehouse_code
       ,t2.fsrcbillno
       ,t1.fbilltypename
       ,t2.fentryid
       ,t1.fbasecurrnumber
       ,t2.funitnumber
) t3
LEFT OUTER JOIN
(
    SELECT
        fmainid
       ,fsettleorgid
    FROM tmp.zzz
    WHERE rn = 1
    AND is_delete IS NOT TRUE
) t4
ON t3.main_id = t4.fmainid
;
  1. 先形成临时表再inner join的话结果是正确的。
-- 形成临时表
CREATE TABLE IF NOT EXISTS tmp.aaa
AS
SELECT
    t3.fid
   ,t3.fbillno
   ,t3.system_source
   ,t3.arrange_date
   ,t3.fsaledeptnumber
   ,t3.salesman_code
   ,t3.fcustnumber
   ,t3.fbilltypenumber
   ,t3.fbilltypename
   ,t3.f_aqa_recaddr
   ,t3.fsaleorgid
   ,t3.fbasecurrnumber
   ,t4.fsettleorgid
FROM
(
    SELECT
        fid
       ,fbillno
       ,system_source
       ,arrange_date
       ,fsaledeptnumber
       ,salesman_code
       ,fcustnumber
       ,fbilltypenumber
       ,fbilltypename
       ,f_aqa_recaddr
       ,fsaleorgid
       ,fbasecurrnumber
    FROM tmp.xxx
    WHERE rn = 1
    AND is_delete IS NOT TRUE
) t3
LEFT OUTER JOIN
(
    SELECT
        fmainid
       ,fsettleorgid
    FROM tmp.yyy
    WHERE rn = 1
    AND is_delete IS NOT TRUE
) t4
ON t3.fid = t4.fmainid
;

--从临时表中取数据关联
SELECT
    t1.arrange_date
   ,t2.fmaterialnumber                                                                                     AS new_material_code
   ,t1.fbillno                                                                                             AS bill_code
   ,MAX(t2.ftaxprice)                                                                                      AS price_unit
   ,SUM(t2.frealqty)                                                                                       AS sale_out_qty
   ,SUM(t2.fallamount_lc)                                                                                  AS sale_out_price_total_rmb
   ,IF((t1.arrange_date <= '2021-05-31' AND t1.fbillno IN ('OUSO20210319321696', 'OUSO20210322323127'))
       ,'Lumidept00165', t1.fsaledeptnumber)                                                               AS dept_code
   ,t1.salesman_code
   ,t2.fmainid                                                                                             AS main_id
   ,t1.system_source
   ,t1.fcustnumber
   ,t2.warehouse_code
   ,t1.fbilltypenumber
   ,t2.delivery_order_id
   ,t1.f_aqa_recaddr
   ,t1.fsaleorgid                                                                                          AS sales_organization
   ,SUM(t2.fcostamount_lc)                                                                                 AS cost_rmb
   ,t2.fsrcbillno
   ,t1.fbilltypename
   ,t2.fentryid
   ,t1.fbasecurrnumber                                                                                     AS currency_code
   ,t2.funitnumber                                                                                         AS unit
   ,SUM(t2.famount)                                                                                        AS sale_out_amt_total
   ,SUM(t2.famount_lc)                                                                                     AS sale_out_amt_total_rmb
   ,t1.fsettleorgid
FROM
(
    SELECT
        fid
       ,fbillno
       ,system_source
       ,arrange_date
       ,fsaledeptnumber
       ,salesman_code
       ,fcustnumber
       ,fbilltypenumber
       ,fbilltypename
       ,f_aqa_recaddr
       ,fsaleorgid
       ,fbasecurrnumber
       ,fsettleorgid
       from tmp.aaa
) t1
INNER JOIN
(
    SELECT
        fmainid
       ,ftaxprice
       ,frealqty
       ,fallamount_lc
       ,fmaterialnumber
       ,delivery_order_id
       ,warehouse_code
       ,fcostamount_lc
       ,fsrcbillno
       ,fentryid
       ,funitnumber
       ,famount
       ,famount_lc
    FROM tmp.zzz
    WHERE rn = 1
    AND is_delete IS NOT TRUE
) t2
ON t1.fid = t2.fmainid
GROUP BY
    t1.arrange_date
   ,t2.fmaterialnumber
   ,t1.fbillno
   ,t1.fsaledeptnumber
   ,t1.salesman_code
   ,t2.fmainid
   ,t1.system_source
   ,t1.fcustnumber
   ,t1.fbilltypenumber
   ,t2.delivery_order_id
   ,t1.f_aqa_recaddr
   ,t1.fsaleorgid
   ,t2.warehouse_code
   ,t2.fsrcbillno
   ,t1.fbilltypename
   ,t2.fentryid
   ,t1.fbasecurrnumber
   ,t2.funitnumber
   ,t1.fsettleorgid
;

总结

山重水复疑无路,柳暗花明又一村。出自宋代陆游的《游山西村》,描绘的是当地路漫漫,长途跋涉以为前方没有路要放弃时,突然发现不远的地方有个村庄的场景,带有一种豁然开朗的喜悦。
当我们面临复杂的SQL关联时,因为SQL内部机制会导致结果和我们想象中的不一样,这时不妨换个思路,可能就会“柳暗花明又一村”了。

  大数据 最新文章
实现Kafka至少消费一次
亚马逊云科技:还在苦于ETL?Zero ETL的时代
初探MapReduce
【SpringBoot框架篇】32.基于注解+redis实现
Elasticsearch:如何减少 Elasticsearch 集
Go redis操作
Redis面试题
专题五 Redis高并发场景
基于GBase8s和Calcite的多数据源查询
Redis——底层数据结构原理
上一篇文章      下一篇文章      查看所有文章
加:2022-10-17 12:41:07  更:2022-10-17 12:43:05 
 
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁

360图书馆 购物 三丰科技 阅读网 日历 万年历 2024年11日历 -2024/11/19 12:14:10-

图片自动播放器
↓图片自动播放器↓
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
图片批量下载器
↓批量下载图片,美女图库↓
  网站联系: qq:121756557 email:121756557@qq.com  IT数码