在一张hive表中 存储了自身关联自身两键值 即: 当前id(newid) 和 原先id (oldid); 每一个 newid 和 oldid 两id有可能 关联本表中 其他的id
现有需求找到 最后newid 和 最早oldid 使其一一对应 网上有sqlserver 等关系型数据库的递归说明大抵就是利用with as 建立子表 自身调用自身实现递归,但是hive中不支持这种方式, 后来想使用datax 提交代码到sqlserver运行,运行即结果直接传输到建立的hive表中, 测试后发现,这种递归方式只能找到最初的 oldid; 发现表中存有 oolevel 第几个关联层级字段,这样就简单很多 最后选用python脚本+hsql的模式:
首先在hive中创建表
create table zltest.cycle_table(
ooOutputorderid string comment 'newid',
listid string comment '中间关联的id',
ooOperateorderid string comment 'oldid'
)row format delimited FIELDS TERMINATED BY '\u0001'
NULL DEFINED AS '' STORED AS Parquet;
这里后面面要到impala执行 最好使用 Parquet 格式
```python
from impala.dbapi import connect
import impala
from bitarray._bitarray import _bitarray, bitdiff, bits2bytes, _sysinfo
sql1="""
insert overwrite table zltest.cycle_table
select ooOutputorderid,
concat_ws(',',ooOperateorderid )listid,ooOperateorderid
from ods.oc_ocorderoperation
where oolevel =%s
group by ooOutputorderid,concat_ws(',',ooOperateorderid ),ooOperateorderid ;
"""
sql2="""
insert into table zltest.cycle_table
select d.ooOutputorderid, concat_ws(',',d.listid,b.ooOperateorderid)listid,b.ooOperateorderid
from zltest.cycle_table as d
inner join (select * from ods.oc_ocorderoperation where oolevel=%s) as b
on d.ooOperateorderid = b.ooOutputorderid
group by d.ooOutputorderid, concat_ws(',',d.listid,b.ooOperateorderid),b.ooOperateorderid
"""
sql3="""
insert into table zltest.cycle_table
select a.ooOutputorderid,
concat_ws(',',ooOperateorderid )listid,ooOperateorderid
from ods.oc_ocorderoperation a
inner join (
select a.ooOutputorderid
from ods.oc_ocorderoperation as a
left join ods.oc_ocorderoperation as b on a.ooOutputorderid = b.ooOperateorderid
where b.ooOperateorderid is null and a.oolevel=%s
) b on a.ooOutputorderid = b.ooOutputorderid
where a.oolevel =%s and b.ooOutputorderid is not null
group by a.ooOutputorderid,concat_ws(',',ooOperateorderid ),ooOperateorderid ;
"""
sql4="""
insert overwrite table zltest.cycle_table
select ooOutputorderid,listid, ooOperateorderid
from(
select *,
row_number()over(partition by ooOutputorderid order by length(listid) desc) row_num
from zltest.cycle_table
) as d where row_num =1;
"""
def connImpala():
conn = connect(host='XX.XX.XX.XX', port=21050)
cur = conn.cursor()
cur.execute("select max(oolevel)num from ods.oc_ocorderoperation;")
data_list=cur.fetchall()
for i in data_list:
data=int(i)
t=data
while(t>=0):
if(t==data):
print('首次插入数据')
cur.execute(sql1%(t))
else:
print('%s次插入有上层跳转数据'%(t))
cur.execute(sql2%(t))
print('%s次插入无上层跳转数据'%(t))
cur.execute(sql3%(t,t))
t-=1
cur.execute(sql4)
conn.close()
if __name__ == '__main__':
connImpala()
运行完成
当然,部分有时侯不有层级数 也可以这样处理 先执行一次:
insert overwrite table zltest.cycle_table
select t1.ooOutputorderid,concat_ws(',',t1.ooOperateorderid )listid,t2.ooOperateorderid
from(
select a.ooOutputorderid,a.ooOperateorderid
from ods.oc_ocorderoperation as a
left join ods.oc_ocorderoperation as b on a.ooOutputorderid = b.ooOperateorderid
where b.ooOperateorderid is null
) as t1
left join ods.oc_ocorderoperation t2 on t1.ooOperateorderid = t2.ooOutputorderid;
python循环 查看还有没有 有链式未关联完的
select count(t1.ooOperateorderid)num
from zltest.cycle_table t1
left join ods.oc_ocorderoperation t2 on t1.ooOperateorderid = t2.ooOutputorderid
where t2.ooOutputorderid is null;
python 脚本中判断在num值为0以前都要循环
insert into table zltest.cycle_table
select a.ooOutputorderid,concat_ws(',',a.listid,a.ooOperateorderid )listid,b.ooOperateorderid
from zltest.cycle_table a
inner join (
select t3.ooOperateorderid ,t3.ooOutputorderid
from ods.oc_ocorderoperation t3
left join (
select t1.ooOperateorderid
from zltest.cycle_table t1
left join ods.oc_ocorderoperation t2 on t1.ooOperateorderid = t2.ooOutputorderid
where t2.ooOutputorderid is null
) as t4 on t3.ooOperateorderid = t4.ooOperateorderid
where t4.ooOperateorderid is null
) as d on a.ooOperateorderid = d.ooOutputorderid;
|