Hive insert Overrite partition 动态分区踩坑实录

业务上要求有一张表要动态插入按照时间分区的数据，然后就用了如下语法(已脱敏)：

先说结论：动态分区按照查询的最后一个字段来，我这里没有按照这个规范写，就报了很多奇奇怪怪的错误。

在Application上找到spark的执行日志时，发现卡在最后一个stage的repartitionAndSortWithPartitions 到 mapPartitions 阶段，而且表现为所有executor上的task数都为0，但是都有shuffle read的记录数，然后Task的执行里面会一直显示Running状态，也不报错，但是跑了15个小时，这11个task的进度条还是 0/11 ，只能kill掉。确认dynamic开启了非严格模式，给的分区数也足够。

set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;
set hive.exec.max.dynamic.partitions = 10000;
set hive.exec.max.dynamic.partitions.pernode = 10000;

drop table if exists db1.table1;
create table db1.table1
(
    id string,
    code string,
    pred_sell double
)
partitioned by (insert_date string)
STORED AS PARQUET
;

insert into table db1.table1
partition (insert_date)
select
a.id,
a.code,
a.insert_date,
b.sell as pred_sell
from db1.table1 a join db1.table2 b
on a.id=b.id and a.code=b.code and cast(date_format(a.insert_date,'yyyyMMdd') as string) >= '${yesterday_time}';

然后找回代码中，发现我最后一个字段select的是b.sell as pred_sell，而分区中填的是insert_date，这就导致我的任务一直卡在repartition那里。

之后事情就多了，任务跑的时间过长报了RPC channel close问题，然后又是spark Return Code3问题…还是怪自己不够细心。

综上，总结一下，Hive建表时指定的分区字段会自动跟在表字段的最下方，然后如果使用insert into partition select * from table 方式插入的话，要记得字段位置对应，特别是分区字段放到普通字段下面，与创建表的顺序对应才行。

drop table if exists db1.table1;
create table db1.table1
(
    id string,
    code string,
    pred_sell double
)
partitioned by (insert_date string)
STORED AS PARQUET
;

insert into table db1.table1
partition (insert_date)
select
a.id,
a.code,
b.sell as pred_sell,
a.insert_date,
from db1.table1 a join db1.table2 b
on a.id=b.id and a.code=b.code and cast(date_format(a.insert_date,'yyyyMMdd') as string) >= '${yesterday_time}';