开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 大数据 -> Hive的分区与分桶 -> 正文阅读

[大数据]Hive的分区与分桶

一、为什么要分区和分桶

1.1 为什么要分区

随着系统运行时间增长，表的数据量会越来越大，而hive查询时通常是全表扫描，这样将导致大量的不必要的数据被扫描，从而大大提高查询效率。从而引进分区技术，可以将用户的整个表的数据在存储时划分为多个子目录，从而在查询时可以指定查询条件（子目录以分区变量的值来命名）eg:filedate=‘20210810’。

1.2 为什么要分桶

单个分区或者表中的数据量越来越大，当分区不能更细粒的划分数据时，需要采用分桶技术将数据更细粒度的划分和管理。从而引进分桶技术，分桶更多地用来做数据抽样、JOIN操作。

分区是分目录，分桶是分文件。

二、分区

2.1 怎么分区

根据业务，通常按照年、月、日、地区等

2.2 分区的技术

PARTIONED BY(col_name data_type)

hive的分区字段使用的是表外字段。而mysql使用的是表内字段。
hive的分区名区分大小写
hive的分区本质是在表目录下面创建目录，但是该分区字段是一个伪列，不真实存在于数据中
一张表可以有一个或者多个分区，分区下面也可以有一个或者多个分区

2.3 分区的意义

可以让用户在做数据分析统计时缩小数据扫描的范围，提高查询效率。因为可以在select时指定要统计的分区，提高查询效率。eg:select * from bus_gps where filedate = 20210810;

2.4 创建分区表

一级：

create table if not exists consumers1(
uid int,
uname string,
uage int
)
PARTITIONED BY (country string)
row format delimited 
fields terminated by ','
;

分区表导入数据的方式：

load data local inpath '/usr/local/xxx' into table consumers1
partition(country='China'); #要指定分区

select * from part1 where country='China';

二级：

create table if not exists consumers2(
uid int,
uname string,
uage int
)
PARTITIONED BY (year string,month string)
row format delimited 
fields terminated by ','
;

load data local inpath '/usr/local/xxx' into table consumers2
partition(year='2021',month='08'); 

load data local inpath '/usr/local/xxx' into table consumers2
partition(year='2021',month=08); 

select * from consumers2 where year='2021',month='08';

三级：

create table if not exists consumers3(
uid int,
uname string,
uage int
)
PARTITIONED BY (year string,month string,day string)
row format delimited 
fields terminated by ','
;

load data local inpath '/usr/local/xxx' into table consumers3
partition(year='2021',month='08',day='10'); 

load data local inpath '/usr/local/xxx' into table consumers3
partition(year='2021',month=08,day='10'); 

select * from consumers3 where year='2021',month='08',day='10';

显示分区：

show partitions consumers3;

修改分区：

分区名不能使用命令修改（手动修改），文件名+元数据
增加单个分区：

alter table consumers1 add partition(country='india');

增加多个分区：

alter table consumers1 add partition(country='india') partition(country='korea') partition(country='America');

增加分区并设置数据：(location 跟在 partition之后，空格隔开)

alter table consumers1 add partition(country='Vietnam')
location '/user/hive/warehouse/xxx'
;

修改分区的存储路径：(hdfs的路径必须是全路径)

alter table consumers1 partition(country='Vietnam') set location ‘hdfs://hadoop01:9000/user/hive/warehouse/brz.db/part1/country=Vietnam’

删除分区：

alter table consumers1 drop partition(country='india');

alter table consumers1 drop partition(country='india'),partition(country='korea'),partition(country='America'); ##逗号隔开

分区类型：

静态分区:加载数据到指定的分区值
动态分区：数据未知，根据分区的值确定创建分区
混合分区：静态和动态都有

动态分区的属性：

set hive.exec.dynamic.partition=true;//(true/false)
set hive.exec.dynamic.partition.mode=strict;//(strict/nonstrict) 至少有一个静态的值
set hive.exec.dynamic.partitions=1000;//(分区最大数)
set hive.exec.max.dynamic.partitions.pernode=100

创建动态分区：

create table if not exists dy_consumers1(
uid int,
uname string,
uage int
)
PARTITIONED BY (year string,month string)
row format delimited 
fields terminated by ','
;

加载数据：（不能使用load加载数据）

load data local inpath '/usr/local/xxx' into table dy_consumers1
partition(year,month);

加载数据：（使用 insert into方式加载数据）

1、创建临时表

create table if not exists consumers_tmp(
uid int,
uname string,
uage int,
year string,
month string
)
row format delimited 
fields terminated by ','
;

2、insert into方式加载数据

insert into dy_consumers1 partition(year,month)
select * from consumers_tmp
;

混合：(注意列的个数的匹配)

create table if not exists dy_consumers2(
uid int,
uname string,
uage int
)
PARTITIONED BY (year string,month string)
row format delimited 
fields terminated by ','
;

#年份已设置成固定的2021，月份是动态的
insert into dy_consumers2 partition(year='2021',month)
select uid,uname,uage,month from consumers_tmp
;

设置hive的执行模式：

严格模式：

set hive.mapred.mode=nonstrict/strict

In strict mode, some risky queries are not allowed to run. They include:
Cartesian Product.
No partition being picked up for a query.
Comparing bigints and strings.
Comparing bigints and doubles.
Orderby without limit.

注意事项：

hive的分区使用的表外字段，分区字段是一个伪列但是可以查询过滤。
分区字段不建议使用中文
不太建议使用动态分区。因为动态分区将会使用mapreduce来查询数据，如果分区数量过多将导致namenode和yarn的资源瓶颈。所以建议动态分区前也尽可能之前预知分区数量。
分区属性的修改均可以使用手动元数据和hdfs的数据内容

三、分桶

3.1 怎么分桶

分区使用的是表外字段，分桶使用的是表内字段，分桶的字段必须在表中存在！

3.2 分桶的技术

[CLUSTERED BY (col_name, col_name, ...)?
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]?
关键字：BUCKET

clustered by (uid) – 指定分桶的字段
sorted by (uid desc) – 指定数据的排序规则，表示预期的数据就是以这里设置的字段以及排序规则来进行存储

3.3 分桶的意义

为了保存分桶查询结果的分桶结构（数据已经按照分桶字段进行了hash散列）
分桶表数据进行抽样和JOIN时可以提高MR程序效率

3.4 创建分桶表

方式一

create table if not exists buck1(
uid int,
uname string,
uage int
)
clustered by (uid) into 4 buckets
row format delimited
fields terminated by ','
;

加载数据：

load data local inpath '/usr/local/hive/test/3.txt' into table buck1
;

设置强制分桶属性：

set hive.enforce.bucketing=false/true

如果reduce的个数和分桶的个数不一致时，请手动设置reduce的个数：
设置reduce task的个数：

set mapreduce.job.reduces=4;

分桶查询测试：

select * from buck1 cluster by (uid);

方式二

创建临时表：

create table if not exists buck_temp(
uid int,
uname string,
uage int
)
row format delimited
fields terminated by ','
;

加载数据：

load data local inpath '/usr/local/hivedata/3.txt' into table buck_temp;

创建分桶表：

create table if not exists buck2(
uid int,
uname string,
uage int
)
clustered by (uid) into 4 buckets
row format delimited
fields terminated by ','
;

分桶表数据需要使用insert into 方式来加载：

insert overwrite table buck2
select uid,uname,uage from buc_temp
cluster by (uid)
;

方式三

创建分桶表，并且指定排序字段及排序规则

create table if not exists buck3(
uid int,
uname string,
uage int
)
clustered by (uid) 
sorted by (uid desc) into 4 buckets
row format delimited
fields terminated by ','
;

导入数据：

insert overwrite table buck3
select uid,uname,uage from buck_temp
cluster by (uid)
;

insert overwrite table buck3
select uid,uname,uage from buck_temp
distribute by (uid) sort by (uid asc)
;

insert overwrite table buck3
select uid,uname,uage from buck_temp
distribute by (uid) sort by (uid desc)
;

insert overwrite table buck3
select uid,uname,uage from buck_temp
distribute by (uid) sort by (uage desc)
;

3.5 对分桶表的查询

查询全部：

select * from buc3;
select * from buc3 tablesample(bucket 1 out of 1)

查询第几桶：

select * from buc3 tablesample(bucket 1 out of 4 on uid);
select * from buc3 tablesample(bucket 1 out of 2 on uid);

解释一下：

tablesample(bucket x out of y on uid)  
x：代表从第几桶开始查询
y：查询的总桶数，y可以是总的桶数的倍数或者因子；x不能大于y

不压缩不拉伸：1 out of 4
for 1 to 4
1 2 3 4 1 2 3 4
1       1 + 4

压缩： 1 out of 2
1 2 3 4
1 2 1 2 1 2
1   1 + 4/2 1+4/2+4/2

拉伸：1 out of 8
1 2 3 4 5 6 7 8
1 2 3 4

等于是重新对所有数据进行分桶：

例子：
查询uid为奇数：（tablesample一定要紧跟在表名之后）

select * from buc3 tablesample(bucket 2 out of 2 on uid);

查询：

select * from part_tmp limit 3;
select * from part_tmp tablesample(3 rows);
select * from part_tmp tablesample(13 percent);
select * from part_tmp tablesample(56B); ###k B M G T P

随机查询3条数据：

select * from part_tmp order by rand() limit 3;
select rand() as rand1,* from part_tmp;

四、分区与分桶联合使用

举例说明一下：按照性别进行分区（1男2女），在分区中按照uid的奇偶进行分桶：

1 gyy1 1
2 gyy2 2
3 gyy3 2
4 gyy4 1
5 gyy5 2
6 gyy6 1
7 gyy7 1
8 gyy8 2
9 gyy9 1
10 gyy10 1
11 gyy11 2
12 gyy12 1

1、创建临时表：

create table if not exists stu_temp(
uid int,
uname string,
usex int
)
row format delimited 
fields terminated by ' '
;

加载数据：

load data local inpath '/usr/local/hivedata/stu.dat' into table stu_temp;

创建分区分桶表：

create table if not exists stus(
uid int,
uname string
)
partitioned by (sex int)
clustered by (uid) into 2 buckets
row format delimited 
fields terminated by ' '
;

//insert into方式：

insert into table stus partition(sex)
select uid,uname,usex from stu_temp
cluster by (uid)
;

需求：
查询性别为女性的、并且学号为奇数的学生：

select *
from stus tablesample(bucket 2 out of 2 on uid)
where sex=2;

五、分区和分桶区别

分区：

分区表现为HDFS上表目录的子目录，数据按照分区存储在子目录中。
由于Hive实际是存储在HDFS上的抽象，Hive的一个分区名对应一个目录名，子分区名就是子目录名，并不是一个实际字段。
所以可以这样理解，当我们在插入数据的时候指定分区，其实就是新建一个目录或者子目录，或者在原有的目录上添加数据文件。
如果查询的 where 字句中包含分区条件，则直接从该分区去查找，而不是扫描整个表目录，合理的分区设计可以极大提高查询速度和性能。在hive中创建表时使用partitioned?by(partition_name string)定义分区

分桶：

分桶是相对分区进行更细粒度的划分。分桶表是在表或者分区表的基础上，进一步对表进行组织。
按照某字段的hash值除以桶的个数做取余运算的方式来分桶，保证了每个桶中都有数据，但每个桶中的数据条数不一定相等。例，按照name字段分为3个桶，就是对name属性值的hash值对3取摸，按照取模结果对数据分桶。如取模结果为0的数据记录存放到一个文件，取模为1的数据存放到一个文件，取模为2的数据存放到一个文件。在hive中创建表时使用clustered by(column_name) into N buckets定义分桶列及分桶个数N。

区别表：

	分区表	分桶表
字段?	表外字段（伪列），需要单独指定类型	表内字段，做hash算法后对分桶个数取模
表现形式	目录（文件夹）	文件
数量	可以不断增多	?一旦确定数量不变
关键字	partitioned by(partition_name string)	clustered by(column_name) into N buckets
粒度	较粗	?较细，可以对表进行先分区再分桶的划分策略
效率	避免全表扫描，根据分区列查询指定目录提高查询速度	分桶表数据进行抽样和JOIN时可以提高MR程序效率

大数据最新文章

实现Kafka至少消费一次

亚马逊云科技：还在苦于ETL？Zero ETL的时代

初探MapReduce

【SpringBoot框架篇】32.基于注解+redis实现

Elasticsearch：如何减少 Elasticsearch 集

Go redis操作

Redis面试题

专题五 Redis高并发场景

基于GBase8s和Calcite的多数据源查询

Redis——底层数据结构原理

加:2021-08-11 12:28:56 更:2021-08-11 12:30:49

360图书馆购物三丰科技阅读网日历万年历 2025年7日历

-2025/7/5 15:21:19-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码