[大数据] hive的语法-HiveDDL分区/分桶

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 大数据 -> hive的语法-HiveDDL分区/分桶 -> 正文阅读

[大数据]hive的语法-HiveDDL分区/分桶

hive中的语法

hive数据类型

原生数据类型

数值型
int, float,
字符型
string
日期型
data
布尔型
bool(ture/false)

复杂数据类型

array 数组
map k-v
struct 结构体 {int, string,…}
联合体

数据类型转换

隐式转换
显示转换
select cast(‘100’ as int)

hive文件的读写

定义: Hadoop中的文件数据和hive的表之间的关系

#自定义
delimited
#自己指定
serbe

read过程:反序列化

将文件数据映射到表上:
将sql语句转化为mp程序，找到对应的数据文件，按照映射表指定的格式去切割获取数据，然后将数据安装指定的字段形式进行返回

write过程:序列化

将表上的数据写入到文件:
按照指定的的格式将数据写入文件

数据格式规范指定

规范在读写数据过程中按照指定格式操作数据

字段间的格式
zhangsan wangwu
集合元素之间的
zhangsan, gender:boy-age:18
map映射
key : value

数据库和表的删改操作


-- 强制删除不为空的数据库
drop database hive cascade ;


-------------------------------------------

-- 把外部表改为内部表
alter table team_player
    set tblproperties ('EXTERNAL' = 'FALSE');
    
drop table team_player;
-- 把内部表转化为外部表
alter table students_txt
    set tblproperties ('EXTERNAL' = 'TRUE');

desc formatted students_txt;

-- 删除外部表students_txt
drop table students_txt;

hiveDDL

①先创建表,指定数据的处理形式后才能进行相应的数据操作

#创建表字段形式,指定分割符号
create table tb_archer
(
    id           int,
    name         string comment '英雄名称',
    hp_max       int,
    attack_max   int,
    defense_max  int,
    attack_range int,
    role_main    string,
    role_assist  string
) row format delimited fields terminated by '\t';

#指定分割形式
fields terminated by

#上传文件后,查看表内容
select * from tb_archer;

----------------------------------------
#创建表,复杂数据类型的处理
create table tb_hot_hero_skin_price
(
    id         int,
    name       string,
    win_rate   int,
    skin_price map<string,int>
) row format delimited fields terminated by ','
    collection items terminated by '-'
    map keys terminated by ':';

---------------------------------------
#创建表,默认字符分割
create table tb_team_ace_player
(
    id              int,
    team_name       string,
    ace_player_name string
);

hive表的类型

内部表
在数据文件不存在的时候,提前创建数据表字段,然后将数据传入对应的目录中

默认情况下在没有指定external关键词的情况下都是内部表

内部表管理元数据和表数据,一旦删除后之后,元数据和表数据全部清空

外部表
数据已经存在,对存在数仓上的数据建表后进行操作,由于存储的位置与默认路径不一致,所以需要location指定数据存储路径

外部表的关键词为external

外部表只管理元数据,删除外部表不会把hdfs上的数据文件删除,只会把元数据删除

外部表的创建


-- 外部表-location指定数据位置  external关键词
create external table student_txt
(
    id   int,
    name string,
    sex  string,
    age  int,
    dept string
)
    row format delimited fields terminated by ','
location '/python';

分区表

分区可以将多个文件划分成不同的文件目录,在进行查询是可以指定对应的目录,直接到对应的目录下完成数据查询

关键字: partition by (分区字段, 字段类型)
分区字段不能可定义的字段重复

分区表导入数据,根据导入的数据方式不同,分区表可以分为静态表和动态表

分区表注意事项

分区表不是建表的必要语法规则，是一种优化手段表，可选；
分区字段不能是表中已有的字段，不能重复；
分区字段是虚拟字段，其数据并不存储在底层的文件中；
Hive支持多重分区，也就是说在分区的基础上继续分区，划分更加细粒度

静态表的导入

load data local inpath ‘/root/指定文件路径’ into table 表名 partition(分区字段=‘分区值’)

#分区表字段的创建
create table if not exists tb_hero_part(
    id           int,
    name         string comment '英雄名称',
    hp_max       int,
    attack_max   int,
    defense_max  int,
    attack_range int,
    role_main    string,
    role_assist  string
    )partitioned by (role string) row format delimited fields terminated by ',';
-----------------------------------------
-- 查看表信息
desc formatted tb_hero_part;
-- 查看有几个分区数据信息
show partitions tb_hero_part;
------------------------------------------

#手动指定静态分区
load data local inpath '/root/hero/archer.txt' into table tb_hero_part partition (role = 'sheshou');
load data local inpath '/root/hero/assassin.txt' into table tb_hero_part partition (role = 'cike');
load data local inpath '/root/hero/mage.txt' into table tb_hero_part partition (role = 'fashi');
load data local inpath '/root/hero/support.txt' into table tb_hero_part partition (role = 'fuzhu');
load data local inpath '/root/hero/tank.txt' into table tb_hero_part partition (role = 'tanke');
load data local inpath '/root/hero/warrior.txt' into table tb_hero_part partition (role = 'zhanshi');

#用分区字段进行筛选
select count(*) from tb_hero_part where hp_max > 6000 and role_main='archer' and role='sheshou';

动态表导入数据

insert into table 表名字 partition(分区字段) select * from tmp_table
hive根据指定的数据自动进行分区，生成对应的分区目录和数据

#启动hive动态分区的设置
set hive.exec.dynamic.partition.mode=nonstrict;

-- 创建表字段
create table t_all_hero_part_d
(
    id           int,
    name         string,
    hp_max       int,
    attack_max   int,
    defense_max  int,
    attack_range string,
    role_main    string,
    role_assist  string
) partitioned by (role string)
    row format delimited
        fields terminated by "\t";

-- 动态导入分区数据
insert into table t_all_hero_part_d partition (role)
select th.*, th.role_assist
from tb_heros as th;

多层分区表


-- 多层分区表字段的创建
create table test_student
(
    id   int,
    name string,
    sex  string,
    age  int,
    dept string
) partitioned by (year string, month string,day string) row format delimited fields terminated by ',';

-- 多层分区分区数据导入
load data local inpath '/root/hero/archer.txt' into table test_student partition (year = '2012', month = '01', day = '10');

删除分区

alter table test_student drop partition(year=2012)

分桶表

定义:字段层面对数据划分, 划分结果比分区表更加平均

根据字段的哈希值除以指定分桶的数量,然后对结果取余,把余数相同的放到一个桶

分桶的好处

基于分桶字段查询时,减少全表扫描
join时可以提高MR程序的效率,减少笛卡尔积数量
用分桶表数据进行抽样

分桶操作步骤

-- 创建数据表
create table if not exists t_usa_covid19
(
    count_date string,
    county     string,
    state      string,
    fips       int,
    cases      int,
    deaths     int
)
    row format delimited fields terminated by ',';


-- 导入需要分桶数据(在端口导入数据)
-- 创建分通表
create table tb_usa_covid19_bucket_sort
(
    count_date string,
    county     string,
    state      string,
    fips       int,
    cases      int,
    deaths     int
) clustered by (state) sorted by (cases desc) into 5 buckets;

-- 导入分桶数据
insert into tb_usa_covid19_bucket_sort
select *
from t_usa_covid19;

大数据最新文章

实现Kafka至少消费一次

亚马逊云科技：还在苦于ETL？Zero ETL的时代

初探MapReduce

【SpringBoot框架篇】32.基于注解+redis实现

Elasticsearch：如何减少 Elasticsearch 集

Go redis操作

Redis面试题

专题五 Redis高并发场景

基于GBase8s和Calcite的多数据源查询

Redis——底层数据结构原理

加:2021-10-16 19:42:31 更:2021-10-16 19:43:57

360图书馆购物三丰科技阅读网日历万年历 2025年7日历

-2025/7/1 5:35:16-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码