开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 大数据 -> hive-3 -> 正文阅读

[大数据]hive-3

?级分区

?级分区指的是在?张表中有两个分区,创建和插?时都要指定两个分区名,最常?的就
是下?案例的年和?,创建的语法和流程都是和?级分区?样,只是多?个分区?已.

create table if not exists part2(
id int,
name string
)
partitioned by (year string,month string)
row format delimited fields terminated by ',';
load data local inpath '/opt/soft/data/user.txt' into table part2
partition(year='2018',month='03');
load data local inpath '/opt/soft/data/user.txt' into table part2
partition(year='2018',month='02');

三级分区

三级分区,以此类推,就是当前表中有三个分区,?般情况下要数据量?常?的情况下才会?到.

-- 创建?个三级分区
create table if not exists part3(
id int,
name string
)
partitioned by (year string,month string,day string)
row format delimited fields terminated by ',';
-- 给三级分区加载数据
load data local inpath '/opt/soft/data/user.txt' into table
part3 partition(year='2018',month='03',day='21');
load data local inpath '/opt/soft/data/user.txt' into table
part3 partition(year='2018',month='02',day='20');

-- 创建?个三级分区
create table if not exists part3(
id int,
name string
)
partitioned by (year string,month string,day string)
row format delimited fields terminated by ',';
-- 加载数据
insert into part3 partition(year='2021',month='07',day='08')
slelect
id,
name
from part1
where dt='20210708';
-- 行为数据中一定有时间 操作表
-- 维度数据一般没有时间 地区表，类别表
-- 分区表一般都有时间
-- show partitions part3;查看分区表
-- 主机地址：50070查看
select * from part3;查看part3
?
-- load 方式加载
load data local inpath '/home/part2' into table part3 partition(year='2021',month='07',day='08');
select * from part3;
-- 追加会重复数据
select * fom part3 where day='08';
-- 分区查询
-- 独立的按照三个条件过滤就用三级分区，否则没必要

测试是否区分??写

在hive中默认分区是不区分??写的,通过下?的案例可以来测试

select * fom part3 where DAY='08';
-- 测试是否区分大小写

创建?个测试区分??写的分区

create table if not exists part4(
id int,
name string
)
partitioned by (year string,month string,DAY string)
row format delimited fields terminated by ',';

进?分区??写的测试

load data local inpath '/opt/soft/data/user.txt' into table part4
partition(year='2018',month='03',day='AA');

查看分区

show partitions part1;
?
?
?

新增分区

有时候在创建表的时候没有创建具体的静态分区, 可以在后期通过添加分区来实现添加静态分区

-- 创建分区表
create table if not exists part5
(
 id int,
 name string
)
 partitioned by (dt string) row format delimited fields
terminated by ",";
-- 增加分区：
alter table part5 add partition(dt='2018-03-21');
alter table part5 add partition(dt='2018-03-20')
partition(dt='2018-03-17');
?
?
alter table part1 add partition(dt='20210709');
--新增单个分区
alter table part1 add partition(dt='20210709') partition(dt='20210709');
--新增多个分区

增加分区并设置数据

alter table part5 add partition(dt='2018-11-11') location
'/user/hive/warehouse/part1/dt=2019-08-08';
alter table part5 add partition(dt='2018-03-26') location
'/user/hive/warehouse/part2/dt=2018-03-20'
 partition(dt='2018-03-24') location
'/user/hive/warehouse/part3/dt=2018-03-21';
?
?
-- 增加分区并指定数据
alter table part1 add partition(dt='20210711') local '/input';
-- 显示NULL的原因是因为内容不匹配分隔符不匹配
-- 第一列是因为类型不匹配，后面是因为类型不匹配，分隔符不匹配
-- 所以分区列表里面没有
-- 因为是本地添加的所以他的目录是本地的hdfs：主机名：9000 /input
-- 其他的目录是hdfs：主机名：9000 /user/hive/warehouse/test.db/part1/dt=20210710

修改分区的hdfs的路径

（注意：location后接的hdfs路径需要写成完全路径）

alter table part5 partition(dt='2018-03-21') set location
'/user/hive/warehouse/qf1704.db/part1/dt=2018-03-21'; --错误使
?
alter table part5 partition(dt='2018-03-21') set location
'hdfs://leetom01:8020/user/hive/warehouse/qf1704.db/part1/dt=20
19-09-11';
?
-- 修改分区的hdfs路径（直接修改元数据）
alter table part1 partition(dt='20210711') set location '/out/01';
--也行
alter table part1 partition(dt='20210711') set location 'hdfs://lee01:9000/out/01';

删除分区

alter table part5 drop partition(dt='2018-03-21');
?
?
-- 修改为外部表
alter table part1 set tblproperties('EXTERNAL'='TRUE');
?
-- 删除分区（删除元数据，文件）
alter table part1 drop partition(d='20210709');-- 一个
alter table part1 drop partition(dt='20210709'),partition(dt='20210710'); -- 多个
-- 外部表在删除时只会删除元数据，不会删除hdfs中的分区数据，但是内部表会删除数据
?
?

删除多个分区

alter table part5 drop partition(dt='2018-03-
24'),partition(dt='2018-03-26');

分区类型详解

静态分区

加载数据到指定分区的值,新增分区或者加载分区时指定分区名.
加载数据到分区时明确指定分区值

动态分区

数据未知，根据分区的值来确定需要创建的分区。

加载分区时未指定分区值

历史数据，或者一段时间的数据，存放到一起后期需要将其分区

混合分区

静态和动态都有。

动态分区属性设置及示例

动态分区的属性：
hive.exec.dynamic.partition=true
-- 默认允许动态分区
hive.exec.dynamic.partition.mode=nonstrict
-- 默认为严格模式，用户必须指定一个静态分区
-- 非严格模式 所有分区均可为动态
hive.exec.max.dynamic.partitions=1000
-- 最大动态分区数量
hive.exec.max.dynamic.partitions.pernode=100
-- 单台服务器最大100个分区

创建动态分区表

-- 创建动态分区表
create table dy_part1(
id int,
name string
)
partitioned by (dt string)
row format delimited fields terminated by ',';
?
-- 创建表
create table dy_part(
id int,
name string
)
partitioned by (year string,month string,day string)
row format delimited fields terminated by ',';
?
create table dy_part3(
id int,
name string
)
partitioned by (year string,month string,day string)
row format delimited fields terminated by ',';
?
-- 动态加载
set hive.exec.dynamic.patition.mode=nonstrict;
insert into dy_part2 partition(year,month,day)
select 
id,
name,
year,
month,
day
from part3
;
set hive.exec.dynamic.patition.mode=nonstrict;
insert into dy_part3 partition(year,month,day)
select 
id,
name,
month,
day
from part3
where year='2021'
;

动态分区加载数据不能使?load?式加载

load data local inpath '/hivedata/user.csv' into table dy_part1
partition(dt);
?
-- load 不会产生作业

动态分区使?insert into的?式加载数据

先创建临时表：
create table temp_part(
id int,
name string,
dt string
)
row format delimited fields terminated by ',';
拷?data/student.txt到服务器上
导?数据：
load data local inpath '/hivedata/student.txt' into table
temp_part;
# 如果是严格模式,则不能导?,需要执?下?代码
set hive.exec.dynamic.partition.mode=nonstrict
insert into dy_part1 partition(dt)
select id,name,dt from temp_part;

混合分区示例

在?个分区表中同时有静态分区,同时也有动态分区,那么这个表的分区表可以称之为混合分区,下?演示下混合分区的流程. (注意列的个数匹配)

创建?个混合分区表

create table dy_part2(
id int,
name string
)
partitioned by (year string,month string,day string)
row format delimited fields terminated by ',';

创建临时表并加载数据

-- 创建分区表
create table temp_part2(
id int,
name string,
year string,
month string,
day string
)
row format delimited fields terminated by ',';
#加载数据 数据参考(data/student.txt?件)
load data local inpath '/opt/data/student.txt' into table
temp_part2;

导?数据到分区表

(注意：字段的对应)

错误?法：
insert into dy_part2 partition (year='2018',month,day)
select * from temp_part2;
正确?法：
insert into dy_part2 partition (year='2018',month,day)
select id,name,month,day from temp_part2;

设置hive为严格模式执?

#严格模式?少要有?个静态分区,?严格模式可以全是动态分区
set hive.mapred.mode=nonstrict/strict;
<name>hive.mapred.mode</name>
<value>nonstrict</value>
<description>
 The mode in which the Hive operations are being performed.
 In strict mode, some risky queries are not allowed to run.
They include:
 Cartesian Product.
 No partition being picked up for a query.
 Comparing bigints and strings.
 Comparing bigints and doubles.
 Orderby without limit.
</description>
?
set hive.mapred.mode=nonstrict/strict
-- 设置非严格/严格模式 ? 
-- 默认非严格
select * from part where dt='20210707';
-- 分区过滤
select * from part3 where day ='08'
-- 严格分区使用分区字段进行查询
-- 有风险的查询不允许执行
-- 如：
    1笛卡尔集
    2分区表不用分区字段过滤
    3bigint和string比较查询
    4bitint和double比较查询
    5orderby语句不带limit查询
    select * from flow2 order by up desc limit 2;
    如下语句不允许查询
    select * from flow2 order by up desc;
    select * from part3;

分区表注意事项

1. hive的分区使?的是表外字段，分区字段是?个==伪==列，但是分区字段是可以
做查询过滤。
2. 分区字段不建议使?中?
3. ?般不建议使?动态分区，因为动态分区会使?mapreduce来进?查询数据，
如果分区数据过多，导致 namenode 和 resourcemanager 的性能瓶颈。所以建议
在使?动态分区前尽可能预知分区数量。
4. 分区属性的修改都可以修改元数据和hdfs数据内容。
Hive分区和Mysql分区的区别
mysql分区字段?的是表内字段；?hive分区字段采?表外字段。
-- 分区表一般适用于行为数据

分桶

为什么要分桶

当单个的分区或者表的数据量过?，分区不能更细粒度的划分数据，就需要使?分桶技术将数据划分成更细的粒度

-- 分区没有办法把数据分的很细的情况用分桶

-- 分区的数据量仍然过大，使用分桶，分到更细的文件

-- 在区或表中创建更多文件

分桶技术

[CLUSTERED BY (COLUMNNAME COLUMNTYPE [COMMENT 'COLUMN
COMMENT'],...)
[SORTED BY (COLUMNNAME [ASC|DESC])...] INTO NUM_BUCKETS BUCKETS]
-- 有多少个桶就有多少个文件
clustered by (columname columntpye [comment 'column comment'],...)[sorted by (columnname[asc|desc])...] into num_buckets buckets

关键字及其原理

==bucket==
分桶的原理:跟MR中的HashPartitioner原理?样:都是key的hash值取模reduce的数
量
MR中：按照key的hash值除以reduceTask取余
Hive中：按照分桶字段的hash值取模除以分桶的个数

分桶的意义

为了保存分桶查询的分桶结构（数据按照分桶字段进?保存hash散列）
分桶表的数据进?抽样和JOIN时可以提?查询效率,?般是?来抽样查询

分桶表的创建示例

分桶表创建流程:
1. ?先要创建带分桶定义的表（分桶表）
2. 然后创建?个临时表（普通表）
3. 从临时表中使?分桶查询将查询到的数据插?到分桶表

对普通表进?分桶查询

create table t_stu(
Sno int,
Sname string,
Sex string,
Sage int,
Sdept string)
row format delimited fields terminated by ','
stored as textfile;
load data local inpath '/root/hivedata/students.txt' into table
t_stu;
select * from t_stu;
?
--操作
-- 文件
vi student.txt
1，cuihua1,1,23,1
-- 建表
create table t_stu(
sno int,
sname string,
sex string,
sage int,
sdept string
)
row format delimited fields terminated by ','
stored as testfile;
load data local inpath '/root/hivedata/students.txt' into table
t_stu;
select * from t_stu;

简单查询测试：----按照分桶查询
select * from t_stu cluster by (sno); 分桶的同时还会进?排序
select * from t_stu distribute by (sno) sort by (sage desc); distribute只进?分桶
sort by进?排序
注意:
默认只有?个reduce,所以默认的分桶也只有?个,看不到效果,这时我们可以将
reduce的个数设置成4个,再进?查询,分析结果.这?的分桶就类似于mr中的分区
设置reduce task的个数： set mapreduce.job.reduces=4;
查询时分开设置分桶和排序的?的是为了满??户的需求

创建分桶表

create table if not exists buc13(
id int,
name string,
age int
)
clustered by (id) into 4 buckets
row format delimited fields terminated by ',';
数据：从(data/buc1.csv加载如下格式数据)
id,name,age
-- 分桶的字段一定是表内字段不能用表外字段

创建临时表

create table if not exists temp_buc1(
id int,
name string,
age int
)
row format delimited fields terminated by ',';

分桶使?load?式加载数据不能体现分桶

load data local inpath '/opt/data/buc1.txt' into table buc1;
-- 能加载，但是不能体现分桶（使用load装载看不到分桶的效果，使用insertinto方式装载）

加载数据到临时表

load data local inpath '/hivedata/buc1.txt' into table temp_buc1;

使?分桶查询将数据导?到分桶表

insert overwrite table buc13
select id,name,age from temp_buc1
cluster by (id);

设置强制分桶的属性

<!-- 如果要分桶,就要打开分桶的强制模式 -->
set hive.enforce.bucketing=false/true
<name>hive.enforce.bucketing</name>
<value>false</value>
<description>
 Whether bucketing is enforced. If true, while inserting into
the table, bucketing is enforced.</description>

如果设置了reduces的个数和总桶数不? 样，请?动设置

set mapreduce.job.reduces=-1 #-1表示可以根据实际需要来定制reduces的数
量

创建指定排序字段的分桶表

create table if not exists buc8(
id int,
name string,
age int
)
clustered by (id)
sorted by (id) into 4 buckets
row format delimited fields terminated by ',';

导?数据

insert overwrite table buc6
select id,name,age from temp_buc1
distribute by (id) sort by (id asc);
和下?的语句效果?样
insert overwrite table buc8
select id,name,age from temp_buc1
cluster by (id);

-- insert into 
set hive.enforce.bucketing=ture;
insert overwrite table buc13
select sno,sname,sage from t_stu
cluster by (sno);
-----------如下等价于如上-------------------
set hive.enforce.bucketing=ture;
insert overwrite table buc13
select sno,sname,sage from t_stu
distributed by (sno) sort by sno asc;

--查询数据
select * from buc13;

分桶数据是：
1%4  1
2%4  2
3%4  3
4%4  0
5%4  1
6%4  2
7%4  3
8%4  0
9%4  1

分桶表查询案例

select * from buc3;
select * from buc3 tablesample(bucket 1 out of 1 on id);
-- 查询分桶全部数据
select * from buc13 tablesample(bucket 1 out of 1 on id);
-- 查询指定分桶数据--第一桶x%y  x查询的第几桶，y总共桶数
select * from buc13 tablesample(bucket 1 out of 4 on id);
-- 查询指定分桶数据--第二桶
select * from buc13 tablesample(bucket 2 out of 4 on id);
-- 查询指定分桶数据--把四桶压缩成两桶，第一桶是1，3桶第二桶是2，4桶
select * from buc13 tablesample(bucket 1 out of 2 on id);
-- 查询指定分桶数据--把四桶放大成八桶，1分二再查询
select * from buc13 tablesample(bucket 1 out of 8 on id);
-- 查询和分桶配合
select * from buc13 tablesample(bucket 1 out of 2 on id) where id>5;

查询第1桶的数据

默认有4桶
查询第?桶
select * from buc3 tablesample(bucket 1 out of 4 on sno); 
查询第?桶和第三桶
select * from buc3 tablesample(bucket 1 out of 2 on sno);
查询第?桶的前半部分
select * from buc3 tablesample(bucket 1 out of 8 on sno);
tablesample(bucket x out of y on id)
语法：
tablesample(bucket x out of y on sno) 注意：tablesample?定是紧
跟在表名之后
x:代表从第?桶开始查询
y:查询的总桶数，y可以是总桶数的倍数或者因?，x不能?于y
x表示从哪个bucket开始抽取。例如，table总bucket数为32，
tablesample(bucket 3 out of 16)，表示总共抽取（32/16=）2个bucket的数
据，分别为第3个bucket和第（3+16=）19个bucket的数据。
表?,表?,表三说明:
表中的含义是当x取对应的值时,m对应的就是所对应的桶
只记住表的结果即可.
当前的桶数 4 
当y=桶数 实例 (bucket 1 out of 4 on sno) 
m=sno%y m是余数
解释:当x=1,桶数=4,y=4, 只要sno对y取余=0我们就认为这条数据在第1桶
这?当x=1时,m=0,所以取的是第?桶
表?:
第?桶 1 2 3 4
x=1 m=0
x=2 	m=1
x=3 		m=2
x=4 			m=3
当y<桶数 我们称y为因? 实例 (bucket 1 out of 2 on sno) 取的
是第?桶和第三桶
解释:
?先使?n=sno%桶数+1 
再使?m=n判断,这?的m就是下?的m1,m2,m3....
这?当x=1时,m=m1和m3,所以取的是第?桶和第三桶
这?当x=2时,m=m2和m4,所以取的是第?桶和第四桶
表?:
第?桶 1 2 3 4
x=1 m1=x=1(x的数值) m3=1+4/2=3(m1+桶数/y值) 
m5=3+4/2=5(m2+桶数/y值)>4舍掉
x=2 m2=2 m4=m2+4/y值=4 
m6=m4+4/y值=6>4 舍掉
当y>桶数 我们称y为倍数 实例 (bucket 1 out of 8 on sno) 
m=sno%y m是余数
解释:当x=1,桶数=4,y=8, 只要sno对y取余=0我们就认为这条数据在第1桶
表三:
第?桶 1 2 3 4
x=1 m=0
x=2 m=1
x=3 m=2
x=4 m=3
x=5 m=0+4
x=6 m=1+4
x=7 m=2+4
x=8 m=3+4

查询

注意y可以不是总桶数的倍数,但是他会重新分桶,重新查询.

select * from buc3 tablesample(bucket 1 out of 5 on sno);

查询sno为奇数的数据
select * from buc3 tablesample(bucket 2 out of 2 on sno);
查询sno为偶数且age?于30的?
select * from buc3 tablesample(bucket 1 out of 2 on sno) where
age>30;
注意:这?会报错,talesample?定是紧跟在表名之后
select * from buc3 where age>30 tablesample(bucket 1 out of 2 on
sno);
注意:由于有编码问题:当我们写中?时要注意.编码不对查不出结果
其他查询知识:
select * from buc3 limit 3; 查出三?
select * from buc3 tablesample(3 rows); 查出三?
select * from buc3 tablesample(13 percent); 查出13%的内容,如果百分
?不够现实??,?少会显示??,如果百分?为0,显示第?桶
select * from buc3 tablesample(68B);k KB M G T P 查出68B包含的数
据,如果是0B,默认显示第?桶
要求随机抽取3?数据：
select * from t_stu order by rand() limit 3; 随机显示3条数据 
https://www.aboutyun.com/thread-27093-1-1.html

select * from buc3 tablesample(3 rows); 查出三?
select * from buc3 tablesample(10 percent); 查出10%
select * from buc3 tablesample(1k); 查出1k
--随机查出三行
select * from t_stu order by rand() limit 3;

分区分桶联合案例

案例查下

需求：按照性别分区（1男2?），在分区中按照id的奇偶进?分桶：
id,name,sex
数据请参考data/user.txt
建表：
create table if not exists stu(
id int,
name string
)
partitioned by (sex string)
clustered by (id) into 2 buckets
row format delimited fields terminated by ',';
建临时表：
create table if not exists stu_temp(
id int,
name string,
sex string
)
row format delimited fields terminated by ',';
-- 主要要把动态分区设为?严格模式,如下:
set hive.exec.dynamic.partition.mode=nonstrict;
-- 加载临时表的数据
load data local inpath '/opt/data/user.txt' into table stu_temp;
将数据导?到分区分桶表
insert overwrite table stu partition(sex)
select id,name,sex from stu_temp cluster by id ;
查询性别为?，学号为奇数的学?
select * from stu tablesample(bucket 2 out of 2 on id)
where sex = '2';

--分区分桶
create table if not exists buc13(
id int,
name string,
agt int
)
partitioned by(dt string)
clustered by(id) into 2 buckets
row format delimited fields terminated by ',';

分桶表总结

1、定义
 clustered by (id) ---指定分桶的字段
 sorted by (id asc|desc) ---指定数据的排序规则，表示咱们预期的数
据是以这种规则进?的排序
 
2、导?数据
 cluster by (id) ---指定getPartition以哪个字段来进?hash，并且
排序字段也是指定的字段，排序是以asc排列
 distribute by (id) ---- 指定getPartition以哪个字段来进?hash
 sort by (name asc | desc) ---指定排序字段
导数据时：
 insert overwrite table buc3
 select id,name,age from temp_buc1
 distribute by (id) sort by (id asc);
 和下?的语句效果?样
 insert overwrite table buc4
 select id,name,age from temp_buc1
 cluster by (id) ;

注意事项

分区使?的是表外字段，分桶使?的是表内字段
分桶更加?于细粒度的管理数据，更多的是使?来做抽样、join

查询语句基本语法

Select查询结构基本语法

下?是?个SQL查询语句的基本结构

select selection_list # 查询的列
from table # 要查询的表
join on # 连接的表
where # 查询条件
group by # 分组查询
having # 分组条件过滤
order by # 字段排序
sort by # 结果排序
limit # 限制结果数
union/union all # 合并表

sql语句的执?顺序

FROM
<left_table>
ON
<join_condition>
<join_type>
JOIN
<right_table>
WHERE
<where_condition>
GROUP BY
<group_by_list>
HAVING
<having_condition>
SELECT
DISTINCT
<select_list>
ORDER BY
<order_by_condition>
LIMIT
<limit_number>

----
explain
select  
sno,sname
from t_stu
group by sno,sname
limit 2
;

查询注意事项

尽量不要使??查询、尽量不要使? in not in
select * from aa1
where id in (select id from bb);
查询尽量避免join连接查询，但是这种操作咱们是永远避免不了的。
查询永远是?表驱动?表（永远是?结果集驱动?结果集）

数据库建模

关系型数据库最难的地?，就是建模（model）。
错综复杂的数据，需要建?模型，才能储存在数据库。所谓"模型"就是两样东?：实
体（entity）+ 关系（relationship）ER图。实体指的是那些实际的对象，带有??
的属性，可以理解成?组相关属性的容器。关系就是实体之间的联系，通常可以分
成"?对?"、"?对多"和"多对多"等类型。

---宽表，尽量放多的数据

Join的语法与特点

表之间关系

在关系型数据库??，每个实体有??的?张表（table），所有属性都是这张表的字
段（field），表与表之间根据关联字段"连接"（join）在?起。所以，表的连接是
关系型数据库的核?问题。
表的连接分成好?种类型。
内连接（inner join）
外连接（outer join）
左连接（left join）
右连接（right join）
全连接（full join）

所谓"连接"，就是两张表根据关联字段，组合成?个数据集。问题是，两张表的关联字
段的值往往是不?致的，如果关联字段不匹配，怎么处理？?如，表 A 包含张三和李
四，表 B 包含李四和王五，匹配的只有李四这?条记录。
很容易看出，?共有四种处理?法。
只返回两张表匹配的记录，这叫内连接（inner join）。
返回匹配的记录，以及表 A 多余的记录，这叫左连接（left join）。
返回匹配的记录，以及表 B 多余的记录，这叫右连接（right join）。
返回匹配的记录，以及表 A 和表 B 各?的多余记录，这叫全连接（full join）。

这四种连接，?可以分成两?类：内连接（inner join）表示只包含匹配的记录，外连
接（outer join）表示还包含不匹配的记录。所以，左连接、右连接、全连接都属于外
连接。
此外，还存在?种特殊的连接，叫做"交叉连接"（cross join），指的是表 A 和表
B 的不存在关联字段

join查询
left join和join这两个最常用
--小结果集驱动大结果集
表连接分成多种类型
内连接：交集
左连接：以左边为准
右连接：以右边为准
全连接：两表所有数据
--不要用笛卡尔积
create table if not exists u1(
id int,
name string,
age int)
row format delimited fields terminated by ',';
create table if not exists u2(
id int,
name string,
age int)
row format delimited fields terminated by ',';
load data local inpath '/home/u1 into table u1;
load data local inpath '/home/u1 into table u2;
在u1中出现过id，但是在u2未出现过id
select uu.*
from u1 uu
left join u2 uu2
on uu.id =uu2.id
where uu2.id is null;
-----
---在u1中出现过的id同时也在u2中出现的id

select uu.*
from u1 uu
left join u2 uu2
on uu.id =uu2.id
where uu2.id=uu.id;
select uu.*
from u1 uu
join u2 uu2
on uu.id =uu2.id
where uu2.id=uu.id;
---left semi join 左半开连接查询(select 字段和过滤字段都不能用右表字段)
select uu.*
from u1 uu
left semi join u2 uu2
on uu.id =uu2.id
where uu.id is not null;
--全外连接
full outer join
(互相进行连接，如果友一张表的数据连接不上来用null来代替)
select 
* 
from
u1
full outer join u2
on u1.id=u2.id

join 需要注意的
-- inner join 和outer join的区别
-- 分区字段对out join的on条件无效 对innerjoin的条件优先
1有inner join但是没有full inner join
2有fullouter join 但是没有outer join
所有的join连接 hive1.2.x只支持等值连接 （=和and）不支持！=，<,> ,<>,>=,<=,or
2.X版本的都支持


map-side.join
map端的场景：需要有小表
如果表中的所有小表，会将小表缓存在内存中，然后在map端进行查找，hive在map端查找时会减小整体查询量，从内存中读取缓存的小表数据，效率较快，还剩去大连数据传输和shuffle耗时

a（u string）3条数据
c(u string) 1亿条数据
=============
a的数据缓存到内存中（hashtable中）
map（）{
    hashtable h=3条数据
    --map阶段读取一亿条数据
   h.contain(c表的u)；
}

set hive.auto.convert.join=true;
开启小表
hive mapjoin.smalltable.filesize
--小表默认大小 25000000字节<==>23.84兆
select
*
from u1
on u1.id=u2.id;
mapjoin 从1.2.x默认就自动开启转换的，如果是以前的版本，用关键字来转换

select
/+MAPJOIN(d)/
e.*
from u1 d
join u2 e
on d.id = e.id
;


reduce join
是在ruduce端进行连接的查找关系，如果我们能够将某个大表的数据清洗成小表，建议做子查询然后来优化reducejoin

reducejoin 通常比mapjoin效率低很多

Join案例-语句特点

连接分类

类型：
left join
left outer join
right join
right outer join
inner join
full outer join
特殊类型：
left semi join

--左半开连接
--只能选左边的表的字段，不能选取右边表的字段用来查存在或不存在

建两张表

create table if not exists u1(
id int,
name string
)
row format delimited fields terminated by ',';
create table if not exists u2(
id int,
name string
)
row format delimited fields terminated by ',';

加载数据

load data local inpath '/opt/data/u1.txt' into table u1;
load data local inpath '/opt/data/u2.txt' into table u2;

内连接

使?关键字join 、 inner join 、多表?逗号分开

select *
from u1
join u2 on u1.id =u2.id;
select *
from u1
inner join u2 on u1.id =u2.id;
select *
from u1,u2
where u1.id=u2.id;

左连接

数据以左表的数据为准，左表存在的数据都查询出，右表的数据关联上就出来，关联不上以NULL代替

left join/left outer join/left semi join :
left join 从hive0.8版本开始有
left join 和 left outer join ?乎差不多
select *
from u1
left join u2 on u1.id =u2.id;
select *
from u1
left outer join u2 on u1.id =u2.id;
上?的 SQL 语句还可以加上where条件从句，对记录进?筛选，?如只返回表 A ?
?不匹配表 B 的记录。

右连接

以右表为准，来匹配左表信息，如果匹配不上，使?NULL来代替。

right join /right outer join
select *
from u1
right join u2 on u1.id =u2.id;
select *
from u1
right outer join u2 on u1.id =u2.id;
hive不?持right semi join：
这两个得到的结果也是?样的。

全连接

full outer join
(相互进?连接，如果有?张表的数据连接不上来使?NULL来代替)
select *
from u1
full outer join u2 on u1.id =u2.id;

Hive专有Join特点

left semi join

在hive中，有?种专有的join操作,left semi join,我们称之为半开连接。它是left join 的?种优化形式，只能查询左表的信息，主要?于解决hive中左表的数据是否存在的问题。相当于exists关键字的?法。

select *
from u1
left semi join u2 on u1.id =u2.id;
下?的写法是错误的,不能查找右表的信息
select u1.* ,u2.*
from u1
left semi join u2 on u1.id =u2.id;
select u1.*
from u1
where exists (select 1 from u2 where u2.id =u1.id);

?查询

# 演示员?表
create table emp
(
 EMPNO string,
 ENAME string,
 JOB string,
 MGR string,
 HIREDATE string,
 SAL string,
 COMM string,
 DEPTNO string
) row format delimited fields terminated by ',';
# 演示部?表
create table dept
(
 DEPTNO string,
 DNAME string,
 LOC string
) row format delimited fields terminated by ',';

导?数据:

load data local inpath "/root/hivetest/emp.txt" into table emp;
select * from emp;
load data local inpath "/root/hivetest/dept.txt" into table
dept;
select * from dept;

hive对?查询?持不是很友好，特别是 "="问题较多

不能使?=
select
e.*
from emp e
where e.deptno = (
select
d.deptno
from dept d
limit 1
);
可以使?in
select
e.*
from emp e
where e.deptno in (
select
d.deptno
from dept d
);

# inner join 和outer join的区别：
# 分区字段对outer join 中的on条件?效，对inner join 中的on条件有效
有inner join 但是没有full inner join
有full outer join但是没有outer join
所有join连接，只?持等值连接(= 和 and )。不?持 != 、 < 、> 、 <> 、>=、
<= 、or

map-side join

如果所?的表中有?表，将会把?表缓存在内存中，然后在map端进?连接查找。 hive在map端查找时会减?整体查询量，从内存中读取缓存的?表数据，效率较快，还省去?量数据传输和shuffle耗时

#注意使?下?属性打开Map-Join：
set hive.auto.convert.join=true
select
e.*
from u1 d
join u2 e
on d.id = e.id;

# 以前的?版本，需要添加(/+MAPJOIN(?表名)/)来标识该join为map端的join。
hive 0.7以后hive已经废弃，但是仍然管?：
select
/+MAPJOIN(d)/
e.*
from u1 d
join u2 e
on d.id = e.id
;
到底?表多?才会被转换为map-side join：
set hive.mapjoin.smalltable.filesize=25000000 约23.8MB
# 以前的?版本，需要添加(/+MAPJOIN(?表名)/)来标识该join为map端的join。
hive 0.7以后hive已经废弃，但是仍然管?：
select
/+MAPJOIN(d)/
e.*
from u1 d
join u2 e
on d.id = e.id
;
到底?表多?才会被转换为map-side join：
set hive.mapjoin.smalltable.filesize=25000000 约23.8MB

表达式别名

对有些表达式来说,查询的结果字段名?较难理解,这时候可以给表达式起?个别名,? 便查看结果的表头.使?as把复杂表达式命名为?个容易懂的别名

select
d.dname,
length(d.dname) as nameLength
from dept d ;

查询?句

where后不能跟聚合函数,但是可以使用普通函数来过滤
select * from u1 where length(name)>5;
group by：分组通常和聚合函数搭配使用，
查询的字段要么出现在groupby后面要么出现在聚合函数里
select 
id,
name,
age
from u1
groupby id ,name 
--以上语句不能执行
having是对分组以后的结果集进行过滤
limit语句限制从结果集中的数据条数，
# 将set hive.limit.optimize.enable=true 时，limit限制数据时就不会全盘扫描，?是根据限制的数量进?抽样。默认优化
同时还有两个配置项需要注意：
hive.limit.row.max.size 这个是控制最?的抽样数量100000十万默认值
hive.limit.optimize.limit.file 这个是抽样的最??件数量10个默认值
排序
orderby：全局排序，使用orderby之后都将会用一个reduce来跑最后数据--一个reduce且数据量大则效率很低
sortby ：局部排序，指保障每一个reduce结果有序 如果reduce个数是1则orderby和sortby一样
distributeby：用来确定用哪个列（字段）来分区，一般要写在sortby的前面，多用于排名函数
clusterby--destributebyid sortbyid 
--设置reduce个数不生效仍然会使用一个reduce
set mapreduce.job.reduces=3;
select
e.empno
from emp e
order by e.empno desc ;
只要使?order by ，reducer的个数将是1个。
--sortby
set mapreduce.job.reduces=3;

select
e.empno
from emp e
sort by e.empno desc ;
--新表
set mapreduce.job.reduces=3;
create table t_s
as
select
e.empno
from emp e
sort by e.empno desc ;
---
set mapreduce.job.reduces=3;
select
e.empno
from emp e
cluster by e.empno ;
---如上语句等于如下语句
set mapreduce.job.reduces=3;
select
e.empno
from emp e
distribute by e.empno sort by e.empno ;

Where语句特点

where后不能跟聚合函数

#下?代码不能正确执?
select
e.deptno,
count(e.deptno) ct
from emp e
where count(e.deptno) > 3
group by e.deptno;

Group By语句特点

group by:分组，通常和聚合函数搭配使? 查询的字段要么出现在group by 后?，要么出现在聚合函数??

select
e.deptno,
count(e.ename) ct
from emp e
group by e.deptno;

Having语句

Hiving是对分组以后的结果集进?过滤。

select
e.deptno,
count(e.deptno) ct
from emp e
group by e.deptno
having ct > 3;

Limit语句

限制从结果集中取数据的条数,?般?于分?

# 将set hive.limit.optimize.enable=true 时，limit限制数据时就不会全
盘扫描，?是根据限制的数量进?抽样。
同时还有两个配置项需要注意：
hive.limit.row.max.size 这个是控制最?的抽样数量
hive.limit.optimize.limit.file 这个是抽样的最??件数量
select
 e.deptno,
 count(e.deptno) ct
from emp e
group by e.deptno
having ct > 3
limit 1;

排序

order by

1.全局排序:引发全表扫描,reduce数量?个,不建议使?

sort by

默认分区内排序,当reduceTask数量是1时候,那么效果和order by ?样
?般和distribute by 搭配使?

distribute by

?来确定?哪个列(字段)来分区,?般要写在sort by的前?

cluster by

分区的列和分区内排序的列相同时,那么可以?cluster by deptno来代替 
distribute by deptno sort
cluster by ：兼有distribute by以及sort by的升序功能。
排序只能是升序排序（默认排序规则），不能指定排序规则为asc 或者desc。

排序详细区别

distribute by:根据by后的字段和reducetask个数，决定maptask的输出去往那
个reducertask。
默认使?查询的第?列的hash值来决定maptask的输出去往那个reducertask。如果
reducertask的个数为1时没有任何体现。
sort by:局部排序，只保证单个reducertask有顺序。
order by：全局排序，保证所有reducer中的数据都是有顺序。
如果reduser个数只有?个，两者都差不多。
两者都通常和 desc 、 asc 搭配。默认使?升序asc。
order by的缺点：
由于是全局排序，所以所有的数据会通过?个Reducer 进?处理，当数据结果较?的
时候，?个Reducer 进?处理?分影响性能。
注意事项：
当开启MR 严格模式的时候ORDER BY必须要设置 LIMIT ?句,要限制全局查询 ，否
则会报错
-- ?动设置reducer个数：
set mapreduce.job.reduces=3;
select
e.empno
from emp e
order by e.empno desc ;
只要使?order by ，reducer的个数将是1个。

表合并 Union

select d.id id,
d.name name
from u1 d
union
select 
d1.id id,
d1.name name
from u1 d1;

select 
tmp.*
from
((select d.id id,
d.name name
from u1 d)
union
(
select 
d1.id id,
d1.name name
from u1 d1)) tmp;
=========
select 
tmp.*
from
((select d.id id,
d.name name
from u1 d)
union all
(
select 
d1.id id,
d1.name name
from u1 d1)) tmp;

# 
a b c d e f 
t1
a b c 
1 2 3
2 2 3
t2
d e f 
3 3 2 
2 1 3
r1:
a b c d e f
1 2 3 3 3 2
2 2 3 2 1 3


1.连接的两个子句 的个数相同，不相同少的为null
select 
tmp.*
from
((select d.id id,
d.name name
from u1 d)
union
(
select 
d1.id id,
d1.name name
from u2 d1 
limit 1
)) tmp;
-- 上面有四条数据
-- 单个自居支持groupby sortby limit等，但是1.x的版本不支持上述语句

如果把两张表 结果联合在?起,可以?union,有下?两种?法
union ：将多个结果集合并，去重，排序(按照第?个字段排序,如果进?了分桶,
会按照分桶内排序)
union all ：将多个结果集合并，不去重，不排序。

select
d.deptno as deptno,
d.dname as dname
from dept d
union
select
e.deptno as deptno,
e.dname as dname
from dept e;
select
d.deptno as deptno,
d.dname as dname
from dept d
union all
select
d.deptno as deptno,
d.dname as dname
from dept d;
-- 单个union语句不?持：orderBy、clusterBy、distributeBy、sortBy、
limit
-- 单个union语句字段的个数要求相同，字段的顺序要求相同。

数据类型

array类型 操作：
zs 98,96,66
ls 78,97,89
--创建表
create table if not exists arr1(
name string,
score array<string>
)
row format delimited fields terminated by ' '
collection items terminated by ',';
-- 导入数据
load data local inpath '/home/s1' into table arr1;
-- 查询
select
name,
score[0]
from arr1
where name='ls';

--行转列
select 
explode(score) score 
from arr1
;
-- 带表的其他字段需要虚拟表
select 
name,
cj
from arr1
lateral view explode(score) score as cj
;
create table a(
content string)
location '/input'
;

--统计词频
select 
word,
1
from a
lateral view explode(split(content,' ')) content as word

另一个炸裂函数 posexplode带下标的

列转行
1使用groupby 搭配collect_set函数或者collect_list
collect_set:无序，去重
collect_list:有序，不去重
2使用union搭配groupby

select tmp.name,collect_list(tmp.cj)
from (select 
name,
cj
from arr1
lateral view explode(score) score as cj
) tmp group by tmp.name;


union方式
zs 98 0 0
zs 0 96 0
zs 0 0 66

ARRAY 有序的同类型的集合 array(1,2)
MAP key-value,key必须为原始
类型，value可以任意类型 map(‘a’:1,’b’:2)
STRUCT 字段集合,类型可以不同 struct(‘1’,1,1.0),
named_stract(‘col1’,’1’,’col2’,1,’clo3’,1.0)
UNION 在有限取值范围内的?个值 create_union(1,’a’,63)

特殊基本类型

java中有的?hive中没有的：
long
char
short
byte

简单数据类型创表案例

-- 创建?个函多种基本类型的表
create table if not exists bs1(
id1 tinyint,
id2 smallint,
id3 int,
id4 bigint,
sla float,
sla1 double,
isok boolean,
content binary,
dt timestamp
)
row format delimited fields terminated by ',';
-- 准备要插?的数据:
23, 12,342523,455345345,30000,600005,nihao, helloworld2 ,2017-
06-02 11:41:30
12, 13,342526,455345346,80000,100000, true, helloworld1, 2017-
06-02 11:41:30
load data local inpath '/opt/data/bs1.csv' into table bs1;

复杂的数据类型

复杂类型分为三种,分别是数组array,键值对map,和结构体struct

array : col array<基本类型> ,下标从0开始，越界不报错，以NULL代替
map : column map<string,string>
struct: col struct

array示例

-- 数据如下: 注意下?列之间是通过TAB来分隔的
zhangsan 78,89,92,96
lisi 67,75,83,94
# 注意terminated顺序,新建数组类型
create table if not exists arr1
(
 name string,
 score array<string>
) row format delimited fields terminated by '\t'
 collection items terminated by ',';
# 导?数据
load data local inpath '/root/hivetest/arr1.csv' into table
arr1;
# 查询：
select * from arr1;
select name,score[1] from arr1 where size(score) > 3;

列转?

就是把?列数据转化成多?

#原始数据：
zhangsan 90,87,63,76
#转化后数据
zhangsan 90
zhangsan 87
zhangsan 63
zhangsan 76

内嵌查询： explode:展开列表中的每个元素?成??

select explode(score) score from arr1;

lateral view : 虚拟表

侧视图的意义是配合explode，?个语句?成把单?数据拆解成多?后的数据结果
集。
解释:lateral view 会将explode?成的结果放到?个虚拟表中，然后这个虚拟表会和
当前表join,来达到数据聚合的?的。
结构解析:要进?聚合的虚拟表，lateral view explode(字段) 虚拟表名 as 虚拟表字段
名

select name,cj from arr1 lateral view explode(score) score as cj;

统计每个学?的总成绩：

select name,sum(cj) as totalscore from arr1 lateral view
explode(score) score as cj group by name;

?转列

就是把多?数据转化成??数据:

#原始数据：
zhangsan 90
zhangsan 87
zhangsan 63
zhangsan 76
#转化后数据
zhangsan 90,87,63,76

准备数据：

create table arr_temp1
as
select * from arr1 lateral view explode(score) score as cj;

collect_set函数：

它们都是将分组中的某列转为?个数组返回

create table if not exists arr3(
name string,
score array<string>
)
row format delimited fields terminated by ' '
collection items terminated by ',';

将数据写成array格式：

insert into arr3
select name,collect_set(cj) from arr_temp group by name;

map示例

-- 创建map表

create table if not exists map1(
name string,
score map<string,int>)
row format delimited fieds terminated by ' '
collection items terminated by ','
map keys terminated by ':';

--生成数据
vi /home/map
--装载数据
load data inpath '/home/map' into table map1;
--查询
select 
name,
score['chinese']
from map1;
--括号中如果是非表中字段则会显示为空（NULL）
--key不存在返回NULL
列转行
--数据展开（炸裂）
select explode(score) as (m_course,m_score) from map1;
--查询所有字段
select name,m_course from mpa1 lateral view explode(score) as m_course,m_score
--行转列
--创建表
create table map_temp(
name string,
score1 int,
score2 int,
score3 int
)
row format delimited fields terminated by ',';
map_temp.csv
load data local inapth '/home/map' into table map_temp;
--创建集合表
create table if not exists map2(
name string,
score map<string,int>)
row format delimited fieds terminated by ' '
collection items terminated by ','
map keys terminated by ':';
--插入数据（map函数）
insert overwrite map2   -- into插入/overwrite覆盖
select 
name,
map('chinese',score1,'math',score2,'english',score3)
from map_temp;
select array(1,2,3,5); 结果 [1,2,3,5]

name7 {"chinese":90,"math":75,"english":88}
create table map_temp_1(
name string,
score string
)
row format delimited fields terminated by '\t';
--查询英语成绩
struct类型
类似与数组，但是类型不一致
goudan beijing,beijing,changpingqu,shayanglu,2
jige honkong,honkong,wanzaiqu,shui,666,nice

create table if not exists str2(
uname string,
addr struct <
	province:string,
	city:string,
	xian:string,
	dadao:string>
	)
	row format delimited fields terminated by '\t'
	collection items terminated by ',';
导入数据
load data local inpath '/root/hivetest/struct.txt' into table str2;

查询数据
select uname,addr.province,addr.city,addr.xian from str2;

--复杂数据类型
map嵌套

-- 定义嵌套类型表
1 xdd wuxian:(300,300),gongjijin:1200,1500,shebao:300
2 lkq wuxian:(200,300),gongjijin:1000,1200,shebao:200

create table qt(
id int,
name string,
addr map<string,array<string>>)
row format delimited fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':';

--加载数据类型 有些数据可能会丢失
load data local inpath '/home/com' overwrite into table qt;
--定义嵌套数据类型表数据 --使用hive默认分割符
-- \001---cnotrol+V和control+A
-- \002---cnotrol+V和control+B
--指定多字符串为分割符
create table qt_sp1(
id int,
name string,
addr map<string,array<string>>)
row format delimited fields terminated by '^A';
load data local inpath '/home/sp' overwrite into table qt_sq1;
--指定\001为分割符
create table qt_sp2(
id int,
name string);
load data local inpath '/home/sp' overwrite into table qt_sq2;
--建表示直接使用\001
create table qt_sp3(
id int,
name string,
addr map<string,array<string>>)
row format delimited fields terminated by '\001';
collection items terminated by '\001';
map keys terminated by '\001';
load data local inpath '/home/sp' overwrite into table qt_sp3;

select qt_sp3
name,
addr["wuxian"][1]
from qt_sp3;

zhangsan chinese:90,math:87,english:63,nature:76
lisi chinese:60,math:30,english:78,nature:0
wangwu chinese:89,math:25,english:81,nature:9

创建map类型的表

create table if not exists map1(
name string,
score map<string,int>
)
row format delimited fields terminated by ' '
collection items terminated by ','
map keys terminated by ':';

加载数据

load data local inpath '/root/hivetest/map1.txt' into table map1;

Map格式数据查询

#查询数学?于35分的学?的英语和?然成绩：
select
m.name,
m.score['english'] ,
m.score['nature']
from map1 m
where m.score['math'] > 35;

Map列转?

explode 展开数据

select explode(score) as (m_class,m_score) from map1;

Lateral view Lateral View和explode?起使?，它能够将??数据拆成多?数据，并在此基础上对拆分后的数据进?聚合。

select name,m_class,m_score from map1 lateral view explode(score)
score as m_class,m_score;

Map?转列

name7,38,75,66
name6,37,74,65
name5,36,73,64
name4,35,72,63
name3,34,71,62
name2,33,70,61
name1,32,69,60

创建临时表,并加载数据

create table map_temp(
name string,
score1 int,
score2 int,
score3 int
)
row format delimited fields terminated by ',';
# map_temp.csv
load data local inpath '/opt/data/map_temp.csv' into table
map_temp;

创建要导?数据Map表

create table if not exists map2(
name string,
score map<string,int>
)
row format delimited fields terminated by ' '
collection items terminated by ','
map keys terminated by ':';

导?数据：

insert into map2
select name,map('chinese',score1,'math',score2,'english',score3)
from map_temp;

struct

create table if not exists str2( 
uname string, 
addr struct <
 province:string,
 city:string,
 xian:string,
 dadao:string >
)
row format delimited fields terminated by '\t' 
collection items terminated by ',';
#导?数据：
load data local inpath '/root/hivetest/struct.txt' into table
str2;
查询数据：
select uname,addr.province,addr.city,addr.xian from str2;

复杂数据类型案例

uid uname belong tax addr
1 xdd ll,lw,lg,lm wuxian:300,gongjijin:1200,shebao:300 北京,?城
区,中南海
2 lkq lg,lm,lw,ll,mm wuxian:200,gongjijin:1000,shebao:200 河北,?
家庄,中?路

#查询：下属个数?于4个，公积??于1200，省份在河北的数据
create table if not exists tax2(
id int,
name string,
belong array<string>,
tax map<string,double>,
addr struct<province:string,city:string,road:string>
)
row format delimited fields terminated by ' '
collection items terminated by ','
map keys terminated by ':'
stored as textfile;
# 导?数据
load data local inpath '/root/hivetest/tax.txt' into table tax;
#查询：下属个数?于4个，公积??于1200，省份在河北的数据
select id,
name,
belong[0],
belong[1],
tax['wuxian'],
tax['shebao'],
addr.road
from tax
where size(belong) > 4 and
tax['gongjijin'] < 1200 and
addr.province = '河北';
结果:
2 lkq lg lw 2000.0 300.0 河北 ?家庄

嵌套数据类型

在Hive中,有时候数据结构?较复杂,这时候可以?到嵌套类型,就是?个类型??可以
放置另外?个类型.这时候为了避免数据发送混乱,?般可以使?Hive?带的分隔符,
hive共?持8个层级的分隔符，依次是：\001,\002,\003,...\008
下?可以简单演示??般分隔符来表示嵌套类型

-- map嵌套使?(数据为qt.txt)
uid uname belong tax addr
1 xdd wuxian:(300,300),gongjijin:1200,1500,shebao:300
2 lkq wuxian:(200,300),gongjijin:1000,1200,shebao:200
-- 定义嵌套类型数据表
create table qt(
id int,name string,
addr map<string,array<string>>)
row format delimited fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':';
-- 加载数据类型,有些数据可能会丢失
load data local inpath '/opt/data/qt.txt' overwrite into table
qt;
-- 定义嵌套类型数据表--使?hive默认分隔符 
写数据时
-- \001---cnotrol+V和control+A
-- \002---cnotrol+V和control+B
-- ...
建表时,直接使?\001
create table qt(
id int,name string,
addr map<string,array<string>>)
row format delimited fields terminated by '\001'
collection items terminated by '\002'
map keys terminated by '\003';

系统内置函数

函数----随机函数

查看函数：hive>show functions

select rand();

select rand(8);在范围内取值

--round保留小数点位数

select round(rand()*100,2)

--floor()向下取整

select floor(rand()*100)

--ceil()向上取整

select ceil(rand()*100)

--split()切分

select split(rand()*100,'\.')[0]

--substr()字符串 substting （）截取

select substr（rand()*100,0,2）

select substring（rand()*100,0,2）

--cast(value as 类型)类型转换

select cast(rand()*100 as int)

--concat(col1,col2,col3)拼接

select concat("1.0","2.0")

--concat_sa带分割符拼接

select concat_ws("_","1","2","3")

--if(条件表达式，"满足true执行","满足false执行")

select if(1==1 ,'true','false');

select if(1!=2,if(2==2,'女','妖'),'false')

--case when

--when后面写表达式

select

case

when sex=1 then '男'

when sex=2 then '女'

else '妖' end

from t_stu

--case 后面写字段

select

case

when 1 then '男'

when 2 then '女'

else '妖' end

from t_stu

--length()字符段长度

select length("abcde")

--size（）数组的长度

select size(array(1,2,5,6));

--size()map的长度

select size(map(1,2,5,6));

---lower()转小写

select lower("Acc");

--upper()转大写

--nvl(值1，值2)判空，如果值1为空则返回值2如果不为空，则返回值1

select nvl(123,666);

select nvl(NULL,666);

--isnull 或者isnotnull判空

select isnull(NULL)

日期时间函数

--时间戳转日期

select unix_timestamp()；获取时间戳

select unix_timestamp(1625810805);--默认日期时间

select unix_timestamp(1625810805，'yyyy-MM-dd hh');--转换为指定日期格式

--日期转时间戳

select unix_timestamp('2021-07-09 11：33：33')--默认格式为日期时分秒

select unix_timestamp('2021-07-09','yyyy-MM-dd')--指定日期格式并转换为时间戳

--获取日期

select data('2021-07-09 11：33：33')

--获取年

select year('2021-07-09 11：33：33')

--获取月

select month('2021-07-09 11：33：33')

--获取日

select day('2021-07-09 11：33：33')

--获取小时

select hour('2021-07-09 11：33：33')

--加天

select data_add('2021-07--09 11：33：33' ,1)加一天

select data_add('2021-07--09 11：33：33' ,-1)加-1天

--减一天

select date_sum(current_date,1)减一天

--计算时间差

select datediff('2021-07-09','2021-07-06')

--获取当前日期

select current_date()

--日期格式转换

select date_format(current_timestamp(),'yyyy-MM-dd hh:mm:ss');

select data_format(current_date(),'yyyyMMdd');

select date_format('2017-01-01','yyyy-MM-dd hh:mm:ss')

select date_format(20170101,'yyyy-MM-dd hh:mm:ss')不能这样转换需要标准的日期格式

窗口函数

class name 分值

1 zs 67

1 ls 72

1 ww 56

2 gg 88

2 hh 76

2 jj 92

日期销售

8 678.99

9 789.89

10 56.9

11 67.12

窗口函数

窗口是将整个数据集（表）划分成多个小数据集进行统计分析

窗口又分为物理窗口和逻辑窗口

窗口函数一般不和groupby搭配使用

窗口函数和聚合函数的区别

聚合函数是一个分组返回一个值；而窗口函数是每一行都要返回一个聚合值

举例

id price

1 12

1 15

1 22

2 23

2 13

按照id分组sum：

1 12+15+22

2 36

窗口函数

1 12

1 27

1 49

2 23

2 36

应用

某天某应用的累计，截至当前总的

函数查看

可以?下?两个命令查看Hive中的函数

-- 显示Hive中所有函数
show functions;
-- 查看某个函数的?法
desc function array;

?期函数

因为Hive的核?功能和海量数据统计分析,?在统计分析时?期时间是?个?常重要的维度, 所以?期函数在Hive使?中尤为重要.

-- 时间戳转?期
select from_unixtime(1505456567);
select from_unixtime(1505456567,'yyyyMMdd');
select from_unixtime(1505456567,'yyyy-MM-dd HH:mm:ss');
-- 获取当前时间戳
select unix_timestamp();
-- ?期转时间戳
select unix_timestamp('2017-09-15 14:23:00');
-- 计算时间差
select datediff('2018-06-18','2018-11-21');
-- 查询当?第?天
select dayofmonth(current_date);
-- ?末:
select last_day(current_date);
--当?第1天:
select date_sub(current_date,dayofmonth(current_date)-1);
--下个?第1天:
select 
add_months(date_sub(current_date,dayofmonth(current_date)-1),1);
-- 当前?期
select current_date
-- 字符串转时间（字符串必须为：yyyy-MM-dd格式）
select to_date('2017-01-01 12:12:12');
-- ?期、时间戳、字符串类型格式化输出标准时间格式：
select date_format(current_timestamp(),'yyyy-MM-dd HH:mm:ss');
select date_format(current_date(),'yyyyMMdd');
select date_format('2017-01-01','yyyy-MM-dd HH:mm:ss');

字符串函数

lower--（转?写）
select lower('ABC');
upper--（转?写）
select lower('abc');
length--（字符串?度，字符数）
select length('abc');
concat--（字符串拼接）
select concat("A", 'B');
concat_ws --（指定分隔符）a-b-c
select concat_ws('-','a' ,'b','c');
substr--（求?串）
select substr('abcde',3);

类型转换函数

cast(value as type) -- 类型转换
select cast('123' as int)+1;

数学函数

round --四舍五?((42.3 =>42))
select round(42.3);
ceil --向上取整(42.3 =>43)
select ceil(42.3);
floor --向下取整(42.3 =>42)
select floor(42.3);

判断为空函数

nvl（expr1，expr2）
#作?：将查询为Null值转换为指定值。
#若expr1为Null，则返回expr2，否则返回expr1。
select nvl(count,2);

窗?函数

简介

窗?函数?名开窗函数，属于分析函数的?种。?于解决复杂报表统计需求的功能强 ?的函数。窗?函数?于计算基于组的某种聚合值，它和聚合函数的不同之处是：对于每个组返回多?，?聚合函数对于每个组只返回??。开窗函数指定了分析函数 ?作的数据窗???，这个数据窗???可能会随着?的变化?变化。

默认的数据库的查询都是要么详细记录,要么聚合分析,如果要查询详细记录和聚合数据,必须要经过两次查询简单的说窗?函数对每条详细记录开?个窗?,进?聚合统计的查询默认mysql?版本没有?持,在最新的8.0版本中?持, Oracle和Hive中都?持窗? 函数

创建order表:

create table if not exists t_order
(
 name string,
 orderdate string,
 cost int
) row format delimited fields terminated by ',';

加载数据:

load data local inpath "/root/hivetest/order.txt" into table
t_order;

语法

over 开窗

使?窗?函数之前?般要通过over()进?开窗,简单可以写成==函数+over==简单的写法如下:

-- 1.不使?窗?函数
-- 查询所有明细
select * from t_order;
# 查询总量
select count(*) from t_order;
-- 2.使?窗?函数
select *, count(*) over() from t_order;

窗?函数是针对每??数据的.

如果over中没有参数,默认的是全部结果集

查询在2018年1?份购买过的顾客购买明细及总?数

select *,count(*) over ()
from t_order
where substring(orderdate,1,7) = '2018-01'

partition by?句

在over窗?中进?分区,对某?列进?分区统计,窗?的??就是分区的?? 需求:查看顾客的购买明细及?购买总额

select name, orderdate, cost, sum(cost) over (partition by
month(orderdate))
from t_order;

order by

order by?句会让输?的数据强制排序

select name, orderdate, cost, sum(cost) over (partition by
month(orderdate) order by orderdate)
from t_order;

Window?句

如果要对窗?的结果做更细粒度的划分,那么就使?window字句,常?的有下??个
PRECEDING：往前
FOLLOWING：往后
CURRENT ROW：当前?
UNBOUNDED：起点，UNBOUNDED PRECEDING 表示从前?的起点，
UNBOUNDED FOLLOWING：表示到后?的终点

?般window?句都是==rows==开头

select name,orderdate,cost,
 sum(cost) over() as sample1,--所有?相加
 sum(cost) over(partition by name) as sample2,--按name分组，
组内数据相加
 sum(cost) over(partition by name order by orderdate) as
sample3,--按name分组，组内数据累加
 sum(cost) over(partition by name order by orderdate rows
between UNBOUNDED PRECEDING and current row ) as sample4 ,--和
sample3?样,由起点到当前?的聚合
 sum(cost) over(partition by name order by orderdate rows
between 1 PRECEDING and current row) as sample5, --当前?和前??
?做聚合
 sum(cost) over(partition by name order by orderdate rows
between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--当前?和前边
??及后???
 sum(cost) over(partition by name order by orderdate rows
between current row and UNBOUNDED FOLLOWING ) as sample7 --当前?
及后?所有?
from t_order;

查看顾客到?前为?的购买总额

select name,
 t_order.orderdate,
 cost,
 sum(cost)
 over (partition by name order by orderdate rows
between UNBOUNDED PRECEDING and current row ) as allCount
from t_order;

序列函数

NTILE

?于将分组数据按照顺序切分成n?，返回当前切?值如果切?不均匀，默认增加第?个切?的分布

select name,orderdate,cost,
 ntile(3) over(partition by name) # 按照name进?分组,在分组内
将数据切成3份
from t_order;

LAG和LEAD函数

lag返回当前数据?的上??数据 lead返回当前数据?的下??数据

查询顾客上次购买的时间

select name,orderdate,cost,
 lag(orderdate,1,'1990-01-01') over(partition by name order
by orderdate ) as time1
from t_order;

取得顾客下次购买的时间

select name,orderdate,cost,
 lead(orderdate,1) over(partition by name order by
orderdate ) as time1
from t_order;

first_value和last_value

first_value 取分组内排序后，截?到当前?，第?个值
last_value 分组内排序后，截?到当前?，最后?个值

select name,orderdate,cost,
 first_value(orderdate) over(partition by name order by
orderdate) as time1,
 last_value(orderdate) over(partition by name order by
orderdate) as time2
from t_order

排名函数

row_number从1开始，按照顺序，?成分组内记录的序列,row_number()的值
不会存在重复,当排序的值相同时,按照表中记录的顺序进?排列
RANK() ?成数据项在分组中的排名，排名相等会在名次中留下空位
DENSE_RANK() ?成数据项在分组中的排名，排名相等会在名次中不会留下空位

row_number()：没有并列，相同名次依顺序排
rank()：有并列，相同名次空位
dense_rank()：有并列，相同名次不空位

create table if not exists stu_score(
dt string,
name string,
score int
)
row format delimited
fields terminated by ',';
load data local inpath '/opt/data/stu_score.txt' overwrite into
table stu_score;

----
select dt,name,score,
-- 没有并列，相同名次依顺序排
row_number() over(distribute by dt sort by score desc) rn,
-- rank()：有并列，相同名次空位
rank() over(distribute by dt sort by score desc) rn,
-- dense_rank()：有并列，相同名次不空位
dense_rank() over(distribute by dt sort by score desc) rn
from stu_score;

?定义函数

hive提供的自定义函数
UDF:user define function: 用户自定义函数，一行输入，一行输出。，类似普通函数入size，length
UDAF:user define aggregate function 用户自定义聚合函数，多行输入，一行输出，类似sum
UDTF:user define table-generate function 用户自定义表生成函数，一行输入，多行输出，类似炸裂

udf案例
继承udf
重写evaluate（）方法，该方法允许重载

package hive;
?
import org.apache.commons.lang.StringUtils;
?
/*
第一个自定义函数
 */
public class FirstFun extends UDF {
 ? ?//重写evaluate，名字不能错
 ? ?//方法返回值的类型
 ? ?//String a,String b方法的输入参数和类型
 ? ?public String evaluate(String a,String b){
 ? ? ? ?if(StringUtils.isEmpty(a)){
 ? ? ? ? ? ?return null;
 ? ? ?  }
 ? ? ? ?if(StringUtils.isEmpty(b)){
 ? ? ? ? ? ?return null;
 ? ? ?  }
 ? ? ? ?//数据正常
 ? ? ? ?
 ? ? ? ?return a+"_"+b;
 ?  }
 ? ?//重载
 ? ?public String evaluate(String a){
 ? ? ? ?if(StringUtils.isEmpty(a)){
 ? ? ? ? ? ?return null;
 ? ? ?  }
 ? ? ? ?
 ? ? ? ?//数据正常
?
 ? ? ? ?return a;
 ?  }
}
?

add jar /home/qf-2022-1.0.jar;
create temporary function my concat as "com.al.hive.FirstFun"

desc function myconcat;
select myconcat('cc','ad')

根据出生年月计算生日

package hive;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;

import java.util.Calendar;
import java.util.Date;

/*
根据年龄算生日
2000-07-11 21
2000-07-13 20
 */
public class BirthdayToAge extends UDF {
    int age=0;
    public int evaluate(String bithday){
        if (StringUtils.isEmpty(bithday)){
            return age;
        }
        //获取前的日期
        Calendar instance = Calendar.getInstance();
        instance.setTime(new Date());
        int nowYear = instance.get(Calendar.YEAR);
        int nowMonth = instance.get(Calendar.MONTH)+1;
        int nowDay = instance.get(Calendar.DAY_OF_MONTH);
        System.out.println(nowYear+" "+nowMonth+" "+nowDay);

        //获取生日的年月
        String[] split = bithday.split("-");
        int birthYear=Integer.parseInt(split[0]);
        int birthMonth=Integer.parseInt(split[1]);
        int birthDay=Integer.parseInt(split[2]);
        age=nowYear-birthYear;
        //然后比较
        if (nowMonth<birthMonth) {
            age-=1;
        } else if (nowMonth==birthMonth&&nowDay<=birthDay) {
            age-=1;

        }

        return age;
    }

    public static void main(String[] args) {
        System.out.println(new BirthdayToAge().evaluate("2000-07-11"));
        System.out.println(new BirthdayToAge().evaluate("2000-07-12"));
        System.out.println(new BirthdayToAge().evaluate("2000-07-13"));
    }
}

根据key找value

数据：
name==zs&age=18&sex=1&high=180
k2v("high") 180
k2v("age") 18
package hive;
?
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
?
import java.util.Calendar;
import java.util.Date;
?
/*
key 找value
找不到返回null
name==zs&age=18&sex=1&high=180
k2v("high") 180
k2v("age") 18
 */
public class Key2Value extends UDF {
 ? ?public String evaluate(String input,String key){
 ? ? ? ?if (StringUtils.isEmpty(input)||StringUtils.isEmpty(key)){
 ? ? ? ? ? ?return null;
 ? ? ?  }
 ? ? ? ?String[] split = input.split("&");
 ? ? ? ?for (String kv : split) {
 ? ? ? ? ? ?String[] kvs = kv.split("=");
 ? ? ? ? ? ?if(kvs[0].equals(key)){
 ? ? ? ? ? ? ? ?return kvs[1];
 ? ? ? ? ?  }
 ? ? ?  }
 ? ? ? ?return null;
 ?  }
?
 ? ?public static void main(String[] args) {
 ? ? ? ?System.out.println(new Key2Value().evaluate("name==zs&age=18&sex=1&high=180","sex1"));
 ?  }
}
?

json数据

{"name":"zs","age":18,"sex":1,"high":180}
json_value("high") 180
json_value("age") 18
?
package hive;
?
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.json.JSONObject;
?
/*
key 找value
{"name":"zs","age":18,"sex":1,"high":180}
json_value("high") 180
json_value("age") 18
 */
public class Json2Value extends UDF {
 ? ?public String evaluate(String input,String key){
 ? ? ? ?if (StringUtils.isEmpty(input)||StringUtils.isEmpty(key)){
 ? ? ? ? ? ?return null;
 ? ? ?  }
 ? ? ? ?String res;
 ? ? ? ?try {
 ? ? ? ?//将字符串转换成json对象
 ? ? ? ?JSONObject jsonObject=new JSONObject(input);
 ? ? ? ?res = jsonObject.get(key).toString();
?
?
 ? ? ?  }catch (Exception e){
 ? ? ? ? ? ?//json解析任何异常返回空
 ? ? ? ? ? ?return null;
 ? ? ?  }
 ? ? ? ?return res;
 ?  }
?
 ? ?public static void main(String[] args) {
 ? ? ? ?System.out.println(new Json2Value().evaluate("{\"name\":\"zs\",\"age\":18,\"sex\":1,\"high\":180}","sex"));
 ?  }
}
?

udtf

把"k1:v1;k2:v2;k3:v3"类似的的字符串解析成每??多?,每??按照key:value
kv2clo("k1:v1;k2:v2;k3:v3")
k1:v1
k2:v2
k3:v3
package hive;
?
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
?
import java.util.ArrayList;
import java.util.List;
?
/*
表生成函数
 */
public class MyUDTF extends GenericUDTF {
 ? ?@Override
 ? ?public StructObjectInspector initialize(ObjectInspector[]
 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?argOIs) throws UDFArgumentException {
 ? ? ? ?//定义要输出列的名字的List,并且添加要输出的列名
 ? ? ? ?List<String> structFieldNames = new ArrayList<>();
 ? ? ? ?structFieldNames.add("key");
 ? ? ? ?structFieldNames.add("value");
// 定义要输出列的类型的List,并且添加要输出列的类型
 ? ? ? ?List<ObjectInspector> objectInspectorList = new
 ? ? ? ? ? ? ? ?ArrayList<>();
?
 ? ? ? ?Object ingObjectInspector;
 ? ? ? ?objectInspectorList.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
?
 ? ? ? ?objectInspectorList.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
 ? ? ? ?return
 ? ? ? ? ? ? ? ?ObjectInspectorFactory.getStandardStructObjectInspector(structFieldNames, objectInspectorList);
 ?  }
 ? ?/*
 ? ?核心处理方法
 ? ? */
 ? ?@Override
 ? ?public void process(Object[] objects) throws HiveException {
 ? ? ? ?// 得到第?个参数,转化为字符串,类似于->name:zhang;age:30;address:shenzhen
 ? ? ? ?String insputString = objects[0].toString();
 ? ? ? ?// 把上述例?字符串按照分号;切分为数组
 ? ? ? ?String[] split = insputString.split(";");
 ? ? ? ?// s=name:zhang
 ? ? ? ?for (String s : split) {
 ? ? ? ? ? ?// 把每个切分后的key value分开
 ? ? ? ? ? ?String[] kvArray = s.split(":");
 ? ? ? ? ? ?// 如果产?多列可以将多个列的值放在?个数组中，然后将该数组传?到forward()函数。
 ? ? ? ? ? ?forward(kvArray);
 ? ? ?  }
?
 ?  }
 ? ?//关闭对象
 ? ?@Override
 ? ?public void close() throws HiveException {
?
 ?  }
}
?

hive>add jar /home/包名.jar
hive>create temporary function 自定义函数名 as '包路径.方法名'
hive>select 自定义函数名("参数")

UDAF案例

?定义函数来历

hive的内置函数满?不了所有的业务需求。
hive提供很多的模块可以?定义功能，?如：?定义函数、serde、输?输出格式等。

?定义函数分类

1. UDF：?户?定义函数，user defined function。?对?的输?输出。（最常?
的）。
2. UDTF：?户?定义表?成函数。user defined table-generate function.?对多
的输?输出。lateral view explode
3. UDAF：?户?定义聚合函数。user defined aggregate function。多对?的输
?输出 count sum max。

?定义函数实现

UDF格式

在pom.xml,加?以下maven的依赖包请查看 code/pom.xml

<dependency>
 <groupId>org.apache.hive</groupId>
 <artifactId>hive-exec</artifactId>
 <version>2.3.7</version>
</dependency>

定义UDF函数要注意下??点:

1. 继承 org.apache.hadoop.hive.ql.exec.UDF
2. 重写 evaluate ()，这个?法不是由接?定义的,因为它可接受的参数的个数,数据
类型都是不确定的。Hive会检查UDF,看能否找到和函数调?相匹配的evaluate()
?法

?定义函数第?个案例

public class FirstUDF extends UDF {
 public String evaluate(String str){
 //关于默认输出值是null，还是“”，这个要看需求具体定义，在这?先默
认定义为null，
 String result = null;
 //1、检查输?参数
 if (!StringUtils.isEmpty(str)){
 result = str.toUpperCase();
 }
 return result;
 }
 //调试?定义函数
 public static void main(String[] args){
 System.out.println(new
FirstUDF().evaluate("leetomedu"));
 }
}

函数加载?式

命令加载

这种加载只对本session有效

# 将编写的udf的jar包上传到服务器上.
# 并且将jar包添加到hive的class path中
# 进?到hive客户端,执?下?命令
 hive> add jar /opt/jar/udf.jar

-- 创建?个临时函数名,要跟上?hive在同?个session??：
hive> create temporary function toUP as 'com.qf.hive.FirstUDF';
-- 检查函数是否创建成功
hive> show functions;
-- 测试功能
hive> select toUp('abcdef');
-- 删除函数
hive> drop temporary function if exists toUP;

启动参数加载

(也是在本session有效，临时函数)

# 1、将编写的udf的jar包上传到服务器上
# 2、创建配置?件
[root@leetom01 hive]# vi ./hive-init
# 加?下?脚本
add jar /opt/jar/udf.jar;
create temporary function toup as 'com.qf.hive.FirstUDF';

# 3、启动hive的时候带上初始化?件：
 [root@leetom01 hive]# hive -i ./hive-init
 hive> select toup('abcdef')

配置?件加载

通过配置?件?式这种只要?hive命令?启动都会加载函数

# 1、将编写的udf的jar包上传到服务器上
# 2、在hive的安装?录的bin?录下创建?个配置?件，?件名：.hiverc
[root@leetom01 hive]# vi ./bin/.hiverc
add jar /hivedata/udf.jar;
create temporary function toup as 'com.qf.hive.FirstUDF';
3、启动hive
 [root@leetom01 hive]# hive

UDTF格式

UDTF是?对多的输?输出,实现UDTF需要完成下?步骤
1. 继承 org.apache.hadoop.hive.ql.udf.generic.GenericUDTF
2. 重写 initlizer() 、 process() 、 close() 。
执?流程如下:
UDTF?先会调?initialize?法，此?法返回UDTF的返回?的信息（返回个数，
类型）。
初始化完成后，会调?process?法,真正的处理过程在process函数中，在
process中，每?次 forward() 调?产???；如果产?多列可以将多个列的值
放在?个数组中，然后将该数组传?到forward()函数。
最后 close() ?法调?，对需要清理的?法进?清理。

需求

把"k1:v1;k2:v2;k3:v3"类似的的字符串解析成每??多?,每??按照key:value
格式输出

源码

?定义函数如下

package com.qf.hive;
/**
* 定义?个UDTF的Hive?定义函数(?对多),默认要继承与GenericUDTF
*/
public class ParseMapUDTF extends GenericUDTF {
 //在initializez中初始化要输出字段的名称和类型
 @Override
 public StructObjectInspector initialize(ObjectInspector[]
argOIs) throws UDFArgumentException {
 //定义要输出列的名字的List,并且添加要输出的列名
 List<String> structFieldNames = new ArrayList<>();
 structFieldNames.add("key");
 structFieldNames.add("value");
// 定义要输出列的类型的List,并且添加要输出列的类型
 List<ObjectInspector> objectInspectorList = new
ArrayList<>();
 
objectInspectorList.add(PrimitiveObjectInspectorFactory.javaStr
ingObjectInspector);
 
objectInspectorList.add(PrimitiveObjectInspectorFactory.javaStr
ingObjectInspector);
 return
ObjectInspectorFactory.getStandardStructObjectInspector(structFi
eldNames, objectInspectorList);
 }
 // process?法?来处理输?的每?数据,每?数据处理调??次process,类
似于Mapper中的map?法
 @Override
 public void process(Object[] objects) throws HiveException {
// 得到第?个参数,转化为字符串,类似于-> 
name:zhang;age:30;address:shenzhen
 String insputString = objects[0].toString();
// 把上述例?字符串按照分号;切分为数组
 String[] split = insputString.split(";");
// s=name:zhang
 for (String s : split) {
// 把每个切分后的key value分开
 String[] kvArray = s.split(":");
 // 如果产?多列可以将多个列的值放在?个数组中，然后将该数组传?
到forward()函数。
 forward(kvArray);
 }
 }
 @Override
 public void close() throws HiveException {
 }
}

打包加载

对上述命令源?件打包为udtf.jar,拷?到服务器的/opt/jar/?录

在Hive客户端把udf.jar加?到hive中,如下:

hive> add jar /opt/jar/udtf.jar;

创建临时函数

在Hive客户端创建函数:

# 创建?个临时函数parseMap
hive> create temporary function parseMap as
'com.qf.hive.ParseMapUDTF';
# 查看函数是否加?
hive> show functions ;

测试临时函数

hive> select parseMap("name:zhang;age:30;address:shenzhen");

UDAF格式

?户?定义聚合函数。user defined aggregate function。多对?的输?输出 count
sum max。定义?个UDAF需要如下步骤:
1. UDF?定义函数必须是org.apache.hadoop.hive.ql.exec.UDAF的?类,并且包含
?个有多个嵌套的的实现了org.apache.hadoop.hive.ql.exec.UDAFEvaluator的
静态类。
2. 函数类需要继承UDAF类，内部类Evaluator实UDAFEvaluator接?。
3. Evaluator需要实现 init、iterate、terminatePartial、merge、terminate这?
个函数
这?个函数作?如下:

init 实现接?UDAFEvaluator的init函数
iterate 每次对?个新值进?聚集计算都会调?,计算函数要根据计
算的结果更新其内部状态，,map端调?
terminatePartial ?参数，其为iterate函数轮转结束后，返回轮转数
据，,map端调?
merge
接收terminatePartial的返回结果，进?数据merge操作，
其返回类型为boolean。
terminate 返回最终的聚集函数结果。 在reducer端调?

需求

计算?组整数的最?值

代码

package com.qf.hive;
/**
* 定义?个UDAF?定义函数类,默认要继承于UDAF类
*/
//给当前函数添加描述信息,?便在desc function?法时查看
@Description(name="maxInt",value = "Find Max Value" ,extended =
"Extended:Find Max Value for all Col")
public class MaxValueUDAF extends UDAF {
 //UDAF要求 并且包含?个或者多个嵌套的的实现了
 // org.apache.hadoop.hive.ql.exec.UDAFEvaluator的静态类。
 public static class MaxnumIntUDAFEvaluator implements
UDAFEvaluator {
 //在静态类内部定义?个返回值,作为当前UDAF最后的唯?返回值,因为返
回值要在hive调?,所以必须要使?序列化类型
 private IntWritable result;
 /**
 * 在初始化是把返回值设为null,避免和上次调?时混淆
 */
 @Override
 public void init() {
 result=null;
 }
 //定义?个函数iterate?来处理遍历多?时,每?值传进来是调?的函数
 public boolean iterate(IntWritable value) {
// 把遍历每?的值value传?,和result?较,如果?result?,那么
result就设置为value,否则result不变
 if (value == null) {
 return true;
 }
 //如果是第??数据,那么直接给result赋值为第??数据
 if (result == null) {
 result = new IntWritable(value.get());
 } else {
// 给result赋值result和value之间的最?值
 result.set(Math.max(result.get(),value.get()));
 }
 return true;
 }
 /**
 * 在map端进?并?执?后的结果
 * @return
 */
 public IntWritable terminatePartial() {
 return result;
 }
 /**
 * 接收terminatePartial的返回结果，进?数据merge操作，其返回类
型为boolean。
 * @param other
 * @return
 */
 public boolean merge(IntWritable other) {
 return iterate( other );
 }
 /**
 * 将最终的结果返回为Hive
 * @return
 */
 public IntWritable terminate() {
 return result;
 }
 }
}

注意:如果你要给??写的函数加上desc function后的说明,可以在?定义函数类上?加上下?的注解:

@Description(name = "maxValue",value = "Find Max Value",extended
= "Extended:Find Max Value for all Col")
hive> desc function maxInt;
hive> desc function extended maxInt;

打包加载

对上述命令源?件打包为udaf.jar,拷?到服务器的/opt/jar/?录

在Hive客户端把udf.jar加?到hive中,如下

hive> add jar /opt/jar/udaf.jar;

创建临时函数

在Hive客户端创建函数:

hive> create temporary function maxInt as
'com.qf.hive.MaxValueUDAF';
# 查看函数是否加?
hive> show functions ;

测试临时函数

-- 使?前?的任意?个有int类型字段的表进?测试
hive> select maxInt(id) from dy_part1;

导?导出

数据导?

从本地?件系统中导?hive表
从hdfs?件系统中导?hive表
从hive的?个表中导?到另?个表
直接将数据copy到hive表?录
location
克隆带数据
多表导?
CTAS(create table as select)
注,前?6中在Hive基础第?章中已经有演示,这?重点是演示第七条多表导?

创建数据源表

# 注意:hive不允许局部数据操作(增、删、改)。
CREATE TABLE text1(
uid int,
uname string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

数据内容

01,user1
02,user2
03,user3
04,user4
05,user5
06,user6
07,user7
08,user8

加载数据

load data local inpath '/opt/data/text1.csv' overwrite into table
text1;

创建两个?标表(字段可以不?样):

create table text2 (
uid int,
uname string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
create table text3 (
uname string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

从?个源表,导?两个?标表:

#:?次扫描源表，多次导?
from text1
insert into text2
select uid,uname
where uid < 2
insert into text3
select uname;

数据导出

导出分类

hive数据导出主要分为下?三种：
1. 从hive表中导出本地?件系统中(?录、?件)
2. 从hive表中导出hdfs?件系统中
3. hive表中导出到其它hive表中

导出到本地?录

-- 1、导出到本地?录
insert overwrite local directory '/hivedata/'
select * from text1;
-- 关键字：insert overwrite directory

导出到本地?件

[root@leetom01 hive]# hive -e 'select * from text1' >>
/hivedata/02.txt;
-- 如果对hive -e命令不熟悉 ,可以?hive -help查看帮助

导出到HDFS?录

insert overwrite directory '/hivedata/'
select * from text1;

注意:在?件读取/解析的?式指定ROW FORMAT

[ROW FORMAT row_format]

总述：HIVE?了两个类去读数据

?个类?于从?件中读取?条?条的记录（可能是??，可能是xml?件中的?个完整标
签）
?个类?于从上?读到的记录中切分出?个?个的字段（可能简单地按照分隔符切，也可
以对复杂结构进??定义切）

ROW FORMAT示例

ROW FORMAT :?什么INPUTFORMAT去读数据
DELIMITED :?普通的org.apache.hadoop.mapred.TextInputFormat去读数据
?，以回?符作为?分割
FIELDS TERMINATED BY ',' : 表示?什么SerDe类去解析??中的数据，默认?
#org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

序列化和反序列化

序列化

1. 序列化(serialize)是对象转化为字节序列的过程；
2. 反序列化(deserialize)是字节码恢复为对象的过程；
序列化的作?主要有两个
1. 对象的持久化；即把对象转换成字节码后保存?件；
2. 对象数据的传输；
Serializer/Deserializer合起来的缩写也叫做SerDe. SerDe允许Hive读取表中的数
据，并将其以任何?定义格式写回HDFS。 任何?都可以为??的数据格式编写??
的SerDe。

常?Serder

csv 、tsv 、json serder 、 regexp serder 等。
csv : 逗号分隔值
tsv : tab 分隔值
json : json格式的数据
regexp : 数据需要复合正则表达式

CSVSerde

在创建表的时候,如果数据是CSV格式,那么不???去指定列分隔符,直接? CSVSSerde即可,语法如下:

创建CSVSerde格式的表：

create table if not exists csv1(
uid int,
uname string,
age int
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
stored as textfile;

直接加载数据到表中(不?指定列分隔符)

具体数据在data/cvs1.csv中
load data local inpath '/opt/data/csv1.csv' into table csv1;

也可以创建?个看指定csv格式分隔符的CSVSerde表

create table if not exists csv3(
uid int,
uname string,
age int
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties(
"separatorChar"=",",
"qutoeChar"="'",
"escapeChar"="\\"
)
stored as textfile;

Json Serde

如果数据格式是Json格式,如果按照常规的思路我们要使??定义函数去解析, 那么在
这?使?JSonSerde就可以直接读取Json格式的?件,不需要做转化

1.如果是第三?jar包或者是??写的，就必须要先加载jar包：
(在?盘的?数据上课软件库2021中:链接:https://pan.baidu.com/s/1xG-c9wOyTmGeuEYAEh0XZg 提取码:chbg)
hive> add jar /hivedata/json-serde-1.3.8-jar-with-dependencies.jar;
hive> add jar /hivedata/json-udf-1.3.8-jar-with-dependencies.jar;
2.将包导?到hadoop的这个?录下:hadoop/share/hadoop/mapreduce/
3.同时导?到hive的lib?录下
4.执?set hive.exec.compress.output=false;
注意:如果使?的包不对(包括版本),或者执??式不对,有可能出现下?的错误
execution error,return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
org/apache/hadoop/hive/serde2/SerDe
?

create table if not exists js1(
pid int,
content string
)
row format serde "org.openx.data.jsonserde.JsonSerDe";
{"pid":1,"content":"this is pid of 1 content"}
{"pid":2,"content":"this is pid of 2 content"}
load data local inpath '/opt/data/js.json' into table js1;
select *
from js1;

# 复杂数据类型嵌套案例：
create table if not exists complex(
uid int,
uname string,
belong array<String>,
tax map<String,array<double>>
)
row format serde "org.openx.data.jsonserde.JsonSerDe";
{"uid":1,"uname":"zs","belong":["zs1","zs2","zs3"],"tax":
{"shebao":[220,280,300],"gongjijin":[600,1200,2400]}}
{"uid":2,"uname":"ls","belong":["ls1","ls2","ls3"],"tax":
{"shebao":[260,300,360],"gongjijin":[800,1600,2600]}}
load data local inpath '/opt/data/complex.json' into table
complex;
select *
from complex;
select
c.*
from complex c
where size(c.belong) = 3
and c.tax["gongjijin"][1] > 1200;

Hive的分隔符

在Hive中，建表时?般?来来指定字段分隔符和列分隔符。?般导?的?本数据字段分隔符多为逗号分隔符或者制表符（但是实际开发中?般不?着这种容易在?本内容中出现的的符号作为分隔符），当然也有?些别的分隔符，也可以?定义分隔符。有时候也会使?hive默认的分隔符来存储数据。在建表时通过下?语法来指定：

ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' -- 指定列分隔符
 LINES TERMINATED BY '\n' -- 指定?分隔符

默认情况下?分隔符是\n, 列分隔符?般是不容易在?本中出现的字符,开发中?部分 ?TAB,但有时候TAB也有可能重复,这时候可以使?Hive的默认分隔符的.

hive中默认使?^A(ctrl+A)作为列分割符，如果?户需要指定的话，等同于row format delimited fields terminated by '\001'，因为^A?进制编码体现为''\001'.所以如果使?默认的分隔符，可以什么都不加，也可以按照上?的指定加‘\001’为列分隔符，效果?样。

hive默认使?的?分隔符是'\n'分隔符，默认是通过row format delimited fields terminated by '\t'这个语句来指定不同的分隔符，但是hive不能够通过LINES TERMINATED BY '$$'来指定?分隔符，?前为?，hive的默认?分隔符仅?持‘\n’字符。

Hive?件存储

textfile 普通文本文件存储，不压缩

1hive默认格式，存储方式行存储

使用gzipbzip2压缩，压缩后不能split

sequencefile提供二进制存储，本事就压缩，不能使用load方式加载

以kv形式序列化到文件，行存储

none record block三种压缩方式record压缩率低，一般用block压缩

优势是文件和hadoopapi中的mapfile是相互兼容的

rcfile：rowcolumn hive提供行列混合存储，hive在该格式下降会尽量把附近的列和行的快尽量存储在一起仍然压缩，查询率较高

存储方式：数据按行分块，每块按列存储，结合了行存储和列存储的优点

首先，rcfile保证同一行的数据唯一同一节点，因此元祖重构的开销很低

其次，像列存储一样，rcfile能够利用列维度的数据压缩，并且能跳过不必要的列读取

orcfile优化后的rcfile

存储方式：数据按行分块，每块按列存储

压缩快：快速列存取

效率必rcfile是rcfile的改良版本

parquet 列式存取能够很好地压缩，有很好的查询性能（查询多列），支持有限的模式演进，但是写的速度比较慢这种文件方式主要是用cloudera impala hive spark上面的

数据仓库中越是底层的表月使用parquet或者orcfile等格式

文件格式例子

<property>
<name>hive.default.fileformat</name>
<value>TextFile</value>
<description>
</property>

使用文件格式

textfile:数据量较小（基础信息表，码表）

orc：数据量比较大（行为数据）

parquet：数据量大的时候可存储更多面向列查询（查询性能较好一些）（行为数据）

存储分类

Hive默认的数据?件存储格式为：TextFile
1. TextFile:普通的?本?件存储，不压缩。
（1）hive数据表的默认格式，存储?式：?存储 。
（2） 可使?Gzip,Bzip2等压缩算法压缩,压缩后的?件不?持split
2. Sequencefile:hive为?户提供的?进制存储，本身就压缩。不能?load?式加
载数据
（1）Hadoop API提供的?种?进制?件，以<key,value>的形式序列化到?件
中。存储?式：?存储。
（2）?持三种压缩选择：NONE，RECORD，BLOCK。Record压缩率低，?般
建议使?BLOCK压缩。
（3）优势是?件和hadoop api中的MapFile是相互兼容的
3. RCFile:hive提供?列混合存储，hive在该格式下，将会尽量把附近的?和列的块
尽量存储到?起。仍然压缩，查询效率较?。
（1）存储?式：数据按?分块，每块按列存储。结合了?存储和列存储的优
点：
（2）?先，RCFile 保证同??的数据位于同?节点，因此元组重构的开销很低
（3）其次，像列存储?样，RCFile 能够利?列维度的数据压缩，并且能跳过不
必要的列读取
4. ORCFile:优化后的rcfile?件
(1）存储?式：数据按?分块 每块按照列存储
(2）压缩快 快速列存取
(3) 效率?rcfile?,是rcfile的改良版本
5. Parquet:能够很好的压缩，有很好的查询性能，?持有限的模式演进。但是写
速度通常?较慢。这中?件格式主要是?Cloudera Impala上?的。

属性设置

hive中默认的?件类型配置?件TextFile

<name>hive.default.fileformat</name>
<value>TextFile</value>
<description>
 Expects one of [textfile, sequencefile, rcfile, orc].
 Default file format for CREATE TABLE statement. Users can
explicitly override it by CREATE TABLE ... STORED AS [FORMAT]

创建?个?本格式类型的表text1

drop table if exists text1;
create table if not exists text1(
uid int,
uname string
)
row format delimited fields terminated by ',';

加载数据如下(data/text1.txt):

01,user1
02,user2
03,user3
04,user4
05,user5
06,user6
07,user7
08,user8

-- 加载数据
load data local inpath '/opt/data/text1.csv' into table text1;

创建sequencefile

reate table if not exists seq1(
uid int,
uname string
)
row format delimited fields terminated by ','
stored as sequencefile;
-- 该?式不?：
load data local inpath '/opt/data/seq.csv' into table seq1;
-- 使?以下?式：
insert into table seq1
select uid,uname from text1;

创建rcfile

create table if not exists rc1(
uid int,
uname string
)
row format delimited fields terminated by ' '
stored as rcfile;
该?式不?：
load data local inpath '/hivedata/seq1' into table rc1;
使?inset into?式：
insert into table rc1
select uid,uname from text1;
create table seq2(
movie string,
rate string,
times string,
uid string
)
row format delimited fields terminated by ','
stored as sequencefile;
create table if not exists rc2(
movie string,
rate string,
times string,
uid string
)
row format delimited fields terminated by ','
stored as rcfile;
create table if not exists movieRate(
 movie string,
 rate string,
 times string,
 uid string
)
 row format delimited fields terminated by ',';
load data local inpath '/opt/data/movie.txt' into table
movieRate;
from movieRate
insert into table seq2
select *
insert into table rc2
select *;
-- 综合效率：是defaultCodec+rcfile较好

Hive视图

-- 创建试图（cvas）：
create view if not exists view1
as
select uid from text1 where uid <5;
-- 查看视图
show views;
show tables;
?
?
-- 查看视图
desc 视图名；
desc extended 视图名；
show create table 视图名；
--删除视图
drop view if exists 视图名;
注意
切记先删除视图对饮的表再查询视图
视图不能用insertinto或者load加载数据
视图是只读不能修改其结构，表相关属性
?
?

1. hive的视图简单理解为逻辑上的表
2. hive现?前只?持逻辑视图，不?持物化视图。

hive的视图意义

1. 对数据进?局部暴露(涉及隐私数据不暴露)。
2. 简化复杂查询。

创建视图

create view if not exists view1
as
select uid from movieRate where uid < 5;

查看视图

show tables;
show create table view1;
desc view1;

克隆视图

视图是否可以克隆：(hive-1.2.1暂时不?持)
create view view2 like view1;-- 不?持
create view if not exists v1 as select * from text1;

删除视图

drop view if exists v1;
# 注意：
#1、切忌先删除视图对应的表后再查询视图。
#2、视图是不能?insert into 或者load ?式来加载数据。
#3、视图是只读，不能修改其结构、表相关属性。

Hive的?志

默认日志在根目录下/tmp/root下面

?志是任何项?中调试或者排错的最重要的?式,所以知道?志的基本原理和基本配
置,对于调试代码有?常?的帮助.
Hive的?志默认是Log4j来实现,如果需要更详细的配置,也可以查看下hive/conf/中的
Log4j.properties 配置?件,可以做更详细的配置.
在hive/conf下?有?个?志类型的模版配置?件,默认都加了template后缀,去掉这个
后缀就??配置并有效.

?志分类

Hive的?志分为两类: 1. 系统?志，记录了hive的运来?情况，错误状况。 2. 查询?志(Job?志)，记录了?Hive 中job的执?的历史过知程

?志配置

修改?志的配置需要修改下?两个配置属性即可

hive的系统?志：

默认?录：/tmp/{user.name}
hive.log.dir={java.io.tmpdir}/{user.name}
hive.log.file=hive.log

hive的查询?志：

<name>hive.querylog.location</name>
 <value>{system:java.io.tmpdir}/${system:user.name}</value>
 <description>Location of Hive run time structured log
file</description>

Hive的压缩

压缩的优势

可以节约磁盘的空间，基于?本的压缩率可达40%+;

压缩可以增加吞吐量和性能量(减?载?内存的数据量),提??络传输性能

压缩注意事项

压缩?般?来处理IO密集型业务
压缩?般会占?CPU资源,所以不适合处理CPU密集型业务

压缩参数设值

因为Hive底层本质上是运?MapReduce程序,所以Hive的压缩其实类似前?讲过的 MapReduce程序压缩.下?是?部分参数设值

hive的map阶段压缩：
set hive.exec.compress.output=false;
set hive.exec.compress.intermediate=false;
hive.intermediate.compression.codec
hive.intermediate.compression.type
hive的reduce阶段压缩：
set hive.exec.compress.output=false;
set hive.exec.compress.intermediate=false;
hive.intermediate.compression.codec
hive.intermediate.compression.type

Hive的运??式

Hive的属性设置

1、hive-site.xml
2、hive通过命令?参数设置
3、hive通过cli端set设置

三种设置?式的区别

1、属性优先级别从上往下?次升?。
2、hive-site.xml是全局和永久的，其它两是临时和局部。
3、hive-site.xml适合所有属性配置，?后两个对于系统级别的属性不能配置。
?如启动所需的元数据库url、log配置等。

Hive有四类属性

1. hiveconf：可读可写 一般指hive 的默认配置文件中出现过的或者hive依赖的组件的默认配置
2. hivevar：?定义临时变量，可读可写，自己根据sql需要而定义的
3. system：可读可写
4. env：可读不可写

hivevar rw ?户?定义变量
hiveconf rw hive相关配置属性
system rw java定义的配置属性
env r Shell环境属性

--hiveconf <property=value> Use value for given property
--hivevar <key=value> Variable subsitution to apply to
hive
 commands. e.g. --hivevar A=B
hive -e #?次性执?命令
hive -f #执?sql
hive -S #静?模式
hive -i #初始化脚本
hive>set -v; ##查看hive相关的环境变量
hive -e "set" | grep current ##单个查找
hive -e "set" ##查找所有
-- 案例
hive --hivevar -limt=2 "select * from test.text1 limit ${hivevar:limt}"
--cli此时
set env:PWD;
set hivevar:abc='234';
set hivevar;
?
hive --hivevar limt=50 -f ./hi.hql
select * from test.text1 limit ${hivevar:limt}

Hive三种运??式

可以?三种?式执?HQL语句,并且都?在不同的场景下,三种如下:
1. 在hive的cli端运?：(开发测试，以及临时跑作业)
2. 通过命令? hive -e 'sql query';
3. 通过命令? hive -f /hql?件 (?产线)
下?是Hive的演示具体案例

#1 hive中直接执?
hive> select * from text1
#2.通过-e?式单次执?Sql语句
hive --database qf1701 -e 'select * from text1';
hive --database qf1701 --hivevar ls=2 --hiveconf tn=text1 -e
'select * from ${hiveconf:tn} limit ${hivevar:ls}';
#3.通过-f可以指定具体的要执?的Sql?件
hive -S --hivevar mapoutputdir=/home/hivedata/out/05 --hivevar
textoutdir=/home/hivedata/out/06 --hivevar limit=1 -f ./hql

三种?式注意事项：
1. ?个--hivevar 或者 --hiveconf 只能带?个参数
2. --hiveconf 或者 --hivevar 可以混合使?
3. --hiveconf 或 --hivevar 定义参数不能取消
?知识点:通过命令?beeline连接hive
1.打开hiveserver2服务器
2.给hadoop设置本地访问权限
将下?的配置添加到/hadop/etc/hadoop/core-site.xml

<property>
 <name>hadoop.proxyuser.root.hosts</name>
 <value>*</value>
 </property>
 <property>
 <name>hadoop.proxyuser.root.groups</name>
 <value>*</value>
</property>

3.执?beeline进?客户端
4.执? jdbc:hive2://leetom01:10000 连接服务器
5.正常的访问服务器

Hive企业级调优

调优原则已经在MR优化阶段已经有核?描述,优化Hive可以按照MR的优化思路
来执?

优化的主要考虑??:
1. 环境??：服务器的配置、容器的配置、环境搭建
2. 具体软件配置参数：
3. 代码级别的优化
?
?
代码级别优化
explain 和 explain extended
explain只对sql进行解释
explain extended除了解释还要生成语义树
explain extended
跟sql代码
主要看出：
执行stage是否多，多个之间知否完全依赖，尽量stage少且依赖少
查看首表的之心表名，如果不符合你的预期，就要修改sql
?

join优化

小表驱动大表
尽量用跟多的条件将无效数据过滤，不要参与join
注意hive 是否配置普通join转换成mapjoin，以及mapjoin小标文件大小的阀值设置

调优的主要原则:
20/80原则?常重要,简单的说80%的程序执?时间花费在20%的代码上,所以重点就先
要找到这关键的20%代码做重点优化.

explain 和 explain extended

explain ： 只有对hql语句的解释。 explain extended：对hql语句的解释，以及抽
象表达式树的?成。
当执?上?两个语句时.?般会分为多个stage,
stage 相当于?个job，?个stage可以是limit、也可以是?个?查询、也可以是
group by等。 hive默认?次只执??个stage，但是如果stage之间没有相互依
赖，将可以并?执?。 任务越复杂，hql代码越复杂，stage越多，运?的时间
?般越?。
explain代码测试:

explain select * from text1;
explain extended select * from text1;
explain extended
select
d.deptno as deptno,
d.dname as dname
from dept d
union all
select
d.dname as dname,
d.deptno as deptno
from dept d;

Join优化

hive的查询永远是?表(结果集)驱动?表(结果集)
hive中的on的条件只能是等值连接
注意:hive是否配置普通join转换成map端join、以及MapJoin?表?件??的阀值

limit的优化

LIMIT语句是?家经常使?到的，经常使?CLI的?户都会使?到。不过，在很多情况
下LIMIT语句还是需要执?整个查询语句，然后再返回部分结果的。因为这种情况通
常是浪费的，所以应该尽可能地避免出现这种情况。Hive 有?个配置属性可以开
启，当使?LIMTI语句时，其可以对源数据进?抽样:

hive.limit.optimize.enable=false (如果limit较多时建议开启)
hive.limit.row.max.size=100000
hive.limit.optimize.limit.file=10
hive.limit.optimize.fetch.max=50000

本地模式

?多数的Hadoop Job是需要Hadoop提供的完整的可扩展性来处理?数据集的。不过，有时Hive的输?数据量是?常?的。在这种情况下，为查询触发执?任务的时间消耗可能会?实际job的执?时间要多得多。对于?多数这种情况，Hive 可以通过本地模式在单台机器上(或某些时候在单个进程中)处理所有的任务。对于?数据集，执 ?时间可以明显被缩短。 ?户可以按照如下这个例?中所演示的?式，在执?过程中临时启?本地模式

hive.exec.mode.local.auto=true
hive.exec.mode.local.auto.inputbytes.max=134217728
hive.exec.mode.local.auto.input.files.max=4

并?执?

Hive会将?个查询转化成?个或者多个阶段。这样的阶段可以是抽样阶段、合并阶段、limit 阶段，或者Hive执?过程中可能需要的其他阶段。默认情况下，Hive?次只会执??个阶段。

不过，某个特定的job可能包含众多的阶段，?这些阶段可能并?完全互相依赖的，也就是说有些阶段是可以并?执?的，这样可能使得整个job的执?时间缩短。不过，如果有更多的阶段可以并?执?，那么job可能就越快完成。通过设置参数 hive.exec.parallel值为true,就可以开启并发执?。不过，在共享集群中，需要注意下，如果job中并?执?的阶段增多，那么集群利?率就会增加:

默认 情况仅执行一个stage 如果stage之间没有依赖则一个job的多个stage可以并行执行
hive.exec.parallel=false (建议开启)
hive.exec.parallel.thread.number=8

严格模式

hive提供了?个严格模式，可以防??户执?那些可能产?意想不到的不好的影响的
查询。 通过设置属性hive.mapred.mode值为strict 可以禁?3种类型的查询。
1. 其?，对于分区表，除?WHEHRE语句中含有分区字段过滤条件来限制数据范
围,否则不允许执?。换句话说，就是?户不允许扫描所有分区。进?这个限制的
原因是，通常分区表都拥有?常?的数据集，?且数据增加迅速。没有进?分区
限制的查询可能会消耗令?不可接受的巨?资源来处理这个表
2. 其?，对于使?了ORDER BY语句的查询，要求必须使?LIMIT语句。因为
ORDER BY为了执?排序过程会将所有的结果数据分发到同?个reducer 中进?
处理，强制要求?户增加这个LIMIT语句可以防?reducer额外执?很??段时
间.
3. 其三，也就是最后?种情况，就是限制笛卡尔积的查询。对关系型数据库?常了
解的?户可能期望在执?JOIN查询的时候不使?ON语句?是使?WHERE语句，
这样关系型数据库的执?优化器就可以?效地将WHERE语句转化成那个ON语
句。不幸的是，Hive并不会执?这种优化，因此，如果表?够?，那么这个查询
就会出现不可控的情况,下?是系统中默认的严格模式要求:

hive.mapred.mode=nonstrict
?
如下的五类查询将会带来极大的风险，容易将资源耗尽
Cartesian Product.      --笛卡尔积
No partition being picked up for a query.  --分区表不带分区字段过滤数据
Orderby without limit.  --order by 不带limit
Comparing bigints and strings.  -- string不能和bigint作比较
Comparing bigints and doubles.  -- double不能和bigint作比较

mapper和reducer的个数

Hive通过将查询划分成?个或者多个MapReduce任务达到并?的?的。每个任务都可能具有多个mapper和reducer任务，其中?少有?些是可以并?执?的。确定最佳的mapper个数和reducer个数取决于多个变量，例如输?的数据量??以及对这些数据执?的操作类型等。 
保持平衡性是有必要的。如果有太多的mapper 或reducer 任务，就会导致启动阶段、调度和运?job过程中产?过多的开销;?如果设置的数量太少，那么就可能没有充分利?好集群内在的并?性.
不是mapper和redcuer个数越多越好，也不是越少越好。
?
mapper个数： 分片决定（分块->文件的大小或个数）一般不去手动设置mapper个数，但是全是小文件导致mapper偏多，需要提前或者是设置表小文件合并输入格式（hive.input.format=org.apach.hadoop.hive.ql.io.CombineHiveInputFormat）
也可以手动设置

?动设置：
set mapred.map.tasks=2;
reducer的个数(?动决定和?动设置)：
mapred.reduce.tasks=-1
hive.exec.reducers.max=1009
?
reduce阶段 group by、聚合
reducer个数：数据量决定，默认256M一个Reduce，默认一条hql最多1009个reducer
也可以手动设置
mapred.reduce.tasks=1; 根据分组分配到多个reducetask中，每个reduce几乎是均分组数据
如果发现reduce运行时间过长，可以手动设置多个reduce来处理

配置jvm重?

JVM重?是Hadop调优参数的内容，其对Hive的性能具有?常?的影响，特别是对于很难避免??件的场景或task特别多的场景，这类场景?多数执?时间都很短。 Hadop的默认配置通常是使?派?JVM来执?map和reduce任务的。这时JVM的启动过程可能会造成相当?的开销，尤其是执?的job包含有成百上千个task任务的情况。 JVM重?可以使得JVM实例在同?个job中重新使?N次。N的值可以在Hadoop 的 mapred-site.xml?件(位于HADOOP HOME/conf?录下)中进?设置:

?
每个jvm最多跑10个任务（默认是1个jvm跑一个task）
<property>
 <name>mapred.job.reuse.jvm.num.tasks</name>
 <value>10</value>
 <description>How many tasks to run per jvm. If set to -1，
there is nolimit.
 </description>
</property>

这个功能的?个缺点是，开启JVM重?将会-直占? 使?到的task资源，以便进?重 ?，直到任务完成后才能释放。如果某个“不平衡的"的job中有某?个reduce task执 ?的时间要?其他reduce task 消耗的时间多得多的话，那么保留的插槽就会?直空闲着却?法被其他的job使?，直到所有的task 都结束了才会释放. 这个功能的??个缺点是，开启JVM重?将会-直占? 使?到的task资源，以便进? 重?，直到任务完成后才能释放。如果某个“不平衡的"的job中有某?个reduce task 执?的时间要?其他rdcee task 消耗的时间多得多的话，那么保留的资源就会?直空闲着却?法被其他的job使?，直到所有的task 都结束了才会释放

mapreduce.job.jvm.numtasks=1 
mapred.job.reuse.jvm.num.tasks=1

数据倾斜

数据倾斜现象：
1、某个任务一直卡着不动。如果资源都没有申请到，这种不算倾斜，但是算资源不足
2、某个作业一直0%或者是99%等这样的现象。
?
数据倾斜造成原因
1、数据本身就倾斜（某一个key过多造成的，在join当中认为hive.skewjoin.key=1000000就是倾斜）
2、join或group by都容易造成
?
怎么解决数据的倾斜
1、设置对应属性
如果有groupby语句设置如下两个
hive.map.aggr=true
hive.groupby.skewindata=false(建议开启)
如果语句中出现join则可以设置如下属性
hive.optimize.skewjoin=false  （数据较大，建议开启）
?
2、如果是数据倾斜（抽样查询key）造成的，可以将倾斜的数据单独提取出来进行计算，然后再和非倾斜数据结果进行合并。
?
3、如果join的语句出现倾斜，你可以将出现的key加上一个范围的随机数，然后将连接的数量减少，避免倾斜，如果后续还有groupby则可以按照第二步做，大前提，不能影响业务

数据倾斜：由于key分布不均匀造成的数据向?个?向偏离的现象。譬如每个省的?
?数(?藏要?河南少很多),每天的交易数(双??就巨?)等
造成数据倾斜的?个原因
1. 本身数据就倾斜
2. join语句容易造成
3. count(distinct col) 很容易造成倾斜
4. group by 也可能会造成
解决?法:

找到造成数据倾斜的key，然后再通过hql语句避免。
hive.map.aggr=true
hive.groupby.skewindata=false (建议开启)
hive.optimize.skewjoin=false
 Whether to enable skew join optimization.
 The algorithm is as follows: At runtime, detect the keys with
a large skew. Instead of
 processing those keys, store them temporarily in an HDFS
directory. In a follow-up map-reduce
 job, process those skewed keys. The same key need not be
skewed for all the tables, and so,
 the follow-up map-reduce job (for the skewed keys) would be
much faster, since it would be a
 map-join.

索引

1. 索引可以?来加快含有GROUP BY语句的查询的计算速度.
2. Hive从0.8版本后增加?个bitmap的索引实现。bitmap索引?般在指定的列排?
后的值?较?时进?使?.相对于?较新,要评估后才能使?
3.hive的索引对hive查询效影响很小，反而是加入索引会导致写数据或者更新数据慢

分区

分区本质就是提升hive查询效率

动态分区INSERT语句可以通过简单的SELECT语句向分区表中创建很多新的分区。这是?个?常强?的功能，不过如果分区的个数?常得多，那么就会在系统中产??量的输出控制流。对于Hadoop来说，这个种情况并不是常?的使?场景，因此，其通常会?次创建很多的?件，然后会向这些?件中写??量的数据。跳出这些框框，Hive可以通过配置限制动态分区插?允许创建的分区数在1000个左右。虽然太多的分区对于表来说并不好，不过，通常还是将这个值设置的更?以便这些查询执?。具体动态分区细节可以参考动态分区?节进?温习.

推测执?

类似于Hadoop,在Hive中也提供了配置项?来做推测执?,具体要经过严格测试后才能推?.

优化中：1、hql的解释 2、join的优化 3、mapper和reducer个数 4、数据倾斜。核心减少无效数据量，内存驻留数据。

with语句

数据驻留在内存中，同时数据的生命周期是随hql语句执行完成则消失，一般作用于临时表，该临时表不需要手动创建，也不需要手动删除

with tmp1 as
(select
*
from text
)
select * from tmp1
?
with tmp1 as
(select
*
from text
)
inserrt into overwrite t1
select * from tmp1

多维分析函数

grouping sets() with cube with rollup

某天，谋省，某市的下单量和下单金额。查询某天、某省下单金额；某天的下单量
13 河北 石家庄 1 33 
13 河北 石家庄 2 35
13 河北 张家口 3 66
12 河北 张家口 4 88
grouping sets((天，省，市)，（天，省），（天）)
13 河北 石家庄 3 88
13 河北 张家口 3 66
12 河北 张家口 4 88
13 河北 NULL  6 154
13 河北 NULL  4 88
13 NULL NULL  6 154
12 NULL NULL  4 88
?
cube 2的维度次方种查询方式
天 省 市
天 省 -
天 - -
- - -
- 省 市
- 省 -
- - 市
天 - 市
?
rollup
天 - -
天 省 -
天 省 市

HBASE

hbase的来源

1、hdfs的数据读写延迟高
2、不能进行实时删除局部数据
3、hive的数据必须要指定的列或者字段，必须要格式化的数据
4、hbase来源于google的bigtable

hbase的定义

hbase是一个基于hadoop开源的，分布式的，多版本的，可扩展的，非关系型数据库，能够处理海量数据，（数十亿行，数百万列），面向列存储，面向稀疏存储（mysql面向行存储）

大数据最新文章

实现Kafka至少消费一次

亚马逊云科技：还在苦于ETL？Zero ETL的时代

初探MapReduce

【SpringBoot框架篇】32.基于注解+redis实现

Elasticsearch：如何减少 Elasticsearch 集

Go redis操作

Redis面试题

专题五 Redis高并发场景

基于GBase8s和Calcite的多数据源查询

Redis——底层数据结构原理

加:2021-07-14 10:59:11 更:2021-07-14 11:00:06

360图书馆购物三丰科技阅读网日历万年历 2025年12日历

-2025/12/1 22:51:26-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码