Hive
- 数据仓库分层
- 按照数据流入流出的过程,数据仓库架构可分为三层——源数据(ODS)、数据仓库(DW)、数据应用(APP)

Hive介绍
- Hive是基于hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供类似SQL的查询功能
- Hive可以存储很大量的数据集,可以直接访问存储在Apache Hdfs或者其他数据库存储系统中的文件
- Hive支持Mapreduce、Spark、Tez这三种分布式计算引擎
Hive架构

Hive计算引擎
  
Hive安装启动
- cd /export/server/spark-2.3.0-bin-hadoop2.7/bin
- ./beeline
- !connect jdbc:hive2://node1:10000
- root
- 直接回车
Hive数据库操作
一、创建数据库-默认方式
create database if not exists myhive;
show databases;
说明:
1: if not exists:该参数可选,表示如果数据存在则不创建(不加该参数则报错),不存在则创建。
2: hive的数据库置默认存放在/user/hive/warehouse目录
二、创建数据库-指定存储路径
create database myhive2 location '/myhive2’;
show databases;
说明:
1: location :用来指定数据库的存放目录
三、查看数据库详细信息
desc database myhive;
四、删除数据库
删除一个空数据库,如果数据库下面有数据表,那么就会报错
drop database myhive;
强制删除数据库,包含数据库下面的表一起删除
drop database myhive2 cascade;
五、创建表格式
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[LOCATION hdfs_path]
六、内部表操作-创建表
被external修饰的是内部表(managed table),内部表又称管理表。内部部表不适合用于共享数据
create database mytest;
use mytest;
create table stu(id int,name string);
show tables;
七、内部表操作-查看表结构/删除表
查看表结构
desc student;
desc formatted student;
删除表
drop table student;
八、插入数据
insert into studnet values("杨华钟",20);
insert into student values("长江大学",80);
八、查询信息
select * from student;
九、LOAD DATA [LOCAL] INPATH "filepath" [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1,partcol2=val2 ...)]
说明:
LOCAL 表示从本地文件系统加载,否则是从HDFS加载
应用1-本地加载
create table if not exists stu2(id int ,name string)
row format delimited fields terminated by "\t" ;
load data local inpath "/export/data/hivedatas/stu.txt" into table student2;
应用2-HDFS加载
create table if not exists stu3(id int ,name string)
row format delimited fields terminated by "\t" ;
hadoop fs -mkdir -p /hivedatas
cd /export/data/hivedatas
hadoop fs –put stu.txt /hivedatas/
load data inpath "/hivedatas/stu.txt" into table stuent3;
Hive内部表特点
- Hive是建立在hadoop之上的数据仓库,存在hive里的数据实际上就是存在HDFS上,都是以文件的形式存在
Hive元数据用来记录数据库和表的特征信息,比如数据库的名字、存储路径、表的名字、字段信息,表文件存储路径等等。 课程中Hive的元数据是保存在Mysql数据库中。 - hive内部表信息存储默认的文件路径是在/user/hive/warehouse/databasename.db/tablename/目录
hive内部表在进行drop操作时,其表中的数据以及表的元数据信息均会被删除 内部表一般可以用来做中间表或者临时表
Hive外部表操作
- 创建表时,使用external关键字修饰则为外部表,外部表数据可用于共享。
- #创建学生表
create external table student (sid string,sname string,sbirth string , ssex string) row format delimited fields terminated by '\t' location '/hive_table/student';
create external table teacher (tid string,tname string) row format delimited fields terminated by '\t' location '/hive_table/teacher';
加载数据 #给学生表添加数据 ,前提是要在本地的目录/export/data/hivedatas/student.txt添加数据
load data local inpath '/export/data/hivedatas/student.txt' into table student;
#给老师表添加数据,并覆盖已有数据 前提是要在本地的目录/export/data/hivedatas/teacher.txt添加数据
load data local inpath '/export/data/hivedatas/teacher.txt' overwrite into table teacher;
- 也可以从hdfs上传数据:hadoop fs -put student.txt /
load data inpath '/student.txt' into table student;
- 外部表的特点:
外部表在进行drop操作的时,仅会删除元数据,而不删除HDFS上的文件 外部表一般用于数据共享表,比较安全
Hive表操作-分区表
- 在大数据中,最常用的一种思想就是分治,分区表实际就是对应hdfs文件系统上的的独立的文件夹,该文件夹下是 该分区所有数据文件。
分区可以理解为分类,通过分类把不同类型的数据放到不同的目录下。 Hive中可以创建一级分区表,也可以创建多级分区表 - 创建一级分区表
CREATE TABLE IF NOT EXISTS score(sid string,cid string,sscore int)
partitioned by(month string) row format delimited fields terminated by '\t';
load data local inpath '/export/data/hivedatas/score.txt'
into TABLE score partition(month='202006');
CREATE TABLE IF NOT EXISTS score2(sid string,cid string,sscore int)
partitioned by(year string,month string,day string)
row format delimited fields terminated by '\t';
load data local inpath '/export/data/hivedatas/score.txt'
into TABLE score2 partition(year = '2020', month='06',day = '01');
load data local inpath '/export/data/hivedatas/score.txt'
into TABLE score2 partition(year = '2020', month='06',day = '02');
load data local inpath '/export/data/hivedatas/score.txt'
into TABLE score2 partition(year = '2020', month='07',day = '01');
show partitions score;
alter TABLE score add partition(month='202008');
alter TABLE score add partition(month='202009') partition(month='202010');
alter TABLE score drop partition(month='202010')
sparksql中不支持create external table外部表的创建,只能是非external表。使用write.option(“path”,"/some/path").saveAsTable是external表。 使用外部表,可以直接加载数据并加载到DateSet.createOrReplaceTempView中完成。 如果注册的表是createGlobalTempView,那么访问表需要加上数据库名,global_temp.tableName否在默认在default中查找会导致报错: Table or view ‘tableName’ not found in database ‘default’;
Hive查询
select [distinct]select_expr, select_expr, ...
from table_reference
[where where_condition]
[group by col_list]
[having where_condition]
[order by col_list] ...
[limit number]
解释:
select 查询关键字
distinct 去重
from 指定要查询的表
where 指定查询条件
group by 分组查询
having 对分组后的结果进行条件筛选
order by 排序
limit 查询指定的记录数
SELECT *FROM score;
SELECT sid,cid from score;
SELECT sid as stu_id,cid course_id from score
聚合函数
SELECT count(*) from score;
SELECT count(1)
FROM score;
SELECT max(sscore)
FROM score;
SELECT min(sscore)
FROM score;
SELECT sum(sscore)
FROM score;
SELECT avg(sscore)
FROM score
where语句
SELECT *
FROM score where sscore>60;
SELECT * from score where sscore=80
SELECT * from score where sscore between 80 and 100
SELECT * from score where sscore is null
SELECT * from score where sscore in(80,90)
SELECT * from score where sscore=80 or sscore=90
SELECT *
FROM student
where sname like '李%'
SELECT *
FROM student
where sname like '_兰%'
SELECT * from score
where sscore>80 and sid =(01)
SELECT * from score
where sscore>80 or sid ='01'
SELECT * from score
where sid not between '01' and '02'
SELECT count(*) from score
where sid not between '01' and '02'
SELECT * from score
where sid not in('01','02')
SELECT count(*) from score
where sid not in('01','02')
分组查询
select sid ,avg(sscore) from score group by sid;
SELECT first(stu.sname),first(sc.sid), avg(sscore) from score sc,student stu
where sc.sid=stu.sid
group by sc.sid
SELECT first(stu.sname),first(sc.sid), avg(sscore) as avg from score sc,student stu
where sc.sid=stu.sid
group by sc.sid
order by avg desc
SELECT sid,max(sscore) from score
group by sid
Having语句
SELECT sid,avg(sscore) as avgscore
FROM score
group by sid
having avgscore>85
SELECT first(stu.sname), first(sc.sid),avg(sscore) as avgscore
FROM score sc,student stu
where sc.sid=stu.sid
group by sc.sid
having avgscore>85
order by sc.sid desc
排序
SELECT * FROM score
order by sscore asc
SELECT * FROM score
order by sscore desc
SELECT sid, avg(sscore) as avgscore FROM score
group by sid
order by avgscore desc
limit语句
SELECT *FROM score
limit 3
多表查询
SELECT *FROM teacher t,course c
where t.tid=c.cid
SELECT * FROM teacher t inner join course c
on t.tid=c.cid
SELECT * FROM teacher t join course c
on t.tid=c.cid
select * from teacher t left join course c on t.tid=c.cid
select * from teacher t right join course c on t.tid=c.cid
select * from teacher t full join course c on t.tid=c.cid
select sid from score
where sscore =(select max(sscore) from score)
select sid,(sscore) from score
where sscore=(select max(sscore) from score)
select sid from score s,(select max(sscore) max_score from score) b
where s.sscore = b.max_score;
内置函数
select round(3.1415926,4);
select rand();
select concat_ws(',','abc','def','gh');
select substr('abcde',3,2);
select substring('abcde',-2,2);
select substr(sbirth,1,4) from student
select year(sbirth)from student
select month(sbirth)from student
select day(sbirth)from student
select date_add('2021-02-01',10)
select date_sub('2021-02-01',10)
select sid,
case
when sscore>=60
then'及格'
when sscore<60
then '不及格'
else '其他'
end as flag
from score
select sid,cid,
case
when sscore>=60
then'及格'
when sscore<60
then '不及格'
else '其他'
end as flag
from score
order by sid
concat_ws(sep, str1,str2)
collect_set(col)
create table emp( deptno int, ename string ) row format delimited fields terminated by '\t'
load data local inpath "/export/data/hivedatas/emp.txt" into table emp
select deptno,concat_ws("|",collect_set(ename)) as ems from emp group by deptno
select collect_set(ename) from emp
select user_id, createtime, pv,
rank() over(partition by user_id order by pv desc) AS rn1,
dense_rank() over(partition by user_id order by pv desc) as rn2,
row_number() over(partition by user_id order by pv desc) as rn3
from user_access;
开窗函数
select user_id,createtime,pv,
sum(pv) over(partition by user_id order by createtime rows between unbounded preceding and current row) as pv2
from user_access;
select user_id,createtime,pv, sum(pv) over(partition by user_id order by createtime rows between 3 preceding and current row) as pv4
from user_access;
select user_id,createtime,pv, sum(pv) over(partition by user_id order by createtime rows between 3 preceding and 1 following) as pv5 from user_access;
select user_id,createtime,pv,
max(pv) over(partition by user_id order by createtime rows between unbounded preceding and current row) as pv2
from user_access;
select user_id,createtime,pv, max(pv) over(partition by user_id order by createtime rows between 3 preceding and current row) as pv4
from user_access;
select user_id,createtime,pv, max(pv) over(partition by user_id order by createtime rows between 3 preceding and 1 following) as pv5 from user_access;
|