一:hive作用
Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张表,并提供类SQL查询功能。
Hive本质:将HQL转化成MapReduce程序
(1)Hive处理的数据存储在HDFS
(2)Hive分析数据底层的实现是MapReduce
(3)执行程序运行在Yarn上
1.1:Hive优缺点
1>Hive支持用户自定义函数,用户可以根据自己的需求来实现自己的函数;
缺点:
1)Hive的HQL表达能力有限
(1)迭代式算法无法表达
(2)数据挖掘方面不擅长,由于MapReduce数据处理流程的限制,效率更高的算法却无法实现。
2)Hive的效率比较低 (Hive调优比较困难,粒度较粗)
1.2:Hive和?数据库比较
从结构上来看,Hive?和数据库除了拥有类似的查询语言,再无类似之处
1.3:数据更新
Hive是针对数据仓库应用设计的,而数据仓库的内容是读多写少的。因此,Hive中不建议对数据的改写,所有的数据都是在加载的时候确定好的。
1.4:执行延迟
Hive?在查询数据的时候,由于没有索引,需要扫描整个表,因此延迟较高。
二:HIve使用
2.1:本地模式
set hive.exec.mode.local.auto=true;? //开启本地mr
注意:当切换成本地模式,可能回报一个 内存溢出 的错误,因为本地模式默认只给我们分配 256M 内存。可以通过修改 hive-env.sh.template?
2.2:一般操作
?2.3:客户端查看hdfs
?
?三:Hive中的数据类型
基本数据类型和集合数据类型
?
?四:Hive操作
load data local inpath '/opt/module/hive/datas/test.txt' into table 表名;
?Hdfs上创建一个库放到指定路径下
?
?
?4.1:外部表
?
?五:Hive操作表
?5.1:上传文件到HDFS下的表
hadoop fs -put student.txt /user/hive/warehouse/student
-- 查看当前表的详情????????????????desc student;
desc formatted student; -- 查看当前表的详细信息
?
?
?
?五:DML数据操作
?
?5.2:通过查询语句向表中插入数据(Insert)
?5.3:导入数据的方式
-- 3. 导入数据
--- 本地不覆盖导入
load data local inpath '/opt/module/hive-3.1.2/datas/ztudent.txt'
into table student;
--- 本地覆盖导入
load data local inpath '/opt/module/hive-3.1.2/datas/student1.txt'
overwrite
into table student;
--- HDFS导入
load data inpath '/student/ztudent.txt'
into table student;
--- 注意:从本地导入数据属于cp行为,如果从HDFS导入属于mv行为
1.3 通过查询语句向表中插入数据(Insert)
-- 1. 建表
create table stu1(id int, name string)
row format delimited fields terminated by '\t';
-- 2. 基本插入数据
insert into table stu1 values(1,'wangwu'),(2,'zhaoliu');
-- 3. 基本模式插入(根据单张表查询结果)
insert into table stu1 select id, name from student;
1.4 查询语句中创建表并加载数据(As Select)
create table stu2
as select id, name from student;
1.5 创建表时通过Location指定加载数据路径
create table stu3(id int, name string)
row format delimited fields terminated by '\t'
location '/student';
5.4:Insert数据导出(通常用于数据迁移)
2.1 Insert导出
--1. 将查询的结果导出到本地
insert overwrite local directory '/opt/module/hive-3.1.2/datas/export_data'
select * from student;
--2. 将查询的结果导出到本地(对数据格式约束)
insert overwrite local directory '/opt/module/hive-3.1.2/datas/export_data'
row format delimited fields terminated by "\t"
select * from student;
--3. 将查询的结果导出到HDFS上(没有local)
insert overwrite directory '/export_data'
row format delimited fields terminated by "\t"
select * from student;
2.2 Export导出到HDFS上 & Import数据到指定Hive表中
-- export导出到HDFS
export table student to '/export/student';
-- import 导入Hive的表中(导数据的过程中会根据元数据进行建表)
import table bucunzai from '/export/student';
-- 注意 Export & Import 使用场景:数据迁移的时候常用!!!
2.3 清除表中数据(Truncate)
注意:当清楚数据的表是内部表的时候才允许清楚,外部表不允许清除数据
truncate table student;
六:Hive函数
6.1:常用函数
HiveSQL语法有限制? ?
?6.2:SQL的执行顺序
?案例一:
?Having语句 (只能和 group by 结合使用) -- 计算每个部门最高薪水的那个哥们,并且薪资大于等于3000
使用group by,select后面的 字段只能是分组字段和 要求的值,如平均工资;
select
t1.deptno,
t2.ename,
t1.max_sal
from
(
select
deptno,
max(sal) as max_sal
from emp
group by deptno having max_sal >= 3000
) t1 join emp t2
on t1.deptno = t2.deptno
where t1.max_sal = t2.sal
;
?
?七:JOIN
(emp 可以看成左表 dept可以看成右表? 表连接的形式与key有关? ?)
emp 和 dept的交集数据:? ? ?join
emp的全部数据 和 dept匹配到的数据? ? :??left join?
dept 的全部数据 和 emp 匹配到的数据? ?:? ?right join? ?
获取emp独有的数据? ?left join? ??where d.deptno is null
获取dept独有的数据? ?right join? ??where e.deptno is null
获取 emp 和dept 的全部数据(全连接)? ?(? union 传统的方式? ??full join 进行全连接(Hive支? 持,传统SQL不支持,oracle中支持))
获取emp和dept各自独有的数据? ??full join? (
where e.deptno is null or d.deptno is null? ?)
?full join 进行全连接(Hive支持,传统SQL不支持,oracle中支持)
4.1 emp 和 dept的交集数据
select
e.empno,e.ename,e.deptno,d.deptno,d.dname
from emp e
join
dept d
on e.deptno = d.deptno;
+----------+----------+-----------+-----------+-------------+
| e.empno | e.ename | e.deptno | d.deptno | d.dname |
+----------+----------+-----------+-----------+-------------+
| 7369 | SMITH | 20 | 20 | RESEARCH |
| 7499 | ALLEN | 30 | 30 | SALES |
| 7521 | WARD | 30 | 30 | SALES |
| 7566 | JONES | 20 | 20 | RESEARCH |
| 7654 | MARTIN | 30 | 30 | SALES |
| 7698 | BLAKE | 30 | 30 | SALES |
| 7782 | CLARK | 10 | 10 | ACCOUNTING |
| 7788 | SCOTT | 20 | 20 | RESEARCH |
| 7839 | KING | 10 | 10 | ACCOUNTING |
| 7844 | TURNER | 30 | 30 | SALES |
| 7876 | ADAMS | 20 | 20 | RESEARCH |
| 7900 | JAMES | 30 | 30 | SALES |
| 7902 | FORD | 20 | 20 | RESEARCH |
| 7934 | MILLER | 10 | 10 | ACCOUNTING |
+----------+----------+-----------+-----------+-------------+
4.2 emp的全部数据 和 dept匹配到的数据
select
e.empno,e.ename,e.deptno,d.deptno,d.dname
from emp e
left join
dept d
on e.deptno = d.deptno;
+----------+----------+-----------+-----------+-------------+
| e.empno | e.ename | e.deptno | d.deptno | d.dname |
+----------+----------+-----------+-----------+-------------+
| 7369 | SMITH | 20 | 20 | RESEARCH |
| 7499 | ALLEN | 30 | 30 | SALES |
| 7521 | WARD | 30 | 30 | SALES |
| 7566 | JONES | 20 | 20 | RESEARCH |
| 7654 | MARTIN | 30 | 30 | SALES |
| 7698 | BLAKE | 30 | 30 | SALES |
| 7782 | CLARK | 10 | 10 | ACCOUNTING |
| 7788 | SCOTT | 20 | 20 | RESEARCH |
| 7839 | KING | 10 | 10 | ACCOUNTING |
| 7844 | TURNER | 30 | 30 | SALES |
| 7876 | ADAMS | 20 | 20 | RESEARCH |
| 7900 | JAMES | 30 | 30 | SALES |
| 7902 | FORD | 20 | 20 | RESEARCH |
| 7934 | MILLER | 10 | 10 | ACCOUNTING |
| 6666 | MILLER | 60 | NULL | NULL |
+----------+----------+-----------+-----------+-------------+
4.3 dept 的全部数据 和 emp 匹配到的数据
select
e.empno,e.ename,e.deptno,d.deptno,d.dname
from emp e
right join
dept d
on e.deptno = d.deptno;
+----------+----------+-----------+-----------+-------------+
| e.empno | e.ename | e.deptno | d.deptno | d.dname |
+----------+----------+-----------+-----------+-------------+
| 7782 | CLARK | 10 | 10 | ACCOUNTING |
| 7839 | KING | 10 | 10 | ACCOUNTING |
| 7934 | MILLER | 10 | 10 | ACCOUNTING |
| 7369 | SMITH | 20 | 20 | RESEARCH |
| 7566 | JONES | 20 | 20 | RESEARCH |
| 7788 | SCOTT | 20 | 20 | RESEARCH |
| 7876 | ADAMS | 20 | 20 | RESEARCH |
| 7902 | FORD | 20 | 20 | RESEARCH |
| 7499 | ALLEN | 30 | 30 | SALES |
| 7521 | WARD | 30 | 30 | SALES |
| 7654 | MARTIN | 30 | 30 | SALES |
| 7698 | BLAKE | 30 | 30 | SALES |
| 7844 | TURNER | 30 | 30 | SALES |
| 7900 | JAMES | 30 | 30 | SALES |
| NULL | NULL | NULL | 40 | OPERATIONS |
+----------+----------+-----------+-----------+-------------+
4.4 获取emp独有的数据
select
e.empno,e.ename,e.deptno,d.deptno,d.dname
from emp e
left join
dept d
on e.deptno = d.deptno
where d.deptno is null
;
+----------+----------+-----------+-----------+----------+
| e.empno | e.ename | e.deptno | d.deptno | d.dname |
+----------+----------+-----------+-----------+----------+
| 6666 | MILLER | 60 | NULL | NULL |
+----------+----------+-----------+-----------+----------+
4.5 获取dept独有的数据
select
e.empno,e.ename,e.deptno,d.deptno,d.dname
from emp e
right join
dept d
on e.deptno = d.deptno
where e.deptno is null
;
+----------+----------+-----------+-----------+-------------+
| e.empno | e.ename | e.deptno | d.deptno | d.dname |
+----------+----------+-----------+-----------+-------------+
| NULL | NULL | NULL | 40 | OPERATIONS |
+----------+----------+-----------+-----------+-------------+
4.6 获取 emp 和dept 的全部数据(全连接)
-- 实现方式一: union 传统的方式
select
e.empno,e.ename,e.deptno,d.deptno,d.dname
from emp e
left join
dept d
on e.deptno = d.deptno
union
select
e.empno,e.ename,e.deptno,d.deptno,d.dname
from emp e
right join
dept d
on e.deptno = d.deptno;
+------------+------------+-------------+------------+
| _u1.empno | _u1.ename | _u1.deptno | _u1.dname |
+------------+------------+-------------+------------+
| NULL | NULL | NULL | 40 |
| 6666 | MILLER | 60 | NULL |
| 7369 | SMITH | 20 | 20 |
| 7499 | ALLEN | 30 | 30 |
| 7521 | WARD | 30 | 30 |
| 7566 | JONES | 20 | 20 |
| 7654 | MARTIN | 30 | 30 |
| 7698 | BLAKE | 30 | 30 |
| 7782 | CLARK | 10 | 10 |
| 7788 | SCOTT | 20 | 20 |
| 7839 | KING | 10 | 10 |
| 7844 | TURNER | 30 | 30 |
| 7876 | ADAMS | 20 | 20 |
| 7900 | JAMES | 30 | 30 |
| 7902 | FORD | 20 | 20 |
| 7934 | MILLER | 10 | 10 |
+------------+------------+-------------+------------+
-- 方式二: full join 进行全连接(Hive支持,传统SQL不支持,oracle中支持)
select
e.empno,e.ename,e.deptno,d.deptno,d.dname
from emp e
full join
dept d
on e.deptno = d.deptno;
4.7 获取emp和dept各自独有的数据
select
e.empno,e.ename,e.deptno,d.deptno,d.dname
from emp e
full join
dept d
on e.deptno = d.deptno
where e.deptno is null or d.deptno is null
;
+----------+----------+-----------+-----------+-------------+
| e.empno | e.ename | e.deptno | d.deptno | d.dname |
+----------+----------+-----------+-----------+-------------+
| NULL | NULL | NULL | 40 | OPERATIONS |
| 6666 | MILLER | 60 | NULL | NULL |
+----------+----------+-----------+-----------+-------------+
八:Group By()Having语句 和 LIke
GROUP BY语句通常会和聚合函数一起使用,按照一个或者多个列结果进行分组,然后对每个组执行聚合操作。
?Having语句
?九:order by()? ?sort by()??Distribute By()?Cluster By()
set mapreduce.job.reduces=3;
set hive.exec.mode.local.auto=false;
运行结果
-- 小结:Hive中如果单独使用sort by 它也能进行局部排序,但是每个文件中的数据是随机进入,如果 -- 如果 将 distribute by + sort by 使用 ? 就完全和Hadoop中的HashPartitioner相吻合!!!
十:分区表(Hive中分区就是分目录)
Hive中没有索引的概念
分区表在查询的时候要加上分区字段
?针对庞大的数据集,存储的时候考虑用分区表表存储,只要按照一定规律 -- ? ? ? 进行分区,将来查询的时候就可以使用分区字段结合查询条件小范围匹配数据 -- ? ? ? 最终避免全表扫描,从而提升查询效率。
即可以将表的分区字段看成表的普通字段?
-- 查看分区详情 show partitions dept_par;
-- 准备数据 导入
load data [local] inpath '数据的path' [overwrite] into table student
[partition (partcol1=val1,…)];
load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200401.log' into table dept_par
partition (day='20200401');
load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200402.log' into table dept_par
partition (day='20200402');
load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200403.log' into table dept_par
partition (day='20200403');
注意:再给分区表中导入数据的时候一定要加上 分区 属性partition(day='20200403')
2. 增加分区
--增加单个分区
alter table dept_par add partition(day='20200404');
--增加多个分区
alter table dept_par add partition(day='20200405') partition(day='20200406');
3. 删除分区
alter table dept_par drop partition(day='20200404');
alter table dept_par drop partition(day='20200405'), partition(day='20200406');
二级分区:
二级分区:
1. 创建分区表
-- 建表
create table dept_par1(
deptno int, dname string, loc string
)
partitioned by (day string, hour string)
row format delimited fields terminated by '\t';
-- 准备数据 导入
load data [local] inpath '数据的path' [overwrite] into table student
[partition (partcol1=val1,…)];
load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200401.log' into table dept_par1
partition (day='20200401', hour='00');
load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200402.log' into table dept_par1
partition (day='20200401', hour='01');
load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200403.log' into table dept_par1
partition (day='20200401', hour='02');
--思考:分区是不是越多越好? -- 分区越多导致文件的数量就会增加,在HDFS尽量避免出现大量小文件,在正常使用中一般做到2级分区就可以啦!
10.1:分区表和数据产生关联的三种方式
分区表和数据产生关联的三种方式:(分区表和普通表不一样需要此三种操作才能找到对应的元数据)
方式一:上传数据后修复? (msck repair table dept_par2;)
方式二:上传数据后 添加分区? ? ?
(alter table dept_par add partition(day='20200401');)
方式三:创建文件夹后load数据到分区
load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200402.log' into table dept_par4
partition (day='20200402');
10.2:动态分区
?相关参数 :
?注意通过insert into的方式给分区表中插入数据一定要指定 partition(loc)
-- 进行动态分区(动态的往 dept_par_dy 插入数据)
insert into table dept_par_dy partition(loc)
select deptno, dname, loc from dept;
10.3:分桶表(分桶操作实际上就是Hadoop中的分区)
面对海量数据表查询效率低下,此时就考虑缩小数据的管理范围, -- ? ? ? 从而达到避免查询的时候进行全表扫描,分桶操作实质就是Hadoop -- ? ? ? 中默认规则的分区操作
分桶表在查询的时候要加上分桶字段,类似于分区表在查询的时候要加上分区字段
注意:在进行分桶操作的时候要设置? ?set mapreduce.job.reduces=-1;? (因为分区分的是分目录(分区上传文件的时候是每个区的文件上传到对应的目录下边),分桶分的是分文件(上传文件的时候只上传一个文件,hive会按照指定的分桶字段创建不同的桶))
创建分桶表
CREATE TABLE table_name
[(col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
INTO num_buckets BUCKETS]
-- 建表
create table stu_bucket(id int, name string)
clustered by(id)
into 4 buckets
row format delimited fields terminated by '\t';
-- 加载数据
load data local inpath '/opt/module/hive-3.1.2/datas/student.txt'
into table stu_bucket;
10.4:抽样查询?
抽样查询取是按照id进行随机分桶,然后选择第x个桶,并不是真正意义上按照分桶的规则进行划分,然后获取第x个桶的数据,而是随机的。
四、 抽样查询
-- 语法: TABLESAMPLE(BUCKET x OUT OF y ON 分桶字段)?
select * from stu_bucket tablesample(bucket 1 out of 4 on id);
select * from emp tablesample(bucket 1 out of 4 on empno);
十一:Hive中的函数(思路:按照需求缺什么 补什么)
2> CASE WHEN THEN ELSE END(逻辑判断函数)
1). 按照部门分组,再按照性别分组,求统计
select
dept_id,
sex,
count(name) as num
from emp_sex
group by dept_id,sex;
+----------+------+------+
| dept_id | sex | num |
+----------+------+------+
| A | 女 | 1 |
| A | 男 | 2 |
| B | 女 | 2 |
| B | 男 | 1 |
+----------+------+------+
2). 实现目标结果数据
select
dept_id,
sum(CASE sex WHEN '男' THEN 1 ELSE 0 END) as man,
sum(CASE sex WHEN '女' THEN 1 ELSE 0 END) as women
from emp_sex
group by dept_id;
+----------+------+--------+
| dept_id | man | women |
+----------+------+--------+
| A | 2 | 1 |
| B | 1 | 2 |
+----------+------+--------+
3). 扩展 if() 函数
select
dept_id,
sum(if(sex='男',1,0)) as man,
sum(if(sex='女',1,0)) as women
from emp_sex
group by dept_id;
+----------+------+--------+
| dept_id | man | women |
+----------+------+--------+
| A | 2 | 1 |
| B | 1 | 2 |
+----------+------+--------+
?11.1:函数第二波
?COLLECT_SET(col):函数只接受基本数据类型,它的主要作用是将某字段的值进行去重汇总,产生array类型字段。
1). 把 constellation 和 blood_type 拼接起来
select
concat_ws(',',constellation,blood_type) as cb,
name
from person_info; --> t1 因为第二个表查询的时候用到了第一个表中的name,因此要把那么查出来
+--------+-------+
| cb | name |
+--------+-------+
| 白羊座,A | 孙悟空 |
| 射手座,A | 大海 |
| 白羊座,B | 宋宋 |
| 白羊座,A | 猪八戒 |
| 射手座,A | 凤姐 |
| 白羊座,B | 苍老师 |
+--------+-------+
2). 根据 cb 字段进行分组
select
t1.cb,
concat_ws('|',collect_set(t1.name))
from
(
select
concat_ws(',',constellation,blood_type) as cb,
name
from person_info
) t1
group by t1.cb;
+--------+----------+
| t1.cb | _c1 |
+--------+----------+
| 射手座,A | 大海|凤姐 |
| 白羊座,A | 孙悟空|猪八戒 |
| 白羊座,B | 宋宋|苍老师 |
+--------+----------+
侧写函数LATERAL VIEW 语法 :(因为SQL语法局限 需要 LATERAL VIEW)
explode 一进多出函数, 侧写函数代表虚拟表的过程;执行过程
炸开用explode ? ?对应key 用侧写函数LATERAL VIEW
EXPLODE(col):将hive一列中复杂的array或者map结构拆分成多行。
select
explode(split(category,',')) as category_name,
movie
from movie_info;
+------+
| col |
+------+
| 悬疑 |
| 动作 |
| 科幻 |
| 剧情 |
| 悬疑 |
| 警匪 |
| 动作 |
| 心理 |
| 剧情 |
| 战争 |
| 动作 |
| 灾难 |
+------+
select
movie,
category_name
from movie_info
lateral view explode(split(category,',')) movie_info_tmp as category_name
;
+--------------+----------------+
| movie | category_name |
+--------------+----------------+
| 《疑犯追踪》 | 悬疑 |
| 《疑犯追踪》 | 动作 |
| 《疑犯追踪》 | 科幻 |
| 《疑犯追踪》 | 剧情 |
| 《Lie to me》 | 悬疑 |
| 《Lie to me》 | 警匪 |
| 《Lie to me》 | 动作 |
| 《Lie to me》 | 心理 |
| 《Lie to me》 | 剧情 |
| 《战狼2》 | 战争 |
| 《战狼2》 | 动作 |
| 《战狼2》 | 灾难 |
+--------------+----------------+
11.2:函数第三波?
窗口函数就是针对SQL中的限制重新开辟一个处理数据空间, -- ? ? ? 可以灵活的的控制窗口的大小,窗口的大小取决于数据的记录的条数。
需求一:查询在2017年4月份购买过的顾客及总人数
-- 开窗函数此处应用 ?大意是 按照名字分组 开辟了一个新的统计名字个数的空间,这个空间的范围是?从每一个组中的起始位置到最后位置? 即每一个组统计一次。?
WINDOW 规范默认为? ROW BETWEEN UNBOUNDED PRECEDING(从前面的起点) AND UNBOUNDED FOLLOWING(到后面的终点).
count(name) over() totalNum 统计姓名的个数? 此时如果不分组的情况 会对相同的名字进行个数统计,因此需要加上 group by()? 而over() 是在group by()之后,
开辟了新的统计组个数的空间(即要统计总人数就要分组
select
name ,
count(name)
from business
where substring(orderdate,0,7) = '2017-04'
group by name ;
name _c1
jack 1
mart 4
)但是分组的结果会重复统计相同的人名,此时开窗的作用就是开辟了新的统计人数的空间,即一个组统计一次;
1) 根据需求的日期过滤数据
select
orderdate ,
name
from business
where substring(orderdate,0,7) = '2017-04' ;
orderdate name
2017-04-06 jack
2017-04-08 mart
2017-04-09 mart
2017-04-11 mart
2017-04-13 mart
2). 根据 name 进行分组,然后使用开窗函数重新处理分组后的结果 求统计
-- (因为用到了分组,而分组要查询的字段 是分组字段 因此 查询字段不能带日期)
select
name,
count(name) over() totalNum
from business
where substring(orderdate,0,7) = '2017-04'
group by name ;
-- 开窗函数此处应用 大意是 按照名字分组 开辟了一个新的统计名字个数的空间,这个空间的范围是
-- 从每一个组中的起始位置到最后位置 。
如果
select
name,
count(name) over() totalNum
from business
where substring(orderdate,0,7) = '2017-04'
没有开窗语句也没有order by语句 则 默认窗口大小从起点到终点
name totalnum
mart 5
mart 5
mart 5
mart 5
jack 5
需求二:查询顾客的购买明细及月购买总额
目的是统计??顾客的购买明细(每个顾客每个月的花费金额)??及??月购买总额(一个月内所有顾客花费的总金额)? 因此SQL(
sum(cost) over(partition by substring(orderdate,0,7)) costByMonth 表示在求一个月内所有顾客花费的总金额的时候 重新开辟了一个新的空间,新的空间的大小是按照substring(orderdate,0,7)一个月进行分区汇总的。
(partition by 可以控制新空间的大小)只有窗口语句没有order by语句 此时窗口的大小是每一个分区内的起点到终点
select
name,
cost,
orderdate,
sum(cost) over(partition by substring(orderdate,0,7)) total_cost
from business;
需求三:查询 每个顾客的购买明细 及 (每个)月购买总额
select
name,
cost,
orderdate,
sum(cost) over(partition by substring(orderdate,0,7),name) oneCostByMonth
from business ;
(sum(cost) over(partition by substring(orderdate,0,7),name) oneCostByMonth)
开窗函数新空间的大小由按照月份和顾客确定。 只有窗口语句没有order by语句 此时窗口的大小是每一个分区内的起点到终点
需求四:上述的场景, 将每个顾客的cost按照日期进行累加
-- 方式一
select
name,
cost,
orderdate,
sum(cost) over(partition by name order by orderdate) oneCustTotal
from business ;
-- 方式二
select
name,
cost,
orderdate,
sum(cost) over(partition by name order by orderdate
rows between UNBOUNDED PRECEDING and current row
) oneCustTotal
from business ;
? 开窗Sql代码
(sum(cost) over(partition by name order by orderdate) oneCustTotal)
每个顾客的cost按照日期进行累加,即开窗的新的空间是按照顾客名字进行分区,按照日期进行累加,因此要排序,开窗规则的默认窗口是从起始行累加到当前行。
(sum(cost) over(partition by name order by orderdate?rows between UNBOUNDED PRECEDING and current row)?开窗规则的默认窗口是从起始行累加到当前行。
注意:? 当 ORDER BY 指定时缺少 WINDOW 子句,WINDOW 窗口大小规范默认为? RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
当 ORDER BY 和 WINDOW 子句都缺失时 即单纯的over(),WINDOW 窗口大小规范默认为? ROW BETWEEN UNBOUNDED PRECEDING 起点?AND UNBOUNDED FOLLOWING. 终点
-- 需求五:查询每个顾客上次的购买时间和下一次购买时间 LAG(col,n,default_val):往前第n行数据 LEAD(col,n, default_val):往后第n行数据
select
name ,
orderdate,
lag(orderdate,1,'startTime') over(partition by name order by orderdate) preTime,
lead(orderdate,1,'endTime') over(partition by name order by orderdate) nextTime
from business ;
name orderdate pretime nexttime
jack 2017-01-01 startTime 2017-01-05
jack 2017-01-05 2017-01-01 2017-01-08
jack 2017-01-08 2017-01-05 2017-02-03
jack 2017-02-03 2017-01-08 2017-04-06
jack 2017-04-06 2017-02-03 endTime
mart 2017-04-08 startTime 2017-04-09
mart 2017-04-09 2017-04-08 2017-04-11
mart 2017-04-11 2017-04-09 2017-04-13
mart 2017-04-13 2017-04-11 endTime
neil 2017-05-10 startTime 2017-06-12
neil 2017-06-12 2017-05-10 endTime
tony 2017-01-02 startTime 2017-01-04
tony 2017-01-04 2017-01-02 2017-01-07
tony 2017-01-07 2017-01-04 endTime
开窗函数是作为对其它函数在统计的时候由于SQL的局限,进行辅助作用的 辅助主函数的,因此不能单独存在; 因此以下写法不正确,应该将over()开窗函数放在主函数的后边。且开窗函数不能单独使用(以下两种写法都是错误的)
select
name ,
orderdate,
over(partition by name order by orderdate) lag(orderdate,1,'startTime') preTime,
over(partition by name order by orderdate) lead(orderdate,1,'endTime') nextTime
from business ;
开窗函数不能单独使用
select
name ,
orderdate,
over(partition by name order by orderdate) preTime,
over(partition by name order by orderdate) nextTime
from business ;
需求六:查询前20%时间的订单信息
?NTILE(n):把有序窗口的行分发到指定数据的组中,?NTILE(n) n为需要的组数各个组有默认有编号,编号从1开始,对于每一行, NTILE函数返回此行所属的组的编号。注意:n必须为int类型。
select
t1.name,
t1.orderdate,
t1.cost
from
(
select
name,
orderdate,
cost,
-- 分成5组 按照日期排序
ntile(5) over(order by orderdate) as group_id
from business
) t1
where t1.group_id = 1;
t1.name t1.orderdate t1.cost
jack 2017-01-01 10
tony 2017-01-02 15
tony 2017-01-04 29
?rank() over(partition by subject order by score desc) rankOrder, dense_rank() over(partition by subject order by score desc) dense_rank_Order, row_number() over(partition by subject order by score desc) row_number_orderNum? 开窗函数辅助主函数? ?row_number()进行统计
select
name ,
subject,
score ,
rank() over(partition by subject order by score desc) rankOrder,
dense_rank() over(partition by subject order by score desc) dense_rank_Order,
row_number() over(partition by subject order by score desc) row_number_orderNum
from score ;
name subject score rankorder dense_rank_order row_number_ordernum
孙悟空 数学 95 1 1 1
宋宋 数学 86 2 2 2
婷婷 数学 85 3 3 3
大海 数学 56 4 4 4
宋宋 英语 84 1 1 1
大海 英语 84 1 1 2
婷婷 英语 78 3 2 3
孙悟空 英语 68 4 3 4
大海 语文 94 1 1 1
孙悟空 语文 87 2 2 2
婷婷 语文 65 3 3 3
宋宋 语文 64 4 4 4
十二:自定义函数(看文档)
?1). 获取客户端连接对象 2). 调用API完成具体功能 3). 关闭资源
十三:Hive企业级调优(看文档)
十四:Hive ETL案例分析
ETL阶段:清洗字段数目
?
?数据清洗(ETL) -- 操作数据的注意点 1). 对字段的数量进行验证 2). 对category字段中的空格进行处理 3). 对关联视频字段 进行处理
-- ETL的思路? 1). 通过MR程序进行数据清洗(编码) 2). 打包,上传至Linux中? 3). 执行数据清洗
-- 进行数据清洗 1). 把待处理的数据 上传至HDFS?
2). 执行MR清洗程序 hadoop jar /opt/module/hive-3.1.2/datas/GuliETL-1.0-SNAPSHOT.jar com.atguigu.mr.GuliEtlDriver /gulivideo/video /gulivideo/video/output
需求分析:
?需求一: 统计视频(字段用videoId视频id) 观看数(views用字段观看次数) Top10
select
videoId,
views
from gulivideo_orc
order by views desc
limit 10;
+--------------+-----------+
| videoid | views |
+--------------+-----------+
| dMH0bHeiRNg | 42513417 |
| 0XxI-hvPRRA | 20282464 |
| 1dmVU08zVpA | 16087899 |
| RB-wUgnyGv0 | 15712924 |
| QjA5faZF1A8 | 15256922 |
| -_CSo1gOd48 | 13199833 |
| 49IDp76kjPw | 11970018 |
| tYnn51C3X_w | 11823701 |
| pv5zWaTEVkI | 11672017 |
| D2kJZOfq7zk | 11184051 |
+--------------+-----------+
需求二:统计视频类别热度Top10(热度用视频的数量来体现)
统计视频类别热度Top10(热度用视频的数量来体现)
-- 分析:
1). 获取视频类别并炸开
select
videoId,
category_name
from gulivideo_orc
lateral view explode(category) gulivideo_orc_tmp as category_name; --> t1
2). 根据 t1 中的 category_name 进行分组 求统计 获取 hot值并根据hot值倒序取前十
select
category_name,
count(t1.videoId) as hot
from
(
select
videoId,
category_name
from gulivideo_orc
lateral view explode(category) gulivideo_orc_tmp as category_name
) t1
group by t1.category_name
order by hot desc
limit 10
;
+----------------+---------+
| category_name | hot |
+----------------+---------+
| Music | 179049 |
| Entertainment | 127674 |
| Comedy | 87818 |
| Animation | 73293 |
| Film | 73293 |
| Sports | 67329 |
| Games | 59817 |
| Gadgets | 59817 |
| People | 48890 |
| Blogs | 48890 |
+----------------+---------+
需求三: 统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数
视频(字段用videoId视频id) 观看数(views用字段观看次数) 最高的20个视频
select
videoId ,
views ,
category ,
category_name
from gulivideo_orc
order by views desc
limit 20 ; -- t1
-- top 20 视频的videoId 与 类别
select
t1.videoId videoId_t2,
category_name
from (
select
videoId ,
views ,
category
from gulivideo_orc
order by views desc
limit 20
) t1
LATERAL VIEW explode(t1.category) gulivideo_orc_temp as category_name ; -- t2
以及 类别 包含Top20视频的个数 (按照类别进行分组 统计视频的个数)
select
t2.category_name,
count(t2.videoId_t2) as num
from (
select
t1.videoId videoId_t2,
category_name
from (
select
videoId ,
views ,
category
from gulivideo_orc
order by views desc
limit 20
) t1
LATERAL VIEW explode(t1.category) gulivideo_orc_temp as category_name
) t2
group by t2.category_name ;
t2.category_name _c1
Blogs 2
Comedy 6
Entertainment 6
Music 5
People 2
UNA 1
需求四: 统计视频观看数Top50所关联视频的所属类别排名
统计视频,观看次数,和视频相关id
select
videoId ,
views ,
relatedId
from gulivideo_orc
order by views desc
limit 50 ; -- t1
将 t1 中的关联视频字段炸开
select
t1.videoId ,
videoId_Id
from (
select
videoId ,
views ,
relatedId
from gulivideo_orc
order by views desc
limit 50
) t1
LATERAL VIEW explode(t1.relatedId)
gulivideo_orc_temp as videoId_Id ; -- t2
通过和原表Join获取关联视频对应的所属类别
select
t2.videoId_Id videoId_t4,
t3.category category_t4
from(
select
t1.videoId ,
videoId_Id
from (
select
videoId ,
views ,
relatedId
from gulivideo_orc
order by views desc
limit 50
) t1
LATERAL VIEW explode(t1.relatedId)
gulivideo_orc_temp as videoId_Id
) t2
join gulivideo_orc t3
on t2.videoId_Id = t3.videoId ; -- t4
将 t4 结果中的 category 炸开
select
t4.videoId_t4 videoId_t5,
category_name_t4 category_t5
from
(
select
t2.videoId_Id videoId_t4,
t3.category category_t4
from(
select
t1.videoId ,
videoId_Id
from (
select
videoId ,
views ,
relatedId
from gulivideo_orc
order by views desc
limit 50
) t1
LATERAL VIEW explode(t1.relatedId)
gulivideo_orc_temp as videoId_Id
) t2
join gulivideo_orc t3
on t2.videoId_Id = t3.videoId
) t4
LATERAL VIEW explode(t4.category_t4) t4_temp as category_name_t4 ; -- t5
根据 t5 的 category_name 进行分组 然后求统计
select
t5.category_t5 category_t6,
count(t5.videoId_t5) num
from
(
select
t4.videoId_t4 videoId_t5,
category_name_t4 category_t5
from
(
select
t2.videoId_Id videoId_t4,
t3.category category_t4
from(
select
t1.videoId ,
videoId_Id
from (
select
videoId ,
views ,
relatedId
from gulivideo_orc
order by views desc
limit 50
) t1
LATERAL VIEW explode(t1.relatedId)
gulivideo_orc_temp as videoId_Id
) t2
join gulivideo_orc t3
on t2.videoId_Id = t3.videoId
) t4
LATERAL VIEW explode(t4.category_t4) t4_temp as category_name_t4
) t5
group by t5.category_t5 ; -- t6
根据 t6 中num 进行排名
select
t6.category_t6,
t6.num ,
rank() over(order by t6.num desc) as r_k
from
(
select
t5.category_t5 category_t6,
count(t5.videoId_t5) num
from
(
select
t4.videoId_t4 videoId_t5,
category_name_t4 category_t5
from
(
select
t2.videoId_Id videoId_t4,
t3.category category_t4
from(
select
t1.videoId ,
videoId_Id
from (
select
videoId ,
views ,
relatedId
from gulivideo_orc
order by views desc
limit 50
) t1
LATERAL VIEW explode(t1.relatedId)
gulivideo_orc_temp as videoId_Id
) t2
join gulivideo_orc t3
on t2.videoId_Id = t3.videoId
) t4
LATERAL VIEW explode(t4.category_t4) t4_temp as category_name_t4
) t5
group by t5.category_t5
) t6 ;
需求五: 统计每个类别中的视频热度Top10,以Music为例 (热度以观看数衡量)
方式一:
获取视频类别的并炸开
select
videoId,
views,
category_name
from gulivideo_orc
lateral view explode(category) gulivideo_orc_tmp as category_name; --> t1
2). 根据 t1 中的 category_name 分组然后求统计
select
t1.category_name,
t1.views
from
(
select
videoId,
views,
category_name
from gulivideo_orc
lateral view explode(category) gulivideo_orc_tmp as category_name
) t1
where t1.category_name = 'Music'
order by t1.views desc
limit 10
;
+-------------------+-----------+
| t1.category_name | t1.views |
+-------------------+-----------+
| Music | 15256922 |
| Music | 11823701 |
| Music | 11672017 |
| Music | 9579911 |
| Music | 7533070 |
| Music | 6946033 |
| Music | 6935578 |
| Music | 6193057 |
| Music | 5581171 |
| Music | 5142238 |
+-------------------+-----------+
方式二:
统计类别为Music 的videoId
select
videoId,
views,
category_name
from gulivideo_orc
LATERAL VIEW explode(category) gulivideo_orc_temp as category_name
where category_name = "Music"
order by views desc
limit 10;
需求六: 统计每个类别中的视频热度Top10
1). 获取视频类别的并炸开
select
videoId,
views,
category_name
from gulivideo_orc
LATERAL VIEW explode(category) gulivideo_orc_temp as category_name ; -- t1
2). 根据 观看数 对每个类别进行排序
select
t1.videoId videoId_t2,
t1.views views_t2,
t1.category_name category_t2,
rank() over(partition by t1.category_name order by views desc) r_k
from (
select
videoId,
views,
category_name
from gulivideo_orc
LATERAL VIEW explode(category) gulivideo_orc_temp as category_name
) t1 ; -- t2
获取前10
select
t2.videoId_t2 ,
t2.views_t2 ,
t2.category_t2,
t2.r_k
from (
select
t1.videoId videoId_t2,
t1.views views_t2,
t1.category_name category_t2,
rank() over(partition by t1.category_name order by views desc) r_k
from (
select
videoId,
views,
category_name
from gulivideo_orc
LATERAL VIEW explode(category) gulivideo_orc_temp as category_name
) t1
) t2
where t2.r_k <=10 ;
每个类别都是前十名
aRNzWyD7C9o 8825788 UNA 1
jtExxsiLgPM 5320895 UNA 2
PxNNR4symuE 4033376 UNA 3
8cjTSvvoddc 3486368 UNA 4
LIhbap3FlGc 2849832 UNA 5
lCSTULqmmYE 2179562 UNA 6
UyTxWvp8upM 2106933 UNA 7
y6oXEWowirI 1666084 UNA 8
_x2-AmY8FI8 1403113 UNA 9
ICoDFooBXpU 1376215 UNA 10
RjrEQaG5jPM 2803140 Vehicles 1
cv157ZIInUk 2773979 Vehicles 2
Gyg9U1YaVk8 1832224 Vehicles 3
6GNB7xT3rNE 1412497 Vehicles 4
tth9krDtxII 1347317 Vehicles 5
46LQd9dXFRU 1262173 Vehicles 6
pdiuDXwgrjQ 1013697 Vehicles 7
kY_cDpENQLE 956665 Vehicles 8
YtxfbxGz1u4 942604 Vehicles 9
aCamHfJwSGU 847442 Vehicles 10
十五:SQL练习
需求一:
?
方式一:
一: 求出每个学生所有科目成绩的最小值
select
uid,
min(score) over(partition by uid) min_score
from score ; -- t1
二: 求出学科的平均成绩的最大值
select
subject_id,
avg(score) avg_score
from score
group by subject_id
order by avg_score desc
limit 1 ; -- t2
三: 求出每个学生所有科目成绩的最小值 大于
各个学科的平均成绩的最大值的uid
select
uid
from(
select
uid,
min(score) over(partition by uid) min_score
from score
) t1 , (
select
subject_id,
avg(score) avg_score
from score
group by subject_id
order by avg_score desc
limit 1
) t2
where t1.min_score > t2.avg_score ;
uid
1001
1001
1001
方式二 (标准方式)
求出每个学科平均成绩
select
uid,
score,
avg(score) over(partition by subject_id) avg_score
from score ; -- t1
求出每个学科平均成绩
1003 70 81.66666666666667
1002 85 81.66666666666667
1001 90 81.66666666666667
1003 70 81.66666666666667
1002 85 81.66666666666667
1001 90 81.66666666666667
1003 85 81.66666666666667
1002 70 81.66666666666667
1001 90 81.66666666666667
根据是否大于平均成绩记录flag,大于则记为0否则记为1
select
t1.uid,
if(t1.score>t1.avg_score,0,1) flag
from(
select
uid,
score,
avg(score) over(partition by subject_id) avg_score
from score
) t1 ; -- t2
1003 1
1002 0
1001 0
1003 1
1002 0
1001 0
1003 0
1002 1
1001 0
根据学生id进行分组统计flag的和,和为0则是所有学科都大于平均成绩
select
uid
from (
select
t1.uid,
if(t1.score>t1.avg_score,0,1) flag
from(
select
uid,
score,
avg(score) over(partition by subject_id) avg_score
from score
) t1
) t2
group by uid
having sum(flag)=0;
uid
1001
需求二:
?
使用SQL统计出每个用户的累积访问次数
修改数据格式
select
userId,
date_format(regexp_replace(visitDate,'/','-'),'yyyy-MM') mn,
visitCount
from
action ; -- t1
u01 2017-01 5
u02 2017-01 6
u03 2017-01 8
u04 2017-01 3
u01 2017-01 6
u01 2017-02 8
U02 2017-01 6
U01 2017-02 4
计算每人单月访问量
select
t1.userId userId_t2,
t1.mn mn_t2 ,
sum(t1.visitCount) mn_count_t2
from (
select
userId,
date_format(regexp_replace(visitDate,'/','-'),'yyyy-MM') mn,
visitCount
from
action
) t1
group by t1.userId , t1.mn ; -- t2
userid_t2 mn_t2 mn_count_t2
U01 2017-02 4
U02 2017-01 6
u01 2017-01 11
u01 2017-02 8
u02 2017-01 6
u03 2017-01 8
u04 2017-01 3
按月累计访问量
select
t2.userid_t2,
t2.mn_t2,
t2.mn_count_t2,
sum(t2.mn_count_t2) over(partition by t2.userid_t2 order by t2.mn_t2)
from (
select
t1.userId userId_t2,
t1.mn mn_t2 ,
sum(t1.visitCount) mn_count_t2
from (
select
userId,
date_format(regexp_replace(visitDate,'/','-'),'yyyy-MM') mn,
visitCount
from
action
) t1
group by t1.userId , t1.mn
) t2 ;
u01 2017-01 11 11
u01 2017-02 12 23
u02 2017-01 12 12
u03 2017-01 8 8
u04 2017-01 3 3
需求三:
有50W个京东店铺,每个顾客访客访问任何一个店铺的任何一个商品时都会产生一条访问日志,访问日志存储的表名为Visit,访客的用户id为user_id,被访问的店铺名称为shop。
?
方式一 : (不对,每家店铺应该去除相同顾客,即相同顾客算一次)
select
shop,
count(user_id) num
from visit
group by shop ;
a 9
b 6
c 4
方式二: 一家店去除了相同的顾客
select
shop,
count(distinct user_id)
from visit
group by shop ;
a 4
b 4
c 3
每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数
(1)查询每个店铺被每个用户访问次数
select
shop ,
user_id ,
count(*) ct
from visit
group by shop , user_id ; -- t1
shop user_id ct
a u1 3
b u1 2
a u2 2
b u2 1
c u2 2
(2)计算每个店铺被用户访问次数排名
select
t1.shop shop_t2,
t1.user_id user_id_t2,
t1.ct ct_t2,
rank() over(partition by shop order by t1.ct) r_k
from(
select
shop ,
user_id ,
count(*) ct
from visit
group by shop , user_id
) t1 ; -- t2
(3)取每个店铺排名前3的
select
t2.shop_t2,
t2.user_id_t2,
t2.ct_t2
from (
select
t1.shop shop_t2,
t1.user_id user_id_t2,
t1.ct ct_t2,
rank() over(partition by shop order by t1.ct ) r_k
from(
select
shop ,
user_id ,
count(*) ct
from visit
group by shop , user_id
) t1
) t2
where t2.r_k <= 3 ;
a u3 1
a u2 2
a u1 3
a u5 3
b u2 1
b u5 1
b u1 2
b u4 2
c u3 1
c u6 1
c u2 2
需求四:
数据样例:2017-01-01,10029028,1000003251,33.57。
1)给出 2017年每个月的订单数、用户数、总成交金额。
select
date_format(dt,'yyyy-MM'),
count(order_id),
count(distinct user_id),
sum(amount)
from order_tab
where date_format(dt,'yyyy') = '2017'
group by date_format(dt,'yyyy-MM') ;
2)给出2017年11月的新客数(指在11月才有第一笔订单)
select
count(user_id)
from order_tab
group by user_id
having date_format(min(dt),'yyyy-MM')='2017-11' ;
需求五:
?
select
sum(user_total_count),
sum(user_total_avg_age),
sum(twice_count),
sum(twice_count_avg_age)
from
(select
0 user_total_count,
0 user_total_avg_age,
count(*) twice_count,
cast(sum(age)/count(*) as decimal(10,2)) twice_count_avg_age
from
(
select
user_id,
min(age) age
from
(select
user_id,
min(age) age
from
(
select
user_id,
age,
date_sub(dt,rk) flag
from
(
select
dt,
user_id,
min(age) age,
rank() over(partition by user_id order by dt) rk
from
user_age
group by
dt,user_id
)t1
)t2
group by
user_id,flag
having
count(*)>=2)t3
group by
user_id
)t4
union all
select
count(*) user_total_count,
cast((sum(age)/count(*)) as decimal(10,1)),
0 twice_count,
0 twice_count_avg_age
from
(
select
user_id,
min(age) age
from
user_age
group by
user_id
)t5)t6;
需求六:
select
t1.userid,
t1.paymenttime,
od.money
from
(select
userid,
min(paymenttime) paymenttime
from
ordertable
where
date_format(paymenttime,'yyyy-MM')='2017-10'
group by
userid)t1
join
ordertable od
on
t1.userid=od.userid
and
t1.paymenttime=od.paymenttime;
需求七:
?
select
ip,
interface,
count(*) ct
from
ip
where
date_format(time,'yyyy-MM-dd HH')>='2016-11-09 14'
and
date_format(time,'yyyy-MM-dd HH')<='2016-11-09 15'
and
interface='/api/user/login'
group by
ip,interface
order by
ct desc
limit 2;t1
|