[大数据] hive数据分析-002

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 大数据 -> hive数据分析-002 -> 正文阅读

[大数据]hive数据分析-002

统计Nginx服务器用户访问量

日志数据表

create external  table if not exists tb_regexp_log(
logs string
)

加载数据到hive表：

load data local inpath '/home/hadoop/data/access_2013_05_31.log' overwrite into table tb_regexp_log;

(1).统计服务器访问次数

select count(*) from tb_regexp_log;
select count(1) from tb_regexp_log;
select count(logs) from tb_regexp_log;

（2）.当日访问用户数
关键点：使用内置函数实现

select count(distinct(substring_index(logs,' ',1)) from tb_regexp_log;

Hive内部表分区

创建内部分区表

CREATE TABLE IF NOT EXISTS pt_flow(
id              string,
phonenumber     bigint,
mac             string,
ip              string,
url             string,
tiele           string,
colum1          string,
colum2          string,
colum3          string,
upflow          int,
downflow        int
)
PARTITIONED BY (year int,month int,day int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
stored as textfile;

分区表中加载数据(静态方式创建分区)
这里的静态分区创建最好以local方式，因为用hdfs的话第一个操作结束后源数据就没了，被移动走了，第二步就无法执行了。

load data local inpath '/home/hadoop/data/flow.log' into table pt_flow partition(year=2015,month=5,day=24);
load data local inpath '/home/hadoop/data/flow.log' into table pt_flow partition(year=2016,month=10,day=1);

查看所有分区

show partitions pt_flow;

表分区作用
分区的本质就是在表目录下创建子目录。我们的日志文件因时间的累积越来越多，分别存放在表不同的文件夹下，查询的时候如果可以指定文件夹查询效率就会很高。
即hive表分区作用：提高查询效率

hive外部表

使用场景：一般待分析数据已经落地的情况，文件已经在hdfs中
创建外部分区表

CREATE EXTERNAL TABLE IF NOT EXISTS extpart_flow(
id              string,
phonenumber     bigint,
mac             string,
ip              string,
url            string,
tiele           string,
colum1          string,
colum2          string,
colum3          string,
upflow          int,
downflow        int
)
PARTITIONED BY (year int,month int,day int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
stored as textfile
location '/source';

注意：一般内部表最好不使用指向源文件！
给外部表添加分区信息(静态方式创建分区)：

ALTER TABLE extpart_flow ADD PARTITION (year=2017, month=10,day=24)
LOCATION 'hdfs:///source';

注意：在严格模式下，查看分区分区表时，必须带分区字段！

set hive.mapred.mode=strict;
select * from extpart_flow;

错误提示:

FAILED: SemanticException [Error 10041]: No partition predicate found for Alias "extpart_flow" Table "extpart_flow"

正确的方式；

select * from extpart_flow year=2017 and month=10 and day=24;

3.动态分区例子
(1)、创建一个普通表：

create table if not exists pv(
id int,
name string,
sex string,
age int
)
row format delimited fields terminated by','
stored as textfile;

vi users.txt
1,AA,nv ,19
2,BB,nvn,21
3,CC,nan,24
4,DD,nan,23
5,EE,nv,22
6,FF,nan,20
7,GG,nan,20
8,MM,nv,29
9,NN,nan,28
10,OO,nan,22
11,PP,nan,20

load data local inpath '/home/hadoop/data/users.txt' into table pv;

(2).创建分区表:按年龄自动分区
使用场景：在不知道分区数量的情况下，使用动态分区！
之所以先创建了普通表，是因为动态分区的表要从普通表中进行插入

create table if not exists pv01(
id int,
name string,
sex string
)
partitioned by(age int)
row format delimited fields terminated by'\t'
stored as textfile;

在动态插入数据之前，必须设置hive为"非严格"模式
打开动态分区功能

hive >set hive.exec.dynamic.partition=true;

设置为非严格模式

hive >set hive.exec.dynamic.partition.mode=nonstrict;

不是必须的，默认每个节点可以创建的分区数量为100

hive >set hive.exec.max.dynamic.partitions.pernode=100;

将用户表按年龄分区，存储到分区表
insert into table pv01 partition(age) select id,name,sex,age from pv;

查看分区
show partitions pv01;

动态分区：在分区数不确定的情况下，使用动态方式实现分区！！！！！！！

创作挑战赛

新人创作奖励来咯，坚持创作打卡瓜分现金大奖

大数据最新文章

实现Kafka至少消费一次

亚马逊云科技：还在苦于ETL？Zero ETL的时代

初探MapReduce

【SpringBoot框架篇】32.基于注解+redis实现

Elasticsearch：如何减少 Elasticsearch 集

Go redis操作

Redis面试题

专题五 Redis高并发场景

基于GBase8s和Calcite的多数据源查询

Redis——底层数据结构原理

加:2022-04-18 17:49:30 更:2022-04-18 17:52:05

360图书馆购物三丰科技阅读网日历万年历 2025年10日历

-2025/10/17 11:27:10-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码