一、创建表
官网创建HIVE表定义如下:
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...)
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
[AS select_statement];
- 默认创建内部表,内部表和外部表的区别在于:hive能删除内部表的数据,但是不能删除外部表的数据 ;
- hive支持的一些特殊数据类型如下:
array_type
: ARRAY < data_type >
map_type
: MAP < primitive_type, data_type >
struct_type
: STRUCT < col_name : data_type [COMMENT col_comment], ...>
- 对于行的格式定义如下:
row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char]
| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
- hive创建表的时候默认的文件格式是textfile,无压缩类型,可自行定义,文件格式如下:
file_format:
: SEQUENCEFILE
| TEXTFILE
| RCFILE
| ORC
| PARQUET
| AVRO
| JSONFILE
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
- 创建hive表时,还可手动设置数据存储路径。
location hdfs_path
- 创建表的实例如下:
create external table if not exists student(
name string,
grade string,
teacher array<string>,
scores map<string, int>,
address struct<street:string, city:string>
)
partitioned by(day string,hour string)
clustered by(id) into 4 buckets
row format delimited fields terminated by ','
collection items terminated by '?'
map keys terminated by ':'
lines terminated by '\n'
stored as orc
location '/student';
以下官网中的实例也可进行参考:
CREATE TABLE new_key_value_store
ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
STORED AS RCFile
AS
SELECT (key % 1024) new_key, concat(key, value) key_value_pair
FROM key_value_store
SORT BY new_key, key_value_pair;
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
STORED AS SEQUENCEFILE;
二、删除表
- 使用drop删除
DROP TABLE [IF EXISTS] table_name [PURGE];
DROP TABLE删除该表的元数据和数据。如果配置了Trash(并且没有指定PURGE),数据实际上会移动到.Trash/Current目录。元数据完全丢失。 需要注意的是:在hive 4.X以后,设置 external.table.purge=true,也可以删除外部表数据了。
When dropping an EXTERNAL table, data in the table will NOT be deleted from the file system.
Starting Hive 4.0.0 ( HIVE-19981 - Managed tables converted to external tables by the
HiveStrictManagedMigration utility should be set to delete data when the table is dropped
RESOLVED ) setting table property external.table.purge=true, will also delete the data.
- 使用truncate清空表数据。
TRUNCATE [TABLE] table_name [PARTITION partition_spec];
partition_spec:
: (partition_column = partition_col_value, partition_column = partition_col_value, ...)
使用该参数删除表数据,不会删除表结构,还可以通过设置分区值,删除指定分区内的数据。
三、修改表
修改表的语法如下:
ALTER TABLE table_name RENAME TO new_table_name;
ALTER TABLE table_name SET TBLPROPERTIES table_properties;
table_properties:
: (property_name = property_value, property_name = property_value, ... )
ALTER TABLE table_name SET TBLPROPERTIES ('comment' = new_comment);
ALTER TABLE table_name CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)]
INTO num_buckets BUCKETS;
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec
[LOCATION 'location'][, PARTITION partition_spec [LOCATION 'location'], ...];
partition_spec:
: (partition_column = partition_col_value, partition_column = partition_col_value, ...)
ALTER TABLE page_view ADD PARTITION (dt='2008-08-08', country='us') location '/part080808'
PARTITION (dt='2008-08-09', country='us') location '/part080809';
ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec;
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec[, PARTITION partition_spec, ...]
[IGNORE PROTECTION] [PURGE];
ALTER TABLE page_view DROP PARTITION (dt='2008-08-08', country='us');
ALTER TABLE table_name [PARTITION partition_spec] CHANGE [COLUMN] col_old_name col_new_name column_type
[COMMENT col_comment] [FIRST|AFTER column_name] [CASCADE|RESTRICT];
Example:
CREATE TABLE test_change (a int, b int, c int);
ALTER TABLE test_change CHANGE a a1 INT;
ALTER TABLE test_change CHANGE a1 a2 STRING AFTER b;
ALTER TABLE test_change CHANGE c c1 INT FIRST;
ALTER TABLE test_change CHANGE a1 a1 INT COMMENT 'this is column a1';
ALTER TABLE table_name
[PARTITION partition_spec]
ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
[CASCADE|RESTRICT]
参考文档:hive官方文档–DDL
四、查询数据
查询语句的基本规则如下:
[WITH CommonTableExpression (, CommonTableExpression)*] (Note: Only available starting with Hive 0.13.0)
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT [offset,] rows]
having语句一般和group by 语句一起使用,和where的区别主要是having可以使用聚合函数作为条件,如下所示:
SELECT col1 FROM t1 GROUP BY col1 HAVING SUM(col2) > 10;
limit语句可以使用limit n展示n条数据,也可以使用 limit start,n 表示从start开始,到start+n条数据,结果默认从第0行开始。
SELECT * FROM customers ORDER BY create_date LIMIT 5;
SELECT * FROM customers ORDER BY create_date LIMIT 2,5;
Order, Sort, Cluster, and Distribute By的区别
order by:是对查询出的结果进行全部排序,在严格模式下(hive.mapred.mode=strict),必须和limit一起使用,使用order by的字句最终肯定只有一个reducer,默认是升序(asc); 在Hive 2.1.0及以后版本中,支持在“order by”子句中为null值排序。ASC默认是null在最前,而DESC默认排在最后。 sort by:sort by在reducer之前就会排序,默认的排序方式是按照数字大小和字典编撰顺序进行排序,在hive3.0.0及以后,在子查询和view中,使用没有limit的sort by会被优化器移除,要想使用可通过设置参数hive.remove.orderby.in.subquery=false进行使用。
In Hive 3.0.0 and later, sort by without limit in subqueries and views will be removed by the
optimizer. To disable it, set hive.remove.orderby.in.subquery to false.
order by和sort by的区别:Hive支持SORT BY,按照每个reducer对数据进行排序。order by和sort by之间的区别是,前者保证输出中的总顺序,而后者只保证reducer中的行顺序。如果有多个reducer,sort by排序可能会给出部分有序的最终结果。
distribute by:distribute by 的分区规则是根据分区字段的 hash 码与 reduce 的个数进行模除后, 余数相同的分到一个区,Hive 要求 DISTRIBUTE BY 语句要写在 SORT BY 语句之前。
cluster by:当 distribute by 和 sorts by 字段相同时,可以使用 cluster by 方式。cluster by 除了具有 distribute by 的功能外还兼具 sort by 的功能。但是排序只能是升序排序,不能指定排序规则为 ASC 或者 DESC。
参考文档:hive官方文档–Sort by
============================= 未完,待续 ===================================
|