[大数据] sqoop简介及基本简单应用实例

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 大数据 -> sqoop简介及基本简单应用实例 -> 正文阅读

[大数据]sqoop简介及基本简单应用实例

一、 sqoop简介

? ? Apache Sqoop?是一种旨在有效地在 Apache Hadoop 和诸如关系数据库等结构化数据存储之间传输大量数据的工具。
? ? 2021年5月06日，Apache董事会宣布终止Apache Sqoop项目，但是我们依然可以继续使用Sqoop工具，只不过官方没有后续的bug修复了。由于Sqoop的实用性和普及性，我们还是有必要认识和了解Sqoop是如何使用的。

二、Sqoop 原理

? ?将导入或导出命令翻译成 mapreduce 程序来实现。在翻译出的 mapreduce 中主要是对 inputformat 和 outputformat 进行定制。
? ?在 Sqoop 中，“导入”概念指：从非大数据集群（ RDBMS）向大数据集群（ HDFS， HIVE，HBASE）中传输数据，叫做：导入，即使用 import 关键字。

三、常用语法实例

---------- 从hdfs导数据到mysql ----------

sqoop export \
--connect jdbc:mysql://hostname:端口号/数据库 \
--username 用户名\
--password 密码\
--table 表名 \
-m 1 \	启动1个map来导入数据,默认是4个
--export-dir /...hdfs中文件的路径 \
--fields-terminated-by '\t' 分隔符默认是Tab

---------- 从mysql导数据到hdfs ----------

//==========全量导入==========
sqoop import \
--connect jdbc:mysql://hostname:端口号/数据库 \
--username 用户名 \
--password 远程连接数据库的密码 \
--table 表名 \
-m 1 \
--delete-target-dir \
--target-dir /...hdfs文件保存的路径，到最后一层文件夹 \
--fields-terminated-by '\t' \
--lines-terminated-by '\n'


//==========列裁剪==========
sqoop import \
--connect jdbc:mysql://你的hostname:端口号/数据库 \
--username 用户名 \
--password 远程连接数据库的密码 \
--table 表名 \
--columns col1,col2,... 指定需要的列 \ 
-m 1 \ 
--delete-target-dir \
--target-dir /...hdfs文件保存的路径，到最后一层文件夹 \
--fields-terminated-by '\t' \
--lines-terminated-by '\n'


//=======行列裁剪+多个reducer===
sqoop import \
--connect jdbc:mysql://你的hostname:端口号/数据库 \
--username 数据库用户名 \
--password 远程连接数据库的密码 \
--table 表名 \
--columns col1,col2,... 指定需要的列 \
--where 指定行条件如：id>12 \
-m 2 \ 启动1个map来导入数据根据split-by来确定分割字段
--split-by col...(split项) \
--delete-target-dir \
--target-dir /.../.../...路径，到最后一层文件夹 \
--fields-terminated-by ',' \
--lines-terminated-by '\n'
//=======以下写法也能达到相同效果
sqoop import \
--connect jdbc:mysql://你的hostname:端口号/数据库 \
--username 数据库用户名 \
--password 远程连接数据库的密码 \
--query "select ... from ... where ... and \$CONDITIONS" \
-m 2 \
--split-by column_name \
--delete-target-dir \
--target-dir /.../.../... \
--fields-terminated-by ',' \
--lines-terminated-by '\n'

//==========增量导入append|merge
sqoop import \
--connect jdbc:mysql://你的hostname:端口号/数据库 \
--username 数据库用户名 \
--password 远程连接数据库的密码 \
--query "select ... from 更新后的表 where \$CONDITIONS" \ 
-m 1 \
--target-dir /.../.../... \
--fields-terminated-by ',' \
--lines-terminated-by '\n' \
--check-column col_name(例:stuId) \
--incremental append \ 导入方式是append
--last-value ...(上次导出的check-column最后的值,例:48)


//=========增量导入lastmodified=======
//该增量方式要根据表上次的修改时间来修改，所以表数据中必须要有时间这一列
sqoop import \
--connect jdbc:mysql://你的hostname:端口号/数据库 \
--username 数据库用户名 \
--password 远程连接数据库的密码 \
--query "select ... from 更新后的表 where \$CONDITIONS" \
-m 1 \
--target-dir /.../.../... \
--fields-terminated-by ',' \
--lines-terminated-by '\n' \
--check-column col_name \
--incremental lastmodified \
--append \
--last-value ...(例:'2021-06-29 12:26:08')

//==========#分区表单分区导入===========
#mysql数据表
create table sqp_partition(
id int auto_increment primary key,
name varchar(20),
dotime datetime
);
insert into sqp_partition(name,dotime) values
('henry','2021-06-01 12:14:14'),
('pola','2021-06-01 12:44:14'),
('ariel','2021-06-01 12:54:14'),
('rose','2021-06-01 13:14:14'),
('jack','2021-06-01 13:33:14');
#hive数据表
create table sqp_partition(
id int,
name string,
dotime timestamp
)
partitioned by (dodate date)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile;
#sqoop语句
sqoop import \
--connect jdbc:mysql://你的hostname:端口号/数据库 \
--username 数据库用户名 \
--password 远程连接数据库的密码 \
--table sqp_partition \
--where "cast(dotime as date)='2021-06-01'" \
-m 1 \
--delete-target-dir \
--target-dir /user/hive/warehouse/test.db/sqp_partition/dodate=2021-06-01 \(hive表数据存放目录)
--fields-terminated-by ',' \
--lines-terminated-by '\n'
#hive语句
alter table sqp_partition add partition(dodate='2021-06-01');

//============分区导入=============
create table sqp_user_par1(
stuId int,
stuName string,
stuAge int,
mobile string,
tutition decimal(10,2),
fkClassId string
)
partitioned by (id_range string)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile;

sqoop import \
--connect jdbc:mysql://你的hostname:端口号/数据库 \
--username 数据库用户名 \
--password 远程连接数据库的密码 \
--table studentinfo \
--where "stuId between 1 and 20" \
-m 1 \
--hive-import \
--hive-table test.sqp_user_par1 \
--hive-partition-key id_range \
--hive-partition-value '1-20' \
--fields-terminated-by ',' \
--lines-terminated-by '\n'

大数据最新文章

实现Kafka至少消费一次

亚马逊云科技：还在苦于ETL？Zero ETL的时代

初探MapReduce

【SpringBoot框架篇】32.基于注解+redis实现

Elasticsearch：如何减少 Elasticsearch 集

Go redis操作

Redis面试题

专题五 Redis高并发场景

基于GBase8s和Calcite的多数据源查询

Redis——底层数据结构原理

加:2021-07-11 16:42:24 更:2021-07-11 16:43:48

360图书馆购物三丰科技阅读网日历万年历 2025年9日历

-2025/9/28 19:51:21-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码