大數(shù)據(jù)開(kāi)發(fā)之Sqoop詳細(xì)介紹

備注:
測(cè)試環(huán)境
CDH 6.3.1
Sqoop 1.4.7

一.Sqoop概述

Apache Sqoop(SQL-to-Hadoop)項(xiàng)目旨在協(xié)助RDBMS與Hadoop之間進(jìn)行高效的大數(shù)據(jù)交流愁憔。用戶(hù)可以在 Sqoop 的幫助下矢劲,輕松地把關(guān)系型數(shù)據(jù)庫(kù)的數(shù)據(jù)導(dǎo)入到 Hadoop 與其相關(guān)的系統(tǒng) (如HBase和Hive)中红竭;同時(shí)也可以把數(shù)據(jù)從 Hadoop 系統(tǒng)里抽取并導(dǎo)出到關(guān)系型數(shù)據(jù)庫(kù)里。

Sqoop是一個(gè)在結(jié)構(gòu)化數(shù)據(jù)和Hadoop之間進(jìn)行批量數(shù)據(jù)遷移的工具,結(jié)構(gòu)化數(shù)據(jù)可以是MySQL、Oracle等RDBMS。Sqoop底層用MapReduce程序?qū)崿F(xiàn)抽取、轉(zhuǎn)換符衔、加載,MapReduce天生的特性保證了并行化和高容錯(cuò)率糟袁,而且相比Kettle等傳統(tǒng)ETL工具判族,任務(wù)跑在Hadoop集群上,減少了ETL服務(wù)器資源的使用情況项戴。在特定場(chǎng)景下形帮,抽取過(guò)程會(huì)有很大的性能提升。

如果要用Sqoop周叮,必須正確安裝并配置Hadoop辩撑,因依賴(lài)于本地的Hadoop環(huán)境啟動(dòng)MR程序;MySQL则吟、Oracle等數(shù)據(jù)庫(kù)的JDBC驅(qū)動(dòng)也要放到Sqoop的lib目錄下槐臀。

Sqoop架構(gòu)圖:


image.png

二.Sqoop 工具概述

通過(guò)Sqoop的help命令可以看到sqoop有哪些工具

[root@hp2 ~]# sqoop help
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/20 17:00:56 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
usage: sqoop COMMAND [ARGS]

Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  import-mainframe   Import datasets from a mainframe server to HDFS
  job                Work with saved jobs
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  merge              Merge results of incremental imports
  metastore          Run a standalone Sqoop metastore
  version            Display version information

See 'sqoop help COMMAND' for information on a specific command.
工具名 描述
codegen 生成與數(shù)據(jù)庫(kù)記錄交互的代碼
create-hive-table 復(fù)制表結(jié)果到hive表
eval 評(píng)估sql并給出結(jié)果
export 導(dǎo)出一個(gè)HDFS目錄到一個(gè)數(shù)據(jù)庫(kù)表
help 列出可用的命令
import 將一個(gè)表從數(shù)據(jù)庫(kù)導(dǎo)入到HDFS
import-all-tables 將所有表從數(shù)據(jù)庫(kù)導(dǎo)入到HDFS
import-mainframe 將大型機(jī)數(shù)據(jù)集導(dǎo)入到HDFS
list-databases 列出服務(wù)器上可用數(shù)據(jù)庫(kù)
list-tables 列出數(shù)據(jù)庫(kù)上可用表
version 展示版本信息

例如我想看 sqoop的import工具有哪些參數(shù):

[root@hp2 ~]# sqoop help import
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/20 17:11:43 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]

Common arguments:
   --connect <jdbc-uri>                                       Specify JDBC
                                                              connect
                                                              string
   --connection-manager <class-name>                          Specify
                                                              connection
                                                              manager
                                                              class name
   --connection-param-file <properties-file>                  Specify
                                                              connection
                                                              parameters
                                                              file
   --driver <class-name>                                      Manually
                                                              specify JDBC
                                                              driver class
                                                              to use

**限于篇幅锄蹂,中間省略N多行**


At minimum, you must specify --connect and --table
Arguments to mysqldump and other subprograms may be supplied
after a '--' on the command line.

三.Sqoon工具詳解

3.1 codegen

codegen工具生成封裝和解釋導(dǎo)入記錄的Java類(lèi)氓仲。記錄的Java定義作為導(dǎo)入過(guò)程的一部分實(shí)例化,但是也可以單獨(dú)執(zhí)行得糜。例如敬扛,如果Java源代碼丟失了,可以重新創(chuàng)建它朝抖∩都可以創(chuàng)建在字段之間使用不同分隔符的類(lèi)的新版本,等等治宣。

由于我是從傳統(tǒng)數(shù)據(jù)倉(cāng)庫(kù)轉(zhuǎn)的大數(shù)據(jù)急侥,目前對(duì)java不熟悉砌滞,此處就不展開(kāi)了。

3.2 create-hive-table

create-hive-table 工具用來(lái)將表同步到hive坏怪。

3.2.1 create-hive-table工具命令介紹

通用命令

參數(shù) 描述
--connect <jdbc-uri> 指定jdbc連接信息
--connection-manager <class-name> 指定要使用的連接管理器類(lèi)
--driver <class-name> 手工指定jdbc驅(qū)動(dòng)
--hadoop-mapred-home <dir> 覆蓋 $HADOOP_MAPRED_HOME
--help 打印幫助信息
--password-file 設(shè)置包含身份驗(yàn)證密碼的文件的路徑
-P 從控制臺(tái)讀取密碼
--password <password> 設(shè)置身份認(rèn)證密碼
--username <username> 設(shè)置身份驗(yàn)證用戶(hù)名
--verbose 工作時(shí)打印更多信息
--connection-param-file <filename> 提供連接參數(shù)的可選屬性文件
--relaxed-isolation 設(shè)置連接的隔離級(jí)別以讀取未提交的數(shù)據(jù)

Hive命令

參數(shù) 描述
--hive-home <dir> 覆蓋 $HIVE_HOME
--hive-overwrite 覆蓋hive表已存在的數(shù)據(jù)
--create-hive-table 如果設(shè)置此參數(shù)贝润,該表存在的情況下mr job會(huì)失敗,默認(rèn)參數(shù)為false
--hive-table <table-name> 導(dǎo)入數(shù)據(jù)的時(shí)候指定hive數(shù)據(jù)庫(kù)的表名
--table 讀取hive表的數(shù)據(jù)定義文件

輸出格式命令

參數(shù) 描述
--enclosed-by <char> 設(shè)置必需的字段封閉字符
--escaped-by <char> 設(shè)置分隔字符
--fields-terminated-by <char 設(shè)置字段分隔字符
--lines-terminated-by <char> 設(shè)置結(jié)束字符
--mysql-delimiters 使用mysql 默認(rèn)的分隔字符
--optionally-enclosed-by <char> 設(shè)置包含字符的字段

3.2.2 create-hive-table 測(cè)試案例

需求铝宵,將mysql test庫(kù)下的emp表的表結(jié)構(gòu) 同步到hive的test庫(kù)下ods_emp表

sqoop create-hive-table \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--hive-database test \
--table emp --hive-table ods_emp

測(cè)試記錄:

[root@hp1 ~]# 
[root@hp1 ~]# sqoop create-hive-table \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 \
> --hive-database test \
> --table emp --hive-table ods_emp
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/23 18:12:28 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/11/23 18:12:28 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/23 18:12:28 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
20/11/23 18:12:28 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
20/11/23 18:12:28 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/11/23 18:12:29 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `emp` AS t LIMIT 1
20/11/23 18:12:29 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `emp` AS t LIMIT 1
20/11/23 18:12:29 WARN hive.TableDefWriter: Column hiredate had to be cast to a less precise type in Hive
20/11/23 18:12:29 WARN hive.TableDefWriter: Column sal had to be cast to a less precise type in Hive
20/11/23 18:12:29 WARN hive.TableDefWriter: Column comm had to be cast to a less precise type in Hive
20/11/23 18:12:30 INFO hive.HiveImport: Loading uploaded data into Hive
20/11/23 18:12:30 INFO conf.HiveConf: Found configuration file file:/etc/hive/conf.cloudera.hive/hive-site.xml
**********中間省略N多輸出*******
20/11/23 18:12:33 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=hp1:2181,hp3:2181,hp2:2181 sessionTimeout=1200000 watcher=org.apache.curator.ConnectionState@176555c
20/11/23 18:12:33 INFO zookeeper.ClientCnxn: Opening socket connection to server hp1/10.31.1.123:2181. Will not attempt to authenticate using SASL (unknown error)
20/11/23 18:12:33 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.31.1.123:54526, server: hp1/10.31.1.123:2181
20/11/23 18:12:33 INFO zookeeper.ClientCnxn: Session establishment complete on server hp1/10.31.1.123:2181, sessionid = 0x175e00052b51d5e, negotiated timeout = 40000
20/11/23 18:12:33 INFO state.ConnectionStateManager: State change: CONNECTED
20/11/23 18:12:33 INFO ql.Driver: Executing command(queryId=root_20201123181231_97629f38-ba50-4864-9d21-6aa2239f11a9): CREATE TABLE IF NOT EXISTS `test`.`ods_emp` ( `empno` INT, `ename` STRING, `job` STRING, `mgr` INT, `hiredate` STRING, `sal` DOUBLE, `comm` DOUBLE, `deptno` INT) COMMENT 'Imported by sqoop on 2020/11/23 18:12:29' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\012' STORED AS TEXTFILE
20/11/23 18:12:33 INFO ql.Driver: Starting task [Stage-0:DDL] in serial mode
20/11/23 18:12:33 INFO hive.metastore: Mestastore configuration hive.metastore.filter.hook changed from org.apache.hadoop.hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook to org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl
20/11/23 18:12:33 INFO hive.metastore: Closed a connection to metastore, current connections: 0
20/11/23 18:12:33 INFO exec.DDLTask: creating table test.ods_emp on null
20/11/23 18:12:34 INFO hive.metastore: HMS client filtering is enabled.
20/11/23 18:12:34 INFO hive.metastore: Trying to connect to metastore with URI thrift://hp1:9083
20/11/23 18:12:34 INFO hive.metastore: Opened a connection to metastore, current connections: 1
20/11/23 18:12:34 INFO hive.metastore: Connected to metastore.
20/11/23 18:12:34 INFO ql.Driver: Completed executing command(queryId=root_20201123181231_97629f38-ba50-4864-9d21-6aa2239f11a9); Time taken: 0.515 seconds
OK
20/11/23 18:12:34 INFO ql.Driver: OK
Time taken: 2.633 seconds
20/11/23 18:12:34 INFO CliDriver: Time taken: 2.633 seconds
20/11/23 18:12:34 INFO conf.HiveConf: Using the default value passed in for log id: 2fec9e5a-4d5e-4786-9e5b-5e5303b8cca1
20/11/23 18:12:34 INFO session.SessionState: Resetting thread name to  main
20/11/23 18:12:34 INFO conf.HiveConf: Using the default value passed in for log id: 2fec9e5a-4d5e-4786-9e5b-5e5303b8cca1
20/11/23 18:12:34 INFO session.SessionState: Deleted directory: /tmp/hive/root/2fec9e5a-4d5e-4786-9e5b-5e5303b8cca1 on fs with scheme hdfs
20/11/23 18:12:34 INFO session.SessionState: Deleted directory: /tmp/root/2fec9e5a-4d5e-4786-9e5b-5e5303b8cca1 on fs with scheme file
20/11/23 18:12:34 INFO hive.metastore: Closed a connection to metastore, current connections: 0
20/11/23 18:12:34 INFO hive.HiveImport: Hive import complete.
20/11/23 18:12:34 INFO imps.CuratorFrameworkImpl: backgroundOperationsLoop exiting
20/11/23 18:12:34 INFO zookeeper.ZooKeeper: Session: 0x175e00052b51d5e closed
20/11/23 18:12:34 INFO CuratorFrameworkSingleton: Closing ZooKeeper client.
20/11/23 18:12:34 INFO zookeeper.ClientCnxn: EventThread shut down

查看hive這邊的表
可以看到整數(shù)轉(zhuǎn)換為int打掘,小數(shù)的自動(dòng)轉(zhuǎn)為double,varchar轉(zhuǎn)為string沒(méi)什么問(wèn)題鹏秋,可是date類(lèi)型的也變?yōu)閟tring了尊蚁,不知道為什么會(huì)這個(gè)樣子。
另外侣夷,主外鍵直接忽略了横朋。

hive> 
    > 
    > desc ods_emp;
OK
empno                   int                                         
ename                   string                                      
job                     string                                      
mgr                     int                                         
hiredate                string                                      
sal                     double                                      
comm                    double                                      
deptno                  int                                         
Time taken: 0.078 seconds, Fetched: 8 row(s)
    > show create table ods_emp;
OK
CREATE TABLE `ods_emp`(
  `empno` int, 
  `ename` string, 
  `job` string, 
  `mgr` int, 
  `hiredate` string, 
  `sal` double, 
  `comm` double, 
  `deptno` int)
COMMENT 'Imported by sqoop on 2020/11/23 18:12:29'
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
WITH SERDEPROPERTIES ( 
  'field.delim'='?', 
  'line.delim'='\n', 
  'serialization.format'='?') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://nameservice1/user/hive/warehouse/test.db/ods_emp'
TBLPROPERTIES (
  'transient_lastDdlTime'='1606126354')
Time taken: 0.107 seconds, Fetched: 24 row(s)
hive> 

3.3 eval

執(zhí)行前,先評(píng)估命令的準(zhǔn)確性

3.3.1 eval工具命令介紹

通用命令

參數(shù) 描述
--connect <jdbc-uri> 指定jdbc連接信息
--connection-manager <class-name> 指定要使用的連接管理器類(lèi)
--driver <class-name> 手工指定jdbc驅(qū)動(dòng)
--hadoop-mapred-home <dir> 覆蓋 $HADOOP_MAPRED_HOME
--help 打印幫助信息
--password-file 設(shè)置包含身份驗(yàn)證密碼的文件的路徑
-P 從控制臺(tái)讀取密碼
--password <password> 設(shè)置身份認(rèn)證密碼
--username <username> 設(shè)置身份驗(yàn)證用戶(hù)名
--verbose 工作時(shí)打印更多信息
--connection-param-file <filename> 提供連接參數(shù)的可選屬性文件
--relaxed-isolation 設(shè)置連接的隔離級(jí)別以讀取未提交的數(shù)據(jù)

SQL 評(píng)估命令

參數(shù) 描述
-e,--query <statement 執(zhí)行sql命令

3.3.2 eval命令測(cè)試

測(cè)試 --query

sqoop eval \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--query "select * from emp"

測(cè)試記錄:

[root@hp1 ~]# sqoop eval \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 \
> --query "select * from emp"
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/23 18:28:05 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/11/23 18:28:05 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/23 18:28:05 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
----------------------------------------------------------------------------------
| empno | ename      | job       | mgr  | hiredate   | sal       | comm      | deptno | 
----------------------------------------------------------------------------------
| 7369 | SMITH      | CLERK     | 7902 | 1980-12-17 | 800.00    | (null)    | 20 | 
| 7499 | ALLEN      | SALESMAN  | 7698 | 1981-02-20 | 1600.00   | 300.00    | 30 | 
| 7521 | WARD       | SALESMAN  | 7698 | 1981-02-22 | 1250.00   | 500.00    | 30 | 
| 7566 | JONES      | MANAGER   | 7839 | 1981-04-02 | 2975.00   | (null)    | 20 | 
| 7654 | MARTIN     | SALESMAN  | 7698 | 1981-09-28 | 1250.00   | 1400.00   | 30 | 
| 7698 | BLAKE      | MANAGER   | 7839 | 1981-05-01 | 2850.00   | (null)    | 30 | 
| 7782 | CLARK      | MANAGER   | 7839 | 1981-06-09 | 2450.00   | (null)    | 10 | 
| 7788 | SCOTT      | ANALYST   | 7566 | 1987-06-13 | 3000.00   | (null)    | 20 | 
| 7839 | KING       | PRESIDENT | (null) | 1981-11-17 | 5000.00   | (null)    | 10 | 
| 7844 | TURNER     | SALESMAN  | 7698 | 1981-09-08 | 1500.00   | 0.00      | 30 | 
| 7876 | ADAMS      | CLERK     | 7788 | 1987-06-13 | 1100.00   | (null)    | 20 | 
| 7900 | JAMES      | CLERK     | 7698 | 1981-12-03 | 950.00    | (null)    | 30 | 
| 7902 | FORD       | ANALYST   | 7566 | 1981-12-03 | 3000.00   | (null)    | 20 | 
| 7934 | MILLER     | CLERK     | 7782 | 1982-01-23 | 1300.00   | (null)    | 10 | 
----------------------------------------------------------------------------------
[root@hp1 ~]# 

測(cè)試 -e

sqoop eval \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
-e "insert into emp(empno) values (1)"

測(cè)試記錄:

[root@hp1 ~]# sqoop eval \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 \
> -e "insert into emp(empno) values (1)"
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/23 18:32:26 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/11/23 18:32:26 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/23 18:32:26 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/11/23 18:32:26 INFO tool.EvalSqlTool: 1 row(s) updated.
[root@hp1 ~]# 

3.4 export

將hdfs里的數(shù)據(jù)同步到關(guān)系型數(shù)據(jù)庫(kù),目標(biāo)表必須在數(shù)據(jù)庫(kù)中存在百拓,將根據(jù)用戶(hù)指定的分隔符將輸入文件讀取并解析為一組記錄叶撒。

3.4.1 export命令概述

通用命令

參數(shù) 描述
--connect <jdbc-uri> 指定jdbc連接信息
--connection-manager <class-name> 指定要使用的連接管理器類(lèi)
--driver <class-name> 手工指定jdbc驅(qū)動(dòng)
--hadoop-mapred-home <dir> 覆蓋 $HADOOP_MAPRED_HOME
--help 打印幫助信息
--password-file 設(shè)置包含身份驗(yàn)證密碼的文件的路徑
-P 從控制臺(tái)讀取密碼
--password <password> 設(shè)置身份認(rèn)證密碼
--username <username> 設(shè)置身份驗(yàn)證用戶(hù)名
--verbose 工作時(shí)打印更多信息
--connection-param-file <filename> 提供連接參數(shù)的可選屬性文件
--relaxed-isolation 設(shè)置連接的隔離級(jí)別以讀取未提交的數(shù)據(jù)

驗(yàn)證參數(shù)細(xì)節(jié)

參數(shù) 描述
--validate 允許對(duì)復(fù)制的數(shù)據(jù)進(jìn)行驗(yàn)證,只支持單表復(fù)制耐版。
--validator <class-name> 指定使用的驗(yàn)證類(lèi)
--validation-threshold <class-name> 指定要使用的驗(yàn)證閾值類(lèi)
--validation-failurehandler <class-name> 指定要使用的驗(yàn)證失敗處理程序類(lèi)

導(dǎo)出控制命令

參數(shù) 描述
--columns <col,col,col…> 表需要導(dǎo)出的列
--direct 使用直接出口快速路徑
--export-dir <dir> HDFS導(dǎo)出的源路徑
-m,--num-mappers <n> 指定使用n個(gè)map并行導(dǎo)出
--table <table-name> 需要填充的表
--call <stored-proc-name> 要調(diào)用的存儲(chǔ)過(guò)程
--update-key <col-name> 用于更新的錨定列祠够。如果有多個(gè)列,請(qǐng)使用逗號(hào)分隔列粪牲。
--update-mode <mode> 更新方式
--input-null-string <null-string> 對(duì)于字符串列古瓤,要解釋為null的字符串
--input-null-non-string <null-string> 對(duì)于非字符串列,將解釋為null的字符串
--staging-table <staging-table-name> 在將數(shù)據(jù)插入目標(biāo)表之前腺阳,數(shù)據(jù)將在其中暫存的表落君。
--clear-staging-table 表示可以刪除staging表中存在的任何數(shù)據(jù)。
--batch 使用批處理模式執(zhí)行底層語(yǔ)句

輸入解析參數(shù)

參數(shù) 描述
--input-enclosed-by <char> 設(shè)置所需的字段附件
--input-escaped-by <char> 設(shè)置輸入轉(zhuǎn)義字符
--input-fields-terminated-by <char> 設(shè)置輸入字段分隔符
--input-lines-terminated-by <char> 設(shè)置輸入的行結(jié)束字符
--input-optionally-enclosed-by <char> 設(shè)置包含字符的字段

輸出行格式化參數(shù)

參數(shù) 描述
--enclosed-by <char> 設(shè)置所需的字段附件
--escaped-by <char> 設(shè)置輸入轉(zhuǎn)義字符
--fields-terminated-by <char> 設(shè)置輸入字段分隔符
--lines-terminated-by <char> 設(shè)置輸入的行結(jié)束字符
--mysql-delimiters 使用MySQL默認(rèn)的分隔符
--optionally-enclosed-by <char> 設(shè)置包含字符的字段

代碼生成的參數(shù)

參數(shù) 描述
--bindir <dir> 編譯對(duì)象的輸出目錄
--class-name <name> 設(shè)置類(lèi)的名稱(chēng)
--jar-file <file> 禁用code generation; 使用指定的jar包
--outdir <dir> generated code的輸出目錄
--package-name <name> auto-generated classes的包
--map-column-java <m> 為配置的列覆蓋從SQL類(lèi)型到Java類(lèi)型的默認(rèn)映射

3.4.2 export命令測(cè)試案例

sqoop的export命令支持 insert亭引、update到關(guān)系型數(shù)據(jù)庫(kù)绎速,但是不支持merge,親自測(cè)試過(guò)焙蚓,--merge-key id 這樣會(huì)報(bào)錯(cuò)纹冤。
但是sqoop的export命令支持調(diào)用存儲(chǔ)過(guò)程,這樣同步了數(shù)據(jù)之后购公,可以通過(guò)mysql的存儲(chǔ)過(guò)程進(jìn)行數(shù)據(jù)處理萌京。

3.4.2.1 hive表導(dǎo)入mysql數(shù)據(jù)庫(kù)insert案例

數(shù)據(jù)準(zhǔn)備:

-- hive表數(shù)據(jù)
hive> 
    > 
    > select * from t1;
OK
1       abc
2       def

-- MySQL 數(shù)據(jù)
mysql> select * from t1; 
Empty set (0.00 sec)

mysql> 

代碼:
--fields-terminated-by '\0001' 必須是這個(gè)格式,其它格式試了報(bào)錯(cuò)宏浩,待研究

sqoop export \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table t1 \
--export-dir /user/hive/warehouse/test.db/t1 \
--num-mappers 1 \
--fields-terminated-by '\0001'

測(cè)試記錄:

[root@hp1 ~]# sqoop export \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 \
> --table t1 \
> --export-dir /user/hive/warehouse/test.db/t1 \
> --num-mappers 1 \
> --fields-terminated-by '\0001'
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/23 19:22:04 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/11/23 19:22:04 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/23 19:22:04 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/11/23 19:22:04 INFO tool.CodeGenTool: Beginning code generation
20/11/23 19:22:05 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t1` AS t LIMIT 1
20/11/23 19:22:05 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t1` AS t LIMIT 1
20/11/23 19:22:05 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
20/11/23 19:22:06 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-root/compile/fab279df6cf018b9c6c4a9b4d9508e39/t1.java to /root/./t1.java. Error: Destination '/root/./t1.java' already exists
20/11/23 19:22:06 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/fab279df6cf018b9c6c4a9b4d9508e39/t1.jar
20/11/23 19:22:06 INFO mapreduce.ExportJobBase: Beginning export of t1
20/11/23 19:22:06 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
20/11/23 19:22:07 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
20/11/23 19:22:07 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
20/11/23 19:22:07 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
20/11/23 19:22:07 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
20/11/23 19:22:07 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/root/.staging/job_1605780958086_0024
20/11/23 19:22:09 INFO input.FileInputFormat: Total input files to process : 2
20/11/23 19:22:09 INFO input.FileInputFormat: Total input files to process : 2
20/11/23 19:22:09 INFO mapreduce.JobSubmitter: number of splits:1
20/11/23 19:22:09 INFO Configuration.deprecation: yarn.resourcemanager.zk-address is deprecated. Instead, use hadoop.zk.address
20/11/23 19:22:09 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
20/11/23 19:22:09 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
20/11/23 19:22:09 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1605780958086_0024
20/11/23 19:22:09 INFO mapreduce.JobSubmitter: Executing with tokens: []
20/11/23 19:22:10 INFO conf.Configuration: resource-types.xml not found
20/11/23 19:22:10 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/11/23 19:22:10 INFO impl.YarnClientImpl: Submitted application application_1605780958086_0024
20/11/23 19:22:10 INFO mapreduce.Job: The url to track the job: http://hp3:8088/proxy/application_1605780958086_0024/
20/11/23 19:22:10 INFO mapreduce.Job: Running job: job_1605780958086_0024
20/11/23 19:22:16 INFO mapreduce.Job: Job job_1605780958086_0024 running in uber mode : false
20/11/23 19:22:16 INFO mapreduce.Job:  map 0% reduce 0%
20/11/23 19:22:21 INFO mapreduce.Job:  map 100% reduce 0%
20/11/23 19:22:21 INFO mapreduce.Job: Job job_1605780958086_0024 completed successfully
20/11/23 19:22:21 INFO mapreduce.Job: Counters: 33
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=247166
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=241
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=7
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=0
                HDFS: Number of bytes read erasure-coded=0
        Job Counters 
                Launched map tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=2732
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=2732
                Total vcore-milliseconds taken by all map tasks=2732
                Total megabyte-milliseconds taken by all map tasks=2797568
        Map-Reduce Framework
                Map input records=2
                Map output records=2
                Input split bytes=223
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=65
                CPU time spent (ms)=930
                Physical memory (bytes) snapshot=235925504
                Virtual memory (bytes) snapshot=2592751616
                Total committed heap usage (bytes)=193462272
                Peak Map Physical memory (bytes)=235925504
                Peak Map Virtual memory (bytes)=2592751616
        File Input Format Counters 
                Bytes Read=0
        File Output Format Counters 
                Bytes Written=0
20/11/23 19:22:21 INFO mapreduce.ExportJobBase: Transferred 241 bytes in 13.9456 seconds (17.2815 bytes/sec)
20/11/23 19:22:21 INFO mapreduce.ExportJobBase: Exported 2 records.
[root@hp1 ~]# 

數(shù)據(jù)驗(yàn)證
可以看到數(shù)據(jù)導(dǎo)入成功

mysql> select * from t1;
Empty set (0.00 sec)

mysql> select * from t1;
+------+------+
| id   | name |
+------+------+
|    1 | abc  |
|    2 | def  |
+------+------+
2 rows in set (0.00 sec)

3.4.2.2 hive表導(dǎo)入mysql數(shù)據(jù)庫(kù)update案例

數(shù)據(jù)準(zhǔn)備:

-- hive 數(shù)據(jù)
hive> 
    > 
    > select * from t1;
OK
1       abc
2       def
-- Mysql 數(shù)據(jù)
mysql> select * from t1;
+------+------+
| id   | name |
+------+------+
|    1 | a    |
|    3 | b    |
+------+------+
2 rows in set (0.00 sec)

代碼:

sqoop export \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table t1 \
--export-dir /user/hive/warehouse/test.db/t1 \
--update-key id \
--num-mappers 1 \
--fields-terminated-by '\0001'

測(cè)試記錄:

[root@hp1 ~]# sqoop export \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 \
> --table t1 \
> --export-dir /user/hive/warehouse/test.db/t1 \
> --update-key id \
> --num-mappers 1 \
> --fields-terminated-by '\0001'
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/24 15:00:54 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/11/24 15:00:54 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/24 15:00:54 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/11/24 15:00:54 INFO tool.CodeGenTool: Beginning code generation
20/11/24 15:00:54 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t1` AS t LIMIT 1
20/11/24 15:00:55 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t1` AS t LIMIT 1
20/11/24 15:00:55 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
20/11/24 15:00:56 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-root/compile/a3298f025b919a783d8aeba0fcfdd07c/t1.java to /root/./t1.java. Error: Destination '/root/./t1.java' already exists
20/11/24 15:00:56 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/a3298f025b919a783d8aeba0fcfdd07c/t1.jar
20/11/24 15:00:56 INFO mapreduce.ExportJobBase: Beginning export of t1
20/11/24 15:00:56 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
20/11/24 15:00:57 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
20/11/24 15:00:57 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
20/11/24 15:00:57 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
20/11/24 15:00:57 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
20/11/24 15:00:58 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/root/.staging/job_1605780958086_0028
20/11/24 15:00:59 INFO input.FileInputFormat: Total input files to process : 2
20/11/24 15:00:59 INFO input.FileInputFormat: Total input files to process : 2
20/11/24 15:00:59 INFO mapreduce.JobSubmitter: number of splits:1
20/11/24 15:00:59 INFO Configuration.deprecation: yarn.resourcemanager.zk-address is deprecated. Instead, use hadoop.zk.address
20/11/24 15:00:59 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
20/11/24 15:00:59 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
20/11/24 15:00:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1605780958086_0028
20/11/24 15:00:59 INFO mapreduce.JobSubmitter: Executing with tokens: []
20/11/24 15:00:59 INFO conf.Configuration: resource-types.xml not found
20/11/24 15:00:59 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/11/24 15:01:00 INFO impl.YarnClientImpl: Submitted application application_1605780958086_0028
20/11/24 15:01:00 INFO mapreduce.Job: The url to track the job: http://hp3:8088/proxy/application_1605780958086_0028/
20/11/24 15:01:00 INFO mapreduce.Job: Running job: job_1605780958086_0028
20/11/24 15:01:06 INFO mapreduce.Job: Job job_1605780958086_0028 running in uber mode : false
20/11/24 15:01:06 INFO mapreduce.Job:  map 0% reduce 0%
20/11/24 15:01:11 INFO mapreduce.Job:  map 100% reduce 0%
20/11/24 15:01:11 INFO mapreduce.Job: Job job_1605780958086_0028 completed successfully
20/11/24 15:01:11 INFO mapreduce.Job: Counters: 33
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=247485
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=241
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=7
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=0
                HDFS: Number of bytes read erasure-coded=0
        Job Counters 
                Launched map tasks=1
                Rack-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=3094
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=3094
                Total vcore-milliseconds taken by all map tasks=3094
                Total megabyte-milliseconds taken by all map tasks=3168256
        Map-Reduce Framework
                Map input records=2
                Map output records=2
                Input split bytes=223
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=60
                CPU time spent (ms)=1010
                Physical memory (bytes) snapshot=243896320
                Virtual memory (bytes) snapshot=2593943552
                Total committed heap usage (bytes)=193462272
                Peak Map Physical memory (bytes)=243896320
                Peak Map Virtual memory (bytes)=2593943552
        File Input Format Counters 
                Bytes Read=0
        File Output Format Counters 
                Bytes Written=0
20/11/24 15:01:11 INFO mapreduce.ExportJobBase: Transferred 241 bytes in 13.9585 seconds (17.2655 bytes/sec)
20/11/24 15:01:11 INFO mapreduce.ExportJobBase: Exported 2 records.
[root@hp1 ~]# 

驗(yàn)證數(shù)據(jù)
發(fā)現(xiàn)update成功

mysql> select * from t1;
+------+------+
| id   | name |
+------+------+
|    1 | abc  |
|    3 | b    |
+------+------+
2 rows in set (0.00 sec)

3.4 help工具

help工具查看命令幫助
例如知残,我不知道sqoop import的語(yǔ)法想查看:

[root@hp1 ~]# sqoop help import
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]

Common arguments:
   --connect <jdbc-uri>     Specify JDBC connect string
   --connection-manager <class-name>     Specify connection manager class to use
   --driver <class-name>    Manually specify JDBC driver class to use
   --hadoop-mapred-home <dir>            Override $HADOOP_MAPRED_HOME
   --help                   Print usage instructions
   --password-file          Set path for file containing authentication password
   -P                       Read password from console
   --password <password>    Set authentication password
   --username <username>    Set authentication username
   --verbose                Print more information while working
   --hadoop-home <dir>      Deprecated. Override $HADOOP_HOME

Import control arguments:
   --as-avrodatafile             Imports data to Avro Data Files
   --as-sequencefile             Imports data to SequenceFiles
   --as-textfile                 Imports data as plain text (default)
   --as-parquetfile              Imports data to Parquet Data Files
...

3.5 import工具

sqoop import工具是使用的最頻繁的工具,該工具將關(guān)系型數(shù)據(jù)庫(kù)的數(shù)據(jù)同步到hdfs上比庄。
關(guān)系型數(shù)據(jù)庫(kù)表中的每一行都會(huì)被當(dāng)做一個(gè)獨(dú)立的記錄存儲(chǔ)為 text files求妹、Avro 或者 SequenceFiles

3.5.1 sqoop import 工具命令介紹

通用命令

參數(shù) 描述
--connect <jdbc-uri> 指定jdbc連接信息
--connection-manager <class-name> 指定要使用的連接管理器類(lèi)
--driver <class-name> 手工指定jdbc驅(qū)動(dòng)
--hadoop-mapred-home <dir> 覆蓋 $HADOOP_MAPRED_HOME
--help 打印幫助信息
--password-file 設(shè)置包含身份驗(yàn)證密碼的文件的路徑
-P 從控制臺(tái)讀取密碼
--password <password> 設(shè)置身份認(rèn)證密碼
--username <username> 設(shè)置身份驗(yàn)證用戶(hù)名
--verbose 工作時(shí)打印更多信息
--connection-param-file <filename> 提供連接參數(shù)的可選屬性文件
--relaxed-isolation 設(shè)置連接的隔離級(jí)別以讀取未提交的數(shù)據(jù)

驗(yàn)證參數(shù)細(xì)節(jié)

參數(shù) 描述
--validate 允許對(duì)復(fù)制的數(shù)據(jù)進(jìn)行驗(yàn)證乏盐,只支持單表復(fù)制。
--validator <class-name> 指定使用的驗(yàn)證類(lèi)
--validation-threshold <class-name> 指定要使用的驗(yàn)證閾值類(lèi)
--validation-failurehandler <class-name> 指定要使用的驗(yàn)證失敗處理程序類(lèi)

導(dǎo)入控制命令

參數(shù) 描述
--append 在HDFS中向現(xiàn)有數(shù)據(jù)集追加數(shù)據(jù)
--as-avrodatafile 導(dǎo)入數(shù)據(jù)到Avro 數(shù)據(jù)文件
--as-sequencefile 導(dǎo)入數(shù)據(jù)到sequencefile
--as-textfile 導(dǎo)入數(shù)據(jù)到textfile制恍,默認(rèn)值
--as-parquetfile 導(dǎo)入數(shù)據(jù)到parquetfile
--boundary-query <statement 用于創(chuàng)建分割的邊界查詢(xún)
--columns <col,col,col…> 指定表中需要導(dǎo)入的列
--delete-target-dir 如目標(biāo)目錄存在丑勤,則先刪除
--direct 如數(shù)據(jù)庫(kù)存在,則直連
--fetch-size <n> 一次從數(shù)據(jù)庫(kù)中讀取的條目數(shù)
--inline-lob-limit <n> 設(shè)置最大值給LOB 列
-m,--num-mappers <n> 導(dǎo)入時(shí)使用n個(gè)map任務(wù)進(jìn)行并行吧趣,默認(rèn)情況下法竞,使用4個(gè)task
-e,--query <statement> 導(dǎo)入sql語(yǔ)句執(zhí)行的結(jié)果
--split-by <column-name> 用于分隔
--split-limit <n> 每個(gè)分割大小的上限。這只適用于整數(shù)和日期列强挫。對(duì)于日期或時(shí)間戳字段岔霸,以秒為單位計(jì)算
--autoreset-to-one-mapper 如果表沒(méi)有主鍵,也沒(méi)有提供分拆列俯渤,則導(dǎo)入應(yīng)該使用一個(gè)映射器芋绸。不能使用-split-by 選項(xiàng)简珠。
--table <table-name> 需要讀取的表
--target-dir <dir> HDFS目標(biāo)路徑
--temporary-rootdir <dir> 在導(dǎo)入HDFS時(shí)的臨時(shí)文件路徑
--warehouse-dir <dir> HDFS中表存儲(chǔ)的父目錄
--where <where clause> 過(guò)濾數(shù)據(jù)的條件
-z,--compress 允許壓縮
--compression-codec <c> 使用 Hadoop codec (默認(rèn)為gzip)
--null-string <null-string> 要為字符串列的空值寫(xiě)入的字符串
--null-non-string <null-string> 將為非字符串列的空值寫(xiě)入的字符串

輸出行格式化參數(shù)

參數(shù) 描述
--enclosed-by <char> 設(shè)置所需的字段附件
--escaped-by <char> 設(shè)置輸入轉(zhuǎn)義字符
--fields-terminated-by <char> 設(shè)置輸入字段分隔符
--lines-terminated-by <char> 設(shè)置輸入的行結(jié)束字符
--mysql-delimiters 使用MySQL默認(rèn)的分隔符
--optionally-enclosed-by <char> 設(shè)置包含字符的字段

轉(zhuǎn)移字符--fields-terminated-by \t
支持的轉(zhuǎn)義字符:
1.\b (backspace)
2.\n (newline)
3.\r (carriage return)
4.\t (tab)
5." (double-quote)
6.\' (single-quote)
7.\ (backslash)
8.\0 (NUL) -- 這將在字段或行之間插入NUL字符潜叛,或者如果用于--enclosed-by咬腕、--optional -enclosed-by或--escaping -by參數(shù)之一,將禁用封閉/轉(zhuǎn)義

UTF-8字符代碼點(diǎn)的八進(jìn)制表示,--fields-terminated-by \001 would yield the ^A character

輸入解析參數(shù)

參數(shù) 描述
--input-enclosed-by <char> 設(shè)置所需的字段附件
--input-escaped-by <char> 設(shè)置輸入轉(zhuǎn)義字符
--input-fields-terminated-by <char> 設(shè)置輸入字段分隔符
--input-lines-terminated-by <char> 設(shè)置輸入的行結(jié)束字符
--input-optionally-enclosed-by <char> 設(shè)置包含字符的字段

Hive命令

參數(shù) 描述
--hive-home <dir> 覆蓋 $HIVE_HOME
--hive-import 將多張表導(dǎo)入hive梨树,分隔符為默認(rèn)的hive分隔符
--hive-overwrite 覆蓋hive表已存在的數(shù)據(jù)
--create-hive-table 如果設(shè)置此參數(shù)坑夯,該表存在的情況下mr job會(huì)失敗,默認(rèn)參數(shù)為false
--hive-table <table-name> 導(dǎo)入數(shù)據(jù)的時(shí)候指定hive數(shù)據(jù)庫(kù)的表名
--hive-drop-import-delims 從導(dǎo)入hive的字符串中刪除\n, \r, and \01
--hive-delims-replacement 使用自定義的肥豬替換 導(dǎo)入hive的字符串中的 \n, \r, and \01
--hive-partition-key 要分割的hive字段的名稱(chēng)
--hive-partition-value <v> 作為此作業(yè)中導(dǎo)入到hive的分區(qū)鍵的字符串值抡四。
--map-column-hive <map> 為配置的列覆蓋從SQL類(lèi)型到Hive類(lèi)型的默認(rèn)映射柜蜈。如果在此參數(shù)中指定逗號(hào),則使用URL編碼的鍵和值指巡,例如淑履,使用DECIMAL(1%2C%201)而不是DECIMAL(1,1)。

3.5.2 hive import測(cè)試案例

數(shù)據(jù)準(zhǔn)備:

-- MySQL
mysql> select count(*) from fact_sale;
+-----------+
| count(*)  |
+-----------+
| 767830000 |
+-----------+
1 row in set (2 min 32.75 sec)

mysql> desc fact_sale;
+-----------+-------------+------+-----+-------------------+-----------------------------+
| Field     | Type        | Null | Key | Default           | Extra                       |
+-----------+-------------+------+-----+-------------------+-----------------------------+
| id        | bigint(8)   | NO   | PRI | NULL              | auto_increment              |
| sale_date | timestamp   | NO   |     | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| prod_name | varchar(50) | YES  |     | NULL              |                             |
| sale_nums | int(11)     | YES  |     | NULL              |                             |
+-----------+-------------+------+-----+-------------------+-----------------------------+
4 rows in set (0.01 sec)

-- Hive這邊無(wú)需提前創(chuàng)建表

代碼:

sqoop import \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table fact_sale \
--fields-terminated-by '\0001' \
--delete-target-dir \
--num-mappers 4 \
--hive-import \
--hive-database test \
--hive-table ods_fact_sale \
--hive-overwrite

測(cè)試記錄:

Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/25 09:49:54 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/11/25 09:49:54 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/25 09:49:54 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/11/25 09:49:54 INFO tool.CodeGenTool: Beginning code generation
20/11/25 09:49:54 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `fact_sale` AS t LIMIT 1
20/11/25 09:49:54 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `fact_sale` AS t LIMIT 1
20/11/25 09:49:54 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
20/11/25 09:49:56 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-root/compile/ac811b1a59e599927528fdb832a4a853/fact_sale.java to /root/./fact_sale.java. Error: Destination '/root/./fact_sale.ja
va' already exists
20/11/25 09:49:56 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/ac811b1a59e599927528fdb832a4a853/fact_sale.jar
20/11/25 09:49:56 INFO tool.ImportTool: Destination directory fact_sale is not present, hence not deleting.
20/11/25 09:49:56 WARN manager.MySQLManager: It looks like you are importing from mysql.
20/11/25 09:49:56 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
20/11/25 09:49:56 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
20/11/25 09:49:56 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
20/11/25 09:49:56 INFO mapreduce.ImportJobBase: Beginning import of fact_sale
20/11/25 09:49:56 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
20/11/25 09:49:56 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
20/11/25 09:49:57 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
20/11/25 09:49:57 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/root/.staging/job_1605780958086_0032
20/11/25 09:49:59 INFO db.DBInputFormat: Using read commited transaction isolation
20/11/25 09:49:59 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `fact_sale`
20/11/25 09:49:59 INFO db.IntegerSplitter: Split size: 196905399; Num splits: 4 from: 1 to: 787621597
20/11/25 09:49:59 INFO mapreduce.JobSubmitter: number of splits:4
20/11/25 09:49:59 INFO Configuration.deprecation: yarn.resourcemanager.zk-address is deprecated. Instead, use hadoop.zk.address
20/11/25 09:49:59 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
20/11/25 09:49:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1605780958086_0032
20/11/25 09:49:59 INFO mapreduce.JobSubmitter: Executing with tokens: []
20/11/25 09:49:59 INFO conf.Configuration: resource-types.xml not found
20/11/25 09:49:59 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/11/25 09:49:59 INFO impl.YarnClientImpl: Submitted application application_1605780958086_0032
20/11/25 09:49:59 INFO mapreduce.Job: The url to track the job: http://hp3:8088/proxy/application_1605780958086_0032/
20/11/25 09:49:59 INFO mapreduce.Job: Running job: job_1605780958086_0032
20/11/25 09:50:06 INFO mapreduce.Job: Job job_1605780958086_0032 running in uber mode : false
20/11/25 09:50:06 INFO mapreduce.Job:  map 0% reduce 0%
20/11/25 09:56:28 INFO mapreduce.Job:  map 25% reduce 0%
20/11/25 09:57:08 INFO mapreduce.Job:  map 50% reduce 0%
20/11/25 10:03:40 INFO mapreduce.Job:  map 75% reduce 0%
20/11/25 10:04:07 INFO mapreduce.Job:  map 100% reduce 0%
20/11/25 10:04:07 INFO mapreduce.Job: Job job_1605780958086_0032 completed successfully
20/11/25 10:04:07 INFO mapreduce.Job: Counters: 33
**中間省略部分輸出**
0/11/25 10:04:11 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=hp1:2181,hp3:2181,hp2:2181 sessionTimeout=1200000 watcher=org.apache.curator.ConnectionState@615bad16
20/11/25 10:04:11 INFO zookeeper.ClientCnxn: Opening socket connection to server hp3/10.31.1.125:2181. Will not attempt to authenticate using SASL (unknown error)
20/11/25 10:04:11 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.31.1.123:41634, server: hp3/10.31.1.125:2181
20/11/25 10:04:11 INFO zookeeper.ClientCnxn: Session establishment complete on server hp3/10.31.1.125:2181, sessionid = 0x275e0004e5929e2, negotiated timeout = 40000
20/11/25 10:04:11 INFO state.ConnectionStateManager: State change: CONNECTED
20/11/25 10:04:11 INFO ql.Driver: Executing command(queryId=root_20201125100408_f35214cb-e680-4e38-9833-7cb096592131): CREATE TABLE IF NOT EXISTS `test`.`ods_fact_sale` ( `id` BIGINT, `sale_date` STRING, 
`prod_name` STRING, `sale_nums` INT) COMMENT 'Imported by sqoop on 2020/11/25 10:04:07' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\012' STORED AS TEXTFILE
20/11/25 10:04:11 INFO ql.Driver: Completed executing command(queryId=root_20201125100408_f35214cb-e680-4e38-9833-7cb096592131); Time taken: 0.006 seconds
OK
20/11/25 10:04:11 INFO ql.Driver: OK
Time taken: 2.319 seconds
20/11/25 10:04:11 INFO CliDriver: Time taken: 2.319 seconds
20/11/25 10:04:11 INFO conf.HiveConf: Using the default value passed in for log id: c9680e77-741b-4fe8-a882-55ac6ae6ee79
20/11/25 10:04:11 INFO session.SessionState: Resetting thread name to  main
20/11/25 10:04:11 INFO conf.HiveConf: Using the default value passed in for log id: c9680e77-741b-4fe8-a882-55ac6ae6ee79
20/11/25 10:04:11 INFO session.SessionState: Updating thread name to c9680e77-741b-4fe8-a882-55ac6ae6ee79 main
20/11/25 10:04:11 INFO ql.Driver: Compiling command(queryId=root_20201125100411_b2ff0e5c-3ccc-4f7e-b202-53855934f5f0): 
LOAD DATA INPATH 'hdfs://nameservice1/user/root/fact_sale' OVERWRITE INTO TABLE `test`.`ods_fact_sale`
20/11/25 10:04:11 INFO ql.Driver: Semantic Analysis Completed
20/11/25 10:04:11 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null)
20/11/25 10:04:11 INFO ql.Driver: Completed compiling command(queryId=root_20201125100411_b2ff0e5c-3ccc-4f7e-b202-53855934f5f0); Time taken: 0.232 seconds
20/11/25 10:04:11 INFO ql.Driver: Executing command(queryId=root_20201125100411_b2ff0e5c-3ccc-4f7e-b202-53855934f5f0): 
LOAD DATA INPATH 'hdfs://nameservice1/user/root/fact_sale' OVERWRITE INTO TABLE `test`.`ods_fact_sale`
20/11/25 10:04:11 INFO ql.Driver: Starting task [Stage-0:MOVE] in serial mode
20/11/25 10:04:11 INFO hive.metastore: Closed a connection to metastore, current connections: 0
Loading data to table test.ods_fact_sale
20/11/25 10:04:11 INFO exec.Task: Loading data to table test.ods_fact_sale from hdfs://nameservice1/user/root/fact_sale
20/11/25 10:04:11 INFO hive.metastore: HMS client filtering is enabled.
20/11/25 10:04:11 INFO hive.metastore: Trying to connect to metastore with URI thrift://hp1:9083
20/11/25 10:04:11 INFO hive.metastore: Opened a connection to metastore, current connections: 1
20/11/25 10:04:11 INFO hive.metastore: Connected to metastore.
20/11/25 10:04:11 INFO fs.TrashPolicyDefault: Moved: 'hdfs://nameservice1/user/hive/warehouse/test.db/ods_fact_sale/part-m-00000' to trash at: hdfs://nameservice1/user/root/.Trash/Current/user/hive/wareho
use/test.db/ods_fact_sale/part-m-00000
20/11/25 10:04:11 INFO common.FileUtils: Creating directory if it doesn't exist: hdfs://nameservice1/user/hive/warehouse/test.db/ods_fact_sale
20/11/25 10:04:11 INFO ql.Driver: Starting task [Stage-1:STATS] in serial mode
20/11/25 10:04:11 INFO exec.StatsTask: Executing stats task
20/11/25 10:04:11 INFO hive.metastore: Closed a connection to metastore, current connections: 0
20/11/25 10:04:11 INFO hive.metastore: HMS client filtering is enabled.
20/11/25 10:04:11 INFO hive.metastore: Trying to connect to metastore with URI thrift://hp1:9083
20/11/25 10:04:11 INFO hive.metastore: Opened a connection to metastore, current connections: 1
20/11/25 10:04:11 INFO hive.metastore: Connected to metastore.
20/11/25 10:04:12 INFO hive.metastore: Closed a connection to metastore, current connections: 0
20/11/25 10:04:12 INFO hive.metastore: HMS client filtering is enabled.
20/11/25 10:04:12 INFO hive.metastore: Trying to connect to metastore with URI thrift://hp1:9083
20/11/25 10:04:12 INFO hive.metastore: Opened a connection to metastore, current connections: 1
20/11/25 10:04:12 INFO hive.metastore: Connected to metastore.
20/11/25 10:04:12 INFO exec.StatsTask: Table test.ods_fact_sale stats: [numFiles=4, numRows=0, totalSize=31421093662, rawDataSize=0, numFilesErasureCoded=0]
20/11/25 10:04:12 INFO ql.Driver: Completed executing command(queryId=root_20201125100411_b2ff0e5c-3ccc-4f7e-b202-53855934f5f0); Time taken: 0.897 seconds
OK
20/11/25 10:04:12 INFO ql.Driver: OK
Time taken: 1.154 seconds
20/11/25 10:04:12 INFO CliDriver: Time taken: 1.154 seconds
20/11/25 10:04:12 INFO conf.HiveConf: Using the default value passed in for log id: c9680e77-741b-4fe8-a882-55ac6ae6ee79
20/11/25 10:04:12 INFO session.SessionState: Resetting thread name to  main
20/11/25 10:04:12 INFO conf.HiveConf: Using the default value passed in for log id: c9680e77-741b-4fe8-a882-55ac6ae6ee79
20/11/25 10:04:12 INFO session.SessionState: Deleted directory: /tmp/hive/root/c9680e77-741b-4fe8-a882-55ac6ae6ee79 on fs with scheme hdfs
20/11/25 10:04:12 INFO session.SessionState: Deleted directory: /tmp/root/c9680e77-741b-4fe8-a882-55ac6ae6ee79 on fs with scheme file
20/11/25 10:04:12 INFO hive.metastore: Closed a connection to metastore, current connections: 0
20/11/25 10:04:12 INFO hive.HiveImport: Hive import complete.
20/11/25 10:04:12 INFO imps.CuratorFrameworkImpl: backgroundOperationsLoop exiting
20/11/25 10:04:12 INFO zookeeper.ZooKeeper: Session: 0x275e0004e5929e2 closed
20/11/25 10:04:12 INFO zookeeper.ClientCnxn: EventThread shut down
20/11/25 10:04:12 INFO CuratorFrameworkSingleton: Closing ZooKeeper client.

3.7 import-all-tables工具

sqoop import-all-tables 將所有表從數(shù)據(jù)庫(kù)導(dǎo)入到HDFS藻雪,這樣可以不用一個(gè)一個(gè)的進(jìn)行導(dǎo)入了

3.7.1 import-all-tables參數(shù)

通用命令

參數(shù) 描述
--connect <jdbc-uri> 指定jdbc連接信息
--connection-manager <class-name> 指定要使用的連接管理器類(lèi)
--driver <class-name> 手工指定jdbc驅(qū)動(dòng)
--hadoop-mapred-home <dir> 覆蓋 $HADOOP_MAPRED_HOME
--help 打印幫助信息
--password-file 設(shè)置包含身份驗(yàn)證密碼的文件的路徑
-P 從控制臺(tái)讀取密碼
--password <password> 設(shè)置身份認(rèn)證密碼
--username <username> 設(shè)置身份驗(yàn)證用戶(hù)名
--verbose 工作時(shí)打印更多信息
--connection-param-file <filename> 提供連接參數(shù)的可選屬性文件
--relaxed-isolation 設(shè)置連接的隔離級(jí)別以讀取未提交的數(shù)據(jù)

導(dǎo)入控制命令

參數(shù) 描述
--as-avrodatafile 導(dǎo)入數(shù)據(jù)到Avro 數(shù)據(jù)文件
--as-sequencefile 導(dǎo)入數(shù)據(jù)到sequencefile
--as-textfile 導(dǎo)入數(shù)據(jù)到textfile瑰排,默認(rèn)值
--as-parquetfile 導(dǎo)入數(shù)據(jù)到parquetfile
--direct 如數(shù)據(jù)庫(kù)存在宅广,則直連
--inline-lob-limit <n> 設(shè)置最大值給LOB 列
-m,--num-mappers <n> 導(dǎo)入時(shí)使用n個(gè)map任務(wù)進(jìn)行并行驶臊,默認(rèn)情況下,使用4個(gè)task
--warehouse-dir <dir> HDFS中表存儲(chǔ)的父目錄
-z,--compress 允許壓縮
--compression-codec <c> 使用 Hadoop codec (默認(rèn)為gzip)
--exclude-tables <tables> 用逗號(hào)分隔要從導(dǎo)入過(guò)程中排除的表列表
--autoreset-to-one mapper 導(dǎo)入應(yīng)該使用一個(gè)mapper如果遇到一個(gè)沒(méi)有主鍵的表

輸出行格式化參數(shù)

參數(shù) 描述
--enclosed-by <char> 設(shè)置所需的字段附件
--escaped-by <char> 設(shè)置輸入轉(zhuǎn)義字符
--fields-terminated-by <char> 設(shè)置輸入字段分隔符
--lines-terminated-by <char> 設(shè)置輸入的行結(jié)束字符
--mysql-delimiters 使用MySQL默認(rèn)的分隔符
--optionally-enclosed-by <char> 設(shè)置包含字符的字段

轉(zhuǎn)移字符--fields-terminated-by \t
支持的轉(zhuǎn)義字符:
1.\b (backspace)
2.\n (newline)
3.\r (carriage return)
4.\t (tab)
5." (double-quote)
6.\' (single-quote)
7.\ (backslash)
8.\0 (NUL) -- 這將在字段或行之間插入NUL字符室奏,或者如果用于--enclosed-by纺酸、--optional -enclosed-by或--escaping -by參數(shù)之一仗考,將禁用封閉/轉(zhuǎn)義

UTF-8字符代碼點(diǎn)的八進(jìn)制表示,--fields-terminated-by \001 would yield the ^A character

輸入解析參數(shù)

參數(shù) 描述
--input-enclosed-by <char> 設(shè)置所需的字段附件
--input-escaped-by <char> 設(shè)置輸入轉(zhuǎn)義字符
--input-fields-terminated-by <char> 設(shè)置輸入字段分隔符
--input-lines-terminated-by <char> 設(shè)置輸入的行結(jié)束字符
--input-optionally-enclosed-by <char> 設(shè)置包含字符的字段

Hive命令

參數(shù) 描述
--hive-home <dir> 覆蓋 $HIVE_HOME
--hive-import 將多張表導(dǎo)入hive叽赊,分隔符為默認(rèn)的hive分隔符
--hive-overwrite 覆蓋hive表已存在的數(shù)據(jù)
--create-hive-table 如果設(shè)置此參數(shù),該表存在的情況下mr job會(huì)失敗庶诡,默認(rèn)參數(shù)為false
--hive-table <table-name> 導(dǎo)入數(shù)據(jù)的時(shí)候指定hive數(shù)據(jù)庫(kù)的表名
--hive-drop-import-delims 從導(dǎo)入hive的字符串中刪除\n, \r, and \01
--hive-delims-replacement 使用自定義的肥豬替換 導(dǎo)入hive的字符串中的 \n, \r, and \01
--hive-partition-key 要分割的hive字段的名稱(chēng)
--hive-partition-value <v> 作為此作業(yè)中導(dǎo)入到hive的分區(qū)鍵的字符串值。
--map-column-hive <map> 為配置的列覆蓋從SQL類(lèi)型到Hive類(lèi)型的默認(rèn)映射让簿。如果在此參數(shù)中指定逗號(hào)椭迎,則使用URL編碼的鍵和值,例如视卢,使用DECIMAL(1%2C%201)而不是DECIMAL(1,1)酝掩。

3.7.2 import-all-tables 測(cè)試案例

將test庫(kù)下的 剔除 fact_sale茂蚓、t1外的其它表同步到hive的test庫(kù)

數(shù)據(jù)準(zhǔn)備:

mysql> show tables;
+----------------+
| Tables_in_test |
+----------------+
| bonus          |
| dept           |
| emp            |
| fact_sale      |
| salgrade       |
| t1             |
+----------------+
6 rows in set (0.00 sec)

代碼:

sqoop import-all-tables \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--exclude-tables fact_sale,t1 \
--autoreset-to-one-mapper \
--fields-terminated-by '\0001' \
--num-mappers 4 \
--hive-import \
--hive-database test \
--hive-overwrite

測(cè)試記錄:
因?yàn)橛斜頉](méi)有主鍵淹朋,不能使用 --num-mappers 4,報(bào)錯(cuò)
加入 --autoreset-to-one-mapper 當(dāng)存在沒(méi)有主鍵的表的時(shí)候, --num-mappers 為1

[root@hp1 ~]# sqoop import-all-tables \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 \
> --exclude-tables fact_sale,t1 \
> --fields-terminated-by '\0001' \
> --num-mappers 4 \
> --hive-import \
> --hive-database test \
> --hive-overwrite
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/25 10:40:36 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/11/25 10:40:36 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/25 10:40:36 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/11/25 10:40:37 INFO tool.CodeGenTool: Beginning code generation
20/11/25 10:40:37 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `bonus` AS t LIMIT 1
20/11/25 10:40:37 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `bonus` AS t LIMIT 1
20/11/25 10:40:37 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
20/11/25 10:40:38 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/9ef9bc1769dcf7d49b3848d6a55144fa/bonus.jar
20/11/25 10:40:38 WARN manager.MySQLManager: It looks like you are importing from mysql.
20/11/25 10:40:38 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
20/11/25 10:40:38 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
20/11/25 10:40:38 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
20/11/25 10:40:38 ERROR tool.ImportAllTablesTool: Error during import: No primary key could be found for table bonus. Please specify one with --split-by or perform a sequential import with '-m 1'.
[root@hp1 ~]# 

調(diào)整后的測(cè)試記錄:

Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/25 10:47:05 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/11/25 10:47:05 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/25 10:47:05 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/11/25 10:47:06 INFO tool.CodeGenTool: Beginning code generation
20/11/25 10:47:06 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `bonus` AS t LIMIT 1
20/11/25 10:47:06 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `bonus` AS t LIMIT 1
20/11/25 10:47:06 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
20/11/25 10:47:07 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-root/compile/7e08dfd7df6bcae92b216d3c68d5a4f3/bonus.java to /root/./bonus.java. Error: Destination '/root/./bonus.java' already 
exists
20/11/25 10:47:07 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/7e08dfd7df6bcae92b216d3c68d5a4f3/bonus.jar
20/11/25 10:47:07 WARN manager.MySQLManager: It looks like you are importing from mysql.
20/11/25 10:47:07 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
20/11/25 10:47:07 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
20/11/25 10:47:07 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
20/11/25 10:47:07 WARN manager.SqlManager: Split by column not provided or can't be inferred.  Resetting to one mapper
20/11/25 10:47:07 INFO mapreduce.ImportJobBase: Beginning import of bonus
20/11/25 10:47:07 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
20/11/25 10:47:08 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
20/11/25 10:47:08 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
20/11/25 10:47:08 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/root/.staging/job_1605780958086_0037
20/11/25 10:47:10 INFO db.DBInputFormat: Using read commited transaction isolation
20/11/25 10:47:10 INFO mapreduce.JobSubmitter: number of splits:1
20/11/25 10:47:10 INFO Configuration.deprecation: yarn.resourcemanager.zk-address is deprecated. Instead, use hadoop.zk.address
20/11/25 10:47:10 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
20/11/25 10:47:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1605780958086_0037
20/11/25 10:47:10 INFO mapreduce.JobSubmitter: Executing with tokens: []
20/11/25 10:47:10 INFO conf.Configuration: resource-types.xml not found
20/11/25 10:47:10 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/11/25 10:47:10 INFO impl.YarnClientImpl: Submitted application application_1605780958086_0037
20/11/25 10:47:10 INFO mapreduce.Job: The url to track the job: http://hp3:8088/proxy/application_1605780958086_0037/
20/11/25 10:47:10 INFO mapreduce.Job: Running job: job_1605780958086_0037
20/11/25 10:47:16 INFO mapreduce.Job: Job job_1605780958086_0037 running in uber mode : false
20/11/25 10:47:16 INFO mapreduce.Job:  map 0% reduce 0%
20/11/25 10:47:22 INFO mapreduce.Job:  map 100% reduce 0%
20/11/25 10:47:22 INFO mapreduce.Job: Job job_1605780958086_0037 completed successfully
**中間省略部分輸出**
20/11/25 10:48:21 INFO common.FileUtils: Creating directory if it doesn't exist: hdfs://nameservice1/user/hive/warehouse/test.db/salgrade
20/11/25 10:48:21 INFO ql.Driver: Starting task [Stage-1:STATS] in serial mode
20/11/25 10:48:21 INFO exec.StatsTask: Executing stats task
20/11/25 10:48:21 INFO hive.metastore: Closed a connection to metastore, current connections: 0
20/11/25 10:48:21 INFO hive.metastore: HMS client filtering is enabled.
20/11/25 10:48:21 INFO hive.metastore: Trying to connect to metastore with URI thrift://hp1:9083
20/11/25 10:48:21 INFO hive.metastore: Opened a connection to metastore, current connections: 1
20/11/25 10:48:21 INFO hive.metastore: Connected to metastore.
20/11/25 10:48:21 INFO hive.metastore: Closed a connection to metastore, current connections: 0
20/11/25 10:48:21 INFO hive.metastore: HMS client filtering is enabled.
20/11/25 10:48:21 INFO hive.metastore: Trying to connect to metastore with URI thrift://hp1:9083
20/11/25 10:48:21 INFO hive.metastore: Opened a connection to metastore, current connections: 1
20/11/25 10:48:21 INFO hive.metastore: Connected to metastore.
20/11/25 10:48:21 INFO exec.StatsTask: Table test.salgrade stats: [numFiles=1, numRows=0, totalSize=59, rawDataSize=0, numFilesErasureCoded=0]
20/11/25 10:48:21 INFO ql.Driver: Completed executing command(queryId=root_20201125104820_ad1fb49d-6166-41cb-9113-91074fb2ea16); Time taken: 0.549 seconds
OK
20/11/25 10:48:21 INFO ql.Driver: OK
Time taken: 0.588 seconds
20/11/25 10:48:21 INFO CliDriver: Time taken: 0.588 seconds
20/11/25 10:48:21 INFO conf.HiveConf: Using the default value passed in for log id: 1e735270-9c28-4658-aad6-fc816a42e377
20/11/25 10:48:21 INFO session.SessionState: Resetting thread name to  main
20/11/25 10:48:21 INFO conf.HiveConf: Using the default value passed in for log id: 1e735270-9c28-4658-aad6-fc816a42e377
20/11/25 10:48:21 INFO session.SessionState: Deleted directory: /tmp/hive/root/1e735270-9c28-4658-aad6-fc816a42e377 on fs with scheme hdfs
20/11/25 10:48:21 INFO session.SessionState: Deleted directory: /tmp/root/1e735270-9c28-4658-aad6-fc816a42e377 on fs with scheme file
20/11/25 10:48:21 INFO hive.metastore: Closed a connection to metastore, current connections: 0
20/11/25 10:48:21 INFO hive.HiveImport: Hive import complete.
Skipping table: t1
20/11/25 10:48:21 INFO imps.CuratorFrameworkImpl: backgroundOperationsLoop exiting
20/11/25 10:48:21 INFO zookeeper.ZooKeeper: Session: 0x275e0004e592a1e closed
20/11/25 10:48:21 INFO zookeeper.ClientCnxn: EventThread shut down
20/11/25 10:48:21 INFO CuratorFrameworkSingleton: Closing ZooKeeper client.

3.8 import-mainframe 工具

sqoop import-mainframe 用于從大型機(jī),此處沒(méi)有測(cè)試環(huán)境弄砍,略過(guò)。

3.9 list-databases工具

sqoop list-databases 列出服務(wù)器上可用數(shù)據(jù)庫(kù)

3.9.1 list-databases工具參數(shù)

通用命令

參數(shù) 描述
--connect <jdbc-uri> 指定jdbc連接信息
--connection-manager <class-name> 指定要使用的連接管理器類(lèi)
--driver <class-name> 手工指定jdbc驅(qū)動(dòng)
--hadoop-mapred-home <dir> 覆蓋 $HADOOP_MAPRED_HOME
--help 打印幫助信息
--password-file 設(shè)置包含身份驗(yàn)證密碼的文件的路徑
-P 從控制臺(tái)讀取密碼
--password <password> 設(shè)置身份認(rèn)證密碼
--username <username> 設(shè)置身份驗(yàn)證用戶(hù)名
--verbose 工作時(shí)打印更多信息
--connection-param-file <filename> 提供連接參數(shù)的可選屬性文件
--relaxed-isolation 設(shè)置連接的隔離級(jí)別以讀取未提交的數(shù)據(jù)

3.9.2 list-databases 測(cè)試記錄

代碼:

sqoop list-databases \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 

測(cè)試記錄:

[root@hp1 ~]# sqoop list-databases \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/25 10:58:57 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/11/25 10:58:57 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/25 10:58:57 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
mysql
performance_schema
sys
test

3.10 list-tables 工具

sqoop list-tables 命令用于列出數(shù)據(jù)庫(kù)下的表名

3.10.1 list-tables 工具 參數(shù)

通用命令

參數(shù) 描述
--connect <jdbc-uri> 指定jdbc連接信息
--connection-manager <class-name> 指定要使用的連接管理器類(lèi)
--driver <class-name> 手工指定jdbc驅(qū)動(dòng)
--hadoop-mapred-home <dir> 覆蓋 $HADOOP_MAPRED_HOME
--help 打印幫助信息
--password-file 設(shè)置包含身份驗(yàn)證密碼的文件的路徑
-P 從控制臺(tái)讀取密碼
--password <password> 設(shè)置身份認(rèn)證密碼
--username <username> 設(shè)置身份驗(yàn)證用戶(hù)名
--verbose 工作時(shí)打印更多信息
--connection-param-file <filename> 提供連接參數(shù)的可選屬性文件
--relaxed-isolation 設(shè)置連接的隔離級(jí)別以讀取未提交的數(shù)據(jù)

3.10.2 list-tables 測(cè)試案例

代碼:

sqoop list-tables \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123

測(cè)試記錄:

[root@hp1 ~]# sqoop list-tables \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/25 11:07:37 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/11/25 11:07:37 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/25 11:07:37 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
bonus
dept
emp
fact_sale
salgrade
t1

3.11 version 工具

sqoop version 工具用于查看版本信息

測(cè)試記錄:

[root@hp1 ~]# sqoop version
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/25 11:09:36 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
Sqoop 1.4.7-cdh6.3.1
git commit id 
Compiled by jenkins on Thu Sep 26 02:57:53 PDT 2019
[root@hp1 ~]# 

四.sqoop將關(guān)系型數(shù)據(jù)庫(kù)表同步到hdfs

實(shí)際生產(chǎn)環(huán)境中输涕,一般將關(guān)系型數(shù)據(jù)庫(kù)先同步到hdfs中音婶,然后再將hdfs中的數(shù)據(jù)同步到關(guān)系型表中。
測(cè)試的性能與直接同步到hive中的性能一致莱坎。

-- 目標(biāo)端創(chuàng)建表結(jié)果
-- 速度很快
sqoop create-hive-table \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--hive-database default \
--table fact_sale --hive-table ods_fact_sale

-- 將mysql表數(shù)據(jù)同步到hdfs
sqoop import \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table fact_sale \
--delete-target-dir \
--fields-terminated-by '\0001' \
--num-mappers 4 \
--target-dir /tmp/fact_sale 

-- 將hdfs數(shù)據(jù)同步到表
-- 速度很快
LOAD DATA INPATH '/tmp/fact_sale' into table ods_fact_sale;

測(cè)試記錄:

[root@hp1 tmp]# sqoop import \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 \
> --table fact_sale \
> --delete-target-dir \
> --fields-terminated-by '\0001' \
> --num-mappers 4 \
> --target-dir /tmp/fact_sale
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/12/07 15:13:42 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/12/07 15:13:42 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/12/07 15:13:42 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/12/07 15:13:42 INFO tool.CodeGenTool: Beginning code generation
20/12/07 15:13:42 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `fact_sale` AS t LIMIT 1
20/12/07 15:13:42 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `fact_sale` AS t LIMIT 1
20/12/07 15:13:42 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
20/12/07 15:13:44 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-root/compile/76fc9602ec3165207e0dd51e2115cc14/fact_sale.java to /tmp/./fact_sale.java. Error: Destination '/tmp/./fact_sale.java' already exists
20/12/07 15:13:44 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/76fc9602ec3165207e0dd51e2115cc14/fact_sale.jar
20/12/07 15:13:44 INFO tool.ImportTool: Destination directory /tmp/fact_sale deleted.
20/12/07 15:13:44 WARN manager.MySQLManager: It looks like you are importing from mysql.
20/12/07 15:13:44 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
20/12/07 15:13:44 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
20/12/07 15:13:44 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
20/12/07 15:13:45 INFO mapreduce.ImportJobBase: Beginning import of fact_sale
20/12/07 15:13:45 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
20/12/07 15:13:45 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
20/12/07 15:13:45 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/root/.staging/job_1606698967173_0119
20/12/07 15:13:47 INFO db.DBInputFormat: Using read commited transaction isolation
20/12/07 15:13:47 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `fact_sale`
20/12/07 15:13:47 INFO db.IntegerSplitter: Split size: 196905399; Num splits: 4 from: 1 to: 787621597
20/12/07 15:13:47 INFO mapreduce.JobSubmitter: number of splits:4
20/12/07 15:13:47 INFO Configuration.deprecation: yarn.resourcemanager.zk-address is deprecated. Instead, use hadoop.zk.address
20/12/07 15:13:47 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
20/12/07 15:13:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1606698967173_0119
20/12/07 15:13:48 INFO mapreduce.JobSubmitter: Executing with tokens: []
20/12/07 15:13:48 INFO conf.Configuration: resource-types.xml not found
20/12/07 15:13:48 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/12/07 15:13:48 INFO impl.YarnClientImpl: Submitted application application_1606698967173_0119
20/12/07 15:13:48 INFO mapreduce.Job: The url to track the job: http://hp1:8088/proxy/application_1606698967173_0119/
20/12/07 15:13:48 INFO mapreduce.Job: Running job: job_1606698967173_0119
20/12/07 15:13:55 INFO mapreduce.Job: Job job_1606698967173_0119 running in uber mode : false
20/12/07 15:13:55 INFO mapreduce.Job:  map 0% reduce 0%
20/12/07 15:20:19 INFO mapreduce.Job:  map 25% reduce 0%
20/12/07 15:20:52 INFO mapreduce.Job:  map 50% reduce 0%
20/12/07 15:27:27 INFO mapreduce.Job:  map 75% reduce 0%
20/12/07 15:27:56 INFO mapreduce.Job:  map 100% reduce 0%
20/12/07 15:27:56 INFO mapreduce.Job: Job job_1606698967173_0119 completed successfully
20/12/07 15:27:56 INFO mapreduce.Job: Counters: 33
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=990012
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=441
                HDFS: Number of bytes written=31421093662
                HDFS: Number of read operations=24
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=8
                HDFS: Number of bytes read erasure-coded=0
        Job Counters 
                Launched map tasks=4
                Other local map tasks=4
                Total time spent by all maps in occupied slots (ms)=1647540
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=1647540
                Total vcore-milliseconds taken by all map tasks=1647540
                Total megabyte-milliseconds taken by all map tasks=1687080960
        Map-Reduce Framework
                Map input records=767830000
                Map output records=767830000
                Input split bytes=441
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=15214
                CPU time spent (ms)=1769770
                Physical memory (bytes) snapshot=1753706496
                Virtual memory (bytes) snapshot=10405761024
                Total committed heap usage (bytes)=1332215808
                Peak Map Physical memory (bytes)=464699392
                Peak Map Virtual memory (bytes)=2621968384
        File Input Format Counters 
                Bytes Read=0
        File Output Format Counters 
                Bytes Written=31421093662
20/12/07 15:27:56 INFO mapreduce.ImportJobBase: Transferred 29.2632 GB in 851.8971 seconds (35.175 MB/sec)
20/12/07 15:27:56 INFO mapreduce.ImportJobBase: Retrieved 767830000 records.
[root@hp1 tmp]# 


hive> 
    > 
    > LOAD DATA INPATH '/tmp/fact_sale' into table ods_fact_sale;
Loading data to table default.ods_fact_sale
OK
Time taken: 0.947 seconds
hive> 

五.sqoop增量同步

實(shí)際生產(chǎn)環(huán)境中衣式,考慮到全量刷新數(shù)據(jù),占用了大量的存儲(chǔ)空間檐什,且存儲(chǔ)了過(guò)多的冗余數(shù)據(jù)碴卧。于是引入了增量更新,sqoop也支持增量更新乃正。

5.1 導(dǎo)入方式:

  1. append
  2. lastmodified方式住册,必須要加--append(追加)或者--merge-key(合并,一般填主鍵)
image.png

5.2 測(cè)試案例

測(cè)試數(shù)據(jù)準(zhǔn)備:

-- MySQL 
create table t_test(id int,name varchar(200),last_update_datetime timestamp,primary key(id) );
insert into t_test values (1,'abc','2020-12-01 00:09:00');
insert into t_test values (2,'def','2020-12-02 00:10:00'); 

同步數(shù)據(jù)到hive:

sqoop import \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table t_test \
--fields-terminated-by '\0001' \
--delete-target-dir \
--num-mappers 1 \
--hive-import \
--hive-database test \
--hive-table t_test

檢查hive表數(shù)據(jù):

hive> 
    > select * from t_test;
OK
1       abc     2020-12-01 00:09:00.0
2       def     2020-12-02 00:10:00.0
Time taken: 0.452 seconds, Fetched: 2 row(s)

創(chuàng)建hive備份表瓮具,方便回退測(cè)試

create table t_test_bak as select * from t_test;
hive> 
    > select * from t_test_bak;
OK
1       abc     2020-12-01 00:09:00.0
2       def     2020-12-02 00:10:00.0
Time taken: 0.059 seconds, Fetched: 2 row(s)
hive> 

-- 如需回退荧飞,先清空再insert
truncate table t_test;
insert into t_test select * from t_test_bak;

5.2.1 append方式導(dǎo)入

在導(dǎo)入一個(gè)表時(shí)凡人,您應(yīng)該指定追加模式(append mode),其中新行將隨著行id值的增加而不斷添加叹阔。用--check-column指定包含行id的列挠轴。Sqoop導(dǎo)入檢查列的值大于用--last-value指定的值的行。

這種默認(rèn)適全部都是insert的耳幢,例如日志類(lèi)的數(shù)據(jù)的同步岸晦,對(duì)于update類(lèi)的無(wú)效。

修改MySQL源表數(shù)據(jù)
新增兩條數(shù)據(jù)

mysql> insert into t_test values (3,'ccc','2020-12-07 00:09:00');
Query OK, 1 row affected (0.01 sec)

mysql> insert into t_test values (4,'ddd','2020-12-07 00:10:00'); 
Query OK, 1 row affected (0.00 sec)

mysql> update t_test set name='bbb',last_update_datetime = '2020-12-07 08:00:00' where id = 2;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select * from t_test;
+----+------+----------------------+
| id | name | last_update_datetime |
+----+------+----------------------+
|  1 | abc  | 2020-12-01 00:09:00  |
|  2 | bbb  | 2020-12-07 08:00:00  |
|  3 | ccc  | 2020-12-07 00:09:00  |
|  4 | ddd  | 2020-12-07 00:10:00  |
+----+------+----------------------+
4 rows in set (0.00 sec)

sqoop增量同步
--target-dir 參數(shù)表示表所在的路徑
--last-value 是上次同步的id的最大值

sqoop import \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table t_test \
--fields-terminated-by '\0001' \
--target-dir '/user/hive/warehouse/test.db/t_test' \
--incremental append \
--check-column id \
--last-value 2 \
-m 1

數(shù)據(jù)驗(yàn)證
可以看到2個(gè)新增的是同步了帅掘,但是對(duì)id為2的update的記錄卻沒(méi)有同步

hive> 
    > 
    > select * from t_test;
OK
1       abc     2020-12-01 00:09:00.0
2       def     2020-12-02 00:10:00.0
3       ccc     2020-12-07 00:09:00.0
4       ddd     2020-12-07 00:10:00.0
Time taken: 0.1 seconds, Fetched: 4 row(s)

還原h(huán)ive表數(shù)據(jù)
重新通過(guò)last_update_datetime時(shí)間戳來(lái)同步

sqoop import \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table t_test \
--fields-terminated-by '\0001' \
--target-dir '/user/hive/warehouse/test.db/t_test' \
--incremental append \
--check-column last_update_datetime \
--last-value '2020-12-02 00:10:00' \
-m 1

驗(yàn)證數(shù)據(jù):
可以看到數(shù)據(jù)重復(fù)了

hive> 
    > 
    > select * from t_test;
OK
1       abc     2020-12-01 00:09:00.0
2       def     2020-12-02 00:10:00.0
2       bbb     2020-12-07 08:00:00.0
3       ccc     2020-12-07 00:09:00.0
4       ddd     2020-12-07 00:10:00.0
Time taken: 0.067 seconds, Fetched: 5 row(s)

5.2.2 lastmodified方式導(dǎo)入

Sqoop支持的另一種表更新策略稱(chēng)為lastmodified模式委煤。當(dāng)可能更新源表的行堂油,并且每次這樣的更新都會(huì)將last-modified列的值設(shè)置為當(dāng)前時(shí)間戳?xí)r修档,應(yīng)該使用此方法。將導(dǎo)入check列中保存的時(shí)間戳比用--last-value指定的時(shí)間戳更近的行府框。

在增量導(dǎo)入結(jié)束時(shí)吱窝,應(yīng)該為后續(xù)導(dǎo)入指定為--last-value的值將打印到屏幕上。在運(yùn)行后續(xù)導(dǎo)入時(shí)迫靖,應(yīng)該以這種方式指定--last-value院峡,以確保只導(dǎo)入新的或更新的數(shù)據(jù)。這可以通過(guò)創(chuàng)建增量導(dǎo)入作為已保存的作業(yè)來(lái)自動(dòng)處理系宜,這是執(zhí)行循環(huán)增量導(dǎo)入的首選機(jī)制照激。

官方文檔原文

An alternate table update strategy supported by Sqoop is called lastmodified mode. You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value are imported.

At the end of an incremental import, the value which should be specified as --last-value for a subsequent import is printed to the screen. When running a subsequent import, you should specify --last-value in this way to ensure you import only the new or updated data. This is handled automatically by creating an incremental import as a saved job, which is the preferred mechanism for performing a recurring incremental import. See the section on saved jobs later in this document for more information.

5.2.2.1 --incremental lastmodified --append 方式

這種也是增量的進(jìn)行insert的方式

sqoop import \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table t_test \
--fields-terminated-by '\0001' \
--target-dir '/user/hive/warehouse/test.db/t_test' \
--check-column last_update_datetime \
--incremental lastmodified \
--last-value '2020-12-02 00:10:00' \
--m 1 \
--append

測(cè)試數(shù)據(jù)
可以看到超過(guò)最后更新時(shí)間的字段都insert了

hive> 
    > 
    > select * from t_test;
OK
1       abc     2020-12-01 00:09:00.0
2       def     2020-12-02 00:10:00.0
2       bbb     2020-12-07 08:00:00.0
3       ccc     2020-12-07 00:09:00.0
4       ddd     2020-12-07 00:10:00.0
Time taken: 0.07 seconds, Fetched: 5 row(s)

5.2.2.2 --incremental lastmodified --merge-key 方式

--merge-key方式類(lèi)似Oracle數(shù)據(jù)庫(kù)的merge語(yǔ)句,存在就update盹牧,不存在的insert俩垃。

sqoop import \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table t_test \
--fields-terminated-by '\0001' \
--target-dir '/user/hive/warehouse/test.db/t_test' \
--check-column last_update_datetime \
--incremental lastmodified \
--last-value '2020-12-02 00:10:00' \
--m 1 \
--merge-key id

測(cè)試數(shù)據(jù)
可以看到 id為2的數(shù)據(jù)被更新了,3汰寓、4成功插入
對(duì)于一些有最后更新時(shí)間戳的業(yè)務(wù)數(shù)據(jù)口柳,此種方法可以保證數(shù)據(jù)倉(cāng)庫(kù)的數(shù)據(jù)與業(yè)務(wù)數(shù)據(jù)保持一致。

    > select * from t_test;
OK
1       abc     2020-12-01 00:09:00.0
2       bbb     2020-12-07 08:00:00.0
3       ccc     2020-12-07 00:09:00.0
4       ddd     2020-12-07 00:10:00.0
Time taken: 0.064 seconds, Fetched: 4 row(s)

參考

1.Hadoop權(quán)威指南
2.http://sqoop.apache.org/docs/1.4.7/SqoopUserGuide.html

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末有滑,一起剝皮案震驚了整個(gè)濱河市跃闹,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌毛好,老刑警劉巖望艺,帶你破解...
    沈念sama閱讀 211,884評(píng)論 6 492
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場(chǎng)離奇詭異肌访,居然都是意外死亡找默,警方通過(guò)查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,347評(píng)論 3 385
  • 文/潘曉璐 我一進(jìn)店門(mén)场靴,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)啡莉,“玉大人港准,你說(shuō)我怎么就攤上這事∵中溃” “怎么了浅缸?”我有些...
    開(kāi)封第一講書(shū)人閱讀 157,435評(píng)論 0 348
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)魄咕。 經(jīng)常有香客問(wèn)我衩椒,道長(zhǎng),這世上最難降的妖魔是什么哮兰? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 56,509評(píng)論 1 284
  • 正文 為了忘掉前任毛萌,我火速辦了婚禮,結(jié)果婚禮上喝滞,老公的妹妹穿的比我還像新娘阁将。我一直安慰自己,他們只是感情好右遭,可當(dāng)我...
    茶點(diǎn)故事閱讀 65,611評(píng)論 6 386
  • 文/花漫 我一把揭開(kāi)白布做盅。 她就那樣靜靜地躺著,像睡著了一般窘哈。 火紅的嫁衣襯著肌膚如雪吹榴。 梳的紋絲不亂的頭發(fā)上,一...
    開(kāi)封第一講書(shū)人閱讀 49,837評(píng)論 1 290
  • 那天滚婉,我揣著相機(jī)與錄音图筹,去河邊找鬼。 笑死让腹,一個(gè)胖子當(dāng)著我的面吹牛远剩,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播哨鸭,決...
    沈念sama閱讀 38,987評(píng)論 3 408
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼民宿,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了像鸡?” 一聲冷哼從身側(cè)響起活鹰,我...
    開(kāi)封第一講書(shū)人閱讀 37,730評(píng)論 0 267
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎只估,沒(méi)想到半個(gè)月后志群,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 44,194評(píng)論 1 303
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡蛔钙,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,525評(píng)論 2 327
  • 正文 我和宋清朗相戀三年锌云,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片吁脱。...
    茶點(diǎn)故事閱讀 38,664評(píng)論 1 340
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡桑涎,死狀恐怖彬向,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情攻冷,我是刑警寧澤娃胆,帶...
    沈念sama閱讀 34,334評(píng)論 4 330
  • 正文 年R本政府宣布,位于F島的核電站等曼,受9級(jí)特大地震影響里烦,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜禁谦,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,944評(píng)論 3 313
  • 文/蒙蒙 一胁黑、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧州泊,春花似錦丧蘸、人聲如沸。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 30,764評(píng)論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)氮发。三九已至渴肉,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間爽冕,已是汗流浹背仇祭。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 31,997評(píng)論 1 266
  • 我被黑心中介騙來(lái)泰國(guó)打工, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留颈畸,地道東北人乌奇。 一個(gè)月前我還...
    沈念sama閱讀 46,389評(píng)論 2 360
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像眯娱,于是被迫代替她去往敵國(guó)和親礁苗。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 43,554評(píng)論 2 349

推薦閱讀更多精彩內(nèi)容