備注:
測(cè)試環(huán)境
CDH 6.3.1
Sqoop 1.4.7
一.Sqoop概述
Apache Sqoop(SQL-to-Hadoop)項(xiàng)目旨在協(xié)助RDBMS與Hadoop之間進(jìn)行高效的大數(shù)據(jù)交流愁憔。用戶(hù)可以在 Sqoop 的幫助下矢劲,輕松地把關(guān)系型數(shù)據(jù)庫(kù)的數(shù)據(jù)導(dǎo)入到 Hadoop 與其相關(guān)的系統(tǒng) (如HBase和Hive)中红竭;同時(shí)也可以把數(shù)據(jù)從 Hadoop 系統(tǒng)里抽取并導(dǎo)出到關(guān)系型數(shù)據(jù)庫(kù)里。
Sqoop是一個(gè)在結(jié)構(gòu)化數(shù)據(jù)和Hadoop之間進(jìn)行批量數(shù)據(jù)遷移的工具,結(jié)構(gòu)化數(shù)據(jù)可以是MySQL、Oracle等RDBMS。Sqoop底層用MapReduce程序?qū)崿F(xiàn)抽取、轉(zhuǎn)換符衔、加載,MapReduce天生的特性保證了并行化和高容錯(cuò)率糟袁,而且相比Kettle等傳統(tǒng)ETL工具判族,任務(wù)跑在Hadoop集群上,減少了ETL服務(wù)器資源的使用情況项戴。在特定場(chǎng)景下形帮,抽取過(guò)程會(huì)有很大的性能提升。
如果要用Sqoop周叮,必須正確安裝并配置Hadoop辩撑,因依賴(lài)于本地的Hadoop環(huán)境啟動(dòng)MR程序;MySQL则吟、Oracle等數(shù)據(jù)庫(kù)的JDBC驅(qū)動(dòng)也要放到Sqoop的lib目錄下槐臀。
Sqoop架構(gòu)圖:
二.Sqoop 工具概述
通過(guò)Sqoop的help命令可以看到sqoop有哪些工具
[root@hp2 ~]# sqoop help
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/20 17:00:56 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
usage: sqoop COMMAND [ARGS]
Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
import-mainframe Import datasets from a mainframe server to HDFS
job Work with saved jobs
list-databases List available databases on a server
list-tables List available tables in a database
merge Merge results of incremental imports
metastore Run a standalone Sqoop metastore
version Display version information
See 'sqoop help COMMAND' for information on a specific command.
工具名 | 描述 |
---|---|
codegen | 生成與數(shù)據(jù)庫(kù)記錄交互的代碼 |
create-hive-table | 復(fù)制表結(jié)果到hive表 |
eval | 評(píng)估sql并給出結(jié)果 |
export | 導(dǎo)出一個(gè)HDFS目錄到一個(gè)數(shù)據(jù)庫(kù)表 |
help | 列出可用的命令 |
import | 將一個(gè)表從數(shù)據(jù)庫(kù)導(dǎo)入到HDFS |
import-all-tables | 將所有表從數(shù)據(jù)庫(kù)導(dǎo)入到HDFS |
import-mainframe | 將大型機(jī)數(shù)據(jù)集導(dǎo)入到HDFS |
list-databases | 列出服務(wù)器上可用數(shù)據(jù)庫(kù) |
list-tables | 列出數(shù)據(jù)庫(kù)上可用表 |
version | 展示版本信息 |
例如我想看 sqoop的import工具有哪些參數(shù):
[root@hp2 ~]# sqoop help import
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/20 17:11:43 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]
Common arguments:
--connect <jdbc-uri> Specify JDBC
connect
string
--connection-manager <class-name> Specify
connection
manager
class name
--connection-param-file <properties-file> Specify
connection
parameters
file
--driver <class-name> Manually
specify JDBC
driver class
to use
**限于篇幅锄蹂,中間省略N多行**
At minimum, you must specify --connect and --table
Arguments to mysqldump and other subprograms may be supplied
after a '--' on the command line.
三.Sqoon工具詳解
3.1 codegen
codegen工具生成封裝和解釋導(dǎo)入記錄的Java類(lèi)氓仲。記錄的Java定義作為導(dǎo)入過(guò)程的一部分實(shí)例化,但是也可以單獨(dú)執(zhí)行得糜。例如敬扛,如果Java源代碼丟失了,可以重新創(chuàng)建它朝抖∩都可以創(chuàng)建在字段之間使用不同分隔符的類(lèi)的新版本,等等治宣。
由于我是從傳統(tǒng)數(shù)據(jù)倉(cāng)庫(kù)轉(zhuǎn)的大數(shù)據(jù)急侥,目前對(duì)java不熟悉砌滞,此處就不展開(kāi)了。
3.2 create-hive-table
create-hive-table 工具用來(lái)將表同步到hive坏怪。
3.2.1 create-hive-table工具命令介紹
通用命令
參數(shù) | 描述 |
---|---|
--connect <jdbc-uri> | 指定jdbc連接信息 |
--connection-manager <class-name> | 指定要使用的連接管理器類(lèi) |
--driver <class-name> | 手工指定jdbc驅(qū)動(dòng) |
--hadoop-mapred-home <dir> | 覆蓋 $HADOOP_MAPRED_HOME |
--help | 打印幫助信息 |
--password-file | 設(shè)置包含身份驗(yàn)證密碼的文件的路徑 |
-P | 從控制臺(tái)讀取密碼 |
--password <password> | 設(shè)置身份認(rèn)證密碼 |
--username <username> | 設(shè)置身份驗(yàn)證用戶(hù)名 |
--verbose | 工作時(shí)打印更多信息 |
--connection-param-file <filename> | 提供連接參數(shù)的可選屬性文件 |
--relaxed-isolation | 設(shè)置連接的隔離級(jí)別以讀取未提交的數(shù)據(jù) |
Hive命令
參數(shù) | 描述 |
---|---|
--hive-home <dir> | 覆蓋 $HIVE_HOME |
--hive-overwrite | 覆蓋hive表已存在的數(shù)據(jù) |
--create-hive-table | 如果設(shè)置此參數(shù)贝润,該表存在的情況下mr job會(huì)失敗,默認(rèn)參數(shù)為false |
--hive-table <table-name> | 導(dǎo)入數(shù)據(jù)的時(shí)候指定hive數(shù)據(jù)庫(kù)的表名 |
--table | 讀取hive表的數(shù)據(jù)定義文件 |
輸出格式命令
參數(shù) | 描述 |
---|---|
--enclosed-by <char> | 設(shè)置必需的字段封閉字符 |
--escaped-by <char> | 設(shè)置分隔字符 |
--fields-terminated-by <char | 設(shè)置字段分隔字符 |
--lines-terminated-by <char> | 設(shè)置結(jié)束字符 |
--mysql-delimiters | 使用mysql 默認(rèn)的分隔字符 |
--optionally-enclosed-by <char> | 設(shè)置包含字符的字段 |
3.2.2 create-hive-table 測(cè)試案例
需求铝宵,將mysql test庫(kù)下的emp表的表結(jié)構(gòu) 同步到hive的test庫(kù)下ods_emp表
sqoop create-hive-table \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--hive-database test \
--table emp --hive-table ods_emp
測(cè)試記錄:
[root@hp1 ~]#
[root@hp1 ~]# sqoop create-hive-table \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 \
> --hive-database test \
> --table emp --hive-table ods_emp
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/23 18:12:28 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/11/23 18:12:28 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/23 18:12:28 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
20/11/23 18:12:28 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
20/11/23 18:12:28 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/11/23 18:12:29 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `emp` AS t LIMIT 1
20/11/23 18:12:29 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `emp` AS t LIMIT 1
20/11/23 18:12:29 WARN hive.TableDefWriter: Column hiredate had to be cast to a less precise type in Hive
20/11/23 18:12:29 WARN hive.TableDefWriter: Column sal had to be cast to a less precise type in Hive
20/11/23 18:12:29 WARN hive.TableDefWriter: Column comm had to be cast to a less precise type in Hive
20/11/23 18:12:30 INFO hive.HiveImport: Loading uploaded data into Hive
20/11/23 18:12:30 INFO conf.HiveConf: Found configuration file file:/etc/hive/conf.cloudera.hive/hive-site.xml
**********中間省略N多輸出*******
20/11/23 18:12:33 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=hp1:2181,hp3:2181,hp2:2181 sessionTimeout=1200000 watcher=org.apache.curator.ConnectionState@176555c
20/11/23 18:12:33 INFO zookeeper.ClientCnxn: Opening socket connection to server hp1/10.31.1.123:2181. Will not attempt to authenticate using SASL (unknown error)
20/11/23 18:12:33 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.31.1.123:54526, server: hp1/10.31.1.123:2181
20/11/23 18:12:33 INFO zookeeper.ClientCnxn: Session establishment complete on server hp1/10.31.1.123:2181, sessionid = 0x175e00052b51d5e, negotiated timeout = 40000
20/11/23 18:12:33 INFO state.ConnectionStateManager: State change: CONNECTED
20/11/23 18:12:33 INFO ql.Driver: Executing command(queryId=root_20201123181231_97629f38-ba50-4864-9d21-6aa2239f11a9): CREATE TABLE IF NOT EXISTS `test`.`ods_emp` ( `empno` INT, `ename` STRING, `job` STRING, `mgr` INT, `hiredate` STRING, `sal` DOUBLE, `comm` DOUBLE, `deptno` INT) COMMENT 'Imported by sqoop on 2020/11/23 18:12:29' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\012' STORED AS TEXTFILE
20/11/23 18:12:33 INFO ql.Driver: Starting task [Stage-0:DDL] in serial mode
20/11/23 18:12:33 INFO hive.metastore: Mestastore configuration hive.metastore.filter.hook changed from org.apache.hadoop.hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook to org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl
20/11/23 18:12:33 INFO hive.metastore: Closed a connection to metastore, current connections: 0
20/11/23 18:12:33 INFO exec.DDLTask: creating table test.ods_emp on null
20/11/23 18:12:34 INFO hive.metastore: HMS client filtering is enabled.
20/11/23 18:12:34 INFO hive.metastore: Trying to connect to metastore with URI thrift://hp1:9083
20/11/23 18:12:34 INFO hive.metastore: Opened a connection to metastore, current connections: 1
20/11/23 18:12:34 INFO hive.metastore: Connected to metastore.
20/11/23 18:12:34 INFO ql.Driver: Completed executing command(queryId=root_20201123181231_97629f38-ba50-4864-9d21-6aa2239f11a9); Time taken: 0.515 seconds
OK
20/11/23 18:12:34 INFO ql.Driver: OK
Time taken: 2.633 seconds
20/11/23 18:12:34 INFO CliDriver: Time taken: 2.633 seconds
20/11/23 18:12:34 INFO conf.HiveConf: Using the default value passed in for log id: 2fec9e5a-4d5e-4786-9e5b-5e5303b8cca1
20/11/23 18:12:34 INFO session.SessionState: Resetting thread name to main
20/11/23 18:12:34 INFO conf.HiveConf: Using the default value passed in for log id: 2fec9e5a-4d5e-4786-9e5b-5e5303b8cca1
20/11/23 18:12:34 INFO session.SessionState: Deleted directory: /tmp/hive/root/2fec9e5a-4d5e-4786-9e5b-5e5303b8cca1 on fs with scheme hdfs
20/11/23 18:12:34 INFO session.SessionState: Deleted directory: /tmp/root/2fec9e5a-4d5e-4786-9e5b-5e5303b8cca1 on fs with scheme file
20/11/23 18:12:34 INFO hive.metastore: Closed a connection to metastore, current connections: 0
20/11/23 18:12:34 INFO hive.HiveImport: Hive import complete.
20/11/23 18:12:34 INFO imps.CuratorFrameworkImpl: backgroundOperationsLoop exiting
20/11/23 18:12:34 INFO zookeeper.ZooKeeper: Session: 0x175e00052b51d5e closed
20/11/23 18:12:34 INFO CuratorFrameworkSingleton: Closing ZooKeeper client.
20/11/23 18:12:34 INFO zookeeper.ClientCnxn: EventThread shut down
查看hive這邊的表
可以看到整數(shù)轉(zhuǎn)換為int打掘,小數(shù)的自動(dòng)轉(zhuǎn)為double,varchar轉(zhuǎn)為string沒(méi)什么問(wèn)題鹏秋,可是date類(lèi)型的也變?yōu)閟tring了尊蚁,不知道為什么會(huì)這個(gè)樣子。
另外侣夷,主外鍵直接忽略了横朋。
hive>
>
> desc ods_emp;
OK
empno int
ename string
job string
mgr int
hiredate string
sal double
comm double
deptno int
Time taken: 0.078 seconds, Fetched: 8 row(s)
> show create table ods_emp;
OK
CREATE TABLE `ods_emp`(
`empno` int,
`ename` string,
`job` string,
`mgr` int,
`hiredate` string,
`sal` double,
`comm` double,
`deptno` int)
COMMENT 'Imported by sqoop on 2020/11/23 18:12:29'
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='?',
'line.delim'='\n',
'serialization.format'='?')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://nameservice1/user/hive/warehouse/test.db/ods_emp'
TBLPROPERTIES (
'transient_lastDdlTime'='1606126354')
Time taken: 0.107 seconds, Fetched: 24 row(s)
hive>
3.3 eval
執(zhí)行前,先評(píng)估命令的準(zhǔn)確性
3.3.1 eval工具命令介紹
通用命令
參數(shù) | 描述 |
---|---|
--connect <jdbc-uri> | 指定jdbc連接信息 |
--connection-manager <class-name> | 指定要使用的連接管理器類(lèi) |
--driver <class-name> | 手工指定jdbc驅(qū)動(dòng) |
--hadoop-mapred-home <dir> | 覆蓋 $HADOOP_MAPRED_HOME |
--help | 打印幫助信息 |
--password-file | 設(shè)置包含身份驗(yàn)證密碼的文件的路徑 |
-P | 從控制臺(tái)讀取密碼 |
--password <password> | 設(shè)置身份認(rèn)證密碼 |
--username <username> | 設(shè)置身份驗(yàn)證用戶(hù)名 |
--verbose | 工作時(shí)打印更多信息 |
--connection-param-file <filename> | 提供連接參數(shù)的可選屬性文件 |
--relaxed-isolation | 設(shè)置連接的隔離級(jí)別以讀取未提交的數(shù)據(jù) |
SQL 評(píng)估命令
參數(shù) | 描述 |
---|---|
-e,--query <statement | 執(zhí)行sql命令 |
3.3.2 eval命令測(cè)試
測(cè)試 --query
sqoop eval \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--query "select * from emp"
測(cè)試記錄:
[root@hp1 ~]# sqoop eval \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 \
> --query "select * from emp"
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/23 18:28:05 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/11/23 18:28:05 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/23 18:28:05 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
----------------------------------------------------------------------------------
| empno | ename | job | mgr | hiredate | sal | comm | deptno |
----------------------------------------------------------------------------------
| 7369 | SMITH | CLERK | 7902 | 1980-12-17 | 800.00 | (null) | 20 |
| 7499 | ALLEN | SALESMAN | 7698 | 1981-02-20 | 1600.00 | 300.00 | 30 |
| 7521 | WARD | SALESMAN | 7698 | 1981-02-22 | 1250.00 | 500.00 | 30 |
| 7566 | JONES | MANAGER | 7839 | 1981-04-02 | 2975.00 | (null) | 20 |
| 7654 | MARTIN | SALESMAN | 7698 | 1981-09-28 | 1250.00 | 1400.00 | 30 |
| 7698 | BLAKE | MANAGER | 7839 | 1981-05-01 | 2850.00 | (null) | 30 |
| 7782 | CLARK | MANAGER | 7839 | 1981-06-09 | 2450.00 | (null) | 10 |
| 7788 | SCOTT | ANALYST | 7566 | 1987-06-13 | 3000.00 | (null) | 20 |
| 7839 | KING | PRESIDENT | (null) | 1981-11-17 | 5000.00 | (null) | 10 |
| 7844 | TURNER | SALESMAN | 7698 | 1981-09-08 | 1500.00 | 0.00 | 30 |
| 7876 | ADAMS | CLERK | 7788 | 1987-06-13 | 1100.00 | (null) | 20 |
| 7900 | JAMES | CLERK | 7698 | 1981-12-03 | 950.00 | (null) | 30 |
| 7902 | FORD | ANALYST | 7566 | 1981-12-03 | 3000.00 | (null) | 20 |
| 7934 | MILLER | CLERK | 7782 | 1982-01-23 | 1300.00 | (null) | 10 |
----------------------------------------------------------------------------------
[root@hp1 ~]#
測(cè)試 -e
sqoop eval \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
-e "insert into emp(empno) values (1)"
測(cè)試記錄:
[root@hp1 ~]# sqoop eval \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 \
> -e "insert into emp(empno) values (1)"
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/23 18:32:26 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/11/23 18:32:26 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/23 18:32:26 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/11/23 18:32:26 INFO tool.EvalSqlTool: 1 row(s) updated.
[root@hp1 ~]#
3.4 export
將hdfs里的數(shù)據(jù)同步到關(guān)系型數(shù)據(jù)庫(kù),目標(biāo)表必須在數(shù)據(jù)庫(kù)中存在百拓,將根據(jù)用戶(hù)指定的分隔符將輸入文件讀取并解析為一組記錄叶撒。
3.4.1 export命令概述
通用命令
參數(shù) | 描述 |
---|---|
--connect <jdbc-uri> | 指定jdbc連接信息 |
--connection-manager <class-name> | 指定要使用的連接管理器類(lèi) |
--driver <class-name> | 手工指定jdbc驅(qū)動(dòng) |
--hadoop-mapred-home <dir> | 覆蓋 $HADOOP_MAPRED_HOME |
--help | 打印幫助信息 |
--password-file | 設(shè)置包含身份驗(yàn)證密碼的文件的路徑 |
-P | 從控制臺(tái)讀取密碼 |
--password <password> | 設(shè)置身份認(rèn)證密碼 |
--username <username> | 設(shè)置身份驗(yàn)證用戶(hù)名 |
--verbose | 工作時(shí)打印更多信息 |
--connection-param-file <filename> | 提供連接參數(shù)的可選屬性文件 |
--relaxed-isolation | 設(shè)置連接的隔離級(jí)別以讀取未提交的數(shù)據(jù) |
驗(yàn)證參數(shù)細(xì)節(jié)
參數(shù) | 描述 |
---|---|
--validate | 允許對(duì)復(fù)制的數(shù)據(jù)進(jìn)行驗(yàn)證,只支持單表復(fù)制耐版。 |
--validator <class-name> | 指定使用的驗(yàn)證類(lèi) |
--validation-threshold <class-name> | 指定要使用的驗(yàn)證閾值類(lèi) |
--validation-failurehandler <class-name> | 指定要使用的驗(yàn)證失敗處理程序類(lèi) |
導(dǎo)出控制命令
參數(shù) | 描述 |
---|---|
--columns <col,col,col…> | 表需要導(dǎo)出的列 |
--direct | 使用直接出口快速路徑 |
--export-dir <dir> | HDFS導(dǎo)出的源路徑 |
-m,--num-mappers <n> | 指定使用n個(gè)map并行導(dǎo)出 |
--table <table-name> | 需要填充的表 |
--call <stored-proc-name> | 要調(diào)用的存儲(chǔ)過(guò)程 |
--update-key <col-name> | 用于更新的錨定列祠够。如果有多個(gè)列,請(qǐng)使用逗號(hào)分隔列粪牲。 |
--update-mode <mode> | 更新方式 |
--input-null-string <null-string> | 對(duì)于字符串列古瓤,要解釋為null的字符串 |
--input-null-non-string <null-string> | 對(duì)于非字符串列,將解釋為null的字符串 |
--staging-table <staging-table-name> | 在將數(shù)據(jù)插入目標(biāo)表之前腺阳,數(shù)據(jù)將在其中暫存的表落君。 |
--clear-staging-table | 表示可以刪除staging表中存在的任何數(shù)據(jù)。 |
--batch | 使用批處理模式執(zhí)行底層語(yǔ)句 |
輸入解析參數(shù)
參數(shù) | 描述 |
---|---|
--input-enclosed-by <char> | 設(shè)置所需的字段附件 |
--input-escaped-by <char> | 設(shè)置輸入轉(zhuǎn)義字符 |
--input-fields-terminated-by <char> | 設(shè)置輸入字段分隔符 |
--input-lines-terminated-by <char> | 設(shè)置輸入的行結(jié)束字符 |
--input-optionally-enclosed-by <char> | 設(shè)置包含字符的字段 |
輸出行格式化參數(shù)
參數(shù) | 描述 |
---|---|
--enclosed-by <char> | 設(shè)置所需的字段附件 |
--escaped-by <char> | 設(shè)置輸入轉(zhuǎn)義字符 |
--fields-terminated-by <char> | 設(shè)置輸入字段分隔符 |
--lines-terminated-by <char> | 設(shè)置輸入的行結(jié)束字符 |
--mysql-delimiters | 使用MySQL默認(rèn)的分隔符 |
--optionally-enclosed-by <char> | 設(shè)置包含字符的字段 |
代碼生成的參數(shù)
參數(shù) | 描述 |
---|---|
--bindir <dir> | 編譯對(duì)象的輸出目錄 |
--class-name <name> | 設(shè)置類(lèi)的名稱(chēng) |
--jar-file <file> | 禁用code generation; 使用指定的jar包 |
--outdir <dir> | generated code的輸出目錄 |
--package-name <name> | auto-generated classes的包 |
--map-column-java <m> | 為配置的列覆蓋從SQL類(lèi)型到Java類(lèi)型的默認(rèn)映射 |
3.4.2 export命令測(cè)試案例
sqoop的export命令支持 insert亭引、update到關(guān)系型數(shù)據(jù)庫(kù)绎速,但是不支持merge,親自測(cè)試過(guò)焙蚓,--merge-key id 這樣會(huì)報(bào)錯(cuò)纹冤。
但是sqoop的export命令支持調(diào)用存儲(chǔ)過(guò)程,這樣同步了數(shù)據(jù)之后购公,可以通過(guò)mysql的存儲(chǔ)過(guò)程進(jìn)行數(shù)據(jù)處理萌京。
3.4.2.1 hive表導(dǎo)入mysql數(shù)據(jù)庫(kù)insert案例
數(shù)據(jù)準(zhǔn)備:
-- hive表數(shù)據(jù)
hive>
>
> select * from t1;
OK
1 abc
2 def
-- MySQL 數(shù)據(jù)
mysql> select * from t1;
Empty set (0.00 sec)
mysql>
代碼:
--fields-terminated-by '\0001' 必須是這個(gè)格式,其它格式試了報(bào)錯(cuò)宏浩,待研究
sqoop export \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table t1 \
--export-dir /user/hive/warehouse/test.db/t1 \
--num-mappers 1 \
--fields-terminated-by '\0001'
測(cè)試記錄:
[root@hp1 ~]# sqoop export \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 \
> --table t1 \
> --export-dir /user/hive/warehouse/test.db/t1 \
> --num-mappers 1 \
> --fields-terminated-by '\0001'
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/23 19:22:04 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/11/23 19:22:04 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/23 19:22:04 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/11/23 19:22:04 INFO tool.CodeGenTool: Beginning code generation
20/11/23 19:22:05 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t1` AS t LIMIT 1
20/11/23 19:22:05 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t1` AS t LIMIT 1
20/11/23 19:22:05 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
20/11/23 19:22:06 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-root/compile/fab279df6cf018b9c6c4a9b4d9508e39/t1.java to /root/./t1.java. Error: Destination '/root/./t1.java' already exists
20/11/23 19:22:06 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/fab279df6cf018b9c6c4a9b4d9508e39/t1.jar
20/11/23 19:22:06 INFO mapreduce.ExportJobBase: Beginning export of t1
20/11/23 19:22:06 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
20/11/23 19:22:07 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
20/11/23 19:22:07 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
20/11/23 19:22:07 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
20/11/23 19:22:07 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
20/11/23 19:22:07 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/root/.staging/job_1605780958086_0024
20/11/23 19:22:09 INFO input.FileInputFormat: Total input files to process : 2
20/11/23 19:22:09 INFO input.FileInputFormat: Total input files to process : 2
20/11/23 19:22:09 INFO mapreduce.JobSubmitter: number of splits:1
20/11/23 19:22:09 INFO Configuration.deprecation: yarn.resourcemanager.zk-address is deprecated. Instead, use hadoop.zk.address
20/11/23 19:22:09 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
20/11/23 19:22:09 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
20/11/23 19:22:09 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1605780958086_0024
20/11/23 19:22:09 INFO mapreduce.JobSubmitter: Executing with tokens: []
20/11/23 19:22:10 INFO conf.Configuration: resource-types.xml not found
20/11/23 19:22:10 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/11/23 19:22:10 INFO impl.YarnClientImpl: Submitted application application_1605780958086_0024
20/11/23 19:22:10 INFO mapreduce.Job: The url to track the job: http://hp3:8088/proxy/application_1605780958086_0024/
20/11/23 19:22:10 INFO mapreduce.Job: Running job: job_1605780958086_0024
20/11/23 19:22:16 INFO mapreduce.Job: Job job_1605780958086_0024 running in uber mode : false
20/11/23 19:22:16 INFO mapreduce.Job: map 0% reduce 0%
20/11/23 19:22:21 INFO mapreduce.Job: map 100% reduce 0%
20/11/23 19:22:21 INFO mapreduce.Job: Job job_1605780958086_0024 completed successfully
20/11/23 19:22:21 INFO mapreduce.Job: Counters: 33
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=247166
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=241
HDFS: Number of bytes written=0
HDFS: Number of read operations=7
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=2732
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=2732
Total vcore-milliseconds taken by all map tasks=2732
Total megabyte-milliseconds taken by all map tasks=2797568
Map-Reduce Framework
Map input records=2
Map output records=2
Input split bytes=223
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=65
CPU time spent (ms)=930
Physical memory (bytes) snapshot=235925504
Virtual memory (bytes) snapshot=2592751616
Total committed heap usage (bytes)=193462272
Peak Map Physical memory (bytes)=235925504
Peak Map Virtual memory (bytes)=2592751616
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
20/11/23 19:22:21 INFO mapreduce.ExportJobBase: Transferred 241 bytes in 13.9456 seconds (17.2815 bytes/sec)
20/11/23 19:22:21 INFO mapreduce.ExportJobBase: Exported 2 records.
[root@hp1 ~]#
數(shù)據(jù)驗(yàn)證
可以看到數(shù)據(jù)導(dǎo)入成功
mysql> select * from t1;
Empty set (0.00 sec)
mysql> select * from t1;
+------+------+
| id | name |
+------+------+
| 1 | abc |
| 2 | def |
+------+------+
2 rows in set (0.00 sec)
3.4.2.2 hive表導(dǎo)入mysql數(shù)據(jù)庫(kù)update案例
數(shù)據(jù)準(zhǔn)備:
-- hive 數(shù)據(jù)
hive>
>
> select * from t1;
OK
1 abc
2 def
-- Mysql 數(shù)據(jù)
mysql> select * from t1;
+------+------+
| id | name |
+------+------+
| 1 | a |
| 3 | b |
+------+------+
2 rows in set (0.00 sec)
代碼:
sqoop export \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table t1 \
--export-dir /user/hive/warehouse/test.db/t1 \
--update-key id \
--num-mappers 1 \
--fields-terminated-by '\0001'
測(cè)試記錄:
[root@hp1 ~]# sqoop export \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 \
> --table t1 \
> --export-dir /user/hive/warehouse/test.db/t1 \
> --update-key id \
> --num-mappers 1 \
> --fields-terminated-by '\0001'
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/24 15:00:54 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/11/24 15:00:54 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/24 15:00:54 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/11/24 15:00:54 INFO tool.CodeGenTool: Beginning code generation
20/11/24 15:00:54 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t1` AS t LIMIT 1
20/11/24 15:00:55 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t1` AS t LIMIT 1
20/11/24 15:00:55 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
20/11/24 15:00:56 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-root/compile/a3298f025b919a783d8aeba0fcfdd07c/t1.java to /root/./t1.java. Error: Destination '/root/./t1.java' already exists
20/11/24 15:00:56 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/a3298f025b919a783d8aeba0fcfdd07c/t1.jar
20/11/24 15:00:56 INFO mapreduce.ExportJobBase: Beginning export of t1
20/11/24 15:00:56 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
20/11/24 15:00:57 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
20/11/24 15:00:57 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
20/11/24 15:00:57 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
20/11/24 15:00:57 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
20/11/24 15:00:58 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/root/.staging/job_1605780958086_0028
20/11/24 15:00:59 INFO input.FileInputFormat: Total input files to process : 2
20/11/24 15:00:59 INFO input.FileInputFormat: Total input files to process : 2
20/11/24 15:00:59 INFO mapreduce.JobSubmitter: number of splits:1
20/11/24 15:00:59 INFO Configuration.deprecation: yarn.resourcemanager.zk-address is deprecated. Instead, use hadoop.zk.address
20/11/24 15:00:59 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
20/11/24 15:00:59 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
20/11/24 15:00:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1605780958086_0028
20/11/24 15:00:59 INFO mapreduce.JobSubmitter: Executing with tokens: []
20/11/24 15:00:59 INFO conf.Configuration: resource-types.xml not found
20/11/24 15:00:59 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/11/24 15:01:00 INFO impl.YarnClientImpl: Submitted application application_1605780958086_0028
20/11/24 15:01:00 INFO mapreduce.Job: The url to track the job: http://hp3:8088/proxy/application_1605780958086_0028/
20/11/24 15:01:00 INFO mapreduce.Job: Running job: job_1605780958086_0028
20/11/24 15:01:06 INFO mapreduce.Job: Job job_1605780958086_0028 running in uber mode : false
20/11/24 15:01:06 INFO mapreduce.Job: map 0% reduce 0%
20/11/24 15:01:11 INFO mapreduce.Job: map 100% reduce 0%
20/11/24 15:01:11 INFO mapreduce.Job: Job job_1605780958086_0028 completed successfully
20/11/24 15:01:11 INFO mapreduce.Job: Counters: 33
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=247485
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=241
HDFS: Number of bytes written=0
HDFS: Number of read operations=7
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Rack-local map tasks=1
Total time spent by all maps in occupied slots (ms)=3094
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=3094
Total vcore-milliseconds taken by all map tasks=3094
Total megabyte-milliseconds taken by all map tasks=3168256
Map-Reduce Framework
Map input records=2
Map output records=2
Input split bytes=223
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=60
CPU time spent (ms)=1010
Physical memory (bytes) snapshot=243896320
Virtual memory (bytes) snapshot=2593943552
Total committed heap usage (bytes)=193462272
Peak Map Physical memory (bytes)=243896320
Peak Map Virtual memory (bytes)=2593943552
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
20/11/24 15:01:11 INFO mapreduce.ExportJobBase: Transferred 241 bytes in 13.9585 seconds (17.2655 bytes/sec)
20/11/24 15:01:11 INFO mapreduce.ExportJobBase: Exported 2 records.
[root@hp1 ~]#
驗(yàn)證數(shù)據(jù)
發(fā)現(xiàn)update成功
mysql> select * from t1;
+------+------+
| id | name |
+------+------+
| 1 | abc |
| 3 | b |
+------+------+
2 rows in set (0.00 sec)
3.4 help工具
help工具查看命令幫助
例如知残,我不知道sqoop import的語(yǔ)法想查看:
[root@hp1 ~]# sqoop help import
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]
Common arguments:
--connect <jdbc-uri> Specify JDBC connect string
--connection-manager <class-name> Specify connection manager class to use
--driver <class-name> Manually specify JDBC driver class to use
--hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME
--help Print usage instructions
--password-file Set path for file containing authentication password
-P Read password from console
--password <password> Set authentication password
--username <username> Set authentication username
--verbose Print more information while working
--hadoop-home <dir> Deprecated. Override $HADOOP_HOME
Import control arguments:
--as-avrodatafile Imports data to Avro Data Files
--as-sequencefile Imports data to SequenceFiles
--as-textfile Imports data as plain text (default)
--as-parquetfile Imports data to Parquet Data Files
...
3.5 import工具
sqoop import工具是使用的最頻繁的工具,該工具將關(guān)系型數(shù)據(jù)庫(kù)的數(shù)據(jù)同步到hdfs上比庄。
關(guān)系型數(shù)據(jù)庫(kù)表中的每一行都會(huì)被當(dāng)做一個(gè)獨(dú)立的記錄存儲(chǔ)為 text files求妹、Avro 或者 SequenceFiles
3.5.1 sqoop import 工具命令介紹
通用命令
參數(shù) | 描述 |
---|---|
--connect <jdbc-uri> | 指定jdbc連接信息 |
--connection-manager <class-name> | 指定要使用的連接管理器類(lèi) |
--driver <class-name> | 手工指定jdbc驅(qū)動(dòng) |
--hadoop-mapred-home <dir> | 覆蓋 $HADOOP_MAPRED_HOME |
--help | 打印幫助信息 |
--password-file | 設(shè)置包含身份驗(yàn)證密碼的文件的路徑 |
-P | 從控制臺(tái)讀取密碼 |
--password <password> | 設(shè)置身份認(rèn)證密碼 |
--username <username> | 設(shè)置身份驗(yàn)證用戶(hù)名 |
--verbose | 工作時(shí)打印更多信息 |
--connection-param-file <filename> | 提供連接參數(shù)的可選屬性文件 |
--relaxed-isolation | 設(shè)置連接的隔離級(jí)別以讀取未提交的數(shù)據(jù) |
驗(yàn)證參數(shù)細(xì)節(jié)
參數(shù) | 描述 |
---|---|
--validate | 允許對(duì)復(fù)制的數(shù)據(jù)進(jìn)行驗(yàn)證乏盐,只支持單表復(fù)制。 |
--validator <class-name> | 指定使用的驗(yàn)證類(lèi) |
--validation-threshold <class-name> | 指定要使用的驗(yàn)證閾值類(lèi) |
--validation-failurehandler <class-name> | 指定要使用的驗(yàn)證失敗處理程序類(lèi) |
導(dǎo)入控制命令
參數(shù) | 描述 |
---|---|
--append | 在HDFS中向現(xiàn)有數(shù)據(jù)集追加數(shù)據(jù) |
--as-avrodatafile | 導(dǎo)入數(shù)據(jù)到Avro 數(shù)據(jù)文件 |
--as-sequencefile | 導(dǎo)入數(shù)據(jù)到sequencefile |
--as-textfile | 導(dǎo)入數(shù)據(jù)到textfile制恍,默認(rèn)值 |
--as-parquetfile | 導(dǎo)入數(shù)據(jù)到parquetfile |
--boundary-query <statement | 用于創(chuàng)建分割的邊界查詢(xún) |
--columns <col,col,col…> | 指定表中需要導(dǎo)入的列 |
--delete-target-dir | 如目標(biāo)目錄存在丑勤,則先刪除 |
--direct | 如數(shù)據(jù)庫(kù)存在,則直連 |
--fetch-size <n> | 一次從數(shù)據(jù)庫(kù)中讀取的條目數(shù) |
--inline-lob-limit <n> | 設(shè)置最大值給LOB 列 |
-m,--num-mappers <n> | 導(dǎo)入時(shí)使用n個(gè)map任務(wù)進(jìn)行并行吧趣,默認(rèn)情況下法竞,使用4個(gè)task |
-e,--query <statement> | 導(dǎo)入sql語(yǔ)句執(zhí)行的結(jié)果 |
--split-by <column-name> | 用于分隔 |
--split-limit <n> | 每個(gè)分割大小的上限。這只適用于整數(shù)和日期列强挫。對(duì)于日期或時(shí)間戳字段岔霸,以秒為單位計(jì)算 |
--autoreset-to-one-mapper | 如果表沒(méi)有主鍵,也沒(méi)有提供分拆列俯渤,則導(dǎo)入應(yīng)該使用一個(gè)映射器芋绸。不能使用-split-by 選項(xiàng)简珠。 |
--table <table-name> | 需要讀取的表 |
--target-dir <dir> | HDFS目標(biāo)路徑 |
--temporary-rootdir <dir> | 在導(dǎo)入HDFS時(shí)的臨時(shí)文件路徑 |
--warehouse-dir <dir> | HDFS中表存儲(chǔ)的父目錄 |
--where <where clause> | 過(guò)濾數(shù)據(jù)的條件 |
-z,--compress | 允許壓縮 |
--compression-codec <c> | 使用 Hadoop codec (默認(rèn)為gzip) |
--null-string <null-string> | 要為字符串列的空值寫(xiě)入的字符串 |
--null-non-string <null-string> | 將為非字符串列的空值寫(xiě)入的字符串 |
輸出行格式化參數(shù)
參數(shù) | 描述 |
---|---|
--enclosed-by <char> | 設(shè)置所需的字段附件 |
--escaped-by <char> | 設(shè)置輸入轉(zhuǎn)義字符 |
--fields-terminated-by <char> | 設(shè)置輸入字段分隔符 |
--lines-terminated-by <char> | 設(shè)置輸入的行結(jié)束字符 |
--mysql-delimiters | 使用MySQL默認(rèn)的分隔符 |
--optionally-enclosed-by <char> | 設(shè)置包含字符的字段 |
轉(zhuǎn)移字符--fields-terminated-by \t
支持的轉(zhuǎn)義字符:
1.\b (backspace)
2.\n (newline)
3.\r (carriage return)
4.\t (tab)
5." (double-quote)
6.\' (single-quote)
7.\ (backslash)
8.\0 (NUL) -- 這將在字段或行之間插入NUL字符潜叛,或者如果用于--enclosed-by咬腕、--optional -enclosed-by或--escaping -by參數(shù)之一,將禁用封閉/轉(zhuǎn)義
UTF-8字符代碼點(diǎn)的八進(jìn)制表示,--fields-terminated-by \001 would yield the ^A character
輸入解析參數(shù)
參數(shù) | 描述 |
---|---|
--input-enclosed-by <char> | 設(shè)置所需的字段附件 |
--input-escaped-by <char> | 設(shè)置輸入轉(zhuǎn)義字符 |
--input-fields-terminated-by <char> | 設(shè)置輸入字段分隔符 |
--input-lines-terminated-by <char> | 設(shè)置輸入的行結(jié)束字符 |
--input-optionally-enclosed-by <char> | 設(shè)置包含字符的字段 |
Hive命令
參數(shù) | 描述 |
---|---|
--hive-home <dir> | 覆蓋 $HIVE_HOME |
--hive-import | 將多張表導(dǎo)入hive梨树,分隔符為默認(rèn)的hive分隔符 |
--hive-overwrite | 覆蓋hive表已存在的數(shù)據(jù) |
--create-hive-table | 如果設(shè)置此參數(shù)坑夯,該表存在的情況下mr job會(huì)失敗,默認(rèn)參數(shù)為false |
--hive-table <table-name> | 導(dǎo)入數(shù)據(jù)的時(shí)候指定hive數(shù)據(jù)庫(kù)的表名 |
--hive-drop-import-delims | 從導(dǎo)入hive的字符串中刪除\n, \r, and \01 |
--hive-delims-replacement | 使用自定義的肥豬替換 導(dǎo)入hive的字符串中的 \n, \r, and \01 |
--hive-partition-key | 要分割的hive字段的名稱(chēng) |
--hive-partition-value <v> | 作為此作業(yè)中導(dǎo)入到hive的分區(qū)鍵的字符串值抡四。 |
--map-column-hive <map> | 為配置的列覆蓋從SQL類(lèi)型到Hive類(lèi)型的默認(rèn)映射柜蜈。如果在此參數(shù)中指定逗號(hào),則使用URL編碼的鍵和值指巡,例如淑履,使用DECIMAL(1%2C%201)而不是DECIMAL(1,1)。 |
3.5.2 hive import測(cè)試案例
數(shù)據(jù)準(zhǔn)備:
-- MySQL
mysql> select count(*) from fact_sale;
+-----------+
| count(*) |
+-----------+
| 767830000 |
+-----------+
1 row in set (2 min 32.75 sec)
mysql> desc fact_sale;
+-----------+-------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+-------------+------+-----+-------------------+-----------------------------+
| id | bigint(8) | NO | PRI | NULL | auto_increment |
| sale_date | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| prod_name | varchar(50) | YES | | NULL | |
| sale_nums | int(11) | YES | | NULL | |
+-----------+-------------+------+-----+-------------------+-----------------------------+
4 rows in set (0.01 sec)
-- Hive這邊無(wú)需提前創(chuàng)建表
代碼:
sqoop import \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table fact_sale \
--fields-terminated-by '\0001' \
--delete-target-dir \
--num-mappers 4 \
--hive-import \
--hive-database test \
--hive-table ods_fact_sale \
--hive-overwrite
測(cè)試記錄:
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/25 09:49:54 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/11/25 09:49:54 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/25 09:49:54 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/11/25 09:49:54 INFO tool.CodeGenTool: Beginning code generation
20/11/25 09:49:54 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `fact_sale` AS t LIMIT 1
20/11/25 09:49:54 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `fact_sale` AS t LIMIT 1
20/11/25 09:49:54 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
20/11/25 09:49:56 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-root/compile/ac811b1a59e599927528fdb832a4a853/fact_sale.java to /root/./fact_sale.java. Error: Destination '/root/./fact_sale.ja
va' already exists
20/11/25 09:49:56 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/ac811b1a59e599927528fdb832a4a853/fact_sale.jar
20/11/25 09:49:56 INFO tool.ImportTool: Destination directory fact_sale is not present, hence not deleting.
20/11/25 09:49:56 WARN manager.MySQLManager: It looks like you are importing from mysql.
20/11/25 09:49:56 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
20/11/25 09:49:56 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
20/11/25 09:49:56 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
20/11/25 09:49:56 INFO mapreduce.ImportJobBase: Beginning import of fact_sale
20/11/25 09:49:56 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
20/11/25 09:49:56 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
20/11/25 09:49:57 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
20/11/25 09:49:57 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/root/.staging/job_1605780958086_0032
20/11/25 09:49:59 INFO db.DBInputFormat: Using read commited transaction isolation
20/11/25 09:49:59 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `fact_sale`
20/11/25 09:49:59 INFO db.IntegerSplitter: Split size: 196905399; Num splits: 4 from: 1 to: 787621597
20/11/25 09:49:59 INFO mapreduce.JobSubmitter: number of splits:4
20/11/25 09:49:59 INFO Configuration.deprecation: yarn.resourcemanager.zk-address is deprecated. Instead, use hadoop.zk.address
20/11/25 09:49:59 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
20/11/25 09:49:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1605780958086_0032
20/11/25 09:49:59 INFO mapreduce.JobSubmitter: Executing with tokens: []
20/11/25 09:49:59 INFO conf.Configuration: resource-types.xml not found
20/11/25 09:49:59 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/11/25 09:49:59 INFO impl.YarnClientImpl: Submitted application application_1605780958086_0032
20/11/25 09:49:59 INFO mapreduce.Job: The url to track the job: http://hp3:8088/proxy/application_1605780958086_0032/
20/11/25 09:49:59 INFO mapreduce.Job: Running job: job_1605780958086_0032
20/11/25 09:50:06 INFO mapreduce.Job: Job job_1605780958086_0032 running in uber mode : false
20/11/25 09:50:06 INFO mapreduce.Job: map 0% reduce 0%
20/11/25 09:56:28 INFO mapreduce.Job: map 25% reduce 0%
20/11/25 09:57:08 INFO mapreduce.Job: map 50% reduce 0%
20/11/25 10:03:40 INFO mapreduce.Job: map 75% reduce 0%
20/11/25 10:04:07 INFO mapreduce.Job: map 100% reduce 0%
20/11/25 10:04:07 INFO mapreduce.Job: Job job_1605780958086_0032 completed successfully
20/11/25 10:04:07 INFO mapreduce.Job: Counters: 33
**中間省略部分輸出**
0/11/25 10:04:11 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=hp1:2181,hp3:2181,hp2:2181 sessionTimeout=1200000 watcher=org.apache.curator.ConnectionState@615bad16
20/11/25 10:04:11 INFO zookeeper.ClientCnxn: Opening socket connection to server hp3/10.31.1.125:2181. Will not attempt to authenticate using SASL (unknown error)
20/11/25 10:04:11 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.31.1.123:41634, server: hp3/10.31.1.125:2181
20/11/25 10:04:11 INFO zookeeper.ClientCnxn: Session establishment complete on server hp3/10.31.1.125:2181, sessionid = 0x275e0004e5929e2, negotiated timeout = 40000
20/11/25 10:04:11 INFO state.ConnectionStateManager: State change: CONNECTED
20/11/25 10:04:11 INFO ql.Driver: Executing command(queryId=root_20201125100408_f35214cb-e680-4e38-9833-7cb096592131): CREATE TABLE IF NOT EXISTS `test`.`ods_fact_sale` ( `id` BIGINT, `sale_date` STRING,
`prod_name` STRING, `sale_nums` INT) COMMENT 'Imported by sqoop on 2020/11/25 10:04:07' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\012' STORED AS TEXTFILE
20/11/25 10:04:11 INFO ql.Driver: Completed executing command(queryId=root_20201125100408_f35214cb-e680-4e38-9833-7cb096592131); Time taken: 0.006 seconds
OK
20/11/25 10:04:11 INFO ql.Driver: OK
Time taken: 2.319 seconds
20/11/25 10:04:11 INFO CliDriver: Time taken: 2.319 seconds
20/11/25 10:04:11 INFO conf.HiveConf: Using the default value passed in for log id: c9680e77-741b-4fe8-a882-55ac6ae6ee79
20/11/25 10:04:11 INFO session.SessionState: Resetting thread name to main
20/11/25 10:04:11 INFO conf.HiveConf: Using the default value passed in for log id: c9680e77-741b-4fe8-a882-55ac6ae6ee79
20/11/25 10:04:11 INFO session.SessionState: Updating thread name to c9680e77-741b-4fe8-a882-55ac6ae6ee79 main
20/11/25 10:04:11 INFO ql.Driver: Compiling command(queryId=root_20201125100411_b2ff0e5c-3ccc-4f7e-b202-53855934f5f0):
LOAD DATA INPATH 'hdfs://nameservice1/user/root/fact_sale' OVERWRITE INTO TABLE `test`.`ods_fact_sale`
20/11/25 10:04:11 INFO ql.Driver: Semantic Analysis Completed
20/11/25 10:04:11 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null)
20/11/25 10:04:11 INFO ql.Driver: Completed compiling command(queryId=root_20201125100411_b2ff0e5c-3ccc-4f7e-b202-53855934f5f0); Time taken: 0.232 seconds
20/11/25 10:04:11 INFO ql.Driver: Executing command(queryId=root_20201125100411_b2ff0e5c-3ccc-4f7e-b202-53855934f5f0):
LOAD DATA INPATH 'hdfs://nameservice1/user/root/fact_sale' OVERWRITE INTO TABLE `test`.`ods_fact_sale`
20/11/25 10:04:11 INFO ql.Driver: Starting task [Stage-0:MOVE] in serial mode
20/11/25 10:04:11 INFO hive.metastore: Closed a connection to metastore, current connections: 0
Loading data to table test.ods_fact_sale
20/11/25 10:04:11 INFO exec.Task: Loading data to table test.ods_fact_sale from hdfs://nameservice1/user/root/fact_sale
20/11/25 10:04:11 INFO hive.metastore: HMS client filtering is enabled.
20/11/25 10:04:11 INFO hive.metastore: Trying to connect to metastore with URI thrift://hp1:9083
20/11/25 10:04:11 INFO hive.metastore: Opened a connection to metastore, current connections: 1
20/11/25 10:04:11 INFO hive.metastore: Connected to metastore.
20/11/25 10:04:11 INFO fs.TrashPolicyDefault: Moved: 'hdfs://nameservice1/user/hive/warehouse/test.db/ods_fact_sale/part-m-00000' to trash at: hdfs://nameservice1/user/root/.Trash/Current/user/hive/wareho
use/test.db/ods_fact_sale/part-m-00000
20/11/25 10:04:11 INFO common.FileUtils: Creating directory if it doesn't exist: hdfs://nameservice1/user/hive/warehouse/test.db/ods_fact_sale
20/11/25 10:04:11 INFO ql.Driver: Starting task [Stage-1:STATS] in serial mode
20/11/25 10:04:11 INFO exec.StatsTask: Executing stats task
20/11/25 10:04:11 INFO hive.metastore: Closed a connection to metastore, current connections: 0
20/11/25 10:04:11 INFO hive.metastore: HMS client filtering is enabled.
20/11/25 10:04:11 INFO hive.metastore: Trying to connect to metastore with URI thrift://hp1:9083
20/11/25 10:04:11 INFO hive.metastore: Opened a connection to metastore, current connections: 1
20/11/25 10:04:11 INFO hive.metastore: Connected to metastore.
20/11/25 10:04:12 INFO hive.metastore: Closed a connection to metastore, current connections: 0
20/11/25 10:04:12 INFO hive.metastore: HMS client filtering is enabled.
20/11/25 10:04:12 INFO hive.metastore: Trying to connect to metastore with URI thrift://hp1:9083
20/11/25 10:04:12 INFO hive.metastore: Opened a connection to metastore, current connections: 1
20/11/25 10:04:12 INFO hive.metastore: Connected to metastore.
20/11/25 10:04:12 INFO exec.StatsTask: Table test.ods_fact_sale stats: [numFiles=4, numRows=0, totalSize=31421093662, rawDataSize=0, numFilesErasureCoded=0]
20/11/25 10:04:12 INFO ql.Driver: Completed executing command(queryId=root_20201125100411_b2ff0e5c-3ccc-4f7e-b202-53855934f5f0); Time taken: 0.897 seconds
OK
20/11/25 10:04:12 INFO ql.Driver: OK
Time taken: 1.154 seconds
20/11/25 10:04:12 INFO CliDriver: Time taken: 1.154 seconds
20/11/25 10:04:12 INFO conf.HiveConf: Using the default value passed in for log id: c9680e77-741b-4fe8-a882-55ac6ae6ee79
20/11/25 10:04:12 INFO session.SessionState: Resetting thread name to main
20/11/25 10:04:12 INFO conf.HiveConf: Using the default value passed in for log id: c9680e77-741b-4fe8-a882-55ac6ae6ee79
20/11/25 10:04:12 INFO session.SessionState: Deleted directory: /tmp/hive/root/c9680e77-741b-4fe8-a882-55ac6ae6ee79 on fs with scheme hdfs
20/11/25 10:04:12 INFO session.SessionState: Deleted directory: /tmp/root/c9680e77-741b-4fe8-a882-55ac6ae6ee79 on fs with scheme file
20/11/25 10:04:12 INFO hive.metastore: Closed a connection to metastore, current connections: 0
20/11/25 10:04:12 INFO hive.HiveImport: Hive import complete.
20/11/25 10:04:12 INFO imps.CuratorFrameworkImpl: backgroundOperationsLoop exiting
20/11/25 10:04:12 INFO zookeeper.ZooKeeper: Session: 0x275e0004e5929e2 closed
20/11/25 10:04:12 INFO zookeeper.ClientCnxn: EventThread shut down
20/11/25 10:04:12 INFO CuratorFrameworkSingleton: Closing ZooKeeper client.
3.7 import-all-tables工具
sqoop import-all-tables 將所有表從數(shù)據(jù)庫(kù)導(dǎo)入到HDFS藻雪,這樣可以不用一個(gè)一個(gè)的進(jìn)行導(dǎo)入了
3.7.1 import-all-tables參數(shù)
通用命令
參數(shù) | 描述 |
---|---|
--connect <jdbc-uri> | 指定jdbc連接信息 |
--connection-manager <class-name> | 指定要使用的連接管理器類(lèi) |
--driver <class-name> | 手工指定jdbc驅(qū)動(dòng) |
--hadoop-mapred-home <dir> | 覆蓋 $HADOOP_MAPRED_HOME |
--help | 打印幫助信息 |
--password-file | 設(shè)置包含身份驗(yàn)證密碼的文件的路徑 |
-P | 從控制臺(tái)讀取密碼 |
--password <password> | 設(shè)置身份認(rèn)證密碼 |
--username <username> | 設(shè)置身份驗(yàn)證用戶(hù)名 |
--verbose | 工作時(shí)打印更多信息 |
--connection-param-file <filename> | 提供連接參數(shù)的可選屬性文件 |
--relaxed-isolation | 設(shè)置連接的隔離級(jí)別以讀取未提交的數(shù)據(jù) |
導(dǎo)入控制命令
參數(shù) | 描述 |
---|---|
--as-avrodatafile | 導(dǎo)入數(shù)據(jù)到Avro 數(shù)據(jù)文件 |
--as-sequencefile | 導(dǎo)入數(shù)據(jù)到sequencefile |
--as-textfile | 導(dǎo)入數(shù)據(jù)到textfile瑰排,默認(rèn)值 |
--as-parquetfile | 導(dǎo)入數(shù)據(jù)到parquetfile |
--direct | 如數(shù)據(jù)庫(kù)存在宅广,則直連 |
--inline-lob-limit <n> | 設(shè)置最大值給LOB 列 |
-m,--num-mappers <n> | 導(dǎo)入時(shí)使用n個(gè)map任務(wù)進(jìn)行并行驶臊,默認(rèn)情況下,使用4個(gè)task |
--warehouse-dir <dir> | HDFS中表存儲(chǔ)的父目錄 |
-z,--compress | 允許壓縮 |
--compression-codec <c> | 使用 Hadoop codec (默認(rèn)為gzip) |
--exclude-tables <tables> | 用逗號(hào)分隔要從導(dǎo)入過(guò)程中排除的表列表 |
--autoreset-to-one mapper | 導(dǎo)入應(yīng)該使用一個(gè)mapper如果遇到一個(gè)沒(méi)有主鍵的表 |
輸出行格式化參數(shù)
參數(shù) | 描述 |
---|---|
--enclosed-by <char> | 設(shè)置所需的字段附件 |
--escaped-by <char> | 設(shè)置輸入轉(zhuǎn)義字符 |
--fields-terminated-by <char> | 設(shè)置輸入字段分隔符 |
--lines-terminated-by <char> | 設(shè)置輸入的行結(jié)束字符 |
--mysql-delimiters | 使用MySQL默認(rèn)的分隔符 |
--optionally-enclosed-by <char> | 設(shè)置包含字符的字段 |
轉(zhuǎn)移字符--fields-terminated-by \t
支持的轉(zhuǎn)義字符:
1.\b (backspace)
2.\n (newline)
3.\r (carriage return)
4.\t (tab)
5." (double-quote)
6.\' (single-quote)
7.\ (backslash)
8.\0 (NUL) -- 這將在字段或行之間插入NUL字符室奏,或者如果用于--enclosed-by纺酸、--optional -enclosed-by或--escaping -by參數(shù)之一仗考,將禁用封閉/轉(zhuǎn)義
UTF-8字符代碼點(diǎn)的八進(jìn)制表示,--fields-terminated-by \001 would yield the ^A character
輸入解析參數(shù)
參數(shù) | 描述 |
---|---|
--input-enclosed-by <char> | 設(shè)置所需的字段附件 |
--input-escaped-by <char> | 設(shè)置輸入轉(zhuǎn)義字符 |
--input-fields-terminated-by <char> | 設(shè)置輸入字段分隔符 |
--input-lines-terminated-by <char> | 設(shè)置輸入的行結(jié)束字符 |
--input-optionally-enclosed-by <char> | 設(shè)置包含字符的字段 |
Hive命令
參數(shù) | 描述 |
---|---|
--hive-home <dir> | 覆蓋 $HIVE_HOME |
--hive-import | 將多張表導(dǎo)入hive叽赊,分隔符為默認(rèn)的hive分隔符 |
--hive-overwrite | 覆蓋hive表已存在的數(shù)據(jù) |
--create-hive-table | 如果設(shè)置此參數(shù),該表存在的情況下mr job會(huì)失敗庶诡,默認(rèn)參數(shù)為false |
--hive-table <table-name> | 導(dǎo)入數(shù)據(jù)的時(shí)候指定hive數(shù)據(jù)庫(kù)的表名 |
--hive-drop-import-delims | 從導(dǎo)入hive的字符串中刪除\n, \r, and \01 |
--hive-delims-replacement | 使用自定義的肥豬替換 導(dǎo)入hive的字符串中的 \n, \r, and \01 |
--hive-partition-key | 要分割的hive字段的名稱(chēng) |
--hive-partition-value <v> | 作為此作業(yè)中導(dǎo)入到hive的分區(qū)鍵的字符串值。 |
--map-column-hive <map> | 為配置的列覆蓋從SQL類(lèi)型到Hive類(lèi)型的默認(rèn)映射让簿。如果在此參數(shù)中指定逗號(hào)椭迎,則使用URL編碼的鍵和值,例如视卢,使用DECIMAL(1%2C%201)而不是DECIMAL(1,1)酝掩。 |
3.7.2 import-all-tables 測(cè)試案例
將test庫(kù)下的 剔除 fact_sale茂蚓、t1外的其它表同步到hive的test庫(kù)
數(shù)據(jù)準(zhǔn)備:
mysql> show tables;
+----------------+
| Tables_in_test |
+----------------+
| bonus |
| dept |
| emp |
| fact_sale |
| salgrade |
| t1 |
+----------------+
6 rows in set (0.00 sec)
代碼:
sqoop import-all-tables \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--exclude-tables fact_sale,t1 \
--autoreset-to-one-mapper \
--fields-terminated-by '\0001' \
--num-mappers 4 \
--hive-import \
--hive-database test \
--hive-overwrite
測(cè)試記錄:
因?yàn)橛斜頉](méi)有主鍵淹朋,不能使用 --num-mappers 4,報(bào)錯(cuò)
加入 --autoreset-to-one-mapper 當(dāng)存在沒(méi)有主鍵的表的時(shí)候, --num-mappers 為1
[root@hp1 ~]# sqoop import-all-tables \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 \
> --exclude-tables fact_sale,t1 \
> --fields-terminated-by '\0001' \
> --num-mappers 4 \
> --hive-import \
> --hive-database test \
> --hive-overwrite
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/25 10:40:36 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/11/25 10:40:36 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/25 10:40:36 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/11/25 10:40:37 INFO tool.CodeGenTool: Beginning code generation
20/11/25 10:40:37 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `bonus` AS t LIMIT 1
20/11/25 10:40:37 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `bonus` AS t LIMIT 1
20/11/25 10:40:37 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
20/11/25 10:40:38 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/9ef9bc1769dcf7d49b3848d6a55144fa/bonus.jar
20/11/25 10:40:38 WARN manager.MySQLManager: It looks like you are importing from mysql.
20/11/25 10:40:38 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
20/11/25 10:40:38 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
20/11/25 10:40:38 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
20/11/25 10:40:38 ERROR tool.ImportAllTablesTool: Error during import: No primary key could be found for table bonus. Please specify one with --split-by or perform a sequential import with '-m 1'.
[root@hp1 ~]#
調(diào)整后的測(cè)試記錄:
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/25 10:47:05 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/11/25 10:47:05 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/25 10:47:05 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/11/25 10:47:06 INFO tool.CodeGenTool: Beginning code generation
20/11/25 10:47:06 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `bonus` AS t LIMIT 1
20/11/25 10:47:06 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `bonus` AS t LIMIT 1
20/11/25 10:47:06 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
20/11/25 10:47:07 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-root/compile/7e08dfd7df6bcae92b216d3c68d5a4f3/bonus.java to /root/./bonus.java. Error: Destination '/root/./bonus.java' already
exists
20/11/25 10:47:07 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/7e08dfd7df6bcae92b216d3c68d5a4f3/bonus.jar
20/11/25 10:47:07 WARN manager.MySQLManager: It looks like you are importing from mysql.
20/11/25 10:47:07 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
20/11/25 10:47:07 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
20/11/25 10:47:07 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
20/11/25 10:47:07 WARN manager.SqlManager: Split by column not provided or can't be inferred. Resetting to one mapper
20/11/25 10:47:07 INFO mapreduce.ImportJobBase: Beginning import of bonus
20/11/25 10:47:07 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
20/11/25 10:47:08 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
20/11/25 10:47:08 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
20/11/25 10:47:08 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/root/.staging/job_1605780958086_0037
20/11/25 10:47:10 INFO db.DBInputFormat: Using read commited transaction isolation
20/11/25 10:47:10 INFO mapreduce.JobSubmitter: number of splits:1
20/11/25 10:47:10 INFO Configuration.deprecation: yarn.resourcemanager.zk-address is deprecated. Instead, use hadoop.zk.address
20/11/25 10:47:10 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
20/11/25 10:47:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1605780958086_0037
20/11/25 10:47:10 INFO mapreduce.JobSubmitter: Executing with tokens: []
20/11/25 10:47:10 INFO conf.Configuration: resource-types.xml not found
20/11/25 10:47:10 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/11/25 10:47:10 INFO impl.YarnClientImpl: Submitted application application_1605780958086_0037
20/11/25 10:47:10 INFO mapreduce.Job: The url to track the job: http://hp3:8088/proxy/application_1605780958086_0037/
20/11/25 10:47:10 INFO mapreduce.Job: Running job: job_1605780958086_0037
20/11/25 10:47:16 INFO mapreduce.Job: Job job_1605780958086_0037 running in uber mode : false
20/11/25 10:47:16 INFO mapreduce.Job: map 0% reduce 0%
20/11/25 10:47:22 INFO mapreduce.Job: map 100% reduce 0%
20/11/25 10:47:22 INFO mapreduce.Job: Job job_1605780958086_0037 completed successfully
**中間省略部分輸出**
20/11/25 10:48:21 INFO common.FileUtils: Creating directory if it doesn't exist: hdfs://nameservice1/user/hive/warehouse/test.db/salgrade
20/11/25 10:48:21 INFO ql.Driver: Starting task [Stage-1:STATS] in serial mode
20/11/25 10:48:21 INFO exec.StatsTask: Executing stats task
20/11/25 10:48:21 INFO hive.metastore: Closed a connection to metastore, current connections: 0
20/11/25 10:48:21 INFO hive.metastore: HMS client filtering is enabled.
20/11/25 10:48:21 INFO hive.metastore: Trying to connect to metastore with URI thrift://hp1:9083
20/11/25 10:48:21 INFO hive.metastore: Opened a connection to metastore, current connections: 1
20/11/25 10:48:21 INFO hive.metastore: Connected to metastore.
20/11/25 10:48:21 INFO hive.metastore: Closed a connection to metastore, current connections: 0
20/11/25 10:48:21 INFO hive.metastore: HMS client filtering is enabled.
20/11/25 10:48:21 INFO hive.metastore: Trying to connect to metastore with URI thrift://hp1:9083
20/11/25 10:48:21 INFO hive.metastore: Opened a connection to metastore, current connections: 1
20/11/25 10:48:21 INFO hive.metastore: Connected to metastore.
20/11/25 10:48:21 INFO exec.StatsTask: Table test.salgrade stats: [numFiles=1, numRows=0, totalSize=59, rawDataSize=0, numFilesErasureCoded=0]
20/11/25 10:48:21 INFO ql.Driver: Completed executing command(queryId=root_20201125104820_ad1fb49d-6166-41cb-9113-91074fb2ea16); Time taken: 0.549 seconds
OK
20/11/25 10:48:21 INFO ql.Driver: OK
Time taken: 0.588 seconds
20/11/25 10:48:21 INFO CliDriver: Time taken: 0.588 seconds
20/11/25 10:48:21 INFO conf.HiveConf: Using the default value passed in for log id: 1e735270-9c28-4658-aad6-fc816a42e377
20/11/25 10:48:21 INFO session.SessionState: Resetting thread name to main
20/11/25 10:48:21 INFO conf.HiveConf: Using the default value passed in for log id: 1e735270-9c28-4658-aad6-fc816a42e377
20/11/25 10:48:21 INFO session.SessionState: Deleted directory: /tmp/hive/root/1e735270-9c28-4658-aad6-fc816a42e377 on fs with scheme hdfs
20/11/25 10:48:21 INFO session.SessionState: Deleted directory: /tmp/root/1e735270-9c28-4658-aad6-fc816a42e377 on fs with scheme file
20/11/25 10:48:21 INFO hive.metastore: Closed a connection to metastore, current connections: 0
20/11/25 10:48:21 INFO hive.HiveImport: Hive import complete.
Skipping table: t1
20/11/25 10:48:21 INFO imps.CuratorFrameworkImpl: backgroundOperationsLoop exiting
20/11/25 10:48:21 INFO zookeeper.ZooKeeper: Session: 0x275e0004e592a1e closed
20/11/25 10:48:21 INFO zookeeper.ClientCnxn: EventThread shut down
20/11/25 10:48:21 INFO CuratorFrameworkSingleton: Closing ZooKeeper client.
3.8 import-mainframe 工具
sqoop import-mainframe 用于從大型機(jī),此處沒(méi)有測(cè)試環(huán)境弄砍,略過(guò)。
3.9 list-databases工具
sqoop list-databases 列出服務(wù)器上可用數(shù)據(jù)庫(kù)
3.9.1 list-databases工具參數(shù)
通用命令
參數(shù) | 描述 |
---|---|
--connect <jdbc-uri> | 指定jdbc連接信息 |
--connection-manager <class-name> | 指定要使用的連接管理器類(lèi) |
--driver <class-name> | 手工指定jdbc驅(qū)動(dòng) |
--hadoop-mapred-home <dir> | 覆蓋 $HADOOP_MAPRED_HOME |
--help | 打印幫助信息 |
--password-file | 設(shè)置包含身份驗(yàn)證密碼的文件的路徑 |
-P | 從控制臺(tái)讀取密碼 |
--password <password> | 設(shè)置身份認(rèn)證密碼 |
--username <username> | 設(shè)置身份驗(yàn)證用戶(hù)名 |
--verbose | 工作時(shí)打印更多信息 |
--connection-param-file <filename> | 提供連接參數(shù)的可選屬性文件 |
--relaxed-isolation | 設(shè)置連接的隔離級(jí)別以讀取未提交的數(shù)據(jù) |
3.9.2 list-databases 測(cè)試記錄
代碼:
sqoop list-databases \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123
測(cè)試記錄:
[root@hp1 ~]# sqoop list-databases \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/25 10:58:57 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/11/25 10:58:57 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/25 10:58:57 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
mysql
performance_schema
sys
test
3.10 list-tables 工具
sqoop list-tables 命令用于列出數(shù)據(jù)庫(kù)下的表名
3.10.1 list-tables 工具 參數(shù)
通用命令
參數(shù) | 描述 |
---|---|
--connect <jdbc-uri> | 指定jdbc連接信息 |
--connection-manager <class-name> | 指定要使用的連接管理器類(lèi) |
--driver <class-name> | 手工指定jdbc驅(qū)動(dòng) |
--hadoop-mapred-home <dir> | 覆蓋 $HADOOP_MAPRED_HOME |
--help | 打印幫助信息 |
--password-file | 設(shè)置包含身份驗(yàn)證密碼的文件的路徑 |
-P | 從控制臺(tái)讀取密碼 |
--password <password> | 設(shè)置身份認(rèn)證密碼 |
--username <username> | 設(shè)置身份驗(yàn)證用戶(hù)名 |
--verbose | 工作時(shí)打印更多信息 |
--connection-param-file <filename> | 提供連接參數(shù)的可選屬性文件 |
--relaxed-isolation | 設(shè)置連接的隔離級(jí)別以讀取未提交的數(shù)據(jù) |
3.10.2 list-tables 測(cè)試案例
代碼:
sqoop list-tables \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123
測(cè)試記錄:
[root@hp1 ~]# sqoop list-tables \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/25 11:07:37 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/11/25 11:07:37 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/25 11:07:37 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
bonus
dept
emp
fact_sale
salgrade
t1
3.11 version 工具
sqoop version 工具用于查看版本信息
測(cè)試記錄:
[root@hp1 ~]# sqoop version
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/25 11:09:36 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
Sqoop 1.4.7-cdh6.3.1
git commit id
Compiled by jenkins on Thu Sep 26 02:57:53 PDT 2019
[root@hp1 ~]#
四.sqoop將關(guān)系型數(shù)據(jù)庫(kù)表同步到hdfs
實(shí)際生產(chǎn)環(huán)境中输涕,一般將關(guān)系型數(shù)據(jù)庫(kù)先同步到hdfs中音婶,然后再將hdfs中的數(shù)據(jù)同步到關(guān)系型表中。
測(cè)試的性能與直接同步到hive中的性能一致莱坎。
-- 目標(biāo)端創(chuàng)建表結(jié)果
-- 速度很快
sqoop create-hive-table \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--hive-database default \
--table fact_sale --hive-table ods_fact_sale
-- 將mysql表數(shù)據(jù)同步到hdfs
sqoop import \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table fact_sale \
--delete-target-dir \
--fields-terminated-by '\0001' \
--num-mappers 4 \
--target-dir /tmp/fact_sale
-- 將hdfs數(shù)據(jù)同步到表
-- 速度很快
LOAD DATA INPATH '/tmp/fact_sale' into table ods_fact_sale;
測(cè)試記錄:
[root@hp1 tmp]# sqoop import \
> --connect jdbc:mysql://10.31.1.122:3306/test \
> --username root \
> --password abc123 \
> --table fact_sale \
> --delete-target-dir \
> --fields-terminated-by '\0001' \
> --num-mappers 4 \
> --target-dir /tmp/fact_sale
Warning: /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/12/07 15:13:42 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.3.1
20/12/07 15:13:42 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/12/07 15:13:42 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/12/07 15:13:42 INFO tool.CodeGenTool: Beginning code generation
20/12/07 15:13:42 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `fact_sale` AS t LIMIT 1
20/12/07 15:13:42 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `fact_sale` AS t LIMIT 1
20/12/07 15:13:42 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
20/12/07 15:13:44 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-root/compile/76fc9602ec3165207e0dd51e2115cc14/fact_sale.java to /tmp/./fact_sale.java. Error: Destination '/tmp/./fact_sale.java' already exists
20/12/07 15:13:44 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/76fc9602ec3165207e0dd51e2115cc14/fact_sale.jar
20/12/07 15:13:44 INFO tool.ImportTool: Destination directory /tmp/fact_sale deleted.
20/12/07 15:13:44 WARN manager.MySQLManager: It looks like you are importing from mysql.
20/12/07 15:13:44 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
20/12/07 15:13:44 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
20/12/07 15:13:44 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
20/12/07 15:13:45 INFO mapreduce.ImportJobBase: Beginning import of fact_sale
20/12/07 15:13:45 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
20/12/07 15:13:45 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
20/12/07 15:13:45 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/root/.staging/job_1606698967173_0119
20/12/07 15:13:47 INFO db.DBInputFormat: Using read commited transaction isolation
20/12/07 15:13:47 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `fact_sale`
20/12/07 15:13:47 INFO db.IntegerSplitter: Split size: 196905399; Num splits: 4 from: 1 to: 787621597
20/12/07 15:13:47 INFO mapreduce.JobSubmitter: number of splits:4
20/12/07 15:13:47 INFO Configuration.deprecation: yarn.resourcemanager.zk-address is deprecated. Instead, use hadoop.zk.address
20/12/07 15:13:47 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
20/12/07 15:13:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1606698967173_0119
20/12/07 15:13:48 INFO mapreduce.JobSubmitter: Executing with tokens: []
20/12/07 15:13:48 INFO conf.Configuration: resource-types.xml not found
20/12/07 15:13:48 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/12/07 15:13:48 INFO impl.YarnClientImpl: Submitted application application_1606698967173_0119
20/12/07 15:13:48 INFO mapreduce.Job: The url to track the job: http://hp1:8088/proxy/application_1606698967173_0119/
20/12/07 15:13:48 INFO mapreduce.Job: Running job: job_1606698967173_0119
20/12/07 15:13:55 INFO mapreduce.Job: Job job_1606698967173_0119 running in uber mode : false
20/12/07 15:13:55 INFO mapreduce.Job: map 0% reduce 0%
20/12/07 15:20:19 INFO mapreduce.Job: map 25% reduce 0%
20/12/07 15:20:52 INFO mapreduce.Job: map 50% reduce 0%
20/12/07 15:27:27 INFO mapreduce.Job: map 75% reduce 0%
20/12/07 15:27:56 INFO mapreduce.Job: map 100% reduce 0%
20/12/07 15:27:56 INFO mapreduce.Job: Job job_1606698967173_0119 completed successfully
20/12/07 15:27:56 INFO mapreduce.Job: Counters: 33
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=990012
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=441
HDFS: Number of bytes written=31421093662
HDFS: Number of read operations=24
HDFS: Number of large read operations=0
HDFS: Number of write operations=8
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=4
Other local map tasks=4
Total time spent by all maps in occupied slots (ms)=1647540
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=1647540
Total vcore-milliseconds taken by all map tasks=1647540
Total megabyte-milliseconds taken by all map tasks=1687080960
Map-Reduce Framework
Map input records=767830000
Map output records=767830000
Input split bytes=441
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=15214
CPU time spent (ms)=1769770
Physical memory (bytes) snapshot=1753706496
Virtual memory (bytes) snapshot=10405761024
Total committed heap usage (bytes)=1332215808
Peak Map Physical memory (bytes)=464699392
Peak Map Virtual memory (bytes)=2621968384
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=31421093662
20/12/07 15:27:56 INFO mapreduce.ImportJobBase: Transferred 29.2632 GB in 851.8971 seconds (35.175 MB/sec)
20/12/07 15:27:56 INFO mapreduce.ImportJobBase: Retrieved 767830000 records.
[root@hp1 tmp]#
hive>
>
> LOAD DATA INPATH '/tmp/fact_sale' into table ods_fact_sale;
Loading data to table default.ods_fact_sale
OK
Time taken: 0.947 seconds
hive>
五.sqoop增量同步
實(shí)際生產(chǎn)環(huán)境中衣式,考慮到全量刷新數(shù)據(jù),占用了大量的存儲(chǔ)空間檐什,且存儲(chǔ)了過(guò)多的冗余數(shù)據(jù)碴卧。于是引入了增量更新,sqoop也支持增量更新乃正。
5.1 導(dǎo)入方式:
- append
- lastmodified方式住册,必須要加--append(追加)或者--merge-key(合并,一般填主鍵)
5.2 測(cè)試案例
測(cè)試數(shù)據(jù)準(zhǔn)備:
-- MySQL
create table t_test(id int,name varchar(200),last_update_datetime timestamp,primary key(id) );
insert into t_test values (1,'abc','2020-12-01 00:09:00');
insert into t_test values (2,'def','2020-12-02 00:10:00');
同步數(shù)據(jù)到hive:
sqoop import \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table t_test \
--fields-terminated-by '\0001' \
--delete-target-dir \
--num-mappers 1 \
--hive-import \
--hive-database test \
--hive-table t_test
檢查hive表數(shù)據(jù):
hive>
> select * from t_test;
OK
1 abc 2020-12-01 00:09:00.0
2 def 2020-12-02 00:10:00.0
Time taken: 0.452 seconds, Fetched: 2 row(s)
創(chuàng)建hive備份表瓮具,方便回退測(cè)試
create table t_test_bak as select * from t_test;
hive>
> select * from t_test_bak;
OK
1 abc 2020-12-01 00:09:00.0
2 def 2020-12-02 00:10:00.0
Time taken: 0.059 seconds, Fetched: 2 row(s)
hive>
-- 如需回退荧飞,先清空再insert
truncate table t_test;
insert into t_test select * from t_test_bak;
5.2.1 append方式導(dǎo)入
在導(dǎo)入一個(gè)表時(shí)凡人,您應(yīng)該指定追加模式(append mode),其中新行將隨著行id值的增加而不斷添加叹阔。用--check-column指定包含行id的列挠轴。Sqoop導(dǎo)入檢查列的值大于用--last-value指定的值的行。
這種默認(rèn)適全部都是insert的耳幢,例如日志類(lèi)的數(shù)據(jù)的同步岸晦,對(duì)于update類(lèi)的無(wú)效。
修改MySQL源表數(shù)據(jù)
新增兩條數(shù)據(jù)
mysql> insert into t_test values (3,'ccc','2020-12-07 00:09:00');
Query OK, 1 row affected (0.01 sec)
mysql> insert into t_test values (4,'ddd','2020-12-07 00:10:00');
Query OK, 1 row affected (0.00 sec)
mysql> update t_test set name='bbb',last_update_datetime = '2020-12-07 08:00:00' where id = 2;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> select * from t_test;
+----+------+----------------------+
| id | name | last_update_datetime |
+----+------+----------------------+
| 1 | abc | 2020-12-01 00:09:00 |
| 2 | bbb | 2020-12-07 08:00:00 |
| 3 | ccc | 2020-12-07 00:09:00 |
| 4 | ddd | 2020-12-07 00:10:00 |
+----+------+----------------------+
4 rows in set (0.00 sec)
sqoop增量同步
--target-dir 參數(shù)表示表所在的路徑
--last-value 是上次同步的id的最大值
sqoop import \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table t_test \
--fields-terminated-by '\0001' \
--target-dir '/user/hive/warehouse/test.db/t_test' \
--incremental append \
--check-column id \
--last-value 2 \
-m 1
數(shù)據(jù)驗(yàn)證
可以看到2個(gè)新增的是同步了帅掘,但是對(duì)id為2的update的記錄卻沒(méi)有同步
hive>
>
> select * from t_test;
OK
1 abc 2020-12-01 00:09:00.0
2 def 2020-12-02 00:10:00.0
3 ccc 2020-12-07 00:09:00.0
4 ddd 2020-12-07 00:10:00.0
Time taken: 0.1 seconds, Fetched: 4 row(s)
還原h(huán)ive表數(shù)據(jù)
重新通過(guò)last_update_datetime時(shí)間戳來(lái)同步
sqoop import \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table t_test \
--fields-terminated-by '\0001' \
--target-dir '/user/hive/warehouse/test.db/t_test' \
--incremental append \
--check-column last_update_datetime \
--last-value '2020-12-02 00:10:00' \
-m 1
驗(yàn)證數(shù)據(jù):
可以看到數(shù)據(jù)重復(fù)了
hive>
>
> select * from t_test;
OK
1 abc 2020-12-01 00:09:00.0
2 def 2020-12-02 00:10:00.0
2 bbb 2020-12-07 08:00:00.0
3 ccc 2020-12-07 00:09:00.0
4 ddd 2020-12-07 00:10:00.0
Time taken: 0.067 seconds, Fetched: 5 row(s)
5.2.2 lastmodified方式導(dǎo)入
Sqoop支持的另一種表更新策略稱(chēng)為lastmodified模式委煤。當(dāng)可能更新源表的行堂油,并且每次這樣的更新都會(huì)將last-modified列的值設(shè)置為當(dāng)前時(shí)間戳?xí)r修档,應(yīng)該使用此方法。將導(dǎo)入check列中保存的時(shí)間戳比用--last-value指定的時(shí)間戳更近的行府框。
在增量導(dǎo)入結(jié)束時(shí)吱窝,應(yīng)該為后續(xù)導(dǎo)入指定為--last-value的值將打印到屏幕上。在運(yùn)行后續(xù)導(dǎo)入時(shí)迫靖,應(yīng)該以這種方式指定--last-value院峡,以確保只導(dǎo)入新的或更新的數(shù)據(jù)。這可以通過(guò)創(chuàng)建增量導(dǎo)入作為已保存的作業(yè)來(lái)自動(dòng)處理系宜,這是執(zhí)行循環(huán)增量導(dǎo)入的首選機(jī)制照激。
官方文檔原文
An alternate table update strategy supported by Sqoop is called lastmodified mode. You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value are imported.
At the end of an incremental import, the value which should be specified as --last-value for a subsequent import is printed to the screen. When running a subsequent import, you should specify --last-value in this way to ensure you import only the new or updated data. This is handled automatically by creating an incremental import as a saved job, which is the preferred mechanism for performing a recurring incremental import. See the section on saved jobs later in this document for more information.
5.2.2.1 --incremental lastmodified --append 方式
這種也是增量的進(jìn)行insert的方式
sqoop import \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table t_test \
--fields-terminated-by '\0001' \
--target-dir '/user/hive/warehouse/test.db/t_test' \
--check-column last_update_datetime \
--incremental lastmodified \
--last-value '2020-12-02 00:10:00' \
--m 1 \
--append
測(cè)試數(shù)據(jù)
可以看到超過(guò)最后更新時(shí)間的字段都insert了
hive>
>
> select * from t_test;
OK
1 abc 2020-12-01 00:09:00.0
2 def 2020-12-02 00:10:00.0
2 bbb 2020-12-07 08:00:00.0
3 ccc 2020-12-07 00:09:00.0
4 ddd 2020-12-07 00:10:00.0
Time taken: 0.07 seconds, Fetched: 5 row(s)
5.2.2.2 --incremental lastmodified --merge-key 方式
--merge-key方式類(lèi)似Oracle數(shù)據(jù)庫(kù)的merge語(yǔ)句,存在就update盹牧,不存在的insert俩垃。
sqoop import \
--connect jdbc:mysql://10.31.1.122:3306/test \
--username root \
--password abc123 \
--table t_test \
--fields-terminated-by '\0001' \
--target-dir '/user/hive/warehouse/test.db/t_test' \
--check-column last_update_datetime \
--incremental lastmodified \
--last-value '2020-12-02 00:10:00' \
--m 1 \
--merge-key id
測(cè)試數(shù)據(jù)
可以看到 id為2的數(shù)據(jù)被更新了,3汰寓、4成功插入
對(duì)于一些有最后更新時(shí)間戳的業(yè)務(wù)數(shù)據(jù)口柳,此種方法可以保證數(shù)據(jù)倉(cāng)庫(kù)的數(shù)據(jù)與業(yè)務(wù)數(shù)據(jù)保持一致。
> select * from t_test;
OK
1 abc 2020-12-01 00:09:00.0
2 bbb 2020-12-07 08:00:00.0
3 ccc 2020-12-07 00:09:00.0
4 ddd 2020-12-07 00:10:00.0
Time taken: 0.064 seconds, Fetched: 4 row(s)
參考
1.Hadoop權(quán)威指南
2.http://sqoop.apache.org/docs/1.4.7/SqoopUserGuide.html