大數(shù)據(jù)技術(shù)之DataX
版本:V1.0
第1章概述
1.1 什么是DataX
?????? DataX是阿里巴巴開源的一個異構(gòu)數(shù)據(jù)源離線同步工具蹂季,致力于實(shí)現(xiàn)包括關(guān)系型數(shù)據(jù)庫(MySQL冕广、Oracle等)、HDFS偿洁、Hive撒汉、ODPS、HBase父能、FTP等各種異構(gòu)數(shù)據(jù)源之間穩(wěn)定高效的數(shù)據(jù)同步功能神凑。
1.2 DataX的設(shè)計(jì)
為了解決異構(gòu)數(shù)據(jù)源同步問題,DataX將復(fù)雜的網(wǎng)狀的同步鏈路變成了星型數(shù)據(jù)鏈路何吝,DataX作為中間傳輸載體負(fù)責(zé)連接各種數(shù)據(jù)源溉委。當(dāng)需要接入一個新的數(shù)據(jù)源的時候,只需要將此數(shù)據(jù)源對接到DataX爱榕,便能跟已有的數(shù)據(jù)源做到無縫數(shù)據(jù)同步瓣喊。
1.3 框架設(shè)計(jì)
1.4 運(yùn)行原理
第2章快速入門
2.1 官方地址
下載地址:http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
源碼地址:https://github.com/alibaba/DataX
2.2 前置要求
- Linux
- JDK(1.8以上,推薦1.8)
- Python(推薦Python2.6.X)
2.3 安裝
1)將下載好的datax.tar.gz上傳到hadoop102的/opt/softwarez
[atguigu@hadoop102
software]$ ls
datax.tar.gz
2)解壓datax.tar.gz到/opt/module
[atguigu@hadoop102
software]$ tar -zxvf datax.tar.gz -C /opt/module/
3)運(yùn)行自檢腳本
[atguigu@hadoop102
bin]$ cd /opt/module/datax/bin/
[atguigu@hadoop102
bin]$ python datax.py /opt/module/datax/job/job.json
第3章使用案例
3.1 從stream流讀取數(shù)據(jù)并打印到控制臺
1)查看配置模板
[atguigu@hadoop102
bin]$ python datax.py -r streamreader -w streamwriter
DataX
(DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright
(C) 2010-2017, Alibaba Group. All Rights Reserved.
Please
refer to the streamreader document:
???? https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md
Please
refer to the streamwriter document:
????https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md
Pleasesave the following configuration as a json file and? use
???? python {DATAX_HOME}/bin/datax.py{JSON_FILE_NAME}.json
to run
the job.
{
??? "job": {
??????? "content": [
??????????? {
??????????????? "reader": {
??????????????????? "name":"streamreader",
??????????????????? "parameter": {
??????????????????????? "column": [],
???????????????????????"sliceRecordCount": ""
??????????????????? }
??????????????? },
??????????????? "writer": {
??????????????????? "name":"streamwriter",
??????????????????? "parameter": {
??????????????????????? "encoding":"",
??????????????????????? "print": true
??????????????????? }
??????????????? }
??????????? }
??????? ],
??????? "setting": {
??????????? "speed": {
??????????????? "channel":""
??????????? }
??????? }
??? }
}
2)根據(jù)模板編寫配置文件
[atguigu@hadoop102
job]$ vim stream2stream.json
填寫以下內(nèi)容:
{
? "job": {
??? "content": [
????? {
??????? "reader": {
????????? "name":"streamreader",
????????? "parameter": {
??????????? "sliceRecordCount": 10,
??????????? "column": [
????????????? {
??????????????? "type":"long",
??????????????? "value":"10"
????????????? },
????????????? {
??????????????? "type":"string",
??????????????? "value": "hello黔酥,DataX"
????????????? }
??????????? ]
????????? }
????? ??},
??????? "writer": {
????????? "name":"streamwriter",
????????? "parameter": {
??????????? "encoding":"UTF-8",
??????????? "print": true
????????? }
??????? }
????? }
??? ],
??? "setting": {
????? "speed": {
??????? "channel": 1
?????? }
??? }
? }
}
3)運(yùn)行
[atguigu@hadoop102
job]$
/opt/module/datax/bin/datax.py
/opt/module/datax/job/stream2stream.json
3.2 讀取MySQL中的數(shù)據(jù)存放到HDFS
3.2.1 查看官方模板
[atguigu@hadoop102
~]$ python /opt/module/datax/bin/datax.py -r mysqlreader -w hdfswriter
{
??? "job": {
??????? "content": [
??????????? {
??????????????? "reader": {
??????????????????? "name":"mysqlreader",
??????????????????? "parameter": {
??????????????????????? "column": [],
??????????????????????? "connection":[
??????????????????????????? {
?????????????? ?????????????????"jdbcUrl": [],
???????????????????????????????"table": []
??????????????????????????? }
??????????????????????? ],
??????????????????????? "password":"",
??????????????????????? "username":"",
??????????????????????? "where":""
??????????????????? }
??????????????? },
??????????????? "writer": {
??????????????????? "name":"hdfswriter",
??????????????????? "parameter": {
??????????????????????? "column": [],
??????????????????????? "compress":"",
??????????????????????? "defaultFS":"",
???????????????????????"fieldDelimiter": "",
??????????????????????? "fileName":"",
??????????????????????? "fileType":"",
??????????????????????? "path":"",
??????????????????????? "writeMode":""
??????????????????? }
??????????????? }
??????????? }
??????? ],
??????? "setting": {
??????????? "speed": {
??????????????? "channel":""
??????????? }
??????? }
??? }
}
mysqlreader參數(shù)解析:
hdfswriter參數(shù)解析:
3.2.2 準(zhǔn)備數(shù)據(jù)
1)創(chuàng)建student表
mysql>
create database datax;
mysql>
use datax;
mysql>
create table student(id int,name varchar(20));
2)插入數(shù)據(jù)
mysql>
insert into student values(1001,'zhangsan'),(1002,'lisi'),(1003,'wangwu');
3.2.3 編寫配置文件
[atguigu@hadoop102
datax]$ vim /opt/module/datax/job/mysql2hdfs.json
{
??? "job": {
??????? "content": [
??????????? {
??????????????? "reader": {
??????????????????? "name":"mysqlreader",
??????????????????? "parameter": {
??????????????????????? "column": [
??????????????????????????? "id",
??????????????????????????? "name"
??????????????????????? ],
??????????????????????? "connection":[
??????????????????????????? {
???????????????????????????????"jdbcUrl": [
???????????????????????????????????"jdbc:mysql://hadoop102:3306/datax"
??????????????????????????????? ],
???????????????????????????????"table": [
???????????????????????????????????"student"
??????????????????????????????? ]
???????????????????????? ???}
??????????????????????? ],
??????????????????????? "username":"root",
??????????????????????? "password":"000000"
??????????????????? }
??????????????? },
??????????????? "writer": {
??????????????????? "name":"hdfswriter",
??????????????????? "parameter": {
??????????????????????? "column": [
??????????????????????????? {
???????????????????????????????"name": "id",
???????????????????????????????"type": "int"
??????????????????????????? },
??????????????????????????? {
???????????????????????????????"name": "name",
???????????????????????????????"type": "string"
??????????????????????????? }
??????????????????????? ],?
??????????????????????? "defaultFS":"hdfs://hadoop102:9000",
?????? ?????????????????"fieldDelimiter":"\t",
??????????????????????? "fileName":"student.txt",
??????????????????????? "fileType":"text",
??????????????????????? "path":"/",
??????????????????????? "writeMode":"append"
??????????????????? }
??????????????? }
??????????? }
??????? ],
??????? "setting": {
??????????? "speed": {
??????????????? "channel": "1"
??????????? }
??????? }
??? }
}
3.2.4 執(zhí)行任務(wù)
[atguigu@hadoop102
datax]$ bin/datax.py job/mysql2hdfs.json
2019-05-1716:02:16.581 [job-0] INFO? JobContainer -
任務(wù)啟動時刻??????????????????? : 2019-05-17 16:02:04
任務(wù)結(jié)束時刻??????????????????? : 2019-05-17 16:02:16
任務(wù)總計(jì)耗時??????????????????? :???????????????? 12s
任務(wù)平均流量??????????????????? :??????????????? 3B/s
記錄寫入速度?????????????? ?????:????????????? 0rec/s
讀出記錄總數(shù)??????????????????? :?????????????????? 3
讀寫失敗總數(shù)??????????????????? :?????????????????? 0
3.2.5 查看hdfs
注意:HdfsWriter實(shí)際執(zhí)行時會在該文件名后添加隨機(jī)的后綴作為每個線程寫入實(shí)際文件名藻三。
3.3 讀取HDFS數(shù)據(jù)寫入MySQL
1)將上個案例上傳的文件改名
[atguigu@hadoop102
datax]$ hadoop fs -mv /student.txt* /student.txt
2)查看官方模板
[atguigu@hadoop102
datax]$ python bin/datax.py -r hdfsreader -w mysqlwriter
{
??? "job": {
??????? "content": [
??????????? {
??????????????? "reader": {
??????????????????? "name":"hdfsreader",
??????????????????? "parameter": {
??????????????????????? "column": [],
??????????????????????? "defaultFS":"",
??????????????????????? "encoding":"UTF-8",
???????????????????????"fieldDelimiter": ",",
??????????????????????? "fileType":"orc",
??????????????????????? "path":""
??????????????????? }
??????????????? },
??????????????? "writer": {
??????????????????? "name":"mysqlwriter",
??????????????????? "parameter": {
??????????????????????? "column": [],
??????????????? ????????"connection": [
??????????????????????????? {
???????????????????????????????"jdbcUrl": "",
???????????????????????????????"table": []
??????????????????????????? }
??????????????????????? ],
??????????????????????? "password":"",
??????????????????????? "preSql": [],
??????????????????????? "session":[],
??????????????????????? "username":"",
??????????????????????? "writeMode":""
??????????????????? }
??????????????? }
??????????? }
???????],
??????? "setting": {
??????????? "speed": {
??????????????? "channel":""
??????????? }
??????? }
??? }
}
3)創(chuàng)建配置文件
[atguigu@hadoop102
datax]$ vim job/hdfs2mysql.json
{
??? "job": {
??????? "content": [
??????????? {
??????????????? "reader": {
??????????????????? "name":"hdfsreader",
??????????????????? "parameter": {
??????????????????????? "column":["*"],
??????????????????????? "defaultFS":"hdfs://hadoop102:9000",
??????????????????????? "encoding":"UTF-8",
???????????????????????"fieldDelimiter": "\t",
??????????????????????? "fileType":"text",
??????????????????????? "path":"/student.txt"
??????????????????? }
??????????????? },
??????????????? "writer": {
??????????????????? "name":"mysqlwriter",
??????????????????? "parameter": {
??????????????????????? "column": [
??????????????????????????? "id",
??????????????????????????? "name"
??????????????????????? ],
??????????????????????? "connection":[
?? ?????????????????????????{
???????????????????????????????"jdbcUrl": "jdbc:mysql://hadoop102:3306/datax",
???????????????????????????????"table": ["student2"]
??????????????????????????? }
??????????????????????? ],
??????????????????????? "password": "000000",
??????????????????????? "username":"root",
??????????????????????? "writeMode":"insert"
??????????????????? }
??????????????? }
??????????? }
??????? ],
??????? "setting": {
??????????? "speed": {
??????????????? "channel":"1"
??????????? }
??????? }
??? }
}
4)在MySQL的datax數(shù)據(jù)庫中創(chuàng)建student2
mysql>
use datax;
mysql>
create table student2(id int,name varchar(20));
5)執(zhí)行任務(wù)
[atguigu@hadoop102
datax]$ bin/datax.py job/hdfs2mysql.json
2019-05-1716:21:53.616 [job-0] INFO? JobContainer -
任務(wù)啟動時刻??????????????????? : 2019-05-17 16:21:41
任務(wù)結(jié)束時刻??????????????????? : 2019-05-17 16:21:53
任務(wù)總計(jì)耗時??????????????????? :???????????????? 11s
任務(wù)平均流量???????????????? ???:??????????????? 3B/s
記錄寫入速度??????????????????? :????????????? 0rec/s
讀出記錄總數(shù)??????????????????? :?????????????????? 3
讀寫失敗總數(shù)??????????????????? :?????????????????? 0
6)查看student2表
mysql>
select * from student2;
+------+----------+
|id?? | name???? |
+------+----------+
| 1001
| zhangsan |
| 1002| lisi???? |
| 1003| wangwu?? |
+------+----------+
3 rows
in set (0.00 sec)
第4章 Oracle數(shù)據(jù)庫
以下操作使用root賬號。
4.1 oracle數(shù)據(jù)庫簡介
Oracle Database跪者,又名Oracle RDBMS棵帽,或簡稱Oracle。是甲骨文公司的一款關(guān)系數(shù)據(jù)庫管理系統(tǒng)渣玲。它是在數(shù)據(jù)庫領(lǐng)域一直處于領(lǐng)先地位的產(chǎn)品逗概。可以說Oracle數(shù)據(jù)庫系統(tǒng)是目前世界上流行的關(guān)系數(shù)據(jù)庫管理系統(tǒng)忘衍,系統(tǒng)可移植性好逾苫、使用方便、功能強(qiáng)枚钓,適用于各類大铅搓、中、小搀捷、微機(jī)環(huán)境星掰。它是一種高效率、可靠性好的、適應(yīng)高吞吐量的數(shù)據(jù)庫解決方案蹋偏。
4.2 安裝前的準(zhǔn)備
4.2.1 檢查依賴
[root@hadoop102~]# rpm -q binutils compat-libstdc++-33 elfutils-libelf elfutils-libelf-develgcc gcc-c++ glibc glibc-common glibc-devel glibc-headers kernel-headers kshlibaio? libaio-devel libgcc libgomplibstdc++ libstdc++-devel make numactl-devel sysstat unixODBC unixODBC-devel
4.2.2 安裝依賴
哪個沒有安裝哪個便斥,如:
[root@hadoop102
~]# yum -y install elfutils-libelf-devel gcc gcc-c++ libaio-devel
libstdc++-devel numactl-devel unixODBC unixODBC-devel
4.2.3 上傳安裝包并解壓
[root@hadoop102
software]# ls
Oracle_Database_12c_Release2_linuxx64.zip
[root@hadoop102
software]# unzip Oracle_Database_12c_Release2_linuxx64.zip -d /opt/module/
4.2.4 配置用戶組
Oracle安裝文件不允許通過root用戶啟動至壤,需要為oracle配置一個專門的用戶威始。
1)創(chuàng)建sql 用戶組
[root@hadoop102
software]#groupadd sql
2)創(chuàng)建oracle 用戶并放入sql組中
[root@hadoop102
software]#useradd oracle -g sql
3)修改oracle用戶登錄密碼,輸入密碼后即可使用oracle用戶登錄系統(tǒng)
[root@hadoop102
software]#passwd oracle
4)修改所屬用戶和組
[root@hadoop102
module]# chown -R oracle:sql /opt/module/database/
[root@hadoop102
module]# chgrp -R sql /opt/module/database/
4.2.5 修改配置文件sysctl.conf
[root@hadoop102
module]# vim /etc/sysctl.conf
刪除里面的內(nèi)容像街,添加如下內(nèi)容
net.ipv4.ip_local_port_range
= 9000 65500
fs.file-max
= 6815744
kernel.shmall
= 10523004
kernel.shmmax
= 6465333657
kernel.shmmni
= 4096
kernel.sem
= 250 32000 100 128
net.core.rmem_default=262144
net.core.wmem_default=262144
net.core.rmem_max=4194304
net.core.wmem_max=1048576
fs.aio-max-nr
= 1048576
參數(shù)解析:
net.ipv4.ip_local_port_range :可使用的IPv4端口范圍
fs.file-max :該參數(shù)表示文件句柄的最大數(shù)量黎棠。文件句柄設(shè)置表示在linux系統(tǒng)中可以打開的文件數(shù)量。
kernel.shmall :該參數(shù)表示系統(tǒng)一次可以使用的共享內(nèi)存總量(以頁為單位)
kernel.shmmax :該參數(shù)定義了共享內(nèi)存段的最大尺寸(以字節(jié)為單位)
kernel.shmmni :這個內(nèi)核參數(shù)用于設(shè)置系統(tǒng)范圍內(nèi)共享內(nèi)存段的最大數(shù)量
kernel.sem :該參數(shù)表示設(shè)置的信號量镰绎。
net.core.rmem_default:默認(rèn)的TCP數(shù)據(jù)接收窗口大信д丁(字節(jié))。
net.core.wmem_default:默認(rèn)的TCP數(shù)據(jù)發(fā)送窗口大谐肫堋(字節(jié))随静。
net.core.rmem_max:最大的TCP數(shù)據(jù)接收窗口(字節(jié))。
net.core.wmem_max:最大的TCP數(shù)據(jù)發(fā)送窗口(字節(jié))吗讶。
fs.aio-max-nr :同時可以擁有的的異步IO請求數(shù)目燎猛。
4.2.6 修改配置文件limits.conf
[root@hadoop102 module]# vim /etc/security/limits.conf
在文件末尾添加:
oracle
soft nproc 2047
oracle
hard nproc 16384
oracle
soft nofile 1024
oracle
hard nofile 65536
重啟機(jī)器生效。
4.3 安裝Oracle數(shù)據(jù)庫
4.3.1 進(jìn)入虛擬機(jī)圖像化頁面操作
[oracle@hadoop102 ~]# cd /opt/module/database
?[oracle@hadoop102 database]# ./runInstaller
4.3.2 安裝數(shù)據(jù)庫
1)去掉紅框位置對勾
2)選擇僅安裝數(shù)據(jù)庫軟件
3)選擇單實(shí)例數(shù)據(jù)庫安裝
4)默認(rèn)下一步
5)設(shè)置安裝位置
6)創(chuàng)建產(chǎn)品清單
7)操作系統(tǒng)組設(shè)置
8)等待安裝
9)查看驗(yàn)證照皆,按提示修改重绷,如果不能修改再點(diǎn)擊忽略
10)概要,直接點(diǎn)擊安裝
11)按提示操作
12)安裝完成
4.4 設(shè)置環(huán)境變量
[oracle@hadoop102
dbhome_1]# vim /home/oracle/.bash_profile
添加:
#ORACLE_HOME
export
ORACLE_HOME=/home/oracle/app/oracle/product/12.2.0/dbhome_1
export
PATH=$PATH:$ORACLE_HOME/bin
export
ORACLE_SID=orcl
export
NLS_LANG=AMERICAN_AMERICA.ZHS16GBK
[oracle@hadoop102
~]$ source /home/oracle/.bash_profile
4.5 設(shè)置Oracle監(jiān)聽
4.5.1 命令行輸入以下命令
[oracle@hadoop102
~]$ netca
4.5.2 選擇添加
4.5.3 設(shè)置監(jiān)聽名膜毁,默認(rèn)即可
4.5.4 選擇協(xié)議昭卓,默認(rèn)即可
4.5.5 設(shè)置端口號,默認(rèn)即可
4.5.6 配置更多監(jiān)聽瘟滨,默認(rèn)
4.5.7 完成
4.6 創(chuàng)建數(shù)據(jù)庫
4.6.1 進(jìn)入創(chuàng)建頁面
[oracle@hadoop102
~]$ dbca
4.6.2 選擇創(chuàng)建數(shù)據(jù)庫
4.6.3 選擇高級配置
4.6.4 選擇數(shù)據(jù)倉庫
4.6.5 將圖中所示對勾去掉
4.6.6 存儲選項(xiàng)
4.6.7 快速恢復(fù)選項(xiàng)
4.6.8 選擇監(jiān)聽程序
4.6.9 如圖設(shè)置
4.6.10 使用自動內(nèi)存管理
4.6.11 管理選項(xiàng)候醒,默認(rèn)
4.6.12 設(shè)置統(tǒng)一密碼
4.6.13 創(chuàng)建選項(xiàng),選擇創(chuàng)建數(shù)據(jù)庫
4.6.14 概要杂瘸,點(diǎn)擊完成
4.6.15 等待安裝
4.7 簡單使用
4.7.1 開啟倒淫,關(guān)閉監(jiān)聽服務(wù)
開啟服務(wù):
[oracle@hadoop102
~]$ lsnrctl start
關(guān)閉服務(wù):
[oracle@hadoop102
~]$ lsnrctl stop
4.7.2 進(jìn)入命令行
[oracle@hadoop102
~]$ sqlplus
SQL*Plus:
Release 12.2.0.1.0 Production on Wed May 29 17:08:05 2019
Copyright(c) 1982, 2016, Oracle.? All rightsreserved.
Enter
user-name: system
Enter
password: (這里輸入之前配置的統(tǒng)一密碼)
Last
Successful login time: Wed May 29 2019 13:03:39 +08:00
Connected
to:
Oracle
Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production
SQL>
4.7.3 創(chuàng)建用戶并授權(quán)
SQL>
create user atguigu identified by 000000;
User created.
SQL>
grant create session,create table,create view,create sequence,unlimited
tablespace to atguigu;
Grant
succeeded.
4.7.4 進(jìn)入atguigu賬號,創(chuàng)建表
SQL>create
TABLE student(id INTEGER,name VARCHAR2(20));
SQL>insert
into student values (1,'zhangsan');
SQL>
select * from student;
??????? ID? NAME
----------
----------------------------------------
???????? 1? zhangsan
注意:安裝完成后重啟機(jī)器可能出現(xiàn)ORACLE not
available錯誤胧沫,解決方法如下:
[oracle@hadoop102
~]$ sqlplus / as sysdba
SQL>startup
SQL>conn
atguigu
Enter
password:
4.8 Oracle與MySQL的SQL區(qū)別
類型OracleMySQL
整型number(N)/integerint/integer
浮點(diǎn)型floatfloat/double
字符串類型varchar2(N)varchar(N)
NULL''null和''不一樣
分頁rownumlimit
""限制很多昌简,一般不讓用與單引號一樣
價格閉源,收費(fèi)開源绒怨,免費(fèi)
主鍵自動增長×√
if? not exists×√
auto_increment×√
create? database×√
select? * from table as t×√
4.9 DataX案例
4.9.1 從Oracle中讀取數(shù)據(jù)存到MySQL
1)MySQL中創(chuàng)建表
[oracle@hadoop102
~]$ mysql -uroot -p000000
mysql>
create database oracle;
mysql>
use oracle;
mysql>
create table student(id int,name varchar(20));
2)編寫datax配置文件
[oracle@hadoop102
~]$ vim /opt/module/datax/job/oracle2mysql.json
{
??? "job": {
??????? "content": [
??????????? {
??????????????? "reader": {
??????????????????? "name":"oraclereader",
??????????????????? "parameter": {
??????????????????????? "column":["*"],
??????????????????????? "connection":[
??????????????????????????? {
???????????????????????????????"jdbcUrl": ["jdbc:oracle:thin:@hadoop102:1521:orcl"],
???????????????????????????????"table": ["student"]
??????????????????????????? }
??????????????????????? ],
??????????????????????? "password":"000000",
??????????????????????? "username":"atguigu"
??????????????????? }
??????????? ????},
??????????????? "writer": {
??????????????????? "name":"mysqlwriter",
??????????????????? "parameter": {
??????????????????????? "column":["*"],
??????????????????????? "connection":[
??????????????????????????? {
??????????????????????????????? "jdbcUrl":"jdbc:mysql://hadoop102:3306/oracle",
???????????????????????????????"table": ["student"]
??????????????????????????? }
??????????????????????? ],
??????????????????????? "password":"000000",
??????????????????????? "username":"root",
??????????????????????? "writeMode":"insert"
??????????????????? }
??????????????? }
??????????? }
??????? ],
??????? "setting": {
??????????? "speed": {
??????????????? "channel":"1"
??????????? }
??????? }
? ??}
}
3)執(zhí)行命令
[oracle@hadoop102
~]$
/opt/module/datax/bin/datax.py
/opt/module/datax/job/oracle2mysql.json
查看結(jié)果:
mysql>
select * from student;
+------+----------+
|id?? | name???? |
+------+----------+
|??? 1 | zhangsan |
+------+----------+
4.9.2 讀取Oracle的數(shù)據(jù)存入HDFS中
1)編寫配置文件
[oracle@hadoop102
datax]$ vim job/oracle2hdfs.json
{
??? "job": {
??????? "content": [
??????????? {
??????????????? "reader": {
??????????????????? "name":"oraclereader",
??????????????????? "parameter": {
??????????????????????? "column":["*"],
??????????????????????? "connection":[
??????????????????????????? {
???????????????????????????????"jdbcUrl":["jdbc:oracle:thin:@hadoop102:1521:orcl"],
????????????????????? ??????????"table":["student"]
??????????????????????????? }
??????????????????????? ],
??????????????????????? "password":"000000",
??????????????????????? "username":"atguigu"
??????????????????? }
??????????????? },
??????????????? "writer": {
??? ????????????????"name":"hdfswriter",
??????????????????? "parameter": {
??????????????????????? "column": [
??????????????????????????? {
???????????????????????????????"name": "id",
???????????????????????????????"type": "int"
??????????????????????? ????},
??????????????????????????? {
???????????????????????????????"name": "name",
???????????????????????????????"type": "string"
??????????????????????????? }
??????????????????????? ],
??????????????????????? "defaultFS":"hdfs://hadoop102:9000",
???????????????????????"fieldDelimiter": "\t",
??????????????????????? "fileName":"oracle.txt",
??????????????????????? "fileType":"text",
??????????????????????? "path":"/",
??????????????????????? "writeMode":"append"
????? ??????????????}
??????????????? }
??????????? }
??????? ],
??????? "setting": {
??????????? "speed": {
??????????????? "channel":"1"
??????????? }
??????? }
??? }
}
2)執(zhí)行
[oracle@hadoop102
datax]$ bin/datax.py job/oracle2hdfs.json
3)查看HDFS結(jié)果
第5章MongoDB
5.1 什么是MongoDB
MongoDB 是由C++語言編寫的纯赎,是一個基于分布式文件存儲的開源數(shù)據(jù)庫系統(tǒng)。MongoDB 旨在為WEB應(yīng)用提供可擴(kuò)展的高性能數(shù)據(jù)存儲解決方案南蹂。MongoDB 將數(shù)據(jù)存儲為一個文檔犬金,數(shù)據(jù)結(jié)構(gòu)由鍵值(key=>value)對組成。MongoDB 文檔類似于 JSON 對象。字段值可以包含其他文檔晚顷,數(shù)組及文檔數(shù)組峰伙。
5.2 MongoDB優(yōu)缺點(diǎn)
5.3 基礎(chǔ)概念解析
SQL術(shù)語/概念MongoDB術(shù)語/概念解釋/說明
databasedatabase數(shù)據(jù)庫
tablecollection數(shù)據(jù)庫表/集合
rowdocument數(shù)據(jù)記錄行/文檔
columnfield數(shù)據(jù)字段/域
indexindex索引
table? joins?不支持表連接,MongoDB不支持
primary? keyprimary? key主鍵,MongoDB自動將_id字段設(shè)置為主鍵
通過下圖實(shí)例,我們也可以更直觀的了解Mongo中的一些概念:
5.4 安裝
5.4.1 下載地址
https://www.mongodb.com/download-center#community
5.4.2 安裝
1)上傳壓縮包到虛擬機(jī)中
[atguigu@hadoop102
software]$ ls
mongodb-linux-x86_64-4.0.10.tgz
2)解壓
[atguigu@hadoop102
software]$ tar -zxvf mongodb-linux-x86_64-4.0.10.tgz -C /opt/module/
3)重命名
[atguigu@hadoop102
module]$ mv mongodb-linux-x86_64-4.0.10/ mongodb
4)創(chuàng)建數(shù)據(jù)庫目錄
MongoDB的數(shù)據(jù)存儲在data目錄的db目錄下该默,但是這個目錄在安裝過程不會自動創(chuàng)建瞳氓,所以需要手動創(chuàng)建data目錄,并在data目錄中創(chuàng)建db目錄栓袖。
[atguigu@hadoop102
module]$ sudo mkdir -p /data/db
[atguigu@hadoop102
mongodb]$ sudo chmod 777 -R /data/db/
5)啟動MongoDB服務(wù)
[atguigu@hadoop102
mongodb]$ bin/mongod
6)進(jìn)入shell頁面
[atguigu@hadoop102
~]$ cd /opt/module/mongodb/
[atguigu@hadoop102
mongodb]$ bin/mongo
MongoDB
shell version v4.0.10
connecting
to: mongodb://127.0.0.1:27017/?gssapiServiceName=mongodb
Implicit
session: session { "id" :
UUID("66e2b331-e3c3-4b81-b676-1ee4bac87abf") }
MongoDB
server version: 4.0.10
Welcome
to the MongoDB shell.
…………..
>?
5.5 基礎(chǔ)概念詳解
5.5.1 數(shù)據(jù)庫
一個mongodb中可以建立多個數(shù)據(jù)庫匣摘。MongoDB的默認(rèn)數(shù)據(jù)庫為"db",該數(shù)據(jù)庫存儲在data目錄中裹刮。MongoDB的單個實(shí)例可以容納多個獨(dú)立的數(shù)據(jù)庫音榜,每一個都有自己的集合和權(quán)限,不同的數(shù)據(jù)庫也放置在不同的文件中捧弃。
1)顯示所有數(shù)據(jù)庫
>
show dbs
admin?? 0.000GB
config? 0.000GB
local?? 0.000GB
admin:從權(quán)限的角度來看赠叼,這是"root"數(shù)據(jù)庫。要是將一個用戶添加到這個數(shù)據(jù)庫违霞,這個用戶自動繼承所有數(shù)據(jù)庫的權(quán)限嘴办。一些特定的服務(wù)器端命令也只能從這個數(shù)據(jù)庫運(yùn)行,比如列出所有的數(shù)據(jù)庫或者關(guān)閉服務(wù)器葛家。
local:這個數(shù)據(jù)永遠(yuǎn)不會被復(fù)制户辞,可以用來存儲限于本地單臺服務(wù)器的任意集合
config:當(dāng)Mongo用于分片設(shè)置時,config數(shù)據(jù)庫在內(nèi)部使用癞谒,用于保存分片的相關(guān)信息底燎。
2)顯示當(dāng)前使用的數(shù)據(jù)庫
>
db
test
3)切換數(shù)據(jù)庫
>
use local
switched
to db local
>
db
local
5.5.2 文檔(Document)
文檔是一組鍵值(key-value)對組成。MongoDB 的文檔不需要設(shè)置相同的字段弹砚,并且相同的字段不需要相同的數(shù)據(jù)類型双仍,這與關(guān)系型數(shù)據(jù)庫有很大的區(qū)別,也是 MongoDB 非常突出的特點(diǎn)桌吃。
一個簡單的例子:
{"name":"atguigu"}
注意:
1.文檔中的鍵/值對是有序的朱沃。
2.MongoDB區(qū)分類型和大小寫。
3.MongoDB的文檔不能有重復(fù)的鍵茅诱。
4.文檔的鍵是字符串逗物。除了少數(shù)例外情況,鍵可以使用任意UTF-8字符瑟俭。
5.5.3 集合
集合就是 MongoDB 文檔組翎卓,類似于MySQL中的table。
集合存在于數(shù)據(jù)庫中摆寄,集合沒有固定的結(jié)構(gòu)失暴,這意味著你在對集合可以插入不同格式和類型的數(shù)據(jù)坯门,但通常情況下我們插入集合的數(shù)據(jù)都會有一定的關(guān)聯(lián)性。
下面我們來看看如何創(chuàng)建集合:
MongoDB 中使用 createCollection() 方法來創(chuàng)建集合逗扒。
語法格式:
db.createCollection(name,
options)
參數(shù)說明:
name: 要創(chuàng)建的集合名稱
options: 可選參數(shù), 指定有關(guān)內(nèi)存大小及索引的選項(xiàng)
options 可以是如下參數(shù):
字段類型描述
capped布爾(可選)如果為 true古戴,則創(chuàng)建固定集合。固定集合是指有著固定大小的集合矩肩,當(dāng)達(dá)到最大值時现恼,它會自動覆蓋最早的文檔。當(dāng)該值為 true 時蛮拔,必須指定 size 參數(shù)述暂。
autoIndexId布爾(可選)如為 true,自動在 _id 字段創(chuàng)建索引建炫。默認(rèn)為 false。
size數(shù)值(可選)為固定集合指定一個最大值(以字節(jié)計(jì))疼蛾。如果 capped 為 true肛跌,也需要指定該字段。
max數(shù)值(可選)指定固定集合中包含文檔的最大數(shù)量察郁。
案例1:在test庫中創(chuàng)建一個atguigu的集合
>
use test
switched
to db test
>
db.createCollection("atguigu")
{
"ok" : 1 }
>
show collections
Atguigu
//插入數(shù)據(jù)
>
db.atguigu.insert({"name":"atguigu","url":"www.atguigu.com"})
WriteResult({
"nInserted" : 1 })
//查看數(shù)據(jù)
>
db.atguigu.find()
{
"_id" : ObjectId("5d0314ceecb77ee2fb2d7566"),
"name" : "atguigu", "url" :
"www.atguigu.com" }
說明:
ObjectId 類似唯一主鍵衍慎,可以很快的去生成和排序,包含 12 bytes皮钠,含義是:
前 4 個字節(jié)表示創(chuàng)建 unix 時間戳
接下來的 3 個字節(jié)是機(jī)器標(biāo)識碼
緊接的兩個字節(jié)由進(jìn)程 id 組成PID
最后三個字節(jié)是隨機(jī)數(shù)
案例2:創(chuàng)建一個固定集合mycol
>
db.createCollection("mycol",{ capped : true,autoIndexId : true,size :
6142800, max : 1000})
>
show tables;
atguigu
mycol
案例3:自動創(chuàng)建集合
在 MongoDB 中稳捆,你不需要創(chuàng)建集合。當(dāng)你插入一些文檔時麦轰,MongoDB會自動創(chuàng)建集合乔夯。
>
db.mycol2.insert({"name":"atguigu"})
WriteResult({
"nInserted" : 1 })
>
show collections
atguigu
mycol
mycol2
案例4:刪除集合
>
db.mycol2.drop()
True
>
show tables;
atguigu
mycol
5.6 DataX導(dǎo)入導(dǎo)出案例
5.6.1 讀取MongoDB的數(shù)據(jù)導(dǎo)入到HDFS
1)編寫配置文件
[atguigu@hadoop102
datax]$ vim job/mongdb2hdfs.json
{
??? "job": {
?? ?????"content": [
??????????? {
??????????????? "reader": {
??????????????????? "name":"mongodbreader",
??????????????????? "parameter": {
??????????????????????? "address":["127.0.0.1:27017"],
???????????????????????"collectionName": "atguigu",
??????????????????????? "column": [
??????????????????????? ?????? {
??????????????????????? ????????????? "name":"name",
??????????????????????? ????????????? "type":"string"
??????????????????????? ?????? },
??????????????????????? ?????? {
??????????????????????? ????????????? "name":"url",
?????????????? ?????????????????????? "type":"string"
??????????????????????? ?????? }
??????????????????????? ],
??????????????????????? "dbName":"test",
??????????????????? }
??????????????? },
??????????????? "writer": {
??????????????????? "name":"hdfswriter",
????????????????? ??"parameter": {
??????????????????????? "column": [
??????????????????????? ?????? {
??????????????????????? ????????????? "name":"name",
??????????????????????? ????????????? "type":"string"
??????????????????????? ?????? },
??????????????????????? ?????? {
??????????????????????? ????????????? "name":"url",
??????????????????????? ????????????? "type":"string"
??????????????????????? ?????? }
??????????????????????? ],
??????????????????????? "defaultFS":"hdfs://hadoop102:9000",
???????????????????????"fieldDelimiter": "\t",
??????????????????????? "fileName":"mongo.txt",
??????????????????????? "fileType":"text",
??????????????????????? "path":"/",
??????????????????????? "writeMode":"append"
??????????????????? }
??????????????? }
??????????? }
??????? ],
??????? "setting": {
??????????? "speed": {
??????????????? "channel":"1"
??????????? }
??????? }
??? }
}
2)mongodbreader參數(shù)解析
address: MongoDB的數(shù)據(jù)地址信息,因?yàn)镸onogDB可能是個集群款侵,則ip端口信息需要以Json數(shù)組的形式給出末荐。【必填】
userName:MongoDB的用戶名新锈〖自啵【選填】
userPassword: MongoDB的密碼∶冒剩【選填】
collectionName: MonogoDB的集合名块请。【必填】
column:MongoDB的文檔列名拳缠《招拢【必填】
name:Column的名字〖够耍【必填】
type:Column的類型抖棘∶龋【選填】
splitter:因?yàn)镸ongoDB支持?jǐn)?shù)組類型,但是Datax框架本身不支持?jǐn)?shù)組類型切省,所以mongoDB讀出來的數(shù)組類型要通過這個分隔符合并成字符串最岗。【選填】
3)執(zhí)行
[atguigu@hadoop102
datax]$ bin/datax.py job/mongdb2hdfs.json
4)查看結(jié)果
5.6.2 讀取MongoDB的數(shù)據(jù)導(dǎo)入MySQL
1)在MySQL中創(chuàng)建表
mysql>
create table atguigu(name varchar(20),url varchar(20));
2)編寫DataX配置文件
[atguigu@hadoop102
datax]$ vim job/mongodb2mysql.json
{
??? "job": {
??????? "content": [
??????????? {
??????????????? "reader": {
??????????????????? "name": "mongodbreader",
??????????????????? "parameter": {
??????????????????????? "address":["127.0.0.1:27017"],
???????????????????????"collectionName": "atguigu",
??????????????????????? "column": [
??????????????????????? ?????? {
??????????????????????? ????????????? "name":"name",
??????????????????????? ????????????? "type":"string"
??????????????????????? ?????? },
??????????????????????? ?????? {
??????????????????????? ????????????? "name":"url",
??????????????????????? ????????????? "type":"string"
??????????????????????? ?????? }
??????????????????????? ],
?????????? ?????????????"dbName":"test",
??????????????????? }
??????????????? },
??????????????? "writer": {
??????????????????? "name":"mysqlwriter",
??????????????????? "parameter": {
??????????????????????? "column": ["*"],
??????????????????????? "connection":[
??????????????????????????? {
???????????????????????????????"jdbcUrl": "jdbc:mysql://hadoop102:3306/test",
???????????????????????????????"table": ["atguigu"]
??????????????????????????? }
?????????????????????? ?],
??????????????????????? "password":"000000",
??????????????????????? "username":"root",
??????????????????????? "writeMode":"insert"
??????????????????? }
??????????????? }
??????????? }
??????? ],
??????? "setting": {
??????????? "speed": {
?? ?????????????"channel": "1"
??????????? }
??????? }
??? }
}
3)執(zhí)行
[atguigu@hadoop102
datax]$ bin/datax.py job/mongodb2mysql.json
4)查看結(jié)果
mysql>
select * from atguigu;
+---------+-----------------+
|name??? | url???????????? |
+---------+-----------------+
|
atguigu | www.atguigu.com |
+---------+-----------------+