2020-03-18

大數(shù)據(jù)技術(shù)之DataX


版本:V1.0

第1章概述

1.1 什么是DataX


?????? DataX是阿里巴巴開源的一個異構(gòu)數(shù)據(jù)源離線同步工具蹂季,致力于實(shí)現(xiàn)包括關(guān)系型數(shù)據(jù)庫(MySQL冕广、Oracle等)、HDFS偿洁、Hive撒汉、ODPS、HBase父能、FTP等各種異構(gòu)數(shù)據(jù)源之間穩(wěn)定高效的數(shù)據(jù)同步功能神凑。

1.2 DataX的設(shè)計(jì)

為了解決異構(gòu)數(shù)據(jù)源同步問題,DataX將復(fù)雜的網(wǎng)狀的同步鏈路變成了星型數(shù)據(jù)鏈路何吝,DataX作為中間傳輸載體負(fù)責(zé)連接各種數(shù)據(jù)源溉委。當(dāng)需要接入一個新的數(shù)據(jù)源的時候,只需要將此數(shù)據(jù)源對接到DataX爱榕,便能跟已有的數(shù)據(jù)源做到無縫數(shù)據(jù)同步瓣喊。

1.3 框架設(shè)計(jì)

1.4 運(yùn)行原理

第2章快速入門

2.1 官方地址

下載地址:http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz

http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz

源碼地址:https://github.com/alibaba/DataX

2.2 前置要求

- Linux

- JDK(1.8以上,推薦1.8)

- Python(推薦Python2.6.X)

2.3 安裝

1)將下載好的datax.tar.gz上傳到hadoop102的/opt/softwarez

[atguigu@hadoop102

software]$ ls

datax.tar.gz

2)解壓datax.tar.gz到/opt/module

[atguigu@hadoop102

software]$ tar -zxvf datax.tar.gz -C /opt/module/

3)運(yùn)行自檢腳本

[atguigu@hadoop102

bin]$ cd /opt/module/datax/bin/

[atguigu@hadoop102

bin]$ python datax.py /opt/module/datax/job/job.json

第3章使用案例

3.1 從stream流讀取數(shù)據(jù)并打印到控制臺

1)查看配置模板

[atguigu@hadoop102

bin]$ python datax.py -r streamreader -w streamwriter


DataX

(DATAX-OPENSOURCE-3.0), From Alibaba !

Copyright

(C) 2010-2017, Alibaba Group. All Rights Reserved.



Please

refer to the streamreader document:

???? https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md


Please

refer to the streamwriter document:

????https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md


Pleasesave the following configuration as a json file and? use

???? python {DATAX_HOME}/bin/datax.py{JSON_FILE_NAME}.json

to run

the job.


{

??? "job": {

??????? "content": [

??????????? {

??????????????? "reader": {

??????????????????? "name":"streamreader",

??????????????????? "parameter": {

??????????????????????? "column": [],

???????????????????????"sliceRecordCount": ""

??????????????????? }

??????????????? },

??????????????? "writer": {

??????????????????? "name":"streamwriter",

??????????????????? "parameter": {

??????????????????????? "encoding":"",

??????????????????????? "print": true

??????????????????? }

??????????????? }

??????????? }

??????? ],

??????? "setting": {

??????????? "speed": {

??????????????? "channel":""

??????????? }

??????? }

??? }

}

2)根據(jù)模板編寫配置文件

[atguigu@hadoop102

job]$ vim stream2stream.json


填寫以下內(nèi)容:


{

? "job": {

??? "content": [

????? {

??????? "reader": {

????????? "name":"streamreader",

????????? "parameter": {

??????????? "sliceRecordCount": 10,

??????????? "column": [

????????????? {

??????????????? "type":"long",

??????????????? "value":"10"

????????????? },

????????????? {

??????????????? "type":"string",

??????????????? "value": "hello黔酥,DataX"

????????????? }

??????????? ]

????????? }

????? ??},

??????? "writer": {

????????? "name":"streamwriter",

????????? "parameter": {

??????????? "encoding":"UTF-8",

??????????? "print": true

????????? }

??????? }

????? }

??? ],

??? "setting": {

????? "speed": {

??????? "channel": 1

?????? }

??? }

? }

}

3)運(yùn)行

[atguigu@hadoop102

job]$

/opt/module/datax/bin/datax.py

/opt/module/datax/job/stream2stream.json

3.2 讀取MySQL中的數(shù)據(jù)存放到HDFS

3.2.1 查看官方模板

[atguigu@hadoop102

~]$ python /opt/module/datax/bin/datax.py -r mysqlreader -w hdfswriter


{

??? "job": {

??????? "content": [

??????????? {

??????????????? "reader": {

??????????????????? "name":"mysqlreader",

??????????????????? "parameter": {

??????????????????????? "column": [],

??????????????????????? "connection":[

??????????????????????????? {

?????????????? ?????????????????"jdbcUrl": [],

???????????????????????????????"table": []

??????????????????????????? }

??????????????????????? ],

??????????????????????? "password":"",

??????????????????????? "username":"",

??????????????????????? "where":""

??????????????????? }

??????????????? },

??????????????? "writer": {

??????????????????? "name":"hdfswriter",

??????????????????? "parameter": {

??????????????????????? "column": [],

??????????????????????? "compress":"",

??????????????????????? "defaultFS":"",

???????????????????????"fieldDelimiter": "",

??????????????????????? "fileName":"",

??????????????????????? "fileType":"",

??????????????????????? "path":"",

??????????????????????? "writeMode":""

??????????????????? }

??????????????? }

??????????? }

??????? ],

??????? "setting": {

??????????? "speed": {

??????????????? "channel":""

??????????? }

??????? }

??? }

}

mysqlreader參數(shù)解析:

hdfswriter參數(shù)解析:

3.2.2 準(zhǔn)備數(shù)據(jù)

1)創(chuàng)建student表

mysql>

create database datax;

mysql>

use datax;

mysql>

create table student(id int,name varchar(20));

2)插入數(shù)據(jù)

mysql>

insert into student values(1001,'zhangsan'),(1002,'lisi'),(1003,'wangwu');

3.2.3 編寫配置文件

[atguigu@hadoop102

datax]$ vim /opt/module/datax/job/mysql2hdfs.json


{

??? "job": {

??????? "content": [

??????????? {

??????????????? "reader": {

??????????????????? "name":"mysqlreader",

??????????????????? "parameter": {

??????????????????????? "column": [

??????????????????????????? "id",

??????????????????????????? "name"

??????????????????????? ],

??????????????????????? "connection":[

??????????????????????????? {

???????????????????????????????"jdbcUrl": [

???????????????????????????????????"jdbc:mysql://hadoop102:3306/datax"

??????????????????????????????? ],

???????????????????????????????"table": [

???????????????????????????????????"student"

??????????????????????????????? ]

???????????????????????? ???}

??????????????????????? ],

??????????????????????? "username":"root",

??????????????????????? "password":"000000"

??????????????????? }

??????????????? },

??????????????? "writer": {

??????????????????? "name":"hdfswriter",

??????????????????? "parameter": {

??????????????????????? "column": [

??????????????????????????? {

???????????????????????????????"name": "id",

???????????????????????????????"type": "int"

??????????????????????????? },

??????????????????????????? {

???????????????????????????????"name": "name",

???????????????????????????????"type": "string"

??????????????????????????? }

??????????????????????? ],?

??????????????????????? "defaultFS":"hdfs://hadoop102:9000",

?????? ?????????????????"fieldDelimiter":"\t",

??????????????????????? "fileName":"student.txt",

??????????????????????? "fileType":"text",

??????????????????????? "path":"/",

??????????????????????? "writeMode":"append"

??????????????????? }

??????????????? }

??????????? }

??????? ],

??????? "setting": {

??????????? "speed": {

??????????????? "channel": "1"

??????????? }

??????? }

??? }

}

3.2.4 執(zhí)行任務(wù)

[atguigu@hadoop102

datax]$ bin/datax.py job/mysql2hdfs.json


2019-05-1716:02:16.581 [job-0] INFO? JobContainer -

任務(wù)啟動時刻??????????????????? : 2019-05-17 16:02:04

任務(wù)結(jié)束時刻??????????????????? : 2019-05-17 16:02:16

任務(wù)總計(jì)耗時??????????????????? :???????????????? 12s

任務(wù)平均流量??????????????????? :??????????????? 3B/s

記錄寫入速度?????????????? ?????:????????????? 0rec/s

讀出記錄總數(shù)??????????????????? :?????????????????? 3

讀寫失敗總數(shù)??????????????????? :?????????????????? 0

3.2.5 查看hdfs

注意:HdfsWriter實(shí)際執(zhí)行時會在該文件名后添加隨機(jī)的后綴作為每個線程寫入實(shí)際文件名藻三。

3.3 讀取HDFS數(shù)據(jù)寫入MySQL

1)將上個案例上傳的文件改名

[atguigu@hadoop102

datax]$ hadoop fs -mv /student.txt* /student.txt

2)查看官方模板

[atguigu@hadoop102

datax]$ python bin/datax.py -r hdfsreader -w mysqlwriter


{

??? "job": {

??????? "content": [

??????????? {

??????????????? "reader": {

??????????????????? "name":"hdfsreader",

??????????????????? "parameter": {

??????????????????????? "column": [],

??????????????????????? "defaultFS":"",

??????????????????????? "encoding":"UTF-8",

???????????????????????"fieldDelimiter": ",",

??????????????????????? "fileType":"orc",

??????????????????????? "path":""

??????????????????? }

??????????????? },

??????????????? "writer": {

??????????????????? "name":"mysqlwriter",

??????????????????? "parameter": {

??????????????????????? "column": [],

??????????????? ????????"connection": [

??????????????????????????? {

???????????????????????????????"jdbcUrl": "",

???????????????????????????????"table": []

??????????????????????????? }

??????????????????????? ],

??????????????????????? "password":"",

??????????????????????? "preSql": [],

??????????????????????? "session":[],

??????????????????????? "username":"",

??????????????????????? "writeMode":""

??????????????????? }

??????????????? }

??????????? }

???????],

??????? "setting": {

??????????? "speed": {

??????????????? "channel":""

??????????? }

??????? }

??? }

}

3)創(chuàng)建配置文件

[atguigu@hadoop102

datax]$ vim job/hdfs2mysql.json


{

??? "job": {

??????? "content": [

??????????? {

??????????????? "reader": {

??????????????????? "name":"hdfsreader",

??????????????????? "parameter": {

??????????????????????? "column":["*"],

??????????????????????? "defaultFS":"hdfs://hadoop102:9000",

??????????????????????? "encoding":"UTF-8",

???????????????????????"fieldDelimiter": "\t",

??????????????????????? "fileType":"text",

??????????????????????? "path":"/student.txt"

??????????????????? }

??????????????? },

??????????????? "writer": {

??????????????????? "name":"mysqlwriter",

??????????????????? "parameter": {

??????????????????????? "column": [

??????????????????????????? "id",

??????????????????????????? "name"

??????????????????????? ],

??????????????????????? "connection":[

?? ?????????????????????????{

???????????????????????????????"jdbcUrl": "jdbc:mysql://hadoop102:3306/datax",

???????????????????????????????"table": ["student2"]

??????????????????????????? }

??????????????????????? ],

??????????????????????? "password": "000000",

??????????????????????? "username":"root",

??????????????????????? "writeMode":"insert"

??????????????????? }

??????????????? }

??????????? }

??????? ],

??????? "setting": {

??????????? "speed": {

??????????????? "channel":"1"

??????????? }

??????? }

??? }

}

4)在MySQL的datax數(shù)據(jù)庫中創(chuàng)建student2

mysql>

use datax;

mysql>

create table student2(id int,name varchar(20));

5)執(zhí)行任務(wù)

[atguigu@hadoop102

datax]$ bin/datax.py job/hdfs2mysql.json


2019-05-1716:21:53.616 [job-0] INFO? JobContainer -

任務(wù)啟動時刻??????????????????? : 2019-05-17 16:21:41

任務(wù)結(jié)束時刻??????????????????? : 2019-05-17 16:21:53

任務(wù)總計(jì)耗時??????????????????? :???????????????? 11s

任務(wù)平均流量???????????????? ???:??????????????? 3B/s

記錄寫入速度??????????????????? :????????????? 0rec/s

讀出記錄總數(shù)??????????????????? :?????????????????? 3

讀寫失敗總數(shù)??????????????????? :?????????????????? 0

6)查看student2表

mysql>

select * from student2;

+------+----------+

|id?? | name???? |

+------+----------+

| 1001

| zhangsan |

| 1002| lisi???? |

| 1003| wangwu?? |

+------+----------+

3 rows

in set (0.00 sec)

第4章 Oracle數(shù)據(jù)庫

以下操作使用root賬號。

4.1 oracle數(shù)據(jù)庫簡介

Oracle Database跪者,又名Oracle RDBMS棵帽,或簡稱Oracle。是甲骨文公司的一款關(guān)系數(shù)據(jù)庫管理系統(tǒng)渣玲。它是在數(shù)據(jù)庫領(lǐng)域一直處于領(lǐng)先地位的產(chǎn)品逗概。可以說Oracle數(shù)據(jù)庫系統(tǒng)是目前世界上流行的關(guān)系數(shù)據(jù)庫管理系統(tǒng)忘衍,系統(tǒng)可移植性好逾苫、使用方便、功能強(qiáng)枚钓,適用于各類大铅搓、中、小搀捷、微機(jī)環(huán)境星掰。它是一種高效率、可靠性好的、適應(yīng)高吞吐量的數(shù)據(jù)庫解決方案蹋偏。

4.2 安裝前的準(zhǔn)備

4.2.1 檢查依賴

[root@hadoop102~]# rpm -q binutils compat-libstdc++-33 elfutils-libelf elfutils-libelf-develgcc gcc-c++ glibc glibc-common glibc-devel glibc-headers kernel-headers kshlibaio? libaio-devel libgcc libgomplibstdc++ libstdc++-devel make numactl-devel sysstat unixODBC unixODBC-devel

4.2.2 安裝依賴

哪個沒有安裝哪個便斥,如:

[root@hadoop102

~]# yum -y install elfutils-libelf-devel gcc gcc-c++ libaio-devel

libstdc++-devel numactl-devel unixODBC unixODBC-devel

4.2.3 上傳安裝包并解壓

[root@hadoop102

software]# ls

Oracle_Database_12c_Release2_linuxx64.zip

[root@hadoop102

software]# unzip Oracle_Database_12c_Release2_linuxx64.zip -d /opt/module/

4.2.4 配置用戶組

Oracle安裝文件不允許通過root用戶啟動至壤,需要為oracle配置一個專門的用戶威始。

1)創(chuàng)建sql 用戶組

[root@hadoop102

software]#groupadd sql

2)創(chuàng)建oracle 用戶并放入sql組中

[root@hadoop102

software]#useradd oracle -g sql

3)修改oracle用戶登錄密碼,輸入密碼后即可使用oracle用戶登錄系統(tǒng)

[root@hadoop102

software]#passwd oracle

4)修改所屬用戶和組

[root@hadoop102

module]# chown -R oracle:sql /opt/module/database/

[root@hadoop102

module]# chgrp -R sql /opt/module/database/

4.2.5 修改配置文件sysctl.conf

[root@hadoop102

module]# vim /etc/sysctl.conf

刪除里面的內(nèi)容像街,添加如下內(nèi)容

net.ipv4.ip_local_port_range

= 9000 65500

fs.file-max

= 6815744

kernel.shmall

= 10523004

kernel.shmmax

= 6465333657

kernel.shmmni

= 4096

kernel.sem

= 250 32000 100 128

net.core.rmem_default=262144

net.core.wmem_default=262144

net.core.rmem_max=4194304

net.core.wmem_max=1048576

fs.aio-max-nr

= 1048576

參數(shù)解析:

net.ipv4.ip_local_port_range :可使用的IPv4端口范圍

fs.file-max :該參數(shù)表示文件句柄的最大數(shù)量黎棠。文件句柄設(shè)置表示在linux系統(tǒng)中可以打開的文件數(shù)量。

kernel.shmall :該參數(shù)表示系統(tǒng)一次可以使用的共享內(nèi)存總量(以頁為單位)

kernel.shmmax :該參數(shù)定義了共享內(nèi)存段的最大尺寸(以字節(jié)為單位)

kernel.shmmni :這個內(nèi)核參數(shù)用于設(shè)置系統(tǒng)范圍內(nèi)共享內(nèi)存段的最大數(shù)量

kernel.sem :該參數(shù)表示設(shè)置的信號量镰绎。

net.core.rmem_default:默認(rèn)的TCP數(shù)據(jù)接收窗口大信д丁(字節(jié))。

net.core.wmem_default:默認(rèn)的TCP數(shù)據(jù)發(fā)送窗口大谐肫堋(字節(jié))随静。

net.core.rmem_max:最大的TCP數(shù)據(jù)接收窗口(字節(jié))。

net.core.wmem_max:最大的TCP數(shù)據(jù)發(fā)送窗口(字節(jié))吗讶。

fs.aio-max-nr :同時可以擁有的的異步IO請求數(shù)目燎猛。

4.2.6 修改配置文件limits.conf

[root@hadoop102 module]# vim /etc/security/limits.conf

在文件末尾添加:

oracle

soft nproc 2047

oracle

hard nproc 16384

oracle

soft nofile 1024

oracle

hard nofile 65536

重啟機(jī)器生效。

4.3 安裝Oracle數(shù)據(jù)庫

4.3.1 進(jìn)入虛擬機(jī)圖像化頁面操作

[oracle@hadoop102 ~]# cd /opt/module/database

?[oracle@hadoop102 database]# ./runInstaller

4.3.2 安裝數(shù)據(jù)庫

1)去掉紅框位置對勾

2)選擇僅安裝數(shù)據(jù)庫軟件

3)選擇單實(shí)例數(shù)據(jù)庫安裝


4)默認(rèn)下一步

5)設(shè)置安裝位置

6)創(chuàng)建產(chǎn)品清單

7)操作系統(tǒng)組設(shè)置

8)等待安裝

9)查看驗(yàn)證照皆,按提示修改重绷,如果不能修改再點(diǎn)擊忽略

10)概要,直接點(diǎn)擊安裝

11)按提示操作

12)安裝完成

4.4 設(shè)置環(huán)境變量

[oracle@hadoop102

dbhome_1]# vim /home/oracle/.bash_profile

添加:

#ORACLE_HOME

export

ORACLE_HOME=/home/oracle/app/oracle/product/12.2.0/dbhome_1

export

PATH=$PATH:$ORACLE_HOME/bin

export

ORACLE_SID=orcl

export

NLS_LANG=AMERICAN_AMERICA.ZHS16GBK


[oracle@hadoop102

~]$ source /home/oracle/.bash_profile

4.5 設(shè)置Oracle監(jiān)聽

4.5.1 命令行輸入以下命令

[oracle@hadoop102

~]$ netca


4.5.2 選擇添加

4.5.3 設(shè)置監(jiān)聽名膜毁,默認(rèn)即可

4.5.4 選擇協(xié)議昭卓,默認(rèn)即可

4.5.5 設(shè)置端口號,默認(rèn)即可

4.5.6 配置更多監(jiān)聽瘟滨,默認(rèn)

4.5.7 完成

4.6 創(chuàng)建數(shù)據(jù)庫

4.6.1 進(jìn)入創(chuàng)建頁面

[oracle@hadoop102

~]$ dbca

4.6.2 選擇創(chuàng)建數(shù)據(jù)庫

4.6.3 選擇高級配置

4.6.4 選擇數(shù)據(jù)倉庫

4.6.5 將圖中所示對勾去掉

4.6.6 存儲選項(xiàng)

4.6.7 快速恢復(fù)選項(xiàng)

4.6.8 選擇監(jiān)聽程序

4.6.9 如圖設(shè)置

4.6.10 使用自動內(nèi)存管理

4.6.11 管理選項(xiàng)候醒,默認(rèn)

4.6.12 設(shè)置統(tǒng)一密碼

4.6.13 創(chuàng)建選項(xiàng),選擇創(chuàng)建數(shù)據(jù)庫

4.6.14 概要杂瘸,點(diǎn)擊完成

4.6.15 等待安裝

4.7 簡單使用

4.7.1 開啟倒淫,關(guān)閉監(jiān)聽服務(wù)

開啟服務(wù):

[oracle@hadoop102

~]$ lsnrctl start

關(guān)閉服務(wù):

[oracle@hadoop102

~]$ lsnrctl stop

4.7.2 進(jìn)入命令行

[oracle@hadoop102

~]$ sqlplus


SQL*Plus:

Release 12.2.0.1.0 Production on Wed May 29 17:08:05 2019


Copyright(c) 1982, 2016, Oracle.? All rightsreserved.


Enter

user-name: system

Enter

password: (這里輸入之前配置的統(tǒng)一密碼)

Last

Successful login time: Wed May 29 2019 13:03:39 +08:00


Connected

to:

Oracle

Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production


SQL>

4.7.3 創(chuàng)建用戶并授權(quán)

SQL>

create user atguigu identified by 000000;

User created.

SQL>

grant create session,create table,create view,create sequence,unlimited

tablespace to atguigu;

Grant

succeeded.

4.7.4 進(jìn)入atguigu賬號,創(chuàng)建表

SQL>create

TABLE student(id INTEGER,name VARCHAR2(20));

SQL>insert

into student values (1,'zhangsan');

SQL>

select * from student;

??????? ID? NAME

----------

----------------------------------------

???????? 1? zhangsan

注意:安裝完成后重啟機(jī)器可能出現(xiàn)ORACLE not

available錯誤胧沫,解決方法如下:

[oracle@hadoop102

~]$ sqlplus / as sysdba

SQL>startup

SQL>conn

atguigu

Enter

password:

4.8 Oracle與MySQL的SQL區(qū)別

類型OracleMySQL

整型number(N)/integerint/integer

浮點(diǎn)型floatfloat/double

字符串類型varchar2(N)varchar(N)

NULL''null和''不一樣

分頁rownumlimit

""限制很多昌简,一般不讓用與單引號一樣

價格閉源,收費(fèi)開源绒怨,免費(fèi)

主鍵自動增長×√

if? not exists×√

auto_increment×√

create? database×√

select? * from table as t×√

4.9 DataX案例

4.9.1 從Oracle中讀取數(shù)據(jù)存到MySQL

1)MySQL中創(chuàng)建表

[oracle@hadoop102

~]$ mysql -uroot -p000000

mysql>

create database oracle;

mysql>

use oracle;

mysql>

create table student(id int,name varchar(20));

2)編寫datax配置文件

[oracle@hadoop102

~]$ vim /opt/module/datax/job/oracle2mysql.json


{

??? "job": {

??????? "content": [

??????????? {

??????????????? "reader": {

??????????????????? "name":"oraclereader",

??????????????????? "parameter": {

??????????????????????? "column":["*"],

??????????????????????? "connection":[

??????????????????????????? {

???????????????????????????????"jdbcUrl": ["jdbc:oracle:thin:@hadoop102:1521:orcl"],

???????????????????????????????"table": ["student"]

??????????????????????????? }

??????????????????????? ],

??????????????????????? "password":"000000",

??????????????????????? "username":"atguigu"

??????????????????? }

??????????? ????},

??????????????? "writer": {

??????????????????? "name":"mysqlwriter",

??????????????????? "parameter": {

??????????????????????? "column":["*"],

??????????????????????? "connection":[

??????????????????????????? {

??????????????????????????????? "jdbcUrl":"jdbc:mysql://hadoop102:3306/oracle",

???????????????????????????????"table": ["student"]

??????????????????????????? }

??????????????????????? ],

??????????????????????? "password":"000000",

??????????????????????? "username":"root",

??????????????????????? "writeMode":"insert"

??????????????????? }

??????????????? }

??????????? }

??????? ],

??????? "setting": {

??????????? "speed": {

??????????????? "channel":"1"

??????????? }

??????? }

? ??}

}

3)執(zhí)行命令

[oracle@hadoop102

~]$

/opt/module/datax/bin/datax.py

/opt/module/datax/job/oracle2mysql.json


查看結(jié)果:

mysql>

select * from student;

+------+----------+

|id?? | name???? |

+------+----------+

|??? 1 | zhangsan |

+------+----------+

4.9.2 讀取Oracle的數(shù)據(jù)存入HDFS中

1)編寫配置文件

[oracle@hadoop102

datax]$ vim job/oracle2hdfs.json

{

??? "job": {

??????? "content": [

??????????? {

??????????????? "reader": {

??????????????????? "name":"oraclereader",

??????????????????? "parameter": {

??????????????????????? "column":["*"],

??????????????????????? "connection":[

??????????????????????????? {

???????????????????????????????"jdbcUrl":["jdbc:oracle:thin:@hadoop102:1521:orcl"],

????????????????????? ??????????"table":["student"]

??????????????????????????? }

??????????????????????? ],

??????????????????????? "password":"000000",

??????????????????????? "username":"atguigu"

??????????????????? }

??????????????? },

??????????????? "writer": {

??? ????????????????"name":"hdfswriter",

??????????????????? "parameter": {

??????????????????????? "column": [

??????????????????????????? {

???????????????????????????????"name": "id",

???????????????????????????????"type": "int"

??????????????????????? ????},

??????????????????????????? {

???????????????????????????????"name": "name",

???????????????????????????????"type": "string"

??????????????????????????? }


??????????????????????? ],

??????????????????????? "defaultFS":"hdfs://hadoop102:9000",

???????????????????????"fieldDelimiter": "\t",

??????????????????????? "fileName":"oracle.txt",

??????????????????????? "fileType":"text",

??????????????????????? "path":"/",

??????????????????????? "writeMode":"append"

????? ??????????????}

??????????????? }

??????????? }

??????? ],

??????? "setting": {

??????????? "speed": {

??????????????? "channel":"1"

??????????? }

??????? }

??? }

}

2)執(zhí)行

[oracle@hadoop102

datax]$ bin/datax.py job/oracle2hdfs.json

3)查看HDFS結(jié)果

第5章MongoDB

5.1 什么是MongoDB

MongoDB 是由C++語言編寫的纯赎,是一個基于分布式文件存儲的開源數(shù)據(jù)庫系統(tǒng)。MongoDB 旨在為WEB應(yīng)用提供可擴(kuò)展的高性能數(shù)據(jù)存儲解決方案南蹂。MongoDB 將數(shù)據(jù)存儲為一個文檔犬金,數(shù)據(jù)結(jié)構(gòu)由鍵值(key=>value)對組成。MongoDB 文檔類似于 JSON 對象。字段值可以包含其他文檔晚顷,數(shù)組及文檔數(shù)組峰伙。

5.2 MongoDB優(yōu)缺點(diǎn)

5.3 基礎(chǔ)概念解析

SQL術(shù)語/概念MongoDB術(shù)語/概念解釋/說明

databasedatabase數(shù)據(jù)庫

tablecollection數(shù)據(jù)庫表/集合

rowdocument數(shù)據(jù)記錄行/文檔

columnfield數(shù)據(jù)字段/域

indexindex索引

table? joins?不支持表連接,MongoDB不支持

primary? keyprimary? key主鍵,MongoDB自動將_id字段設(shè)置為主鍵

通過下圖實(shí)例,我們也可以更直觀的了解Mongo中的一些概念:

5.4 安裝

5.4.1 下載地址

https://www.mongodb.com/download-center#community

5.4.2 安裝

1)上傳壓縮包到虛擬機(jī)中

[atguigu@hadoop102

software]$ ls

mongodb-linux-x86_64-4.0.10.tgz

2)解壓

[atguigu@hadoop102

software]$ tar -zxvf mongodb-linux-x86_64-4.0.10.tgz -C /opt/module/

3)重命名

[atguigu@hadoop102

module]$ mv mongodb-linux-x86_64-4.0.10/ mongodb

4)創(chuàng)建數(shù)據(jù)庫目錄

MongoDB的數(shù)據(jù)存儲在data目錄的db目錄下该默,但是這個目錄在安裝過程不會自動創(chuàng)建瞳氓,所以需要手動創(chuàng)建data目錄,并在data目錄中創(chuàng)建db目錄栓袖。

[atguigu@hadoop102

module]$ sudo mkdir -p /data/db

[atguigu@hadoop102

mongodb]$ sudo chmod 777 -R /data/db/

5)啟動MongoDB服務(wù)

[atguigu@hadoop102

mongodb]$ bin/mongod

6)進(jìn)入shell頁面

[atguigu@hadoop102

~]$ cd /opt/module/mongodb/

[atguigu@hadoop102

mongodb]$ bin/mongo


MongoDB

shell version v4.0.10

connecting

to: mongodb://127.0.0.1:27017/?gssapiServiceName=mongodb

Implicit

session: session { "id" :

UUID("66e2b331-e3c3-4b81-b676-1ee4bac87abf") }

MongoDB

server version: 4.0.10

Welcome

to the MongoDB shell.

…………..

>?

5.5 基礎(chǔ)概念詳解

5.5.1 數(shù)據(jù)庫

一個mongodb中可以建立多個數(shù)據(jù)庫匣摘。MongoDB的默認(rèn)數(shù)據(jù)庫為"db",該數(shù)據(jù)庫存儲在data目錄中裹刮。MongoDB的單個實(shí)例可以容納多個獨(dú)立的數(shù)據(jù)庫音榜,每一個都有自己的集合和權(quán)限,不同的數(shù)據(jù)庫也放置在不同的文件中捧弃。

1)顯示所有數(shù)據(jù)庫

>

show dbs

admin?? 0.000GB

config? 0.000GB

local?? 0.000GB

admin:從權(quán)限的角度來看赠叼,這是"root"數(shù)據(jù)庫。要是將一個用戶添加到這個數(shù)據(jù)庫违霞,這個用戶自動繼承所有數(shù)據(jù)庫的權(quán)限嘴办。一些特定的服務(wù)器端命令也只能從這個數(shù)據(jù)庫運(yùn)行,比如列出所有的數(shù)據(jù)庫或者關(guān)閉服務(wù)器葛家。

local:這個數(shù)據(jù)永遠(yuǎn)不會被復(fù)制户辞,可以用來存儲限于本地單臺服務(wù)器的任意集合

config:當(dāng)Mongo用于分片設(shè)置時,config數(shù)據(jù)庫在內(nèi)部使用癞谒,用于保存分片的相關(guān)信息底燎。

2)顯示當(dāng)前使用的數(shù)據(jù)庫

>

db

test

3)切換數(shù)據(jù)庫

>

use local

switched

to db local

>

db

local

5.5.2 文檔(Document)

文檔是一組鍵值(key-value)對組成。MongoDB 的文檔不需要設(shè)置相同的字段弹砚,并且相同的字段不需要相同的數(shù)據(jù)類型双仍,這與關(guān)系型數(shù)據(jù)庫有很大的區(qū)別,也是 MongoDB 非常突出的特點(diǎn)桌吃。

一個簡單的例子:

{"name":"atguigu"}

注意:

1.文檔中的鍵/值對是有序的朱沃。

2.MongoDB區(qū)分類型和大小寫。

3.MongoDB的文檔不能有重復(fù)的鍵茅诱。

4.文檔的鍵是字符串逗物。除了少數(shù)例外情況,鍵可以使用任意UTF-8字符瑟俭。

5.5.3 集合

集合就是 MongoDB 文檔組翎卓,類似于MySQL中的table。

集合存在于數(shù)據(jù)庫中摆寄,集合沒有固定的結(jié)構(gòu)失暴,這意味著你在對集合可以插入不同格式和類型的數(shù)據(jù)坯门,但通常情況下我們插入集合的數(shù)據(jù)都會有一定的關(guān)聯(lián)性。

下面我們來看看如何創(chuàng)建集合:

MongoDB 中使用 createCollection() 方法來創(chuàng)建集合逗扒。

語法格式:

db.createCollection(name,

options)

參數(shù)說明:


name: 要創(chuàng)建的集合名稱

options: 可選參數(shù), 指定有關(guān)內(nèi)存大小及索引的選項(xiàng)

options 可以是如下參數(shù):

字段類型描述

capped布爾(可選)如果為 true古戴,則創(chuàng)建固定集合。固定集合是指有著固定大小的集合矩肩,當(dāng)達(dá)到最大值時现恼,它會自動覆蓋最早的文檔。當(dāng)該值為 true 時蛮拔,必須指定 size 參數(shù)述暂。

autoIndexId布爾(可選)如為 true,自動在 _id 字段創(chuàng)建索引建炫。默認(rèn)為 false。

size數(shù)值(可選)為固定集合指定一個最大值(以字節(jié)計(jì))疼蛾。如果 capped true肛跌,也需要指定該字段。

max數(shù)值(可選)指定固定集合中包含文檔的最大數(shù)量察郁。

案例1:在test庫中創(chuàng)建一個atguigu的集合

>

use test

switched

to db test

>

db.createCollection("atguigu")

{

"ok" : 1 }

>

show collections

Atguigu


//插入數(shù)據(jù)

>

db.atguigu.insert({"name":"atguigu","url":"www.atguigu.com"})

WriteResult({

"nInserted" : 1 })

//查看數(shù)據(jù)

>

db.atguigu.find()

{

"_id" : ObjectId("5d0314ceecb77ee2fb2d7566"),

"name" : "atguigu", "url" :

"www.atguigu.com" }

說明:

ObjectId 類似唯一主鍵衍慎,可以很快的去生成和排序,包含 12 bytes皮钠,含義是:

前 4 個字節(jié)表示創(chuàng)建 unix 時間戳

接下來的 3 個字節(jié)是機(jī)器標(biāo)識碼

緊接的兩個字節(jié)由進(jìn)程 id 組成PID

最后三個字節(jié)是隨機(jī)數(shù)

案例2:創(chuàng)建一個固定集合mycol

>

db.createCollection("mycol",{ capped : true,autoIndexId : true,size :

6142800, max : 1000})

>

show tables;

atguigu

mycol

案例3:自動創(chuàng)建集合

在 MongoDB 中稳捆,你不需要創(chuàng)建集合。當(dāng)你插入一些文檔時麦轰,MongoDB會自動創(chuàng)建集合乔夯。

>

db.mycol2.insert({"name":"atguigu"})

WriteResult({

"nInserted" : 1 })

>

show collections

atguigu

mycol

mycol2

案例4:刪除集合

>

db.mycol2.drop()

True

>

show tables;

atguigu

mycol

5.6 DataX導(dǎo)入導(dǎo)出案例

5.6.1 讀取MongoDB的數(shù)據(jù)導(dǎo)入到HDFS

1)編寫配置文件

[atguigu@hadoop102

datax]$ vim job/mongdb2hdfs.json

{

??? "job": {

?? ?????"content": [

??????????? {

??????????????? "reader": {

??????????????????? "name":"mongodbreader",

??????????????????? "parameter": {

??????????????????????? "address":["127.0.0.1:27017"],

???????????????????????"collectionName": "atguigu",

??????????????????????? "column": [

??????????????????????? ?????? {

??????????????????????? ????????????? "name":"name",

??????????????????????? ????????????? "type":"string"

??????????????????????? ?????? },

??????????????????????? ?????? {

??????????????????????? ????????????? "name":"url",

?????????????? ?????????????????????? "type":"string"

??????????????????????? ?????? }

??????????????????????? ],

??????????????????????? "dbName":"test",

??????????????????? }

??????????????? },

??????????????? "writer": {

??????????????????? "name":"hdfswriter",

????????????????? ??"parameter": {

??????????????????????? "column": [

??????????????????????? ?????? {

??????????????????????? ????????????? "name":"name",

??????????????????????? ????????????? "type":"string"

??????????????????????? ?????? },

??????????????????????? ?????? {

??????????????????????? ????????????? "name":"url",

??????????????????????? ????????????? "type":"string"

??????????????????????? ?????? }

??????????????????????? ],

??????????????????????? "defaultFS":"hdfs://hadoop102:9000",

???????????????????????"fieldDelimiter": "\t",

??????????????????????? "fileName":"mongo.txt",

??????????????????????? "fileType":"text",

??????????????????????? "path":"/",

??????????????????????? "writeMode":"append"

??????????????????? }

??????????????? }

??????????? }

??????? ],

??????? "setting": {

??????????? "speed": {

??????????????? "channel":"1"

??????????? }

??????? }

??? }

}

2)mongodbreader參數(shù)解析

address: MongoDB的數(shù)據(jù)地址信息,因?yàn)镸onogDB可能是個集群款侵,則ip端口信息需要以Json數(shù)組的形式給出末荐。【必填】

userName:MongoDB的用戶名新锈〖自啵【選填】

userPassword: MongoDB的密碼∶冒剩【選填】

collectionName: MonogoDB的集合名块请。【必填】

column:MongoDB的文檔列名拳缠《招拢【必填】

name:Column的名字〖够耍【必填】

type:Column的類型抖棘∶龋【選填】

splitter:因?yàn)镸ongoDB支持?jǐn)?shù)組類型,但是Datax框架本身不支持?jǐn)?shù)組類型切省,所以mongoDB讀出來的數(shù)組類型要通過這個分隔符合并成字符串最岗。【選填】

3)執(zhí)行

[atguigu@hadoop102

datax]$ bin/datax.py job/mongdb2hdfs.json

4)查看結(jié)果

5.6.2 讀取MongoDB的數(shù)據(jù)導(dǎo)入MySQL

1)在MySQL中創(chuàng)建表

mysql>

create table atguigu(name varchar(20),url varchar(20));

2)編寫DataX配置文件

[atguigu@hadoop102

datax]$ vim job/mongodb2mysql.json

{

??? "job": {

??????? "content": [

??????????? {

??????????????? "reader": {

??????????????????? "name": "mongodbreader",

??????????????????? "parameter": {

??????????????????????? "address":["127.0.0.1:27017"],

???????????????????????"collectionName": "atguigu",

??????????????????????? "column": [

??????????????????????? ?????? {

??????????????????????? ????????????? "name":"name",

??????????????????????? ????????????? "type":"string"

??????????????????????? ?????? },

??????????????????????? ?????? {

??????????????????????? ????????????? "name":"url",

??????????????????????? ????????????? "type":"string"

??????????????????????? ?????? }

??????????????????????? ],

?????????? ?????????????"dbName":"test",

??????????????????? }

??????????????? },

??????????????? "writer": {

??????????????????? "name":"mysqlwriter",

??????????????????? "parameter": {

??????????????????????? "column": ["*"],

??????????????????????? "connection":[

??????????????????????????? {

???????????????????????????????"jdbcUrl": "jdbc:mysql://hadoop102:3306/test",

???????????????????????????????"table": ["atguigu"]

??????????????????????????? }

?????????????????????? ?],

??????????????????????? "password":"000000",

??????????????????????? "username":"root",

??????????????????????? "writeMode":"insert"

??????????????????? }

??????????????? }

??????????? }

??????? ],

??????? "setting": {

??????????? "speed": {

?? ?????????????"channel": "1"

??????????? }

??????? }

??? }

}

3)執(zhí)行

[atguigu@hadoop102

datax]$ bin/datax.py job/mongodb2mysql.json

4)查看結(jié)果

mysql>

select * from atguigu;

+---------+-----------------+

|name??? | url???????????? |

+---------+-----------------+

|

atguigu | www.atguigu.com |

+---------+-----------------+

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末朝捆,一起剝皮案震驚了整個濱河市般渡,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌芙盘,老刑警劉巖驯用,帶你破解...
    沈念sama閱讀 216,372評論 6 498
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異儒老,居然都是意外死亡蝴乔,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,368評論 3 392
  • 文/潘曉璐 我一進(jìn)店門驮樊,熙熙樓的掌柜王于貴愁眉苦臉地迎上來薇正,“玉大人,你說我怎么就攤上這事囚衔⊥谘” “怎么了?”我有些...
    開封第一講書人閱讀 162,415評論 0 353
  • 文/不壞的土叔 我叫張陵练湿,是天一觀的道長猴仑。 經(jīng)常有香客問我,道長肥哎,這世上最難降的妖魔是什么辽俗? 我笑而不...
    開封第一講書人閱讀 58,157評論 1 292
  • 正文 為了忘掉前任,我火速辦了婚禮贤姆,結(jié)果婚禮上榆苞,老公的妹妹穿的比我還像新娘。我一直安慰自己霞捡,他們只是感情好坐漏,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,171評論 6 388
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著碧信,像睡著了一般赊琳。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上砰碴,一...
    開封第一講書人閱讀 51,125評論 1 297
  • 那天躏筏,我揣著相機(jī)與錄音,去河邊找鬼呈枉。 笑死趁尼,一個胖子當(dāng)著我的面吹牛埃碱,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播酥泞,決...
    沈念sama閱讀 40,028評論 3 417
  • 文/蒼蘭香墨 我猛地睜開眼砚殿,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了芝囤?” 一聲冷哼從身側(cè)響起似炎,我...
    開封第一講書人閱讀 38,887評論 0 274
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎悯姊,沒想到半個月后羡藐,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,310評論 1 310
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡悯许,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,533評論 2 332
  • 正文 我和宋清朗相戀三年仆嗦,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片岸晦。...
    茶點(diǎn)故事閱讀 39,690評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡欧啤,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出启上,到底是詐尸還是另有隱情,我是刑警寧澤店印,帶...
    沈念sama閱讀 35,411評論 5 343
  • 正文 年R本政府宣布冈在,位于F島的核電站,受9級特大地震影響按摘,放射性物質(zhì)發(fā)生泄漏包券。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,004評論 3 325
  • 文/蒙蒙 一炫贤、第九天 我趴在偏房一處隱蔽的房頂上張望溅固。 院中可真熱鬧,春花似錦兰珍、人聲如沸侍郭。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,659評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽亮元。三九已至,卻和暖如春唠摹,著一層夾襖步出監(jiān)牢的瞬間爆捞,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 32,812評論 1 268
  • 我被黑心中介騙來泰國打工勾拉, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留煮甥,地道東北人盗温。 一個月前我還...
    沈念sama閱讀 47,693評論 2 368
  • 正文 我出身青樓,卻偏偏與公主長得像成肘,于是被迫代替她去往敵國和親卖局。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,577評論 2 353

推薦閱讀更多精彩內(nèi)容

  • hive學(xué)習(xí)筆記 一艇劫、hive入門 1.1 什么是hive ??Hive:由Facebook開源用于解決海量結(jié)構(gòu)化...
    bd75129df185閱讀 287評論 0 0
  • 一吼驶、HBase簡介 1.1 HBase是什么 HBase是一個分布式的、面向列的開源數(shù)據(jù)庫店煞,Hadoop 數(shù)據(jù)庫蟹演。...
    這一刻_776b閱讀 926評論 0 0
  • 數(shù)據(jù)庫導(dǎo)入電影信息數(shù)據(jù)酒请,建表:movieinfo,user,personalratings,recommendre...
    ElfACCC閱讀 303評論 0 0
  • MongoDB數(shù)據(jù)庫 簡介: Windows 下安裝 MongoDB 數(shù)據(jù)庫: 運(yùn)行 MongoDB : 連接 M...
    modesty123閱讀 206評論 0 0
  • 一Hive的介紹和下載安裝 http://archive.apache.org/ 1.1 Hive的三種模式 1....
    鐘敏_1788閱讀 992評論 0 0