DataX 是阿里云 DataWorks數(shù)據(jù)集成 的開源版本遣妥,在阿里巴巴集團內(nèi)被廣泛使用的離線數(shù)據(jù)同步工具/平臺拙毫。DataX 實現(xiàn)了包括 MySQL颗祝、Oracle辉哥、SqlServer、Postgre耳贬、HDFS踏堡、Hive、ADS咒劲、HBase顷蟆、TableStore(OTS)诫隅、MaxCompute(ODPS)、Hologres帐偎、DRDS 等各種異構(gòu)數(shù)據(jù)源之間高效的數(shù)據(jù)同步功能
Apache Doris是一個現(xiàn)代化的MPP分析型數(shù)據(jù)庫產(chǎn)品逐纬。僅需亞秒級響應時間即可獲得查詢結(jié)果,有效地支持實時數(shù)據(jù)分析削樊。Apache Doris的分布式架構(gòu)非常簡潔豁生,易于運維,并且可以支持10PB以上的超大數(shù)據(jù)集漫贞。
Apache Doris可以滿足多種數(shù)據(jù)分析需求甸箱,例如固定歷史報表,實時數(shù)據(jù)分析迅脐,交互式數(shù)據(jù)分析和探索式數(shù)據(jù)分析等芍殖。令您的數(shù)據(jù)分析工作更加簡單高效!
為了更好的擴展Apache doris生態(tài)谴蔑,為doris用戶提供更方便的數(shù)據(jù)導入豌骏,社區(qū)開發(fā)擴展支持了Datax DorisWriter,使大家更方便Datax進行數(shù)據(jù)進入
1.場景
這里演示介紹的使用 Doris 的 Datax 擴展 DorisWriter實現(xiàn)從Mysql數(shù)據(jù)定時抽取數(shù)據(jù)導入到Doris數(shù)倉表里
2.編譯 DorisWriter
這個的擴展的編譯可以不在 doris 的 docker 編譯環(huán)境下進行隐锭,本文是在 windows 下的 WLS 下進行編譯的
首先從github上拉取源碼
git clone https://github.com/apache/incubator-doris.git
進入到incubator-doris/extension/DataX/
執(zhí)行編譯
首先執(zhí)行:
sh init_env.sh
這個腳本主要用于構(gòu)建 DataX 開發(fā)環(huán)境窃躲,他主要進行了以下操作:
將 DataX 代碼庫 clone 到本地。
將
doriswriter/
目錄軟鏈到DataX/doriswriter
目錄成榜。在
DataX/pom.xml
文件中添加<module>doriswriter</module>
模塊框舔。-
將
DataX/core/pom.xml
文件中的 httpclient 版本從 4.5 改為 4.5.13.httpclient v4.5 在處理 307 轉(zhuǎn)發(fā)時有bug。
這個腳本執(zhí)行后赎婚,開發(fā)者就可以進入 DataX/
目錄開始開發(fā)或編譯了刘绣。因為做了軟鏈,所以任何對 DataX/doriswriter
目錄中文件的修改挣输,都會反映到 doriswriter/
目錄中纬凤,方便開發(fā)者提交代碼
2.1 開始編譯
這里我為了加快編譯速度去掉了很多無用的插件:這里直接在Datax目錄下的pom.xml里注釋掉就行
hbase11xreader
hbase094xreader
tsdbreader
oceanbasev10reader
odpswriter
hdfswriter
adswriter
ocswriter
oscarwriter
oceanbasev10writer
然后進入到incubator-doris/extension/DataX/
目錄下的 Datax 目錄,執(zhí)行編譯
這里我是執(zhí)行的將 Datax 編譯成 tar 包撩嚼,和官方的編譯命令不太一樣停士。
mvn -U clean package assembly:assembly -Dmaven.test.skip=true
編譯完成以后,tar 包在 Datax/target
目錄下完丽,你可以將這tar包拷貝到你需要的地方恋技,這里我是直接在 datax 執(zhí)行測試,這里因為的 python 版本是 3.x版本逻族,需要將 bin 目錄下的三個文件換成 python 3能之別的版本蜻底,這個你可以去下面的地址下載:
https://github.com/WeiYe-Jing/datax-web/tree/master/doc/datax-web/datax-python3
將下載的三個文件替換 bin 目錄下的文件以后,整個編譯聘鳞,安裝就完成了
如果你編譯不成功也可以從我的百度網(wǎng)盤上下載編譯好的包薄辅,注意我上邊編譯去掉的那些插件
鏈接:https://pan.baidu.com/s/1hXYkpkrUE2qW4j98k2Wu7A
提取碼:3azi
3.數(shù)據(jù)接入
這個時候我們就可以開始使用 Datax 的doriswriter擴展開始從 Mysql(或者其他數(shù)據(jù)源)直接將數(shù)據(jù)抽取出來導入到 Doris 表中了要拂。
3.1 Mysql 數(shù)據(jù)庫準備
下面是我數(shù)據(jù)庫的建表腳本(mysql 8):
CREATE TABLE `order_analysis` (
`date` varchar(19) DEFAULT NULL,
`user_src` varchar(9) DEFAULT NULL,
`order_src` varchar(11) DEFAULT NULL,
`order_location` varchar(2) DEFAULT NULL,
`new_order` int DEFAULT NULL,
`payed_order` int DEFAULT NULL,
`pending_order` int DEFAULT NULL,
`cancel_order` int DEFAULT NULL,
`reject_order` int DEFAULT NULL,
`good_order` int DEFAULT NULL,
`report_order` int DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT
示例數(shù)據(jù)
INSERT INTO `sql12298540`.`order_analysis` (`date`, `user_src`, `order_src`, `order_location`, `new_order`, `payed_order`, `pending_order`, `cancel_order`, `reject_order`, `good_order`, `report_order`) VALUES ('2015-10-12 00:00:00', '廣告二維碼', 'Android APP', '上海', 15253, 13210, 684, 1247, 1000, 10824, 862);
INSERT INTO `sql12298540`.`order_analysis` (`date`, `user_src`, `order_src`, `order_location`, `new_order`, `payed_order`, `pending_order`, `cancel_order`, `reject_order`, `good_order`, `report_order`) VALUES ('2015-10-14 00:00:00', '微信朋友圈H5頁面', 'iOS APP', '廣州', 17134, 11270, 549, 204, 224, 10234, 773);
INSERT INTO `sql12298540`.`order_analysis` (`date`, `user_src`, `order_src`, `order_location`, `new_order`, `payed_order`, `pending_order`, `cancel_order`, `reject_order`, `good_order`, `report_order`) VALUES ('2015-10-17 00:00:00', '地推二維碼掃描', 'iOS APP', '北京', 16061, 9418, 1220, 1247, 458, 13877, 749);
INSERT INTO `sql12298540`.`order_analysis` (`date`, `user_src`, `order_src`, `order_location`, `new_order`, `payed_order`, `pending_order`, `cancel_order`, `reject_order`, `good_order`, `report_order`) VALUES ('2015-10-17 00:00:00', '微信朋友圈H5頁面', '微信公眾號', '武漢', 12749, 11127, 1773, 6, 5, 9874, 678);
INSERT INTO `sql12298540`.`order_analysis` (`date`, `user_src`, `order_src`, `order_location`, `new_order`, `payed_order`, `pending_order`, `cancel_order`, `reject_order`, `good_order`, `report_order`) VALUES ('2015-10-18 00:00:00', '地推二維碼掃描', 'iOS APP', '上海', 13086, 15882, 1727, 1764, 1429, 12501, 625);
INSERT INTO `sql12298540`.`order_analysis` (`date`, `user_src`, `order_src`, `order_location`, `new_order`, `payed_order`, `pending_order`, `cancel_order`, `reject_order`, `good_order`, `report_order`) VALUES ('2015-10-18 00:00:00', '微信朋友圈H5頁面', 'iOS APP', '武漢', 15129, 15598, 1204, 1295, 1831, 11500, 320);
INSERT INTO `sql12298540`.`order_analysis` (`date`, `user_src`, `order_src`, `order_location`, `new_order`, `payed_order`, `pending_order`, `cancel_order`, `reject_order`, `good_order`, `report_order`) VALUES ('2015-10-19 00:00:00', '地推二維碼掃描', 'Android APP', '杭州', 20687, 18526, 1398, 550, 213, 12911, 185);
INSERT INTO `sql12298540`.`order_analysis` (`date`, `user_src`, `order_src`, `order_location`, `new_order`, `payed_order`, `pending_order`, `cancel_order`, `reject_order`, `good_order`, `report_order`) VALUES ('2015-10-19 00:00:00', '應用商店', '微信公眾號', '武漢', 12388, 11422, 702, 106, 158, 5820, 474);
INSERT INTO `sql12298540`.`order_analysis` (`date`, `user_src`, `order_src`, `order_location`, `new_order`, `payed_order`, `pending_order`, `cancel_order`, `reject_order`, `good_order`, `report_order`) VALUES ('2015-10-20 00:00:00', '微信朋友圈H5頁面', '微信公眾號', '上海', 14298, 11682, 1880, 582, 154, 7348, 354);
INSERT INTO `sql12298540`.`order_analysis` (`date`, `user_src`, `order_src`, `order_location`, `new_order`, `payed_order`, `pending_order`, `cancel_order`, `reject_order`, `good_order`, `report_order`) VALUES ('2015-10-21 00:00:00', '地推二維碼掃描', 'Android APP', '深圳', 22079, 14333, 5565, 1742, 439, 8246, 211);
INSERT INTO `sql12298540`.`order_analysis` (`date`, `user_src`, `order_src`, `order_location`, `new_order`, `payed_order`, `pending_order`, `cancel_order`, `reject_order`, `good_order`, `report_order`) VALUES ('2015-10-22 00:00:00', 'UC瀏覽器引流', 'iOS APP', '上海', 28968, 18151, 7212, 2373, 1232, 10739, 578);
3.2 doris數(shù)據(jù)庫準備
下面是我上面數(shù)據(jù)表在doris對應的建表腳本
CREATE TABLE `order_analysis` (
`date` datetime DEFAULT NULL,
`user_src` varchar(30) DEFAULT NULL,
`order_src` varchar(50) DEFAULT NULL,
`order_location` varchar(10) DEFAULT NULL,
`new_order` int DEFAULT NULL,
`payed_order` int DEFAULT NULL,
`pending_order` int DEFAULT NULL,
`cancel_order` int DEFAULT NULL,
`reject_order` int DEFAULT NULL,
`good_order` int DEFAULT NULL,
`report_order` int DEFAULT NULL
) ENGINE=OLAP
DUPLICATE KEY(`date`,user_src)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`user_src`) BUCKETS 1
PROPERTIES (
"replication_num" = "3",
"in_memory" = "false",
"storage_format" = "V2"
);
3.3 Datax Job JSON文件
創(chuàng)建并編輯datax job任務json文件,并保存到指定目錄
{
"job": {
"setting": {
"speed": {
"channel": 1
},
"errorLimit": {
"record": 0,
"percentage": 0
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "zhangfeng",
"column": ["date","user_src","order_src","order_location","new_order","payed_order"," pending_order"," cancel_order"," reject_order"," good_order"," report_order" ],
"connection": [ { "table": [ "order_analysis" ], "jdbcUrl": [ "jdbc:mysql://localhost:3306/demo" ] } ] }
},
"writer": {
"name": "doriswriter",
"parameter": {
"feLoadUrl": ["fe:8030"],
"beLoadUrl": ["be1:8040","be1:8040","be1:8040","be1:8040","be1:8040","be1:8040"],
"jdbcUrl": "jdbc:mysql://fe:9030/",
"database": "test_2",
"table": "order_analysis",
"column": ["date","user_src","order_src","order_location","new_order","payed_order"," pending_order"," cancel_order"," reject_order"," good_order"," report_order"],
"username": "root",
"password": "",
"postSql": [],
"preSql": [],
"loadProps": {
},
"maxBatchRows" : 10000,
"maxBatchByteSize" : 104857600,
"labelPrefix": "datax_doris_writer_demo_",
"lineDelimiter": "\n"
}
}
}
]
}
}
這塊 Mysql reader 使用方式參照:
https://github.com/alibaba/DataX/blob/master/mysqlreader/doc/mysqlreader.md
doriswriter的使用及參數(shù)說明:
https://github.com/apache/incubator-doris/blob/master/extension/DataX/doriswriter/doc/doriswriter.md
4.執(zhí)行Datax數(shù)據(jù)導入任務
python bin/datax.py doris.json
然后就可以看到執(zhí)行結(jié)果:
再去 Doris 數(shù)據(jù)庫中查看你的表站楚,數(shù)據(jù)就已經(jīng)導入進去了脱惰,任務執(zhí)行結(jié)束
因為 Datax 的任務是要靠外部觸發(fā)才能執(zhí)行,這里你可以使用Linux的crontab或者海豚調(diào)度之類的來控制任務運行