簡(jiǎn)介
Apache Drill是一個(gè)低延遲的分布式海量數(shù)據(jù)(涵蓋結(jié)構(gòu)化、半結(jié)構(gòu)化以及嵌套數(shù)據(jù))交互式查詢引擎萍歉,使用ANSI SQL兼容語(yǔ)法侣颂,支持本地文件、HDFS枪孩、HBase憔晒、MongoDB等后端存儲(chǔ),支持Parquet蔑舞、JSON拒担、CSV、TSV攻询、PSV等數(shù)據(jù)格式从撼。受Google的Dremel啟發(fā),Drill滿足上千節(jié)點(diǎn)的PB級(jí)別數(shù)據(jù)的交互式商業(yè)智能分析場(chǎng)景钧栖。
安裝
Drill可以安裝在單機(jī)或者集群環(huán)境上低零,支持Linux、Windows拯杠、Mac OS X系統(tǒng)掏婶。簡(jiǎn)單起見,我們?cè)贚inux單機(jī)環(huán)境(CentOS 6.3)搭建以供試用阴挣。
準(zhǔn)備安裝包:
- jdk 7:jdk-7u75-linux-x64.tar.gz
- Drill:apache-drill-0.8.0.tar.gz
在$WORK(/path/to/work)目錄中安裝气堕,將jdk和drill分別解壓到j(luò)ava和drill目錄中,并打軟連以便升級(jí):
.
├── drill
│ ├── apache-drill -> apache-drill-0.8.0
│ └── apache-drill-0.8.0
├── init.sh
└── java
├── jdk -> jdk1.7.0_75
└── jdk1.7.0_75
并添加一init.sh腳本初始化java相關(guān)環(huán)境變量:
export WORK="/path/to/work"
export JAVA="$WORK/java/jdk/bin/java"
export JAVA_HOME="$WORK/java/jdk"
啟動(dòng)
在單機(jī)環(huán)境運(yùn)行只需要啟動(dòng)bin/sqlline便可:
$ cd $WORK
$ . ./init.sh
$ ./drill/apache-drill/bin/sqlline -u jdbc:drill:zk=local
Drill log directory /var/log/drill does not exist or is not writable, defaulting to ...
Apr 06, 2015 12:47:30 AM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
sqlline version 1.1.6
0: jdbc:drill:zk=local>
-u jdbc:drill:zk=local
表示使用本機(jī)的Drill,無(wú)需啟動(dòng)ZooKeeper茎芭,如果是集群環(huán)境則需要配置和啟動(dòng)ZooKeeper并填寫地址揖膜。啟動(dòng)后便可以在0: jdbc:drill:zk=local>
后敲入命令使用了。
試用
Drill的sample-data目錄有Parquet格式的演示數(shù)據(jù)可供查詢:
0: jdbc:drill:zk=local> select * from dfs.`/path/to/work/drill/apache-drill/sample-data/nation.parquet` limit 5;
+-------------+------------+-------------+------------+
| N_NATIONKEY | N_NAME | N_REGIONKEY | N_COMMENT |
+-------------+------------+-------------+------------+
| 0 | ALGERIA | 0 | haggle. carefully f |
| 1 | ARGENTINA | 1 | al foxes promise sly |
| 2 | BRAZIL | 1 | y alongside of the p |
| 3 | CANADA | 1 | eas hang ironic, sil |
| 4 | EGYPT | 4 | y above the carefull |
+-------------+------------+-------------+------------+
5 rows selected (0.741 seconds)
這里用的庫(kù)名格式為dfs.`本地文件(Parquet梅桩、JSON壹粟、CSV等文件)絕對(duì)路徑`∷薨伲可以看出只要熟悉SQL語(yǔ)法幾乎沒有學(xué)習(xí)成本趁仙。但Parquet格式文件需要專用工具查看、編輯垦页,不是很方便雀费,后續(xù)再專門介紹,下文先使用更通用的CSV和JSON文件進(jìn)行演示痊焊。
在$WORK/data
中創(chuàng)建如下test.csv
文件:
1101,SteveEurich,Steve,Eurich,16,StoreT
1102,MaryPierson,Mary,Pierson,16,StoreT
1103,LeoJones,Leo,Jones,16,StoreTem
1104,NancyBeatty,Nancy,Beatty,16,StoreT
1105,ClaraMcNight,Clara,McNight,16,Store
然后查詢:
0: jdbc:drill:zk=local> select * from dfs.`/path/to/work/drill/data/test.csv`;
+------------+
| columns |
+------------+
| ["1101","SteveEurich","Steve","Eurich","16","StoreT"] |
| ["1102","MaryPierson","Mary","Pierson","16","StoreT"] |
| ["1103","LeoJones","Leo","Jones","16","StoreTem"] |
| ["1104","NancyBeatty","Nancy","Beatty","16","StoreT"] |
| ["1105","ClaraMcNight","Clara","McNight","16","Store"] |
+------------+
5 rows selected (0.082 seconds)
可以看到結(jié)果和之前的稍有不同盏袄,因?yàn)镃SV文件沒有地方存放列列名,所以統(tǒng)一用columns
代替薄啥,如果需要具體制定列則需要用columns[n]
辕羽,如:
0: jdbc:drill:zk=local> select columns[0], columns[3] from dfs.`/path/to/work/drill/data/test.csv`;
+------------+------------+
| EXPR$0 | EXPR$1 |
+------------+------------+
| 1101 | Eurich |
| 1102 | Pierson |
| 1103 | Jones |
| 1104 | Beatty |
| 1105 | McNight |
+------------+------------+
CSV文件格式比較簡(jiǎn)單,發(fā)揮不出Drill的強(qiáng)大優(yōu)勢(shì)垄惧,下邊更復(fù)雜的功能使用和Parquet更接近的JSON文件進(jìn)行演示刁愿。
在$WORK/data
中創(chuàng)建如下test.json
文件:
{
"ka1": 1,
"kb1": 1.1,
"kc1": "vc11",
"kd1": [
{
"ka2": 10,
"kb2": 10.1,
"kc2": "vc1010"
}
]
}
{
"ka1": 2,
"kb1": 2.2,
"kc1": "vc22",
"kd1": [
{
"ka2": 20,
"kb2": 20.2,
"kc2": "vc2020"
}
]
}
{
"ka1": 3,
"kb1": 3.3,
"kc1": "vc33",
"kd1": [
{
"ka2": 30,
"kb2": 30.3,
"kc2": "vc3030"
}
]
}
可以看到這個(gè)JSON文件內(nèi)容是有多層嵌套的,結(jié)構(gòu)比之前那個(gè)CSV文件要復(fù)雜不少到逊,而查詢嵌套數(shù)據(jù)正是Drill的優(yōu)勢(shì)所在铣口。
0: jdbc:drill:zk=local> select * from dfs.`/path/to/work/drill/data/test.json`;
+------------+------------+------------+------------+
| ka1 | kb1 | kc1 | kd1 |
+------------+------------+------------+------------+
| 1 | 1.1 | vc11 | [{"ka2":10,"kb2":10.1,"kc2":"vc1010"}] |
| 2 | 2.2 | vc22 | [{"ka2":20,"kb2":20.2,"kc2":"vc2020"}] |
| 3 | 3.3 | vc33 | [{"ka2":30,"kb2":30.3,"kc2":"vc3030"}] |
+------------+------------+------------+------------+
3 rows selected (0.098 seconds)
select *
只查出第一層的數(shù)據(jù),更深層的數(shù)據(jù)只以原本的JSON數(shù)據(jù)呈現(xiàn)出來蕾管,我們顯然不應(yīng)該只關(guān)心第一層的數(shù)據(jù)枷踏,具體怎么查完全隨心所欲:
0: jdbc:drill:zk=local> select sum(ka1), avg(kd1[0].kb2) from dfs.`/path/to/work/drill/data/test.json`;
+------------+------------+
| EXPR$0 | EXPR$1 |
+------------+------------+
| 6 | 20.2 |
+------------+------------+
1 row selected (0.136 seconds)
可以通過kd1[0]
來訪問嵌套到第二層的這個(gè)表。
0: jdbc:drill:zk=local> select kc1, kd1[0].kc2 from dfs.`/path/to/work/drill/data/test.json` where kd1[0].kb2 = 10.1 and ka1 = 1;
+------------+------------+
| kc1 | EXPR$1 |
+------------+------------+
| vc11 | vc1010 |
+------------+------------+
1 row selected (0.181 seconds)
創(chuàng)建view:
0: jdbc:drill:zk=local> create view dfs.tmp.tmpview as select kd1[0].kb2 from dfs.`/path/to/work/drill/data/test.json`;
+------------+------------+
| ok | summary |
+------------+------------+
| true | View 'tmpview' created successfully in 'dfs.tmp' schema |
+------------+------------+
1 row selected (0.055 seconds)
0: jdbc:drill:zk=local> select * from dfs.tmp.tmpview;
+------------+
| EXPR$0 |
+------------+
| 10.1 |
| 20.2 |
| 30.3 |
+------------+
3 rows selected (0.193 seconds)
可以把嵌套的第二層表打平(整合kd1[0]..kd1[n]):
0: jdbc:drill:zk=local> select kddb.kdtable.kc2 from (select flatten(kd1) kdtable from dfs.`/path/to/work/drill/data/test.json`) kddb;
+------------+
| EXPR$0 |
+------------+
| vc1010 |
| vc2020 |
| vc3030 |
+------------+
3 rows selected (0.083 seconds)
使用細(xì)節(jié)上和mysql還是有所不同的掰曾,另外涉及到多層表的復(fù)雜邏輯旭蠕,要想用得得心應(yīng)手還需要仔細(xì)閱讀官方文檔并多多練習(xí)。這次先走馬觀花了旷坦,之后會(huì)深入了解語(yǔ)法層面的特性掏熬。
參考
付費(fèi)解決 Windows、Linux秒梅、Shell旗芬、C、C++捆蜀、AHK疮丛、Python幔嫂、JavaScript、Lua 等領(lǐng)域相關(guān)問題誊薄,靈活定價(jià)履恩,歡迎咨詢,微信 ly50247呢蔫。