跟著官方文檔看伦籍,外加查到的一些資料
官方文檔戳這里瘾晃,中文版戳這里(安裝方法完全可以按照tutorial贷痪,很詳細(xì),開啟服務(wù)記住這一句就ok:bin/drill-embedded)
FYI:本文和大部分介紹drill的文字一樣無聊蹦误,劫拢,可能drill都是這么點(diǎn)東西,而且是同一版翻譯
Running in embedded mode
安裝完可以通過http://localhost:8047/ 訪問强胰,也可以:
- cd (path)/drill
- bin/sqlline -u jdbc:drill:zk=local
- Run a query (below).
如果想修改配置舱沧,進(jìn)入drill下conf文件夾,drill-env.sh中可以添加配置信息
簡介
- Apache Drill是一個低延遲的分布式海量數(shù)據(jù)(涵蓋結(jié)構(gòu)化偶洋、半結(jié)構(gòu)化以及嵌套數(shù)據(jù))交互式查詢引擎熟吏。分布式、無模式(schema-free)
- 是Google Dremel的開源實(shí)現(xiàn)玄窝,本質(zhì)是一個分布式的mpp(大規(guī)模并行處理)查詢層牵寺,支持SQL及一些用于NoSQL和Hadoop數(shù)據(jù)存儲系統(tǒng)上的語言
- 更快查詢海量數(shù)據(jù),通過對PB字節(jié)(2的50次方字節(jié))數(shù)據(jù)的快速掃描完成相關(guān)分析
- Drill 提供即插即用恩脂,在現(xiàn)有的 Hive 和 HBase中可以隨時整合部署帽氓。
- 是MR交互式查詢能力不足的補(bǔ)充
- 數(shù)據(jù)模型,嵌套
- 列式存儲
- 結(jié)合了web搜索和并行DBMS技術(shù)
注:Hive (Hive就是在Hadoop上架了一層SQL接口俩块,可以將SQL翻譯成MapReduce去Hadoop上執(zhí)行黎休,這樣就使得數(shù)據(jù)開發(fā)和分析人員很方便的使用SQL來完成海量數(shù)據(jù)的統(tǒng)計(jì)和分析,而不必使用編程語言開發(fā)MapReduce那么麻煩玉凯。)
有一套筆記講Hive势腮,戳這里
Drill 核心服務(wù)是 Drillbit,
Drillbit運(yùn)行在集群的每個數(shù)據(jù)節(jié)點(diǎn)上時漫仆,可以最大化執(zhí)行查詢捎拯,不需要網(wǎng)絡(luò)或是節(jié)點(diǎn)之間移動數(shù)據(jù)
接口
- Drill Shell
- Drill Web Console
- ODBC/JDBC
- C++ API
動態(tài)發(fā)現(xiàn)Schema
處理過程中會發(fā)現(xiàn)schema,
靈活的數(shù)據(jù)模型
允許數(shù)據(jù)屬性嵌套盲厌,從架構(gòu)角度看玄渗,Drill提供了靈活的柱狀數(shù)據(jù)模型
無集中式元數(shù)據(jù)
不依賴單個的Hive倉庫,可以查詢多個Hive倉庫狸眼,將數(shù)據(jù)結(jié)果整合
查詢執(zhí)行
提交一個Drill查詢,客戶端或應(yīng)用程序會按照查詢格式發(fā)一個SQL語句到Drillbit浴滴,Drillbit是一個執(zhí)行入口拓萌,運(yùn)行計(jì)劃并執(zhí)行查詢
Drillbit街道查詢請求后會變成Foreman來帶動整個查詢,先解析SQL升略,然后轉(zhuǎn)變成Drill可以識別的SQL
logical plan 描述生成查詢結(jié)果所需要的工作微王,并定義數(shù)據(jù)源和操作屡限,由邏輯運(yùn)算符的集合構(gòu)成。
Major Fragments
- a concept that represents a phase of the query execution
- A phase can consist of one or multiple operations that Drill must perform to execute the query.
- Drill assigns each major fragment a MajorFragmentID
- Drill uses an exchange operator to separate major fragments. An exchange is a change in data location and/or parallelization of the physical plan. An exchange is composed of a sender and a receiver to allow data to move between nodes.
Minor Fragments
- Each major fragment is parallelized into minor fragments.
- A minor fragment is a logical unit of work that runs inside a thread. A logical unit of work in Drill is also referred to as a slice.
-
The execution plan that Drill creates is composed of minor fragments. Drill assigns each minor fragment a MinorFragmentID.
- 流程
The parallelizer in the Foreman creates one or more minor fragments from a major fragment at execution time, by breaking a major fragment into as many minor fragments as it can usefully run at the same time on the cluster.
Drill executes each minor fragment in its own thread as quickly as possible based on its upstream data requirements. Drill schedules the minor fragments on nodes with data locality. Otherwise, Drill schedules them in a round-robin(RR時間段執(zhí)行方法) fashion on the existing, available Drillbits. - Minor fragments contain one or more relational operators. An operator performs a relational operation, such as scan, filter, join, or group by. Each operator has a particular operator type and an OperatorID. Each OperatorID defines its relationship within the minor fragment to which it belongs.
Execution of Minor Fragments
Minor fragments can run as root, intermediate, or leaf fragments. An execution tree contains only one root fragment. The coordinates of the execution tree are numbered from the root, with the root being zero. Data flows downstream from the leaf fragments to the root fragment.
The root fragment runs in the Foreman and receives incoming queries, reads metadata from tables, rewrites the queries and routes them to the next level in the serving tree. The other fragments become intermediate or leaf fragments.
Intermediate fragments start work when data is available or fed to them from other fragments. They perform operations on the data and then send the data downstream. They also pass the aggregated results to the root fragment, which performs further aggregation and provides the query results to the client or application.
The leaf fragments scan tables in parallel and communicate with the storage layer or access data on local disk. The leaf fragments pass partial results to the intermediate fragments, which perform parallel operations on intermediate results.
Minor Fragment可以作為root炕倘、intermediate钧大、leaf Fragment三種類型運(yùn) 行。一個執(zhí)行樹只包括一個root Fragment罩旋。執(zhí)行樹的坐標(biāo)編號是從root 開始的啊央,root是0。數(shù)據(jù)流是從下游的leaf Fragment到root Fragment涨醋。
運(yùn)行在Foreman的root Fragment接收傳入的查詢瓜饥、從表讀取元數(shù)據(jù),重 新查詢并且路由到下一級服務(wù)樹浴骂。下一級的Fragment包括Intermediate 和leaf Fragment乓土。
當(dāng)數(shù)據(jù)可用或者能從其他的Fragment提供時,Intermediate Fragment啟 動作業(yè)溯警。他們執(zhí)行數(shù)據(jù)操作并且發(fā)送數(shù)據(jù)到下游處理趣苏。通過聚合Root Fragment的結(jié)果數(shù)據(jù),進(jìn)行進(jìn)一步聚合并提供查詢結(jié)果給客戶端或應(yīng) 用程序梯轻。
Leaf Fragment并行掃描表并且與存儲層數(shù)據(jù)通信或者訪問本地磁盤數(shù) 據(jù)食磕。Leaf Fragment的部分結(jié)果傳遞給Intermediate Fragment,然后對 Intermediate結(jié)果執(zhí)行合并操作
Query
- 比如一個查詢語句:select id, type, name, ppu
from dfs./Users/brumsby/drill/donuts.json
;
Note that dfs is the schema name, the path to the file is enclosed by backticks, and the query must end with a semicolon.
-
注意需要as top
Paste_Image.png
如果是嵌套數(shù)組檩淋,the third value of the second inner array).
select group[1][2] 一個復(fù)雜的SQL語句
SELECT * FROM (SELECT t.trans_id,
t.trans_info.prod_id[0] AS prod_id,
t.trans_info.purch_flag AS purchased
FROM `clicks/clicks.json` t) sq
WHERE sq.prod_id BETWEEN 700 AND 750 AND
sq.purchased = 'true'
ORDER BY sq.prod_id;
REST API
get/post,文檔在這里芬为,有個jsonapi相關(guān)的文檔,寫的很好蟀悦,而且也有代碼媚朦,是我當(dāng)時看的時候的參考資料
待續(xù)。日戈。