入門
本教程介紹如何將自己的流數(shù)據(jù)加載到Druid颁督。
在本教程中鸭轮,我們假設(shè)您已經(jīng)按照快速入門中所述下載了Druid和Tranquility齿诉,并將其在本機(jī)上運(yùn)行竣蹦。并且您不需要事先加載任何數(shù)據(jù)顶猜。
完成后,您可以通過編寫自定義數(shù)據(jù)提取規(guī)范來加載自己的數(shù)據(jù)集痘括。
編寫數(shù)據(jù)提取規(guī)范
當(dāng)使用流進(jìn)行數(shù)據(jù)加載长窄,推薦使用Stream Push方式。在本教程中使用Tranquility 通過HTTP將數(shù)據(jù)推送到Druid纲菌。
本教程將介紹如何通過HTTP將數(shù)據(jù)流推送到Druid挠日,但Druid還支持各類批處理和流式加載方法。查看Loading files 和 Loading streams頁面來了解其它方法的更多消息翰舌,包括Hadoop嚣潜、Kafka, Storm、Samza灶芝、Spark Streaming和您自己的JVM應(yīng)用郑原。
你可以按照需求修改conf-quickstart/tranquility/server.json配置文件,來自定義Tranquility Server配置夜涕,通過HTTP加載新的數(shù)據(jù)集犯犁。
配置文件中有幾項(xiàng)需要特別關(guān)注:
{
"dataSources" : {
"metrics" : {
"spec" : {
"dataSchema" : {
//1.使用的數(shù)據(jù)集
"dataSource" : "metrics",
"parser" : {
"type" : "string",
"parseSpec" : {
//2.哪個字段是timestamp
"timestampSpec" : {
"column" : "timestamp",
"format" : "auto"
},
"dimensionsSpec" : {
//3.哪些字段需要當(dāng)成維度處理
"dimensions" : [],
"dimensionExclusions" : [
"timestamp",
"value"
]
},
"format" : "json"
}
},
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "hour",
"queryGranularity" : "none"
},
//4.哪些字段需要當(dāng)成度量進(jìn)行處理
"metricsSpec" : [
{
"type" : "count",
"name" : "count"
},
{
"name" : "value_sum",
"type" : "doubleSum",
"fieldName" : "value"
},
{
"fieldName" : "value",
"name" : "value_min",
"type" : "doubleMin"
},
{
"type" : "doubleMax",
"name" : "value_max",
"fieldName" : "value"
}
]
},
"ioConfig" : {
"type" : "realtime"
},
"tuningConfig" : {
"type" : "realtime",
"maxRowsInMemory" : "100000",
"intermediatePersistPeriod" : "PT10M",
"windowPeriod" : "PT10M"
}
},
"properties" : {
"task.partitions" : "1",
"task.replicants" : "1"
}
}
},
"properties" : {
"zookeeper.connect" : "localhost",
"druid.discovery.curator.path" : "/druid/discovery",
"druid.selectors.indexing.serviceName" : "druid/overlord",
"http.port" : "8200",
"http.threads" : "8"
}
}
下面用一個pageviews(瀏覽量)的json作為示例:
{"time": "2000-01-01T00:00:00Z", "url": "/foo/bar", "user": "alice", "latencyMs": 32}
所以對于這個例子上述四個問題的答案是:
- 數(shù)據(jù)集就是pageviews
- time字段是timestamp
- 可以用url和user作為維度
- 度量可以是計(jì)算pageviews計(jì)數(shù),和latencyMs字段求和女器。在數(shù)據(jù)接收階段求和酸役,也能在查詢的時候快速方便的求平均值。
所以配置文件修改后
{
"dataSources" : {
"metrics" : {
"spec" : {
"dataSchema" : {
//1.使用的數(shù)據(jù)集
"dataSource" : "pageviews",
"parser" : {
"type" : "string",
"parseSpec" : {
//2.哪個字段是timestamp
"timestampSpec" : {
"column" : "time",
"format" : "auto"
},
"dimensionsSpec" : {
//3.哪些字段需要當(dāng)成維度處理
"dimensions" : ["url", "user"],
"dimensionExclusions" : [
"timestamp",
"value"
]
},
"format" : "json"
}
},
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "hour",
"queryGranularity" : "none"
},
//4.哪些字段需要當(dāng)成度量進(jìn)行處理
"metricsSpec" : [
{
"name": "views",
"type": "count"
},
{
"name": "latencyMs",
"type": "doubleSum",
"fieldName": "latencyMs"
}
]
},
"ioConfig" : {
"type" : "realtime"
},
"tuningConfig" : {
"type" : "realtime",
"maxRowsInMemory" : "100000",
"intermediatePersistPeriod" : "PT10M",
"windowPeriod" : "PT10M"
}
},
"properties" : {
"task.partitions" : "1",
"task.replicants" : "1"
}
}
},
"properties" : {
"zookeeper.connect" : "localhost",
"druid.discovery.curator.path" : "/druid/discovery",
"druid.selectors.indexing.serviceName" : "druid/overlord",
"http.port" : "8200",
"http.threads" : "8"
}
}
重啟服務(wù)
停止Tranquility(CTRL-C)并重新啟動,來獲取新的配置文件并生效涣澡。
發(fā)送數(shù)據(jù)
發(fā)送測試數(shù)據(jù)如下:
{"time": "2000-01-01T00:00:00Z", "url": "/foo/bar", "user": "alice", "latencyMs": 32}
{"time": "2000-01-01T00:00:00Z", "url": "/", "user": "bob", "latencyMs": 11}
{"time": "2000-01-01T00:00:00Z", "url": "/foo/bar", "user": "bob", "latencyMs": 45}
Druid流處理需要相對當(dāng)前(準(zhǔn)實(shí)時)的數(shù)據(jù)贱呐,相而言windowPeriod值控制的是更寬松的時間窗口(也就是流處理會檢查數(shù)據(jù)timestamp的值,而時間窗口只關(guān)注數(shù)據(jù)接收的時間)入桂。所以需要將2000-01-01T00:00:00Z轉(zhuǎn)換為ISO8601格式的當(dāng)前系統(tǒng)時間奄薇,你可以用以下命令轉(zhuǎn)換:
python -c 'import datetime; print(datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ"))'
用輸出的時間將上述的json示例的timestamps更新,并且保存到pageviews.json文件里抗愁。 通過下面命令將數(shù)據(jù)發(fā)送到Druid:
curl -XPOST -H'Content-Type: application/json' --data-binary @pageviews.json http://localhost:8200/v1/post/pageviews
然后你會看到屏幕輸出如下:
{"result":{"received":3,"sent":3}}
這表明HTTP服務(wù)接收了三條事件馁蒂,并且發(fā)送了三條到Druid。因?yàn)樾枰峙銬ruid給ingestion 任務(wù)蜘腌,所以初次運(yùn)行可能會消耗幾秒鐘時間沫屡。但是后續(xù)查詢請求就會變得很快了。
如果你看到是"sent":0撮珠,很有可能是時間戳(timestamps)不夠新沮脖,再次更新時間戳并且重新發(fā)送。
數(shù)據(jù)查詢
數(shù)據(jù)發(fā)送后就可以馬上進(jìn)行數(shù)據(jù)查詢了芯急,詳見Druid查詢
進(jìn)一步閱讀
想了解更多Druid流處理勺届,詳見streaming ingestion documentation
原文鏈接:http://druid.io/docs/0.9.2/tutorials/tutorial-streams.html