一怎棱、背景
訪客論分析是常見數(shù)據(jù)分析的一種,通過如上圖(Google Analytics)以比較直觀的方式展現(xiàn)用戶達(dá)到網(wǎng)站后各條訪問路徑的流失情況,幫助網(wǎng)站優(yōu)化減少流失率脱拼。
訪客路徑分析有如下幾個關(guān)鍵點:
- 用戶訪問的路徑通常有多級文虏,默認(rèn)展開包含著陸頁在內(nèi)的5級路徑侣诺,支持往后每點擊一次展開一級路徑(最高支持到10級殖演,再往后意義不大)。
- 每級只展示top 5訪問數(shù)的網(wǎng)頁年鸳,每級路徑網(wǎng)頁之間連接線表示跳轉(zhuǎn)情況趴久。
- 指標(biāo)包含top 5網(wǎng)頁的會話數(shù)、流失數(shù)和剩余網(wǎng)頁的會話數(shù)搔确。
通過上述分析彼棍,要實現(xiàn)訪客路徑分析需要完成如下幾項工作:
- 計算每一級所有網(wǎng)頁的會話總數(shù)。
- 計算每一級會話數(shù)top 5的網(wǎng)頁膳算。
- 計算每一級兩兩網(wǎng)頁之間的跳轉(zhuǎn)訪問數(shù)座硕。
本文提出一種基于druid的實現(xiàn)方案,將上述3個查詢轉(zhuǎn)化為druid中的Timeseries(求總數(shù))涕蜂、TopN(求前5)坎吻、GroupBy(求兩兩關(guān)聯(lián))查詢。
二宇葱、技術(shù)方案
數(shù)據(jù)清洗(ETL)
將用戶pv流水根據(jù)瘦真,聚合成一個session會話。session會話內(nèi)用戶的訪問流水按時間排序黍瞧,取前11個分別放于維度landing_page ~ path10诸尽,ETL處理后的數(shù)據(jù)表格示例如下:
host | landing_page | path1 | path2 | ... | path10 |
---|---|---|---|---|---|
www.xxx.com | /index.html | /a | /b | ... | /e |
www.xxx.com | /product.html | /c | /d | ... | null |
數(shù)據(jù)入Druid供查詢,schema設(shè)計如下
{
"type" : "index_hadoop",
"spec" : {
"ioConfig" : {
"type" : "hadoop",
"inputSpec" : {
"type" : "static",
"paths" : ""
}
},
"dataSchema" : {
"dataSource" : "",
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : {"type":"period","period":"P1D","timeZone":"Asia/Shanghai"},
"queryGranularity" : {"type":"period","period":"P1D","timeZone":"Asia/Shanghai"},
"intervals" : []
},
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "json",
"dimensionsSpec" : {
"dimensions": [
"host",
"landing_page",
"path1",
...
"path10"
]
},
"timestampSpec" : {
"format" : "auto",
"column" : "time"
}
}
},
"metricsSpec": [
{
"name": "count",
"type": "count"
}
]
},
"tuningConfig" : {
"type" : "hadoop",
"partitionsSpec" : {
"type" : "hashed",
"targetPartitionSize" : 5000000
},
"indexSpec" : {
"bitmap" : { "type" : "roaring"},
"dimensionCompression":"LZ4",
"metricCompression" : "LZ4",
"longEncoding" : "auto"
}
}
}
}
三印颤、具體實踐
查詢語句示例
計算每一級所有網(wǎng)頁的會話總數(shù)(默認(rèn)展示前5級)您机,過濾掉為null的情況(用戶只訪問到上一級就跳出)。
{
"queryType": "timeseries",
"dataSource": "visit_path_analysis",
"granularity": "all",
"filter": {
"type": "and",
"fields": [{"type": "selector", "dimension": "host", "value": "www.xxx.com"}]
},
"aggregations": [
{
"type": "filtered",
"filter": {
"type": "not",
"field": { "type": "selector", "dimension": "landing_page", "value": null }
},
"aggregator": { "type": "longSum", "name": "count0", "fieldName": "count" }
},
{
"type": "filtered",
"filter": {
"type": "not",
"field": { "type": "selector", "dimension": "path1", "value": null }
},
"aggregator": { "type": "longSum", "name": "count1", "fieldName": "count" }
},
{
"type": "filtered",
"filter": {
"type": "not",
"field": { "type": "selector", "dimension": "path2", "value": null }
},
"aggregator": { "type": "longSum", "name": "count2", "fieldName": "count" }
},
{
"type": "filtered",
"filter": {
"type": "not",
"field": { "type": "selector", "dimension": "path3", "value": null }
},
"aggregator": { "type": "longSum", "name": "count3", "fieldName": "count" }
},
{
"type": "filtered",
"filter": {
"type": "not",
"field": { "type": "selector", "dimension": "path4", "value": null }
},
"aggregator": { "type": "longSum", "name": "count4", "fieldName": "count" }
}
],
"intervals": []
}
計算每一級會話數(shù)top5的網(wǎng)頁年局,過濾掉為null的情況(用戶只訪問到上一級就跳出)际看。
{
"queryType": "topN",
"dataSource": "visit_path_analysis",
"granularity": "all",
"dimension": "landing_page",
"filter": {
"type": "and",
"fields": [
{"type": "selector", "dimension": "host", "value": "www.xxx.com"},
{
"type": "not",
"field": { "type": "selector", "dimension": "landing_page", "value": null }
}
]
},
"threshold": 5,
"metric": {
"type": "numeric",
"metric": "count"
},
"aggregations": [{ "type": "longSum", "name": "count", "fieldName": "count" }],
"intervals": []
}
計算每一級兩兩網(wǎng)頁之間的跳轉(zhuǎn)訪問數(shù),后一級的null用來計算流水?dāng)?shù)矢否。
{
"queryType": "groupBy",
"dataSource": "visit_path_analysis",
"granularity": "all",
"dimensions": ["landing_page", "path1"],
"filter": {
"type": "and",
"fields": [
{"type": "selector", "dimension": "host", "value": "www.xxx.com"},
{
"type": "in",
"dimension": "landing_page",
"values": ["/a", "/b", "/c", "/d", "e"]
},
{
"type": "in",
"dimension": "path1",
"values": ["/f", "/g", "/h", "/i", "/j", null]
}
]
},
"aggregations": [{ "type": "longSum", "name": "count", "fieldName": "count" }],
"intervals": []
}
四仲闽、總結(jié)分析
本文提出基于Druid來做訪客路徑分析的方案需由多個請求來完成。
計算每一級所有網(wǎng)頁的會話總數(shù)和計算每一級會話數(shù)top5的網(wǎng)頁僵朗,在默認(rèn)展示的時候可以先并行向druid發(fā)起請求赖欣。獲取每級總會話數(shù)后再減去top5的會話數(shù)就是剩余其他網(wǎng)頁的會話數(shù)。
當(dāng)?shù)玫矫恳患塼op5的路徑后验庙,只需要相鄰兩級路徑做GroupBy查詢即可獲得轉(zhuǎn)化數(shù)與流水?dāng)?shù)顶吮。
當(dāng)需要展示往后一級路徑流轉(zhuǎn)時,只需要基于當(dāng)前最后一級的top5與下一級別top5做GroupBy計算即可粪薛。
從數(shù)據(jù)分布來看悴了,大部分流水集中在前幾步,往后有數(shù)據(jù)級的差距。
該方案最大挑戰(zhàn)來著對Druid的并發(fā)請求湃交,一個頁面展示會擴(kuò)大為多個Druid并發(fā)語句請求熟空。