聲明
本人以elasticsearch-definitive-guide-cn項(xiàng)目作為入門教程昆咽,并對(duì)文檔中部分問題在文中進(jìn)行了糾正办悟,比如demo執(zhí)行報(bào)錯(cuò)尘奏,一般是因?yàn)槲臋n對(duì)應(yīng)的版本偏低,對(duì)比新版官方文檔可找到原因病蛉,如:過濾查詢(filtered)已被棄用炫加,并在ES 5.0中刪除,可使用bool / must / filter查詢铺然。
總的來說俗孝,Elasticsearch 權(quán)威指南作為入門教程還是不錯(cuò)的,如果你的英語水平還可以魄健,建議直接看官方文檔赋铝。
環(huán)境準(zhǔn)備
Elasticsearch版本:6.8.2
安裝教程可參考我的另一篇文章 windows下docker安裝Elasticsearch
與Elasticsearch交互
JAVA API
關(guān)于Java API的更多信息請(qǐng)查看相關(guān)章節(jié):Java API
基于HTTP協(xié)議,以JSON為數(shù)據(jù)交互格式的RESTful API
其他所有程序語言都可以使用RESTful API沽瘦,通過9200端口的與Elasticsearch進(jìn)行通信柬甥,你可以使用你喜歡的WEB客戶端,事實(shí)上其垄,如你所見苛蒲,你甚至可以通過curl
命令與Elasticsearch通信。
向Elasticsearch發(fā)出的請(qǐng)求的組成部分與其它普通的HTTP請(qǐng)求是一樣的:
curl -X<VERB> '<PROTOCOL>://<HOST>:<PORT>/<PATH>?<QUERY_STRING>' -d '<BODY>'
- VERB HTTP方法:
GET
,POST
,PUT
,HEAD
,DELETE
- PROTOCOL http或者h(yuǎn)ttps協(xié)議(只有在Elasticsearch前面有https代理的時(shí)候可用)
- HOST Elasticsearch集群中的任何一個(gè)節(jié)點(diǎn)的主機(jī)名绿满,如果是在本地的節(jié)點(diǎn)臂外,那么就叫l(wèi)ocalhost
- PORT Elasticsearch HTTP服務(wù)所在的端口,默認(rèn)為9200
- PATH API路徑(例如_count將返回集群中文檔的數(shù)量)喇颁,PATH可以包含多個(gè)組件漏健,例如_cluster/stats或者_(dá)nodes/stats/jvm
- QUERY_STRING 一些可選的查詢請(qǐng)求參數(shù),例如
?pretty
參數(shù)將使請(qǐng)求返回更加美觀易讀的JSON數(shù)據(jù) - BODY 一個(gè)JSON格式的請(qǐng)求主體(如果請(qǐng)求需要的話)
面向文檔
Elasticsearch是面向文檔(document oriented)的橘霎,這意味著它可以存儲(chǔ)整個(gè)對(duì)象或文檔(document)蔫浆。然而它不僅僅是存儲(chǔ),還會(huì)索引(index)每個(gè)文檔的內(nèi)容使之可以被搜索姐叁。在Elasticsearch中瓦盛,你可以對(duì)文檔(而非成行成列的數(shù)據(jù))進(jìn)行索引洗显、搜索、排序原环、過濾挠唆。這種理解數(shù)據(jù)的方式與以往完全不同,這也是Elasticsearch能夠執(zhí)行復(fù)雜的全文搜索的原因之一嘱吗。
JSON
ELasticsearch使用Javascript對(duì)象符號(hào)(JavaScript
Object Notation)玄组,也就是JSON,作為文檔序列化格式谒麦。JSON現(xiàn)在已經(jīng)被大多語言所支持俄讹,而且已經(jīng)成為NoSQL領(lǐng)域的標(biāo)準(zhǔn)格式。它簡(jiǎn)潔绕德、簡(jiǎn)單且容易閱讀患膛。
讓我們先添加幾條數(shù)據(jù)看看,我們可以用postman執(zhí)行下面的請(qǐng)求
PUT /mycompany/customer/101
{
"first_name" : "Donald",
"last_name" : "Trump",
"gender" : "male",
"age":74,
"about" : "Businessman迁匠,the president of America剩瓶,a crazy guy!He has lots of money!",
"interests": [ "golf", "music" ]
}
返回結(jié)果:
{
"_index": "mycompany",
"_type": "customer",
"_id": "101",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 0,
"_primary_term": 1
}
so easy! 索引為mycompany,類型為customer城丧,ID為101延曙,讓我們?cè)俣嘣煲恍?shù)據(jù),
PUT /mycompany/customer/102
{
"first_name" : "Jack",
"last_name" : "Ma",
"gender" : "male",
"age":56,
"about" : "Chinese Rural Teacher亡哄,the president of Alibaba, he is not interested in money!",
"interests": [ "太極", "dance" ]
}
PUT /mycompany/customer/103
{
"first_name" : "Jackie",
"last_name" : "Chan",
"gender" : "male",
"age":66,
"about" : "Famous kung fu star枝缔,He is a Chinese!",
"interests": [ "kung fu", "music" ]
}
PUT /mycompany/customer/104
{
"first_name" : "Taylor",
"last_name" : "Swift",
"gender" : "female",
"age":31,
"about" : "American Country Singer!",
"interests": [ "music" ]
}
檢索文檔
查詢字符串方式
精確搜索
GET /{index}/{type}/{id}
_search關(guān)鍵字
使用關(guān)鍵字_search來取代原來的文檔ID。響應(yīng)內(nèi)容的hits數(shù)組中包含了我們所有的三個(gè)文檔蚊惯。默認(rèn)情況下搜索會(huì)返回前10個(gè)結(jié)果
GET /{index}/{type}/_search
條件搜索
GET /{index}/{type}/_search?q={field}:{val}
eg:
GET /mycompany/customer/_search?q=last_name:Jack
DSL語句
查詢字符串搜索便于通過命令行完成特定(ad hoc)的搜索愿卸,但是它也有局限性。Elasticsearch提供豐富且靈活的查詢語言叫做DSL查詢(Query DSL),它允許你構(gòu)建更加復(fù)雜截型、強(qiáng)大的查詢趴荸。
DSL(Domain Specific Language特定領(lǐng)域語言)以JSON請(qǐng)求體的形式出現(xiàn)。我們可以這樣表示之前關(guān)于“Singer”的查詢:
GET /mycompany/customer/_search
{
"query" : {
"match" : {
"about" : "singer"
}
}
}
更復(fù)雜的搜索
- 注:過濾查詢(filtered)已被棄用宦焦,并在ES 5.0中刪除发钝,可使用bool / must / filter查詢
GET /mycompany/customer/_search
{
"query" : {
"bool" : {
"filter" : {
"range" : {
"age" : { "gt" : 70 } <1>
}
},
"must" : {
"match" : {
"gender" : "male" <2>
}
}
}
}
}
- <1> 這部分查詢屬于區(qū)間過濾器(range filter),它用于查找所有年齡大于30歲的數(shù)據(jù)——
gt
為"greater than"的縮寫。 - <2> 這部分查詢與之前的
match
語句(query)一致波闹。
全文搜索
到目前為止搜索都很簡(jiǎn)單:搜索特定的名字酝豪,通過年齡篩選。讓我們嘗試一種更高級(jí)的搜索精堕,全文搜索——一種傳統(tǒng)數(shù)據(jù)庫很難實(shí)現(xiàn)的功能孵淘。
我們將會(huì)搜索所有“not interested in money”的客戶:
GET /mycompany/customer/_search
{
"query" : {
"match" : {
"about" : "not interested in money"
}
}
}
我們可以看到查出了兩個(gè)人,但是字段_score值不一樣歹篓,值越大表示匹配度越高
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.1507283,
"hits": [
{
"_index": "mycompany",
"_type": "customer",
"_id": "102",
"_score": 1.1507283,
"_source": {
"first_name": "Jack",
"last_name": "Ma",
"gender": "male",
"age": 56,
"about": "Chinese Rural Teacher瘫证,the president of Alibaba, he is not interested in money!",
"interests": [
"太極",
"dance"
]
}
},
{
"_index": "mycompany",
"_type": "customer",
"_id": "101",
"_score": 0.78111285,
"_source": {
"first_name": "Donald",
"last_name": "Trump",
"gender": "male",
"age": 74,
"about": "Businessman揉阎,the president of America,a crazy guy!He has lots of money!",
"interests": [
"golf",
"music"
]
}
}
]
}
}
默認(rèn)情況下痛悯,Elasticsearch根據(jù)結(jié)果相關(guān)性評(píng)分來對(duì)結(jié)果集進(jìn)行排序余黎,所謂的「結(jié)果相關(guān)性評(píng)分」就是文檔與查詢條件的匹配程度重窟。很顯然载萌,排名第一的Jack Ma
的about
字段明確的寫到“not interested in money”。
但是為什么Trump
也會(huì)出現(xiàn)在結(jié)果里呢巡扇?原因是“money”在他的about
字段中被提及了扭仁。因?yàn)橹挥?strong>“money”被提及而“not interested in”沒有,所以她的_score
要低于John厅翔。
這個(gè)例子很好的解釋了Elasticsearch如何在各種文本字段中進(jìn)行全文搜索乖坠,并且返回相關(guān)性最大的結(jié)果集。相關(guān)性(relevance)的概念在Elasticsearch中非常重要刀闷,而這個(gè)概念在傳統(tǒng)關(guān)系型數(shù)據(jù)庫中是不可想象的熊泵,因?yàn)閭鹘y(tǒng)數(shù)據(jù)庫對(duì)記錄的查詢只有匹配或者不匹配。
短語搜索
目前我們可以在字段中搜索單獨(dú)的一個(gè)詞甸昏,這挺好的顽分,但是有時(shí)候你想要確切的匹配若干個(gè)單詞或者短語(phrases)。例如我們想只查出not interested in money的人施蜜,而不需要查出has lots of money的人卒蘸。
要做到這個(gè),我們只要將match
查詢變更為match_phrase
查詢即可:
GET /mycompany/customer/_search
{
"query" : {
"match_phrase" : {
"about" : "not interested in money"
}
}
}
高亮搜索
很多應(yīng)用喜歡從每個(gè)搜索結(jié)果中高亮(highlight)匹配到的關(guān)鍵字翻默,這樣用戶可以知道為什么這些文檔和查詢相匹配缸沃。在Elasticsearch中高亮片段是非常容易的。
讓我們?cè)谥暗恼Z句上增加highlight
參數(shù):
GET /mycompany/customer/_search
{
"query" : {
"match_phrase" : {
"about" : "not interested in money"
}
},
"highlight": {
"fields" : {
"about" : {}
}
}
}
當(dāng)我們運(yùn)行這個(gè)語句時(shí)修械,會(huì)命中與之前相同的結(jié)果趾牧,但是在返回結(jié)果中會(huì)有一個(gè)新的部分叫做highlight
,這里包含了來自about
字段中的文本肯污,并且用<em></em>
來標(biāo)識(shí)匹配到的單詞翘单。
{
"took": 14,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.1507283,
"hits": [
{
"_index": "mycompany",
"_type": "customer",
"_id": "102",
"_score": 1.1507283,
"_source": {
"first_name": "Jack",
"last_name": "Ma",
"gender": "male",
"age": 56,
"about": "Chinese Rural Teacher,the president of Alibaba, he is not interested in money!",
"interests": [
"太極",
"dance"
]
},
"highlight": {
"about": [
"Chinese Rural Teacher仇箱,the president of Alibaba, he is <em>not</em> <em>interested</em> <em>in</em> <em>money</em>!"
]
}
}
]
}
}
分析
Elasticsearch有一個(gè)功能叫做聚合(aggregations)县恕,它允許你在數(shù)據(jù)上生成復(fù)雜的分析統(tǒng)計(jì)。它很像SQL中的GROUP BY
但是功能更強(qiáng)大剂桥。
舉個(gè)例子忠烛,讓我們找到所有客戶中最大的共同點(diǎn)(興趣愛好)是什么:
GET /mycompany/customer/_search
{
"aggs": {
"all_interests": {
"terms": { "field": "interests" }
}
}
}
直接執(zhí)行上面的語句會(huì)報(bào)錯(cuò)
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "megacorp",
"node": "-Md3f007Q3G6HtdnkXoRiA",
"reason": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
}
],
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
},
"status": 400
}
應(yīng)該是5.x后對(duì)排序,聚合這些操作用單獨(dú)的數(shù)據(jù)結(jié)構(gòu)(fielddata)緩存到內(nèi)存里了权逗,需要單獨(dú)開啟美尸,官方解釋在此fielddata
簡(jiǎn)單來說就是在聚合前執(zhí)行如下操作:
PUT /mycompany/_mapping/customer
{
"properties": {
"interests": {
"type": "text",
"fielddata": true
}
}
}
返回
{
"acknowledged": true
}
現(xiàn)在可正常執(zhí)行分析語句冤议。
聚合也允許分級(jí)匯總。例如师坎,讓我們統(tǒng)計(jì)每種興趣下客戶的平均年齡
GET /mycompany/customer/_search
{
"aggs" : {
"all_interests" : {
"terms" : { "field" : "interests" },
"aggs" : {
"avg_age" : {
"avg" : { "field" : "age" }
}
}
}
}
}
結(jié)果如下:
{
"took": 15,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1.0,
"hits": [......]
},
"aggregations": {
"all_interests": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "music",
"doc_count": 3,
"avg_age": {
"value": 57.0
}
},
{
"key": "dance",
"doc_count": 1,
"avg_age": {
"value": 56.0
}
},
{
"key": "fu",
"doc_count": 1,
"avg_age": {
"value": 66.0
}
},
{
"key": "golf",
"doc_count": 1,
"avg_age": {
"value": 74.0
}
},
{
"key": "kung",
"doc_count": 1,
"avg_age": {
"value": 66.0
}
},
{
"key": "太",
"doc_count": 1,
"avg_age": {
"value": 56.0
}
},
{
"key": "極",
"doc_count": 1,
"avg_age": {
"value": 56.0
}
}
]
}
}
}
該聚合結(jié)果比之前的聚合結(jié)果要更加豐富恕酸。我們依然得到了興趣以及數(shù)量(指具有該興趣的客戶人數(shù))的列表,但是現(xiàn)在每個(gè)興趣額外擁有avg_age
字段來顯示具有該興趣客戶的平均年齡胯陋。
即使你還不理解語法蕊温,但你也可以大概感覺到通過這個(gè)特性可以完成相當(dāng)復(fù)雜的聚合工作,你可以處理任何類型的數(shù)據(jù)遏乔。