ES近義詞匹配
ES近義詞匹配搜索需要用戶提供一張滿足相應(yīng)格式的近義詞表,并在創(chuàng)建索引時設(shè)計將該表放入settings
中致稀。
近義詞表的可以直接以字符串的形式寫入settings
中也可以放入文本文件中冈闭,由es讀取。
近義詞表格式
近義詞表需要滿足以下格式要求:
-
A => B,C
格式- 這種格式在搜索時會將搜索詞A替換成B抖单、C萎攒,且B,C互不為同義詞
A,B,C,D
格式
這種格式得分情況討論:
當(dāng)
expand == true
時臭猜,這種格式等價于A,B,C,D => A,B,C,D
即ABCD互為同義詞當(dāng)
expand == false
時躺酒,這種格式等價于A,B,C,D => A,即ABCD四個詞在搜索時會被替換成A
如何使用近義詞表進(jìn)行查詢
建立索引
PUT /fond_goods
{
"settings": {
"number_of_replicas": 0,
"number_of_shards": 1,
"analysis": {
"analyzer": {
"my_whitespace":{
"tokenizer":"whitespace",
"filter": ["synonymous_filter"]
}
},
"filter": {
"synonymous_filter":{
"type": "synonym",
"expand": true
"synonyms": [
"A, B, C, D"
]
}
}
}
},
"mappings": {
"properties": {
"code":{
"type": "keyword"
},
"context":{
"type": "text",
"analyzer": "my_whitespace"
},
"color":{
"type": "text",
"analyzer": "my_whitespace"
}
}
}
}
參數(shù)解釋
-
expand
默認(rèn)值為true
蔑歌。 -
lenient
默認(rèn)值為false
若lenient
值為true
羹应, es會忽略轉(zhuǎn)換近義詞文件時的報錯。值得注意的是次屠,只有當(dāng)遇到近義詞無法轉(zhuǎn)換時出現(xiàn)的異常才會被忽略掉园匹,具體例子可以參考官網(wǎng) [ https://www.elastic.co/guide/en/elasticsearch/reference/7.16/analysis-synonym-tokenfilter.html ]。 -
synonyms
近義詞表劫灶,即開始所說要按格式填寫的近義詞表裸违。 -
synonyms
也可替換成synonyms_path
,此時需要填寫一個外部文件的路徑本昏。該文件可以是某個外部的網(wǎng)頁供汛,也可以是存放在本地的文件。 -
format
當(dāng)該參數(shù)值為wordnet
時涌穆,可以使用wordnet英文詞匯數(shù)據(jù)庫中的近義詞怔昨。
使用案例
構(gòu)建索引
PUT /fond_goods
{
"settings": {
"number_of_replicas": 0,
"number_of_shards": 1,
"analysis": {
"analyzer": {
"my_whitespace":{................................................................ I
"tokenizer":"whitespace",
"filter": ["synonymous_filter"]
}
},
"filter": {
"synonymous_filter":{
"type": "synonym",
"synonyms_path": "synonym.txt"................................................. II
}
}
}
},
"mappings": {
"properties": {
"code":{
"type": "keyword"
},
"context":{
"type": "text",
"analyzer": "my_whitespace"
},
"color":{
"type": "text",
"analyzer": "my_whitespace"
}
}
}
}
-
注:
I:`my_whitespace`為自定義分詞器 II:此處的synonyms_path為es文件夾中以config文件夾為基準(zhǔn)的相對路徑
在相應(yīng)路徑中存入近義詞文件
Women,women,girl,girls
yellow,orange,wheat
blue,skyblue
white,snow,silver
dress,dresses,skirt,skirts
autumn,fall
shirt,shirts
A,B,C
存入測試數(shù)據(jù)
POST _bulk
{"index" : {"_index" : "fond_goods", "_id":1}}
{"code" : 1,"context" : "ruffled shirt for women 2021 fall slim fit pure color all matching off-neck lantern long sleeve slim women short shirt", "color": "red"}
{"index" : {"_index" : "fond_goods", "_id":2}}
{"code" : 2,"context" : "2021 warmth pullover sweater fall", "color": "blue"}
{"index" : {"_index" : "fond_goods", "_id":3}}
{"code" : 3,"context" : "early autumn elegant dress women dress 2021 autumn new long sleeve", "color": "yellow"}
{"index" : {"_index" : "fond_goods", "_id":4}}
{"code" : 4,"context" : "2021 autumn new sweater yama autumn and winter female autumn and winter dot cardigan knitted coat", "color": "snow"}
{"index" : {"_index" : "fond_goods", "_id":5}}
{"code" : 5,"context" : "za satin party dinner skirts suits woemn sexy bandage shirts and high split skirt elegant luxurious female dinner sets", "color": "white"}
{"index" : {"_index" : "fond_goods", "_id":6}}
{"code" : 6,"context" : "big bow tie sweet puff sleeve shirt dress long sleeve shirt skirt solid color shirt dress short skirt ", "color": "moss green"}
{"index" : {"_index" : "fond_goods", "_id":7}}
{"code" : 7,"context" : "casual button plaid short skirts women streetwear a-line summer skirts female high waist yellow autumn short skirts", "color": "skyblue "}
{"index" : {"_index" : "fond_goods", "_id":8}}
{"code" : 8,"context" : "muslim middle east women fashion dress abaya long dress muslim dress arab dress dres", "color": "orange"}
{"index" : {"_index" : "fond_goods", "_id":9}}
{"code" : 9,"context" : "sexy white party dresses autumn winter sexy mini dresses women fashion solid color off shoulder short", "color": "wheat"}
{"index" : {"_index" : "fond_goods", "_id":10}}
{"code" : 10,"context" : "women green patchwork buttons bodycon mini dresses all-match office ladies long shirt dresses autumn party vestidos new", "color": "silver"}
{"index" : {"_index" : "fond_goods_demo", "_id":11}}
{"code" : 11,"context" : "A", "color": "silver"}
{"index" : {"_index" : "fond_goods_demo", "_id":12}}
{"code" : 12,"context" : "B", "color": "silver"}
{"index" : {"_index" : "fond_goods_demo", "_id":13}}
{"code" : 13,"context" : "C", "color": "silver"}
簡單應(yīng)用
簡單嘗試一下近義詞庫查詢
- 查詢條件
GET fond_goods/_search
{
"query": {
"match": {
"context": "A"
}
}
}
- 查詢結(jié)果
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 2.7354302,
"hits" : [
{
"_index" : "fond_goods",
"_type" : "_doc",
"_id" : "11",
"_score" : 2.7354302,
"_source" : {
"code" : 11,
"context" : "A",
"color" : "silver"
}
},
{
"_index" : "fond_goods",
"_type" : "_doc",
"_id" : "12",
"_score" : 2.7354302,
"_source" : {
"code" : 12,
"context" : "B",
"color" : "silver"
}
},
{
"_index" : "fond_goods",
"_type" : "_doc",
"_id" : "13",
"_score" : 2.7354302,
"_source" : {
"code" : 13,
"context" : "C",
"color" : "silver"
}
}
]
}
}
刪除數(shù)據(jù)
- 刪除語句
POST fond_goods/_delete_by_query
{
"query": {
"match": {
"context": "A"
}
}
}
- 刪除結(jié)果
{
"took" : 5,
"timed_out" : false,
"total" : 3,
"deleted" : 3,
"batches" : 1,
"version_conflicts" : 0,
"noops" : 0,
"retries" : {
"bulk" : 0,
"search" : 0
},
"throttled_millis" : 0,
"requests_per_second" : -1.0,
"throttled_until_millis" : 0,
"failures" : [ ]
}
我們一共插入了三條A、B宿稀、C這組同義詞的數(shù)據(jù)趁舀,一共刪除了三條數(shù)據(jù);可以看出祝沸,在刪除時矮烹,我們也將A的近義詞B越庇、C給刪除了
結(jié)論
- 我們使用A為查詢條件,但結(jié)果中出現(xiàn)了B奉狈、C的數(shù)據(jù)卤唉,即近義詞查詢成功
- 我們以A為查詢條件,而結(jié)果的相關(guān)性打分中嘹吨,B搬味、C的得分與A一致境氢,即表明在查詢時蟀拷,A、B萍聊、C是完全等價的问芬,es的相關(guān)性打分無法做出區(qū)分
- 在根據(jù)條件刪除數(shù)據(jù)時,近義詞的數(shù)據(jù)也會一同刪除
動態(tài)更新近義詞文件
es本身提供的近義詞功能是在項(xiàng)目啟動時讀取近義詞表文件寿桨,并且每一次近義詞表文件有更新時都得重啟才能再次讀取此衅,這就給我們項(xiàng)目使用帶來了很大的不便性。
可以使用一款叫做 elasticsearch-analysis-dynamic-synonym
的es插件來動態(tài)讀取近義詞文件
插件地址
https://github.com/bells/elasticsearch-analysis-dynamic-synonym
插件使用方法
插件使用方法在項(xiàng)目中有詳細(xì)介紹亭螟,這里簡單介紹一下
- 拷貝項(xiàng)目到本地
- 將項(xiàng)目打包
- 在es的
plugins/
文件夾中新建dynamic-synonym
文件夾 - 將
target/releases/elasticsearch-analysis-dynamic-synonym-{version}.zip
文件解壓到dynamic-synonym
中 - 創(chuàng)建es索引時將同義詞配置中的
"type": "synonym"
"filter": {
"synonymous_filter":{
"type": "synonym",
"synonyms_path": "synonym.txt"
}
}
修改成"type": "dynamic_synonym"
"filter": {
"synonymous_filter":{
"type": "dynamic_synonym",
"synonyms_path": "synonym.txt"
}
}
注:該插件還提供了一個可選參數(shù)interval
挡鞍,即刷新同義詞文件時間間隔,默認(rèn)值為60s
- 他與原有操作一致预烙,至此墨微,每隔
60s
,es會自動獲取一次同義詞文件修改時間扁掸,如有變化翘县,es會重新載入同義詞文件
同義詞查詢原理
分詞
想了解同義詞查詢的原理就必須先了解es的 分詞 (Trem)。ES中的分詞(Analysis)就是把一段文本拆分成一系列的單詞谴分,也叫做文本分析锈麸。在es中,分析器(Analyzer)負(fù)責(zé)處理這一系列操作牺蹄。
ES的分詞器主要由字符過濾器(Character Filter)忘伞、分詞器(Tokenizer)、分詞過濾器(Token Filter)組成沙兰。
- 字符過濾器(Character Filter)
- 以字符流的形式接受文本氓奈,并可以通過添加、刪除或更改字符來轉(zhuǎn)化文本僧凰。
- 一個Analyzer可以由0個或多個字符過濾器
- 分詞器(Tokenizer)
- 對經(jīng)過字符過濾器過濾后的文本按照一定規(guī)則分詞探颈。一個Analyzer只允許有一個分詞器
- 分詞過濾器(Token Filter)
- 針對分詞后的token再次進(jìn)行過濾,可以增刪和修改token训措,一個分詞器中可以有多個token過濾器
同義詞過濾器
同義詞查詢的關(guān)鍵其實(shí)就是自定義Token過濾器伪节。該過濾器在收到分詞器發(fā)過來的數(shù)據(jù)(我暫時將其稱之為分詞數(shù)據(jù))時光羞,會先讀取用戶存放的近義詞文件,比對分詞數(shù)據(jù)怀大。當(dāng)出現(xiàn)同義詞時纱兑,Token過濾器就按照近義詞文件配置的規(guī)則選定帶搜索詞組,進(jìn)行同義詞搜索化借。
我們可以拿之前的索引做個試驗(yàn):我們的索引使用的是自定義的分析器my_whitespace
潜慎,其中分詞器是whitespace
空格分詞器, 而token Filter 使用的是自定義的近義詞過濾器蓖康。由上述可知铐炫,我們自定義的分析器與官方自帶的whitespace
分析器唯一的差別就在token Filter上。
我們使用官方的whitespace
分析器來看一下分詞情況:
GET fond_goods/_analyze
{
"analyzer": "whitespace",
"field":"context",
"text": "A"
}
- 結(jié)果
{
"tokens" : [
{
"token" : "A",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
}
]
}
在經(jīng)過分析器后蒜焊,字符A被分成了 "A"這一個分詞
- 再來嘗試一個長度更長的字符串
GET fond_goods/_analyze
{
"analyzer": "whitespace",
"field":"context",
"text": "ruffled shirt for women 2021 fall slim fit pure color all matching off-neck lantern long sleeve slim women short shirt"
}
- 結(jié)果
{
"tokens" : [
{
"token" : "ruffled",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 0
},
{
"token" : "shirt",
"start_offset" : 8,
"end_offset" : 13,
"type" : "word",
"position" : 1
},
{
"token" : "for",
"start_offset" : 14,
"end_offset" : 17,
"type" : "word",
"position" : 2
},
{
"token" : "women",
"start_offset" : 18,
"end_offset" : 23,
"type" : "word",
"position" : 3
},
{
"token" : "2021",
"start_offset" : 24,
"end_offset" : 28,
"type" : "word",
"position" : 4
},
{
"token" : "fall",
"start_offset" : 29,
"end_offset" : 33,
"type" : "word",
"position" : 5
},
{
"token" : "slim",
"start_offset" : 34,
"end_offset" : 38,
"type" : "word",
"position" : 6
},
{
"token" : "fit",
"start_offset" : 39,
"end_offset" : 42,
"type" : "word",
"position" : 7
},
{
"token" : "pure",
"start_offset" : 43,
"end_offset" : 47,
"type" : "word",
"position" : 8
},
{
"token" : "color",
"start_offset" : 48,
"end_offset" : 53,
"type" : "word",
"position" : 9
},
{
"token" : "all",
"start_offset" : 54,
"end_offset" : 57,
"type" : "word",
"position" : 10
},
{
"token" : "matching",
"start_offset" : 58,
"end_offset" : 66,
"type" : "word",
"position" : 11
},
{
"token" : "off-neck",
"start_offset" : 67,
"end_offset" : 75,
"type" : "word",
"position" : 12
},
{
"token" : "lantern",
"start_offset" : 76,
"end_offset" : 83,
"type" : "word",
"position" : 13
},
{
"token" : "long",
"start_offset" : 84,
"end_offset" : 88,
"type" : "word",
"position" : 14
},
{
"token" : "sleeve",
"start_offset" : 89,
"end_offset" : 95,
"type" : "word",
"position" : 15
},
{
"token" : "slim",
"start_offset" : 96,
"end_offset" : 100,
"type" : "word",
"position" : 16
},
{
"token" : "women",
"start_offset" : 101,
"end_offset" : 106,
"type" : "word",
"position" : 17
},
{
"token" : "short",
"start_offset" : 107,
"end_offset" : 112,
"type" : "word",
"position" : 18
},
{
"token" : "shirt",
"start_offset" : 113,
"end_offset" : 118,
"type" : "word",
"position" : 19
}
]
}
- 結(jié)果
可以看到倒信,whitespace
分析器將輸入字符串按照空格拆分成了如上結(jié)果
我們再來試試自定義的分析器
GET fond_goods/_analyze
{
"analyzer": "my_whitespace",
"field":"context",
"text": "A"
}
- 結(jié)果
{
"tokens" : [
{
"token" : "A",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "B",
"start_offset" : 0,
"end_offset" : 1,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "C",
"start_offset" : 0,
"end_offset" : 1,
"type" : "SYNONYM",
"position" : 0
}
]
}
經(jīng)過分析器后,A這個字符被分成了 A泳梆、B鳖悠、C三個分詞,且在type
字段上有作區(qū)分优妙,A被標(biāo)記為word
乘综,B、C被標(biāo)記為SYNONYM
- 我們再嘗試一下長字符串(注:在近義詞文件中套硼,我們定義了shirt,shirts為一組近義詞卡辰;Women,women,girl,girls為一組近義詞)
GET fond_goods/_analyze
{
"analyzer": "my_whitespace",
"field":"context",
"text": "ruffled shirt for women 2021 fall slim fit pure color all matching off-neck lantern long sleeve slim women short shirt"
}
- 結(jié)果
{
"tokens" : [
{
"token" : "ruffled",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 0
},
{
"token" : "shirt",
"start_offset" : 8,
"end_offset" : 13,
"type" : "word",
"position" : 1
},
{
"token" : "shirts",
"start_offset" : 8,
"end_offset" : 13,
"type" : "SYNONYM",
"position" : 1
},
{
"token" : "for",
"start_offset" : 14,
"end_offset" : 17,
"type" : "word",
"position" : 2
},
{
"token" : "women",
"start_offset" : 18,
"end_offset" : 23,
"type" : "word",
"position" : 3
},
{
"token" : "Women",
"start_offset" : 18,
"end_offset" : 23,
"type" : "SYNONYM",
"position" : 3
},
{
"token" : "girl",
"start_offset" : 18,
"end_offset" : 23,
"type" : "SYNONYM",
"position" : 3
},
{
"token" : "girls",
"start_offset" : 18,
"end_offset" : 23,
"type" : "SYNONYM",
"position" : 3
},
{
"token" : "2021",
"start_offset" : 24,
"end_offset" : 28,
"type" : "word",
"position" : 4
},
{
"token" : "fall",
"start_offset" : 29,
"end_offset" : 33,
"type" : "word",
"position" : 5
},
{
"token" : "autumn",
"start_offset" : 29,
"end_offset" : 33,
"type" : "SYNONYM",
"position" : 5
},
{
"token" : "slim",
"start_offset" : 34,
"end_offset" : 38,
"type" : "word",
"position" : 6
},
{
"token" : "fit",
"start_offset" : 39,
"end_offset" : 42,
"type" : "word",
"position" : 7
},
{
"token" : "pure",
"start_offset" : 43,
"end_offset" : 47,
"type" : "word",
"position" : 8
},
{
"token" : "color",
"start_offset" : 48,
"end_offset" : 53,
"type" : "word",
"position" : 9
},
{
"token" : "all",
"start_offset" : 54,
"end_offset" : 57,
"type" : "word",
"position" : 10
},
{
"token" : "matching",
"start_offset" : 58,
"end_offset" : 66,
"type" : "word",
"position" : 11
},
{
"token" : "off-neck",
"start_offset" : 67,
"end_offset" : 75,
"type" : "word",
"position" : 12
},
{
"token" : "lantern",
"start_offset" : 76,
"end_offset" : 83,
"type" : "word",
"position" : 13
},
{
"token" : "long",
"start_offset" : 84,
"end_offset" : 88,
"type" : "word",
"position" : 14
},
{
"token" : "sleeve",
"start_offset" : 89,
"end_offset" : 95,
"type" : "word",
"position" : 15
},
{
"token" : "slim",
"start_offset" : 96,
"end_offset" : 100,
"type" : "word",
"position" : 16
},
{
"token" : "women",
"start_offset" : 101,
"end_offset" : 106,
"type" : "word",
"position" : 17
},
{
"token" : "Women",
"start_offset" : 101,
"end_offset" : 106,
"type" : "SYNONYM",
"position" : 17
},
{
"token" : "girl",
"start_offset" : 101,
"end_offset" : 106,
"type" : "SYNONYM",
"position" : 17
},
{
"token" : "girls",
"start_offset" : 101,
"end_offset" : 106,
"type" : "SYNONYM",
"position" : 17
},
{
"token" : "short",
"start_offset" : 107,
"end_offset" : 112,
"type" : "word",
"position" : 18
},
{
"token" : "shirt",
"start_offset" : 113,
"end_offset" : 118,
"type" : "word",
"position" : 19
},
{
"token" : "shirts",
"start_offset" : 113,
"end_offset" : 118,
"type" : "SYNONYM",
"position" : 19
}
]
}
可以看到,shirt熟菲、women兩個字符串經(jīng)過分析器后被分詞為了shirt, shirts
以及 women, Women, girl, girls
兩組分詞看政,且都做了相應(yīng)標(biāo)識。
參考文章
同義詞搜索原理部分參考
https://blog.csdn.net/woshixubo123/article/details/121774972
以及
https://blog.csdn.net/woshixubo123/article/details/121898514
兩篇文章
其他均來自于官網(wǎng)或者自己舉的例子