閱讀說明:
1.如果有排版格式問題薄料,請(qǐng)移步https://www.yuque.com/mrhuang-ire4d/oufb8x/gmzl30v8ofqg3ua3?singleDoc# 《elasticsearch分詞》项炼,選擇寬屏模式效果更佳。
2.本文為原創(chuàng)文章,轉(zhuǎn)發(fā)請(qǐng)注明出處淮腾。
在建立倒排索引之前谭梗,先要分詞器對(duì)文檔數(shù)據(jù)進(jìn)行處理。分析器是用于將文本數(shù)據(jù)劃分為一系列的單詞(或稱之為詞項(xiàng)旅赢、tokens)的組件齿桃。一個(gè)分析器包含以下三個(gè)部分:
- 字符過濾器(Character filters)-使用字符過濾器轉(zhuǎn)變字符,比如去除HTML標(biāo)簽煮盼,將數(shù)字轉(zhuǎn)換為文字短纵, 將特殊符號(hào)轉(zhuǎn)換成文本等;
- 分詞器(Tokenizer)-將文本按照規(guī)則切分為單個(gè)或多個(gè)分詞僵控;
- 分詞過濾器(Token filters)-對(duì)分詞進(jìn)一步加工香到。比如轉(zhuǎn)為小寫,停滯詞過濾等报破;
然后eleasticsearch搜索引擎Lucene對(duì)分詞建立倒排索引悠就,只有text類型才支持分詞。
eleasticsearch內(nèi)部自帶了多個(gè)分析器充易,還可以自定義分析器理卑。對(duì)于中文,有專業(yè)的第三方分析器比如IKAnalyzer等蔽氨。
通常藐唠,搜索和索引共用分析器帆疟,以確保查詢中的分詞與反向索引中的分詞具有相同的格式。也支持索引和查詢使用不同的分析器滿足個(gè)性化要求宇立。
● analyzer:索引分析器踪宠,包含停用詞參數(shù)。分詞中包含停用詞妈嘹;
● search_analyzer: 指定非短語查詢分詞器柳琢,分詞中刪除停用詞;
● search_quote_analyzer: 指定短語查詢分詞器润脸,分詞中包含停用詞柬脸;
內(nèi)置分析器
Standard Analyzer
默認(rèn)分詞器, 按Unicode文本分割算法拆分 , 轉(zhuǎn)化為小寫 , 支持中文(但是中文按照每個(gè)文字拆分,沒啥意義)
示例
POST _analyze
{
"analyzer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
分詞結(jié)果:[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]
Standard Analyzer配置
參數(shù) | 說明 |
---|---|
max_token_length | 分詞最大長度,默認(rèn)255毙驯。 |
stopwords | 自定義停滯詞倒堕。例如_english_,或一個(gè)停滯詞數(shù)組爆价, 默認(rèn)_english_垦巴。 |
stopwords_path | 包含停滯詞的文件的路徑,路徑相對(duì)于Elasticsearch的config目錄铭段。 |
自定義配置示例
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_english_analyzer": {
"type": "standard",
"max_token_length": 5,
"stopwords": "_english_"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_english_analyzer",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
分詞結(jié)果:[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]
Simple Analyzer
遇到非字母就切分骤宣,并且轉(zhuǎn)化為小寫。
示例
POST _analyze
{
"analyzer": "simple",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
分詞結(jié)果:[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
Stop Analyzer
在simple的基礎(chǔ)上序愚, 刪除停滯詞(停止詞是指常見而無意義的詞匯比如a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with)憔披,默認(rèn)使用 stop token filter 的_english_預(yù)定義。
示例
POST _analyze
{
"analyzer": "stop",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
分詞結(jié)果:[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]
Stop Analyzer 配置
參數(shù) | 說明 |
---|---|
stopwords | 自定義停滯詞爸吮。例如_english_活逆,或一個(gè)停滯詞數(shù)組, 默認(rèn)_english_拗胜。 |
stopwords_path | 包含停滯詞的文件的路徑蔗候,路徑相對(duì)于Elasticsearch的config目錄。 |
自定義配置示例
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_stop_analyzer": {
"type": "stop",
"stopwords": ["the", "over"]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_stop_analyzer",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
分詞結(jié)果:[ quick, brown, foxes, jumped, lazy, dog, s, bone ]
Whitespace Analyzer
特點(diǎn):遇到空格的時(shí)候會(huì)進(jìn)行分詞 , 不會(huì)轉(zhuǎn)小寫埂软。
示例:
POST _analyze
{
"analyzer": "whitespace",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
分詞結(jié)果:[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]
Patter Analyzer
按照正則表示式去切分锈遥,默認(rèn)為\W+, 并且默認(rèn)會(huì)轉(zhuǎn)為小寫。
示例
POST _analyze
{
"analyzer": "pattern",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
分詞結(jié)果:[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
Patter Analyzer配置
參數(shù) | 說明 |
---|---|
pattern | 正則表達(dá)式勘畔,默認(rèn)\W+所灸。 |
flags | Java正則表達(dá)式flags,多個(gè)用|分離炫七,例如"CASE_INSENSITIVE | COMMENTS"爬立。 |
lowercase | 是否小寫。默認(rèn)true万哪。 |
stopwords | 預(yù)定義停滯詞列表侠驯,例如_english_, 或一個(gè)停滯詞數(shù)組抡秆,默認(rèn)_none_。 |
stopwords_path | 包含停滯詞的文件的路徑吟策。 |
自定義配置示例
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_email_analyzer": {
"type": "pattern",
"pattern": "\\W|_",
"lowercase": true
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_email_analyzer",
"text": "John_Smith@foo-bar.com"
}
分詞結(jié)果:[ john, smith, foo, bar, com ]
Keyword Analyzer
特點(diǎn):不分詞 直接將輸入當(dāng)做輸出
示例
POST _analyze
{
"analyzer": "keyword",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
分詞結(jié)果:[ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]
Fingerprint Analyzer
實(shí)現(xiàn)了指紋識(shí)別算法儒士。
對(duì)輸入文本小寫,規(guī)范化刪掉擴(kuò)展符檩坚,排序着撩,去重,連接處理成單個(gè)分詞匾委。如果配置了停滯詞列表拖叙,停滯詞也將被刪除。
示例
POST _analyze
{
"analyzer": "fingerprint",
"text": "Yes yes, G?del said this sentence is consistent and."
}
分詞結(jié)果:[ and consistent godel is said sentence this yes ]
Language Analyzers
針對(duì)特殊語言專門的分詞器赂乐。這里不做詳細(xì)介紹薯鳍,后面有針對(duì)中文講解中文分詞器。
自定義分析器
當(dāng)內(nèi)置分析器不能滿足您的需求時(shí)沪猴,您可以創(chuàng)建一個(gè)自定義分析器。一個(gè)自定義分詞器由零個(gè)或多個(gè)char filters, 一個(gè) Tokenizer 和零個(gè)或多個(gè) token filters 組成采章。elasticsearch已經(jīng)內(nèi)置了多個(gè)字符過濾器运嗜,分詞器和分詞過濾器。
字符過濾器(Character Filters)
- HTML Strip Character Filter: 從輸入文本中移除HTML元素悯舟。比如<p>Hello World</p> 被處理為 Hello World担租,簡(jiǎn)寫為html_strip。
- Mapping Character Filter:通過一個(gè)預(yù)定義的映射關(guān)系抵怎,將指定的字符或字符串替換為其他字符或字符串奋救。例如,你可以定義一個(gè)規(guī)則將 "&" 替換為 "and"反惕。簡(jiǎn)寫為mapping.
- Pattern Replace Character Filter:使用正則表達(dá)式匹配和替換字符, 簡(jiǎn)寫為pattern_replace尝艘。
HTML Strip Character Filter示例
GET /_analyze
{
"tokenizer": "keyword",
"char_filter": [
"html_strip"
],
"text": "<p>I'm so <b>happy</b>!</p>"
}
分詞結(jié)果:[ \nI'm so happy!\n ]
Mapping Character Filter示例
GET /_analyze
{
"tokenizer": "keyword",
"char_filter": [
{
"type": "mapping",
"mappings": [
"? => 0",
"? => 1",
"? => 2",
"? => 3",
"? => 4",
"? => 5",
"? => 6",
"? => 7",
"? => 8",
"? => 9"
]
}
],
"text": "My license plate is ?????"
}
分詞結(jié)果:[ My license plate is 25015 ]
Pattern Replace Character Filte示例
GET /_analyze
{
"tokenizer": "keyword",
"char_filter": [
{
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1_"
}
],
"text": "My credit card is 123-456-789"
}
分詞結(jié)果:[My credit card is 123_456_789]
分詞器(Tokenizer)
在內(nèi)置分詞器部分已經(jīng)詳細(xì)講過了。
分詞過濾器(Token Filters)
系統(tǒng)已經(jīng)提供了多個(gè)分詞過濾器姿染,完整請(qǐng)參考Token filter reference | Elasticsearch Guide [7.7] | Elastic背亥。
- Lowercase token filter: 轉(zhuǎn)換為小寫,簡(jiǎn)稱為lowercase.
- Stop token filter:移除停頓詞,簡(jiǎn)稱為stop.
Lowercase token filter示例
GET _analyze
{
"tokenizer" : "standard",
"filter" : ["lowercase"],
"text" : "THE Quick FoX JUMPs"
}
分詞結(jié)果:[ the, quick, fox, jumps ]
Stop token filter示例
GET /_analyze
{
"tokenizer": "standard",
"filter": [ "stop" ],
"text": "a quick fox jumps over the lazy dog"
}
分詞結(jié)果:[ quick, fox, jumps, over, lazy, dog ]
自定義分析器示例
下面有一個(gè)完整的自定義分析器示例悬赏,包含字符過濾器狡汉,分詞器和分詞過濾器。
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"char_filter": [
"emoticons"
],
"tokenizer": "punctuation",
"filter": [
"lowercase",
"english_stop"
]
}
},
"tokenizer": {
"punctuation": {
"type": "pattern",
"pattern": "[ .,!?]"
}
},
"char_filter": {
"emoticons": {
"type": "mapping",
"mappings": [
":) => _happy_",
":( => _sad_"
]
}
},
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "I'm a :) person, and you?"
}
分詞結(jié)果:[ i'm, _happy_, person, you ]
中文分詞器 IKAnalyzer
IKAnalyzer提供兩種分詞:
ik_max_word: 會(huì)將文本做最細(xì)粒度的拆分闽颇,比如會(huì)將“中華人民共和國國歌”拆分為“中華人民共和國,中華人民,中華,華人,人民共和國,人民,人,民,共和國,共和,和,國國,國歌”盾戴,會(huì)窮盡各種可能的組合,適合 Term Query兵多;
ik_smart: 會(huì)做最粗粒度的拆分尖啡,比如會(huì)將“中華人民共和國國歌”拆分為“中華人民共和國,國歌”橄仆,適合 Phrase 查詢。
示例
//創(chuàng)建索引
XPUT /index
//映射配置
POST /index/_mapping -H 'Content-Type:application/json' -d'
{
"properties": {
"content": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
}
}
}
//寫入數(shù)據(jù)
POST /index/_create/1
{"content":"美國留給伊拉克的是個(gè)爛攤子嗎"}
POST /index/_create/2
{"content":"公安部:各地校車將享最高路權(quán)"}
POST /index/_create/2
{"content":"中韓漁警沖突調(diào)查:韓警平均每天扣1艘中國漁船"}
POST /index/_create/2
{"content":"中國駐洛杉磯領(lǐng)事館遭亞裔男子槍擊 嫌犯已自首"}
//查詢含有中國的數(shù)據(jù)
POST /index/_search
{
"query" : { "match" : { "content" : "中國" }},
"highlight" : {
"pre_tags" : ["<tag1>", "<tag2>"],
"post_tags" : ["</tag1>", "</tag2>"],
"fields" : {
"content" : {}
}
}
}
返回結(jié)果:
{"took":14,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":2,"hits":[{"_index":"index","_type":"fulltext","_id":"4","_score":2,"_source":{"content":"中國駐洛杉磯領(lǐng)事館遭亞裔男子槍擊 嫌犯已自首"},"highlight":{"content":["<tag1>中國</tag1>駐洛杉磯領(lǐng)事館遭亞裔男子槍擊 嫌犯已自首 "]}},{"_index":"index","_type":"fulltext","_id":"3","_score":2,"_source":{"content":"中韓漁警沖突調(diào)查:韓警平均每天扣1艘中國漁船"},"highlight":{"content":["均每天扣1艘<tag1>中國</tag1>漁船 "]}}]}}
詞庫配置
IKAnalyzer已經(jīng)內(nèi)置了詞庫可婶,在目錄{ik_path}/config下沿癞。
- main.dic:主詞庫。
- stopword.dic:英文停用詞矛渴,不會(huì)建立在倒排索引中椎扬。
- quantifier.dic:特殊詞庫:計(jì)量單位等。
- suffix.dic:特殊詞庫:行政單位具温。
- surname.dic:特殊詞庫:百家姓蚕涤。
- preposition:特殊詞庫:語氣詞。
當(dāng)然铣猩,也支持用戶自定義詞庫揖铜,在{ik_path}/config目錄下添加自定義詞庫文件,要求每行一個(gè)詞并且UTF8 編碼达皿,然后修改文件{ik_path}/config/IKAnalyzer.cfg.xml, 添加自定義詞庫位置天吓,重啟es。
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 擴(kuò)展配置</comment>
<!--用戶可以在這里配置自己的擴(kuò)展字典 -->
<entry key="ext_dict"></entry>
<!--用戶可以在這里配置自己的擴(kuò)展停止詞字典-->
<entry key="ext_stopwords"></entry>
<!--用戶可以在這里配置遠(yuǎn)程擴(kuò)展字典 -->
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!--用戶可以在這里配置遠(yuǎn)程擴(kuò)展停止詞字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
參考:
[1].https://www.elastic.co/guide/en/elasticsearch/reference/7.7/analysis-analyzers.html
[2].https://mp.weixin.qq.com/s?__biz=Mzg4Nzc3NjkzOA==&mid=2247485544&idx=1&sn=cfa20adbb5c7328ea0cab85966d95c02&chksm=cf847badf8f3f2bbefd1b9e893cccf10a24c2a83f8052b613c62c999566e4c8616fded236552#rd
[3].https://zhuanlan.zhihu.com/p/580356194
[4].https://github.com/medcl/elasticsearch-analysis-ik
[5].https://www.elastic.co/guide/en/elasticsearch/reference/7.7/analysis-charfilters.html
[6].https://www.elastic.co/guide/en/elasticsearch/reference/7.7/analysis-tokenizers.html
[7].https://www.elastic.co/guide/en/elasticsearch/reference/7.7/analysis-tokenfilters.html