官方文檔:https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis.html
分析器analyzer
包含如下幾個屬性:
分析器類型
type
:custom
字符過濾器char_filter
: 零個或多個
分詞器tokenizer
: 有且僅有一個
詞元過濾器filter
:零個或多個 按順序應(yīng)用的
字符過濾器
字符過濾器也叫預(yù)處理過濾器,用于預(yù)處理字符流追城,然后再將其傳遞給分詞器刹碾。
字符過濾器有三種:
1. html_strip
:HTML標(biāo)簽字符過濾器
特性:
a. 從原始文本中過濾掉HTML標(biāo)簽
可選配置:
escaped_tags: 不應(yīng)從原始文本中過濾掉的HTML標(biāo)簽,數(shù)組類型座柱。
example:
GET _analyze
{
"tokenizer": "keyword",
"char_filter": [ "html_strip" ],
"text": "<p>I'm so <b>happy</b>!</p>"
}
{
"tokens": [
{
"token": """
I'm so happy!
""",
"start_offset": 0,
"end_offset": 32,
"type": "word",
"position": 0
}
]
}
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": ["my_char_filter"]
}
},
"char_filter": {
"my_char_filter": {
"type": "html_strip",
"escaped_tags": ["b"] // 不從原始文本中過濾<b></b>標(biāo)簽
}
}
}
}
}
GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "<p>I'm so <b>happy</b>!</p>"
}
{
"tokens": [
{
"token": """
I'm so <b>happy</b>!
""",
"start_offset": 0,
"end_offset": 32,
"type": "word",
"position": 0
}
]
}
2. mapping
: 映射字符過濾器
特性:
a. mapping字符過濾器接受鍵值對數(shù)組迷帜。每當(dāng)遇到與鍵相同的字符串時,它將用與該鍵關(guān)聯(lián)的值替換它們色洞。
b. 匹配是貪婪的戏锹,優(yōu)先匹配最長的那一個。
c. 允許替換為空字符串火诸。
可選配置:
mappings:定義一個鍵值對數(shù)組
mappings_path: 定義一個包含鍵值對數(shù)組文件的路徑
example:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"& => and",
"$ => ¥"
]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "My license plate is $203 & $110"
}
{
"tokens": [
{
"token": "My license plate is ¥203 and ¥110",
"start_offset": 0,
"end_offset": 31,
"type": "word",
"position": 0
}
]
}
3. pattern_replace
: 正則替換字符過濾器
特性:
a. 使用一個正則表達(dá)式匹配锦针,用指定的字符串替換字符。
b. 替換字符串可以引用正則表達(dá)式中的捕獲組置蜀。
可選配置:
pattern: 一個Java的正則表達(dá)式奈搜,必須。
replacement:替換字符串盯荤,可以參考使用捕獲組$1
..$9
語法馋吗,說明 這里
flags:Java正則表達(dá)式標(biāo)志。標(biāo)記應(yīng)以管道分隔秋秤,例如"CASE_INSENSITIVE|COMMENTS"
宏粤。
example:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1_"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "My credit card is 123-456-789"
}
{
"tokens": [
{
"token": "My credit card is 123_456_789",
"start_offset": 0,
"end_offset": 29,
"type": "word",
"position": 0
}
]
}
分詞器
特性:
標(biāo)準(zhǔn)類型的tokenizer對歐洲語言非常友好脚翘, 支持Unicode。
可選配置:
max_token_length:最大的token集合,即經(jīng)過tokenizer過后得到的結(jié)果集的最大值绍哎。如果token的長度超過了設(shè)置的長度来农,將會繼續(xù)分,默認(rèn)255
ex:
POST _analyze
{
"tokenizer": "standard",
"text": "The 2 QUICK Brown-Foxes of dog's bone."
}
結(jié)果
[The, 2, QUICK, Brown, Foxes, of, dog's bone]
ex:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "standard",
"max_token_length": 5
}
}
}
}
}
特性:
每當(dāng)遇到一個字符是不是字母的時候崇堰,進(jìn)行分詞备图。
可選配置:
不可配置
ex:
POST _analyze
{
"tokenizer": "letter",
"text": "The 2 QUICK Brown-Foxes of dog's bone."
}
結(jié)果:
[The, 2, QUICK, Brown, Foxes, of, dog, s, bone]
特性:
可以看做是Letter Tokenizer和lower token filter的組合
特性:
遇到空白字符就分詞
可選配置:
無
特性:
和standard 類型的分詞器類似,但是能識別url和email
可選配置:
max_token_length:默認(rèn)256
ex:
POST _analyze
{
"tokenizer": "uax_url_email",
"text": "Email me at john.smith@global-international.com http://www.baidu.com"
}
結(jié)果
[Email, me, at, john.smith@global-international.com, http://www.baidu.com]
特性:
為英語而生的分詞器. 這個分詞器對于英文的首字符縮寫赶袄、 公司名字、 email 抠藕、 大部分網(wǎng)站域名.都能很好的解決饿肺。 但是, 對于除了英語之外的其他語言,都不是很好使盾似。
可選配置:
max_token_length: 默認(rèn)255
特性:
泰語專用分詞器
特性:
N-gram就像一個滑動窗口敬辣,在整個單詞上移動-連續(xù)的指定長度字符序列。它們對于查詢不用空格的語言(例如德語零院,漢語)很有用溉跃。
可選配置:
min_gram:分詞后詞語的最小長度
max_gram: 分詞后數(shù)據(jù)的最大長度
token_chars:設(shè)置分詞的形式,例如數(shù)字還是文字告抄。elasticsearch將根據(jù)分詞的形式對文本進(jìn)行分詞撰茎。
[] (Keep all characters)
token_chars可用的值:
普通字符:letter?—? for example a, b, ? or 京
數(shù)字:digit?—? for example 3 or 7
空格或回車符:whitespace?—? for example " " or "\n"
標(biāo)點符號:punctuation?—?for example ! or "
特殊字符:symbol?—? for example $ or √
ex:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"filter":["lowercase"]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "2 2311 Quick Foxes."
}
分詞結(jié)果:
[231, 2311, 311, qui, quic, quick, uic, uick, ick, fox, foxe, foxes, oxe, oxes, xes]
邊緣ngram分詞器,與ngram分詞器的不同之處在于ngram是補(bǔ)全提示打洼,而edge-ngram是自動補(bǔ)全龄糊。
例如:
POST _analyze
{
"tokenizer": "ngram",
"text": "a Quick Foxes."
}
ngram
分詞測試結(jié)果:
["a", "a ", " ", " Q", "Q", "Qu", "u", "ui", "i", "ic", "c", "ck", "k", "k ", " ", " F", "F", "Fo", "o", "ox", "x", "xe", "e", "es", "s", "s.", ".",]
POST _analyze
{
"tokenizer": "edge_ngram",
"text": "a Quick Foxes."
}
edeg_ngram
分詞測試結(jié)果
["a", "a "]
從上面的測試結(jié)果可以看出:
默認(rèn)ngram和edge_ngram分詞器是將
a Quick Foxes.
當(dāng)成一個整體的。
默認(rèn)ngram和edge_ngram最小長度和最大長度都是 1 和 2募疮。
ngram是一個固定的小窗口在單詞上滑動的炫惩。(一般用來搜索補(bǔ)全提示)
edge_ngram是起始位置不動,小窗口從最小值到最大值延長的結(jié)果阿浓。(一般用來單詞自動補(bǔ)全)
特性:
keyword 類型的tokenizer 是將一整塊的輸入數(shù)據(jù)作為一個單獨的分詞
可選配置:
buffer_size: 默認(rèn)256
用正則表達(dá)式分詞
可選配置:
pattern
:一個Java的正則表達(dá)式他嚷,則默認(rèn)為\W+
。
flags
: Java正則表達(dá)式標(biāo)志芭毙。標(biāo)記應(yīng)以管道分隔筋蓖,例如"CASE_INSENSITIVE|COMMENTS"
。
group
: 要提取哪個捕獲組作為令牌稿蹲。默認(rèn)為-1
(分割).
group默認(rèn)為-1,表示以正則表達(dá)式匹配字符來進(jìn)行分割分詞
group=0扭勉,則表示保留匹配所有正則表達(dá)式的字符串來分詞
group=1,2,3...,則表示保留匹配正則表達(dá)式中的某一個()
中的匹配結(jié)果苛聘。
ex:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "\"(.*)\"",
"flags": "",
"group": -1
}
}
}
}
}
注意:匹配兩端帶"
的字符串涂炎,注意"\"(.*)\""
與"\".*\""
的區(qū)別忠聚,同樣都是匹配兩端帶"
的字符串,前面一個有g(shù)roup唱捣,而后面這個沒有g(shù)roup两蟀。
{
"analyzer": "my_analyzer",
"text": "comma,\"separated\",values"
}
分詞測試結(jié)果:
當(dāng)group為默認(rèn)的-1時, 以正則匹配到的結(jié)果作為分隔符分詞,所得結(jié)果: ["comma", "values"]
當(dāng)group=0時震缭,以正則匹配到的結(jié)果作為分詞結(jié)果赂毯,所得結(jié)果為匹配的字符串:[""separated""]
當(dāng)group=1時,以正則中第一個()
中匹配的字符串作為分詞結(jié)果拣宰,所得的分詞結(jié)果為:["separated"]
當(dāng)group=2時党涕,以正則中第二個()
中匹配的字符串作為分詞結(jié)果,這里正則表達(dá)式中因為只有一個()
巡社,所以會報異常膛堤。
所述path_hierarchy標(biāo)記生成器需要像文件系統(tǒng)路徑的分層值,分割的路徑分隔晌该,并發(fā)出一個術(shù)語肥荔,樹中的每個組件。
可選配置
delimiter:路徑匹配分割符朝群,默認(rèn)
/
.
replacement:替換路徑分隔符燕耿,默認(rèn)與delimiter一致
buffer_size:分割路徑最大長度,默認(rèn)1024.
reverse:是否翻轉(zhuǎn)姜胖,默認(rèn) false.
skip: 默認(rèn)為 0.
ex:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "path_hierarchy",
"delimiter": "-",
"replacement":"/",
"reverse": false,
"skip": 0
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "one-two-three-four"
}
{
"tokens": [
{
"token": "one",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "one/two",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "one/two/three",
"start_offset": 0,
"end_offset": 13,
"type": "word",
"position": 0
},
{
"token": "one/two/three/four",
"start_offset": 0,
"end_offset": 18,
"type": "word",
"position": 0
}
]
}
一誉帅、內(nèi)置的8種分析器:
-
standard analyzer
:默認(rèn)分詞器,它提供了基于語法的分詞(基于Unicode文本分割算法,如Unicode? Standard Annex #29所指定的)右莱,適用于大多數(shù)語言,對中文分詞效果很差堵第。
POST _analyze
{
"analyzer":"standard",
"text":"Geneva K. Risk-Issues "
}
{
"tokens": [
{
"token": "geneva",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "k",
"start_offset": 7,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "risk",
"start_offset": 10,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "issues",
"start_offset": 15,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 3
}
]
}
-
simple analyzer
:它提供了基于字母的分詞,如果遇到不是字母時直接分詞,所有字母均置為小寫
POST _analyze
{
"analyzer":"simple",
"text":"Geneva K. Risk-Issues "
}
{
"tokens": [
{
"token": "geneva",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "k",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "risk",
"start_offset": 10,
"end_offset": 14,
"type": "word",
"position": 2
},
{
"token": "issues",
"start_offset": 15,
"end_offset": 21,
"type": "word",
"position": 3
}
]
}
-
whitespace analyzer
:它提供了基于空格的分詞,如果遇到空格時直接分詞
POST _analyze
{
"analyzer":"whitespace",
"text":"Geneva K. Risk-Issues "
}
{
"tokens": [
{
"token": "Geneva",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "K.",
"start_offset": 7,
"end_offset": 9,
"type": "word",
"position": 1
},
{
"token": "Risk-Issues",
"start_offset": 10,
"end_offset": 21,
"type": "word",
"position": 2
}
]
}
-
stop analyzer
:它與simple analyzer相同,但是支持刪除停用詞隧出。它默認(rèn)使用 _english_stop 單詞踏志。
POST _analyze
{
"analyzer":"stop",
"text":"Geneva K.of Risk-Issues "
}
{
"tokens": [
{
"token": "geneva",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "k",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "risk",
"start_offset": 16,
"end_offset": 20,
"type": "word",
"position": 4
},
{
"token": "issues",
"start_offset": 21,
"end_offset": 27,
"type": "word",
"position": 5
}
]
}
-
keyword analyzer
:它提供的是無操作分詞,它將整個輸入字符串作為一個詞返回,即不分詞胀瞪。
POST _analyze
{
"analyzer":"keyword",
"text":"Geneva K.of Risk-Issues "
}
{
"tokens": [
{
"token": "Geneva K.of Risk-Issues ",
"start_offset": 0,
"end_offset": 24,
"type": "word",
"position": 0
}
]
}
-
pattern analyzer
:它提供了基于正則表達(dá)式將文本分詞针余。正則表達(dá)式應(yīng)該匹配詞語分隔符,而不是詞語本身凄诞。正則表達(dá)式默認(rèn)為\W+(或所有非單詞字符)圆雁。
POST _analyze
{
"analyzer":"pattern",
"text":"Geneva K.of Risk-Issues "
}
{
"tokens": [
{
"token": "geneva",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "k",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "of",
"start_offset": 9,
"end_offset": 11,
"type": "word",
"position": 2
},
{
"token": "risk",
"start_offset": 12,
"end_offset": 16,
"type": "word",
"position": 3
},
{
"token": "issues",
"start_offset": 17,
"end_offset": 23,
"type": "word",
"position": 4
}
]
}
-
language analyzers
:它提供了一組語言的分詞,旨在處理特定語言.它包含了一下語言的分詞:阿拉伯語,亞美尼亞語帆谍,巴斯克語伪朽,巴西語,保加利亞語汛蝙,加泰羅尼亞語烈涮,cjk朴肺,捷克語,丹麥語坚洽,荷蘭語戈稿,英語,芬蘭語讶舰,法語鞍盗,加利西亞語,德語跳昼,希臘語般甲,印度語鹅颊,匈牙利語,愛爾蘭語杠娱,意大利語,拉脫維亞語室叉,立陶宛語茧痕,挪威語,波斯語令野,葡萄牙語气破,羅馬尼亞語低匙,俄羅斯語,索拉尼語渗稍,西班牙語,瑞典語,土耳其語碗誉,泰國語
POST _analyze
{
"analyzer":"english", ## french(法語)
"text":"Geneva K.of Risk-Issues "
}
{
"tokens": [
{
"token": "geneva",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "k.of",
"start_offset": 7,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "risk",
"start_offset": 12,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "isue",
"start_offset": 17,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 3
}
]
}
-
fingerprint analyzer
:對text進(jìn)行排序甲喝,重復(fù)數(shù)據(jù)刪除然后將它們重新組合為單個text糠溜。
POST _analyze
{
"analyzer":"fingerprint",
"text":"Geneva K.of Risk-Issues "
}
{
"tokens": [
{
"token": "geneva issues k.of risk",
"start_offset": 0,
"end_offset": 24,
"type": "fingerprint",
"position": 0
}
]
}
二、測試自定義分析器
分析器analyze API的使用
分析器analyze API可驗證分析器的分析效果并解釋分析過程。
text: 待分析文本
explain:解釋分析過程
char_filter:字符過濾器
tokenizer:分詞器
filter:詞元過濾器
GET _analyze
{
"char_filter": ["html_strip"],
"tokenizer": "standard",
"filter": ["lowercase"],
"text": "<p><em>No <b>dreams</b>, why bother <b>Beijing</b> !</em></p>",
"explain": true
}
{
"detail": {
"custom_analyzer": true,
"charfilters": [
{
"name": "html_strip",
"filtered_text": [
"""
No dreams, why bother Beijing !
"""
]
}
],
"tokenizer": {
"name": "standard",
"tokens": [
{
"token": "No",
"start_offset": 7,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0,
"bytes": "[4e 6f]",
"positionLength": 1
},
{
"token": "dreams",
"start_offset": 13,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 1,
"bytes": "[64 72 65 61 6d 73]",
"positionLength": 1
},
{
"token": "why",
"start_offset": 25,
"end_offset": 28,
"type": "<ALPHANUM>",
"position": 2,
"bytes": "[77 68 79]",
"positionLength": 1
},
{
"token": "bother",
"start_offset": 29,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 3,
"bytes": "[62 6f 74 68 65 72]",
"positionLength": 1
},
{
"token": "Beijing",
"start_offset": 39,
"end_offset": 50,
"type": "<ALPHANUM>",
"position": 4,
"bytes": "[42 65 69 6a 69 6e 67]",
"positionLength": 1
}
]
},
"tokenfilters": [
{
"name": "lowercase",
"tokens": [
{
"token": "no",
"start_offset": 7,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0,
"bytes": "[6e 6f]",
"positionLength": 1
},
{
"token": "dreams",
"start_offset": 13,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 1,
"bytes": "[64 72 65 61 6d 73]",
"positionLength": 1
},
{
"token": "why",
"start_offset": 25,
"end_offset": 28,
"type": "<ALPHANUM>",
"position": 2,
"bytes": "[77 68 79]",
"positionLength": 1
},
{
"token": "bother",
"start_offset": 29,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 3,
"bytes": "[62 6f 74 68 65 72]",
"positionLength": 1
},
{
"token": "beijing",
"start_offset": 39,
"end_offset": 50,
"type": "<ALPHANUM>",
"position": 4,
"bytes": "[62 65 69 6a 69 6e 67]",
"positionLength": 1
}
]
}
]
}
}
歸一化分析器 normalizer
針對type為keyword類型的字段众眨,只能精確搜索沿腰,而且是區(qū)分大小寫的习蓬。有時候我們希望對于keyword類型的字段不區(qū)分大小寫也能精確檢索怎么辦呢措嵌?normalizer這個分析器可以幫你解決這個問題躲叼。
normalizer的構(gòu)成比analyzer少了一個tokenizer屬性,它的結(jié)構(gòu)如下:
分析器類型
type
:custom
字符過濾器char_filter
: 零個或多個 按順序應(yīng)用的
詞元過濾器filter
:零個或多個 按順序應(yīng)用的
這里借用官方的一個例子:
PUT index
{
"settings": {
"analysis": {
"char_filter": {
"quote": {
"type": "mapping",
"mappings": [
"? => \"",
"? => \""
]
}
},
"normalizer": {
"my_normalizer": {
"type": "custom",
"char_filter": ["quote"],
"filter": ["lowercase", "asciifolding"]
}
}
}
},
"mappings": {
"type": {
"properties": {
"foo": {
"type": "keyword", // normalizer只能用在keyword類型的字段
"normalizer": "my_normalizer"
}
}
}
}
}
PUT testlog/wd_doc/1
{
"title": "Quick Frox"
}
GET testlog/wd_doc/_search
{
"query": {
"match": {
"title": {
"query": "quick Frox" // 大小寫不敏感企巢,無論大小寫都能檢索到
}
}
}
}