注:代碼基于Elasticsearch 7.x馍佑,低版本語法稍有不同,需指定type!且低版本可能無法使用相關性計算的一些新特性。
一寞奸、分析器
1.1 概念:
分析器包括:
- 字符過濾器(CharacterFilters):首先,字符串按順序通過每個 字符過濾器 在跳。他們的任務是在分詞前整理字符串枪萄。一個字符過濾器可以用來去掉HTML,或者將 & 轉(zhuǎn)化成 and猫妙;
- 分詞器(Tokenizer):字符串被 分詞器 分為單個的詞條瓷翻。得到分詞,標記每個分詞的順序或位置(用于鄰近查詢)割坠,標記分詞的起始和結(jié)束的偏移量(用于突出顯示搜索片段)齐帚,標記分詞的類型;
- 后過濾器(TokenFilter):最后彼哼,詞條按順序通過每個 token 過濾器 对妄。這個過程可能會改變詞條(例如,小寫化 Quick )沪羔,刪除詞條(例如饥伊, 像 a, and蔫饰, the 等無用詞)琅豆,或者增加詞條(例如,像 jump 和 leap 這種同義詞)篓吁。
1.2 字符過濾器CharacterFilters
字符過濾器類型 | 說明 |
---|---|
HTML Strip Character Filter(去除html標簽和轉(zhuǎn)換html實體) | The html_strip character filter strips out HTML elements like <b> and decodes HTML entities like & . |
Mapping Character Filter(字符串替換操作) | The mapping character filter replaces any occurrences of the specified strings with the specified replacements. |
Pattern Replace Character Filter(正則匹配替換) | The pattern_replace character filter replaces any characters matching a regular expression with the specified replacement. |
①使用html_strip字符過濾器
過濾掉文本中的html標簽茫因。
參數(shù) | 說明 |
---|---|
escaped_tags | 不進行過濾的標簽名,多個標簽用數(shù)組表示 |
默認過濾器配置
GET _analyze
{
"tokenizer": "keyword",
"char_filter": [
"html_strip"
],
"text": "<p>I'm so <b>happy</b>!</p>"
}
# 得到結(jié)果 [ \nI'm so happy!\n ]
定制過濾器配置
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"my_custom_html_strip_char_filter"
]
}
},
"char_filter": {
"my_custom_html_strip_char_filter": {
"type": "html_strip",
"escaped_tags": [
"b"
]
}
}
}
}
}
②使用mapping 字符過濾器
對文本進行替換杖剪。
參數(shù) | 說明 |
---|---|
mappings | 使用key => value來指定映射關系冻押,多種映射關系用數(shù)組表示 |
mappings_path | 指定配置了mappings映射關系的文件的路徑,文件使用UTF-8格式編碼盛嘿,每個映射關系使用換行符分割 |
默認過濾器配置
GET /_analyze
{
"tokenizer": "keyword",
"char_filter": [
{
"type": "mapping",
"mappings": [
"? => 0",
"? => 1",
"? => 2",
"? => 3",
"? => 4",
"? => 5",
"? => 6",
"? => 7",
"? => 8",
"? => 9"
]
}
],
"text": "My license plate is ?????"
}
# 得到結(jié)果 [ My license plate is 25015 ]
定制過濾器配置
PUT /my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_mappings_char_filter"
]
}
},
"char_filter": {
"my_mappings_char_filter": {
"type": "mapping",
"mappings": [
":) => _happy_",
":( => _sad_"
]
}
}
}
}
}
測試過濾器
GET /my-index-000001/_analyze
{
"tokenizer": "keyword",
"char_filter": [ "my_mappings_char_filter" ],
"text": "I'm delighted about it :("
}
# 結(jié)果 [ I'm delighted about it _sad_ ]
③使用pattern_replace 字符過濾器
對文本進行正則匹配洛巢,對匹配的字符串進行替換。
參數(shù) | 說明 |
---|---|
pattern | java正則表達式 |
replacement | 替換字符串, 使用 |
flags | Java regular expression flags. Flags should be pipe-separated, eg "CASE_INSENSITIVE|COMMENTS" . |
定制過濾器配置
PUT my-index-00001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1_"
}
}
}
}
}
測試過濾器
POST my-index-00001/_analyze
{
"analyzer": "my_analyzer",
"text": "My credit card is 123-456-789"
}
# [ My, credit, card, is, 123_456_789 ]
1.3 分詞器Tokenizer
1.3.1 Word Oriented Tokenizers
分詞器類型 | 說明 |
---|---|
Standard Tokenizer | The standard tokenizer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation symbols. It is the best choice for most languages. |
Letter Tokenizer | The letter tokenizer divides text into terms whenever it encounters a character which is not a letter. |
Lowercase Tokenizer | The lowercase tokenizer, like the letter tokenizer, divides text into terms whenever it encounters a character which is not a letter, but it also lowercases all terms. |
Whitespace Tokenizer | The whitespace tokenizer divides text into terms whenever it encounters any whitespace character. |
UAX URL Email Tokenizer | The uax_url_email tokenizer is like the standard tokenizer except that it recognises URLs and email addresses as single tokens. |
Classic Tokenizer | The classic tokenizer is a grammar based tokenizer for the English Language. |
Thai Tokenizer | The thai tokenizer segments Thai text into words. |
① Standard
Standard tokenizer是基于<Unicode標準附錄#29>中指定的算法進行切分的次兆,如whitespace稿茉,‘-’等符號都會進行切分。
參數(shù) | 說明 | 默認值 |
---|---|---|
max_token_length | 切分后得到的token的長度如果超過最大token長度芥炭,以最大長度間隔拆分 | 默認255 |
POST _analyze
{
"tokenizer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
# [ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]
定制
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "standard",
"max_token_length": 5
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_analyzer",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
# [ The, 2, QUICK, Brown, Foxes, jumpe, d, over, the, lazy, dog's, bone ]
②Letter
Letter tokenizer遇到非字母時就會進行分詞漓库。也就是說,這個分詞的結(jié)果可以是一整塊的的連續(xù)的數(shù)據(jù)內(nèi)容 园蝠。對歐洲語言友好渺蒿,但是不適用于亞洲語言。
無參數(shù)
③Lowercase
Lowercase tokenizer可以看做Letter Tokenizer分詞和Lower case Token Filter的結(jié)合體彪薛。即先用Letter Tokenizer分詞茂装,然后再把分詞結(jié)果全部換成小寫格式。
無參數(shù)
④Whitespace
Whitespace tokenizer 將文本通過空格進行分詞陪汽。
參數(shù) | 說明 | 默認值 |
---|---|---|
max_token_length |
經(jīng)過此分詞器后所得的數(shù)據(jù)的最大長度训唱。 | 默認 255 |
⑤UAX Email URL
Uax_url_email tokenizer和standard tokenizer類似,不同的是Uax_url_email tokenizer會將url和郵箱分為單獨一個token挚冤。而standard tokenizer會將url和郵箱進行切分况增。
參數(shù) | 說明 | 默認值 |
---|---|---|
max_token_length |
經(jīng)過此分詞器后所得的數(shù)據(jù)的最大長度。 | 默認 255 |
⑥Classic
Classic tokenizer很適合英語編寫的文檔训挡。 這個分詞器對于英文的首字符縮寫澳骤、 公司名字、 email 澜薄、 大部分網(wǎng)站域名都能很好的解決为肮。 但是,對于除了英語之外的其他語言都不好用肤京。
- 會在大部分的標點符號處進行切分并移除標點符號颊艳,但是不在空格后的點不會被切分茅特。
- 會在連字符處切分,除非在這個token中有數(shù)字棋枕,那么整個token會被理解為產(chǎn)品編號而不切分白修。
- 可以將郵件地址和節(jié)點主機名分割為一個token。
參數(shù) | 說明 | 默認值 |
---|---|---|
max_token_length |
經(jīng)過此分詞器后所得的數(shù)據(jù)的最大長度重斑。 | 默認 255 |
POST _analyze
{
"tokenizer": "classic",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
# [ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]
⑦Thai
泰語分詞器
無參數(shù)
1.3.2 Partial Word Tokenizers
分詞器類型 | 說明 |
---|---|
N-Gram Tokenizer | The ngram tokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation), then it returns n-grams of each word: a sliding window of continuous letters, e.g. quick → [qu, ui, ic, ck] . |
Edge N-Gram Tokenizer | The edge_ngram tokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation), then it returns n-grams of each word which are anchored to the start of the word, e.g. quick → [q, qu, qui, quic, quick] . |
①Ngram
一個nGram.類型的分詞器兵睛。
以下是 nGram tokenizer 的設置:
參數(shù) | 說明 | 默認值 |
---|---|---|
min_gram |
分詞后詞語的最小長度 | 1 |
max_gram |
分詞后數(shù)據(jù)的最大長度 | 2 |
token_chars |
設置分詞的形式,例如數(shù)字還是文字窥浪。elasticsearch將根據(jù)分詞的形式對文本進行分詞祖很。 |
[] (Keep all characters) |
token_chars 所接受以下的形式:
token_chars | 舉例 |
---|---|
letter |
例如 a , b , ? or 京
|
digit |
例如3 or 7
|
whitespace |
例如 " " or "\n"
|
punctuation |
例如 ! or "
|
symbol |
例如 $ or √
|
custom |
custom characters which need to be set using the custom_token_chars setting. |
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_analyzer",
"text": "2 Quick Foxes."
}
# [ Qui, uic, ick, Fox, oxe, xes ]
②Edge NGram
這個分詞和 nGram 非常的類似。但是只是相當于 n-grams 的分詞的方式漾脂,只保留了“從頭至尾”的分詞假颇。
以下是 edgeNGram 分詞的設置:
參數(shù) | 說明 | 默認值 |
---|---|---|
min_gram | 分詞后詞語的最小長度 | 1 |
max_gram | 分詞后詞語的最大長度 | 2 |
token_chars | 設置分詞的形式,例如骨稿,是數(shù)字還是文字拆融;將根據(jù)分詞的形式對文本進行分詞 | [] (Keep all characters) |
token_chars 所接受以下的形式:
token_chars | 舉例 |
---|---|
letter |
例如 a , b , ? or 京
|
digit |
例如3 or 7
|
whitespace |
例如 " " or "\n"
|
punctuation |
例如 ! or "
|
symbol |
例如 $ or √
|
custom |
custom characters which need to be set using the custom_token_chars setting. |
POST _analyze
{
"text": "Hélène Ségara it's !<>#",
"char_filter": [
{
"type": "pattern_replace",
"pattern": "[^\\s\\p{L}\\p{N}]",
"replacement": ""
}
],
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
{
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "12"
}
]
}
# [h] [he] [hel] [hele] [helen] [s] [se] [seg] [sega] [segar] [Segara ] [i] [it] [its]
1.3.3 Structured Text Tokenizers
分詞器類型 | 說明 |
---|---|
Keyword Tokenizer | The keyword tokenizer is a “noop” tokenizer that accepts whatever text it is given and outputs the exact same text as a single term. It can be combined with token filters like lowercase to normalise the analysed terms. |
Pattern Tokenizer | The pattern tokenizer uses a regular expression to either split text into terms whenever it matches a word separator, or to capture matching text as terms. |
Simple Pattern Tokenizer | The simple_pattern tokenizer uses a regular expression to capture matching text as terms. It uses a restricted subset of regular expression features and is generally faster than the pattern tokenizer. |
Char Group Tokenizer | The char_group tokenizer is configurable through sets of characters to split on, which is usually less expensive than running regular expressions. |
Simple Pattern Split Tokenizer | The simple_pattern_split tokenizer uses the same restricted regular expression subset as the simple_pattern tokenizer, but splits the input at matches rather than returning the matches as terms. |
Path Tokenizer | The path_hierarchy tokenizer takes a hierarchical value like a filesystem path, splits on the path separator, and emits a term for each component in the tree, e.g. /foo/bar/baz → [/foo, /foo/bar, /foo/bar/baz ] . |
①Simple Pattern
Simple Pattern Tokenizer使用正則表達式來匹配符合文本,然后將匹配的文本提取出來作為token啊终,其他部分舍棄镜豹。
參數(shù) | 說明 | 默認值 |
---|---|---|
simple_pattern | Lucene正則表達式 | empty string |
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "simple_pattern",
"pattern": "[0123456789]{3}"
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_analyzer",
"text": "fd-786-335-514-x"
}
# [ 786, 335, 514 ]
②Simple Pattern Split
Simple Pattern Split Tokenizer使用Lucene正則表達式來匹配符合文本,將匹配的文本作為分隔符進行切分蓝牲。它使用的Lucene正則語法沒有pattern tokenizer使用的Java正則語法強大趟脂,但是效率更高。
參數(shù) | 說明 | 默認值 |
---|---|---|
simple_pattern_split |
Lucene正則表達式 | empty string |
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "simple_pattern_split",
"pattern": "_"
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_analyzer",
"text": "an_underscored_phrase"
}
# [ an, underscored, phrase ]
③Pattern
Pattern Tokenizer使用Java正則表達式來匹配符合文本例衍,將匹配的文本作為分隔符進行切分昔期。
參數(shù) | 說明 | 默認值 |
---|---|---|
pattern |
正則表達式的pattern | \W+ |
flags |
正則表達式的 flags. Flags should be pipe-separated, eg "CASE_INSENSITIVE|COMMENTS" | |
group |
哪個group去抽取數(shù)據(jù)。 | -1 |
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": ","
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_analyzer",
"text": "comma,separated,values"
}
# [ comma, separated, values ]
④Keyword
Keyword Tokenizer 不會對文本進行操作佛玄,會將一整塊的輸入數(shù)據(jù)作為一個token硼一。
設置 | 說明 | 默認值 |
---|---|---|
buffer_size |
term buffer 的大小。不建議修改 | 默認256 |
⑤Path Hierarchy
Path_hierarchy Tokenizer會對路徑進行逐級劃分梦抢。示例如下:
/something/something/else
經(jīng)過該分詞器后會得到如下數(shù)據(jù) tokens
/something
般贼,/something/something
,/something/something/else
參數(shù) | 說明 | 默認值 |
---|---|---|
delimiter |
分隔符 | / |
replacement |
替代符用于替換分隔符 | 默認與delimiter 的值相同 |
buffer_size |
緩存buffer的大小 | 1024 |
reverse |
是否將分詞后的tokens反轉(zhuǎn) | false |
skip |
The number of initial tokens to skip | 0 |
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "path_hierarchy",
"delimiter": "-",
"replacement": "/",
"skip": 2
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_analyzer",
"text": "one-two-three-four-five"
}
# [ /three, /three/four, /three/four/five ]
# 如果reverse設置為true奥吩,得到如下token
# [ one/two/three/, two/three/, three/ ]
⑥Character group
char_group
Tokenizer通過字符進行切分哼蛆,可以在參數(shù)中指定字符進行切分。
參數(shù) | 說明 | 默認值 |
---|---|---|
tokenize_on_chars |
分隔符或分隔符組成的數(shù)組霞赫,like e.g. - , or character groups: whitespace , letter , digit , punctuation , symbol
|
|
max_token_length |
token最大長度腮介,超過后按最大長度再次切分 | 255 |
POST _analyze
{
"tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"whitespace",
"-",
"\n"
]
},
"text": "The QUICK brown-fox"
}
# [ The, QUICK, brown, fox ]
1.4 后過濾器TokenFilter
后過濾器是有順序的,所以需要注意數(shù)組中的順序端衰。
太多了叠洗,需要的話直接點進官網(wǎng)看吧甘改,常用的做下說明:
后過濾器類型 | 說明 |
---|---|
Apostrophe | |
ASCII folding | Converts alphabetic, numeric, and symbolic characters that are not in the Basic Latin Unicode block (first 127 ASCII characters) to their ASCII equivalent, if one exists. For example, the filter changes à to a. |
CJK bigram | |
CJK width | |
Classic | |
Common grams | |
Conditional | |
Decimal digit | |
Delimited payload | |
Dictionary decompounder | |
Edge n-gram | Forms an n-gram of a specified length from the beginning of a token. |
Elision | |
Fingerprint | Sorts and removes duplicate tokens from a token stream, then concatenates the stream into a single output token. |
Flatten graph | |
Hunspell | |
Hyphenation decompounder | |
Keep types | |
Keep words | |
Keyword marker | |
Keyword repeat | Outputs a keyword version of each token in a stream. These keyword tokens are not stemmed. |
KStem | |
Length | |
Limit token count | |
Lowercase | Changes token text to lowercase. For example, you can use the lowercase filter to change THE Lazy DoG to the lazy dog. |
MinHash | |
Multiplexer | |
N-gram | Forms n-grams of specified lengths from a token. |
Normalization | |
Pattern capture | |
Pattern replace | |
Phonetic | |
Porter stem | |
Predicate script | |
Remove duplicates | Removes duplicate tokens in the same position. |
Reverse | Reverses each token in a stream. For example, you can use the reverse filter to change cat to tac. |
Shingle | |
Snowball | |
Stemmer | Provides algorithmic stemming for several languages, some with additional variants. For a list of supported languages, see the language parameter. |
Stemmer override | |
Stop | Removes stop words from a token stream. |
Synonym | The synonym token filter allows to easily handle synonyms during the analysis process. Synonyms are configured using a configuration file. |
Synonym graph | |
Trim | Removes leading and trailing whitespace from each token in a stream. While this can change the length of a token, the trim filter does not change a token’s offsets. |
Truncate | |
Unique | Removes duplicate tokens from a stream. For example, you can use the unique filter to change the lazy lazy dog to the lazy dog. |
Uppercase | Changes token text to uppercase. For example, you can use the uppercase filter to change the Lazy DoG to THE LAZY DOG. |
Word delimiter | |
Word delimiter graph |
①Edge n-gram
edge_ngram
效果同ngram
,區(qū)別是edge_ngram
只會從頭開始切分灭抑,同樣是fox楼誓,拆成[ f, fo ]
參數(shù) | 說明 | 默認值 |
---|---|---|
min_gram |
拆分的最小長度 | 1 |
max_gram |
拆分的最大長度 | 2 |
preserve_original | Emits original token when set to true . |
false |
side | 已廢棄. 指定從token的 front 或是 back 開始截取 |
front |
GET _analyze
{
"tokenizer": "standard",
"filter": [
{ "type": "edge_ngram",
"min_gram": 1,
"max_gram": 2
}
],
"text": "the quick brown fox jumps"
}
# [ t, th, q, qu, b, br, f, fo, j, ju ]
②N-gram
可以使用ngram
將fox拆成[ f, fo, o, ox, x ]名挥。
參數(shù) | 說明 | 默認值 |
---|---|---|
min_gram |
拆分的最小長度 | 1 |
max_gram |
拆分的最大長度 | 2 |
preserve_original | Emits original token when set to true . |
false |
具體演示如下
GET _analyze
{
"tokenizer": "standard",
"filter": [ "ngram" ],
"text": "Quick fox"
}
# [ Q, Qu, u, ui, i, ic, c, ck, k, f, fo, o, ox, x ]
③Stop
Stop后過濾器用于過濾掉停用詞。
參數(shù) | 說明 | 默認值 |
---|---|---|
stopwords |
預先定義的停用詞或停用詞組成的數(shù)組 | _english_ |
stopwords_path |
停用詞文件的路徑 | |
ignore_case |
是否大小寫敏感 | false |
remove_trailing |
如果流的最后一個token是停用詞主守,是否刪除 | true |
默認的_english_過濾詞組用于過濾掉a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with
GET /_analyze
{
"tokenizer": "standard",
"filter": [ "stop" ],
"text": "a quick fox jumps over the lazy dog"
}
# [ quick, fox, jumps, over, lazy, dog ]
④Stemmer
為部分語言提供了algorithmic stemming禀倔。
獲得token的詞源并替換該token。
GET /_analyze
{
"tokenizer": "standard",
"filter": [ "stemmer" ],
"text": "the foxes jumping quickly"
}
# [ the, fox, jump, quickli ]
⑤Keyword repeat
對每個token進行復制并返回参淫。一般配合Stemmer和Remove duplicates救湖。Stemmer獲取詞源時不會保留原token,在Stemmer之前加Keyword repeat就可以同時獲取詞源和原詞涎才。但是有些token的詞源就是原token鞋既,造成同一位置上有重復的token,則可以通過Remove duplicates進行去重耍铜。
GET _analyze
{
"tokenizer": "whitespace",
"filter": [
"keyword_repeat",
"stemmer"
],
"text": "fox running"
}
{
"tokens": [
{
"token": "fox",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "fox",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "running",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "run",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 1
}
]
}
⑥Remove duplicates
刪除同一位置(start_offset)上的相同的token邑闺。一般配合Stemmer和使用。
如下同一位置出現(xiàn)了重復的token
GET _analyze
{
"tokenizer": "whitespace",
"filter": [
"keyword_repeat",
"stemmer"
],
"text": "jumping dog"
}
{
"tokens": [
{
"token": "jumping",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "jump",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "dog",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "dog",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 1
}
]
}
使用remove_duplicates
GET _analyze
{
"tokenizer": "whitespace",
"filter": [
"keyword_repeat",
"stemmer",
"remove_duplicates"
],
"text": "jumping dog"
}
{
"tokens": [
{
"token": "jumping",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "jump",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "dog",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 1
}
]
}
⑦Uppercase
將token都轉(zhuǎn)變?yōu)榇髮憽?br>
如將 The Lazy DoG
轉(zhuǎn)變?yōu)?THE LAZY DOG
棕兼。
⑧Lowercase
將token都轉(zhuǎn)變?yōu)樾憽?br>
如將 The Lazy DoG
轉(zhuǎn)變?yōu)?the lazy dog
陡舅。
⑨ASCII folding
asciifolding
將不在基本拉丁Unicode塊中的字母、數(shù)字和符號字符(前127個ASCII字符)轉(zhuǎn)換為其ASCII等效字符(如果存在)伴挚。例如更改à 到a靶衍。
GET /_analyze
{
"tokenizer" : "standard",
"filter" : ["asciifolding"],
"text" : "a?aí à la carte"
}
# [ acai, a, la, carte ]
⑩Fingerprint
刪除重復的token,并對token進行排序茎芋,然后將排序后的token作為一個token輸出颅眶。
如使用Fingerprint過濾 [ the, fox, was, very, very, quick ] 步驟如下:
1)先進行排序 [ fox, quick, the, very, very, was ]
2)刪除重復的token
3)變成一個token輸出 [fox quick the very was ]
GET _analyze
{
"tokenizer" : "whitespace",
"filter" : ["fingerprint"],
"text" : "zebra jumps over resting resting dog"
}
# [ dog jumps over resting zebra ]
?Trim
刪除每個token的最前和最后面的空格符,但是不會改變token的offset位置田弥。
standard
和whitespace
tokenizer默認使用Trim涛酗,當使用這兩個時可以不用添加Trim后過濾器。
?Unique
過濾掉重復的token偷厦,如the lazy lazy dog進行過濾后得到the lazy dog煤杀。
對比Remove duplicates,Unique只需要token相同即可沪哺。
?Synonym
同義詞標記過濾器允許在分析過程中處理同義詞沈自。同義詞是使用配置文件配置的。
參數(shù) | 說明 | 默認值 |
---|---|---|
expand |
If the mapping was "bar, foo, baz" and expand was set to false no mapping would get added as when expand=false the target mapping is the first word. However, if expand=true then the mappings added would be equivalent to foo, baz => foo, baz i.e, all mappings other than the stop word. |
true |
lenient |
If true ignores exceptions while parsing the synonym configuration. It is important to note that only those synonym rules which cannot get parsed are ignored. |
false |
synonyms |
指定同義詞辜妓,如 [ "foo, bar => baz" ] | |
synonyms_path |
同義詞文件的路徑 | |
tokenizer |
The tokenizer parameter controls the tokenizers that will be used to tokenize the synonym, this parameter is for backwards compatibility for indices that created before 6.0. |
|
ignore_case |
The ignore_case parameter works with tokenizer parameter only. |
注:如果目標同義詞(=>符號后的詞)是停用詞枯途,那么這個同義詞映射就會失效忌怎。如果查詢的詞(=>符號前的詞)是停用詞,那么這個詞就會失效酪夷。
PUT /test_index
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "standard",
"filter": [ "my_stop", "synonym" ]
}
},
"filter": {
"my_stop": {
"type": "stop",
"stopwords": [ "bar" ]
},
"synonym": {
"type": "synonym",
"lenient": true,
"synonyms": [ "foo, bar => baz" ]
}
}
}
}
}
}
使用同義詞配置文件
PUT /test_index
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [ "synonym" ]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "analysis/synonym.txt"
}
}
}
}
}
}
文件格式如下
# Blank lines and lines starting with pound are comments.
# Explicit mappings match any token sequence on the LHS of "=>"
# and replace with all alternatives on the RHS. These types of mappings
# ignore the expand parameter in the schema.
# Examples:
i-pod, i pod => ipod
sea biscuit, sea biscit => seabiscuit
# Equivalent synonyms may be separated with commas and give
# no explicit mapping. In this case the mapping behavior will
# be taken from the expand parameter in the schema. This allows
# the same synonym file to be used in different synonym handling strategies.
# Examples:
ipod, i-pod, i pod
foozball , foosball
universe , cosmos
lol, laughing out loud
# If expand==true, "ipod, i-pod, i pod" is equivalent
# to the explicit mapping:
ipod, i-pod, i pod => ipod, i-pod, i pod
# If expand==false, "ipod, i-pod, i pod" is equivalent
# to the explicit mapping:
ipod, i-pod, i pod => ipod
# Multiple synonym mapping entries are merged.
foo => foo bar
foo => baz
# is equivalent to
foo => foo bar, baz
注:經(jīng)驗所得榴啸,帶有 synonym 的 analyzer 適用于 search 而不適用于存儲 index。
- synonym 增加了field 的 term 數(shù)量(導致評分參數(shù) avgdl 變大)晚岭, 還有重要的是 如果使用 match query 的話鸥印,會導致 匹配的 termFreq 增加到 synonym 的數(shù)量,影響評分坦报。
- 如果 同義詞變化的話库说,需要同步更新所有的關系到同義詞的文檔。
- 對于匹配原詞 和 他的同義詞片择,往往原詞的 評分應該更高潜的。但是 ES 中卻一視同仁。沒有區(qū)別字管。雖然可以通過定義不同的 field 啰挪,一個 field 使用 完全切分,一個field 使用同義詞嘲叔,并且在search時亡呵,給 全完且分詞field 一個較高的權重。但是又帶來了怎加了term 存儲的容量擴大問題硫戈。
?Reverse
將每個token翻轉(zhuǎn)政己,如將cat替換為tac。
1.5 分析器Analyzer
以下是es自帶的分析器掏愁,絕大多數(shù)的分析器我們可以通過以上介紹的CharFilter歇由,Tokenizer和TokenFilter自己組合實現(xiàn)相同的功能。
分析器類型 | 說明 |
---|---|
Standard Analyzer | The standard analyzer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation, lowercases terms, and supports removing stop words. |
Simple Analyzer | The simple analyzer divides text into terms whenever it encounters a character which is not a letter. It lowercases all terms. |
Whitespace Analyzer | The whitespace analyzer divides text into terms whenever it encounters any whitespace character. It does not lowercase terms. |
Stop Analyzer | The stop analyzer is like the simple analyzer, but also supports removal of stop words. |
Keyword Analyzer | The keyword analyzer is a “noop” analyzer that accepts whatever text it is given and outputs the exact same text as a single term. |
Pattern Analyzer | The pattern analyzer uses a regular expression to split the text into terms. It supports lower-casing and stop words. |
Language Analyzer | Elasticsearch provides many language-specific analyzers like english or french . |
Fingerprint Analyzer | The fingerprint analyzer is a specialist analyzer which creates a fingerprint which can be used for duplicate detection. |
①Standard Analyzer
standard analyzer 是 Elasticsearch 的缺省分析器:
- 沒有 Char Filter
- 使用 standard tokonizer
- 使用lowercase filter和stop token filter果港。默認的情況下 stop words 為 none沦泌,也即不過濾任何 stop words。
參數(shù) | 說明 | 默認值 |
---|---|---|
max_token_length |
分詞后單個token的最大長度辛掠,如果超過最大長度谢谦,按最大長度分詞 | Defaults to 255 . |
stopwords |
預先定義的停用詞或停用詞組成的數(shù)組 | Defaults to _none_ . |
stopwords_path |
停用詞文件的路徑 |
直接使用
POST _analyze
{
"analyzer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
# [ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]
指定參數(shù)
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_english_analyzer": {
"type": "standard",
"max_token_length": 5,
"stopwords": "_english_"
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_english_analyzer",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
# [ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]
②Simple Analyzer
簡單分析器
- 沒有Char Filter
- 使用Lowercase Tokenier
- 沒有TokenFilter
③Whitespace Analyzer
空格分析器,遇到空格
- 沒有Char Filter
- 使用Whitespace Tokenier
- 沒有TokenFilter
④Stop Analyzer
與簡單分析器類似萝衩,但是添加了停用詞回挽,默認使用的是_english_
停用詞。
- 沒有Char Filter
- 使用Lowercase Tokenier
- 使用Stop token filter猩谊,默認為
_english_
參數(shù) | 說明 | 默認值 |
---|---|---|
stopwords |
預先定義的停用詞或停用詞組成的數(shù)組 | _none_ |
stopwords_path |
停用詞文件的路徑 |
直接使用
POST _analyze
{
"analyzer": "stop",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
# [ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]
指定參數(shù)
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_stop_analyzer": {
"type": "stop",
"stopwords": ["the", "over"]
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_stop_analyzer",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
# [ quick, brown, foxes, jumped, lazy, dog, s, bone ]
⑤Keyword Analyzer
keyword分析器
- 沒有Char Filter
- 使用Keyword Tokenier
- 沒有TokenFilter
⑥Language Analyzer
語言分析器千劈,支持如下類型:arabic
, armenian
, basque
, bengali
, brazilian
, bulgarian
, catalan
, cjk
, czech
, danish
, dutch
, english
, estonian
, finnish
, french
, galician
, german
, greek
, hindi
, hungarian
, indonesian
, irish
, italian
, latvian
, lithuanian
, norwegian
, persian
, portuguese
, romanian
, russian
, sorani
, spanish
, swedish
, turkish
, thai
.
我們只會用到type為english的分析器吧
- 沒有Char Filter
- 使用Standard Tokenizer
- 使用Stemmer過濾器
參數(shù) | 說明 | 默認值 |
---|---|---|
stopwords |
預先定義的停用詞或停用詞組成的數(shù)組 | Defaults to _english_ . |
stopwords_path |
停用詞文件的路徑 |
直接使用
GET _analyze
{
"analyzer": "english",
"text": "Running Apps in a Phone"
}
# [run] [app] [phone]
創(chuàng)建一個自定義分析器實現(xiàn)english分析器的功能
PUT /english_example
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"rebuilt_english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
}
}
}
⑦Pattern Analyzer
- 沒有CharFilter
- 分詞器使用Pattern Tokenizer
- Token Filters使用Lower Case Token Filter和Stop Token Filter (disabled by default)
參數(shù) | 說明 | 默認值 |
---|---|---|
pattern |
Java正則表達式 | \W+ |
flags |
Java regular expression flags. Flags should be pipe-separated, eg "CASE_INSENSITIVE|COMMENTS" . |
|
lowercase |
Should terms be lowercased or not. Defaults to true . |
true |
stopwords |
預先定義的停用詞或停用詞組成的數(shù)組 | _none_ |
stopwords_path |
停用詞文件的路徑 |
直接使用
POST _analyze
{
"analyzer": "pattern",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
# [ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
指定參數(shù)
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_email_analyzer": {
"type": "pattern",
"pattern": "\\W|_",
"lowercase": true
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_email_analyzer",
"text": "John_Smith@foo-bar.com"
}
# [ john, smith, foo, bar, com ]
⑧Fingerprint Analyzer
Fingerprint Analyzer實現(xiàn)了fingerprinting
算法。文本會被轉(zhuǎn)為小寫格式牌捷,經(jīng)過規(guī)范化處理后移除擴展字符墙牌,然后再經(jīng)過排序涡驮,刪除重復數(shù)據(jù)組合為單個token;
- 沒有CharFilter
- Tokenizer使用Standard Tokenizer
- Token Filters 使用Lower Case Token Filter喜滨、ASCII folding捉捅、Stop Token Filter (disabled by default)、Fingerprint
參數(shù) | 說明 | 默認值 |
---|---|---|
separator |
The character to use to concatenate the terms. | Defaults to a space. |
max_output_size |
token允許的最大值就虽风,超過該值直接丟棄 | 255 |
stopwords |
預先定義的停用詞或停用詞組成的數(shù)組 | none |
stopwords_path |
停用詞文件的路徑 |
直接使用
POST _analyze
{
"analyzer": "fingerprint",
"text": "Yes yes, G?del said this sentence is consistent and."
}
# [ and consistent godel is said sentence this yes ]
指定參數(shù)
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_fingerprint_analyzer": {
"type": "fingerprint",
"stopwords": "_english_"
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_fingerprint_analyzer",
"text": "Yes yes, G?del said this sentence is consistent and."
}
# [ consistent godel said sentence yes ]
1.6 自定義分析器示例
示例一
PUT chenjie.asia:9200/analyzetest
{
"settings":{
"analysis":{
"analyzer":{
"my":{ //分析器
"tokenizer":"punctuation", //指定所用的分詞器
"type":"custom", //自定義類型的分析器
"char_filter":["emoticons"], //指定所用的字符過濾器
"filter":["lowercase","english_stop"]
}
},
"char_filter":{ //字符過濾器
"emoticons":{ //字符過濾器的名字
"type":"mapping", //匹配模式
"mappings":[
":)=>_happy_", //如果匹配上:)棒口,那么替換為_happy_
":(=>_sad_" //如果匹配上:(,那么替換為_sad_
]
}
},
"tokenizer":{ //分詞器
"punctuation":{ //分詞器的名字
"type":"pattern", //正則匹配分詞器
"pattern":"[.,!?]" //通過正則匹配方式匹配需要作為分隔符的字符辜膝,此處為 . , ! ? 无牵,作為分隔符進行分詞
}
},
"filter":{ //后過濾器
"english_stop":{ //后過濾器的名字
"type":"stop", //停用詞
"stopwords":"_english_" //指定停用詞,過濾掉停用詞
}
}
}
}
}
# GET chenjie.asia:9200/analyzetest/_analyze
{
"analyzer": "my",
"text": "I am a :) person,and you"
}
上述自定義分析器對文本 "I am a :) person,and you" 進行分詞 内舟,最終得到兩個分詞 "I am a happy person" 和 "and you" ;
第一步:用字符過濾器將 :) 替換為 happy
第二步:用分詞器,通過正則表達式匹配到逗號初橘,在逗號處進行分詞
第三步:過濾停用詞
示例二
{
"settings": {
"analysis": {
"analyzer": {
"my_content_analyzer": {
"type": "custom",
"char_filter": [
"xschool_filter"
],
"tokenizer": "standard",
"filter": [
"lowercase",
"my_stop"
]
}
},
"char_filter": {
"xschool_filter": {
"type": "mapping",
"mappings": [
"X-School => XSchool"
]
}
},
"filter": {
"my_stop": {
"type": "stop",
"stopwords": ["so", "to", "the"]
}
}
}
},
"mappings": {
"type":{
"properties": {
"content": {
"type": "text",
"analyzer": "my_content_analyzer",
"search_analyzer": "standard"
}
}
}
}
}
可以指定搜索時验游,搜索進行制定具體的搜索詞分析器
search_analyzer
。
二保檐、ik和pinyin分詞器
2.1 安裝ik和pinyin分詞器
- 下載與es版本對應的分詞器耕蝉,這里使用的是7.6.2版本
https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.6.2/elasticsearch-analysis-ik-7.6.2.zip # ik分詞器
https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v7.6.2/elasticsearch-analysis-pinyin-7.6.2.zip # pinyin分詞器
- 將分詞器進行安裝
elasticsearch/bin/elasticsearch-plugin install elasticsearch-analysis-ik-7.6.2.zip
elasticsearch/bin/elasticsearch-plugin install elasticsearch-analysis-pinyin-7.6.2.zip
如何擴展ik分詞器庫
添加自定義的詞到ik分詞器庫,使得分詞器可以切割出指定的詞夜只。
進入到plugins中的ik分詞器的config文件夾下垒在,創(chuàng)建文件myword.dic,在該文件中添加自定義詞扔亥,然后將該文件配置到IKAnalyzer.conf.xml中的擴展字典中<entry key="ext_dict">myword.dic</entry>场躯。然后重啟。
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 擴展配置</comment>
<!--用戶可以在這里配置自己的擴展字典 -->
<entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>
<!--用戶可以在這里配置自己的擴展停止詞字典-->
<entry key="ext_stopwords">custom/ext_stopword.dic</entry>
<!--用戶可以在這里配置遠程擴展字典 -->
<entry key="remote_ext_dict">location</entry>
<!--用戶可以在這里配置遠程擴展停止詞字典-->
<entry key="remote_ext_stopwords">http://xxx.com/xxx.dic</entry>
</properties>
熱更新 IK 分詞使用方法
目前該插件支持熱更新 IK 分詞旅挤,通過上文在 IK 配置文件中提到的如下配置
<!--用戶可以在這里配置遠程擴展字典 -->
<entry key="remote_ext_dict">location</entry>
<!--用戶可以在這里配置遠程擴展停止詞字典-->
<entry key="remote_ext_stopwords">location</entry>
其中 location 是指一個 url踢关,比如 http://yoursite.com/getCustomDict,該請求只需滿足以下兩點即可完成分詞熱更新粘茄。
- 該 http 請求需要返回兩個頭部(header)签舞,一個是 Last-Modified,一個是 ETag柒瓣,這兩者都是字符串類型儒搭,只要有一個發(fā)生變化,該插件就會去抓取新的分詞進而更新詞庫芙贫。
- 該 http 請求返回的內(nèi)容格式是一行一個分詞搂鲫,換行符用 \n 即可。
滿足上面兩點要求就可以實現(xiàn)熱更新分詞了磺平,不需要重啟 ES 實例默穴。
可以將需自動更新的熱詞放在一個 UTF-8 編碼的 .txt 文件里怔檩,放在 nginx 或其他簡易 http server 下,當 .txt 文件修改時蓄诽,http server 會在客戶端請求該文件時自動返回相應的 Last-Modified 和 ETag薛训。可以另外做一個工具來從業(yè)務系統(tǒng)提取相關詞匯仑氛,并更新這個 .txt 文件乙埃。
2.2 ik分詞器
2.2.1 ik分詞器
Elasticsearch 內(nèi)置的分詞器對中文不友好,只會一個字一個字的分锯岖,無法形成詞語介袜,因此引入ik分詞器。
ik分詞器包括
- 分析器Analyzer:ik_smart , ik_max_word
- 分詞器Tokenizer: ik_smart , ik_max_word
ik_max_word 和 ik_smart 什么區(qū)別?
- ik_max_word: 會將文本做最細粒度的拆分出吹,比如會將“中華人民共和國國歌”拆分為“中華人民共和國,中華人民,中華,華人,人民共和國,人民,人,民,共和國,共和,和,國國,國歌”遇伞,會窮盡各種可能的組合,適合 Term Query捶牢;
- ik_smart: 會做最粗粒度的拆分鸠珠,比如會將“中華人民共和國國歌”拆分為“中華人民共和國,國歌”,適合 Phrase 查詢秋麸。
2.2.2 同義詞(TODO)
PUT http://chenjie.asia:9200/ik_synonym
{
"setting": {
"analysis": {
"analyzer": {
"ik_synonym_analyzer": {
"tokenizer": "",
"filter": ""
}
},
"filter": {
"ik_synonym_filter": {
"type": "synonym",
"synonyms_path": "/"
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "ik_smart"
},
"author": {
"type": "keyword"
}
}
}
}
2.2.3 停用詞
2.3 pinyin分詞器
2.3.1 pinyin分詞器
該拼音分析插件用于漢字與拼音的轉(zhuǎn)換渐排,集成了NLP工具。
插件包括:
- 分析器Analyzer: pinyin
- 分詞器Tokenizer: pinyin
- 后過濾器Token-filter: pinyin
參數(shù) | 說明 | 示例 | 默認值 |
---|---|---|---|
keep_first_letter |
保留拼音首字母組合 |
劉德華 >ldh
|
true |
keep_separate_first_letter |
保留拼音首字母 |
劉德華 >l ,d ,h
|
false |
limit_first_letter_length |
設置最長拼音首字母組合 | 16 | |
keep_full_pinyin |
保留全拼 |
劉德華 > [liu ,de ,hua ] |
true |
keep_joined_full_pinyin |
保留全拼組合 |
劉德華 > [liudehua ] |
false |
keep_none_chinese |
過濾掉中文和數(shù)字 | true | |
keep_none_chinese_together |
不切分非中文字母 |
當設為true時:DJ音樂家 -> DJ ,yin ,yue ,jia <br />當設為false時:DJ音樂家 -> D ,J ,yin ,yue ,jia 灸蟆;NOTE:keep_none_chinese需要為true |
true |
keep_none_chinese_in_first_letter |
中文轉(zhuǎn)為拼音首字母驯耻,并將非中文字母與拼音合并 |
劉德華AT2016 ->ldhat2016
|
true |
keep_none_chinese_in_joined_full_pinyin |
中文轉(zhuǎn)為全拼,并將非中文字母與拼音合并 |
劉德華2016 ->liudehua2016
|
false |
none_chinese_pinyin_tokenize |
如果非中文字母是拼音炒考,則將它們分成單獨的拼音詞可缚。keep_none_chinese 和keep_none_chinese_together 需要為true。 |
liudehuaalibaba13zhuanghan -> liu ,de ,hua ,a ,li ,ba ,ba ,13 ,zhuang ,han
|
true |
keep_original |
是否保留原輸入 | false | |
lowercase |
是否小寫非中文字母 | true | |
trim_whitespace |
首位去空格 | true | |
remove_duplicated_term |
會移除重復的短語斋枢,可能會影響位置相關的查詢結(jié)果城看。 |
de的 >de
|
false |
ignore_pinyin_offset |
after 6.0, offset is strictly constrained, overlapped tokens are not allowed, with this parameter, overlapped token will allowed by ignore offset, please note, all position related query or highlight will become incorrect, you should use multi fields and specify different settings for different query purpose. if you need offset, please set it to false. | true |
2.3.2 使用示例
使用pinyin分詞器
創(chuàng)建一個名為diypytest的索引,該索引中使用了一個pinyin tokenizer杏慰。
# 創(chuàng)建索引
PUT http://chenjie.asia:9200/diypytest
{
"settings": {
"analysis": {
"analyzer": {
"pinyin_analyzer": {
"tokenizer": "my_pinyin"
}
},
"tokenizer": {
"my_pinyin": {
"type": "pinyin",
"keep_separate_first_letter": false,
"keep_full_pinyin": true,
"keep_original": true,
"limit_first_letter_length": 16,
"lowercase": true,
"remove_duplicated_term": true
}
}
}
},
"mappings": {
"properties": {
"menu": {
"type": "keyword",
"fields": {
"pinyin": {
"type": "text",
"store": false,
"term_vector": "with_offsets",
"analyzer": "pinyin_analyzer",
"boost": 10
}
}
}
}
}
}
# 測試該分詞器
GET http://chenjie.asia:9200/diypytest/_analyze
{
"analyzer": "pinyin_analyzer",
"text": "西紅柿雞蛋"
}
# 插入數(shù)據(jù)
PUT http://chenjie.asia:9200/diypytest/_doc/1
{
"menu":"西紅柿雞蛋"
}
PUT http://chenjie.asia:9200/diypytest/_doc/2
{
"menu":"韭菜雞蛋"
}
# 查詢數(shù)據(jù)
GET http://chenjie.asia:9200/diypytest/_search?q=menu:xhsjd // 查詢?yōu)榭?GET http://chenjie.asia:9200/diypytest/_search?q=menu.pinyin:xhsjd // 查詢得到結(jié)果
使用pinyin-tokenFilter
創(chuàng)建一個名為diypytest2的索引测柠,該索引中使用了一個pinyin后過濾器。
# 創(chuàng)建索引
http://chenjie.asia:9200/diypytest2
{
"settings" : {
"analysis" : {
"analyzer" : {
"menu_analyzer" : {
"tokenizer" : "whitespace",
"filter" : "pinyin_first_letter_and_full_pinyin_filter"
}
},
"filter" : {
"pinyin_first_letter_and_full_pinyin_filter" : {
"type" : "pinyin",
"keep_first_letter" : true,
"keep_full_pinyin" : false,
"keep_none_chinese" : true,
"keep_original" : false,
"limit_first_letter_length" : 16,
"lowercase" : true,
"trim_whitespace" : true,
"keep_none_chinese_in_first_letter" : true
}
}
}
}
}
# 測試分詞器
http://chenjie.asia:9200/diypytest2/_analyze
{
"analyzer":"menu_analyzer",
"text":"西紅柿雞蛋 韭菜雞蛋 糖醋里脊"
}
# 結(jié)果如下
{
"tokens": [
{
"token": "xhsjd",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "jcjd",
"start_offset": 6,
"end_offset": 10,
"type": "word",
"position": 1
},
{
"token": "tclj",
"start_offset": 11,
"end_offset": 15,
"type": "word",
"position": 2
}
]
}
為索引添加自定義分詞器
PUT my_analyzer
{
"settings":{
"analysis":{
"analyzer":{
"my":{ //分析器
"tokenizer":"punctuation", //指定所用的分詞器
"type":"custom", //自定義類型的分析器
"char_filter":["emoticons"], //指定所用的字符過濾器
"filter":["lowercase","english_stop"]
},
"char_filter":{ //字符過濾器
"emoticons":{ //字符過濾器的名字
"type":"mapping", //匹配模式
"mapping":[
":)=>_happy_", //如果匹配上:)缘滥,那么替換為_happy_
":(=>_sad_" //如果匹配上:(轰胁,那么替換為_sad_
]
}
},
"tokenizer":{ //分詞器
"punctuation":{ //分詞器的名字
"type":"pattern", //正則匹配分詞器
"pattern":"[.,!?]" //通過正則匹配方式匹配需要作為分隔符的字符,此處為 . , ! ? 朝扼,作為分隔符進行分詞
}
},
"filter":{ //后過濾器
"english_stop":{ //后過濾器的名字
"type":"stop", //停用詞
"stopwords":"_english_" //指定停用詞赃阀,不影響分詞,但不允許查詢
}
}
}
}
}
}
例如用上述自定義分析器對文本 "I am a :) person,and you" 進行分詞 ,最終得到兩個分詞 "I am a happy person" 和 "and you" ;
第一步:用字符過濾器將 :) 替換為 happy
第二步:用分詞器榛斯,通過正則表達式匹配到逗號观游,在逗號處進行分詞
第三步:不查詢停用詞
三、檢索
3.1 Field的配置
定義一個字段的時候驮俗,可以選擇如下屬性進行配置懂缕。
"field": {
"type": "text", //文本類型 ,指定類型
"index": "analyzed", //該屬性共有三個有效值:analyzed王凑、no和not_analyzed搪柑,默認是analyzed;analyzed:表示該字段被分析索烹,編入索引工碾,產(chǎn)生的token能被搜索到;not_analyzed:表示該字段不會被分析百姓,使用原始值編入索引锐极,在索引中作為單個詞叶圃;no:不編入索引正驻,無法搜索該字段宣虾;
"analyzer":"ik"http://指定分詞器
"boost":1.23//字段級別的分數(shù)加權
"doc_values":false//對not_analyzed字段蒲拉,默認都是開啟达布,analyzed字段不能使用睁冬,對排序和聚合能提升較大性能募胃,節(jié)約內(nèi)存,如果您確定不需要對字段進行排序或聚合矗晃,或者從script訪問字段值仑嗅,則可以禁用doc值以節(jié)省磁盤空間:
"fielddata":{"loading" : "eager" }//Elasticsearch 加載內(nèi)存 fielddata 的默認行為是 延遲 加載 。 當 Elasticsearch 第一次查詢某個字段時张症,它將會完整加載這個字段所有 Segment 中的倒排索引到內(nèi)存中仓技,以便于以后的查詢能夠獲取更好的性能。
"fields":{"keyword": {"type": "keyword","ignore_above": 256}} //可以對一個字段提供多種索引模式俗他,同一個字段的值脖捻,一個分詞,一個不分詞
"ignore_above":100 //超過100個字符的文本兆衅,將會被忽略地沮,不被索引
"include_in_all":ture//設置是否此字段包含在_all字段中,默認是true羡亩,除非index設置成no選項
"index_options":"docs"http://4個可選參數(shù)docs(索引文檔號) ,freqs(文檔號+詞頻)摩疑,positions(文檔號+詞頻+位置,通常用來距離查詢)畏铆,offsets(文檔號+詞頻+位置+偏移量雷袋,通常被使用在高亮字段)分詞字段默認是position,其他的默認是docs
"norms":{"enable":true,"loading":"lazy"}//分詞字段默認配置辞居,不分詞字段:默認{"enable":false}楷怒,存儲長度因子和索引時boost蛋勺,建議對需要參與評分字段使用 ,會額外增加內(nèi)存消耗量
"null_value":"NULL"http://設置一些缺失字段的初始化值鸠删,只有string可以使用抱完,分詞字段的null值也會被分詞
"position_increament_gap":0//影響距離查詢或近似查詢,可以設置在多值字段的數(shù)據(jù)上火分詞字段上冶共,查詢時可指定slop間隔乾蛤,默認值是100
"store":false//是否單獨設置此字段的是否存儲而從_source字段中分離,默認是false捅僵,只能搜索家卖,不能獲取值
"search_analyzer":"ik"http://設置搜索時的分詞器,默認跟ananlyzer是一致的庙楚,比如index時用standard+ngram上荡,搜索時用standard用來完成自動提示功能
"similarity":"BM25"http://默認是TF/IDF算法,指定一個字段評分策略馒闷,僅僅對字符串型和分詞類型有效
"term_vector":"no"http://默認不存儲向量信息酪捡,支持參數(shù)yes(term存儲),with_positions(term+位置),with_offsets(term+偏移量)纳账,with_positions_offsets(term+位置+偏移量) 對快速高亮fast vector highlighter能提升性能逛薇,但開啟又會加大索引體積,不適合大數(shù)據(jù)量用
}
3.2 檢索
3.2.1 搜索詞的分詞
每當一個文檔在被錄入到 Elasticsearch中 時疏虫,需要一個叫做 index 的過程永罚。在 index 的過程中,它會為該字符串進行分詞卧秘,并最終形成一個一個的 token呢袱,并存于數(shù)據(jù)庫。但是翅敌,每當我們搜索一個字符串時羞福,在搜索時,我們同樣也要對該字符串進行分詞蚯涮,也會建立token治专,但不會存于數(shù)據(jù)庫。
①當你查詢一個 全文 域時(match)遭顶, 會對查詢字符串應用相同的分析器(或使用指定的分析器search_analyzer
)张峰,以產(chǎn)生正確的搜索詞條列表。
②當你查詢一個 精確值 域時(term)液肌,不會分析查詢字符串挟炬,而是搜索你指定的精確值。
示例:
PUT http://chenjie.asia:9200/test1
{
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "standard",
"search_analyzer": "english"
}
}
}
}
GET http://chenjie.asia:9200/test1/_search
{
"query": {
"match": {
"content": "Happy a birthday"
}
}
}
對于這個搜索來說,我們在默認的情況下谤祖,會把 "Happy a birthday" 使用同樣的
standard analyzer
進行分詞婿滓。如果我們指定search_analyzer
為english analyzer
過濾器,它就會把字母 “a” 過濾掉粥喜,那么直剩下 “happy” 及 “birthday” 這兩個詞凸主,而 “a” 將不進入搜索之中。
3.2.2 單字段檢索
如上所示额湘,使用match進行檢索卿吐。
3.2.3 多字段檢索
檢索多個field
PUT http://chenjie.asia:9200/test2
{
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "ik_smart"
},
"author": {
"type": "keyword"
}
}
}
}
# 插入數(shù)據(jù)
POST http://chenjie.asia:9200/test2/_doc/1
{
"content": "I am good!",
"author": "cj"
}
POST http://chenjie.asia:9200/test2/_doc/2
{
"content": "CJ is good!",
"author": "zs"
}
# 進行檢索,以上兩條都可以檢索到锋华,因為字段中有匹配的分詞
{
"query": {
"multi_match": {
"query": "cj",
"fields": ["content","author"]
}
}
}
檢索一個字段的多種索引模式
當我們需要對某個字段進行多種方式的分詞嗡官,使用多個不同的 anaylzer 來提高我們的搜索,就可以使用fields定義多種分析方式毯焕,使用不同的分析器來分析同樣的一個字符串衍腥。
PUT http://chenjie.asia:9200/test3
{
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "ik_smart",
"fields": {
"py": {
"type": "text",
"analyzer": "pinyin"
}
}
}
}
}
}
# 插入數(shù)據(jù)
POST http://chenjie.asia:9200/test3/_doc/1
{
"content": "我胡漢三又回來了"
}
# 使用拼音和中文都可以檢索到這條文檔
GET http://chenjie.asia:9200/test3/_search
{
"query": {
"multi_match": {
"query": "huhansan",
"fields": ["content","content.py"]
}
}
}
GET http://chenjie.asia:9200/test3/_search
{
"query": {
"multi_match": {
"query": "胡漢三",
"fields": ["content","content.py"]
}
}
}
五種類型的multi match query
- best_fields: (default) Finds documents which match any field, but uses the _score from the best field.
- most_fields: Finds documents which match any field and combines the _score from each field.(與best_fields不同之處在于相關性評分,best_fields取最大匹配得分(max計算)纳猫,而most_fields取所有匹配之和(sum計算))
- cross_fields: Treats fields with the same analyzer as though they were one big field. Looks for each word in any field.(所有輸入的Token必須在同一組的字段上全部匹配,)
- phrase: Runs a match_phrase query on each field and combines the _score from each field.
- phrase_prefix: Runs a match_phrase_prefix query on each field and combines the _score from each field.
GET http://chenjie.asia:9200/article/_search
{
"query": {
"multi_match": {
"query": "hxr",
"fields": [
"name^5",
"name.FPY",
"name.SPY",
"name.IKS^0.8"
],
"type": "best_fields"
}
}
}
文章字數(shù)受限婆咸,下篇請看
Elasticsearch自定義分析器(下)