Elasticsearch自定義分析器（上）

注：代碼基于Elasticsearch 7.x馍佑，低版本語法稍有不同，需指定type！且低版本可能無法使用相關性計算的一些新特性。

一寞奸、分析器

1.1 概念：

分析器包括：

字符過濾器(CharacterFilters)：首先，字符串按順序通過每個字符過濾器在跳。他們的任務是在分詞前整理字符串枪萄。一個字符過濾器可以用來去掉HTML，或者將 & 轉(zhuǎn)化成 and猫妙；
分詞器(Tokenizer)：字符串被分詞器分為單個的詞條瓷翻。得到分詞，標記每個分詞的順序或位置（用于鄰近查詢）割坠，標記分詞的起始和結(jié)束的偏移量（用于突出顯示搜索片段）齐帚，標記分詞的類型；
后過濾器(TokenFilter)：最后彼哼，詞條按順序通過每個 token 過濾器对妄。這個過程可能會改變詞條（例如，小寫化 Quick ）沪羔，刪除詞條（例如饥伊，像 a， and蔫饰， the 等無用詞）琅豆，或者增加詞條（例如，像 jump 和 leap 這種同義詞）篓吁。

1.2 字符過濾器CharacterFilters

字符過濾器類型	說明
HTML Strip Character Filter(去除html標簽和轉(zhuǎn)換html實體)	The `html_strip` character filter strips out HTML elements like `<b>` and decodes HTML entities like `&`.
Mapping Character Filter(字符串替換操作)	The `mapping` character filter replaces any occurrences of the specified strings with the specified replacements.
Pattern Replace Character Filter(正則匹配替換)	The `pattern_replace` character filter replaces any characters matching a regular expression with the specified replacement.

①使用html_strip字符過濾器

過濾掉文本中的html標簽茫因。

參數(shù)	說明
escaped_tags	不進行過濾的標簽名，多個標簽用數(shù)組表示

默認過濾器配置

GET _analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    "html_strip"
  ],
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}

# 得到結(jié)果 [ \nI'm so happy!\n ]

定制過濾器配置

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_custom_html_strip_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_custom_html_strip_char_filter": {
          "type": "html_strip",
          "escaped_tags": [
            "b"
          ]
        }
      }
    }
  }
}

②使用mapping 字符過濾器

對文本進行替換杖剪。

參數(shù)	說明
mappings	使用key => value來指定映射關系冻押，多種映射關系用數(shù)組表示
mappings_path	指定配置了mappings映射關系的文件的路徑，文件使用UTF-8格式編碼盛嘿，每個映射關系使用換行符分割

默認過濾器配置

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [
        "? => 0",
        "? => 1",
        "? => 2",
        "? => 3",
        "? => 4",
        "? => 5",
        "? => 6",
        "? => 7",
        "? => 8",
        "? => 9"
      ]
    }
  ],
  "text": "My license plate is ?????"
}

# 得到結(jié)果 [ My license plate is 25015 ]

定制過濾器配置

PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_mappings_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_mappings_char_filter": {
          "type": "mapping",
          "mappings": [
            ":) => _happy_",
            ":( => _sad_"
          ]
        }
      }
    }
  }
}

測試過濾器

GET /my-index-000001/_analyze
{
  "tokenizer": "keyword",
  "char_filter": [ "my_mappings_char_filter" ],
  "text": "I'm delighted about it :("
}

# 結(jié)果  [ I'm delighted about it _sad_ ]

③使用pattern_replace 字符過濾器

對文本進行正則匹配洛巢，對匹配的字符串進行替換。

參數(shù)	說明
pattern	java正則表達式
replacement	替換字符串, 使用 $1..$ 9 來對應替換位置
flags	Java regular expression flags. Flags should be pipe-separated, eg `"CASE_INSENSITIVE\|COMMENTS"`.

定制過濾器配置

PUT my-index-00001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(?=\\d)",
          "replacement": "$1_"
        }
      }
    }
  }
}

測試過濾器

POST my-index-00001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "My credit card is 123-456-789"
}

# [ My, credit, card, is, 123_456_789 ]

1.3 分詞器Tokenizer

1.3.1 Word Oriented Tokenizers

分詞器類型	說明
Standard Tokenizer	The `standard` tokenizer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation symbols. It is the best choice for most languages.
Letter Tokenizer	The `letter` tokenizer divides text into terms whenever it encounters a character which is not a letter.
Lowercase Tokenizer	The `lowercase` tokenizer, like the `letter`tokenizer, divides text into terms whenever it encounters a character which is not a letter, but it also lowercases all terms.
Whitespace Tokenizer	The `whitespace` tokenizer divides text into terms whenever it encounters any whitespace character.
UAX URL Email Tokenizer	The `uax_url_email` tokenizer is like the `standard` tokenizer except that it recognises URLs and email addresses as single tokens.
Classic Tokenizer	The `classic` tokenizer is a grammar based tokenizer for the English Language.
Thai Tokenizer	The `thai` tokenizer segments Thai text into words.

① Standard

Standard tokenizer是基于<Unicode標準附錄#29>中指定的算法進行切分的次兆，如whitespace稿茉，‘-’等符號都會進行切分。

參數(shù)	說明	默認值
max_token_length	切分后得到的token的長度如果超過最大token長度芥炭，以最大長度間隔拆分	默認255

POST _analyze
{
  "tokenizer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

# [ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]

定制

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "standard",
          "max_token_length": 5
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

# [ The, 2, QUICK, Brown, Foxes, jumpe, d, over, the, lazy, dog's, bone ]

②Letter

Letter tokenizer遇到非字母時就會進行分詞漓库。也就是說，這個分詞的結(jié)果可以是一整塊的的連續(xù)的數(shù)據(jù)內(nèi)容园蝠。對歐洲語言友好渺蒿，但是不適用于亞洲語言。

無參數(shù)

③Lowercase

Lowercase tokenizer可以看做Letter Tokenizer分詞和Lower case Token Filter的結(jié)合體彪薛。即先用Letter Tokenizer分詞茂装，然后再把分詞結(jié)果全部換成小寫格式。

無參數(shù)

④Whitespace

Whitespace tokenizer 將文本通過空格進行分詞陪汽。

參數(shù)	說明	默認值
`max_token_length`	經(jīng)過此分詞器后所得的數(shù)據(jù)的最大長度训唱。	默認 255

⑤UAX Email URL

Uax_url_email tokenizer和standard tokenizer類似，不同的是Uax_url_email tokenizer會將url和郵箱分為單獨一個token挚冤。而standard tokenizer會將url和郵箱進行切分况增。

參數(shù)	說明	默認值
`max_token_length`	經(jīng)過此分詞器后所得的數(shù)據(jù)的最大長度。	默認 255

⑥Classic

Classic tokenizer很適合英語編寫的文檔训挡。這個分詞器對于英文的首字符縮寫澳骤、公司名字、 email 澜薄、大部分網(wǎng)站域名都能很好的解決为肮。但是，對于除了英語之外的其他語言都不好用肤京。

會在大部分的標點符號處進行切分并移除標點符號颊艳，但是不在空格后的點不會被切分茅特。
會在連字符處切分，除非在這個token中有數(shù)字棋枕，那么整個token會被理解為產(chǎn)品編號而不切分白修。
可以將郵件地址和節(jié)點主機名分割為一個token。

參數(shù)	說明	默認值
`max_token_length`	經(jīng)過此分詞器后所得的數(shù)據(jù)的最大長度重斑。	默認 255

POST _analyze
{
  "tokenizer": "classic",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

# [ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]

⑦Thai

泰語分詞器

無參數(shù)

1.3.2 Partial Word Tokenizers

分詞器類型	說明
N-Gram Tokenizer	The `ngram` tokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation), then it returns n-grams of each word: a sliding window of continuous letters, e.g. `quick → [qu, ui, ic, ck]`.
Edge N-Gram Tokenizer	The `edge_ngram` tokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation), then it returns n-grams of each word which are anchored to the start of the word, e.g. `quick → [q, qu, qui, quic, quick]`.

①Ngram

一個nGram.類型的分詞器兵睛。
以下是 nGram tokenizer 的設置:

參數(shù)	說明	默認值
`min_gram`	分詞后詞語的最小長度	`1`
`max_gram`	分詞后數(shù)據(jù)的最大長度	`2`
`token_chars`	設置分詞的形式，例如數(shù)字還是文字窥浪。elasticsearch將根據(jù)分詞的形式對文本進行分詞祖很。	`[]` (Keep all characters)

token_chars 所接受以下的形式：

token_chars	舉例
`letter`	例如 `a`, `b`, `?` or `京`
`digit`	例如`3` or `7`
`whitespace`	例如 `" "` or `"\n"`
`punctuation`	例如 `!` or `"`
`symbol`	例如 `$` or `√`
`custom`	custom characters which need to be set using the `custom_token_chars` setting.

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "2 Quick Foxes."
}

# [ Qui, uic, ick, Fox, oxe, xes ]

②Edge NGram

這個分詞和 nGram 非常的類似。但是只是相當于 n-grams 的分詞的方式漾脂，只保留了“從頭至尾”的分詞假颇。
以下是 edgeNGram 分詞的設置：

參數(shù)	說明	默認值
min_gram	分詞后詞語的最小長度	1
max_gram	分詞后詞語的最大長度	2
token_chars	設置分詞的形式，例如骨稿，是數(shù)字還是文字拆融；將根據(jù)分詞的形式對文本進行分詞	[] (Keep all characters)

token_chars 所接受以下的形式：

token_chars	舉例
`letter`	例如 `a`, `b`, `?` or `京`
`digit`	例如`3` or `7`
`whitespace`	例如 `" "` or `"\n"`
`punctuation`	例如 `!` or `"`
`symbol`	例如 `$` or `√`
`custom`	custom characters which need to be set using the `custom_token_chars` setting.

POST _analyze
{
  "text": "Hélène Ségara it's !<>#",
  "char_filter": [
    {
      "type": "pattern_replace",
      "pattern": "[^\\s\\p{L}\\p{N}]",
      "replacement": ""
    }
  ],
  "tokenizer": "standard",
  "filter": [
    "lowercase",
    "asciifolding",
    {
      "type": "edge_ngram",
      "min_gram": "1",
      "max_gram": "12"
    }
  ]
}

# [h]  [he]  [hel]  [hele]  [helen]  [s]  [se]  [seg]  [sega]  [segar] [Segara ]  [i]  [it]  [its]

1.3.3 Structured Text Tokenizers

分詞器類型	說明
Keyword Tokenizer	The `keyword` tokenizer is a “noop” tokenizer that accepts whatever text it is given and outputs the exact same text as a single term. It can be combined with token filters like `lowercase` to normalise the analysed terms.
Pattern Tokenizer	The pattern `tokenizer` uses a regular expression to either split text into terms whenever it matches a word separator, or to capture matching text as terms.
Simple Pattern Tokenizer	The `simple_pattern` tokenizer uses a regular expression to capture matching text as terms. It uses a restricted subset of regular expression features and is generally faster than the `pattern` tokenizer.
Char Group Tokenizer	The `char_group` tokenizer is configurable through sets of characters to split on, which is usually less expensive than running regular expressions.
Simple Pattern Split Tokenizer	The `simple_pattern_split` tokenizer uses the same restricted regular expression subset as the `simple_pattern` tokenizer, but splits the input at matches rather than returning the matches as terms.
Path Tokenizer	The `path_hierarchy` tokenizer takes a hierarchical value like a filesystem path, splits on the path separator, and emits a term for each component in the tree, e.g. `/foo/bar/baz → [/foo, /foo/bar, /foo/bar/baz ]`.

①Simple Pattern

Simple Pattern Tokenizer使用正則表達式來匹配符合文本，然后將匹配的文本提取出來作為token啊终，其他部分舍棄镜豹。

參數(shù)	說明	默認值
simple_pattern	Lucene正則表達式	empty string

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "simple_pattern",
          "pattern": "[0123456789]{3}"
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "fd-786-335-514-x"
}

# [ 786, 335, 514 ]

②Simple Pattern Split

Simple Pattern Split Tokenizer使用Lucene正則表達式來匹配符合文本，將匹配的文本作為分隔符進行切分蓝牲。它使用的Lucene正則語法沒有pattern tokenizer使用的Java正則語法強大趟脂，但是效率更高。

參數(shù)	說明	默認值
`simple_pattern_split`	Lucene正則表達式	empty string

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "simple_pattern_split",
          "pattern": "_"
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "an_underscored_phrase"
}

# [ an, underscored, phrase ]

③Pattern

Pattern Tokenizer使用Java正則表達式來匹配符合文本例衍，將匹配的文本作為分隔符進行切分昔期。

參數(shù)	說明	默認值
`pattern`	正則表達式的pattern	`\W+`
`flags`	正則表達式的 flags. Flags should be pipe-separated, eg "CASE_INSENSITIVE\|COMMENTS"
`group`	哪個group去抽取數(shù)據(jù)。	`-1`

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": ","
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "comma,separated,values"
}

# [ comma, separated, values ]

④Keyword

Keyword Tokenizer 不會對文本進行操作佛玄，會將一整塊的輸入數(shù)據(jù)作為一個token硼一。

設置	說明	默認值
`buffer_size`	term buffer 的大小。不建議修改	默認256

⑤Path Hierarchy

Path_hierarchy Tokenizer會對路徑進行逐級劃分梦抢。示例如下：

/something/something/else
經(jīng)過該分詞器后會得到如下數(shù)據(jù) tokens
/something般贼，/something/something，/something/something/else

參數(shù)	說明	默認值
`delimiter`	分隔符	`/`
`replacement`	替代符用于替換分隔符	默認與`delimiter`的值相同
`buffer_size`	緩存buffer的大小	`1024`
`reverse`	是否將分詞后的tokens反轉(zhuǎn)	`false`
`skip`	The number of initial tokens to skip	`0`

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "path_hierarchy",
          "delimiter": "-",
          "replacement": "/",
          "skip": 2
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "one-two-three-four-five"
}

# [ /three, /three/four, /three/four/five ]

# 如果reverse設置為true奥吩，得到如下token
# [ one/two/three/, two/three/, three/ ]

⑥Character group

char_group Tokenizer通過字符進行切分哼蛆，可以在參數(shù)中指定字符進行切分。

參數(shù)	說明	默認值
`tokenize_on_chars`	分隔符或分隔符組成的數(shù)組霞赫，like e.g. `-`, or character groups: `whitespace`, `letter`, `digit`, `punctuation`, `symbol`
`max_token_length`	token最大長度腮介，超過后按最大長度再次切分	`255`

POST _analyze
{
  "tokenizer": {
    "type": "char_group",
    "tokenize_on_chars": [
      "whitespace",
      "-",
      "\n"
    ]
  },
  "text": "The QUICK brown-fox"
}

# [ The, QUICK, brown, fox ]

1.4 后過濾器TokenFilter

后過濾器是有順序的，所以需要注意數(shù)組中的順序端衰。

太多了叠洗，需要的話直接點進官網(wǎng)看吧甘改，常用的做下說明：

后過濾器類型	說明
Apostrophe
ASCII folding	Converts alphabetic, numeric, and symbolic characters that are not in the Basic Latin Unicode block (first 127 ASCII characters) to their ASCII equivalent, if one exists. For example, the filter changes à to a.
CJK bigram
CJK width
Classic
Common grams
Conditional
Decimal digit
Delimited payload
Dictionary decompounder
Edge n-gram	Forms an n-gram of a specified length from the beginning of a token.
Elision
Fingerprint	Sorts and removes duplicate tokens from a token stream, then concatenates the stream into a single output token.
Flatten graph
Hunspell
Hyphenation decompounder
Keep types
Keep words
Keyword marker
Keyword repeat	Outputs a keyword version of each token in a stream. These keyword tokens are not stemmed.
KStem
Length
Limit token count
Lowercase	Changes token text to lowercase. For example, you can use the lowercase filter to change THE Lazy DoG to the lazy dog.
MinHash
Multiplexer
N-gram	Forms n-grams of specified lengths from a token.
Normalization
Pattern capture
Pattern replace
Phonetic
Porter stem
Predicate script
Remove duplicates	Removes duplicate tokens in the same position.
Reverse	Reverses each token in a stream. For example, you can use the reverse filter to change cat to tac.
Shingle
Snowball
Stemmer	Provides algorithmic stemming for several languages, some with additional variants. For a list of supported languages, see the `language` parameter.
Stemmer override
Stop	Removes stop words from a token stream.
Synonym	The synonym token filter allows to easily handle synonyms during the analysis process. Synonyms are configured using a configuration file.
Synonym graph
Trim	Removes leading and trailing whitespace from each token in a stream. While this can change the length of a token, the trim filter does not change a token’s offsets.
Truncate
Unique	Removes duplicate tokens from a stream. For example, you can use the unique filter to change the lazy lazy dog to the lazy dog.
Uppercase	Changes token text to uppercase. For example, you can use the uppercase filter to change the Lazy DoG to THE LAZY DOG.
Word delimiter
Word delimiter graph

①Edge n-gram

edge_ngram效果同ngram，區(qū)別是edge_ngram只會從頭開始切分灭抑，同樣是fox楼誓，拆成[ f, fo ]

參數(shù)	說明	默認值
`min_gram`	拆分的最小長度	1
`max_gram`	拆分的最大長度	2
preserve_original	Emits original token when set to `true`.	false
side	已廢棄. 指定從token的 `front` 或是 `back`開始截取	front

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    { "type": "edge_ngram",
      "min_gram": 1,
      "max_gram": 2
    }
  ],
  "text": "the quick brown fox jumps"
}

# [ t, th, q, qu, b, br, f, fo, j, ju ]

②N-gram

可以使用ngram將fox拆成[ f, fo, o, ox, x ]名挥。

參數(shù)	說明	默認值
`min_gram`	拆分的最小長度	1
`max_gram`	拆分的最大長度	2
preserve_original	Emits original token when set to `true`.	false

具體演示如下

GET _analyze
{
  "tokenizer": "standard",
  "filter": [ "ngram" ],
  "text": "Quick fox"
}

# [ Q, Qu, u, ui, i, ic, c, ck, k, f, fo, o, ox, x ]

③Stop

Stop后過濾器用于過濾掉停用詞。

參數(shù)	說明	默認值
`stopwords`	預先定義的停用詞或停用詞組成的數(shù)組	_english_
`stopwords_path`	停用詞文件的路徑
`ignore_case`	是否大小寫敏感	false
`remove_trailing`	如果流的最后一個token是停用詞主守，是否刪除	true

默認的_english_過濾詞組用于過濾掉a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

GET /_analyze
{
  "tokenizer": "standard",
  "filter": [ "stop" ],
  "text": "a quick fox jumps over the lazy dog"
}

# [ quick, fox, jumps, over, lazy, dog ]

④Stemmer

為部分語言提供了algorithmic stemming禀倔。
獲得token的詞源并替換該token。

GET /_analyze
{
  "tokenizer": "standard",
  "filter": [ "stemmer" ],
  "text": "the foxes jumping quickly"
}

# [ the, fox, jump, quickli ]

⑤Keyword repeat

對每個token進行復制并返回参淫。一般配合Stemmer和Remove duplicates救湖。Stemmer獲取詞源時不會保留原token，在Stemmer之前加Keyword repeat就可以同時獲取詞源和原詞涎才。但是有些token的詞源就是原token鞋既，造成同一位置上有重復的token，則可以通過Remove duplicates進行去重耍铜。

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": [
    "keyword_repeat",
    "stemmer"
  ],
  "text": "fox running"
}

{
    "tokens": [
        {
            "token": "fox",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "fox",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "running",
            "start_offset": 4,
            "end_offset": 11,
            "type": "word",
            "position": 1
        },
        {
            "token": "run",
            "start_offset": 4,
            "end_offset": 11,
            "type": "word",
            "position": 1
        }
    ]
}

⑥Remove duplicates

刪除同一位置(start_offset)上的相同的token邑闺。一般配合Stemmer和使用。

如下同一位置出現(xiàn)了重復的token

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": [
    "keyword_repeat",
    "stemmer"
  ],
  "text": "jumping dog"
}

{
  "tokens": [
    {
      "token": "jumping",
      "start_offset": 0,
      "end_offset": 7,
      "type": "word",
      "position": 0
    },
    {
      "token": "jump",
      "start_offset": 0,
      "end_offset": 7,
      "type": "word",
      "position": 0
    },
    {
      "token": "dog",
      "start_offset": 8,
      "end_offset": 11,
      "type": "word",
      "position": 1
    },
    {
      "token": "dog",
      "start_offset": 8,
      "end_offset": 11,
      "type": "word",
      "position": 1
    }
  ]
}

使用remove_duplicates

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": [
    "keyword_repeat",
    "stemmer",
    "remove_duplicates"
  ],
  "text": "jumping dog"
}

{
  "tokens": [
    {
      "token": "jumping",
      "start_offset": 0,
      "end_offset": 7,
      "type": "word",
      "position": 0
    },
    {
      "token": "jump",
      "start_offset": 0,
      "end_offset": 7,
      "type": "word",
      "position": 0
    },
    {
      "token": "dog",
      "start_offset": 8,
      "end_offset": 11,
      "type": "word",
      "position": 1
    }
  ]
}

⑦Uppercase

將token都轉(zhuǎn)變?yōu)榇髮憽?br> 如將 The Lazy DoG 轉(zhuǎn)變?yōu)?THE LAZY DOG棕兼。

⑧Lowercase

將token都轉(zhuǎn)變?yōu)樾憽?br> 如將 The Lazy DoG 轉(zhuǎn)變?yōu)?the lazy dog陡舅。

⑨ASCII folding

asciifolding將不在基本拉丁Unicode塊中的字母、數(shù)字和符號字符（前127個ASCII字符）轉(zhuǎn)換為其ASCII等效字符（如果存在）伴挚。例如更改à 到a靶衍。

GET /_analyze
{
  "tokenizer" : "standard",
  "filter" : ["asciifolding"],
  "text" : "a?aí à la carte"
}

# [ acai, a, la, carte ]

⑩Fingerprint

刪除重復的token，并對token進行排序茎芋，然后將排序后的token作為一個token輸出颅眶。

如使用Fingerprint過濾 [ the, fox, was, very, very, quick ] 步驟如下：
1）先進行排序 [ fox, quick, the, very, very, was ]
2）刪除重復的token
3）變成一個token輸出 [fox quick the very was ]

GET _analyze
{
  "tokenizer" : "whitespace",
  "filter" : ["fingerprint"],
  "text" : "zebra jumps over resting resting dog"
}

# [ dog jumps over resting zebra ]

?Trim

刪除每個token的最前和最后面的空格符，但是不會改變token的offset位置田弥。
standard和whitespace tokenizer默認使用Trim涛酗，當使用這兩個時可以不用添加Trim后過濾器。

?Unique

過濾掉重復的token偷厦，如the lazy lazy dog進行過濾后得到the lazy dog煤杀。
對比Remove duplicates，Unique只需要token相同即可沪哺。

?Synonym

同義詞標記過濾器允許在分析過程中處理同義詞沈自。同義詞是使用配置文件配置的。

參數(shù)	說明	默認值
`expand`	If the mapping was "bar, foo, baz" and `expand` was set to `false` no mapping would get added as when `expand=false` the target mapping is the first word. However, if `expand=true` then the mappings added would be equivalent to `foo, baz => foo, baz` i.e, all mappings other than the stop word.	true
`lenient`	If `true` ignores exceptions while parsing the synonym configuration. It is important to note that only those synonym rules which cannot get parsed are ignored.	false
`synonyms`	指定同義詞辜妓，如 [ "foo, bar => baz" ]
`synonyms_path`	同義詞文件的路徑
`tokenizer`	The `tokenizer` parameter controls the tokenizers that will be used to tokenize the synonym, this parameter is for backwards compatibility for indices that created before 6.0.
`ignore_case`	The `ignore_case` parameter works with `tokenizer` parameter only.

注：如果目標同義詞(=>符號后的詞)是停用詞枯途，那么這個同義詞映射就會失效忌怎。如果查詢的詞(=>符號前的詞)是停用詞，那么這個詞就會失效酪夷。

PUT /test_index
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "synonym": {
            "tokenizer": "standard",
            "filter": [ "my_stop", "synonym" ]
          }
        },
        "filter": {
          "my_stop": {
            "type": "stop",
            "stopwords": [ "bar" ]
          },
          "synonym": {
            "type": "synonym",
            "lenient": true,
            "synonyms": [ "foo, bar => baz" ]
          }
        }
      }
    }
  }
}

使用同義詞配置文件

PUT /test_index
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "synonym": {
            "tokenizer": "whitespace",
            "filter": [ "synonym" ]
          }
        },
        "filter": {
          "synonym": {
            "type": "synonym",
            "synonyms_path": "analysis/synonym.txt"
          }
        }
      }
    }
  }
}

文件格式如下

# Blank lines and lines starting with pound are comments.

# Explicit mappings match any token sequence on the LHS of "=>"
# and replace with all alternatives on the RHS.  These types of mappings
# ignore the expand parameter in the schema.
# Examples:
i-pod, i pod => ipod
sea biscuit, sea biscit => seabiscuit

# Equivalent synonyms may be separated with commas and give
# no explicit mapping.  In this case the mapping behavior will
# be taken from the expand parameter in the schema.  This allows
# the same synonym file to be used in different synonym handling strategies.
# Examples:
ipod, i-pod, i pod
foozball , foosball
universe , cosmos
lol, laughing out loud

# If expand==true, "ipod, i-pod, i pod" is equivalent
# to the explicit mapping:
ipod, i-pod, i pod => ipod, i-pod, i pod
# If expand==false, "ipod, i-pod, i pod" is equivalent
# to the explicit mapping:
ipod, i-pod, i pod => ipod

# Multiple synonym mapping entries are merged.
foo => foo bar
foo => baz
# is equivalent to
foo => foo bar, baz

注：經(jīng)驗所得榴啸，帶有 synonym 的 analyzer 適用于 search 而不適用于存儲 index。

synonym 增加了field 的 term 數(shù)量(導致評分參數(shù) avgdl 變大)晚岭，還有重要的是如果使用 match query 的話鸥印，會導致匹配的 termFreq 增加到 synonym 的數(shù)量，影響評分坦报。
如果同義詞變化的話库说，需要同步更新所有的關系到同義詞的文檔。
對于匹配原詞和他的同義詞片择，往往原詞的評分應該更高潜的。但是 ES 中卻一視同仁。沒有區(qū)別字管。雖然可以通過定義不同的 field 啰挪，一個 field 使用完全切分，一個field 使用同義詞嘲叔，并且在search時亡呵，給全完且分詞field 一個較高的權重。但是又帶來了怎加了term 存儲的容量擴大問題硫戈。

?Reverse

將每個token翻轉(zhuǎn)政己，如將cat替換為tac。

1.5 分析器Analyzer

以下是es自帶的分析器掏愁，絕大多數(shù)的分析器我們可以通過以上介紹的CharFilter歇由，Tokenizer和TokenFilter自己組合實現(xiàn)相同的功能。

分析器類型	說明
Standard Analyzer	The `standard` analyzer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation, lowercases terms, and supports removing stop words.
Simple Analyzer	The `simple` analyzer divides text into terms whenever it encounters a character which is not a letter. It lowercases all terms.
Whitespace Analyzer	The `whitespace` analyzer divides text into terms whenever it encounters any whitespace character. It does not lowercase terms.
Stop Analyzer	The `stop` analyzer is like the simple analyzer, but also supports removal of stop words.
Keyword Analyzer	The `keyword` analyzer is a “noop” analyzer that accepts whatever text it is given and outputs the exact same text as a single term.
Pattern Analyzer	The `pattern` analyzer uses a regular expression to split the text into terms. It supports lower-casing and stop words.
Language Analyzer	Elasticsearch provides many language-specific analyzers like `english` or `french`.
Fingerprint Analyzer	The `fingerprint` analyzer is a specialist analyzer which creates a fingerprint which can be used for duplicate detection.

①Standard Analyzer

standard analyzer 是 Elasticsearch 的缺省分析器：

沒有 Char Filter
使用 standard tokonizer
使用lowercase filter和stop token filter果港。默認的情況下 stop words 為 none沦泌，也即不過濾任何 stop words。

參數(shù)	說明	默認值
`max_token_length`	分詞后單個token的最大長度辛掠，如果超過最大長度谢谦，按最大長度分詞	Defaults to `255`.
`stopwords`	預先定義的停用詞或停用詞組成的數(shù)組	Defaults to `_none_`.
`stopwords_path`	停用詞文件的路徑

直接使用

POST _analyze
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

# [ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]

指定參數(shù)

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "standard",
          "max_token_length": 5,
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_english_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

# [ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]

②Simple Analyzer

簡單分析器

沒有Char Filter
使用Lowercase Tokenier
沒有TokenFilter

③Whitespace Analyzer

空格分析器，遇到空格

沒有Char Filter
使用Whitespace Tokenier
沒有TokenFilter

④Stop Analyzer

與簡單分析器類似萝衩，但是添加了停用詞回挽，默認使用的是_english_停用詞。

沒有Char Filter
使用Lowercase Tokenier
使用Stop token filter猩谊，默認為_english_

參數(shù)	說明	默認值
`stopwords`	預先定義的停用詞或停用詞組成的數(shù)組	`_none_`
`stopwords_path`	停用詞文件的路徑

直接使用

POST _analyze
{
  "analyzer": "stop",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

# [ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]

指定參數(shù)

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_stop_analyzer": {
          "type": "stop",
          "stopwords": ["the", "over"]
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_stop_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

# [ quick, brown, foxes, jumped, lazy, dog, s, bone ]

⑤Keyword Analyzer

keyword分析器

沒有Char Filter
使用Keyword Tokenier
沒有TokenFilter

⑥Language Analyzer

語言分析器千劈，支持如下類型：arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai.

我們只會用到type為english的分析器吧

沒有Char Filter
使用Standard Tokenizer
使用Stemmer過濾器

參數(shù)	說明	默認值
`stopwords`	預先定義的停用詞或停用詞組成的數(shù)組	Defaults to `_english_`.
`stopwords_path`	停用詞文件的路徑

直接使用

GET _analyze
{
  "analyzer": "english",
  "text": "Running Apps in a Phone"
}

# [run]  [app]  [phone]

創(chuàng)建一個自定義分析器實現(xiàn)english分析器的功能

PUT /english_example
{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_" 
        },
        "english_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["example"] 
        },
        "english_stemmer": {
          "type":       "stemmer",
          "language":   "english"
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      },
      "analyzer": {
        "rebuilt_english": {
          "tokenizer":  "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        }
      }
    }
  }
}

⑦Pattern Analyzer

沒有CharFilter
分詞器使用Pattern Tokenizer
Token Filters使用Lower Case Token Filter和Stop Token Filter (disabled by default)

參數(shù)	說明	默認值
`pattern`	Java正則表達式	`\W+`
`flags`	Java regular expression flags. Flags should be pipe-separated, eg `"CASE_INSENSITIVE\|COMMENTS"`.
`lowercase`	Should terms be lowercased or not. Defaults to `true`.	true
`stopwords`	預先定義的停用詞或停用詞組成的數(shù)組	_none_
`stopwords_path`	停用詞文件的路徑

直接使用

POST _analyze
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

# [ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

指定參數(shù)

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_email_analyzer": {
          "type":      "pattern",
          "pattern":   "\\W|_", 
          "lowercase": true
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_email_analyzer",
  "text": "John_Smith@foo-bar.com"
}

# [ john, smith, foo, bar, com ]

⑧Fingerprint Analyzer

Fingerprint Analyzer實現(xiàn)了fingerprinting
算法。文本會被轉(zhuǎn)為小寫格式牌捷，經(jīng)過規(guī)范化處理后移除擴展字符墙牌，然后再經(jīng)過排序涡驮，刪除重復數(shù)據(jù)組合為單個token；

沒有CharFilter
Tokenizer使用Standard Tokenizer
Token Filters 使用Lower Case Token Filter喜滨、ASCII folding捉捅、Stop Token Filter (disabled by default)、Fingerprint

參數(shù)	說明	默認值
`separator`	The character to use to concatenate the terms.	Defaults to a space.
`max_output_size`	token允許的最大值就虽风，超過該值直接丟棄	255
`stopwords`	預先定義的停用詞或停用詞組成的數(shù)組	none
`stopwords_path`	停用詞文件的路徑

直接使用

POST _analyze
{
  "analyzer": "fingerprint",
  "text": "Yes yes, G?del said this sentence is consistent and."
}

# [ and consistent godel is said sentence this yes ]

指定參數(shù)

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_fingerprint_analyzer": {
          "type": "fingerprint",
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_fingerprint_analyzer",
  "text": "Yes yes, G?del said this sentence is consistent and."
}

# [ consistent godel said sentence yes ]

1.6 自定義分析器示例

示例一

PUT chenjie.asia:9200/analyzetest
{
    "settings":{
        "analysis":{
            "analyzer":{
                "my":{                //分析器
                    "tokenizer":"punctuation",    //指定所用的分詞器
                    "type":"custom",        //自定義類型的分析器
                    "char_filter":["emoticons"],    //指定所用的字符過濾器
                    "filter":["lowercase","english_stop"]
                }
            },
            "char_filter":{        //字符過濾器
                "emoticons":{        //字符過濾器的名字
                    "type":"mapping",    //匹配模式
                    "mappings":[
                        ":)=>_happy_",        //如果匹配上:)棒口，那么替換為_happy_
                        ":(=>_sad_"            //如果匹配上:(，那么替換為_sad_
                    ]
                }
            },
            "tokenizer":{        //分詞器
                "punctuation":{    //分詞器的名字
                    "type":"pattern",    //正則匹配分詞器
                    "pattern":"[.,!?]"    //通過正則匹配方式匹配需要作為分隔符的字符辜膝，此處為 . , ! ? 无牵，作為分隔符進行分詞
                }
            },
            "filter":{        //后過濾器
                "english_stop":{    //后過濾器的名字
                    "type":"stop",       //停用詞
                    "stopwords":"_english_"    //指定停用詞，過濾掉停用詞
                }
            }
        }
    }
}

# GET chenjie.asia:9200/analyzetest/_analyze
{
    "analyzer": "my",
    "text": "I am a  :)  person,and you"
}

上述自定義分析器對文本 "I am a :) person,and you" 進行分詞内舟，最終得到兩個分詞 "I am a happy person" 和 "and you" ;
第一步：用字符過濾器將 :) 替換為 happy
第二步：用分詞器，通過正則表達式匹配到逗號初橘，在逗號處進行分詞
第三步：過濾停用詞

示例二

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_content_analyzer": {
          "type": "custom",
          "char_filter": [
            "xschool_filter"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_stop"
          ]
        }
      },
      "char_filter": {
        "xschool_filter": {
          "type": "mapping",
          "mappings": [
            "X-School => XSchool"
          ]
        }
      },
      "filter": {
        "my_stop": {
          "type": "stop",
          "stopwords": ["so", "to", "the"]
        }
      }
    }
  },
  "mappings": {
    "type":{
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "my_content_analyzer",
          "search_analyzer": "standard"
        }
      }
    }
  }
}

可以指定搜索時验游，搜索進行制定具體的搜索詞分析器 search_analyzer。

二保檐、ik和pinyin分詞器

2.1 安裝ik和pinyin分詞器

下載與es版本對應的分詞器耕蝉，這里使用的是7.6.2版本

https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.6.2/elasticsearch-analysis-ik-7.6.2.zip   # ik分詞器
https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v7.6.2/elasticsearch-analysis-pinyin-7.6.2.zip   # pinyin分詞器

將分詞器進行安裝

elasticsearch/bin/elasticsearch-plugin install elasticsearch-analysis-ik-7.6.2.zip
elasticsearch/bin/elasticsearch-plugin install elasticsearch-analysis-pinyin-7.6.2.zip

如何擴展ik分詞器庫
添加自定義的詞到ik分詞器庫，使得分詞器可以切割出指定的詞夜只。
進入到plugins中的ik分詞器的config文件夾下垒在，創(chuàng)建文件myword.dic，在該文件中添加自定義詞扔亥，然后將該文件配置到IKAnalyzer.conf.xml中的擴展字典中<entry key="ext_dict">myword.dic</entry>场躯。然后重啟。

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 擴展配置</comment>
    <!--用戶可以在這里配置自己的擴展字典 -->
    <entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>
     <!--用戶可以在這里配置自己的擴展停止詞字典-->
    <entry key="ext_stopwords">custom/ext_stopword.dic</entry>
    <!--用戶可以在這里配置遠程擴展字典 -->
    <entry key="remote_ext_dict">location</entry>
    <!--用戶可以在這里配置遠程擴展停止詞字典-->
    <entry key="remote_ext_stopwords">http://xxx.com/xxx.dic</entry>
</properties>

熱更新 IK 分詞使用方法
目前該插件支持熱更新 IK 分詞旅挤，通過上文在 IK 配置文件中提到的如下配置

    <!--用戶可以在這里配置遠程擴展字典 -->
    <entry key="remote_ext_dict">location</entry>
    <!--用戶可以在這里配置遠程擴展停止詞字典-->
    <entry key="remote_ext_stopwords">location</entry>

其中 location 是指一個 url踢关，比如 http://yoursite.com/getCustomDict，該請求只需滿足以下兩點即可完成分詞熱更新粘茄。

該 http 請求需要返回兩個頭部(header)签舞，一個是 Last-Modified，一個是 ETag柒瓣，這兩者都是字符串類型儒搭，只要有一個發(fā)生變化，該插件就會去抓取新的分詞進而更新詞庫芙贫。
該 http 請求返回的內(nèi)容格式是一行一個分詞搂鲫，換行符用 \n 即可。

滿足上面兩點要求就可以實現(xiàn)熱更新分詞了磺平，不需要重啟 ES 實例默穴。
可以將需自動更新的熱詞放在一個 UTF-8 編碼的 .txt 文件里怔檩，放在 nginx 或其他簡易 http server 下，當 .txt 文件修改時蓄诽，http server 會在客戶端請求該文件時自動返回相應的 Last-Modified 和 ETag薛训。可以另外做一個工具來從業(yè)務系統(tǒng)提取相關詞匯仑氛，并更新這個 .txt 文件乙埃。

2.2 ik分詞器

2.2.1 ik分詞器

Elasticsearch 內(nèi)置的分詞器對中文不友好，只會一個字一個字的分锯岖，無法形成詞語介袜，因此引入ik分詞器。

ik分詞器包括

分析器Analyzer：ik_smart , ik_max_word
分詞器Tokenizer: ik_smart , ik_max_word

ik_max_word 和 ik_smart 什么區(qū)別?

ik_max_word: 會將文本做最細粒度的拆分出吹，比如會將“中華人民共和國國歌”拆分為“中華人民共和國,中華人民,中華,華人,人民共和國,人民,人,民,共和國,共和,和,國國,國歌”遇伞，會窮盡各種可能的組合，適合 Term Query捶牢；

ik_smart: 會做最粗粒度的拆分鸠珠，比如會將“中華人民共和國國歌”拆分為“中華人民共和國,國歌”，適合 Phrase 查詢秋麸。

2.2.2 同義詞（TODO）

PUT http://chenjie.asia:9200/ik_synonym
{
    "setting": {
        "analysis": {
            "analyzer": {
                "ik_synonym_analyzer": {
                    "tokenizer": "",
                    "filter": ""
                }
            },
            "filter": {
                "ik_synonym_filter": {
                    "type": "synonym",
                    "synonyms_path": "/"
                }
            }
        }
        
    },
    "mappings": {
        "properties": {
            "content": {
                "type": "text",
                "analyzer": "ik_smart"
            },
            "author": {
                "type": "keyword"
            }
        }
    }
}

2.2.3 停用詞

2.3 pinyin分詞器

2.3.1 pinyin分詞器

該拼音分析插件用于漢字與拼音的轉(zhuǎn)換渐排，集成了NLP工具。

插件包括：

分析器Analyzer: pinyin
分詞器Tokenizer: pinyin
后過濾器Token-filter: pinyin

參數(shù)	說明	示例	默認值
`keep_first_letter`	保留拼音首字母組合	`劉德華`>`ldh`	true
`keep_separate_first_letter`	保留拼音首字母	`劉德華`>`l`,`d`,`h`	false
`limit_first_letter_length`	設置最長拼音首字母組合		16
`keep_full_pinyin`	保留全拼	`劉德華`> [`liu`,`de`,`hua`]	true
`keep_joined_full_pinyin`	保留全拼組合	`劉德華`> [`liudehua`]	false
`keep_none_chinese`	過濾掉中文和數(shù)字		true
`keep_none_chinese_together`	不切分非中文字母	當設為true時：`DJ音樂家` -> `DJ`,`yin`,`yue`,`jia`<br />當設為false時：`DJ音樂家` -> `D`,`J`,`yin`,`yue`,`jia` 灸蟆；NOTE：keep_none_chinese需要為true	true
`keep_none_chinese_in_first_letter`	中文轉(zhuǎn)為拼音首字母驯耻，并將非中文字母與拼音合并	`劉德華AT2016`->`ldhat2016`	true
`keep_none_chinese_in_joined_full_pinyin`	中文轉(zhuǎn)為全拼，并將非中文字母與拼音合并	`劉德華2016`->`liudehua2016`	false
`none_chinese_pinyin_tokenize`	如果非中文字母是拼音炒考，則將它們分成單獨的拼音詞可缚。`keep_none_chinese` 和`keep_none_chinese_together` 需要為true。	`liudehuaalibaba13zhuanghan` -> `liu`,`de`,`hua`,`a`,`li`,`ba`,`ba`,`13`,`zhuang`,`han`	true
`keep_original`	是否保留原輸入		false
`lowercase`	是否小寫非中文字母		true
`trim_whitespace`	首位去空格		true
`remove_duplicated_term`	會移除重復的短語斋枢，可能會影響位置相關的查詢結(jié)果城看。	`de的`>`de`	false
`ignore_pinyin_offset`	after 6.0, offset is strictly constrained, overlapped tokens are not allowed, with this parameter, overlapped token will allowed by ignore offset, please note, all position related query or highlight will become incorrect, you should use multi fields and specify different settings for different query purpose. if you need offset, please set it to false.		true

2.3.2 使用示例

使用pinyin分詞器

創(chuàng)建一個名為diypytest的索引，該索引中使用了一個pinyin tokenizer杏慰。

# 創(chuàng)建索引
PUT http://chenjie.asia:9200/diypytest
{
    "settings": {
        "analysis": {
            "analyzer": {
                "pinyin_analyzer": {
                    "tokenizer": "my_pinyin"
                }
            },
            "tokenizer": {
                "my_pinyin": {
                    "type": "pinyin",
                    "keep_separate_first_letter": false,
                    "keep_full_pinyin": true,
                    "keep_original": true,
                    "limit_first_letter_length": 16,
                    "lowercase": true,
                    "remove_duplicated_term": true
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "menu": {
                "type": "keyword",
                "fields": {
                    "pinyin": {
                        "type": "text",
                        "store": false,
                        "term_vector": "with_offsets",
                        "analyzer": "pinyin_analyzer",
                        "boost": 10
                    }
                }
            }
        }
    }
}

# 測試該分詞器
GET http://chenjie.asia:9200/diypytest/_analyze
{
    "analyzer": "pinyin_analyzer",
    "text": "西紅柿雞蛋" 
}

# 插入數(shù)據(jù)
PUT http://chenjie.asia:9200/diypytest/_doc/1
{
    "menu":"西紅柿雞蛋"
}
PUT http://chenjie.asia:9200/diypytest/_doc/2
{
    "menu":"韭菜雞蛋"
}

# 查詢數(shù)據(jù)
GET http://chenjie.asia:9200/diypytest/_search?q=menu:xhsjd  // 查詢?yōu)榭?GET http://chenjie.asia:9200/diypytest/_search?q=menu.pinyin:xhsjd  // 查詢得到結(jié)果

使用pinyin-tokenFilter

創(chuàng)建一個名為diypytest2的索引测柠，該索引中使用了一個pinyin后過濾器。

# 創(chuàng)建索引
http://chenjie.asia:9200/diypytest2
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "menu_analyzer" : {
                    "tokenizer" : "whitespace",
                    "filter" : "pinyin_first_letter_and_full_pinyin_filter"
                }
            },
            "filter" : {
                "pinyin_first_letter_and_full_pinyin_filter" : {
                    "type" : "pinyin",
                    "keep_first_letter" : true,
                    "keep_full_pinyin" : false,
                    "keep_none_chinese" : true,
                    "keep_original" : false,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "trim_whitespace" : true,
                    "keep_none_chinese_in_first_letter" : true
                }
            }
        }
    }
}

# 測試分詞器
http://chenjie.asia:9200/diypytest2/_analyze
{
    "analyzer":"menu_analyzer",
    "text":"西紅柿雞蛋 韭菜雞蛋 糖醋里脊"
}

# 結(jié)果如下
{
    "tokens": [
        {
            "token": "xhsjd",
            "start_offset": 0,
            "end_offset": 5,
            "type": "word",
            "position": 0
        },
        {
            "token": "jcjd",
            "start_offset": 6,
            "end_offset": 10,
            "type": "word",
            "position": 1
        },
        {
            "token": "tclj",
            "start_offset": 11,
            "end_offset": 15,
            "type": "word",
            "position": 2
        }
    ]
}

為索引添加自定義分詞器

PUT  my_analyzer
{
    "settings":{
        "analysis":{
            "analyzer":{
                "my":{                //分析器
                    "tokenizer":"punctuation",    //指定所用的分詞器
                    "type":"custom",        //自定義類型的分析器
                    "char_filter":["emoticons"],    //指定所用的字符過濾器
                    "filter":["lowercase","english_stop"]
                },
                "char_filter":{        //字符過濾器
                    "emoticons":{        //字符過濾器的名字
                        "type":"mapping",    //匹配模式
                        "mapping":[
                            ":)=>_happy_",        //如果匹配上:)缘滥，那么替換為_happy_
                            ":(=>_sad_"            //如果匹配上:(轰胁，那么替換為_sad_
                        ]
                    }
                },
                "tokenizer":{        //分詞器
                    "punctuation":{    //分詞器的名字
                        "type":"pattern",    //正則匹配分詞器
                        "pattern":"[.,!?]"    //通過正則匹配方式匹配需要作為分隔符的字符，此處為 . , ! ? 朝扼，作為分隔符進行分詞
                    }
                },
                "filter":{        //后過濾器
                    "english_stop":{    //后過濾器的名字
                        "type":"stop",       //停用詞
                        "stopwords":"_english_"    //指定停用詞赃阀，不影響分詞，但不允許查詢
                    }
                }
            }
        }
    }
}

例如用上述自定義分析器對文本 "I am a :) person,and you" 進行分詞，最終得到兩個分詞 "I am a happy person" 和 "and you" ;
第一步：用字符過濾器將 :) 替換為 happy
第二步：用分詞器榛斯，通過正則表達式匹配到逗號观游，在逗號處進行分詞
第三步：不查詢停用詞

三、檢索

3.1 Field的配置

定義一個字段的時候驮俗，可以選擇如下屬性進行配置懂缕。

"field": {  
         "type":  "text", //文本類型  ，指定類型
      
         "index": "analyzed", //該屬性共有三個有效值：analyzed王凑、no和not_analyzed搪柑，默認是analyzed；analyzed：表示該字段被分析索烹，編入索引工碾，產(chǎn)生的token能被搜索到；not_analyzed：表示該字段不會被分析百姓，使用原始值編入索引锐极，在索引中作為單個詞叶圃；no：不編入索引正驻，無法搜索該字段宣虾；
         
         "analyzer":"ik"http://指定分詞器  
         
         "boost":1.23//字段級別的分數(shù)加權  
         
         "doc_values":false//對not_analyzed字段蒲拉，默認都是開啟达布，analyzed字段不能使用睁冬，對排序和聚合能提升較大性能募胃，節(jié)約內(nèi)存,如果您確定不需要對字段進行排序或聚合矗晃，或者從script訪問字段值仑嗅，則可以禁用doc值以節(jié)省磁盤空間：
         
         "fielddata":{"loading" : "eager" }//Elasticsearch 加載內(nèi)存 fielddata 的默認行為是 延遲 加載 。 當 Elasticsearch 第一次查詢某個字段時张症，它將會完整加載這個字段所有 Segment 中的倒排索引到內(nèi)存中仓技，以便于以后的查詢能夠獲取更好的性能。
         
         "fields":{"keyword": {"type": "keyword","ignore_above": 256}} //可以對一個字段提供多種索引模式俗他，同一個字段的值脖捻，一個分詞，一個不分詞  
         
         "ignore_above":100 //超過100個字符的文本兆衅，將會被忽略地沮，不被索引
           
         "include_in_all":ture//設置是否此字段包含在_all字段中，默認是true羡亩，除非index設置成no選項  
         
         "index_options":"docs"http://4個可選參數(shù)docs（索引文檔號） ,freqs（文檔號+詞頻）摩疑，positions（文檔號+詞頻+位置，通常用來距離查詢）畏铆，offsets（文檔號+詞頻+位置+偏移量雷袋，通常被使用在高亮字段）分詞字段默認是position，其他的默認是docs  
         
         "norms":{"enable":true,"loading":"lazy"}//分詞字段默認配置辞居，不分詞字段：默認{"enable":false}楷怒，存儲長度因子和索引時boost蛋勺，建議對需要參與評分字段使用 ，會額外增加內(nèi)存消耗量  
         
         "null_value":"NULL"http://設置一些缺失字段的初始化值鸠删，只有string可以使用抱完，分詞字段的null值也會被分詞  
         
         "position_increament_gap":0//影響距離查詢或近似查詢，可以設置在多值字段的數(shù)據(jù)上火分詞字段上冶共，查詢時可指定slop間隔乾蛤，默認值是100  
         
         "store":false//是否單獨設置此字段的是否存儲而從_source字段中分離，默認是false捅僵，只能搜索家卖，不能獲取值  
         
         "search_analyzer":"ik"http://設置搜索時的分詞器，默認跟ananlyzer是一致的庙楚，比如index時用standard+ngram上荡，搜索時用standard用來完成自動提示功能  
         
         "similarity":"BM25"http://默認是TF/IDF算法，指定一個字段評分策略馒闷，僅僅對字符串型和分詞類型有效  
         
         "term_vector":"no"http://默認不存儲向量信息酪捡，支持參數(shù)yes（term存儲），with_positions（term+位置）,with_offsets（term+偏移量）纳账，with_positions_offsets(term+位置+偏移量) 對快速高亮fast vector highlighter能提升性能逛薇，但開啟又會加大索引體積，不適合大數(shù)據(jù)量用  
}

3.2 檢索

3.2.1 搜索詞的分詞

每當一個文檔在被錄入到 Elasticsearch中時疏虫，需要一個叫做 index 的過程永罚。在 index 的過程中，它會為該字符串進行分詞卧秘，并最終形成一個一個的 token呢袱，并存于數(shù)據(jù)庫。但是翅敌，每當我們搜索一個字符串時羞福，在搜索時，我們同樣也要對該字符串進行分詞蚯涮，也會建立token治专，但不會存于數(shù)據(jù)庫。

①當你查詢一個全文域時(match)遭顶，會對查詢字符串應用相同的分析器(或使用指定的分析器search_analyzer)张峰，以產(chǎn)生正確的搜索詞條列表。
②當你查詢一個精確值域時(term)液肌，不會分析查詢字符串挟炬，而是搜索你指定的精確值。

示例：

PUT http://chenjie.asia:9200/test1
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "standard",
        "search_analyzer": "english"
      }
    }
  }
}

GET http://chenjie.asia:9200/test1/_search
{
  "query": {
    "match": {
      "content": "Happy a birthday"
    }
  }
}

對于這個搜索來說，我們在默認的情況下谤祖，會把 "Happy a birthday" 使用同樣的 standard analyzer 進行分詞婿滓。如果我們指定search_analyzer為english analyzer 過濾器，它就會把字母 “a” 過濾掉粥喜，那么直剩下 “happy” 及 “birthday” 這兩個詞凸主，而 “a” 將不進入搜索之中。

3.2.2 單字段檢索

如上所示额湘，使用match進行檢索卿吐。

3.2.3 多字段檢索

檢索多個field

PUT http://chenjie.asia:9200/test2
{
  "mappings": {
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "ik_smart"
        },
        "author": {
          "type": "keyword"
        }
    }
  }
}


# 插入數(shù)據(jù)
POST http://chenjie.asia:9200/test2/_doc/1
{
    "content": "I am good!",
    "author": "cj"
}
POST http://chenjie.asia:9200/test2/_doc/2
{
    "content": "CJ is good!",
    "author": "zs"
}

# 進行檢索，以上兩條都可以檢索到锋华，因為字段中有匹配的分詞
{
    "query": {
        "multi_match": {
            "query": "cj",
            "fields": ["content","author"]
        }
    }
}

檢索一個字段的多種索引模式

當我們需要對某個字段進行多種方式的分詞嗡官，使用多個不同的 anaylzer 來提高我們的搜索，就可以使用fields定義多種分析方式毯焕，使用不同的分析器來分析同樣的一個字符串衍腥。

PUT http://chenjie.asia:9200/test3
{
    "mappings": {
        "properties": {
            "content": {
                "type": "text",
                "analyzer": "ik_smart",
                "fields": {
                    "py": {
                        "type": "text",
                        "analyzer": "pinyin"
                    }
                }
            }
        }
    }
}

# 插入數(shù)據(jù)
POST http://chenjie.asia:9200/test3/_doc/1
{
    "content": "我胡漢三又回來了"
}

# 使用拼音和中文都可以檢索到這條文檔
GET http://chenjie.asia:9200/test3/_search
{
    "query": {
        "multi_match": {
            "query": "huhansan",
            "fields": ["content","content.py"]
        }
    }
}

GET http://chenjie.asia:9200/test3/_search
{
    "query": {
        "multi_match": {
            "query": "胡漢三",
            "fields": ["content","content.py"]
        }
    }
}

五種類型的multi match query

best_fields: (default) Finds documents which match any field, but uses the _score from the best field.
most_fields: Finds documents which match any field and combines the _score from each field.(與best_fields不同之處在于相關性評分，best_fields取最大匹配得分（max計算）纳猫，而most_fields取所有匹配之和（sum計算）)
cross_fields: Treats fields with the same analyzer as though they were one big field. Looks for each word in any field.(所有輸入的Token必須在同一組的字段上全部匹配,)
phrase: Runs a match_phrase query on each field and combines the _score from each field.
phrase_prefix: Runs a match_phrase_prefix query on each field and combines the _score from each field.

GET http://chenjie.asia:9200/article/_search
{
    "query": {
        "multi_match": {
            "query": "hxr",
            "fields": [
                "name^5",
                "name.FPY",
                "name.SPY",
                "name.IKS^0.8"
            ],
            "type": "best_fields"
        }
    }
}

文章字數(shù)受限婆咸，下篇請看 Elasticsearch自定義分析器（下）

最后編輯于：2023.04.12 11:36:31

?著作權歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

禁止轉(zhuǎn)載，如需轉(zhuǎn)載請通過簡信或評論聯(lián)系作者芜辕。

人面猴
序言：七十年代末尚骄，一起剝皮案震驚了整個濱河市，隨后出現(xiàn)的幾起案子侵续，更是在濱河造成了極大的恐慌倔丈，老刑警劉巖，帶你破解...
沈念sama閱讀 218,941評論 6贊 508
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件询兴，死亡現(xiàn)場離奇詭異乃沙，居然都是意外死亡起趾，警方通過查閱死者的電腦和手機，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,397評論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門眶根，熙熙樓的掌柜王于貴愁眉苦臉地迎上來属百，“玉大人族扰，你說我怎么就攤上這事∮婧牵” “怎么了怒竿？”我有些...
開封第一講書人閱讀 165,345評論 0贊 356
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長扩氢。經(jīng)常有香客問我耕驰，道長，這世上最難降的妖魔是什么录豺？我笑而不...
開封第一講書人閱讀 58,851評論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任朦肘，我火速辦了婚禮，結(jié)果婚禮上双饥，老公的妹妹穿的比我還像新娘媒抠。我一直安慰自己，他們只是感情好咏花，可當我...
茶點故事閱讀 67,868評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布领舰。她就那樣靜靜地躺著，像睡著了一般迟螺。火紅的嫁衣襯著肌膚如雪冲秽。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 51,688評論 1贊 305
城市分裂傳說
那天矩父，我揣著相機與錄音，去河邊找鬼后裸。笑死开睡，一個胖子當著我的面吹牛扶檐，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播寻咒，決...
沈念sama閱讀 40,414評論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼叫挟，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起瓶蚂，我...
開封第一講書人閱讀 39,319評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤杭攻，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體衣撬，經(jīng)...
沈念sama閱讀 45,775評論 1贊 315
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡扛点，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 37,945評論 3贊 336
?白月光啟示錄
正文我和宋清朗相戀三年奥帘，在試婚紗的時候發(fā)現(xiàn)自己被綠了。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片秸苗。...
茶點故事閱讀 40,096評論 1贊 350
活死人
序言：一個原本活蹦亂跳的男人離奇死亡秸讹，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出侣诺，到底是詐尸還是另有隱情搔确，我是刑警寧澤弛作，帶...
沈念sama閱讀 35,789評論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站，受9級特大地震影響持寄，放射性物質(zhì)發(fā)生泄漏矢否。R本人自食惡果不足惜赖欣，卻給世界環(huán)境...
茶點故事閱讀 41,437評論 3贊 331
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望验庙。院中可真熱鬧顶吮，春花似錦、人聲如沸粪薛。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,993評論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽嫉戚。三九已至疯坤，卻和暖如春栖忠，著一層夾襖步出監(jiān)牢的瞬間渐白，已是汗流浹背菇用。一陣腳步聲響...
開封第一講書人閱讀 33,107評論 1贊 271
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人脚曾。一個月前我還...
沈念sama閱讀 48,308評論 3贊 372
代替公主和親
正文我出身青樓，卻偏偏與公主長得像愉镰，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子橄霉，可洞房花燭夜當晚...
茶點故事閱讀 45,037評論 2贊 355

Elasticsearch自定義分析器（上）

一寞奸、分析器

1.1 概念：

1.2 字符過濾器CharacterFilters

①使用html_strip字符過濾器

②使用mapping 字符過濾器

③使用pattern_replace 字符過濾器

1.3 分詞器Tokenizer

1.3.1 Word Oriented Tokenizers

① Standard

②Letter

③Lowercase

④Whitespace

⑤UAX Email URL

⑥Classic

⑦Thai

1.3.2 Partial Word Tokenizers

①Ngram

②Edge NGram

1.3.3 Structured Text Tokenizers

①Simple Pattern

②Simple Pattern Split

③Pattern

④Keyword

⑤Path Hierarchy

⑥Character group

1.4 后過濾器TokenFilter

①Edge n-gram

②N-gram

③Stop

④Stemmer

⑤Keyword repeat

⑥Remove duplicates

⑦Uppercase

⑧Lowercase

⑨ASCII folding

⑩Fingerprint

?Trim

?Unique

?Synonym

?Reverse

1.5 分析器Analyzer

①Standard Analyzer

②Simple Analyzer

③Whitespace Analyzer

④Stop Analyzer

⑤Keyword Analyzer

⑥Language Analyzer

⑦Pattern Analyzer

⑧Fingerprint Analyzer

1.6 自定義分析器示例

示例一

示例二

二保檐、ik和pinyin分詞器

2.1 安裝ik和pinyin分詞器

2.2 ik分詞器

2.2.1 ik分詞器

2.2.2 同義詞（TODO）

2.2.3 停用詞

2.3 pinyin分詞器

2.3.1 pinyin分詞器

2.3.2 使用示例

使用pinyin分詞器

使用pinyin-tokenFilter

為索引添加自定義分詞器

三、檢索

3.1 Field的配置

3.2 檢索

3.2.1 搜索詞的分詞

3.2.2 單字段檢索

3.2.3 多字段檢索

檢索多個field

檢索一個字段的多種索引模式

五種類型的multi match query

推薦閱讀更多精彩內(nèi)容