1. 借鑒
極客時間 阮一鳴老師的Elasticsearch核心技術(shù)與實戰(zhàn)
Elasticsearch 分詞器
Elasticsearch 默認分詞器和中分分詞器之間的比較及使用方法
Elasticsearch系列---使用中文分詞器
官網(wǎng) character filters
官網(wǎng) tokenizers
官網(wǎng) token filters
2. 開始
一、analyze api
方式1 指定分詞器
GET /_analyze
{
"analyzer": "ik_max_word",
"text": "Hello Lady, I'm Elasticsearch ^_^"
}
方式2 指定索引及屬性字段
GET /tmdb_movies/_analyze
{
"field": "title",
"text": "Basketball with cartoon alias"
}
方式3 自定義分詞
GET /_analyze
{
"tokenizer": "standard",
"filter": ["lowercase"],
"text": "Machine Building Industry Epoch"
}
二檀咙、自定義分詞器
- 分詞器是由三部分組成的雅倒,分別是
character filter
,tokenizer
,token filter
character filter[字符過濾器]
處理原始文本,可以配置多個弧可,會影響到tokenizer的position和offset信息
在es中有幾個默認的字符過濾器
- HTML Strip
去除html標簽 - Mapping
字符串替換 - Pattern Replace
正則匹配替換
舉個栗子
html_strip
GET _analyze
{
"tokenizer": "keyword",
"char_filter": ["html_strip"],
"text": "<br>you know, for search</br>"
}
- 結(jié)果
{
"tokens" : [
{
"token" : """
you know, for search
""",
"start_offset" : 0,
"end_offset" : 29,
"type" : "word",
"position" : 0
}
]
}
mapping
GET _analyze
{
"tokenizer": "whitespace",
"char_filter": [
{
"type": "mapping",
"mappings": ["- => "]
},
"html_strip"
],
"text": "<br>中國-北京 中國-臺灣 中國-人民</br>"
}
- 結(jié)果
{
"tokens" : [
{
"token" : "中國北京",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 0
},
{
"token" : "中國臺灣",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 1
},
{
"token" : "中國人民",
"start_offset" : 16,
"end_offset" : 21,
"type" : "word",
"position" : 2
}
]
}
pattern_replace
GET /_analyze
{
"tokenizer": "keyword",
"char_filter": [
{
"type": "pattern_replace",
"pattern": "https?://(.*)",
"replacement": "$1"
}],
"text": "https://www.elastic.co"
}
- 結(jié)果
{
"tokens" : [
{
"token" : "www.elastic.co",
"start_offset" : 0,
"end_offset" : 22,
"type" : "word",
"position" : 0
}
]
}
tokenizer[分詞器]
將原始文本按照一定規(guī)則蔑匣,切分成詞項(字符處理)
在es中有幾個默認的分詞器
- standard
- letter
- lowercase
- whitespace
- uax url email
- classic
- thai
- n-gram
- edge n-gram
- keyword
- pattern
- simple
- char group
- simple pattern split
- path
舉個栗子
path_hierarchy
GET /_analyze
{
"tokenizer": "path_hierarchy",
"text": ["/usr/local/bin/java"]
}
- 結(jié)果
{
"tokens" : [
{
"token" : "/usr",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "/usr/local",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 0
},
{
"token" : "/usr/local/bin",
"start_offset" : 0,
"end_offset" : 14,
"type" : "word",
"position" : 0
},
{
"token" : "/usr/local/bin/java",
"start_offset" : 0,
"end_offset" : 19,
"type" : "word",
"position" : 0
}
]
}
token filter[分詞過濾]
將tokenizer輸出的詞項進行處理,如:增加侣诺,修改殖演,刪除
在es中有幾個默認的分詞過濾器
- lowercase
- stop
- uppercase
- reverse
- length
- n-gram
- edge n-gram
- pattern replace
- trim
- ...[更多參照官網(wǎng),目前僅列舉用到的]
舉個栗子
GET /_analyze
{
"tokenizer": "whitespace",
"filter": ["stop"],
"text": ["how are you i am fine thank you"]
}
三年鸳、自定義分詞器
自定義也無非是定義char_filter趴久,tokenizer,filter(token filter)
DELETE /my_analysis
PUT /my_analysis
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [
"my_char_filter"
],
"tokenizer": "my_tokenizer",
"filter": [
"my_tokenizer_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": ["_ => "]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "[,.!? ]"
}
},
"filter": {
"my_tokenizer_filter": {
"type": "stop",
"stopword": "__english__"
}
}
}
}
}
POST /my_analysis/_analyze
{
"analyzer": "my_analyzer",
"text": ["Hello Kitty!, A_n_d you?"]
}