模擬數(shù)據(jù)
PUT /website/article/1
{
"post_date": "2017-01-01",
"title": "my first article",
"content": "this is my first article in this website",
"author_id": 11400
}
PUT /website/article/2
{
"post_date": "2017-01-02",
"title": "my second article",
"content": "this is my second article in this website",
"author_id": 11400
}
PUT /website/article/3
{
"post_date": "2017-01-03",
"title": "my third article",
"content": "this is my third article in this website",
"author_id": 11400
}
搜索測(cè)試:
GET /website/article/_search?q=2017 //三條數(shù)據(jù)全部搜索出來
GET /website/article/_search?q=2017-01-01 //三條數(shù)據(jù)全部搜索出來
GET /website/article/_search?q=post_date:2017-01-01 //只搜索出來post_date=2017-01-01的那一條數(shù)據(jù)
GET /website/article/_search?q=post_date:2017 //也是只搜索出來post_date=2017-01-01的那一條數(shù)據(jù)
為什么會(huì)是這樣的結(jié)果:
這和es自動(dòng)建立的mapping有關(guān)
GET /website/_mapping/article
可以看到每個(gè)字段的類型
{
"website": {
"mappings": {
"article": {
"properties": {
"author_id": {
"type": "long"
},
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"post_date": {
"type": "date"
},
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
因?yàn)閑s自動(dòng)建立mapping的時(shí)候惯悠,設(shè)置了不同的field不同的data type修噪。不同的data type的分詞园担、搜索等行為是不一樣的垂睬。所以出現(xiàn)了_all field和post_date field的搜索結(jié)果完全不一樣。不同的type搜索方式也不一樣辰斋。
搜索方式
-
exact value#####
例如:對(duì)2017-01-01進(jìn)行exact value 搜索的時(shí)候退子,必須輸入2017-01-01,才能搜索出來切端,如果你輸入一個(gè)01彻坛,是搜索不出來的
-
full text#####
也即“全文檢索”
當(dāng)你進(jìn)行搜索的時(shí)候,對(duì)你要搜索的詞踏枣,會(huì)進(jìn)行一系列的轉(zhuǎn)換
- 縮寫 vs. 全程:cn vs. china
- 格式轉(zhuǎn)化:like liked likes
- 大小寫:Tom vs tom
- 同義詞:like vs love
例如:搜索2017-01-01時(shí)昌屉,可能會(huì)先分解為2017 01 01,搜索2017椰于,或者01怠益,都可以搜索出來
就不是說單純的只是匹配完整的一個(gè)值,而是可以對(duì)值進(jìn)行拆分詞語后(分詞)進(jìn)行匹配瘾婿,也可以通過縮寫蜻牢、時(shí)態(tài)烤咧、大小寫、同義詞等進(jìn)行匹配
分詞器
作用:切分詞語抢呆,normalization(提升recall召回率)
給你一段句子煮嫌,然后將這段句子拆分成一個(gè)一個(gè)的單個(gè)的單詞,同時(shí)對(duì)每個(gè)單詞進(jìn)行normalization(時(shí)態(tài)轉(zhuǎn)換抱虐,單復(fù)數(shù)轉(zhuǎn)換)
reacall(召回率):搜索的時(shí)候昌阿,增加能夠搜索到的結(jié)果的數(shù)量
分詞器的一些功能#####
character filter:在一段文本進(jìn)行分詞之前,先進(jìn)行預(yù)處理恳邀,比如說最常見的就是懦冰,過濾html標(biāo)簽(<span>hello<span> --> hello),& --> and(I&you --> I and you)
tokenizer:分詞谣沸。例如:hello you and me --> hello, you, and, me
token filter:大小寫的轉(zhuǎn)換刷钢,停用詞,同義詞的轉(zhuǎn)換等乳附。例如:dogs --> dog内地,liked --> like,Tom --> tom赋除,a/the/an --> 這些詞沒有什么意義阱缓,就干掉,mother --> mom举农,small --> little等
es內(nèi)置的分詞器#####
- standard analyzer
- simple analyzer
- whitespace analyzer
- language analyzer(特定的語言的分詞器)
例如:
例句:Set the shape to semi-transparent by calling set_trans(5)
不同分詞器的分詞結(jié)果
- standard analyzer:set, the, shape, to, semi, transparent, by, calling, set_trans, 5(默認(rèn)的是standard)
- simple analyzer:set, the, shape, to, semi, transparent, by, calling, set, trans
- whitespace analyzer:Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
- language analyzer(特定的語言的分詞器荆针,比如說,english并蝗,英語分詞器):set, shape, semi, transpar, call, set_tran, 5
分詞器測(cè)試#####
GET /_analyze
{
"analyzer": "standard",
"text":"I love you"
}
使用query string查詢的說明
query string必須以和index建立時(shí)相同的分詞器進(jìn)行分詞
query string對(duì)exact value和full text的區(qū)別對(duì)待
對(duì)于本文開頭查詢出現(xiàn)的問題
- GET /_search?q=2017祭犯,搜索的是_all field秸妥,document所有的field都會(huì)拼接成一個(gè)大串滚停,進(jìn)行分詞,所以可以搜索出三條記錄
- GET /_search?q=post_date:2017-01-01粥惧,post_date存儲(chǔ)時(shí)是date類型键畴,會(huì)作為exact value去建立索引,所以只查處一條數(shù)據(jù)
- GET /_search?q=post_date:2017 這個(gè)也查詢了一條數(shù)據(jù)突雪,是因?yàn)檐浖姹镜膬?yōu)化問題
小結(jié):
- 往es里面直接插入數(shù)據(jù)起惕,es會(huì)自動(dòng)建立索引,同時(shí)建立type以及對(duì)應(yīng)的mapping
- mapping中就自動(dòng)定義了每個(gè)field的數(shù)據(jù)類型
- 不同的數(shù)據(jù)類型(比如說text和date)咏删,可能有的是exact value惹想,有的是full text
- exact value,在建立倒排索引的時(shí)候督函,分詞的時(shí)候嘀粱,是將整個(gè)值一起作為一個(gè)關(guān)鍵詞建立到倒排索引中的激挪;full text,會(huì)經(jīng)歷各種各樣的處理锋叨,分詞垄分,normaliztion(時(shí)態(tài)轉(zhuǎn)換,同義詞轉(zhuǎn)換娃磺,大小寫轉(zhuǎn)換)薄湿,才會(huì)建立到倒排索引中
- 同時(shí),exact value和full text類型的field就決定了偷卧,在一個(gè)搜索過來的時(shí)候豺瘤,對(duì)exact value field或者是full text field進(jìn)行搜索的行為也是不一樣的,會(huì)跟建立倒排索引的行為保持一致听诸;比如說exact value搜索的時(shí)候炉奴,就是直接按照整個(gè)值進(jìn)行匹配,full text query string蛇更,也會(huì)進(jìn)行分詞和normalization再去倒排索引中去搜索
- 可以用es的dynamic mapping瞻赶,讓其自動(dòng)建立mapping,包括自動(dòng)設(shè)置數(shù)據(jù)類型派任;也可以提前手動(dòng)創(chuàng)建index和type的mapping砸逊,自己對(duì)各個(gè)field進(jìn)行設(shè)置,包括數(shù)據(jù)類型掌逛,包括索引行為师逸,包括分詞器,等等
mapping豆混,就是index的type的元數(shù)據(jù)篓像,每個(gè)type都有一個(gè)自己的mapping,決定了數(shù)據(jù)類型皿伺,建立倒排索引的行為员辩,還有進(jìn)行搜索的行為
mapping核心數(shù)據(jù)類型
- string/text
- byte,short,integer,long
- float,double
- boolean
- date
如果是es動(dòng)態(tài)創(chuàng)建mapping的話(dynamic mapping),規(guī)則如下
- true or false ----------> boolean
- 123 ----------> long
- 123.45 ----------> double
- 2017-01-01 ----------> date
- "hello world" ----------> string/text
查看mapping
GET /index/_mapping/type
索引的幾種類型
- analyzed 分詞
- not_analyzed 不分詞鸵鸥,當(dāng)做一個(gè)整體奠滑,和exact value一致 注:ES 5.0以上的not_analyzed 已經(jīng)不能用了。要用type:keyword
- no 不能被索引和搜索
analyzer內(nèi)置的有
whitespace 妒穴、 simple 和english
等
創(chuàng)建或修改mapping
-
創(chuàng)建####
PUT /website
{
"mappings": {
"article":{
"properties": {
"author_id":{
"type": "long"
},
"title":{
"type": "text",
"analyzer": "english"
},
"content":{
"type": "text"
},
"post_date":{
"type": "date"
},
"publisher_id":{
//下面這兩行可以用"type":"keyword"代替
"type": "string",
"index":"not_analyzed"
}
}
}
}
}
-
修改或添加一個(gè)新字段####
PUT /website/_mapping/article
{
"properties": {
"new_field":{
//下面這兩行可以用"type":"keyword"代替
"type": "string",
"index":"not_analyzed"
}
}
}
測(cè)試我們新建立的mapping和索引
測(cè)試content字段
GET /website/_analyze
{
"field": "content",//默認(rèn)用的是standard分詞
"text": "my-dogs"
}
結(jié)果:
{
"tokens": [
{
"token": "my",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "dogs",
"start_offset": 3,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
}
]
}
測(cè)試new_field
GET /website/_analyze
{
"field": "new_field",
"text":"my-dogs"
}
結(jié)果報(bào)錯(cuò)宋税,原因是:我們?cè)O(shè)置該字段時(shí)不分詞的,是execute value
{
"error": {
"root_cause": [
{
"type": "remote_transport_exception",
"reason": "[XHoQN0O][127.0.0.1:9300][indices:admin/analyze[s]]"
}
],
"type": "illegal_argument_exception",
"reason": "Can't process field [new_field], Analysis requests are only supported on tokenized fields"
},
"status": 400
}
復(fù)雜數(shù)據(jù)類型
PUT /company/employee/1
{
"address":{
"country":"china",
"provice":"beijing",
"city":"beiing"
},
"name":"lili",
"age":"18"
}
該employee type的mapping
GET /company/_mapping/employee
{
"company": {
"mappings": {
"employee": {
"properties": {
"address": {
"properties": {
"city": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"country": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"provice": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"age": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
該復(fù)雜數(shù)據(jù)類型在底層的存儲(chǔ)類似于
{
"name":[jack],
"age":[27],
"address.country":[china],
"address.provice":[beijing],
"address.city":[beijing]
}
再復(fù)雜一些的數(shù)據(jù)底層的存儲(chǔ)結(jié)構(gòu)
{
"author":[
{"age":26,"name":"Jack White"},
{"age":55","name":"Tom Jones"},
{"age":39,"name":"Kitty Smith"}
]
}
//底層存儲(chǔ)
{
"author.age":[26,55,39],
"author.name":[jack,white,tom,jones,kitty,smith]
}