1栖雾、什么是召回率?
比如你搜索一個java spark伟众,總共有100個doc析藕,能返回多少個doc作為結果,就是召回率凳厢,recall
2账胧、什么是精準度?
比如你搜索一個java spark先紫,能不能盡可能讓包含java spark或者是java和spark離的很近的doc排在最前面治泥,precision直接用match_phrase短語搜索,會導致必須所有term都在doc field中出現(xiàn)遮精,而且距離在slop限定范圍內(nèi)才能匹配上居夹。
match_phrase,proximity match要求doc必須包含所有的term,才能作為結果返回;如果某一個doc可能就是有某個term沒有包含吮播,那么就無法作為結果返回变屁。
比如:
java spark --》 hello world java : 就無法匹配到
java spark --》 hello world,java spark : 可以匹配到
3意狠、疑問
近似匹配的時候粟关,召回率比較低,精準度太高了环戈,但是有時我們希望的是匹配到幾個term中的部分闷板,就可以作為結果出來,這樣可以提高召回率院塞,同時我們也希望用上match_phrase根據(jù)距離提升分數(shù)的功能遮晚,讓幾個term距離越近分數(shù)就越高,越優(yōu)先返回拦止。
就是優(yōu)先滿足召回率县遣。比如
java spark --》 包含java的返回,包含spark的也返回汹族,包含 java和spark的也返回萧求,同時兼顧精準度,就是包含java和spark顶瞒,同時java和spark距離越近的doc排最前面夸政。
4、解決疑問
可以用bool組合match query和match_phrase query一起榴徐,來實現(xiàn)上述效果守问。
match提高召回率,帶java和帶spark的都要返回坑资。
match_phrase提高精準度耗帕,保證同時帶java和spark的排在最前面。
效果1:直接用bool match query
GET /forum/article/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"content": "java spark"
}
}
]
}
}
}
結果
{
"took": 54,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.68640786,
"hits": [
{
"_index": "forum",
"_type": "article",
"_id": "2",
"_score": 0.68640786,
"_source": {
"articleID": "KDKE-B-9947-#kL5",
"userID": 1,
"hidden": false,
"postDate": "2017-01-02",
"tag": [
"java"
],
"tag_cnt": 1,
"view_cnt": 50,
"title": "this is java blog",
"content": "i think java is the best programming language",
"sub_title": "learned a lot of course",
"author_first_name": "Smith",
"author_last_name": "Williams",
"new_author_last_name": "Williams",
"new_author_first_name": "Smith"
}
},
{
"_index": "forum",
"_type": "article",
"_id": "5",
"_score": 0.68324494,
"_source": {
"articleID": "DHJK-B-1395-#Ky5",
"userID": 3,
"hidden": false,
"postDate": "2017-03-01",
"tag": [
"elasticsearch"
],
"tag_cnt": 1,
"view_cnt": 10,
"title": "this is spark blog",
"content": "spark is best big data solution based on scala ,an programming language similar to java spark",
"sub_title": "haha, hello world",
"author_first_name": "Tonny",
"author_last_name": "Peter Smith",
"new_author_last_name": "Peter Smith",
"new_author_first_name": "Tonny"
}
}
]
}
}
結果發(fā)現(xiàn)單獨包含java和spark的也被返回了,而且單獨包含java的卻排到了第一位,既包含java又包含spark的卻排到了最后茵汰。
效果2:直接用match_phrase
GET /forum/article/_search
{
"query": {
"match_phrase": {
"content": {
"query": "java spark",
"slop" : 50
}
}
}
}
結果:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.5753642,
"hits": [
{
"_index": "forum",
"_type": "article",
"_id": "5",
"_score": 0.5753642,
"_source": {
"articleID": "DHJK-B-1395-#Ky5",
"userID": 3,
"hidden": false,
"postDate": "2017-03-01",
"tag": [
"elasticsearch"
],
"tag_cnt": 1,
"view_cnt": 10,
"title": "this is spark blog",
"content": "spark is best big data solution based on scala ,an programming language similar to java spark",
"sub_title": "haha, hello world",
"author_first_name": "Tonny",
"author_last_name": "Peter Smith",
"new_author_last_name": "Peter Smith",
"new_author_first_name": "Tonny"
}
}
]
}
}
結果發(fā)現(xiàn)只返回了既包含java又包含spark的數(shù)據(jù)涉枫,召回率降低了。
最終效果:我們將兩個結果合并狡赐,既用bool match query又用match_phrase
GET /forum/article/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"content": "java spark"
}
}
],
"should": [
{
"match_phrase": {
"content": {
"query": "java spark",
"slop" : 50
}
}
}
]
}
}
}
結果:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.258609,
"hits": [
{
"_index": "forum",
"_type": "article",
"_id": "5",
"_score": 1.258609,
"_source": {
"articleID": "DHJK-B-1395-#Ky5",
"userID": 3,
"hidden": false,
"postDate": "2017-03-01",
"tag": [
"elasticsearch"
],
"tag_cnt": 1,
"view_cnt": 10,
"title": "this is spark blog",
"content": "spark is best big data solution based on scala ,an programming language similar to java spark",
"sub_title": "haha, hello world",
"author_first_name": "Tonny",
"author_last_name": "Peter Smith",
"new_author_last_name": "Peter Smith",
"new_author_first_name": "Tonny"
}
},
{
"_index": "forum",
"_type": "article",
"_id": "2",
"_score": 0.68640786,
"_source": {
"articleID": "KDKE-B-9947-#kL5",
"userID": 1,
"hidden": false,
"postDate": "2017-01-02",
"tag": [
"java"
],
"tag_cnt": 1,
"view_cnt": 50,
"title": "this is java blog",
"content": "i think java is the best programming language",
"sub_title": "learned a lot of course",
"author_first_name": "Smith",
"author_last_name": "Williams",
"new_author_last_name": "Williams",
"new_author_first_name": "Smith"
}
}
]
}
}
結果發(fā)現(xiàn)非常完美窑业,兩個都包含的排到了第一位,并且分數(shù)遠高于第二個枕屉。而且召回率也很高
若有興趣常柄,歡迎來加入群,【Java初學者學習交流群】:458430385,此群有Java開發(fā)人員西潘、UI設計人員和前端工程師卷玉。有問必答,共同探討學習喷市,一起進步相种!
歡迎關注我的微信公眾號【Java碼農(nóng)社區(qū)】,會定時推送各種干貨: