Elasticsearch 分詞
分詞分為讀時分詞和寫時分詞。
讀時分詞發(fā)生在用戶查詢時社裆,ES 會即時地對用戶輸入的關鍵詞進行分詞拙绊,分詞結果只存在內(nèi)存中,當查詢結束時泳秀,分詞結果也會隨即消失标沪。而寫時分詞發(fā)生在文檔寫入時,ES 會對文檔進行分詞后晶默,將結果存入倒排索引,該部分最終會以文件的形式存儲于磁盤上航攒,不會因查詢結束或者 ES 重啟而丟失磺陡。
寫時分詞器需要在 mapping 中指定,而且一經(jīng)指定就不能再修改漠畜,若要修改必須新建索引币他。
分詞一般在ES中有分詞器處理。英文為Analyzer,它決定了分詞的規(guī)則憔狞,Es默認自帶了很多分詞器蝴悉,如:
Standard、english瘾敢、Keyword拍冠、Whitespace等等。默認的分詞器為Standard簇抵,通過它們各自的功能可組合
成你想要的分詞規(guī)則庆杜。分詞器具體詳情可查看官網(wǎng):分詞器
另外,在常用的中文分詞器碟摆、拼音分詞器晃财、繁簡體轉換插件。國內(nèi)用的就多的分別是:
elasticsearch-analysis-ik
elasticsearch-analysis-pinyin
elasticsearch-analysis-stconvert
可在以上鏈接找到自己對于的elasticsearch版本安裝插件典蜕。
這里提供一個我自己封裝的elasticsearch 5.5.0 的Docker鏡像断盛,里面在官方鏡像的基礎上加入了以上三個個插件罗洗,鏈接:
liaodashuai/elasticsearch:1.0.2
簡單了解至此训措,下面用SpringBoot 集成
實現(xiàn)效果:
打造匹配搜索和高亮搜索API
使用中文捏鱼、拼音和繁簡體都能搜索到
擴展另外眾多的搜索方式,簡單使用測試用例實現(xiàn)
集成SpringBoot 實現(xiàn)高亮顯示妒挎、拼音搜索
- 導入jar包,springboot 2.0.4只支持5.X版本的Es,注意版本對應厢洞,避免坑仇让。
compile group: 'org.springframework.boot', name: 'spring-boot-starter-data-elasticsearch', version: '2.0.6.RELEASE'
compile 'org.elasticsearch.client:x-pack-transport:5.5.0'
- 配置連接Es
@Configuration
public class EsConfiguration {
private Client esClient;
/**
* Transport client transport client.
* 如果配置X-PACK ,則需要在此處配置用戶信息
*
* @return the transport client
*/
@Bean
public Client transportClient() {
TransportClient client = null;
try {
client = new PreBuiltXPackTransportClient(Settings.builder()
//嗅探集群狀態(tài)
// .put("client.transport.sniff", true)
.put("cluster.name", "docker-cluster")
//如果有配置xpack插件,需要配置登錄
.put("xpack.security.user", "elastic:changeme")
.build())
.addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName("120.79.58.138"), 9300));
} catch (UnknownHostException e) {
log.error("elasticsearch 連接失敗 !");
}
return client;
}
/**
* 避免TransportClient每次使用創(chuàng)建和釋放
*/
public Client esTemplate() {
if (StringUtils.isEmpty(esClient) || StringUtils.isEmpty(esClient.admin())) {
esClient = transportClient();
return esClient;
}
return esClient;
}
}
- 配置實體Mapping
@Document(indexName = "film-entity", type = "film")
@Setting(settingPath = "/json/film-setting.json")
@Mapping(mappingPath = "/json/film-mapping.json")
public class FilmEntity {
@Id
private Long id;
// @Field(type = FieldType.Text, searchAnalyzer = "ik_max_word", analyzer = "ik_smart")
private String name;
private String nameOri;
private String publishDate;
private String type;
private String language;
private String fileDuration;
private String director;
// @Field(type = FieldType.Date)
private Date created ;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getNameOri() {
return nameOri;
}
public void setNameOri(String nameOri) {
this.nameOri = nameOri;
}
public String getPublishDate() {
return publishDate;
}
public void setPublishDate(String publishDate) {
this.publishDate = publishDate;
}
public String getType() {
return type;
}
public void setType(String type) {
this.type = type;
}
public String getLanguage() {
return language;
}
public void setLanguage(String language) {
this.language = language;
}
public String getFileDuration() {
return fileDuration;
}
public void setFileDuration(String fileDuration) {
this.fileDuration = fileDuration;
}
public String getDirector() {
return director;
}
public void setDirector(String director) {
this.director = director;
}
public Date getCreated() {
return created;
}
public void setCreated(Date created) {
this.created = created;
}
public Long getId() {
return id;
}
public void setId(Long id) {
this.id = id;
}
@Override
public String toString() {
return "FilmEntity [id=" + id + ", name=" + name + ", director=" + director + "]";
}
}
上面的Model有必要解釋一下躺翻,SpringBoot 有為我們提供多種方式設置mapping丧叽,你可以按喜好選擇使用,我選擇
的使用@Mapping注解配置公你,使用es原生的方式進行設置踊淳,雖然有點小麻煩,但是更加直觀了陕靠,也不僅限于java迂尝,也可以直接用curl或es控制臺創(chuàng)建。
film-mapping.json
{
"film": {
"_all": {
"enabled": true
},
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "text",
"analyzer": "ikSearchAnalyzer",
"search_analyzer": "ikSearchAnalyzer",
"fields": {
"pinyin": {
"type": "text",
"analyzer": "pinyinSimpleIndexAnalyzer",
"search_analyzer": "pinyinSimpleIndexAnalyzer"
}
}
},
"nameOri": {
"type": "text"
},
"publishDate": {
"type": "text"
},
"type": {
"type": "text"
},
"language": {
"type": "text"
},
"fileDuration": {
"type": "text"
},
"director": {
"type": "text",
"index": "true",
"analyzer": "ikSearchAnalyzer"
},
"created": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
另外剪芥,除了@Mapping垄开,SpringBoot還為我們提供了另一強大的注解@Setting,該注解可以讓我們?yōu)楫斍八饕O置一些相關屬性,相當于
elasticsearch中的settings配置税肪,例如:
film-setting.json
{
"index": {
"analysis": {
"filter": {
"edge_ngram_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 50
},
"pinyin_simple_filter": {
"type": "pinyin",
"first_letter": "prefix",
"padding_char": " ",
"limit_first_letter_length": 50,
"lowercase": true
}
},
"char_filter": {
"tsconvert": {
"type": "stconvert",
"convert_type": "t2s"
}
},
"analyzer": {
"ikSearchAnalyzer": {
"type": "custom",
"tokenizer": "ik_max_word",
"char_filter": [
"tsconvert"
]
},
"pinyinSimpleIndexAnalyzer": {
"tokenizer": "keyword",
"filter": [
"pinyin_simple_filter",
"edge_ngram_filter",
"lowercase"
]
}
}
}
}
}
上面的JSON作用是創(chuàng)建兩個分析器名為ikSearchAnalyzer溉躲,pinyinSimpleIndexAnalyzer,前者使用ik中文分詞器加繁體轉簡體char_filter過濾,使得引用此分詞器的字段在設置時益兄,將會自動對中文進行分詞和繁簡體轉換锻梳。
pinyinSimpleIndexAnalyzer 使用pinyin分詞器,并進行edge_ngram 過濾净捅,大寫轉小寫過濾疑枯。
上述設置完后,啟動應用蛔六,打開head插件荆永,也可以使用google擴展,elasticsearch-head国章。
創(chuàng)建好索引后屁魏,便可開始測試查詢了。
使用SpringBoot提供的ElasticsearchRepository<T,ID>構建簡單查詢捉腥,當然它也是有局限的氓拼,一些較復雜的查詢,只能通過
SearchResponse 自定義設置。
首先我們實現(xiàn)簡單的普通查詢,可以配合Repository桃漾,繼承ElasticsearchRepository<T,ID>,簡單的CRUD都提供了坏匪。
public interface FilmDao extends ElasticsearchRepository<FilmEntity, Long> {
}
先創(chuàng)建幾條測試數(shù)據(jù):
service類,構建查詢
/**
* 拼接搜索條件
*
* @param name the name
* @param director the director
* @return list
*/
public List<FilmEntity> search(String name, String director) {
//使用中文拼音混合搜索,取分數(shù)最高的撬统,具體評分規(guī)則可參照:
// https://blog.csdn.net/paditang/article/details/79098830
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(structureQuery(name))
.build();
List<FilmEntity> list = filmDao.search(searchQuery).getContent();
return list;
}
/**
* 中文适滓、拼音混合搜索
*
* @param content the content
* @return dis max query builder
*/
public DisMaxQueryBuilder structureQuery(String content) {
//使用dis_max直接取多個query中,分數(shù)最高的那一個query的分數(shù)即可
DisMaxQueryBuilder disMaxQueryBuilder = QueryBuilders.disMaxQuery();
//boost 設置權重,只搜索匹配name和disrector字段
QueryBuilder ikNameQuery = QueryBuilders.matchQuery("name", content).boost(2f);
QueryBuilder pinyinNameQuery = QueryBuilders.matchQuery("name.pinyin", content);
QueryBuilder ikDirectorQuery = QueryBuilders.matchQuery("director", content).boost(2f);
disMaxQueryBuilder.add(ikNameQuery);
disMaxQueryBuilder.add(pinyinNameQuery);
disMaxQueryBuilder.add(ikDirectorQuery);
return disMaxQueryBuilder;
}
輸入拼音搜索“ceshi”可看到對應結果恋追,當然中文也是可以的:
輸入簡體字搜索"測試",可看到對應結果
service類凭迹,構建高亮查詢
public List<FilmEntity> search(String query) {
Client client = esConfig.esTemplate();
HighlightBuilder highlightBuilder = new HighlightBuilder();
//高亮顯示規(guī)則
highlightBuilder.preTags("<span style='color:green'>");
highlightBuilder.postTags("</span>");
//指定高亮字段
highlightBuilder.field("name");
highlightBuilder.field("name.pinyin");
highlightBuilder.field("director");
String[] fileds = {"name", "name.pinyin", "director"};
QueryBuilder matchQuery = QueryBuilders.multiMatchQuery(query, fileds);
//搜索數(shù)據(jù)
SearchResponse response = client.prepareSearch("film-entity")
.setQuery(matchQuery)
.highlighter(highlightBuilder)
.execute().actionGet();
SearchHits searchHits = response.getHits();
System.out.println("記錄數(shù)-->" + searchHits.getTotalHits());
List<FilmEntity> list = new ArrayList<>();
for (SearchHit hit : searchHits) {
FilmEntity entity = new FilmEntity();
Map<String, Object> entityMap = hit.getSourceAsMap();
System.out.println(hit.getHighlightFields());
//高亮字段
if (!StringUtils.isEmpty(hit.getHighlightFields().get("name"))) {
Text[] text = hit.getHighlightFields().get("name").getFragments();
entity.setName(text[0].toString());
entity.setDirector(String.valueOf(entityMap.get("director")));
}
if (!StringUtils.isEmpty(hit.getHighlightFields().get("name.pinyin"))) {
Text[] text = hit.getHighlightFields().get("name.pinyin").getFragments();
entity.setName(text[0].toString());
entity.setDirector(String.valueOf(entityMap.get("director")));
}
if (!StringUtils.isEmpty(hit.getHighlightFields().get("director"))) {
Text[] text = hit.getHighlightFields().get("director").getFragments();
entity.setDirector(text[0].toString());
entity.setName(String.valueOf(entityMap.get("name")));
}
//map to object
if (!CollectionUtils.isEmpty(entityMap)) {
if (!StringUtils.isEmpty(entityMap.get("id"))) {
entity.setId(Long.valueOf(String.valueOf(entityMap.get("id"))));
}
if (!StringUtils.isEmpty(entityMap.get("language"))) {
entity.setLanguage(String.valueOf(entityMap.get("language")));
}
}
list.add(entity);
}
return list;
}
上面配置了高亮搜索字段[name,name.pinyin,director],也就是說匹配到這三個字段的高亮結果,則會加上自定義的
高亮顯示規(guī)則:
<span style='color:green'>...</span>
輸入拼音搜索“ceshi”可看到對應結果苦囱,當然中文也是可以的:
輸入簡體字搜索"測試",可看到對應結果
輸入繁體字搜索"認爲",可看到對應結果嗅绸,由于pinyin分詞器影響還會取到小王。
實際上有搜索到有多個高亮結果的撕彤,這里只取第一個演示查看鱼鸠。
大家肯定很好奇這分詞到底是怎么分的,為此我專門提供一個接口羹铅,可以查看我們輸入的搜索內(nèi)容是怎樣被分詞的蚀狰。
api測試:
結果如下:
{
"result": [
{
"term": "xiao",
"startOffset": 0,
"endOffset": 2,
"position": 0,
"positionLength": 1,
"attributes": null,
"type": "CN_WORD",
"fragment": false
},
{
"term": "xm",
"startOffset": 0,
"endOffset": 2,
"position": 0,
"positionLength": 1,
"attributes": null,
"type": "CN_WORD",
"fragment": false
},
{
"term": "ming",
"startOffset": 0,
"endOffset": 2,
"position": 1,
"positionLength": 1,
"attributes": null,
"type": "CN_WORD",
"fragment": false
}
],
"msg": "",
"code": 200,
"is_success": true
}
可以看到,我們的分詞器已經(jīng)生效职员。
以上示例源碼以上傳至GitHub:https://github.com/liaozihong/SpringBoot-Learning/tree/master/SpringBoot-Elasticsearch-Query
參考鏈接:
Elasticsearch 分詞檢索
Java API 5.5.0
Elasticsearch 結合SpringBoot 高亮顯示查詢