簡介
Solr是一種開放源碼的闷串、底層的核心技術是使用Lucene 來實現(xiàn)的搜索引擎。
OK狱意,這里提到了search engine湖苞,《solr in action》中,詳細說明了search engine的適用場景髓涯,以及和 DB的區(qū)別袒啼,對我收獲很大,摘抄一段:
1. Search engine
Search engines like Solr are optimized to handle data exhibiting four main characteristics:
- Text-centric
- Read- dominant
- Document- oriented
- Flexible schema
1.1 Text-centric
A search engine supports non text data such as dates and numbers, but its primary strength is handling text data based on natural language
.
If users aren’t interested in the information in the text, a search engine may not be the best solution for your problem.
Think about whether your data is text-centric. The main consideration is whether or not the text fields in your data contain information that users will want to query.
Solr等搜索引擎為搜索包含自然語言的文本內容做了優(yōu)化纬纪,比如電子郵件蚓再,網(wǎng)頁,簡歷包各,PDF文檔摘仅,或是推特、微博问畅、博客這些社交內容等等娃属,都適合用Solr來處理。
1.2 Read- dominant
Think of read-dominant as meaning that documents are read far more often than they’re created or updated.
If you must update existing data in a search engine often, that could be an indication that a search engine might not be the best solution for your needs. Another NoSQL technology, like Cassandra, might be a better choice when you need fast random writes to existing data.
1.3 Document-oriented
In a search engine, a document is a self-contained collection of fields, in which each field only holds data(can have multiple values) and doesn’t contain subfields.
A search engine isn’t the place to store data unless it’s useful for search or displaying results
In general, you should store the minimal set of information for each document needed to satisfy search requirements.
1.4 Flexible schema
In a relational database, every row in a table has the same structure. In Solr, documents can have different fields.
2. Solr vs Lucene
兩者的區(qū)別有:
- Lucene本質上是搜索庫护姆,不是獨立的應用程序矾端,而Solr是
- Lucene專注于搜索底層的建設,而Solr專注于企業(yè)應用
- Lucene不負責支撐搜索服務所必須的管理卵皂,而Solr負責
所以說秩铆,一句話概括: Solr是Lucene面向企業(yè)搜索應用的擴展
。
Solr 提供了層面搜索灯变、命中醒目顯示并且支持多種輸出格式(包括XML/XSLT 和JSON等格式)殴玛,它附帶了一個基于HTTP 的管理界面。Solr的特性包括:
- 高級的全文搜索功能
- 一個真正的擁有動態(tài)字段(Dynamic Field)和唯一鍵(Unique Key)的數(shù)據(jù)模式(Data Schema)
- 專為高通量的網(wǎng)絡流量進行的優(yōu)化
- 基于開放接口(XML和HTTP)的標準
- 綜合的HTML管理界面
- 可伸縮性-能夠有效地復制到另外一個Solr搜索服務器
- 使用XML配置達到靈活性和適配性
- 可擴展的插件體系
- 支持對結果進行動態(tài)的分組和過濾
- 高度可配置和可擴展的緩存機制
因為 Solr 包裝并擴展了Lucene添祸,所以它們使用很多相同的術語滚粟。
2. solr 配置
2.1 solrconfig.xml
定義solr的處理程序(handler)和一些擴展程序。其中的配置很多刃泌,其實很多都可以保持默認凡壤。
- dataDir:索引存放位置
- autoCommit:solr在建索引的時候收到請求并沒用立即寫入文件,而是先放到緩存中耙替,等收到commit命令時才將緩存中得數(shù)據(jù)寫入索引文件亚侠。
- maxDocs:
Maximum number of documents to add since the last commit before automatically triggering a new commit.
- maxTime:
Maximum amount of time in ms that is allowed to pass since a document was added before automatically triggering a new commit.
- openSearcher:
if false, the commit causes recent index changes to be flushed to stable storage, but does not cause a new searcher to be opened to make those changes visible.
- autoSoftCommit:
softAutoCommit is like autoCommit except it causes a 'soft' commit which only ensures that changes are visible but does not ensure that data is synced to disk. This is faster and more near-realtime friendly than a hard commit.
2.2 manage-schema
用于定義索引的字段和字段類型
2.2.1 fieldType:字段類型(int、float林艘、string盖奈、ik...)
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_ik" class="solr.TextField" sortMissingLast="true" omitNorms="true" autoGeneratePhraseQueries="false">
<analyzer type="index" isMaxWordLength="false" class="org.wltea.analyzer.lucene.IKAnalyzer"/>
<analyzer type="query" isMaxWordLength="true" class="org.wltea.analyzer.lucene.IKAnalyzer"/>
</fieldType>
2.2.2 field:字段,定義需要的字段名和它的類型
- name 字段名
- type 字段類型
- indexed 是否進行索引
- stored 是否進行保存狐援,如不保存钢坦,可以進行搜索究孕,但不能顯示該字段的內容
- required 是否是必須字段,如若是爹凹,該字段必須有值厨诸,否則索引報錯
- multiValued 是否允許多值
- docValues
- sortMissingLast
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
2.2.3 dynamicFields
動態(tài)字段表示,如果字段的定義沒有在配置中找到禾酱,就在動態(tài)字段類型中進行查找
<dynamicField name="*_txt" type="text_general" indexed="true" stored="true" multiValued="true"/>
2.2.4 copyField
復制源字段到目標字段微酬,maxchars 限制復制的最大長度
<copyField source="body" dest="teaser" maxChars="300"/>
2.2.5 uniqueKey
相當于數(shù)據(jù)庫中得主鍵,如建索引時遇到重復的颤陶,則會覆蓋掉以前的記錄
<uniqueKey>id</uniqueKey>
2.2.6 defaultSearchField
如果搜索參數(shù)中沒有指定具體的field颗管,那么這是默認的域
<defaultSearchField>text</defaultSearchField>
2.2.7 solrQueryParser
配置搜索參數(shù)短語間的邏輯,可以是"AND | OR"滓走。
<solrQueryParser defaultOperator="OR" />