索引架構(gòu)
lucene索引基本組成結(jié)構(gòu) index今膊、segment咙边、document、field拗胜、term
index:就是一個(gè)目錄
segment:段,一系列索引文件組成的抽象的該您
document:文檔怒允,可以在是一個(gè)網(wǎng)頁一個(gè)或者多個(gè)document構(gòu)成一個(gè)segment
field:類似數(shù)據(jù)庫中的字段埂软, 一個(gè)文檔包含多個(gè)field, 比如一個(gè)網(wǎng)頁包含:? 標(biāo)題、作者纫事、內(nèi)容勘畔。不同域的索引方式可以不一樣
term: 索引的最小單位,是經(jīng)過詞法分析和語言處理后的字符串
Segment info. This contains metadata about a segment, such as the number of documents, what files it uses,
Field names. This contains the set of field names used in the index.
Stored Field values. This contains, for each document, a list of attribute-value pairs, where the attributes are field names. These are used to store auxiliary information about the document, such as its title, url, or an identifier to access a database. The set of stored fields are what is returned for each hit when searching. This is keyed by document number.
Term dictionary. A dictionary containing all of the terms used in all of the indexed fields of all of the documents. The dictionary also contains the number of documents which contain the term, and pointers to the term's frequency and proximity data.
Term Frequency data. For each term in the dictionary, the numbers of all the documents that contain that term, and the frequency of the term in that document, unless frequencies are omitted (IndexOptions.DOCS_ONLY)
Term Proximity data. For each term in the dictionary, the positions that the term occurs in each document. Note that this will not exist if all fields in all documents omit position data.
Normalization factors. For each field in each document, a value is stored that is multiplied into the score for hits on that field.
Term Vectors. For each field in each document, the term vector (sometimes called document vector) may be stored. A term vector consists of term text and term frequency. To add Term Vectors to your index see the?Fieldconstructors
Per-document values. Like stored values, these are also keyed by document number, but are generally intended to be loaded into main memory for fast access. Whereas stored values are generally intended for summary results from searches, per-document values are useful for things like scoring factors.
Live documents. An optional file indicating which documents are live.
Point values. Optional pair of files, recording dimensionally indexed fields, to enable fast numeric range filtering and large numeric values like BigInteger and BigDecimal (1D) and geographic shape intersection (2D, 3D).
索引文件后綴和含義:
The following table summarizes the names and extensions of the files in Lucene:
名稱? ? ? ? ? ? ? ? ? ? ? ? ? ? 擴(kuò)展名? ? ? ? ? ? ? ? ? ? ? ? ? 說明
Segments Fil e? ?????????segments_N? ? ? ? ? ? ? ? ? 存儲檢查點(diǎn)
Lock File? ? ? ? ? ? ? ? ? ? ?write.lock? ? ? ? ? ? ? ? ? ? ? ?寫鎖儿礼,防止不同IndexWriter 寫同一個(gè)文件
Segment Info? ? ? ? ? ? ? .si? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?存儲段的meta信息
Compound File? ? ? ? ? ?.cfs,? .cfe? ? ? ? ? ? ? ? ? ? ? ? 種后綴的文件存在也可不存在咖杂,只有IndexWriter “優(yōu)化”過才會出現(xiàn)(將其他索引后綴文件合并庆寺,可以防止系統(tǒng)的文件句柄被消耗光)
? ?Fields????????????????????? .fnm? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 域文件蚊夫,存儲field的信息
Field Index? ? ? ? ? ? ? ? ?.fdx? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?指向 field data的指針
Field Data? ? ? ? ? ? ? ? ? ?.fdt? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?域數(shù)據(jù)文件:包含文檔中存儲的域
Term Dictionary??????????.tim? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? The term dictionary, stores term info
Term Index? ? ? ? ? ? ? ? ?.tip? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? The index into the Term Dictionary
Frequencies? ? ? ? ? ? ? ?.doc? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Contains the list of docs which contain each term along with? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?frequency
Positions? ? ? ? ? ? ? ? ? ? ?.pos? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Stores position information about where a term occurs in the ????????????????????????????????????????????????????????????????????????????index
Payloads?????????????????????.pay? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Stores additional per-position metadata information such as? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?????????character offsets and user payloads
Norms????????????????????????.nvd, .nvm? ? ? ? ? ? ? ? ? ? ? Encodes length and boost factors for docs and fields
Per-Document Values????.dvd, .dvm? ? ? ? ? ? ? ? ? Encodes additional scoring factors or other per-document? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ????????????information.
Term Vector Index????????????.tvx? ? ? ? ? ? ? ? ? ? ? ? ? ? Stores offset into the document data file
Term Vector Data????????????.tvd? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Contains term vector data.
Live Documents????????????.livInfo? ? ? ? ? ? ? ? ? ? ? ? ? about what documents are live
Point values????????????????.dii, .dim? ? ? ? ? ? ? ? ? ? ? ? ? Holds indexed points, if any
總結(jié)
Index –> Segments (segments.gen, segments_N) –> Field(fnm, fdx, fdt) –> Term (tvx, tvd, tvf)
參考:
https://lucene.apache.org/core/7_3_0/core/org/apache/lucene/codecs/lucene70/package-summary.html#package.description
https://blog.csdn.net/ghj1976/article/details/5586329
http://www.cnblogs.com/forfuture1978/archive/2009/12/14/1623599.html