ES在7.15版本引入了一個非常好用的API:Index disk usage API,具體功能可以見文檔說明:https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-disk-usage.html
這個API可以分析給定索引每個字段使用的存儲容量,有了這個API食绿,就能對ES使用的存儲容積有了量化的認(rèn)識。
很多用戶對于ES使用占用過多存儲容量月培,寫入性能不及預(yù)期等會有誤解平斩。這其實主要是ES為了功能方面的強大做出的一種選擇。
ES默認(rèn)會給每個字段都建立非常完整的索引鸦致,這些索引寫入時需要進行索引處理,也增加了額外的存儲容量涣楷,但是有了這些索引分唾,查詢時會支持更多功能,更快性能狮斗。這算是ES schma on write和make query simple and fast的思想绽乔。
但這可能造成過度schma on write,帶來了寫入性能和存儲膨脹方面的問題碳褒。這時候優(yōu)化索引mapping折砸,對于不需要的字段,去掉索引沙峻,能提高寫入性能睦授,降低存儲開銷。
Index disk usage API就給索引優(yōu)化mapping提供了一直量化的數(shù)據(jù)专酗。Index disk usage API能列出索引每個字段使用的存儲明顯睹逃,這樣就能根據(jù)ROI,從大到小來優(yōu)化mapping字段。
這個是文檔給出的結(jié)果示例:
{
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"my-index-000001": {
"store_size": "929mb",
"store_size_in_bytes": 974192723,
"all_fields": {
"total": "928.9mb",
"total_in_bytes": 973977084,
"inverted_index": {
"total": "107.8mb",
"total_in_bytes": 113128526
},
"stored_fields": "623.5mb",
"stored_fields_in_bytes": 653819143,
"doc_values": "125.7mb",
"doc_values_in_bytes": 131885142,
"points": "59.9mb",
"points_in_bytes": 62885773,
"norms": "2.3kb",
"norms_in_bytes": 2356,
"term_vectors": "2.2kb",
"term_vectors_in_bytes": 2310
},
"fields": {
"_id": {
"total": "49.3mb",
"total_in_bytes": 51709993,
"inverted_index": {
"total": "29.7mb",
"total_in_bytes": 31172745
},
"stored_fields": "19.5mb",
"stored_fields_in_bytes": 20537248,
"doc_values": "0b",
"doc_values_in_bytes": 0,
"points": "0b",
"points_in_bytes": 0,
"norms": "0b",
"norms_in_bytes": 0,
"term_vectors": "0b",
"term_vectors_in_bytes": 0
},
"_primary_term": {...},
"_seq_no": {...},
"_version": {...},
"_source": {
"total": "603.9mb",
"total_in_bytes": 633281895,
"inverted_index": {...},
"stored_fields": "603.9mb",
"stored_fields_in_bytes": 633281895,
"doc_values": "0b",
"doc_values_in_bytes": 0,
"points": "0b",
"points_in_bytes": 0,
"norms": "0b",
"norms_in_bytes": 0,
"term_vectors": "0b",
"term_vectors_in_bytes": 0
},
"context": {
"total": "28.6mb",
"total_in_bytes": 30060405,
"inverted_index": {
"total": "22mb",
"total_in_bytes": 23090908
},
"stored_fields": "0b",
"stored_fields_in_bytes": 0,
"doc_values": "0b",
"doc_values_in_bytes": 0,
"points": "0b",
"points_in_bytes": 0,
"norms": "2.3kb",
"norms_in_bytes": 2356,
"term_vectors": "2.2kb",
"term_vectors_in_bytes": 2310
},
"context.keyword": {...},
"message": {...},
"message.keyword": {...}
}
}
}
針對一個字段沉填,可以看到疗隶,會列出非常詳細(xì)的存儲容量:
"total": "49.3mb",
"total_in_bytes": 51709993,
"inverted_index": {
"total": "29.7mb",
"total_in_bytes": 31172745
},
"stored_fields": "19.5mb",
"stored_fields_in_bytes": 20537248,
"doc_values": "0b",
"doc_values_in_bytes": 0,
"points": "0b",
"points_in_bytes": 0,
"norms": "0b",
"norms_in_bytes": 0,
"term_vectors": "0b",
"term_vectors_in_bytes": 0
total記錄了字段的總?cè)萘浚@個來自于倒排索引的開銷翼闹,明細(xì)數(shù)據(jù)的開銷斑鼻,doc_values的開銷等等。
那么ES是如何實時獲得這些數(shù)據(jù)呢猎荠?
ES核心功能在IndexDiskUsageAnalyzer類中坚弱,IndexDiskUsageAnalyzer類統(tǒng)計每個shard的存儲容量,然后使用BroadcastAction框架將每個shard數(shù)據(jù)匯聚起來关摇。
IndexDiskUsageAnalyzer通過doAnalyze方法獲取每種類型的存儲容量:
void doAnalyze(IndexDiskUsageStats stats) throws IOException {
long startTimeInNanos;
final ExecutionTime executionTime = new ExecutionTime();
try (DirectoryReader directoryReader = DirectoryReader.open(commit)) {
directory.resetBytesRead();
for (LeafReaderContext leaf : directoryReader.leaves()) {
cancellationChecker.checkForCancellation();
final SegmentReader reader = Lucene.segmentReader(leaf.reader());
startTimeInNanos = System.nanoTime();
analyzeInvertedIndex(reader, stats);
executionTime.invertedIndexTimeInNanos += System.nanoTime() - startTimeInNanos;
startTimeInNanos = System.nanoTime();
analyzeStoredFields(reader, stats);
executionTime.storedFieldsTimeInNanos += System.nanoTime() - startTimeInNanos;
startTimeInNanos = System.nanoTime();
analyzeDocValues(reader, stats);
executionTime.docValuesTimeInNanos += System.nanoTime() - startTimeInNanos;
startTimeInNanos = System.nanoTime();
analyzePoints(reader, stats);
executionTime.pointsTimeInNanos += System.nanoTime() - startTimeInNanos;
startTimeInNanos = System.nanoTime();
analyzeNorms(reader, stats);
executionTime.normsTimeInNanos += System.nanoTime() - startTimeInNanos;
startTimeInNanos = System.nanoTime();
analyzeTermVectors(reader, stats);
executionTime.termVectorsTimeInNanos += System.nanoTime() - startTimeInNanos;
startTimeInNanos = System.nanoTime();
analyzeKnnVectors(reader, stats);
executionTime.knnVectorsTimeInNanos += System.nanoTime() - startTimeInNanos;
}
}
logger.debug("analyzing the disk usage took {} stats: {}", executionTime, stats);
}
這里是調(diào)用lucene接口荒叶,從對應(yīng)數(shù)據(jù)結(jié)構(gòu)的元數(shù)據(jù)和具體數(shù)據(jù)中獲取字段的存儲容量。
其中InvertedIndex输虱、Points些楣、DocValues都是以字段為單位存儲數(shù)據(jù)的,所以可以一個字段一個字段處理宪睹。
StoredFields是按行存儲數(shù)據(jù)愁茁,所以這里遍歷了每一個doc,去獲取里面每個字段的length亭病。
相關(guān)代碼都在IndexDiskUsageAnalyzer里鹅很,對具體數(shù)據(jù)結(jié)構(gòu)感興趣的同學(xué)可以針對性的了解。