iceberg 元數(shù)據(jù)

以下為一個hive-catalog的iceberg表的所有存在hdfs目錄中的文件
包含
1.parquet數(shù)據(jù)文件
2.json元數(shù)據(jù)文件
3.avro snapshot文件
4.avro manifest文件

hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/data/00000-0-319a206d-7ead-415d-9ec8-700c1a49b8c4-00001.parquet
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/data/00000-0-319a206d-7ead-415d-9ec8-700c1a49b8c4-00003.parquet
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/data/00000-0-319a206d-7ead-415d-9ec8-700c1a49b8c4-00004.parquet
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/data/00000-0-319a206d-7ead-415d-9ec8-700c1a49b8c4-00005.parquet
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/data/00000-0-319a206d-7ead-415d-9ec8-700c1a49b8c4-00006.parquet
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/data/00000-0-319a206d-7ead-415d-9ec8-700c1a49b8c4-00007.parquet
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/data/00000-0-319a206d-7ead-415d-9ec8-700c1a49b8c4-00008.parquet
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/data/00000-0-319a206d-7ead-415d-9ec8-700c1a49b8c4-00009.parquet
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/data/00000-0-319a206d-7ead-415d-9ec8-700c1a49b8c4-00010.parquet
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/data/00000-0-319a206d-7ead-415d-9ec8-700c1a49b8c4-00011.parquet
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/data/00000-0-319a206d-7ead-415d-9ec8-700c1a49b8c4-00012.parquet
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/data/00000-0-79d89118-5069-4877-8332-2a592c887fe3-00001.parquet

hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/00000-f9a42593-ab76-4933-a739-8e10b476fc85.metadata.json
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/00001-2002be31-0182-4085-9173-aee3e4facc0b.metadata.json
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/00002-2c5e9702-a908-43a6-bbe8-0f0c6582e984.metadata.json
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/00003-3db39d6b-6311-4bdb-9d7b-b56f2df74fb3.metadata.json
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/00004-a5490f98-4daf-4592-abf1-fdcc408f1b0f.metadata.json
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/00005-b13e2c1f-1383-43c3-a53c-832ed8c68fa8.metadata.json
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/00006-68ce5b89-27fb-421a-8a49-42f383dfc587.metadata.json
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/00007-b3430d66-c9fb-401c-b800-e2ea4ad70d8d.metadata.json

hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/09769592-109f-4f6e-ab46-9b597dacfd43-m0.avro
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/1a49a079-d7cf-41a6-931d-15ad2a44914b-m0.avro
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/1a49a079-d7cf-41a6-931d-15ad2a44914b-m1.avro
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/2b1ddf19-5701-4c0b-ac6a-ea41fdab9c07-m0.avro
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/2b1ddf19-5701-4c0b-ac6a-ea41fdab9c07-m1.avro
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/bf413511-d1cf-407f-bcc9-b6960cde7898-m0.avro
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/bf413511-d1cf-407f-bcc9-b6960cde7898-m1.avro
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/e97d1919-f47d-40c0-9eb6-24bf68f96980-m0.avro
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/e97d1919-f47d-40c0-9eb6-24bf68f96980-m1.avro

hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/f0bd795c-6a10-41bc-8f79-437fef1ff5f9-m0.avro
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/f0bd795c-6a10-41bc-8f79-437fef1ff5f9-m1.avro
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/f0bd795c-6a10-41bc-8f79-437fef1ff5f9-m2.avro
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/f0bd795c-6a10-41bc-8f79-437fef1ff5f9-m3.avro
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/f0bd795c-6a10-41bc-8f79-437fef1ff5f9-m4.avro
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/f0bd795c-6a10-41bc-8f79-437fef1ff5f9-m5.avro
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/f0bd795c-6a10-41bc-8f79-437fef1ff5f9-m6.avro
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/f0bd795c-6a10-41bc-8f79-437fef1ff5f9-m7.avro

hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/snap-1289984099921389549-1-1a49a079-d7cf-41a6-931d-15ad2a44914b.avro
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/snap-3921229567852426700-1-bf413511-d1cf-407f-bcc9-b6960cde7898.avro
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/snap-5386042144404510937-1-09769592-109f-4f6e-ab46-9b597dacfd43.avro
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/snap-7125662397327732785-1-2b1ddf19-5701-4c0b-ac6a-ea41fdab9c07.avro
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/snap-7329471080018208648-1-f0bd795c-6a10-41bc-8f79-437fef1ff5f9.avro
hdfs://10.177.13.120:8020/user/hive/dc-warehouse/iceberg_cdc_table/metadata/snap-7377732782289998100-1-e97d1919-f47d-40c0-9eb6-24bf68f96980.avro

以下為iceberg表在hive中的建表語句
REATE EXTERNAL TABLE iceberg_cdc_table(
id string COMMENT 'unique ID',
data string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.FileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.mapred.FileOutputFormat'
LOCATION
'hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='false',
'metadata_location'='hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/00007-b3430d66-c9fb-401c-b800-e2ea4ad70d8d.metadata.json',
'numFiles'='0',
'numRows'='-1',
'previous_metadata_location'='hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/00006-68ce5b89-27fb-421a-8a49-42f383dfc587.metadata.json',
'rawDataSize'='-1',
'table_type'='ICEBERG',
'totalSize'='0',
'transient_lastDdlTime'='1619089695')

其中metadata_location為當(dāng)前的元數(shù)據(jù)文件,查看該文件

{
  "format-version" : 2,
  "table-uuid" : "924ae1db-5aad-451a-ae3b-bd933296ea84",
  "location" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table",
  "last-sequence-number" : 6,
  "last-updated-ms" : 1619090084800,
  "last-column-id" : 2,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "id",
      "required" : true,
      "type" : "string",
      "doc" : "unique ID"
    }, {
      "id" : 2,
      "name" : "data",
      "required" : true,
      "type" : "string"
    } ]
  } ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ ]
  } ],
  "last-partition-id" : 999,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "row-key" : {
    "identifier-fields" : [ {
      "source-id" : 1
    } ]
  },
  "properties" : { },
  "current-snapshot-id" : 7329471080018208648,
  "snapshots" : [ {
    "sequence-number" : 1,
    "snapshot-id" : 5386042144404510937,
    "timestamp-ms" : 1619089843403,
    "summary" : {
      "operation" : "append",
      "flink.job-id" : "94aed63193990d73442f8696c3eee136",
      "flink.max-committed-checkpoint-id" : "1",
      "added-data-files" : "1",
      "added-records" : "1000000",
      "added-files-size" : "3076138",
      "changed-partition-count" : "1",
      "total-records" : "1000000",
      "total-files-size" : "3076138",
      "total-data-files" : "1",
      "total-delete-files" : "0",
      "total-position-deletes" : "0",
      "total-equality-deletes" : "0"
    },
    "manifest-list" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/snap-5386042144404510937-1-09769592-109f-4f6e-ab46-9b597dacfd43.avro"
  }, {
    "sequence-number" : 2,
    "snapshot-id" : 1289984099921389549,
    "parent-snapshot-id" : 5386042144404510937,
    "timestamp-ms" : 1619089902186,
    "summary" : {
      "operation" : "overwrite",
      "flink.job-id" : "94aed63193990d73442f8696c3eee136",
      "flink.max-committed-checkpoint-id" : "2",
      "added-data-files" : "1",
      "added-delete-files" : "1",
      "added-records" : "21892",
      "added-files-size" : "184249",
      "added-equality-deletes" : "21892",
      "changed-partition-count" : "1",
      "total-records" : "1021892",
      "total-files-size" : "3260387",
      "total-data-files" : "2",
      "total-delete-files" : "1",
      "total-position-deletes" : "0",
      "total-equality-deletes" : "21892"
    },
    "manifest-list" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/snap-1289984099921389549-1-1a49a079-d7cf-41a6-931d-15ad2a44914b.avro"
  }, {
    "sequence-number" : 3,
    "snapshot-id" : 7377732782289998100,
    "parent-snapshot-id" : 1289984099921389549,
    "timestamp-ms" : 1619089962201,
    "summary" : {
      "operation" : "overwrite",
      "flink.job-id" : "94aed63193990d73442f8696c3eee136",
      "flink.max-committed-checkpoint-id" : "3",
      "added-data-files" : "1",
      "added-delete-files" : "1",
      "added-records" : "73302",
      "added-files-size" : "604308",
      "added-equality-deletes" : "73302",
      "changed-partition-count" : "1",
      "total-records" : "1095194",
      "total-files-size" : "3864695",
      "total-data-files" : "3",
      "total-delete-files" : "2",
      "total-position-deletes" : "0",
      "total-equality-deletes" : "95194"
    },
    "manifest-list" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/snap-7377732782289998100-1-e97d1919-f47d-40c0-9eb6-24bf68f96980.avro"
  }, {
    "sequence-number" : 4,
    "snapshot-id" : 3921229567852426700,
    "parent-snapshot-id" : 7377732782289998100,
    "timestamp-ms" : 1619090021768,
    "summary" : {
      "operation" : "overwrite",
      "flink.job-id" : "94aed63193990d73442f8696c3eee136",
      "flink.max-committed-checkpoint-id" : "4",
      "added-data-files" : "1",
      "added-delete-files" : "1",
      "added-records" : "95137",
      "added-files-size" : "783498",
      "added-equality-deletes" : "95137",
      "changed-partition-count" : "1",
      "total-records" : "1190331",
      "total-files-size" : "4648193",
      "total-data-files" : "4",
      "total-delete-files" : "3",
      "total-position-deletes" : "0",
      "total-equality-deletes" : "190331"
    },
    "manifest-list" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/snap-3921229567852426700-1-bf413511-d1cf-407f-bcc9-b6960cde7898.avro"
  }, {
    "sequence-number" : 5,
    "snapshot-id" : 7125662397327732785,
    "parent-snapshot-id" : 3921229567852426700,
    "timestamp-ms" : 1619090082142,
    "summary" : {
      "operation" : "overwrite",
      "flink.job-id" : "94aed63193990d73442f8696c3eee136",
      "flink.max-committed-checkpoint-id" : "5",
      "added-data-files" : "1",
      "added-delete-files" : "1",
      "added-records" : "2772",
      "added-files-size" : "25696",
      "added-equality-deletes" : "2772",
      "changed-partition-count" : "1",
      "total-records" : "1193103",
      "total-files-size" : "4673889",
      "total-data-files" : "5",
      "total-delete-files" : "4",
      "total-position-deletes" : "0",
      "total-equality-deletes" : "193103"
    },
    "manifest-list" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/snap-7125662397327732785-1-2b1ddf19-5701-4c0b-ac6a-ea41fdab9c07.avro"
  }, {
    "sequence-number" : 6,
    "snapshot-id" : 7329471080018208648,
    "parent-snapshot-id" : 7125662397327732785,
    "timestamp-ms" : 1619090084800,
    "summary" : {
      "operation" : "replace",
      "added-data-files" : "1",
      "deleted-data-files" : "4",
      "removed-delete-files" : "3",
      "added-records" : "1000000",
      "deleted-records" : "1190331",
      "added-files-size" : "3293597",
      "removed-files-size" : "4648193",
      "removed-equality-deletes" : "190331",
      "changed-partition-count" : "1",
      "total-records" : "1002772",
      "total-files-size" : "3319293",
      "total-data-files" : "2",
      "total-delete-files" : "1",
      "total-position-deletes" : "0",
      "total-equality-deletes" : "2772"
    },
    "manifest-list" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/snap-7329471080018208648-1-f0bd795c-6a10-41bc-8f79-437fef1ff5f9.avro"
  } ],
  "snapshot-log" : [ {
    "timestamp-ms" : 1619089843403,
    "snapshot-id" : 5386042144404510937
  }, {
    "timestamp-ms" : 1619089902186,
    "snapshot-id" : 1289984099921389549
  }, {
    "timestamp-ms" : 1619089962201,
    "snapshot-id" : 7377732782289998100
  }, {
    "timestamp-ms" : 1619090021768,
    "snapshot-id" : 3921229567852426700
  }, {
    "timestamp-ms" : 1619090082142,
    "snapshot-id" : 7125662397327732785
  }, {
    "timestamp-ms" : 1619090084800,
    "snapshot-id" : 7329471080018208648
  } ],
  "metadata-log" : [ {
    "timestamp-ms" : 1619089691387,
    "metadata-file" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/00000-f9a42593-ab76-4933-a739-8e10b476fc85.metadata.json"
  }, {
    "timestamp-ms" : 1619089741748,
    "metadata-file" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/00001-2002be31-0182-4085-9173-aee3e4facc0b.metadata.json"
  }, {
    "timestamp-ms" : 1619089843403,
    "metadata-file" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/00002-2c5e9702-a908-43a6-bbe8-0f0c6582e984.metadata.json"
  }, {
    "timestamp-ms" : 1619089902186,
    "metadata-file" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/00003-3db39d6b-6311-4bdb-9d7b-b56f2df74fb3.metadata.json"
  }, {
    "timestamp-ms" : 1619089962201,
    "metadata-file" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/00004-a5490f98-4daf-4592-abf1-fdcc408f1b0f.metadata.json"
  }, {
    "timestamp-ms" : 1619090021768,
    "metadata-file" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/00005-b13e2c1f-1383-43c3-a53c-832ed8c68fa8.metadata.json"
  }, {
    "timestamp-ms" : 1619090082142,
    "metadata-file" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/00006-68ce5b89-27fb-421a-8a49-42f383dfc587.metadata.json"
  } ]
}

其中包含了所有的snapshot信息和所有的元數(shù)據(jù)文件信息
注意sequence-number和snapshot-id,它們是強(qiáng)關(guān)聯(lián)的泪漂,
sequence-number在v2版本的表中會作為標(biāo)識數(shù)據(jù)的序列號
讀取的時候data文件中過濾掉equility-delete數(shù)據(jù)的時候是按sequence-number過濾的
就找比data文件snapshot大的equility-delete文件

小文件合并也和入數(shù)據(jù)checkpoint一樣生成新的snapshot
如果入庫snapshot是3 然后開始小文件合并 合并過程中入庫生成snapshot 4
然后合并完成生成snapshot 5
snapshot5的文件只合并了snapshot3的文件需要對snapshot 4中的equility-delete文件進(jìn)行過濾 但是因為5比4大就不會過濾了

小文件合并跨了入庫的snapshot數(shù)據(jù)就有問題了

當(dāng)前的snapshotID和對應(yīng)的文件邦泄,查看該文件snap-7329471080018208648-1-f0bd795c-6a10-41bc-8f79-437fef1ff5f9.avro

{
  "manifest_path" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/f0bd795c-6a10-41bc-8f79-437fef1ff5f9-m7.avro",
  "manifest_length" : 6569,
  "partition_spec_id" : 0,
  "content" : 0,
  "sequence_number" : 6,
  "min_sequence_number" : 6,
  "added_snapshot_id" : 7329471080018208648,
  "added_data_files_count" : 1,
  "existing_data_files_count" : 0,
  "deleted_data_files_count" : 0,
  "added_rows_count" : 1000000,
  "existing_rows_count" : 0,
  "deleted_rows_count" : 0,
  "partitions" : {
    "array" : [ ]
  }
  
  00000-0-79d89118-5069-4877-8332-2a592c887fe3-00001.parquet  "status" : 1  "content" : 0
  
}
{
  "manifest_path" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/2b1ddf19-5701-4c0b-ac6a-ea41fdab9c07-m0.avro",
  "manifest_length" : 6557,
  "partition_spec_id" : 0,
  "content" : 0,
  "sequence_number" : 5,
  "min_sequence_number" : 5,
  "added_snapshot_id" : 7125662397327732785,
  "added_data_files_count" : 1,
  "existing_data_files_count" : 0,
  "deleted_data_files_count" : 0,
  "added_rows_count" : 2772,
  "existing_rows_count" : 0,
  "deleted_rows_count" : 0,
  "partitions" : {
    "array" : [ ]
  }
  
  00000-0-319a206d-7ead-415d-9ec8-700c1a49b8c4-00009.parquet  "status" : 1  "content" : 0
  
}
{
  "manifest_path" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/f0bd795c-6a10-41bc-8f79-437fef1ff5f9-m0.avro",
  "manifest_length" : 6553,
  "partition_spec_id" : 0,
  "content" : 0,
  "sequence_number" : 6,
  "min_sequence_number" : 6,
  "added_snapshot_id" : 7329471080018208648,
  "added_data_files_count" : 0,
  "existing_data_files_count" : 0,
  "deleted_data_files_count" : 1,
  "added_rows_count" : 0,
  "existing_rows_count" : 0,
  "deleted_rows_count" : 95137,
  "partitions" : {
    "array" : [ ]
  }
  
  00000-0-319a206d-7ead-415d-9ec8-700c1a49b8c4-00007.parquet  "status" : 2  "content" : 0
  
}
{
  "manifest_path" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/f0bd795c-6a10-41bc-8f79-437fef1ff5f9-m3.avro",
  "manifest_length" : 6554,
  "partition_spec_id" : 0,
  "content" : 0,
  "sequence_number" : 6,
  "min_sequence_number" : 6,
  "added_snapshot_id" : 7329471080018208648,
  "added_data_files_count" : 0,
  "existing_data_files_count" : 0,
  "deleted_data_files_count" : 1,
  "added_rows_count" : 0,
  "existing_rows_count" : 0,
  "deleted_rows_count" : 73302,
  "partitions" : {
    "array" : [ ]
  }
  
  00000-0-319a206d-7ead-415d-9ec8-700c1a49b8c4-00005.parquet  "status" : 2  "content" : 0
  
}
{
  "manifest_path" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/f0bd795c-6a10-41bc-8f79-437fef1ff5f9-m2.avro",
  "manifest_length" : 6553,
  "partition_spec_id" : 0,
  "content" : 0,
  "sequence_number" : 6,
  "min_sequence_number" : 6,
  "added_snapshot_id" : 7329471080018208648,
  "added_data_files_count" : 0,
  "existing_data_files_count" : 0,
  "deleted_data_files_count" : 1,
  "added_rows_count" : 0,
  "existing_rows_count" : 0,
  "deleted_rows_count" : 21892,
  "partitions" : {
    "array" : [ ]
  }
  
  00000-0-319a206d-7ead-415d-9ec8-700c1a49b8c4-00003.parquet  "status" : 2  "content" : 0
  
}
{
  "manifest_path" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/f0bd795c-6a10-41bc-8f79-437fef1ff5f9-m1.avro",
  "manifest_length" : 6566,
  "partition_spec_id" : 0,
  "content" : 0,
  "sequence_number" : 6,
  "min_sequence_number" : 6,
  "added_snapshot_id" : 7329471080018208648,
  "added_data_files_count" : 0,
  "existing_data_files_count" : 0,
  "deleted_data_files_count" : 1,
  "added_rows_count" : 0,
  "existing_rows_count" : 0,
  "deleted_rows_count" : 1000000,
  "partitions" : {
    "array" : [ ]
  }
  
  00000-0-319a206d-7ead-415d-9ec8-700c1a49b8c4-00001.parquet  "status" : 2  "content" : 0
  
}
{
  "manifest_path" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/2b1ddf19-5701-4c0b-ac6a-ea41fdab9c07-m1.avro",
  "manifest_length" : 6568,
  "partition_spec_id" : 0,
  "content" : 1,
  "sequence_number" : 5,
  "min_sequence_number" : 5,
  "added_snapshot_id" : 7125662397327732785,
  "added_data_files_count" : 1,
  "existing_data_files_count" : 0,
  "deleted_data_files_count" : 0,
  "added_rows_count" : 2772,
  "existing_rows_count" : 0,
  "deleted_rows_count" : 0,
  "partitions" : {
    "array" : [ ]
  }
  
  00000-0-319a206d-7ead-415d-9ec8-700c1a49b8c4-00010.parquet  "status" : 1  "content" : 2
  
}
{
  "manifest_path" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/f0bd795c-6a10-41bc-8f79-437fef1ff5f9-m4.avro",
  "manifest_length" : 6568,
  "partition_spec_id" : 0,
  "content" : 1,
  "sequence_number" : 6,
  "min_sequence_number" : 6,
  "added_snapshot_id" : 7329471080018208648,
  "added_data_files_count" : 0,
  "existing_data_files_count" : 0,
  "deleted_data_files_count" : 1,
  "added_rows_count" : 0,
  "existing_rows_count" : 0,
  "deleted_rows_count" : 95137,
  "partitions" : {
    "array" : [ ]
  }
  
  00000-0-319a206d-7ead-415d-9ec8-700c1a49b8c4-00008.parquet  "status" : 2  "content" : 2
  
}
{
  "manifest_path" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/f0bd795c-6a10-41bc-8f79-437fef1ff5f9-m5.avro",
  "manifest_length" : 6570,
  "partition_spec_id" : 0,
  "content" : 1,
  "sequence_number" : 6,
  "min_sequence_number" : 6,
  "added_snapshot_id" : 7329471080018208648,
  "added_data_files_count" : 0,
  "existing_data_files_count" : 0,
  "deleted_data_files_count" : 1,
  "added_rows_count" : 0,
  "existing_rows_count" : 0,
  "deleted_rows_count" : 73302,
  "partitions" : {
    "array" : [ ]
  }
  
  00000-0-319a206d-7ead-415d-9ec8-700c1a49b8c4-00006.parquet  "status" : 2  "content" : 2
  
}
{
  "manifest_path" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/metadata/f0bd795c-6a10-41bc-8f79-437fef1ff5f9-m6.avro",
  "manifest_length" : 6567,
  "partition_spec_id" : 0,
  "content" : 1,
  "sequence_number" : 6,
  "min_sequence_number" : 6,
  "added_snapshot_id" : 7329471080018208648,
  "added_data_files_count" : 0,
  "existing_data_files_count" : 0,
  "deleted_data_files_count" : 1,
  "added_rows_count" : 0,
  "existing_rows_count" : 0,
  "deleted_rows_count" : 21892,
  "partitions" : {
    "array" : [ ]
  }
  
  00000-0-319a206d-7ead-415d-9ec8-700c1a49b8c4-00004.parquet  "status" : 2  "content" : 2
  
}

這其中包含了所有的manifest文件,注意content屬性,在ManifestContent 中定義了其意義,0表示新增數(shù)據(jù)Manifest,1表示刪除數(shù)據(jù)Manifest

/**
 * Content type stored in a manifest file, either DATA or DELETES.
 */
public enum ManifestContent {
  DATA(0),
  DELETES(1);

  private final int id;

  ManifestContent(int id) {
    this.id = id;
  }

  public int id() {
    return id;
  }
}

查看manifest文件

{
  "status" : 1,
  "snapshot_id" : {
    "long" : 7329471080018208648
  },
  "sequence_number" : null,
  "data_file" : {
    "content" : 0,
    "file_path" : "hdfs://test-hdfs1/user/hive/dc-warehouse/iceberg_cdc_table/data/00000-0-79d89118-5069-4877-8332-2a592c887fe3-00001.parquet",
    "file_format" : "PARQUET",
    "partition" : { },
    "record_count" : 1000000,
    "file_size_in_bytes" : 3293597,
    "column_sizes" : {
      "array" : [ {
        "key" : 1,
        "value" : 2554588
      }, {
        "key" : 2,
        "value" : 734455
      } ]
    },
    "value_counts" : {
      "array" : [ {
        "key" : 1,
        "value" : 1000000
      }, {
        "key" : 2,
        "value" : 1000000
      } ]
    },
    "null_value_counts" : {
      "array" : [ {
        "key" : 1,
        "value" : 0
      }, {
        "key" : 2,
        "value" : 0
      } ]
    },
    "nan_value_counts" : {
      "array" : [ ]
    },
    "lower_bounds" : {
      "array" : [ {
        "key" : 1,
        "value" : "0"
      }, {
        "key" : 2,
        "value" : "007-dacf7d6ae3f9"
      } ]
    },
    "upper_bounds" : {
      "array" : [ {
        "key" : 1,
        "value" : "999999"
      }, {
        "key" : 2,
        "value" : "ff3-e85ff5b95460"
      } ]
    },
    "key_metadata" : null,
    "split_offsets" : {
      "array" : [ 4 ]
    },
    "equality_ids" : null,
    "sort_order_id" : {
      "int" : 0
    }
  }
}

注意status屬性煮仇,在ManifestEntry接口中定義了枚舉

package org.apache.iceberg;

interface ManifestEntry<F extends ContentFile<F>> {
  enum Status {
    EXISTING(0),
    ADDED(1),
    DELETED(2);

    private final int id;

    Status(int id) {
      this.id = id;
    }

    public int id() {
      return id;
    }
  }
}

1表示添加的文件,2表示已經(jīng)無效需要刪除的文件

還有content屬性谎仲,在FileContent 類中定義了其意義浙垫,0表示數(shù)據(jù)文件,1表示POSITION_DELETES文件郑诺,2表示 EQUALITY_DELETES文件

package org.apache.iceberg;

/**
 * Content type stored in a file, one of DATA, POSITION_DELETES, or EQUALITY_DELETES.
 */
public enum FileContent {
  DATA(0),
  POSITION_DELETES(1),
  EQUALITY_DELETES(2);

  private final int id;

  FileContent(int id) {
    this.id = id;
  }

  public int id() {
    return id;
  }
}

上面的snapshot文件snap-7329471080018208648-1-f0bd795c-6a10-41bc-8f79-437fef1ff5f9.avro是最新的snapshot文件夹姥,有6個content為0的文件和4個content為1的文件,因為我這里是初始入了100w條cdc數(shù)據(jù)生成一個data文件辙诞,然后經(jīng)歷了4次updata辙售,生成了4個data文件和4個delete文件,最后做了一個文件合并生成一個新的data文件飞涂。

我提取了其中對應(yīng)的parquet文件和其status和content信息旦部,state狀態(tài)為1的有3個,即只有3個有效的文件较店,一個是進(jìn)行小文件合并后生成的文件士八,兩個是之后入庫的更新文件,這兩個也是一個是DATA文件一個是POSITION_DELETES文件梁呈。

而在小文件合并之前則是9個有效文件婚度,5個data文件和4個POSITION_DELETES文件。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末官卡,一起剝皮案震驚了整個濱河市蝗茁,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌味抖,老刑警劉巖评甜,帶你破解...
    沈念sama閱讀 218,546評論 6 507
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件灰粮,死亡現(xiàn)場離奇詭異仔涩,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)粘舟,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,224評論 3 395
  • 文/潘曉璐 我一進(jìn)店門熔脂,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人柑肴,你說我怎么就攤上這事霞揉。” “怎么了晰骑?”我有些...
    開封第一講書人閱讀 164,911評論 0 354
  • 文/不壞的土叔 我叫張陵适秩,是天一觀的道長。 經(jīng)常有香客問我,道長秽荞,這世上最難降的妖魔是什么骤公? 我笑而不...
    開封第一講書人閱讀 58,737評論 1 294
  • 正文 為了忘掉前任,我火速辦了婚禮扬跋,結(jié)果婚禮上阶捆,老公的妹妹穿的比我還像新娘。我一直安慰自己钦听,他們只是感情好洒试,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,753評論 6 392
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著朴上,像睡著了一般垒棋。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上痪宰,一...
    開封第一講書人閱讀 51,598評論 1 305
  • 那天捕犬,我揣著相機(jī)與錄音,去河邊找鬼酵镜。 笑死碉碉,一個胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的淮韭。 我是一名探鬼主播斩启,決...
    沈念sama閱讀 40,338評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼绘迁,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起如迟,我...
    開封第一講書人閱讀 39,249評論 0 276
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎获洲,沒想到半個月后磨取,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,696評論 1 314
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡畔乙,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,888評論 3 336
  • 正文 我和宋清朗相戀三年君仆,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片牲距。...
    茶點(diǎn)故事閱讀 40,013評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡返咱,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出牍鞠,到底是詐尸還是另有隱情咖摹,我是刑警寧澤,帶...
    沈念sama閱讀 35,731評論 5 346
  • 正文 年R本政府宣布难述,位于F島的核電站萤晴,受9級特大地震影響吐句,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜店读,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,348評論 3 330
  • 文/蒙蒙 一蕴侧、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧两入,春花似錦净宵、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,929評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至剃氧,卻和暖如春敏储,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背朋鞍。 一陣腳步聲響...
    開封第一講書人閱讀 33,048評論 1 270
  • 我被黑心中介騙來泰國打工已添, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人滥酥。 一個月前我還...
    沈念sama閱讀 48,203評論 3 370
  • 正文 我出身青樓更舞,卻偏偏與公主長得像,于是被迫代替她去往敵國和親坎吻。 傳聞我的和親對象是個殘疾皇子缆蝉,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,960評論 2 355

推薦閱讀更多精彩內(nèi)容