- 博客地址:http://zhwhong.cn/2017/03/27/LIDC-Dicom-data-and-XML-annotation-parse/
- 相關(guān)文章:LIDC-IDRI肺結(jié)節(jié)Dicom數(shù)據(jù)集解析與總結(jié)
- github參考:zhwhong/lidc_nodule_detection
數(shù)據(jù)來源
數(shù)據(jù)集采用為 LIDC-IDRI (The Lung Image Database Consortium)牛柒,該數(shù)據(jù)集由胸部醫(yī)學(xué)圖像文件(如CT袜爪、X光片)和對應(yīng)的診斷結(jié)果病變標(biāo)注組成肢娘。該數(shù)據(jù)是由美國國家癌癥研究所(National Cancer Institute)發(fā)起收集的记餐,目的是為了研究高危人群早期癌癥檢測扶镀。
該數(shù)據(jù)集中罕偎,共收錄了1018個研究實例躏将。對于每個實例中的圖像,都由4位經(jīng)驗豐富的胸部放射科醫(yī)師進行兩階段的診斷標(biāo)注呻引。在第一階段筷凤,每位醫(yī)師分別獨立診斷并標(biāo)注病患位置,其中會標(biāo)注三中類別:1) >=3mm的結(jié)節(jié), 2) <3mm的結(jié)節(jié), 3) >=3mm的非結(jié)節(jié)(官網(wǎng)描述: "nodule > or =3 mm," "nodule <3 mm," and "non-nodule > or =3 mm" 詳見 Summary)苞七。在隨后的第二階段中,各位醫(yī)師都分別獨立的復(fù)審其他三位醫(yī)師的標(biāo)注挪丢,并給出自己最終的診斷結(jié)果蹂风。這樣的兩階段標(biāo)注可以在避免forced consensus的前提下,盡可能完整的標(biāo)注所有結(jié)果乾蓬。
數(shù)據(jù)位置: @news-ai:/baina/sda1/data/lidc/
解析結(jié)果
1.圖像矩陣像素信息
模塊處理的數(shù)據(jù)為slicer * rows* cols大小的三維矩陣D惠啄。D中第z個切片y行x列的元素對應(yīng)的位置為:(z rows cols+ y * cols + x) * sizeof(data_type) 。其中rows表示圖像的行數(shù)任内,cols表示圖像的列數(shù)撵渡,默認均為512,data_type代表數(shù)據(jù)類型死嗦,默認為short趋距。具體見:肺結(jié)節(jié)檢測說明文檔。
- eg: 對于病例LIDC-IDRI-0001越除,即為133512512的矩陣节腐,一共133張切片外盯,每張大小512*512,依次按順序存入二進制文件翼雀,每個像素大小為2字節(jié)(對應(yīng)C中short類型)饱苟。
2.結(jié)節(jié)區(qū)域類型標(biāo)注信息
第一行: slicers rows cols data_type pixel_space_x pixel_space_y slice_thickness
- slicer : 切片個數(shù);
- rows : 矩陣行數(shù)狼渊,默認512箱熬;
- cols : 矩陣列數(shù),默認512狈邑;
- data_type : 數(shù)據(jù)類型標(biāo)簽城须。為以下枚舉類型中的一種(默認SHORT_TYPE,4):enum DATA_TYPE { CHAR_TYPE, UCHAR_TYPE, INT_TYPE, UINT_TYPE, SHORT_TYPE, USHORT_TYPE, FLOAT_TYPE, DOUBLE_TYPE }官地;
- pixel_space_x : x線列掃描步長酿傍,單位:毫米;
- pixel_space_y : x線行掃描步長驱入,單位:毫米赤炒;
- slice_thickness : z軸掃描步長(即切片厚度),單位:毫米亏较。
其他行: Type num x1 y1 z1 x2 y2 z2 … xi yi zi ... xn yn zn
Type: "1"表示"nodules"莺褒, "2"表示"small_nodules","3"表示"non_nodules"雪情;
num:該行x,y,z數(shù)字的個數(shù)(由于一個點有三個坐標(biāo)遵岩,所以num為3的倍數(shù));
Xi, Yi, Zi:該肺結(jié)節(jié)第i個點的空間坐標(biāo)巡通,Zi為切片序號尘执;
數(shù)據(jù)位置: @news-ai:/baina/sda1/data/lidc_matrix/ (DAT為矩陣,TXT為標(biāo)注)
數(shù)據(jù)分析
文件結(jié)構(gòu)
目前測試一共1012個病例數(shù)據(jù)宴凉,每個病例文件夾對應(yīng)結(jié)構(gòu):
LIDC-IDRI-XXXX / Study Instance UID / Series Instance UID / *.dcm *.xml
- XXXX : 從0000到1012誊锭;
- Study Instance UID : 每個病例對應(yīng)的檢查實例號;
- Series Instance UID : 不同檢查對應(yīng)的序列實例號弥锄;
- *.dcm 丧靡,*.xml : 解析見LIDC-IDRI圖像標(biāo)注處理記錄。
特例:LIDC-IDRI-0365號病例存在兩份序列檢查籽暇,分別有對應(yīng)的dcm和xml文件温治,如下:
**Dicom重要信息說明 **
eg : LIDC-IDRI-0001(GE MEDICAL SYSTEM公司)中000001.dcm如下:(詳見 DICOM的常用Tag分類和說明)
(0008, 0005) Specific Character Set CS: 'ISO_IR 100'
(0008, 0008) Image Type CS: ['ORIGINAL', 'PRIMARY', 'AXIAL']
(0008, 0016) SOP Class UID UI: CT Image Storage
(0008, 0018) SOP Instance UID UI: 1.3.6.1.4.1.14519.5.2.1.6279.6001.143451261327128179989900675595
(0008, 0020) Study Date DA: '20000101'
(0008, 0021) Series Date DA: '20000101'
(0008, 0022) Acquisition Date DA: '20000101'
(0008, 0023) Content Date DA: '20000101'
(0008, 0024) Overlay Date DA: '20000101'
(0008, 0025) Curve Date DA: '20000101'
(0008, 002a) Acquisition DateTime DT: '20000101'
(0008, 0030) Study Time TM: ''
(0008, 0032) Acquisition Time TM: ''
(0008, 0033) Content Time TM: ''
(0008, 0050) Accession Number SH: '2819497684894126'
(0008, 0060) Modality CS: 'CT'
(0008, 0070) Manufacturer LO: 'GE MEDICAL SYSTEMS'
(0008, 0090) Referring Physician Name PN: ''
(0008, 1090) Manufacturer Model Name LO: 'LightSpeed Plus'
(0008, 1155) Referenced SOP Instance UID UI: 1.3.6.1.4.1.14519.5.2.1.6279.6001.675906998158803995297223798692
(0010, 0010) Patient Name PN: ''
(0010, 0020) Patient ID LO: 'LIDC-IDRI-0001'
(0010, 0030) Patient Birth Date DA: ''
(0010, 0040) Patient Sex CS: ''
(0010, 1010) Patient Age AS: ''
(0010, 21d0) Last Menstrual Date DA: '20000101'
(0012, 0062) Patient Identity Removed CS: 'YES'
(0012, 0063) De-identification Method LO: 'DCM:113100/113105/113107/113108/113109/113111'
(0013, 0010) Private Creator LO: 'CTP'
(0013, 1010) Private tag data LO: 'LIDC-IDRI'
(0013, 1013) Private tag data LO: '62796001'
(0018, 0010) Contrast/Bolus Agent LO: 'IV'
(0018, 0015) Body Part Examined CS: 'CHEST'
(0018, 0022) Scan Options CS: 'HELICAL MODE'
(0018, 0050) Slice Thickness DS: '2.500000'
(0018, 0060) KVP DS: '120'
(0018, 0090) Data Collection Diameter DS: '500.000000'
(0018, 1020) Software Version(s) LO: 'LightSpeedApps2.4.2_H2.4M5'
(0018, 1100) Reconstruction Diameter DS: '360.000000'
(0018, 1110) Distance Source to Detector DS: '949.075012'
(0018, 1111) Distance Source to Patient DS: '541.000000'
(0018, 1120) Gantry/Detector Tilt DS: '0.000000'
(0018, 1130) Table Height DS: '144.399994'
(0018, 1140) Rotation Direction CS: 'CW'
(0018, 1150) Exposure Time IS: '570'
(0018, 1151) X-Ray Tube Current IS: '400'
(0018, 1152) Exposure IS: '4684'
(0018, 1160) Filter Type SH: 'BODY FILTER'
(0018, 1170) Generator Power IS: '48000'
(0018, 1190) Focal Spot(s) DS: '1.200000'
(0018, 1210) Convolution Kernel SH: 'STANDARD'
(0018, 5100) Patient Position CS: 'FFS'
(0020, 000d) Study Instance UID UI: 1.3.6.1.4.1.14519.5.2.1.6279.6001.298806137288633453246975630178
(0020, 000e) Series Instance UID UI: 1.3.6.1.4.1.14519.5.2.1.6279.6001.179049373636438705059720603192
(0020, 0010) Study ID SH: ''
(0020, 0011) Series Number IS: '3000566'
(0020, 0013) Instance Number IS: '80'
(0020, 0032) Image Position (Patient) DS: ['-166.000000', '-171.699997', '-207.500000']
(0020, 0037) Image Orientation (Patient) DS: ['1.000000', '0.000000', '0.000000', '0.000000', '1.000000', '0.000000']
(0020, 0052) Frame of Reference UID UI: 1.3.6.1.4.1.14519.5.2.1.6279.6001.229925374658226729607867499499
(0020, 1040) Position Reference Indicator LO: 'SN'
(0020, 1041) Slice Location DS: '-207.500000'
(0028, 0002) Samples per Pixel US: 1
(0028, 0004) Photometric Interpretation CS: 'MONOCHROME2'
(0028, 0010) Rows US: 512
(0028, 0011) Columns US: 512
(0028, 0030) Pixel Spacing DS: ['0.703125', '0.703125']
(0028, 0100) Bits Allocated US: 16
(0028, 0101) Bits Stored US: 16
(0028, 0102) High Bit US: 15
(0028, 0103) Pixel Representation US: 1
(0028, 0120) Pixel Padding Value US: 63536
(0028, 0303) Longitudinal Temporal Information M CS: 'MODIFIED'
(0028, 1050) Window Center DS: '-600'
(0028, 1051) Window Width DS: '1600'
(0028, 1052) Rescale Intercept DS: '-1024'
(0028, 1053) Rescale Slope DS: '1'
(0038, 0020) Admitting Date DA: '20000101'
(0040, 0002) Scheduled Procedure Step Start Date DA: '20000101'
(0040, 0004) Scheduled Procedure Step End Date DA: '20000101'
(0040, 0244) Performed Procedure Step Start Date DA: '20000101'
(0040, 2016) Placer Order Number / Imaging Servi LO: ''
(0040, 2017) Filler Order Number / Imaging Servi LO: ''
(0040, a075) Verifying Observer Name PN: 'Removed by CTP'
(0040, a123) Person Name PN: 'Removed by CTP'
(0040, a124) UID UI: 1.3.6.1.4.1.14519.5.2.1.6279.6001.335419887712224178340067932923
(0070, 0084) Content Creator's Name PN: ''
(0088, 0140) Storage Media File-set UID UI: 1.3.6.1.4.1.14519.5.2.1.6279.6001.211790042620307056609660772296
(7fe0, 0010) Pixel Data OW: Array of 524288 bytes
eg : LIDC-IDRI-0069(TOSHIBA公司)中000001.dcm如下:
(0008, 0008) Image Type CS: ['ORIGINAL', 'PRIMARY', 'AXIAL']
(0008, 0016) SOP Class UID UI: CT Image Storage
(0008, 0018) SOP Instance UID UI: 1.3.6.1.4.1.14519.5.2.1.6279.6001.263800607656124864093833884216
(0008, 0020) Study Date DA: '20000101'
(0008, 0021) Series Date DA: '20000101'
(0008, 0022) Acquisition Date DA: '20000101'
(0008, 0023) Content Date DA: '20000101'
(0008, 0024) Overlay Date DA: '20000101'
(0008, 0025) Curve Date DA: '20000101'
(0008, 002a) Acquisition DateTime DT: '20000101'
(0008, 0030) Study Time TM: ''
(0008, 0032) Acquisition Time TM: '185549.500'
(0008, 0033) Content Time TM: '185605.277'
(0008, 0050) Accession Number SH: '2819497684894126'
(0008, 0060) Modality CS: 'CT'
(0008, 0070) Manufacturer LO: 'TOSHIBA'
(0008, 0090) Referring Physician Name PN: ''
(0008, 1090) Manufacturer Model Name LO: 'Aquilion'
(0010, 0010) Patient Name PN: ''
(0010, 0020) Patient ID LO: 'LIDC-IDRI-0069'
(0010, 0030) Patient Birth Date DA: ''
(0010, 0040) Patient Sex CS: 'M'
(0010, 1010) Patient Age AS: '051Y'
(0010, 2160) Ethnic Group SH: 'white-ns'
(0010, 21c0) Pregnancy Status US: 4
(0010, 21d0) Last Menstrual Date DA: '20000101'
(0012, 0062) Patient Identity Removed CS: 'YES'
(0012, 0063) De-identification Method LO: 'DCM:113100/113105/113107/113108/113109/113111'
(0013, 0010) Private Creator OB: 'CTP '
(0013, 1010) Private tag data OB: 'LIDC-IDRI '
(0013, 1013) Private tag data OB: '62796001'
(0018, 0010) Contrast/Bolus Agent LO: '100ccs_OMNI-350'
(0018, 0015) Body Part Examined CS: 'CHEST'
(0018, 0022) Scan Options CS: 'HELICAL_CT'
(0018, 0050) Slice Thickness DS: '2.0'
(0018, 0060) KVP DS: '135'
(0018, 0090) Data Collection Diameter DS: '400.00'
(0018, 1020) Software Version(s) LO: 'V2.04ER001'
(0018, 1100) Reconstruction Diameter DS: '379.687'
(0018, 1120) Gantry/Detector Tilt DS: '+0.0'
(0018, 1130) Table Height DS: '+48.00'
(0018, 1140) Rotation Direction CS: 'CW'
(0018, 1150) Exposure Time IS: '500'
(0018, 1151) X-Ray Tube Current IS: '260'
(0018, 1152) Exposure IS: '130'
(0018, 1210) Convolution Kernel SH: 'FC10'
(0018, 5100) Patient Position CS: 'FFS'
(0020, 000d) Study Instance UID UI: 1.3.6.1.4.1.14519.5.2.1.6279.6001.303241414168367763244410429787
(0020, 000e) Series Instance UID UI: 1.3.6.1.4.1.14519.5.2.1.6279.6001.131939324905446238286154504249
(0020, 0010) Study ID SH: ''
(0020, 0011) Series Number IS: '3079'
(0020, 0012) Acquisition Number IS: '5'
(0020, 0013) Instance Number IS: '134'
(0020, 0020) Patient Orientation CS: ['L', 'P']
(0020, 0032) Image Position (Patient) DS: ['-184.375000', '-188.281200', '1292.500000']
(0020, 0037) Image Orientation (Patient) DS: ['1.000000', '0.000000', '0.000000', '0.000000', '1.000000', '0.000000']
(0020, 0052) Frame of Reference UID UI: 1.3.6.1.4.1.14519.5.2.1.6279.6001.228313061349684266844487315959
(0020, 1040) Position Reference Indicator LO: ''
(0020, 1041) Slice Location DS: '+324.00'
(0028, 0002) Samples per Pixel US: 1
(0028, 0004) Photometric Interpretation CS: 'MONOCHROME2'
(0028, 0010) Rows US: 512
(0028, 0011) Columns US: 512
(0028, 0030) Pixel Spacing DS: ['0.741', '0.741']
(0028, 0100) Bits Allocated US: 16
(0028, 0101) Bits Stored US: 16
(0028, 0102) High Bit US: 15
(0028, 0103) Pixel Representation US: 1
(0028, 0303) Longitudinal Temporal Information M CS: 'MODIFIED'
(0028, 1050) Window Center DS: '-500'
(0028, 1051) Window Width DS: '2000'
(0028, 1052) Rescale Intercept DS: '0'
(0028, 1053) Rescale Slope DS: '1'
(0032, 000a) Study Status ID CS: ''
(0032, 1000) Scheduled Study Start Date DA: ''
(0032, 1001) Scheduled Study Start Time TM: ''
(0032, 1060) Requested Procedure Description LO: ''
(0032, 1064) Requested Procedure Code Sequence 1 item(s) ----
(0008, 0104) Code Meaning LO: ''
---------
(0038, 0020) Admitting Date DA: '20000101'
(0040, 0002) Scheduled Procedure Step Start Date DA: '20000101'
(0040, 0003) Scheduled Procedure Step Start Time TM: ''
(0040, 0004) Scheduled Procedure Step End Date DA: '20000101'
(0040, 0005) Scheduled Procedure Step End Time TM: ''
(0040, 0244) Performed Procedure Step Start Date DA: '20000101'
(0040, 0245) Performed Procedure Step Start Time TM: ''
(0040, 2016) Placer Order Number / Imaging Servi LO: ''
(0040, 2017) Filler Order Number / Imaging Servi LO: ''
(0040, a075) Verifying Observer Name PN: 'Removed by CTP'
(0040, a123) Person Name PN: 'Removed by CTP'
(0070, 0084) Content Creator Name PN: ''
(7fe0, 0010) Pixel Data OB or OW: Array of 524288 bytes
可以看到不同公司所做的檢查存儲信息的格式不太一樣,但一些主要信息都還是有的:
- SOP Instance UID 用于唯一區(qū)分每一張dcm切片戒悠,其中Study Instance UID熬荆,Series Instance UID上面已經(jīng)提過,分別用于區(qū)分檢查號和一次檢查對應(yīng)序列號绸狐。
- Modality 表示檢查模態(tài)惶看,有MRI捏顺,CT,CR纬黎,DR等幅骄;
- Manufacturer 表示制造商,經(jīng)分析共有"GE MEDICAL SYSTEMS"(最多)本今, "SIEMENS"拆座, "TOSHIBA", "Philips"四家制造商提供數(shù)據(jù)冠息。詳見:/baina/sda1/data/lidc_matrix/information.txt挪凑;
- Slice Thickness 表示z方向切片厚度,經(jīng)統(tǒng)計有GE MEDICAL SYSTEMS:2.50逛艰, 1.25躏碳,SIEMENS:0.75,1.0散怖, 2.0菇绵,3.0,5.0镇眷,TOSHIBA:2.0咬最, 3.0, Philips:2.0欠动,1.0永乌,1.5,0.9具伍;
- Instance Number 表示一組切片的序列號翅雏,這個可以直接用來將切面排序,在實際CT掃描時人芽,是從胸部靠近頭的一側(cè)開始掃描望几,一次掃描到肺部最下,得到的instance number依次增加啼肩,對應(yīng)的Image Position中的z依次減小,而對應(yīng)的Slice Location是相對位置衙伶,絕大多數(shù)情況與Image Positon中的z值相同祈坠,依次減小,部分不同公司矢劲,如TOSHIBA則Slice Location可能與Image Position中的z不同赦拘,由于是相對位置,其Slice Location值為正芬沉,并且和Instance Number的變化趨勢相同躺同。為了在實際分析是不出現(xiàn)錯誤阁猜,不能僅僅采用Slice Location來對切片進行排序,而應(yīng)使用Instance Number或者Image Position中的z蹋艺,此次實驗使用的是Instance Number剃袍。
- Image Position表示圖像的左上角在空間坐標(biāo)系中的x,y,z坐標(biāo),單位是毫米捎谨,如果在檢查中民效,則指該序列中第一張影像左上角坐標(biāo);
- Slice Location為切片z軸相對位置涛救,單位毫米畏邢,大多情況與Image Position中的z相同,但TOSHIBA公司提供的數(shù)據(jù)里面不同检吆,所以不能僅僅根據(jù)這個值來對所有切片進行統(tǒng)一排序舒萎;
- Photometric Interpretation:光度計的解釋,對于CT圖像,用兩個枚舉值MONOCHROME1蹭沛,MONOCHROME2.用來判斷圖像是否是彩色的臂寝,MONOCHROME1/2是灰度圖,RGB則是真彩色圖致板,還有其他交煞;
- Pixel Spacing 表示像素中心間的物理間距;
- Bits Allocated表示存儲每一位像素時分配位數(shù)斟或,Bits Stored 表示存儲每一位像素所用位數(shù)素征;
- Pixel Representation 表示像素數(shù)據(jù)的表現(xiàn)類型:這是一個枚舉值,分別為十六進制數(shù)0000和0001萝挤,0000H = 無符號整數(shù)御毅,0001H = 2的補碼。
**XML重要信息說明 **
分析所有1012個病人XML標(biāo)注信息怜珍,存在如下問題:
醫(yī)生標(biāo)注信息可能有誤(個人覺得)!!!!!!
對所有病例跑完標(biāo)注腳本(/home/zhwhong/API/get_txt.sh)時端蛆,在生成的log日志(/baina/sda1/data/lidc_matrix/get_txt.log)里面發(fā)現(xiàn)有問題的病例有四個,分別是LIDC-IDRI-0017酥泛,LIDC-IDRI-0365今豆,LIDC-IDRI-0566,LIDC-IDRI-0659柔袁。
【LIDC-IDRI-0017】
我們找到這個不存在的sop_uid呆躲,為"1.3.6.1.4.1.14519.5.2.1.6279.6001.305973183883758685859912046949",然后找到病例17對應(yīng)的XML文件捶索,看一下醫(yī)生的標(biāo)注信息:
帶有這個sop_uid的標(biāo)注有兩個插掂,分別是醫(yī)師2和醫(yī)師4,我們看一下他們的標(biāo)注:
醫(yī)師2:
醫(yī)師4:
對,有兩個醫(yī)師都標(biāo)注了這個sop_uid辅甥,并且對應(yīng)的ImageZposition為-82.75酝润,我們再在XML文件中找到ImageZposition為-82.75的另外兩個醫(yī)師是否有標(biāo)注,結(jié)果是有璃弄,但是另外兩個醫(yī)師標(biāo)注的-82.75的位置對應(yīng)的切片的sop_uid和醫(yī)師2,4不同要销,分別如下:
醫(yī)師1:
醫(yī)師3:
這就很尷尬了,同一個ImageZpositon谢揪,但是卻標(biāo)了不同的sop_uid蕉陋,于是追根溯源,看一下到底是怎么回事拨扶,自己寫腳本遍歷LIDC-IDRI-0017中所有dcm切片凳鬓,打印出所有切片sop_uid,作對比患民,然后發(fā)現(xiàn)在所有的結(jié)果中缩举,根本沒有找到醫(yī)師2,醫(yī)師4標(biāo)記的那個sop_uid,而醫(yī)師1匹颤,醫(yī)師3的標(biāo)注是存在的仅孩,如下:
醫(yī)師2,4標(biāo)記的sop_uid找不到:
醫(yī)師1,3標(biāo)記的找到了:
所以初步認定,LIDC-IDRI-0017病例中印蓖,醫(yī)師2和醫(yī)師4存在兩處錯誤的標(biāo)注信息(sop_uid錯誤)
【LIDC-IDRI-0365】
LIDC-IDRI-0365中存在兩份檢查序列辽慕,分別是:
1.3.6.1.4.1.14519.5.2.1.6279.6001.212341120080087350703610584139 / 1.3.6.1.4.1.14519.5.2.1.6279.6001.207544473852086582434957174616
和
1.3.6.1.4.1.14519.5.2.1.6279.6001.216207548522622026268886920069 / 1.3.6.1.4.1.14519.5.2.1.6279.6001.802846969823720586279982179144
存在問題的是第二份序列,問題同17號病例類似赦肃,如下:
找到醫(yī)生標(biāo)注如下(四位醫(yī)師標(biāo)注相同):
同樣遍歷LIDC-IDRI-0365中第二份序列溅蛉,找不到對應(yīng)標(biāo)記的切片sop_uid:
【LIDC-IDRI-0566】
存在和上面相同的問題:
【LIDC-IDRI-0659】
(注:感謝您的閱讀,希望本文對您有所幫助他宛。如果覺得不錯歡迎分享轉(zhuǎn)載船侧,但請先點擊 這里 獲取授權(quán)。本文由 版權(quán)印 提供保護厅各,禁止任何形式的未授權(quán)違規(guī)轉(zhuǎn)載镜撩,謝謝!)