python 對 xml 文件的解析

xml文件

xml

其中pathway為整個xml的根節(jié)點

xml結(jié)構(gòu)

整個xml結(jié)構(gòu)為

pathway

----entry

--------graphics

----entry

--------graphics

----entry

--------graphics

----....

----relation

--------subtype

----....

需求為講relation中的attrib中的entry1與entry2提取出來十艾，尋找entry中對應(yīng)的id

基本思路為：

1. 先獲取所有的relation節(jié)點幽纷，遍歷relation節(jié)點

2. 獲取當(dāng)前節(jié)點的attribute饰恕，獲取attrib的entry1寥枝、entry2屬性的值

3. 在entry節(jié)點中搜索entry id

4. 構(gòu)造最后所需的結(jié)果

在遍歷relation節(jié)點，每次都要對entry節(jié)點進(jìn)行搜索芽丹，比較浪費時間北启。

在此可以先構(gòu)造一個key為id，value為graphics節(jié)點 name 的entry字典拔第，之后可以遍歷relation的時候獲得的

entry1咕村、entry2可以直接訪問entry字典獲取值，就不需要多層for循環(huán)了~

在這里使用的xml解析包為 xml.etree.ElementTree

直接上代碼吧=蚊俺，=

沒有設(shè)置為markdown文本模式懈涛。。泳猬。各位看官將就的看會吧批钠。。也沒幾行

import xml.etree.ElementTree as ET #解析xml文件的包

import pandas #寫入excel需要的包

tree = ET.parse('Pathway_5.xml') #打開xml文件得封，使用xml.etree進(jìn)行解析

root = tree.getroot() #獲取根節(jié)點

entry_list = root.findall('entry') #找到所有的entry節(jié)點

relation_list = root.findall('relation') #找到所有的relation節(jié)點

entry_dic = {} #構(gòu)造空字典

#對所有的entry節(jié)點進(jìn)行一次遍歷埋心，使用entry的id 作為字典的key 使用entry內(nèi)的gene節(jié)點的name 作為字典的 value

#這步是為了避免之后每次都要對entry進(jìn)行遍歷查找

#避免了深層次的 for 循環(huán)嵌套

for i in entry_list:

? ? gene = i.findall('graphics') #查找當(dāng)前entry節(jié)點所有的gene節(jié)點，避免出現(xiàn)兩次

? ? if len(gene) == 1:

? ? ? ? gene = gene[0]

? ? ? ? if 'name' in gene.attrib:

? ? ? ? ? ? entry_dic[i.attrib['id']] = gene.attrib['name'] # 構(gòu)造key-value

? ? ? ? else:

? ? ? ? ? ? print(gene.attrib)

? ? ? ? ? ? entry_dic[i.attrib['id']] = 'none'

#為了寫入excel作準(zhǔn)備

entry1_name = []

entry2_name = []

subtype_name = []

#遍歷relation

for i in relation_list:

? ? #如果當(dāng)前relation節(jié)點不同時存在entry1和entry2則跳到下次for循環(huán)

? ? if 'entry1' not in i.attrib and 'entry2' not in i.attrib:

? ? ? ? print("False relation : %s" % str(relation_list.index(i)))

? ? ? ? continue

? ? #獲得 entry1 和 entry2 的id

? ? entry1_id = i.attrib['entry1']

? ? entry2_id = i.attrib['entry2']

? ? print(entry1_id,entry2_id)

? ? # 包含當(dāng)前relation節(jié)點的subtype節(jié)點出現(xiàn)多個的情況

? ? subtype_name_list = []

? ? for k in i.findall('subtype'):

? ? ? ? if 'name' in k.attrib:

? ? ? ? ? ? subtype_name_list.append(k.attrib['name'])

? ? ? ? else:

? ? ? ? ? ? subtype_name_list.append('')

? ? #將結(jié)果添加到之前列表忙上，pandas寫入excel需要列表

? ? entry1_name.append(entry_dic[entry1_id])

? ? entry2_name.append(entry_dic[entry2_id])

? ? subtype_name.append(' '.join(subtype_name_list))

? ? #寫入txt文件

? ? with open('d.txt','a+') as f:

? ? ? ? f.write('%s \t\t\t\t%s\t\t\t\t%s\n' % (entry_dic[entry1_id],entry_dic[entry2_id],' '.join(subtype_name_list)))

# #寫入 excel 文件

# file_name = 'outputs.xlsx' #文件名

# #構(gòu)造DataFrame結(jié)構(gòu)數(shù)據(jù) excel寫入需要DataFrame數(shù)據(jù)

# msg = pandas.DataFrame(data={'entry1_name':entry1_name,'entry2_name':entry2_name,'subtype_name':subtype_name})

# #寫入excel

# writer = pandas.ExcelWriter(file_name)

# msg.to_excel(writer,'Sheet1')

# writer.save()

幸好縮進(jìn)還是保留的....

在這里說下我用到的api

ET.parse('Pathway_5.xml') #解析文件~

tree.getroot() #獲取根節(jié)點

root.findall('entry') #找到當(dāng)前節(jié)點下的所有tag name 為entry 的節(jié)點

.attrib? #獲得當(dāng)前節(jié)點的屬性

最后編輯于：2018.07.29 14:44:09

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者