最近解析了一個超大的xml澎胡,之間遇到很多坑,有寫Java程序娩鹉、spark程序攻谁,最后用Python處理的:
Java、spark弯予、python處理XML速度對比:Python>spark>java
因為是XML戚宦,所以不能破壞標簽的完整性,所以spark可以提交yarn后也只是用一個executors在跑锈嫩,只不過用了多個cores受楼,速度還是非常慢的(Java和spark用一個早晨沒有跑完)垦搬,期間還遇到了OOM問題,因為spark單個executors的內(nèi)存大小在配置文件中是有限制的艳汽,所以會出現(xiàn)OOM猴贰,java大家知道的,首先要讀取整個文件到內(nèi)存中河狐,前提是內(nèi)存夠米绕,再加上中間處理結(jié)果的存放,使用內(nèi)存遠大于文件大小21G
數(shù)據(jù)樣例:(數(shù)據(jù)較簡單馋艺,原理一樣)
<add overwrite="true" commitWithin="10000">
<doc><field name="id" ><![CDATA[286c9edd3f2721730a8cecdbfec94ee4X]]></field>
<field name="an-country" ><![CDATA[GR]]></field>
<field name="an" ><![CDATA[88100105]]></field>
<field name="an-kind" ><![CDATA[A]]></field>
<field name="pn-country" ><![CDATA[GR]]></field>
<field name="pn" ><![CDATA[880100105]]></field>
<field name="pn-kind" ><![CDATA[A]]></field>
<field name="ctfw-country" ><![CDATA[DE]]></field>
<field name="ctfw-num" ><![CDATA[DE2736069]]></field>
<field name="ctfw-kind" ><![CDATA[A1]]></field>
<field name="srepphase" ><![CDATA[SEA]]></field>
<field name="srepphase" ><![CDATA[SEA]]></field>
</doc>
<doc><field name="id" ><![CDATA[caf2088f80da92f58c413d23d9cc8124X]]></field>
<field name="an-country" ><![CDATA[GR]]></field>
<field name="an" ><![CDATA[88100091]]></field>
<field name="an-kind" ><![CDATA[A]]></field>
<field name="pn-country" ><![CDATA[GR]]></field>
<field name="pn" ><![CDATA[880100091]]></field>
<field name="pn-kind" ><![CDATA[A]]></field>
<field name="ctfw-country" ><![CDATA[FR]]></field>
<field name="ctfw-country" ><![CDATA[GB]]></field>
<field name="ctfw-country" ><![CDATA[US]]></field>
<field name="ctfw-country" ><![CDATA[EP]]></field>
<field name="ctfw-country" ><![CDATA[EP]]></field>
<field name="ctfw-num" ><![CDATA[FR2585362]]></field>
<field name="ctfw-num" ><![CDATA[GB2141152]]></field>
<field name="ctfw-num" ><![CDATA[US4292035]]></field>
<field name="ctfw-num" ><![CDATA[EP0026529]]></field>
<field name="ctfw-num" ><![CDATA[EP0146289]]></field>
<field name="ctfw-kind" ><![CDATA[A1]]></field>
<field name="ctfw-kind" ><![CDATA[A]]></field>
<field name="ctfw-kind" ><![CDATA[A]]></field>
<field name="ctfw-kind" ><![CDATA[A1]]></field>
<field name="ctfw-kind" ><![CDATA[A2]]></field>
<field name="srepphase" ><![CDATA[SEA]]></field>
<field name="srepphase" ><![CDATA[SEA]]></field>
<field name="srepphase" ><![CDATA[SEA]]></field>
<field name="srepphase" ><![CDATA[SEA]]></field>
<field name="srepphase" ><![CDATA[SEA]]></field>
<field name="srepphase" ><![CDATA[SEA]]></field>
</doc>
</add>
spark代碼:
spark代碼也是將整個文件加載到內(nèi)存中栅干,耗內(nèi)存,解析速度慢
object ParseQuoteData1 {
def main(args: Array[String]): Unit = {
//構(gòu)建sparksession
/val spark = SparkSession.builder
.master("local[1]")
.appName("Parse_xml").getOrCreate()
val sc = spark.sparkContext/
/val conf = new SparkConf().setAppName("quote_parse").setMaster("local[1]")
conf.set("spark.rdd.compress", "true")
val sc = new SparkContext(conf)/
val someXML = XML.loadFile(args(0))
val pubRef_len = (someXML \ "add" \ "doc" ).length
val file = args(1)
val writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file)))
// val array = new ArrayString
for(a <- 0 to pubRef_len - 1){
val quotedata = (someXML \ "add" \ "doc" )(a)
val fields = quotedata \ "field"
val fields_nature = quotedata \ "field" \"@name"
val quotList = new util.ArrayListString
for(b <- 0 to fields.length -1){
val k = fields_nature(b).text
val v = fields(b).text
val line = k + ":" + v
quotList.add(line)
}
val res = quotList.toString.replace("[", "").replace("]", "")
println(res)
writer.write(res + "\n")
}
writer.close()
}
}
Python代碼:
python解析的原理非常符合處理大文件丈钙,即使超過50G也可以正常處理非驮,而且速度很快,
解析原理:標簽迭代雏赦,每次取出需要解析的一個標簽劫笙,放到內(nèi)存中解析,內(nèi)存耗費非常小
-- coding:utf-8 --
from lxml import etree
import time
def fast_iter(context,*args, **kwargs):
"""
讀取xml數(shù)據(jù)星岗,并釋放空間
context: etree.iterparse生成的迭代器
"""
# 打開文件
with open('data/result.txt', 'a') as f:
"""
event:事件
elem:元素
"""
# 處理xml數(shù)據(jù)
for event, elem in context:
list = []
for e in elem:
# 獲取標簽屬性值填大,獲取標簽值
s1 = e.get("name") + ":" + e.text
# print(e.get("name") + ":" + e.text)
list = list + [s1]
# 替換list的【】,變?yōu)橐粋€ 俏橘,分隔的字符串
res = str(list).replace("[", "").replace("]", "").replace("'", "")
f.write(res) # 寫入
f.write('\n')
# 重置元素允华,清空元素內(nèi)部數(shù)據(jù)
elem.clear()
# 選取當前節(jié)點的所有先輩(父、祖父等)節(jié)點寥掐,以及當前節(jié)點本身
for ftag in elem.xpath('doc'):
# 如果當前節(jié)點還有前一個兄弟靴寂,則刪除父節(jié)點的第一個子節(jié)點。getprevious():返回當前節(jié)點的前一個兄弟或None召耘。
while ftag.getprevious() is not None:
# 刪除父節(jié)點的第一個子節(jié)點百炬,getparent():返回當前節(jié)點的父元素或根元素或None。
del ftag.getparent()[0]
# 釋放內(nèi)存
del context
def process_element(elem):
"""
處理element
:params elem: Element
"""
# 儲存基因列表
gene_list = []
for i in elem.xpath('add'):
# 獲取基因名字
gene = i.text
# 添加到列表
gene_list.append(gene)
print('gene', gene_list)
if name == 'main':
print('start', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime()))
start = time.time()
# 需要處理的文件路徑
infile = r'data/patent_info_cited__GR_cited_Thread.xml'
# 通過迭代讀取xml污它,帶命名空間的要加上命名空間
# context = etree.iterparse(infile, events=('end',), encoding='UTF-8', tag='{http://uniprot.org/uniprot}doc')
context = etree.iterparse(infile, events=('end',), encoding='UTF-8', tag='doc')
# 快速讀取xml數(shù)據(jù)
fast_iter(context,process_element)
print('stop', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime()))
print('time', time.time() - start)