這段代碼可能對某些人有用胡嘿,
def parsefile(path):
???try:
??????file = open(path, "r")
??????fileread = file.read()
??????fileread = unescape(fileread.decode('utf-8')).encode('utf-8')
??????file.close()
???except:
??????print "Reading File Bug"
??????sys.exit(1)
???return ET.fromstring(fileread)
該吃UNESCAPE HTML實體程序已于弗雷德里克Lundh開發(fā)網(wǎng)站上找到。代碼做得太多了服协,因為它正在轉(zhuǎn)換&,& gt;而且<逞敷。我希望將這些保存在URL中以及我已轉(zhuǎn)義代碼段的位置狂秦。所以我稍微修改了它以滿足我自己的需要灌侣。
def unescape(text):
???"""Removes HTML or XML character references
??????and entities from a text string.
??????keep &推捐,&?gt; <in the source code.
???from Fredrik Lundh
???http://effbot.org/zone/re-sub.htm#unescape-html
???"""
???def fixup(m):
??????text = m.group(0)
??????if text[:2] == "&#":
?????????# character reference
?????????try:
????????????if text[:3] == "&#x":
???????????????return unichr(int(text[3:-1], 16))
????????????else:
???????????????return unichr(int(text[2:-1]))
?????????except ValueError:
????????????print "erreur de valeur"
????????????pass
??????else:
?????????# named entity
?????????try:
????????????if text[1:-1] == "amp":
???????????????text = "&"
????????????elif text[1:-1] == "gt":
???????????????text = ">"
????????????elif text[1:-1] == "lt":
???????????????text = "<"
????????????else:
???????????????print text[1:-1]
???????????????text = unichr(htmlentitydefs.name2codepoint])
?????????except KeyError:
????????????print "keyerror"
????????????pass
??????return text # leave as is
???return re.sub("&#?w+;", fixup, text)
希望能幫助到你。