- 學(xué)習(xí)xpath,使用lxml+xpath提取內(nèi)容。
- 使用xpath提取丁香園論壇的回復(fù)內(nèi)容。
- 丁香園直通點(diǎn):http://www.dxy.cn/bbs/thread/626626#626626 。
- 參考資料:https://blog.csdn.net/naonao77/article/details/88129994
1.學(xué)習(xí)xpath
XPath 是一門在 XML 文檔中查找信息的語(yǔ)言设江。XPath 可用來(lái)在 XML 文檔中對(duì)元素和屬性進(jìn)行遍歷。 XPath 是 W3C XSLT 標(biāo)準(zhǔn)的主要元素攘轩,并且 XQuery 和 XPointer 都構(gòu)建于 XPath 表達(dá)之上叉存。(官方教程:http://www.w3school.com.cn/xpath/index.asp)
參考鏈接:用lxml解析HTML
2.使用xpath提取丁香園論壇的回復(fù)內(nèi)容
import requests
from lxml import etree
def getItem():
headers = {
"Connection": "keep-alive",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0",
"Accept": "text/html,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip,deflate,sdch",
"Accept-Language": "zh-CN,zh;q=0.8"
}
url = 'http://www.dxy.cn/bbs/thread/626626#626626'
request = requests.get(url, headers=headers)
# response = urllib.request.urlopen(request).read().decode("utf-8")
html = request.text
tree = etree.HTML(html)
user = tree.xpath('//div[@class="auth"]/a/text()')
content = tree.xpath('//td[@class="postbody"]')
# print(user)
# print(content)
# datas = []
for i in range(0,len(user)):
print(user[i].strip()+":"+content[i].xpath('string(.)').strip())
if __name__ == '__main__':
getItem()
輸出: