一淀衣、div標(biāo)簽文本提取
將學(xué)習(xí)視頻中xpath.html文件中div標(biāo)簽下文本值
“第一個div” ,“第二個div” 使用xpath結(jié)構(gòu)化提取并打印輸出
二悬秉、ul標(biāo)簽文本提取
將xpath.html文件中ul標(biāo)簽下“流程” 帅矗,“xpath學(xué)習(xí)”吗蚌,“流程2”文本值
使用xpath結(jié)構(gòu)化提取并打印輸出
三、過濾標(biāo)簽
將xpath.html文件中的第一個div下的前3個a標(biāo)簽的文本及超鏈接
使用xpath結(jié)構(gòu)化提取囤萤,打印輸出
四、requests模塊和lxml&xpath結(jié)合提取數(shù)據(jù)
結(jié)合上節(jié)課requests模塊知識是趴,將陽光電影網(wǎng)導(dǎo)航欄的文本及超鏈接結(jié)構(gòu)化提取
def clean_data(element_result):
return str(element_result).replace(" ", "").replace("\n", "").replace("\r", "")
def print_data(elements):
for element in elements:
data = clean_data(element)
if len(data):
print(data)
with open("xpath.html", "r", encoding="utf-8") as html_file:
html_str = html_file.read()
selector = etree.HTML(html_str)
div_elements = selector.xpath("http://div/text()")
print_data(div_elements)
ul_elements = selector.xpath("http://ul/text()")
print_data(ul_elements)
filter_elements = selector.xpath("http://div[1]//a[position()<4]/@href|//div[1]//a[position()<4]/text()")
print_data(filter_elements)
url = "http://www.ygdy8.com/"
header_str = '''
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding:gzip, deflate
Accept-Language:zh-CN,zh;q=0.8
Cache-Control:max-age=0
Cookie:37cs_pidx=1; 37cs_user=37cs96544059545; UM_distinctid=160e80f56031c9-0c9b01c124c227-6d1b117c-1fa400-160e80f5607f4; CNZZDATA5783118=cnzz_eid%3D2025418817-1515716500-null%26ntime%3D1515716500; 37cs_show=69; cscpvrich4016_fidx=1
Host:www.ygdy8.com
If-Modified-Since:Thu, 11 Jan 2018 15:12:16 GMT
If-None-Match:"0c8cb90ee8ad31:54c"
Proxy-Connection:keep-alive
Referer:https://www.google.co.uk/
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36
'''
header_list = header_str.strip().split('\n')
headers_dict = {x.split(':')[0]: x.split(':')[1] for x in header_list}
req = requests.get(url, headers_dict)
req.encoding = "gb2312"
selector = etree.HTML(req.text)
print(req.text)
data_elements = selector.xpath("http://div[@id = 'menu']//a/@href|//div[@id = 'menu']//a/text()")
print_data(data_elements)