最近都沒(méi)怎么寫(xiě)代碼,忙著上班叮趴,每天上到晚上九點(diǎn)半割笙,回來(lái)看視頻,學(xué)習(xí)scrapy,寫(xiě)了個(gè)小項(xiàng)目伤溉,無(wú)法傳回下一頁(yè)鏈接繼續(xù)爬取般码。。乱顾。板祝。還在調(diào)試。
期間又寫(xiě)了個(gè)爬取貼吧小爬蟲(chóng)練練xpath提取規(guī)則走净,用的不是很熟券时。還是有錯(cuò)誤。
# -*- coding: utf-8 -*-
import requests
from lxml import etree
def gethtml(url):
header = {'User-Agent' :'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
r = requests.get(url, headers=header)
return r.content
def parsehtml(html):
selector = etree.HTML(html)
name = selector.xpath('//*[@id="j_core_title_wrap"]/h3/text()')[0]
author = selector.xpath('//*[@id="j_p_postlist"]/div[1]/div[1]/ul/li[3]/a/text()')[0]
time = selector.xpath('//*[@id="j_p_postlist"]/div[1]/div[2]/div[4]/div[1]/div/span[3]/text()')[0]
infos = selector.xpath('//div[@class="d_post_content_main "]/div[1]')
for info in infos:
paper = info.xpath('string(.)').strip().encode('utf-8')
f = open('001.text', 'a', encoding='utf-8')
print(paper)
f.write(str(paper))
f.close()
def main(infourl):
html = gethtml(infourl)
parsehtml(html)
main("https://tieba.baidu.com/p/5366583054?see_lz=1")
打開(kāi)content.txt是這個(gè)鬼樣子:
使用encode大法修改代碼:
def parsehtml(html):
selector = etree.HTML(html)
name = selector.xpath('//*[@id="j_core_title_wrap"]/h3/text()')[0].encode('utf-8')
author = selector.xpath('//*[@id="j_p_postlist"]/div[1]/div[1]/ul/li[3]/a/text()')[0].encode('utf-8')
time = selector.xpath('//*[@id="j_p_postlist"]/div[1]/div[2]/div[4]/div[1]/div/span[3]/text()')[0].encode('utf-8')
infos = selector.xpath('//div[@class="d_post_content_main "]/div[1]')
for info in infos:
paper = info.xpath('string(.)').strip().encode('utf-8')
return paper
得到的文件時(shí)這個(gè)鬼樣:
可是我要寫(xiě)成這樣子的話:
# -*- coding: utf-8 -*-
import requests
from lxml import etree
url = "https://tieba.baidu.com/p/5366583054?see_lz=1"
header = {'User-Agent' :'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
r = requests.get(url, headers=header)
selector = etree.HTML(r.text)
name = selector.xpath('//*[@id="j_core_title_wrap"]/h3/text()')[0]
author = selector.xpath('//*[@id="j_p_postlist"]/div[1]/div[1]/ul/li[3]/a/text()')[0]
time = selector.xpath('//*[@id="j_p_postlist"]/div[1]/div[2]/div[4]/div[1]/div/span[3]/text()')[0]
infos = selector.xpath('//div[@class="d_post_content_main "]/div[1]')
for info in infos:
paper = info.xpath('string(.)').strip()
print(paper + '\n')
f = open('content.text', 'a', encoding='utf-8')
f.write(str(paper) + '\n')
f.close()
content文件又能正常顯示出中文伏伯,
后來(lái)我索性新建一個(gè)文件橘洞,把第二個(gè)程序代碼改裝一下:
# -*- coding: utf-8 -*-
import requests
from lxml import etree
def gethtml(url):
# url = "https://tieba.baidu.com/p/5366583054?see_lz=1"
header = {'User-Agent' :'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
r = requests.get(url, headers=header)
html = r.text
return html
def parsehtml(html):
selector = etree.HTML(html)
name = selector.xpath('//*[@id="j_core_title_wrap"]/h3/text()')[0]
author = selector.xpath('//*[@id="j_p_postlist"]/div[1]/div[1]/ul/li[3]/a/text()')[0]
time = selector.xpath('//*[@id="j_p_postlist"]/div[1]/div[2]/div[4]/div[1]/div/span[3]/text()')[0]
infos = selector.xpath('//div[@class="d_post_content_main "]/div[1]')
for info in infos:
paper = info.xpath('string(.)').strip()
print(paper + '\n')
f = open('content.text', 'a', encoding='utf-8')
f.write(str(paper) + '\n')
f.close()
url1 = "https://tieba.baidu.com/p/5366583054?see_lz=1"
parsehtml(gethtml(url1))
#
# if len(etree.HTML(gethtml(url1)).xpath('//*[@id="thread_theme_7"]/div[1]/ul/li[1]/a[6]')):
# furtherurl = etree.HTML(gethtml(url1)).xpath('//*[@id="thread_theme_7"]/div[1]/ul/li[1]/a[6]/@href/text()')
# url = "https://tieba.baidu.com" + str(furtherurl)
# parsehtml(gethtml(url))
繼續(xù)可以正常運(yùn)行并且可以顯示中文。
但是我不能只獲取這一頁(yè)的小說(shuō)呀说搅,于是提取下一頁(yè)的鏈接并且拼接url:
if len(etree.HTML(gethtml(url1)).xpath('//*[@id="thread_theme_7"]/div[1]/ul/li[1]/a[6]')):
furtherurl = etree.HTML(gethtml(url1)).xpath('//*[@id="thread_theme_7"]/div[1]/ul/li[1]/a[6]/@href/text()')
url = "https://tieba.baidu.com" + str(furtherurl)
parsehtml(gethtml(url))
然而跑的時(shí)候報(bào)錯(cuò)炸枣。。弄唧。抛虏。。
先這樣吧套才,12點(diǎn)了好困,明天再說(shuō)慕淡。背伴。。
早起更新峰髓,早上意識(shí)到傻寂,提取鏈接是不需要/text()的,我們只需要提取出鏈接的屬性携兵,另外疾掰,由于xpath返回的是一個(gè)列表類(lèi)型,我們需要用索引徐紧,取第0個(gè)静檬,代碼修改如下:
if len(etree.HTML(gethtml(url1)).xpath('//*[@id="thread_theme_7"]/div[1]/ul/li[1]/a[6]')):
furtherurl = etree.HTML(gethtml(url1)).xpath('//*[@id="thread_theme_7"]/div[1]/ul/li[1]/a[6]/@href')[0]
url = "https://tieba.baidu.com" + str(furtherurl)
parsehtml(gethtml(url))
這樣跑起來(lái)還是有個(gè)問(wèn)題,顯示獲取的time內(nèi)容list index out of range, 先注釋掉這行并级,看看能不能翻頁(yè)獲取內(nèi)容拂檩。試了一下可以翻頁(yè),但是只能翻到第三頁(yè)獲取數(shù)據(jù)嘲碧,這又是啥問(wèn)題稻励。。愈涩。望抽。
第二天晚上下班加矛,回來(lái)更新。上班期間偷偷查了些資料煤篙,嘿嘿斟览,貌似是被網(wǎng)站限制的原因。后來(lái)還是通過(guò)構(gòu)造url完成了代碼舰蟆。趣惠。。身害。味悄。。表示本來(lái)就是因?yàn)橄胪ㄟ^(guò)找下一頁(yè)鏈接然后迭代的方式去抓取的塌鸯,結(jié)果還是回到了老路子侍瑟,郁悶。用scrapy也是有這個(gè)問(wèn)題丙猬。再慢慢查資料吧~完整代碼如下:
# -*- coding: utf-8 -*-
import requests
from lxml import etree
def gethtml(url):
# url = "https://tieba.baidu.com/p/5366583054?see_lz=1"
header = {'User-Agent' :'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
r = requests.get(url, headers=header)
html = r.text
return html
def parsehtml(html):
selector = etree.HTML(html)
name = selector.xpath('//*[@id="j_core_title_wrap"]/h3/text()')[0]
author = selector.xpath('//*[@id="j_p_postlist"]/div[1]/div[1]/ul/li[3]/a/text()')[0]
# time = selector.xpath('//*[@id="j_p_postlist"]/div[1]/div[2]/div[4]/div[1]/div/span[3]/text()')[0]
infos = selector.xpath('//div[@class="d_post_content_main "]/div[1]')
for info in infos:
paper = info.xpath('string(.)').strip()
print(paper + '\n')
f = open('content.text', 'a', encoding='utf-8')
f.write(str(paper) + '\n')
f.close()
url1 = "https://tieba.baidu.com/p/5366583054?see_lz=1"
parsehtml(gethtml(url1))
for i in range(2, 11):
url2 = "https://tieba.baidu.com/p/5366583054?pn=" + str(i)
parsehtml(gethtml(url2))
以后再填坑