今天是利用xpath爬取網(wǎng)址: 簡書首頁
包括:標(biāo)題饺蔑,作者,發(fā)表時(shí)間嗜诀,內(nèi)容猾警,閱讀量,評(píng)論數(shù)隆敢,點(diǎn)贊數(shù)发皿,打賞數(shù),所投專題
主要思想:利用xpath獲取網(wǎng)頁中的數(shù)據(jù)拂蝎,然后存到本地的csv
下面了解一下xpath的用法
首先必須要導(dǎo)入 lxml 庫
Python爬蟲利器三之Xpath語法與lxml庫的用法
1穴墅、首先是爬的第一頁的數(shù)據(jù)
運(yùn)行代碼:
#coding: utf-8
import requests
from lxml import etree
import csv
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
header = {'User-Agent': user_agent}
html = requests.get('http://www.reibang.com/', headers = header).content
selector = etree.HTML(html)
infos = selector.xpath('//div[@id="list-container"]/ul/li/div')
a = []
# 第一頁數(shù)據(jù)的匹配
for info in infos:
titles = info.xpath('a/text()')[0]
authors = info.xpath('div[1]/div/a/text()')[0]
times = info.xpath('div[1]/div/span/@data-shared-at')[0]
contents = info.xpath('p/text()')[0].strip()
try:
read_counts = info.xpath('div[2]/a[2]/text()')[1].strip()
except:
read_counts = '0'
try:
comment_counts = info.xpath('div[2]/a[3]/text()')[1].strip()
except:
comment_counts = '0'
try:
vote_counts = info.xpath('//div/div[2]/span[1]/text()')[0].strip()
except:
vote_counts = '0'
try:
reward_counts = info.xpath('div[2]/span[2]/text()')[0]
except:
reward_counts = '0'
try:
subjects = info.xpath('div[2]/a[1]/text()')[0]
except:
subjects = '暫未收錄專題'
#print(titles, authors, times, contents, read_counts, comment_counts, vote_counts, reward_counts, subjects)
data = {
'文章標(biāo)題': titles,
'作者': authors,
'發(fā)表時(shí)間': times,
'內(nèi)容': contents,
'閱讀量': read_counts,
'評(píng)論數(shù)': comment_counts,
'點(diǎn)贊數(shù)': vote_counts,
'打賞數(shù)': reward_counts,
'主題': subjects,
}
a.append(data)
#print(a)
#把爬到的數(shù)據(jù)存儲(chǔ)到csv
csv_name = ['文章標(biāo)題', '作者', '發(fā)表時(shí)間', '內(nèi)容', '閱讀量', '評(píng)論數(shù)', '點(diǎn)贊數(shù)', '打賞數(shù)', '主題']
with open('jianshu_xpath.csv', 'w', newline = '',encoding='utf-8')as csvfile:
write = csv.DictWriter(csvfile, fieldnames = csv_name)
write.writeheader()
write.writerows(a)
csvfile.close()
運(yùn)行結(jié)果:
第一頁比較容易,主要是每個(gè)數(shù)據(jù)爬取路徑的選取,還有循環(huán)點(diǎn)的選确饩取拇涤;
2.爬取簡書首頁前15頁的數(shù)據(jù)
a、首先要分析一下每頁的加載方式誉结,通過點(diǎn)擊更多鹅士,可以發(fā)現(xiàn)url并沒有變化,所以是異步加載惩坑,下面要抓包分析一下后面每頁請(qǐng)求的url有什么共同點(diǎn)掉盅。
上面每頁的id都可以在上一頁找到,而且是會(huì)累加的以舒,
具體的分析可以參看liang的文章http://www.reibang.com/p/9afef50a8cc7趾痘,寫的很詳細(xì),就不多說了蔓钟;
運(yùn)行代碼:
#coding: utf-8
import requests
from lxml import etree
import csv
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
header = {'User-Agent': user_agent,
'cookie': 'remember_user_token=W1szNjE3MDgyXSwiJDJhJDEwJDMuQTVNeHVYTkUubFQvc1ZPM0V5UGUiLCIxNDk3MTcyNDA2Ljk2ODQ2NjMiXQ%3D%3D--56522c2190961ce284b1fe108b267ae0cd5bf32a; _session_id=YVRyNm5tREZkK1JwUGFZVDNLdjJoL25zVS8yMjBnOGlwSnpITEE0U0drZHhxSU5XQVpYM2RpSmY5WU44WGJWeHVZV3d1Z1lINHR0aXhUQzR6Z1pMUW52RGI5UHpPRVFJRk5HeUcybEhwc21raVBqbk9uZmhjN0xQWmc2ZFMreXhGOHlhbmJiSDBHQUVsUTNmN2p0M2Y2TjgrWnBjVis4ODE4UXRhWmJ6K2VETHJlakhHbEl0djhDNDRKYVZEWndENjhrSGIvZ1crNC9NNnh4UmlpOVFPNWxGWm1PUmxhQk1sdnk2OXozQVZwU1hXVm9lMTU3WkUyUkhialZKZ2MvVkFOYk1tOUw3STkrMGNFWXVIaklDNlNpTmkrVi9iNDIrRzBDU0ZNNnc3b3I2bkhvLzFCSCsvTWdsUDExdEZBa0RsU3RqTURWcjdNU1VOTGVBeTk2MERMUXN1UlZqUytuYXdWdnI4cTkxTjFPbG5Ia3IzK3NXcVNpMENwWVZPSUV3TWU4TENaRWUva24ybXMzSE9MTVZRSEdrVDJhMzhzM05RUnBoMk8xU1FHYz0tLTFxUnlXWTZLQXM4dW9EQmVxMHZwRWc9PQ%3D%3D--6fb5c178053ee287201628ee5d7b2b61c170e994'}
a = []
params = []
#獲取每一頁的url
for p in range(1,16):
url_data = '&'.join(params)
url = 'http://www.reibang.com/?' + url_data + '&page={}'.format(p)
#獲取每頁的數(shù)據(jù)
html = requests.get(url, headers = header).text
selector = etree.HTML(html)
li_pages = selector.xpath('//div[@id="list-container"]/ul/li')
for li_page in li_pages:
li_page = 'seen_snote_ids[]=' + li_page.xpath('@data-note-id')[0]
params.append(li_page)
#print(len(params))
infos = selector.xpath('//div[@id="list-container"]/ul/li/div')
for info in infos:
titles = info.xpath('a/text()')[0]
authors = info.xpath('div[1]/div/a/text()')[0]
times = info.xpath('div[1]/div/span/@data-shared-at')[0]
contents = info.xpath('p/text()')[0].strip()
try:
read_counts = info.xpath('div[2]/a[2]/text()')[1].strip()
except:
read_counts = '0'
try:
comment_counts = info.xpath('div[2]/a[3]/text()')[1].strip()
except:
comment_counts = '0'
try:
vote_counts = info.xpath('//div/div[2]/span[1]/text()')[0].strip()
except:
vote_counts = '0'
try:
reward_counts = info.xpath('div[2]/span[2]/text()')[0]
except:
reward_counts = '0'
try:
subjects = info.xpath('div[2]/a[@class="collection-tag"]/text()')[0]
except:
subjects = '暫未收錄專題'
#print(titles, authors, times, contents, read_counts, comment_counts, vote_counts, reward_counts, subjects)
data = {
'文章標(biāo)題': titles,
'作者': authors,
'發(fā)表時(shí)間': times,
'內(nèi)容': contents,
'閱讀量': read_counts,
'評(píng)論數(shù)': comment_counts,
'點(diǎn)贊數(shù)': vote_counts,
'打賞數(shù)': reward_counts,
'主題': subjects,
}
a.append(data)
#存儲(chǔ)數(shù)據(jù)
csv_name = ['文章標(biāo)題', '作者', '發(fā)表時(shí)間', '內(nèi)容', '閱讀量', '評(píng)論數(shù)', '點(diǎn)贊數(shù)', '打賞數(shù)', '主題']
with open('jianshu_xpath2.csv', 'w', newline='', encoding='utf-8')as csvfile:
write = csv.DictWriter(csvfile, fieldnames=csv_name)
write.writeheader()
write.writerows(a)
運(yùn)行結(jié)果:
注意這里取cookie的時(shí)候一定要登錄后再取cookie永票,否則只能爬到重復(fù)第一頁的數(shù)據(jù);
最后再把整個(gè)代碼整理封裝一下滥沫,稍微好看一點(diǎn):
#coding: utf-8
import requests
from lxml import etree
import csv
class Jianshu():
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
header = {'User-Agent': user_agent,
'cookie': 'remember_user_token=W1szNjE3MDgyXSwiJDJhJDEwJDMuQTVNeHVYTkUubFQvc1ZPM0V5UGUiLCIxNDk3MTcyNDA2Ljk2ODQ2NjMiXQ%3D%3D--56522c2190961ce284b1fe108b267ae0cd5bf32a; _session_id=YVRyNm5tREZkK1JwUGFZVDNLdjJoL25zVS8yMjBnOGlwSnpITEE0U0drZHhxSU5XQVpYM2RpSmY5WU44WGJWeHVZV3d1Z1lINHR0aXhUQzR6Z1pMUW52RGI5UHpPRVFJRk5HeUcybEhwc21raVBqbk9uZmhjN0xQWmc2ZFMreXhGOHlhbmJiSDBHQUVsUTNmN2p0M2Y2TjgrWnBjVis4ODE4UXRhWmJ6K2VETHJlakhHbEl0djhDNDRKYVZEWndENjhrSGIvZ1crNC9NNnh4UmlpOVFPNWxGWm1PUmxhQk1sdnk2OXozQVZwU1hXVm9lMTU3WkUyUkhialZKZ2MvVkFOYk1tOUw3STkrMGNFWXVIaklDNlNpTmkrVi9iNDIrRzBDU0ZNNnc3b3I2bkhvLzFCSCsvTWdsUDExdEZBa0RsU3RqTURWcjdNU1VOTGVBeTk2MERMUXN1UlZqUytuYXdWdnI4cTkxTjFPbG5Ia3IzK3NXcVNpMENwWVZPSUV3TWU4TENaRWUva24ybXMzSE9MTVZRSEdrVDJhMzhzM05RUnBoMk8xU1FHYz0tLTFxUnlXWTZLQXM4dW9EQmVxMHZwRWc9PQ%3D%3D--6fb5c178053ee287201628ee5d7b2b61c170e994'}
a = []
params = []
def __init__(self):
pass
#獲取每一頁的url
def total_page(self):
for p in range(1,16):
url_data = '&'.join(self.params)
url = 'http://www.reibang.com/?' + url_data + '&page={}'.format(p)
self.get_data(url)
#獲取每頁的數(shù)據(jù)
def get_data(self, url):
html = requests.get(url, headers = self.header).text
selector = etree.HTML(html)
li_pages = selector.xpath('//*[@id="list-container"]/ul/li')
#print(li_pages)
for info in li_pages:
info = 'seen_snote_ids%5B%5D=' + info.xpath('@data-note-id')[0]
self.params.append(info)
infos = selector.xpath('//div[@id="list-container"]/ul/li/div')
for info in infos:
titles = info.xpath('a/text()')[0]
authors = info.xpath('div[1]/div/a/text()')[0]
times = info.xpath('div[1]/div/span/@data-shared-at')[0]
contents = info.xpath('p/text()')[0].strip()
try:
read_counts = info.xpath('div[2]/a[2]/text()')[1].strip()
except:
read_counts = '0'
try:
comment_counts = info.xpath('div[2]/a[3]/text()')[1].strip()
except:
comment_counts = '0'
try:
vote_counts = info.xpath('//div/div[2]/span[1]/text()')[0].strip()
except:
vote_counts = '0'
try:
reward_counts = info.xpath('div[2]/span[2]/text()')[0]
except:
reward_counts = '0'
try:
subjects = info.xpath('div[2]/a[@class="collection-tag"]/text()')[0]
except:
subjects = '暫未收錄專題'
#print(titles, authors, times, contents, read_counts, comment_counts, vote_counts, reward_counts, subjects)
data = {
'文章標(biāo)題': titles,
'作者': authors,
'發(fā)表時(shí)間': times,
'內(nèi)容': contents,
'閱讀量': read_counts,
'評(píng)論數(shù)': comment_counts,
'點(diǎn)贊數(shù)': vote_counts,
'打賞數(shù)': reward_counts,
'主題': subjects,
}
self.a.append(data)
#print(self.a)
#存儲(chǔ)數(shù)據(jù)
csv_name = ['文章標(biāo)題', '作者', '發(fā)表時(shí)間', '內(nèi)容', '閱讀量', '評(píng)論數(shù)', '點(diǎn)贊數(shù)', '打賞數(shù)', '主題']
with open('jianshu_xpath2.csv', 'w', newline='', encoding='utf-8')as csvfile:
write = csv.DictWriter(csvfile, fieldnames=csv_name)
write.writeheader()
write.writerows(self.a)
if __name__ == '__main__':
jian = Jianshu()
jian.total_page()
小結(jié):
1侣集、這里用xpath爬取網(wǎng)頁的內(nèi)容,是不是很方便兰绣?
雖然用正則世分、BeautifulSoup和Xpath都可以獲取網(wǎng)頁的內(nèi)容,但是要學(xué)會(huì)靈活應(yīng)用缀辩,有時(shí)遇到某一種方法獲取不到就要用另外的方法(比如正則臭埋,只要你的正則表達(dá)式?jīng)]寫錯(cuò),基本都是可以獲取網(wǎng)頁數(shù)據(jù))
2臀玄、這里爬取多頁是通過自己手動(dòng)分析網(wǎng)頁加載方式去構(gòu)造每頁的url瓢阴,然后爬取全部的數(shù)據(jù);對(duì)于這種異步加載的網(wǎng)頁镐牺,后面還會(huì)介紹其他的方法炫掐;