目標(biāo)
爬取豆瓣圖書TOP250的圖書信息,包括書名(name)耻讽、書本的URL鏈接(url)察纯、作者(author)、出版社(publisher)、出版時(shí)間(date)饼记、書本價(jià)格(price)香伴、評(píng)分(rate)和評(píng)價(jià)(comment)
網(wǎng)址
https://book.douban.com/top250
思路
(1)手動(dòng)瀏覽,觀察url地址的變化具则,構(gòu)建url列表即纲。很容易發(fā)現(xiàn)url地址是以數(shù)字遞增的方式改變的,步長為25博肋,共10頁低斋。
https://book.douban.com/top250?start=25
https://book.douban.com/top250?start=50
(2)爬取相關(guān)信息
(3)將爬取的信息寫入csv文件
具體代碼如下:
import csv
from lxml import etree
import requests
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
f = open('doubanTop250.csv', 'wt', newline='', encoding='UTF-8') #創(chuàng)建csv文件
writer = csv.writer(f)
writer.writerow(('name', 'url', 'author', 'publisher', 'date', 'price', 'rate',
'comment'))
urls = ["https://book.douban.com/top250?start={}"\
.format(str(i)) for i in range(0,226,25)] #構(gòu)造url列表
for url in urls:
print('正在爬取'+url)
r = requests.get(url, headers=headers)
selector = etree.HTML(r.text)
infos = selector.xpath('//tr[@class="item"]') #取大標(biāo)簽,以此循環(huán)
for info in infos:
name = info.xpath('td/div/a/@title')[0]
url = info.xpath('td/div/a/@href')[0]
book_infos = info.xpath('td/p/text()')[0]
author = book_infos.split('/')[0]
publisher = book_infos.split('/')[-3]
date = book_infos.split('/')[-2]
price = book_infos.split('/')[-1]
rate = info.xpath('td/div/span[2]/text()')[0]
comments = info.xpath('td/p/span/text()')
comment = comments[0] if len(comments) != 0 else "空"
writer.writerow((name,url,author,publisher,date,price,rate,comment))
f.close()
爬取結(jié)果如下
需要注意的是匪凡,爬取完用excel打開會(huì)看到是亂碼拔稳,不要慌,這時(shí)候先用記事本打開锹雏,將編碼改為UTF格式巴比,保存后再用excel打開就正常了