學(xué)習(xí)scrapy奠伪,總結(jié)下使用scrapy爬取豆瓣電影的demo榜苫,以及中間遇到的問題护戳。
- 核心就是一個(gè)spider和一個(gè)item(爬取電影排名,名稱垂睬,分?jǐn)?shù)媳荒,評分人數(shù),圖片url)
class DouBanMovieItem(scrapy.Item):
rank = scrapy.Field()
movie_name = scrapy.Field()
score = scrapy.Field()
score_num = scrapy.Field()
pic_url = scrapy.Field()
class DoubanmovieSpider(scrapy.Spider):
name = 'doubanmovie'
allowed_domains = ['movie.douban.com/top250']
start_urls = ['https://movie.douban.com/top250/']
def parse(self, response):
item = DouBanMovieItem()
movies = response.css('ol.grid_view li')
for movie in movies:
item['pic_url'] = movie.css('div.pic a::attr(href)').extract()
item['rank'] = movie.css('div.pic em::text').extract()
item['movie_name'] = movie.css('div.info > div.hd > a > span:nth-child(1)::text').extract()
item['score'] = movie.css('div.info > div.bd > div.star > span.rating_num::text').extract()
item['score_num'] = movie.css('div.info > div.bd > div.star > span:nth-child(4)::text').extract()
yield item
next_page = response.css('div.paginator > span.next > a::attr(href)').extract_first()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
以上代碼主要需要測試選擇器定位的數(shù)據(jù)是否正確驹饺,對于豆瓣這類網(wǎng)站钳枕,肯定有反爬措施,所以一段時(shí)間以后上述的選擇器可能無法定位到準(zhǔn)確的數(shù)據(jù)逻淌,需要更改對應(yīng)的代碼么伯。這通過scrapy shell 來進(jìn)行調(diào)試更加方便。
在執(zhí)行爬蟲的時(shí)候卡儒,正常爬取了第一頁數(shù)據(jù)田柔,但是在爬取第二頁數(shù)據(jù)的時(shí)候爬蟲停止了,命令行有以下提示:
2018-09-11 20:56:33 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'movie.douban.com': <GET https://movie.douban.com/top250?start=25&filter=>
2018-09-11 20:56:33 [scrapy.core.engine] INFO: Closing spider (finished)
如上骨望,第二頁的urlhttps://movie.douban.com/top250?start=25&filter=可以通過瀏覽器正常打開硬爆,通過scrapy shell 也可以獲取數(shù)據(jù)。而且程序也沒有報(bào)錯(cuò)擎鸠,感覺這里很奇怪缀磕,是不是應(yīng)該拋個(gè)異常出來?最后查了下,這個(gè)問題是爬蟲中定義的允許域allowed_domains=['movie.douban.com/top250']
與要爬取的url的域不一致袜蚕。這里我有個(gè)猜測 是不是https://movie.douban.com/top250?start=25&filter=通過?帶了參數(shù)糟把,scrapy認(rèn)為域是movie.douban.com
而不是movie.douban.com/top250
?于是將上面的allowed_domains修改成allowed_domains = ['movie.douban.com']
再執(zhí)行爬蟲就可以正常爬取了∩辏或者在調(diào)用response.follow()
的時(shí)候增加一個(gè)關(guān)鍵字參數(shù)response.follow(next_page, callback=self.parse, dont_filter=True)
遣疯,經(jīng)驗(yàn)證,這樣做也是可以的