請謹(jǐn)記如下三條命令:
scrapy startproject xxx 創(chuàng)建scrapy項目
scrapy genspider xxx "xxx.com" 創(chuàng)建爬蟲spider距帅,名字不能和項目名一樣
scrapy crawl xxx 運行某個爬蟲項目
首先scrapy startproject douban
建立項目,其次切換到spiders目錄下沉删,scrapy genspider douban_movie
建立爬蟲。
我們要爬取的數(shù)據(jù)很簡單,是豆瓣電影排行榜航缀。之所以說它簡單是因為它請求返回的數(shù)據(jù)我們可以轉(zhuǎn)換成規(guī)整的json列表夜牡,并且獲取分頁鏈接也很簡單与纽。
我們只獲得title和url的信息。明確了請求目標(biāo)后塘装,我們開始編寫items
import scrapy
class DoubanItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
其次編輯spiders下的爬蟲文件
# -*- coding: utf-8 -*-
import scrapy
from douban.items import DoubanItem
import json
class DoubanMovieSpider(scrapy.Spider):
name = 'douban_movie'
allowed_domains = ['movie.douban.com']
start_urls = ['https://movie.douban.com/j/chart/top_list?type=11&interval_id=100%3A90&action=&start=0&limit=20']
offset = 0
def parse(self, response):
item = DoubanItem()
content_list = json.loads(response.body.decode())
if (content_list == []):
return
for content in content_list:
item['title'] = content['title']
item['url'] = content['url']
yield item
self.offset += 20
url = 'https://movie.douban.com/j/chart/top_list?type=11&interval_id=100%3A90&action=&start='+str(self.offset) + '&limit=20'
yield scrapy.Request(url=url,callback=self.parse)
response.body
獲得數(shù)據(jù)是<class 'bytes'>
型急迂,我們需要轉(zhuǎn)換為str型,response.body.decode()
蹦肴。然后通過json.loads()
將字符串裝換成json 列表僚碎,列表里的元素其實就是dict型。
然后保存數(shù)據(jù)阴幌,編輯pipelines.py
import json
class DoubanPipeline(object):
def open_spider(self,spider):
self.file = open("douban.json","w")
self.num = 0
def process_item(self, item, spider):
self.num+=1
content = json.dumps(dict(item),ensure_ascii=False)+'\n'
self.file.write(content)
return item
def close_spider(self,spider):
print('一共保存了'+str(self.num)+'條數(shù)據(jù)')
self.file.close()
在運行前需要設(shè)置settings.py
#打開這兩個注釋:
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.15 Safari/537.36' #模擬瀏覽器
ITEM_PIPELINES = {
'douban.pipelines.DoubanPipeline': 300,
} #編輯好管道要記得注冊管道
#ROBOTSTXT_OBEY = True 注釋掉robot協(xié)議勺阐,不然會報錯
項目源碼:
https://gitee.com/stefanpy/Scrapy_projects/tree/dev/douban
推薦Scrapy學(xué)習(xí)網(wǎng)站:http://www.scrapyd.cn/