Scrapy入門(mén)案例
Scrapy教程:
中文 《Scrapy 0.24.1文檔》
安裝環(huán)境:
- Python 2.7.12
- Scrapy 0.24.1
- Ubuntu 16.04
安裝步驟:
pip install scrapy==0.24.1
pip install service_identity==17.0.0
Creating a project
scrapy startproject tutorial
Our first Spider
This is the code for our first Spider. Save it in a file named quotes_spider.py
under thetutorial/spiders
directory in your project:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
How to run our spider
scrapy crawl quotes
A shortcut to the start_requests method
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
上述兩種寫(xiě)法等價(jià),而start_urls是start_requests的簡(jiǎn)潔寫(xiě)法南缓。
為了創(chuàng)建一個(gè)Spider肮之,您必須繼承 scrapy.Spider 類(lèi)堡赔, 且定義以下三個(gè)屬性:
name: 用于區(qū)別Spider漾稀。 該名字必須是唯一的谚中,您不可以為不同的Spider設(shè)定相同的名字甸怕。
start_urls: 包含了Spider在啟動(dòng)時(shí)進(jìn)行爬取的url列表博敬。 因此,第一個(gè)被獲取到的頁(yè)面將是其中之一。 后續(xù)的URL則從初始的URL獲取到的數(shù)據(jù)中提取养晋。
parse() 是spider的一個(gè)方法衬吆。 被調(diào)用時(shí),每個(gè)初始URL完成下載后生成的 Response 對(duì)象將會(huì)作為唯一的參數(shù)傳遞給該函數(shù)绳泉。 該方法負(fù)責(zé)解析返回的數(shù)據(jù)(response data)逊抡,提取數(shù)據(jù)(生成item)以及生成需要進(jìn)一步處理的URL的 Request 對(duì)象。
Selectors選擇器簡(jiǎn)介
Selector有四個(gè)基本的方法:
xpath(): 傳入xpath表達(dá)式零酪,返回該表達(dá)式所對(duì)應(yīng)的所有節(jié)點(diǎn)的selector list列表 秦忿。
css(): 傳入CSS表達(dá)式,返回該表達(dá)式所對(duì)應(yīng)的所有節(jié)點(diǎn)的selector list列表.
extract(): 序列化該節(jié)點(diǎn)為unicode字符串并返回list蛾娶。
re(): 根據(jù)傳入的正則表達(dá)式對(duì)數(shù)據(jù)進(jìn)行提取灯谣,返回unicode字符串list列表。
Extracting data
The best way to learn how to extract data with Scrapy is trying selectors using the shell Scrapy shell. Run:
scrapy shell 'http://quotes.toscrape.com/page/1/'
當(dāng)shell載入后蛔琅,您將得到一個(gè)包含response數(shù)據(jù)的本地 response 變量胎许。輸入 response.body 將輸出response的包體, 輸出 response.headers 可以看到response的包頭罗售。
你可以使用 response.selector.xpath() 辜窑、 response.selector.css()或者response.xpath() 和 response.css() 甚至sel.xpath() 、sel.css()來(lái)獲取數(shù)據(jù)寨躁,他們之間是等價(jià)的穆碎。
# 測(cè)試這些css方法看看輸出啥
response.css('title')
response.css('title').extract()
response.css('title::text')
response.css('title::text').extract()
# response.css('title::text').extract_first() #extract_first在0.24.1版本不可用
response.css('title::text')[0].extract()
response.css('title::text').re(r'Quotes.*')
response.css('title::text').re(r'Q\w+')
response.css('title::text').re(r'(\w+) to (\w+)')
# 測(cè)試這些xpath方法看看輸出啥
response.xpath('//title')
response.xpath('//title').extract()
response.xpath('//title/text()')
response.xpath('//title/text()').extract()
# response.xpath('//title/text()').extract_first() #extract_first在0.24.1版本不可用
response.xpath('//title/text()')[0].extract()
response.xpath('//title/text()').re(r'Quotes.*')
response.xpath('//title/text()').re(r'Q\w+')
response.xpath('//title/text()').re(r'(\w+) to (\w+)')
上面css與xpath 表達(dá)式部分不同其他對(duì)應(yīng)一致,他們的輸出結(jié)果基本一樣职恳,除了個(gè)別所禀。
Extracting data in our spider
import scrapy
from tutorial.items import QuotesItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
for quote in response.css('div.quote'):
item = QuotesItem()
item['title'] = quote.css('span.text::text')[0].extract(),
item['author'] = quote.css('small.author::text')[0].extract(),
item['tags'] = quote.css('div.tags a.tag::text')[0].extract(),
yield item
Spider將爬到的數(shù)據(jù)以Item對(duì)象返回,因此放钦,還需要定義一個(gè)QuotesItem在items.py中
import scrapy
class QuotesItem(scrapy.Item):
title = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
Run 查看輸出log,確定沒(méi)有錯(cuò)誤,否則返回修改上面代碼
scrapy crawl quotes
Storing the scraped data
The simplest way to store the scraped data is by using Feed exports, with the following command:
scrapy crawl quotes -o quotes.json
You can also use other formats, like JSON Lines:
scrapy crawl quotes -o quotes.jl
Following links
import scrapy
from tutorial.items import QuotesItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
for quote in response.css('div.quote'):
item = QuotesItem()
item['title'] = quote.css('span.text::text')[0].extract(),
item['author'] = quote.css('small.author::text')[0].extract(),
item['tags'] = quote.css('div.tags a.tag::text')[0].extract(),
yield item
next_page = response.css('li.next a::attr(href)')[0].extract()
if next_page is not None:
next_page = 'http://quotes.toscrape.com'+next_page
yield scrapy.Request(next_page, callback=self.parse)
這里關(guān)鍵是response.css('li.next a::attr(href)')[0].extract()獲取到了'/page/2/'色徘,然后通過(guò)scrapy.Request遞歸調(diào)用,再次爬取了'/page/2/'操禀,這樣實(shí)現(xiàn)了跟蹤鏈接的效果褂策。
新版本會(huì)提供response.urljoin來(lái)代替我們現(xiàn)在手動(dòng)拼接url,
最新版還會(huì)出response.follow來(lái)代替response.urljoin和scrapy.Request兩步操作颓屑。