前言
系統(tǒng)環(huán)境:CentOS7
本文假設(shè)你已經(jīng)安裝了virtualenv瑞眼,并且已經(jīng)激活虛擬環(huán)境ENV1铅祸,如果沒有誓沸,請(qǐng)參考這里:使用virtualenv創(chuàng)建python沙盒(虛擬)環(huán)境
目標(biāo)
使用scrapy的命令行工具創(chuàng)建項(xiàng)目以及spider跷敬,使用Pycharm編碼并在虛擬環(huán)境中運(yùn)行spider抓取http://quotes.toscrape.com/中的article和author信息, 將抓取的信息存入txt文件麻车。
正文
1.使用命令行工具創(chuàng)建項(xiàng)目并指定項(xiàng)目路徑逆趋,具體用法為
scrapy startproject [project_dir]
項(xiàng)目名稱
[project_dir]項(xiàng)目路徑盏阶,缺省時(shí)默認(rèn)為當(dāng)前路徑
本文中quotes為項(xiàng)目名稱,PycharmProjects/quotes為項(xiàng)目路徑
(ENV1) [eason@localhost ~]$scrapy startprojectquotesPycharmProjects/quotes
New Scrapy project 'quotes', using template directory '/home/eason/ENV1/lib/python2.7/site-packages/scrapy/templates/project', created in:
/home/eason/PycharmProjects/quotes
You can start your first spider with:
cd PycharmProjects/quotes
scrapy genspider example example.com
(ENV1) [eason@localhost ~]$
2.進(jìn)入項(xiàng)目路徑并創(chuàng)建spider,命令的具體用法為
scrapy genspider [-t template]
[-t template] 指定生成spider的模板闻书,可用模板有如下4種名斟,缺省時(shí)默認(rèn)為basic
basic
crawl
csvfeed
xmlfeed
設(shè)定spider的名字
設(shè)定allowed_domains和start_urls
本文的spider名稱為quotes_spider
(ENV1) [eason@localhost ~]$cd PycharmProjects/quotes
(ENV1) [eason@localhost quotes]$scrapy genspiderquotes_spiderquotes.toscrape.com
Created spider 'quotes_spider' using template 'basic' in module:
quotes.spiders.quotes_spider
(ENV1) [eason@localhost quotes]$
至此,創(chuàng)建項(xiàng)目以及spider的工作已經(jīng)完成了魄眉。
3.在Pycharm中打開上面剛剛創(chuàng)建的項(xiàng)目
紅框內(nèi)為我們剛才創(chuàng)建項(xiàng)目的目錄結(jié)構(gòu)
├── quotes
│? └── spiders
│? ? ? └── __init__.py
|? ? ? └── quotes_spider.py
│? ├── __init__.py
│? ├── items.py
│? ├── pipelines.py
│? ├── settings.py
└── scrapy.cfg
參考官網(wǎng)文檔的解釋如下:
quotes/
project's Python module, you'll import your code from here(該項(xiàng)目的python模塊砰盐。之后您將在此加入代碼。)
quotes/spiders/
a directory where you'll later put your spiders(放置spider代碼的目錄.用來將網(wǎng)頁爬下來)
quotes/spiders/quotes_spider.py
剛才自動(dòng)生成的spider文件
quotes/items.py
project items definition file(項(xiàng)目中的item文件,其實(shí)就是要抓取的數(shù)據(jù)的結(jié)構(gòu)定義)
quotes/pipelines.py
project pipelines file(項(xiàng)目的pipelines文件坑律,在這里可以定義將抓取的數(shù)據(jù)以何種方式保存)
quotes/settings.py
project settings file(項(xiàng)目的設(shè)置文件)
scrapy.cfg
deploy configuration file(項(xiàng)目配置文件)
4.此時(shí)打開quotes_spider.py 文件會(huì)報(bào)錯(cuò)岩梳,提示找不到scrapy的模塊,這是因?yàn)楫?dāng)前pycharm是在全局環(huán)境打開該項(xiàng)目晃择,而我全局環(huán)境并沒有安裝scrapy冀值,所以下面更改項(xiàng)目設(shè)置,讓pycharm能使用虛擬環(huán)境的包和模塊
依次點(diǎn)擊菜單欄的File-->Settings打開設(shè)置界面宫屠,Project Interpreter下拉選擇當(dāng)前已經(jīng)激活的虛擬環(huán)境列疗,可能你那邊的路徑不一樣,本文是/home/eason/ENV1/bin/python
選好以后點(diǎn)擊OK浪蹂,重新打開quotes_spider.py發(fā)現(xiàn)已經(jīng)不報(bào)錯(cuò)了抵栈。
5.編輯items.py定義數(shù)據(jù)結(jié)構(gòu)
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class QuotesItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
article=scrapy.Field()
author=scrapy.Field()
pass
5.編輯quotes_spider.py添加爬取規(guī)則
# -*- coding: utf-8 -*-
import scrapy
from ..items import QuotesItem
class QuotesSpiderSpider(scrapy.Spider):
name = "quotes_spider"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
items=[]
articles=response.xpath("http://div[@class='quote']")
for article in articles:
item=QuotesItem()
content=article.xpath("span[@class='text']/text()").extract_first()
author=article.xpath("span/small[@class='author']/text()").extract_first()
item['article']=content.encode('utf-8')
item['author'] = author.encode('utf-8')
items.append(item)
return items
6.編輯pipelines.py,確定數(shù)據(jù)保存方法坤次,本文為寫到文本文件result.txt中
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
class QuotesPipeline(object):
def process_item(self, item, spider):
#爬取的數(shù)據(jù)保存在/home/eason/PycharmProjects/quotes/路徑下
f = open(r"/home/eason/PycharmProjects/quotes/result.txt", "a")
f.write(item['article']+'\t' +item['author']+'\n')
f.close()
return item
7.為了讓pipeline.py生效古劲,還需要在settings.py文件中注冊(cè)
# -*- coding: utf-8 -*-
# Scrapy settings for quotes project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#? ? http://doc.scrapy.org/en/latest/topics/settings.html
#? ? http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#? ? http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'quotes'
SPIDER_MODULES = ['quotes.spiders']
NEWSPIDER_MODULE = 'quotes.spiders'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'quotes.pipelines.QuotesPipeline': 300,
}
8.在Pycharm中打開Terminal,激活虛擬環(huán)境并運(yùn)行spider
[eason@localhost quotes]$source /home/eason/ENV1/bin/activate
(ENV1) [eason@localhost quotes]$scrapy crawl quotes_spider
9.爬取完成后缰猴,會(huì)在/home/eason/PycharmProjects/quotes/路徑下生成result.txt文件产艾,打開result.txt后內(nèi)容如下
10.Done!