前言
系統(tǒng)環(huán)境:CentOS7
本文假設(shè)你已經(jīng)安裝了virtualenv刚夺,并且已經(jīng)激活虛擬環(huán)境ENV1,如果沒有,請(qǐng)參考這里:使用virtualenv創(chuàng)建python沙盒(虛擬)環(huán)境米母,在上一篇文章(Scrapy學(xué)習(xí)筆記(2)-使用pycharm在虛擬環(huán)境中運(yùn)行第一個(gè)spider)中我們已經(jīng)能夠使用scrapy的命令行工具創(chuàng)建項(xiàng)目以及spider轮锥、使用Pycharm編碼并在虛擬環(huán)境中運(yùn)行spider抓取http://quotes.toscrape.com/中的article和author信息, 最后將抓取的信息存入txt文件矫钓,上次的spider只能單頁爬取,今天我們?cè)谏洗蔚膕pider上再深入一下舍杜。
目標(biāo)
跟蹤next(下一頁)鏈接循環(huán)爬取http://quotes.toscrape.com/中的article和author信息,將結(jié)果保存到mysql數(shù)據(jù)庫中新娜。
正文
1.因?yàn)橐肞ython操作MySQL數(shù)據(jù)庫,所以先得安裝相關(guān)的Python模塊既绩,本文使用MySQLdb
#sudo yum install mysql-devel
#pip install mysql-devel
2.在數(shù)據(jù)庫中創(chuàng)建目標(biāo)表quotes概龄,建表語句如下:
CREATE TABLE `quotes` (
? `id` int(11) NOT NULL AUTO_INCREMENT,
? `article` varchar(500) DEFAULT NULL,
? `author` varchar(50) DEFAULT NULL,
? PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
3.items.py文件詳細(xì)代碼如下:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class QuotesItem(scrapy.Item):
? ? # define the fields for your item here like:
? ? # name = scrapy.Field()
? ? article=scrapy.Field()
? ? author=scrapy.Field()
? ? pass
4.修改quotes_spider.py如下:
# -*- coding: utf-8 -*-
import scrapy
from ..items import QuotesItem
from urlparse import urljoin
from scrapy.http import Request
class QuotesSpiderSpider(scrapy.Spider):
? ? name = "quotes_spider"
? ? allowed_domains = ["quotes.toscrape.com"]
? ? start_urls = ['http://quotes.toscrape.com']
? ? def parse(self, response):
? ? ? ? articles=response.xpath("http://div[@class='quote']")
? ? ? ? next_page=response.xpath("http://li[@class='next']/a/@href").extract_first()
? ? ? ? for article in articles:
? ? ? ? ? ? item=QuotesItem()
? ? ? ? ? ? content=article.xpath("span[@class='text']/text()").extract_first()
? ? ? ? ? ? author=article.xpath("span/small[@class='author']/text()").extract_first()
? ? ? ? ? ? item['article']=content.encode('utf-8')
? ? ? ? ? ? item['author'] = author.encode('utf-8')
? ? ? ? ? ? yield item#使用yield返回結(jié)果但不會(huì)中斷程序執(zhí)行
? ? ? ? if next_page:#判斷是否存在next鏈接
? ? ? ? ? ? url=urljoin(self.start_urls[0],next_page)#拼接url
? ? ? ? ? ? yield Request(url,callback=self.parse)
5.修改pipelines.py文件,將爬取到的數(shù)據(jù)保存到數(shù)據(jù)庫
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from twisted.enterprise import adbapi
import MySQLdb
import MySQLdb.cursors
class QuotesPipeline(object):
? ? def __init__(self):
? ? ? ? db_args=dict(
? ? ? ? ? ? host="192.168.0.107",#數(shù)據(jù)庫主機(jī)ip
? ? ? ? ? ? db="scrapy",#數(shù)據(jù)庫名稱
? ? ? ? ? ? user="root",#用戶名
? ? ? ? ? ? passwd="123456",#密碼
? ? ? ? ? ? charset='utf8',#數(shù)據(jù)庫字符編碼
? ? ? ? ? ? cursorclass = MySQLdb.cursors.DictCursor,#以字典的形式返回?cái)?shù)據(jù)集
? ? ? ? ? ? use_unicode = True,
? ? ? ? )
? ? ? ? self.dbpool = adbapi.ConnectionPool('MySQLdb', **db_args)
? ? def process_item(self, item, spider):
? ? ? ? self.dbpool.runInteraction(self.insert_into_quotes, item)
? ? ? ? return item
? ? def insert_into_quotes(self,conn,item):
? ? ? ? conn.execute(
? ? ? ? ? ? '''
? ? ? ? ? ? INSERT INTO quotes(article,author)
? ? ? ? ? ? VALUES(%s,%s)
? ? ? ? ? ? '''
? ? ? ? ? ? ,(item['article'],item['author'])
? ? ? ? )
6.pipeline.py文件代碼不變:
# -*- coding: utf-8 -*-
# Scrapy settings for quotes project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#? ? http://doc.scrapy.org/en/latest/topics/settings.html
#? ? http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#? ? http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'quotes'
SPIDER_MODULES = ['quotes.spiders']
NEWSPIDER_MODULE = 'quotes.spiders'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
? 'quotes.pipelines.QuotesPipeline': 300,
}
7.開始運(yùn)行spider
(ENV1) [eason@localhost quotes]$ scrapy crawl quotes_spider
8.檢驗(yàn)結(jié)果饲握,Done!