Scrapy抓取到網(wǎng)頁數(shù)據(jù)敏储,保存到數(shù)據(jù)庫,是通過pipelines來處理的丁侄」喏ǎ看一下官方文檔的說明。
當(dāng)Item在Spider中被收集之后鸿摇,它將會(huì)被傳遞到Item Pipeline石景,一些組件會(huì)按照一定的順序執(zhí)行對(duì)Item的處理。
以下是item pipeline的一些典型應(yīng)用:
- 清理HTML數(shù)據(jù)
- 驗(yàn)證爬取的數(shù)據(jù)(檢查item包含某些字段)
- 查重(并丟棄)
- 將爬取結(jié)果保存到數(shù)據(jù)庫中
一、解析頁面數(shù)據(jù) Spider類
本文以簡書《讀書》專題為例潮孽,抓取專題收錄的所有文章數(shù)據(jù)揪荣,http://www.reibang.com/collection/yD9GAd
把需要爬取的頁面數(shù)據(jù)解析出來,封裝成對(duì)象Item往史,提交(yield)变逃。
item = JsArticleItem()
author = info.xpath('p/a/text()').extract()
pubday = info.xpath('p/span/@data-shared-at').extract()
author_url = info.xpath('p/a/@href').extract()
title = info.xpath('h4/a/text()').extract()
url = info.xpath('h4/a/@href').extract()
reads = info.xpath('div/a[1]/text()').extract()
reads = filter(str.isdigit, str(reads[0]))
comments = info.xpath('div/a[2]/text()').extract()
comments = filter(str.isdigit, str(comments[0]))
likes = info.xpath('div/span[1]/text()').extract()
likes = filter(str.isdigit,str(likes[0]))
rewards = info.xpath('div/span[2]/text()')
## 判斷文章有無打賞數(shù)據(jù)
if len(rewards)==1 :
rds = info.xpath('div/span[2]/text()').extract()
rds = int(filter(str.isdigit,str(rds[0])))
else:
rds = 0
item['author'] = author
item['url'] = 'http://www.reibang.com'+url[0]
item['reads'] = reads
item['title'] = title
item['comments'] = comments
item['likes'] = likes
item['rewards'] = rds
item['pubday'] = pubday
yield item
定義好的Item類,在items.py中
class JsArticleItem(Item):
author = Field()
url = Field()
title = Field()
reads = Field()
comments = Field()
likes = Field()
rewards = Field()
pubday = Field()
二怠堪、pipelines.py中定義一個(gè)類揽乱,操作數(shù)據(jù)庫
class WebcrawlerScrapyPipeline(object):
'''保存到數(shù)據(jù)庫中對(duì)應(yīng)的class
1、在settings.py文件中配置
2粟矿、在自己實(shí)現(xiàn)的爬蟲類中yield item,會(huì)自動(dòng)執(zhí)行'''
def __init__(self, dbpool):
self.dbpool = dbpool
@classmethod
def from_settings(cls, settings):
'''1凰棉、@classmethod聲明一個(gè)類方法,而對(duì)于平常我們見到的叫做實(shí)例方法陌粹。
2撒犀、類方法的第一個(gè)參數(shù)cls(class的縮寫,指這個(gè)類本身)掏秩,而實(shí)例方法的第一個(gè)參數(shù)是self或舞,表示該類的一個(gè)實(shí)例
3、可以通過類來調(diào)用蒙幻,就像C.f()映凳,相當(dāng)于java中的靜態(tài)方法'''
#讀取settings中配置的數(shù)據(jù)庫參數(shù)
dbparams = dict(
host=settings['MYSQL_HOST'],
db=settings['MYSQL_DBNAME'],
user=settings['MYSQL_USER'],
passwd=settings['MYSQL_PASSWD'],
charset='utf8', # 編碼要加上,否則可能出現(xiàn)中文亂碼問題
cursorclass=MySQLdb.cursors.DictCursor,
use_unicode=False,
)
dbpool = adbapi.ConnectionPool('MySQLdb', **dbparams) # **表示將字典擴(kuò)展為關(guān)鍵字參數(shù),相當(dāng)于host=xxx,db=yyy....
return cls(dbpool) # 相當(dāng)于dbpool付給了這個(gè)類邮破,self中可以得到
# pipeline默認(rèn)調(diào)用
def process_item(self, item, spider):
query = self.dbpool.runInteraction(self._conditional_insert, item) # 調(diào)用插入的方法
query.addErrback(self._handle_error, item, spider) # 調(diào)用異常處理方法
return item
# 寫入數(shù)據(jù)庫中
# SQL語句在這里
def _conditional_insert(self, tx, item):
sql = "insert into jsbooks(author,title,url,pubday,comments,likes,rewards,views) values(%s,%s,%s,%s,%s,%s,%s,%s)"
params = (item['author'], item['title'], item['url'], item['pubday'],item['comments'],item['likes'],item['rewards'],item['reads'])
tx.execute(sql, params)
# 錯(cuò)誤處理方法
def _handle_error(self, failue, item, spider):
print failue
三诈豌、在settings.py中指定數(shù)據(jù)庫操作的類,啟用pipelines組件
ITEM_PIPELINES = {
'jsuser.pipelines.WebcrawlerScrapyPipeline': 300,#保存到mysql數(shù)據(jù)庫
}
#Mysql數(shù)據(jù)庫的配置信息
MYSQL_HOST = '127.0.0.1'
MYSQL_DBNAME = 'testdb' #數(shù)據(jù)庫名字抒和,請(qǐng)修改
MYSQL_USER = 'root' #數(shù)據(jù)庫賬號(hào)矫渔,請(qǐng)修改
MYSQL_PASSWD = '1234567' #數(shù)據(jù)庫密碼,請(qǐng)修改
MYSQL_PORT = 3306 #數(shù)據(jù)庫端口摧莽,在dbhelper中使用
其他設(shè)置庙洼,偽裝瀏覽器請(qǐng)求,設(shè)置延遲抓取镊辕,防ban
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
ROBOTSTXT_OBEY=False
DOWNLOAD_DELAY = 0.25 # 250 ms of delay
運(yùn)行爬蟲油够,cmdline.execute("scrapy crawl zhanti".split())
開始,OK!
Scrapy爬取數(shù)據(jù)存入Mongdb貌似更方便丑蛤,代碼更少叠聋,看下面文章鏈接撕阎。
我的Scrapy爬蟲框架系列文章: