爬取貓眼電影TOP100
參考來源:靜覓丨崔慶才的個人博客 https://cuiqingcai.com/5534.html
目的:使用Scrapy爬取貓眼電影TOP100并保存至MONGODB數(shù)據(jù)庫
目標(biāo)網(wǎng)址:http://maoyan.com/board/4?offset=0
分析/知識點:
爬取難度:
a. 入門級盆佣,網(wǎng)頁結(jié)構(gòu)簡單善玫,靜態(tài)HTML炼彪,少量JS柠座,不涉及AJAX动知;
b. 處理分頁需要用到正則养铸;MONGODB的update語句使用:
a. update語句:具備查重/插入新數(shù)據(jù)功能讼积,以title為查重標(biāo)準(zhǔn)
def process_item(self, item, spider):
self.db['movies'].update({'title': item['title']}, {'$set': item}, upsert=True) #注意upsert=True扛门,更新并插入
return item
實際步驟:
- 創(chuàng)建Scrapy項目/maoyan(spider)
Terminal: > scrapy startproject maoyan_movie
Terminal: > scrapy genspider maoyan maoyan.com/board/4?offset=
- 配置settings.py文件
# MONGODB配置
MONGO_URI = 'localhost'
MONGO_DB = 'maoyan_movie'
...
# 啟用MongoPipeline
ITEM_PIPELINES = {
'maoyan_movie.pipelines.MongoPipeline': 300,
}
- 編寫items.py文件
from scrapy import Item, Field
class MovieItem(Item):
title = Field() #電影標(biāo)題
actors = Field() #演員
releasetime = Field() #上映時間
cover_img = Field() #縮略圖
detail_page = Field() #電影詳情頁url
score = Field() #電影評分
-
編寫pipelines.py文件
根據(jù)Scrapy官方文檔修改:https://doc.scrapy.org/en/latest/topics/item-pipeline.html?highlight=mongo
import pymongo
class MongoPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DB')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
# !! 更新MONGODB沐祷,使用UPDATE方法嚷闭,查重功能(以title字段進(jìn)行判斷)
def process_item(self, item, spider):
self.db['movies'].update({'title': item['title']}, {'$set': item}, upsert=True)
return item
-
編寫spiders > maoyan.py文件
注意:
a) 使用scrapy的css selector進(jìn)行節(jié)點解析;
b) 獲取電影縮略圖url時赖临,注意需要根據(jù)網(wǎng)頁源代碼寫css選擇器胞锰,和審查元素中看到的不同;
item['cover_img'] = movie.css('a.image-link img.board-img::attr(data-src)').extract_first()
c) 獲取下一頁節(jié)點時兢榨,直接使用xpath或css難以直接獲得嗅榕,需要使用正則匹配;
next = response.xpath('.').re_first(r'href="(.*?)">下一頁</a>')
d) 完整代碼如下:
from scrapy import Spider, Request
from maoyan_movie.items import MovieItem
class MaoyanSpider(Spider):
name = 'maoyan'
allowed_domains = ['maoyan.com/board/4?offset=']
start_urls = ['http://maoyan.com/board/4?offset=']
# 每部電影詳情頁的基本前綴url
base_url = 'http://maoyan.com'
# 下一頁前綴url
next_base_url = 'http://maoyan.com/board/4'
def parse(self, response):
if response:
# 獲取每頁所有電影的節(jié)點
movies = response.css('dl.board-wrapper dd') # 獲取所有電影相關(guān)節(jié)點吵聪,切記A枘恰!不能加上extract()
item = MovieItem()
for movie in movies:
item['title'] = movie.css('p.name a::text').extract_first()
item['actors'] = movie.css('p.star::text').extract_first().strip()
item['releasetime'] = movie.css('p.releasetime::text').extract_first().strip()
item['score'] = movie.css('i.integer::text').extract_first() + movie.css(
'i.fraction::text').extract_first()
item['detail_page'] = self.base_url + movie.css('p.name a::attr(href)').extract_first()
item['cover_img'] = movie.css(
'a.image-link img.board-img::attr(data-src)').extract_first() # 注意:需要根據(jù)網(wǎng)頁源碼寫css選擇器吟逝,和審查元素中的不同帽蝶,估計是受JS影響
yield item
# 處理下一頁
next = response.xpath('.').re_first(r'href="(.*?)">下一頁</a>')
if next:
next_url = self.next_base_url + next
yield Request(url=next_url, callback=self.parse, dont_filter=True)
-
運行結(jié)果
temp.png
小結(jié)
- 入門級項目,熟悉了Scrapy的基本使用流程块攒;
- Scrapy的css/xpath選擇器励稳、正則寫法需要進(jìn)一步熟悉;
- pymongo的update語句囱井,需要進(jìn)一步熟練掌握驹尼。