一己沛、背景介紹
- 操作系統(tǒng)及環(huán)境
操作系統(tǒng):Win10(主)竹祷、Ubuntu(從)
Python版本:Python3.6
Scrapy版本:Scrapy1.5.1
scrapy_redis:兩臺電腦都需要安裝
redis數(shù)據(jù)庫:主服務(wù)器的redis數(shù)據(jù)庫要運(yùn)行遠(yuǎn)程連接
因為只是為了分享如何進(jìn)行簡單的分布式爬取疆拘,所以選取了一個結(jié)構(gòu)比較簡單的網(wǎng)站(網(wǎng)址不適合公開寝并,僅作學(xué)習(xí)用途)
二箫措、代碼
主要思路
使用scrapy_redis的框架來實現(xiàn)該網(wǎng)站的分布式爬取〕牧剩總共分成如下幾個步驟:
1斤蔓、第一個爬蟲抓取需要下載的url信息存入reids數(shù)據(jù)庫的隊列(只需要放在主服務(wù)器)。從機(jī)通過redis數(shù)據(jù)庫的隊列來獲取需要去抓取的url
2镀岛、第二個爬蟲獲取電影的信息弦牡,并將信息放回pipelines進(jìn)行持久化存儲
3、下載電影時配置斷點下載以及進(jìn)度條的顯示項目目錄結(jié)構(gòu)
- crawlall.py文件:負(fù)責(zé)啟動多個爬蟲
- crawl_url.py文件:負(fù)責(zé)抓取url漂羊,保存到redis隊列
- video_6969.py文件:爬取電影
- items.py文件:保存電影字段
- pipelines.py文件:下載電影驾锰、斷點下載、下載進(jìn)度條走越、保存到redis數(shù)據(jù)庫
- settings.py文件:配置信息
- 先配置我們的settings.py文件
# -*- coding: utf-8 -*-
# Scrapy settings for Video_6969 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'Video_6969'
SPIDER_MODULES = ['Video_6969.spiders']
NEWSPIDER_MODULE = 'Video_6969.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Video_6969 (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 150
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 200
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'Video_6969.middlewares.Video6969SpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'Video_6969.middlewares.Video6969DownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
# 分布式爬蟲的數(shù)據(jù)可以不通過本地的管道(數(shù)據(jù)不需要存在本地)稻据。數(shù)據(jù)要存入到redis數(shù)據(jù)庫中,所以這里需要加入一個reids數(shù)據(jù)庫的管道組件
'Video_6969.pipelines.Video6969Pipeline': 300,
"scrapy_redis.pipelines.RedisPipeline": 100, # item數(shù)據(jù)會報錯到redis
"Video_6969.pipelines.CrawlUrls": 50,
# 'Video_6969.pipelines.Video6969Info': 200,
}
# 指定Redis數(shù)據(jù)庫相關(guān)的配置
# Redis的主機(jī)地址
REDIS_HOST = '10.36.133.11' # 主機(jī)
REDIS_PORT = 6379 # 端口
# REDIS_PARAMS = {"password": "xxxx"} # 密碼
# 調(diào)度器需要切換成Scrapy_Redis的調(diào)度器(是Scrapy_Redis組件對原生調(diào)度器的重寫,加入了一些分布式調(diào)度的算法)
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 加入scrapy_redis的去重組件
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 爬取過程中是否運(yùn)行暫停
SCHEDULER_PERSIST = True
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
# 日志
# 關(guān)閉日志或調(diào)整Debug級別
# LOG_ENABLED = False
# LOG_LEVEL = 'ERROR'
LOG_LEVEL = 'DEBUG'
"""
CRITICAL - 嚴(yán)重錯誤
ERROR - 一般錯誤
WARNING - 警告信息
INFO - 一般信息
DEBUG - 調(diào)試信息
"""
# 日志文件
LOG_FILE = '6969.log'
# 是否啟用日志(創(chuàng)建日志后捻悯,不需開啟匆赃,進(jìn)行配置)
LOG_ENABLED = True # (默認(rèn)為True,啟用日志)
# 如果是True 今缚,進(jìn)程當(dāng)中算柳,所有標(biāo)準(zhǔn)輸出(包括錯誤)將會被重定向到log中
LOG_STDOUT = False
# 日志編碼
LOG_ENCODING = 'utf-8'
# 配置啟動所有爬蟲
COMMANDS_MODULE = 'Video_6969.commands'
# MongoDB配置
MONGO_HOST = "127.0.0.1" # 主機(jī)IP
MONGO_PORT = 27017 # 端口號
MONGO_DB = "6969" # 庫名
MONGO_COLL = "ViodeInfo" # collection名
# 如果有用戶名和密碼
# MONGO_USER = "zhangsan"
# MONGO_PSW = "123456"
注意:現(xiàn)在爬蟲要繼承自RedisCrawlSpider,且urls要從redis數(shù)據(jù)庫中根據(jù)redis_key配置的值進(jìn)行獲取姓言,所以我們要將start_urls注釋瞬项。后面我們將在redis配置我們的起始url。
- crawl_url.py文件
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from scrapy_redis.spiders import RedisCrawlSpider
from Video_6969.items import Video6969Item, UrlItem
class Video6969(CrawlSpider):
name = 'crawl_urls'
start_urls = ['https://www.6969qq.com']
rules = (
Rule(LinkExtractor(allow=r'/html/\d+/'), follow=True), # 分類
Rule(LinkExtractor(allow=r'/vod/\d+/.+?html'), callback='video_info', follow=True), # 更多
)
def video_info(self, response):
item = UrlItem()
item['html_url'] = response.url
yield item
crawl_url.py文件負(fù)責(zé)抓取我們需要下載的url頁面何荚,再通過pipelines存儲到redis隊列中囱淋。(也可以直接在crawl_url里進(jìn)行持久化存儲)
- video_6969.py文件
# -*- coding: utf-8 -*-
from scrapy_redis.spiders import RedisCrawlSpider
from Video_6969.items import Video6969Item
class Video6969(RedisCrawlSpider):
name = 'video_6969'
redis_key = "video6969:start_urls"
def parse(self, response):
item = Video6969Item()
item['html_url'] = response.url
item['name'] = response.xpath("http://h1/text()").extract_first()
item['video_type'] = response.xpath("http://div[@class = 'play_nav hidden-xs']//a/@title").extract_first()
item['video_url'] = response.selector.re("(https://\w+.xia12345.com/.+?mp4)")[0]
yield item
其它的從機(jī)是不需要crawl_url文件的,它們通過此文件來匹配到電影信息進(jìn)行下載
- item.py文件
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class Video6969Item(scrapy.Item):
video_type = scrapy.Field()
name = scrapy.Field()
html_url = scrapy.Field()
video_url = scrapy.Field()
class UrlItem(scrapy.Item):
html_url = scrapy.Field()
- pipelines.py文件
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import os
import pymongo
import redis
import requests
import sys
from Video_6969.items import UrlItem, Video6969Item
# 電影下載
class Video6969Pipeline(object):
dir_path = r'G:\Video_6969'
def process_item(self, item, spider):
if isinstance(item, Video6969Item):
type_path = os.path.join(self.dir_path, item['video_type'])
if not os.path.exists(type_path):
os.makedirs(type_path)
name_path = os.path.join(type_path, item['name'])
path = name_path + item['name'] + ".mp4"
try:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.3.2.1000 Chrome/30.0.1599.101 Safari/537.36"
}
# now_length = 0 # 已下載大小
# 循環(huán)接收視頻數(shù)據(jù)
while True:
# 若文件已經(jīng)存在餐塘,則斷點續(xù)傳妥衣,設(shè)置接收來需接收數(shù)據(jù)的位置
if os.path.exists(path):
now_length = os.path.getsize(path)
print("網(wǎng)絡(luò)波動繼續(xù)下載 。已下載:{}MB".format(now_length // 1024 // 1024))
headers['Range'] = 'bytes=%d-' % now_length # 獲得本地文件的大小作為續(xù)傳的起點戒傻,還有就是按bytes
else:
now_length = 0 # 已下載大小
res = requests.get(item['video_url'], stream=True,
headers=headers) # stream設(shè)置為True税手,可以直接訪問Response.content屬性
total_length = int(res.headers['Content-Length']) # 內(nèi)容體總大小
print("準(zhǔn)備下載:【{}】{} {}MB".format(item["video_type"], item["name"], total_length // 1024 // 1024))
# 若當(dāng)前報文長度小于前次報文長度,或者已接收文件等于當(dāng)前報文長度需纳,則可以認(rèn)為視頻接收完成
if total_length < now_length or (
os.path.exists(path) and os.path.getsize(path) >= total_length):
# print("文件下載完成:【{}】{} {}MB".format(item["video_type"], item["name"], total_length % 1024 % 1024))
break
# 寫入收到的視頻數(shù)據(jù)
with open(path, 'ab') as file:
for chunk in res.iter_content(chunk_size=1024):
# if chunk:
file.write(chunk)
now_length += len(chunk)
# 實時保證一點點的寫入
file.flush()
# 下載實現(xiàn)進(jìn)度顯示
done = int(50 * now_length / total_length)
sys.stdout.write(
"\r【%s%s】%d%%" % ('█' * done, ' ' * (50 - done), 100 * now_length / total_length))
sys.stdout.flush()
print()
except Exception as e:
print(e)
raise IOError
print("【{}】{}下載完畢:{}MB".format(item["video_type"], item["name"], now_length // 1024 // 1024))
return item
# 存儲MongoDB
class Video6969Info(object):
def __init__(self, mongo_host, mongo_db, mongo_coll):
self.mongo_host = mongo_host
self.mongo_db = mongo_db
self.mongo_coll = mongo_coll
self.count = 0
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_host=crawler.settings['MONGO_HOST'],
mongo_db=crawler.settings['MONGO_DB'],
mongo_coll=crawler.settings['MONGO_COLL']
)
def open_spider(self, spider):
# 連接數(shù)據(jù)庫
self.client = pymongo.MongoClient(self.mongo_host)
self.db = self.client[self.mongo_db] # 獲得數(shù)據(jù)庫的句柄
self.coll = self.db[self.mongo_coll] # 獲得collection的句柄
def close_spider(self, spider):
self.client.close() # 關(guān)閉數(shù)據(jù)庫
def process_item(self, item, spider):
data = dict(item) # 把item轉(zhuǎn)換成字典形式
try:
self.coll.insert(data) # 插入
self.count += 1
except:
raise IOError
if not self.count % 100:
print("已獲取數(shù)據(jù):%d條" % self.count)
return item
# 壓入Redis隊列
class CrawlUrls(object):
def process_item(self, item, spider):
rds = redis.StrictRedis(host='10.36.133.11', port=6379, db=0)
if isinstance(item, UrlItem):
rds.lpush("video6969:start_urls", item['html_url'])
return item
這里用request請求獲取我的電影的二進(jìn)制數(shù)據(jù)芦倒,并進(jìn)行寫入。因為網(wǎng)絡(luò)波動很容易造成視頻文件損壞不翩,所以我又在這里進(jìn)行了斷點下載
- crawlall.py文件
from scrapy.commands import ScrapyCommand
class Command(ScrapyCommand):
requires_project = True
def syntax(self):
return '[options]'
def short_desc(self):
return 'Runs all of the spiders'
def run(self, args, opts):
spider_list = self.crawler_process.spiders.list()
for name in spider_list:
self.crawler_process.crawl(name, **opts.__dict__)
self.crawler_process.start()
- 啟動
scrapy crawlall
啟動后crawl_url爬蟲會去爬取url存入redis隊列兵扬,其它從機(jī)獲取到url以后開始下載。當(dāng)然你也可以通過其它的辦法來進(jìn)行分布式的爬取口蝠。
注意:保存電影的時候器钟,要注意你是否對改目錄有讀寫的權(quán)限