Scrapy_redis分布式爬取某電影網(wǎng)站（斷點下載+下載進(jìn)度條顯示）

一己沛、背景介紹

操作系統(tǒng)及環(huán)境

操作系統(tǒng)：Win10（主）竹祷、Ubuntu（從）
Python版本：Python3.6
Scrapy版本：Scrapy1.5.1
scrapy_redis：兩臺電腦都需要安裝
redis數(shù)據(jù)庫：主服務(wù)器的redis數(shù)據(jù)庫要運(yùn)行遠(yuǎn)程連接

因為只是為了分享如何進(jìn)行簡單的分布式爬取疆拘，所以選取了一個結(jié)構(gòu)比較簡單的網(wǎng)站（網(wǎng)址不適合公開寝并，僅作學(xué)習(xí)用途）

二箫措、代碼

主要思路
使用scrapy_redis的框架來實現(xiàn)該網(wǎng)站的分布式爬取〕牧剩總共分成如下幾個步驟：
1斤蔓、第一個爬蟲抓取需要下載的url信息存入reids數(shù)據(jù)庫的隊列（只需要放在主服務(wù)器）。從機(jī)通過redis數(shù)據(jù)庫的隊列來獲取需要去抓取的url
2镀岛、第二個爬蟲獲取電影的信息弦牡，并將信息放回pipelines進(jìn)行持久化存儲
3、下載電影時配置斷點下載以及進(jìn)度條的顯示
項目目錄結(jié)構(gòu)

image.png

  - crawlall.py文件：負(fù)責(zé)啟動多個爬蟲
  - crawl_url.py文件：負(fù)責(zé)抓取url漂羊，保存到redis隊列
  - video_6969.py文件：爬取電影
  - items.py文件：保存電影字段
  - pipelines.py文件：下載電影驾锰、斷點下載、下載進(jìn)度條走越、保存到redis數(shù)據(jù)庫
  - settings.py文件：配置信息

先配置我們的settings.py文件

# -*- coding: utf-8 -*-

# Scrapy settings for Video_6969 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Video_6969'

SPIDER_MODULES = ['Video_6969.spiders']
NEWSPIDER_MODULE = 'Video_6969.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Video_6969 (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 150

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 200
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'Video_6969.middlewares.Video6969SpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'Video_6969.middlewares.Video6969DownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   # 分布式爬蟲的數(shù)據(jù)可以不通過本地的管道（數(shù)據(jù)不需要存在本地）稻据。數(shù)據(jù)要存入到redis數(shù)據(jù)庫中，所以這里需要加入一個reids數(shù)據(jù)庫的管道組件
   'Video_6969.pipelines.Video6969Pipeline': 300,
   "scrapy_redis.pipelines.RedisPipeline": 100,  # item數(shù)據(jù)會報錯到redis
   "Video_6969.pipelines.CrawlUrls": 50,
   # 'Video_6969.pipelines.Video6969Info': 200,
}


# 指定Redis數(shù)據(jù)庫相關(guān)的配置
# Redis的主機(jī)地址
REDIS_HOST = '10.36.133.11'  # 主機(jī)
REDIS_PORT = 6379  # 端口
# REDIS_PARAMS = {"password": "xxxx"}  # 密碼


# 調(diào)度器需要切換成Scrapy_Redis的調(diào)度器（是Scrapy_Redis組件對原生調(diào)度器的重寫，加入了一些分布式調(diào)度的算法）
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# 加入scrapy_redis的去重組件
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# 爬取過程中是否運(yùn)行暫停
SCHEDULER_PERSIST = True


# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

# 日志
# 關(guān)閉日志或調(diào)整Debug級別
# LOG_ENABLED = False
# LOG_LEVEL = 'ERROR'


LOG_LEVEL = 'DEBUG'
"""
CRITICAL - 嚴(yán)重錯誤
ERROR - 一般錯誤
WARNING - 警告信息
INFO - 一般信息
DEBUG - 調(diào)試信息
"""

# 日志文件
LOG_FILE = '6969.log'

# 是否啟用日志（創(chuàng)建日志后捻悯，不需開啟匆赃，進(jìn)行配置）
LOG_ENABLED = True  # （默認(rèn)為True，啟用日志）

# 如果是True 今缚，進(jìn)程當(dāng)中算柳，所有標(biāo)準(zhǔn)輸出（包括錯誤）將會被重定向到log中
LOG_STDOUT = False

# 日志編碼
LOG_ENCODING = 'utf-8'


# 配置啟動所有爬蟲
COMMANDS_MODULE = 'Video_6969.commands'

# MongoDB配置
MONGO_HOST = "127.0.0.1"  # 主機(jī)IP
MONGO_PORT = 27017  # 端口號
MONGO_DB = "6969"  # 庫名
MONGO_COLL = "ViodeInfo"  # collection名
# 如果有用戶名和密碼
# MONGO_USER = "zhangsan"
# MONGO_PSW = "123456"

注意：現(xiàn)在爬蟲要繼承自RedisCrawlSpider，且urls要從redis數(shù)據(jù)庫中根據(jù)redis_key配置的值進(jìn)行獲取姓言，所以我們要將start_urls注釋瞬项。后面我們將在redis配置我們的起始url。

crawl_url.py文件

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from scrapy_redis.spiders import RedisCrawlSpider
from Video_6969.items import Video6969Item, UrlItem


class Video6969(CrawlSpider):
    name = 'crawl_urls'
    start_urls = ['https://www.6969qq.com']
    rules = (
        Rule(LinkExtractor(allow=r'/html/\d+/'), follow=True),  # 分類
        Rule(LinkExtractor(allow=r'/vod/\d+/.+?html'), callback='video_info', follow=True),  # 更多
    )

    def video_info(self, response):
        item = UrlItem()
        item['html_url'] = response.url
        yield item

crawl_url.py文件負(fù)責(zé)抓取我們需要下載的url頁面何荚，再通過pipelines存儲到redis隊列中囱淋。（也可以直接在crawl_url里進(jìn)行持久化存儲）

video_6969.py文件

# -*- coding: utf-8 -*-

from scrapy_redis.spiders import RedisCrawlSpider
from Video_6969.items import Video6969Item


class Video6969(RedisCrawlSpider):
    name = 'video_6969'

    redis_key = "video6969:start_urls"

    def parse(self, response):
        item = Video6969Item()
        item['html_url'] = response.url
        item['name'] = response.xpath("http://h1/text()").extract_first()
        item['video_type'] = response.xpath("http://div[@class = 'play_nav hidden-xs']//a/@title").extract_first()
        item['video_url'] = response.selector.re("(https://\w+.xia12345.com/.+?mp4)")[0]
        yield item

其它的從機(jī)是不需要crawl_url文件的，它們通過此文件來匹配到電影信息進(jìn)行下載

item.py文件

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class Video6969Item(scrapy.Item):
    video_type = scrapy.Field()
    name = scrapy.Field()
    html_url = scrapy.Field()
    video_url = scrapy.Field()


class UrlItem(scrapy.Item):
    html_url = scrapy.Field()

pipelines.py文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import os
import pymongo
import redis
import requests
import sys


from Video_6969.items import UrlItem, Video6969Item


#  電影下載
class Video6969Pipeline(object):
    dir_path = r'G:\Video_6969'

    def process_item(self, item, spider):
        if isinstance(item, Video6969Item):
            type_path = os.path.join(self.dir_path, item['video_type'])
            if not os.path.exists(type_path):
                os.makedirs(type_path)
            name_path = os.path.join(type_path, item['name'])
            path = name_path + item['name'] + ".mp4"

            try:
                headers = {
                    "User-Agent": "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.3.2.1000 Chrome/30.0.1599.101 Safari/537.36"
                }
                # now_length = 0  # 已下載大小
                # 循環(huán)接收視頻數(shù)據(jù)
                while True:
                    # 若文件已經(jīng)存在餐塘，則斷點續(xù)傳妥衣，設(shè)置接收來需接收數(shù)據(jù)的位置
                    if os.path.exists(path):
                        now_length = os.path.getsize(path)
                        print("網(wǎng)絡(luò)波動繼續(xù)下載 。已下載：{}MB".format(now_length // 1024 // 1024))
                        headers['Range'] = 'bytes=%d-' % now_length  # 獲得本地文件的大小作為續(xù)傳的起點戒傻，還有就是按bytes
                    else:
                        now_length = 0  # 已下載大小
                    res = requests.get(item['video_url'], stream=True,
                                       headers=headers)  # stream設(shè)置為True税手，可以直接訪問Response.content屬性
                    total_length = int(res.headers['Content-Length'])  # 內(nèi)容體總大小
                    print("準(zhǔn)備下載：【{}】{} {}MB".format(item["video_type"], item["name"], total_length // 1024 // 1024))
                    # 若當(dāng)前報文長度小于前次報文長度，或者已接收文件等于當(dāng)前報文長度需纳，則可以認(rèn)為視頻接收完成
                    if total_length < now_length or (
                            os.path.exists(path) and os.path.getsize(path) >= total_length):
                        # print("文件下載完成：【{}】{} {}MB".format(item["video_type"], item["name"], total_length % 1024 % 1024))
                        break

                    # 寫入收到的視頻數(shù)據(jù)
                    with open(path, 'ab') as file:
                        for chunk in res.iter_content(chunk_size=1024):
                            # if chunk:
                            file.write(chunk)
                            now_length += len(chunk)
                            # 實時保證一點點的寫入
                            file.flush()
                            # 下載實現(xiàn)進(jìn)度顯示
                            done = int(50 * now_length / total_length)
                            sys.stdout.write(
                                "\r【%s%s】%d%%" % ('█' * done, ' ' * (50 - done), 100 * now_length / total_length))
                            sys.stdout.flush()
                    print()

            except Exception as e:
                print(e)
                raise IOError

            print("【{}】{}下載完畢：{}MB".format(item["video_type"], item["name"], now_length // 1024 // 1024))
            return item


# 存儲MongoDB
class Video6969Info(object):

    def __init__(self, mongo_host, mongo_db, mongo_coll):
        self.mongo_host = mongo_host
        self.mongo_db = mongo_db
        self.mongo_coll = mongo_coll
        self.count = 0


    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_host=crawler.settings['MONGO_HOST'],
            mongo_db=crawler.settings['MONGO_DB'],
            mongo_coll=crawler.settings['MONGO_COLL']
        )

    def open_spider(self, spider):
        #  連接數(shù)據(jù)庫
        self.client = pymongo.MongoClient(self.mongo_host)
        self.db = self.client[self.mongo_db]  # 獲得數(shù)據(jù)庫的句柄
        self.coll = self.db[self.mongo_coll]  # 獲得collection的句柄

    def close_spider(self, spider):
        self.client.close()  # 關(guān)閉數(shù)據(jù)庫

    def process_item(self, item, spider):
        data = dict(item)  # 把item轉(zhuǎn)換成字典形式
        try:
            self.coll.insert(data)  # 插入
            self.count += 1
        except:
            raise IOError
        if not self.count % 100:
            print("已獲取數(shù)據(jù)：%d條" % self.count)
        return item


# 壓入Redis隊列
class CrawlUrls(object):
    def process_item(self, item, spider):
        rds = redis.StrictRedis(host='10.36.133.11', port=6379, db=0)
        if isinstance(item, UrlItem):
            rds.lpush("video6969:start_urls", item['html_url'])
        return item

這里用request請求獲取我的電影的二進(jìn)制數(shù)據(jù)芦倒，并進(jìn)行寫入。因為網(wǎng)絡(luò)波動很容易造成視頻文件損壞不翩，所以我又在這里進(jìn)行了斷點下載

crawlall.py文件

from scrapy.commands import ScrapyCommand


class Command(ScrapyCommand):
    requires_project = True

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def run(self, args, opts):
        spider_list = self.crawler_process.spiders.list()
        for name in spider_list:
            self.crawler_process.crawl(name, **opts.__dict__)
        self.crawler_process.start()

啟動

scrapy crawlall

啟動后crawl_url爬蟲會去爬取url存入redis隊列兵扬，其它從機(jī)獲取到url以后開始下載。當(dāng)然你也可以通過其它的辦法來進(jìn)行分布式的爬取口蝠。
注意：保存電影的時候器钟，要注意你是否對改目錄有讀寫的權(quán)限

最后編輯于：2018.11.02 16:11:29

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市亚皂，隨后出現(xiàn)的幾起案子俱箱，更是在濱河造成了極大的恐慌国瓮，老刑警劉巖灭必，帶你破解...
沈念sama閱讀 219,589評論 6贊 508
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異乃摹，居然都是意外死亡禁漓，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,615評論 3贊 396
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門孵睬，熙熙樓的掌柜王于貴愁眉苦臉地迎上來播歼，“玉大人，你說我怎么就攤上這事∶啬” “怎么了叭莫？”我有些...
開封第一講書人閱讀 165,933評論 0贊 356
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長烁试。經(jīng)常有香客問我雇初，道長，這世上最難降的妖魔是什么减响？我笑而不...
開封第一講書人閱讀 58,976評論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任靖诗，我火速辦了婚禮，結(jié)果婚禮上支示，老公的妹妹穿的比我還像新娘刊橘。我一直安慰自己，他們只是感情好颂鸿，可當(dāng)我...
茶點故事閱讀 67,999評論 6贊 393
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布促绵。她就那樣靜靜地躺著，像睡著了一般据途。火紅的嫁衣襯著肌膚如雪绞愚。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 51,775評論 1贊 307
城市分裂傳說
那天颖医，我揣著相機(jī)與錄音位衩，去河邊找鬼。笑死熔萧，一個胖子當(dāng)著我的面吹牛糖驴，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播佛致，決...
沈念sama閱讀 40,474評論 3贊 420
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼贮缕，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了俺榆？” 一聲冷哼從身側(cè)響起感昼，我...
開封第一講書人閱讀 39,359評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎罐脊，沒想到半個月后定嗓，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,854評論 1贊 317
?護(hù)林員之死
正文獨居荒郊野嶺守林人離奇死亡萍桌，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 38,007評論 3贊 338
?白月光啟示錄
正文我和宋清朗相戀三年宵溅，在試婚紗的時候發(fā)現(xiàn)自己被綠了。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片上炎。...
茶點故事閱讀 40,146評論 1贊 351
活死人
序言：一個原本活蹦亂跳的男人離奇死亡恃逻，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情寇损，我是刑警寧澤凸郑，帶...
沈念sama閱讀 35,826評論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站矛市，受9級特大地震影響线椰，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜尘盼，卻給世界環(huán)境...
茶點故事閱讀 41,484評論 3贊 331
男人毒藥：我在死后第九天來索命
文/蒙蒙一憨愉、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧卿捎，春花似錦配紫、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 32,029評論 0贊 22
一樁弒父案躺孝，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至底桂，卻和暖如春植袍，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背籽懦。一陣腳步聲響...
開封第一講書人閱讀 33,153評論 1贊 272
情欲美人皮
我被黑心中介騙來泰國打工于个，沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人暮顺。一個月前我還...
沈念sama閱讀 48,420評論 3贊 373
代替公主和親
正文我出身青樓厅篓，卻偏偏與公主長得像，于是被迫代替她去往敵國和親捶码。傳聞我的和親對象是個殘疾皇子羽氮，可洞房花燭夜當(dāng)晚...
茶點故事閱讀 45,107評論 2贊 356

Scrapy_redis分布式爬取某電影網(wǎng)站（斷點下載+下載進(jìn)度條顯示）

一己沛、背景介紹

二箫措、代碼

推薦閱讀更多精彩內(nèi)容