前言:關(guān)于標(biāo)題似乎有些浮夸酬姆,所謂的全棧系統(tǒng)主要包括數(shù)據(jù)的爬取勤庐,web網(wǎng)站展示示惊,移動設(shè)備App,主要記錄學(xué)習(xí)過程中知識點(diǎn)愉镰,以備忘米罚。
技術(shù)棧:
1,Scrapy爬蟲框架:記錄爬蟲框架的工作流程丈探,簡單爬蟲的編寫
2录择,Yii框架:用于PC網(wǎng)站、移動網(wǎng)站以及RESTful Api(為什么不繼續(xù)用python注入django或者fastapi等框架碗降?主要是目前還不熟悉)
3隘竭,F(xiàn)lutter移動:用于移動App搭建
免責(zé)聲明:該項(xiàng)目不會儲存任何視頻資源到服務(wù)器,僅作為個人學(xué)習(xí)過程點(diǎn)滴積累讼渊。
數(shù)據(jù)庫結(jié)構(gòu):
vod_detail主要保存視頻信息动看,play_url用于各個視頻的播放地址。這里將視頻信息和播放地址分開到不同的表中保存?zhèn)€人覺得更加合理爪幻,比如一個電視劇之類的可以有多個劇集播放地址菱皆。各個字段說明見表結(jié)構(gòu)。
vod_detail:
-- phpMyAdmin SQL Dump
-- version 4.8.5
-- https://www.phpmyadmin.net/
--
-- 主機(jī): localhost
-- 生成日期: 2020-09-09 10:33:32
-- 服務(wù)器版本: 5.7.26
-- PHP 版本: 7.3.4
SET SQL_MODE = "NO_AUTO_VALUE_ON_ZERO";
SET AUTOCOMMIT = 0;
START TRANSACTION;
SET time_zone = "+00:00";
--
-- 數(shù)據(jù)庫: `film`
--
-- --------------------------------------------------------
--
-- 表的結(jié)構(gòu) `vod_detail`
--
CREATE TABLE `vod_detail` (
`id` int(11) NOT NULL,
`url` varchar(500) NOT NULL COMMENT '采集的url',
`url_id` varchar(100) NOT NULL COMMENT '采集的url經(jīng)過加密生成的唯一字符串',
`vod_title` varchar(255) NOT NULL COMMENT '視頻名稱',
`vod_sub_title` varchar(255) DEFAULT NULL COMMENT '視頻別名',
`vod_blurb` varchar(255) DEFAULT NULL COMMENT '簡介',
`vod_content` longtext COMMENT '詳細(xì)介紹',
`vod_status` int(11) DEFAULT '0' COMMENT '狀態(tài)',
`vod_type` varchar(255) DEFAULT NULL COMMENT '視頻分類',
`vod_class` varchar(255) DEFAULT NULL COMMENT '擴(kuò)展分類',
`vod_tag` varchar(255) DEFAULT NULL,
`vod_pic_url` varchar(255) DEFAULT NULL COMMENT '圖片url',
`vod_pic_path` varchar(255) DEFAULT NULL COMMENT '圖片下載保存路徑',
`vod_pic_thumb` varchar(255) DEFAULT NULL,
`vod_actor` varchar(255) DEFAULT NULL COMMENT '演員',
`vod_director` varchar(255) DEFAULT NULL COMMENT '導(dǎo)演',
`vod_writer` varchar(255) DEFAULT NULL COMMENT '編劇',
`vod_remarks` varchar(255) DEFAULT NULL COMMENT '影片版本',
`vod_pubdate` int(11) DEFAULT NULL,
`vod_area` varchar(255) DEFAULT NULL COMMENT '地區(qū)',
`vod_lang` varchar(255) DEFAULT NULL COMMENT '語言',
`vod_year` varchar(255) DEFAULT NULL COMMENT '年代',
`vod_hits` int(11) DEFAULT '0' COMMENT '總瀏覽數(shù)',
`vod_hits_day` int(11) DEFAULT '0' COMMENT '一天瀏覽數(shù)',
`vod_hits_week` int(11) DEFAULT '0' COMMENT '一周瀏覽數(shù)',
`vod_hits_month` int(11) DEFAULT '0' COMMENT '一月瀏覽數(shù)',
`vod_up` int(11) DEFAULT '0' COMMENT '頂數(shù)',
`vod_down` int(11) DEFAULT '0' COMMENT '踩數(shù)',
`vod_score` decimal(3,1) DEFAULT '0.0' COMMENT '總評分',
`vod_score_all` int(11) DEFAULT '0',
`vod_score_num` int(11) DEFAULT '0',
`vod_create_time` int(11) DEFAULT NULL COMMENT '創(chuàng)建時(shí)間',
`vod_update_time` int(11) DEFAULT NULL COMMENT '更新時(shí)間',
`vod_lately_hit_time` int(11) DEFAULT NULL COMMENT '最后瀏覽時(shí)間'
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4;
--
-- 轉(zhuǎn)儲表的索引
--
--
-- 表的索引 `vod_detail`
--
ALTER TABLE `vod_detail`
ADD PRIMARY KEY (`id`),
ADD UNIQUE KEY `url_id` (`url_id`) COMMENT '唯一 避免抓取過的網(wǎng)址重復(fù)采集';
--
-- 在導(dǎo)出的表使用AUTO_INCREMENT
--
--
-- 使用表AUTO_INCREMENT `vod_detail`
--
ALTER TABLE `vod_detail`
MODIFY `id` int(11) NOT NULL AUTO_INCREMENT;
COMMIT;
play_url:
-- phpMyAdmin SQL Dump
-- version 4.8.5
-- https://www.phpmyadmin.net/
--
-- 主機(jī): localhost
-- 生成日期: 2020-09-09 10:34:59
-- 服務(wù)器版本: 5.7.26
-- PHP 版本: 7.3.4
SET SQL_MODE = "NO_AUTO_VALUE_ON_ZERO";
SET AUTOCOMMIT = 0;
START TRANSACTION;
SET time_zone = "+00:00";
--
-- 數(shù)據(jù)庫: `film`
--
-- --------------------------------------------------------
--
-- 表的結(jié)構(gòu) `play_url`
--
CREATE TABLE `play_url` (
`id` int(11) NOT NULL,
`play_title` varchar(255) DEFAULT NULL,
`play_from` varchar(255) DEFAULT NULL,
`play_url` varchar(255) NOT NULL,
`play_url_aes` varchar(100) NOT NULL COMMENT '將url生成唯一字符串',
`url_id` varchar(100) NOT NULL COMMENT '關(guān)聯(lián)vod_detail url_id',
`create_time` int(11) DEFAULT NULL,
`update_time` int(11) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4;
--
-- 轉(zhuǎn)儲表的索引
--
--
-- 表的索引 `play_url`
--
ALTER TABLE `play_url`
ADD PRIMARY KEY (`id`),
ADD UNIQUE KEY `play_url_aes` (`play_url_aes`);
--
-- 在導(dǎo)出的表使用AUTO_INCREMENT
--
--
-- 使用表AUTO_INCREMENT `play_url`
--
ALTER TABLE `play_url`
MODIFY `id` int(11) NOT NULL AUTO_INCREMENT;
COMMIT;
影視數(shù)據(jù)爬蟲:
環(huán)境:python3.8
爬蟲框架:scrapy
環(huán)境搭建配置略過挨稿。搔预。。叶组。
這里選擇抓取的是一個影視資源站拯田,特點(diǎn)就是它就是專門給別人爬取采集的,無反爬蟲限制甩十,結(jié)構(gòu)簡單船庇,相應(yīng)的爬蟲就簡單。
(一)這里記錄下安裝Scrapy容易出錯點(diǎn)及爬蟲調(diào)試的配置:
python安裝虛擬環(huán)境工具:
pip install virtualenv
1侣监,新建虛擬環(huán)境:
進(jìn)入存放虛擬環(huán)境的文件夾
virtualenv pachong
2鸭轮,Scarapy框架安裝:
進(jìn)入創(chuàng)建的虛擬環(huán)境(可以在cmd中或者pycharm命令控制臺操作)
先安裝Scarapy框架依賴:lxml、Twisted橄霉、pywin32 最好提前離線安裝窃爷。
離線包下載地址:https://www.lfd.uci.edu/~gohlke/pythonlibs/
3,再安裝scrapy
4,新建Scrapy項(xiàng)目:進(jìn)入虛擬環(huán)境 scrapy startproject MoviesSpider
5按厘, 新建一個okzy爬蟲: scrapy genspider okzy okzy.co
6医吊,如何在pychar中調(diào)試爬蟲:
由于pycharm不能直接新建scrapy項(xiàng)目,所以可以在爬蟲項(xiàng)目根目錄新建main.py 代碼如下:
import os
import sys
from scrapy.cmdlineimport execute
sys.path.append(os.path.dirname(os.path.abspath(file)))
execute(['scrapy', 'crawl', 'cnblogs'])
(二)爬蟲目錄結(jié)構(gòu):
說明:scrapy基礎(chǔ)知識工作流程這里不具體展開逮京。
A, models目錄中的film文件是使用peewee根據(jù)數(shù)據(jù)庫生成的model類卿堂,關(guān)于peewee的主要作用既可以根據(jù)數(shù)據(jù)庫生成model,也可以根據(jù)model類創(chuàng)建對應(yīng)的表懒棉。peewee是一款輕量化的ORM框架草描,讓我們更加面向?qū)ο蟮牟僮鲾?shù)據(jù)庫。這樣我們在爬取玩數(shù)據(jù)插入數(shù)據(jù)庫的時(shí)候就可以不寫那些麻煩又容易出錯的原生SQL語句了策严。熟悉php yii的小伙伴可以類比yii自帶的腳手架工具gii穗慕。peewee文檔:peewee文檔
models中的retry_mySQLDatabase文件是爬取到的數(shù)據(jù)存入Mysql時(shí)既使用連接池 又使用重連,防止連接時(shí)間過長插入數(shù)據(jù)可能出錯妻导。
film.py代碼:
from peewee import *
# database = MySQLDatabase('film', **{'charset': 'utf8', 'sql_mode': 'PIPES_AS_CONCAT', 'use_unicode': True, 'host': '127.0.0.1', 'port': 3306, 'user': 'root', 'password': 'root'})
from models.retry_mySQLDatabase import RetryMySQLDatabase
database = database = RetryMySQLDatabase.get_db_instance()
class UnknownField(object):
def __init__(self, *_, **__): pass
class BaseModel(Model):
class Meta:
database = database
class PlayUrl(BaseModel):
create_time = IntegerField(null=True)
play_from = CharField(null=True)
play_title = CharField(null=True)
play_url = CharField()
play_url_aes = CharField()
update_time = IntegerField(null=True)
url_id = CharField(unique=True)
class Meta:
table_name = 'play_url'
class VodDetail(BaseModel):
url = CharField()
url_id = CharField(unique=True)
vod_actor = CharField(null=True)
vod_area = CharField(null=True)
vod_blurb = CharField(null=True)
vod_class = CharField(null=True)
vod_content = TextField(null=True)
vod_create_time = IntegerField(null=True)
vod_director = CharField(null=True)
vod_down = IntegerField(constraints=[SQL("DEFAULT 0")], null=True)
vod_hits = IntegerField(constraints=[SQL("DEFAULT 0")], null=True)
vod_hits_day = IntegerField(constraints=[SQL("DEFAULT 0")], null=True)
vod_hits_month = IntegerField(constraints=[SQL("DEFAULT 0")], null=True)
vod_hits_week = IntegerField(constraints=[SQL("DEFAULT 0")], null=True)
vod_lang = CharField(null=True)
vod_lately_hit_time = IntegerField(null=True)
vod_pic_path = CharField(null=True)
vod_pic_thumb = CharField(null=True)
vod_pic_url = CharField(null=True)
vod_pubdate = IntegerField(null=True)
vod_remarks = CharField(null=True)
vod_score = DecimalField(constraints=[SQL("DEFAULT 0.0")], null=True)
vod_score_all = IntegerField(constraints=[SQL("DEFAULT 0")], null=True)
vod_score_num = IntegerField(constraints=[SQL("DEFAULT 0")], null=True)
vod_status = IntegerField(constraints=[SQL("DEFAULT 0")], null=True)
vod_sub_title = CharField(null=True)
vod_tag = CharField(null=True)
vod_title = CharField()
vod_type = CharField(null=True)
vod_up = IntegerField(constraints=[SQL("DEFAULT 0")], null=True)
vod_update_time = IntegerField(null=True)
vod_writer = CharField(null=True)
vod_year = CharField(null=True)
class Meta:
table_name = 'vod_detail'
class VodTags(BaseModel):
frequency = IntegerField(null=True)
name = CharField()
class Meta:
table_name = 'vod_tags'
class VodType(BaseModel):
type_des = CharField(constraints=[SQL("DEFAULT ''")], null=True)
type_en = CharField(constraints=[SQL("DEFAULT ''")], index=True, null=True)
type_extend = TextField(null=True)
type_jumpurl = CharField(constraints=[SQL("DEFAULT ''")], null=True)
type_key = CharField(constraints=[SQL("DEFAULT ''")], null=True)
type_logo = CharField(constraints=[SQL("DEFAULT ''")], null=True)
type_mid = IntegerField(constraints=[SQL("DEFAULT 1")], index=True, null=True)
type_name = CharField(constraints=[SQL("DEFAULT ''")], index=True)
type_pic = CharField(constraints=[SQL("DEFAULT ''")], null=True)
type_pid = IntegerField(constraints=[SQL("DEFAULT 0")], index=True, null=True)
type_sort = IntegerField(constraints=[SQL("DEFAULT 0")], index=True, null=True)
type_status = IntegerField(constraints=[SQL("DEFAULT 1")], null=True)
type_title = CharField(constraints=[SQL("DEFAULT ''")], null=True)
type_tpl = CharField(constraints=[SQL("DEFAULT ''")], null=True)
type_tpl_detail = CharField(constraints=[SQL("DEFAULT ''")], null=True)
type_tpl_down = CharField(constraints=[SQL("DEFAULT ''")], null=True)
type_tpl_list = CharField(constraints=[SQL("DEFAULT ''")], null=True)
type_tpl_play = CharField(constraints=[SQL("DEFAULT ''")], null=True)
type_union = CharField(constraints=[SQL("DEFAULT ''")], null=True)
class Meta:
table_name = 'vod_type'
retry_mySQLDatabase.py代碼:
from playhouse.pool import PooledMySQLDatabase
from playhouse.shortcuts import ReconnectMixin
"""
既使用連接池 又使用重連
"""
class RetryMySQLDatabase(ReconnectMixin, PooledMySQLDatabase):
_instance = None
@staticmethod
def get_db_instance():
if not RetryMySQLDatabase._instance:
RetryMySQLDatabase._instance = RetryMySQLDatabase(
'film',
**{'charset': 'utf8', 'sql_mode': 'PIPES_AS_CONCAT', 'use_unicode': True,
'host': '127.0.0.1', 'port': 3306, 'user': 'root', 'password': 'root'}
)
return RetryMySQLDatabase._instance
B揍诽,MoviesSpider目錄是爬蟲主體目錄。spiders中是目標(biāo)站okzy爬蟲栗竖,upload中存放影視圖片,items渠啤、middlewares狐肢、pipelines、settings等同學(xué)們自行熟悉Scrapy工作原理和各個文件作用沥曹。
items.py文件:OkzyMoviesDetailspiderItem和OkzyMoviesspiderPlayurlItem分別對應(yīng)影視詳情和影片播放地址份名,都定義了個save_into_sql方法配合peewee生成的model類插入爬取到的數(shù)據(jù)到mysql。MoviesItemLoader是重寫ItemLoader主要是防止目標(biāo)網(wǎng)站有些數(shù)據(jù)不存在的出錯問題和數(shù)據(jù)清洗妓美。關(guān)于input_processor和output_processor如何處理爬取到的數(shù)據(jù)僵腺,及與之類似作用的優(yōu)先級問題可以參考:scrapy--Itemloader數(shù)據(jù)清洗--input_processor和output_processor比較
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
from itemloaders.processors import TakeFirst, MapCompose, Identity
from scrapy.loader import ItemLoader
from scrapy.loader.common import wrap_loader_context
from scrapy.utils.misc import arg_to_iter
from models.film import VodDetail, PlayUrl
from utils.common import date_convert
def MergeDict(dict1, dict2):
return dict2.update(dict1)
pass
class MapComposeCustom(MapCompose):
# 自定義MapCompose,當(dāng)value沒元素時(shí)傳入" "
def __call__(self, value, loader_context=None):
if not value:
value.append(" ")
values = arg_to_iter(value)
if loader_context:
context = MergeDict(loader_context, self.default_loader_context)
else:
context = self.default_loader_context
wrapped_funcs = [wrap_loader_context(f, context) for f in self.functions]
for func in wrapped_funcs:
next_values = []
for v in values:
next_values += arg_to_iter(func(v))
values = next_values
return values
class TakeFirstCustom(TakeFirst):
"""
處理采集的元素不存在問題
"""
def __call__(self, values):
for value in values:
if value is not None and value != '':
return value.strip() if isinstance(value, str) else value
"""
重寫ItemLoader,默認(rèn)取第一個元素并處理不存在的元素
"""
class MoviesItemLoader(ItemLoader):
default_output_processor = TakeFirstCustom()
default_input_processor = MapComposeCustom()
class MoviesspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
class OkzyMoviesDetailspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
url = scrapy.Field()
url_id = scrapy.Field()
vod_title = scrapy.Field()
vod_sub_title = scrapy.Field()
vod_blurb = scrapy.Field()
vod_content = scrapy.Field()
vod_status = scrapy.Field()
vod_type = scrapy.Field()
vod_class = scrapy.Field()
vod_tag = scrapy.Field()
vod_pic_url = scrapy.Field(
output_processor=Identity()) # 優(yōu)先級高于default_output_processor壶栋,因?yàn)閟crapy要求下載圖片辰如、文件,不能是字符串贵试,所以默認(rèn)處理
vod_pic_path = scrapy.Field() # 下載的圖片保存路徑
vod_pic_thumb = scrapy.Field()
vod_actor = scrapy.Field()
vod_director = scrapy.Field()
vod_writer = scrapy.Field()
vod_remarks = scrapy.Field()
vod_pubdate = scrapy.Field()
vod_area = scrapy.Field()
vod_lang = scrapy.Field()
vod_year = scrapy.Field()
vod_hits = scrapy.Field()
vod_hits_day = scrapy.Field()
vod_hits_week = scrapy.Field()
vod_hits_month = scrapy.Field()
vod_up = scrapy.Field()
vod_down = scrapy.Field()
vod_score = scrapy.Field()
vod_score_all = scrapy.Field()
vod_score_num = scrapy.Field()
vod_create_time = scrapy.Field(input_processor=MapCompose(date_convert))
vod_update_time = scrapy.Field(input_processor=MapCompose(date_convert))
vod_lately_hit_time = scrapy.Field()
pass
def save_into_sql(self):
if not VodDetail.table_exists():
VodDetail.create_table()
vod_detail = VodDetail.get_or_none(VodDetail.url_id == self['url_id'])
if vod_detail is not None:
data = vod_detail
else:
data = VodDetail()
try:
data.url = self['url']
data.url_id = self['url_id']
data.vod_title = self['vod_title']
data.vod_sub_title = self['vod_sub_title']
# data.vod_blurb = self['vod_blurb']
data.vod_content = self['vod_content']
data.vod_status = 1
data.vod_type = self['vod_type']
data.vod_class = self['vod_class']
# data.vod_tag=self['vod_tag']
data.vod_pic_url = self['vod_pic_url'][0]
data.vod_pic_path = self['vod_pic_path']
# data.vod_pic_thumb=self['vod_pic_thumb']
data.vod_actor = self['vod_actor']
data.vod_director = self['vod_director']
# data.vod_writer=self['vod_writer']
data.vod_remarks = self['vod_remarks']
# data.vod_pubdate=self['vod_pubdate']
data.vod_area = self['vod_area']
data.vod_lang = self['vod_lang']
data.vod_year = self['vod_year']
# data.vod_hits=self['vod_hits']
# data.vod_hits_day=self['vod_hits_day']
# data.vod_hits_week=self['vod_hits_week']
# data.vod_hits_month=self['vod_hits_month']
# data.vod_up=self['vod_up']
# data.vod_down=self['vod_down']
data.vod_score = self['vod_score']
data.vod_score_all = self['vod_score_all']
data.vod_score_num = self['vod_score_num']
data.vod_create_time = self['vod_create_time']
data.vod_update_time = self['vod_update_time']
# data.vod_lately_hit_time = self['vod_lately_hit_time']
row = data.save()
except Exception as e:
print(e)
pass
class OkzyMoviesspiderPlayurlItem(scrapy.Item):
play_title = scrapy.Field()
play_from = scrapy.Field()
play_url = scrapy.Field()
play_url_aes = scrapy.Field()
url_id = scrapy.Field()
create_time = scrapy.Field(input_processor=MapCompose(date_convert))
update_time = scrapy.Field(input_processor=MapCompose(date_convert))
def save_into_sql(self):
if not PlayUrl.table_exists():
PlayUrl.create_table()
play_url = PlayUrl.get_or_none(PlayUrl.play_url_aes == self['play_url_aes'])
if play_url is not None:
data = play_url
else:
data = PlayUrl()
try:
data.play_title = self['play_title']
data.play_from = self['play_from']
data.play_url = self['play_url']
data.play_url_aes = self['play_url_aes']
data.url_id = self['url_id']
data.create_time = self['create_time']
data.update_time = self['update_time']
row = data.save()
except Exception as e:
print(e)
pass
pass
pipelines.py文件:MovieImagesPipeline類重寫scrapy.pipelines.images.ImagesPipeline 獲取圖片下載地址 給items琉兜,MysqlPipeline類是插入爬取的數(shù)據(jù)存入mysql。
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import re
from scrapy import Request
# useful for handling different item types with a single interface
from scrapy.pipelines.images import ImagesPipeline
from utils.common import pinyin
class MoviesspiderPipeline:
def process_item(self, item, spider):
return item
# 重寫scrapy.pipelines.images.ImagesPipeline 獲取圖片下載地址 給items
class MovieImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
if "vod_pic_url" in item:
for vod_pic_url in item['vod_pic_url']:
yield Request(url=vod_pic_url, meta={'item': item}) # 添加meta是為了下面重命名文件名使用
def file_path(self, request, response=None, info=None):
item = request.meta['item']
movietitle = item['vod_title']
#去除特殊字符毙玻,只保留漢子豌蟋,字母、數(shù)字
sub_str = re.sub(u"([^\u4e00-\u9fa5\u0030-\u0039\u0041-\u005a\u0061-\u007a])", "", movietitle)
img_guid = request.url.split('/')[-1] # 得到圖片名和后綴
filename = '/upload/images/{0}/{1}'.format(pinyin(sub_str), img_guid)
return filename
# return super().file_path(request, response, info)
# def thumb_path(self, request, thumb_id, response=None, info=None):
# item = request.meta['item']
# movietitle = pinyin(item['vod_title'][0])
# img_guid = request.url.split('/')[-1] # 得到圖片名和后綴
# filename = '/images/{0}/thumbs/{1}/{2}'.format(movietitle, thumb_id, img_guid)
# return filename
def item_completed(self, results, item, info):
image_file_path = ""
if "vod_pic_url" in item:
for ok, value in results:
image_file_path = value["path"]
item["vod_pic_path"] = image_file_path
return item
class MysqlPipeline(object):
def process_item(self, item, spider):
"""
每個item中都實(shí)現(xiàn)save_into_sql()方法桑滩,就可以用同一個MysqlPipeline去處理
:param item:
:param spider:
:return:
"""
item.save_into_sql()
return item
settings.py主要設(shè)置了圖片下載item及路徑梧疲。
# Scrapy settings for MoviesSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
import os
current_dir = os.path.dirname(os.path.abspath(__file__))
BOT_NAME = 'MoviesSpider'
SPIDER_MODULES = ['MoviesSpider.spiders']
NEWSPIDER_MODULE = 'MoviesSpider.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'MoviesSpider (+http://www.yourdomain.com)'
# Obey robots.txt rules
# ROBOTSTXT_OBEY = True
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
# COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False
# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
# }
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
# 'MoviesSpider.middlewares.MoviesspiderSpiderMiddleware': 543,
# }
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
# 'MoviesSpider.middlewares.MoviesspiderDownloaderMiddleware': 543,
# }
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
# }
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'MoviesSpider.pipelines.MoviesspiderPipeline': 300,
# 重寫scrapy.pipelines.images.ImagesPipeline 獲取圖片下載地址 給items
'MoviesSpider.pipelines.MovieImagesPipeline': 1,
'MoviesSpider.pipelines.MysqlPipeline': 20,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = 'httpcache'
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
# 圖片下載路徑配置
IMAGES_STORE = os.path.join(current_dir, 'upload')
# 配置要下載的item
IMAGES_URLS_FIELD = "vod_pic_url"
# 生成圖片縮略圖,添加設(shè)置
# IMAGES_THUMBS = {
# 'small': (80, 80),
# 'big': (200, 200),
# }
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'
# RANDOMIZE_DOWNLOAD_DELAY = True
# DOWNLOAD_DELAY = 2
關(guān)于scrapy圖片、文件下載配置自行參考文檔幌氮。值得注意的一點(diǎn)缭受,因?yàn)閟crapy要求下載圖片、文件浩销,不能是字符串贯涎,所以item中使用默認(rèn)處理vod_pic_url = scrapy.Field( output_processor=Identity())。
C慢洋,spiders目錄中的okzy文件是爬蟲塘雳。由于目標(biāo)站結(jié)構(gòu)簡單所以爬蟲就相對簡單。parse方法是解析影片列表獲取詳情頁地址普筹,parse_detail是解析詳情頁獲得影片詳情及播放地址败明。
from urllib import parse
import scrapy
from scrapy import Request
from MoviesSpider.items import OkzyMoviesDetailspiderItem, OkzyMoviesspiderPlayurlItem, MoviesItemLoader
from utils import common
class OkzySpider(scrapy.Spider):
name = 'okzy'
allowed_domains = ['okzy.co']
start_urls = ['https://okzy.co/?m=vod-index-pg-1.html']
# start_urls = ['https://okzy.co/?m=vod-type-id-22-pg-1.html']
def parse(self, response):
all_urls = response.css(".xing_vb4 a::attr(href)").extract()
for url in all_urls:
yield Request(url=parse.urljoin(response.url, url), callback=self.parse_detail)
list_urls = list(range(1, 3))
list_urls.reverse()
# for i in list_urls:
# url = 'https://okzy.co/?m=vod-index-pg-{0}.html'.format(i)
# yield Request(url=url, callback=self.parse)
def parse_detail(self, response):
voddetail_item_loader = MoviesItemLoader(item=OkzyMoviesDetailspiderItem(), response=response)
voddetail_item_loader.add_value('url', response.url)
voddetail_item_loader.add_value("url_id", common.get_md5(response.url))
voddetail_item_loader.add_css('vod_title', 'h2::text')
voddetail_item_loader.add_css('vod_sub_title',
'.vodinfobox > ul:nth-child(1) > li:nth-child(1) > span:nth-child(1)::text')
# voddetail_item_loader.add_xpath('vod_blurb', '//h2/text()')
voddetail_item_loader.add_css('vod_content', 'div.ibox:nth-child(2) > div:nth-child(2)::text')
# voddetail_item_loader.add_xpath('vod_status', '//h2/text()')
voddetail_item_loader.add_css('vod_type',
'.vodinfobox > ul:nth-child(1) > li:nth-child(4) > span:nth-child(1)::text')
voddetail_item_loader.add_xpath('vod_class',
'/html/body/div[5]/div[1]/div/div/div[2]/div[2]/ul/li[4]/span/a/text()')
# voddetail_item_loader.add_xpath('tag', '//h2/text()'
voddetail_item_loader.add_css('vod_pic_url', '.lazy::attr(src)')
# voddetail_item_loader.add_xpath('vod_pic_thumb', '//h2/text()')
voddetail_item_loader.add_css('vod_actor',
'.vodinfobox > ul:nth-child(1) > li:nth-child(3) > span:nth-child(1)::text')
voddetail_item_loader.add_css('vod_director',
'.vodinfobox > ul:nth-child(1) > li:nth-child(2) > span:nth-child(1)::text')
# voddetail_item_loader.add_xpath('vod_writer', '//h2/text()')
voddetail_item_loader.add_css('vod_remarks', '.vodh > span:nth-child(2)::text')
# voddetail_item_loader.add_xpath('vod_pubdate', '//h2/text()')
voddetail_item_loader.add_css('vod_area', 'li.sm:nth-child(5) > span:nth-child(1)::text')
voddetail_item_loader.add_css('vod_lang', 'li.sm:nth-child(6) > span:nth-child(1)::text')
voddetail_item_loader.add_css('vod_year', 'li.sm:nth-child(7) > span:nth-child(1)::text')
# voddetail_item_loader.add_xpath('vod_hits', '//h2/text()')
# voddetail_item_loader.add_xpath('vod_hits_day', '//h2/text()')
# voddetail_item_loader.add_xpath('vod_hits_week', '//h2/text()')
# voddetail_item_loader.add_xpath('vod_hits_month', '//h2/text()')
# voddetail_item_loader.add_xpath('vod_up', '//h2/text()')
# voddetail_item_loader.add_xpath('vod_down', '//h2/text()')
voddetail_item_loader.add_css('vod_score', '.vodh > label:nth-child(3)::text')
voddetail_item_loader.add_css('vod_score_all', 'li.sm:nth-child(12) > span:nth-child(1)::text')
voddetail_item_loader.add_css('vod_score_num', 'li.sm:nth-child(13) > span:nth-child(1)::text')
voddetail_item_loader.add_css('vod_create_time', 'li.sm:nth-child(9) > span:nth-child(1)::text')
voddetail_item_loader.add_css('vod_update_time', 'li.sm:nth-child(9) > span:nth-child(1)::text')
# voddetail_item_loader.add_xpath('vod_lately_hit_time', '//h2/text()')
okzyMoviesspiderItem = voddetail_item_loader.load_item()
# 解析m3u8格式播放地址
ckm3u8playurlList = response.xpath('//*[@id="2"]/ul/li/text()').extract()
for ckm3u8playurlInfo in ckm3u8playurlList:
m3u8playurlInfoList = ckm3u8playurlInfo.split('$')
vodm3u8playurl_item_loader = MoviesItemLoader(item=OkzyMoviesspiderPlayurlItem(), response=response)
vodm3u8playurl_item_loader.add_value('play_title', m3u8playurlInfoList[0])
vodm3u8playurl_item_loader.add_value('play_url', m3u8playurlInfoList[1])
vodm3u8playurl_item_loader.add_value('play_url_aes', common.get_md5(m3u8playurlInfoList[1]))
vodm3u8playurl_item_loader.add_xpath('play_from', '//*[@id="2"]/h3/span/text()')
vodm3u8playurl_item_loader.add_value("url_id", common.get_md5(response.url))
vodm3u8playurl_item_loader.add_css('create_time', 'li.sm:nth-child(9) > span:nth-child(1)::text')
vodm3u8playurl_item_loader.add_css('update_time', 'li.sm:nth-child(9) > span:nth-child(1)::text')
okzyMoviesm3u8PlayurlspiderItem = vodm3u8playurl_item_loader.load_item()
yield okzyMoviesm3u8PlayurlspiderItem
# 解析mp4格式播放地址
mp4playurlList = response.xpath('//*[@id="down_1"]/ul/li/text()').extract()
for mp4playurlInfo in mp4playurlList:
mp4playurlInfoList = mp4playurlInfo.split('$')
vodmp4playurl_item_loader = MoviesItemLoader(item=OkzyMoviesspiderPlayurlItem(), response=response)
vodmp4playurl_item_loader.add_value('play_title', mp4playurlInfoList[0])
vodmp4playurl_item_loader.add_value('play_url', mp4playurlInfoList[1])
vodmp4playurl_item_loader.add_value('play_url_aes', common.get_md5(mp4playurlInfoList[1]))
vodmp4playurl_item_loader.add_xpath('play_from', '//*[@id="down_1"]/h3/span/text()')
vodmp4playurl_item_loader.add_value("url_id", common.get_md5(response.url))
vodmp4playurl_item_loader.add_css('create_time', 'li.sm:nth-child(9) > span:nth-child(1)::text')
vodmp4playurl_item_loader.add_css('update_time', 'li.sm:nth-child(9) > span:nth-child(1)::text')
okzyMoviesmp4PlayurlspiderItem = vodmp4playurl_item_loader.load_item()
yield okzyMoviesmp4PlayurlspiderItem
yield okzyMoviesspiderItem
本機(jī)測試爬取了6W多條影視詳情和78W條影片播放地址,下載6W多張影視圖片半天時(shí)間足夠了太防。
下節(jié)預(yù)告:將使用yii框架快速搭建影片展示的WEB網(wǎng)站和編寫符合restful風(fēng)格的Api妻顶。
WEB網(wǎng)站展示: