Scrapy框架
構架圖
- Scrapy Engine(引擎模塊)
- Scheduler(調度模塊):負責接受引擎發(fā)送過來的Request請求,并按照一定的方式進行整理排隊、入隊吸奴,并且在引擎需要時,交換給引擎
- Downloader(下載模塊侵贵、下載器):負責下載引擎模塊發(fā)送的所有Requests請求朴皆,并將其獲取到的Responses交換給引擎模塊,由引擎交給Spider來處理
- Spider(爬蟲模塊):負責處理所有Response搔课,從中分析提取數(shù)據(jù)胰柑,獲取Item字段需要的數(shù)據(jù),并將需要跟進的URL提交給引擎爬泥,再次進入Scheduler
- Item Pipline(管道模塊):負責處理Spider中獲取的Item柬讨,并進行后期處理
1、安裝配置
Windows
pip install --upgrade pip
pip install twisted
pip install lxml
pip pywin32
pip install Scrapy
Ubuntu
sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
pip install --upgrade pip
sudo pip install scrapy
2袍啡、操作步驟
1踩官、scrapy startproject JobSpider
2、cd JobSpider cd spider
3境输、scrapy genspider 爬蟲程序名 域名
4蔗牡、編寫items.py文件
import scrapy
class DoubanmoviesItem(scrapy.Item):
name = scrapy.Field()
score = scrapy.Field()
intro = scrapy.Field()
info = scrapy.Field()
5颖系、編寫爬蟲文件
import scrapy
from ..items import DoubanmoviesItem
class RunmoviesSpider(scrapy.Spider):
name = 'runMovies'
allowed_domains = ['movie.douban.com']
def parse(self, response):
item = DoubanmoviesItem()
content_list = response.xpath("http://div[@class='article']/ol/li/div[@class='item']/div[@class='info']")
next_link = response.xpath("http://div[@class='paginator']/span/a/@href").extract()
print(next_link)
for content in content_list:
name = content.xpath(".//div[@class='hd']/a/span[1]/text()").extract()[0]
score = content.xpath(".//div[@class='bd']/div[@class='star']/span[@class='rating_num']/text()").extract()[0].strip()
intro = content.xpath(".//div[@class='bd']/p[1]/text()").extract()[0].strip()
info = content.xpath(".//div[@class='bd']/p[@class='quote']/span[@class='inq']/text()").extract()[0].strip()
if name and score and intro and info:
print(name, score, intro, info)
item["name"] = name
item["score"] = score
item["intro"] = intro
item["info"] = info
yield item
time.sleep(random.randint(0, 2))
if next_link:
url = "https://movie.douban.com/top250" + next_link[len(next_link) - 1]
print(url)
yield scrapy.Request(url=url, callback=self.parse)
6、編寫pipelines.py 文件
import pymysql
class DoubanmoviesPipeline(object):
def __init__(self):
# 連接數(shù)據(jù)庫
self.my_conn = pymysql.connect(
host='localhost',
port=3306,
database='douban',
user='root',
password='',
charset='utf8',
)
self.my_cursor = self.my_conn.cursor()
def process_item(self, item, spider):
insert_sql = "insert into movies(`name`,`score`,`intro`,`info`) value(%s,%s,%s,%s)"
print(item["name"], item["score"], item["intro"], item["info"])
self.my_cursor.execute(insert_sql, [item["name"], item["score"], item["intro"], item["info"]])
self.my_conn.commit()
return item
def close_item(self, spider):
self.my_cursor.close()
self.my_conn.close()
7辩越、設置settings.py 文件
- 設置請求頭
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; rv,2.0.1) Gecko/20100101 Firefox/4.0.1',
'Accept-Language': 'en',
}
- 設置管道文件
ITEM_PIPELINES = {
'doubanMovies.pipelines.DoubanmoviesPipeline': 300,
}
3嘁扼、細節(jié)
1、Item pipeline
可以通過管道處理爬起的數(shù)據(jù)区匣,在pipelines.py 文件中對傳輸過來的數(shù)據(jù)進行篩選
from scrapy.exceptions import DropItem
class PricePipeline(object):
vat_factor = 1
def process_item(self, item, spider):
if item['price']:
if item['price_excludes_vat']:
item['price'] = item['price']
return item
else:
raise DropItem('Missing price in %s'% item)
將item寫入JSON文件
import json
class JsonWriterPipeline(object):
def __init__(self):
self.file = open('items.jl', 'wb')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
去重:給item進行賦id偷拔,若id重復,則清除亏钩,否則加入莲绰。
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
if item['id'] in self.ids_seen:
raise DropItem('Duplicate item found: %s' % item)
else:
self.ids_seen.add(item['id'])
return item
2、Link Extractor
引用
from scrapy.contrib.linkextractors import LinkExtractor
參數(shù)
- allow (a regular expression (or list of))
- deny (a regular expression (or list of))
- allow_domains(str or list)
- deny_domains(str or list)
- restrict_xpath(str or list)
- attrs(list)
- unique(boolean)
- process_value(callable)
3姑丑、Logging
通過scrapy.log 模塊使用蛤签,必須通過顯示調用scrapy.log.start()來開啟
- CRITICAL - 嚴重錯誤
- ERROR - 一般錯誤
- WARRING - 警告錯誤
- INFO - 一般信息
- DEBUG - 調試信息
scrapy.log模塊
啟動log功能:
scrapy.log.start(logfile=None, loglevel=None, logstdout=None)
記錄信息:
scrapy.log.msg(message, level=INFO, spider=None)
scrapy.log.CRITICAL
scrapy.log.ERROR
scrapy.log.WARRING
scrapy.log.INFO
scrapy.log.DEBUG
通過在setting.py中進行以下設置可以被用來配置logging:
1、LOG_ENABLED 默認: True栅哀,啟用logging
2震肮、LOG_ENCODING 默認: 'utf-8',logging使用的編碼
3留拾、LOG_FILE 默認: None戳晌,在當前目錄里創(chuàng)建logging輸出文件的文件名
4、LOG_LEVEL 默認: 'DEBUG'痴柔,log的最低級別
5沦偎、LOG_STDOUT 默認: False 如果為 True,進程所有的標準輸出(及錯誤)將會被重定向到log中咳蔚。例如豪嚎,執(zhí)行 print "hello" ,其將會在Scrapy log中顯示
4谈火、email
python可以通過smtplib庫發(fā)送email侈询,scrapy提供了自己的實現(xiàn)。采用了Twisted非阻塞式IO糯耍,其避免了對爬蟲的非阻塞式IO的影響扔字。
from scrapy.mail import MailSender
mailer = MailSender()
或者可以傳遞一個Scrapy設置對象,其會參考setting:
mailer = MailSender.from_setting(settings)
mailer.send(to=["someone@example.com"], subject="Some subject", body="Some body", cc=["another@example.com"])
MailSender類
class scrapy.mail.MailSender(smtphost=None, mailfrom=None, smtpuser=None, smtppass=None, smtpport=None)
參數(shù):
-
smtphost (str) – 發(fā)送email的SMTP主機(host)温技。如果忽略啦租,則使用
MAIL_HOST
。 -
mailfrom (str) – 用于發(fā)送email的地址(address)(填入
From:
) 荒揣。 如果忽略篷角,則使用MAIL_FROM
。 -
smtpuser – SMTP用戶系任。如果忽略,則使用
MAIL_USER
恳蹲。 如果未給定虐块,則將不會進行SMTP認證(authentication)。 - smtppass (str) – SMTP認證的密碼
- smtpport (int) – SMTP連接的短褲
- smtptls – 強制使用STARTTLS
- smtpssl (boolean) – 強制使用SSL連接
4嘉蕾、scrapy框架爬取圖片
1贺奠、編寫Item文件
import scrapy
class MyItem(scrapy.Item):
# ... other item fields ...
image_urls = scrapy.Field()
images = scrapy.Field()
2、開啟圖片管道
ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1}
3错忱、設置圖片存儲信息
#圖片存儲位置
IMAGES_STORE = '/path/to/valid/dir'
# 90天的圖片失效期限
IMAGES_EXPIRES = 90
#縮略圖信息
IMAGES_THUMBS = {
'small': (50, 50),
'big': (270, 270),
}
4儡率、編寫爬蟲文件
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import TuxingspiderItem
class TuxingSpider(CrawlSpider):
name = 'tuxing'
allowed_domains = ['so.photophoto.cn']
start_urls = ['http://so.photophoto.cn/tag/%E6%B5%B7%E6%8A%A5']
# 下一頁連接
link_next_page = LinkExtractor(restrict_xpaths=("http://div[@id='page']/a[@class='pagenexton']/img/@src"))
rules = [
Rule(link_next_page, callback='parse_item', follow=True),
]
def parse_item(self, response):
img_url_list = response.xpath("http://ul[@id='list']/li/div[@class='libg']")
for img_url in img_url_list:
item = TuxingspiderItem()
name = img_url.xpath(".//div[@class='text']/div[@class='text2']/a/text()")
url = img_url.xpath(".//div[@class='image']/a/img/@src")
item["name"] = name
item["imagesUrls"] = url
yield item
5、編寫pipelines文件
import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item