1.?可以使用API從腳本運行Scrapy膛檀,而不是運行Scrapy的典型方法scrapy crawl劫樟;Scrapy是基于Twisted異步網(wǎng)絡庫構建的昆稿,因此需要在Twisted容器內(nèi)運行它坊夫,可以通過兩個API來運行單個或多個爬蟲scrapy.crawler.CrawlerProcess、scrapy.crawler.CrawlerRunner荷腊。
2.?啟動爬蟲的的第一個實用程序是scrapy.crawler.CrawlerProcess?。該類將為您啟動Twisted reactor急凰,配置日志記錄并設置關閉處理程序女仰,此類是所有Scrapy命令使用的類。
示例運行單個爬蟲:
交流群:1029344413 源碼抡锈、素材學習資料
import
scrapyfromscrapy.crawlerimport CrawlerProcessclass MySpider(scrapy.Spider):
? ? # Your spider definition? ? ...
process = CrawlerProcess({
? ? 'USER_AGENT':'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
通過CrawlerProcess傳入?yún)?shù)疾忍,并使用get_project_settings獲取Settings 項目設置的實例。
fromscrapy.crawlerimport CrawlerProcessfromscrapy.utils.projectimport get_project_settings
process = CrawlerProcess(get_project_settings())# 'followall' is the name of one of the spiders of the project.process.crawl('followall', domain='scrapinghub.com')
process.start() # the script will block here until the crawling is finished
還有另一個Scrapy實例方式可以更好地控制爬蟲運行過程:scrapy.crawler.CrawlerRunner床三。此類封裝了一些簡單的幫助程序來運行多個爬蟲程序一罩,但它不會以任何方式啟動或干擾現(xiàn)有的爬蟲。
使用此類撇簿,顯式運行reactor聂渊。如果已有爬蟲在運行想在同一個進程中開啟另一個Scrapy推汽,建議您使用CrawlerRunner 而不是CrawlerProcess。
注意歧沪,爬蟲結束后需要手動關閉Twisted reactor歹撒,通過向CrawlerRunner.crawl方法返回的延遲添加回調(diào)來實現(xiàn)。
下面是它的用法示例诊胞,在MySpider完成運行后手動停止容器的回調(diào)暖夭。
fromtwisted.internetimport reactorimport scrapyfromscrapy.crawlerimport CrawlerRunnerfromscrapy.utils.logimport configure_loggingclass MySpider(scrapy.Spider):
? ? # Your spider definition? ? ...
configure_logging({'LOG_FORMAT':'%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
在同一個進程中運行多個蜘蛛
默認情況下,Scrapy在您運行時為每個進程運行一個蜘蛛撵孤。但是迈着,Scrapy支持使用內(nèi)部API為每個進程運行多個蜘蛛。
這是一個同時運行多個蜘蛛的示例:
import scrapyfromscrapy.crawlerimport CrawlerProcessclass MySpider1(scrapy.Spider):
? ? # Your first spider definition? ? ...class MySpider2(scrapy.Spider):
? ? # Your second spider definition? ? ...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
使用CrawlerRunner示例:
import scrapyfromtwisted.internetimport reactorfromscrapy.crawlerimport CrawlerRunnerfromscrapy.utils.logimport configure_loggingclass MySpider1(scrapy.Spider):
? ? # Your first spider definition? ? ...class MySpider2(scrapy.Spider):
? ? # Your second spider definition? ? ...
configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
相同的示例邪码,但通過異步運行爬蟲蛛:
fromtwisted.internetimport reactor, deferfromscrapy.crawlerimport CrawlerRunnerfromscrapy.utils.logimport configure_loggingclass MySpider1(scrapy.Spider):
? ? # Your first spider definition? ? ...class MySpider2(scrapy.Spider):
? ? # Your second spider definition? ? ...
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacksdef crawl():
? ? yield runner.crawl(MySpider1)
? ? yield runner.crawl(MySpider2)
? ? reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished