本篇結(jié)合Scrapy尉桩、Selenium與Headless Chrome來(lái)爬取需要js渲染的頁(yè)面瞳筏,本節(jié)以爬取京東搜索手機(jī)的頁(yè)面為例卓缰。
頁(yè)面分析
可以看到對(duì)于手機(jī)這個(gè)選項(xiàng)喇辽,總共有100頁(yè)的結(jié)果掌挚,從動(dòng)態(tài)圖頁(yè)可以看到,每次頁(yè)面加載并不是一次性加載完的菩咨,而是當(dāng)鼠標(biāo)滾輪向下滾動(dòng)到一定距離的時(shí)候吠式,才會(huì)出現(xiàn)新的搜索結(jié)果,這種是通過(guò)js渲染的方式來(lái)實(shí)現(xiàn)的抽米。
我們可以通過(guò)Selenium的execute_script("window.scrollTo(0, document.body.scrollHeight);")
來(lái)模擬向下滑動(dòng)到最底的操作特占。
在看頁(yè)面,從圖中我們可以看出云茸,當(dāng)下一頁(yè)跳轉(zhuǎn)到第2頁(yè)的時(shí)候是目,url中的page值為3
,在點(diǎn)擊下一頁(yè)跳轉(zhuǎn)到第3頁(yè)是标捺,url中的page為5懊纳,由此可以推斷出,page的變化與對(duì)應(yīng)的展示頁(yè)面對(duì)應(yīng)關(guān)系為亡容,real_page = 2*(page-1)
嗤疯,由此,我們可以得到所有頁(yè)面的url地址萍倡。
實(shí)現(xiàn)
只展示關(guān)鍵源碼身弊,其他settings.py等文件不做展示,具體可見我的Github
# search.py
# -*- coding: utf-8 -*-
import scrapy
from selenium import webdriver
import time
class SearchSpider(scrapy.Spider):
name = 'search'
search_page_url_pattern = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page={page}&enc=utf-8"
start_urls = ['https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8']
def __init__(self):
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
self.browser = webdriver.Chrome(chrome_options=chrome_options, executable_path='/usr/local/bin/chromedriver')
super(SearchSpider, self).__init__()
def closed(self,reason):
self.browser.close() # 記得關(guān)閉
def parse(self, response):
total_page = response.css('span.p-skip em b::text').extract_first()
if total_page:
for i in range(int(total_page)):
next_page_url = self.search_page_url_pattern.format(page=2*i + 1)
yield scrapy.Request(next_page_url, callback = self.parse_page)
time.sleep(1)
def parse_page(self, response):
phone_info_list = response.css('div.p-name a')
for item in book_info_list:
phone_name = item.css('a::attr(title)').extract_first()
phone_href = item.css('a::attr(href)').extract_first()
yield dict(name=phone_name, href=phone_href)
這里在spider中定義了webdriver,這樣的話就可以避免每次都重新打開一個(gè)新的瀏覽器阱佛。
在closed()
中要關(guān)閉瀏覽器帖汞。
在parse()
我們先獲取到頁(yè)面的總頁(yè)數(shù),然后在開始根據(jù)規(guī)則生成url凑术,繼續(xù)爬取翩蘸。
在parse_page()
中我們根據(jù)頁(yè)面規(guī)則爬取要獲取的信息,不再贅述淮逊。
#middlewares.py
from scrapy import signals
from scrapy.http import HtmlResponse
class JdDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
spider.browser.get(request.url)
for i in range(5):
spider.browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
return HtmlResponse(url = spider.browser.current_url, body = spider.browser.page_source, encoding = 'utf8', request = request)
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
這邊我們利用DownloadMiddleware
的特性催首,在process_request()
中使用webdriver
來(lái)模擬滾動(dòng)獲取整個(gè)頁(yè)面的源碼,在直接返回一個(gè)Response
對(duì)象泄鹏,根據(jù)規(guī)則郎任,當(dāng)返回Response
對(duì)象,之后的DownloadMiddle
將不會(huì)再運(yùn)行备籽,而是直接返回舶治。
運(yùn)行scrapy crawl search -o result.csv --nolog
即可獲得爬取結(jié)果。
總結(jié)
本篇講解了selenium與headless chrome和scrapy的聯(lián)合使用车猬,看怎么爬取動(dòng)態(tài)頁(yè)面的信息霉猛,通過(guò)此方法,再也不用怕需要?jiǎng)討B(tài)渲染的頁(yè)面無(wú)法爬取了珠闰。
自此惜浅,解決了動(dòng)態(tài)爬取動(dòng)態(tài)頁(yè)面的問(wèn)題之后,就要解決爬取規(guī)模的問(wèn)題伏嗜,接下來(lái)將會(huì)學(xué)習(xí)如何使用scrapy-redis
來(lái)進(jìn)行分布式爬取坛悉。