有時候遇到這種情況学辱,每個請求里面有cookies和header,但是抓包怎么也抓不到是怎么來的棍郎,用 scrapy和requests都不能執(zhí)行js晒骇,只能是爬取靜態(tài)的頁面锋八。利用scrapy-splash雖然可以爬取動態(tài)的頁面挂捅,但是自己必須起一個服務(wù)來跑scrapy-splash芹助。這個時候覺得還是采用selenium,selenium支持chrome和firefox等闲先。
def __init__(self):
chrome_options = Options()
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--hide-scrollbars')
# 不顯示瀏覽器窗口
# chrome_options.add_argument('--headless')
self.browser = webdriver.Chrome(executable_path='/opt/webdriver/chrome/chromedriver',
chrome_options=chrome_options)
self.browser.set_page_load_timeout(30)
# 重寫start_requests方法
def start_requests(self):
cookies = self.convert_cookies(self.get_cookies())
for form_data in self.form_data_list:
yield scrapy.FormRequest(self.start_url, method="POST", cookies=cookies, formdata=form_data,
dont_filter=True)
pass
# 通過webdriver獲取cookies
def get_cookies(self):
self.browser.get(self.cookies_url)
cookies = []
try:
WebDriverWait(self.browser, 100).until(
expected_conditions.element_to_be_clickable((By.XPATH, "http://a[@class='searchbutton']")))
cookies = self.browser.get_cookies()
except Exception as e:
self.logger.info("獲取cookies出錯")
finally:
# 關(guān)閉瀏覽器
self.browser.quit()
return cookies
def convert_cookies(self, cookies):
newcookies = {}
for cookie in cookies:
newcookies[cookie['name']] = cookie['value']
return newcookies
# 表單數(shù)據(jù)轉(zhuǎn)化為dict
def fromData2Dict(self, formData):
# urlencode會把空格轉(zhuǎn)化為+状土,此處做個轉(zhuǎn)換
params = urllib.parse.unquote(formData).replace('+', ' ').split("&")
nums = len(params)
form_data = {}
for i in range(0, nums):
param = params[i].split("=", 1)
key = param[0]
value = param[1]
form_data[key] = value
return form_data
設(shè)置無頭模式,不顯示窗口(遇到問題:導(dǎo)致尋找不到頁面元素)
chrome_options.add_argument('--headless')
關(guān)閉沙盒:
options.add_argument('--no-sandbox')
遇到了的問題匯總:
1.在mac環(huán)境運行的好好的,在Linux環(huán)境一直報錯伺糠,DevToolsActivePort文件找不到蒙谓,參考了很多國外國內(nèi)的博客都寫的禁用沙箱然并卵。
比如:
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--disable-setuid-sandbox')
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
desired_capabilities=desired_capabilities)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
self.start_session(capabilities, browser_profile)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
(Driver info: chromedriver=2.45.615279 (12b89733300bd268cff3b78fc76cb8f3a7cc44e5),platform=Linux 3.10.0-327.el7.x86_64 x86_64)
增加了無頭模式雖然可以跑训桶,但是無法找到頁面元素
2019-01-08 16:43:00 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
[2019-01-08 16:43:00] 140734813173184 POST http://127.0.0.1:56931/session/cd22b1e86a32e3f65f5b2fb0a0795a49/element {"using": "xpath", "value": "http://a[@class='searchbutton']", "sessionId": "cd22b1e86a32e3f65f5b2fb0a0795a49"}
2019-01-08 16:43:00 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:56931/session/cd22b1e86a32e3f65f5b2fb0a0795a49/element {"using": "xpath", "value": "http://a[@class='searchbutton']", "sessionId": "cd22b1e86a32e3f65f5b2fb0a0795a49"}
[2019-01-08 16:43:00] 140734813173184 http://127.0.0.1:56931 "POST /session/cd22b1e86a32e3f65f5b2fb0a0795a49/element HTTP/1.1" 200 358
2019-01-08 16:43:00 [urllib3.connectionpool] DEBUG: http://127.0.0.1:56931 "POST /session/cd22b1e86a32e3f65f5b2fb0a0795a49/element HTTP/1.1" 200 358
在看別人博客發(fā)現(xiàn)linux服務(wù)器是無界面的累驮,知道了xvfb這個概念:Xvfb在內(nèi)存中執(zhí)行所有的圖形操作,不需要借助任何顯示設(shè)備舵揭。就嘗試安裝一下看看是否能解決問題:
yum install Xvfb
還是一如既往的報錯谤专,決定降低chrome版本試試,看了下linux版本信息:
[root@localhost google]# uname -a
Linux localhost.localdomain 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
我卸載了當(dāng)前的goole-chrome(版本信息:google-chrome-stable-71.0.3578.98),重新安裝了google-chrome(版本信息:google-chrome-stable-62.0.3202.94)午绳。chromedriver版本從2.45.615279改為了2.33.506092置侍。
最后還是報錯了:
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
desired_capabilities=desired_capabilities)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
self.start_session(capabilities, browser_profile)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally
(Driver info: chromedriver=2.33.506092 (733a02544d189eeb751fe0d7ddca79a0ee28cce4),platform=Linux 3.10.0-327.el7.x86_64 x86_64)
不過,和以前的錯誤不一樣拦焚,感覺離成功更近了一步蜡坊。
查找資料安裝pyvirtualdisplay:
pip install pyvirtualdisplay
在代碼中使用:
from pyvirtualdisplay import Display
display = Display(visible=0, size=(800, 800))
display.start()
driver = webdriver.Chrome()
功夫不負(fù)有心人代碼完美運行。
2.scrapy定義初始化方法赎败,本地python 3.7環(huán)境直接定義__init__(self)
格式秕衙,但是Linux python 3.6的環(huán)境卻報錯,按理說使用的scrapy版本都是3.5.1僵刮。linux python 3.6的寫法:
def __init__(self, *args, **kwargs):
super(SpdSpider, self).__init__(*args, **kwargs)
參考文檔