之前用selenium和phantomJS單線程爬取tyc的對(duì)外投資信息浴骂,無奈爬取速度太慢,單個(gè)企業(yè)抓取速度大概在>30-60s,這還不是最關(guān)鍵的,最令人崩潰的是剛抓取一會(huì)就有bug郁妈,導(dǎo)致程序中斷,程序中斷的原因大概在爬取程序卡在某個(gè)部分不動(dòng)了认臊,經(jīng)檢查也沒發(fā)現(xiàn)bug在哪圃庭,所以爬蟲一直處于手動(dòng)爬蟲-手動(dòng)中斷-繼續(xù)爬蟲的狀態(tài)。今天學(xué)了scrapy失晴,果斷用scrapy+selenium+phantomJS來爬。
先上代碼
#coding:utf-8
from selenium.webdriver.common.keys import Keys
import time
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import pymongo
import xlrd
import time
import scrapy
from tyc.items import TycItem
import logging
from scrapy.http import Request
class TycSpider(scrapy.Spider):
name = 'tyc'
allowed_domains = ['tianyancha.com']
fname = "C:\\Users\\Administrator\\Desktop\\test.xlsx"
workbook = xlrd.open_workbook(fname)
sheet = workbook.sheet_by_name('Sheet1')
urls = list()
cols = sheet.col_values(0)
#要爬取的url
start_urls =['http://www.tianyancha.com/search?key={}&checkFrom=searchBox' .format(col) for col in cols]
def parse(self,response):
#用phantomJs模擬瀏覽器拘央,添加headers
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Mobile Safari/537.36"
)
browser = webdriver.PhantomJS(desired_capabilities=dcap)
browser.get(response.url)
time.sleep(4)
#獲取企業(yè)url
try:
url = browser.find_element_by_class_name('query_name').get_attribute('href')
browser.quit()
self.logger.info('成功搜索到 %s',url)
yield Request(url = url,callback = self.parse_detail)
except Exception as e:
self.logger.info('經(jīng)查詢沒有這個(gè)企業(yè)涂屁!')
def parse_detail(self,response):
#獲取企業(yè)對(duì)外投資情況
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Mobile Safari/537.36"
)
browser = webdriver.PhantomJS(desired_capabilities=dcap)
browser.get(response.url)
self.logger.info('url %s', response.url)
time.sleep(3)
soup = BeautifulSoup(browser.page_source, 'lxml')
# driver.implicitly_wait(10)
browser.quit()
item = TycItem()
name = soup.select('.base-company')[0].text.split(' ')[0]
self.logger.info('企業(yè)名 %s',name)
try:
inv = soup.select('#nav-main-outInvestment .m-plele')
print (len(inv))
for i in inv:
inv = i.select('div')
companyName = inv[0].text
legalPerson = inv[2].text
industry = inv[3].text
state = inv[4].text
invest = inv[5].text
item['company'] = name
item['enterprise_name'] = companyName
item['legal_person_name'] = legalPerson
item['industry'] = industry
item['status'] = state
item['reg_captial'] = invest
yield (item)
except Exception as e:
self.logger.info('這個(gè)企業(yè)沒有對(duì)外投資!')
有幾處需要注意:
- 雖然用selenium模擬瀏覽器了灰伟,但是仍然要添加headers拆又,不添加headers儒旬,網(wǎng)頁的代碼還是不全。
- 現(xiàn)在速度是有些提升了帖族,不過面對(duì)海量的數(shù)據(jù)栈源,還是要利用分布式爬蟲scrapy-redis或者scrapyd。