Python3.X 爬蟲實(shí)戰(zhàn)（動態(tài)頁面爬取解析）

Python3+Scrapy+phantomJs+Selenium爬取今日頭條
在實(shí)現(xiàn)爬蟲的過程中，我們不可避免的會爬取又js以及Ajax等動態(tài)網(wǎng)頁技術(shù)生成網(wǎng)頁內(nèi)容的網(wǎng)站塞栅，今日頭條就是一個很好的例子。
本文所要介紹的是基于Python3沦零，配合Scrapy+phantomjs+selenium框架的動態(tài)網(wǎng)頁爬取技術(shù)餐曼。
本文所實(shí)現(xiàn)的2個項(xiàng)目已上傳至Github中，求Star~ 1. 爬取今日頭條新聞列表URL： 2. 爬取今日頭條新聞內(nèi)容：
靜態(tài)網(wǎng)頁爬取技術(shù)以及windows下爬蟲環(huán)境搭建移步上幾篇博客德频，必要的安裝軟件也在上一篇博客中提供苍息。

本文介紹使用PhantongJs + Selenium實(shí)現(xiàn)新聞內(nèi)容的爬取，爬取新聞列表的url也是相同的原理壹置，不再贅述竞思。
項(xiàng)目結(jié)構(gòu)

這里寫圖片描述

項(xiàng)目原理
底層代碼使用Python3，網(wǎng)絡(luò)爬蟲基礎(chǔ)框架采用Scrapy钞护，由于爬取的是動態(tài)網(wǎng)頁盖喷，整個網(wǎng)頁并不是直接生成頁面，動過Ajax等技術(shù)動態(tài)生成难咕。所以這里考慮采用 PhantomJs+Selenium模擬實(shí)現(xiàn)一個無界面的瀏覽器课梳，去模擬用戶操作，抓取網(wǎng)頁代碼內(nèi)容余佃。
代碼文件說明
項(xiàng)目結(jié)構(gòu)從上到下依次為：
middleware.py：整個項(xiàng)目的核心暮刃，用于啟動中間件，在Scrapy抓取調(diào)用request的過程中實(shí)現(xiàn)模擬用戶操作瀏覽器
ContentSpider.py：爬蟲類文件咙冗，定義爬蟲
commonUtils：工具類
items.py：爬蟲所抓取到的字段存儲類
pipelines.py：抓取到的數(shù)據(jù)處理類

這5個為關(guān)鍵類代碼沾歪，其余的代碼為業(yè)務(wù)相關(guān)代碼。
關(guān)鍵代碼講解
middleware.py

douguo request middleware

for the page which loaded by js/ajax

ang changes should be recored here:

@author zhangjianfei

@date 2017/05/04

from selenium import webdriver
from scrapy.http import HtmlResponse
from DgSpiderPhantomJS import settings
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time
import random

class JavaScriptMiddleware(object):
print("LOGS Starting Middleware ...")

def process_request(self, request, spider):

    print("LOGS:  process_request is starting  ...")

    # 開啟虛擬瀏覽器參數(shù)
    dcap = dict(DesiredCapabilities.PHANTOMJS)

    # 設(shè)置agents
    dcap["phantomjs.page.settings.userAgent"] = (random.choice(settings.USER_AGENTS))

    # 啟動phantomjs
    driver = webdriver.PhantomJS(executable_path=r"D:\phantomjs-2.1.1\bin\phantomjs.exe", desired_capabilities=dcap)

    # 設(shè)置60秒頁面超時(shí)返回
    driver.set_page_load_timeout(60)
    # 設(shè)置60秒腳本超時(shí)時(shí)間
    driver.set_script_timeout(60)

    # get page request
    driver.get(request.url)

    # simulate user behavior
    js = "document.body.scrollTop=10000"
    driver.execute_script(js)  # 可執(zhí)行js雾消，模仿用戶操作灾搏。此處為將頁面拉至1000。

    # 等待異步請求響應(yīng)
    driver.implicitly_wait(20)

    # 獲取頁面源碼
    body = driver.page_source

    return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)

-- coding: utf-8 --

import scrapy
import random
import time
from DgSpiderPhantomJS.items import DgspiderPostItem
from scrapy.selector import Selector
from DgSpiderPhantomJS import urlSettings
from DgSpiderPhantomJS import contentSettings
from DgSpiderPhantomJS.mysqlUtils import dbhandle_update_status
from DgSpiderPhantomJS.mysqlUtils import dbhandle_geturl

class DgContentSpider(scrapy.Spider):
print('LOGS: Spider Content_Spider Staring ...')

sleep_time = random.randint(60, 90)
print("LOGS: Sleeping :" + str(sleep_time))
time.sleep(sleep_time)

# get url from db
result = dbhandle_geturl()
url = result[0]
# spider_name = result[1]
site = result[2]
gid = result[3]
module = result[4]

# set spider name
name = 'Content_Spider'
# name = 'DgUrlSpiderPhantomJS'

# set domains
allowed_domains = [site]

# set scrapy url
start_urls = [url]

# change status
"""對于爬去網(wǎng)頁立润，無論是否爬取成功都將設(shè)置status為1狂窑，避免死循環(huán)"""
dbhandle_update_status(url, 1)

# scrapy crawl
def parse(self, response):

    # init the item
    item = DgspiderPostItem()

    # get the page source
    sel = Selector(response)

    print(sel)

    # get post title
    title_date = sel.xpath(contentSettings.POST_TITLE_XPATH)
    item['title'] = title_date.xpath('string(.)').extract()

    # get post page source
    item['text'] = sel.xpath(contentSettings.POST_CONTENT_XPATH).extract()

    # get url
    item['url'] = DgContentSpider.url

    yield item

-- coding: utf-8 --

Define here the models for your scraped items

See documentation in:

http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class DgspiderUrlItem(scrapy.Item):
url = scrapy.Field()

class DgspiderPostItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
text = scrapy.Field()

-- coding: utf-8 --

Define your item pipelines here

Don't forget to add your pipeline to the ITEM_PIPELINES setting

See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import re
import datetime
import urllib.request
from DgSpiderPhantomJS import urlSettings
from DgSpiderPhantomJS import contentSettings
from DgSpiderPhantomJS.mysqlUtils import dbhandle_insert_content
from DgSpiderPhantomJS.uploadUtils import uploadImage
from DgSpiderPhantomJS.mysqlUtils import dbhandle_online
from DgSpiderPhantomJS.PostHandle import post_handel
from DgSpiderPhantomJS.mysqlUtils import dbhandle_update_status
from bs4 import BeautifulSoup
from DgSpiderPhantomJS.commonUtils import get_random_user
from DgSpiderPhantomJS.commonUtils import get_linkmd5id

class DgspiderphantomjsPipeline(object):

# post構(gòu)造reply
cs = []

# 帖子title
title = ''

# 帖子文本
text = ''

# 當(dāng)前爬取的url
url = ''

# 隨機(jī)用戶ID
user_id = ''

# 圖片flag
has_img = 0

# get title flag
get_title_flag = 0

def __init__(self):
    DgspiderphantomjsPipeline.user_id = get_random_user(contentSettings.CREATE_POST_USER)

# process the data
def process_item(self, item, spider):
    self.get_title_flag += 1

    # 獲取當(dāng)前網(wǎng)頁url
    DgspiderphantomjsPipeline.url = item['url']

    # 獲取post title
    if len(item['title']) == 0:
        title_tmp = ''
    else:
        title_tmp = item['title'][0]

    # 替換標(biāo)題中可能會引起 sql syntax 的符號
    # 對于分頁的文章，只取得第一頁的標(biāo)題
    if self.get_title_flag == 1:

        # 使用beautifulSoup格什化標(biāo)題
        soup_title = BeautifulSoup(title_tmp, "lxml")
        title = ''
        # 對于bs之后的html樹形結(jié)構(gòu)桑腮，不使用.prettify()泉哈，對于bs, prettify后每一個標(biāo)簽自動換行，造成多個破讨、
        # 多行的空格丛晦、換行，使用stripped_strings獲取文本
        for string in soup_title.stripped_strings:
            title += string

        title = title.replace("'", "”").replace('"', '“')
        DgspiderphantomjsPipeline.title = title

    # 獲取正post內(nèi)容
    if len(item['text']) == 0:
        text_temp = ''
    else:
        text_temp = item['text'][0]

    soup = BeautifulSoup(text_temp, "lxml")
    text_temp = str(soup)

    # 獲取圖片
    reg_img = re.compile(r'<img.*?>')
    imgs = reg_img.findall(text_temp)
    for img in imgs:
        DgspiderphantomjsPipeline.has_img = 1

        # matchObj = re.search('.*src="(.*)"{2}.*', img, re.M | re.I)
        match_obj = re.search('.*src="(.*)".*', img, re.M | re.I)
        img_url_tmp = match_obj.group(1)

        # 去除所有Http:標(biāo)簽
        img_url_tmp = img_url_tmp.replace("http:", "")

        # 對于![a.jpg](http://a.jpg)這種情況單獨(dú)處理
        imgUrl_tmp_list = img_url_tmp.split('"')
        img_url_tmp = imgUrl_tmp_list[0]

        # 加入http
        imgUrl = 'http:' + img_url_tmp

        list_name = imgUrl.split('/')
        file_name = list_name[len(list_name)-1]

        # if os.path.exists(settings.IMAGES_STORE):
        #     os.makedirs(settings.IMAGES_STORE)

        # 獲取圖片本地存儲路徑
        file_path = contentSettings.IMAGES_STORE + file_name
        # 獲取圖片并上傳至本地
        urllib.request.urlretrieve(imgUrl, file_path)
        upload_img_result_json = uploadImage(file_path, 'image/jpeg', DgspiderphantomjsPipeline.user_id)
        # 獲取上傳之后返回的服務(wù)器圖片路徑提陶、寬烫沙、高
        img_u = upload_img_result_json['result']['image_url']
        img_w = upload_img_result_json['result']['w']
        img_h = upload_img_result_json['result']['h']
        img_upload_flag = str(img_u)+';'+str(img_w)+';'+str(img_h)

        # 在圖片前后插入字符標(biāo)記
        text_temp = text_temp.replace(img, '[dgimg]' + img_upload_flag + '[/dgimg]')

    # 替換<strong>標(biāo)簽
    text_temp = text_temp.replace('<strong>', '').replace('</strong>', '')

    # 使用beautifulSoup格什化HTML
    soup = BeautifulSoup(text_temp, "lxml")
    text = ''
    # 對于bs之后的html樹形結(jié)構(gòu)，不使用.prettify()隙笆，對于bs, prettify后每一個標(biāo)簽自動換行锌蓄，造成多個升筏、
    # 多行的空格、換行
    for string in soup.stripped_strings:
        text += string + '\n\n'

    # 替換因?yàn)殡p引號為中文雙引號瘸爽，避免 mysql syntax
    DgspiderphantomjsPipeline.text = self.text + text.replace('"', '“')

    return item

# spider開啟時(shí)被調(diào)用
def open_spider(self, spider):
    pass

# sipder 關(guān)閉時(shí)被調(diào)用
def close_spider(self, spider):

    # 數(shù)據(jù)入庫：235
    url = DgspiderphantomjsPipeline.url
    title = DgspiderphantomjsPipeline.title
    content = DgspiderphantomjsPipeline.text
    user_id = DgspiderphantomjsPipeline.user_id
    create_time = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    dbhandle_insert_content(url, title, content, user_id, DgspiderphantomjsPipeline.has_img, create_time)

    # 處理文本您访、設(shè)置status、上傳至dgCommunity.dg_post
    # 如果判斷has_img為1剪决，那么上傳帖子
    if DgspiderphantomjsPipeline.has_img == 1:
        if title.strip() != '' and content.strip() != '':
            spider.logger.info('status=2 , has_img=1, title and content is not null! Uploading post into db...')
            post_handel(url)
        else:
            spider.logger.info('status=1 , has_img=1, but title or content is null! ready to exit...')
        pass
    else:
        spider.logger.info('status=1 , has_img=0, changing status and ready to exit...')
        pass

轉(zhuǎn)自：

http://blog.csdn.net/qq_31573519/article/details/74248559

灵汪、

最后編輯于：2017.12.10 01:53:17

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市昼捍，隨后出現(xiàn)的幾起案子识虚，更是在濱河造成了極大的恐慌，老刑警劉巖妒茬，帶你破解...
沈念sama閱讀 211,376評論 6贊 491
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件担锤，死亡現(xiàn)場離奇詭異，居然都是意外死亡乍钻，警方通過查閱死者的電腦和手機(jī)肛循，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 90,126評論 2贊 385
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來银择，“玉大人多糠，你說我怎么就攤上這事『瓶迹” “怎么了夹孔？”我有些...
開封第一講書人閱讀 156,966評論 0贊 347
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長析孽。經(jīng)常有香客問我搭伤，道長，這世上最難降的妖魔是什么袜瞬？我笑而不...
開封第一講書人閱讀 56,432評論 1贊 283
?港島之戀（遺憾婚禮）
正文為了忘掉前任怜俐，我火速辦了婚禮，結(jié)果婚禮上邓尤，老公的妹妹穿的比我還像新娘拍鲤。我一直安慰自己，他們只是感情好汞扎，可當(dāng)我...
茶點(diǎn)故事閱讀 65,519評論 6贊 385
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布季稳。她就那樣靜靜地躺著，像睡著了一般澈魄。火紅的嫁衣襯著肌膚如雪景鼠。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 49,792評論 1贊 290
城市分裂傳說
那天一忱，我揣著相機(jī)與錄音莲蜘，去河邊找鬼。笑死帘营，一個胖子當(dāng)著我的面吹牛票渠，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播芬迄，決...
沈念sama閱讀 38,933評論 3贊 406
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼问顷，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了禀梳？” 一聲冷哼從身側(cè)響起杜窄，我...
開封第一講書人閱讀 37,701評論 0贊 266
萬榮殺人案實(shí)錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎算途，沒想到半個月后塞耕，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 44,143評論 1贊 303
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡嘴瓤，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 36,488評論 2贊 327
?白月光啟示錄
正文我和宋清朗相戀三年扫外，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片廓脆。...
茶點(diǎn)故事閱讀 38,626評論 1贊 340
活死人
序言：一個原本活蹦亂跳的男人離奇死亡筛谚，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出停忿，到底是詐尸還是另有隱情驾讲，我是刑警寧澤，帶...
沈念sama閱讀 34,292評論 4贊 329
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布席赂，位于F島的核電站吮铭，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏氧枣。R本人自食惡果不足惜沐兵，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,896評論 3贊 313
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望便监。院中可真熱鬧扎谎，春花似錦、人聲如沸烧董。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,742評論 0贊 21
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽逊移。三九已至预吆，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間胳泉，已是汗流浹背拐叉。一陣腳步聲響...
開封第一講書人閱讀 31,977評論 1贊 265
情欲美人皮
我被黑心中介騙來泰國打工岩遗，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人凤瘦。一個月前我還...
沈念sama閱讀 46,324評論 2贊 360
代替公主和親
正文我出身青樓宿礁，卻偏偏與公主長得像，于是被迫代替她去往敵國和親蔬芥。傳聞我的和親對象是個殘疾皇子梆靖，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 43,494評論 2贊 348

Python3.X 爬蟲實(shí)戰(zhàn)（動態(tài)頁面爬取解析）

douguo request middleware

for the page which loaded by js/ajax

ang changes should be recored here:

@author zhangjianfei

@date 2017/05/04

-- coding: utf-8 --

-- coding: utf-8 --

Define here the models for your scraped items

See documentation in:

http://doc.scrapy.org/en/latest/topics/items.html

-- coding: utf-8 --

Define your item pipelines here

Don't forget to add your pipeline to the ITEM_PIPELINES setting

See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

推薦閱讀更多精彩內(nèi)容