最近一直在學(xué)習(xí)python的scrapy框架彼妻。寫了比較多的小例子牵舵。工欲善其事必先利其器送火。今天描述的就是爬取今日頭條的科技板塊新聞。練練這把利器判呕。
教程依賴scrapy,pymongo模塊著隆,可以直接先下載相應(yīng)的環(huán)境依賴荤牍。
- 1.分析今日頭條新聞的API接口
- 對(duì)于今日頭條這些通過(guò)AJAX來(lái)異步獲取json數(shù)據(jù)旱函,正常的等待頁(yè)面渲染后再進(jìn)行提取數(shù)據(jù)有點(diǎn)顯得力不從心,所以直接通過(guò)瀏覽器對(duì)網(wǎng)站進(jìn)行抓包分析。
- 打開(kāi)瀏覽器弥咪,訪問(wèn)今日頭條的科技新聞模塊亏狰,這里的地址是 http://www.toutiao.com/ch/news_tech/
-
右鍵審查元素锰什,對(duì)頁(yè)面的網(wǎng)絡(luò)請(qǐng)求資源做分析罪既。勾上紅色箭頭的那個(gè)選擇框探熔,選擇記錄網(wǎng)絡(luò)請(qǐng)求日記。然后重新刷新網(wǎng)站
- 逐一查看記錄的網(wǎng)絡(luò)數(shù)據(jù)包,可以發(fā)現(xiàn) http://www.toutiao.com/api/pc/feed/?category=news_tech&utm_source=toutiao&widen=1&max_behot_time=0&max_behot_time_tmp=0&tadrequire=true&as=A155493CA8EBB0F&cp=59C84BEB601F7E1的請(qǐng)求地址返回了json的數(shù)據(jù)。
- 返回的數(shù)據(jù)格式如下所示:
{ "has_more": false, "message": "success", "data": [ { "chinese_tag": "財(cái)經(jīng)", "media_avatar_url": "http://p3.pstatp.com/large/1233000741099c9f4a59", "is_feed_ad": false, "tag_url": "news_finance", "title": "【特寫】數(shù)字貨幣的信徒們", "single_mode": true, "middle_mode": true, "abstract": "在九月初在中國(guó)發(fā)文整治ICO后,硅谷的區(qū)塊鏈項(xiàng)目創(chuàng)業(yè)者林嚇洪把籌集的資金全部還給了中國(guó)投資者們剩瓶。在那次整治中延曙,監(jiān)管部門宣布,首次代幣發(fā)行(Initial Coin Offering亡哄,簡(jiǎn)稱ICO)屬于非法行為枝缔,所有平臺(tái)必須返還籌集的資金。", "tag": "news_finance", "label": [ "數(shù)字貨幣", "風(fēng)投", "比特幣", "投資", "經(jīng)濟(jì)" ], "behot_time": 1506326903, "source_url": "/group/6469550301866803469/", "source": "界面新聞", "more_mode": false, "article_genre": "article", "image_url": "http://p1.pstatp.com/list/190x124/317200041ea1cf451f52", "has_gallery": false, "group_source": 1, "comments_count": 10, "group_id": "6469550301866803469", "media_url": "/c/user/52857496566/" }, { "image_url": "http://p3.pstatp.com/list/190x124/31770009f2c887fdb867", "single_mode": true, "abstract": "早磺平,來(lái)看看今天的新聞魂仍。小米就校招風(fēng)波道歉@DoNews【小米就校招風(fēng)波道歉 對(duì)涉事員工通報(bào)批評(píng)】近日,一名自稱在河南鄭州大學(xué)日語(yǔ)專業(yè)學(xué)習(xí)的大學(xué)生表示拣挪,她與同學(xué)在一次校園招聘宣講會(huì)上無(wú)故被來(lái)自小米公司的主管人員諷刺擦酌。導(dǎo)致自己和本專業(yè)的同學(xué)憤然離開(kāi)。", "middle_mode": false, "more_mode": true, "tag": "news_tech", "label": [ "小米科技", "亞馬遜公司", "Uber", "美國(guó)", "樂(lè)視" ], "tag_url": "news_tech", "title": "小米就校招風(fēng)波道歉菠劝;ofo正尋求新一輪融資", "chinese_tag": "科技", "source": "虎嗅APP", "group_source": 1, "has_gallery": false, "media_url": "/c/user/3358265611/", "media_avatar_url": "http://p2.pstatp.com/large/18a50010126f235bf938", "image_list": [ { "url": "http://p3.pstatp.com/list/31770009f2c887fdb867" }, { "url": "http://p1.pstatp.com/list/317b00061c410d6d0352" }, { "url": "http://p3.pstatp.com/list/3172000337e0332b337f" } ], "source_url": "/group/6469472579270672654/", "article_genre": "article", "is_feed_ad": false, "behot_time": 1506326303, "comments_count": 114, "group_id": "6469472579270672654" }, { "image_url": "http://p3.pstatp.com/list/190x124/3c64000074857b07c81d", "single_mode": true, "abstract": "藍(lán)燕赊舶,經(jīng)常關(guān)注香港電影的人應(yīng)該不陌生,在2011年靠著香港三級(jí)影片《3D肉蒲團(tuán)之極樂(lè)寶鑒》走紅赶诊,并逐漸出現(xiàn)人們的視線中笼平。被稱為新一代的“艷星”√蚧荆可走紅后的她并沒(méi)有獲得很好的資源寓调,所接拍的影片大多數(shù)是一些不知名的配角。", "middle_mode": false, "more_mode": true, "tag": "news_entertainment", "label": [ "藍(lán)燕 ", "肉蒲團(tuán)", "投資", "娛樂(lè)" ], "tag_url": "news_entertainment", "title": "艷星藍(lán)燕美照曝光 靠著《3D肉蒲團(tuán)》走紅", "chinese_tag": "娛樂(lè)", "source": "陪你樂(lè)不停", "group_source": 2, "has_gallery": false, "media_url": "/c/user/61497461135/", "media_avatar_url": "http://p3.pstatp.com/large/382f000f5dd459d0eb74", "image_list": [ { "url": "http://p3.pstatp.com/list/3c64000074857b07c81d" }, { "url": "http://p3.pstatp.com/list/3c6000022fcec3f4ca48" }, { "url": "http://p3.pstatp.com/list/3c60000230155491a84d" } ], "source_url": "/group/6469578595697164813/", "article_genre": "article", "is_feed_ad": false, "behot_time": 1506325703, "comments_count": 2, "group_id": "6469578595697164813" }, { "log_extra": "{\"ad_price\":\"Wci5d__iJRJZyLl3_-IlEuQYjwGdUeJEIl99Ew\",\"convert_id\":0,\"external_action\":0,\"req_id\":\"201709251608231720180471641841E3\",\"rit\":1}", "image_url": "http://p3.pstatp.com/large/26c00009898dbc9c5a52", "read_count": 12196, "ban_comment": 1, "single_mode": true, "abstract": "", "image_list": [], "has_video": false, "article_type": 1, "tag": "ad", "display_info": "股市迎來(lái)重磅利好消息锄码,這些股或?qū)⑸蠞q翻倍夺英,微信領(lǐng)取", "has_m3u8_video": 0, "label": "廣告", "user_verified": 0, "aggr_type": 1, "expire_seconds": 314754930, "cell_type": 0, "article_sub_type": 0, "group_flags": 4096, "bury_count": 0, "title": "股市迎來(lái)重磅利好消息晌涕,這些股或?qū)⑸蠞q翻倍,微信領(lǐng)取", "ignore_web_transform": 1, "source_icon_style": 3, "tip": 0, "hot": 0, "share_url": "http://m.toutiao.com/group/6465452273144168717/?iid=0&app=news_article", "has_mp4_video": 0, "source": "聯(lián)訊證券", "comment_count": 0, "article_url": "http://cq3.ilyae.cn/toutiao2/index.html", "filter_words": [ { "id": "1:74", "name": "股票", "is_selected": false }, { "id": "1:6", "name": "金融保險(xiǎn)", "is_selected": false }, { "id": "2:0", "name": "來(lái)源:聯(lián)訊證券", "is_selected": false }, { "id": "4:2", "name": "看過(guò)了", "is_selected": false } ], "has_gallery": false, "publish_time": 1505355414, "ad_id": 69048936405, "action_list": [ { "action": 1, "extra": {}, "desc": "" }, { "action": 3, "extra": {}, "desc": "" }, { "action": 7, "extra": {}, "desc": "" }, { "action": 9, "extra": {}, "desc": "" } ], "has_image": false, "cell_layout_style": 1, "tag_id": 6465452273144168717, "source_url": "http://cq3.ilyae.cn/toutiao2/index.html", "video_style": 0, "verified_content": "", "is_feed_ad": true, "large_image_list": [], "item_id": 6465452273144168717, "natant_level": 2, "tag_url": "search/?keyword=None", "article_genre": "ad", "level": 0, "cell_flag": 10, "source_open_url": "sslocal://search?from=channel_source&keyword=%E8%81%94%E8%AE%AF%E8%AF%81%E5%88%B8", "display_url": "http://cq3.ilyae.cn/toutiao2/index.html", "digg_count": 0, "behot_time": 1506325103, "article_alt_url": "http://m.toutiao.com/group/article/6465452273144168717/", "cursor": 1506325103999, "url": "http://cq3.ilyae.cn/toutiao2/index.html", "preload_web": 0, "ad_label": "廣告", "user_repin": 0, "label_style": 3, "item_version": 0, "group_id": "6465452273144168717", "middle_image": { "url": "http://p3.pstatp.com/large/26c00009898dbc9c5a52", "width": 456, "url_list": [ { "url": "http://p3.pstatp.com/large/26c00009898dbc9c5a52" }, { "url": "http://pb9.pstatp.com/large/26c00009898dbc9c5a52" }, { "url": "http://pb1.pstatp.com/large/26c00009898dbc9c5a52" } ], "uri": "large/26c00009898dbc9c5a52", "height": 256 } }, { "image_url": "http://p3.pstatp.com/list/190x124/3b050002710aff2b3422", "single_mode": true, "abstract": "如今2017年微信的月活躍用戶達(dá)9億痛悯,微信成了中國(guó)最大用戶群體的手機(jī)APP余黎,它集通訊、娛樂(lè)载萌、支付等于一體惧财。很多朋友習(xí)慣每天打開(kāi)微信收發(fā)信息、查看朋友圈動(dòng)態(tài)扭仁。", "middle_mode": false, "more_mode": true, "tag": "news_tech", "label": [ "移動(dòng)互聯(lián)網(wǎng)", "微信", "澤西島", "美女", "歐洲" ], "tag_url": "news_tech", "title": "為什么微信中那么多美女來(lái)自安道爾或澤西島垮衷?這是一種暗語(yǔ)嗎", "chinese_tag": "科技", "source": "獅子夜光杯", "group_source": 2, "has_gallery": false, "media_url": "/c/user/53397416061/", "media_avatar_url": "http://p3.pstatp.com/large/12330013573aaa4c18b1", "image_list": [ { "url": "http://p3.pstatp.com/list/3b050002710aff2b3422" }, { "url": "http://p3.pstatp.com/list/3b05000271096e15298e" }, { "url": "http://p9.pstatp.com/list/3b080000bdf469bf7330" } ], "source_url": "/group/6467319367565574670/", "article_genre": "article", "is_feed_ad": false, "behot_time": 1506324503, "comments_count": 46, "group_id": "6467319367565574670" }, { "image_url": "http://p3.pstatp.com/list/190x124/3b0f0003c132eb485453", "single_mode": true, "abstract": "最近幾周,各大互聯(lián)網(wǎng)科技公司都開(kāi)始秋季招聘了這些是正經(jīng)的公司的招聘筆試題:關(guān)于c++的inline關(guān)鍵字,以下說(shuō)法正確的是()對(duì)N個(gè)數(shù)進(jìn)行排序,在各自最優(yōu)條件下以下算法復(fù)雜度最低的是()為百度設(shè)計(jì)一款新產(chǎn)品斋枢,可以結(jié)合百度現(xiàn)有的優(yōu)勢(shì)和資源帘靡,專注解決大學(xué)生用戶的某個(gè)需求痛點(diǎn)知给,請(qǐng)給出主", "middle_mode": false, "more_mode": true, "tag": "news_design", "label": [ "電子商務(wù)", "京東", "面試", "劉強(qiáng)東", "計(jì)算復(fù)雜性理論" ], "tag_url": "search/?keyword=%E8%AE%BE%E8%AE%A1", "title": "京東校招筆試題“如何用0.01元買到一瓶可樂(lè)”瓤帚?竟被蘇寧秀了一臉", "chinese_tag": "設(shè)計(jì)", "source": "小禾科技", "group_source": 2, "has_gallery": false, "media_url": "/c/user/59954335187/", "media_avatar_url": "http://p9.pstatp.com/large/39b10003f6cddd5128fa", "image_list": [ { "url": "http://p3.pstatp.com/list/3b0f0003c132eb485453" }, { "url": "http://p3.pstatp.com/list/3b110000ab4c79a56483" }, { "url": "http://p9.pstatp.com/list/3b1600007cde1cf9bdd0" } ], "source_url": "/group/6468140283245625870/", "article_genre": "article", "is_feed_ad": false, "behot_time": 1506323903, "comments_count": 87, "group_id": "6468140283245625870" }, { "chinese_tag": "科技", "media_avatar_url": "http://p9.pstatp.com/large/2c6600049c7144303824", "is_feed_ad": false, "tag_url": "news_tech", "title": "為什么家里的WIFI時(shí)快時(shí)慢?竟然是因?yàn)椤?, "single_mode": true, "middle_mode": false, "abstract": "現(xiàn)在還是個(gè)信息的時(shí)代涩赢,不僅手機(jī)戈次、電腦非常普遍,而且現(xiàn)在的人們都喜歡用無(wú)線網(wǎng)絡(luò)之WiFi筒扒,因?yàn)檫@樣更加便捷怯邪。在家使用手機(jī)的時(shí)候,不用打開(kāi)手機(jī)的數(shù)據(jù)流量花墩,只要使用WiFi就可以了悬秉,無(wú)限的流量使用,太方便了冰蘑。但是很多用戶都會(huì)有這樣的體驗(yàn)和泌,WiFi速度時(shí)快時(shí)慢的,很是煩惱祠肥。", "group_source": 2, "image_list": [ { "url": "http://p3.pstatp.com/list/3b1600009ba8a7500c7e" }, { "url": "http://p1.pstatp.com/list/3b1600009bb32db8a78a" }, { "url": "http://p3.pstatp.com/list/3b120000c5dac40ae0fe" } ], "label": [ "Wi-Fi", "科技" ], "behot_time": 1506323303, "source_url": "/group/6468146583144759822/", "source": "水電小知識(shí)", "more_mode": true, "article_genre": "article", "image_url": "http://p3.pstatp.com/list/190x124/3b1600009ba8a7500c7e", "tag": "news_tech", "has_gallery": false, "group_id": "6468146583144759822", "media_url": "/c/user/61795844218/" } ], "next": { "max_behot_time": 1506323303 } }
- 2.分析請(qǐng)求的參數(shù)以及請(qǐng)求循環(huán)性:
- 科技新聞的數(shù)據(jù)接口使用的是GET請(qǐng)求武氓,傳遞下面幾個(gè)查詢參數(shù):
category:news_tech utm_source:toutiao widen:1 max_behot_time:0 max_behot_time_tmp:0 tadrequire:true as:A155493CA8EBB0F cp:59C84BEB601F7E1
- 滑動(dòng)網(wǎng)頁(yè),再次發(fā)出異步請(qǐng)求仇箱,觀察請(qǐng)求參數(shù)县恕,可以發(fā)現(xiàn)只有幾個(gè)查詢參數(shù)是改變的。從上一次獲取的數(shù)據(jù)有個(gè)字段next->max_behot_time剛好是max_behot_time和max_behot_time_tmp的值剂桥。至于as與及cp參數(shù)對(duì)GET請(qǐng)求影響不大忠烛,可以直接取某一次分析的參數(shù)值就是max_behot_time參數(shù),作者認(rèn)為是當(dāng)前的時(shí)間戳权逗,現(xiàn)在數(shù)據(jù)已經(jīng)展示給我們美尸,我們就沒(méi)必要去猜測(cè)垒拢,有時(shí)候抓包分析就是一種猜測(cè)API參數(shù)意義的過(guò)程,大家可以去驗(yàn)證:
max_behot_time:1506326351 max_behot_time_tmp:1506326351 as:A115996C383BD3C cp:59C82BAD839CBE1
- 3.構(gòu)造請(qǐng)請(qǐng)求地址:
- scrapy項(xiàng)目的目錄結(jié)構(gòu)如下所示:
- settings.py源碼如下:
- scrapy項(xiàng)目的目錄結(jié)構(gòu)如下所示:
# -*- coding: utf-8 -*-
# Scrapy settings for todayNews project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'todayNews'
SPIDER_MODULES = ['todayNews.spiders']
NEWSPIDER_MODULE = 'todayNews.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'todayNews (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept':'text/javascript, text/html, application/xml, text/xml, */*',
'Accept-Encoding':'gzip, deflate, sdch, br',
'Accept-Language':'zh-CN,zh;q=0.8',
'Cache-Control':'no-cache',
'Connection':'keep-alive',
'Content-Type':'application/x-www-form-urlencoded',
'Cookie':'uuid="w:3db0708ea2c549fab1a5371c56f16176"; UM_distinctid=15c7147fecd8d-0a4277451-4349052c-100200-15c7147fecf6f; csrftoken=af9a5a0d4cd30794e6c04511ca9f31eb; _ga=GA1.2.312467779.1496549163; __guid=32687416.738502311042654200.1505560389379.9048; tt_track_id=c7baa73a99ec9787ead7a2f6b01ff56b; _ba=BA0.2-20170923-51d9e-ErxmsyZIIoxNOzZgf6Us; tt_webid=6427627096743282178; WEATHER_CITY=%E5%8C%97%E4%BA%AC; CNZZDATA1259612802=610804389-1496543540-null%7C1506261975; __tasessionId=0vta7k1uc1506263833592; tt_webid=6427627096743282178',
'Host':'www.toutiao.com',
'Pragma':'no-cache',
'Referer':'https://www.toutiao.com/ch/news_tech/',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'X-Requested-With':'XMLHttpRequest'
}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'todayNews.middlewares.TodaynewsSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'todayNews.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'todayNews.pipelines.MongoPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
DOWNLOAD_DELAY = 1
MONGO_URI="localhost"
MONGO_DATABASE="toutiao"
MONGO_USER="username"
MONGO_PASS="password"
- pipelines源碼如下:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
class MongoPipeline(object):
collection_name="science"
def __init__(self,mongo_uri,mongo_db,mongo_user,mongo_pass):
self.mongo_uri=mongo_uri
self.mongo_db=mongo_db
self.mongo_user=mongo_user
self.mongo_pass=mongo_pass
@classmethod
def from_crawler(cls,crawler):
return cls(mongo_uri=crawler.settings.get('MONGO_URI'),mongo_db=crawler.settings.get('MONGO_DATABASE'),mongo_user=crawler.settings.get("MONGO_USER"),mongo_pass=crawler.settings.get("MONGO_PASS"))
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
self.db.authenticate(self.mongo_user,self.mongo_pass)
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
# self.db[self.collection_name].update({'url_token': item['url_token']}, {'$set': dict(item)}, True)
# return item
self.db[self.collection_name].insert(dict(item))
return item
- toutiao.py源碼如下:
# -*- coding: utf-8 -*-
from scrapy import Spider,Request
import json
import logging
from todayNews.items import TodaynewsItem
class ToutiaoSpider(Spider):
name = "toutiao"
allowed_domains = ["www.toutiao.com"]
start_urls = ['https://www.toutiao.com/api/pc/feed/?min_behot_time=0&category=__all__&utm_source=toutiao&widen=1&tadrequire=true&as=A1D5394CB72C38F&cp=59C71C03883F0E1']
url='https://www.toutiao.com/api/pc/feed/?category=news_tech&utm_source=toutiao&widen=1&max_behot_time={behot_time}&max_behot_time_tmp={behot_time_tmp}&tadrequire=true&as=A165E92C97CC487&cp=59C74CC4E8F7BE1'
def parse(self, response):
jsonData=json.loads(response.body.decode("utf-8"))
MainData=jsonData["data"]
nextTime=jsonData["next"]["max_behot_time"]
if jsonData["message"]=='success':
for rowData in MainData:
yield rowData
yield Request(url=self.url.format(behot_time=nextTime,behot_time_tmp=nextTime),callback=self.parse)
else:
logging.info("The Data is null")
- items定義數(shù)據(jù)結(jié)構(gòu)化的提取火惊,因?yàn)榻袢疹^條返回的json格式并不是規(guī)范(可以查閱上面展示的數(shù)據(jù))求类,所以并沒(méi)有定義提取的item值。而是直接把items傳遞到pipeline梳理保存在MongoDB上面屹耐。
-
4.啟動(dòng)爬蟲程序尸疆,并查看爬取到數(shù)據(jù)