今日頭條爬蟲

最近一直在學(xué)習(xí)python的scrapy框架彼妻。寫了比較多的小例子牵舵。工欲善其事必先利其器送火。今天描述的就是爬取今日頭條的科技板塊新聞。練練這把利器判呕。
教程依賴scrapy,pymongo模塊著隆,可以直接先下載相應(yīng)的環(huán)境依賴荤牍。

  • 1.分析今日頭條新聞的API接口
    • 對(duì)于今日頭條這些通過(guò)AJAX來(lái)異步獲取json數(shù)據(jù)旱函,正常的等待頁(yè)面渲染后再進(jìn)行提取數(shù)據(jù)有點(diǎn)顯得力不從心,所以直接通過(guò)瀏覽器對(duì)網(wǎng)站進(jìn)行抓包分析。
    • 打開(kāi)瀏覽器弥咪,訪問(wèn)今日頭條的科技新聞模塊亏狰,這里的地址是 http://www.toutiao.com/ch/news_tech/
      http://www.toutiao.com/ch/news_tech/
    • 右鍵審查元素锰什,對(duì)頁(yè)面的網(wǎng)絡(luò)請(qǐng)求資源做分析罪既。勾上紅色箭頭的那個(gè)選擇框探熔,選擇記錄網(wǎng)絡(luò)請(qǐng)求日記。然后重新刷新網(wǎng)站


      360截圖20170925161331582.jpg
    • 逐一查看記錄的網(wǎng)絡(luò)數(shù)據(jù)包,可以發(fā)現(xiàn) http://www.toutiao.com/api/pc/feed/?category=news_tech&utm_source=toutiao&widen=1&max_behot_time=0&max_behot_time_tmp=0&tadrequire=true&as=A155493CA8EBB0F&cp=59C84BEB601F7E1的請(qǐng)求地址返回了json的數(shù)據(jù)。
      今日頭條
    • 返回的數(shù)據(jù)格式如下所示:
      {
    "has_more": false,
    "message": "success",
    "data": [
      {
        "chinese_tag": "財(cái)經(jīng)",
        "media_avatar_url": "http://p3.pstatp.com/large/1233000741099c9f4a59",
        "is_feed_ad": false,
        "tag_url": "news_finance",
        "title": "【特寫】數(shù)字貨幣的信徒們",
        "single_mode": true,
        "middle_mode": true,
        "abstract": "在九月初在中國(guó)發(fā)文整治ICO后,硅谷的區(qū)塊鏈項(xiàng)目創(chuàng)業(yè)者林嚇洪把籌集的資金全部還給了中國(guó)投資者們剩瓶。在那次整治中延曙,監(jiān)管部門宣布,首次代幣發(fā)行(Initial Coin Offering亡哄,簡(jiǎn)稱ICO)屬于非法行為枝缔,所有平臺(tái)必須返還籌集的資金。",
        "tag": "news_finance",
        "label": [
          "數(shù)字貨幣",
          "風(fēng)投",
          "比特幣",
          "投資",
          "經(jīng)濟(jì)"
        ],
        "behot_time": 1506326903,
        "source_url": "/group/6469550301866803469/",
        "source": "界面新聞",
        "more_mode": false,
        "article_genre": "article",
        "image_url": "http://p1.pstatp.com/list/190x124/317200041ea1cf451f52",
        "has_gallery": false,
        "group_source": 1,
        "comments_count": 10,
        "group_id": "6469550301866803469",
        "media_url": "/c/user/52857496566/"
      },
      {
        "image_url": "http://p3.pstatp.com/list/190x124/31770009f2c887fdb867",
        "single_mode": true,
        "abstract": "早磺平,來(lái)看看今天的新聞魂仍。小米就校招風(fēng)波道歉@DoNews【小米就校招風(fēng)波道歉 對(duì)涉事員工通報(bào)批評(píng)】近日,一名自稱在河南鄭州大學(xué)日語(yǔ)專業(yè)學(xué)習(xí)的大學(xué)生表示拣挪,她與同學(xué)在一次校園招聘宣講會(huì)上無(wú)故被來(lái)自小米公司的主管人員諷刺擦酌。導(dǎo)致自己和本專業(yè)的同學(xué)憤然離開(kāi)。",
        "middle_mode": false,
        "more_mode": true,
        "tag": "news_tech",
        "label": [
          "小米科技",
          "亞馬遜公司",
          "Uber",
          "美國(guó)",
          "樂(lè)視"
        ],
        "tag_url": "news_tech",
        "title": "小米就校招風(fēng)波道歉菠劝;ofo正尋求新一輪融資",
        "chinese_tag": "科技",
        "source": "虎嗅APP",
        "group_source": 1,
        "has_gallery": false,
        "media_url": "/c/user/3358265611/",
        "media_avatar_url": "http://p2.pstatp.com/large/18a50010126f235bf938",
        "image_list": [
          {
            "url": "http://p3.pstatp.com/list/31770009f2c887fdb867"
          },
          {
            "url": "http://p1.pstatp.com/list/317b00061c410d6d0352"
          },
          {
            "url": "http://p3.pstatp.com/list/3172000337e0332b337f"
          }
        ],
        "source_url": "/group/6469472579270672654/",
        "article_genre": "article",
        "is_feed_ad": false,
        "behot_time": 1506326303,
        "comments_count": 114,
        "group_id": "6469472579270672654"
      },
      {
        "image_url": "http://p3.pstatp.com/list/190x124/3c64000074857b07c81d",
        "single_mode": true,
        "abstract": "藍(lán)燕赊舶,經(jīng)常關(guān)注香港電影的人應(yīng)該不陌生,在2011年靠著香港三級(jí)影片《3D肉蒲團(tuán)之極樂(lè)寶鑒》走紅赶诊,并逐漸出現(xiàn)人們的視線中笼平。被稱為新一代的“艷星”√蚧荆可走紅后的她并沒(méi)有獲得很好的資源寓调,所接拍的影片大多數(shù)是一些不知名的配角。",
        "middle_mode": false,
        "more_mode": true,
        "tag": "news_entertainment",
        "label": [
          "藍(lán)燕 ",
          "肉蒲團(tuán)",
          "投資",
          "娛樂(lè)"
        ],
        "tag_url": "news_entertainment",
        "title": "艷星藍(lán)燕美照曝光 靠著《3D肉蒲團(tuán)》走紅",
        "chinese_tag": "娛樂(lè)",
        "source": "陪你樂(lè)不停",
        "group_source": 2,
        "has_gallery": false,
        "media_url": "/c/user/61497461135/",
        "media_avatar_url": "http://p3.pstatp.com/large/382f000f5dd459d0eb74",
        "image_list": [
          {
            "url": "http://p3.pstatp.com/list/3c64000074857b07c81d"
          },
          {
            "url": "http://p3.pstatp.com/list/3c6000022fcec3f4ca48"
          },
          {
            "url": "http://p3.pstatp.com/list/3c60000230155491a84d"
          }
        ],
        "source_url": "/group/6469578595697164813/",
        "article_genre": "article",
        "is_feed_ad": false,
        "behot_time": 1506325703,
        "comments_count": 2,
        "group_id": "6469578595697164813"
      },
      {
        "log_extra": "{\"ad_price\":\"Wci5d__iJRJZyLl3_-IlEuQYjwGdUeJEIl99Ew\",\"convert_id\":0,\"external_action\":0,\"req_id\":\"201709251608231720180471641841E3\",\"rit\":1}",
        "image_url": "http://p3.pstatp.com/large/26c00009898dbc9c5a52",
        "read_count": 12196,
        "ban_comment": 1,
        "single_mode": true,
        "abstract": "",
        "image_list": [],
        "has_video": false,
        "article_type": 1,
        "tag": "ad",
        "display_info": "股市迎來(lái)重磅利好消息锄码,這些股或?qū)⑸蠞q翻倍夺英,微信領(lǐng)取",
        "has_m3u8_video": 0,
        "label": "廣告",
        "user_verified": 0,
        "aggr_type": 1,
        "expire_seconds": 314754930,
        "cell_type": 0,
        "article_sub_type": 0,
        "group_flags": 4096,
        "bury_count": 0,
        "title": "股市迎來(lái)重磅利好消息晌涕,這些股或?qū)⑸蠞q翻倍,微信領(lǐng)取",
        "ignore_web_transform": 1,
        "source_icon_style": 3,
        "tip": 0,
        "hot": 0,
        "share_url": "http://m.toutiao.com/group/6465452273144168717/?iid=0&app=news_article",
        "has_mp4_video": 0,
        "source": "聯(lián)訊證券",
        "comment_count": 0,
        "article_url": "http://cq3.ilyae.cn/toutiao2/index.html",
        "filter_words": [
          {
            "id": "1:74",
            "name": "股票",
            "is_selected": false
          },
          {
            "id": "1:6",
            "name": "金融保險(xiǎn)",
            "is_selected": false
          },
          {
            "id": "2:0",
            "name": "來(lái)源:聯(lián)訊證券",
            "is_selected": false
          },
          {
            "id": "4:2",
            "name": "看過(guò)了",
            "is_selected": false
          }
        ],
        "has_gallery": false,
        "publish_time": 1505355414,
        "ad_id": 69048936405,
        "action_list": [
          {
            "action": 1,
            "extra": {},
            "desc": ""
          },
          {
            "action": 3,
            "extra": {},
            "desc": ""
          },
          {
            "action": 7,
            "extra": {},
            "desc": ""
          },
          {
            "action": 9,
            "extra": {},
            "desc": ""
          }
        ],
        "has_image": false,
        "cell_layout_style": 1,
        "tag_id": 6465452273144168717,
        "source_url": "http://cq3.ilyae.cn/toutiao2/index.html",
        "video_style": 0,
        "verified_content": "",
        "is_feed_ad": true,
        "large_image_list": [],
        "item_id": 6465452273144168717,
        "natant_level": 2,
        "tag_url": "search/?keyword=None",
        "article_genre": "ad",
        "level": 0,
        "cell_flag": 10,
        "source_open_url": "sslocal://search?from=channel_source&keyword=%E8%81%94%E8%AE%AF%E8%AF%81%E5%88%B8",
        "display_url": "http://cq3.ilyae.cn/toutiao2/index.html",
        "digg_count": 0,
        "behot_time": 1506325103,
        "article_alt_url": "http://m.toutiao.com/group/article/6465452273144168717/",
        "cursor": 1506325103999,
        "url": "http://cq3.ilyae.cn/toutiao2/index.html",
        "preload_web": 0,
        "ad_label": "廣告",
        "user_repin": 0,
        "label_style": 3,
        "item_version": 0,
        "group_id": "6465452273144168717",
        "middle_image": {
          "url": "http://p3.pstatp.com/large/26c00009898dbc9c5a52",
          "width": 456,
          "url_list": [
            {
              "url": "http://p3.pstatp.com/large/26c00009898dbc9c5a52"
            },
            {
              "url": "http://pb9.pstatp.com/large/26c00009898dbc9c5a52"
            },
            {
              "url": "http://pb1.pstatp.com/large/26c00009898dbc9c5a52"
            }
          ],
          "uri": "large/26c00009898dbc9c5a52",
          "height": 256
        }
      },
      {
        "image_url": "http://p3.pstatp.com/list/190x124/3b050002710aff2b3422",
        "single_mode": true,
        "abstract": "如今2017年微信的月活躍用戶達(dá)9億痛悯,微信成了中國(guó)最大用戶群體的手機(jī)APP余黎,它集通訊、娛樂(lè)载萌、支付等于一體惧财。很多朋友習(xí)慣每天打開(kāi)微信收發(fā)信息、查看朋友圈動(dòng)態(tài)扭仁。",
        "middle_mode": false,
        "more_mode": true,
        "tag": "news_tech",
        "label": [
          "移動(dòng)互聯(lián)網(wǎng)",
          "微信",
          "澤西島",
          "美女",
          "歐洲"
        ],
        "tag_url": "news_tech",
        "title": "為什么微信中那么多美女來(lái)自安道爾或澤西島垮衷?這是一種暗語(yǔ)嗎",
        "chinese_tag": "科技",
        "source": "獅子夜光杯",
        "group_source": 2,
        "has_gallery": false,
        "media_url": "/c/user/53397416061/",
        "media_avatar_url": "http://p3.pstatp.com/large/12330013573aaa4c18b1",
        "image_list": [
          {
            "url": "http://p3.pstatp.com/list/3b050002710aff2b3422"
          },
          {
            "url": "http://p3.pstatp.com/list/3b05000271096e15298e"
          },
          {
            "url": "http://p9.pstatp.com/list/3b080000bdf469bf7330"
          }
        ],
        "source_url": "/group/6467319367565574670/",
        "article_genre": "article",
        "is_feed_ad": false,
        "behot_time": 1506324503,
        "comments_count": 46,
        "group_id": "6467319367565574670"
      },
      {
        "image_url": "http://p3.pstatp.com/list/190x124/3b0f0003c132eb485453",
        "single_mode": true,
        "abstract": "最近幾周,各大互聯(lián)網(wǎng)科技公司都開(kāi)始秋季招聘了這些是正經(jīng)的公司的招聘筆試題:關(guān)于c++的inline關(guān)鍵字,以下說(shuō)法正確的是()對(duì)N個(gè)數(shù)進(jìn)行排序,在各自最優(yōu)條件下以下算法復(fù)雜度最低的是()為百度設(shè)計(jì)一款新產(chǎn)品斋枢,可以結(jié)合百度現(xiàn)有的優(yōu)勢(shì)和資源帘靡,專注解決大學(xué)生用戶的某個(gè)需求痛點(diǎn)知给,請(qǐng)給出主",
        "middle_mode": false,
        "more_mode": true,
        "tag": "news_design",
        "label": [
          "電子商務(wù)",
          "京東",
          "面試",
          "劉強(qiáng)東",
          "計(jì)算復(fù)雜性理論"
        ],
        "tag_url": "search/?keyword=%E8%AE%BE%E8%AE%A1",
        "title": "京東校招筆試題“如何用0.01元買到一瓶可樂(lè)”瓤帚?竟被蘇寧秀了一臉",
        "chinese_tag": "設(shè)計(jì)",
        "source": "小禾科技",
        "group_source": 2,
        "has_gallery": false,
        "media_url": "/c/user/59954335187/",
        "media_avatar_url": "http://p9.pstatp.com/large/39b10003f6cddd5128fa",
        "image_list": [
          {
            "url": "http://p3.pstatp.com/list/3b0f0003c132eb485453"
          },
          {
            "url": "http://p3.pstatp.com/list/3b110000ab4c79a56483"
          },
          {
            "url": "http://p9.pstatp.com/list/3b1600007cde1cf9bdd0"
          }
        ],
        "source_url": "/group/6468140283245625870/",
        "article_genre": "article",
        "is_feed_ad": false,
        "behot_time": 1506323903,
        "comments_count": 87,
        "group_id": "6468140283245625870"
      },
      {
        "chinese_tag": "科技",
        "media_avatar_url": "http://p9.pstatp.com/large/2c6600049c7144303824",
        "is_feed_ad": false,
        "tag_url": "news_tech",
        "title": "為什么家里的WIFI時(shí)快時(shí)慢?竟然是因?yàn)椤?,
        "single_mode": true,
        "middle_mode": false,
        "abstract": "現(xiàn)在還是個(gè)信息的時(shí)代涩赢,不僅手機(jī)戈次、電腦非常普遍,而且現(xiàn)在的人們都喜歡用無(wú)線網(wǎng)絡(luò)之WiFi筒扒,因?yàn)檫@樣更加便捷怯邪。在家使用手機(jī)的時(shí)候,不用打開(kāi)手機(jī)的數(shù)據(jù)流量花墩,只要使用WiFi就可以了悬秉,無(wú)限的流量使用,太方便了冰蘑。但是很多用戶都會(huì)有這樣的體驗(yàn)和泌,WiFi速度時(shí)快時(shí)慢的,很是煩惱祠肥。",
        "group_source": 2,
        "image_list": [
          {
            "url": "http://p3.pstatp.com/list/3b1600009ba8a7500c7e"
          },
          {
            "url": "http://p1.pstatp.com/list/3b1600009bb32db8a78a"
          },
          {
            "url": "http://p3.pstatp.com/list/3b120000c5dac40ae0fe"
          }
        ],
        "label": [
          "Wi-Fi",
          "科技"
        ],
        "behot_time": 1506323303,
        "source_url": "/group/6468146583144759822/",
        "source": "水電小知識(shí)",
        "more_mode": true,
        "article_genre": "article",
        "image_url": "http://p3.pstatp.com/list/190x124/3b1600009ba8a7500c7e",
        "tag": "news_tech",
        "has_gallery": false,
        "group_id": "6468146583144759822",
        "media_url": "/c/user/61795844218/"
      }
    ],
    "next": {
      "max_behot_time": 1506323303
      }
    }
    
  • 2.分析請(qǐng)求的參數(shù)以及請(qǐng)求循環(huán)性:
    • 科技新聞的數(shù)據(jù)接口使用的是GET請(qǐng)求武氓,傳遞下面幾個(gè)查詢參數(shù):
      category:news_tech
      utm_source:toutiao
      widen:1
      max_behot_time:0
      max_behot_time_tmp:0
      tadrequire:true
      as:A155493CA8EBB0F
      cp:59C84BEB601F7E1
    
    • 滑動(dòng)網(wǎng)頁(yè),再次發(fā)出異步請(qǐng)求仇箱,觀察請(qǐng)求參數(shù)县恕,可以發(fā)現(xiàn)只有幾個(gè)查詢參數(shù)是改變的。從上一次獲取的數(shù)據(jù)有個(gè)字段next->max_behot_time剛好是max_behot_time和max_behot_time_tmp的值剂桥。至于as與及cp參數(shù)對(duì)GET請(qǐng)求影響不大忠烛,可以直接取某一次分析的參數(shù)值就是max_behot_time參數(shù),作者認(rèn)為是當(dāng)前的時(shí)間戳权逗,現(xiàn)在數(shù)據(jù)已經(jīng)展示給我們美尸,我們就沒(méi)必要去猜測(cè)垒拢,有時(shí)候抓包分析就是一種猜測(cè)API參數(shù)意義的過(guò)程,大家可以去驗(yàn)證:
      max_behot_time:1506326351
      max_behot_time_tmp:1506326351
      as:A115996C383BD3C
      cp:59C82BAD839CBE1
    
  • 3.構(gòu)造請(qǐng)請(qǐng)求地址:
    • scrapy項(xiàng)目的目錄結(jié)構(gòu)如下所示:
      結(jié)構(gòu)圖
    • settings.py源碼如下:
  # -*- coding: utf-8 -*-
# Scrapy settings for todayNews project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'todayNews'

SPIDER_MODULES = ['todayNews.spiders']
NEWSPIDER_MODULE = 'todayNews.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'todayNews (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
    'Accept':'text/javascript, text/html, application/xml, text/xml, */*',
    'Accept-Encoding':'gzip, deflate, sdch, br',
    'Accept-Language':'zh-CN,zh;q=0.8',
    'Cache-Control':'no-cache',
    'Connection':'keep-alive',
    'Content-Type':'application/x-www-form-urlencoded',
    'Cookie':'uuid="w:3db0708ea2c549fab1a5371c56f16176"; UM_distinctid=15c7147fecd8d-0a4277451-4349052c-100200-15c7147fecf6f; csrftoken=af9a5a0d4cd30794e6c04511ca9f31eb; _ga=GA1.2.312467779.1496549163; __guid=32687416.738502311042654200.1505560389379.9048; tt_track_id=c7baa73a99ec9787ead7a2f6b01ff56b; _ba=BA0.2-20170923-51d9e-ErxmsyZIIoxNOzZgf6Us; tt_webid=6427627096743282178; WEATHER_CITY=%E5%8C%97%E4%BA%AC; CNZZDATA1259612802=610804389-1496543540-null%7C1506261975; __tasessionId=0vta7k1uc1506263833592; tt_webid=6427627096743282178',
    'Host':'www.toutiao.com',
    'Pragma':'no-cache',
    'Referer':'https://www.toutiao.com/ch/news_tech/',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    'X-Requested-With':'XMLHttpRequest'
}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'todayNews.middlewares.TodaynewsSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'todayNews.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {  
   'todayNews.pipelines.MongoPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
DOWNLOAD_DELAY = 1   
MONGO_URI="localhost"
MONGO_DATABASE="toutiao"
MONGO_USER="username"
MONGO_PASS="password"
  • pipelines源碼如下:
  # -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


import pymongo

class MongoPipeline(object):
  collection_name="science"
  def __init__(self,mongo_uri,mongo_db,mongo_user,mongo_pass):
      self.mongo_uri=mongo_uri
      self.mongo_db=mongo_db
      self.mongo_user=mongo_user
      self.mongo_pass=mongo_pass
  @classmethod
  def from_crawler(cls,crawler):
      return cls(mongo_uri=crawler.settings.get('MONGO_URI'),mongo_db=crawler.settings.get('MONGO_DATABASE'),mongo_user=crawler.settings.get("MONGO_USER"),mongo_pass=crawler.settings.get("MONGO_PASS"))
  def open_spider(self, spider):
      self.client = pymongo.MongoClient(self.mongo_uri)
      self.db = self.client[self.mongo_db]
      self.db.authenticate(self.mongo_user,self.mongo_pass)
      
  def close_spider(self, spider):
      self.client.close()

  def process_item(self, item, spider):
      # self.db[self.collection_name].update({'url_token': item['url_token']}, {'$set': dict(item)}, True)
      # return item
      self.db[self.collection_name].insert(dict(item))
      return item
  • toutiao.py源碼如下:
  # -*- coding: utf-8 -*-
from scrapy import Spider,Request
import json
import logging
from todayNews.items import TodaynewsItem
class ToutiaoSpider(Spider):
  name = "toutiao"
  allowed_domains = ["www.toutiao.com"]
  start_urls = ['https://www.toutiao.com/api/pc/feed/?min_behot_time=0&category=__all__&utm_source=toutiao&widen=1&tadrequire=true&as=A1D5394CB72C38F&cp=59C71C03883F0E1']
  url='https://www.toutiao.com/api/pc/feed/?category=news_tech&utm_source=toutiao&widen=1&max_behot_time={behot_time}&max_behot_time_tmp={behot_time_tmp}&tadrequire=true&as=A165E92C97CC487&cp=59C74CC4E8F7BE1'
  def parse(self, response):
      jsonData=json.loads(response.body.decode("utf-8"))
      MainData=jsonData["data"]
      nextTime=jsonData["next"]["max_behot_time"]
      if jsonData["message"]=='success':
          for rowData in MainData:
              yield rowData
          yield Request(url=self.url.format(behot_time=nextTime,behot_time_tmp=nextTime),callback=self.parse)
      else:
          logging.info("The Data is null")
      
  • items定義數(shù)據(jù)結(jié)構(gòu)化的提取火惊,因?yàn)榻袢疹^條返回的json格式并不是規(guī)范(可以查閱上面展示的數(shù)據(jù))求类,所以并沒(méi)有定義提取的item值。而是直接把items傳遞到pipeline梳理保存在MongoDB上面屹耐。
  • 4.啟動(dòng)爬蟲程序尸疆,并查看爬取到數(shù)據(jù)


    保存的數(shù)據(jù)
完工
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市惶岭,隨后出現(xiàn)的幾起案子寿弱,更是在濱河造成了極大的恐慌,老刑警劉巖按灶,帶你破解...
    沈念sama閱讀 211,743評(píng)論 6 492
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件症革,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡鸯旁,警方通過(guò)查閱死者的電腦和手機(jī)噪矛,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,296評(píng)論 3 385
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)铺罢,“玉大人艇挨,你說(shuō)我怎么就攤上這事【伦福” “怎么了缩滨?”我有些...
    開(kāi)封第一講書人閱讀 157,285評(píng)論 0 348
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)泉瞻。 經(jīng)常有香客問(wèn)我脉漏,道長(zhǎng),這世上最難降的妖魔是什么袖牙? 我笑而不...
    開(kāi)封第一講書人閱讀 56,485評(píng)論 1 283
  • 正文 為了忘掉前任侧巨,我火速辦了婚禮,結(jié)果婚禮上贼陶,老公的妹妹穿的比我還像新娘刃泡。我一直安慰自己,他們只是感情好碉怔,可當(dāng)我...
    茶點(diǎn)故事閱讀 65,581評(píng)論 6 386
  • 文/花漫 我一把揭開(kāi)白布烘贴。 她就那樣靜靜地躺著,像睡著了一般撮胧。 火紅的嫁衣襯著肌膚如雪桨踪。 梳的紋絲不亂的頭發(fā)上,一...
    開(kāi)封第一講書人閱讀 49,821評(píng)論 1 290
  • 那天芹啥,我揣著相機(jī)與錄音锻离,去河邊找鬼铺峭。 笑死,一個(gè)胖子當(dāng)著我的面吹牛汽纠,可吹牛的內(nèi)容都是我干的卫键。 我是一名探鬼主播,決...
    沈念sama閱讀 38,960評(píng)論 3 408
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼虱朵,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼莉炉!你這毒婦竟也來(lái)了?” 一聲冷哼從身側(cè)響起碴犬,我...
    開(kāi)封第一講書人閱讀 37,719評(píng)論 0 266
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤絮宁,失蹤者是張志新(化名)和其女友劉穎,沒(méi)想到半個(gè)月后服协,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體绍昂,經(jīng)...
    沈念sama閱讀 44,186評(píng)論 1 303
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,516評(píng)論 2 327
  • 正文 我和宋清朗相戀三年偿荷,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了窘游。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 38,650評(píng)論 1 340
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡遭顶,死狀恐怖张峰,靈堂內(nèi)的尸體忽然破棺而出泪蔫,到底是詐尸還是另有隱情棒旗,我是刑警寧澤,帶...
    沈念sama閱讀 34,329評(píng)論 4 330
  • 正文 年R本政府宣布撩荣,位于F島的核電站铣揉,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏餐曹。R本人自食惡果不足惜逛拱,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,936評(píng)論 3 313
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望台猴。 院中可真熱鬧朽合,春花似錦、人聲如沸饱狂。這莊子的主人今日做“春日...
    開(kāi)封第一講書人閱讀 30,757評(píng)論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)休讳。三九已至讲婚,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間俊柔,已是汗流浹背筹麸。 一陣腳步聲響...
    開(kāi)封第一講書人閱讀 31,991評(píng)論 1 266
  • 我被黑心中介騙來(lái)泰國(guó)打工活合, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人物赶。 一個(gè)月前我還...
    沈念sama閱讀 46,370評(píng)論 2 360
  • 正文 我出身青樓白指,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親酵紫。 傳聞我的和親對(duì)象是個(gè)殘疾皇子侵续,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 43,527評(píng)論 2 349

推薦閱讀更多精彩內(nèi)容