今日頭條爬蟲

最近一直在學(xué)習(xí)python的scrapy框架彼妻。寫了比較多的小例子牵舵。工欲善其事必先利其器送火。今天描述的就是爬取今日頭條的科技板塊新聞。練練這把利器判呕。
教程依賴scrapy，pymongo模塊著隆，可以直接先下載相應(yīng)的環(huán)境依賴荤牍。

1.分析今日頭條新聞的API接口

對(duì)于今日頭條這些通過(guò)AJAX來(lái)異步獲取json數(shù)據(jù)旱函，正常的等待頁(yè)面渲染后再進(jìn)行提取數(shù)據(jù)有點(diǎn)顯得力不從心，所以直接通過(guò)瀏覽器對(duì)網(wǎng)站進(jìn)行抓包分析。
打開(kāi)瀏覽器弥咪，訪問(wèn)今日頭條的科技新聞模塊亏狰，這里的地址是 http://www.toutiao.com/ch/news_tech/

http://www.toutiao.com/ch/news_tech/
右鍵審查元素锰什，對(duì)頁(yè)面的網(wǎng)絡(luò)請(qǐng)求資源做分析罪既。勾上紅色箭頭的那個(gè)選擇框探熔，選擇記錄網(wǎng)絡(luò)請(qǐng)求日記。然后重新刷新網(wǎng)站

360截圖20170925161331582.jpg
逐一查看記錄的網(wǎng)絡(luò)數(shù)據(jù)包，可以發(fā)現(xiàn) http://www.toutiao.com/api/pc/feed/?category=news_tech&utm_source=toutiao&widen=1&max_behot_time=0&max_behot_time_tmp=0&tadrequire=true&as=A155493CA8EBB0F&cp=59C84BEB601F7E1的請(qǐng)求地址返回了json的數(shù)據(jù)。

今日頭條
返回的數(shù)據(jù)格式如下所示：

  {
"has_more": false,
"message": "success",
"data": [
  {
    "chinese_tag": "財(cái)經(jīng)",
    "media_avatar_url": "http://p3.pstatp.com/large/1233000741099c9f4a59",
    "is_feed_ad": false,
    "tag_url": "news_finance",
    "title": "【特寫】數(shù)字貨幣的信徒們",
    "single_mode": true,
    "middle_mode": true,
    "abstract": "在九月初在中國(guó)發(fā)文整治ICO后，硅谷的區(qū)塊鏈項(xiàng)目創(chuàng)業(yè)者林嚇洪把籌集的資金全部還給了中國(guó)投資者們剩瓶。在那次整治中延曙，監(jiān)管部門宣布，首次代幣發(fā)行(Initial Coin Offering亡哄，簡(jiǎn)稱ICO)屬于非法行為枝缔，所有平臺(tái)必須返還籌集的資金。",
    "tag": "news_finance",
    "label": [
      "數(shù)字貨幣",
      "風(fēng)投",
      "比特幣",
      "投資",
      "經(jīng)濟(jì)"
    ],
    "behot_time": 1506326903,
    "source_url": "/group/6469550301866803469/",
    "source": "界面新聞",
    "more_mode": false,
    "article_genre": "article",
    "image_url": "http://p1.pstatp.com/list/190x124/317200041ea1cf451f52",
    "has_gallery": false,
    "group_source": 1,
    "comments_count": 10,
    "group_id": "6469550301866803469",
    "media_url": "/c/user/52857496566/"
  },
  {
    "image_url": "http://p3.pstatp.com/list/190x124/31770009f2c887fdb867",
    "single_mode": true,
    "abstract": "早磺平，來(lái)看看今天的新聞魂仍。小米就校招風(fēng)波道歉@DoNews【小米就校招風(fēng)波道歉 對(duì)涉事員工通報(bào)批評(píng)】近日，一名自稱在河南鄭州大學(xué)日語(yǔ)專業(yè)學(xué)習(xí)的大學(xué)生表示拣挪，她與同學(xué)在一次校園招聘宣講會(huì)上無(wú)故被來(lái)自小米公司的主管人員諷刺擦酌。導(dǎo)致自己和本專業(yè)的同學(xué)憤然離開(kāi)。",
    "middle_mode": false,
    "more_mode": true,
    "tag": "news_tech",
    "label": [
      "小米科技",
      "亞馬遜公司",
      "Uber",
      "美國(guó)",
      "樂(lè)視"
    ],
    "tag_url": "news_tech",
    "title": "小米就校招風(fēng)波道歉菠劝；ofo正尋求新一輪融資",
    "chinese_tag": "科技",
    "source": "虎嗅APP",
    "group_source": 1,
    "has_gallery": false,
    "media_url": "/c/user/3358265611/",
    "media_avatar_url": "http://p2.pstatp.com/large/18a50010126f235bf938",
    "image_list": [
      {
        "url": "http://p3.pstatp.com/list/31770009f2c887fdb867"
      },
      {
        "url": "http://p1.pstatp.com/list/317b00061c410d6d0352"
      },
      {
        "url": "http://p3.pstatp.com/list/3172000337e0332b337f"
      }
    ],
    "source_url": "/group/6469472579270672654/",
    "article_genre": "article",
    "is_feed_ad": false,
    "behot_time": 1506326303,
    "comments_count": 114,
    "group_id": "6469472579270672654"
  },
  {
    "image_url": "http://p3.pstatp.com/list/190x124/3c64000074857b07c81d",
    "single_mode": true,
    "abstract": "藍(lán)燕赊舶，經(jīng)常關(guān)注香港電影的人應(yīng)該不陌生，在2011年靠著香港三級(jí)影片《3D肉蒲團(tuán)之極樂(lè)寶鑒》走紅赶诊，并逐漸出現(xiàn)人們的視線中笼平。被稱為新一代的“艷星”√蚧荆可走紅后的她并沒(méi)有獲得很好的資源寓调，所接拍的影片大多數(shù)是一些不知名的配角。",
    "middle_mode": false,
    "more_mode": true,
    "tag": "news_entertainment",
    "label": [
      "藍(lán)燕 ",
      "肉蒲團(tuán)",
      "投資",
      "娛樂(lè)"
    ],
    "tag_url": "news_entertainment",
    "title": "艷星藍(lán)燕美照曝光 靠著《3D肉蒲團(tuán)》走紅",
    "chinese_tag": "娛樂(lè)",
    "source": "陪你樂(lè)不停",
    "group_source": 2,
    "has_gallery": false,
    "media_url": "/c/user/61497461135/",
    "media_avatar_url": "http://p3.pstatp.com/large/382f000f5dd459d0eb74",
    "image_list": [
      {
        "url": "http://p3.pstatp.com/list/3c64000074857b07c81d"
      },
      {
        "url": "http://p3.pstatp.com/list/3c6000022fcec3f4ca48"
      },
      {
        "url": "http://p3.pstatp.com/list/3c60000230155491a84d"
      }
    ],
    "source_url": "/group/6469578595697164813/",
    "article_genre": "article",
    "is_feed_ad": false,
    "behot_time": 1506325703,
    "comments_count": 2,
    "group_id": "6469578595697164813"
  },
  {
    "log_extra": "{\"ad_price\":\"Wci5d__iJRJZyLl3_-IlEuQYjwGdUeJEIl99Ew\",\"convert_id\":0,\"external_action\":0,\"req_id\":\"201709251608231720180471641841E3\",\"rit\":1}",
    "image_url": "http://p3.pstatp.com/large/26c00009898dbc9c5a52",
    "read_count": 12196,
    "ban_comment": 1,
    "single_mode": true,
    "abstract": "",
    "image_list": [],
    "has_video": false,
    "article_type": 1,
    "tag": "ad",
    "display_info": "股市迎來(lái)重磅利好消息锄码，這些股或?qū)⑸蠞q翻倍夺英，微信領(lǐng)取",
    "has_m3u8_video": 0,
    "label": "廣告",
    "user_verified": 0,
    "aggr_type": 1,
    "expire_seconds": 314754930,
    "cell_type": 0,
    "article_sub_type": 0,
    "group_flags": 4096,
    "bury_count": 0,
    "title": "股市迎來(lái)重磅利好消息晌涕，這些股或?qū)⑸蠞q翻倍，微信領(lǐng)取",
    "ignore_web_transform": 1,
    "source_icon_style": 3,
    "tip": 0,
    "hot": 0,
    "share_url": "http://m.toutiao.com/group/6465452273144168717/?iid=0&app=news_article",
    "has_mp4_video": 0,
    "source": "聯(lián)訊證券",
    "comment_count": 0,
    "article_url": "http://cq3.ilyae.cn/toutiao2/index.html",
    "filter_words": [
      {
        "id": "1:74",
        "name": "股票",
        "is_selected": false
      },
      {
        "id": "1:6",
        "name": "金融保險(xiǎn)",
        "is_selected": false
      },
      {
        "id": "2:0",
        "name": "來(lái)源:聯(lián)訊證券",
        "is_selected": false
      },
      {
        "id": "4:2",
        "name": "看過(guò)了",
        "is_selected": false
      }
    ],
    "has_gallery": false,
    "publish_time": 1505355414,
    "ad_id": 69048936405,
    "action_list": [
      {
        "action": 1,
        "extra": {},
        "desc": ""
      },
      {
        "action": 3,
        "extra": {},
        "desc": ""
      },
      {
        "action": 7,
        "extra": {},
        "desc": ""
      },
      {
        "action": 9,
        "extra": {},
        "desc": ""
      }
    ],
    "has_image": false,
    "cell_layout_style": 1,
    "tag_id": 6465452273144168717,
    "source_url": "http://cq3.ilyae.cn/toutiao2/index.html",
    "video_style": 0,
    "verified_content": "",
    "is_feed_ad": true,
    "large_image_list": [],
    "item_id": 6465452273144168717,
    "natant_level": 2,
    "tag_url": "search/?keyword=None",
    "article_genre": "ad",
    "level": 0,
    "cell_flag": 10,
    "source_open_url": "sslocal://search?from=channel_source&keyword=%E8%81%94%E8%AE%AF%E8%AF%81%E5%88%B8",
    "display_url": "http://cq3.ilyae.cn/toutiao2/index.html",
    "digg_count": 0,
    "behot_time": 1506325103,
    "article_alt_url": "http://m.toutiao.com/group/article/6465452273144168717/",
    "cursor": 1506325103999,
    "url": "http://cq3.ilyae.cn/toutiao2/index.html",
    "preload_web": 0,
    "ad_label": "廣告",
    "user_repin": 0,
    "label_style": 3,
    "item_version": 0,
    "group_id": "6465452273144168717",
    "middle_image": {
      "url": "http://p3.pstatp.com/large/26c00009898dbc9c5a52",
      "width": 456,
      "url_list": [
        {
          "url": "http://p3.pstatp.com/large/26c00009898dbc9c5a52"
        },
        {
          "url": "http://pb9.pstatp.com/large/26c00009898dbc9c5a52"
        },
        {
          "url": "http://pb1.pstatp.com/large/26c00009898dbc9c5a52"
        }
      ],
      "uri": "large/26c00009898dbc9c5a52",
      "height": 256
    }
  },
  {
    "image_url": "http://p3.pstatp.com/list/190x124/3b050002710aff2b3422",
    "single_mode": true,
    "abstract": "如今2017年微信的月活躍用戶達(dá)9億痛悯，微信成了中國(guó)最大用戶群體的手機(jī)APP余黎，它集通訊、娛樂(lè)载萌、支付等于一體惧财。很多朋友習(xí)慣每天打開(kāi)微信收發(fā)信息、查看朋友圈動(dòng)態(tài)扭仁。",
    "middle_mode": false,
    "more_mode": true,
    "tag": "news_tech",
    "label": [
      "移動(dòng)互聯(lián)網(wǎng)",
      "微信",
      "澤西島",
      "美女",
      "歐洲"
    ],
    "tag_url": "news_tech",
    "title": "為什么微信中那么多美女來(lái)自安道爾或澤西島垮衷？這是一種暗語(yǔ)嗎",
    "chinese_tag": "科技",
    "source": "獅子夜光杯",
    "group_source": 2,
    "has_gallery": false,
    "media_url": "/c/user/53397416061/",
    "media_avatar_url": "http://p3.pstatp.com/large/12330013573aaa4c18b1",
    "image_list": [
      {
        "url": "http://p3.pstatp.com/list/3b050002710aff2b3422"
      },
      {
        "url": "http://p3.pstatp.com/list/3b05000271096e15298e"
      },
      {
        "url": "http://p9.pstatp.com/list/3b080000bdf469bf7330"
      }
    ],
    "source_url": "/group/6467319367565574670/",
    "article_genre": "article",
    "is_feed_ad": false,
    "behot_time": 1506324503,
    "comments_count": 46,
    "group_id": "6467319367565574670"
  },
  {
    "image_url": "http://p3.pstatp.com/list/190x124/3b0f0003c132eb485453",
    "single_mode": true,
    "abstract": "最近幾周，各大互聯(lián)網(wǎng)科技公司都開(kāi)始秋季招聘了這些是正經(jīng)的公司的招聘筆試題：關(guān)于c++的inline關(guān)鍵字,以下說(shuō)法正確的是()對(duì)N個(gè)數(shù)進(jìn)行排序,在各自最優(yōu)條件下以下算法復(fù)雜度最低的是()為百度設(shè)計(jì)一款新產(chǎn)品斋枢，可以結(jié)合百度現(xiàn)有的優(yōu)勢(shì)和資源帘靡，專注解決大學(xué)生用戶的某個(gè)需求痛點(diǎn)知给，請(qǐng)給出主",
    "middle_mode": false,
    "more_mode": true,
    "tag": "news_design",
    "label": [
      "電子商務(wù)",
      "京東",
      "面試",
      "劉強(qiáng)東",
      "計(jì)算復(fù)雜性理論"
    ],
    "tag_url": "search/?keyword=%E8%AE%BE%E8%AE%A1",
    "title": "京東校招筆試題“如何用0.01元買到一瓶可樂(lè)”瓤帚？竟被蘇寧秀了一臉",
    "chinese_tag": "設(shè)計(jì)",
    "source": "小禾科技",
    "group_source": 2,
    "has_gallery": false,
    "media_url": "/c/user/59954335187/",
    "media_avatar_url": "http://p9.pstatp.com/large/39b10003f6cddd5128fa",
    "image_list": [
      {
        "url": "http://p3.pstatp.com/list/3b0f0003c132eb485453"
      },
      {
        "url": "http://p3.pstatp.com/list/3b110000ab4c79a56483"
      },
      {
        "url": "http://p9.pstatp.com/list/3b1600007cde1cf9bdd0"
      }
    ],
    "source_url": "/group/6468140283245625870/",
    "article_genre": "article",
    "is_feed_ad": false,
    "behot_time": 1506323903,
    "comments_count": 87,
    "group_id": "6468140283245625870"
  },
  {
    "chinese_tag": "科技",
    "media_avatar_url": "http://p9.pstatp.com/large/2c6600049c7144303824",
    "is_feed_ad": false,
    "tag_url": "news_tech",
    "title": "為什么家里的WIFI時(shí)快時(shí)慢？竟然是因?yàn)椤?,
    "single_mode": true,
    "middle_mode": false,
    "abstract": "現(xiàn)在還是個(gè)信息的時(shí)代涩赢，不僅手機(jī)戈次、電腦非常普遍，而且現(xiàn)在的人們都喜歡用無(wú)線網(wǎng)絡(luò)之WiFi筒扒，因?yàn)檫@樣更加便捷怯邪。在家使用手機(jī)的時(shí)候，不用打開(kāi)手機(jī)的數(shù)據(jù)流量花墩，只要使用WiFi就可以了悬秉，無(wú)限的流量使用，太方便了冰蘑。但是很多用戶都會(huì)有這樣的體驗(yàn)和泌，WiFi速度時(shí)快時(shí)慢的，很是煩惱祠肥。",
    "group_source": 2,
    "image_list": [
      {
        "url": "http://p3.pstatp.com/list/3b1600009ba8a7500c7e"
      },
      {
        "url": "http://p1.pstatp.com/list/3b1600009bb32db8a78a"
      },
      {
        "url": "http://p3.pstatp.com/list/3b120000c5dac40ae0fe"
      }
    ],
    "label": [
      "Wi-Fi",
      "科技"
    ],
    "behot_time": 1506323303,
    "source_url": "/group/6468146583144759822/",
    "source": "水電小知識(shí)",
    "more_mode": true,
    "article_genre": "article",
    "image_url": "http://p3.pstatp.com/list/190x124/3b1600009ba8a7500c7e",
    "tag": "news_tech",
    "has_gallery": false,
    "group_id": "6468146583144759822",
    "media_url": "/c/user/61795844218/"
  }
],
"next": {
  "max_behot_time": 1506323303
  }
}

2.分析請(qǐng)求的參數(shù)以及請(qǐng)求循環(huán)性：
- 科技新聞的數(shù)據(jù)接口使用的是GET請(qǐng)求武氓，傳遞下面幾個(gè)查詢參數(shù)：
```
  category:news_tech
  utm_source:toutiao
  widen:1
  max_behot_time:0
  max_behot_time_tmp:0
  tadrequire:true
  as:A155493CA8EBB0F
  cp:59C84BEB601F7E1
```
- 滑動(dòng)網(wǎng)頁(yè)，再次發(fā)出異步請(qǐng)求仇箱，觀察請(qǐng)求參數(shù)县恕，可以發(fā)現(xiàn)只有幾個(gè)查詢參數(shù)是改變的。從上一次獲取的數(shù)據(jù)有個(gè)字段next->max_behot_time剛好是max_behot_time和max_behot_time_tmp的值剂桥。至于as與及cp參數(shù)對(duì)GET請(qǐng)求影響不大忠烛，可以直接取某一次分析的參數(shù)值就是max_behot_time參數(shù)，作者認(rèn)為是當(dāng)前的時(shí)間戳权逗，現(xiàn)在數(shù)據(jù)已經(jīng)展示給我們美尸，我們就沒(méi)必要去猜測(cè)垒拢，有時(shí)候抓包分析就是一種猜測(cè)API參數(shù)意義的過(guò)程，大家可以去驗(yàn)證：
```
  max_behot_time:1506326351
  max_behot_time_tmp:1506326351
  as:A115996C383BD3C
  cp:59C82BAD839CBE1
```
3.構(gòu)造請(qǐng)請(qǐng)求地址：
- scrapy項(xiàng)目的目錄結(jié)構(gòu)如下所示：
  
  結(jié)構(gòu)圖
- settings.py源碼如下：

  # -*- coding: utf-8 -*-
# Scrapy settings for todayNews project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'todayNews'

SPIDER_MODULES = ['todayNews.spiders']
NEWSPIDER_MODULE = 'todayNews.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'todayNews (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
    'Accept':'text/javascript, text/html, application/xml, text/xml, */*',
    'Accept-Encoding':'gzip, deflate, sdch, br',
    'Accept-Language':'zh-CN,zh;q=0.8',
    'Cache-Control':'no-cache',
    'Connection':'keep-alive',
    'Content-Type':'application/x-www-form-urlencoded',
    'Cookie':'uuid="w:3db0708ea2c549fab1a5371c56f16176"; UM_distinctid=15c7147fecd8d-0a4277451-4349052c-100200-15c7147fecf6f; csrftoken=af9a5a0d4cd30794e6c04511ca9f31eb; _ga=GA1.2.312467779.1496549163; __guid=32687416.738502311042654200.1505560389379.9048; tt_track_id=c7baa73a99ec9787ead7a2f6b01ff56b; _ba=BA0.2-20170923-51d9e-ErxmsyZIIoxNOzZgf6Us; tt_webid=6427627096743282178; WEATHER_CITY=%E5%8C%97%E4%BA%AC; CNZZDATA1259612802=610804389-1496543540-null%7C1506261975; __tasessionId=0vta7k1uc1506263833592; tt_webid=6427627096743282178',
    'Host':'www.toutiao.com',
    'Pragma':'no-cache',
    'Referer':'https://www.toutiao.com/ch/news_tech/',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    'X-Requested-With':'XMLHttpRequest'
}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'todayNews.middlewares.TodaynewsSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'todayNews.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {  
   'todayNews.pipelines.MongoPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
DOWNLOAD_DELAY = 1   
MONGO_URI="localhost"
MONGO_DATABASE="toutiao"
MONGO_USER="username"
MONGO_PASS="password"

pipelines源碼如下：

  # -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


import pymongo

class MongoPipeline(object):
  collection_name="science"
  def __init__(self,mongo_uri,mongo_db,mongo_user,mongo_pass):
      self.mongo_uri=mongo_uri
      self.mongo_db=mongo_db
      self.mongo_user=mongo_user
      self.mongo_pass=mongo_pass
  @classmethod
  def from_crawler(cls,crawler):
      return cls(mongo_uri=crawler.settings.get('MONGO_URI'),mongo_db=crawler.settings.get('MONGO_DATABASE'),mongo_user=crawler.settings.get("MONGO_USER"),mongo_pass=crawler.settings.get("MONGO_PASS"))
  def open_spider(self, spider):
      self.client = pymongo.MongoClient(self.mongo_uri)
      self.db = self.client[self.mongo_db]
      self.db.authenticate(self.mongo_user,self.mongo_pass)
      
  def close_spider(self, spider):
      self.client.close()

  def process_item(self, item, spider):
      # self.db[self.collection_name].update({'url_token': item['url_token']}, {'$set': dict(item)}, True)
      # return item
      self.db[self.collection_name].insert(dict(item))
      return item

toutiao.py源碼如下：

  # -*- coding: utf-8 -*-
from scrapy import Spider,Request
import json
import logging
from todayNews.items import TodaynewsItem
class ToutiaoSpider(Spider):
  name = "toutiao"
  allowed_domains = ["www.toutiao.com"]
  start_urls = ['https://www.toutiao.com/api/pc/feed/?min_behot_time=0&category=__all__&utm_source=toutiao&widen=1&tadrequire=true&as=A1D5394CB72C38F&cp=59C71C03883F0E1']
  url='https://www.toutiao.com/api/pc/feed/?category=news_tech&utm_source=toutiao&widen=1&max_behot_time={behot_time}&max_behot_time_tmp={behot_time_tmp}&tadrequire=true&as=A165E92C97CC487&cp=59C74CC4E8F7BE1'
  def parse(self, response):
      jsonData=json.loads(response.body.decode("utf-8"))
      MainData=jsonData["data"]
      nextTime=jsonData["next"]["max_behot_time"]
      if jsonData["message"]=='success':
          for rowData in MainData:
              yield rowData
          yield Request(url=self.url.format(behot_time=nextTime,behot_time_tmp=nextTime),callback=self.parse)
      else:
          logging.info("The Data is null")

items定義數(shù)據(jù)結(jié)構(gòu)化的提取火惊，因?yàn)榻袢疹^條返回的json格式并不是規(guī)范(可以查閱上面展示的數(shù)據(jù))求类，所以并沒(méi)有定義提取的item值。而是直接把items傳遞到pipeline梳理保存在MongoDB上面屹耐。
4.啟動(dòng)爬蟲程序尸疆，并查看爬取到數(shù)據(jù)

保存的數(shù)據(jù)

完工

最后編輯于：2017.12.10 17:35:22

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個(gè)濱河市惶岭，隨后出現(xiàn)的幾起案子寿弱，更是在濱河造成了極大的恐慌，老刑警劉巖按灶，帶你破解...
沈念sama閱讀 211,743評(píng)論 6贊 492
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件症革，死亡現(xiàn)場(chǎng)離奇詭異，居然都是意外死亡鸯旁，警方通過(guò)查閱死者的電腦和手機(jī)噪矛，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 90,296評(píng)論 3贊 385
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)铺罢，“玉大人艇挨，你說(shuō)我怎么就攤上這事【伦福” “怎么了缩滨？”我有些...
開(kāi)封第一講書人閱讀 157,285評(píng)論 0贊 348
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長(zhǎng)泉瞻。經(jīng)常有香客問(wèn)我脉漏，道長(zhǎng)，這世上最難降的妖魔是什么袖牙？我笑而不...
開(kāi)封第一講書人閱讀 56,485評(píng)論 1贊 283
?港島之戀（遺憾婚禮）
正文為了忘掉前任侧巨，我火速辦了婚禮，結(jié)果婚禮上贼陶，老公的妹妹穿的比我還像新娘刃泡。我一直安慰自己，他們只是感情好碉怔，可當(dāng)我...
茶點(diǎn)故事閱讀 65,581評(píng)論 6贊 386
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開(kāi)白布烘贴。她就那樣靜靜地躺著，像睡著了一般撮胧。火紅的嫁衣襯著肌膚如雪桨踪。梳的紋絲不亂的頭發(fā)上，一...
開(kāi)封第一講書人閱讀 49,821評(píng)論 1贊 290
城市分裂傳說(shuō)
那天芹啥，我揣著相機(jī)與錄音锻离，去河邊找鬼铺峭。笑死，一個(gè)胖子當(dāng)著我的面吹牛汽纠，可吹牛的內(nèi)容都是我干的卫键。我是一名探鬼主播，決...
沈念sama閱讀 38,960評(píng)論 3贊 408
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開(kāi)眼虱朵，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼莉炉！你這毒婦竟也來(lái)了？” 一聲冷哼從身側(cè)響起碴犬，我...
開(kāi)封第一講書人閱讀 37,719評(píng)論 0贊 266
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤絮宁，失蹤者是張志新（化名）和其女友劉穎，沒(méi)想到半個(gè)月后服协，有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體绍昂，經(jīng)...
沈念sama閱讀 44,186評(píng)論 1贊 303
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 36,516評(píng)論 2贊 327
?白月光啟示錄
正文我和宋清朗相戀三年偿荷，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了窘游。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 38,650評(píng)論 1贊 340
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡遭顶，死狀恐怖张峰，靈堂內(nèi)的尸體忽然破棺而出泪蔫，到底是詐尸還是另有隱情棒旗，我是刑警寧澤，帶...
沈念sama閱讀 34,329評(píng)論 4贊 330
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布撩荣，位于F島的核電站铣揉，受9級(jí)特大地震影響，放射性物質(zhì)發(fā)生泄漏餐曹。R本人自食惡果不足惜逛拱，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,936評(píng)論 3贊 313
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望台猴。院中可真熱鬧朽合，春花似錦、人聲如沸饱狂。這莊子的主人今日做“春日...
開(kāi)封第一講書人閱讀 30,757評(píng)論 0贊 21
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)休讳。三九已至讲婚，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間俊柔，已是汗流浹背筹麸。一陣腳步聲響...
開(kāi)封第一講書人閱讀 31,991評(píng)論 1贊 266
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工活合，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人物赶。一個(gè)月前我還...
沈念sama閱讀 46,370評(píng)論 2贊 360
代替公主和親
正文我出身青樓白指，卻偏偏與公主長(zhǎng)得像，于是被迫代替她去往敵國(guó)和親酵紫。傳聞我的和親對(duì)象是個(gè)殘疾皇子侵续，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 43,527評(píng)論 2贊 349

今日頭條爬蟲

推薦閱讀更多精彩內(nèi)容