今天寫了一個爬蟲音半,抓取了一下國內知名的某足球網(wǎng)站的內容琳袄。
首先就是去創(chuàng)建項目:
scrapy startproject dongXXXdi
然后去創(chuàng)建一個爬蟲
scrapy genspider DQD "dongXXXdi.com"
然后出現(xiàn)了如下的目錄:
項目的目錄結構
具體的網(wǎng)頁結構就不分析了君旦∑嶙玻可以參考我上一篇博客的Chrome的network調試。
直接上代碼:
先看看spider.py 于宙。
# -*- coding: utf-8 -*-
import scrapy
class DqdSpider(scrapy.Spider):
name = "DQD"
allowed_domains = ["dongqiudi.com"]
start_urls = ['http://dongqiudi.com/archives/1?page=1']
def parse(self, response):
html = response.text
text = json.loads(html)
dataArray = text['data']
for data in dataArray:
yield data
for i in range(2,50): #暫時就先抓取50頁內容
new_url = "http://dongqiudi.com/archives/1?page={}".format(i)
yield scrapy.Request(url=new_url,callback=self.parse) #回調函數(shù)
再看看items.py
import scrapy
class DongqiudiItem(scrapy.Item):
# define the fields for your item here like:
id = scrapy.Field()
title = scrapy.Field()
discription = scrapy.Field()
user_id = scrapy.Field()
type = scrapy.Field()
display_time = scrapy.Field()
thumb = scrapy.Field()
comments_total = scrapy.Field()
web_url = scrapy.Field()
official_account = scrapy.Field()
然后再看看pipilines.py,將所有的數(shù)據(jù)存儲為json格式。
import json
class DongqiudiPipeline(object):
def process_item(self, item, spider):
with open("DQD.json","a") as f:
f.write(json.dumps(item,ensure_ascii=False)+"\n")
最后看看settings.py悍汛。
BOT_NAME = 'dongqiudi'
SPIDER_MODULES = ['dongqiudi.spiders']
NEWSPIDER_MODULE = 'dongqiudi.spiders'
ROBOTSTXT_OBEY = False #不遵守機器人協(xié)議
#請求頭的設置
DEFAULT_REQUEST_HEADERS = {
'Accept':
'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
}
ITEM_PIPELINES = {
'dongqiudi.pipelines.DongqiudiPipeline': 300,
}
到此所有的要寫的代碼就簡單的完成了捞魁。然后看看抓取的結果。
抓取結果