采集對象:四川大學公共管理學院新聞動態(tài)及內容
爬取規(guī)則:用css選擇器的方法來進行元素定位
采集過程
激活犹撒,進入虛擬環(huán)境
創(chuàng)建項目
修改items.py文件
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class GgnewsItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
time = scrapy.Field()
content = scrapy.Field()
img = scrapy.Field()
編寫爬蟲
import scrapy
from ggnews.items import GgnewsItem
class GgnewsSpider(scrapy.Spider):
name = "spidernews"
start_urls = [
'http://ggglxy.scu.edu.cn/index.php?c=special&sid=1',
]
def parse(self, response):
for href in response.css('div.pb30.mb30 div.right_info.p20.bgf9 ul.index_news_ul.dn li a.fl::attr(href)'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse2)
next_page = response.css('div.w100p div.px_box.w1000.auto.ovh.cf div.pb30.mb30 div.mobile_pager.dn li.c::text').extract_first()
if next_page is not None:
next_url = int(next_page) + 1
next_urls = '?c=special&sid=1&page=%s' % next_url
print next_urls
next_urls = response.urljoin(next_urls)
yield scrapy.Request(next_urls,callback = self.parse)
def parse2(self, response):
items = []
for new in response.css('div.w1000.auto.cf div.w780.pb30.mb30.fr div.right_info.p20'):
item = GgnewsItem()
item['title'] = new.css('div.detail_zy_title h1::text').extract_first(),
item['time'] = new.css('div.detail_zy_title p::text').extract_first(),
item['content'] = new.css('div.detail_zy_c.pb30.mb30 p span::text').extract(),
item['img'] = new.css('div.detail_zy_c.pb30.mb30 p.MsoNormal img::attr(src)').extract(),
items.append(item)
return items
將爬蟲文件拖進spiders文件夾下
執(zhí)行爬蟲
scrapy crawl spidernews -o spidernews.xml
(開始幾次一直出現 ImportError: No module named items的錯誤颁湖,查百度發(fā)現時spiders 目錄中的.py文件不能和項目名同名的問題蔗候,對其文件名進行修改)
scrapy crawl spidernews -o spidernews.json
得到數據