本篇內(nèi)容需要大家對(duì)scrapy框架有了解,并完成了入門(mén)學(xué)習(xí)才能繼續(xù)使用寸癌。
創(chuàng)建項(xiàng)目
scrape startproject tutorial
定義Item如下:
import scrapy
class TutorialItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
movieInfo = scrapy.Field()
star = scrapy.Field()
quote = scrapy.Field()
制作爬蟲(chóng)(Spider)
代碼內(nèi)容如下:
import scrapy
class Douban(scrapy.Spider):
name = "douban"
start_urls = ['http://movie.douban.com/top250']
def parse(self,response):
print response.body
運(yùn)行一下看看
scrapy crawl douban
INFO: Closing spider (finished)表明爬蟲(chóng)已經(jīng)成功運(yùn)行并且自行關(guān)閉了。
創(chuàng)建主函數(shù)
創(chuàng)建文件main.py文件蒸苇,內(nèi)容如下:
from scrapy import cmdline
cmdline.execute("scrapy crawl douban".split())
DEBUG1:HTTP status code is not handled or not allowed
DEBUG: Ignoring response <403 http://movie.douban.com/top250>: HTTP status code is not handled or not allowed
Answer:被屏蔽了,在settings.py里加上USER_AGENT:
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
取數(shù)據(jù)
OK味咳,獲取網(wǎng)頁(yè)后,接下去就是取數(shù)據(jù)了槽驶。
導(dǎo)入包:
import scrapy
from tutorial.items import TutorialItem
from scrapy.selector import Selector
from scrapy.http import Request
class Douban(scrapy.Spider):
name = "douban250"
start_urls = ['http://movie.douban.com/top250']
url = 'http://movie.douban.com/top250'
def parse(self,response):
item = TutorialItem()
selector = Selector(response)
#print selector
Movies = selector.xpath('//div[@class="info"]')
#print Movies
for eachMoive in Movies:
title = eachMoive.xpath('div[@class="hd"]/a/span/text()').extract()
fullTitle = ''
for each in title:
fullTitle += each
movieInfo = eachMoive.xpath('div[@class="bd"]/p/text()').extract()
star = eachMoive.xpath('div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()').extract()[0]
quote = eachMoive.xpath('div[@class="bd"]/p[@class="quote"]/span/text()').extract()
if quote:
quote = quote[0]
else:
quote = ''
#print fullTitle
#print movieInfo
#print star
#print quote
item['title'] = fullTitle
item['movieInfo'] = ';'.join(movieInfo)
item['star'] = star
item['quote'] = quote
yield item
nextLink = selector.xpath('//span[@class="next"]/link/@href').extract()
if nextLink:
nextLink = nextLink[0]
print nextLink
yield Request(self.url + nextLink, callback=self.parse)
存儲(chǔ)數(shù)據(jù)一
-o 后面是導(dǎo)出文件名掂铐,-t 后面是導(dǎo)出類(lèi)型。
scrapy crawl douban -o items.csv -t csv
然后用numbers程序打開(kāi)即可看到250個(gè)電影排序了堡纬。
存儲(chǔ)數(shù)據(jù)二
可以直接在settings.py文件中設(shè)置輸出的位置和文件類(lèi)型蒿秦,如下:
FEED_URI = u'file:///E:/douban/douban.csv'
FEED_FORMAT = 'CSV'