起因:學(xué)校項(xiàng)目實(shí)訓(xùn)盖高,要求我們爬取招聘網(wǎng)站信息并對(duì)其進(jìn)行分析声滥,在此我和大家分享一下關(guān)于我爬取58同城招聘網(wǎng)站信息的過(guò)程和結(jié)果~
前期準(zhǔn)備步驟:
1.搭建環(huán)境:首先把scrapy需要的環(huán)境搭建好胆胰,再次我就不贅述了感昼,這個(gè)去百度坛悉,有很多的教程,可能有些不夠全面不夠準(zhǔn)確台颠,反正多看看褐望,先把環(huán)境搭建好,我是在windows7下進(jìn)行的安裝蓉媳。
2.環(huán)境搭建好后,學(xué)習(xí)scrapy框架的結(jié)構(gòu)以及運(yùn)行流程锅铅,具體網(wǎng)上也有很多介紹酪呻,我也不贅述了,提一點(diǎn)百度百科的解釋盐须,scrapy:Scrapy玩荠,Python開發(fā)的一個(gè)快速,高層次的屏幕抓取和web抓取框架,用于抓取web站點(diǎn)并從頁(yè)面中提取結(jié)構(gòu)化的數(shù)據(jù)贼邓。Scrapy用途廣泛阶冈,可以用于數(shù)據(jù)挖掘、監(jiān)測(cè)和自動(dòng)化測(cè)試塑径。
這個(gè)關(guān)于scrapy的中文的網(wǎng)站點(diǎn)擊打開鏈接女坑,大家可以學(xué)習(xí)學(xué)習(xí),這項(xiàng)目统舀,我也就學(xué)習(xí)了前面的幾點(diǎn)知識(shí)匆骗。
代碼編寫過(guò)程:
1.在cmd中新建一個(gè)新項(xiàng)目。
scrapy startproject tc (58同城的縮寫誉简,項(xiàng)目名稱)
2.對(duì)于該項(xiàng)目的items類進(jìn)行編寫:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class TcItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()? #招聘職位名稱
Cpname = scrapy.Field() #公司名稱
pay? = scrapy.Field()? #薪資待遇
edu? = scrapy.Field()? #學(xué)歷要求
num? = scrapy.Field()? #招聘人數(shù)
year = scrapy.Field()? #工作年限
FL? = scrapy.Field()? #福利待遇
以上是我給想爬取的數(shù)據(jù)定義的屬性
3.在spiders中新建了一個(gè)tc_spider.py,一下是tc_spider.py的代碼:
# -*- coding: utf-8 -*-
import scrapy
from tc.items import TcItem
from scrapy.selector import HtmlXPathSelector,Selector
from scrapy.http import Request
class TcSpider(scrapy.Spider):
name='tc'
allowed_domains=['jn.58.com']
start_urls=[
"http://jn.58.com/tech/pn1/?utm_source=market&spm=b-31580022738699-me-f-824.bdpz_biaoti&PGTID=0d303655-0010-915b-ca53-cb17de8b2ef6&ClickID=3"
]
theurl="http://jn.58.com/tech/pn"
theurl2="/?utm_source=market&spm=b-31580022738699-me-f-824.bdpz_biaoti&PGTID=0d303655-0010-915b-ca53-cb17de8b2ef6&ClickID=3"
for i in range(75):
n=i+2
the_url=theurl+str(n)+theurl2
start_urls.append(the_url)
def start_request(self,response):
sel = Selector(response)
sites = sel.xpath("http://*[@id='infolist']/dl")
#items = []
for site in sites:
#item = DmozItem()
#item['namee'] = site.xpath('dt/a/text()').extract()
href = site.xpath('dt/a/@href').extract()
self.start_urls.append(href)
#item['company'] = site.xpath('dd/a/@title').extract()
#if site!= " " :
#? items.append(item)
for url in self.start_urls:
yield self.make_requests_from_url()
def parse_item(self, response):
items2 = []
item=TcItem()
item['name']=response.xpath("http://*[@class='headConLeft']/h1/text()").extract()
item['Cpname']=response.xpath("http://*[@class='company']/a/text()").extract()
item['pay']=response.xpath(("http://*[@class='salaNum']/strong/text()")).extract()
item['edu']=response.xpath("http://*[@class='xq']/ul/li[1]/div[2]/text()").extract()
item['num']=response.xpath("http://*[@class='xq']/ul/li[2]/div[1]/text()").extract()
item['year']=response.xpath("http://*[@class='xq']/ul/li[2]/div[2]/text()").extract()
item['FL']=response.xpath("http://*[@class='cbSum']/span/text()").extract()
dec=item['num']
items2.append(item)
return items2
def parse(self, response):
sel = HtmlXPathSelector(response)
href = sel.select("http://*[@id='infolist']/dl/dt/a/@href").extract()
for he in href:
yield Request (he,callback=self.parse_item)
# 翻頁(yè)
#? ? next_page=response.xpath("http://*[@class='nextMsg']/a/@href")
#? if next_page:
#? url=response.urljoin(next_page[0].extract())
#? ? yield scrapy.Request(url,self.parse)
這段代碼大體四個(gè)部分:①定義爬取的網(wǎng)站以及范圍②每個(gè)屬性的xpath的編寫③對(duì)于每個(gè)職位的鏈接爬取的循環(huán)(能實(shí)現(xiàn)進(jìn)去爬取靜態(tài)的信息)④連續(xù)爬取碉就,網(wǎng)頁(yè)的循環(huán)