1. 創(chuàng)建自定義爬蟲
scrapy startproject zhihurb
目錄結(jié)構(gòu)
scrapy.cfg: 項(xiàng)目的配置文件(很少用)
zhihurb/: 該項(xiàng)目的python模塊蹦魔。之后您將在此加入代碼。
zhihurb/items.py: 項(xiàng)目中的item文件.
zhihurb/pipelines.py: 項(xiàng)目中的pipelines文件.
zhihurb/settings.py: 項(xiàng)目的設(shè)置文件(設(shè)置)
zhihurb/spiders/: 放置spider代碼的目錄.
settings.py 常用配置:
LOG_LEVEL = 'ERROR'
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 1 下載延時(shí)
DEFAULT_REQUEST_HEADERS = {} 重新請求頭
2. xpath 選擇器
可以在命令行輸入 scrapy shell [要測試的url地址]
response.xpath('//div[@class="tqtongji2"]/ul[position()>1]/li[1]/a/text()').extract() 進(jìn)行測試
extract()
Serialize and return the matched nodes as a list of unicode strings. Percent encoded content is unquoted.
extract_first()
return the matched node.
re(regex)
Apply the given regex and return a list of unicode strings with the matches.
同樣有re_first()
獲取某節(jié)點(diǎn)所有文字內(nèi)容
node = response.xpath('//div[@class="content"]')[0]
article = node.xpath('string(.)').extract_first()
3. 中斷和恢復(fù)爬蟲
scrapy crawl article -s JOBDIR=crawls/article
中斷后缅刽,重新執(zhí)行該命令捡絮,從暫停地方繼續(xù)
4. 數(shù)據(jù)導(dǎo)出
數(shù)據(jù)導(dǎo)出
如果要簡單將已抓取的item數(shù)據(jù)保存到文件暖混,可以傳遞-o選項(xiàng):
scrapy crawl heartsong -o index.xml
格式包括 csv,json,xml
如果有復(fù)雜操作,在pipelines處理邏輯参淫, 注釋setting中的 ITEM_PIPELINES 配置項(xiàng)可以更換實(shí)現(xiàn)類救湖。
ITEM_PIPELINES = {
'music163.pipelines.MongoPipeline': 300,
}
demo
from scrapy import Spider, Request
from zhihurb.items import ZhihurbItem
class ZhihuSpider(Spider):
name = "zhihu"
allowed_domains = ["zhihu.com"]
start_urls = ['https://daily.zhihu.com/']
def parse(self, response):
urls = response.xpath('//div[@class="box"]/a/@href').extract()
for url in urls:
url = response.urljoin(url)
print(url)
yield Request(url, callback=self.parse_url)
def parse_url(self, response):
# name = xxxx
# article = xxxx
# 保存
name = response.xpath('//h1[@class="headline-title"]/text()').extract_first()
node = response.xpath('//div[@class="content"]')[0]
article = node.xpath('string(.)').extract_first()
item = ZhihurbItem()
item['name'] = name
item['article'] = article
# 返回item
yield item