Scrapy 官方文檔
1.安裝scrapy
終端命令:
pip3 install scrapy
2.創(chuàng)建項目
終端命令:
scrapy startproject <projectname>
cd <projectname> #進入工程目錄
scrapy genspider <spidername> <url domain>
url domain
為想爬取的網(wǎng)址域名
之后會在當前路徑生成一個以projectname
為名稱的文件夾彰居,以下projectname
=mySpider
,spidername
=epilepsy_spider
暴氏、url domain
=baidu.com
為例翅敌,文件夾目錄結(jié)構(gòu)如圖:
scrapy文件目錄結(jié)構(gòu)
- scrapy.cfg 項目的配置信息,主要為Scrapy命令行工具提供一個基礎(chǔ)的配置信息逞敷,真正爬蟲相關(guān)的配置信息在settings.py文件中
- items.py 定義爬取數(shù)據(jù)的對象
- pipelines 數(shù)據(jù)處理行為狂秦,如:一般結(jié)構(gòu)化的數(shù)據(jù)持久化
- settings.py 配置文件,如:遞歸的層數(shù)推捐、并發(fā)數(shù)裂问,延遲下載等
- spiders 爬蟲目錄,如:創(chuàng)建文件,編寫爬蟲規(guī)則
3.編寫items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class MovieItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
4.編寫epilepsy_spider.py包括爬去網(wǎng)頁的起始路徑和xpath表達式
# -*- coding: utf-8 -*-
import scrapy
from mySpider.items import MovieItem
class MeijuSpider(scrapy.Spider):
name = 'epilepsy_spider'
allowed_domains = ['baidu.com']
start_urls = ['https://baike.baidu.com/medicine/disease/%E7%99%AB%E7%97%AB/1613?from=lemma']
def parse(self, response):
movies = response.xpath('//*[@id="medical_content"]/ul').extract()
item = MovieItem()
item['name'] = movies
print(item['name'])
yield item
在pycharm項目內(nèi)創(chuàng)建Spider項目時注意將Spider項目的父文件夾設(shè)置為Resource Root:文件夾右鍵->Make Directory as ->Resource Root
關(guān)于xpath表達式的使用見下一篇
5.編寫pipeline.py定義爬取數(shù)據(jù)的保存路徑
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
class MoviePipeline(object):
def process_item(self, item, spider):
with open("./epilepsy.txt", "a") as fp:
for i in item['name']:
fp.write(i+'\n')
6.執(zhí)行爬蟲
終端命令:
scrapy crawl <spider name>