寫在開頭:本文運行爬蟲的示例網(wǎng)站為 數(shù)碼獸數(shù)據(jù)庫http://digimons.net/digimon/chn.html
文章提及的細節(jié)參見我的另外三篇簡文
MySQL本地類細節(jié)配置:
[Python] 爬取bioinfo帖子相關信息 (requests + openpyxl + pymysql)
Scrapy各組件詳細配置:
[Python] 爬蟲 Scrapy框架各組件詳細設置
SQL命令基礎:
[SQL] MySQL基礎+Python交互
轉(zhuǎn)載請注明:陳熹 chenx6542@foxmail.com (簡書號:半為花間酒)
若公眾號內(nèi)轉(zhuǎn)載請聯(lián)系公眾號:早起Python
需求分析
- 主頁面分析
首先點擊http://digimons.net/digimon/chn.html
進入中文檢索頁面
查看頁面源碼
有兩點發(fā)現(xiàn):
- 數(shù)據(jù)不是通過Ajax加載
- 獲得全部數(shù)據(jù)不需要什么翻頁邏輯素标,所有數(shù)碼獸的后半url都在當前源碼里告材。href中的格式是數(shù)碼獸英文名/index.html
接下來分析幾個數(shù)碼獸詳情頁的url:
http://digimons.net/digimon/agumon_yuki_kizuna/index.html
http://digimons.net/digimon/herakle_kabuterimon/index.html
http://digimons.net/digimon/mugendramon/index.html
http://digimons.net/digimon/king_etemon/index.html
根據(jù)這個情況有一種思路:利用正則或者其他(超)文本解析工具獲取源碼中的所有href挟冠,然后利用urllib.parse.urljoin和父目錄路徑http://digimons.net/digimon/ 拼起來構(gòu)成完整url,再訪問詳情頁
但本文換了一種思路
許多爬蟲的數(shù)據(jù)采集工作都是類似的闻蛀,因此Scrapy提供了若干個 更高程度封裝的通用爬蟲類
可用以下命令查看:
# 查看scrapy提供的通用爬蟲(Generic Spiders)
scrapy genspider -l
CrawlSpider
是通用爬蟲里最常用的一個
通過一套規(guī)則引擎,它自動實現(xiàn)了頁面鏈接的搜索跟進考榨,解決了包含但不限于自動采集詳情頁蛇数、跟進分類/分頁地址等問題。主要運行邏輯是深度優(yōu)先
這個網(wǎng)站的設計非常簡單批销,因此可以考慮用便捷的全網(wǎng)爬取框架洒闸。這個框架的前提是:無關url和需要url有明顯差別,可以利用正則獲取其他方式區(qū)別開
簡而言之,可以想象給定爬蟲一個一只url以后,爬蟲會繼續(xù)訪問從這個url出發(fā)能訪問到的新url捐友,然后爬蟲需要根據(jù)預設的語法判斷這個url是不是所需的蔓纠,如果是則先解析后延伸訪問新url,如果不是則繼續(xù)訪問新url羞反,假如沒有新url該叉結(jié)束
文章中的外鏈比較多是wikipedia布朦,url差別較大。通過對比和數(shù)碼獸詳情頁的url昼窗,可以總結(jié)出所需url的格式:
http://digimons.net/digimon/.*/index.html
利用正則替換中間的英文名
- 詳情頁分析
爬取的需求如下:
基本一只數(shù)碼獸所有的資料都會爬取下來是趴,但需要注意不同的數(shù)碼獸資料不一定完整,故寫代碼需要留意(舉兩個例子)
需求分析差不多了澄惊,可以著手寫代碼
代碼實戰(zhàn)
- 創(chuàng)建項目
# scrapy startproject <Project_name>
scrapy startproject Digimons
# scrapy genspider <spider_name> <domains>
scrapy genspider –t crawl digimons http://digimons.net/digimon/chn.html
- spiders.py
依次打開項目文件夾Digimons - Digimons - spiders唆途,創(chuàng)建digimons.py
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import DigimonsItem
class DigimonsSpider(CrawlSpider):
name = 'digimons'
allowed_domains = ['digimons.net']
start_urls = ['http://digimons.net/digimon/chn.html']
# 爬蟲的規(guī)則是重點富雅,把先前分析的結(jié)果url放進allow
# callback='parse_item',符合規(guī)則的url回調(diào)該函數(shù)
# follow = False則爬到該頁面后不繼續(xù)拓寬深度
rules = (
Rule(LinkExtractor(allow=r'http://digimons.net/digimon/.*/index.html'), callback='parse_item', follow=False),
)
# 按需求逐個解析肛搬,有的不存在需要判斷
def parse_item(self, response):
# 名字列表
name_lst = response.xpath('//*[@id="main"]/article/h2[1]//text()').extract()
name_lstn = [i.replace('/', '').strip() for i in name_lst if i.strip() != '']
# 中文名
Cname = name_lstn[0].replace(' ', '-')
# 日文名
Jname = name_lstn[1]
# 英文名
Ename = name_lstn[2]
# 等級
digit_grade = response.xpath("http://article/div[@class='data'][1]/table/tr[1]/td/text()").extract()
digit_grade = '-' if digit_grade == [] else ''.join(digit_grade)
# 類型
digit_type = response.xpath("http://article/div[@class='data'][1]/table/tr[2]/td/text()").extract()
digit_type = '-' if digit_type == [] else ''.join(digit_type)
# 屬性
digit_attribute = response.xpath("http://article/div[@class='data'][1]/table/tr[3]/td/text()").extract()
digit_attribute = '-' if digit_attribute == [] else ''.join(digit_attribute)
# 所屬
belongs = response.xpath("http://article/div[@class='data'][1]/table/tr[4]/td/text()").extract()
belongs = '-' if belongs == [] else ''.join(belongs)
# 適應領域
adaptation_field = response.xpath("http://article/div[@class='data'][1]/table/tr[5]/td/text()").extract()
adaptation_field = '-' if adaptation_field == [] else ''.join(adaptation_field)
# 首次登場
debut = response.xpath("http://article/div[@class='data'][1]/table/tr[6]/td/text()").extract()
debut = '-' if debut == [] else ''.join(debut)
# 名字來源
name_source = response.xpath("http://article/div[@class='data'][1]/table/tr[7]/td/text()").extract()
name_source = '-' if name_source == [] else '/'.join(name_source).strip('/')
# 必殺技
nirvana = response.xpath("http://article/div[@class='data'][2]/table/tr/td[1]/text()").extract()
nirvana = '-' if nirvana == [] else '/'.join(nirvana).strip('/')
# 介紹資料
info_lst = response.xpath("http://*[@id='cn']/p/text()").extract()
info = ''.join([i.replace('/', '').strip() for i in info_lst if i.strip() != ''])
# 圖片url
img_url = response.xpath('//*[@id="main"]/article/div[1]/a/img/@src').extract()
img_url = response.url[:-10] + img_url[0] if img_url != [] else '-'
# 個人習慣簡單輸出
print(Cname, Jname, Ename)
# 如果要持久化存儲轉(zhuǎn)向items
item = DigimonsItem()
item['Cname'] = Cname
item['Jname'] = Jname
item['Ename'] = Ename
item['digit_grade'] = digit_grade
item['digit_type'] = digit_type
item['digit_attribute'] = digit_attribute
item['belongs'] = belongs
item['adaptation_field'] = adaptation_field
item['debut'] = debut
item['name_source'] = name_source
item['nirvana'] = nirvana
item['info'] = info
item['img_url'] = img_url
yield item
- items.py
關于MySQL存儲的細節(jié)可以參考我的另一篇文章:
[Python] 爬取生信坑論壇 bioinfo帖子相關信息 (requests + openpyxl + pymysql)
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
# 和spiders.py中的傳入對應
class DigimonsItem(scrapy.Item):
Cname = scrapy.Field()
Jname = scrapy.Field()
Ename = scrapy.Field()
digit_grade = scrapy.Field()
digit_type = scrapy.Field()
digit_attribute = scrapy.Field()
belongs = scrapy.Field()
adaptation_field = scrapy.Field()
debut = scrapy.Field()
name_source = scrapy.Field()
nirvana = scrapy.Field()
info = scrapy.Field()
img_url = scrapy.Field()
# 注釋部分是sql語法没佑,需要在命令行運行
def get_insert_sql_and_data(self):
# CREATE TABLE digimons(
# id int not null auto_increment primary key,
# Chinese_name text, Japanese_name text, English_name text,
# digit_grade text, digit_type text, digit_attribute text,
# belongs text, adaptation_field text, debut text, name_source text,
# narvana text, info text)ENGINE=INNODB DEFAULT CHARSET=UTF8mb4;
insert_sql = 'insert into digimons(Chinese_name,Japanese_name,English_name,digit_grade,digit_type,' \
'digit_attribute,belongs,adaptation_field,debut,name_source,nirvana,info,img_url)' \
'values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'
data = (self['Cname'],self['Jname'],self['Ename'],self['digit_grade'],self['digit_type'],
self['digit_attribute'],self['belongs'],self['adaptation_field'],self['debut'],
self['name_source'],self['nirvana'],self['info'],self['img_url'])
return (insert_sql, data)
- pipelines.py
借助items.py和Mysqlhelper完成存儲
# -*- coding: utf-8 -*-
from mysqlhelper import Mysqlhelper
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
class DigimonsPipeline(object):
def __init__(self):
self.mysqlhelper = Mysqlhelper()
def process_item(self, item, spider):
if 'get_insert_sql_and_data' in dir(item):
(insert_sql, data) = item.get_insert_sql_and_data()
self.mysqlhelper.execute_sql(insert_sql, data)
return item
- settings.py
沒有特別修改,下面的代碼默認是注釋狀態(tài)温赔,需要打開
ITEM_PIPELINES = {
'Digimons.pipelines.DigimonsPipeline': 300,
}
- extends.py
自定義拓展這部分內(nèi)容不看也可以蛤奢,我實現(xiàn)的功能就是在爬蟲運行結(jié)束的時候自動給微信推送消息(借助喵提醒)
喵提醒需要申請賬號獲得自己的id,官方已經(jīng)給了一個API可以包裝在本地
from urllib import request, parse
import json
class Message(object):
def __init__(self,text):
self.text = text
def push(self):
# 重要陶贼,在id中填寫自己綁定的id
page = request.urlopen("http://miaotixing.com/trigger?" + parse.urlencode({"id": "xxxxxx", "text": self.text, "type": "json"}))
result = page.read()
jsonObj = json.loads(result)
if (jsonObj["code"] == 0):
print("\nReminder message was sent successfully")
else:
print("\nReminder message failed to be sent啤贩,wrong code:" + str(jsonObj["code"]) + ",describe:" + jsonObj["msg"])
寫在另外的py文件里命名為message.py
接下來寫extends.py拜秧,需要自己新建痹屹,位置和pipelines.py,items.py同級
from scrapy import signals
from message import Message
class MyExtension(object):
def __init__(self, value):
self.value = value
@classmethod
def from_crawler(cls, crawler):
val = crawler.settings.getint('MMMM')
ext = cls(val)
crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
return ext
def spider_opened(self, spider):
print('spider running')
def spider_closed(self, spider):
# 推送的消息可以自定義腹纳,可以再給出獲取了多少條數(shù)據(jù)
text = 'DigimonsSpider運行結(jié)束'
message = Message(text)
message.push()
print('spider closed')
重要的是如果添加了自定義拓展痢掠,需要在settings中也打開
默認是:
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
需要修改并打開:
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {
'Digimons.extends.MyExtension': 500,
}
- running.py
最后就是運行這個項目了,也可以直接在命令行中cd到項目位置后運行
scrapy crawl digimons
個人比較喜歡在py文件中運行嘲恍,新建一個running.py(和items.py同級目錄)
from scrapy.cmdline import execute
execute('scrapy crawl digimons'.split())
最后運行runnings.py這個項目就啟動了
項目運行
運行中如圖:
運行完成共獲得1179條數(shù)據(jù)足画,進入Navicat(MySQL等數(shù)據(jù)庫的圖形交互界面):
數(shù)據(jù)非常好的儲存了,可以看到不是所有的數(shù)碼獸都有所屬陣營
可以做一些簡單的查詢佃牛,比如七大魔王陣營里都有誰:
查詢結(jié)果對比確實只有7只淹辞,重復出現(xiàn)的都是形態(tài)變化或者變體:
對其中重復的一只數(shù)碼獸再次查詢,的確簡介都不同
如果覺得用MySQL查看不方便俘侠,可以在Navicat中轉(zhuǎn)出成EXCEL
在命令行中導出詳情參見:[SQL] MySQL基礎+Python交互
爬取的數(shù)據(jù)EXCEL形式供大家下載:
https://pan.baidu.com/s/1oKFsw3at4cF5p4WpW7ZRcA
提取碼:1wvs
寫在最后:
數(shù)碼獸的資料里數(shù)字信息較少象缀,可供挖掘和造模的信息有限
后續(xù)爬取口袋妖怪小精靈們的數(shù)據(jù)就有意思多了 : )關于數(shù)據(jù)分析的部分以后會涉及,歡迎繼續(xù)關注