【scrapy爬蟲(chóng)實(shí)戰(zhàn)】王者榮耀全部英雄信息爬取

王者榮耀英雄信息爬取

分析

入口頁(yè)面地址

https://pvp.qq.com/web201605/herolist.shtml

第一步獲取所有英雄的列表

image-20200525100704049.png

可以看到英雄列表是在源碼中可以被找到的


image-20200525100755019.png

第二步 獲取英雄的各種信息

英雄的基本信息放在一個(gè)class = "cover"的div中 我們主要采集英雄的名稱技能介紹

image-20200525101942363.png

技能部分都在 class=" zk-con3 zk-con" 中 中的 ul

image-20200525103312535.png

爬取英雄列表

創(chuàng)建工程

scrapy startproject wzry
cd wzry

創(chuàng)建爬蟲(chóng)

scrapy genspider wzry_spider pvp.qq.com

修改配置

# 不遵循robots協(xié)議 因?yàn)檎军c(diǎn)沒(méi)有這個(gè)文件 爬蟲(chóng)會(huì)直接略過(guò)
ROBOTSTXT_OBEY = False
# 添加請(qǐng)求頭
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}

# 下載延遲
DOWNLOAD_DELAY = 1

修改初始頁(yè)面

start_urls = ['https://pvp.qq.com/web201605/herolist.shtml']

創(chuàng)建爬取列表方法(測(cè)試)

    def parse(self, response):
        print("=" * 50)
        print(response)
        print("=" * 50)

運(yùn)行爬蟲(chóng)

scrapy crawl wzry_spider
2020-05-25 11:08:53 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-25 11:08:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pvp.qq.com/web201605/herolist.shtml> (referer: None)
==================================================
<200 https://pvp.qq.com/web201605/herolist.shtml>
==================================================
2020-05-25 11:08:54 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-25 11:08:54 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 315,

成功返回結(jié)果

爬取所有英雄鏈接方法

class WzrySpiderSpider(scrapy.Spider):
    name = 'wzry_spider'
    allowed_domains = ['pvp.qq.com']
    start_urls = ['https://pvp.qq.com/web201605/herolist.shtml']    # 起始url
    base_url = "https://pvp.qq.com/web201605/"  # url 前綴

    def parse(self, response):
        print("=" * 50)
        # print(response.body)
        hero_list = response.xpath("http://ul[@class='herolist clearfix']//li")
        for hero in hero_list:
            url = self.base_url + hero.xpath("./a/@href").get()
            print(url)
            # yield scrapy.Request(url)
        print("=" * 50)
https://pvp.qq.com/web201605/herodetail/114.shtml
https://pvp.qq.com/web201605/herodetail/113.shtml
https://pvp.qq.com/web201605/herodetail/112.shtml
https://pvp.qq.com/web201605/herodetail/111.shtml
https://pvp.qq.com/web201605/herodetail/110.shtml
https://pvp.qq.com/web201605/herodetail/109.shtml
https://pvp.qq.com/web201605/herodetail/108.shtml
https://pvp.qq.com/web201605/herodetail/107.shtml
https://pvp.qq.com/web201605/herodetail/106.shtml
https://pvp.qq.com/web201605/herodetail/105.shtml
==================================================

爬取英雄詳情頁(yè)面

  • 獲得基本信息
# 基本信息塊
hero_info = response.xpath("http://div[@class='cover']")
hero_name = hero_info.xpath(".//h2[@class='cover-name']/text()").get()
print(hero_name)
# 分類
sort_num = hero_info.xpath(".//span[@class='herodetail-sort']/i/@class").get()[-1:]
print(sort_num)
# 生存能力
viability = hero_info.xpath(".//ul/li[1]/span/i/@style").get()[6:]
print("生存能力:" + viability)
# 傷害
aggressivity = hero_info.xpath(".//ul/li[2]/span/i/@style").get()[6:]
print("攻擊能力:" + aggressivity)
effect = hero_info.xpath(".//ul/li[3]/span/i/@style").get()[6:]
print("技能影響" + effect)
difficulty = hero_info.xpath(".//ul/li[4]/span/i/@style").get()[6:]
print("上手難度:" + difficulty)
  • 獲取技能信息
 skill_list = response.xpath("http://div[@class='skill-show']/div[@class='show-list']")

        for skill in skill_list:
            skill_name = skill.xpath("./p[@class='skill-name']/b/text()").get()
            if not skill_name:
                continue
            # 冷卻時(shí)間
            cooling = skill.xpath("./p[@class='skill-name']/span[1]/text()").get().split(":")[1].strip().split('/')
            # 消耗
            consume = skill.xpath("./p[@class='skill-name']/span[1]/text()").get().split(":")[1].strip().split('/')
            # "".strip()
            # 如果這個(gè)技能是空的就 continue

            # 技能介紹
            skill_desc = skill.xpath("./p[@class='skill-desc']/text()").get()
            new_skill = {
                "name": skill_name,
                "cooling": cooling,
                "consume": consume,
                "desc": skill_desc
            }

                new_hero = HeroInfo(name=hero_name,
                            sort_num=sort_num,
                            viability=viability,
                            aggressivity=aggressivity,
                            effect=effect,
                            difficulty=difficulty,
                            skills_list=new_skill)

                yield new_hero

        
  • 英雄信息item類
class HeroInfo(scrapy.Item):
    # 存字符串
    name = scrapy.Field()
    sort_num = scrapy.Field()
    viability = scrapy.Field()
    aggressivity = scrapy.Field()
    effect = scrapy.Field()
    difficulty = scrapy.Field()
    # 存字典
    skills_list = scrapy.Field()
  • 儲(chǔ)存 pipelines.py文件

需要修改 ITEM_PIPELINES 配置 加入這個(gè)處理類

# -*- coding: utf-8 -*-

from scrapy.exporters import JsonItemExporter


class WzryPipeline:
    def __init__(self):
        # 打開(kāi)文件并實(shí)例化JsonItemExporter
        self.fp = open('result.json', 'wb')
        self.save_json = JsonItemExporter(self.fp, encoding="utf-8", ensure_ascii=False, indent=4)
        # 開(kāi)始寫入
        self.save_json.start_exporting()

    def open_spider(self, spider):
        pass

    def close_spider(self, spider):
        # 結(jié)束寫入
        self.save_json.finish_exporting()
        # 關(guān)閉文件
        self.fp.close()

    def process_item(self, item, spider):
        # 寫入item
        self.save_json.export_item(item)
        return item

運(yùn)行爬蟲(chóng)

scrapy crawl wzry_spider

結(jié)果

image-20200525150137203.png

全部代碼

wzry_spider.py

# -*- coding: utf-8 -*-
import scrapy
from wzry.items import HeroInfo


class WzrySpiderSpider(scrapy.Spider):
    name = 'wzry_spider'
    allowed_domains = ['pvp.qq.com']
    start_urls = ['https://pvp.qq.com/web201605/herolist.shtml']  # 起始url
    base_url = "https://pvp.qq.com/web201605/"  # url 前綴

    def parse(self, response):
        # print(response.body)
        hero_list = response.xpath("http://ul[@class='herolist clearfix']//li")
        for hero in hero_list:
            url = self.base_url + hero.xpath("./a/@href").get()
            yield scrapy.Request(url, callback=self.get_hero_info)

    def get_hero_info(self, response):
        # 基本信息塊
        global new_skill
        hero_info = response.xpath("http://div[@class='cover']")
        hero_name = hero_info.xpath(".//h2[@class='cover-name']/text()").get()

        # 分類
        sort_num = hero_info.xpath(".//span[@class='herodetail-sort']/i/@class").get()[-1:]

        # 生存能力
        viability = hero_info.xpath(".//ul/li[1]/span/i/@style").get()[6:]
        # 傷害
        aggressivity = hero_info.xpath(".//ul/li[2]/span/i/@style").get()[6:]
        effect = hero_info.xpath(".//ul/li[3]/span/i/@style").get()[6:]
        difficulty = hero_info.xpath(".//ul/li[4]/span/i/@style").get()[6:]
        # 技能列表
        skill_list = response.xpath("http://div[@class='skill-show']/div[@class='show-list']")

        for skill in skill_list:
            skill_name = skill.xpath("./p[@class='skill-name']/b/text()").get()
            if not skill_name:
                continue
            # 冷卻時(shí)間
            cooling = skill.xpath("./p[@class='skill-name']/span[1]/text()").get().split(":"
                                                                                         "")[1].strip().split('/')
            # 消耗
            consume = skill.xpath("./p[@class='skill-name']/span[1]/text()").get().split(":"
                                                                                         "")[1].strip().split('/')
            # "".strip()
            # 如果這個(gè)技能是空的就 continue

            # 技能介紹
            skill_desc = skill.xpath("./p[@class='skill-desc']/text()").get()
            new_skill = {
                "name": skill_name,
                "cooling": cooling,
                "consume": consume,
                "desc": skill_desc
            }

        new_hero = HeroInfo(name=hero_name,
                            sort_num=sort_num,
                            viability=viability,
                            aggressivity=aggressivity,
                            effect=effect,
                            difficulty=difficulty,
                            skills_list=new_skill)

        yield new_hero

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class WzryItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


class HeroInfo(scrapy.Item):
    # 存字符串
    name = scrapy.Field()
    sort_num = scrapy.Field()
    viability = scrapy.Field()
    aggressivity = scrapy.Field()
    effect = scrapy.Field()
    difficulty = scrapy.Field()
    # 存字典
    skills_list = scrapy.Field()

pipelines.py


# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy.exporters import JsonItemExporter


class WzryPipeline:
    def __init__(self):
        self.fp = open('result.json', 'wb')
        self.save_json = JsonItemExporter(self.fp, encoding="utf-8", ensure_ascii=False, indent=4)
        self.save_json.start_exporting()

    def open_spider(self, spider):
        pass

    def close_spider(self, spider):
        self.save_json.finish_exporting()
        self.fp.close()

    def process_item(self, item, spider):
        self.save_json.export_item(item)
        return item

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末蜗巧,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子蓝丙,更是在濱河造成了極大的恐慌渺尘,老刑警劉巖说敏,帶你破解...
    沈念sama閱讀 217,406評(píng)論 6 503
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件医咨,死亡現(xiàn)場(chǎng)離奇詭異枫匾,居然都是意外死亡干茉,警方通過(guò)查閱死者的電腦和手機(jī)角虫,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,732評(píng)論 3 393
  • 文/潘曉璐 我一進(jìn)店門戳鹅,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)粉楚,“玉大人模软,你說(shuō)我怎么就攤上這事饮潦。” “怎么了回俐?”我有些...
    開(kāi)封第一講書(shū)人閱讀 163,711評(píng)論 0 353
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)忘瓦。 經(jīng)常有香客問(wèn)我引颈,道長(zhǎng)蝙场,這世上最難降的妖魔是什么售滤? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 58,380評(píng)論 1 293
  • 正文 為了忘掉前任罚拟,我火速辦了婚禮,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘舟舒。我一直安慰自己拉庶,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,432評(píng)論 6 392
  • 文/花漫 我一把揭開(kāi)白布秃励。 她就那樣靜靜地躺著氏仗,像睡著了一般。 火紅的嫁衣襯著肌膚如雪夺鲜。 梳的紋絲不亂的頭發(fā)上皆尔,一...
    開(kāi)封第一講書(shū)人閱讀 51,301評(píng)論 1 301
  • 那天,我揣著相機(jī)與錄音币励,去河邊找鬼慷蠕。 笑死,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的每辟。 我是一名探鬼主播,決...
    沈念sama閱讀 40,145評(píng)論 3 418
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼编整,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了?” 一聲冷哼從身側(cè)響起断箫,我...
    開(kāi)封第一講書(shū)人閱讀 39,008評(píng)論 0 276
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤赵颅,失蹤者是張志新(化名)和其女友劉穎,沒(méi)想到半個(gè)月后,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體拔鹰,經(jīng)...
    沈念sama閱讀 45,443評(píng)論 1 314
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡瓷马,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,649評(píng)論 3 334
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了晒喷。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 39,795評(píng)論 1 347
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡蓝撇,死狀恐怖渤昌,靈堂內(nèi)的尸體忽然破棺而出迈窟,到底是詐尸還是另有隱情狂秘,我是刑警寧澤,帶...
    沈念sama閱讀 35,501評(píng)論 5 345
  • 正文 年R本政府宣布,位于F島的核電站,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏吴汪。R本人自食惡果不足惜霜运,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,119評(píng)論 3 328
  • 文/蒙蒙 一案淋、第九天 我趴在偏房一處隱蔽的房頂上張望宦棺。 院中可真熱鬧蹈丸,春花似錦荸百、人聲如沸够话。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 31,731評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)册着。三九已至司顿,卻和暖如春钦奋,著一層夾襖步出監(jiān)牢的瞬間璧帝,已是汗流浹背理疙。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 32,865評(píng)論 1 269
  • 我被黑心中介騙來(lái)泰國(guó)打工, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人崖咨。 一個(gè)月前我還...
    沈念sama閱讀 47,899評(píng)論 2 370
  • 正文 我出身青樓缨恒,卻偏偏與公主長(zhǎng)得像萧锉,于是被迫代替她去往敵國(guó)和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子昼钻,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,724評(píng)論 2 354