【scrapy爬蟲(chóng)實(shí)戰(zhàn)】王者榮耀全部英雄信息爬取

王者榮耀英雄信息爬取

分析

入口頁(yè)面地址

https://pvp.qq.com/web201605/herolist.shtml

第一步獲取所有英雄的列表

image-20200525100704049.png

可以看到英雄列表是在源碼中可以被找到的

image-20200525100755019.png

第二步獲取英雄的各種信息

英雄的基本信息放在一個(gè)class = "cover"的div中我們主要采集英雄的名稱 和 技能介紹

image-20200525101942363.png

技能部分都在 class=" zk-con3 zk-con" 中中的 ul中

image-20200525103312535.png

爬取英雄列表

創(chuàng)建工程

scrapy startproject wzry
cd wzry

創(chuàng)建爬蟲(chóng)

scrapy genspider wzry_spider pvp.qq.com

修改配置

# 不遵循robots協(xié)議 因?yàn)檎军c(diǎn)沒(méi)有這個(gè)文件 爬蟲(chóng)會(huì)直接略過(guò)
ROBOTSTXT_OBEY = False
# 添加請(qǐng)求頭
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}

# 下載延遲
DOWNLOAD_DELAY = 1

修改初始頁(yè)面

start_urls = ['https://pvp.qq.com/web201605/herolist.shtml']

創(chuàng)建爬取列表方法(測(cè)試)

    def parse(self, response):
        print("=" * 50)
        print(response)
        print("=" * 50)

運(yùn)行爬蟲(chóng)

scrapy crawl wzry_spider

2020-05-25 11:08:53 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-25 11:08:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pvp.qq.com/web201605/herolist.shtml> (referer: None)
==================================================
<200 https://pvp.qq.com/web201605/herolist.shtml>
==================================================
2020-05-25 11:08:54 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-25 11:08:54 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 315,

成功返回結(jié)果

爬取所有英雄鏈接方法

class WzrySpiderSpider(scrapy.Spider):
    name = 'wzry_spider'
    allowed_domains = ['pvp.qq.com']
    start_urls = ['https://pvp.qq.com/web201605/herolist.shtml']    # 起始url
    base_url = "https://pvp.qq.com/web201605/"  # url 前綴

    def parse(self, response):
        print("=" * 50)
        # print(response.body)
        hero_list = response.xpath("http://ul[@class='herolist clearfix']//li")
        for hero in hero_list:
            url = self.base_url + hero.xpath("./a/@href").get()
            print(url)
            # yield scrapy.Request(url)
        print("=" * 50)

https://pvp.qq.com/web201605/herodetail/114.shtml
https://pvp.qq.com/web201605/herodetail/113.shtml
https://pvp.qq.com/web201605/herodetail/112.shtml
https://pvp.qq.com/web201605/herodetail/111.shtml
https://pvp.qq.com/web201605/herodetail/110.shtml
https://pvp.qq.com/web201605/herodetail/109.shtml
https://pvp.qq.com/web201605/herodetail/108.shtml
https://pvp.qq.com/web201605/herodetail/107.shtml
https://pvp.qq.com/web201605/herodetail/106.shtml
https://pvp.qq.com/web201605/herodetail/105.shtml
==================================================

爬取英雄詳情頁(yè)面

獲得基本信息

# 基本信息塊
hero_info = response.xpath("http://div[@class='cover']")
hero_name = hero_info.xpath(".//h2[@class='cover-name']/text()").get()
print(hero_name)
# 分類
sort_num = hero_info.xpath(".//span[@class='herodetail-sort']/i/@class").get()[-1:]
print(sort_num)
# 生存能力
viability = hero_info.xpath(".//ul/li[1]/span/i/@style").get()[6:]
print("生存能力:" + viability)
# 傷害
aggressivity = hero_info.xpath(".//ul/li[2]/span/i/@style").get()[6:]
print("攻擊能力:" + aggressivity)
effect = hero_info.xpath(".//ul/li[3]/span/i/@style").get()[6:]
print("技能影響" + effect)
difficulty = hero_info.xpath(".//ul/li[4]/span/i/@style").get()[6:]
print("上手難度:" + difficulty)

獲取技能信息

 skill_list = response.xpath("http://div[@class='skill-show']/div[@class='show-list']")

        for skill in skill_list:
            skill_name = skill.xpath("./p[@class='skill-name']/b/text()").get()
            if not skill_name:
                continue
            # 冷卻時(shí)間
            cooling = skill.xpath("./p[@class='skill-name']/span[1]/text()").get().split("：")[1].strip().split('/')
            # 消耗
            consume = skill.xpath("./p[@class='skill-name']/span[1]/text()").get().split("：")[1].strip().split('/')
            # "".strip()
            # 如果這個(gè)技能是空的就 continue

            # 技能介紹
            skill_desc = skill.xpath("./p[@class='skill-desc']/text()").get()
            new_skill = {
                "name": skill_name,
                "cooling": cooling,
                "consume": consume,
                "desc": skill_desc
            }

                new_hero = HeroInfo(name=hero_name,
                            sort_num=sort_num,
                            viability=viability,
                            aggressivity=aggressivity,
                            effect=effect,
                            difficulty=difficulty,
                            skills_list=new_skill)

                yield new_hero

英雄信息item類

class HeroInfo(scrapy.Item):
    # 存字符串
    name = scrapy.Field()
    sort_num = scrapy.Field()
    viability = scrapy.Field()
    aggressivity = scrapy.Field()
    effect = scrapy.Field()
    difficulty = scrapy.Field()
    # 存字典
    skills_list = scrapy.Field()

儲(chǔ)存 pipelines.py文件

需要修改 ITEM_PIPELINES 配置加入這個(gè)處理類

# -*- coding: utf-8 -*-

from scrapy.exporters import JsonItemExporter


class WzryPipeline:
    def __init__(self):
        # 打開(kāi)文件并實(shí)例化JsonItemExporter
        self.fp = open('result.json', 'wb')
        self.save_json = JsonItemExporter(self.fp, encoding="utf-8", ensure_ascii=False, indent=4)
        # 開(kāi)始寫入
        self.save_json.start_exporting()

    def open_spider(self, spider):
        pass

    def close_spider(self, spider):
        # 結(jié)束寫入
        self.save_json.finish_exporting()
        # 關(guān)閉文件
        self.fp.close()

    def process_item(self, item, spider):
        # 寫入item
        self.save_json.export_item(item)
        return item

運(yùn)行爬蟲(chóng)

scrapy crawl wzry_spider

結(jié)果

image-20200525150137203.png

全部代碼

wzry_spider.py

# -*- coding: utf-8 -*-
import scrapy
from wzry.items import HeroInfo


class WzrySpiderSpider(scrapy.Spider):
    name = 'wzry_spider'
    allowed_domains = ['pvp.qq.com']
    start_urls = ['https://pvp.qq.com/web201605/herolist.shtml']  # 起始url
    base_url = "https://pvp.qq.com/web201605/"  # url 前綴

    def parse(self, response):
        # print(response.body)
        hero_list = response.xpath("http://ul[@class='herolist clearfix']//li")
        for hero in hero_list:
            url = self.base_url + hero.xpath("./a/@href").get()
            yield scrapy.Request(url, callback=self.get_hero_info)

    def get_hero_info(self, response):
        # 基本信息塊
        global new_skill
        hero_info = response.xpath("http://div[@class='cover']")
        hero_name = hero_info.xpath(".//h2[@class='cover-name']/text()").get()

        # 分類
        sort_num = hero_info.xpath(".//span[@class='herodetail-sort']/i/@class").get()[-1:]

        # 生存能力
        viability = hero_info.xpath(".//ul/li[1]/span/i/@style").get()[6:]
        # 傷害
        aggressivity = hero_info.xpath(".//ul/li[2]/span/i/@style").get()[6:]
        effect = hero_info.xpath(".//ul/li[3]/span/i/@style").get()[6:]
        difficulty = hero_info.xpath(".//ul/li[4]/span/i/@style").get()[6:]
        # 技能列表
        skill_list = response.xpath("http://div[@class='skill-show']/div[@class='show-list']")

        for skill in skill_list:
            skill_name = skill.xpath("./p[@class='skill-name']/b/text()").get()
            if not skill_name:
                continue
            # 冷卻時(shí)間
            cooling = skill.xpath("./p[@class='skill-name']/span[1]/text()").get().split("："
                                                                                         "")[1].strip().split('/')
            # 消耗
            consume = skill.xpath("./p[@class='skill-name']/span[1]/text()").get().split("："
                                                                                         "")[1].strip().split('/')
            # "".strip()
            # 如果這個(gè)技能是空的就 continue

            # 技能介紹
            skill_desc = skill.xpath("./p[@class='skill-desc']/text()").get()
            new_skill = {
                "name": skill_name,
                "cooling": cooling,
                "consume": consume,
                "desc": skill_desc
            }

        new_hero = HeroInfo(name=hero_name,
                            sort_num=sort_num,
                            viability=viability,
                            aggressivity=aggressivity,
                            effect=effect,
                            difficulty=difficulty,
                            skills_list=new_skill)

        yield new_hero

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class WzryItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


class HeroInfo(scrapy.Item):
    # 存字符串
    name = scrapy.Field()
    sort_num = scrapy.Field()
    viability = scrapy.Field()
    aggressivity = scrapy.Field()
    effect = scrapy.Field()
    difficulty = scrapy.Field()
    # 存字典
    skills_list = scrapy.Field()

pipelines.py


# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy.exporters import JsonItemExporter


class WzryPipeline:
    def __init__(self):
        self.fp = open('result.json', 'wb')
        self.save_json = JsonItemExporter(self.fp, encoding="utf-8", ensure_ascii=False, indent=4)
        self.save_json.start_exporting()

    def open_spider(self, spider):
        pass

    def close_spider(self, spider):
        self.save_json.finish_exporting()
        self.fp.close()

    def process_item(self, item, spider):
        self.save_json.export_item(item)
        return item

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末蜗巧，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子蓝丙，更是在濱河造成了極大的恐慌渺尘，老刑警劉巖说敏，帶你破解...
沈念sama閱讀 217,406評(píng)論 6贊 503
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件医咨，死亡現(xiàn)場(chǎng)離奇詭異枫匾，居然都是意外死亡干茉，警方通過(guò)查閱死者的電腦和手機(jī)角虫，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,732評(píng)論 3贊 393
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門戳鹅，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)粉楚，“玉大人模软，你說(shuō)我怎么就攤上這事饮潦。” “怎么了回俐？”我有些...
開(kāi)封第一講書(shū)人閱讀 163,711評(píng)論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長(zhǎng)忘瓦。經(jīng)常有香客問(wèn)我引颈，道長(zhǎng)蝙场，這世上最難降的妖魔是什么售滤？我笑而不...
開(kāi)封第一講書(shū)人閱讀 58,380評(píng)論 1贊 293
?港島之戀（遺憾婚禮）
正文為了忘掉前任罚拟，我火速辦了婚禮，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘舟舒。我一直安慰自己拉庶，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 67,432評(píng)論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開(kāi)白布秃励。她就那樣靜靜地躺著氏仗，像睡著了一般。火紅的嫁衣襯著肌膚如雪夺鲜。梳的紋絲不亂的頭發(fā)上皆尔，一...
開(kāi)封第一講書(shū)人閱讀 51,301評(píng)論 1贊 301
城市分裂傳說(shuō)
那天，我揣著相機(jī)與錄音币励，去河邊找鬼慷蠕。笑死，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的每辟。我是一名探鬼主播，決...
沈念sama閱讀 40,145評(píng)論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開(kāi)眼编整，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼！你這毒婦竟也來(lái)了？” 一聲冷哼從身側(cè)響起断箫，我...
開(kāi)封第一講書(shū)人閱讀 39,008評(píng)論 0贊 276
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤赵颅，失蹤者是張志新（化名）和其女友劉穎，沒(méi)想到半個(gè)月后，有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體拔鹰，經(jīng)...
沈念sama閱讀 45,443評(píng)論 1贊 314
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡瓷马，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,649評(píng)論 3贊 334
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了晒喷。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 39,795評(píng)論 1贊 347
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡蓝撇，死狀恐怖渤昌，靈堂內(nèi)的尸體忽然破棺而出迈窟，到底是詐尸還是另有隱情狂秘，我是刑警寧澤，帶...
沈念sama閱讀 35,501評(píng)論 5贊 345
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站，受9級(jí)特大地震影響，放射性物質(zhì)發(fā)生泄漏吴汪。R本人自食惡果不足惜霜运，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,119評(píng)論 3贊 328
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一案淋、第九天我趴在偏房一處隱蔽的房頂上張望宦棺。院中可真熱鬧蹈丸，春花似錦荸百、人聲如沸够话。這莊子的主人今日做“春日...
開(kāi)封第一講書(shū)人閱讀 31,731評(píng)論 0贊 22
一樁弒父案谷徙，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)册着。三九已至司顿，卻和暖如春钦奋，著一層夾襖步出監(jiān)牢的瞬間璧帝，已是汗流浹背理疙。一陣腳步聲響...
開(kāi)封第一講書(shū)人閱讀 32,865評(píng)論 1贊 269
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人崖咨。一個(gè)月前我還...
沈念sama閱讀 47,899評(píng)論 2贊 370
代替公主和親
正文我出身青樓缨恒，卻偏偏與公主長(zhǎng)得像萧锉，于是被迫代替她去往敵國(guó)和親。傳聞我的和親對(duì)象是個(gè)殘疾皇子昼钻，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,724評(píng)論 2贊 354