數(shù)據(jù)源
1旅东、小紅書
2抵代、百度百科-創(chuàng)造101
3、騰訊管網(wǎng)
采集分析
這次有100多人數(shù)據(jù)需要采集康吵,而且分主副頁面晦嵌,必須使用爬蟲處理了惭载。一個人一天20個棕兼,5天也能干完。
爬蟲打算使用scrapy茎芋,文檔多田弥,使用方便偷厦,支持xpath和css語法只泼,再加上正則表達式请唱,基本上除了有反爬蟲設(shè)置十绑,否則沒什么網(wǎng)頁處理不了扳躬。
主要目標如下:
1坦报、人名
2、公司
3潜的、排名
4信不、身高
5抽活、體重
6下硕、英文名
7梭姓、圖片
8誉尖、數(shù)據(jù)轉(zhuǎn)存json和csv格式處理
項目開始
項目環(huán)境
1琢感、python 3.6.3
2猩谊、scrapy 1.5.0
3牌捷、WIN10
創(chuàng)建項目
pip install scrapy
pip install xpinyin
scrapy startproject P101
編寫爬蟲
修改spiders下的P101.py暗甥,分兩段分別編寫
1、分析主頁面
進入命令行模式寄月,分析需要數(shù)據(jù)
scrapy shell http://v.qq.com/biu/101_star_web
>>>sel.xpath('//div[@class="list_item"]//a[contains(@href, "javascript:;")]/text()')
陳意涵
....
....
>>>
2漾肮、分析子頁面
進入命令行模式克懊,分析需要數(shù)據(jù),EP1~EP10數(shù)據(jù)
scrapy shell http://v.qq.com/doki/star?id=1661556
>>>sel.xpath('//div[@id="101"]/@data-round').extract()
[',20,26,31,35,37,35,25,,']
>>>
3、爬蟲源碼
# -*- coding: utf-8 -*-
from scrapy import Spider, Request
from P101.items import DokiSlimItem
from P101.items import DokiItem
import json
from xpinyin import Pinyin
import urllib
class P101Spider(Spider):
name = 'P101'
allowed_domains = ['v.qq.com']
start_urls = ['http://v.qq.com/biu/101_star_web']
p101_url = 'http://v.qq.com/biu/101_star_web'
# allowed_domains = ['127.0.0.1:8080']
# p101_url = 'http://127.0.0.1:8080/rank.html'
single_url = 'http://v.qq.com/doki/star?id={starid}'
def start_requests(self): # 將戰(zhàn)隊ID號取出扮念,構(gòu)建完整的戰(zhàn)隊詳情頁的URL柜与,并使用parse_team函數(shù)解析
yield Request(self.p101_url, self.parse_p101)
def parse_p101(self, response):
p = Pinyin()
# sel.xpath('//div[@class="list_item"]//a[contains(@href, "javascript:;")]/text()')
for divs in response.xpath('//div[@class="list_item"]'):
item1 = DokiSlimItem()
for name in divs.xpath('.//a[contains(@href, "javascript:;")]/text()'):
print(name.extract())
cnname = name.extract()
engname = p.get_pinyin(cnname, '')
item1['name'] = cnname
item1['engname'] = engname
for starid in divs.xpath('.//a[@class="pic"][contains(@href, "javascript:;")]/@data-starid'):
print(starid.extract())
item1['starid'] = starid.extract()
for pic in divs.xpath('.//a[@class="pic"][contains(@href, "javascript:;")]/img/@src'):
print(pic.extract())
item1['pic'] = pic.extract()
item1['images'] = engname + ".png"
# strurl = urllib.parse.quote(pic.extract().replace('.', ''))
# strurl = "http://127.0.0.1:8080"+strurl
strurl = pic.extract()
strurl = "http:"+strurl
item1['image_urls'] = [strurl]
yield item1
# 構(gòu)造隊員信息URL伞鲫,回調(diào)函數(shù)為parse_idol
yield Request(self.single_url.format(starid=item1['starid']), self.parse_idol)
def parse_idol(self, response): # 將隊員的信息存入Item
p = Pinyin()
item2 = DokiItem()
starid = str(response.url).strip().split("id=")[-1]
epsdata = response.xpath('//div[@id="101"]/@data-round').extract()
item2["epsdata"] = epsdata[0]
properties = response.xpath('//div[@class="wiki_info_1"]//div[@class="line"]')
name = properties[0].xpath('.//span[@class="content"]/text()').extract()
# item2["name"] = name[0]
cnname = name[0]
engname = p.get_pinyin(cnname, '')
item2['name'] = cnname
item2['engname'] = engname
item2['starid'] = starid
height = properties[5].xpath('.//span[@class="content"]/text()').extract()
item2["height"] = height[0]
weight = properties[6].xpath('.//span[@class="content"]/text()').extract()
item2["weight"] = weight[0]
hometown = properties[7].xpath('.//span[@class="content"]/text()').extract()
item2["hometown"] = hometown[0]
yield item2
優(yōu)化爬蟲
1儒搭、圖片名稱是什么搂鲫?
網(wǎng)站圖片是puui.qpic.cn/media_img/0/null1524465427/0魂仍,這是什么鬼擦酌?
獲取下來需要修改成我們能找到的圖片。打算使用名字拼音作為圖片名字赶诊。
2寓调、需要使用ImagesPipeline技術(shù)下載圖片捶牢,當然如果你覺得麻煩巍耗,直接request也可以。
沒太多難度驯耻,網(wǎng)上找個教程霎迫,添加進來就能用了知给。
3戈次、源碼
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from .items import DokiSlimItem
from scrapy import Request
from scrapy import log
import requests
import re
import logging
import json
def strip(path):
"""
:param path: 需要清洗的文件夾名字
:return: 清洗掉Windows系統(tǒng)非法文件夾名字的字符串
"""
path = re.sub(r'[怯邪?\\*|“<>:/]', '', str(path))
return path
class P101Pipeline(object):
def process_item(self, item, spider):
return item
class P101ImgDownloadPipeline(ImagesPipeline):
default_headers = {
'accept': 'image/webp,image/*,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, sdch, br',
'accept-language': 'zh-CN,zh;q=0.8,en;q=0.6',
# 'referer': 'http://puui.qpic.cn/media_img/0/',
'referer': 'http://127.0.0.1:8080/',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
}
def file_path(self, request, response=None, info=None):
"""
:param request: 每一個圖片下載管道請求
:param response:
:param info:
:param strip :清洗Windows系統(tǒng)的文件夾非法字符,避免無法創(chuàng)建目錄
:return: 每套圖的分類目錄
"""
print('abc:')
item = request.meta['item']
folder = item
print('folder:', folder)
folder_strip = strip(folder)
filename = u'{0}'.format(folder_strip)
return filename
def get_media_requests(self, item, info):
if isinstance(item, DokiSlimItem):
logging.debug("get_media_requests:"+item['image_urls'][0])
print('item:', item)
for image_url in item['image_urls']:
self.default_headers['referer'] = image_url
# yield Request(image_url, headers=self.default_headers)
logging.debug("get_media_requests url:"+image_url)
# referer = item['UserIcon']
print('url:', image_url)
yield Request(image_url, meta={'item': item['images']})
# for image_url in item['image_urls']:
# self.default_headers['referer'] = image_url
# print('xxxx:'+image_url)
# yield Request(image_url, headers=self.default_headers)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
class JsonPipeline(object):
def open_spider(self, spider):
self.file = open('data.json', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
數(shù)據(jù)存儲
存成json格式,寫到pipeline文件里了
1允跑、源碼
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from .items import DokiSlimItem
from scrapy import Request
from scrapy import log
import requests
import re
import logging
import json
class JsonPipeline(object):
def open_spider(self, spider):
self.file = open('data.json', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
拿到的數(shù)據(jù)
{"name": "\u9a6c\u5174\u94b0", "engname": "maxingyu", "starid": "1661557", "pic": "http://puui.qpic.cn/media_img/0/null1524465404/0", "images": "maxingyu.png", "image_urls": ["http://puui.qpic.cn/media_img/0/null1524465404/0"]}
{"name": "\u5218\u601d\u7ea4", "engname": "liusixian", "starid": "1642387", "pic": "http://puui.qpic.cn/media_img/0/null1524465187/0", "images": "liusixian.png", "image_urls": ["http://puui.qpic.cn/media_img/0/null1524465187/0"]}
{"name": "\u5f20\u695a\u5bd2", "engname": "zhangchuhan", "starid": "1661544", "pic": "http://puui.qpic.cn/media_img/0/null1524466277/0", "images": "zhangchuhan.png", "image_urls": ["http://puui.qpic.cn/media_img/0/null1524466277/0"]}
{"name": "\u5411\u4fde\u661f", "engname": "xiangyuxing", "starid": "1572221", "pic": "http://puui.qpic.cn/media_img/0/null1524465963/0", "images": "xiangyuxing.png", "image_urls": ["http://puui.qpic.cn/media_img/0/null1524465963/0"]}
{"name": "\u5434\u831c", "engname": "wuqian", "starid": "1661559", "pic": "http://puui.qpic.cn/media_img/0/null1524465836/0", "images": "wuqian.png", "image_urls": ["http://puui.qpic.cn/media_img/0/null1524465836/0"]}
{"name": "\u5c39\u854a", "engname": "yinrui", "starid": "1661563", "pic": "http://puui.qpic.cn/media_img/0/null1524466237/0", "images": "yinrui.png", "image_urls": ["http://puui.qpic.cn/media_img/0/null1524466237/0"]}
{"epsdata": ",8,8,10,9,9,9,12,,", "name": "\u5085\u83c1", "engname": "fujing", "starid": "1661523", "height": "168", "weight": "46kg", "hometown": "\u4e0a\u6d77"}
{"epsdata": ",,94,90,55,36,23,2,,", "name": "\u738b\u83ca", "engname": "wangju", "starid": "1661570", "height": "165", "weight": "60kg", "hometown": "\u4e0a\u6d77"}
{"epsdata": ",14,16,17,29,25,26,26,,", "name": "\u5434\u6620\u9999", "engname": "wuyingxiang", "starid": "1512788", "height": "164", "weight": "64kg", "hometown": "\u5723\u4fdd\u7f57"}
{"epsdata": ",69,66,47,40,47,48,,,", "name": "\u52fe\u96ea\u83b9", "engname": "gouxueying", "starid": "1597083", "height": "164", "weight": "46kg", "hometown": "\u5317\u4eac"}
{"epsdata": ",42,43,42,48,50,51,,,", "name": "\u5f20\u6eaa", "engname": "zhangxi", "starid": "1661547", "height": "163", "weight": "45kg", "hometown": "\u6dc4\u535a"}
{"epsdata": ",45,50,39,53,54,58,,,", "name": "\u5c39\u854a", "engname": "yinrui", "starid": "1661563", "height": "166", "weight": "42kg", "hometown": "\u91cd\u5e86"}
結(jié)束語
爬蟲是數(shù)據(jù)采集最基礎(chǔ)的技術(shù),大數(shù)據(jù)必備知識之一况木。