最近事情比較多,所以從上周就開始寫的新浪微博爬蟲一直拖到了現(xiàn)在辱揭,不過不得不說新浪微博的反扒离唐,我只想說我真的服氣了。
爬取數(shù)據(jù)前的準備
向右奔跑老大說這次的就不限制要爬取哪些內容了问窃,但是給一個參考亥鬓,有興趣的可以搞一搞:
當我看到這個的時候感覺很有意思和搞頭就想去整一整,所以我的一個想法就是去找一個粉絲比較多的人去解析他的分析信息域庇,然后再去解析他粉絲的粉絲嵌戈,以此類推(感覺解析初始用戶的關注的人的粉絲會更好一點,因為他的粉絲比較多听皿,他關注的人粉絲量肯定不會惺烨骸),但是到后來我就想放棄這個想法了尉姨,因為遇到的問題真的一大堆庵朝,好了廢話不多說,來看一下我抓取的信息:
- 抓取的信息:
- 1.微博標題
- 2.微博nickname
- 3.標識id
- 4.微博等級
- 5.地區(qū)
- 6.畢業(yè)院校
- 7.關注量+URL
- 8.粉絲量+URL
- 9.微博量+URL
大致獲取的也就這么多信息,因為很多人的信息是不完善的偿短,所以就先抓這么多進行測試欣孤。
一個基本的思路
確定了我們要找的信息,接下來就是去解析網(wǎng)頁了(一個大的難題要出現(xiàn)了)昔逗,在我看來獲取網(wǎng)頁目前遇到的:1.解析源碼降传,2.抓包(json),但是新浪微博這個就比較煩了勾怒,他這個是在js中婆排,并且是未加載的(只能用正則或者selenium模擬瀏覽器了),看到這個之后我想了一段時間并且問了羅羅攀 有沒有其他的方法笔链,不行我就用selenium段只,他說還是推薦正則,解析快一點鉴扫,selenium是最后的選擇赞枕,沒辦法了只好硬著頭皮去寫正則了,這里在測試正則是否正確坪创,可以使用在線測試工具炕婶,進行正則的測試,不必去一遍又一遍運行代碼莱预。
找到這些信息柠掂,盯著源碼一直瞅,看的我頭都大了,其實又快捷的方法ctrl+f
![搜索框](http://upload-images.jianshu.io/upload_images/5208064-c9c156d6a70f76ab.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
現(xiàn)在信息的位置我們都清楚在哪了依沮,那么就是寫匹配信息的正則了涯贞,這個只能是自己慢慢去寫,可以練習正則表達式危喉。
URL+粉絲分頁問題
個人主頁URL
我們先來看一個示例:http://weibo.com/p/1005051497035431/home?from=page_100505&mod=TAB&is_hot=1#place宋渔,
這個URL,給大家提個醒直接用這個是看不到主頁信息的姥饰,但是在代碼的測試源碼中我們能看到一個location重定向的連接傻谁,是將#之后的部分替換為&retcode=6102,所以URL應該為:http://weibo.com/p/1005051497035431/home?from=page_100505&mod=TAB&is_hot=1&retcode=6102列粪,
我點擊連接測試了一下审磁,看到的內容和第一條連接一樣,并且還有一點岂座,我們之后獲取的所有連接都要替換#之后的內容态蒂,來一個示例吧:
urls = re.findall(r'class=\\"t_link S_txt1\\" href=\\"(.*?)\\"',data)
careUrl = urls[0].replace('\\','').replace('#place','&retcode=6102')
fansUrl = urls[1].replace('\\','').replace('#place','&retcode=6102')
wbUrl = urls[2].replace('\\','').replace('#place','&retcode=6102')
如果不進行替換,我們拿獲取后的仍然是無法獲取到我們要的源碼费什。
粉絲分頁問題
我本想可以解析一個人的粉絲钾恢,就可以獲取大量的數(shù)據(jù),可還是栽在了系統(tǒng)限制(我在爬取的時候第五頁之后就返回不到數(shù)據(jù))
看到這個之后,系統(tǒng)限制瘩蚪,這個又是什么泉懦,好吧只能看100個粉絲的信息,沒辦法了也只能繼續(xù)寫下去疹瘦。所以說我們只要考慮5頁的數(shù)據(jù)崩哩,總頁數(shù)大于5頁按五頁對待,小于5頁的正常去寫就可以言沐,這個搞明白之后邓嘹,就是要去解決分頁的連接了,通過三條URL進行對比:
- 1.http://weibo.com/p/1005051110411735/follow?relate=fans&from=100505&wvr=6&mod=headfans¤t=fans#place
- 2.http://weibo.com/p/1005051110411735/follow?relate=fans&page=2#Pl_Official_HisRelation__60
- 2.http://weibo.com/p/1005051110411735/follow?relate=fans&page=3#Pl_Official_HisRelation__60
通過這兩個URL我們可以看出险胰,差別就在后半部分汹押,除了之前我說的要將是將#之后的部分替換為&retcode=6102,之外還要改動一點起便,就是follow棚贾?之后的內容那么改動后,我們就從第二頁去構造URL榆综。
示例代碼:
urls = ['http://weibo.com/p/1005051497035431/follow?relate=fans&page={}&retcode=6102'.format(i) for i in range(2,int(pages)+1)]
那么URL分頁問題就搞定了鸟悴,也可以說解決了一個難題。如果你認為新浪微博只有這些反扒的話奖年,就太天真了,讓我們接著往下看沛贪。
布滿荊棘的路
整個獲取過程就是各種坑陋守,之前主要是說了數(shù)據(jù)的獲取方式和URL及粉絲分頁的問題,現(xiàn)在我們來看一下新浪微博的一些反扒:
首先利赋,在請求的時候必須加cookies進行身份驗證水评,這個挺正常的,但是在這來說他真的不是萬能的媚送,因為cookie也是有生存期的中燥,這個在獲取個人信息的時候還沒什么問題,但是在獲取粉絲頁面信息的時候就出現(xiàn)了過期的問題塘偎,那該怎么解決呢疗涉,想了很久,最后通過selenium模擬登錄解決了吟秩,這個之后在詳細說咱扣,總之,這一點要注意涵防。
然后闹伪,另外一個點,不是每一個人的源碼都是一樣的,怎么說呢最明顯的自己可以去對比下偏瓤,登錄微博后看一下自己粉絲的分頁那部分源碼和你搜索的那個用戶的源碼一樣不杀怠,除此之外其他的源碼信息也有不一樣,我真的指向說一句厅克,大公司就是厲害赔退。
大家自習看應該可以看出來不同,所以整體來說新浪微博挺難爬已骇。
代碼
代碼這一塊离钝,確實沒整好,問題也比較多褪储,但是可以把核心代碼貼出來供大家參考和探討(感覺自己寫的有點亂)
說一下代碼結構卵渴,兩個主類和三個輔助類:
兩個主類:第一個類去解析粉絲id,另一個類去解析詳細信息(解析的時候會判斷id是否解析過)
三個輔助類:第一個去模擬登陸返回cookies(再爬取數(shù)據(jù)的過程中鲤竹,好像是只調用了一次浪读,可能是代碼的問題),第二個輔助類去返回一個隨機代理辛藻,第三個輔助類將個人信息寫入mysql碘橘。
下邊我就將兩個主類的源碼貼出來,把輔助類相關其他的信息去掉仍然是可以運行的吱肌。
1.fansSpider.py
#-*- coding:utf-8 -*-
import requests
import re
import random
from proxy import Proxy
from getCookie import COOKIE
from time import sleep
from store_mysql import Mysql
from weibo_spider import weiboSpider
class fansSpider(object):
headers = [
{"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"},
{"user-agent": "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1"},
{"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3"},
{"user-agent": "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3"},
{"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3"},
{"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"},
{"user-agent": "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"}
]
def __init__(self):
self.wbspider = weiboSpider()
self.proxie = Proxy()
self.cookie = COOKIE()
self.cookies = self.cookie.getcookie()
field = ['id']
self.mysql = Mysql('sinaid', field, len(field) + 1)
self.key = 1
def getData(self,url):
self.url = url
proxies = self.proxie.popip()
print self.cookies
print proxies
r = requests.get("https://www.baidu.com", headers=random.choice(self.headers), proxies=proxies)
while r.status_code != requests.codes.ok:
proxies = self.proxie.popip()
r = requests.get("https://www.baidu.com", headers=random.choice(self.headers), proxies=proxies)
data = requests.get(self.url,headers=random.choice(self.headers), cookies=self.cookies, proxies=proxies,timeout=20).text
#print data
infos = re.findall(r'fnick=(.+?)&f=1\\',data)
if infos is None:
self.cookies = self.cookie.getcookie()
data = requests.get(self.url, headers=random.choice(self.headers), cookies=self.cookies, proxies=proxies,
timeout=20).text
infos = re.findall(r'fnick=(.+?)&f=1\\', data)
fans = []
for info in infos:
fans.append(info.split('&')[0])
try:
totalpage = re.findall(r'Pl_Official_HisRelation__6\d+\\">(\d+)<',data)[-1]
print totalpage
except:
totalpage = 1
# totalpage = re.findall(r'Pl_Official_HisRelation__\d+\\">(\d+)<', data)[-1]
Id = [one for one in re.findall(r'usercard=\\"id=(\d+)&',data)]
self.totalid = [Id[i] for i in range(1,len(fans)*2+1,2)]
if int(totalpage) == 1:
for one in self.totalid:
self.wbspider.getUserData(one)
item = {}
for one in self.totalid:
item[1] = one
self.mysql.insert(item)
fansurl = 'http://weibo.com/p/100505' + one + '/follow?from=page_100505&wvr=6&mod=headfollow&retcode=6102'
# fansurl = 'http://weibo.com/p/100505' + one + '/follow?relate=fans&from=100505&wvr=6&mod=headfans¤t=fans&retcode=6102'
fan.getData(fansurl)
elif int(totalpage) >= 5:
totalpage=5
self.mulpage(totalpage)
# if self.key == 1:
# self.mulpage(totalpage)
# else:
# self.carepage(totalpage)
# def carepage(self,pages):
# #self.key=1
# urls = ['http://weibo.com/p/1005051497035431/follow?page={}&retcode=6102'.format(i) for i in range(2, int(pages) + 1)]
# for url in urls:
# sleep(2)
# print url.split('&')[-2]
# proxies = self.proxie.popip()
# r = requests.get("https://www.baidu.com", headers=random.choice(self.headers), proxies=proxies)
# print r.status_code
# while r.status_code != requests.codes.ok:
# proxies = self.proxie.popip()
# r = requests.get("https://www.baidu.com", headers=random.choice(self.headers), proxies=proxies)
# data = requests.get(url, headers=random.choice(self.headers), cookies=self.cookies, proxies=proxies,
# timeout=20).text
# # print data
# infos = re.findall(r'fnick=(.+?)&f=1\\', data)
# if infos is None:
# self.cookies = self.cookie.getcookie()
# data = requests.get(self.url, headers=random.choice(self.headers), cookies=self.cookies,
# proxies=proxies,
# timeout=20).text
# infos = re.findall(r'fnick=(.+?)&f=1\\', data)
# fans = []
# for info in infos:
# fans.append(info.split('&')[0])
# Id = [one for one in re.findall(r'usercard=\\"id=(\d+)&', data)]
# totalid = [Id[i] for i in range(1, len(fans) * 2 + 1, 2)]
# for one in totalid:
# # print one
# self.totalid.append(one)
# for one in self.totalid:
# sleep(1)
# self.wbspider.getUserData(one)
# item = {}
# for one in self.totalid:
# item[1] = one
# self.mysql.insert(item)
# fansurl = 'http://weibo.com/p/100505'+one+'/follow?from=page_100505&wvr=6&mod=headfollow&retcode=6102'
# #fansurl = 'http://weibo.com/p/100505' + one + '/follow?relate=fans&from=100505&wvr=6&mod=headfans¤t=fans&retcode=6102'
# fan.getData(fansurl)
def mulpage(self,pages):
#self.key=2
urls = ['http://weibo.com/p/1005051497035431/follow?relate=fans&page={}&retcode=6102'.format(i) for i in range(2,int(pages)+1)]
for url in urls:
sleep(2)
print url.split('&')[-2]
proxies = self.proxie.popip()
r = requests.get("https://www.baidu.com", headers=random.choice(self.headers), proxies=proxies)
print r.status_code
while r.status_code != requests.codes.ok:
proxies = self.proxie.popip()
r = requests.get("https://www.baidu.com", headers=random.choice(self.headers), proxies=proxies)
data = requests.get(url, headers=random.choice(self.headers), cookies=self.cookies, proxies=proxies,
timeout=20).text
# print data
infos = re.findall(r'fnick=(.+?)&f=1\\', data)
if infos is None:
self.cookies = self.cookie.getcookie()
data = requests.get(self.url, headers=random.choice(self.headers), cookies=self.cookies,
proxies=proxies,
timeout=20).text
infos = re.findall(r'fnick=(.+?)&f=1\\', data)
fans = []
for info in infos:
fans.append(info.split('&')[0])
Id = [one for one in re.findall(r'usercard=\\"id=(\d+)&', data)]
totalid = [Id[i] for i in range(1, len(fans) * 2 + 1, 2)]
for one in totalid:
#print one
self.totalid.append(one)
for one in self.totalid:
sleep(1)
self.wbspider.getUserData(one)
item ={}
for one in self.totalid:
item[1]=one
self.mysql.insert(item)
#fansurl = 'http://weibo.com/p/1005055847228592/follow?from=page_100505&wvr=6&mod=headfollow&retcode=6102'
fansurl = 'http://weibo.com/p/100505'+one+'/follow?relate=fans&from=100505&wvr=6&mod=headfans¤t=fans&retcode=6102'
fan.getData(fansurl)
if __name__ == "__main__":
url = 'http://weibo.com/p/1005051497035431/follow?relate=fans&from=100505&wvr=6&mod=headfans¤t=fans&retcode=6102'
fan = fansSpider()
fan.getData(url)
<em>中間注釋的一部分痘拆,因為代碼在調試,大家參考正則和一些處理方式即可</em>
2.weibo_spider.py
# -*- coding:utf-8 -*-
import requests
import re
from store_mysql import Mysql
import MySQLdb
class weiboSpider(object):
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"
}
cookies = {
'TC-Page-G0':'1bbd8b9d418fd852a6ba73de929b3d0c',
'login_sid_t':'0554454a652ee2a19c672e92ecee3220',
'_s_tentry':'-',
'Apache':'8598167916889.414.1493773707704',
'SINAGLOBAL':'8598167916889.414.1493773707704',
'ULV':'1493773707718:1:1:1:8598167916889.414.1493773707704:',
'SCF':'An3a20Qu9caOfsjo36dVvRQh7tKzwKwWXX7CdmypYAwRoCoWM94zrQyZ-5QJPjjDRpp2fBxA_9d6-06C8vLD490.',
'SUB':'_2A250DV37DeThGeNO7FEX9i3IyziIHXVXe8gzrDV8PUNbmtAKLWbEkW8qBangfcJP4zc_n3aYnbcaf1aVNA..',
'SUBP':'0033WrSXqPxfM725Ws9jqgMF55529P9D9WhR6nHCyWoXhugM0PU8VZAu5JpX5K2hUgL.Fo-7S0ecSoeXehB2dJLoI7pX9PiEIgij9gpD9J-t',
'SUHB':'0jBY7fPNWFbwRJ',
'ALF':'1494378549',
'SSOLoginState':'1493773739',
'wvr':'6',
'UOR':',www.weibo.com,spr_sinamkt_buy_lhykj_weibo_t111',
'YF-Page-G0':'19f6802eb103b391998cb31325aed3bc',
'un':'fengshengjie5 @ live.com'
}
def __init__(self):
field = ['title', 'name', 'id', 'wblevel', 'addr', 'graduate', 'care', 'careurl', 'fans', 'fansurl', 'wbcount',
'wburl']
conn = MySQLdb.connect(user='root', passwd='123456', db='zhihu', charset='utf8')
conn.autocommit(True)
self.cursor = conn.cursor()
self.mysql = Mysql('sina', field, len(field) + 1)
def getUserData(self,id):
self.cursor.execute('select id from sina where id=%s',(id,))
data = self.cursor.fetchall()
if data:
pass
else:
item = {}
#test = [5321549625,1669879400,1497035431,1265189091,5705874800,5073663404,5850521726,1776845763]
url = 'http://weibo.com/u/'+id+'?topnav=1&wvr=6&retcode=6102'
data = requests.get(url,headers=self.headers,cookies=self.cookies).text
#print data
id = url.split('?')[0].split('/')[-1]
try:
title = re.findall(r'<title>(.*?)</title>',data)[0]
title = title.split('_')[0]
except:
title= u''
try:
name = re.findall(r'class=\\"username\\">(.+?)<',data)[0]
except:
name = u''
try:
totals = re.findall(r'class=\\"W_f\d+\\">(\d*)<',data)
care = totals[0]
fans = totals[1]
wbcount = totals[2]
except:
care = u''
fans = u''
wbcount = u''
try:
urls = re.findall(r'class=\\"t_link S_txt1\\" href=\\"(.*?)\\"',data)
careUrl = urls[0].replace('\\','').replace('#place','&retcode=6102')
fansUrl = urls[1].replace('\\','').replace('#place','&retcode=6102')
wbUrl = urls[2].replace('\\','').replace('#place','&retcode=6102')
except:
careUrl = u''
fansUrl = u''
wbUrl = u''
profile = re.findall(r'class=\\"item_text W_fl\\">(.+?)<',data)
try:
wblevel = re.findall(r'title=\\"(.*?)\\"',profile[0])[0]
addr = re.findall(u'[\u4e00-\u9fa5]+', profile[1])[0]# 地址
except:
profile1 = re.findall(r'class=\\"icon_group S_line1 W_fl\\">(.+?)<',data)
try:
wblevel = re.findall(r'title=\\"(.*?)\\"', profile1[0])[0]
except:
wblevel = u''
try:
addr = re.findall(u'[\u4e00-\u9fa5]+', profile[0])[0]
except:
addr = u''
try:
graduate = re.findall(r'profile&wvr=6\\">(.*?)<',data)[0]
except:
graduate = u''
item[1] = title
item[2] =name
item[3] =id
item[4] =wblevel
item[5] =addr
item[6] =graduate
item[7] =care
item[8] =careUrl
item[9] =fans
item[10] =fansUrl
item[11] =wbcount
item[12] =wbUrl
self.mysql.insert(item)
<em>寫的比較亂氮墨,大家將就著看纺蛆,還是我說的只是一個Demo</em>
輔助類之一(存mysql,可以參考Mr_Cxy的python對Mysql數(shù)據(jù)庫的操作小例)规揪,其他的兩個關于隨機代理和獲取cookie桥氏,在下篇文章會詳細講解
運行結果+數(shù)據(jù)結果
<em>200為狀態(tài)碼,說明請求成功</em>
總結
目前新浪微博是遇到問題最多的一個猛铅,不過也學到了很多知識字支,比如正則表達式,隨機代理等等奸忽,在學習的過程中就是遇到的問題越多堕伪,積累的越多,進步越快栗菜,所以遇到問題和出錯也是幸事刃跛。說一下代碼運行過程中存在遇到的問題吧(可以一塊交流解決):<strong>
- 1.有兩個id一直在循環(huán),可能是循環(huán)那一塊存在問題苛萎,可以一塊交流桨昙,解決后會更新文章检号。
- 2.解析的速度(單線程比較慢,后續(xù)寫scrapy版)
- 3.去重(目前是在將解析過的id寫入數(shù)據(jù)庫蛙酪,然后在解析前進行判斷)
這差不多就是一個簡單的思路齐苛,目前存在一些問題,可以作為參考桂塞,有問題的可以一塊交流解決(所有源碼凹蜂,可以私聊參考)