這篇文章是Python爬蟲(chóng)的第二篇篙挽,目標(biāo)是新浪微博的評(píng)論人的性別,地區(qū)镊靴,等信息铣卡,寫的不好的地方請(qǐng)指正。
先來(lái)分析一下數(shù)據(jù)的位置偏竟。
個(gè)人資料的網(wǎng)址有兩種煮落,如果用戶沒(méi)有設(shè)置個(gè)性域名,網(wǎng)址即為圖1踊谋,微博默認(rèn)的ID(weibo.cn/u/**********)蝉仇。否則為圖二(weibo.cn/purdence520)。因?yàn)槲覀冎矮@取的到的信息殖蚕,可能為域名或id轿衔,所以這里需要判斷,再獲取信息頁(yè)睦疫。
再來(lái)看看源碼害驹,信息都在
class="c"
的div
的第五個(gè)子標(biāo)簽內(nèi)(0索引開(kāi)始)代碼
def get_page(self, domain, num):
url = 'https://weibo.cn/{}/info'.format(domain)
print(url)
try:
req = requests.get(url, headers=self.header, timeout=5,
cookies=self.cookie[2],)
soup = BeautifulSoup(req.text, 'lxml')
if req.status_code == 200:
return soup
else:
print(req.status_code)
url = 'https://weibo.cn/{}'.format(domain)
req = requests.get(url, timeout=5,
cookies=self.cookie[self.cg_id],
headers=self.header)
soup = BeautifulSoup(req.text, 'lxml')
domain = re.compile(r'/(\d+)/info').\
findall(str(soup))[0]
return self.get_page(domain, num)
except Exception as e:
raise(e)
此方法用于獲取信息頁(yè),需判斷id頁(yè)還是個(gè)性域名頁(yè)蛤育。domain參數(shù)是id/域名宛官,num參數(shù)是存到數(shù)據(jù)庫(kù)里的自增列葫松,用于定位。如果以id/info的網(wǎng)址可以獲取到信息底洗,則返回獲取到的頁(yè)面腋么。否則用域名網(wǎng)站獲取最后返回一個(gè)信息頁(yè)面。
用正則表達(dá)式匹配信息,把沒(méi)有填生日的信息設(shè)為none徐块,用tools包操作數(shù)據(jù)庫(kù)未玻。
def get_sab(self, q):
while True:
num = q.get()
self.user_domain = tools.s_domain(num)
soup = self.get_page(self.user_domain, num)
try:
self.user_sex = re.findall(r'性別:(.*?)<br', str(soup))[0]
self.user_area = re.findall(r'地區(qū):(.*?)<br', str(soup))[0]
self.user_birth = re.findall(r'生日:(.*?)<br', str(soup))[0]
except Exception as e:
self.user_birth = 'none'
print(mp.current_process().name, num, self.user_sex,
self.user_area, self.user_birth)
tools.i_sab((self.user_sex, self.user_area, self.user_birth,
num))
sleep(randint(1, 3))
用Queue來(lái)生成數(shù)據(jù)庫(kù)自增num,獲取數(shù)據(jù)庫(kù)中每一個(gè)domain
def set_num(self, q):
global num
while True:
q.put(num)
print(num, 'put')
num += 1
GitHub開(kāi)源地址:https://github.com/matianhe/crawler