參加向右奔跑老師爬蟲小分隊二期學(xué)習(xí)已經(jīng)快一周的時間啦推沸,之前完成的三篇作業(yè)一直都是關(guān)于Python程序邏輯的訓(xùn)練庶灿,雖然之前有過使用BeautifulSoup爬去簡單網(wǎng)頁數(shù)據(jù)的嘗試,但是對網(wǎng)頁結(jié)構(gòu)化數(shù)據(jù)的抓取還從來沒有嘗試過摸恍,通過這幾天參考同學(xué)們完成的作業(yè)和彭老師推薦的簡書文章,今天嘗試一次爬蟲糗事百科數(shù)據(jù)(數(shù)據(jù)包括:作者,性別存筏,年齡,段子內(nèi)容味榛,好笑數(shù)椭坚,評論數(shù)),并插入MySQL數(shù)據(jù)庫搏色。
代碼如下:
import requests
from bs4 import BeautifulSoup
import re
import pymysql
#鏈接MySQL數(shù)據(jù)庫
connect = pymysql.connect(
host='localhost',
user='root',
passwd='root',
db='test',
port=3306,
charset='utf8'
)
# 獲取游標(biāo)
cursor = connect.cursor()
#請求頭
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
}
def get_data(url):
#模擬瀏覽器訪問網(wǎng)頁
html = requests.get(url,headers).content
#lxml解析網(wǎng)頁內(nèi)容返回給soup
soup = BeautifulSoup(html,'lxml')
#find_all得到所有包含目標(biāo)內(nèi)容的div的集合
div_list = soup.find_all(name = 'div', class_ = 'article block untagged mb15')
#for循環(huán)分別讀取div集合中的目標(biāo)數(shù)據(jù)
for div in div_list:
#定位獲取作者名字?jǐn)?shù)據(jù)
author = div.find('h2').text
#正則表達(dá)式定位獲取性別年齡信息數(shù)據(jù)(匿名用戶無信息需特別處理)
genders = div.find(name = 'div', class_ = re.compile('articleGender .*'))
if genders == None:
gender = 'Null'
age = 'Null'
else:
gender = genders.attrs['class'][1][:-4]
age = genders.text
#定位獲取內(nèi)容數(shù)據(jù)
content = div.find('span').text
#定位獲取好笑數(shù)數(shù)據(jù)
funny_nums = div.find(name = 'span', class_ = 'stats-vote').find('i')
if funny_nums == None:
funny_num = 0
else:
funny_num = funny_nums.text
#定位獲取評論數(shù)數(shù)據(jù)
comment_nums = div.find(name = 'span', class_ = 'stats-comments').find('i')
if comment_nums == None:
comment_num = 0
else:
comment_num = comment_nums.text
#選擇test數(shù)據(jù)
cursor.execute("use test")
#執(zhí)行insert插入語句
cursor.execute("insert into qiushibaike_tbl (author,gender,age,content,funny_num,comment_num) values(%s,%s,%s,%s,%s,%s)",(author,gender,age,content,funny_num,comment_num))
#不執(zhí)行不能插入數(shù)據(jù)
connect.commit()
if __name__ == '__main__':
#分別構(gòu)建1-35頁的url
for i in range(1,36):
url = 'http://www.qiushibaike.com/text/page/'+str(i)+'/?s=4986178'
get_data(url)
#關(guān)閉數(shù)據(jù)庫鏈接善茎,釋放資源
connect.close()
print('SUCCESS!')
第一次完成網(wǎng)頁結(jié)構(gòu)化數(shù)據(jù)的抓取,幾乎每一段程序代碼都加了注釋說明频轿,成功爬到糗事百科35頁699條數(shù)據(jù)垂涯,并存入本地搭建的MySQL數(shù)據(jù)庫,存在一個問題向大家請教航邢,糗事百科的匿名用戶的年齡和性別不存在耕赘,直接判斷過濾,但是匿名用戶的段子內(nèi)容數(shù)據(jù)沒有抓到和插入數(shù)據(jù)庫膳殷,程序執(zhí)行過程也沒有報任何錯誤操骡!還請哪位大神看出問題所在后,留言指點(diǎn)!拜謝册招!拜謝岔激!
上面的問題感謝羅羅攀 和Mr_Cxy的幫忙解答,問題還是對網(wǎng)頁結(jié)構(gòu)分析的不夠透徹是掰,沒有發(fā)現(xiàn)匿名用戶和非匿名用戶的網(wǎng)頁結(jié)構(gòu)不同虑鼎,沒法直接用 content = div.find('span').text直接獲取數(shù)據(jù)匿名用戶的段子內(nèi)容數(shù)據(jù),需要分開處理冀惭,用 content = div.find_all('span')[2].text或者= div.select('span')[2].text獲取獲取數(shù)據(jù)匿名用戶的數(shù)據(jù)震叙,下面是修改后的程序代碼:
import requests
from bs4 import BeautifulSoup
import re
import pymysql
#鏈接MySQL數(shù)據(jù)庫
connect = pymysql.connect(
host='localhost',
user='root',
passwd='root',
db='test',
port=3306,
charset='utf8'
)
# 獲取游標(biāo)
cursor = connect.cursor()
#請求頭
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
}
def get_data(url):
#模擬瀏覽器訪問網(wǎng)頁
html = requests.get(url,headers).content
#lxml解析網(wǎng)頁內(nèi)容返回給soup
soup = BeautifulSoup(html,'lxml')
#find_all得到所有包含目標(biāo)內(nèi)容的div的集合
div_list = soup.find_all(name = 'div', class_ = 'article block untagged mb15')
#for循環(huán)分別讀取div集合中的目標(biāo)數(shù)據(jù)
for div in div_list:
#定位獲取作者名字?jǐn)?shù)據(jù)
author = div.find('h2').text
#正則表達(dá)式定位獲取性別年齡信息數(shù)據(jù)(匿名用戶無信息需特別處理)
genders = div.find(name = 'div', class_ = re.compile('articleGender .*'))
if genders == None:
gender = 'Null'
age = 'Null'
#定位獲取匿名用戶段子內(nèi)容數(shù)據(jù)
content = div.find_all('span')[2].text
else:
gender = genders.attrs['class'][1][:-4]
age = genders.text
#定位獲取非匿名用戶段子內(nèi)容數(shù)據(jù)
content = div.find('span').text
#定位獲取好笑數(shù)數(shù)據(jù)
funny_nums = div.find(name = 'span', class_ = 'stats-vote').find('i')
if funny_nums == None:
funny_num = 0
else:
funny_num = funny_nums.text
#定位獲取評論數(shù)數(shù)據(jù)
comment_nums = div.find(name = 'span', class_ = 'stats-comments').find('i')
if comment_nums == None:
comment_num = 0
else:
comment_num = comment_nums.text
#選擇test數(shù)據(jù)
cursor.execute("use test")
#執(zhí)行insert插入語句
cursor.execute("insert into qiushibaike_tbl (author,gender,age,content,funny_num,comment_num) values(%s,%s,%s,%s,%s,%s)",(author,gender,age,content,funny_num,comment_num))
#不執(zhí)行不能插入數(shù)據(jù)
connect.commit()
if __name__ == '__main__':
#分別構(gòu)建1-35也的url
for i in range(1,36):
url = 'http://www.qiushibaike.com/text/page/'+str(i)+'/?s=4986178'
get_data(url)
#關(guān)閉數(shù)據(jù)庫鏈接,釋放資源
connect.close()
print('SUCCESS!')