主要收獲
- 字典數(shù)據(jù)寫入mongodb數(shù)據(jù)庫(kù)
- 數(shù)據(jù)庫(kù)中取值 點(diǎn)贊數(shù)大于某個(gè)值時(shí)提取數(shù)據(jù)蚌斩,用$gte表示大于等于某個(gè)數(shù),開始寫入是字符串范嘱,后加上int()函數(shù)改為數(shù)據(jù)進(jìn)行比查找
- 利用select時(shí)送膳,爬去內(nèi)容中“查看全文”按鈕和內(nèi)容部分無(wú)法區(qū)分,每頁(yè)爬去數(shù)據(jù)都是25丑蛤,因此項(xiàng)bug存在叠聋,導(dǎo)致抓取的內(nèi)容數(shù)總是大于25,后強(qiáng)行遍歷數(shù)組受裹,把含有“查看全文”內(nèi)容的項(xiàng)目強(qiáng)行去除碌补,重新構(gòu)建數(shù)組虏束。才成功
-- 此項(xiàng)工作應(yīng)吸取教訓(xùn),留下疑問(wèn)厦章,百度許久不得解決辦法镇匀。就是select中如何減去特定class屬性的內(nèi)容。
import requests
from bs4 import BeautifulSoup
import lxml
import time
import pymongo
client=pymongo.MongoClient('localhost',27017)
donger=client['donger']
sheet_2=donger['sheet_2']
urls=["https://www.qiushibaike.com/text/page/{}/".format(str(i)) for i in range(1,14)]
def getone_url(url):
web_data=requests.get(url)
neirongs=[]
soup=BeautifulSoup(web_data.text,'lxml')
authors=soup.select(' div > a > h2 ')
a=soup.select(' a.contentHerf > div > span ')
for i in a:
if i.get_text()!="查看全文":
neirongs.append(i)
numbers=soup.select('div.stats span.stats-vote i')
discuss=soup.select('div.stats span.stats-comments i')
for author ,neirong,number,discuss_one in zip(authors,neirongs,numbers,discuss):
data={
"author":author.get_text().strip(),
"neirong":neirong.get_text().strip(),
"number":int(number.get_text()),
"discuss":discuss_one.get_text(),
}
print(data)
sheet_2.insert_one(data)
for url in urls:
getone_url(url)
time.sleep(2)
# for item in sheet_2.find({'number':{'$gte':3000}}):
# print (item['neirong'])