爬蟲是python的強(qiáng)項(xiàng),也是其熱門應(yīng)用領(lǐng)域炸茧,有很多完善成熟的爬蟲模塊/庫(kù),像是scrapy稿静、selenium梭冠、beautifulsoup,以及簡(jiǎn)易的urlib改备、requests控漠。今天就是介紹一下怎么用requests簡(jiǎn)單爬取pubmed的文獻(xiàn)搜索結(jié)果并進(jìn)行批量下載全部搜索結(jié)果(當(dāng)然前提是給出了doi號(hào)),已經(jīng)將關(guān)鍵代碼進(jìn)行了注釋悬钳,只需要運(yùn)行代碼盐捷,然后輸入你的搜索關(guān)鍵詞,用空格隔開默勾,即可自動(dòng)獲得doi號(hào)并且從sci-hub地址下載文獻(xiàn)(下載到你的當(dāng)前文件夾)碉渡,不過(guò)。母剥。滞诺。由于網(wǎng)速原因,python去下載確實(shí)有點(diǎn)不太安全环疼,因?yàn)橐膊缓脭帱c(diǎn)續(xù)傳什么的习霹。。秦爆。所以我把sci-hub的對(duì)應(yīng)pdf下載地址還寫入到了一個(gè)文件里序愚,復(fù)制粘貼到迅雷里即可下載,這樣做的話可以直接把代碼段中直接下載pdf那段注釋掉等限。
import requests
import re
import os
if __name__ == "__main__":
# UA偽裝
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
}
# 設(shè)定pubmed的url,以及搜索關(guān)鍵詞芬膝,從用戶端空格隔開輸入
url = "https://pubmed.ncbi.nlm.nih.gov/"
term = input("Please input your keyword: ").split(" ")
# pubmed一頁(yè)顯示的結(jié)果數(shù)量
size = 200
# 結(jié)果頁(yè)碼號(hào)
page = 1
param = {
"term": term,
"size": size,
"page": page
}
doi_list = []
# 發(fā)送請(qǐng)求
response = requests.get(url=url, params=param, headers=headers)
page_text = response.text
# 得到結(jié)果總數(shù)量
results_amount = int(re.search(r"""<span class="value">(\d+(?:,?\d+)?)</span>.*?results""", page_text,
re.DOTALL).group(1).replace(",", ""))
# 正則獲得doi號(hào)
doi_list += re.findall(r"""doi: (10\..*?)\.[ <]""", page_text)
# 模擬翻頁(yè)望门,將剩余pages中的doi號(hào)提取
if results_amount % 200 == 0:
step_num = results_amount / 200 - 1
else:
step_num = results_amount // 200
if step_num:
for page in range(2, step_num+2):
size = 200
page = page
param = {
"term": term,
"size": size,
"page": page
}
response = requests.get(url=url, params=param, headers=headers)
page_text = response.text
doi_list += re.findall(r"""doi: (10\..*?)\.[ <]""", page_text)
# 從sci-hub下載
for doi in doi_list:
down_url = r"https://sci.bban.top/pdf/"+doi+".pdf"
# 將下載地址寫入文件
with open(r"./down_url.txt", "a") as u:
u.write(down_url+"\n")
r = requests.get(url=down_url)
# 直接下載pdf
with open(f"./{os.path.basename(down_url)}", "wb") as f:
f.write(r.content)