當(dāng)我們決定好構(gòu)建的url連接之后呼股,所需要的就是觀察網(wǎng)頁的html結(jié)構(gòu)
我們找到的wiki百科內(nèi)容為mw-cntent-text標(biāo)簽旱眯,由于我們只需要其中包含的p后的標(biāo)簽詞條鏈接晨川,構(gòu)建url結(jié)構(gòu) mw-content-text -> p[0]
56565656.png
我們發(fā)現(xiàn)編輯鏈接的結(jié)構(gòu)如下
所有詞條連接的a標(biāo)簽位于詞條連接的mp-tfa標(biāo)簽下
find層次結(jié)構(gòu)為 mp-tfa -> a -> a href
56876586575.png
采集數(shù)據(jù)
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getlinks(pageUrl):
global pages
html=urlopen("http://en.wikipedia.org"+pageUrl)
bsObj=BeautifulSoup(html,'html.parser')
try:
print(bsObj.h1.get_text())
print(bsObj.find(id="mw-content-text").findAll("p")[0])
print(bsObj.find(id="mp-tfa").find("a").attrs['href'])
except AttributeError:
print("頁面缺少一些屬性")
for link in bsObj.findAll("a" , href=re.compile("^(/wiki/)")):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
newPage=link.attrs['href']
print(newPage)
pages.add(newPage)
getlinks(newPage)
getlinks("")
console output
09809809.png
發(fā)現(xiàn)在找到a標(biāo)簽之后立即拋出異常
檢查編輯鏈接的層次順序,修改 mp-tfa -> p -> b -> a href
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getlinks(pageUrl):
global pages
html=urlopen("http://en.wikipedia.org"+pageUrl)
bsObj=BeautifulSoup(html,'html.parser')
try:
print(bsObj.h1.get_text())
print(bsObj.find(id="mw-content-text").findAll("p")[0])
print(bsObj.find(id="mp-tfa",style="padding:2px 5px").find("p").find("b").find("a").attrs['href'])
except AttributeError:
print("頁面缺少一些屬性")
for link in bsObj.findAll("a" , href=re.compile("^(/wiki/)")):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
newPage=link.attrs['href']
print("--------\n"+newPage)
pages.add(newPage)
getlinks(newPage)
getlinks("")
console output
7978979.png
原因在于之前分析的頁面僅在于Main_page頁面删豺,繼續(xù)對跳轉(zhuǎn)之后的頁面進(jìn)行解析共虑,發(fā)現(xiàn)并沒有mp-tfa標(biāo)簽
jhgjhgjh.png
修改url構(gòu)造 mw-content-test -> p ->a href
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getlinks(pageUrl):
global pages
html=urlopen("http://en.wikipedia.org"+pageUrl)
bsObj=BeautifulSoup(html,'html.parser')
try:
print(bsObj.h1.get_text())
print(bsObj.find(id="mw-content-text").findAll("p")[0])
print(bsObj.find(id="mw-content-text").find("p").find("a").attrs['href'])
except AttributeError:
print("頁面缺少一些屬性")
for link in bsObj.findAll("a" , href=re.compile("^(/wiki/)")):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
newPage=link.attrs['href']
print("--------\n"+newPage)
pages.add(newPage)
getlinks(newPage)
getlinks("")
console output
成功拿到詞條鏈接
867867867.png