在前兩張前浅辙,我們所進行的行為是基于一個頁面的html結(jié)構(gòu)進行解析,但在實際的網(wǎng)絡(luò)爬蟲中阎姥,會順著一個鏈接跳轉(zhuǎn)到另一個鏈接记舆,構(gòu)建出一張"網(wǎng)絡(luò)地圖",所以我們本次將對外鏈進行爬取
示例:http://oreilly.com
測試一下是否能拿到外鏈
from urllib.parse import urlparse
import random
import datetime
import re
pages = set()
random.seed(datetime.datetime.now())
#獲取頁面內(nèi)部鏈接
def getInternalLinks(bsObj,includeUrl):
includeUrl = urlparse(includeUrl).scheme+"://"+urlparse(includeUrl).netloc
internalLinks = []
for link in bsObj.findAll("a",href=re.compile("^(/|.*"+includeUrl+")")):
if link.attrs['href'] is not None:
if link.attrs['href'] not in internalLinks:
if(link.href['href'].startswith("/")):
internalLinks.append(includeUrl+link.attrs['href'])
else:
internalLinks.append(link.attrs['href'])
return internalLinks
def followExtrenalOnly(startingPage):
externalLink = "https://en.wikipedia.org/wiki/Intelligence_agency"
print("Random extranal link is"+externalLink)
followExtrenalOnly(externalLink)
# def main():
# followExtrenalOnly("http://en.wikipedia.org")
# print('End')
# if __name__ == '__main__':
# main()
followExtrenalOnly("http://en.wikipedia.org")
console output
遞歸迭代外鏈數(shù),一共56條
90890890.png
在網(wǎng)站首頁不保證一定能發(fā)現(xiàn)外鏈呼巴,根據(jù)第二章的console output實驗我們可以知道泽腮,html結(jié)構(gòu)不存在外鏈的情況
對比https://en.wikipedia.org/wiki/Main_Page與https://en.wikipedia.org/wiki/Auriscalpium_vulgare的html結(jié)構(gòu)如下
87878768.png
4545545.png
尋找該頁面外鏈的dfs邏輯如下:
當(dāng)獲取頁面上的所有外鏈時,我們按照遞歸的方式去找衣赶,當(dāng)遇到一個外鏈诊赊,視為達到一個葉子結(jié)點。若為遇到府瞄,修改此外鏈為內(nèi)鏈碧磅,結(jié)束本次遞歸,回溯從主頁面開始搜索
from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import random
import datetime
import re
pages = set()
random.seed(datetime.datetime.now())
#獲取頁面內(nèi)部鏈接
def getInternalLinks(bsObj,includeUrl):
includeUrl = urlparse(includeUrl).scheme+"://"+urlparse(includeUrl).netloc
internalLinks = []
for link in bsObj.findAll("a",href=re.compile("^(/|.*"+includeUrl+")")):
if link.attrs['href'] is not None:
if link.attrs['href'] not in internalLinks:
if(link.href['href'].startswith("/")):
internalLinks.append(includeUrl+link.attrs['href'])
else:
internalLinks.append(link.attrs['href'])
return internalLinks
def getExtrenalLinks(bsObj,excludeurl):
extrenalLinks=[]
#查找http開頭和www開頭的域名
for link in bsObj.findAll("a",href =re.compile("^(http|www)((?!"+excludeurl+").)*$")):
if link.attrs['href'] is not None:
#如果內(nèi)連接包含跳轉(zhuǎn)到其他頁面的鏈接
if link.attrs['href'] not in extrenalLinks:
extrenalLinks.append(link.attrs['href'])
return extrenalLinks
def getRandomExtrnalLink(startingPage):
html=urlopen(startingPage)
bsObj= BeautifulSoup(html,"html.parser")
extrenalLinks = getExtrenalLinks(bsObj,urlparse(startingPage).netloc)
if len(extrenalLinks)==0:
print("沒有找到外鏈")
domain =urlparse(html).scheme+"://"+urlparse(startingPage).netloc
internalLinks=getInternalLinks(bsObj,domain)
return getRandomExtrnalLink(internalLinks[random.randint(0,len(internalLinks)-1)])
else:
return extrenalLinks[random.randint(0,len(extrenalLinks)-1)]
def followExtrenalOnly(startingPage):
externalLink =getRandomExtrnalLink(startingPage)
#externalLink = "https://en.wikipedia.org/wiki/Intelligence_agency"
print("Random extranal link is"+externalLink)
followExtrenalOnly(externalLink)
# def main():
# followExtrenalOnly("http://en.wikipedia.org")
# print('End')
# if __name__ == '__main__':
# main()
followExtrenalOnly("https://en.wikipedia.org/wiki/Main_Page")
console output
9789789.png
Tips: 根據(jù)隨機外鏈摘能,各位朋友可以參考一下時下最為流行的區(qū)塊鏈:
簡單易懂的區(qū)塊鏈: http://python.jobbole.com/88248/
阮一峰老師的區(qū)塊鏈入門: http://www.ruanyifeng.com/blog/2017/12/blockchain-tutorial.html