該系列是按照Beautiful Soup教程抄襲朱转,原文鏈接:
http://beautifulsoup.readthedocs.io/zh_CN/latest/
工欲善其事瞻想,必先利其器。下面我們安裝 beautifulsoup4:
#pip install beautifulsoup4 (Centos系統(tǒng))
Collecting beautifulsoup4
Downloading beautifulsoup4-4.5.3-py3-none-any.whl (85kB)
100% |████████████████████████████████| 92kB 669kB/s
Installing collected packages: beautifulsoup4
Successfully installed beautifulsoup4-4.5.3
安裝解析器:
Beautiful Soup支持Python標(biāo)準(zhǔn)庫(kù)中的HTML解析器,還支持一些第三方的解析器,其中一個(gè)是 lxml .根據(jù)操作系統(tǒng)不同,可以選擇下列方法來(lái)安裝lxml:
# pip install lxml
Collecting lxml
Downloading lxml-3.7.3-cp35-cp35m-manylinux1_x86_64.whl (7.1MB)
100% |████████████████████████████████| 7.1MB 83kB/s
Installing collected packages: lxml
Successfully installed lxml-3.7.3
安裝完成之后饭于,如何使用:
將一段文檔傳入BeautifulSoup 的構(gòu)造方法,就能得到一個(gè)文檔的對(duì)象,可以傳入一段字符串或一個(gè)文件句柄舱沧。
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")
首先,文檔被轉(zhuǎn)換成Unicode反番,并且HTML的實(shí)例都被轉(zhuǎn)換成Unicode編碼
BeautifulSoup("Sacré bleu!")
<html><head></head><body>Sacré bleu!</body></html>
然后,Beautiful Soup選擇最合適的解析器來(lái)解析這段文檔,如果手動(dòng)指定解析器那么Beautiful Soup會(huì)選擇指定的解析器來(lái)解析文檔.
首先是一段HTML代碼的字符串:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">Elsie</a>,
<a class="sister" id="link2">Lacie</a> and
<a class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
使用BeautifulSoup解析這段代碼,能夠得到一個(gè) BeautifulSoup 的對(duì)象,并能按照標(biāo)準(zhǔn)的縮進(jìn)格式的結(jié)構(gòu)輸出:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">
Elsie
</a>
,
<a class="sister" id="link2">
Lacie
</a>
and
<a class="sister" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
幾個(gè)瀏覽結(jié)構(gòu)化數(shù)據(jù)的方法:
>>> soup.title
<title>The Dormouse's story</title>
>>> soup.title.name
'title'
>>> soup.title.string
"The Dormouse's story"
>>> soup.title.parent.name
'head'
>>> soup.p
<p class="title"><b>The Dormouse's story</b></p>
>>> soup.p['class']
['title']
>>> soup.a
<a class="sister" id="link1">Elsie</a>
>>> soup.find_all('a')
[<a class="sister" id="link1">Elsie</a>, <a class="sister" href="http://example.com/l
acie" id="link2">Lacie</a>, <a class="sister" id="link3">Tillie</a>]
>>> soup.find(id="link2")
<a class="sister" id="link2">Lacie</a>
從文檔中找到所有<a>標(biāo)簽的鏈接:
>>> for link in soup.find_all('a'):
... print(link.get('href'))
...
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie
從文檔中獲得所有文字:
>>> print(soup.get_text())
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...