環(huán)境:Win10 + Cmder + Python3.6.5
需求
? 獲取 http://www.air-level.com/air/xian/ 的空氣質(zhì)量指數(shù)表格數(shù)據(jù)遗增。騷年沈矿,是不是蠢蠢欲動(dòng)要爬蟲三步走了瞻凤?
代碼
我說三行代碼就可以輕松搞定, 你信嗎妄呕?(正經(jīng)臉):
import pandas as pd
df = pd.read_html("http://www.air-level.com/air/xian/", encoding='utf-8', header=0)[0]
df.to_excel('xian_tianqi.xlsx', index=False)
? 然后先來看網(wǎng)頁數(shù)據(jù):
? 再來看Excel中的數(shù)據(jù):
? 是不是被秀到啦?講真骚灸,我也被秀到一臉...
解釋
? read_html()部分源碼如下:
# 已省略部分代碼,詳細(xì)查看可在命令行執(zhí)行:print(pd.read_html.__doc__)
def read_html(io, match='.+', flavor=None, header=None, index_col=None,
skiprows=None, attrs=None, parse_dates=False,
tupleize_cols=None, thousands=',', encoding=None,
decimal='.', converters=None, na_values=None,
keep_default_na=True, displayed_only=True):
r"""Read HTML tables into a ``list`` of ``DataFrame`` objects.
Parameters
----------
io : str or file-like
A URL, a file-like object, or a raw string containing HTML. Note that
lxml only accepts the http, ftp and file url protocols. If you have a
URL that starts with ``'https'`` you might try removing the ``'s'``.
flavor : str or None, container of strings
The parsing engine to use. 'bs4' and 'html5lib' are synonymous with
each other, they are both there for backwards compatibility. The
default of ``None`` tries to use ``lxml`` to parse and if that fails it
falls back on ``bs4`` + ``html5lib``.
header : int or list-like or None, optional
The row (or list of rows for a :class:`~pandas.MultiIndex`) to use to
make the columns headers.
......
? 可以看到原朝,read_html() 方法的 io 參數(shù)默認(rèn)了多種形式,URL 便是其中一種镶苞。然后函數(shù)默認(rèn)調(diào)用 lxml 解析 table 標(biāo)簽里的每個(gè) td 的數(shù)據(jù)喳坠,最后生成一個(gè)包含 Dataframe 對象的列表。通過索引獲取到 DataFrame 對象即可茂蚓。
最后
? read_html() 僅支持靜態(tài)網(wǎng)頁解析壕鹉。你可以通過其他方法獲取動(dòng)態(tài)頁面加載后response.text 傳入 read_html() 再獲取表格數(shù)據(jù)。