Pandas.read_html() 獲取靜態(tài)網(wǎng)頁表格數(shù)據(jù)

環(huán)境：Win10 + Cmder + Python3.6.5

需求

? 獲取 http://www.air-level.com/air/xian/ 的空氣質(zhì)量指數(shù)表格數(shù)據(jù)遗增。騷年沈矿，是不是蠢蠢欲動(dòng)要爬蟲三步走了瞻凤？

代碼

我說三行代碼就可以輕松搞定, 你信嗎妄呕？（正經(jīng)臉）：

import pandas as pd
df = pd.read_html("http://www.air-level.com/air/xian/", encoding='utf-8', header=0)[0]
df.to_excel('xian_tianqi.xlsx', index=False)

? 然后先來看網(wǎng)頁數(shù)據(jù)：

? 再來看Excel中的數(shù)據(jù)：

? 是不是被秀到啦？講真骚灸，我也被秀到一臉...

解釋

? read_html()部分源碼如下：

# 已省略部分代碼，詳細(xì)查看可在命令行執(zhí)行：print(pd.read_html.__doc__)
def read_html(io, match='.+', flavor=None, header=None, index_col=None,
              skiprows=None, attrs=None, parse_dates=False,
              tupleize_cols=None, thousands=',', encoding=None,
              decimal='.', converters=None, na_values=None,
              keep_default_na=True, displayed_only=True):
    r"""Read HTML tables into a ``list`` of ``DataFrame`` objects.

  Parameters
    ----------
    io : str or file-like
        A URL, a file-like object, or a raw string containing HTML. Note that
        lxml only accepts the http, ftp and file url protocols. If you have a
        URL that starts with ``'https'`` you might try removing the ``'s'``.

    flavor : str or None, container of strings
        The parsing engine to use. 'bs4' and 'html5lib' are synonymous with
        each other, they are both there for backwards compatibility. The
        default of ``None`` tries to use ``lxml`` to parse and if that fails it
        falls back on ``bs4`` + ``html5lib``.

     header : int or list-like or None, optional
        The row (or list of rows for a :class:`~pandas.MultiIndex`) to use to
        make the columns headers.
......

? 可以看到原朝，read_html() 方法的 io 參數(shù)默認(rèn)了多種形式，URL 便是其中一種镶苞。然后函數(shù)默認(rèn)調(diào)用 lxml 解析 table 標(biāo)簽里的每個(gè) td 的數(shù)據(jù)喳坠，最后生成一個(gè)包含 Dataframe 對象的列表。通過索引獲取到 DataFrame 對象即可茂蚓。

最后

? read_html() 僅支持靜態(tài)網(wǎng)頁解析壕鹉。你可以通過其他方法獲取動(dòng)態(tài)頁面加載后response.text 傳入 read_html() 再獲取表格數(shù)據(jù)。

參考：https://mp.weixin.qq.com/s/CuhC7rCD6LPXLO88JVEuJg

最后編輯于：2019.05.16 00:22:45

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者