《Pandas 1.x Cookbook · 第二版》第03章創(chuàng)建和持久化DataFrame

第01章 Pandas基礎(chǔ)
第02章 DataFrame基礎(chǔ)運(yùn)算
第03章創(chuàng)建和持久化DataFrame
第04章開(kāi)始數(shù)據(jù)分析
 第05章探索性數(shù)據(jù)分析
 第06章選取數(shù)據(jù)子集
 第07章過(guò)濾行
 第08章索引對(duì)齊

3.1 創(chuàng)建DataFrame

使用平行的列表創(chuàng)建DataFrame

>>> import pandas as pd
>>> import numpy as np
>>> fname = ["Paul", "John", "Richard", "George"]
>>> lname = ["McCartney", "Lennon", "Starkey", "Harrison"]
>>> birth = [1942, 1940, 1940, 1943]

創(chuàng)建字典：

>>> people = {"first": fname, "last": lname, "birth": birth}

用該字典創(chuàng)建DataFrame：

>>> beatles = pd.DataFrame(people)
>>> beatles
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943

原理

當(dāng)調(diào)用DataFrame構(gòu)造器時(shí)饭耳，Pandas會(huì)創(chuàng)建一個(gè)RangeIndex對(duì)象：

>>> beatles.index
RangeIndex(start=0, stop=4, step=1)

重新指定索引：

>>> pd.DataFrame(people, index=["a", "b", "c", "d"])
     first       last  birth
a     Paul  McCartney   1942
b     John     Lennon   1940
c  Richard    Starkey   1940
d   George   Harrison   1943

還可以用字典構(gòu)成的列表構(gòu)建DataFrame：

>>> pd.DataFrame(
...     [
...         {
...             "first": "Paul",
...             "last": "McCartney",
...             "birth": 1942,
...         },
...         {
...             "first": "John",
...             "last": "Lennon",
...             "birth": 1940,
...         },
...         {
...             "first": "Richard",
...             "last": "Starkey",
...             "birth": 1940,
...         },
...         {
...             "first": "George",
...             "last": "Harrison",
...             "birth": 1943,
...         },
...     ]
... )
   birth    first       last
0   1942     Paul  McCartney
1   1940     John     Lennon
2   1940  Richard    Starkey
3   1943   George   Harrison

使用columns參數(shù)指定列的順序：

>>> pd.DataFrame(
...     [
...         {
...             "first": "Paul",
...             "last": "McCartney",
...             "birth": 1942,
...         },
...         {
...             "first": "John",
...             "last": "Lennon",
...             "birth": 1940,
...         },
...         {
...             "first": "Richard",
...             "last": "Starkey",
...             "birth": 1940,
...         },
...         {
...             "first": "George",
...             "last": "Harrison",
...             "birth": 1943,
...         },
...     ],
...     columns=["last", "first", "birth"],
... )
        last    first  birth
0  McCartney     Paul   1942
1     Lennon     John   1940
2    Starkey  Richard   1940
3   Harrison   George   1943

3.2 寫(xiě)入CSV

將DataFrame寫(xiě)入CSV文件：

使用.to_csv方法將DataFrame寫(xiě)入CSV文件：

>>> beatles
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943
>>> from io import StringIO
>>> fout = StringIO()
>>> beatles.to_csv(fout)  # 使用文件名

查看文件內(nèi)容：

>>> print(fout.getvalue())
,first,last,birth
0,Paul,McCartney,1942
1,John,Lennon,1940
2,Richard,Starkey,1940
3,George,Harrison,1943

如果讀取剛剛保存的CSV酌媒，會(huì)讀入列名為Unnamed: 0的冗余列：

>>> _ = fout.seek(0)
>>> pd.read_csv(fout)
   Unnamed: 0    first       last  birth
0           0     Paul  McCartney   1942
1           1     John     Lennon   1940
2           2  Richard    Starkey   1940
3           3   George   Harrison   1943

使用index_col參數(shù)，可以指定列：

>>> _ = fout.seek(0)
>>> pd.read_csv(fout, index_col=0)
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943

如果存CSV文件時(shí)，不想保存行索引周偎，可以將index參數(shù)設(shè)為False：

>>> fout = StringIO()
>>> beatles.to_csv(fout, index=False)
>>> print(fout.getvalue())
first,last,birth
Paul,McCartney,1942
John,Lennon,1940
Richard,Starkey,1940
George,Harrison,1943

3.3 讀取大CSV文件

Pandas是在內(nèi)存中處理文件的熬粗，通常來(lái)講，內(nèi)存的大小需要是文件大小的3至10倍莉炉。

這里使用的是diamonds數(shù)據(jù)集。使用nrows參數(shù)讀取1000行數(shù)據(jù)碴犬。

>>> diamonds = pd.read_csv("data/diamonds.csv", nrows=1000)
>>> diamonds
     carat      cut color clarity  ...  price     x     y     z
0     0.23    Ideal     E     SI2  ...    326  3.95  3.98  2.43
1     0.21  Premium     E     SI1  ...    326  3.89  3.84  2.31
2     0.23     Good     E     VS1  ...    327  4.05  4.07  2.31
3     0.29  Premium     I     VS2  ...    334  4.20  4.23  2.63
4     0.31     Good     J     SI2  ...    335  4.34  4.35  2.75
..     ...      ...   ...     ...  ...    ...   ...   ...   ...
995   0.54    Ideal     D    VVS2  ...   2897  5.30  5.34  3.26
996   0.72    Ideal     E     SI1  ...   2897  5.69  5.74  3.57
997   0.72     Good     F     VS1  ...   2897  5.82  5.89  3.48
998   0.74  Premium     D     VS2  ...   2897  5.81  5.77  3.58
999   1.12  Premium     J     SI2  ...   2898  6.68  6.61  4.03

使用.info方法查看消耗的內(nèi)存量：

>>> diamonds.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat      1000 non-null float64
cut        1000 non-null object
color      1000 non-null object
clarity    1000 non-null object
depth      1000 non-null float64
table      1000 non-null float64
price      1000 non-null int64
x          1000 non-null float64
y          1000 non-null float64
z          1000 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 78.2+ KB

可以看到1000行數(shù)據(jù)使用了78.2KB內(nèi)存絮宁。如果有10億行數(shù)據(jù)，則要占用78GB的內(nèi)存服协。

使用dtype參數(shù)绍昂，設(shè)置讀取的數(shù)值類型：

>>> diamonds2 = pd.read_csv(
...     "data/diamonds.csv",
...     nrows=1000,
...     dtype={
...         "carat": np.float32,
...         "depth": np.float32,
...         "table": np.float32,
...         "x": np.float32,
...         "y": np.float32,
...         "z": np.float32,
...         "price": np.int16,
...     },
... )
>>> diamonds2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat      1000 non-null float32
cut        1000 non-null object
color      1000 non-null object
clarity    1000 non-null object
depth      1000 non-null float32
table      1000 non-null float32
price      1000 non-null int16
x          1000 non-null float32
y          1000 non-null float32
z          1000 non-null float32
dtypes: float32(6), int16(1), object(3)
memory usage: 49.0+ KB

改變了數(shù)值類型，對(duì)比下新的DataFrame和原先的DataFrame：

>>> diamonds.describe()
             carat        depth  ...            y            z
count  1000.000000  1000.000000  ...  1000.000000  1000.000000
mean      0.689280    61.722800  ...     5.599180     3.457530
std       0.195291     1.758879  ...     0.611974     0.389819
min       0.200000    53.000000  ...     3.750000     2.270000
25%       0.700000    60.900000  ...     5.630000     3.450000
50%       0.710000    61.800000  ...     5.760000     3.550000
75%       0.790000    62.600000  ...     5.910000     3.640000
max       1.270000    69.500000  ...     7.050000     4.330000
>>> diamonds2.describe()
             carat        depth  ...            y            z
count  1000.000000  1000.000000  ...  1000.000000  1000.000000
mean      0.689453    61.718750  ...     5.601562     3.457031
std       0.195312     1.759766  ...     0.611816     0.389648
min       0.199951    53.000000  ...     3.750000     2.269531
25%       0.700195    60.906250  ...     5.628906     3.449219
50%       0.709961    61.812500  ...     5.761719     3.550781
75%       0.790039    62.593750  ...     5.910156     3.640625
max       1.269531    69.500000  ...     7.050781     4.328125

可以看到通過(guò)改變數(shù)據(jù)類型，節(jié)省了38%的內(nèi)存窘游。

使用dtype參數(shù)唠椭，將數(shù)據(jù)類型改為category。使用.value_counts先統(tǒng)計(jì)數(shù)據(jù)個(gè)數(shù)：

>>> diamonds2.cut.value_counts(
Ideal       333
Premium     290
Very Good   226
Good         89
Fair         62
Name: cut, dtype: int64
>>> diamonds2.color.value_counts()
E    240
F    226
G    139
D    129
H    125
I     95
J     46
Name: color, dtype: int64
>>> diamonds2.clarity.value_counts()
SI1     306
VS2     218
VS1     159
SI2     154
VVS2     62
VVS1     58
I1       29
IF       14
Name: clarity, dtype: int64

因?yàn)槭堑突鶖?shù)张峰，將其轉(zhuǎn)換為category泪蔫，可以節(jié)省約37%的內(nèi)存：

>>> diamonds3 = pd.read_csv(
...     "data/diamonds.csv",
...     nrows=1000,
...      dtype={
...        "carat": np.float32,
...         "depth": np.float32,
...         "table": np.float32,
...         "x": np.float32,
...         "y": np.float32,
...         "z": np.float32,
...         "price": np.int16,
...         "cut": "category",
...         "color": "category",
...         "clarity": "category",
...     },
... )
>>> diamonds3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat      1000 non-null float32
cut        1000 non-null category
color      1000 non-null category
clarity    1000 non-null category
depth      1000 non-null float32
table      1000 non-null float32
price      1000 non-null int16
x          1000 non-null float32
y          1000 non-null float32
z          1000 non-null float32
dtypes: category(3), float32(6), int16(1)
memory usage: 29.4 KB

使用參數(shù)usecols，可以指定加載哪些列喘批。這里忽略了x撩荣、y、z三列：

>>> cols = [
...     "carat",
...     "cut",
...     "color",
...     "clarity",
...     "depth",
...     "table",
...     "price",
... ]
>>> diamonds4 = pd.read_csv(
...     "data/diamonds.csv",
...     nrows=1000,
...     dtype={
...         "carat": np.float32,
...        "depth": np.float32,
...         "table": np.float32,
...         "price": np.int16,
...         "cut": "category",
...         "color": "category",
...         "clarity": "category",
...     },
...     usecols=cols,
... )
>>> diamonds4.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
carat      1000 non-null float32
cut        1000 non-null category
color      1000 non-null category
clarity    1000 non-null category
depth      1000 non-null float32
table      1000 non-null float32
price      1000 non-null int16
dtypes: category(3), float32(3), int16(1)
memory usage: 17.7 KB

現(xiàn)在使用的內(nèi)存只有原始的21%饶深。

如果每次只處理數(shù)據(jù)的一部分餐曹，可以使用chunksize參數(shù)：

>>> cols = [
...     "carat",
...     "cut",
...     "color",
...     "clarity",
...     "depth",
...     "table",
...     "price",
... ]
>>> diamonds_iter = pd.read_csv(
...     "data/diamonds.csv",
...     nrows=1000,
...     dtype={
...         "carat": np.float32,
...         "depth": np.float32,
...         "table": np.float32,
...         "price": np.int16,
...         "cut": "category",
...         "color": "category",
...         "clarity": "category",
...     },
...     usecols=cols,
...     chunksize=200,
... )
>>> def process(df):
...     return f"processed {df.size} items"
>>> for chunk in diamonds_iter:
...     process(chunk)

因?yàn)镃SV文件不保存數(shù)據(jù)類型，Pandas需要推斷每列的數(shù)據(jù)類型是什么敌厘。如果一列的值都是整數(shù)台猴，并且沒(méi)有缺失值，則Pandas將其認(rèn)定為int64俱两。如果一列是數(shù)值類型饱狂，但不是整數(shù)，或存在缺失值宪彩，Pandas使用的是float64休讳。這兩種數(shù)據(jù)類型占用的內(nèi)存比較大。例如尿孔，如果所有數(shù)都在200以下俊柔，你可以使用一個(gè)小的數(shù)據(jù)類型，比如np.int16（或np.int8活合，如果都是正數(shù)）雏婶。

如果某列都是非數(shù)值類型，Pandas會(huì)將其轉(zhuǎn)換為object類型白指。object類型占用內(nèi)存很多留晚，因?yàn)樗菍?shù)據(jù)以Python字符串存儲(chǔ)的，將類型改為category侵续，可以大大節(jié)省空間倔丈，因?yàn)樗鼘?duì)每個(gè)字符串只存儲(chǔ)一次。

如果價(jià)格使用int8状蜗，會(huì)導(dǎo)致丟失信息需五。你可以使用NumPy的iinfo函數(shù)列出NumPy整數(shù)類型的范圍：

>>> np.iinfo(np.int8)
iinfo(min=-128, max=127, dtype=int8)

使用.finfo可以查看浮點(diǎn)數(shù)類型的范圍：

>>> np.finfo(np.float16)
finfo(resolution=0.001, min=-6.55040e+04,
      max=6.55040e+04, dtype=float16)

還可以用.memory_usage方法查詢DataFrame或Series使用了多少字節(jié)。注意轧坎，其中是包含行索引的宏邮。另外，傳入deep=True，可以查詢帶有對(duì)象類型的Series的內(nèi)存用量：

>>> diamonds.price.memory_usage()
8080
>>> diamonds.price.memory_usage(index=False)
8000
>>> diamonds.cut.memory_usage()
8080
>>> diamonds.cut.memory_usage(deep=True)
63413

一旦確定了數(shù)據(jù)類型蜜氨，可以將其以二進(jìn)制并帶有數(shù)據(jù)類型的形式保存下來(lái)械筛，比如Feather格式，Pandas使用的是pyarrow庫(kù)飒炎。

>>> diamonds4.to_feather("d.arr")
>>> diamonds5 = pd.read_feather("d.arr")

另一種方法是使用Parquet格式埋哟。

>>> diamonds4.to_parquet("/tmp/d.pqt")

3.4 使用Excel文件

需要安裝xlwt或openpyxl來(lái)寫(xiě)入XLSX文件。

使用.to_excel方法郎汪，進(jìn)行存儲(chǔ)：

>>> beatles.to_excel("beat.xls")
>>> beatles.to_excel("beat.xlsx")

使用read_excel讀取Excel文件：

>>> beat2 = pd.read_excel("/tmp/beat.xls")
>>> beat2
   Unnamed: 0    first       last  birth
0           0     Paul  McCartney   1942
1           1     John     Lennon   1940
2           2  Richard    Starkey   1940
3           3   George   Harrison   1943

用參數(shù)index_col赤赊，指定行索引：

>>> beat2 = pd.read_excel("/tmp/beat.xls", index_col=0)
>>> beat2
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943

Excel保存了數(shù)據(jù)類型：

>>> beat2.dtypes
first    object
last     object
birth     int64
dtype: object

使用sheet_name參數(shù)命名表單：

>>> xl_writer = pd.ExcelWriter("beat2.xlsx")
>>> beatles.to_excel(xl_writer, sheet_name="All")
>>> beatles[beatles.birth < 1941].to_excel(
...     xl_writer, sheet_name="1940"
... )
>>> xl_writer.save()

這個(gè)Excel文件包含兩個(gè)表單，一個(gè)名字是All煞赢，一個(gè)名字是1940抛计。

3.5 使用ZIP文件

如果CSV文件是ZIP文件中的唯一文件，可以直接使用read_csv函數(shù)：

>>> autos = pd.read_csv("data/vehicles.csv.zip")
>>> autos
       barrels08  barrelsA08  ...  phevHwy  phevComb
0      15.695714         0.0  ...        0         0
1      29.964545         0.0  ...        0         0
2      12.207778         0.0  ...        0         0
3      29.964545         0.0  ...        0         0
4      17.347895         0.0  ...        0         0
...          ...         ...  ...      ...       ...
41139  14.982273         0.0  ...        0         0
41140  14.330870         0.0  ...        0         0
41141  15.695714         0.0  ...        0         0
41142  15.695714         0.0  ...        0         0
41143  18.311667         0.0  ...        0         0
>>> autos.modifiedOn.dtype
dtype('O')

因?yàn)镃SV文件中包含日期的列照筑，它是字符串吹截。可以在使用read_csv時(shí)使用parse_dates加載文件凝危，另一種方法是加載文件后用to_datetime方法解析：

>>> autos= pd.read_csv(
...     "data/vehicles.csv.zip", parse_dates=["modifiedOn"]
... )
>>> autos.modifiedOn
0       2013-01-0...
1       2013-01-0...
2       2013-01-0...
3       2013-01-0...
4       2013-01-0...
            ...     
41139   2013-01-0...
41140   2013-01-0...
41141   2013-01-0...
41142   2013-01-0...
41143   2013-01-0...
Name: modifiedOn, Length: 41144, dtype: datetime64[ns, tzlocal()]

>>> autos.modifiedOn
0        Tue Jan 01 00:00:00 EST 2013
1        Tue Jan 01 00:00:00 EST 2013
2        Tue Jan 01 00:00:00 EST 2013
3        Tue Jan 01 00:00:00 EST 2013
4        Tue Jan 01 00:00:00 EST 2013
                     ...
39096    Tue Jan 01 00:00:00 EST 2013
39097    Tue Jan 01 00:00:00 EST 2013
39098    Tue Jan 01 00:00:00 EST 2013
39099    Tue Jan 01 00:00:00 EST 2013
39100    Tue Jan 01 00:00:00 EST 2013
Name: modifiedOn, Length: 39101, dtype: object
>>> pd.to_datetime(autos.modifiedOn)
0       2013-01-01
1       2013-01-01
2       2013-01-01
3       2013-01-01
4       2013-01-01
           ...
39096   2013-01-01
39097   2013-01-01
39098   2013-01-01
39099   2013-01-01
39100   2013-01-01
Name: modifiedOn, Length: 39101, dtype: datetime64[ns]

如果ZIP文件中有多個(gè)文件波俄，可以使用zipfile模塊。因?yàn)閿?shù)據(jù)集第二行中包含問(wèn)題蛾默，將其存入kag_questions弟断。

>>> import zipfile
>>> with zipfile.ZipFile(
...     "data/kaggle-survey-2018.zip"
... ) as z:
...     print("\n".join(z.namelist()))
...     kag = pd.read_csv(
...         z.open("multipleChoiceResponses.csv")
...     )
...     kag_questions = kag.iloc[0]
...     survey = kag.iloc[1:]
multipleChoiceResponses.csv
freeFormResponses.csv
SurveySchema.csv
>>> survey.head(2).T
1          2
Time from...          710        434
Q1                 Female       Male
Q1_OTHER_...           -1         -1
Q2                  45-49      30-34
Q3            United S...  Indonesia
...                   ...        ...
Q50_Part_5            NaN        NaN
Q50_Part_6            NaN        NaN
Q50_Part_7            NaN        NaN
Q50_Part_8            NaN        NaN
Q50_OTHER...           -1         -1

如果壓縮文件中只有一個(gè)文件，則read_csv方法還可以讀取GZIP趴生、BZ2和XZ文件。

3.6 使用數(shù)據(jù)庫(kù)

創(chuàng)建SQLite數(shù)據(jù)庫(kù)昏翰，存儲(chǔ)Beatles信息：

>>> import sqlite3
>>> con = sqlite3.connect("data/beat.db")
>>> with con:
...     cur = con.cursor()
...     cur.execute("""DROP TABLE Band""")
...     cur.execute(
...         """CREATE TABLE Band(id INTEGER PRIMARY KEY,
...         fname TEXT, lname TEXT, birthyear INT)"""
...     )
...     cur.execute(
...         """INSERT INTO Band VALUES(
...         0, 'Paul', 'McCartney', 1942)"""
...     )
...     cur.execute(
...         """INSERT INTO Band VALUES(
...         1, 'John', 'Lennon', 1940)"""
...     )
...     _ = con.commit()

從DataFrame讀取數(shù)據(jù)庫(kù)苍匆，這里使用的是SQLAlchemy：

>>> import sqlalchemy as sa
>>> engine = sa.create_engine(
...     "sqlite:///data/beat.db", echo=True
... )
>>> sa_connection = engine.connect()
>>> beat = pd.read_sql(
...     "Band", sa_connection, index_col="id"
... )
>>> beat
   fname      lname  birthyear
id                            
0   Paul  McCartney       1942
1   John     Lennon       1940

使用SQL語(yǔ)句讀取數(shù)據(jù)∨锞眨可以使用SQLite或SQLAlchemy連接：

>>> sql = """SELECT fname, birthyear from Band"""
>>> fnames = pd.read_sql(sql, con)
>>> fnames
  fname  birthyear
0  Paul       1942
1  John       1940

3.7 讀取JSON

JSON數(shù)據(jù)的編碼和加載：

>>> import json
>>> encoded = json.dumps(people)
>>> encoded
'{"first": ["Paul", "John", "Richard", "George"], "last": ["McCartney", "Lennon", "Starkey", "Harrison"], "birth": [1942, 1940, 1940, 1943]}'
>>> json.loads(encoded)
{'first': ['Paull', 'John', 'Richard', 'George'], 'last': ['McCartney', 'Lennon', 'Starkey', 'Harrison'], 'birth': [1942, 1940, 1940, 1943]}

使用.read_json讀取json數(shù)據(jù)浸踩，使用字典中的鍵名作為列名。

>>> beatles = pd.read_json(encoded)
>>> beatles
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943

讀取JSON時(shí)统求，Pandas支持一些特定的方式：

columns —— （默認(rèn)）將列名映射為列中的值的列表检碗；
records —— 行的列表。每行是一個(gè)字典码邻，一行映射到一個(gè)值折剃；
split —— columns映射到列名，index映射到行索引值像屋，data映射到每行數(shù)據(jù)組成的列表怕犁；
index —— 將索引映射到行，每行是一個(gè)列映射到值的字典；
values —— 數(shù)據(jù)行構(gòu)成的列表（每行也是列表）奏甫。不包含列和行索引的值戈轿；
table —— 將schema映射到DataFrame的綱要，data映射為字典的列表阵子。

參考下面的代碼：

>>> records = beatles.to_json(orient="records")
>>> records
'[{"first":"Paul","last":"McCartney","birth":1942},{"first":"John","last":"Lennon","birth":1940},{"first":"Richard","last":"Starkey","birth":1940},{"first":"George","last":"Harrison","birth":1943}]'
>>> pd.read_json(records, orient="records")
   birth    first       last
0   1942     Paul  McCartney
1   1940     John     Lennon
2   1940  Richard    Starkey
3   1943   George   Harrison
>>> split = beatles.to_json(orient="split")
>>> split
'{"columns":["first","last","birth"],"index":[0,1,2,3],"data":[["Paul","McCartney",1942],["John","Lennon",1940],["Richard","Starkey",1940],["George","Harrison",1943]]}'
>>> pd.read_json(split, orient="split")
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943
>>> index = beatles.to_json(orient="index")
>>> index
'{"0":{"first":"Paul","last":"McCartney","birth":1942},"1":{"first":"John","last":"Lennon","birth":1940},"2":{"first":"Richard","last":"Starkey","birth":1940},"3":{"first":"George","last":"Harrison","birth":1943}}'
>>> pd.read_json(index, orient="index")
   birth    first       last
0   1942     Paul  McCartney
1   1940     John     Lennon
2   1940  Richard    Starkey
3   1943   George   Harrison
>>> values = beatles.to_json(orient="values")
>>> values
'[["Paul","McCartney",1942],["John","Lennon",1940],["Richard","Starkey",1940],["George","Harrison",1943]]'
>>> pd.read_json(values, orient="values")
         0          1     2
0     Paul  McCartney  1942
1     John     Lennon  1940
2  Richard    Starkey  1940
3   George   Harrison  1943
>>> (
...     pd.read_json(values, orient="values").rename(
...         columns=dict(
...             enumerate(["first", "last", "birth"])
...         )
...     )
... )
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943
>>> table = beatles.to_json(orient="table")
>>> table
'{"schema": {"fields":[{"name":"index","type":"integer"},{"name":"first","type":"string"},{"name":"last","type":"string"},{"name":"birth","type":"integer"}],"primaryKey":["index"],"pandas_version":"0.20.0"}, "data": [{"index":0,"first":"Paul","last":"McCartney","birth":1942},{"index":1,"first":"John","last":"Lennon","birth":1940},{"index":2,"first":"Richard","last":"Starkey","birth":1940},{"index":3,"first":"George","last":"Harrison","birth":1943}]}'
>>> pd.read_json(table, orient="table")
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943

如果要將數(shù)據(jù)轉(zhuǎn)換為字典思杯，可以使用.to_dict方法，方便添加數(shù)據(jù)：

>>> output = beat.to_dict()
>>> output
{'fname': {0: 'Paul', 1: 'John'}, 'lname': {0: 'McCartney', 1: 'Lennon'}, 'birthyear': {0: 1942, 1: 1940}}
>>> output["version"] = "0.4.1"
>>> json.dumps(output)
'{"fname": {"0": "Paul", "1": "John"}, "lname": {"0": "McCartney", "1": "Lennon"}, "birthyear": {"0": 1942, "1": 1940}, "version": "0.4.1"}'

3.8 讀取HTML表格

可以使用Pandas讀取HTML中的表格：

Wikipedia中的表格

# 譯者注：en.wikipedia.org 需要科學(xué)上網(wǎng)
>>> url = 'https://en.wikipedia.org/wiki/The_Beatles_discography'
>>> dfs = pd.read_html(url)
>>> len(dfs)
51

一共讀取了51個(gè)df挠进，檢查一下第一個(gè)：

>>> dfs[0]
  The Beatles discography The Beatles discography.1
0  The Beat...             The Beat...
1  Studio a...                      23
2  Live albums                       5
3  Compilat...                      53
4  Video al...                      15
5  Music vi...                      64
6          EPs                      21
7      Singles                      63
8     Mash-ups                       2
9     Box sets                      15

.read_html有一個(gè)match參數(shù)色乾，可以是字符串或正則表達(dá)式。還有一個(gè)attrs參數(shù)奈梳，用于定位HTML標(biāo)簽杈湾。

檢查下HTML的table元素：

<table class="wikitable plainrowheaders" style="text-align:center;">
  <caption>List of studio albums,<sup id="cite_ref-1" class="reference"><a href="#cite_note-1">[A]</a></sup> with selected chart positions and certifications
  </caption>
  <tbody>
    <tr>
      <th scope="col" rowspan="2" style="width:20em;">Title
      </th>
      <th scope="col" rowspan="2" style="width:20em;">Release
       ...

雖然沒(méi)有屬性，但可以使用字符串List of studio albums來(lái)匹配攘须，缺失值na_values用"—"表示：

>>> url = https://en.wikipedia.org/wiki/The_Beatles_discography
>>> dfs = pd.read_html(
...     url, match="List of studio albums", na_values="—"
... )
>>> len(dfs)
1
>>> dfs[0].columns
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

使用前兩行作為列名漆撞，但還是很亂：

>>> url = https://en.wikipedia.org/wiki/The_Beatles_discography
>>> dfs = pd.read_html(
...     url,
...     match="List of studio albums",
...     na_values="—",
...     header=[0, 1],
... )
>>> len(dfs)
1
>>> dfs[0]
          Title      Release  ... Peak chart positions Certifications
          Title      Release  ...             US[8][9] Certifications
0   Please P...  Released...  ...          NaN          BPI: Gol...
1   With the...  Released...  ...          NaN          BPI: Gol...
2   Introduc...  Released...  ...            2          RIAA: Pl...
3   Meet the...  Released...  ...            1          MC: Plat...
4   Twist an...  Released...  ...          NaN          MC: 3× P...
..          ...          ...  ...          ...                  ...
22  The Beat...  Released...  ...            1          BPI: 2× ...
23  Yellow S...  Released...  ...            2          BPI: Gol...
24   Abbey Road  Released...  ...            1          BPI: 2× ...
25    Let It Be  Released...  ...            1          BPI: Gol...
26  "—" deno...  "—" deno...  ...  "—" deno...          "—" deno...
>>> dfs[0].columns
MultiIndex(levels=[['Certifications', 'Peak chart positions', 'Release', 'Title'], ['AUS[3]', 'CAN[4]', 'Certifications', 'FRA[5]', 'GER[6]', 'NOR[7]', 'Release', 'Title', 'UK[1][2]', 'US[8][9]']],
  codes=[[3, 2, 1, 1, 1, 1, 1, 1, 1, 0], [7, 6, 8, 0, 1, 3, 4, 5, 9, 2]])

對(duì)于這種情況，最容易的方法是更新列索引：

>>> df = dfs[0]
>>> df.columns = [
...     "Title",
...     "Release",
...     "UK",
...     "AUS",
...     "CAN",
...     "FRA",
...     "GER",
...     "NOR",
...     "US",
...     "Certifications",
... ]
>>> df
          Title      Release  ...           US Certifications
0   Please P...  Released...  ...          NaN  BPI: Gol...
1   With the...  Released...  ...          NaN  BPI: Gol...
2   Introduc...  Released...  ...            2  RIAA: Pl...
3   Meet the...  Released...  ...            1  MC: Plat...
4   Twist an...  Released...  ...          NaN  MC: 3× P...
..          ...          ...  ...          ...          ...
22  The Beat...  Released...  ...            1  BPI: 2× ...
23  Yellow S...  Released...  ...            2  BPI: Gol...
24   Abbey Road  Released...  ...            1  BPI: 2× ...
25    Let It Be  Released...  ...            1  BPI: Gol...
26  "—" deno...  "—" deno...  ...  "—" deno...  "—" deno...

繼續(xù)清理Release這列于宙，在HTML中浮驳，代碼是這樣的：

<th scope="row" rowspan="2">
  <i><a href="/wiki/A_Hard_Day%27s_Night_(album)" title="A Hard Day's Night (album)">A Hard Day's Night</a></i>
  <img alt="double-dagger" src="http://upload.wikimedia.org/wikipedia/commons/f/f9/Double-dagger-14-plain.png" decoding="async" width="9" height="14" data-file-width="9" data-file-height="14">
</th>

將release列分別存入release_date 和label兩列：

>>> res = (
...     df.pipe(
...         lambda df_: df_[
...             ~df_.Title.str.startswith("Released")
...         ]
...     )
...     .assign(
...         release_date=lambda df_: pd.to_datetime(
...             df_.Release.str.extract(
...                 r"Released: (.*) Label"
...             )[0].str.replace(r"\[E\]", "")
...         ),
...         label=lambda df_: df_.Release.str.extract(
...             r"Label: (.*)"
...         ),
...     )
...     .loc[
...         :,
...         [
...             "Title",
...             "UK",
...             "AUS",
...             "CAN",
...             "FRA",
...             "GER",
...             "NOR",
...             "US",
...             "release_date",
...             "label",
...         ],
...     ]
... )
>>> res
          Title   UK  ... release_date        label
0   Please P...    1  ...   1963-03-22  Parlopho...
1   With the...    1  ...   1963-11-22  Parlopho...
2   Introduc...  NaN  ...   1964-01-10  Vee-Jay ...
3   Meet the...  NaN  ...   1964-01-20  Capitol ...
4   Twist an...  NaN  ...   1964-02-03  Capitol ...
..          ...  ...  ...          ...          ...
21  Magical ...   31  ...   1967-11-27  Parlopho...
22  The Beat...    1  ...   1968-11-22        Apple
23  Yellow S...    3  ...   1969-01-13  Apple (U...
24   Abbey Road    1  ...   1969-09-26        Apple
25    Let It Be    1  ...   1970-05-08        Apple

直接讀取線上的csv文件：

>>> url = https://github.com/mattharrison/datasets/blob/master/data/anscombes.csv
>>> dfs = pd.read_html(url, attrs={"class": "csv-data"})
>>> len(dfs)
1
>>> dfs[0]
    Unnamed: 0 quadrant     x     y
0          NaN        I  10.0  8.04
1          NaN        I  14.0  9.96
2          NaN        I   6.0  7.24
3          NaN        I   9.0  8.81
4          NaN        I   4.0  4.26
..         ...      ...   ...   ...
39         NaN       IV   8.0  6.58
40         NaN       IV   8.0  7.91
41         NaN       IV   8.0  8.47
42         NaN       IV   8.0  5.25
43         NaN       IV   8.0  6.89

最后編輯于：2021.02.13 22:42:24

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個(gè)濱河市捞魁，隨后出現(xiàn)的幾起案子至会，更是在濱河造成了極大的恐慌，老刑警劉巖谱俭，帶你破解...
沈念sama閱讀 211,042評(píng)論 6贊 490
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件奉件，死亡現(xiàn)場(chǎng)離奇詭異，居然都是意外死亡昆著，警方通過(guò)查閱死者的電腦和手機(jī)县貌，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 89,996評(píng)論 2贊 384
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)凑懂，“玉大人煤痕，你說(shuō)我怎么就攤上這事〗咏鳎” “怎么了摆碉？”我有些...
開(kāi)封第一講書(shū)人閱讀 156,674評(píng)論 0贊 345
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長(zhǎng)脓豪。經(jīng)常有香客問(wèn)我巷帝，道長(zhǎng)，這世上最難降的妖魔是什么跑揉？我笑而不...
開(kāi)封第一講書(shū)人閱讀 56,340評(píng)論 1贊 283
?港島之戀（遺憾婚禮）
正文為了忘掉前任锅睛，我火速辦了婚禮埠巨，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘现拒。我一直安慰自己辣垒，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 65,404評(píng)論 5贊 384
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開(kāi)白布印蔬。她就那樣靜靜地躺著勋桶，像睡著了一般。火紅的嫁衣襯著肌膚如雪侥猬。梳的紋絲不亂的頭發(fā)上例驹，一...
開(kāi)封第一講書(shū)人閱讀 49,749評(píng)論 1贊 289
城市分裂傳說(shuō)
那天，我揣著相機(jī)與錄音退唠，去河邊找鬼鹃锈。笑死，一個(gè)胖子當(dāng)著我的面吹牛瞧预，可吹牛的內(nèi)容都是我干的屎债。我是一名探鬼主播，決...
沈念sama閱讀 38,902評(píng)論 3贊 405
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開(kāi)眼垢油，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼盆驹！你這毒婦竟也來(lái)了？” 一聲冷哼從身側(cè)響起滩愁，我...
開(kāi)封第一講書(shū)人閱讀 37,662評(píng)論 0贊 266
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤躯喇，失蹤者是張志新（化名）和其女友劉穎，沒(méi)想到半個(gè)月后硝枉，有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體廉丽，經(jīng)...
沈念sama閱讀 44,110評(píng)論 1贊 303
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 36,451評(píng)論 2贊 325
?白月光啟示錄
正文我和宋清朗相戀三年妻味，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了雅倒。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 38,577評(píng)論 1贊 340
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡弧可，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出劣欢，到底是詐尸還是另有隱情棕诵，我是刑警寧澤，帶...
沈念sama閱讀 34,258評(píng)論 4贊 328
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布凿将，位于F島的核電站校套，受9級(jí)特大地震影響，放射性物質(zhì)發(fā)生泄漏牧抵。R本人自食惡果不足惜笛匙，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,848評(píng)論 3贊 312
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一侨把、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧妹孙，春花似錦秋柄、人聲如沸。這莊子的主人今日做“春日...
開(kāi)封第一講書(shū)人閱讀 30,726評(píng)論 0贊 21
一樁弒父案骇笔，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)。三九已至嚣崭，卻和暖如春笨触，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背雹舀。一陣腳步聲響...
開(kāi)封第一講書(shū)人閱讀 31,952評(píng)論 1贊 264
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工芦劣，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人说榆。一個(gè)月前我還...
沈念sama閱讀 46,271評(píng)論 2贊 360
代替公主和親
正文我出身青樓虚吟，卻偏偏與公主長(zhǎng)得像，于是被迫代替她去往敵國(guó)和親娱俺。傳聞我的和親對(duì)象是個(gè)殘疾皇子稍味，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 43,452評(píng)論 2贊 348

《Pandas 1.x Cookbook · 第二版》第03章創(chuàng)建和持久化DataFrame

《Pandas 1.x Cookbook · 第二版》第03章創(chuàng)建和持久化DataFrame

3.1 創(chuàng)建DataFrame

原理

更多

3.2 寫(xiě)入CSV

更多

3.3 讀取大CSV文件

更多

3.4 使用Excel文件

更多

3.5 使用ZIP文件

更多

3.6 使用數(shù)據(jù)庫(kù)

3.7 讀取JSON

更多

3.8 讀取HTML表格

更多

推薦閱讀更多精彩內(nèi)容

《Pandas 1.x Cookbook · 第二版》第03章 創(chuàng)建和持久化DataFrame

3.1 創(chuàng)建DataFrame

原理

更多

3.2 寫(xiě)入CSV

更多

3.3 讀取大CSV文件

更多

3.4 使用Excel文件

更多

3.5 使用ZIP文件

更多

3.6 使用數(shù)據(jù)庫(kù)

3.7 讀取JSON

更多

3.8 讀取HTML表格

更多

推薦閱讀更多精彩內(nèi)容

《Pandas 1.x Cookbook · 第二版》第03章創(chuàng)建和持久化DataFrame