《Pandas 1.x Cookbook · 第二版》第03章 創(chuàng)建和持久化DataFrame

第01章 Pandas基礎(chǔ)
第02章 DataFrame基礎(chǔ)運(yùn)算
第03章 創(chuàng)建和持久化DataFrame
第04章 開(kāi)始數(shù)據(jù)分析
第05章 探索性數(shù)據(jù)分析
第06章 選取數(shù)據(jù)子集
第07章 過(guò)濾行
第08章 索引對(duì)齊


3.1 創(chuàng)建DataFrame

使用平行的列表創(chuàng)建DataFrame

>>> import pandas as pd
>>> import numpy as np
>>> fname = ["Paul", "John", "Richard", "George"]
>>> lname = ["McCartney", "Lennon", "Starkey", "Harrison"]
>>> birth = [1942, 1940, 1940, 1943]

創(chuàng)建字典:

>>> people = {"first": fname, "last": lname, "birth": birth}

用該字典創(chuàng)建DataFrame:

>>> beatles = pd.DataFrame(people)
>>> beatles
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943

原理

當(dāng)調(diào)用DataFrame構(gòu)造器時(shí)饭耳,Pandas會(huì)創(chuàng)建一個(gè)RangeIndex對(duì)象:

>>> beatles.index
RangeIndex(start=0, stop=4, step=1)

重新指定索引:

>>> pd.DataFrame(people, index=["a", "b", "c", "d"])
     first       last  birth
a     Paul  McCartney   1942
b     John     Lennon   1940
c  Richard    Starkey   1940
d   George   Harrison   1943

更多

還可以用字典構(gòu)成的列表構(gòu)建DataFrame:

>>> pd.DataFrame(
...     [
...         {
...             "first": "Paul",
...             "last": "McCartney",
...             "birth": 1942,
...         },
...         {
...             "first": "John",
...             "last": "Lennon",
...             "birth": 1940,
...         },
...         {
...             "first": "Richard",
...             "last": "Starkey",
...             "birth": 1940,
...         },
...         {
...             "first": "George",
...             "last": "Harrison",
...             "birth": 1943,
...         },
...     ]
... )
   birth    first       last
0   1942     Paul  McCartney
1   1940     John     Lennon
2   1940  Richard    Starkey
3   1943   George   Harrison

使用columns參數(shù)指定列的順序:

>>> pd.DataFrame(
...     [
...         {
...             "first": "Paul",
...             "last": "McCartney",
...             "birth": 1942,
...         },
...         {
...             "first": "John",
...             "last": "Lennon",
...             "birth": 1940,
...         },
...         {
...             "first": "Richard",
...             "last": "Starkey",
...             "birth": 1940,
...         },
...         {
...             "first": "George",
...             "last": "Harrison",
...             "birth": 1943,
...         },
...     ],
...     columns=["last", "first", "birth"],
... )
        last    first  birth
0  McCartney     Paul   1942
1     Lennon     John   1940
2    Starkey  Richard   1940
3   Harrison   George   1943

3.2 寫(xiě)入CSV

將DataFrame寫(xiě)入CSV文件:

使用.to_csv方法將DataFrame寫(xiě)入CSV文件:

>>> beatles
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943
>>> from io import StringIO
>>> fout = StringIO()
>>> beatles.to_csv(fout)  # 使用文件名

查看文件內(nèi)容:

>>> print(fout.getvalue())
,first,last,birth
0,Paul,McCartney,1942
1,John,Lennon,1940
2,Richard,Starkey,1940
3,George,Harrison,1943

更多

如果讀取剛剛保存的CSV酌媒,會(huì)讀入列名為Unnamed: 0的冗余列:

>>> _ = fout.seek(0)
>>> pd.read_csv(fout)
   Unnamed: 0    first       last  birth
0           0     Paul  McCartney   1942
1           1     John     Lennon   1940
2           2  Richard    Starkey   1940
3           3   George   Harrison   1943

使用index_col參數(shù),可以指定列:

>>> _ = fout.seek(0)
>>> pd.read_csv(fout, index_col=0)
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943

如果存CSV文件時(shí),不想保存行索引周偎,可以將index參數(shù)設(shè)為False

>>> fout = StringIO()
>>> beatles.to_csv(fout, index=False)
>>> print(fout.getvalue())
first,last,birth
Paul,McCartney,1942
John,Lennon,1940
Richard,Starkey,1940
George,Harrison,1943

3.3 讀取大CSV文件

Pandas是在內(nèi)存中處理文件的熬粗,通常來(lái)講,內(nèi)存的大小需要是文件大小的3至10倍莉炉。

這里使用的是diamonds數(shù)據(jù)集。使用nrows參數(shù)讀取1000行數(shù)據(jù)碴犬。

>>> diamonds = pd.read_csv("data/diamonds.csv", nrows=1000)
>>> diamonds
     carat      cut color clarity  ...  price     x     y     z
0     0.23    Ideal     E     SI2  ...    326  3.95  3.98  2.43
1     0.21  Premium     E     SI1  ...    326  3.89  3.84  2.31
2     0.23     Good     E     VS1  ...    327  4.05  4.07  2.31
3     0.29  Premium     I     VS2  ...    334  4.20  4.23  2.63
4     0.31     Good     J     SI2  ...    335  4.34  4.35  2.75
..     ...      ...   ...     ...  ...    ...   ...   ...   ...
995   0.54    Ideal     D    VVS2  ...   2897  5.30  5.34  3.26
996   0.72    Ideal     E     SI1  ...   2897  5.69  5.74  3.57
997   0.72     Good     F     VS1  ...   2897  5.82  5.89  3.48
998   0.74  Premium     D     VS2  ...   2897  5.81  5.77  3.58
999   1.12  Premium     J     SI2  ...   2898  6.68  6.61  4.03

使用.info方法查看消耗的內(nèi)存量:

>>> diamonds.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat      1000 non-null float64
cut        1000 non-null object
color      1000 non-null object
clarity    1000 non-null object
depth      1000 non-null float64
table      1000 non-null float64
price      1000 non-null int64
x          1000 non-null float64
y          1000 non-null float64
z          1000 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 78.2+ KB

可以看到1000行數(shù)據(jù)使用了78.2KB內(nèi)存絮宁。如果有10億行數(shù)據(jù),則要占用78GB的內(nèi)存服协。

使用dtype參數(shù)绍昂,設(shè)置讀取的數(shù)值類型:

>>> diamonds2 = pd.read_csv(
...     "data/diamonds.csv",
...     nrows=1000,
...     dtype={
...         "carat": np.float32,
...         "depth": np.float32,
...         "table": np.float32,
...         "x": np.float32,
...         "y": np.float32,
...         "z": np.float32,
...         "price": np.int16,
...     },
... )
>>> diamonds2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat      1000 non-null float32
cut        1000 non-null object
color      1000 non-null object
clarity    1000 non-null object
depth      1000 non-null float32
table      1000 non-null float32
price      1000 non-null int16
x          1000 non-null float32
y          1000 non-null float32
z          1000 non-null float32
dtypes: float32(6), int16(1), object(3)
memory usage: 49.0+ KB

改變了數(shù)值類型,對(duì)比下新的DataFrame和原先的DataFrame:

>>> diamonds.describe()
             carat        depth  ...            y            z
count  1000.000000  1000.000000  ...  1000.000000  1000.000000
mean      0.689280    61.722800  ...     5.599180     3.457530
std       0.195291     1.758879  ...     0.611974     0.389819
min       0.200000    53.000000  ...     3.750000     2.270000
25%       0.700000    60.900000  ...     5.630000     3.450000
50%       0.710000    61.800000  ...     5.760000     3.550000
75%       0.790000    62.600000  ...     5.910000     3.640000
max       1.270000    69.500000  ...     7.050000     4.330000
>>> diamonds2.describe()
             carat        depth  ...            y            z
count  1000.000000  1000.000000  ...  1000.000000  1000.000000
mean      0.689453    61.718750  ...     5.601562     3.457031
std       0.195312     1.759766  ...     0.611816     0.389648
min       0.199951    53.000000  ...     3.750000     2.269531
25%       0.700195    60.906250  ...     5.628906     3.449219
50%       0.709961    61.812500  ...     5.761719     3.550781
75%       0.790039    62.593750  ...     5.910156     3.640625
max       1.269531    69.500000  ...     7.050781     4.328125

可以看到通過(guò)改變數(shù)據(jù)類型,節(jié)省了38%的內(nèi)存窘游。

使用dtype參數(shù)唠椭,將數(shù)據(jù)類型改為category。使用.value_counts先統(tǒng)計(jì)數(shù)據(jù)個(gè)數(shù):

>>> diamonds2.cut.value_counts(
Ideal       333
Premium     290
Very Good   226
Good         89
Fair         62
Name: cut, dtype: int64
>>> diamonds2.color.value_counts()
E    240
F    226
G    139
D    129
H    125
I     95
J     46
Name: color, dtype: int64
>>> diamonds2.clarity.value_counts()
SI1     306
VS2     218
VS1     159
SI2     154
VVS2     62
VVS1     58
I1       29
IF       14
Name: clarity, dtype: int64

因?yàn)槭堑突鶖?shù)张峰,將其轉(zhuǎn)換為category泪蔫,可以節(jié)省約37%的內(nèi)存:

>>> diamonds3 = pd.read_csv(
...     "data/diamonds.csv",
...     nrows=1000,
...      dtype={
...        "carat": np.float32,
...         "depth": np.float32,
...         "table": np.float32,
...         "x": np.float32,
...         "y": np.float32,
...         "z": np.float32,
...         "price": np.int16,
...         "cut": "category",
...         "color": "category",
...         "clarity": "category",
...     },
... )
>>> diamonds3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat      1000 non-null float32
cut        1000 non-null category
color      1000 non-null category
clarity    1000 non-null category
depth      1000 non-null float32
table      1000 non-null float32
price      1000 non-null int16
x          1000 non-null float32
y          1000 non-null float32
z          1000 non-null float32
dtypes: category(3), float32(6), int16(1)
memory usage: 29.4 KB

使用參數(shù)usecols,可以指定加載哪些列喘批。這里忽略了x撩荣、yz三列:

>>> cols = [
...     "carat",
...     "cut",
...     "color",
...     "clarity",
...     "depth",
...     "table",
...     "price",
... ]
>>> diamonds4 = pd.read_csv(
...     "data/diamonds.csv",
...     nrows=1000,
...     dtype={
...         "carat": np.float32,
...        "depth": np.float32,
...         "table": np.float32,
...         "price": np.int16,
...         "cut": "category",
...         "color": "category",
...         "clarity": "category",
...     },
...     usecols=cols,
... )
>>> diamonds4.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
carat      1000 non-null float32
cut        1000 non-null category
color      1000 non-null category
clarity    1000 non-null category
depth      1000 non-null float32
table      1000 non-null float32
price      1000 non-null int16
dtypes: category(3), float32(3), int16(1)
memory usage: 17.7 KB

現(xiàn)在使用的內(nèi)存只有原始的21%饶深。

如果每次只處理數(shù)據(jù)的一部分餐曹,可以使用chunksize參數(shù):

>>> cols = [
...     "carat",
...     "cut",
...     "color",
...     "clarity",
...     "depth",
...     "table",
...     "price",
... ]
>>> diamonds_iter = pd.read_csv(
...     "data/diamonds.csv",
...     nrows=1000,
...     dtype={
...         "carat": np.float32,
...         "depth": np.float32,
...         "table": np.float32,
...         "price": np.int16,
...         "cut": "category",
...         "color": "category",
...         "clarity": "category",
...     },
...     usecols=cols,
...     chunksize=200,
... )
>>> def process(df):
...     return f"processed {df.size} items"
>>> for chunk in diamonds_iter:
...     process(chunk)

因?yàn)镃SV文件不保存數(shù)據(jù)類型,Pandas需要推斷每列的數(shù)據(jù)類型是什么敌厘。如果一列的值都是整數(shù)台猴,并且沒(méi)有缺失值,則Pandas將其認(rèn)定為int64俱两。如果一列是數(shù)值類型饱狂,但不是整數(shù),或存在缺失值宪彩,Pandas使用的是float64休讳。這兩種數(shù)據(jù)類型占用的內(nèi)存比較大。例如尿孔,如果所有數(shù)都在200以下俊柔,你可以使用一個(gè)小的數(shù)據(jù)類型,比如np.int16(或np.int8活合,如果都是正數(shù))雏婶。

如果某列都是非數(shù)值類型,Pandas會(huì)將其轉(zhuǎn)換為object類型白指。object類型占用內(nèi)存很多留晚,因?yàn)樗菍?shù)據(jù)以Python字符串存儲(chǔ)的,將類型改為category侵续,可以大大節(jié)省空間倔丈,因?yàn)樗鼘?duì)每個(gè)字符串只存儲(chǔ)一次。

更多

如果價(jià)格使用int8状蜗,會(huì)導(dǎo)致丟失信息需五。你可以使用NumPy的iinfo函數(shù)列出NumPy整數(shù)類型的范圍:

>>> np.iinfo(np.int8)
iinfo(min=-128, max=127, dtype=int8)

使用.finfo可以查看浮點(diǎn)數(shù)類型的范圍:

>>> np.finfo(np.float16)
finfo(resolution=0.001, min=-6.55040e+04,
      max=6.55040e+04, dtype=float16)

還可以用.memory_usage方法查詢DataFrame或Series使用了多少字節(jié)。注意轧坎,其中是包含行索引的宏邮。另外,傳入deep=True,可以查詢帶有對(duì)象類型的Series的內(nèi)存用量:

>>> diamonds.price.memory_usage()
8080
>>> diamonds.price.memory_usage(index=False)
8000
>>> diamonds.cut.memory_usage()
8080
>>> diamonds.cut.memory_usage(deep=True)
63413

一旦確定了數(shù)據(jù)類型蜜氨,可以將其以二進(jìn)制并帶有數(shù)據(jù)類型的形式保存下來(lái)械筛,比如Feather格式,Pandas使用的是pyarrow庫(kù)飒炎。

>>> diamonds4.to_feather("d.arr")
>>> diamonds5 = pd.read_feather("d.arr")

另一種方法是使用Parquet格式埋哟。

>>> diamonds4.to_parquet("/tmp/d.pqt")

3.4 使用Excel文件

需要安裝xlwtopenpyxl來(lái)寫(xiě)入XLSX文件。

使用.to_excel方法郎汪,進(jìn)行存儲(chǔ):

>>> beatles.to_excel("beat.xls")
>>> beatles.to_excel("beat.xlsx")

使用read_excel讀取Excel文件:

>>> beat2 = pd.read_excel("/tmp/beat.xls")
>>> beat2
   Unnamed: 0    first       last  birth
0           0     Paul  McCartney   1942
1           1     John     Lennon   1940
2           2  Richard    Starkey   1940
3           3   George   Harrison   1943

用參數(shù)index_col赤赊,指定行索引:

>>> beat2 = pd.read_excel("/tmp/beat.xls", index_col=0)
>>> beat2
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943

Excel保存了數(shù)據(jù)類型:

>>> beat2.dtypes
first    object
last     object
birth     int64
dtype: object

更多

使用sheet_name參數(shù)命名表單:

>>> xl_writer = pd.ExcelWriter("beat2.xlsx")
>>> beatles.to_excel(xl_writer, sheet_name="All")
>>> beatles[beatles.birth < 1941].to_excel(
...     xl_writer, sheet_name="1940"
... )
>>> xl_writer.save()

這個(gè)Excel文件包含兩個(gè)表單,一個(gè)名字是All煞赢,一個(gè)名字是1940抛计。


3.5 使用ZIP文件

如果CSV文件是ZIP文件中的唯一文件,可以直接使用read_csv函數(shù):

>>> autos = pd.read_csv("data/vehicles.csv.zip")
>>> autos
       barrels08  barrelsA08  ...  phevHwy  phevComb
0      15.695714         0.0  ...        0         0
1      29.964545         0.0  ...        0         0
2      12.207778         0.0  ...        0         0
3      29.964545         0.0  ...        0         0
4      17.347895         0.0  ...        0         0
...          ...         ...  ...      ...       ...
41139  14.982273         0.0  ...        0         0
41140  14.330870         0.0  ...        0         0
41141  15.695714         0.0  ...        0         0
41142  15.695714         0.0  ...        0         0
41143  18.311667         0.0  ...        0         0
>>> autos.modifiedOn.dtype
dtype('O')

因?yàn)镃SV文件中包含日期的列照筑,它是字符串吹截。可以在使用read_csv時(shí)使用parse_dates加載文件凝危,另一種方法是加載文件后用to_datetime方法解析:

>>> autos= pd.read_csv(
...     "data/vehicles.csv.zip", parse_dates=["modifiedOn"]
... )
>>> autos.modifiedOn
0       2013-01-0...
1       2013-01-0...
2       2013-01-0...
3       2013-01-0...
4       2013-01-0...
            ...     
41139   2013-01-0...
41140   2013-01-0...
41141   2013-01-0...
41142   2013-01-0...
41143   2013-01-0...
Name: modifiedOn, Length: 41144, dtype: datetime64[ns, tzlocal()]
>>> autos.modifiedOn
0        Tue Jan 01 00:00:00 EST 2013
1        Tue Jan 01 00:00:00 EST 2013
2        Tue Jan 01 00:00:00 EST 2013
3        Tue Jan 01 00:00:00 EST 2013
4        Tue Jan 01 00:00:00 EST 2013
                     ...
39096    Tue Jan 01 00:00:00 EST 2013
39097    Tue Jan 01 00:00:00 EST 2013
39098    Tue Jan 01 00:00:00 EST 2013
39099    Tue Jan 01 00:00:00 EST 2013
39100    Tue Jan 01 00:00:00 EST 2013
Name: modifiedOn, Length: 39101, dtype: object
>>> pd.to_datetime(autos.modifiedOn)
0       2013-01-01
1       2013-01-01
2       2013-01-01
3       2013-01-01
4       2013-01-01
           ...
39096   2013-01-01
39097   2013-01-01
39098   2013-01-01
39099   2013-01-01
39100   2013-01-01
Name: modifiedOn, Length: 39101, dtype: datetime64[ns]

如果ZIP文件中有多個(gè)文件波俄,可以使用zipfile模塊。因?yàn)閿?shù)據(jù)集第二行中包含問(wèn)題蛾默,將其存入kag_questions弟断。

>>> import zipfile
>>> with zipfile.ZipFile(
...     "data/kaggle-survey-2018.zip"
... ) as z:
...     print("\n".join(z.namelist()))
...     kag = pd.read_csv(
...         z.open("multipleChoiceResponses.csv")
...     )
...     kag_questions = kag.iloc[0]
...     survey = kag.iloc[1:]
multipleChoiceResponses.csv
freeFormResponses.csv
SurveySchema.csv
>>> survey.head(2).T
1          2
Time from...          710        434
Q1                 Female       Male
Q1_OTHER_...           -1         -1
Q2                  45-49      30-34
Q3            United S...  Indonesia
...                   ...        ...
Q50_Part_5            NaN        NaN
Q50_Part_6            NaN        NaN
Q50_Part_7            NaN        NaN
Q50_Part_8            NaN        NaN
Q50_OTHER...           -1         -1

更多

如果壓縮文件中只有一個(gè)文件,則read_csv方法還可以讀取GZIP趴生、BZ2XZ文件。


3.6 使用數(shù)據(jù)庫(kù)

創(chuàng)建SQLite數(shù)據(jù)庫(kù)昏翰,存儲(chǔ)Beatles信息:

>>> import sqlite3
>>> con = sqlite3.connect("data/beat.db")
>>> with con:
...     cur = con.cursor()
...     cur.execute("""DROP TABLE Band""")
...     cur.execute(
...         """CREATE TABLE Band(id INTEGER PRIMARY KEY,
...         fname TEXT, lname TEXT, birthyear INT)"""
...     )
...     cur.execute(
...         """INSERT INTO Band VALUES(
...         0, 'Paul', 'McCartney', 1942)"""
...     )
...     cur.execute(
...         """INSERT INTO Band VALUES(
...         1, 'John', 'Lennon', 1940)"""
...     )
...     _ = con.commit()

從DataFrame讀取數(shù)據(jù)庫(kù)苍匆,這里使用的是SQLAlchemy:

>>> import sqlalchemy as sa
>>> engine = sa.create_engine(
...     "sqlite:///data/beat.db", echo=True
... )
>>> sa_connection = engine.connect()
>>> beat = pd.read_sql(
...     "Band", sa_connection, index_col="id"
... )
>>> beat
   fname      lname  birthyear
id                            
0   Paul  McCartney       1942
1   John     Lennon       1940

使用SQL語(yǔ)句讀取數(shù)據(jù)∨锞眨可以使用SQLite或SQLAlchemy連接:

>>> sql = """SELECT fname, birthyear from Band"""
>>> fnames = pd.read_sql(sql, con)
>>> fnames
  fname  birthyear
0  Paul       1942
1  John       1940

3.7 讀取JSON

JSON數(shù)據(jù)的編碼和加載:

>>> import json
>>> encoded = json.dumps(people)
>>> encoded
'{"first": ["Paul", "John", "Richard", "George"], "last": ["McCartney", "Lennon", "Starkey", "Harrison"], "birth": [1942, 1940, 1940, 1943]}'
>>> json.loads(encoded)
{'first': ['Paull', 'John', 'Richard', 'George'], 'last': ['McCartney', 'Lennon', 'Starkey', 'Harrison'], 'birth': [1942, 1940, 1940, 1943]}

使用.read_json讀取json數(shù)據(jù)浸踩,使用字典中的鍵名作為列名。

>>> beatles = pd.read_json(encoded)
>>> beatles
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943

讀取JSON時(shí)统求,Pandas支持一些特定的方式:

  • columns —— (默認(rèn))將列名映射為列中的值的列表检碗;
  • records —— 行的列表。每行是一個(gè)字典码邻,一行映射到一個(gè)值折剃;
  • split —— columns映射到列名,index映射到行索引值像屋,data映射到每行數(shù)據(jù)組成的列表怕犁;
  • index —— 將索引映射到行,每行是一個(gè)列映射到值的字典;
  • values —— 數(shù)據(jù)行構(gòu)成的列表(每行也是列表)奏甫。不包含列和行索引的值戈轿;
  • table —— 將schema映射到DataFrame的綱要,data映射為字典的列表阵子。

參考下面的代碼:

>>> records = beatles.to_json(orient="records")
>>> records
'[{"first":"Paul","last":"McCartney","birth":1942},{"first":"John","last":"Lennon","birth":1940},{"first":"Richard","last":"Starkey","birth":1940},{"first":"George","last":"Harrison","birth":1943}]'
>>> pd.read_json(records, orient="records")
   birth    first       last
0   1942     Paul  McCartney
1   1940     John     Lennon
2   1940  Richard    Starkey
3   1943   George   Harrison
>>> split = beatles.to_json(orient="split")
>>> split
'{"columns":["first","last","birth"],"index":[0,1,2,3],"data":[["Paul","McCartney",1942],["John","Lennon",1940],["Richard","Starkey",1940],["George","Harrison",1943]]}'
>>> pd.read_json(split, orient="split")
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943
>>> index = beatles.to_json(orient="index")
>>> index
'{"0":{"first":"Paul","last":"McCartney","birth":1942},"1":{"first":"John","last":"Lennon","birth":1940},"2":{"first":"Richard","last":"Starkey","birth":1940},"3":{"first":"George","last":"Harrison","birth":1943}}'
>>> pd.read_json(index, orient="index")
   birth    first       last
0   1942     Paul  McCartney
1   1940     John     Lennon
2   1940  Richard    Starkey
3   1943   George   Harrison
>>> values = beatles.to_json(orient="values")
>>> values
'[["Paul","McCartney",1942],["John","Lennon",1940],["Richard","Starkey",1940],["George","Harrison",1943]]'
>>> pd.read_json(values, orient="values")
         0          1     2
0     Paul  McCartney  1942
1     John     Lennon  1940
2  Richard    Starkey  1940
3   George   Harrison  1943
>>> (
...     pd.read_json(values, orient="values").rename(
...         columns=dict(
...             enumerate(["first", "last", "birth"])
...         )
...     )
... )
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943
>>> table = beatles.to_json(orient="table")
>>> table
'{"schema": {"fields":[{"name":"index","type":"integer"},{"name":"first","type":"string"},{"name":"last","type":"string"},{"name":"birth","type":"integer"}],"primaryKey":["index"],"pandas_version":"0.20.0"}, "data": [{"index":0,"first":"Paul","last":"McCartney","birth":1942},{"index":1,"first":"John","last":"Lennon","birth":1940},{"index":2,"first":"Richard","last":"Starkey","birth":1940},{"index":3,"first":"George","last":"Harrison","birth":1943}]}'
>>> pd.read_json(table, orient="table")
     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3   George   Harrison   1943

更多

如果要將數(shù)據(jù)轉(zhuǎn)換為字典思杯,可以使用.to_dict方法,方便添加數(shù)據(jù):

>>> output = beat.to_dict()
>>> output
{'fname': {0: 'Paul', 1: 'John'}, 'lname': {0: 'McCartney', 1: 'Lennon'}, 'birthyear': {0: 1942, 1: 1940}}
>>> output["version"] = "0.4.1"
>>> json.dumps(output)
'{"fname": {"0": "Paul", "1": "John"}, "lname": {"0": "McCartney", "1": "Lennon"}, "birthyear": {"0": 1942, "1": 1940}, "version": "0.4.1"}'

3.8 讀取HTML表格

可以使用Pandas讀取HTML中的表格:

Wikipedia中的表格
# 譯者注:en.wikipedia.org 需要科學(xué)上網(wǎng)
>>> url = 'https://en.wikipedia.org/wiki/The_Beatles_discography'
>>> dfs = pd.read_html(url)
>>> len(dfs)
51

一共讀取了51個(gè)df挠进,檢查一下第一個(gè):

>>> dfs[0]
  The Beatles discography The Beatles discography.1
0  The Beat...             The Beat...
1  Studio a...                      23
2  Live albums                       5
3  Compilat...                      53
4  Video al...                      15
5  Music vi...                      64
6          EPs                      21
7      Singles                      63
8     Mash-ups                       2
9     Box sets                      15

.read_html有一個(gè)match參數(shù)色乾,可以是字符串或正則表達(dá)式。還有一個(gè)attrs參數(shù)奈梳,用于定位HTML標(biāo)簽杈湾。

檢查下HTML的table元素:

<table class="wikitable plainrowheaders" style="text-align:center;">
  <caption>List of studio albums,<sup id="cite_ref-1" class="reference"><a href="#cite_note-1">[A]</a></sup> with selected chart positions and certifications
  </caption>
  <tbody>
    <tr>
      <th scope="col" rowspan="2" style="width:20em;">Title
      </th>
      <th scope="col" rowspan="2" style="width:20em;">Release
       ...

雖然沒(méi)有屬性,但可以使用字符串List of studio albums來(lái)匹配攘须,缺失值na_values用"—"表示:

>>> url = https://en.wikipedia.org/wiki/The_Beatles_discography
>>> dfs = pd.read_html(
...     url, match="List of studio albums", na_values="—"
... )
>>> len(dfs)
1
>>> dfs[0].columns
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

使用前兩行作為列名漆撞,但還是很亂:

>>> url = https://en.wikipedia.org/wiki/The_Beatles_discography
>>> dfs = pd.read_html(
...     url,
...     match="List of studio albums",
...     na_values="—",
...     header=[0, 1],
... )
>>> len(dfs)
1
>>> dfs[0]
          Title      Release  ... Peak chart positions Certifications
          Title      Release  ...             US[8][9] Certifications
0   Please P...  Released...  ...          NaN          BPI: Gol...
1   With the...  Released...  ...          NaN          BPI: Gol...
2   Introduc...  Released...  ...            2          RIAA: Pl...
3   Meet the...  Released...  ...            1          MC: Plat...
4   Twist an...  Released...  ...          NaN          MC: 3× P...
..          ...          ...  ...          ...                  ...
22  The Beat...  Released...  ...            1          BPI: 2× ...
23  Yellow S...  Released...  ...            2          BPI: Gol...
24   Abbey Road  Released...  ...            1          BPI: 2× ...
25    Let It Be  Released...  ...            1          BPI: Gol...
26  "—" deno...  "—" deno...  ...  "—" deno...          "—" deno...
>>> dfs[0].columns
MultiIndex(levels=[['Certifications', 'Peak chart positions', 'Release', 'Title'], ['AUS[3]', 'CAN[4]', 'Certifications', 'FRA[5]', 'GER[6]', 'NOR[7]', 'Release', 'Title', 'UK[1][2]', 'US[8][9]']],
  codes=[[3, 2, 1, 1, 1, 1, 1, 1, 1, 0], [7, 6, 8, 0, 1, 3, 4, 5, 9, 2]])

對(duì)于這種情況,最容易的方法是更新列索引:

>>> df = dfs[0]
>>> df.columns = [
...     "Title",
...     "Release",
...     "UK",
...     "AUS",
...     "CAN",
...     "FRA",
...     "GER",
...     "NOR",
...     "US",
...     "Certifications",
... ]
>>> df
          Title      Release  ...           US Certifications
0   Please P...  Released...  ...          NaN  BPI: Gol...
1   With the...  Released...  ...          NaN  BPI: Gol...
2   Introduc...  Released...  ...            2  RIAA: Pl...
3   Meet the...  Released...  ...            1  MC: Plat...
4   Twist an...  Released...  ...          NaN  MC: 3× P...
..          ...          ...  ...          ...          ...
22  The Beat...  Released...  ...            1  BPI: 2× ...
23  Yellow S...  Released...  ...            2  BPI: Gol...
24   Abbey Road  Released...  ...            1  BPI: 2× ...
25    Let It Be  Released...  ...            1  BPI: Gol...
26  "—" deno...  "—" deno...  ...  "—" deno...  "—" deno...

繼續(xù)清理Release這列于宙,在HTML中浮驳,代碼是這樣的:

<th scope="row" rowspan="2">
  <i><a href="/wiki/A_Hard_Day%27s_Night_(album)" title="A Hard Day's Night (album)">A Hard Day's Night</a></i>
  <img alt="double-dagger" src="http://upload.wikimedia.org/wikipedia/commons/f/f9/Double-dagger-14-plain.png" decoding="async" width="9" height="14" data-file-width="9" data-file-height="14">
</th>

將release列分別存入release_datelabel兩列:

>>> res = (
...     df.pipe(
...         lambda df_: df_[
...             ~df_.Title.str.startswith("Released")
...         ]
...     )
...     .assign(
...         release_date=lambda df_: pd.to_datetime(
...             df_.Release.str.extract(
...                 r"Released: (.*) Label"
...             )[0].str.replace(r"\[E\]", "")
...         ),
...         label=lambda df_: df_.Release.str.extract(
...             r"Label: (.*)"
...         ),
...     )
...     .loc[
...         :,
...         [
...             "Title",
...             "UK",
...             "AUS",
...             "CAN",
...             "FRA",
...             "GER",
...             "NOR",
...             "US",
...             "release_date",
...             "label",
...         ],
...     ]
... )
>>> res
          Title   UK  ... release_date        label
0   Please P...    1  ...   1963-03-22  Parlopho...
1   With the...    1  ...   1963-11-22  Parlopho...
2   Introduc...  NaN  ...   1964-01-10  Vee-Jay ...
3   Meet the...  NaN  ...   1964-01-20  Capitol ...
4   Twist an...  NaN  ...   1964-02-03  Capitol ...
..          ...  ...  ...          ...          ...
21  Magical ...   31  ...   1967-11-27  Parlopho...
22  The Beat...    1  ...   1968-11-22        Apple
23  Yellow S...    3  ...   1969-01-13  Apple (U...
24   Abbey Road    1  ...   1969-09-26        Apple
25    Let It Be    1  ...   1970-05-08        Apple

更多

直接讀取線上的csv文件:

>>> url = https://github.com/mattharrison/datasets/blob/master/data/anscombes.csv
>>> dfs = pd.read_html(url, attrs={"class": "csv-data"})
>>> len(dfs)
1
>>> dfs[0]
    Unnamed: 0 quadrant     x     y
0          NaN        I  10.0  8.04
1          NaN        I  14.0  9.96
2          NaN        I   6.0  7.24
3          NaN        I   9.0  8.81
4          NaN        I   4.0  4.26
..         ...      ...   ...   ...
39         NaN       IV   8.0  6.58
40         NaN       IV   8.0  7.91
41         NaN       IV   8.0  8.47
42         NaN       IV   8.0  5.25
43         NaN       IV   8.0  6.89

第01章 Pandas基礎(chǔ)
第02章 DataFrame基礎(chǔ)運(yùn)算
第03章 創(chuàng)建和持久化DataFrame
第04章 開(kāi)始數(shù)據(jù)分析
第05章 探索性數(shù)據(jù)分析
第06章 選取數(shù)據(jù)子集
第07章 過(guò)濾行
第08章 索引對(duì)齊

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市捞魁,隨后出現(xiàn)的幾起案子至会,更是在濱河造成了極大的恐慌,老刑警劉巖谱俭,帶你破解...
    沈念sama閱讀 211,042評(píng)論 6 490
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件奉件,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡昆著,警方通過(guò)查閱死者的電腦和手機(jī)县貌,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 89,996評(píng)論 2 384
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)凑懂,“玉大人煤痕,你說(shuō)我怎么就攤上這事〗咏鳎” “怎么了摆碉?”我有些...
    開(kāi)封第一講書(shū)人閱讀 156,674評(píng)論 0 345
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)脓豪。 經(jīng)常有香客問(wèn)我巷帝,道長(zhǎng),這世上最難降的妖魔是什么跑揉? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 56,340評(píng)論 1 283
  • 正文 為了忘掉前任锅睛,我火速辦了婚禮埠巨,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘现拒。我一直安慰自己辣垒,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 65,404評(píng)論 5 384
  • 文/花漫 我一把揭開(kāi)白布印蔬。 她就那樣靜靜地躺著勋桶,像睡著了一般。 火紅的嫁衣襯著肌膚如雪侥猬。 梳的紋絲不亂的頭發(fā)上例驹,一...
    開(kāi)封第一講書(shū)人閱讀 49,749評(píng)論 1 289
  • 那天,我揣著相機(jī)與錄音退唠,去河邊找鬼鹃锈。 笑死,一個(gè)胖子當(dāng)著我的面吹牛瞧预,可吹牛的內(nèi)容都是我干的屎债。 我是一名探鬼主播,決...
    沈念sama閱讀 38,902評(píng)論 3 405
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼垢油,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼盆驹!你這毒婦竟也來(lái)了?” 一聲冷哼從身側(cè)響起滩愁,我...
    開(kāi)封第一講書(shū)人閱讀 37,662評(píng)論 0 266
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤躯喇,失蹤者是張志新(化名)和其女友劉穎,沒(méi)想到半個(gè)月后硝枉,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體廉丽,經(jīng)...
    沈念sama閱讀 44,110評(píng)論 1 303
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,451評(píng)論 2 325
  • 正文 我和宋清朗相戀三年妻味,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了雅倒。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 38,577評(píng)論 1 340
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡弧可,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出劣欢,到底是詐尸還是另有隱情棕诵,我是刑警寧澤,帶...
    沈念sama閱讀 34,258評(píng)論 4 328
  • 正文 年R本政府宣布凿将,位于F島的核電站校套,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏牧抵。R本人自食惡果不足惜笛匙,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,848評(píng)論 3 312
  • 文/蒙蒙 一侨把、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧妹孙,春花似錦秋柄、人聲如沸。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 30,726評(píng)論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)。三九已至嚣崭,卻和暖如春笨触,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背雹舀。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 31,952評(píng)論 1 264
  • 我被黑心中介騙來(lái)泰國(guó)打工芦劣, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人说榆。 一個(gè)月前我還...
    沈念sama閱讀 46,271評(píng)論 2 360
  • 正文 我出身青樓虚吟,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親娱俺。 傳聞我的和親對(duì)象是個(gè)殘疾皇子稍味,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 43,452評(píng)論 2 348

推薦閱讀更多精彩內(nèi)容