第01章 Pandas基礎(chǔ)
第02章 DataFrame基礎(chǔ)運(yùn)算
第03章 創(chuàng)建和持久化DataFrame
第04章 開(kāi)始數(shù)據(jù)分析
第05章 探索性數(shù)據(jù)分析
第06章 選取數(shù)據(jù)子集
第07章 過(guò)濾行
第08章 索引對(duì)齊
3.1 創(chuàng)建DataFrame
使用平行的列表創(chuàng)建DataFrame
>>> import pandas as pd
>>> import numpy as np
>>> fname = ["Paul", "John", "Richard", "George"]
>>> lname = ["McCartney", "Lennon", "Starkey", "Harrison"]
>>> birth = [1942, 1940, 1940, 1943]
創(chuàng)建字典:
>>> people = {"first": fname, "last": lname, "birth": birth}
用該字典創(chuàng)建DataFrame:
>>> beatles = pd.DataFrame(people)
>>> beatles
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
原理
當(dāng)調(diào)用DataFrame構(gòu)造器時(shí)饭耳,Pandas會(huì)創(chuàng)建一個(gè)RangeIndex
對(duì)象:
>>> beatles.index
RangeIndex(start=0, stop=4, step=1)
重新指定索引:
>>> pd.DataFrame(people, index=["a", "b", "c", "d"])
first last birth
a Paul McCartney 1942
b John Lennon 1940
c Richard Starkey 1940
d George Harrison 1943
更多
還可以用字典構(gòu)成的列表構(gòu)建DataFrame:
>>> pd.DataFrame(
... [
... {
... "first": "Paul",
... "last": "McCartney",
... "birth": 1942,
... },
... {
... "first": "John",
... "last": "Lennon",
... "birth": 1940,
... },
... {
... "first": "Richard",
... "last": "Starkey",
... "birth": 1940,
... },
... {
... "first": "George",
... "last": "Harrison",
... "birth": 1943,
... },
... ]
... )
birth first last
0 1942 Paul McCartney
1 1940 John Lennon
2 1940 Richard Starkey
3 1943 George Harrison
使用columns
參數(shù)指定列的順序:
>>> pd.DataFrame(
... [
... {
... "first": "Paul",
... "last": "McCartney",
... "birth": 1942,
... },
... {
... "first": "John",
... "last": "Lennon",
... "birth": 1940,
... },
... {
... "first": "Richard",
... "last": "Starkey",
... "birth": 1940,
... },
... {
... "first": "George",
... "last": "Harrison",
... "birth": 1943,
... },
... ],
... columns=["last", "first", "birth"],
... )
last first birth
0 McCartney Paul 1942
1 Lennon John 1940
2 Starkey Richard 1940
3 Harrison George 1943
3.2 寫(xiě)入CSV
將DataFrame寫(xiě)入CSV文件:
使用.to_csv
方法將DataFrame寫(xiě)入CSV文件:
>>> beatles
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
>>> from io import StringIO
>>> fout = StringIO()
>>> beatles.to_csv(fout) # 使用文件名
查看文件內(nèi)容:
>>> print(fout.getvalue())
,first,last,birth
0,Paul,McCartney,1942
1,John,Lennon,1940
2,Richard,Starkey,1940
3,George,Harrison,1943
更多
如果讀取剛剛保存的CSV酌媒,會(huì)讀入列名為Unnamed: 0
的冗余列:
>>> _ = fout.seek(0)
>>> pd.read_csv(fout)
Unnamed: 0 first last birth
0 0 Paul McCartney 1942
1 1 John Lennon 1940
2 2 Richard Starkey 1940
3 3 George Harrison 1943
使用index_col
參數(shù),可以指定列:
>>> _ = fout.seek(0)
>>> pd.read_csv(fout, index_col=0)
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
如果存CSV文件時(shí),不想保存行索引周偎,可以將index
參數(shù)設(shè)為False
:
>>> fout = StringIO()
>>> beatles.to_csv(fout, index=False)
>>> print(fout.getvalue())
first,last,birth
Paul,McCartney,1942
John,Lennon,1940
Richard,Starkey,1940
George,Harrison,1943
3.3 讀取大CSV文件
Pandas是在內(nèi)存中處理文件的熬粗,通常來(lái)講,內(nèi)存的大小需要是文件大小的3至10倍莉炉。
這里使用的是diamonds
數(shù)據(jù)集。使用nrows
參數(shù)讀取1000行數(shù)據(jù)碴犬。
>>> diamonds = pd.read_csv("data/diamonds.csv", nrows=1000)
>>> diamonds
carat cut color clarity ... price x y z
0 0.23 Ideal E SI2 ... 326 3.95 3.98 2.43
1 0.21 Premium E SI1 ... 326 3.89 3.84 2.31
2 0.23 Good E VS1 ... 327 4.05 4.07 2.31
3 0.29 Premium I VS2 ... 334 4.20 4.23 2.63
4 0.31 Good J SI2 ... 335 4.34 4.35 2.75
.. ... ... ... ... ... ... ... ... ...
995 0.54 Ideal D VVS2 ... 2897 5.30 5.34 3.26
996 0.72 Ideal E SI1 ... 2897 5.69 5.74 3.57
997 0.72 Good F VS1 ... 2897 5.82 5.89 3.48
998 0.74 Premium D VS2 ... 2897 5.81 5.77 3.58
999 1.12 Premium J SI2 ... 2898 6.68 6.61 4.03
使用.info
方法查看消耗的內(nèi)存量:
>>> diamonds.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat 1000 non-null float64
cut 1000 non-null object
color 1000 non-null object
clarity 1000 non-null object
depth 1000 non-null float64
table 1000 non-null float64
price 1000 non-null int64
x 1000 non-null float64
y 1000 non-null float64
z 1000 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 78.2+ KB
可以看到1000行數(shù)據(jù)使用了78.2KB內(nèi)存絮宁。如果有10億行數(shù)據(jù),則要占用78GB的內(nèi)存服协。
使用dtype
參數(shù)绍昂,設(shè)置讀取的數(shù)值類型:
>>> diamonds2 = pd.read_csv(
... "data/diamonds.csv",
... nrows=1000,
... dtype={
... "carat": np.float32,
... "depth": np.float32,
... "table": np.float32,
... "x": np.float32,
... "y": np.float32,
... "z": np.float32,
... "price": np.int16,
... },
... )
>>> diamonds2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat 1000 non-null float32
cut 1000 non-null object
color 1000 non-null object
clarity 1000 non-null object
depth 1000 non-null float32
table 1000 non-null float32
price 1000 non-null int16
x 1000 non-null float32
y 1000 non-null float32
z 1000 non-null float32
dtypes: float32(6), int16(1), object(3)
memory usage: 49.0+ KB
改變了數(shù)值類型,對(duì)比下新的DataFrame和原先的DataFrame:
>>> diamonds.describe()
carat depth ... y z
count 1000.000000 1000.000000 ... 1000.000000 1000.000000
mean 0.689280 61.722800 ... 5.599180 3.457530
std 0.195291 1.758879 ... 0.611974 0.389819
min 0.200000 53.000000 ... 3.750000 2.270000
25% 0.700000 60.900000 ... 5.630000 3.450000
50% 0.710000 61.800000 ... 5.760000 3.550000
75% 0.790000 62.600000 ... 5.910000 3.640000
max 1.270000 69.500000 ... 7.050000 4.330000
>>> diamonds2.describe()
carat depth ... y z
count 1000.000000 1000.000000 ... 1000.000000 1000.000000
mean 0.689453 61.718750 ... 5.601562 3.457031
std 0.195312 1.759766 ... 0.611816 0.389648
min 0.199951 53.000000 ... 3.750000 2.269531
25% 0.700195 60.906250 ... 5.628906 3.449219
50% 0.709961 61.812500 ... 5.761719 3.550781
75% 0.790039 62.593750 ... 5.910156 3.640625
max 1.269531 69.500000 ... 7.050781 4.328125
可以看到通過(guò)改變數(shù)據(jù)類型,節(jié)省了38%的內(nèi)存窘游。
使用dtype
參數(shù)唠椭,將數(shù)據(jù)類型改為category
。使用.value_counts
先統(tǒng)計(jì)數(shù)據(jù)個(gè)數(shù):
>>> diamonds2.cut.value_counts(
Ideal 333
Premium 290
Very Good 226
Good 89
Fair 62
Name: cut, dtype: int64
>>> diamonds2.color.value_counts()
E 240
F 226
G 139
D 129
H 125
I 95
J 46
Name: color, dtype: int64
>>> diamonds2.clarity.value_counts()
SI1 306
VS2 218
VS1 159
SI2 154
VVS2 62
VVS1 58
I1 29
IF 14
Name: clarity, dtype: int64
因?yàn)槭堑突鶖?shù)张峰,將其轉(zhuǎn)換為category
泪蔫,可以節(jié)省約37%的內(nèi)存:
>>> diamonds3 = pd.read_csv(
... "data/diamonds.csv",
... nrows=1000,
... dtype={
... "carat": np.float32,
... "depth": np.float32,
... "table": np.float32,
... "x": np.float32,
... "y": np.float32,
... "z": np.float32,
... "price": np.int16,
... "cut": "category",
... "color": "category",
... "clarity": "category",
... },
... )
>>> diamonds3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
carat 1000 non-null float32
cut 1000 non-null category
color 1000 non-null category
clarity 1000 non-null category
depth 1000 non-null float32
table 1000 non-null float32
price 1000 non-null int16
x 1000 non-null float32
y 1000 non-null float32
z 1000 non-null float32
dtypes: category(3), float32(6), int16(1)
memory usage: 29.4 KB
使用參數(shù)usecols
,可以指定加載哪些列喘批。這里忽略了x
撩荣、y
、z
三列:
>>> cols = [
... "carat",
... "cut",
... "color",
... "clarity",
... "depth",
... "table",
... "price",
... ]
>>> diamonds4 = pd.read_csv(
... "data/diamonds.csv",
... nrows=1000,
... dtype={
... "carat": np.float32,
... "depth": np.float32,
... "table": np.float32,
... "price": np.int16,
... "cut": "category",
... "color": "category",
... "clarity": "category",
... },
... usecols=cols,
... )
>>> diamonds4.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
carat 1000 non-null float32
cut 1000 non-null category
color 1000 non-null category
clarity 1000 non-null category
depth 1000 non-null float32
table 1000 non-null float32
price 1000 non-null int16
dtypes: category(3), float32(3), int16(1)
memory usage: 17.7 KB
現(xiàn)在使用的內(nèi)存只有原始的21%饶深。
如果每次只處理數(shù)據(jù)的一部分餐曹,可以使用chunksize
參數(shù):
>>> cols = [
... "carat",
... "cut",
... "color",
... "clarity",
... "depth",
... "table",
... "price",
... ]
>>> diamonds_iter = pd.read_csv(
... "data/diamonds.csv",
... nrows=1000,
... dtype={
... "carat": np.float32,
... "depth": np.float32,
... "table": np.float32,
... "price": np.int16,
... "cut": "category",
... "color": "category",
... "clarity": "category",
... },
... usecols=cols,
... chunksize=200,
... )
>>> def process(df):
... return f"processed {df.size} items"
>>> for chunk in diamonds_iter:
... process(chunk)
因?yàn)镃SV文件不保存數(shù)據(jù)類型,Pandas需要推斷每列的數(shù)據(jù)類型是什么敌厘。如果一列的值都是整數(shù)台猴,并且沒(méi)有缺失值,則Pandas將其認(rèn)定為int64
俱两。如果一列是數(shù)值類型饱狂,但不是整數(shù),或存在缺失值宪彩,Pandas使用的是float64
休讳。這兩種數(shù)據(jù)類型占用的內(nèi)存比較大。例如尿孔,如果所有數(shù)都在200以下俊柔,你可以使用一個(gè)小的數(shù)據(jù)類型,比如np.int16
(或np.int8
活合,如果都是正數(shù))雏婶。
如果某列都是非數(shù)值類型,Pandas會(huì)將其轉(zhuǎn)換為object
類型白指。object
類型占用內(nèi)存很多留晚,因?yàn)樗菍?shù)據(jù)以Python字符串存儲(chǔ)的,將類型改為category
侵续,可以大大節(jié)省空間倔丈,因?yàn)樗鼘?duì)每個(gè)字符串只存儲(chǔ)一次。
更多
如果價(jià)格使用int8
状蜗,會(huì)導(dǎo)致丟失信息需五。你可以使用NumPy的iinfo
函數(shù)列出NumPy整數(shù)類型的范圍:
>>> np.iinfo(np.int8)
iinfo(min=-128, max=127, dtype=int8)
使用.finfo
可以查看浮點(diǎn)數(shù)類型的范圍:
>>> np.finfo(np.float16)
finfo(resolution=0.001, min=-6.55040e+04,
max=6.55040e+04, dtype=float16)
還可以用.memory_usage
方法查詢DataFrame或Series使用了多少字節(jié)。注意轧坎,其中是包含行索引的宏邮。另外,傳入deep=True
,可以查詢帶有對(duì)象類型的Series的內(nèi)存用量:
>>> diamonds.price.memory_usage()
8080
>>> diamonds.price.memory_usage(index=False)
8000
>>> diamonds.cut.memory_usage()
8080
>>> diamonds.cut.memory_usage(deep=True)
63413
一旦確定了數(shù)據(jù)類型蜜氨,可以將其以二進(jìn)制并帶有數(shù)據(jù)類型的形式保存下來(lái)械筛,比如Feather格式,Pandas使用的是pyarrow
庫(kù)飒炎。
>>> diamonds4.to_feather("d.arr")
>>> diamonds5 = pd.read_feather("d.arr")
另一種方法是使用Parquet格式埋哟。
>>> diamonds4.to_parquet("/tmp/d.pqt")
3.4 使用Excel文件
需要安裝xlwt
或openpyxl
來(lái)寫(xiě)入XLSX文件。
使用.to_excel
方法郎汪,進(jìn)行存儲(chǔ):
>>> beatles.to_excel("beat.xls")
>>> beatles.to_excel("beat.xlsx")
使用read_excel
讀取Excel文件:
>>> beat2 = pd.read_excel("/tmp/beat.xls")
>>> beat2
Unnamed: 0 first last birth
0 0 Paul McCartney 1942
1 1 John Lennon 1940
2 2 Richard Starkey 1940
3 3 George Harrison 1943
用參數(shù)index_col
赤赊,指定行索引:
>>> beat2 = pd.read_excel("/tmp/beat.xls", index_col=0)
>>> beat2
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
Excel保存了數(shù)據(jù)類型:
>>> beat2.dtypes
first object
last object
birth int64
dtype: object
更多
使用sheet_name
參數(shù)命名表單:
>>> xl_writer = pd.ExcelWriter("beat2.xlsx")
>>> beatles.to_excel(xl_writer, sheet_name="All")
>>> beatles[beatles.birth < 1941].to_excel(
... xl_writer, sheet_name="1940"
... )
>>> xl_writer.save()
這個(gè)Excel文件包含兩個(gè)表單,一個(gè)名字是All煞赢,一個(gè)名字是1940抛计。
3.5 使用ZIP文件
如果CSV文件是ZIP文件中的唯一文件,可以直接使用read_csv
函數(shù):
>>> autos = pd.read_csv("data/vehicles.csv.zip")
>>> autos
barrels08 barrelsA08 ... phevHwy phevComb
0 15.695714 0.0 ... 0 0
1 29.964545 0.0 ... 0 0
2 12.207778 0.0 ... 0 0
3 29.964545 0.0 ... 0 0
4 17.347895 0.0 ... 0 0
... ... ... ... ... ...
41139 14.982273 0.0 ... 0 0
41140 14.330870 0.0 ... 0 0
41141 15.695714 0.0 ... 0 0
41142 15.695714 0.0 ... 0 0
41143 18.311667 0.0 ... 0 0
>>> autos.modifiedOn.dtype
dtype('O')
因?yàn)镃SV文件中包含日期的列照筑,它是字符串吹截。可以在使用read_csv
時(shí)使用parse_dates
加載文件凝危,另一種方法是加載文件后用to_datetime
方法解析:
>>> autos= pd.read_csv(
... "data/vehicles.csv.zip", parse_dates=["modifiedOn"]
... )
>>> autos.modifiedOn
0 2013-01-0...
1 2013-01-0...
2 2013-01-0...
3 2013-01-0...
4 2013-01-0...
...
41139 2013-01-0...
41140 2013-01-0...
41141 2013-01-0...
41142 2013-01-0...
41143 2013-01-0...
Name: modifiedOn, Length: 41144, dtype: datetime64[ns, tzlocal()]
>>> autos.modifiedOn
0 Tue Jan 01 00:00:00 EST 2013
1 Tue Jan 01 00:00:00 EST 2013
2 Tue Jan 01 00:00:00 EST 2013
3 Tue Jan 01 00:00:00 EST 2013
4 Tue Jan 01 00:00:00 EST 2013
...
39096 Tue Jan 01 00:00:00 EST 2013
39097 Tue Jan 01 00:00:00 EST 2013
39098 Tue Jan 01 00:00:00 EST 2013
39099 Tue Jan 01 00:00:00 EST 2013
39100 Tue Jan 01 00:00:00 EST 2013
Name: modifiedOn, Length: 39101, dtype: object
>>> pd.to_datetime(autos.modifiedOn)
0 2013-01-01
1 2013-01-01
2 2013-01-01
3 2013-01-01
4 2013-01-01
...
39096 2013-01-01
39097 2013-01-01
39098 2013-01-01
39099 2013-01-01
39100 2013-01-01
Name: modifiedOn, Length: 39101, dtype: datetime64[ns]
如果ZIP文件中有多個(gè)文件波俄,可以使用zipfile
模塊。因?yàn)閿?shù)據(jù)集第二行中包含問(wèn)題蛾默,將其存入kag_questions
弟断。
>>> import zipfile
>>> with zipfile.ZipFile(
... "data/kaggle-survey-2018.zip"
... ) as z:
... print("\n".join(z.namelist()))
... kag = pd.read_csv(
... z.open("multipleChoiceResponses.csv")
... )
... kag_questions = kag.iloc[0]
... survey = kag.iloc[1:]
multipleChoiceResponses.csv
freeFormResponses.csv
SurveySchema.csv
>>> survey.head(2).T
1 2
Time from... 710 434
Q1 Female Male
Q1_OTHER_... -1 -1
Q2 45-49 30-34
Q3 United S... Indonesia
... ... ...
Q50_Part_5 NaN NaN
Q50_Part_6 NaN NaN
Q50_Part_7 NaN NaN
Q50_Part_8 NaN NaN
Q50_OTHER... -1 -1
更多
如果壓縮文件中只有一個(gè)文件,則read_csv
方法還可以讀取GZIP
趴生、BZ2
和XZ
文件。
3.6 使用數(shù)據(jù)庫(kù)
創(chuàng)建SQLite數(shù)據(jù)庫(kù)昏翰,存儲(chǔ)Beatles信息:
>>> import sqlite3
>>> con = sqlite3.connect("data/beat.db")
>>> with con:
... cur = con.cursor()
... cur.execute("""DROP TABLE Band""")
... cur.execute(
... """CREATE TABLE Band(id INTEGER PRIMARY KEY,
... fname TEXT, lname TEXT, birthyear INT)"""
... )
... cur.execute(
... """INSERT INTO Band VALUES(
... 0, 'Paul', 'McCartney', 1942)"""
... )
... cur.execute(
... """INSERT INTO Band VALUES(
... 1, 'John', 'Lennon', 1940)"""
... )
... _ = con.commit()
從DataFrame讀取數(shù)據(jù)庫(kù)苍匆,這里使用的是SQLAlchemy:
>>> import sqlalchemy as sa
>>> engine = sa.create_engine(
... "sqlite:///data/beat.db", echo=True
... )
>>> sa_connection = engine.connect()
>>> beat = pd.read_sql(
... "Band", sa_connection, index_col="id"
... )
>>> beat
fname lname birthyear
id
0 Paul McCartney 1942
1 John Lennon 1940
使用SQL語(yǔ)句讀取數(shù)據(jù)∨锞眨可以使用SQLite或SQLAlchemy連接:
>>> sql = """SELECT fname, birthyear from Band"""
>>> fnames = pd.read_sql(sql, con)
>>> fnames
fname birthyear
0 Paul 1942
1 John 1940
3.7 讀取JSON
JSON數(shù)據(jù)的編碼和加載:
>>> import json
>>> encoded = json.dumps(people)
>>> encoded
'{"first": ["Paul", "John", "Richard", "George"], "last": ["McCartney", "Lennon", "Starkey", "Harrison"], "birth": [1942, 1940, 1940, 1943]}'
>>> json.loads(encoded)
{'first': ['Paull', 'John', 'Richard', 'George'], 'last': ['McCartney', 'Lennon', 'Starkey', 'Harrison'], 'birth': [1942, 1940, 1940, 1943]}
使用.read_json
讀取json數(shù)據(jù)浸踩,使用字典中的鍵名作為列名。
>>> beatles = pd.read_json(encoded)
>>> beatles
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
讀取JSON時(shí)统求,Pandas支持一些特定的方式:
-
columns
—— (默認(rèn))將列名映射為列中的值的列表检碗; -
records
—— 行的列表。每行是一個(gè)字典码邻,一行映射到一個(gè)值折剃; -
split
——columns
映射到列名,index
映射到行索引值像屋,data
映射到每行數(shù)據(jù)組成的列表怕犁; -
index
—— 將索引映射到行,每行是一個(gè)列映射到值的字典; -
values
—— 數(shù)據(jù)行構(gòu)成的列表(每行也是列表)奏甫。不包含列和行索引的值戈轿; -
table
—— 將schema
映射到DataFrame的綱要,data
映射為字典的列表阵子。
參考下面的代碼:
>>> records = beatles.to_json(orient="records")
>>> records
'[{"first":"Paul","last":"McCartney","birth":1942},{"first":"John","last":"Lennon","birth":1940},{"first":"Richard","last":"Starkey","birth":1940},{"first":"George","last":"Harrison","birth":1943}]'
>>> pd.read_json(records, orient="records")
birth first last
0 1942 Paul McCartney
1 1940 John Lennon
2 1940 Richard Starkey
3 1943 George Harrison
>>> split = beatles.to_json(orient="split")
>>> split
'{"columns":["first","last","birth"],"index":[0,1,2,3],"data":[["Paul","McCartney",1942],["John","Lennon",1940],["Richard","Starkey",1940],["George","Harrison",1943]]}'
>>> pd.read_json(split, orient="split")
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
>>> index = beatles.to_json(orient="index")
>>> index
'{"0":{"first":"Paul","last":"McCartney","birth":1942},"1":{"first":"John","last":"Lennon","birth":1940},"2":{"first":"Richard","last":"Starkey","birth":1940},"3":{"first":"George","last":"Harrison","birth":1943}}'
>>> pd.read_json(index, orient="index")
birth first last
0 1942 Paul McCartney
1 1940 John Lennon
2 1940 Richard Starkey
3 1943 George Harrison
>>> values = beatles.to_json(orient="values")
>>> values
'[["Paul","McCartney",1942],["John","Lennon",1940],["Richard","Starkey",1940],["George","Harrison",1943]]'
>>> pd.read_json(values, orient="values")
0 1 2
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
>>> (
... pd.read_json(values, orient="values").rename(
... columns=dict(
... enumerate(["first", "last", "birth"])
... )
... )
... )
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
>>> table = beatles.to_json(orient="table")
>>> table
'{"schema": {"fields":[{"name":"index","type":"integer"},{"name":"first","type":"string"},{"name":"last","type":"string"},{"name":"birth","type":"integer"}],"primaryKey":["index"],"pandas_version":"0.20.0"}, "data": [{"index":0,"first":"Paul","last":"McCartney","birth":1942},{"index":1,"first":"John","last":"Lennon","birth":1940},{"index":2,"first":"Richard","last":"Starkey","birth":1940},{"index":3,"first":"George","last":"Harrison","birth":1943}]}'
>>> pd.read_json(table, orient="table")
first last birth
0 Paul McCartney 1942
1 John Lennon 1940
2 Richard Starkey 1940
3 George Harrison 1943
更多
如果要將數(shù)據(jù)轉(zhuǎn)換為字典思杯,可以使用.to_dict
方法,方便添加數(shù)據(jù):
>>> output = beat.to_dict()
>>> output
{'fname': {0: 'Paul', 1: 'John'}, 'lname': {0: 'McCartney', 1: 'Lennon'}, 'birthyear': {0: 1942, 1: 1940}}
>>> output["version"] = "0.4.1"
>>> json.dumps(output)
'{"fname": {"0": "Paul", "1": "John"}, "lname": {"0": "McCartney", "1": "Lennon"}, "birthyear": {"0": 1942, "1": 1940}, "version": "0.4.1"}'
3.8 讀取HTML表格
可以使用Pandas讀取HTML中的表格:
# 譯者注:en.wikipedia.org 需要科學(xué)上網(wǎng)
>>> url = 'https://en.wikipedia.org/wiki/The_Beatles_discography'
>>> dfs = pd.read_html(url)
>>> len(dfs)
51
一共讀取了51個(gè)df挠进,檢查一下第一個(gè):
>>> dfs[0]
The Beatles discography The Beatles discography.1
0 The Beat... The Beat...
1 Studio a... 23
2 Live albums 5
3 Compilat... 53
4 Video al... 15
5 Music vi... 64
6 EPs 21
7 Singles 63
8 Mash-ups 2
9 Box sets 15
.read_html
有一個(gè)match
參數(shù)色乾,可以是字符串或正則表達(dá)式。還有一個(gè)attrs
參數(shù)奈梳,用于定位HTML標(biāo)簽杈湾。
檢查下HTML的table
元素:
<table class="wikitable plainrowheaders" style="text-align:center;">
<caption>List of studio albums,<sup id="cite_ref-1" class="reference"><a href="#cite_note-1">[A]</a></sup> with selected chart positions and certifications
</caption>
<tbody>
<tr>
<th scope="col" rowspan="2" style="width:20em;">Title
</th>
<th scope="col" rowspan="2" style="width:20em;">Release
...
雖然沒(méi)有屬性,但可以使用字符串List of studio albums
來(lái)匹配攘须,缺失值na_values
用"—"表示:
>>> url = https://en.wikipedia.org/wiki/The_Beatles_discography
>>> dfs = pd.read_html(
... url, match="List of studio albums", na_values="—"
... )
>>> len(dfs)
1
>>> dfs[0].columns
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
使用前兩行作為列名漆撞,但還是很亂:
>>> url = https://en.wikipedia.org/wiki/The_Beatles_discography
>>> dfs = pd.read_html(
... url,
... match="List of studio albums",
... na_values="—",
... header=[0, 1],
... )
>>> len(dfs)
1
>>> dfs[0]
Title Release ... Peak chart positions Certifications
Title Release ... US[8][9] Certifications
0 Please P... Released... ... NaN BPI: Gol...
1 With the... Released... ... NaN BPI: Gol...
2 Introduc... Released... ... 2 RIAA: Pl...
3 Meet the... Released... ... 1 MC: Plat...
4 Twist an... Released... ... NaN MC: 3× P...
.. ... ... ... ... ...
22 The Beat... Released... ... 1 BPI: 2× ...
23 Yellow S... Released... ... 2 BPI: Gol...
24 Abbey Road Released... ... 1 BPI: 2× ...
25 Let It Be Released... ... 1 BPI: Gol...
26 "—" deno... "—" deno... ... "—" deno... "—" deno...
>>> dfs[0].columns
MultiIndex(levels=[['Certifications', 'Peak chart positions', 'Release', 'Title'], ['AUS[3]', 'CAN[4]', 'Certifications', 'FRA[5]', 'GER[6]', 'NOR[7]', 'Release', 'Title', 'UK[1][2]', 'US[8][9]']],
codes=[[3, 2, 1, 1, 1, 1, 1, 1, 1, 0], [7, 6, 8, 0, 1, 3, 4, 5, 9, 2]])
對(duì)于這種情況,最容易的方法是更新列索引:
>>> df = dfs[0]
>>> df.columns = [
... "Title",
... "Release",
... "UK",
... "AUS",
... "CAN",
... "FRA",
... "GER",
... "NOR",
... "US",
... "Certifications",
... ]
>>> df
Title Release ... US Certifications
0 Please P... Released... ... NaN BPI: Gol...
1 With the... Released... ... NaN BPI: Gol...
2 Introduc... Released... ... 2 RIAA: Pl...
3 Meet the... Released... ... 1 MC: Plat...
4 Twist an... Released... ... NaN MC: 3× P...
.. ... ... ... ... ...
22 The Beat... Released... ... 1 BPI: 2× ...
23 Yellow S... Released... ... 2 BPI: Gol...
24 Abbey Road Released... ... 1 BPI: 2× ...
25 Let It Be Released... ... 1 BPI: Gol...
26 "—" deno... "—" deno... ... "—" deno... "—" deno...
繼續(xù)清理Release這列于宙,在HTML中浮驳,代碼是這樣的:
<th scope="row" rowspan="2">
<i><a href="/wiki/A_Hard_Day%27s_Night_(album)" title="A Hard Day's Night (album)">A Hard Day's Night</a></i>
<img alt="double-dagger" src="http://upload.wikimedia.org/wikipedia/commons/f/f9/Double-dagger-14-plain.png" decoding="async" width="9" height="14" data-file-width="9" data-file-height="14">
</th>
將release列分別存入release_date
和label
兩列:
>>> res = (
... df.pipe(
... lambda df_: df_[
... ~df_.Title.str.startswith("Released")
... ]
... )
... .assign(
... release_date=lambda df_: pd.to_datetime(
... df_.Release.str.extract(
... r"Released: (.*) Label"
... )[0].str.replace(r"\[E\]", "")
... ),
... label=lambda df_: df_.Release.str.extract(
... r"Label: (.*)"
... ),
... )
... .loc[
... :,
... [
... "Title",
... "UK",
... "AUS",
... "CAN",
... "FRA",
... "GER",
... "NOR",
... "US",
... "release_date",
... "label",
... ],
... ]
... )
>>> res
Title UK ... release_date label
0 Please P... 1 ... 1963-03-22 Parlopho...
1 With the... 1 ... 1963-11-22 Parlopho...
2 Introduc... NaN ... 1964-01-10 Vee-Jay ...
3 Meet the... NaN ... 1964-01-20 Capitol ...
4 Twist an... NaN ... 1964-02-03 Capitol ...
.. ... ... ... ... ...
21 Magical ... 31 ... 1967-11-27 Parlopho...
22 The Beat... 1 ... 1968-11-22 Apple
23 Yellow S... 3 ... 1969-01-13 Apple (U...
24 Abbey Road 1 ... 1969-09-26 Apple
25 Let It Be 1 ... 1970-05-08 Apple
更多
直接讀取線上的csv文件:
>>> url = https://github.com/mattharrison/datasets/blob/master/data/anscombes.csv
>>> dfs = pd.read_html(url, attrs={"class": "csv-data"})
>>> len(dfs)
1
>>> dfs[0]
Unnamed: 0 quadrant x y
0 NaN I 10.0 8.04
1 NaN I 14.0 9.96
2 NaN I 6.0 7.24
3 NaN I 9.0 8.81
4 NaN I 4.0 4.26
.. ... ... ... ...
39 NaN IV 8.0 6.58
40 NaN IV 8.0 7.91
41 NaN IV 8.0 8.47
42 NaN IV 8.0 5.25
43 NaN IV 8.0 6.89
第01章 Pandas基礎(chǔ)
第02章 DataFrame基礎(chǔ)運(yùn)算
第03章 創(chuàng)建和持久化DataFrame
第04章 開(kāi)始數(shù)據(jù)分析
第05章 探索性數(shù)據(jù)分析
第06章 選取數(shù)據(jù)子集
第07章 過(guò)濾行
第08章 索引對(duì)齊