說明:文章使用的數(shù)據(jù)集來源于 https://www.kaggle.com/c/titanic/data Kaggle 泰坦尼克號(hào)競(jìng)賽提供的數(shù)據(jù)旅薄。
一. DataFrame 結(jié)構(gòu)
DataFrame
是 Pandas 最核心的數(shù)據(jù)結(jié)構(gòu)辅髓,可以使用值為列表的字典進(jìn)行構(gòu)造:
>> data = {'a': [1,2,3], 'b':[1.2, None, 1.3], 'c':['Alex', 'Bob', 'Chandler']}
>> data
{'a': [1, 2, 3], 'b': [1.2, None, 1.3], 'c': ['Alex', 'Bob', 'Chandler']}
>> df_data = pd.DataFrame(data)
>> df_data
字典的鍵將作為 DataFrame 的列索引,對(duì)應(yīng)的列表作為相應(yīng)列的值:
>> df_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 a 3 non-null int64
1 b 2 non-null float64
2 c 3 non-null object
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes
二. 從 csv 文件加載 DataFrame
>> df = pd.read_csv('../data/train.csv')
>> df.head()
讀取結(jié)果:
>> df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, Braund, Mr. Owen Harris to Dooley, Mr. Patrick
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Sex 891 non-null object
4 Age 714 non-null float64
5 SibSp 891 non-null int64
6 Parch 891 non-null int64
7 Ticket 891 non-null object
8 Fare 891 non-null float64
9 Cabin 204 non-null object
10 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 123.5+ KB
此外少梁,還可以使用 DataFrame
提供的 describe
返回常用的統(tǒng)計(jì)分析項(xiàng):
>> df.describe()
其結(jié)果包含了所有數(shù)值列的計(jì)數(shù)洛口、平均值、均值猎莲、最小值绍弟、最大值、中位值等著洼。
三. DataFrame 的列 Series
使用列索引樟遣,取出 DataFrame 的某一列:
>> age = df['Age']
>> age
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
...
886 27.0
887 19.0
888 NaN
889 26.0
890 32.0
Name: Age, Length: 891, dtype: float64
>> type(age)
pandas.core.series.Series
DataFrame
的列為 Series
結(jié)構(gòu),使用切片索引打印 Series
的前 5 個(gè)元素:
>> age[:5]
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64
Series
也有索引屬性:
>> age.index
RangeIndex(start=0, stop=891, step=1)
values
屬性返回 ndarray
結(jié)構(gòu):
>> age.values[:5]
array([22., 38., 26., 35., 35.])
下面身笤,使用指定的列作為 DataFrame
的行索引:
>> df = df.set_index('Name')
>> df.head()
Name
列將作為 DataFrame
的行索引:
這時(shí)豹悬,取出的列 Series
的索引也將是旅客的姓名:
>> age = df['Age']
>> age[:5]
Name
Braund, Mr. Owen Harris 22.0
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 38.0
Heikkinen, Miss. Laina 26.0
Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0
Allen, Mr. William Henry 35.0
Name: Age, dtype: float64
使用索引取出 Series
中具體的某個(gè)值:
>> age['Heikkinen, Miss. Laina']
26.0
Series
也提供了很多常用的統(tǒng)計(jì)方法:
>> age.mean()
29.69911764705882
>> age.max(),age.min()
(80.0, 0.42)
Series
和 Numpy 的 ndarray
一樣,在和常量進(jìn)行運(yùn)算時(shí)液荸,也可以進(jìn)行廣播操作:
>> age + 10
Name
Braund, Mr. Owen Harris 32.0
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 48.0
Heikkinen, Miss. Laina 36.0
Futrelle, Mrs. Jacques Heath (Lily May Peel) 45.0
Allen, Mr. William Henry 45.0
...
Montvila, Rev. Juozas 37.0
Graham, Miss. Margaret Edith 29.0
Johnston, Miss. Catherine Helen "Carrie" NaN
Behr, Mr. Karl Howell 36.0
Dooley, Mr. Patrick 42.0
Name: Age, Length: 891, dtype: float64
Series
中的每個(gè)值都將進(jìn)行加 10 操作瞻佛。