1序调,讀取cs
training_raw = pd.read_csv('dataset/adult.data',? ?header=None,? ?names=headers,? ? ?sep=',\s',? ?na_values=["?"],? ?engine='python')
sep? 分隔符湖雹,na_values將嗜桌?設(shè)置為na
dtype={'onpromotion': bool},? ? ? ? ?指定數(shù)據(jù)格式
? ? converters={'unit_sales': lambda u: np.log1p(
? ? ? ? float(u)) if float(u) > 0 else 0},? ? ? 數(shù)據(jù)轉(zhuǎn)換
? ? parse_dates=["date"],? ? ? ? ? 將數(shù)據(jù)轉(zhuǎn)換成時(shí)間數(shù)據(jù),并放在第一列
? ? skiprows=range(1, 66458909)? ? ? ? 跳過這兩行不讀
2好港,存儲(chǔ)csv
df1.to_csv('test.csv', encoding='utf-8', index=False)
index=False 為不要索引
3愉镰,讀取excel
pd.read_excel?
可指定sheet_name 如sheet_name='Sheet1',
keep_default_na=False 使默認(rèn)空值變?yōu)?'媚狰,
header=None不要表頭
4岛杀,DataFrame
pd.DataFrame(x_test, columns=columns)
數(shù)據(jù)表內(nèi)容為x_test阔拳,columns為數(shù)據(jù)表所有屬性值
DataFrame.columns 為取數(shù)據(jù)表所有屬性值
5崭孤,loc,iloc
df1.loc[:, '營銷是否成功'] = y_test
loc可指定不存在的屬性,iloc只可指定存在的屬性
6糊肠,取索引的值
df.iloc[:, :-1].values
取除最后一列辨宠,所有行的值
7,取所有索引
index_list = data.index.values
8货裹,改變數(shù)據(jù)表的值
data.iloc[i +1, 0] = name
9嗤形,顯示數(shù)字型特征和標(biāo)量型特征
數(shù)字型:?可計(jì)算.
標(biāo)量型:?任何包含類別或文本的特征.
# 顯示所有數(shù)字型特征
dataset_raw.describe()
# 標(biāo)量型特征
dataset_raw.describe(include=['0'])
10,顯示該列數(shù)據(jù)類型
dataset_raw.dtypes['fnlwgt']
11弧圆,predclass列大于50設(shè)為1
dataset_raw.loc[dataset_raw['predclass'] == '>50K', 'predclass'] = 1
12赋兵,cut 分箱
dataset_bin['age'] = pd.cut(dataset_raw['age'], 10)
10為分箱個(gè)數(shù)
13笔咽,pandas提供對(duì)one-hot編碼的函數(shù)是:pd.get_dummies()
dataset_bin_enc = pd.get_dummies(dataset_bin, columns=one_hot_cols)
14,astype? 設(shè)定字符類型
dataset_con = dataset_con.astype(str)
?將非數(shù)字特征轉(zhuǎn)化為數(shù)字特征
grid_df[col] = grid_df[col].astype('category')
15,刪除某一列
dataset_con_enc.drop('predclass', axis=1)
16霹期,根據(jù)某值進(jìn)行排序
importance.sort_values(by='Importance', ascending=True)
# 可以通過subset參數(shù)來刪除在age和sex中含有空數(shù)據(jù)的全部行叶组,空值值np.nan????
new_titanic_survival = titanic_survival.dropna(subset=["age","sex"])
17,刪除空行,去除缺失值
train = train.dropna(axis=0)
18历造,空值填充
dataset.fillna(-1,inplace=True)
19,分組甩十,gruop by?
http://www.reibang.com/p/50fb023f208c
20? ?reset_index? 添加索引
https://blog.csdn.net/weixin_43655282/article/details/97889398
#drop=True: 把原來的索引index列去掉,丟掉
21, merge? 合并
https://blog.csdn.net/Asher117/article/details/84725199
22,value_counts() 計(jì)算每一列有多少重復(fù)值
dropna=False,不去除空值吭产,normalize 計(jì)算每個(gè)值的占比
http://www.reibang.com/p/f773b4b82c66
23,iterrows
https://blog.csdn.net/Softdiamonds/article/details/80218777
24侣监,pandas group分組與agg聚合
https://blog.csdn.net/u012706792/article/details/80892510
25,map,apply
https://blog.csdn.net/u010814042/article/details/76401133
26,quantile
#quantile 四分位數(shù)函數(shù)
?group[group < group.quantile(.05)] = group.quantile(.05)
27,transform
http://www.reibang.com/p/509d7b97088c
28.drop_duplicates? ? 數(shù)據(jù)去重
https://blog.csdn.net/ghr5582/article/details/80693882
29臣淤,nunique??即返回的是唯一值的個(gè)數(shù)
https://blog.csdn.net/feizxiang3/article/details/93380525
30,sample? 混排
x_data = x_data.sample(frac=1, random_state=1).reset_index(drop=True)
https://www.cnblogs.com/webRobot/p/11484648.html
31,tail
tail()?方法就是從數(shù)據(jù)集尾部開始顯示了橄霉,同樣默認(rèn) 5 條,可自定義邑蒋。
32,相關(guān)系數(shù)酪劫,corr()
https://blog.csdn.net/walking_visitor/article/details/85128461
32?as_matrix
https://www.cnblogs.com/key221/p/9394051.html
33,.transpose
行列轉(zhuǎn)換
pd.DataFrame(deck_percentages).transpose()
34,.levels
層級(jí)索引寺董,只有g(shù)roupby之后會(huì)用到
35覆糟,qcut? 分箱
pd.qcut(df_all['Fare'], 13)
36,數(shù)據(jù)切分? split
df_all['Title'] = df_all['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
# expand : 布爾值,默認(rèn)為False.如果為真返回?cái)?shù)據(jù)框(DataFrame)或復(fù)雜索引(MultiIndex);如果為假,返回序列(Series)或者索引(Index)
37,.cat 連接字符串
https://blog.csdn.net/zbrj12345/article/details/81181015
38,melt
index_columns = ['id','item_id','dept_id','cat_id','store_id','state_id']
#id_vars 指數(shù)據(jù)的id(標(biāo)識(shí),不變的量)遮咖,剩下的列為目標(biāo)變量滩字,變化之后變量名字為var_name,指的名字為value_name
train_df = train_df.melt(id_vars = index_columns,var_name='d',value_name='sales')
前:
后
34,shift
數(shù)據(jù)在df中移位
https://www.cnblogs.com/liulangmao/p/9301032.html
35,rolling 處理時(shí)間序列方法
https://blog.csdn.net/liuhaolei1992/article/details/89421212
36,reindex? ? 改變索引御吞,可以做到增改的操作
https://blog.csdn.net/missyougoon/article/details/83409717
37,diff? ? ?diff用于計(jì)算一列中某元素與該列中另一個(gè)元素的差值
https://jingyan.baidu.com/article/2a13832852b1d1464a134f90.html
38? ??add_prefix
帶有字符串前綴的前綴標(biāo)簽
,https://www.cjavapy.com/article/276/
39? resamle? ?重新采樣
https://www.jb51.net/article/164438.htm
40麦箍,slice 切分?jǐn)?shù)據(jù)
https://blog.csdn.net/claroja/article/details/64925356
41,assign? 直接向DF中添加一列
https://www.cnblogs.com/jason--/p/11502710.html
42,to_pickle
保存數(shù)據(jù)
43,日期格式方法
grid_df['date'] = pd.to_datetime(grid_df['date'])
grid_df['tm_d'] = grid_df['date'].dt.day.astype(np.int8)
grid_df['tm_w'] = grid_df['date'].dt.week.astype(np.int8)
grid_df['tm_m'] = grid_df['date'].dt.month.astype(np.int8)
grid_df['tm_y'] = grid_df['date'].dt.year
grid_df['tm_y'] = (grid_df['tm_y'] - grid_df['tm_y'].min()).astype(np.int8)
grid_df['tm_wm'] = grid_df['tm_d'].apply(lambda x: ceil(x/7)).astype(np.int8) 全年的第幾個(gè)星期
grid_df['tm_dw'] = grid_df['date'].dt.dayofweek.astype(np.int8)
grid_df['tm_w_end'] = (grid_df['tm_dw']>=5).astype(np.int8)是否為周末
44,train['SalePrice'].skew()? 偏度
train['SalePrice'].kurt()? ? 峰度
45, crosstable 交叉表