1. 數(shù)據(jù)總體了解:
a. 讀取數(shù)據(jù)集并了解數(shù)據(jù)集大小,原始特征維度;
1)data_test_a.shape
2)data_train.shape
3)data_train.columns
b. 通過(guò)info熟悉數(shù)據(jù)類(lèi)型逼肯;
1)data_train.info()
c. 粗略查看數(shù)據(jù)集中各特征基本統(tǒng)計(jì)量;
1)data_train.describe()
2)data_train.head(3).append(data_train.tail(3))
2. 缺失值和唯一值:
a. 查看數(shù)據(jù)缺失值情況
1)print(f'There are {data_train.isnull().any().sum()} columns in train dataset with missing values.')
2)have_null_fea_dict = (data_train.isnull().sum()/len(data_train)).to_dict()
fea_null_moreThanHalf = {}
for key,value in have_null_fea_dict.items():
? ? if value > 0.5:
? ? ? ? fea_null_moreThanHalf[key] = value
3)fea_null_moreThanHalf
4)missing = data_train.isnull().sum()/len(data_train)
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()
b. 查看唯一值特征情況
3. 深入數(shù)據(jù)-查看數(shù)據(jù)類(lèi)型
a. 類(lèi)別型數(shù)據(jù)
1)def get_numerical_serial_fea(data,feas):
numerical_serial_fea = []
numerical_noserial_fea = []
for fea in feas:
temp = data[fea].nunique()
if temp <= 10:
numerical_noserial_fea.append(fea)
continue
numerical_serial_fea.append(fea)
return numerical_serial_fea,numerical_noserial_fea
numerical_serial_fea,numerical_noserial_fea =
get_numerical_serial_fea(data_train,numerical_fea)
b. 數(shù)值型數(shù)據(jù)
離散數(shù)值型數(shù)據(jù)
1)data_train['term'].value_counts()
連續(xù)數(shù)值型數(shù)據(jù)
1)f = pd.melt(data_train, value_vars=numerical_serial_fea)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")
4. 數(shù)據(jù)間相關(guān)關(guān)系
a. 特征和特征之間關(guān)系
b. 特征和目標(biāo)變量之間關(guān)系
1)fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 8))
train_loan_fr.groupby('grade')['grade'].count().plot(kind='barh', ax=ax1, title='Count of
grade fraud')
train_loan_nofr.groupby('grade')['grade'].count().plot(kind='barh', ax=ax2, title='Count of
grade non-fraud')
train_loan_fr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh',
ax=ax3, title='Count of employmentLength fraud')
train_loan_nofr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh',
ax=ax4, title='Count of employmentLength non-fraud')
plt.show()
5. 用pandas_profiling生成數(shù)據(jù)報(bào)告
pfr = pandas_profiling.ProfileReport(data_train)
pfr.to_file("./example.html")