任務(wù)1 - 數(shù)據(jù)分析(2天)
說(shuō)明:這份數(shù)據(jù)集是金融數(shù)據(jù)(非原始數(shù)據(jù)项贺,已經(jīng)處理過(guò)了),我們要做的是預(yù)測(cè)貸款用戶是否會(huì)逾期瞧捌。表格中 "status" 是結(jié)果標(biāo)簽:0表示未逾期讯赏,1表示逾期。
要求:數(shù)據(jù)切分方式 - 三七分诉瓦,其中測(cè)試集30%川队,訓(xùn)練集70%,隨機(jī)種子設(shè)置2018
任務(wù)1:對(duì)數(shù)據(jù)進(jìn)行探索和分析睬澡。時(shí)間:2天
- 數(shù)據(jù)類型的分析
- 無(wú)關(guān)特征刪除
- 數(shù)據(jù)類型轉(zhuǎn)換
- 缺失值處理
- ……以及你能想到和借鑒的數(shù)據(jù)分析處理
我的結(jié)果
主要步驟:
1.刪除重復(fù)行固额;
2.無(wú)關(guān)特征刪除: 刪除無(wú)關(guān)信息列,刪除值全一致的列煞聪;
3.數(shù)據(jù)類型轉(zhuǎn)換:利用pandas實(shí)現(xiàn)one hot encode的方式斗躏,轉(zhuǎn)換枚舉類型的object為int;
4.缺失值處理:有特殊含義填0或1昔脯,沒(méi)有特殊含義填眾數(shù)啄糙;
5.切分?jǐn)?shù)據(jù):測(cè)試集30%,訓(xùn)練集70%栅干,隨機(jī)種子設(shè)置2018迈套;
#!/usr/bin/python
# -*- coding:utf-8 -*-
from sklearn.model_selection import train_test_split
import pandas as pd
data = pd.read_csv('data.csv', encoding='gbk')
# 1.刪除重復(fù)行;
data_clean = data.drop_duplicates()
# 2.無(wú)關(guān)特征刪除: 刪除無(wú)關(guān)信息列碱鳞,刪除值全一致的列桑李;
drop_columns = ['Unnamed: 0', 'trade_no', 'id_name', 'bank_card_no',
'query_org_count', 'query_finance_count', 'query_cash_count', 'latest_query_time', ]
for data_col in data.columns:
if len(data[data_col].unique()) == 1 and data_col not in drop_columns:
drop_columns.append(data_col)
data_clean = data_clean.drop(drop_columns, axis=1)
# 3.數(shù)據(jù)類型轉(zhuǎn)換:利用pandas實(shí)現(xiàn)one hot encode的方式,轉(zhuǎn)換枚舉類型的object為int窿给;
data_clean = pd.get_dummies(data_clean, columns=['reg_preference_for_trad'])
# 4.缺失值處理:有特殊含義填0或1贵白,沒(méi)有特殊含義填眾數(shù);
data_clean['student_feature'].fillna(0, inplace=True)
data_cols = data_clean.columns.values
for data_col in data_cols:
fill_value = data_clean[data_col].value_counts().index[0]
data_clean[data_col].fillna(fill_value, inplace=True)
# 5.切分?jǐn)?shù)據(jù):測(cè)試集30%崩泡,訓(xùn)練集70%禁荒,隨機(jī)種子設(shè)置2018
train_data, test_data = train_test_split(data_clean, test_size=0.3, random_state=2018)
train_data.to_csv('training.csv', index=False, header=True)
test_data.to_csv('testing.csv', index=False, header=True)