本篇文章主要是為參加挑戰(zhàn)賽做一些準(zhǔn)備工作溯泣,鋪墊一些基礎(chǔ)知識瞬女。適用于從零開始的菜鳥窍帝,大牛繞道~
比賽網(wǎng)址:https://tianchi.aliyun.com/competition/entrance/231593/information
1、ROC诽偷、AUC相關(guān)概念
參考維基百科
2坤学、數(shù)據(jù)的讀取與初步觀察
啥也不說,先把numpy和pandas import進(jìn)來报慕。
(1)讀取數(shù)據(jù)使用pandas的read_csv()方法深浮。觀察前5條數(shù)據(jù)使用pandas對象的.head()方法。查看數(shù)據(jù)整體情況眠冈,比如一共有多少字段飞苇,各字段的類型等,使用pandas對象的.info()方法。
import numpy as np
import pandas as pd
dftest = pd.read_csv('data/ccf_offline_stage1_test_revised.csv')
dfoff = pd.read_csv('data/ccf_offline_stage1_train.csv')
dfon = pd.read_csv('data/ccf_online_stage1_train.csv')
dfoff.head()
得到結(jié)果:
User_id Merchant_id Coupon_id Discount_rate Distance Date_received Date
0 1439408 2632 NaN NaN 0.0 NaN 20160217.0
1 1439408 4663 11002.0 150:20 1.0 20160528.0 NaN
2 1439408 2632 8591.0 20:1 0.0 20160217.0 NaN
3 1439408 2632 1078.0 20:1 0.0 20160319.0 NaN
4 1439408 2632 8591.0 20:1 0.0 20160613.0 NaN
dfoff.info()
得到結(jié)果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1754884 entries, 0 to 1754883
Data columns (total 7 columns):
User_id int64
Merchant_id int64
Coupon_id float64
Discount_rate object
Distance float64
Date_received float64
Date float64
dtypes: float64(4), int64(2), object(1)
memory usage: 93.7+ MB
- NOTE:以上是最通用的兩個方法布卡,幾乎每次加載新的數(shù)據(jù)集都可以通過.head()和.info()方法來初步查看數(shù)據(jù)集的情況雨让。
(2)接下來就要結(jié)合對業(yè)務(wù)的理解來做一些稍微深入一些的數(shù)據(jù)洞察。拿到這個題目和數(shù)據(jù)集后忿等,可能要問幾個問題:一個是線上線下的行為數(shù)據(jù)是否是同一撥人的栖忠,二是訓(xùn)練數(shù)據(jù)集和測試數(shù)據(jù)集是否是同一撥人的,三是有多少客戶是對訓(xùn)練有幫助的贸街,在本題中就體現(xiàn)在有多少客戶既領(lǐng)取了優(yōu)惠券又使用了優(yōu)惠券娃闲。可以利用python中的各種索引切片方法來完成這類問題的洞察匾浪。
print('訓(xùn)練集線下消費(fèi)用戶總數(shù):', len(set(dfoff['User_id'])))
print('訓(xùn)練集線下消費(fèi)用戶同時(shí)具有線上消費(fèi)行為的用戶數(shù):', len(set(dfoff['User_id']) & set(dfon['User_id'])))
print('測試集中未領(lǐng)取過優(yōu)惠券的用戶數(shù):',len(set(dftest['User_id']) - set(dftest[(dftest['Date_received'] != 'null')].User_id)))
得到結(jié)果:
訓(xùn)練集線下消費(fèi)用戶總數(shù): 539438
訓(xùn)練集線下消費(fèi)用戶同時(shí)具有線上消費(fèi)行為的用戶數(shù): 267448
測試集中未領(lǐng)取過優(yōu)惠券的用戶數(shù): 0
- Note: 可以看到線下樣本中有27萬的用戶同時(shí)在線上也有消費(fèi)行為皇帮,線上消費(fèi)偏好可以作為一個特征。同時(shí)測試集中的用戶均領(lǐng)用過優(yōu)惠券蛋辈,數(shù)據(jù)質(zhì)量較好属拾。
然后,這里面要理解的一個非常簡單的業(yè)務(wù)邏輯冷溶,客戶是有領(lǐng)取優(yōu)惠券渐白、使用優(yōu)惠券、消費(fèi)三個動作的逞频,數(shù)據(jù)表的記錄集合了這三個動作纯衍。
題目中有這樣一段話:
消費(fèi)日期:如果Date=null & Coupon_id != null,該記錄表示領(lǐng)取優(yōu)惠券但沒有使用苗胀,即負(fù)樣本襟诸;如果Date!=null & Coupon_id = null,則表示普通消費(fèi)日期基协;如果Date!=null & Coupon_id != null歌亲,則表示用優(yōu)惠券消費(fèi)日期,即正樣本澜驮。
可以觀察一下這幾類記錄的條數(shù)陷揪。
print('領(lǐng)取優(yōu)惠券但沒有使用的記錄數(shù)(負(fù)樣本):', len(dfoff[(dfoff['Date'].isnull()) & (dfoff['Coupon_id'].notnull())]))
print('未領(lǐng)取優(yōu)惠券,普通購物的記錄數(shù):', len(dfoff[(dfoff['Date'].notnull() & dfoff['Coupon_id'].isnull())]))
print('領(lǐng)取了優(yōu)惠券且使用了優(yōu)惠券記錄數(shù)(正樣本):', len(dfoff[(dfoff['Date'].notnull() & dfoff['Coupon_id'].notnull())]))
print('無優(yōu)惠券且無消費(fèi)的記錄條數(shù):', len(dfoff[(dfoff['Date'].isnull() & dfoff['Coupon_id'].isnull())]))
輸出結(jié)果:
領(lǐng)取優(yōu)惠券但沒有使用的記錄數(shù)(負(fù)樣本): 977900
未領(lǐng)取優(yōu)惠券杂穷,普通購物的記錄數(shù): 701602
領(lǐng)取了優(yōu)惠券且使用了優(yōu)惠券記錄數(shù)(正樣本): 75382
無優(yōu)惠券且無消費(fèi)的記錄條數(shù): 0
3悍缠、對Discount_rate和Distance字段的處理
本篇將繼續(xù)上一篇,對O2O比賽的數(shù)據(jù)進(jìn)行初步分析耐量,本篇中將在數(shù)據(jù)洞察的基礎(chǔ)上加入一些必要的數(shù)據(jù)處理飞蚓。
首先,我們使用.info()函數(shù)拴鸵,來看一下訓(xùn)練集中各個字段的數(shù)據(jù)類型玷坠。
offline_train:
offline_train.info()
輸出:<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1754884 entries, 0 to 1754883
Data columns (total 7 columns):
User_id int64
Merchant_id int64
Coupon_id float64
Discount_rate object
Distance float64
Date_received float64
Date float64
dtypes: float64(4), int64(2), object(1)
memory usage: 93.7+ MB
online_train:
online_train.info()
輸出:<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11429826 entries, 0 to 11429825
Data columns (total 7 columns):
User_id int64
Merchant_id int64
Action int64
Coupon_id object
Discount_rate object
Date_received float64
Date float64
dtypes: float64(2), int64(3), object(2)
memory usage: 610.4+ MB
所以我們首先要將object對象的字段進(jìn)行量化的轉(zhuǎn)換。
1劲藐、discount_rate處理
discount_rate目前的取值:
offline_train['Discount_rate'].unique()
輸出:array([nan, '150:20', '20:1', '200:20', '30:5', '50:10', '10:5', '100:10',
'200:30', '20:5', '30:10', '50:5', '150:10', '100:30', '200:50',
'100:50', '300:30', '50:20', '0.9', '10:1', '30:1', '0.95',
'100:5', '5:1', '100:20', '0.8', '50:1', '200:10', '300:20',
'100:1', '150:30', '300:50', '20:10', '0.85', '0.6', '150:50',
'0.75', '0.5', '200:5', '0.7', '30:20', '300:10', '0.2', '50:30',
'200:100', '150:5'], dtype=object)
可以看到八堡,取值有小數(shù),有比例聘芜,有nan兄渺,通過type()函數(shù),看到小數(shù)和比例都是str類型汰现,nan是float()類型挂谍。需要將其轉(zhuǎn)化為數(shù)值,并且不損失信息量瞎饲。增加三個字段口叙,discount_man,discount_jian嗅战, discount_new_rate妄田。
import math
from numpy import nan as NaN
from pandas import DataFrame
from pandas import Series
import numpy as np
def get_discount_man(row):
if isinstance(row, str) and ':' in row:
rows = row.split(':')
man = int(rows[0])
jian = int(rows[1])
return man
elif isinstance(row, str) and '.' in row:
return 0
elif isinstance(row, str) == False and math.isnan(row):
return 0
else:
print("something unexpected", row, type(row))
return 0
def get_discount_jian(row):
if isinstance(row, str) and ':' in row:
rows = row.split(':')
man = int(rows[0])
jian = int(rows[1])
return jian
elif isinstance(row, str) and '.' in row:
return 0
elif isinstance(row, str) == False and math.isnan(row):
return 0
else:
print("something unexpected", row, type(row))
return 0
def get_discount_rate(row):
if isinstance(row, str) and ':' in row:
rows = row.split(':')
man = int(rows[0])
jian = int(rows[1])
return 1 - float(jian)/float(man)
elif isinstance(row, str) and '.' in row:
return float(row)
elif isinstance(row, str) == False and math.isnan(row):
return 1
else:
print("something unexpected", row, type(row))
return 0
def processData(offline):
offline['discount_man'] = offline['Discount_rate'].apply(get_discount_man)
offline['discount_jian'] = offline['Discount_rate'].apply(get_discount_jian)
offline['discount_new_rate'] = offline['Discount_rate'].apply(get_discount_rate)
return offline
offline_train = processData(offline_train)
offline_test = processData(offline_test)
處理完Discount_rate后,還需要處理一下Distance驮捍,因?yàn)殡m然Distance的數(shù)據(jù)類型均為float疟呐,但Distance里面有NaN值,需要將NaN值替換為-1东且。
offline_train['Distance'] = offline_train['Distance'].replace(NaN, -1.0)
至此启具,Discount_rate和Distance數(shù)據(jù)均處理完畢。
4珊泳、時(shí)間數(shù)據(jù)解析
數(shù)據(jù)中關(guān)于時(shí)間的字段有兩個鲁冯,一個是領(lǐng)券日期Date_received, 一個是消費(fèi)日期Date。首先來看一下這兩個日期的分布及格式色查。
date_received = offline_train['Date_received'].unique()
date_received = Series(date_received)
print(date_received)
輸出:
0 NaN
1 20160528.0
2 20160217.0
3 20160319.0
4 20160613.0
5 20160516.0
6 20160429.0
7 20160129.0
8 20160530.0
9 20160519.0
10 20160606.0
11 20160207.0
12 20160421.0
13 20160130.0
14 20160412.0
15 20160518.0
16 20160327.0
17 20160127.0
18 20160215.0
19 20160524.0
20 20160523.0
21 20160515.0
22 20160521.0
23 20160114.0
24 20160321.0
25 20160426.0
26 20160409.0
27 20160326.0
28 20160322.0
29 20160131.0
...
138 20160104.0
139 20160113.0
140 20160108.0
141 20160115.0
142 20160513.0
143 20160208.0
144 20160612.0
145 20160419.0
146 20160103.0
147 20160312.0
148 20160209.0
149 20160529.0
150 20160119.0
151 20160227.0
152 20160315.0
153 20160304.0
154 20160216.0
155 20160507.0
156 20160311.0
157 20160320.0
158 20160102.0
159 20160106.0
160 20160224.0
161 20160219.0
162 20160111.0
163 20160310.0
164 20160307.0
165 20160221.0
166 20160226.0
167 20160309.0
Length: 168, dtype: float64
可以看到晓褪,除了NaN,一共有167天的記錄综慎。NaN表示客戶并沒有領(lǐng)券涣仿。下面通過兩步對時(shí)間特征進(jìn)行構(gòu)建,一是將Date_received數(shù)據(jù)處理為日期類型加NaN類型示惊。二是加入星期特征好港。
5、時(shí)間數(shù)據(jù)類型轉(zhuǎn)化
將Date_received米罚、Date數(shù)據(jù)處理為日期類型加NaN類型
from datetime import date
def getDateType(row):
if math.isnan(row):
return row
else:
str_row = str(row)
return date(int(str_row[0 : 4]), int(str_row[4:6]), int(str_row[6:8]))
offline_train['date_received_new'] = offline_train['Date_received'].apply(getDateType)
offline_train['date_new'] = offline_train['Date'].apply(getDateType)
offline_test['date_received_new'] = offline_test['Date_received'].apply(getDateType)
6钧汹、加入星期特征
加入工作日or周六日特征(weekday_type:{0,1}),加入星期X特征(weekday:{1~7})录择。
首先加入星期X特征:
def getWeekday(row):
if type(row) != float:
return row.weekday() + 1
else:
return row
offline_train['weekday_received'] = offline_train['date_received_new'].apply(getWeekday)
offline_test['weekday_received'] = offline_test['date_received_new'].apply(getWeekday)
offline_train['weekday_buy'] = offline_train['date_new'].apply(getWeekday)
然后加入工作日特征:
offline_train['weekday_type_received'] = offline_train['weekday_received'].apply(lambda x : 1 if x == 6 or x == 7 else 0)
offline_train['weekday_type_buy'] = offline_train['weekday_buy'].apply(lambda x : 1 if x == 6 or x == 7 else 0)
offline_test['weekday_type_received'] = offline_test['weekday_received'].apply(lambda x : 1 if x == 6 or x == 7 else 0)
將星期X特征轉(zhuǎn)化為one-hot格式拔莱。什么是one-hot碗降,以下的說明非常簡明。
one-hot的基本思想:將離散型特征的每一種取值都看成一種狀態(tài)塘秦,若你的這一特征中有N個不相同的取值讼渊,那么我們就可以將該特征抽象成N種不同的狀態(tài),one-hot編碼保證了每一個取值只會使得一種狀態(tài)處于“激活態(tài)”尊剔,也就是說這N種狀態(tài)中只有一個狀態(tài)位值為1爪幻,其他狀態(tài)位都是0。舉個例子须误,假設(shè)我們以學(xué)歷為例挨稿,我們想要研究的類別為小學(xué)、中學(xué)京痢、大學(xué)奶甘、碩士、博士五種類別祭椰,我們使用one-hot對其編碼就會得到:
作者:古怪地區(qū)
鏈接:http://www.reibang.com/p/5f8782bf15b1
來源:簡書
簡書著作權(quán)歸作者所有甩十,任何形式的轉(zhuǎn)載都請聯(lián)系作者獲得授權(quán)并注明出處。
轉(zhuǎn)化為one-hot格式需要pd中的一個函數(shù)吭产,get_dummies()侣监,將需要轉(zhuǎn)化的字段輸入即可。為了便于數(shù)據(jù)的閱讀臣淤,在get_dummies后對獲得的結(jié)果橄霉,進(jìn)行列名的重新定義 。
weekdaycols = [ 'weekday_' + str(i) for i in [1, 2, 3, 4, 5, 6, 7]]
data_weekday = pd.get_dummies(offline_train['weekday_received'])
data_weekday.columns = weekdaycols
offline_train[weekdaycols] = data_weekday
data_weekday = pd.get_dummies(offline_test['weekday_received'])
data_weekday.columns = weekdaycols
offline_test[weekdaycols] = data_weekday
7邑蒋、數(shù)據(jù)標(biāo)注
將數(shù)據(jù)分為三類姓蜂,
- 一類是領(lǐng)券并在15天內(nèi)用券的數(shù)據(jù),即Date_received != null, Date - Date_received <= 15 : y = 1
- 第二類是未領(lǐng)券數(shù)據(jù)医吊,即Date_received == null : y = -1
- 第三類是其他钱慢,也就是領(lǐng)券但未使用數(shù)據(jù): y = 0
def getLabel(row):
if type(row['date_received_new']) == float and math.isnan(row['date_received_new']):
return -1
elif type(row['date_new']) == date and row['date_new'] - row['date_received_new'] <= pd.Timedelta(15, 'D'):
return 1
else:
return 0
label = offline_train.apply(getLabel, axis = 1)
offline_train['label'] = label
print(offline_train['label'].value_counts())
輸出:
0 988887
-1 701602
1 64395
Name: label, dtype: int64
至此數(shù)據(jù)處理就全部完成了。來看一下訓(xùn)練數(shù)據(jù)的樣子:
print('已有列名', offline_train.columns.tolist())
已有列名 ['User_id', 'Merchant_id', 'Coupon_id', 'Discount_rate', 'Distance', 'Date_received', 'Date', 'discount_man', 'discount_jian', 'discount_new_rate', 'date_received_new', 'date_new', 'weekday', 'weekday_received', 'weekday_buy', 'weekday_type_received', 'weekday_type_buy', 'weekday_1', 'weekday_2', 'weekday_3', 'weekday_4', 'weekday_5', 'weekday_6', 'weekday_7', 'label']
本節(jié)需記憶的python語句總結(jié):
1. !ls data #輸出該路徑下的文件名
2. pd.read_csv(路徑名) #讀取路徑指定的csv文件
3. info() #查看某個pandas對象的信息
4. set(x) - set(y) #將兩個字段內(nèi)容轉(zhuǎn)為集合卿堂,并求差集(在x中但未在y中)
5. isinstance(data, type) #判斷data是否是type類型束莫,返回bool值
6. math.isnan(data) #判斷float類型數(shù)據(jù)是否為空,返回bool值
7. data.apply(function) #對data中的數(shù)據(jù)逐一使用function函數(shù)草描,返回函數(shù)值數(shù)組
8. data.replace(a, b) #將data中的a換為b
9. date.weekday()+1 #返回日期的星期
10. pd.get_dummies(data) #返回data的one-hot格式
11. date1-date2 < pd.Timedelta(15, 'D') #計(jì)算日期間隔
12. Series.value_counts() #統(tǒng)計(jì)每個取值的個數(shù)
本節(jié)經(jīng)驗(yàn)總結(jié)
1览绿、拿到數(shù)據(jù)后先對數(shù)據(jù)進(jìn)行理解和觀察,了解每個字段的含義穗慕、數(shù)值類型饿敲。
2、對測試集和訓(xùn)練集進(jìn)行對比逛绵,確保測試集中的主體在訓(xùn)練集中都存在(大部分存在)怀各。
3倔韭、對于拿到的數(shù)據(jù),一定要對每列數(shù)值進(jìn)行分析和清洗瓢对,將無法比較的數(shù)據(jù)(比如空值或格式不統(tǒng)一的數(shù)值)進(jìn)行轉(zhuǎn)化寿酌。
4、分析消費(fèi)數(shù)據(jù)時(shí)可考慮星期特征及工作日特征沥曹。
參考博客:https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12586969.1002.3.292844f2tqQhXQ&postId=4796