這次專(zhuān)門(mén)使用天池新人賽的離線(xiàn)比賽來(lái)實(shí)際練習(xí), 因?yàn)闀r(shí)間和算力他膳, 更重要是經(jīng)驗(yàn)的問(wèn)題雌团, 這次嘗試還有很多問(wèn)題趣倾, 比如還不會(huì)傳統(tǒng)的機(jī)器算法, 比如對(duì)深度學(xué)習(xí)還有很多東西不夠熟悉丸逸, 還有數(shù)據(jù)預(yù)處理還不夠熟練蹋艺。
本文由拎著激光炮的野人原創(chuàng), 歡迎轉(zhuǎn)載黄刚, 轉(zhuǎn)載請(qǐng)注明作者與原文鏈接
http://www.reibang.com/p/ef1fc958e30f
解讀分析
認(rèn)真讀題之后捎谨, 發(fā)現(xiàn)這個(gè)賽題是針對(duì)于類(lèi)似于o2o的預(yù)測(cè), 商品基本來(lái)自于服務(wù)行業(yè)隘击, 線(xiàn)上購(gòu)買(mǎi)侍芝, 線(xiàn)下消費(fèi), 也就是說(shuō)和商品的地理位置有很大的關(guān)系埋同, 所以我們?yōu)榱撕?jiǎn)單, 我們假設(shè)不同的商品類(lèi)別之間沒(méi)有太大的可替換性和相關(guān)性棵红, 假設(shè)用戶(hù)的geo位置和商品的geo位置有很大的相關(guān)性
1. 導(dǎo)入數(shù)據(jù)
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import math
from sklearn.metrics import f1_score
idx = pd.IndexSlice
#讀取items字段
items = pd.read_csv("./tianchi_fresh_comp_train_item.csv")
print("read items", items.count()[0])
actions = pd.read_csv("./tianchi_fresh_comp_train_user.csv")
print("user action read, total:", actions.count()[0])
read items 620918
user action read, total: 23291027
# 讀取并且轉(zhuǎn)換actions表凶赁, 用戶(hù)的所有的行為
# TODO: 暫時(shí)忽略所有的geo信息
def prepare_data(actions, items):
#convert time
actions.time = pd.to_datetime(actions.time)
#index user
user_index = actions.user_id.drop_duplicates()
user_index = user_index.reset_index(drop=True).reset_index().set_index("user_id")
user_index.columns = ['user']
actions = pd.merge(actions, user_index, left_on='user_id', right_index=True, how='left')
#index item
item_ids = actions.item_id.drop_duplicates()
item_ids = item_ids.reset_index(drop=True).reset_index().set_index("item_id")
item_ids.columns = ['item']
actions = pd.merge(actions, item_ids, left_on='item_id', right_index=True, how='left')
items = pd.merge(items, item_ids, left_on='item_id', right_index=True, how='left')
# index category
category = actions.item_category.drop_duplicates()
category = category.reset_index(drop=True).reset_index().set_index("item_category")
category.columns = ['category']
actions = pd.merge(actions, category, left_on='item_category', right_index=True, how='left')
#drop user_id, item_id
actions = actions.drop(['user_id', 'item_id', 'item_category'], axis=1)
items = items.drop(['item_id', 'item_category'], axis=1);
#reoder columns
actions = actions.loc[:, ['user', 'item', 'behavior_type', 'category', 'time', 'user_geohash']]
#add date and hour
actions['date'] = actions.time.dt.date
actions['hour'] = actions.time.dt.hour
return actions, items, user_index, item_ids, category
# actions, items = prepare_data(actions, items)
actions, items, user_index, item_ids, _ = prepare_data(actions, items)
actions.head()
[圖片上傳中...(image.png-fb438e-1547196526920-0)]
items.head()
2.觀察數(shù)據(jù)
geo = pd.concat([items.item_geohash, actions.user_geohash]).drop_duplicates()
item_geo = items.item_geohash.drop_duplicates().dropna()
print("商品的geo去重后總數(shù)的統(tǒng)計(jì)", item_geo.count())
action_geo = actions.user_geohash.drop_duplicates().dropna()
print("用戶(hù)行為的geo去重后總數(shù)的統(tǒng)計(jì)",action_geo.count())
print("商品與用戶(hù)行為的geo去重后總數(shù)的統(tǒng)計(jì):\n",
"交集 / 用戶(hù)行為geo:",
len(action_geo[action_geo.isin(item_geo)]) / len(action_geo),
"\n交集 / 商品geo:",
len(item_geo[item_geo.isin(action_geo)]) / len(item_geo)
)
del item_geo
del action_geo
#從結(jié)果可以看出, 大多數(shù)情況下用戶(hù)和商品的地址存在匹配的情況逆甜, 少量不匹配
商品的geo去重后總數(shù)的統(tǒng)計(jì) 57358
用戶(hù)行為的geo去重后總數(shù)的統(tǒng)計(jì) 1018981
商品與用戶(hù)行為的geo去重后總數(shù)的統(tǒng)計(jì):
交集 / 用戶(hù)行為geo: 0.025223237724746585
交集 / 商品geo: 0.44809791136371563
ag = actions.loc[:, ['user', 'user_geohash']].dropna()
print("用戶(hù)行為帶有g(shù)eohash的數(shù)量", len(ag))
ag = ag.drop_duplicates()
print("用戶(hù)行為帶有g(shù)eohash的數(shù)量(去重后)", len(ag))
ag['c'] = 1
ag = ag.loc[:, ['user', 'c']].groupby('user').sum()
print(ag.describe())
del ag
#可以發(fā)現(xiàn)用戶(hù)
#有g(shù)eo hash地址的用戶(hù)行為的中位數(shù)為42虱肄, 就是大多數(shù)用戶(hù)所在的geohash是經(jīng)常變化的
#用戶(hù)在不同的時(shí)間, 處于多個(gè)不同的geo地址(也就是說(shuō)這個(gè)geo的還是比較精確的交煞, 可能離開(kāi)商品的某個(gè)geo有一定的距離)
#那么可以考慮的是咏窿, 是否時(shí)間間隔越近的兩個(gè)geohash地址, 意味著越近的距離
用戶(hù)行為帶有g(shù)eohash的數(shù)量 7380017
用戶(hù)行為帶有g(shù)eohash的數(shù)量(去重后) 1257674
c
count 16240.000000
mean 77.442980
std 53.782759
min 1.000000
25% 42.000000
50% 68.000000
75% 103.000000
max 709.000000
df = actions[actions.user_geohash.notna()]
print("購(gòu)買(mǎi)的時(shí)候素征, 有g(shù)eo信息的行為數(shù)量", len(df), "占全部行為的", len(df[df.user_geohash.isin(items.item_geohash)]) / len(df))
del df
購(gòu)買(mǎi)的時(shí)候集嵌, 有g(shù)eo信息的行為數(shù)量 7380017 占全部行為的 0.03044234179948366
3. 提取特征
首先要考慮要提取哪些特征萝挤, 這些特征需要考慮體現(xiàn)用戶(hù)、商品根欧、商品分類(lèi)怜珍、地點(diǎn)等特性
- 用戶(hù): 總體行為次數(shù),還有如何體現(xiàn)出用戶(hù)的購(gòu)買(mǎi)愛(ài)好凤粗, 比如針對(duì)某一類(lèi)商品購(gòu)買(mǎi)的喜好酥泛?
- 商品/分類(lèi): 總體有多少用戶(hù)購(gòu)買(mǎi), 所有用戶(hù)的總體行為計(jì)數(shù)
- 分類(lèi):總共有多少商品
- 上面特征的時(shí)間特性嫌拣?
- 上面物品的地理特性
- 上面物品的交叉特性扎酷, 比如某個(gè)用戶(hù)特別愛(ài)購(gòu)買(mǎi)某個(gè)商品
- 與時(shí)間相關(guān)的特性, 用戶(hù)某一天的購(gòu)買(mǎi)行為計(jì)數(shù)四瘫, 用來(lái)計(jì)算第二天是否購(gòu)買(mǎi)
3.0 保存特征
saved_actions = actions
print(len(actions))
actions.head()
23291027
# actions = saved_actions;#恢復(fù)actions
print("共計(jì): {}條交易記錄".format(actions.user.max()))
共計(jì): 19999條交易記錄
# #從用戶(hù)來(lái)限制提取特征對(duì)數(shù)據(jù)額占用欣喧, 是在太卡了, 后續(xù)刪除
# actions = actions.set_index("user").loc[:10000, :]
# actions = actions.reset_index()
# print(actions.user.max())
# actions.head()
3.1 提取用戶(hù)特征
#用戶(hù)總計(jì)購(gòu)買(mǎi)了多少商品
user = actions.groupby(['user', 'behavior_type'])[['item']].count().unstack().fillna(0).astype(np.int)
user.rename(columns={'item': 'c'}, level=0, inplace=True)
user.head()
# 統(tǒng)計(jì)購(gòu)買(mǎi)商品的種類(lèi)
c = actions.drop_duplicates(['user', 'behavior_type', 'item']) \
.groupby(['user', 'behavior_type'])[['item']].count().unstack().fillna(0).astype(np.int)
user = user.merge(c, left_index=True, right_index=True, how='left')
user.head()
#統(tǒng)計(jì)購(gòu)買(mǎi)商品類(lèi)別的種類(lèi)
c = actions.drop_duplicates(['user', 'behavior_type', 'category']) \
.groupby(['user', 'behavior_type'])[['category']].count().unstack().fillna(0).astype(np.int)
user = user.merge(c, left_index=True, right_index=True, how='left')
user.head()
user = pd.DataFrame(user.values, index=user.index, columns=["u{}".format(i) for i in range(0, 12, 1)])
user.head()
user = user / (user.mean() + user.std() * 3)
user.head()
# user.to_csv("user.csv")
# del user
3.2 統(tǒng)計(jì)商品屬性
#統(tǒng)計(jì)商品被購(gòu)買(mǎi)的次數(shù)
good = actions.groupby(['item', 'behavior_type'])[['user']].count().unstack().fillna(0).astype(np.int)
good.rename(columns={'user': 'c'}, level=0, inplace=True)
good.head()
#統(tǒng)計(jì)商品被多少用戶(hù)購(gòu)買(mǎi)過(guò)
c = actions.drop_duplicates(['user', 'behavior_type', 'item']) \
.groupby(['item', 'behavior_type'])[['user']].count().unstack().fillna(0).astype(np.int)
good = good.merge(c, left_index=True, right_index=True, how='left')
good.head()
good = pd.DataFrame(good.values, index=good.index, columns=["g{}".format(i) for i in range(0, 8, 1)])
good.head()
good = good / (good.mean() + good.std() * 3)
good.head()
# good.to_csv("good.csv")
# del good
3.3 統(tǒng)計(jì)商品類(lèi)別的特征
#統(tǒng)計(jì)商品類(lèi)別被購(gòu)買(mǎi)的次數(shù)
cat = actions.groupby(['category', 'behavior_type'])[['user']].count().unstack().fillna(0).astype(np.int)
cat.rename(columns={'user': 'c'}, level=0, inplace=True)
cat.head()
#統(tǒng)計(jì)商品類(lèi)別被多少用戶(hù)購(gòu)買(mǎi)過(guò)
c = actions.drop_duplicates(['user', 'behavior_type', 'category']) \
.groupby(['category', 'behavior_type'])[['user']].count().unstack().fillna(0).astype(np.int)
cat = cat.merge(c, left_index=True, right_index=True, how='left')
cat.head()
#統(tǒng)計(jì)商品類(lèi)別有多少商品
c = actions.drop_duplicates(['item', 'behavior_type', 'category']) \
.groupby(['category', 'behavior_type'])[['item']].count().unstack().fillna(0).astype(np.int)
cat = cat.merge(c, left_index=True, right_index=True, how='left')
cat.head()
cat = pd.DataFrame(cat.values, index=cat.index, columns=["c{}".format(i) for i in range(0, 12, 1)])
cat.head()
cat = cat / (cat.mean() + cat.std() * 3)
cat.head()
# cat.to_csv('cat.csv')
# del cat
del c
3.4 時(shí)間特性
3.5 地理特性
3.6 交叉特性
3.7 24小時(shí)內(nèi)的動(dòng)作
def read_csv():
return pd.read_csv("user.csv", index_col=0)
def read_good():
return pd.read_csv("good.csv", index_col=0)
def read_cat():
return pd.read_csv('cat.csv', index_col=0)
def read_label():
return pd.read_csv("label.csv", index_col=0)
# 用戶(hù)第二天是否會(huì)購(gòu)買(mǎi)的標(biāo)簽
label = actions[actions.behavior_type == 4].copy()
label.date = (pd.to_datetime(label.date) - np.timedelta64(1, 'D'))
# label.date = label.date.dt.date
print(label.date.dtypes)
label['buy'] = 1
# label = label.loc[:, ['date', 'user','category','item','buy']].groupby(['date', 'user','category','item']).sum()
label = label.set_index(['date', 'user']).loc[:, ['item', 'category', 'buy']].drop_duplicates()
label.set_index(['category','item',], append=True, inplace=True)
label.head()
datetime64[ns]
# label.to_csv("label.csv")
# del label
# read_label().head()
# 統(tǒng)計(jì)用戶(hù)最后一天的行為
d_action = actions.copy()
d_action['d'] = 1
d_action.date = pd.to_datetime(d_action.date)
d_action = d_action.groupby([ 'date', 'user', 'category', 'item', 'behavior_type']).sum()[['d']]
d_action = d_action / (d_action.mean() + d_action.std() * 3)
d_action = d_action.unstack().fillna(0).astype(np.float32)
d_action.columns = d_action.columns.droplevel(0)
d_action.columns = ['d_t{}'.format(i) for i in range(1, 5, 1)]
d_action.head()
# d_action.to_csv('d_action.csv')
# pd.read_csv('d_action.csv', index_col=0).dtypes
#某個(gè)用戶(hù)3小時(shí)的行為
x_action = actions.copy()
x_action['c'] = 1
x_action.date = pd.to_datetime(x_action.date).dt.date
#數(shù)據(jù)量太大, 只考慮最后3個(gè)小時(shí)的數(shù)據(jù)
x_action = x_action.loc[x_action.hour.isin([23, 22, 21])]
x_action.date = pd.to_datetime(x_action.date)
x_action = x_action.groupby([ 'date', 'user', 'category', 'item', 'hour', 'behavior_type']).sum()
x_action = x_action.unstack()
x_action = x_action / (x_action.mean() + x_action.std() * 3)
x_action = x_action.stack().astype(np.float32)
x_action.head()
x_action = x_action.unstack(['hour', 'behavior_type'], fill_value=0).sort_index(axis=1)
x_action.columns = x_action.columns.droplevel(0)
# print(x_action.describe())
#用如此方式來(lái)保證代碼會(huì)被正確的展開(kāi)成96列应役, 而不至于部分代碼被
x_action = pd.DataFrame(x_action.values, index=x_action.index, columns=pd.MultiIndex.from_product([range(1, 5, 1), range(21, 24, 1)], names=['behavior_type','hour']))
x_action = x_action.fillna(0)
# x_action.info()
# x_action[:, :] = x_action[:, :].astype(np.int8)
# x_action.info()
# x_action = x_action.apply(lambda x: x.astype(np.int32))
x_action = pd.DataFrame(x_action.values, index=x_action.index, columns = ["h{}_{}".format(h, t) for h in range(21, 24, 1) for t in [1, 2, 3, 4]])
# print(x_action.describe())
x_action.head()
x_action = d_action.merge(x_action, left_index=True, right_index=True, how='left')
x_action.fillna(0, inplace=True)
x_action.head()
# 合并x情组, y數(shù)據(jù), 使用how='left'可以過(guò)濾掉之前沒(méi)有行為箩祥, 但是卻有購(gòu)買(mǎi)動(dòng)作的數(shù)據(jù)
# 當(dāng)然院崇, 這樣我也過(guò)濾到了, 我看了n個(gè)便宜的袍祖, 結(jié)果買(mǎi)了這類(lèi)里面的一個(gè)爆品
# TODO: 以后想法處理
x_action = x_action.merge(label, left_index=True, right_index=True, how='left')
x_action.fillna(0, inplace=True)
x_action.buy = x_action.buy.astype(np.int8)
x_action.head()
#對(duì)應(yīng)時(shí)間點(diǎn)行為對(duì)應(yīng)的用戶(hù)底瓣, 商品, 分類(lèi)屬性
x_action.reset_index(inplace=True)
x_action = user.merge(x_action, right_on='user', left_index=True, how='right')
x_action = good.merge(x_action, right_on='item', left_index=True, how='right')
x_action = cat.merge(x_action, right_on='category', left_index=True, how='right')
x_action.set_index(['date', 'user', 'category', 'item'], inplace=True)
x_action.head()
#x_action.to_csv("x_action.csv") #數(shù)據(jù)量太大蕉陋, 寫(xiě)入非常的慢捐凭, 如何破這個(gè)問(wèn)題呢?
#pd.read_csv("x_action.csv").head()
3.8 優(yōu)化方向
3.8.1 以后可能考慮加入噪音層凳鬓, 不然茁肠, 某個(gè)用戶(hù)可能存在只是查看了一次, 買(mǎi)了一次缩举, 就被網(wǎng)絡(luò)記憶成必買(mǎi)的用戶(hù)
3.8.2 如何按組來(lái)訓(xùn)練垦梆, 畢竟用戶(hù)一般是看一類(lèi)商品, 然后選擇其中一個(gè)商品來(lái)購(gòu)買(mǎi)
3.8.3 目前采用的"正則化"是否合理仅孩, 是否有更好或者更加通用的數(shù)據(jù)處理方式托猩, 或者直接用normal是否更好
from keras.layers import Dense, LSTM, Dropout
from keras.models import Model, Input
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_action.loc[:'2014-12-18'].values[:, :-1], x_action.values[:, -1], test_size=0.1)
x_train, y_train
(array([[0.03628967, 0.03287605, 0.04344553, ..., 0. , 0. ,
0. ],
[2.19627494, 2.48323743, 2.35556257, ..., 0. , 0. ,
0. ],
[0.12262843, 0.05917688, 0.07467201, ..., 0. , 0. ,
0. ],
...,
[0.08175772, 0.04602647, 0.16020541, ..., 0. , 0. ,
0. ],
[9.42243503, 9.34921724, 9.17379612, ..., 0. , 0. ,
0. ],
[2.09024259, 1.75996439, 2.44856316, ..., 0. , 0. ,
0. ]]), array([0., 0., 0., ..., 0., 0., 0.]))
x_train.shape # (17482, 128) for test, (9014805, 48) for all
(9014805, 48)
inputs = Input(shape=(x_train.shape[1], ))
x = Dense(256)(inputs)
x = Dropout(0.2)(x)
x = Dense(128)(x)
x = Dropout(0.2)(x)
outputs = Dense(1, activation='sigmoid')(x)
model = Model(inputs=inputs, outputs=outputs)
model.compile(loss='binary_crossentropy', optimizer='rmsprop',metrics=['accuracy'])
history = model.fit(x_train, y_train, batch_size=64, epochs=50, validation_data=[x_test, y_test])
# 10000名用戶(hù)的結(jié)果如下:
# Epoch 11/50
# 4065324/4065324 [==============================] - 364s 90us/step - loss: 0.0219 - acc: 0.9963 - val_loss: 0.0196 - val_acc: 0.9968
Train on 9014805 samples, validate on 1001646 samples
Epoch 1/50
9014805/9014805 [==============================] - 796s 88us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0293 - val_acc: 0.9959
Epoch 2/50
9014805/9014805 [==============================] - 787s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0215 - val_acc: 0.9968
Epoch 3/50
9014805/9014805 [==============================] - 787s 87us/step - loss: 0.0270 - acc: 0.9960 - val_loss: 0.0208 - val_acc: 0.9968
Epoch 4/50
9014805/9014805 [==============================] - 786s 87us/step - loss: 0.0270 - acc: 0.9960 - val_loss: 0.0253 - val_acc: 0.9966
Epoch 5/50
9014805/9014805 [==============================] - 787s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0249 - val_acc: 0.9961
Epoch 6/50
9014805/9014805 [==============================] - 785s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0258 - val_acc: 0.9966
Epoch 7/50
9014805/9014805 [==============================] - 786s 87us/step - loss: 0.0270 - acc: 0.9960 - val_loss: 0.0235 - val_acc: 0.9969
Epoch 8/50
9014805/9014805 [==============================] - 783s 87us/step - loss: 0.0272 - acc: 0.9960 - val_loss: 0.0216 - val_acc: 0.9968
Epoch 9/50
9014805/9014805 [==============================] - 785s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0242 - val_acc: 0.9964
Epoch 10/50
9014805/9014805 [==============================] - 786s 87us/step - loss: 0.0273 - acc: 0.9960 - val_loss: 0.0229 - val_acc: 0.9969
Epoch 11/50
9014805/9014805 [==============================] - 783s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0234 - val_acc: 0.9967
Epoch 12/50
9014805/9014805 [==============================] - 782s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0230 - val_acc: 0.9968
Epoch 13/50
9014805/9014805 [==============================] - 781s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0236 - val_acc: 0.9969
Epoch 14/50
9014805/9014805 [==============================] - 781s 87us/step - loss: 0.0272 - acc: 0.9960 - val_loss: 0.0229 - val_acc: 0.9967
Epoch 15/50
9014805/9014805 [==============================] - 782s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0249 - val_acc: 0.9961
Epoch 16/50
5038976/9014805 [===============>..............] - ETA: 5:37 - loss: 0.0271 - acc: 0.9960
............................
y_predict = model.predict(x_test)
y_predict
array([[1.0223342e-03],
[5.4835586e-10],
[1.0230833e-03],
...,
[5.3039176e-04],
[1.6522235e-03],
[1.2059750e-03]], dtype=float32)
import matplotlib.pyplot as plt
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
dict_keys(['val_loss', 'val_acc', 'loss', 'acc'])
#統(tǒng)計(jì)哪個(gè)閾值的F1 score最高
def get_f1_by(true, predict, n):
predict = y_predict.reshape(-1)
return f1_score(true, np.where(predict >= n, np.ones_like(predict), np.zeros_like(predict)))
def cal_f1_score(true, predict):
result = area.apply(lambda i : get_f1_by(true, predict, i))
return result
area = pd.Series(np.arange(1e-9, 0.9, 0.05))
result = cal_f1_score(y_test.reshape(-1), y_predict.reshape(-1))
print('when n =', area[result.idxmax()], 'best result is', result.max())
plt.scatter(area, result)
plt.show()
#0.043000000000000024 0.15744941753525443
when n = 0.050000001 best result is 0.061109622085231845
area = pd.Series(np.arange(1e-9, 0.1, 0.001))
result = cal_f1_score(y_test.reshape(-1), y_predict.reshape(-1))
print('when n =', area[result.idxmax()], 'best result is', result.max())
plt.scatter(area, result)
plt.show()
#0.043000000000000024 0.15744941753525443
when n = 0.016000001 best result is 0.19509536784741144
#從這個(gè)結(jié)果發(fā)現(xiàn)0.043是閾值可以得到最高的F1, 我們用這個(gè)來(lái)預(yù)測(cè)最后一天的結(jié)果
y_predict = model.predict(x_action.loc['2014-12-18'].iloc[:, :-1])
y_predict = np.where(y_predict > 0.016, np.ones_like(y_predict), np.zeros_like(y_predict))
len(y_predict[y_predict==1])
1154
result = x_action.loc['2014-12-18'].copy()
result.buy = y_predict
result = result.loc[result.buy > 0, 'buy'].reset_index().loc[:, ['user', 'item']]
result.head()
user_index.head()
item_ids.head()
result = result.merge(item_ids.reset_index(), left_on='item', right_on='item', how='left') \
.merge(user_index.reset_index(), left_on='user', right_on='user', how='left').loc[:, ['user_id', 'item_id']]
result.head()
result.to_csv('tianchi_mobile_recommendation_predict.csv')
至此數(shù)據(jù)已經(jīng)得到一個(gè)結(jié)果辽慕, 提交到天池上結(jié)果尚未得到:
來(lái)日更新
我本意是通過(guò)LSTM和embedding來(lái)構(gòu)建網(wǎng)絡(luò)京腥, 獲得數(shù)據(jù)的, 不過(guò)先通過(guò)“傳統(tǒng)"的方式來(lái)構(gòu)建網(wǎng)絡(luò)溅蛉, 作為后續(xù)網(wǎng)絡(luò)的一個(gè)參考標(biāo)準(zhǔn)公浪, 這個(gè)就搞了非常久他宛, 還是基礎(chǔ)不行啊, 準(zhǔn)備一邊練習(xí)一邊學(xué)因悲, 有更好的方式和我弄得不好的地方堕汞, 請(qǐng)不吝指出!