今天看了一篇關(guān)于協(xié)同過濾的學(xué)習(xí)文章袋毙,感覺不錯忍疾,文中主要介紹了以下兩種算法:
Implementing your own recommender systems in Python
- 基于內(nèi)存的協(xié)同過濾算法(Memory-Based Collaborative Filtering)
- 基于模型的協(xié)同過濾算法(Model-based Collaborative Filtering)
跟著寫了一遍代碼以后环肘,發(fā)現(xiàn)其實也有國內(nèi)翻譯版的,意思基本上差不多墨礁,不過還是有一點點機翻的感覺:
在Python中實現(xiàn)你自己的推薦系統(tǒng)
基于內(nèi)存的協(xié)同過濾算法包括兩個類別:user-item filtering 充石、item-item filtering。.用原文的話就是:
- Item-Item Collaborative Filtering: “Users who liked this item also liked …”
- User-Item Collaborative Filtering: “Users who are similar to you also liked …”
這個介紹應(yīng)該很直白易懂了峻厚。
直接上代碼吧响蕴,原文應(yīng)該是pyhton 2寫的,改成python3環(huán)境上也沒太大區(qū)別惠桃。
讀入數(shù)據(jù)
import numpy as np
import pandas as pd
#讀入數(shù)據(jù)
header=['user_id','item_id','rating','timestamp']
df=pd.read_csv('D:/PythonSource/ml-100k/u.data',sep='\t',names=header)
n_users=df.user_id.unique().shape[0]
n_items=df.item_id.unique().shape[0]
print('Numbers of users='+str(n_users),'and Numbers of items='+str(n_items))
Numbers of users=943 and Numbers of items=1682
分割數(shù)據(jù)集
from sklearn import cross_validation as cv
#分割數(shù)據(jù)集
train_data,test_data=cv.train_test_split(df,test_size=0.25)
# create 2 user-item matrices
train_data_matrix=np.zeros((n_users,n_items))
for line in train_data.itertuples():
#數(shù)據(jù)中用戶和物品是從1開始計算的
train_data_matrix[line[1]-1,line[2]-1]=line[3]
test_data_matrix=np.zeros((n_users,n_items))
for line in test_data.itertuples():
test_data_matrix[line[1]-1,line[2]-1]=line[3]
計算相似性(余弦夾角)
from sklearn.metrics.pairwise import pairwise_distances
user_similarity=pairwise_distances(train_data_matrix,metric='cosine')
item_similarity=pairwise_distances(train_data_matrix.T,metric='cosine')
計算預(yù)測值
#預(yù)測函數(shù)
def predict(ratings,similarity,type='user'):
if type=='user':
mean_user_rating=ratings.mean(axis=1)
ratings_diff=(ratings-mean_user_rating[:,np.newaxis])
print(ratings_diff)
pred=mean_user_rating[:,np.newaxis]+similarity.dot(ratings_diff)/np.array([np.abs(similarity).sum(axis=1)]).T
elif type=='item':
pred=ratings.dot(similarity)/np.array([np.abs(similarity).sum(axis=1)])
return pred
這里計算user類型的時候浦夷,涉及數(shù)組轉(zhuǎn)置,這個過程可參考:
Collaborative filtering using RapidMiner: user vs. item recommenders
輸出預(yù)測結(jié)果
#輸出結(jié)果
item_prediction=predict(train_data_matrix,item_similarity,type='item')
np.savetxt('D:/PythonSource/item_prediction.csv',item_prediction,delimiter=',')
user_prediction=predict(train_data_matrix,user_similarity,type='user')
np.savetxt('D:/PythonSource/user_prediction.csv',user_prediction,delimiter=',')
準確性評估
#評估準確性
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction,ground_truth):
prediction=prediction[ground_truth.nonzero()].flatten()
ground_truth=ground_truth[ground_truth.nonzero()].flatten()
return sqrt(mean_squared_error(prediction,ground_truth))
print('user-based CF RMSE='+str(rmse(user_prediction,test_data_matrix)))
print('item-based CF RMSE='+str(rmse(item_prediction,test_data_matrix)))
user-based CF RMSE=3.138256866186845
item-based CF RMSE=3.464855694296178