可能是史上代碼最少的協(xié)同過(guò)濾推薦引擎 – 不周山
http://www.wentrue.net/blog/?p=970
實(shí)際上是用R實(shí)現(xiàn)的item-based CF推薦算法象缀。
讀入數(shù)據(jù)娇妓,原數(shù)據(jù)是user-subject的收藏二元組
data = read.table('data.dat', sep=',', header=TRUE)
標(biāo)識(shí)user與subject的索引
user = unique(data$user_id)
subject = unique(data$subject_id)
uidx = match(data$user_id, user)
iidx = match(data$subject_id, subject)
從二元組構(gòu)造收藏矩陣
M = matrix(0, length(user), length(subject))
i = cbind(uidx, iidx)
M[i] = 1
對(duì)列向量(subject向量)進(jìn)行標(biāo)準(zhǔn)化泡一,%*%為矩陣乘法
mod = colSums(M2)0.5 # 各列的模
MM = M %*% diag(1/mod) # M乘以由1/mod組成的對(duì)角陣沫换,實(shí)質(zhì)是各列除以該列的模
crossprod實(shí)現(xiàn)MM的轉(zhuǎn)置乘以MM,這里用于計(jì)算列向量的內(nèi)積娩贷,S為subject的相似度矩陣
S = crossprod(MM)
user-subject推薦的分值
R = M %*% S
R = apply(R, 1, FUN=sort, decreasing=TRUE, index.return=TRUE)
k = 5
取出前5個(gè)分值最大的subject
res = lapply(R, FUN=function(r)return(subject[r$ix[1:k]]))
輸出數(shù)據(jù)
write.table(paste(user, res, sep=':'), file='result.dat', quote=FALSE, row.name=FALSE, col.name=FALSE)