FM算法
因子分解機(jī)模型(Factorization Machine, FM)是一種基于矩陣分解的機(jī)器學(xué)習(xí)算法荞彼,它廣泛應(yīng)用于廣告和推薦領(lǐng)域茫孔,主要解決數(shù)據(jù)稀疏的情況下如何進(jìn)行特征交叉的問題叙赚。
FM的優(yōu)點(diǎn)
1)FM可以在參數(shù)稀疏的情況下進(jìn)行參數(shù)估計(jì)。
2)FM具有線性的時(shí)間復(fù)雜度玻淑。
3)FM適用于多種類型特征向量陵像,一般輸入數(shù)據(jù)包括數(shù)值特征和one-hot編碼后的離散特征。FM可以調(diào)整成MF你稚、SVD++瓷耙、PITF、FPMC等模型刁赖。
FM的缺點(diǎn)
1)FM只能進(jìn)行三階以下的自動(dòng)特征交叉哺徊,因此特征工程部分依舊無法避免。(后續(xù)延伸出了DeepFM乾闰,可以進(jìn)行高階的特征交叉)
FM推導(dǎo)
本節(jié)中落追,我們進(jìn)行因子分解機(jī)模型的推導(dǎo)。
1)FM公式
其中涯肩,<.,.>代表維度為k的兩個(gè)向量進(jìn)行點(diǎn)積轿钠。
- w0是全局偏差
- wi是變量xi的參數(shù)
- wi,j := <vi, vj>是xi和xj變量的交叉參數(shù)。
2)模型的表達(dá)能力
k值得選擇病苗,影響了FM的表達(dá)能力疗垛。為了使模型有更好的泛化能力,在稀疏數(shù)據(jù)場景下硫朦,通常選擇比較小的k贷腕。
3)完整推導(dǎo)過程。
相對難以理解的是第一步的轉(zhuǎn)化過程咬展。當(dāng)j=i+1變?yōu)閖=1時(shí)泽裳,發(fā)生了什么?
xixi多加了一次破婆。
xixj多加了一次涮总。
形如:共有a b c三個(gè)特征,原來是ab + ac + bc祷舀,轉(zhuǎn)化為1/2 (第一項(xiàng)-第二項(xiàng))
第一項(xiàng):aa + ab + ac + ba + bb + bc + ca + cb + cc
第二項(xiàng):aa + bb + cc
最后得到的公式是:
梯度
FM擁有線性時(shí)間復(fù)雜度瀑梗,可以在線性的時(shí)間內(nèi)完成訓(xùn)練和預(yù)測烹笔。
FM的網(wǎng)絡(luò)結(jié)構(gòu)
DeepFM算法
2017年哈爾濱工業(yè)大學(xué)和華為大學(xué)聯(lián)合提出了DeepFM。DeepFM是wide&deep之后另一個(gè)被工業(yè)屆廣泛使用的雙模型抛丽,相比于wide&deep谤职,DeepFM采用FM替換了原來的Wide部分,加強(qiáng)了淺層神經(jīng)網(wǎng)絡(luò)的特征組合能力亿鲜。
DeepFM代碼實(shí)現(xiàn)
import os
import numpy as np
import pandas as pd
from collections import namedtuple
import tensorflow as tf
from tensorflow.keras.layers import *
from tensorflow.keras.models import *
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
##### 數(shù)據(jù)預(yù)處理
data = pd.read_csv('./data/criteo_sample.txt')
data.head()
def data_processing(df, dense_features, sparse_features):
df[dense_features] = df[dense_features].fillna(0.0)
for f in dense_features:
df[f] = df[f].apply(lambda x: np.log(x+1) if x > -1 else -1)
df[sparse_features] = df[sparse_features].fillna("-1")
for f in sparse_features:
lbe = LabelEncoder()
df[f] = lbe.fit_transform(df[f])
return df[dense_features + sparse_features]
dense_features = [i for i in data.columns.values if 'I' in i]
sparse_features = [i for i in data.columns.values if 'C' in i]
df = data_processing(data, dense_features, sparse_features)
df['label'] = data['label']
##### 模型構(gòu)建
# 使用具名元組定義特征標(biāo)記
SparseFeature = namedtuple('SparseFeature', ['name', 'vocabulary_size', 'embedding_size'])
DenseFeature = namedtuple('DenseFeature', ['name', 'dimension'])
VarLenSparseFeature = namedtuple('VarLenSparseFeature', ['name', 'vocabulary_size', 'embedding_size', 'maxlen'])
class FM_Layer(Layer):
def __init__(self):
super(FM_Layer, self).__init__()
def call(self, inputs):
concate_embed_values = inputs
square_of_sum = tf.square(tf.reduce_sum(concate_embed_values, axis=1, keepdims=True))
sum_of_square = tf.reduce_sum(concate_embed_values * concate_embed_values, axis=1, keepdims=True)
output = square_of_sum - sum_of_square
output = 0.5 * tf.reduce_sum(output, axis=2, keepdims=False)
return output
def compute_output_shape(self, input_shape):
return (None, 1)
def build_input_layers(feature_columns):
""" 構(gòu)建輸入層 """
dense_input_dict, sparse_input_dict = {}, {}
for f in feature_columns:
if isinstance(f, DenseFeature):
dense_input_dict[f.name] = Input(shape=(f.dimension, ), name=f.name)
elif isinstance(f, SparseFeature):
sparse_input_dict[f.name] = Input(shape=(1, ), name=f.name)
return dense_input_dict, sparse_input_dict
def build_embedding_layers(feature_columns, is_linear):
embedding_layers_dict = {}
# 篩選出sparse特征列
sparse_feature_columns = list(filter(lambda x: isinstance(x, SparseFeature), feature_columns)) if feature_columns else []
if is_linear:
for f in sparse_feature_columns:
embedding_layers_dict[f.name] = Embedding(f.vocabulary_size + 1, 1, name='1d_emb_' + f.name)
else:
for f in sparse_feature_columns:
embedding_layers_dict[f.name] = Embedding(f.vocabulary_size + 1, f.embedding_size, name='kd_emb_' + f.name)
return embedding_layers_dict
def concat_embedding_list(feature_columns, input_layer_dict, embedding_layer_dict, flatten=False):
""" 拼接embedding特征 """
sparse_feature_columns = list(filter(lambda x: isinstance(x, SparseFeature), feature_columns)) if feature_columns else []
embedding_list = []
for f in sparse_feature_columns:
_input_layer = input_layer_dict[f.name]
_embed = embedding_layer_dict[f.name]
embed_layer = _embed(_input_layer)
if flatten:
embed_layer = Flatten()(embed_layer)
embedding_list.append(embed_layer)
return embedding_list
def get_linear_logits(dense_input_dict, sparse_input_dict, sparse_feature_columns):
concat_dense_inputs = Concatenate(axis=1)(list(dense_input_dict.values()))
dense_logits_output = Dense(1)(concat_dense_inputs)
linear_embedding_layer = build_embedding_layers(sparse_feature_columns, is_linear=True)
sparse_1d_embed_list = []
for f in sparse_feature_columns:
temp_input = sparse_input_dict[f.name]
temp_embed = Flatten()(linear_embedding_layer[f.name](temp_input))
sparse_1d_embed_list.append(temp_embed)
sparse_logits_output = Add()(sparse_1d_embed_list)
linear_logits = Add()([dense_logits_output, sparse_logits_output])
return linear_logits
def get_fm_logits(sparse_input_dict, sparse_feature_columns, dnn_embedding_layers):
sparse_kd_embed_list = []
for f in sparse_feature_columns:
f_input = sparse_input_dict[f.name]
_embed = dnn_embedding_layers[f.name](f_input)
sparse_kd_embed_list.append(_embed)
concat_sparse_kd_embed_list = Concatenate(axis=1)(sparse_kd_embed_list)
fm_logits = FM_Layer()(concat_sparse_kd_embed_list)
return fm_logits
def get_dnn_logits(sparse_input_dict, sparse_feature_columns, dnn_embedding_layers):
sparse_kd_embed = concat_embedding_list(sparse_feature_columns, sparse_input_dict, dnn_embedding_layers, flatten=True)
concat_sparse_kd_embed = Concatenate(axis=1)(sparse_kd_embed)
# DNN層
dnn_out = Dropout(0.5)(Dense(1024, activation='relu')(concat_sparse_kd_embed))
dnn_out = Dropout(0.5)(Dense(512, activation='relu')(dnn_out))
dnn_out = Dropout(0.5)(Dense(256, activation='relu')(dnn_out))
dnn_logits = Dense(1)(dnn_out)
return dnn_logits
def DeepFm(linear_feature_columns, dnn_feature_columns):
dense_input_dict, sparse_input_dict = build_input_layers(linear_feature_columns + dnn_feature_columns)
# linear
linear_sparse_feature_columns = list(filter(lambda x: isinstance(x, SparseFeature), linear_feature_columns))
input_layers = list(dense_input_dict.values()) + list(sparse_input_dict.values())
linear_logits = get_linear_logits(dense_input_dict, sparse_input_dict, linear_sparse_feature_columns)
# dnn
dnn_embedding_layers = build_embedding_layers(dnn_feature_columns, is_linear=False)
dnn_sparse_feature_columns = list(filter(lambda x: isinstance(x, SparseFeature), dnn_feature_columns))
dnn_logits = get_dnn_logits(sparse_input_dict, dnn_sparse_feature_columns, dnn_embedding_layers)
# fm
fm_logits = get_fm_logits(sparse_input_dict, dnn_sparse_feature_columns, dnn_embedding_layers)
output_logits = Add()([linear_logits, dnn_logits, fm_logits])
output_layer = Activation("sigmoid")(output_logits)
model = Model(input_layers, output_layer)
return model
# 定義特征列
linear_feature_columns = [SparseFeature(f, vocabulary_size=df[f].nunique(), embedding_size=4) for f in sparse_features] + \
[DenseFeature(f, 1,) for f in dense_features]
dnn_feature_columns = [SparseFeature(f, vocabulary_size=df[f].nunique(), embedding_size=4) for f in sparse_features] + \
[DenseFeature(f, 1,) for f in dense_features]
model = DeepFm(linear_feature_columns, dnn_feature_columns)
model.summary()
##### 模型訓(xùn)練
model.compile(optimizer="adam",
loss="binary_crossentropy",
metrics=["binary_crossentropy", tf.keras.metrics.AUC(name='auc')])
train_input = {col: df[col] for col in dense_features + sparse_features}
model.fit(train_input, df['label'].values,
batch_size=64, epochs=5, validation_split=0.2)