本文是 Pinterest 試驗(yàn)的圖片檢索算法倾哺,本文從:training data齿尽,user/image featurization 和 ranking models 等角度進(jìn)行解讀颠印,并做了性能和質(zhì)量方面的測(cè)試
Backgroud:
1. Purpose: improve the image search quality by evolving ranking techniques
The first search system was built on top of Apache Lucene and solr, the result were ranked by text relevance scores between queries and text description of images. The search system was subsequently evolved by adding varies advancements that address unique challenges of Pinterest ( got problems and explored solutions)
2. Three challenges:?
The first challenge?is the reason why users use Pinterest to search images. It can be understood by adding pins for users to find their intent and explicit, such as 'try it', 'close up' & 'repin'. However, it rises new problem that it is difficult to define a universal performance rule to distinguish the weight between 'try it' and 'close up', which one is more important? An other challenge comes from the the nature of images. An image may reflect many information which is difficult to extract and describe with simple words. The third is balancing the efficiency and effectiveness of ranking algorithms. Though there are many algorithms implemented in industries, it is rarely known that which algorithm is the best for a given application.
3. Three aspects:
data: combine the engagement training data & human curated relevance judgement data
Featurization: include feature engineering, word embedding and visual embedding, visual relevance signal, query intent understanding, user intent understanding etc.
Modeling: cascading core ranking component ( to achieve the balance between search latency and search quality)
Section 2: Introduce how to curate training data?
maxmize both engagement & relevance
engagement data (user behavioral metrics):
click-through engagement Log has became the standard training data for learning to rank optimization in search engine,< q,u, (P, T) >撼嗓, but T can be?multiple actions towards pins including click-through, repin, close-up, like, hide, comment, try-it, etc绑谣,?
so?a new challenge: how we should simultaneously combine and optimize multiple feedbacks?
方案1:每個(gè)engagement訓(xùn)練一個(gè)模型,然后再對(duì)多個(gè)模型進(jìn)行ensemble和調(diào)試程梦,?但是嘗試了很多方法点把,沒(méi)有辦法得到一個(gè)不犧牲任何engagement的方法
方案2: 在data_level進(jìn)行ensemble, l (p | q,u)代表用戶u在搜索詞q下對(duì)pin(圖片)的engagement-based quality label橘荠,用lp表示。于是對(duì)圖片集 P (user query recall)可以生成label set L 郎逃,對(duì)于圖片集P中的每個(gè)圖片哥童,它的lp是所有T集合的線性組合,?
公式1褒翰, T是一系列action贮懈,包括上面提到的(repin,close Up 等)影暴,ct是action的計(jì)數(shù)错邦,wt是action的權(quán)重,權(quán)重與action的volume成反比型宙。(volume of Close up? > click )
對(duì)公式1 用公式2 做歸一化撬呢,以位置和年齡為基礎(chǔ) 歸一化每個(gè)pin的label, agep and posp are the age and position of pin p, τ is the normalized weight for the ages of pins, and λ is the parameter that controls the position decay.
another challenge: 樣本偏斜妆兑,negative > positive , 修剪樣本集魂拦,1. 舍棄lp都小于零的query group2. 舍棄負(fù)樣本數(shù)超過(guò)某個(gè)閾值的query group?
最終得到< q,u, (P, T, L) >
human relevance data (human relevance judgment):
雖然大規(guī)模不可靠用戶搜索會(huì)話的聚集提供了隱含的反饋的可靠的參與訓(xùn)練數(shù)據(jù),但是它也帶來(lái)了來(lái)自當(dāng)前排序函數(shù)的偏倚搁嗓。例如芯勘,位置偏倚就是其中之一。To correct the ranking bias, we also curate relevance judgment data from human experts腺逛, 類似問(wèn)卷調(diào)查荷愕,人工打標(biāo)。
combining:
we simply consider each trainingdata source independently and feed each of which into the ranking function to train a specific model
Section 3: Feature representation for ranking
how we enhance traditional ranking features to address unique challenges
Beyond Text Relevance Feature:
背景:the text description of each Pin usually is short and noisy.?
1. word-level relevance: 為每個(gè)圖片標(biāo)注text annotation in the format of (n-gram)棍矛, unigrams, bigrams and trigrams from different sources such as?texts, description, text from the crawled linked web page and?automatically classified annotation label.然后用bm25或者proximity BM25算text matching score安疗。
2.?Categoryboost:?32 L1 categories and 500 L2 categories.
3.?Topicboost: topic 是通過(guò)statistical topic modeling分析words的分布找到的(Latent Dirichletallocation topic modeling)
4.Embedding Features: 在latent representation space 上描述了 ?users’ query request 和 the pins之間的相似程度。
User Intent Features: help our learning algorithm learn which type of pins are “really” relevant and interesting to users
1.?Navboost:描述了一個(gè)圖片在一般情況下和特殊搜索詞和用戶群下的表現(xiàn)够委,根據(jù)先驗(yàn)經(jīng)驗(yàn)統(tǒng)計(jì)close up, click, long-click and repin荐类,還有 country, gender, aggregation time 等信息。
2.?Tokenboost: how well a pin performs in general and in context of a specific token
3.Gender Features:
4.Personalized Features:
Socialness, Visual and other Features
圖片大小茁帽, 寬高比玉罐,圖像哈希特征,pin following, friends following
Section 4: Cascading models
三級(jí): light-weight stage, full-ranking stage, and re-ranking stage.如圖4潘拨, 百萬(wàn)級(jí)到萬(wàn)集到1000吊输,
q: query, u:user, p:pin, x: feature for a tuple with (q, u, p ), lp 同section 2, y是lp的真值,sp 是quality score of p given q & u, L 是Loss 是function战秋, S 是 score function.
table 1 是各個(gè)stage 用的model璧亚。略過(guò)基于規(guī)則的模型,因?yàn)榉浅V庇^。
light-weight stage:
filter out negative pins
full-ranking stage:
re-ranking stage:
提高新鮮度癣蟋、多樣性透硝、地方性和語(yǔ)言意識(shí)。
models:
Gradient Boost Decision Tree (GBDT):use?mean square loss as the training loss
Deep Neural Network (DNN):learns to predict quality score sp疯搅, 變成多分類問(wèn)題?4-scale label [1, 2, 3, 4]. lp變成y ∈ [1, 2, 3, 4]濒生。cross entropy loss
Convolutional Neural Network (CNN):除了結(jié)構(gòu)不同,其他與DNN都是一樣的
RankNet:找到文本對(duì)的正確排序幔欧,?one model learns a ranking function which predicts the S(q,u罪治,pi,pj, θ probability spi > spj. 基于未歸一的lp礁蔗,lpi > lpj ? lpi:lpj.? loss function 用公式6
RankSVM:?
Gradient Boost Ranking Tree (GBRT):RankSVM 和 GBDT的組合觉义, 每次迭代都是為了學(xué)到一個(gè)function that pi 排在pj前面的概率, 和gbdt相似的是,rander是一個(gè)有限深度的回歸樹(shù)
模型融合:
engagement data 和 human relevance data 訓(xùn)練出了不同的種模型浴井,用線性融合晒骇。
對(duì)于訓(xùn)練階段計(jì)算量小的模型,在訓(xùn)練過(guò)程中融合磺浙,反之在訓(xùn)練結(jié)束后融合洪囤。
Section 5:Experiments
Offline Experimental Setting:
每個(gè)國(guó)家和語(yǔ)言,human-relevance data : 5000 query and performed user judgement for 400 pins per query ; engagement data: 最近7天 隨機(jī)抽取1%的log. 70% 用于訓(xùn)練撕氧,20%測(cè)試瘤缩,10%驗(yàn)證。
Offline Measurement Metrics 圖6 是部分feature分布伦泥。offline 用ndcg做評(píng)估剥啤。
Online Experimental Setting:
100 buckets
Online Measurement Metrics?
用戶level 和 query level
For query-level measurement metrics, repin per search (Qrepin), click per search (Qclick), close up per search
(Qclose up) and engagement per search (Qengaged) were the main metrics we used
Performance Results
5.3.1 Lightweight Ranking Comparison.
圖7 離線數(shù)據(jù)下, 以rule based 為基準(zhǔn)不脯, ranksvm 有很大提升铐殃,但是線上ranksvm提升不大,通病跨新, 提升不高但是延遲降低。table2
5.3.2 Full Ranking Comparison
offline: 對(duì)比ranksvm,?CNN ? GBRT ? DNN ? RankNet ? GBDT
online: 神經(jīng)網(wǎng)絡(luò)模型引入latency坏逢, 線下預(yù)先計(jì)算好域帐,作為feature添加到線上的數(shù)模型。
5.3.3 Re-ranking Comparison.
當(dāng)基于規(guī)則的rerank模型被換為GBRT時(shí)是整,在fresh pin 的repin rate和ctr增加了20%肖揣。
reference
歸一化折損累積增益(NDCG)
在精確率與召回率中,返回集中每個(gè)項(xiàng)目的地位(權(quán)值)是一樣浮入,即位置k處的項(xiàng)目與位置1處的項(xiàng)目地位一樣龙优。但用戶對(duì)返回結(jié)果的理解卻不是這樣。對(duì)于搜索引擎給出的排序結(jié)果事秀,最前面的答案會(huì)遠(yuǎn)比排在后面的答案重要彤断。
歸一化折損累積增益(NDCG野舶,Normalized Discounted Cumulative G a i n) 便考慮了這種情況。N D C C包含了3 個(gè)遞進(jìn)的指標(biāo): 累積增益(CG宰衙,Cumulative Gain)平道,折損累積增益(DCG,Discounted Cumulative Gain)供炼,進(jìn)而得到歸一化折損累積增益一屋。CG是對(duì)排序返回的最前面k個(gè)項(xiàng)目的相關(guān)性得分求和,而DCG在每個(gè)項(xiàng)目的得分乘上一個(gè)權(quán)值袋哼,該權(quán)值與位置成反關(guān)系冀墨,即位置越靠前,權(quán)值越大涛贯。NDCG則對(duì)每項(xiàng)的帶權(quán)值得分先進(jìn)行歸一化(把每個(gè)項(xiàng)目的得分除以最好的那個(gè)項(xiàng)目的得分)诽嘉,這樣得分總是落在0.0和1.0之間。維基百科上的相關(guān)文章有更詳細(xì)的數(shù)學(xué)公式疫蔓。
DCG或NDCG在信息檢索中或者那些對(duì)項(xiàng)目的返回位置關(guān)心的模型方法找中用的比較多含懊。