整體
Quora是如何做推薦的艺晴, 原文在 machine-learning-for-qa-sites-the-quora-example
看到了這篇文章,然后順便追了相關(guān)的參考文獻(xiàn)
首先是這篇綜述
Quora主要考慮的三個(gè)因素:Relevance冗澈、Quality和Demand创千。
Feed Ranking
目標(biāo):Present most interesting stories for a user at a given time
Interesting = topical relevance + social relevance + timeliness
Stories = questions + answers
Answer Ranking
A Machine Learning Approach to Ranking Answers on Quora
答案內(nèi)容本身的質(zhì)量度锐朴。Quora對(duì)什么是「好的答案」有明確的指導(dǎo)[2]囱皿,比如應(yīng)該是有事實(shí)根據(jù)的,有可復(fù)用價(jià)值的肛走,提供了解釋說明的漓雅,進(jìn)行了好的格式排版的等等。What-does-a-good-answer-on-Quora-look-like-What-does-it-mean-to-be-helpfult
互動(dòng),包括頂/踩邻吞、評(píng)論组题、分享、收藏抱冷、點(diǎn)擊等等崔列。
回答者本身的一些特征,比如回答者在問題領(lǐng)域的專業(yè)度旺遮。
Ask2Answers
Ask-To-Answer-as-a-Machine-Learning-Problem
Given a question and a viewer rank all other users based on how 「well-suited」 they are赵讯。其中「well-suited」= likelihood of viewer sending a request + likelihood of the candidate adding a good answer,既要考慮瀏覽用戶發(fā)送邀請(qǐng)的可能性耿眉,又要考慮被邀請(qǐng)者受邀回答的可能性边翼。
Topic Network
User Trust/Expertise Inference
Quora需要找出某個(gè)領(lǐng)域的專家,在某個(gè)領(lǐng)域里回答問題的多少跷敬,接收到的頂讯私、踩、感謝西傀、分享斤寇、收藏及瀏覽等數(shù)據(jù)。另外還有一個(gè)很重要的是專業(yè)度的傳播效應(yīng)拥褂。比如Xavier在推薦系統(tǒng)領(lǐng)域?qū)δ硞€(gè)答案頂了一下娘锁,那么這個(gè)答案作者在推薦系統(tǒng)領(lǐng)域很可能具備較高的專業(yè)度。
The-Product-Engineering-Behind-Most-Viewed-Writers 這個(gè)感覺稍微有點(diǎn)關(guān)系
數(shù)據(jù)案例挖掘
Mapping-the-Discussion-on-Quora-Over-Time-through-Question-Text
下面??針對(duì)我比較感興趣的點(diǎn)一個(gè)個(gè)看————————————
A Machine Learning Approach to Ranking Answers on Quora
Early Attempts and Baselines
使用upvotes/downvotes進(jìn)行簡(jiǎn)單的打分饺鹃,優(yōu)點(diǎn)是簡(jiǎn)單莫秆、快速、可解釋性強(qiáng)悔详,是一個(gè)很好的baseline(想使用觀看次數(shù)來進(jìn)行優(yōu)化镊屎,結(jié)果發(fā)現(xiàn)效果不是很好)
缺點(diǎn)如下,所以想到用豐富的結(jié)構(gòu)化信息:
- Time sensitivity茄螃,必須先有行為才能打分
- Rich get richer:up得越多獲得up的機(jī)會(huì)越多
- Joke answers:類似標(biāo)題黨或者有噱頭的玩意
- Discoverability:新用戶發(fā)的內(nèi)容很難獲得行為
Our Approach to Solving the Ranking Problem
In ranking problems our goal is to predict a list of documents ranked by relevance. In most cases there is also additional context, e.g. an associated user that's viewing results, therefore introducing personalization to the problem.下面介紹一種應(yīng)該是base的模型: - non-personalized
- supervised
- item-wise regression:給每一個(gè)答案打分
Ground Truth Dataset
At Quora we define good answers to have the following five properties: - 確實(shí)是在回答該問題
- 提供reusable的知識(shí)
- 有理由支持的答案
- 可證實(shí)是正確的
-
清晰易讀
有效特征的例子
提取的特征也包括
- 用戶認(rèn)證信息, 格式, 贊成票
- 作者和投票者的歷史數(shù)據(jù)來推算主題的專業(yè)度和可信度
We explored many different ways to know the true quality of answers.
- once again consider using a list of answers where the labels are the ratio of upvotes to downvotes.the shortcoming here is that this label would suffer from the same issues as the downvote model presented earlier
- run a user survey
- try to combine them into one authoritative ground truth dataset.
Features and Models
Feature engineering is a major contributor to the success of a model and it's often the hardest part of building a good machine learning system.The features we tried can broadly be categorized into three groups
- text-based features
- expertise-based features
- author/upvoter history-based features
一開始用的是文本特征缝驳,但是由于有句法復(fù)雜性,所以部分有問題归苍。
通常來說集成別的模型作為特征很有用用狱,比如我們使用了一個(gè)預(yù)估用戶在某個(gè)topic下面的專業(yè)度的特征就證明了這一點(diǎn)。
對(duì)于這個(gè)回歸模型來說拼弃,gbdt和一些DeepLearning的結(jié)果很令人信服夏伊,但是dl的可解釋性差一點(diǎn),所以我們也會(huì)用一些lr來進(jìn)行實(shí)驗(yàn)吻氧。
Metrics
We used the following metrics: - Rank-based: NDCG, Mean Reciprocal Rank, Spearman's Rho, Kendall's Tau
- Point-wise: R2, and Mean Squared Error溺忧, make sure our scores were close to our training data in scale and that answers between questions can also be compared.
Productionalization - 新答案先用一個(gè)快速的特征提取來進(jìn)行ranking咏连,然后異步的重新計(jì)算更為準(zhǔn)確的分?jǐn)?shù)。
- 排序數(shù)百個(gè)答案是很費(fèi)時(shí)間的砸狞,所以分?jǐn)?shù)需要緩存捻勉,只有當(dāng)特征發(fā)生變化的時(shí)候才更新它。
- 有一個(gè)問題是如果是作者的特征變了刀森,那我們要更新所有這個(gè)作者相關(guān)的答案踱启,可能會(huì)很費(fèi)時(shí)間,所以做了一個(gè)特殊的組織和批處理研底。
- 另外一個(gè)優(yōu)化是在決策樹內(nèi)部埠偿,當(dāng)這個(gè)特征變化不會(huì)影響得分的時(shí)候就不更新特征。
- 總共減少了70%的計(jì)算
The Quora Topic Network
Introduction
topics form an important organizational backbone for Quora's corpus of knowledge榜晦。Our goal is to become the Internet's best source for knowledge on as many of these topics as possible冠蒋。
- 越來越多的topics,并且人們提供了很多質(zhì)量的內(nèi)容
-
通過將內(nèi)容標(biāo)記為一些topics乾胶,人們創(chuàng)造了一個(gè)合理的有層次的領(lǐng)域知識(shí)
人們可以關(guān)注另外一個(gè)人抖剿,也可以關(guān)注問題和主題
關(guān)系
Quora's Diversification
topic的數(shù)量其實(shí)代表了多樣性的變化,于是計(jì)算了擁有100個(gè)好問題的topic的數(shù)量识窿。
什么是好問題:除了作者斩郎,至少有一個(gè)人覺得這個(gè)問題有價(jià)值。2013年底大概5000個(gè)topic喻频,而且增長(zhǎng)很快缩宜,這說明多樣性在變好
Defining the Probabilistic Topic Network
topic漲得快,但是新的領(lǐng)域知識(shí)提高的沒那么快甥温,所以需要將這些知識(shí)進(jìn)行組織和整理锻煌。
一個(gè)問題可以打上多個(gè)topic可以提供topic之間的隱含關(guān)系。因此可以只針對(duì)topic進(jìn)行網(wǎng)絡(luò)關(guān)系姻蚓。
- 我們先鏈接topic A和topic B宋梧,如果這倆至少被一個(gè)問題在一起tag了或者引用了
- 有很多打了Moon landing的topic會(huì)打上NASA,但是反過來不是狰挡,所以邊是有向邊
-
增加權(quán)重捂龄,如果有n個(gè)人follow這個(gè)問題,那么這一次鏈接的重要性就是n圆兵,最后A->B的weight定義如下:
公式
Hints of the Topic Hierarchy
每一個(gè)節(jié)點(diǎn)的入度是一個(gè)簡(jiǎn)單的衡量手段跺讯。直接把所有指向這個(gè)node的權(quán)重加起來就行了枢贿。
假設(shè)topic存在層級(jí)關(guān)系殉农,那么一個(gè)topic就有可能至少?gòu)膬蓚€(gè)不同的途徑獲得一個(gè)大的入度,所有節(jié)點(diǎn)入度的平均數(shù)和中位數(shù)應(yīng)該會(huì)差別很多局荚。中位數(shù)主要收到那些典型具體的topic影響超凳,入度應(yīng)該會(huì)比較低愈污,而且隨著越來越多的有特點(diǎn)的topic加入進(jìn)來,中位數(shù)會(huì)越來越低轮傍。平均數(shù)主要收到那些大的節(jié)點(diǎn)的入度影響暂雹,應(yīng)該會(huì)比較大一點(diǎn),而且相對(duì)變化不會(huì)那么大创夜。使用NetworkX統(tǒng)計(jì)出來的情況確實(shí)是這樣杭跪,所以我們可以從topic網(wǎng)絡(luò)中演繹出層級(jí)關(guān)系
Diving Deeper into the Topic Hierarchy
- degree distribution:power law,proportional to ???1.6
- 連通性:99.8% of all topics are connected together in one big "component."
- the joint degree distribution (JDD):相配的 assortative-如果熱門的人一起hangout驰吓,不熱門的人一起hangout涧尿;不相配的disassortative-如果熱門的人與不熱門的人hangout。這個(gè)網(wǎng)絡(luò)是mildly disassortative: large, well-connected, general topics tend to be linked to smaller, more specific topics.
- the clustering coefficient (CC):the CC measures the probability that any two of my friends are also friends with each other, given that they are my friends檬贰。decreases steeply with the number of links a topic has姑廉。這意味著小的topic聯(lián)系比較緊密,大的topic關(guān)系不是那么大翁涤,這也說明這個(gè)網(wǎng)絡(luò)是有層級(jí)的桥言。
Topic Clustering
我們可以使用層級(jí)主題聚類來進(jìn)行topic的表達(dá)。具體的算法如下:
- 1.Create a list of empty trees with each topic as the root
- Find the topic with the largest total outdegree in the topic network
- Add the topic, and its subtree, to the subtree of each topic it links to with weight
- Remove the topic from the topic network
- Goto 2 until only N topics are left
算法得到的結(jié)果包括topic的列表以及每一個(gè)node作為root的層次的樹結(jié)構(gòu)葵礼,可以知道相關(guān)的topic連接得多么緊密号阿。有了這些信息,我們就可以選擇任意topic通過爬上和爬下找到相關(guān)的topic章咧,我們?cè)诰垲惖臅r(shí)候選擇了fuzzy方法來允許一個(gè)topic可以有多個(gè)parents倦西,這很有用。這個(gè)網(wǎng)絡(luò)他們留下了大概2000個(gè)topic赁严。
- Goto 2 until only N topics are left