親和性分析

數(shù)據(jù)挖掘有個(gè)常見的應(yīng)用場景含长，即顧客在購買一件商品時(shí)盐股，商家可以趁機(jī)了解他們還想買什么胎源，以便把多數(shù)顧客意愿同時(shí)購買的商品放到一起以提高銷售量。當(dāng)商家收集到足夠多的數(shù)據(jù)時(shí)瓜喇，就可以對其進(jìn)行親和性分析挺益，以確定哪些商品合適放在一起銷售

什么是親和性：

親和性分析根據(jù)樣本個(gè)體（物體）之間的相似度，確定他們關(guān)系的親疏乘寒。親和性分析的應(yīng)用場景如下：

向網(wǎng)站用戶提供多樣化的服務(wù)或投放定向廣告望众；
為了向用戶推薦電影或者商品，兒賣給他們一些與之相關(guān)的小玩意肃续；
根據(jù)基因?qū)ふ矣H緣關(guān)系的人

商品推薦：

我們一起看下簡單的商品推薦服務(wù)黍檩，他背后的思路其實(shí)很好理解：人們之前經(jīng)常同時(shí)購買兩件商品，以后也很可能同時(shí)購買始锚，該想法很簡單吧，可這就是很多商品推薦服務(wù)的基礎(chǔ)喳逛；
為了簡化代碼瞧捌，我們只考慮一次購買兩件商品的請客。例如润文，人們?nèi)チ顺屑荣I了面包又買了牛奶姐呐。作為數(shù)據(jù)挖掘的例子，我們希望看到下面的規(guī)則：

如果一個(gè)人買了商品X典蝌，那么他很可能購買商品Y

多件商品的規(guī)則會更為復(fù)雜曙砂，比如購買了香腸和漢堡的顧客比起其他顧客更有可能購買番茄醬。本次不探討這樣的規(guī)則骏掀。

加載數(shù)據(jù)：

In [2]: import numpy as np

In [3]: path = 'D:\\books\\affinity_dataset.txt'

In [4]: data = np.loadtxt(path)

In [5]: n_samples,n_features = data.shape

In [6]: n_samples
Out[6]: 100

In [7]: n_features
Out[7]: 5

In [8]: print("This dataset has {0} samples and {1} features".format(n_samples, n_features))
This dataset has 100 samples and 5 features

#查看數(shù)據(jù)
In [11]: print(data[:5])
[[0. 0. 1. 1. 1.]
 [1. 1. 0. 1. 0.]
 [1. 0. 1. 1. 0.]
 [0. 0. 1. 1. 1.]
 [0. 1. 0. 0. 1.]]

輸出的結(jié)果中鸠澈，從橫向和豎向我們可以，橫著看截驮，每次只看一行笑陈，第一行（0,0,1,1,1）表示第一條交易數(shù)據(jù)所包含的商品，豎著看葵袭，每一列代表一種商品涵妥。在我們的例子中，這五種商品分別包含面包坡锡、牛奶蓬网、奶酪、蘋果和香蕉鹉勒；從第一條交易數(shù)據(jù)帆锋，我們可以看到顧客買了奶酪，香蕉和蘋果贸弥，但是沒買面包和牛奶窟坐；
每個(gè)特征只有兩種可能，1或0，表示是否購買了某種商品哲鸳，而不是購買商品的數(shù)量臣疑；1表示至少購買了一個(gè)單位的該商品，0表示顧客沒有購買該商品徙菠；

實(shí)現(xiàn)簡單的排序規(guī)則：

正如前面所說讯沈，我們要找出“如果顧客買了商品X，那么他們可能愿意購買商品Y”這樣的規(guī)則婿奔，簡單粗暴的做法是缺狠，找出數(shù)據(jù)集中所有同事購買的兩件商品。找出規(guī)則后萍摊，還需要判斷其優(yōu)劣勢挤茄；我們挑好用的規(guī)則用：

規(guī)則的優(yōu)劣勢有多重衡量方法，常用的是支持度(support)和置信度(confidence)

支持度指數(shù)集中規(guī)則應(yīng)驗(yàn)的次數(shù)：支持度衡量的是給定規(guī)則的應(yīng)驗(yàn)比例冰木；
置信度衡量的是規(guī)則準(zhǔn)確率如何穷劈，即符合給定條件（即規(guī)則的“如果”語句所表示的前提條件）的所有規(guī)則里，跟當(dāng)前結(jié)論一致的比例有多大踊沸；計(jì)算方法為首先統(tǒng)計(jì)當(dāng)前規(guī)則出現(xiàn)的次數(shù)歇终，再用他來除以（“如果”語句）相同規(guī)則的數(shù)量

接下來我們通過一個(gè)例子來說明支持度和置信度的計(jì)算方法；我們來看一下“如果顧客購買了蘋果逼龟，他們也會購買香蕉”這條的支持度和置信度评凝；

In [12]: fearures = ['beard','milk','cheese','apple','bananas']

In [13]: num_apple_purchases = 0

#First ,how many rows contain our premise:that a person is buying apples
In [14]: for sample in data:
    ...:     if sample[3] == 1: #this person bought apples
    ...:         num_apple_purchases += 1
    ...:

In [15]: print("{0} people bought Apples".format(num_apple_purchases))
36 people bought Apples

同理，檢測sample[4]的值是否為1腺律，就能確定顧客有沒有買香蕉奕短，

我們需要統(tǒng)計(jì)數(shù)據(jù)集中所有規(guī)則的相關(guān)數(shù)據(jù)，首先分別為規(guī)則應(yīng)驗(yàn)和規(guī)則無效這兩種情況創(chuàng)建字典疾渣。字典的鍵是由條件和結(jié)論組成的元組篡诽，元組元素為特征在特征列表中的索引值，不要用實(shí)際特征名榴捡；

In [16]: rule_valid = 0

In [17]: rule_invalid = 0

In [19]: for sample in data:
    ...:     if sample[3] == 1:   #this person bought apples
    ...:         if sample[4] == 1:  #this person bought both apples and bananas
    ...:             rule_valid += 1
    ...:         else:
    ...:             rule_invalid += 1
    ...:

In [20]: print("{0} cases of the rule being valid were discovered".format(rule_valid))
21 cases of the rule being valid were discovered

In [21]: print("{0} cases of the rule being invalid were discovered".format(rule_invalid))
15 cases of the rule being invalid were discovered

我們可以計(jì)算支持度和置信度了杈女；

# Now we have all the information needed to compute Support and Confidence
In [22]: support = rule_valid  # The Support is the number of times the rule is discovered.

In [23]: confidence = rule_valid / num_apple_purchases

In [24]: print("The support is {0} and the confidence is {1:.3f}.".format(support, confidence))
The support is 21 and the confidence is 0.583.
# Confidence can be thought of as a percentage using the following:
In [25]: print("As a percentage, that is {0:.1f}%.".format(100 * confidence))
As a percentage, that is 58.3%.

為了計(jì)算所有規(guī)則的置信度和支持度，首先要?jiǎng)?chuàng)建幾個(gè)字典吊圾，用來存放計(jì)算結(jié)果达椰。這里使用defaultdict。

from collections import defaultdict
# Now compute for all possible rules
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)

for sample in X:
    for premise in range(n_features):
        if sample[premise] == 0: continue
        # Record that the premise was bought in another transaction
        num_occurences[premise] += 1
        for conclusion in range(n_features):
            if premise == conclusion:  # It makes little sense to measure if X -> X.
                continue
            if sample[conclusion] == 1:
                # This person also bought the conclusion item
                valid_rules[(premise, conclusion)] += 1
            else:
                # This person bought the premise, but not the conclusion
                invalid_rules[(premise, conclusion)] += 1
support = valid_rules
confidence = defaultdict(float)
for premise, conclusion in valid_rules.keys():
    confidence[(premise, conclusion)] = valid_rules[(premise, conclusion)] / num_occurences[premise]

for premise, conclusion in confidence:
    premise_name = features[premise]
    conclusion_name = features[conclusion]
    print("Rule: If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
    print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
    print(" - Support: {0}".format(support[(premise, conclusion)]))
    print("")

Rule: If a person buys bread they will also buy milk
 - Confidence: 0.519
 - Support: 14

Rule: If a person buys milk they will also buy cheese
 - Confidence: 0.152
 - Support: 7

Rule: If a person buys apples they will also buy cheese
 - Confidence: 0.694
 - Support: 25

Rule: If a person buys milk they will also buy apples
 - Confidence: 0.196
 - Support: 9

Rule: If a person buys bread they will also buy apples
 - Confidence: 0.185
 - Support: 5

Rule: If a person buys apples they will also buy bread
 - Confidence: 0.139
 - Support: 5

Rule: If a person buys apples they will also buy bananas
 - Confidence: 0.583
 - Support: 21

Rule: If a person buys apples they will also buy milk
 - Confidence: 0.250
 - Support: 9

Rule: If a person buys milk they will also buy bananas
 - Confidence: 0.413
 - Support: 19

Rule: If a person buys cheese they will also buy bananas
 - Confidence: 0.659
 - Support: 27

Rule: If a person buys cheese they will also buy bread
 - Confidence: 0.098
 - Support: 4

Rule: If a person buys cheese they will also buy apples
 - Confidence: 0.610
 - Support: 25

Rule: If a person buys cheese they will also buy milk
 - Confidence: 0.171
 - Support: 7

Rule: If a person buys bananas they will also buy apples
 - Confidence: 0.356
 - Support: 21

Rule: If a person buys bread they will also buy bananas
 - Confidence: 0.630
 - Support: 17

Rule: If a person buys bananas they will also buy cheese
 - Confidence: 0.458
 - Support: 27

Rule: If a person buys milk they will also buy bread
 - Confidence: 0.304
 - Support: 14

Rule: If a person buys bananas they will also buy milk
 - Confidence: 0.322
 - Support: 19

Rule: If a person buys bread they will also buy cheese
 - Confidence: 0.148
 - Support: 4

Rule: If a person buys bananas they will also buy bread
 - Confidence: 0.288
 - Support: 17


def print_rule(premise, conclusion, support, confidence, features):
    premise_name = features[premise]
    conclusion_name = features[conclusion]
    print("Rule: If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
    print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
    print(" - Support: {0}".format(support[(premise, conclusion)]))
    print("")

premise = 1
conclusion = 3
print_rule(premise, conclusion, support, confidence, features)
Rule: If a person buys milk they will also buy apples
 - Confidence: 0.196
 - Support: 9

# Sort by support
from pprint import pprint
pprint(list(support.items()))
[((0, 1), 14),
 ((1, 2), 7),
 ((3, 2), 25),
 ((1, 3), 9),
 ((0, 2), 4),
 ((3, 0), 5),
 ((4, 1), 19),
 ((3, 1), 9),
 ((1, 4), 19),
 ((2, 4), 27),
 ((2, 0), 4),
 ((2, 3), 25),
 ((2, 1), 7),
 ((4, 3), 21),
 ((0, 4), 17),
 ((4, 2), 27),
 ((1, 0), 14),
 ((3, 4), 21),
 ((0, 3), 5),
 ((4, 0), 17)]

排序：

from operator import itemgetter
sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)
for index in range(5):
    print("Rule #{0}".format(index + 1))
    (premise, conclusion) = sorted_support[index][0]
    print_rule(premise, conclusion, support, confidence, features)

Rule #1
Rule: If a person buys cheese they will also buy bananas
 - Confidence: 0.659
 - Support: 27

Rule #2
Rule: If a person buys bananas they will also buy cheese
 - Confidence: 0.458
 - Support: 27

Rule #3
Rule: If a person buys apples they will also buy cheese
 - Confidence: 0.694
 - Support: 25

Rule #4
Rule: If a person buys cheese they will also buy apples
 - Confidence: 0.610
 - Support: 25

Rule #5
Rule: If a person buys bananas they will also buy apples
 - Confidence: 0.356
 - Support: 21

最后編輯于：2018.06.04 11:07:53

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末项乒，一起剝皮案震驚了整個(gè)濱河市啰劲，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌檀何，老刑警劉巖蝇裤，帶你破解...
沈念sama閱讀 219,490評論 6贊 508
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件廷支，死亡現(xiàn)場離奇詭異，居然都是意外死亡栓辜，警方通過查閱死者的電腦和手機(jī)恋拍，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,581評論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門类垦，熙熙樓的掌柜王于貴愁眉苦臉地迎上來爬迟，“玉大人，你說我怎么就攤上這事辐脖∠晾常” “怎么了僵娃？”我有些...
開封第一講書人閱讀 165,830評論 0贊 356
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長腋妙。經(jīng)常有香客問我默怨，道長，這世上最難降的妖魔是什么骤素？我笑而不...
開封第一講書人閱讀 58,957評論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任先壕，我火速辦了婚禮，結(jié)果婚禮上谆甜，老公的妹妹穿的比我還像新娘。我一直安慰自己集绰，他們只是感情好规辱，可當(dāng)我...
茶點(diǎn)故事閱讀 67,974評論 6贊 393
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著栽燕，像睡著了一般罕袋。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上碍岔，一...
開封第一講書人閱讀 51,754評論 1贊 307
城市分裂傳說
那天浴讯，我揣著相機(jī)與錄音，去河邊找鬼蔼啦。笑死榆纽，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的捏肢。我是一名探鬼主播奈籽，決...
沈念sama閱讀 40,464評論 3贊 420
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼鸵赫！你這毒婦竟也來了衣屏？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 39,357評論 0贊 276
萬榮殺人案實(shí)錄
序言：老撾萬榮一對情侶失蹤辩棒，失蹤者是張志新（化名）和其女友劉穎狼忱，沒想到半個(gè)月后膨疏，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,847評論 1贊 317
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡钻弄，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,995評論 3贊 338
?白月光啟示錄
正文我和宋清朗相戀三年佃却，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片斧蜕。...
茶點(diǎn)故事閱讀 40,137評論 1贊 351
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡双霍，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出批销，到底是詐尸還是另有隱情洒闸，我是刑警寧澤，帶...
沈念sama閱讀 35,819評論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布均芽，位于F島的核電站丘逸，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏掀宋。R本人自食惡果不足惜深纲，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,482評論 3贊 331
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望劲妙。院中可真熱鬧湃鹊，春花似錦、人聲如沸镣奋。這莊子的主人今日做“春日...
開封第一講書人閱讀 32,023評論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽侨颈。三九已至余赢，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間哈垢，已是汗流浹背妻柒。一陣腳步聲響...
開封第一講書人閱讀 33,149評論 1贊 272
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留耘分，地道東北人举塔。一個(gè)月前我還...
沈念sama閱讀 48,409評論 3贊 373
代替公主和親
正文我出身青樓，卻偏偏與公主長得像陶贼，于是被迫代替她去往敵國和親啤贩。傳聞我的和親對象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 45,086評論 2贊 355

親和性分析

什么是親和性：

商品推薦：

加載數(shù)據(jù)：

實(shí)現(xiàn)簡單的排序規(guī)則：

推薦閱讀更多精彩內(nèi)容