數(shù)據(jù)挖掘有個(gè)常見的應(yīng)用場景含长,即顧客在購買一件商品時(shí)盐股,商家可以趁機(jī)了解他們還想買什么胎源,以便把多數(shù)顧客意愿同時(shí)購買的商品放到一起以提高銷售量。當(dāng)商家收集到足夠多的數(shù)據(jù)時(shí)瓜喇,就可以對其進(jìn)行親和性分析挺益,以確定哪些商品合適放在一起銷售
什么是親和性:
親和性分析根據(jù)樣本個(gè)體(物體)之間的相似度,確定他們關(guān)系的親疏乘寒。親和性分析的應(yīng)用場景如下:
- 向網(wǎng)站用戶提供多樣化的服務(wù)或投放定向廣告望众;
- 為了向用戶推薦電影或者商品,兒賣給他們一些與之相關(guān)的小玩意肃续;
- 根據(jù)基因?qū)ふ矣H緣關(guān)系的人
商品推薦:
我們一起看下簡單的商品推薦服務(wù)黍檩,他背后的思路其實(shí)很好理解:人們之前經(jīng)常同時(shí)購買兩件商品,以后也很可能同時(shí)購買始锚,該想法很簡單吧,可這就是很多商品推薦服務(wù)的基礎(chǔ)喳逛;
為了簡化代碼瞧捌,我們只考慮一次購買兩件商品的請客。例如润文,人們?nèi)チ顺屑荣I了面包又買了牛奶姐呐。作為數(shù)據(jù)挖掘的例子,我們希望看到下面的規(guī)則:
如果一個(gè)人買了商品X典蝌,那么他很可能購買商品Y
多件商品的規(guī)則會更為復(fù)雜曙砂,比如購買了香腸和漢堡的顧客比起其他顧客更有可能購買番茄醬。本次不探討這樣的規(guī)則骏掀。
加載數(shù)據(jù):
In [2]: import numpy as np
In [3]: path = 'D:\\books\\affinity_dataset.txt'
In [4]: data = np.loadtxt(path)
In [5]: n_samples,n_features = data.shape
In [6]: n_samples
Out[6]: 100
In [7]: n_features
Out[7]: 5
In [8]: print("This dataset has {0} samples and {1} features".format(n_samples, n_features))
This dataset has 100 samples and 5 features
#查看數(shù)據(jù)
In [11]: print(data[:5])
[[0. 0. 1. 1. 1.]
[1. 1. 0. 1. 0.]
[1. 0. 1. 1. 0.]
[0. 0. 1. 1. 1.]
[0. 1. 0. 0. 1.]]
輸出的結(jié)果中鸠澈,從橫向和豎向我們可以,橫著看截驮,每次只看一行笑陈,第一行(0,0,1,1,1)表示第一條交易數(shù)據(jù)所包含的商品,豎著看葵袭,每一列代表一種商品涵妥。在我們的例子中,這五種商品分別包含面包坡锡、牛奶蓬网、奶酪、蘋果和香蕉鹉勒;從第一條交易數(shù)據(jù)帆锋,我們可以看到顧客買了奶酪,香蕉和蘋果贸弥,但是沒買面包和牛奶窟坐;
每個(gè)特征只有兩種可能,1或0,表示是否購買了某種商品哲鸳,而不是購買商品的數(shù)量臣疑;1表示至少購買了一個(gè)單位的該商品,0表示顧客沒有購買該商品徙菠;
實(shí)現(xiàn)簡單的排序規(guī)則:
正如前面所說讯沈,我們要找出“如果顧客買了商品X,那么他們可能愿意購買商品Y”這樣的規(guī)則婿奔,簡單粗暴的做法是缺狠,找出數(shù)據(jù)集中所有同事購買的兩件商品。找出規(guī)則后萍摊,還需要判斷其優(yōu)劣勢挤茄;我們挑好用的規(guī)則用:
規(guī)則的優(yōu)劣勢有多重衡量方法,常用的是支持度(support)和置信度(confidence)
- 支持度指數(shù)集中規(guī)則應(yīng)驗(yàn)的次數(shù):支持度衡量的是給定規(guī)則的應(yīng)驗(yàn)比例冰木;
- 置信度衡量的是規(guī)則準(zhǔn)確率如何穷劈,即符合給定條件(即規(guī)則的“如果”語句所表示的前提條件)的所有規(guī)則里,跟當(dāng)前結(jié)論一致的比例有多大踊沸;計(jì)算方法為首先統(tǒng)計(jì)當(dāng)前規(guī)則出現(xiàn)的次數(shù)歇终,再用他來除以(“如果”語句)相同規(guī)則的數(shù)量
接下來我們通過一個(gè)例子來說明支持度和置信度的計(jì)算方法;我們來看一下“如果顧客購買了蘋果逼龟,他們也會購買香蕉”這條的支持度和置信度评凝;
In [12]: fearures = ['beard','milk','cheese','apple','bananas']
In [13]: num_apple_purchases = 0
#First ,how many rows contain our premise:that a person is buying apples
In [14]: for sample in data:
...: if sample[3] == 1: #this person bought apples
...: num_apple_purchases += 1
...:
In [15]: print("{0} people bought Apples".format(num_apple_purchases))
36 people bought Apples
同理,檢測sample[4]的值是否為1腺律,就能確定顧客有沒有買香蕉奕短,
我們需要統(tǒng)計(jì)數(shù)據(jù)集中所有規(guī)則的相關(guān)數(shù)據(jù),首先分別為規(guī)則應(yīng)驗(yàn)和規(guī)則無效這兩種情況創(chuàng)建字典疾渣。字典的鍵是由條件和結(jié)論組成的元組篡诽,元組元素為特征在特征列表中的索引值,不要用實(shí)際特征名榴捡;
In [16]: rule_valid = 0
In [17]: rule_invalid = 0
In [19]: for sample in data:
...: if sample[3] == 1: #this person bought apples
...: if sample[4] == 1: #this person bought both apples and bananas
...: rule_valid += 1
...: else:
...: rule_invalid += 1
...:
In [20]: print("{0} cases of the rule being valid were discovered".format(rule_valid))
21 cases of the rule being valid were discovered
In [21]: print("{0} cases of the rule being invalid were discovered".format(rule_invalid))
15 cases of the rule being invalid were discovered
我們可以計(jì)算支持度和置信度了杈女;
# Now we have all the information needed to compute Support and Confidence
In [22]: support = rule_valid # The Support is the number of times the rule is discovered.
In [23]: confidence = rule_valid / num_apple_purchases
In [24]: print("The support is {0} and the confidence is {1:.3f}.".format(support, confidence))
The support is 21 and the confidence is 0.583.
# Confidence can be thought of as a percentage using the following:
In [25]: print("As a percentage, that is {0:.1f}%.".format(100 * confidence))
As a percentage, that is 58.3%.
為了計(jì)算所有規(guī)則的置信度和支持度,首先要?jiǎng)?chuàng)建幾個(gè)字典吊圾,用來存放計(jì)算結(jié)果达椰。這里使用defaultdict。
from collections import defaultdict
# Now compute for all possible rules
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)
for sample in X:
for premise in range(n_features):
if sample[premise] == 0: continue
# Record that the premise was bought in another transaction
num_occurences[premise] += 1
for conclusion in range(n_features):
if premise == conclusion: # It makes little sense to measure if X -> X.
continue
if sample[conclusion] == 1:
# This person also bought the conclusion item
valid_rules[(premise, conclusion)] += 1
else:
# This person bought the premise, but not the conclusion
invalid_rules[(premise, conclusion)] += 1
support = valid_rules
confidence = defaultdict(float)
for premise, conclusion in valid_rules.keys():
confidence[(premise, conclusion)] = valid_rules[(premise, conclusion)] / num_occurences[premise]
for premise, conclusion in confidence:
premise_name = features[premise]
conclusion_name = features[conclusion]
print("Rule: If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
print(" - Support: {0}".format(support[(premise, conclusion)]))
print("")
Rule: If a person buys bread they will also buy milk
- Confidence: 0.519
- Support: 14
Rule: If a person buys milk they will also buy cheese
- Confidence: 0.152
- Support: 7
Rule: If a person buys apples they will also buy cheese
- Confidence: 0.694
- Support: 25
Rule: If a person buys milk they will also buy apples
- Confidence: 0.196
- Support: 9
Rule: If a person buys bread they will also buy apples
- Confidence: 0.185
- Support: 5
Rule: If a person buys apples they will also buy bread
- Confidence: 0.139
- Support: 5
Rule: If a person buys apples they will also buy bananas
- Confidence: 0.583
- Support: 21
Rule: If a person buys apples they will also buy milk
- Confidence: 0.250
- Support: 9
Rule: If a person buys milk they will also buy bananas
- Confidence: 0.413
- Support: 19
Rule: If a person buys cheese they will also buy bananas
- Confidence: 0.659
- Support: 27
Rule: If a person buys cheese they will also buy bread
- Confidence: 0.098
- Support: 4
Rule: If a person buys cheese they will also buy apples
- Confidence: 0.610
- Support: 25
Rule: If a person buys cheese they will also buy milk
- Confidence: 0.171
- Support: 7
Rule: If a person buys bananas they will also buy apples
- Confidence: 0.356
- Support: 21
Rule: If a person buys bread they will also buy bananas
- Confidence: 0.630
- Support: 17
Rule: If a person buys bananas they will also buy cheese
- Confidence: 0.458
- Support: 27
Rule: If a person buys milk they will also buy bread
- Confidence: 0.304
- Support: 14
Rule: If a person buys bananas they will also buy milk
- Confidence: 0.322
- Support: 19
Rule: If a person buys bread they will also buy cheese
- Confidence: 0.148
- Support: 4
Rule: If a person buys bananas they will also buy bread
- Confidence: 0.288
- Support: 17
def print_rule(premise, conclusion, support, confidence, features):
premise_name = features[premise]
conclusion_name = features[conclusion]
print("Rule: If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
print(" - Support: {0}".format(support[(premise, conclusion)]))
print("")
premise = 1
conclusion = 3
print_rule(premise, conclusion, support, confidence, features)
Rule: If a person buys milk they will also buy apples
- Confidence: 0.196
- Support: 9
# Sort by support
from pprint import pprint
pprint(list(support.items()))
[((0, 1), 14),
((1, 2), 7),
((3, 2), 25),
((1, 3), 9),
((0, 2), 4),
((3, 0), 5),
((4, 1), 19),
((3, 1), 9),
((1, 4), 19),
((2, 4), 27),
((2, 0), 4),
((2, 3), 25),
((2, 1), 7),
((4, 3), 21),
((0, 4), 17),
((4, 2), 27),
((1, 0), 14),
((3, 4), 21),
((0, 3), 5),
((4, 0), 17)]
排序:
from operator import itemgetter
sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)
for index in range(5):
print("Rule #{0}".format(index + 1))
(premise, conclusion) = sorted_support[index][0]
print_rule(premise, conclusion, support, confidence, features)
Rule #1
Rule: If a person buys cheese they will also buy bananas
- Confidence: 0.659
- Support: 27
Rule #2
Rule: If a person buys bananas they will also buy cheese
- Confidence: 0.458
- Support: 27
Rule #3
Rule: If a person buys apples they will also buy cheese
- Confidence: 0.694
- Support: 25
Rule #4
Rule: If a person buys cheese they will also buy apples
- Confidence: 0.610
- Support: 25
Rule #5
Rule: If a person buys bananas they will also buy apples
- Confidence: 0.356
- Support: 21