關(guān)聯(lián)分析用于發(fā)現(xiàn)用戶(hù)購(gòu)買(mǎi)不同的商品之間存在關(guān)聯(lián)和相關(guān)聯(lián)系悼瓮,比如A商品和B商品存在很強(qiáng)的相關(guān)性戈毒,常用于實(shí)體商店或在線電商的推薦系統(tǒng),例如某一客戶(hù)購(gòu)買(mǎi)A商品横堡,那么他很有可能會(huì)購(gòu)買(mǎi)B商品埋市,通過(guò)大量銷(xiāo)售數(shù)據(jù)找到經(jīng)常在一起購(gòu)買(mǎi)的商品組合,可以了解用戶(hù)的購(gòu)買(mǎi)行為命贴,根據(jù)銷(xiāo)售的商品推薦關(guān)聯(lián)商品從而給出購(gòu)買(mǎi)建議道宅,尋找銷(xiāo)售新的增長(zhǎng)點(diǎn)。
一胸蛛、數(shù)據(jù)來(lái)源及說(shuō)明
https://tianchi.aliyun.com/dataset/dataDetail?dataId=46&userId=1
本文從數(shù)據(jù)集中選取包含了2014年11月18日至2014年12月18日之間培己,10000名隨機(jī)用戶(hù)共12256906條行為數(shù)據(jù),數(shù)據(jù)集的每一行表示一條用戶(hù)行為胚泌,共6列省咨。
列字段包含以下:
user_id:用戶(hù)身份
item_id:商品ID
behavior_type:用戶(hù)行為類(lèi)型(包含點(diǎn)擊、收藏玷室、加購(gòu)物車(chē)零蓉、購(gòu)買(mǎi)四種行為,分別用數(shù)字1穷缤、2敌蜂、3、4表示)
user_geohash:地理位置(有空值)
item_category:品類(lèi)ID(商品所屬的品類(lèi))
time:用戶(hù)行為發(fā)生的時(shí)間
數(shù)據(jù)結(jié)構(gòu)如下:在本次分析中只選用user_id和item_id這兩個(gè)字段
二津肛、提出問(wèn)題
1章喉、 用戶(hù)購(gòu)買(mǎi)哪種商品次數(shù)最多
2、 用戶(hù)購(gòu)買(mǎi)的商品中,哪些商品組合關(guān)聯(lián)度高
三秸脱、數(shù)據(jù)清洗和構(gòu)建模型
關(guān)聯(lián)分析算法常用Apriori算法和FP-growth算法
(一) Apriori算法
1落包、Apriori算法基本原理
Apriori算法是經(jīng)典的挖掘頻繁項(xiàng)集和關(guān)聯(lián)規(guī)則的數(shù)據(jù)挖掘算法,可以從大規(guī)模數(shù)據(jù)集中尋找物品間的隱含關(guān)系摊唇。其核心思想是通過(guò)連接產(chǎn)生候選項(xiàng)及其支持度咐蝇,然后通過(guò)剪枝生成頻繁項(xiàng)集。
項(xiàng)集:包含0個(gè)或者多個(gè)項(xiàng)的集合稱(chēng)為項(xiàng)集巷查。在購(gòu)物藍(lán)事務(wù)中有序,每一樣商品就是一個(gè)項(xiàng),一次購(gòu)買(mǎi)行為包含了多個(gè)項(xiàng)岛请,把其中的項(xiàng)組合起來(lái)就構(gòu)成了項(xiàng)集
支持度計(jì)數(shù):項(xiàng)集在事務(wù)中出現(xiàn)的次數(shù)
頻繁項(xiàng)集:經(jīng)常出現(xiàn)在一塊的物品的集合
關(guān)聯(lián)規(guī)則:暗示兩種物品之間可能存在很強(qiáng)的關(guān)系
Support(支持度):表示同時(shí)包含 A 和 B 的事務(wù)占所有事務(wù)的比例旭寿。如果用 P(A) 表示包含 A 的事務(wù)的比例,那么 Support = P(A & B)
Confidence(可信度/置信度):表示包含 A 的事務(wù)中同時(shí)包含 B 的事務(wù)的比例崇败,即同時(shí)包含 A 和 B 的事務(wù)占包含 A 的事務(wù)的比例许师。公式表達(dá):Confidence = P(A & B)/ P(A)
Apriori算法兩個(gè)重要的定律:
定律1:如果一個(gè)集合是頻繁項(xiàng)集,則它的所有子集都是頻繁項(xiàng)集
定律2:如果一個(gè)集合不是頻繁項(xiàng)集僚匆,則它的所有超集都不是頻繁項(xiàng)集
2微渠、Apriori算法實(shí)現(xiàn)基本過(guò)程如下:
1) 創(chuàng)建初始候選集,篩選候選項(xiàng)1項(xiàng)集
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
## 方法一:
def apriori(data_set):
"""創(chuàng)建初始候選集,候選項(xiàng)1項(xiàng)集"""
print('創(chuàng)建初始候選項(xiàng)1項(xiàng)集')
c1 = set()
for items in data_set:
for item in items:
# frozenset()返回一個(gè)凍結(jié)的集合,凍結(jié)后集合不能再添加或刪除任何元素
item_set = frozenset([item])
c1.add(item_set)
2) 從候選項(xiàng)集中選出滿足最小支持度的頻繁項(xiàng)集并計(jì)算支持度
def generate_freq_supports(data_set, item_set, min_support):
"""從候選項(xiàng)集中選出頻繁項(xiàng)集并計(jì)算支持度"""
print('篩選頻繁項(xiàng)集并計(jì)算支持度')
freq_set = set() # 保存頻繁項(xiàng)集元素
item_count = {} # 保存元素頻次,用于計(jì)算支持度
supports = {} # 保存支持度
# 如果項(xiàng)集中元素在數(shù)據(jù)集中則計(jì)數(shù)
for record in data_set:
for item in item_set:
# issubset()方法用于判斷集合的所有元素是否都包含在指定集合中
if item.issubset(record):
if item not in item_count:
item_count[item] = 1
else:
item_count[item] += 1
data_len = float(len(data_set))
# 計(jì)算項(xiàng)集支持度
for item in item_count:
if (item_count[item] / data_len) >= min_support:
freq_set.add(item)
supports[item] = item_count[item] / data_len
return freq_set, supports
3) 根據(jù)頻繁項(xiàng)集咧擂,生成新的候選項(xiàng)1項(xiàng)集
def generate_new_combinations(freq_set, k):
"""
根據(jù)頻繁項(xiàng)集逞盆,生成新的候選項(xiàng)1項(xiàng)集
參數(shù):頻繁項(xiàng)集列表 freq_set 與項(xiàng)集元素個(gè)數(shù) k
"""
print('生成新組合')
new_combinations = set() # 保存新組合
sets_len = len(freq_set) # 集合含有元素個(gè)數(shù),用于遍歷求得組合
freq_set_list = list(freq_set) # 集合轉(zhuǎn)為列表用于索引
for i in range(sets_len):
for j in range(i + 1, sets_len):
l1 = list(freq_set_list[i])
l2 = list(freq_set_list[j])
l1.sort()
l2.sort()
# 若兩個(gè)集合的前k-2個(gè)項(xiàng)相同時(shí),則將兩個(gè)集合合并
if l1[0:k-2] == l2[0:k-2]:
freq_item = freq_set_list[i] | freq_set_list[j]
new_combinations.add(freq_item)
return new_combinations
4) 循環(huán)生成候選集并計(jì)算其支持度
def apriori(data_set, min_support, max_len=None):
"""循環(huán)生成候選集并計(jì)算其支持度"""
print('循環(huán)生成候選集')
max_items = 2 # 初始項(xiàng)集元素個(gè)數(shù)
freq_sets = [] # 保存所有頻繁項(xiàng)集
supports = {} # 保存所有支持度
# 候選項(xiàng)1項(xiàng)集
c1 = set()
for items in data_set:
for item in items:
item_set = frozenset([item])
c1.add(item_set)
# 頻繁項(xiàng)1項(xiàng)集及其支持度
l1, support1 = generate_freq_supports(data_set, c1, min_support)
freq_sets.append(l1)
supports.update(support1)
if max_len is None:
max_len = float('inf')
while max_items and max_items <= max_len:
# 生成候選集
ci = generate_new_combinations(freq_sets[-1], max_items)
# 生成頻繁項(xiàng)集和支持度
li, support = generate_freq_supports(data_set, ci, min_support)
# 如果有頻繁項(xiàng)集則進(jìn)入下個(gè)循環(huán)
if li:
freq_sets.append(li)
supports.update(support)
max_items += 1
else:
max_items = 0
return freq_sets, supports
5) 從頻繁項(xiàng)集中挖掘關(guān)聯(lián)規(guī)則松申,篩選滿足最小可信度的關(guān)聯(lián)規(guī)則
def association_rules(freq_sets, supports, min_conf):
"""生成關(guān)聯(lián)規(guī)則"""
print('生成關(guān)聯(lián)規(guī)則')
rules = []
max_len = len(freq_sets)
# 篩選符合規(guī)則的頻繁集計(jì)算置信度云芦,滿足最小置信度的關(guān)聯(lián)規(guī)則添加到列表
for k in range(max_len - 1):
for freq_set in freq_sets[k]:
for sub_set in freq_sets[k + 1]:
if freq_set.issubset(sub_set):
frq = supports[sub_set]
conf = supports[sub_set] / supports[freq_set]
rule = (freq_set, sub_set - freq_set, frq, conf)
if conf >= min_conf:
print(freq_set,"-->",sub_set - freq_set,'frq:',frq,'conf:',conf)
rules.append(rule)
return rules
這里先用測(cè)試數(shù)據(jù)測(cè)試算法是否可行,設(shè)置最小支持度為0.5贸桶,最小置信度為0.7:
if __name__ == '__main__':
# 創(chuàng)建測(cè)試數(shù)據(jù)
dic = {'user_id':[111,111,
112,112,112,112,
113,113,113,113,
114,114,114,114,
115,115,115,115],
'item_id':['豆奶','萵苣',
'萵苣','尿布','葡萄酒','甜菜',
'豆奶','尿布','葡萄酒','橙汁',
'萵苣','豆奶','尿布','葡萄酒',
'萵苣','豆奶','尿布','橙汁']}
data = pd.DataFrame(dic)
# 關(guān)聯(lián)規(guī)則中不考慮多次購(gòu)買(mǎi)同一件物品舅逸,刪除重復(fù)數(shù)據(jù)
data = data.drop_duplicates()
# 初始化列表
data_set = []
# 分組聚合,同一用戶(hù)購(gòu)買(mǎi)多種商品的合并為一條數(shù)據(jù)皇筛,只有1件商品的沒(méi)有意義琉历,需要進(jìn)行過(guò)濾
groups = data.groupby(by='user_id')
for group in groups:
if len(group[1]) >= 2:
data_set.append(group[1]['item_id'].tolist())
# L為頻繁項(xiàng)集,support_data為支持度
L, support_data = apriori(data_set, min_support=0.5)
association_rules = association_rules(L, support_data, min_conf=0.7)
# print('關(guān)聯(lián)規(guī)則:\n',association_rules)
結(jié)果如下:
這里還可以直接調(diào)包Mlxtend實(shí)現(xiàn):
# 方法二:Mlxtend實(shí)現(xiàn)
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
# 創(chuàng)建測(cè)試數(shù)據(jù)
dic = {'user_id':[111,111,
112,112,112,112,
113,113,113,113,
114,114,114,114,
115,115,115,115],
'item_id':['豆奶','萵苣',
'萵苣','尿布','葡萄酒','甜菜',
'豆奶','尿布','葡萄酒','橙汁',
'萵苣','豆奶','尿布','葡萄酒',
'萵苣','豆奶','尿布','橙汁']}
data = pd.DataFrame(dic)
# 關(guān)聯(lián)規(guī)則中不考慮多次購(gòu)買(mǎi)同一件物品水醋,刪除重復(fù)數(shù)據(jù)
data = data.drop_duplicates()
# 初始化列表
data_set = []
# 分組聚合旗笔,同一用戶(hù)購(gòu)買(mǎi)多種商品的合并為一條數(shù)據(jù),只有1件商品的沒(méi)有意義拄踪,需要進(jìn)行過(guò)濾
groups = data.groupby(by='user_id')
for group in groups:
if len(group[1]) >= 2:
data_set.append(group[1]['item_id'].tolist())
te = TransactionEncoder()
te_ary = te.fit(data_set).transform(data_set)
df = pd.DataFrame(te_ary, columns=te.columns_)
frequent_itemsets = apriori(df, min_support=0.1, use_colnames=True)
rules = association_rules(frequent_itemsets, min_threshold=0.3)
print('關(guān)聯(lián)規(guī)則:\n',rules)
通過(guò)測(cè)試數(shù)據(jù)可知蝇恶,Apriori算法能正常使用,接下來(lái)直接讀取淘寶用戶(hù)行為數(shù)據(jù)惶桐,分析用戶(hù)購(gòu)買(mǎi)的商品組合關(guān)聯(lián)度撮弧,設(shè)置最小支持度為0.05潘懊,最小置信度為0.3:
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
## 方法一:
def apriori(data_set):
"""創(chuàng)建初始候選集,候選項(xiàng)1項(xiàng)集"""
print('創(chuàng)建初始候選項(xiàng)1項(xiàng)集')
c1 = set()
for items in data_set:
for item in items:
# frozenset()返回一個(gè)凍結(jié)的集合,凍結(jié)后集合不能再添加或刪除任何元素
item_set = frozenset([item])
c1.add(item_set)
def generate_freq_supports(data_set, item_set, min_support):
"""從候選項(xiàng)集中選出頻繁項(xiàng)集并計(jì)算支持度"""
print('篩選頻繁項(xiàng)集并計(jì)算支持度')
freq_set = set() # 保存頻繁項(xiàng)集元素
item_count = {} # 保存元素頻次,用于計(jì)算支持度
supports = {} # 保存支持度
# 如果項(xiàng)集中元素在數(shù)據(jù)集中則計(jì)數(shù)
for record in data_set:
for item in item_set:
# issubset()方法用于判斷集合的所有元素是否都包含在指定集合中
if item.issubset(record):
if item not in item_count:
item_count[item] = 1
else:
item_count[item] += 1
data_len = float(len(data_set))
# 計(jì)算項(xiàng)集支持度
for item in item_count:
if (item_count[item] / data_len) >= min_support:
freq_set.add(item)
supports[item] = item_count[item] / data_len
return freq_set, supports
def generate_new_combinations(freq_set, k):
"""
根據(jù)頻繁項(xiàng)集贿衍,生成新的候選項(xiàng)1項(xiàng)集
參數(shù):頻繁項(xiàng)集列表 freq_set 與項(xiàng)集元素個(gè)數(shù) k
"""
print('生成新組合')
new_combinations = set() # 保存新組合
sets_len = len(freq_set) # 集合含有元素個(gè)數(shù)授舟,用于遍歷求得組合
freq_set_list = list(freq_set) # 集合轉(zhuǎn)為列表用于索引
for i in range(sets_len):
for j in range(i + 1, sets_len):
l1 = list(freq_set_list[i])
l2 = list(freq_set_list[j])
l1.sort()
l2.sort()
# 若兩個(gè)集合的前k-2個(gè)項(xiàng)相同時(shí),則將兩個(gè)集合合并
if l1[0:k-2] == l2[0:k-2]:
freq_item = freq_set_list[i] | freq_set_list[j]
new_combinations.add(freq_item)
return new_combinations
def apriori(data_set, min_support, max_len=None):
"""循環(huán)生成候選集并計(jì)算其支持度"""
print('循環(huán)生成候選集')
max_items = 2 # 初始項(xiàng)集元素個(gè)數(shù)
freq_sets = [] # 保存所有頻繁項(xiàng)集
supports = {} # 保存所有支持度
# 候選項(xiàng)1項(xiàng)集
c1 = set()
for items in data_set:
for item in items:
item_set = frozenset([item])
c1.add(item_set)
# 頻繁項(xiàng)1項(xiàng)集及其支持度
l1, support1 = generate_freq_supports(data_set, c1, min_support)
freq_sets.append(l1)
supports.update(support1)
if max_len is None:
max_len = float('inf')
while max_items and max_items <= max_len:
# 生成候選集
ci = generate_new_combinations(freq_sets[-1], max_items)
# 生成頻繁項(xiàng)集和支持度
li, support = generate_freq_supports(data_set, ci, min_support)
# 如果有頻繁項(xiàng)集則進(jìn)入下個(gè)循環(huán)
if li:
freq_sets.append(li)
supports.update(support)
max_items += 1
else:
max_items = 0
return freq_sets, supports
def association_rules(freq_sets, supports, min_conf):
"""生成關(guān)聯(lián)規(guī)則"""
print('生成關(guān)聯(lián)規(guī)則')
rules = []
max_len = len(freq_sets)
# 篩選符合規(guī)則的頻繁集計(jì)算置信度,滿足最小置信度的關(guān)聯(lián)規(guī)則添加到列表
for k in range(max_len - 1):
for freq_set in freq_sets[k]:
for sub_set in freq_sets[k + 1]:
if freq_set.issubset(sub_set):
frq = supports[sub_set]
conf = supports[sub_set] / supports[freq_set]
rule = (freq_set, sub_set - freq_set, frq, conf)
if conf >= min_conf:
print(freq_set,"-->",sub_set - freq_set,'frq:',frq,'conf:',conf)
rules.append(rule)
return rules
if __name__ == '__main__':
# 讀取淘寶用戶(hù)行為數(shù)據(jù)
data = pd.read_csv(r'D:\關(guān)聯(lián)分析\關(guān)聯(lián)分析2\tianchi_mobile_recommend_train_user.zip',encoding='ansi')
data = data[['user_id','item_id']]
# 關(guān)聯(lián)規(guī)則中不考慮多次購(gòu)買(mǎi)同一件物品舌厨,刪除重復(fù)數(shù)據(jù)
data = data.drop_duplicates()
# 初始化列表
data_set = []
# 分組聚合岂却,同一用戶(hù)購(gòu)買(mǎi)多種商品的合并為一條數(shù)據(jù)忿薇,只有1件商品的沒(méi)有意義裙椭,需要進(jìn)行過(guò)濾
groups = data.groupby(by='user_id')
for group in groups:
if len(group[1]) >= 2:
data_set.append(group[1]['item_id'].tolist())
# L為頻繁項(xiàng)集,support_data為支持度
L, support_data = apriori(data_set, min_support=0.05)
association_rules = association_rules(L, support_data, min_conf=0.3)
# print('關(guān)聯(lián)規(guī)則:\n',association_rules)
結(jié)果程序運(yùn)行超過(guò)半小時(shí)仍未出結(jié)果署浩,這是由于Apriori算法使用多重嵌套for循環(huán)進(jìn)行計(jì)算揉燃,每次更新頻繁項(xiàng)集都需要掃描一次整個(gè)數(shù)據(jù)集,當(dāng)數(shù)據(jù)量過(guò)大時(shí)效率不高筋栋。這里由于Apriori算法的性能限制炊汤,所以考慮用FP-Growth算法尋找頻繁項(xiàng)集。
(二) FP-growth算法
1弊攘、FP-growth算法基本原理
FP-growth算法基于Apriori構(gòu)建抢腐,但采用了高級(jí)的數(shù)據(jù)結(jié)構(gòu)減少掃描次數(shù),大大加快了算法速度襟交。FP-growth算法只需要對(duì)數(shù)據(jù)庫(kù)進(jìn)行兩次掃描迈倍,而Apriori算法對(duì)于每個(gè)潛在的頻繁項(xiàng)集都會(huì)掃描數(shù)據(jù)集判定給定模式是否頻繁,因此FP-growth算法的速度要比Apriori算法快捣域。
優(yōu)點(diǎn):一般要快于Apriori啼染。
缺點(diǎn):實(shí)現(xiàn)比較困難,在某些數(shù)據(jù)集上性能會(huì)下降焕梅。
適用數(shù)據(jù)類(lèi)型:離散型數(shù)據(jù)迹鹅。
這里涉及到另外一個(gè)指標(biāo):提升度(Lift)
Lift(提升度):表示“包含 A 的事務(wù)中同時(shí)包含 B 的事務(wù)的比例”與“包含 B 的事務(wù)的比例”的比值。公式表達(dá):Lift = ( P(A & B)/ P(A) ) / P(B) = P(A & B)/ P(A) / P(B)贞言。
提升度反映了關(guān)聯(lián)規(guī)則中的 A 與 B 的相關(guān)性斜棚,提升度 > 1 且越高表明正相關(guān)性越高,提升度 < 1 且越低表明負(fù)相關(guān)性越高该窗,提升度 = 1 表明沒(méi)有相關(guān)性打肝。
2、FP-growth算法實(shí)現(xiàn)基本過(guò)程如下:
(原文鏈接:https://blog.csdn.net/youhuakongzhi/article/details/87943503)
1) 第一次掃描數(shù)據(jù)挪捕,得到所有頻繁一項(xiàng)集的的計(jì)數(shù)粗梭。然后刪除支持度低于閾值的項(xiàng),將1項(xiàng)頻繁集放入項(xiàng)頭表级零,并按照支持度降序排列断医。
2)第二次掃描數(shù)據(jù)滞乙,將讀到的原始數(shù)據(jù)剔除非頻繁1項(xiàng)集,并按照支持度降序排列鉴嗤。至此斩启,通過(guò)兩次掃描數(shù)據(jù)建立項(xiàng)頭表
3)構(gòu)建FP樹(shù)。讀入排序后的數(shù)據(jù)集醉锅,插入FP樹(shù)兔簇,插入時(shí)按照排序后的順序,插入FP樹(shù)中硬耍,排序靠前的節(jié)點(diǎn)是父節(jié)點(diǎn)垄琐,而靠后的是子節(jié)點(diǎn)。如果有共同的鏈表经柴,則對(duì)應(yīng)的公用父節(jié)點(diǎn)計(jì)數(shù)加1狸窘。插入后,如果有新節(jié)點(diǎn)出現(xiàn)坯认,則項(xiàng)頭表對(duì)應(yīng)的節(jié)點(diǎn)會(huì)通過(guò)節(jié)點(diǎn)鏈表鏈接上新節(jié)點(diǎn)翻擒。直到所有的數(shù)據(jù)都插入到FP樹(shù)后,F(xiàn)P樹(shù)的建立完成牛哺。
4)挖掘頻繁項(xiàng)集陋气。從項(xiàng)頭表的底部項(xiàng)依次向上找到項(xiàng)頭表項(xiàng)對(duì)應(yīng)的條件模式基。從條件模式基遞歸挖掘得到項(xiàng)頭表項(xiàng)的頻繁項(xiàng)集引润,同時(shí)返回頻繁項(xiàng)集對(duì)應(yīng)的節(jié)點(diǎn)計(jì)數(shù)值巩趁。
5)如果不限制頻繁項(xiàng)集的項(xiàng)數(shù),則返回步驟4所有的頻繁項(xiàng)集椰拒,否則只返回滿足項(xiàng)數(shù)要求的頻繁項(xiàng)集晶渠。
我在這里使用FP-growth算法挖掘頻繁項(xiàng)集,把整個(gè)FP-growth算法封裝到fp_growth包中燃观,
fp_growth.py代碼如下:
# encoding: utf-8
"""
fp-growth算法是一個(gè)生成頻繁項(xiàng)集的算法褒脯,其主要利用了FP樹(shù)的數(shù)據(jù)結(jié)構(gòu),
整個(gè)生成過(guò)程只需要遍歷數(shù)據(jù)集2次
"""
from collections import defaultdict, namedtuple
"""
collections模塊中的defaultdict繼承自dict缆毁,namedtuple繼承自tuple
defaultdict會(huì)構(gòu)建一個(gè)類(lèi)似dict的對(duì)象番川,該對(duì)象具有默認(rèn)值
當(dāng)dict不存在的key時(shí)會(huì)報(bào)KeyError錯(cuò)誤,調(diào)用defaultdict時(shí)遇到KeyError錯(cuò)誤會(huì)用默認(rèn)值填充
namedtuple主要用來(lái)產(chǎn)生可以使用名稱(chēng)來(lái)訪問(wèn)元素的數(shù)據(jù)對(duì)象脊框,通常用來(lái)增強(qiáng)代碼的可讀性
"""
def find_frequent_itemsets(transactions, minimum_support, include_support=False):
"""
挖掘頻繁項(xiàng)集颁督,生成頻繁項(xiàng)集和對(duì)應(yīng)支持度(頻數(shù))
"""
items = defaultdict(lambda: 0) # mapping from items to their supports
# Load the passed-in transactions and count the support that individual
# items have.
for transaction in transactions:
for item in transaction:
items[item] += 1
# Remove infrequent items from the item support dictionary.
items = dict((item, support) for item, support in items.items()
if support >= minimum_support)
# Build our FP-tree. Before any transactions can be added to the tree, they
# must be stripped of infrequent items and their surviving items must be
# sorted in decreasing order of frequency.
def clean_transaction(transaction):
transaction = filter(lambda v: v in items, transaction)
transaction_list = list(transaction) # 為了防止變量在其他部分調(diào)用,這里引入臨時(shí)變量transaction_list
transaction_list.sort(key=lambda v: items[v], reverse=True)
return transaction_list
master = FPTree()
for transaction in map(clean_transaction, transactions):
master.add(transaction)
def find_with_suffix(tree, suffix):
for item, nodes in tree.items():
support = sum(n.count for n in nodes)
if support >= minimum_support and item not in suffix:
# New winner!
found_set = [item] + suffix
yield (found_set, support) if include_support else found_set
# Build a conditional tree and recursively search for frequent
# itemsets within it.
cond_tree = conditional_tree_from_paths(tree.prefix_paths(item))
for s in find_with_suffix(cond_tree, found_set):
yield s # pass along the good news to our caller
# Search for frequent itemsets, and yield the results we find.
for itemset in find_with_suffix(master, []):
yield itemset
class FPTree(object):
"""
構(gòu)建FP樹(shù)
所有的項(xiàng)必須作為字典的鍵或集合成員
"""
Route = namedtuple('Route', 'head tail')
def __init__(self):
# The root node of the tree.
self._root = FPNode(self, None, None)
# A dictionary mapping items to the head and tail of a path of
# "neighbors" that will hit every node containing that item.
self._routes = {}
@property
def root(self):
"""The root node of the tree."""
return self._root
def add(self, transaction):
"""Add a transaction to the tree."""
point = self._root
for item in transaction:
next_point = point.search(item)
if next_point:
# There is already a node in this tree for the current
# transaction item; reuse it.
next_point.increment()
else:
# Create a new point and add it as a child of the point we're
# currently looking at.
next_point = FPNode(self, item)
point.add(next_point)
# Update the route of nodes that contain this item to include
# our new node.
self._update_route(next_point)
point = next_point
def _update_route(self, point):
"""Add the given node to the route through all nodes for its item."""
assert self is point.tree
try:
route = self._routes[point.item]
route[1].neighbor = point # route[1] is the tail
self._routes[point.item] = self.Route(route[0], point)
except KeyError:
# First node for this item; start a new route.
self._routes[point.item] = self.Route(point, point)
def items(self):
"""
Generate one 2-tuples for each item represented in the tree. The first
element of the tuple is the item itself, and the second element is a
generator that will yield the nodes in the tree that belong to the item.
"""
for item in self._routes.keys():
yield (item, self.nodes(item))
def nodes(self, item):
"""
Generate the sequence of nodes that contain the given item.
"""
try:
node = self._routes[item][0]
except KeyError:
return
while node:
yield node
node = node.neighbor
def prefix_paths(self, item):
"""Generate the prefix paths that end with the given item."""
def collect_path(node):
path = []
while node and not node.root:
path.append(node)
node = node.parent
path.reverse()
return path
return (collect_path(node) for node in self.nodes(item))
def inspect(self):
print('Tree:')
self.root.inspect(1)
print
print('Routes:')
for item, nodes in self.items():
print(' %r' % item)
for node in nodes:
print(' %r' % node)
def conditional_tree_from_paths(paths):
"""從給定的前綴路徑構(gòu)建一個(gè)條件fp樹(shù)."""
tree = FPTree()
condition_item = None
items = set()
# Import the nodes in the paths into the new tree. Only the counts of the
# leaf notes matter; the remaining counts will be reconstructed from the
# leaf counts.
for path in paths:
if condition_item is None:
condition_item = path[-1].item
point = tree.root
for node in path:
next_point = point.search(node.item)
if not next_point:
# Add a new node to the tree.
items.add(node.item)
count = node.count if node.item == condition_item else 0
next_point = FPNode(tree, node.item, count)
point.add(next_point)
tree._update_route(next_point)
point = next_point
assert condition_item is not None
# Calculate the counts of the non-leaf nodes.
for path in tree.prefix_paths(condition_item):
count = path[-1].count
for node in reversed(path[:-1]):
node._count += count
return tree
class FPNode(object):
"""FP樹(shù)節(jié)點(diǎn)"""
def __init__(self, tree, item, count=1):
self._tree = tree
self._item = item
self._count = count
self._parent = None
self._children = {}
self._neighbor = None
def add(self, child):
"""Add the given FPNode `child` as a child of this node."""
if not isinstance(child, FPNode):
raise TypeError("Can only add other FPNodes as children")
if not child.item in self._children:
self._children[child.item] = child
child.parent = self
def search(self, item):
"""
Check whether this node contains a child node for the given item.
If so, that node is returned; otherwise, `None` is returned.
"""
try:
return self._children[item]
except KeyError:
return None
def __contains__(self, item):
return item in self._children
@property
def tree(self):
"""The tree in which this node appears."""
return self._tree
@property
def item(self):
"""The item contained in this node."""
return self._item
@property
def count(self):
"""The count associated with this node's item."""
return self._count
def increment(self):
"""Increment the count associated with this node's item."""
if self._count is None:
raise ValueError("Root nodes have no associated count.")
self._count += 1
@property
def root(self):
"""True if this node is the root of a tree; false if otherwise."""
return self._item is None and self._count is None
@property
def leaf(self):
"""True if this node is a leaf in the tree; false if otherwise."""
return len(self._children) == 0
@property
def parent(self):
"""The node's parent"""
return self._parent
@parent.setter
def parent(self, value):
if value is not None and not isinstance(value, FPNode):
raise TypeError("A node must have an FPNode as a parent.")
if value and value.tree is not self.tree:
raise ValueError("Cannot have a parent from another tree.")
self._parent = value
@property
def neighbor(self):
"""
The node's neighbor; the one with the same value that is "to the right"
of it in the tree.
"""
return self._neighbor
@neighbor.setter
def neighbor(self, value):
if value is not None and not isinstance(value, FPNode):
raise TypeError("A node must have an FPNode as a neighbor.")
if value and value.tree is not self.tree:
raise ValueError("Cannot have a neighbor from another tree.")
self._neighbor = value
@property
def children(self):
"""The nodes that are children of this node."""
return tuple(self._children.itervalues())
def inspect(self, depth=0):
print((' ' * depth) + repr(self))
for child in self.children:
child.inspect(depth + 1)
def __repr__(self):
if self.root:
return "<%s (root)>" % type(self).__name__
return "<%s %r (%r)>" % (type(self).__name__, self.item, self.count)
if __name__ == '__main__':
from optparse import OptionParser
import csv
p = OptionParser(usage='%prog data_file')
p.add_option('-s', '--minimum-support', dest='minsup', type='int',
help='Minimum itemset support (default: 2)')
p.add_option('-n', '--numeric', dest='numeric', action='store_true',
help='Convert the values in datasets to numerals (default: false)')
p.set_defaults(minsup=2)
p.set_defaults(numeric=False)
options, args = p.parse_args()
if len(args) < 1:
p.error('must provide the path to a CSV file to read')
transactions = []
with open(args[0]) as database:
for row in csv.reader(database):
if options.numeric:
transaction = []
for item in row:
transaction.append(long(item))
transactions.append(transaction)
else:
transactions.append(row)
result = []
for itemset, support in find_frequent_itemsets(transactions, options.minsup, True):
result.append((itemset, support))
result = sorted(result, key=lambda i: i[0])
for itemset, support in result:
print(str(itemset) + ' ' + str(support))
6)從頻繁項(xiàng)集中挖掘關(guān)聯(lián)規(guī)則浇雹,篩選滿足最小置信度的關(guān)聯(lián)規(guī)則沉御,計(jì)算支持度和置信度。
在這里我最小支持度(頻數(shù))設(shè)置為50昭灵,最小置信度設(shè)置為0.3吠裆,代碼如下:
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
from sqlalchemy import create_engine
import itertools # itertools用于高效循環(huán)的迭代函數(shù)集合
import sys,time
sys.path.append(r"D:\關(guān)聯(lián)分析\關(guān)聯(lián)分析2") # 調(diào)包失敗時(shí)伐谈,把包所在路徑添加到系統(tǒng)路徑中
import fp_growth as fpg # 導(dǎo)入fp-growth算法
start =time.time()
def convertData(data):
"""數(shù)據(jù)轉(zhuǎn)換為同一單號(hào)包含多個(gè)商品的列表"""
# 關(guān)聯(lián)規(guī)則中不考慮多次購(gòu)買(mǎi)同一件物品,刪除重復(fù)數(shù)據(jù)
data = data.drop_duplicates()
# 初始化列表
itemSets = []
it_list = []
# 分組聚合试疙,同一用戶(hù)購(gòu)買(mǎi)多種商品的合并為一條數(shù)據(jù)诵棵,只有1件商品的沒(méi)有意義,需要進(jìn)行過(guò)濾
groups = data.groupby(by=['user_id'])
for group in groups:
if len(group[1]) >= 2:
itemSets.append(group[1]['item_id'].tolist())
return itemSets
def generate_association_rules(patterns, total, min_confidence):
"""
生成關(guān)聯(lián)規(guī)則祝旷,計(jì)算支持度履澳、置信度和提升度
"""
antecedent_list = []
consequent_list = []
support_list = []
confidence_list = []
lift_list = []
count_antecedent = []
p_antecedent = []
count_consequent = []
p_consequent = []
count_ant_con = []
for itemset in patterns.keys():
upper_support = patterns[itemset] # A & B
for i in range(1, len(itemset)):
for antecedent in itertools.combinations(itemset, i):
"""
itertools.combinations()用于創(chuàng)建一個(gè)迭代器,
返回iterable中所有長(zhǎng)度為r的子序列怀跛,
返回的子序列中的項(xiàng)按輸入iterable中的順序排序
"""
antecedent = tuple(sorted(antecedent))
consequent = tuple(sorted(set(itemset) - set(antecedent)))
if antecedent in patterns:
lower_support = patterns[antecedent] # A
consequent_support = patterns[consequent] # B
p_lower_support = lower_support / total # P(A)
p_consequent_support = consequent_support / total # P(B)
support = round(float(upper_support) / total, 6) # 支持度Support = P(A & B)
confidence = float(upper_support) / lower_support # 置信度Confidence = P(A & B)/ P(A)
lift = confidence / p_consequent_support # 提升度Lift = ( P(A & B)/ P(A) ) / P(B) = P(A & B)/ P(A) / P(B)
if confidence >= min_confidence:
antecedent_list.append(list(antecedent))
consequent_list.append(list(consequent))
support_list.append(support)
confidence_list.append(confidence)
lift_list.append(lift)
count_antecedent.append(lower_support) # count(A)
p_antecedent.append(p_lower_support) # P(A)
count_consequent.append(consequent_support) # count(B)
p_consequent.append(p_consequent_support) # P(B)
count_ant_con.append(upper_support) # count(AB)
rules_col = {'antecedent':antecedent_list,
'consequent':consequent_list,
'count_antecedent':count_antecedent,
'antecedent_support':p_antecedent,
'count_consequent':count_consequent,
'consequent_support':p_consequent,
'count_ant_con':count_ant_con,
'support':support_list,
'confidence':confidence_list,
'lift':lift_list}
rules = pd.DataFrame(rules_col)
# col = ['antecedent','consequent','count_antecedent','antecedent_support',
# 'count_consequent','consequent_support','count_ant_con','support',
# 'confidence','lift']
col = ['antecedent','consequent','support','confidence','lift']
rules = rules[col]
rules.sort_values(by=['support','confidence'],ascending=False,inplace=True)
return rules
if __name__ == '__main__':
# 導(dǎo)入數(shù)據(jù)
data = pd.read_csv(r'D:\關(guān)聯(lián)分析\關(guān)聯(lián)分析2\tianchi_mobile_recommend_train_user.zip',encoding='ansi')
data = data[['user_id','item_id']]
# 轉(zhuǎn)換數(shù)據(jù)
dataset = convertData(data)
total = len(dataset)
print('總訂單數(shù):',total)
'''
find_frequent_itemsets()調(diào)用函數(shù)生成頻繁項(xiàng)集和頻數(shù)
minimum_support表示設(shè)置最小支持度(頻數(shù))距贷,即頻數(shù)大于等于minimum_support,保存此頻繁項(xiàng)敌完,否則刪除
include_support表示返回結(jié)果是否包含支持度(頻數(shù))储耐,若include_support=True羊初,返回結(jié)果中包含itemset和support滨溉,否則只返回itemset
'''
frequent_itemsets = fpg.find_frequent_itemsets(dataset, minimum_support=50, include_support=True)
result = []
for itemset, support in frequent_itemsets: # 將generator結(jié)果存入list
result.append((itemset, support))
result = sorted(result, key=lambda i: i[0]) # 排序后輸出
item_list = []
itemset_list = []
support_list = []
for itemset, support in result:
# print(str(itemset) + ' ' + str(support)) #頻繁項(xiàng)集和出現(xiàn)次數(shù)
item_list.append(itemset) # 保存為列表,用于輸出頻繁項(xiàng)集結(jié)果
itemset = tuple(sorted(itemset)) # 先轉(zhuǎn)換為元組长赞,用于后續(xù)生成關(guān)聯(lián)規(guī)則
itemset_list.append(itemset)
support_list.append(support)
# 構(gòu)建字典
patterns = dict(zip(itemset_list,support_list))
print('頻繁項(xiàng)集總數(shù):',len(patterns))
# 生成關(guān)聯(lián)規(guī)則,計(jì)算支持度、置信度和提升度
# min_confidence代表最小置信度
rules = generate_association_rules(patterns,total,min_confidence=0.3)
print('關(guān)聯(lián)規(guī)則:\n',rules.head())
print('結(jié)果總數(shù):',len(rules))
## 輸出結(jié)果次企,輸出到同一份excel文件不同的工作表中
# 輸出頻繁集
sup = {'item_id':item_list,'frequency':support_list}
sup = pd.DataFrame(sup)
sup['support'] = round(sup['frequency'] / float(total), 6)
sup.sort_values(by=['support'],ascending=False,inplace=True)
sup_col = ['item_id','frequency','support']
sup = sup[sup_col]
writer = pd.ExcelWriter(r'D:\關(guān)聯(lián)分析\關(guān)聯(lián)分析2\result\fp-growth-result.xlsx')
sup.to_excel(excel_writer=writer,sheet_name='support',index=False)
# 輸出關(guān)聯(lián)規(guī)則
rules.to_excel(excel_writer=writer,sheet_name='rules',index=False)
end = time.time()
print('Running time: %s Seconds'%(end-start))
運(yùn)行結(jié)果部分截圖如下
頻繁集結(jié)果:四虽画、結(jié)論
業(yè)務(wù)結(jié)論:
1、 頻繁項(xiàng)集我按支持度進(jìn)行了排序贩据,商品編號(hào)112921337最受歡迎栋操,遠(yuǎn)遠(yuǎn)高于其他商品;
2饱亮、 從總體上看矾芙,所有組合商品中支持度數(shù)值偏低,這是由于平臺(tái)銷(xiāo)售的商品種類(lèi)繁多近上;
3剔宪、 所有商品組合按支持度從高到低排序,商品組合中 [387911330] à [97655171] 和 [97655171] à [387911330] 支持度最高壹无,但是商品組合[387911330] à [97655171]的置信度最高葱绒,表示購(gòu)買(mǎi)商品編號(hào)387911330的用戶(hù)中有44%會(huì)購(gòu)買(mǎi)商品編號(hào)97655171,可以對(duì)這兩種商品進(jìn)行捆綁銷(xiāo)售斗锭;
4地淀、 置信度最高的商品組合是購(gòu)買(mǎi)商品126902916的用戶(hù)最大可能會(huì)購(gòu)買(mǎi)商品125666923,但出現(xiàn)的概率偏低岖是;
技術(shù)結(jié)論:
1帮毁、 Apriori和FP-Growth這兩個(gè)算法都有個(gè)特點(diǎn)她倘,就是當(dāng)選取支持度很小時(shí),計(jì)算時(shí)間明顯變長(zhǎng)作箍,性能影響很大硬梁;
2、 數(shù)據(jù)量大時(shí)優(yōu)先考慮FP-Growth算法查找頻繁集胞得;
3荧止、 用FP-Growth算法尋找頻繁項(xiàng)集,支持度先設(shè)的大一些阶剑,一般可設(shè)項(xiàng)集數(shù)目的1/10跃巡,這里的項(xiàng)集數(shù)目不是指原始表數(shù)據(jù),而是事務(wù)數(shù)據(jù)集牧愁;
4素邪、 根據(jù)業(yè)務(wù)經(jīng)驗(yàn),選擇恰當(dāng)?shù)闹С侄燃翱尚哦戎戆耄艜?huì)分析出恰當(dāng)?shù)慕Y(jié)論兔朦。尤其是支持度,選大了磨确,會(huì)過(guò)濾掉可能關(guān)鍵或有意義的關(guān)聯(lián)規(guī)則輸出沽甥;選小了,會(huì)產(chǎn)生太多頻繁項(xiàng)集及FP條件樹(shù)乏奥,干擾分析結(jié)果摆舟;
5、 如果有數(shù)據(jù)聚合處理需求邓了,應(yīng)盡量減少for循環(huán)使用恨诱,尤其是嵌套兩層以上的情形,這對(duì)性能會(huì)是一個(gè)災(zāi)難骗炉。有條件的情況下照宝,可使用DataFrame包的groupby函數(shù)、pivot_table函數(shù)痕鳍,以及pandas包的merge函數(shù)等硫豆,能獲得極大的性能提升。
參考鏈接:https://blog.csdn.net/youhuakongzhi/article/details/87943503