Apriori关联分析-Python实现
问题描述
这类问题常出现于确定某些商品之间的关联销售联系情况。比如,沃尔玛将啤酒与尿布一起销售,通常可以获得两个商品的更高销量。背后的原因是妻子要求丈夫为小孩买尿布,但丈夫同时也会去购买自己喜欢的啤酒等商品。
但从数据中还存在着更多的关联关系需要被发掘。关联规则学习就是来解决这些问题的。
关联规则学习(Association rule learning)是一种在大型数据库中发现变量之间的有趣性关系的方法。它的目的是利用一些有趣性的量度来识别数据库中发现的强规则。
这些关系可以有两种形式:频繁项集或者关联规则。
频繁项集(frequent item sets)是经常出现在一块的物品的集合
关联规则(association rules)暗示两种物品之间可能存在很强的关系。
关联规则学习基础概念
支持度与置信度
支持度
$$
support(X,Y)=P(X,Y)=\frac{number(X,Y)}{number(all sample)}
$$
置信度
$$
confidence(X⇒Y)=P(Y∣X)=\frac{P(X,Y)}{P(X)}=\frac{number(X,Y)}{number(X)}
$$
举个例子
交易编号 | 购买商品 |
---|---|
0 | 牛奶,洋葱,肉豆蔻,芸豆,鸡蛋,酸奶 |
1 | 莳萝,洋葱,肉豆蔻,芸豆,鸡蛋,酸奶 |
2 | 牛奶,苹果,芸豆,鸡蛋 |
3 | 牛奶,独角兽,玉米,芸豆,酸奶 |
4 | 玉米,洋葱,洋葱,芸豆,冰淇淋,鸡蛋 |
牛奶的支持度
$$
support(X,Y)=P(X,Y)=\frac{number(X,Y)}{number(all sample)}=3/5
$$
(牛奶->鸡蛋)的置信度
$$
confidence(X⇒Y)=P(Y∣X)=\frac{P(X,Y)}{P(X)}=\frac{number(X,Y)}{number(X)}
$$
$$
Confidence(牛奶->鸡蛋)=(牛奶,鸡蛋)一起购买的次数/鸡蛋的购买次数=2/4=0.5
$$
除了以上的两个概念,还有一个概念提升度(Lift):指当销售一个物品时,另一个物品销售率会增加多少。
$$
Lift(X⇒Y)=confidence(X⇒Y)/support(X)
Lift(牛奶->鸡蛋)=2/4/3/5=0.83
$$
当Lift大于1时,表示该物品X对商品Y的销售产生积极作用;提升度小于1那么意味着购买X反而会减少Y的销量。
Apriori算法
从原理上来看,求解支持度和置信度的公式是很简单的,遍历所有组合计算即可。但是,遍历的时间复杂度是$ 2^{N-1} $
Apriori算法则是找出频繁项集的高效算法。
某个项集是频繁的,那么它的所有子集也是频繁的。
如果一个项集是 非频繁项集,那么它的所有超集也是非频繁项集。
关于这两点,可以举个例子:如果A和B之间并不存在频繁项,则他们和C之间亦不存在。

如果{A,B}这个项集是非频繁的,那么{A,B}这个项集的超集,{A,B,C},{A,B,D}等等也都是非频繁的,这些就都可以忽略不去计算。
具体实现
首先是第三方库
from apyori import apriori
dataset = [['牛奶','洋葱','肉豆蔻','芸豆','鸡蛋','酸奶'],
['莳萝','洋葱','肉豆蔻','芸豆','鸡蛋','酸奶'],
['牛奶','苹果','芸豆','鸡蛋'],
['牛奶','独角兽','玉米','芸豆','酸奶'],
['玉米','洋葱','芸豆','冰淇淋','鸡蛋']]
rules = apriori(dataset, min_support=0.5)
for i in rules:
print(i)
RelationRecord(items=frozenset({'洋葱'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'洋葱'}), confidence=0.6, lift=1.0)])
RelationRecord(items=frozenset({'牛奶'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'牛奶'}), confidence=0.6, lift=1.0)])
RelationRecord(items=frozenset({'芸豆'}), support=1.0, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'芸豆'}), confidence=1.0, lift=1.0)])
RelationRecord(items=frozenset({'酸奶'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'酸奶'}), confidence=0.6, lift=1.0)])
RelationRecord(items=frozenset({'鸡蛋'}), support=0.8, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'鸡蛋'}), confidence=0.8, lift=1.0)])
RelationRecord(items=frozenset({'芸豆', '洋葱'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'芸豆', '洋葱'}), confidence=0.6, lift=1.0), OrderedStatistic(items_base=frozenset({'洋葱'}), items_add=frozenset({'芸豆'}), confidence=1.0, lift=1.0), OrderedStatistic(items_base=frozenset({'芸豆'}), items_add=frozenset({'洋葱'}), confidence=0.6, lift=1.0)])
RelationRecord(items=frozenset({'鸡蛋', '洋葱'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'鸡蛋', '洋葱'}), confidence=0.6, lift=1.0), OrderedStatistic(items_base=frozenset({'洋葱'}), items_add=frozenset({'鸡蛋'}), confidence=1.0, lift=1.25), OrderedStatistic(items_base=frozenset({'鸡蛋'}), items_add=frozenset({'洋葱'}), confidence=0.7499999999999999, lift=1.2499999999999998)])
RelationRecord(items=frozenset({'芸豆', '牛奶'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'芸豆', '牛奶'}), confidence=0.6, lift=1.0), OrderedStatistic(items_base=frozenset({'牛奶'}), items_add=frozenset({'芸豆'}), confidence=1.0, lift=1.0), OrderedStatistic(items_base=frozenset({'芸豆'}), items_add=frozenset({'牛奶'}), confidence=0.6, lift=1.0)])
RelationRecord(items=frozenset({'芸豆', '酸奶'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'芸豆', '酸奶'}), confidence=0.6, lift=1.0), OrderedStatistic(items_base=frozenset({'芸豆'}), items_add=frozenset({'酸奶'}), confidence=0.6, lift=1.0), OrderedStatistic(items_base=frozenset({'酸奶'}), items_add=frozenset({'芸豆'}), confidence=1.0, lift=1.0)])
RelationRecord(items=frozenset({'芸豆', '鸡蛋'}), support=0.8, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'芸豆', '鸡蛋'}), confidence=0.8, lift=1.0), OrderedStatistic(items_base=frozenset({'芸豆'}), items_add=frozenset({'鸡蛋'}), confidence=0.8, lift=1.0), OrderedStatistic(items_base=frozenset({'鸡蛋'}), items_add=frozenset({'芸豆'}), confidence=1.0, lift=1.0)])
RelationRecord(items=frozenset({'芸豆', '鸡蛋', '洋葱'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'芸豆', '鸡蛋', '洋葱'}), confidence=0.6, lift=1.0), OrderedStatistic(items_base=frozenset({'洋葱'}), items_add=frozenset({'芸豆', '鸡蛋'}), confidence=1.0, lift=1.25), OrderedStatistic(items_base=frozenset({'芸豆'}), items_add=frozenset({'鸡蛋', '洋葱'}), confidence=0.6, lift=1.0), OrderedStatistic(items_base=frozenset({'鸡蛋'}), items_add=frozenset({'芸豆', '洋葱'}), confidence=0.7499999999999999, lift=1.2499999999999998), OrderedStatistic(items_base=frozenset({'芸豆', '洋葱'}), items_add=frozenset({'鸡蛋'}), confidence=1.0, lift=1.25), OrderedStatistic(items_base=frozenset({'鸡蛋', '洋葱'}), items_add=frozenset({'芸豆'}), confidence=1.0, lift=1.0), OrderedStatistic(items_base=frozenset({'芸豆', '鸡蛋'}), items_add=frozenset({'洋葱'}), confidence=0.7499999999999999, lift=1.2499999999999998)])
其次,自己实现
考虑的问题如下:
- 首先生成所有单个物品的项集列表
- 遍历数据集中所有项集,将不满足最小支持度的项集去掉
- 对剩下的项集组合,生成包含两个元素的项集
- 重新遍历数据集,去掉不满足最小支持度的项集
- 重复上述过程,直到所有项集都被去掉
import pandas as pd
from itertools import combinations,permutations
计算最后的置信度和提升度
def calcflf(astring,support_series,minconfidence):
support_series[frozenset()]=1
items = astring.split(',')
t = len(items)
items=frozenset(items)
res = []
ordered_statistics=[]
for num in range(t):
res += list((frozenset(set(items) - set(l)),frozenset(l)) for l in permutations(items, num))
for k,l in reversed(res):
items_base =k
items_add=l
confidence=support_series[items]/support_series[l]
lift=support_series[items]/support_series[l]/support_series[k]
if confidence>=minconfidence:
ordered_statistics.append({'items_base':items_add,'items_add':items_base,'support':support_series[items],'confidence':confidence,'lift':lift})
return ordered_statistics
主进程,计算支持度,利用frozenset作为hash键值
def Apriori(dataset,minconfidence=0.5,minsupport=0.5):
# tranform dataset into table which contains is_exist in all tranctions
func = lambda x: pd.Series(1, index=x)
data = pd.DataFrame(list(map(func, dataset))).fillna(0)
# apply Apriori, use frozenset as the key of support_series
support_series = 1.0 * data.sum() / len(data)
column = list(support_series[support_series >= minsupport].index)
support_series={frozenset([i]):k for i,k in dict(support_series).items() if k>=minsupport}
# k is the id of search, also the parameters of combinations
k = 1
# final res
res = []
cols=column
while len(column) > 1:
k = k + 1
temp=set.union(*(set(i.split(',')) for i in column))
temp = [frozenset(i) for i in combinations(temp, k)]
secfunc = lambda i: data[i].prod(axis=1, numeric_only=True)
datatemp = pd.DataFrame(list(map(secfunc, temp)), index=[','.join(i) for i in temp]).T
support_series_temp = 1.0 * datatemp[[','.join(i) for i in temp]].sum() / len(data)
column = list(support_series_temp[support_series_temp >= minsupport].index)
cols+=column
support_series_temp = {frozenset(i.split(',')): k for i, k in dict(support_series_temp).items() if k >= minsupport}
support_series.update(support_series_temp)
# after find all columns which are greater than minsupport, find all lift and confidence info.
for i in cols:
res_temp=calcflf(i, support_series,minconfidence)
if len(res_temp)>0:
res.append({i: res_temp})
return res
minsupport = 0.5
minconfidence = 0.5
rules = Apriori(dataset, minconfidence, minsupport)
for i in rules:
print(i)
{'牛奶': [{'items_base': frozenset(), 'items_add': frozenset({'牛奶'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}]}
{'洋葱': [{'items_base': frozenset(), 'items_add': frozenset({'洋葱'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}]}
{'芸豆': [{'items_base': frozenset(), 'items_add': frozenset({'芸豆'}), 'support': 1.0, 'confidence': 1.0, 'lift': 1.0}]}
{'鸡蛋': [{'items_base': frozenset(), 'items_add': frozenset({'鸡蛋'}), 'support': 0.8, 'confidence': 0.8, 'lift': 1.0}]}
{'酸奶': [{'items_base': frozenset(), 'items_add': frozenset({'酸奶'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}]}
{'芸豆,牛奶': [{'items_base': frozenset({'牛奶'}), 'items_add': frozenset({'芸豆'}), 'support': 0.6, 'confidence': 1.0, 'lift': 1.0}, {'items_base': frozenset({'芸豆'}), 'items_add': frozenset({'牛奶'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}, {'items_base': frozenset(), 'items_add': frozenset({'芸豆', '牛奶'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}]}
{'芸豆,洋葱': [{'items_base': frozenset({'洋葱'}), 'items_add': frozenset({'芸豆'}), 'support': 0.6, 'confidence': 1.0, 'lift': 1.0}, {'items_base': frozenset({'芸豆'}), 'items_add': frozenset({'洋葱'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}, {'items_base': frozenset(), 'items_add': frozenset({'芸豆', '洋葱'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}]}
{'鸡蛋,洋葱': [{'items_base': frozenset({'洋葱'}), 'items_add': frozenset({'鸡蛋'}), 'support': 0.6, 'confidence': 1.0, 'lift': 1.25}, {'items_base': frozenset({'鸡蛋'}), 'items_add': frozenset({'洋葱'}), 'support': 0.6, 'confidence': 0.7499999999999999, 'lift': 1.2499999999999998}, {'items_base': frozenset(), 'items_add': frozenset({'鸡蛋', '洋葱'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}]}
{'芸豆,鸡蛋': [{'items_base': frozenset({'鸡蛋'}), 'items_add': frozenset({'芸豆'}), 'support': 0.8, 'confidence': 1.0, 'lift': 1.0}, {'items_base': frozenset({'芸豆'}), 'items_add': frozenset({'鸡蛋'}), 'support': 0.8, 'confidence': 0.8, 'lift': 1.0}, {'items_base': frozenset(), 'items_add': frozenset({'芸豆', '鸡蛋'}), 'support': 0.8, 'confidence': 0.8, 'lift': 1.0}]}
{'芸豆,酸奶': [{'items_base': frozenset({'酸奶'}), 'items_add': frozenset({'芸豆'}), 'support': 0.6, 'confidence': 1.0, 'lift': 1.0}, {'items_base': frozenset({'芸豆'}), 'items_add': frozenset({'酸奶'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}, {'items_base': frozenset(), 'items_add': frozenset({'芸豆', '酸奶'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}]}
{'芸豆,鸡蛋,洋葱': [{'items_base': frozenset({'鸡蛋', '洋葱'}), 'items_add': frozenset({'芸豆'}), 'support': 0.6, 'confidence': 1.0, 'lift': 1.0}, {'items_base': frozenset({'芸豆', '洋葱'}), 'items_add': frozenset({'鸡蛋'}), 'support': 0.6, 'confidence': 1.0, 'lift': 1.25}, {'items_base': frozenset({'鸡蛋', '洋葱'}), 'items_add': frozenset({'芸豆'}), 'support': 0.6, 'confidence': 1.0, 'lift': 1.0}, {'items_base': frozenset({'芸豆', '鸡蛋'}), 'items_add': frozenset({'洋葱'}), 'support': 0.6, 'confidence': 0.7499999999999999, 'lift': 1.2499999999999998}, {'items_base': frozenset({'芸豆', '洋葱'}), 'items_add': frozenset({'鸡蛋'}), 'support': 0.6, 'confidence': 1.0, 'lift': 1.25}, {'items_base': frozenset({'芸豆', '鸡蛋'}), 'items_add': frozenset({'洋葱'}), 'support': 0.6, 'confidence': 0.7499999999999999, 'lift': 1.2499999999999998}, {'items_base': frozenset({'洋葱'}), 'items_add': frozenset({'芸豆', '鸡蛋'}), 'support': 0.6, 'confidence': 1.0, 'lift': 1.25}, {'items_base': frozenset({'鸡蛋'}), 'items_add': frozenset({'芸豆', '洋葱'}), 'support': 0.6, 'confidence': 0.7499999999999999, 'lift': 1.2499999999999998}, {'items_base': frozenset({'芸豆'}), 'items_add': frozenset({'鸡蛋', '洋葱'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}, {'items_base': frozenset(), 'items_add': frozenset({'芸豆', '鸡蛋', '洋葱'}), 'support': 0.6, 'confidence': 0.6, 'lift': 1.0}]}
结论
- map()的使用
- frozenset()作为字典的键值
- itertools库的combinations,permutations使用
- DataFrame().prod()乘积表示多元素同时出现
0 Comments