筆者剛開始接觸數(shù)據(jù)挖掘,入門參考書籍為Peter Harrington編著的《機(jī)器學(xué)習(xí)》励堡,文章代碼亦大量借鑒于書中褐筛。
信息增益
導(dǎo)入模塊:
from math import log
import operator
計(jì)算給定數(shù)據(jù)集的香農(nóng)熵:
def calcShannonEnt(dataSet):
numEntries = len(dataSet)
lableCounts = {}
for featVec in dataSet:
currentLable = featVec[-1]
if currentLable not in lableCounts.keys():
lableCounts[currentLable] = 0
lableCounts[currentLable] += 1
shannonEnt = 0
for key in lableCounts:
prob = float(lableCounts[key])/numEntries
shannonEnt -= prob * log(prob,2)
return shannonEnt
創(chuàng)建簡(jiǎn)單的數(shù)據(jù)集:
def createDataSet():
dataSet = [[1,1,0,'fight'],[1,0,1,'fight'],[1,0,1,'fight'],[1,0,1,'fight'],[0,0,1,'run'],[0,1,0,'fight'],[0,1,1,'run']]
lables = ['weapon','bullet','blood']
return dataSet,lables
字段說明
[1,1,0,'fight']
數(shù)值 | 武器類型 | 子彈 | 血量 |
---|---|---|---|
0 | 步槍 | 少 | 少 |
1 | 機(jī)槍 | 多 | 多 |
值 | 行為類別 |
---|---|
fight | 戰(zhàn)斗 |
run | 逃跑 |
按行打印數(shù)據(jù)集
def printData(myData):
for item in myData:
print '%s' %(item)
用Python命令提示符輸入下列命令:
>>> import tree
>>> myDat,lable = tree.createDataSet()
>>> tree.printData(myDat)
[1, 1, 0, 'fight']
[1, 0, 1, 'fight']
[1, 0, 1, 'fight']
[1, 0, 1, 'fight']
[0, 0, 1, 'run']
[0, 1, 0, 'fight']
[0, 1, 1, 'run']
>>> tree.calcShannonEnt(myDat)
0.863120568566631
得到香農(nóng)熵為0.863120568566631
熵越高,則混合的數(shù)據(jù)也越多。
為數(shù)據(jù)集添加新分類surrender
:
>>> myDat[0][-1] = 'surrender'
>>> tree.printData(myDat)
[1, 1, 0, 'surrender']
[1, 0, 1, 'fight']
[1, 0, 1, 'fight']
[1, 0, 1, 'fight']
[0, 0, 1, 'run']
[0, 1, 0, 'fight']
[0, 1, 1, 'run']
>>> tree.calcShannonEnt(myDat)
1.3787834934861756
得到香農(nóng)熵為1.3787834934861756
劃分?jǐn)?shù)據(jù)集
按照給定特征劃分?jǐn)?shù)據(jù)集:
def splitDataSet(dataSet,axis,value):
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis]
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
輸入Python命令卧土,分別提取武器類型為1(機(jī)槍)
和0(步槍)
的行為:
>>> reload(tree)
<module 'tree' from 'tree.py'>
>>> myDat,lable = tree.createDataSet()
>>> tree.printData(myDat)
[1, 1, 0, 'fight']
[1, 0, 1, 'fight']
[1, 0, 1, 'fight']
[1, 0, 1, 'fight']
[0, 0, 1, 'run']
[0, 1, 0, 'fight']
[0, 1, 1, 'run']
>>> tree.splitDataSet(myDat,0,1)
[[1, 0, 'fight'], [0, 1, 'fight'], [0, 1, 'fight'], [0, 1, 'fight']]
>>> tree.splitDataSet(myDat,0,0)
[[0, 1, 'run'], [1, 0, 'fight'], [1, 1, 'run']]
選擇最好的數(shù)據(jù)集劃分方式:
def chooseBestFeatureToSplit(dataSet):
numFeatures = len(dataSet[0]) - 1
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0
bestFeature = -1
for i in range(numFeatures):
featList = [example[i] for example in dataSet]
uniqueVals = set(featList)
newEntropy = 0
for value in uniqueVals:
subDataSet = splitDataSet(dataSet,i,value)
prob = len(subDataSet)/float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = baseEntropy - newEntropy
if (infoGain > bestInfoGain):
bestInfoGain = infoGain
bestFeature = i
return bestFeature
chooseBestFeatureToSplit
調(diào)用的數(shù)據(jù)需要滿足的要求:
- 數(shù)據(jù)必須是一種由列表元素組成的列表
- 所有列表元素都要具有相同的數(shù)據(jù)長(zhǎng)度
- 數(shù)據(jù)的最后一列是當(dāng)前數(shù)據(jù)的類別標(biāo)簽
在開始劃分?jǐn)?shù)據(jù)集之前,先計(jì)算整個(gè)數(shù)據(jù)集的原始香農(nóng)熵像樊,保存最初的無(wú)序度量值夸溶,用于與劃分完之后的數(shù)據(jù)集計(jì)算的熵值進(jìn)行比較,從而計(jì)算信息增益凶硅。
遍歷當(dāng)前特征中的所有唯一屬性值缝裁,對(duì)每個(gè)特征劃分一次數(shù)據(jù)集,然后計(jì)算數(shù)據(jù)集的新熵值足绅,并對(duì)所有唯一特征值得到的熵求和捷绑。
最后,比較所有特征中的信息增益氢妈,返回最好特征劃分的索引值粹污。
>>> reload(tree)
<module 'tree' from 'tree.pyc'>
>>> myDat,lable = tree.createDataSet()
>>> tree.printData(myDat)
[1, 1, 0, 'fight']
[1, 0, 1, 'fight']
[1, 0, 1, 'fight']
[1, 0, 1, 'fight']
[0, 0, 1, 'run']
[0, 1, 0, 'fight']
[0, 1, 1, 'run']
>>> tree.chooseBestFeatureToSplit(myDat)
0.469565211115
0.00597771142377
0.16958442967
0
在劃分?jǐn)?shù)據(jù)集之前之后信息發(fā)生的變化稱為信息增益。
特征 | 信息增益 |
---|---|
武器類型 | 0.469565211115 |
子彈數(shù)量 | 0.00597771142377 |
血量 | 0.16958442967 |
運(yùn)行結(jié)果告訴我們首量,第0個(gè)特征壮吩,也就是武器類型
是最好的用于劃分?jǐn)?shù)據(jù)集的特征。
遞歸構(gòu)建決策樹
創(chuàng)建??的函數(shù)代碼:
def createTree(dataSet,lables):
classList = [example[-1] for example in dataSet]
if classList.count(classList[0]) == len(classList):
return classList[0]
if len(dataSet[0]) == 1:
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLable = lables[bestFeat]
myTree = {bestFeatLable:{}}
del(lables[bestFeat])
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals:
subLables = lables[:]
myTree[bestFeatLable][value] = createTree(splitDataSet(dataSet,bestFeat,value),subLables)
return myTree
遞歸結(jié)束的條件:
- 遍歷完所有劃分?jǐn)?shù)據(jù)集的屬性
- 每個(gè)分支下的素有實(shí)力都有相同的分類
所有的類標(biāo)簽完全相同加缘,則返回該類標(biāo)簽鸭叙。如果使用完了所有特征,仍然不能將數(shù)據(jù)集劃分成僅包含唯一類別的分組拣宏,則通過majorityCnt
挑選出出現(xiàn)次數(shù)最多的類別標(biāo)簽作為返回值沈贝。
選出出現(xiàn)次數(shù)最多的分類名稱:
def majorityCnt(classList):
classCount = {}
for vote in classList:
if vote not in classCount.keys():classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.iteritems(),key = operator.itemgetter(1),reverse=True)
return sortedClassCount[0][0]
為測(cè)試代碼的實(shí)際輸出結(jié)果,在Python命令提示符中輸入下列命令:
>>> reload(tree)
<module 'tree' from 'tree.pyc'>
>>> myDat,lable = tree.createDataSet()
>>> tree.printData(myDat)
[1, 1, 0, 'fight']
[1, 0, 1, 'fight']
[1, 0, 1, 'fight']
[1, 0, 1, 'fight']
[0, 0, 1, 'run']
[0, 1, 0, 'fight']
[0, 1, 1, 'run']
>>> tree.createTree(myDat,lable)
0.469565211115
0.00597771142377
0.16958442967
0.251629167388
0.918295834054
{'weapon': {0: {'blood': {0: 'fight', 1: 'run'}}, 1: 'fight'}}
createTree
返回的嵌套字典包含了很多代表樹結(jié)構(gòu)的信息勋乾,從左邊開始宋下,第一個(gè)關(guān)鍵字weapon
是第一個(gè)劃分?jǐn)?shù)據(jù)集的特征名稱,該關(guān)鍵字的值也是另一個(gè)數(shù)據(jù)字典。第二個(gè)關(guān)鍵字是weapon
特征劃分的數(shù)據(jù)集辑莫,這些關(guān)鍵字的值是weapon
節(jié)點(diǎn)的子節(jié)點(diǎn)学歧。這些值可能是類標(biāo)簽,也可能是另一個(gè)數(shù)據(jù)字典各吨。如果值是類標(biāo)簽枝笨,則該節(jié)點(diǎn)是葉子節(jié)點(diǎn);如果值是另一個(gè)數(shù)據(jù)字典,則子節(jié)點(diǎn)是一個(gè)判斷節(jié)點(diǎn)伺帘,這種格式結(jié)構(gòu)不斷重復(fù)就構(gòu)成了整棵??昭躺。該例子中包含了3個(gè)葉子節(jié)點(diǎn)和2個(gè)判斷節(jié)點(diǎn)忌锯。