機(jī)器學(xué)習(xí)：如何構(gòu)造一個(gè)簡單的決策樹

數(shù)據(jù)集下載地址：http://archive.ics.uci.edu/ml/datasets/Adult

在寫該算法時(shí)遇到一個(gè)問題：
構(gòu)造決策樹時(shí)全陨，這兩段代碼雖然都可以成功運(yùn)行辱姨，但是構(gòu)造的結(jié)果卻有些不同戚嗅。
如果用第一種方式遍歷每個(gè)分支，會(huì)導(dǎo)致每次從右側(cè)分支開始遍歷替久，即使把branc_dict調(diào)整為{'right':right_split躏尉，'left':left_split}
而使用第二種方式，則可以正常遍歷(先遍歷左分支颅拦，再遍歷右分支)，到目前為止還沒發(fā)現(xiàn)是什么原因?qū)е碌挠蚁牵魑挥兄赖臍g迎留言～

以下為代碼過程：

讀入數(shù)據(jù)

import pandas as pd
columns=['age', 'workclass', 'fnlwgt', 'education', 'education_num',
       'marital_status', 'occupation', 'relationship', 'race', 'sex',
       'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',
       'high_income']
data=pd.read_table('./data/income.data',delimiter=',',names=columns)
data.head()

在開始構(gòu)建決策樹之前碌秸，我們需要把數(shù)據(jù)集中的分類型數(shù)據(jù)轉(zhuǎn)換為數(shù)值型哮肚，pandas.Categorical方法可以把string型分類的column轉(zhuǎn)換為Categorical Type，轉(zhuǎn)換以后系統(tǒng)就會(huì)自動(dòng)將該column中的類別映射為一個(gè)數(shù)字允趟。

list=['workclass','education','marital_status', 'occupation',
           'relationship', 'race', 'sex', 'native_country','high_income']
for name in list:
    col=pd.Categorical.from_array(data[name])
    data[name]=col.codes
data.head()

計(jì)算熵和信息增益：
熵

信息增益

def calc_entropy(target):
    counts=np.bincount(target)
    probabilities=counts/len(target)
    entropys=probabilities*np.log2(probabilities)
    return -sum(entropys)
def calc_information_gain(data,split_name,target):
    entropy=calc_entropy(data[target])
    median=np.median(data[split_name])
    left_split=data[data[split_name]<=median]
    right_split=data[data[split_name]>median]
    
    to_subtract=0
    for subset in [left_split,right_split]:
        prob=subset.shape[0]/data.shape[0]
        to_subtract+=prob*calc_entropy(subset[target])
    return entropy-to_subtract
#通過計(jì)算每一個(gè)column的信息增益，獲得最佳分裂屬性（信息增益最大的）
def find_best_column(data,columns,target):
    information_gains=[]
    for name in columns:
        information_gains.append(calc_information_gain(data,name,'high_income'))
    information_index=information_gains.index(max(information_gains))
    best_column=columns[information_index]
    return best_column

帶有存儲(chǔ)功能的ID3算法：

為了實(shí)現(xiàn)存儲(chǔ)功能狮斗，可以使用一個(gè)含有以下關(guān)鍵字的dictionary存儲(chǔ)節(jié)點(diǎn)：

left/right 關(guān)鍵字表示左右結(jié)點(diǎn)
column 最佳分裂屬性
median 分裂屬性的中位數(shù)
number 結(jié)點(diǎn)編號(hào)
label
如果結(jié)點(diǎn)為葉節(jié)點(diǎn)弧蝇，則僅僅含有l(wèi)abel（值為0/1）和number關(guān)鍵字
偽代碼如下：

def id3(data, target, columns, tree)
    1 Create a node for the tree
    2 Number the node
    3 If all of the values of the target attribute are 1, assign 1 to the label key in tree
    4 If all of the values of the target attribute are 0, assign 0 to the label key in tree
    5 Using information gain, find A, the column that splits the data best
    6 Find the median value in column A
    7 Assign the column and median keys in tree
    8 Split A into values less than or equal to the median (0), and values above the median (1)
    9 For each possible value (0 or 1), vi, of A,
   10    Add a new tree branch below Root that corresponds to rows of data where A = vi
   11    Let Examples(vi) be the subset of examples that have the value vi for A
   12    Create a new key with the name corresponding to the side of the split (0=left, 1=right).  The value of this key should be an empty dictionary.
   13    Below this new branch, add the subtree id3(data[A==vi], target, columns, tree[split_side])
   14 Return Root

實(shí)現(xiàn)代碼：

tree={}
nodes=[]  #重點(diǎn)注意：因?yàn)樵谶f歸中使用int型不能自增看疗，所以采取使用數(shù)組的方法。
def id3(data,columns,target,tree):

    nodes.append(len(nodes)+1)
    tree['number']=nodes[-1]
    unique_targets=pd.unique(data[target])
    if len(unique_targets)==1:
        tree['label']=unique_targets[0]
        return  #不要忘記返回
    
    #如unique長度不為1摔寨，既包含0又含1怖辆，需要分裂：
    best_column=find_best_column(data,columns,target)
    median=np.median(data[best_column])
    tree['column']=best_column  #分裂key
    tree['median']=median  #median key
    
    left_split=data[data[best_column]<=median]
    right_split=data[data[best_column]>median]
    branch_dict={'left':left_split,'right':right_split}
    for branch in branch_dict:
        tree[branch]={}
        id3(branch_dict[branch],columns,target,tree[branch])

id3(data,  ["age", "marital_status"],"high_income", tree)
print(tree)

結(jié)果為

為了方便觀察決策樹的構(gòu)造結(jié)果竖螃，我們可以寫一個(gè)結(jié)點(diǎn)輸出函數(shù)，結(jié)構(gòu)化的輸出生成的決策樹：

def print_with_depth(string,depth):
    prefix="    "*depth
    print("{0}{1}".format(prefix,string))
def print_node(tree,depth):
    if 'label' in tree:
        print_with_depth('Leaf label{0}'.format(tree['label']),depth)
        return
    print_with_depth('{0}>{1}'.format(tree['column'],tree['median']),depth)
    
    branches = [tree["left"], tree["right"]]
        
    for branch in branches:
        print_node(branch,depth+1)
print_node(tree, 0 )

輸出

實(shí)現(xiàn)預(yù)測功能：

#預(yù)測函數(shù)
def predict(tree,row):
    if 'label' in tree:
        return tree['label']
    column=tree['column']
    median=tree['median']
    
    if row['columns']<=median:
        return predict(tree['left'],row)
    else:
        return predict(tree['right'],row)
print(predict(tree, data.iloc[0]))

predictions=data.apply(lambda x:predict(tree,x),axis=1)

完蒋纬。

最后編輯于：2017.12.07 04:28:11

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末蜀备，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子输虱，更是在濱河造成了極大的恐慌脂凶，老刑警劉巖，帶你破解...
沈念sama閱讀 217,734評(píng)論 6贊 505
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件亭病，死亡現(xiàn)場離奇詭異嘶居，居然都是意外死亡，警方通過查閱死者的電腦和手機(jī)邮屁，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,931評(píng)論 3贊 394
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門佑吝，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人炸客，你說我怎么就攤上這事嚷量∧嫒ぃ” “怎么了嗜历？”我有些...
開封第一講書人閱讀 164,133評(píng)論 0贊 354
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長痕囱。經(jīng)常有香客問我暴匠，道長，這世上最難降的妖魔是什么帮掉？我笑而不...
開封第一講書人閱讀 58,532評(píng)論 1贊 293
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮稽莉，結(jié)果婚禮上涩搓，老公的妹妹穿的比我還像新娘。我一直安慰自己昧甘，他們只是感情好充边，可當(dāng)我...
茶點(diǎn)故事閱讀 67,585評(píng)論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布痛黎。她就那樣靜靜地躺著，像睡著了一般掖蛤。火紅的嫁衣襯著肌膚如雪蚓庭。梳的紋絲不亂的頭發(fā)上器赞，一...
開封第一講書人閱讀 51,462評(píng)論 1贊 302
城市分裂傳說
那天港柜，我揣著相機(jī)與錄音咳榜，去河邊找鬼。笑死涌韩，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的靶擦。我是一名探鬼主播，決...
沈念sama閱讀 40,262評(píng)論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼踩蔚，長吁一口氣：“原來是場噩夢啊……” “哼桩盲！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起捞蛋，我...
開封第一講書人閱讀 39,153評(píng)論 0贊 276
萬榮殺人案實(shí)錄
序言：老撾萬榮一對情侶失蹤柬姚，失蹤者是張志新（化名）和其女友劉穎，沒想到半個(gè)月后搬设，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體撕捍，經(jīng)...
沈念sama閱讀 45,587評(píng)論 1贊 314
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡忧风，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,792評(píng)論 3贊 336
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了腿宰。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片缘厢。...
茶點(diǎn)故事閱讀 39,919評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖椿每，靈堂內(nèi)的尸體忽然破棺而出夜畴，到底是詐尸還是另有隱情，我是刑警寧澤贪绘，帶...
沈念sama閱讀 35,635評(píng)論 5贊 345
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布税灌，位于F島的核電站，受9級(jí)特大地震影響菱涤，放射性物質(zhì)發(fā)生泄漏粘秆。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,237評(píng)論 3贊 329
男人毒藥：我在死后第九天來索命
文/蒙蒙一殷勘、第九天我趴在偏房一處隱蔽的房頂上張望昔搂。院中可真熱鬧玲销，春花似錦摘符、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,855評(píng)論 0贊 22
一樁弒父案带族，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至堕战，卻和暖如春拍霜，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背祠饺。一陣腳步聲響...
開封第一講書人閱讀 32,983評(píng)論 1贊 269
情欲美人皮
我被黑心中介騙來泰國打工道偷，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人勺鸦。一個(gè)月前我還...
沈念sama閱讀 48,048評(píng)論 3贊 370
代替公主和親
正文我出身青樓换途，卻偏偏與公主長得像懊渡，于是被迫代替她去往敵國和親刽射。傳聞我的和親對象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,864評(píng)論 2贊 354

機(jī)器學(xué)習(xí)：如何構(gòu)造一個(gè)簡單的決策樹

以下為代碼過程：

帶有存儲(chǔ)功能的ID3算法：

實(shí)現(xiàn)預(yù)測功能：

推薦閱讀更多精彩內(nèi)容