數(shù)據(jù)挖掘?qū)嵺`指南讀書筆記5

寫在之前

本書涉及的源程序和數(shù)據(jù)都可以在以下網(wǎng)站中找到:http://guidetodatamining.com/
這本書理論比較簡單,書中錯誤較少兄旬,動手鍛煉較多刹泄,如果每個代碼都自己寫出來捺疼,收獲不少疏虫。總結(jié):適合入門啤呼。
歡迎轉(zhuǎn)載卧秘,轉(zhuǎn)載請注明出處,如有問題歡迎指正官扣。
合集地址:https://www.zybuluo.com/hainingwyx/note/559139

概率及樸素貝葉斯

特點:分類并給出概率翅敌。
先驗概率:P(h)
后驗概率/條件概率:P(h/d)

# 訓練
class Classifier:
    def __init__(self, bucketPrefix, testBucketNumber, dataFormat):

        """ a classifier will be built from files with the bucketPrefix
        excluding the file with textBucketNumber. dataFormat is a string that
        describes how to interpret each line of the data files. For example,
        for the iHealth data the format is:
        "attr   attr    attr    attr    class"
        """
   
        total = 0
        classes = {}
        counts = {}
        
        
        # reading the data in from the file
        
        self.format = dataFormat.strip().split('\t')
        self.prior = {}
        self.conditional = {}
        # for each of the buckets numbered 1 through 10:
        for i in range(1, 11):
            # if it is not the bucket we should ignore, read in the data
            if i != testBucketNumber:
                filename = "%s-%02i" % (bucketPrefix, i)
                f = open(filename)
                lines = f.readlines()
                f.close()
                for line in lines:
                    fields = line.strip().split('\t')
                    ignore = []
                    vector = []
                    for i in range(len(fields)):
                        if self.format[i] == 'num':
                            vector.append(float(fields[i]))     #vector!!
                        elif self.format[i] == 'attr':
                            vector.append(fields[i])                           
                        elif self.format[i] == 'comment':
                            ignore.append(fields[i])
                        elif self.format[i] == 'class':
                            category = fields[i]
                    # now process this instance
                    total += 1
                    classes.setdefault(category, 0)     #字典:分類類別計數(shù)
                    counts.setdefault(category, {})     #復合字典:每類的每列的具體計數(shù)
                    classes[category] += 1
                    # now process each attribute of the instance
                    col = 0
                    for columnValue in vector:
                        col += 1
                        counts[category].setdefault(col, {})
                        counts[category][col].setdefault(columnValue, 0)
                        counts[category][col][columnValue] += 1
        
        # ok done counting. now compute probabilities
        # first prior probabilities p(h)
        for (category, count) in classes.items():
            self.prior[category] = count / total#字典:先驗概率

        # now compute conditional probabilities p(D|h)
        for (category, columns) in counts.items():
              self.conditional.setdefault(category, {})
              for (col, valueCounts) in columns.items():
                  self.conditional[category].setdefault(col, {})
                  for (attrValue, count) in valueCounts.items():
                      self.conditional[category][col][attrValue] = (
                          count / classes[category])        #復合字典:每類的每個屬性的條件概率
        self.tmp =  counts               #應該暫時沒有用
# 分類
    def classify(self, itemVector):
        """Return class we think item Vector is in"""
        results = []
        for (category, prior) in self.prior.items():
            prob = prior
            col = 1
            for attrValue in itemVector:
                if not attrValue in self.conditional[category][col]:
                    # we did not find any instances of this attribute value
                    # occurring with this category so prob = 0
                    prob = 0
                else:
                    prob = prob * self.conditional[category][col][attrValue]
                col += 1
            results.append((prob, category))
        # return the category with the highest probability
        return(max(results)[1])
# test code
c = Classifier("iHealth/i", 10,"attr\tattr\tattr\tattr\tclass")
print(c.classify(['health', 'moderate', 'moderate', 'yes']))

問題:當存在某個概率為0時,直接主導整個貝葉斯的計算過程醇锚,即使其他的獨立事件的條件概率接近于1哼御。此外,基于樣本集估計出來概率往往是真實概率的偏低估計焊唬。

改進:將



修改為



其中n是y事件總數(shù)恋昼,

是y中x事件總數(shù),m是等效樣本容量赶促,通常的確定方法是:m為可選屬性的個數(shù)值液肌,p是可選屬性的概率的先驗估計,通常假設均勻分布鸥滨。
當處理的數(shù)據(jù)是連續(xù)的時候嗦哆,有兩種解決辦法谤祖。一是離散化,構(gòu)建類別老速;一是假設概率分布服從高斯分布粥喜,然后計算概率。
樣本標準差:

對于樣本集而言橘券,樣本標準差相對于總體標準差計算公式是總體標準差的更優(yōu)估計额湘。

# pdf計算實現(xiàn)
def pdf(mean, ssd, x):
   """Probability Density Function  computing P(x|y)
   input is the mean, sample standard deviation for all the items in y,
   and x."""
   ePart = math.pow(math.e, -(x-mean)**2/(2*ssd**2))
   print (ePart)
   return (1.0 / (math.sqrt(2*math.pi)*ssd)) * ePart
# 連續(xù)數(shù)據(jù)的訓練
class Classifier:
    def __init__(self, bucketPrefix, testBucketNumber, dataFormat):

        """ a classifier will be built from files with the bucketPrefix
        excluding the file with textBucketNumber. dataFormat is a string that
        describes how to interpret each line of the data files. For example,
        for the iHealth data the format is:
        "attr   attr    attr    attr    class"
        """
   
        total = 0
        classes = {}
        # counts used for attributes that are not numeric
        counts = {}
        # totals used for attributes that are numereric
        # we will use these to compute the mean and sample standard deviation for
        # each attribute - class pair.
        totals = {}
        numericValues = {}
        
        
        # reading the data in from the file
        
        self.format = dataFormat.strip().split('\t')
        # 
        self.prior = {}
        self.conditional = {}
 
        # for each of the buckets numbered 1 through 10:
        for i in range(1, 11):
            # if it is not the bucket we should ignore, read in the data
            if i != testBucketNumber:
                filename = "%s-%02i" % (bucketPrefix, i)
                f = open(filename)
                lines = f.readlines()
                f.close()
                for line in lines:
                    fields = line.strip().split('\t')
                    ignore = []
                    vector = []
                    nums = []
                    for i in range(len(fields)):
                        if self.format[i] == 'num':
                            nums.append(float(fields[i]))
                        elif self.format[i] == 'attr':
                            vector.append(fields[i])                           
                        elif self.format[i] == 'comment':
                            ignore.append(fields[i])
                        elif self.format[i] == 'class':
                            category = fields[i]
                    # now process this instance
                    total += 1
                    classes.setdefault(category, 0)
                    counts.setdefault(category, {})
                    totals.setdefault(category, {})
                    numericValues.setdefault(category, {})
                    classes[category] += 1
                    # now process each non-numeric attribute of the instance
                    col = 0
                    for columnValue in vector:
                        col += 1
                        counts[category].setdefault(col, {})
                        counts[category][col].setdefault(columnValue, 0)
                        counts[category][col][columnValue] += 1
                    # process numeric attributes
                    col = 0
                    for columnValue in nums:
                        col += 1
                        totals[category].setdefault(col, 0)
                        #totals[category][col].setdefault(columnValue, 0)
                        totals[category][col] += columnValue
                        numericValues[category].setdefault(col, [])
                        numericValues[category][col].append(columnValue)
                    
        
        #
        # ok done counting. now compute probabilities
        #
        # first prior probabilities p(h)
        #
        for (category, count) in classes.items():
            self.prior[category] = count / total
        #
        # now compute conditional probabilities p(h|D)
        #
        for (category, columns) in counts.items():
              self.conditional.setdefault(category, {})
              for (col, valueCounts) in columns.items():
                  self.conditional[category].setdefault(col, {})
                  for (attrValue, count) in valueCounts.items():
                      self.conditional[category][col][attrValue] = (
                          count / classes[category])
        self.tmp =  counts               
        #
        # now compute mean and sample standard deviation
        #
        self.means = {}
        self.totals = totals
        for (category, columns) in totals.items():
            self.means.setdefault(category, {})
            for (col, cTotal) in columns.items():
                self.means[category][col] = cTotal / classes[category]
        # standard deviation
        self.ssd = {}
        for (category, columns) in numericValues.items():
            
            self.ssd.setdefault(category, {})
            for (col, values) in columns.items():
                SumOfSquareDifferences = 0
                theMean = self.means[category][col]
                for value in values:
                    SumOfSquareDifferences += (value - theMean)**2
                columns[col] = 0
                self.ssd[category][col] = math.sqrt(SumOfSquareDifferences / (classes[category]  - 1))              
# 連續(xù)數(shù)據(jù)的分類
    def classify(self, itemVector, numVector):
        """Return class we think item Vector is in"""
        results = []
        sqrt2pi = math.sqrt(2 * math.pi)
        for (category, prior) in self.prior.items():
            prob = prior
            col = 1
            for attrValue in itemVector:
                if not attrValue in self.conditional[category][col]:
                    # we did not find any instances of this attribute value
                    # occurring with this category so prob = 0
                    prob = 0
                else:
                    prob = prob * self.conditional[category][col][attrValue]
                col += 1
            col = 1
            for x in  numVector:
                mean = self.means[category][col]
                ssd = self.ssd[category][col]
                ePart = math.pow(math.e, -(x - mean)**2/(2*ssd**2))
                prob = prob * ((1.0 / (sqrt2pi*ssd)) * ePart)
                col += 1
            results.append((prob, category))
        # return the category with the highest probability
        #print(results)
        return(max(results)[1])

貝葉斯和kNN的比較

  • 貝葉斯優(yōu)點:實現(xiàn)簡單,和其他方法相比需要的訓練數(shù)據(jù)更少
  • 貝葉斯缺點:不能學習到特征之間的相互作用旁舰。
  • kNN優(yōu)點:實現(xiàn)簡單锋华,不用考慮數(shù)據(jù)特定的結(jié)構(gòu),需要大量的內(nèi)存來存儲訓練集
  • kNN缺點:訓練集很大的時候是一個合理的選擇箭窜。

許多真實數(shù)據(jù)挖掘問題中毯焕,很多屬性不是獨立的。有時候可以假設獨立磺樱。之所以稱樸素貝葉斯是因為盡管知道不成立仍然假設屬性之間是獨立的纳猫。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市坊罢,隨后出現(xiàn)的幾起案子续担,更是在濱河造成了極大的恐慌擅耽,老刑警劉巖活孩,帶你破解...
    沈念sama閱讀 223,126評論 6 520
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異乖仇,居然都是意外死亡憾儒,警方通過查閱死者的電腦和手機,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 95,421評論 3 400
  • 文/潘曉璐 我一進店門乃沙,熙熙樓的掌柜王于貴愁眉苦臉地迎上來起趾,“玉大人,你說我怎么就攤上這事警儒⊙雕桑” “怎么了?”我有些...
    開封第一講書人閱讀 169,941評論 0 366
  • 文/不壞的土叔 我叫張陵蜀铲,是天一觀的道長边琉。 經(jīng)常有香客問我,道長记劝,這世上最難降的妖魔是什么变姨? 我笑而不...
    開封第一講書人閱讀 60,294評論 1 300
  • 正文 為了忘掉前任,我火速辦了婚禮厌丑,結(jié)果婚禮上定欧,老公的妹妹穿的比我還像新娘渔呵。我一直安慰自己,他們只是感情好砍鸠,可當我...
    茶點故事閱讀 69,295評論 6 398
  • 文/花漫 我一把揭開白布扩氢。 她就那樣靜靜地躺著,像睡著了一般爷辱。 火紅的嫁衣襯著肌膚如雪类茂。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 52,874評論 1 314
  • 那天托嚣,我揣著相機與錄音巩检,去河邊找鬼。 笑死示启,一個胖子當著我的面吹牛兢哭,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播夫嗓,決...
    沈念sama閱讀 41,285評論 3 424
  • 文/蒼蘭香墨 我猛地睜開眼迟螺,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了舍咖?” 一聲冷哼從身側(cè)響起矩父,我...
    開封第一講書人閱讀 40,249評論 0 277
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎排霉,沒想到半個月后窍株,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 46,760評論 1 321
  • 正文 獨居荒郊野嶺守林人離奇死亡攻柠,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 38,840評論 3 343
  • 正文 我和宋清朗相戀三年球订,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片瑰钮。...
    茶點故事閱讀 40,973評論 1 354
  • 序言:一個原本活蹦亂跳的男人離奇死亡冒滩,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出浪谴,到底是詐尸還是另有隱情开睡,我是刑警寧澤,帶...
    沈念sama閱讀 36,631評論 5 351
  • 正文 年R本政府宣布苟耻,位于F島的核電站篇恒,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏梁呈。R本人自食惡果不足惜婚度,卻給世界環(huán)境...
    茶點故事閱讀 42,315評論 3 336
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧蝗茁,春花似錦醋虏、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,797評論 0 25
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至饭寺,卻和暖如春阻课,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背艰匙。 一陣腳步聲響...
    開封第一講書人閱讀 33,926評論 1 275
  • 我被黑心中介騙來泰國打工限煞, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人员凝。 一個月前我還...
    沈念sama閱讀 49,431評論 3 379
  • 正文 我出身青樓署驻,卻偏偏與公主長得像,于是被迫代替她去往敵國和親健霹。 傳聞我的和親對象是個殘疾皇子旺上,可洞房花燭夜當晚...
    茶點故事閱讀 45,982評論 2 361

推薦閱讀更多精彩內(nèi)容