開始嘗試用weka工具來做一些小示例纵揍,但是發(fā)現(xiàn)輸出結果里有很多不了解的地方顿乒,比如這樣的輸出代表什么意思,完全看不明白泽谨。
1、mean absolute errorMAE={|p1-a1|+....+|pn-an|}/n. But what is the difference of predicted values pi and actual value ai in classification? 分類問題時吧雹,對于每一個instance,如何計算其error呢骨杂?發(fā)現(xiàn)一篇不錯的文章,http://weka.8497.n7.nabble.com/Mean-absolute-error-in-classification-td9440.html雄卷。但是還是擔心如果鏈接失效了搓蚪,以我的記性那是肯定記不住了。把他的意思簡單說一下丁鹉。
it's not too hard to replicate this using the GUI: go to "More options..." in the Classfy tab and then configure the "Output predictions" to generate a CSV table like below.
大佬生成的像下面這樣:
inst#,actual,predicted,error,distribution,,?
1,3:Iris-virginica,3:Iris-virginica,,0,0.02439024390243902464,0.975609756097561?
for the instance 1, the distribution given by Weka is: 0 ?0.02439024390243902464 ?0.975609756097561 (note that it adds up to 1; the order is the same as the order of the labels: first = Setosa, second = Versicolor & third = Virginica)?
I personally think that "distribution" it's a very vague name. I would rather call them something like prediction scores maybe, 【我感覺他的理解很透徹谎亩,有時候分布函數(shù)給人很高大上的感覺炒嘲,但是你如果說這是一個對預測的信心值,打分值匈庭,那比較好理解多了】as distribution can be many things in this context (for example, the actual distribution of classes in the dataset). Anyway, in the case of this instance, the error is very simple to calculate.?
First, the Expected distribution for the instance would be: 0 0 1. Since it's an instance of Iris virginica. Then the error is:?
abs(0 - 0)/3 + abs(0.02439024390243902464 - 0)/3 + abs(0.975609756097561 - 1)/3 = 0.01626016?
Repeating this for all the instances and summing up, I get 5.246992, which divided by 150 is 0.0349799, and that's the same answer I get with Weka.
2夫凸、correlation coefficient
When two sets of data are strongly linked together we say they have a?High Correlation。
相關性是說兩者有聯(lián)系阱持,比如氣溫和冰淇淋銷售額成在額定溫度內(nèi)存在正比關系寸痢,但是超過一定溫度后,銷售額又下降紊选,呈曲線啼止,相關性表述的是線性而無法描述曲線。同時兩者的聯(lián)系也不一定是因果關系兵罢,比如墨鏡的銷售額和冰淇淋的銷售額存在正比關系献烦,但僅僅是數(shù)值上的關系。
3卖词、choose the model with bigger correlation and smaller error estimates
下午看了一陣子coursera上Introduction to Data Science in Python的課程巩那,主要介紹pandas的用法,看完記住的甚少此蜈,關于missing value的填補他也介紹了方法fillna即横,如forward fill 和backward fill,不太明白為什么要排序裆赵,以后要是用到這些功能的時候再來回看課程我覺得效率會高些东囚。
函數(shù)名后直接加?可以顯示幫助战授。
shift + command +4截屏