跟著coursera上的課程Applied Machine Learning in Python學習了一個機器學習的示例次坡,很簡單的對水果進行分類呼猪,下載數據之后發(fā)現(xiàn),哎對文件的簡單復制粘貼會改變數據文件的分隔符砸琅,本來是用十六進制09作為分隔符宋距,但是復制粘貼一個新的文件之后分隔符變成了2到4個不等的空格20。實在是對這些編碼之類的幺蛾子表示頭大症脂。弄了好一陣子谚赎,直到用ultraedit對比十六進制.
水果分類的示例用的是knn算法,算法也被稱作是instance based or memory based supervised learning.What this means is that instance based learning methods work by memorizing the labeled examples that they see in the training set.And then they use those memorized examples to classify new objects later.
four criteria to specify the method:
scikit-learn by default uses p=2; k=1; no special treatment on weight function; majority vote.
We can see that when K has a small value like 1, the classifier?In general with k-nearest neighbors, is good at learning the classes for individual points in the training set. But with a decision boundary that's fragmented with considerable variation比較多(碎片較顯著)的空間分割.This is because when K = 1, the prediction is sensitive to noise, outliers, mislabeled data, and other sources of variation in individual data points.
using a larger k suppresses the effects of noisy individual labels. But results in classification boundaries that are less detailed.