sklearn的SVM函數(shù)沒有對數(shù)據(jù)做scale操作撤缴,而e1071包的對應(yīng)函數(shù)做數(shù)據(jù)做了scale膝舅。因此在R語言中需要指定scale=FALSE
掘殴,才會產(chǎn)生跟sklearn類似的結(jié)果。
這里以Machine learning with R(機器學(xué)習(xí)與R語言)一書的letter recognition舉例,該數(shù)據(jù)集也在UCI數(shù)據(jù)庫中,uci letter recognition珠叔,這里為了可重復(fù)性,使用UCI的數(shù)據(jù)肮塞。
首先在python中逾一,使用pandas讀取相應(yīng)的數(shù)據(jù),并將前16000條數(shù)據(jù)放入訓(xùn)練集喇潘,后4000條數(shù)據(jù)放入測試集体斩,用以評估svm的預(yù)測性能。
import pandas as pd
letter_reco_path = "https://archive.ics.uci.edu/ml/machine-learning-databases/letter-recognition/letter-recognition.data"
colnames = [
"letter", "xbox", "ybox", "width", "height", "onpix", "xbar", "ybar", "x2bar", "y2bar",
"xybar", "x2ybar", "xy2bar", "xedge", "xedgey", "yedge", "yedgex"
]
letter_data = pd.read_csv(letter_reco_path, header = None, names = colnames)
training = letter_data.iloc[0:16000,]
testing = letter_data.iloc[16000:, ]
X_train, y_train = training.ix[:, 1:].values, training.ix[:, 0].values
X_train, y_train = training.ix[:, 1:].values, training.ix[:, 0].values
下面使用sklearn的SVC進(jìn)行SVM的分類颖低,并使用高斯核絮吵。
from sklearn.svm import SVC
svm_model = SVC(kernel="rbf", random_state=1071).fit(X_train, y_train)
再對測試集進(jìn)行預(yù)測,得到預(yù)測精度0.9722忱屑。
svm_pred = svm_model.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, svm_pred)
0.97224999999999995
同樣地蹬敲,在R語言中,讀取UCI對應(yīng)的數(shù)據(jù)莺戒,把前16000條放入訓(xùn)練集伴嗡,剩下的放入測試集。
letter_reco_path <- "https://archive.ics.uci.edu/ml/machine-learning-databases/letter-recognition/letter-recognition.data"
colnames <- c("letter", "xbox", "ybox", "width", "height", "onpix", "xbar", "ybar", "x2bar", "y2bar", "xybar", "x2ybar", "xy2bar", "xedge", "xedgey", "yedge", "yedgex")
letter_data <- read.csv( letter_reco_path, header = FALSE, col.names = colnames)
training_index <- seq.int(1, 16000)
training <- letter_data[training_index, ]
testing <- letter_data[-training_index, ]
通過e1071的svm函數(shù)做對應(yīng)的模型訓(xùn)練脏毯,使用高斯核闹究,且對數(shù)據(jù)不做scale操作,即scale=FALSE
食店。
svm_model2 <- svm(
letter ~.,
data = training,
kernal = "radial",
type = "C-classification",
scale = FALSE
)
再通過predict對測試集進(jìn)行預(yù)測渣淤,得到精度赏寇,0.9725,與sklearn的精度接近价认。
svm_pred2 <- predict(svm_model2, newdata = testing)
table(svm_pred2 == testing$letter) %>% prop.table
FALSE TRUE
0.0275 0.9725
查閱文檔嗅定,發(fā)現(xiàn)sklearn的SVC函數(shù)不會對數(shù)據(jù)做scale操作,而e1071的svm函數(shù)默認(rèn)情況下有scale的操作用踩,需要在實際的使用中注意這種差異渠退。