本文是關于learning data mining with python的學習筆記渗柿。記錄下在學習過程中的一些知識點勇婴。該書用的是python3.4攻询。
本書所用到的python包
scikit-learn: http://scikit-learn.org/stable/
The scikit-learn package is a machine learning library, written in Python. Itcontains numerous algorithms, datasets, utilities, and frameworks for performingmachine learning.
"Data mining provides a way for a computer to learn how to make decisions with data. "
Datasets comprise of two aspects:
? Samples that are objects in the real world. This can be a book, photograph,animal, person, or any other object.
? Features that are descriptions of the samples in our dataset. Features couldbe the length, frequency of a given word, number of legs, date it was created,and so on.
Support is the number of times that a rule occurs in a dataset, which is computed bysimply counting the number of samples that the rule is valid for.
While the support measures how often a rule exists, confdence measures howaccurate they are when they can be used.
Overftting is the problem of creating a model that classifes our training datasetvery well, but performs poorly on new samples. *The solution is quite simple: neveruse training data to test your algorithm. *
scikit-learn library contains a function to split data into training andtesting components:from sklearn.cross_validation import train_test_split .
'''
This function will split the dataset into two subdatasets,
according to a given ratio(which by default uses 25 percent of the dataset for testing).
Xd_train contains our data for training andXd_test contains our data for testing.
y_train and y_test give the correspondingclass values for these datasets.
random_state. Setting the random state will give the samesplit every time the same value is entered.
It will look random, but the algorithm usedis deterministic and the output will be consistent.
'''
//test_size是樣本占比。如果是整數(shù)的話就是樣本的數(shù)量怜森。
//random_state是隨機數(shù)的種子齐遵。不同的種子會造成不同的隨機采樣結果。相同的種子采樣結果相同塔插。
from sklearn.cross_validation import train_test_split
Xd_train, Xd_test, y_train, y_test = train_test_split(X_d, y, test_size=0.4, random_state=14)
scikit-learn estimators
- support vectormachines (SVM)
- random forests
- neural networks
- Estimators are scikit-learn's abstraction.
- Estimators are usedfor classifcation.
Estimators have the following two main functions :
? fit(): This performs the training of the algorithm and sets internalparameters. It takes two inputs, the training sample dataset and thecorresponding classes for those samples.
? predict(): This predicts the class of the testing samples that is given asinput. This function returns an array with the predictions of each inputtesting sample.