- 算法是核心堵未,數(shù)據(jù)和計(jì)算是基礎(chǔ)
-
數(shù)據(jù)類型
1、離散數(shù)據(jù)類型
2盏触、連續(xù)數(shù)據(jù)類型
-
機(jī)器學(xué)習(xí)算法分類
監(jiān)督學(xué)習(xí):特征值+目標(biāo)值
無(wú)監(jiān)督學(xué)習(xí):只有特征值渗蟹,無(wú)目標(biāo)值
分類:目標(biāo)值離散型
回歸:目標(biāo)值連續(xù)型
-
分類算法
k-近鄰算法:根據(jù)你的鄰居來(lái)判斷你的類別
k-近鄰算法的計(jì)算公式:
注意:k-近鄰算法,需要做標(biāo)準(zhǔn)化處理 -
sklearn k-近鄰算法API
k-近鄰算法的例子:
from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler import pandas as pd def knncls(): """ K-近鄰預(yù)測(cè)用戶簽到位置 """ # 1赞辩、讀取數(shù)據(jù) data = pd.read_csv("./data/FBlocation/train.csv") # print(data.head(10)) # 打印前十行 # 2雌芽、處理數(shù)據(jù) # 2.1 縮小數(shù)據(jù),查詢數(shù)據(jù)篩選 data = data.query("x>1.0 & x<1.25 & y>2.5 & y<2.75") # 2.2 處理時(shí)間 time_value = pd.to_datatime(data['time'], unit='s') # print(time_value) # 2.3 把日期格式轉(zhuǎn)換成字典格式 time_value = pd.DatetimeIndex(time_value) # 2.4 構(gòu)造一些特征 data['day'] = time_value.day data['hour'] = time_value.hour data['weekday'] = time_value.weekday # 2.5 把時(shí)間特征刪除 data = data.drop(['time'], axis=1) # 按列刪除 # 2.6 把簽到數(shù)量少于n個(gè)目標(biāo)位置刪除 place_count = data.groupby('place_id').count() tf = place_count[place_count.row_id > 3].reset_index() data = data[data['place_id'].isin(tf.place_id)] # 2.7 取出數(shù)據(jù)中的特征值和目標(biāo)值 y = data['place_id'] x = data.drop(['place_id'], axis=1) # 2.8 進(jìn)行數(shù)據(jù)分割辨嗽,訓(xùn)練集與測(cè)試集 x_train, x_test, y_train, x_test = train_test_split(x, y, test_size=0.25) # 3世落、特征工程(標(biāo)準(zhǔn)化) std = StandardScaler() # 對(duì)測(cè)試集與訓(xùn)練集的特征值做標(biāo)準(zhǔn)化 x_train = std.fit_transform(x_train) x_test = std.transform(x_test) # 4、進(jìn)行算法流程 knn = KNeighborsClassifier(n_neighbors=5) knn.fit(x_train, y_train) # 得出預(yù)測(cè)結(jié)果 y_predict = knn.predict(x_test) print("預(yù)測(cè)的目標(biāo)簽到位置為:", y_predict) # 得出準(zhǔn)確率 print("預(yù)測(cè)準(zhǔn)確率:", knn.score(x_test. x_)) return None if __name__ == "__main__": knncls()
-
樸素貝葉斯算法
概率想關(guān)知識(shí):
樸素貝葉斯算法:適用特征獨(dú)立的數(shù)據(jù)
- API:
sklearn.naive_bayes.MultinomialNB(alpha=1.0)
樸素貝葉斯算法例子:from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB def naviebayes(): """ 樸素貝葉斯進(jìn)行文本分類 """ # 獲取數(shù)據(jù) news = fetch_20newsgroups(subset='all') # 進(jìn)行數(shù)據(jù)分割 x_train, x_test, y_train, y_test = train_test_split(new.data, news.target, test_size=0.25) # 對(duì)數(shù)據(jù)集進(jìn)行特征抽取 tf = TfidfVectorizer() # 以訓(xùn)練集當(dāng)中詞的列表進(jìn)行每篇文章重要性統(tǒng)計(jì) x_train = tf.fit_transform(x_train) print(tf.get_feature_names()) x_test = tf.fit_transform(x_test) # 進(jìn)行樸素貝葉斯算法預(yù)測(cè) mlt = MultinomialNB(alpha=1.0) print(x_train.toarry()) mlt.fit(x_train, y_train) y_predict = mlt.predict(x_test) print("預(yù)測(cè)文章的類別為:", y_predict) # 得出準(zhǔn)確率 print("準(zhǔn)確率為:", mlt.score(x_test, y_test)) return None if __name__ == "__main__": naviebayes()
總結(jié)樸素貝葉斯分類:
-
分類模型效果評(píng)估標(biāo)準(zhǔn):
1糟需、準(zhǔn)確率
2屉佳、精確率
3、召回率
- 分類模型評(píng)估API
API:sklearn.metrics.classification_report
- 模型的選優(yōu)
1洲押、交叉驗(yàn)證武花,將訓(xùn)練集數(shù)據(jù)分成訓(xùn)練集與驗(yàn)證集,數(shù)據(jù)不包括測(cè)試集
2杈帐、超參數(shù)搜索-網(wǎng)格搜索API
API:sklearn.model_selection.GridSearchCV
from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler import pandas as pd def knncls(): """ K-近鄰預(yù)測(cè)用戶簽到位置 """ # 1体箕、讀取數(shù)據(jù) data = pd.read_csv("./data/FBlocation/train.csv") # print(data.head(10)) # 打印前十行 # 2、處理數(shù)據(jù) # 2.1 縮小數(shù)據(jù)娘荡,查詢數(shù)據(jù)篩選 data = data.query("x>1.0 & x<1.25 & y>2.5 & y<2.75") # 2.2 處理時(shí)間 time_value = pd.to_datatime(data['time'], unit='s') # print(time_value) # 2.3 把日期格式轉(zhuǎn)換成字典格式 time_value = pd.DatetimeIndex(time_value) # 2.4 構(gòu)造一些特征 data['day'] = time_value.day data['hour'] = time_value.hour data['weekday'] = time_value.weekday # 2.5 把時(shí)間特征刪除 data = data.drop(['time'], axis=1) # 按列刪除 # 2.6 把簽到數(shù)量少于n個(gè)目標(biāo)位置刪除 place_count = data.groupby('place_id').count() tf = place_count[place_count.row_id > 3].reset_index() data = data[data['place_id'].isin(tf.place_id)] # 2.7 取出數(shù)據(jù)中的特征值和目標(biāo)值 y = data['place_id'] x = data.drop(['place_id'], axis=1) # 2.8 進(jìn)行數(shù)據(jù)分割干旁,訓(xùn)練集與測(cè)試集 x_train, x_test, y_train, x_test = train_test_split(x, y, test_size=0.25) # 3、特征工程(標(biāo)準(zhǔn)化) std = StandardScaler() # 對(duì)測(cè)試集與訓(xùn)練集的特征值做標(biāo)準(zhǔn)化 x_train = std.fit_transform(x_train) x_test = std.transform(x_test) # 4炮沐、進(jìn)行算法流程 knn = KNeighborsClassifier() # 構(gòu)造一些參數(shù)值進(jìn)行搜索 param = {"n_neighbors": [3, 5, 10]} # 進(jìn)行網(wǎng)格搜索 gc = GridSearchCV(knn, param_grid=param, cv=10) gc.fit(x_train, y_train) # 預(yù)測(cè)準(zhǔn)確率 print("在測(cè)試集上的準(zhǔn)確率:", gc.score) print("在交叉驗(yàn)證當(dāng)中最好的結(jié)果:", gc.best_score_) print("選擇最好的模型是:", gc.best_estimator_) print("每個(gè)超參數(shù)每次交叉驗(yàn)證的精確率與召回率:", gc.cv_results_) return None if __name__ == "__main__": knncls()
-
決策樹
-
決策樹的劃分依據(jù)
1争群、信息增益:當(dāng)?shù)弥粋€(gè)特征條件之后,減少的信息熵的大小
例子:
基尼系數(shù):劃分更加仔細(xì) -
API
from sklearn.tree import DecisionTreeClassifier from sklearn.feature_extraction import DictVectorizer from sklearn.model_selection import train_test_split import pandas as pd def decision(): """ 決策樹對(duì)泰坦尼克號(hào)進(jìn)行預(yù)測(cè)生死 """ # 1大年、獲取數(shù)據(jù) titan = pd.read_csv("http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titan.txt") # 2换薄、處理數(shù)據(jù),找出特征值和目標(biāo)值 x = titan[['pclass', 'age', 'sex']] y = titan['survived'] # 2.1 缺失值處理 x['age'] = .fillna(x['age'].mean(), inplace=True) # 2.2 分割數(shù)據(jù)集到訓(xùn)練集和測(cè)試集 x_train, x_test, y_train, x_test = train_test_split(x, y, test_size=0.25) # 3翔试、進(jìn)行處理(特征工程)特征-> 類別 one_hot編碼 dict = DictVectorizer(sparse=False) x_train = dict.fit_transform(x_train.to_dict(orient="records")) x_test = dict.tansform(x_test.to_dict(orient="records")) # 4轻要、用決策數(shù)進(jìn)行預(yù)測(cè) dec = DecisionTreeClassifier() dec.fit(x_train, y_train) # 4、1 預(yù)測(cè)準(zhǔn)確率 print("預(yù)測(cè)的準(zhǔn)確率:", dec.score(x_test, y_test)) return None if __name__ == "__main__": decision()
-