機(jī)器學(xué)習(xí)基礎(chǔ)、sklearn數(shù)據(jù)集翅阵、轉(zhuǎn)換器與預(yù)估器

機(jī)器學(xué)習(xí)基礎(chǔ)

機(jī)器學(xué)習(xí)開發(fā)流程

機(jī)器學(xué)習(xí)算法分類

機(jī)器學(xué)習(xí)模型是什么

需要明確幾點(diǎn)問題：

算法是核心，數(shù)據(jù)和計(jì)算是基礎(chǔ)
找準(zhǔn)定位

大部分復(fù)雜模型的算法設(shè)計(jì)都是算法工程師在做藐吮，而我只是一個(gè)調(diào)包俠
- 分析很多的數(shù)據(jù)
- 分析具體的業(yè)務(wù)
- 應(yīng)用常見的算法
- 特征工程溺拱、調(diào)參數(shù)、優(yōu)化

我們應(yīng)該怎么做

學(xué)會(huì)分析問題炎码，使用機(jī)器學(xué)習(xí)算法的目的盟迟，想要算法完成何種任務(wù)
掌握算法基本思想，學(xué)會(huì)對問題用相應(yīng)的算法解決
學(xué)會(huì)利用庫或者框架解決問題

機(jī)器學(xué)習(xí)模型是什么

定義：通過一種映射關(guān)系將輸入值到輸出值

image

機(jī)器學(xué)習(xí)算法的判別依據(jù)

我們來看看下面兩組數(shù)據(jù)潦闲，說說它們的區(qū)別攒菠？

數(shù)據(jù)類型

離散型數(shù)據(jù)：由記錄不同類別個(gè)體的數(shù)目所得到的數(shù)據(jù)，又稱計(jì)數(shù)數(shù)據(jù)歉闰，所有這些數(shù)據(jù)全部都是整數(shù)辖众，而且不能再細(xì)分，也不能進(jìn)一步提高他們的精確度和敬。
連續(xù)型數(shù)據(jù)：變量可以在某個(gè)范圍內(nèi)取任一數(shù)凹炸，即變量的取值可以是連續(xù)的，如昼弟，長度啤它、時(shí)間、質(zhì)量值等，這類整數(shù)通常是非整數(shù)变骡，含有小數(shù)部分离赫。

注：只要記住一點(diǎn)，離散型是區(qū)間內(nèi)不可分塌碌，連續(xù)型是區(qū)間內(nèi)可分

數(shù)據(jù)類型的不同應(yīng)用

數(shù)據(jù)的類型將是機(jī)器學(xué)習(xí)模型不同問題不同處理的依據(jù)渊胸？

圖片識(shí)別

分析文章類別

預(yù)測下月票房數(shù)據(jù)

機(jī)器學(xué)習(xí)算法分類

監(jiān)督學(xué)習(xí)：特征值 + 目標(biāo)值

非監(jiān)督學(xué)習(xí)：特征值

分類：目標(biāo)值離散型

回歸：目標(biāo)值連續(xù)型

監(jiān)督學(xué)習(xí)
- 分類
  - k-鄰近算法
  - 貝葉斯分類
  - 決策樹與隨機(jī)森林
  - 邏輯回歸
  - 神經(jīng)網(wǎng)絡(luò)
- 回歸
  - 線性回歸
  - 嶺回歸
- 標(biāo)注
  - 隱馬爾可夫模型
無監(jiān)督學(xué)習(xí)
- 聚類
  - k-means

image

監(jiān)督學(xué)習(xí)

監(jiān)督學(xué)習(xí)（英語：Supervised Learning），可以由輸入數(shù)據(jù)中學(xué)到或建立一個(gè)模型台妆，并依此模式推測新的結(jié)果翎猛。輸入數(shù)據(jù)是由輸入特征值和目標(biāo)值所組成。函數(shù)的輸出可以是一個(gè)連續(xù)的值（稱為回歸）接剩，或是輸出是有限個(gè)離散值（稱作分類）切厘。

無監(jiān)督學(xué)習(xí)

無監(jiān)督學(xué)習(xí)（英語：Unsupervised Learning），可以由輸入數(shù)據(jù)中學(xué)到或建立一個(gè)模型搂漠，并依此模式推測新的結(jié)果迂卢。輸入數(shù)據(jù)是由輸入特征值所組成某弦。

分類問題

概念：分類是監(jiān)督學(xué)習(xí)的一個(gè)核心問題桐汤，在監(jiān)督學(xué)習(xí)中，當(dāng)輸出變量取有限個(gè)離散值時(shí)礁击，預(yù)測問題變成為分類問題建车。最基礎(chǔ)的便是二分類問題契耿，即判斷是非，從兩個(gè)類別中選擇一個(gè)作為預(yù)測結(jié)果拣度。

分類問題的應(yīng)用

分類在于根據(jù)其特性將數(shù)據(jù)“分門別類”，所以在許多領(lǐng)域都有廣泛的應(yīng)用

分類在于根據(jù)其特性將數(shù)據(jù)“分門別類”螃壤，所以在許多領(lǐng)域都有廣泛的應(yīng)用
在銀行業(yè)務(wù)中抗果，構(gòu)建一個(gè)客戶分類模型，按客戶按照貸款風(fēng)險(xiǎn)的大小進(jìn)行分類
圖像處理中奸晴，分類可以用來檢測圖像中是否有人臉出現(xiàn)冤馏，動(dòng)物類別等
手寫識(shí)別中，分類可以用于識(shí)別手寫的數(shù)字
文本分類寄啼，這里的文本可以是新聞報(bào)道逮光、網(wǎng)頁、電子郵件墩划、學(xué)術(shù)論文
…

回歸問題

概念：回歸是監(jiān)督學(xué)習(xí)的另一個(gè)重要問題涕刚。回歸用于預(yù)測輸入變量和輸出變量之間的關(guān)系乙帮，輸出是連續(xù)型的值杜漠。

回歸問題的應(yīng)用

回歸在多領(lǐng)域也有廣泛的應(yīng)用

房價(jià)預(yù)測，根據(jù)某地歷史房價(jià)數(shù)據(jù)，進(jìn)行一個(gè)預(yù)測
金融信息驾茴，每日股票走向
…

說一下它們具體問題類別：

預(yù)測明天的氣溫是多少度戴陡？回歸問題
預(yù)測明天是陰、晴還是雨沟涨？分類問題
人臉年齡預(yù)測恤批？回歸問題
人臉識(shí)別？分類問題

機(jī)器學(xué)習(xí)開發(fā)流程

數(shù)據(jù)：

公司本身就有數(shù)據(jù)
合作過來的數(shù)
購買的數(shù)據(jù)

建立模型：根據(jù)數(shù)據(jù)類型劃分應(yīng)用種類

原始數(shù)據(jù)明確問題做什么
數(shù)據(jù)的基本處理：pd去處理數(shù)據(jù)（缺失值裹赴，合并表....）
特征工程（特征進(jìn)行處理）

分類喜庞、回歸

模型：算法 + 數(shù)據(jù)
找到合適的算法進(jìn)行預(yù)測
模型的評估，判斷效果
1. 沒有合格：
  1. 換算法參數(shù)
  2. 特征工程
2. 合格：上線使用棋返，以API形式提供

機(jī)器學(xué)習(xí)開發(fā)流程

sklearn數(shù)據(jù)集

數(shù)據(jù)集劃分

sklearn數(shù)據(jù)集接口介紹

sklearn分類數(shù)據(jù)集

sklearn回歸數(shù)據(jù)集

評估模型和建立模型的數(shù)據(jù)能不能一模一樣延都？不能！＞ⅰＮ俊！射沟！

拿一部分?jǐn)?shù)據(jù)去訓(xùn)練殊者，拿一些未知數(shù)據(jù)去評估

所以數(shù)據(jù)分為兩大部分

訓(xùn)練集和測試集，他們一般的劃分比例：

訓(xùn)練集	測試集
70%	30%
80%	20%
75%	25%

數(shù)據(jù)集劃分

機(jī)器學(xué)習(xí)一般的數(shù)據(jù)集會(huì)劃分為兩個(gè)部分：

訓(xùn)練數(shù)據(jù)：用于訓(xùn)練验夯，構(gòu)建模型
測試數(shù)據(jù)：在模型檢驗(yàn)時(shí)使用猖吴，用于評估模型是否有效

sklearn數(shù)據(jù)集劃分API

sklearn.model_selection.train_test_split

那么問題來了，自己準(zhǔn)備數(shù)據(jù)集挥转，耗時(shí)耗力海蔽，不一定真實(shí)

scikit-learn數(shù)據(jù)集API介紹

sklearn.datasets

加載獲取流行數(shù)據(jù)集

datasets.load_*()

獲取小規(guī)模數(shù)據(jù)集，數(shù)據(jù)包含在datasets里

datasets.fetch_*(data_home=None)

獲取大規(guī)模數(shù)據(jù)集绑谣，需要從網(wǎng)絡(luò)上下載党窜，函數(shù)的第一個(gè)參數(shù)是data_home，表示數(shù)據(jù)集下載的目錄,默認(rèn)是 ~/scikit_learn_data/

獲取數(shù)據(jù)集返回的類型

load和fetch返回的數(shù)據(jù)類型datasets.base.Bunch(字典格式)
data：特征數(shù)據(jù)數(shù)組借宵，是 [n_samples * n_features] 的二維 numpy.ndarray 數(shù)組
target：標(biāo)簽數(shù)組幌衣，是 n_samples 的一維 numpy.ndarray 數(shù)組
DESCR：數(shù)據(jù)描述
feature_names：特征名，新聞數(shù)據(jù)暇务，手寫數(shù)字泼掠、回歸數(shù)據(jù)集沒有
target_names：標(biāo)簽名,回歸數(shù)據(jù)集沒有

sklearn分類數(shù)據(jù)集

我們來看一下sklearn返回的數(shù)據(jù)格式

sklearn.datasets.load_iris() 加載并返回鳶尾花數(shù)據(jù)集

名稱	數(shù)量
類別	3
特征	4
樣本數(shù)量	150
每個(gè)類別數(shù)量	50

sklearn.datasets.load_digits() 加載并返回?cái)?shù)字?jǐn)?shù)據(jù)集

名稱	數(shù)量
類別	10
特征	64
樣本數(shù)量	1797

from sklearn.datasets import load_iris


li = load_iris()

print("獲取特征值")
print(li.data)

print("目標(biāo)值")
print(li.target)
print(li.DESCR)

運(yùn)行結(jié)果

獲取特征值
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.9 1.5]
 [5.5 2.3 4.  1.3]
 [6.5 2.8 4.6 1.5]
 [5.7 2.8 4.5 1.3]
 [6.3 3.3 4.7 1.6]
 [4.9 2.4 3.3 1. ]
 [6.6 2.9 4.6 1.3]
 [5.2 2.7 3.9 1.4]
 [5.  2.  3.5 1. ]
 [5.9 3.  4.2 1.5]
 [6.  2.2 4.  1. ]
 [6.1 2.9 4.7 1.4]
 [5.6 2.9 3.6 1.3]
 [6.7 3.1 4.4 1.4]
 [5.6 3.  4.5 1.5]
 [5.8 2.7 4.1 1. ]
 [6.2 2.2 4.5 1.5]
 [5.6 2.5 3.9 1.1]
 [5.9 3.2 4.8 1.8]
 [6.1 2.8 4.  1.3]
 [6.3 2.5 4.9 1.5]
 [6.1 2.8 4.7 1.2]
 [6.4 2.9 4.3 1.3]
 [6.6 3.  4.4 1.4]
 [6.8 2.8 4.8 1.4]
 [6.7 3.  5.  1.7]
 [6.  2.9 4.5 1.5]
 [5.7 2.6 3.5 1. ]
 [5.5 2.4 3.8 1.1]
 [5.5 2.4 3.7 1. ]
 [5.8 2.7 3.9 1.2]
 [6.  2.7 5.1 1.6]
 [5.4 3.  4.5 1.5]
 [6.  3.4 4.5 1.6]
 [6.7 3.1 4.7 1.5]
 [6.3 2.3 4.4 1.3]
 [5.6 3.  4.1 1.3]
 [5.5 2.5 4.  1.3]
 [5.5 2.6 4.4 1.2]
 [6.1 3.  4.6 1.4]
 [5.8 2.6 4.  1.2]
 [5.  2.3 3.3 1. ]
 [5.6 2.7 4.2 1.3]
 [5.7 3.  4.2 1.2]
 [5.7 2.9 4.2 1.3]
 [6.2 2.9 4.3 1.3]
 [5.1 2.5 3.  1.1]
 [5.7 2.8 4.1 1.3]
 [6.3 3.3 6.  2.5]
 [5.8 2.7 5.1 1.9]
 [7.1 3.  5.9 2.1]
 [6.3 2.9 5.6 1.8]
 [6.5 3.  5.8 2.2]
 [7.6 3.  6.6 2.1]
 [4.9 2.5 4.5 1.7]
 [7.3 2.9 6.3 1.8]
 [6.7 2.5 5.8 1.8]
 [7.2 3.6 6.1 2.5]
 [6.5 3.2 5.1 2. ]
 [6.4 2.7 5.3 1.9]
 [6.8 3.  5.5 2.1]
 [5.7 2.5 5.  2. ]
 [5.8 2.8 5.1 2.4]
 [6.4 3.2 5.3 2.3]
 [6.5 3.  5.5 1.8]
 [7.7 3.8 6.7 2.2]
 [7.7 2.6 6.9 2.3]
 [6.  2.2 5.  1.5]
 [6.9 3.2 5.7 2.3]
 [5.6 2.8 4.9 2. ]
 [7.7 2.8 6.7 2. ]
 [6.3 2.7 4.9 1.8]
 [6.7 3.3 5.7 2.1]
 [7.2 3.2 6.  1.8]
 [6.2 2.8 4.8 1.8]
 [6.1 3.  4.9 1.8]
 [6.4 2.8 5.6 2.1]
 [7.2 3.  5.8 1.6]
 [7.4 2.8 6.1 1.9]
 [7.9 3.8 6.4 2. ]
 [6.4 2.8 5.6 2.2]
 [6.3 2.8 5.1 1.5]
 [6.1 2.6 5.6 1.4]
 [7.7 3.  6.1 2.3]
 [6.3 3.4 5.6 2.4]
 [6.4 3.1 5.5 1.8]
 [6.  3.  4.8 1.8]
 [6.9 3.1 5.4 2.1]
 [6.7 3.1 5.6 2.4]
 [6.9 3.1 5.1 2.3]
 [5.8 2.7 5.1 1.9]
 [6.8 3.2 5.9 2.3]
 [6.7 3.3 5.7 2.5]
 [6.7 3.  5.2 2.3]
 [6.3 2.5 5.  1.9]
 [6.5 3.  5.2 2. ]
 [6.2 3.4 5.4 2.3]
 [5.9 3.  5.1 1.8]]
目標(biāo)值
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

數(shù)據(jù)集進(jìn)行分割

sklearn.model_selection.train_test_split(*arrays, **options)

x 數(shù)據(jù)集的特征值
y 數(shù)據(jù)集的標(biāo)簽值
test_size 測試集的大小，一般為float
random_state 隨機(jī)數(shù)種子,不同的種子會(huì)造成不同的隨機(jī)采樣結(jié)果垦细。相同的種子采樣結(jié)果相同择镇。
return 訓(xùn)練集特征值，測試集特征值括改，訓(xùn)練標(biāo)簽腻豌，測試標(biāo)簽(默認(rèn)隨機(jī)取)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

li = load_iris()

# 注意返回值，訓(xùn)練集 train  x_train, y_train
#           測試集 test   x_test,  y_test
x_train, x_test, y_train, y_test = train_test_split(li.data, li.target, test_size=0.25)

print("訓(xùn)練集特征值和目標(biāo)值：", x_train, y_train)
print("測試集特征值和目標(biāo)值：", x_test, y_test)

運(yùn)行結(jié)果

訓(xùn)練集特征值和目標(biāo)值： [[6.4 2.9 4.3 1.3]
 [4.3 3.  1.1 0.1]
 [5.5 2.4 3.8 1.1]
 [6.2 3.4 5.4 2.3]
 [6.1 3.  4.9 1.8]
 [5.7 3.8 1.7 0.3]
 [6.3 2.3 4.4 1.3]
 [4.6 3.4 1.4 0.3]
 [6.5 3.  5.5 1.8]
 [5.4 3.9 1.7 0.4]
 [5.1 3.4 1.5 0.2]
 [6.  2.7 5.1 1.6]
 [5.1 3.3 1.7 0.5]
 [4.9 2.5 4.5 1.7]
 [6.7 3.3 5.7 2.1]
 [7.7 3.8 6.7 2.2]
 [5.8 2.6 4.  1.2]
 [6.7 3.  5.  1.7]
 [6.3 3.3 4.7 1.6]
 [5.  3.3 1.4 0.2]
 [5.8 2.8 5.1 2.4]
 [6.9 3.2 5.7 2.3]
 [5.6 2.8 4.9 2. ]
 [4.7 3.2 1.3 0.2]
 [5.9 3.2 4.8 1.8]
 [5.9 3.  4.2 1.5]
 [6.7 3.1 5.6 2.4]
 [4.4 3.2 1.3 0.2]
 [5.1 3.8 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.9 1.3 0.4]
 [5.1 3.8 1.9 0.4]
 [5.4 3.4 1.5 0.4]
 [5.1 3.8 1.5 0.3]
 [5.6 2.9 3.6 1.3]
 [6.5 3.2 5.1 2. ]
 [5.  3.5 1.6 0.6]
 [7.2 3.6 6.1 2.5]
 [7.  3.2 4.7 1.4]
 [4.9 2.4 3.3 1. ]
 [6.3 2.9 5.6 1.8]
 [6.4 2.8 5.6 2.1]
 [6.1 3.  4.6 1.4]
 [6.3 3.3 6.  2.5]
 [6.6 3.  4.4 1.4]
 [5.6 3.  4.1 1.3]
 [6.3 2.5 4.9 1.5]
 [6.  2.2 4.  1. ]
 [5.3 3.7 1.5 0.2]
 [6.8 2.8 4.8 1.4]
 [4.8 3.  1.4 0.3]
 [4.8 3.4 1.6 0.2]
 [7.2 3.  5.8 1.6]
 [5.5 4.2 1.4 0.2]
 [5.5 2.6 4.4 1.2]
 [7.2 3.2 6.  1.8]
 [5.5 3.5 1.3 0.2]
 [4.6 3.2 1.4 0.2]
 [6.  2.9 4.5 1.5]
 [4.9 3.1 1.5 0.1]
 [5.4 3.  4.5 1.5]
 [5.  2.3 3.3 1. ]
 [6.3 2.5 5.  1.9]
 [5.8 2.7 5.1 1.9]
 [6.9 3.1 5.4 2.1]
 [6.7 3.3 5.7 2.5]
 [6.  3.  4.8 1.8]
 [5.2 4.1 1.5 0.1]
 [7.9 3.8 6.4 2. ]
 [4.8 3.4 1.9 0.2]
 [7.4 2.8 6.1 1.9]
 [5.  3.4 1.5 0.2]
 [4.8 3.  1.4 0.1]
 [6.  3.4 4.5 1.6]
 [6.7 3.  5.2 2.3]
 [5.7 4.4 1.5 0.4]
 [7.1 3.  5.9 2.1]
 [7.3 2.9 6.3 1.8]
 [5.7 2.9 4.2 1.3]
 [6.4 3.1 5.5 1.8]
 [6.4 2.7 5.3 1.9]
 [5.4 3.7 1.5 0.2]
 [5.7 2.8 4.5 1.3]
 [6.3 2.8 5.1 1.5]
 [6.3 3.4 5.6 2.4]
 [4.6 3.1 1.5 0.2]
 [5.8 2.7 4.1 1. ]
 [6.4 2.8 5.6 2.2]
 [6.7 3.1 4.7 1.5]
 [5.8 2.7 5.1 1.9]
 [6.2 2.2 4.5 1.5]
 [5.1 3.7 1.5 0.4]
 [4.9 3.  1.4 0.2]
 [6.5 2.8 4.6 1.5]
 [5.5 2.5 4.  1.3]
 [5.8 4.  1.2 0.2]
 [5.6 3.  4.5 1.5]
 [5.  3.2 1.2 0.2]
 [7.7 2.8 6.7 2. ]
 [5.7 3.  4.2 1.2]
 [5.4 3.4 1.7 0.2]
 [7.7 2.6 6.9 2.3]
 [6.4 3.2 5.3 2.3]
 [5.5 2.3 4.  1.3]
 [5.2 3.4 1.4 0.2]
 [4.6 3.6 1.  0.2]
 [5.1 3.5 1.4 0.2]
 [5.  3.5 1.3 0.3]
 [6.1 2.8 4.7 1.2]
 [5.8 2.7 3.9 1.2]
 [7.6 3.  6.6 2.1]
 [6.6 2.9 4.6 1.3]] [1 0 1 2 2 0 1 0 2 0 0 1 0 2 2 2 1 1 1 0 2 2 2 0 1 1 2 0 0 0 0 0 0 0 1 2 0
 2 1 1 2 2 1 2 1 1 1 1 0 1 0 0 2 0 1 2 0 0 1 0 1 1 2 2 2 2 2 0 2 0 2 0 0 1
 2 0 2 2 1 2 2 0 1 2 2 0 1 2 1 2 1 0 0 1 1 0 1 0 2 1 0 2 2 1 0 0 0 0 1 1 2
 1]
測試集特征值和目標(biāo)值： [[6.7 2.5 5.8 1.8]
 [6.1 2.6 5.6 1.4]
 [6.1 2.9 4.7 1.4]
 [6.8 3.2 5.9 2.3]
 [5.  3.6 1.4 0.2]
 [5.6 2.5 3.9 1.1]
 [6.8 3.  5.5 2.1]
 [4.9 3.1 1.5 0.2]
 [4.9 3.6 1.4 0.1]
 [6.5 3.  5.8 2.2]
 [4.5 2.3 1.3 0.3]
 [6.4 3.2 4.5 1.5]
 [5.  3.4 1.6 0.4]
 [6.2 2.8 4.8 1.8]
 [6.7 3.1 4.4 1.4]
 [4.7 3.2 1.6 0.2]
 [5.2 2.7 3.9 1.4]
 [5.1 2.5 3.  1.1]
 [6.3 2.7 4.9 1.8]
 [5.6 2.7 4.2 1.3]
 [6.9 3.1 4.9 1.5]
 [6.1 2.8 4.  1.3]
 [5.7 2.5 5.  2. ]
 [6.5 3.  5.2 2. ]
 [4.4 2.9 1.4 0.2]
 [4.4 3.  1.3 0.2]
 [6.9 3.1 5.1 2.3]
 [5.1 3.5 1.4 0.3]
 [5.2 3.5 1.5 0.2]
 [6.  2.2 5.  1.5]
 [7.7 3.  6.1 2.3]
 [5.  3.  1.6 0.2]
 [5.7 2.8 4.1 1.3]
 [5.7 2.6 3.5 1. ]
 [5.5 2.4 3.7 1. ]
 [5.9 3.  5.1 1.8]
 [5.  2.  3.5 1. ]
 [6.2 2.9 4.3 1.3]] [2 2 1 2 0 1 2 0 0 2 0 1 0 2 1 0 1 1 2 1 1 1 2 2 0 0 2 0 0 2 2 0 1 1 1 2 1
 1]

用于分類的大數(shù)據(jù)集

sklearn.datasets.fetch_20newsgroups(data_home=None,subset=‘train’)

subset: 'train'或者'test','all'，可選吝梅，選擇要加載的數(shù)據(jù)集.

訓(xùn)練集的“訓(xùn)練”虱疏，測試集的“測試”，兩者的“全部”

datasets.clear_data_home(data_home=None)

from sklearn.datasets import fetch_20newsgroups

news = fetch_20newsgroups(subset='all')

print(news.data)
print(news.target)

這個(gè)會(huì)有幾千個(gè)幾萬個(gè)樣本……不展示了

sklearn回歸數(shù)據(jù)集

sklearn.datasets.load_boston() 加載并返回波士頓房價(jià)數(shù)據(jù)集

sklearn.datasets.load_diabetes() 加載和返回糖尿病數(shù)據(jù)集

from sklearn.datasets import load_boston

li = load_iris()

lb = load_boston()
print("獲取特征值")
print(lb.data)

print("目標(biāo)值")
print(lb.target)
print(lb.DESCR)

運(yùn)行結(jié)果

獲取特征值
[[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]
 [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]
 [2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]
 ...
 [6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]
 [1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]
 [4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]
目標(biāo)值
[24.  21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15.  18.9 21.7 20.4
 18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8
 18.4 21.  12.7 14.5 13.2 13.1 13.5 18.9 20.  21.  24.7 30.8 34.9 26.6
 25.3 24.7 21.2 19.3 20.  16.6 14.4 19.4 19.7 20.5 25.  23.4 18.9 35.4
 24.7 31.6 23.3 19.6 18.7 16.  22.2 25.  33.  23.5 19.4 22.  17.4 20.9
 24.2 21.7 22.8 23.4 24.1 21.4 20.  20.8 21.2 20.3 28.  23.9 24.8 22.9
 23.9 26.6 22.5 22.2 23.6 28.7 22.6 22.  22.9 25.  20.6 28.4 21.4 38.7
 43.8 33.2 27.5 26.5 18.6 19.3 20.1 19.5 19.5 20.4 19.8 19.4 21.7 22.8
 18.8 18.7 18.5 18.3 21.2 19.2 20.4 19.3 22.  20.3 20.5 17.3 18.8 21.4
 15.7 16.2 18.  14.3 19.2 19.6 23.  18.4 15.6 18.1 17.4 17.1 13.3 17.8
 14.  14.4 13.4 15.6 11.8 13.8 15.6 14.6 17.8 15.4 21.5 19.6 15.3 19.4
 17.  15.6 13.1 41.3 24.3 23.3 27.  50.  50.  50.  22.7 25.  50.  23.8
 23.8 22.3 17.4 19.1 23.1 23.6 22.6 29.4 23.2 24.6 29.9 37.2 39.8 36.2
 37.9 32.5 26.4 29.6 50.  32.  29.8 34.9 37.  30.5 36.4 31.1 29.1 50.
 33.3 30.3 34.6 34.9 32.9 24.1 42.3 48.5 50.  22.6 24.4 22.5 24.4 20.
 21.7 19.3 22.4 28.1 23.7 25.  23.3 28.7 21.5 23.  26.7 21.7 27.5 30.1
 44.8 50.  37.6 31.6 46.7 31.5 24.3 31.7 41.7 48.3 29.  24.  25.1 31.5
 23.7 23.3 22.  20.1 22.2 23.7 17.6 18.5 24.3 20.5 24.5 26.2 24.4 24.8
 29.6 42.8 21.9 20.9 44.  50.  36.  30.1 33.8 43.1 48.8 31.  36.5 22.8
 30.7 50.  43.5 20.7 21.1 25.2 24.4 35.2 32.4 32.  33.2 33.1 29.1 35.1
 45.4 35.4 46.  50.  32.2 22.  20.1 23.2 22.3 24.8 28.5 37.3 27.9 23.9
 21.7 28.6 27.1 20.3 22.5 29.  24.8 22.  26.4 33.1 36.1 28.4 33.4 28.2
 22.8 20.3 16.1 22.1 19.4 21.6 23.8 16.2 17.8 19.8 23.1 21.  23.8 23.1
 20.4 18.5 25.  24.6 23.  22.2 19.3 22.6 19.8 17.1 19.4 22.2 20.7 21.1
 19.5 18.5 20.6 19.  18.7 32.7 16.5 23.9 31.2 17.5 17.2 23.1 24.5 26.6
 22.9 24.1 18.6 30.1 18.2 20.6 17.8 21.7 22.7 22.6 25.  19.9 20.8 16.8
 21.9 27.5 21.9 23.1 50.  50.  50.  50.  50.  13.8 13.8 15.  13.9 13.3
 13.1 10.2 10.4 10.9 11.3 12.3  8.8  7.2 10.5  7.4 10.2 11.5 15.1 23.2
  9.7 13.8 12.7 13.1 12.5  8.5  5.   6.3  5.6  7.2 12.1  8.3  8.5  5.
 11.9 27.9 17.2 27.5 15.  17.2 17.9 16.3  7.   7.2  7.5 10.4  8.8  8.4
 16.7 14.2 20.8 13.4 11.7  8.3 10.2 10.9 11.   9.5 14.5 14.1 16.1 14.3
 11.7 13.4  9.6  8.7  8.4 12.8 10.5 17.1 18.4 15.4 10.8 11.8 14.9 12.6
 14.1 13.  13.4 15.2 16.1 17.8 14.9 14.1 12.7 13.5 14.9 20.  16.4 17.7
 19.5 20.2 21.4 19.9 19.  19.1 19.1 20.1 19.9 19.6 23.2 29.8 13.8 13.3
 16.7 12.  14.6 21.4 23.  23.7 25.  21.8 20.6 21.2 19.1 20.6 15.2  7.
  8.1 13.6 20.1 21.8 24.5 23.1 19.7 18.3 21.2 17.5 16.8 22.4 20.6 23.9
 22.  11.9]
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

轉(zhuǎn)換器與預(yù)估器

轉(zhuǎn)換器

想一下之前做的特征工程的步驟苏携？

1做瞪、實(shí)例化 (實(shí)例化的是一個(gè)轉(zhuǎn)換器類(Transformer))

2、調(diào)用fit_transform(對于文檔建立分類詞頻矩陣右冻，不能同時(shí)調(diào)用)

轉(zhuǎn)換器

fit_transform(): 輸入數(shù)據(jù)直接轉(zhuǎn)換 = fit() + transform()

fit(): 輸入數(shù)據(jù)装蓬，但不做事情

transform(): 進(jìn)行數(shù)據(jù)的轉(zhuǎn)換

In [1]: from sklearn.preprocessing import StandardScaler

In [2]: s = StandardScaler()

In [3]: s.fit_transform([[1, 2, 3], [4, 5, 6]])
Out[3]: 
array([[-1., -1., -1.],
       [ 1.,  1.,  1.]])

In [4]: ss = StandardScaler()

In [5]: ss.fit([[1, 2, 3], [4, 5, 6]])
Out[5]: StandardScaler(copy=True, with_mean=True, with_std=True)

In [6]: ss.transform([[1, 2, 3], [4, 5, 6]])
Out[6]: 
array([[-1., -1., -1.],
       [ 1.,  1.,  1.]])

In [7]: ss.fit([[2, 3, 4], [4, 5, 7]])
Out[7]: StandardScaler(copy=True, with_mean=True, with_std=True)

In [8]: ss.transform([[1, 2, 3], [4, 5, 6]])
Out[8]: 
array([[-2.        , -2.        , -1.66666667],
       [ 1.        ,  1.        ,  0.33333333]])

估計(jì)器

sklearn機(jī)器學(xué)習(xí)算法的實(shí)現(xiàn)-估計(jì)器

在sklearn中，估計(jì)器(estimator)是一個(gè)重要的角色纱扭，分類器和回歸器都屬于estimator牍帚，是一類實(shí)現(xiàn)了算法的API

用于分類的估計(jì)器：
- sklearn.neighbors k 近鄰算法
- sklearn.naive_bayes 貝葉斯
- sklearn.linear_model.LogisticRegression 邏輯回歸
用于回歸的估計(jì)器
- sklearn.linear_model.LinearRegression 線性回歸
- sklearn.linear_model.Ridg 嶺回歸

其實(shí)機(jī)器學(xué)習(xí)開發(fā)的門檻還是高一些的。這些API并不是像Web開發(fā)那些API一樣乳蛾，看文檔就知道怎么用暗赶，比如對接支付寶或微信的支付接口，文檔上寫的很清楚……但是這些API看了也不知道要傳哪些參數(shù)肃叶，所以算法還是要搞懂的蹂随。

估計(jì)器的工作流程

訓(xùn)練集 x_train, y_train

測試集 x_test, y_test

調(diào)用fit
- fit(x_train, y_train)
輸入測試集的數(shù)據(jù)(x_test, y_test)
1. y_predict = predict(x_test)
2. 預(yù)測的準(zhǔn)確率：score(x_test, y_test)

預(yù)估器

"我在人間販賣黃昏，只為收集世間的溫柔去見你"
Macsen Chu

機(jī)器學(xué)習(xí)基礎(chǔ)闷盔、sklearn數(shù)據(jù)集、轉(zhuǎn)換器與預(yù)估器

機(jī)器學(xué)習(xí)基礎(chǔ)旅急、sklearn數(shù)據(jù)集逢勾、轉(zhuǎn)換器與預(yù)估器