scikit-learn數(shù)據(jù)集
我們將介紹sklearn中的數(shù)據(jù)集類绣檬,模塊包括用于加載數(shù)據(jù)集的實(shí)用程序拓诸,包括加載和獲取流行參考數(shù)據(jù)集的方法。它還具有一些人工數(shù)據(jù)生成器。
-
sklearn數(shù)據(jù)集
-
sklearn.datasets
(1)datasets.load_*()
獲取小規(guī)模數(shù)據(jù)集蝶棋,數(shù)據(jù)包含在datasets里
(2)datasets.fetch_*()
獲取大規(guī)模數(shù)據(jù)集坷衍,需要從網(wǎng)絡(luò)上下載疙渣,函數(shù)的第一個參數(shù)是data_home且轨,表示數(shù)據(jù)集下載的目錄,默認(rèn)是 ~/scikit_learn_data/魄健,要修改默認(rèn)目錄赋铝,可以修改環(huán)境變量SCIKIT_LEARN_DATA
(3)datasets.make_*()
本地生成數(shù)據(jù)集
load*和 fetch* 函數(shù)返回的數(shù)據(jù)類型是 datasets.base.Bunch,本質(zhì)上是一個 dict沽瘦,它的鍵值對可用通過對象的屬性方式訪問柬甥。主要包含以下屬性:
- data:特征數(shù)據(jù)數(shù)組,是 n_samples * n_features 的二維 numpy.ndarray 數(shù)組
- target:標(biāo)簽數(shù)組其垄,是 n_samples 的一維 numpy.ndarray 數(shù)組
- DESCR:數(shù)據(jù)描述
- feature_names:特征名
- target_names:標(biāo)簽名
數(shù)據(jù)集目錄可以通過datasets.get_data_home()獲取苛蒲,clear_data_home(data_home=None)刪除所有下載數(shù)據(jù)
- datasets.get_data_home(data_home=None)
返回scikit學(xué)習(xí)數(shù)據(jù)目錄的路徑。這個文件夾被一些大的數(shù)據(jù)集裝載器使用绿满,以避免下載數(shù)據(jù)臂外。默認(rèn)情況下,數(shù)據(jù)目錄設(shè)置為用戶主文件夾中名為“scikit_learn_data”的文件夾喇颁÷┙。或者,可以通過“SCIKIT_LEARN_DATA”環(huán)境變量或通過給出顯式的文件夾路徑以編程方式設(shè)置它橘霎。'?'符號擴(kuò)展到用戶主文件夾蔫浆。如果文件夾不存在,則會自動創(chuàng)建姐叁。
- sklearn.datasets.clear_data_home(data_home=None)
刪除存儲目錄中的數(shù)據(jù)
-
獲取小數(shù)據(jù)集
用于分類
-
sklearn.datasets.load_iris
鳶尾花數(shù)據(jù)集采集的是鳶尾花的測量數(shù)據(jù)以及其所屬的類別瓦盛。測量數(shù)據(jù)包括:萼片長度洗显、萼片寬度、花瓣長度原环、花瓣寬度挠唆。類別共分為三類:Iris Setosa,Iris Versicolour嘱吗,Iris Virginica玄组。該數(shù)據(jù)集可用于多分類問題。
-
加載數(shù)據(jù)集其參數(shù)有:
? return_X_y:若為True谒麦,則以(data, target)元組形式返回?cái)?shù)據(jù)俄讹;默認(rèn)為False,表示以字典形式返回?cái)?shù)據(jù)全部信息(包括data和target)绕德。
from sklearn.datasets import load_iris data = load_iris(return_X_y=True)
from sklearn.datasets import load_iris data = load_iris() #查看data所具有的屬性或方法 print(dir(data)) print('*'*80) #查看數(shù)據(jù)集的描述 print(data.DESCR) print('*'*80) #查看數(shù)據(jù)的特征名 print(data.feature_names) #print(data.data) print('*'*80) #查看數(shù)據(jù)的分類名 print(data.target_names) print('*'*80) print(data.target) print('*'*80) #查看第2颅悉、11、101個樣本的目標(biāo)值 print(data.target[[1,10, 100]])
['DESCR', 'data', 'feature_names', 'filename', 'target', 'target_names'] ******************************************************************************** .. _iris_dataset: Iris plants dataset -------------------- **Data Set Characteristics:** :Number of Instances: 150 (50 in each of three classes) :Number of Attributes: 4 numeric, predictive attributes and the class :Attribute Information: - sepal length in cm - sepal width in cm - petal length in cm - petal width in cm - class: - Iris-Setosa - Iris-Versicolour - Iris-Virginica :Summary Statistics: ============== ==== ==== ======= ===== ==================== Min Max Mean SD Class Correlation ============== ==== ==== ======= ===== ==================== sepal length: 4.3 7.9 5.84 0.83 0.7826 sepal width: 2.0 4.4 3.05 0.43 -0.4194 petal length: 1.0 6.9 3.76 1.76 0.9490 (high!) petal width: 0.1 2.5 1.20 0.76 0.9565 (high!) ============== ==== ==== ======= ===== ==================== :Missing Attribute Values: None :Class Distribution: 33.3% for each of 3 classes. :Creator: R.A. Fisher :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov) :Date: July, 1988 ''' 部分省略 ''' ******************************************************************************** ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] ******************************************************************************** ['setosa' 'versicolor' 'virginica'] ******************************************************************************** [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] ******************************************************************************** [0 0 2]
-
sklearn.datasets.load_digits
手寫數(shù)字?jǐn)?shù)據(jù)集包括1797個0-9的手寫數(shù)字?jǐn)?shù)據(jù)迁匠,每個數(shù)字由8*8大小的矩陣構(gòu)成,矩陣中值的范圍是0-16驹溃,代表顏色的深度城丧。
加載數(shù)據(jù)集其參數(shù)包括:
? return_X_y:若為True,則以(data, target)形式返回?cái)?shù)據(jù)豌鹤;默認(rèn)為False亡哄,表示以字典形式返回?cái)?shù)據(jù)全部信息(包括data和target) ;
? n_class:表示返回?cái)?shù)據(jù)的類別數(shù)布疙,默認(rèn)= 10蚊惯,如:n_class=5,則返回0到4的數(shù)據(jù)樣本。
from sklearn.datasets import load_digits digits = load_digits(n_class=5,return_X_y=False) #查看第1-10個樣本的目標(biāo)值 print(digits.target[0:10])
[0 1 2 3 4 0 1 2 3 4]
import matplotlib.pyplot as plt from sklearn.datasets import load_digits digits = load_digits(n_class=10,return_X_y=False) print(dir(digits)) print('*'*80) print(digits.DESCR) print('*'*80) print(digits.data) print('*'*80) print(digits.target_names) print('*'*80) print(digits.target[[2,20,200]]) print('*'*80) print(digits.images.shape) plt.matshow(digits.images[1]) plt.savefig('手寫數(shù)字1') plt.show()
['DESCR', 'data', 'images', 'target', 'target_names'] ******************************************************************************** .. _digits_dataset: Optical recognition of handwritten digits dataset -------------------------------------------------- **Data Set Characteristics:** :Number of Instances: 5620 :Number of Attributes: 64 :Attribute Information: 8x8 image of integer pixels in the range 0..16. :Missing Attribute Values: None :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr) :Date: July; 1998 ''' 部分省略 ''' ******************************************************************************** [[ 0. 0. 5. ... 0. 0. 0.] [ 0. 0. 0. ... 10. 0. 0.] [ 0. 0. 0. ... 16. 9. 0.] ... [ 0. 0. 1. ... 6. 0. 0.] [ 0. 0. 2. ... 12. 0. 0.] [ 0. 0. 10. ... 12. 1. 0.]] ******************************************************************************** [0 1 2 3 4 5 6 7 8 9] ******************************************************************************** [2 0 1] ******************************************************************************** (1797, 8, 8)
手寫數(shù)字1.png -
用于回歸
-
sklearn.datasets.load_boston
波士頓房價數(shù)據(jù)集包含506組數(shù)據(jù)灵临,每條數(shù)據(jù)包含房屋以及房屋周圍的詳細(xì)信息截型。其中包括城鎮(zhèn)犯罪率、一氧化氮濃度儒溉、住宅平均房間數(shù)宦焦、到中心區(qū)域的加權(quán)距離以及自住房平均房價等。
波士頓房價數(shù)據(jù)集屬性描述
CRIM:城鎮(zhèn)人均犯罪率顿涣。
ZN:住宅用地超過 25000 sq.ft. 的比例波闹。
INDUS:城鎮(zhèn)非零售商用土地的比例。
CHAS:查理斯河空變量(如果邊界是河流涛碑,則為1精堕;否則為0)
NOX:一氧化氮濃度。
RM:住宅平均房間數(shù)蒲障。
AGE:1940 年之前建成的自用房屋比例歹篓。
DIS:到波士頓五個中心區(qū)域的加權(quán)距離瘫证。
RAD:輻射性公路的接近指數(shù)。
TAX:每 10000 美元的全值財(cái)產(chǎn)稅率滋捶。
PTRATIO:城鎮(zhèn)師生比例痛悯。
B:1000(Bk-0.63)^ 2,其中 Bk 指代城鎮(zhèn)中黑人的比例重窟。
LSTAT:人口中地位低下者的比例载萌。
MEDV:自住房的平均房價,以千美元計(jì)巡扇。-
加載數(shù)據(jù)集其參數(shù)有:
? return_X_y:若為True扭仁,則以(data, target)元組形式返回?cái)?shù)據(jù);默認(rèn)為False厅翔,表示以字典形式返回?cái)?shù)據(jù)全部信息(包括data和target)乖坠。
from sklearn.datasets import load_boston
boston = load_boston()
print(dir(boston))
print('*'*80)
print(boston.DESCR)
print('*'*80)
print(boston.feature_names)
print(boston.data)
print('*'*80)
print(boston.filename)
print('*'*80)
print(boston.target)
['DESCR', 'data', 'feature_names', 'filename', 'target']
********************************************************************************
.. _boston_dataset:
Boston house prices dataset
---------------------------
**Data Set Characteristics:**
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
''' 部分省略 '''
********************************************************************************
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
[[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]
[2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]
[2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]
...
[6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]
[1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]
[4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]
********************************************************************************
D:\Anaconda3\lib\site-packages\sklearn\datasets\data\boston_house_prices.csv
********************************************************************************
[24. 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15. 18.9 21.7 20.4
18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8
''' 部分省略 '''
16.7 12. 14.6 21.4 23. 23.7 25. 21.8 20.6 21.2 19.1 20.6 15.2 7.
8.1 13.6 20.1 21.8 24.5 23.1 19.7 18.3 21.2 17.5 16.8 22.4 20.6 23.9
22. 11.9]
- sklearn.datasets.load_diabetes
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
print(dir(diabetes))
print('*'*80)
print(diabetes.DESCR)
print('*'*80)
print(diabetes.data_filename)
print('*'*80)
print(diabetes.feature_names)
print(diabetes.data)
print('*'*80)
print(diabetes.target_filename)
['DESCR', 'data', 'data_filename', 'feature_names', 'target', 'target_filename']
********************************************************************************
.. _diabetes_dataset:
Diabetes dataset
----------------
Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.
**Data Set Characteristics:**
:Number of Instances: 442
:Number of Attributes: First 10 columns are numeric predictive values
:Target: Column 11 is a quantitative measure of disease progression one year after baseline
:Attribute Information:
- Age
- Sex
- Body mass index
- Average blood pressure
- S1
- S2
- S3
- S4
- S5
- S6
Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).
''' 部分省略 '''
********************************************************************************
D:\Anaconda3\lib\site-packages\sklearn\datasets\data\diabetes_data.csv.gz
********************************************************************************
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
[[ 0.03807591 0.05068012 0.06169621 ... -0.00259226 0.01990842
-0.01764613]
[-0.00188202 -0.04464164 -0.05147406 ... -0.03949338 -0.06832974
-0.09220405]
[ 0.08529891 0.05068012 0.04445121 ... -0.00259226 0.00286377
-0.02593034]
...
[ 0.04170844 0.05068012 -0.01590626 ... -0.01107952 -0.04687948
0.01549073]
[-0.04547248 -0.04464164 0.03906215 ... 0.02655962 0.04452837
-0.02593034]
[-0.04547248 -0.04464164 -0.0730303 ... -0.03949338 -0.00421986
0.00306441]]
********************************************************************************
D:\Anaconda3\lib\site-packages\sklearn\datasets\data\diabetes_target.csv.gz
-
獲取大數(shù)據(jù)集
sklearn.datasets.fetch_20newsgroups
-
加載數(shù)據(jù)集其參數(shù)有:
subset: 'train'或者'test','all',可選刀闷,選擇要加載的數(shù)據(jù)集:訓(xùn)練集的“訓(xùn)練”熊泵,測試集的“測試”,兩者的“全部”
data_home: 可選甸昏,默認(rèn)值:無顽分。指定數(shù)據(jù)集的下載路徑。如果沒有施蜜,所有scikit學(xué)習(xí)數(shù)據(jù)都存儲在'?/ scikit_learn_data'子文件夾中
categories: 選取哪一類數(shù)據(jù)集[類別列表]卒蘸,默認(rèn)20類
shuffle: 是否對數(shù)據(jù)進(jìn)行隨機(jī)排序
random_state: numpy隨機(jī)數(shù)生成器或種子整數(shù)
download_if_missing: 可選,默認(rèn)為True翻默,如果沒有下載過缸沃,重新下載
remove: ('headers','footers','quotes')刪除部分文本
from sklearn.datasets import fetch_20newsgroups data_test=fetch_20newsgroups(subset='test',data_home=None,categories=None, shuffle=True,random_state=42,remove=(),download_if_missing=True)
from sklearn.datasets import fetch_20newsgroups data_test = fetch_20newsgroups(subset='test',shuffle=True,random_state=42) data_train = fetch_20newsgroups(subset='train',shuffle=True,random_state=42) print(dir(data_train)) print('*'*80) #print(data_train.DESCR) print('*'*80) print(data_test.data[0]) #測試集中的第一篇文檔 print('-'*80) print('訓(xùn)練集數(shù)據(jù)分類名稱:{} '.format(data_train.target_names)) print(data_test.target[:10]) print('*'*80) print('訓(xùn)練集數(shù)據(jù):{} 條'.format(data_train.target.shape)) print('測試集數(shù)據(jù):{} 條'.format(data_test.target.shape))
['DESCR', 'data', 'filenames', 'target', 'target_names'] ******************************************************************************** ******************************************************************************** From: v064mb9k@ubvmsd.cc.buffalo.edu (NEIL B. GANDLER) Subject: Need info on 88-89 Bonneville Organization: University at Buffalo Lines: 10 News-Software: VAX/VMS VNEWS 1.41 Nntp-Posting-Host: ubvmsd.cc.buffalo.edu I am a little confused on all of the models of the 88-89 bonnevilles. I have heard of the LE SE LSE SSE SSEI. Could someone tell me the differences are far as features or performance. I am also curious to know what the book value is for prefereably the 89 model. And how much less than book value can you usually get them for. In other words how much are they in demand this time of year. I have heard that the mid-spring early summer is the best time to buy. Neil Gandler -------------------------------------------------------------------------------- 訓(xùn)練集數(shù)據(jù)分類名稱:['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc'] [ 7 5 0 17 19 13 15 15 5 1] ******************************************************************************** 訓(xùn)練集數(shù)據(jù):(11314,) 條 測試集數(shù)據(jù):(7532,) 條
-
sklearn.datasets.fetch_20newsgroups_vectorized
? 加載20個新聞組數(shù)據(jù)集并將其轉(zhuǎn)換為tf-idf向量,這是一個方便的功能; 使用sklearn.feature_ extraction.text.Vectorizer的默認(rèn)設(shè)置完成tf-idf 轉(zhuǎn)換修械。
from sklearn.datasets import fetch_20newsgroups_vectorized from sklearn.utils import shuffle bunch = fetch_20newsgroups_vectorized(subset='all') X,y = shuffle(bunch.data,bunch.target) print(X.shape) # 數(shù)據(jù)集劃分為訓(xùn)練集0.7和測試集0.3 offset = int(X.shape[0]*0.7) X_train, y_train = X[0:offset], y[0:offset] X_test, y_test = X[offset:], y[offset:] print(X_train.shape) print(X_test.shape)
(18846, 130107) (13192, 130107) (5654, 130107)
-
獲取本地生成數(shù)據(jù)
生成本地分類數(shù)據(jù):
sklearn.datasets.make_classification
-
加載數(shù)據(jù)集其參數(shù)有:
n_samples:int趾牧,optional(default = 100),樣本數(shù)量
n_features:int肯污,可選(默認(rèn)= 20)武氓,特征總數(shù)= n_informative + n_redundant + n_repeated
n_informative:多信息特征的個數(shù)
n_redundant:冗余信息,informative特征的隨機(jī)線性組合
n_repeated :重復(fù)信息仇箱,隨機(jī)提取n_informative和n_redundant 特征
n_classes:int县恕,可選(default = 2),分類類別
n_clusters_per_class :某一個類別是由幾個cluster構(gòu)成的
random_state:int,RandomState實(shí)例剂桥,可選(默認(rèn)=無)如果int忠烛,random_state是隨機(jī)數(shù)生成器使用的種子
from sklearn import datasets import matplotlib.pyplot as plt data,target = datasets.make_classification(n_samples=100,n_features=2, n_informative=2,n_redundant=0,n_repeated=0, n_classes=2,n_clusters_per_class=1, random_state=0) print(data.shape) print(target.shape) #print(data) #print(target) plt.scatter(data[:,0],data[:,1],c=target) plt.show()
(100, 2) (100,)
111.png生成本地回歸數(shù)據(jù):
sklearn.datasets.make_regression
-
加載數(shù)據(jù)集其參數(shù)有:
n_samples: int,optional(default = 100)权逗,樣本數(shù)量
n_features: int,optional(default = 100)美尸,特征數(shù)量
coef: boolean冤议,optional(default = False),如果為True师坎,則返回底層線性模型的系數(shù)
random_state: int恕酸,RandomState實(shí)例,可選(默認(rèn)=無)
from sklearn.datasets.samples_generator import make_regression X, y = make_regression(n_samples=100, n_features=10, random_state=1) print(X.shape) print(y.shape)
-
圖像數(shù)據(jù)
在Anaconda中sklearn中的圖像在該目錄下
D:\Anaconda3\Lib\site-packages\sklearn\datasets\images
存在china.jpg和flower.jpg
from sklearn.datasets import load_sample_image
import matplotlib.pyplot as plt
img = load_sample_image('china.jpg')
plt.imshow(img)
參考資料:
網(wǎng)址:
https://blog.csdn.net/wangdong2017/article/details/81326341
視頻:
《python機(jī)器學(xué)習(xí)應(yīng)用》《黑馬程序員之機(jī)器學(xué)習(xí)》