結(jié)構(gòu)化數(shù)據(jù)分類(lèi)實(shí)戰(zhàn)：心臟病預(yù)測(cè)(tensorflow2.0官方教程翻譯)

最全tensorflow2.0學(xué)習(xí)路線 https://www.mashangxue123.com
最新版本：https://www.mashangxue123.com/tensorflow/tf2-tutorials-keras-feature_columns.html

本教程演示了如何對(duì)結(jié)構(gòu)化數(shù)據(jù)進(jìn)行分類(lèi)（例如CSV格式的表格數(shù)據(jù)）瓣履。
我們將使用Keras定義模型，并使用特征列作為橋梁练俐，將CSV中的列映射到用于訓(xùn)練模型的特性袖迎。
本教程包含完整的代碼：

使用Pandas加載CSV文件。 .
構(gòu)建一個(gè)輸入管道腺晾，使用tf.data批處理和洗牌行
從CSV中的列映射到用于訓(xùn)練模型的特性燕锥。
使用Keras構(gòu)建、訓(xùn)練和評(píng)估模型悯蝉。

1. 數(shù)據(jù)集

我們將使用克利夫蘭診所心臟病基金會(huì)提供的一個(gè)小數(shù)據(jù)集归形。CSV中有幾百行，每行描述一個(gè)患者鼻由，每列描述一個(gè)屬性暇榴。我們將使用此信息來(lái)預(yù)測(cè)患者是否患有心臟病，該疾病在該數(shù)據(jù)集中是二元分類(lèi)任務(wù)蕉世。

以下是此數(shù)據(jù)集的說(shuō)明蔼紧。請(qǐng)注意，有數(shù)字和分類(lèi)列狠轻。

Column Description Feature Type Data Type

Age Age in years Numerical integer

Sex (1 = male; 0 = female) Categorical integer

CP Chest pain type (0, 1, 2, 3, 4) Categorical integer

Trestbpd Resting blood pressure (in mm Hg on admission to the hospital) Numerical integer

Chol Serum cholestoral in mg/dl Numerical integer

FBS (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) Categorical integer

RestECG Resting electrocardiographic results (0, 1, 2) Categorical integer

Thalach Maximum heart rate achieved Numerical integer

Exang Exercise induced angina (1 = yes; 0 = no) Categorical integer

Oldpeak ST depression induced by exercise relative to rest Numerical integer

Slope The slope of the peak exercise ST segment Numerical float

CA Number of major vessels (0-3) colored by flourosopy Numerical integer

Thal 3 = normal; 6 = fixed defect; 7 = reversable defect Categorical string

Target Diagnosis of heart disease (1 = true; 0 = false) Classification integer

Column	Description	Feature Type	Data Type
Age	Age in years	Numerical	integer
Sex	(1 = male; 0 = female)	Categorical	integer
CP	Chest pain type (0, 1, 2, 3, 4)	Categorical	integer
Trestbpd	Resting blood pressure (in mm Hg on admission to the hospital)	Numerical	integer
Chol	Serum cholestoral in mg/dl	Numerical	integer
FBS	(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)	Categorical	integer
RestECG	Resting electrocardiographic results (0, 1, 2)	Categorical	integer
Thalach	Maximum heart rate achieved	Numerical	integer
Exang	Exercise induced angina (1 = yes; 0 = no)	Categorical	integer
Oldpeak	ST depression induced by exercise relative to rest	Numerical	integer
Slope	The slope of the peak exercise ST segment	Numerical	float
CA	Number of major vessels (0-3) colored by flourosopy	Numerical	integer
Thal	3 = normal; 6 = fixed defect; 7 = reversable defect	Categorical	string
Target	Diagnosis of heart disease (1 = true; 0 = false)	Classification	integer

2. 導(dǎo)入TensorFlow和其他庫(kù)

安裝sklearn依賴(lài)庫(kù)

pip install sklearn

from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import pandas as pd

import tensorflow as tf

from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

3. 使用Pandas創(chuàng)建數(shù)據(jù)幀

Pandas 是一個(gè)Python庫(kù)奸例，包含許多有用的實(shí)用程序，用于加載和處理結(jié)構(gòu)化數(shù)據(jù)向楼。我們將使用Pandas從URL下載數(shù)據(jù)集查吊，并將其加載到數(shù)據(jù)幀中谐区。

URL = 'https://storage.googleapis.com/applied-dl/heart.csv'
dataframe = pd.read_csv(URL)
dataframe.head()

4. 將數(shù)據(jù)拆分為訓(xùn)練、驗(yàn)證和測(cè)試

我們下載的數(shù)據(jù)集是一個(gè)CSV文件逻卖，并將其分為訓(xùn)練卢佣，驗(yàn)證和測(cè)試集。

train, test = train_test_split(dataframe, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')

      193 train examples
      49 validation examples
      61 test examples

5. 使用tf.data創(chuàng)建輸入管道

接下來(lái)箭阶，我們將使用tf.data包裝數(shù)據(jù)幀虚茶，這將使我們能夠使用特征列作為橋梁從Pandas數(shù)據(jù)框中的列映射到用于訓(xùn)練模型的特征。如果我們使用非常大的CSV文件（如此之大以至于它不適合內(nèi)存）仇参，我們將使用tf.data直接從磁盤(pán)讀取它嘹叫，本教程不涉及這一點(diǎn)。

# 一種從Pandas Dataframe創(chuàng)建tf.data數(shù)據(jù)集的使用方法 
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('target')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds

batch_size = 5 # 小批量用于演示目的
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

6. 理解輸入管道

現(xiàn)在我們已經(jīng)創(chuàng)建了輸入管道诈乒，讓我們調(diào)用它來(lái)查看它返回的數(shù)據(jù)的格式罩扇，我們使用了一小批量來(lái)保持輸出的可讀性。

for feature_batch, label_batch in train_ds.take(1):
  print('Every feature:', list(feature_batch.keys()))
  print('A batch of ages:', feature_batch['age'])
  print('A batch of targets:', label_batch )

      Every feature: ['age', 'chol', 'fbs', 'ca', 'slope', 'restecg', 'sex', 'thal', 'thalach', 'oldpeak', 'exang', 'cp', 'trestbps']
      A batch of ages: tf.Tensor([58 52 56 35 59], shape=(5,), dtype=int32)
      A batch of targets: tf.Tensor([1 0 1 0 0], shape=(5,), dtype=int32)

我們可以看到數(shù)據(jù)集返回一個(gè)列名稱(chēng)（來(lái)自數(shù)據(jù)幀）怕磨，該列表映射到數(shù)據(jù)幀中行的列值喂饥。

7. 演示幾種類(lèi)型的特征列

TensorFlow提供了許多類(lèi)型的特性列。在本節(jié)中肠鲫，我們將創(chuàng)建幾種類(lèi)型的特性列员帮，并演示它們?nèi)绾螐膁ataframe轉(zhuǎn)換列。

# 我們將使用此批處理來(lái)演示幾種類(lèi)型的特征列 
example_batch = next(iter(train_ds))[0]

# 用于創(chuàng)建特征列和轉(zhuǎn)換批量數(shù)據(jù) 
def demo(feature_column):
  feature_layer = layers.DenseFeatures(feature_column)
  print(feature_layer(example_batch).numpy())

7.1. 數(shù)字列

特征列的輸出成為模型的輸入（使用上面定義的演示函數(shù)导饲，我們將能夠準(zhǔn)確地看到數(shù)據(jù)幀中每列的轉(zhuǎn)換方式）捞高，數(shù)字列是最簡(jiǎn)單的列類(lèi)型，它用于表示真正有價(jià)值的特征渣锦，使用此列時(shí)硝岗，模型將從數(shù)據(jù)幀中接收未更改的列值。

age = feature_column.numeric_column("age")
demo(age)

      [[58.]
      [52.]
      [56.]
      [35.]
      [59.]]

在心臟病數(shù)據(jù)集中袋毙，數(shù)據(jù)幀中的大多數(shù)列都是數(shù)字型檀。

7.2. Bucketized列（桶列）

通常锦秒，您不希望將數(shù)字直接輸入模型崭孤，而是根據(jù)數(shù)值范圍將其值分成不同的類(lèi)別，考慮代表一個(gè)人年齡的原始數(shù)據(jù)寡键，我們可以使用bucketized列將年齡分成幾個(gè)桶媳溺，而不是將年齡表示為數(shù)字列月幌。
請(qǐng)注意，下面的one-hot(獨(dú)熱編碼)值描述了每行匹配的年齡范圍悬蔽。

age_buckets = feature_column.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
demo(age_buckets)

      [[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
      [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
      [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
      [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
      [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]]

7.3. 分類(lèi)列

在該數(shù)據(jù)集中扯躺，thal表示為字符串（例如“固定”，“正常”或“可逆”）录语，我們無(wú)法直接將字符串提供給模型倍啥，相反，我們必須首先將它們映射到數(shù)值澎埠。分類(lèi)詞匯表列提供了一種將字符串表示為獨(dú)熱矢量的方法（就像上面用年齡段看到的那樣）虽缕。詞匯表可以使用categorical_column_with_vocabulary_list作為列表傳遞，或者使用categorical_column_with_vocabulary_file從文件加載蒲稳。

thal = feature_column.categorical_column_with_vocabulary_list(
      'thal', ['fixed', 'normal', 'reversible'])

thal_one_hot = feature_column.indicator_column(thal)
demo(thal_one_hot)

      [[0. 0. 1.]
      [0. 1. 0.]
      [0. 0. 1.]
      [0. 0. 1.]
      [0. 0. 1.]]

在更復(fù)雜的數(shù)據(jù)集中氮趋，許多列將是分類(lèi)的（例如字符串），在處理分類(lèi)數(shù)據(jù)時(shí)江耀，特征列最有價(jià)值剩胁。雖然此數(shù)據(jù)集中只有一個(gè)分類(lèi)列，但我們將使用它來(lái)演示在處理其他數(shù)據(jù)集時(shí)可以使用的幾種重要類(lèi)型的特征列祥国。

7.4. 嵌入列

假設(shè)我們不是只有幾個(gè)可能的字符串昵观，而是每個(gè)類(lèi)別有數(shù)千（或更多）值。由于多種原因舌稀，隨著類(lèi)別數(shù)量的增加啊犬，使用獨(dú)熱編碼訓(xùn)練神經(jīng)網(wǎng)絡(luò)變得不可行，我們可以使用嵌入列來(lái)克服此限制壁查。
嵌入列不是將數(shù)據(jù)表示為多維度的獨(dú)熱矢量觉至，而是將數(shù)據(jù)表示為低維密集向量，其中每個(gè)單元格可以包含任意數(shù)字潮罪，而不僅僅是0或1.嵌入的大锌底弧（在下面的例子中是8）是必須調(diào)整的參數(shù)领斥。

關(guān)鍵點(diǎn)：當(dāng)分類(lèi)列具有許多可能的值時(shí)嫉到，最好使用嵌入列，我們?cè)谶@里使用一個(gè)用于演示目的月洛，因此您有一個(gè)完整的示例何恶，您可以在將來(lái)修改其他數(shù)據(jù)集。

# 請(qǐng)注意嚼黔，嵌入列的輸入是我們先前創(chuàng)建的分類(lèi)列 
thal_embedding = feature_column.embedding_column(thal, dimension=8)
demo(thal_embedding)

[[-0.01019966  0.23583987  0.04172783  0.34261808 -0.02596842  0.05985594
   0.32729048 -0.07209085]
 [ 0.08829682  0.3921798   0.32400072  0.00508362 -0.15642034 -0.17451124
   0.12631968  0.15029909]
 [-0.01019966  0.23583987  0.04172783  0.34261808 -0.02596842  0.05985594
   0.32729048 -0.07209085]
 [-0.01019966  0.23583987  0.04172783  0.34261808 -0.02596842  0.05985594
   0.32729048 -0.07209085]
 [-0.01019966  0.23583987  0.04172783  0.34261808 -0.02596842  0.05985594
   0.32729048 -0.07209085]]

7.5. 哈希特征列

表示具有大量值的分類(lèi)列的另一種方法是使用categorical_column_with_hash_bucket.
此特征列計(jì)算輸入的哈希值细层，然后選擇一個(gè)hash_bucket_size存儲(chǔ)桶來(lái)編碼字符串，使用此列時(shí)唬涧，您不需要提供詞匯表疫赎，并且可以選擇使hash_buckets的數(shù)量遠(yuǎn)遠(yuǎn)小于實(shí)際類(lèi)別的數(shù)量以節(jié)省空間。

關(guān)鍵點(diǎn)：該技術(shù)的一個(gè)重要缺點(diǎn)是可能存在沖突碎节，其中不同的字符串被映射到同一個(gè)桶捧搞，實(shí)際上，無(wú)論如何，這對(duì)某些數(shù)據(jù)集都有效胎撇。

thal_hashed = feature_column.categorical_column_with_hash_bucket(
      'thal', hash_bucket_size=1000)
demo(feature_column.indicator_column(thal_hashed))

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

7.6. 交叉特征列

將特征組合成單個(gè)特征（也稱(chēng)為特征交叉）介粘，使模型能夠?yàn)槊總€(gè)特征組合學(xué)習(xí)單獨(dú)的權(quán)重。
在這里晚树，我們將創(chuàng)建一個(gè)age和thal交叉的新功能姻采，
請(qǐng)注意，crossed_column不會(huì)構(gòu)建所有可能組合的完整表（可能非常大）爵憎，相反慨亲，它由hashed_column支持，因此您可以選擇表的大小宝鼓。

crossed_feature = feature_column.crossed_column([age_buckets, thal], hash_bucket_size=1000)
demo(feature_column.indicator_column(crossed_feature))

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

8. 選擇要使用的列

我們已經(jīng)了解了如何使用幾種類(lèi)型的特征列巡雨，現(xiàn)在我們將使用它們來(lái)訓(xùn)練模型。本教程的目標(biāo)是向您展示使用特征列所需的完整代碼（例如席函，機(jī)制）铐望，我們選擇了幾列來(lái)任意訓(xùn)練我們的模型。

關(guān)鍵點(diǎn)：如果您的目標(biāo)是建立一個(gè)準(zhǔn)確的模型茂附，請(qǐng)嘗試使用您自己的更大數(shù)據(jù)集正蛙，并仔細(xì)考慮哪些特征最有意義，以及如何表示它們营曼。

feature_columns = []

# numeric 數(shù)字列
for header in ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope', 'ca']:
  feature_columns.append(feature_column.numeric_column(header))

# bucketized 分桶列
age_buckets = feature_column.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
feature_columns.append(age_buckets)

# indicator 指示符列 
thal = feature_column.categorical_column_with_vocabulary_list(
      'thal', ['fixed', 'normal', 'reversible'])
thal_one_hot = feature_column.indicator_column(thal)
feature_columns.append(thal_one_hot)

# embedding 嵌入列 
thal_embedding = feature_column.embedding_column(thal, dimension=8)
feature_columns.append(thal_embedding)

# crossed 交叉列 
crossed_feature = feature_column.crossed_column([age_buckets, thal], hash_bucket_size=1000)
crossed_feature = feature_column.indicator_column(crossed_feature)
feature_columns.append(crossed_feature)

8.1. 創(chuàng)建特征層

現(xiàn)在我們已經(jīng)定義了我們的特征列乒验，我們將使用DenseFeatures層將它們輸入到我們的Keras模型中。

feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

之前蒂阱，我們使用小批量大小來(lái)演示特征列的工作原理锻全，我們創(chuàng)建了一個(gè)具有更大批量的新輸入管道。

batch_size = 32
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

9. 創(chuàng)建录煤、編譯和訓(xùn)練模型

model = tf.keras.Sequential([
  feature_layer,
  layers.Dense(128, activation='relu'),
  layers.Dense(128, activation='relu'),
  layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(train_ds,
          validation_data=val_ds,
          epochs=5)

訓(xùn)練過(guò)程的輸出

Epoch 1/5
7/7 [==============================] - 1s 79ms/step - loss: 3.8492 - accuracy: 0.4219 - val_loss: 2.7367 - val_accuracy: 0.7143
......
Epoch 5/5
7/7 [==============================] - 0s 34ms/step - loss: 0.6200 - accuracy: 0.7377 - val_loss: 0.6288 - val_accuracy: 0.6327

<tensorflow.python.keras.callbacks.History at 0x7f48c044c5f8>

測(cè)試

loss, accuracy = model.evaluate(test_ds)
print("Accuracy", accuracy)

      2/2 [==============================] - 0s 19ms/step - loss: 0.5538 - accuracy: 0.6721
      Accuracy 0.6721311

關(guān)鍵點(diǎn)：通常使用更大更復(fù)雜的數(shù)據(jù)集進(jìn)行深度學(xué)習(xí)鳄厌，您將看到最佳結(jié)果。使用像這樣的小數(shù)據(jù)集時(shí)妈踊，我們建議使用決策樹(shù)或隨機(jī)森林作為強(qiáng)基線了嚎。

本教程的目標(biāo)不是為了訓(xùn)練一個(gè)準(zhǔn)確的模型，而是為了演示使用結(jié)構(gòu)化數(shù)據(jù)的機(jī)制廊营，因此您在將來(lái)使用自己的數(shù)據(jù)集時(shí)需要使用代碼作為起點(diǎn)歪泳。

10. 下一步

了解有關(guān)分類(lèi)結(jié)構(gòu)化數(shù)據(jù)的更多信息的最佳方法是親自嘗試，我們建議找到另一個(gè)可以使用的數(shù)據(jù)集露筒，并訓(xùn)練模型使用類(lèi)似于上面的代碼對(duì)其進(jìn)行分類(lèi)呐伞，要提高準(zhǔn)確性，請(qǐng)仔細(xì)考慮模型中包含哪些特征以及如何表示這些特征慎式。