id: 9vyvlNjQhL0ZiLxDj0Apo
title: Keras
desc: 《Deep Learning with Python》筆記
updated: 1632854978958
created: 1632414421733

本來是打算趁這個時間好好看看花書的臂港，前幾章看下來確實覺得獲益匪淺汞幢，但看下去就發(fā)現(xiàn)跟不上了，特別是抱著急功近利的心態(tài)的話亥啦，目前也沉不下去真的一節(jié)節(jié)吃透地往下看职辅。這類書終歸不是入門教材，是需要你有過一定的積累后再回過頭來看的戳葵。

于是想到了《Deep Learning with Python》，忘記這本書怎么來的了，但是在別的地方看到了有人推薦筋遭，說是Keras的作者寫的非常好的一本入門書，翻了前面幾十頁后發(fā)現(xiàn)居然跟進去了暴拄，不該講的地方?jīng)]講比如數(shù)學(xué)細節(jié)漓滔，而且思路也極其統(tǒng)一，從頭貫穿到尾（比如representations, latent space, hypothesis space）乖篷，我覺得很受用响驴。

三百多頁全英文，居然也沒查幾個單詞就這么看完了撕蔼，以前看文檔最多十來頁豁鲤，也算一個突破了，可見其實還是一個耐心的問題鲸沮。

看完后書上做了很多筆記琳骡，于是順著筆記讀了第二遍，順便就把筆記給電子化了讼溺。不是教程日熬，不是導(dǎo)讀。

Fundamentals of deep learning

核心思想：
learng useful representations of input data

what’s a representation?

At its core, it’s a different way to look at data—to represent or encode data.

簡單回顧深度學(xué)習之于人工智能的歷史肾胯，每本書都會寫竖席，但每本書里都有作者自己的側(cè)重：

Artificial intelligence
Machine learning
- Machine learning is tightly related to mathematical statistics, but it differs from statistics in several important ways.
  - machine learning tends to deal with large, complex datasets (such as a dataset of millions of images, each consisting of tens of thousands of pixels)
  - classical statistical analysis such as Bayesian analysis would be impractical(不切實際的).
  - It’s a hands-on discipline in which ideas are proven empirically more often than theoretically.（工程/實踐大于理論）
- 是一種meaningfully transform data
  - Machine-learning models are all about finding appropriate representations for their input data—transformations of the data that make it more amenable to the task at hand, such as a classification task.
  - 尋找更有代表性的representation, 通過:(coordinate change, linear projections, tranlsations, nonlinear operations)
  - 只會在hypothesis space里尋找
  - 以某種反饋為信號作為優(yōu)化指導(dǎo)
Deep learning
- Machine Learing的子集，一種新的learning representation的新方法
- 雖然叫神經(jīng)網(wǎng)絡(luò)(neural network)敬肚，但它既非neural毕荐，也不是network，更合理的名字：
  - layered representations learning and hierarchical representations learning.
- 相對少的層數(shù)的實現(xiàn)叫shallow learning

Before deep learning

Probabilistic modeling
- the earliest forms of machine learning,
- still widely used to this day.
  - One of the best-known algorithms in this category
    is the Naive Bayes algorithm(樸素貝葉斯)
- 條件概率艳馒，把規(guī)則理解為“條件”憎亚，判斷概率，比如垃圾郵件弄慰。
  - A closely related model is the logistic regression
Early neural networks
- in the mid-1980s, multiple people independently rediscovered the Backpropagation algorithm
- The first successful practical application of neural nets came in 1989 from Bell Labs -> LeNet
Kernel methods
- Kernel methods are a group of classification algorithms(核方法是一組分類算法)
  - the best known of which is the support vector machine (SVM).
  - SVMs aim at solving classification problems by finding good decision boundaries between two sets of points belonging to two different categories.
    1. 先把數(shù)據(jù)映射到高維第美，decision boundary表示為hyperplane
    2. 最大化每個類別里離hyperplane最近的點到hyperplane的距離:maximizing the margin
  - The technique of mapping data to a high-dimensional representation 非常消耗計算資源，實際使用的是核函數(shù)(kernel function):
    - 不把每個點轉(zhuǎn)換到高維陆爽，而只是計算每兩個點在高維中的距離
    - 核函數(shù)是手工設(shè)計的什往，不是學(xué)習的
  - SVM在分類問題上是經(jīng)典方案，但難以擴展到大型數(shù)據(jù)集上
  - 對于perceptual problems(感知類的問題)如圖像分類效果也不好
    - 它是一個shallow method
    - 需要事先手動提取有用特征(feature enginerring)-> difficult and brittle（脆弱的）
Decision trees, random forests, and gradient boosting machines
- Random Forest
  - you could say that they’re almost always the second-best algorithm for any shallow machine-learning task.
- gradient boosting machines (1st):
  - a way to improve any machine-learning model by iteratively training new models that specialize in addressing the weak points of the previous models.

What makes deep learning different

it completely automates what used to be the most crucial step in a machine-learning workflow: feature engineering. 有人認為這叫窮舉慌闭，思路上有點像别威，至少得到特征的過程不是靠觀察和分析躯舔。

feature engineering

manually engineer good layers of representations for their data

Getting started with neural networks

Anatomy of a neural network

Layers, which are combined into a network (or model)
- layers: 常見的比如卷積層，池化層省古，全連接層等
- models: layers構(gòu)成的網(wǎng)絡(luò)粥庄，或多個layers構(gòu)成的模塊（用模塊組成網(wǎng)絡(luò)）
  - Two-branch networks
  - Multihead networks
  - Inception blocks, residual blocks etc.
- The topology of a network defines a hypothesis space
- 本書反復(fù)強調(diào)的就是這個hypothesis space，一定要理解這個思維：
  - By choosing a network topology, you constrain your space of possibilities (hypothesis space) to a specific series of tensor operations, mapping input data to output data.（network的選擇約束了tensor變換的步驟）
  - 所以如果選擇了不好的network豺妓，可能導(dǎo)致你在錯誤的hyposhesis space里搜索惜互，以致于效果不好。
The input data and corresponding targets
The loss function (objective function), which defines the feedback signal used for learning
- The quantity that will be minimized during training.
- It represents a measure of success for the task at hand.
- 多頭網(wǎng)絡(luò)有多個loss function琳拭，但基于gradient-descent的網(wǎng)絡(luò)只允許有一個標量的loss训堆，因此需要把它合并起來（相加，平均...）
The optimizer, which determines how learning proceeds
- Determines how the network will be updated based on the loss function.
- It implements a specific variant of stochastic gradient descent (SGD).

Classifying movie reviews: a binary classification example

一個二元分類的例子

情感分析/情緒判斷臀栈，數(shù)據(jù)源是IMDB的影評數(shù)據(jù).

理解hidden的維度

how much freedom you’re allowing the network to have when learning internal representations. 即學(xué)習表示（別的地方通常叫提取特征）的自由度。

目前提出了架構(gòu)網(wǎng)絡(luò)的時候的兩個問題：

多少個隱層
隱層需要多少個神經(jīng)元（即維度）

后面的章節(jié)會介紹一些原則挠乳。

激活函數(shù)

李宏毅的課程里权薯，從用整流函數(shù)來逼近非線性方程的方式來引入激活函數(shù)，也就是說在李宏毅的課程里睡扬，激活函數(shù)是因盟蚣，推出來的公式是果，當然一般的教材都不是這個角度卖怜，都是有了線性方程屎开，再去告訴你，這樣還不夠马靠，需要一個activation奄抽。

本書也一樣，告訴你甩鳄，如果只有wX+b逞度，那么只有線性變換，這樣會導(dǎo)致對hypothesis space的極大的限制妙啃，為了擴展它的空間档泽，就引入了非線性的后續(xù)處理∫靖埃總之馆匿，都是在自己的邏輯體系內(nèi)的。本書的邏輯體系就是hypothesis space燥滑，你想要有解渐北，就是在這個空間里。

網(wǎng)絡(luò)結(jié)構(gòu)

from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

entropy

Crossentropy is a quantity from the field of Information Theory（信息論） that measures the distance between probability distributions铭拧。

in this case, between the ground-truth distribution and your predictions.

keras風格的訓(xùn)練

其實就是模仿了scikit learn的風格腔稀。對快速實驗非常友好盆昙，缺點就是封裝過于嚴重，不利于調(diào)試焊虏，但這其實不是問題淡喜，誰也不會只用keras。

# 演示用類名和字符串分別做參數(shù)的方式
model.compile(optimizer='rmsprop',
            loss='binary_crossentropy',
            metrics=['accuracy'])

from keras import optimizers
model.compile(optimizer=optimizers.RMSprop(lr=0.001),
            loss='binary_crossentropy',
            metrics=['accuracy'])

from keras import losses
from keras import metrics
model.compile(optimizer=optimizers.RMSprop(lr=0.001),
            loss=losses.binary_crossentropy,
            metrics=[metrics.binary_accuracy])

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])

# train
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val, y_val))

后續(xù)優(yōu)化诵闭，就是對比train和validate階段的loss和accuracy炼团，找到overfit的節(jié)點（比如是第N輪），然后重新訓(xùn)練到第N輪（或者直接用第N輪生成的模型疏尿，如果有）瘟芝，用這個模型來預(yù)測沒有人工標注的數(shù)據(jù)。

核心就是要訓(xùn)練到明顯的overfit為止褥琐。這是第一個例子的內(nèi)容锌俱，所以是告訴你怎么用這個簡單的網(wǎng)絡(luò)來進行預(yù)測，而不是立即著眼怎么去解決overfit.

第一個小結(jié)

數(shù)據(jù)需要預(yù)處理成tensor, 了解幾種tensor化敌呈，或vector化的方式
堆疊全連接網(wǎng)絡(luò)(Dense)贸宏，以及activation，就能解決很多分類問題
二元分類的問題通常在Dense后接一個sigmoid函數(shù)
引入二元交叉熵(BCE)作為二元分類問題的loss
用了rmsprop優(yōu)化器磕洪，暫時沒有過多介紹吭练。這些優(yōu)化器都是為了解決能不能找到局部極值而進行的努力，具體可看上一篇李宏毅的筆記
使用overfit之前的那一個模型來做預(yù)測

Classifying newswires: a multiclass classification example

這次用路透社的新聞來做多分類的例子析显，給每篇新聞標記類別鲫咽。

預(yù)處理，一些要點:

不會采用所有的詞匯谷异，所以預(yù)處理時分尸，根據(jù)詞頻，只選了前1000個詞
用索引來實現(xiàn)文字-數(shù)字的對應(yīng)
用one-hot來實現(xiàn)數(shù)字-向量的對應(yīng)
理解什么是序列（其實就是一句話）
所以句子有長有短歹嘹，為了矩陣的批量計算（即多個句子同時處理）寓落，需要“對齊”（補0和截斷）
理解稠密矩陣(word-embedding)與稀疏矩陣(one-hot)的區(qū)別（這里沒有講，用的是one-hot)

網(wǎng)絡(luò)和訓(xùn)練

網(wǎng)絡(luò)結(jié)構(gòu)不變荞下，每層的神經(jīng)元為(64, 64, 46)
前面增加了神經(jīng)元伶选，16個特征對語言來說應(yīng)該是不夠的）
最后一層由1變成了46，因為二元的輸出只需要一個數(shù)字尖昏，而多元輸出是用one-hot表示的向量仰税，最有可能的類別在這個向量里擁有最大的值。
4抽诉。損失函數(shù)為categorial_crossentropy陨簇，這在別的教材里應(yīng)該就是普通的CE.

新知識

介紹了一種不用one-hot而直接用數(shù)字表示真值的方法，但是沒有改變網(wǎng)絡(luò)結(jié)構(gòu)（即最后一層仍然輸出46維迹淌，而不是因為你用了一個標量而只輸出一維河绽。
- 看來它僅僅就是一個語法糖（loss函數(shù)選擇sparse_categorial_crossentropy就行了）
嘗試把第2層由64改為4己单，變成bottleneck，演示你有46維的數(shù)據(jù)要輸出的話耙饰，前面的層數(shù)或少會造成信息壓縮過于嚴重以致于丟失特征纹笼。

Predicting house prices: a regression example

這里用了預(yù)測房價的Boston Hosing Price數(shù)據(jù)集。

與吳恩達的課程一樣苟跪，也恰好是在這個例子里引入了對input的normalize廷痘，理由也僅僅是簡單的把量綱拉平。現(xiàn)在我們應(yīng)該還知道Normalize還能讓數(shù)據(jù)在進入激活函數(shù)前件已，把值限定在激活函數(shù)的梯度敏感區(qū)笋额。

此外，一個知識點就是你對訓(xùn)練集進行Normalize用的均值和標準差篷扩，是直接用在測試集上的兄猩，而不是各計算各的，可以理解為保持訓(xùn)練集的“分布”鉴未。

這也是scikit learn里fit_tranform和直接用transform的原因枢冤。

對scalar進行預(yù)測是不需要進行激活（即無需把輸出壓縮到和為1的概率空間）
loss也直觀很多，就是predict與target的差（取平方歼狼，除2掏导，除批量等都是輔助）享怀，預(yù)測與直值的差才是核心羽峰。

Fundamentals of machine learning

Supervised learning
- binary classification
- multiclass classificaiton
- scalar regression
- vector regression（比如bounding-box)
- Sequence generation (摘要，翻譯...)
- Syntax tree prediction
- Object detection (一般bounding-box的坐標仍然是回歸出來的)
- Image segmentation
Unsupervised learing
- 是數(shù)據(jù)分析的基礎(chǔ)添瓷，在監(jiān)督學(xué)習前也常常需要用無監(jiān)督學(xué)習來更好地“理解”數(shù)據(jù)集
- 主要有降維(Dimensionality reduction)和聚類(clustering)
Self-supervised learning
- 其實還是監(jiān)督學(xué)習梅屉，因為它仍需要與某個target做比較
- 往往半監(jiān)督（自監(jiān)督）學(xué)習仍然有小量有標簽數(shù)據(jù)集，在此基礎(chǔ)上訓(xùn)練的不完善的model用來對無標簽的數(shù)據(jù)進行打標鳞贷，循環(huán)中對無標簽數(shù)據(jù)打標的可靠度就越來越高坯汤，這樣總體數(shù)據(jù)集的可靠度也越來越高了。有點像生成對抗網(wǎng)絡(luò)里生成器和辨別器一同在訓(xùn)練過程中完善搀愧。
- autoencoders
Reinforcement learning
- an agent receives information about its environment and learns to choose actions that will maximize some reward.
- 可以用訓(xùn)練狗來理解
- 工業(yè)界的應(yīng)用除了游戲就是機器人了

Data preprocessing

vectorization
normalization (small, homogenous)
handling missing values
1. 除非0有特別的含義惰聂，不然一般可以對缺失值補0
2. 你不能保證測試集沒有缺失值，如果訓(xùn)練集沒看到過缺失值咱筛，那么將不會學(xué)到忽略缺失值
  - 復(fù)制一些訓(xùn)練數(shù)據(jù)并且隨機drop掉一些特征
feature extraction
- making a problem easier by expressing it in a simpler way. It usually requires understanding the problem in depth.
- Before deep learning, feature engineering used to be critical, because classical shallow algorithms didn’t have hypothesis spaces rich enough to learn useful features by themselves. (又見假設(shè)空間)
- 但是好的特征仍然能讓你在處理問題上更優(yōu)雅搓幌、更省資源，也能減小對數(shù)據(jù)集規(guī)模的依賴迅箩。

Overfitting and underfitting

Machine learning is the tension between optimization and generalization.
optimization要求你在訓(xùn)練過的數(shù)據(jù)集上能達到最好的效果
generalization則希望你在沒見過的數(shù)據(jù)上有好的效果
如果訓(xùn)練集上loss小溉愁，測試集上也小，說明還有優(yōu)化(optimize)的余地 -> underfitting看loss
- just keep training
如果驗證集上generalization stop improving(泛化不再進步饲趋，一般看衡量指標拐揭，比如準確率) -> overfitting

解決overfitting的思路：

the best solution is get more trainging data
the simple way is to reduce the size of the model
- 模型容量(capacity)足夠大撤蟆，就足夠容易記住input和target的映射，沒推理什么事了
add constraints -> weight regularization
add dropout

Regularization

Occam’s razor

given two explanations for something, the explanation most likely to be correct is the simplest one—the one that makes fewer assumptions.

即為傳說中如無必要堂污，勿增實體的奧卡姆剃刀原理家肯，這是在藝術(shù)創(chuàng)作領(lǐng)域的翻譯，我們這里還是直譯的好敷鸦，即能解釋一件事的各種理解中息楔，越簡單的，假設(shè)條件越少的扒披，往往是最正確的值依，引申到機器學(xué)習，就是如何定義一個simple model

A simple model in this context is:

a model where the distribution of parameter values has less entropy
or a model with fewer parameters

實操就是碟案，就是迫使選擇那些值比較小的weights愿险，which makes the distribution of weight values more regular. This is called weight regularization。這個解釋是我目前看到的最regularization這個名字最好的解釋价说，“正則化”三個字都認識辆亏，根本沒人知道這三個字是什么意思，翻譯了跟沒番一樣鳖目，而使分布更“常規(guī)化扮叨，正規(guī)化”，好像更有解釋性领迈。

別的教材里還會告訴你這里是對大的權(quán)重的懲罰（設(shè)計損失函數(shù)加上自身權(quán)重后彻磁，權(quán)重越大，loss也就越大狸捅，這就是對大權(quán)重的懲罰）

L1 regularization—The cost added is proportional to the absolute value of the weight coefficients (the L1 norm of the weights).
L2 regularization—The cost added is proportional to the square of the value of the weight coefficients (the L2 norm of the weights).

L2 regularization is also called weight decayin the context of neural networks. Don’t let the different name confuse you: weight decay is mathematically the same as L2 regularization.

只需要在訓(xùn)練時添加正則化

Dropout

randomly dropping out (setting to zero) a number of output features of the layer during training.

dropout的作者Geoff Hinton解釋dropout的靈感來源于銀行辦事出納的不停更換和移動的防欺詐機制衷蜓，可能認為一次欺詐的成功實施需要員工的配合，所以就盡量降低這種配合的可能性尘喝。于是他為了防止神經(jīng)元也能聚在一起”密謀”磁浇，嘗試隨機去掉一些神經(jīng)元。以及對輸出添加噪聲朽褪，讓模型更難記住某些patten置吓。

The universal workflow of machine learning

Defining the problem and assembling a dataset
- What will your input data be?
- What are you trying to predict?
- What type of problem are you facing?
- You hypothesize that your outputs can be predicted given your inputs.
- You hypothesize that your available data is sufficiently informative to learn the relationship between inputs and outputs.
- Just because you’ve assembled exam- ples of inputs X and targets Y doesn’t mean X contains enough information to predict Y.
Choosing a measure of success
- accuracy? Precision and recall? Customer-retention rate?
- balanced-classification problems,
  - accuracy and area under the receiver operating characteristic curve (ROC AUC)
- class-imbalanced problems
  - precision and recall.
- ranking problems or multilabel classification
  - mean average precision
- ...
Deciding on an evaluation protocol
- Maintaining a hold-out validation set—The way to go when you have plenty of data
- Doing K-fold cross-validation—The right choice when you have too few samples for hold-out validation to be reliable
- Doing iterated K-fold validation—For performing highly accurate model evaluation when little data is available
Preparing your data
- tensor化，向量化缔赠，歸一化等
- may do some feature engineering
Developing a model that does better than a baseline
- baseline:
  - 基本上是用純隨機(比如手寫數(shù)字識別衍锚，隨機猜測為10%)，和純相關(guān)性推理（比如用前幾天的溫度預(yù)測今天的溫度橡淑，因為溫度變化是連續(xù)的）构拳，不用任何機器學(xué)習做出baseline
- model:
  - Last-layer activation
    - sigmoid, relu系列，等等
  - Loss function
    - 直接的預(yù)測值真值的差，如MSE
    - 度量代理置森，如crossentropy是ROC AUC的proxy metric
- Optimization configuration
  - What optimizer will you use? What will its learning rate be? In most cases, it’s safe to go with rmsprop and its default learning rate.
- Scaling up: developing a model that overfits
  - 通過增加layers, 增加capacity斗埂，增加training epoch來加速overfitting，從而再通過減模型和加約束等優(yōu)化
- Regularizing your model and tuning your hyperparameters
  - Add dropout.
  - Try different architectures: add or remove layers.
  - Add L1 and/or L2 regularization.
  - Try different hyperparameters (such as the number of units per layer or the learning rate of the optimizer) to find the optimal configuration.
  - Optionally, iterate on feature engineering: add new features, or remove features that don’t seem to be informative.

Problem type	Last-layer activation	Loss function
Binary classification	sigmoid	binary_crossentropy
Multiclass, single-label classification	softmax	categorical_crossentropy
Multiclass, multilabel classification	sigmoid	binary_crossentropy
Regression to arbitrary values	None	mse
Regression to values between 0 and 1	sigmoi	mse or binary_crossentropy

Deep learning for computer vision

Convolution Network

The convolution operation extracts patches from its input feature map and applies the same transformation to all of these patches, producing an output feature map.

convolution layers learn local patterns(局部特征)
- The patterns they learn are translation invariant.（局部特征可在圖片別的地方重復(fù)）
- 有的教材里會說每個滑窗一個特征凫海，然后引入參數(shù)共享才講到一個特征其實可以用在所有滑窗
They can learn spatial hierarchies of patterns(低級特征堆疊成高級特征)
depth axis no longer stand for specific colors as in RGB input; rather, they stand for filters(表示圖片時呛凶，3個通道有原始含義，卷積開始后通道只表示filter了)
valid and same convolution（加不加padding讓filter在最后一個像素時也能計算）
stride行贪，滑窗步長
max-pooling or average-pooling
- usually 2x2 windows by stride 2 -> 下采樣(downsample)
- 更大的感受野
- 更小的輸出
- 不是唯一的下采樣方式（比如在卷積中使用stride也可以）
- 一般用max而不是average(尋找最強的表現(xiàn))
小數(shù)據(jù)集
- data augmenetation(旋轉(zhuǎn)平衡縮放shear翻轉(zhuǎn)等)
  - 不能產(chǎn)生當前數(shù)據(jù)集不存在的信息
  - 所以仍需要dropout
- pretrained network(適用通用物體)
  - feature extraction
  - fine-tuneing

Using a pretrained convnet

A pretrained network is a saved network that was previously trained on a large dataset typically on a large-scale image-classification task.

Feature extraction

Feature extraction consists of using the representations learned by a previous network to extract interesting features from new samples. These features are then run through a new classifier, which is trained from scratch.

即只使用別的大型模型提取的representations（特征）漾稀，來構(gòu)建自己的分類器。
原本模型的分類器不但是為特定任務(wù)寫的建瘫，而且基本上喪失了位置和空間信息崭捍，只保留了對該任務(wù)上的presence probability.
最初的層一般只能提取到線，邊緣啰脚，顏色等低級特征殷蛇，再往后會聚合出一些紋理，更高的層就可能會疊加出一些眼橄浓，耳等抽象的特征粒梦，所以你的識別對象與pretrained數(shù)據(jù)源差別很大的時候，就需要考慮把最尾巴的幾層layer也舍棄掉荸实。（e.g. VGG16最后一層提取了512個feature map）
兩種用法：
- 跑一次預(yù)訓(xùn)練模型你選中的部分匀们，把參數(shù)存起來（錯），把輸出當作dataset作為自己構(gòu)建的分類器的input准给。
  - 快泄朴，省資源，但是需要把數(shù)據(jù)集固定住圆存，等于沒法做data augmentation
  - 跑預(yù)訓(xùn)練模型時不需要計算梯度(freeze)
  - 其實應(yīng)用預(yù)訓(xùn)練模型就等于別人的預(yù)處理數(shù)據(jù)集叼旋，而真實的模型只有一個小分類器
- 合并到自定義的網(wǎng)絡(luò)中當成普通網(wǎng)絡(luò)訓(xùn)練
  - 慢仇哆，但是能做數(shù)據(jù)增廣了
  - 需手動設(shè)置來自預(yù)訓(xùn)練模型的梯度不需要計算梯度

注：這里為什么單獨跑預(yù)訓(xùn)練模型不能數(shù)據(jù)增廣呢沦辙？

教材用的是keras, 它處理數(shù)據(jù)的方式是做一個generaotr，只要你給定數(shù)據(jù)增廣的規(guī)則（參數(shù)）讹剔，哪怕只有一張圖油讯，它也是可以無窮無盡地給你生成下一張的。所以每一次訓(xùn)練都能有新的數(shù)據(jù)喂到網(wǎng)絡(luò)里延欠。這是出于內(nèi)存考慮陌兑，不需要真的把數(shù)據(jù)全部加載到內(nèi)存里。

而如果你是一個固定的數(shù)據(jù)集由捎，比如幾萬條兔综，那么你把所有的數(shù)據(jù)跑一遍把這個結(jié)果當成數(shù)據(jù)集（全放在內(nèi)存里），那也不是不可以在這一步用數(shù)據(jù)增廣。

Fine-tuning

Fine-tuning consists of unfreezing a few of the top layers of a frozen model base used for feature extraction, and jointly training both the newly added part of the model (in this case, the fully connected classifier) and these top layers. This is called fine-tuning because it slightly adjusts the more abstract representations of the model being reused, in order to make them more relevant for the problem at hand.

前面的feature extraction方式软驰，會把預(yù)訓(xùn)練的模型你選中的layers給freeze掉涧窒，即不計算梯度。這里之所以叫fine-tuning锭亏，意思就是會把最后幾層(top-layers)給unfreezing掉纠吴，這樣的好處是保留低級特征，重新訓(xùn)練高級特征慧瘤，還保留了原來大型模型的結(jié)構(gòu)戴已，不需要自行構(gòu)建。

2021-09-25-13-39-58.png

但是： it’s only possible to fine-tune the top layers of the convolutional base once the classifier on top has already been trained. 預(yù)訓(xùn)練模型沒有frezze住的話loss將會很大锅减，所以變成了先train一個大體差不多的classifier糖儡，再聯(lián)合起來train一遍高級特征和classifier:

Add your custom network on top of an already-trained base network.
Freeze the base network.
Train the part you added. (第一次train)
Unfreeze some layers in the base network.
Jointly train both these layers and the part you added.（第二次train）

但千萬別把所有層都unfrezze來訓(xùn)練了

低級特征都為邊緣和顏色，無需重新訓(xùn)練
小數(shù)據(jù)量訓(xùn)練大型模型怔匣，model capacity相當大休玩，非常容易過擬合

Visualizing what convents learn

并不是所有的深度學(xué)習都是黑盒子，至少對圖像的卷積網(wǎng)絡(luò)不是 -> representations of visual concepts, 下面介紹三種視覺化和可解釋性的representations的方法劫狠。

Visualizing intermediate activations

就是把每個中間層(基本上是"卷積+池化+激活“)可視化出來拴疤，This gives a view into how an input is decomposed into the different filters learned by the network.

from keras import models
layer_outputs = [layer.output for layer in model.layers[:8]] activation_model = models.Model(inputs=model.input, outputs=layer_outputs)

activations = activation_model.predict(img_tensor)

import matplotlib.pyplot as plt
plt.matshow(first_layer_activation[0, :, :, 4], cmap='viridis')

# 注意使用的是matshow而不是show

2021-09-25-23-36-19.png

以上代碼是利用了keras的Model特性，將所有l(wèi)ayers的輸出攤平（就是做了一個多頭的模型）独泞，然后再順便取了第4和第7個feature map畫出來呐矾，可以看到，圖一感興趣的是對角線懦砂，圖二提取的是藍色的亮點蜒犯。

結(jié)構(gòu)化這些輸出，可以確信初始layer確實提取的是簡單特征荞膘，越往后越高級（抽象）罚随。

A deep neural network effectively acts as an information distillation(信息蒸餾) pipeline, with raw data going in (in this case, RGB pictures) and being repeatedly transformed so that irrelevant information is filtered out (for example, the specific visual appearance of the image), and useful information is magnified and refined (for example, the class of the image).

關(guān)鍵詞：有用的信息被不斷放大和強化

書里舉了個有趣的例子，要你畫一輛自行車羽资。你畫出來的并不是一輛充滿細節(jié)的單車淘菩，而往往是你抽象出來的單車，你會用基本的線條勾勒出你對單車特征的理解屠升，比如龍頭潮改，輪子等關(guān)鍵部件，以及相對位置腹暖。畫家為什么能畫得又真實又好看汇在？那就是他們真的仔細觀察了單車，他們繪畫的時候用的并不是特征脏答，而是一切細節(jié)糕殉，然而對于沒有受過訓(xùn)練的普通人來說亩鬼，往往只能用簡單幾筆勾勒出腦海中的單車的樣子（其實并不是樣子，而是特征的組合）

Visualizing convnet filters

通過強化filter對輸出的反應(yīng)并繪制出來阿蝶，這是從數(shù)學(xué)方法上直接觀察filter辛孵，看什么最能“刺激”一個filter，用”梯度上升“最能體現(xiàn)這種思路：

把output當成loss赡磅，用梯度上升（每次修改input_image）訓(xùn)練出來的output就是這個filter的極端情況魄缚，可以認為這個filter其實是在提取什么（responsive to）：

from keras.applications import VGG16
from keras import backend as K
model = VGG16(weights='imagenet', include_top=False)
layer_name = 'block3_conv1'
filter_index = 0
layer_output = model.get_layer(layer_name).output
loss = K.mean(layer_output[:, :, :, filter_index])  # output就是loss

grads = K.gradients(loss, model.input)[0] # 對input求微分
grads /= (K.sqrt(K.mean(K.square(grads))) + 1e-5)

iterate = K.function([model.input], [loss, grads])
import numpy as np
# 理解靜態(tài)圖的用法
loss_value, grads_value = iterate([np.zeros((1, 150, 150, 3))])

input_img_data = np.random.random((1, 150, 150, 3)) * 20 + 128.
step = 1.
for i in range(40):
    loss_value, grads_value = iterate([input_img_data])
    input_img_data += grads_value * step  # 梯度上升

按上述代碼的思路結(jié)構(gòu)化輸出并繪圖：

2021-09-26-00-06-44.png

從線條到紋理到物件（眼睛，毛皮，葉子）

each layer in a convnet learns a collection of filters such that their inputs can be expressed as a combination of the filters.

This is similar to how the Fourier transform decomposes signals onto a bank of cosine functions.

用傅里葉變換來類比卷積網(wǎng)絡(luò)每一層就是把input表示成一系列特征的組合。

Visualizing heatmaps of class activation

which parts of a given image led a convnet to its final classification decision. 即圖像有哪一部分對最終的決策起了作用椰弊。

class activation map (CAM) visualization,
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization.”

you’re weighting a spatial map of “how intensely the input image activates different channels” by “how important each channel is with regard to the class,” resulting in a spatial map of “how intensely the input image activates the class.

解讀上面這句話：

不同channels（特征）對圖像的激活的強度
+
每個特征對(鑒定為）該類別的重要程度
=
該“類別”對圖像的激活的強度

一張兩只亞洲象的例圖，使用VGG16來做分類嚼隘，得到92.5%的置信度的亞洲象的判斷，為了visualize哪個部分才是“最像亞洲象”的袒餐，使用Grad-CAM處理：

from keras.applications.vgg16 import VGG16
model = VGG16(weights='imagenet')
african_e66lephant_output = model.output[:, 386]  # 亞洲象在IMGNET的類別是386
last_conv_layer = model.get_layer('block5_conv3') # top conv layer

grads = K.gradients(african_elephant_output, last_conv_layer.output)[0] 
pooled_grads = K.mean(grads, axis=(0, 1, 2))
iterate = K.function([model.input],
                     [pooled_grads, last_conv_layer.output[0]])
pooled_grads_value, conv_layer_output_value = iterate([x])
for i in range(512):
    conv_layer_output_value[:, :, i] *= pooled_grads_value[i]
heatmap = np.mean(conv_layer_output_value, axis=-1)

2021-09-26-01-53-59.png

疊加到原圖上去（用cv2融合兩張圖片飞蛹，即相同維度的數(shù)組以不同權(quán)重逐像素相加）：

2021-09-26-01-58-32.png

Deep learning for text and sequences

空間上的序列，時間上的序列組成的數(shù)據(jù)灸眼，比如文本卧檐，視頻，天氣數(shù)據(jù)等焰宣，一般用recurrent neural network(RNN)和1D convnets

其實很多名詞霉囚，包括convnets，我并沒有在別的地方看到過匕积，好像就是作者自己發(fā)明的盈罐，但這些不重要，知道它描述的是什么就可以了闪唆，不一定要公認術(shù)語盅粪。

通用場景：

[分類: 文本分類] Document classification and timeseries classification, such as identifying the topic of an article or the author of a book
[分類: 文本比較] Timeseries comparisons, such as estimating how closely related two documents or two stock tickers are
[分類: 生成] Sequence-to-sequence learning, such as decoding an English sentence into French
[分類: 情感分析]Sentiment analysis, such as classifying the sentiment of tweets or movie reviews as positive or negative
[回歸: 預(yù)測]Timeseries forecasting, such as predicting the future weather at a certain location, given recent weather data

我畫蛇添足地加了是分類問題還是回歸問題.

none of these deeplearning models truly understand text in a human sense

Deep learning for natural-language processing is pattern recognition applied to words, sentences, and paragraphs, in much the same way that computer vision is pattern recognition applied to pixels.

tokenizer

圖像用像素上的顏色來數(shù)字化，那文字也把什么數(shù)字化呢悄蕾？

拆分為詞票顾，把每個詞轉(zhuǎn)化成向量
拆分為字（或字符），把每個字符轉(zhuǎn)化為向量
把字（詞）與前n個字（詞）組合成單元笼吟，轉(zhuǎn)化為向量库物，（類似滑窗）霸旗，N-Grams

all of above are tokens, and breaking text into such tokens is called tokenization. These vectors, packed into sequence tensors, are fed into deep neural networks.

N-grams這種生成的token是無序的贷帮，就像一個袋子裝了一堆詞：bag-of-words: a set of tokens rather than a list of sequence.

所以句子結(jié)構(gòu)信息丟失了，更適合用于淺層網(wǎng)絡(luò)诱告。作為一種rigid, brittle（僵硬的撵枢，脆弱的）特征工程方式民晒，深度學(xué)習采用多層網(wǎng)絡(luò)來提取特征。

vectorizer

token -> vector:

one-hot encoding
token/word embedding (word2vec)

one-hot

以token總數(shù)量（一般就是字典容量）為維度
一般無序锄禽，所以生成的時候只需要按出現(xiàn)順序編索引就好了
有時候也往往伴隨丟棄不常用詞潜必，以減小維度
也可以在字符維度編碼（維度更低）
一個小技巧，如果索引數(shù)字過大沃但，可以把單詞hash到固定維度(未跟進)

特點/問題：

sparse
high-dimensional, 比如幾千幾萬
no spatial relationship
hardcoded

word embeddings

Dense
Lower-dimensional磁滚，比如128，256...
Spatial relationships (語義接近的向量空間上也接近)
Learned from data

to obtain word embeddings:

當成訓(xùn)練參數(shù)之一(以Embedding層的身份)宵晚，跟著訓(xùn)練任務(wù)一起訓(xùn)練
pretrained word embeddings
- Word2Vec(2013, google)
  - CBOW
  - Skip-Gram
- GloVe(2014, Stanford))
- 前提是語言環(huán)境差不多垂攘，不同學(xué)科/專業(yè)/行業(yè)里的詞的關(guān)系是完全不同的
  - GloVe從wikipedia和很多通用語料庫里訓(xùn)練，可以嘗試在許多非專業(yè)場景里使用淤刃。

keras加載訓(xùn)練詞向量的方式：

model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False

pytorch：

# TEXT, LABEL為torchtext的Field對象
from torchtext.vocab import Vectors
vectors=Vectors(name='./sgns.sogou.word') #使用預(yù)訓(xùn)練的詞向量晒他，維度為300Dimension
TEXT.build_vocab(train, vectors=vectors) #構(gòu)建詞典
LABEL.build_vocab(train)

vocab_size = len(TEXT.vocab)
vocab_vectors = TEXT.vocab.vectors.numpy() #準備好預(yù)訓(xùn)練詞向量

self.embedding = nn.Embedding(num_embeddings=vocab_size， embedding_dim=embedding_size)

# 上面是為了回顧逸贾，真正用來做對比的是下面這兩句
self.embedding.weight.data.copy_(torch.from_numpy(vocab_vectors))
self.embedding.weight.requires_grad = False

預(yù)訓(xùn)練詞向量也可以繼續(xù)訓(xùn)練陨仅，以得到task-specific embedding

Recurrent neural networks(RNN)

sequence, time series類的數(shù)據(jù)，天然會受到前后數(shù)據(jù)的影響铝侵，RNN通過將當前token計算的時候引入上一個token的計算結(jié)果（反向的話就能獲得下一個token的結(jié)果）以獲取上下文的信息灼伤。

前面碰到的網(wǎng)絡(luò)，數(shù)據(jù)消費完就往前走（按我這種說法咪鲜，后面還有很多“等著二次消費的”模塊饺蔑，比如inception, resdual等等），叫做feedforward network嗜诀。顯然猾警，RNN中，一個token產(chǎn)生輸出后并不是直接丟給下一層隆敢，而是還復(fù)制了一份丟給了同層的下一個token. 這樣发皿，當前token的output成了下一個token的state。

因為一個output其實含有“前面“所有的信息拂蝎，一般只需要最后一個output
如果是堆疊多層網(wǎng)絡(luò)穴墅，則需要返回所有output

序列過長梯度就消失了，所謂的遺忘（推導(dǎo)見另一篇筆記温自，） -> LSTM, GRU

Long Short-Term Memory(LSTM)

想象有一根傳送帶穿過sequence
同一組input和state會進行三次相同的線性變換玄货，有沒有聯(lián)想到transformer用同一個輸出去生成q, k, v？


output_t = activation(dot(state_t, Uo) + dot(input_t, Wo) + dot(C_t, Vo) + bo)
i_t = activation(dot(state_t, Ui) + dot(input_t, Wi) + bi) 
f_t = activation(dot(state_t, Uf) + dot(input_t, Wf) + bf) 
k_t = activation(dot(state_t, Uk) + dot(input_t, Wk) + bk)

c_t+1 = i_t * k_t + c_t * f_t  # 仍然有q悼泌，k松捉，v的意思（i,k互乘，加上f馆里， 生成新c）

不要去考慮哪個是遺忘門隘世，記憶門可柿，還是輸出門，最終是由weights決定的丙者，而不是設(shè)計复斥。

Just keep in mind what the LSTM cell is meant to do:

allow past information to be reinjected at a later time, thus fighting the vanishing-gradient problem.

關(guān)鍵詞：reinject

dropout

不管是keras還是pytorch，都幫你隱藏了dropout的坑械媒。你能看到應(yīng)用這些框架的時候目锭，是需要你把dropout傳進去的，而不是手動接一個dropoutlayer纷捞，原因是需要在序列每一個節(jié)點上應(yīng)用同樣的dropout mask才能起作用侣集，不然就會起到反作用。

keras封裝得要復(fù)雜一點：

model.add(layers.GRU(32,
                    dropout=0.2,
                    recurrent_dropout=0.2,
                    input_shape=(None, float_data.shape[-1])))

stacking recurrent layers

前面說過兰绣，設(shè)計好的模型的一個判斷依據(jù)是至少讓模型能跑到overfitting世分。如果到了overfitting，表現(xiàn)還不是很好缀辩，那么可以考慮增加模型容量（疊更多層臭埋，以及拓寬layer的輸出維度）

堆疊多層就需要用到每個節(jié)點上的輸出，而不只關(guān)心最后一個輸出了臀玄。

Bidriectional

keras奇葩的bidirectional語法：

model.add(layers.Bidirectional(layers.LSTM(32)))

其實這是設(shè)計模式在類的封裝上的典型應(yīng)用瓢阴，善用繼承和多態(tài)，無侵入地擴展類的方法和屬性健无，而不是不斷魔改原代碼荣恐，加參數(shù)，改API累贤。但在腳本語言風格里的環(huán)境里叠穆，這么玩就有點格格不入了。

Sequence processing with convnets

卷積用到序列上去也是可以的
一個向量只表示一個token臼膏，如果把token的向量打斷就違背了token是最小單元的初衷硼被，所以序列上的卷積，不可能像圖片上兩個方向去滑窗了渗磅。(Conv1D的由來)
一個卷積核等于提取了n個關(guān)聯(lián)的上下文（有點類似n-grams）嚷硫，堆疊得夠深感受野更大，可能得到更大的上下文始鱼。
但仍然理解為filter在全句里提取局部特征

歸桕結(jié)底仔掸，圖片的最小單元是一個像素（一個數(shù)字），而序列（我們這里說文本）的最小單元是token医清，而token又被我們定義為vector（一組數(shù)字）了起暮，那么卷積核就限制在至少要達到最小單元(vector)的維度了。

Combining CNNs and RNNs to process long sequences

卷積能通過加深網(wǎng)絡(luò)獲取更大的感受野状勤，但仍然是“位置無關(guān)”的鞋怀，因為每個filter本就是在整個序列里搜索相同的特征双泪。

但是它確實提取出了特征持搜，是否可把位置關(guān)系等上下文的作業(yè)交給下游任務(wù)RNN做呢密似？

2021-09-27-01-27-17.png

不但實現(xiàn)，而且堆疊兩種網(wǎng)絡(luò)葫盼，還可以把數(shù)據(jù)集做得更大（CNN是矩陣運算残腌，還能用GPU加速）。

Advanced deep-learning best practices

這一章是介紹了更多的網(wǎng)絡(luò)（從keras的封裝特性出發(fā)）結(jié)構(gòu)和模塊贫导，以及batch normalization, model ensembling等知識抛猫。

beyond Sequential model

前面介紹的都是Sequential模型，就是一個接一個地layer前后堆疊孩灯，現(xiàn)實中有很多場景并不是一進一出的：

multi-input model

假設(shè)為二手衣物估價：

格式化的元數(shù)據(jù)（品牌闺金，性別，年齡峰档，款式）: one-hot, dense
商品的文字描述：RNN or 1D convnet
圖片展示：2D convnet
每個input用適合自己的網(wǎng)絡(luò)做輸出败匹，然后合并起來作為一個input，回歸一個價格

multi-output model (multi-head)

一般的檢測器通常就是多頭模型讥巡，因為既要回歸對象類別掀亩，還要回歸出對象的位置

graph-like model

這個名字很好地形容了做深度學(xué)習時看別人的網(wǎng)絡(luò)是什么樣的方式：看圖。現(xiàn)代的SOTA的網(wǎng)絡(luò)往往既深且復(fù)雜欢顷，而網(wǎng)絡(luò)結(jié)構(gòu)畫出來也不再是一條線或幾個簡單分支槽棍，這本書干脆把它們叫圖形網(wǎng)絡(luò)：Inception, Residual

為了能架構(gòu)這些復(fù)雜的網(wǎng)絡(luò)，keras介紹了新的語法抬驴，先看看怎么重寫Sequential:

seq_model = Sequential()
seq_model.add(layers.Dense(32, activation='relu', input_shape=(64,)))
seq_model.add(layers.Dense(32, activation='relu'))
seq_model.add(layers.Dense(10, activation='softmax'))

# 重寫
input_tensor = Input(shape=(64,))
x = layers.Dense(32, activation='relu')(input_tensor)
x = layers.Dense(32, activation='relu')(x)
output_tensor = layers.Dense(10, activation='softmax')(x)
model = Model(input_tensor, output_tensor)

model.summayr()

我們自己實現(xiàn)過靜態(tài)圖炼七，最終去執(zhí)行的時候能從尾追溯到頭，并從頭來開始計算布持，這里也是一樣的：

input, output是Tensor類特石，所以有完整的層次信息
output往上追溯，最終溯到缺少一個input
這個input恰好也是Model的構(gòu)造函數(shù)之一鳖链，閉環(huán)了姆蘸。

書里說的更簡單，output是input不斷transforming的結(jié)果芙委。如果傳一個沒有這個關(guān)系的input進去逞敷，就會報錯。

demo

用一個QA的例子來演示多輸入（一個問句灌侣，一段資料）推捐，輸出為答案在資料時的索引（簡化為單個詞，所以只有一個輸出）

text_input = Input(shape=(None,), dtype='int32', name='text')
embedded_text = layers.Embedding(
    64, text_vocabulary_size)(text_input)
encoded_text = layers.LSTM(32)(embedded_text)  # lstm 處理資訊
question_input = Input(shape=(None,), dtype='int32', name='question')


embedded_question = layers.Embedding(
    32, question_vocabulary_size)(question_input)
encoded_question = layers.LSTM(16)(embedded_question) # lstm 處理問句

concatenated = layers.concatenate([encoded_text, encoded_question], axis = -1)  # 豎向拼接（即不增加內(nèi)容只增加數(shù)量）
answer = layers.Dense(answer_vocabulary_size,
                      activation='softmax')(concatenated)
model = Model([text_input, question_input], answer)
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['acc'])

這里是把答案直接給回歸出來了(one-hot)侧啼，如果是給出答案的首尾位置牛柒，那肯定只能用索引了堪簿。

demo

多頭輸出的：

# 線性回歸
age_prediction = layers.Dense(1, name='age')(x)
# 邏輯回歸
income_prediction = layers.Dense(num_income_groups, activation='softmax', name='income')(x)
# 二元邏輯回歸
gender_prediction = layers.Dense(1, activation='sigmoid', name='gender')(x)
model = Model(posts_input,
              [age_prediction, income_prediction, gender_prediction])

梯度回歸要求loss是一個標量，keras提供了方法將三個loss加起來皮壁，同時為了量綱統(tǒng)一椭更，還給了權(quán)重參數(shù)：

model.compile(optimizer='rmsprop',
loss=['mse', 'categorical_crossentropy', 'binary_crossentropy'], loss_weights=[0.25, 1., 10.])

Directed acyclic graphs of layers

有向無環(huán)圖《昶牵可以理解為最終不會回到出發(fā)點虑瀑。

現(xiàn)在會介紹的是幾個Modules，意思是可以把它當成一個layer滴须，來構(gòu)造你的網(wǎng)絡(luò)/模型舌狗。

Inception Modules

inspired by network-in-network
對同一個輸入做不同（層數(shù)/深度）的卷積（保證最終相同的下采樣維度），最后合并為一個輸出
因為卷積的深度不盡相同扔水，學(xué)到的空間特征也有粗有細

Residual Connections

有些地方叫shortcut
用的是相加痛侍，不是concatenate, 如果形狀變了，對earlier activation做linear transformation
解決vanishing gradients and representational bottlenecks
adding residual connections to any model that has more than 10 layers is likely to be beneficial.

representational bottlenecks

序列模型時魔市，每一層的表示都來自于前一層主届，如果前一層很小，比如維度過低嘹狞，那么攜帶的信息量也被壓縮得很有限了岂膳，整個模型都會被這個“瓶頸”限制。比如音頻信號處理磅网，降維就是降頻谈截，比如到0-15kHz，但是下游任務(wù)也沒法recover dropped frequencies了涧偷。所有的損失都是永久的簸喂。

Residual connections, by reinjecting earlier information downstream, partially solve this issue for deep-learning models.（又一次強調(diào)reinject）

Lyaer weight sharihng

在網(wǎng)絡(luò)的不同位置用同一個layer，并且參數(shù)也相同燎潮。等于共享了相同的知識喻鳄，相同的表示，以及是同時(simultaneously)訓(xùn)練的确封。

一個語義相似度的例子除呵，輸入是A和B還是B和A，是一樣的（即可以互換）爪喘。架構(gòu)網(wǎng)絡(luò)的時候颜曾，用LSTM來處理句子，需要做兩個LSTM嗎秉剑？當然可以泛豪，但是也可以只做一個LSTM，分別喂入兩個句子，合并兩個輸出來做分類诡曙。就是考慮到這種互換性臀叙，既然能互換，也就是這個layer也能應(yīng)用另一個句子价卤，因此就不必要再新建一個LSTM.

Models as layers

講了兩點：

model也可以當layer使用
多處使用同一個model也是共享參數(shù)劝萤，如上一節(jié)。

舉了個雙攝像頭用以感知深度的例子荠雕，每個攝像頭都用一個Xception網(wǎng)絡(luò)提取特征稳其，但是可以共用這個網(wǎng)絡(luò)驶赏，因為拍的是同樣的內(nèi)容炸卑，只需要處理兩個攝像頭拍到的內(nèi)容的差別就能學(xué)習到深度信息。因為希望是用同樣的特征提取機制的煤傍。

都是蜻蜓點水盖文。

More Advanced

Batch Normalization

第一句話就是說為了讓樣本數(shù)據(jù)看起來更相似，說明這是初衷蚯姆。
然后是能更好地泛化到未知數(shù)據(jù)（同樣也是因為bn后就更相似了）
深度網(wǎng)絡(luò)中每一層之后也需要做
- 還有一個書里沒講到的原因五续，就是把值移到激活函數(shù)的梯度大的區(qū)域（比如0附近），否則過大過小的值在激活函數(shù)的曲線里都是幾乎沒有梯度的位置
內(nèi)部用的指數(shù)移動平均(exponential moving average)
一些層數(shù)非常深的網(wǎng)絡(luò)必須用BN龄恋，像resnet 50, 101, 152, inception v3, xception等

Depthwise Separable Convolution

之前的卷積疙驾，不管有多少個layer，都是放到矩陣里一次計算的郭毕，DSC把每一個layer拆開它碎，單獨做卷積（不共享參數(shù)），因為沒有一個巨大的矩陣显押，變成了幾個小矩陣乘法扳肛，參數(shù)量也大大變少了。

對于小樣本很有效
對于大規(guī)模數(shù)據(jù)集乘碑，它可以成為里面的固定結(jié)構(gòu)的模塊（它也是Xception的基礎(chǔ)架構(gòu)之一）

In the future, it’s likely that depthwise separable convolutions will completely replace regular convolutions, whether for 1D, 2D, or 3D applications, due to their higher representational efficiency.

?!!

Model ensembling

Ensembling consists of pooling together the predictions of a set of different models, to produce better predictions.
期望每一個good model擁有part of the truth(部分的真相)挖息。盲人摸象的例子，沒有哪個盲人擁有直接感知一頭象的能力兽肤，機器學(xué)習可能就是這樣一個盲人套腹。
The key to making ensembling work is the diversity of the set of classifiers -> 關(guān)鍵是要“多樣性”。 Diversity is what makes ensembling work.
千萬不要去ensembling同樣的網(wǎng)絡(luò)僅僅改變初始化而去train多次的結(jié)果资铡。
比較好的實踐有ensemble tree-based models(random forests, gradient-boosted trees) 和深度神經(jīng)網(wǎng)絡(luò)
以及wide and deep category of models, blending deep learning with shallow learning.

同樣是蜻蜓點水电禀。

Generative deep learning

Our perceptual modalities, our language, and our artwork all have statistical structure. Learning this structure is what deep-learning algorithms excel at.

Machine-learning models can learn the statistical latent space of images, music, and stories, and they can thensample from this space, creating new artworks with characteristics similar to those the model has seen in its training data.

Text generation with LSTM

Language model

很多地方都在按自己的理解定義language model，這本書定義很明確害驹，能為根據(jù)前文預(yù)測下一個或多個token建立概率模型的網(wǎng)絡(luò)鞭呕。

any network that can model the probability of the next token given the previous ones is called a language model.

所以首先，它是一個network
它做的事是model一個probability
內(nèi)容是the next token
條件是previous tokens

一旦你有了這樣一個language model，你就能sample from it葫松，這就是前面筆記里的sample from lantent space, 然后generate了瓦糕。

greedy sampling and stochastic sampling

如果根據(jù)概率模型每次都選“最可能”的輸出，在連貫性上被證明是不好的腋么，而且也喪失了創(chuàng)造性咕娄，所以還是給了一定的隨機性能選到“不那么可能”的輸出。

因為人類思維本身也是跳躍的珊擂。

考慮兩個輸出下一個token時的極端情況：


純隨機圣勒，所有可選詞的概率是均等的	毫無意義	`max entropy`	創(chuàng)造性高
greedy sampling	毫無生趣	`minimum entropy`	可預(yù)測性高

實現(xiàn)方式：softmax temperature

除一個溫度，如果溫度大于1摧扇，那么溫度越大圣贸，被除數(shù)縮幅度就越大（這樣溫差就越小，分布會更平均）-> 偏向了純隨機的概率結(jié)構(gòu)（均等）

import numpy as np
def reweight_distribution(original_distribution, temperature=0.5):
    distribution = np.log(original_distribution) / temperature
    distribution = np.exp(distribution)
    return distribution / np.sum(distribution)

寫成公式
$\frac{e^{\frac{log(d)}{T}}}{\sum e^{\frac{log(d)}{T}}}$
這是對溫度和sigmoid做了融合：

一個是對目標分布取自然對數(shù)后除溫度再當成e的指數(shù)給冪回去（如果不除溫度扛稽，那就是先log再e吁峻，等于是原數(shù)）
標準的sigmoid方程

這里回顧一個概念：Sampling from a space

書里大量用了這個概念，結(jié)合代碼在张，其實就是一個predict函數(shù)用含，也就是說，一般人理解的“預(yù)測帮匾，推理”啄骇，是從業(yè)務(wù)邏輯方面來理解，作者更愿意從統(tǒng)計學(xué)和線性代數(shù)角度來理解瘟斜。

兩種訓(xùn)練方法：

每次用N個字缸夹，來預(yù)測第N+1個字，即output只有1個(voc_size, 1)哼转，訓(xùn)練的是language model
每次用N個字(a, b), 來預(yù)測(a+1, b+1)明未， output有N個(voc_size, N)，訓(xùn)練的是特定的任務(wù)壹蔓，比如寫詩趟妥，作音樂

過程：

準備數(shù)據(jù)，X為一組句子佣蓉，Y為每一個句子對應(yīng)的下一個字（全部向量化）
搭建一個LSTM + Dense 的網(wǎng)絡(luò)披摄，輸出根據(jù)具體情況要么為1，要么為N
每一個epoch里均進行預(yù)測（如果不是為了看過程勇凭，有必要嗎疚膊？我們要最后一輪的預(yù)測不就行了？）
- 進行一次fit(就是train)虾标，得到優(yōu)化后的參數(shù)
- 隨機取一段文本寓盗，用作種子（用來生成第一個字）
- 計算生成多少個字，就開始for循環(huán)
  - 向量化當前的種子（會越來越長）
  - predict，得到每個字的概率
  - softmax temperature傀蚌，平滑概率基显，取出next_token
  - next_token轉(zhuǎn)回文本，附加到seed后面

DeepDream

看了一遍善炫，不感興趣撩幽。核心思路跟視覺化filter的思路是一樣的：gradient ascent

從對每個layer里的單個filter做梯度上升變成了對整個layer做梯度上升
不再從隨機噪聲開始，而是從一張真實圖片開始箩艺，實現(xiàn)這些layer里對圖片影響最大的patterns的distorting

Neural style transfer

Neural style transfer consists of applying the style of a reference image to a target image while conserving the content of the target image.

兩個對象：reference, target image
兩個概念：style和content

對B的content應(yīng)用A的style窜醉，我們可以理解為“筆刷”，或者用前些年的流行應(yīng)用來解釋：把一副畫水彩化艺谆，或油畫化榨惰。

把style分解為不同spatial scales上的：紋理，顏色擂涛，和visual pattern

想用深度學(xué)習來嘗試解決這個問題读串，首先至少得定義損失函數(shù)是什么樣的聊记。

If we were able to mathematically define content and style, then an appropriate loss function to minimize would be the following:

loss = distance(style(reference_image) - style(generated_image)) +
        distance(content(original_image) - content(generated_image))

即對新圖而言撒妈，紋理要無限靠近A，內(nèi)容要無限靠近B排监。

the content loss
- 圖像內(nèi)容屬于高級抽象狰右，因此只需要top layers參與就行了，實際應(yīng)用中只取了最頂層
the style loss
- 應(yīng)用Gram matrix
  - the inner product of the feature maps of a given layer
  - correlations between the layer's feature
  - 需要生成圖和參考圖的每一個對應(yīng)的layer擁有相同的紋理(same textures at different spatial scales)舆床，因此需要所有的layer參與

從這里應(yīng)該也能判斷出要搭建網(wǎng)絡(luò)的話棋蚌，input至少由三部分（三張圖片）構(gòu)成了。

demo

input為參考圖挨队，目標圖谷暮，和生成圖（占位），concatenate成一個tensor
用VGG19來做特征提取
計算loss
1. 用生成圖和目標圖的top_layer以L2 norm距離做loss
2. 用生成圖和參考圖的every layer以L2 Norm做loss并累加
3. 對生成圖偏移1像素做regularization loss（具體看書）
4. 上述三組loss累加盛垦，為一輪的loss
用loss計算對input(即三聯(lián)圖)的梯度

Generating images

Sampling from a latent space of images to create entirely new images

熟悉的句式又來了湿弦。

核心思想：

low-dimensional latent space of representations
- 一般是個vector space
- any point can be mapped to a realistic-looking image
the module capable of realizing this mapping, can take point as input, then output an image, this called:
- generator -> GAN
- decoder -> VAE

VAE v.s. GAN

VAEs are great for learning latent spaces that are well structured
GANs generate images that can potentially be highly realistic, but the latent space they come from may not have as much structure and continuity.

VAE（variational autoencoders）

given a latent space of representations, or an embedding space, certain directions in the space may encode interesting axes of variation in the original data. -> inspired by concept space

比如包含人臉的數(shù)據(jù)集的latent space里，是否會存在smile vectors腾夯，定位這樣的vector颊埃，就可以修改圖片，讓它projecting到這個latent space里去蝶俱。

Variational autoencoders

Variational autoencoders are a kind of generative model that’s especially appropriate for the task of image editing via concept vectors.

They’re a modern take on autoencoders (a type of network that aims to encodean input to a low-dimensional latent space and then decode it back) that mixes ideas from deep learning with Bayesian inference.

VAE把圖片視作隱藏空間的參數(shù)進行統(tǒng)計過程的結(jié)果班利。
參數(shù)就是表示一種正態(tài)分布的mean和variance（實際取的log_variance)
用這個分布可以進行采樣(sample)
映射回original image

An encoder module turns the input samples input_img into two parameters in a latent space of representations, z_mean and z_log_variance.
You randomly sample a point z from the latent normal distribution that’s assumed to generate the input image, via $z = z\_mean + e^{z\_log\_variance} \times \epsilon$ , where $\epsilon$ is a random tensor of small values.
A decoder module maps this point in the latent space back to the original input image.

Because epsilon is random, the process ensures that every point that’s close to the latent location where you encoded input_img (z-mean) can be decoded to something similar to input_img, thus forcing the latent space to be continuously meaningful.

所以VAE生成的圖片是可解釋的，比如在latent space中距離相近的兩點榨呆，decode出來的圖片相似度也就很高罗标。
多用于編輯圖片，并且能生成動畫過程（因為是連續(xù)的）

偽代碼(不算，可以說是骨干代碼）：

z_mean, z_log_variance = encoder(input_img)
z = z_mean + exp(z_log_variance) * epsilon  # sampling
reconstructed_img = decoder(z)
model = Model(input_img, reconstructed_img)

VAE encoder network

img_shape = (28, 28, 1)
batch_size = 16
latent_dim = 2

x = layers.Conv2D(32, 3, padding='same', activation='relu')(input_img)
x = layers.Conv2D(64, 3, padding='same', activation='relu', strides=(2, 2))(x)
x = layers.Conv2D(64, 3, padding='same', activation='relu')(x)
x = layers.Conv2D(64, 3, padding='same', activation='relu')(x)
shape_before_flattening = K.int_shape(x)
x = layers.Flatten()(x)
x = layers.Dense(32, activation='relu')(x)
z_mean = layers.Dense(latent_dim)(x)
z_log_var = layers.Dense(latent_dim)(x)

可見是一個標準的multi-head的網(wǎng)絡(luò)
可見所謂的latent space闯割，其實就是transforming后的結(jié)果
encode的目的是回歸出兩個參數(shù)（本例是兩個2維參數(shù)）
兩個參數(shù)一個理解為mean, 一個理解為log_variance

decoder過程就是對mean和var隨機采樣（得到z)皿哨，然后不斷上采樣(Conv2DTranspose)得到形狀與源圖一致的輸出(得到z_decode)的過程。

z_decode跟z做BCE loss
還要加一個regularization loss防止overfitting

此處請看書纽谒，演示了自定義的loss证膨。因為keras高度封裝，所以各種在封裝之外的自定義的用法尤其值得關(guān)注鼓黔。比如這里央勒，自定義了loss之后，Model和fit里就不需要傳Y澳化，compile時也不需要傳loss了崔步。

loss是在最后一層layer里計算的，并且通過一個layer方法add_loss缎谷，把loss和input通知給了network（如果你想知道注入點的話）

使用模型的話井濒，就是生成兩組隨機數(shù)，當成mean和log_variance列林，觀察decode之后的結(jié)果瑞你。

GAN

Generative adversarial network可以創(chuàng)作以假亂真的圖片。通過訓(xùn)練最好的造假和和最好的鑒別者來達到“創(chuàng)造”越來越逼近人類創(chuàng)作的作品希痴。

Generator network: Takes as input a random vector (a random point in the latent space), and decodes it into a synthetic image
Discriminator network (or adversary): Takes as input an image (real or synthetic), and predicts whether the image came from the training set or was created by the generator network.

deep convolutional GAN (DCGAN)

a GAN where the generator and discriminator are deep convnets.
In particular, it uses a Conv2DTranspose layer for image upsampling in the generator.

訓(xùn)練生成器是沖著能讓鑒別器盡可能鑒別為真的方向的：the generator is trained to fool the discriminator者甲。

這句話其實暗含了一個前提，下面會說砌创，就是此時discriminator是確定的虏缸。即在確定的鑒別能力下，盡可能去擬合generator的輸出嫩实，讓它能通過當前鑒別器的測試刽辙。

書中說訓(xùn)練DCGAN很復(fù)雜，而且很多trick, 超參靠的是經(jīng)驗而不是理論支撐甲献，摘抄并筆記a bag of tricks如下：

We use tanh as the last activation in the generator, instead of sigmoid, which is more commonly found in other types of models.
We sample points from the latent space using a normal distribution (Gaussian distribution), not a uniform distribution.
Stochasticity is good to induce robustness. Because GAN training results in a dynamic equilibrium, GANs are likely to get stuck in all sorts of ways. Introducing randomness during training helps prevent this. We introduce randomness in two ways:
- by using dropout in the discriminator
- and by adding random noise to the labels for the discriminator.
Sparse gradients can hinder GAN training. In deep learning, sparsity is often a desirable property, but not in GANs. Two things can induce gradient sparsity: max pooling operations and ReLU activations.
- Instead of max pooling, we recommend using strided convolutions for downsampling(用步長卷積代替pooling),
- and we recommend using a LeakyReLU layer instead of a ReLU activation. It’s similar to ReLU, but it relaxes sparsity constraints by allowing small negative activation values.
In generated images, it’s common to see checkerboard artifacts(stirde和kernel size不匹配千萬的) caused by unequal coverage of the pixel space in the generator.
- To fix this, we use a kernel size that’s divisible by the stride size whenever we use a strided Conv2DTranpose or Conv2D in both the generator and the discriminator.

Train

Draw random points in the latent space (random noise).
Generate images with generator using this random noise.
Mix the generated images with real ones.
Train discriminator using these mixed images, with corresponding targets:
- either “real” (for the real images) or “fake” (for the generated images).
- 所以鑒別器是單獨訓(xùn)練的（前面筆記鋪墊過了）
- 下面就是train整個DCGAN了：
Draw new random points in the latent space.
Train gan using these random vectors, with targets that all say “these are real images.” This updates the weights of the generator (only, because the discriminator is frozen inside gan) to move them toward getting the discriminator to predict “these are real images” for generated images: this trains the generator to fool the discriminator.
- 只train網(wǎng)絡(luò)里的generator
- discriminator不訓(xùn)練宰缤，因為是要用“已經(jīng)訓(xùn)練到目前程度的”discriminator來做下面的任務(wù)
- 任務(wù)就是只送入偽造圖，并聲明所有圖都是真的竟纳，去讓generator生成能逼近這個聲明的圖
- generator就是這么訓(xùn)練出來的撵溃。
- 所以實際代碼是一次epoch是由train一個discriminator和train一個GAN組成.

因為鑒別器和生成器是一起訓(xùn)練的，因此前幾輪生成的肯定是噪音锥累，但前幾輪鑒別器也是瞎鑒別的缘挑。