（Caffe仅胞，LeNet）反向傳播（六）

本文從CSDN上轉(zhuǎn)移過來：
http://blog.csdn.net/mounty_fsc/article/details/51379395

本部分剖析Caffe中Net::Backward()函數(shù)，即反向傳播計算過程饺藤。從LeNet網(wǎng)絡(luò)角度出發(fā)包斑，且調(diào)試網(wǎng)絡(luò)為訓練網(wǎng)絡(luò)，共9層網(wǎng)絡(luò)涕俗。具體網(wǎng)絡(luò)層信息見（Caffe罗丰，LeNet）初始化訓練網(wǎng)絡(luò)（三）第2部分

本部分不介紹反向傳播算法的理論原理，以下介紹基于對反向傳播算法有一定的了解再姑。

1 入口信息

Net::Backward()函數(shù)中調(diào)用BackwardFromTo函數(shù)萌抵，從網(wǎng)絡(luò)最后一層到網(wǎng)絡(luò)第一層反向調(diào)用每個網(wǎng)絡(luò)層的Backward。

void Net<Dtype>::BackwardFromTo(int start, int end) {
  for (int i = start; i >= end; --i) {
    if (layer_need_backward_[i]) {
      layers_[i]->Backward(
          top_vecs_[i], bottom_need_backward_[i], bottom_vecs_[i]);
      if (debug_info_) { BackwardDebugInfo(i); }
    }
  }
}

2 第九層SoftmaxWithLossLayer

2.1 代碼分析

代碼實現(xiàn)如下：

void SoftmaxWithLossLayer<Dtype>::Backward_gpu(const vector<Blob<Dtype>*>& top,
    const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom) {

    // bottom_diff shape:64*10
    Dtype* bottom_diff = bottom[0]->mutable_gpu_diff();
    // prob_data shape:64*10
    const Dtype* prob_data = prob_.gpu_data();
    // top_data shape:(1)
    const Dtype* top_data = top[0]->gpu_data();
    // 將Softmax層預測的結(jié)果prob復制到bottom_diff中
    caffe_gpu_memcpy(prob_.count() * sizeof(Dtype), prob_data, bottom_diff);
    // label shape:64*1
    const Dtype* label = bottom[1]->gpu_data();
    // dim = 640 / 64 = 10
    const int dim = prob_.count() / outer_num_;
    // nthreads = 64 / 1 = 64
    const int nthreads = outer_num_ * inner_num_;
    // Since this memory is never used for anything else,
    // we use to to avoid allocating new GPU memory.
    Dtype* counts = prob_.mutable_gpu_diff();
    
    // 該函數(shù)將bottom_diff（此時為每個類的預測概率）對應的正確類別（label）的概率值-1元镀，其他數(shù)據(jù)沒變绍填。見公式推導。
    SoftmaxLossBackwardGPU<Dtype><<<CAFFE_GET_BLOCKS(nthreads),
        CAFFE_CUDA_NUM_THREADS>>>(nthreads, top_data, label, bottom_diff,
        outer_num_, dim, inner_num_, has_ignore_label_, ignore_label_, counts);
    // 代碼展開開始,代碼有修改
    __global__ void SoftmaxLossBackwardGPU(...) {
      CUDA_KERNEL_LOOP(index, nthreads) { 
        const int label_value = static_cast<int>(label[index]);
        bottom_diff[index * dim + label_value] -= 1;
        counts[index] = 1;        
      }
    }
    // 代碼展開結(jié)束

    Dtype valid_count = -1;
    // 注意為loss的權(quán)值栖疑，對該權(quán)值（一般為1或者0）歸一化（除以64）
    const Dtype loss_weight = top[0]->cpu_diff()[0] /
                              get_normalizer(normalization_, valid_count);
    caffe_gpu_scal(prob_.count(), loss_weight , bottom_diff);
  
}

說明：

SoftmaxWithLossLayer是沒有學習參數(shù)的（見前向計算（五）) 讨永，因此不需要對該層的參數(shù)做調(diào)整，只需要計算bottom_diff（理解反向傳播算法的鏈式求導遇革，求bottom_diff對上一層的輸出求導卿闹，是為了進一步計算調(diào)整上一層權(quán)值）
以上代碼核心部分在SoftmaxLossBackwardGPU。該函數(shù)將bottom_diff（此時為每個類的預測概率）對應的正確類別（label）的概率值-1萝快，其他數(shù)據(jù)沒變锻霎。這里使用前幾節(jié)的符號系統(tǒng)及圖片進行解釋。

2.2 公式推導

符號系統(tǒng)

設(shè)SoftmaxWithLoss層的輸入為向量$\mathbf{z}$揪漩，即bottom_blob_data旋恼，也就是上一層的輸出。經(jīng)過Softmax計算后的輸出為向量$\mathbf{f(z)}$奄容，公式為（省略了標準化常量m）$f(z_k)=\frac{e^{{z_k}}{\sum_i}n{e^{{z_i}}}$冰更。最后SoftmaxWithLoss層的輸出為$loss=\sum}n-\log{f(z_y)}$，$y$為樣本的標簽嫩海。見前向計算（五）冬殃。
反向推導

把loss展開可得
$$loss=log\sum_i^n{e{z_i}}-z_y$$
所以$\frac{d loss}{d\mathbf{z}}$結(jié)果如下：
$$
\frac{\partial loss}{\partial z_i}=
\left {
\begin{aligned}
& f(z_y)-1,z_i= z_y \
& f(z_i),z_i \ne z_y
\end{aligned}
\right.
$$
圖示

3 第八層InnerProduct

3.1 代碼分析

template <typename Dtype>
void InnerProductLayer<Dtype>::Backward_gpu(const vector<Blob<Dtype>*>& top,
    const vector<bool>& propagate_down,
    const vector<Blob<Dtype>*>& bottom) {
  //對參數(shù)求偏導囚痴，top_diff*bottom_data=blobs_diff
  if (this->param_propagate_down_[0]) {
    const Dtype* top_diff = top[0]->gpu_diff();
    const Dtype* bottom_data = bottom[0]->gpu_data();
    // Gradient with respect to weight
    caffe_gpu_gemm<Dtype>(CblasTrans, CblasNoTrans, N_, K_, M_, (Dtype)1.,
        top_diff, bottom_data, (Dtype)1., this->blobs_[0]->mutable_gpu_diff());
  }

  // 對偏置求偏導top_diff*bias=blobs_diff
  if (bias_term_ && this->param_propagate_down_[1]) {
    const Dtype* top_diff = top[0]->gpu_diff();
    // Gradient with respect to bias
    caffe_gpu_gemv<Dtype>(CblasTrans, M_, N_, (Dtype)1., top_diff,
        bias_multiplier_.gpu_data(), (Dtype)1.,
        this->blobs_[1]->mutable_gpu_diff());
  }
  
  //對上一層輸出求偏導top_diff*blobs_data=bottom_diff
  if (propagate_down[0]) {
    const Dtype* top_diff = top[0]->gpu_diff();
    // Gradient with respect to bottom data
    caffe_gpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, M_, K_, N_, (Dtype)1.,
        top_diff, this->blobs_[0]->gpu_data(), (Dtype)0.,
        bottom[0]->mutable_gpu_diff());
  }
}

3.2 公式推導

如圖叁怪，當前層ip2層的輸入為$\mathbf{z}$，上一層的輸入為$\mathbf{u}$深滚。

1. 對上一層輸出求偏導

$\frac{\partial loss}{\partial u_j}$存放在ip2層的bottom_blob_diff（64500）中奕谭，計算公式如下涣觉，其中$\frac{\partial loss}{\partial z_k}$存放在top_blob_diff（6410）中:

$$
\frac{\partial z_k}{\partial u_j} = \frac{\sum_j^{100}{w_{kj}u_j}}{\partial u_j}=w_{kj}
$$

$$
\frac{\partial loss}{\partial u_j}=\sum_k^{n=10}{\frac{\partial loss}{\partial z_k}\frac{\partial z_k}{\partial u_j}}=\sum_k^{n=10}{\frac{\partial loss}{\partial z_k}w_{kj}}
$$
寫成向量的形式為：
$$
\frac{\partial loss}{\partial u_j}=\frac{\partial loss}{\partial \mathbf{z^T}} \cdot \mathbf{w_{j}}
$$
進一步，寫成矩陣的形式血柳，其中$\mathbf{u}$為500維官册，$\mathbf{z}$為10維，$\mathbf{W}$為$10 \times 500$：
$$
\frac{\partial loss}{\partial \mathbf{u^T}}=\frac{\partial loss}{\partial \mathbf{z^T}} \cdot \mathbf{W}
$$
再進一步难捌，考慮到一個batch有64個樣本膝宁，表達式可以寫成如下形式，其中$\mathbf{U}$為$64 \times 500$根吁；$\mathbf{Z}$為$64 \times 10$员淫；$\mathbf{W}$為$10 \times 500$：
$$
\frac{\partial loss}{\partial \mathbf{U}}=\frac{\partial loss}{\partial \mathbf{Z}} \cdot \mathbf{W}
$$

2. 對參數(shù)求偏導

$$
\frac{\partial loss}{\partial w_{kj}}=\frac{\partial loss}{\partial z_k}\frac{\partial z_k}{\partial w_{kj}}=\frac{\partial loss}{\partial z_k} u_{j}
$$
寫成向量的形式有：
$$
\frac{\partial loss}{\partial \mathbf{w_{j}}}=\frac{\partial loss}{\partial \mathbf{z}} u_{j}
$$
進一步，可以寫成矩陣形式击敌，其中$\mathbf{W}$為$10 \times 500$介返；$\mathbf{z}$為10維；$\mathbf{u}$為500維沃斤。
$$
\frac{\partial loss}{\partial \mathbf{W}}=\frac{\partial loss}{\partial \mathbf{z}} \mathbf{u^T}
$$
再進一步圣蝎，考慮到一個batch有64個樣本，表達式可以寫成如下形式衡瓶，其中$\mathbf{W}$為$10 \times 500$徘公；$\mathbf{Z}$為$64 \times 10$；$\mathbf{U}$為$64 \times 500$：
$$
\frac{\partial loss}{\partial \mathbf{W}}=\frac{\partial loss}{\partial \mathbf{Z^T}} \cdot \mathbf{U}
$$

4 第七層ReLU

4.1 代碼分析

cpu代碼分析如下哮针，注步淹，該層沒有參數(shù)，只需對輸入求導

void ReLULayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
    const vector<bool>& propagate_down,
    const vector<Blob<Dtype>*>& bottom) {
  if (propagate_down[0]) {
    const Dtype* bottom_data = bottom[0]->cpu_data();
    const Dtype* top_diff = top[0]->cpu_diff();
    Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();
    const int count = bottom[0]->count();

    //見公式推導
    Dtype negative_slope = this->layer_param_.relu_param().negative_slope();
    for (int i = 0; i < count; ++i) {
      bottom_diff[i] = top_diff[i] * ((bottom_data[i] > 0)
          + negative_slope * (bottom_data[i] <= 0));
    }
  }
}

4.2 公式推導

設(shè)輸入向量為$\mathbf{bottom_data}$诚撵，輸出向量為$\mathbf{top_data}$缭裆，ReLU層公式為
$$top_data_i=
\left {
\begin{aligned}
& bottom_data_i & bottom_data_i \gt 0 \
& bottom_data_i*slope & bottom_data_i \le 0
\end{aligned}
\right .
$$所以，loss對輸入的偏導為:
$$
\frac{\partial loss}{\partial bottom_data_i}=\frac{\partial loss}{\partial top_data_i} \cdot \frac{\partial top_data_i}{\partial bottom_data_i} \
= \left {
\begin{aligned}
& top_diff_i & bottom_data_i \gt 0\
& top_diff_i * slope & bottom_data_i \le 0
\end{aligned}
\right .
$$

5 第五層Pooling

5.1 代碼分析

Maxpooling的cpu代碼分析如下寿烟，注澈驼，該層沒有參數(shù)，只需對輸入求導

void PoolingLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom) {

  const Dtype* top_diff = top[0]->cpu_diff();
  Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();
  // bottom_diff初始化置0
  caffe_set(bottom[0]->count(), Dtype(0), bottom_diff);
  const int* mask = NULL;  // suppress warnings about uninitialized variables
 
  ...
    // 在前向計算時max_idx中保存了top_data中的點是有bottom_data中的點得來的在該feature map中的坐標
    mask = max_idx_.cpu_data();
    // 主循環(huán)筛武，按(N,C,H,W)方式便利top_data中每個點
    for (int n = 0; n < top[0]->num(); ++n) {
      for (int c = 0; c < channels_; ++c) {
        for (int ph = 0; ph < pooled_height_; ++ph) {
          for (int pw = 0; pw < pooled_width_; ++pw) {
            const int index = ph * pooled_width_ + pw;
            const int bottom_index = mask[index];
            // 見公式推導
            bottom_diff[bottom_index] += top_diff[index];
          }
        }
        bottom_diff += bottom[0]->offset(0, 1);
        top_diff += top[0]->offset(0, 1);
        mask += top[0]->offset(0, 1);
      
      }
    }

}

5.2 公式推導

由圖可知缝其，maxpooling層是非線性變換，但有輸入與輸出的關(guān)系可線性表達為$bottom_data_j=top_data_i$（所以需要前向計算時需要記錄索引i到索引j的映射max_idx_.
鏈式求導有：
$$
bottom_diff_j = \frac{\partial loss}{\partial bottom_data_j}=\frac{\partial loss}{\partial top_data_i} \cdot \frac{\partial top_data_i}{\partial bottom_data_j} \= top_diff_i \cdot 1（注意下標）
$$

6 第四層Convolution


void ConvolutionLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom) {
  const Dtype* weight = this->blobs_[0]->cpu_data();
  Dtype* weight_diff = this->blobs_[0]->mutable_cpu_diff();
  for (int i = 0; i < top.size(); ++i) {
    const Dtype* top_diff = top[i]->cpu_diff();
    const Dtype* bottom_data = bottom[i]->cpu_data();
    Dtype* bottom_diff = bottom[i]->mutable_cpu_diff();
    // Bias gradient, if necessary.
    if (this->bias_term_ && this->param_propagate_down_[1]) {
      Dtype* bias_diff = this->blobs_[1]->mutable_cpu_diff();
      // 對于每個Batch中的樣本徘六，計算偏執(zhí)的偏導
      for (int n = 0; n < this->num_; ++n) {
        this->backward_cpu_bias(bias_diff, top_diff + n * this->top_dim_);
      }
    }
    if (this->param_propagate_down_[0] || propagate_down[i]) {
      // 對于每個Batch中的樣本,關(guān)于權(quán)值及輸入求導部分代碼展開了函數(shù)（非可運行代碼）
      for (int n = 0; n < this->num_; ++n) {
        
        // gradient w.r.t. weight. Note that we will accumulate diffs.
        //top_diff(50*64) * bottom_data(500*64,Transpose) = weight_diff(50*500)
        caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasTrans, conv_out_channels_ / group_,
          kernel_dim_, conv_out_spatial_dim_,
          (Dtype)1., top_diff + n * this->top_dim_, bottom_data + n * this->bottom_dim_,
          (Dtype)1., weight_diff);

        // gradient w.r.t. bottom data, if necessary.
        // weight(50*500,Transpose) * top_diff(50*64) = bottom_diff(500*64)
        caffe_cpu_gemm<Dtype>(CblasTrans, CblasNoTrans, kernel_dim_,
          conv_out_spatial_dim_, conv_out_channels_ ,
          (Dtype)1., weight, top_diff + n * this->top_dim_,
          (Dtype)0., bottom_diff + n * this->bottom_dim_);
        
      }
    }
  }
}

說明：

第四層的bottom維度$(N,C,H,W)=(64,20,12,12)$内边，top的維度bottom維度$(N,C,H,W)=(64,50,8,8)$,由于每個樣本單獨處理，所以只需要關(guān)注$(C,H,W)$的維度待锈，分別為$(20,12,12)$和$(50,8,8)$
根據(jù)（Caffe）卷積的實現(xiàn)漠其，該層可以寫成矩陣相乘的形式$Weight_data \times Bottom_data^T = Top_data$
$Weight_data$的維度為$C_{out} \times (CKK)=50 \times 500$
$Bottom_data$的維度為$(HW) \times (CKK)=64 \times 500$，$64$為$88$個卷積核的位置，$500=CKK=2055$
$Top_data$的維度為$64 \times 50$
寫成矩陣表示后和屎，從某種角度上與全連接從（也是表示成矩陣相乘）相同拴驮，因此，可以借鑒全連接層的推導柴信。

最后編輯于：2017.12.03 06:49:15

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末套啤，一起剝皮案震驚了整個濱河市，隨后出現(xiàn)的幾起案子随常，更是在濱河造成了極大的恐慌潜沦，老刑警劉巖，帶你破解...
沈念sama閱讀 216,544評論 6贊 501
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件绪氛，死亡現(xiàn)場離奇詭異止潮，居然都是意外死亡，警方通過查閱死者的電腦和手機钞楼，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,430評論 3贊 392
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門喇闸，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人询件，你說我怎么就攤上這事燃乍。” “怎么了宛琅？”我有些...
開封第一講書人閱讀 162,764評論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵刻蟹，是天一觀的道長。經(jīng)常有香客問我嘿辟，道長舆瘪，這世上最難降的妖魔是什么红伦？我笑而不...
開封第一講書人閱讀 58,193評論 1贊 292
?港島之戀（遺憾婚禮）
正文為了忘掉前任英古，我火速辦了婚禮，結(jié)果婚禮上昙读，老公的妹妹穿的比我還像新娘召调。我一直安慰自己，他們只是感情好蛮浑，可當我...
茶點故事閱讀 67,216評論 6贊 388
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布唠叛。她就那樣靜靜地躺著，像睡著了一般沮稚。火紅的嫁衣襯著肌膚如雪艺沼。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 51,182評論 1贊 299
城市分裂傳說
那天蕴掏，我揣著相機與錄音障般，去河邊找鬼调鲸。笑死，一個胖子當著我的面吹牛剩拢，可吹牛的內(nèi)容都是我干的线得。我是一名探鬼主播饶唤，決...
沈念sama閱讀 40,063評論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼徐伐，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了募狂？” 一聲冷哼從身側(cè)響起办素，我...
開封第一講書人閱讀 38,917評論 0贊 274
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎祸穷，沒想到半個月后性穿，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,329評論 1贊 310
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡雷滚，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 37,543評論 2贊 332
?白月光啟示錄
正文我和宋清朗相戀三年需曾，在試婚紗的時候發(fā)現(xiàn)自己被綠了。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片祈远。...
茶點故事閱讀 39,722評論 1贊 348
活死人
序言：一個原本活蹦亂跳的男人離奇死亡呆万，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出车份，到底是詐尸還是另有隱情谋减，我是刑警寧澤，帶...
沈念sama閱讀 35,425評論 5贊 343
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布扫沼，位于F島的核電站出爹，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏缎除。R本人自食惡果不足惜严就，卻給世界環(huán)境...
茶點故事閱讀 41,019評論 3贊 326
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望器罐。院中可真熱鬧盈蛮，春花似錦、人聲如沸技矮。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,671評論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽衰倦。三九已至袒炉，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間樊零，已是汗流浹背我磁。一陣腳步聲響...
開封第一講書人閱讀 32,825評論 1贊 269
情欲美人皮
我被黑心中介騙來泰國打工孽文，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人夺艰。一個月前我還...
沈念sama閱讀 47,729評論 2贊 368
代替公主和親
正文我出身青樓芋哭，卻偏偏與公主長得像，于是被迫代替她去往敵國和親郁副。傳聞我的和親對象是個殘疾皇子减牺，可洞房花燭夜當晚...
茶點故事閱讀 44,614評論 2贊 353

（Caffe厉亏，LeNet）反向傳播（六）