RePr: Improved Training of Convolutional Filters翻譯[下]

RePr: Improved Training of Convolutional Filters翻譯上

6. Ablation study

6.消融研究

Comparison of pruning criteria We measure the correlation of our metric with the Oracle to answer the question - how good a substitute is our metric for the ?lter importance ranking. Pearson correlation of our metric, henceforth referred to as Ortho, with the Oracle is 0.38. This is not a strong correlation, however, when we compare this with other known metrics, it is the closest. Molchanov et al. [9] report Spearman correlation of their criteria (Taylor) with greedy Oracle at 0.73. We observed similar numbers for Taylor ranking during the early epochs but the correlation diminished signi?cantly as the models converged. This is due to low gradient value from ?lters that have converged. The Taylor metric is a product of the activation and the gradient. High gradients correlate with important ?lters during early phases of learning but when models converge low gradient do not necessarily mean less salient weights.It could be that the ?lter has already converged to a useful feature that is not contributing to the overall error of the model or is stuck at a saddle point. With the norm of activations, the relationship is reversed. Thus by multiplying the terms together hope is to achieve a balance. But our experiments show that in a fully converged model, low gradients dominate high activations. Therefore, the Taylor term will have lower values as the models converge and will no longer be correlated with the inef?cient ?lters. While the correlation of the values denotes how well the metric is the substitute for predicting the accuracy, it is more important to measure the correlation of the rank of the ?lters. Correlation of the values and the rank may not be the same, and the correlation with the rank is the more meaningful measurement to determine the weaker ?lters. Ortho has a correlation of 0.58 against the Oracle when measured over the rank of the ?lters. Other metrics show very poor correlation using the rank. Figure 3 (Left and Center) shows the correlation plot for various metrics with the Oracle. The table on the right of Figure 3 presents the test accuracy on CIFAR-10 of various ranking metrics. From the table, it is evident that Orthogonality ranking leads to a signi?cant boost of accuracy compared to standard training and other ranking criteria.

修剪標(biāo)準(zhǔn)的比較我們測(cè)量我們的度量標(biāo)準(zhǔn)與Oracle的相關(guān)性來(lái)回答問(wèn)題 - 我們的過(guò)濾器對(duì)于過(guò)濾器重要性排名的指標(biāo)有多好。我們的度量標(biāo)準(zhǔn)（以下稱為Ortho）與Oracle的Pearson相關(guān)性為0.38丁溅。然而，這并不是一個(gè)強(qiáng)相關(guān)性，當(dāng)我們將其與其他已知指標(biāo)進(jìn)行比較時(shí)辕录，它是最接近的。Molchanov等升酣。 [9]報(bào)告Spearman將他們的標(biāo)準(zhǔn)（泰勒）與貪婪的甲骨文（0.73）相關(guān)聯(lián)幻工。我們觀察到在早期時(shí)期泰勒排名的類似數(shù)字，但隨著模型收斂效五，相關(guān)性顯著減小地消。這是由于已經(jīng)收斂的過(guò)濾器的梯度值較低。泰勒度量是激活和梯度的乘積畏妖。在學(xué)習(xí)的早期階段脉执，高梯度與重要濾波器相關(guān)，但是當(dāng)模型收斂時(shí)戒劫，低梯度并不一定意味著較少的顯著權(quán)重半夷。可能是濾波器已經(jīng)收斂到一個(gè)有用的特征迅细，它不會(huì)導(dǎo)致模型的整體誤差或卡在鞍點(diǎn)上巫橄。根據(jù)激活的規(guī)范，這種關(guān)系是相反的茵典。因此嗦随，通過(guò)將這些術(shù)語(yǔ)相乘，希望是實(shí)現(xiàn)平衡。但是我們的實(shí)驗(yàn)表明枚尼，在完全收斂的模型中贴浙，低梯度主導(dǎo)著高度激活。因此署恍，當(dāng)模型收斂時(shí)崎溃，泰勒項(xiàng)將具有較低的值，并且將不再與無(wú)效濾波器相關(guān)聯(lián)盯质。雖然值的相關(guān)性表示度量是預(yù)測(cè)準(zhǔn)確度的替代程度袁串，但更重要的是測(cè)量過(guò)濾器等級(jí)的相關(guān)性。值和等級(jí)的相關(guān)性可能不相同呼巷，并且與等級(jí)的相關(guān)性是用于確定較弱過(guò)濾器的更有意義的度量囱修。當(dāng)在過(guò)濾器的等級(jí)上測(cè)量時(shí)，Ortho與Oracle的相關(guān)性為0.58王悍。其他指標(biāo)使用排名顯示非常差的相關(guān)性破镰。圖3（左側(cè)和中間）顯示了Oracle的各種指標(biāo)的相關(guān)性圖。圖3右側(cè)的表格顯示了各種排名指標(biāo)的CIFAR-10的測(cè)試準(zhǔn)確性压储。從表中可以看出鲜漩，與標(biāo)準(zhǔn)培訓(xùn)和其他排名標(biāo)準(zhǔn)相比，正交性排名顯著提高了準(zhǔn)確性集惋。

Percentage of ?lters pruned One of the key factors in our training scheme is the percentage of the ?lters to prune at each pruning phase (

image

). It behaves like the Dropout parameter, and impacts the training time and generalization ability of the model (see Figure: 4). In general the higher the pruned percentage, the better the performance. However, beyond 30%, the performances are not signi?cant. Up to 50%, the model seems to recover from the dropping of ?lters. Beyond that, the training is not stable, and sometimes the model fails to converge.

修剪過(guò)濾器的百分比我們培訓(xùn)計(jì)劃的關(guān)鍵因素之一是在每個(gè)修剪階段修剪過(guò)濾器的百分比（

image

）孕似。它的行為類似于Dropout參數(shù)，并影響模型的訓(xùn)練時(shí)間和泛化能力（見(jiàn)圖4）刮刑。通常喉祭，修剪百分比越高，性能越好雷绢。然而泛烙，超過(guò)30％，表現(xiàn)并不重要习寸。高達(dá)50％，該模型似乎從過(guò)濾器的丟失中恢復(fù)傻工。除此之外霞溪，訓(xùn)練不穩(wěn)定，有時(shí)模型無(wú)法收斂中捆。

Number of RePr iterations Our experiments suggest that each repeat of the RePr process has diminishing returns, and therefore should be limited to a single-digit number (see Figure 4 (Right)).Similar to Dense-Sparse-Dense [18] and Born-Again-Networks [20], we observe that for most networks, two to three iterations is suf?cient to achieve the maximum bene?t.

RePr迭代次數(shù)我們的實(shí)驗(yàn)表明RePr過(guò)程的每次重復(fù)都有遞減的回報(bào)鸯匹，因此應(yīng)限制為一位數(shù)（見(jiàn)圖4（右））。類似于Dense-Sparse-Dense [18]和Born-Again-Networks [20]泄伪，我們觀察到對(duì)于大多數(shù)網(wǎng)絡(luò)殴蓬，兩到三次迭代足以實(shí)現(xiàn)最大收益。

Optimizer and S1/S2 Figure 5 (left) shows variance in improvement when using different optimizers.Our model works well with most well-known optimizers. Adam and Momentum perform better than SGD due to their added stability in training. We experimented with various values of S1 and S2, and there is not much difference if either of them is large enough for the model to converge temporarily.

優(yōu)化器和S1 / S2圖5（左）顯示了使用不同優(yōu)化器時(shí)的改進(jìn)差異。我們的模型適用于大多數(shù)知名的優(yōu)化器染厅。Adam和Momentum表現(xiàn)優(yōu)于SGD痘绎，因?yàn)樗麄冊(cè)谟?xùn)練中增加了穩(wěn)定性。我們嘗試了S1和S2的各種值肖粮，如果它們中的任何一個(gè)足夠大以使模型暫時(shí)收斂孤页，則沒(méi)有太大差異。

image

Figure 5: Left: Impact of using various optimizers on RePr training scheme. Right: Results from using different S1/S2 values. For clarity, these experiments only shows results with

image

*圖5：左：使用各種優(yōu)化器對(duì)RePr訓(xùn)練方案的影響涩馆。右：使用不同的S1 / S2值的結(jié)果行施。為清楚起見(jiàn)，這些實(shí)驗(yàn)僅顯示

image

的結(jié)果*

image

Learning Rate Schedules SGD with a ?xed learning rate does not typically produce optimal model performance. Instead, gradually annealing the learning rate over the course of training is known to produce models with higher test accuracy. State-of-the-art results on ResNet, DenseNet, Inception were all reported with a predetermined learning rate schedule. However, the selection of the exact learning rate schedule is itself a hyperparameter, one which needs to be speci?cally tuned for each model.Cyclical learning rates [52] can provide stronger performance without exhaustive tuning of a precise learning rate schedule. Figure 6 shows the comparison of our training technique when applied in conjunction with ?xed schedule learning rate scheme and cyclical learning rate. Our training scheme is not impacted by using these schemes, and improvements over standard training is still apparent.

學(xué)習(xí)率計(jì)劃SGD具有固定的學(xué)習(xí)率通常不會(huì)產(chǎn)生最佳的模型性能魂那。相反蛾号，已知在訓(xùn)練過(guò)程中逐漸退出學(xué)習(xí)速率以產(chǎn)生具有更高測(cè)試精度的模型。ResNet涯雅，DenseNet鲜结，Inception的最新結(jié)果均以預(yù)定的學(xué)習(xí)率計(jì)劃報(bào)告。然而斩芭，精確學(xué)習(xí)率計(jì)劃的選擇本身就是一個(gè)超參數(shù)轻腺，需要針對(duì)每個(gè)模型進(jìn)行特定調(diào)整。循環(huán)學(xué)習(xí)率[52]可以提供更強(qiáng)的性能划乖，而無(wú)需詳盡地調(diào)整精確的學(xué)習(xí)率計(jì)劃贬养。圖6顯示了我們的訓(xùn)練技術(shù)與固定時(shí)間表學(xué)習(xí)速率方案和循環(huán)學(xué)習(xí)速率相結(jié)合時(shí)的比較。我們的培訓(xùn)計(jì)劃不受使用這些計(jì)劃的影響琴庵，并且對(duì)標(biāo)準(zhǔn)培訓(xùn)的改進(jìn)仍然很明顯误算。

Impact of Dropout Dropout, while commonly applied in Multilayer Perceptrons, is typically not used for ConvNets. Our technique can be viewed as a type of nonrandom Dropout, speci?cally applicable to ConvNets. Unlike standard Dropout, out method acts on entire ?lters, rather than individual weights, and is applied only during select stages of training, rather than in every training step. Dropout prevents over?tting by encouraging coadaptation of weights. This is effective in the case of overparameterized models, but in compact or shallow models, Dropout may needlessly reduce already limited model ca

Dropout Dropout的影響雖然通常應(yīng)用于多層感知器，但通常不用于ConvNets迷殿。我們的技術(shù)可以被視為一種非隨機(jī)丟失儿礼，特別適用于ConvNets。與標(biāo)準(zhǔn)Dropout不同庆寺，out方法對(duì)整個(gè)過(guò)濾器而不是單獨(dú)的權(quán)重起作用蚊夫，并且僅在訓(xùn)練的選定階段而不是在每個(gè)訓(xùn)練步驟中應(yīng)用。輟學(xué)通過(guò)鼓勵(lì)權(quán)重的共同適應(yīng)來(lái)防止過(guò)度配置懦尝。這在過(guò)度參數(shù)化模型的情況下是有效的知纷，但在緊湊或淺模型中，Dropout可能會(huì)不必要地減少已經(jīng)有限的模型ca

image

Figure 7: Test accuracy of a three layer ConvNet with 32 ?lters each over 100 epochs using standard scheme, RePr with Oracle and RePr with Ortho on CIFAR-10. Left: With Dropout of 0.5. Right: No Dropout

圖7：三層ConvNet的測(cè)試精度陵霉，使用標(biāo)準(zhǔn)方案琅轧，每個(gè)超過(guò)100個(gè)時(shí)期的32個(gè)過(guò)濾器，帶有Oracle的RePr和帶有CIFAR-10上的Ortho的RePr踊挠。左：Dropout為0.5乍桂。右：沒(méi)有輟學(xué)

image

7. Orthogonality and Distillation

7.正交和蒸餾

Our method, RePr and Knowledge Distillation (KD) are both techniques to improve performance of compact models. RePr reduces the overlap of ?lter representations and KD distills the information from a larger network. We present a brief comparison of the techniques and show that they can be combined to achieve even better performance. RePr repetitively drops the ?lters with most overlap in the directions of the weights using the inter-?lter orthogonality, as shown in the equation 2. Therefore, we expect this value to gradually reduce over time during training. Figure 8 (left) shows the sum of this value over the entire network with three training schemes. We show RePr with two different ?lter ranking criteria - Ortho and Oracle. It is not surprising that RePr training scheme with Ortho ranking has lowest Ortho sum but it is surprising that RePr training with Oracle ranking also reduces the ?lter overlap, compared to the standard training. Once the model starts to converge, the least important ?lters based on Oracle ranking are the ones with the most overlap. And dropping these ?lters leads to better test accuracy (table on the right of Figure 3). Does this improvement come from the same source as the that due to Knowledge Distillation? Knowledge Distillation (KD) is a well-proven methodology to train compact models. Using soft logits from the teacher and the ground truth signal the model converges to better optima compared to standard training. If we apply KD to the same three experiments (see Figure 8, right), we see that all the models have signi?cantly larger Ortho sum. Even the RePr (Ortho) model struggles to lower the sum as the model is strongly guided to converge to a speci?c solution. This suggests that this improvement due to KD is not due to reducing ?lter overlap. Therefore, a model which uses both the techniques should bene?t by even better generalization. Indeed, that is the case as the combined model has signi?cantly better performance than either of the individual models, as shown in Table 2.

我們的方法，RePr和知識(shí)蒸餾（KD）都是提高緊湊模型性能的技術(shù)。RePr減少了過(guò)濾器表示的重疊睹酌，KD從更大的網(wǎng)絡(luò)中提取信息权谁。我們對(duì)這些技術(shù)進(jìn)行了簡(jiǎn)要比較，并表明它們可以結(jié)合起來(lái)以實(shí)現(xiàn)更好的性能忍疾。RePr使用中間濾波器正交性重復(fù)地丟棄在重量方向上具有最多重疊的濾波器闯传，如等式2所示。因此卤妒，我們希望這個(gè)值在訓(xùn)練期間逐漸減少甥绿。圖8（左）顯示了具有三種訓(xùn)練方案的整個(gè)網(wǎng)絡(luò)上該值的總和。我們向RePr展示了兩種不同的過(guò)濾器排名標(biāo)準(zhǔn) - Ortho和Oracle则披。具有Ortho排名的RePr訓(xùn)練方案具有最低的Ortho總和并不令人驚訝共缕，與標(biāo)準(zhǔn)訓(xùn)練相比，令人驚訝的是士复，與Oracle排名的RePr訓(xùn)練也減少了過(guò)濾器重疊图谷。一旦模型開(kāi)始收斂，基于Oracle排名的最不重要的過(guò)濾器就是重疊最多的過(guò)濾器阱洪。丟棄這些濾波器可以提高測(cè)試精度（圖3右側(cè)的表格）便贵。這種改進(jìn)是否來(lái)自與知識(shí)蒸餾相同的來(lái)源？知識(shí)蒸餾（KD）是一種用于訓(xùn)練緊湊模型的成熟方法冗荸。使用來(lái)自教師的軟對(duì)數(shù)和地面實(shí)況信號(hào)承璃，與標(biāo)準(zhǔn)訓(xùn)練相比，模型收斂到更好的最佳狀態(tài)蚌本。如果我們將KD應(yīng)用于相同的三個(gè)實(shí)驗(yàn)（見(jiàn)圖8盔粹，右），我們發(fā)現(xiàn)所有模型都具有顯著更大的正交和程癌。甚至RePr（Ortho）模型都在努力降低總和舷嗡，因?yàn)槟Ｐ捅粡?qiáng)烈引導(dǎo)收斂到一個(gè)特定的解決方案。這表明由于KD導(dǎo)致的這種改善不是由于減少了過(guò)濾器重疊嵌莉。因此进萄，使用這兩種技術(shù)的模型應(yīng)該通過(guò)更好的泛化來(lái)獲益。實(shí)際上锐峭，情況正是如此中鼠，因?yàn)榻M合模型的性能明顯優(yōu)于單個(gè)模型，如表2所示。

image

Figure 8: Comparison of orthogonality of ?lters (Ortho-sum - eq 2) in standard training and RePr training with and without Knowledge Distillation. Lower value signi?es less overlapping ?lters. Dashed vertical lines denotes ?lter dropping.

圖8：標(biāo)準(zhǔn)訓(xùn)練中過(guò)濾器的正交性（Ortho-sum-eq 2）與有和沒(méi)有知識(shí)蒸餾的RePr訓(xùn)練的比較。較低的值意味著較少重疊的過(guò)濾器莱预。虛線垂直線表示過(guò)濾器掉落氧枣。

8. Results

8.結(jié)果

We present the performance of our training scheme, RePr, with our ranking criteria, inter-?lter orthogonality, Ortho, on different ConvNets [53, 1, 29, 54, 31]. For all the results provided RePr parameters are:

image

, and with three iterations,

image

我們?cè)诓煌腃onvNets [53,1,29,54,31]上展示了我們的訓(xùn)練計(jì)劃RePr的性能，我們的排名標(biāo)準(zhǔn)，中間正交性盗舰，Ortho晶府。對(duì)于所有提供的結(jié)果，RePr參數(shù)為：

image

钻趋，

image

川陆，

image

，以及三次迭代

image

蛮位。

image

*Figure 9: Accuracy improvement using RePr over standard training on Vanilla ConvNets across many layered networks [

image

*圖9：使用RePr在多個(gè)分層網(wǎng)絡(luò)上對(duì)Vanilla ConvNets進(jìn)行標(biāo)準(zhǔn)培訓(xùn)時(shí)的準(zhǔn)確性改進(jìn)[

image

We compare our training scheme with other similar schemes like BAN and DSD in table 3. All three schemes were trained for three iterations i.e. N=3. All models were trained for 150 epochs with similar learning rate schedule and initialization. DSD and RePr (Weights) perform roughly the same function - sparsifying the model guided by magnitude, with the difference that DSD acts on individual weights, while RePr (Weights) acts on entire ?lters. Thus, we observe similar performance between these techniques. RePr (Ortho) outperforms the other techniques and is signi?cantly cheaper to train compared to BAN, which requires N full training cycles.

我們將我們的培訓(xùn)方案與表3中的BAN和DSD等其他類似方案進(jìn)行了比較较沪。所有三種方案都經(jīng)過(guò)三次迭代訓(xùn)練，即N = 3失仁。所有模型都經(jīng)過(guò)150個(gè)時(shí)期的訓(xùn)練尸曼，具有相似的學(xué)習(xí)速率計(jì)劃和初始化。DSD和RePr（權(quán)重）執(zhí)行大致相同的功能 - 按幅度指導(dǎo)稀疏模型萄焦，差異在于DSD作用于單個(gè)權(quán)重控轿，而RePr（權(quán)重）作用于整個(gè)過(guò)濾器。因此拂封，我們觀察到這些技術(shù)之間的類似性能茬射。與BAN相比，RePr（Ortho）優(yōu)于其他技術(shù)并且訓(xùn)練顯著更便宜冒签，BAN需要N個(gè)完整的訓(xùn)練周期在抛。

Compared to modern architectures, vanilla ConvNets show signi?cantly more inef?ciency in the allocation of their feature representations. Thus, we ?nd larger improvements from our method when applied to vanilla ConvNets, as compared to modern architectures. Table 4 shows test errors on CIFAR 10 & 100. Vanilla CNNs with 32 ?lters each have high error compared to DenseNet or ResNet but their inference time is signi?cantly faster. RePr training improves the relative accuracy of vanilla CNNs by 8% on CIFAR-10 and 25% on CIFAR-100. The performance of baseline DenseNet and ResNet models is still better than vanilla CNNs trained with RePr, but these models incur more than twice the inference cost. For comparison, we also consider a reduced DenseNet model with only 5 layers, which has similar inference time to the 3-layer vanilla ConvNet. This model has many fewer parameters (by a factor of

image

) than the vanilla ConvNet, leading to signi?cantly higher error rates, but we choose to equalize inference time rather than parameter count, due to the importance of inference time in many practical applications. Figure 9 shows more results on vanilla CNNs with varying depth. Vanilla CNNs start to over?t the data, as most ?lters converge to similar representation. Our training scheme forces them to be different which reduces the over?tting (Figure 4 - right). This is evident in the larger test error of 18-layer vanilla CNN with CIFAR-10 compared to 3-layer CNN. With RePr training, 18 layer model shows lower test error.

與現(xiàn)代架構(gòu)相比，vanilla ConvNets在其功能表示的分配方面顯示出更低的效率镣衡。因此霜定，與現(xiàn)代架構(gòu)相比，我們?cè)趹?yīng)用于vanilla ConvNets時(shí)廊鸥，從我們的方法中獲得了更大的改進(jìn)望浩。表4顯示了CIFAR 10和100的測(cè)試錯(cuò)誤。與DenseNet或ResNet相比惰说，具有32個(gè)過(guò)濾器的Vanilla CNN各自具有高誤差磨德，但它們的推理時(shí)間明顯更快。RePr培訓(xùn)使CIFAR-10的香草CNN相對(duì)準(zhǔn)確度提高了8％吆视，CIFAR-100提高了25％典挑。基線DenseNet和ResNet模型的性能仍然優(yōu)于使用RePr訓(xùn)練的vanilla CNN啦吧，但這些模型的推理成本是其兩倍多您觉。為了進(jìn)行比較，我們還考慮僅使用5層的簡(jiǎn)化DenseNet模型授滓，其具有與3層香草ConvNet相似的推理時(shí)間琳水。該模型比香草ConvNet具有更少的參數(shù)（通過(guò)因子

image

）肆糕，導(dǎo)致錯(cuò)誤率顯著更高，但由于在許多實(shí)際應(yīng)用中推理時(shí)間的重要性在孝，我們選擇均衡推理時(shí)間而不是參數(shù)計(jì)數(shù)诚啃。圖9顯示了具有不同深度的香草CNN的更多結(jié)果。香草CNN開(kāi)始過(guò)度處理數(shù)據(jù)私沮，因?yàn)榇蠖鄶?shù)過(guò)濾器會(huì)收斂到類似的表示始赎。我們的培訓(xùn)計(jì)劃強(qiáng)制它們不同，這減少了過(guò)度配置（圖4 - 右）仔燕。這與使用CIFAR-10的18層香草CNN與3層CNN相比具有更大的測(cè)試誤差是顯而易見(jiàn)的造垛。通過(guò)RePr培訓(xùn)，18層模型顯示較低的測(cè)試錯(cuò)誤晰搀。

image

RePr is also able to improve the performance of ResNet and shallow DenseNet. This improvement is larger on CIFAR-100, which is a 100 class classi?cation and thus is a harder task and requires more specialized ?lters. Similarly, our training scheme shows bigger relative improvement on ImageNet, a 1000 way classi?cation problem. Table 5 presents top-1 test error on ImageNet [55] of various ConvNets trained using standard training and with RePr. RePr was applied three times (N=3), and the table shows errors after each round. We have attempted to replicate the results of the known models as closely as possible with suggested hyper-parameters and are within

image

of the reported results. More details of the training and hyper-parameters are provided in the supplementary material. Each subsequent RePr leads to improved performance with signi?cantly diminishing returns. Improvement is more distinct in architectures which do not have skip connections, like Inception v1 and VGG and have lower baseline performance.

RePr還能夠改善ResNet和淺層DenseNet的性能筋搏。這種改進(jìn)在CIFAR-100上更大，這是100級(jí)分類厕隧，因此是一項(xiàng)更難的任務(wù)奔脐，需要更專業(yè)的過(guò)濾器。同樣吁讨，我們的培訓(xùn)計(jì)劃在ImageNet上顯示出更大的相對(duì)改進(jìn)髓迎，這是一種1000分類的分類問(wèn)題。表5列出了使用標(biāo)準(zhǔn)培訓(xùn)和RePr培訓(xùn)的各種ConvNets的ImageNet [55]的前1個(gè)測(cè)試錯(cuò)誤建丧。RePr應(yīng)用三次（N = 3）排龄，表格顯示每輪后的錯(cuò)誤。我們嘗試使用建議的超參數(shù)盡可能接近地復(fù)制已知模型的結(jié)果翎朱，并且在報(bào)告結(jié)果的

image

內(nèi)橄维。補(bǔ)充材料中提供了培訓(xùn)和超參數(shù)的更多細(xì)節(jié)。每個(gè)后續(xù)的RePr都可以提高性能拴曲，同時(shí)顯著降低回報(bào)争舞。在沒(méi)有跳過(guò)連接的體系結(jié)構(gòu)（如Inception v1和VGG）中，改進(jìn)更加明顯澈灼，并且具有較低的基線性能竞川。

Our model improves upon other computer vision tasks that use similar ConvNets. We present a small sample of results from visual question answering and object detection tasks. Both these tasks involve using ConvNets to extract features, and RePr improves their baseline results.

我們的模型改進(jìn)了使用類似ConvNets的其他計(jì)算機(jī)視覺(jué)任務(wù)。我們提供了一些視覺(jué)問(wèn)答和目標(biāo)檢測(cè)任務(wù)的結(jié)果樣本叁熔。這兩項(xiàng)任務(wù)都涉及使用ConvNets提取功能委乌，RePr可以改善其基線結(jié)果。

Visual Question Answering In the domain of visual question answering (VQA), a model is provided with an image and question (as text) about that image, and must produce an answer to that question.Most of the models that solve this problem use standard ConvNets to extract image features and an LSTM network to extract text fea tures. These features are then fed to a third model which learns to select the correct answer as a classi?cation problem. State-of-the-art models use an attention layer and intricate mapping between features. We experimented with a more standard model where image features and language features are fed to a Multi-layer Perceptron with a softmax layer at the end that does 1000-way classi?cation over candidate answers. Table 6 provides accuracy on VQAv1 using VQA-LSTM-CNN model [56]. Results are reported for Open-Ended questions, which is a harder task compared to multiple-choice questions. We extract image features from Inception-v1, trained using standard training and with RePr (Ortho) training, and then feed these image features and the language embeddings (GloVe vectors) from the question, to a two layer fully connected network. Thus, the only difference between the two reported results 6 is the training methodology of Inception-v1.

視覺(jué)問(wèn)題回答在視覺(jué)問(wèn)答（VQA）領(lǐng)域荣回，模型提供有關(guān)該圖像的圖像和問(wèn)題（作為文本）遭贸，并且必須產(chǎn)生該問(wèn)題的答案。解決此問(wèn)題的大多數(shù)模型使用標(biāo)準(zhǔn)ConvNets來(lái)提取圖像特征心软，使用LSTM網(wǎng)絡(luò)來(lái)提取文本特征壕吹。然后將這些特征饋送到第三個(gè)模型除秀，該模型學(xué)習(xí)選擇正確的答案作為分類問(wèn)題。最先進(jìn)的模型使用注意層和功能之間的復(fù)雜映射算利。我們嘗試了一個(gè)更標(biāo)準(zhǔn)的模型，其中圖像特征和語(yǔ)言特征被饋送到多層感知器泳姐，最后一個(gè)softmax層對(duì)候選答案進(jìn)行1000路分類效拭。表6提供了使用VQA-LSTM-CNN模型的VQAv1的準(zhǔn)確性[56]。報(bào)告了開(kāi)放式問(wèn)題的結(jié)果胖秒，與多項(xiàng)選擇問(wèn)題相比缎患，這是一項(xiàng)更難的任務(wù)。我們從Inception-v1中提取圖像特征阎肝，使用標(biāo)準(zhǔn)訓(xùn)練和RePr（Ortho）訓(xùn)練進(jìn)行訓(xùn)練挤渔，然后將這些圖像特征和語(yǔ)言嵌入（GloVe向量）從問(wèn)題提供給兩層完全連接的網(wǎng)絡(luò)。因此风题，兩個(gè)報(bào)告結(jié)果6之間的唯一區(qū)別是Inception-v1的訓(xùn)練方法判导。

image

Object Detection For object detection, we experimented with Faster R-CNN using ResNet 50 and 101 pretrained on ImageNet. We experimented with both Feature Pyramid Network and baseline RPN with c4 conv layer. We use the model structure from Tensorpack [57], which is able to reproduce the reported mAP scores. The model was trained on ’trainval35k + minival’ split of COCO dataset (2014). Mean Average Precision (mAP) is calculated at ten IoU thresholds from 0.5 to 0.95. mAP for the boxes obtained with standard training and RePr training is shown in the table 7.

對(duì)象檢測(cè)對(duì)于對(duì)象檢測(cè)，我們使用在ImageNet上預(yù)訓(xùn)練的ResNet 50和101進(jìn)行了更快的R-CNN實(shí)驗(yàn)沛硅。我們使用c4轉(zhuǎn)換層對(duì)特征金字塔網(wǎng)絡(luò)和基線RPN進(jìn)行了實(shí)驗(yàn)眼刃。我們使用Tensorpack [57]的模型結(jié)構(gòu)，它能夠重現(xiàn)報(bào)告的mAP分?jǐn)?shù)摇肌。該模型接受了COCO數(shù)據(jù)集“trainval35k + minival”拆分的培訓(xùn)（2014年）擂红。平均平均精度（mAP）計(jì)算為10個(gè)IoU閾值，從0.5到0.95围小。通過(guò)標(biāo)準(zhǔn)培訓(xùn)和RePr培訓(xùn)獲得的盒子的mAP顯示在表7中昵骤。

image

9. Conclusion

9.結(jié)論

We have introduced RePr, a training paradigm which cyclically drops and relearns some percentage of the least expressive ?lters. After dropping these ?lters, the pruned sub-model is able to recapture the lost features using the remaining parameters, allowing a more robust and ef?cient allocation of model capacity once the ?lters are reintroduced. We show that a reduced model needs training before re-introducing the ?lters, and careful selection of this training duration leads to substantial gains. We also demonstrate that this process can be repeated with diminishing returns.

我們引入了RePr，這是一種循環(huán)下降并重新獲得一些最不具有表現(xiàn)力的過(guò)濾器的培訓(xùn)范例肯适。丟棄這些過(guò)濾器后变秦，修剪后的子模型能夠使用其余參數(shù)重新捕獲丟失的特征，一旦重新引入過(guò)濾器框舔，就可以更加穩(wěn)健和高效地分配模型容量伴栓。我們表明減少的模型需要在重新引入過(guò)濾器之前進(jìn)行培訓(xùn)，仔細(xì)選擇這個(gè)培訓(xùn)持續(xù)時(shí)間會(huì)帶來(lái)可觀的收益雨饺。我們還證明了這個(gè)過(guò)程可以重復(fù)钳垮，收益遞減。

Motivated by prior research which highlights inef?ciencies in the feature representations learned by convolutional neural networks, we further introduce a novel inter-?lter orthogonality metric for ranking ?lter importance for the purpose of RePr training, and demonstrate that this metric outperforms established ranking metrics. Our training method is able to signi?cantly improve performance in under-parameterized networks by ensuring the ef?cient use of limited capacity, and the performance gains are complementary to knowledge distillation. Even in the case of complex, over-parameterized network architectures, our method is able to improve performance across a variety of tasks.

由先前的研究強(qiáng)調(diào)了卷積神經(jīng)網(wǎng)絡(luò)學(xué)習(xí)的特征表示的無(wú)效性额港，我們進(jìn)一步引入了一種新的中間正交性度量指標(biāo)饺窿，用于對(duì)RePr訓(xùn)練目的的過(guò)濾器重要性進(jìn)行排序，并證明該指標(biāo)優(yōu)于既定的排名指標(biāo)移斩。我們的培訓(xùn)方法能夠通過(guò)確保有效使用有限容量來(lái)顯著提高參數(shù)不足網(wǎng)絡(luò)的性能肚医，并且性能提升是知識(shí)蒸餾的補(bǔ)充绢馍。即使在復(fù)雜的，過(guò)度參數(shù)化的網(wǎng)絡(luò)架構(gòu)的情況下肠套，我們的方法也能夠提高各種任務(wù)的性能舰涌。

10. Acknowledgement

10.致謝

First author would like to thank NVIDIA and Google for donating hardware resources partially used for this research. He would also like to thank Nick Moran, Solomon Garber and Ryan Marcus for helpful comments.

第一作者要感謝NVIDIA和Google捐贈(zèng)部分用于本研究的硬件資源。他還要感謝Nick Moran你稚，Solomon Garber和Ryan Marcus的有益評(píng)論瓷耙。

References

參考

[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 1, 2, 8

[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.深度殘差學(xué)習(xí)用于圖像識(shí)別。 2016年IEEE計(jì)算機(jī)視覺(jué)和模式識(shí)別會(huì)議（CVPR）刁赖，2016年.1,2,8

[2] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Doll′ar. Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence, 2018. 1

[2] Tsung-Yi Lin搁痛，Priya Goyal，Ross B. Girshick宇弛，Kaiming He和Piotr Doll'ar鸡典。密集物體檢測(cè)的焦點(diǎn)損失。關(guān)于模式分析和機(jī)器智能的IEEE交易枪芒，2018

[3] Kaiming He, Georgia Gkioxari, Piotr Doll′ar, and Ross B. Girshick. Mask r-cnn. 2017 IEEE International Conference on Computer Vision (ICCV), 2017. 1

[3] Kaiming He彻况，Georgia Gkioxari，Piotr Doll'ar和Ross B. Girshick舅踪。面具r-cnn疗垛。 2017年IEEE IEEE計(jì)算機(jī)視覺(jué)國(guó)際會(huì)議（ICCV），2017年.1

[4] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015. 1

[4] Olaf Ronneberger硫朦，Philipp Fischer和Thomas Brox贷腕。U-net：用于生物醫(yī)學(xué)圖像分割的卷積網(wǎng)絡(luò)。在MICCAI咬展，2015年.1

[5] Michael Cogswell, Faruk Ahmed, Ross B. Girshick, C. Lawrence Zitnick, and Dhruv Batra. Reducing over?tting in deep networks by decorrelating representations. ICLR, abs/1511.06068, 2016. 1, 2, 3

[5] Michael Cogswell泽裳，F(xiàn)aruk Ahmed，Ross B. Girshick破婆，C涮总。Lawrence Zitnick和Dhruv Batra。通過(guò)去相關(guān)表示減少深度網(wǎng)絡(luò)中的過(guò)度配置祷舀。 ICLR瀑梗，abs / 1511.06068,2016.1,2,3

[6] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning ?lters for ef?cient convnets. ICLR, abs/1608.08710, 2017. 1

[6] Hao Li，Asim Kadav裳扯，Igor Durdanovic抛丽，Hanan Samet和Hans Peter Graf。修剪過(guò)濾器以實(shí)現(xiàn)有效的網(wǎng)絡(luò)饰豺。 ICLR亿鲜，abs / 1608.08710,2017。1

[7] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. 2017 IEEE International Conference on Computer Vision (ICCV), 2017. 1

[7]何一輝冤吨，張翔宇蒿柳，孫健饶套。用于加速非常深的神經(jīng)網(wǎng)絡(luò)的通道修剪。 2017年IEEE IEEE計(jì)算機(jī)視覺(jué)國(guó)際會(huì)議（ICCV）垒探，2017年.1

[8] Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convolutional neural networks. JETC, 2017. 1

[8] Sajid Anwar妓蛮，Kyuyeon Hwang和Wonyong Sung。深度卷積神經(jīng)網(wǎng)絡(luò)的結(jié)構(gòu)化修剪圾叼。 JETC蛤克，2017年.1

[9] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource ef?cient transfer learning. ICLR, abs/1611.06440, 2017. 1, 2, 3, 5, 6

[9] Pavlo Molchanov，Stephen Tyree褐奥，Tero Karras，Timo Aila和Jan Kautz翘簇。修剪卷積神經(jīng)網(wǎng)絡(luò)撬码，實(shí)現(xiàn)資源高效的轉(zhuǎn)移學(xué)習(xí)。 ICLR版保，abs / 1611.06440,2017.1,2,3,5,6

[10] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang.Learning ef?cient convolutional networks through network slimming. 2017 IEEE International Conference on Computer Vision (ICCV), 2017. 1

[10] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang.通過(guò)網(wǎng)絡(luò)瘦身學(xué)習(xí)有效的卷積網(wǎng)絡(luò)呜笑。 2017年IEEE IEEE計(jì)算機(jī)視覺(jué)國(guó)際會(huì)議（ICCV），2017年.1

[11] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A ?lter level pruning method for deep neural network compression. 2017 IEEE International Conference on Computer Vision (ICCV), 2017. 1

[11]羅建浩彻犁，吳建新叫胁，林維堯。 Thinet：深層神經(jīng)網(wǎng)絡(luò)壓縮的過(guò)濾級(jí)修剪方法汞幢。 2017年IEEE IEEE計(jì)算機(jī)視覺(jué)國(guó)際會(huì)議（ICCV）驼鹅，2017年.1

[12] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. ICLR, abs/1510.00149, 2016. 1, 2, 6

[12] Song Han，Huizi Mao和William J. Dally森篷。深度壓縮：通過(guò)修剪输钩，訓(xùn)練量化和霍夫曼編碼壓縮深度神經(jīng)網(wǎng)絡(luò)。 ICLR仲智，abs / 1510.00149,2016.1,2,6

[13] Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the ef?cacy of pruning for model compression. NIPS Workshop on Machine Learning of Phones and other Consumer Devices, abs/1710.01878, 2017. 1

[13] Michael Zhu和Suyog Gupta买乃。修剪或不修剪：探索修剪模型壓縮的效果。NIPS電話和其他消費(fèi)類設(shè)備機(jī)器學(xué)習(xí)研討會(huì)钓辆，abs / 1710.01878,2017剪验。1

[14] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Training pruned neural networks. CoRR, abs/1803.03635, 2018. 2

[14] Jonathan Frankle和Michael Carbin。彩票假設(shè)：訓(xùn)練修剪的神經(jīng)網(wǎng)絡(luò)前联。 CoRR功戚，abs / 1803.03635,2018。2

[15] Haibing Wu and Xiaodong Gu. Towards dropout training for convolutional neural networks.Neural networks : the of?cial journal of the International Neural Network Society, 71, 2015. 2

[15]吳海兵和顧曉東似嗤。進(jìn)行卷積神經(jīng)網(wǎng)絡(luò)的輟學(xué)訓(xùn)練疫铜。神經(jīng)網(wǎng)絡(luò)：國(guó)際神經(jīng)網(wǎng)絡(luò)學(xué)會(huì)的官方期刊，71,2015

[16] Li Wan, Matthew D. Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus. Regularization of neural networks using dropconnect. In ICML, 2013. 2

[16] Li Wan双谆，Matthew D. Zeiler壳咕，Sixin Zhang席揽，Yann LeCun和Rob Fergus。使用dropconnect進(jìn)行神經(jīng)網(wǎng)絡(luò)的正則化谓厘。在ICML幌羞，2013年.2

[17] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with stochastic depth. In ECCV, 2016. 2

[17]高黃，孫宇竟稳，劉壯属桦，丹尼爾塞德拉和基利安。溫伯格他爸。具有隨機(jī)深度的深度網(wǎng)絡(luò)聂宾。在ECCV，2016年.2

[18] Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar

Paluri, John Tran, Bryan Catanzaro, and William J. Dally. Dsd: Dense-sparse-dense training for deep neural networks. 2016. 2, 6, 8

Paluri诊笤，John Tran系谐，Bryan Catanzaro和William J. Dally。 Dsd：深度神經(jīng)網(wǎng)絡(luò)的密集稀疏訓(xùn)練讨跟。 2016. 2,6,8

[19] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015. 2

[19] Geoffrey E. Hinton纪他，Oriol Vinyals和Jeffrey Dean。在神經(jīng)網(wǎng)絡(luò)中提煉知識(shí)晾匠。 CoRR茶袒，abs / 1503.02531,2015。2

[20] Tommaso Furlanello, Zachary Chase Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. In ICML, 2018. 2, 6, 8

[20] Tommaso Furlanello凉馆，Zachary Chase Lipton薪寓，Michael Tschannen，Laurent Itti和Anima Anandkumar澜共。再次出生神經(jīng)網(wǎng)絡(luò)预愤。在ICML，2018咳胃。植康，2,6，8

[21] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. ICLR, abs/1412.6550, 2015. 2

[21] Adriana Romero展懈，Nicolas Ballas销睁，Samira Ebrahimi Kahou，Antoine Chassang存崖，Carlo Gatta和Yoshua Bengio冻记。 Fitnets：薄網(wǎng)的提示。 ICLR来惧，abs / 1412.6550,2015冗栗。2

[22] Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. In NIPS, 1989. 2, 5, 6

[22] Yann LeCun，John S. Denker和Sara A. Solla。最佳的腦損傷隅居。在NIPS钠至，1989.2,5,6

[23] Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In NIPS, 1992. 2, 6

[23] Babak Hassibi和David G. Stork。用于網(wǎng)絡(luò)修剪的二階導(dǎo)數(shù)：最佳腦外科醫(yī)生胎源。在NIPS棉钧，1992.2,6

[24] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming: A data-driven neuron pruning approach towards ef?cient deep architectures. CoRR, abs/1607.03250, 2016. 2, 6

[24] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang.網(wǎng)絡(luò)修整：一種數(shù)據(jù)驅(qū)動(dòng)的神經(jīng)元修剪方法，可實(shí)現(xiàn)高效的深層體系結(jié)構(gòu)涕蚤。 CoRR宪卿，abs / 1607.03250,2016.2,6

[25] Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan L. Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. CoRR, abs/1712.00559, 2017. 2

[26] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized evolution for image classi?er architecture search. CoRR, abs/1802.01548, 2018. 2

[26] Esteban Real，Alok Aggarwal万栅，Yanping Huang和Quoc V. Le佑钾。圖像分類架構(gòu)搜索的規(guī)范化演化。 CoRR烦粒，abs / 1802.01548,2018休溶。2

[27] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. CoRR, abs/1611.01578, 2016. 2

[27] Barret Zoph和Quoc V. Le。神經(jīng)結(jié)構(gòu)搜索與強(qiáng)化學(xué)習(xí)撒遣。 CoRR邮偎，abs / 1611.01578,2016管跺。2

[28] Xavier Glorot and Yoshua Bengio. Understanding the dif?culty of training deep feedforward neural networks. In AISTATS, 2010. 2

[28] Xavier Glorot和Yoshua Bengio义黎。了解訓(xùn)練深度前饋神經(jīng)網(wǎng)絡(luò)的困難。在AISTATS豁跑，2010年.2

[29] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 2, 8

[29] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E.Reed廉涕，Dragomir Anguelov，Dumitru Erhan艇拍，Vincent Vanhoucke和Andrew Rabinovich狐蜕。進(jìn)一步深化卷積。 2015年IEEE計(jì)算機(jī)視覺(jué)和模式識(shí)別會(huì)議（CVPR）卸夕，2015年.2,8

[30] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. In NIPS, 2017. 2, 3, 4

[30] Maithra Raghu层释，Justin Gilmer，Jason Yosinski和Jascha Sohl-Dickstein快集。Svcca：深度學(xué)習(xí)動(dòng)力學(xué)和可解釋性的奇異向量典型相關(guān)分析贡羔。在NIPS，2017年.2,3,4

[31] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. CVPR, 2017. 2, 8

[32] Ari S. Morcos, David G. T. Barrett, Neil C. Rabinowitz, and Matthew Botvinick. On the importance of single directions for generalization. CoRR, abs/1803.06959, 2017. 3

[32] Ari S. Morcos个初，David G. T. Barrett乖寒，Neil C. Rabinowitz和Matthew Botvinick。關(guān)于單一方向概括的重要性院溺。 CoRR楣嘁，abs / 1803.06959,2017。3

[33] Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. Provable bounds for learning some deep representations. In ICML, 2014. 3

[33] Sanjeev Arora，Aditya Bhaskara逐虚，Rong Ge和Tengyu Ma聋溜。可以學(xué)習(xí)一些深刻的表達(dá)方式痊班。在ICML勤婚，2014年.3

[34] Pau Rodr′?guez, Jordi Gonz`alez, Guillem Cucurull, Josep M. Gonfaus, and F. Xavier Roca. Regularizing cnns with locally constrained decorrelations. ICLR, abs/1611.01967, 2017. 3

[34]PauRodr?guez，JordiGonzález涤伐，Guillem Cucurull馒胆，Josep M. Gonfaus和F. Xavier Roca。規(guī)范具有局部約束的去相關(guān)的cnns凝果。 ICLR祝迂，abs / 1611.01967,2017。3

[35] Andrew Brock, Theodore Lim, James M. Ritchie, and Nick Weston. Neural photo editing with introspective adversarial networks. ICLR, abs/1609.07093, 2017. 3, 6

[35] Andrew Brock器净，Theodore Lim型雳，James M. Ritchie和Nick Weston。內(nèi)省對(duì)抗網(wǎng)絡(luò)的神經(jīng)照片編輯山害。 ICLR纠俭，abs / 1609.07093,2017。3,6

[36] Ben Poole, Jascha Sohl-Dickstein, and Surya Ganguli. Analyzing noise in autoencoders and deep networks. NIPS Workshop on Deep Learning, abs/1406.1831, 2013. 3, 6

[36] Ben Poole浪慌，Jascha Sohl-Dickstein和Surya Ganguli冤荆。分析自動(dòng)編碼器和深度網(wǎng)絡(luò)中的噪聲。NIPS深度學(xué)習(xí)研討會(huì)权纤，abs / 1406.1831,2013钓简。3,6

[37] Pengtao Xie, Barnab′as P′oczos, and Eric P. Xing. Nearorthogonality regularization in kernel methods. In UAI, 2017. 3, 6

[37]謝鵬濤，Barnab'as P'oczos和Eric P. Xing汹想。核方法中的近正則性正則化外邓。在UAI，2017年.3,6

[38] Di Xie, Jiang Xiong, and Shiliang Pu. All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 3

[38]狄燮古掏，蔣雄损话，石世良。您所需要的只是一個(gè)好的初始：探索更好的解決方案槽唾，用于訓(xùn)練具有正交性和調(diào)制的極深度卷積神經(jīng)網(wǎng)絡(luò)丧枪。 2017年IEEE計(jì)算機(jī)視覺(jué)和模式識(shí)別會(huì)議（CVPR），2017年.3

[39] Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee. Understanding and improving convolutional neural networks via concatenated recti?ed linear units. In ICML, 2016. 3

[39] Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee.通過(guò)級(jí)聯(lián)的整形線性單元理解和改進(jìn)卷積神經(jīng)網(wǎng)絡(luò)夏漱。在ICML豪诲，2016年.3

[40] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ICLR, abs/1312.6120, 2014. 3

[40] Andrew M. Saxe，James L. McClelland和Surya Ganguli挂绰。深層線性神經(jīng)網(wǎng)絡(luò)學(xué)習(xí)非線性動(dòng)力學(xué)的精確解屎篱。 ICLR服赎，abs / 1312.6120,2014。3

[41] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean ?eld theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. In ICML, 2018. 3, 5

[41] Lechao Xiao交播，Yasaman Bahri重虑，Jascha Sohl-Dickstein，Samuel S. Schoenholz和Jeffrey Pennington秦士。動(dòng)態(tài)等距和cnns的平均場(chǎng)理論：如何訓(xùn)練10,000層香草卷積神經(jīng)網(wǎng)絡(luò)缺厉。在ICML，2018隧土。3,5

[42] Eugene Vorontsov, Chiheb Trabelsi, Samuel Kadoury, and Christopher Joseph Pal. On orthogonality and learning recurrent networks with long term dependencies. In ICML, 2017. 3, 5

[42] Eugene Vorontsov提针，Chiheb Trabelsi，Samuel Kadoury和Christopher Joseph Pal曹傀。關(guān)于正交性和學(xué)習(xí)具有長(zhǎng)期依賴性的循環(huán)網(wǎng)絡(luò)辐脖。在ICML，2017年.3,5

[43] Harold Hotelling. Relations between two sets of variates. Biometrika, 28(3/4), 1936. 3

[43] Harold Hotelling皆愉。兩組變量之間的關(guān)系嗜价。 Biometrika，28（3/4）幕庐，1936久锥。3

[44] Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John E. Hopcroft. Convergent learning: Do different neural networks learn the same representations? In ICLR, 2016. 3

[44]李一軒，Jason Yosinski异剥，Jeff Clune瑟由，Hod Lipson和John E. Hopcroft。趨同學(xué)習(xí)：不同的神經(jīng)網(wǎng)絡(luò)是否學(xué)習(xí)相同的表示届吁？在ICLR错妖，2016年.3

[45] David R. Hardoon, S′andor Szedm′ak, and John ShaweTaylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 2004. 4

[45] David R. Hardoon绿鸣，S'andor Szedm'ak和John ShaweTaylor疚沐。典型相關(guān)分析：應(yīng)用于學(xué)習(xí)方法的概述。神經(jīng)計(jì)算潮模，2004年.4

[46] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, abs/1412.6980, 2015. 4

[46] Diederik P. Kingma和Jimmy Ba亮蛔。亞當(dāng)：隨機(jī)優(yōu)化的一種方法。 ICLR擎厢，abs / 1412.6980,2015究流。4

[47] T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012. 4

[47] T. Tieleman和G. Hinton。第6.5講RmsProp：將梯度除以其最近幅度的運(yùn)行平均值动遭。COURSERA：機(jī)器學(xué)習(xí)神經(jīng)網(wǎng)絡(luò)芬探，2012年.4

[48] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. 4

[48] Sergey Ioffe和Christian Szegedy。批量標(biāo)準(zhǔn)化：通過(guò)減少內(nèi)部協(xié)變量偏移來(lái)加速深度網(wǎng)絡(luò)訓(xùn)練厘惦。在ICML偷仿，2015年.4

[49] Xiaoliang Dai, Hongxu Yin, and Niraj K. Jha. Nest: A neural network synthesis tool based on a grow-and-prune paradigm. CoRR, abs/1711.02017, 2017. 5

[49]戴曉亮，尹宏旭和Niraj K. Jha。 Nest：基于增長(zhǎng)和修剪范例的神經(jīng)網(wǎng)絡(luò)綜合工具酝静。 CoRR节榜，abs / 1711.02017,2017。5

[50] Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for ef?cient neural networks. In NIPS, 2015. 5

[50] Song Han别智，Jeff Pool宗苍，John Tran和William J. Dally。學(xué)習(xí)有效神經(jīng)網(wǎng)絡(luò)的權(quán)重和連接薄榛。在NIPS讳窟，2015年.5

[51] Maithra Raghu, Ben Poole, Jon M. Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the expressive power of deep neural networks. In ICML, 2017. 5

[51] Maithra Raghu，Ben Poole敞恋，Jon M. Kleinberg挪钓，Surya Ganguli和Jascha Sohl-Dickstein。論深度神經(jīng)網(wǎng)絡(luò)的表達(dá)能力耳舅。在ICML碌上，2017年.5

[52] Leslie N. Smith. Cyclical learning rates for training neural networks. 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), 2017. 7

[52] Leslie N. Smith。訓(xùn)練神經(jīng)網(wǎng)絡(luò)的循環(huán)學(xué)習(xí)率浦徊。 2017年IEEE計(jì)算機(jī)視覺(jué)應(yīng)用冬季會(huì)議（WACV）馏予，2017年.7

[53] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. 8

[53] Karen Simonyan和Andrew Zisserman。用于大規(guī)模圖像識(shí)別的非常深的卷積網(wǎng)絡(luò)盔性。 CoRR霞丧，abs / 1409.1556,2014。8

[54] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna.Rethinking the inception architecture for computer vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 8

[54] Christian Szegeda冕香，Vincent Vanhoucke蛹尝，Sergey Ioffe，Jonathon Shlens和Zbigniew Wojna悉尾。重新思考計(jì)算機(jī)視覺(jué)的初始架構(gòu)突那。 2016年IEEE計(jì)算機(jī)視覺(jué)和模式識(shí)別會(huì)議（CVPR），2016年.8

[55] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015. 9

[56] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. 2015 IEEE International Conference on Computer Vision (ICCV), 2015. 9

[56] Stanislaw Antol构眯，Aishwarya Agrawal愕难，Jiasen Lu，Margaret Mitchell惫霸，Dhruv Batra蜜笤，C罐呼。Lawrence Zitnick和Devi Parikh罩锐。 Vqa：視覺(jué)問(wèn)題回答求泰。 2015年IEEE計(jì)算機(jī)視覺(jué)國(guó)際會(huì)議（ICCV），2015年.9

[57] Yuxin Wu et al. Tensorpack. https://github.com/ tensorpack/, 2016. 9

[57] Yuxin Wu et al.Tensorpack.Https：//github.com/ tensorpack /硅卢，2016射窒。9

Appendix

附錄

ImageNet Training Details

ImageNet培訓(xùn)詳情

Training large models such as ResNet, VGG or Inception (as discussed in Table 4) can be dif?cult and models may not always converge to similar optima across training runs. With our RePr training scheme, we observed that large values of

image

can sometimes produce collapse upon reintroduction of dropped ?lters. On analysis, we found that this was due to large random activations from the newly initialized ?lters. This can be overcome by initializing the new ?lters with relatively small values.

訓(xùn)練大型模型妖混，如ResNet，VGG或Inception（如表4中所討論的）可能很困難轮洋，并且模型可能并不總是在訓(xùn)練運(yùn)行中收斂到類似的最佳值制市。通過(guò)我們的RePr培訓(xùn)計(jì)劃，我們觀察到

image

的大值有時(shí)會(huì)在重新引入掉落的過(guò)濾器時(shí)產(chǎn)生崩潰弊予。經(jīng)過(guò)分析祥楣，我們發(fā)現(xiàn)這是由于新初始化過(guò)濾器的大量隨機(jī)激活造成的。這可以通過(guò)用相對(duì)較小的值初始化新濾波器來(lái)克服汉柒。

Another trick that minimizes this problem is to also reinitialize the corresponding kernels of next layer for a given ?lter. Consider a ?lter f at layer ?. The activations from this ?lter f become input to a kernel of every ?lter of the next layer

image

. If the ?lter f is pruned, and then re-initialized, then all those kernels in layer

image

should also be initialized to small random values, as the features they had learned to process no longer exist. This prevents new activations these kernels (which are currently random) from dominating the activations from other kernels.

使該問(wèn)題最小化的另一個(gè)技巧是為給定的濾波器重新初始化下一層的相應(yīng)內(nèi)核误褪。考慮層l處的濾波器f碾褂。來(lái)自此濾波器f的激活成為下一層

image

的每個(gè)濾波器的內(nèi)核的輸入兽间。如果對(duì)濾鏡f進(jìn)行修剪，然后重新初始化正塌，則層

image

中的所有內(nèi)核也應(yīng)初始化為小的隨機(jī)值嘀略，因?yàn)樗麄儗W(xué)會(huì)處理的特征不再存在。這可以防止新的激活這些內(nèi)核（目前是隨機(jī)的）主導(dǎo)其他內(nèi)核的激活乓诽。

Pruning signi?cant number of ?lters at one iteration could lead to instability in training. This is mostly due to changes in running mean/variance of BatchNorm parameters. To overcome this issue, ?lters can be pruned over multiple mini-batches. There is no need to re-evaluate the rank, as it does not change signi?cantly with few iterations. Instability of training is compounded in DenseNet, due to the dense connections. Removing multiple ?lters leads to signi?cant changes to the forward going dense connections, and they impact all the existing activations. One way to overcome this is to decay the ?lter weights over multiple iterations to a very small norm before removing the ?lter all together from the network. Similarly Squeezeand-Excitation Networks2 are also dif?cult to prune, because they maintain learned scaling parameters for activations from all the ?lters. Unlike, BatchNorm, it is not trivial to remove the corresponding scaling parameters, as they are part of a fully connected layer. Removing this value would change the network structure and also relative scaling of all the other activations.

在一次迭代中修剪顯著數(shù)量的過(guò)濾器可能導(dǎo)致訓(xùn)練不穩(wěn)定帜羊。這主要是由于BatchNorm參數(shù)的運(yùn)行均值/方差的變化。為了解決這個(gè)問(wèn)題鸠天，可以通過(guò)多個(gè)小批量修剪過(guò)濾器讼育。沒(méi)有必要重新評(píng)估排名，因?yàn)樗鼛缀鯖](méi)有迭代地顯著變化稠集。由于連接密集奶段，DenseNet中的訓(xùn)練不穩(wěn)定。刪除多個(gè)過(guò)濾器會(huì)導(dǎo)致前向密集連接發(fā)生重大變化剥纷，并且會(huì)影響所有現(xiàn)有的激活痹籍。解決此問(wèn)題的一種方法是在從網(wǎng)絡(luò)中一起移除濾波器之前，將多次迭代中的濾波器權(quán)重衰減到非常小的范數(shù)筷畦。類似地词裤，Squeezeand-Excitation Networks2也難以修剪刺洒，因?yàn)樗鼈兙S持所有過(guò)濾器激活的學(xué)習(xí)縮放參數(shù)鳖宾。與BatchNorm不同，刪除相應(yīng)的縮放參數(shù)并非易事逆航，因?yàn)樗鼈兪峭耆B接層的一部分鼎文。刪除此值將更改網(wǎng)絡(luò)結(jié)構(gòu)以及所有其他激活的相對(duì)縮放。

It is also possible to apply RePr to a pre-trained model. This is especially useful for ImageNet, where the cost of training from scratch is high. Applying RePr to a pretrained model is able to produce some improvement, but is not as effective as applying RePr throughout training. Careful selection of the ?ne-tuning learning rate is necessary to minimize the required training time. Our experiments show that using adaptive LR optimizers such as Adam might be more suited for ?ne-tuning from pre-trained weights.

也可以將RePr應(yīng)用于預(yù)先訓(xùn)練的模型因俐。這對(duì)ImageNet尤其有用拇惋，因?yàn)閺念^開(kāi)始周偎，培訓(xùn)成本很高。將RePr應(yīng)用于預(yù)訓(xùn)練模型能夠產(chǎn)生一些改進(jìn)撑帖，但不如在整個(gè)訓(xùn)練中應(yīng)用RePr有效蓉坎。仔細(xì)選擇微調(diào)學(xué)習(xí)速率是必要的，以盡量減少所需的培訓(xùn)時(shí)間胡嘿。我們的實(shí)驗(yàn)表明蛉艾，使用自適應(yīng)LR優(yōu)化器（如Adam）可能更適合從預(yù)訓(xùn)練的權(quán)重進(jìn)行微調(diào)。

Hyper-parameters

超參數(shù)

All ImageNet models were trained using Tensor?ow with Tesla V100 and model de?nitions were obtained from the of?cial TF repository3. Images were augmented with brightness (0.6 to 1.4), contrast (0.6 to 1.4), saturation (0.4), lightning (0.1), random center crop and horizontal ?ip. During the test, images were tested on center

image

crop. Most models were trained with a batch size of 256, but the large ones like ResNet-101, ResNet-152 and Inception-v2 were trained with a batch size of 128.Depending upon the implementation, RePr may add its own non-trainable variables, which will take up GPU memory, thus requiring the use smaller batch size than that originally reported by other papers.Models with batch sizes of 256 were trained using SGD with a learning rate of 0.1 for the ?rst 30 epochs, 0.01 for the next 30 epochs, and 0.001 for the remaining epochs. For models with batch sizes of 128, these learning rates were correspondingly reduced by half. For ResNet models convolutional layers were initialized with MSRA initialization with FAN OUT (scaling=2.0), and fully connected layer was initialized with Random Normal (standard deviation =0.01).

所有ImageNet模型都使用Tensor流程與Tesla V100進(jìn)行訓(xùn)練衷敌，模型定義來(lái)自官方TF資源庫(kù)3勿侯。圖像增加了亮度（0.6到1.4），對(duì)比度（0.6到1.4）缴罗，飽和度（0.4）助琐，閃電（0.1），隨機(jī)中心裁剪和水平閃光面氓。在測(cè)試期間兵钮，在中心

image

作物上測(cè)試圖像。大多數(shù)模型都經(jīng)過(guò)了批量處理256的培訓(xùn)舌界，但ResNet-101矢空，ResNet-152和Inception-v2等大型模型的批量培訓(xùn)為128。根據(jù)實(shí)現(xiàn)禀横，RePr可能會(huì)添加自己的非訓(xùn)練變量屁药，這將占用GPU內(nèi)存，因此需要使用比其他論文最初報(bào)告的更小的批量柏锄。批量大小為256的模型使用SGD進(jìn)行訓(xùn)練酿箭，其中前30個(gè)時(shí)期的學(xué)習(xí)率為0.1，接下來(lái)的30個(gè)時(shí)期為0.01趾娃，剩余時(shí)期為0.001缭嫡。對(duì)于批量大小為128的模型，這些學(xué)習(xí)率相應(yīng)地減少了一半抬闷。對(duì)于ResNet模型妇蛀，使用FAN OUT（縮放= 2.0）對(duì)MSRA初始化初始化卷積層，并且使用隨機(jī)正態(tài)（標(biāo)準(zhǔn)偏差= 0.01）初始化完全連接的層笤成。

2Hu, J., Shen, L., & Sun, G. (2017). Squeeze-and-Excitation Networks. CoRR, abs/1709.01507.

2Hu评架，J.，Shen炕泳，L纵诞。，＆Sun培遵，G浙芙。（2017）登刺。擠壓和激勵(lì)網(wǎng)絡(luò)。 CoRR嗡呼，abs / 1709.01507纸俭。

3tensor?ow/contrib/slim/python/slim/nets

3tensor溢流/的contrib /超薄/蟒蛇/超薄/網(wǎng)

image

Figure 10: Comparison of ?lter correlations with RePr and Standard Training.

圖10：過(guò)濾器與RePr和標(biāo)準(zhǔn)培訓(xùn)的相關(guān)性比較。

文章引用于 http://tongtianta.site/paper/13375
編輯 Lornatang
校準(zhǔn) Lornatang

最后編輯于：2019.04.09 22:23:52

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末南窗，一起剝皮案震驚了整個(gè)濱河市掉蔬，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌矾瘾，老刑警劉巖女轿，帶你破解...
沈念sama閱讀 218,941評(píng)論 6贊 508
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異壕翩，居然都是意外死亡蛉迹，警方通過(guò)查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,397評(píng)論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門放妈，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)北救，“玉大人，你說(shuō)我怎么就攤上這事芜抒≌洳撸” “怎么了？”我有些...
開(kāi)封第一講書人閱讀 165,345評(píng)論 0贊 356
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵宅倒，是天一觀的道長(zhǎng)攘宙。經(jīng)常有香客問(wèn)我，道長(zhǎng)拐迁，這世上最難降的妖魔是什么蹭劈？我笑而不...
開(kāi)封第一講書人閱讀 58,851評(píng)論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮线召，結(jié)果婚禮上铺韧，老公的妹妹穿的比我還像新娘。我一直安慰自己缓淹，他們只是感情好哈打，可當(dāng)我...
茶點(diǎn)故事閱讀 67,868評(píng)論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開(kāi)白布。她就那樣靜靜地躺著讯壶，像睡著了一般料仗。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上鹏溯，一...
開(kāi)封第一講書人閱讀 51,688評(píng)論 1贊 305
城市分裂傳說(shuō)
那天罢维，我揣著相機(jī)與錄音，去河邊找鬼丙挽。笑死肺孵，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的颜阐。我是一名探鬼主播平窘，決...
沈念sama閱讀 40,414評(píng)論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開(kāi)眼，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼凳怨！你這毒婦竟也來(lái)了瑰艘？” 一聲冷哼從身側(cè)響起，我...
開(kāi)封第一講書人閱讀 39,319評(píng)論 0贊 276
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤肤舞，失蹤者是張志新（化名）和其女友劉穎紫新，沒(méi)想到半個(gè)月后，有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體李剖，經(jīng)...
沈念sama閱讀 45,775評(píng)論 1贊 315
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡芒率，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,945評(píng)論 3贊 336
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了篙顺。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片偶芍。...
茶點(diǎn)故事閱讀 40,096評(píng)論 1贊 350
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖德玫，靈堂內(nèi)的尸體忽然破棺而出匪蟀，到底是詐尸還是另有隱情，我是刑警寧澤宰僧，帶...
沈念sama閱讀 35,789評(píng)論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布材彪，位于F島的核電站，受9級(jí)特大地震影響琴儿，放射性物質(zhì)發(fā)生泄漏查刻。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,437評(píng)論 3贊 331
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一凤类、第九天我趴在偏房一處隱蔽的房頂上張望穗泵。院中可真熱鬧，春花似錦谜疤、人聲如沸佃延。這莊子的主人今日做“春日...
開(kāi)封第一講書人閱讀 31,993評(píng)論 0贊 22
一樁弒父案夷磕，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)履肃。三九已至，卻和暖如春坐桩，著一層夾襖步出監(jiān)牢的瞬間尺棋，已是汗流浹背。一陣腳步聲響...
開(kāi)封第一講書人閱讀 33,107評(píng)論 1贊 271
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工绵跷，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留膘螟，地道東北人成福。一個(gè)月前我還...
沈念sama閱讀 48,308評(píng)論 3贊 372
代替公主和親
正文我出身青樓，卻偏偏與公主長(zhǎng)得像荆残，于是被迫代替她去往敵國(guó)和親奴艾。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 45,037評(píng)論 2贊 355