Improved Training of Wasserstein GANs翻譯 上
4 Gradient penalty
4梯度罰款
We now propose an alternative way to enforce the Lipschitz constraint. A differentiable function is 1-Lipschtiz if and only if it has gradients with norm at most 1 everywhere, so we consider directly constraining the gradient norm of the critic’s output with respect to its input. To circumvent tractability issues, we enforce a soft version of the constraint with a penalty on the gradient norm for random samples. Our new objective is
我們現(xiàn)在提出一種強(qiáng)制Lipschitz約束的替代方法抛蚁《罡鳎可區(qū)分函數(shù)是1-Lipschtiz赊颠,當(dāng)且僅當(dāng)它具有最多1個(gè)范數(shù)的梯度時(shí)虚吟,所以我們考慮直接約束評(píng)論者輸出相對(duì)于其輸入的梯度范數(shù)。為了避免易處理性問(wèn)題技即,我們強(qiáng)制執(zhí)行約束的軟版本厢岂,對(duì)隨機(jī)樣本的梯度范數(shù)進(jìn)行懲罰。我們的新目標(biāo)是
, which we found to work well across a variety of architectures and datasets ranging from toy tasks to large ImageNet CNNs.
懲罰系數(shù)本文中的所有實(shí)驗(yàn)都使用永品,我們發(fā)現(xiàn)它可以很好地適用于從玩具任務(wù)到大型ImageNet CNN的各種架構(gòu)和數(shù)據(jù)集。
No critic batch normalization Most prior GAN implementations [22, 23, 2] use batch normalization in both the generator and the discriminator to help stabilize training, but batch normalization changes the form of the discriminator’s problem from mapping a single input to a single output to mapping from an entire batch of inputs to a batch of outputs [23]. Our penalized training objective is no longer valid in this setting, since we penalize the norm of the critic’s gradient with respect to each input independently, and not the entire batch. To resolve this, we simply omit batch normalization in the critic in our models, ?nding that they perform well without it. Our method works with normalization schemes which don’t introduce correlations between examples. In particular, we recommend layer normalization [3] as a drop-in replacement for batch normalization.
沒(méi)有評(píng)論批量標(biāo)準(zhǔn)化大多數(shù)先前的GAN實(shí)現(xiàn)[22,23,2]在生成器和鑒別器中都使用批量標(biāo)準(zhǔn)化來(lái)幫助穩(wěn)定訓(xùn)練击纬,但批量標(biāo)準(zhǔn)化會(huì)將鑒別器問(wèn)題的形式從單個(gè)輸入映射到單個(gè)輸出變?yōu)閺囊慌斎胗成涞揭慌敵鯷23]鼎姐。我們的懲罰性培訓(xùn)目標(biāo)在此設(shè)置中不再有效,因?yàn)槲覀儠?huì)獨(dú)立地懲罰評(píng)論者關(guān)于每個(gè)輸入的梯度的標(biāo)準(zhǔn)更振,而不是整個(gè)批次炕桨。為了解決這個(gè)問(wèn)題,我們?cè)谀P椭泻雎耘u(píng)規(guī)范化肯腕,發(fā)現(xiàn)它們?cè)跊](méi)有它的情況下表現(xiàn)良好献宫。我們的方法適用于規(guī)范化方案,這些方案不會(huì)引入示例之間的相關(guān)性实撒。特別是姊途,我們建議將層標(biāo)準(zhǔn)化[3]作為批量標(biāo)準(zhǔn)化的直接替代。
Two-sided penalty We encourage the norm of the gradient to go towards 1 (two-sided penalty) instead of just staying below 1 (one-sided penalty). Empirically this seems not to constrain the critic too much, likely because the optimal WGAN critic anyway has gradients with norm 1 almost everywhere under5 Experiments
5實(shí)驗(yàn)
5.1 Training random architectures within a set
5.1培訓(xùn)集合中的隨機(jī)體系結(jié)構(gòu)
We experimentally demonstrate our model’s ability to train a large number of architectures which we think are useful to be able to train. Starting from the DCGAN architecture, we de?ne a set of architecture variants by changing model settings to random corresponding values in Table 1. We believe that reliable training of many of the architectures in this set is a useful goal, but we do not claim that our set is an unbiased or representative sample of the whole space of useful architectures: it is designed to demonstrate a successful regime of our method, and readers should evaluate whether it contains architectures similar to their intended application.
我們通過(guò)實(shí)驗(yàn)證明了我們的模型訓(xùn)練大量架構(gòu)的能力村怪,我們認(rèn)為這些架構(gòu)對(duì)訓(xùn)練有用。從DCGAN架構(gòu)開(kāi)始浮庐,我們通過(guò)將模型設(shè)置更改為表1中的隨機(jī)對(duì)應(yīng)值來(lái)定義一組架構(gòu)變體甚负。我們相信柬焕,對(duì)這一系列中的許多架構(gòu)進(jìn)行可靠的培訓(xùn)是一個(gè)有用的目標(biāo),但我們并不認(rèn)為我們的集合是整個(gè)有用架構(gòu)空間的公正或有代表性的樣本:它旨在展示我們的成功制度梭域。方法斑举,讀者應(yīng)評(píng)估它是否包含與其預(yù)期應(yīng)用類(lèi)似的架構(gòu)。
Table 1: We evaluate WGAN-GP’s ability to train the architectures in this set.
表1:我們?cè)u(píng)估WGAN-GP在該組中訓(xùn)練架構(gòu)的能力病涨。
Table 2: Outcomes of training 200 random architectures, for different success thresholds. For comparison, our standard DCGAN scored 7.24.
表2:針對(duì)不同的成功閾值,培訓(xùn)200個(gè)隨機(jī)體系結(jié)構(gòu)的結(jié)果踢代。相比之下盲憎,我們的標(biāo)準(zhǔn)DCGAN得分為7.24。
101-layer ResNet G and D 5.2 Training varied architectures on LSUN bedrooms To demonstrate our model’s ability to train many architectures with its default settings, we train six different GAN architectures on the LSUN bedrooms dataset [31]. In addition to the baseline DCGAN architecture from [22], we choose six architectures whose successful training we demonstrate: (1) no BN and a constant number of ?lters in the generator, as in [2], (2) 4-layer 512-dim ReLU MLP generator, as in [2], (3) no normalization in either the discriminator or generator (4) gated multiplicative nonlinearities, as in [24], (5) tanh nonlinearities, and (6) 101-layer ResNet generator and discriminator.
基于LSUN臥室的101層ResNet G和D 5.2培訓(xùn)各種架構(gòu)為了展示我們的模型能夠以默認(rèn)設(shè)置訓(xùn)練許多架構(gòu)胳挎,我們?cè)贚SUN臥室數(shù)據(jù)集上訓(xùn)練了六種不同的GAN架構(gòu)[31]焙畔。除了[22]的基線DCGAN架構(gòu)外,我們選擇了六種架構(gòu)串远,我們展示了它們的成功訓(xùn)練:(1)發(fā)生器中沒(méi)有BN和恒定數(shù)量的濾波器,如[2]儿惫,(2)4層512 -dim ReLU MLP發(fā)生器澡罚,如[2]中所述,(3)在鑒別器或發(fā)生器中沒(méi)有歸一化(4)門(mén)控乘法非線性肾请,如[24]留搔,(5)tanh非線性,和(6)101層ResNet發(fā)生器和鑒別器铛铁。
Figure 2: Different GAN architectures trained with different methods. We only succeeded in training every architecture with a shared set of hyperparameters using WGAN-GP.
圖2:使用不同方法訓(xùn)練的不同GAN架構(gòu)隔显。我們只使用WGAN-GP成功地使用一組共享的超參數(shù)來(lái)訓(xùn)練每個(gè)架構(gòu)。
Although we do not claim it is impossible without our method, to the best of our knowledge this is the ?rst time very deep residual networks were successfully trained in a GAN setting. For each architecture, we train models using four different GAN methods: WGAN-GP, WGAN with weight clipping, DCGAN [22], and Least-Squares GAN [18]. For each objective, we used the default set of optimizer hyperparameters recommended in that work (except LSGAN, where we searched over learning rates).
雖然我們沒(méi)有聲稱(chēng)沒(méi)有我們的方法是不可能的饵逐,但據(jù)我們所知括眠,這是第一次在GAN設(shè)置中成功訓(xùn)練非常深的殘留網(wǎng)絡(luò)。對(duì)于每種架構(gòu)倍权,我們使用四種不同的GAN方法訓(xùn)練模型:WGAN-GP掷豺,帶權(quán)重限幅的WGAN,DCGAN [22]和最小二乘GAN [18]。對(duì)于每個(gè)目標(biāo)当船,我們使用了該工作中推薦的默認(rèn)優(yōu)化器超參數(shù)集(除了LSGAN题画,我們搜索了學(xué)習(xí)率)。
For WGAN-GP, we replace any batch normalization in the discriminator with layer normalization (see section 4). We train each model for 200K iterations and present samples in Figure 2. We only succeeded in training every architecture with a shared set of hyperparameters using WGAN-GP. For every other training method, some of these architectures were unstable or suffered from mode collapse.
對(duì)于WGAN-GP德频,我們用層規(guī)范化替換鑒別器中的任何批量標(biāo)準(zhǔn)化(參見(jiàn)第4節(jié))苍息。我們訓(xùn)練每個(gè)模型進(jìn)行200K次迭代,并在圖2中顯示樣本壹置。我們只使用WGAN-GP成功地使用一組共享的超參數(shù)來(lái)訓(xùn)練每個(gè)架構(gòu)竞思。對(duì)于其他所有訓(xùn)練方法,其中一些架構(gòu)不穩(wěn)定或遭受模式崩潰蒸绩。
5.3 Improved performance over weight clipping
5.3改善了重量削減的性能
One advantage of our method over weight clipping is improved training speed and sample quality. To demonstrate this, we train WGANs with weight clipping and our gradient penalty on CIFAR10 [13] and plot Inception scores [23] over the course of training in Figure 3. For WGAN-GP, we train one model with the same optimizer (RMSProp) and learning rate as WGAN with weight clipping, and another model with Adam and a higher learning rate. Even with the same optimizer, our method converges faster and to a better score than weight clipping. Using Adam further improves performance. We also plot the performance of DCGAN [22] and ?nd that our method converges more slowly (in wall-clock time) than DCGAN, but its score is more stable at convergence.
我們的方法優(yōu)于減重的一個(gè)優(yōu)點(diǎn)是提高了訓(xùn)練速度和樣本質(zhì)量衙四。為了證明這一點(diǎn),我們?cè)趫D3中的訓(xùn)練過(guò)程中訓(xùn)練WGAN進(jìn)行了體重削減和CIFAR10 [13]的梯度懲罰以及初始得分[23]患亿。對(duì)于WGAN-GP传蹈,我們訓(xùn)練一個(gè)模型使用相同的優(yōu)化器(RMSProp)和學(xué)習(xí)率作為WGAN進(jìn)行權(quán)重削減,另一個(gè)模型使用Adam和更高的學(xué)習(xí)率步藕。即使使用相同的優(yōu)化器惦界,我們的方法收斂速度更快,并且比重量限幅更好咙冗。使用Adam進(jìn)一步提高了性能沾歪。我們還繪制了DCGAN [22]的性能,并發(fā)現(xiàn)我們的方法比DCGAN收斂得更慢(在掛鐘時(shí)間內(nèi))雾消,但其收斂在收斂時(shí)更穩(wěn)定灾搏。
Figure 3: CIFAR-10 Inception score over generator iterations (left) or wall-clock time (right) for four models: WGAN with weight clipping, WGAN-GP with RMSProp and Adam (to control for the optimizer), and DCGAN. WGAN-GP signi?cantly outperforms weight clipping and performs comparably to DCGAN.
圖3:CIFAR-10在四個(gè)模型的生成器迭代(左)或掛鐘時(shí)間(右)上的初始得分:具有權(quán)重削減的WGAN,具有RMSProp和Adam的WGAN-GP(用于控制優(yōu)化器)和DCGAN立润。WGAN-GP顯著優(yōu)于減重并且與DCGAN相當(dāng)狂窑。
5.4 Sample quality on CIFAR-10 and LSUN bedrooms
5.4 CIFAR-10和LSUN臥室的樣品質(zhì)量
For equivalent architectures, our method achieves comparable sample quality to the standard GAN objective. However the increased stability allows us to improve sample quality by exploring a wider range of architectures. To demonstrate this, we ?nd an architecture which establishes a new state of the art Inception score on unsupervised CIFAR-10 (Table 3). When we add label information (using the method in [20]), the same architecture outperforms all other published models except for SGAN.
對(duì)于等效架構(gòu),我們的方法實(shí)現(xiàn)了與標(biāo)準(zhǔn)GAN目標(biāo)相當(dāng)?shù)臉颖举|(zhì)量桑腮。然而泉哈,增加的穩(wěn)定性使我們能夠通過(guò)探索更廣泛的架構(gòu)來(lái)提高樣品質(zhì)量。為了證明這一點(diǎn)破讨,我們找到了一種架構(gòu)丛晦,它在無(wú)人監(jiān)督的CIFAR-10上建立了一種新的最先進(jìn)的入門(mén)分?jǐn)?shù)(表3)。當(dāng)我們添加標(biāo)簽信息時(shí)(使用[20]中的方法)提陶,相同的架構(gòu)優(yōu)于除SGAN之外的所有其他已發(fā)布模型烫沙。
Table 3: Inception scores on CIFAR-10. Our unsupervised model achieves state-of-the-art performance, and our conditional model outperforms all others except SGAN.
表3:CIFAR-10的初始分?jǐn)?shù)。我們的無(wú)監(jiān)督模型實(shí)現(xiàn)了最先進(jìn)的性能搁骑,我們的條件模型優(yōu)于除SGAN之外的所有其他模型斧吐。
Unsupervised Supervised believe these samples are at least competitive with the best reported so far on any resolution for this We also train a deep ResNet onLSUN bedrooms and show samples in Figure 4. We dataset.
無(wú)監(jiān)督的監(jiān)督認(rèn)為這些樣本至少與迄今為止報(bào)道的最佳報(bào)告競(jìng)爭(zhēng)對(duì)手又固。我們還在LSUN臥室培訓(xùn)深度ResNet并在圖4中顯示樣本。我們的數(shù)據(jù)集煤率。
5.5 Modeling discrete data with a continuous generator
5.5使用連續(xù)發(fā)電機(jī)建模離散數(shù)據(jù)
To demonstrate our method’s ability to model degenerate distributions, we consider the problem of modeling a complex discrete distribution with a GAN whose generator is de?ned over a continuous space. As an instance of this problem, we train a character-level GAN language model on the Google Billion Word dataset [6]. Our generator is a simple 1D CNN which deterministically transforms a latent vector into a sequence of 32 one-hot character vectors through 1D convolutions. We apply a softmax nonlinearity at the output, but use no sampling step: during training, the softmax output is to the best published results so far. Figure 4: Samples ofLSUN bedrooms. We believe these samples are at least comparable passed directly into the critic (which, likewise, is a simple 1D CNN). When decoding samples, we just take the argmax of each output vector.
為了證明我們的方法能夠?qū)ν嘶植歼M(jìn)行建模仰冠,我們考慮使用GAN對(duì)復(fù)雜離散分布建模的問(wèn)題,其中GAN的生成器是在連續(xù)空間上定義的蝶糯。作為這個(gè)問(wèn)題的一個(gè)例子洋只,我們?cè)贕oogle Billion Word數(shù)據(jù)集上訓(xùn)練了一個(gè)字符級(jí)的GAN語(yǔ)言模型[6]。我們的生成器是一個(gè)簡(jiǎn)單的1D CNN昼捍,通過(guò)1D卷積確定性地將潛在向量轉(zhuǎn)換為32個(gè)單熱字符向量的序列识虚。我們?cè)谳敵龆藨?yīng)用softmax非線性,但不使用采樣步驟:在訓(xùn)練期間妒茬,softmax輸出到目前為止發(fā)布的最佳結(jié)果担锤。圖4:LSUN臥室的樣品。我們相信這些樣本至少可以直接傳遞給評(píng)論家(同樣乍钻,這是一個(gè)簡(jiǎn)單的1D CNN)肛循。解碼樣本時(shí),我們只取每個(gè)輸出向量的argmax银择。
We present samples from the model in Table 4. Our model makes frequent spelling errors (likely because it has to output each character independently) but nonetheless manages to learn quite a lot about the statistics of language. We were unable to produce comparable results with the standard GAN objective, though we do not claim that doing so is impossible.
我們?cè)诒?中提供了模型中的樣本多糠。我們的模型經(jīng)常出現(xiàn)拼寫(xiě)錯(cuò)誤(可能是因?yàn)樗仨毆?dú)立輸出每個(gè)字符),但仍然能夠?qū)W到很多關(guān)于語(yǔ)言統(tǒng)計(jì)的知識(shí)浩考。我們無(wú)法與標(biāo)準(zhǔn)GAN目標(biāo)產(chǎn)生可比較的結(jié)果夹孔,但我們并未聲稱(chēng)這樣做是不可能的。
Table 4: Samples from a WGAN-GP character-level language model trained on sentences from the Billion Word dataset, truncated to 32 characters. The model learns to directly output one-hot character embeddings from a latent vector without any discrete sampling step. We were unable to achieve comparable results with the standard GAN objective and a continuous generator.
表4:來(lái)自WGAN-GP字符級(jí)語(yǔ)言模型的樣本析孽,該模型使用Billion Word數(shù)據(jù)集中的句子進(jìn)行訓(xùn)練搭伤,截?cái)酁?2個(gè)字符。該模型學(xué)習(xí)直接從潛在向量輸出單熱字符嵌入而無(wú)需任何離散采樣步驟袜瞬。我們無(wú)法使用標(biāo)準(zhǔn)GAN物鏡和連續(xù)發(fā)電機(jī)獲得可比較的結(jié)果闷畸。
Figure 5: (a) The negative critic loss of our model on LSUN bedrooms converges toward a minimum as the network trains. (b) WGAN training and validation losses on a random 1000-digit subset of MNIST show over?tting when using either our method (left) or weight clipping (right). In particular, with our method, the critic over?ts faster than the generator, causing the training loss to increase gradually over time even as the validation loss drops.
圖5:(a)我們的LSUN臥室模型的負(fù)面批評(píng)損失在網(wǎng)絡(luò)訓(xùn)練時(shí)趨于最小。 (b)當(dāng)使用我們的方法(左)或權(quán)重削減(右)時(shí)纱注,隨機(jī)的1000位MNIST子集上的WGAN訓(xùn)練和驗(yàn)證損失顯示過(guò)度擬合畏浆。特別是,使用我們的方法狞贱,批評(píng)者比發(fā)電機(jī)更快刻获,導(dǎo)致培訓(xùn)損失隨著時(shí)間的推移逐漸增加,即使驗(yàn)證損失下降瞎嬉。
Other attempts at language modeling with GANs [32, 14, 30, 5, 15, 10] typically use discrete models and gradient estimators [28, 12, 17]. Our approach is simpler to implement, though whether it scales beyond a toy language model is unclear.
使用GAN [32,14,30,5,15,10]進(jìn)行語(yǔ)言建模的其他嘗試通常使用離散模型和梯度估計(jì)[28,12,17]蝎毡。我們的方法實(shí)現(xiàn)起來(lái)比較簡(jiǎn)單,但是它是否超出了玩具語(yǔ)言模型還不清楚氧枣。
5.6 Meaningful loss curves and detecting over?tting
5.6有意義的損耗曲線和檢測(cè)過(guò)度擬合
An important bene?t of weight-clipped WGANs is that their loss correlates with sample quality and converges toward a minimum. To show that our method preserves this property, we train a WGAN-GP on the LSUN bedrooms dataset [31] and plot the negative of the critic’s loss in Figure 5a. We see that the loss converges as the generator minimizes.
重量限制WGAN的一個(gè)重要好處是它們的損失與樣品質(zhì)量相關(guān)沐兵,并且收斂到最小。為了表明我們的方法保留了這個(gè)屬性便监,我們?cè)贚SUN臥室數(shù)據(jù)集上訓(xùn)練了一個(gè)WGAN-GP [31]扎谎,并繪制了圖5a中評(píng)論家損失的負(fù)面影響。我們看到損失在發(fā)生器最小化時(shí)收斂烧董。
Given enough capacity and too little training data, GANs will over?t. To explore the loss curve’s behavior when the network over?ts, we train large unregularized WGANs on a random 1000-image subset of MNIST and plot the negative critic loss on both the training and validation sets in Figure 5b. In both WGAN and WGAN-GP, the two losses diverge, suggesting that the critic over?ts and provides an inaccurate estimate of, at which point all bets are off regarding correlation with sample quality. However in WGAN-GP, the training loss gradually increases even while the validation loss drops.
如果有足夠的容量和太少的訓(xùn)練數(shù)據(jù)毁靶,GAN將會(huì)過(guò)度。為了探索網(wǎng)絡(luò)過(guò)度時(shí)的損失曲線的行為逊移,我們?cè)贛NIST的隨機(jī)1000圖像子集上訓(xùn)練大的非正規(guī)化WGAN预吆,并在圖5b中的訓(xùn)練和驗(yàn)證集上繪制負(fù)面評(píng)論者損失。在WGAN和WGAN-GP中胳泉,這兩種損失有所不同拐叉,這表明對(duì)于過(guò)濾器的批評(píng)并提供了對(duì)的不準(zhǔn)確估計(jì)砸讳,此時(shí)所有的投注均與樣本質(zhì)量相關(guān)异希。然而,在WGAN-GP中,即使驗(yàn)證損失下降平绩,訓(xùn)練損失也逐漸增加躺孝。
[29] also measure over?tting in GANs by estimating the generator’s log-likelihood. Compared to that work, our method detects over?tting in the critic (rather than the generator) and measures over?tting against the same loss that the network minimizes.
[29]還通過(guò)估計(jì)發(fā)電機(jī)的對(duì)數(shù)似然來(lái)測(cè)量GAN中的過(guò)量配置鸳址。與該工作相比烘苹,我們的方法檢測(cè)批評(píng)者(而不是發(fā)電機(jī))中的過(guò)度配置,并針對(duì)網(wǎng)絡(luò)最小化的相同損失進(jìn)行測(cè)量坝茎。
6 Conclusion
六涤姊,結(jié)論
In this work, we demonstrated problems with weight clipping in WGAN and introduced an alternative in the form of a penalty term in the critic loss which does not exhibit the same problems. Using our method, we demonstrated strong modeling performance and stability across a variety of architectures. Now that we have a more stable algorithm for training GANs, we hope our work opens the path for stronger modeling performance on large-scale image datasets and language. Another interesting direction is adapting our penalty term to the standard GAN objective function, where it might stabilize training by encouraging the discriminator to learn smoother decision boundaries.
在這項(xiàng)工作中,我們展示了WGAN中減重的問(wèn)題嗤放,并在批評(píng)者損失中以懲罰性術(shù)語(yǔ)的形式引入了替代方案思喊,其沒(méi)有表現(xiàn)出相同的問(wèn)題。使用我們的方法次酌,我們展示了各種架構(gòu)的強(qiáng)大建模性能和穩(wěn)定性『蘅危現(xiàn)在我們有了一個(gè)更穩(wěn)定的GAN訓(xùn)練算法,我們希望我們的工作為大規(guī)模圖像數(shù)據(jù)集和語(yǔ)言打開(kāi)更強(qiáng)大的建模性能之路岳服。另一個(gè)有趣的方向是使我們的懲罰項(xiàng)適應(yīng)標(biāo)準(zhǔn)的GAN目標(biāo)函數(shù)剂公,它可以通過(guò)鼓勵(lì)鑒別器學(xué)習(xí)更平滑的決策邊界來(lái)穩(wěn)定訓(xùn)練。
Acknowledgements
致謝
We would like to thank Mohamed Ishmael Belghazi, L′eon Bottou, Zihang Dai, Stefan Doerr, Ian Goodfellow, Kyle Kastner, Kundan Kumar, Luke Metz, Alec Radford, Colin Raffel, Sai Rajeshwar, Aditya Ramesh, Tom Sercu, Zain Shah and Jake Zhao for insightful comments.
我們要感謝Mohamed Ishmael Belghazi吊宋,L'Thon Bottou纲辽,Zihang Dai,Stefan Doerr璃搜,Ian Goodfellow拖吼,Kyle Kastner,Kundan Kumar这吻,Luke Metz吊档,Alec Radford,Colin Raffel唾糯,Sai Rajeshwar怠硼,Aditya Ramesh,Tom Sercu移怯,Zain Shah和杰克趙的見(jiàn)解很有見(jiàn)地拒名。
文章引用于 http://tongtianta.site/paper/3418
編輯 Lornatang
校準(zhǔn) Lornatang