Are GANs Created Equal? A Large-Scale Study翻譯[下]

Are GANs Created Equal? A Large-Scale Study翻譯上

5. Metrics

5.指標(biāo)

In this work we focus on two sets of metrics. We ?rst analyze the recently proposed FID in terms of robustness (of the metric itself), and conclude that it has desirable properties and can be used in practice. Nevertheless, this metric, as well as Inception Score, is incapable of detecting over?tting: a memory GAN which simply stores all training samples would score perfectly under both measures. Based on these shortcomings, we propose an approximation to precision and recall for GANs and how that it can be used to quantify the degree of over?tting. We stress that the proposed method should be viewed as complementary to IS or FID, rather than a replacement.

在這項(xiàng)工作中，我們關(guān)注兩組指標(biāo)。我們首先根據(jù)穩(wěn)健性（度量本身）來分析最近提出的FID娃循，并得出結(jié)論它具有理想的屬性并且可以在實(shí)踐中使用。盡管如此留攒，該指標(biāo)以及初始分?jǐn)?shù)無法檢測到過度擬合：僅存儲所有訓(xùn)練樣本的記憶GAN在兩種測量下都能得到完美的分?jǐn)?shù)〖掂郑基于這些缺點(diǎn)炼邀，我們提出了GAN的精確度和召回率的近似值，以及它如何用于量化過度擬合的程度剪侮。我們強(qiáng)調(diào)所提議的方法應(yīng)被視為IS或FID的補(bǔ)充汤善，而不是替代。

5.1. Fr′echet Inception Distance

5.1票彪。 Fr'echet初始距離

FID was shown to be robust to noise [10]. Here we quantify the bias and variance of FID, its sensitivity to the encoding network and sensitivity to mode dropping. To this end, we partition the data set into two groups, i.e.

image

. Then, we de?ne the data distribution

image

as the empirical distribution on a random subsample of

image

and the model distribution

image

to be the empirical distribution on a random subsample from

image

. For a random partition this “model distribution” should follow the data distribution.

FID顯示出對噪聲的魯棒性[10]红淡。在這里，我們量化FID的偏差和方差降铸，它對編碼網(wǎng)絡(luò)的敏感性和對模式下降的敏感性在旱。為此，我們將數(shù)據(jù)集分為兩組推掸，即

image

桶蝎。然后，我們將數(shù)據(jù)分布

image

定義為

image

的隨機(jī)子樣本和模型分布

image

的經(jīng)驗(yàn)分布，作為來自

image

的隨機(jī)子樣本的經(jīng)驗(yàn)分布丰泊。對于隨機(jī)分區(qū)嘉抓，此“模型分布”應(yīng)遵循數(shù)據(jù)分布。

Bias and variance. We evaluate the bias and variance of FID on four classic data sets used in the GAN literature. We start by using the default train vs. test partition and compute the FID between the test set (limited to

image

samples for CelebA) and the sample of size N from the train set. The sampling from the train set is performed

image

times. The optimistic estimate of FID are reported in Table 2. We observe that FID has rather high bias, but small variance. From this perspective, estimating the full covariance matrix might be unnecessary and counter-productive, and a constrained version might suf?ce.

偏見和差異胜茧。我們在GAN文獻(xiàn)中使用的四個經(jīng)典數(shù)據(jù)集上評估FID的偏差和方差。我們首先使用默認(rèn)列車與測試分區(qū)仇味，并計(jì)算測試集（限于CelebA的

image

樣本）和列車集中大小為N的樣本之間的FID呻顽。火車組的采樣執(zhí)行

image

次丹墨。表2列出了對FID的樂觀估計(jì)廊遍。我們觀察到FID具有相當(dāng)高的偏差，但方差很小贩挣。從這個角度來看喉前，估計(jì)完整的協(xié)方差矩陣可能是不必要的并且適得其反，而受約束的版本可能會受到影響王财。

2Furthermore, while we present the results which were obtained by a random search, we have also investigated sequential Bayesian optimization, which resulted in comparable results.

2此外卵迂，雖然我們提出了通過隨機(jī)搜索獲得的結(jié)果，但我們還研究了順序貝葉斯優(yōu)化搪搏，這導(dǎo)致了可比較的結(jié)果狭握。

Table 2: Bias and variance of FID. If the data distribution matches the model distribution, FID should evaluate to zero. However, we observe some bias and low variance on samples of size 10000.

表2：FID的偏差和方差。如果數(shù)據(jù)分布與模型分布匹配疯溺，則FID應(yīng)評估為零论颅。然而哎垦，我們觀察到大小為10000的樣本存在一些偏差和低方差。

image

To test the sensitivity to this initial choice of train vs. test partitioning, we consider 50 random partitions (keeping the relative sizes ?xed, i.e. 6 : 1 for MNIST) and compute the FID with

image

sample. We observe results similar to Table 2 which is to be expected if the train and test data sets are drawn from the same distribution.

為了測試對列車與測試分區(qū)的初始選擇的敏感性恃疯，我們考慮50個隨機(jī)分區(qū)（保持相對大小固定漏设，即MNIST為6：1）并使用

image

樣本計(jì)算FID。我們觀察到類似于表2的結(jié)果今妄，如果列車和測試數(shù)據(jù)集來自相同的分布郑口，則可以預(yù)期。

Detecting mode dropping with FID. To simulate missing modes, we ?x a partition of data set

image

and we subsample

image

and keep only samples from the ?rst k classes, increasing k from 1 to 10. For each k, we consider 50 random subsamples from

image

. Figure 1 shows that FID is heavily in?uenced by the missing modes.

使用FID檢測模式下降盾鳞。為了模擬丟失模式犬性，我們對數(shù)據(jù)集

image

進(jìn)行分區(qū)，并對子樣本

image

進(jìn)行子樣本處理腾仅，并保留僅來自第一個k類的樣本乒裆，將k從1增加到10。對于每個k推励，我們考慮來自

image

的50個隨機(jī)子樣本鹤耍。圖1顯示FID受到缺失模式的嚴(yán)重影響。

image

Figure 1: As the sample captures more classes the FID with respect to the reference data set decreases. We observe that FID drastically increases under mode dropping.

圖1：當(dāng)樣本捕獲更多類時验辞，F(xiàn)ID相對于參考數(shù)據(jù)集減少稿黄。我們觀察到FID在模式下降時急劇增加。

Sensitivity to encoding network. Suppose we compute FID using a different network and encoding layer. Would the ranking of models change? To test this we apply VGG trained on ImageNet and consider the layer FC7 of dimension 4096. Figure 2 shows the resulting distribution. We observe high Spearman’s rank correlation

image

which encourages the use of the default coding layer suggested by the authors. Of course, a natural comparison would be to apply VGG trained on some other data set, which we leave for future work.

對編碼網(wǎng)絡(luò)的敏感性跌造。假設(shè)我們使用不同的網(wǎng)絡(luò)和編碼層計(jì)算FID杆怕。模型的排名會改變嗎？為了測試這個鼻听，我們應(yīng)用在ImageNet上訓(xùn)練的VGG并考慮尺寸為4096的層FC7财著。圖2顯示了得到的分布。我們觀察到高Spearman等級相關(guān)性

image

撑碴，它鼓勵使用作者建議的默認(rèn)編碼層。當(dāng)然朝墩，自然的比較是將VGG應(yīng)用于其他一些數(shù)據(jù)集醉拓，我們留待將來的工作。

image

Figure 2: The difference between FID score computed on InceptionNet vs FID computed using VGG for the CELEBA data set (for interesting range: FID < 200). We observe high rank correlation (Spearman’s

image

) which encourages the use of the default coding layer suggested by the authors.

圖2：在InceptionNet上計(jì)算的FID得分與使用VGG為CELEBA數(shù)據(jù)集計(jì)算的FID之間的差異（對于感興趣的范圍：FID <200）收苏。我們觀察到高等級相關(guān)性（Spearman的

image

）亿卤，它鼓勵使用作者建議的默認(rèn)編碼層。

5.2. Precision, Recall and F1 Score

5.2鹿霸。精確度排吴，召回率和F1分?jǐn)?shù)

Precision, recall and

image

score are proven and widely adopted techniques for quantitatively evaluating the quality of discriminative models. Precision measures the fraction of relevant retrieved instances among the retrieved instances, while recall measures the fraction of the retrieved instances among relevant instances.

image

score is the harmonic average of precision and recall.

精確度，召回率和

image

評分是經(jīng)過驗(yàn)證的廣泛采用的技術(shù)懦鼠，用于定量評估判別模型的質(zhì)量钻哩。精確度測量檢索到的實(shí)例中相關(guān)檢索實(shí)例的分?jǐn)?shù)屹堰，而召回則測量相關(guān)實(shí)例中檢索到的實(shí)例的分?jǐn)?shù)。

image

得分是精度和召回的諧波平均值街氢。

Notice that IS only captures precision: It will not penalize the model for not producing all modes of the data distribution — it will only penalize the model for not producing all classes. On the other hand, FID captures both precision and recall. Indeed, a model which fails to recover different modes of the data distribution will suffer in terms of FID.

請注意扯键，IS僅捕獲精度：它不會因?yàn)椴簧蓴?shù)據(jù)分布的所有模式而懲罰模型 - 它只會懲罰模型而不生成所有類。另一方面珊肃，F(xiàn)ID捕獲精確度和召回率荣刑。實(shí)際上，無法恢復(fù)數(shù)據(jù)分布的不同模式的模型將在FID方面受到影響伦乔。

We propose a simple and effective data set for evaluating (and comparing) generative models. Our main motivation is that the currently used data sets are either too simple (e.g. simple mixtures of Gaussians, or MNIST) or too complex (e.g. ImageNet). We argue that it is critical to be able to increase the complexity of the task in a relatively smooth and controlled fashion. To this end, we present a set of tasks for which we can approximate the precision and recall of each model. As a result, we can compare different models based on established metrics.

我們提出了一種簡單有效的數(shù)據(jù)集厉亏，用于評估（和比較）生成模型。我們的主要動機(jī)是當(dāng)前使用的數(shù)據(jù)集太簡單（例如高斯的簡單混合或MNIST）或太復(fù)雜（例如ImageNet）烈和。我們認(rèn)為能夠以相對平穩(wěn)和受控的方式增加任務(wù)的復(fù)雜性至關(guān)重要爱只。為此，我們提出了一組任務(wù)斥杜，我們可以近似精確度和每個模型的召回虱颗。因此，我們可以根據(jù)既定指標(biāo)比較不同的模型蔗喂。

Manifold of convex polygons. The main idea is to construct a data manifold such that the distances from samples to the manifold can be computed ef?ciently. As a result, the problem of evaluating the quality of the generative model is effectively transformed into a problem of computing the distance to the manifold. This enables an intuitive approach for de?ning the quality of the model. Namely, if the samples from the model distribution

image

are (on average) close to the manifold, its precision is high. Similarly, high recall implies that the generator can recover (i.e. generate something close to) any sample from the manifold.

凸多邊形的流形忘渔。主要思想是構(gòu)造數(shù)據(jù)流形，使得可以有效地計(jì)算從樣本到流形的距離缰儿。結(jié)果畦粮，評估生成模型的質(zhì)量的問題被有效地轉(zhuǎn)換成計(jì)算到歧管的距離的問題。這使得能夠以直觀的方式來確定模型的質(zhì)量乖阵。即宣赔，如果來自模型分布

image

的樣本（平均）接近歧管，則其精度高瞪浸。類似地儒将，高召回率意味著發(fā)生器可以從歧管中恢復(fù)（即產(chǎn)生接近）任何樣品。

image

Figure 3: Samples from models with (a) high recall and precision, (b) high precision, but low recall (lacking in diversity), (c) low precision, but high recall (can decently reproduce triangles, but fails to capture convexity), and (d) low precision and low recall.

圖3：來自模型的樣本具有（a）高召回率和高精度对蒲，（b）高精度钩蚊，但低召回率（缺乏多樣性），（c）低精度蹈矮，但高召回率（可以適當(dāng)?shù)卦佻F(xiàn)三角形砰逻，但無法捕獲凸度），和（d）精度低泛鸟，召回率低蝠咆。

For general data sets, this reduction is impractical as one has to compute the distance to the manifold which we are trying to learn. However, if we construct a manifold such that this distance is ef?ciently computable, the precision and recall can be ef?ciently evaluated.

對于一般數(shù)據(jù)集，這種減少是不切實(shí)際的刚操，因?yàn)楸仨氂?jì)算到我們試圖學(xué)習(xí)的流形的距離闸翅。然而譬涡，如果我們構(gòu)造一個歧管使得該距離可以有效地計(jì)算捅厂，則可以有效地評估精度和召回率山上。

To this end, we propose a set of toy data sets for which such computation can be performed ef?ciently: The manifold of convex polygons. As the simplest example, let us focus on gray-scale triangles represented as one channel images as in Figure 3. These triangles belong to a lowdimensional manifold

image

embedded in

image

. Intuitively, the coordinate system of this manifold represents the axes of variation (e.g. rotation, translation, minimum angle size, etc.). A good generative model should be able to capture these factors of variation and recover the training samples. Furthermore, it should recover any sample from this manifold from which we can ef?ciently sample which is illustrated in Figure 3.

為此，我們提出了一套玩具數(shù)據(jù)集支子，可以有效地執(zhí)行這樣的計(jì)算：凸多邊形的多樣性创肥。作為最簡單的例子，讓我們專注于表示為一個通道圖像的灰度三角形，如圖3所示叹侄。這些三角形屬于

image

中嵌入的低維流形

image

巩搏。直觀地，該歧管的坐標(biāo)系表示變化軸（例如趾代，旋轉(zhuǎn)贯底，平移，最小角度尺寸等）撒强。一個好的生成模型應(yīng)該能夠捕獲這些變異因素并恢復(fù)訓(xùn)練樣本禽捆。此外，它應(yīng)該從這個歧管中回收任何樣品尿褪，我們可以從中有效地對樣品進(jìn)行采樣睦擂，如圖3所示。

Computing the distance to the manifold. Let us consider the simplest case: single-channel gray scale images represented as vectors

image

. The distance of a sample

image

to the manifold is de?ned as the squared Euclidean distance to the closest sample from the manifold

image

, i.e.

計(jì)算到歧管的距離杖玲。讓我們考慮最簡單的情況：單通道灰度圖像表示為矢量

image

顿仇。

image

樣本到歧管的距離定義為距離流形

image

最近的樣本的歐幾里德距離的平方，即

image

Figure 4: How does the minimum FID behave as a function of the budget? The plot shows the distribution of the minimum FID achievable for a ?xed budget along with one standard deviation interval. For each budget, we estimate the mean and variance using 5000 bootstrap resamples out of 100 runs. We observe that, given a relatively low budget (say less than 15 hyperparameter settings), all models achieve a similar minimum FID. Furthermore, for a ?xed FID, “bad” models can outperform “good” models given enough computational budget. We argue that the computational budget to search over hyperparameters is an important aspect of the comparison between algorithms.

圖4：最小FID如何作為預(yù)算的函數(shù)摆马？該圖顯示了固定預(yù)算可實(shí)現(xiàn)的最小FID的分布以及一個標(biāo)準(zhǔn)偏差間隔臼闻。對于每個預(yù)算，我們使用100次運(yùn)行中的5000次自舉重采樣來估計(jì)均值和方差囤采。我們觀察到述呐，由于預(yù)算相對較低（比如少于15個超參數(shù)設(shè)置），所有模型都達(dá)到了類似的最小FID蕉毯。此外乓搬，對于固定的FID，“壞”模型在給定足夠的計(jì)算預(yù)算的情況下可以勝過“好”模型代虾。我們認(rèn)為进肯，搜索超參數(shù)的計(jì)算預(yù)算是算法之間比較的一個重要方面。

image

This is a non-convex optimization problem. We ?nd an approximate solution by gradient descent on the vertices of the triangle (more generally, a convex polygon), ensuring that each iterate is a valid triangle (more generally, a convex polygon). To reduce the false-negative rate we repeat the algorithm several times from random initial solutions. To compute the latent representation of a sample

image

we invert the generator, i.e. we solve

這是一個非凸優(yōu)化問題棉磨。我們通過梯度下降在三角形的頂點(diǎn)（更一般地江掩，凸多邊形）上找到近似解，確保每個迭代是有效三角形（更一般地乘瓤，凸多邊形）环形。為了降低假陰性率，我們從隨機(jī)初始解決方案中多次重復(fù)該算法衙傀。為了計(jì)算樣本

image

的潛在表示抬吟，我們反轉(zhuǎn)生成器，即我們解決

image

using gradient descent on z while keeping G ?xed [15].

在z上使用梯度下降同時保持G fi xed [15]统抬。

6. Large-scale Experimental Evaluation

6.大規(guī)模實(shí)驗(yàn)評估

We consider two budget-constrained experimental setups whereby in the (i) wide one-shot setup one may select 100 samples of hyper-parameters per model, and where the range for each hyperparameter is wide, and (ii) the narrow two-shots setup where one is allowed to select 50 samples from more narrow ranges which were manually selected by ?rst performing the wide hyperparameter search over a speci?c data set. For the exact ranges and hyperparameter search details we refer the reader to the Appendix A. In the second set of experiments we evaluate the models based on the ”novel” metric:

image

score on the proposed data set. Finally, we included the Variational Autoencoder [13] in the experiments as a popular alternative.

我們考慮兩個預(yù)算受限的實(shí)驗(yàn)設(shè)置拗军，其中在（i）寬單次設(shè)置中任洞，每個模型可以選擇100個超參數(shù)樣本，并且每個超參數(shù)的范圍很寬发侵，以及（ii）窄的兩個鏡頭設(shè)置允許從較窄范圍中選擇50個樣本，這些樣本通過首先在特定數(shù)據(jù)集上執(zhí)行寬超參數(shù)搜索來手動選擇妆偏。有關(guān)確切范圍和超參數(shù)搜索詳細(xì)信息刃鳄，請參閱附錄A.在第二組實(shí)驗(yàn)中，我們基于“新穎”度量評估模型：

image

對建議數(shù)據(jù)集的評分钱骂。最后叔锐，我們在實(shí)驗(yàn)中將變分自動編碼器[13]作為一種流行的替代方案。

6.1. Experimental Setup

6.1见秽。實(shí)驗(yàn)裝置

To ensure a fair comparison, we made the following choices: (i) we use the generator and discriminator architecture from INFO GAN [5] as the resulting function space is rich enough and all considered GANs were not originally designed for this architecture. Furthermore, it is similar to a proven architecture used in DCGAN [20]. The exception is BEGAN where an autoencoder is used as the discriminator. We maintain similar expressive power to INFO GAN by using identical convolutional layers the encoder and approximately matching the total number of parameters.

為了確保公平比較愉烙，我們做出了以下選擇：（i）我們使用INFO GAN [5]中的生成器和鑒別器體系結(jié)構(gòu)，因?yàn)榈玫降暮瘮?shù)空間足夠豐富解取，并且所有考慮的GAN最初都不是為此體系結(jié)構(gòu)設(shè)計(jì)的步责。此外，它類似于DCGAN [20]中使用的成熟架構(gòu)禀苦。例外是BEGAN蔓肯，其中自動編碼器用作鑒別器。我們通過使用相同的卷積層編碼器并大致匹配參數(shù)總數(shù)來保持與INFO GAN類似的表達(dá)能力振乏。

For all experiments we ?x the latent code size to 64 and the prior distribution over the latent space to be uniform on

image

, except for VAE where it is Gaussian

image

I). We choose Adam [12] as the optimization algorithm as it was the most popular choice in the GAN literature 3. We apply the same learning rate for both generator and discriminator. We set the batch size to 64 and perform optimization for 20 epochs on MNIST and FASHION MNIST, 40 on CELEBA and 100 on CIFAR4.

對于所有實(shí)驗(yàn)蔗包，我們將潛在代碼大小設(shè)置為64，并且在潛在空間上的先驗(yàn)分布在

image

上是均勻的慧邮，除了VAE调限，其中它是高斯

image

I）。我們選擇Adam [12]作為優(yōu)化算法误澳，因?yàn)樗荊AN文獻(xiàn)中最受歡迎的選擇3耻矮。我們對發(fā)生器和鑒別器應(yīng)用相同的學(xué)習(xí)率。我們將批量大小設(shè)置為64脓匿，并在MNIST和FASHION MNIST上執(zhí)行20個時期的優(yōu)化淘钟，在CELEBA上執(zhí)行40個時期，在CIFAR4上執(zhí)行100個時期陪毡。

Finally, we allow for recent suggestions, such as batch normalization in the discriminator, and imbalanced update frequencies of generator and discriminator. We explore these possibilities, together with learning rate, parameter

image

for ADAM, and hyperparameters of each model. We report the hyperparameter ranges and other details in Appendix A.

最后米母，我們允許最近的建議，例如鑒別器中的批量歸一化毡琉，以及發(fā)生器和鑒別器的不平衡更新頻率铁瞒。我們將探討這些可能性，以及學(xué)習(xí)速率桅滋，ADAM參數(shù)

image

以及每個模型的超參數(shù)慧耍。我們在附錄A中報告了超參數(shù)范圍和其他詳細(xì)信息身辨。

6.2. A Large Hyperparameter Search

6.2。大型超參數(shù)搜索

We perform hyperparameter optimization and, for each run, look for the best FID across the training run (simulating early stopping). To choose the best model, every 5 epochs we compute the FID between the 10k samples generated by the model and the 10k samples from the test set. We have performed this computationally expensive search for each data set. We present the sensitivity of models to the hyperparameters in Figure 5 and the best FID achieved by each model in Table 3.

我們執(zhí)行超參數(shù)優(yōu)化芍碧，并且對于每次運(yùn)行煌珊，在整個訓(xùn)練運(yùn)行中尋找最佳FID（模擬早期停止）。為了選擇最佳模型泌豆，我們每5個時期計(jì)算模型生成的10k樣本與測試集中的10k樣本之間的FID定庵。我們對每個數(shù)據(jù)集執(zhí)行了這種計(jì)算成本高的搜索。我們將模型的靈敏度呈現(xiàn)給圖5中的超參數(shù)踪危，以及表3中每種模型實(shí)現(xiàn)的最佳FID蔬浙。

3An empirical comparison to RMSProp is provided in Appendix F 4Those four data sets are a popular choice for generative modeling. They are of simple to medium complexity, making it possible to run many experiments as well as getting decent results.

3附錄F中提供了與RMSProp的經(jīng)驗(yàn)比較。這四個數(shù)據(jù)集是生成建模的流行選擇贞远。它們具有簡單到中等的復(fù)雜性畴博，可以運(yùn)行許多實(shí)驗(yàn)并獲得不錯的結(jié)果。

image

Figure 5: A wide range hyperparameter search (100 hyperparameter samples per model). Black stars indicate the performance of suggested hyperparameter settings. We observe that GAN training is extremely sensitive to hyperparameter settings and there is no model which is signi?cantly more stable than others. The importance of hyperparameter search is further highlighted in Figure 15.

圖5：廣泛的超參數(shù)搜索（每個模型100個超參數(shù)樣本）。黑色星標(biāo)表示建議的超參數(shù)設(shè)置的性能。我們觀察到GAN訓(xùn)練對超參數(shù)設(shè)置極其敏感光戈，并且沒有比其他模型顯著更穩(wěn)定的模型宁脊。超參數(shù)搜索的重要性在圖15中進(jìn)一步突出顯示。

Table 3: Best FID obtained in a large-scale hyperparameter search for each data set. The scores were computed in two phases: ?rst, we run a large-scale search on a wide range of hyperparameters, and select the best model. Then, we re-run the training of the selected model 50 times with different initialization seeds, to estimate the stability of the training and report the mean FID and standard deviation, excluding outliers. The asterisk () on some combinations of models and data sets indicates the presence of signi?cant outlier runs, usually severe mode collapses or training failures (* indicates up to 20% failures). We observe that the performance of each model heavily depends on the data set and no model strictly dominates the others. We note that VAE is heavily penalized due to the blurriness of the generated images. Note that these results are not “state-of-the-art”: (i) larger architectures could improve all models, (ii) authors often report the best FID which opens the door for random seed optimization.

表3：在每個數(shù)據(jù)集的大規(guī)模超參數(shù)搜索中獲得的最佳FID。分?jǐn)?shù)分兩個階段計(jì)算：首先，我們對各種超參數(shù)進(jìn)行大規(guī)模搜索，并選擇最佳模型咱揍。然后，我們使用不同的初始化種子重新運(yùn)行所選模型的訓(xùn)練50次棚饵，以估計(jì)訓(xùn)練的穩(wěn)定性并報告平均FID和標(biāo)準(zhǔn)偏差煤裙，排除異常值。某些模型和數(shù)據(jù)集組合上的星號（）表示存在重要的異常值運(yùn)行噪漾，通常是嚴(yán)重模式崩潰或訓(xùn)練失斉鹋椤（*表示失敗率高達(dá)20％）。我們觀察到每個模型的性能在很大程度上取決于數(shù)據(jù)集欣硼，并且沒有模型嚴(yán)格地支配其他模型题翰。我們注意到由于生成的圖像模糊，VAE受到嚴(yán)重懲罰诈胜。請注意豹障，這些結(jié)果不是“最先進(jìn)的”：（i）較大的架構(gòu)可以改進(jìn)所有模型，（ii）作者經(jīng)常報告最佳FID焦匈，為隨機(jī)種子優(yōu)化打開了大門血公。

image

Critically, we consider the mean FID as the computational budget increases which is shown in Figure 4. There are three important observations. Firstly, there is no algorithm which clearly dominates others. Secondly, for an interesting range of FIDs, a “bad” model trained on a large budget can out perform a “good” model trained on a small budget. Finally, when the budget is limited, any statistically signi?cant comparison of the models is unattainable.

關(guān)鍵的是，我們將平均FID視為計(jì)算預(yù)算的增加缓熟，如圖4所示累魔。有三個重要的觀察結(jié)果摔笤。首先，沒有明顯支配他人的算法垦写。其次吕世，對于一系列有趣的FID，在大預(yù)算下訓(xùn)練的“壞”模型可以執(zhí)行在小預(yù)算下訓(xùn)練的“好”模型梯澜。最后寞冯，當(dāng)預(yù)算有限時，任何統(tǒng)計(jì)上顯著的模型比較都是無法實(shí)現(xiàn)的晚伙。

6.3. Impact of Limited Computational Budget

6.3。有限計(jì)算預(yù)算的影響

In some cases, the computational budget available to a practitioner is too small to perform such a large-scale hyperparameter search. Instead, one can tune the range of hyperparameters on one data set and interpolate the good hy perparameter ranges for other data sets. We now consider this setting in which we allow only 50 samples from a set of narrow ranges, which were selected based on the wide hyperparameter search on the FASHION-MNIST data set. We report the narrow hyperparameter ranges in Appendix A. Figure 15 shows the variance of FID per model, where the hyperparameters were selected from narrow ranges. From the practical point of view, there are signi?cant differences between the models: in some cases the hyperparameter ranges transfer from one data set to the others (e.g. NS GAN), while others are more sensitive to this choice (e.g. WGAN). We note that better scores can be obtained by a wider hyperparameter search. These results supports the conclusion that discussing the best score obtained by a model on a data set is not a meaningful way to discern between these models. One should instead discuss the distribution of the obtained scores.

在某些情況下俭茧，從業(yè)者可用的計(jì)算預(yù)算太小而無法執(zhí)行如此大規(guī)模的超參數(shù)搜索咆疗。相反，可以調(diào)整一個數(shù)據(jù)集上的超參數(shù)范圍母债，并為其他數(shù)據(jù)集插入良好的hy參數(shù)范圍午磁。我們現(xiàn)在考慮這種設(shè)置，其中我們僅允許來自一組窄范圍的50個樣本毡们，這些樣本是基于FASHION-MNIST數(shù)據(jù)集上的寬超參數(shù)搜索而選擇的迅皇。我們在附錄A中報告了狹窄的超參數(shù)范圍。圖15顯示了每個模型的FID方差衙熔，其中超參數(shù)選自窄范圍登颓。從實(shí)際的角度來看，模型之間存在顯著差異：在某些情況下红氯，超參數(shù)范圍從一個數(shù)據(jù)集轉(zhuǎn)移到另一個數(shù)據(jù)集（例如NS GAN）框咙，而其他情況則對此選擇更敏感（例如WGAN）。我們注意到痢甘，通過更廣泛的超參數(shù)搜索可以獲得更好的分?jǐn)?shù)喇嘱。這些結(jié)果支持這樣的結(jié)論：討論模型在數(shù)據(jù)集上獲得的最佳分?jǐn)?shù)并不是識別這些模型之間有意義的方法。人們應(yīng)該討論獲得的分?jǐn)?shù)的分布塞栅。

6.4. Robustness to Random Initialization

6.4者铜。隨機(jī)初始化的穩(wěn)健性

For a ?xed model, hyperparameters, training algorithm, and the order that the data is presented to the model, one would expect similar model performance. To test this hypothesis we re-train the best models from the limited hyperparameter range considered for the previous section, while changing the initial weights of the generator and discriminator networks (i.e. by varying a random seed). Table 3 and Figure 16 show the results for each data set. Most models are relatively robust to random initialization, except LSGAN, even though for all of them the variance is signi?cant and should be taken into account when comparing models.

對于固定模型，超參數(shù)放椰，訓(xùn)練算法以及將數(shù)據(jù)呈現(xiàn)給模型的順序作烟，可以預(yù)期類似的模型性能。為了測試該假設(shè)庄敛，我們從前一部分考慮的有限超參數(shù)范圍重新訓(xùn)練最佳模型俗壹，同時改變發(fā)生器和鑒別器網(wǎng)絡(luò)的初始權(quán)重（即通過改變隨機(jī)種子）。表3和圖16顯示了每個數(shù)據(jù)集的結(jié)果藻烤。除了LSGAN之外绷雏，大多數(shù)模型對隨機(jī)初始化都相對魯棒头滔，即使對于所有模型，方差都很重要涎显，在比較模型時應(yīng)該考慮到這些差異坤检。

6.5. Precision, recall, and F1

6.5。精確期吓，召回和F1

We perform a search over the wide range of hyperparameters and compute precision and recall by considering

image

samples. In particular, we compute the precision of the model by computing the fraction of generated samples with distance below a threshold

image

. We then consider n samples from the test set and invert each sample x to compute

image

and compute the squared Euclidean distance between x and

image

. We de?ne the recall as the fraction of samples with squared Euclidean distance below δ. Figure 6 shows the results where we select the best

image

score for a ?xed model and hyperparameters and vary the budget. We observe that even for this seemingly simple task, many models struggle to achieve a high

image

score. Analogous plots where we instead maximize precision or recall for various thresholds are presented in Appendix E.

我們通過考慮

image

樣本早歇，對各種超參數(shù)進(jìn)行搜索，并計(jì)算精度和召回率讨勤。特別地箭跳，我們通過計(jì)算距離低于閾值

image

的生成樣本的分?jǐn)?shù)來計(jì)算模型的精度。然后我們考慮來自測試集的n個樣本并反轉(zhuǎn)每個樣本x以計(jì)算

image

并計(jì)算x和

image

之間的平方歐幾里德距離潭千。我們將回憶定義為歐氏距離平方低于δ的樣本分?jǐn)?shù)谱姓。圖6顯示了我們?yōu)楣潭Ｐ秃统瑓?shù)選擇最佳

image

分?jǐn)?shù)并改變預(yù)算的結(jié)果。我們觀察到即使對于這個看似簡單的任務(wù)刨晴，許多模型仍難以獲得高

image

分?jǐn)?shù)屉来。附錄E中給出了我們相反最大化精度或召回各種閾值的類比圖。

image

Figure 6: How does

image

score vary with computational budget? The plot shows the distribution of the maximum

image

score achievable for a ?xed budget with a 95% con?dence interval. For each budget, we estimate the mean and con?dence interval (of the mean) using 5000 bootstrap resamples out of 100 runs. When optimizing for

image

score, both NS GAN and WGAN enjoy high precision and recall. The underwhelming performance of BEGAN and VAE on this particular data set merits further investigation.

圖6：

image

得分如何隨計(jì)算預(yù)算而變化狈癞？該圖顯示了固定預(yù)算可實(shí)現(xiàn)的最大

image

分?jǐn)?shù)的分布茄靠，其中95％的置信區(qū)間。對于每個預(yù)算蝶桶，我們使用100次運(yùn)行中的5000次自舉重采樣來估計(jì)平均值和置信區(qū)間（均值）慨绳。在優(yōu)化

image

得分時，NS GAN和WGAN都享有高精度和召回率莫瞬。BEGAN和VAE在這一特定數(shù)據(jù)集上的表現(xiàn)令人沮喪儡蔓，值得進(jìn)一步研究。

7. Conclusion & Open Problems

7.結(jié)論和開放性問題

In this paper we have started a discussion on how to neutrally and fairly compare GANs. We focus on two sets of evaluation metrics: (i) The Fr′echet Inception Distance, and (ii) precision, recall and

image

. We provide empirical evidence that FID is a reasonable metric due to its robustness with respect to mode dropping and encoding network choices.

在本文中疼邀，我們已經(jīng)開始討論如何中性和公平地比較GAN喂江。我們關(guān)注兩組評估指標(biāo)：（i）Fr'echet初始距離，以及（ii）精確度旁振，召回率和

image

获询。我們提供經(jīng)驗(yàn)證據(jù)表明FID是一個合理的度量標(biāo)準(zhǔn)，因?yàn)樗谀Ｊ絹G棄和編碼網(wǎng)絡(luò)選擇方面具有魯棒性拐袜。

Comparison based on FID. Our main insight is that to compare models it is meaningless to report the minimum FID achieved. Instead, distributions of the FID for a ?xed computational budget should be compared. Indeed, empirical evidence presented herein imply that algorithmic differences in state-of-the-art GANs become less relevant, as the computational budget increases. Furthermore, given a limited budget (say a month of compute-time), a “good” algorithm might be outperformed by a “bad” algorithm.

基于FID的比較吉嚣。我們的主要觀點(diǎn)是，比較模型蹬铺，報告實(shí)現(xiàn)的最小FID是沒有意義的尝哆。相反，應(yīng)該比較固定計(jì)算預(yù)算的FID分布甜攀。實(shí)際上秋泄，本文提供的經(jīng)驗(yàn)證據(jù)表明琐馆，隨著計(jì)算預(yù)算的增加，現(xiàn)有技術(shù)GAN中的算法差異變得不那么重要恒序。此外瘦麸，考慮到有限的預(yù)算（比如一個月的計(jì)算時間），“好”算法可能會勝過“壞”算法歧胁。

Comparison based on precision, recall and

image

score. Our simple triangle data set allows us to compute well understood precision and recall metrics, and consequently the

image

score. We observe that even for this seemingly simple task, many models struggle to achieve a high

image

score. When optimizing for

image

score both NS GAN and WGAN enjoy both high precision and recall. Other models, such as DRAGAN and WGAN GP fail to reach high recall values. Fi nally, we observe that it is possible to achieve high precision and high recall on this task (cf. Appendix E).

比較基于精度滋饲，召回和

image

得分。我們的簡單三角形數(shù)據(jù)集允許我們計(jì)算出易于理解的精度和召回指標(biāo)喊巍，從而計(jì)算

image

得分屠缭。我們觀察到即使對于這個看似簡單的任務(wù)，許多模型也難以獲得高

image

分?jǐn)?shù)崭参。在優(yōu)化

image

得分時勿她，NS GAN和WGAN都享有高精度和召回。其他型號（如DRAGAN和WGAN GP）無法達(dá)到高召回率阵翎。最后，我們觀察到可以在此任務(wù)上實(shí)現(xiàn)高精度和高召回率（參見附錄E）之剧。

Comparison with respect to original GAN. While many algorithms have claimed superiority over the original GAN model [8], we found no empirical evidence which supports such claims, across all data sets. In fact, the NS GAN performs on par with most other models and achieves the best overall FID on MNIST. Furthermore, it outperforms other models in terms of the

image

score on TRIANGLES.

與原始GAN的比較郭卫。雖然許多算法聲稱優(yōu)于原始GAN模型[8]，但我們沒有找到支持所有數(shù)據(jù)集的此類聲明的經(jīng)驗(yàn)證據(jù)背稼。實(shí)際上贰军，NS GAN與大多數(shù)其他型號相當(dāng)，并且在MNIST上實(shí)現(xiàn)了最佳的整體FID蟹肘。此外词疼，它在TRIANGLES的

image

得分方面優(yōu)于其他車型。

Open problems. It remains to be examined whether FID is stable under a more radical change of the encoding, e.g using a network trained on a different task. Also, FID cannot detect over?tting to the training data set, and an algorithm that just remembers all the training examples would perform very well. Finally, FID can probably be “fooled” by artifacts that are not detected by the embedding network.

打開問題帘腹。還需要檢查FID是否在編碼的更激進(jìn)的變化下是穩(wěn)定的贰盗，例如使用在不同任務(wù)上訓(xùn)練的網(wǎng)絡(luò)。此外阳欲，F(xiàn)ID無法檢測到對訓(xùn)練數(shù)據(jù)集的過度配置舵盈，并且只記得所有訓(xùn)練樣例的算法將表現(xiàn)得非常好。最后球化，F(xiàn)ID可能會被嵌入網(wǎng)絡(luò)未檢測到的工件“欺騙”秽晚。

The triangles data set can be made progressively more complex by: (i) introducing multiple convex polygons at once, (ii) providing color or texture inside the polygon, and (iii) gradually increasing the resolution. While the performance of existing models might be improved given a bigger computational budget and larger model capacity, we argue that algorithmic improvements should drive better performance. Having such a series of tasks of increasing complexity should greatly bene?t the research community.

通過以下方式可以使三角形數(shù)據(jù)集逐漸變得更復(fù)雜：（i）一次引入多個凸多邊形，（ii）在多邊形內(nèi)提供顏色或紋理筒愚，以及（iii）逐漸增加分辨率赴蝇。雖然現(xiàn)有模型的性能可能會因?yàn)楦蟮挠?jì)算預(yù)算和更大的模型容量而得到改善，但我們認(rèn)為算法改進(jìn)應(yīng)該會帶來更好的性能巢掺。擁有如此一系列日益復(fù)雜的任務(wù)應(yīng)該對研究界有很大的幫助句伶。

As discussed in Section 4, many dimensions have to be taken into account when comparing different models, and this work only explores a subset of the options. We cannot exclude the possibility that that some models signi?cantly outperform others under currently unexplored conditions.

如第4節(jié)所述劲蜻，在比較不同模型時必須考慮許多維度，而這項(xiàng)工作僅探討了選項(xiàng)的一個子集熄阻。我們不能排除某些模型在目前尚未開發(fā)的情況下顯著優(yōu)于其他模型的可能性斋竞。

Finally, this work strongly suggest that future GAN research should be more experimentally systematic and models should be compared on a neutral ground.

最后，這項(xiàng)工作強(qiáng)烈建議未來的GAN研究應(yīng)該更具實(shí)驗(yàn)系統(tǒng)性秃殉，模型應(yīng)該在中立的基礎(chǔ)上進(jìn)行比較坝初。

文章引用于 http://tongtianta.site/paper/3092
編輯 Lornatang
校準(zhǔn) Lornatang