https://arxiv.org/search/?query=convolution+transformer+robust&searchtype=all&source=header
RobustART:Benchmarking Robustness on Architecture Desgin and Traning Tecniques
Adversarial?Robustness?Comparison of Vision?Transformer?and MLP-Mixer to CNNs????
★★★★★
Authors:?Philipp Benz,?Soomin Ham,?Chaoning Zhang,?Adil Karjauv,?In So Kweon
Abstract:?Convolutional?Neural Networks (CNNs) have become the de facto gold standard in computer vision applications in the past years. Recently, however, new model architectures have been proposed challenging the status quo. The Vision?Transformer?(ViT) relies solely on attention modules, while the MLP-Mixer architecture substitutes the self-attention modules with Multi-Layer Perceptrons (MLPs). Despite their great success, CNNs have been widely known to be vulnerable to adversarial attacks, causing serious concerns for security-sensitive applications. Thus, it is critical for the community to know whether the newly proposed ViT and MLP-Mixer are also vulnerable to adversarial attacks. To this end, we empirically evaluate their adversarial?robustness?under several adversarial attack setups and benchmark them against the widely used CNNs. Overall, we find that the two architectures, especially ViT, are more?robust?than their CNN models. Using a toy example, we also provide empirical evidence that the lower adversarial?robustness?of CNNs can be partially attributed to their shift-invariant property. Our frequency analysis suggests that the most?robust?ViT architectures tend to rely more on low-frequency features compared with CNNs. Additionally, we have an intriguing finding that MLP-Mixer is extremely vulnerable to universal adversarial perturbations.?△ Less
Submitted?11 October, 2021;?v1?submitted 6 October, 2021;?originally announced?October 2021.
Comments:?Code: https://github.com/phibenz/robustness_comparison_vit_mlp-mixer_cnn
在過去的幾年里,卷積神經(jīng)網(wǎng)絡(luò)(CNN)已經(jīng)成為計(jì)算機(jī)視覺應(yīng)用中事實(shí)上的黃金標(biāo)準(zhǔn)惰匙。然而技掏,最近有人提出了挑戰(zhàn)現(xiàn)狀的新模型體系結(jié)構(gòu)。視覺變換器(ViT)僅依賴于注意模塊项鬼,而MLP混頻器體系結(jié)構(gòu)用多層感知器(MLP)替代自注意模塊哑梳。盡管CNN取得了巨大的成功,但眾所周知绘盟,CNN容易受到敵對(duì)攻擊鸠真,這給安全敏感應(yīng)用程序帶來了嚴(yán)重的問題。因此龄毡,社區(qū)必須了解新提議的ViT和MLP混音器是否也容易受到敵對(duì)攻擊吠卷。為此,我們以經(jīng)驗(yàn)評(píng)估了它們?cè)趲追N對(duì)抗攻擊設(shè)置下的對(duì)抗魯棒性沦零,并針對(duì)廣泛使用的CNN對(duì)其進(jìn)行了基準(zhǔn)測(cè)試祭隔。總的來說路操,我們發(fā)現(xiàn)這兩種體系結(jié)構(gòu)疾渴,尤其是ViT,比它們的CNN模型更健壯屯仗。通過一個(gè)玩具例子搞坝,我們還提供了經(jīng)驗(yàn)證據(jù),證明CNN較低的對(duì)抗魯棒性部分歸因于其平移不變特性魁袜。我們的頻率分析表明桩撮,與CNN相比,最穩(wěn)健的ViT架構(gòu)更傾向于依賴低頻特性峰弹。此外距境,我們有一個(gè)有趣的發(fā)現(xiàn),MLP混頻器極易受到普遍的敵對(duì)干擾垮卓。
Exploring Corruption?Robustness: Inductive Biases in Vision?Transformers?and MLP-Mixers
Authors:?Katelyn Morrison,?Benjamin Gilby,?Colton Lipchak,?Adam Mattioli,?Adriana Kovashka
Abstract:?Recently, vision?transformers?and MLP-based models have been developed in order to address some of the prevalent weaknesses in?convolutional?neural networks. Due to the novelty of?transformers?being used in this domain along with the self-attention mechanism, it remains unclear to what degree these architectures are?robust?to corruptions. Despite some works proposing that data augmentation remains essential for a model to be?robust?against corruptions, we propose to explore the impact that the architecture has on corruption?robustness. We find that vision?transformer?architectures are inherently more?robust?to corruptions than the ResNet-50 and MLP-Mixers. We also find that vision?transformers?with 5 times fewer parameters than a ResNet-50 have more shape bias. Our code is available to reproduce.?△ Less
Submitted?3 July, 2021;?v1?submitted 24 June, 2021;?originally announced?June 2021.
Comments:?Under review at the Uncertainty and Robustness in Deep Learning workshop at ICML 2021. Our appendix is attached to the last page of the paper
摘要:最近,為了解決卷積神經(jīng)網(wǎng)絡(luò)中普遍存在的一些弱點(diǎn)师幕,人們開發(fā)了視覺變換器和基于MLP的模型粟按。由于變壓器在這一領(lǐng)域的應(yīng)用以及自我關(guān)注機(jī)制的新穎性诬滩,目前尚不清楚這些體系結(jié)構(gòu)在多大程度上對(duì)腐蝕具有魯棒性。盡管有一些工作建議數(shù)據(jù)擴(kuò)充對(duì)于模型抗損壞的健壯性仍然至關(guān)重要灭将,但我們建議探索體系結(jié)構(gòu)對(duì)損壞健壯性的影響疼鸟。我們發(fā)現(xiàn),vision transformer體系結(jié)構(gòu)天生比ResNet-50和MLP混頻器更能抵抗損壞庙曙。我們還發(fā)現(xiàn)空镜,參數(shù)比ResNet-50少5倍的視覺變壓器具有更多的形狀偏差。我們的代碼可以復(fù)制捌朴。
★★★★★
Intriguing Properties of Vision?Transformers
★★★★★
Authors:?Muzammal Naseer,?Kanchana Ranasinghe,?Salman Khan,?Munawar Hayat,?Fahad Shahbaz Khan,?Ming-Hsuan Yang
Abstract:?Vision?transformers?(ViT) have demonstrated impressive performance across various machine vision problems. These models are based on multi-head self-attention mechanisms that can flexibly attend to a sequence of image patches to encode contextual cues. An important question is how such flexibility in attending image-wide context conditioned on a given patch can facilitate handling nuisances in natural images e.g., severe occlusions, domain shifts, spatial permutations, adversarial and natural perturbations. We systematically study this question via an extensive set of experiments encompassing three ViT families and comparisons with a high-performing?convolutional?neural network (CNN). We show and analyze the following intriguing properties of ViT: (a)?Transformers?are highly?robust?to severe occlusions, perturbations and domain shifts, e.g., retain as high as 60% top-1 accuracy on ImageNet even after randomly occluding 80% of the image content. (b) The?robust?performance to occlusions is not due to a bias towards local textures, and ViTs are significantly less biased towards textures compared to CNNs. When properly trained to encode shape-based features, ViTs demonstrate shape recognition capability comparable to that of human visual system, previously unmatched in the literature. (c) Using ViTs to encode shape representation leads to an interesting consequence of accurate semantic segmentation without pixel-level supervision. (d) Off-the-shelf features from a single ViT model can be combined to create a feature ensemble, leading to high accuracy rates across a range of classification datasets in both traditional and few-shot learning paradigms. We show effective features of ViTs are due to flexible and dynamic receptive fields possible via the self-attention mechanism.?△ Less
Submitted?8 June, 2021;?v1?submitted 21 May, 2021;?originally announced?May 2021.
Comments:?Code: https://git.io/Js15X
摘要:視覺轉(zhuǎn)換器(ViT)在各種機(jī)器視覺問題上表現(xiàn)出令人印象深刻的性能吴攒。這些模型基于多頭自我注意機(jī)制,可以靈活地處理一系列圖像塊砂蔽,對(duì)上下文線索進(jìn)行編碼洼怔。一個(gè)重要的問題是,以給定補(bǔ)丁為條件處理圖像范圍上下文的靈活性如何有助于處理自然圖像中的干擾左驾,例如嚴(yán)重遮擋镣隶、域移動(dòng)、空間排列诡右、敵對(duì)和自然干擾安岂。我們通過一系列廣泛的實(shí)驗(yàn)系統(tǒng)地研究了這個(gè)問題,包括三個(gè)ViT家族帆吻,并與高性能卷積神經(jīng)網(wǎng)絡(luò)(CNN)進(jìn)行了比較域那。我們展示并分析了ViT的以下有趣特性:(a)變壓器對(duì)嚴(yán)重遮擋、擾動(dòng)和域移動(dòng)具有高度魯棒性桅锄,例如琉雳,即使在隨機(jī)遮擋80%的圖像內(nèi)容后,在ImageNet上仍保持高達(dá)60%的top-1精度友瘤。(b) 對(duì)遮擋的魯棒性能不是由于對(duì)局部紋理的偏見翠肘,與CNN相比,VIT對(duì)紋理的偏見要小得多辫秧。當(dāng)適當(dāng)訓(xùn)練以編碼基于形狀的特征時(shí)束倍,VIT顯示出與人類視覺系統(tǒng)相當(dāng)?shù)男螤钭R(shí)別能力,這在以前的文獻(xiàn)中是無與倫比的盟戏。(c) 使用VIT對(duì)形狀表示進(jìn)行編碼绪妹,可以在沒有像素級(jí)監(jiān)控的情況下實(shí)現(xiàn)精確的語義分割。(d) 來自單個(gè)ViT模型的現(xiàn)成特征可以組合起來創(chuàng)建一個(gè)特征集合柿究,從而在傳統(tǒng)和少數(shù)鏡頭學(xué)習(xí)范式中邮旷,在一系列分類數(shù)據(jù)集中實(shí)現(xiàn)高準(zhǔn)確率。我們發(fā)現(xiàn)ViTs的有效特征是通過自我注意機(jī)制可能產(chǎn)生的靈活和動(dòng)態(tài)的感受野蝇摸。
Towards?Robust?Vision?Transformer
Authors:?Xiaofeng Mao,?Gege Qi,?Yuefeng Chen,?Xiaodan Li,?Ranjie Duan,?Shaokai Ye,?Yuan He,?Hui Xue
Abstract:?Recent advances on Vision?Transformer?(ViT) and its improved variants have shown that self-attention-based networks surpass traditional?Convolutional?Neural Networks (CNNs) in most vision tasks. However, existing ViTs focus on the standard accuracy and computation cost, lacking the investigation of the intrinsic influence on model?robustness?and generalization. In this work, we conduct systematic evaluation on components of ViTs in terms of their impact on?robustness?to adversarial examples, common corruptions and distribution shifts. We find some components can be harmful to?robustness. By using and combining?robust?components as building blocks of ViTs, we propose?Robust?Vision?Transformer?(RVT), which is a new vision?transformer?and has superior performance with strong?robustness. We further propose two new plug-and-play techniques called position-aware attention scaling and patch-wise augmentation to augment our RVT, which we abbreviate as RVT*. The experimental results on ImageNet and six?robustness?benchmarks show the advanced?robustness?and generalization ability of RVT compared with previous ViTs and state-of-the-art CNNs. Furthermore, RVT-S* also achieves Top-1 rank on multiple?robustness?leaderboards including ImageNet-C and ImageNet-Sketch. The code will be available at \url{https://git.io/Jswdk}.?△ Less
Submitted?26 May, 2021;?v1?submitted 17 May, 2021;?originally announced?May 2021.
摘要:視覺變換器(ViT)及其改進(jìn)型的最新進(jìn)展表明婶肩,基于自我注意的網(wǎng)絡(luò)在大多數(shù)視覺任務(wù)中超過了傳統(tǒng)的卷積神經(jīng)網(wǎng)絡(luò)(CNN)办陷。然而,現(xiàn)有的ViTs主要關(guān)注標(biāo)準(zhǔn)精度和計(jì)算成本律歼,缺乏對(duì)模型魯棒性和泛化的內(nèi)在影響的研究民镜。在這項(xiàng)工作中,我們對(duì)VIT的組成部分進(jìn)行了系統(tǒng)評(píng)估险毁,評(píng)估其對(duì)對(duì)抗性示例制圈、常見腐蝕和分布變化的魯棒性的影響。我們發(fā)現(xiàn)一些組件可能對(duì)健壯性有害畔况。通過使用和組合魯棒組件作為ViTs的構(gòu)建塊鲸鹦,我們提出了魯棒視覺轉(zhuǎn)換器(RVT),它是一種新的視覺轉(zhuǎn)換器问窃,具有強(qiáng)大的魯棒性和優(yōu)越的性能亥鬓。我們進(jìn)一步提出了兩種新的即插即用技術(shù),稱為位置感知注意縮放和面片增強(qiáng)域庇,以增強(qiáng)我們的RVT嵌戈,簡(jiǎn)稱RVT*。在ImageNet和六個(gè)魯棒性基準(zhǔn)上的實(shí)驗(yàn)結(jié)果表明听皿,與以前的ViTs和最先進(jìn)的CNN相比熟呛,RVT具有更高的魯棒性和泛化能力。此外尉姨,RVT-S*在包括ImageNet-C和ImageNet Sketch在內(nèi)的多個(gè)穩(wěn)健性排行榜上也獲得了第一名庵朝。該代碼將在\url中提供{https://git.io/Jswdk}.
Vision?Transformers?are?Robust?Learners
★★★★★
Authors:?Sayak Paul,?Pin-Yu Chen
Abstract:?Transformers, composed of multiple self-attention layers, hold strong promises toward a generic learning primitive applicable to different data modalities, including the recent breakthroughs in computer vision achieving state-of-the-art (SOTA) standard accuracy with better parameter efficiency. Since self-attention helps a model systematically align different components present inside the input data, it leaves grounds to investigate its performance under model?robustness?benchmarks. In this work, we study the?robustness?of the Vision?Transformer?(ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We use six different diverse ImageNet datasets concerning?robust?classification to conduct a comprehensive performance comparison of ViT models and SOTA?convolutional?neural networks (CNNs), Big-Transfer. Through a series of six systematically designed experiments, we then present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more?robust?learners. For example, with fewer parameters and similar dataset and pre-training combinations, ViT gives a top-1 accuracy of 28.10% on ImageNet-A which is 4.3x higher than a comparable variant of BiT. Our analyses on image masking, Fourier spectrum sensitivity, and spread on discrete cosine energy spectrum reveal intriguing properties of ViT attributing to improved?robustness. Code for reproducing our experiments is available here: https://git.io/J3VO0.?△ Less
摘要:變形金剛由多個(gè)自我注意層組成,它有望成為一種適用于不同數(shù)據(jù)模式的通用學(xué)習(xí)原語又厉,包括最近在計(jì)算機(jī)視覺領(lǐng)域取得的突破九府,以更好的參數(shù)效率實(shí)現(xiàn)最先進(jìn)的(SOTA)標(biāo)準(zhǔn)精度。由于自我關(guān)注有助于模型系統(tǒng)地對(duì)齊輸入數(shù)據(jù)中存在的不同組件覆致,因此它為在模型魯棒性基準(zhǔn)下研究其性能留下了基礎(chǔ)侄旬。在這項(xiàng)工作中,我們研究了視覺轉(zhuǎn)換器(ViT)對(duì)常見的腐蝕和擾動(dòng)煌妈、分布偏移和自然對(duì)抗示例的魯棒性儡羔。我們使用六種不同的關(guān)于穩(wěn)健分類的ImageNet數(shù)據(jù)集,對(duì)ViT模型和SOTA卷積神經(jīng)網(wǎng)絡(luò)(CNN)進(jìn)行綜合性能比較璧诵,即大傳輸汰蜘。通過一系列六個(gè)系統(tǒng)設(shè)計(jì)的實(shí)驗(yàn),我們提出了定量和定性的分析之宿,以解釋為什么VIT確實(shí)是更健壯的學(xué)習(xí)者族操。例如,在參數(shù)較少比被、數(shù)據(jù)集和預(yù)訓(xùn)練組合相似的情況下坪创,ViT在ImageNet-a上的最高精度為28.10%炕婶,比BiT的可比變體高4.3倍。我們對(duì)圖像掩蔽莱预、傅里葉光譜靈敏度和離散余弦能譜擴(kuò)展的分析揭示了ViT的有趣特性,歸因于增強(qiáng)了魯棒性项滑。復(fù)制我們的實(shí)驗(yàn)的代碼可在此處獲得:https://git.io/J3VO0.
On the?Robustness?of Vision?Transformers?to Adversarial Examples
Authors:?Kaleel Mahmood,?Rigel Mahmood,?Marten van Dijk
Abstract:?Recent advances in attention-based networks have shown that Vision?Transformers?can achieve state-of-the-art or near state-of-the-art results on many image classification tasks. This puts?transformers?in the unique position of being a promising alternative to traditional?convolutional?neural networks (CNNs). While CNNs have been carefully studied with respect to adversarial attacks, the same cannot be said of Vision?Transformers. In this paper, we study the?robustness?of Vision?Transformers?to adversarial examples. Our analyses of?transformer?security is divided into three parts. First, we test the?transformer?under standard white-box and black-box attacks. Second, we study the transferability of adversarial examples between CNNs and?transformers. We show that adversarial examples do not readily transfer between CNNs and?transformers. Based on this finding, we analyze the security of a simple ensemble defense of CNNs and?transformers. By creating a new attack, the self-attention blended gradient attack, we show that such an ensemble is not secure under a white-box adversary. However, under a black-box adversary, we show that an ensemble can achieve unprecedented?robustness?without sacrificing clean accuracy. Our analysis for this work is done using six types of white-box attacks and two types of black-box attacks. Our study encompasses multiple Vision?Transformers, Big Transfer Models and CNN architectures trained on CIFAR-10, CIFAR-100 and ImageNet.?
摘要:基于注意力的網(wǎng)絡(luò)的最新進(jìn)展表明依沮,視覺變換器可以在許多圖像分類任務(wù)中實(shí)現(xiàn)最先進(jìn)或接近最先進(jìn)的結(jié)果。這使變壓器處于獨(dú)特的地位枪狂,成為傳統(tǒng)卷積神經(jīng)網(wǎng)絡(luò)(CNN)的一個(gè)有前途的替代品危喉。雖然CNN在對(duì)抗性攻擊方面進(jìn)行了仔細(xì)的研究,但在視覺變形金剛方面卻不是這樣州疾。在本文中辜限,我們研究了視覺變換器對(duì)對(duì)抗性示例的魯棒性。我們對(duì)變壓器安全性的分析分為三個(gè)部分严蓖。首先薄嫡,我們?cè)跇?biāo)準(zhǔn)的白盒和黑盒攻擊下測(cè)試變壓器。其次颗胡,我們研究了CNN和Transformer之間對(duì)抗性示例的可轉(zhuǎn)移性毫深。我們表明,對(duì)抗性示例不容易在CNN和變壓器之間傳輸毒姨⊙颇瑁基于這一發(fā)現(xiàn),我們分析了CNN和變壓器的簡(jiǎn)單集成防御的安全性弧呐。通過創(chuàng)建一種新的攻擊闸迷,即自注意混合梯度攻擊,我們證明了這樣的集成在白盒對(duì)手下是不安全的俘枫。然而腥沽,在黑箱對(duì)手的情況下,我們證明了一個(gè)集成可以在不犧牲精確性的情況下實(shí)現(xiàn)前所未有的健壯性崩哩。我們使用六種類型的白盒攻擊和兩種類型的黑盒攻擊來分析這項(xiàng)工作巡球。我們的研究包括在CIFAR-10、CIFAR-100和ImageNet上訓(xùn)練的多個(gè)視覺轉(zhuǎn)換器邓嘹、大傳輸模型和CNN架構(gòu)酣栈。
cs.CV?cs.AI?cs.LG
On the Adversarial?Robustness?of Vision?Transformers
★★★★★★★★★★★★★★★★★★★★★★★★★
Authors:?Rulin Shao,?Zhouxing Shi,?Jinfeng Yi,?Pin-Yu Chen,?Cho-Jui Hsieh
Abstract:?Following the success in advancing natural language processing and understanding,?transformers?are expected to bring revolutionary changes to computer vision. This work provides the first and comprehensive study on the?robustness?of vision?transformers?(ViTs) against adversarial perturbations. Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial?robustness?when compared with?convolutional?neural networks (CNNs). This observation also holds for certified?robustness. We summarize the following main observations contributing to the improved?robustness?of ViTs: 1) Features learned by ViTs contain less low-level information and are more generalizable, which contributes to superior?robustness?against adversarial perturbations. 2) Introducing?convolutional?or tokens-to-token blocks for learning low-level features in ViTs can improve classification accuracy but at the cost of adversarial?robustness. 3) Increasing the proportion of?transformers?in the model structure (when the model consists of both?transformer?and CNN blocks) leads to better?robustness. But for a pure?transformer?model, simply increasing the size or adding layers cannot guarantee a similar effect. 4) Pre-training on larger datasets does not significantly improve adversarial?robustness?though it is critical for training ViTs. 5) Adversarial training is also applicable to ViT for training?robust?models. Furthermore, feature visualization and frequency analysis are conducted for explanation. The results show that ViTs are less sensitive to high-frequency perturbations than CNNs and there is a high correlation between how well the model learns low-level features and its?robustness?against different frequency-based perturbations.?△ Less
Submitted?14 October, 2021;?v1?submitted 29 March, 2021;?originally announced?March 2021.
隨著自然語言處理和理解的成功推進(jìn),《變形金剛》有望給計(jì)算機(jī)視覺帶來革命性的變化汹押。這項(xiàng)工作首次全面研究了視覺轉(zhuǎn)換器(VIT)對(duì)對(duì)抗性干擾的魯棒性矿筝。在各種白盒和轉(zhuǎn)移攻擊設(shè)置下進(jìn)行測(cè)試,我們發(fā)現(xiàn)ViTs與卷積神經(jīng)網(wǎng)絡(luò)(CNN)相比具有更好的對(duì)抗魯棒性棚贾。這一觀察結(jié)果也適用于經(jīng)認(rèn)證的穩(wěn)健性窖维。我們總結(jié)了以下有助于提高ViTs魯棒性的主要觀察結(jié)果:1)ViTs學(xué)習(xí)的特征包含較少的低級(jí)信息榆综,更具普遍性,這有助于提高對(duì)抗性干擾的魯棒性铸史。2) 將卷積或令牌引入令牌塊以學(xué)習(xí)ViTs中的低級(jí)特征可以提高分類精度鼻疮,但代價(jià)是對(duì)抗性魯棒性。3) 增加模型結(jié)構(gòu)中變壓器的比例(當(dāng)模型由變壓器和CNN塊組成時(shí))可提高魯棒性琳轿。但對(duì)于純變壓器模型判沟,簡(jiǎn)單地增加尺寸或添加層并不能保證類似的效果。4) 在較大數(shù)據(jù)集上進(jìn)行預(yù)訓(xùn)練不會(huì)顯著提高對(duì)抗魯棒性崭篡,盡管這對(duì)于訓(xùn)練VIT至關(guān)重要挪哄。5) 對(duì)抗性訓(xùn)練也適用于ViT,用于訓(xùn)練健壯的模型琉闪。此外迹炼,還進(jìn)行了特征可視化和頻率分析。結(jié)果表明颠毙,與CNN相比斯入,VIT對(duì)高頻擾動(dòng)的敏感性較低,并且模型對(duì)低層特征的學(xué)習(xí)程度與其對(duì)不同頻率擾動(dòng)的魯棒性之間存在高度相關(guān)性吟秩。
Understanding?Robustness?of?Transformers?for Image Classification
★★★★★
Authors:?Srinadh Bhojanapalli,?Ayan Chakrabarti,?Daniel Glasner,?Daliang Li,?Thomas Unterthiner,?Andreas Veit
Abstract:?Deep?Convolutional?Neural Networks (CNNs) have long been the architecture of choice for computer vision tasks. Recently,?Transformer-based architectures like Vision?Transformer?(ViT) have matched or even surpassed ResNets for image classification. However, details of the?Transformer?architecture -- such as the use of non-overlapping patches -- lead one to wonder whether these networks are as?robust. In this paper, we perform an extensive study of a variety of different measures of?robustness?of ViT models and compare the findings to ResNet baselines. We investigate?robustness?to input perturbations as well as?robustness?to model perturbations. We find that when pre-trained with a sufficient amount of data, ViT models are at least as?robust?as the ResNet counterparts on a broad range of perturbations. We also find that?Transformers?are?robust?to the removal of almost any single layer, and that while activations from later layers are highly correlated with each other, they nevertheless play an important role in classification.?△ Less
Submitted?8 October, 2021;?v1?submitted 26 March, 2021;?originally announced?March 2021.
Comments:?Accepted for publication at ICCV 2021. Rewrote Section 5 and made other minor changes throughout
摘要:深卷積神經(jīng)網(wǎng)絡(luò)(CNN)長(zhǎng)期以來一直是計(jì)算機(jī)視覺任務(wù)的首選結(jié)構(gòu)咱扣。最近,基于轉(zhuǎn)換器的體系結(jié)構(gòu)(如ViT)在圖像分類方面已經(jīng)達(dá)到甚至超過了RESNET涵防。然而闹伪,變壓器體系結(jié)構(gòu)的細(xì)節(jié)——比如使用非重疊補(bǔ)丁——讓人懷疑這些網(wǎng)絡(luò)是否同樣健壯。在本文中壮池,我們對(duì)ViT模型穩(wěn)健性的各種不同度量進(jìn)行了廣泛的研究偏瓤,并將研究結(jié)果與ResNet基線進(jìn)行了比較。我們研究了對(duì)輸入擾動(dòng)的魯棒性以及對(duì)模型擾動(dòng)的魯棒性椰憋。我們發(fā)現(xiàn)厅克,當(dāng)使用足夠數(shù)量的數(shù)據(jù)進(jìn)行預(yù)訓(xùn)練時(shí),ViT模型在大范圍的擾動(dòng)下至少與ResNet模型一樣穩(wěn)健橙依。我們還發(fā)現(xiàn)证舟,變壓器對(duì)幾乎任何一層的移除都具有魯棒性,雖然后一層的激活彼此高度相關(guān)窗骑,但它們?cè)诜诸愔衅鹬匾饔谩?/p>