1X1卷積核最開始是在顏水成論文 [1312.4400] Network In Network 中提出的,后來被[GoogLeNet 1409.4842] Going Deeper with Convolutions的Inception結構繼續(xù)應用了。能夠使用更小channel的前提就是sparse比較多 不然1*1效果也不會很明顯
Network in Network and 1×1 convolutions
Lin et al., 2013. Network in network
1x1 卷積可以壓縮信道數(shù)肮街。池化可以壓縮寬和高先慷。
1x1卷積給神經網絡增加非線性横漏,從而減少或保持信道數(shù)不變,也可以增加信道數(shù)
1.實現(xiàn)跨通道的交互和信息整合
1×1的卷積層(可能)引起人們的重視是在NIN的結構中熟掂,論文中林敏師兄的想法是利用MLP代替?zhèn)鹘y(tǒng)的線性卷積核缎浇,從而提高網絡的表達能力。文中同時利用了跨通道pooling的角度解釋赴肚,認為文中提出的MLP其實等價于在傳統(tǒng)卷積核后面接cccp層素跺,從而實現(xiàn)多個feature map的線性組合,實現(xiàn)跨通道的信息整合誉券。而cccp層是等價于1×1卷積的指厌,因此細看NIN的caffe實現(xiàn),就是在每個傳統(tǒng)卷積層后面接了兩個cccp層(其實就是接了兩個1×1的卷積層)
2.進行卷積核通道數(shù)的降維和升維
由于3X3卷積或者5X5卷積在幾百個filter的卷積層上做卷積操作時相當耗時踊跟,所以1X1卷積在3X3卷積或者5X5卷積計算之前先降低維度踩验。那么,1X1卷積的主要作用有以下幾點:
1商玫、降維( dimension reductionality )箕憾。比如,一張500 X500且厚度depth為100 的圖片在20個filter上做1X1的卷積拳昌,那么結果的大小為500X500X20袭异。
2、加入非線性炬藤。卷積層之后經過激勵層御铃,1X1的卷積在前一層的學習表示上添加了非線性激勵( non-linear activation )碴里,提升網絡的表達能力;
What is Depth of a convolutional neural network?
如果卷積的輸出輸入都是一個平面上真,那么1X1卷積核并沒有什么意義咬腋,它是完全不考慮像素與周邊其他像素關系。但卷積的輸出輸入是長方體睡互,所以1X1卷積實際上對每個像素點帝火,在不同的channels上進行線性組合(信息整合),且保留原有平面結構湃缎,調控depth,從而完成升維或降維的功能犀填。
下圖所示,如果選擇2個filters的1X1卷積層嗓违,那么數(shù)據就從原來的depth3降到了2九巡。若用4個filters,則起到了升維的作用。
MSRA的ResNet同樣也利用了1×1卷積蹂季,并且是在3×3卷積層的前后都使用了冕广,不僅進行了降維,還進行了升維偿洁,使得卷積層的輸入和輸出的通道數(shù)都減小撒汉,參數(shù)數(shù)量進一步減少,如下圖的結構涕滋。
Simple Answer
Most simplistic explanation would be that 1x1 convolution leads to dimension reductionality. For example, an image of 200 x 200 with 50 features on convolution with 20 filters of 1x1 would result in size of 200 x 200 x 20. But then again, is this is the best way to do dimensionality reduction in the convoluational neural network? What about the efficacy vs efficiency?
One by One [ 1 x 1 ] Convolution - counter-intuitively useful
Complex Answer
Feature transformation
Although 1x1 convolution is a ‘feature pooling’ technique, there is more to it than just sum pooling of features across various channels/feature-maps of a given layer. 1x1 convolution acts like coordinate-dependent transformation in the filter space[https://plus.google.com/118431607943208545663/posts/2y7nmBuh2ar]. It is important to note here that this transformation is strictly linear, but in most of application of 1x1 convolution, it is succeeded by a non-linear activation layer like ReLU. This transformation is learned through the (stochastic) gradient descent. But an important distinction is that it suffers with less over-fitting due to smaller kernel size (1x1).
3.可以在保持feature map 尺寸不變(即不損失分辨率)的前提下大幅增加非線性特性睬辐,把網絡做得很deep
Deeper Network
One by One convolution was first introduced in this paper titled Network in Network. In this paper, the author’s goal was to generate a deeper network without simply stacking more layers. It replaces few filters with a smaller perceptron layer with mixture of 1x1 and 3x3 convolutions. In a way, it can be seen as “going wide” instead of “deep”, but it should be noted that in machine learning terminology, ‘going wide’ is often meant as adding more data to the training. Combination of 1x1 (x F) convolution is mathematically equivalent to a multi-layer perceptron[https://www.reddit.com/r/MachineLearning/comments/3oln72/1x1_convolutions_why_use_them/cvyxood/]
Inception Module
In GoogLeNet architecture, 1x1 convolution is used for two purposes
To make network deep by adding an “inception module” like Network in Network paper, as described above.
To reduce the dimensions inside this “inception module”.
To add more non-linearity by having ReLU immediately after every 1x1 convolution.
Here is the scresnshot from the paper, which elucidates above points :
It can be seen from the image on the right, that 1x1 convolutions (in yellow), are specially used before 3x3 and 5x5 convolution to reduce the dimensions. It should be noted that a two step convolution operation can always to combined into one, but in this case and in most other deep learning networks, convolutions are followed by non-linear activation and hence convolutions are no longer linear operators and cannot be combined.
In designing such a network, it is important to note that initial convolution kernel should be of size larger than 1x1 to have a receptive field capable of capturing locally spatial information. According to the NIN paper, 1x1 convolution is equivalent to cross-channel parametric pooling layer. From the paper - “This cascaded cross channel parameteric pooling structure allows complex and learnable interactions of cross channel information”.
Cross channel information learning (cascaded 1x1 convolution) is biologically inspired because human visual cortex have receptive fields (kernels) tuned to different orientation. For e.g
Different orientation tuned receptive field profiles in the human visual cortex Source
More Uses
1x1 Convolution can be combined with Max pooling
1x1 Convolution with higher strides leads to even more redution in data by decreasing resolution, while losing very little non-spatially correlated information.
Replace fully connected layers with 1x1 convolutions as Yann LeCun believes they are the same
-In Convolutional Nets, there is no such thing as “fully-connected layers”. There are only convolution layers with 1x1 convolution kernels and a full connection table.– Yann LeCun
Convolution gif images generated using this wonderful code, more images on 1x1 convolutions and 3x3 convolutions can be found here