Alexnet input 224?227

2 minute read

Published:

0.结论

  1. 224的输入配套会给一个padding=2, 227的输入只有kernel=11,stride=4,整体计算得到第一个卷积层的特征输出都是55,这里没有区别。
  2. pytorch的实现和caffe原版的实现第一和第二个卷积层的输出通道数不同,但是整体参数量没有很大变化,都是6kw左右。

1.问题

在Alexnet的论文中,结构图上给的输入是224,但是如果按照224去计算,得到的下一次的输出不是55,即:

\[\begin{aligned} (224-11)/4+1&=&54.25 \\ (227-11)/4+1&=&55 \end{aligned}\]

而Alexnet训练时的数据处理步骤为:

  1. 下采样得到$256\times 256$的图像
  2. 从256的图像中crop出224的图像

因此有人认为,在真正送入网络之前,可能加了3个像素的padding,得到227

2. 解决问题

2.1 文字描述

类似的言论在cs231的资料中也有提及

The Krizhevsky et al. architecture that won the ImageNet challenge in 2012 accepted images of size [227x227x3]

2.2 代码层面

在AlexNet的Caffe实现中,可以看到使用的图像尺寸其实是227

name: "AlexNet"
layer {
  name: "data"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    mirror: true
    crop_size: 227

不过,在pytorch或者是mmpretrain的Alexnet实现中,使用的都是224

# pytorch
[docs]class AlexNet_Weights(WeightsEnum):
    IMAGENET1K_V1 = Weights(
        url="https://download.pytorch.org/models/alexnet-owt-7be5be79.pth",
        transforms=partial(ImageClassification, crop_size=224),

# mmpretrain
@MODELS.register_module()
class AlexNet(BaseBackbone):
    """`AlexNet <https://en.wikipedia.org/wiki/AlexNet>`_ backbone.

    The input for AlexNet is a 224x224 RGB image.

    Args:
        num_classes (int): number of classes for classification.
            The default value is -1, which uses the backbone as
            a feature extractor without the top classifier.
    """

2.3 手动计算比较

                                           以224为输入(pytorch等实现)
-------------------------------------------------------------------------------------------------------------------------
        Layer (type)               Output Shape         Param #                       kernel
=========================================================================================================================
            Conv2d-1           [-1, 64, 55, 55]          3*11*11*64+64=23,296         k=11,s=4,p=2 (224+4-11)/4+1=55.25
              ReLU-2           [-1, 64, 55, 55]               0
         MaxPool2d-3           [-1, 64, 27, 27]               0                       k=3, s=2, (55-3)/2+1=27
            Conv2d-4          [-1, 192, 27, 27]         64*5*5*192+192=307,392        k=5, p=2, (27+4-5)+1=27
              ReLU-5          [-1, 192, 27, 27]               0
         MaxPool2d-6          [-1, 192, 13, 13]               0                       k=3, s=2, (27-3)/2+1=13
            Conv2d-7          [-1, 384, 13, 13]         192*3*3*384+384=663,936       k=3, p=1, (13-3+2)+1=13
              ReLU-8          [-1, 384, 13, 13]               0
            Conv2d-9          [-1, 256, 13, 13]         384*3*3*256+256=884,992       k=3, p=1, (13-3+2)+1=13
             ReLU-10          [-1, 256, 13, 13]               0
           Conv2d-11          [-1, 256, 13, 13]         256*3*3*256+256=590,080       k=3, p=1, (13-3+2)+1=13
             ReLU-12          [-1, 256, 13, 13]               0
        MaxPool2d-13            [-1, 256, 6, 6]               0                       k=3, s=2, (13-3)/2+1=6
AdaptiveAvgPool2d-14            [-1, 256, 6, 6]               0                       可能k=3, p=1,(6-3+2)+1=6
          Dropout-15                 [-1, 9216]               0
           Linear-16                 [-1, 4096]      (256*6*6)9216*4096+4096=37,752,832                       
             ReLU-17                 [-1, 4096]               0
          Dropout-18                 [-1, 4096]               0
           Linear-19                 [-1, 4096]      4096*4096+4096=16,781,312
             ReLU-20                 [-1, 4096]               0
           Linear-21                 [-1, 1000]       4096*1000+1000=4,097,000
===========================================================================================================================
Total params: 61,100,840
                                         以227为输入(Alexnet原论文,caffe实现)
------------------------------------------------------------------------------------------------------------------------
        Layer (type)               Output Shape         Param #                       kernel
=========================================================================================================================
            Conv2d-1         🔝[-1, 96, 55, 55]      3*11*11*96+96=34,944             k=11,s=4, (227-11)/4+1=55
              ReLU-2           [-1, 96, 55, 55]               0
         MaxPool2d-3           [-1, 96, 27, 27]               0                       k=3, s=2, (55-3)/2+1=27
            Conv2d-4        🔝[-1, 256, 27, 27]      96*5*5*256+256=614,656           k=5, p=2, (27+4-5)+1=27
              ReLU-5          [-1, 256, 27, 27]               0
         MaxPool2d-6          [-1, 256, 13, 13]               0                       k=3, s=2, (27-3)/2+1=13
            Conv2d-7          [-1, 384, 13, 13]      256*3*3*384+384=885,120          k=3, p=1, (13-3+2)+1=13
              ReLU-8          [-1, 384, 13, 13]               0
            Conv2d-9        🔝[-1, 384, 13, 13]      384*3*3*384+384=1,327,488        k=3, p=1, (13-3+2)+1=13
             ReLU-10          [-1, 384, 13, 13]               0
           Conv2d-11          [-1, 256, 13, 13]      384*3*3*256+256=884,992          k=3, p=1, (13-3+2)+1=13
             ReLU-12          [-1, 256, 13, 13]               0
        MaxPool2d-13            [-1, 256, 6, 6]               0                       k=3, s=2, (13-3)/2+1=6
AdaptiveAvgPool2d-14            [-1, 256, 6, 6]               0
          Dropout-15                 [-1, 9216]               0
           Linear-16                 [-1, 4096]      37,752,832                       
             ReLU-17                 [-1, 4096]               0
          Dropout-18                 [-1, 4096]               0
           Linear-19                 [-1, 4096]      16,781,312
             ReLU-20                 [-1, 4096]               0
           Linear-21                 [-1, 1000]       4,097,000
===========================================================================================================================
Total params: 62,368,344

可以看到,

  • 主要是前两个卷积层部分的输出通道数不一致,
  • 但是整体参数量都是6kw左右,参数量并没有发生很大变化,
  • caffe实现的Alexnet看起来结构更对称一些。

🔗参考资料: