ConvNets

Mathematically

32323(3 channels) Image

filter:553 filter - Convolve the filter with the image: Slide over the image spatially

activation map => 28 * 28 * 1,使用的是$w^T x +b$

then,there is a second filter(553).then there is another activation map.

if we had 6 5x5 filters, we’ll get 6 separate activation maps.

stack up to get 28286 new image,This operation name convolution.

Conv ReLU

Conv -> ReLU -> Pool -> Conv -> ReLU -> Pool -> …-> Pool ->FC -> Output

Common to zero pad the border

  • 边界加上一圈border ,0 Pixel
  • 一般stride 1
  • Conv每层都在减少(减小的太快不好)所以要加padding,加边界

eg:32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially! (32 -> 28 -> 24 …). Shrinking too fast is not good, doesn’t work well.

Each filter has 553 + 1 = 76 params (+1 for bias)

76*10

Summary

输入长宽深度:$W_1 \times H_1 \times D_1$

4个参数: - K filter的个数 number of filters ,K是2的次幂powers of 2 eg 32,64,128 - F 空间范围 spatial extent filter的空间范围 - S 步长 stride - P 0 padding的总数

Common settings:

K = (powers of 2, e.g. 32, 64, 128, 512) - F = 3, S = 1, P = 1 - F = 5, S = 1, P = 2 - F = 5, S = 2, P = ? (whatever fits) - F = 1, S = 1, P = 0

1x1 的过滤,为了深度

F一般都是odd,奇数~

默认都处理正方形

一般不会去在一层里用多种大小的Filter

计算得到第二层的长宽深度:

因为参数共享,每个过滤器有$F \cdot F \cdot D_1$这么多weights参数,一共有$(F \cdot F \cdot D_1) \cdot K$ + K biases

输出一共有d-th层$ W_2 \times H_2$的卷积结果

把Conv Layer变成Neuron View

An activation map is a 28x28 sheet of neuron outputs: 1. Each is connected to a small region in the input 2. All of them share parameters 因为都是用一个filter变过来的

“5x5 filter” -> “5x5 receptive field for each neuron”

Pool / FC

Pooling Layers

做 downsampling,例如从22422464->11211264

  • makes the representations smaller and more manageable
  • operates over each activation map independently:

downsampling最常用的是 max pool 用一个2x2的filter,步长为2来求max

Pooling不改变 Depth

Fully Connected Layer (FC)

  • Contains neurons that connect to the entire input volume, as in ordinary Neural Networks

JavaScript Demo

可视化很困难

Case Study

  • LeCUN 1998

Conv->SubSampleing->Conv->Sub->FC->FC->GC

AlexNet

8 layers

input : 2272273 images

当时GPU Mem不够,所以用了两条分离的线(不考虑)

First Layer: 96 11*11 filters

=> Output: 555596

Parameters: 11113 *96 = 35K

Second Layer: POOL1 : 3x3 at stride 2.No paramter

Pool layer是没有parameters的

Details/Retrospectives: - first use of ReLU - used Norm layers (not common anymore) - heavy data augmentation - dropout 0.5 最后几层FC - batch size 128 - SGD Momentum 0.9 - Learning rate 1e-2, reduced by 10 manually when val accuracy plateaus - L2 weight decay 5e-4 - 7 CNN ensemble: 18.2% -> 15.4%

FC7 feature,再分类之前的最后一次FC

ZFNet

8 layers

[Zeiler and Fergus, 2013]

AlexNet but:

  • CONV1: change from (11x11 stride 4) to (7x7 stride 2)
  • CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512

ImageNet top 5 error: 15.4% -> 14.8%

VGGNet

19 layers

[Simonyan and Zisserman, 2014]

Only 3x3 CONV stride 1, pad 1 and 2x2 MAX POOL stride 2

19层,11.2% top 5 error in ILSVRC 2013 -> 7.3% top 5 error

主要params在FC层

GoogLeNet

22 layers

[Szegedy et al., 2014]

Inception Module

Inception Layers ,Average

reduce computation & memory

ResNet

152 layers [He et al., 2015]

并不知道何时converge…可能仅仅是训练了两周了,累了….

ILSVRC 2015 winner (3.6% top 5 error)

plain net vs res net

  • 2-3 weeks of training on 8 GPU machine
  • at runtime: faster than a VGGNet! (even though it has 8x more layers)

Skip connections

res net vs plain net

res net vs plain net

  • Batch Normalization after every CONV layer
  • Xavier/2 initialization from He et al.
  • SGD + Momentum (0.9)
  • Learning rate: 0.1, divided by 10 when validation error plateaus
  • Mini-batch size 256
  • Weight decay of 1e-5
  • No dropout used

Alpha Go

policy network

[19x19x48] Input - CONV1: 192 5x5 filters , stride 1, pad 2 => [19x19x192] - CONV2..12: 192 3x3 filters, stride 1, pad 1 => [19x19x192] - CONV: 1 1x1 filter, stride 1, pad 0 => [19x19] => probability map of promising moves