卷积神经网络正式内容笔记 Stanford CS231n 2016

ConvNets

Mathematically

32323(3 channels) Image

filter:553 filter - Convolve the filter with the image: Slide over the image spatially

activation map => 28 * 28 * 1，使用的是$w^T x +b$

then,there is a second filter(553).then there is another activation map.

if we had 6 5x5 filters, we’ll get 6 separate activation maps.

stack up to get 28286 new image,This operation name convolution.

Conv ReLU

Conv -> ReLU -> Pool -> Conv -> ReLU -> Pool -> …-> Pool ->FC -> Output

Common to zero pad the border

边界加上一圈border ，0 Pixel
一般stride 1
Conv每层都在减少（减小的太快不好）所以要加padding，加边界

eg:32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially! (32 -> 28 -> 24 …). Shrinking too fast is not good, doesn’t work well.

Each filter has 553 + 1 = 76 params (+1 for bias)

76*10

Summary

输入长宽深度：$W_1 \times H_1 \times D_1$

4个参数： - K filter的个数 number of filters ，K是2的次幂powers of 2 eg 32,64,128 - F 空间范围 spatial extent filter的空间范围 - S 步长 stride - P 0 padding的总数

Common settings:

K = (powers of 2, e.g. 32, 64, 128, 512) - F = 3, S = 1, P = 1 - F = 5, S = 1, P = 2 - F = 5, S = 2, P = ? (whatever fits) - F = 1, S = 1, P = 0

1x1 的过滤，为了深度

F一般都是odd，奇数~

默认都处理正方形

一般不会去在一层里用多种大小的Filter

计算得到第二层的长宽深度： $W_2 = (W_1 - F + 2 P) / S + 1 H_2 = (H_1 - F + 2 P) / S + 1 D_2 = K$

因为参数共享，每个过滤器有$F \cdot F \cdot D_1$这么多weights参数，一共有$(F \cdot F \cdot D_1) \cdot K$ + K biases

输出一共有d-th层$ W_2 \times H_2$的卷积结果

把Conv Layer变成Neuron View

An activation map is a 28x28 sheet of neuron outputs: 1. Each is connected to a small region in the input 2. All of them share parameters 因为都是用一个filter变过来的

“5x5 filter” -> “5x5 receptive field for each neuron”

Pool / FC

Pooling Layers

做 downsampling，例如从22422464->11211264

makes the representations smaller and more manageable
operates over each activation map independently:

downsampling最常用的是 max pool 用一个2x2的filter，步长为2来求max

Pooling不改变 Depth

Fully Connected Layer （FC）

Contains neurons that connect to the entire input volume, as in ordinary Neural Networks

JavaScript Demo

可视化很困难

Case Study

LeCUN 1998

Conv->SubSampleing->Conv->Sub->FC->FC->GC

AlexNet

8 layers

input : 2272273 images

当时GPU Mem不够，所以用了两条分离的线（不考虑）

First Layer： 96 11*11 filters

=> Output: 555596

Parameters: 11113 *96 = 35K

Second Layer: POOL1 : 3x3 at stride 2.No paramter

Pool layer是没有parameters的

Details/Retrospectives: - first use of ReLU - used Norm layers (not common anymore) - heavy data augmentation - dropout 0.5 最后几层FC - batch size 128 - SGD Momentum 0.9 - Learning rate 1e-2, reduced by 10 manually when val accuracy plateaus - L2 weight decay 5e-4 - 7 CNN ensemble: 18.2% -> 15.4%

FC7 feature，再分类之前的最后一次FC

ZFNet

8 layers

[Zeiler and Fergus, 2013]

AlexNet but:

CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512

ImageNet top 5 error: 15.4% -> 14.8%

VGGNet

19 layers

[Simonyan and Zisserman, 2014]

Only 3x3 CONV stride 1, pad 1 and 2x2 MAX POOL stride 2

19层，11.2% top 5 error in ILSVRC 2013 -> 7.3% top 5 error

主要params在FC层

GoogLeNet

22 layers

[Szegedy et al., 2014]

Inception Module

Inception Layers ，Average

reduce computation & memory

ResNet

152 layers [He et al., 2015]

并不知道何时converge…可能仅仅是训练了两周了，累了….

ILSVRC 2015 winner (3.6% top 5 error)

plain net vs res net

2-3 weeks of training on 8 GPU machine
at runtime: faster than a VGGNet! (even though it has 8x more layers)

Skip connections

res net vs plain net

Batch Normalization after every CONV layer
Xavier/2 initialization from He et al.
SGD + Momentum (0.9)
Learning rate: 0.1, divided by 10 when validation error plateaus
Mini-batch size 256
Weight decay of 1e-5
No dropout used

Alpha Go

policy network

[19x19x48] Input - CONV1: 192 5x5 filters , stride 1, pad 2 => [19x19x192] - CONV2..12: 192 3x3 filters, stride 1, pad 1 => [19x19x192] - CONV: 1 1x1 filter, stride 1, pad 0 => [19x19] => probability map of promising moves