Szegedy et al, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, arXiv 2016

GoogLeNet-v4

Top 5,error 3.08%


Segmentation

  • Semantic Segmentation(不知道有几个,只是对每个像素label了)

  • Instance Segmentation(SDS,对每个个体都要区分,不同的人也要分)

Semantic Segmentation

Figure credit: Shotton et al, “TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context”, IJCV 2007

Instance Segmentation

Figure credit: Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Semantic Segmentation

输出图像因为Pooling,会比之前的小

image pyramid

Resize to multiple different sizes

each scales -> run one cnn per scale -> up scale ouputs and concatenate

法2

RGB三个通道

Apply CNN once

More iterations improve results

Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014


Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015

Learnable upsampling!

skip connections

Better results

从pool3或者pool4跳到最后

upsampling

Deconvolution

!!!Input gives weight for filter

convolution的时候,stride1,deconvolution的时候stride2

重叠的地方,相加

Same as backward pass for normal convolution!

“inverse of convolution”

名字:

convolution transpose,backward strided convolution,1/2 strided convolution,upconvolution

名字的争论的论文。。。。

  • Im et al, “Generating images with recurrent adversarial networks”, arXiv 2016
  • Radford et al, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, ICLR 2016

Instance Segmentation

Input->Region Proposal (Segment Proposal) -> External Segment proposal

然后两条路,一条Feature Extraction另一条 R-CNN

->Region CLassification

->Refinement


判断foreground还是background

Hariharan et al, “Hypercolumns for Object Segmentation and Fine-grained Localization”, CVPR 2015


Google Instance Segmentation COCO2015获胜

Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Region proposal network (RPN)

然后使用Rol warping pooling

faster R-CNN && other to captioning

COCO 2015 performance

Attention Models

Attention Models

每次处理并不是全部处理所有的Input,每次处理只处理最Attention的那部分

最近很火

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

R-CNN

R-CNN

Soft Attention

Soft Attention

reinforcement learning

但是不知道在自由图片上的效果

Soft Attention for Translation

Sequence -> Sequence

Video captioning,attention over input frames:

  • Yao et al, “Describing Videos by Exploiting Temporal Structure”, ICCV 2015

Image, question to answer,attention over image:

  • Xu and Saenko, “Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering”, arXiv 2015

  • Zhu et al, “Visual7W: Grounded Question Answering in Images”, arXiv 2015

RNN handwriting

Graves, “Generating Sequences with Recurrent Neural Networks”, arXiv 2013

Demo http://www.cs.toronto.edu/~graves/handwriting.html

Spatial Transformer Networks

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Soft attention:

  • Easy to implement: produce distribution over input locations, reweight features and feed as input
  • Attend to arbitrary input locations using spatial transformer networks

Hard attention:

  • Attend to a single input location
  • Can’t use gradient descent!
  • Need reinforcement learning

Selectively paying attention to different parts of the image