Feature-based approaches to Activity Recognition

Dense trajectories and motion boundary descriptors for action recognition

Wang et al., 2013

Action Recognition with Improved Trajectories

Wang and Schmid, 2013

用来track视频中什么有变化

[T. Brox and J. Malik, “Large displacement optical flow: Descriptor matching in variational motion estimation,” 2011]

flow

Q: What if the input is now a small chunk of video? E.g. [227x227x3x15] ?

A: Extend the convolutional filters in time, perform spatio-temporal convolutions!E.g. can have 11x11xT filters, where T = 2..15.

filter不仅在空间上移动,还在时间上移动

Spatio-Temporal ConvNets

Before AlexNets [3D Convolutional Neural Networks for Human Action Recognition, Ji et al., 2010]

[Sequential Deep Learning for Human Action Recognition, Baccouche et al., 2011]

After

[Large-scale Video Classification with Convolutional Neural Networks, Karpathy et al., 2014]

spatio-temporal convolutions

Single Frame ->非常好的一种,比较推荐先尝试这个

Model && Gain


C3D architechure

in time & VGG & work well

[Learning Spatiotemporal Features with 3D Convolutional Networks, Tran et al. 2015]


2D convolution

Simonyan 是VGG的提出人

[Two-Stream Convolutional Networks for Action Recognition in Videos, Simonyan and Zisserman 2014]

Two-stream version works much better than either alone.

Longer Term Events

使用LSTM

[Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al., 2015]

[Beyond Short Snippets: Deep Networks for Video Classification, Ng et al., 2015]

奇葩!!!

(This paper was way ahead of its time. Cited 65 times.)

Sequential Deep Learning for Human Action Recognition, Baccouche et al., 2011

三种构架

  • Model temporal motion locally (3D CONV)
  • Model temporal motion globally (LSTM / RNN)
  • Fusions of both approaches at the same

idea:快进快退

Long-time Spatio-Temporal ConvNets

[Delving Deeper into Convolutional Networks for Learning Video Representations, Ballas et al., 2016]

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 很重要,这个应该能用

GRU是一个LSTM的简化算法

long time spatio temporal convnets

Infinite (in theory) temporal extent

(neurons that are function of all video frames in the past)

注意

在做视频的时候,事先考虑Spatial Temporal Video ConvNet

确定需要local motion(3D Conv)还是global motion(LSTM)

Try out using Optical Flow in a second stream

Try out GRU-RCN! (imo best model)

Unsupervised

Clustering, dimensionality reduction, feature learning, generative models, etc

Autoencoders

  • Vanilla Traditional: feature learning
  • Variational: generate samples

● Generative Adversarial Networks: Generate samples

Encoder

输入Data,经过Encoder,然后去找Features

Feature一般比Data小,所以可以用于降维

然后通过Decoder,然后通过Features重建Input Data

Decoder和Encoder 的Weight是共享的


  • After training, throw away decoder!

  • Use encoder to initialize a supervised model


三层NN,不过输入输出一样,因为需要中间的Features,要用到其他地方

PCA可以解决一部分,更多是Reconstruction

Variational Autoencoder: Encoder

Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR 2014

Encoder

Variational Autoencoder

完全没听懂==

no idea

Goodfellow et al, “Generative Adversarial Nets”, NIPS 2014

Generative Adversarial Nets: Multiscale

Generator is an upsampling network with fractionally-strided convolutions

Discriminator is a convolutional network

Radford et al, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, ICLR 2016

!!!!!!!!!!!! 超牛逼

Radford et al,ICLR 2016

Generative Adversarial Nets: Vector Math

smile

glass

Put everything together

Dosovitskiy and Brox, “Generating Images with Perceptual Similarity Metrics based on Deep Networks”,arXiv 2016