RNN、Image Captioning和LSTM笔记 Stanford CS231n

Recurrent Neural Network

CNN

fixed size image -> fixed size of vector label

-one to one

Image Captioning

image -> sequence of words

Sentiment Classification

sequence of words -> sentiment(positive or negative)

Machine Translation

seq -> seq of words

Video classification on frame level

before the frame

Process sequentially

fixed inputs vs fixed outputs

Visual Attention Ba et al
Draw a RNN for Image Generation Gregor et al

RNN

It has state.through time,feed input vector into RNN.State changing as received input vectors.

usually we can predict a vector at some time steps.

y 
^
|
RNN<-self 
^
|
x

base on RNN to predict

has states

We can process a sequence of vectors X by applying a recurrence formula at every time step.

$h_t = f_W (h_{t-1} , x_t)$

$h_t$ is new state,$h_{t-1}$ is old state,$f_W$ is the recurrent function with parameters $W$.

训练出$f_W$里的$W$，使用的时候，use $f_W$ at every single step no matter how long the input or output sequences are.

Vanilla RNN

Simplest.

a single hidden vector $h$.

$h_t = tanh(W_{hh} h_{t-1} + W_{xh} x_t) \\ y_t = W_{hy} h_t$

feed character one at a time into RNN.

Character-level language model example

import numpy as np

# data I/O
data = open('input.txt', 'r').read() # should be simple plain text file
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print 'data has %d characters, %d unique.' % (data_size, vocab_size)
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

# hyperparameters
hidden_size = 100 # size of hidden layer of neurons
seq_length = 25 # number of steps to unroll the RNN for 必须在内存里都存了，用chunk
learning_rate = 1e-1

# model parameters
Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
by = np.zeros((vocab_size, 1)) # output bias

def lossFun(inputs, targets, hprev):
  """
  inputs,targets are both list of integers.
  hprev is Hx1 array of initial hidden state
  returns the loss, gradients on model parameters, and last hidden state
  """
  xs, hs, ys, ps = {}, {}, {}, {}
  hs[-1] = np.copy(hprev)
  loss = 0
  # forward pass (compute loss)
  for t in xrange(len(inputs)):
    xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
    xs[t][inputs[t]] = 1
    hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state (RNN的部分)
    ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
    ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
    loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
  # backward pass: compute gradients going backwards
  dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
  dbh, dby = np.zeros_like(bh), np.zeros_like(by)
  dhnext = np.zeros_like(hs[0])
  for t in reversed(xrange(len(inputs))):
    dy = np.copy(ps[t])
    dy[targets[t]] -= 1 # backprop into y
    dWhy += np.dot(dy, hs[t].T)
    dby += dy
    dh = np.dot(Why.T, dy) + dhnext # backprop into h
    dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
    dbh += dhraw
    dWxh += np.dot(dhraw, xs[t].T)
    dWhh += np.dot(dhraw, hs[t-1].T)
    dhnext = np.dot(Whh.T, dhraw)
  for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
    np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
  return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]
  # weight maxtrices

def sample(h, seed_ix, n): 
  """ 
  sample a sequence of integers from the model 
  h is memory state, seed_ix is seed letter for first time step
  """
  x = np.zeros((vocab_size, 1))
  x[seed_ix] = 1
  ixes = []
  for t in xrange(n):
    h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
    y = np.dot(Why, h) + by
    p = np.exp(y) / np.sum(np.exp(y))
    ix = np.random.choice(range(vocab_size), p=p.ravel())
    x = np.zeros((vocab_size, 1))
    x[ix] = 1
    ixes.append(ix)
  return ixes

n, p = 0, 0
mWxh, mWhh, mWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
mbh, mby = np.zeros_like(bh), np.zeros_like(by) # memory variables for Adagrad
smooth_loss = -np.log(1.0/vocab_size)*seq_length # loss at iteration 0

# Main Loop，a batch(25) of data
while True:
  # prepare inputs (we're sweeping from left to right in steps seq_length long)
  if p+seq_length+1 >= len(data) or n == 0: 
    hprev = np.zeros((hidden_size,1)) # reset RNN memory
    p = 0 # go from start of data
  inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]]
  targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]]

  # sample from the model now and then
  if n % 100 == 0:
    sample_ix = sample(hprev, inputs[0], 200)
    txt = ''.join(ix_to_char[ix] for ix in sample_ix)
    print '----\n %s \n----' % (txt, )

  # forward seq_length characters through the net and fetch gradient
  loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev)
  smooth_loss = smooth_loss * 0.999 + loss * 0.001
  if n % 100 == 0: print 'iter %d, loss: %f' % (n, smooth_loss) # print progress
  
  # perform parameter update with Adagrad
  for param, dparam, mem in zip([Wxh, Whh, Why, bh, by], 
                                [dWxh, dWhh, dWhy, dbh, dby], 
                                [mWxh, mWhh, mWhy, mbh, mby]):
    mem += dparam * dparam
    param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update

  p += seq_length # move data pointer
  n += 1 # iteration counter

Searching for interpretable cells

karpathy 的博客

[Visualizing and Understanding Recurrent Networks, Andrej Karpathy, Justin Johnson, Li Fei-Fei]

quote detection cell
line length tracking cell
if statement cell
quote/comment cell
code depth cell

Image Captioning Sentence Datasets

Microsoft COCO Tsung-Yi Lin et al. 2014

LSTM

Show Attend and Tell, Xu et al., 2015

得到RNN的结果，用这个结果去看下一步去哪里找

RNN attends spatially to different parts of images while generating each word of the sentence

RNN with attention over the image

RNN stack layers,depth and time

LSTM (Long short term memory)

[Hochreiter et al., 1997]

LSTM formula

$c_t$ cell state vector

RNN vs LSTM

ignoring forget gates

就和“PlainNets” vs. ResNets一样，一个是add，一个是直接传递

ResNet vs plainNet

RNN 和 LSTM 视觉区别，gradient的变化，RNN很多gradient在传递过来承重逐渐的die了。LSTM is super highway pipeline.

视觉区别视频

RNN 的不稳定性

因为代码里有个

Whh = np.random.randn(H,H)
...
dss[t] = (hs[t] > 0) * dhs[t];
dhs[t-1] = np.dot(Whh.T,dss[t])

导致 if the largest eigenvalue is > 1, gradient will explode （如果太大了，控制住==，大于多少就不增加了）

if the largest eigenvalue is < 1, gradient will vanish

Thesis

[On the difficulty of training Recurrent Neural Networks, Pascanu et al., 2013]

[LSTM: A Search Space Odyssey,Greff et al., 2015]

[An Empirical Exploration of Recurrent Network Architectures,Jozefowicz et al., 2015]

GRU changed LSTM 变量少了

GRU [Learning phrase representations using rnn encoder-decoder for statistical machine translation, Cho et al. 2014]

Raw RNN 工作并不是特别好