Spatial Transformation

Size: px

Start display at page:

Download "Spatial Transformation"

Arabella Thomas
5 years ago
Views:

1 Spatial Transformation Presented by Liqun Chen June 30, 2017

2 1 Overview 2 Spatial Transformer Networks 3 STN experiments 4 Recurrent Models of Visual Attention (RAM) 5 Recurrent Models of Visual Attention Experiment

3 Overview 1 Overview 2 Spatial Transformer Networks 3 STN experiments 4 Recurrent Models of Visual Attention (RAM) 5 Recurrent Models of Visual Attention Experiment

4 Overview Today s Paper Spatial Transformer Networks (STN) Recurrent Models of Visual Attention (RAM) STN: CNN architecture with differentiable Sampling RAM: RNN with Partially Observable Markov Decision Process (POMDP)

5 Overview Background Goal Both of these two papers want the neural network to learn to extract the invariance information. So that the models can be used to translate, rotation, scale, and more generic warping. App for example: tracking object, classification,...

6 Spatial Transformer Networks 1 Overview 2 Spatial Transformer Networks 3 STN experiments 4 Recurrent Models of Visual Attention (RAM) 5 Recurrent Models of Visual Attention Experiment

7 Spatial Transformer Networks Model Architecture U: input feature maps V: output feature maps

8 Spatial Transformer Networks Localisation net input feature map U RH W C output θ = f loc (U), the parameters of the T θ the size of θ depends on the transformation type, e.g. for an affine transformation, θ is 6-dimensional f loc can be fully-connected network or a convolutional network

9 Spatial Transformer Networks Parameterised Sampling Grid output pixels lie on a regular grid G = {G i } of pixel G i = (xi t, y i t ), and they form an output feature map: V RH W C For example, T θ here is a 2D affine transformation A θ In this affine case, the pointwise transformation is: Here, ( x s i y s i ) = T θ (G i ) = A θ [ ] θ11 θ A θ = 12 θ 13 θ 21 θ 22 θ 23 1 x t i, y t i 1, 1 x s i, y s i 1 x s i y s i 1

10 Spatial Transformer Networks Parameterised Sampling Grid The class of transformation T θ may be more constrained, such as that used for attention: [ ] s 0 tx A θ = 0 s t y This T θ allows cropping, translation and isotropic scaling by varying s, t x, t y.

11 Spatial Transformer Networks Differentiable Image Sampling Each (xi s, yi s ) coordinate in T (G) defines the spatial location in the input where a sampling kernel is applied to get the value at a particular pixel in the output V, which can be written as: V c i = H n W m U c nmk(x s i m; φ x )k(y s i n; φ y ) i [1... H W ], c [1... C] Here, φ x and φ y are the parameters of a generic sampling kernel k() which defines the image interpolation. (e.g. bilinear).

12 Spatial Transformer Networks Differentiable Image Sampling In theory, any kernel can be used, as long as (sub-)gradient can be defined with respect to (x s i, y s i ) example 1: bilinear sampling kernel: V c i = H n W m U c nm max(0, 1 x s i m ) max(0, 1 y s i n ) example 2: integer sampling kernel: V c i = H n W m U c nmδ( x s i m)δ( y s i n) Here, δ is the Kronecker delta function

13 Spatial Transformer Networks Differentiable Image Sampling partial derivatives in bilinear sampling:

14 STN experiments 1 Overview 2 Spatial Transformer Networks 3 STN experiments 4 Recurrent Models of Visual Attention (RAM) 5 Recurrent Models of Visual Attention Experiment

15 STN experiments MNIST Distortion The percentage errors for different models on different distorted MNIST datasets. The different distorted MNIST datasets test are TC: translated and cluttered, R: rotated, RTS: rotated, translated, and scaled, P: projective distortion, E: elastic distortion.

16 STN experiments Street View House Number Left: The sequence error for SVHN multi-digit recognition on crops of 64px, and 128px Right: The schematic of the ST-CNN Multi model inflated crops of (128px) which include more background.

17 Recurrent Models of Visual Attention (RAM) 1 Overview 2 Spatial Transformer Networks 3 STN experiments 4 Recurrent Models of Visual Attention (RAM) 5 Recurrent Models of Visual Attention Experiment

18 Recurrent Models of Visual Attention (RAM) Overview The model is a recurrent neural network (RNN) which processes inputs sequentially, attending to different locations within the images (or video frames) one at a time, and incrementally combines information from these fixations to build up a dynamic internal representation of the scene or environment. It is attention based An agent observe the environment only via a bandwidth-limited sensors, this agent can actively control how to deploy the sensors. At each step, the agent receives a scalar reward, the goal is to maximize the total sum of such rewards. (POMDP)

19 Recurrent Models of Visual Attention (RAM) The Recurrent Attention Model (RAM) Sensor Internal State Actions Rewards

20 Recurrent Models of Visual Attention (RAM) Sensor Glimpse Sensor: extracts a retina-like representation ρ(x t, l t 1 ) around location l t 1 from image x t k square patches, and start with size g m g m, then each successive patch having twice the width of the previous.

21 Recurrent Models of Visual Attention (RAM) Sensor Glimpse Network: (θ 0 g, θ 1 g, θ 2 g) all of them are MLP with Relu g t is the glimpse information glimpse: low-resolution representation

22 Recurrent Models of Visual Attention (RAM) Internal State this internal state is formed by the hidden units h t of the RNN and updated by: h t = f h (h t1, g t ; θ h ). The external input to the network is the glimpse feature vector g t.

23 Recurrent Models of Visual Attention (RAM) Action location actions are chosen stochastically from a distribution parameterized by the location network f l (h t ; θ l ) at time t: l t p( f l (h t ; θ l )). The environment action a t is similarly drawn from a distribution conditioned on a second network output a t p( f a (h t ; θ a )). For classification a t is formulated using a softmax output and for dynamic environments, its exact formulation depends on the action set defined for that particular environment

24 Recurrent Models of Visual Attention (RAM) Reward The goal of the agent is to maximize the sum of the reward signal1 which is usually very sparse and delayed: R = T t=1 r t the agent needs to learn a (stochastic) policy π((l t, a t ) s1 : t; θ), with the environment s 1:t = x 1, l 1, a 1,..., x t1, l t1, a t1, x t Policy π here is the RNN

25 Recurrent Models of Visual Attention (RAM) Reward this model aim to maximize the reward under this distribution: J(θ) = E p(s1:t ;θ)[ T r t ] = E p(s1:t ;θ )[R] t=1 sample approximation to the gradient is given by: But this is just the gradient of the RNN that defines the agent evaluated at time step t and can be computed by standard backpropagation.

26 Recurrent Models of Visual Attention Experiment 1 Overview 2 Spatial Transformer Networks 3 STN experiments 4 Recurrent Models of Visual Attention (RAM) 5 Recurrent Models of Visual Attention Experiment

27 Recurrent Models of Visual Attention Experiment MNIST Classification Classification results on the MNIST and Translated MNIST datasets.

Spatial Transformer Networks

BIL722 - Deep Learning for Computer Vision Spatial Transformer Networks Max Jaderberg Andrew Zisserman Karen Simonyan Koray Kavukcuoglu Contents Introduction to Spatial Transformers Related Works Spatial