LEARNING REPRESENTATIONS OF SEQUENCES IPAM GRADUATE SUMMER SCHOOL ON DEEP LEARNING

Size: px

Start display at page:

Download "LEARNING REPRESENTATIONS OF SEQUENCES IPAM GRADUATE SUMMER SCHOOL ON DEEP LEARNING"

Dale Powell
6 years ago
Views:

1 LEARNING REPRESENTATIONS OF SEQUENCES IPAM GRADUATE SUMMER SCHOOL ON DEEP LEARNING GRAHAM TAYLOR SCHOOL OF ENGINEERING UNIVERSITY OF GUELPH Papers and software available at:

2 OVERVIEW: THIS TALK 13 Jul 2012 / 2

3 OVERVIEW: THIS TALK Learning representations of temporal data: X (Input) N p - existing methods and challenges faced - recent methods inspired by deep learning and representation learning N y N x N x w N x Y (Output) N y w N y w N x w z k m,n Z k Feature layer N z N z p k α P k Pooling layer N p N y 13 Jul 2012 / 2

learning N y N x N x w N x Y (Output) N y w N

k α P k Pooling layer N p N y Applications: in

highly structured data: e.g. motion capture -

4 OVERVIEW: THIS TALK Learning representations of temporal data: X (Input) N p - existing methods and challenges faced - recent methods inspired by deep learning and representation learning N y N x N x w N x Y (Output) N y w N y w N x w z k m,n Z k Feature layer N z N z p k α P k Pooling layer N p N y Applications: in particular, modeling human pose and activity - highly structured data: e.g. motion capture - weakly structured data: e.g. video 13 Jul 2012 / 2

5 OUTLINE 13 Jul 2012 / 3

6 OUTLINE Learning representations from sequences Existing methods, challenges y t 2 y t 1 y t 13 Jul 2012 / 3

7 OUTLINE Learning representations from sequences Existing methods, challenges y t 2 y t 1 y t Composable, distributed-state models for sequences Conditional Restricted Boltzmann Machines and their variants 13 Jul 2012 / 3

8 OUTLINE Learning representations from sequences Existing methods, challenges y t 2 y t 1 y t Composable, distributed-state models for sequences Conditional Restricted Boltzmann Machines and their variants Using learned representations to analyze video A brief and (incomplete survey of deep learning for activity recognition X (Input) N N x w x N x w N x Y (Output) z k m,n N z N z p k α N p P k Pooling layer N p N y N y w N y w Z k Feature layer N y 13 Jul 2012 / 3

9 TIME SERIES DATA Time is an integral part of many human behaviours (motion, reasoning) In building statistical models, time is sometimes ignored, often problematic Models that do incorporate dynamics fail to account for the fact that data is often high-dimensional, nonlinear, and contains long-range dependencies INTENSITY (No of stories) Jul 2012 / 4 Graphic: David McCandless, informationisbeautiful.net

10 TIME SERIES DATA Time is an integral part of many human behaviours (motion, reasoning) In building statistical models, time is sometimes ignored, often problematic Models that do incorporate dynamics fail to account for the fact that data is often high-dimensional, nonlinear, and contains long-range dependencies INTENSITY (No of stories) Today we will discuss a number of models that have been developed to address these challenges. 13 Jul 2012 / 4 Graphic: David McCandless, informationisbeautiful.net

11 VECTOR AUTOREGRESSIVE MODELS v t = b + M m=1 A m v t m + e t Have dominated statistical time-series analysis for approx. 50 years Can be fit easily by least-squares regression Can fail even for simple nonlinearities present in the system - but many data sets can be modeled well by a linear system Well understood; many extensions exist 13 Jul 2012 / 5

12 MARKOV ( N-GRAM ) MODELS v t 2 v t 1 v t Fully observable Sequential observations may have nonlinear dependence Derived by assuming sequences have Markov property: This leads to joint: p(v t {v t 1 1 }) =p(v t {v t 1 p({v T 1 }) =p({v N 1 }) Number of parameters exponential in N! T t=n+1 t N }) p(v t {v t 1 t N }) 13 Jul 2012 / 6

13 EXPONENTIAL INCREASE IN PARAMETERS θ = Q N+1 Here, Q =3 p(a a) p(b a) p(c a) p(a b) p(b b) p(c b) p(a c) p(b c) p(c c) 1st order Markov (N = 1) 13 Jul 2012 / 7

14 EXPONENTIAL INCREASE IN PARAMETERS θ = Q N+1 Here, Q =3 p(a a) p(b a) p(c a) p(a b) p(b b) p(c b) p(a {a,c}) p(b {a,c}) p(c {a,c}) p(a {a,b}) p(b {a,b}) p(c {a,b}) p(a {b,a}) p(b {b,a}) p(c {b,c}) p(a {a,a}) p(b {a,a}) p(c {a,a}) p(a {b,a}) p(b {b,a}) p(c {b,b}) p(a {c,a}) p(b {c,a}) p(c {c,c}) p(a {b,a}) p(b {b,a}) p(c {b,a}) p(a {c,a}) p(b {c,a}) p(c {c,b}) p(a c) p(b c) p(c c) p(a {c,a}) p(b {c,a}) p(c {c,a}) 1st order Markov (N = 1) 2nd order Markov (N = 2) 13 Jul 2012 / 7

15 HIDDEN MARKOV MODELS (HMM) h t 2 h t 1 h t v t 2 v t 1 v t 13 Jul 2012 / 8

16 HIDDEN MARKOV MODELS (HMM) Introduces a hidden state that controls the dependence of the current observation on the past h t 2 h t 1 h t v t 2 v t 1 v t 13 Jul 2012 / 8

17 HIDDEN MARKOV MODELS (HMM) Introduces a hidden state that controls the dependence of the current observation on the past h t 2 h t 1 h t v t 2 v t 1 v t Successful in speech & language modeling, biology Defined by 3 sets of parameters: - Initial state parameters, π - Transition matrix, A - Emission distribution, p(v t h t ) Factored joint distribution: p({h t }, {v t })=p(h 1 )p(v 1 h 1 ) T p(h t h t 1 )p(v t h t ) t=2 13 Jul 2012 / 8

18 INFERENCE AND LEARNING h 1 h 2... h T v 1 v 2 v T h 1 h 2... h T v 1 v 2 v T h 1 h 2... h T 13 Jul 2012 / 9 v 1 v 2 v T

19 INFERENCE AND LEARNING h 1 h 2... h T Typically three tasks we want to perform in an HMM: v 1 v 2 v T h 1 h 2... h T v 1 v 2 v T h 1 h 2... h T 13 Jul 2012 / 9 v 1 v 2 v T

20 p({v 1,...,v T } θ) INFERENCE AND LEARNING Typically three tasks we want to perform in an HMM: - Likelihood estimation h 1 h 2 {h 1,...,h T }... h T v 1 v 2 v T h 1 h 2... h T v 1 v 2 v T h 1 h 2... h T 13 Jul 2012 / 9 v 1 v 2 v T

21 p({v 1,...,v T } θ) INFERENCE AND LEARNING Typically three tasks we want to perform in an HMM: - Likelihood estimation - Inference h 1 h 2 {h 1,...,h T }... h T v 1 v 2 v T??...? h 1 h 2 h T v 1 v 2 v T h 1 h 2... h T 13 Jul 2012 / 9 v 1 v 2 v T

22 p({v 1,...,v T } θ) INFERENCE AND LEARNING Typically three tasks we want to perform in an HMM: - Likelihood estimation - Inference - Learning h 1 h 2 {h 1,...,h T }... h T v 1 v 2 v T??...? h 1 h 2 h T v 1 v 2 v T h 1 h 2... h T 13 Jul 2012 / 9 v 1 v 2 v T

23 p({v 1,...,v T } θ) INFERENCE AND LEARNING Typically three tasks we want to perform in an HMM: - Likelihood estimation - Inference - Learning All are exact and tractable due to the simple structure of the HMM h 1 h 2 {h 1,...,h T }... h T v 1 v 2 v T??...? h 1 h 2 h T v 1 v 2 v T h 1 h 2... h T 13 Jul 2012 / 9 v 1 v 2 v T

24 LIMITATIONS OF HMMS 13 Jul 2012 / 10

25 LIMITATIONS OF HMMS Many high-dimensional data sets contain rich componential structure 13 Jul 2012 / 10

26 LIMITATIONS OF HMMS Many high-dimensional data sets contain rich componential structure Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series 13 Jul 2012 / 10

27 LIMITATIONS OF HMMS Many high-dimensional data sets contain rich componential structure Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series To model K bits of information, they need hidden states 2 K 13 Jul 2012 / 10

28 LIMITATIONS OF HMMS Many high-dimensional data sets contain rich componential structure Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series To model K bits of information, they need hidden states 2 K 13 Jul 2012 / 10

29 LIMITATIONS OF HMMS Many high-dimensional data sets contain rich componential structure Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series To model K bits of information, they need hidden states 2 K 13 Jul 2012 / 10

30 LIMITATIONS OF HMMS Many high-dimensional data sets contain rich componential structure Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series To model K bits of information, they need hidden states 2 K We seek models with distributed hidden state: 13 Jul 2012 / 10

31 LIMITATIONS OF HMMS Many high-dimensional data sets contain rich componential structure Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series To model K bits of information, they need hidden states 2 K We seek models with distributed hidden state: - capacity linear in the number of components 13 Jul 2012 / 10

32 LIMITATIONS OF HMMS Many high-dimensional data sets contain rich componential structure Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series To model K bits of information, they need hidden states 2 K We seek models with distributed hidden state: - capacity linear in the number of components 13 Jul 2012 / 10

33 LINEAR DYNAMICAL SYSTEMS h t 2 h t 1 h t v t 2 v t 1 v t 13 Jul 2012 / 11

34 LINEAR DYNAMICAL SYSTEMS Graphical model is the same as HMM but with real-valued state vectors h t 2 h t 1 h t v t 2 v t 1 v t 13 Jul 2012 / 11

35 LINEAR DYNAMICAL SYSTEMS Graphical model is the same as HMM but with real-valued state vectors h t 2 h t 1 h t v t 2 v t 1 v t Characterized by linear-gaussian dynamics and observations: p(h t h t 1 )=N (h t ; Ah t 1,Q) p(v t h t )=N (v t ; Ch t,r) Inference is performed using Kalman smoothing (belief propagation) Learning can be done by EM Dynamics, observations may also depend on an observed input (control) 13 Jul 2012 / 11

36 LATENT REPRESENTATIONS FOR REAL-WORLD DATA Data for many real-world problems (e.g. vision, motion capture) is highdimensional, containing complex non-linear relationships between components 13 Jul 2012 / 12

37 LATENT REPRESENTATIONS FOR REAL-WORLD DATA Data for many real-world problems (e.g. vision, motion capture) is highdimensional, containing complex non-linear relationships between components Hidden Markov Models Pro: complex, nonlinear emission model Con: single K -state multinomial represents entire history 13 Jul 2012 / 12

38 LATENT REPRESENTATIONS FOR REAL-WORLD DATA Data for many real-world problems (e.g. vision, motion capture) is highdimensional, containing complex non-linear relationships between components Hidden Markov Models Pro: complex, nonlinear emission model Con: single K -state multinomial represents entire history Linear Dynamical Systems Pro: state can convey much more information Con: emission model constrained to be linear 13 Jul 2012 / 12

39 LEARNING DISTRIBUTED REPRESENTATIONS 13 Jul 2012 / 13

40 LEARNING DISTRIBUTED REPRESENTATIONS Simple networks are capable of discovering useful and interesting internal representations of static data (e.g. many of the talks so far!) 13 Jul 2012 / 13

41 LEARNING DISTRIBUTED REPRESENTATIONS Simple networks are capable of discovering useful and interesting internal representations of static data (e.g. many of the talks so far!) Can we learn, in a similar way, representations of temporal data? 13 Jul 2012 / 13

42 LEARNING DISTRIBUTED REPRESENTATIONS Simple networks are capable of discovering useful and interesting internal representations of static data (e.g. many of the talks so far!) Can we learn, in a similar way, representations of temporal data? Simple idea: spatial representation of time: 13 Jul 2012 / 13

43 LEARNING DISTRIBUTED REPRESENTATIONS Simple networks are capable of discovering useful and interesting internal representations of static data (e.g. many of the talks so far!) Can we learn, in a similar way, representations of temporal data? Simple idea: spatial representation of time: 13 Jul 2012 / 13 Spectrogram:

44 LEARNING DISTRIBUTED REPRESENTATIONS Simple networks are capable of discovering useful and interesting internal representations of static data (e.g. many of the talks so far!) Can we learn, in a similar way, representations of temporal data? Simple idea: spatial representation of time: 13 Jul 2012 / 13 Spectrogram:

45 LEARNING DISTRIBUTED REPRESENTATIONS Simple networks are capable of discovering useful and interesting internal representations of static data (e.g. many of the talks so far!) Can we learn, in a similar way, representations of temporal data? Simple idea: spatial representation of time: 13 Jul 2012 / 13 Spectrogram:

46 LEARNING DISTRIBUTED REPRESENTATIONS Simple networks are capable of discovering useful and interesting internal representations of static data (e.g. many of the talks so far!) Can we learn, in a similar way, representations of temporal data? Simple idea: spatial representation of time: - Need a buffer; not biologically plausible - Cannot process inputs of differing length - Cannot distinguish between absolute and relative position 13 Jul 2012 / 13 Spectrogram:

47 LEARNING DISTRIBUTED REPRESENTATIONS Simple networks are capable of discovering useful and interesting internal representations of static data (e.g. many of the talks so far!) Can we learn, in a similar way, representations of temporal data? Simple idea: spatial representation of time: - Need a buffer; not biologically plausible - Cannot process inputs of differing length - Cannot distinguish between absolute and relative position This motivates an implicit representation of time in connectionist models where time is represented by its effect on processing 13 Jul 2012 / 13 Spectrogram:

48 A RICH HISTORY 13 Jul 2012 / 14

49 A RICH HISTORY Elman networks Time-delayed context units, truncated BPTT. (Elman, 1990), (Jordan, 1986) 13 Jul 2012 / 14

50 A RICH HISTORY Elman networks Time-delayed context units, truncated BPTT. (Elman, 1990), (Jordan, 1986) Mean-field Boltzmann Machines through Time Inference is approximate, learning less efficient than HMMs. (Williams and Hinton, 1990) 13 Jul 2012 / 14

51 A RICH HISTORY Elman networks Time-delayed context units, truncated BPTT. (Elman, 1990), (Jordan, 1986) Mean-field Boltzmann Machines through Time Inference is approximate, learning less efficient than HMMs. (Williams and Hinton, 1990) Spiking Boltzmann Machines Hidden-state dynamics and smoothness constraints on observed data. (Hinton and Brown, 2000) 13 Jul 2012 / 14

52 RECURRENT NEURAL NETWORKS ŷ t 1 ŷ t ŷ t+1 h t 1 h t h t+1 v t 1 v t v t+1 13 Jul 2012 / 15 x t = W hv v t + W hh h t 1 + b h h j,t = f(x j,t ) s t = W yh h t + b y ŷ k,t = g(s k,t ) (Figure from Martens and Sutskever)

53 RECURRENT NEURAL NETWORKS ŷ t 1 ŷ t ŷ t+1 h t 1 h t h t+1 v t 1 v t v t+1 Neural network replicated in time 13 Jul 2012 / 15 x t = W hv v t + W hh h t 1 + b h h j,t = f(x j,t ) s t = W yh h t + b y ŷ k,t = g(s k,t ) (Figure from Martens and Sutskever)

54 RECURRENT NEURAL NETWORKS ŷ t 1 ŷ t ŷ t+1 h t 1 h t h t+1 v t 1 v t v t+1 Neural network replicated in time At each step, receives input vector, updates its internal representation via nonlinear activation functions, and makes a prediction: 13 Jul 2012 / 15 x t = W hv v t + W hh h t 1 + b h h j,t = f(x j,t ) s t = W yh h t + b y ŷ k,t = g(s k,t ) (Figure from Martens and Sutskever)

55 TRAINING RECURRENT NEURAL NETWORKS 13 Jul 2012 / 16

56 TRAINING RECURRENT NEURAL NETWORKS Possibly high-dimensional, distributed, internal representation and nonlinear dynamics allow model, in theory, model complex time series 13 Jul 2012 / 16

57 TRAINING RECURRENT NEURAL NETWORKS Possibly high-dimensional, distributed, internal representation and nonlinear dynamics allow model, in theory, model complex time series Exact gradients can be computed exactly via Backpropagation Through Time 13 Jul 2012 / 16

58 TRAINING RECURRENT NEURAL NETWORKS Possibly high-dimensional, distributed, internal representation and nonlinear dynamics allow model, in theory, model complex time series Exact gradients can be computed exactly via Backpropagation Through Time It is an interesting and powerful model. What s the catch? - Training RNNs via gradient descent fails on simple problems - Attributed to vanishing or exploding gradients - Much work in the 1990 s focused on identifying and addressing these issues: none of these methods were widely adopted 13 Jul 2012 / 16

59 TRAINING RECURRENT NEURAL NETWORKS Possibly high-dimensional, distributed, internal representation and nonlinear dynamics allow model, in theory, model complex time series Exact gradients can be computed exactly via Backpropagation Through Time It is an interesting and powerful model. What s the catch? - Training RNNs via gradient descent fails on simple problems - Attributed to vanishing or exploding gradients - Much work in the 1990 s focused on identifying and addressing these issues: none of these methods were widely adopted 13 Jul 2012 / 16 (Figure adapted from James Martens)

60 TRAINING RECURRENT NEURAL NETWORKS Possibly high-dimensional, distributed, internal representation and nonlinear dynamics allow model, in theory, model complex time series Exact gradients can be computed exactly via Backpropagation Through Time It is an interesting and powerful model. What s the catch? - Training RNNs via gradient descent fails on simple problems - Attributed to vanishing or exploding gradients - Much work in the 1990 s focused on identifying and addressing these issues: none of these methods were widely adopted Best-known attempts to resolve the problem of RNN training: - Long Short-term Memory (LSTM) (Hochreiter and Schmidhuber 1997) - Echo-State Network (ESN) (Jaeger and Haas 2004) 13 Jul 2012 / 16

61 FAILURE OF GRADIENT DESCENT Two hypotheses for why gradient descent fails for NN: 18 May 2012 / 17

62 FAILURE OF GRADIENT DESCENT Two hypotheses for why gradient descent fails for NN: increased frequency and severity of bad local minima 18 May 2012 / 17 (Figures from James Martens)

63 FAILURE OF GRADIENT DESCENT Two hypotheses for why gradient descent fails for NN: increased frequency and severity of bad local minima pathological curvature, like the type seen in the Rosenbrock function: f(x, y) =(1 x) (y x 2 ) 2 18 May 2012 / 17 (Figures from James Martens)

64 SECOND ORDER METHODS Model the objective function by the local approximation: where p is the search direction and B is a matrix which quantifies curvature In Newton s method, f(θ + p) q θ (p) f(θ)+ f(θ) T p pt Bp is the Hessian matrix, By taking the curvature information into account, Newton s method rescales the gradient so it is a much more sensible direction to follow Not feasible for high-dimensional problems! B H 18 May 2012 / 18 (Figure from James Martens)

65 HESSIAN-FREE OPTIMIZATION Based on exploiting two simple ideas (and some additional tricks ): 18 May 2012 / 19

66 HESSIAN-FREE OPTIMIZATION Based on exploiting two simple ideas (and some additional tricks ): For an n-dimensional vector d, the Hessian-vector product Hd can easily be computed using finite differences at the cost of a single extra gradient evaluation - In practice, the R-operator (Perlmutter 1994) is used instead of finite differences 18 May 2012 / 19

67 HESSIAN-FREE OPTIMIZATION Based on exploiting two simple ideas (and some additional tricks ): For an n-dimensional vector d, the Hessian-vector product Hd can easily be computed using finite differences at the cost of a single extra gradient evaluation - In practice, the R-operator (Perlmutter 1994) is used instead of finite differences There is a very effective algorithm for optimizing quadratic objectives which requires only Hessian-vector products: linear conjugate-gradient (CG) 18 May 2012 / 19

68 HESSIAN-FREE OPTIMIZATION Based on exploiting two simple ideas (and some additional tricks ): For an n-dimensional vector d, the Hessian-vector product Hd can easily be computed using finite differences at the cost of a single extra gradient evaluation - In practice, the R-operator (Perlmutter 1994) is used instead of finite differences There is a very effective algorithm for optimizing quadratic objectives which requires only Hessian-vector products: linear conjugate-gradient (CG) This method was shown to effectively train RNNs in the pathological long-term dependency problems they were previously not able to solve (Martens and Sutskever 2011) 18 May 2012 / 19

69 HESSIAN-FREE OPTIMIZATION Based on exploiting two simple ideas (and some additional tricks ): For an n-dimensional vector d, the Hessian-vector product Hd can easily be computed using finite differences at the cost of a single extra gradient evaluation - In practice, the R-operator (Perlmutter 1994) is used instead of finite differences There is a very effective algorithm for optimizing quadratic objectives which requires only Hessian-vector products: linear conjugate-gradient (CG) This method was shown to effectively train RNNs in the pathological long-term dependency problems they were previously not able to solve (Martens and Sutskever 2011) RNN demo code (using Theano): 18 May 2012 / 19

70 GENERATIVE MODELS WITH DISTRIBUTED STATE 13 Jul 2012 / 20

71 GENERATIVE MODELS WITH DISTRIBUTED STATE Many sequences are high-dimensional and have complex structure - music, human motion, weather/climate data - RNNs simply predict the expected value at the next time step - They can t capture multi-modality 13 Jul 2012 / 20

72 GENERATIVE MODELS WITH DISTRIBUTED STATE Many sequences are high-dimensional and have complex structure - music, human motion, weather/climate data - RNNs simply predict the expected value at the next time step - They can t capture multi-modality Generative models (like Restricted Boltzmann Machines) can capture complex distributions 13 Jul 2012 / 20

73 GENERATIVE MODELS WITH DISTRIBUTED STATE Many sequences are high-dimensional and have complex structure - music, human motion, weather/climate data - RNNs simply predict the expected value at the next time step - They can t capture multi-modality Generative models (like Restricted Boltzmann Machines) can capture complex distributions Use binary hidden state and gain the best of HMM & LDS: - the nonlinear dynamics and observation model of the HMM without the limited hidden state - the efficient, expressive state of the LDS without the linear-gaussian restriction on dynamics and observations 13 Jul 2012 / 20

74 DISTRIBUTED BINARY HIDDEN STATE Using distributed binary representations for hidden state in directed models of time series makes inference difficult. But we can: - Use a Restricted Boltzmann Machine (RBM) for the interactions between hidden and visible variables. A factorial posterior makes inference and sampling easy. - Treat the visible variables in the previous time slice as additional fixed inputs Hidden variables (factors) at time t Visible variables (observations) at time t One typically uses binary logistic units for both visibles and hiddens p(h j =1 v) =σ(b j + i v i W ij ) 13 Jul 2012 / 21 p(v i =1 h) =σ(b i + j h j W ij )

75 MODELING OBSERVATIONS WITH AN RBM So the distributed binary latent (hidden) state of an RBM lets us: - Model complex, nonlinear dynamics - Easily and exactly infer the latent binary state given the observations But RBMs treat data as static (i.i.d.) Hidden variables (factors) at time t Visible variables (joint angles) at time t 13 Jul 2012 / 22

76 MODELING OBSERVATIONS WITH AN RBM So the distributed binary latent (hidden) state of an RBM lets us: - Model complex, nonlinear dynamics - Easily and exactly infer the latent binary state given the observations But RBMs treat data as static (i.i.d.) Hidden variables (factors) at time t Visible variables (joint angles) at time t 13 Jul 2012 / 22

77 MODELING OBSERVATIONS WITH AN RBM So the distributed binary latent (hidden) state of an RBM lets us: - Model complex, nonlinear dynamics - Easily and exactly infer the latent binary state given the observations But RBMs treat data as static (i.i.d.) Hidden variables (factors) at time t Visible variables (joint angles) at time t 13 Jul 2012 / 22

78 CONDITIONAL RESTRICTED BOLTZMANN MACHINES (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) 13 Jul 2012 / 23

79 CONDITIONAL RESTRICTED BOLTZMANN MACHINES Start with a Restricted Boltzmann Machine (RBM) (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) Hidden layer Visible layer 13 Jul 2012 / 23

80 CONDITIONAL RESTRICTED BOLTZMANN MACHINES Start with a Restricted Boltzmann Machine (RBM) (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) Add two types of directed connections: Hidden layer Visible layer 13 Jul 2012 / 23

81 CONDITIONAL RESTRICTED BOLTZMANN MACHINES Start with a Restricted Boltzmann Machine (RBM) (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) Add two types of directed connections: - Autoregressive connections model short-term, linear structure Hidden layer Visible layer Recent history 13 Jul 2012 / 23

82 CONDITIONAL RESTRICTED BOLTZMANN MACHINES Start with a Restricted Boltzmann Machine (RBM) (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) Add two types of directed connections: - Autoregressive connections model short-term, linear structure - History can also influence dynamics through hidden layer Hidden layer Visible layer Recent history 13 Jul 2012 / 23

83 CONDITIONAL RESTRICTED BOLTZMANN MACHINES Start with a Restricted Boltzmann Machine (RBM) (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) Add two types of directed connections: - Autoregressive connections model short-term, linear structure - History can also influence dynamics through hidden layer Conditioning does not change inference nor learning Hidden layer Visible layer Recent history 13 Jul 2012 / 23

84 CONTRASTIVE DIVERGENCE LEARNING IN A CRBM Fixed j Fixed j Fixed i iter = 0 (data) <v i h j > data <v i h j > recon Fixed i iter =1 (reconstruction) When updating visible and hidden units, we implement directed connections by treating data from previous time steps as a dynamically changing bias Inference and learning do not change 13 Jul 2012 / 24

85 STACKING: THE CONDITIONAL DEEP BELIEF NETWORK 13 Jul 2012 / 25

86 STACKING: THE CONDITIONAL DEEP BELIEF NETWORK Learn a CRBM Hidden layer Visible layer 13 Jul 2012 / 25

87 STACKING: THE CONDITIONAL DEEP BELIEF NETWORK Learn a CRBM Now, treat the sequence of hidden units as fully observed data and train a second CRBM Hidden layer Visible layer 13 Jul 2012 / 25

88 STACKING: THE CONDITIONAL DEEP BELIEF NETWORK Learn a CRBM Now, treat the sequence of hidden units as fully observed data and train a second CRBM The composition of CRBMs is a conditional deep belief net l h 1 t Hidden layer h 0 t 2 h 0 t 1 0 Hidden layer Visible layer 13 Jul 2012 / 25

89 STACKING: THE CONDITIONAL DEEP BELIEF NETWORK Learn a CRBM Now, treat the sequence of hidden units as fully observed data and train a second CRBM The composition of CRBMs is a conditional deep belief net l h 1 t Hidden layer It can be fine-tuned generatively or discriminatively h 0 t 2 h 0 t 1 0 Hidden layer Visible layer 13 Jul 2012 / 25

90 l h 1 t MOTION SYNTHESIS WITH A 2-LAYER CDBN h 0 t 2 h 0 t 1 0 Model is trained on ~8000 frames of 60fps data (49 dimensions) 10 styles of walking: cat, chicken, dinosaur, drunk, gangly, graceful, normal, old-man, sexy and strong 600 binary hidden units per layer < 1 hour training on a modern workstation 13 Jul 2012 / 26

91 l h 1 t MOTION SYNTHESIS WITH A 2-LAYER CDBN h 0 t 2 h 0 t 1 0 Model is trained on ~8000 frames of 60fps data (49 dimensions) 10 styles of walking: cat, chicken, dinosaur, drunk, gangly, graceful, normal, old-man, sexy and strong 600 binary hidden units per layer < 1 hour training on a modern workstation 13 Jul 2012 / 26

92 MODELING CONTEXT l Hidden layer h 1 t h 0 t 2 h 0 t 1 0 Hidden layer Visible layer 13 Jul 2012 / 27

93 MODELING CONTEXT A single model was trained on 10 styled walks from CMU subject 137 l h 1 t Hidden layer h 0 t 2 h 0 t 1 0 Hidden layer Visible layer 13 Jul 2012 / 27

94 MODELING CONTEXT A single model was trained on 10 styled walks from CMU subject 137 The model can generate each style based on initialization l h 1 t Hidden layer h 0 t 2 h 0 t 1 0 Hidden layer Visible layer 13 Jul 2012 / 27

95 MODELING CONTEXT A single model was trained on 10 styled walks from CMU subject 137 l Hidden layer The model can generate each style based on initialization h 1 t We cannot prevent nor control transitioning h 0 t 2 h 0 t 1 0 Hidden layer Visible layer 13 Jul 2012 / 27

96 MODELING CONTEXT A single model was trained on 10 styled walks from CMU subject 137 l Hidden layer The model can generate each style based on initialization h 1 t We cannot prevent nor control transitioning How to blend styles? h 0 t 2 h 0 t 1 0 Hidden layer Visible layer 13 Jul 2012 / 27

97 MODELING CONTEXT A single model was trained on 10 styled walks from CMU subject 137 Labels m l Hidden layer The model can generate each style based on initialization h 1 t We cannot prevent nor control transitioning How to blend styles? h 0 t 2 h 0 t 1 0 Hidden layer Style or person labels can be provided as part of the input to the top layer Visible layer 13 Jul 2012 / 27

98 HOW TO MAKE CONTEXT INFLUENCE DYNAMICS?!!!!"#$!!!!!!!!!!"#%!!!!!!!!!!"! 18 May 2012 / 28

99 MULTIPLICATIVE INTERACTIONS Let latent variables act like gates, that dynamically change the connections between other variables h j This amounts to letting variables multiply connections between other variables: three-way multiplicative interactions z k Recently used in the context of learning correspondence between images (Memisevic & Hinton 2007, 2010) but long history before that v i 13 Jul 2012 / 29

100 MULTIPLICATIVE INTERACTIONS Let latent variables act like gates, that dynamically change the connections between other variables h j This amounts to letting variables multiply connections between other variables: three-way multiplicative interactions z k Recently used in the context of learning correspondence between images (Memisevic & Hinton 2007, 2010) but long history before that v i Roland Memisevic has a nice Tutorial and review paper on the subject: 13 Jul 2012 / 29

101 GATED RESTRICTED BOLTZMANN MACHINES (GRBM) Two views: Memisevic & Hinton (2007) h j Latent variables h j z k v i z k v i Output Input Output Input 13 Jul 2012 / 30

102 INFERRING OPTICAL FLOW: IMAGE ANALOGIES Toy images (Memisevic & Hinton 2006) No structure in these images, only how they change Can infer optical flow from a pair of images and apply it to a random image 13 Jul 2012 / 31 Input Output Inferred Flow field New input Apply trans

103 BACK TO MOTION STYLE Introduce a set of latent context variables whose value is known at training time h j In our example, these represent motion style but could also represent height, weight, gender, etc. z k The contextual variables gate every existing pairwise connection in our model v i 13 Jul 2012 / 32

104 LEARNING AND INFERENCE Learning and inference remain almost the same as in the standard CRBM h j We can think of the context or style variables as blending in a whole sub-network z k This allows us to share parameters across styles but selectively adapt dynamics v i 13 Jul 2012 / 33

105 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) 13 Jul 2012 / 34

106 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 34

107 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer j j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 34

108 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Style Hidden layer j l Features j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 34

109 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Style Hidden layer j l Features j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 34

110 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Style Hidden layer j l Features j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 34

111 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Style Hidden layer j l Features j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 34

112 OVERPARAMETERIZATION Note: weight Matrix has been replaced by a tensor! (Likewise for other weights) W v,h,z W v,h l Hidden layer Style Features j The number of parameters is group of weights O(N 3 ) - per More, if we want sparse, overcomplete hiddens However, there is a simple yet powerful solution! Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 35

113 Hidden layer FACTORING Style features l j W vh ijl W z lf W vh ijl W h jf Output layer (e.g. data at time t) W v if W vh ijl = f W v if W h jfw z lf 13 Jul 2012 / 36 (Figure adapted from Roland Memisevic)

114 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) 13 Jul 2012 / 37

115 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 37

116 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer j j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 37

117 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Style Hidden layer j l Features j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 37

118 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Style Hidden layer j l Features j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 37

119 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Style Hidden layer j l Features j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 37

120 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Style Hidden layer j l Features j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 37

121 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Style Hidden layer j l Features j Factors Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 37

122 PARAMETER SHARING 13 Jul 2012 / 38

123 MOTION SYNTHESIS: FACTORED 3RD-ORDER CRBM l j Same 10-styles dataset 600 binary hidden units deterministic factors 100 real-valued style features < 1 hour training on a modern workstation Synthesis is real-time 13 Jul 2012 / 39 Summary

124 MOTION SYNTHESIS: FACTORED 3RD-ORDER CRBM l j Same 10-styles dataset 600 binary hidden units deterministic factors 100 real-valued style features < 1 hour training on a modern workstation Synthesis is real-time 13 Jul 2012 / 39 Summary

125 QUANTITATIVE EVALUATION Not computationally tractable to compute likelihoods Annealed Importance Sampling will not work in conditional models (open problem) Can evaluate predictive power (even though it has been trained generatively) Can also evaluate in denoising tasks RMS prediction error (normalized data space) CRBM 600 (201,795) 0.4 FCRBM (110,445) style FCRBM (110,795) 0.35 unfac style CRBM 100 (218,445) Autoregressive 6 (12,195) Predict ahead interval 13 Jul 2012 / 40

126 3D CONVNETS FOR ACTIVITY RECOGNITION Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu (ICML 2010) One approach: treat video frames as still images (LeCun et al. 2005) Alternatively, perform 3D convolution so that discriminative features across space and time are captured (a) 2D convolution t e m p o r a l t e m p o r a l Multiple convolutions applied to contiguous frames to extract multiple features Images from Ji et al (b) 3D convolution 13 Jul 2012 / 41

127 3D CNN ARCHITECTURE hardwired 7x7x3 3D convolution 2x2 subsampling 7x6x3 3D convolution 3x3 subsampling 7x4 convolution full connnection input: H1: C2: S3: C4: S5: C6: Image from Ji et al Hardwired to extract: 1)grayscale 2)grad-x 3)grad-y 4)flow-x 5)flow-y 2 different 3D filters applied to each of 5 blocks independently Subsample spatially 3 different 3D filters applied to each of 5 channels in 2 blocks Two fullyconnected layers Action units 13 Jul 2012 / 42

ObjectPut, Pointing) Good performance on

128 3D CONVNET: DISCUSSION Good performance on TRECVID surveillance data (CellToEar, ObjectPut, Pointing) Good performance on KTH actions (box, handwave, handclap, jog, run, walk) Still a fair amount of engineering: person detection (TRECVID), foreground extraction (KTH), hard-coded first layer Image from Ji et al Jul 2012 / 43

129 LEARNING FEATURES FOR VIDEO UNDERSTANDING Most work on unsupervised feature extraction has concentrated on static images Transformation feature maps We propose a model that extracts motionsensitive features from pairs of images Existing attempts (e.g. Memisevic & Hinton 2007, Cadieu & Olshausen 2009) ignore the pictorial structure of the input Thus limited to modeling small image patches Image pair 13 Jul 2012 / 44

130 GATED RESTRICTED BOLTZMANN MACHINES (GRBM) Two views: Memisevic & Hinton (2007) h j Latent variables h j z k v i z k v i Output Input Output Input 13 Jul 2012 / 45

131 CONVOLUTIONAL GRBM Graham Taylor, Rob Fergus, Yann LeCun, and Chris Bregler (ECCV 2010) Like the GRBM, captures third-order interactions P k Pooling layer p k α N p N p Shares weights at all locations in an image As in a standard RBM, exact inference is efficient Z k Feature layer z k m,n N z N z Inference and reconstruction are performed through convolution operations N x N x w X (Input) Y (Output) N y w N y N x w N y w N x N y 13 Jul 2012 / 46

132 MORE COMPLEX EXAMPLE OF ANALOGIES Input Output 13 Jul 2012 / 47

133 MORE COMPLEX EXAMPLE OF ANALOGIES Feature maps Input Output 13 Jul 2012 / 47

134 MORE COMPLEX EXAMPLE OF ANALOGIES Feature maps Input Output 13 Jul 2012 / 47

135 MORE COMPLEX EXAMPLE OF ANALOGIES Feature maps???????????????????????????????????????????????????????????????? Input Output Input Output 13 Jul 2012 / 47

136 MORE COMPLEX EXAMPLE OF ANALOGIES Ground truth Feature maps???????????????????????????????????????????????????????????????? Input Output Input Output Novel input Transformation (model) 13 Jul 2012 / 47

HUMAN ACTIVITY: KTH ACTIONS DATASET We learn 32 feature maps 6 are shown here KTH contains 25 subjects performing 6 actions under 4 conditions Only preprocessing is local

137 HUMAN ACTIVITY: KTH ACTIONS DATASET We learn 32 feature maps 6 are shown here KTH contains 25 subjects performing 6 actions under 4 conditions Only preprocessing is local contrast normalization Motion sensitive features (1,3) Edge features (4) Segmentation operator (6) Feature ( zk ) Time 13 Jul 2012 / 48 Hand clapping (above); Walking (below)

138 ACTIVITY RECOGNITION: KTH Prior Art Acc (%) Convolutional architectures Acc. (%) HOG3D+KM+SVM 85.3 convgrbm+3d-convnet+logistic reg HOG/HOF+KM+SVM 86.1 convgrbm+3d convnet+mlp 90.0 HOG+KM+SVM D convnet+3d convnet+logistic reg HOF+KM+SVM D convnet+3d convnet+mlp 79.5 Compared to methods that do not use explicit interest point detection State of the art: 92.1% (Laptev et al. 2008) 93.9% (Le et al. 2011) Other reported result on 3D convnets uses a different evaluation scheme 13 Jul 2012 / 49

139 ACTIVITY RECOGNITION: HOLLYWOOD 2 12 classes of human action extracted from 69 movies (20 hours) Much more realistic and challenging than KTH (changing scenes, zoom, etc.) Performance is evaluated by mean average precision over classes Method Average Prec. Prior Art (Wang et al. survey 2009): HOG3D+KM+SVM 45.3 HOG/HOF+KM+SVM 47.4 HOG+KM+SVM 39.4 HOF+KM+SVM 45.5 Our method: GRBM+SC+SVM Jul 2012 / 50

140 SUMMARY 13 Jul 2012 / 51

141 SUMMARY Learning distributed representations of sequences 13 Jul 2012 / 51

142 SUMMARY Learning distributed representations of sequences For high-dimensional, multi-modal data: CRBM, FCRBM l j 13 Jul 2012 / 51

143 SUMMARY Learning distributed representations of sequences For high-dimensional, multi-modal data: CRBM, FCRBM l j Activity recognition: 2 methods N x N x w X (Input) N x w N z p k α N p N p z k m,n N x Y (Output) N z P k Pooling layer 13 Jul 2012 / 51 N y N y w N y w N y Z k Feature layer

144 The University of Guelph is not in Belgium! Montreal Guelph Toronto New York

TUTORIAL PART 1 Unsupervised Learning

TUTORIAL PART 1 Unsupervised Learning Marc'Aurelio Ranzato Department of Computer Science Univ. of Toronto ranzato@cs.toronto.edu Co-organizers: Honglak Lee, Yoshua Bengio, Geoff Hinton, Yann LeCun, Andrew