LEARNING REPRESENTATIONS OF SEQUENCES IPAM GRADUATE SUMMER SCHOOL ON DEEP LEARNING

Size: px
Start display at page:

Download "LEARNING REPRESENTATIONS OF SEQUENCES IPAM GRADUATE SUMMER SCHOOL ON DEEP LEARNING"

Transcription

1 LEARNING REPRESENTATIONS OF SEQUENCES IPAM GRADUATE SUMMER SCHOOL ON DEEP LEARNING GRAHAM TAYLOR SCHOOL OF ENGINEERING UNIVERSITY OF GUELPH Papers and software available at:

2 OVERVIEW: THIS TALK 13 Jul 2012 / 2

3 OVERVIEW: THIS TALK Learning representations of temporal data: X (Input) N p - existing methods and challenges faced - recent methods inspired by deep learning and representation learning N y N x N x w N x Y (Output) N y w N y w N x w z k m,n Z k Feature layer N z N z p k α P k Pooling layer N p N y 13 Jul 2012 / 2

4 OVERVIEW: THIS TALK Learning representations of temporal data: X (Input) N p - existing methods and challenges faced - recent methods inspired by deep learning and representation learning N y N x N x w N x Y (Output) N y w N y w N x w z k m,n Z k Feature layer N z N z p k α P k Pooling layer N p N y Applications: in particular, modeling human pose and activity - highly structured data: e.g. motion capture - weakly structured data: e.g. video 13 Jul 2012 / 2

5 OUTLINE 13 Jul 2012 / 3

6 OUTLINE Learning representations from sequences Existing methods, challenges y t 2 y t 1 y t 13 Jul 2012 / 3

7 OUTLINE Learning representations from sequences Existing methods, challenges y t 2 y t 1 y t Composable, distributed-state models for sequences Conditional Restricted Boltzmann Machines and their variants 13 Jul 2012 / 3

8 OUTLINE Learning representations from sequences Existing methods, challenges y t 2 y t 1 y t Composable, distributed-state models for sequences Conditional Restricted Boltzmann Machines and their variants Using learned representations to analyze video A brief and (incomplete survey of deep learning for activity recognition X (Input) N N x w x N x w N x Y (Output) z k m,n N z N z p k α N p P k Pooling layer N p N y N y w N y w Z k Feature layer N y 13 Jul 2012 / 3

9 TIME SERIES DATA Time is an integral part of many human behaviours (motion, reasoning) In building statistical models, time is sometimes ignored, often problematic Models that do incorporate dynamics fail to account for the fact that data is often high-dimensional, nonlinear, and contains long-range dependencies INTENSITY (No of stories) Jul 2012 / 4 Graphic: David McCandless, informationisbeautiful.net

10 TIME SERIES DATA Time is an integral part of many human behaviours (motion, reasoning) In building statistical models, time is sometimes ignored, often problematic Models that do incorporate dynamics fail to account for the fact that data is often high-dimensional, nonlinear, and contains long-range dependencies INTENSITY (No of stories) Today we will discuss a number of models that have been developed to address these challenges. 13 Jul 2012 / 4 Graphic: David McCandless, informationisbeautiful.net

11 VECTOR AUTOREGRESSIVE MODELS v t = b + M m=1 A m v t m + e t Have dominated statistical time-series analysis for approx. 50 years Can be fit easily by least-squares regression Can fail even for simple nonlinearities present in the system - but many data sets can be modeled well by a linear system Well understood; many extensions exist 13 Jul 2012 / 5

12 MARKOV ( N-GRAM ) MODELS v t 2 v t 1 v t Fully observable Sequential observations may have nonlinear dependence Derived by assuming sequences have Markov property: This leads to joint: p(v t {v t 1 1 }) =p(v t {v t 1 p({v T 1 }) =p({v N 1 }) Number of parameters exponential in N! T t=n+1 t N }) p(v t {v t 1 t N }) 13 Jul 2012 / 6

13 EXPONENTIAL INCREASE IN PARAMETERS θ = Q N+1 Here, Q =3 p(a a) p(b a) p(c a) p(a b) p(b b) p(c b) p(a c) p(b c) p(c c) 1st order Markov (N = 1) 13 Jul 2012 / 7

14 EXPONENTIAL INCREASE IN PARAMETERS θ = Q N+1 Here, Q =3 p(a a) p(b a) p(c a) p(a b) p(b b) p(c b) p(a {a,c}) p(b {a,c}) p(c {a,c}) p(a {a,b}) p(b {a,b}) p(c {a,b}) p(a {b,a}) p(b {b,a}) p(c {b,c}) p(a {a,a}) p(b {a,a}) p(c {a,a}) p(a {b,a}) p(b {b,a}) p(c {b,b}) p(a {c,a}) p(b {c,a}) p(c {c,c}) p(a {b,a}) p(b {b,a}) p(c {b,a}) p(a {c,a}) p(b {c,a}) p(c {c,b}) p(a c) p(b c) p(c c) p(a {c,a}) p(b {c,a}) p(c {c,a}) 1st order Markov (N = 1) 2nd order Markov (N = 2) 13 Jul 2012 / 7

15 HIDDEN MARKOV MODELS (HMM) h t 2 h t 1 h t v t 2 v t 1 v t 13 Jul 2012 / 8

16 HIDDEN MARKOV MODELS (HMM) Introduces a hidden state that controls the dependence of the current observation on the past h t 2 h t 1 h t v t 2 v t 1 v t 13 Jul 2012 / 8

17 HIDDEN MARKOV MODELS (HMM) Introduces a hidden state that controls the dependence of the current observation on the past h t 2 h t 1 h t v t 2 v t 1 v t Successful in speech & language modeling, biology Defined by 3 sets of parameters: - Initial state parameters, π - Transition matrix, A - Emission distribution, p(v t h t ) Factored joint distribution: p({h t }, {v t })=p(h 1 )p(v 1 h 1 ) T p(h t h t 1 )p(v t h t ) t=2 13 Jul 2012 / 8

18 INFERENCE AND LEARNING h 1 h 2... h T v 1 v 2 v T h 1 h 2... h T v 1 v 2 v T h 1 h 2... h T 13 Jul 2012 / 9 v 1 v 2 v T

19 INFERENCE AND LEARNING h 1 h 2... h T Typically three tasks we want to perform in an HMM: v 1 v 2 v T h 1 h 2... h T v 1 v 2 v T h 1 h 2... h T 13 Jul 2012 / 9 v 1 v 2 v T

20 p({v 1,...,v T } θ) INFERENCE AND LEARNING Typically three tasks we want to perform in an HMM: - Likelihood estimation h 1 h 2 {h 1,...,h T }... h T v 1 v 2 v T h 1 h 2... h T v 1 v 2 v T h 1 h 2... h T 13 Jul 2012 / 9 v 1 v 2 v T

21 p({v 1,...,v T } θ) INFERENCE AND LEARNING Typically three tasks we want to perform in an HMM: - Likelihood estimation - Inference h 1 h 2 {h 1,...,h T }... h T v 1 v 2 v T??...? h 1 h 2 h T v 1 v 2 v T h 1 h 2... h T 13 Jul 2012 / 9 v 1 v 2 v T

22 p({v 1,...,v T } θ) INFERENCE AND LEARNING Typically three tasks we want to perform in an HMM: - Likelihood estimation - Inference - Learning h 1 h 2 {h 1,...,h T }... h T v 1 v 2 v T??...? h 1 h 2 h T v 1 v 2 v T h 1 h 2... h T 13 Jul 2012 / 9 v 1 v 2 v T

23 p({v 1,...,v T } θ) INFERENCE AND LEARNING Typically three tasks we want to perform in an HMM: - Likelihood estimation - Inference - Learning All are exact and tractable due to the simple structure of the HMM h 1 h 2 {h 1,...,h T }... h T v 1 v 2 v T??...? h 1 h 2 h T v 1 v 2 v T h 1 h 2... h T 13 Jul 2012 / 9 v 1 v 2 v T

24 LIMITATIONS OF HMMS 13 Jul 2012 / 10

25 LIMITATIONS OF HMMS Many high-dimensional data sets contain rich componential structure 13 Jul 2012 / 10

26 LIMITATIONS OF HMMS Many high-dimensional data sets contain rich componential structure Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series 13 Jul 2012 / 10

27 LIMITATIONS OF HMMS Many high-dimensional data sets contain rich componential structure Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series To model K bits of information, they need hidden states 2 K 13 Jul 2012 / 10

28 LIMITATIONS OF HMMS Many high-dimensional data sets contain rich componential structure Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series To model K bits of information, they need hidden states 2 K 13 Jul 2012 / 10

29 LIMITATIONS OF HMMS Many high-dimensional data sets contain rich componential structure Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series To model K bits of information, they need hidden states 2 K 13 Jul 2012 / 10

30 LIMITATIONS OF HMMS Many high-dimensional data sets contain rich componential structure Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series To model K bits of information, they need hidden states 2 K We seek models with distributed hidden state: 13 Jul 2012 / 10

31 LIMITATIONS OF HMMS Many high-dimensional data sets contain rich componential structure Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series To model K bits of information, they need hidden states 2 K We seek models with distributed hidden state: - capacity linear in the number of components 13 Jul 2012 / 10

32 LIMITATIONS OF HMMS Many high-dimensional data sets contain rich componential structure Hidden Markov Models cannot model such data efficiently: a single, discrete K-state multinomial must represent the history of the time series To model K bits of information, they need hidden states 2 K We seek models with distributed hidden state: - capacity linear in the number of components 13 Jul 2012 / 10

33 LINEAR DYNAMICAL SYSTEMS h t 2 h t 1 h t v t 2 v t 1 v t 13 Jul 2012 / 11

34 LINEAR DYNAMICAL SYSTEMS Graphical model is the same as HMM but with real-valued state vectors h t 2 h t 1 h t v t 2 v t 1 v t 13 Jul 2012 / 11

35 LINEAR DYNAMICAL SYSTEMS Graphical model is the same as HMM but with real-valued state vectors h t 2 h t 1 h t v t 2 v t 1 v t Characterized by linear-gaussian dynamics and observations: p(h t h t 1 )=N (h t ; Ah t 1,Q) p(v t h t )=N (v t ; Ch t,r) Inference is performed using Kalman smoothing (belief propagation) Learning can be done by EM Dynamics, observations may also depend on an observed input (control) 13 Jul 2012 / 11

36 LATENT REPRESENTATIONS FOR REAL-WORLD DATA Data for many real-world problems (e.g. vision, motion capture) is highdimensional, containing complex non-linear relationships between components 13 Jul 2012 / 12

37 LATENT REPRESENTATIONS FOR REAL-WORLD DATA Data for many real-world problems (e.g. vision, motion capture) is highdimensional, containing complex non-linear relationships between components Hidden Markov Models Pro: complex, nonlinear emission model Con: single K -state multinomial represents entire history 13 Jul 2012 / 12

38 LATENT REPRESENTATIONS FOR REAL-WORLD DATA Data for many real-world problems (e.g. vision, motion capture) is highdimensional, containing complex non-linear relationships between components Hidden Markov Models Pro: complex, nonlinear emission model Con: single K -state multinomial represents entire history Linear Dynamical Systems Pro: state can convey much more information Con: emission model constrained to be linear 13 Jul 2012 / 12

39 LEARNING DISTRIBUTED REPRESENTATIONS 13 Jul 2012 / 13

40 LEARNING DISTRIBUTED REPRESENTATIONS Simple networks are capable of discovering useful and interesting internal representations of static data (e.g. many of the talks so far!) 13 Jul 2012 / 13

41 LEARNING DISTRIBUTED REPRESENTATIONS Simple networks are capable of discovering useful and interesting internal representations of static data (e.g. many of the talks so far!) Can we learn, in a similar way, representations of temporal data? 13 Jul 2012 / 13

42 LEARNING DISTRIBUTED REPRESENTATIONS Simple networks are capable of discovering useful and interesting internal representations of static data (e.g. many of the talks so far!) Can we learn, in a similar way, representations of temporal data? Simple idea: spatial representation of time: 13 Jul 2012 / 13

43 LEARNING DISTRIBUTED REPRESENTATIONS Simple networks are capable of discovering useful and interesting internal representations of static data (e.g. many of the talks so far!) Can we learn, in a similar way, representations of temporal data? Simple idea: spatial representation of time: 13 Jul 2012 / 13 Spectrogram:

44 LEARNING DISTRIBUTED REPRESENTATIONS Simple networks are capable of discovering useful and interesting internal representations of static data (e.g. many of the talks so far!) Can we learn, in a similar way, representations of temporal data? Simple idea: spatial representation of time: 13 Jul 2012 / 13 Spectrogram:

45 LEARNING DISTRIBUTED REPRESENTATIONS Simple networks are capable of discovering useful and interesting internal representations of static data (e.g. many of the talks so far!) Can we learn, in a similar way, representations of temporal data? Simple idea: spatial representation of time: 13 Jul 2012 / 13 Spectrogram:

46 LEARNING DISTRIBUTED REPRESENTATIONS Simple networks are capable of discovering useful and interesting internal representations of static data (e.g. many of the talks so far!) Can we learn, in a similar way, representations of temporal data? Simple idea: spatial representation of time: - Need a buffer; not biologically plausible - Cannot process inputs of differing length - Cannot distinguish between absolute and relative position 13 Jul 2012 / 13 Spectrogram:

47 LEARNING DISTRIBUTED REPRESENTATIONS Simple networks are capable of discovering useful and interesting internal representations of static data (e.g. many of the talks so far!) Can we learn, in a similar way, representations of temporal data? Simple idea: spatial representation of time: - Need a buffer; not biologically plausible - Cannot process inputs of differing length - Cannot distinguish between absolute and relative position This motivates an implicit representation of time in connectionist models where time is represented by its effect on processing 13 Jul 2012 / 13 Spectrogram:

48 A RICH HISTORY 13 Jul 2012 / 14

49 A RICH HISTORY Elman networks Time-delayed context units, truncated BPTT. (Elman, 1990), (Jordan, 1986) 13 Jul 2012 / 14

50 A RICH HISTORY Elman networks Time-delayed context units, truncated BPTT. (Elman, 1990), (Jordan, 1986) Mean-field Boltzmann Machines through Time Inference is approximate, learning less efficient than HMMs. (Williams and Hinton, 1990) 13 Jul 2012 / 14

51 A RICH HISTORY Elman networks Time-delayed context units, truncated BPTT. (Elman, 1990), (Jordan, 1986) Mean-field Boltzmann Machines through Time Inference is approximate, learning less efficient than HMMs. (Williams and Hinton, 1990) Spiking Boltzmann Machines Hidden-state dynamics and smoothness constraints on observed data. (Hinton and Brown, 2000) 13 Jul 2012 / 14

52 RECURRENT NEURAL NETWORKS ŷ t 1 ŷ t ŷ t+1 h t 1 h t h t+1 v t 1 v t v t+1 13 Jul 2012 / 15 x t = W hv v t + W hh h t 1 + b h h j,t = f(x j,t ) s t = W yh h t + b y ŷ k,t = g(s k,t ) (Figure from Martens and Sutskever)

53 RECURRENT NEURAL NETWORKS ŷ t 1 ŷ t ŷ t+1 h t 1 h t h t+1 v t 1 v t v t+1 Neural network replicated in time 13 Jul 2012 / 15 x t = W hv v t + W hh h t 1 + b h h j,t = f(x j,t ) s t = W yh h t + b y ŷ k,t = g(s k,t ) (Figure from Martens and Sutskever)

54 RECURRENT NEURAL NETWORKS ŷ t 1 ŷ t ŷ t+1 h t 1 h t h t+1 v t 1 v t v t+1 Neural network replicated in time At each step, receives input vector, updates its internal representation via nonlinear activation functions, and makes a prediction: 13 Jul 2012 / 15 x t = W hv v t + W hh h t 1 + b h h j,t = f(x j,t ) s t = W yh h t + b y ŷ k,t = g(s k,t ) (Figure from Martens and Sutskever)

55 TRAINING RECURRENT NEURAL NETWORKS 13 Jul 2012 / 16

56 TRAINING RECURRENT NEURAL NETWORKS Possibly high-dimensional, distributed, internal representation and nonlinear dynamics allow model, in theory, model complex time series 13 Jul 2012 / 16

57 TRAINING RECURRENT NEURAL NETWORKS Possibly high-dimensional, distributed, internal representation and nonlinear dynamics allow model, in theory, model complex time series Exact gradients can be computed exactly via Backpropagation Through Time 13 Jul 2012 / 16

58 TRAINING RECURRENT NEURAL NETWORKS Possibly high-dimensional, distributed, internal representation and nonlinear dynamics allow model, in theory, model complex time series Exact gradients can be computed exactly via Backpropagation Through Time It is an interesting and powerful model. What s the catch? - Training RNNs via gradient descent fails on simple problems - Attributed to vanishing or exploding gradients - Much work in the 1990 s focused on identifying and addressing these issues: none of these methods were widely adopted 13 Jul 2012 / 16

59 TRAINING RECURRENT NEURAL NETWORKS Possibly high-dimensional, distributed, internal representation and nonlinear dynamics allow model, in theory, model complex time series Exact gradients can be computed exactly via Backpropagation Through Time It is an interesting and powerful model. What s the catch? - Training RNNs via gradient descent fails on simple problems - Attributed to vanishing or exploding gradients - Much work in the 1990 s focused on identifying and addressing these issues: none of these methods were widely adopted 13 Jul 2012 / 16 (Figure adapted from James Martens)

60 TRAINING RECURRENT NEURAL NETWORKS Possibly high-dimensional, distributed, internal representation and nonlinear dynamics allow model, in theory, model complex time series Exact gradients can be computed exactly via Backpropagation Through Time It is an interesting and powerful model. What s the catch? - Training RNNs via gradient descent fails on simple problems - Attributed to vanishing or exploding gradients - Much work in the 1990 s focused on identifying and addressing these issues: none of these methods were widely adopted Best-known attempts to resolve the problem of RNN training: - Long Short-term Memory (LSTM) (Hochreiter and Schmidhuber 1997) - Echo-State Network (ESN) (Jaeger and Haas 2004) 13 Jul 2012 / 16

61 FAILURE OF GRADIENT DESCENT Two hypotheses for why gradient descent fails for NN: 18 May 2012 / 17

62 FAILURE OF GRADIENT DESCENT Two hypotheses for why gradient descent fails for NN: increased frequency and severity of bad local minima 18 May 2012 / 17 (Figures from James Martens)

63 FAILURE OF GRADIENT DESCENT Two hypotheses for why gradient descent fails for NN: increased frequency and severity of bad local minima pathological curvature, like the type seen in the Rosenbrock function: f(x, y) =(1 x) (y x 2 ) 2 18 May 2012 / 17 (Figures from James Martens)

64 SECOND ORDER METHODS Model the objective function by the local approximation: where p is the search direction and B is a matrix which quantifies curvature In Newton s method, f(θ + p) q θ (p) f(θ)+ f(θ) T p pt Bp is the Hessian matrix, By taking the curvature information into account, Newton s method rescales the gradient so it is a much more sensible direction to follow Not feasible for high-dimensional problems! B H 18 May 2012 / 18 (Figure from James Martens)

65 HESSIAN-FREE OPTIMIZATION Based on exploiting two simple ideas (and some additional tricks ): 18 May 2012 / 19

66 HESSIAN-FREE OPTIMIZATION Based on exploiting two simple ideas (and some additional tricks ): For an n-dimensional vector d, the Hessian-vector product Hd can easily be computed using finite differences at the cost of a single extra gradient evaluation - In practice, the R-operator (Perlmutter 1994) is used instead of finite differences 18 May 2012 / 19

67 HESSIAN-FREE OPTIMIZATION Based on exploiting two simple ideas (and some additional tricks ): For an n-dimensional vector d, the Hessian-vector product Hd can easily be computed using finite differences at the cost of a single extra gradient evaluation - In practice, the R-operator (Perlmutter 1994) is used instead of finite differences There is a very effective algorithm for optimizing quadratic objectives which requires only Hessian-vector products: linear conjugate-gradient (CG) 18 May 2012 / 19

68 HESSIAN-FREE OPTIMIZATION Based on exploiting two simple ideas (and some additional tricks ): For an n-dimensional vector d, the Hessian-vector product Hd can easily be computed using finite differences at the cost of a single extra gradient evaluation - In practice, the R-operator (Perlmutter 1994) is used instead of finite differences There is a very effective algorithm for optimizing quadratic objectives which requires only Hessian-vector products: linear conjugate-gradient (CG) This method was shown to effectively train RNNs in the pathological long-term dependency problems they were previously not able to solve (Martens and Sutskever 2011) 18 May 2012 / 19

69 HESSIAN-FREE OPTIMIZATION Based on exploiting two simple ideas (and some additional tricks ): For an n-dimensional vector d, the Hessian-vector product Hd can easily be computed using finite differences at the cost of a single extra gradient evaluation - In practice, the R-operator (Perlmutter 1994) is used instead of finite differences There is a very effective algorithm for optimizing quadratic objectives which requires only Hessian-vector products: linear conjugate-gradient (CG) This method was shown to effectively train RNNs in the pathological long-term dependency problems they were previously not able to solve (Martens and Sutskever 2011) RNN demo code (using Theano): 18 May 2012 / 19

70 GENERATIVE MODELS WITH DISTRIBUTED STATE 13 Jul 2012 / 20

71 GENERATIVE MODELS WITH DISTRIBUTED STATE Many sequences are high-dimensional and have complex structure - music, human motion, weather/climate data - RNNs simply predict the expected value at the next time step - They can t capture multi-modality 13 Jul 2012 / 20

72 GENERATIVE MODELS WITH DISTRIBUTED STATE Many sequences are high-dimensional and have complex structure - music, human motion, weather/climate data - RNNs simply predict the expected value at the next time step - They can t capture multi-modality Generative models (like Restricted Boltzmann Machines) can capture complex distributions 13 Jul 2012 / 20

73 GENERATIVE MODELS WITH DISTRIBUTED STATE Many sequences are high-dimensional and have complex structure - music, human motion, weather/climate data - RNNs simply predict the expected value at the next time step - They can t capture multi-modality Generative models (like Restricted Boltzmann Machines) can capture complex distributions Use binary hidden state and gain the best of HMM & LDS: - the nonlinear dynamics and observation model of the HMM without the limited hidden state - the efficient, expressive state of the LDS without the linear-gaussian restriction on dynamics and observations 13 Jul 2012 / 20

74 DISTRIBUTED BINARY HIDDEN STATE Using distributed binary representations for hidden state in directed models of time series makes inference difficult. But we can: - Use a Restricted Boltzmann Machine (RBM) for the interactions between hidden and visible variables. A factorial posterior makes inference and sampling easy. - Treat the visible variables in the previous time slice as additional fixed inputs Hidden variables (factors) at time t Visible variables (observations) at time t One typically uses binary logistic units for both visibles and hiddens p(h j =1 v) =σ(b j + i v i W ij ) 13 Jul 2012 / 21 p(v i =1 h) =σ(b i + j h j W ij )

75 MODELING OBSERVATIONS WITH AN RBM So the distributed binary latent (hidden) state of an RBM lets us: - Model complex, nonlinear dynamics - Easily and exactly infer the latent binary state given the observations But RBMs treat data as static (i.i.d.) Hidden variables (factors) at time t Visible variables (joint angles) at time t 13 Jul 2012 / 22

76 MODELING OBSERVATIONS WITH AN RBM So the distributed binary latent (hidden) state of an RBM lets us: - Model complex, nonlinear dynamics - Easily and exactly infer the latent binary state given the observations But RBMs treat data as static (i.i.d.) Hidden variables (factors) at time t Visible variables (joint angles) at time t 13 Jul 2012 / 22

77 MODELING OBSERVATIONS WITH AN RBM So the distributed binary latent (hidden) state of an RBM lets us: - Model complex, nonlinear dynamics - Easily and exactly infer the latent binary state given the observations But RBMs treat data as static (i.i.d.) Hidden variables (factors) at time t Visible variables (joint angles) at time t 13 Jul 2012 / 22

78 CONDITIONAL RESTRICTED BOLTZMANN MACHINES (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) 13 Jul 2012 / 23

79 CONDITIONAL RESTRICTED BOLTZMANN MACHINES Start with a Restricted Boltzmann Machine (RBM) (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) Hidden layer Visible layer 13 Jul 2012 / 23

80 CONDITIONAL RESTRICTED BOLTZMANN MACHINES Start with a Restricted Boltzmann Machine (RBM) (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) Add two types of directed connections: Hidden layer Visible layer 13 Jul 2012 / 23

81 CONDITIONAL RESTRICTED BOLTZMANN MACHINES Start with a Restricted Boltzmann Machine (RBM) (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) Add two types of directed connections: - Autoregressive connections model short-term, linear structure Hidden layer Visible layer Recent history 13 Jul 2012 / 23

82 CONDITIONAL RESTRICTED BOLTZMANN MACHINES Start with a Restricted Boltzmann Machine (RBM) (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) Add two types of directed connections: - Autoregressive connections model short-term, linear structure - History can also influence dynamics through hidden layer Hidden layer Visible layer Recent history 13 Jul 2012 / 23

83 CONDITIONAL RESTRICTED BOLTZMANN MACHINES Start with a Restricted Boltzmann Machine (RBM) (Taylor, Hinton and Roweis NIPS 2006, JMLR 2011) Add two types of directed connections: - Autoregressive connections model short-term, linear structure - History can also influence dynamics through hidden layer Conditioning does not change inference nor learning Hidden layer Visible layer Recent history 13 Jul 2012 / 23

84 CONTRASTIVE DIVERGENCE LEARNING IN A CRBM Fixed j Fixed j Fixed i iter = 0 (data) <v i h j > data <v i h j > recon Fixed i iter =1 (reconstruction) When updating visible and hidden units, we implement directed connections by treating data from previous time steps as a dynamically changing bias Inference and learning do not change 13 Jul 2012 / 24

85 STACKING: THE CONDITIONAL DEEP BELIEF NETWORK 13 Jul 2012 / 25

86 STACKING: THE CONDITIONAL DEEP BELIEF NETWORK Learn a CRBM Hidden layer Visible layer 13 Jul 2012 / 25

87 STACKING: THE CONDITIONAL DEEP BELIEF NETWORK Learn a CRBM Now, treat the sequence of hidden units as fully observed data and train a second CRBM Hidden layer Visible layer 13 Jul 2012 / 25

88 STACKING: THE CONDITIONAL DEEP BELIEF NETWORK Learn a CRBM Now, treat the sequence of hidden units as fully observed data and train a second CRBM The composition of CRBMs is a conditional deep belief net l h 1 t Hidden layer h 0 t 2 h 0 t 1 0 Hidden layer Visible layer 13 Jul 2012 / 25

89 STACKING: THE CONDITIONAL DEEP BELIEF NETWORK Learn a CRBM Now, treat the sequence of hidden units as fully observed data and train a second CRBM The composition of CRBMs is a conditional deep belief net l h 1 t Hidden layer It can be fine-tuned generatively or discriminatively h 0 t 2 h 0 t 1 0 Hidden layer Visible layer 13 Jul 2012 / 25

90 l h 1 t MOTION SYNTHESIS WITH A 2-LAYER CDBN h 0 t 2 h 0 t 1 0 Model is trained on ~8000 frames of 60fps data (49 dimensions) 10 styles of walking: cat, chicken, dinosaur, drunk, gangly, graceful, normal, old-man, sexy and strong 600 binary hidden units per layer < 1 hour training on a modern workstation 13 Jul 2012 / 26

91 l h 1 t MOTION SYNTHESIS WITH A 2-LAYER CDBN h 0 t 2 h 0 t 1 0 Model is trained on ~8000 frames of 60fps data (49 dimensions) 10 styles of walking: cat, chicken, dinosaur, drunk, gangly, graceful, normal, old-man, sexy and strong 600 binary hidden units per layer < 1 hour training on a modern workstation 13 Jul 2012 / 26

92 MODELING CONTEXT l Hidden layer h 1 t h 0 t 2 h 0 t 1 0 Hidden layer Visible layer 13 Jul 2012 / 27

93 MODELING CONTEXT A single model was trained on 10 styled walks from CMU subject 137 l h 1 t Hidden layer h 0 t 2 h 0 t 1 0 Hidden layer Visible layer 13 Jul 2012 / 27

94 MODELING CONTEXT A single model was trained on 10 styled walks from CMU subject 137 The model can generate each style based on initialization l h 1 t Hidden layer h 0 t 2 h 0 t 1 0 Hidden layer Visible layer 13 Jul 2012 / 27

95 MODELING CONTEXT A single model was trained on 10 styled walks from CMU subject 137 l Hidden layer The model can generate each style based on initialization h 1 t We cannot prevent nor control transitioning h 0 t 2 h 0 t 1 0 Hidden layer Visible layer 13 Jul 2012 / 27

96 MODELING CONTEXT A single model was trained on 10 styled walks from CMU subject 137 l Hidden layer The model can generate each style based on initialization h 1 t We cannot prevent nor control transitioning How to blend styles? h 0 t 2 h 0 t 1 0 Hidden layer Visible layer 13 Jul 2012 / 27

97 MODELING CONTEXT A single model was trained on 10 styled walks from CMU subject 137 Labels m l Hidden layer The model can generate each style based on initialization h 1 t We cannot prevent nor control transitioning How to blend styles? h 0 t 2 h 0 t 1 0 Hidden layer Style or person labels can be provided as part of the input to the top layer Visible layer 13 Jul 2012 / 27

98 HOW TO MAKE CONTEXT INFLUENCE DYNAMICS?!!!!"#$!!!!!!!!!!"#%!!!!!!!!!!"! 18 May 2012 / 28

99 MULTIPLICATIVE INTERACTIONS Let latent variables act like gates, that dynamically change the connections between other variables h j This amounts to letting variables multiply connections between other variables: three-way multiplicative interactions z k Recently used in the context of learning correspondence between images (Memisevic & Hinton 2007, 2010) but long history before that v i 13 Jul 2012 / 29

100 MULTIPLICATIVE INTERACTIONS Let latent variables act like gates, that dynamically change the connections between other variables h j This amounts to letting variables multiply connections between other variables: three-way multiplicative interactions z k Recently used in the context of learning correspondence between images (Memisevic & Hinton 2007, 2010) but long history before that v i Roland Memisevic has a nice Tutorial and review paper on the subject: 13 Jul 2012 / 29

101 GATED RESTRICTED BOLTZMANN MACHINES (GRBM) Two views: Memisevic & Hinton (2007) h j Latent variables h j z k v i z k v i Output Input Output Input 13 Jul 2012 / 30

102 INFERRING OPTICAL FLOW: IMAGE ANALOGIES Toy images (Memisevic & Hinton 2006) No structure in these images, only how they change Can infer optical flow from a pair of images and apply it to a random image 13 Jul 2012 / 31 Input Output Inferred Flow field New input Apply trans

103 BACK TO MOTION STYLE Introduce a set of latent context variables whose value is known at training time h j In our example, these represent motion style but could also represent height, weight, gender, etc. z k The contextual variables gate every existing pairwise connection in our model v i 13 Jul 2012 / 32

104 LEARNING AND INFERENCE Learning and inference remain almost the same as in the standard CRBM h j We can think of the context or style variables as blending in a whole sub-network z k This allows us to share parameters across styles but selectively adapt dynamics v i 13 Jul 2012 / 33

105 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) 13 Jul 2012 / 34

106 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 34

107 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer j j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 34

108 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Style Hidden layer j l Features j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 34

109 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Style Hidden layer j l Features j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 34

110 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Style Hidden layer j l Features j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 34

111 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Style Hidden layer j l Features j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 34

112 OVERPARAMETERIZATION Note: weight Matrix has been replaced by a tensor! (Likewise for other weights) W v,h,z W v,h l Hidden layer Style Features j The number of parameters is group of weights O(N 3 ) - per More, if we want sparse, overcomplete hiddens However, there is a simple yet powerful solution! Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 35

113 Hidden layer FACTORING Style features l j W vh ijl W z lf W vh ijl W h jf Output layer (e.g. data at time t) W v if W vh ijl = f W v if W h jfw z lf 13 Jul 2012 / 36 (Figure adapted from Roland Memisevic)

114 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) 13 Jul 2012 / 37

115 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 37

116 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Hidden layer j j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 37

117 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Style Hidden layer j l Features j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 37

118 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Style Hidden layer j l Features j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 37

119 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Style Hidden layer j l Features j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 37

120 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Style Hidden layer j l Features j Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 37

121 SUPERVISED MODELING OF STYLE (Taylor, Hinton and Roweis ICML 2009, JMLR 2011) Hidden layer Style Hidden layer j l Features j Factors Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) Input layer (e.g. data at time t-1:t-n) Output layer (e.g. data at time t) 13 Jul 2012 / 37

122 PARAMETER SHARING 13 Jul 2012 / 38

123 MOTION SYNTHESIS: FACTORED 3RD-ORDER CRBM l j Same 10-styles dataset 600 binary hidden units deterministic factors 100 real-valued style features < 1 hour training on a modern workstation Synthesis is real-time 13 Jul 2012 / 39 Summary

124 MOTION SYNTHESIS: FACTORED 3RD-ORDER CRBM l j Same 10-styles dataset 600 binary hidden units deterministic factors 100 real-valued style features < 1 hour training on a modern workstation Synthesis is real-time 13 Jul 2012 / 39 Summary

125 QUANTITATIVE EVALUATION Not computationally tractable to compute likelihoods Annealed Importance Sampling will not work in conditional models (open problem) Can evaluate predictive power (even though it has been trained generatively) Can also evaluate in denoising tasks RMS prediction error (normalized data space) CRBM 600 (201,795) 0.4 FCRBM (110,445) style FCRBM (110,795) 0.35 unfac style CRBM 100 (218,445) Autoregressive 6 (12,195) Predict ahead interval 13 Jul 2012 / 40

126 3D CONVNETS FOR ACTIVITY RECOGNITION Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu (ICML 2010) One approach: treat video frames as still images (LeCun et al. 2005) Alternatively, perform 3D convolution so that discriminative features across space and time are captured (a) 2D convolution t e m p o r a l t e m p o r a l Multiple convolutions applied to contiguous frames to extract multiple features Images from Ji et al (b) 3D convolution 13 Jul 2012 / 41

127 3D CNN ARCHITECTURE hardwired 7x7x3 3D convolution 2x2 subsampling 7x6x3 3D convolution 3x3 subsampling 7x4 convolution full connnection input: H1: C2: S3: C4: S5: C6: Image from Ji et al Hardwired to extract: 1)grayscale 2)grad-x 3)grad-y 4)flow-x 5)flow-y 2 different 3D filters applied to each of 5 blocks independently Subsample spatially 3 different 3D filters applied to each of 5 channels in 2 blocks Two fullyconnected layers Action units 13 Jul 2012 / 42

128 3D CONVNET: DISCUSSION Good performance on TRECVID surveillance data (CellToEar, ObjectPut, Pointing) Good performance on KTH actions (box, handwave, handclap, jog, run, walk) Still a fair amount of engineering: person detection (TRECVID), foreground extraction (KTH), hard-coded first layer Image from Ji et al Jul 2012 / 43

129 LEARNING FEATURES FOR VIDEO UNDERSTANDING Most work on unsupervised feature extraction has concentrated on static images Transformation feature maps We propose a model that extracts motionsensitive features from pairs of images Existing attempts (e.g. Memisevic & Hinton 2007, Cadieu & Olshausen 2009) ignore the pictorial structure of the input Thus limited to modeling small image patches Image pair 13 Jul 2012 / 44

130 GATED RESTRICTED BOLTZMANN MACHINES (GRBM) Two views: Memisevic & Hinton (2007) h j Latent variables h j z k v i z k v i Output Input Output Input 13 Jul 2012 / 45

131 CONVOLUTIONAL GRBM Graham Taylor, Rob Fergus, Yann LeCun, and Chris Bregler (ECCV 2010) Like the GRBM, captures third-order interactions P k Pooling layer p k α N p N p Shares weights at all locations in an image As in a standard RBM, exact inference is efficient Z k Feature layer z k m,n N z N z Inference and reconstruction are performed through convolution operations N x N x w X (Input) Y (Output) N y w N y N x w N y w N x N y 13 Jul 2012 / 46

132 MORE COMPLEX EXAMPLE OF ANALOGIES Input Output 13 Jul 2012 / 47

133 MORE COMPLEX EXAMPLE OF ANALOGIES Feature maps Input Output 13 Jul 2012 / 47

134 MORE COMPLEX EXAMPLE OF ANALOGIES Feature maps Input Output 13 Jul 2012 / 47

135 MORE COMPLEX EXAMPLE OF ANALOGIES Feature maps???????????????????????????????????????????????????????????????? Input Output Input Output 13 Jul 2012 / 47

136 MORE COMPLEX EXAMPLE OF ANALOGIES Ground truth Feature maps???????????????????????????????????????????????????????????????? Input Output Input Output Novel input Transformation (model) 13 Jul 2012 / 47

137 HUMAN ACTIVITY: KTH ACTIONS DATASET We learn 32 feature maps 6 are shown here KTH contains 25 subjects performing 6 actions under 4 conditions Only preprocessing is local contrast normalization Motion sensitive features (1,3) Edge features (4) Segmentation operator (6) Feature ( zk ) Time 13 Jul 2012 / 48 Hand clapping (above); Walking (below)

138 ACTIVITY RECOGNITION: KTH Prior Art Acc (%) Convolutional architectures Acc. (%) HOG3D+KM+SVM 85.3 convgrbm+3d-convnet+logistic reg HOG/HOF+KM+SVM 86.1 convgrbm+3d convnet+mlp 90.0 HOG+KM+SVM D convnet+3d convnet+logistic reg HOF+KM+SVM D convnet+3d convnet+mlp 79.5 Compared to methods that do not use explicit interest point detection State of the art: 92.1% (Laptev et al. 2008) 93.9% (Le et al. 2011) Other reported result on 3D convnets uses a different evaluation scheme 13 Jul 2012 / 49

139 ACTIVITY RECOGNITION: HOLLYWOOD 2 12 classes of human action extracted from 69 movies (20 hours) Much more realistic and challenging than KTH (changing scenes, zoom, etc.) Performance is evaluated by mean average precision over classes Method Average Prec. Prior Art (Wang et al. survey 2009): HOG3D+KM+SVM 45.3 HOG/HOF+KM+SVM 47.4 HOG+KM+SVM 39.4 HOF+KM+SVM 45.5 Our method: GRBM+SC+SVM Jul 2012 / 50

140 SUMMARY 13 Jul 2012 / 51

141 SUMMARY Learning distributed representations of sequences 13 Jul 2012 / 51

142 SUMMARY Learning distributed representations of sequences For high-dimensional, multi-modal data: CRBM, FCRBM l j 13 Jul 2012 / 51

143 SUMMARY Learning distributed representations of sequences For high-dimensional, multi-modal data: CRBM, FCRBM l j Activity recognition: 2 methods N x N x w X (Input) N x w N z p k α N p N p z k m,n N x Y (Output) N z P k Pooling layer 13 Jul 2012 / 51 N y N y w N y w N y Z k Feature layer

144 The University of Guelph is not in Belgium! Montreal Guelph Toronto New York

TUTORIAL PART 1 Unsupervised Learning

TUTORIAL PART 1 Unsupervised Learning TUTORIAL PART 1 Unsupervised Learning Marc'Aurelio Ranzato Department of Computer Science Univ. of Toronto ranzato@cs.toronto.edu Co-organizers: Honglak Lee, Yoshua Bengio, Geoff Hinton, Yann LeCun, Andrew

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

Deep Recurrent Neural Networks

Deep Recurrent Neural Networks Deep Recurrent Neural Networks Artem Chernodub e-mail: a.chernodub@gmail.com web: http://zzphoto.me ZZ Photo IMMSP NASU 2 / 28 Neuroscience Biological-inspired models Machine Learning p x y = p y x p(x)/p(y)

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

The Recurrent Temporal Restricted Boltzmann Machine

The Recurrent Temporal Restricted Boltzmann Machine The Recurrent Temporal Restricted Boltzmann Machine Ilya Sutskever, Geoffrey Hinton, and Graham Taylor University of Toronto {ilya, hinton, gwtaylor}@cs.utoronto.ca Abstract The Temporal Restricted Boltzmann

More information

Unsupervised Learning

Unsupervised Learning CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality

More information

Deep unsupervised learning

Deep unsupervised learning Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.

More information

Juergen Gall. Analyzing Human Behavior in Video Sequences

Juergen Gall. Analyzing Human Behavior in Video Sequences Juergen Gall Analyzing Human Behavior in Video Sequences 09. 10. 2 01 7 Juer gen Gall Instit u t e of Com puter Science III Com puter Vision Gr oup 2 Analyzing Human Behavior Analyzing Human Behavior Human

More information

Deep Learning of Invariant Spatiotemporal Features from Video. Bo Chen, Jo-Anne Ting, Ben Marlin, Nando de Freitas University of British Columbia

Deep Learning of Invariant Spatiotemporal Features from Video. Bo Chen, Jo-Anne Ting, Ben Marlin, Nando de Freitas University of British Columbia Deep Learning of Invariant Spatiotemporal Features from Video Bo Chen, Jo-Anne Ting, Ben Marlin, Nando de Freitas University of British Columbia Introduction Focus: Unsupervised feature extraction from

More information

Deep Learning Architecture for Univariate Time Series Forecasting

Deep Learning Architecture for Univariate Time Series Forecasting CS229,Technical Report, 2014 Deep Learning Architecture for Univariate Time Series Forecasting Dmitry Vengertsev 1 Abstract This paper studies the problem of applying machine learning with deep architecture

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

Asaf Bar Zvi Adi Hayat. Semantic Segmentation Asaf Bar Zvi Adi Hayat Semantic Segmentation Today s Topics Fully Convolutional Networks (FCN) (CVPR 2015) Conditional Random Fields as Recurrent Neural Networks (ICCV 2015) Gaussian Conditional random

More information

An efficient way to learn deep generative models

An efficient way to learn deep generative models An efficient way to learn deep generative models Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of Toronto Joint work with: Ruslan Salakhutdinov, Yee-Whye

More information

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017 Modelling Time Series with Neural Networks Volker Tresp Summer 2017 1 Modelling of Time Series The next figure shows a time series (DAX) Other interesting time-series: energy prize, energy consumption,

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

How to do backpropagation in a brain

How to do backpropagation in a brain How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Deep Neural Networks

Deep Neural Networks Deep Neural Networks DT2118 Speech and Speaker Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 45 Outline State-to-Output Probability Model Artificial Neural Networks Perceptron Multi

More information

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji Dynamic Data Modeling, Recognition, and Synthesis Rui Zhao Thesis Defense Advisor: Professor Qiang Ji Contents Introduction Related Work Dynamic Data Modeling & Analysis Temporal localization Insufficient

More information

Knowledge Extraction from DBNs for Images

Knowledge Extraction from DBNs for Images Knowledge Extraction from DBNs for Images Son N. Tran and Artur d Avila Garcez Department of Computer Science City University London Contents 1 Introduction 2 Knowledge Extraction from DBNs 3 Experimental

More information

Deep Learning Autoencoder Models

Deep Learning Autoencoder Models Deep Learning Autoencoder Models Davide Bacciu Dipartimento di Informatica Università di Pisa Intelligent Systems for Pattern Recognition (ISPR) Generative Models Wrap-up Deep Learning Module Lecture Generative

More information

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials Contents 1 Pseudo-code for the damped Gauss-Newton vector product 2 2 Details of the pathological synthetic problems

More information

Recurrent Neural Networks

Recurrent Neural Networks Recurrent Neural Networks Datamining Seminar Kaspar Märtens Karl-Oskar Masing Today's Topics Modeling sequences: a brief overview Training RNNs with back propagation A toy example of training an RNN Why

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch Psychology 452 Week 12: Deep Learning What Is Deep Learning? Preliminary Ideas (that we already know!) The Restricted Boltzmann Machine (RBM) Many Layers of RBMs Pros and Cons of Deep Learning Course Structure

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Deep Generative Models. (Unsupervised Learning)

Deep Generative Models. (Unsupervised Learning) Deep Generative Models (Unsupervised Learning) CEng 783 Deep Learning Fall 2017 Emre Akbaş Reminders Next week: project progress demos in class Describe your problem/goal What you have done so far What

More information

Deep Learning: a gentle introduction

Deep Learning: a gentle introduction Deep Learning: a gentle introduction Jamal Atif jamal.atif@dauphine.fr PSL, Université Paris-Dauphine, LAMSADE February 8, 206 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 206 / Why

More information

Modeling Natural Images with Higher-Order Boltzmann Machines

Modeling Natural Images with Higher-Order Boltzmann Machines Modeling Natural Images with Higher-Order Boltzmann Machines Marc'Aurelio Ranzato Department of Computer Science Univ. of Toronto ranzato@cs.toronto.edu joint work with Geoffrey Hinton and Vlad Mnih CIFAR

More information

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Jakob Verbeek & Daan Wynen 206-09-22 Jakob Verbeek & Daan Wynen Unsupervised Neural Networks Outline Autoencoders Restricted) Boltzmann

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional

More information

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous

More information

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text

More information

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent 1 Xi-Lin Li, lixilinx@gmail.com arxiv:1606.04449v2 [stat.ml] 8 Dec 2016 Abstract This paper studies the performance of

More information

Deep Learning of Invariant Spatio-Temporal Features from Video

Deep Learning of Invariant Spatio-Temporal Features from Video Deep Learning of Invariant Spatio-Temporal Features from Video Bo Chen California Institute of Technology Pasadena, CA, USA bchen3@caltech.edu Benjamin M. Marlin University of British Columbia Vancouver,

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Greedy Layer-Wise Training of Deep Networks

Greedy Layer-Wise Training of Deep Networks Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny Story so far Deep neural nets are more expressive: Can learn

More information

Neural Architectures for Image, Language, and Speech Processing

Neural Architectures for Image, Language, and Speech Processing Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Deep Learning & Neural Networks Lecture 4

Deep Learning & Neural Networks Lecture 4 Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014 2/20 3/20 Advanced Topics in Optimization Today we ll briefly

More information

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen Artificial Neural Networks Introduction to Computational Neuroscience Tambet Matiisen 2.04.2018 Artificial neural network NB! Inspired by biology, not based on biology! Applications Automatic speech recognition

More information

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Neural Networks 2. 2 Receptive fields and dealing with image inputs CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Reading Group on Deep Learning Session 2

Reading Group on Deep Learning Session 2 Reading Group on Deep Learning Session 2 Stephane Lathuiliere & Pablo Mesejo 10 June 2016 1/39 Chapter Structure Introduction. 5.1. Feed-forward Network Functions. 5.2. Network Training. 5.3. Error Backpropagation.

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

arxiv: v3 [cs.lg] 14 Jan 2018

arxiv: v3 [cs.lg] 14 Jan 2018 A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe

More information

How to do backpropagation in a brain. Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto

How to do backpropagation in a brain. Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto 1 How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto What is wrong with back-propagation? It requires labeled training data. (fixed) Almost

More information

Learning Deep Architectures

Learning Deep Architectures Learning Deep Architectures Yoshua Bengio, U. Montreal Microsoft Cambridge, U.K. July 7th, 2009, Montreal Thanks to: Aaron Courville, Pascal Vincent, Dumitru Erhan, Olivier Delalleau, Olivier Breuleux,

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

Deep Learning Recurrent Networks 2/28/2018

Deep Learning Recurrent Networks 2/28/2018 Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

More information

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight) CSCE 636 Neural Networks Instructor: Yoonsuck Choe Deep Learning What Is Deep Learning? Learning higher level abstractions/representations from data. Motivation: how the brain represents sensory information

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Jakub Hajic Artificial Intelligence Seminar I

Jakub Hajic Artificial Intelligence Seminar I Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network

More information

Long-Short Term Memory and Other Gated RNNs

Long-Short Term Memory and Other Gated RNNs Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling

More information

Recurrent neural networks

Recurrent neural networks 12-1: Recurrent neural networks Prof. J.C. Kao, UCLA Recurrent neural networks Motivation Network unrollwing Backpropagation through time Vanishing and exploding gradients LSTMs GRUs 12-2: Recurrent neural

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

Deep Learning for Computer Vision

Deep Learning for Computer Vision Deep Learning for Computer Vision Spring 2018 http://vllab.ee.ntu.edu.tw/dlcv.html (primary) https://ceiba.ntu.edu.tw/1062dlcv (grade, etc.) FB: DLCV Spring 2018 Yu-Chiang Frank Wang 王鈺強, Associate Professor

More information

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Speaker Representation and Verification Part II. by Vasileios Vasilakakis Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation

More information

Deep Belief Networks are compact universal approximators

Deep Belief Networks are compact universal approximators 1 Deep Belief Networks are compact universal approximators Nicolas Le Roux 1, Yoshua Bengio 2 1 Microsoft Research Cambridge 2 University of Montreal Keywords: Deep Belief Networks, Universal Approximation

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate

More information

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

arxiv: v2 [cs.ne] 22 Feb 2013

arxiv: v2 [cs.ne] 22 Feb 2013 Sparse Penalty in Deep Belief Networks: Using the Mixed Norm Constraint arxiv:1301.3533v2 [cs.ne] 22 Feb 2013 Xanadu C. Halkias DYNI, LSIS, Universitè du Sud, Avenue de l Université - BP20132, 83957 LA

More information

Multimodal context analysis and prediction

Multimodal context analysis and prediction Multimodal context analysis and prediction Valeria Tomaselli (valeria.tomaselli@st.com) Sebastiano Battiato Giovanni Maria Farinella Tiziana Rotondo (PhD student) Outline 2 Context analysis vs prediction

More information

Sum-Product Networks: A New Deep Architecture

Sum-Product Networks: A New Deep Architecture Sum-Product Networks: A New Deep Architecture Pedro Domingos Dept. Computer Science & Eng. University of Washington Joint work with Hoifung Poon 1 Graphical Models: Challenges Bayesian Network Markov Network

More information

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves Recurrent Neural Networks Deep Learning Lecture 5 Efstratios Gavves Sequential Data So far, all tasks assumed stationary data Neither all data, nor all tasks are stationary though Sequential Data: Text

More information

Part 2. Representation Learning Algorithms

Part 2. Representation Learning Algorithms 53 Part 2 Representation Learning Algorithms 54 A neural network = running several logistic regressions at the same time If we feed a vector of inputs through a bunch of logis;c regression func;ons, then

More information

Denoising Autoencoders

Denoising Autoencoders Denoising Autoencoders Oliver Worm, Daniel Leinfelder 20.11.2013 Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 1 / 11 Introduction Poor initialisation can lead to local minima 1986 -

More information

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i

More information

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ] Neural Networks William Cohen 10-601 [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ] WHAT ARE NEURAL NETWORKS? William s notation Logis;c regression + 1

More information

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions - Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions Simon Luo The University of Sydney Data61, CSIRO simon.luo@data61.csiro.au Mahito Sugiyama National Institute of

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

Hidden CRFs for Human Activity Classification from RGBD Data

Hidden CRFs for Human Activity Classification from RGBD Data H Hidden s for Human from RGBD Data Avi Singh & Ankit Goyal IIT-Kanpur CS679: Machine Learning for Computer Vision April 13, 215 Overview H 1 2 3 H 4 5 6 7 Statement H Input An RGBD video with a human

More information

Deep Learning Architectures and Algorithms

Deep Learning Architectures and Algorithms Deep Learning Architectures and Algorithms In-Jung Kim 2016. 12. 2. Agenda Introduction to Deep Learning RBM and Auto-Encoders Convolutional Neural Networks Recurrent Neural Networks Reinforcement Learning

More information

Slide credit from Hung-Yi Lee & Richard Socher

Slide credit from Hung-Yi Lee & Richard Socher Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step

More information

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight) CSCE 636 Neural Networks Instructor: Yoonsuck Choe Deep Learning What Is Deep Learning? Learning higher level abstractions/representations from data. Motivation: how the brain represents sensory information

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks Oliver Schulte - CMPT 310 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will focus on

More information

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009 AN INTRODUCTION TO NEURAL NETWORKS Scott Kuindersma November 12, 2009 SUPERVISED LEARNING We are given some training data: We must learn a function If y is discrete, we call it classification If it is

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Recurrent Neural Network

Recurrent Neural Network Recurrent Neural Network Xiaogang Wang xgwang@ee..edu.hk March 2, 2017 Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 1 / 48 Outline 1 Recurrent neural networks Recurrent neural networks

More information

Deep Boltzmann Machines

Deep Boltzmann Machines Deep Boltzmann Machines Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel University of Illinois Urbana Champaign agoel10@illinois.edu December 2, 2016 Ruslan Salakutdinov and Geoffrey E. Hinton Amish

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) Human Brain Neurons Input-Output Transformation Input Spikes Output Spike Spike (= a brief pulse) (Excitatory Post-Synaptic Potential)

More information

Learning to Disentangle Factors of Variation with Manifold Learning

Learning to Disentangle Factors of Variation with Manifold Learning Learning to Disentangle Factors of Variation with Manifold Learning Scott Reed Kihyuk Sohn Yuting Zhang Honglak Lee University of Michigan, Department of Electrical Engineering and Computer Science 08

More information

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of

More information

Introduction to Restricted Boltzmann Machines

Introduction to Restricted Boltzmann Machines Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,

More information

Reservoir Computing and Echo State Networks

Reservoir Computing and Echo State Networks An Introduction to: Reservoir Computing and Echo State Networks Claudio Gallicchio gallicch@di.unipi.it Outline Focus: Supervised learning in domain of sequences Recurrent Neural networks for supervised

More information

Bayesian Networks (Part I)

Bayesian Networks (Part I) 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks (Part I) Graphical Model Readings: Murphy 10 10.2.1 Bishop 8.1,

More information

Chapter 20. Deep Generative Models

Chapter 20. Deep Generative Models Peng et al.: Deep Learning and Practice 1 Chapter 20 Deep Generative Models Peng et al.: Deep Learning and Practice 2 Generative Models Models that are able to Provide an estimate of the probability distribution

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks Delivered by Mark Ebden With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable

More information