Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics )

Size: px
Start display at page:

Download "Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics )"

Transcription

1 Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics ) James Martens University of Toronto June 24, 2010 Computer Science UNIVERSITY OF TORONTO James Martens (U of T) Learning the LDS with ASOS June 24, / 21

2 The Linear Dynamical System model model of vector-valued time-series {y t R Ny } T t=1 James Martens (U of T) Learning the LDS with ASOS June 24, / 21

3 The Linear Dynamical System model model of vector-valued time-series {y t R Ny } T t=1 widely-applied due to predictable behavior, easy inference, etc James Martens (U of T) Learning the LDS with ASOS June 24, / 21

4 The Linear Dynamical System model model of vector-valued time-series {y t R Ny } T t=1 widely-applied due to predictable behavior, easy inference, etc vector-valued hidden states ({x t R Nx } T t=1 ) evolve via linear dynamics, x t+1 = Ax t + ɛ t A R Nx Nx ɛ t N(0, Q) James Martens (U of T) Learning the LDS with ASOS June 24, / 21

5 The Linear Dynamical System model model of vector-valued time-series {y t R Ny } T t=1 widely-applied due to predictable behavior, easy inference, etc vector-valued hidden states ({x t R Nx } T t=1 ) evolve via linear dynamics, x t+1 = Ax t + ɛ t A R Nx Nx ɛ t N(0, Q) linearly generated observations: y t = Cx t + δ t C R Ny Nx δ t N(0, R) James Martens (U of T) Learning the LDS with ASOS June 24, / 21

6 Learning the LDS Expectation Maximization (EM) finds local optimum of log-likelihood pretty slow - convergence requires lots of iterations and E-step is expensive James Martens (U of T) Learning the LDS with ASOS June 24, / 21

7 Learning the LDS Expectation Maximization (EM) finds local optimum of log-likelihood pretty slow - convergence requires lots of iterations and E-step is expensive Subspace identification hidden states estimated directly from the data, and the parameters from these asymptotically unbiased / consistent non-iterative algorithm, but solution not optimal in any objective good way to initialize EM or other iterative optimizers James Martens (U of T) Learning the LDS with ASOS June 24, / 21

8 Our contribution accelerate the EM algorithm by reducing its per-iteration cost to be constant time w.r.t. T (length of the time-series) James Martens (U of T) Learning the LDS with ASOS June 24, / 21

9 Our contribution accelerate the EM algorithm by reducing its per-iteration cost to be constant time w.r.t. T (length of the time-series) key idea: approximate the inference done in the E-step James Martens (U of T) Learning the LDS with ASOS June 24, / 21

10 Our contribution accelerate the EM algorithm by reducing its per-iteration cost to be constant time w.r.t. T (length of the time-series) key idea: approximate the inference done in the E-step E-step approximation is unbiased and asymptotically consistent also convergences exponentially with L, where L is a meta-parameter that trades off approximation quality with speed (notation change: L is k lim from the paper) James Martens (U of T) Learning the LDS with ASOS June 24, / 21

11 Learning via E.M. the Algorithm E.M. Objective Function At each iteration we maximize the following objective where θ n is the current parameter estimate: Q n (θ) = E θn [log p(x, y) y] = p(x y, θ n ) log p(x, y θ) x James Martens (U of T) Learning the LDS with ASOS June 24, / 21

12 Learning via E.M. the Algorithm E.M. Objective Function At each iteration we maximize the following objective where θ n is the current parameter estimate: Q n (θ) = E θn [log p(x, y) y] = p(x y, θ n ) log p(x, y θ) x E-Step E-Step computes expectation of log p(x, y θ) under p(x y, θ n ) uses the classical Kalman filtering/smoothing algorithm James Martens (U of T) Learning the LDS with ASOS June 24, / 21

13 Learning via E.M. the Algorithm (cont.) M-Step maximize objective Q n (θ) w.r.t. to θ, producing a new estimate θ n+1 θ n+1 = arg max Q n (θ) θ very easy - similar to linear-regression James Martens (U of T) Learning the LDS with ASOS June 24, / 21

14 Learning via E.M. the Algorithm (cont.) M-Step maximize objective Q n (θ) w.r.t. to θ, producing a new estimate θ n+1 θ n+1 = arg max Q n (θ) θ very easy - similar to linear-regression Problem EM can get very slow for when we have lots of data mainly due to call to expensive Kalman filter/smoother in the E-step O(N 3 x T ) where T = length of the training time-series, N x = hidden state dim. James Martens (U of T) Learning the LDS with ASOS June 24, / 21

15 the Kalman filter/smoother estimates hidden-state means and covariances: x k t E θn [ x t y k ] V k t,s Cov θn [ x t, x s y k ] for each t = {1,..., T } and s = t, t + 1. James Martens (U of T) Learning the LDS with ASOS June 24, / 21

16 the Kalman filter/smoother estimates hidden-state means and covariances: x k t E θn [ x t y k ] V k t,s Cov θn [ x t, x s y k ] for each t = {1,..., T } and s = t, t + 1. these are summed over time to obtain the statistics required for M-step, e.g.: T 1 E θn [x t+1 x t y k ] = (x T, x T ) 1 + where (a, b) k T k t=1 a t+kb t t=1 V T t+1,t James Martens (U of T) Learning the LDS with ASOS June 24, / 21

17 the Kalman filter/smoother estimates hidden-state means and covariances: x k t E θn [ x t y k ] V k t,s Cov θn [ x t, x s y k ] for each t = {1,..., T } and s = t, t + 1. these are summed over time to obtain the statistics required for M-step, e.g.: T 1 E θn [x t+1 x t y k ] = (x T, x T ) 1 + where (a, b) k T k t=1 a t+kb t t=1 V T t+1,t but we only care about the M-statistics, not the individual inferences for each time-step so let s estimate these directly! James Martens (U of T) Learning the LDS with ASOS June 24, / 21

18 Steady-state first we need a basic tool from linear systems/control theory: steady-state James Martens (U of T) Learning the LDS with ASOS June 24, / 21

19 Steady-state first we need a basic tool from linear systems/control theory: steady-state the covariance terms, and the filtering and smoothing matrices (denoted K t and J t ) do not depend on the data y - only the current parameters James Martens (U of T) Learning the LDS with ASOS June 24, / 21

20 Steady-state first we need a basic tool from linear systems/control theory: steady-state the covariance terms, and the filtering and smoothing matrices (denoted K t and J t ) do not depend on the data y - only the current parameters and they rapidly converge to steady-state values: V T t,t, V T t,t 1, J t, K t Λ 0, Λ 1, J, K as min(t, T t) James Martens (U of T) Learning the LDS with ASOS June 24, / 21

21 we can approximate the Kalman filter/smoother equations using the steady-state matrices James Martens (U of T) Learning the LDS with ASOS June 24, / 21

22 we can approximate the Kalman filter/smoother equations using the steady-state matrices this gives the highly simplified recurrences x t = Hx t 1 + Ky t x T t = Jx T t+1 + Px t where x t x t t E θn [ x t y t ], H A KCA and P I JA James Martens (U of T) Learning the LDS with ASOS June 24, / 21

23 we can approximate the Kalman filter/smoother equations using the steady-state matrices this gives the highly simplified recurrences x t = Hx t 1 + Ky t x T t = Jx T t+1 + Px t where x t x t t E θn [ x t y t ], H A KCA and P I JA these don t require any matrix multiplications or inversions James Martens (U of T) Learning the LDS with ASOS June 24, / 21

24 we can approximate the Kalman filter/smoother equations using the steady-state matrices this gives the highly simplified recurrences x t = Hx t 1 + Ky t x T t = Jx T t+1 + Px t where x t x t t E θn [ x t y t ], H A KCA and P I JA these don t require any matrix multiplications or inversions we apply the approximate filter/smoother everywhere except first and last i time-steps yields a run-time of O(N 2 x T + N 3 x i). James Martens (U of T) Learning the LDS with ASOS June 24, / 21

25 ASOS: Approximated Second-Order Statistics steady-state makes covariance terms easy to estimate in time independent of T James Martens (U of T) Learning the LDS with ASOS June 24, / 21

26 ASOS: Approximated Second-Order Statistics steady-state makes covariance terms easy to estimate in time independent of T we want something similar for sum-of-products of means terms like (x T, x T ) 0 t x T t (x T t ) James Martens (U of T) Learning the LDS with ASOS June 24, / 21

27 ASOS: Approximated Second-Order Statistics steady-state makes covariance terms easy to estimate in time independent of T we want something similar for sum-of-products of means terms like (x T, x T ) 0 t x T t (x T t ) such sums we will call 2 nd -order statistics. The ones-needed for the M-step are the M-statistics James Martens (U of T) Learning the LDS with ASOS June 24, / 21

28 ASOS: Approximated Second-Order Statistics steady-state makes covariance terms easy to estimate in time independent of T we want something similar for sum-of-products of means terms like (x T, x T ) 0 t x T t (x T t ) such sums we will call 2 nd -order statistics. The ones-needed for the M-step are the M-statistics idea #1: derive recursions and equations that relate the 2 nd -order statistics of different time-lags time-lag refers to the value of k in (a, b) k T k t=1 a t+kb t James Martens (U of T) Learning the LDS with ASOS June 24, / 21

29 ASOS: Approximated Second-Order Statistics steady-state makes covariance terms easy to estimate in time independent of T we want something similar for sum-of-products of means terms like (x T, x T ) 0 t x T t (x T t ) such sums we will call 2 nd -order statistics. The ones-needed for the M-step are the M-statistics idea #1: derive recursions and equations that relate the 2 nd -order statistics of different time-lags time-lag refers to the value of k in (a, b) k T k t=1 a t+kb t idea #2: evaluate these efficiently using approximations James Martens (U of T) Learning the LDS with ASOS June 24, / 21

30 Deriving the 2 nd -order recursions/equations: An example suppose we wish to find the recursion for (x, y) k James Martens (U of T) Learning the LDS with ASOS June 24, / 21

31 Deriving the 2 nd -order recursions/equations: An example suppose we wish to find the recursion for (x, y) k steady-state Kalman recursion for x t+k is: x t+k = Hx t+k 1 + Ky t+k right-multiply both sides by y t and sum over t T k T (x, y) k xt+k y k t = (Hxt+k 1 y t + Ky t+k y t) t=1 t=1 James Martens (U of T) Learning the LDS with ASOS June 24, / 21

32 Deriving the 2 nd -order recursions/equations: An example suppose we wish to find the recursion for (x, y) k steady-state Kalman recursion for x t+k is: x t+k = Hx t+k 1 + Ky t+k right-multiply both sides by y t and sum over t factor out matrices H and K T k T (x, y) k xt+k y k t = (Hxt+k 1 y t + Ky t+k y t) t=1 = H T k t=1 t=1 x t+k 1 y t + K T k t=1 y t+k y t James Martens (U of T) Learning the LDS with ASOS June 24, / 21

33 Deriving the 2 nd -order recursions/equations: An example suppose we wish to find the recursion for (x, y) k steady-state Kalman recursion for x t+k is: x t+k = Hx t+k 1 + Ky t+k right-multiply both sides by y t and sum over t factor out matrices H and K finally, re-write everything using our special notation for 2 nd -order statistics: (a, b) k T k t=1 a t+kb t T k T (x, y) k xt+k y k t = (Hxt+k 1 y t + Ky t+k y t) t=1 = H T k t=1 t=1 x t+k 1 y t + K T k t=1 y t+k y t = H( (x, y) k 1 x T y T k+1 ) + K (y, y) k James Martens (U of T) Learning the LDS with ASOS June 24, / 21

34 The complete list (don t bother to memorize this) The recursions: The equations: (y, x ) k = (y, x ) k+1 H + ((y, y) k y 1+k y 1 )K + y 1+k x 1 (x, y) k = H((x, y) k 1 x T y T k+1 ) + K(y, y) k (x, x ) k = (x, x ) k+1 H + ((x, y) k x 1+k y 1 )K + x 1+k x 1 (x, x ) k = H((x, x ) k 1 x T x T k+1 ) + K(y, x ) k (x T, y) k = J(x T, y) k+1 + P((x, y) k x T y T k ) + x T T y T k (x T, x ) k = J(x T, x ) k+1 + P((x, x ) k x T x T k ) + x T T x T k (x T, x T ) k = ((x T, x T ) k 1 x T k xt 1 )J + (x T, x ) k P (x T, x T ) k = J(x T, x T ) k+1 + P((x, x T ) k xt xt T k ) + x T T xt T k (x, x ) k = H(x, x ) k H + ((x, y) k x 1+k y 1 )K Hx T x T k H + K(y, x ) k+1 H + x 1+k x 1 (x T, x T ) k = J(x T, x T ) k J + P((x, x T ) k xt xt T k ) Jx T k+1 x1 T J + J(x T, x ) k+1 P + xt T xt T k James Martens (U of T) Learning the LDS with ASOS June 24, / 21

35 noting that statistics of time-lag T + 1 are 0 by definition we can start the 2 nd -order recursions at t = T but this doesn t get us anywhere - would be even more expensive than the usual Kalman recursions on the 1 st -order terms James Martens (U of T) Learning the LDS with ASOS June 24, / 21

36 noting that statistics of time-lag T + 1 are 0 by definition we can start the 2 nd -order recursions at t = T but this doesn t get us anywhere - would be even more expensive than the usual Kalman recursions on the 1 st -order terms instead, start the recursions at time-lag L with unbiased approximations ( ASOS approximations ) (y, x ) L+1 CA ( (x, x ) L x T x T L ), (x T, x ) L (x, x ) L, (x T, y) L (x, y) L James Martens (U of T) Learning the LDS with ASOS June 24, / 21

37 noting that statistics of time-lag T + 1 are 0 by definition we can start the 2 nd -order recursions at t = T but this doesn t get us anywhere - would be even more expensive than the usual Kalman recursions on the 1 st -order terms instead, start the recursions at time-lag L with unbiased approximations ( ASOS approximations ) (y, x ) L+1 CA ( (x, x ) L x T x T L ), (x T, x ) L (x, x ) L, (x T, y) L (x, y) L we also need xt T for t {1, 2,..., L} {T L, T L+1,..., T } but these can be approximated by a separate procedure (see paper) James Martens (U of T) Learning the LDS with ASOS June 24, / 21

38 Why might this be reasonable? 2 nd -order statistics with large time lag quantify relationships between variables that are far apart in time weaker and less important than relationships between variables that are close in time James Martens (U of T) Learning the LDS with ASOS June 24, / 21

39 Why might this be reasonable? 2 nd -order statistics with large time lag quantify relationships between variables that are far apart in time weaker and less important than relationships between variables that are close in time in steady-state Kalman recursions, information is propagated via multiplication by H and J: x t = Hx t 1 + Ky t x T t = Jx T t+1 + Px t both of these have spectral radius (denoted σ( )) less than 1, and so they decay the signal exponentially James Martens (U of T) Learning the LDS with ASOS June 24, / 21

40 Procedure for estimating the M-statistics how do we compute an estimate of the M-statistics consistent with the 2nd-order recursions/equations and the approximations? James Martens (U of T) Learning the LDS with ASOS June 24, / 21

41 Procedure for estimating the M-statistics how do we compute an estimate of the M-statistics consistent with the 2nd-order recursions/equations and the approximations? essentially it is just a large linear system of dimension O(N 2 x L) James Martens (U of T) Learning the LDS with ASOS June 24, / 21

42 Procedure for estimating the M-statistics how do we compute an estimate of the M-statistics consistent with the 2nd-order recursions/equations and the approximations? essentially it is just a large linear system of dimension O(N 2 x L) but using a general solver would be far too expensive: O(N 6 x L 3 ) James Martens (U of T) Learning the LDS with ASOS June 24, / 21

43 Procedure for estimating the M-statistics how do we compute an estimate of the M-statistics consistent with the 2nd-order recursions/equations and the approximations? essentially it is just a large linear system of dimension O(N 2 x L) but using a general solver would be far too expensive: O(N 6 x L 3 ) fortunately, using the special structure of this system, we have developed a (non-trivial) algorithm which is much more efficient equations can be solved using an efficient iterative algorithm we developed for a generalization of the Sylvester equation evaluating recursions is then straightforward James Martens (U of T) Learning the LDS with ASOS June 24, / 21

44 Procedure for estimating the M-statistics how do we compute an estimate of the M-statistics consistent with the 2nd-order recursions/equations and the approximations? essentially it is just a large linear system of dimension O(N 2 x L) but using a general solver would be far too expensive: O(N 6 x L 3 ) fortunately, using the special structure of this system, we have developed a (non-trivial) algorithm which is much more efficient equations can be solved using an efficient iterative algorithm we developed for a generalization of the Sylvester equation evaluating recursions is then straightforward the cost is then just O(N 3 x L) after (y, y) k t y t+ky t has been pre-computed for k = 0,..., L James Martens (U of T) Learning the LDS with ASOS June 24, / 21

45 First convergence result: our intuition confirmed First result: For a fixed θ the l 2 -error in the M-statistics is bounded by a quantity proportional to L 2 λ L 1, where λ = σ(h) = σ(j) < 1 (σ( ) denotes the spectral radius) James Martens (U of T) Learning the LDS with ASOS June 24, / 21

46 First convergence result: our intuition confirmed First result: For a fixed θ the l 2 -error in the M-statistics is bounded by a quantity proportional to L 2 λ L 1, where λ = σ(h) = σ(j) < 1 (σ( ) denotes the spectral radius) so as L grows, the estimation error for the M-statistics will decay exponentially James Martens (U of T) Learning the LDS with ASOS June 24, / 21

47 First convergence result: our intuition confirmed First result: For a fixed θ the l 2 -error in the M-statistics is bounded by a quantity proportional to L 2 λ L 1, where λ = σ(h) = σ(j) < 1 (σ( ) denotes the spectral radius) so as L grows, the estimation error for the M-statistics will decay exponentially but, λ might be close enough to 1 so that we need to make L too big James Martens (U of T) Learning the LDS with ASOS June 24, / 21

48 First convergence result: our intuition confirmed First result: For a fixed θ the l 2 -error in the M-statistics is bounded by a quantity proportional to L 2 λ L 1, where λ = σ(h) = σ(j) < 1 (σ( ) denotes the spectral radius) so as L grows, the estimation error for the M-statistics will decay exponentially but, λ might be close enough to 1 so that we need to make L too big fortunately we have a 2nd result which provides a very different type of guarantee James Martens (U of T) Learning the LDS with ASOS June 24, / 21

49 Second convergence result Second result: updates produced by ASOS procedure are asymptotically consistent with the usual EM updates in the limit as T assumes data is generated from the model James Martens (U of T) Learning the LDS with ASOS June 24, / 21

50 Second convergence result Second result: updates produced by ASOS procedure are asymptotically consistent with the usual EM updates in the limit as T assumes data is generated from the model first result didn t make strong use any property of the approx. (could use 0 for each and result would still hold) James Martens (U of T) Learning the LDS with ASOS June 24, / 21

51 Second convergence result Second result: updates produced by ASOS procedure are asymptotically consistent with the usual EM updates in the limit as T assumes data is generated from the model first result didn t make strong use any property of the approx. (could use 0 for each and result would still hold) this second one is justified in the opposite way strong use of the approximation follows from convergence of 1 T -scaled expected l 2 error of approx. towards zero result holds for any value of L value James Martens (U of T) Learning the LDS with ASOS June 24, / 21

52 Experiments we considered 3 real datasets of varying sizes and dimensionality each algorithm initialized from same random parameters latent dimension N x determined by trial-and-error Experimental parameters Name length (T ) N y N x evaporator motion capture warship sounds James Martens (U of T) Learning the LDS with ASOS June 24, / 21

53 Experimental results (cont.) Name length (T ) N y N x evaporator motion capture warship sounds Negative Log Likelihood ASOS EM 5 ( s) ASOS EM 10 ( s) ASOS EM 20 ( s) ASOS EM 35 ( s) ASOS EM 75 ( s) SS EM ( s) EM ( s) Iterations James Martens (U of T) Learning the LDS with ASOS June 24, / 21

54 Experimental results (cont.) Name length (T ) N y N x evaporator motion capture warship sounds Negative Log Likelihood x ASOS EM 20 ( s) ASOS EM 30 ( s) ASOS EM 50 ( s) SS EM ( s) EM ( s) Iterations James Martens (U of T) Learning the LDS with ASOS June 24, / 21

55 Experimental results (cont.) Name length (T ) N y N x evaporator motion capture warship sounds x 10 6 Negative Log Likelihood ASOS EM 150 ( s) ASOS EM 300 ( s) ASOS EM 850 ( s) ASOS EM 1700 ( s) SS EM ( s) Iterations James Martens (U of T) Learning the LDS with ASOS June 24, / 21

56 Conclusion we applied steady-state approximations to derive a set of 2nd-order recursions and equations James Martens (U of T) Learning the LDS with ASOS June 24, / 21

57 Conclusion we applied steady-state approximations to derive a set of 2nd-order recursions and equations approximated statistics of time-lag L James Martens (U of T) Learning the LDS with ASOS June 24, / 21

58 Conclusion we applied steady-state approximations to derive a set of 2nd-order recursions and equations approximated statistics of time-lag L produced an efficient algorithm for solving the resulting system Per-iteration run-times: EM SS-EM ASOS-EM O(Nx 3 T ) O(Nx 2 T + Nx 3 i) O(Nx 3 k lim ) James Martens (U of T) Learning the LDS with ASOS June 24, / 21

59 Conclusion we applied steady-state approximations to derive a set of 2nd-order recursions and equations approximated statistics of time-lag L produced an efficient algorithm for solving the resulting system Per-iteration run-times: gave 2 formal convergence results EM SS-EM ASOS-EM O(Nx 3 T ) O(Nx 2 T + Nx 3 i) O(Nx 3 k lim ) James Martens (U of T) Learning the LDS with ASOS June 24, / 21

60 Conclusion we applied steady-state approximations to derive a set of 2nd-order recursions and equations approximated statistics of time-lag L produced an efficient algorithm for solving the resulting system Per-iteration run-times: gave 2 formal convergence results EM SS-EM ASOS-EM O(Nx 3 T ) O(Nx 2 T + Nx 3 i) O(Nx 3 k lim ) demonstrated significant performance benefits for learning with long time-series James Martens (U of T) Learning the LDS with ASOS June 24, / 21

61 Thank you for your attention James Martens (U of T) Learning the LDS with ASOS June 24, / 21

Learning the Linear Dynamical System with ASOS

Learning the Linear Dynamical System with ASOS James Martens University of Toronto, Ontario, M5S A, Canada JMARTENS@CS.TORONTO.EDU Abstract We develop a new algorithm, based on EM, for learning the Linear Dynamical System model. Called the method of

More information

1 Proof of Claim Proof of Claim Proof of claim Solving The Primary Equation 5. 5 Psuedo-code for ASOS 7

1 Proof of Claim Proof of Claim Proof of claim Solving The Primary Equation 5. 5 Psuedo-code for ASOS 7 Contents 1 Proof of Claim 2 1 2 Proof of Claim 3 3 3 Proof of claim 4 4 4 Solving he Primary Equation 5 5 Psuedo-code for ASOS 7 1 Proof of Claim 2 Claim. Given a fixed setting of the parameters θ there

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Linear Dynamical Systems (Kalman filter)

Linear Dynamical Systems (Kalman filter) Linear Dynamical Systems (Kalman filter) (a) Overview of HMMs (b) From HMMs to Linear Dynamical Systems (LDS) 1 Markov Chains with Discrete Random Variables x 1 x 2 x 3 x T Let s assume we have discrete

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: ony Jebara Kalman Filtering Linear Dynamical Systems and Kalman Filtering Structure from Motion Linear Dynamical Systems Audio: x=pitch y=acoustic waveform Vision: x=object

More information

Introduction to Graphical Models

Introduction to Graphical Models Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic

More information

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Lecture 2: From Linear Regression to Kalman Filter and Beyond January 18, 2017 Contents 1 Batch and Recursive Estimation 2 Towards Bayesian Filtering 3 Kalman Filter and Bayesian Filtering and Smoothing

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Sequential Monte Carlo Methods for Bayesian Computation

Sequential Monte Carlo Methods for Bayesian Computation Sequential Monte Carlo Methods for Bayesian Computation A. Doucet Kyoto Sept. 2012 A. Doucet (MLSS Sept. 2012) Sept. 2012 1 / 136 Motivating Example 1: Generic Bayesian Model Let X be a vector parameter

More information

Factor Analysis and Kalman Filtering (11/2/04)

Factor Analysis and Kalman Filtering (11/2/04) CS281A/Stat241A: Statistical Learning Theory Factor Analysis and Kalman Filtering (11/2/04) Lecturer: Michael I. Jordan Scribes: Byung-Gon Chun and Sunghoon Kim 1 Factor Analysis Factor analysis is used

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 12: Gaussian Belief Propagation, State Space Models and Kalman Filters Guest Kalman Filter Lecture by

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017 1 Introduction Let x = (x 1,..., x M ) denote a sequence (e.g. a sequence of words), and let y = (y 1,..., y M ) denote a corresponding hidden sequence that we believe explains or influences x somehow

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Approximate Bayesian Computation and Particle Filters

Approximate Bayesian Computation and Particle Filters Approximate Bayesian Computation and Particle Filters Dennis Prangle Reading University 5th February 2014 Introduction Talk is mostly a literature review A few comments on my own ongoing research See Jasra

More information

Mathematical Formulation of Our Example

Mathematical Formulation of Our Example Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot

More information

Learning about State. Geoff Gordon Machine Learning Department Carnegie Mellon University

Learning about State. Geoff Gordon Machine Learning Department Carnegie Mellon University Learning about State Geoff Gordon Machine Learning Department Carnegie Mellon University joint work with Byron Boots, Sajid Siddiqi, Le Song, Alex Smola What s out there?...... ot-2 ot-1 ot ot+1 ot+2 2

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:

More information

Least Squares Estimation Namrata Vaswani,

Least Squares Estimation Namrata Vaswani, Least Squares Estimation Namrata Vaswani, namrata@iastate.edu Least Squares Estimation 1 Recall: Geometric Intuition for Least Squares Minimize J(x) = y Hx 2 Solution satisfies: H T H ˆx = H T y, i.e.

More information

Dynamic Approaches: The Hidden Markov Model

Dynamic Approaches: The Hidden Markov Model Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

CSC2541 Lecture 5 Natural Gradient

CSC2541 Lecture 5 Natural Gradient CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12 Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent,

More information

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Lecture 2: From Linear Regression to Kalman Filter and Beyond Department of Biomedical Engineering and Computational Science Aalto University January 26, 2012 Contents 1 Batch and Recursive Estimation

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov

More information

Stabilization and Acceleration of Algebraic Multigrid Method

Stabilization and Acceleration of Algebraic Multigrid Method Stabilization and Acceleration of Algebraic Multigrid Method Recursive Projection Algorithm A. Jemcov J.P. Maruszewski Fluent Inc. October 24, 2006 Outline 1 Need for Algorithm Stabilization and Acceleration

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Lecture 6: Bayesian Inference in SDE Models

Lecture 6: Bayesian Inference in SDE Models Lecture 6: Bayesian Inference in SDE Models Bayesian Filtering and Smoothing Point of View Simo Särkkä Aalto University Simo Särkkä (Aalto) Lecture 6: Bayesian Inference in SDEs 1 / 45 Contents 1 SDEs

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

K-Means and Gaussian Mixture Models

K-Means and Gaussian Mixture Models K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

Markov Chain Monte Carlo Methods for Stochastic Optimization

Markov Chain Monte Carlo Methods for Stochastic Optimization Markov Chain Monte Carlo Methods for Stochastic Optimization John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge U of Toronto, MIE,

More information

The K-FAC method for neural network optimization

The K-FAC method for neural network optimization The K-FAC method for neural network optimization James Martens Thanks to my various collaborators on K-FAC research and engineering: Roger Grosse, Jimmy Ba, Vikram Tankasali, Matthew Johnson, Daniel Duckworth,

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information

The Kalman Filter ImPr Talk

The Kalman Filter ImPr Talk The Kalman Filter ImPr Talk Ged Ridgway Centre for Medical Image Computing November, 2006 Outline What is the Kalman Filter? State Space Models Kalman Filter Overview Bayesian Updating of Estimates Kalman

More information

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Improving the Convergence of Back-Propogation Learning with Second Order Methods the of Back-Propogation Learning with Second Order Methods Sue Becker and Yann le Cun, Sept 1988 Kasey Bray, October 2017 Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Expectation maximization tutorial

Expectation maximization tutorial Expectation maximization tutorial Octavian Ganea November 18, 2016 1/1 Today Expectation - maximization algorithm Topic modelling 2/1 ML & MAP Observed data: X = {x 1, x 2... x N } 3/1 ML & MAP Observed

More information

State-Space Methods for Inferring Spike Trains from Calcium Imaging

State-Space Methods for Inferring Spike Trains from Calcium Imaging State-Space Methods for Inferring Spike Trains from Calcium Imaging Joshua Vogelstein Johns Hopkins April 23, 2009 Joshua Vogelstein (Johns Hopkins) State-Space Calcium Imaging April 23, 2009 1 / 78 Outline

More information

Parameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets

Parameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets Parameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets Matthias Katzfuß Advisor: Dr. Noel Cressie Department of Statistics The Ohio State University

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SOFT CLUSTERING VS HARD CLUSTERING

More information

An analogy from Calculus: limits

An analogy from Calculus: limits COMP 250 Fall 2018 35 - big O Nov. 30, 2018 We have seen several algorithms in the course, and we have loosely characterized their runtimes in terms of the size n of the input. We say that the algorithm

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

LECTURE NOTE #3 PROF. ALAN YUILLE

LECTURE NOTE #3 PROF. ALAN YUILLE LECTURE NOTE #3 PROF. ALAN YUILLE 1. Three Topics (1) Precision and Recall Curves. Receiver Operating Characteristic Curves (ROC). What to do if we do not fix the loss function? (2) The Curse of Dimensionality.

More information

Statistics 910, #15 1. Kalman Filter

Statistics 910, #15 1. Kalman Filter Statistics 910, #15 1 Overview 1. Summary of Kalman filter 2. Derivations 3. ARMA likelihoods 4. Recursions for the variance Kalman Filter Summary of Kalman filter Simplifications To make the derivations

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

Computer Intensive Methods in Mathematical Statistics

Computer Intensive Methods in Mathematical Statistics Computer Intensive Methods in Mathematical Statistics Department of mathematics johawes@kth.se Lecture 16 Advanced topics in computational statistics 18 May 2017 Computer Intensive Methods (1) Plan of

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Asymptotic and finite-sample properties of estimators based on stochastic gradients

Asymptotic and finite-sample properties of estimators based on stochastic gradients Asymptotic and finite-sample properties of estimators based on stochastic gradients Panagiotis (Panos) Toulis panos.toulis@chicagobooth.edu Econometrics and Statistics University of Chicago, Booth School

More information

Latent Variable Models and Expectation Maximization

Latent Variable Models and Expectation Maximization Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15

More information

Regression with correlation for the Sales Data

Regression with correlation for the Sales Data Regression with correlation for the Sales Data Scatter with Loess Curve Time Series Plot Sales 30 35 40 45 Sales 30 35 40 45 0 10 20 30 40 50 Week 0 10 20 30 40 50 Week Sales Data What is our goal with

More information

Outline lecture 6 2(35)

Outline lecture 6 2(35) Outline lecture 35 Lecture Expectation aximization E and clustering Thomas Schön Division of Automatic Control Linöping University Linöping Sweden. Email: schon@isy.liu.se Phone: 13-1373 Office: House

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Statistical Techniques in Robotics (16-831, F12) Lecture#17 (Wednesday October 31) Kalman Filters. Lecturer: Drew Bagnell Scribe:Greydon Foil 1

Statistical Techniques in Robotics (16-831, F12) Lecture#17 (Wednesday October 31) Kalman Filters. Lecturer: Drew Bagnell Scribe:Greydon Foil 1 Statistical Techniques in Robotics (16-831, F12) Lecture#17 (Wednesday October 31) Kalman Filters Lecturer: Drew Bagnell Scribe:Greydon Foil 1 1 Gauss Markov Model Consider X 1, X 2,...X t, X t+1 to be

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger

More information

1 Kalman Filter Introduction

1 Kalman Filter Introduction 1 Kalman Filter Introduction You should first read Chapter 1 of Stochastic models, estimation, and control: Volume 1 by Peter S. Maybec (available here). 1.1 Explanation of Equations (1-3) and (1-4) Equation

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

Technical Details about the Expectation Maximization (EM) Algorithm

Technical Details about the Expectation Maximization (EM) Algorithm Technical Details about the Expectation Maximization (EM Algorithm Dawen Liang Columbia University dliang@ee.columbia.edu February 25, 2015 1 Introduction Maximum Lielihood Estimation (MLE is widely used

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Particle Filtering Approaches for Dynamic Stochastic Optimization

Particle Filtering Approaches for Dynamic Stochastic Optimization Particle Filtering Approaches for Dynamic Stochastic Optimization John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge I-Sim Workshop,

More information

The Multivariate Gaussian Distribution [DRAFT]

The Multivariate Gaussian Distribution [DRAFT] The Multivariate Gaussian Distribution DRAFT David S. Rosenberg Abstract This is a collection of a few key and standard results about multivariate Gaussian distributions. I have not included many proofs,

More information

Latent Variable Models and Expectation Maximization

Latent Variable Models and Expectation Maximization Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15

More information

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2014 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Yu-Hsin Kuo, Amos Ng 1 Introduction Last lecture

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models David Rosenberg, Brett Bernstein New York University April 26, 2017 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 1 / 42 Intro Question Intro

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky. Batch Normalization Devin Willmott University of Kentucky October 23, 2017 Overview 1 Internal Covariate Shift 2 Batch Normalization 3 Implementation 4 Experiments Covariate Shift Suppose we have two distributions,

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010 Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

4 Derivations of the Discrete-Time Kalman Filter

4 Derivations of the Discrete-Time Kalman Filter Technion Israel Institute of Technology, Department of Electrical Engineering Estimation and Identification in Dynamical Systems (048825) Lecture Notes, Fall 2009, Prof N Shimkin 4 Derivations of the Discrete-Time

More information

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010 Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Particle Filtering for Data-Driven Simulation and Optimization

Particle Filtering for Data-Driven Simulation and Optimization Particle Filtering for Data-Driven Simulation and Optimization John R. Birge The University of Chicago Booth School of Business Includes joint work with Nicholas Polson. JRBirge INFORMS Phoenix, October

More information

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21 CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about

More information

Chapter 3. Tohru Katayama

Chapter 3. Tohru Katayama Subspace Methods for System Identification Chapter 3 Tohru Katayama Subspace Methods Reading Group UofA, Edmonton Barnabás Póczos May 14, 2009 Preliminaries before Linear Dynamical Systems Hidden Markov

More information

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007 Sequential Monte Carlo and Particle Filtering Frank Wood Gatsby, November 2007 Importance Sampling Recall: Let s say that we want to compute some expectation (integral) E p [f] = p(x)f(x)dx and we remember

More information

Why should you care about the solution strategies?

Why should you care about the solution strategies? Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the

More information

Expectation Propagation in Dynamical Systems

Expectation Propagation in Dynamical Systems Expectation Propagation in Dynamical Systems Marc Peter Deisenroth Joint Work with Shakir Mohamed (UBC) August 10, 2012 Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 1 Motivation Figure : Complex

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

Robust Monte Carlo Methods for Sequential Planning and Decision Making

Robust Monte Carlo Methods for Sequential Planning and Decision Making Robust Monte Carlo Methods for Sequential Planning and Decision Making Sue Zheng, Jason Pacheco, & John Fisher Sensing, Learning, & Inference Group Computer Science & Artificial Intelligence Laboratory

More information

CS181 Midterm 2 Practice Solutions

CS181 Midterm 2 Practice Solutions CS181 Midterm 2 Practice Solutions 1. Convergence of -Means Consider Lloyd s algorithm for finding a -Means clustering of N data, i.e., minimizing the distortion measure objective function J({r n } N n=1,

More information

Towards a Bayesian model for Cyber Security

Towards a Bayesian model for Cyber Security Towards a Bayesian model for Cyber Security Mark Briers (mbriers@turing.ac.uk) Joint work with Henry Clausen and Prof. Niall Adams (Imperial College London) 27 September 2017 The Alan Turing Institute

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:

More information