Probabilistic Time Series Classification

Size: px
Start display at page:

Download "Probabilistic Time Series Classification"

Transcription

1 Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

2 Problem Statement The goal is to assign "similar" sequences to same classes. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

3 Problem Statement It is not a trivial task. Variability within each class. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

4 Problem Statement Data in real life applications is even more challenging. So, we make a computer program "learn" how to classify sequences. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

5 Goal of this thesis The goal of this thesis is to understand and develop learning algorithms for supervised and unsupervised time series classification models. We propose two novel learning algorithms for mixture of Markov models and mixture of HMMs. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

6 Goal of this thesis The goal of this thesis is to understand and develop learning algorithms for supervised and unsupervised time series classification models. We propose two novel learning algorithms for mixture of Markov models and mixture of HMMs. h n x 1,n x 2,n... x Tn,n n = 1... N A k k = 1... K Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

7 Goal of this thesis The goal of this thesis is to understand and develop learning algorithms for supervised and unsupervised time series classification models. We propose two novel learning algorithms for mixture of Markov models and mixture of HMMs. A k h n k = 1... K r 1,n r 2,n... r Tn,n x 1,n x 2,n... x Tn,n h n A k k = 1... K n = 1... N x 1,n x 2,n... x Tn,n n = 1... N O k k = 1... K Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

8 Achievements A new spectral learning algorithm for learning mixture of Markov models Based on learning mixture of Dirichlet distributions Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

9 Achievements A new spectral learning algorithm for learning mixture of Markov models Based on learning mixture of Dirichlet distributions A new spectral learning based algorithm for learning mixture of HMMs Faster compared to the conventional EM approach Possible to generalize to infinite mixture of HMMs Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

10 Achievements A new spectral learning algorithm for learning mixture of Markov models Based on learning mixture of Dirichlet distributions A new spectral learning based algorithm for learning mixture of HMMs Faster compared to the conventional EM approach Possible to generalize to infinite mixture of HMMs Survey of Discriminative versions of Markov models and HMMs Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

11 Outline 1 Learning a Latent Variable Model with ML Expectation Maximization (EM) algorithm description, drawbacks 2 Method of Moments based learning algorithms Method of Moments vs ML Spectral Learning for LVMs 3 Spectral Learning for Mixture of Markov models Derivation of an algorithm with existing methods Derivation of a superior learning scheme Results 4 Spectral Learning based algorithm for Mixture of HMMs For Finite mixtures For Infinite mixtures Results 5 Discriminative vs Generative Time Series Models Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

12 EM for latent variable models The optimization problem to train a latent variable model: θ = arg max p(x θ) θ = arg max p(x, h θ) θ where, x are the observations (sequences), h are the latent variables (cluster indicators/latent state sequences), θ is the model parameter. The summation is over all possible combinations of h. An intractable summation in practice. h Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

13 EM for latent variable models The idea is to have a lower bound on log p(x θ), which is easier to maximize. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

14 EM for latent variable models The idea is to have a lower bound on log p(x θ), which is easier to maximize. log p(x θ) = log h = log h p(x, h θ) = log h [ p(x, h θ) ] E q(h) q(h) p(x, h θ) q(h) q(h) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

15 EM for latent variable models The idea is to have a lower bound on log p(x θ), which is easier to maximize. log p(x θ) = log h = log h p(x, h θ) = log h [ p(x, h θ) ] E q(h) q(h) p(x, h θ) q(h) q(h) Then, using Jensen s inequality; log h E q(h) [ p(x, h θ) q(h) ] E q(h) [ log p(x, h θ) ] q(h) log p(x θ) E q(h) [log p(x, h θ)] E q(h) [log q(h)] }{{} Q(θ):= Q(θ) is easier to maximize than p(x θ). Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

16 EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

17 EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) Let us consider the following toy example: 20 p(x 1,x 2 ) x x x 2 Joint distribution p(x 1, x 2 ) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

18 EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) Let us consider the following toy example: 20 p(x 1,x 2 ) x x x We take h = x 2, θ = x 1. The problem is: x1 = arg max p(x 1, x 2 ) x 1 x 2 Joint distribution p(x 1, x 2 ) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

19 EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) 4 Initialization from 111 log p(x 1 ) 4 Initialization from 95 log p(x 1 ) x x 1 Initialization x 1 1 = 111 Initialization x 1 1 = 95 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

20 EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) Initialization from log p(x 1 ) Initialization from 95 1 log p(x 1 ) x x 1 Initialization x 1 1 = 111 Initialization x 1 1 = 95 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

21 EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) Initialization from log p(x 1 ) Initialization from 95 1 log p(x 1 ) x x 1 Initialization x 1 1 = 111 Initialization x 1 1 = 95 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

22 EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) Initialization from log p(x 1 ) Initialization from 95 1 log p(x 1 ) x x 1 Initialization x 1 1 = 111 Initialization x 1 1 = 95 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

23 EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) Initialization from log p(x 1 ) Initialization from 95 1 log p(x 1 ) x x 1 Initialization x 1 1 = 111 Initialization x 1 1 = 95 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

24 EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) Initialization from log p(x 1 ) Initialization from 95 1 log p(x 1 ) x x 1 Initialization x 1 1 = 111 Initialization x 1 1 = 95 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

25 EM for latent variable models We see that EM is an iterative algorithm, sensitive to initialization. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

26 EM for latent variable models We see that EM is an iterative algorithm, sensitive to initialization. In next section, we introduce an alternative local optima free, non-iterative learning method for LVMs, based on method of moments. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

27 Outline 1 Learning a Latent Variable Model with ML Expectation Maximization (EM) algorithm description, drawbacks 2 Method of Moments based learning algorithms Method of Moments vs ML Spectral Learning for LVMs 3 Spectral Learning for Mixture of Markov models Derivation of an algorithm with existing methods Derivation of a superior learning scheme Results 4 Spectral Learning based algorithm for Mixture of HMMs For Finite mixtures For Infinite mixtures Results 5 Discriminative vs Generative Time Series Models Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

28 Method of Moments vs ML Suppose we have the i.i.d. data x 1:N, generated from G(x; a, b). Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

29 Method of Moments vs ML Suppose we have the i.i.d. data x 1:N, generated from G(x; a, b). A maximum likelihood estimator for a and b would require to solve, ( ) 1 N log(a) ψ(a) = log x n 1 N log(x n ) N N to find an estimate â. Then, ˆb = 1 ân n=1 N n=1 x n n=1 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

30 Method of Moments vs ML Suppose we have the i.i.d. data x 1:N, generated from G(x; a, b). A maximum likelihood estimator for a and b would require to solve, ( ) 1 N log(a) ψ(a) = log x n 1 N log(x n ) N N to find an estimate â. Then, ˆb = 1 ân n=1 N n=1 x n n=1 So, a ML estimate for G(a, b) would require to use an iterative method. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

31 Method of Moments vs ML Alternatively, we can use method of moments to estimate a and b: Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

32 Method of Moments vs ML Alternatively, we can use method of moments to estimate a and b: Then, M 1 := E[x] =ab M 2 := E[x 2 ] =ab 2 + a 2 b 2 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

33 Method of Moments vs ML Alternatively, we can use method of moments to estimate a and b: Then, M 1 := E[x] =ab M 2 := E[x 2 ] =ab 2 + a 2 b 2 â = M2 1 M 2 M 2 1 ˆb = M 2 M 2 1 M 1 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

34 a Method of Moments vs ML N = 10 True parameters MoMoment estimate ML estimate b Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

35 a a Method of Moments vs ML N = 10 True parameters MoMoment estimate ML estimate N = 40 True parameters MoMoment estimate ML estimate b b Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

36 a a a Method of Moments vs ML N = 10 True parameters MoMoment estimate ML estimate N = 40 True parameters MoMoment estimate ML estimate b b N = 100 True parameters MoMoment estimate ML estimate b Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

37 a a a a Method of Moments vs ML N = 10 True parameters MoMoment estimate ML estimate N = 40 True parameters MoMoment estimate ML estimate b b N = 100 True parameters MoMoment estimate ML estimate N = 200 True parameters MoMoment estimate ML estimate b b Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

38 Method of Moments vs ML Method of Moments yield simpler solutions than ML. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

39 Method of Moments vs ML Method of Moments yield simpler solutions than ML. Method of Moments solutions are not iterative. ML solutions are. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

40 Method of Moments vs ML Method of Moments yield simpler solutions than ML. Method of Moments solutions are not iterative. ML solutions are. ML solutions are more efficient, guaranteed to return parameter estimates in feasible range. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

41 Method of Moments vs ML Method of Moments yield simpler solutions than ML. Method of Moments solutions are not iterative. ML solutions are. ML solutions are more efficient, guaranteed to return parameter estimates in feasible range. ML solutions may get stuck in local maxima, require good initializations. Method of moments do not. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

42 Spectral Learning for Latent Variable Models Spectral learning is an instance of method of moments. Model parameters are solved from a function of some observable moments. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

43 Spectral Learning for Latent Variable Models Spectral learning is an instance of method of moments. Model parameters are solved from a function of some observable moments. Let us consider a simple mixture model: h n x n O(:, k) n = 1... N k = 1... K h n Discrete(p(h n )) x n h n p(x n h n, O(:, h n )) where, x R L observations, h n {1,..., K} cluster indicators, E[x h] = O R L K model parameters, e.g. for Poisson, O(i, k) = λ i,k. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

44 Spectral Learning for Latent Variable Models Spectral learning is an instance of method of moments. Model parameters are solved from a function of some observable moments. Let us consider a simple mixture model: h n x n O(:, k) n = 1... N k = 1... K h n Discrete(p(h n )) x n h n p(x n h n, O(:, h n )) where, x R L observations, h n {1,..., K} cluster indicators, E[x h] = O R L K model parameters, e.g. for Poisson, O(i, k) = λ i,k. Given x 1:N, one can use EM to learn O. Alternatively, we can use spectral learning to learn O, in a non iterative fashion. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

45 Spectral Learning for Latent Variable Models h n x n O(:, k) n = 1... N k = 1... K h n Discrete(p(h n )) x n h n p(x n h n, O(:, h n )) Given that L K, and O has K linearly independent columns then, B i :=(U T E[x x x i ]V )(U T E[x x]v ) 1 =(U T O)diag(O(i, :))(U T O) 1 where, E[x x] = UΣV T. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

46 Spectral Learning for Latent Variable Models h n x n O(:, k) n = 1... N k = 1... K h n Discrete(p(h n )) x n h n p(x n h n, O(:, h n )) Given that L K, and O has K linearly independent columns then, where, E[x x] = UΣV T. B i :=(U T E[x x x i ]V )(U T E[x x]v ) 1 =(U T O)diag(O(i, :))(U T O) 1 B i can be expressed in terms of observable moments E[x x] and E[x x x i ]. So, model parameters can be simply be read from the eigenvectors of B i. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

47 Spectral Learning for Latent Variable Models Proof: U T E[x x x i ]V =U T Odiag(O(i, :))diag(p(h))o T V Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

48 Spectral Learning for Latent Variable Models Proof: U T E[x x x i ]V =U T Odiag(O(i, :))diag(p(h))o T V =U T Odiag(O(i, :))(U T O) 1 U T Odiag(p(h))O T V Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

49 Spectral Learning for Latent Variable Models Proof: U T E[x x x i ]V =U T Odiag(O(i, :))diag(p(h))o T V =U T Odiag(O(i, :))(U T O) 1 U T Odiag(p(h))O T V =U T Odiag(O(i, :))(U T O) 1 U T Odiag(p(h))O T V }{{} U T E[x x]v Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

50 Spectral Learning for Latent Variable Models Proof: so, U T E[x x x i ]V =U T Odiag(O(i, :))diag(p(h))o T V =U T Odiag(O(i, :))(U T O) 1 U T Odiag(p(h))O T V =U T Odiag(O(i, :))(U T O) 1 U T Odiag(p(h))O T V }{{} U T E[x x]v (U T O)diag(O(i, :))(U T O) 1 = (U T E[x x x i ]V )(U T E[x x]v ) 1 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

51 Spectral Learning for Latent Variable Models This algorithm is based on the work of Anandkumar et al Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

52 Spectral Learning for Latent Variable Models This algorithm is based on the work of Anandkumar et al One can also modify this algorithm slightly to learn HMMs. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

53 Spectral Learning for Latent Variable Models This algorithm is based on the work of Anandkumar et al One can also modify this algorithm slightly to learn HMMs. The general goal is to express the observable moments so that, we obtain an eigen-decomposition form, from which we can read the parameters. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

54 Spectral Learning for Latent Variable Models This algorithm is based on the work of Anandkumar et al One can also modify this algorithm slightly to learn HMMs. The general goal is to express the observable moments so that, we obtain an eigen-decomposition form, from which we can read the parameters. This is a fast, and local optima free algorithm to learn LVMs. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

55 Spectral Learning for Latent Variable Models This algorithm is based on the work of Anandkumar et al One can also modify this algorithm slightly to learn HMMs. The general goal is to express the observable moments so that, we obtain an eigen-decomposition form, from which we can read the parameters. This is a fast, and local optima free algorithm to learn LVMs. However, this scheme requires high order moments to learn a temporally connected LVM such as mixture of Markov models. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

56 Outline 1 Learning a Latent Variable Model with ML Expectation Maximization (EM) algorithm description, drawbacks 2 Method of Moments based learning algorithms Method of Moments vs ML Spectral Learning for LVMs 3 Spectral Learning for Mixture of Markov models Derivation of an algorithm with existing methods Derivation of a superior learning scheme Results 4 Spectral Learning based algorithm for Mixture of HMMs For Finite mixtures For Infinite mixtures Results 5 Discriminative vs Generative Time Series Models Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

57 Mixture of Markov models h n x 1,n x 2,n... x Tn,n n = 1... N A k k = 1... K h n Discrete(p(h n )) x t,n x t 1,n, h n, A Discrete(A(:, x t 1,n, h n )) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

58 Spectral Learning for Mixture of Markov models We show that a spectral learning algorithm as in Anandkumar, et al. 2012, requires observable moments up to order five. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

59 Spectral Learning for Mixture of Markov models We show that a spectral learning algorithm as in Anandkumar, et al. 2012, requires observable moments up to order five. We propose an alternative scheme based on hierarchical Bayesian modeling: p(a hn x n, h n ) p(x n A hn, h n )p(a hn ) = Dirichlet(c n 1,1 + β 1, c n 1,2 + β 1,..., c n L,L + β 1) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

60 Spectral Learning for Mixture of Markov models We show that a spectral learning algorithm as in Anandkumar, et al. 2012, requires observable moments up to order five. We propose an alternative scheme based on hierarchical Bayesian modeling: p(a hn x n, h n ) p(x n A hn, h n )p(a hn ) = Dirichlet(c n 1,1 + β 1, c n 1,2 + β 1,..., c n L,L + β 1) Assuming a uniform Dirichlet prior p(a hn ), i.e. taking β = 1, the posterior of A hn becomes Dirichlet(c n 1,1, cn 1,2,..., cn L,L ). Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

61 Spectral Learning for Mixture of Markov models We show that a spectral learning algorithm as in Anandkumar, et al. 2012, requires observable moments up to order five. We propose an alternative scheme based on hierarchical Bayesian modeling: p(a hn x n, h n ) p(x n A hn, h n )p(a hn ) = Dirichlet(c n 1,1 + β 1, c n 1,2 + β 1,..., c n L,L + β 1) Assuming a uniform Dirichlet prior p(a hn ), i.e. taking β = 1, the posterior of A hn becomes Dirichlet(c n 1,1, cn 1,2,..., cn L,L ). So, we can represent a sequence only via empirical transition counts matrix s n Dirichlet(c n 1,1, cn 1,2,..., cn L,L ) = p(a h n x n, h n ). Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

62 Spectral Learning for Mixture of Markov models - Dirichlet distribution Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

63 Spectral Learning for Mixture of Markov models - Dirichlet posterior 6 25 Observation Time x t x t Observation x t Time x t 1 0 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

64 Spectral Learning for Mixture of Dirichlet Distributions h n x n α(:, k) n = 1... N k = 1... K h n Discrete(p(h n )) x n h n, α Dirichlet(α(1, h n ),..., α(k, h n )) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

65 Spectral Learning for Mixture of Dirichlet Distributions h n x n α(:, k) n = 1... N k = 1... K h n Discrete(p(h n )) x n h n, α Dirichlet(α(1, h n ),..., α(k, h n )) Note that E[s n,i h n = k] = α(i, k)/α 0. So, we can estimate α, up to a scaling factor α 0 via spectral learning. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

66 Spectral Learning for Mixture of Markov models Input: Sequences x 1:N Output: Clustering assignments ĥ1:n Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

67 Spectral Learning for Mixture of Markov models Input: Sequences x 1:N Output: Clustering assignments ĥ1:n 1. Extract the sufficient statistics s n from x n, n {1,..., N} Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

68 Spectral Learning for Mixture of Markov models Input: Sequences x 1:N Output: Clustering assignments ĥ1:n 1. Extract the sufficient statistics s n from x n, n {1,..., N} 2. Compute empirical moment estimates for E[s s] and E[s s s]. Compute U and V such that E[s s] = UΣV T. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

69 Spectral Learning for Mixture of Markov models Input: Sequences x 1:N Output: Clustering assignments ĥ1:n 1. Extract the sufficient statistics s n from x n, n {1,..., N} 2. Compute empirical moment estimates for E[s s] and E[s s s]. Compute U and V such that E[s s] = UΣV T. 3. Estimate α by doing eigen-decomposition of B i. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

70 Spectral Learning for Mixture of Markov models Input: Sequences x 1:N Output: Clustering assignments ĥ1:n 1. Extract the sufficient statistics s n from x n, n {1,..., N} 2. Compute empirical moment estimates for E[s s] and E[s s s]. Compute U and V such that E[s s] = UΣV T. 3. Estimate α by doing eigen-decomposition of B i. 4. n {1,... N}, ĥn = arg max k Dirichlet(s n, α(:, k)). Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

71 Results - Synthetic Data 100 Clustering Accuracy % Spectral MMarkov Spectral MDirichlet 65 EM MMarkov EM MDirichlet Sequence Length On 100 synthetic datasets, each consisting of 60 sequences, average clustering accuracies for varying sequence lengths. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

72 Results - Network Traffic Data Exploratory data analysis on network traffic data consisting of Skype and Bit-torrent Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

73 Results - Network Traffic Data Effect of clustering in training Algorithm Classification Accuracy No Clustering 63.41% Mixture of Dirichlet (Spectral) 84.85% Mixture of Dirichlet (EM) 79.39% Mixture of Markov (EM) 83.51% Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

74 Results - Finding number of Clusters We can also estimate the number of clusters: K E[s n s n ] = αdiag(p(h))α T = p(h = k)α(:, k)α(:, k) T k=1 By looking at the jump σ k+1 /σ k of E[s n s n ], we can estimate K σ k / σ k σ k / σ k Number of Clusters K Number of Clusters K Figure: Network Traffic Data. Figure: Synthetic Data. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

75 Results - Finding number of Clusters Finding number of clusters on motion capture data σ k / σ k Number of Clusters K Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

76 Conclusions - Spectral Learning for Mixture of Markov models Spectral learning of mixture of Dirichlet disributions yield a simple and effective algorithm. On synthetic data, it outperforms its rivals. It is also possible to automatically determine the number clusters. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

77 Outline 1 Learning a Latent Variable Model with ML Expectation Maximization (EM) algorithm description, drawbacks 2 Method of Moments based learning algorithms Method of Moments vs ML Spectral Learning for LVMs 3 Spectral Learning for Mixture of Markov models Derivation of an algorithm with existing methods Derivation of a superior learning scheme Results 4 Spectral Learning based algorithm for Mixture of HMMs For Finite mixtures For Infinite mixtures Results 5 Discriminative vs Generative Time Series Models Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

78 Mixture of HMMs A k k = 1... K r 1,n r 2,n... r Tn,n h n x 1,n x 2,n... x Tn,n h n Discrete(p(h n )) r t,n r t 1,n, h n Discrete(A(:, r t 1,n, h n )) x t,n r t,n, h n p(x t,n O(:, r t,n, h n )) n = 1... N O k k = 1... K Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

79 Spectral Learning based algorithm for Mixture of HMMs An EM algorithm for mixture of HMMs is expensive. Requires forward-backward message passing for each sequence in each EM iteration. We propose to replace the parameter estimation step with spectral learning. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

80 Cheaper Alternative Algorithm for HMM Mixture learning Randomly initialize r 1:N At each iteration, for k = 1 K do θ k EMHMM({x n n, h n = k}) end for for n = 1 N do h n = arg max k p(x n θ k ) end for Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

81 Cheaper Alternative Algorithm for HMM Mixture learning Randomly initialize r 1:N At each iteration, for k = 1 K do θ k SpectralHMM({x n n, h n = k}) end for for n = 1 N do h n = arg max k p(x n θ k ) end for Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

82 Cheaper Alternative Algorithm for HMM Mixture learning Randomly initialize r 1:N At each iteration, for k = 1 K do θ k SpectralHMM({x n n, h n = k}) end for for n = 1 N do h n = arg max k p(x n θ k ) end for With spectral learning, required computations for parameter estimation step become K low-rank SVDs and K eigen-decompositions, which is substantially cheaper compared to N K forward-backward. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

83 Results on mocap data Results obtained on Motion capture data. We input the temporal difference of 3D coordinates of 32 markers on human body. In this case, we used 20 sequences. We restarted the algorithm 10 times with random r 1:N, and recorded the results: Average Max Iteration Average Success(%) Success(%) time* (s) convergence EM EM Spectral *PC: 3.33 GHz dual core CPU, 4 GB RAM, Software: MATLAB Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

84 Learning Infinite mixture of HMMs In Finite mixtures, we fix the number of clusters K a-priori. In infinite mixtures, K is automatically learned. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

85 Learning Infinite mixture of HMMs In Finite mixtures, we fix the number of clusters K a-priori. In infinite mixtures, K is automatically learned. Learning an infinite mixture would involve an intractable integral over HMM parameters. We approximate these integrals using spectral learning. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

86 Learning Infinite mixture of HMMs Learning an infinite Mixture of HMMs would require an integral over HMM parameters θ = (O, A). (Collapsed Gibbs sampler) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

87 Learning Infinite mixture of HMMs Learning an infinite Mixture of HMMs would require an integral over HMM parameters θ = (O, A). (Collapsed Gibbs sampler) For existing clusters: p(h n = k, k K h n 1:N, x 1:N) = N n k p(x n θ)p(θ {x l : l n, h l = k})dθ N + α 1 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

88 Learning Infinite mixture of HMMs Learning an infinite Mixture of HMMs would require an integral over HMM parameters θ = (O, A). (Collapsed Gibbs sampler) For existing clusters: p(h n = k, k K h n 1:N, x 1:N) = N n k p(x n θ)p(θ {x l : l n, h l = k})dθ N + α 1 To open a new cluster: p(h n = k + 1 h n 1:N, x 1:N) α = N + α 1 p(x n θ)p(θ)dθ Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

89 Learning Infinite Mixture of HMMs One can use Gibbs sampling with auxiliary variable method of Neal (2000). However, this would require sampling from the prior to approximate the marginal likelihood. Large number of samples, slow. We suggest to use spectral learning for learning IM-HMMs. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

90 IMM - Spectral learning Assuming that the sequences are long enough, we make the following approximation; p(x n θ)p(θ {x l : l n, h l = k})dθ p(x n θ spectral k ) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

91 IMM - Spectral learning Assuming that the sequences are long enough, we make the following approximation; p(x n θ)p(θ {x l : l n, h l = k})dθ p(x n θ spectral k ) We compute the marginal likelihood of each sequence offline, and then use them in each iteration. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

92 IMM - Spectral learning Assuming that the sequences are long enough, we make the following approximation; p(x n θ)p(θ {x l : l n, h l = k})dθ p(x n θ spectral k ) We compute the marginal likelihood of each sequence offline, and then use them in each iteration. By modifying the spectral algorithm for finite mixtures, we obtain a learning algorithm for infinite mixtures of HMMs: Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

93 Cheaper Alternative Algorithm for HMM Mixture learning Randomly initialize h 1:N At each iteration, for k = 1 K do θ k SpectralHMM({x n n, h n = k}) end for for n = 1 N do h n = arg max k p(x n θ k ) end for Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

94 Cheaper Alternative Algorithm for HMM Mixture learning Randomly initialize h 1:N At each iteration, for k = 1 K do θ k SpectralHMM({x n n, h n = k}) end for for n = 1 N do h n = arg max k [p(x n θ k ) p(x n )] if h n > K then K = K + 1 end if end for Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

95 Cheaper Alternative Algorithm for HMM Mixture learning Randomly initialize h 1:N At each iteration, for k = 1 K do θ k SpectralHMM({x n n, h n = k}) end for for n = 1 N do h n = arg max k [p(x n θ k ) p(x n )] if h n > K then K = K + 1 end if end for Marginal likelihoods are computed before starting the algorithm p(x n ) Note that this algorithm is the analogue of the infinite version of k-means, Dirichlet Process Means algorithm of Kulis and Jordan. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

96 Results - Human Action Classification We use KTH human action database. 6 classes (boxing, hand waving, hand clapping, walking, jogging, running), 100 sequences: 64 sequences for training 36 for testing. We do clustering in training phase to increase the clustering accuracy, as data instances in same class tend to show variability. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

97 Results - Human Action Classification Algorithm Classification Accuracy No Clustering 70.1% Spectral 74.1% Gibbs Sampling 74.0% Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

98 Results - MoCap Data A dataset 44 sequences with two different walking characteristics and running. Algorithm automatically determines K = 3. In 9 seconds algorithm converged Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

99 Conclusions - Spectral Learning for Mixture of HMMs We have proposed a simple, fast and effective algorithm for learning mixture of HMMs. Algorithm is competitive with more conventional Gibbs sampling method. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

100 Outline 1 Learning a Latent Variable Model with ML Expectation Maximization (EM) algorithm description, drawbacks 2 Method of Moments based learning algorithms Method of Moments vs ML Spectral Learning for LVMs 3 Spectral Learning for Mixture of Markov models Derivation of an algorithm with existing methods Derivation of a superior learning scheme Results 4 Spectral Learning based algorithm for Mixture of HMMs For Finite mixtures For Infinite mixtures Results 5 Discriminative vs Generative Time Series Models Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

101 Discriminative vs Generative Models Given the training set (x n, h n ) N n=1 Training a Generative Model θ = arg max θ Training a Discriminative Model θ = arg max θ p(x n θ, h n ) n p(h n x n, θ) n Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

102 Discriminative vs Generative Models Logistic Regression vs Naive Bayes Classifiers (with Isotropic Gaussian observations) 8 Logistic Reg. Naive Bayes Generative Models model the data distribution p(x h). Discriminative Models model the class labels p(h x). Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

103 Discriminative and Generative Time Series Models HMM, Markov Model and their mixtures are all generative models. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

104 Discriminative and Generative Time Series Models HMM, Markov Model and their mixtures are all generative models. Deriving a discriminative model corresponding to a LVM amounts to converting the model into an undirected graph. (Defining an energy function) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

105 Discriminative and Generative Time Series Models HMM, Markov Model and their mixtures are all generative models. Deriving a discriminative model corresponding to a LVM amounts to converting the model into an undirected graph. (Defining an energy function) For Discriminative Markov model: ψ(x n, h n ; θ k ) = T n K L t=1 k=1 l 1 =1 l 2 =1 L [h n = k][l 1 = x t ][l 2 = x t 1 ]θ k,l1,l 2 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

106 Discriminative and Generative Time Series Models HMM, Markov Model and their mixtures are all generative models. Deriving a discriminative model corresponding to a LVM amounts to converting the model into an undirected graph. (Defining an energy function) For Discriminative Markov model: ψ(x n, h n ; θ k ) = T n K L t=1 k=1 l 1 =1 l 2 =1 L [h n = k][l 1 = x t ][l 2 = x t 1 ]θ k,l1,l 2 The training is: θ 1:K = arg max θ 1:K N exp(ψ(x n, h n ; θ k )) log K h exp(ψ(x n=1 n, h n ; θ k )) n=1 Note that we treat data x as an input. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

107 Result We investigate amount of training data vs classification accuracy for generative and discriminative Markov model We use the synthetic control chart dataset from UCI machine learning repository. 6 classes 100 sequences for each class. 70 sequence for training 30 sequence for test. Classification Accuracy % Discriminative Generative % of the training set Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

108 Conclusions and Future Work We have proposed two novel algorithms for time series clustering. We have investigated generative and discriminative classification of time series. As future work, we plan on investigating the hierarchical Bayesian viewpoint for deriving spectral learning algorithms for more complicated time series models. A complete spectral learning algorithm for mixture of HMMs based on constrained optimization. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense / 54

Introduction to Graphical Models

Introduction to Graphical Models Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SOFT CLUSTERING VS HARD CLUSTERING

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

Tensor Methods for Feature Learning

Tensor Methods for Feature Learning Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same

More information

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang. Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Collapsed Variational Bayesian Inference for Hidden Markov Models

Collapsed Variational Bayesian Inference for Hidden Markov Models Collapsed Variational Bayesian Inference for Hidden Markov Models Pengyu Wang, Phil Blunsom Department of Computer Science, University of Oxford International Conference on Artificial Intelligence and

More information

Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning. Chapter 20, Sections 1 4 1 Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University Another Walkthrough of Variational Bayes Bevan Jones Machine Learning Reading Group Macquarie University 2 Variational Bayes? Bayes Bayes Theorem But the integral is intractable! Sampling Gibbs, Metropolis

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics Outline Maximum likelihood (ML) Priors, and

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu November 3, 2015 Methods to Learn Matrix Data Text Data Set Data Sequence Data Time Series Graph

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4 ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Active and Semi-supervised Kernel Classification

Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

More information

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated Fall 3 Computer Vision Overview of Statistical Tools Statistical Inference Haibin Ling Observation inference Decision Prior knowledge http://www.dabi.temple.edu/~hbling/teaching/3f_5543/index.html Bayesian

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Part 1: Expectation Propagation

Part 1: Expectation Propagation Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech Administrativia Poster Presentation:

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 10-708: Probabilistic Graphical Models, Spring 2015 26 : Spectral GMs Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 1 Introduction A common task in machine learning is to work with

More information

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul Generative Learning INFO-4604, Applied Machine Learning University of Colorado Boulder November 29, 2018 Prof. Michael Paul Generative vs Discriminative The classification algorithms we have seen so far

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori

More information

Gentle Introduction to Infinite Gaussian Mixture Modeling

Gentle Introduction to Infinite Gaussian Mixture Modeling Gentle Introduction to Infinite Gaussian Mixture Modeling with an application in neuroscience By Frank Wood Rasmussen, NIPS 1999 Neuroscience Application: Spike Sorting Important in neuroscience and for

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

an introduction to bayesian inference

an introduction to bayesian inference with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena

More information

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures 17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

Probability and Information Theory. Sargur N. Srihari

Probability and Information Theory. Sargur N. Srihari Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal

More information

Expectation Maximization Algorithm

Expectation Maximization Algorithm Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University Heeyoul (Henry) Choi Dept. of Computer Science Texas A&M University hchoi@cs.tamu.edu Introduction Speaker Adaptation Eigenvoice Comparison with others MAP, MLLR, EMAP, RMP, CAT, RSW Experiments Future

More information

Spectral Learning of Mixture of Hidden Markov Models

Spectral Learning of Mixture of Hidden Markov Models Spectral Learning of Mixture of Hidden Markov Models Y Cem Sübakan, Johannes raa, Paris Smaragdis,, Department of Computer Science, University of Illinois at Urbana-Champaign Department of Electrical and

More information

Machine Learning. Probabilistic KNN.

Machine Learning. Probabilistic KNN. Machine Learning. Mark Girolami girolami@dcs.gla.ac.uk Department of Computing Science University of Glasgow June 21, 2007 p. 1/3 KNN is a remarkably simple algorithm with proven error-rates June 21, 2007

More information

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

CSC411: Final Review. James Lucas & David Madras. December 3, 2018 CSC411: Final Review James Lucas & David Madras December 3, 2018 Agenda 1. A brief overview 2. Some sample questions Basic ML Terminology The final exam will be on the entire course; however, it will be

More information

Bayesian Networks Inference with Probabilistic Graphical Models

Bayesian Networks Inference with Probabilistic Graphical Models 4190.408 2016-Spring Bayesian Networks Inference with Probabilistic Graphical Models Byoung-Tak Zhang intelligence Lab Seoul National University 4190.408 Artificial (2016-Spring) 1 Machine Learning? Learning

More information

Machine Learning Overview

Machine Learning Overview Machine Learning Overview Sargur N. Srihari University at Buffalo, State University of New York USA 1 Outline 1. What is Machine Learning (ML)? 2. Types of Information Processing Problems Solved 1. Regression

More information

Coupled Hidden Markov Models: Computational Challenges

Coupled Hidden Markov Models: Computational Challenges .. Coupled Hidden Markov Models: Computational Challenges Louis J. M. Aslett and Chris C. Holmes i-like Research Group University of Oxford Warwick Algorithms Seminar 7 th March 2014 ... Hidden Markov

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Dynamic Approaches: The Hidden Markov Model

Dynamic Approaches: The Hidden Markov Model Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message

More information

METHOD OF MOMENTS LEARNING FOR LEFT TO RIGHT HIDDEN MARKOV MODELS

METHOD OF MOMENTS LEARNING FOR LEFT TO RIGHT HIDDEN MARKOV MODELS METHOD OF MOMENTS LEARNING FOR LEFT TO RIGHT HIDDEN MARKOV MODELS Y. Cem Subakan, Johannes Traa, Paris Smaragdis,,, Daniel Hsu UIUC Computer Science Department, Columbia University Computer Science Department

More information

Tree-Based Inference for Dirichlet Process Mixtures

Tree-Based Inference for Dirichlet Process Mixtures Yang Xu Machine Learning Department School of Computer Science Carnegie Mellon University Pittsburgh, USA Katherine A. Heller Department of Engineering University of Cambridge Cambridge, UK Zoubin Ghahramani

More information

A minimalist s exposition of EM

A minimalist s exposition of EM A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Generative MaxEnt Learning for Multiclass Classification

Generative MaxEnt Learning for Multiclass Classification Generative Maximum Entropy Learning for Multiclass Classification A. Dukkipati, G. Pandey, D. Ghoshdastidar, P. Koley, D. M. V. S. Sriram Dept. of Computer Science and Automation Indian Institute of Science,

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 9: Expectation Maximiation (EM) Algorithm, Learning in Undirected Graphical Models Some figures courtesy

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Pattern Recognition. Parameter Estimation of Probability Density Functions

Pattern Recognition. Parameter Estimation of Probability Density Functions Pattern Recognition Parameter Estimation of Probability Density Functions Classification Problem (Review) The classification problem is to assign an arbitrary feature vector x F to one of c classes. The

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 9: Variational Inference Relaxations Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 24/10/2011 (EPFL) Graphical Models 24/10/2011 1 / 15

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 Reminders Homework

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Combining Generative and Discriminative Methods for Pixel Classification with Multi-Conditional Learning

Combining Generative and Discriminative Methods for Pixel Classification with Multi-Conditional Learning Combining Generative and Discriminative Methods for Pixel Classification with Multi-Conditional Learning B. Michael Kelm Interdisciplinary Center for Scientific Computing University of Heidelberg michael.kelm@iwr.uni-heidelberg.de

More information

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan Bayesian Learning CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Bayes Theorem MAP Learners Bayes optimal classifier Naïve Bayes classifier Example text classification Bayesian networks

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information