Probabilistic Time Series Classification

Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54

Problem Statement The goal is to assign "similar" sequences to same classes. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 2 / 54

Problem Statement It is not a trivial task. Variability within each class. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 3 / 54

Problem Statement Data in real life applications is even more challenging. So, we make a computer program "learn" how to classify sequences. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 4 / 54

Goal of this thesis The goal of this thesis is to understand and develop learning algorithms for supervised and unsupervised time series classification models. We propose two novel learning algorithms for mixture of Markov models and mixture of HMMs. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 5 / 54

Goal of this thesis The goal of this thesis is to understand and develop learning algorithms for supervised and unsupervised time series classification models. We propose two novel learning algorithms for mixture of Markov models and mixture of HMMs. h n x 1,n x 2,n... x Tn,n n = 1... N A k k = 1... K Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 5 / 54

Goal of this thesis The goal of this thesis is to understand and develop learning algorithms for supervised and unsupervised time series classification models. We propose two novel learning algorithms for mixture of Markov models and mixture of HMMs. A k h n k = 1... K r 1,n r 2,n... r Tn,n x 1,n x 2,n... x Tn,n h n A k k = 1... K n = 1... N x 1,n x 2,n... x Tn,n n = 1... N O k k = 1... K Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 5 / 54

Achievements A new spectral learning algorithm for learning mixture of Markov models Based on learning mixture of Dirichlet distributions Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 6 / 54

Achievements A new spectral learning algorithm for learning mixture of Markov models Based on learning mixture of Dirichlet distributions A new spectral learning based algorithm for learning mixture of HMMs Faster compared to the conventional EM approach Possible to generalize to infinite mixture of HMMs Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 6 / 54

Achievements A new spectral learning algorithm for learning mixture of Markov models Based on learning mixture of Dirichlet distributions A new spectral learning based algorithm for learning mixture of HMMs Faster compared to the conventional EM approach Possible to generalize to infinite mixture of HMMs Survey of Discriminative versions of Markov models and HMMs Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 6 / 54

Outline 1 Learning a Latent Variable Model with ML Expectation Maximization (EM) algorithm description, drawbacks 2 Method of Moments based learning algorithms Method of Moments vs ML Spectral Learning for LVMs 3 Spectral Learning for Mixture of Markov models Derivation of an algorithm with existing methods Derivation of a superior learning scheme Results 4 Spectral Learning based algorithm for Mixture of HMMs For Finite mixtures For Infinite mixtures Results 5 Discriminative vs Generative Time Series Models Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 7 / 54

EM for latent variable models The optimization problem to train a latent variable model: θ = arg max p(x θ) θ = arg max p(x, h θ) θ where, x are the observations (sequences), h are the latent variables (cluster indicators/latent state sequences), θ is the model parameter. The summation is over all possible combinations of h. An intractable summation in practice. h Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 8 / 54

EM for latent variable models The idea is to have a lower bound on log p(x θ), which is easier to maximize. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 9 / 54

EM for latent variable models The idea is to have a lower bound on log p(x θ), which is easier to maximize. log p(x θ) = log h = log h p(x, h θ) = log h [ p(x, h θ) ] E q(h) q(h) p(x, h θ) q(h) q(h) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 9 / 54

EM for latent variable models The idea is to have a lower bound on log p(x θ), which is easier to maximize. log p(x θ) = log h = log h p(x, h θ) = log h [ p(x, h θ) ] E q(h) q(h) p(x, h θ) q(h) q(h) Then, using Jensen s inequality; log h E q(h) [ p(x, h θ) q(h) ] E q(h) [ log p(x, h θ) ] q(h) log p(x θ) E q(h) [log p(x, h θ)] E q(h) [log q(h)] }{{} Q(θ):= Q(θ) is easier to maximize than p(x θ). Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 9 / 54

EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 10 / 54

EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) Let us consider the following toy example: 20 p(x 1,x 2 ) x 10 5 11 10 40 60 80 9 8 7 x 1 100 120 140 160 6 5 4 3 2 180 1 200 50 100 150 200 x 2 Joint distribution p(x 1, x 2 ) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 10 / 54

EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) Let us consider the following toy example: 20 p(x 1,x 2 ) x 10 5 11 10 40 x 1 60 80 100 120 140 160 180 200 50 100 150 200 x 2 9 8 7 6 5 4 3 2 1 We take h = x 2, θ = x 1. The problem is: x1 = arg max p(x 1, x 2 ) x 1 x 2 Joint distribution p(x 1, x 2 ) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 10 / 54

EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) 4 Initialization from 111 log p(x 1 ) 4 Initialization from 95 log p(x 1 ) 5 6 1 5 6 1 7 7 8 8 9 9 10 10 11 11 12 0 20 40 60 80 100 120 140 160 180 200 x 1 12 0 20 40 60 80 100 120 140 160 180 200 x 1 Initialization x 1 1 = 111 Initialization x 1 1 = 95 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 11 / 54

EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) 4 5 6 Initialization from 111 1 2 log p(x 1 ) 4 5 6 2 Initialization from 95 1 log p(x 1 ) 7 7 8 8 9 9 10 10 11 11 12 0 20 40 60 80 100 120 140 160 180 200 x 1 12 0 20 40 60 80 100 120 140 160 180 200 x 1 Initialization x 1 1 = 111 Initialization x 1 1 = 95 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 11 / 54

EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) 4 5 6 Initialization from 111 1 2 3 log p(x 1 ) 4 5 6 32 Initialization from 95 1 log p(x 1 ) 7 7 8 8 9 9 10 10 11 11 12 0 20 40 60 80 100 120 140 160 180 200 x 1 12 0 20 40 60 80 100 120 140 160 180 200 x 1 Initialization x 1 1 = 111 Initialization x 1 1 = 95 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 11 / 54

EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) 4 5 6 Initialization from 111 1 2 3 log p(x 1 ) 4 5 6 432 Initialization from 95 1 log p(x 1 ) 7 7 8 8 9 9 10 10 11 11 12 0 20 40 60 80 100 120 140 160 180 200 x 1 12 0 20 40 60 80 100 120 140 160 180 200 x 1 Initialization x 1 1 = 111 Initialization x 1 1 = 95 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 11 / 54

EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) 4 5 6 Initialization from 111 1 2 3 log p(x 1 ) 4 5 6 4532 Initialization from 95 1 log p(x 1 ) 7 7 8 8 9 9 10 10 11 11 12 0 20 40 60 80 100 120 140 160 180 200 x 1 12 0 20 40 60 80 100 120 140 160 180 200 x 1 Initialization x 1 1 = 111 Initialization x 1 1 = 95 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 11 / 54

EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) 4 5 6 Initialization from 111 1 2 3 log p(x 1 ) 4 5 6 45632 Initialization from 95 1 log p(x 1 ) 7 7 8 8 9 9 10 10 11 11 12 0 20 40 60 80 100 120 140 160 180 200 x 1 12 0 20 40 60 80 100 120 140 160 180 200 x 1 Initialization x 1 1 = 111 Initialization x 1 1 = 95 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 11 / 54

EM for latent variable models We see that EM is an iterative algorithm, sensitive to initialization. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 12 / 54

EM for latent variable models We see that EM is an iterative algorithm, sensitive to initialization. In next section, we introduce an alternative local optima free, non-iterative learning method for LVMs, based on method of moments. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 12 / 54

Method of Moments vs ML Suppose we have the i.i.d. data x 1:N, generated from G(x; a, b). Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 14 / 54

Method of Moments vs ML Suppose we have the i.i.d. data x 1:N, generated from G(x; a, b). A maximum likelihood estimator for a and b would require to solve, ( ) 1 N log(a) ψ(a) = log x n 1 N log(x n ) N N to find an estimate â. Then, ˆb = 1 ân n=1 N n=1 x n n=1 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 14 / 54

Method of Moments vs ML Suppose we have the i.i.d. data x 1:N, generated from G(x; a, b). A maximum likelihood estimator for a and b would require to solve, ( ) 1 N log(a) ψ(a) = log x n 1 N log(x n ) N N to find an estimate â. Then, ˆb = 1 ân n=1 N n=1 x n n=1 So, a ML estimate for G(a, b) would require to use an iterative method. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 14 / 54

Method of Moments vs ML Alternatively, we can use method of moments to estimate a and b: Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 15 / 54

Method of Moments vs ML Alternatively, we can use method of moments to estimate a and b: Then, M 1 := E[x] =ab M 2 := E[x 2 ] =ab 2 + a 2 b 2 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 15 / 54

Method of Moments vs ML Alternatively, we can use method of moments to estimate a and b: Then, M 1 := E[x] =ab M 2 := E[x 2 ] =ab 2 + a 2 b 2 â = M2 1 M 2 M 2 1 ˆb = M 2 M 2 1 M 1 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 15 / 54

a Method of Moments vs ML 2 1.8 1.6 N = 10 True parameters MoMoment estimate ML estimate 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 2 4 6 8 10 b Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 16 / 54

a a Method of Moments vs ML 2 1.8 1.6 N = 10 True parameters MoMoment estimate ML estimate 2 1.8 1.6 N = 40 True parameters MoMoment estimate ML estimate 1.4 1.4 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 2 4 6 8 10 b 0 0 2 4 6 8 10 b Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 16 / 54

a a a Method of Moments vs ML 2 1.8 1.6 N = 10 True parameters MoMoment estimate ML estimate 2 1.8 1.6 N = 40 True parameters MoMoment estimate ML estimate 1.4 1.4 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 2 4 6 8 10 b 0 0 2 4 6 8 10 b 2 1.8 1.6 N = 100 True parameters MoMoment estimate ML estimate 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 2 4 6 8 10 b Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 16 / 54

a a a a Method of Moments vs ML 2 1.8 1.6 N = 10 True parameters MoMoment estimate ML estimate 2 1.8 1.6 N = 40 True parameters MoMoment estimate ML estimate 1.4 1.4 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 2 4 6 8 10 b 0 0 2 4 6 8 10 b 2 1.8 1.6 N = 100 True parameters MoMoment estimate ML estimate 2 1.8 1.6 N = 200 True parameters MoMoment estimate ML estimate 1.4 1.4 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 2 4 6 8 10 b 0 0 2 4 6 8 10 b Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 16 / 54

Method of Moments vs ML Method of Moments yield simpler solutions than ML. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 17 / 54

Method of Moments vs ML Method of Moments yield simpler solutions than ML. Method of Moments solutions are not iterative. ML solutions are. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 17 / 54

Method of Moments vs ML Method of Moments yield simpler solutions than ML. Method of Moments solutions are not iterative. ML solutions are. ML solutions are more efficient, guaranteed to return parameter estimates in feasible range. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 17 / 54

Method of Moments vs ML Method of Moments yield simpler solutions than ML. Method of Moments solutions are not iterative. ML solutions are. ML solutions are more efficient, guaranteed to return parameter estimates in feasible range. ML solutions may get stuck in local maxima, require good initializations. Method of moments do not. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 17 / 54

Spectral Learning for Latent Variable Models Spectral learning is an instance of method of moments. Model parameters are solved from a function of some observable moments. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 18 / 54

Spectral Learning for Latent Variable Models Spectral learning is an instance of method of moments. Model parameters are solved from a function of some observable moments. Let us consider a simple mixture model: h n x n O(:, k) n = 1... N k = 1... K h n Discrete(p(h n )) x n h n p(x n h n, O(:, h n )) where, x R L observations, h n {1,..., K} cluster indicators, E[x h] = O R L K model parameters, e.g. for Poisson, O(i, k) = λ i,k. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 18 / 54

Spectral Learning for Latent Variable Models Spectral learning is an instance of method of moments. Model parameters are solved from a function of some observable moments. Let us consider a simple mixture model: h n x n O(:, k) n = 1... N k = 1... K h n Discrete(p(h n )) x n h n p(x n h n, O(:, h n )) where, x R L observations, h n {1,..., K} cluster indicators, E[x h] = O R L K model parameters, e.g. for Poisson, O(i, k) = λ i,k. Given x 1:N, one can use EM to learn O. Alternatively, we can use spectral learning to learn O, in a non iterative fashion. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 18 / 54

Spectral Learning for Latent Variable Models h n x n O(:, k) n = 1... N k = 1... K h n Discrete(p(h n )) x n h n p(x n h n, O(:, h n )) Given that L K, and O has K linearly independent columns then, B i :=(U T E[x x x i ]V )(U T E[x x]v ) 1 =(U T O)diag(O(i, :))(U T O) 1 where, E[x x] = UΣV T. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 19 / 54

Spectral Learning for Latent Variable Models h n x n O(:, k) n = 1... N k = 1... K h n Discrete(p(h n )) x n h n p(x n h n, O(:, h n )) Given that L K, and O has K linearly independent columns then, where, E[x x] = UΣV T. B i :=(U T E[x x x i ]V )(U T E[x x]v ) 1 =(U T O)diag(O(i, :))(U T O) 1 B i can be expressed in terms of observable moments E[x x] and E[x x x i ]. So, model parameters can be simply be read from the eigenvectors of B i. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 19 / 54

Spectral Learning for Latent Variable Models Proof: U T E[x x x i ]V =U T Odiag(O(i, :))diag(p(h))o T V Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 20 / 54

Spectral Learning for Latent Variable Models Proof: U T E[x x x i ]V =U T Odiag(O(i, :))diag(p(h))o T V =U T Odiag(O(i, :))(U T O) 1 U T Odiag(p(h))O T V Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 20 / 54

Spectral Learning for Latent Variable Models Proof: U T E[x x x i ]V =U T Odiag(O(i, :))diag(p(h))o T V =U T Odiag(O(i, :))(U T O) 1 U T Odiag(p(h))O T V =U T Odiag(O(i, :))(U T O) 1 U T Odiag(p(h))O T V }{{} U T E[x x]v Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 20 / 54

Spectral Learning for Latent Variable Models Proof: so, U T E[x x x i ]V =U T Odiag(O(i, :))diag(p(h))o T V =U T Odiag(O(i, :))(U T O) 1 U T Odiag(p(h))O T V =U T Odiag(O(i, :))(U T O) 1 U T Odiag(p(h))O T V }{{} U T E[x x]v (U T O)diag(O(i, :))(U T O) 1 = (U T E[x x x i ]V )(U T E[x x]v ) 1 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 20 / 54

Spectral Learning for Latent Variable Models This algorithm is based on the work of Anandkumar et al. 2012. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 21 / 54

Spectral Learning for Latent Variable Models This algorithm is based on the work of Anandkumar et al. 2012. One can also modify this algorithm slightly to learn HMMs. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 21 / 54

Spectral Learning for Latent Variable Models This algorithm is based on the work of Anandkumar et al. 2012. One can also modify this algorithm slightly to learn HMMs. The general goal is to express the observable moments so that, we obtain an eigen-decomposition form, from which we can read the parameters. This is a fast, and local optima free algorithm to learn LVMs. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 21 / 54

Spectral Learning for Latent Variable Models This algorithm is based on the work of Anandkumar et al. 2012. One can also modify this algorithm slightly to learn HMMs. The general goal is to express the observable moments so that, we obtain an eigen-decomposition form, from which we can read the parameters. This is a fast, and local optima free algorithm to learn LVMs. However, this scheme requires high order moments to learn a temporally connected LVM such as mixture of Markov models. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 21 / 54

Mixture of Markov models h n x 1,n x 2,n... x Tn,n n = 1... N A k k = 1... K h n Discrete(p(h n )) x t,n x t 1,n, h n, A Discrete(A(:, x t 1,n, h n )) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 23 / 54

Spectral Learning for Mixture of Markov models We show that a spectral learning algorithm as in Anandkumar, et al. 2012, requires observable moments up to order five. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 24 / 54

Spectral Learning for Mixture of Markov models We show that a spectral learning algorithm as in Anandkumar, et al. 2012, requires observable moments up to order five. We propose an alternative scheme based on hierarchical Bayesian modeling: p(a hn x n, h n ) p(x n A hn, h n )p(a hn ) = Dirichlet(c n 1,1 + β 1, c n 1,2 + β 1,..., c n L,L + β 1) Assuming a uniform Dirichlet prior p(a hn ), i.e. taking β = 1, the posterior of A hn becomes Dirichlet(c n 1,1, cn 1,2,..., cn L,L ). Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 24 / 54

Spectral Learning for Mixture of Markov models We show that a spectral learning algorithm as in Anandkumar, et al. 2012, requires observable moments up to order five. We propose an alternative scheme based on hierarchical Bayesian modeling: p(a hn x n, h n ) p(x n A hn, h n )p(a hn ) = Dirichlet(c n 1,1 + β 1, c n 1,2 + β 1,..., c n L,L + β 1) Assuming a uniform Dirichlet prior p(a hn ), i.e. taking β = 1, the posterior of A hn becomes Dirichlet(c n 1,1, cn 1,2,..., cn L,L ). So, we can represent a sequence only via empirical transition counts matrix s n Dirichlet(c n 1,1, cn 1,2,..., cn L,L ) = p(a h n x n, h n ). Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 24 / 54

Spectral Learning for Mixture of Markov models - Dirichlet distribution Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 25 / 54

Spectral Learning for Mixture of Markov models - Dirichlet posterior 6 25 Observation 4 2 0 0 10 20 30 40 50 60 70 80 90 100 Time x t 2 4 6 2 4 6 x t 1 20 15 10 5 0 Observation 6 4 2 x t 2 4 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 Time 6 2 4 6 x t 1 0 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 26 / 54

Spectral Learning for Mixture of Dirichlet Distributions h n x n α(:, k) n = 1... N k = 1... K h n Discrete(p(h n )) x n h n, α Dirichlet(α(1, h n ),..., α(k, h n )) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 27 / 54

Spectral Learning for Mixture of Dirichlet Distributions h n x n α(:, k) n = 1... N k = 1... K h n Discrete(p(h n )) x n h n, α Dirichlet(α(1, h n ),..., α(k, h n )) Note that E[s n,i h n = k] = α(i, k)/α 0. So, we can estimate α, up to a scaling factor α 0 via spectral learning. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 27 / 54

Spectral Learning for Mixture of Markov models Input: Sequences x 1:N Output: Clustering assignments ĥ1:n Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 28 / 54

Spectral Learning for Mixture of Markov models Input: Sequences x 1:N Output: Clustering assignments ĥ1:n 1. Extract the sufficient statistics s n from x n, n {1,..., N} Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 28 / 54

Spectral Learning for Mixture of Markov models Input: Sequences x 1:N Output: Clustering assignments ĥ1:n 1. Extract the sufficient statistics s n from x n, n {1,..., N} 2. Compute empirical moment estimates for E[s s] and E[s s s]. Compute U and V such that E[s s] = UΣV T. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 28 / 54

Spectral Learning for Mixture of Markov models Input: Sequences x 1:N Output: Clustering assignments ĥ1:n 1. Extract the sufficient statistics s n from x n, n {1,..., N} 2. Compute empirical moment estimates for E[s s] and E[s s s]. Compute U and V such that E[s s] = UΣV T. 3. Estimate α by doing eigen-decomposition of B i. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 28 / 54

Spectral Learning for Mixture of Markov models Input: Sequences x 1:N Output: Clustering assignments ĥ1:n 1. Extract the sufficient statistics s n from x n, n {1,..., N} 2. Compute empirical moment estimates for E[s s] and E[s s s]. Compute U and V such that E[s s] = UΣV T. 3. Estimate α by doing eigen-decomposition of B i. 4. n {1,... N}, ĥn = arg max k Dirichlet(s n, α(:, k)). Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 28 / 54

Results - Synthetic Data 100 Clustering Accuracy % 95 90 85 80 75 70 Spectral MMarkov Spectral MDirichlet 65 EM MMarkov EM MDirichlet 60 3090150 300 600 1200 Sequence Length On 100 synthetic datasets, each consisting of 60 sequences, average clustering accuracies for varying sequence lengths. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 29 / 54

Results - Network Traffic Data Exploratory data analysis on network traffic data consisting of Skype and Bit-torrent 1 2 3 4 5 6 15 23 24 13 11 7 8 9 10 11 12 20 17 22 21 19 14 53 7 4 12 1 18 2 6 108 9 13 14 15 16 17 18 19 20 21 22 23 24 16 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 30 / 54

Results - Network Traffic Data Effect of clustering in training Algorithm Classification Accuracy No Clustering 63.41% Mixture of Dirichlet (Spectral) 84.85% Mixture of Dirichlet (EM) 79.39% Mixture of Markov (EM) 83.51% Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 31 / 54

Results - Finding number of Clusters We can also estimate the number of clusters: K E[s n s n ] = αdiag(p(h))α T = p(h = k)α(:, k)α(:, k) T k=1 By looking at the jump σ k+1 /σ k of E[s n s n ], we can estimate K. 2.8 14 σ k / σ k 1 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 σ k / σ k 1 12 10 8 6 4 2 1 0 10 20 30 40 50 60 70 Number of Clusters K 0 0 10 20 30 40 50 60 70 Number of Clusters K Figure: Network Traffic Data. Figure: Synthetic Data. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 32 / 54

Results - Finding number of Clusters Finding number of clusters on motion capture data. 25 20 15 σ k / σ k 1 10 5 0 1 2 3 4 5 6 7 8 9 Number of Clusters K Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 33 / 54

Conclusions - Spectral Learning for Mixture of Markov models Spectral learning of mixture of Dirichlet disributions yield a simple and effective algorithm. On synthetic data, it outperforms its rivals. It is also possible to automatically determine the number clusters. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 34 / 54

Mixture of HMMs A k k = 1... K r 1,n r 2,n... r Tn,n h n x 1,n x 2,n... x Tn,n h n Discrete(p(h n )) r t,n r t 1,n, h n Discrete(A(:, r t 1,n, h n )) x t,n r t,n, h n p(x t,n O(:, r t,n, h n )) n = 1... N O k k = 1... K Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 36 / 54

Spectral Learning based algorithm for Mixture of HMMs An EM algorithm for mixture of HMMs is expensive. Requires forward-backward message passing for each sequence in each EM iteration. We propose to replace the parameter estimation step with spectral learning. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 37 / 54

Cheaper Alternative Algorithm for HMM Mixture learning Randomly initialize r 1:N At each iteration, for k = 1 K do θ k EMHMM({x n n, h n = k}) end for for n = 1 N do h n = arg max k p(x n θ k ) end for Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 38 / 54

Cheaper Alternative Algorithm for HMM Mixture learning Randomly initialize r 1:N At each iteration, for k = 1 K do θ k SpectralHMM({x n n, h n = k}) end for for n = 1 N do h n = arg max k p(x n θ k ) end for With spectral learning, required computations for parameter estimation step become K low-rank SVDs and K eigen-decompositions, which is substantially cheaper compared to N K forward-backward. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 38 / 54

Results on mocap data Results obtained on Motion capture data. We input the temporal difference of 3D coordinates of 32 markers on human body. In this case, we used 20 sequences. We restarted the algorithm 10 times with random r 1:N, and recorded the results: Average Max Iteration Average Success(%) Success(%) time* (s) convergence EM1 70 100 7.5 3.2 EM2 73 100 12 2.9 Spectral 76 100 3.1 3.8 *PC: 3.33 GHz dual core CPU, 4 GB RAM, Software: MATLAB Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 39 / 54

Learning Infinite mixture of HMMs In Finite mixtures, we fix the number of clusters K a-priori. In infinite mixtures, K is automatically learned. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 40 / 54

Learning Infinite mixture of HMMs In Finite mixtures, we fix the number of clusters K a-priori. In infinite mixtures, K is automatically learned. Learning an infinite mixture would involve an intractable integral over HMM parameters. We approximate these integrals using spectral learning. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 40 / 54

Learning Infinite mixture of HMMs Learning an infinite Mixture of HMMs would require an integral over HMM parameters θ = (O, A). (Collapsed Gibbs sampler) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 41 / 54

Learning Infinite mixture of HMMs Learning an infinite Mixture of HMMs would require an integral over HMM parameters θ = (O, A). (Collapsed Gibbs sampler) For existing clusters: p(h n = k, k K h n 1:N, x 1:N) = N n k p(x n θ)p(θ {x l : l n, h l = k})dθ N + α 1 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 41 / 54

Learning Infinite mixture of HMMs Learning an infinite Mixture of HMMs would require an integral over HMM parameters θ = (O, A). (Collapsed Gibbs sampler) For existing clusters: p(h n = k, k K h n 1:N, x 1:N) = N n k p(x n θ)p(θ {x l : l n, h l = k})dθ N + α 1 To open a new cluster: p(h n = k + 1 h n 1:N, x 1:N) α = N + α 1 p(x n θ)p(θ)dθ Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 41 / 54

Learning Infinite Mixture of HMMs One can use Gibbs sampling with auxiliary variable method of Neal (2000). However, this would require sampling from the prior to approximate the marginal likelihood. Large number of samples, slow. We suggest to use spectral learning for learning IM-HMMs. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 42 / 54

IMM - Spectral learning Assuming that the sequences are long enough, we make the following approximation; p(x n θ)p(θ {x l : l n, h l = k})dθ p(x n θ spectral k ) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 43 / 54

IMM - Spectral learning Assuming that the sequences are long enough, we make the following approximation; p(x n θ)p(θ {x l : l n, h l = k})dθ p(x n θ spectral k ) We compute the marginal likelihood of each sequence offline, and then use them in each iteration. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 43 / 54

IMM - Spectral learning Assuming that the sequences are long enough, we make the following approximation; p(x n θ)p(θ {x l : l n, h l = k})dθ p(x n θ spectral k ) We compute the marginal likelihood of each sequence offline, and then use them in each iteration. By modifying the spectral algorithm for finite mixtures, we obtain a learning algorithm for infinite mixtures of HMMs: Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 43 / 54

Cheaper Alternative Algorithm for HMM Mixture learning Randomly initialize h 1:N At each iteration, for k = 1 K do θ k SpectralHMM({x n n, h n = k}) end for for n = 1 N do h n = arg max k p(x n θ k ) end for Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 44 / 54

Cheaper Alternative Algorithm for HMM Mixture learning Randomly initialize h 1:N At each iteration, for k = 1 K do θ k SpectralHMM({x n n, h n = k}) end for for n = 1 N do h n = arg max k [p(x n θ k ) p(x n )] if h n > K then K = K + 1 end if end for Marginal likelihoods are computed before starting the algorithm p(x n ) Note that this algorithm is the analogue of the infinite version of k-means, Dirichlet Process Means algorithm of Kulis and Jordan. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 44 / 54

Results - Human Action Classification We use KTH human action database. 6 classes (boxing, hand waving, hand clapping, walking, jogging, running), 100 sequences: 64 sequences for training 36 for testing. We do clustering in training phase to increase the clustering accuracy, as data instances in same class tend to show variability. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 45 / 54

Results - Human Action Classification Algorithm Classification Accuracy No Clustering 70.1% Spectral 74.1% Gibbs Sampling 74.0% Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 46 / 54

Results - MoCap Data A dataset 44 sequences with two different walking characteristics and running. Algorithm automatically determines K = 3. In 9 seconds algorithm converged. 10 5 0 5 10 15 20 10 10 0 10 20 30 40 5 0 5 5 0 5 10 15 10 5 30 20 10 0 10 20 5 0 5 30 0 5 10 15 20 0 20 40 5 0 5 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 47 / 54

Conclusions - Spectral Learning for Mixture of HMMs We have proposed a simple, fast and effective algorithm for learning mixture of HMMs. Algorithm is competitive with more conventional Gibbs sampling method. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 48 / 54

Discriminative vs Generative Models Given the training set (x n, h n ) N n=1 Training a Generative Model θ = arg max θ Training a Discriminative Model θ = arg max θ p(x n θ, h n ) n p(h n x n, θ) n Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 50 / 54

Discriminative vs Generative Models Logistic Regression vs Naive Bayes Classifiers (with Isotropic Gaussian observations) 8 Logistic Reg. Naive Bayes 6 4 2 0 2 4 12 10 8 6 4 2 0 2 4 Generative Models model the data distribution p(x h). Discriminative Models model the class labels p(h x). Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 51 / 54

Discriminative and Generative Time Series Models HMM, Markov Model and their mixtures are all generative models. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 52 / 54

Discriminative and Generative Time Series Models HMM, Markov Model and their mixtures are all generative models. Deriving a discriminative model corresponding to a LVM amounts to converting the model into an undirected graph. (Defining an energy function) For Discriminative Markov model: ψ(x n, h n ; θ k ) = T n K L t=1 k=1 l 1 =1 l 2 =1 L [h n = k][l 1 = x t ][l 2 = x t 1 ]θ k,l1,l 2 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 52 / 54

Discriminative and Generative Time Series Models HMM, Markov Model and their mixtures are all generative models. Deriving a discriminative model corresponding to a LVM amounts to converting the model into an undirected graph. (Defining an energy function) For Discriminative Markov model: ψ(x n, h n ; θ k ) = T n K L t=1 k=1 l 1 =1 l 2 =1 L [h n = k][l 1 = x t ][l 2 = x t 1 ]θ k,l1,l 2 The training is: θ 1:K = arg max θ 1:K N exp(ψ(x n, h n ; θ k )) log K h exp(ψ(x n=1 n, h n ; θ k )) n=1 Note that we treat data x as an input. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 52 / 54

Result We investigate amount of training data vs classification accuracy for generative and discriminative Markov model We use the synthetic control chart dataset from UCI machine learning repository. 6 classes 100 sequences for each class. 70 sequence for training 30 sequence for test. Classification Accuracy % 90 80 70 60 50 40 30 20 Discriminative Generative 10 10 20 30 40 50 60 70 80 90 100 % of the training set Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 53 / 54

Conclusions and Future Work We have proposed two novel algorithms for time series clustering. We have investigated generative and discriminative classification of time series. As future work, we plan on investigating the hierarchical Bayesian viewpoint for deriving spectral learning algorithms for more complicated time series models. A complete spectral learning algorithm for mixture of HMMs based on constrained optimization. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 54 / 54