Probabilistic Time Series Classification

Similar documents
Introduction to Graphical Models

Introduction to Machine Learning

Naïve Bayes classification

Generative Clustering, Topic Modeling, & Bayesian Inference

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Unsupervised Learning

STA 4273H: Sta-s-cal Machine Learning

Hidden Markov Models

Study Notes on the Latent Dirichlet Allocation

Brief Introduction of Machine Learning Techniques for Content Analysis

Lecture 21: Spectral Learning for Graphical Models

CS534 Machine Learning - Spring Final Exam

CS6220: DATA MINING TECHNIQUES

Tensor Methods for Feature Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Introduction to Machine Learning

Expectation Maximization

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

CSCI-567: Machine Learning (Spring 2019)

Collapsed Variational Bayesian Inference for Hidden Markov Models

Statistical learning. Chapter 20, Sections 1 4 1

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Bayesian Methods for Machine Learning

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

CS6220: DATA MINING TECHNIQUES

Pattern Recognition and Machine Learning

Density Estimation. Seungjin Choi

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

STA 4273H: Statistical Machine Learning

STA 414/2104: Machine Learning

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Active and Semi-supervised Kernel Classification

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

STA 4273H: Statistical Machine Learning

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Machine Learning Techniques for Computer Vision

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Part 1: Expectation Propagation

Graphical Models for Collaborative Filtering

Lecture 13 : Variational Inference: Mean Field Approximation

STA 4273H: Statistical Machine Learning

Recent Advances in Bayesian Inference Techniques

p L yi z n m x N n xi

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Introduction to Probabilistic Machine Learning

ECE 5984: Introduction to Machine Learning

Posterior Regularization

The Expectation-Maximization Algorithm

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Notes on Machine Learning for and

Gentle Introduction to Infinite Gaussian Mixture Modeling

Parametric Techniques

an introduction to bayesian inference

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

CPSC 540: Machine Learning

Approximate Inference Part 1 of 2

Probability and Information Theory. Sargur N. Srihari

Expectation Maximization Algorithm

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Lecture 4: Probabilistic Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Undirected Graphical Models

Approximate Inference Part 1 of 2

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

Spectral Learning of Mixture of Hidden Markov Models

Machine Learning. Probabilistic KNN.

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Bayesian Networks Inference with Probabilistic Graphical Models

Machine Learning Overview

Coupled Hidden Markov Models: Computational Challenges

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Dynamic Approaches: The Hidden Markov Model

METHOD OF MOMENTS LEARNING FOR LEFT TO RIGHT HIDDEN MARKOV MODELS

Tree-Based Inference for Dirichlet Process Mixtures

A minimalist s exposition of EM

CPSC 540: Machine Learning

Generative MaxEnt Learning for Multiclass Classification

Latent Variable Models and EM algorithm

Advanced Introduction to Machine Learning

13: Variational inference II

Probabilistic Graphical Models

Bayesian Machine Learning

Discriminative Models

Pattern Recognition. Parameter Estimation of Probability Density Functions

Probabilistic Graphical Models

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Hidden Markov Models

L11: Pattern recognition principles

Combining Generative and Discriminative Methods for Pixel Classification with Multi-Conditional Learning

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Transcription:

Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54

Problem Statement The goal is to assign "similar" sequences to same classes. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 2 / 54

Problem Statement It is not a trivial task. Variability within each class. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 3 / 54

Problem Statement Data in real life applications is even more challenging. So, we make a computer program "learn" how to classify sequences. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 4 / 54

Goal of this thesis The goal of this thesis is to understand and develop learning algorithms for supervised and unsupervised time series classification models. We propose two novel learning algorithms for mixture of Markov models and mixture of HMMs. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 5 / 54

Goal of this thesis The goal of this thesis is to understand and develop learning algorithms for supervised and unsupervised time series classification models. We propose two novel learning algorithms for mixture of Markov models and mixture of HMMs. h n x 1,n x 2,n... x Tn,n n = 1... N A k k = 1... K Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 5 / 54

Goal of this thesis The goal of this thesis is to understand and develop learning algorithms for supervised and unsupervised time series classification models. We propose two novel learning algorithms for mixture of Markov models and mixture of HMMs. A k h n k = 1... K r 1,n r 2,n... r Tn,n x 1,n x 2,n... x Tn,n h n A k k = 1... K n = 1... N x 1,n x 2,n... x Tn,n n = 1... N O k k = 1... K Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 5 / 54

Achievements A new spectral learning algorithm for learning mixture of Markov models Based on learning mixture of Dirichlet distributions Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 6 / 54

Achievements A new spectral learning algorithm for learning mixture of Markov models Based on learning mixture of Dirichlet distributions A new spectral learning based algorithm for learning mixture of HMMs Faster compared to the conventional EM approach Possible to generalize to infinite mixture of HMMs Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 6 / 54

Achievements A new spectral learning algorithm for learning mixture of Markov models Based on learning mixture of Dirichlet distributions A new spectral learning based algorithm for learning mixture of HMMs Faster compared to the conventional EM approach Possible to generalize to infinite mixture of HMMs Survey of Discriminative versions of Markov models and HMMs Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 6 / 54

Outline 1 Learning a Latent Variable Model with ML Expectation Maximization (EM) algorithm description, drawbacks 2 Method of Moments based learning algorithms Method of Moments vs ML Spectral Learning for LVMs 3 Spectral Learning for Mixture of Markov models Derivation of an algorithm with existing methods Derivation of a superior learning scheme Results 4 Spectral Learning based algorithm for Mixture of HMMs For Finite mixtures For Infinite mixtures Results 5 Discriminative vs Generative Time Series Models Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 7 / 54

EM for latent variable models The optimization problem to train a latent variable model: θ = arg max p(x θ) θ = arg max p(x, h θ) θ where, x are the observations (sequences), h are the latent variables (cluster indicators/latent state sequences), θ is the model parameter. The summation is over all possible combinations of h. An intractable summation in practice. h Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 8 / 54

EM for latent variable models The idea is to have a lower bound on log p(x θ), which is easier to maximize. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 9 / 54

EM for latent variable models The idea is to have a lower bound on log p(x θ), which is easier to maximize. log p(x θ) = log h = log h p(x, h θ) = log h [ p(x, h θ) ] E q(h) q(h) p(x, h θ) q(h) q(h) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 9 / 54

EM for latent variable models The idea is to have a lower bound on log p(x θ), which is easier to maximize. log p(x θ) = log h = log h p(x, h θ) = log h [ p(x, h θ) ] E q(h) q(h) p(x, h θ) q(h) q(h) Then, using Jensen s inequality; log h E q(h) [ p(x, h θ) q(h) ] E q(h) [ log p(x, h θ) ] q(h) log p(x θ) E q(h) [log p(x, h θ)] E q(h) [log q(h)] }{{} Q(θ):= Q(θ) is easier to maximize than p(x θ). Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 9 / 54

EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 10 / 54

EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) Let us consider the following toy example: 20 p(x 1,x 2 ) x 10 5 11 10 40 60 80 9 8 7 x 1 100 120 140 160 6 5 4 3 2 180 1 200 50 100 150 200 x 2 Joint distribution p(x 1, x 2 ) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 10 / 54

EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) Let us consider the following toy example: 20 p(x 1,x 2 ) x 10 5 11 10 40 x 1 60 80 100 120 140 160 180 200 50 100 150 200 x 2 9 8 7 6 5 4 3 2 1 We take h = x 2, θ = x 1. The problem is: x1 = arg max p(x 1, x 2 ) x 1 x 2 Joint distribution p(x 1, x 2 ) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 10 / 54

EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) 4 Initialization from 111 log p(x 1 ) 4 Initialization from 95 log p(x 1 ) 5 6 1 5 6 1 7 7 8 8 9 9 10 10 11 11 12 0 20 40 60 80 100 120 140 160 180 200 x 1 12 0 20 40 60 80 100 120 140 160 180 200 x 1 Initialization x 1 1 = 111 Initialization x 1 1 = 95 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 11 / 54

EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) 4 5 6 Initialization from 111 1 2 log p(x 1 ) 4 5 6 2 Initialization from 95 1 log p(x 1 ) 7 7 8 8 9 9 10 10 11 11 12 0 20 40 60 80 100 120 140 160 180 200 x 1 12 0 20 40 60 80 100 120 140 160 180 200 x 1 Initialization x 1 1 = 111 Initialization x 1 1 = 95 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 11 / 54

EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) 4 5 6 Initialization from 111 1 2 3 log p(x 1 ) 4 5 6 32 Initialization from 95 1 log p(x 1 ) 7 7 8 8 9 9 10 10 11 11 12 0 20 40 60 80 100 120 140 160 180 200 x 1 12 0 20 40 60 80 100 120 140 160 180 200 x 1 Initialization x 1 1 = 111 Initialization x 1 1 = 95 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 11 / 54

EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) 4 5 6 Initialization from 111 1 2 3 log p(x 1 ) 4 5 6 432 Initialization from 95 1 log p(x 1 ) 7 7 8 8 9 9 10 10 11 11 12 0 20 40 60 80 100 120 140 160 180 200 x 1 12 0 20 40 60 80 100 120 140 160 180 200 x 1 Initialization x 1 1 = 111 Initialization x 1 1 = 95 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 11 / 54

EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) 4 5 6 Initialization from 111 1 2 3 log p(x 1 ) 4 5 6 4532 Initialization from 95 1 log p(x 1 ) 7 7 8 8 9 9 10 10 11 11 12 0 20 40 60 80 100 120 140 160 180 200 x 1 12 0 20 40 60 80 100 120 140 160 180 200 x 1 Initialization x 1 1 = 111 Initialization x 1 1 = 95 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 11 / 54

EM for latent variable models EM Algorithm Initialize θ then, at each iteration; E-Step: Compute the lower bound Q(θ) using q(h) = p(h x, θ) M-Step: θ = arg max θ Q(θ) 4 5 6 Initialization from 111 1 2 3 log p(x 1 ) 4 5 6 45632 Initialization from 95 1 log p(x 1 ) 7 7 8 8 9 9 10 10 11 11 12 0 20 40 60 80 100 120 140 160 180 200 x 1 12 0 20 40 60 80 100 120 140 160 180 200 x 1 Initialization x 1 1 = 111 Initialization x 1 1 = 95 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 11 / 54

EM for latent variable models We see that EM is an iterative algorithm, sensitive to initialization. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 12 / 54

EM for latent variable models We see that EM is an iterative algorithm, sensitive to initialization. In next section, we introduce an alternative local optima free, non-iterative learning method for LVMs, based on method of moments. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 12 / 54

Outline 1 Learning a Latent Variable Model with ML Expectation Maximization (EM) algorithm description, drawbacks 2 Method of Moments based learning algorithms Method of Moments vs ML Spectral Learning for LVMs 3 Spectral Learning for Mixture of Markov models Derivation of an algorithm with existing methods Derivation of a superior learning scheme Results 4 Spectral Learning based algorithm for Mixture of HMMs For Finite mixtures For Infinite mixtures Results 5 Discriminative vs Generative Time Series Models Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 13 / 54

Method of Moments vs ML Suppose we have the i.i.d. data x 1:N, generated from G(x; a, b). Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 14 / 54

Method of Moments vs ML Suppose we have the i.i.d. data x 1:N, generated from G(x; a, b). A maximum likelihood estimator for a and b would require to solve, ( ) 1 N log(a) ψ(a) = log x n 1 N log(x n ) N N to find an estimate â. Then, ˆb = 1 ân n=1 N n=1 x n n=1 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 14 / 54

Method of Moments vs ML Suppose we have the i.i.d. data x 1:N, generated from G(x; a, b). A maximum likelihood estimator for a and b would require to solve, ( ) 1 N log(a) ψ(a) = log x n 1 N log(x n ) N N to find an estimate â. Then, ˆb = 1 ân n=1 N n=1 x n n=1 So, a ML estimate for G(a, b) would require to use an iterative method. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 14 / 54

Method of Moments vs ML Alternatively, we can use method of moments to estimate a and b: Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 15 / 54

Method of Moments vs ML Alternatively, we can use method of moments to estimate a and b: Then, M 1 := E[x] =ab M 2 := E[x 2 ] =ab 2 + a 2 b 2 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 15 / 54

Method of Moments vs ML Alternatively, we can use method of moments to estimate a and b: Then, M 1 := E[x] =ab M 2 := E[x 2 ] =ab 2 + a 2 b 2 â = M2 1 M 2 M 2 1 ˆb = M 2 M 2 1 M 1 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 15 / 54

a Method of Moments vs ML 2 1.8 1.6 N = 10 True parameters MoMoment estimate ML estimate 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 2 4 6 8 10 b Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 16 / 54

a a Method of Moments vs ML 2 1.8 1.6 N = 10 True parameters MoMoment estimate ML estimate 2 1.8 1.6 N = 40 True parameters MoMoment estimate ML estimate 1.4 1.4 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 2 4 6 8 10 b 0 0 2 4 6 8 10 b Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 16 / 54

a a a Method of Moments vs ML 2 1.8 1.6 N = 10 True parameters MoMoment estimate ML estimate 2 1.8 1.6 N = 40 True parameters MoMoment estimate ML estimate 1.4 1.4 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 2 4 6 8 10 b 0 0 2 4 6 8 10 b 2 1.8 1.6 N = 100 True parameters MoMoment estimate ML estimate 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 2 4 6 8 10 b Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 16 / 54

a a a a Method of Moments vs ML 2 1.8 1.6 N = 10 True parameters MoMoment estimate ML estimate 2 1.8 1.6 N = 40 True parameters MoMoment estimate ML estimate 1.4 1.4 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 2 4 6 8 10 b 0 0 2 4 6 8 10 b 2 1.8 1.6 N = 100 True parameters MoMoment estimate ML estimate 2 1.8 1.6 N = 200 True parameters MoMoment estimate ML estimate 1.4 1.4 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 2 4 6 8 10 b 0 0 2 4 6 8 10 b Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 16 / 54

Method of Moments vs ML Method of Moments yield simpler solutions than ML. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 17 / 54

Method of Moments vs ML Method of Moments yield simpler solutions than ML. Method of Moments solutions are not iterative. ML solutions are. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 17 / 54

Method of Moments vs ML Method of Moments yield simpler solutions than ML. Method of Moments solutions are not iterative. ML solutions are. ML solutions are more efficient, guaranteed to return parameter estimates in feasible range. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 17 / 54

Method of Moments vs ML Method of Moments yield simpler solutions than ML. Method of Moments solutions are not iterative. ML solutions are. ML solutions are more efficient, guaranteed to return parameter estimates in feasible range. ML solutions may get stuck in local maxima, require good initializations. Method of moments do not. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 17 / 54

Spectral Learning for Latent Variable Models Spectral learning is an instance of method of moments. Model parameters are solved from a function of some observable moments. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 18 / 54

Spectral Learning for Latent Variable Models Spectral learning is an instance of method of moments. Model parameters are solved from a function of some observable moments. Let us consider a simple mixture model: h n x n O(:, k) n = 1... N k = 1... K h n Discrete(p(h n )) x n h n p(x n h n, O(:, h n )) where, x R L observations, h n {1,..., K} cluster indicators, E[x h] = O R L K model parameters, e.g. for Poisson, O(i, k) = λ i,k. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 18 / 54

Spectral Learning for Latent Variable Models Spectral learning is an instance of method of moments. Model parameters are solved from a function of some observable moments. Let us consider a simple mixture model: h n x n O(:, k) n = 1... N k = 1... K h n Discrete(p(h n )) x n h n p(x n h n, O(:, h n )) where, x R L observations, h n {1,..., K} cluster indicators, E[x h] = O R L K model parameters, e.g. for Poisson, O(i, k) = λ i,k. Given x 1:N, one can use EM to learn O. Alternatively, we can use spectral learning to learn O, in a non iterative fashion. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 18 / 54

Spectral Learning for Latent Variable Models h n x n O(:, k) n = 1... N k = 1... K h n Discrete(p(h n )) x n h n p(x n h n, O(:, h n )) Given that L K, and O has K linearly independent columns then, B i :=(U T E[x x x i ]V )(U T E[x x]v ) 1 =(U T O)diag(O(i, :))(U T O) 1 where, E[x x] = UΣV T. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 19 / 54

Spectral Learning for Latent Variable Models h n x n O(:, k) n = 1... N k = 1... K h n Discrete(p(h n )) x n h n p(x n h n, O(:, h n )) Given that L K, and O has K linearly independent columns then, where, E[x x] = UΣV T. B i :=(U T E[x x x i ]V )(U T E[x x]v ) 1 =(U T O)diag(O(i, :))(U T O) 1 B i can be expressed in terms of observable moments E[x x] and E[x x x i ]. So, model parameters can be simply be read from the eigenvectors of B i. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 19 / 54

Spectral Learning for Latent Variable Models Proof: U T E[x x x i ]V =U T Odiag(O(i, :))diag(p(h))o T V Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 20 / 54

Spectral Learning for Latent Variable Models Proof: U T E[x x x i ]V =U T Odiag(O(i, :))diag(p(h))o T V =U T Odiag(O(i, :))(U T O) 1 U T Odiag(p(h))O T V Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 20 / 54

Spectral Learning for Latent Variable Models Proof: U T E[x x x i ]V =U T Odiag(O(i, :))diag(p(h))o T V =U T Odiag(O(i, :))(U T O) 1 U T Odiag(p(h))O T V =U T Odiag(O(i, :))(U T O) 1 U T Odiag(p(h))O T V }{{} U T E[x x]v Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 20 / 54

Spectral Learning for Latent Variable Models Proof: so, U T E[x x x i ]V =U T Odiag(O(i, :))diag(p(h))o T V =U T Odiag(O(i, :))(U T O) 1 U T Odiag(p(h))O T V =U T Odiag(O(i, :))(U T O) 1 U T Odiag(p(h))O T V }{{} U T E[x x]v (U T O)diag(O(i, :))(U T O) 1 = (U T E[x x x i ]V )(U T E[x x]v ) 1 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 20 / 54

Spectral Learning for Latent Variable Models This algorithm is based on the work of Anandkumar et al. 2012. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 21 / 54

Spectral Learning for Latent Variable Models This algorithm is based on the work of Anandkumar et al. 2012. One can also modify this algorithm slightly to learn HMMs. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 21 / 54

Spectral Learning for Latent Variable Models This algorithm is based on the work of Anandkumar et al. 2012. One can also modify this algorithm slightly to learn HMMs. The general goal is to express the observable moments so that, we obtain an eigen-decomposition form, from which we can read the parameters. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 21 / 54

Spectral Learning for Latent Variable Models This algorithm is based on the work of Anandkumar et al. 2012. One can also modify this algorithm slightly to learn HMMs. The general goal is to express the observable moments so that, we obtain an eigen-decomposition form, from which we can read the parameters. This is a fast, and local optima free algorithm to learn LVMs. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 21 / 54

Spectral Learning for Latent Variable Models This algorithm is based on the work of Anandkumar et al. 2012. One can also modify this algorithm slightly to learn HMMs. The general goal is to express the observable moments so that, we obtain an eigen-decomposition form, from which we can read the parameters. This is a fast, and local optima free algorithm to learn LVMs. However, this scheme requires high order moments to learn a temporally connected LVM such as mixture of Markov models. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 21 / 54

Outline 1 Learning a Latent Variable Model with ML Expectation Maximization (EM) algorithm description, drawbacks 2 Method of Moments based learning algorithms Method of Moments vs ML Spectral Learning for LVMs 3 Spectral Learning for Mixture of Markov models Derivation of an algorithm with existing methods Derivation of a superior learning scheme Results 4 Spectral Learning based algorithm for Mixture of HMMs For Finite mixtures For Infinite mixtures Results 5 Discriminative vs Generative Time Series Models Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 22 / 54

Mixture of Markov models h n x 1,n x 2,n... x Tn,n n = 1... N A k k = 1... K h n Discrete(p(h n )) x t,n x t 1,n, h n, A Discrete(A(:, x t 1,n, h n )) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 23 / 54

Spectral Learning for Mixture of Markov models We show that a spectral learning algorithm as in Anandkumar, et al. 2012, requires observable moments up to order five. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 24 / 54

Spectral Learning for Mixture of Markov models We show that a spectral learning algorithm as in Anandkumar, et al. 2012, requires observable moments up to order five. We propose an alternative scheme based on hierarchical Bayesian modeling: p(a hn x n, h n ) p(x n A hn, h n )p(a hn ) = Dirichlet(c n 1,1 + β 1, c n 1,2 + β 1,..., c n L,L + β 1) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 24 / 54

Spectral Learning for Mixture of Markov models We show that a spectral learning algorithm as in Anandkumar, et al. 2012, requires observable moments up to order five. We propose an alternative scheme based on hierarchical Bayesian modeling: p(a hn x n, h n ) p(x n A hn, h n )p(a hn ) = Dirichlet(c n 1,1 + β 1, c n 1,2 + β 1,..., c n L,L + β 1) Assuming a uniform Dirichlet prior p(a hn ), i.e. taking β = 1, the posterior of A hn becomes Dirichlet(c n 1,1, cn 1,2,..., cn L,L ). Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 24 / 54

Spectral Learning for Mixture of Markov models We show that a spectral learning algorithm as in Anandkumar, et al. 2012, requires observable moments up to order five. We propose an alternative scheme based on hierarchical Bayesian modeling: p(a hn x n, h n ) p(x n A hn, h n )p(a hn ) = Dirichlet(c n 1,1 + β 1, c n 1,2 + β 1,..., c n L,L + β 1) Assuming a uniform Dirichlet prior p(a hn ), i.e. taking β = 1, the posterior of A hn becomes Dirichlet(c n 1,1, cn 1,2,..., cn L,L ). So, we can represent a sequence only via empirical transition counts matrix s n Dirichlet(c n 1,1, cn 1,2,..., cn L,L ) = p(a h n x n, h n ). Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 24 / 54

Spectral Learning for Mixture of Markov models - Dirichlet distribution Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 25 / 54

Spectral Learning for Mixture of Markov models - Dirichlet posterior 6 25 Observation 4 2 0 0 10 20 30 40 50 60 70 80 90 100 Time x t 2 4 6 2 4 6 x t 1 20 15 10 5 0 Observation 6 4 2 x t 2 4 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 Time 6 2 4 6 x t 1 0 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 26 / 54

Spectral Learning for Mixture of Dirichlet Distributions h n x n α(:, k) n = 1... N k = 1... K h n Discrete(p(h n )) x n h n, α Dirichlet(α(1, h n ),..., α(k, h n )) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 27 / 54

Spectral Learning for Mixture of Dirichlet Distributions h n x n α(:, k) n = 1... N k = 1... K h n Discrete(p(h n )) x n h n, α Dirichlet(α(1, h n ),..., α(k, h n )) Note that E[s n,i h n = k] = α(i, k)/α 0. So, we can estimate α, up to a scaling factor α 0 via spectral learning. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 27 / 54

Spectral Learning for Mixture of Markov models Input: Sequences x 1:N Output: Clustering assignments ĥ1:n Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 28 / 54

Spectral Learning for Mixture of Markov models Input: Sequences x 1:N Output: Clustering assignments ĥ1:n 1. Extract the sufficient statistics s n from x n, n {1,..., N} Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 28 / 54

Spectral Learning for Mixture of Markov models Input: Sequences x 1:N Output: Clustering assignments ĥ1:n 1. Extract the sufficient statistics s n from x n, n {1,..., N} 2. Compute empirical moment estimates for E[s s] and E[s s s]. Compute U and V such that E[s s] = UΣV T. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 28 / 54

Spectral Learning for Mixture of Markov models Input: Sequences x 1:N Output: Clustering assignments ĥ1:n 1. Extract the sufficient statistics s n from x n, n {1,..., N} 2. Compute empirical moment estimates for E[s s] and E[s s s]. Compute U and V such that E[s s] = UΣV T. 3. Estimate α by doing eigen-decomposition of B i. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 28 / 54

Spectral Learning for Mixture of Markov models Input: Sequences x 1:N Output: Clustering assignments ĥ1:n 1. Extract the sufficient statistics s n from x n, n {1,..., N} 2. Compute empirical moment estimates for E[s s] and E[s s s]. Compute U and V such that E[s s] = UΣV T. 3. Estimate α by doing eigen-decomposition of B i. 4. n {1,... N}, ĥn = arg max k Dirichlet(s n, α(:, k)). Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 28 / 54

Results - Synthetic Data 100 Clustering Accuracy % 95 90 85 80 75 70 Spectral MMarkov Spectral MDirichlet 65 EM MMarkov EM MDirichlet 60 3090150 300 600 1200 Sequence Length On 100 synthetic datasets, each consisting of 60 sequences, average clustering accuracies for varying sequence lengths. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 29 / 54

Results - Network Traffic Data Exploratory data analysis on network traffic data consisting of Skype and Bit-torrent 1 2 3 4 5 6 15 23 24 13 11 7 8 9 10 11 12 20 17 22 21 19 14 53 7 4 12 1 18 2 6 108 9 13 14 15 16 17 18 19 20 21 22 23 24 16 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 30 / 54

Results - Network Traffic Data Effect of clustering in training Algorithm Classification Accuracy No Clustering 63.41% Mixture of Dirichlet (Spectral) 84.85% Mixture of Dirichlet (EM) 79.39% Mixture of Markov (EM) 83.51% Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 31 / 54

Results - Finding number of Clusters We can also estimate the number of clusters: K E[s n s n ] = αdiag(p(h))α T = p(h = k)α(:, k)α(:, k) T k=1 By looking at the jump σ k+1 /σ k of E[s n s n ], we can estimate K. 2.8 14 σ k / σ k 1 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 σ k / σ k 1 12 10 8 6 4 2 1 0 10 20 30 40 50 60 70 Number of Clusters K 0 0 10 20 30 40 50 60 70 Number of Clusters K Figure: Network Traffic Data. Figure: Synthetic Data. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 32 / 54

Results - Finding number of Clusters Finding number of clusters on motion capture data. 25 20 15 σ k / σ k 1 10 5 0 1 2 3 4 5 6 7 8 9 Number of Clusters K Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 33 / 54

Conclusions - Spectral Learning for Mixture of Markov models Spectral learning of mixture of Dirichlet disributions yield a simple and effective algorithm. On synthetic data, it outperforms its rivals. It is also possible to automatically determine the number clusters. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 34 / 54

Outline 1 Learning a Latent Variable Model with ML Expectation Maximization (EM) algorithm description, drawbacks 2 Method of Moments based learning algorithms Method of Moments vs ML Spectral Learning for LVMs 3 Spectral Learning for Mixture of Markov models Derivation of an algorithm with existing methods Derivation of a superior learning scheme Results 4 Spectral Learning based algorithm for Mixture of HMMs For Finite mixtures For Infinite mixtures Results 5 Discriminative vs Generative Time Series Models Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 35 / 54

Mixture of HMMs A k k = 1... K r 1,n r 2,n... r Tn,n h n x 1,n x 2,n... x Tn,n h n Discrete(p(h n )) r t,n r t 1,n, h n Discrete(A(:, r t 1,n, h n )) x t,n r t,n, h n p(x t,n O(:, r t,n, h n )) n = 1... N O k k = 1... K Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 36 / 54

Spectral Learning based algorithm for Mixture of HMMs An EM algorithm for mixture of HMMs is expensive. Requires forward-backward message passing for each sequence in each EM iteration. We propose to replace the parameter estimation step with spectral learning. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 37 / 54

Cheaper Alternative Algorithm for HMM Mixture learning Randomly initialize r 1:N At each iteration, for k = 1 K do θ k EMHMM({x n n, h n = k}) end for for n = 1 N do h n = arg max k p(x n θ k ) end for Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 38 / 54

Cheaper Alternative Algorithm for HMM Mixture learning Randomly initialize r 1:N At each iteration, for k = 1 K do θ k SpectralHMM({x n n, h n = k}) end for for n = 1 N do h n = arg max k p(x n θ k ) end for Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 38 / 54

Cheaper Alternative Algorithm for HMM Mixture learning Randomly initialize r 1:N At each iteration, for k = 1 K do θ k SpectralHMM({x n n, h n = k}) end for for n = 1 N do h n = arg max k p(x n θ k ) end for With spectral learning, required computations for parameter estimation step become K low-rank SVDs and K eigen-decompositions, which is substantially cheaper compared to N K forward-backward. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 38 / 54

Results on mocap data Results obtained on Motion capture data. We input the temporal difference of 3D coordinates of 32 markers on human body. In this case, we used 20 sequences. We restarted the algorithm 10 times with random r 1:N, and recorded the results: Average Max Iteration Average Success(%) Success(%) time* (s) convergence EM1 70 100 7.5 3.2 EM2 73 100 12 2.9 Spectral 76 100 3.1 3.8 *PC: 3.33 GHz dual core CPU, 4 GB RAM, Software: MATLAB Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 39 / 54

Learning Infinite mixture of HMMs In Finite mixtures, we fix the number of clusters K a-priori. In infinite mixtures, K is automatically learned. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 40 / 54

Learning Infinite mixture of HMMs In Finite mixtures, we fix the number of clusters K a-priori. In infinite mixtures, K is automatically learned. Learning an infinite mixture would involve an intractable integral over HMM parameters. We approximate these integrals using spectral learning. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 40 / 54

Learning Infinite mixture of HMMs Learning an infinite Mixture of HMMs would require an integral over HMM parameters θ = (O, A). (Collapsed Gibbs sampler) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 41 / 54

Learning Infinite mixture of HMMs Learning an infinite Mixture of HMMs would require an integral over HMM parameters θ = (O, A). (Collapsed Gibbs sampler) For existing clusters: p(h n = k, k K h n 1:N, x 1:N) = N n k p(x n θ)p(θ {x l : l n, h l = k})dθ N + α 1 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 41 / 54

Learning Infinite mixture of HMMs Learning an infinite Mixture of HMMs would require an integral over HMM parameters θ = (O, A). (Collapsed Gibbs sampler) For existing clusters: p(h n = k, k K h n 1:N, x 1:N) = N n k p(x n θ)p(θ {x l : l n, h l = k})dθ N + α 1 To open a new cluster: p(h n = k + 1 h n 1:N, x 1:N) α = N + α 1 p(x n θ)p(θ)dθ Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 41 / 54

Learning Infinite Mixture of HMMs One can use Gibbs sampling with auxiliary variable method of Neal (2000). However, this would require sampling from the prior to approximate the marginal likelihood. Large number of samples, slow. We suggest to use spectral learning for learning IM-HMMs. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 42 / 54

IMM - Spectral learning Assuming that the sequences are long enough, we make the following approximation; p(x n θ)p(θ {x l : l n, h l = k})dθ p(x n θ spectral k ) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 43 / 54

IMM - Spectral learning Assuming that the sequences are long enough, we make the following approximation; p(x n θ)p(θ {x l : l n, h l = k})dθ p(x n θ spectral k ) We compute the marginal likelihood of each sequence offline, and then use them in each iteration. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 43 / 54

IMM - Spectral learning Assuming that the sequences are long enough, we make the following approximation; p(x n θ)p(θ {x l : l n, h l = k})dθ p(x n θ spectral k ) We compute the marginal likelihood of each sequence offline, and then use them in each iteration. By modifying the spectral algorithm for finite mixtures, we obtain a learning algorithm for infinite mixtures of HMMs: Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 43 / 54

Cheaper Alternative Algorithm for HMM Mixture learning Randomly initialize h 1:N At each iteration, for k = 1 K do θ k SpectralHMM({x n n, h n = k}) end for for n = 1 N do h n = arg max k p(x n θ k ) end for Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 44 / 54

Cheaper Alternative Algorithm for HMM Mixture learning Randomly initialize h 1:N At each iteration, for k = 1 K do θ k SpectralHMM({x n n, h n = k}) end for for n = 1 N do h n = arg max k [p(x n θ k ) p(x n )] if h n > K then K = K + 1 end if end for Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 44 / 54

Cheaper Alternative Algorithm for HMM Mixture learning Randomly initialize h 1:N At each iteration, for k = 1 K do θ k SpectralHMM({x n n, h n = k}) end for for n = 1 N do h n = arg max k [p(x n θ k ) p(x n )] if h n > K then K = K + 1 end if end for Marginal likelihoods are computed before starting the algorithm p(x n ) Note that this algorithm is the analogue of the infinite version of k-means, Dirichlet Process Means algorithm of Kulis and Jordan. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 44 / 54

Results - Human Action Classification We use KTH human action database. 6 classes (boxing, hand waving, hand clapping, walking, jogging, running), 100 sequences: 64 sequences for training 36 for testing. We do clustering in training phase to increase the clustering accuracy, as data instances in same class tend to show variability. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 45 / 54

Results - Human Action Classification Algorithm Classification Accuracy No Clustering 70.1% Spectral 74.1% Gibbs Sampling 74.0% Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 46 / 54

Results - MoCap Data A dataset 44 sequences with two different walking characteristics and running. Algorithm automatically determines K = 3. In 9 seconds algorithm converged. 10 5 0 5 10 15 20 10 10 0 10 20 30 40 5 0 5 5 0 5 10 15 10 5 30 20 10 0 10 20 5 0 5 30 0 5 10 15 20 0 20 40 5 0 5 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 47 / 54

Conclusions - Spectral Learning for Mixture of HMMs We have proposed a simple, fast and effective algorithm for learning mixture of HMMs. Algorithm is competitive with more conventional Gibbs sampling method. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 48 / 54

Outline 1 Learning a Latent Variable Model with ML Expectation Maximization (EM) algorithm description, drawbacks 2 Method of Moments based learning algorithms Method of Moments vs ML Spectral Learning for LVMs 3 Spectral Learning for Mixture of Markov models Derivation of an algorithm with existing methods Derivation of a superior learning scheme Results 4 Spectral Learning based algorithm for Mixture of HMMs For Finite mixtures For Infinite mixtures Results 5 Discriminative vs Generative Time Series Models Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 49 / 54

Discriminative vs Generative Models Given the training set (x n, h n ) N n=1 Training a Generative Model θ = arg max θ Training a Discriminative Model θ = arg max θ p(x n θ, h n ) n p(h n x n, θ) n Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 50 / 54

Discriminative vs Generative Models Logistic Regression vs Naive Bayes Classifiers (with Isotropic Gaussian observations) 8 Logistic Reg. Naive Bayes 6 4 2 0 2 4 12 10 8 6 4 2 0 2 4 Generative Models model the data distribution p(x h). Discriminative Models model the class labels p(h x). Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 51 / 54

Discriminative and Generative Time Series Models HMM, Markov Model and their mixtures are all generative models. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 52 / 54

Discriminative and Generative Time Series Models HMM, Markov Model and their mixtures are all generative models. Deriving a discriminative model corresponding to a LVM amounts to converting the model into an undirected graph. (Defining an energy function) Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 52 / 54

Discriminative and Generative Time Series Models HMM, Markov Model and their mixtures are all generative models. Deriving a discriminative model corresponding to a LVM amounts to converting the model into an undirected graph. (Defining an energy function) For Discriminative Markov model: ψ(x n, h n ; θ k ) = T n K L t=1 k=1 l 1 =1 l 2 =1 L [h n = k][l 1 = x t ][l 2 = x t 1 ]θ k,l1,l 2 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 52 / 54

Discriminative and Generative Time Series Models HMM, Markov Model and their mixtures are all generative models. Deriving a discriminative model corresponding to a LVM amounts to converting the model into an undirected graph. (Defining an energy function) For Discriminative Markov model: ψ(x n, h n ; θ k ) = T n K L t=1 k=1 l 1 =1 l 2 =1 L [h n = k][l 1 = x t ][l 2 = x t 1 ]θ k,l1,l 2 The training is: θ 1:K = arg max θ 1:K N exp(ψ(x n, h n ; θ k )) log K h exp(ψ(x n=1 n, h n ; θ k )) n=1 Note that we treat data x as an input. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 52 / 54

Result We investigate amount of training data vs classification accuracy for generative and discriminative Markov model We use the synthetic control chart dataset from UCI machine learning repository. 6 classes 100 sequences for each class. 70 sequence for training 30 sequence for test. Classification Accuracy % 90 80 70 60 50 40 30 20 Discriminative Generative 10 10 20 30 40 50 60 70 80 90 100 % of the training set Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 53 / 54

Conclusions and Future Work We have proposed two novel algorithms for time series clustering. We have investigated generative and discriminative classification of time series. As future work, we plan on investigating the hierarchical Bayesian viewpoint for deriving spectral learning algorithms for more complicated time series models. A complete spectral learning algorithm for mixture of HMMs based on constrained optimization. Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 54 / 54