Particle Swarm Optimization of Hidden Markov Models: a comparative study
|
|
- Edward Lamb
- 5 years ago
- Views:
Transcription
1 Particle Swarm Optimization of Hidden Markov Models: a comparative study D. Novák Department of Cybernetics Czech Technical University in Prague Czech Republic xnovakd@labe.felk.cvut.cz M. Macaš, Department of Cybernetics Czech Technical University in Prague Czech Republic mmacas@seznam.cz Abstract In recent years, Hidden Markov Models (HMM) have been increasingly applied in data mining applications. However, most authors have used classical optimization Expectation- Maximization (EM) scheme. A new method of HMM learning based on Particle Swarm Optimization (PSO) has been developed. Along with others global approaches as Simulating Annealing (SIM) and Genetic Algorithms (GA) the following local gradient methods have been also compared: classical Expectation-Maximization algorithm, Maximum A Posteriory approach (MAP) and Bayes Variational learning (VAR). The methods are evaluated on a synthetic data set using different evaluation criteria including classification problem. The most reliable optimization approach in terms of performance, numerical stability and speed is VAR learning followed by PSO approach. I. INTRODUCTION Will the classical EM algorithm for HMM optimization stand up? Over several ten years the EM algorithm has been gold standard for HMM optimization. Due to increasing popularity of HMM modelling technique, HMM have been applied in many application areas such as speech processing, signal processing, dynamic systems, robotics, handwriting recognition, economy, molecular biology. In all mentioned papers authors have used EM algorithm. In this paper we ask whether more optimization techniques could not bring any further improvements. We make a comparison study of several different methods for continuous HMM introducing new global technique of Particle Swarm Optimization (PSO) []. The techniques can be divided into two groups: (i) hillclimbing algorithms (EM, MAP and VAR approach) and (ii) global searching algorithm (PSO, genetic approach and simulated annealing). The first group depends quite strongly on the initial estimate of the model parameters. Any arbitrary estimate of the initial model parameters will usually lead to a sub-optimal model in practice. The second group are able to escape from the initially guess and find the optimal solutions due to the global searching capability. The paper is organized as follows. Firstly, the HMM theory is introduced. Then in Sect. II-B the description of optimization techniques is addressed. In Sect. II-C several criteria for methods evaluation are mentioned. Finally, results are presented in Sect. III and discussion along with concluding remarks are included in Sect. IV. A. Hidden Markov Models II. METHOD An HMM is a stochastic finite state automata characterized by the following []: ) N, the number of states in the model. Although the states are hidden, for many practical applications there is often some physical significance. We denote the individual states as S = (S,S,...,S N ), and the state at time t as q t. ) M, the number of distinct observations symbols per state or the number of mixtures in Gaussian pdf. 3) The state transition probability distribution A = {a i j }, of size N N, defines the probability of transition from state i at time t, to state j at time t +. a i j = P(q t+ = S j q t = S i ), i, j N. () ) The initial state distribution π = {π i }, defines the probability of any given state being the initial state of the given sequences, where π i = P(q = S i ), i N. () 5) emission probability which we can further divide into two categories depending whether the observation sequence is discrete or continuous.in the paper presented, the continuous model is used: - continuous emission probability B = {b j (O t )}, where O = O,O,...,O T, the emission probability density function of each state is defined by a finite multivariate Gaussian mixture (3): b j (O t ) = M m= d jm N (O t, µ jm,c jm ), j N (3)
2 where O t is the feature vector of the sequence being modelled, d jm is the mixture coefficient from the mth mixture in state j and N is a Gaussian probability with mean vector µ jm and covariance matrix C jm for the mth mixture component in state j. We will refer to these models as a continuous HMM (CHMM). A complete specification of an HMM requires specification of two model parameters (N and M), specification of the three probability measures A, b, π. We will refer to all HMM parameters using the set = {A,B,π}. Each model can be used to compute the probability of observing a discrete input sequence O = O,...,O T, P(O ) to find the corresponding state sequence S that maximizes the probability of the input sequence, P(S O,), and to induce the model that maximizes the probability of a given sequence P(O ) > P(O ). The following keywords are known as the three problems of an HMM: evaluation, generation, and training. B. HMM optimization techniques ) Expectation-Maximization []: or Baum-Welch The EM algorithm is a general method of finding the maximumlikelihood (ML) estimate of the parameters of an underlying distribution from a given data set when the data is incomplete or has missing values. Let s have density function P(O ) that is governed by the set of parameters and O = {O,O,...O N }. We assume that these data are independent and identically distributed (iid) with distribution P. Therefore, the resulting density for the samples is P(O ) = N i= P(O i ) = P( O) () In the maximum likelihood problem, our goal is to find the that maximizes P. That is, we wish to find where = argmax P( O) (5) The EM algorithm first finds the expected value of the complete-data likelihood P(O, S ) with respect to the unknown hidden states S given the observed data O and the current parameter estimates (i). The evaluation of this expectation is called the E-step of the algorithm. The second step, which is called the M-step, maximize the expectation we computed in the first step. ) A Posteriory approach (MAP) [3]: The difference between MAP and ML estimation lies in the assumption of an appropriate prior distribution to be estimated from the observation sequence O with probability density function P(O ), and if P () is the prior density function of then the MAP estimates is defined as MAP = argmaxp( O) argmaxp(o )P () (6) where we used the Bayes theorem. Regarding the MAP estimate, it follows that the same iterative procedure of EM algorithm described in previous paragraph. Nevertheless, the EM algorithm can be applied to the MAP estimation problem if the prior density P () belongs to the conjugate family of the complete-data density. In case of the initial and transitions probabilities, a Dirichlet density was used for the initial probability vector π and for each row of the transition probability matrix A. For the mean vector and covariance matrix of Gaussian mixture the conjugate densities are as follows: for the mean is Normal density and for the covariance normal-wishart densities [3]. 3) Variational Bayes learning: We wish to approximate the conditional probability P(S O) because the exact algorithms might not provide a satisfactory solution to inference and learning problems due to the time or space complexity. We introduce an approximating family of conditional probability distributions, Q(S O, ), where are variational parameters. From the family of approximating distributions Q, we choose a particular distribution by minimizing the Kullback-Leibler (KL) divergence, D(Q P), with respect to the variational parameters [] = argmax D [ Q(S O,) P(S O) ] (7) where the KL divergence for any probability distribution Q(V ) and P(V ), V = {S,O} is defined as follows D(Q P) = Q(V )ln Q(V ) P(V ) {V } The minimizing values of the variational parameters,, define a particular distribution, Q(S O, ), that we treat as the best approximation of P(S O) in the family Q(S O,). Another important remark is that the optimization procedure can be casted under the framework of EM algorithm as in the MAP learning. We used the same approximative family as in case of MAP approach, e.g. Dirichlet, Normal and normal-wishart continuous density. We show the typical development of density functions during particular training period in Fig.. Note, that the model priors were set to be flat as much as possible in this case. The mean priors, which follow Normal cdf, are really very flat see Fig. (b), first column. The covariances priors, which follow Wishart cdf, were set to the total training covariance matrix. As the training process goes further, the refinement in means and covariances can be observed. The similar development could be also observed in case of MAP approach. ) Simulated annealing: Simulated annealing is a well known general heuristic approach to combinatorial optimization. Given the observation sequence O, a state sequence S is generated at random and the logarithm of probability P(O ) of generating O is considered to be the objective value f (S) to be minimized. The solution structure is based on the choice of a state trajectory. The various building blocks were proposed as follows [5]: (i) The initial solution is obtained simply by generating a (8)
3 Mean cdf Covariance cdf State Prior Last posterior Prior Last posterior 6 State State (a) Development of first two Dirichlet cdf states of model (3) State 3 3 (b) Development of mean and variance cdf Fig.. Sub-figure (a): Development of two first Dirichlet cdf states throughout variational Bayes learning. The total number of iterations was 38. Sub-figure (b): Development of mean and variance cdf of the model (3) throughout variational Bayes learning. The total number of iterations was. random state trajectory. (ii) The initial temperature should be high to allow virtually all transitions to be accepted. Thereafter, the temperature is decreased at each iteration by a factor.98. (iii) The number of trials at each temperature should progressively increase with the decrease in temperature (in our case by a factor.). (iv) Moving from one solution to the next is obtained by choosing at random a state at a randomly chosen instant and affecting it randomly to another state. (v) The objective function to be minimized is the overall probability of the observation sequence. 5) Genetic algorithm: By using global searching capability and non-problem specific property of GA, the GA for HMM training can find the optimal model parameters. Generally speaking each GA consists of several steps: encoding mechanism, the fitness evaluation, the selection mechanism, and the replacement mechanism. Next we will describe very briefly our algorithm which is a modification of approach proposed by [6]. During encoding mechanism, each chromosome represents one HMM model, where each gene expresses one HMM parameter. The likelihood P(O ) is an appropriate criterion used in the fitness function to determine the quality of the chromosome. Selection mechanism is one of the most common-the roulette wheel selection. Finally the steady-state reproduction is used as the replacement strategy. To increase speed of GA, we used also a hybrid operator, e.g., after each ten population iterations, the classical EM estimation (with 8 HMM iteration only) is applied to all the chromosomes in population. Regarding the algorithm setup, the number of generation was N gen = 6, the number of chromosomes in population N pop = 6 and the number of offsprings N child = 6. 6) Particle Swarm Optimization: The PSO method is one of optimization method developed for finding a global optima of some nonlinear function []. It is inspired by the social behavior of birds and fishes.the method uses group of problem solutions. Each solution consists of set of parameters and represents a point in multidimensional space. The solution is called particle and the group of particles (population) is called swarm. Two kinds of information are available to the particles. The first is their own experience - they have tried the choices and know which state has been better so far and how good it was. The second information is social knowledge - the particles know how the other individuals in their neighborhood have performed. Each particle i is represented as a D-dimensional position vector x i (t) and has a corresponding instantaneous velocity vector v i (t). Furthermore, it remembers its individual best value of fitness function and position p i which has resulted in that value. During each iteration t, the velocity update rule (9) is applied on each particle in the swarm. The p g is the best position of the entire swarm and represents the social knowledge. v i (t) = α v i (t ) + + ϕ ( p i x i (t )) + + ϕ ( p g x i (t )) (9) The parameter α is called inertia weight and during all iterations decreases linearly from α start to α end. The symbols Φ and Φ are computed according to the equation (), where j =,. The parameters ϕ i are constants that weight influence of particles own experience and the social knowledge. In our experiments, the parameters were set to ϕ = and ϕ =, α start = and α end =. The r jk, where k =...D are random numbers drawn from a uniform distribution between and.
4 r j Φ j = ϕ j... () r jd Next, the position update rule () is applied. x i (t) = x i (t ) + v i (t) () If any component of v i is less than V max or greater than +V max, the corresponding value is replaced by V max or +V max, respectively. The V max is maximum velocity parameter whose value setting depends on parameters range of HMM. The update formulas (9) and () are applied during each iteration and the p i and p g values are updated simultaneously. The algorithm stops if maximum number of iterations is achieved. C. Evaluation criteria We evaluate the quality of derived HMM using the following criteria: Data likelihood (Lik). The data likelihood measures the log likelihood of data for a given HMM model. Distance measure (DM). We can define a distance measure D(, ), between two Markov models, (generating model) and (derived model), as [] D(, ) = T (logp(o ) logp(o )) () where O = O,O,...,O T is a sequence of observations generated by model. Classification experiment (Clas). The last benchmark is a classification rate (in %) of synthetic data set, which is generated from three very similar HMMs (3-5). Time. Duration of classification task in minutes. The following computational framework has been used in all experiments: Intel Pentium, 3.Ghz, Windows Vista, Matlab R7 edition. III. RESULTS AND DISCUSSION To verify the effectiveness of different initialization and learning methods we performed in total ten experiment runs N run = to get statistical significant results. Regarding HMM parameter setup, we use continous HMM: only one mixture component per state, M = and diagonal covariance matrix B. We constructed three HMM models (3-5) for generating the synthetic data set. The models are full transition models of four states. The data set consists of one observation sequence of length T =. The sequence was generated assuming a probabilistic walk through the HMM. In the first part of experiment likelihood and distance measure were evaluated on the first model (3). In the second part-classification experiment, the HMM generating models used for each class are those shown in (3-5) only slightly differing in transition matrix A and observation matrix B. In total, the three HMMs are quite similar to each other A = π =.5.5 (3) µ = σ =.5 B = µ = σ =.5 µ 3 = σ3 =.5 µ = σ = A = π =.5.5 () µ = σ =. B = µ = σ =. µ 3 = σ3 =.5 µ = σ = A 3 = π 3 =.5.5 (5) µ = σ =. B 3 = µ = σ =. µ 3 = σ3 =.5 µ = σ =.5 The result summary of the evaluation criteria is shown in Table I. The mean and the variance (in parentheses) of each criterion across ten runs are computed. In case of local methods (EM,VAR,MAP) the k-means initialization was used. The stopping criteria was e, the maximum number of iterations. The size of population in case of GA was, in case of PSO was 3. TABLE I METHODS COMPARISON: LIKELIHOOD (LIK), DISTANCE MEASURE (DM), CLASSIFICATION (%) AND TIME (MINS) Lik DM Clas Time EM 99.6 (5.5). (.7) 85. (.6).5 MAP 6. (3.) 5.9 (3.) 83.6(.5). VAR 8. (.6) 6. (3.) 8. (.3) 3. SIM.9 (5.6) 7.3 (5.) 8. (3.9). GA. (3.). (.8) 83. (5.) 9 PSO 7.(5.9).8 (.) 9.7(.7) 5. The best classification performance is achieved by PSO and classical HMM approach, while MAP, VAR and GA, SIM methods yield similar results. Not surprisingly, all global optimization methods are several times slower than local
5 gradient approaches. Especially, the time cost of GA and PSO is caused by their population size. The most stable method from the group of local optimization techniques was variational Bayes. Unlike its counterparts (EM and MAP) the covariance matrix of Gaussian density B did not collapse so frequently into singular points during optimization-see Figure where density function of HMM parameters are depicted during 38, resp iterations. AP PSO Log Likelihood algorithms comparision VAR SIM GA speed and variational Bayes approach due to its numerical stability, and its insensitivity to initialization. However, if time is not the most limiting factor than Particle Swarm Optimization yielded best performance. ACKNOWLEDGMENT The project was supported by the Ministry of Education, Youth and Sport of the Czech Republic with the grant No. MSM6877 entitled Transdisciplinary Research in Biomedical Engineering II. REFERENCES [] R.Eberhart, Y.Shi, and J.Kennedy, Swarm Intelligence. Morgan Kaufmann,. [] R. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition, Proceedings of the IEEE, vol. 77, 989. [3] J. Gauvain and C. Lee, Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains, IEEE Transaction on Speech and Audio Processing, vol., no., 99. [] I. Rezek and S. Roberts, Learning ensemble hidden markov models for biosignal analysis, in th International Conference on Digital Signal Processing, Greece,. [5] Y. Hamam and T. Al-Ani, Simulated annealing approach for training hidden markov models, in Working Conference on Optimization-Based Computer-Aided Modeling and Design, ser. ESIEE, France, 996. [6] S. Kwong, C. Chau, K. Man, and K. Tang, Optimisation of hmm topology and its model parameters by genetic algorithms, Pattern Recognition, vol. 3, pp. 59 5,. [7] G. McLachlan and T. Krishnan, The EM algorithm and extensions. John Wiley & Sons, Iteration Fig.. Log-likelihood comparison In Figure, log-likelihood curves are compared for one algorithm run. In this case PSO outperformed classical EM in terms of likelihood. However, final values of global techniques are close to each other. IV. CONCLUSION All three local-searching learning techniques follow the same framework of EM optimization approach. Apart of the main drawback of the EM algorithm that is the sensitivity to initialization, the EM algorithm also led to meaningless parameters estimation several times when the EM converged to the boundary of the parameter space. Here the likelihood is unbounded [7], and the computation had to be either restarted or it was sufficient to restart only the covariance diagonals. Using global strategies as PSO, SIM and GA approaches, these problems have been overcome. On the other hand, we have exchanged the time effective optimization for the numerical stability. The is a very important drawback, because even when using the hybrid combination of local and global approaches, these algorithms were still at least ten times slower than the local gradient approaches. In terms of performance using the evaluation criteria, no big differences between local and global techniques were remarked. To sum up, the most suitable combination is to apply local gradient algorithm, namely classical EM approach due to its
An Evolutionary Programming Based Algorithm for HMM training
An Evolutionary Programming Based Algorithm for HMM training Ewa Figielska,Wlodzimierz Kasprzak Institute of Control and Computation Engineering, Warsaw University of Technology ul. Nowowiejska 15/19,
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationA Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute
More informationRecent Advances in Bayesian Inference Techniques
Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian
More informationAlgorithmisches Lernen/Machine Learning
Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines
More informationMachine Learning Techniques for Computer Vision
Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationUniversity of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I
University of Cambridge MPhil in Computer Speech Text & Internet Technology Module: Speech Processing II Lecture 2: Hidden Markov Models I o o o o o 1 2 3 4 T 1 b 2 () a 12 2 a 3 a 4 5 34 a 23 b () b ()
More informationBrief Introduction of Machine Learning Techniques for Content Analysis
1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview
More informationMultiscale Systems Engineering Research Group
Hidden Markov Model Prof. Yan Wang Woodruff School of Mechanical Engineering Georgia Institute of echnology Atlanta, GA 30332, U.S.A. yan.wang@me.gatech.edu Learning Objectives o familiarize the hidden
More informationUnsupervised Learning with Permuted Data
Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University
More informationCISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)
CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models
More informationPattern Recognition. Parameter Estimation of Probability Density Functions
Pattern Recognition Parameter Estimation of Probability Density Functions Classification Problem (Review) The classification problem is to assign an arbitrary feature vector x F to one of c classes. The
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationHidden Markov Modelling
Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models
More informationRecall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series
Recall: Modeling Time Series CSE 586, Spring 2015 Computer Vision II Hidden Markov Model and Kalman Filter State-Space Model: You have a Markov chain of latent (unobserved) states Each state generates
More informationThe Expectation Maximization Algorithm
The Expectation Maximization Algorithm Frank Dellaert College of Computing, Georgia Institute of Technology Technical Report number GIT-GVU-- February Abstract This note represents my attempt at explaining
More informationHidden Markov Models Part 2: Algorithms
Hidden Markov Models Part 2: Algorithms CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Hidden Markov Model An HMM consists of:
More informationRobert Collins CSE586 CSE 586, Spring 2015 Computer Vision II
CSE 586, Spring 2015 Computer Vision II Hidden Markov Model and Kalman Filter Recall: Modeling Time Series State-Space Model: You have a Markov chain of latent (unobserved) states Each state generates
More informationp L yi z n m x N n xi
y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen
More informationOutline of Today s Lecture
University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005 Jeff A. Bilmes Lecture 12 Slides Feb 23 rd, 2005 Outline of Today s
More informationA Novel Low-Complexity HMM Similarity Measure
A Novel Low-Complexity HMM Similarity Measure Sayed Mohammad Ebrahim Sahraeian, Student Member, IEEE, and Byung-Jun Yoon, Member, IEEE Abstract In this letter, we propose a novel similarity measure for
More informationMaximum Likelihood Estimation. only training data is available to design a classifier
Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationElectrocardiogram Signal Processing using Hidden Markov Models
Electrocardiogram Signal Processing using Hidden Markov Models Ph.D. Thesis Daniel Novák 3 th of September, 23 Czech Technical University in Prague Faculty of Electrical Engineering Department of Cybernetics
More informationA Note on the Expectation-Maximization (EM) Algorithm
A Note on the Expectation-Maximization (EM) Algorithm ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign March 11, 2007 1 Introduction The Expectation-Maximization
More informationSpeech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.
Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Expectation-Maximization (EM)
More informationShankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms
Recognition of Visual Speech Elements Using Adaptively Boosted Hidden Markov Models. Say Wei Foo, Yong Lian, Liang Dong. IEEE Transactions on Circuits and Systems for Video Technology, May 2004. Shankar
More informationLecture 9 Evolutionary Computation: Genetic algorithms
Lecture 9 Evolutionary Computation: Genetic algorithms Introduction, or can evolution be intelligent? Simulation of natural evolution Genetic algorithms Case study: maintenance scheduling with genetic
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationCOM336: Neural Computing
COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk
More informationPATTERN CLASSIFICATION
PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS
More informationHidden Markov models
Hidden Markov models Charles Elkan November 26, 2012 Important: These lecture notes are based on notes written by Lawrence Saul. Also, these typeset notes lack illustrations. See the classroom lectures
More informationSequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015
Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative
More informationHidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing
Hidden Markov Models By Parisa Abedi Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed data Sequential (non i.i.d.) data Time-series data E.g. Speech
More informationSupervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore
Supervised Learning Hidden Markov Models Some of these slides were inspired by the tutorials of Andrew Moore A Markov System S 2 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1,.
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization
More informationCurve Fitting Re-visited, Bishop1.2.5
Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationHuman Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data
Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data 0. Notations Myungjun Choi, Yonghyun Ro, Han Lee N = number of states in the model T = length of observation sequence
More informationBAYESIAN ESTIMATION OF UNKNOWN PARAMETERS OVER NETWORKS
BAYESIAN ESTIMATION OF UNKNOWN PARAMETERS OVER NETWORKS Petar M. Djurić Dept. of Electrical & Computer Engineering Stony Brook University Stony Brook, NY 11794, USA e-mail: petar.djuric@stonybrook.edu
More informationData Mining in Bioinformatics HMM
Data Mining in Bioinformatics HMM Microarray Problem: Major Objective n Major Objective: Discover a comprehensive theory of life s organization at the molecular level 2 1 Data Mining in Bioinformatics
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationHidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010
Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data
More informationMobile Robot Localization
Mobile Robot Localization 1 The Problem of Robot Localization Given a map of the environment, how can a robot determine its pose (planar coordinates + orientation)? Two sources of uncertainty: - observations
More informationWeighted Finite-State Transducers in Computational Biology
Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial
More informationHMM and IOHMM Modeling of EEG Rhythms for Asynchronous BCI Systems
HMM and IOHMM Modeling of EEG Rhythms for Asynchronous BCI Systems Silvia Chiappa and Samy Bengio {chiappa,bengio}@idiap.ch IDIAP, P.O. Box 592, CH-1920 Martigny, Switzerland Abstract. We compare the use
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationChapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang
Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check
More informationCSC 4510 Machine Learning
10: Gene(c Algorithms CSC 4510 Machine Learning Dr. Mary Angela Papalaskari Department of CompuBng Sciences Villanova University Course website: www.csc.villanova.edu/~map/4510/ Slides of this presenta(on
More informationParametric Models Part III: Hidden Markov Models
Parametric Models Part III: Hidden Markov Models Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2014 CS 551, Spring 2014 c 2014, Selim Aksoy (Bilkent
More informationHidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391
Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391 Parameters of an HMM States: A set of states S=s 1, s n Transition probabilities: A= a 1,1, a 1,2,, a n,n
More informationHidden Markov Models and Gaussian Mixture Models
Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted
More informationStatistical NLP: Hidden Markov Models. Updated 12/15
Statistical NLP: Hidden Markov Models Updated 12/15 Markov Models Markov models are statistical tools that are useful for NLP because they can be used for part-of-speech-tagging applications Their first
More informationOnline Estimation of Discrete Densities using Classifier Chains
Online Estimation of Discrete Densities using Classifier Chains Michael Geilke 1 and Eibe Frank 2 and Stefan Kramer 1 1 Johannes Gutenberg-Universtität Mainz, Germany {geilke,kramer}@informatik.uni-mainz.de
More informationA6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring
A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2015 http://www.astro.cornell.edu/~cordes/a6523 Lecture 23:! Nonlinear least squares!! Notes Modeling2015.pdf on course
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationHidden Markov Models. Dr. Naomi Harte
Hidden Markov Models Dr. Naomi Harte The Talk Hidden Markov Models What are they? Why are they useful? The maths part Probability calculations Training optimising parameters Viterbi unseen sequences Real
More informationPage 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence
Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)
More informationInference and estimation in probabilistic time series models
1 Inference and estimation in probabilistic time series models David Barber, A Taylan Cemgil and Silvia Chiappa 11 Time series The term time series refers to data that can be represented as a sequence
More informationO 3 O 4 O 5. q 3. q 4. Transition
Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabás Póczos & Aarti Singh Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed
More informationCourse content (will be adapted to the background knowledge of the class):
Biomedical Signal Processing and Signal Modeling Lucas C Parra, parra@ccny.cuny.edu Departamento the Fisica, UBA Synopsis This course introduces two fundamental concepts of signal processing: linear systems
More informationMixture Models and EM
Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering
More informationHidden Markov Models and other Finite State Automata for Sequence Processing
To appear in The Handbook of Brain Theory and Neural Networks, Second edition, (M.A. Arbib, Ed.), Cambridge, MA: The MIT Press, 2002. http://mitpress.mit.edu The MIT Press Hidden Markov Models and other
More informationBayesian ensemble learning of generative models
Chapter Bayesian ensemble learning of generative models Harri Valpola, Antti Honkela, Juha Karhunen, Tapani Raiko, Xavier Giannakopoulos, Alexander Ilin, Erkki Oja 65 66 Bayesian ensemble learning of generative
More information15-381: Artificial Intelligence. Hidden Markov Models (HMMs)
15-381: Artificial Intelligence Hidden Markov Models (HMMs) What s wrong with Bayesian networks Bayesian networks are very useful for modeling joint distributions But they have their limitations: - Cannot
More informationHidden Markov Models
Hidden Markov Models CI/CI(CS) UE, SS 2015 Christian Knoll Signal Processing and Speech Communication Laboratory Graz University of Technology June 23, 2015 CI/CI(CS) SS 2015 June 23, 2015 Slide 1/26 Content
More informationMIXTURE OF EXPERTS ARCHITECTURES FOR NEURAL NETWORKS AS A SPECIAL CASE OF CONDITIONAL EXPECTATION FORMULA
MIXTURE OF EXPERTS ARCHITECTURES FOR NEURAL NETWORKS AS A SPECIAL CASE OF CONDITIONAL EXPECTATION FORMULA Jiří Grim Department of Pattern Recognition Institute of Information Theory and Automation Academy
More informationMobile Robot Localization
Mobile Robot Localization 1 The Problem of Robot Localization Given a map of the environment, how can a robot determine its pose (planar coordinates + orientation)? Two sources of uncertainty: - observations
More informationMACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION
MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION THOMAS MAILUND Machine learning means different things to different people, and there is no general agreed upon core set of algorithms that must be
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 203 http://ce.sharif.edu/courses/9-92/2/ce725-/ Agenda Expectation-maximization
More informationProbability and Information Theory. Sargur N. Srihari
Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal
More informationForecasting Wind Ramps
Forecasting Wind Ramps Erin Summers and Anand Subramanian Jan 5, 20 Introduction The recent increase in the number of wind power producers has necessitated changes in the methods power system operators
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationEstimation of Single-Gaussian and Gaussian Mixture Models for Pattern Recognition
Estimation of Single-Gaussian and Gaussian Mixture Models for Pattern Recognition Jan Vaněk, Lukáš Machlica, and Josef Psutka University of West Bohemia in Pilsen, Univerzitní 22, 36 4 Pilsen Faculty of
More informationp(d θ ) l(θ ) 1.2 x x x
p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to
More informationStatistical Methods for NLP
Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition
More informationHidden Markov Models Part 1: Introduction
Hidden Markov Models Part 1: Introduction CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Modeling Sequential Data Suppose that
More informationA Modified Baum Welch Algorithm for Hidden Markov Models with Multiple Observation Spaces
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 4, MAY 2001 411 A Modified Baum Welch Algorithm for Hidden Markov Models with Multiple Observation Spaces Paul M. Baggenstoss, Member, IEEE
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationLatent Tree Approximation in Linear Model
Latent Tree Approximation in Linear Model Navid Tafaghodi Khajavi Dept. of Electrical Engineering, University of Hawaii, Honolulu, HI 96822 Email: navidt@hawaii.edu ariv:1710.01838v1 [cs.it] 5 Oct 2017
More informationStochastic Complexity of Variational Bayesian Hidden Markov Models
Stochastic Complexity of Variational Bayesian Hidden Markov Models Tikara Hosino Department of Computational Intelligence and System Science, Tokyo Institute of Technology Mailbox R-5, 459 Nagatsuta, Midori-ku,
More informationUnsupervised Learning
CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality
More informationMachine Learning using Bayesian Approaches
Machine Learning using Bayesian Approaches Sargur N. Srihari University at Buffalo, State University of New York 1 Outline 1. Progress in ML and PR 2. Fully Bayesian Approach 1. Probability theory Bayes
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More informationBeta Damping Quantum Behaved Particle Swarm Optimization
Beta Damping Quantum Behaved Particle Swarm Optimization Tarek M. Elbarbary, Hesham A. Hefny, Atef abel Moneim Institute of Statistical Studies and Research, Cairo University, Giza, Egypt tareqbarbary@yahoo.com,
More informationHidden Markov Models
Hidden Markov Models Lecture Notes Speech Communication 2, SS 2004 Erhard Rank/Franz Pernkopf Signal Processing and Speech Communication Laboratory Graz University of Technology Inffeldgasse 16c, A-8010
More informationMachine Learning Overview
Machine Learning Overview Sargur N. Srihari University at Buffalo, State University of New York USA 1 Outline 1. What is Machine Learning (ML)? 2. Types of Information Processing Problems Solved 1. Regression
More informationCS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas
CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1
More informationGenetic Algorithms: Basic Principles and Applications
Genetic Algorithms: Basic Principles and Applications C. A. MURTHY MACHINE INTELLIGENCE UNIT INDIAN STATISTICAL INSTITUTE 203, B.T.ROAD KOLKATA-700108 e-mail: murthy@isical.ac.in Genetic algorithms (GAs)
More informationMath 350: An exploration of HMMs through doodles.
Math 350: An exploration of HMMs through doodles. Joshua Little (407673) 19 December 2012 1 Background 1.1 Hidden Markov models. Markov chains (MCs) work well for modelling discrete-time processes, or
More informationBayesian Networks BY: MOHAMAD ALSABBAGH
Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional
More informationLearning in Bayesian Networks
Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks
More information