Expectation-Maximization (EM) algorithm

Similar documents
MCMC: Markov Chain Monte Carlo

HMM: Parameter Estimation

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

EM-algorithm for motif discovery

Latent Variable Models

A Note on the Expectation-Maximization (EM) Algorithm

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Linear Dynamical Systems

Naïve Bayes classification

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Expectation maximization tutorial

Latent Variable View of EM. Sargur Srihari

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

What is the expectation maximization algorithm?

Statistical Methods for NLP

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Linear Dynamical Systems (Kalman filter)

STA 414/2104: Machine Learning

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Statistical Pattern Recognition

1 What is a hidden Markov model?

Statistical learning. Chapter 20, Sections 1 4 1

Gaussian Mixture Models, Expectation Maximization

Statistical Sequence Recognition and Training: An Introduction to HMMs

Directed Probabilistic Graphical Models CMSC 678 UMBC

Bayesian Methods for Machine Learning

Hidden Markov Models

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Lecture 6: April 19, 2002

The Expectation-Maximization Algorithm

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Statistical Pattern Recognition

EECS E6870: Lecture 4: Hidden Markov Models

{ p if x = 1 1 p if x = 0

Dynamic Approaches: The Hidden Markov Model

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Statistical NLP: Hidden Markov Models. Updated 12/15

Weighted Finite-State Transducers in Computational Biology

Note Set 5: Hidden Markov Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Mixtures of Gaussians. Sargur Srihari

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Expectation Maximisation (EM) CS 486/686: Introduction to Artificial Intelligence University of Waterloo

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Hidden Markov Models

Lecture 6: Gaussian Mixture Models (GMM)

LEARNING WITH BAYESIAN NETWORKS

Hidden Markov Models. Hosein Mohimani GHC7717

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

Automatic Speech Recognition (CS753)

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Lecture 9. Intro to Hidden Markov Models (finish up)

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Bayesian Networks. Motivation

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Hidden Markov Models for biological sequence analysis I

The Expectation Maximization or EM algorithm

Hidden Markov Models

Mobile Robot Localization

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Linear Models for Regression CS534

Parametric Models Part III: Hidden Markov Models

Hidden Markov Models

Machine Learning 4. week

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Gaussian Mixture Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

O 3 O 4 O 5. q 3. q 4. Transition

Hidden Markov models

Probabilistic Time Series Classification

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

Infering the Number of State Clusters in Hidden Markov Model and its Extension

Unsupervised Learning

Expectation Maximization Mixture Models HMMs

Hidden Markov Models for biological sequence analysis

Brief Introduction of Machine Learning Techniques for Content Analysis

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Algorithmisches Lernen/Machine Learning

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Abstract

Introduction to Machine Learning CMU-10701

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

EM algorithm and applications Lecture #9

Lecture 13 : Variational Inference: Mean Field Approximation

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models. Terminology, Representation and Basic Problems

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Machine Learning for Signal Processing Expectation Maximization Mixture Models. Bhiksha Raj 27 Oct /

Toward Probabilistic Forecasts of Convective Storm Activity for En Route Air Traffic Management

Bayesian Methods: Naïve Bayes

Transcription:

I529: Machine Learning in Bioinformatics (Spring 2017) Expectation-Maximization (EM) algorithm Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2017

Contents Introduce EM algorithm using the flipping coin experiment Formal definition of the EM algorithm Two toy examples Coin toss with missing data Coin toss with hidden data Applications of the EM algorithm Motif finding Baum-Welch algorithm Binning of metagenomes

A coin-flipping experiment θ : the probability of getting heads θ A: the probability of coin A landing on head θ B: the probability of coin B landing on head Ref: What is the expectation maximization algorithm? Nature Biotechnology 26, 897-899 (2008)

When the identities of the coins are unknown Instead of picking up the single best guess, the EM algorithm computes probabilities for each possible completion of the missing data, using the current parameters E(H) for coin B

Main applications of the EM algorithm When the data indeed has missing values, due to problems with or limitations of the observation process When optimizing the likelihood function is analytically intractable but it can be simplified by assuming the existence of and values for additional but missing (or hidden) parameters.

The EM algorithm handles hidden data Consider a model where, for observed data x and model parameters θ: p(x θ)= z p(x,z θ). z is the hidden variable that is marginalized out Finding θ * which maximizes z p(x,z θ) is hard! The EM algorithm reduces the difficult task of optimizing log P(x; θ) into a sequence of simpler optimization subproblems. In each iteration, The EM algorithm receives parameters θ (t), and returns new parameters θ (t+1), s.t. p(x θ (t+1) ) > p(x θ (t) ).

In each iteration the EM algorithm does the following. E step: Calculate X The EM algorithm Q t ( ) = X z P (z x; ˆ (t) )logp(x, z; ) M step: Find ˆ (t+1) (Next iteration sets θ (t) which maximizes the Q function ˆ (t+1) and repeats). The EM update rule: ˆ (t+1) = arg max X z P (z x; ˆ (t) )logp(x, z; )

Convergence of the X EM algorithm X Compare the Q function and the g function Q t ( ) = X z g t ( ) = X z P (z x; ˆ (t) )logp(x, z; ) P (z x; ˆ (t) P (x, z; ) )log P (z x; ˆ (t) ) Fig. 1 Fig 1 demonstrates the convergence of the EM algorithm. Starting from initial parameters (t), the E-step of the EM algorithm constructs a function g t that lower-bounds the objective function logp(x; ) (i.e.,g t apple logp(x; ); and g t (ˆ (t) )=logp(x; ˆ (t) ). In the M-step, (t+1) is computed as the maximum of g t. In the next E-step, a new lower-bound g t+1 is constructed; maximization of g t+1 in the next M-step gives (t+2),etc. As the value of the lower-bound g t matches the objective function at ˆ (t), if follows that logp(x; ˆ (t) )=g t (ˆ (t) ) apple g t (ˆ (t+1) = logp(x; ˆ (t+1) ) (2) So the objective function monotonically increases during each iteration of expectation maximization! Ref: Nature Biotechnology 26, 897-899 (2008)

The EM update rule ˆ (t+1) = arg max X z P (z x; ˆ (t) )logp(x, z; ) The EM update rule maximizes the log likelihood of a dataset expanded to contain all possible completions of the unobserved variables, where each completion is weighted by the posterior probability!

Coin toss with missing data Given a coin with two possible outcomes: H (head) and T (tail), with probabilities θ and 1-θ, respectively. The coin is tossed twice, but only the 1 st outcome, T, is seen. So the data is x = (T,*) (with incomplete data!) We wish to apply the EM algorithm to get parameters that increase the likelihood of the data. Let the initial parameters be θ = ¼.

The EM algorithm P at work Q t ( t ) = X P (z x; t )logp(x, z; ) z = P (z1 x; t )logp(x, z1; )+P (z2 x; t )logp(x, z2; ) = P (z1 x; t )logp(z1; )+P (z2 x; t )logp(z2; ) = P (z1 x; t )log[ n H(z1) (1 ) n T (z1) ]+P (z2 x; t )log[ n H(z2) (1 ) n T (z2) ] = P (z1 x; t )[n H (z1)log + n T (z1)log(1 )] + P (z2 x; t )[n H (z2)log + n T (z2)log(1 )] = [P (z1 x; t )n H (z1) + P (z2 x; t )n H (z2)]log +[P (z1 x; t )n T (z1) + P (z2 x; t )n T (z2)]log(1 ) ere n H n(z1) H = P (z1 x; is the t )n H (z1) number + P (z2 x; t )n H of (z2) heads n T = in P (z1 x; z1 t )n (i.e., T (z1) + T, P (z2 x; T), t )n T an (z2) n H log + n T log(1 ) is maximized when = n H t t P (x; t )=P (z1; t )+P (z2; t )=(1 t ) 2 +(1 t ) = 3/4 P (z1 x; ) = P (x, z1; t )/P (x; t )=(1 t ) 2 /P (x; t )= P (z2 x; t )=1 P (z1 x; t )=1/4 3/4 3/4 3/4 n H (z1) = 0, n T (z1) = 2, n H (z2) = 1, and n T (z2) = 1. Inputs: Observation: x=(t,*) Hidden data: z1=(t,t) z2=(t,h) Initial guess: θ t = ¼ =3/4 n H +n T

The EM algorithm at work: continue Initial guess θ = ¼ After one iteration θ = 1/8 The optimal parameter θ will never be reached by the EM algorithm!

Coin toss with hidden data Two coins A and B, with parameters θ={θ A, θ B }; compute θ that maximizes the log likelihood of the observed data x={x 1,x 2,..x 5 } E.g., initial parameter θ: θ A =0.60, θ B =0.50 (x 1, x 2,..x 5 are independent observations) P (z 1 = A x; t ) = P (z 1 = A x 1 ; t ) = = P (z 1 = A, x 1 ; t ) P (z 1 = A, x 1 ; t )+P(z 1 = B,x 1 ; t ) 0.6 5 0.4 5 0.6 5 0.4 5 +0.5 5 0.5 5 =0.58 Coin A observation n H n T P(A) P(B) n H n T n H n T B x1: HTTTHHTHTH 5 5 0.58 0.42 2.9 2.9 2.1 2.1 x2: HHHHTHHHHH 9 1 0.84 0.16 7.6 0.8 1.4 0.2 x3: HTHHHHHTHH 8 2 0.81 0.19 6.4 1.6 1.6 0.4 x4: HTHTTTHHTT 4 6 0.25 0.75 1.0 1.5 3.0 4.5 x5: THHHTHHHTH 8 2 0.81 0.19 6.4 1.6 1.6 0.4 24.3H 8.4T 9.7H 7.6T New parameter: θ A =24.3/(24.3+8.4)=0.74, θ B =9.7/(9.7+7.6)=0.56

Motif finding problem Motif finding problem is not that different from the coin toss problem! Probabilistic approaches to motif finding EM Gibbs sampling (a generalized EM algorithm) There are also combinatorial approaches

Motif finding problem Given a set of DNA sequences: cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc Find the motif in each of the individual sequences

The MEME algorithm Collect all substrings with the same length w from the input sequences: X = (X 1,,X n ) Treat sequences as bags of subsequences: a bag for motif, and a bag for background Need to figure out two models (one for motif, and one for the background), and assign each of the subsequences to one of the bags, such that the likelihood of the data (subsequences) is maximized Difficult problem Solved by the EM algorithm

Motif finding vs coin toss tagacgctatc gctatccacgt gtaggtcctct 0.3x M 0.7x 0.7x M 0.3x 0.2x M 0.8x B B B M B Motif Background model θ : the probability of getting heads θ A: P(head) for coin A θ B: P(head) for coin B Probability of a subsequence: P(x M), or P(x B) Probability of an observation sequence: P(x θ)=θ #(heads) (1-θ) #(tails)

Fitting a mixture model by EM A finite mixture model: data X = (X 1,,X n ) arises from two or more groups with g models q = (q 1,, q g ). Indicator vectors Z = (Z 1,,Z n ), where Z i = (Z i1,,z ig ), and Z ij = 1 if X i is from group j, and = 0 otherwise. P(Z ij = 1 q j ) = l j. For any given i, all Z ij are 0 except one j; g=2: class 1 (the motif) and class 2 (the background) are given by position specific and a general multinomial distribution

The E- and M-step E-step: Since the log likelihood is the sum of over i and j of terms multiplying Z ij, and these are independent across i, we need only consider the expectation of one such, given X i. Using initial parameter values q and l, and the fact that the Z ij are binary, we get E(Z ij X,q,l )=l j P(X i q j )/ k l k P(X i q k )=Z ij M-step: The maximization over l is independent of the rest and is readily achieved with l j = i Z ij / n.

Baum-Welch algorithm for HMM parameter estimation A kl = A kl = n j =1 n 1 p(x j ) L i =1 p(s i 1 =k,s i =l,x j θ) 1 f j p(x j k (i 1)a kl e l (x i )b j l (i ) ) j =1 L i =1 E k (b) = n 1 f j p(x j k (i ) f j k (i ) ) j =1 i :x i j =b During each iteration, compute the expected transitions between any pair of states, and expected emissions from any state, using averaging process (Estep), which are then used to compute new parameters (M-step).

Application of EM algorithms in metagenomics: Binning AbundanceBin Binning of short reads into bins (species) A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-tuples ( JCB 2011, 18(3): 523-534. RECOMB 2010) MaxBin/MaxBin2 Binning of assembled metagenomic scaffolds using an EM algorithm (Microbiome, 2014 doi: 10.1186/2049-2618-2-26)

Pros and Cons Cons Slow convergence Converge to local optima Pros Ref The E-step and M-step are often easy to implement for many problems, thanks to the nice form of the complete-data likelihood function Solutions to the M-steps often exist in the closed form On the convergence properties of the EM algorithm. CFJ WU, 1983 A gentle tutorial of the EM algorithm and its applications to parameter estimation for Gaussian mixture and hidden Markov models, JA Bilmes, 1998 What is the expectation maximization algorithm? 2008