XII.3 The EM (Expectation-Maximization) Algorithm

Similar documents
Xiangwen Li. March 8th and March 13th, 2001

BAYESIAN CURVE FITTING USING PIECEWISE POLYNOMIALS. Dariusz Biskup

System in Weibull Distribution

Least Squares Fitting of Data

Excess Error, Approximation Error, and Estimation Error

Least Squares Fitting of Data

Chapter 12 Lyes KADEM [Thermodynamics II] 2007

COS 511: Theoretical Machine Learning

DETERMINATION OF UNCERTAINTY ASSOCIATED WITH QUANTIZATION ERRORS USING THE BAYESIAN APPROACH

Denote the function derivatives f(x) in given points. x a b. Using relationships (1.2), polynomials (1.1) are written in the form

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Introducing Entropy Distributions

,..., k N. , k 2. ,..., k i. The derivative with respect to temperature T is calculated by using the chain rule: & ( (5) dj j dt = "J j. k i.

Chapter One Mixture of Ideal Gases

PROBABILITY AND STATISTICS Vol. III - Analysis of Variance and Analysis of Covariance - V. Nollau ANALYSIS OF VARIANCE AND ANALYSIS OF COVARIANCE

Need for Probabilistic Reasoning. Raymond J. Mooney. Conditional Probability. Axioms of Probability Theory. Classification (Categorization)

AN ANALYSIS OF A FRACTAL KINETICS CURVE OF SAVAGEAU

The Expectation-Maximization Algorithm

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

A be a probability space. A random vector

CHAPTER 10 ROTATIONAL MOTION

Several generation methods of multinomial distributed random number Tian Lei 1, a,linxihe 1,b,Zhigang Zhang 1,c

Computational and Statistical Learning theory Assignment 4

Lecture Notes on Linear Regression

NP-Completeness : Proofs

The Parity of the Number of Irreducible Factors for Some Pentanomials

Revision: December 13, E Main Suite D Pullman, WA (509) Voice and Fax

Applied Mathematics Letters

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Determination of the Confidence Level of PSD Estimation with Given D.O.F. Based on WELCH Algorithm

What is LP? LP is an optimization technique that allocates limited resources among competing activities in the best possible manner.

Outline. Prior Information and Subjective Probability. Subjective Probability. The Histogram Approach. Subjective Determination of the Prior Density

Structure and Drive Paul A. Jensen Copyright July 20, 2003

PHYS 1443 Section 002 Lecture #20

Chapter 1. Probability

On Pfaff s solution of the Pfaff problem

= z 20 z n. (k 20) + 4 z k = 4

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

1 Definition of Rademacher Complexity

CS-433: Simulation and Modeling Modeling and Probability Review

On the Eigenspectrum of the Gram Matrix and the Generalisation Error of Kernel PCA (Shawe-Taylor, et al. 2005) Ameet Talwalkar 02/13/07

, are assumed to fluctuate around zero, with E( i) 0. Now imagine that this overall random effect, , is composed of many independent factors,

Lecture 12: Discrete Laplacian

Probability and Random Variable Primer

Statistical analysis of Accelerated life testing under Weibull distribution based on fuzzy theory

Expectation Maximization Mixture Models HMMs

Our focus will be on linear systems. A system is linear if it obeys the principle of superposition and homogenity, i.e.

Markov Chain Monte-Carlo (MCMC)

ASYMMETRIC TRAFFIC ASSIGNMENT WITH FLOW RESPONSIVE SIGNAL CONTROL IN AN URBAN NETWORK

Limited Dependent Variables

Linear Regression Analysis: Terminology and Notation

Fall 2012 Analysis of Experimental Measurements B. Eisenstein/rev. S. Errede. ) with a symmetric Pcovariance matrix of the y( x ) measurements V

Notes prepared by Prof Mrs) M.J. Gholba Class M.Sc Part(I) Information Technology

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur

y new = M x old Feature Selection: Linear Transformations Constraint Optimization (insertion)

SDMML HT MSc Problem Sheet 4

Chapter 13. Gas Mixtures. Study Guide in PowerPoint. Thermodynamics: An Engineering Approach, 5th edition by Yunus A. Çengel and Michael A.

Markov Chain Monte Carlo Lecture 6

PHYS 342L NOTES ON ANALYZING DATA. Spring Semester 2002

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Gradient Descent Learning and Backpropagation

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Kernel Methods and SVMs Extension

LECTURE :FACTOR ANALYSIS

Gaussian Mixture Models

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

6.3.4 Modified Euler s method of integration

Statistical Inference. 2.3 Summary Statistics Measures of Center and Spread. parameters ( population characteristics )

Quantum Particle Motion in Physical Space

Elastic Collisions. Definition: two point masses on which no external forces act collide without losing any energy.

Multipoint Analysis for Sibling Pairs. Biostatistics 666 Lecture 18

Solutions for Homework #9

1.3 Hence, calculate a formula for the force required to break the bond (i.e. the maximum value of F)

Estimation of Reliability in Multicomponent Stress-Strength Based on Generalized Rayleigh Distribution

On Syndrome Decoding of Punctured Reed-Solomon and Gabidulin Codes 1

By M. O'Neill,* I. G. Sinclairf and Francis J. Smith

Research Article Green s Theorem for Sign Data

2 Complement Representation PIC. John J. Sudano Lockheed Martin Moorestown, NJ, 08057, USA

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis

Study of the possibility of eliminating the Gibbs paradox within the framework of classical thermodynamics *

Two Conjectures About Recency Rank Encoding

A PROBABILITY-DRIVEN SEARCH ALGORITHM FOR SOLVING MULTI-OBJECTIVE OPTIMIZATION PROBLEMS

arxiv: v2 [math.co] 3 Sep 2017

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Three Algorithms for Flexible Flow-shop Scheduling

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

ITERATIVE ESTIMATION PROCEDURE FOR GEOSTATISTICAL REGRESSION AND GEOSTATISTICAL KRIGING

Identifying assessor differences in weighting the underlying sensory dimensions EL MOSTAFA QANNARI (1) MICHAEL MEYNERS (2)

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

ON THE NUMBER OF PRIMITIVE PYTHAGOREAN QUINTUPLES

Appendix B: Resampling Algorithms

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

Expected Value and Variance

New Approach to Fuzzy Decision Matrices

Comparison of the Population Variance Estimators. of 2-Parameter Exponential Distribution Based on. Multiple Criteria Decision Making Method

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

On an identity for the cycle indices of rooted tree automorphism groups

Lecture 3: Probability Distributions

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Transcription:

XII.3 The EM (Expectaton-Maxzaton) Algorth Toshnor Munaata 3/7/06 The EM algorth s a technque to deal wth varous types of ncoplete data or hdden varables. It can be appled to a wde range of learnng probles ncludng Bayesan networs. Bascs of the EM algorth Assue that we have an ncoplete data set. It s ncoplete snce data for soe or all nstances of certan attrbutes are ssng or soe characterstc values are unnown. The data and assocated paraeters n the EM algorth are classfed nto three categores: X, Z and θ. Let X = {x,, x } be a set of observed data, Z = {z,, z } be a set of unobserved (.e., hdden) data, and Y = X Z be the entre set of data. There s one-to-one correspondence between x n X and z n Z. Also let θ be a set of unnown paraeters that characterzes Y. Gven X, our proble s to deterne Z and θ. We note that the proble s doubly coplcated because we have to deterne both Z and θ. Dependng on the applcaton, X, Z and θ can tae a varety of fors. For exaple, each of x and z can be a scalar or a vector havng ts coponents as, e.g., z, z,... Each paraeter n θ can also be a scalar or a vector. The basc steps of the EM algorths can be descrbed as follows: 0. Intalzaton. Assgn values to paraeters n θ (arbtrarly, at rando, or based on soe nowledge). Repeat (terate) the followng two steps untl the soluton for Z and θ converges.. Expectaton Step (E-Step n short). Assung the current paraeter values are correct, copute expected values of Z.. Maxzaton Step (M-Step n short). Assung that Y = X Z s a truly observed data set (although the Z part s not), calculate new paraeter values of θ. The Step s called the axzaton step because we search for a axu lelhood hypothess n ters of the paraeters wth the data set of Y = X Z. As n any other so-called hll-clbng algorths, t can be stuc n a local axa, partcularly when substantal data are ssng. Prelude: Sple Illustratons As dscussed n eleentary statstcs textboos, the noral dstrbuton f(x) wth ean µ and standard devaton σ s gven by f ( x) = e σ π ( xµ ) σ () f(x) represents the probablty densty functon or the probablty dstrbuton of the noral dstrbuton. The area under the curve f(x) s wth the noralzng factor / σ π (See Fg. 6). Many types of data follow the noral dstrbuton wth approprate scalng factors.

Fgure 6. A noral dstrbuton wth the ean µ and standard devaton σ. Exaple. A 0-pont quz n a class of 00 students. Only gven data s X = {x }, =, 00, where x represents the score receved by the th student. For exaple, X = {x, x, x 3, x 4,, x 00 } = {0,,,,..., 0}. (Note. Usually the noral dstrbuton s used for contnuous values of x such as x = 5.47. We are applyng the dstrbuton for dscrete values of x as a sple llustraton.) If we assue that x are sorted for splcty, the data ndcate that one student receved 0 ponts, two students receved pont, etc. The followng Fg. 7 shows parts of X for x, x, x 3, x 99, and x 00. Each crcle on the x-axs represents a data nstance. Fgure 7. Ponts receved by students for a 0-pont quz n a class of 00 students. Only ponts for fve out of 00 students are shown: the leftost crcle represents x = 0, the next two crcles represent x = and x 3 =, and the two rghtost crcles x 99 = 0 and x 00 = 0. In general, the data need not be sorted. For exaple, x for Adas ay be 8, x for Sth ay be 3, etc. It s a coon practce to tally the data to see the pont dstrbuton easer. The followng s such a tally for our exaple: x: 0 3 4 5 6 7 8 9 0 f(x): 6 7 0 9 7 3 Total = 00 Here x represents quz pont, and f(x) the nuber,.e., the frequency of students who receved score x. Ths s convenent to see the dstrbuton,.e., how any students receved what pont. But the orgnal nforaton of raw data, the pont receved by each student s lost. The above tally can be depcted by a graph, by addng an ordnate representng the nuber of students to Fg. 7 and droppng the crcles representng students resultng to the followng Fg. 8. We see that a graph such as Fg. 8 can be approxated by a noral dstrbuton le equaton () and Fg. 6 (wth an approprate scalng factor).

Fgure 8. Pont dstrbuton for a 0-pont quz n a class of 00 students. In certan cases, data ay not ft well to a noral dstrbuton. Instead, t ay be ore natural to ft the data to a xture of noral dstrbutons. Suppose the pont dstrbuton s gven as follows: x: 0 3 4 5 6 7 8 9 0 f(x): 0 7 6 4 3 7 8 6 5 Total = 00 Fg. 9 depcts ths dstrbuton. We see two peas at x = 3 and 8. It ay be natural to consder ths dstrbuton as a xture of two noral dstrbutons, wth dfferent eans, standard devatons, and heghts. In general, data can be a xture of two, three,..., noral dstrbutons. Ths s the type of the proble the EM algorth addresses gven data, for whch soe are observable whle soe are not, we are to deterne the underlyng xture of noral dstrbutons. Fgure 9. Data that ft a xture of two noral dstrbutons. Case Study: A Mxture of Dstnct Noral Dstrbutons The followng Fg. 0 shows an exaple of two noral dstrbutons f (x) wth µ and σ and f (x) wth µ and σ. The area for each of the two dstrbutons s. In our proble, we want to ft each data nstance x to a probablty dstrbuton f(x) that s a xture of dstnct noral dstrbutons ultpled by weghts, where s a nown postve nteger (for exaple, = n Fg. 0) as: f ( x) p f ( x ; µ, σ ) = () = 3

where f s the th noral dstrbuton for a data nstance x wth the ean µ and the standard devaton σ. For specfc values of x, µ and σ, the consttuent functon f (x; µ, σ ) can be evaluated by substtutng these values nto equaton (). p s the weght or the probablty of the coponent, contrbutng to the total dstrbuton f(x). p s assocated only wth and does not depend on specfc x; = p = holds. The probablty dstrbuton f(x ) for a specfc data nstance x can be represented by sply replacng x wth x n equaton (). Fgure 0. A xture of = noral dstrbutons, f (x) and f (x). Each crcle on the x-axs represents a data nstance. Exaple. Suppose that our probablty dstrbuton f(x) s a xture of = noral dstrbutons f (x) and f (x) gven n Fg. 0, wth p = 0.8 and p = 0.. We note p + p =. Then ( xµ ) ( xµ ) 0.8 σ 0. σ f(x) = 0.8 f (x; µ, σ ) + 0. f (x; µ, σ ) = e + e σ π σ π A graph for ths f(x) can be obtaned fro Fg. 0 as follows. Contract (flatten) f (x) and f (x) along the ordnate drecton by ultplyng 0.8 and 0., respectvely, then add the two graphs (Fg. ). Fgure. Graph f(x) obtaned by superposng = noral dstrbutons, 0.8f (x) and 0.f (x). 4

The EM algorth Let a set of observed data nstances, X = {x,, x }. Our proble s to deterne a set of unobservable data Z and a set of paraeters θ that characterzes Y = X Z. More specfcally, Z and θ are: Z = {p }, =, and =,, where p represents the probablty that x belongs to the th coponent. Here each z n Z = {z,, z } s a vector havng coponents as, z = p,..., z = p. θ ={ p,..., p, µ,..., µ, σ,..., σ }, the paraeter vector. In Fg. 0 exaple where =, {x,, x } are represented by sall crcles on the abscssa. They are the only data that are observable. Our proble s to deterne two dstrbutons le f (x) and f (x) through the paraeter vector θ, and {p }, a easure for whch each data nstance s generated by whch dstrbuton. Our EM algorth can be perfored as follows: Step 0. Intalzaton. Assgn approprate values to θ ={ p,..., p, µ,..., µ, σ,..., σ }. E-step. Assung the current value of θ, copute Z = {p }, =, and =, as follows: p P( x ) ( x ; µ, σ ) (weght) ( th dstrbuton) p f = = (3) (total dstrbuton) f Ths result can also be obtaned by eployng Bayes' rule: ( ) P x ( ) ( ) ( ) P x P p f = = P x f x ( x ; µ, σ ) ( ) ( x ) In the above, we used P( x ) = f ( x ; µ, σ ), P( ) = p, and P( x ) = f ( x ). We note that f ( x ) p = ( ;, ) p f x µ σ = =,.e., the probabltes add up to for each x, and = f ( x ) = f ( x ) p = =. f (x ; µ, σ ) can be deterned by equaton () and f (x ) can be deterned = = = by equaton (). M-step. Fro the above, we can estate new paraeters of θ as follows. p = p (4) = to average the probablty for the th coponent over data ponts. µ = p x (5) p = to average x over data ponts wth weght factors. p = ( x ) σ = p µ (6) Ths s the standard devaton verson of equaton (5) for µ. After ntalzaton, teratons are perfored for the E, M, E,, steps untl Z and θ converge. Exaple. A sple specal case. = and σ σ σ = = s nown. θ ={,,, } p p µ µ. 5

( ) ( ; µ, σ) ( ;, ) f x = p f x + p f x µ σ ( ) E-step. =,. ( xµ ) σ p f( x ; µ, σ) pe p = = ( ) ( xµ ) ( x f x µ ) σ σ pe + p e M-step. Copute new p, u for =,. p = = p (4 ) µ = p x (5 ) p = To perfor teratons, ntalze θ ={,,, } the E-step. (3 ) p p µ µ. Then repeat the E and M steps startng fro In the above, each data nstance x s assued to be a sngle scalar value. As an extenson, each data nstance can be a vector x when there are ultple ndependent varables. For exaple, when there are three ndependent varables, each data nstance would be a vector x = (x, x, x 3 ). The above dscussons can be extended by replacng scalar quanttes such as x, z, and the paraeters by vector quanttes. 6