Mixtures of Gaussians and the EM Algorithm

Similar documents
Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Axis Aligned Ellipsoid

Statistical Pattern Recognition

Algorithms for Clustering

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Expectation-Maximization Algorithm.

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

15-780: Graduate Artificial Intelligence. Density estimation

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

The Expectation-Maximization (EM) Algorithm

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

NUMERICAL METHODS FOR SOLVING EQUATIONS

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

Machine Learning Brett Bernstein

CS284A: Representations and Algorithms in Molecular Biology

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Stat 421-SP2012 Interval Estimation Section

Vector Quantization: a Limiting Case of EM

Empirical Process Theory and Oracle Inequalities

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Data Analysis and Statistical Methods Statistics 651

Design and Analysis of Algorithms

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Simulation. Two Rule For Inverting A Distribution Function

Expectation maximization

Lecture 2: Monte Carlo Simulation

Probability and MLE.

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Lecture 1 Probability and Statistics

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Random Variables, Sampling and Estimation

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

Module 1 Fundamentals in statistics

Lecture 3: August 31

Kinetics of Complex Reactions

Understanding Samples

Parameter, Statistic and Random Samples

1 Review of Probability & Statistics

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01

Machine Learning Assignment-1

Statistics 511 Additional Materials

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Topic 9: Sampling Distributions of Estimators

Machine Learning. Ilya Narsky, Caltech

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

Lecture 5. Materials Covered: Chapter 6 Suggested Exercises: 6.7, 6.9, 6.17, 6.20, 6.21, 6.41, 6.49, 6.52, 6.53, 6.62, 6.63.

Frequentist Inference

10-701/ Machine Learning Mid-term Exam Solution

Basics of Probability Theory (for Theory of Computation courses)

Discrete Mathematics and Probability Theory Fall 2016 Walrand Probability: An Overview

CS322: Network Analysis. Problem Set 2 - Fall 2009

Topic 9: Sampling Distributions of Estimators

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Topic 9: Sampling Distributions of Estimators

Chapter 8: Estimating with Confidence

Chapter 2 The Monte Carlo Method

Economics Spring 2015

Monte Carlo Integration

Random assignment with integer costs

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Class 23. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

Lecture 19: Convergence

Distributional Similarity Models (cont.)

Introduction to Artificial Intelligence CAP 4601 Summer 2013 Midterm Exam

CSE 527, Additional notes on MLE & EM

6.867 Machine learning, lecture 7 (Jaakkola) 1

(7 One- and Two-Sample Estimation Problem )

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Exponential Families and Bayesian Inference

BIOSTATISTICS. Lecture 5 Interval Estimations for Mean and Proportion. dr. Petr Nazarov

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

Lecture 1 Probability and Statistics

MATH/STAT 352: Lecture 15

Lesson 10: Limits and Continuity

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Distributional Similarity Models (cont.)

PRACTICE PROBLEMS FOR THE FINAL

Problem Set 4 Due Oct, 12

Problem Set 2 Solutions

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Probability and statistics: basic terms

THE SYSTEMATIC AND THE RANDOM. ERRORS - DUE TO ELEMENT TOLERANCES OF ELECTRICAL NETWORKS

Interval Estimation (Confidence Interval = C.I.): An interval estimate of some population parameter is an interval of the form (, ),

Bayesian Methods: Introduction to Multi-parameter Models

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

Pattern Classification, Ch4 (Part 1)

Intro to Learning Theory

Final Review for MATH 3510

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

AAEC/ECON 5126 FINAL EXAM: SOLUTIONS

Element sampling: Part 2

Machine Learning Brett Bernstein

Perceptron. Inner-product scalar Perceptron. XOR problem. Gradient descent Stochastic Approximation to gradient descent 5/10/10

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)

Transcription:

Mixtures of Gaussias ad the EM Algorithm CSE 6363 Machie Learig Vassilis Athitsos Computer Sciece ad Egieerig Departmet Uiversity of Texas at Arligto 1

Gaussias A popular way to estimate probability desity fuctios is to model them as Gaussias. Review: a 1D ormal distributio is defied as: N x = 1 σ 2π e x μ 2 2σ 2 To defie a Gaussia, we eed to specify just two parameters: μ, which is the mea (average) of the distributio. σ, which is the stadard deviatio of the distributio. Note: σ 2 is called the variace of the distributio. 2

Estimatig a Gaussia I oe dimesio, a Gaussia is defied like this: N x = 1 x μ 2 σ 2π e 2σ 2 Give a set of real umbers x 1,, x, we ca easily fid the best-fittig Gaussia for that data. The mea μ is simply the average of those umbers: μ = 1 x i 1 The stadard deviatio σ is computed as: σ = 1 1 (x i μ) 2 1 3

Estimatig a Gaussia Fittig a Gaussia to data does ot guaratee that the resultig Gaussia will be a accurate distributio for the data. The data may have a distributio that is very differet from a Gaussia. 4

Example of Fittig a Gaussia The blue curve is a desity fuctio F such that: - F(x) = 0.25 for 1 x 3. - F(x) = 0.5 for 7 x 8. The red curve is the Gaussia fit G to data geerated usig F. 5

Naïve Bayes with 1D Gaussias Suppose the patters come from a d-dimesioal space: Examples: pixels to be classified as ski or o-ski, or the statlog dataset. Notatio: x i = (x i,1, x i,2,, x i,d ) For each dimesio j, we ca use a Gaussia to model the distributio p j (x i,j C k ) of the data i that dimesio, give their class. For example for the statlog dataset, we would get 216 Gaussias: 36 dimesios * 6 classes. The, we ca use the aïve Bayes approach (i.e., assume pairwise idepedece of all dimesios), to defie P(x C k ) as: d p x C k ) = p j x i,j ) C k ) i=1 6

Mixtures of Gaussias This figure shows our previous example, where we fitted a Gaussia ito some data, ad the fit was poor. Overall, Gaussias have attractive properties: They require learig oly two umbers (μ ad σ), ad thus require few traiig data to estimate those umbers. However, for some data, Gaussias are just ot good fits. 7

Mixtures of Gaussias Mixtures of Gaussias are oftetimes a better solutio. They are defied i the ext slide. They still require relatively few parameters to estimate, ad thus ca be leared from relatively small amouts of data. They ca fit pretty well actual distributios of data. 8

Mixtures of Gaussias Suppose we have k Gaussia distributios N i. Each N i has its ow mea μ i ad std σ i. Usig these k Gaussias, we ca defie a Gaussia mixture M as follows: k M x = w i N i x i=1 Each w i is a weight, specifyig the relative importace of Gaussia N i i the mixture. Weights w i are real umbers betwee 0 ad 1. Weights w i must sum up to 1, so that the itegral of M is 1. 9

Mixtures of Gaussias Example The blue ad gree curves show two Gaussias. The red curve shows a mixture of those Gaussias. w 1 = 0.9. w 2 = 0.1. The mixture looks a lot like N 1, but is iflueced a little by N 2 as well. 10

Mixtures of Gaussias Example The blue ad gree curves show two Gaussias. The red curve shows a mixture of those Gaussias. w 1 = 0.7. w 2 = 0.3. The mixture looks less like N 1 compared to the previous example, ad is iflueced more by N 2. 11

Mixtures of Gaussias Example The blue ad gree curves show two Gaussias. The red curve shows a mixture of those Gaussias. w 1 = 0.5. w 2 = 0.5. At each poit x, the value of the mixture is the average of N 1 (x) ad N 2 (x). 12

Mixtures of Gaussias Example The blue ad gree curves show two Gaussias. The red curve shows a mixture of those Gaussias. w 1 = 0.3. w 2 = 0.7. The mixture ow resembles N 2 more tha N 1. 13

Mixtures of Gaussias Example The blue ad gree curves show two Gaussias. The red curve shows a mixture of those Gaussias. w 1 = 0.1. w 2 = 0.9. The mixture ow is almost idetical to N 2 (x). 14

Learig a Mixture of Gaussias Suppose we are give traiig data x 1, x 2,, x. Suppose all x j belog to the same class c. How ca we fit a mixture of Gaussias to this data? This will be the topic of the ext few slides. We will lear a very popular machie learig algorithm, called the EM algorithm. EM stads for Expectatio-Maximizatio. Step 0 of the EM algorithm: pick k maually. Decide how may Gaussias the mixture should have. Ay approach for choosig k automatically is beyod the scope of this class. 15

Learig a Mixture of Gaussias Suppose we are give traiig data x 1, x 2,, x. Suppose all x j belog to the same class c. We wat to model P(x c) as a mixture of Gaussias. Give k, how may parameters do we eed to estimate i order to fully defie the mixture? Remember, a mixture M of k Gaussias is defied as: M x = w i N i x For each N i, we eed to estimate three umbers: w i, μ i, σ i. k i=1 i=1 = w i 1 σ i 2π e So, i total, we eed to estimate 3*k umbers. k x μ i 2 2σ i 2 16

Learig a Mixture of Gaussias Suppose we are give traiig data x 1, x 2,, x. A mixture M of k Gaussias is defied as: k M x = w i N i x For each N i, we eed to estimate w i, μ i, σ i. k i=1 i=1 = w i 1 σ i 2π e x μ i 2 2σ i 2 Suppose that we kew for each x j, that it belogs to oe ad oly oe of the k Gaussias. The, learig the mixture would be a piece of cake: For each Gaussia N i : Estimate μ i, σ i based o the examples that belog to it. Set w i equal to the fractio of examples that belog to N i. 17

Learig a Mixture of Gaussias Suppose we are give traiig data x 1, x 2,, x. A mixture M of k Gaussias is defied as: k M x = w i N i x For each N i, we eed to estimate w i, μ i, σ i. k i=1 i=1 = w i 1 σ i 2π e x μ i 2 2σ i 2 However, we have o idea which mixture each x j belogs to. If we kew μ i ad σ i for each N i, we could probabilistically assig each x j to a compoet. Probabilistically meas that we would ot make a hard assigmet, but we would partially assig x j to differet compoets, with each assigmet weighted proportioally to the desity value N i (x j ). 18

Example of Partial Assigmets Usig our previous example of a mixture: Suppose x j = 6.5. How do we assig 6.5 to the two Gaussias? N 1 (6.5) = 0.0913. N 2 (6.5) = 0.3521. So: 6.5 belogs to N 1 by 0.0913 0.0913+0.3521 = 20.6%. 6.5 belogs to N 2 by 0.3521 = 79.4%. 0.0913+0.3521 19

The Chicke-ad-Egg Problem To recap, fittig a mixture of Gaussias to data ivolves estimatig, for each N i, values w i, μ i, σ i. If we could assig each x j to oe of the Gaussias, we could compute easily w i, μ i, σ i. Eve if we probabilistically assig x j to multiple Gaussias, we ca still easily w i, μ i, σ i, by adaptig our previous formulas. We will see the adapted formulas i a few slides. If we kew μ i, σ i ad w i, we could assig (at least probabilistically) x j s to Gaussias. So, this is a chicke-ad-egg problem. If we kew oe piece, we could compute the other. But, we kow either. So, what do we do? 20

O Chicke-ad-Egg Problems Such chicke-ad-egg problems occur frequetly i AI. Surprisigly (at least to people ew i AI), we ca easily solve such chicke-ad-egg problems. Overall, chicke ad egg problems i AI look like this: We eed to kow A to estimate B. We eed to kow B to compute A. There is a fairly stadard recipe for solvig these problems. Ay guesses? 21

O Chicke-ad-Egg Problems Such chicke-ad-egg problems occur frequetly i AI. Surprisigly (at least to people ew i AI), we ca easily solve such chicke-ad-egg problems. Overall, chicke ad egg problems i AI look like this: We eed to kow A to estimate B. We eed to kow B to compute A. There is a fairly stadard recipe for solvig these problems. Start by givig to A values chose radomly (or perhaps oradomly, but still i a uiformed way, sice we do ot kow the correct values). Repeat this loop: Give our curret values for A, estimate B. Give our curret values of B, estimate A. If the ew values of A ad B are very close to the old values, break. 22

The EM Algorithm - Overview We use this approach to fit mixtures of Gaussias to data. This algorithm, that fits mixtures of Gaussias to data, is called the EM algorithm (Expectatio-Maximizatio algorithm). Remember, we choose k (the umber of Gaussias i the mixture) maually, so we do t have to estimate that. To iitialize the EM algorithm, we iitialize each μ i, σ i, ad w i. Values w i are set to 1/k. We ca iitialize μ i, σ i i differet ways: Givig radom values to each μ i. Uiformly spacig the values give to each μ i. Givig radom values to each σ i. Settig each σ i to 1 iitially. The, we iteratively perform two steps. The E-step. The M-step. 23

The E-Step E-step. Give our curret estimates for μ i, σ i, ad w i : We compute, for each i ad j, the probability p ij = P(N i x j ): the probability that x j was geerated by Gaussia N i. How? Usig Bayes rule. p ij = P(N i x j ) = P x j N i P(N i ) P(x j ) N i x j = 1 σ i k 2π e x μ i 2 2σ i 2 P x j = w i N i x j i =1 = N i x j w i P(x j ) 24

The M-Step: Updatig μ i ad σ i M-step. Give our curret estimates of p ij, for each i, j: We compute μ i ad σ i for each N i, as follows: μ i = j=1 [p ij x j ] j=1 p ij σ i = j=1 [p ij x j μ i 2 ] j=1 p ij To uderstad these formulas, it helps to compare them to the stadard formulas for fittig a Gaussia to data: μ = 1 1 x j σ = 1 1 j=1 (x j μ) 2 25

μ i = The M-Step: Updatig μ i ad σ i j=1 [p ij x j ] j=1 p ij σ i = j=1 [p ij x j μ i 2 ] j=1 p ij To uderstad these formulas, it helps to compare them to the stadard formulas for fittig a Gaussia to data: μ = 1 1 x j σ = 1 1 j=1 (x j μ) 2 Why do we take weighted averages at the M-step? Because each x j is probabilistically assiged to multiple Gaussias. We use p ij = P N i x j as weight of the assigmet of x j to N i. 26

The M-Step: Updatig w i w i = k i=1 j=1 p ij j=1 p ij At the M-step, i additio to updatig μ i ad σ i, we also eed to update w i, which is the weight of the i-th Gaussia i the mixture. The formula show above is used for the update of w i. We sum up the weights of all objects for the i-th Gaussia. We divide that sum by the sum of weights of all objects for all Gaussias. k The divisio esures that i=1 w i = 1. 27

The EM Steps: Summary E-step: Give curret estimates for each μ i, σ i, ad w i, update p ij : p ij = N i x j w i P(x j ) M-step: Give our curret estimates for each p ij, update μ i, σ i ad w i : μ i = j=1 [p x ij j] j=1 p ij σ i = j=1[p ij j=1 p ij x j μ i 2 ] w i = j=1 p ij k i=1 j=1 p ij 28

The EM Algorithm - Termiatio The log likelihood of the traiig data is defied as: L x 1,, x = log 2 M x j As a remider, M is the Gaussia mixture, defied as: k M x = w i N i x i=1 = w i 1 Oe ca prove that, after each iteratio of the E-step ad the M- step, this log likelihood icreases or stays the same. We check how much the log likelihood chages at each iteratio. Whe the chage is below some threshold, we stop. 29 j=1 k i=1 σ i 2π e x μ i 2 2σ i 2

The EM Algorithm: Summary Iitializatio: Iitialize each μ i, σ i, w i, usig your favorite approach (e.g., set each μ i to a radom value, ad set each σ i to 1, set each w i equal to 1/k). last_log_likelihood = -ifiity. Mai loop: E-step: Give our curret estimates for each μ i, σ i, ad w i, update each p ij. M-step: Give our curret estimates for each p ij, update each μ i, σ i, ad w i. log_likelihood = L x 1,, x. if (log_likelihood last_log_likelihood) < threshold, break. last_log_likelihood = log_likelihood 30

The EM Algorithm: Limitatios Whe we fit a Gaussia to data, we always get the same result. We ca also prove that the result that we get is the best possible result. There is o other Gaussia givig a higher log likelihood to the data, tha the oe that we compute as described i these slides. Whe we fit a mixture of Gaussias to the same data, do we always ed up with the same result? 31

The EM Algorithm: Limitatios Whe we fit a Gaussia to data, we always get the same result. We ca also prove that the result that we get is the best possible result. There is o other Gaussia givig a higher log likelihood to the data, tha the oe that we compute as described i these slides. Whe we fit a mixture of Gaussias to the same data, we (sadly) do ot always get the same result. The EM algorithm is a greedy algorithm. The result depeds o the iitializatio values. We may have bad luck with the iitial values, ad ed up with a bad fit. There is o good way to kow if our result is good or bad, or if better results are possible. 32

Mixtures of Gaussias - Recap Mixtures of Gaussias are widely used. Why? Because with the right parameters, they ca fit very well various types of data. Actually, they ca fit almost aythig, as log as k is large eough (so that the mixture cotais sufficietly may Gaussias). The EM algorithm is widely used to fit mixtures of Gaussias to data. 33

Multidimesioal Gaussias Istead of assumig that each dimesio is idepedet, we ca istead model the distributio usig a multi-dimesioal Gaussia: N v = 1 2π d Σ exp 1 2 (x μ)τ Σ 1 (x μ) To specify this Gaussia, we eed to estimate the mea μ ad the covariace matrix Σ. 34

Multidimesioal Gaussias - Mea Let x 1, x 2,, x be d-dimesioal vectors. x i = (x i,1, x i,2,, x i,d ), where each x i,j is a real umber. The, the mea μ = (μ 1,..., μ d ) is computed as: μ = 1 x i 1 Therefore, μ j = 1 i=1 x i,j 35

Multidimesioal Gaussias Covariace Matrix Let x 1, x 2,, x be d-dimesioal vectors. x i = (x i,1, x i,2,, x i,d ), where each x i,j is a real umber. Let Σ be the covariace matrix. Its size is dxd. Let σ r,c be the value of Σ at row r, colum c. σ r,c = 1 1 j=1 (x j,r μ r )(x j,c μ c ) 36

Multidimesioal Gaussias Traiig Let N be a d-dimesioal Gaussia with mea μ ad covariace matrix Σ. How may parameters do we eed to specify N? The mea μ is defied by d umbers. The covariace matrix Σ requires d 2 umbers σ r,c. Strictly speakig, Σ is symmetric, σ r,c = σ c,r. So, we eed roughly d 2 /2 parameters. The umber of parameters is quadratic to d. The umber of traiig data we eed for reliable estimatio is also quadratic to d. 37

The Curse of Dimesioality We will discuss this "curse" i several places i this course. Summary: dealig with high dimesioal data is a pai, ad presets challeges that may be surprisig to someoe used to dealig with oe, two, or three dimesios. Oe first example is i estimatig Gaussia parameters. I oe dimesio, it is very simple: We estimate two parameters, μ ad σ. Estimatio ca be pretty reliable with a few tes of examples. I d dimesios, we estimate O(d 2 ) parameters. The umber of traiig data is quadratic to the dimesios. 38

The Curse of Dimesioality For example: suppose we wat to trai a system to recogize the faces of Michael Jorda ad Kobe Bryat. Assume each image is 100x100 pixels. Each pixel has three umbers: r, g, b. Thus, each image has 30,000 umbers. Suppose we model each class as a multi-dimesioal Gaussia. The, we eed to estimate parameters of a 30,000- dimesioal Gaussia. We eed roughly 450 millio umbers for the covariace matrix. We would eed more tha te billio traiig images to have a reliable estimate. It is ot realistic to expect to have such a large traiig set for learig how to recogize a sigle perso. 39

The Curse of Dimesioality The curse of dimesioality makes it (usually) impossible to estimate precisely probability desities i high-dimesioal spaces. The umber of traiig data that is eeded is expoetial to the umber of dimesios. The curse of dimesioality also makes histogram-based probability estimatio ifeasible i high dimesios. Estimatig a histogram still requires a umber of traiig examples that is expoetial to the dimesios. Estimatig a Gaussia requires a umber of traiig parameters that is "oly" quadratic to the dimesios. However, Gaussias may ot be accurate fits for the actual distributio. Mixtures of Gaussias ca ofte provide sigificatly better fits. 40