Mixture of Gaussians Models

Similar documents
Gaussian Mixture Models, Expectation Maximization

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Data Preprocessing. Cluster Similarity

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Expectation Maximization Algorithm

Machine Learning for Data Science (CS4786) Lecture 12

Generative Clustering, Topic Modeling, & Bayesian Inference

CS534 Machine Learning - Spring Final Exam

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

FINAL: CS 6375 (Machine Learning) Fall 2014

Statistical learning. Chapter 20, Sections 1 4 1

Mixtures of Gaussians continued

L11: Pattern recognition principles

ECE521 week 3: 23/26 January 2017

CSE446: Clustering and EM Spring 2017

MIXTURE MODELS AND EM

Linear Models for Classification

K-Means and Gaussian Mixture Models

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Pattern Recognition. Parameter Estimation of Probability Density Functions

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

COM336: Neural Computing

CPSC 540: Machine Learning

CS281 Section 4: Factor Analysis and PCA

Expectation Maximization

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Bayesian Learning. Machine Learning Fall 2018

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Probabilistic Machine Learning. Industrial AI Lab.

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Gaussian Mixture Models

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Lecture : Probabilistic Machine Learning

Introduction to Machine Learning Midterm Exam

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Statistical learning. Chapter 20, Sections 1 3 1

Introduction to Machine Learning

Computer Vision Group Prof. Daniel Cremers. 3. Regression

CMU-Q Lecture 24:

Machine Learning Lecture 2

Expectation Maximization

CSE 546 Final Exam, Autumn 2013

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Lecture 6: Gaussian Mixture Models (GMM)

Bayesian Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Introduction to Machine Learning Spring 2018 Note 18

Bayesian Methods: Naïve Bayes

STA 4273H: Statistical Machine Learning

Latent Variable Models and Expectation Maximization

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Week 5: Logistic Regression & Neural Networks

MACHINE LEARNING ADVANCED MACHINE LEARNING

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1

Ch 4. Linear Models for Classification

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

STA414/2104 Statistical Methods for Machine Learning II

Final Exam, Fall 2002

CSCI-567: Machine Learning (Spring 2019)

A brief introduction to Conditional Random Fields

Gentle Introduction to Infinite Gaussian Mixture Modeling

Introduction to Machine Learning Midterm, Tues April 8

Clustering with k-means and Gaussian mixture distributions

CS-E3210 Machine Learning: Basic Principles

Probabilistic Graphical Models for Image Analysis - Lecture 1

CSE 473: Artificial Intelligence Autumn Topics

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering with k-means and Gaussian mixture distributions

Lecture 4: Probabilistic Learning

Bayesian Learning (II)

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning Lecture 5

Latent Variable Models and Expectation Maximization

Bayesian Machine Learning

Midterm: CS 6375 Spring 2018

Semi-supervised Learning

Logistics. Naïve Bayes & Expectation Maximization. 573 Schedule. Coming Soon. Estimation Models. Topics

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Review and Motivation

Density Estimation. Seungjin Choi

Least Squares Regression

Clustering and Gaussian Mixture Models

Expectation maximization

CS181 Midterm 2 Practice Solutions

y Xw 2 2 y Xw λ w 2 2

Bayesian Networks Structure Learning (cont.)

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine Learning for Signal Processing Bayes Classification

STA 414/2104, Spring 2014, Practice Problem Set #1

Introduction to Graphical Models

Transcription:

Mixture of Gaussians Models

Outline Inference, Learning, and Maximum Likelihood Why Mixtures? Why Gaussians? Building up to the Mixture of Gaussians Single Gaussians Fully-Observed Mixtures Hidden Mixtures

Perception Involves Inference and Learning Must infer the hidden causes, α, of sensory data, x Sensory data: air pressure wave frequency composition, patterns of electromagnetic radiation Hidden causes: proverbial tigers in bushes, lecture slides, sentences Must learn the correct model for the relationship between hidden causes and sensory data Models will be parameterized, with parameters θ We will use quality of prediction as our figure of merit

Generative models inference explanation or prediction

Maximum Likelihood and Maximum a Posteriori The model parameters θ that make the data most probable are called the maximum likelihood parameters or hidden causes α or causes INFERENCE LEARNING

In practice, we maximize log-likelihoods Taking logs doesn t change the answer Logs turn multiplication into addition Logs turn many natural operations on probabilities into linear algebra operations Negative log probabilities arise naturally in information theory

The Maximum Likelihood Answer Depends on Model Class

Outline Inference, Learning, and Maximum Likelihood Why Mixtures? Why Gaussians? Building up to the Mixture of Gaussians Single Gaussians Fully-Observed Mixtures Hidden Mixtures

Why Mixtures?

What is a Mixture Model? DATA LIKELIHOOD PRIOR This is precisely analogous to using a basis to approximate a vector

Example Mixture Datasets Mixtures of Uniforms Spike-And-Gaussian Mixtures Mixtures of Gaussians

Why Gaussians?

Why Gaussians? Wikipedia

Why Gaussians? An unhelpfully terse answer. Gaussians satisfy a particular differential equation: From this differential equation, all the properties of the Gaussian family can be derived without solving for the explicit form. Gaussians are isotropic, Fourier transform of a Gaussian is a Gaussian, sum of Gaussian RVs is Gaussian, Central Limit Theorem See this blogpost for details: http://bit.ly/gaussian-diff-eq

Why Gaussians? Gaussians are everywhere, thanks to the Central Limit Theorem Gaussians are the maximum entropy distribution with a given center (mean) and spread (std dev) Inference on Gaussians is linear algebra

Central Limit Theorem Statistics: adding up independent random variables with finite variances results in a Gaussian distribution Science: if we assume that many small, independent random factors produce the noise in our results, we should see a Gaussian distribution

Central Limit Theorem in Action

Central Limit Theorem in Action

Why Gaussians? Gaussians are everywhere, thanks to the Central Limit Theorem Gaussians are the maximum entropy distribution with a given center (mean) and spread (std dev) Inference on Gaussians is linear algebra

Gaussians are a natural MAXENT distribution The principle of maximum entropy (MAXENT) will be covered in detail later Teaser: MAXENT maps statistics of data to probability distributions in a principled, faithful manner For the most common choice of statistic, mean ± s.d., the MAXENT is a Gaussian

Why Gaussians? Gaussians are everywhere, thanks to the Central Limit Theorem Gaussians are the maximum entropy distribution with a given center (mean) and spread (std dev) Inference on Gaussians is linear algebra

Inference with Gaussians is just linear algebra The log-probabilities of a Gaussian are a negative-definite quadratic form Quadratic forms can be mapped onto matrices So solving an inference problem becomes solving a linear algebra problem Linear algebra is the Scottie Pippen of mathematics

Outline Inference, Learning, and Maximum Likelihood Why Mixtures? Why Gaussians? Building up to the Mixture of Gaussians Single Gaussians Fully-Observed Mixtures Hidden Mixtures

What is a Gaussian Mixture Model? DATA LIKELIHOOD PRIOR Example

Maximum Likelihood for Gaussian Mixture Models Plan of Attack: 1. 2. 3. ML for a single Gaussian ML for a fully-observed mixture ML for a hidden mixture

Maximum Likelihood for a Single Gaussian

Maximum Likelihood for a Single Gaussian

Maximum Likelihood for a Single Gaussian By a similar argument:

Maximum Likelihood for Gaussian Mixture Models Plan of Attack: 1. 2. 3. ML for a single Gaussian ML for a fully-observed mixture ML for a hidden mixture

Maximum Likelihood for Fully-Observed Mixture Observed Mixture means we receive datapoints (x,α). Examples: classification (discrete), regression (continuous) Cats Dogs Memming Wordpress Blog

Maximum Likelihood for Fully-Observed Mixture For each mixture element, the problem is exactly the same - what are the parameters of a single Gaussian? Because we know which mixture each data point came from, we can solve all these problems separately, using the same method as for a single Gaussian. How do we figure out the mixture weights w?

Bonus: We Can Now Classify Unlabeled Datapoints We can label new datapoints x with a corresponding α using our model This is the key idea behind supervised learning approaches in general. How do we label them? Max Likelihood method - find the closest mean (in z-score units), that s our label Fully Bayesian method - maintain a distribution over the labels - p(α x ; θ) Dogs Cats Memming Wordpress Blog

Maximum Likelihood for Gaussian Mixture Models Plan of Attack: 1. 2. 3. ML for a single Gaussian ML for a fully-observed mixture ML for a hidden mixture

Hidden Variables Example: Spike Sorting Martinez, Quiran Quiroga, et al., 2009 Journal of Neuroscience Methods

Maximum Likelihood for Models with Hidden Variables p(x μ,σ,α) is the same, but now we don t have the labels α. Problem: if we had the labels, we could find the parameters (just as before), and if we had the parameters, we could compute the labels (again, just as before). It s a chicken-and-egg problem! Solution: let s just make-believe we have the parameters.

Our Clustering Algorithm on Spike Sorting Wikipedia

Our Clustering Algorithm on Spike Sorting Wikipedia

The K-Means Algorithm 1. Make up K values for the means of the clusters 2. Assign datapoints to clusters 3. 4. Usually initialized randomly Each datapoint is assigned to the nearest cluster Update the cluster means to the new empirical means Repeat 2-4.

Issues with K-Means 1. Cluster assignment step (inference) is not Bayesian 2. Small changes in data can cause big changes in behavior

Soft Clustering? Bishop - PRML

Expectation-Maximization for Means 1. 2. 3. Make up K values for the means (E) Infer p(α x) for each x and α (M) Update the means via weighted average a. 4. Weight the contribution of datapoint x by p(α x) Repeat 2-4.

Full Expectation-Maximization 1. 2. 3. Make up K values for the means, covariances, and mixture weights (E) Infer p(α x) for each x and α (M) Update the parameters with weighted averages a. 4. Weight the contribution of datapoint x by p(α x) Repeat 2-4.

E-Step: Bayes Rule for Inference

M-Step: Direct Maximization

Bonus Slides: Information Geometry

A Geometric View All Probability Distributions Family of Model Distributions

Geometry of MLE All Probability Distributions Family of Model Distributions (here, Gaussians with different means)

Geometry of EM Family of Data Labelings (same p(x), different p(a x)) All Probability Distributions Family of Model Distributions

Family of Data Labelings E-Step: Inference M-Step: Learning Family of Model Distributions