# Mixture Models and EM

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering of data coping with missing data segmentation? (... stay tuned) Readings: Chapter 16 in the Forsyth and Ponce. Matlab Tutorials: modelselectiontut.m (optional) 2503: Mixture Models and EM c D.J. Fleet & A.D. Jepson, 2005 Page: 1

2 Model Fitting: Density Estimation Let s say we want to model the distribution of grey levels d k d( x k ) at pixels, { x k } K, within some image region of interest. Non-parametric model: Compute a histogram. Parametric model: Fit an analytic density function to the data. For example, if we assume the samples were drawn from a Gaussian distribution, then we could fit a Gaussian density to the data by computing the sample mean and variance: µ = 1 d k, σ 2 = 1 (d k µ) 2 K K 1 k k Right plot shows a histogram of 150 IID samples drawn from the Gaussian density on the left (dashed). Overlaid is the estimated Gaussian model (solid). 2503: Mixture Models and EM Page: 2

3 Model Fitting: Multiple Data Modes When the data come from an image region with more than one dominant color, perhaps near an occlusion boundary, then a single Gaussian will not fit the data well: Missing Data: If the assignment of measurements to the two modes were known, then we could easily solve for the means and variances using sample statistics, as before, but only incorporating those data assigned to their respective models. Soft Assignments: But we don t know the assignments of pixels to the two Gaussians. So instead, let s infer them: Using Bayes rule, the probability that d k is owned (i.e., generated) by model M n is p(m n d k ) = p(d k M n ) p(m n ) p(d k ) 2503: Mixture Models and EM Page: 3

4 Ownership (example) Above we drew samples from two Gaussians in equal proportions, so p(m 1 ) = p(m 2 ) = 1 2, and p(d k M n ) = G(d k ; µ n, σ 2 n ) where G(d; µ, σ 2 ) is a Gaussian pdf with mean µ and variance σ 2 evaluated at d. And remember p(d k ) = n p(d k M n ) p(m n ). So, the ownerships, q n (d k ) p(m n d k ), then reduce to q 1 (d k ) = G(d k ; µ 1, σ 2 1 ) G(d k ; µ 1, σ 2 1 ) + G(d k; µ 2, σ 2 2 ), and q 2 (d k ) = 1 q 1 (d k ) For the 2-component density below: Then, the Gaussian parameters are given by weighted sample stats: µ n = 1 q n (d k )d k, σn 2 S = 1 q n (d k )(d k µ n ) 2, S n = q n (d k ) n S n k k k 2503: Mixture Models and EM Page: 4

5 Mixture Model Assume N processes, {M n } N n=1, each of which generates some data (or measurements). Each sample d from process M n is IID with density p n (d a n ), where a n denotes parameters for process M n. The proportion of the entire data set produced solely by M n is denoted m n = p(m n ) (it s called a mixing probability). Generative Process: First, randomly select one of the N processes according to the mixing probabilities, m (m 1,..., m N ). Then, given n, generate a sample from the observation density p n (d a n ). Mixture Model Likelihood: The probability of observing a datum d from the collection of N processes is given by their linear mixture: p(d M) = N n=1 m n p n (d a n ) The mixture model, M, comprises m, and the parameters, { a n } N n=1. Mixture Model Inference: Given K IID measurements (the data), {d k } K, our goal is to estimate the mixture model parameters. Remarks: One may also wish to estimate N and the parametric form of each component, but that s outside the scope of these notes. 2503: Mixture Models and EM Page: 5

6 Expectation-Maximization (EM) Algorithm EM is an iterative algorithm for parameter estimation, especially useful when one formulates the estimation problem in terms of observed and missing data. Observed data are the K intensities. Missing data are the assignments of observations to model components, z n (d k ) {0, 1}. Each EM iteration comprises an E-step and an M-step: E-Step: Compute the expected values of the missing data given the current model parameter estimate. For mixture models one can show this gives the ownership probability: E[z n (d k )] = q n (d k ). M-Step: Compute ML model parameters given observed data and the expected value of the missing data. For mixture models this yields a weighted regression problem for each model component: K q n (d k ) a n log p n (d k a n ) = 0. and the mixing probabilities are m n = 1 K K q n(d k ). Remarks: Each EM iteration can be shown to increase the likelihood of the observed data given the model parameters. EM converges to local maxima (not necessarily global maxima). An initial guess is required (e.g., random ownerships). 2503: Mixture Models and EM Page: 6

7 Derivation of EM for Mixture Models The mixture model likelihood function is given by: p ( {d k } K M ) = K p (d k M) = K N m n p n (d k a n ) n=1 where M ( m, { a n } n=1) N. The log likelihood is then given by L(M) = log p ( ( K N ) {d k } K M) = log m n p n (d k a n ) n=1 Our goal is to find extrema of the log likelihood function subject to the constraint that the mixing probabilities sum to 1. The constraint that n m n = 1 can be included with a Lagrange multiplier. Accordingly, the following conditions can be shown to hold at the extrema of the objective function: 1 K K q n (d k ) = m n and L a n = K q n (d k ) log p n (d k a n ) = 0. a n The first condition is easily derived from the derivative of the log likliehood with respect to m n, along with the Lagrange multiplier. The second condition is more involved as we show here, beginning with form of the derivative of the log likelihood with respect to the motion parameters for the m th component: ( L K N ) 1 = a N n n=1 m m n p n (d k a n ) n p n (d k a n ) a n n=1 K m n = N n=1 m p n (d k a n ) n p n (d k a n ) a n = K m n p n (d k a n ) N n=1 m n p n (d k a n ) a n log p n (d k a n ) The last step is an algebraic manipulation that uses the fact that log p(a) = 1 p(a). a p(a) a 2503: Mixture Models and EM Notes: 7

8 Derivation of EM for Mixture Models (cont) Notice that this equation can be greatly simplified because each term in the sum is really the product of the ownership probability q n (d k ) and the derivative of the component log likelihood. Therefore L a n = K q n (d k ) log p n (d k a n ) a n This is just a weighted log likelihood. In the case of a Gaussian component likelihood, p n (d k a n ), this is the derivative of a weighted least-squares error. Thus, setting L/ a n = 0 in the Gaussian case yields a weighted least-squares estimate for a n. 2503: Mixture Models and EM Notes: 8

9 Examples Example 1: Two distant modes. (We don t necessarily need EM here since hard assignments would be simple to determine, and reasonably efficient statistically.) Example 2: Two nearby modes. (Here, the soft ownerships are essential to the estimation of the mode locations and variances.) 2503: Mixture Models and EM Page: 9

10 More Examples Example 3: Nearby modes with uniformly distributed outliers. The model is a mixture of two Gaussians and a uniform outlier process. Example 4: Four modes and uniform noise present a challenge to EM. With only 1000 samples the model fit is reasonably good. 2503: Mixture Models and EM Page: 10

11 Mixture Models for Layered Optical Flow Layered motion is a natural domain for mixture models and the EM algorithm. Here, we believe there may be multiple motions present, but we don t know the motions, nor which pixels belong together (i.e., move coherently). Two key sources of multiple motions, even within small image neighbourhoods, are occlusion and transparency. Mixture models are also useful when the motion model doesn t provide a good approximation to the 2D motion within the region Example: The camera in the Pepsi sequence moves left-to-right. The depth discontinuities at the boundaries of the can produce motion discontinuities. For the pixels in the box in the left image, the right plot shows some of the motion constraint lines (in velocity space) : Mixture Models and EM Page: 11

12 Mixture Models for Optical Flow Formulation: Assume an image region contains two motions. Let the pixels in the region be { x k } K. Let s use parameterized motion models u j ( x; a j ) for j = 1, 2, where a j are the motion model parameters. (E.g., a is 2D for a translational model, and 6D for an affine motion model.) One gradient measurement per pixel, c k (f x ( x k ), f y ( x k ), f t ( x k )). Like gradient-based flow estimation, let f( x k ) u + f t ( x k ) be mean-zero Gaussian with variance σv 2, that is, p n ( c k x k, a n ) = G( f( x k ) u + f t ( x k ); 0, σ 2 v ) Let the fraction of measurements (pixels) owned by each of the two motions be denoted m 1 and m 2. Let m 0 denote the fraction of outlying measurements (i.e., constraints not consistent with either of the motions), and assume a uniform density for the outlier likelihood, denoted p 0. With three components, m 0 + m 1 + m 2 = : Mixture Models and EM Page: 12

13 Mixture Models for Optical Flow Mixture Model: The observation density for measurement c k is p( c k m, a 1, a 2 ) = m 0 p 0 + where m (m 0, m 1, m 2 ). 2 n=1 m n p n ( c k x k, a n ). Given K IID measurements { c k } K the joint likelihood is the product of individual component likelihoods: EM Algorithm: E Step: L( m, a 1, a 2 ) = K p( c k m, a 1, a 2 ). Infer the ownership probability, q n ( c k ), that constraint c k is owned by the n th mixture component. For the motion components of the mixture (n=1, 2), given m, a 1 and a 2, we have: q n ( c k ) = m n p n ( c k a n ) m 0 p 0 + m 1 p 1 ( c k x k, a 1 ) + m 2 p 2 ( c k x k, a 2 ). And of course the ownership probabilities sum to one so: q 0 ( c k ) = 1 q 1 ( c k ) q 2 ( c k ) M Step: Compute the maximum likelihood estimates of the mixing probabilities m and the flow field parameters, a 1 and a : Mixture Models and EM Page: 13

14 Mixing Probabilities: ML Parameter Estimation Given the ownership probabilities, the mixing probabilities are the fractions of the total ownership probability assigned to the respective components: 1 K q n ( c k ) = m n K Flow Estimation: Given the ownership probabilities, we can estimate the motion parameters for each component separately with a form of weighted, least-squares area-based regression. E.g., for 2D translation, where the a n u n, this amounts to the minimization of the weighted least-squares error K E( u n ) = q n ( c k ) d 2 ( u n, c k ) = K q n ( c k ) [ ] 2 f( xk, t) u n + f t ( x k, t), where f (f x, f y ) T (cf. iteratively reweighted LS for robust estimation). Convergence Behaviour: (of two motion estimates in velcocity space) 2503: Mixture Models and EM Page: 14

15 Representation for the Occlusion Example Model: translational flow, with 2 layers and an outlier process. Region at an occlusion boundary. Pixel Ownership: for Layer #2 Constraint Ownership: Layer #1 Layer #2 Outliers 2503: Mixture Models and EM Page: 15

16 Additional Test Patches #4 #2 #3 Test Patches: #1 Motion constraints for regions #1, #3 and #4: Outliers for regions #1, #3 and #4: 2503: Mixture Models and EM Page: 16

17 Further Readings Papers on mixture models and EM: A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society Series B, pp. 1 38, R. Neal, and G. Hinton. A view of the EM algorithm that justifies increemental, sparse and other variants. In Learning in Graphical Models, M. Jordan (ed.). Papers on mixture models for layered motion: S. Ayer and H. Sawhney. Compact Representations of Videos Through Dominant and Multiple Motion Estimation, IEEE Trans. on Pattern Analysis and Machine Intelligence, 18(8): , A.D. Jepson and M. J. Black. Mixture models for optical flow computation. Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp , New York, June Y. Weiss and E.H. Adelson. A unified mixture framework for motion segmentation: Incorporating spatial coherence and estimating the number of models. IEEE Proc. Computer Vision and Pattern Recognition, San Francisco, pp , : Mixture Models and EM Notes: 17

### The Expectation Maximization Algorithm

The Expectation Maximization Algorithm Frank Dellaert College of Computing, Georgia Institute of Technology Technical Report number GIT-GVU-- February Abstract This note represents my attempt at explaining

### Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Expectation-Maximization (EM)

### Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

### Weighted Finite-State Transducers in Computational Biology

Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

### Statistical Pattern Recognition

Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization

### Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

### STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised

### CPSC 540: Machine Learning

CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:

### output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1

To appear in M. S. Kearns, S. A. Solla, D. A. Cohn, (eds.) Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 999. Learning Nonlinear Dynamical Systems using an EM Algorithm Zoubin

### EM Algorithm LECTURE OUTLINE

EM Algorithm Lukáš Cerman, Václav Hlaváč Czech Technical University, Faculty of Electrical Engineering Department of Cybernetics, Center for Machine Perception 121 35 Praha 2, Karlovo nám. 13, Czech Republic

### Linear Subspace Models

Linear Subspace Models Goal: Explore linear models of a data set. Motivation: A central question in vision concerns how we represent a collection of data vectors. The data vectors may be rasterized images,

### Does the Wake-sleep Algorithm Produce Good Density Estimators?

Does the Wake-sleep Algorithm Produce Good Density Estimators? Brendan J. Frey, Geoffrey E. Hinton Peter Dayan Department of Computer Science Department of Brain and Cognitive Sciences University of Toronto

### CS 4495 Computer Vision Principle Component Analysis

CS 4495 Computer Vision Principle Component Analysis (and it s use in Computer Vision) Aaron Bobick School of Interactive Computing Administrivia PS6 is out. Due *** Sunday, Nov 24th at 11:55pm *** PS7

### Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Bayesian Classifiers and Probability Estimation Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington 1 Data Space Suppose that we have a classification problem The

### Newscast EM. Abstract

Newscast EM Wojtek Kowalczyk Department of Computer Science Vrije Universiteit Amsterdam The Netherlands wojtek@cs.vu.nl Nikos Vlassis Informatics Institute University of Amsterdam The Netherlands vlassis@science.uva.nl

### Probability Theory for Machine Learning. Chris Cremer September 2015

Probability Theory for Machine Learning Chris Cremer September 2015 Outline Motivation Probability Definitions and Rules Probability Distributions MLE for Gaussian Parameter Estimation MLE and Least Squares

### EXPECTATION- MAXIMIZATION THEORY

Chapter 3 EXPECTATION- MAXIMIZATION THEORY 3.1 Introduction Learning networks are commonly categorized in terms of supervised and unsupervised networks. In unsupervised learning, the training set consists

### Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

### Linear Models for Classification

Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

### Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007

### Technical Details about the Expectation Maximization (EM) Algorithm

Technical Details about the Expectation Maximization (EM Algorithm Dawen Liang Columbia University dliang@ee.columbia.edu February 25, 2015 1 Introduction Maximum Lielihood Estimation (MLE is widely used

### Statistical Pattern Recognition

Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 203 http://ce.sharif.edu/courses/9-92/2/ce725-/ Agenda Expectation-maximization

### Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II Gatsby Unit University College London 27 Feb 2017 Outline Part I: Theory of ICA Definition and difference

### p L yi z n m x N n xi

y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

### Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

### Data Preprocessing. Cluster Similarity

1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M

### COM336: Neural Computing

COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

### Linear Dynamical Systems

Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

### Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Part I C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Probabilistic Graphical Models Graphical representation of a probabilistic model Each variable corresponds to a

### Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machines CAP 5610: Machine Learning Instructor: Guo-Jun QI 1 Linear Classifier Naive Bayes Assume each attribute is drawn from Gaussian distribution with the same variance Generative model:

### Convergence Rate of Expectation-Maximization

Convergence Rate of Expectation-Maximiation Raunak Kumar University of British Columbia Mark Schmidt University of British Columbia Abstract raunakkumar17@outlookcom schmidtm@csubcca Expectation-maximiation

### Variational Principal Components

Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

### Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

Clustering Léon Bottou NEC Labs America COS 424 3/4/2010 Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification, clustering, regression, other.

### Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

### CPSC 540: Machine Learning

CPSC 540: Machine Learning Kernel Density Estimation, Factor Analysis Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 2: 2 late days to hand it in tonight. Assignment 3: Due Feburary

### CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

### Expectation Maximization (EM)

Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

### Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures

### Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

### Week 3: The EM algorithm

Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

### Motion Estimation (I)

Motion Estimation (I) Ce Liu celiu@microsoft.com Microsoft Research New England We live in a moving world Perceiving, understanding and predicting motion is an important part of our daily lives Motion

### Expectation Maximization (EM)

Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are

### Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

### Kernel Density Topic Models: Visual Topics Without Visual Words

Kernel Density Topic Models: Visual Topics Without Visual Words Konstantinos Rematas K.U. Leuven ESAT-iMinds krematas@esat.kuleuven.be Mario Fritz Max Planck Institute for Informatics mfrtiz@mpi-inf.mpg.de

### Machine Learning 4771

Machine Learning 4771 Instructor: Tony Jebara Topic 11 Maximum Likelihood as Bayesian Inference Maximum A Posteriori Bayesian Gaussian Estimation Why Maximum Likelihood? So far, assumed max (log) likelihood

### Solution. Daozheng Chen. Challenge 1

Solution Daozheng Chen 1 For all the scatter plots and 2D histogram plots within this solution, the x axis is for the saturation component, and the y axis is the value component. Through out the solution,

### CS 231A Section 1: Linear Algebra & Probability Review

CS 231A Section 1: Linear Algebra & Probability Review 1 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability

### MIXTURE MODELS AND EM

Last updated: November 6, 212 MIXTURE MODELS AND EM Credits 2 Some of these slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Simon Prince, University College London Sergios Theodoridis,

### Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

### Probability models for machine learning. Advanced topics ML4bio 2016 Alan Moses

Probability models for machine learning Advanced topics ML4bio 2016 Alan Moses What did we cover in this course so far? 4 major areas of machine learning: Clustering Dimensionality reduction Classification

### MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Y. LeCun: Machine Learning and Pattern Recognition p. 1/? MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun The Courant Institute, New York University http://yann.lecun.com

### CS 540: Machine Learning Lecture 1: Introduction

CS 540: Machine Learning Lecture 1: Introduction AD January 2008 AD () January 2008 1 / 41 Acknowledgments Thanks to Nando de Freitas Kevin Murphy AD () January 2008 2 / 41 Administrivia & Announcement

### ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech Administrativia Poster Presentation:

### Clustering, K-Means, EM Tutorial

Clustering, K-Means, EM Tutorial Kamyar Ghasemipour Parts taken from Shikhar Sharma, Wenjie Luo, and Boris Ivanovic s tutorial slides, as well as lecture notes Organization: Clustering Motivation K-Means

### COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models:

### Motion Estimation (I) Ce Liu Microsoft Research New England

Motion Estimation (I) Ce Liu celiu@microsoft.com Microsoft Research New England We live in a moving world Perceiving, understanding and predicting motion is an important part of our daily lives Motion

### 1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo

The Fixed-Point Algorithm and Maximum Likelihood Estimation for Independent Component Analysis Aapo Hyvarinen Helsinki University of Technology Laboratory of Computer and Information Science P.O.Box 5400,

### Statistical learning. Chapter 20, Sections 1 3 1

Statistical learning Chapter 20, Sections 1 3 Chapter 20, Sections 1 3 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

### p(d θ ) l(θ ) 1.2 x x x

p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

### Today. Calculus. Linear Regression. Lagrange Multipliers

Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject

### Tracking. Readings: Chapter 17 of Forsyth and Ponce. Matlab Tutorials: motiontutorial.m. 2503: Tracking c D.J. Fleet & A.D. Jepson, 2009 Page: 1

Goal: Tracking Fundamentals of model-based tracking with emphasis on probabilistic formulations. Examples include the Kalman filter for linear-gaussian problems, and maximum likelihood and particle filters

### Image Analysis. Feature extraction: corners and blobs

Image Analysis Feature extraction: corners and blobs Christophoros Nikou cnikou@cs.uoi.gr Images taken from: Computer Vision course by Svetlana Lazebnik, University of North Carolina at Chapel Hill (http://www.cs.unc.edu/~lazebnik/spring10/).

### Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector

### An Adaptive Bayesian Network for Low-Level Image Processing

An Adaptive Bayesian Network for Low-Level Image Processing S P Luttrell Defence Research Agency, Malvern, Worcs, WR14 3PS, UK. I. INTRODUCTION Probability calculus, based on the axioms of inference, Cox

### Machine Learning: Probability Theory

Machine Learning: Probability Theory Prof. Dr. Martin Riedmiller Albert-Ludwigs-University Freiburg AG Maschinelles Lernen Theories p.1/28 Probabilities probabilistic statements subsume different effects

### THE blind image deconvolution (BID) problem is a difficult

2222 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 52, NO. 8, AUGUST 2004 A Variational Approach for Bayesian Blind Image Deconvolution Aristidis C. Likas, Senior Member, IEEE, and Nikolas P. Galatsanos,

### Parametric Techniques

Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

### PATTERN CLASSIFICATION

PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

### Lecture 8: Interest Point Detection. Saad J Bedros

#1 Lecture 8: Interest Point Detection Saad J Bedros sbedros@umn.edu Review of Edge Detectors #2 Today s Lecture Interest Points Detection What do we mean with Interest Point Detection in an Image Goal:

### Factor Analysis (10/2/13)

STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.

### Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

CSE 586, Spring 2015 Computer Vision II Hidden Markov Model and Kalman Filter Recall: Modeling Time Series State-Space Model: You have a Markov chain of latent (unobserved) states Each state generates

### Communications in Statistics - Simulation and Computation. Comparison of EM and SEM Algorithms in Poisson Regression Models: a simulation study

Comparison of EM and SEM Algorithms in Poisson Regression Models: a simulation study Journal: Manuscript ID: LSSP-00-0.R Manuscript Type: Original Paper Date Submitted by the Author: -May-0 Complete List

### Statistical learning. Chapter 20, Sections 1 3 1

Statistical learning Chapter 20, Sections 1 3 Chapter 20, Sections 1 3 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

### SIFT: Scale Invariant Feature Transform

1 SIFT: Scale Invariant Feature Transform With slides from Sebastian Thrun Stanford CS223B Computer Vision, Winter 2006 3 Pattern Recognition Want to find in here SIFT Invariances: Scaling Rotation Illumination

### Uncertainty quantification and visualization for functional random variables

Uncertainty quantification and visualization for functional random variables MascotNum Workshop 2014 S. Nanty 1,3 C. Helbert 2 A. Marrel 1 N. Pérot 1 C. Prieur 3 1 CEA, DEN/DER/SESI/LSMR, F-13108, Saint-Paul-lez-Durance,

### Bayesian Machine Learning

Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

### COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released

### bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o

Category: Algorithms and Architectures. Address correspondence to rst author. Preferred Presentation: oral. Variational Belief Networks for Approximate Inference Wim Wiegerinck David Barber Stichting Neurale

### CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

CS 231A Section 1: Linear Algebra & Probability Review Kevin Tang Kevin Tang Section 1-1 9/30/2011 Topics Support Vector Machines Boosting Viola Jones face detector Linear Algebra Review Notation Operations

### Bayesian Networks Structure Learning (cont.)

Koller & Friedman Chapters (handed out): Chapter 11 (short) Chapter 1: 1.1, 1., 1.3 (covered in the beginning of semester) 1.4 (Learning parameters for BNs) Chapter 13: 13.1, 13.3.1, 13.4.1, 13.4.3 (basic

### Training Neural Networks with Deficient Data

Training Neural Networks with Deficient Data Volker Tresp Siemens AG Central Research 81730 Munchen Germany tresp@zfe.siemens.de Subutai Ahmad Interval Research Corporation 1801-C Page Mill Rd. Palo Alto,

### Feature extraction: Corners and blobs

Feature extraction: Corners and blobs Review: Linear filtering and edge detection Name two different kinds of image noise Name a non-linear smoothing filter What advantages does median filtering have over

### The Minimum Message Length Principle for Inductive Inference

The Principle for Inductive Inference Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population Health University of Melbourne University of Helsinki, August 25,

### Robust Scale Estimation for the Generalized Gaussian Probability Density Function

Metodološki zvezki, Vol. 3, No. 1, 26, 21-37 Robust Scale Estimation for the Generalized Gaussian Probability Density Function Rozenn Dahyot 1 and Simon Wilson 2 Abstract This article proposes a robust

### ESTIMATING an unknown probability density function

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL 17, NO 6, JUNE 2008 897 Maximum-Entropy Expectation-Maximization Algorithm for Image Reconstruction and Sensor Field Estimation Hunsop Hong, Student Member, IEEE,

### Templates, Image Pyramids, and Filter Banks

Templates, Image Pyramids, and Filter Banks 09/9/ Computer Vision James Hays, Brown Slides: Hoiem and others Review. Match the spatial domain image to the Fourier magnitude image 2 3 4 5 B A C D E Slide:

### COMS 4771 Regression. Nakul Verma

COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the

### Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

EM Algorithm Last lecture 1/35 General optimization problems Newton Raphson Fisher scoring Quasi Newton Nonlinear regression models Gauss-Newton Generalized linear models Iteratively reweighted least squares

### Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional

### Improved Kalman Filter Initialisation using Neurofuzzy Estimation

Improved Kalman Filter Initialisation using Neurofuzzy Estimation J. M. Roberts, D. J. Mills, D. Charnley and C. J. Harris Introduction It is traditional to initialise Kalman filters and extended Kalman

### L11: Pattern recognition principles

L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

### Overlapping Astronomical Sources: Utilizing Spectral Information

Overlapping Astronomical Sources: Utilizing Spectral Information David Jones Advisor: Xiao-Li Meng Collaborators: Vinay Kashyap (CfA) and David van Dyk (Imperial College) CHASC Astrostatistics Group April

### Introduction to Graphical Models

Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic

### Probabilistic Graphical Models

Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

### CPSC 540: Machine Learning

CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

### Association studies and regression

Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration