Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Similar documents
Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Lecture 2

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Naïve Bayes classification

The Expectation-Maximization Algorithm

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Gaussian Mixture Models

Lecture 4: Probabilistic Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

An Introduction to Expectation-Maximization

Machine Learning for Data Science (CS4786) Lecture 12

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

CSC321 Lecture 18: Learning Probabilistic Models

Expectation maximization tutorial

Language as a Stochastic Process

Expectation maximization

COM336: Neural Computing

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning (II)

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

MLE/MAP + Naïve Bayes

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

CSE446: Clustering and EM Spring 2017

Machine Learning for Signal Processing Expectation Maximization Mixture Models. Bhiksha Raj 27 Oct /

Bayesian Models in Machine Learning

Algorithmisches Lernen/Machine Learning

CPSC 540: Machine Learning

Machine Learning for Signal Processing Bayes Classification and Regression

COS513 LECTURE 8 STATISTICAL CONCEPTS

Bayesian Methods: Naïve Bayes

Notes on Machine Learning for and

Statistical learning. Chapter 20, Sections 1 4 1

Lecture 2: Conjugate priors

Machine Learning, Fall 2012 Homework 2

Expectation Maximization Mixture Models HMMs

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

K-Means and Gaussian Mixture Models

Pattern Recognition and Machine Learning

Introduction to Probabilistic Machine Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Expectation Propagation Algorithm

Ch 4. Linear Models for Classification

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Logistic Regression. Machine Learning Fall 2018

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Expectation Maximisation (EM) CS 486/686: Introduction to Artificial Intelligence University of Waterloo

Latent Variable Models and EM Algorithm

Mixtures of Gaussians. Sargur Srihari

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Introduction to Machine Learning

Machine Learning 4771

Non-Parametric Bayes

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Parameter estimation Conditional risk

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Bayesian Learning. Bayesian Learning Criteria

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Bayesian Analysis for Natural Language Processing Lecture 2

Lecture 2: Simple Classifiers

PMR Learning as Inference

Computational Cognitive Science

Gaussian Mixture Models, Expectation Maximization

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Machine Learning Lecture 5

But if z is conditioned on, we need to model it:

MLE/MAP + Naïve Bayes

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Probability and Estimation. Alan Moses

Computing the MLE and the EM Algorithm

Bayesian Methods for Machine Learning

Hidden Markov Models and Gaussian Mixture Models

The Expectation Maximization or EM algorithm

Parametric Techniques Lecture 3

Bayesian RL Seminar. Chris Mansley September 9, 2008

Expectation Maximization Algorithm

Machine Learning for Signal Processing Bayes Classification

Graphical Models for Collaborative Filtering

Maximum Likelihood Estimation. only training data is available to design a classifier

STA 414/2104: Machine Learning

Statistical learning. Chapter 20, Sections 1 3 1

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Gaussian Mixture Models

Based on slides by Richard Zemel

Machine Learning

Transcription:

1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15

We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a random number m.! Pr(X = m X ~ B(n, p)) = # " n m $ & p m (1 p) n m %

We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a random number m. The likelihood is defined as: L(p; X = m) = Pr(X = m X ~ B(n, p))

The Likelihood Function Assume we have a set of hypotheses to choose from. Normally a hypothesis will be defined by a set of parameters θ. We do not know θ, but we make some observations and get data D. The likelihood of θ is L(θ;D) = Prob(D θ). We are interested in the hypothesis that maximizes the likelihood.

Example We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a random number m. In this case, the data D is the number m, and the parameter θ is p. The likelihood is " L( p;x = m) = Pr(X = m X ~ B(n, p)) = n % $ ' p m (1 p) n m # m&

Maximum Likelihood Estimate Maximum likelihood = argmax θ L(θ;D) In the example above, the maximum is obtained for ˆ p = m n

Reminder: The Normal Distribution

Reminder: The Normal Distribution We obtain a set of n independent samples: We want to estimate the model parameters:.

Reminder: The Normal Distribution We obtain a set of n independent samples: We want to estimate the model parameters:.

Maximum Likelihood Estimate (MLE)

Example X 1,, X n ~ U(0,θ) What is the maximum likelihood?

Example X 1,, X n ~ U(0,θ) What is the maximum likelihood? Assume X (1) < < X (n ) For θ < X (n ), L(θ;D) = 0 For θ X (n ), L(θ;D) = 1 θ n Max Likelihood : ˆ θ = X (n )

Example: MLE of a Multinomial We are given a universe of possible strings (e.g., words of a language): Assume a model by which the strings are generated from a multinomial with (unknown) probabilities p 1,, p t h 1,,h t {0,1} k We are given a sample from the multinomial with counts c 1,,c t

Generative Model p 1 = 1/4! p 2 = 1/2! p 3 = 1/8! p 4 = 1/8! 01000010! 11111111! 00001111! 00000000! 11111111! 00001111! 01000010! 11111111! 00000000! 11111111! 01000010!! GOAL! Unknown!

Generative Model p 1 = 1/4! p 2 = 1/2! p 3 = 1/8! p 4 = 1/8! 01000010! 11111111! 00001111! 00000000! 11111111! 00001111! 01000010! 11111111! 00000000! 11111111! 01000010!! c 1 = 2! c 2 = 3! c 3 = 1! c 4 = 1! GOAL! Unknown!

MLE of a Multinomial Strings: Counts: h 1,,h t {0,1} k c 1,,c t " L( p 1,..., p t ;c 1,,c t ) = n %" $ ' n c % " 1 $ ' n c c % 1 t 1 c $ ' p 1 c # c 1 &# c 2 & # c 1 p 21 c 2 p t t t & i Max c i log(p i ) s.t p i =1, p i > 0

Using Lagrange Multipliers We are interested in maximizing: i Max c i log(p i ) s.t p i =1, p i > 0 Instead, we will consider the Lagrange function: max X i c i log(p i )+ (1 X p i ), s.t. p i > 0 An optimal solution of the original problem corresponds to a stationary point of the Lagrange function. i

Using Lagrange Multipliers f( p, )= X i c i log(p i )+ (1 X p i ) i Compute the gradient: f p i = c i p i Equating to zero: f =1 X i p i p i = ci, = X i c i = n

Bayesian Estimators Maximum likelihood: Advantage: No assumptions made on the model distribution. Disadvantage: In reality we are looking for: max Pr( D) max Pr(D ) Is it well defined?

Prior and Posterior Sometimes we know something about the PRIOR distribution Pr( ) Then, based on Bayes rule, we can calculate the POSTERIOR distribution: Pr( D) = Pr(D )Pr( ) Pr(D)

MAP (Maximum a posteriori) Maximum a posteriori estimation (MAP) is the mode of the posterior distribution: ˆ MAP = arg max Pr( D) ˆ ML = arg max Pr(D )

MAP (Maximum a posteriori) Maximum a posteriori estimation (MAP) is the mode of the posterior distribution: ˆ MAP = arg max Pr( D) ˆ ML = arg max Pr(D ) ˆ MAP = arg max Pr(D )Pr( )

Example Assume: x 1,...,x n N(µ, 1) P n i=1 x i ˆµ ML = n

Normal Prior Assume prior µ N(0, 1) log(pr(x 1,...,x n µ)) = n 2 log(2 ) P n i=1 (x i µ) 2 2 log(pr(µ)) = 1 2 log(2 ) µ 2 ˆµ MAP = arg max µ { µ2 2 nx (x i µ) 2 } i=1

Normal Prior Assume prior µ N(0, 1) ˆµ MAP = arg max µ { µ2 nx (x i µ) 2 } i=1 ˆµ MAP = P n i=1 x i n +1 ˆµ ML = P n i=1 x i n

Posterior of a Normal Prior Assume prior µ N(0, 1) 0 r(µ x 1,...,x n ) / exp B @ µ P n i=1 2 n+1 2 x i n+1 1 C A µ N P n i=1 x i n +1, 1 n +1

Choosing a prior for B(n,p) X B(n, p) One sample: X = m ˆp ML = m n

The Beta Distribution X Beta(, ) > 0, > 0 f(x) = x 1 (1 x) 1 B(, ) µ = E[X] = +

Posterior with a Beta Prior X B(n, p) Assume prior : p Beta(, ) Pr(p X = m,, ) n p m (1 p) n m p 1 (1 p) 1 m B(, ) Pr(p X = m,, ) p m+ 1 (1 p) n m+ 1

Posterior with a Beta Prior Pr(p X = m,, ) p m+ 1 (1 p) n m+ 1 Pr(p X = m,, ) Beta(m +,n m + ) ˆp MAP = m + 1 n + + 2 If the prior distribution is Beta then the posterior distribution is Beta as well. A conjugate prior.

Posterior with a Uniform Prior X B(n, p) Assume prior : p U(0, 1) Pr(p X = m) / p m (1 p) n m Pr(p X = m) Beta(m +1,n m + 1)

Posterior with a Uniform Prior X B(n, p) Assume prior : p U(0, 1) Pr(p X = m) Beta(m +1,n m + 1) ˆp MAP = m n E[p X = m] = m +1 n +2

Classification (Naïve Bayes) Cholesterol level Heart Attack (HA) x 1 1 Given a new individual, can we predict whether the individual will get a heart attack Based on his cholesterol level? x 2 1 x 3 0 x 4 1 x 5 0 x 6 0 x 7 0

Classification (Naïve Bayes) Cholesterol level Heart Attack (HA) x 1 1 x 2 1 x 3 0 x 4 1 x 5 0 x 6 0 x 7 0 Given a new individual, can we predict whether the individual will get a heart attack Based on his cholesterol level? Assumption: Cholesterol levels are normally distributed with a different mean in the 1 and 0 sets. Pr(x HA = 1) N(µ 1, Pr(x HA = 0) N(µ 0, 2 ) 2 ) can be estimated using MLE

Classification (Naïve Bayes) Decision rule:

Multiple Variables x 1 x 2 x n y 195 17 117 1 195 24 114 1 184 13 117 0 Assumptions: 1. Normal marginal distributions 2. Variables are independent 250 22 111 1 173 15 108 0 185 18 145 0 178 22 136 0

Multiple Variables

Multiple Variables

Naïve Bayes A Naïve assumption. Often works in practice. Interpretation: A weighted sum of evidence. Allows for the incorporation of features of different distributions. Requires small amounts of data

Naïve Bayes Might Break 4 y=1 4 y=0 3 3 2 2 1 1 0 0 1 2 1 3 2 4 4 3 2 1 0 1 2 3 4 3 3 2 1 0 1 2 3 4 y=1: Independent variables y=0: x 2 =x 1

The Multivariate Normal Distribution is a multivariate normal distribution 10 8 6 4 2 0 2 4 6 8 8 6 4 2 0 2 4 6 8 10

The Multivariate Normal Distribution is a multivariate normal distribution 10 Example: 8 6 4 2 0 2 4 6 8 8 6 4 2 0 2 4 6 8 10

The Multivariate Normal Distribution Notation: The variance-covariance matrix is If we do not use Naïve Bayes we need to estimate O(k 2 ) parameters.

Reminder: K-means objective Given: Vectors A number K Objective:

K-Means: A Likelihood Formulation There are unknown clusters: S 1,,S k. The points in S i are distributed Each point x i originates from a cluster c i.

Mixture of Gaussians There are unknown clusters: S 1,,S k. The points in S i are distributed Each point x i originates from cluster c i with probability p i.

25 20 15 10 5 0 5 10 15 15 10 5 0 5 10 15 20 25 30

25 20 15 10 5 0 5 10 15 15 10 5 0 5 10 15 20 25 30

In one dimension There are unknown clusters: S 1,,S k. The points in S i are distributed Each point x i originates from cluster c i with probability p i.

For every i, we choose:

The Expectation-Maximization Algorithm Start with a guess: In each iteration t+1 set:

The Expectation-Maximization Algorithm Start with a guess: In each iteration t+1 set:

The Expectation-Maximization Algorithm By construction:

The Expectation-Maximization Algorithm Conclusion: The likelihood is non-decreasing in each iteration. Stopping rule: When the likelihood flattens.

Expectation Maximization (EM) D given data parameters that need to be estimated Z missing (latent) variables 1. E-step: 2. M-step: Q( t )=E Z D, t [log(pr(d, Z )] t+1 := arg max Q( t )

EM - Comments No guarantee of optimization to local maximum. No guarantee of running times. Often it takes many iterations to converge. Efficiency: no matrix inversion is needed (e.g., in Newton). Generalized EM no need to find the max in the M-step. Easy to implement. Numerical stability. Monotone it is easy to ensure correctness in EM. Interpretation provides interpretation for the latent variables.

Reminder: Mixture of Gaussians There are unknown clusters: S 1,,S k. The points in S i are distributed Each point x i originates from cluster c i with probability p i. Variant: S i is distributed

Mixture of Gaussians - EM E-step:

Mixture of Gaussians - EM E-step:

EM for Mixture of Gaussians. E-step:

EM for Mixture of Gaussians. M-step: Relation to K-Means: The E-step assigns each point to a cluster. The M-step finds the new cluster centers.

Hidden Coin There are two coins with probabilities for Head: p 1,p 2. In each step with probability λ we flip the first coin. With probability 1-λ we flip the second coin. Observe the series of Heads: x = (0,1,1,0,1, ) Parameters: λ,p 1,p 2.

Hidden Coin

Hidden Coin E-step: M-step:

Hidden Coin Consider the set of Heads: (1,1,0,0). Start with an initial guess: Exercise: the EM is stuck and we are at a saddle point.

Multinomial Revisited We are given a universe of possible strings (e.g., words of a language): Assume a model by which the strings are generated from a multinomial with (unknown) probabilities p 1,, p t h 1,,h t {0,1} k We are given a sample from the multinomial with counts c 1,,c t

Ambiguous Multinomial We are given a universe of possible strings (e.g., words of a language): Assume a model by which the strings are generated from a multinomial with (unknown) probabilities p 1,, p t We are given a sample from the multinomial, but due to a technical issue we can only observe sums of pairs of strings.

Generative Model p 1 = 1/4! p 2 = 1/2! p 3 = 1/8! p 4 = 1/8! 01000010! 11111111! 00001111! 00000000! 11111111! 00001111! 01000010! 11111111! 00000000! 11111111! 01000010! 01000010! 11112222! 12111121! 11111111! 02000020! GOAL! Unknown!

Likelihood Formulation How do we find the max? How do we know there s only one solution?

Expectation Maximization Data: 21110 10001 10000 11110 10010 11100 10100 11010 10110 11000 10000 00001 10001 00000 0.25 0.25 0.25 0.25 0.5 0.5 10000!1/12! 11110!1/12! 10010!1/12! 11100!1/12! 10100!1/12! 11010!1/12! 10110!1/12! 11000!1/12! 00001!1/12! 10001!1/12! 00000!1/12! 11111!1/12! 22221 11111 11110 1

Expectation Maximization Data: 21110 10001 10000 11110 10010 11100 10100 11010 10110 11000 10000 00001 10001 00000 0.25 0.25 0.25 0.25 0.5 0.5 10000!0.125! 11110!0.208! 10010!0.041! 11100!0.041! 10100!0.041! 11010!0.041! 10110!0.041! 11000!0.041! 00001!0.083! 10001!0.083! 00000!0.083! 11111!0.166! 22221 11111 11110 1

Expectation Maximization Data: 21110 10001 10000 11110 10010 11100 10100 11010 10110 11000 10000 00001 10001 00000 0.838 0.054 0.054 0.054 0.6 0.4 10000!0.125! 11110!0.208! 10010!0.041! 11100!0.041! 10100!0.041! 11010!0.041! 10110!0.041! 11000!0.041! 00001!0.083! 10001!0.083! 00000!0.083! 11111!0.166! 22221 11111 11110 1

Expectation Maximization Data: 21110 10001 10000 11110 10010 11100 10100 11010 10110 11000 10000 00001 10001 00000 0.838 0.054 0.054 0.054 0.6 0.4 10000!0.239! 11110!0.306! 10010!0.009! 11100!0.009! 10100!0.009! 11010!0.009! 10110!0.009! 11000!0.009! 00001!0.1! 10001!0.066! 00000!0.066! 11111!0.166! 22221 11111 11110 1

Expectation Maximization Data: 21110 10001 10000 11110 10010 11100 10100 11010 10110 11000 10000 00001 10001 00000 1 0 0 0 0.85 0.15 10000!0.239! 11110!0.306! 10010!0.009! 11100!0.009! 10100!0.009! 11010!0.009! 10110!0.009! 11000!0.009! 00001!0.1! 10001!0.066! 00000!0.066! 11111!0.166! 22221 11111 11110 1

After many iterations Data: 21110 10001 10000 11110 10010 11100 10100 11010 10110 11000 10000 00001 10001 00000 1 0 0 0 1 0 10000!0.333! 11110!0.333! 10010!0! 11100!0! 10100!0! 11010!0! 10110!0! 11000!0! 00001!0.166! 10001!0! 00000!0! 11111!0.166! 22221 11111 11110 1

Expectation Maximization (EM) E-step: M-step:

Functions over n-dimensions 79 For a function f(x 1,,x n ), the gradient is the vector of partial derivatives: " f = f,, f % $ ' # x 1 x n & In one dimension: derivative.

Gradient Descent 80 Goal: Minimize a function Algorithm: 1. Start from a point 2. Compute u = f (x 1 i,..., x n i ) 3. Update 4. Return to (2), unless converged.

Gradient of g: g p i = {h:h X C(h i )} H P j:h 1 C(h j ) p j Projected on the space : u =( g p 1,..., g p m ) where = P m i=1 m g p i

Gradient descent with projections Start from a point p 0 =(p 0 1,...,p 0 m) Repeat: 1. Current point is 2. Calculate u =( g p 1,..., p =(p 1,...,p m ) 3. Find a new point for a small epsilon > 0: p 0 = p + u g p m ) 4. Repeat to step (2) until there the gradient is zero or until you get to a corner (p i =1 or p i =0)