Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15

We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a random number m.! Pr(X = m X ~ B(n, p)) = # " n m $ & p m (1 p) n m %

We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a random number m. The likelihood is defined as: L(p; X = m) = Pr(X = m X ~ B(n, p))

The Likelihood Function Assume we have a set of hypotheses to choose from. Normally a hypothesis will be defined by a set of parameters θ. We do not know θ, but we make some observations and get data D. The likelihood of θ is L(θ;D) = Prob(D θ). We are interested in the hypothesis that maximizes the likelihood.

Example We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a random number m. In this case, the data D is the number m, and the parameter θ is p. The likelihood is " L( p;x = m) = Pr(X = m X ~ B(n, p)) = n % $ ' p m (1 p) n m # m&

Maximum Likelihood Estimate Maximum likelihood = argmax θ L(θ;D) In the example above, the maximum is obtained for ˆ p = m n

Reminder: The Normal Distribution

Reminder: The Normal Distribution We obtain a set of n independent samples: We want to estimate the model parameters:.

Maximum Likelihood Estimate (MLE)

Example X 1,, X n ~ U(0,θ) What is the maximum likelihood?

Example X 1,, X n ~ U(0,θ) What is the maximum likelihood? Assume X (1) < < X (n ) For θ < X (n ), L(θ;D) = 0 For θ X (n ), L(θ;D) = 1 θ n Max Likelihood : ˆ θ = X (n )

Example: MLE of a Multinomial We are given a universe of possible strings (e.g., words of a language): Assume a model by which the strings are generated from a multinomial with (unknown) probabilities p 1,, p t h 1,,h t {0,1} k We are given a sample from the multinomial with counts c 1,,c t

Generative Model p 1 = 1/4! p 2 = 1/2! p 3 = 1/8! p 4 = 1/8! 01000010! 11111111! 00001111! 00000000! 11111111! 00001111! 01000010! 11111111! 00000000! 11111111! 01000010!! GOAL! Unknown!

Generative Model p 1 = 1/4! p 2 = 1/2! p 3 = 1/8! p 4 = 1/8! 01000010! 11111111! 00001111! 00000000! 11111111! 00001111! 01000010! 11111111! 00000000! 11111111! 01000010!! c 1 = 2! c 2 = 3! c 3 = 1! c 4 = 1! GOAL! Unknown!

MLE of a Multinomial Strings: Counts: h 1,,h t {0,1} k c 1,,c t " L( p 1,..., p t ;c 1,,c t ) = n %" $ ' n c % " 1 $ ' n c c % 1 t 1 c $ ' p 1 c # c 1 &# c 2 & # c 1 p 21 c 2 p t t t & i Max c i log(p i ) s.t p i =1, p i > 0

Using Lagrange Multipliers We are interested in maximizing: i Max c i log(p i ) s.t p i =1, p i > 0 Instead, we will consider the Lagrange function: max X i c i log(p i )+ (1 X p i ), s.t. p i > 0 An optimal solution of the original problem corresponds to a stationary point of the Lagrange function. i

Using Lagrange Multipliers f( p, )= X i c i log(p i )+ (1 X p i ) i Compute the gradient: f p i = c i p i Equating to zero: f =1 X i p i p i = ci, = X i c i = n

Bayesian Estimators Maximum likelihood: Advantage: No assumptions made on the model distribution. Disadvantage: In reality we are looking for: max Pr( D) max Pr(D ) Is it well defined?

Prior and Posterior Sometimes we know something about the PRIOR distribution Pr( ) Then, based on Bayes rule, we can calculate the POSTERIOR distribution: Pr( D) = Pr(D )Pr( ) Pr(D)

MAP (Maximum a posteriori) Maximum a posteriori estimation (MAP) is the mode of the posterior distribution: ˆ MAP = arg max Pr( D) ˆ ML = arg max Pr(D )

MAP (Maximum a posteriori) Maximum a posteriori estimation (MAP) is the mode of the posterior distribution: ˆ MAP = arg max Pr( D) ˆ ML = arg max Pr(D ) ˆ MAP = arg max Pr(D )Pr( )

Example Assume: x 1,...,x n N(µ, 1) P n i=1 x i ˆµ ML = n

Normal Prior Assume prior µ N(0, 1) log(pr(x 1,...,x n µ)) = n 2 log(2 ) P n i=1 (x i µ) 2 2 log(pr(µ)) = 1 2 log(2 ) µ 2 ˆµ MAP = arg max µ { µ2 2 nx (x i µ) 2 } i=1

Normal Prior Assume prior µ N(0, 1) ˆµ MAP = arg max µ { µ2 nx (x i µ) 2 } i=1 ˆµ MAP = P n i=1 x i n +1 ˆµ ML = P n i=1 x i n

Posterior of a Normal Prior Assume prior µ N(0, 1) 0 r(µ x 1,...,x n ) / exp B @ µ P n i=1 2 n+1 2 x i n+1 1 C A µ N P n i=1 x i n +1, 1 n +1

Choosing a prior for B(n,p) X B(n, p) One sample: X = m ˆp ML = m n

The Beta Distribution X Beta(, ) > 0, > 0 f(x) = x 1 (1 x) 1 B(, ) µ = E[X] = +

Posterior with a Beta Prior X B(n, p) Assume prior : p Beta(, ) Pr(p X = m,, ) n p m (1 p) n m p 1 (1 p) 1 m B(, ) Pr(p X = m,, ) p m+ 1 (1 p) n m+ 1

Posterior with a Beta Prior Pr(p X = m,, ) p m+ 1 (1 p) n m+ 1 Pr(p X = m,, ) Beta(m +,n m + ) ˆp MAP = m + 1 n + + 2 If the prior distribution is Beta then the posterior distribution is Beta as well. A conjugate prior.

Posterior with a Uniform Prior X B(n, p) Assume prior : p U(0, 1) Pr(p X = m) / p m (1 p) n m Pr(p X = m) Beta(m +1,n m + 1)

Posterior with a Uniform Prior X B(n, p) Assume prior : p U(0, 1) Pr(p X = m) Beta(m +1,n m + 1) ˆp MAP = m n E[p X = m] = m +1 n +2

Classification (Naïve Bayes) Cholesterol level Heart Attack (HA) x 1 1 Given a new individual, can we predict whether the individual will get a heart attack Based on his cholesterol level? x 2 1 x 3 0 x 4 1 x 5 0 x 6 0 x 7 0

Classification (Naïve Bayes) Cholesterol level Heart Attack (HA) x 1 1 x 2 1 x 3 0 x 4 1 x 5 0 x 6 0 x 7 0 Given a new individual, can we predict whether the individual will get a heart attack Based on his cholesterol level? Assumption: Cholesterol levels are normally distributed with a different mean in the 1 and 0 sets. Pr(x HA = 1) N(µ 1, Pr(x HA = 0) N(µ 0, 2 ) 2 ) can be estimated using MLE

Classification (Naïve Bayes) Decision rule:

Multiple Variables x 1 x 2 x n y 195 17 117 1 195 24 114 1 184 13 117 0 Assumptions: 1. Normal marginal distributions 2. Variables are independent 250 22 111 1 173 15 108 0 185 18 145 0 178 22 136 0

Multiple Variables

Naïve Bayes A Naïve assumption. Often works in practice. Interpretation: A weighted sum of evidence. Allows for the incorporation of features of different distributions. Requires small amounts of data

Naïve Bayes Might Break 4 y=1 4 y=0 3 3 2 2 1 1 0 0 1 2 1 3 2 4 4 3 2 1 0 1 2 3 4 3 3 2 1 0 1 2 3 4 y=1: Independent variables y=0: x 2 =x 1

The Multivariate Normal Distribution is a multivariate normal distribution 10 8 6 4 2 0 2 4 6 8 8 6 4 2 0 2 4 6 8 10

The Multivariate Normal Distribution is a multivariate normal distribution 10 Example: 8 6 4 2 0 2 4 6 8 8 6 4 2 0 2 4 6 8 10

The Multivariate Normal Distribution Notation: The variance-covariance matrix is If we do not use Naïve Bayes we need to estimate O(k 2 ) parameters.

Reminder: K-means objective Given: Vectors A number K Objective:

K-Means: A Likelihood Formulation There are unknown clusters: S 1,,S k. The points in S i are distributed Each point x i originates from a cluster c i.

Mixture of Gaussians There are unknown clusters: S 1,,S k. The points in S i are distributed Each point x i originates from cluster c i with probability p i.

25 20 15 10 5 0 5 10 15 15 10 5 0 5 10 15 20 25 30

In one dimension There are unknown clusters: S 1,,S k. The points in S i are distributed Each point x i originates from cluster c i with probability p i.

For every i, we choose:

The Expectation-Maximization Algorithm Start with a guess: In each iteration t+1 set:

The Expectation-Maximization Algorithm By construction:

The Expectation-Maximization Algorithm Conclusion: The likelihood is non-decreasing in each iteration. Stopping rule: When the likelihood flattens.

Expectation Maximization (EM) D given data parameters that need to be estimated Z missing (latent) variables 1. E-step: 2. M-step: Q( t )=E Z D, t [log(pr(d, Z )] t+1 := arg max Q( t )

EM - Comments No guarantee of optimization to local maximum. No guarantee of running times. Often it takes many iterations to converge. Efficiency: no matrix inversion is needed (e.g., in Newton). Generalized EM no need to find the max in the M-step. Easy to implement. Numerical stability. Monotone it is easy to ensure correctness in EM. Interpretation provides interpretation for the latent variables.

Reminder: Mixture of Gaussians There are unknown clusters: S 1,,S k. The points in S i are distributed Each point x i originates from cluster c i with probability p i. Variant: S i is distributed

Mixture of Gaussians - EM E-step:

EM for Mixture of Gaussians. E-step:

EM for Mixture of Gaussians. M-step: Relation to K-Means: The E-step assigns each point to a cluster. The M-step finds the new cluster centers.

Hidden Coin There are two coins with probabilities for Head: p 1,p 2. In each step with probability λ we flip the first coin. With probability 1-λ we flip the second coin. Observe the series of Heads: x = (0,1,1,0,1, ) Parameters: λ,p 1,p 2.

Hidden Coin

Hidden Coin E-step: M-step:

Hidden Coin Consider the set of Heads: (1,1,0,0). Start with an initial guess: Exercise: the EM is stuck and we are at a saddle point.

Multinomial Revisited We are given a universe of possible strings (e.g., words of a language): Assume a model by which the strings are generated from a multinomial with (unknown) probabilities p 1,, p t h 1,,h t {0,1} k We are given a sample from the multinomial with counts c 1,,c t

Ambiguous Multinomial We are given a universe of possible strings (e.g., words of a language): Assume a model by which the strings are generated from a multinomial with (unknown) probabilities p 1,, p t We are given a sample from the multinomial, but due to a technical issue we can only observe sums of pairs of strings.

Generative Model p 1 = 1/4! p 2 = 1/2! p 3 = 1/8! p 4 = 1/8! 01000010! 11111111! 00001111! 00000000! 11111111! 00001111! 01000010! 11111111! 00000000! 11111111! 01000010! 01000010! 11112222! 12111121! 11111111! 02000020! GOAL! Unknown!

Likelihood Formulation How do we find the max? How do we know there s only one solution?

Expectation Maximization Data: 21110 10001 10000 11110 10010 11100 10100 11010 10110 11000 10000 00001 10001 00000 0.25 0.25 0.25 0.25 0.5 0.5 10000!1/12! 11110!1/12! 10010!1/12! 11100!1/12! 10100!1/12! 11010!1/12! 10110!1/12! 11000!1/12! 00001!1/12! 10001!1/12! 00000!1/12! 11111!1/12! 22221 11111 11110 1

Expectation Maximization Data: 21110 10001 10000 11110 10010 11100 10100 11010 10110 11000 10000 00001 10001 00000 0.25 0.25 0.25 0.25 0.5 0.5 10000!0.125! 11110!0.208! 10010!0.041! 11100!0.041! 10100!0.041! 11010!0.041! 10110!0.041! 11000!0.041! 00001!0.083! 10001!0.083! 00000!0.083! 11111!0.166! 22221 11111 11110 1

Expectation Maximization Data: 21110 10001 10000 11110 10010 11100 10100 11010 10110 11000 10000 00001 10001 00000 0.838 0.054 0.054 0.054 0.6 0.4 10000!0.125! 11110!0.208! 10010!0.041! 11100!0.041! 10100!0.041! 11010!0.041! 10110!0.041! 11000!0.041! 00001!0.083! 10001!0.083! 00000!0.083! 11111!0.166! 22221 11111 11110 1

Expectation Maximization Data: 21110 10001 10000 11110 10010 11100 10100 11010 10110 11000 10000 00001 10001 00000 0.838 0.054 0.054 0.054 0.6 0.4 10000!0.239! 11110!0.306! 10010!0.009! 11100!0.009! 10100!0.009! 11010!0.009! 10110!0.009! 11000!0.009! 00001!0.1! 10001!0.066! 00000!0.066! 11111!0.166! 22221 11111 11110 1

Expectation Maximization Data: 21110 10001 10000 11110 10010 11100 10100 11010 10110 11000 10000 00001 10001 00000 1 0 0 0 0.85 0.15 10000!0.239! 11110!0.306! 10010!0.009! 11100!0.009! 10100!0.009! 11010!0.009! 10110!0.009! 11000!0.009! 00001!0.1! 10001!0.066! 00000!0.066! 11111!0.166! 22221 11111 11110 1

After many iterations Data: 21110 10001 10000 11110 10010 11100 10100 11010 10110 11000 10000 00001 10001 00000 1 0 0 0 1 0 10000!0.333! 11110!0.333! 10010!0! 11100!0! 10100!0! 11010!0! 10110!0! 11000!0! 00001!0.166! 10001!0! 00000!0! 11111!0.166! 22221 11111 11110 1

Expectation Maximization (EM) E-step: M-step:

Functions over n-dimensions 79 For a function f(x 1,,x n ), the gradient is the vector of partial derivatives: " f = f,, f % $ ' # x 1 x n & In one dimension: derivative.

Gradient Descent 80 Goal: Minimize a function Algorithm: 1. Start from a point 2. Compute u = f (x 1 i,..., x n i ) 3. Update 4. Return to (2), unless converged.

Gradient of g: g p i = {h:h X C(h i )} H P j:h 1 C(h j ) p j Projected on the space : u =( g p 1,..., g p m ) where = P m i=1 m g p i

Gradient descent with projections Start from a point p 0 =(p 0 1,...,p 0 m) Repeat: 1. Current point is 2. Calculate u =( g p 1,..., p =(p 1,...,p m ) 3. Find a new point for a small epsilon > 0: p 0 = p + u g p m ) 4. Repeat to step (2) until there the gradient is zero or until you get to a corner (p i =1 or p i =0)