Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Size: px

Start display at page:

Download "Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf"

Evelyn Pope
6 years ago
Views:

1 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

2 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a random number m.! Pr(X = m X ~ B(n, p)) = # " n m $ & p m (1 p) n m %

3 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a random number m. The likelihood is defined as: L(p; X = m) = Pr(X = m X ~ B(n, p))

4 The Likelihood Function Assume we have a set of hypotheses to choose from. Normally a hypothesis will be defined by a set of parameters θ. We do not know θ, but we make some observations and get data D. The likelihood of θ is L(θ;D) = Prob(D θ). We are interested in the hypothesis that maximizes the likelihood.

5 Example We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a random number m. In this case, the data D is the number m, and the parameter θ is p. The likelihood is " L( p;x = m) = Pr(X = m X ~ B(n, p)) = n % $ ' p m (1 p) n m # m&

6 Maximum Likelihood Estimate Maximum likelihood = argmax θ L(θ;D) In the example above, the maximum is obtained for ˆ p = m n

7 Reminder: The Normal Distribution

8 Reminder: The Normal Distribution We obtain a set of n independent samples: We want to estimate the model parameters:.

9 Reminder: The Normal Distribution We obtain a set of n independent samples: We want to estimate the model parameters:.

10 Maximum Likelihood Estimate (MLE)

11 Example X 1,, X n ~ U(0,θ) What is the maximum likelihood?

12 Example X 1,, X n ~ U(0,θ) What is the maximum likelihood? Assume X (1) < < X (n ) For θ < X (n ), L(θ;D) = 0 For θ X (n ), L(θ;D) = 1 θ n Max Likelihood : ˆ θ = X (n )

13 Example: MLE of a Multinomial We are given a universe of possible strings (e.g., words of a language): Assume a model by which the strings are generated from a multinomial with (unknown) probabilities p 1,, p t h 1,,h t {0,1} k We are given a sample from the multinomial with counts c 1,,c t

14 Generative Model p 1 = 1/4! p 2 = 1/2! p 3 = 1/8! p 4 = 1/8! ! ! ! ! ! ! ! ! ! ! !! GOAL! Unknown!

15 Generative Model p 1 = 1/4! p 2 = 1/2! p 3 = 1/8! p 4 = 1/8! ! ! ! ! ! ! ! ! ! ! !! c 1 = 2! c 2 = 3! c 3 = 1! c 4 = 1! GOAL! Unknown!

16 MLE of a Multinomial Strings: Counts: h 1,,h t {0,1} k c 1,,c t " L( p 1,..., p t ;c 1,,c t ) = n %" $ ' n c % " 1 $ ' n c c % 1 t 1 c $ ' p 1 c # c 1 &# c 2 & # c 1 p 21 c 2 p t t t & i Max c i log(p i ) s.t p i =1, p i > 0

17 Using Lagrange Multipliers We are interested in maximizing: i Max c i log(p i ) s.t p i =1, p i > 0 Instead, we will consider the Lagrange function: max X i c i log(p i )+ (1 X p i ), s.t. p i > 0 An optimal solution of the original problem corresponds to a stationary point of the Lagrange function. i

18 Using Lagrange Multipliers f( p, )= X i c i log(p i )+ (1 X p i ) i Compute the gradient: f p i = c i p i Equating to zero: f =1 X i p i p i = ci, = X i c i = n

19 Bayesian Estimators Maximum likelihood: Advantage: No assumptions made on the model distribution. Disadvantage: In reality we are looking for: max Pr( D) max Pr(D ) Is it well defined?

20 Prior and Posterior Sometimes we know something about the PRIOR distribution Pr( ) Then, based on Bayes rule, we can calculate the POSTERIOR distribution: Pr( D) = Pr(D )Pr( ) Pr(D)

21 MAP (Maximum a posteriori) Maximum a posteriori estimation (MAP) is the mode of the posterior distribution: ˆ MAP = arg max Pr( D) ˆ ML = arg max Pr(D )

22 MAP (Maximum a posteriori) Maximum a posteriori estimation (MAP) is the mode of the posterior distribution: ˆ MAP = arg max Pr( D) ˆ ML = arg max Pr(D )

23 Example Assume: x 1,...,x n N(µ, 1) P n i=1 x i ˆµ ML = n

24 Normal Prior Assume prior µ N(0, 1) log(pr(x 1,...,x n µ)) = n 2 log(2 ) P n i=1 (x i µ) 2 2 log(pr(µ)) = 1 2 log(2 ) µ 2 ˆµ MAP = arg max µ { µ2 2 nx (x i µ) 2 } i=1

25 Normal Prior Assume prior µ N(0, 1) ˆµ MAP = arg max µ { µ2 nx (x i µ) 2 } i=1 ˆµ MAP = P n i=1 x i n +1 ˆµ ML = P n i=1 x i n

26 Posterior of a Normal Prior Assume prior µ N(0, 1) 0 r(µ x 1,...,x n ) / exp µ P n i=1 2 n+1 2 x i n+1 1 C A µ N P n i=1 x i n +1, 1 n +1

27 Choosing a prior for B(n,p) X B(n, p) One sample: X = m ˆp ML = m n

28 The Beta Distribution X Beta(, ) > 0, > 0 f(x) = x 1 (1 x) 1 B(, ) µ = E[X] = +

29 Posterior with a Beta Prior X B(n, p) Assume prior : p Beta(, ) Pr(p X = m,, ) n p m (1 p) n m p 1 (1 p) 1 m B(, ) Pr(p X = m,, ) p m+ 1 (1 p) n m+ 1

30 Posterior with a Beta Prior Pr(p X = m,, ) p m+ 1 (1 p) n m+ 1 Pr(p X = m,, ) Beta(m +,n m + ) ˆp MAP = m + 1 n If the prior distribution is Beta then the posterior distribution is Beta as well. A conjugate prior.

31 Classification (Naïve Bayes) Cholesterol level Heart Attack (HA) x 1 1 Given a new individual, can we predict whether the individual will get a heart attack Based on his cholesterol level? x 2 1 x 3 0 x 4 1 x 5 0 x 6 0 x 7 0

32 Classification (Naïve Bayes) Cholesterol level Heart Attack (HA) x 1 1 x 2 1 x 3 0 x 4 1 x 5 0 x 6 0 x 7 0 Given a new individual, can we predict whether the individual will get a heart attack Based on his cholesterol level? Assumption: Cholesterol levels are normally distributed with a different mean in the 1 and 0 sets. Pr(x HA = 1) N(µ 1, Pr(x HA = 0) N(µ 0, 2 ) 2 ) can be estimated using MLE

33 Classification (Naïve Bayes) Decision rule:

34 Multiple Variables x 1 x 2 x n y Assumptions: 1. Normal marginal distributions 2. Variables are independent

35 Multiple Variables

36 Multiple Variables

37 Naïve Bayes A Naïve assumption. Easy to implement. Often works in practice. Interpretation: A weighted sum of evidence. Allows for the incorporation of features of different distributions. Requires small amounts of data

38 Naïve Bayes Might Break 4 y=1 4 y= y=1: Independent variables y=0: x 2 =x 1

39 The Multivariate Normal Distribution is a multivariate normal distribution

40 The Multivariate Normal Distribution is a multivariate normal distribution 10 Example:

41 The Multivariate Normal Distribution Notation: The variance-covariance matrix is If we do not use Naïve Bayes we need to estimate O(k 2 ) parameters.

42 Reminder: K-means objective Given: Vectors A number K Objective:

43 K-Means: A Likelihood Formulation There are unknown clusters: S 1,,S k. The points in S i are distributed Each point x i originates from a cluster c i.

44 Mixture of Gaussians There are unknown clusters: S 1,,S k. The points in S i are distributed Each point x i originates from cluster S j with probability p j.

47 In one dimension There are unknown clusters: S 1,,S k. The points in S i are distributed Each point x i originates from cluster S j with probability p j.

48 For every i, we choose:

49 The Expectation-Maximization Algorithm Start with a guess: In each iteration t+1 set:

50 The Expectation-Maximization Algorithm Start with a guess: In each iteration t+1 set:

51 The Expectation-Maximization Algorithm By construction:

52 The Expectation-Maximization Algorithm Conclusion: The likelihood is non-decreasing in each iteration. Stopping rule: When the likelihood flattens.

53 Expectation Maximization (EM) D given data parameters that need to be estimated Z missing (latent) variables 1. E-step: 2. M-step: Q( t )=E Z D, t [log(pr(d, Z )] t+1 := arg max Q( t )

56 EM - Comments No guarantee of optimization to local maximum. No guarantee of running times. Often it takes many iterations to converge. Efficiency: no matrix inversion is needed (e.g., in Newton). Generalized EM no need to find the max in the M-step. Easy to implement. Numerical stability. Monotone it is easy to ensure correctness in EM. Interpretation provides interpretation for the latent variables.

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a