Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Size: px

Start display at page:

Download "Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf"

John Lawrence
5 years ago
Views:

1 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf

2 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a random number m.! Pr(X = m X ~ B(n, p)) = # " n m $ & p m (1 p) n m %

3 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a random number m. The likelihood is defined as: L(p; X = m) = Pr(X = m X ~ B(n, p))

4 The Likelihood Function Assume we have a set of hypotheses to choose from. Normally a hypothesis will be defined by a set of parameters θ. We do not know θ, but we make some observations and get data D. The likelihood of θ is L(θ;D) = Prob(D θ). We are interested in the hypothesis that maximizes the likelihood.

5 Example We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a random number m. In this case, the data D is the number m, and the parameter θ is p. The likelihood is " L( p;x = m) = Pr(X = m X ~ B(n, p)) = n % $ ' p m (1 p) n m # m&

6 Maximum Likelihood Estimate Maximum likelihood = argmax θ L(θ;D) In the example above, the maximum is obtained for ˆ p = m n

7 Reminder: The Normal Distribution

8 Reminder: The Normal Distribution We obtain a set of n independent samples: We want to estimate the model parameters:.

9 Reminder: The Normal Distribution We obtain a set of n independent samples: We want to estimate the model parameters:.

10 Maximum Likelihood Estimate (MLE)

11 Example X 1,, X n ~ U(0,θ) What is the maximum likelihood?

12 Example X 1,, X n ~ U(0,θ) What is the maximum likelihood? Assume X (1) < < X (n ) For θ < X (n ), L(θ;D) = 0 For θ X (n ), L(θ;D) = 1 θ n Max Likelihood : ˆ θ = X (n )

13 Example: MLE of a Multinomial We are given a universe of possible strings (e.g., words of a language): Assume a model by which the strings are generated from a multinomial with (unknown) probabilities p 1,, p t h 1,,h t {0,1} k We are given a sample from the multinomial with counts c 1,,c t

14 Generative Model p 1 = 1/4! p 2 = 1/2! p 3 = 1/8! p 4 = 1/8! ! ! ! ! ! ! ! ! ! ! !! GOAL! Unknown!

15 Generative Model p 1 = 1/4! p 2 = 1/2! p 3 = 1/8! p 4 = 1/8! ! ! ! ! ! ! ! ! ! ! !! c 1 = 2! c 2 = 3! c 3 = 1! c 4 = 1! GOAL! Unknown!

16 MLE of a Multinomial Strings: Counts: h 1,,h t {0,1} k c 1,,c t " L( p 1,..., p t ;c 1,,c t ) = n %" $ ' n c % " 1 $ ' n c c % 1 t 1 c $ ' p 1 c # c 1 &# c 2 & # c 1 p 21 c 2 p t t t & i Max c i log(p i ) s.t p i =1, p i > 0

17 Using Lagrange Multipliers We are interested in maximizing: i Max c i log(p i ) s.t p i =1, p i > 0 Instead, we will consider the Lagrange function: max X i c i log(p i )+ (1 X p i ), s.t. p i > 0 An optimal solution of the original problem corresponds to a stationary point of the Lagrange function. i

18 Using Lagrange Multipliers f( p, )= X i c i log(p i )+ (1 X p i ) i Compute the gradient: f p i = c i p i Equating to zero: f =1 X i p i p i = ci, = X i c i = n

19 Bayesian Estimators Maximum likelihood: Advantage: No assumptions made on the model distribution. Disadvantage: In reality we are looking for: max Pr( D) max Pr(D ) Is it well defined?

20 Prior and Posterior Sometimes we know something about the PRIOR distribution Pr( ) Then, based on Bayes rule, we can calculate the POSTERIOR distribution: Pr( D) = Pr(D )Pr( ) Pr(D)

21 MAP (Maximum a posteriori) Maximum a posteriori estimation (MAP) is the mode of the posterior distribution: ˆ MAP = arg max Pr( D) ˆ ML = arg max Pr(D )

22 MAP (Maximum a posteriori) Maximum a posteriori estimation (MAP) is the mode of the posterior distribution: ˆ MAP = arg max Pr( D) ˆ ML = arg max Pr(D ) ˆ MAP = arg max Pr(D )Pr( )

23 Example Assume: x 1,...,x n N(µ, 1) P n i=1 x i ˆµ ML = n

24 Normal Prior Assume prior µ N(0, 1) log(pr(x 1,...,x n µ)) = n 2 log(2 ) P n i=1 (x i µ) 2 2 log(pr(µ)) = 1 2 log(2 ) µ 2 ˆµ MAP = arg max µ { µ2 2 nx (x i µ) 2 } i=1

25 Normal Prior Assume prior µ N(0, 1) ˆµ MAP = arg max µ { µ2 nx (x i µ) 2 } i=1 ˆµ MAP = P n i=1 x i n +1 ˆµ ML = P n i=1 x i n

26 Posterior of a Normal Prior Assume prior µ N(0, 1) 0 r(µ x 1,...,x n ) / exp µ P n i=1 2 n+1 2 x i n+1 1 C A µ N P n i=1 x i n +1, 1 n +1

27 Choosing a prior for B(n,p) X B(n, p) One sample: X = m ˆp ML = m n

28 The Beta Distribution X Beta(, ) > 0, > 0 f(x) = x 1 (1 x) 1 B(, ) µ = E[X] = +

29 Posterior with a Beta Prior X B(n, p) Assume prior : p Beta(, ) Pr(p X = m,, ) n p m (1 p) n m p 1 (1 p) 1 m B(, ) Pr(p X = m,, ) p m+ 1 (1 p) n m+ 1

30 Posterior with a Beta Prior Pr(p X = m,, ) p m+ 1 (1 p) n m+ 1 Pr(p X = m,, ) Beta(m +,n m + ) ˆp MAP = m + 1 n If the prior distribution is Beta then the posterior distribution is Beta as well. A conjugate prior.

31 Posterior with a Uniform Prior X B(n, p) Assume prior : p U(0, 1) Pr(p X = m) / p m (1 p) n m Pr(p X = m) Beta(m +1,n m + 1)

32 Posterior with a Uniform Prior X B(n, p) Assume prior : p U(0, 1) Pr(p X = m) Beta(m +1,n m + 1) ˆp MAP = m n E[p X = m] = m +1 n +2

33 Classification (Naïve Bayes) Cholesterol level Heart Attack (HA) x 1 1 Given a new individual, can we predict whether the individual will get a heart attack Based on his cholesterol level? x 2 1 x 3 0 x 4 1 x 5 0 x 6 0 x 7 0

34 Classification (Naïve Bayes) Cholesterol level Heart Attack (HA) x 1 1 x 2 1 x 3 0 x 4 1 x 5 0 x 6 0 x 7 0 Given a new individual, can we predict whether the individual will get a heart attack Based on his cholesterol level? Assumption: Cholesterol levels are normally distributed with a different mean in the 1 and 0 sets. Pr(x HA = 1) N(µ 1, Pr(x HA = 0) N(µ 0, 2 ) 2 ) can be estimated using MLE

35 Classification (Naïve Bayes) Decision rule:

36 Multiple Variables x 1 x 2 x n y Assumptions: 1. Normal marginal distributions 2. Variables are independent

37 Multiple Variables

38 Multiple Variables

39 Naïve Bayes A Naïve assumption. Often works in practice. Interpretation: A weighted sum of evidence. Allows for the incorporation of features of different distributions. Requires small amounts of data

40 Naïve Bayes Might Break 4 y=1 4 y= y=1: Independent variables y=0: x 2 =x 1

41 The Multivariate Normal Distribution is a multivariate normal distribution

42 The Multivariate Normal Distribution is a multivariate normal distribution 10 Example:

43 The Multivariate Normal Distribution Notation: The variance-covariance matrix is If we do not use Naïve Bayes we need to estimate O(k 2 ) parameters.

44 Reminder: K-means objective Given: Vectors A number K Objective:

45 K-Means: A Likelihood Formulation There are unknown clusters: S 1,,S k. The points in S i are distributed Each point x i originates from a cluster c i.

46 Mixture of Gaussians There are unknown clusters: S 1,,S k. The points in S i are distributed Each point x i originates from cluster c i with probability p i.

49 In one dimension There are unknown clusters: S 1,,S k. The points in S i are distributed Each point x i originates from cluster c i with probability p i.

50 For every i, we choose:

51 The Expectation-Maximization Algorithm Start with a guess: In each iteration t+1 set:

52 The Expectation-Maximization Algorithm Start with a guess: In each iteration t+1 set:

53 The Expectation-Maximization Algorithm By construction:

54 The Expectation-Maximization Algorithm Conclusion: The likelihood is non-decreasing in each iteration. Stopping rule: When the likelihood flattens.

55 Expectation Maximization (EM) D given data parameters that need to be estimated Z missing (latent) variables 1. E-step: 2. M-step: Q( t )=E Z D, t [log(pr(d, Z )] t+1 := arg max Q( t )

58 EM - Comments No guarantee of optimization to local maximum. No guarantee of running times. Often it takes many iterations to converge. Efficiency: no matrix inversion is needed (e.g., in Newton). Generalized EM no need to find the max in the M-step. Easy to implement. Numerical stability. Monotone it is easy to ensure correctness in EM. Interpretation provides interpretation for the latent variables.

59 Reminder: Mixture of Gaussians There are unknown clusters: S 1,,S k. The points in S i are distributed Each point x i originates from cluster c i with probability p i. Variant: S i is distributed

60 Mixture of Gaussians - EM E-step:

61 Mixture of Gaussians - EM E-step:

62 EM for Mixture of Gaussians. E-step:

63 EM for Mixture of Gaussians. M-step: Relation to K-Means: The E-step assigns each point to a cluster. The M-step finds the new cluster centers.

64 Hidden Coin There are two coins with probabilities for Head: p 1,p 2. In each step with probability λ we flip the first coin. With probability 1-λ we flip the second coin. Observe the series of Heads: x = (0,1,1,0,1, ) Parameters: λ,p 1,p 2.

65 Hidden Coin

66 Hidden Coin E-step: M-step:

67 Hidden Coin Consider the set of Heads: (1,1,0,0). Start with an initial guess: Exercise: the EM is stuck and we are at a saddle point.

68 Multinomial Revisited We are given a universe of possible strings (e.g., words of a language): Assume a model by which the strings are generated from a multinomial with (unknown) probabilities p 1,, p t h 1,,h t {0,1} k We are given a sample from the multinomial with counts c 1,,c t

69 Ambiguous Multinomial We are given a universe of possible strings (e.g., words of a language): Assume a model by which the strings are generated from a multinomial with (unknown) probabilities p 1,, p t We are given a sample from the multinomial, but due to a technical issue we can only observe sums of pairs of strings.

Generative Model p 1 = 1/4! p 2 = 1/2! p 3 = 1/8! p 4 = 1/8! 01000010! 11111111! 00001111! 00000000! 11111111! 00001111! 01000010! 11111111! 00000000! 11111111! 01000010! 01000010! 11112222!

70 Generative Model p 1 = 1/4! p 2 = 1/2! p 3 = 1/8! p 4 = 1/8! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! GOAL! Unknown!

71 Likelihood Formulation How do we find the max? How do we know there s only one solution?

72 Expectation Maximization Data: !1/12! 11110!1/12! 10010!1/12! 11100!1/12! 10100!1/12! 11010!1/12! 10110!1/12! 11000!1/12! 00001!1/12! 10001!1/12! 00000!1/12! 11111!1/12!

73 Expectation Maximization Data: !0.125! 11110!0.208! 10010!0.041! 11100!0.041! 10100!0.041! 11010!0.041! 10110!0.041! 11000!0.041! 00001!0.083! 10001!0.083! 00000!0.083! 11111!0.166!

74 Expectation Maximization Data: !0.125! 11110!0.208! 10010!0.041! 11100!0.041! 10100!0.041! 11010!0.041! 10110!0.041! 11000!0.041! 00001!0.083! 10001!0.083! 00000!0.083! 11111!0.166!

75 Expectation Maximization Data: !0.239! 11110!0.306! 10010!0.009! 11100!0.009! 10100!0.009! 11010!0.009! 10110!0.009! 11000!0.009! 00001!0.1! 10001!0.066! 00000!0.066! 11111!0.166!

76 Expectation Maximization Data: !0.239! 11110!0.306! 10010!0.009! 11100!0.009! 10100!0.009! 11010!0.009! 10110!0.009! 11000!0.009! 00001!0.1! 10001!0.066! 00000!0.066! 11111!0.166!

77 After many iterations Data: !0.333! 11110!0.333! 10010!0! 11100!0! 10100!0! 11010!0! 10110!0! 11000!0! 00001!0.166! 10001!0! 00000!0! 11111!0.166!

78 Expectation Maximization (EM) E-step: M-step:

79 Functions over n-dimensions 79 For a function f(x 1,,x n ), the gradient is the vector of partial derivatives: " f = f,, f % $ ' # x 1 x n & In one dimension: derivative.

80 Gradient Descent 80 Goal: Minimize a function Algorithm: 1. Start from a point 2. Compute u = f (x 1 i,..., x n i ) 3. Update 4. Return to (2), unless converged.

81 Gradient of g: g p i = {h:h X C(h i )} H P j:h 1 C(h j ) p j Projected on the space : u =( g p 1,..., g p m ) where = P m i=1 m g p i

82 Gradient descent with projections Start from a point p 0 =(p 0 1,...,p 0 m) Repeat: 1. Current point is 2. Calculate u =( g p 1,..., p =(p 1,...,p m ) 3. Find a new point for a small epsilon > 0: p 0 = p + u g p m ) 4. Repeat to step (2) until there the gradient is zero or until you get to a corner (p i =1 or p i =0)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample