Lecture 4: Probabilistic Learning

Size: px

Start display at page:

Download "Lecture 4: Probabilistic Learning"

Walter Wilkerson
5 years ago
Views:

1 DD2431 Autumn, 2015

2 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3

3 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods Classification with Probability Distributions Classification x2 x features y {y 1,..., y K } class ˆk = arg max P(y k x) k = arg max P(y k )P(x y k ) k x1

4 Estimation Theory Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods in the last lecture we assumed we knew: P(y) Prior P(x y) Likelihood P(x) Evidence and we used them to compute the Posterior P(y x) How can we obtain this information from observations (data)? Estimation Theory Learning

5 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods Parametric vs Non-Parametric Estimation Parametric Non Parametric Bayesian non-parametric methods integrate out the parameters.

6 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods Assumption # 1: Class Independence x2 x1 Assumptions: samples from class i do not influence estimate for class j, i j Generative vs discriminative models

class independence assumption: x2 xx[,2] xx[,2] xx[,2] xx[,2] xx[,2] each

7 x1 Parameter estimation (cont.) Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods class independence assumption: x2 xx[,2] xx[,2] xx[,2] xx[,2] xx[,2] each distribution is a likelihood in the form P(x θ i ) for class i in the following we drop the class index and talk about P(x θ)

8 Assumption #2: i.i.d. Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods Samples from each class are independent and identically distributed: D = {x 1,..., x N } The likelihood of the whole data set can be factorized: P(D θ) = P(x 1,..., x N θ) = And the log-likelihood becomes: N P(x i θ) i=1 log P(D θ) = N log P(x i θ) i=1

9 Three Approaches Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian

10 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods Maximum likelihood estimation: Illustration Find parameter vector ˆθ that maximizes P(D θ) with D = {x 1,..., x n } p(d theta) X log likelihood likelihood

11 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods Maximum likelihood estimation: Illustration Find parameter vector ˆθ that maximizes P(D θ) with D = {x 1,..., x n } p(d theta) X likelihood log likelihood 1 estimate the optimal parameters of the model

12 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods Maximum likelihood estimation: Illustration Find parameter vector ˆθ that maximizes P(D θ) with D = {x 1,..., x n } p(d theta) p(x theta) X likelihood log likelihood 1 estimate the optimal parameters of the model 2 evaluate the predictive distribution on new data points

13 ML estimation of Gaussian mean Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods N(x µ, σ 2 ) = 1 ] (x µ)2 exp [ 2πσ 2σ 2, with θ = {µ, σ 2 } Log-likelihood of data (i.i.d. samples): log P(D θ) = N i=1 ( ) log N(x i µ, σ 2 ) = N log 2πσ N (x i µ) 2 i=1 2σ 2 0 = d log P(D θ) dµ = N i=1 (x i µ) N i=1 σ 2 = x i Nµ σ 2 N ˆµ = 1 N i=1 x i

14 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods ML estimation of Gaussian parameters ˆµ = 1 N ˆσ 2 = 1 N N i=1 x i N (x i ˆµ) 2 i=1 same result by minimizing the sum of square errors! but we make assumptions explicit

15 Problem: few data points Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 10 repetitions with 5 points each X

16 Problem: few data points Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 10 repetitions with 5 points each X

17 Maximum a Posteriori Estimation Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods ˆµ, ˆσ 2 = arg max µ,σ 2 [ N ] P(x i µ, σ 2 )P(µ, σ 2 ) i=1 where the prior P(µ, σ 2 ) needs a nice mathematical form for closed solution ˆµ MAP = ˆσ 2 MAP = N N + γ ˆµ ML + γ N + γ δ N N α ˆσ2 ML + 2β + γ(δ + ˆµ MAP) 2 N α where α, β, γ, δ are parameters of the prior distribution

18 ML, MAP and Point Estimates Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods Both ML and MAP produce point estimates of θ Assumption: there is a true value for θ advantage: once ˆθ is found, everything is known p(d theta) p(x theta) X likelihood log likelihood

19 Bayesian estimation Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods Consider θ as a random variable characterize θ with the posterior distribution P(θ D) given the data ML: D ˆθ ML MAP: D, P(θ) ˆθ MAP Bayes: D, P(θ) P(θ D) for new data points, instead of P(x new ˆθ ML ) or P(x new ˆθ MAP ), compute: P(x new D) = P(x new θ)p(θ D)dθ θ Θ

20 Bayesian estimation (cont.) Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods we can compute P(x D) instead of P(x ˆθ) integrate the joint density P(x, θ D) = P(x θ)p(θ D) p(d theta) p(x theta) X P(x ˆθ) likelihood log likelihood

21 Bayesian estimation Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods we can compute P(x D) instead of P(x ˆθ) integrate the joint density P(x, θ D) = P(x θ)p(θ D) P(x D) = P(x θ)p(θ D)dθ join dist X integral

22 Bayesian estimation Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods we can compute P(x D) instead of P(x ˆθ) integrate the joint density P(x, θ D) = P(x θ)p(θ D) P(x D) = P(x θ)p(θ D)dθ join dist X integral

23 Bayesian estimation Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods we can compute P(x D) instead of P(x ˆθ) integrate the joint density P(x, θ D) = P(x θ)p(θ D) P(x D) = P(x θ)p(θ D)dθ join dist X integral

24 Bayesian estimation (cont.) Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods Pros: Cons: better use of the data makes a priori assumptions explicit can be implemented recursively (if conjugate prior) use posterior P(θ D) as new prior reduce overfitting definition of noninformative priors can be tricky often requires numerical integration not widely accepted by traditional statistics (frequentism)

25 Classification vs Clustering Heuristic Example: K-means Expectation Maximization Clustering vs Classification Classification x1 x2 Clustering x1 x2

26 Fitting complex distributions Classification vs Clustering Heuristic Example: K-means Expectation Maximization We can try to fit a mixture of K distributions: K P(x θ) = π k P(x θ k ), k=1 with θ = {π 1,..., π k, θ 1,..., θ K } Problem: We do not know which point has been generated by which component of the mixture We cannot optimize P(x θ) directly

27 Expectation Maximization Classification vs Clustering Heuristic Example: K-means Expectation Maximization Fitting model parameters with missing (latent) variables K P(x θ) = π k P(x θ k ), k=1 with θ = {π 1,..., π k, θ 1,..., θ K } very general idea (applies to many different probabilistic models) augment the data with the missing variables: h ik probability that each data point x i was generated by each component of the mixture k optimize the Likelihood of the complete data: P(x, h θ)

28 Heuristic Example: K-means Classification vs Clustering Heuristic Example: K-means Expectation Maximization describes each class with a centroid a point belongs to a class if the corresponding centroid is closest (Euclidean distance) iterative procedure guaranteed to converge not guaranteed to find the optimal solution used in vector quantization (since the 1950 s)

29 K-means: algorithm Classification vs Clustering Heuristic Example: K-means Expectation Maximization Data: k (number of desired clusters), n data points x i Result: k clusters initialization: assign initial value to k centroids c i ; repeat assign each point x i to closest centroid c j ; compute new centroids as mean of each group of points; until centroids do not change; return k clusters;

30 K-means: example Classification vs Clustering Heuristic Example: K-means Expectation Maximization iteration 20, update clusters

31 Classification vs Clustering Heuristic Example: K-means Expectation Maximization K-means: sensitivity to initial conditions iteration 20, update clusters

32 Classification vs Clustering Heuristic Example: K-means Expectation Maximization K-means: limits of Euclidean distance the Euclidean distance is isotropic (same in all directions in R p ) this favours spherical clusters the size of the clusters is controlled by their distance

33 K-means: non-spherical classes Classification vs Clustering Heuristic Example: K-means Expectation Maximization two non spherical classes

34 Expectation Maximization Classification vs Clustering Heuristic Example: K-means Expectation Maximization Fitting model parameters with missing (latent) variables K P(x θ) = π k P(x θ k ), k=1 with θ = {π 1,..., π k, θ 1,..., θ K } very general idea (applies to many different probabilistic models) augment the data with the missing variables: h ik probability of assignment of each data point x i to each component of the mixture k optimize the Likelihood of the complete data: P(x, h θ)

35 Mixture of Gaussians Classification vs Clustering Heuristic Example: K-means Expectation Maximization This distribution is a weight sum of K Gaussian distributions K P(x) = π k N (x; µ k, σk) 2 ure 7.6 Mixture of Gaussians el in 1D. A complex multimodal bability density function (black d curve) is created by taking a ghted sum or mixture of several stituent normal distributions with erent means and variances (red, n and blue dashed curves). To ure that the final distribution is aliddensity,theweightsmustbe itive and sum to one. k=1 where π π K = 1 7 Modeling complex data densities and π k > 0 (k = 1,..., K). This model can describe complex multi-modal probability distributions st function for the M-Step (equation 7.12) improves the bound. For now we ssume that by these combining things are simpler true andistributions. proceed with the main thrust of the er. We will return to these issues in section 7.8.

36 Mixture of Gaussians Classification vs Clustering Heuristic Example: K-means Expectation Maximization K P(x) = π k N (x; µ k, σk) 2 k=1 Learning the parameters of this model from training data x 1,..., x n is not trivial - using the usual straightforward maximum likelihood approach. Instead learn parameters using the Expectation-Maximization (EM) algorithm.

37 Classification vs Clustering Heuristic Example: K-means Expectation Maximization Mixture of Gaussians as a marginalization We can interpret the Mixture of Gaussians model with the introduction of a discrete hidden/latent variable h and P(x, h): P(x) = K P(x, h = k) = k=1 K P(x h = k)p(h = k) k=1 7.4 Mixture of Gaussians K = π k N (x; µ k, σk) k=1 Figure 7.7 Mixture of Gaussians as a marginalization. The mixture of Gaussians can also be thought of in terms of a joint distribution Pr(x, h) between the observed variable x and a discrete hidden variable h. To create the mixture density we marginalize over h. The hidden variable has a straightforward interpretation: it is the index of the constituent normal distribution. mixture density Figures taken must from be Computer positive definite. Vision: models, For alearning simplerand approach, inferencewe by express Simon Prince. the observed density as a marginalization and use the EM algorithm to learn the parameters.

38 EM for two Gaussians Classification vs Clustering Heuristic Example: K-means Expectation Maximization Assume: We know the pdf of x has this form: P(x) = π 1 N (x; µ 1, σ1) 2 + π 2 N (x; µ 2, σ2) 2 where π 1 + π 2 = 1 and π k > 0 for components k = 1, 2. Unknown: Values of the parameters (Many!) Θ = (π 1, µ 1, σ 1, µ 2, σ 2 ). Have: Observed n samples x 1,..., x n drawn from P(x). Want to: Estimate Θ from x 1,..., x n. How would it be possible to get them all???

39 EM for two Gaussians Classification vs Clustering Heuristic Example: K-means Expectation Maximization For each sample x i introduce a hidden variable h i { 1 if sample x i was drawn from N (x; µ 1, σ1) 2 h i = 2 if sample x i was drawn from N (x; µ 2, σ2) 2 and come up with initial values for each of the parameters. Θ (0) = (π (0) 1, µ(0) 1, σ(0) 1, µ(0) 2, σ(0) 2 ) EM is an iterative algorithm which updates Θ (t) using the following two steps...

40 EM for two Gaussians: E-step Classification vs Clustering Heuristic Example: K-means Expectation Maximization The responsibility of k-th Gaussian for each sample x (indicated by the size 112of the projected data point) 7 Modeling complex data densities Look at each sample x along hidden variable h in the E-step Figure 7.8 E-Step for fitting the mixture of Gaussians model. For each of the I data points x1...i, we calculate the posterior distribution Pr(hi xi) over the hidden variable hi. The posterior probability Pr(hi = k xi) thathi takes value k can be understood as the responsibility of normal distribution Figure from Computer Vision: k for data models, point xi. learning For example, and inference for data point by Simon x1 (magenta Prince. circle) the component 1 (red curve) is more than twice as likely to be responsible than component 2 (green Giampiero curve). Salvi Note that inlecture the joint4: distribution Probabilistic (left), Learning the

41 Classification vs Clustering Heuristic Example: K-means Expectation Maximization EM for two Gaussians: E-step (cont.) E-step: Compute the posterior probability that x i was generated by component k given the current estimate of the parameters Θ (t). (responsibilities) for i = 1,... n for k = 1, 2 γ (t) ik = P(h i = k x i, Θ (t) ) = π (t) k N (x i ; µ (t) k, σ(t) k ) π (t) 1 N (x i ; µ (t) 1, σ(t) 1 ) + π(t) 2 N (x i ; µ (t) 2, σ(t) 2 ) Note: γ (t) i1 + γ(t) i2 = 1 and π1 + π2 = 1

42 EM for two Gaussians: M-step Classification vs Clustering Heuristic Example: K-means Expectation Maximization Fitting the Gaussian model for each of k-th constinuetnt. 7.4 Mixture of Gaussians 113 Sample x i contributes according to the responsibility γ ik. (dashed and solid lines for fit before and after update) Figure 7.9 M-Step for fitting the mixture of Gaussians model. For the k th constituent Gaussian, we update the parameters {λ k, µ k, Σ k}. The i th data Look point x i along samples x for each h in the M-step contributes to these updates according to the responsibility r ik (indicated by size of point) Giampiero assigned Salvi inlecture the E-Step; 4: Probabilistic data points Learningthat are

43 Classification vs Clustering Heuristic Example: K-means Expectation Maximization EM for two Gaussians: M-step (cont.) M-step: Compute the Maximum Likelihood of the parameters of the mixture model given out data s membership distribution, the s: γ (t) i for k = 1, 2 µ (t+1) k = σ (t+1) k = n i=1 γ(t) ik x i, n i=1 γ(t) ik n i=1 γ(t) ik (x i µ (t+1) n i=1 γ(t) ik k ) 2, n π (t+1) i=1 k = γ(t) ik. n

44 Classification vs Clustering Heuristic Example: K-means Expectation Maximization EM in practice Modeling complex data densities Figure 7.10 a) Initial model. b) E-Step. For each data point the posterior probability that is was generated from each Gaussian is calculated (indicated by color of point). c) M-Step. The mean, variance and weight of each Gaussian is updated based on these posterior probabilities. Ellipse shows

45 EM properties Classification vs Clustering Heuristic Example: K-means Expectation Maximization Similar to K-means guaranteed to find a local maximum of the complete data likelihood somewhat sensitive to initial conditions Better than K-means Gaussian distributions can model clusters with different shapes all data points are smoothly used to update all parameters

46 Model Selection and Overfitting X

47 Overfitting f(x) f(x) x x

48 Overfitting: Phoneme Discrimination

49 Occam s Razor Choose the simplest explanation for the observed data Important factors: number of model parameters number of data points model fit to the data

50 Overfitting and Maximum Likelihood we can make the likelihood arbitrary large by increasing the number of parameters X

51 Occam s Razor and Bayesian Learning Remember that: P(x new D) = θ Θ P(x new θ)p(θ D)dθ Intuition: More complex models fit the data very well (large P(D θ)) but only for small regions of the parameter space Θ.

52 Summary 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 If you are interested in learning more take a look at: C. M. Bishop, Pattern Recognition and Machine Learning, Springer Verlag 2006.

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K