Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Size: px

Start display at page:

Download "Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions"

Christian Hampton
5 years ago
Views:

1 DD2431 Autumn, Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K } class ˆ = arg max P(ω )P(x ω ) P(x) Evidence and we used them to compute the Posterior P(y x) How can we obtain this information from observations (data)? Estimation Theory Learning x1

x1 Assumption # 1: Class Independence Parameter estimation (cont.

estimate for class j, i j Generative vs discriminative models xx[,2] each distribution is a lielihood in the form P(x θ i )

2 x1 Assumption # 1: Class Independence Parameter estimation (cont.) class independence assumption: x2 x2 xx[,2] xx[,2] xx[,2] xx[,2] x1 Assumptions: samples from class i do not influence estimate for class j, i j Generative vs discriminative models xx[,2] each distribution is a lielihood in the form P(x θ i ) for class i in the following we drop the class index and tal about P(x θ) Assumption #2: i.i.d. Parametric vs Non-Parametric Estimation Samples from each class are independent and identically distributed: D = {x 1,..., x N } The lielihood of the whole data set can be factorized: P(D θ) = P(x 1,..., x N θ) = N P(x i θ) Parametric Non Parametric And the log-lielihood becomes: log P(D θ) = N log P(x i θ) We only consider parametric methods today

3 Maximum lielihood estimation: Illustration Find parameter vector ˆθ that maximizes P(D θ) with D = {x 1,..., x n } Maximum lielihood estimation: Illustration Find parameter vector ˆθ that maximizes P(D θ) with D = {x 1,..., x n } p(d theta) p(d theta) lielihood log lielihood log lielihood lielihood 1 estimate the optimal parameters of the model Maximum lielihood estimation: Illustration Find parameter vector ˆθ that maximizes P(D θ) with D = {x 1,..., x n } p(d theta) lielihood p(x theta) ML estimation of Gaussian mean N(x µ, σ 2 ) = 1 ] (x µ)2 exp [ 2πσ 2σ 2, with θ = {µ, σ 2 } Log-lielihood of data (i.i.d. samples): log P(D θ) = N ( ) log N(x i µ, σ 2 ) = N log 2πσ N (x i µ) 2 2σ 2 log lielihood 0 = d log P(D θ) dµ = N (x i µ) N σ 2 = x i Nµ σ 2 N 1 estimate the optimal parameters of the model 2 evaluate the predictive distribution on new data points ˆµ = 1 N x i

4 ML estimation of Gaussian parameters Problem: few data points ˆµ = 1 N ˆσ 2 = 1 N N x i N (x i ˆµ) 2 10 repetitions with 5 points each same result by minimizing the sum of square errors! but we mae assumptions explicit Problem: few data points Maximum a Posteriori Estimation 10 repetitions with 5 points each ˆµ, ˆσ 2 = arg max µ,σ 2 [ N ] P(x i µ, σ 2 )P(µ, σ 2 ) where the prior P(µ, σ 2 ) needs a nice mathematical form for closed solution ˆµ MAP = ˆσ 2 MAP = N N + γ ˆµ ML + γ N + γ δ N N α ˆσ2 ML + 2β + γ(δ + ˆµ MAP) 2 N α where α, β, γ, δ are parameters of the prior distribution

5 ML, MAP and Point Estimates Bayesian estimation Both ML and MAP produce point estimates of θ Assumption: there is a true value for θ advantage: once ˆθ is found, everything is nown p(d theta) lielihood log lielihood p(x theta) Consider θ as a random variable characterize θ with the posterior distribution P(θ D) given the data ML: D ˆθ ML MAP: D, P(θ) ˆθ MAP Bayes: D, P(θ) P(θ D) for new data points, instead of P(x new ˆθ ML ) or P(x new ˆθ MAP ), compute: P(x new D) = P(x new θ)p(θ D)dθ θ Θ Bayesian estimation (cont.) Bayesian estimation we can compute P(x D) instead of P(x ˆθ) integrate the joint density P(x, θ D) = P(x θ)p(θ D) we can compute P(x D) instead of P(x ˆθ) integrate the joint density P(x, θ D) = P(x θ)p(θ D) P(x ˆθ) p(d theta) lielihood p(x theta) P(x D) = P(x θ)p(θ D)dθ join dist log lielihood integral

6 Bayesian estimation Bayesian estimation we can compute P(x D) instead of P(x ˆθ) integrate the joint density P(x, θ D) = P(x θ)p(θ D) we can compute P(x D) instead of P(x ˆθ) integrate the joint density P(x, θ D) = P(x θ)p(θ D) join dist join dist P(x D) = P(x θ)p(θ D)dθ P(x D) = P(x θ)p(θ D)dθ integral integral Bayesian estimation (cont.) Clustering vs Classification Pros: Classification Clustering better use of the data Cons: maes a priori assumptions explicit can be implemented recursively (if conjugate prior) use posterior P(θ D) as new prior reduce overfitting definition of noninformative priors can be tricy often requires numerical integration not widely accepted by traditional statistics (frequentism) x2 x2 x1 x1

7 Fitting complex distributions We can try to fit a mixture of K distributions: P(x θ) = π P(x θ ), =1 with θ = {π 1,..., π, θ 1,..., θ K } Problem: We do not now which point has been generated by which component of the mixture We cannot optimize P(x θ) directly Fitting model parameters with missing (latent) variables P(x θ) = π P(x θ ), =1 with θ = {π 1,..., π, θ 1,..., θ K } very general idea (applies to many different probabilistic models) augment the data with the missing variables: h i probability that each data point x i was generated by each component of the mixture optimize the Lielihood of the complete data: P(x, h θ) K-means: algorithm describes each class with a centroid a point belongs to a class if the corresponding centroid is closest (Euclidean distance) iterative procedure guaranteed to converge not guaranteed to find the optimal solution used in vector quantization (since the 1950 s) Data: (number of desired clusters), n data points x i Result: clusters initialization: assign initial value to centroids c i ; repeat assign each point x i to closest centroid c j ; compute new centroids as mean of each group of points; until centroids do not change; return clusters;

8 K-means: example K-means: sensitivity to initial conditions iteration 20, update clusters iteration 20, update clusters K-means: limits of Euclidean distance K-means: non-spherical classes two non spherical classes the Euclidean distance is isotropic (same in all directions in R p ) this favours spherical clusters the size of the clusters is controlled by their distance

9 Fitting model parameters with missing (latent) variables P(x θ) = π P(x θ ), =1 with θ = {π 1,..., π, θ 1,..., θ K } Mixture of Gaussians This distribution is a weight sum of K Gaussian distributions P(x) = π N (x; µ, σ) 2 =1 where π π K = 1 and π > 0 ( = 1,..., K). very general idea (applies to many different probabilistic models) augment the data with the missing variables: h i probability of assignment of each data point x i to each component of the mixture optimize the Lielihood of the complete data: P(x, h θ) This model can describe complex multi-modal probability distributions by combining simpler distributions. Mixture of Gaussians Mixture of Gaussians as a marginalization We can interpret the Mixture of Gaussians model with the introduction of a discrete hidden/latent variable h and P(x, h): P(x) = π N (x; µ, σ) 2 =1 Learning the parameters of this model from training data x 1,..., x n is not trivial - using the usual straightforward maximum lielihood approach. Instead learn parameters using the Expectation-Maximization (EM) algorithm. P(x) = P(x, h = ) = P(x h = )P(h = ) =1 =1 = π N (x; µ, σ) 2 =1 mixture density Figures taen from Computer Vision: models, learning and inference by Simon Prince.

10 EM for two Gaussians EM for two Gaussians Assume: We now the pdf of x has this form: P(x) = π 1 N (x; µ 1, σ1) 2 + π 2 N (x; µ 2, σ2) 2 where π 1 + π 2 = 1 and π > 0 for components = 1, 2. Unnown: Values of the parameters (Many!) Θ = (π 1, µ 1, σ 1, µ 2, σ 2 ). Have: Observed n samples x 1,..., x n drawn from P(x). Want to: Estimate Θ from x 1,..., x n. How would it be possible to get them all??? For each sample x i introduce a hidden variable h i { 1 if sample x i was drawn from N (x; µ 1, σ1) 2 h i = 2 if sample x i was drawn from N (x; µ 2, σ2) 2 and come up with initial values for each of the parameters. Θ (0) = (π (0) 1, µ(0) 1, σ(0) 1, µ(0) 2, σ(0) 2 ) EM is an iterative algorithm which updates Θ (t) using the following two steps... EM for two Gaussians: E-step The responsibility of -th Gaussian for each sample x (indicated by the size of the projected data point) EM for two Gaussians: E-step (cont.) E-step: Compute the posterior probability that x i was generated by component given the current estimate of the parameters Θ (t). (responsibilities) for i = 1,... n for = 1, 2 γ (t) i = P(h i = x i, Θ (t) ) = π (t) N (x i ; µ (t), σ(t) ) π (t) 1 N (x i ; µ (t) 1, σ(t) 1 ) + π(t) 2 N (x i ; µ (t) 2, σ(t) 2 ) Loo at each sample x along hidden variable h in the E-step Note: γ (t) i1 + γ(t) i2 = 1 and π 1 + π 2 = 1 Figure from Computer Vision: models, learning and inference by Simon Prince.

11 EM for two Gaussians: M-step Fitting the Gaussian model for each of -th constinuetnt. Sample x i contributes according to the responsibility γ i. EM for two Gaussians: M-step (cont.) M-step: Compute the Maximum Lielihood of the parameters of the mixture model given out data s membership distribution, the s: γ (t) i for = 1, 2 µ (t+1) = σ (t+1) = n γ(t) i x i, n γ(t) i n γ(t) i (x i µ (t+1) n γ(t) i ) 2, (dashed and solid lines for fit before and after update) Loo along samples x for each h in the M-step Figure from Computer Vision: models, learning and inference by Simon Prince. EM in practice EM properties n π (t+1) = γ(t) i. n Similar to K-means guaranteed to find a local maximum of the complete data lielihood somewhat sensitive to initial conditions Better than K-means Gaussian distributions can model clusters with different shapes all data points are smoothly used to update all parameters

12 Model Selection and Overfitting Overfitting f(x) f(x) x x Overfitting: Phoneme Discrimination Occam s Razor Choose the simplest explanation for the observed data Important factors: number of model parameters number of data points model fit to the data

13 Overfitting and Maximum Lielihood Occam s Razor and Bayesian Learning we can mae the lielihood arbitrary large by increasing the number of parameters Remember that: P(x new D) = θ Θ P(x new θ)p(θ D)dθ Intuition: More complex models fit the data very well (large P(D θ)) but only for small regions of the parameter space Θ. Summary If you are interested in learning more tae a loo at: C. M. Bishop, Pattern Recognition and Machine Learning, Springer Verlag 2006.

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods