Faculty of Mathematics and Informatics Babeş Bolyai University, Cluj-Napoca, Matematika és Informatika Tanszék Babeş Bolyai Tudományegyetem, Kolozsvár March 2008
Table of Contents 1 2 methods 3 Methods 4
Outline 1 2 methods 3 Methods 4
Machine learning Historical background / Motivation: Huge amount of data, that should automatically be processed, Mathematics provides general solutions, solutions are i.e. not for a given problem, Need for science, that uses mathematics machinery for solving practical problems.
Definitions for Machine learning Collection of methods (from statistics, probability theory) to solve problems met in practice. noise filtering for non-linear regression and/or non-gaussian noise Classification: binary, multiclass, partially labelled Clustering, Inversion problems, density estimation, novelty detection. Generally, we need to model the data,
f(x) Observation (x,y) Real world: there is a function y = f (x) (x,y ) 1 1 (x,y ) 2 2 (x,y ) N N Observation process: a corrupted datum is collected for a sample x n : t n = y n + ɛ additive noise t n = h(y n, ɛ) h distortion function Problem: find function y = f (x)
* f (x) Inference (x,y ) 1 1 (x,y ) 2 2 F function class Observ. process (x,y ) N N Data set collected. Assume a function class. polynomial, Fourier expansion, Wavelet; Observation process encodes the noise; Find the optimal function from the class.
II We have the data set D = {(x 1, y 1 ),..., (x N, y N )}. Consider a function class: (1) F = { w T x + b w R d, b R } (2) F = { K K a 0 + a k sin(2πkx) + b k cos(2πkx) k=1 a, b R K, a 0 R } Assume an observation process: k=1 y n = f (x n ) + ɛ with ɛ N(0, σ 2 ).
III 1 The data set: D = {(x 1, y 1 ),..., (x N, y N )}. 2 Assume a function class: F polynomial, etc. F = { f (x, θ) θ R p} 3 Assume an observation process. Define a loss function: L (y n, f (x n, θ)) For the Gaussian noise: L(y n, f (x n, θ)) = (y n f (x n, θ)) 2.
Outline 1 2 methods 3 Methods 4
Parameter estimation Estimating parameters: Finding the optimal value to θ: where θ = arg min L(D, θ) θ Ω Ω is the domain of the parameters. L(D, θ) is a loss function for the data set. Example: N L(D, θ) = L(y n, f (x n, θ)) n=1
L(D, θ) (log)likelihood function. Maximum likelihood estimation of the model: θ = arg min L(D, θ) θ Ω Example quadratic regression: L(D, θ) = N (y n f (x n, θ)) 2 factorisation n=1 Drawback: can produce perfect fit to the data over-fitting.
Example of an ML estimate Graphic Y 190 180 170 160 h(cm) 150 140 50 60 70 80 90 100 110 We want to fit a model to the data. Use linear model: h = θ 0 + θ 1 w. Use log-linear model: h = θ 0 + θ 1 log(w). Use higher order polynomials, e.g. : h = θ 0 + θ 1 w + θ 2 w 2 + θ 3 w 3 +.... w(kg) X
Example of an ML estimate Graphic Y 190 180 170 160 h(cm) 150 140 50 60 70 80 90 100 110 We want to fit a model to the data. Use linear model: h = θ 0 + θ 1 w. Use log-linear model: h = θ 0 + θ 1 log(w). Use higher order polynomials, e.g. : h = θ 0 + θ 1 w + θ 2 w 2 + θ 3 w 3 +.... w(kg) X
Example of an ML estimate Graphic Y 190 180 170 160 h(cm) 150 140 50 60 70 80 90 100 110 We want to fit a model to the data. Use linear model: h = θ 0 + θ 1 w. Use log-linear model: h = θ 0 + θ 1 log(w). Use higher order polynomials, e.g. : h = θ 0 + θ 1 w + θ 2 w 2 + θ 3 w 3 +.... w(kg) X
Example of an ML estimate Graphic Y 190 180 170 160 h(cm) 150 140 50 60 70 80 90 100 110 We want to fit a model to the data. Use linear model: h = θ 0 + θ 1 w. Use log-linear model: h = θ 0 + θ 1 log(w). Use higher order polynomials, e.g. : h = θ 0 + θ 1 w + θ 2 w 2 + θ 3 w 3 +.... w(kg) X
M.L. for linear models I Assume: linear model for the x y relation d f (x n θ) = θ l x l l=1 with x = [1, x, x 2, log(x),...] T quadratic loss for D = {(x 1, y 1 ),..., (x N, h N )} N E 2 (D f ) = (y n f (x n θ)) 2 n=1
M.L. for linear models II Minimisation: N (y n f (x n θ)) 2 = (y Xθ) T (y Xθ) n=1 = θ T X T Xθ 2θ T X T y + y T y Solution: 0 = 2X T Xθ 2X T y ( 1 θ = X X) T X T y where y = [y 1,..., y N ] T and X = [x 1,..., x N ] T are the transformed data.
M.L. for linear models III Generalised linear models: Use a set of functions Φ = [φ 1 (.),..., φ M (.)]. Project the inputs into the space spanned by Im(Φ). Have a parameter vector of length M: θ = [θ 1,..., θ M ] T. { } The model is m θ mφ m (x) θ m R. The optimal parameter vector is: θ = ( Φ T Φ) 1 Φ T y
Summary There are many candidate model families: the degree of polynomials specifies a model family; the rank of a Fourier expansion; the mixture of {log, sin, cos,...} also a family; Selecting the best family is a difficult modelling problem. In maximum likelihood there is no controll on how good a family is when processing a given data-set. Smaller number of parameters than #data.
Maximum a posteriori I Generalised linear model powerful it can be extremely complex; With no complexity control, overfitting problem. Aim: to include knowledge in the inference process. Our beliefs are reflected by the choice of the candidate functions.
Maximum a posteriori I Goal: Generalised linear model powerful it can be extremely complex; With no complexity control, overfitting problem. Aim: to include knowledge in the inference process. Our beliefs are reflected by the choice of the candidate functions. Prior knowledge specification using probabilities; Using probability theory for consistent estimation; Encode the observation noise in the model;
Maximum a posteriori Data/noise Probabilistic data description: How likely is that θ generated the data: y = f (x) y f (x) δ 0 y = f (x) + ɛ y f (x) N ɛ Gaussian noise: y f (x) N(0, σ 2 ) P(y f (x)) = 1 ] (y f (x))2 exp [ 2πσ 2σ 2
Maximum a posteriori Prior William of Ockham (1285 1349) principle Entities should not be multiplied beyond necessity. Also known as (wiki...): Principle of simplicity KISS,. When you hear hoofbeats, think horses, not zebras. Simple models small number of parameters. L 0 norm L 2 norm Probabilistic representation: [ p 0 (θ) exp θ 2 2 2σ 2 0 ]
M.A.P. Inference M.A.P. probabilities assigned to D via the log-likelihood function: P(y n x n, θ, F) exp [ L(y n, f (x n, θ))] θ prior probabilities: p 0 (θ) exp [ θ 2 2σ 2 0 ] A posteriori probability: p(θ D, F) = P(D θ)p 0 (θ) p(d F) p(d F) probability of the data for a given family.
M.A.P. Inference M.A.P. probabilities assigned to D via the log-likelihood function: P(y n x n, θ, F) exp [ L(y n, f (x n, θ))] θ prior probabilities: p 0 (θ) exp [ θ 2 2σ 2 0 ] A posteriori probability: p(θ D, F) = P(D θ)p 0 (θ) p(d F) p(d F) probability of the data for a given family.
M.A.P. Inference M.A.P. probabilities assigned to D via the log-likelihood function: P(y n x n, θ, F) exp [ L(y n, f (x n, θ))] θ prior probabilities: p 0 (θ) exp [ θ 2 2σ 2 0 ] A posteriori probability: p(θ D, F) = P(D θ)p 0 (θ) p(d F) p(d F) probability of the data for a given family.
M.A.P. Inference II M.A.P. estimation finds θ with largest probability: θ MAP = arg max p(θ D, F) θ Ω Example: with L(y n, f (x n, θ)) and Gaussian prior: θ MAP = argmax K 1 θ Ω 2 n L(y n, f (x n, θ)) θ 2 2σ 2 0 σ0 2 = = maximum likelihood.. after a change of sign and max min
M.A.P. Example I 12 10 8 6 Poly 6 N. Dev =10 3 True function Training data N. Dev = 10 2 4 2 0 2 4 6 10 5 0 5 10
M.A.P. Linear models I Y 190 180 170 h(cm) 160 150 140 50 60 70 80 90 100 110 w(kg) X Aim: test different levels of flexibility. p = 10 Prior width:
M.A.P. Linear models I Y 190 180 170 h(cm) 160 150 140 50 60 70 80 90 100 110 w(kg) X Aim: test different levels of flexibility. p = 10 Prior width: σ 2 0 = 106
M.A.P. Linear models I Y 190 180 170 h(cm) 160 150 140 50 60 70 80 90 100 110 w(kg) X Aim: test different levels of flexibility. p = 10 Prior width: σ 2 0 = 106 σ 2 0 = 105
M.A.P. Linear models I Y 190 180 170 h(cm) 160 150 140 50 60 70 80 90 100 110 w(kg) X Aim: test different levels of flexibility. p = 10 Prior width: σ 2 0 = 106 σ 2 0 = 105 σ 2 0 = 104
M.A.P. Linear models I Y 190 180 170 h(cm) 160 150 140 50 60 70 80 90 100 110 w(kg) X Aim: test different levels of flexibility. p = 10 Prior width: σ 2 0 = 106 σ 2 0 = 105 σ 2 0 = 104 σ 2 0 = 103
M.A.P. Linear models I Y 190 180 170 h(cm) 160 150 140 50 60 70 80 90 100 110 w(kg) X Aim: test different levels of flexibility. p = 10 Prior width: σ0 2 = 106 σ0 2 = 105 σ0 2 = 104 σ0 2 = 103 σ0 2 = 102
M.A.P. Linear models I Y 190 180 170 h(cm) 160 150 140 50 60 70 80 90 100 110 w(kg) X Aim: test different levels of flexibility. p = 10 Prior width: σ 2 0 = 106 σ 2 0 = 105 σ 2 0 = 104 σ 2 0 = 103 σ 2 0 = 102 σ 2 0 = 101
M.A.P. Linear models I Y 190 180 170 h(cm) 160 150 140 50 60 70 80 90 100 110 w(kg) X Aim: test different levels of flexibility. p = 10 Prior width: σ 2 0 = 106 σ 2 0 = 105 σ 2 0 = 104 σ 2 0 = 103 σ 2 0 = 102 σ 2 0 = 101 σ 2 0 = 100
M.A.P. Linear models II θ MAP = argmax K 1 θ Ω 2 Transform into vector notation: n E 2 (y n, f (x n, θ)) θ 2 2σ 2 0 θ MAP = argmax K 1 θ Ω 2 (y Xθ)T (y Xθ) θt θ 2σ0 2 solve for θ by differentiation: X T (y Xθ) 1 σ0 2 I d θ = 0 θ MAP = ( ) 1 X T X + 1 σ0 2 I d X T y. again M.L. for σ 2 0 =
M.A.P. Summary Mximum a posteriori models: Allow for the inclusion of prior knowledge; May protect against overfitting; Can measure the fitness of the family to the data;. Procedure called M.L. type II.
M.A.P. application M.L. II Idea: instead of computing the most probable value of θ, we can measure the fit of the model F to the data D. P(D F) = p(d, θ l F) θ l Ω = p(d θ l, F)p 0 (θ l F) θ l Ω Gaussian noise and polynomial of order K : ( ) log(p(d F)) = log dθ p(d θ, F)p 0 (θ F) = log (N(y 0, Σ X )) Ω θ p(d F) = 1 ( ) N log(2π) + log Σ X + y T Σ 1 X 2 y where Σ X = I N σ 2 n+xσ 0 X T with X = [x ] 0, x 1,..., x K Σ 0 = diag(σ 2 0, σ 2 1,..., σ 2 K )
M.A.P. M.L.II poly 80 log(p(d k)) 100 120 log(σ0 2 140 ) 6 4 2 0 2 4 Aim: test different models. Polynomial families: k = 10 k = 9 k = 8.
M.A.P. M.L.II poly 80 log(p(d k)) 100 120 log(σ0 2 140 ) 6 4 2 0 2 4 Aim: test different models. Polynomial families: k = 9 k = 8 k = 7.
M.A.P. M.L.II poly 80 log(p(d k)) 100 120 log(σ0 2 140 ) 6 4 2 0 2 4 Aim: test different models. Polynomial families: k = 8 k = 7 k = 6.
M.A.P. M.L.II poly 80 log(p(d k)) 100 120 log(σ0 2 140 ) 6 4 2 0 2 4 Aim: test different models. Polynomial families: k = 7 k = 6 k = 5.
M.A.P. M.L.II poly 80 log(p(d k)) 100 120 log(σ0 2 140 ) 6 4 2 0 2 4 Aim: test different models. Polynomial families: k = 6 k = 5 k = 4.
M.A.P. M.L.II poly 80 log(p(d k)) 100 120 log(σ0 2 140 ) 6 4 2 0 2 4 Aim: test different models. Polynomial families: k = 5 k = 4 k = 3.
M.A.P. M.L.II poly 80 log(p(d k)) 100 120 log(σ0 2 140 ) 6 4 2 0 2 4 Aim: test different models. Polynomial families: k = 4 k = 3 k = 2.
M.A.P. M.L.II poly 80 log(p(d k)) 100 120 log(σ0 2 140 ) 6 4 2 0 2 4 Aim: test different models. Polynomial families: k = 3 k = 2 k = 1.
Bayesian estimation Intro M.L. and M.A.P. estimates provide single solutions. Point estimates lack the assessment of un/certainty. Better solution: for a query x, the system output is probabilistic: x p(y x, F) Tool: go beyond the M.A.P. solution and use the a posteriori distribution of the parameters.
Bayesian estimation II We again use Bayes rule: p(θ D, F) = P(D θ)p 0 (θ) p(d F) with p(d F) = dθ P(D θ)p 0 (θ). Ω and exploit the whole posterior distribution of the parameters. A-posteriori parameter estimates We operate with p post (θ) = def p(θ D, F) and use the total probability rule: p(y D, F) = p(y θ l, F) p post (θ l ) θ l Ω θ in assessing system output.
Bayesian estimation Example I Given the data D = {(x 1, y 1 ),..., (x N, y N )} estimate the linear fit: T 1 y = θ 0 + d θ 1 θ i x i =. i=1 θ 0 θ d Gaussian distributions noise and prior: x 1. x d ɛ = y n θ T x n N(0, σ 2 n) w N(0, Σ 0 ) def = θ T x
Bayesian estimation Example II Goal: compute the posterior distribution p post (θ). p post(θ) p 0 (θ) p(d θ, F) = p 0 (θ Σ 0 ) N P(y n θ T x n) 2 log (p post(θ)) = K post + 1 (y Xθ) T (y Xθ) + θ T Σ 1 σn 2 0 θ ( ) = θ T 1 X T X + Σ 1 σn 2 0 θ 2 θ T X T y + K σn 2 post = ( ) T θ µ post Σ 1 ( ) post θ µpost + K post and by identification ( ) 1 1 Σ post = X T X + Σ 1 σn 2 0 and µ = Σ X T y post post σn 2 n=1
Bayesian estimation Example III Bayesian linear model The posterior distribution for the parameters is a Gaussian with parameters ( ) 1 1 Σ post = σn 2 X T X + Σ 1 X T y 0 and µ post = Σ post σn 2 Point estimates from keeping : M.L. if we take Σ 0 and considering only µ post. M.A.P if we approximate the distribution with a single value at the maximum, i.e. µ post.
Bayesian estimation Example IV Prediction for new values x : use the likelihood P(y x, θ, F), and the posterior for θ and Bayes rule. The steps: p(y x, D, F) = dθ p(y x, θ, F)p post(θ D, F) where = [ dθ exp Ω θ = dθ exp Ω θ Ω θ (K + (y θt x ) 2 1 2 σn 2 [ 1 ( K + y 2 a T C 1 a + Q(θ) 2 σn 2 )] + (θ µ post ) T Σ 1 post(θ µ post ) )] a = x y σ 2 n + Σ 1 postµ post C = x x T σ 2 n + Σ post
Bayesian estimation Example V Integrating out the quadratic in θ: Predictive distribution at x p(y x, D, F) = exp [ 1 2 ( K + (y )] x µ post ) 2 σn 2 + x T Σ 1 postx ) = N (y x T µ post, σn 2 + x T Σ post x With the predictive distribution we: measure the variance of the prediction for each point: σ 2 = σ 2 n + x T Σ post x ; sample from the parameters and plot the candidate predictors.
Bayesian example Error bars Pol. 6 N.var σ 2 = 1 10 5 0 5 10 5 0 5 10 The errors are the symmetric thin lines.
Bayesian example Predictive samples Third order polynomials are used to approximate the data.
Bayesian estimation Problems When computing p post (θ D, F) we assumed that the posterior can be represented analytically. This is not the case. Approximations are needed for the posterior distribution predictive distribution In Bayesian modelling an important issue is how we approximate the posterior distribution.
Bayesian estimation Summary Complete specification of the model Can include prior beliefs about the model. Accurate predictions Can include prior beliefs about the model. Computational cost Using models for prediction can be difficult and expensive in time and memory.
Bayesian estimation Summary Complete specification of the model Can include prior beliefs about the model. Accurate predictions Can include prior beliefs about the model. Computational cost Using models for prediction can be difficult and expensive in time and memory. Bayesian models Flexible and accurate if priors about the model are used. It can be expensive.
Outline 1 2 methods 3 Methods 4
setting Data can be unlabeled, i.e. no values y are associated to an input x. We want to extract information from D = {x 1,..., x N }. We assume that the data although high-dimensional spans a manifold of a much smaller dimension. Task is to find the subspace corresponding to the data span.
Models in unsupervised learning It is again important the model of the data:... ; ; Mixture models; 3 2 1 x 2 3 2 1 x 2 0 1 2 3 3 2 1 x 2 0 1 2 3 0 1 2 3 4 5 6 x 1 x 1 x 1
The PCA model I Simple data structure. Spherical cluster that is: translated; scaled; rotated.. 3 2 1 x 2 0 1 2 3 x 1 We aim to find the principal directions of the data spread. Principal direction: the direction u along which the data preserves most of its variance.
The PCA model II Principal direction: u = argmax u =1 1 2N N (u T x n ux) 2 n=1 we pre-process: x = 0. Replacing the empirical covariance with Σ x : u = argmax u =1 = argmax u,λ 1 2N N (u T x n ux) 2 n=1 1 2 ut Σ x u λ( u 2 1) with λ the Lagrange multiplier. Differentiating w.r.t u: Σ x u λu = 0
The PCA model III The optimum solution must obey: Σ x u = λu The eigendecomposition of the covariance matrix. (λ, u ) is an eigenvalue, eigenvector of the system. If we replace back, the value of the expression is λ. Optimal solution when λ = λ max. Principal direction: The eigenvector u max corresponding to the largest eigenvalue of the system.
The PCA model Data mining I How is this used in data mining? Assume that data is: jointly Gaussian: x = N(m x, Σ x ), high-dimensional; only few (2) directions are relevant. 2 02 40 20 0 20 20 0 20
The PCA model Data mining II How is this used in data mining? Subtracting mean. Eigendecomposition. 3 2 1 Selecting the K eigenvectors corresponding to the K largest values. 0 1 2 3 Computing the K projections: z nl = x T n u l. The projection using matrix P = def [u 1,..., u K ] T : Z = XP and z n can is used as a compact representation of x n. x 2 x 1
The PCA model Data mining III Reconstruction: K x n = z nl u l or, with matrix notation: X = Z P T l=1 PCA projection analysis: E PCA = 1 N ( x N 2 n x 2 n) = 1 [ (X N tr X ) T ( X X )] 2 n=1 [ ] = tr Σ x P T Σ zp [U ] (diag(λ 1,..., λ d ) diag(λ 1,..., λ K, 0,...)) U T = tr = tr [ ] U T U diag(0,..., 0, λ K +1,..., λ d ) d K = l=1 λ K +l
The PCA model Data mining IV PCA reconstruction error: The error made using the PCA directions: PCA properties: E PCA = d K l=1 λ K +l PCA system orthonormal: u T l u r = δ l r Reconstruction fast. Spherical assumption critical.
PCA application USPS I USPS digits testbed for several models.
PCA application USPS II USPS characteristics: handwritten data centered and scaled; 10.000 items of 16 16 grayscale images; We plot k r = r l=1 λ l - normalised 0.95 0.85 0.75 Conclusion for the USPS set: λ 20 40 60 80 The normalised λ 1 = 0.24 u 1 accounts for 24% of the data. at 10 more than 70% of variance is explained. at 50 more than 98% 50 numbers instead of 256. i
PCA application USPS III Visualisation application: Visualisation along the first two eigendirections.
PCA application USPS IV Visualisation application: Detail.
The ICA model I Start from the PCA: x = Pz is a generative model for the data. We assumed that 3 2 1 x 2 0 1 2 3 z i.i.d. Gaussian random variables z N(0, diag(λ l )); x are not independent; In most of real data: Sources are not Gaussian. But sources are independent. We exploit that!. x 1 z are Gaussian sources;
The ICA model II The following model assumption: where z independent sources; A linear mixing matrix; x = As Looking for matrix B that recovers the sources: s def = Bx = B (As) = (BA) s i.e. (BA) is a permutation and scaling. but retains independence.
The ICA model III In practice: s def = Bx with s = [s 1,..., s K ] all independent sources. Independence test: the KL-divergence between the joint distribution and the marginals B = argmin B SO d KL (p(s 1, s 2 ) p(s 1 )p(s 2 )) where SO d is the group of matrices with B = 1. In ICA we are looking for matrix B that minimises: ds l log p(s l ) ds log(p(s 1,..., s d ) Ω l Ω l l. skip KL
The KL-divergence Detour Kullback-Leibler divergence KL(p q) = x is zero only and only if p = q, p(x) log p(x) q(x) is not a measure of distance (but cloooose to it!), Efficient when exponential families are used. Short proof: KL(p q) 0 ( ) ( 0 = log 1 = log q(x) = log x p(x) log x ( ) q(x) p(x) x = KL(p q) ) p(x) q(x) p(x)
ICA Application Data Separation of source signals: Mixture m2 m4 m1 m3 m3 m4
ICA Application Results Results of separation: Source m2 m4 m1 m3 m3 m4. FastICA package
Applications of ICA Applications: Coctail party problem; Separates noisy and multiple sources from multiple observations. Fetus ECG; Separation of the ECG signal of a fetus from its mother s ECG. MEG recordings; Separation of MEG sources. Financial data; Finding hidden factors in financial data. Noise reduction; Noise reduction in natural images. Interference removal; Interference removal from CDMA Code-division multiple access communication systems.
The mixture model Introduction The data structure is more complex. More than a single source for data. 3 2 1 x 2 0 1 2 3 4 5 6 x 1 The mixture model: P(x Σ) = where: K π k p k (x µ k, Σ k ) (1) k=1 π 1,..., π K mixing components. p k (x µ k, Σ k ) density of a component. The components are usually called clusters.
The mixture model Data generation The generation process reflects the assumptions about the model. The data generation: first we select from which component, then we sample from the component s density function. When modelling data we do not know: Which point belongs to which cluster. What are the parameters for each density function.
The mixture model Example I Old Faithful geyser in the Yellowstone National park. Characterised by: intense eruptions; differing times between them. Rule: Duration is 1.5 to 5 minutes. The length of eruption helps determine the interval. If an eruption lasts less than 2 minutes the interval will be around 55 minutes. If the eruption last 4.5 minutes the interval may be around 88 minutes.
The mixture model Example II Interval between durations 100 90 80 70 60 50 40 1.5 2 2.5 3 3.5 4 4.5 5 5.5 Duration The longer the duration, the longer the interval. The linear relation I = θ 0 + θ 1 d is not the best. There are only a very few eruptions lasting 3 minutes.
The mixture model I Assumptions: We know the family of individual density functions: These density functions are parametrised with a few parameters. The densities are easily identifiable: If we knew which data belongs to which cluster, the density function is easily identifiable. Gaussian densities are often used fulfill both conditions.
The mixture model II The Gaussian mixture model: p(x) = π 1 N 1 (x µ 1, Σ 1 )+π 2 N 2 (x µ 2, Σ 2 ) for known densities (centres and ellipses): p(x n k) = N k(x n µ k, Σ k ) p(k) l N l(x n µ l, Σ l ) p(l) i.e. we know the probability that data comes from cluster k (shades from red to green). For D: x p(x 1) p(x 2) x 1 γ 11 γ 12... x N γ N1 γ N2 γ nl responsibility of x n in cluster l. 100 90 80 70 60 50 40 1 2 3 4 5 6
The mixture model III When γ nl known, the parameters are computed using the data weighted by their responsibilities: (µ k, Σ k ) = argmax µ,σ for all k. This means: (µ k, Σ k ) n N (N k (x n µ, Σ)) γ nk n=1 γ nk log N(x n µ, Σ) When making inference Have to find the responsibility vector and the parameters of the mixture.
The mixture model III When γ nl known, the parameters are computed using the data weighted by their responsibilities: (µ k, Σ k ) = argmax µ,σ for all k. This means: (µ k, Σ k ) n N (N k (x n µ, Σ)) γ nk n=1 γ nk log N(x n µ, Σ) Given data D: Initial guess: (µ 1, Σ 1 ),..., (µ K, Σ K )) Re-estimate resp.s: x 1 γ 11 γ 12... x N γ N1 γ N2 When making inference Have to find the responsibility vector and the parameters of the mixture. Re-estimate parameters: (µ 1, Σ 1 ),..., (µ K, Σ K ))
The mixture model Summary Responsibilities γ The additional latent variables needed to help computation. In the mixture model: goal is to fit model to data; which submodel gets a particular data; Achieved by the maximisation of the log-likelihood function.
The EM algorithm I (π, Θ) = argmax n log [ l π l N l (x n µ l, Σ l ) Θ = [µ 1, Σ 1,..., µ K, Σ K ] is the vector of parameters; π = [π 1,..., π K ] the shares of the factors; Problem with optimisation: The parameters are not separable due to the sum within the logarithm. ] Solution: Use an approximation.
The EM algorithm II log P(D π, Θ) = n = n log log [ l [ l π l N l (x n µ l, Σ l ) p l (x n, l) Use Jensen s inequality: ( ) ( ) log p l (x n, l θ l ) = log q p l(x n, l θ l ) n(l) q l l n(l) ( ) pl (x n, l) q n(l) log q l n(l) ] ] for any [q n(1),..., q n(l)].. skip Jensen
Jensen Inequality Detour z 1 γ 2 = 0.75 z 2 Jensen s Inequality γ 1 f (z 1 ) + γ 2 f (z 2 ) f (γ 1 z 1 + γ 2 z 2 ) concave f (z) For any concave f (z), any z 1 and z 2, and any γ 1, γ 2 > 0 such that γ 1 + γ 2 = 1: f (γ 1 z 1 + γ 2 z 2 ) γ 1 f (z 1 ) + γ 2 f (z 2 )
The EM algorithm III ( ) log p l (x n, l θ l ) l l q n(l) log for any distribution q n( ). Replacing with the right-hand side, we have: log P(D π, Θ) n l l [ n q n(l) log p l(x n θ l ) q n(l) q n(l) log p l(x n θ l ) q n(l) ( ) pl (x n, l) q n(l) ] = L and therefore the optimisation w.r.to cluster parameters separate. l 0 = n q n(l) log p l(x n θ l ) θ l For distributions from exponential family optimisation is easy.
The EM algorithm IV any set of distributions q 1 (l),..., q N (l) provides a lower bound to the log-likelihood. We should choose the distributions so that they are the closest to the current parameter set. We assume the parameters have the value θ 0. Want to minimise the difference: log P(x n, l θ 0 l) L = q n(l) log P(x n, l θ 0 l) l l q n(l) log P(x n, l θ0 l)q n(l) p l (x n θ 0 l) l q n(l) log p l(x n, l θ 0 l) q n(l) and observe that by setting we have l qn(l) 0 = 0. q n(l) = p l(x n θ 0 l) P(x n, l θ 0 l)
The EM algorithm V The EM algorithm: Init initialise model parameters; E step compute the responsibilities γ nl = q n (l); M step for each k optimize 0 = n q n (l) log p l(x n θ l ) θ l repeat goto the E step.
EM application I Old faithful: 100 90 Interval between durations 80 70 60 50 40 1.5 2 2.5 3 3.5 4 4.5 5 5.5 Duration
EM application II Old faithful: 100 90 80 70 60 50 40 1 2 3 4 5 6
EM application III Old faithful: 100 90 80 70 60 50 40 1 2 3 4 5 6
EM application IV Old faithful: 100 90 80 70 60 50 40 1 2 3 4 5 6
J. M. Bernardo and A. F. Smith. Bayesian Theory. John Wiley & Sons, 1994. C. M. Bishop. Pattern Recognition and. Springer Verlag, New York, N.Y., 2006. T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 1991. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society series B, 39:1 38, 1977. T. Hastie, R. Tibshirani, és J. Friedman. The Elements of Statistical Learning: Data, Inference, and Prediction. Springer Verlag, 2001.