Lehel Csató. March Faculty of Mathematics and Informatics Babeş Bolyai University, Cluj-Napoca,

Similar documents
Probabilistic & Unsupervised Learning

Latent Variable Models and EM algorithm

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Week 3: The EM algorithm

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Machine Learning Techniques for Computer Vision

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Variational Principal Components

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Expectation Maximization

GWAS V: Gaussian processes

Unsupervised Learning

Data Mining Techniques

Latent Variable Models and EM Algorithm

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

PATTERN RECOGNITION AND MACHINE LEARNING

Functional modelling with Gaussian Processes

Density Estimation. Seungjin Choi

Probabilistic and Bayesian Machine Learning

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Linear Dynamical Systems

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

PMR Learning as Inference

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Lecture 1b: Linear Models for Regression

Introduction to Machine Learning

Statistical learning. Chapter 20, Sections 1 4 1

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Lecture : Probabilistic Machine Learning

Linear Regression and Discrimination

Expectation Maximization

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Lecture 4: Probabilistic Learning

Lecture 7: Con3nuous Latent Variable Models

CS281 Section 4: Factor Analysis and PCA

Unsupervised Learning

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Lecture 9: PGM Learning

Gaussian Mixture Models

Learning Binary Classifiers for Multi-Class Problem

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Bayesian Machine Learning

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

Machine learning for pervasive systems Classification in high-dimensional spaces

Recent Advances in Bayesian Inference Techniques

Nonparameteric Regression:

The Expectation-Maximization Algorithm

Introduction to Machine Learning

GWAS IV: Bayesian linear (variance component) models

The Expectation Maximization or EM algorithm

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

ECE521 week 3: 23/26 January 2017

EM Algorithm LECTURE OUTLINE

COM336: Neural Computing

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Expectation Maximization

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Probabilistic Graphical Models for Image Analysis - Lecture 1

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

The Variational Gaussian Approximation Revisited

STA 414/2104: Lecture 8

Lecture 6: April 19, 2002

Computer Vision Group Prof. Daniel Cremers. 3. Regression

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Machine Learning for Data Science (CS4786) Lecture 12

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

CSCI-567: Machine Learning (Spring 2019)

CSC321 Lecture 18: Learning Probabilistic Models

PCA and admixture models

Bayesian Inference Course, WTCN, UCL, March 2013

Mixtures of Gaussians. Sargur Srihari

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Mathematical Formulation of Our Example

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

K-Means and Gaussian Mixture Models

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Gaussian Mixture Models

Expectation Propagation Algorithm

Machine Learning for Signal Processing Bayes Classification and Regression

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Bayesian Learning (II)

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Bayesian Machine Learning - Lecture 7

Mixture Models and Expectation-Maximization

Expectation Propagation for Approximate Bayesian Inference

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Expectation maximization

Auto-Encoding Variational Bayes

Recap from previous lecture

Motivating the Covariance Matrix

CS 540: Machine Learning Lecture 1: Introduction

Lecture 14. Clustering, K-means, and EM

Transcription:

Faculty of Mathematics and Informatics Babeş Bolyai University, Cluj-Napoca, Matematika és Informatika Tanszék Babeş Bolyai Tudományegyetem, Kolozsvár March 2008

Table of Contents 1 2 methods 3 Methods 4

Outline 1 2 methods 3 Methods 4

Machine learning Historical background / Motivation: Huge amount of data, that should automatically be processed, Mathematics provides general solutions, solutions are i.e. not for a given problem, Need for science, that uses mathematics machinery for solving practical problems.

Definitions for Machine learning Collection of methods (from statistics, probability theory) to solve problems met in practice. noise filtering for non-linear regression and/or non-gaussian noise Classification: binary, multiclass, partially labelled Clustering, Inversion problems, density estimation, novelty detection. Generally, we need to model the data,

f(x) Observation (x,y) Real world: there is a function y = f (x) (x,y ) 1 1 (x,y ) 2 2 (x,y ) N N Observation process: a corrupted datum is collected for a sample x n : t n = y n + ɛ additive noise t n = h(y n, ɛ) h distortion function Problem: find function y = f (x)

* f (x) Inference (x,y ) 1 1 (x,y ) 2 2 F function class Observ. process (x,y ) N N Data set collected. Assume a function class. polynomial, Fourier expansion, Wavelet; Observation process encodes the noise; Find the optimal function from the class.

II We have the data set D = {(x 1, y 1 ),..., (x N, y N )}. Consider a function class: (1) F = { w T x + b w R d, b R } (2) F = { K K a 0 + a k sin(2πkx) + b k cos(2πkx) k=1 a, b R K, a 0 R } Assume an observation process: k=1 y n = f (x n ) + ɛ with ɛ N(0, σ 2 ).

III 1 The data set: D = {(x 1, y 1 ),..., (x N, y N )}. 2 Assume a function class: F polynomial, etc. F = { f (x, θ) θ R p} 3 Assume an observation process. Define a loss function: L (y n, f (x n, θ)) For the Gaussian noise: L(y n, f (x n, θ)) = (y n f (x n, θ)) 2.

Outline 1 2 methods 3 Methods 4

Parameter estimation Estimating parameters: Finding the optimal value to θ: where θ = arg min L(D, θ) θ Ω Ω is the domain of the parameters. L(D, θ) is a loss function for the data set. Example: N L(D, θ) = L(y n, f (x n, θ)) n=1

L(D, θ) (log)likelihood function. Maximum likelihood estimation of the model: θ = arg min L(D, θ) θ Ω Example quadratic regression: L(D, θ) = N (y n f (x n, θ)) 2 factorisation n=1 Drawback: can produce perfect fit to the data over-fitting.

Example of an ML estimate Graphic Y 190 180 170 160 h(cm) 150 140 50 60 70 80 90 100 110 We want to fit a model to the data. Use linear model: h = θ 0 + θ 1 w. Use log-linear model: h = θ 0 + θ 1 log(w). Use higher order polynomials, e.g. : h = θ 0 + θ 1 w + θ 2 w 2 + θ 3 w 3 +.... w(kg) X

Example of an ML estimate Graphic Y 190 180 170 160 h(cm) 150 140 50 60 70 80 90 100 110 We want to fit a model to the data. Use linear model: h = θ 0 + θ 1 w. Use log-linear model: h = θ 0 + θ 1 log(w). Use higher order polynomials, e.g. : h = θ 0 + θ 1 w + θ 2 w 2 + θ 3 w 3 +.... w(kg) X

Example of an ML estimate Graphic Y 190 180 170 160 h(cm) 150 140 50 60 70 80 90 100 110 We want to fit a model to the data. Use linear model: h = θ 0 + θ 1 w. Use log-linear model: h = θ 0 + θ 1 log(w). Use higher order polynomials, e.g. : h = θ 0 + θ 1 w + θ 2 w 2 + θ 3 w 3 +.... w(kg) X

Example of an ML estimate Graphic Y 190 180 170 160 h(cm) 150 140 50 60 70 80 90 100 110 We want to fit a model to the data. Use linear model: h = θ 0 + θ 1 w. Use log-linear model: h = θ 0 + θ 1 log(w). Use higher order polynomials, e.g. : h = θ 0 + θ 1 w + θ 2 w 2 + θ 3 w 3 +.... w(kg) X

M.L. for linear models I Assume: linear model for the x y relation d f (x n θ) = θ l x l l=1 with x = [1, x, x 2, log(x),...] T quadratic loss for D = {(x 1, y 1 ),..., (x N, h N )} N E 2 (D f ) = (y n f (x n θ)) 2 n=1

M.L. for linear models II Minimisation: N (y n f (x n θ)) 2 = (y Xθ) T (y Xθ) n=1 = θ T X T Xθ 2θ T X T y + y T y Solution: 0 = 2X T Xθ 2X T y ( 1 θ = X X) T X T y where y = [y 1,..., y N ] T and X = [x 1,..., x N ] T are the transformed data.

M.L. for linear models III Generalised linear models: Use a set of functions Φ = [φ 1 (.),..., φ M (.)]. Project the inputs into the space spanned by Im(Φ). Have a parameter vector of length M: θ = [θ 1,..., θ M ] T. { } The model is m θ mφ m (x) θ m R. The optimal parameter vector is: θ = ( Φ T Φ) 1 Φ T y

Summary There are many candidate model families: the degree of polynomials specifies a model family; the rank of a Fourier expansion; the mixture of {log, sin, cos,...} also a family; Selecting the best family is a difficult modelling problem. In maximum likelihood there is no controll on how good a family is when processing a given data-set. Smaller number of parameters than #data.

Maximum a posteriori I Generalised linear model powerful it can be extremely complex; With no complexity control, overfitting problem. Aim: to include knowledge in the inference process. Our beliefs are reflected by the choice of the candidate functions.

Maximum a posteriori I Goal: Generalised linear model powerful it can be extremely complex; With no complexity control, overfitting problem. Aim: to include knowledge in the inference process. Our beliefs are reflected by the choice of the candidate functions. Prior knowledge specification using probabilities; Using probability theory for consistent estimation; Encode the observation noise in the model;

Maximum a posteriori Data/noise Probabilistic data description: How likely is that θ generated the data: y = f (x) y f (x) δ 0 y = f (x) + ɛ y f (x) N ɛ Gaussian noise: y f (x) N(0, σ 2 ) P(y f (x)) = 1 ] (y f (x))2 exp [ 2πσ 2σ 2

Maximum a posteriori Prior William of Ockham (1285 1349) principle Entities should not be multiplied beyond necessity. Also known as (wiki...): Principle of simplicity KISS,. When you hear hoofbeats, think horses, not zebras. Simple models small number of parameters. L 0 norm L 2 norm Probabilistic representation: [ p 0 (θ) exp θ 2 2 2σ 2 0 ]

M.A.P. Inference M.A.P. probabilities assigned to D via the log-likelihood function: P(y n x n, θ, F) exp [ L(y n, f (x n, θ))] θ prior probabilities: p 0 (θ) exp [ θ 2 2σ 2 0 ] A posteriori probability: p(θ D, F) = P(D θ)p 0 (θ) p(d F) p(d F) probability of the data for a given family.

M.A.P. Inference M.A.P. probabilities assigned to D via the log-likelihood function: P(y n x n, θ, F) exp [ L(y n, f (x n, θ))] θ prior probabilities: p 0 (θ) exp [ θ 2 2σ 2 0 ] A posteriori probability: p(θ D, F) = P(D θ)p 0 (θ) p(d F) p(d F) probability of the data for a given family.

M.A.P. Inference M.A.P. probabilities assigned to D via the log-likelihood function: P(y n x n, θ, F) exp [ L(y n, f (x n, θ))] θ prior probabilities: p 0 (θ) exp [ θ 2 2σ 2 0 ] A posteriori probability: p(θ D, F) = P(D θ)p 0 (θ) p(d F) p(d F) probability of the data for a given family.

M.A.P. Inference II M.A.P. estimation finds θ with largest probability: θ MAP = arg max p(θ D, F) θ Ω Example: with L(y n, f (x n, θ)) and Gaussian prior: θ MAP = argmax K 1 θ Ω 2 n L(y n, f (x n, θ)) θ 2 2σ 2 0 σ0 2 = = maximum likelihood.. after a change of sign and max min

M.A.P. Example I 12 10 8 6 Poly 6 N. Dev =10 3 True function Training data N. Dev = 10 2 4 2 0 2 4 6 10 5 0 5 10

M.A.P. Linear models I Y 190 180 170 h(cm) 160 150 140 50 60 70 80 90 100 110 w(kg) X Aim: test different levels of flexibility. p = 10 Prior width:

M.A.P. Linear models I Y 190 180 170 h(cm) 160 150 140 50 60 70 80 90 100 110 w(kg) X Aim: test different levels of flexibility. p = 10 Prior width: σ 2 0 = 106

M.A.P. Linear models I Y 190 180 170 h(cm) 160 150 140 50 60 70 80 90 100 110 w(kg) X Aim: test different levels of flexibility. p = 10 Prior width: σ 2 0 = 106 σ 2 0 = 105

M.A.P. Linear models I Y 190 180 170 h(cm) 160 150 140 50 60 70 80 90 100 110 w(kg) X Aim: test different levels of flexibility. p = 10 Prior width: σ 2 0 = 106 σ 2 0 = 105 σ 2 0 = 104

M.A.P. Linear models I Y 190 180 170 h(cm) 160 150 140 50 60 70 80 90 100 110 w(kg) X Aim: test different levels of flexibility. p = 10 Prior width: σ 2 0 = 106 σ 2 0 = 105 σ 2 0 = 104 σ 2 0 = 103

M.A.P. Linear models I Y 190 180 170 h(cm) 160 150 140 50 60 70 80 90 100 110 w(kg) X Aim: test different levels of flexibility. p = 10 Prior width: σ0 2 = 106 σ0 2 = 105 σ0 2 = 104 σ0 2 = 103 σ0 2 = 102

M.A.P. Linear models I Y 190 180 170 h(cm) 160 150 140 50 60 70 80 90 100 110 w(kg) X Aim: test different levels of flexibility. p = 10 Prior width: σ 2 0 = 106 σ 2 0 = 105 σ 2 0 = 104 σ 2 0 = 103 σ 2 0 = 102 σ 2 0 = 101

M.A.P. Linear models I Y 190 180 170 h(cm) 160 150 140 50 60 70 80 90 100 110 w(kg) X Aim: test different levels of flexibility. p = 10 Prior width: σ 2 0 = 106 σ 2 0 = 105 σ 2 0 = 104 σ 2 0 = 103 σ 2 0 = 102 σ 2 0 = 101 σ 2 0 = 100

M.A.P. Linear models II θ MAP = argmax K 1 θ Ω 2 Transform into vector notation: n E 2 (y n, f (x n, θ)) θ 2 2σ 2 0 θ MAP = argmax K 1 θ Ω 2 (y Xθ)T (y Xθ) θt θ 2σ0 2 solve for θ by differentiation: X T (y Xθ) 1 σ0 2 I d θ = 0 θ MAP = ( ) 1 X T X + 1 σ0 2 I d X T y. again M.L. for σ 2 0 =

M.A.P. Summary Mximum a posteriori models: Allow for the inclusion of prior knowledge; May protect against overfitting; Can measure the fitness of the family to the data;. Procedure called M.L. type II.

M.A.P. application M.L. II Idea: instead of computing the most probable value of θ, we can measure the fit of the model F to the data D. P(D F) = p(d, θ l F) θ l Ω = p(d θ l, F)p 0 (θ l F) θ l Ω Gaussian noise and polynomial of order K : ( ) log(p(d F)) = log dθ p(d θ, F)p 0 (θ F) = log (N(y 0, Σ X )) Ω θ p(d F) = 1 ( ) N log(2π) + log Σ X + y T Σ 1 X 2 y where Σ X = I N σ 2 n+xσ 0 X T with X = [x ] 0, x 1,..., x K Σ 0 = diag(σ 2 0, σ 2 1,..., σ 2 K )

M.A.P. M.L.II poly 80 log(p(d k)) 100 120 log(σ0 2 140 ) 6 4 2 0 2 4 Aim: test different models. Polynomial families: k = 10 k = 9 k = 8.

M.A.P. M.L.II poly 80 log(p(d k)) 100 120 log(σ0 2 140 ) 6 4 2 0 2 4 Aim: test different models. Polynomial families: k = 9 k = 8 k = 7.

M.A.P. M.L.II poly 80 log(p(d k)) 100 120 log(σ0 2 140 ) 6 4 2 0 2 4 Aim: test different models. Polynomial families: k = 8 k = 7 k = 6.

M.A.P. M.L.II poly 80 log(p(d k)) 100 120 log(σ0 2 140 ) 6 4 2 0 2 4 Aim: test different models. Polynomial families: k = 7 k = 6 k = 5.

M.A.P. M.L.II poly 80 log(p(d k)) 100 120 log(σ0 2 140 ) 6 4 2 0 2 4 Aim: test different models. Polynomial families: k = 6 k = 5 k = 4.

M.A.P. M.L.II poly 80 log(p(d k)) 100 120 log(σ0 2 140 ) 6 4 2 0 2 4 Aim: test different models. Polynomial families: k = 5 k = 4 k = 3.

M.A.P. M.L.II poly 80 log(p(d k)) 100 120 log(σ0 2 140 ) 6 4 2 0 2 4 Aim: test different models. Polynomial families: k = 4 k = 3 k = 2.

M.A.P. M.L.II poly 80 log(p(d k)) 100 120 log(σ0 2 140 ) 6 4 2 0 2 4 Aim: test different models. Polynomial families: k = 3 k = 2 k = 1.

Bayesian estimation Intro M.L. and M.A.P. estimates provide single solutions. Point estimates lack the assessment of un/certainty. Better solution: for a query x, the system output is probabilistic: x p(y x, F) Tool: go beyond the M.A.P. solution and use the a posteriori distribution of the parameters.

Bayesian estimation II We again use Bayes rule: p(θ D, F) = P(D θ)p 0 (θ) p(d F) with p(d F) = dθ P(D θ)p 0 (θ). Ω and exploit the whole posterior distribution of the parameters. A-posteriori parameter estimates We operate with p post (θ) = def p(θ D, F) and use the total probability rule: p(y D, F) = p(y θ l, F) p post (θ l ) θ l Ω θ in assessing system output.

Bayesian estimation Example I Given the data D = {(x 1, y 1 ),..., (x N, y N )} estimate the linear fit: T 1 y = θ 0 + d θ 1 θ i x i =. i=1 θ 0 θ d Gaussian distributions noise and prior: x 1. x d ɛ = y n θ T x n N(0, σ 2 n) w N(0, Σ 0 ) def = θ T x

Bayesian estimation Example II Goal: compute the posterior distribution p post (θ). p post(θ) p 0 (θ) p(d θ, F) = p 0 (θ Σ 0 ) N P(y n θ T x n) 2 log (p post(θ)) = K post + 1 (y Xθ) T (y Xθ) + θ T Σ 1 σn 2 0 θ ( ) = θ T 1 X T X + Σ 1 σn 2 0 θ 2 θ T X T y + K σn 2 post = ( ) T θ µ post Σ 1 ( ) post θ µpost + K post and by identification ( ) 1 1 Σ post = X T X + Σ 1 σn 2 0 and µ = Σ X T y post post σn 2 n=1

Bayesian estimation Example III Bayesian linear model The posterior distribution for the parameters is a Gaussian with parameters ( ) 1 1 Σ post = σn 2 X T X + Σ 1 X T y 0 and µ post = Σ post σn 2 Point estimates from keeping : M.L. if we take Σ 0 and considering only µ post. M.A.P if we approximate the distribution with a single value at the maximum, i.e. µ post.

Bayesian estimation Example IV Prediction for new values x : use the likelihood P(y x, θ, F), and the posterior for θ and Bayes rule. The steps: p(y x, D, F) = dθ p(y x, θ, F)p post(θ D, F) where = [ dθ exp Ω θ = dθ exp Ω θ Ω θ (K + (y θt x ) 2 1 2 σn 2 [ 1 ( K + y 2 a T C 1 a + Q(θ) 2 σn 2 )] + (θ µ post ) T Σ 1 post(θ µ post ) )] a = x y σ 2 n + Σ 1 postµ post C = x x T σ 2 n + Σ post

Bayesian estimation Example V Integrating out the quadratic in θ: Predictive distribution at x p(y x, D, F) = exp [ 1 2 ( K + (y )] x µ post ) 2 σn 2 + x T Σ 1 postx ) = N (y x T µ post, σn 2 + x T Σ post x With the predictive distribution we: measure the variance of the prediction for each point: σ 2 = σ 2 n + x T Σ post x ; sample from the parameters and plot the candidate predictors.

Bayesian example Error bars Pol. 6 N.var σ 2 = 1 10 5 0 5 10 5 0 5 10 The errors are the symmetric thin lines.

Bayesian example Predictive samples Third order polynomials are used to approximate the data.

Bayesian estimation Problems When computing p post (θ D, F) we assumed that the posterior can be represented analytically. This is not the case. Approximations are needed for the posterior distribution predictive distribution In Bayesian modelling an important issue is how we approximate the posterior distribution.

Bayesian estimation Summary Complete specification of the model Can include prior beliefs about the model. Accurate predictions Can include prior beliefs about the model. Computational cost Using models for prediction can be difficult and expensive in time and memory.

Bayesian estimation Summary Complete specification of the model Can include prior beliefs about the model. Accurate predictions Can include prior beliefs about the model. Computational cost Using models for prediction can be difficult and expensive in time and memory. Bayesian models Flexible and accurate if priors about the model are used. It can be expensive.

Outline 1 2 methods 3 Methods 4

setting Data can be unlabeled, i.e. no values y are associated to an input x. We want to extract information from D = {x 1,..., x N }. We assume that the data although high-dimensional spans a manifold of a much smaller dimension. Task is to find the subspace corresponding to the data span.

Models in unsupervised learning It is again important the model of the data:... ; ; Mixture models; 3 2 1 x 2 3 2 1 x 2 0 1 2 3 3 2 1 x 2 0 1 2 3 0 1 2 3 4 5 6 x 1 x 1 x 1

The PCA model I Simple data structure. Spherical cluster that is: translated; scaled; rotated.. 3 2 1 x 2 0 1 2 3 x 1 We aim to find the principal directions of the data spread. Principal direction: the direction u along which the data preserves most of its variance.

The PCA model II Principal direction: u = argmax u =1 1 2N N (u T x n ux) 2 n=1 we pre-process: x = 0. Replacing the empirical covariance with Σ x : u = argmax u =1 = argmax u,λ 1 2N N (u T x n ux) 2 n=1 1 2 ut Σ x u λ( u 2 1) with λ the Lagrange multiplier. Differentiating w.r.t u: Σ x u λu = 0

The PCA model III The optimum solution must obey: Σ x u = λu The eigendecomposition of the covariance matrix. (λ, u ) is an eigenvalue, eigenvector of the system. If we replace back, the value of the expression is λ. Optimal solution when λ = λ max. Principal direction: The eigenvector u max corresponding to the largest eigenvalue of the system.

The PCA model Data mining I How is this used in data mining? Assume that data is: jointly Gaussian: x = N(m x, Σ x ), high-dimensional; only few (2) directions are relevant. 2 02 40 20 0 20 20 0 20

The PCA model Data mining II How is this used in data mining? Subtracting mean. Eigendecomposition. 3 2 1 Selecting the K eigenvectors corresponding to the K largest values. 0 1 2 3 Computing the K projections: z nl = x T n u l. The projection using matrix P = def [u 1,..., u K ] T : Z = XP and z n can is used as a compact representation of x n. x 2 x 1

The PCA model Data mining III Reconstruction: K x n = z nl u l or, with matrix notation: X = Z P T l=1 PCA projection analysis: E PCA = 1 N ( x N 2 n x 2 n) = 1 [ (X N tr X ) T ( X X )] 2 n=1 [ ] = tr Σ x P T Σ zp [U ] (diag(λ 1,..., λ d ) diag(λ 1,..., λ K, 0,...)) U T = tr = tr [ ] U T U diag(0,..., 0, λ K +1,..., λ d ) d K = l=1 λ K +l

The PCA model Data mining IV PCA reconstruction error: The error made using the PCA directions: PCA properties: E PCA = d K l=1 λ K +l PCA system orthonormal: u T l u r = δ l r Reconstruction fast. Spherical assumption critical.

PCA application USPS I USPS digits testbed for several models.

PCA application USPS II USPS characteristics: handwritten data centered and scaled; 10.000 items of 16 16 grayscale images; We plot k r = r l=1 λ l - normalised 0.95 0.85 0.75 Conclusion for the USPS set: λ 20 40 60 80 The normalised λ 1 = 0.24 u 1 accounts for 24% of the data. at 10 more than 70% of variance is explained. at 50 more than 98% 50 numbers instead of 256. i

PCA application USPS III Visualisation application: Visualisation along the first two eigendirections.

PCA application USPS IV Visualisation application: Detail.

The ICA model I Start from the PCA: x = Pz is a generative model for the data. We assumed that 3 2 1 x 2 0 1 2 3 z i.i.d. Gaussian random variables z N(0, diag(λ l )); x are not independent; In most of real data: Sources are not Gaussian. But sources are independent. We exploit that!. x 1 z are Gaussian sources;

The ICA model II The following model assumption: where z independent sources; A linear mixing matrix; x = As Looking for matrix B that recovers the sources: s def = Bx = B (As) = (BA) s i.e. (BA) is a permutation and scaling. but retains independence.

The ICA model III In practice: s def = Bx with s = [s 1,..., s K ] all independent sources. Independence test: the KL-divergence between the joint distribution and the marginals B = argmin B SO d KL (p(s 1, s 2 ) p(s 1 )p(s 2 )) where SO d is the group of matrices with B = 1. In ICA we are looking for matrix B that minimises: ds l log p(s l ) ds log(p(s 1,..., s d ) Ω l Ω l l. skip KL

The KL-divergence Detour Kullback-Leibler divergence KL(p q) = x is zero only and only if p = q, p(x) log p(x) q(x) is not a measure of distance (but cloooose to it!), Efficient when exponential families are used. Short proof: KL(p q) 0 ( ) ( 0 = log 1 = log q(x) = log x p(x) log x ( ) q(x) p(x) x = KL(p q) ) p(x) q(x) p(x)

ICA Application Data Separation of source signals: Mixture m2 m4 m1 m3 m3 m4

ICA Application Results Results of separation: Source m2 m4 m1 m3 m3 m4. FastICA package

Applications of ICA Applications: Coctail party problem; Separates noisy and multiple sources from multiple observations. Fetus ECG; Separation of the ECG signal of a fetus from its mother s ECG. MEG recordings; Separation of MEG sources. Financial data; Finding hidden factors in financial data. Noise reduction; Noise reduction in natural images. Interference removal; Interference removal from CDMA Code-division multiple access communication systems.

The mixture model Introduction The data structure is more complex. More than a single source for data. 3 2 1 x 2 0 1 2 3 4 5 6 x 1 The mixture model: P(x Σ) = where: K π k p k (x µ k, Σ k ) (1) k=1 π 1,..., π K mixing components. p k (x µ k, Σ k ) density of a component. The components are usually called clusters.

The mixture model Data generation The generation process reflects the assumptions about the model. The data generation: first we select from which component, then we sample from the component s density function. When modelling data we do not know: Which point belongs to which cluster. What are the parameters for each density function.

The mixture model Example I Old Faithful geyser in the Yellowstone National park. Characterised by: intense eruptions; differing times between them. Rule: Duration is 1.5 to 5 minutes. The length of eruption helps determine the interval. If an eruption lasts less than 2 minutes the interval will be around 55 minutes. If the eruption last 4.5 minutes the interval may be around 88 minutes.

The mixture model Example II Interval between durations 100 90 80 70 60 50 40 1.5 2 2.5 3 3.5 4 4.5 5 5.5 Duration The longer the duration, the longer the interval. The linear relation I = θ 0 + θ 1 d is not the best. There are only a very few eruptions lasting 3 minutes.

The mixture model I Assumptions: We know the family of individual density functions: These density functions are parametrised with a few parameters. The densities are easily identifiable: If we knew which data belongs to which cluster, the density function is easily identifiable. Gaussian densities are often used fulfill both conditions.

The mixture model II The Gaussian mixture model: p(x) = π 1 N 1 (x µ 1, Σ 1 )+π 2 N 2 (x µ 2, Σ 2 ) for known densities (centres and ellipses): p(x n k) = N k(x n µ k, Σ k ) p(k) l N l(x n µ l, Σ l ) p(l) i.e. we know the probability that data comes from cluster k (shades from red to green). For D: x p(x 1) p(x 2) x 1 γ 11 γ 12... x N γ N1 γ N2 γ nl responsibility of x n in cluster l. 100 90 80 70 60 50 40 1 2 3 4 5 6

The mixture model III When γ nl known, the parameters are computed using the data weighted by their responsibilities: (µ k, Σ k ) = argmax µ,σ for all k. This means: (µ k, Σ k ) n N (N k (x n µ, Σ)) γ nk n=1 γ nk log N(x n µ, Σ) When making inference Have to find the responsibility vector and the parameters of the mixture.

The mixture model III When γ nl known, the parameters are computed using the data weighted by their responsibilities: (µ k, Σ k ) = argmax µ,σ for all k. This means: (µ k, Σ k ) n N (N k (x n µ, Σ)) γ nk n=1 γ nk log N(x n µ, Σ) Given data D: Initial guess: (µ 1, Σ 1 ),..., (µ K, Σ K )) Re-estimate resp.s: x 1 γ 11 γ 12... x N γ N1 γ N2 When making inference Have to find the responsibility vector and the parameters of the mixture. Re-estimate parameters: (µ 1, Σ 1 ),..., (µ K, Σ K ))

The mixture model Summary Responsibilities γ The additional latent variables needed to help computation. In the mixture model: goal is to fit model to data; which submodel gets a particular data; Achieved by the maximisation of the log-likelihood function.

The EM algorithm I (π, Θ) = argmax n log [ l π l N l (x n µ l, Σ l ) Θ = [µ 1, Σ 1,..., µ K, Σ K ] is the vector of parameters; π = [π 1,..., π K ] the shares of the factors; Problem with optimisation: The parameters are not separable due to the sum within the logarithm. ] Solution: Use an approximation.

The EM algorithm II log P(D π, Θ) = n = n log log [ l [ l π l N l (x n µ l, Σ l ) p l (x n, l) Use Jensen s inequality: ( ) ( ) log p l (x n, l θ l ) = log q p l(x n, l θ l ) n(l) q l l n(l) ( ) pl (x n, l) q n(l) log q l n(l) ] ] for any [q n(1),..., q n(l)].. skip Jensen

Jensen Inequality Detour z 1 γ 2 = 0.75 z 2 Jensen s Inequality γ 1 f (z 1 ) + γ 2 f (z 2 ) f (γ 1 z 1 + γ 2 z 2 ) concave f (z) For any concave f (z), any z 1 and z 2, and any γ 1, γ 2 > 0 such that γ 1 + γ 2 = 1: f (γ 1 z 1 + γ 2 z 2 ) γ 1 f (z 1 ) + γ 2 f (z 2 )

The EM algorithm III ( ) log p l (x n, l θ l ) l l q n(l) log for any distribution q n( ). Replacing with the right-hand side, we have: log P(D π, Θ) n l l [ n q n(l) log p l(x n θ l ) q n(l) q n(l) log p l(x n θ l ) q n(l) ( ) pl (x n, l) q n(l) ] = L and therefore the optimisation w.r.to cluster parameters separate. l 0 = n q n(l) log p l(x n θ l ) θ l For distributions from exponential family optimisation is easy.

The EM algorithm IV any set of distributions q 1 (l),..., q N (l) provides a lower bound to the log-likelihood. We should choose the distributions so that they are the closest to the current parameter set. We assume the parameters have the value θ 0. Want to minimise the difference: log P(x n, l θ 0 l) L = q n(l) log P(x n, l θ 0 l) l l q n(l) log P(x n, l θ0 l)q n(l) p l (x n θ 0 l) l q n(l) log p l(x n, l θ 0 l) q n(l) and observe that by setting we have l qn(l) 0 = 0. q n(l) = p l(x n θ 0 l) P(x n, l θ 0 l)

The EM algorithm V The EM algorithm: Init initialise model parameters; E step compute the responsibilities γ nl = q n (l); M step for each k optimize 0 = n q n (l) log p l(x n θ l ) θ l repeat goto the E step.

EM application I Old faithful: 100 90 Interval between durations 80 70 60 50 40 1.5 2 2.5 3 3.5 4 4.5 5 5.5 Duration

EM application II Old faithful: 100 90 80 70 60 50 40 1 2 3 4 5 6

EM application III Old faithful: 100 90 80 70 60 50 40 1 2 3 4 5 6

EM application IV Old faithful: 100 90 80 70 60 50 40 1 2 3 4 5 6

J. M. Bernardo and A. F. Smith. Bayesian Theory. John Wiley & Sons, 1994. C. M. Bishop. Pattern Recognition and. Springer Verlag, New York, N.Y., 2006. T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 1991. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society series B, 39:1 38, 1977. T. Hastie, R. Tibshirani, és J. Friedman. The Elements of Statistical Learning: Data, Inference, and Prediction. Springer Verlag, 2001.