Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1

Last week... supervised and unsupervised methods need adaptive methods which learn the significance of the input data in predicting the outputs need to regularize fitting in order to achieve generalization former are fit (trained) on a labelled data set trade-off between fit bias and variance curse of dimensionality in practice must make assumptions (e.g. smoothness) and much of machine learning is about how to do this 2

Topics today kernels density estimation for density estimation for regression parametric as a means of classification (LDA) basis functions for structured regression e.g. splines smoothing splines 3

1D density estimation Given a 1D vector of measurements, how to we infer the density? 4

Bishop (1995) Bin the data: histograms 5

Kernel density estimation 1 f x = Nh n= N n=1 x x n K h K = no. neighbours N = total no. points V = volume occupied by K neighbours sum is over all points, n K() is a fixed kernel function with bandwidth h. Simple (Parzen) kernel: K u = 1 = 0 if u 1/2 otherwise 6

Gaussian kernel n=n n=1 x x n 2 1 exp 2 d /2 2 h 2h 2 where N is entire data set Bishop (1995) 1 f x = N 7

K-NN density estimation Overcome fixed kernel size: Vary search volume size, V, until reach K neighbours K NV Bishop (1995) f x = K = no. neighbours N = total no. points V = volume occupied by K neighbours 8

Histograms and 1D kernel density estimation From MASS4 section 5.6. See R scripts on web. 10

2D kernel density estimation From MASS4 section 5.6. See R scripts on web. 11

Classification via (parametric) density modelling Separately model PDF of each class using a specific function e.g. a p-dimensional Gaussian: p x = 2 1 C exp { 12 x T C 1 x } p/ 2 1 /2 where = E [ x] C = E [ x x T ] The quantity 2 = 12 x T 1 x is called the Mahalanobis distance from x to. Surfaces of constant probability density are constant 2 (ellipsoids). Principal axes of these are eigenvectors, u, of C and the variances in these directions are the eigenvalues, C u = u 12

Maximum likelihood estimate of parameters p X = Nj=1 p x j L is the likelihood, a function of the parameters, ={, C }. We maximise this w.r.t. In practice we minimize the negative log likelihood E = ln L = ln p x j For a multivariate Gaussian distribution p x = 2 1 C exp { 12 x T C 1 x } p/ 2 ln L = Np 2 1/ 2 ln 2 N 2 ln C 1 2 N i=1 x T C 1 x i.e. corresponds to least squares In this case analytic differentiation can be used and gives 1 N = N i=1 x C = 1 N N i=1 x x T (note the outer product) which are intuitive. 13

Example: modelling PDF with two Gaussians See R scripts on web page class 1 = (0.0, 0.0) = (0.5, 0.5) class 2 = (1.0, 1.0) = (0.7, 0.3) 14

likelihood of x given y Bayes' Theorem posterior probability of y given x p x y p y p y x = p x prior over y p y, x = p y x p x = p x y p y Applications: 1. Prediction: x = predictors (input), y = response(s) (output) 2. Learning: x = data, y = model parameters p x y, p y p y x, = p x p x = evidence for model p x y, p y dy p y x, = p x y, p y p x y, p y dy 15

Linear Discriminant Analysis (LDA) Approach: directly model the class posterior probabilities, Pr G X y k x is class conditional density of X for class G=k (likelihood) k is the prior probability of class k where Kk=1 k =1 From Bayes' theorem y k x k Pr G=k X = x = kk=1 y k x k Need form for conditional density. Many possibilities Gaussian mixture of Gaussians nonparametric density estimates (e.g. kernel) Naive Bayes (assume inputs to be conditionally independent) 16

Linear Discriminant Analysis (LDA) Approach: directly model the class posterior probabilities, Pr G X y k x is class conditional density of X for class G=k (likelihood) k is the prior probability of class k where Kk=1 k =1 From Bayes' theorem y k x k Pr G=k X = x = kk=1 y k x k Multivariate Gaussian case y k x = 2 1 C exp { 12 x k T C 1 k x k } p/ 2 k 1 /2 LDA is special case in which we take C k =C k To compare two classes k and l look at the odds ratio Pr G=k X = x log Pr G=l X = x = log = log y k x yl x k l log 1 2 solve using maximum likelihood. Use class frequencies as priors k l k l T C 1 k l x T C 1 k l which is linear in x 17

Hastie, Tibshirani, Friedman (2001) LDA 18

LDA as data compression (data description) Can also write log odds ratio as Pr G=k X = x log Pr G=l X =x = k x l x where k x = x T C 1 k Tk C 1 k log k is called the linear discriminant function. Decision rule is then G x = argmax k x k Aside: same as linear least squares for two classes when k = j For g classes have g-1 linear discriminants Gives a dimensionality reduction p original data dimensions to g-1 LDs 19

LDA example: Fisher's iris data 4 measurements: sepal and petal length and width 3 classes: iris setosa, versicolor, virginica 50 flowers in each class use lda in MASS package in R will get two LDs R.A. Fisher (1890-1962) 20

Quadratic Discriminant Analysis (QDA) As before y k x = 1 2 p / 2 C k 1 /2 exp { 12 x T C 1 k x } Odds ratio Pr G=k X = x log Pr G=l X =x = log y k x yl x log k l = k x l x With QDA we no longer assume equal covariances, so they do not cancel. Discriminant functions are k x = 12 log C k 1/2 12 x k T C 1 k x k log k which are quadratic functions of x 21

Hastie, Tibshirani, Friedman (2001) LDA and QDA applied to 3-class problem in 2D. Data are non-gaussian and classes have non-equal covariances) 22

One-dimensional kernel smoothers: k-nn Hastie, Tibshirani, Friedman (2001) k -nn smoother: E Y X = x = f x = Ave[ y i x i N k x ] N k x is the set of k points nearest to x in (e.g.) squared distance Drawback is that the estimator is not smooth in x. k =30 23

One-dimensional kernel smoothers: Epanechnikov Instead give more distant points less weight, e.g. with the Nadaraya-Watson kernel-weighted average N using the Epanechnikov kernel x x 0 K x 0, x i = D where { 3 2 1 t D t = 4 0 if t 1 otherwise Hastie, Tibshirani, Friedman (2001) f x 0 = i=1 K x0, x i yi N i=1 K x 0, xi } Could generalize kernel to have variable width x x 0 K x 0, x i = D h x 0 24

Hastie, Tibshirani, Friedman (2001) Kernel comparison { 3 2 1 t Epanechnikov: D t = 4 0 if t 1 otherwise } { 3 3 1 t Tri-cube: D t = 0 if t 1 otherwise 25 }

Hastie, Tibshirani, Friedman (2001) k-nn and Epanechnikov kernels k =30 =0.2 Epanechnikov kernel has fixed width (bias approx. constant, variance not) k-nn has adaptive width (constant variance, bias varies as 1/density) free parameters: k or 26

Hastie, Tibshirani, Friedman (2001) Locally-weighted averages can be biased at boundaries Kernel is asymmetric at the boundary 27

Hastie, Tibshirani, Friedman (2001) Local linear regression solve linear least squares in local region to predict at a single point green points: effective kernel 28

Hastie, Tibshirani, Friedman (2001) Local quadratic regression 29

Kernels in higher dimensions kernel smoothing and local regression generalize to higher dimensions......but curse of dimensionality not overcome cannot simultaneously retain localness (=low bias) and sufficient sample size (=low variance) without increasing total sample exponentially with dimension In general we need to make assumptions about underlying data/true function and use structured regression/classification 30

Basis expansions f X = M m h m X m=1 linear model hm X = X m m=1,..., p quadratic terms hm X = X j X k higher order terms other transformations, e.g. h m X = log X j, X j split range with an indicator function h m X = I L m X j U m As higher order terms are added we need to control the complexity Restrict class of functions a priori, e.g. additive models Adaptive selection of function, e.g. variable selection, only include significant basis functions Fit complex function and regularize. Typically an integral part of nonlinear modelling 31

Splines (one-dimensional) Idea behind splines is that region to fit is split into subregions which are then fit separately In plots: data drawn from nonlinear function (blue line) with Gaussian noise Split region to fit into three by the knots Piecewise linear 3 constants 3 linear fits 3 linear fits forced to be continuous (2 restrictions 2 fewer free parameters) 32

Piecewise linear fits Hastie, Tibshirani, Friedman (2001) h1 X =1, h 2 X = X, h3 X = X 1, h 4 X = X 2, f X = M m h m X m=1 33

Piecewise cubic polynomial fits h1 X =1, h2 X = X, Hastie, Tibshirani, Friedman (2001) h3 X = X 2, h4 X = X 3, 3 h5 X = X 1, 3 h6 X = X 2, M f X = m h m X m=1 Subregions are fit separately but not independently 34

Cubic splines regression spline natural cubic splines polynomial fits near boundaries often very wild force extrapolation to be linear: frees up 4 d.o.f (i.e. adds two constraints at the two boundaries) increases variance at the boundaries must select knots fixed knots regularization controlled by number of knots and degree of polynomials not very satisfactory... See R scripts for spline examples (more next week) 35

Smoothing splines Avoid knot selection by using all data points as knots avoid overfitting by using regularization Minimise a penalized sum-of-squares N RSS f, = {y i f x i }2 { f ' ' t }2 dt i=1 f is the fitting function with continuous second derivatives What is the effect of varying? What fits do we get for the extreme values? 36

Smoothing splines Avoid knot selection by using all data points as knots avoid overfitting by using regularization Minimise a penalized sum-of-squares N RSS f, = {y i f x i }2 { f ' ' t }2 dt i=1 f is the fitting function with continuous second derivatives =0 f is any function which interpolates the data (could be wild) = straight line least squares fit (no second derivative tolerated) Solution is a cubic spline with knots at each of the {xi } i.e. N f x = h j x j j=1 37

Splines in R See R scripts on the web 38

Summary density estimation in order to fit a model or make inferences some assumptions are essential, CBJ classification via parametric density estimation (LDA) histograms ( binning is sinning, Rix) kernels ( smoothing is evil, Martin) parametric model Gaussian fit to each class. Common covariance matrix is LDA and leads to linear decision boundaries g-1 linear discriminants between g classes get QDA when relax common covariance restraint basis functions linear model of local or nonlinear terms splines (see web notes/r scripts for more) complexity control: smoothing splines taken to a higher level with neural networks...next week... 39