Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Similar documents
Cheng Soon Ong & Christian Walder. Canberra February June 2018

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Multivariate statistical methods and data mining in particle physics

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Non-parametric Methods

L11: Pattern recognition principles

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Pattern Recognition and Machine Learning

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

The Bayes classifier

Linear Regression and Discrimination

LEC 4: Discriminant Analysis for Classification

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Machine Learning Linear Classification. Prof. Matteo Matteucci

CMSC858P Supervised Learning Methods

Advanced statistical methods for data analysis Lecture 1

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

BAYESIAN DECISION THEORY

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

Cheng Soon Ong & Christian Walder. Canberra February June 2018

ECE521 week 3: 23/26 January 2017

Curve Fitting Re-visited, Bishop1.2.5

Recap from previous lecture

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING

Econ 582 Nonparametric Regression

Ch 4. Linear Models for Classification

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Classification techniques focus on Discriminant Analysis

Advanced statistical methods for data analysis Lecture 2

Gaussian Models

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Lecture 5: Classification

CMU-Q Lecture 24:

CS-E3210 Machine Learning: Basic Principles

Machine Learning Lecture 2

An Introduction to Statistical and Probabilistic Linear Models

Introduction to Graphical Models

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

12 - Nonparametric Density Estimation

Classification via kernel regression based on univariate product density estimators

Cheng Soon Ong & Christian Walder. Canberra February June 2018

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

CS 195-5: Machine Learning Problem Set 1

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Lecture 3: Statistical Decision Theory (Part II)

Semiparametric Discriminant Analysis of Mixture Populations Using Mahalanobis Distance. Probal Chaudhuri and Subhajit Dutta

Machine Learning Lecture 3

Machine Learning (CS 567) Lecture 5

Regression, Ridge Regression, Lasso

Statistical Learning Reading Assignments

Motivating the Covariance Matrix

COM336: Neural Computing

Introduction to Machine Learning

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Classification 2: Linear discriminant analysis (continued); logistic regression

Bayesian Learning (II)

MLE/MAP + Naïve Bayes

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Introduction to Machine Learning

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

Machine Learning Lecture 7

Machine Learning Lecture 2

Learning from Data: Regression

Introduction. Chapter 1

CS 6375 Machine Learning

Introduction to Machine Learning

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Machine Learning Lecture 2

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Alternatives to Basis Expansions. Kernels in Density Estimation. Kernels and Bandwidth. Idea Behind Kernel Methods

Classification. Chapter Introduction. 6.2 The Bayes classifier

STA 4273H: Sta-s-cal Machine Learning

Linear Models for Classification

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Support Vector Machines

Soft and hard models. Jiří Militky Computer assisted statistical modeling in the

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Linear Classification: Probabilistic Generative Models

5. Discriminant analysis

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation

Density Estimation. Seungjin Choi

LDA, QDA, Naive Bayes

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Bits of Machine Learning Part 1: Supervised Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Statistical Pattern Recognition

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Transcription:

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1

Last week... supervised and unsupervised methods need adaptive methods which learn the significance of the input data in predicting the outputs need to regularize fitting in order to achieve generalization former are fit (trained) on a labelled data set trade-off between fit bias and variance curse of dimensionality in practice must make assumptions (e.g. smoothness) and much of machine learning is about how to do this 2

Topics today kernels density estimation for density estimation for regression parametric as a means of classification (LDA) basis functions for structured regression e.g. splines smoothing splines 3

1D density estimation Given a 1D vector of measurements, how to we infer the density? 4

Bishop (1995) Bin the data: histograms 5

Kernel density estimation 1 f x = Nh n= N n=1 x x n K h K = no. neighbours N = total no. points V = volume occupied by K neighbours sum is over all points, n K() is a fixed kernel function with bandwidth h. Simple (Parzen) kernel: K u = 1 = 0 if u 1/2 otherwise 6

Gaussian kernel n=n n=1 x x n 2 1 exp 2 d /2 2 h 2h 2 where N is entire data set Bishop (1995) 1 f x = N 7

K-NN density estimation Overcome fixed kernel size: Vary search volume size, V, until reach K neighbours K NV Bishop (1995) f x = K = no. neighbours N = total no. points V = volume occupied by K neighbours 8

9

Histograms and 1D kernel density estimation From MASS4 section 5.6. See R scripts on web. 10

2D kernel density estimation From MASS4 section 5.6. See R scripts on web. 11

Classification via (parametric) density modelling Separately model PDF of each class using a specific function e.g. a p-dimensional Gaussian: p x = 2 1 C exp { 12 x T C 1 x } p/ 2 1 /2 where = E [ x] C = E [ x x T ] The quantity 2 = 12 x T 1 x is called the Mahalanobis distance from x to. Surfaces of constant probability density are constant 2 (ellipsoids). Principal axes of these are eigenvectors, u, of C and the variances in these directions are the eigenvalues, C u = u 12

Maximum likelihood estimate of parameters p X = Nj=1 p x j L is the likelihood, a function of the parameters, ={, C }. We maximise this w.r.t. In practice we minimize the negative log likelihood E = ln L = ln p x j For a multivariate Gaussian distribution p x = 2 1 C exp { 12 x T C 1 x } p/ 2 ln L = Np 2 1/ 2 ln 2 N 2 ln C 1 2 N i=1 x T C 1 x i.e. corresponds to least squares In this case analytic differentiation can be used and gives 1 N = N i=1 x C = 1 N N i=1 x x T (note the outer product) which are intuitive. 13

Example: modelling PDF with two Gaussians See R scripts on web page class 1 = (0.0, 0.0) = (0.5, 0.5) class 2 = (1.0, 1.0) = (0.7, 0.3) 14

likelihood of x given y Bayes' Theorem posterior probability of y given x p x y p y p y x = p x prior over y p y, x = p y x p x = p x y p y Applications: 1. Prediction: x = predictors (input), y = response(s) (output) 2. Learning: x = data, y = model parameters p x y, p y p y x, = p x p x = evidence for model p x y, p y dy p y x, = p x y, p y p x y, p y dy 15

Linear Discriminant Analysis (LDA) Approach: directly model the class posterior probabilities, Pr G X y k x is class conditional density of X for class G=k (likelihood) k is the prior probability of class k where Kk=1 k =1 From Bayes' theorem y k x k Pr G=k X = x = kk=1 y k x k Need form for conditional density. Many possibilities Gaussian mixture of Gaussians nonparametric density estimates (e.g. kernel) Naive Bayes (assume inputs to be conditionally independent) 16

Linear Discriminant Analysis (LDA) Approach: directly model the class posterior probabilities, Pr G X y k x is class conditional density of X for class G=k (likelihood) k is the prior probability of class k where Kk=1 k =1 From Bayes' theorem y k x k Pr G=k X = x = kk=1 y k x k Multivariate Gaussian case y k x = 2 1 C exp { 12 x k T C 1 k x k } p/ 2 k 1 /2 LDA is special case in which we take C k =C k To compare two classes k and l look at the odds ratio Pr G=k X = x log Pr G=l X = x = log = log y k x yl x k l log 1 2 solve using maximum likelihood. Use class frequencies as priors k l k l T C 1 k l x T C 1 k l which is linear in x 17

Hastie, Tibshirani, Friedman (2001) LDA 18

LDA as data compression (data description) Can also write log odds ratio as Pr G=k X = x log Pr G=l X =x = k x l x where k x = x T C 1 k Tk C 1 k log k is called the linear discriminant function. Decision rule is then G x = argmax k x k Aside: same as linear least squares for two classes when k = j For g classes have g-1 linear discriminants Gives a dimensionality reduction p original data dimensions to g-1 LDs 19

LDA example: Fisher's iris data 4 measurements: sepal and petal length and width 3 classes: iris setosa, versicolor, virginica 50 flowers in each class use lda in MASS package in R will get two LDs R.A. Fisher (1890-1962) 20

Quadratic Discriminant Analysis (QDA) As before y k x = 1 2 p / 2 C k 1 /2 exp { 12 x T C 1 k x } Odds ratio Pr G=k X = x log Pr G=l X =x = log y k x yl x log k l = k x l x With QDA we no longer assume equal covariances, so they do not cancel. Discriminant functions are k x = 12 log C k 1/2 12 x k T C 1 k x k log k which are quadratic functions of x 21

Hastie, Tibshirani, Friedman (2001) LDA and QDA applied to 3-class problem in 2D. Data are non-gaussian and classes have non-equal covariances) 22

One-dimensional kernel smoothers: k-nn Hastie, Tibshirani, Friedman (2001) k -nn smoother: E Y X = x = f x = Ave[ y i x i N k x ] N k x is the set of k points nearest to x in (e.g.) squared distance Drawback is that the estimator is not smooth in x. k =30 23

One-dimensional kernel smoothers: Epanechnikov Instead give more distant points less weight, e.g. with the Nadaraya-Watson kernel-weighted average N using the Epanechnikov kernel x x 0 K x 0, x i = D where { 3 2 1 t D t = 4 0 if t 1 otherwise Hastie, Tibshirani, Friedman (2001) f x 0 = i=1 K x0, x i yi N i=1 K x 0, xi } Could generalize kernel to have variable width x x 0 K x 0, x i = D h x 0 24

Hastie, Tibshirani, Friedman (2001) Kernel comparison { 3 2 1 t Epanechnikov: D t = 4 0 if t 1 otherwise } { 3 3 1 t Tri-cube: D t = 0 if t 1 otherwise 25 }

Hastie, Tibshirani, Friedman (2001) k-nn and Epanechnikov kernels k =30 =0.2 Epanechnikov kernel has fixed width (bias approx. constant, variance not) k-nn has adaptive width (constant variance, bias varies as 1/density) free parameters: k or 26

Hastie, Tibshirani, Friedman (2001) Locally-weighted averages can be biased at boundaries Kernel is asymmetric at the boundary 27

Hastie, Tibshirani, Friedman (2001) Local linear regression solve linear least squares in local region to predict at a single point green points: effective kernel 28

Hastie, Tibshirani, Friedman (2001) Local quadratic regression 29

Kernels in higher dimensions kernel smoothing and local regression generalize to higher dimensions......but curse of dimensionality not overcome cannot simultaneously retain localness (=low bias) and sufficient sample size (=low variance) without increasing total sample exponentially with dimension In general we need to make assumptions about underlying data/true function and use structured regression/classification 30

Basis expansions f X = M m h m X m=1 linear model hm X = X m m=1,..., p quadratic terms hm X = X j X k higher order terms other transformations, e.g. h m X = log X j, X j split range with an indicator function h m X = I L m X j U m As higher order terms are added we need to control the complexity Restrict class of functions a priori, e.g. additive models Adaptive selection of function, e.g. variable selection, only include significant basis functions Fit complex function and regularize. Typically an integral part of nonlinear modelling 31

Splines (one-dimensional) Idea behind splines is that region to fit is split into subregions which are then fit separately In plots: data drawn from nonlinear function (blue line) with Gaussian noise Split region to fit into three by the knots Piecewise linear 3 constants 3 linear fits 3 linear fits forced to be continuous (2 restrictions 2 fewer free parameters) 32

Piecewise linear fits Hastie, Tibshirani, Friedman (2001) h1 X =1, h 2 X = X, h3 X = X 1, h 4 X = X 2, f X = M m h m X m=1 33

Piecewise cubic polynomial fits h1 X =1, h2 X = X, Hastie, Tibshirani, Friedman (2001) h3 X = X 2, h4 X = X 3, 3 h5 X = X 1, 3 h6 X = X 2, M f X = m h m X m=1 Subregions are fit separately but not independently 34

Cubic splines regression spline natural cubic splines polynomial fits near boundaries often very wild force extrapolation to be linear: frees up 4 d.o.f (i.e. adds two constraints at the two boundaries) increases variance at the boundaries must select knots fixed knots regularization controlled by number of knots and degree of polynomials not very satisfactory... See R scripts for spline examples (more next week) 35

Smoothing splines Avoid knot selection by using all data points as knots avoid overfitting by using regularization Minimise a penalized sum-of-squares N RSS f, = {y i f x i }2 { f ' ' t }2 dt i=1 f is the fitting function with continuous second derivatives What is the effect of varying? What fits do we get for the extreme values? 36

Smoothing splines Avoid knot selection by using all data points as knots avoid overfitting by using regularization Minimise a penalized sum-of-squares N RSS f, = {y i f x i }2 { f ' ' t }2 dt i=1 f is the fitting function with continuous second derivatives =0 f is any function which interpolates the data (could be wild) = straight line least squares fit (no second derivative tolerated) Solution is a cubic spline with knots at each of the {xi } i.e. N f x = h j x j j=1 37

Splines in R See R scripts on the web 38

Summary density estimation in order to fit a model or make inferences some assumptions are essential, CBJ classification via parametric density estimation (LDA) histograms ( binning is sinning, Rix) kernels ( smoothing is evil, Martin) parametric model Gaussian fit to each class. Common covariance matrix is LDA and leads to linear decision boundaries g-1 linear discriminants between g classes get QDA when relax common covariance restraint basis functions linear model of local or nonlinear terms splines (see web notes/r scripts for more) complexity control: smoothing splines taken to a higher level with neural networks...next week... 39