Non-parametric Methods Machine Learning Torsten Möller Möller/Mori 1
Reading Chapter 2 of Pattern Recognition and Machine Learning by Bishop (with an emphasis on section 2.5) Möller/Mori 2
Outline Last week Parametric Models Kernel Density Estimation Nearest-neighbour Möller/Mori 3
Last week model selection / generalization OR the tale of finding good parameters curve fitting decision theory probability theory Möller/Mori 4
Which Degree of Polynomial? 1 1 1 1 1 1 1 1 1 1 1 1 A model selection problem M = 9 E(w ) = : This is over-fitting Möller/Mori 5
Generalization 1 Training Test.5 3 6 9 Generalization is the holy grail of ML Want good performance for new data Measure generalization using a separate set Use root-mean-squared (RMS) error: E RMS = 2E(w )/N Möller/Mori 6
Decision Theory For a sample x, decide which class(c k ) it is from. Ideas: Maximum Likelihood Minimum Loss/Cost (e.g. misclassification rate) Maximum Aposteriori (MAP) Möller/Mori 7
Decision: Maximum Likelihood Inferencestep: Determine statistics from training data. p(x, t) OR p(x C k ) Decisionstep: Determine optimal t for test input x: t = arg max k { p (x C k ) }{{} Likelihood } Möller/Mori 8
Decision: Maximum Likelihood Inferencestep: Determine statistics from training data. p(x, t) OR p(x C k ) Decisionstep: Determine optimal t for test input x: t = arg max k { p (x C k ) }{{} Likelihood } Möller/Mori 9
Decision: Maximum Likelihood Inferencestep: Determine statistics from training data. p(x, t) OR p(x C k ) Decisionstep: Determine optimal t for test input x: t = arg max k { p (x C k ) }{{} Likelihood } Möller/Mori 1
Parametric Models Problemstatement: What is a good probabilistic model (probability distribution that produced) our data? Answer: density estimation given a finite set x 1,..., x N of observations, model the probability distribution p(x) of a random variable. one standardapproach are parametric methods: assume a model (e.g. Gaussian, Bernoulli, Dirichlet, etc.) and fit the correct parameters based on the given observations. Today we will focus on non-parametric methods: Rather than having a fixed set of parameters (e.g. weight vector for regression, µ, Σ for Gaussian) we have a possibly infinite set of parameters based on each data point Kernel density estimation Nearest-neighbour methods Möller/Mori 11
Histograms Consider the problem of modelling the distribution of brightness values in pictures taken on sunny days versus cloudy days We could build histograms of pixel values for each class Möller/Mori 12
Histograms 5 5.5 1 5.5 1.5 1 E.g. for sunny days Count n i number of datapoints (pixels) with brightness value falling into each bin: p i = n i N i Sensitive to bin width i Discontinuous due to bin edges In D-dim space with M bins per dimension, M D bins Möller/Mori 13
Histograms 5 5.5 1 5.5 1.5 1 E.g. for sunny days Count n i number of datapoints (pixels) with brightness value falling into each bin: p i = n i N i Sensitive to bin width i Discontinuous due to bin edges In D-dim space with M bins per dimension, M D bins Möller/Mori 14
Histograms 5 5.5 1 5.5 1.5 1 E.g. for sunny days Count n i number of datapoints (pixels) with brightness value falling into each bin: p i = n i N i Sensitive to bin width i Discontinuous due to bin edges In D-dim space with M bins per dimension, M D bins Möller/Mori 15
Histograms 5 5.5 1 5.5 1.5 1 E.g. for sunny days Count n i number of datapoints (pixels) with brightness value falling into each bin: p i = n i N i Sensitive to bin width i Discontinuous due to bin edges In D-dim space with M bins per dimension, M D bins Möller/Mori 16
Local Density Estimation In a histogram we use nearby points to estimate density Assuming our space is divided into smallish regions of size V, then their probability mass is P = p(x)dx p(x)v R We assume that p(x) is approximately constant in that region Further, the number of data points per region is roughly K NP Hence, for a small region around x, estimate density as: p(x) = K NV K is number of points in region, V is volume of region, N is total number of datapoints Möller/Mori 17
Local Density Estimation In a histogram we use nearby points to estimate density Assuming our space is divided into smallish regions of size V, then their probability mass is P = p(x)dx p(x)v R We assume that p(x) is approximately constant in that region Further, the number of data points per region is roughly K NP Hence, for a small region around x, estimate density as: p(x) = K NV K is number of points in region, V is volume of region, N is total number of datapoints Möller/Mori 18
Local Density Estimation In a histogram we use nearby points to estimate density Assuming our space is divided into smallish regions of size V, then their probability mass is P = p(x)dx p(x)v R We assume that p(x) is approximately constant in that region Further, the number of data points per region is roughly K NP Hence, for a small region around x, estimate density as: p(x) = K NV K is number of points in region, V is volume of region, N is total number of datapoints Möller/Mori 19
Local Density Estimation In a histogram we use nearby points to estimate density Assuming our space is divided into smallish regions of size V, then their probability mass is P = p(x)dx p(x)v R We assume that p(x) is approximately constant in that region Further, the number of data points per region is roughly K NP Hence, for a small region around x, estimate density as: p(x) = K NV K is number of points in region, V is volume of region, N is total number of datapoints Möller/Mori 2
Kernel Density Estimation First idea: keep volume of neighbourhood V constant Try to keep idea of using nearby points to estimate density, but obtain smoother estimate Estimate density by placing a small bump at each datapoint Kernel function k( ) determines shape of these bumps Density estimate is p(x) 1 N N ( ) x xn k h n=1 Möller/Mori 21
Kernel Density Estimation 5 5.5 1 5.5 1.5 1 Example using Gaussian kernel: p(x) = 1 N N 1 (2πh 2 exp { x x n 2 } ) 1/2 2h 2 n=1 Möller/Mori 22
Kernel Density Estimation 1.9.8.7.6.5.4.3.2.1!3!2!1 1 2 3 Other kernels: Rectangle, Triangle, Epanechnikov Möller/Mori 23
Kernel Density Estimation.14.12 1.9.8.7.6.5.4.3.2.1.8.6.4.2.1!3!2!1 1 2 3!5 5 1 15 2 25 3 Other kernels: Rectangle, Triangle, Epanechnikov Möller/Mori 24
Kernel Density Estimation.14.12 1.9.8.7.6.5.4.3.2.1.8.6.4.2.1!3!2!1 1 2 3!5 5 1 15 2 25 3 Other kernels: Rectangle, Triangle, Epanechnikov Fast at training time, slow at test time keep all datapoints Möller/Mori 25
Kernel Density Estimation.14.12 1.9.8.7.6.5.4.3.2.1.8.6.4.2.1!3!2!1 1 2 3!5 5 1 15 2 25 3 Other kernels: Rectangle, Triangle, Epanechnikov Fast at training time, slow at test time keep all datapoints Sensitive to kernel bandwidth h Möller/Mori 26
5 Nearest-neighbour.5 1 5.5 1 5.5 1 Instead of relying on kernel bandwidth to get proper density estimate, fix number of nearby points K: p(x) = K NV Note: diverges, not proper density estimate Möller/Mori 27
Nearest-neighbour for Classification K Nearest neighbour is often used for classification Classification: predict labels t i from x i Möller/Mori 28
x 2 Nearest-neighbour for Classification (a) x 1 K Nearest neighbour is often used for classification Classification: predict labels t i from x i e.g. x i R 2 and t i {, 1}, 3-nearest neighbour Möller/Mori 29
Nearest-neighbour for Classification x 2 x 2 (a) x 1 (b) x 1 K Nearest neighbour is often used for classification Classification: predict labels t i from x i e.g. x i R 2 and t i {, 1}, 3-nearest neighbour K = 1 referred to as nearest-neighbour Möller/Mori 3
Nearest-neighbour for Classification Good baseline method Slow, but can use fancy data structures for efficiency (KD-trees, Locality Sensitive Hashing) Nice theoretical properties As we obtain more training data points, space becomes more filled with labelled data As N error no more than twice Bayes error Möller/Mori 31
Bayes Error p(x, C 1 ) x x p(x, C 2 ) x R 1 R 2 Best classification possible given features Two classes, PDFs shown Decision rule: C 1 if x ˆx; makes errors on red, green, and blue regions Optimal decision rule: C 1 if x x, Bayes error is area of green and blue regions Möller/Mori 32
Bayes Error p(x, C 1 ) x x p(x, C 2 ) x R 1 R 2 Best classification possible given features Two classes, PDFs shown Decision rule: C 1 if x ˆx; makes errors on red, green, and blue regions Optimal decision rule: C 1 if x x, Bayes error is area of green and blue regions Möller/Mori 33
Bayes Error p(x, C 1 ) x x p(x, C 2 ) x R 1 R 2 Best classification possible given features Two classes, PDFs shown Decision rule: C 1 if x ˆx; makes errors on red, green, and blue regions Optimal decision rule: C 1 if x x, Bayes error is area of green and blue regions Möller/Mori 34
Conclusion Readings: Ch. 2.5 Kernel density estimation Model density p(x) using kernels around training datapoints Nearest neighbour Model density or perform classification using nearest training datapoints Multivariate Gaussian Needed for next week s lectures, if you need a refresher read pp. 78-81 Möller/Mori 35