Non-parametric Methods

Size: px

Start display at page:

Download "Non-parametric Methods"

Dominick Nicholson
5 years ago
Views:

1 Non-parametric Methods Machine Learning Torsten Möller Möller/Mori 1

2 Reading Chapter 2 of Pattern Recognition and Machine Learning by Bishop (with an emphasis on section 2.5) Möller/Mori 2

3 Outline Last week Parametric Models Kernel Density Estimation Nearest-neighbour Möller/Mori 3

4 Last week model selection / generalization OR the tale of finding good parameters curve fitting decision theory probability theory Möller/Mori 4

5 Which Degree of Polynomial? A model selection problem M = 9 E(w ) = : This is over-fitting Möller/Mori 5

6 Generalization 1 Training Test Generalization is the holy grail of ML Want good performance for new data Measure generalization using a separate set Use root-mean-squared (RMS) error: E RMS = 2E(w )/N Möller/Mori 6

7 Decision Theory For a sample x, decide which class(c k ) it is from. Ideas: Maximum Likelihood Minimum Loss/Cost (e.g. misclassification rate) Maximum Aposteriori (MAP) Möller/Mori 7

8 Decision: Maximum Likelihood Inferencestep: Determine statistics from training data. p(x, t) OR p(x C k ) Decisionstep: Determine optimal t for test input x: t = arg max k { p (x C k ) }{{} Likelihood } Möller/Mori 8

9 Decision: Maximum Likelihood Inferencestep: Determine statistics from training data. p(x, t) OR p(x C k ) Decisionstep: Determine optimal t for test input x: t = arg max k { p (x C k ) }{{} Likelihood } Möller/Mori 9

10 Decision: Maximum Likelihood Inferencestep: Determine statistics from training data. p(x, t) OR p(x C k ) Decisionstep: Determine optimal t for test input x: t = arg max k { p (x C k ) }{{} Likelihood } Möller/Mori 1

11 Parametric Models Problemstatement: What is a good probabilistic model (probability distribution that produced) our data? Answer: density estimation given a finite set x 1,..., x N of observations, model the probability distribution p(x) of a random variable. one standardapproach are parametric methods: assume a model (e.g. Gaussian, Bernoulli, Dirichlet, etc.) and fit the correct parameters based on the given observations. Today we will focus on non-parametric methods: Rather than having a fixed set of parameters (e.g. weight vector for regression, µ, Σ for Gaussian) we have a possibly infinite set of parameters based on each data point Kernel density estimation Nearest-neighbour methods Möller/Mori 11

12 Histograms Consider the problem of modelling the distribution of brightness values in pictures taken on sunny days versus cloudy days We could build histograms of pixel values for each class Möller/Mori 12

13 Histograms E.g. for sunny days Count n i number of datapoints (pixels) with brightness value falling into each bin: p i = n i N i Sensitive to bin width i Discontinuous due to bin edges In D-dim space with M bins per dimension, M D bins Möller/Mori 13

14 Histograms E.g. for sunny days Count n i number of datapoints (pixels) with brightness value falling into each bin: p i = n i N i Sensitive to bin width i Discontinuous due to bin edges In D-dim space with M bins per dimension, M D bins Möller/Mori 14

15 Histograms E.g. for sunny days Count n i number of datapoints (pixels) with brightness value falling into each bin: p i = n i N i Sensitive to bin width i Discontinuous due to bin edges In D-dim space with M bins per dimension, M D bins Möller/Mori 15

16 Histograms E.g. for sunny days Count n i number of datapoints (pixels) with brightness value falling into each bin: p i = n i N i Sensitive to bin width i Discontinuous due to bin edges In D-dim space with M bins per dimension, M D bins Möller/Mori 16

17 Local Density Estimation In a histogram we use nearby points to estimate density Assuming our space is divided into smallish regions of size V, then their probability mass is P = p(x)dx p(x)v R We assume that p(x) is approximately constant in that region Further, the number of data points per region is roughly K NP Hence, for a small region around x, estimate density as: p(x) = K NV K is number of points in region, V is volume of region, N is total number of datapoints Möller/Mori 17

18 Local Density Estimation In a histogram we use nearby points to estimate density Assuming our space is divided into smallish regions of size V, then their probability mass is P = p(x)dx p(x)v R We assume that p(x) is approximately constant in that region Further, the number of data points per region is roughly K NP Hence, for a small region around x, estimate density as: p(x) = K NV K is number of points in region, V is volume of region, N is total number of datapoints Möller/Mori 18

19 Local Density Estimation In a histogram we use nearby points to estimate density Assuming our space is divided into smallish regions of size V, then their probability mass is P = p(x)dx p(x)v R We assume that p(x) is approximately constant in that region Further, the number of data points per region is roughly K NP Hence, for a small region around x, estimate density as: p(x) = K NV K is number of points in region, V is volume of region, N is total number of datapoints Möller/Mori 19

20 Local Density Estimation In a histogram we use nearby points to estimate density Assuming our space is divided into smallish regions of size V, then their probability mass is P = p(x)dx p(x)v R We assume that p(x) is approximately constant in that region Further, the number of data points per region is roughly K NP Hence, for a small region around x, estimate density as: p(x) = K NV K is number of points in region, V is volume of region, N is total number of datapoints Möller/Mori 2

21 Kernel Density Estimation First idea: keep volume of neighbourhood V constant Try to keep idea of using nearby points to estimate density, but obtain smoother estimate Estimate density by placing a small bump at each datapoint Kernel function k( ) determines shape of these bumps Density estimate is p(x) 1 N N ( ) x xn k h n=1 Möller/Mori 21

22 Kernel Density Estimation Example using Gaussian kernel: p(x) = 1 N N 1 (2πh 2 exp { x x n 2 } ) 1/2 2h 2 n=1 Möller/Mori 22

23 Kernel Density Estimation !3!2! Other kernels: Rectangle, Triangle, Epanechnikov Möller/Mori 23

24 Kernel Density Estimation !3!2! ! Other kernels: Rectangle, Triangle, Epanechnikov Möller/Mori 24

25 Kernel Density Estimation !3!2! ! Other kernels: Rectangle, Triangle, Epanechnikov Fast at training time, slow at test time keep all datapoints Möller/Mori 25

26 Kernel Density Estimation !3!2! ! Other kernels: Rectangle, Triangle, Epanechnikov Fast at training time, slow at test time keep all datapoints Sensitive to kernel bandwidth h Möller/Mori 26

27 5 Nearest-neighbour Instead of relying on kernel bandwidth to get proper density estimate, fix number of nearby points K: p(x) = K NV Note: diverges, not proper density estimate Möller/Mori 27

28 Nearest-neighbour for Classification K Nearest neighbour is often used for classification Classification: predict labels t i from x i Möller/Mori 28

29 x 2 Nearest-neighbour for Classification (a) x 1 K Nearest neighbour is often used for classification Classification: predict labels t i from x i e.g. x i R 2 and t i {, 1}, 3-nearest neighbour Möller/Mori 29

30 Nearest-neighbour for Classification x 2 x 2 (a) x 1 (b) x 1 K Nearest neighbour is often used for classification Classification: predict labels t i from x i e.g. x i R 2 and t i {, 1}, 3-nearest neighbour K = 1 referred to as nearest-neighbour Möller/Mori 3

31 Nearest-neighbour for Classification Good baseline method Slow, but can use fancy data structures for efficiency (KD-trees, Locality Sensitive Hashing) Nice theoretical properties As we obtain more training data points, space becomes more filled with labelled data As N error no more than twice Bayes error Möller/Mori 31

32 Bayes Error p(x, C 1 ) x x p(x, C 2 ) x R 1 R 2 Best classification possible given features Two classes, PDFs shown Decision rule: C 1 if x ˆx; makes errors on red, green, and blue regions Optimal decision rule: C 1 if x x, Bayes error is area of green and blue regions Möller/Mori 32

33 Bayes Error p(x, C 1 ) x x p(x, C 2 ) x R 1 R 2 Best classification possible given features Two classes, PDFs shown Decision rule: C 1 if x ˆx; makes errors on red, green, and blue regions Optimal decision rule: C 1 if x x, Bayes error is area of green and blue regions Möller/Mori 33

34 Bayes Error p(x, C 1 ) x x p(x, C 2 ) x R 1 R 2 Best classification possible given features Two classes, PDFs shown Decision rule: C 1 if x ˆx; makes errors on red, green, and blue regions Optimal decision rule: C 1 if x x, Bayes error is area of green and blue regions Möller/Mori 34

35 Conclusion Readings: Ch. 2.5 Kernel density estimation Model density p(x) using kernels around training datapoints Nearest neighbour Model density or perform classification using nearest training datapoints Multivariate Gaussian Needed for next week s lectures, if you need a refresher read pp Möller/Mori 35

Non-parametric Methods

Non-parametric Methods Machine Learning Alireza Ghane Non-Parametric Methods Alireza Ghane / Torsten Möller 1 Outline Machine Learning: What, Why, and How? Curve Fitting: (e.g.) Regression and Model Selection