Machine Learning Lecture 2

Similar documents
Machine Learning Lecture 2

Machine Learning Lecture 2

Machine Learning Lecture 1

Machine Learning Lecture 3

Lecture Machine Learning (past summer semester) If at least one English-speaking student is present.

Machine Learning Lecture 5

Machine Learning Lecture 7

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

ECE521 week 3: 23/26 January 2017

Machine Learning Lecture 14

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Multivariate statistical methods and data mining in particle physics

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Machine Learning Lecture 12

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

PATTERN RECOGNITION AND MACHINE LEARNING

5. Discriminant analysis

Bayesian Decision Theory

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Parametric Techniques

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

BAYESIAN DECISION THEORY

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Parametric Techniques Lecture 3

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

The Bayes classifier

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Ch 4. Linear Models for Classification

Intelligent Systems Statistical Machine Learning

Curve Fitting Re-visited, Bishop1.2.5

Machine Learning, Fall 2012 Homework 2

Overfitting, Bias / Variance Analysis

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Machine Learning 2017

Machine Learning Lecture 10

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

COM336: Neural Computing

CS 195-5: Machine Learning Problem Set 1

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Advanced statistical methods for data analysis Lecture 1

Naïve Bayes classification

Lecture 4: Probabilistic Learning

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

L11: Pattern recognition principles

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Expect Values and Probability Density Functions

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Machine Learning for Signal Processing Bayes Classification and Regression

ECE521 Lecture7. Logistic Regression

Announcements. Proposals graded

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Density Estimation. Seungjin Choi

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CSCI-567: Machine Learning (Spring 2019)

Pattern Recognition and Machine Learning

Introduction to Statistical Inference

Pattern Classification

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Introduction to Graphical Models

Lecture 3: Pattern Classification

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Algorithmisches Lernen/Machine Learning

10-701/ Machine Learning - Midterm Exam, Fall 2010

CMU-Q Lecture 24:

Statistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning 4771

Machine Learning (CS 567) Lecture 2

CS534 Machine Learning - Spring Final Exam

Intelligent Systems Statistical Machine Learning

Introduction to Machine Learning

COMP90051 Statistical Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

MLE/MAP + Naïve Bayes

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Computer Vision Group Prof. Daniel Cremers. 3. Regression

p(d θ ) l(θ ) 1.2 x x x

Pattern Recognition. Parameter Estimation of Probability Density Functions

Day 5: Generative models, structured classification

Statistical Methods for Particle Physics Lecture 1: parameter estimation, statistical tests

Transcription:

Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de

Announcements Course webpage http://www.vision.rwth-aachen.de/teaching/ Slides will be made available on the webpage L2P electronic repository Exercises and supplementary materials will be posted on the L2P Please subscribe to the lecture on the Campus system! Important to get email announcements and L2P access! 2

Announcements Exercise sheet 1 is now available on L2P Bayes decision theory Maximum Likelihood Kernel density estimation / k-nn Submit your results to Ishrat/Michael until evening of 29.04. Work in teams (of up to 3 people) is encouraged Who is not part of an exercise team yet? 3

Course Outline Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Discriminative Approaches (5 weeks) Linear Discriminant Functions Support Vector Machines Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Generative Models (4 weeks) Bayesian Networks Markov Random Fields 4

Topics of This Lecture Bayes Decision Theory Basic concepts Minimizing the misclassification rate Minimizing the expected loss Discriminant functions Probability Density Estimation General concepts Gaussian distribution Parametric Methods Maximum Likelihood approach Bayesian vs. Frequentist views on probability Bayesian Learning 5

Recap: Bayes Decision Theory Concepts Concept 1: Priors (a priori probabilities) pc k What we can tell about the probability before seeing the data. Example: C C 1 2 a b pc pc 1 2 0.75 0.25 In general: Slide credit: Bernt Schiele pck k 1 6

Recap: Bayes Decision Theory Concepts Concept 2: Conditional probabilities Let x be a feature vector. p x Ck x measures/describes certain properties of the input. E.g. number of black pixels, aspect ratio, p(x C k ) describes its likelihood for class C k. p x a p x b x Slide credit: Bernt Schiele x 7

Bayes Decision Theory Concepts Concept 3: Posterior probabilities p Ck We are typically interested in the a posteriori probability, i.e. the probability of class C k given the measurement vector x. x Bayes Theorem: p C Interpretation k x p x C k p Ck p x Ck p Ck p x p x Ci p C Likelihood Prior Posterior Normalization Factor i i Slide credit: Bernt Schiele 8

Bayes Decision Theory p x a p x b Likelihood p x a p( a) x p x b p( b) x Decision boundary Likelihood Prior p a x p b x Slide credit: Bernt Schiele x Posterior = Likelihood Prior NormalizationFactor 9

Bayesian Decision Theory Goal: Minimize the probability of a misclassification The green and blue regions stay constant. Only the size of the red region varies! = Z R 1 p(c 2 jx)p(x)dx + Z p(c 1 jx)p(x)dx R 2 12 Image source: C.M. Bishop, 2006

Bayes Decision Theory Optimal decision rule Decide for C 1 if p(c 1 jx) > p(c 2 jx) This is equivalent to p(xjc 1 )p(c 1 ) > p(xjc 2 )p(c 2 ) Which is again equivalent to (Likelihood-Ratio test) p(xjc 1 ) p(xjc 2 ) > p(c 2) p(c 1 ) Slide credit: Bernt Schiele Decision threshold 13

Generalization to More Than 2 Classes Decide for class k whenever it has the greatest posterior probability of all classes: p(c k jx) > p(c j jx) 8j 6= k p(xjc k )p(c k ) > p(xjc j )p(c j ) 8j 6= k Likelihood-ratio test p(xjc k ) p(xjc j ) > p(c j) p(c k ) 8j 6= k Slide credit: Bernt Schiele 14

Bayes Decision Theory Decision regions: R 1, R 2, R 3, Slide credit: Bernt Schiele 15

Classifying with Loss Functions Generalization to decisions with a loss function Differentiate between the possible decisions and the possible true classes. Example: medical diagnosis Decisions: diagnosis is sick or healthy (or: further examination necessary) Classes: patient is sick or healthy The cost may be asymmetric: loss(decision = healthyjpatient = sick) >> loss(decision = sickjpatient = healthy) Slide credit: Bernt Schiele 16

Truth Classifying with Loss Functions In general, we can formalize this by introducing a loss matrix L kj L kj = loss for decision C j if truth is C k : Example: cancer diagnosis Decision L cancer diagnosis = 17

Classifying with Loss Functions Loss functions may be different for different actors. Example: L stocktrader (subprime) = invest don t invest µ 1 2 c gain 0 0 0 L bank (subprime) = µ 1 2 c gain 0 0 Different loss functions may lead to different Bayes optimal strategies. 18

Minimizing the Expected Loss Optimal solution is the one that minimizes the loss. But: loss function depends on the true class, which is unknown. Solution: Minimize the expected loss This can be done by choosing the regions R j such that which is easy to do once we know the posterior class probabilities p(c k jx). 19

Minimizing the Expected Loss Example: 2 Classes: C 1, C 2 2 Decision: 1, 2 Loss function: L( j jc k ) = L kj Expected loss (= risk R) for the two decisions: Goal: Decide such that expected loss is minimized I.e. decide 1 if Slide credit: Bernt Schiele 20

Minimizing the Expected Loss R( 2 jx) > R( 1 jx) L 12 p(c 1 jx) + L 22 p(c 2 jx) > L 11 p(c 1 jx) + L 21 p(c 2 jx) (L 12 L 11 )p(c 1 jx) > (L 21 L 22 )p(c 2 jx) (L 12 L 11 ) (L 21 L 22 ) > p(c 2jx) p(c 1 jx) = p(xjc 2)p(C 2 ) p(xjc 1 )p(c 1 ) p(xjc 1 ) p(xjc 2 ) > (L 21 L 22 ) (L 12 L 11 ) p(c 2 ) p(c 1 ) Adapted decision rule taking into account the loss. Slide credit: Bernt Schiele 21

The Reject Option Classification errors arise from regions where the largest posterior probability p(c k jx) is significantly less than 1. These are the regions where we are relatively uncertain about class membership. For some applications, it may be better to reject the automatic decision entirely in such a case and e.g. consult a human expert. 22 Image source: C.M. Bishop, 2006

Discriminant Functions Formulate classification in terms of comparisons Discriminant functions y 1 (x); : : : ; y K (x) Classify x as class C k if y k (x) > y j (x) 8j 6= k Examples (Bayes Decision Theory) y k (x) = p(c k jx) y k (x) = p(xjc k )p(c k ) y k (x) = log p(xjc k ) + log p(c k ) Slide credit: Bernt Schiele 23

Different Views on the Decision Problem y k (x) / p(xjc k )p(c k ) First determine the class-conditional densities for each class individually and separately infer the prior class probabilities. Then use Bayes theorem to determine class membership. Generative methods y k (x) = p(c k jx) First solve the inference problem of determining the posterior class probabilities. Then use decision theory to assign each new x to its class. Discriminative methods Alternative Directly find a discriminant function which maps each input x directly onto a class label. y k (x) 24

Topics of This Lecture Bayes Decision Theory Basic concepts Minimizing the misclassification rate Minimizing the expected loss Discriminant functions Probability Density Estimation General concepts Gaussian distribution Parametric Methods Maximum Likelihood approach Bayesian vs. Frequentist views on probability Bayesian Learning 25

Probability Density Estimation Up to now Bayes optimal classification Based on the probabilities p(xjc k )p(c k ) How can we estimate (=learn) those probability densities? Supervised training case: data and class labels are known. Estimate the probability density for each class separately: p(xjc k ) (For simplicity of notation, we will drop the class label in the following.) C k C k Slide credit: Bernt Schiele 26

Probability Density Estimation Data: x 1, x 2, x 3, x 4, x Estimate: p(x) x Methods Parametric representations Non-parametric representations Mixture models (next lecture) Slide credit: Bernt Schiele 27

The Gaussian (or Normal) Distribution One-dimensional case Mean ¹ Variance ¾ 2 N(xj¹; ¾ 2 ) = p 1 ¾ (x ¹)2 exp ½ 2¼¾ 2¾ 2 Multi-dimensional case Mean ¹ Covariance N(xj¹; ) = ½ 1 exp 1 ¾ (2¼) D=2 j j1=2 2 (x ¹)T 1 (x ¹) 28 Image source: C.M. Bishop, 2006

Gaussian Distribution Properties Central Limit Theorem The distribution of the sum of N i.i.d. random variables becomes increasingly Gaussian as N grows. In practice, the convergence to a Gaussian can be very rapid. This makes the Gaussian interesting for many applications. Example: N uniform [0,1] random variables. 29 Image source: C.M. Bishop, 2006

Gaussian Distribution Properties Quadratic Form N depends on x through the exponent Here, M is often called the Mahalanobis distance from ¹ to x. Shape of the Gaussian is a real, symmetric matrix. We can therefore decompose it into its eigenvectors DX = iu i u T i i=1 and thus obtain with. Constant density on ellipsoids with main directions along the eigenvectors u i and scaling factors p i. 30 Image source: C.M. Bishop, 2006

Gaussian Distribution Properties Special cases Full covariance matrix = [¾ ij ] General ellipsoid shape Diagonal covariance matrix = diagf¾ i g Axis-aligned ellipsoid Uniform variance = ¾ 2 I Hypersphere 31 Image source: C.M. Bishop, 2006

Gaussian Distribution Properties The marginals of a Gaussian are again Gaussians: 32 Image source: C.M. Bishop, 2006

Topics of This Lecture Bayes Decision Theory Basic concepts Minimizing the misclassification rate Minimizing the expected loss Discriminant functions Probability Density Estimation General concepts Gaussian distribution Parametric Methods Maximum Likelihood approach Bayesian vs. Frequentist views on probability Bayesian Learning 33

Parametric Methods Given Data Parametric form of the distribution with parameters µ E.g. for Gaussian distrib.: Learning X = fx 1 ; x 2 ; : : : ; x N g Estimation of the parameters µ µ = (¹; ¾) x x Likelihood of µ Probability that the data X have indeed been generated from a probability density with parameters µ L(µ) = p(xjµ) Slide adapted from Bernt Schiele 34

Maximum Likelihood Approach Computation of the likelihood p(x n jµ) Single data point: Assumption: all data points are independent Log-likelihood L(µ) = p(xjµ) = E(µ) = ln L(µ) = NY n=1 p(x n jµ) NX ln p(x n jµ) n=1 Estimation of the parameters µ (Learning) Maximize the likelihood Minimize the negative log-likelihood Slide credit: Bernt Schiele 35

Maximum Likelihood Approach Likelihood: L(µ) = p(xjµ) = NY n=1 p(x n jµ) We want to obtain ^µ such that L(^µ) is maximized. p(xjµ) ^µ µ Slide credit: Bernt Schiele 36

Maximum Likelihood Approach Minimizing the log-likelihood How do we minimize a function? Take the derivative and set it to zero. @ @µ E(µ) = @ @µ NX ln p(x n jµ) = n=1 NX n=1 @ @µ p(x njµ) p(x n jµ)! = 0 Log-likelihood for Normal distribution (1D case) NX E(µ) = ln p(x n j¹; ¾) = n=1 NX n=1 ln µ 1 p exp ½ jjx n ¹jj 2 ¾ 2¼¾ 2¾ 2 37

Maximum Likelihood Approach Minimizing the log-likelihood @ @¹ E(¹; ¾) = X N @ @¹ p(x nj¹; ¾) p(x n j¹; ¾) = n=1 NX 2(x n ¹) 2¾ 2 n=1 X N = 1 NX 2 (x (x n n ¹) ¹) ¾ n=1 Ã = 1 N! X 2 x n n N¹ ¾ n=1 @ @¹ E(¹; ¾) =! 0, ^¹ = 1 N NX n=1 x n p(x n j¹; ¾) = 1 p e jjx n ¹jj 2 2¾ 2 2¼¾ 38

Maximum Likelihood Approach We thus obtain ^¹ = 1 NX N n=1 x n In a similar fashion, we get ^¾ 2 = 1 NX (x n ^¹) 2 N n=1 sample mean sample variance ^µ = (^¹; ^¾) is the Maximum Likelihood estimate for the parameters of a Gaussian distribution. This is a very important result. Unfortunately, it is wrong 39

Maximum Likelihood Approach Or not wrong, but rather biased Assume the samples x 1, x 2,, x N come from a true Gaussian distribution with mean ¹ and variance ¾ 2 We can now compute the expectations of the ML estimates with respect to the data set values. It can be shown that E(¹ ML ) = ¹ µ N 1 E(¾ML) 2 = N ¾ 2 The ML estimate will underestimate the true variance. Corrected estimate: ~¾ 2 = N N 1 ¾2 ML = 1 N 1 NX (x n ^¹) 2 n=1 40

Maximum Likelihood Limitations Maximum Likelihood has several significant limitations It systematically underestimates the variance of the distribution! E.g. consider the case N = 1; X = fx 1 g x Maximum-likelihood estimate: ^¾ = 0! We say ML overfits to the observed data. We will still often use ML, but it is important to know about this effect. ^¹ x Slide adapted from Bernt Schiele 41

Deeper Reason Maximum Likelihood is a Frequentist concept In the Frequentist view, probabilities are the frequencies of random, repeatable events. These frequencies are fixed, but can be estimated more precisely when more data is available. This is in contrast to the Bayesian interpretation In the Bayesian view, probabilities quantify the uncertainty about certain states or events. This uncertainty can be revised in the light of new evidence. Bayesians and Frequentists do not like each other too well 42

Bayesian vs. Frequentist View To see the difference Suppose we want to estimate the uncertainty whether the Arctic ice cap will have disappeared by the end of the century. This question makes no sense in a Frequentist view, since the event cannot be repeated numerous times. In the Bayesian view, we generally have a prior, e.g. from calculations how fast the polar ice is melting. If we now get fresh evidence, e.g. from a new satellite, we may revise our opinion and update the uncertainty from the prior. Posterior / Likelihood Prior This generally allows to get better uncertainty estimates for many situations. Main Frequentist criticism The prior has to come from somewhere and if it is wrong, the result will be worse. 43

Bayesian Approach to Parameter Learning Conceptual shift Maximum Likelihood views the true parameter vector µ to be unknown, but fixed. In Bayesian learning, we consider µ to be a random variable. This allows us to use knowledge about the parameters µ i.e. to use a prior for µ Training data then converts this prior distribution on µ into a posterior probability density. The prior thus encodes knowledge we have about the type of distribution we expect to see for µ. Slide adapted from Bernt Schiele 44

Bayesian Learning Approach Bayesian view: Consider the parameter vector µ as a random variable. When estimating the parameters, what we compute is p(xjx) = p(xjx) = Z Z p(x; µjx)dµ p(x; µjx) = p(xjµ; X)p(µjX) p(xjµ)p(µjx)dµ Assumption: given µ, this doesn t depend on X anymore This is entirely determined by the parameter µ (i.e. by the parametric form of the pdf). Slide adapted from Bernt Schiele 45

Bayesian Learning Approach Z p(xjx) = p(xjµ)p(µjx)dµ p(µjx) = p(xjµ)p(µ) p(x) p(x) = Z p(xjµ)p(µ)dµ = = p(µ) p(x) L(µ) Z L(µ)p(µ)dµ Inserting this above, we obtain Z p(xjµ)l(µ)p(µ) p(xjx) = dµ = p(x) Z p(xjµ)l(µ)p(µ) R L(µ)p(µ)dµ dµ Slide credit: Bernt Schiele 46

Bayesian Learning Approach Discussion Likelihood of the parametric form µ given the data set X. Estimate for x based on parametric form µ Prior for the parameters µ p(xjx) = Z p(xjµ)l(µ)p(µ) R L(µ)p(µ)dµ dµ Normalization: integrate over all possible values of µ If we now plug in a (suitable) prior p(µ), we can estimate from the data set X. p(xjx) 47

Bayesian Density Estimation Discussion p(xjx) = Z p(xjµ)p(µjx)dµ = Z p(xjµ)l(µ)p(µ) R L(µ)p(µ)dµ dµ p(µjx) The probability makes the dependency of the estimate on the data explicit. p(µjx) ^µ If is very small everywhere, but is large for one, then p(xjx) ¼ p(xj^µ) The more uncertain we are about µ, the more we average over all parameter values. Slide credit: Bernt Schiele 48

Bayesian Density Estimation Problem In the general case, the integration over µ is not possible (or only possible stochastically). Example where an analytical solution is possible Normal distribution for the data, ¾ 2 assumed known and fixed. Estimate the distribution of the mean: p(¹jx) = p(xj¹)p(¹) p(x) Prior: We assume a Gaussian prior over ¹, Slide credit: Bernt Schiele 49

Bayesian Learning Approach Sample mean: Bayes estimate: Note: ¹ N = ¾2 ¹ 0 + N¾ 2 0 ¹x ¾ 2 + N¾ 2 0 1 ¾ 2 N = 1 ¾ 2 0 + N ¾ 2 ¹x = 1 N NX n=1 x n p(¹jx) ¹ 0 = 0 50 Slide adapted from Bernt Schiele Image source: C.M. Bishop, 2006

Summary: ML vs. Bayesian Learning Maximum Likelihood Simple approach, often analytically possible. Problem: estimation is biased, tends to overfit to the data. But: Often needs some correction or regularization. Approximation gets accurate for N! 1. Bayesian Learning General approach, avoids the estimation bias through a prior. Problems: But: Need to choose a suitable prior (not always obvious). Integral over µ often not analytically feasible anymore. Efficient stochastic sampling techniques available (see Lecture 15). (In this lecture, we ll use both concepts wherever appropriate) 51

References and Further Reading More information in Bishop s book Gaussian distribution and ML: Ch. 1.2.4 and 2.3.1-2.3.4. Bayesian Learning: Ch. 1.2.3 and 2.3.6. Nonparametric methods: Ch. 2.5. Additional information can be found in Duda & Hart ML estimation: Ch. 3.2 Bayesian Learning: Ch. 3.3-3.5 Nonparametric methods: Ch. 4.1-4.5 Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 R.O. Duda, P.E. Hart, D.G. Stork Pattern Classification 2 nd Ed., Wiley-Interscience, 2000 73