Non-parametric Methods

Similar documents
Non-parametric Methods

Curve Fitting Re-visited, Bishop1.2.5

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Multivariate statistical methods and data mining in particle physics

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probability Models for Bayesian Recognition

Machine Learning Lecture 3

BAYESIAN DECISION THEORY

Advanced statistical methods for data analysis Lecture 2

(Kernels +) Support Vector Machines

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Linear Models for Classification

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

The Bayes classifier

CMU-Q Lecture 24:

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis

Machine Learning Lecture 7

CSE446: non-parametric methods Spring 2017

CSC 411: Lecture 09: Naive Bayes

CS-E3210 Machine Learning: Basic Principles

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Introduction to Machine Learning

CS 188: Artificial Intelligence Spring Announcements

Bayesian Learning (II)

Introduction to Machine Learning

CPSC 540: Machine Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Naïve Bayes classification

Machine Learning 4771

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Lecture 3: Statistical Decision Theory (Part II)

Bayesian Models in Machine Learning

Lecture : Probabilistic Machine Learning

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Kernel Methods and Support Vector Machines

Discriminative Learning and Big Data

Advanced statistical methods for data analysis Lecture 1

CS 188: Artificial Intelligence Spring Announcements

Nonparametric Bayesian Methods (Gaussian Processes)

CPSC 540: Machine Learning

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

CS 556: Computer Vision. Lecture 21

The Gaussian distribution

Machine Learning Lecture 5

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Nonparametric Methods Lecture 5

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Expectation Propagation for Approximate Bayesian Inference

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

ECE521 week 3: 23/26 January 2017

12 - Nonparametric Density Estimation

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Machine Learning for Signal Processing Bayes Classification

Machine Learning Lecture 2

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

MIRA, SVM, k-nn. Lirong Xia

The Naïve Bayes Classifier. Machine Learning Fall 2017

Nonparametric probability density estimation

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Learning from Data: Regression

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Announcements. Proposals graded

Cheng Soon Ong & Christian Walder. Canberra February June 2018

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Linear Models for Regression

Chapter 9. Non-Parametric Density Function Estimation

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Classification: The rest of the story

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Algorithm-Independent Learning Issues

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Introduction to Bayesian Learning. Machine Learning Fall 2018

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Pattern Recognition and Machine Learning

Expect Values and Probability Density Functions

Mean-Shift Tracker Computer Vision (Kris Kitani) Carnegie Mellon University

Machine Learning Lecture 2

Max Margin-Classifier

Probability Theory for Machine Learning. Chris Cremer September 2015

Expectation Maximization

Learning Linear Detectors

Transcription:

Non-parametric Methods Machine Learning Torsten Möller Möller/Mori 1

Reading Chapter 2 of Pattern Recognition and Machine Learning by Bishop (with an emphasis on section 2.5) Möller/Mori 2

Outline Last week Parametric Models Kernel Density Estimation Nearest-neighbour Möller/Mori 3

Last week model selection / generalization OR the tale of finding good parameters curve fitting decision theory probability theory Möller/Mori 4

Which Degree of Polynomial? 1 1 1 1 1 1 1 1 1 1 1 1 A model selection problem M = 9 E(w ) = : This is over-fitting Möller/Mori 5

Generalization 1 Training Test.5 3 6 9 Generalization is the holy grail of ML Want good performance for new data Measure generalization using a separate set Use root-mean-squared (RMS) error: E RMS = 2E(w )/N Möller/Mori 6

Decision Theory For a sample x, decide which class(c k ) it is from. Ideas: Maximum Likelihood Minimum Loss/Cost (e.g. misclassification rate) Maximum Aposteriori (MAP) Möller/Mori 7

Decision: Maximum Likelihood Inferencestep: Determine statistics from training data. p(x, t) OR p(x C k ) Decisionstep: Determine optimal t for test input x: t = arg max k { p (x C k ) }{{} Likelihood } Möller/Mori 8

Decision: Maximum Likelihood Inferencestep: Determine statistics from training data. p(x, t) OR p(x C k ) Decisionstep: Determine optimal t for test input x: t = arg max k { p (x C k ) }{{} Likelihood } Möller/Mori 9

Decision: Maximum Likelihood Inferencestep: Determine statistics from training data. p(x, t) OR p(x C k ) Decisionstep: Determine optimal t for test input x: t = arg max k { p (x C k ) }{{} Likelihood } Möller/Mori 1

Parametric Models Problemstatement: What is a good probabilistic model (probability distribution that produced) our data? Answer: density estimation given a finite set x 1,..., x N of observations, model the probability distribution p(x) of a random variable. one standardapproach are parametric methods: assume a model (e.g. Gaussian, Bernoulli, Dirichlet, etc.) and fit the correct parameters based on the given observations. Today we will focus on non-parametric methods: Rather than having a fixed set of parameters (e.g. weight vector for regression, µ, Σ for Gaussian) we have a possibly infinite set of parameters based on each data point Kernel density estimation Nearest-neighbour methods Möller/Mori 11

Histograms Consider the problem of modelling the distribution of brightness values in pictures taken on sunny days versus cloudy days We could build histograms of pixel values for each class Möller/Mori 12

Histograms 5 5.5 1 5.5 1.5 1 E.g. for sunny days Count n i number of datapoints (pixels) with brightness value falling into each bin: p i = n i N i Sensitive to bin width i Discontinuous due to bin edges In D-dim space with M bins per dimension, M D bins Möller/Mori 13

Histograms 5 5.5 1 5.5 1.5 1 E.g. for sunny days Count n i number of datapoints (pixels) with brightness value falling into each bin: p i = n i N i Sensitive to bin width i Discontinuous due to bin edges In D-dim space with M bins per dimension, M D bins Möller/Mori 14

Histograms 5 5.5 1 5.5 1.5 1 E.g. for sunny days Count n i number of datapoints (pixels) with brightness value falling into each bin: p i = n i N i Sensitive to bin width i Discontinuous due to bin edges In D-dim space with M bins per dimension, M D bins Möller/Mori 15

Histograms 5 5.5 1 5.5 1.5 1 E.g. for sunny days Count n i number of datapoints (pixels) with brightness value falling into each bin: p i = n i N i Sensitive to bin width i Discontinuous due to bin edges In D-dim space with M bins per dimension, M D bins Möller/Mori 16

Local Density Estimation In a histogram we use nearby points to estimate density Assuming our space is divided into smallish regions of size V, then their probability mass is P = p(x)dx p(x)v R We assume that p(x) is approximately constant in that region Further, the number of data points per region is roughly K NP Hence, for a small region around x, estimate density as: p(x) = K NV K is number of points in region, V is volume of region, N is total number of datapoints Möller/Mori 17

Local Density Estimation In a histogram we use nearby points to estimate density Assuming our space is divided into smallish regions of size V, then their probability mass is P = p(x)dx p(x)v R We assume that p(x) is approximately constant in that region Further, the number of data points per region is roughly K NP Hence, for a small region around x, estimate density as: p(x) = K NV K is number of points in region, V is volume of region, N is total number of datapoints Möller/Mori 18

Local Density Estimation In a histogram we use nearby points to estimate density Assuming our space is divided into smallish regions of size V, then their probability mass is P = p(x)dx p(x)v R We assume that p(x) is approximately constant in that region Further, the number of data points per region is roughly K NP Hence, for a small region around x, estimate density as: p(x) = K NV K is number of points in region, V is volume of region, N is total number of datapoints Möller/Mori 19

Local Density Estimation In a histogram we use nearby points to estimate density Assuming our space is divided into smallish regions of size V, then their probability mass is P = p(x)dx p(x)v R We assume that p(x) is approximately constant in that region Further, the number of data points per region is roughly K NP Hence, for a small region around x, estimate density as: p(x) = K NV K is number of points in region, V is volume of region, N is total number of datapoints Möller/Mori 2

Kernel Density Estimation First idea: keep volume of neighbourhood V constant Try to keep idea of using nearby points to estimate density, but obtain smoother estimate Estimate density by placing a small bump at each datapoint Kernel function k( ) determines shape of these bumps Density estimate is p(x) 1 N N ( ) x xn k h n=1 Möller/Mori 21

Kernel Density Estimation 5 5.5 1 5.5 1.5 1 Example using Gaussian kernel: p(x) = 1 N N 1 (2πh 2 exp { x x n 2 } ) 1/2 2h 2 n=1 Möller/Mori 22

Kernel Density Estimation 1.9.8.7.6.5.4.3.2.1!3!2!1 1 2 3 Other kernels: Rectangle, Triangle, Epanechnikov Möller/Mori 23

Kernel Density Estimation.14.12 1.9.8.7.6.5.4.3.2.1.8.6.4.2.1!3!2!1 1 2 3!5 5 1 15 2 25 3 Other kernels: Rectangle, Triangle, Epanechnikov Möller/Mori 24

Kernel Density Estimation.14.12 1.9.8.7.6.5.4.3.2.1.8.6.4.2.1!3!2!1 1 2 3!5 5 1 15 2 25 3 Other kernels: Rectangle, Triangle, Epanechnikov Fast at training time, slow at test time keep all datapoints Möller/Mori 25

Kernel Density Estimation.14.12 1.9.8.7.6.5.4.3.2.1.8.6.4.2.1!3!2!1 1 2 3!5 5 1 15 2 25 3 Other kernels: Rectangle, Triangle, Epanechnikov Fast at training time, slow at test time keep all datapoints Sensitive to kernel bandwidth h Möller/Mori 26

5 Nearest-neighbour.5 1 5.5 1 5.5 1 Instead of relying on kernel bandwidth to get proper density estimate, fix number of nearby points K: p(x) = K NV Note: diverges, not proper density estimate Möller/Mori 27

Nearest-neighbour for Classification K Nearest neighbour is often used for classification Classification: predict labels t i from x i Möller/Mori 28

x 2 Nearest-neighbour for Classification (a) x 1 K Nearest neighbour is often used for classification Classification: predict labels t i from x i e.g. x i R 2 and t i {, 1}, 3-nearest neighbour Möller/Mori 29

Nearest-neighbour for Classification x 2 x 2 (a) x 1 (b) x 1 K Nearest neighbour is often used for classification Classification: predict labels t i from x i e.g. x i R 2 and t i {, 1}, 3-nearest neighbour K = 1 referred to as nearest-neighbour Möller/Mori 3

Nearest-neighbour for Classification Good baseline method Slow, but can use fancy data structures for efficiency (KD-trees, Locality Sensitive Hashing) Nice theoretical properties As we obtain more training data points, space becomes more filled with labelled data As N error no more than twice Bayes error Möller/Mori 31

Bayes Error p(x, C 1 ) x x p(x, C 2 ) x R 1 R 2 Best classification possible given features Two classes, PDFs shown Decision rule: C 1 if x ˆx; makes errors on red, green, and blue regions Optimal decision rule: C 1 if x x, Bayes error is area of green and blue regions Möller/Mori 32

Bayes Error p(x, C 1 ) x x p(x, C 2 ) x R 1 R 2 Best classification possible given features Two classes, PDFs shown Decision rule: C 1 if x ˆx; makes errors on red, green, and blue regions Optimal decision rule: C 1 if x x, Bayes error is area of green and blue regions Möller/Mori 33

Bayes Error p(x, C 1 ) x x p(x, C 2 ) x R 1 R 2 Best classification possible given features Two classes, PDFs shown Decision rule: C 1 if x ˆx; makes errors on red, green, and blue regions Optimal decision rule: C 1 if x x, Bayes error is area of green and blue regions Möller/Mori 34

Conclusion Readings: Ch. 2.5 Kernel density estimation Model density p(x) using kernels around training datapoints Nearest neighbour Model density or perform classification using nearest training datapoints Multivariate Gaussian Needed for next week s lectures, if you need a refresher read pp. 78-81 Möller/Mori 35