Supervised Learning: Non-parametric Estimation

Size: px
Start display at page:

Download "Supervised Learning: Non-parametric Estimation"

Transcription

1 Supervised Learning: Non-parametric Estimation Edmondo Trentin March 18, 2018

2 Non-parametric Estimates No assumptions are made on the form of the pdfs 1. There are 3 major instances of non-parametric estimates: 1. We want P(ω i x) = p(x ω i )P(ω i ) p(x) and we try and estimate the pdf p(x ω i ) relying on the available data 2. We want P(ω i x) = p(x ω i )P(ω i ) p(x) and we try and estimate P(ω i x), that is the discriminant function (posterior probability) 3. the feature space X is first projected onto a sub-space Y (having reduced dimensionality) via ϕ : X Y such that, given x and letting y = ϕ(x), computing P(ω i y) = p(y ω i )P(ω i ) p(y) turns out to be easier than computing P(ω i x) = p(x ω i )P(ω i ) p(x) 1 This is the fundamental thing: we recognize we do not know anything!

3 Non-parametric pdf estimation Let us elaborate on the relation between pdf and probability: what is the probability P that a generic pattern x, drawn from a certain pdf p(.), belongs to an arbitrary region R of the feature space? We have: P = p(x)dx (1) R Thence, P is an averaged (over R) version of p(x). If we can estimate P, then we we can come up with an estimate of the averaged version of p(x). Assume we collected a random sample of n data, x 1,..., x n, independently and identically distributed (iid) according to p(x). If k out of n data are in R, we can estimate P via the relative frequency: P k n (2)

4 If p(x) is continuous and has little variation in R (e.g. if R is small) then: p(x)dx p(x 0 )V (3) R where x 0 R and V is the volume of R. Thus far we have P = R p(x)dx and P k n, thence: p(x 0 ) k/n V (4) Practical problem: to retrieve p(x 0 ) instead of its version averaged over R, V should become 0. Since n is fixed in real-world cases, this would drive k to 0 as well, thenceforth making the estimate p(x 0 ) 0 V useless.

5 Some theoretical issues Let x 0 be any pattern of interest. Let us imagine the number n of available data may grow unbounded. We can then build a sequence of regions R 1, R 2,..., R n,... s.t. x 0 R i i as n increases. We use R i to estimate p(x 0 ) from i data. Let: V n be the volume of R n k n be the number of data (out of n) in R n p n (x 0 ) be the n-th estimate of p(x 0 ), i.e., p n (x 0 ) = kn/n V n Asymptotic necessary and sufficient conditions that ensure p n (x 0 ) p(x 0 ): 1. lim n V n = 0 2. lim n k n = 3. lim n k n /n = 0 (to guarantee convergence) How do we satisfy 1, 2 and 3? There are two complementary approaches that make sure p n (x 0 ) p(x 0 ) in probability: 1. Fix a proper volume, say V n = 1/ (n), and determine k n /n consequently (Parzen Window) 2. Fix k n, e.g. k n = n, and determine V n consequently, in such a way that exactly k n patterns fall in R n (k n -nearest neighbor)

6 Parzen Window To fix ideas, let us assume that R n is an hypercube having edge h n (thus, V n = hn d ). We define a window function (or, kernel): { 1 uj 1/2 j = 1, 2,..., d ϕ(u) = 0 else ( x0 x It is seen that ϕ i is a kernel having value 1 only within the h n ) hypercube centered in x 0 and having edge h n. The number k n of data in this hypercube is: n ( ) x0 x i k n = ϕ (6) i=1 h n (5) Bearing in mind that p n (x 0 ) = kn/n V n, we have: p n (x 0 ) = 1 n ( 1 x0 x i ϕ n V n i=1 h n ) (7)

7 The approach can be extended to window functions ϕ ( ) of different form. The equation: p n (x 0 ) = 1 n ( ) 1 xi x 0 ϕ n V n i=1 tells us that, using the n available data, the estimate p n (.) of the unknown pdf p(.) is obtained by averaging the kernel function ϕ ( ) over x 1,..., x n, i.e. by interpolation 2. We need a ϕ ( ) such that p n ( ) is a pdf, that is p n ( ) 0 and pn (x)dx = 1. If V n = hn d, this is guaranteed by the (sufficient) conditions: 1. ϕ(u) 0 u R d 2. ϕ(u)du = 1 Exercise: check it out. A popular and effective choice for ϕ ( ) is the Gaussian kernel, namely: ϕ(u) = N(u; 0, 1I ) (9) 2 In fact, I can think of each kernel as of a window centered in x i and evaluated on x 0, since ϕ(.). is a symmetric function. h n (8)

8 Q: how deep is the impact of the window width h n on the quality of the estimate p n (x)? Let us define: δ n (x) = 1 ( ) x ϕ (10) V n Since we can write Thence: p n (x) = 1 n p n (x) = 1 n n i=1 h n ( 1 x xi ϕ V n h n n ( ) δ n x xi i=1 ) (11) (12) if h n is large, then δ n (x) is very smooth (having small variation), and p n (x) is yielded by the superposition of wide, smooth functions; if h n is small, then δ n (x) is very peaked around x = x i. Since δn (x x i )dx = 1 V n ϕ ( x xi h n ) dx = ϕ(u)du = 1, if h n 0 then δ n (x x i ) converges to a Dirac s Delta centered in x i

9 p(x) = N(x; 0, 1) ϕ(u) = N(u; 0, 1) hn = h1 / n n pn (x) = 1X 1 ϕ n hn i=1 x xi hn

10 2.5 < x < < x < 2 p(x) = 0 otherwise ϕ(u) = N(u; 0, 1) hn = h1 / n

11 k n -Nearest Neighbor In Parzen Window, the asymptotic conditions are guaranteed by taking h n = h 1 / n. Unfortunately, in the finite-sample case the estimate is affected by the choice for the initial edge (bandwidth) h 1 : if h 1 is too small, then p n ( ) = 0 if h 1 is too big, then p n ( ) = E[p( )] The inconvenience could be overcome by choosing the volume according to the nature of the data! 1. define k n as a function of n 2. to estimate p(x 0 ), consider a small ball around x 0 and let it grow until it embraces k n data (the k n -nearest neighbors) 3. Eventually: p n (x 0 ) = k n/n V n as we did in the generic non-parametric pdf estimation setup.

12 Good asymptotic behavior p n (x 0 ) p(x 0 ) is guaranteed once these two necessary and sufficient conditions hold: 1. lim n k n = 2. lim n k n n = 0 For instance, by letting k n = n the conditions are satisfied. Moreover: V n = 1 p(x 0 ) n (13) that is, exactly like in Parzen Window, we have V n = V 1 / n but V 1 = 1/p(x 0 ), i.e., V 1 is not chosen arbitrarily but it is uniquely determined by the nature of the data. In practice, nonetheless, we usually set k n = k 1 n where k1 is determined empirically.

13

14 Non-parametric Decision Rule Let {x 1, x 2,..., x n } be a supervised sample Take a ball (of volume V ) around x 0, s.t. it embraces k patterns Among these k patterns, let k i be the number of patterns of class ω i Since p n (x 0 ) = k/n V, we have: p n (x 0, ω i ) = k i/n V Thence, an estimate P n (ω i x 0 ) of P(ω i x 0 ) is given by k p n (x 0, ω i ) i /n P n (ω i x 0 ) = c j=1 p n(x 0, ω j ) = V c j=1 k j /n V = k i k i.e., the fraction of data of class ω i that falls within the ball under consideration. The ball (that is, its volume V ) may be fixed either using the Parzen Window or the k n -nearest neighbors philosophies.

15 Nearest Neighbor Algorithm The aforementioned decision rule, when applied along with the k n -NN philosophy and k n = 1, takes the (simple, yet effective) form of a popular classification algorithm. Let Y n = {x 1, x 2,..., x n } be the supervised sample, and let x 0 be a pattern to be classified. Nearest-neighbor decision rule: assign x 0 to class ω i if and only if (i) x n Y n is the nearest 3 pattern to x 0, among all those in Y n ; and (ii) x n belongs to class ω i. It is possible to show that, asymptotically, the performance of NN for n is as good as that of k n -NN. Intuitively, this is due to the fact that the probability that x n belongs to ω i is P(ω i x n). If we had many data (say, n ), then P(ω i x n) P(ω i x 0 ). Using NN, indeed, I assign x 0 to ω i due to ω i having taken place in x 0 with the highest probability, i.e. P(ω i x 0 ), 3 According to Euclidean distance.

16 K-Nearest Neighbor Given the sample Y n = {x 1, x 2,..., x n } and the pattern x 0 to be classified, we can apply this K-NN decision rule: consider the k patterns in Y n that are the nearest to x 0 (in terms of Euclidean distance). Assign x 0 to ω i if and only if the latter is the class having the highest relative frequency within this sub-sample of k patterns. Remarks: K-NN is in the spirit of k n -NN while k n -NN estimates a pdf, K-NN estimates P(ω i x) for n, the asymptotic behavior of K-NN tends to be optimal (i.e., Bayesian) in 2-class cases, an odd value for k is used in order to break ties more accurate decisions would be taken when k, but in the finite-sample case we cannot move away from x 0 too far (trade-off)

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

Nonparametric Methods Lecture 5

Nonparametric Methods Lecture 5 Nonparametric Methods Lecture 5 Jason Corso SUNY at Buffalo 17 Feb. 29 J. Corso (SUNY at Buffalo) Nonparametric Methods Lecture 5 17 Feb. 29 1 / 49 Nonparametric Methods Lecture 5 Overview Previously,

More information

The Gaussian distribution

The Gaussian distribution The Gaussian distribution Probability density function: A continuous probability density function, px), satisfies the following properties:. The probability that x is between two points a and b b P a

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Probability Models for Bayesian Recognition

Probability Models for Bayesian Recognition Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIAG / osig Second Semester 06/07 Lesson 9 0 arch 07 Probability odels for Bayesian Recognition Notation... Supervised Learning for Bayesian

More information

probability of k samples out of J fall in R.

probability of k samples out of J fall in R. Nonparametric Techniques for Density Estimation (DHS Ch. 4) n Introduction n Estimation Procedure n Parzen Window Estimation n Parzen Window Example n K n -Nearest Neighbor Estimation Introduction Suppose

More information

BAYESIAN DECISION THEORY

BAYESIAN DECISION THEORY Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will

More information

Kernel-based density. Nuno Vasconcelos ECE Department, UCSD

Kernel-based density. Nuno Vasconcelos ECE Department, UCSD Kernel-based density estimation Nuno Vasconcelos ECE Department, UCSD Announcement last week of classes we will have Cheetah Day (exact day TBA) what: 4 teams of 6 people each team will write a report

More information

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.1 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

41903: Introduction to Nonparametrics

41903: Introduction to Nonparametrics 41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI Bayes Classifiers CAP5610 Machine Learning Instructor: Guo-Jun QI Recap: Joint distributions Joint distribution over Input vector X = (X 1, X 2 ) X 1 =B or B (drinking beer or not) X 2 = H or H (headache

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.2 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods) Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March

More information

Expect Values and Probability Density Functions

Expect Values and Probability Density Functions Intelligent Systems: Reasoning and Recognition James L. Crowley ESIAG / osig Second Semester 00/0 Lesson 5 8 april 0 Expect Values and Probability Density Functions otation... Bayesian Classification (Reminder...3

More information

Classification: The rest of the story

Classification: The rest of the story U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher

More information

12 - Nonparametric Density Estimation

12 - Nonparametric Density Estimation ST 697 Fall 2017 1/49 12 - Nonparametric Density Estimation ST 697 Fall 2017 University of Alabama Density Review ST 697 Fall 2017 2/49 Continuous Random Variables ST 697 Fall 2017 3/49 1.0 0.8 F(x) 0.6

More information

Metric-based classifiers. Nuno Vasconcelos UCSD

Metric-based classifiers. Nuno Vasconcelos UCSD Metric-based classifiers Nuno Vasconcelos UCSD Statistical learning goal: given a function f. y f and a collection of eample data-points, learn what the function f. is. this is called training. two major

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

Artificial Neural Networks (ANN)

Artificial Neural Networks (ANN) Artificial Neural Networks (ANN) Edmondo Trentin April 17, 2013 ANN: Definition The definition of ANN is given in 3.1 points. Indeed, an ANN is a machine that is completely specified once we define its:

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Maximum Likelihood Estimation. only training data is available to design a classifier

Maximum Likelihood Estimation. only training data is available to design a classifier Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional

More information

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1 EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle

More information

STA 414/2104, Spring 2014, Practice Problem Set #1

STA 414/2104, Spring 2014, Practice Problem Set #1 STA 44/4, Spring 4, Practice Problem Set # Note: these problems are not for credit, and not to be handed in Question : Consider a classification problem in which there are two real-valued inputs, and,

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

More on Estimation. Maximum Likelihood Estimation.

More on Estimation. Maximum Likelihood Estimation. More on Estimation. In the previous chapter we looked at the properties of estimators and the criteria we could use to choose between types of estimators. Here we examine more closely some very popular

More information

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors

More information

Pattern Recognition. Parameter Estimation of Probability Density Functions

Pattern Recognition. Parameter Estimation of Probability Density Functions Pattern Recognition Parameter Estimation of Probability Density Functions Classification Problem (Review) The classification problem is to assign an arbitrary feature vector x F to one of c classes. The

More information

Bayesian Decision Theory

Bayesian Decision Theory Bayesian Decision Theory Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Bayesian Decision Theory Bayesian classification for normal distributions Error Probabilities

More information

Nonparametric Methods

Nonparametric Methods Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of Pennsylvania July 28, 2009 Michael R. Roberts Nonparametric Methods 1/42 Overview Great for data analysis

More information

Density Estimation: ML, MAP, Bayesian estimation

Density Estimation: ML, MAP, Bayesian estimation Density Estimation: ML, MAP, Bayesian estimation CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Introduction Maximum-Likelihood Estimation Maximum

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Statistical inference on Lévy processes

Statistical inference on Lévy processes Alberto Coca Cabrero University of Cambridge - CCA Supervisors: Dr. Richard Nickl and Professor L.C.G.Rogers Funded by Fundación Mutua Madrileña and EPSRC MASDOC/CCA student workshop 2013 26th March Outline

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive

More information

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation Lecture 15. Pattern Classification (I): Statistical Formulation Outline Statistical Pattern Recognition Maximum Posterior Probability (MAP) Classifier Maximum Likelihood (ML) Classifier K-Nearest Neighbor

More information

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries Overview Gaussians estimated from training data Guido Sanguinetti Informatics B Learning and Data Lecture 1 9 March 1 Today s lecture Posterior probabilities, decision regions and minimising the probability

More information

Applied inductive learning - Lecture 4

Applied inductive learning - Lecture 4 Applied inductive learning - Lecture 4 Louis Wehenkel Department of Electrical Engineering and Computer Science University of Liège Montefiore - Liège - October 10, 2005 Find slides: http://montefiore.ulg.ac.be/

More information

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Lecture 21 K - Nearest Neighbor V In this lecture we discuss; how do we evaluate the

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Review for Exam 1. Erik G. Learned-Miller Department of Computer Science University of Massachusetts, Amherst Amherst, MA

Review for Exam 1. Erik G. Learned-Miller Department of Computer Science University of Massachusetts, Amherst Amherst, MA Review for Exam Erik G. Learned-Miller Department of Computer Science University of Massachusetts, Amherst Amherst, MA 0003 March 26, 204 Abstract Here are some things you need to know for the in-class

More information

A tailor made nonparametric density estimate

A tailor made nonparametric density estimate A tailor made nonparametric density estimate Daniel Carando 1, Ricardo Fraiman 2 and Pablo Groisman 1 1 Universidad de Buenos Aires 2 Universidad de San Andrés School and Workshop on Probability Theory

More information

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout :. The Multivariate Gaussian & Decision Boundaries..15.1.5 1 8 6 6 8 1 Mark Gales mjfg@eng.cam.ac.uk Lent

More information

Semi-Supervised Learning through Principal Directions Estimation

Semi-Supervised Learning through Principal Directions Estimation Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning 3. Instance Based Learning Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 Outline Parzen Windows Kernels, algorithm Model selection

More information

Nonparametric probability density estimation

Nonparametric probability density estimation A temporary version for the 2018-04-11 lecture. Nonparametric probability density estimation Václav Hlaváč Czech Technical University in Prague Czech Institute of Informatics, Robotics and Cybernetics

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Pattern Recognition 2

Pattern Recognition 2 Pattern Recognition 2 KNN,, Dr. Terence Sim School of Computing National University of Singapore Outline 1 2 3 4 5 Outline 1 2 3 4 5 The Bayes Classifier is theoretically optimum. That is, prob. of error

More information

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang / CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors Furong Huang / furongh@cs.umd.edu What we know so far Decision Trees What is a decision tree, and how to induce it from

More information

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Lecture - 17 K - Nearest Neighbor I Welcome to our discussion on the classification

More information

A PECULIAR COIN-TOSSING MODEL

A PECULIAR COIN-TOSSING MODEL A PECULIAR COIN-TOSSING MODEL EDWARD J. GREEN 1. Coin tossing according to de Finetti A coin is drawn at random from a finite set of coins. Each coin generates an i.i.d. sequence of outcomes (heads or

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

CENG 793. On Machine Learning and Optimization. Sinan Kalkan CENG 793 On Machine Learning and Optimization Sinan Kalkan 2 Now Introduction to ML Problem definition Classes of approaches K-NN Support Vector Machines Softmax classification / logistic regression Parzen

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence Nearest neighbor classification in metric spaces: universal consistency and rates of convergence Sanjoy Dasgupta University of California, San Diego Nearest neighbor The primeval approach to classification.

More information

Optimal global rates of convergence for interpolation problems with random design

Optimal global rates of convergence for interpolation problems with random design Optimal global rates of convergence for interpolation problems with random design Michael Kohler 1 and Adam Krzyżak 2, 1 Fachbereich Mathematik, Technische Universität Darmstadt, Schlossgartenstr. 7, 64289

More information

Lecture 4: September Reminder: convergence of sequences

Lecture 4: September Reminder: convergence of sequences 36-705: Intermediate Statistics Fall 2017 Lecturer: Siva Balakrishnan Lecture 4: September 6 In this lecture we discuss the convergence of random variables. At a high-level, our first few lectures focused

More information

Deep Learning for Computer Vision

Deep Learning for Computer Vision Deep Learning for Computer Vision Lecture 3: Probability, Bayes Theorem, and Bayes Classification Peter Belhumeur Computer Science Columbia University Probability Should you play this game? Game: A fair

More information

A first model of learning

A first model of learning A first model of learning Let s restrict our attention to binary classification our labels belong to (or ) We observe the data where each Suppose we are given an ensemble of possible hypotheses / classifiers

More information

Nearest Neighbor Pattern Classification

Nearest Neighbor Pattern Classification Nearest Neighbor Pattern Classification T. M. Cover and P. E. Hart May 15, 2018 1 The Intro The nearest neighbor algorithm/rule (NN) is the simplest nonparametric decisions procedure, that assigns to unclassified

More information

Pattern recognition. "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher

Pattern recognition. To understand is to perceive patterns Sir Isaiah Berlin, Russian philosopher Pattern recognition "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher The more relevant patterns at your disposal, the better your decisions will be. This is hopeful news to

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,

More information

Bayesian Decision and Bayesian Learning

Bayesian Decision and Bayesian Learning Bayesian Decision and Bayesian Learning Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1 / 30 Bayes Rule p(x ω i

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Tips and Tricks in Real Analysis

Tips and Tricks in Real Analysis Tips and Tricks in Real Analysis Nate Eldredge August 3, 2008 This is a list of tricks and standard approaches that are often helpful when solving qual-type problems in real analysis. Approximate. There

More information

Lecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales.

Lecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales. Lecture 2 1 Martingales We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales. 1.1 Doob s inequality We have the following maximal

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

CSE 546 Final Exam, Autumn 2013

CSE 546 Final Exam, Autumn 2013 CSE 546 Final Exam, Autumn 0. Personal info: Name: Student ID: E-mail address:. There should be 5 numbered pages in this exam (including this cover sheet).. You can use any material you brought: any book,

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010. Material presented Direct Models for Classification SCARF JHU Summer School June 18, 2010 Patrick Nguyen (panguyen@microsoft.com) What is classification? What is a linear classifier? What are Direct Models?

More information

Metric Embedding for Kernel Classification Rules

Metric Embedding for Kernel Classification Rules Metric Embedding for Kernel Classification Rules Bharath K. Sriperumbudur University of California, San Diego (Joint work with Omer Lang & Gert Lanckriet) Bharath K. Sriperumbudur (UCSD) Metric Embedding

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul Generative Learning INFO-4604, Applied Machine Learning University of Colorado Boulder November 29, 2018 Prof. Michael Paul Generative vs Discriminative The classification algorithms we have seen so far

More information

Machine Learning Lecture 3

Machine Learning Lecture 3 Announcements Machine Learning Lecture 3 Eam dates We re in the process of fiing the first eam date Probability Density Estimation II 9.0.207 Eercises The first eercise sheet is available on L2P now First

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L2: Instance Based Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January

More information

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lesson 1 5 October 2016 Learning and Evaluation of Pattern Recognition Processes Outline Notation...2 1. The

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Multi-layer Neural Networks

Multi-layer Neural Networks Multi-layer Neural Networks Steve Renals Informatics 2B Learning and Data Lecture 13 8 March 2011 Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 1 Overview Multi-layer neural

More information

6.1 Moment Generating and Characteristic Functions

6.1 Moment Generating and Characteristic Functions Chapter 6 Limit Theorems The power statistics can mostly be seen when there is a large collection of data points and we are interested in understanding the macro state of the system, e.g., the average,

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Minimum Error-Rate Discriminant

Minimum Error-Rate Discriminant Discriminants Minimum Error-Rate Discriminant In the case of zero-one loss function, the Bayes Discriminant can be further simplified: g i (x) =P (ω i x). (29) J. Corso (SUNY at Buffalo) Bayesian Decision

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Time Series Classification

Time Series Classification Distance Measures Classifiers DTW vs. ED Further Work Questions August 31, 2017 Distance Measures Classifiers DTW vs. ED Further Work Questions Outline 1 2 Distance Measures 3 Classifiers 4 DTW vs. ED

More information