Kernel-based density. Nuno Vasconcelos ECE Department, UCSD

Similar documents
Supervised Learning: Non-parametric Estimation

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

PAC-learning, VC Dimension and Margin-based Bounds

EM Algorithm & High Dimensional Data

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

PAC-learning, VC Dimension and Margin-based Bounds

More on Estimation. Maximum Likelihood Estimation.

Dimensionality Reduction and Principle Components

Boundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Dimensionality Reduction and Principal Components

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Nonparametric Methods Lecture 5

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Metric-based classifiers. Nuno Vasconcelos UCSD

Bias-Variance Tradeoff

Density Estimation: ML, MAP, Bayesian estimation

STA414/2104 Statistical Methods for Machine Learning II

GAUSSIAN PROCESS REGRESSION

Learning from Data: Regression

Hypothesis Testing with the Bootstrap. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

12 - Nonparametric Density Estimation

Probability Models for Bayesian Recognition

Brief Review of Probability

Pattern Recognition. Parameter Estimation of Probability Density Functions

Expect Values and Probability Density Functions

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Nonparametric Methods

Machine Learning

CS 188: Artificial Intelligence Spring Announcements

The Gaussian distribution

Stat Lecture 20. Last class we introduced the covariance and correlation between two jointly distributed random variables.

Classification: The rest of the story

Review of probability. Nuno Vasconcelos UCSD

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

ECO Class 6 Nonparametric Econometrics

Overfitting, Bias / Variance Analysis

Announcements. Proposals graded

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

BAYESIAN DECISION THEORY

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

ECE521 week 3: 23/26 January 2017

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

probability of k samples out of J fall in R.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Econometrics I. Lecture 10: Nonparametric Estimation with Kernels. Paul T. Scott NYU Stern. Fall 2018

6.1 Moment Generating and Characteristic Functions

Bayes Decision Theory - I

STAT/SOC/CSSS 221 Statistical Concepts and Methods for the Social Sciences. Random Variables

CPSC 340: Machine Learning and Data Mining

Chapter 9. Non-Parametric Density Function Estimation

Machine Learning Gaussian Naïve Bayes Big Picture

Maximum Likelihood Estimation. only training data is available to design a classifier

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

Density Estimation. Seungjin Choi

Summary of free theory: one particle state: vacuum state is annihilated by all a s: then, one particle state has normalization:

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Confidence intervals for kernel density estimation

MATH 10 INTRODUCTORY STATISTICS

Chapter 9. Non-Parametric Density Function Estimation

Approximate inference in Energy-Based Models

Lecture 2: Linear regression

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

The Bayes classifier

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416)

A Primer on Statistical Inference using Maximum Likelihood

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Statistics: Learning models from data

Instructor (Brad Osgood)

εx 2 + x 1 = 0. (2) Suppose we try a regular perturbation expansion on it. Setting ε = 0 gives x 1 = 0,

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Nonparametric probability density estimation

STA 4273H: Statistical Machine Learning

MATH 10 INTRODUCTORY STATISTICS

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Nonparametric Bayesian Methods (Gaussian Processes)

Machine Learning, Fall 2011: Homework 5

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Introduction to Logistic Regression and Support Vector Machine

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

But is this loud enough? Can people in the back hear me? Is this okay?

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Time Series and Forecasting Lecture 4 NonLinear Time Series

Bayesian Analysis for Natural Language Processing Lecture 2

Parameter estimation Conditional risk

MITOCW watch?v=hdyabia-dny

1 What does the random effect η mean?

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

MITOCW watch?v=pqkyqu11eta

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Probability and Information Theory. Sargur N. Srihari

Lecture 3: Introduction to Complexity Regularization

Unit 4 Probability. Dr Mahmoud Alhussami

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Transcription:

Kernel-based density estimation Nuno Vasconcelos ECE Department, UCSD

Announcement last week of classes we will have Cheetah Day (exact day TBA) what: 4 teams of 6 people each team will write a report on the 4 cheetah problems each team will give a presentation on one of the problems why: to make sure that we get the big picture out of all this work presenting is always good practice 2

Announcement how much: 10% of the final grade (5% report, 5% presentation) what to talk about: report: comparative analysis of all solutions of the problem (8 page) as if you were writing a conference paperp presentation: will be on one single problem review what solution was what did this problem taught us about learning? what tricks did we learn solving it? how well did this solution do compared to others? 3

Announcement details: get together and form groups let me know what they are by Wednesday (November 19) (email is fine) I will randomly assign the problem on which each group has to be expert prepare a talk for 20min (max 10 slides) feel free to use my solutions, your results feel free to go beyond what we have done (e.g. search over features, whatever ) 4

Plan for today we have talked a lot about the BDR and methods based on density estimation practical densities are not well approximated by simple probability models today: what can we do if have complicated densities? use better probability density models! 5

Non-parametric density estimates 6

Binomial random variable N 10 100 1,000 Var[P] < 0.025025 0.00250025 0.0002500025 7

Histogram this means that k/n is a very good estimate of P on the other hand, from the mean value theorem, if P X X( (x) is continuous this is easiest to see in 1D can always find a box such that the integral of the function is equal to that of the box P X (x) P X (ε) since P X (x) is continuous there must be a ε such that P X (ε) is the box height ε R x 8

Histogram hence using continuity it of P X () (x) again and assuming R is small this is the histogram it is the simplest possible non-parametric estimator can be generalized into kernel-based density estimator 9

Kernel density estimates 10

Kernel density estimates this means that the histogram can be written as which h is equivalent to: put a box around X for each X i that lands on the hypercube can be seen as a very crude form of interpolation better interpolation if contribution of X i decreases with distance to X consider other windows φ(x) x 3 x x 1 x 2 11

Windows what sort of functions are valid windows? note that P X (x) is a pdf if and only if since these conditions hold if φ(x) is itself a pdf 12

Gaussian kernel probably the most popular in practice note that t P X () (x) can also be seen as a sum of pdfs centered on the X i when φ(x) is symmetric in X and X i 13

Gaussian kernel Gaussian case can be interpreted as sum of n Gaussians centered at the X i with covariance hi more generally, we can have a full covariance sum of n Gaussians centered at the X i with covariance Σ Gaussian kernel density estimate: approximate the pdf of X with a sum of Gaussian bumps 14

Kernel bandwidth back to the generic model what is the role of h (bandwidth parameter)? defining we can write i.e. a sum of translated replicas of δ(x) 15

Kernel bandwidth h has two roles: 1. rescale the x-axis 2. rescale the amplitude of δ(x) this implies that for large h: 1. δ(x) has low amplitude 2. iso-contours of h are quite distant from zero (x large before φ(x/h) changes significantly from φ(0)) 16

Kernel bandwidth for small h: 1. δ(x) has large amplitude 2. iso-contours of h are quite close to zero (x small before φ(x/h) changes significantly from φ(0)) what is the impact of this on the quality of the density estimates? 17

Kernel bandwidth it controls the smoothness of the estimate as h goes to zero we have a sum of delta functions (very spiky approximation) as h goes to infinity we have a sum of constant functions (approximation by a constant) in between we get approximations that are gradually more smooth 18

Kernel bandwidth why does this matter? when the density estimates are plugged into the BDR smoothness of estimates determines the smoothness of the boundaries less smooth more smooth this affects the probability of error! 19

Convergence since P x (x) depends on the sample points X i, it is a random variable as we add more points, the estimate should get better the question is then whether the estimate ever converges this is no different than parameter estimation as before, we talk about convergence in probability 20

Convergence of the mean from the linearity of P X (x) on the kernels 21

Convergence of the mean hence this is the convolution of P X () (x) with δ(x) ) it is a blurred version ( low-pass filtered ) unless h = 0 in this case δ(x-v) ) converges to the Dirac delta and so 22

Convergence of the variance since the X i are iid 23

Convergence in summary this means that: to obtain small bias we need h ~ 0 to obtain small variance we need h infinite 24

Convergence intuitively makes sense h ~ 0 means a Dirac around each point can approximate any function arbitrarily well there is no bias but if we get a different sample, the estimate is likely to be very different there is large variance as before, variance can be decreased by getting a larger sample but, for fixed n, smaller h always means greater variability example: fit to N(0,I) using h = h 1 /n 1/2 25

Example small h: spiky need a lot of points to converge (variance) large h: approximate N(0,I) with a sum of Gaussians of larger covariance will never have zero error (bias) 26

Optimal bandwidth we would like h ~ 0 to guarantee zero bias zero variance as n goes to infinity solution: make h a function of n that goes to zero since variance is O(1/nh d ) this is fine if nh d goes to infinity hence, we need optimal sequences exist, e.g. 27

Optimal bandwidth in practice this has limitations does not say anything about the finite data case (the one we care about) still have to find the best k usually we end up using trial and error or techniques like cross-validation 28

Cross-validation basic idea: leave some data out of your training set (cross validation set) train with different parameters evaluate performance on cross validation set pick best parameter configuration training set xval set training testing test set training set 29

Leave-one-out cross-validation many variations leave-one-out out CV: compute n estimators of P X (x) by leaving one X i out at a time for each P X X( (x) evaluate P X X( (X i i) on the point that was left out pick P X (x) that maximizes this likelihood... testing test set 30

31