Nonparametric Density Estimation Intro

Similar documents
Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

Introduction to local (nonparametric) density estimation. methods

MIMA Group. Chapter 4 Non-Parameter Estimation. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

6. Nonparametric techniques

Bayes (Naïve or not) Classifiers: Generative Approach

Nonparametric Techniques

Parameter Estimation

CS 2750 Machine Learning Lecture 5. Density estimation. Density estimation

Econometric Methods. Review of Estimation

Unsupervised Learning and Other Neural Networks

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

å 1 13 Practice Final Examination Solutions - = CS109 Dec 5, 2018

Summary of the lecture in Biostatistics

Point Estimation: definition of estimators

An Introduction to. Support Vector Machine

2. Independence and Bernoulli Trials

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

CHAPTER VI Statistical Analysis of Experimental Data

2SLS Estimates ECON In this case, begin with the assumption that E[ i

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Parametric Density Estimation: Bayesian Estimation. Naïve Bayes Classifier

Dimensionality Reduction and Learning

Generative classification models

Supervised learning: Linear regression Logistic regression

Chapter 5 Properties of a Random Sample

BASIC PRINCIPLES OF STATISTICS

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

Lecture 3. Sampling, sampling distributions, and parameter estimation

Multivariate Transformation of Variables and Maximum Likelihood Estimation

TESTS BASED ON MAXIMUM LIKELIHOOD

Chapter 14 Logistic Regression Models

Random Variables. ECE 313 Probability with Engineering Applications Lecture 8 Professor Ravi K. Iyer University of Illinois

Functions of Random Variables

X ε ) = 0, or equivalently, lim

Simulation Output Analysis

CHAPTER 6. d. With success = observation greater than 10, x = # of successes = 4, and

( ) 2 2. Multi-Layer Refraction Problem Rafael Espericueta, Bakersfield College, November, 2006

Module 7. Lecture 7: Statistical parameter estimation

D KL (P Q) := p i ln p i q i

Special Instructions / Useful Data

Estimation of Stress- Strength Reliability model using finite mixture of exponential distributions

Lecture 3 Probability review (cont d)

Lecture 02: Bounding tail distributions of a random variable

KLT Tracker. Alignment. 1. Detect Harris corners in the first frame. 2. For each Harris corner compute motion between consecutive frames

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

1 Onto functions and bijections Applications to Counting

Idea is to sample from a different distribution that picks points in important regions of the sample space. Want ( ) ( ) ( ) E f X = f x g x dx

IS 709/809: Computational Methods in IS Research. Simple Markovian Queueing Model

Binary classification: Support Vector Machines

Section l h l Stem=Tens. 8l Leaf=Ones. 8h l 03. 9h 58

2.28 The Wall Street Journal is probably referring to the average number of cubes used per glass measured for some population that they have chosen.

22 Nonparametric Methods.

CS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

Chain Rules for Entropy

Machine Learning. Introduction to Regression. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

D. VQ WITH 1ST-ORDER LOSSLESS CODING

Lecture 9. Some Useful Discrete Distributions. Some Useful Discrete Distributions. The observations generated by different experiments have

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

Lecture 9: Tolerant Testing

Kernel-based Methods and Support Vector Machines

STK3100 and STK4100 Autumn 2017

5 Short Proofs of Simplified Stirling s Approximation

Maximum Likelihood Estimation

LINEAR REGRESSION ANALYSIS

Multiple Regression. More than 2 variables! Grade on Final. Multiple Regression 11/21/2012. Exam 2 Grades. Exam 2 Re-grades

Probability and Statistics. What is probability? What is statistics?

L5 Polynomial / Spline Curves

STK3100 and STK4100 Autumn 2018

Centroids & Moments of Inertia of Beam Sections

Random Variables and Probability Distributions

Ideal multigrades with trigonometric coefficients

Continuous Distributions

Dimensionality reduction Feature selection

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

Radial Basis Function Networks

STRONG CONSISTENCY FOR SIMPLE LINEAR EV MODEL WITH v/ -MIXING

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

means the first term, a2 means the term, etc. Infinite Sequences: follow the same pattern forever.

Channel Models with Memory. Channel Models with Memory. Channel Models with Memory. Channel Models with Memory

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #1

CHAPTER 3 POSTERIOR DISTRIBUTIONS

Investigating Cellular Automata

Linear Regression Linear Regression with Shrinkage. Some slides are due to Tommi Jaakkola, MIT AI Lab

Recall MLR 5 Homskedasticity error u has the same variance given any values of the explanatory variables Var(u x1,...,xk) = 2 or E(UU ) = 2 I

ECONOMETRIC THEORY. MODULE VIII Lecture - 26 Heteroskedasticity

Unit 9. The Tangent Bundle

Bayesian Classification. CS690L Data Mining: Classification(2) Bayesian Theorem: Basics. Bayesian Theorem. Training dataset. Naïve Bayes Classifier

Chapter 3 Sampling For Proportions and Percentages

Mu Sequences/Series Solutions National Convention 2014

The expected value of a sum of random variables,, is the sum of the expected values:

VARIABLE-RATE VQ (AKA VQ WITH ENTROPY CODING)

Derivation of 3-Point Block Method Formula for Solving First Order Stiff Ordinary Differential Equations

Lecture 7: Linear and quadratic classifiers

Mean is only appropriate for interval or ratio scales, not ordinal or nominal.

STK4011 and STK9011 Autumn 2016

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Model Fitting, RANSAC. Jana Kosecka

Part 4b Asymptotic Results for MRR2 using PRESS. Recall that the PRESS statistic is a special type of cross validation procedure (see Allen (1971))

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

Transcription:

Noarametrc Desty Estmato Itro Parze Wdows

No-Parametrc Methods Nether robablty dstrbuto or dscrmat fucto s kow Haes qute ofte All we have s labeled data a lot s kow easer salmo bass salmo salmo Estmate the robablty dstrbuto from the labeled data lttle s kow harder

NoParametrc Techques: Itroducto I revous lectures we assumed that ether. someoe gves us the desty ( c j ) I atter recogto alcatos ths ever haes 2. someoe gves us ( θ cj ) Does hae sometmes, but we are lkely to susect whether the gve ( θ) models the data well Most arametrc destes are umodal (have a sgle local mamum), whereas may ractcal roblems volve mult-modal destes

NoParametrc Techques: Itroducto Noarametrc rocedures ca be used wth arbtrary dstrbutos ad wthout ay assumto about the forms of the uderlyg destes There are two tyes of oarametrc methods: Parze wdows Estmate lkelhood ( c j ) Nearest Neghbors Byass lkelhood ad go drectly to osteror estmato P(c j )

NoParametrc Techques: Itroducto Noarametrc techques attemt to estmate the uderlyg desty fuctos from the trag data Idea: the more data a rego, the larger s the desty fucto () Pr [ X R] f( ) R d salmo legth

NoParametrc Techques: Itroducto [ ] How ca we aromate Pr X R ad Pr X R2? 6 6 Pr[ X R] ad Pr[ X R2] 20 Pr [ X R] f( ) Should the desty curves above R ad R 2 be equally hgh? No, sce R s smaller tha R 2 Pr X R f d () R 20 d [ ] ( ) f( ) d Pr[ R ] X 2 To get desty, ormalze by rego sze R R [ ] 2 R R2 salmo legth

NoParametrc Techques: Itroducto Assumg f() s bascally flat sde R, # of samles R total # of samles Pr [ X R] f( y) R dy f ( ) Volume( R) Thus, desty at a ot sde R, ca be aromated f ( ) # of samles R total # of samles Volume ( R) Now let s derve ths formula more formally

Bomal Radom Varable Let us fl a co tmes (each oe s called tral ) Probablty of head ρ, robablty of tal s -ρ Bomal radom varable K couts the umber of heads trals ( ) k( ) k P K k k ρ ρ Mea s where ( K) ρ E Varace s ( ) ( ) var K k ρ ρ! k!( k)!

Desty Estmato: Basc Issues From the defto of a desty fucto, robablty ρ that a vector wll fall rego R s: [ R] ρ Pr ( ' )d' Suose we have samles, 2,, draw from the dstrbuto (). The robablty that k ots fall R s the gve by bomal dstrbuto: Pr [ K k] R k k ( ) ρ ρ Suose that k ots fall R, we ca use MLE to estmate the value of ρ. The lkelhood fucto s ( ρ),..., k k ρ ( ρ) k k

Desty Estmato: Basc Issues ( ρ) k,..., k ρ ( ρ) k Ths lkelhood fucto s mamzed at ρ Thus the MLE s ˆ ρ k Assume that () s cotuous ad that the rego R s so small that () s aromately costat R R ( ') d' ( ) V s R ad V s the volume of R Recall from the revous slde: Thus () ca be aromated: R k ρ ( ') d' ( ) k / V () R

Desty Estmato: Basc Issues Ths s eactly what we had before: ( ) R k / V s sde some rego R k umber of samles sde R total umber of samles V volume of R Our estmate wll always be the average of true desty over R ( ) k / V ρˆ V R ( ') d' Ideally, () should be costat sde R V

Desty Estmato: Hstogram 0 90 6 90 (l) ( ) k / V 90 0 0 20 30 40 50 R R 3 R 2 6/9 ( ) 3/9 ( ) ( ) 0 30 0/9 0 If regos R s do ot overla, we have a hstogram

Desty Estmato: Accuracy How accurate s desty aromato ( )? V We have made two aromatos k. ˆ ρ as creases, ths estmate becomes more accurate 2. R ( ') d' ( ) V as R grows smaller, the estmate becomes more accurate As we shrk R we have to make sure t cotas samles, otherwse our estmated () 0 for all R Thus theory, f we have a ulmted umber of samles, we get covergece as we smultaeously crease the umber of samles, ad shrk rego R, but ot too much so that R stll cotas a lot of samles k /

Desty Estmato: Accuracy k / ( ) V I ractce, the umber of samles s always fed Thus the oly avalable oto to crease the accuracy s by decreasg the sze of R (V gets smaller) If V s too small, ()0 for most, because most regos wll have o samles Thus have to fd a comromse for V ot too small so that t has eough samles but also ot too large so that () s aromately costat sde V

. Desty Estmato: Two Aroaches ( ) 2. k-nearest Neghbors. Parze Wdows: k / V Choose a fed value for volume V ad determe the corresodg k from the data Choose a fed value for k ad determe the corresodg volume V from the data Uder arorate codtos ad as umber of samles goes to fty, both methods ca be show to coverge to the true ()

Parze Wdows ( ) k / V s sde some rego R k umber of samles sde R total umber of samles V volume of R To estmate the desty at ot, smly ceter the rego R at, cout the umber of samles R, ad substtute everythg our formula R ( ) 3 / 6 0

Parze Wdows I Parze-wdow aroach to estmate destes we f the sze ad shae of rego R Let us assume that the rego R s a d-dmesoal hyercube wth sde legth h thus t s volume s h d R R R h h h dmeso 2 dmesos 3 dmesos

Parze Wdows Let u[u, u 2,, u d ] ad defe a wdow fucto u j j,..., d (u) 2 0 otherwse dmeso (u) /2 u 2 s sde u u s 0 outsde /2 2 dmesos

Parze Wdows Recall we have d-dmesoal samles, 2,,. Let j be the jth coordate of samle.the - h j j j ) 2 0 otherwse u u j 2 - ( h R h,..., d - ( h ) f s sde the hyercube wth wdth h ad cetered at 0 otherwse

Parze Wdows How do we cout the total umber of samle ots, 2,, whch are sde the hyercube wth sde h ad cetered at? k k / V Recall ( ), Vh d h Thus we get the desred aalytcal eresso for the estmate of desty () ( ) h d h / h d h

Parze Wdows ( ) h d h Let s make sure () s fact a desty ( ) 0 volume of hyercube ( ) d h d h d h d h d d h h d

Parze Wdows: Eamle D ( ) h d h Suose we have 7 samles D{2,3,4,8,0,,2} 2 () Let wdow wdth h3, estmate desty at () 7 3 3 7 2 2 3 3 + 3 4 + 3 2 +... + 3 () 7 3 7 3 2 / 2 2 3 > / 2 > / 2 3 [ + 0+ 0+... + 0] 2 > 3 / 2

Parze Wdows: Sum of Fuctos Now let s look at our desty estmate () aga: ( ) d h h h d h sde square cetered at 0 otherwse Thus () s just a sum of bo lke fuctos each of heght d h

Parze Wdows: Eamle D Let s come back to our eamle 7 samles D{2,3,4,8,0,,2}, h3 () 2 To see what the fucto looks lke, we eed to geerate 7 boes ad add them u The wdth s h3 ad the heght s d h 2

Parze Wdows: Iterolato I essece, wdow fucto s used for terolato: each samle cotrbutes to the resultg desty at f s close eough to () 2

Parze Wdows: Drawbacks of Hyercube As log as samle ot ad are the same hyercube, the cotrbuto of to the desty at s costat, regardless of how close s to 2 h h 2 The resultg desty () s ot smooth, t has dscotutes ()

Parze Wdows: geeral ( ) h d h We ca use a geeral wdow as log as the resultg () s a legtmate desty,.e.. (u ) 0 satsfed f ( u ) 0 2. ( ) d satsfed f ( u) du (u) 2 (u) d h ( u) du ( )d d d d h h h chage coordate s to u, thus du h d h u

Parze Wdows: geeral ( ) h d h Notce that wth the geeral wdow we are o loger coutg the umber of samles sde R. We are coutg the weghted average of otetally every sgle samle ot (although oly those wth dstace h have ay sgfcat weght) Wth fte umber of samles, ad arorate codtos, t ca stll be show that ( ) ( )

Parze Wdows: Gaussa ( ) h d h A oular choce for s N(0,) desty u ( ) 2 / 2 u e 2ππ (u) u Solves both drawbacks of the bo wdow Pots whch are close to the samle ot receve hgher weght Resultg desty () s smooth

Parze Wdows: Eamle wth Geeral Let s come back to our eamle 7 samles D{2,3,4,8,0,,2}, h ( ) 7 7 ( ) () s the sum of of 7 Gaussas, each cetered at oe of the samle ots, ad each scaled by /7

Parze Wdows: Dd We Solve the Problem? Let s test f we solved the roblem. Draw samles from a kow dstrbuto 2. Use our desty aromato method ad comare wth the true desty We wll vary the umber of samles ad the wdow sze h We wll lay wth 2 dstrbutos N(0,) tragle ad uform mture

Parze Wdows: True Desty N(0,) h h 0.5 h 0. 0

Parze Wdows: True Desty N(0,) 00 h h 0.5 h 0. h h /

Parze Wdows: True desty s Mture of Uform ad Tragle h h 0.5 h 0.2 6

Parze Wdows: True desty s Mture of Uform ad Tragle h h 0.5 h 0.2 256 h h /

Parze Wdows: Effect of Wdow Wdth h By choosg h we are guessg the rego where desty s aromately costat Wthout kowg aythg about the dstrbuto, t s really hard to guess were the desty s aromately costat () h h

Parze Wdows: Effect of Wdow Wdth h If h s small, we suermose shar ulses cetered at the data Each samle ot flueces too small rage of Smoothed too lttle: the result wll look osy ad ot smooth eough If h s large, we suermose broad slowly chagg fuctos, Each samle ot flueces too large rage of Smoothed too much: the result looks oversmoothed or outof-focus Fdg the best h s challegg, ad deed o sgle h may work well May eed to adat h for dfferet samle ots However we ca try to lear the best h to use from our labeled data

Learg wdow wdth h From Labeled Data Dvde labeled data to trag set, valdato set, test set For a rage of dfferet values of h (ossbly usg bary search), costruct desty estmate () usg Parze wdows Test the classfcato erformace o the valdato set for each value of h you tred For the fal desty estmate, choose h gvg the smallest error o the valdato set Now you ca test the erformace of the classfer o the test set Notce we eed valdato set to fd best arameter h, we ca t use test set for ths because test set caot be used for trag I geeral, eed valdato set f our classfer has some tuable arameters

Parze Wdows: Classfcato Eamle I classfers based o Parze-wdow estmato: We estmate the destes for each category ad classfy a test ot by the label corresodg to the mamum osteror The decso rego for a Parze-wdow classfer deeds uo the choce of wdow fucto as llustrated the followg fgure

Parze Wdows: Classfcato Eamle For small eough wdow sze h the classfcato o trag data s erfect However decso boudares are comle ad ths soluto s ot lkely to geeralze well to ovel data For larger wdow sze h, classfcato o trag data s ot erfect However decso boudares are smler ad ths soluto s more lkely to geeralze well to ovel data

Parze Wdows: Summary Advatages Ca be aled to the data from ay dstrbuto I theory ca be show to coverge as the umber of samles goes to fty Dsadvatages Number of trag data s lmted ractce, ad so choosg the arorate wdow sze h s dffcult May eed large umber of samles for accurate estmates Comutatoally heavy, to classfy oe ot we have to comute a fucto whch otetally deeds o all samles ( ) h d h But we eed a lot of samles for accurate desty estmato!