Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3

Similar documents
Pattern Classification, Ch4 (Part 1)

Statistical Pattern Recognition

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Pattern Classification

Pattern Classification

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice

1.3 Convergence Theorems of Fourier Series. k k k k. N N k 1. With this in mind, we state (without proof) the convergence of Fourier series.

10-701/ Machine Learning Mid-term Exam Solution

Stat 421-SP2012 Interval Estimation Section

SDS 321: Introduction to Probability and Statistics

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

6.867 Machine learning, lecture 7 (Jaakkola) 1

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Lecture 2: Monte Carlo Simulation

7.1 Convergence of sequences of random variables

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Sequences and Series of Functions

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Vector Quantization: a Limiting Case of EM

7.1 Convergence of sequences of random variables

6.3 Testing Series With Positive Terms

Lecture Chapter 6: Convergence of Random Sequences

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

1 Review of Probability & Statistics

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Machine Learning Theory (CS 6783)

Statistical inference: example 1. Inferential Statistics

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

CHAPTER 10 INFINITE SEQUENCES AND SERIES

FIR Filters. Lecture #7 Chapter 5. BME 310 Biomedical Computing - J.Schesser

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Section 11.8: Power Series

Expectation-Maximization Algorithm.

Approximations and more PMFs and PDFs

Statistics 511 Additional Materials

Infinite Sequences and Series

Polynomial Functions and Their Graphs

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators

Chapter 6 Principles of Data Reduction

Kernel density estimator

LECTURE 8: ASYMPTOTICS I

Lecture 20: Multivariate convergence and the Central Limit Theorem

Chapter 6 Infinite Series

HOMEWORK #10 SOLUTIONS

Entropy Rates and Asymptotic Equipartition

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Lesson 10: Limits and Continuity

Axioms of Measure Theory

6. Sufficient, Complete, and Ancillary Statistics

SCORE. Exam 2. MA 114 Exam 2 Fall 2017

Sieve Estimators: Consistency and Rates of Convergence

Inferential Statistics. Inference Process. Inferential Statistics and Probability a Holistic Approach. Inference Process.

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 15

Distribution of Random Samples & Limit theorems

Topic 10: Introduction to Estimation

Analytic Continuation

Introduction to Artificial Intelligence CAP 4601 Summer 2013 Midterm Exam

Ma 530 Introduction to Power Series

Math 113 Exam 4 Practice

KLMED8004 Medical statistics. Part I, autumn Estimation. We have previously learned: Population and sample. New questions

Lecture 10 October Minimaxity and least favorable prior sequences

10.6 ALTERNATING SERIES

c. Explain the basic Newsvendor model. Why is it useful for SC models? e. What additional research do you believe will be helpful in this area?

Discrete probability distributions

Chapter 10: Power Series

Math 113 Exam 3 Practice

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

INFINITE SEQUENCES AND SERIES

Convergence: nth-term Test, Comparing Non-negative Series, Ratio Test

Topic 9: Sampling Distributions of Estimators

Math 116 Second Midterm November 13, 2017

n=1 a n is the sequence (s n ) n 1 n=1 a n converges to s. We write a n = s, n=1 n=1 a n

( ) = p and P( i = b) = q.

Algorithms for Clustering

We are mainly going to be concerned with power series in x, such as. (x)} converges - that is, lims N n

Math 106 Fall 2014 Exam 3.2 December 10, 2014

Large Sample Theory. Convergence. Central Limit Theorems Asymptotic Distribution Delta Method. Convergence in Probability Convergence in Distribution

Roberto s Notes on Series Chapter 2: Convergence tests Section 7. Alternating series

Parameter, Statistic and Random Samples

Math 113, Calculus II Winter 2007 Final Exam Solutions

The z-transform. 7.1 Introduction. 7.2 The z-transform Derivation of the z-transform: x[n] = z n LTI system, h[n] z = re j

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Metric Space Properties

Limit Theorems. Convergence in Probability. Let X be the number of heads observed in n tosses. Then, E[X] = np and Var[X] = np(1-p).

Math 116 Final Exam December 12, 2014

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

Chapter 6: Numerical Series

ECE 330:541, Stochastic Signals and Systems Lecture Notes on Limit Theorems from Probability Fall 2002

MATH 1080: Calculus of One Variable II Fall 2017 Textbook: Single Variable Calculus: Early Transcendentals, 7e, by James Stewart.

11 Hidden Markov Models

Pixel Recurrent Neural Networks

Unbiased Estimation. February 7-12, 2008

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Ma 530 Infinite Series I

MAT1026 Calculus II Basic Convergence Tests for Series

Transcription:

No-Parametric Techiques Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3

Parametric vs. No-Parametric Parametric Based o Fuctios (e.g Normal Distributio) Uimodal Oly oe peak Ulikely real data cofies to fuctio No-Parametric Based o Data As may peaks as Data has Methods for both p(w j x) ad P(w j x)

Desity Estimatio Probability a vector x will fall i regio R. P = p( x') dx' (1) R Assume samples, idetically distributed. By a Biomial equatio, Probability that k samples are i Regio R is k k Pk = P (1 P) (2) k Expected value for k = P, so P k/

Desity Estimatio For large, k/ is a good estimate for P If p(x) is cotiuous, ad p does ot vary i R P = p( x') dx' p( x) V R (4) Where V is volume of R Combie with P = k/ p( x) k / V

If volume V is fixed, ad is icreased towards, P(x) coverges to the average p of that volume. It peaks at the true probability, which is 0.7, ad with ifiite, will coverge to 0.7.

Desity Estimatio If is fixed, ad V approaches zero, V will become so small it has zero samples, or reside directly o a poit, makig p(x) 0 or I Practice, ca ot allow volume to become too small, sice data is limited. If you use a o-zero V, estimatio will have some variace i k/ from actual. I theory, with ulimited data, ca get aroud limitatios

Desity Est. with Ifiite data To get the desity at x. Assume a sequece of regios (R 1, R 2, R ) that all cotai x. I R i the estimate uses i samples V is volume of R, k is the umber of samples i R. p (x) is the th estimate for. p (x) = k / / V Goal is to get p (x) to coverge to p(x)

Covergece of p (x) to p(x) p (x) coverges to p(x) if the followig is true lim V = 0 lim k k = lim = 0 Regio R covers egligible space p(x) is average of ifiite samples (uless p(x) = 0) The samples of k, are a egligible amout of the whole set. gets bigger faster the k does.

Satisfyig coditios Two commo methods to satisfy coditios that both coverge Volume of Regio R based o Parze Widows V = 1 / Number of poits i regio (k) based o k earest eighbors k =

Example Volume based o Volume based o k

Parze Widows Assume R is d-dimesioal hypercube h legth of a edge of that cube Volume of cube is V = h d Need to determie k (umber of samples that fall withi R )

Parze Widows Defie a widow fuctio that tells us if a sample is i R : d V = h (h :legth of the edge of R ) Let ϕ(u) be the followig widow fuctio : ϕ(u) = 1 0 1 u j j 2 otherwise = 1,..., d

Example Assume d = 2, h = 1 ϕ((x-x i )/h ) = 1 if x i falls withi R x2 1/2 ϕ(u) = 1 0 1 u j j = 1,..., d 2 otherwise -1/2-1/2 1/2 x1

Parze Widows 1 i 1 u j = 1,..., d = j ϕ(u) = 2 k = 0 otherwise i= 1 x ϕ h x i Number of samples i R computed as k Derive ew p (x) Earlier, p (x) = (k /)/V, ow redefied as p (x) i 1 = = 1 V x ϕ h i= 1 x i

Geeralize ϕ(x) p (x) is average of fuctios of x ad samples x i Widow fuctio ϕ(x) is beig used for iterpolatio Each x i cotributes to p (x) accordig to its distace from x We d like ϕ(x) to be a legitimate desity fuctio ϕ( v) 0 ϕ ( v) dv =1

Widow Width Remember that: New defiitio: ) the edge of :legth of (h R = d V h = = = i i i h x x V x p ϕ 1 1 1 ) ( h clearly affects the amplitude ad width of delta fuctio = h x V x ϕ δ 1 ) ( = = = i i i x x x p 1 ) ( 1 ) ( δ

Widow Width Very large h Small amplitude of delta fuctio x i must be far from x before δ (x-x i ) chages from δ (0) p (x) is superpositio of a broad, slowly chagig fuctio (out of focus) Too little resolutio Very small h Large amplitude of delta fuctio Peak of δ (x-x i ) is large, occurs ear x=x i p (x) is superpositio of sharp pulses (erratic, oisy) Too much statistical istability

Widow Width For ay h distributio is ormalized 1 x x 1 V h i ( x x ) dx = ϕ dx = ϕ( u) δ i du =

Covergece With limited samples, best we ca do for h is compromise With ulimited samples, we ca let V slowly approach zero as icreases, ad p (x) p(x) For ay fixed x, p (x) depeds o r.v. samples (x 1, x 1,.. x ) p 2 ( x) has some mea p ( x) ad variaceσ ( x)

Covergece of mea ( x) δ ( x v) p( v) = dv Expected value of estimate is averaged value of ukow, true desity p(x) blurred or smoothed versio of p(x) as see through the averagig widow Limits, as V 0 V δ (x-v) delta fuctio cetered at x expected value of estimate true desity p

Covergece of variace For ay expected value of estimate true desity if we let V 0 for some set of samples estimate is useless ( spiky ) eed to cosider variace Should let V 0 slower tha σ 2 ( x) sup ( ϕ( ) ) p ( ) x V

Example

Illustratio The behavior of the Parze-widow method Case where p(x) N(0,1) Let ϕ(u) = (1/ (2π) exp(-u 2 /2) ad h = h 1 / (>1) Thus: p ( i 1 = x ) = (h 1 : kow parameter) x ϕ h i= 1 is a average of ormal desities cetered at the samples x i. 1 h x i

Probabilistic Neural Network The PNN for this case has 1. d iput uits comprisig the iput layer, 2. patter uits comprisig of the patter layer, 3. c category uits Each iput uit is coected to each patter uit. Each patter uit is coected to oe ad oly oe category uit correspodig the category of the traiig sample. The coectios from the iput to patter uits have modifiable weights w which will be leared durig the traiig.

Iput layer x 1 W 11 p1 x 2 p 2.... x d.... W d W d2 p Patters layer

Patters layer. p 1 p 2... ω 1. p k.......ω 2... Category uits.ω c p Activatios (Emissio of oliear fuctios)

PNN traiig The traiig procedure is simple cosistig of three simple steps. 1) Normalize the traiig feature vectors so that x i = 1 for all i = 1. 2) Set the weight vector w i = x i for all i = 1. w i cosists of weights coectig the iput uits to the ith patter uit. 3) Coect the patter uit i to the category uit correspodig to the category of x i for all i = 1.

PNN Classificatio 1. Normalize the test patter x ad place it at the iput uits 2. Each patter uit computes the ier product i order to yield the et activatio et k = w t k.x ad emit a oliear fuctio f ( et k ) etk 1 exp σ = 2 3. Each output uit sums the cotributios from all patter uits coected to it P ( x ω ) = i= 1 P( ω 4. Classify by selectig the maximum value of P (x ω j ) (j = 1,, c) j ϕ i j x )

Advatages of PNN Speed of learig Sice W k =X k, it requires a sigle pass thru traiig Time complexity For parallel implemetatio its O(1) as ier product ca be doe i parallel New traiig patters ca be icorporated quite easily

Summary No parametric estimatio ca be applied to ay radom distributio of data Parze widow method provide a better estimatio of pdf Estimatio depeds upo o. of samples ad Parze widow size PPN gives a efficiet Parze widow method implemetatio

Refereces R.J. Schalkoff. (1992) Patter Recogitio: Statistical, Structural, ad Neural Approaches, Wiley.* Patter Classificatio (2d Editio) by R.O. Duda, P. E. Hart ad D. Stork, Wiley 2001