No-Parametric Techiques Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3
Parametric vs. No-Parametric Parametric Based o Fuctios (e.g Normal Distributio) Uimodal Oly oe peak Ulikely real data cofies to fuctio No-Parametric Based o Data As may peaks as Data has Methods for both p(w j x) ad P(w j x)
Desity Estimatio Probability a vector x will fall i regio R. P = p( x') dx' (1) R Assume samples, idetically distributed. By a Biomial equatio, Probability that k samples are i Regio R is k k Pk = P (1 P) (2) k Expected value for k = P, so P k/
Desity Estimatio For large, k/ is a good estimate for P If p(x) is cotiuous, ad p does ot vary i R P = p( x') dx' p( x) V R (4) Where V is volume of R Combie with P = k/ p( x) k / V
If volume V is fixed, ad is icreased towards, P(x) coverges to the average p of that volume. It peaks at the true probability, which is 0.7, ad with ifiite, will coverge to 0.7.
Desity Estimatio If is fixed, ad V approaches zero, V will become so small it has zero samples, or reside directly o a poit, makig p(x) 0 or I Practice, ca ot allow volume to become too small, sice data is limited. If you use a o-zero V, estimatio will have some variace i k/ from actual. I theory, with ulimited data, ca get aroud limitatios
Desity Est. with Ifiite data To get the desity at x. Assume a sequece of regios (R 1, R 2, R ) that all cotai x. I R i the estimate uses i samples V is volume of R, k is the umber of samples i R. p (x) is the th estimate for. p (x) = k / / V Goal is to get p (x) to coverge to p(x)
Covergece of p (x) to p(x) p (x) coverges to p(x) if the followig is true lim V = 0 lim k k = lim = 0 Regio R covers egligible space p(x) is average of ifiite samples (uless p(x) = 0) The samples of k, are a egligible amout of the whole set. gets bigger faster the k does.
Satisfyig coditios Two commo methods to satisfy coditios that both coverge Volume of Regio R based o Parze Widows V = 1 / Number of poits i regio (k) based o k earest eighbors k =
Example Volume based o Volume based o k
Parze Widows Assume R is d-dimesioal hypercube h legth of a edge of that cube Volume of cube is V = h d Need to determie k (umber of samples that fall withi R )
Parze Widows Defie a widow fuctio that tells us if a sample is i R : d V = h (h :legth of the edge of R ) Let ϕ(u) be the followig widow fuctio : ϕ(u) = 1 0 1 u j j 2 otherwise = 1,..., d
Example Assume d = 2, h = 1 ϕ((x-x i )/h ) = 1 if x i falls withi R x2 1/2 ϕ(u) = 1 0 1 u j j = 1,..., d 2 otherwise -1/2-1/2 1/2 x1
Parze Widows 1 i 1 u j = 1,..., d = j ϕ(u) = 2 k = 0 otherwise i= 1 x ϕ h x i Number of samples i R computed as k Derive ew p (x) Earlier, p (x) = (k /)/V, ow redefied as p (x) i 1 = = 1 V x ϕ h i= 1 x i
Geeralize ϕ(x) p (x) is average of fuctios of x ad samples x i Widow fuctio ϕ(x) is beig used for iterpolatio Each x i cotributes to p (x) accordig to its distace from x We d like ϕ(x) to be a legitimate desity fuctio ϕ( v) 0 ϕ ( v) dv =1
Widow Width Remember that: New defiitio: ) the edge of :legth of (h R = d V h = = = i i i h x x V x p ϕ 1 1 1 ) ( h clearly affects the amplitude ad width of delta fuctio = h x V x ϕ δ 1 ) ( = = = i i i x x x p 1 ) ( 1 ) ( δ
Widow Width Very large h Small amplitude of delta fuctio x i must be far from x before δ (x-x i ) chages from δ (0) p (x) is superpositio of a broad, slowly chagig fuctio (out of focus) Too little resolutio Very small h Large amplitude of delta fuctio Peak of δ (x-x i ) is large, occurs ear x=x i p (x) is superpositio of sharp pulses (erratic, oisy) Too much statistical istability
Widow Width For ay h distributio is ormalized 1 x x 1 V h i ( x x ) dx = ϕ dx = ϕ( u) δ i du =
Covergece With limited samples, best we ca do for h is compromise With ulimited samples, we ca let V slowly approach zero as icreases, ad p (x) p(x) For ay fixed x, p (x) depeds o r.v. samples (x 1, x 1,.. x ) p 2 ( x) has some mea p ( x) ad variaceσ ( x)
Covergece of mea ( x) δ ( x v) p( v) = dv Expected value of estimate is averaged value of ukow, true desity p(x) blurred or smoothed versio of p(x) as see through the averagig widow Limits, as V 0 V δ (x-v) delta fuctio cetered at x expected value of estimate true desity p
Covergece of variace For ay expected value of estimate true desity if we let V 0 for some set of samples estimate is useless ( spiky ) eed to cosider variace Should let V 0 slower tha σ 2 ( x) sup ( ϕ( ) ) p ( ) x V
Example
Illustratio The behavior of the Parze-widow method Case where p(x) N(0,1) Let ϕ(u) = (1/ (2π) exp(-u 2 /2) ad h = h 1 / (>1) Thus: p ( i 1 = x ) = (h 1 : kow parameter) x ϕ h i= 1 is a average of ormal desities cetered at the samples x i. 1 h x i
Probabilistic Neural Network The PNN for this case has 1. d iput uits comprisig the iput layer, 2. patter uits comprisig of the patter layer, 3. c category uits Each iput uit is coected to each patter uit. Each patter uit is coected to oe ad oly oe category uit correspodig the category of the traiig sample. The coectios from the iput to patter uits have modifiable weights w which will be leared durig the traiig.
Iput layer x 1 W 11 p1 x 2 p 2.... x d.... W d W d2 p Patters layer
Patters layer. p 1 p 2... ω 1. p k.......ω 2... Category uits.ω c p Activatios (Emissio of oliear fuctios)
PNN traiig The traiig procedure is simple cosistig of three simple steps. 1) Normalize the traiig feature vectors so that x i = 1 for all i = 1. 2) Set the weight vector w i = x i for all i = 1. w i cosists of weights coectig the iput uits to the ith patter uit. 3) Coect the patter uit i to the category uit correspodig to the category of x i for all i = 1.
PNN Classificatio 1. Normalize the test patter x ad place it at the iput uits 2. Each patter uit computes the ier product i order to yield the et activatio et k = w t k.x ad emit a oliear fuctio f ( et k ) etk 1 exp σ = 2 3. Each output uit sums the cotributios from all patter uits coected to it P ( x ω ) = i= 1 P( ω 4. Classify by selectig the maximum value of P (x ω j ) (j = 1,, c) j ϕ i j x )
Advatages of PNN Speed of learig Sice W k =X k, it requires a sigle pass thru traiig Time complexity For parallel implemetatio its O(1) as ier product ca be doe i parallel New traiig patters ca be icorporated quite easily
Summary No parametric estimatio ca be applied to ay radom distributio of data Parze widow method provide a better estimatio of pdf Estimatio depeds upo o. of samples ad Parze widow size PPN gives a efficiet Parze widow method implemetatio
Refereces R.J. Schalkoff. (1992) Patter Recogitio: Statistical, Structural, ad Neural Approaches, Wiley.* Patter Classificatio (2d Editio) by R.O. Duda, P. E. Hart ad D. Stork, Wiley 2001