Statistical Patter Recogitio Classificatio: No-Parametric Modelig Hamid R. Rabiee Jafar Muhammadi Sprig 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/
Ageda Parametric Modelig No-Parametric Modelig Desity Estimatio Parze Widow Parze Widow - Illustratio Parze Widow ad Classificatio K -Nearest Neighbor (K-NN) K-NN - Illustratio K-NN ad a-posteriori probabilities K-NN ad Classificatio Pros ad cos 2
Parametric Modelig Data availability i a Bayesia framework We could desig a optimal classifier if we kew P(w i ) ad P(x w i ) Ufortuately, we rarely have that much iformatio available! Assumptios A priori iformatio about the problem The form of uderlyig desity Example: Normality of P(x w i ): Characterized by 2 parameters Estimatio techiques (studied i stochastic processes course) Maximum-Likelihood (ML) ad the Bayesia estimatios (MAP: Maximum A Posteriori) Results are early idetical, but the approaches are differet! Other techiques (will be discussed later) Gaussia Mixture Model (GMM) ad Hidde Markov Model (HMM) 3
No-Parametric Modelig No-parametric modelig tries to model arbitrary distributios without assumig a certai parametric form. No-parametric models ca be used with arbitrary distributios ad without the assumptio that the forms of the uderlyig desities are kow. Moreover, They ca be used with multimodal distributios which are much more commo i practice tha uimodal distributios. There are two types of o-parametric methods: Estimatig P(x w j ) Parze widow Estimatig P(w j x) (Bypass probability ad go directly to a-posteriori probability estimatio ) K -Nearest Neighbor 4
Desity Estimatio Basic idea: Probability that a vector x will fall i regio R is: P P( x') dx ' P is a smoothed (or averaged) versio of the desity fuctio P(x). R If we have a sample of size ; therefore, the probability that k poits fall i R is the: k Pk P (1 P) k The expected value for k is E(k) = P ML estimatio of P is reached for ˆ ˆ k PML Therefore, the ratio k/ is a good estimate for the desity fuctio p. Assumig P(x) is cotiuous ad that the regio R is so small that P does ot vary sigificatly withi it, we ca write (V is the volume of R): Combiig above equatios, the desity estimate becomes: k/ Px ( ) V k P P( x') dx ' P( x) V R 5
Desity Estimatio The volume V eeds to approach zero if we wat to use this estimatio Practically, V caot be allowed to become small (sice the umber of samples is always limited). Theoretically, if a ulimited umber of samples is available, we ca circumvet this difficulty To estimate the desity of x regardig above limitatios, we do followig steps: I th step, cosider a total of data samples with the cetrality of x Form a regio R cotaiig x Let V be the volume of R, k the umber of samples fallig i R ad P (x) be the th estimate for P(x), the: P (x) = (k /)/V Three ecessary coditios for covergig P (x) to P(x) are: limv 0 lim1 k 0 lim k / 0 There are two differet ways of obtaiig sequeces of regios that satisfy these coditios: Parze-widow estimatio method: Shrik a iitial regio where V = 1/ ad show that P ( ) ( ) x P x k -earest eighbor estimatio method: Specify k as some fuctio of, such as k = ; the volume V is grow util it ecloses k eighbors of x. 6
Desity Estimatio Parze widow vs. k-earest eighbor 7
Parze Widow Parze-widow approach to estimate desities Assume the regio R is a d-dimesioal hypercube V (h : legth of the edge of R ) j (u) 2 ((x-x i )/h ) is equal to uity if x i falls withi the hypercube of volume V cetered at x ad equal to zero otherwise. The umber of samples i this hypercube is: h d Let (u) be the followig widow fuctio: 1 1 u j 1,..., d 0 otherwise k x x i i i1 h i 1 1 x x h i The, we obtai the followig estimate: P ( x) i1 V h h P (x) estimates p(x) as a average of fuctios of x ad the samples (x i ) (i = 1,,). These fuctios ca be geeral desity fuctio! 1 1 1 h 8
Parze Widow Example: The behavior of the Parze-widow method for the case where both P(x) & (u)~n(0,1) Let 2 1 h u e h h kow parameter 2 u 2 1, ; ( 1, 1 : ) Thus: i 1 1 x x i P ( x) i1 h h P is a average of ormal desities cetered at the samples x. i Numerical results for =1 ad h 1 =1 1/2 2 P 1( x) ( x x1) 1 e ( x x1) N( x1,1) 2 For =10 ad h=0.1, the cotributios of the idividual samples are clearly observable! 9
Parze Widow - Illustratio Example illustratio Note that the = estimates are the same ad match the true desity fuctio regardless of widow width. 10
Parze Widow - Illustratio Example 2 Case where P(x) = 1 U(a,b) + 2 T(c,d) (ukow desity) mixture of a uiform ad a triagle desity The P as the same as previous example 11
Parze Widow ad Classificatio I classifiers based o Parze-widow estimatio: We estimate the desities for each category ad classify a test poit by the label correspodig to the maximum posterior Usig the poits of oly category w i, P(x w i ) ca be estimated Kowig P(w i ), posterior probabilities ca be foud The decisio regio for a Parze-widow classifier depeds upo the choice of widow fuctio as illustrated i the followig figure. (See ext slide) 12
Parze Widow ad Classificatio The left oe: a small h (complicated boudaries) - The right oe: a larger h (simple boudaries) compare the upper ad lower regios of two cases small h is appropriate for the upper regio, large h for the lower regio No sigle widow width is ideal overall 13
Parze Widow 1D Example Suppose we have 7 samples D={2,3,4,8,10,11,12} Let widow width h=3, estimate desity at x=1 14
Parze Widow 1D Example Suppose we have 7 samples D={2,3,4,8,10,11,12}, h = 3 Plot probability desity fuctio usig Parze Widow, otice the resultig PDF is ot smooth. 15
K -Nearest Neighbor Goal: a solutio for the problem of the ukow best widow fuctio Let the cell volume be a fuctio of the traiig data Ceter a cell about x ad let it grows util it captures k samples (k = f()) k samples are called the k earest-eighbors of x Two possibilities ca occur: Desity is high ear x; therefore the cell will be small which provides a good resolutio Desity is low; therefore the cell will grow large ad stop util higher desity regios are reached We ca obtai a family of estimates by settig k =k 1 / ad choosig differet values for k 1 16
K-NN - Illustratio How do we classify a poit usig k-nn? K=1: belogs to square class K=3: belogs to triagle class K=7: belogs to square class 17
K-NN - Illustratio For k = ad for =1 the estimate becomes P (x)=k /, V = 1/V 1 =1/2 x-x 1 18
K-NN ad a-posteriori probabilities Goal: estimate P(w i x) from a set of labeled samples Let s place a cell of volume V aroud x ad capture k samples k i samples amogst k tured out to be labeled w i the A estimate for P (w i x) is: ki i ki P ( X, wi ) P ( X wi ) * P ( wi ) i P ( w x) i j c P ( x, wi ) ki k P ( x, w ) j1 j k i /k is the fractio of the samples withi the cell that are labeled w i For miimum error rate, the most frequetly represeted category withi the cell is selected If k is large ad the cell sufficietly small, the performace will approach the best possible 19
K-NN ad Classificatio The earest eighbor Rule (K=1) Let D = {x 1, x 2,, x } be a set of labeled prototypes Let x D be the closest prototype to a test poit x; the the earest-eighbor rule for classifyig x is to assig it the label associated with x The earest-eighbor rule leads to a error rate greater tha the miimum possible: the Bayes rate If the umber of prototype is large (ulimited), the error rate of the earest-eighbor classifier is ever worse tha twice the Bayes rate (it ca be demostrated!) Thik more about it. It meas that 50% of the iformatio eeded to optimally classify poit x is aggregated withi its earest labeled eighbor. If, it is always possible to fid x sufficietly close so that P(w i x ) P(w i x) If P(w m x) 1, the the earest eighbor selectio is almost always the same as the Bayes selectio 20
K-NN ad Classificatio The earest eighbor rule I 2D the earest eighbor leads to a partitioig of the iput space ito Vorooi cells I 3D the cells are 3D ad the decisio boudary resembles the surface of a crystal 21
Pros ad Cos No assumptios are eeded about the distributios ahead of time (geerality). With eough samples, covergece to a arbitrarily complicated target desity ca be obtaied. The umber of samples eeded may be very large (umber grows expoetially with the dimesioality of the feature space). These methods are very sesitive to the choice of widow size (if too small, most of the volume will be empty, if too large, importat variatios may be lost). There may be severe requiremets for computatio time ad storage. 22
Ay Questio Ed of Lecture 8 Thak you! Sprig 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ 23