Part 3 Introduction to statistical classification techniques

Part 3 Intrductin t statistical classificatin techniques Machine Learning, Part 3, March 07 Fabi Rli

Preamble ØIn Part we have seen that if we knw: Psterir prbabilities P(ω i / ) Or the equivalent terms P(ω i ) p( / ω i ) And we knw the lss matri Λ Nte: in statistics D is ften called the sample f size n drawn frm the distributin p(). In pattern recgnitin the term sample is usually used fr the single pattern i. ØThe minimum risk thery allws us t design the ptimal classifier (that minimizes the classificatin risk) fr the task at hand ØHwever, in practical cases, we never knw all this infrmatin ØThe nly infrmatin that we usually have is a data set D (called design r training data set) D = [,,., n ] i = ( i, i,., id ) i=,..,n i belnging t ne f the c classes ( i ε ω j j=,,c) ØPatterns i are drawn independently accrding t p( i / ω j ) Machine Learning, Part 3, March 07 Fabi Rli

Classificatin techniques ØIf we knw the classes t which the patterns i f the design/training set belng t, we speak f Supervised Classificatin Further infrmatin, beynd the data sert D, that we can have are: Ø We can knw the parametric mdel ( parametric frm ) f the distributin p(/ω i ), s that we can use the Parametric Techniques Ø If we knw nthing abut the distributin p(/ω i ), we are bliged t use the s called Nn-Parametric Techniques ØParametric Techniques: we knw the parametric frm f the distributin p(/ω i ), fr eample, we knw that the distributin is Gaussian ØNn-Parametric Techniques: we knw nthing abut the distributin, and we are nt able t get any infrmatin with an unsupervised analysis. Nte that we are assuming that estimating prirs P(ω i ) is an easy prblem, assumptin that is ften but nt always true. ØHere we are disregarding the csts f classificatin. The reasn is that the chice f cst values is a prblem-dependent issue, very little can be said in general abut this chice. Machine Learning, Part 3, March 07 Fabi Rli 3

Classificatin Parametric Techniques ØWe knw, r we assume, a parametric frm f the distributins p(/ω i ). ØThe main prblem is then t estimate the parameters f the mdel (e.g., the mean value and the variance f the Guassian mdel) ØWe discuss in detail these techniques in Part 4. ØThe estimate f the parameters is dne using the data set D, r a subset f it mre ften (t avid a prblem called ver-fitting ). ØHw can we assume a gd parametric mdel f the distributins p(/ω i )? In the practical applicatins we have tw pssibilities t d that: We assume different parametric mdels, we cmpute the parameters fr each mdel, then we cmpare the errrs f the mdels and select the best We use Unsupervised Classificatin Techniques (we see basic cncepts later in Part 9) t gain sme knwledge f the parametric frm f the p(/ ω i ). Unsupervised classificatin Using the data set D we try t gain sme knwledge abut p(/ω i ) (e.g., we discver that it is made up f tw clusters, i.e., it is the sum f tw Gaussian distributins) Machine Learning, Part 3, March 07 Fabi Rli 4

Classificatin Nn-Parametric Techniques ØWe knw nthing f the distributin p(/ω i ), and we are nt able t gain knwledge with an unsupervised analysis. ØWe use techniques (Part 5) that allw t estimate the densities p(/ω i ), r the prsterir prbabilities P(ω i /), using the data set D. ØNn-parametric techniques are aimed t estimate directly the density functins p() Machine Learning, Part 3, March 07 Fabi Rli 5

Eample f parametric techniques in bimetrics In bimetric recgnitin parametric techniques can be used t mdel genuine and impstr distributins Parametric techniques smetimes prvides perfrmances lwer than the nes f nn parametric techniques Machine Learning, Part 3, March 07 Fabi Rli 6

Linear discriminant functins ØIn sme cases it can be mre effective t assume a parametric frm f the discriminant functins g i (), i=,..,c, instead f a parametric frm f the p(/ω i ) (We discuss this in Part 6). ØFr eample, t assume a linear frm f the discriminant functins g i () In sme cases linear functins allw t disciminate well classes that wuld be difficult t mdel by cmputing the distributins p(/ω i ). It is wrth nting that, in the end, what we want t d in many cases is just t classify, nt mdelling the p(/ω i )! Even if a linear discriminat functin des nt prvide the ptimal slutins, hwever the errr rate can be acceptable fr the task at hand! Machine Learning, Part 3, March 07 Fabi Rli 7

Design f classifier: basic design cycle We have just a design set D = [,,., n ] Unsupervised analysis D yu knw the frm f p()? NO Nn Parametric techniques YES Parametric techniques Split D int 3 sets: training, validatin, and test set Use training+validatin sets t estimate parameters Split D int 3 sets: training, validatin, and test set Use the validatin set t estimate parameters, and training set t train classifier Use test set t estimate errr prbability We see later that nnparametric techniques have sme parameters t be estimated as well! Machine Learning, Part 3, March 07 Fabi Rli 8

Sme ntable cncepts: feature (re)scaling ØFeatures used t characterize patterns are usually linked t physical measurements which have different scales. Given samples in D, feature scales can be very different (e.g, height in meters and weight in kg). This is due t nn-hmgenus physical measurements r the intrinsic scales f different features. ØSlutin: nrmalizatin, (re)scaling f features. The nrmalizatin peratin can be regarded as a functin h j applied t feature that takes as input the riginal feature value ij, and utputs the rescaled(nrmalized) feature value ij = h j ( ij ), with h j being the nrmalizatin functin (j =,,..., d). Machine Learning, Part 3, March 07 Fabi Rli 9

Sme nrmalizatin functins Given D = [,,., n ], i = ( i, i,., id ) i=,..,n, nrmalizatin functins h j widely used are the fllwings: We divide the feature ij by maimum value (ver D): ij ij =, j,ma = ma kj k=,,..., n j,ma Divide by maimum range: ij = ij j,ma j,min j,min [0,], j,ma j,min Divide by standard deviatin f feature ij : { } kj ( m ) = ij m m j E j k ij =, σ σ j = E j kj j k { } Machine Learning, Part 3, March 07 Fabi Rli 0 = = ma k=,,..., n min k=,,..., n mˆ j σˆ j = = kj kj n N k= N kj n ( ) kj mˆ j k=

Remarks n nrmalizatin The third nrmalizatin methd (divisin by standard deviatin) is useful, fr eample, when feature distributin is Gaussian. If feature ij has a Gaussian distributin, the nrmalized feature ij has a nrmalized Gaussian distributin. Nrmalizatin must be dne using all the patterns available in D and fr each feature separately. Hereafter, we assume that all the features used have been prperly nrmalized, and therefre we mit the ape in ij. Machine Learning, Part 3, March 07 Fabi Rli

Sme ntable cncepts: Separatin f classes Definitin f separated class: In a bi-dimensinal feature space (d=), a class is called separated if a curve (clsed r pen) eits s that all the samples f that class lies n the same side f the curve. In a d-dimensinal feature space we have hyper-curves. Tw separated classes can be: Linearly separable, if the curve that separates the tw classes is a linear functin (fr d =, the curve is a straight line); Nn linearly separable, the separatin needs nn-linear curves. Nte that the separatin demands that tw patterns belnging t different classes d nt have the same feature values! S we are speaking f deterministic separatin! Machine Learning, Part 3, March 07 Fabi Rli

Ntable cncepts: Multi-mdal classes Ø A data class is multimdal if it cntains clusters f patterns which are linearly separable r it has different peaks f the density functin. Esempi ω (a) (b) (c) ω ω ω ω ω (a) (b) (c) ω (a) tw linearly separable classes, (b) e (c) tw classes nn linearly separable. The class ω in (c) is bimdal. In (a) and (c) statistical methds wrk well, the case (b) is much mre difficult. Machine Learning, Part 3, March 07 Fabi Rli 3

A ntable cncept: gemetrical cmpleity f classes Characteristics f a class als dipends n the gemetrical features f the data distributin in the feature space. In particular, if classes have elngated distributins and/r are much verlapped, sme techniques wrk prly. Eample it is difficult t discriminate samples in regins where the tw classes are very verlapped. Each class in the figure have a privileged directin in the feature space. Features have a very high crrelatin (cnditinal crrelatin given the class). Machine Learning, Part 3, March 07 Fabi Rli 4

Crrelatin Cefficient Crrelatin between tw features i ed j can be measured by the cefficient f crrelatin ρ ij (i, j =,,..., d). It is linked t the variance σ ij = E{( i i )( j j )} and the feature variances σ ii and σ jj by: ρ ij = σ jj σ ii If d is the feature number, [ρ ij ] is a squared matri d d, cn ρ ij i, j =,,..., d e ρ ii = (main diagnal) i =,,..., d. feature i and j are crrelated if ρ ij has a high value (e.g. > 0.8). σ ij The analysis f crrelatin can be dne fr each class and fr the whle data set. Machine Learning, Part 3, March 07 Fabi Rli 5

Ntable cncepts: Gemetrical vs. Prbabilistic cmpleity Square 44 Square 00 Prbabilistic cmpleity I must recgnize ne pattern ut f ne millin! Tw very unbalanced classes! The prblem has simple gemetrical features, but it is very hard! Eample f Gemetrical Cmpleity Machine Learning, Part 3, March 07 Fabi Rli 6