Confusion matrices. True / False positives / negatives. INF 4300 Classification III Anne Solberg The agenda today: E.g., testing for cancer

INF 4300 Claification III Anne Solberg 29.10.14 The agenda today: More on etimating claifier accuracy Cure of dimenionality knn-claification K-mean clutering x i feature vector for pixel i i- The cla label for pixel i K the number of clae given in the training data Mak with training pixel p ( x ) 2 1 1 exp n / 2 2 Σ 1 / 2 Multiband image with n pectral channel or feature t 1 x μ Σ x μ 29.10.14 INF 4300 1 INF 4300 2 Confuion matrice A matrix with the true cla label veru the etimated cla label for each cla Etimated cla label True cla label Cla 1 Cla 2 Cla 3 Total # of ample Cla 1 80 15 5 100 Cla 2 5 140 5 150 Cla 3 25 50 125 200 Total 110 205 135 450 INF 4300 3 True / Fale poitive / negative True poitive (TP): Patient ha cancer and tet reult i poitive. True negative (TN): A healthy patient and a negative tet reult. Fale poitive (FP): Healthy patient that get a poitive tet reult. Fale negative (FN): Cancer patient that get a negative tet reult. Good to have: TP & TN Bad to have: FP (but thi will probably be detected) Wort to have: FN (may go un-detected) E.g., teting for cancer No cancer TN TP Cancer F N FP INF 4300 4

Senitivity and pecificity Baye claification with lo function Senitivity: the portion of the data et that teted poitive out of all the poitive patient teted: Senitivity = TP/(TP+FN) The probability that the tet i poitive given that the patient i ick. Higher enitivity mean that fewer deceae cae go undetected. Specificity: the portion of the data et that teted negative out of all the negative patient teted: Specificity = TN/(TN+FP) The probability that a tet i negative given that the patient i not ick. TN F N FP TP In cae where different clae have different importance (e.g. ick/healthy), we can incorporate thi into a Bayeian claifier if we conider the lo. Let ( i ) be the lo if we decide cla i if the true cla i. c The rik of deciding cla i i then: R( i x ) i P x 1 To minimize the overall rik, compute R( i x) for i=1 c and chooe the cla for which R( i x) i minimum. Higher pecificity mean that fewer healthy patient are labeled a ick. INF 4300 5 INF 4300 6 Outlier and doubt In a claification problem, we might want to identify outlier and doubt ample We might want an ideal claifier to report thi ample i from cla l (uual cae) thi ample i not from any of the clae (outlier) thi ample i too hard for me (doubt/reect) The two lat cae hould lead to a reection of the ample! Outlier Heuritically defined a ample which did not come from the aumed population of ample The outlier can reult from ome breakdown in preproceing. Outlier can alo come from pixel from other clae than the clae in the training data et. Example: K tree pecie clae, but a few road pixel divide the foret region. One way to deal with outlier i to model them a a eparate cla, e.g., a gauian with very large variance, and etimate prior probability from the training data Another approach i to decide on ome threhold on the apoteriori and if a ample fall below thi threhold for all clae, then declare it an outlier. INF 4300 7 INF 4300 8

Doubt ample Doubt ample are ample for which the cla with the highet probability i not ignificantly more probable than ome of the other clae (e.g. two clae have eentially equal probability). Doubt pixel typically occurr on the border between two clae ( mixel ) Cloe to the deciion boundary the probabilitie will be almot equal. Claification oftware can allow the uer to pecify threhold for doubt. INF 4300 9 The training / tet et dilemma Ideally we want to maximize the ize of both the training and tet dataet Obviouly there i a fixed amount of available data with known label A very imple approach i to eparate the dataet in two random ubet For mall ample ize we may have to ue another trategy: Cro-validation Thi i a good trategy when we have very few ground truth ample. Common in medicine where we might have a mall number of patient with a certain type of cancer. The cot of obtaining more ground truth data might be o high that we have to do with a mall number of ground truth ample. INF 4300 10 Crovalidation / Leave n - Out A very imple (but computationally complex) idea allow u u to fake a large tet et Train the claifier on a et of N-n ample Tet the claifier on the n remaining ample Repeat n/n time (dependent on ubampling) Report average performance on the repeated experiment a tet et error An example with leave-1-out and 30 ample: Select one ample to leave out Train on the remaining 29 ample Claify the one ample and tore it cla label Repeat thi 30 time Count the number of miclaification among the 30 experiment. Leave-n-Out etimation generally overetimate the claification accuracy. Feature election hould be performed within the loop, not in advance!!! Uing a training et and a tet et of approximately the ame ize i better. INF 4300 11 The covariance matrix and dimenionality Aume we have S clae and a n-dimenional feature vector. With a fully multivariate Gauian model, we mut etimate S different mean vector and S different covariance matrice from training ample. ˆ ha n element ˆ ha n(n-1)/2 element Aume that we have M training ample from each cla Given M, there i a maximum of the achieved claification performance for a certain value of n increaing n beyond thi limit will lead to wore performance. Adding more feature i not alway a good idea! Total number of ample given by a rule of thumb: M>10 n S If we have limited training data, we can ue diagonal covariance matrice or regularization INF 4300 12

The cure of dimenionality In practice, the cure mean that, for a given ample ize, there i a maximum number of feature one can add before the claifier tart to degrade. For a finite training ample ize, the correct claification rate initially increae when adding new feature, attain a maximum and then begin to decreae. For a high dimenionality, we will need lot of training data to get the bet performance. => 10 ample / feature / cla. Correct claification rate a function of feature dimenionality, for different amount of training data. Equal prior probabilitie of the two clae i aumed. INF 4300 13 Ue few, but good feature To avoid the cure of dimenionality we mut take care in finding a et of relatively few feature. A good feature ha high within-cla homogeneity, and hould ideally have large between-cla eparation. In practie, one feature i not enough to eparate all clae, but a good feature hould: eparate ome of the clae well Iolate one cla from the other. If two feature look very imilar (or have high correlation), they are often redundant and we hould ue only one of them. Cla eparation can be tudied by: Viual inpection of the feature image overlaid the training mak Scatter plot Evaluating feature a done by training can be difficult to do automatically, o manual interaction i normally required. INF 4300 14 How do we beat the cure of dimenionality? Ue regularized etimate for the Gauian cae Ue diagonal covariance matrice Apply regularized covariance etimation (INF 5300) Generate few, but informative feature Careful feature deign given the application Reducing the dimenionality Feature election (more in INF5300) Feature tranform (INF 5300) Exhautive feature election If for ome reaon you know that you will ue d out of D available feature, an exhautive earch will involve a number of combination to tet: D! n D d! d! If we want to perform an exhautive earch through D feature for the optimal ubet of the d m bet feature, the number of combination to tet i m D! n D d! d! d1 Impractical even for a moderate number of feature! d 5, D = 100 => n = 79.374.995 31.10.12 INF 4300 15 INF 4300 16

Suboptimal feature election Select the bet ingle feature baed on ome quality criteria, e.g., etimated correct claification rate. A combination of the bet ingle feature will often imply correlated feature and will therefore be uboptimal. More in INF 5300 Sequential forward election implie that when a feature i elected or removed, thi deciion i final. Stepwie forward-backward election overcome thi. A pecial cae of the add - a, remove - r algorithm. Improved into floating earch by making the number of forward and backward earch tep data dependent. Adaptive floating earch Ocillating earch. Example of feature election from INF 5300 - Method 1 - Individual feature election Each feature i treated individually (no correlation/covariance between feature i conideren) Select a criteria, e.g. a ditance meaure Rank the feature according to the value of the criteria C(k) Select the et of feature with the bet individual criteria value Multicla ituation: Average cla eparability or C(k) = min ditance(i,) - wort cae Often ued Advantage with individual election: computation time Diadvantage: no correlation i utilized. INF 4300 17 INF 4300 18 Method 2 - Sequential backward election Select l feature out of m Example: 4 feature x 1,x 2,x 3,x 4 Chooe a criterion C and compute it for the vector [x 1,x 2,x 3,x 4 ] T Eliminate one feature at a time by computing [x 1,x 2,x 3 ] T, [x 1,x 2,x 4 ] T, [x 1,x 3,x 4 ] T and [x 2,x 3,x 4 ] T Select the bet combination, ay [x 1,x 2,x 3 ] T. From the elected 3-dimenional feature vector eliminate one more feature, and evaluate the criterion for [x 1,x 2 ] T, [x 1,x 3 ] T, [x 2,x 3 ] T and elect the one with the bet value. Number of combination earched: 1+1/2((m+1)m-l(l+1)) Method 3: Sequential forward election Compute the criterion value for each feature. Select the feature with the bet value, ay x 1. Form all poible combination of feature x1 (the winner at the previou tep) and a new feature, e.g. [x 1,x 2 ] T, [x 1,x 3 ] T, [x 1,x 4 ] T, etc. Compute the criterion and elect the bet one, ay [x 1,x 3 ] T. Continue with adding a new feature. Number of combination earched: lm-l(l-1)/2. Backward election i fater if l i cloer to m than to 1. INF 4300 19 INF 4300 20

Hyperpectral image example Hyperpectral example claification accuracy v. nof. feature on tet et A hyperpectral image from France 81 feature/pectral band 6 clae (tree pecie) ha 81 parameter to compute for each cla ha 81*80/2=3240 parameter for each cla. 1000 training ample for each cla. Tet et: 1000-2000 ample for each cla. 3 of the 81 band hown a RGB image claification accuracy Each curve how a different dimenionality reduction method. Note that a we include more feature, the claification accuracy firt increae, then it tart to decreae. Cure of dimenionality! Plot of the 81 mean value for each cla number of feature ued in a Gauian claifier INF 4300 21 INF 4300 22 Exploratory data analyi k-nearet-neighbor claification For a mall number of feature, manual data analyi to tudy the feature i recommended. Chooe intelligent feature. Evaluate e.g. Error rate for ingle-feature claification Scatter plot Scatter plot of feature combination A very imple claifier. Claification of a new ample x i i done a follow: Out of N training vector, identify the k nearet neighbor (meaure by Euclidean ditance) in the training et, irrepectively of the cla label. Out of thee k ample, identify the number of vector k i that belong to cla i, i:1,2,...m (if we have M clae) Aign x i to the cla i with the maximum number of k i ample. k hould be odd, and mut be elected a priori. INF 4300 23 INF 4300 24

knn-example About knn-claification If k=1, will be claified a If k=5, will be claified a If k=1 (1NN-claification), each ample i aigned to the ame cla a the cloet ample in the training data et. If the number of training ample i very high, thi can be a good rule. If k->, thi i theoretically a very good claifier. Thi claifier involve no training time, but the time needed to claify one pattern x i will depend on the number of training ample, a the ditance to all point in the training et mut be computed. Practical value for k: 3<=k<=9 INF 4300 25 INF 4300 26 Supervied or unupervied claification Supervied claification Claify each obect or pixel into a et of k known clae Cla parameter are etimated uing a et of training ample from each cla. Unupervied claification Partition the feature pace into a et of k cluter k i not known and mut be etimated (difficult) In both cae, claification i baed on the value of the et of n feature x 1,...x n. The obect i claified to the cla which ha the highet poterior probability. The cluter we get are not the clae we want. Unupervied claification/clutering Divide the data into cluter baed on imilarity (or diimilarity) Similarity or diimilarity i baed on ditance meaure (ometime called proximity meaure) Euclidean ditance, Mahalanobi ditance etc. Two main approache to clutering hierarchical - non-hierarchical (equential) diviive agglomerative Non-hierarachical method are often ued in image analyi INF 4300 27 INF 4300 28

K-mean clutering k-mean example Note: K-mean algorithm normally mean ISODATA, but different definition are found in different book K i aumed to be known 1. Start with aigning K cluter center k random data point, or the firt K point, or K equally pace point For k=1:k, Set k equal to the feature vector x k for thee point. 2. Aign each obect/pixel x i in the image to the cloet cluter center uing Euclidean ditance. Compute for each ample the ditance r2 to each cluter center: T 2 r xi k xi k xi k Aign x i to the cloet cluter (with minimum r value) 2 X6 3. Recompute the cluter center baed on the new label. 4. Repeat from 2 until #change<limit. ISODATA K-mean: plitting and merging of cluter are included in the algorithm INF 4300 29 INF 4300 30 k-mean example μ 3 k-mean example μ 3 μ 2 μ 2 X6 X6 Step 1: Step 2: μ 1 Chooe k cluter centre, μ (0) k, randomly from the available datapoint. Here: k = 3 μ 1 Aign each of the obect in x to the nearet cluter center μ (i) k ( i) ( i x in, where, arg min, ) n c xn xn 1.. k INF 4300 31 INF 4300 32

k-mean example k-mean example μ 2 μ 3 μ 2 μ 3 μ 1 X6 Step 3: Recalculate cluter centre μ (i+1) k baed on the clutering in iteration i ( i 1 ) 1 ( x i ) n N x n c ( i ) μ 1 X6 Step 4: If the cluter don t change; μ (i+1) k μ (i) k (or prepecified number of iteration i reached), terminate, ele reaign - increae iteration i and goto tep 2. INF 4300 33 INF 4300 34 k-mean example k-mean variation μ 2 X(6) X6 μ 3 Step 3 in next iteration: Recalculate cluter centre. The generic algorithm ha many improvement ISODATA allow for merging and plitting of cluter Among other thing, thi eek to improve an initial bad choice of k k-median i another variation k-mean optimize a probabilitic model μ 1 INF 4300 35 INF 4300 36

How do we determine k? Example: K-mean clutering The number of natural cluter in the data rarely correpond to the number of information clae of interet. Cluter validity indice can give indication of how many cluter there are. Ue cluter merging or plitting tailored to the application. Rule of thumb for practical image clutering: tart with approximately twice a many cluter a expected information clae determine which cluter correpond to the information clae plit and merge cluter to improve. Original Kmean K=5 Supervied 4 clae Kmean K=10 INF 4300 37 INF 4300 38 Learning goal for thi lecture Undertand how different meaure of claification accuracy work: Confuion matrix Senitivity/pecifity Average claification accuracy Be familiar with the cure of dimenionality and the importance of electing few, but good feature Undertand knn-claification Undertand the difference between upervied and unupervied claification Undertand the Kmean-algorithm. INF 4300 39