INF Introduction to classifiction Anne Solberg

Size: px

Start display at page:

Download "INF Introduction to classifiction Anne Solberg"

Jack Kelly
5 years ago
Views:

1 INF Introduction to classifiction Anne Solberg Introduction to classification Based on handout from Pattern Recognition b Theodoridis, available after the lecture INF

2 Plan for this lecture: Eplain the relation between thresholding and classification with classes Background in probabilit theor Baes rule Classification with a Gaussian densit and a single feature Linear boundaries in feature space Briefl: training and testing a classifier INF 4300

3 Scatter plots A D scatter plot is a plot of feature values for two different features. Each obect s feature values are plotted in the position given b the features values, and with a class label telling its obect class. A scatter plott visualize the space spanned b or more features: called the feature space Matlab: gscatterfeature1, feature, labelvector Classification is done based on more than two features, but this is difficult to visualize. Features with good class separation show clusters for each class, but different clusters should ideall be separated. INF 4300 Feature : maor ais length Feature 1: minor ais length INF

4 Discriminating between classes in a scatter plot INF

5 Introduction to classification One of the most challenging topics in image analsis is recognizing a specific obect in an image. To do that, several steps are normall used. Some kind of segmentation can delimit the spatial etent of the foreground obects, then a set of features to describe the obect characteristics are computed, before the recognition step that decides which obect tpe this is. The focus of the net lectures is the recognition step. The starting point is that we have a set of K features computed for an obect. These features can either describe the shape or the gra level/color/teture properties of the obect. Based on these features we need to decide what kind of an obect this is. Statistical classification is the process of estimating the probabilit that an obect belongs to one of S obect classes based on the observed value of K features. Classification can be done both unsupervised or supervised. In unsupervised classification the categories or classes are not known, and the classification process with be based on grouping similar obects together. In supervised classification the categories or classes are known, and the classification will of consist of estimating the probabilit that the obect belongs to each of the S obect classes. The obect is assigned to the class with highest probabilit. For supervised classifcation, training data is needed. Training data consists of a set of obects with know class tpe, and the are used to estimate the parameters of the classifier. Classification can be piel-based or region-based. The performance of a classifier is normall computed as the accurac it gets when classifing a different set of obects with known class labels called the test set. INF

6 Concepts in classification In the following three lectures we will cover these topics related to classification: Training set Validation set Test set Classifier accurac/confusion matrices. Computing the probabilit that an obect belongs to a class. Let each class be represented b a probabilit densit function. In general man probabilit densities can be used but we use the multivariate normal distribution which is commonl used. Baes rule Discriminant functions/decision boundaries Normal distribution, mean vector and covariance matrices knn classification Unsupervised classification/clustering INF

7 Training data set A set of known obects from all classes for the application. Tpicall a high number of samples from each class. Used to determine the parameters of the given classification algorithm. Eample 1: fit a Gaussian densit to the data Eample : find a Support Vector Model that fits the data Eample 3: find the weights in a neural net INF

8 Validation data set A set of known obects from all classes for the application. Tpicall a sufficientl high number of samples from each class. Used to optimize hperparameters, e.g. Compare feature selection algorithms Compare classifiers Compare network architectures INF

9 Test data set A set of known obects from all classes for the application. Tpicall a high number of samples from each class. Used ONCE to estimate the accurac of a trained model. Should never optimize an parameters on the test data set. INF

10 From INF310: Thresholding Gonzalez and Woods 10.3 Basic thresholding assigns all piels in the image to one of classes: foreground or background g, 0 if 1 if f, f, T T This can be seen as a -class classification problem based on a single feature, the gra level. The classes are background and foreground, and the threshold T defines the border between them. INF

11 Classification error for thresholding Threshold t Foreground tet probabilit distribution Background probabilit distribution In this region, foreground piels are misclassified as background In this region, background piels are misclassified as foreground INF

12 Classification error for thresholding We assume that bz is the normalized histogram for background bz and fz is the normalized histogram for foreground. The histograms are estimates of the probabilit distribution of the gra levels in the image. Let F and B be the prior probabilities for background and foregroundb+f=1 The normalized histogram for the image is then given b p z B b z F f z The probabilit for misclassification given a treshold t is: E E B F t t t t f z dz b z dz INF

13 Find T that minimizes the error t E t F f z dz B b z dz t de t dt 0 F f T B b T Minimum error is achieved b setting T equal to the point where the probabilities for foreground and background are equal. The locations in feature space where the probabilities are equal is called decision boundar in classification. The goal of classification is to find the boundaries. INF

14 Partitioning the feature space using thresholding 1 feature and 1 threshold Tresholding one feature gives a linear boundar separating the feature space INF

15 Partitioning the feature space using thresholding features and thresholds Tresholding two features independentl gives a rectangular boes in feature space INF

16 Can we find a line with better separation? We cannot isolate all classes using straight lines. This problem is not linearl separable INF

17 The goal of classification: partitioning the feature space with smooth boundaries INF

18 Distributions, standard deviation and variance A univariate one feature Gaussian distribution normal distribution is specified given the mean value and the variance : p z Variance, standard deviation 1 e INF

19 Two Gaussian distributions for thresholding a single feature Assume that bz and fz are Gaussian distributions, then p B F B F B F z B e F e B and F are the mean values for background and foreground. B and F are the variance for background and foreground. INF

20 The -class classification problem summarized Given two Gaussian distributions bz and fz. The classes have prior probabilities F and B. Ever piel should be assigned to the class that minimizes the classification error. The classification error is minimized at the point where F fz = B bz. What we will do now is to generalize to K classes and D features. INF

21 The goal of classification We estimate the decision boundaries equivalent to the threshold for multivariate data based on training data. Classification performance is alwas estimated on a separate test data set. We tr to measure the generalization performance. The classifier should perform well when classifing new samples Have lowest possible classification error. We often face a tradeoff between classification error on the training set and generalization abilit when determining the compleit of the decision boundar. INF

22 Probabilit theor Let be a discrete random variable that can assume an of a finite number of M different values. The probabilit that belongs to class i is p i = Pr=i, i=1,...m A probabilit distribution must sum to 1 and probabilities must be positive so p i 0 and M i1 p i 1 INF 4300

23 Epected values - definition The epected value or mean of a random variable is: The variance or second order moment is: These will be estimated from training data where we know the true class labels: Maimum likelihood estimates. INF M i ip i P E 1 P u u E Var P E 1 1 k N i i i k k N i i i k k k k N N

24 Pairs of random variables - definitions Let and be two random variables. The oint probabilit of observing a pair of values =i,= is p i. Alternativel we can define a oint probabilit distribution function P, for which, 0, P, P 1 The marginal distributions for and if we want to eliminate one of them is: P P P P,, INF

25 Statistical independence - definitions Variables and are statistical independent if and onl if P, P P In words: two variables are indepentent if the occurrence of one does not affect the other. If two variables are not independent, the are dependent. If two variables are independent, the are also uncorrelated. For more than two variables: all pairs must be independent. Two variables are uncorrelated if 0 If Cov[X,Y] = E[X Y] E[X ]E[Y] =0, we must have E[X Y] = E[X ]E[Y] If two variables are uncorrelated, the can still be dependent. INF

26 Epected values of two variables Epected values of two variables: INF P E P E P E P E P E P f f E,,,,,,,, Where in this course have ou seen similar formulas?

27 Feature Can ou sketch approimatel,,? Which is largest, or? Feature 1 INF

28 Conditional probabilit If two variables are statisticall dependent, knowing the value of one of them lets us get a better estimate of the value of the other one. We need to consider their covariance. The conditional probabilit of given is defined: Pr i, Pr i Pr and for distributions : P P, P Eample: Draw two cards from a deck. Drawing a king in the first draw has probabilit 4/5. Drawing a king in the second draw given that the first draw gave a king is 3/51. INF

Baesian decision theor A fundamental statistical approach to pattern classification. Named after Thomas Baes 170-1761, an english priest and matematician.

29 Baesian decision theor A fundamental statistical approach to pattern classification. Named after Thomas Baes , an english priest and matematician. It combines prior knowledge about the problem with a probabilit distribution function. The most central concept for us is Baes decision rule. INF

30 Baes rule for a classification problem Suppose we have J, =1,...J classes. is the class label for a piel, and is the observed gra level or feature vector. We can use Baes rule to find an epression for the class with the highest probabilit: p P P p prior probabilit posterior probabilit likelihood normalizing factor For thresholding, P is the prior probabilit for background or foreground. If we don't have special knowledge that one of the classes occur more frequent than other classes, we set them equal for all classes. P =1/J, =1.,,,J. Small p means a probabilit distribution Capital P means a probabilit scalar value between 0 and 1 INF

31 Baes rule eplained P p P p p is the probabilit densit function that models the likelihood for observing gra level if the piel belongs to class. Tpicall we assume a tpe of distribution, e.g. Gaussian, and the mean and covariance of that distribution is fitted to some data that we know belong to that class. This fitting is called classifier training. P is the posterior probabilit that the piel actuall belongs to class. We will soon se that the the classifier that achieves the minimum error is a classifier that assigns each piel to the class that has the highest posterior probabilit. p is ust a scaling factor that assures that the probabilities sum to 1. INF

32 Probabilit of error If we have classes, we make an error either if we decide 1 if the true class is, and if we decide if the true class is 1. If P 1 > P we have more belief that belongs to 1, and we decide 1. The probabilit of error is then: P error P 1 if P if we decide we decide 1 INF

33 Back to classification error for thresholding - Background - Foreground P error P error, d P error p d In this region, foreground piels are misclassified as background In this region, background piels are misclassified as foreground INF

34 Minimizing the error P error P error, d P error p d When we derived the optimal threshold INF 310, we showed that the minimum error was achieved for placing the threshold or decision boundar as we will call it now at the point where P 1 = P This is still valid. INF

35 Baes decision rule In the class case, our goal of minimizing the error implies a decision rule: Decide ω 1 if Pω 1 >Pω ; otherwise ω For J classes, the rule analogusl etends to choose the class with maimum a posteriori probabilit The decision boundar is the border between classes i and, simpl where Pω i =Pω Eactl where the threshold was set in minimum error thresholding! INF

36 Baes classification with J classes How do we generalize: and D features To more the one feature at a time To J classes To consider loss functions that some errors are more costl than others INF

37 Baes rule with J classes and d features If we measure d features, will be a d-dimensional feature vector. Let { 1,..., J } be a set of J classes. The posterior probabilit for class is now computed as Still, we assign a piel with feature vector to the class that has the highest posterior probabilit: INF c P p p p P p P 1 i P for all, P if Decide 1 1

38 Feature space and decision regions The minimum error criterior or a different classifier criterion partitions feature space into disoint region R i. The decision surface separates the regions in multidimensional space. For minimum error, the equation for the decision surface is P P k INF

39 Discriminant functions The decision rule Decide1 if P 1 P, for all i can be written as assign to 1 if g i g The classifier computes J discriminant functions g i and selects the class corresponding to the largest value of the discriminant function. Since classification consists of choosing the class that has the largest value, a scaling of the discriminant function g i b fg i will not effect the decision if f is a monotonicall increasing function. This can lead to simplifications as we will soon see. INF

40 Equivalent discriminant functions The following choices of discriminant functions give equivalent decisions: The effect of the decision rules is to divide the feature space into c decision regions R 1,...R c. If g i >g for all i, then is in region R i. The regions are separated b decision boundaries, surfaces in features space where the discriminant functions for two classes are equal INF ln ln i i i i i i i i i i P p g P p g p P p P g

41 The Gaussian densit - univariate case a single feature To use a classifier we need to select a probabilit densit function p i. The most commonl used probabilit densit is the normal Gaussian distribution: p 1 ep E with epected value or mean E and variance 1 p d p d INF

42 Eample: image and training masks The masks contain labels for the training data. If label=1, then the piel belongs to class 1 red, and so on. If a piel is not part of the training data, it will have label 0. A piel belonging to class k will have value k in the mask image. The mask is often visualized in pseudo-colors on top of the input image, where each class is assign a color. We should have a similar mask for the test data. INF

43 Training a univariate Gaussian classifier To be able to compute the value of the discriminant function, we need to have an estimate of and for each class. Assume that we know the true class labels for some piels and that this is given in a mask image. The mask has N k piels for each class. Training the classifier then consists of computing and for all piels with class label in the mask file. The are computed from training data as: For all piels i with label k in the training mask, compute k k 1 N k 1 N k N k ii N k ii i i k INF

44 Training for i=1:n for =i:m if maski,>==k increment nof. Samples in class K store the feature vector fi, in a vector of training samples from class K end end end For class k=1:k compute meank and sigmak k k 1 N k 1 N k N k ii N k ii i i k INF

45 How do to classification with a univariate Gaussian 1 feature Decide on values for the prior probabilities, P. If we have no prior information, assume that all classes are equall probable and P =1/J. Estimate and based on training data based on the formulae on the previous slide. Training For each piel in a new image: For class =1,.J, compute the discriminant function 1 1 P p P ep P Assign piel to the class C with the highest value of P b setting label_image,= C The result after classification is an image with class labels corresponding to the most probable class for each piel. We compute the classification error rate from an independent test mask. INF

46 Estimating classification error A simple measure of classification accurac can be to count the percentage of correctl classified piels overall averaged for all classes, or per. class. If a piel has true class label k, it is correctl classified if =k. Normall we use different piels to train and test a classifier, so we have a disoint training mask and test mask. Estimate the classification error b classifing all piels in the test set and count the percentage of wrongl classified piels. INF

47 Validating classifier performance Classification performance is evaluated on a different set of samples with known class - the test set. The training set and the test set must be independent! Normall, the set of ground truth piels with known class is partionioned into a set of training piels and a set of test piels of approimatel the same size. If the classifier has hperparameters, we use a separate validation set to find the best value of them. More on this in a later lecture. INF

48 Confusion matrices A matri with the true class label versus the estimated class labels for each class Estimated class labels True class labels Class 1 Class Class 3 Total #sampl es Class Class Class Total INF

49 Confusion matri - cont. Alternatives: Report nof. correctl classified piels for each class. Report the percentage of correctl classified piels for each class. Report the percentage of correctl classified piels in total. Wh is this not a good measure if the number of test piels from each class varies between classes? Class 1 Class Class 3 Total #sam ples Class Class Class Total INF

50 Using more than one feature The power of the computer lies in deciding based on more than 1 feature at a time A simple trick to do this is to assume that features i and and independent, then pi, c=pi cp c The oint decision based on D independent features is then: INF ,...,,..., D D D p P p p p P

51 Upcoming lectures Multivariate Gaussian Classifier evaluation Support vector machine classification Feature selection/feature transforms Knn-classification Unsupervised classification INF

52 If time.. We start with multivariate Gaussian classification INF

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification INF 4300 151014 Introduction to classifiction Anne Solberg anne@ifiuiono Based on Chapter 1-6 in Duda and Hart: Pattern Classification 151014 INF 4300 1 Introduction to classification One of the most challenging