Discriminant analysis and supervised classification
|
|
- Paul Day
- 6 years ago
- Views:
Transcription
1 Discriminant analysis and supervised classification Angela Montanari 1 Linear discriminant analysis Linear discriminant analysis (LDA) also known as Fisher s linear discriminant analysis or as Canonical variate analysis is a widely used method aimed at finding linear combinations of observed features which best characterize or separate two or more classes of objects or events. The resulting combinations are commonly used for dimensionality reduction, discrimination, before later classification. LDA is also closely related to principal component analysis (PCA) in that they both look for linear combinations of variables which best explain the data. However LDA explicitly attempts to model the difference between the classes of data while PCA does not take into account any difference in class. Here are a few examples of possible applications of the method: Identification To identify the type of customers that is likely to buy a certain product in a store. Using simple questionnaire surveys, we can get the features of customers. Discriminant analysis will help us to select which features can describe the group membership of buy or not buy the product. Decision Making Doctor diagnosing illness may be seen as assessing which disease the patient has. However, we can transform this problem into a classification problem by assigning the patient to a number of possible groups of diseases based on the observation of the symptoms. Prediction Question Will it rain today? can be thought of as prediction. Prediction problem can be thought of as assigning today to one of the two possible groups of rain and dry. 1
2 Pattern recognition To distinguish pedestrians from dogs and cars on captured image sequence of traffic data is a classification problem. Learning Scientists attempt to teach a robot to learn to talk can be seen as classification problem. It assigns frequency, pitch, tune, and many other measurements of sound into many groups of words. The method was first introduced by R.A. Fisher in Discrimination Let s assume we have G groups of units (each composed of n g units, for g = 1,..., G, such that G n g = n) on which a vector random variable x (corresponding to p observed numeric variables) has been observed. We also assume that, for the G populations the G groups come from, the homoscedasticity condition holds, i.e. Σ 1 = Σ 2 = = Σ g = = Σ G = Σ. Fisher suggested to look for the linear combination y of the variables in x, y = a T x, which best separates the groups. This amounts to look for the vector a such that, when projected along it, the groups are as separated as possible and as homogeneous as possible at the same time. In this framework the function which must be optimized with respect to a is the ratio of the between group to the within group variance of the linear combination y. In the x space the overall average is x, while each group has average vector x g and covariance matrix S g ; because of the properties of the arithmetic mean it will be x = 1 n G x g n g (1) The variable y will therefore have overall average ȳ = a T x and, for each group, average value ȳ g = a T x g and variance V ar(y g ) = a T S g a. V ar(y) within = 1 n G { = a T G 1 n G (n g 1)V ar(y g ) = 1 n G G (n g 1)a T S g a } G (n g 1)S g a = a T Wa 2
3 where W = 1 n G G (n g 1)S g is the within group covariance matrix (also known as within group scatter matrix) in the observed variable space (it is meaningful because of the omoscedasticity assumption). The n G degrees of freedom derive from the fact that the external sum involves G terms, each of which has n g 1 degrees of freedom. V ar(y) between = 1 G 1 G (ȳ g ȳ) 2 n g = 1 G (a T x g a T x) 2 n g G 1 { G } n g ( x g x)( x g x) T a = a T Ba = 1 G 1 at where B = 1 G 1 G n g( x g x)( x g x) T is the between group covariance matrix (also known as between group scatter matrix) in the observed variable space. Its rank is at most G 1. In the simple two group case the variance between has one degree of freedom only, and has the simple expression B = 2 n g( x g x)( x g x) T. After writing x as in equation (1) and after little algebra it becomes B = n 1n 2 n 1 + n 2 ( x 1 x 2 )( x 1 x 2 ) T This clearly shows that in the two group case the between group covariance matrix has rank equal to 1. In general, the function we need to optimize with respect to a is therefore: φ = V ar(y) between V ar(y) within = at Ba a T Wa (2) (notice that it coincides with the F statistic for the Analysis of Variance). In order to find the vector a for which φ is maximum we derive it with respect to a and we set the derivatives to 0: { } φ Ba(a T a = 2 Wa) Wa(a T Ba) (a T Wa) { 2 } Ba(a T Wa) = 2 Wa(aT Ba) = 0 (a T Wa) 2 (a T Wa) 2 3
4 Remembering equation (2), it becomes and then or equivalently Ba a T Wa Waφ a T Wa = 0 Ba Waφ = 0 (3) (B φw)a = 0 After pre-multiplying both sides by W 1 (under the assumption that it is non singular) we obtain (W 1 B φi)a = 0 This is a linear homogeneous equation system which admits non trivial solution if and only if det(w 1 B φi) = 0; this means that φ is an eigenvalue of W 1 B and a is the corresponding eigenvector. As φ is the function we want to maximize we choose the largest eigenvalue and the corresponding eigenvector as the best discriminant direction. We can find up to G 1 different discriminant directions, as the rank of B (and hence of W 1 B) is at most G 1. Each of them will have a decreasing discrimination power. They define a vector subspace containing the variability between features. These vectors are primarily used in feature reduction and can be interpreted in the same way as principal components. As W 1 B is non symmetric its eigenvectors are not necessarily orthogonal, but it can be proved that the linear combinations y obtained through them (also known as canonical variates) are uncorrelated. Let s go back to equation (3) and let s consider two generic couples eigenvalueeigenvector, say {φ i, a i } and {φ j, a j }. For them the following equalities hold: Ba i = Wa i φ i Ba j = Wa j φ j After pre-multiplying the first one by a T j and the second one by a T i we obtain a T j Ba i = a T j Wa i φ i a T i Ba j = a T i Wa j φ j 4
5 But a T j Ba i = a T i Ba j because they are both scalars and hence also a T j Wa i φ i = a T i Wa j φ j Again a T j Wa i = a T i Wa j as both are scalars; therefore, unless φ i = φ j (a rare event in practice) the above equality only holds if a T j Wa i = a T i Wa j = 0 i.e. if the discriminant variables are uncorrelated. In matrix form: A T WA = diag Many softwares (R included, see tutorial) scale A so that A T WA = I: the discriminant variables are uncorrelated and have unit within group variance. They are often said to be sphered. As, from equation (3), BA = WAΦ, where Φ is the diagonal matrix of the eigenvalues, after pre-multiplying it by A T, in the within group sphered data case, we obtain A T BA = A T WAΦ = Φ i.e. the discriminant variables are uncorrelated between groups too, and have decreasing between variance. Being uncorrelated both within and between groups, the discriminant variables are also uncorrelated with respect to the whole set of units. Fisher proved that in the two group case, the only discriminant direction a can be obtained as a = W 1 ( x 1 x 2 ). The corresponding liner combination will therefore be y = a T x = ( x 1 x 2 ) T W 1 x. Classification Even if it has been derived for discrimination purposes, Fisher s linear function can also be used to address classification issues, i.e. to define a rule for assigning a unit, whose group membership is unknown, to one out of the G known groups. The method is general, but for teaching purposes we will limit our attention to the two group case only. Let s denote by ȳ 1 the projection of group 1 average on a, ȳ 1 = a T x 1 = ( x 1 x 2 ) T W 1 x 1 and by ȳ 2 = ( x 1 x 2 ) T W 1 x 2 the projection of group 2 average on a. Let s also assume, without loss of generality, that ȳ 1 > ȳ 2. Being x 0 the new unit we want to classify and y 0 = a T x 0 = ( x 1 x 2 ) T W 1 x 0 its projection on a, a natural allocation rule will consist in assigning x 0 to the group whose average it is closest to along a: i.e. assign x 0 to group 1 if y 0 ȳ 1 < y 0 ȳ 2 and to group 2 if the opposite inequality holds. This amounts to assign x 0 to group 1 if y 0 > ȳ1+ȳ
6 Due to the results previously obtained this defines the following rule: assign x 0 to group 1 if: ( x 1 x 2 ) T W 1 x 0 > 1 2 {( x 1 x 2 ) T W 1 x 1 + ( x 1 x 2 ) T W 1 x 2 } or equivalently ( x 1 x 2 ) T W 1 x 0 > 1 2 ( x 1 x 2 ) T W 1 ( x 1 + x 2 ) that is known as linear classification rule as it is a linear function of the observed vector variable x. This allocation rule is very popular as it can be obtained addressing the classification problem according to a variety of different perspectives. We will see a couple of them in the following. 2 Classification based on Mahalanobis distance One common way to measure the distance between points in a p dimensional space is provided by the Euclidean distance. However the Euclidean distance attaches equal weight to all the axes of the representation; if differences are present in the variances of the observed variables, and if the variables are correlated this might not be a desirable feature. In Fig.1 a bivariate point cloud is represented. The variables have different variances and are correlated. The points A and B have the same Euclidean distance from the center of the point cloud (i.e. from the average vector) but while B lies in the core of the distribution, A is in a low frequency region. Figure 1: Euclidean distance 6
7 Mahalanobis distance provides a way to take into account variances and correlations when computing distances, i.e. to take into account the shape of the point cloud. Given a point whose coordinate vector is x, its Mahalonobis distance from the center of the point cloud x is defined as d M (x, x) = {(x x) T S 1 (x x)} 1/2 where, as usual, S is the sample covariance matrix. Mahalanobis distance is thus a weighted Euclidean distance. In Fig. 2 the same point cloud and two points having the same Mahalanobis distance are represented Figure 2: Mahalanobis distance Mahalanobis distance turns out to be really useful also for classification purposes. Assuming homoscedasticity, and estimating the unknown common covariance matrix by the within group covariance matrix W, the squared Mahalonobis distance between a unit x 0 and group 1 center can be defined as d 2 M(x 0, x 1 ) = (x 0 x 1 ) T W 1 (x 0 x 1 ); the same for group 2. If W is the identity matrix, i.e. if, within groups, the variables are uncorrelatated and have unit variance, then Mahalanobis distance becomes the ordinary Euclidian one. The allocation rule, consisting in assigning a unit to the group whose average it is closest to, will assign unit x 0 to group 1 if (x 0 x 1 ) T W 1 (x 0 x 1 ) < (x 0 x 2 ) T W 1 (x 0 x 2 ) and to group 2 if the opposite inequality holds. After a few passages the classification rule: assign x 0 to group 1 if ( x 1 x 2 ) T W 1 x 0 > 1 2 ( x 1 x 2 ) T W 1 ( x 1 + x 2 ) 7
8 can be easily obtained. This is exactly Fisher s linear classification rule. The method can be extended to the classification of units into more than 2 groups. In the G group case we allocate an object to the group for which (a) the Mahalanobis distance between the object and the class mean is smallest using the original variables or (b) the Euclidean distance between the object and the class mean is smallest in terms of the Canonical Variates. Figure 3: Three group case 3 Classification based on probability models So far we have addressed classification issues in a purely distribution free context. Now we will assume that we know the shape of the probability density functions (pdf) that have generated our data in the G groups. We will consider the G = 2 case only. Let x be the p-dimensional vector of the observed variables and x 0 a new unit whose group membership is unknown. Let Π 1 and Π 2 denoted the two parent populations. The key assumption is that x has a different pdf in Π 1 and Π 2. Let s denote the pdf of x in Π 1 as f 1 (x) and the pdf of x in Π 2 as f 2 (x). Let s denote by R the set of all possible values x can assume. As f 1 (x) and f 2 (x) usually overlap, each point of R can belong both to Π 1 and Π 2, but with a different probability degree. The goal is to partition R into two exhaustive, non overlapping regions R 1 and R 2 (R 1 R 2 = R and R 1 R 2 = ) such that the probability of a wrong classification is minimum, when a unit belonging to R 1 is allocated to Π 1 and a unit belonging to R 2 is allocated to Π 2. 8
9 Given x 0, a very intuitive rule consists in allocating it to Π 1 if the probability that it comes from Π 1 is larger than the probability that it comes from Π 2 or to allocate it to Π 2 if the opposite holds. According to this criterion: R 1 is the set of the x values such that f 1 (x) > f 2 (x) and R 2 is the set of the x values such that f 1 (x) < f 2 (x). The ensuing classification rule is therefore; allocate x 0 to Π 1 if f 1 (x 0 ) f 2 (x 0 ) > 1 allocate x 0 to Π 2 if f 1 (x 0 ) f 2 (x 0 ) < 1 allocate x 0 randomly to one of the two populations if equality holds. This classification rule is known as likelihood ratio rule. However intuitively reasonable, this rule neglects possibly different prior probabilities of class membership and possibly different misclassification costs. Let s denote by π 1 the prior probability that x 0 belongs to Π 1 and by π 2 the prior probability that x 0 belongs to Π 2 (π 1 +π 2 =1). Based on likelihoods only, the probability that a unit, belonging to Π 1 is wrongly classified to Π 2 (this happens when it falls in R 2 ) is R 2 f 1 (x)dx If we consider prior probabilities too, the probability of a wrong classification of a unit to Π 2 when it effectively comes from Π 1 is p(2 1) = π 1 R 2 f 1 (x)dx and the probability of a wrong classification of a unit to Π 1 when it effectively comes from Π 2 is p(1 2) = π 2 R 1 f 2 (x)dx The total probability of a wrong classification is therefore prob = p(2 1) + p(1 2) 9
10 R 1 and R 2 should be therefore chosen in such a way that prob is minimum. prob can be written as prob = π 1 f 1 (x)dx + π 2 f 2 (x)dx (4) R 2 R 1 Since R is the complete space and pdfs are known to integrate to 1 over their domain, it is f 1 (x)dx = f 2 (x)dx = 1 R and as R 1 R 2 = R and R 1 R 2 = we have f 1 (x)dx = R f 1 (x)dx + R 1 f 1 (x)dx = 1 R 2 and hence f 1 (x)dx = 1 f 1 (x)dx R 2 R 1 After replacing it in equation (4) we obtain [ prob = π 1 1 f 1 (x)dx R ] + π 2 f 2 (x)dx 1 R 1 = π 1 π 1 f 1 (x)dx + π 2 f 2 (x)dx R 1 R 1 = π 1 R 1 [π 2 f 2 (x) π 1 f 1 (x)]dx As π 1 is a constant, prob will be minimum when the integral is minimum i.e. when the integrand is negative. This means that R 1 should be chosen so that, for the points belonging to it that is R π 2 f 2 (x) π 1 f 1 (x) < 0 π 1 f 1 (x) > π 2 f 2 (x) The ensuing allocation rule will then be: allocate x 0 to Π 1 if f 1 (x 0 ) f 2 (x 0 ) > π 2 π 1 10
11 allocate x 0 to Π 2 if f 1 (x 0 ) f 2 (x 0 ) < π 2 π 1 allocate x 0 randomly to one of the two populations if equality holds. In the equal prior case (π 1 = π 2 = 1/2) the likelihood ratio rule is obtained. It is worth adding that the classification rule obtained by minimizing the total probability of a wrong classification is equivalent to the one that would be obtained by maximizing the posterior probability of population membership. That s the reason why it is often called optimal Bayesian rule. Following the same steps we have gone through before, in case of unequal misclassification costs, the rule minimizing the total misclassification cost can be obtained. Gaussian populations Let s assume that both f 1 (x) and f 2 (x) are multivariate normal densities: f 1 (x) = (2π) p/2 Σ 1 1/2 exp { 1 2 (x µ 1) T Σ 1 1 (x µ 1 ) } f 2 (x) = (2π) p/2 Σ 2 1/2 exp { 1 2 (x µ 2) T Σ 1 2 (x µ 2 ) } The likelihood ratio is therefore f 1 (x) f 2 (x) = Σ 1 1/2 Σ 2 1/2 exp { 1 2 [(x µ 1) T Σ 1 1 (x µ 1 ) (x µ 2 ) T Σ 1 2 (x µ 2 )] } = Σ 1 1/2 Σ 2 1/2 exp { 1 2 [xt Σ 1 1 x + µ T 1 Σ 1 1 µ 1 2x T Σ 1 1 µ 1 x T Σ 1 2 x µ T 2 Σ 1 2 µ 2 + 2x T Σ 1 2 µ 2 ] } = Σ 1 1/2 Σ 2 1/2 exp { 1 2 [xt (Σ 1 1 Σ 1 2 )x 2x T (Σ 1 1 µ 1 Σ 1 2 µ 2 ) ]} + µ T 1 Σ 1 1 µ 1 µ T 2 Σ 1 2 µ 2 The expression can be simplified by considering the logarithm of the likelihood ratio. In this way the so called Quadratic discriminant function is obtained: Q(x) = ln f 1(x) f 2 (x) = 1 2 ln Σ 2 Σ 1 1 [ ] x T (Σ 1 1 Σ 1 2 )x 2x T (Σ 1 1 µ 1 Σ 1 2 µ 2 ) + µ T 1 Σ 1 1 µ 1 µ T 2 Σ 1 2 µ
12 The expression clearly shows that it is a quadratic function of x. The ensuing classification rule suggests to allocate x 0 to Π 1 if Q(x 0 ) > ln(h) where H is either 1, if equal priors are assumed, or π 2 /π 1 in the unequal prior case. When Σ 1 = Σ 2 = Σ then the likelihood ratio becomes f 1 (x)/f 2 (x) = exp { (µ 1 µ 2 ) T Σ 1 x 1 2 (µ 1 µ 2 ) T Σ 1 (µ 1 + µ 2 ) } = exp { (µ 1 µ 2 ) T Σ 1[ x 1 2 (µ 1 + µ 2 ) ]} After taking the logarithms we obtain L(x) = (µ 1 µ 2 ) T Σ 1[ x 1 2 (µ 1 + µ 2 ) ] This is known as linear discriminant rule as it is a linear function of x. The ensuing classification rule suggests to allocate x 0 to Π 1 if L(x 0 ) > ln(h) where H is either 1, if equal priors are assumed, or π 2 /π 1 in the unequal prior case. In empirical applications, maximum likelihood estimates of the model parameters are plugged into the classification rules. µ 1 and µ 2 are estimated by x 1 and x 2. Furthermore, in the heteroscedastic case Σ 1 and Σ 2 are replaced by S 1 and S 2 respectively, while in the homoscedastic case, the common Σ is replaced by the within group covariance matrix W. The empirical linear discriminant rule, in the equal priori case, becomes ( x 1 x 2 ) T W 1[ x ( x 1 + x 2 ) ] > 0 or equivalently ( x 1 x 2 ) T W 1 x 0 > 1 2 ( x 1 + x 2 ) In it the allocation rule obtained according to Fisher s approach can be easily recognized. This means that, for gaussian populations and equal priors, besides optimizing group separation, Fisher s rule also minimizes the probability of a wrong classification. 12
13 4 Assessment of classification rules After a classification rule has been devised, a measure of its performance must be obtained. In other words an answer to the question: How good is the classification rule in correctly predicting class membership? must be provided. The most obvious measure of performance is an estimation of the proportion of individuals that will be misallocated by the rule. And the simplest way to do this is by applying the given allocation rule to the sample units that have been used to derive it. The proportion of misallocated individuals will provide an estimate of the error rate. The method is called resubstitution method (because the sample units are used to derive the rule and then resubstituted into it in order to evaluate its performance) and the corresponding error rate is known as apparent error rate. Although a very simple procedure, the resubstiution method produces biased error estimates. It gives an optimistic measure of the performance of the rule; the rule is expected to perform worse on units that have not contributed to its derivation. A safer strategy suggests to split the original sample into two subgroups: the training sample (containing 70-75% of the units), on which the classification rule is constructed, and the test sample on which the classification rule is evaluated. The error rate is estimated by the proportion of misallocated units in the test set. In case the sample is so small that splitting it in two parts might produce rules based on unstable model parameters, resampling techniques have been suggested. The most common one is k-fold cross-validation. It suggests to split the sample into k equal sized sub-samples, build the rule using the data of k 1 groups as the training set and the group not used before as the test set. Repeat the procedure excluding one subset at a time and obtain the error rate estimate by averaging the error rates obtained in the various steps. A famous variant is the so called leave-one-out (or jackknife) estimator obtained by excluding one observation at a time from the training set and using it as a test set. 13
Lecture 8: Classification
1/26 Lecture 8: Classification Måns Eriksson Department of Mathematics, Uppsala University eriksson@math.uu.se Multivariate Methods 19/5 2010 Classification: introductory examples Goal: Classify an observation
More information6-1. Canonical Correlation Analysis
6-1. Canonical Correlation Analysis Canonical Correlatin analysis focuses on the correlation between a linear combination of the variable in one set and a linear combination of the variables in another
More information12 Discriminant Analysis
12 Discriminant Analysis Discriminant analysis is used in situations where the clusters are known a priori. The aim of discriminant analysis is to classify an observation, or several observations, into
More informationRegularized Discriminant Analysis and Reduced-Rank LDA
Regularized Discriminant Analysis and Reduced-Rank LDA Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Regularized Discriminant Analysis A compromise between LDA and
More information5. Discriminant analysis
5. Discriminant analysis We continue from Bayes s rule presented in Section 3 on p. 85 (5.1) where c i is a class, x isap-dimensional vector (data case) and we use class conditional probability (density
More informationLecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides
Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Intelligent Data Analysis and Probabilistic Inference Lecture
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationProblem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30
Problem Set 2 MAS 622J/1.126J: Pattern Recognition and Analysis Due: 5:00 p.m. on September 30 [Note: All instructions to plot data or write a program should be carried out using Matlab. In order to maintain
More informationUniversity of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries
University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout :. The Multivariate Gaussian & Decision Boundaries..15.1.5 1 8 6 6 8 1 Mark Gales mjfg@eng.cam.ac.uk Lent
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationPrincipal component analysis
Principal component analysis Angela Montanari 1 Introduction Principal component analysis (PCA) is one of the most popular multivariate statistical methods. It was first introduced by Pearson (1901) and
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationMotivating the Covariance Matrix
Motivating the Covariance Matrix Raúl Rojas Computer Science Department Freie Universität Berlin January 2009 Abstract This note reviews some interesting properties of the covariance matrix and its role
More informationDimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014
Dimensionality Reduction Using PCA/LDA Hongyu Li School of Software Engineering TongJi University Fall, 2014 Dimensionality Reduction One approach to deal with high dimensional data is by reducing their
More informationDimensionality Reduction and Principle Components
Dimensionality Reduction and Principle Components Ken Kreutz-Delgado (Nuno Vasconcelos) UCSD ECE Department Winter 2012 Motivation Recall, in Bayesian decision theory we have: World: States Y in {1,...,
More informationThe Bayes classifier
The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal
More informationFeature selection and extraction Spectral domain quality estimation Alternatives
Feature selection and extraction Error estimation Maa-57.3210 Data Classification and Modelling in Remote Sensing Markus Törmä markus.torma@tkk.fi Measurements Preprocessing: Remove random and systematic
More informationPrincipal component analysis
Principal component analysis Motivation i for PCA came from major-axis regression. Strong assumption: single homogeneous sample. Free of assumptions when used for exploration. Classical tests of significance
More informationMSA220 Statistical Learning for Big Data
MSA220 Statistical Learning for Big Data Lecture 4 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology More on Discriminant analysis More on Discriminant
More informationPattern recognition. "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher
Pattern recognition "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher The more relevant patterns at your disposal, the better your decisions will be. This is hopeful news to
More informationDecember 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis
.. December 20, 2013 Todays lecture. (PCA) (PLS-R) (LDA) . (PCA) is a method often used to reduce the dimension of a large dataset to one of a more manageble size. The new dataset can then be used to make
More informationSupervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012
Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012 Overview Review: Conditional Probability LDA / QDA: Theory Fisher s Discriminant Analysis LDA: Example Quality control:
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationFeature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size
Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size Berkman Sahiner, a) Heang-Ping Chan, Nicholas Petrick, Robert F. Wagner, b) and Lubomir Hadjiiski
More informationCLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS
CLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS EECS 833, March 006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@gs.u.edu 864-093 Overheads and resources available at http://people.u.edu/~gbohling/eecs833
More informationPattern Recognition 2
Pattern Recognition 2 KNN,, Dr. Terence Sim School of Computing National University of Singapore Outline 1 2 3 4 5 Outline 1 2 3 4 5 The Bayes Classifier is theoretically optimum. That is, prob. of error
More informationSF2935: MODERN METHODS OF STATISTICAL LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS LEARNING. Tatjana Pavlenko.
SF2935: MODERN METHODS OF STATISTICAL LEARNING LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS Tatjana Pavlenko 5 November 2015 SUPERVISED LEARNING (REP.) Starting point: we have an outcome
More informationDimensionality Reduction and Principal Components
Dimensionality Reduction and Principal Components Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD Motivation Recall, in Bayesian decision theory we have: World: States Y in {1,..., M} and observations of X
More informationMachine Learning (CS 567) Lecture 5
Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol
More informationCS 195-5: Machine Learning Problem Set 1
CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of
More informationPrincipal Component Analysis and Linear Discriminant Analysis
Principal Component Analysis and Linear Discriminant Analysis Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/29
More informationBayesian Decision Theory
Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 1 / 46 Bayesian
More informationCS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)
CS4495/6495 Introduction to Computer Vision 8B-L2 Principle Component Analysis (and its use in Computer Vision) Wavelength 2 Wavelength 2 Principal Components Principal components are all about the directions
More informationNon-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines
Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall
More informationAlgorithm-Independent Learning Issues
Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning
More informationx. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).
.8.6 µ =, σ = 1 µ = 1, σ = 1 / µ =, σ =.. 3 1 1 3 x Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ ). The Gaussian distribution Probably the most-important distribution in all of statistics
More informationTHE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2012, Mr. Ruey S. Tsay
THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2012, Mr Ruey S Tsay Lecture 9: Discrimination and Classification 1 Basic concept Discrimination is concerned with separating
More informationROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015
ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti
More informationSTA 414/2104: Lecture 8
STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA
More informationBayes Decision Theory - I
Bayes Decision Theory - I Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD Statistical Learning from Data Goal: Given a relationship between a feature vector and a vector y, and iid data samples ( i,y i ), find
More informationPCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction
More informationAnnouncements (repeat) Principal Components Analysis
4/7/7 Announcements repeat Principal Components Analysis CS 5 Lecture #9 April 4 th, 7 PA4 is due Monday, April 7 th Test # will be Wednesday, April 9 th Test #3 is Monday, May 8 th at 8AM Just hour long
More informationDISCRIMINANT ANALYSIS. 1. Introduction
DISCRIMINANT ANALYSIS. Introduction Discrimination and classification are concerned with separating objects from different populations into different groups and with allocating new observations to one
More informationCS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis
CS 3 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis AD March 11 AD ( March 11 1 / 17 Multivariate Gaussian Consider data { x i } N i=1 where xi R D and we assume they are
More informationCMSC858P Supervised Learning Methods
CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors
More informationFace Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition
ace Recognition Identify person based on the appearance of face CSED441:Introduction to Computer Vision (2017) Lecture10: Subspace Methods and ace Recognition Bohyung Han CSE, POSTECH bhhan@postech.ac.kr
More informationSTATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010
STATS306B Discriminant Analysis Jonathan Taylor Department of Statistics Stanford University June 3, 2010 Spring 2010 Classification Given K classes in R p, represented as densities f i (x), 1 i K classify
More informationMetric-based classifiers. Nuno Vasconcelos UCSD
Metric-based classifiers Nuno Vasconcelos UCSD Statistical learning goal: given a function f. y f and a collection of eample data-points, learn what the function f. is. this is called training. two major
More informationGenerative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham
Generative classifiers: The Gaussian classifier Ata Kaban School of Computer Science University of Birmingham Outline We have already seen how Bayes rule can be turned into a classifier In all our examples
More informationENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition
Memorial University of Newfoundland Pattern Recognition Lecture 6 May 18, 2006 http://www.engr.mun.ca/~charlesr Office Hours: Tuesdays & Thursdays 8:30-9:30 PM EN-3026 Review Distance-based Classification
More informationBayesian Decision and Bayesian Learning
Bayesian Decision and Bayesian Learning Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1 / 30 Bayes Rule p(x ω i
More informationTable of Contents. Multivariate methods. Introduction II. Introduction I
Table of Contents Introduction Antti Penttilä Department of Physics University of Helsinki Exactum summer school, 04 Construction of multinormal distribution Test of multinormality with 3 Interpretation
More informationDimension Reduction (PCA, ICA, CCA, FLD,
Dimension Reduction (PCA, ICA, CCA, FLD, Topic Models) Yi Zhang 10-701, Machine Learning, Spring 2011 April 6 th, 2011 Parts of the PCA slides are from previous 10-701 lectures 1 Outline Dimension reduction
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationIntroduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones
Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive
More informationMachine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More informationChapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)
HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter
More informationIN Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg
IN 5520 30.10.18 Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg (anne@ifi.uio.no) 30.10.18 IN 5520 1 Literature Practical guidelines of classification
More informationComputation. For QDA we need to calculate: Lets first consider the case that
Computation For QDA we need to calculate: δ (x) = 1 2 log( Σ ) 1 2 (x µ ) Σ 1 (x µ ) + log(π ) Lets first consider the case that Σ = I,. This is the case where each distribution is spherical, around the
More informationLecture 2: Linear Algebra Review
EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1
More informationLinear & Non-Linear Discriminant Analysis! Hugh R. Wilson
Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson PCA Review! Supervised learning! Fisher linear discriminant analysis! Nonlinear discriminant analysis! Research example! Multiple Classes! Unsupervised
More informationPrincipal Component Analysis -- PCA (also called Karhunen-Loeve transformation)
Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation) PCA transforms the original input space into a lower dimensional space, by constructing dimensions that are linear combinations
More informationClustering VS Classification
MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:
More informationRevision: Chapter 1-6. Applied Multivariate Statistics Spring 2012
Revision: Chapter 1-6 Applied Multivariate Statistics Spring 2012 Overview Cov, Cor, Mahalanobis, MV normal distribution Visualization: Stars plot, mosaic plot with shading Outlier: chisq.plot Missing
More informationInformatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries
Overview Gaussians estimated from training data Guido Sanguinetti Informatics B Learning and Data Lecture 1 9 March 1 Today s lecture Posterior probabilities, decision regions and minimising the probability
More informationMachine Learning 11. week
Machine Learning 11. week Feature Extraction-Selection Dimension reduction PCA LDA 1 Feature Extraction Any problem can be solved by machine learning methods in case of that the system must be appropriately
More informationL5: Quadratic classifiers
L5: Quadratic classifiers Bayes classifiers for Normally distributed classes Case 1: Σ i = σ 2 I Case 2: Σ i = Σ (Σ diagonal) Case 3: Σ i = Σ (Σ non-diagonal) Case 4: Σ i = σ 2 i I Case 5: Σ i Σ j (general
More informationMinimum Error Rate Classification
Minimum Error Rate Classification Dr. K.Vijayarekha Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur-613 401 Table of Contents 1.Minimum Error Rate Classification...
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:
More informationAn introduction to multivariate data
An introduction to multivariate data Angela Montanari 1 The data matrix The starting point of any analysis of multivariate data is a data matrix, i.e. a collection of n observations on a set of p characters
More informationLinear Algebra Methods for Data Mining
Linear Algebra Methods for Data Mining Saara Hyvönen, Saara.Hyvonen@cs.helsinki.fi Spring 2007 Linear Discriminant Analysis Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki Principal
More informationVectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =
Linear Algebra Review Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1 x x = 2. x n Vectors of up to three dimensions are easy to diagram.
More informationClassification: Linear Discriminant Analysis
Classification: Linear Discriminant Analysis Discriminant analysis uses sample information about individuals that are known to belong to one of several populations for the purposes of classification. Based
More informationCPSC 340: Machine Learning and Data Mining. More PCA Fall 2017
CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).
More informationDIMENSION REDUCTION AND CLUSTER ANALYSIS
DIMENSION REDUCTION AND CLUSTER ANALYSIS EECS 833, 6 March 2006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@kgs.ku.edu 864-2093 Overheads and resources available at http://people.ku.edu/~gbohling/eecs833
More informationPrincipal component analysis (PCA) for clustering gene expression data
Principal component analysis (PCA) for clustering gene expression data Ka Yee Yeung Walter L. Ruzzo Bioinformatics, v17 #9 (2001) pp 763-774 1 Outline of talk Background and motivation Design of our empirical
More informationData Preprocessing. Cluster Similarity
1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M
More informationExercises * on Principal Component Analysis
Exercises * on Principal Component Analysis Laurenz Wiskott Institut für Neuroinformatik Ruhr-Universität Bochum, Germany, EU 4 February 207 Contents Intuition 3. Problem statement..........................................
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationDimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas
Dimensionality Reduction: PCA Nicholas Ruozzi University of Texas at Dallas Eigenvalues λ is an eigenvalue of a matrix A R n n if the linear system Ax = λx has at least one non-zero solution If Ax = λx
More informationEigenvalues, Eigenvectors, and an Intro to PCA
Eigenvalues, Eigenvectors, and an Intro to PCA Eigenvalues, Eigenvectors, and an Intro to PCA Changing Basis We ve talked so far about re-writing our data using a new set of variables, or a new basis.
More informationLecture Notes 1: Vector spaces
Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector
More informationEigenvalues, Eigenvectors, and an Intro to PCA
Eigenvalues, Eigenvectors, and an Intro to PCA Eigenvalues, Eigenvectors, and an Intro to PCA Changing Basis We ve talked so far about re-writing our data using a new set of variables, or a new basis.
More informationGENOMIC SIGNAL PROCESSING. Lecture 2. Classification of disease subtype based on microarray data
GENOMIC SIGNAL PROCESSING Lecture 2 Classification of disease subtype based on microarray data 1. Analysis of microarray data (see last 15 slides of Lecture 1) 2. Classification methods for microarray
More information1 Using standard errors when comparing estimated values
MLPR Assignment Part : General comments Below are comments on some recurring issues I came across when marking the second part of the assignment, which I thought it would help to explain in more detail
More informationClassification Methods II: Linear and Quadratic Discrimminant Analysis
Classification Methods II: Linear and Quadratic Discrimminant Analysis Rebecca C. Steorts, Duke University STA 325, Chapter 4 ISL Agenda Linear Discrimminant Analysis (LDA) Classification Recall that linear
More informationOptimization Problems
Optimization Problems The goal in an optimization problem is to find the point at which the minimum (or maximum) of a real, scalar function f occurs and, usually, to find the value of the function at that
More informationRobustness of the Quadratic Discriminant Function to correlated and uncorrelated normal training samples
DOI 10.1186/s40064-016-1718-3 RESEARCH Open Access Robustness of the Quadratic Discriminant Function to correlated and uncorrelated normal training samples Atinuke Adebanji 1,2, Michael Asamoah Boaheng
More informationSystem 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to:
System 2 : Modelling & Recognising Modelling and Recognising Classes of Classes of Shapes Shape : PDM & PCA All the same shape? System 1 (last lecture) : limited to rigidly structured shapes System 2 :
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationChemometrics: Classification of spectra
Chemometrics: Classification of spectra Vladimir Bochko Jarmo Alander University of Vaasa November 1, 2010 Vladimir Bochko Chemometrics: Classification 1/36 Contents Terminology Introduction Big picture
More informationIntroduction to Supervised Learning. Performance Evaluation
Introduction to Supervised Learning Performance Evaluation Marcelo S. Lauretto Escola de Artes, Ciências e Humanidades, Universidade de São Paulo marcelolauretto@usp.br Lima - Peru Performance Evaluation
More informationMachine Learning 2017
Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationFinal Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson
Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson Name: TA Name and section: NO CALCULATORS, SHOW ALL WORK, NO OTHER PAPERS ON DESK. There is very little actual work to be done on this exam if
More informationStructure in Data. A major objective in data analysis is to identify interesting features or structure in the data.
Structure in Data A major objective in data analysis is to identify interesting features or structure in the data. The graphical methods are very useful in discovering structure. There are basically two
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationEigenvalues, Eigenvectors, and an Intro to PCA
Eigenvalues, Eigenvectors, and an Intro to PCA Eigenvalues, Eigenvectors, and an Intro to PCA Changing Basis We ve talked so far about re-writing our data using a new set of variables, or a new basis.
More information