Discriminant analysis and supervised classification

Size: px
Start display at page:

Download "Discriminant analysis and supervised classification"

Transcription

1 Discriminant analysis and supervised classification Angela Montanari 1 Linear discriminant analysis Linear discriminant analysis (LDA) also known as Fisher s linear discriminant analysis or as Canonical variate analysis is a widely used method aimed at finding linear combinations of observed features which best characterize or separate two or more classes of objects or events. The resulting combinations are commonly used for dimensionality reduction, discrimination, before later classification. LDA is also closely related to principal component analysis (PCA) in that they both look for linear combinations of variables which best explain the data. However LDA explicitly attempts to model the difference between the classes of data while PCA does not take into account any difference in class. Here are a few examples of possible applications of the method: Identification To identify the type of customers that is likely to buy a certain product in a store. Using simple questionnaire surveys, we can get the features of customers. Discriminant analysis will help us to select which features can describe the group membership of buy or not buy the product. Decision Making Doctor diagnosing illness may be seen as assessing which disease the patient has. However, we can transform this problem into a classification problem by assigning the patient to a number of possible groups of diseases based on the observation of the symptoms. Prediction Question Will it rain today? can be thought of as prediction. Prediction problem can be thought of as assigning today to one of the two possible groups of rain and dry. 1

2 Pattern recognition To distinguish pedestrians from dogs and cars on captured image sequence of traffic data is a classification problem. Learning Scientists attempt to teach a robot to learn to talk can be seen as classification problem. It assigns frequency, pitch, tune, and many other measurements of sound into many groups of words. The method was first introduced by R.A. Fisher in Discrimination Let s assume we have G groups of units (each composed of n g units, for g = 1,..., G, such that G n g = n) on which a vector random variable x (corresponding to p observed numeric variables) has been observed. We also assume that, for the G populations the G groups come from, the homoscedasticity condition holds, i.e. Σ 1 = Σ 2 = = Σ g = = Σ G = Σ. Fisher suggested to look for the linear combination y of the variables in x, y = a T x, which best separates the groups. This amounts to look for the vector a such that, when projected along it, the groups are as separated as possible and as homogeneous as possible at the same time. In this framework the function which must be optimized with respect to a is the ratio of the between group to the within group variance of the linear combination y. In the x space the overall average is x, while each group has average vector x g and covariance matrix S g ; because of the properties of the arithmetic mean it will be x = 1 n G x g n g (1) The variable y will therefore have overall average ȳ = a T x and, for each group, average value ȳ g = a T x g and variance V ar(y g ) = a T S g a. V ar(y) within = 1 n G { = a T G 1 n G (n g 1)V ar(y g ) = 1 n G G (n g 1)a T S g a } G (n g 1)S g a = a T Wa 2

3 where W = 1 n G G (n g 1)S g is the within group covariance matrix (also known as within group scatter matrix) in the observed variable space (it is meaningful because of the omoscedasticity assumption). The n G degrees of freedom derive from the fact that the external sum involves G terms, each of which has n g 1 degrees of freedom. V ar(y) between = 1 G 1 G (ȳ g ȳ) 2 n g = 1 G (a T x g a T x) 2 n g G 1 { G } n g ( x g x)( x g x) T a = a T Ba = 1 G 1 at where B = 1 G 1 G n g( x g x)( x g x) T is the between group covariance matrix (also known as between group scatter matrix) in the observed variable space. Its rank is at most G 1. In the simple two group case the variance between has one degree of freedom only, and has the simple expression B = 2 n g( x g x)( x g x) T. After writing x as in equation (1) and after little algebra it becomes B = n 1n 2 n 1 + n 2 ( x 1 x 2 )( x 1 x 2 ) T This clearly shows that in the two group case the between group covariance matrix has rank equal to 1. In general, the function we need to optimize with respect to a is therefore: φ = V ar(y) between V ar(y) within = at Ba a T Wa (2) (notice that it coincides with the F statistic for the Analysis of Variance). In order to find the vector a for which φ is maximum we derive it with respect to a and we set the derivatives to 0: { } φ Ba(a T a = 2 Wa) Wa(a T Ba) (a T Wa) { 2 } Ba(a T Wa) = 2 Wa(aT Ba) = 0 (a T Wa) 2 (a T Wa) 2 3

4 Remembering equation (2), it becomes and then or equivalently Ba a T Wa Waφ a T Wa = 0 Ba Waφ = 0 (3) (B φw)a = 0 After pre-multiplying both sides by W 1 (under the assumption that it is non singular) we obtain (W 1 B φi)a = 0 This is a linear homogeneous equation system which admits non trivial solution if and only if det(w 1 B φi) = 0; this means that φ is an eigenvalue of W 1 B and a is the corresponding eigenvector. As φ is the function we want to maximize we choose the largest eigenvalue and the corresponding eigenvector as the best discriminant direction. We can find up to G 1 different discriminant directions, as the rank of B (and hence of W 1 B) is at most G 1. Each of them will have a decreasing discrimination power. They define a vector subspace containing the variability between features. These vectors are primarily used in feature reduction and can be interpreted in the same way as principal components. As W 1 B is non symmetric its eigenvectors are not necessarily orthogonal, but it can be proved that the linear combinations y obtained through them (also known as canonical variates) are uncorrelated. Let s go back to equation (3) and let s consider two generic couples eigenvalueeigenvector, say {φ i, a i } and {φ j, a j }. For them the following equalities hold: Ba i = Wa i φ i Ba j = Wa j φ j After pre-multiplying the first one by a T j and the second one by a T i we obtain a T j Ba i = a T j Wa i φ i a T i Ba j = a T i Wa j φ j 4

5 But a T j Ba i = a T i Ba j because they are both scalars and hence also a T j Wa i φ i = a T i Wa j φ j Again a T j Wa i = a T i Wa j as both are scalars; therefore, unless φ i = φ j (a rare event in practice) the above equality only holds if a T j Wa i = a T i Wa j = 0 i.e. if the discriminant variables are uncorrelated. In matrix form: A T WA = diag Many softwares (R included, see tutorial) scale A so that A T WA = I: the discriminant variables are uncorrelated and have unit within group variance. They are often said to be sphered. As, from equation (3), BA = WAΦ, where Φ is the diagonal matrix of the eigenvalues, after pre-multiplying it by A T, in the within group sphered data case, we obtain A T BA = A T WAΦ = Φ i.e. the discriminant variables are uncorrelated between groups too, and have decreasing between variance. Being uncorrelated both within and between groups, the discriminant variables are also uncorrelated with respect to the whole set of units. Fisher proved that in the two group case, the only discriminant direction a can be obtained as a = W 1 ( x 1 x 2 ). The corresponding liner combination will therefore be y = a T x = ( x 1 x 2 ) T W 1 x. Classification Even if it has been derived for discrimination purposes, Fisher s linear function can also be used to address classification issues, i.e. to define a rule for assigning a unit, whose group membership is unknown, to one out of the G known groups. The method is general, but for teaching purposes we will limit our attention to the two group case only. Let s denote by ȳ 1 the projection of group 1 average on a, ȳ 1 = a T x 1 = ( x 1 x 2 ) T W 1 x 1 and by ȳ 2 = ( x 1 x 2 ) T W 1 x 2 the projection of group 2 average on a. Let s also assume, without loss of generality, that ȳ 1 > ȳ 2. Being x 0 the new unit we want to classify and y 0 = a T x 0 = ( x 1 x 2 ) T W 1 x 0 its projection on a, a natural allocation rule will consist in assigning x 0 to the group whose average it is closest to along a: i.e. assign x 0 to group 1 if y 0 ȳ 1 < y 0 ȳ 2 and to group 2 if the opposite inequality holds. This amounts to assign x 0 to group 1 if y 0 > ȳ1+ȳ

6 Due to the results previously obtained this defines the following rule: assign x 0 to group 1 if: ( x 1 x 2 ) T W 1 x 0 > 1 2 {( x 1 x 2 ) T W 1 x 1 + ( x 1 x 2 ) T W 1 x 2 } or equivalently ( x 1 x 2 ) T W 1 x 0 > 1 2 ( x 1 x 2 ) T W 1 ( x 1 + x 2 ) that is known as linear classification rule as it is a linear function of the observed vector variable x. This allocation rule is very popular as it can be obtained addressing the classification problem according to a variety of different perspectives. We will see a couple of them in the following. 2 Classification based on Mahalanobis distance One common way to measure the distance between points in a p dimensional space is provided by the Euclidean distance. However the Euclidean distance attaches equal weight to all the axes of the representation; if differences are present in the variances of the observed variables, and if the variables are correlated this might not be a desirable feature. In Fig.1 a bivariate point cloud is represented. The variables have different variances and are correlated. The points A and B have the same Euclidean distance from the center of the point cloud (i.e. from the average vector) but while B lies in the core of the distribution, A is in a low frequency region. Figure 1: Euclidean distance 6

7 Mahalanobis distance provides a way to take into account variances and correlations when computing distances, i.e. to take into account the shape of the point cloud. Given a point whose coordinate vector is x, its Mahalonobis distance from the center of the point cloud x is defined as d M (x, x) = {(x x) T S 1 (x x)} 1/2 where, as usual, S is the sample covariance matrix. Mahalanobis distance is thus a weighted Euclidean distance. In Fig. 2 the same point cloud and two points having the same Mahalanobis distance are represented Figure 2: Mahalanobis distance Mahalanobis distance turns out to be really useful also for classification purposes. Assuming homoscedasticity, and estimating the unknown common covariance matrix by the within group covariance matrix W, the squared Mahalonobis distance between a unit x 0 and group 1 center can be defined as d 2 M(x 0, x 1 ) = (x 0 x 1 ) T W 1 (x 0 x 1 ); the same for group 2. If W is the identity matrix, i.e. if, within groups, the variables are uncorrelatated and have unit variance, then Mahalanobis distance becomes the ordinary Euclidian one. The allocation rule, consisting in assigning a unit to the group whose average it is closest to, will assign unit x 0 to group 1 if (x 0 x 1 ) T W 1 (x 0 x 1 ) < (x 0 x 2 ) T W 1 (x 0 x 2 ) and to group 2 if the opposite inequality holds. After a few passages the classification rule: assign x 0 to group 1 if ( x 1 x 2 ) T W 1 x 0 > 1 2 ( x 1 x 2 ) T W 1 ( x 1 + x 2 ) 7

8 can be easily obtained. This is exactly Fisher s linear classification rule. The method can be extended to the classification of units into more than 2 groups. In the G group case we allocate an object to the group for which (a) the Mahalanobis distance between the object and the class mean is smallest using the original variables or (b) the Euclidean distance between the object and the class mean is smallest in terms of the Canonical Variates. Figure 3: Three group case 3 Classification based on probability models So far we have addressed classification issues in a purely distribution free context. Now we will assume that we know the shape of the probability density functions (pdf) that have generated our data in the G groups. We will consider the G = 2 case only. Let x be the p-dimensional vector of the observed variables and x 0 a new unit whose group membership is unknown. Let Π 1 and Π 2 denoted the two parent populations. The key assumption is that x has a different pdf in Π 1 and Π 2. Let s denote the pdf of x in Π 1 as f 1 (x) and the pdf of x in Π 2 as f 2 (x). Let s denote by R the set of all possible values x can assume. As f 1 (x) and f 2 (x) usually overlap, each point of R can belong both to Π 1 and Π 2, but with a different probability degree. The goal is to partition R into two exhaustive, non overlapping regions R 1 and R 2 (R 1 R 2 = R and R 1 R 2 = ) such that the probability of a wrong classification is minimum, when a unit belonging to R 1 is allocated to Π 1 and a unit belonging to R 2 is allocated to Π 2. 8

9 Given x 0, a very intuitive rule consists in allocating it to Π 1 if the probability that it comes from Π 1 is larger than the probability that it comes from Π 2 or to allocate it to Π 2 if the opposite holds. According to this criterion: R 1 is the set of the x values such that f 1 (x) > f 2 (x) and R 2 is the set of the x values such that f 1 (x) < f 2 (x). The ensuing classification rule is therefore; allocate x 0 to Π 1 if f 1 (x 0 ) f 2 (x 0 ) > 1 allocate x 0 to Π 2 if f 1 (x 0 ) f 2 (x 0 ) < 1 allocate x 0 randomly to one of the two populations if equality holds. This classification rule is known as likelihood ratio rule. However intuitively reasonable, this rule neglects possibly different prior probabilities of class membership and possibly different misclassification costs. Let s denote by π 1 the prior probability that x 0 belongs to Π 1 and by π 2 the prior probability that x 0 belongs to Π 2 (π 1 +π 2 =1). Based on likelihoods only, the probability that a unit, belonging to Π 1 is wrongly classified to Π 2 (this happens when it falls in R 2 ) is R 2 f 1 (x)dx If we consider prior probabilities too, the probability of a wrong classification of a unit to Π 2 when it effectively comes from Π 1 is p(2 1) = π 1 R 2 f 1 (x)dx and the probability of a wrong classification of a unit to Π 1 when it effectively comes from Π 2 is p(1 2) = π 2 R 1 f 2 (x)dx The total probability of a wrong classification is therefore prob = p(2 1) + p(1 2) 9

10 R 1 and R 2 should be therefore chosen in such a way that prob is minimum. prob can be written as prob = π 1 f 1 (x)dx + π 2 f 2 (x)dx (4) R 2 R 1 Since R is the complete space and pdfs are known to integrate to 1 over their domain, it is f 1 (x)dx = f 2 (x)dx = 1 R and as R 1 R 2 = R and R 1 R 2 = we have f 1 (x)dx = R f 1 (x)dx + R 1 f 1 (x)dx = 1 R 2 and hence f 1 (x)dx = 1 f 1 (x)dx R 2 R 1 After replacing it in equation (4) we obtain [ prob = π 1 1 f 1 (x)dx R ] + π 2 f 2 (x)dx 1 R 1 = π 1 π 1 f 1 (x)dx + π 2 f 2 (x)dx R 1 R 1 = π 1 R 1 [π 2 f 2 (x) π 1 f 1 (x)]dx As π 1 is a constant, prob will be minimum when the integral is minimum i.e. when the integrand is negative. This means that R 1 should be chosen so that, for the points belonging to it that is R π 2 f 2 (x) π 1 f 1 (x) < 0 π 1 f 1 (x) > π 2 f 2 (x) The ensuing allocation rule will then be: allocate x 0 to Π 1 if f 1 (x 0 ) f 2 (x 0 ) > π 2 π 1 10

11 allocate x 0 to Π 2 if f 1 (x 0 ) f 2 (x 0 ) < π 2 π 1 allocate x 0 randomly to one of the two populations if equality holds. In the equal prior case (π 1 = π 2 = 1/2) the likelihood ratio rule is obtained. It is worth adding that the classification rule obtained by minimizing the total probability of a wrong classification is equivalent to the one that would be obtained by maximizing the posterior probability of population membership. That s the reason why it is often called optimal Bayesian rule. Following the same steps we have gone through before, in case of unequal misclassification costs, the rule minimizing the total misclassification cost can be obtained. Gaussian populations Let s assume that both f 1 (x) and f 2 (x) are multivariate normal densities: f 1 (x) = (2π) p/2 Σ 1 1/2 exp { 1 2 (x µ 1) T Σ 1 1 (x µ 1 ) } f 2 (x) = (2π) p/2 Σ 2 1/2 exp { 1 2 (x µ 2) T Σ 1 2 (x µ 2 ) } The likelihood ratio is therefore f 1 (x) f 2 (x) = Σ 1 1/2 Σ 2 1/2 exp { 1 2 [(x µ 1) T Σ 1 1 (x µ 1 ) (x µ 2 ) T Σ 1 2 (x µ 2 )] } = Σ 1 1/2 Σ 2 1/2 exp { 1 2 [xt Σ 1 1 x + µ T 1 Σ 1 1 µ 1 2x T Σ 1 1 µ 1 x T Σ 1 2 x µ T 2 Σ 1 2 µ 2 + 2x T Σ 1 2 µ 2 ] } = Σ 1 1/2 Σ 2 1/2 exp { 1 2 [xt (Σ 1 1 Σ 1 2 )x 2x T (Σ 1 1 µ 1 Σ 1 2 µ 2 ) ]} + µ T 1 Σ 1 1 µ 1 µ T 2 Σ 1 2 µ 2 The expression can be simplified by considering the logarithm of the likelihood ratio. In this way the so called Quadratic discriminant function is obtained: Q(x) = ln f 1(x) f 2 (x) = 1 2 ln Σ 2 Σ 1 1 [ ] x T (Σ 1 1 Σ 1 2 )x 2x T (Σ 1 1 µ 1 Σ 1 2 µ 2 ) + µ T 1 Σ 1 1 µ 1 µ T 2 Σ 1 2 µ

12 The expression clearly shows that it is a quadratic function of x. The ensuing classification rule suggests to allocate x 0 to Π 1 if Q(x 0 ) > ln(h) where H is either 1, if equal priors are assumed, or π 2 /π 1 in the unequal prior case. When Σ 1 = Σ 2 = Σ then the likelihood ratio becomes f 1 (x)/f 2 (x) = exp { (µ 1 µ 2 ) T Σ 1 x 1 2 (µ 1 µ 2 ) T Σ 1 (µ 1 + µ 2 ) } = exp { (µ 1 µ 2 ) T Σ 1[ x 1 2 (µ 1 + µ 2 ) ]} After taking the logarithms we obtain L(x) = (µ 1 µ 2 ) T Σ 1[ x 1 2 (µ 1 + µ 2 ) ] This is known as linear discriminant rule as it is a linear function of x. The ensuing classification rule suggests to allocate x 0 to Π 1 if L(x 0 ) > ln(h) where H is either 1, if equal priors are assumed, or π 2 /π 1 in the unequal prior case. In empirical applications, maximum likelihood estimates of the model parameters are plugged into the classification rules. µ 1 and µ 2 are estimated by x 1 and x 2. Furthermore, in the heteroscedastic case Σ 1 and Σ 2 are replaced by S 1 and S 2 respectively, while in the homoscedastic case, the common Σ is replaced by the within group covariance matrix W. The empirical linear discriminant rule, in the equal priori case, becomes ( x 1 x 2 ) T W 1[ x ( x 1 + x 2 ) ] > 0 or equivalently ( x 1 x 2 ) T W 1 x 0 > 1 2 ( x 1 + x 2 ) In it the allocation rule obtained according to Fisher s approach can be easily recognized. This means that, for gaussian populations and equal priors, besides optimizing group separation, Fisher s rule also minimizes the probability of a wrong classification. 12

13 4 Assessment of classification rules After a classification rule has been devised, a measure of its performance must be obtained. In other words an answer to the question: How good is the classification rule in correctly predicting class membership? must be provided. The most obvious measure of performance is an estimation of the proportion of individuals that will be misallocated by the rule. And the simplest way to do this is by applying the given allocation rule to the sample units that have been used to derive it. The proportion of misallocated individuals will provide an estimate of the error rate. The method is called resubstitution method (because the sample units are used to derive the rule and then resubstituted into it in order to evaluate its performance) and the corresponding error rate is known as apparent error rate. Although a very simple procedure, the resubstiution method produces biased error estimates. It gives an optimistic measure of the performance of the rule; the rule is expected to perform worse on units that have not contributed to its derivation. A safer strategy suggests to split the original sample into two subgroups: the training sample (containing 70-75% of the units), on which the classification rule is constructed, and the test sample on which the classification rule is evaluated. The error rate is estimated by the proportion of misallocated units in the test set. In case the sample is so small that splitting it in two parts might produce rules based on unstable model parameters, resampling techniques have been suggested. The most common one is k-fold cross-validation. It suggests to split the sample into k equal sized sub-samples, build the rule using the data of k 1 groups as the training set and the group not used before as the test set. Repeat the procedure excluding one subset at a time and obtain the error rate estimate by averaging the error rates obtained in the various steps. A famous variant is the so called leave-one-out (or jackknife) estimator obtained by excluding one observation at a time from the training set and using it as a test set. 13

Lecture 8: Classification

Lecture 8: Classification 1/26 Lecture 8: Classification Måns Eriksson Department of Mathematics, Uppsala University eriksson@math.uu.se Multivariate Methods 19/5 2010 Classification: introductory examples Goal: Classify an observation

More information

6-1. Canonical Correlation Analysis

6-1. Canonical Correlation Analysis 6-1. Canonical Correlation Analysis Canonical Correlatin analysis focuses on the correlation between a linear combination of the variable in one set and a linear combination of the variables in another

More information

12 Discriminant Analysis

12 Discriminant Analysis 12 Discriminant Analysis Discriminant analysis is used in situations where the clusters are known a priori. The aim of discriminant analysis is to classify an observation, or several observations, into

More information

Regularized Discriminant Analysis and Reduced-Rank LDA

Regularized Discriminant Analysis and Reduced-Rank LDA Regularized Discriminant Analysis and Reduced-Rank LDA Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Regularized Discriminant Analysis A compromise between LDA and

More information

5. Discriminant analysis

5. Discriminant analysis 5. Discriminant analysis We continue from Bayes s rule presented in Section 3 on p. 85 (5.1) where c i is a class, x isap-dimensional vector (data case) and we use class conditional probability (density

More information

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Intelligent Data Analysis and Probabilistic Inference Lecture

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30 Problem Set 2 MAS 622J/1.126J: Pattern Recognition and Analysis Due: 5:00 p.m. on September 30 [Note: All instructions to plot data or write a program should be carried out using Matlab. In order to maintain

More information

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout :. The Multivariate Gaussian & Decision Boundaries..15.1.5 1 8 6 6 8 1 Mark Gales mjfg@eng.cam.ac.uk Lent

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Principal component analysis

Principal component analysis Principal component analysis Angela Montanari 1 Introduction Principal component analysis (PCA) is one of the most popular multivariate statistical methods. It was first introduced by Pearson (1901) and

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Motivating the Covariance Matrix

Motivating the Covariance Matrix Motivating the Covariance Matrix Raúl Rojas Computer Science Department Freie Universität Berlin January 2009 Abstract This note reviews some interesting properties of the covariance matrix and its role

More information

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014 Dimensionality Reduction Using PCA/LDA Hongyu Li School of Software Engineering TongJi University Fall, 2014 Dimensionality Reduction One approach to deal with high dimensional data is by reducing their

More information

Dimensionality Reduction and Principle Components

Dimensionality Reduction and Principle Components Dimensionality Reduction and Principle Components Ken Kreutz-Delgado (Nuno Vasconcelos) UCSD ECE Department Winter 2012 Motivation Recall, in Bayesian decision theory we have: World: States Y in {1,...,

More information

The Bayes classifier

The Bayes classifier The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal

More information

Feature selection and extraction Spectral domain quality estimation Alternatives

Feature selection and extraction Spectral domain quality estimation Alternatives Feature selection and extraction Error estimation Maa-57.3210 Data Classification and Modelling in Remote Sensing Markus Törmä markus.torma@tkk.fi Measurements Preprocessing: Remove random and systematic

More information

Principal component analysis

Principal component analysis Principal component analysis Motivation i for PCA came from major-axis regression. Strong assumption: single homogeneous sample. Free of assumptions when used for exploration. Classical tests of significance

More information

MSA220 Statistical Learning for Big Data

MSA220 Statistical Learning for Big Data MSA220 Statistical Learning for Big Data Lecture 4 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology More on Discriminant analysis More on Discriminant

More information

Pattern recognition. "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher

Pattern recognition. To understand is to perceive patterns Sir Isaiah Berlin, Russian philosopher Pattern recognition "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher The more relevant patterns at your disposal, the better your decisions will be. This is hopeful news to

More information

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis .. December 20, 2013 Todays lecture. (PCA) (PLS-R) (LDA) . (PCA) is a method often used to reduce the dimension of a large dataset to one of a more manageble size. The new dataset can then be used to make

More information

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012 Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012 Overview Review: Conditional Probability LDA / QDA: Theory Fisher s Discriminant Analysis LDA: Example Quality control:

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size Berkman Sahiner, a) Heang-Ping Chan, Nicholas Petrick, Robert F. Wagner, b) and Lubomir Hadjiiski

More information

CLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS

CLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS CLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS EECS 833, March 006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@gs.u.edu 864-093 Overheads and resources available at http://people.u.edu/~gbohling/eecs833

More information

Pattern Recognition 2

Pattern Recognition 2 Pattern Recognition 2 KNN,, Dr. Terence Sim School of Computing National University of Singapore Outline 1 2 3 4 5 Outline 1 2 3 4 5 The Bayes Classifier is theoretically optimum. That is, prob. of error

More information

SF2935: MODERN METHODS OF STATISTICAL LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS LEARNING. Tatjana Pavlenko.

SF2935: MODERN METHODS OF STATISTICAL LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS LEARNING. Tatjana Pavlenko. SF2935: MODERN METHODS OF STATISTICAL LEARNING LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS Tatjana Pavlenko 5 November 2015 SUPERVISED LEARNING (REP.) Starting point: we have an outcome

More information

Dimensionality Reduction and Principal Components

Dimensionality Reduction and Principal Components Dimensionality Reduction and Principal Components Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD Motivation Recall, in Bayesian decision theory we have: World: States Y in {1,..., M} and observations of X

More information

Machine Learning (CS 567) Lecture 5

Machine Learning (CS 567) Lecture 5 Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

Principal Component Analysis and Linear Discriminant Analysis

Principal Component Analysis and Linear Discriminant Analysis Principal Component Analysis and Linear Discriminant Analysis Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/29

More information

Bayesian Decision Theory

Bayesian Decision Theory Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 1 / 46 Bayesian

More information

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision) CS4495/6495 Introduction to Computer Vision 8B-L2 Principle Component Analysis (and its use in Computer Vision) Wavelength 2 Wavelength 2 Principal Components Principal components are all about the directions

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ). .8.6 µ =, σ = 1 µ = 1, σ = 1 / µ =, σ =.. 3 1 1 3 x Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ ). The Gaussian distribution Probably the most-important distribution in all of statistics

More information

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2012, Mr. Ruey S. Tsay

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2012, Mr. Ruey S. Tsay THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2012, Mr Ruey S Tsay Lecture 9: Discrimination and Classification 1 Basic concept Discrimination is concerned with separating

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

Bayes Decision Theory - I

Bayes Decision Theory - I Bayes Decision Theory - I Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD Statistical Learning from Data Goal: Given a relationship between a feature vector and a vector y, and iid data samples ( i,y i ), find

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

Announcements (repeat) Principal Components Analysis

Announcements (repeat) Principal Components Analysis 4/7/7 Announcements repeat Principal Components Analysis CS 5 Lecture #9 April 4 th, 7 PA4 is due Monday, April 7 th Test # will be Wednesday, April 9 th Test #3 is Monday, May 8 th at 8AM Just hour long

More information

DISCRIMINANT ANALYSIS. 1. Introduction

DISCRIMINANT ANALYSIS. 1. Introduction DISCRIMINANT ANALYSIS. Introduction Discrimination and classification are concerned with separating objects from different populations into different groups and with allocating new observations to one

More information

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis CS 3 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis AD March 11 AD ( March 11 1 / 17 Multivariate Gaussian Consider data { x i } N i=1 where xi R D and we assume they are

More information

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors

More information

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition ace Recognition Identify person based on the appearance of face CSED441:Introduction to Computer Vision (2017) Lecture10: Subspace Methods and ace Recognition Bohyung Han CSE, POSTECH bhhan@postech.ac.kr

More information

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010 STATS306B Discriminant Analysis Jonathan Taylor Department of Statistics Stanford University June 3, 2010 Spring 2010 Classification Given K classes in R p, represented as densities f i (x), 1 i K classify

More information

Metric-based classifiers. Nuno Vasconcelos UCSD

Metric-based classifiers. Nuno Vasconcelos UCSD Metric-based classifiers Nuno Vasconcelos UCSD Statistical learning goal: given a function f. y f and a collection of eample data-points, learn what the function f. is. this is called training. two major

More information

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham Generative classifiers: The Gaussian classifier Ata Kaban School of Computer Science University of Birmingham Outline We have already seen how Bayes rule can be turned into a classifier In all our examples

More information

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition Memorial University of Newfoundland Pattern Recognition Lecture 6 May 18, 2006 http://www.engr.mun.ca/~charlesr Office Hours: Tuesdays & Thursdays 8:30-9:30 PM EN-3026 Review Distance-based Classification

More information

Bayesian Decision and Bayesian Learning

Bayesian Decision and Bayesian Learning Bayesian Decision and Bayesian Learning Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1 / 30 Bayes Rule p(x ω i

More information

Table of Contents. Multivariate methods. Introduction II. Introduction I

Table of Contents. Multivariate methods. Introduction II. Introduction I Table of Contents Introduction Antti Penttilä Department of Physics University of Helsinki Exactum summer school, 04 Construction of multinormal distribution Test of multinormality with 3 Interpretation

More information

Dimension Reduction (PCA, ICA, CCA, FLD,

Dimension Reduction (PCA, ICA, CCA, FLD, Dimension Reduction (PCA, ICA, CCA, FLD, Topic Models) Yi Zhang 10-701, Machine Learning, Spring 2011 April 6 th, 2011 Parts of the PCA slides are from previous 10-701 lectures 1 Outline Dimension reduction

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1) HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter

More information

IN Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg

IN Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg IN 5520 30.10.18 Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg (anne@ifi.uio.no) 30.10.18 IN 5520 1 Literature Practical guidelines of classification

More information

Computation. For QDA we need to calculate: Lets first consider the case that

Computation. For QDA we need to calculate: Lets first consider the case that Computation For QDA we need to calculate: δ (x) = 1 2 log( Σ ) 1 2 (x µ ) Σ 1 (x µ ) + log(π ) Lets first consider the case that Σ = I,. This is the case where each distribution is spherical, around the

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson PCA Review! Supervised learning! Fisher linear discriminant analysis! Nonlinear discriminant analysis! Research example! Multiple Classes! Unsupervised

More information

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation) Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation) PCA transforms the original input space into a lower dimensional space, by constructing dimensions that are linear combinations

More information

Clustering VS Classification

Clustering VS Classification MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:

More information

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012 Revision: Chapter 1-6 Applied Multivariate Statistics Spring 2012 Overview Cov, Cor, Mahalanobis, MV normal distribution Visualization: Stars plot, mosaic plot with shading Outlier: chisq.plot Missing

More information

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries Overview Gaussians estimated from training data Guido Sanguinetti Informatics B Learning and Data Lecture 1 9 March 1 Today s lecture Posterior probabilities, decision regions and minimising the probability

More information

Machine Learning 11. week

Machine Learning 11. week Machine Learning 11. week Feature Extraction-Selection Dimension reduction PCA LDA 1 Feature Extraction Any problem can be solved by machine learning methods in case of that the system must be appropriately

More information

L5: Quadratic classifiers

L5: Quadratic classifiers L5: Quadratic classifiers Bayes classifiers for Normally distributed classes Case 1: Σ i = σ 2 I Case 2: Σ i = Σ (Σ diagonal) Case 3: Σ i = Σ (Σ non-diagonal) Case 4: Σ i = σ 2 i I Case 5: Σ i Σ j (general

More information

Minimum Error Rate Classification

Minimum Error Rate Classification Minimum Error Rate Classification Dr. K.Vijayarekha Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur-613 401 Table of Contents 1.Minimum Error Rate Classification...

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:

More information

An introduction to multivariate data

An introduction to multivariate data An introduction to multivariate data Angela Montanari 1 The data matrix The starting point of any analysis of multivariate data is a data matrix, i.e. a collection of n observations on a set of p characters

More information

Linear Algebra Methods for Data Mining

Linear Algebra Methods for Data Mining Linear Algebra Methods for Data Mining Saara Hyvönen, Saara.Hyvonen@cs.helsinki.fi Spring 2007 Linear Discriminant Analysis Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki Principal

More information

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x = Linear Algebra Review Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1 x x = 2. x n Vectors of up to three dimensions are easy to diagram.

More information

Classification: Linear Discriminant Analysis

Classification: Linear Discriminant Analysis Classification: Linear Discriminant Analysis Discriminant analysis uses sample information about individuals that are known to belong to one of several populations for the purposes of classification. Based

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017 CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

More information

DIMENSION REDUCTION AND CLUSTER ANALYSIS

DIMENSION REDUCTION AND CLUSTER ANALYSIS DIMENSION REDUCTION AND CLUSTER ANALYSIS EECS 833, 6 March 2006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@kgs.ku.edu 864-2093 Overheads and resources available at http://people.ku.edu/~gbohling/eecs833

More information

Principal component analysis (PCA) for clustering gene expression data

Principal component analysis (PCA) for clustering gene expression data Principal component analysis (PCA) for clustering gene expression data Ka Yee Yeung Walter L. Ruzzo Bioinformatics, v17 #9 (2001) pp 763-774 1 Outline of talk Background and motivation Design of our empirical

More information

Data Preprocessing. Cluster Similarity

Data Preprocessing. Cluster Similarity 1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M

More information

Exercises * on Principal Component Analysis

Exercises * on Principal Component Analysis Exercises * on Principal Component Analysis Laurenz Wiskott Institut für Neuroinformatik Ruhr-Universität Bochum, Germany, EU 4 February 207 Contents Intuition 3. Problem statement..........................................

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas Dimensionality Reduction: PCA Nicholas Ruozzi University of Texas at Dallas Eigenvalues λ is an eigenvalue of a matrix A R n n if the linear system Ax = λx has at least one non-zero solution If Ax = λx

More information

Eigenvalues, Eigenvectors, and an Intro to PCA

Eigenvalues, Eigenvectors, and an Intro to PCA Eigenvalues, Eigenvectors, and an Intro to PCA Eigenvalues, Eigenvectors, and an Intro to PCA Changing Basis We ve talked so far about re-writing our data using a new set of variables, or a new basis.

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Eigenvalues, Eigenvectors, and an Intro to PCA

Eigenvalues, Eigenvectors, and an Intro to PCA Eigenvalues, Eigenvectors, and an Intro to PCA Eigenvalues, Eigenvectors, and an Intro to PCA Changing Basis We ve talked so far about re-writing our data using a new set of variables, or a new basis.

More information

GENOMIC SIGNAL PROCESSING. Lecture 2. Classification of disease subtype based on microarray data

GENOMIC SIGNAL PROCESSING. Lecture 2. Classification of disease subtype based on microarray data GENOMIC SIGNAL PROCESSING Lecture 2 Classification of disease subtype based on microarray data 1. Analysis of microarray data (see last 15 slides of Lecture 1) 2. Classification methods for microarray

More information

1 Using standard errors when comparing estimated values

1 Using standard errors when comparing estimated values MLPR Assignment Part : General comments Below are comments on some recurring issues I came across when marking the second part of the assignment, which I thought it would help to explain in more detail

More information

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Classification Methods II: Linear and Quadratic Discrimminant Analysis Classification Methods II: Linear and Quadratic Discrimminant Analysis Rebecca C. Steorts, Duke University STA 325, Chapter 4 ISL Agenda Linear Discrimminant Analysis (LDA) Classification Recall that linear

More information

Optimization Problems

Optimization Problems Optimization Problems The goal in an optimization problem is to find the point at which the minimum (or maximum) of a real, scalar function f occurs and, usually, to find the value of the function at that

More information

Robustness of the Quadratic Discriminant Function to correlated and uncorrelated normal training samples

Robustness of the Quadratic Discriminant Function to correlated and uncorrelated normal training samples DOI 10.1186/s40064-016-1718-3 RESEARCH Open Access Robustness of the Quadratic Discriminant Function to correlated and uncorrelated normal training samples Atinuke Adebanji 1,2, Michael Asamoah Boaheng

More information

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to:

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to: System 2 : Modelling & Recognising Modelling and Recognising Classes of Classes of Shapes Shape : PDM & PCA All the same shape? System 1 (last lecture) : limited to rigidly structured shapes System 2 :

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Chemometrics: Classification of spectra

Chemometrics: Classification of spectra Chemometrics: Classification of spectra Vladimir Bochko Jarmo Alander University of Vaasa November 1, 2010 Vladimir Bochko Chemometrics: Classification 1/36 Contents Terminology Introduction Big picture

More information

Introduction to Supervised Learning. Performance Evaluation

Introduction to Supervised Learning. Performance Evaluation Introduction to Supervised Learning Performance Evaluation Marcelo S. Lauretto Escola de Artes, Ciências e Humanidades, Universidade de São Paulo marcelolauretto@usp.br Lima - Peru Performance Evaluation

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson Name: TA Name and section: NO CALCULATORS, SHOW ALL WORK, NO OTHER PAPERS ON DESK. There is very little actual work to be done on this exam if

More information

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data. Structure in Data A major objective in data analysis is to identify interesting features or structure in the data. The graphical methods are very useful in discovering structure. There are basically two

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Eigenvalues, Eigenvectors, and an Intro to PCA

Eigenvalues, Eigenvectors, and an Intro to PCA Eigenvalues, Eigenvectors, and an Intro to PCA Eigenvalues, Eigenvectors, and an Intro to PCA Changing Basis We ve talked so far about re-writing our data using a new set of variables, or a new basis.

More information