Discriminant Analysis (DA)

Size: px
Start display at page:

Download "Discriminant Analysis (DA)"

Transcription

1 Discriminant Analysis (DA) Involves two main goals: 1) Separation/discrimination: Descriptive Discriminant Analysis (DDA) 2) Classification/allocation: Predictive Discriminant Analysis (PDA) In DDA Classification of subjects into known groups on the basis of their quantitative characteristics Using known groups (classes) and their multiple characteristics to build a model (discriminant rule) that can discriminate among groups, in other words Relating a classification variable to multiple quantitative explanatory variables (p responses) the model built, may be used to classify new observations into the known groups (PDA) the success or error of assigning new observations to known groups depends on the quality of the model built Differences between DDA and Cluster Analysis In DA the grouping is known before the data analysis, we perform the analysis to gain better understanding of the grouping structure and restructure using multiple characteristics In cluster analysis the grouping and its structure are not known before the data analysis The grouping resulted from cluster analysis is only suggestive (not necessarily the true but unknown clustering structure) 197

2 Difference between DDA and Regression analysis In DDA the dependent variable is categorical, but in regression analysis the dependent variable is continuous Examples of usage: aving a number of subjects who can be classified as having 1) heart disease or 2) no heart disease a set of their medical, physiological, dietary, characteristics to address if the heart condition can be explained by such characteristics, and if so identify a discriminatory model to classify new observations into each of the heart-condition groups (i.e., predict group membership of new subjects) based on related characteristics Credit card companies develop discriminant models based on past records to predict applicants who will be credit worthy or delinquent Developing a model to explain popularity of TV programs (ads, news, etc.) and predict popularity of a new program Males and Females, Young and Adults, Conifers and ardwoods, Smokers and Non-smokers, Successful or Unsuccessful students in graduating, Bankrupt and Successful companies (useful for company managers as well as share holders), originating an artifact to a civilization or tribe 198

3 Assumptions: Independence of subjects If multivariate normality can be assumed and variance/covariance matrices are equal, then - linear discriminant analysis should be used variance/covariance matrices are not equal, then - quadratic discriminant analysis should be used If multivariate normality CANNOT be assumed or explanatory variables are categorical, then use SAS GENMODE or LOGISTIC procedures, or SAS DISCRIM procedure and non-parametric discriminant analysis using the kernel or K nearest-neighbor method 199

4 A simple Example of DDA and PDA: There are two known groups belonging to 2 separate populations (happy and unhappy) and one quantitative variable (income). Furthermore, assume that income is normally distributed with equal variances for the two populations as below Figure adapted from Afifi & Clark, 1998 What would be the best criterion based on the available info. (income) to separate the 2 groups (assuming income means differ significantly p < 0.01)? In other words, the separation (discrimination) value of income (c) for the two groups? 200

5 x I x C II, this can serve as a classification (discrimination) rule, 2 to model (predict) the future obs. - with income < C belonging to group - with income > C belonging to group Is it possible to make an error (i.e., misclassification)? Is the probability of misclassification equal? ow can we calculate the exact error? - to classify based on the rule developed and compare it to the known classification Classified as Known grouping appy Unhappy %error (1-%correct) appy (n =200) Unhappy (n =100) Total (n =300) Is income a good separator of the two groups? ow might such error (probability of misclassification) decrease with regard to - mean and variance of income? - addition of other variables? - sample size (# of obs.)? 201

6 Let s add age (x 2 ) into our analysis and assume the normality and equality of variances for age and income for the two groups (I=happy, II=unhappy). Univariate and bivariate (concentration ellipses) for the two populations may be shown as below Figure adapted from Afifi & Clark, 1998 What do the shaded areas indicate? Which analysis (univariate or bivariate) results in greater shaded areas? What is the simplest way of separating the two groups based on both age and income? 202

7 What is the mathematical equation of the dividing line z? z = a 1 x 1 +a 2 x 2, - this was developed first by Fisher (1936) and hence is called Fisher s discriminant function - the symbol z here should not be mistaken with z as the symbol for a standardized value - Fisher calculated the coefficients a 1 and a 2 such that the squared statistical distance (D 2 ) between the means of the two groups in terms of z values is maximized: -- implication of a large D 2? formulas for their computations can be found in Fisher (1936), Lachenbruch (1975), Afifi and Azen (1979) - for each observation within a group, a z value can be calculated z using the above formula, then calculate C as I z C II, 2 Figure adapted from Afifi & Clark,

8 and then use C to classify -- existing observations with their known group membership to -- additional observations to predict their group membership Classified as Known grouping appy Unhappy %error (1-%correct) appy (n =200) Unhappy (n =100) Total (n =300) Did the addition of the second variable (age) help the discrimination between the two groups? - it depends on the effects on the error rates, are they decreased? - is the overall error acceptable? ow may we improve the discriminant rule further? -adding new variables, this cannot be shown graphically, but holds mathematically So, classification may be done based on a single variable but the classification may not be very accurate the more variables involved the smaller the error, but the number of variables involved depends on the number of observations 204

9 In general, there are 4 similar ways to develop and use discriminant rules Fisher s linear Discriminant Function Rule (presented above) for cases when two multivariate normal populations have equal variance/covariance matrices Likelihood Rule: Rule: Choose g 1 if L(x, 1, 1 ) > L(x; 2, 2) and choose g 2 otherwise, where L(x, i, i ) is the likelihood function (the multivariate normal probability density function, presented in earlier lectures) Mahalanobis Distance Rule When two populations have equal variance/covariance matrices, the likelihood rule will be equivalent to: Rule: Choose g 1 when d 1 < d 2, where d i = (x- i ), (x- i ) d i measures how far x is from i (the Mahalanobis Squared distance between x and i ) Posterior Probability Rule based on the Bayes Theorem When the variance/covariance matrices are equal - the quantity P(g i x) is defined as P( g i x ) P( x g i ) P( gi ), where: k P( x g ) P( g ) i 1 i i 205

10 P(x g i ) is the probability of observing x assuming the data are from population g i, or in other words, it is - the proportion of units in population g i, that has a response vector close to x - it is called typicality probability - we use the data to calculate this P(g i x) is the probability of belonging to population g i conditioned on observing x - conceptually, P (g i x) P (x g i ) - P (g i x) is called the posterior probability - we use the Bayes theorem to calculate this P(g i ) is the probability of belonging to population i - it is called the prior probability K is the number of criterion (known) populations Recall the Bayes theorem from univariate analysis, that P(B A) = [P(A B) P(B)] / [P(A)], because P(A) P(B A) = P(B) P(A B) Rule: Choose g 1 if P (g 1 x) > P (g 2 x) > P (g 3 x) Remark: When the variance/covariance matrices are equal, all four discriminant rules are equivalent. 206

11 Note: The following examples are from SAS elp and Documentation with slight modifications in some cases Example 1: Performing a simple discriminant analysis on simulated data Options nocenter ps=35 ls=65 nodate pageno=1; Data a; drop n; Type = ''; do n = 1 to 20; X = * normal(57391); Y = X normal(57391); output; end; Type = 'C'; do n = 1 to 30; X = * normal(57391); Y = X normal(57391); output; end; run; symbol1 v='' c=black; symbol2 v='c' c=red; run; proc print data=a; run; 207

12 Obs Type X Y C C C C C C C C C C C C C C C C C C C

13 209 proc gplot; plot Y*X=type / cframe=w nolegend; run; Y X C C C C C C C C C C C C C C C C C C C C

14 proc discrim data=a all; class type; var X Y; run; Prior probabilities in the above codes (as indicated in the output) are equal by default. Anywhere before the RUN statement we may indicate: PRIORS Prop (meaning proportional to group sample size) c = 0.7 h =.3 (or any other desired or hypothesized probability) The DISCRIM Procedure Observations 50 DF Total 49 Variables 2 DF Within Classes 48 Classes 2 DF Between Classes 1 Class Level Information Prior Type Var. Freq. Weight Proportion Probability C C

15 Within-Class SSCP Matrices Type = C Variable X Y X Y Type = Variable X Y X Y Pooled Within-Class SSCP Matrix Variable X Y X Y Between-Class SSCP Matrix Variable X Y X Y Total-Sample SSCP Matrix Variable X Y X Y

16 Within-Class Covariance Matrices Type = C, DF = 29 Variable X Y X Y Type =, DF = 19 Variable X Y X Y Pooled Within-Class Covariance Matrix, DF = 48 Variable X Y X Y Between-Class Covariance Matrix, DF = 1 Variable X Y X Y Total-Sample Covariance Matrix, DF = 49 Variable X Y X Y

17 Within-Class Correlation Coefficients / Pr > r Type = C Variable X Y X <.0001 Y < Type = Variable X Y X <.0001 Y < Pooled Within-Class Correlation Coefficients / Pr > r Variable X Y X <.0001 Y <.0001 Between-Class Correlation Coefficients / Pr > r Variable X Y X Y Total-Sample Correlation Coefficients / Pr > r Variable X Y X <.0001 Y <

18 Simple Statistics Total-Sample Standard Variable N Sum Mean Variance Deviation X Y Type = C Standard Variable N Sum Mean Variance Deviation X Y Type = Standard Variable N Sum Mean Variance Deviation X Y Total-Sample Standardized Class Means Variable C X Y Pooled Within-Class Standardized Class Means Variable C X Y

19 Pooled Covariance Matrix Information Covariance Matrix Rank Natural Log of the Determinant of the Covariance Matrix Pairwise Squared Distances Between Groups 2 D(i j)= (X i - X j )' COV -1 (X i - X j ) Squared Distance to Type From Type C C

20 Univariate Test Statistics F Statistics, Num DF=1, Den DF=48 Total Pooled Between Standard Standard Standard Variable Deviation Deviation Deviation F Value Pr > F X Y Multivariate Statistics and Exact F Statistics S=1 M=0 N=22.5 Statistic Value F Value NumDF DenDF Pr > F Wilks' Lambda <.0001 Pillai's Trace <.0001 otelling-lawley Trace <.0001 Roy's Greatest Root <

21 Linear Discriminant Function _ Constant = -.5 X j ' COV -1 X j Coefficient Vector = COV -1 X j Linear Discriminant Function for Type Variable C Constant X Y

22 Classification Summary for Calibration Data: WORK.A Resubstitution Summary using Linear Discriminant Function Posterior Probability of Membership in Each Type 2 2 Pr(j X) = exp(-.5 D (X)) / SUM exp(-.5 D (X)) j k k Number of Observations and Percent Classified into Type From Type C Total C Total Priors Error Count Estimates for Type C Total Rate Priors NOTE: If Prior probabilities are set proportional to group sample size, the total classification error decreases to

23 To learn how to use the information given under Linear Discriminant Function for Type for classification purposes to generate the classification table we may run the following codes: data b; set a; class_c = ( *x) + ( *y); class_h = ( *x) + ( *y); if class_c > class_h then pred_type = 'C'; else pred_type = ''; run; proc freq data=b; tables type*pred_type; run; 219

24 Example 2: The iris data (Fisher, 1936) are used. The sepal length, sepal width, petal length, and petal width are measured in millimeters on 50 iris specimens from each of three known species, Iris setosa, I. versicolor, and I. virginica. The first discriminant analyses is performed with a single quantitative variable (petal width) to simplify the outputs. The GCART procedure is used to display the sample distribution of petal width in the three species. Note the overlap between species I. versicolor and I. virginica. data iris; input SepalLength SepalWidth PetalLength PetalWidth lines; ; 220

25 proc gchart data=iris; vbar PetalWidth / subgroup=species midpoints=0 to 25 raxis=axis1 maxis=axis2 legend=legend1 cframe=ligr; run; Output : Sample Distribution of Petal Width in Three Species 221

26 To use the Discriminant Model built to predict species membership of plants with known petal width but unknown species, 30 plants will be simulated and saved in a data named b using the following codes: data b; do plant = 1 to 30; PetalWidth= 10+4 *normal (1); output;end; options nocenter ls=75; Proc print data=b noobs; Run; Plant Petal Width

27 Data b can then be used as the test data using the TESTDATA, TESTLIST, and TESTID SAS keywords for predicting species of the simulated plants The following run uses normal-theory methods (method=normal). The crosslisterr option lists the misclassified observations under cross validation and displays cross validation error-rate estimates. The testdata option names the data set containing plants for which we would like to predict species using the discriminant model. The option lists the species membership for plants within the TEST data. The testid statement indicates the name of the observation when predicting its species. This statement works only if testlist and/or testlisterr option are/is used. Although not the case in the following run, please note that testclass, testdata, testlist, testlisterr, and testid may also be used in separating the original data, with known grouping for all observations, into two groups and based on one group build the discriminant model and then test this model with the rest of data named TESTDATA. Proc discrim data=iris method=normal crosslisterr testdata=b testlist; class Species; var PetalWidth; testid Plant; run; 223

28 The DISCRIM Procedure Total Sample Size 150 DF Total 149 Variables 1 DF Within Classes 147 Classes 3 DF Between Classes 2 Number of Observations Read 150 Number of Observations Used 150 Class Level Information Variable Prior Species Name Frequency Weight Proportion Probability Setosa Setosa Versicolor Versicolor Virginica Virginica Pooled Covariance Matrix Information Natural Log of the Covariance Determinant of the Matrix Rank Covariance Matrix Pairwise Generalized Squared Distances Between Groups 2-1 D (i j) = (X - X )' COV (X - X ) i j i j Generalized Squared Distance to Species From Species Setosa Versicolor Virginica Setosa Versicolor Virginica Linear Discriminant Function _ -1 _ -1 _ Constant = -.5 X' COV X Coefficient Vector = COV X j j j Linear Discriminant Function for Species Variable Label Setosa Versicolor Virginica Constant PetalWidth Petal Width in mm

29 Classification Summary for Calibration Data: WORK.IRIS Resubstitution Summary using Linear Discriminant Function Generalized Squared Distance Function 2 _ -1 _ D (X) = (X-X )' COV (X-X ) j j j Posterior Probability of Membership in Each Species 2 2 Pr(j X) = exp(-.5 D (X)) / SUM exp(-.5 D (X)) j k k Number of Observations and Percent Classified into Species From Species Setosa Versicolor Virginica Total Setosa Versicolor Virginica Total Priors Error Count Estimates for Species Setosa Versicolor Virginica Total Rate Priors

30 The DISCRIM Procedure Classification Results for Calibration Data: WORK.IRIS Cross-validation Results using Linear Discriminant Function Generalized Squared Distance Function 2 _ -1 _ D (X) = (X-X )' COV (X-X ) j (X)j (X) (X)j Posterior Probability of Membership in Each Species 2 2 Pr(j X) = exp(-.5 D (X)) / SUM exp(-.5 D (X)) j k k Posterior Probability of Membership in Species From Classified Obs Species into Species Setosa Versicolor Virginica 5 Virginica Versicolor * Versicolor Virginica * Virginica Versicolor * Virginica Versicolor * Virginica Versicolor * Versicolor Virginica * * Misclassified observation 226

31 The DISCRIM Procedure Classification Summary for Calibration Data: WORK.IRIS Cross-validation Summary using Linear Discriminant Function Generalized Squared Distance Function 2 _ -1 _ D (X) = (X-X )' COV (X-X ) j (X)j (X) (X)j Posterior Probability of Membership in Each Species 2 2 Pr(j X) = exp(-.5 D (X)) / SUM exp(-.5 D (X)) j k k Number of Observations and Percent Classified into Species From Species Setosa Versicolor Virginica Total Setosa Versicolor Virginica Total Priors Error Count Estimates for Species Setosa Versicolor Virginica Total Rate Priors

32 The DISCRIM Procedure Classification Results for Test Data: WORK.b Classification Results using Linear Discriminant Function Posterior Probability of Membership in Species Classified plant into Species Setosa Versicolor Virginica 1 Virginica Versicolor Versicolor Setosa Virginica Setosa Versicolor Versicolor Setosa Versicolor Setosa Versicolor Versicolor Setosa Setosa Setosa Versicolor Versicolor Setosa Versicolor Versicolor Versicolor Versicolor Versicolor Setosa Setosa Versicolor Versicolor Versicolor Versicolor

33 Observation Profile for Test Data Number of Observations Read 30 Number of Observations Used 30 Number of Observations and Percent Classified into Species Setosa Versicolor Virginica Total Total Priors

34 Part of options in Proc DISCIRM adopted from SAS elp and Documentation. LISTERR displays the resubstitution classification results for misclassified observations only. You can specify this option only when the input data set is an ordinary SAS data set. NOCLASSIFY suppresses the resubstitution classification of the input DATA= data set. You can specify this option only when the input data set is an ordinary SAS data set. OUT=SAS-data-set creates an output SAS data set containing all the data from the DATA= data set, plus the posterior probabilities and the class into which each observation is classified by resubstitution. When you specify the CANONICAL option, the data set also contains new variables with canonical variable scores. See the "OUT= Data Set" section. OUTCROSS=SAS-data-set creates an output SAS data set containing all the data from the DATA= data set, plus the posterior probabilities and the class into which each observation is classified by cross validation. When you specify the CANONICAL option, the data set also contains new variables with canonical variable scores. See the "OUT= Data Set" section. OUTD=SAS-data-set creates an output SAS data set containing all the data from the DATA= data set, plus the group-specific density estimates for each observation. See the "OUT= Data Set" section. OUTSTAT=SAS-data-set creates an output SAS data set containing various statistics such as means, standard deviations, and correlations. When the input data set is an ordinary SAS data set or when TYPE=CORR, TYPE=COV, TYPE=CSSCP, or TYPE=SSCP, this option can be used to generate discriminant statistics. When you specify the CANONICAL option, canonical correlations, canonical structures, canonical coefficients, and means of canonical variables for each class are included in the data set. If you specify METOD=NORMAL, the output data set also includes coefficients of the discriminant functions, and the output data set is TYPE=LINEAR (POOL=YES), TYPE=QUAD (POOL=NO), or TYPE=MIXED (POOL=TEST). If you specify METOD=NPAR, this output data set is TYPE=CORR. This data set also holds calibration information that can be used to classify new observations. See the "Saving and Using Calibration Information" section and the "OUT= Data Set" section. POSTERR displays the posterior probability error-rate estimates of the classification criterion based on the classification results. TESTDATA=SAS-data-set names an ordinary SAS data set with observations that are to be classified. The quantitative variable names in this data set must match those in the DATA= data set. When you specify the TESTDATA= option, you can also specify the TESTCLASS, TESTFREQ, and TESTID statements. When you specify the TESTDATA= option, you can use the TESTOUT= and TESTOUTD= options to 230

35 generate classification results and group-specific density estimates for observations in the test data set. Note that if the CLASS variable is not present in the TESTDATA= data set, the output will not include misclassification statistics. TESTLIST lists classification results for all observations in the TESTDATA= data set. TESTLISTERR lists only misclassified observations in the TESTDATA= data set but only if a TESTCLASS statement is also used. TESTOUT=SAS-data-set creates an output SAS data set containing all the data from the TESTDATA= data set, plus the posterior probabilities and the class into which each observation is classified. When you specify the CANONICAL option, the data set also contains new variables with canonical variable scores. See the "OUT= Data Set" section. TESTOUTD=SAS-data-set creates an output SAS data set containing all the data from the TESTDATA= data set, plus the group-specific density estimates for each observation. See the "OUT= Data Set" section. 231

36 The following is a non-parametric discriminant analyses (METOD=NPAR). It uses equal bandwidths (smoothing parameters). The value of the radius parameter that, assuming normality, minimizes an approximate mean integrated square error is 0.48 (see the "Nonparametric Methods" section). Choosing r = 0.4 gives a more detailed look at the irregularities in the data. The following statements produce Output : proc discrim data=iris method=npar kernel=normal r =.4 short noclassify crosslisterr; class Species; var PetalWidth; title2 'Using Kernel Density Estimates with Equal Bandwidth'; run; Output : Kernel Density Estimates with Equal Bandwidth Discriminant Analysis of Fisher (1936) Iris Data Using Kernel Density Estimates with Equal Bandwidth The DISCRIM Procedure Observations 150 DF Total 149 Variables 1 DF Within Classes 147 Classes 3 DF Between Classes 2 Class Level Information Species Variable Name Frequency Weight Proportion Prior Probability Setosa Setosa Versicolor Versicolor Virginica Virginica

37 Classification Results for Calibration Data: WORK.IRIS Cross-validation Results using Normal Kernel Density Posterior Probability of Membership in Species Obs From Species Classified into Species Setosa Versicolor Virginica 5 Virginica Versicolor * Versicolor Virginica * Virginica Versicolor * Virginica Versicolor * Virginica Versicolor * Versicolor Virginica * * Misclassified observation 233

38 Classification Summary for Calibration Data: WORK.IRIS Cross-validation Summary using Normal Kernel Density Number of Observations and Percent Classified into Species From Species Setosa Versicolor Virginica Total Setosa Versicolor Virginica Total Priors Error Count Estimates for Species Setosa Versicolor Virginica Total Rate Priors

39 In the following example: All four variables are used. POOL=TEST (YES or NO are other options Yes is default) tests the homogeneity of the within-group covariance matrices (null) (Output ). This test rejects the null at the 0.10 level (SAS default value to declare calculated P significant), so separate within-group covariance matrices are used to derive the quadratic discriminant criterion. WCOV and PCOV options display the within-group covariance matrices and the pooled covariance matrix (Output ). DISTANCE option displays squared distances between classes (Output ). ANOVA and MANOVA options test if the class means are equal, using ANOVA and MANOVA (Output ). LISTERR option lists the misclassified observations using resubstitution (Output ). CROSSLISTERR option lists misclassified observations using cross validation and displays cross validation error-rate estimates (Output ). OUTSTAT= option generates a TYPE=MIXED (because POOL=TEST) output data set containing various statistics such as means, covariances, and coefficients of the discriminant function (Output ). As expected, resubstitution method error count estimate is smaller than that of cross validation method because it is optimistically biased. Proc discrim data=iris outstat=irisstat wcov pcov method=normal pool=test distance anova manova listerr crosslisterr; class Species; var SepalLength SepalWidth PetalLength PetalWidth; run; 235

40 Output : Covariance Matrices Within-Class Covariance Matrices Species = Setosa, DF = 49 Variable SepalLength SepalWidth PetalLength PetalWidth SepalLength SepalWidth PetalLength PetalWidth Species = Versicolor, DF = 49 Variable SepalLength SepalWidth PetalLength PetalWidth SepalLength SepalWidth PetalLength PetalWidth Species = Virginica, DF = 49 Variable SepalLength SepalWidth PetalLength PetalWidth SepalLength SepalWidth PetalLength PetalWidth

41 Pooled Within-Class Covariance Matrix, DF = 147 Variable SepalLength SepalWidth PetalLength PetalWidt h SepalLength SepalWidth PetalLength PetalWidth Output : omogeneity Test Test of omogeneity of Within Covariance Matrices Chi-Square DF Pr > ChiSq <.0001 Since the Chi-Square value is significant at the 0.1 level, the within covariance matrices will be used in the discriminant function. Reference: Morrison, D.F. (1976) Multivariate Statistical Methods p

42 Output : Squared Distances Discriminant Analysis of Fisher (1936) Iris Data Using Quadratic Discriminant Function The DISCRIM Procedure Squared Distance to Species From Species Setosa Versicolor Virginica Setosa Versicolor Virginica Generalized Squared Distance to Species From Species Setosa Versicolor Virginica Setosa Versicolor Virginica

43 Output : Tests of Equal Class Means Univariate Test Statistics F Statistics, Num DF=2, Den DF=147 Variable Total Standard Deviation Pooled Standard Deviation Between Standard Deviation R- Square R- Square / (1- RSq) F Value Pr > F SepalLength <.0001 SepalWidth <.0001 PetalLength <.0001 PetalWidth <.0001 Average R-Square Unweighted Weighted by Variance Multivariate Statistics and F Approximations S=2 M=0.5 N=71 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda <.0001 Pillai's Trace <.0001 otelling-lawley Trace <.0001 Roy's Greatest Root <.0001 NOTE: F Statistic for Roy's Greatest Root is an upper bound. NOTE: F Statistic for Wilks' Lambda is exact. 239

44 Output : Misclassified Observations: Resubstitution Classification Results for Calibration Data: WORK.IRIS Resubstitution Results using Quadratic Discriminant Function Posterior Probability of Membership in Species Obs From Species Classified into Species Setosa Versicolor Virginica 5 Virginica Versicolor * Versicolor Virginica * Versicolor Virginica * * Misclassified observation Resubstitution Summary using Quadratic Discriminant Function Number of Observations and Percent Classified into Species From Species Setosa Versicolor Virginica Total Setosa Versicolor Virginica Total Priors Error Count Estimates for Species Setosa Versicolor Virginica Total Rate Priors

45 Output : Misclassified Observations: Cross validation Classification Results for Calibration Data: WORK.IRIS Cross-validation Results using Quadratic Discriminant Function Posterior Probability of Membership in Species Obs From Species Classified into Species Setosa Versicolor Virginica 5 Virginica Versicolor * Versicolor Virginica * Versicolor Virginica * Versicolor Virginica * * Misclassified observation Number of Observations and Percent Classified into species From Species Setosa Versicolor Virginica Total Setosa Versicolor Virginica Total Priors Error Count Estimates for Species Setosa Versicolor Virginica Total Rate Priors Note that if we assume homogeneity of var/covar matrices (POOL=Yes), ECE rate for the resubstitution method remain the same but it will be 0.02 instead of for the crossvalidation method. 241

Chapter 25 The DISCRIM Procedure. Chapter Table of Contents

Chapter 25 The DISCRIM Procedure. Chapter Table of Contents Chapter 25 Chapter Table of Contents OVERVIEW...1013 GETTING STARTED...1014 SYNTAX...1019 PROCDISCRIMStatement...1019 BYStatement...1027 CLASSStatement...1028 FREQStatement...1028 IDStatement...1028 PRIORSStatement...1028

More information

The DISCRIM Procedure

The DISCRIM Procedure SAS/STAT 9.2 User s Guide (Book Excerpt) SAS Documentation This document is an individual chapter from SAS/STAT 9.2 User s Guide. The correct bibliographic citation for the complete manual is as follows:

More information

The SAS System 18:28 Saturday, March 10, Plot of Canonical Variables Identified by Cluster

The SAS System 18:28 Saturday, March 10, Plot of Canonical Variables Identified by Cluster The SAS System 18:28 Saturday, March 10, 2018 1 The FASTCLUS Procedure Replace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02 Initial Seeds Cluster SepalLength SepalWidth PetalLength PetalWidth 1

More information

MULTIVARIATE HOMEWORK #5

MULTIVARIATE HOMEWORK #5 MULTIVARIATE HOMEWORK #5 Fisher s dataset on differentiating species of Iris based on measurements on four morphological characters (i.e. sepal length, sepal width, petal length, and petal width) was subjected

More information

Applied Multivariate Analysis

Applied Multivariate Analysis Department of Mathematics and Statistics, University of Vaasa, Finland Spring 2017 Discriminant Analysis Background 1 Discriminant analysis Background General Setup for the Discriminant Analysis Descriptive

More information

4.1 Computing section Example: Bivariate measurements on plants Post hoc analysis... 7

4.1 Computing section Example: Bivariate measurements on plants Post hoc analysis... 7 Master of Applied Statistics ST116: Chemometrics and Multivariate Statistical data Analysis Per Bruun Brockhoff Module 4: Computing 4.1 Computing section.................................. 1 4.1.1 Example:

More information

SAS/STAT 15.1 User s Guide The CANDISC Procedure

SAS/STAT 15.1 User s Guide The CANDISC Procedure SAS/STAT 15.1 User s Guide The CANDISC Procedure This document is an individual chapter from SAS/STAT 15.1 User s Guide. The correct bibliographic citation for this manual is as follows: SAS Institute

More information

Chapter 7, continued: MANOVA

Chapter 7, continued: MANOVA Chapter 7, continued: MANOVA The Multivariate Analysis of Variance (MANOVA) technique extends Hotelling T 2 test that compares two mean vectors to the setting in which there are m 2 groups. We wish to

More information

Multivariate analysis of variance and covariance

Multivariate analysis of variance and covariance Introduction Multivariate analysis of variance and covariance Univariate ANOVA: have observations from several groups, numerical dependent variable. Ask whether dependent variable has same mean for each

More information

Discrimination: finding the features that separate known groups in a multivariate sample.

Discrimination: finding the features that separate known groups in a multivariate sample. Discrimination and Classification Goals: Discrimination: finding the features that separate known groups in a multivariate sample. Classification: developing a rule to allocate a new object into one of

More information

4 Statistics of Normally Distributed Data

4 Statistics of Normally Distributed Data 4 Statistics of Normally Distributed Data 4.1 One Sample a The Three Basic Questions of Inferential Statistics. Inferential statistics form the bridge between the probability models that structure our

More information

Covariance Structure Approach to Within-Cases

Covariance Structure Approach to Within-Cases Covariance Structure Approach to Within-Cases Remember how the data file grapefruit1.data looks: Store sales1 sales2 sales3 1 62.1 61.3 60.8 2 58.2 57.9 55.1 3 51.6 49.2 46.2 4 53.7 51.5 48.3 5 61.4 58.7

More information

Discriminant Analysis

Discriminant Analysis Chapter 16 Discriminant Analysis A researcher collected data on two external features for two (known) sub-species of an insect. She can use discriminant analysis to find linear combinations of the features

More information

Chapter 9. Multivariate and Within-cases Analysis. 9.1 Multivariate Analysis of Variance

Chapter 9. Multivariate and Within-cases Analysis. 9.1 Multivariate Analysis of Variance Chapter 9 Multivariate and Within-cases Analysis 9.1 Multivariate Analysis of Variance Multivariate means more than one response variable at once. Why do it? Primarily because if you do parallel analyses

More information

Classification: Linear Discriminant Analysis

Classification: Linear Discriminant Analysis Classification: Linear Discriminant Analysis Discriminant analysis uses sample information about individuals that are known to belong to one of several populations for the purposes of classification. Based

More information

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes November 9, 2012 1 / 1 Nearest centroid rule Suppose we break down our data matrix as by the labels yielding (X

More information

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis Week 5 Based in part on slides from textbook, slides of Susan Holmes Part I Linear Discriminant Analysis October 29, 2012 1 / 1 2 / 1 Nearest centroid rule Suppose we break down our data matrix as by the

More information

Repeated Measures Part 2: Cartoon data

Repeated Measures Part 2: Cartoon data Repeated Measures Part 2: Cartoon data /*********************** cartoonglm.sas ******************/ options linesize=79 noovp formdlim='_'; title 'Cartoon Data: STA442/1008 F 2005'; proc format; /* value

More information

ANOVA Longitudinal Models for the Practice Effects Data: via GLM

ANOVA Longitudinal Models for the Practice Effects Data: via GLM Psyc 943 Lecture 25 page 1 ANOVA Longitudinal Models for the Practice Effects Data: via GLM Model 1. Saturated Means Model for Session, E-only Variances Model (BP) Variances Model: NO correlation, EQUAL

More information

WITHIN-PARTICIPANT EXPERIMENTAL DESIGNS

WITHIN-PARTICIPANT EXPERIMENTAL DESIGNS 1 WITHIN-PARTICIPANT EXPERIMENTAL DESIGNS I. Single-factor designs: the model is: yij i j ij ij where: yij score for person j under treatment level i (i = 1,..., I; j = 1,..., n) overall mean βi treatment

More information

Lecture 5: LDA and Logistic Regression

Lecture 5: LDA and Logistic Regression Lecture 5: and Logistic Regression Hao Helen Zhang Hao Helen Zhang Lecture 5: and Logistic Regression 1 / 39 Outline Linear Classification Methods Two Popular Linear Models for Classification Linear Discriminant

More information

Analysis of Longitudinal Data: Comparison Between PROC GLM and PROC MIXED. Maribeth Johnson Medical College of Georgia Augusta, GA

Analysis of Longitudinal Data: Comparison Between PROC GLM and PROC MIXED. Maribeth Johnson Medical College of Georgia Augusta, GA Analysis of Longitudinal Data: Comparison Between PROC GLM and PROC MIXED Maribeth Johnson Medical College of Georgia Augusta, GA Overview Introduction to longitudinal data Describe the data for examples

More information

Lecture 5: Classification

Lecture 5: Classification Lecture 5: Classification Advanced Applied Multivariate Analysis STAT 2221, Spring 2015 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department of Mathematical Sciences Binghamton

More information

Principal Component Analysis (PCA) Principal Component Analysis (PCA)

Principal Component Analysis (PCA) Principal Component Analysis (PCA) Recall: Eigenvectors of the Covariance Matrix Covariance matrices are symmetric. Eigenvectors are orthogonal Eigenvectors are ordered by the magnitude of eigenvalues: λ 1 λ 2 λ p {v 1, v 2,..., v n } Recall:

More information

Principal component analysis

Principal component analysis Principal component analysis Motivation i for PCA came from major-axis regression. Strong assumption: single homogeneous sample. Free of assumptions when used for exploration. Classical tests of significance

More information

1. Introduction to Multivariate Analysis

1. Introduction to Multivariate Analysis 1. Introduction to Multivariate Analysis Isabel M. Rodrigues 1 / 44 1.1 Overview of multivariate methods and main objectives. WHY MULTIVARIATE ANALYSIS? Multivariate statistical analysis is concerned with

More information

Applied Multivariate and Longitudinal Data Analysis

Applied Multivariate and Longitudinal Data Analysis Applied Multivariate and Longitudinal Data Analysis Chapter 2: Inference about the mean vector(s) Ana-Maria Staicu SAS Hall 5220; 919-515-0644; astaicu@ncsu.edu 1 In this chapter we will discuss inference

More information

MANOVA is an extension of the univariate ANOVA as it involves more than one Dependent Variable (DV). The following are assumptions for using MANOVA:

MANOVA is an extension of the univariate ANOVA as it involves more than one Dependent Variable (DV). The following are assumptions for using MANOVA: MULTIVARIATE ANALYSIS OF VARIANCE MANOVA is an extension of the univariate ANOVA as it involves more than one Dependent Variable (DV). The following are assumptions for using MANOVA: 1. Cell sizes : o

More information

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham Generative classifiers: The Gaussian classifier Ata Kaban School of Computer Science University of Birmingham Outline We have already seen how Bayes rule can be turned into a classifier In all our examples

More information

Multivariate Statistics

Multivariate Statistics Multivariate Statistics Chapter 2: Multivariate distributions and inference Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2016/2017 Master in Mathematical

More information

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Classification Methods II: Linear and Quadratic Discrimminant Analysis Classification Methods II: Linear and Quadratic Discrimminant Analysis Rebecca C. Steorts, Duke University STA 325, Chapter 4 ISL Agenda Linear Discrimminant Analysis (LDA) Classification Recall that linear

More information

Topic 20: Single Factor Analysis of Variance

Topic 20: Single Factor Analysis of Variance Topic 20: Single Factor Analysis of Variance Outline Single factor Analysis of Variance One set of treatments Cell means model Factor effects model Link to linear regression using indicator explanatory

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Lecture 8: Classification

Lecture 8: Classification 1/26 Lecture 8: Classification Måns Eriksson Department of Mathematics, Uppsala University eriksson@math.uu.se Multivariate Methods 19/5 2010 Classification: introductory examples Goal: Classify an observation

More information

PRINCIPAL COMPONENTS ANALYSIS

PRINCIPAL COMPONENTS ANALYSIS PRINCIPAL COMPONENTS ANALYSIS Iris Data Let s find Principal Components using the iris dataset. This is a well known dataset, often used to demonstrate the effect of clustering algorithms. It contains

More information

Other hypotheses of interest (cont d)

Other hypotheses of interest (cont d) Other hypotheses of interest (cont d) In addition to the simple null hypothesis of no treatment effects, we might wish to test other hypothesis of the general form (examples follow): H 0 : C k g β g p

More information

MANOVA MANOVA,$/,,# ANOVA ##$%'*!# 1. $!;' *$,$!;' (''

MANOVA MANOVA,$/,,# ANOVA ##$%'*!# 1. $!;' *$,$!;' ('' 14 3! "#!$%# $# $&'('$)!! (Analysis of Variance : ANOVA) *& & "#!# +, ANOVA -& $ $ (+,$ ''$) *$#'$)!!#! (Multivariate Analysis of Variance : MANOVA).*& ANOVA *+,'$)$/*! $#/#-, $(,!0'%1)!', #($!#$ # *&,

More information

STAT 730 Chapter 5: Hypothesis Testing

STAT 730 Chapter 5: Hypothesis Testing STAT 730 Chapter 5: Hypothesis Testing Timothy Hanson Department of Statistics, University of South Carolina Stat 730: Multivariate Analysis 1 / 28 Likelihood ratio test def n: Data X depend on θ. The

More information

Application of Ghosh, Grizzle and Sen s Nonparametric Methods in. Longitudinal Studies Using SAS PROC GLM

Application of Ghosh, Grizzle and Sen s Nonparametric Methods in. Longitudinal Studies Using SAS PROC GLM Application of Ghosh, Grizzle and Sen s Nonparametric Methods in Longitudinal Studies Using SAS PROC GLM Chan Zeng and Gary O. Zerbe Department of Preventive Medicine and Biometrics University of Colorado

More information

A Comparison of Missing Data Handling Methods Catherine Truxillo, Ph.D., SAS Institute Inc, Cary, NC

A Comparison of Missing Data Handling Methods Catherine Truxillo, Ph.D., SAS Institute Inc, Cary, NC A Comparison of Missing Data Handling Methods Catherine Truxillo, Ph.D., SAS Institute Inc, Cary, NC ABSTRACT Incomplete data presents a problem in both inferential and predictive modeling applications.

More information

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012 Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012 Overview Review: Conditional Probability LDA / QDA: Theory Fisher s Discriminant Analysis LDA: Example Quality control:

More information

Classification techniques focus on Discriminant Analysis

Classification techniques focus on Discriminant Analysis Classification techniques focus on Discriminant Analysis Seminar: Potentials of advanced image analysis technology in the cereal science research 2111 2005 Ulf Indahl/IMT - 14.06.2010 Task: Supervised

More information

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive

More information

Creative Data Mining

Creative Data Mining Creative Data Mining Using ML algorithms in python Artem Chirkin Dr. Daniel Zünd Danielle Griego Lecture 7 0.04.207 /7 What we will cover today Outline Getting started Explore dataset content Inspect visually

More information

Chapter 5: Multivariate Analysis and Repeated Measures

Chapter 5: Multivariate Analysis and Repeated Measures Chapter 5: Multivariate Analysis and Repeated Measures Multivariate -- More than one dependent variable at once. Why do it? Primarily because if you do parallel analyses on lots of outcome measures, the

More information

Multivariate Linear Models

Multivariate Linear Models Multivariate Linear Models Stanley Sawyer Washington University November 7, 2001 1. Introduction. Suppose that we have n observations, each of which has d components. For example, we may have d measurements

More information

Topic 17 - Single Factor Analysis of Variance. Outline. One-way ANOVA. The Data / Notation. One way ANOVA Cell means model Factor effects model

Topic 17 - Single Factor Analysis of Variance. Outline. One-way ANOVA. The Data / Notation. One way ANOVA Cell means model Factor effects model Topic 17 - Single Factor Analysis of Variance - Fall 2013 One way ANOVA Cell means model Factor effects model Outline Topic 17 2 One-way ANOVA Response variable Y is continuous Explanatory variable is

More information

STAT 501 EXAM I NAME Spring 1999

STAT 501 EXAM I NAME Spring 1999 STAT 501 EXAM I NAME Spring 1999 Instructions: You may use only your calculator and the attached tables and formula sheet. You can detach the tables and formula sheet from the rest of this exam. Show your

More information

This is a Randomized Block Design (RBD) with a single factor treatment arrangement (2 levels) which are fixed.

This is a Randomized Block Design (RBD) with a single factor treatment arrangement (2 levels) which are fixed. EXST3201 Chapter 13c Geaghan Fall 2005: Page 1 Linear Models Y ij = µ + βi + τ j + βτij + εijk This is a Randomized Block Design (RBD) with a single factor treatment arrangement (2 levels) which are fixed.

More information

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification ISyE 6416: Computational Statistics Spring 2017 Lecture 5: Discriminant analysis and classification Prof. Yao Xie H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology

More information

z = β βσβ Statistical Analysis of MV Data Example : µ=0 (Σ known) consider Y = β X~ N 1 (β µ, β Σβ) test statistic for H 0β is

z = β βσβ Statistical Analysis of MV Data Example : µ=0 (Σ known) consider Y = β X~ N 1 (β µ, β Σβ) test statistic for H 0β is Example X~N p (µ,σ); H 0 : µ=0 (Σ known) consider Y = β X~ N 1 (β µ, β Σβ) H 0β : β µ = 0 test statistic for H 0β is y z = β βσβ /n And reject H 0β if z β > c [suitable critical value] 301 Reject H 0 if

More information

M A N O V A. Multivariate ANOVA. Data

M A N O V A. Multivariate ANOVA. Data M A N O V A Multivariate ANOVA V. Čekanavičius, G. Murauskas 1 Data k groups; Each respondent has m measurements; Observations are from the multivariate normal distribution. No outliers. Covariance matrices

More information

An Introduction to Multivariate Statistical Analysis

An Introduction to Multivariate Statistical Analysis An Introduction to Multivariate Statistical Analysis Third Edition T. W. ANDERSON Stanford University Department of Statistics Stanford, CA WILEY- INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents

More information

Bayesian Classification Methods

Bayesian Classification Methods Bayesian Classification Methods Suchit Mehrotra North Carolina State University smehrot@ncsu.edu October 24, 2014 Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 1 / 33 How do you define

More information

Extensions to LDA and multinomial regression

Extensions to LDA and multinomial regression Extensions to LDA and multinomial regression Patrick Breheny September 22 Patrick Breheny BST 764: Applied Statistical Modeling 1/20 Introduction Quadratic discriminant analysis Fitting models Linear discriminant

More information

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing Supervised Learning Unsupervised learning: To extract structure and postulate hypotheses about data generating process from observations x 1,...,x n. Visualize, summarize and compress data. We have seen

More information

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Multilevel Models in Matrix Form Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Today s Lecture Linear models from a matrix perspective An example of how to do

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

Measuring relationships among multiple responses

Measuring relationships among multiple responses Measuring relationships among multiple responses Linear association (correlation, relatedness, shared information) between pair-wise responses is an important property used in almost all multivariate analyses.

More information

Exst7037 Multivariate Analysis Cancorr interpretation Page 1

Exst7037 Multivariate Analysis Cancorr interpretation Page 1 Exst7037 Multivariate Analysis Cancorr interpretation Page 1 1 *** C03S3D1 ***; 2 ****************************************************************************; 3 *** The data set insulin shows data from

More information

Multivariate Statistical Analysis

Multivariate Statistical Analysis Multivariate Statistical Analysis Fall 2011 C. L. Williams, Ph.D. Lecture 17 for Applied Multivariate Analysis Outline Multivariate Analysis of Variance 1 Multivariate Analysis of Variance The hypotheses:

More information

An Introduction to Multivariate Methods

An Introduction to Multivariate Methods Chapter 12 An Introduction to Multivariate Methods Multivariate statistical methods are used to display, analyze, and describe data on two or more features or variables simultaneously. I will discuss multivariate

More information

LEC 4: Discriminant Analysis for Classification

LEC 4: Discriminant Analysis for Classification LEC 4: Discriminant Analysis for Classification Dr. Guangliang Chen February 25, 2016 Outline Last time: FDA (dimensionality reduction) Today: QDA/LDA (classification) Naive Bayes classifiers Matlab/Python

More information

Discriminant analysis and supervised classification

Discriminant analysis and supervised classification Discriminant analysis and supervised classification Angela Montanari 1 Linear discriminant analysis Linear discriminant analysis (LDA) also known as Fisher s linear discriminant analysis or as Canonical

More information

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS Duration - 3 hours Aids Allowed: Calculator LAST NAME: FIRST NAME: STUDENT NUMBER: There are 27 pages

More information

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS Ravinder Malhotra and Vipul Sharma National Dairy Research Institute, Karnal-132001 The most common use of statistics in dairy science is testing

More information

Topic 23: Diagnostics and Remedies

Topic 23: Diagnostics and Remedies Topic 23: Diagnostics and Remedies Outline Diagnostics residual checks ANOVA remedial measures Diagnostics Overview We will take the diagnostics and remedial measures that we learned for regression and

More information

Lecture 8: Summary Measures

Lecture 8: Summary Measures Lecture 8: Summary Measures Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina Lecture 8:

More information

A Little Stats Won t Hurt You

A Little Stats Won t Hurt You A Little Stats Won t Hurt You Nate Derby Statis Pro Data Analytics Seattle, WA, USA Edmonton SAS Users Group, 11/13/09 Nate Derby A Little Stats Won t Hurt You 1 / 71 Outline Introduction 1 Introduction

More information

Chapter 16 The ACECLUS Procedure. Chapter Table of Contents

Chapter 16 The ACECLUS Procedure. Chapter Table of Contents Chapter 16 The ACECLUS Procedure Chapter Table of Contents OVERVIEW...303 Background..... 304 GETTING STARTED...310 SYNTAX...318 PROCACECLUSStatement...318 BYStatement...323 FREQStatement...324 VARStatement...324

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Repeated Measures ANOVA Multivariate ANOVA and Their Relationship to Linear Mixed Models

Repeated Measures ANOVA Multivariate ANOVA and Their Relationship to Linear Mixed Models Repeated Measures ANOVA Multivariate ANOVA and Their Relationship to Linear Mixed Models EPSY 905: Multivariate Analysis Spring 2016 Lecture #12 April 20, 2016 EPSY 905: RM ANOVA, MANOVA, and Mixed Models

More information

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Canonical Edps/Soc 584 and Psych 594 Applied Multivariate Statistics Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Canonical Slide

More information

MULTIVARIATE ANALYSIS OF VARIANCE

MULTIVARIATE ANALYSIS OF VARIANCE MULTIVARIATE ANALYSIS OF VARIANCE RAJENDER PARSAD AND L.M. BHAR Indian Agricultural Statistics Research Institute Library Avenue, New Delhi - 0 0 lmb@iasri.res.in. Introduction In many agricultural experiments,

More information

5. Discriminant analysis

5. Discriminant analysis 5. Discriminant analysis We continue from Bayes s rule presented in Section 3 on p. 85 (5.1) where c i is a class, x isap-dimensional vector (data case) and we use class conditional probability (density

More information

6-1. Canonical Correlation Analysis

6-1. Canonical Correlation Analysis 6-1. Canonical Correlation Analysis Canonical Correlatin analysis focuses on the correlation between a linear combination of the variable in one set and a linear combination of the variables in another

More information

The Bayes classifier

The Bayes classifier The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal

More information

Data Analyses in Multivariate Regression Chii-Dean Joey Lin, SDSU, San Diego, CA

Data Analyses in Multivariate Regression Chii-Dean Joey Lin, SDSU, San Diego, CA Data Analyses in Multivariate Regression Chii-Dean Joey Lin, SDSU, San Diego, CA ABSTRACT Regression analysis is one of the most used statistical methodologies. It can be used to describe or predict causal

More information

Outline. Topic 20 - Diagnostics and Remedies. Residuals. Overview. Diagnostics Plots Residual checks Formal Tests. STAT Fall 2013

Outline. Topic 20 - Diagnostics and Remedies. Residuals. Overview. Diagnostics Plots Residual checks Formal Tests. STAT Fall 2013 Topic 20 - Diagnostics and Remedies - Fall 2013 Diagnostics Plots Residual checks Formal Tests Remedial Measures Outline Topic 20 2 General assumptions Overview Normally distributed error terms Independent

More information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision

More information

Multivariate Analysis of Variance

Multivariate Analysis of Variance Multivariate Analysis of Variance 1 Multivariate Analysis of Variance Objective of Multivariate Analysis of variance (MANOVA) is to determine whether the differences on criterion or dependent variables

More information

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam ECLT 5810 Linear Regression and Logistic Regression for Classification Prof. Wai Lam Linear Regression Models Least Squares Input vectors is an attribute / feature / predictor (independent variable) The

More information

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56 STAT 391 - Spring Quarter 2017 - Midterm 1 - April 27, 2017 Name: Student ID Number: Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56 Directions. Read directions carefully and show all your

More information

Testing Independence

Testing Independence Testing Independence Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM 1/50 Testing Independence Previously, we looked at RR = OR = 1

More information

EXST 7015 Fall 2014 Lab 08: Polynomial Regression

EXST 7015 Fall 2014 Lab 08: Polynomial Regression EXST 7015 Fall 2014 Lab 08: Polynomial Regression OBJECTIVES Polynomial regression is a statistical modeling technique to fit the curvilinear data that either shows a maximum or a minimum in the curve,

More information

CLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS

CLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS CLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS EECS 833, March 006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@gs.u.edu 864-093 Overheads and resources available at http://people.u.edu/~gbohling/eecs833

More information

STA6938-Logistic Regression Model

STA6938-Logistic Regression Model Dr. Ying Zhang STA6938-Logistic Regression Model Topic 2-Multiple Logistic Regression Model Outlines:. Model Fitting 2. Statistical Inference for Multiple Logistic Regression Model 3. Interpretation of

More information

Solution to Series 10

Solution to Series 10 Prof. Dr. M. Maathuis Multivariate Statistics SS 0 Solution to Series 0. a) > bumpus

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII

More information

Introduction to Data Science

Introduction to Data Science Introduction to Data Science Winter Semester 2018/19 Oliver Ernst TU Chemnitz, Fakultät für Mathematik, Professur Numerische Mathematik Lecture Slides Contents I 1 What is Data Science? 2 Learning Theory

More information

Discriminant Analysis

Discriminant Analysis Discriminant Analysis V.Čekanavičius, G.Murauskas 1 Discriminant analysis one categorical variable depends on one or more normaly distributed variables. Can be used for forecasting. V.Čekanavičius, G.Murauskas

More information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians University of Cambridge Engineering Part IIB Module 4F: Statistical Pattern Processing Handout 2: Multivariate Gaussians.2.5..5 8 6 4 2 2 4 6 8 Mark Gales mjfg@eng.cam.ac.uk Michaelmas 2 2 Engineering

More information

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression Logistic Regression Usual linear regression (repetition) y i = b 0 + b 1 x 1i + b 2 x 2i + e i, e i N(0,σ 2 ) or: y i N(b 0 + b 1 x 1i + b 2 x 2i,σ 2 ) Example (DGA, p. 336): E(PEmax) = 47.355 + 1.024

More information

Odor attraction CRD Page 1

Odor attraction CRD Page 1 Odor attraction CRD Page 1 dm'log;clear;output;clear'; options ps=512 ls=99 nocenter nodate nonumber nolabel FORMCHAR=" ---- + ---+= -/\*"; ODS LISTING; *** Table 23.2 ********************************************;

More information

6 Single Sample Methods for a Location Parameter

6 Single Sample Methods for a Location Parameter 6 Single Sample Methods for a Location Parameter If there are serious departures from parametric test assumptions (e.g., normality or symmetry), nonparametric tests on a measure of central tendency (usually

More information

Y (Nominal/Categorical) 1. Metric (interval/ratio) data for 2+ IVs, and categorical (nominal) data for a single DV

Y (Nominal/Categorical) 1. Metric (interval/ratio) data for 2+ IVs, and categorical (nominal) data for a single DV 1 Neuendorf Discriminant Analysis The Model X1 X2 X3 X4 DF2 DF3 DF1 Y (Nominal/Categorical) Assumptions: 1. Metric (interval/ratio) data for 2+ IVs, and categorical (nominal) data for a single DV 2. Linearity--in

More information

Statistics in Stata Introduction to Stata

Statistics in Stata Introduction to Stata 50 55 60 65 70 Statistics in Stata Introduction to Stata Thomas Scheike Statistical Methods, Used to test simple hypothesis regarding the mean in a single group. Independent samples and data approximately

More information

Multivariate Statistics Summary and Comparison of Techniques. Multivariate Techniques

Multivariate Statistics Summary and Comparison of Techniques. Multivariate Techniques Multivariate Statistics Summary and Comparison of Techniques P The key to multivariate statistics is understanding conceptually the relationship among techniques with regards to: < The kinds of problems

More information

Epidemiology Wonders of Biostatistics Chapter 11 (continued) - probability in a single population. John Koval

Epidemiology Wonders of Biostatistics Chapter 11 (continued) - probability in a single population. John Koval Epidemiology 9509 Wonders of Biostatistics Chapter 11 (continued) - probability in a single population John Koval Department of Epidemiology and Biostatistics University of Western Ontario What is being

More information

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning

More information