Applied Multivariate Analysis

Size: px

Start display at page:

Download "Applied Multivariate Analysis"

Kelley Flowers
5 years ago
Views:

1 Department of Mathematics and Statistics, University of Vaasa, Finland Spring 2017

2 Discriminant Analysis

3 Background 1 Discriminant analysis Background General Setup for the Discriminant Analysis Descriptive Discriminant Analysis Number of Discriminant Functions

4 Background Example 1 Consider the following data on financial ratios for solvent and bankrupted companies Financial Ratios of Bankrupt and Solvent Companies, Altman (1968) Source: Morrison (1990). Multivariate Statistical Methods, 3rd ed. McGraw-Hill X1 = Working Capital / Total Assets X2 = Retained Earnings / Total Assets X3 = Earnings Before Interest and Taxes / Total Assets X4 = Market Value of Equity / Total Value of Liabilities X5 = Sales / Total Assets Group, 1 = Bankrupt 2 = Solvent

5 Background Group X1 X2 X3 X4 X5 Group X1 X2 X3 X4 X Seppo 7.0 Pynnönen 0.9

6 Background

7 Background Relevant questions then are: How do the companies in these two groups differ from each other? Which ratios best discriminate the groups? Are the ratios useful for predicting bankruptcies? Partial answers to can be obtained by examining each single variable at a time.

8 Background For example sample statistics for each group are

9 Background Some graphics may also be helpful. For example, More complete use of group separation information, however, can be given by discriminant analysis (DA).

10 General Setup for the Discriminant Analysis 1 Discriminant analysis Background General Setup for the Discriminant Analysis Descriptive Discriminant Analysis Number of Discriminant Functions

11 General Setup for the Discriminant Analysis Discriminant analysis is used for two purposes: (1) describing major differences among the groups, and (2) classifying subject on the basis of measurements.

12 Descriptive Discriminant Analysis 1 Discriminant analysis Background General Setup for the Discriminant Analysis Descriptive Discriminant Analysis Number of Discriminant Functions

13 Descriptive Discriminant Analysis The start off setup: p variables q exclusive groups

14 Descriptive Discriminant Analysis The goal of the descriptive DA is: Form k new variables such that 1 The new variables are uncorrelated. 2 The first new variable has the best discriminating power w.r.t the given groups. The second new variable has the second best discriminating power and is uncorrelated with the first one, the third has the third best discriminating power and is uncorrelated with the previous ones, etc. Remark 1 k min(p, q 1). For example, if q = 2 then k = min(p, 1) = 1.

15 Descriptive Discriminant Analysis More precisely, suppose we have observations on random variables x 1,..., x p from q groups. Then the j th discriminant function is defined as a linear combination of the original variables y j = a j1 x a jp x p, (1) such that corr[y j, y l ] = 0 for j l, and y 1 has the best discriminating power, y 2 the second best, and so on.

16 Descriptive Discriminant Analysis Remark 2 In the basic case the assumption is that the groups differ only with respect to the means of the variables. As a consequence the correlations between the variables and variances are assumed the same over the groups (groups have similar covariance structures).

17 Descriptive Discriminant Analysis The idea in deriving the discriminant functions is to divide the total variation into between group and within group variation T = B + W, (2) where T denotes the total covariance matrix, B the between covariance matrix, and W the within covariance matrix.

18 Descriptive Discriminant Analysis Technically the problem reduces again to an eigenvalue problem. In this case the eigenvalues are extracted form the matrix BW 1. (3) The resulting eigenvectors form the coefficients for the discriminant functions y j, j = 1,..., k with k = min(q 1, p). The functions are called canonical discriminant functions.

19 Descriptive Discriminant Analysis Example 2 Consider the bankruptcy data. SAS proc candisc or SPSS (Analyze Classify Discriminant). Below are SAS results. Example: Discriminant analysis applied to bankrupt data Canonical Discriminant Analysis 66 Observations 65 DF Total 5 Variables 64 DF Within Classes 2 Classes 1 DF Between Classes Class Level Information GROUP Frequency Weight Proportion

20 Descriptive Discriminant Analysis Canonical Discriminant Analysis Within-Class Covariance Matrices GROUP = 1 DF = 32 Variable X1 X2 X3 X4 X5 X X X X X GROUP = 2 DF = 32 Variable X1 X2 X3 X4 X5 X X X X X

21 Descriptive Discriminant Analysis Canonical Discriminant Analysis Simple Statistics Total-Sample Variable N Mean Variance Std Dev X X X X X GROUP = 1 Variable N Mean Variance Std Dev X X X X X GROUP = 2 Variable N Mean Variance Std Dev X X X X X

22 Descriptive Discriminant Analysis Univariate Test Statistics F Statistics, Num DF= 1 Den DF= 64 Total Pooled Between RSQ/ Variable STD STD STD R-Squared (1-RSQ) X X X X X Univariate Test Statistics Variable F Pr > F X X X X X Average R-Squared: Unweighted = Weighted by Variance = Multivariate Statistics and Exact F Statistics S=1 M=1.5 N=29 Statistic Value F Num DF Den DF Pr > F Wilks Lambda Pillai s Trace Hotelling-Lawley Trace Roy s Greatest Root

23 Descriptive Discriminant Analysis Example: Discriminant analysis applied to bankrupt data Canonical Discriminant Analysis Adjusted Approx Squared Canonical Canonical Standard Canonical Correlation Correlation Error Correlation Eigenvalues of INV(E)*H = CanRsq/(1-CanRsq) Eigenvalue Difference Proportion Cumulative Test of H0: The canonical correlations in the current row and all that follow are zero Likelihood Ratio Approx F Num DF Den DF Pr > F NOTE: The F statistic is exact. Total Canonical Structure CAN1 X X X X X

24 Descriptive Discriminant Analysis Between Canonical Structure CAN1 X X X X X Pooled Within Canonical Structure CAN1 X X X X X Total-Sample Standardized Canonical Coefficients CAN1 X X X X X Pooled Within-Class Standardized Canonical Coefficients CAN1 X X X X X

25 Descriptive Discriminant Analysis Raw Canonical Coefficients CAN1 X X X X X Class Means on Canonical Variables GROUP CAN

26 Descriptive Discriminant Analysis The output includes several coefficient matrices. The structure matrices describe the correlations of the original variables with the discriminant function. The most useful of these for interpretation purposes is the within canonical structure. In the case of multiple groups also between canonical structure may give useful additional information. This structure tells how the means of variables and means of discriminant functions are correlated.

27 Descriptive Discriminant Analysis The standardized coefficients are obtained by dividing the raw coefficients by the standard deviations of the variables. These coefficient tell the marginal effect of the (standardized) variable on the discriminant function. Labeling the discriminant function is based on those variables having largest correlations and largest standardized coefficients.

28 Descriptive Discriminant Analysis Example 3 From the within canonical structure we observe: X 2 (Retained earnings / Total assets) has the highest correlation with the discriminant function. X 4 (Market value of equity / Total Value of Liabilities), X 1 (Working capital / Total Assets), and X 3 (Earnings before interest and taxes / Total assets) have next highest. X 5 (Sales / Total Assets) is small, but it has a large standardized coefficient. Summing up, profitable and companies whose market value is on a high level are the properties preventing from the bankruptcy.

29 Descriptive Discriminant Analysis It should be noted that the basic assumption in the discriminant analysis is that the variables are normally distributed in each of the groups, and that the covariance matrices are the same. The former assumption is harder to test. The latter is easier (in SPSS select Box M from the options). If the covariance matrices are not the same the linear discriminant function analysis is invalid. One should move to the quadratic discriminant function analysis. This method, however, is planned for classification purposes.

30 Descriptive Discriminant Analysis Example 4 Testing for the equality of the population covariance matrices. H 0 : Σ 1 = Σ 2, (4) where Σ i is the population covariance matrix of the population i (i = 1, 2). SPSS give the result: Test Chi-Square Value = with 15 degrees of freedom and p-value = We observe that the null hypothesis is rejected, hence one analysis results should be interpreted with caution.

31 Number of Discriminant Functions 1 Discriminant analysis Background General Setup for the Discriminant Analysis Descriptive Discriminant Analysis Number of Discriminant Functions

32 Number of Discriminant Functions In a case of multiple group (> 2) the question is: in how many dimension the groups are different. In the case of two groups this is not a major problem, because the groups can differentiate only in one dimension. Generally, however, there can be more discriminating dimensions, if q > 2.

33 Number of Discriminant Functions Example 5 The following data is a classic example considering different species of Iris Setosa. The following measures were made: SL: SW: PL: PW: Sepal length Sepal WIdth Pedal Length Pedal Width

34 Number of Discriminant Functions The CANDISC procedure produces the following results. title; data iris; title Discriminant Analysis of Fisher (1936) Iris Data ; input sepallen sepalwid petallen petalwid if spec_no=1 then species= SETOSA ; if spec_no=2 then species= VERSICOLOR ; if spec_no=3 then species= VIRGINICA ; label sepallen= Sepal Length in mm. sepalwid= Sepal Width in mm. petallen= Petal Length in mm. petalwid= Petal Width in mm. ; datalines;

35 Number of Discriminant Functions title Canonical Discriminant Analysis of IRIS data ; proc candisc data = iris; class species; var sepallen--petalwid; run; Which gives the results: Canonical Discriminant Analysis of IRIS data Canonical Discriminant Analysis 150 Observations 149 DF Total 4 Variables 147 DF Within Classes 3 Classes 2 DF Between Classes Class Level Information SPECIES Frequency Weight Proportion SETOSA VERSICOLOR VIRGINICA Canonical Discriminant Analysis Multivariate Statistics and F Approximations S=2 M=0.5 N=71 Statistic Value F Num DF Den DF Pr > F Wilks Lambda Pillai s Trace Hotelling-Lawley Trace Roy s Greatest Root NOTE: F Statistic for Roy s Greatest Root is an upper bound. NOTE: F Statistic for Wilks Lambda is exact.

36 Number of Discriminant Functions Adjusted Approx Squared Canonical Canonical Standard Canonical Correlation Correlation Error Correlation Eigenvalues of INV(E)*H = CanRsq/(1-CanRsq) Eigenvalue Difference Proportion Cumulative Test of H0: The canonical correlations in the current row and all that follow are zero Likelihood Ratio Approx F Num DF Den DF Pr > F Total Canonical Structure CAN1 CAN2 SEPALLEN Sepal Length in mm. SEPALWID Sepal Width in mm. PETALLEN Petal Length in mm. PETALWID Petal Width in mm.

37 Number of Discriminant Functions Between Canonical Structure CAN1 CAN2 SEPALLEN Sepal Length in mm. SEPALWID Sepal Width in mm. PETALLEN Petal Length in mm. PETALWID Petal Width in mm. Pooled Within Canonical Structure CAN1 CAN2 SEPALLEN Sepal Length in mm. SEPALWID Sepal Width in mm. PETALLEN Petal Length in mm. PETALWID Petal Width in mm.

38 Number of Discriminant Functions Total-Sample Standardized Canonical Coefficients CAN1 CAN2 SEPALLEN Sepal Length in mm. SEPALWID Sepal Width in mm. PETALLEN Petal Length in mm. PETALWID Petal Width in mm. Pooled Within-Class Standardized Canonical Coefficients CAN1 CAN2 SEPALLEN Sepal Length in mm. SEPALWID Sepal Width in mm. PETALLEN Petal Length in mm. PETALWID Petal Width in mm. Raw Canonical Coefficients CAN1 CAN2 SEPALLEN Sepal Length in mm. SEPALWID Sepal Width in mm. PETALLEN Petal Length in mm. PETALWID Petal Width in mm. Class Means on Canonical Variables SPECIES CAN1 CAN2 SETOSA VERSICOLOR VIRGINICA

39 Number of Discriminant Functions The Wilk s lambda test indicates that there are two statistically significant discriminators on the five percent level. Generally the hypotheses to be tested is like in the factor analysis H 0 : H 1 : The number of discriminators = m More is needed (5) On the basis of the within-matrices the first discriminator indicates that the species differ with respect to the overall size of the leaves and the second discriminator that species differ also with respect to the width of the leaves.

40 Number of Discriminant Functions Example 9.6: Bankruptcy risk and signal to reorganization of a company (Laitinen, Luoma, Pynnönen 1996, UV, Discussion Papers 200) Thus we have four groups.

41 Number of Discriminant Functions Sample Table statistics: 7. Descriptive statistics of groups for estimation data. B 1 (n=20) B 2 (n=20) N 3 (n=17) N 4 (n=23) F for eq Variable Mean Std Dev Mean Std Dev Mean Std Dev Mean Std Dev of means ROI *** TCF *** QRA ** SCA *** DSR *** **=significant at level 0.01 ***=significant at level 0.001

42 Number of Discriminant Functions Number of canonical discriminant functions: The results indicate that also the third canonical discriminant function is statistically significant.

43 Number of Discriminant Functions Canonical structure and standardized coefficients: Table 11. Canonical structure and Standardized canonical coefficients both as pooled within. Canonical structure* Standardized coefficient Variable CAN1 CAN2 CAN3 CAN1 CAN2 CAN3 ROI TCF QRA SCA DSR *Correlation coefficients between original variables and canonical variables.

44 Number of Discriminant Functions Interpretation of the discriminant functions:

45 Number of Discriminant Functions Group differences:

46 Number of Discriminant Functions CAN1, the financial performance, shows that the financial performance is the main characteristic differentiating healthy and bankruptcy firms (as expected). CAN2, controversy dynamic liquidity and static ratios, is differentiating characteristic between reorganizable non-bankrupt and reorganizable bankrupt firms. CAN3, controversy between liquidity and other ratios, reorganizable non-bankrupt firms and healthy firms. The distinction is probably due to the fact that non-bankrupt firms may have cash reserves (high liquidity), but do not use it profitably.

The SAS System 18:28 Saturday, March 10, Plot of Canonical Variables Identified by Cluster

The SAS System 18:28 Saturday, March 10, 2018 1 The FASTCLUS Procedure Replace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02 Initial Seeds Cluster SepalLength SepalWidth PetalLength PetalWidth 1