4.1 Computing section Example: Bivariate measurements on plants Post hoc analysis... 7

Size: px

Start display at page:

Download "4.1 Computing section Example: Bivariate measurements on plants Post hoc analysis... 7"

Samson Stone
5 years ago
Views:

1 Master of Applied Statistics ST116: Chemometrics and Multivariate Statistical data Analysis Per Bruun Brockhoff Module 4: Computing 4.1 Computing section Example: Bivariate measurements on plants Post hoc analysis Computing section Example: Bivariate measurements on plants The data is given here: data fertil; Input ID N $ yield seedwt; cards; 1 Low Low Low Low High High High High ; Barcharts similar to those in Figure 4.1 can be obtained by the following lines: goptions colors=(blue blue blue blue red red red red); axis1 label=(a=90 r=0); title ; proc gchart data=fertil; vbar ID / sumvar=yield type=sum DISCRETE patternid=midpoint; proc gchart data=fertil; vbar ID / sumvar=seedwt type=sum DISCRETE patternid=midpoint;

2 4.1 Computing section 2 The univariate and multivariate analysis of variance AND the CVA can all be carried out by the GLM procedure: options ls=70; proc glm data=fertil outstat=out; model yield seedwt =N; manova h=n/printh printe canonical; The first part of the output are the results from the two univariate analyses of variance of which parts is shown here: Dependent Variable: yield Sum of Source DF Squares Mean Square F Value Pr > F Model Error Corrected Total Dependent Variable: seedwt Sum of Source DF Squares Mean Square F Value Pr > F Model Error Corrected Total The multivariate analysis is carried out as a consequence of the manova statement - the options printh and printe request printing of the H = SS Between and E = SS Within matrices and canonical requests the CVA. In the following we go through the output step by step. The first part is self explaining: E = Error SSCP Matrix yield seedwt yield seedwt Next the group corrected correlations calculated from this E matrix is given: Partial Correlation Coefficients from the Error SSCP Matrix / Prob > r DF = 6 yield seedwt yield

3 4.1 Computing section 3 Then the H matrix is given seedwt H = Type III SSCP Matrix for N yield seedwt yield seedwt 5 2 And finally the CVA results are listed. As often is the case for SAS more information is provided that we really want to look at. Hence, only parts of the information is explained here. Again we take the relevant information step by step and omit the rest. The titles of each piece of information identifies it, such that the similar information can be found new applications of this. The first set of relevant information are the eigenvalues of the SS 1 Within SS Between matrix: Eigenvalues of Inv(E)*H = CanRsq/(1-CanRsq) Eigenvalue Difference Proportion Cumulative In this case there is only one eigenvalue, because there is only two groups in the analysis. Next a list of statistical hypothesis test regarding the number of components are listed. Again in this case, only one is given which is the test for zero components, that is, no structure at all: Test of H0: The canonical correlations in the current row and all that follow are zero Likelihood Approximate Ratio F Value Num DF Den DF Pr > F NOTE: The F statistic is exact. Note that this is a different test than the simple test given by the left out eigenvalues. The hypothesis of zero components is the same as the hypothesis of all groups being equal, hence the same overall hypothesis tested in the MANOVA in the first place - hence the same result! Next SAS provides some information related to the structural loadings:

4 4.1 Computing section 4 Canonical Structure Total Between Within Can1 Can1 Can1 yield seedwt These numbers are correlations between the canonical components (scores) and the original variables. Under the Total heading the correlations are calculated across all observations. Under the Between heading thay are calculated on group level and under within they are the group corrected (partial) correlations. The structural loadings, as defined in the main text, can be obtained from the Within correlations by scaling with the within-group standard deviation of each original variable: b T 1 = ( (5.8333), (2.0000)) = (0.707, 0.283) The unscaled discriminative loadings are given under the heading Raw : Canonical Coefficients Standardized Can1 Raw Can1 yield seedwt These, that is column 2, are the coefficients that should be used directly on new observations to obtain the value (projection) of this observation in the subspace, for instance with the purpose of classifying such an observation to the most likely group. These must also be scaled with the within-group standard deviations to obtain the interpretable discriminative loadings, as directly defined in the main text: l T 1 = ( (5.8333), (2.0000)) = (2.049, 2.000) The standardized coefficients given by SAS: (2.210, 2.000) are the raw coefficients multiplied by standard deviations across all observations (including group differences): ( (6.7857), (2.0000)) = (2.210, 2.000) Coincidentally the yield total variance equals the within-variance of 2. It is more correct to use the definition here than the SAS version. Finally, the overall MANOVA hypothesis tests are provided: MANOVA Test Criteria and Exact F Statistics for the Hypothesis of No Overall N Effect H = Type III SSCP Matrix for N E = Error SSCP Matrix S=1 M=0 N=1.5 Statistic Value F Value Num DF Den DF Pr > F

5 4.1 Computing section 5 Wilks Lambda Pillai s Trace Hotelling-Lawley Trace Roy s Greatest Root Note that the canonical scores were not provided and that neither of the two sets of loadings were provided explicitly. We give now a few subsequent SAS lines that can be used to save these three sets of information into SAS data sets that can be used for simple listing/printing or subsequent plotting of the results. The call to the PROC GLM above was prepared for extracting the canonical scores by specifying the option outstat=out, which saves various information about the analysis, among others the raw canonical coefficients a needed for constructing the scores using the procedure SCORE: proc score data=fertil score=out out=scores; var yield seedwt; The canonical scores are now saved in the data set scores with names can1, can2 etc. A print of the data set is given by proc print data=scores; var N yield seedwt can1; Obs N yield seedwt CAN1 1 Low Low Low Low High High High High A print of the content of the result data set out gives a short version of most of the important information: proc print data=out round; Obs _NAME SOURCE TYPE_ yield seedwt DF SS F PROB 1 yield ERROR ERROR seedwt ERROR ERROR yield N SS seedwt N SS N CANCORR CAN1 N STRUCTUR CAN1 N SCORE

6 4.1 Computing section 6 In the last line of this data set, the raw canonical coefficients are given, and in the first two lines the SS Within matrix is given (together with the corresponding number of degrees of freedom). In the following the SAS lines needed to extract this information, and transform the raw coefficients into both sets of loadings and constructing a data set for each set of loadings are given: (Even though there is only a single component in this case, the programme works for any number of components ) data E; /* Extracting the E-matrix */ set out; if _source_ = ERROR ; keep yield seedwt; data DF; /* Extracting the df */ set out; if _source_ = ERROR ; keep df; data A; /* Extracting the A matrix */ set out; if _type_ = SCORE ; keep yield seedwt; proc iml; /* used for the matrix manipulation/transformations */ use E; read all into E; use A; read all into A; use DF; read all into df; E=E#(1/df); diags=sqrt(diag(e)); ta=t(a); L=diags*tA; B=E*tA; create B from B; append from B; create L from L; append from L; /* end of IML */ title Structural Loadings ; proc print data=b; title; title Discriminative Loadings ; proc print data=l; title; The result is: Structural Loadings Obs COL1

7 4.1 Computing section Post hoc analysis Discriminative Loadings Obs COL It is possible to obtain a hypothesis test for group differences for e.g. the average of the two measurements by the M notation of the MANOVA statement AND the usual (from univariate ANOVA) contrast feature: proc glm data=fertil outstat=out; model yield seedwt =N; contrast Group difference N -1 1; manova h=n M=( )/summary; The additional output this gives you is: M Matrix Describing Transformed Variables yield seedwt MVAR Characteristic Roots and Vectors of: E Inverse * H, where H = Type III SSCP Matrix for N E = Error SSCP Matrix Variables have been transformed by the M Matrix Characteristic Characteristic Vector V EV=1 Root Percent MVAR MANOVA Test Criteria and Exact F Statistics for the Hypothesis of No Overall N Effect on the Variables Defined by the M Matrix Transformation

8 4.1 Computing section 8 H = Type III SSCP Matrix for N E = Error SSCP Matrix S=1 M=-0.5 N=2 Statistic Value F Value Num DF Den DF Pr > F Wilks Lambda Pillai s Trace Hotelling-Lawley Trace Roy s Greatest Root Dependent Variable: MVAR1 Source DF Type III SS Mean Square F Value Pr > F N Error Contrast DF Contrast SS Mean Square F Value Pr > F Group difference Note that the multivariate test equals the univariate F-test for the average, since the average only constitute a single variable. The optimal F-test is seen by using the canonical coefficients: proc glm data=fertil outstat=out; model yield seedwt =N; contrast Group difference N -1 1; manova h=n M=( )/summary; giving Dependent Variable: MVAR1 Source DF Type III SS Mean Square F Value Pr > F N Error Or equivalently using the canonical scores in a unvariate analysis: proc glm data=scores; model can1=n; contrast Group difference N -1 1;

9 4.1 Computing section 9 Or equivalently: data can1scores; set fertil; can1=0.8485*yield *seedwt; proc glm data=can1scores; model can1=n; estimate Group difference N -1 1;

The SAS System 18:28 Saturday, March 10, Plot of Canonical Variables Identified by Cluster

The SAS System 18:28 Saturday, March 10, 2018 1 The FASTCLUS Procedure Replace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02 Initial Seeds Cluster SepalLength SepalWidth PetalLength PetalWidth 1