Data Analyses in Multivariate Regression Chii-Dean Joey Lin, SDSU, San Diego, CA

Similar documents
MANOVA is an extension of the univariate ANOVA as it involves more than one Dependent Variable (DV). The following are assumptions for using MANOVA:

Application of Ghosh, Grizzle and Sen s Nonparametric Methods in. Longitudinal Studies Using SAS PROC GLM

Stevens 2. Aufl. S Multivariate Tests c

Performing response surface analysis using the SAS RSREG procedure

GLM Repeated Measures

Paper: ST-161. Techniques for Evidence-Based Decision Making Using SAS Ian Stockwell, The Hilltop UMBC, Baltimore, MD

Neuendorf MANOVA /MANCOVA. Model: X1 (Factor A) X2 (Factor B) X1 x X2 (Interaction) Y4. Like ANOVA/ANCOVA:

STAT 501 Assignment 2 NAME Spring Chapter 5, and Sections in Johnson & Wichern.

Chapter 7, continued: MANOVA

Neuendorf MANOVA /MANCOVA. Model: X1 (Factor A) X2 (Factor B) X1 x X2 (Interaction) Y4. Like ANOVA/ANCOVA:

Statistics 5100 Spring 2018 Exam 1

Multivariate analysis of variance and covariance

Multivariate Linear Regression Models

Neuendorf MANOVA /MANCOVA. Model: MAIN EFFECTS: X1 (Factor A) X2 (Factor B) INTERACTIONS : X1 x X2 (A x B Interaction) Y4. Like ANOVA/ANCOVA:

Least Squares Estimation

Business Statistics. Lecture 10: Correlation and Linear Regression

MULTIVARIATE ANALYSIS OF VARIANCE

Chapter 9. Multivariate and Within-cases Analysis. 9.1 Multivariate Analysis of Variance

Multivariate Regression (Chapter 10)

Gregory Carey, 1998 Regression & Path Analysis - 1 MULTIPLE REGRESSION AND PATH ANALYSIS

Lecture 3: Inference in SLR

General Linear Model (Chapter 4)

On MANOVA using STATA, SAS & R

Types of Statistical Tests DR. MIKE MARRAPODI

STAT Checking Model Assumptions

Applied Multivariate Analysis

General Principles Within-Cases Factors Only Within and Between. Within Cases ANOVA. Part One

Correlation and Simple Linear Regression

Bootstrap Simulation Procedure Applied to the Selection of the Multiple Linear Regressions

Analysing data: regression and correlation S6 and S7

An Introduction to Multivariate Statistical Analysis

Applied Multivariate Statistical Modeling Prof. J. Maiti Department of Industrial Engineering and Management Indian Institute of Technology, Kharagpur

You can compute the maximum likelihood estimate for the correlation

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Applied Statistics Friday, January 15, 2016

Statistics in medicine

Rejection regions for the bivariate case

POWER AND TYPE I ERROR RATE COMPARISON OF MULTIVARIATE ANALYSIS OF VARIANCE

Chapter 4. Regression Models. Learning Objectives

M A N O V A. Multivariate ANOVA. Data

Chapter 5: Multivariate Analysis and Repeated Measures

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

WITHIN-PARTICIPANT EXPERIMENTAL DESIGNS

Generating Half-normal Plot for Zero-inflated Binomial Regression

The SEQDESIGN Procedure

Multiple Linear Regression

Inferences for Regression

Leonor Ayyangar, Health Economics Resource Center VA Palo Alto Health Care System Menlo Park, CA

Principal Component Analysis, A Powerful Scoring Technique

1 Introduction to Minitab

Homework 2: Simple Linear Regression

Tutorial 6: Tutorial on Translating between GLIMMPSE Power Analysis and Data Analysis. Acknowledgements:

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Bootstrapping, Randomization, 2B-PLS

Lecture 5: Hypothesis tests for more than one sample

Biostatistics. Correlation and linear regression. Burkhardt Seifert & Alois Tschopp. Biostatistics Unit University of Zurich

Analysis of Longitudinal Data: Comparison between PROC GLM and PROC MIXED.

Repeated Measures ANOVA Multivariate ANOVA and Their Relationship to Linear Mixed Models

MANOVA MANOVA,$/,,# ANOVA ##$%'*!# 1. $!;' *$,$!;' (''

Applied Multivariate and Longitudinal Data Analysis

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping

Biostatistics. Chapter 11 Simple Linear Correlation and Regression. Jing Li

Correlation and the Analysis of Variance Approach to Simple Linear Regression

SESUG 2011 ABSTRACT INTRODUCTION BACKGROUND ON LOGLINEAR SMOOTHING DESCRIPTION OF AN EXAMPLE. Paper CC-01

TWO-FACTOR AGRICULTURAL EXPERIMENT WITH REPEATED MEASURES ON ONE FACTOR IN A COMPLETE RANDOMIZED DESIGN

In most cases, a plot of d (j) = {d (j) 2 } against {χ p 2 (1-q j )} is preferable since there is less piling up

Lecture 3: Multiple Regression. Prof. Sharyn O Halloran Sustainable Development U9611 Econometrics II

Implementation of Pairwise Fitting Technique for Analyzing Multivariate Longitudinal Data in SAS

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

df=degrees of freedom = n - 1

Analyzing and Interpreting Continuous Data Using JMP

Lecture 11: Simple Linear Regression

Multivariate Linear Models

Czech J. Anim. Sci., 50, 2005 (4):

The Model Building Process Part I: Checking Model Assumptions Best Practice

STAT 501 EXAM I NAME Spring 1999

Repeated Measures Part 2: Cartoon data

Analysis of Longitudinal Data: Comparison Between PROC GLM and PROC MIXED. Maribeth Johnson Medical College of Georgia Augusta, GA

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

Multivariate Analysis of Variance

Nemours Biomedical Research Statistics Course. Li Xie Nemours Biostatistics Core October 14, 2014

Review of the General Linear Model

Introduction to Power and Sample Size Analysis

SAS/STAT 14.1 User s Guide. Introduction to Structural Equation Modeling with Latent Variables

Exam Applied Statistical Regression. Good Luck!

Software for genome-wide association studies having multivariate responses: Introducing MAGWAS

23. Inference for regression

Applied Multivariate and Longitudinal Data Analysis

Other hypotheses of interest (cont d)

STATISTICS 479 Exam II (100 points)

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

Chapter 9. Correlation and Regression

Multiple linear regression S6

Lecture 6: Single-classification multivariate ANOVA (k-group( MANOVA)

GMM Logistic Regression with Time-Dependent Covariates and Feedback Processes in SAS TM

COLLABORATION OF STATISTICAL METHODS IN SELECTING THE CORRECT MULTIPLE LINEAR REGRESSIONS

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Logistic Regression in R. by Kerry Machemer 12/04/2015

SAS macro to obtain reference values based on estimation of the lower and upper percentiles via quantile regression.

Transcription:

Data Analyses in Multivariate Regression Chii-Dean Joey Lin, SDSU, San Diego, CA ABSTRACT Regression analysis is one of the most used statistical methodologies. It can be used to describe or predict causal relationships between a response variable and a set of predictor variables. An extension of univariate (one response variable) regression analysis is multivariate regression analysis. Multivariate regression extends regression analysis into more than one response variable. In a multiple linear regression model, linear association between the response variable and the predictors are assumed. In addition, the random errors are assumed to be un-correlated and normally distributed with a constant variance. In conducting a multivariate regression analysis, the assumptions are similar to multiple linear regression models but extended to multivariate aspect. In this paper, we introduce SAS codes for data analyses related to multivariate regression using PROC REG. The mtest statement in PROC REG is a key component for multivariate regression analysis. An example is introduced to demonstrate the usage of the mtest statement. INTRODUCTION Regression analysis was first developed in 19 th century and is one of the most used statistical methods (Kutner et al, 2004). It can be used for prediction or for assessing an association between two variables. The purpose of this paper is to review the mtest statement in PROC REG to test hypotheses in multivariate regression. Multivariate regression is a statistical method that is useful in many fields such as medical industry and psychology among others. It is an extension of a univariate regression model (single dependent variable) to a model with multiple response variables. An example of fitting a multivariate regression model is to predict a subject s systolic blood pressure and diastolic blood pressure based on BMI, age, and alcohol consumption. In this example, there are two response variables (systolic blood pressure and diastolic blood pressure) and three predictor variables (BMI, age, and alcohol). In this example, the three predictor variables (BMI, age, and alcohol) are used to predict both response variables, systolic blood pressure and diastolic blood pressure. An approach for the analysis is to fit two univariate multiple linear regression models (one model for each response variable) for the two response variables and interpret the results independently. While the parameter estimates are the same by using either univariate or multivariate regression model, univariate approaches may not be able to address important scientific questions. In addition, using a univariate approach will be less efficient in constructing simultaneous confidence intervals for regression coefficients when the correlation among the response variables is high. In any statistical data analysis, checking assumptions are always an essential step. Assumptions needed for regression analysis include random errors following normal distribution with constant variance and being uncorrelated (under normal distribution, uncorrelated and independent are equivalent). Assumptions for multivariate regression analysis are similar to assumptions in univariate regression analysis but they are extended to multivariate aspect. The assumptions include the random error vector to follow a multivariate normal distribution with homogeneity of variancecovariance matrix (Johnson and Wichern, 2007). To test multivariate normality, an introduction and SAS code can be found in SAS Customer Support website (SAS Institute online website). A macro (multnorm) that is used to test multivariate normality can also be found under the same SAS Customer Support website (SAS Institute online website). To test homogeneity of variance covariance matrix, the Box s M test can be applied. In doing so, one can partition the data into several groups based on X values and apply the Box s M test to test homogeneity of variancecovariance matrix among the partitioned groups. The Box s M test can be produced using PROC DISCRIM procedure. More information for the Box s M test can be found in SAS STAT manual (SAS Institute (2008)). This paper emphasizes on providing SAS codes for hypothesis tests in multivariate regression analyses through an example. A quick overview of multiple linear regression and multivariate regression is given. An example is used to address interesting scientific questions and how the corresponding SAS codes are written. Finally, a conclusion that summarizes this paper is provided. MULTIPLE LINEAR REGRESSION VS. MULTIVARIATE REGRESSION For a linear regression model with one predictor variable (so called simple linear regression model), we can state the model as: Y i = β 0 + β 1 X i + ε i, 1

where Y i is the i th response, X i is the i th observation on predictor variable X and is a known constant, β 0 and β 1 are unknown parameters, and ε i is a random error following a normal distribution with 0 mean and a constant variance. In addition, ε i and ε j are assumed to be uncorrelated. In this model, the slope β 1 represents how much change the outcome variable Y will be when the value of x is changed by one unit. The intercept β 0 represents the value of Y when the value of X is 0. Depending on the range of the collected data, it is possible that the intercept is meaningless (in the case that the range of x values does not cover 0, which will run into an extrapolation issue for interpreting the meaning of the intercept). If there is more than one predictor variable, it is called a multiple linear regression model. We can use a matrix format to present the multiple linear regression model: Y = Xβ + ε, where Y is an n x 1 response vector, X is an n x (p+1) matrix with known constant elements, β is a (p+1) x 1 parameter vector, and ε is an n x 1 random vector. In this model, we assume there are p predictor variables. An extension of a multiple linear regression model is to consider the model with more than one response variable. In this case, it is called multivariate regression analysis. A multivariate regression model with k response variables can be expressed as Y = XB + ε, where Y is an n x k response matrix, X is an n x (p+1) matrix, B is a (p+1) x k parameter matrix, and ε is an n x k random error matrix. For the two models described above, the design matrix X is identical for both models. That is, we use same set of independent variables to predict different response variables. The four matrices in the model can be expressed as follow: y 1 1 x 11 x 1p ε β y 2 1 x 21 x 01 β 02 β 1 0k 2p ε β Y =, X =, B = [ 11 β 12 β 2 1k ], ε =. β [ y n ] [ 1 x n1 x p1 β p2 β pk np] [ ε n ] Note y i (1 x k) represents k outcomes from i th subject. Note also if k = 1, the above multivariate regression model is the same as the usual univariate multiple linear regression model. The linear function of the mtest statement in PROC REG is straightforward for most parameter tests in multivariate regression. For complicated multivariate tests, one needs to rewrite the hypothesis tests into a multivariate general linear hypothesis H 0 : LBM = 0, where L and M are matrices to be decided so that H 0 : LBM = 0 and the desired hypothesis are identical. The matrix B is the parameter matrix defined above. The elements in L and M are used to decide the linear functions of the mtest statement. USING PROC REG FOR MULTIVARIATE REGRESSION SAS PROC REG provides tools for regression model fitting, model selections, and diagnostic analyses. Diagnostic plots such as residual plot, studentized residual plot, histogram of the residual, quantile-quantile plot (QQ plot), and Cook s distance are automatically produced for a newer version of SAS. For SAS 9.2, one has to specify ODS GRAPHICS ON; and ODS GRAPHICS OFF; statements to generate the specified diagnostic plots. Note the k outcome measures are collected in each subject and they are expected to be correlated. If the correlations among the response variables are small, univariate regression approaches can be applied since there is not much gain by using multivariate approaches. The gain of using multivariate approaches is more substantial when the correlations among the response variables are high. As mentioned in the Introduction section, the SAS macro, multnorm, can be used to test multivariate normal assumption and Box s M test under PROC DISCRIM can be used to test homogeneity of variance-covariance matrix assumption. The mtest statement in PROC REG is used for analyses related to multivariate regression models. If no equations or options are specified, the mtest statement will test the hypothesis that all estimated parameters except the intercept are zero. The mtest statement for different multivariate tests is introduced through an example in next section. AN EXAMPLE In this section, we discuss several scenarios from an example to demonstrate the usage of mtest statement in PROC REG procedure. The example we use is weightlifting data collected from the International Weightlifting Federation 2

(IWF) (IWF(2015)). Weightlifting competition is an Olympic event that is categorized by an athlete s weight and gender. There are eight categories (from 56 kg to 105+ kg) for men and seven categories (from 48 kg to 75+ kg) for women. Two lifts (the snatch and the clean and jerk) are required for each competition. At most three attempts are allowed for each lift. Three champions are awarded for each category (the snatch, the clean and jerk, and the aggregate of the best snatch and the best clean and jerk (total)). While there is no age category for the Olympic game, the IWF does maintain world records categorized by age group as well. This example was chosen to demonstrate multivariate tests in multivariate regression analysis. In addition, we use this example to show advantages of multivariate regression over univariate regression in assessing scientific questions. We consider two predictors (AGE and athlete s bodyweight (WT)) and three outcome variables (the snatch, the clean and jerk, and the total) in this example. Note the peak performance age for this sport is around 40 and it is older than the peak performance age of most other sports. For demonstrating purposes, we use six age groups (age from 42 to 72) and seven WT categories (range from 56 kg to 105 kg) from men s records only. We want to assess if AGE and WT variables are linearly associated with the three outcome variables (the snatch, the clean and jerk, and the total). We want to assess if the predictor variables AGE and WT are good predicting variables for the response variables. Can we build a model with good predicting power to predict best athlete s performance based on the athlete s AGE and WT? If an association is identified, we can also test if the AGE parameter is the same for the snatch and the clean and jerk responses. That is, under same WT category, we want to see if the effect of AGE on the snatch and AGE on the clean and jerk responses is identical. Similarly, we can evaluate the WT predictor variable on the snatch and the clean and jerk outcomes as well. Since the response variable TOTAL is the aggregate of the best snatch lift and the best clean and jerk lift, we can also test if the AGE parameter associated with the total response is equal to the sum of the AGE parameters associated with the snatch and the clean and jerk responses. Note all meaningful multivariate tests can be performed through mtest statement. For a more complicated test, it is easier to rewrite the null hypothesis into a form of H 0: LBM = 0, where B is the parameter matrix and L and M are matrices need to be decided so that LBM = 0 and the desired test are equivalent. The elements of the matrices L and M will be used for the mtest statement. Another test we can perform is if the parameter difference between the sum of SNATCH and CLEAN and the TOTAL is identical for the AGE or for the WT variable. In this example, the dimensions for the multivariate regression model are n = 42 (42 observations), p = 2 (two predictor variables, AGE and WT), and k = 3 (three response variables, SNATCH, CLEAN, and TOTAL). The parameter matrix B for this example is β 01 β 02 β 03 B = [ β 11 β 12 β 13 ]. β 21 β 22 β 23 Note β 0i is the intercept for the ith outcome variable, β 1i (β 2i ) is the slope associated with AGE (WT) variable for the i th response variable. The corresponding SAS codes for the mtest statement, the PROC REG procedure, and partial outputs for the previous stated questions are listed below. Before we conduct an overall test for the regression analysis, scatterplots are generated to check linearity assumption. The scatterplots (shown below) show that within each WT class, a linear trend is a reasonable pattern to describe the association between outcome variables and age variable. The scatterplots also reveal that the weight lifting ability is negatively correlated with age (left panel). That is, from age 40 to age 70, the older an athlete is, the lighter an athlete can lift. This is consistent with all three response variables (SNATCH, CLEAN, and TOTAL). Similar findings apply to the WT variable as well. The linearity trend is also suitable for the WT variable. A positive correlation between the three response variables and the WT variable is observed. That is, within each age group, the heavier an athlete s body mass, the more weight an athlete can lift. From the scatterplots, we can fit linear regression models for the response and predictor variables we considered here. In this example, we want to study the association between the three response variables (SNATCH, CLEAN, and TOTAL) and the two predictor variables (WT and AGE). 3

The left panel of the previous scatterplots was generated by the PROC SGPANEL procedure. The SAS code is shown below: proc sgpanel data = wtlift ; PANELBY wt / LAYOUT = COLUMNLATTICE ; SCATTER X = age Y = snatch; run; Similarly, the SAS code for the right panel plots is: proc sgpanel data = wtlift ; PANELBY age / LAYOUT = COLUMNLATTICE ; SCATTER X = wt Y = snatch; run; To evaluate the dependency among the three response variables and between the response variables and the predictor variables, a Pearson correlation coefficients matrix is generated and shown below. As expected, the three response variables are highly correlated. The correlations between the response variables and the AGE variable are negative (from -0.6 to -0.65) while the correlations between the response variables and the WT variable are positive (from 0.72 to 0.77). These findings are consistent with the findings from previous scatterplots. 4

Pearson Correlation Coefficients, N = 42 Prob > r under H0: Rho=0 snatch clean total wt age snatch 1.00000 0.98928 clean 0.98928 total 0.99420 wt 0.72269 age -0.65166 0.99420 1.00000 0.99680 0.99680 0.76608-0.59813 0.72269 0.76608 1.00000 0.75071 0.75071-0.62231-0.65166-0.59813-0.62231 1.00000 0.00000 1.0000 0.00000 1.0000 1.00000 Is there a linear association between the outcome variables and the set of independent variables (an overall test)? Typically this is the first hypothesis test to be performed in regression analysis. The null hypothesis for this test is H 0 : β 1 = β 2 = 0. To conduct the test, we can add mtest statement to PROC REG procedure. The SAS code is provided below: PROC REG DATA = wtlift; MTEST; Note in the MODEL statement, there are three variables (snatch, clean, and total) on the left hand side of the equal sign. These are three response variables we considered. Independent variables WT and AGE are listed on the right hand size of the equal sign as usual. The WT and AGE variables will be used for predicting SNATCH, CLEAN, and TOTAL variables in this analysis. With mtest statement, a multivariate test will be performed on top of univariate regression analysis outputs from PROC REG procedure. We expect the estimated coefficients of WT and AGE are different for different response variables. Partial outputs including parameter estimates tables for the three univariate regression analyses and a multivariate test table are shown below. The three parameter estimates tables in order are for SNATCH, CLEAN, and TOTAL response variables, respectively. Both WT and AGE predictor variables are linearly associated with three response variables. The three fitted regression lines are SNATCH = 121.27 + 0.949*WT -1.632*AGE CLEAN = 135.15 + 1.217*WT -1.811*AGE TOTAL = 252.35 + 2.187*WT -3.456*AGE As we expected, WT variable is positively correlated with all three response variables while AGE variable is negatively correlated with these three response variables. From the fitted model, we can predict that an athlete can lift 0.949 kg more if the body mass is increased by 1 kg using SNATCH lift (AGE is held constant). Similarly, an athlete will lift 1.632 kg lighter if AGE is increased by 1 year (WT is held constant). R squares (not shown) for the three fitted models are all very high (range from.94 to.96). These indicate that both WT and AGE are very good predictor variables for predicting how much weight an athlete can lift in the three considered categories (SNATCH, CLEAN, and TOTAL). 5

Variable DF Parameter Estimates (SNATCH) Parameter Estimate Standard Error t Value Pr > t Intercept 1 121.27092 6.35127 19.09 Wt 1 0.94918 0.04844 19.59 Age 1-1.63184 0.09236-17.67 Variable DF Parameter Estimates (CLEAN) Parameter Estimate Standard Error t Value Pr > t Intercept 1 135.15105 7.84591 17.23 Wt 1 1.21685 0.05984 20.34 Age 1-1.81143 0.11409-15.88 Variable DF Parameter Estimates (TOTAL) Parameter Estimate Standard Error t Value Pr > t Intercept 1 252.34745 13.55889 18.61 Wt 1 2.18659 0.10341 21.14 Age 1-3.45592 0.19717-17.53 With the mtest statement been added to the Proc REG procedure, a multivariate statistics table shows the test statistics and p values are provided. In the table, four test statistics (Wilks Lambda, Pillai s Trace, Hotelling-Lawley Trace,and Roy s Greatest Root) are presented. All four p values in this table are very small. This indicates that there are linear associations between the two predictor variables and the three response variables. Note an R 2 type measure for the model can be calculated from the Wilk s Lambda. The subtraction of the Wilk s Lambda from 1 can be used as an R 2 type measure. In this case, the value is 1 0.0376 = 0.9624. This indicates that about 96% of the variation of the 3 response variables can be explained by the two predictor variables in the model. This value is similar to R 2 s calculated from the three univariate regression approaches (range from.94 to.96). Multivariate Statistics and F Approximations S=2 M=0 N=17.5 Wilks' Lambda 0.03755975 51.31 6 74 Pillai's Trace 1.17464085 18.03 6 76 Hotelling-Lawley Trace 19.97456401 121.63 6 47.596 Roy's Greatest Root 19.68759753 249.38 3 38 NOTE: F Statistic for Roy's Greatest Root is an upper bound. NOTE: F Statistic for Wilks' Lambda is exact. 6

The overall model test shows that there are strong linear associations between the three considered response variables and the two considered predictor variables. We further perform some subject-specific multivariate tests for this example. Are the fitted regression lines for SNATHC and CLEAN identical? Since the world records using SNATCH and CLEAN lifts are quite similar, we may want to test if the two regression lines are identical. The hypothesis for this test is H 0: β 01 = β 02, β 11 = β 12, β 21 = β 22. The null hypothesis assumes the intercepts, the coefficients of AGE and the coefficients of WT are all identical for the fitted lines associated with SNATCH and CLEAN responses. The corresponding SAS code for this test is: PROC REG data = wtlift; MTEST snatch-clean, intercept, wt, age; The output table shown below suggests that the two fitted lines are not identical (p value < 0.0001). S=1 M=0.5 N=18.5 Wilks' Lambda 0.01601561 798.71 3 39 Pillai's Trace 0.98398439 798.71 3 39 Hotelling-Lawley Trace 61.43909100 798.71 3 39 Roy's Greatest Root 61.43909100 798.71 3 39 Are AGE coefficients equal for SNATCH and CLEAN responses? In this test, we are interested in evaluating if AGE effect on SNATCH and CLEAN are equal under same WT. That is, we want to test if β 11 = β 12. Note β 11 and β 12 are the coefficients of AGE associated with SNATCH and CLEAN responses, respectively. From previous tables, we see the estimated AGE coefficients associated with SNATCH and CLEAN outcomes are -1.63 and -1.81, respectively. We want to test if the two coefficients are statistically significant. This can be done by adding a linear function into the mtest statement. In this test, we will have MTEST SNATCH- CLEAN, AGE;. The SNATCH-CLEAN component describes the coefficient difference for the SNATCH and CLEAN responses and the AGE component limits the coefficient difference to AGE only. The complete code for the test is shown below. PROC REG data = wtlift; MTEST snatch-clean, age; To be able to identify the output table easily, we can also add a label into the mtest statement. This can be done by typing any label followed by the : symbol before the mtest statement. A sample code can be: snatch_eq_clean4age: MTEST snatch-clean, age;. The following output table shows the result for testing the equivalence of the two coefficients associated with the AGE variable. S=1 M=-0.5 N=18.5 Wilks' Lambda 0.81784061 8.69 1 39 0.0054 Pillai's Trace 0.18215939 8.69 1 39 0.0054 Hotelling-Lawley Trace 0.22273214 8.69 1 39 0.0054 Roy's Greatest Root 0.22273214 8.69 1 39 0.0054 7

The result shows that coefficients of AGE for SNATCH and CLEAN are significantly difference. That means, an athlete s lifting ability for SNATCH and CLEAN AND JERK is significantly different when his age is changed by a year while the WT variable is held constant. Similarly, we can test the slopes for the WT variable on SNATCH and CLEAN response variables are well. This can be done by replacing AGE with WT in the mtest statement. PROC REG data = wtlift; MTEST snatch-clean, wt; The result table is shown below. S=1 M=-0.5 N=18.5 Wilks' Lambda 0.35731346 70.15 1 39 Pillai's Trace 0.64268654 70.15 1 39 Hotelling-Lawley Trace 1.79866314 70.15 1 39 Roy's Greatest Root 1.79866314 70.15 1 39 A highly significant result is observed for this test as well. Test H 0: β 11 + β 12 = β 13 Since the outcome variable TOTAL is the aggregate of the best snatch and the best clean and jerk lifting weights, it is worthwhile to test if the coefficients are additive for either AGE or WT variables. The stated hypothesis is for AGE variable only. The corresponding SAS code is PROC REG data = wtlift; MTEST snatch+clean-total, age; The p-value from the multivariate test is 0.8085 and thus we can conclude that there is no statistically significant difference between the sum of the AGE coefficients associated with SNATCH and CLEAN and the coefficient of AGE for TOTAL response. S=1 M=-0.5 N=18.5 Wilks' Lambda 0.99847513 0.06 1 39 0.8085 Pillai's Trace 0.00152487 0.06 1 39 0.8085 Hotelling-Lawley Trace 0.00152720 0.06 1 39 0.8085 Roy's Greatest Root 0.00152720 0.06 1 39 0.8085 Test H 0: β 21 + β 22 = β 23 Similarly, the argument of mtest statement for testing if the sum of coefficients of WT for SNATCH and CLEAN responses is equal to the coefficient of WT for TOTAL response is shown below. 8

PROC REG data = wtlift; MTEST snatch+clean-total, wt; Similar to the AGE case, a non-significant result has been observed. S=1 M=-0.5 N=18 Wilks' Lambda 0.93644654 2.58 1 38 0.1166 Pillai's Trace 0.06355346 2.58 1 38 0.1166 Hotelling-Lawley Trace 0.06786662 2.58 1 38 0.1166 Roy's Greatest Root 0.06786662 2.58 1 38 0.1166 CONCLUSION This paper describes data analyses in multivariate regression. The SAS procedure, PROC REG, accompanying with the mtest statement, is the key feature for multivariate regression related analyses. In this paper, we use a weight lifting example to demonstrate the usage of mtest statement in variety of tests related to multivariate regression. Advantages for using multivariate regression over univariate regression analysis are discussed. While a formal test for multivariate normal assumption is not available yet, the SAS macro, multnorm, can be used to check the assumption. In checking homogeneity of variance-covariance matrix for the residuals, we can apply Box s M test that can be generated from PROC DISCRIM procedure. Note the F test is quite robust for mild violation of model assumptions especially when the sample size is large. Most multivariate tests are straightforward by specifying a linear function in the mtest statement as shown in this paper. REFERENCES IWF (2015). IWF Masters Weightlifting Records, Retrieved in May 2015 from http://www.mastersweightlifting.org/. Johnson and Wichern (2007). Applied Multivariate Statistical Analysis(6 th ed).upper Sallde River, NJ: Pearson Prentice Hall. Kutner, Nachtsheim, and Neter. 2004. Applied Linear Regression Models- 4th Edition. McGraw-Hill Education. SAS Institute Inc, Multivariate Analysis Concepts. http://support.sas.com/publishing/pubcat/chaps/56903.pdf SAS Institute Inc, Macro to Test Multivariate Normality, http://support.sas.com/kb/24/983.html. SAS Institute Inc. 2008. SAS/STAT 9.2 User s Guide. Cary, NC: SAS Institute Inc. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Joey Lin Department of Mathematics & Statistics, San Diego State University San Diego, CA 92181 E-mail: cdlin@mail.sdsu.edu SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 9