Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek

Similar documents
General Linear Model Introduction, Classes of Linear models and Estimation

Hotelling s Two- Sample T 2

Numerical Linear Algebra

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Introduction to Probability and Statistics

SAS for Bayesian Mediation Analysis

MATH 2710: NOTES FOR ANALYSIS

MULTIVARIATE STATISTICAL PROCESS OF HOTELLING S T CONTROL CHARTS PROCEDURES WITH INDUSTRIAL APPLICATION

ute measures of uncertainty called standard errors for these b j estimates and the resulting forecasts if certain conditions are satis- ed. Note the e

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Pretest (Optional) Use as an additional pacing tool to guide instruction. August 21

Statics and dynamics: some elementary concepts

GOOD MODELS FOR CUBIC SURFACES. 1. Introduction

A Comparison between Biased and Unbiased Estimators in Ordinary Least Squares Regression

On the asymptotic sizes of subset Anderson-Rubin and Lagrange multiplier tests in linear instrumental variables regression

Elements of Asymptotic Theory. James L. Powell Department of Economics University of California, Berkeley

Chater Matrix Norms and Singular Value Decomosition Introduction In this lecture, we introduce the notion of a norm for matrices The singular value de

State Estimation with ARMarkov Models

4. Score normalization technical details We now discuss the technical details of the score normalization method.

Tests for Two Proportions in a Stratified Design (Cochran/Mantel-Haenszel Test)

where x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H:

Slides Prepared by JOHN S. LOUCKS St. Edward s s University Thomson/South-Western. Slide

Chapter 10. Supplemental Text Material

8.7 Associated and Non-associated Flow Rules

Outline for today. Maximum likelihood estimation. Computation with multivariate normal distributions. Multivariate normal distribution

CMSC 425: Lecture 4 Geometry and Geometric Programming

A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split

2. Sample representativeness. That means some type of probability/random sampling.

216 S. Chandrasearan and I.C.F. Isen Our results dier from those of Sun [14] in two asects: we assume that comuted eigenvalues or singular values are

Supplementary Materials for Robust Estimation of the False Discovery Rate

Elements of Asymptotic Theory. James L. Powell Department of Economics University of California, Berkeley

Estimation of the large covariance matrix with two-step monotone missing data

An Improved Generalized Estimation Procedure of Current Population Mean in Two-Occasion Successive Sampling

One-way ANOVA Inference for one-way ANOVA

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points.

The Poisson Regression Model

Elementary Analysis in Q p

Research of power plant parameter based on the Principal Component Analysis method

Mobius Functions, Legendre Symbols, and Discriminants

Statistics II Logistic Regression. So far... Two-way repeated measures ANOVA: an example. RM-ANOVA example: the data after log transform

Scaling Multiple Point Statistics for Non-Stationary Geostatistical Modeling

A New Perspective on Learning Linear Separators with Large L q L p Margins

FE FORMULATIONS FOR PLASTICITY

Using Factor Analysis to Study the Effecting Factor on Traffic Accidents

ECON 4130 Supplementary Exercises 1-4

Positive decomposition of transfer functions with multiple poles

Research Note REGRESSION ANALYSIS IN MARKOV CHAIN * A. Y. ALAMUTI AND M. R. MESHKANI **

POINTS ON CONICS MODULO p

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests

MATH 829: Introduction to Data Mining and Analysis Consistency of Linear Regression

HENSEL S LEMMA KEITH CONRAD

Towards understanding the Lorenz curve using the Uniform distribution. Chris J. Stephens. Newcastle City Council, Newcastle upon Tyne, UK

STK4900/ Lecture 7. Program

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL

Chapter 13 Variable Selection and Model Building

MULTIVARIATE SHEWHART QUALITY CONTROL FOR STANDARD DEVIATION

arxiv:cond-mat/ v2 25 Sep 2002

Hidden Predictors: A Factor Analysis Primer

CHAPTER 5 STATISTICAL INFERENCE. 1.0 Hypothesis Testing. 2.0 Decision Errors. 3.0 How a Hypothesis is Tested. 4.0 Test for Goodness of Fit

ON THE LEAST SIGNIFICANT p ADIC DIGITS OF CERTAIN LUCAS NUMBERS

Completely Randomized Design

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO)

NONLINEAR OPTIMIZATION WITH CONVEX CONSTRAINTS. The Goldstein-Levitin-Polyak algorithm

Biostat Methods STAT 5500/6500 Handout #12: Methods and Issues in (Binary Response) Logistic Regression

Chapter 7 Sampling and Sampling Distributions. Introduction. Selecting a Sample. Introduction. Sampling from a Finite Population

Plotting the Wilson distribution

Introduction to MVC. least common denominator of all non-identical-zero minors of all order of G(s). Example: The minor of order 2: 1 2 ( s 1)

John Weatherwax. Analysis of Parallel Depth First Search Algorithms

Principles of Computed Tomography (CT)

Asymptotic Properties of the Markov Chain Model method of finding Markov chains Generators of..

Evaluating Process Capability Indices for some Quality Characteristics of a Manufacturing Process

Positive Definite Uncertain Homogeneous Matrix Polynomials: Analysis and Application

A New Asymmetric Interaction Ridge (AIR) Regression Method

8 STOCHASTIC PROCESSES

PROFIT MAXIMIZATION. π = p y Σ n i=1 w i x i (2)

NUMERICAL AND THEORETICAL INVESTIGATIONS ON DETONATION- INERT CONFINEMENT INTERACTIONS

7.2 Inference for comparing means of two populations where the samples are independent

Lower Confidence Bound for Process-Yield Index S pk with Autocorrelated Process Data

substantial literature on emirical likelihood indicating that it is widely viewed as a desirable and natural aroach to statistical inference in a vari

Estimating function analysis for a class of Tweedie regression models

Using a Computational Intelligence Hybrid Approach to Recognize the Faults of Variance Shifts for a Manufacturing Process

New Information Measures for the Generalized Normal Distribution

PHYS 301 HOMEWORK #9-- SOLUTIONS

Bayesian Spatially Varying Coefficient Models in the Presence of Collinearity

Optimal array pattern synthesis with desired magnitude response

OXFORD UNIVERSITY. MATHEMATICS, JOINT SCHOOLS AND COMPUTER SCIENCE WEDNESDAY 4 NOVEMBER 2009 Time allowed: hours

On Line Parameter Estimation of Electric Systems using the Bacterial Foraging Algorithm

Churilova Maria Saint-Petersburg State Polytechnical University Department of Applied Mathematics

AI*IA 2003 Fusion of Multiple Pattern Classifiers PART III

Pretest (Optional) Use as an additional pacing tool to guide instruction. August 21

Chapter 7: Special Distributions

Maximum Entropy and the Stress Distribution in Soft Disk Packings Above Jamming

An Improved Calibration Method for a Chopped Pyrgeometer

A CONCRETE EXAMPLE OF PRIME BEHAVIOR IN QUADRATIC FIELDS. 1. Abstract

On the Estimation Of Population Mean Under Systematic Sampling Using Auxiliary Attributes

A SIMPLE PLASTICITY MODEL FOR PREDICTING TRANSVERSE COMPOSITE RESPONSE AND FAILURE

On a Markov Game with Incomplete Information

Additive results for the generalized Drazin inverse in a Banach algebra

2x2x2 Heckscher-Ohlin-Samuelson (H-O-S) model with factor substitution

Transcription:

Use of Transformations and the Reeated Statement in PROC GLM in SAS Ed Stanek Introduction We describe how the Reeated Statement in PROC GLM in SAS transforms the data to rovide tests of hyotheses of interest. A good reference for multivariate methods is Timm (975). We use the SAS manual as a reference for the reeated statement. First, we briefly review notation and the context for multivariate data. The Poulation, Samle, and Context Consider a simle random samle of n subjects from a very large oulation. Assume that there are measures of resonse made on each selected subject. The measures may corresond to measures of different characteristics of the subjects (such as age, height, weight, systolic blood ressure, etc), or reeated measures of the same variable (total cholesterol) at different times or conditions. With these assumtions, we reresent the vector of resonses for the i th selected subject as Y = Y Y Y. We will assume that selections of subjects is indeendent, but ( 2 ) i i i i that measures on a selected subject may be correlated, reresenting var be740-trans-reeatedsas.doc /25/2006 Y = Σ. In exerimental settings, Factor level may be assigned to subjects (blocks) or to occasions within a subject (lots). Factors assigned to subjects are modeled with a design matrix such that Y= Xβ + E Y 2 where Y Y =, X is a design matrix reresenting the levels of the factor assigned to n n k Yn β β2 β 2 22 2 selected subject, β β β β = is a matrix of arameters, with rows k βk βk2 β k corresonding to a level of factor A, and columns corresonding to the measures, or var vec = var vec Y = Σ I. occasions. With these assumtions, ( Y ) I Σ, or Estimation under a General Linear Multivariate Model n Estimates of the arameters corresonding to least squares estimates can be obtained similar to estimates in univariate models. The estimates are given by i n

( XX) β ˆ = XY. k Since vec( ABC) = ( C A) vec( B ), then exressing β ˆ = [ ] ( I XX X ) vec( Y) k ( vec( ˆ )) which simlifies to var ( vec ( ˆ )) = ( XX ) that vec ˆ so that var vec ( ˆ ) β =, and hence be740-trans-reeatedsas.doc /25/2006 2 k ( I XX X )( I ) I X( XX ) XX X Y I, we find var = n = ˆ ( XX ) β Σ. β Σ, β Σ. The variance matrix is estimated by ( ˆ Y Xβ) ( Y Xβˆ) Σ= ˆ, n k Hyotheses under A General Linear Multivariate Normal Model Traditionally, hyotheses have been secified for a general linear multivariate β β2 β 2 22 2 model that are linear combinations of rows and columns of β β β β =. k βk βk2 β k Such hyotheses are exressed as L β M = 0, where L and M are matrices of g k u g k k u constants. For examle, the hyothesis of whether or not the average of the arameters in the first row equals the average of the arameters in the second row is secified by setting L = ( 0 0 0) and M =. It should be clear that there k are limitations in defining hyotheses in terms of L β M = 0. With such a structure, it is g k k u not ossible to define L and M test the hyothesis that β + β2 = β + β2. Using g k u mixed models, any linear hyothesis concerning the elements of β can be tested. When u = and M is of full rank, then M is a square matrix that can be viewed as a transformation matrix. The standard multivariate model can be written as a model on a set of transformed random variables. This is the idea behind the Reeated Statement in PROC GLM. A Model for A Transformed Set of Random Variables We transform the random variables in a multivariate model by ost-multilying by

the non-singular matrix M. The model is given by which we exress as YM = Xβ M + EM Y = Xβ + E where Y = YM, β = β M, and vec = vec = vec E = EM. Note that ( Y ) ( YM) ( M In ) ( Y ) while vec ( ) = vec ( ) = ( n ) vec ( ) ( β ) = ( βm) = ( M In ) ( β ). Then in the transformed model, E ( Y ) = vec vec vec where ( vec( Y )) = ( I M )( I Σ )( I M) var n n n = I M Σ M n = I Σ n Y MY I M Y and Xβ and Σ = M Σ M. Thus, all the analysis tools used for the usual multivariate model can be alied to the transformed multivariate model. Examle: Secial Case For A Simle Resonse Error Model Suose that indeendent measures are made on each selected subject in a th simle random samle of n subjects from a large oulation. Let the model for the k measure on the i th selected subject be given by Y ik = + E ik 2 2 where σ σ var i e 2 N 2 e σ s N s= Y = I + J = Σ, with σ = reresenting the average resonse N 2 error in the oulation, and σ = ( ) 2 s reresenting the variance between N s= subjects. Consider a transformation matrix M = M ( ) where we assume that M = 0. Thus, the column sace sanned by the remaining columns is orthogonal to the column sace of a one vector. Using this transformation matrix, Y = X E E Y = X var vec Y = I Σ where Σ = M Σ M. Using β +, β, and be740-trans-reeatedsas.doc /25/2006 3 n

the definitions of these terms, Σ = M Σ M 2 2 = ( σe + σ ) M I J M ( ) 2 2 2 2 ( σei + σ J) ( σei + σ J) M. = 2 2 2 2 M ( σei + σ J) M ( σei + σ J) M 2 2 ( σe + σ ) 0 = 2 σ e 0 M M Frequently, we will require the columns of the second art of the transformation matrix to be orthogonal, and normal (such that the inner roduct of the column vectors is ). 2 2 ( σ ) e + σ 0 When this is true, then M M = I ( ), and hence Σ = 2. Notice 0 σ e I( ) that with this choice of a transformation, the columns of the transformed random variables are indeendent. Thus, univariate methods can be used to test hyotheses concerning the transformed random variables. Using the Reeated Statement in PROC GLM in SAS In reeated measures studies, the same variable is measured under different conditions. Often, there are simle functions of the measures that are of interest. For examle, if there are 2 measures on each selected subject corresonding to a retest and osttest measure, then the difference between the two measures is of interest. This difference is secified in terms of an M matrix. If there is a factorial structure on the reeated measures, the main effects, and interactions can be secified in terms of linear combinations. We consider two examles as illustrations. First, suose that = 6 measures were made on each subject under each of the factor levels in a two factor study where Factor A had 2 levels, and Factor B had 3 levels. Let the resonses be organized as follows: Level of Factor A A A A A2 A2 A2 Level of Factor B B B2 B3 B B2 B3. Y Y Y Y Y Y Y i i i2 i3 i2 i22 i23 th Suose an aroriate model for the m measure on subject s under level j of Factor A and level k of Factor B is given by Y = + E m m be740-trans-reeatedsas.doc /25/2006 4

2 where E( Y m ) =, and ( Y ) σ var m =. Let us define arameters corresonding to the average exected resonse under level j of Factor A and level k of Factor B by N jk =, and re-arameterize these averages such that N s= 0 0 0 0 α 2 β 3 =. 0 0 β2 2 0 0 αβ 22 αβ 2 23 With this arameterization, we can exress 2 2 α 2 2 2 β 3 = (see bem032.sas). We summarize this by β2 6 2 2 2 αβ 2 2 22 αβ 2 23 exressing = + α + β+ β2 + αβ+ αβ2. Finally, let us define a b s s ab j= k= jk = + δ =, where N a b =. Then Nab s= j= k= = jk + jk. = jk s + s + jk The first term is the subject by treatment interaction. This term is a difference in the subject effect under a secific treatment from the average subject effect. If a treatment affects subjects differently, this term will be non-zero. If the subject by treatment interaction is zero, the model simlifies. The mean structure is given by = + jk jk = s + jk. = + α + β + β + αβ + αβ + δ 2 2 s be740-trans-reeatedsas.doc /25/2006 5

Post-multilying Y i by M given by Z = Y M i i 0 0 0 0 M = will results in 0 0 0 0 ( ) ( ) Y Y Y Y =. The columns of this vector corresond to function that reresent main effects and interactions of the two factors. For examle, simly by testing the hyothesis that a b b b a a a a i i3 i2 i3 Yijk Yi k Yi2k Yij Yij3 Yij2 Yij3 j= k= k= k= i= i= i= i= Yi2 Yi23 Yi22 Yi23 E Z i2 = 0 will test for a main effect for Factor A. In such settings, some of the univariate tests on the transformed data will be interretable. be740-trans-reeatedsas.doc /25/2006 6

Polynomial Regression Polynomial regression oses no secial roblems for analysis of data. Polynomial functions of x i (i.e. x i 2, x i 3, etc) can be treated simly as different regression variables. There are two ecularities in analysis, however, that deserves secial discussion with resect to olynomial regression, or olynomial trends. The first relates to a comuting roblem. The second relates to testing contrasts in ANOVA alications that reresent olynomial trends. We discuss these two issues here. Examle: Cubic Polynomial model Suose heart rate is measured at P=6 walking seeds, corresonding to x i =, 2, 3, 4, 5 and 6 mh, for i=,...,6=p. Let y ij = heart rate at walking seed "i" on j th measure. If we assume that heart rate is a cubic function of walking seed, then y ij = β 0 + β x i + β 2 x i 2 + β 3 x i 3 + e ij is a cubic olyonmial model. When considering olynomial models, it is customary to include all lower order olynomial terms along with the highest olynomial term in the model. We will always follow this convention. We can estimate regression coefficients in the usual manner with a design matrix corresonding to 2 3 x x x 2 3 x 2 x2 x2 2 3 X= x3 x3 x 3............ 2 3 x P xp xp Polynomial Models as Simle Transformation of Cell Mean Model. If a olynomial model is of degree (P-) for i=,...,p times, then there will be as many olynomial arameters as time oints. For examle, with P=6 walking seeds, a 5th degree olynomial model will have 6 arameters. In such cases, the olynomial arameters are simly a re-arameterization of the cell mean resonse at each time. The ideas can be illustrated with a simle examle with P=4. Suose that mean resonse at four doses are recorded, where the doses are, 2, 3, and 4. The four means can be reresented as their individual means. Alternatively, the four means be740-trans-reeatedsas.doc /25/2006 7

can be reresented by four regression arameters corresonding to a constant, linear, quadratic, and cubic arameter. The cell mean arameters can be transformed to form the regression arameters. The transformation is a re-arameterization. The connection between the arameters can be seen by considering the relationshi i = β 0 + β x i + β 2 x i 2 + β 3 x i 3 Using matrix notation, 0 2 4 8 β β 2 4 6 64β 3 =X β= 3 9 27 β or β = X -. This is a simle transformation of the means. If we had n measures at each dose, we could first estimate the oulation mean resonse based on the samle mean. We could then transform these mean resonses to regression coefficients. Estimate of regression coefficients in a model with n measures er dose can be fit directly using a design matrix, where the design matrix can be exressed as a set of individual design matrices (like the one given above). X 0 = X X... Then X 0 X 0 = n (X X). As a result, when the same number of measures are made at each dose level (or time oint), the X X matrix is a function of the olynomial matrix for a single set of measures. Problems With Polynomial Design Matrices: Examle With 0 Times; Suose a child is measured at P=0 times. Let the times be given by,2,3,4,5,6,7,8,9,0, where the measures corresond to the weight of at each of 0 ages. Suose a 8th order (x 8 i ) regression model is to be fit to these data. To fit this, we need to invert the design matrix, X X. This turns out to be a difficult comuting roblem. For examle, in PROC REG, the rogram will not form the regression coefficients. In PROC IML, a check of (X X) - (X X) results in a matrix that is not an identify matrix. Orthogonal olynomials enable tests to be constructed for regression coefficients be740-trans-reeatedsas.doc /25/2006 8

in this roblem by rearameterizing the regression arameters, and thereby avoiding inverting a an ill conditionned matrix. The test results avoid the inaccuracies that may be introduced by matrix inversion. Although tests can be constructed based on orthogonal olynomials (as well as redicted values), the matrix inversion is necessary to estimate regression arameters on the original metric. For this reason, this re-arameterization is often not done. Choleski Decomosition of Symmetric Full Rank Matrix and its Relationshi to Orthogonal Polynomials Orthogonal olynomials were introduced as a way of making tests more recise and not deendent on the inversion of X X. While their use imroves accuracy of the tests, it transforms arameters into quantities that are not easily interretable. We illustrate orthogonal olynomials in the context of a olynomial regression with P time oints, where a olynomial of degree (P-) (having P arameters) is fit. We assume that the matrix X is a P x P square square olynomial matrix. When X reresents a matrix of olynomials, the matrix X X will be ositive definite (of full rank) and symmetrical. In such cases, it is ossible to factor the matrix using a Cholesky decomosition (see Timm, (975), 73-75). Cholesky factoring of a matrix A results in an uer triangular matrix T such that A = T T In PROC IML, the Cholesky decomositon is secified via the ROOT function, and results in the matrix T =ROOT(A). Of course, the question is how does this factorization hel in estimating sums of squares for regression roblems. To answer this question, we consider a transformation of the original X matrix that will result in a T matrix. First, note that X X = TT Both the matrices X and T are P x P matrices. Let us indictate the relationshi between X and T via a matrix P such that Then since X=PT, or P=X T - be740-trans-reeatedsas.doc /25/2006 9

which imlies that X X = TT = T P P T, P P=I Simle Examle with P=3. (x=,2,3); X X X (X X) - 3 6 4 9-2 5 2 4 6 4 36-2 24.5-6 3 9 4 36 98 5-6.5 T T (TT ) -.7320508 0 0.7320508 3.46406 8.0829038 9-2 5 3.46406.44236 0 0.44236 5.6568542-2 24.5-6 8.0829038 5.6568542 0.864966 0 0 0.864966 5-6.5 P 0.5773503-0.70707 0.4082483 0.5773503 0-0.86497 0.5773503 0.707068 0.4082483 Column Scale Factors: 3 2 6 The coefficients in the matrix P are orthogonal olynomial coefficients. Aside: Note that the solution we have given for P requires inversion of the matrix T. If the design matrix is ill conditionned, then the matrix T will also be ill conditionned. Hence, it may be questionnable as to how much the roblem of inverting the matrix has been resolved. Another fucntion, ORPOL, in SAS will generate orthogonal olynomials without inverting T. This function using a slightly different algorithm, but does not aear to be better than simle use of the Choleski decomosition. Summary of Transformations of Regression Coefficients with Orthogonal Polynomials To summarize, if the model is Y = X β + e where X = P T, we can transform the model such that Y = P T β + e be740-trans-reeatedsas.doc /25/2006 0

= P β + e where β = T β. The model with transformed arameters will fit idential to the original model. It is easy to fit this model, since the design matrix is orthogonal, (P P=I) and hence, Also, ˆ β =P y ˆ )=IP 2 var( β σ where σ 2 is estimated via the MSE. [ Aside: It is this transformation that is used in the SAS Proceedure PROC GLM with a REPEATED POLYNOMIAL otion.] Note that the diagonal form of the variance of the transformed arameters indicates the indeendence of the arameters. be740-trans-reeatedsas.doc /25/2006

A similar analysis can be made when measures are made over time. Suose three measures of ulse rate are made on each selected subject immediately following exercise, minute following exercise, and 2 minutes following exercise. We may consider a transformation matrix similar to a olynomial matrix given by 0 0 M = 2 4 where columns corresond to a mean, a linear, and a quadratic trend. The transformation used in SAS is a normalized orthogonal olyonimal matrix, not the olynomial matrix given above. A normalized matrix means that the sum of the squared values of the coefficients in any column add u to. Orthogonal means that the inner roduct of any two columns of the matrix is zero. The orthonormal olynomial matrix is formed by taking the Choleski decomosition of the olynomial matrix (see Timm, (975) 73-75). The decomosition results in M = TT, where T is an uer triangular matrix. All elements of T below the diagonal are zero. In PROC IML, this can be obtained by the command T A ;. With this transformation, = ROOT References Timm, N.H. (975). Multivariate Analysis with Alications in Education and Psychology, Wadsworth Publishing Comany, Belmont, California. be740-trans-reeatedsas.doc /25/2006 2