Biostatistics 0 Multiple Regression ORIGIN 0 Multiple Regression Multiple Regression is an extension of the technique of linear regression to describe the relationship between a single dependent (response) variable (Y) and multiple independent (predictor) variables (X, X, X,...). Typically, multiple regression involves specifying one, or sometimes several, linear models, constructing the multiple regression, and then testing hypotheses often involving several regression coefficients (,,,...) corresponding to each of the X variables. ZAR Y ZAR X ZAR X ZAR n k READPRN ("c:/data/biostatistics/zarex0.ar.txt" ) length( Y) < dependent variable Y < independent variables X-X4 X ZAR n X4 ZAR 4 < number of independent variables + ZAR 0 4 7 0 4 9.9.7. 9..4.9-9.4.7.4. 4 9...4.7 -.9. 9..7 4.4. 7 7.9.9.9 7.4.. 9 7...9. i 0 n ii 0 n j 0 k < range variables i, ii for n observations < range variable j for columns of X 9 0 0.. 0. 9..7 4..7 9 0... 9..4.4.9 4-0.. 4.0 Assumptions: 4 7.. 0..9 - Multiple Linear Regression depends on specifying in advance which variable is considered 'dependent' and which others 'independent'. This decision matters as changing roles for Y versus X's usually produces a different result. - Y, Y, Y,..., Y n (dependent variable) is a random sample ~ N(, ). - k Vectors of fixed Independent Variables: - X,, X,, X,,..., X,n (independent variable) with each value of X,i matched to Y i - X,, X,, X,,..., X,n (independent variable) with each value of X,i matched to Y i... - X k,, X k,, X k,,..., X k,n (independent variable) with each value of Xk,i matched to Y i Model: Y i = X,i + X,i + X,i +... + k X k,i + i for i = to n where: is the y intercept of the regression line (translation). i 's are the regression coefficient (i.e., "slope") for each X i of the regression line. i 's are the residuals (i.e. "error") in prediction of Y given that it is a random variable with N(0, ) Note that this is one of many possible linear models involving X i 's that may be squared or higher order functions of an original variable (i.e., X, X, etc.) or cross products of two original variables (i.e., X a,i X b,i etc.). This is the wonderful and extremely powerful world of Linear Modeling.
Biostatistics 0 Multiple Regression Least Squares Estimation of the Regression Line: Calculations in Multiple Regression are extensive, and best visualized using matrix algebra where sums of squares and cross products are implicit in matrix manipulations: b Estimated values of Y (Y hat ): Y hat Residuals (): Y Y hat X T Y X T X X: becomes a (n X k+) Matrix of fixed X values with first column of 's and each of k vectors above comprising subsequent columns. X - X T Y b Y hat e X0 i X T X X b is the Inverse Matrix of X, such that X - X = I (the identity matrix). is the transpose matrix of X where rows and columns are reversed. is the vector of Y i 's arrayed as a column of numbers. is the vector of regression coefficients including plus all i 's arrayed as a single column of numbers. is the vector of fitted values for Y i arrayed as a column of numbers. is the vector of residuals e i (Yhat i - Y i ) arrayed as a column of numbers. Constructing Design Matrix X of Independent Variables:. 0.7 7.0 479.779 < vector of 's for constructing matrix X below X augment( X0X X X X4) Estimated Regression Coefficients (b): X T Y ^ Note: all calculations here involve MathCad's matrix algebra functions! see next page for Y hat see next page for 0. 0.044 0.4.47 0.07 < Sums of XY cross Products Matrix of Lxy 0.044 0.00 0.00 0.009 0.000 X T X 0.4 0.00 0.07 0.074 0.00 b 47 9. 94..9.47 0.009 0.074 0.9 0.0.9 0.99 0.07 0.04 0.07 47 7. 7.7. ^ Sums of Squares and Cross-Products of X's Matrix analog of Lxx 0.07 0.000 0.00 0.0 0.0 4 9.. 7. 7 7. 94. 7.7 7 4.7 497.07.9. 7. 497.07 7 < Inverse of Matrix X T X: Multiplication by this matrix is the matrix algebra analog of dividing by X T X. X 9 0 4 0 9.9 9. 9.4 9..9 9. 7.9 7.4 7.. 9. 0. 9. 0. 7..7.7 7.. 0.9 7. 7. 9. 7 7. 7. 0.. 7.7 7.. 0.4.7.4.7..7.9....7..4.....9......4..4.4....4..4.4 4.4.9 0. 4..4 0..9 4.4 0. 4.4..9. 4..9 4...9
Biostatistics 0 Multiple Regression Sums of Squares as Quadratic Forms: H X X T X X T J i ii I identity( n) SS df MS SSR ( 9.774 ) df R 4 SSE (.099 ) df E SSTO ( 4.747 ) df T < called the "Hat Matrix" often reported in software < nxn matrix of 's useful in calculations < identity matrix of length n SSR Y T H n J Y SSR ( 9.774 ) SSE Y T ( I H) Y SSE (.099 ) SSTO Y T I n J Y SSTO ( 4.7470 ) Degrees of Freedom: df R k df R 4 df E n k df E df T n df T ANOVA Table: SSR MSR MSR ( 9 ) df R SSE MSE MSE ( 0.79 ) df E SSTO MSTO MSTO ( 0.409 ) df T Standard error of the Regression Parameters: H Y.07.94.47.797.07. 99 7 0.9. 7. 40..09.799.70. 4 7 0.9494 4.9.7. Y hat.07.94.47.797.07. 99 7 0.9. 7. 40..09.799.70. 4 7 0.9494 4.9.7. 0.0 0.40 0. 0.077.07 0.0 0.40 0.7 0.07 0. 0.9 0. 0. 0.4 0.0 0.049 0.94 0.04 0.040 0.0 0. 0. 0.9 0.07 0.97 0.07 0.47 0.077 0.94 0.4 0.09 0.77 0. MS E MSE MS 0 E 0.79 sb sq MS E X T X sb j sb sq j j sb < converting X matrix into scalar < matrix of variances/covariances among regression coefficients:.0 0.07 0.07 0.077 0.070 < standard error of regression parameters b sb sq.94 0.0079 0.04 0. 0.09 0.0079 0.000 0.0004 0.00 0.000 0.04 0.0004 0.00 0.00 0.0009 0. 0.00 0.00 0.04 0.004 0.09 0.000 0.0009 0.004 0.004
Biostatistics 0 Multiple Regression 4 Coefficient of Multiple Determination (R ): R sq SSE 0 SSTO 0 R sq 0.9 Coefficient of MultipleCorrelation (R): R R sq R 0.7 Adjusted Coefficient of Multiple Determination: R sqa MSE 0 MSTO 0 R sqa 0.0 Overall F Test of the Multiple Regression: Hypotheses: H 0 : all slope 's < i.e., only is left in regression model Test Statistic: H : at least some slope 's not zero F MSR 0 MSE 0 F. Sampling Distribution: If Assumptions hold and H 0 is true, then F ~F (k-)/(n-k) Critical Value of the Test: 0.0 C qf k n k < Probability of Type I error must be explicitly set C.74 Decision Rule: IF F > C, THEN REJECT H 0 OTHERWISE ACCEPT H 0 Probability Value: F. P pf( F k n k) C.74 P.9474 0
Biostatistics 0 Multiple Regression Partial t/f-tests of single coefficients: Hypotheses: Note: this is a "marginal" test, so the order of entry into regression does not matter. H 0 : a single < typically this is the marginal independent variable, but also intercept H : j 0 < test set each taking a turn ( first, then all 's in turn) Test Statistic: b j t j sb j F j t j Sampling Distributions: t.9.07 0. 0..4.90 F 0.4 t 4.70 0.0497 9.979 4.70.90 0.4 0.0497 9.979 If Assumptions hold and H 0 is true, then t ~t (n-k) k If Assumptions hold and H 0 is true, then F ~F (, n-k) Critical Values of the Test: 0.0 < Probability of Type I error must be explicitly set CV t qt n k CV t.044 CV F qf n k CV F.09 Decision Rules: IF t > CV t, THEN REJECT H 0 OTHERWISE ACCEPT H 0 CV t.044 IF F > CV F, THEN REJECT H 0 OTHERWISE ACCEPT H 0 CV F.09 < t-test version < F-test version Probability Values: P tj min pt t n k pt t n k j j ^ takes the minimum of the two-tailed P calculations P F pf F n k j j P t 0.09 0.00000 0.7404 0.7 0.0047 for: 4 P F 0.09 0.00000 0.7404 0.7 0.0047
Biostatistics 0 Multiple Regression Confidence Intervals for single coefficients & : CI b augment( b C sb b C sb) < augment function puts them into a matrix for display CI b 0.747 0.7 0.7 0.0 0.0.9 0.07 0.4 0. 0.907 b.9 0.9 0.0 0.04 0.0 for: 4 Prototype in R: #MULTIPLE REGRESSION #ZAR EXAMPLE 0.a ZAR=read.table("ZarEX0.aR.txt") ZAR aach(zar) Y=var X=var X=var X=var X4=var4 LM=lm(Y~X+X+X+X4) LM summary(lm) anova(lm) Partial t-tests of individual coefficients > > LM Call: lm(formula = Y ~ X + X + X + X4) Coefficients: (Intercept) X X X X4.9-0.9-0.07-0.04 0.07 > summary(lm) Call: lm(formula = Y ~ X + X + X + X4) Residuals: Min Q Median Q Max -.070-0. 0.040 0.0 0.97 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept).9..9 0.09 * X -0.9 0.09 -.07.0e-0 *** X -0.07 0.0-0.4 0.740 X -0.04 0.077-0. 0. X4 0.07 0.070.4 0.004 ** --- Signif. codes: 0 *** 0.00 ** 0.0 * 0.0. 0. Residual standard error: 0.4 on degrees of freedom Multiple R-squared: 0.9, Adjusted R-squared: 0.0 F-statistic:. on 4 and DF, p-value:.94e-0 Overall F-test > Coefficient of Multiple Determination R ^ Adjusted Coefficient of Multiple Determination R a ^ b.9 0.9 0.0 0.04 0.0 sb. 0.09 0.0 0.077 0.070 t.9.07 0. 0..4 P t 0.07.49 0 0.74 0. 0.004 < equivalent to summary() report of partial t-tests of individual coefficients
Biostatistics 0 Multiple Regression 7 Note: Sums of Squares for X-X4 in R's ANOVA table sum to SSR above, and SS for 'Residuals' in R equals SSE above. > anova(lm) Analysis of Variance Table Response: Y Df Sum Sq Mean Sq F value Pr(>F) X 7.7 7.7 4.40.49e-07 *** X 0.0 0.0 0.07 0.77 X 0.09 0.09 0.47 0.4949 X4.74.74 9.979 0.0047 ** Residuals.099 0.79 --- Signif. codes: 0 *** 0.00 ** 0.0 * 0.0. 0. Sums of Squares for X-X4 in the ANOVA chart, also known as "Partial" or "Extra" Sums of Squares are not calculated in this worksheet. To understand how to interpret them, see Biostatistics Worksheet 400.