Biostatistics 380 Multiple Regression 1. Multiple Regression

Similar documents
ANOVA Randomized Block Design

Nested 2-Way ANOVA as Linear Models - Unbalanced Example

Biostatistics 360 F&t Tests and Intervals in Regression 1

Inference for Regression

STAT763: Applied Regression Analysis. Multiple linear regression. 4.4 Hypothesis testing

Correlation Analysis

Variance Decomposition and Goodness of Fit

STA 4210 Practise set 2a

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

Lecture 10 Multiple Linear Regression

1 Use of indicator random variables. (Chapter 8)

R Output for Linear Models using functions lm(), gls() & glm()

The Multiple Regression Model

Ch 2: Simple Linear Regression

Applied Regression Analysis

6. Multiple Linear Regression

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

Chapter 14 Simple Linear Regression (A)

Inferences for Regression

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

MATH 644: Regression Analysis Methods

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

ST430 Exam 2 Solutions

STA 4210 Practise set 2b

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Ch 3: Multiple Linear Regression

Homework 9 Sample Solution

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test, October 2013

Ordinary Least Squares Regression Explained: Vartanian

Stat 5102 Final Exam May 14, 2015

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Formal Statement of Simple Linear Regression Model

MODELS WITHOUT AN INTERCEPT

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Regression Analysis lab 3. 1 Multiple linear regression. 1.1 Import data. 1.2 Scatterplot matrix

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

Density Temp vs Ratio. temp

STAT 350 Final (new Material) Review Problems Key Spring 2016

Multiple Linear Regression

ST430 Exam 1 with Answers

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Confidence Intervals, Testing and ANOVA Summary

Simple Linear Regression

We like to capture and represent the relationship between a set of possible causes and their response, by using a statistical predictive model.

R 2 and F -Tests and ANOVA

Comparing Nested Models

Linear Regression Model. Badr Missaoui

Lecture 18: Simple Linear Regression

STAT 572 Assignment 5 - Answers Due: March 2, 2007

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Tests of Linear Restrictions

Chapter 4: Regression Models

Statistics for Managers using Microsoft Excel 6 th Edition

13 Simple Linear Regression

Basic Business Statistics 6 th Edition

> modlyq <- lm(ly poly(x,2,raw=true)) > summary(modlyq) Call: lm(formula = ly poly(x, 2, raw = TRUE))

Linear Algebra Review

Multiple Pairwise Comparison Procedures in One-Way ANOVA with Fixed Effects Model

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Lecture 1 Linear Regression with One Predictor Variable.p2

Confidence Interval for the mean response

SCHOOL OF MATHEMATICS AND STATISTICS

STAT 215 Confidence and Prediction Intervals in Regression

Statistical Test of the Global Warming Hypothesis

Handout 4: Simple Linear Regression

The simple linear regression model discussed in Chapter 13 was written as

Lecture 3: Inference in SLR

Regression Analysis II

F-tests and Nested Models

14 Multiple Linear Regression

Multiple Predictor Variables: ANOVA

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Stat 401B Final Exam Fall 2015

16.3 One-Way ANOVA: The Procedure

Chapter 4. Regression Models. Learning Objectives

Basic Business Statistics, 10/e

No other aids are allowed. For example you are not allowed to have any other textbook or past exams.

Lecture 10. Factorial experiments (2-way ANOVA etc)

Statistiek II. John Nerbonne. March 17, Dept of Information Science incl. important reworkings by Harmut Fitz

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species

Randomized Block Designs with Replicates

Stat 412/512 TWO WAY ANOVA. Charlotte Wickham. stat512.cwick.co.nz. Feb

STAT Chapter 11: Regression

Correlation and Simple Linear Regression

Chapter 2 Inferences in Simple Linear Regression

Mathematics for Economics MA course

School of Mathematical Sciences. Question 1

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

Chapter 14 Student Lecture Notes 14-1

ECO220Y Simple Regression: Testing the Slope

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Linear regression. We have that the estimated mean in linear regression is. ˆµ Y X=x = ˆβ 0 + ˆβ 1 x. The standard error of ˆµ Y X=x is.

Chapter 16: Understanding Relationships Numerical Data

Transcription:

Biostatistics 0 Multiple Regression ORIGIN 0 Multiple Regression Multiple Regression is an extension of the technique of linear regression to describe the relationship between a single dependent (response) variable (Y) and multiple independent (predictor) variables (X, X, X,...). Typically, multiple regression involves specifying one, or sometimes several, linear models, constructing the multiple regression, and then testing hypotheses often involving several regression coefficients (,,,...) corresponding to each of the X variables. ZAR Y ZAR X ZAR X ZAR n k READPRN ("c:/data/biostatistics/zarex0.ar.txt" ) length( Y) < dependent variable Y < independent variables X-X4 X ZAR n X4 ZAR 4 < number of independent variables + ZAR 0 4 7 0 4 9.9.7. 9..4.9-9.4.7.4. 4 9...4.7 -.9. 9..7 4.4. 7 7.9.9.9 7.4.. 9 7...9. i 0 n ii 0 n j 0 k < range variables i, ii for n observations < range variable j for columns of X 9 0 0.. 0. 9..7 4..7 9 0... 9..4.4.9 4-0.. 4.0 Assumptions: 4 7.. 0..9 - Multiple Linear Regression depends on specifying in advance which variable is considered 'dependent' and which others 'independent'. This decision matters as changing roles for Y versus X's usually produces a different result. - Y, Y, Y,..., Y n (dependent variable) is a random sample ~ N(, ). - k Vectors of fixed Independent Variables: - X,, X,, X,,..., X,n (independent variable) with each value of X,i matched to Y i - X,, X,, X,,..., X,n (independent variable) with each value of X,i matched to Y i... - X k,, X k,, X k,,..., X k,n (independent variable) with each value of Xk,i matched to Y i Model: Y i = X,i + X,i + X,i +... + k X k,i + i for i = to n where: is the y intercept of the regression line (translation). i 's are the regression coefficient (i.e., "slope") for each X i of the regression line. i 's are the residuals (i.e. "error") in prediction of Y given that it is a random variable with N(0, ) Note that this is one of many possible linear models involving X i 's that may be squared or higher order functions of an original variable (i.e., X, X, etc.) or cross products of two original variables (i.e., X a,i X b,i etc.). This is the wonderful and extremely powerful world of Linear Modeling.

Biostatistics 0 Multiple Regression Least Squares Estimation of the Regression Line: Calculations in Multiple Regression are extensive, and best visualized using matrix algebra where sums of squares and cross products are implicit in matrix manipulations: b Estimated values of Y (Y hat ): Y hat Residuals (): Y Y hat X T Y X T X X: becomes a (n X k+) Matrix of fixed X values with first column of 's and each of k vectors above comprising subsequent columns. X - X T Y b Y hat e X0 i X T X X b is the Inverse Matrix of X, such that X - X = I (the identity matrix). is the transpose matrix of X where rows and columns are reversed. is the vector of Y i 's arrayed as a column of numbers. is the vector of regression coefficients including plus all i 's arrayed as a single column of numbers. is the vector of fitted values for Y i arrayed as a column of numbers. is the vector of residuals e i (Yhat i - Y i ) arrayed as a column of numbers. Constructing Design Matrix X of Independent Variables:. 0.7 7.0 479.779 < vector of 's for constructing matrix X below X augment( X0X X X X4) Estimated Regression Coefficients (b): X T Y ^ Note: all calculations here involve MathCad's matrix algebra functions! see next page for Y hat see next page for 0. 0.044 0.4.47 0.07 < Sums of XY cross Products Matrix of Lxy 0.044 0.00 0.00 0.009 0.000 X T X 0.4 0.00 0.07 0.074 0.00 b 47 9. 94..9.47 0.009 0.074 0.9 0.0.9 0.99 0.07 0.04 0.07 47 7. 7.7. ^ Sums of Squares and Cross-Products of X's Matrix analog of Lxx 0.07 0.000 0.00 0.0 0.0 4 9.. 7. 7 7. 94. 7.7 7 4.7 497.07.9. 7. 497.07 7 < Inverse of Matrix X T X: Multiplication by this matrix is the matrix algebra analog of dividing by X T X. X 9 0 4 0 9.9 9. 9.4 9..9 9. 7.9 7.4 7.. 9. 0. 9. 0. 7..7.7 7.. 0.9 7. 7. 9. 7 7. 7. 0.. 7.7 7.. 0.4.7.4.7..7.9....7..4.....9......4..4.4....4..4.4 4.4.9 0. 4..4 0..9 4.4 0. 4.4..9. 4..9 4...9

Biostatistics 0 Multiple Regression Sums of Squares as Quadratic Forms: H X X T X X T J i ii I identity( n) SS df MS SSR ( 9.774 ) df R 4 SSE (.099 ) df E SSTO ( 4.747 ) df T < called the "Hat Matrix" often reported in software < nxn matrix of 's useful in calculations < identity matrix of length n SSR Y T H n J Y SSR ( 9.774 ) SSE Y T ( I H) Y SSE (.099 ) SSTO Y T I n J Y SSTO ( 4.7470 ) Degrees of Freedom: df R k df R 4 df E n k df E df T n df T ANOVA Table: SSR MSR MSR ( 9 ) df R SSE MSE MSE ( 0.79 ) df E SSTO MSTO MSTO ( 0.409 ) df T Standard error of the Regression Parameters: H Y.07.94.47.797.07. 99 7 0.9. 7. 40..09.799.70. 4 7 0.9494 4.9.7. Y hat.07.94.47.797.07. 99 7 0.9. 7. 40..09.799.70. 4 7 0.9494 4.9.7. 0.0 0.40 0. 0.077.07 0.0 0.40 0.7 0.07 0. 0.9 0. 0. 0.4 0.0 0.049 0.94 0.04 0.040 0.0 0. 0. 0.9 0.07 0.97 0.07 0.47 0.077 0.94 0.4 0.09 0.77 0. MS E MSE MS 0 E 0.79 sb sq MS E X T X sb j sb sq j j sb < converting X matrix into scalar < matrix of variances/covariances among regression coefficients:.0 0.07 0.07 0.077 0.070 < standard error of regression parameters b sb sq.94 0.0079 0.04 0. 0.09 0.0079 0.000 0.0004 0.00 0.000 0.04 0.0004 0.00 0.00 0.0009 0. 0.00 0.00 0.04 0.004 0.09 0.000 0.0009 0.004 0.004

Biostatistics 0 Multiple Regression 4 Coefficient of Multiple Determination (R ): R sq SSE 0 SSTO 0 R sq 0.9 Coefficient of MultipleCorrelation (R): R R sq R 0.7 Adjusted Coefficient of Multiple Determination: R sqa MSE 0 MSTO 0 R sqa 0.0 Overall F Test of the Multiple Regression: Hypotheses: H 0 : all slope 's < i.e., only is left in regression model Test Statistic: H : at least some slope 's not zero F MSR 0 MSE 0 F. Sampling Distribution: If Assumptions hold and H 0 is true, then F ~F (k-)/(n-k) Critical Value of the Test: 0.0 C qf k n k < Probability of Type I error must be explicitly set C.74 Decision Rule: IF F > C, THEN REJECT H 0 OTHERWISE ACCEPT H 0 Probability Value: F. P pf( F k n k) C.74 P.9474 0

Biostatistics 0 Multiple Regression Partial t/f-tests of single coefficients: Hypotheses: Note: this is a "marginal" test, so the order of entry into regression does not matter. H 0 : a single < typically this is the marginal independent variable, but also intercept H : j 0 < test set each taking a turn ( first, then all 's in turn) Test Statistic: b j t j sb j F j t j Sampling Distributions: t.9.07 0. 0..4.90 F 0.4 t 4.70 0.0497 9.979 4.70.90 0.4 0.0497 9.979 If Assumptions hold and H 0 is true, then t ~t (n-k) k If Assumptions hold and H 0 is true, then F ~F (, n-k) Critical Values of the Test: 0.0 < Probability of Type I error must be explicitly set CV t qt n k CV t.044 CV F qf n k CV F.09 Decision Rules: IF t > CV t, THEN REJECT H 0 OTHERWISE ACCEPT H 0 CV t.044 IF F > CV F, THEN REJECT H 0 OTHERWISE ACCEPT H 0 CV F.09 < t-test version < F-test version Probability Values: P tj min pt t n k pt t n k j j ^ takes the minimum of the two-tailed P calculations P F pf F n k j j P t 0.09 0.00000 0.7404 0.7 0.0047 for: 4 P F 0.09 0.00000 0.7404 0.7 0.0047

Biostatistics 0 Multiple Regression Confidence Intervals for single coefficients & : CI b augment( b C sb b C sb) < augment function puts them into a matrix for display CI b 0.747 0.7 0.7 0.0 0.0.9 0.07 0.4 0. 0.907 b.9 0.9 0.0 0.04 0.0 for: 4 Prototype in R: #MULTIPLE REGRESSION #ZAR EXAMPLE 0.a ZAR=read.table("ZarEX0.aR.txt") ZAR aach(zar) Y=var X=var X=var X=var X4=var4 LM=lm(Y~X+X+X+X4) LM summary(lm) anova(lm) Partial t-tests of individual coefficients > > LM Call: lm(formula = Y ~ X + X + X + X4) Coefficients: (Intercept) X X X X4.9-0.9-0.07-0.04 0.07 > summary(lm) Call: lm(formula = Y ~ X + X + X + X4) Residuals: Min Q Median Q Max -.070-0. 0.040 0.0 0.97 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept).9..9 0.09 * X -0.9 0.09 -.07.0e-0 *** X -0.07 0.0-0.4 0.740 X -0.04 0.077-0. 0. X4 0.07 0.070.4 0.004 ** --- Signif. codes: 0 *** 0.00 ** 0.0 * 0.0. 0. Residual standard error: 0.4 on degrees of freedom Multiple R-squared: 0.9, Adjusted R-squared: 0.0 F-statistic:. on 4 and DF, p-value:.94e-0 Overall F-test > Coefficient of Multiple Determination R ^ Adjusted Coefficient of Multiple Determination R a ^ b.9 0.9 0.0 0.04 0.0 sb. 0.09 0.0 0.077 0.070 t.9.07 0. 0..4 P t 0.07.49 0 0.74 0. 0.004 < equivalent to summary() report of partial t-tests of individual coefficients

Biostatistics 0 Multiple Regression 7 Note: Sums of Squares for X-X4 in R's ANOVA table sum to SSR above, and SS for 'Residuals' in R equals SSE above. > anova(lm) Analysis of Variance Table Response: Y Df Sum Sq Mean Sq F value Pr(>F) X 7.7 7.7 4.40.49e-07 *** X 0.0 0.0 0.07 0.77 X 0.09 0.09 0.47 0.4949 X4.74.74 9.979 0.0047 ** Residuals.099 0.79 --- Signif. codes: 0 *** 0.00 ** 0.0 * 0.0. 0. Sums of Squares for X-X4 in the ANOVA chart, also known as "Partial" or "Extra" Sums of Squares are not calculated in this worksheet. To understand how to interpret them, see Biostatistics Worksheet 400.