Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Similar documents
36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

Example. Multiple Regression. Review of ANOVA & Simple Regression /749 Experimental Design for Behavioral and Social Sciences

Review of Multiple Regression

Inferences for Regression

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Lecture 11: Simple Linear Regression

General Linear Model (Chapter 4)

A discussion on multiple regression models

Regression ( Kemampuan Individu, Lingkungan kerja dan Motivasi)

y response variable x 1, x 2,, x k -- a set of explanatory variables

9. Linear Regression and Correlation

Simple Linear Regression

Regression. Notes. Page 1. Output Created Comments 25-JAN :29:55

Multiple linear regression

36-309/749 Experimental Design for Behavioral and Social Sciences. Dec 1, 2015 Lecture 11: Mixed Models (HLMs)

ST505/S697R: Fall Homework 2 Solution.

SPSS Output. ANOVA a b Residual Coefficients a Standardized Coefficients

ECON Introductory Econometrics. Lecture 5: OLS with One Regressor: Hypothesis Tests

df=degrees of freedom = n - 1

1 Correlation and Inference from Regression

Mathematics for Economics MA course

EDF 7405 Advanced Quantitative Methods in Educational Research MULTR.SAS

Lecture 3: Inference in SLR

Area1 Scaled Score (NAPLEX) .535 ** **.000 N. Sig. (2-tailed)

16.400/453J Human Factors Engineering. Design of Experiments II

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

EDF 7405 Advanced Quantitative Methods in Educational Research. Data are available on IQ of the child and seven potential predictors.

Correlation and simple linear regression S5

In Class Review Exercises Vartanian: SW 540

Lecture 10 Multiple Linear Regression

Introduction to Regression

A Re-Introduction to General Linear Models (GLM)

Research Design - - Topic 19 Multiple regression: Applications 2009 R.C. Gardner, Ph.D.

L6: Regression II. JJ Chen. July 2, 2015

Introduction to Within-Person Analysis and RM ANOVA

MORE ON SIMPLE REGRESSION: OVERVIEW

Review of Statistics 101

Formal Statement of Simple Linear Regression Model

Density Temp vs Ratio. temp

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

Difference in two or more average scores in different groups

Univariate analysis. Simple and Multiple Regression. Univariate analysis. Simple Regression How best to summarise the data?

Multiple linear regression S6

Time-Invariant Predictors in Longitudinal Models

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

Topic 20: Single Factor Analysis of Variance

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Inference for Regression Inference about the Regression Model and Using the Regression Line

Topic 17 - Single Factor Analysis of Variance. Outline. One-way ANOVA. The Data / Notation. One way ANOVA Cell means model Factor effects model

Business Statistics. Lecture 10: Course Review

2 Prediction and Analysis of Variance

WELCOME! Lecture 13 Thommy Perlinger

Applied Regression Analysis

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Multiple Regression and Model Building Lecture 20 1 May 2006 R. Ryznar

Using SPSS for One Way Analysis of Variance

Multiple OLS Regression

Lecture 1 Linear Regression with One Predictor Variable.p2

STATISTICS 110/201 PRACTICE FINAL EXAM

STAT 350 Final (new Material) Review Problems Key Spring 2016

Topic 14: Inference in Multiple Regression

Interactions and Factorial ANOVA

Data Science for Engineers Department of Computer Science and Engineering Indian Institute of Technology, Madras

4:3 LEC - PLANNED COMPARISONS AND REGRESSION ANALYSES

Warm-up Using the given data Create a scatterplot Find the regression line

Ch 2: Simple Linear Regression

Interactions and Factorial ANOVA

One-way ANOVA Model Assumptions

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

appstats27.notebook April 06, 2017

ECONOMETRIC ANALYSIS OF THE COMPANY ON STOCK EXCHANGE

Multicollinearity Richard Williams, University of Notre Dame, Last revised January 13, 2015

ECON 497 Midterm Spring

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

holding all other predictors constant

Statistical Modelling in Stata 5: Linear Models

STAT Chapter 11: Regression

SIMPLE REGRESSION ANALYSIS. Business Statistics

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

Advanced Quantitative Data Analysis

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

Additional Notes: Investigating a Random Slope. When we have fixed level-1 predictors at level 2 we show them like this:

Ordinary Least Squares Regression Explained: Vartanian

Advanced Regression Topics: Violation of Assumptions

Inference for Regression Simple Linear Regression

SPSS LAB FILE 1

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

Example: Four levels of herbicide strength in an experiment on dry weight of treated plants.

Hypothesis testing, part 2. With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

4/22/2010. Test 3 Review ANOVA

Lecture 9: Linear Regression

Time-Invariant Predictors in Longitudinal Models

McGill University. Faculty of Science MATH 204 PRINCIPLES OF STATISTICS II. Final Examination

Stats fest Analysis of variance. Single factor ANOVA. Aims. Single factor ANOVA. Data

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

MS&E 226: Small Data

Lecture 7: OLS with qualitative information

Transcription:

TCELL 9/4/205 36-309/749 Experimental Design for Behavioral and Social Sciences Simple Regression Example Male black wheatear birds carry stones to the nest as a form of sexual display. Soler et al. wanted to find out whether there is a relationship between the mass (in g) of the stones carried and the immunologic health of the birds as measured by a test of T cell function (in mm). MASS TCELL Valid N (listwise) Descriptive Statistics N Minimum Maximum Mean Std. Dev iation 3.33 9.95 7.2043.70244 2 2.8.5.3240.09674 2.6 Sep. 22, 205 Lecture 4: Linear Regression.5.4.3.2 9. 0.0 0 2 3 4 5 6 7 8 9 0 MASS 2 Example, cont. Unstandardized Coefficients t Sig. 95% Confidence Interval for B Lower B Std. Error Bound Upper Bound (Constant).087.079.2.280 -.077.252 MASS.033.0 3.084.006.0.055 Regression Residual Total a. Predictors: (Constant), MASS ANOVA b Sum of Squares df Mean Square F Sig..062.062 9.53.006 a.25 9.007.87 20 Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable : ( 9.) Structural (means) model: E(Y x) = b 0 + b x (parameters are called betas or coefficients ) This defines a linear relationship, on average (linearity assumption). The intercept, b 0, is the (population) mean of Y when x = 0. The slope, b, is the (population) mean change in Y when x increases by. 3 4

9/4/205, cont. Fixed-x assumption: x s are measured with no (little) error Error model Errors (deviations of Y from b 0 +b x) are Gaussian with constant variance, s 2, called error variance and the errors are independent of each other. Alternate model form: Y i x i = b 0 +b x i + e i e~n(0,s 2 ) iid Simple vs. multiple regression Null hypothesis: H 0 : β =0 vs. H : β 0. (Sometimes: H 0 : β 0 =0 vs. H : β 0 0 or H 0 : β = vs. H : β ) ( 9.2) Interpolation: good / extrapolation: bad 5 6 The least squares principle finds the best-fit line by minimizing the sum of squared residuals. ( 9.4) Estimates: best estimates of β 0 and β are called b 0 and b or β 0 and β (read beta 0 hat and beta hat ). ( 9.4, 9.5) Technical & optional: The estimates of the coefficients are statistics that are unbiased, minimum variance estimates. For linear regression, least squares is the same as maximum likelihood. 7 8 2

9/4/205 Fitted (predicted) values: = Y i = Y i = fit i = pred i = β 0 + β x i = b 0 + b x i. ( 9.4) Optional estimation formulas: b = n i= x i x y i y n x i x 2 i=, b 0 = y b x, s 2 = n r 2 i= i n 2 Residual: res i = r i = Y i - fit i = Y i Y i This is an estimate of the true error. ( 9.4, 9.6) Ms within = Ms resid = Ms error = SS resid /df are exactly the same estimate of error variance, σ 2. ( 9.8) In general, the df equals (N - #betas). 9 0 Main ideas, cont. The Normality assumption leads to a Normal sampling distribution of β with mean β and a variance that is a complicated combination of σ 2 and the x values. When we substitute in the estimate of σ 2 and take the square root, we get a standard error (SE) of β that can be used in a t- test ( 9.4) of H 0 : b =0: Main ideas, cont. Goodness of fit /model comparison: ( 9.8) R 2 (0 to ) measures the fraction of the variability in the outcome accounted for by the explanatory variable(s). It is also r 2 in simple (one x) regression where r is the correlation of x and Y. (See slide 23 below for the formula.) 95% confidence interval on β : [b m SE(b ), b + m SE(b )], where m is approximately 2 and the exact value comes from finding the value of the appropriate t distribution that holds 95% of the distribution. ( 9.4) Adjusted R 2 corrects for the more IVs always gives a higher R 2 problem. Many consider AIC / BIC, which are penalized likelihoods, even better for comparing models, e.g., deciding if some additional x should be included. 2 3

9/4/205 Robustness of the t-test in regression ( 9.7) moderately severe non-normality is OK (CLT) mild to moderate unequal spread are OK (ratio<2) only minimal correlation of errors is OK non-fixed X is OK only if its variability is much less that the variability of Y non-linearity is bad SPSS Output and Interpretation SPSS has some mechanisms for trying different sets of explanatory variables (model selection) in multiple (> x) regression. The default is to include all specified x variables. Variables Entered/Removed b Variables Variables Entered Removed Method MASS a. Enter a. All requested v ariables entered. 3 4 There is a statistically significant positive association between the mass of stones collected and the T cell activity (reject H 0 : b =0 or equivalently b MASS =0; p=0.006). If the levels of the explanatory variable were randomly assigned, we could say that a rise in stone mass causes a rise in T cell function. Concluding only that stone mass is statistically significant is stupid!!! (Constant) MASS a. Dependent Variable: TCELL Unstandardized Coeff icients Coefficients a 95% Confidence Interv al for B B Std. Error t Sig. Lower Bound Upper Bound.087.079.2.280 -.077.252.033.0 3.084.006.0.055 5 The three major causes of an association between variables X and Y are that changes in X cause changes in Y, changes in Y cause changes in X, and that changes in a third variable causes changes in both X and Y. In addition, type error can result in an apparent association between X and Y when there really is none (a type- error). Interpretation of slope coefficient: A rise of gram of mass is associated with an estimated mean rise of 0.033 mm in T cell function. It is also OK (better) to say that, e.g., a rise of 0 gram of mass is associated with an estimated 0.33 mm mean rise in T cell function. 6 4

9/4/205 The intercept is the mean of Y when x is 0. In this example, since x=0 is far from the rest of the data, an interpretation of the intercept would be an unwarranted extrapolation. If we were to make that extrapolation we would express it as the predicted (estimated) mean of T cell function for wheatears that carry no stones is 0.087 mm. Here, the non-significant p-value (and the CI) tell us that we cannot rule out that the mean T cell response for 0 g of stones is 0 mm (cannot reject H 0 :b 0 =0). Only interpret the intercept p-value when x=0 makes sense and when we have data at or near x=0 and when we care if E(Y x=0) equals zero or not, i.e. was originally unknown. The CI for B MASS has the interpretation that we are 95% confident that the mean rise in T cell function associated with a one gm increase in stone mass is between 0.0 and 0.055. Technically, when the assumptions of the model are met, CI s constructed by the standard recipe used here will include the true coefficient value 95% of the time (over repeated experiments, whether or not the null hypothesis is true, but only when the assumptions are true). 7 8 The mean squared error (MSE) is 0.007 which is the best estimate of s 2. Our model predicts that for any given x value, the distribution of the outcome for many subjects with that same x value will be normally distributed with mean β 0 +β x and variance 0.007 and sd= 0.007=0.08 and 2sd=0.62. Regression Residual Total a. Predictors: (Constant), MASS ANOVA b Sum of Squares df Mean Square F Sig..062.062 9.53.006 a.25 9.007.87 20 The case-wise diagnostics tries to detect possible outliers. Here it tells us that the subject farthest (vertically) from the best fit line is subject 9, who has actual outcome 0.2 mm, predicted outcome 0.39 mm, and residual 0.8. So this bird had a T cell measurement 0.8 lower than expected by the model. The standardized residual divides by the standard deviation of the residuals; 2.24 is not highly unlikely. Case Number 9 Casewise Diagnostics a Predicted Std. Residual TCELL Value Residual -2.239.2.3944 -.84 a. Dependent Variable: TCELL 9 20 5

Expected Cum Prob 9/4/205 The Normal P-P (similar to the quantile-quantile or quantile-normal) plot of the residuals helps test the assumption of a normal distribution of the Y values at each x by combining the residuals from all x values. The residual vs. fit plot of Standardized Residual vs Standardized Predicted Value helps test the assumptions of linearity and equal spread. (Alternatively un-standardized values are used, but they are a bit harder to get in SPSS.) Use the vertical band method of interpretation. Normal P-P Plot of Regression Standardized Dependent Variable: TCELL.00.75.50.25 0.00 0.00.25.50.75.00 Observed Cum Prob 2 22 The R-Squared value of 0.334 indicates that 33.4% of the overall variation in the outcome has been explained by the explanatory variable(s). SS Total = Σ(y i -y) 2, SS Error = Σ(y i -y) 2, SS explained = SS Total - SS Error R 2 = (SS Total -SS Error ) / SS T: fraction of the total deviations that is explained by the explanatory variable(s) R 2 values range from 0 to (high in physics, low in psychology) Adjusted-R 2 corrects for more is always better and is more realistic AIC/BIC are even better for model comparison/selection Conclusion There are lots of useful regression results beyond just a p-value! Summary b Adjusted Std. Error of R R Square R Square the Estimate.578 a.334.299.0802 a. Predictors: (Constant), MASS 23 24 6