Overview Scatter Plot Example

Similar documents
Chapter 1 Linear Regression with One Predictor

Lecture 3: Inference in SLR

Lecture 11: Simple Linear Regression

Ch 2: Simple Linear Regression

Chapter 2 Inferences in Simple Linear Regression

STOR 455 STATISTICAL METHODS I

Topic 14: Inference in Multiple Regression

Lecture 1 Linear Regression with One Predictor Variable.p2

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

Simple Linear Regression

Homework 2: Simple Linear Regression

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Chapter 6 Multiple Regression

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

EXST Regression Techniques Page 1. We can also test the hypothesis H :" œ 0 versus H :"

Failure Time of System due to the Hot Electron Effect

STAT 4385 Topic 03: Simple Linear Regression

STAT 3A03 Applied Regression With SAS Fall 2017

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

General Linear Model (Chapter 4)

ST505/S697R: Fall Homework 2 Solution.

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Measuring the fit of the model - SSR

ST Correlation and Regression

Statistics for exp. medical researchers Regression and Correlation

Statistics 512: Applied Linear Models. Topic 1

9. Linear Regression and Correlation

13 Simple Linear Regression

STATISTICS 479 Exam II (100 points)

EXST7015: Estimating tree weights from other morphometric variables Raw data print

Lecture 10 Multiple Linear Regression

Basic Business Statistics 6 th Edition

Lecture 12 Inference in MLR

a. The least squares estimators of intercept and slope are (from JMP output): b 0 = 6.25 b 1 =

Lecture notes on Regression & SAS example demonstration

Statistics for Managers using Microsoft Excel 6 th Edition

Two-factor studies. STAT 525 Chapter 19 and 20. Professor Olga Vitek

Topic 20: Single Factor Analysis of Variance

Chapter 8 (More on Assumptions for the Simple Linear Regression)

Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping

Simple Linear Regression

Topic 25 - One-Way Random Effects Models. Outline. Random Effects vs Fixed Effects. Data for One-way Random Effects Model. One-way Random effects

Chapter 8 Quantitative and Qualitative Predictors

SAS Commands. General Plan. Output. Construct scatterplot / interaction plot. Run full model

Topic 17 - Single Factor Analysis of Variance. Outline. One-way ANOVA. The Data / Notation. One way ANOVA Cell means model Factor effects model

Ch 3: Multiple Linear Regression

Chapter 12 - Lecture 2 Inferences about regression coefficient

Topic 28: Unequal Replication in Two-Way ANOVA

STAT Chapter 11: Regression

STAT 3A03 Applied Regression Analysis With SAS Fall 2017

Simple Linear Regression

T-test: means of Spock's judge versus all other judges 1 12:10 Wednesday, January 5, judge1 N Mean Std Dev Std Err Minimum Maximum

Math 3330: Solution to midterm Exam

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

Lecture 18 Miscellaneous Topics in Multiple Regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

Lecture 6 Multiple Linear Regression, cont.

STAT 511. Lecture : Simple linear regression Devore: Section Prof. Michael Levine. December 3, Levine STAT 511

Course Information Text:

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter.

Simple linear regression

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

Section Least Squares Regression

Simple and Multiple Linear Regression

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Inferences for Regression

y response variable x 1, x 2,, x k -- a set of explanatory variables

Comparison of a Population Means

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Correlation Analysis

STAT5044: Regression and Anova. Inyoung Kim

Lecture 10: 2 k Factorial Design Montgomery: Chapter 6

Outline Topic 21 - Two Factor ANOVA

Simple Linear Regression

ECON3150/4150 Spring 2015

Inference for Regression Inference about the Regression Model and Using the Regression Line

Topic 18: Model Selection and Diagnostics

Lecture 18: Simple Linear Regression

Stat 500 Midterm 2 12 November 2009 page 0 of 11

Lecture 3. Experiments with a Single Factor: ANOVA Montgomery 3.1 through 3.3

Analysis of Variance. Source DF Squares Square F Value Pr > F. Model <.0001 Error Corrected Total

Lecture 3. Experiments with a Single Factor: ANOVA Montgomery 3-1 through 3-3

PubH 7405: REGRESSION ANALYSIS SLR: DIAGNOSTICS & REMEDIES

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Inference for Regression Simple Linear Regression

CHAPTER EIGHT Linear Regression

Analysis of Covariance

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

Inference for Regression

Assessing Model Adequacy

Chapter 16. Simple Linear Regression and dcorrelation

Linear models and their mathematical foundations: Simple linear regression

R 2 and F -Tests and ANOVA

Topic 10 - Linear Regression

Multiple Linear Regression

Lecture 11 Multiple Linear Regression

Transcription:

Overview Topic 22 - Linear Regression and Correlation STAT 5 Professor Bruce Craig Consider one population but two variables For each sampling unit observe X and Y Assume linear relationship between variables Regression/correlation assess association Relationship may be non-linear but linear in a particular region Background Reading Devore : Section 2. - 2.5 Can often transform Y and/or X to create linear association Dose-response curve: Response with log(dose) Lineweaver-Burk: /velocity with /concen 22 22- Overview Scatter Plot Example A scatter plot allows visual assessment of relationship One variable (X) plotted on x-axis, other variable (Y ) plotted on y-axis Linear regression determines best line through (X, Y )pairs Y = b 0 + b X Correlation describes the tightness of the linear fit r 22-2 The VO 2 max readings for 8 healthy adults following exercise are recorded. Does it appear that VO 2 max decreases with an increase in activity? Create a scatterplot to investigate the relationship. Subject VO 2 Max Duration of Exercise 82 9.5 2 74 9.9 3 63 0.2 4 65 0.0 5 58 0.7 6 44.0 7 55 0.8 8 48.0 22-3

Basic Computations Basic Computations Have n pairs of (x, y) values Univariate summary statistics needed for analysis Statistic X Y Mean x y Sum of Squares = (x x) 2 S yy = (y y) 2 Std Deviation s x = Sxx s n y = Also need joint summary statistic Syy n Sign of S xy indicates direction of trend (x x)(y y) positive x>xand y>y x<xand y<y (x x)(y y) negative x>xand y<y x<xand y>y S xy = (x x)(y y) 22-4 22-5 Least Squares Estimation Least Squares Estimation Given b 0 & b, the predicted value for x Many different ways to fit line Need criterion to assess best fit ŷ = b 0 + b x Residual is y ŷ and represents the vertical distance of y from fitted line The least squares criterion estimates are Least squares minimizes the sum of these squared residuals b = S xy b 0 = y b x SSE = (y ŷ) 2 Estimates also labeled ˆβ 0 and ˆβ Can show estimates result in SSE=S yy S2 xy 22-6 22-7

Residual Standard Deviation Linear Model of X and Y Describes the closeness of the data to fitted line How far above/below line y s tend to be Std deviation based on squared residuals ˆσ 2 = s 2 = SSE (yi n 2 = ŷ i ) 2 n 2 Want to generalize sample to population Assume for value of X, can observe diff values of Y If X indicates trt, similar to ANOVA Dist of Y s assumed Normal with unknown mean Given X, have conditional dist of Y Similar to s y estimate except for ŷ i and n 2 Use n 2 because variation about line (i.e., we re using ŷ i not y as the predicted value) Normal approx 95% of obs within ±2s 22-8 Model mean of Y X = x as linear function E(Y X = x) =β 0 + β x Y X = x N(β 0 + β x, σ 2 ) 22-9 Assumptions The conditional distribution of Y is Normally Distributed µ Y X=x = β 0 + β x Prediction Recall predicted value for X = x o is ŷ = b 0 + b x o Must use caution in interpretation of ŷ σ Y X=x constant (can drop X = x) Consider ANOVA problem Assume mean depends on treatment group If x o within range of x s interpolation If x o outside range of x s extrapolation Assume constant variance ANOVA just special case of regression Extrapolation should be avoided No assurances still linear outside range 22-0 22-

Prediction Standard error of ŷ depends on whether estimating a Conditional mean point on regression line b Future observation point may vary from line Example Recall the VO 2 Max example. Construct a 95% CI for the mean VO 2 level when a healthy adult exercises for 0.5 minutes. First, need to estimate the regression line. The summary statistics necessary for this calculation are shown below. y = 489 x =83. y 2 = 3023 xy = 5030.8 x 2 = 865.43 From these, (a) ( s 2 n + (x ) o x) 2 (b) s 2 ( + n + (x o x) 2 ) S yy = 3023 489 2 /8 = 32.88 = 865.43 83. 2 /8=2.23 S xy = 5030.8 489(83.)/8 = 48.69 so Similar argument to confidence interval vs prediction interval b = 48.69/2.23 = 2.84 and b 0 = (489/8) + 2.84(83./8) = 288.0 22-2 22-3 The table below computes the pred values and residuals x y Predicted Residual (y ŷ) 2 9.5 82 80.520.480 2.90 9.9 74 7.784 2.26 4.97 0.2 63 65.232-2.232 4.982 0.0 65 69.600-4.600 2.60 0.7 58 54.32 3.688 3.60.0 44 47.760-3.760 4.38 0.8 55 52.28 2.872 8.248.0 48 47.760 0.240 0.058 83. 489 0.000 69.288 Since SSE = 69.288, s = 69.288/6 =3.40. The pred value for x =0.5 is288.0 2.84(0.5) = 58.68. Interested in the average VO 2 level, so SE(ŷ)=3.40 8 + (0.5 0.3875)2 =.23. 2.23 Because we use s, the df is 6 so a 95% CI is 58.68 ± 2.447(.23) = (55.67, 6.69). An interval which 95% of the time would contain the observed y for a person exercising x =0.5 minutes is 58.68 ± 2.447(3.65) = (49.83, 67.53). Standard Error of b As with all estimates, b subject to sampling error Standard error of b s b = s 2 In situations where X s are under experimental control If made large small SE Increase by increasing dispersion of x (spread out) If increase n increases 22-4 22-5

Inference of β Need sampling distribution to construct CI or perform hypothesis test Given normality assumption, sampling distribution of b is also normal CI: b ± t α s b (df =n 2) Hypothesis test H 0 : β = β t s = b β sb Standard Error of b 0 Sometimes interested in intercept β 0 Standard error of b 0 s b0 = ( ) s 2 n + x2 In situation where X is under experimental control If made large small SE Increase by increasing dispersion Can look at β = 0 to see if there is a linear association 22-6 If x close to zero small SE If increase n increase SS X 22-7 Inference of β 0 Need sampling distribution to construct CI or perform hypothesis test Given normality assumption, sampling distribution of b 0 is also normal CI: b 0 ± t α s b0 (df= n 2) Example Recall the VO 2 Max problem. The standard errors for both b 0 and b are (69.288/6) s b = =2.27 2.23 s b0 = (69.288/6)( 8 + (83./8)2 )=23.67 2.23 The 95% CI are Hypothesis tests H 0 : β 0 = β 0 t s = b0 β 0 sb 0 2.84 ± 2.447(2.27) = ( 27.39, 6.29) 288.0 ± 2.447(23.67) = (230.08, 345.92) Since 0 is not in the CI for β, we can say that there is a linear association and it is a negative association. As the amount of exercise increases, there is a decrease in the VO 2 max. Can compute joint confidence regions using F dist 22-8 We must be careful interpreting anything outside of the x range. We should not feel comfortable saying that the VO 2 max at rest is somewhere between 230.08 and 345.92 with 95% confidence. 22-9

The Correlation Coefficient The Correlation Coefficient Describes how close the data cluster about the line Properties r has same sign as b Describe direction and tightness Correlation coefficient is a dimensionless statistic r = S xy Sxx S yy r 2 known as coefficient of determination. % of total variation in Y explained by regression r 2 = SSE SS Y If straight line fit SSE = 0 r = ± andr 2 = 00% Symmetry - can interchange X and Y with altering the value If no linear association SSE = SS Y r = 0 and r 2 =0% 22-20 22-2 Coefficient of Determination Determines % of variability in Y explained by linear relationship with X r 2 = SSE SST) Hypothesis Test for ρ r is sample correlation coefficient Use ρ to denote the pop correlation coefficient Can use r 2 to approximate reduction in std dev s s Y r 2 Under linear model with normal errors σ ρ = β X SSX b = r σ Y SS Y Prior to regression, the std dev of Y is s Y After regressing X on Y the std dev is s r 2 approx reduction in std dev (i.e., closeness) Can do t-test to see if population linear association (same as H 0 : β =0) n 2 t s = r r 2 22-22 22-23

Confidence Interval for ρ If sample size large, can construct CI for ρ Based on Fisher transformation of r V = ( ) ( ( ) +r +ρ 2 ln N r 2 ln, ρ ) n 3 Construct CI for V, then convert back to CI for ρ c = v z α/2 n 3 c 2 = v + z α/2 n 3 Example - Transformation (Bates and Watts: Nonlinear Regression) Consider the data set used to describe the relationship between velocity of an enzymatic reaction (V ) and the substrate concentration (C). Consider only the experiment where the enzyme is treated with Puromycin. The common model used to describe the relationship between velocity and concentration is the Michaelis- Menten model V = θ C θ 2 + C where θ is the maximum velocity of the reaction and θ 2 describes how quickly (in terms of increasing concentration) the reaction will reach maximum velocity. ( exp(2c ) exp(2c )+, exp(2c ) 2) exp(2c 2 )+ 22-24 22-25 Scatterplot Velocity 200 50 00 50 0.0 0.2 0.4 0.6 0.8.0 Since this is a non-linear model, one approach is to transform the variables so that a linear relationship exists. While this usually works quite well, one must be aware that the transformation changes the distribution of the data. For example, with this model, we can rewrite it as a linear model if we look at the inverse concentration and inverse velocity. Concentration V = θ + θ 2 θ ( ) C /Velocity 0.020 0.05 0.00 0.005 0 0 20 30 40 50 /Concentration One must be careful with this transformation for two reasons. First, since we are looking at inverse concentrations, very low concentrations will be highly influential in the regression analysis. Second, the equal variance assumption is often violated. The variance violation is more easily picked up when there are replicates. Also notice that the two observations at a very low concentration are now much further separated from the others making these observations more influential. 22-26 22-27

Since the variance appears constant in the untransformed plot, the better way to estimate the parameters is to use non-linear estimation methods. These procedures are beyond the scope of the class but one could view it as an iterative least squares approach where some initial estimates of the parameters are given, the predicted values are calculated using the untransformed model, and new parameter estimates are proposed until the residual sum of squares decreases to a minimum. The two plots below show the fitted line in reference to the original data using both the linear and non-linear approaches. Method ˆθ ˆθ 2 Regression 95.80 0.0484 Non-Linear 22.70 0.064 option nocenter ls=75; goptions color=( none ); SAS Proc Reg PROC IMPORT OUT=EXAMPLE DATAFILE= "U:\.www\datasets5\exp2-0.xls" DBMS=EXCEL2000 REPLACE; GETNAMES=YES; RUN; proc gplot data=example; plot y_*x_; run; Velocity 200 50 00 Regression 50 0.0 0.2 0.4 0.6 0.8.0 Concentration Velocity 200 50 00 Non-Linear 50 0.0 0.2 0.4 0.6 0.8.0 Concentration proc reg data=example; model y_=x_ / clb clm cli p r; output out=a2 p=pred r=resid; proc gplot data=a2; plot resid*x_/ vref=0; run; 22-28 22-29 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 39.68595 39.68595 48.32 <.000 Error 28 2.65634 0.09487 Corrected Total 29 42.34230 Root MSE 0.3080 R-Square 0.9373 Dependent Mean 2.84033 Adj R-Sq 0.9350 Coeff Var 0.844 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept -0.39774 0.680-2.37 0.025 x_ 3.07997 0.5059 20.45 <.000 Parameter Estimates Variable DF 95% Confidence Limits Intercept -0.7489-0.05359 x_ 2.7750 3.38843 Output Statistics Dep Var Predicted Std Error Obs y_ Value Mean Predict 95% CL Mean.0200 0.8342 0.3 0.6027.0658 2.200 0.8958 0.05 0.6696.22 3 0.8800.0806 0.028 0.870.292 4 0.9800.730 0.0990 0.9702.3759 5.5200.3578 0.097.699.5458 6.8300.4502 0.0882.2695.6309 7.5000.7582 0.0772.600.964 Std Error Student Obs 95% CL Predict Residual Residual Residual -2-0 2 0.622.5063 0.858 0.287 0.648 * 2 0.2256.566 0.342 0.288.093 ** 3 0.455.7458-0.2006 0.290-0.69 * 4 0.503.8358-0.930 0.292-0.662 * 5 0.6995 2.062 0.622 0.294 0.552 * 6 0.7939 2.065 0.3798 0.295.287 ** 7.078 2.4087-0.2582 0.298-0.866 * 22-30 22-3