Regression Analysis. Simple Regression Multivariate Regression Stepwise Regression Replication and Prediction Error EE290H F05

Similar documents
Regression Analysis. Regression Analysis

UNIVERSITY OF CALIFORNIA College of Engineering Department of Electrical Engineering and Computer Sciences. PROBLEM SET No. 5 Official Solutions

EE290H F05. Spanos. Lecture 5: Comparison of Treatments and ANOVA

Math 423/533: The Main Theoretical Topics

SMA 6304 / MIT / MIT Manufacturing Systems. Lecture 10: Data and Regression Analysis. Lecturer: Prof. Duane S. Boning

Response Surface Methodology

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist

Ch 2: Simple Linear Regression

A discussion on multiple regression models

14.0 RESPONSE SURFACE METHODOLOGY (RSM)

EXST Regression Techniques Page 1. We can also test the hypothesis H :" œ 0 versus H :"

Linear models and their mathematical foundations: Simple linear regression

4/22/2010. Test 3 Review ANOVA

Statistics and Quantitative Analysis U4320. Segment 10 Prof. Sharyn O Halloran

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Table 1: Fish Biomass data set on 26 streams

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Hypothesis Testing hypothesis testing approach

Statistical Modelling in Stata 5: Linear Models

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Chapter 10. Simple Linear Regression and Correlation

Section Least Squares Regression

Correlation and Simple Linear Regression

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Statistical Techniques II EXST7015 Simple Linear Regression

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Econ 300/QAC 201: Quantitative Methods in Economics/Applied Data Analysis. 12th Class 6/23/10

sociology sociology Scatterplots Quantitative Research Methods: Introduction to correlation and regression Age vs Income

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

9. Linear Regression and Correlation

Simple and Multiple Linear Regression

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

Addition of Center Points to a 2 k Designs Section 6-6 page 271

2.830J / 6.780J / ESD.63J Control of Manufacturing Processes (SMA 6303) Spring 2008

Factorial designs. Experiments

7. Response Surface Methodology (Ch.10. Regression Modeling Ch. 11. Response Surface Methodology)

Multiple Regression Analysis. Basic Estimation Techniques. Multiple Regression Analysis. Multiple Regression Analysis

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Design of Engineering Experiments Chapter 5 Introduction to Factorials

Simple Linear Regression Analysis

What If There Are More Than. Two Factor Levels?

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Alternatives to Difference Scores: Polynomial Regression and Response Surface Methodology. Jeffrey R. Edwards University of North Carolina

Simple Linear Regression

Difference in two or more average scores in different groups

: The model hypothesizes a relationship between the variables. The simplest probabilistic model: or.

Six Sigma Black Belt Study Guides

DESAIN EKSPERIMEN Analysis of Variances (ANOVA) Semester Genap 2017/2018 Jurusan Teknik Industri Universitas Brawijaya

Ma 3/103: Lecture 25 Linear Regression II: Hypothesis Testing and ANOVA

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Regression Analysis. Table Relationship between muscle contractile force (mj) and stimulus intensity (mv).

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Gregory Carey, 1998 Regression & Path Analysis - 1 MULTIPLE REGRESSION AND PATH ANALYSIS

CHAPTER 6 MACHINABILITY MODELS WITH THREE INDEPENDENT VARIABLES

Regression. Estimation of the linear function (straight line) describing the linear component of the joint relationship between two variables X and Y.

Correlation Analysis

Confidence Intervals, Testing and ANOVA Summary

Statistics - Lecture Three. Linear Models. Charlotte Wickham 1.

Lecture 3: Inference in SLR

MEMORIAL UNIVERSITY OF NEWFOUNDLAND DEPARTMENT OF MATHEMATICS AND STATISTICS FINAL EXAM - STATISTICS FALL 1999

Measuring the fit of the model - SSR

13 Simple Linear Regression

OPTIMIZATION OF FIRST ORDER MODELS

Lecture 6 Multiple Linear Regression, cont.

Overview Scatter Plot Example

Analysis of variance

Visual interpretation with normal approximation

General Linear Model (Chapter 4)

STK4900/ Lecture 3. Program

Chapter 1 Linear Regression with One Predictor

Simple Linear Regression

LIST OF FORMULAS FOR STK1100 AND STK1110

Recent Advances in the Field of Trade Theory and Policy Analysis Using Micro-Level Data

Lecture 14. Analysis of Variance * Correlation and Regression. The McGraw-Hill Companies, Inc., 2000

Lecture 14. Outline. Outline. Analysis of Variance * Correlation and Regression Analysis of Variance (ANOVA)

Stats Review Chapter 14. Mary Stangler Center for Academic Success Revised 8/16

MATH c UNIVERSITY OF LEEDS Examination for the Module MATH1725 (May-June 2009) INTRODUCTION TO STATISTICS. Time allowed: 2 hours

Lecture 1 Linear Regression with One Predictor Variable.p2

Predicted Y Scores. The symbol stands for a predicted Y score

CHAPTER EIGHT Linear Regression

STATISTICS 110/201 PRACTICE FINAL EXAM

Answers to Problem Set #4

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

Advanced Engineering Statistics - Section 5 - Jay Liu Dept. Chemical Engineering PKNU

Econometrics. 8) Instrumental variables

Correlation. Bivariate normal densities with ρ 0. Two-dimensional / bivariate normal density with correlation 0

Simple Linear Regression

Statistics Handbook. All statistical tables were computed by the author.

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Probability and Statistics Notes

Introduction to Regression

The linear model. Our models so far are linear. Change in Y due to change in X? See plots for: o age vs. ahe o carats vs.

STT 843 Key to Homework 1 Spring 2018

Linear regression. We have that the estimated mean in linear regression is. ˆµ Y X=x = ˆβ 0 + ˆβ 1 x. The standard error of ˆµ Y X=x is.

Inference in Regression Analysis

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Transcription:

Regression Analysis Simple Regression Multivariate Regression Stepwise Regression Replication and Prediction Error 1

Regression Analysis In general, we "fit" a model by minimizing a metric that represents the error. n Σ min (y i - y i ) 2 i=1 The sum of squares gives closed form solutions and minimum variance for linear models. 2

The Simplest Regression Model Line through the origin: y y=bx x y u =βx u +ε u u=1,2,...,n ε u ~N(0, σ R 2 ) n min S = min Σ (y u - βx u ) 2 : estimate of 2 σ R u=1 y=bx η u =βx u b: estimate of β y: estimate of η u, the true value of the model. 3

Using the Normal Equation to fit line through the origin model Our model only has one degree of freedom This is why our choices are confined on this line min Σ (y-y) 2 y2 y y=bx (1 d.f.) y1 4

Using the Normal Equation (cont) (fitting line through the origin model) Choose b so that the residual vector is perpendicular to the model vector... Σ (y-y) x =0 Σ (y - bx) x = 0 b= Σ xy Σx 2 (est. of β) s 2 = S R n-1 (est. of σ R 2 ) V(b) = s2 Σx 2 67% conf: b ± s 2 Σx 2 Significance test: t = b-β* s 2 Σx 2 ~ t n-1 5

Etch time vs. removed material: y = bx 500 R e m o v ed ( n m ) 400 300 200 100 0 0.0 0.2 0.4 0.6 0.8 1.0 Etch Time (sec) x 10^3 Variable Name Coefficient Std. Err. Estimate t Statistic Prob > t Etch Time (sec) 0.501 0.0162 30.9 0.000 6

Model Validation through ANOVA The idea is to decompose the sum of squares into orthogonal components. Assuming that there is no need for a model at all* (always a good null Hypothesis!): H 0 : β * =0 Σ y u 2 = Σ y u 2 + Σ (y u - y u ) 2 n p n-p total model residual * This is equivalent to saying that y~n(μ,σ 2 ), where μ and σ are constants, independent of x. 7

Model Validation through ANOVA (cont) Assuming a specific model: H 0 : β * = b Σ(y u - β * x u ) 2 = Σ (y u - β * x u ) 2 + Σ (y u - y u ) 2 n p n-p total model residual The ANOVA table will answer the question: Is Is there a relationship between x and y? y? 8

ANOVA table and Residual Plot Source Sum of Squares Deg. of Freedom Mean Squares F-Ratio Prob>F Model 1.83e+5 1 1.83e+5 1.98e+2 2.17e-6 Error 6.47e+3 7 9.24e+2 Total 1.89e+5 8 R es i d u a l s 60 40 20-20 -40 0-60 0.0 0.2 0.4 0.6 0.8 Etch Time (sec) x 10^3 1.0 9

A More Complex Regression Equation - a straight line with two parameters actual estimated η = α + β (x - x ) y = a + b (x - x ) y i ~ N (η i, σ 2 ) Minimize R = Σ(y i -y i ) 2 to estimate α and β a=y b= Σ(x i -x)y i Σ(x i -x) 2 =Σ(x i -x)(y i-y) Σ(x i -x) 2 Are a and b good estimators of α and β? E[a] = α E[b] = Σ(x i -x)e[y i] Σ(x i -x) 2 = β 10

Variance Estimation: Note that all variability comes from y i! V[a] = V V[b] = V Σ y i k = 1 k 2 Σ V[ y i] = σ 2 k Σ (x i -x)y i Σ (x i -x) 2 = σ 2 Σ (x i -x) 2 min var. thanks to to least squares! 11

LTO thickness vs deposition time: y = a + bx L T O t h i c k A x 1 0^ 3 4 3 2 1 1.0 1.5 2.0 2.5 3.0 3.5 Dep time x 10^3 Variable Name Coefficient Std. Err. Estimate t Statistic Prob > t Constant Dep time 6.04e+1 5.61e+1 1.08e+0 0.030 9.75e-1 2.52e-2 3.87e+1 0.000 12

Source Anova table and Residual Plot Sum of Squares Deg. of Freedom Mean Squares F-Ratio Prob>F Model 4.77e+6 1 4.77e+6 1.50e+3 0.000 Error 5.09e+4 16 3.18e+3 Total 4.82e+6 17 100 R es i d u a l s 0-100 1.0 1.5 2.0 2.5 3.0 3.5 Dep time x 10^3 13

ANOVA Representation (x i,y i ) (y i -y i ) y (y i -η i ) b(x i -x) (y i -η i ) (a-α) y i = a+b(x i -x) η i = α+β(x i -x) β(x i -x) x x i x Note differences between "true" and "estimated" model. 14

ANOVA Representation (cont) (y i -η i ) = (a- α ) + (b- β )(x i -x) + ( y i - y i ) Σ(y i -η i ) 2 = k(a-α ) 2 + (b-β) 2 Σ(x i -x)+ (k) (1) (1) ~σ 2 χ 2 (k) ~σ 2 χ 2 (1) ~ σ 2 χ 2 (1) Σ(y i -y i ) 2 (k-2) ~σ 2 χ 2 (k-2) In In this way, the significance of of the model can be be analyzed in in detail. 15

Confidence Limits of an Estimate y0= y+b(x0 -x ) V(y0) = V(y)+(x0 -x ) 2 V(b) V(y0) = 1 n + (x 0 -x ) 2 Σ (x -x ) 2 s2 prediction interval: y0 +/- tα 2 V(y0) 16

Confidence Interval of Prediction (all points) p L T O T h i c k n e s s 3000 2500 2000 1500 1000 1000 1500 2000 2500 3000 Dep time Leverage 17

Confidence Interval of Prediction (half the points) L T O 3000 T h i c k n e s s 2500 2000 1500 1000 1000 1500 2000 2500 3000 Dep time Leverage 18

Confidence Interval of Prediction (1/4 of points) L T O T h i c k n e s s 3000 2500 2000 1500 1000 1000 1500 2000 2500 3000 Dep time Leverage 19

Prediction Error vs Experimental Error Experimental Error y Prediction error Estimated Model True model x Experimental Error Error Does Does not not depend on on location or or sample sample size. size. Prediction Error Error depends on on location gets gets smaller smaller as as sample sample size size increases. 20

Multivariate Regression η = β 1 x 1 +β 2 x 2 β 2 y y x 2 R The Residual is is to to y,, x 1,, x 2.. β 1 x 1 Coefficient Estimation: Σ(y-y)x 1 =0 Σ(y-y)x 2 =0 Σyx 1 -b 1 Σx 1 2 -b 2 Σx 1 x 2 = 0 Σyx 2 -b 2 Σx 2 2 -b 1 Σx 1 x 2 = 0 21

Variance Estimation: s 2 = S R n-p V(b 1 ) = 1 1-ρ 2 s 2 Σx 1 2 V(b 2 ) = 1 1-ρ 2 s 2 Σx 2 2 ρ = - Σx 1 x 2 Σx 12 Σx 2 2 22

Thickness vs time, temp: y = a + b1 x1 + b2 x2 Variable Name Coefficient Std. Err. Estimate t Statistic Prob > t Constant temp time min -7.04e+2 7.18e+1-9.80e+0 0.000 7.14e-1 7.00e-2 1.02e+1 0.000 8.69e-1 3.89e-2 2.23e+1 0.000 23

Anova table and Correlation of Estimates Source Sum of Squares Deg. of Freedom Mean Squares F-Ratio Prob>F Model 2.58e+4 2 1.29e+4 3.01e+2 0.000 Error 7.71e+2 18 4.28e+1 Total 2.66e+4 20 Data File: tox nm 1.000 temp 0.410 time min 0.896 regression Tox Temp Time 0.410 1.000 0.000 0.896 0.000 1.000 24

Multiple Regression in General x 1 x 2 x n b = y + e minimize Xb - y 2 = e 2 = ( y - Xb ) T ( y - Xb ) or, min -e T Xb + e T y which is equiv. to: ( y - Xb ) T Xb = 0 X T Xb = X T y b = ( X T X ) -1 X T y V(b) = ( X T X ) -1 σ 2 25

Joint Confidence Region for x 1 x 2 S = S R 1 + p n-p F α(p, n-p) Σ 2 β 1 -b 1 Σ x 2 1 +2 β 1 -b 1 β 2 -b 2 Σ 2 x 1 x 2 + β 2 -b 2 Σ x 2 2= S-S R 26

What if a linear model is not enough? 300 d e p r a t e 200 100 600 610 620 630 640 650 inlet temp Variable Name Coefficient Std. Err. Estimate t Statistic Prob > t Constant inlet temp -1.85e+3 4.64e+1-3.99e+1 0.000 3.24e+0 7.46e-2 4.35e+1 0.000 27

ANOVA table and Residual Plot Source Sum of Squares Deg. of Freedom Mean Squares F-Ratio Prob>F Model 3.65e+4 1 3.65e+4 1.89e+3 0.000 Error 4.06e+2 21 1.93e+1 Total 3.69e+4 22 20 R es i d u a l s 10 0-10 -20 600 610 620 630 640 650 inlet temp 28

Multiple Regression with Replication S E = 1 2 Σ (y i1 -y i2 ) 2 S LF =S R -S E (a-α) 2 k Σ i η i k k i Σ v n i (y iv -ηi) 2 k Σ i η i + (b-β) 2 Σ η i (x i -x) 2 + η i (y i. -y i ) 2 + (y iv -y i. ) 2 i Σ k Σ i = k n i Σ Σ 1 1 k-2 η i -k i v k Σ i k Σ i Σ v n i (y iv -y) 2 k Σ Σ v n i = (y iv -y i. ) 2 i k + Σ η i i k (y i. -y i ) 2 + Σ η i i (y-y i ) 2 29

Pure Error vs. Lack of Fit Example Lack Of Fit Source Lack Of Fit Pure Error Total Error DF 17 4 21 Sum of Squares 401.01 4.49 405.50 Mean Square 23.59 1.12 F Ratio 21.04 Prob > F 0.005 Parameter Estimates Term Intercept inlet temp Estimate -1850.16 3.24 Std Error 46.42 0.07 t Ratio -39.85 43.47 Prob> t 0.000 0.000 Model Test Source inlet temp DF 1 Sum of Squares 36489.55 F Ratio 999.99 Prob > F 0.000 30

Dep. rate vs temperature: y = a + bx + cx 2 300 d e p r a t e 200 Variable Name 100 600 610 620 630 inlet temp 640 650 Std. Err. t Coefficient Estimate Statistic Prob > t Constant inlet temp inlet temp ^2 8.34e+3 1.80e+3 4.66e+0 0.000-2.94e+1 5.74e+0-5.13e+0 0.000 2.62e-2 4.60e-3 5.69e+0 0.000 31

Pure Error vs. Lack of Fit Example (cont) Lack Of Fit Source Lack Of Fit Pure Error Total Error DF 16 4 20 Sum of Squares 150.24 4.49 154.73 Mean Square 9.39 1.12 F Ratio 8.37 Prob > F 0.026 Parameter Estimates Term Intercept inlet temp^1 inlet temp^2 Estimate 8339.05-29.45 0.03 Std Error 1789.92 5.74 0.005 t Ratio 4.66-5.13 5.69 Prob> t 0.0002 0.0001 0.0000 Model Test Source Poly(inlet temp,2) DF 2 Sum of Squares 36740.32 F Ratio 999.99 Prob > F 0.0000 32

Source ANOVA table and Residual Plot Sum of Squares Deg. of Freedom Mean Squares F-Ratio Prob>F Model 3.67e+4 2 1.84e+4 2.37e+3 0.000 Error 1.55e+2 20 7.74e+0 Total 3.69e+4 22 R es i d u a l s 6 4 2 0-2 -4-6 600 610 620 630 640 650 inlet temp 33

Use regression line to predict LTO thickness... y = 60.352 + 0.97456 x R 2 = 0.989 y = - 38.440 + 1.0153 x R 2 = 0.989 4000 4000 3000 3000 2000 1000 0 1000 LTO Thick A 90%LimitLow 90%LimitHigh 2000 3000 4000 Dep Time Sec 2000 1000 1000 2000 3000 LTO Thick A 4000 34

Response Surface Methodology Objectives: get a feel of I/O relationships find setting(s) that satisfy multiple constraints find settings that lead to optimum performance Observations: Function is nearly linear away from the peak Function is nearly quadratic at the peak 35

Building the planar model A Factorial experiment with center points is enough to build and confirm a planar model. b1, b2, b12 = -0.65 +/-0.75 b11+b22=1/4σp+1/3σc= -0.50 +/-1.15 36

Quadratic Model and Confirmation Run Close to the peak, a quadratic model can be built and confirmed by an expanded two-phase experiment. 37

Response Surface Methodology RSM consists of creating models that lead to visual images of a response. The models are usually linear or quadratic in nature. Either expanded factorial experiments, or regression analysis can be used. All empirical models have a random prediction error. In RSM, the average variance of the model is: V(y) = 1 n n Σ i=1 V(y i ) = pσ2 n where p is the number of model parameters and n is the number of experiments. 38

Response Surface Exploration 39

"Popular" RSM Use singe-stage Box-B or Box-W designs Use computer (simulated) experiments Rely on "goodness of fit" measures Automate model structure generation Problems? 40