Regression Analysis Simple Regression Multivariate Regression Stepwise Regression Replication and Prediction Error 1
Regression Analysis In general, we "fit" a model by minimizing a metric that represents the error. n Σ min (y i - y i ) 2 i=1 The sum of squares gives closed form solutions and minimum variance for linear models. 2
The Simplest Regression Model Line through the origin: y y=bx x y u =βx u +ε u u=1,2,...,n ε u ~N(0, σ R 2 ) n min S = min Σ (y u - βx u ) 2 : estimate of 2 σ R u=1 y=bx η u =βx u b: estimate of β y: estimate of η u, the true value of the model. 3
Using the Normal Equation to fit line through the origin model Our model only has one degree of freedom This is why our choices are confined on this line min Σ (y-y) 2 y2 y y=bx (1 d.f.) y1 4
Using the Normal Equation (cont) (fitting line through the origin model) Choose b so that the residual vector is perpendicular to the model vector... Σ (y-y) x =0 Σ (y - bx) x = 0 b= Σ xy Σx 2 (est. of β) s 2 = S R n-1 (est. of σ R 2 ) V(b) = s2 Σx 2 67% conf: b ± s 2 Σx 2 Significance test: t = b-β* s 2 Σx 2 ~ t n-1 5
Etch time vs. removed material: y = bx 500 R e m o v ed ( n m ) 400 300 200 100 0 0.0 0.2 0.4 0.6 0.8 1.0 Etch Time (sec) x 10^3 Variable Name Coefficient Std. Err. Estimate t Statistic Prob > t Etch Time (sec) 0.501 0.0162 30.9 0.000 6
Model Validation through ANOVA The idea is to decompose the sum of squares into orthogonal components. Assuming that there is no need for a model at all* (always a good null Hypothesis!): H 0 : β * =0 Σ y u 2 = Σ y u 2 + Σ (y u - y u ) 2 n p n-p total model residual * This is equivalent to saying that y~n(μ,σ 2 ), where μ and σ are constants, independent of x. 7
Model Validation through ANOVA (cont) Assuming a specific model: H 0 : β * = b Σ(y u - β * x u ) 2 = Σ (y u - β * x u ) 2 + Σ (y u - y u ) 2 n p n-p total model residual The ANOVA table will answer the question: Is Is there a relationship between x and y? y? 8
ANOVA table and Residual Plot Source Sum of Squares Deg. of Freedom Mean Squares F-Ratio Prob>F Model 1.83e+5 1 1.83e+5 1.98e+2 2.17e-6 Error 6.47e+3 7 9.24e+2 Total 1.89e+5 8 R es i d u a l s 60 40 20-20 -40 0-60 0.0 0.2 0.4 0.6 0.8 Etch Time (sec) x 10^3 1.0 9
A More Complex Regression Equation - a straight line with two parameters actual estimated η = α + β (x - x ) y = a + b (x - x ) y i ~ N (η i, σ 2 ) Minimize R = Σ(y i -y i ) 2 to estimate α and β a=y b= Σ(x i -x)y i Σ(x i -x) 2 =Σ(x i -x)(y i-y) Σ(x i -x) 2 Are a and b good estimators of α and β? E[a] = α E[b] = Σ(x i -x)e[y i] Σ(x i -x) 2 = β 10
Variance Estimation: Note that all variability comes from y i! V[a] = V V[b] = V Σ y i k = 1 k 2 Σ V[ y i] = σ 2 k Σ (x i -x)y i Σ (x i -x) 2 = σ 2 Σ (x i -x) 2 min var. thanks to to least squares! 11
LTO thickness vs deposition time: y = a + bx L T O t h i c k A x 1 0^ 3 4 3 2 1 1.0 1.5 2.0 2.5 3.0 3.5 Dep time x 10^3 Variable Name Coefficient Std. Err. Estimate t Statistic Prob > t Constant Dep time 6.04e+1 5.61e+1 1.08e+0 0.030 9.75e-1 2.52e-2 3.87e+1 0.000 12
Source Anova table and Residual Plot Sum of Squares Deg. of Freedom Mean Squares F-Ratio Prob>F Model 4.77e+6 1 4.77e+6 1.50e+3 0.000 Error 5.09e+4 16 3.18e+3 Total 4.82e+6 17 100 R es i d u a l s 0-100 1.0 1.5 2.0 2.5 3.0 3.5 Dep time x 10^3 13
ANOVA Representation (x i,y i ) (y i -y i ) y (y i -η i ) b(x i -x) (y i -η i ) (a-α) y i = a+b(x i -x) η i = α+β(x i -x) β(x i -x) x x i x Note differences between "true" and "estimated" model. 14
ANOVA Representation (cont) (y i -η i ) = (a- α ) + (b- β )(x i -x) + ( y i - y i ) Σ(y i -η i ) 2 = k(a-α ) 2 + (b-β) 2 Σ(x i -x)+ (k) (1) (1) ~σ 2 χ 2 (k) ~σ 2 χ 2 (1) ~ σ 2 χ 2 (1) Σ(y i -y i ) 2 (k-2) ~σ 2 χ 2 (k-2) In In this way, the significance of of the model can be be analyzed in in detail. 15
Confidence Limits of an Estimate y0= y+b(x0 -x ) V(y0) = V(y)+(x0 -x ) 2 V(b) V(y0) = 1 n + (x 0 -x ) 2 Σ (x -x ) 2 s2 prediction interval: y0 +/- tα 2 V(y0) 16
Confidence Interval of Prediction (all points) p L T O T h i c k n e s s 3000 2500 2000 1500 1000 1000 1500 2000 2500 3000 Dep time Leverage 17
Confidence Interval of Prediction (half the points) L T O 3000 T h i c k n e s s 2500 2000 1500 1000 1000 1500 2000 2500 3000 Dep time Leverage 18
Confidence Interval of Prediction (1/4 of points) L T O T h i c k n e s s 3000 2500 2000 1500 1000 1000 1500 2000 2500 3000 Dep time Leverage 19
Prediction Error vs Experimental Error Experimental Error y Prediction error Estimated Model True model x Experimental Error Error Does Does not not depend on on location or or sample sample size. size. Prediction Error Error depends on on location gets gets smaller smaller as as sample sample size size increases. 20
Multivariate Regression η = β 1 x 1 +β 2 x 2 β 2 y y x 2 R The Residual is is to to y,, x 1,, x 2.. β 1 x 1 Coefficient Estimation: Σ(y-y)x 1 =0 Σ(y-y)x 2 =0 Σyx 1 -b 1 Σx 1 2 -b 2 Σx 1 x 2 = 0 Σyx 2 -b 2 Σx 2 2 -b 1 Σx 1 x 2 = 0 21
Variance Estimation: s 2 = S R n-p V(b 1 ) = 1 1-ρ 2 s 2 Σx 1 2 V(b 2 ) = 1 1-ρ 2 s 2 Σx 2 2 ρ = - Σx 1 x 2 Σx 12 Σx 2 2 22
Thickness vs time, temp: y = a + b1 x1 + b2 x2 Variable Name Coefficient Std. Err. Estimate t Statistic Prob > t Constant temp time min -7.04e+2 7.18e+1-9.80e+0 0.000 7.14e-1 7.00e-2 1.02e+1 0.000 8.69e-1 3.89e-2 2.23e+1 0.000 23
Anova table and Correlation of Estimates Source Sum of Squares Deg. of Freedom Mean Squares F-Ratio Prob>F Model 2.58e+4 2 1.29e+4 3.01e+2 0.000 Error 7.71e+2 18 4.28e+1 Total 2.66e+4 20 Data File: tox nm 1.000 temp 0.410 time min 0.896 regression Tox Temp Time 0.410 1.000 0.000 0.896 0.000 1.000 24
Multiple Regression in General x 1 x 2 x n b = y + e minimize Xb - y 2 = e 2 = ( y - Xb ) T ( y - Xb ) or, min -e T Xb + e T y which is equiv. to: ( y - Xb ) T Xb = 0 X T Xb = X T y b = ( X T X ) -1 X T y V(b) = ( X T X ) -1 σ 2 25
Joint Confidence Region for x 1 x 2 S = S R 1 + p n-p F α(p, n-p) Σ 2 β 1 -b 1 Σ x 2 1 +2 β 1 -b 1 β 2 -b 2 Σ 2 x 1 x 2 + β 2 -b 2 Σ x 2 2= S-S R 26
What if a linear model is not enough? 300 d e p r a t e 200 100 600 610 620 630 640 650 inlet temp Variable Name Coefficient Std. Err. Estimate t Statistic Prob > t Constant inlet temp -1.85e+3 4.64e+1-3.99e+1 0.000 3.24e+0 7.46e-2 4.35e+1 0.000 27
ANOVA table and Residual Plot Source Sum of Squares Deg. of Freedom Mean Squares F-Ratio Prob>F Model 3.65e+4 1 3.65e+4 1.89e+3 0.000 Error 4.06e+2 21 1.93e+1 Total 3.69e+4 22 20 R es i d u a l s 10 0-10 -20 600 610 620 630 640 650 inlet temp 28
Multiple Regression with Replication S E = 1 2 Σ (y i1 -y i2 ) 2 S LF =S R -S E (a-α) 2 k Σ i η i k k i Σ v n i (y iv -ηi) 2 k Σ i η i + (b-β) 2 Σ η i (x i -x) 2 + η i (y i. -y i ) 2 + (y iv -y i. ) 2 i Σ k Σ i = k n i Σ Σ 1 1 k-2 η i -k i v k Σ i k Σ i Σ v n i (y iv -y) 2 k Σ Σ v n i = (y iv -y i. ) 2 i k + Σ η i i k (y i. -y i ) 2 + Σ η i i (y-y i ) 2 29
Pure Error vs. Lack of Fit Example Lack Of Fit Source Lack Of Fit Pure Error Total Error DF 17 4 21 Sum of Squares 401.01 4.49 405.50 Mean Square 23.59 1.12 F Ratio 21.04 Prob > F 0.005 Parameter Estimates Term Intercept inlet temp Estimate -1850.16 3.24 Std Error 46.42 0.07 t Ratio -39.85 43.47 Prob> t 0.000 0.000 Model Test Source inlet temp DF 1 Sum of Squares 36489.55 F Ratio 999.99 Prob > F 0.000 30
Dep. rate vs temperature: y = a + bx + cx 2 300 d e p r a t e 200 Variable Name 100 600 610 620 630 inlet temp 640 650 Std. Err. t Coefficient Estimate Statistic Prob > t Constant inlet temp inlet temp ^2 8.34e+3 1.80e+3 4.66e+0 0.000-2.94e+1 5.74e+0-5.13e+0 0.000 2.62e-2 4.60e-3 5.69e+0 0.000 31
Pure Error vs. Lack of Fit Example (cont) Lack Of Fit Source Lack Of Fit Pure Error Total Error DF 16 4 20 Sum of Squares 150.24 4.49 154.73 Mean Square 9.39 1.12 F Ratio 8.37 Prob > F 0.026 Parameter Estimates Term Intercept inlet temp^1 inlet temp^2 Estimate 8339.05-29.45 0.03 Std Error 1789.92 5.74 0.005 t Ratio 4.66-5.13 5.69 Prob> t 0.0002 0.0001 0.0000 Model Test Source Poly(inlet temp,2) DF 2 Sum of Squares 36740.32 F Ratio 999.99 Prob > F 0.0000 32
Source ANOVA table and Residual Plot Sum of Squares Deg. of Freedom Mean Squares F-Ratio Prob>F Model 3.67e+4 2 1.84e+4 2.37e+3 0.000 Error 1.55e+2 20 7.74e+0 Total 3.69e+4 22 R es i d u a l s 6 4 2 0-2 -4-6 600 610 620 630 640 650 inlet temp 33
Use regression line to predict LTO thickness... y = 60.352 + 0.97456 x R 2 = 0.989 y = - 38.440 + 1.0153 x R 2 = 0.989 4000 4000 3000 3000 2000 1000 0 1000 LTO Thick A 90%LimitLow 90%LimitHigh 2000 3000 4000 Dep Time Sec 2000 1000 1000 2000 3000 LTO Thick A 4000 34
Response Surface Methodology Objectives: get a feel of I/O relationships find setting(s) that satisfy multiple constraints find settings that lead to optimum performance Observations: Function is nearly linear away from the peak Function is nearly quadratic at the peak 35
Building the planar model A Factorial experiment with center points is enough to build and confirm a planar model. b1, b2, b12 = -0.65 +/-0.75 b11+b22=1/4σp+1/3σc= -0.50 +/-1.15 36
Quadratic Model and Confirmation Run Close to the peak, a quadratic model can be built and confirmed by an expanded two-phase experiment. 37
Response Surface Methodology RSM consists of creating models that lead to visual images of a response. The models are usually linear or quadratic in nature. Either expanded factorial experiments, or regression analysis can be used. All empirical models have a random prediction error. In RSM, the average variance of the model is: V(y) = 1 n n Σ i=1 V(y i ) = pσ2 n where p is the number of model parameters and n is the number of experiments. 38
Response Surface Exploration 39
"Popular" RSM Use singe-stage Box-B or Box-W designs Use computer (simulated) experiments Rely on "goodness of fit" measures Automate model structure generation Problems? 40