SMA 6304 / MIT / MIT Manufacturing Systems. Lecture 10: Data and Regression Analysis. Lecturer: Prof. Duane S. Boning

Similar documents
2.830J / 6.780J / ESD.63J Control of Manufacturing Processes (SMA 6303) Spring 2008

Stat 328 Final Exam (Regression) Summer 2002 Professor Vardeman

STAT 350: Summer Semester Midterm 1: Solutions

UNIVERSITY OF CALIFORNIA College of Engineering Department of Electrical Engineering and Computer Sciences. PROBLEM SET No. 5 Official Solutions

a. The least squares estimators of intercept and slope are (from JMP output): b 0 = 6.25 b 1 =

2.830J / 6.780J / ESD.63J Control of Manufacturing Processes (SMA 6303) Spring 2008

Addition of Center Points to a 2 k Designs Section 6-6 page 271

Inferences for Regression

IE 361 Exam 1 October 2004 Prof. Vardeman

EE290H F05. Spanos. Lecture 5: Comparison of Treatments and ANOVA

A discussion on multiple regression models

Sampling Distributions in Regression. Mini-Review: Inference for a Mean. For data (x 1, y 1 ),, (x n, y n ) generated with the SRM,

Lectures on Simple Linear Regression Stat 431, Summer 2012

Ch 3: Multiple Linear Regression

Math 423/533: The Main Theoretical Topics

Ch 2: Simple Linear Regression

Regression Analysis. Simple Regression Multivariate Regression Stepwise Regression Replication and Prediction Error EE290H F05

Multiple Regression: Example

AMS 7 Correlation and Regression Lecture 8

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Factorial designs. Experiments

Battery Life. Factory

ECO220Y Simple Regression: Testing the Slope

MS&E 226: Small Data

General Linear Model (Chapter 4)

Design of Engineering Experiments Chapter 5 Introduction to Factorials

2.1: Inferences about β 1

Exponential Smoothing. INSR 260, Spring 2009 Bob Stine

Lecture 10 Multiple Linear Regression

Lectures 5 & 6: Hypothesis Testing

Regression Models for Time Trends: A Second Example. INSR 260, Spring 2009 Bob Stine

Hypothesis Testing hypothesis testing approach

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

Chapter 13 Experiments with Random Factors Solutions

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

What If There Are More Than. Two Factor Levels?

Inference for Regression

Chapter 14 Student Lecture Notes 14-1

Inference for Regression Simple Linear Regression

Lecture 10. Factorial experiments (2-way ANOVA etc)

Table 1: Fish Biomass data set on 26 streams

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Analysis of variance

Lecture 18 Miscellaneous Topics in Multiple Regression

Analysis of Covariance

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Analysis of Variance

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Statistical Modelling in Stata 5: Linear Models

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

STAT 135 Lab 9 Multiple Testing, One-Way ANOVA and Kruskal-Wallis

OHSU OGI Class ECE-580-DOE :Design of Experiments Steve Brainerd

Lecture 6 Multiple Linear Regression, cont.

One-way ANOVA. Experimental Design. One-way ANOVA

Regression Analysis IV... More MLR and Model Building

The Multiple Regression Model

Variance Decomposition and Goodness of Fit

Econometrics. 4) Statistical inference

Unit 11: Multiple Linear Regression

MORE ON SIMPLE REGRESSION: OVERVIEW

Lecture 18: Simple Linear Regression

Chapter 13. Multiple Regression and Model Building

2 and F Distributions. Barrow, Statistics for Economics, Accounting and Business Studies, 4 th edition Pearson Education Limited 2006

Chapter 3 Multiple Regression Complete Example

2.830J / 6.780J / ESD.63J Control of Manufacturing Processes (SMA 6303) Spring 2008

As always, show your work and follow the HW format. You may use Excel, but must show sample calculations.

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Lecture 4: Multivariate Regression, Part 2

Confidence Intervals, Testing and ANOVA Summary

Stat 101 L: Laboratory 5

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

22s:152 Applied Linear Regression. Take random samples from each of m populations.

One-Way Analysis of Variance: A Guide to Testing Differences Between Multiple Groups

Unit 12: Analysis of Single Factor Experiments

Lecture 10: F -Tests, ANOVA and R 2

Regression and the 2-Sample t

df=degrees of freedom = n - 1

Simple Linear Regression

Stat 5102 Final Exam May 14, 2015

STAT22200 Spring 2014 Chapter 8A

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

Solutions to Final STAT 421, Fall 2008

2.830 Homework #6. April 2, 2009

Multiple Regression Prediction & Confidence Intervals Demystified Stacy Dean The Boeing Company

CRP 272 Introduction To Regression Analysis

with the usual assumptions about the error term. The two values of X 1 X 2 0 1

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Basic Business Statistics, 10/e

Lecture 14: ANOVA and the F-test

The legacy of Sir Ronald A. Fisher. Fisher s three fundamental principles: local control, replication, and randomization.

ANOVA: Analysis of Variation

PubH 7405: REGRESSION ANALYSIS. MLR: INFERENCES, Part I

SIMPLE REGRESSION ANALYSIS. Business Statistics

13 Simple Linear Regression

F Tests and F statistics

CHAPTER 4 Analysis of Variance. One-way ANOVA Two-way ANOVA i) Two way ANOVA without replication ii) Two way ANOVA with replication

Econometrics. 5) Dummy variables

2. Outliers and inference for regression

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Six Sigma Black Belt Study Guides

Transcription:

SMA 6304 / MIT 2.853 / MIT 2.854 Manufacturing Systems Lecture 10: Data and Regression Analysis Lecturer: Prof. Duane S. Boning 1

Agenda 1. Comparison of Treatments (One Variable) Analysis of Variance (ANOVA) 2. Multivariate Analysis of Variance Model forms 3. Regression Modeling Regression fundamentals Significance of model terms Confidence intervals 2

Is Process B Better Than Process A? time order method yield 1 A 89.7 2 A 81.4 3 A 84.5 4 A 84.8 5 A 87.3 6 A 79.7 7 A 85.1 8 A 81.7 9 A 83.7 10 A 84.5 11 B 84.7 12 B 86.1 13 B 83.2 14 B 91.9 15 B 86.3 16 B 79.3 17 B 82.6 18 B 89.1 19 B 83.7 20 B 88.5 yield 92 90 88 86 84 82 80 78 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 time order 3

Two Means with Internal Estimate of Variance Method A Method B Pooled estimate of σ 2 Estimated variance of Estimated standard error of with ν=18 d.o.f So only about 80% confident that mean difference is real (signficant) 4

Comparison of Treatments Population A Population C Population B Sample A Sample B Sample C Consider multiple conditions (treatments, settings for some variable) There is an overall mean µ and real effects or deltas between conditions τ i. We observe samples at each condition of interest Key question: are the observed differences in mean significant? Typical assumption (should be checked): the underlying variances are all the same usually an unknown value (σ 02 ) 5

Steps/Issues in Analysis of Variance 1. Within group variation Estimates underlying population variance 2. Between group variation Estimate group to group variance 3. Compare the two estimates of variance If there is a difference between the different treatments, then the between group variation estimate will be inflated compared to the within group estimate We will be able to establish confidence in whether or not observed differences between treatments are significant Hint: we ll be using F tests to look at ratios of variances 6

(1) Within Group Variation Assume that each group is normally distributed and shares a common variance σ 0 2 SS t = sum of square deviations within t th group (there are k groups) Estimate of within group variance in t th group (just variance formula) Pool these (across different conditions) to get estimate of common within group variance: This is the within group mean square (variance estimate) 7

(2) Between Group Variation We will be testing hypothesis µ 1 = µ 2 = = µ k If all the means are in fact equal, then a 2 nd estimate of σ 2 could be formed based on the observed differences between group means: If all the treatments in fact have different means, then s T2 estimates something larger: Variance is inflated by the real treatment effects τ t 8

(3) Compare Variance Estimates We now have two different possibilities for s T2, depending on whether the observed sample mean differences are real or are just occurring by chance (by sampling) Use F statistic to see if the ratios of these variances are likely to have occurred by chance! Formal test for significance: 9

(4) Compute Significance Level Calculate observed F ratio (with appropriate degrees of freedom in numerator and denominator) Use F distribution to find how likely a ratio this large is to have occurred by chance alone This is our significance level If then we say that the mean differences or treatment effects are significant to (1-α)100% confidence or better 10

(5) Variance Due to Treatment Effects We also want to estimate the sum of squared deviations from the grand mean among all samples: 11

(6) Results: The ANOVA Table source of variation sum of squares degrees of freedom mean square F 0 Pr(F 0 ) Between treatments Within treatments Also referred to as residual SS Total about the grand average 12

Example: Anova A B C 11 10 12 10 8 10 12 6 11 12 10 8 6 Excel: Data Analysis, One-Variation Anova Anova: Single Factor A B C SUMMARY Groups Count Sum Average Variance A 3 33 11 1 B 3 24 8 4 C 3 33 11 1 ANOVA Source of Variation SS df MS F P-value F crit Between Groups 18 2 9 4.5 0.064 5.14 Within Groups 12 6 2 Total 30 8 13

ANOVA Implied Model The ANOVA approach assumes a simple mathematical model: Where µ t is the treatment mean (for treatment type t) And τ t is the treatment effect With ε ti being zero mean normal residuals ~N(0,σ 02 ) Checks Plot residuals against time order Examine distribution of residuals: should be IID, Normal Plot residuals vs. estimates Plot residuals vs. other variables of interest 14

MANOVA Two Dependencies Can extend to two (or more) variables of interest. MANOVA assumes a mathematical model, again simply capturing the means (or treatment offsets) for each discrete variable level: Assumes that the effects from the two variables are additive 15

Example: Two Factor MANOVA Two LPCVD deposition tube types, three gas suppliers. Does supplier matter in average particle counts on wafers? Experiment: 3 lots on each tube, for each gas; report average # particles added Factor 2 Tube 1 2 A 7 13 10 Factor 1 Gas B 36 44 40 C 2 18 10 15 25 Analysis of Variance Source Model Error C. Total Effect Tests Source Tube Gas DF 3 2 5 Nparm 1 2 Sum of Squares 1350.00 28.00 1378.00 DF 1 2 Sum of Squares 150.00 1200.00 Mean Square 450.0 14.0 F Ratio 10.71 42.85 F Ratio 32.14 Prob > F 0.0303 Prob > F 0.0820 0.0228 7 13 36 44 2 18 20 20 20 20 20 20-10 -10 20 20-10 -10-5 5-5 5-5 5 2-2 1-1 -3 3 16

MANOVA Two Factors with Interactions May be interaction: not simply additive effects may depend synergistically on both factors: IID, ~N(0,σ 2 ) Can split out the model more explicitly An effect that depends on both t & i factors simultaneously t = first factor = 1,2, k (k = # levels of first factor) i = second factor = 1,2, n (n = # levels of second factor) j = replication = 1,2, m (m = # replications at t, jth combination of factor levels Estimate by: 17

MANOVA Table Two Way with Interactions source of variation sum of squares degrees of freedom mean square F 0 Pr(F 0 ) Between levels of factor 1 (T) Between levels of factor 2 (B) Interaction Within Groups (Error) Total about the grand average 18

Measures of Model Goodness R 2 Goodness of fit R 2 Question considered: how much better does the model do that just using the grand average? Think of this as the fraction of squared deviations (from the grand average) in the data which is captured by the model Adjusted R 2 For fair comparison between models with different numbers of coefficients, an alternative is often used Think of this as (1 variance remaining in the residual). Recall ν R = ν D - ν T 19

Regression Fundamentals Use least square error as measure of goodness to estimate coefficients in a model One parameter model: Model form Squared error Estimation using normal equations Estimate of experimental error Precision of estimate: variance in b Confidence interval for β Analysis of variance: significance of b Lack of fit vs. pure error Polynomial regression 20

Least Squares Regression We use least-squares to estimate coefficients in typical regression models One-Parameter Model: Goal is to estimate β with best b How define best? That b which minimizes sum of squared error between prediction and data The residual sum of squares (for the best estimate) is 21

Least Squares Regression, cont. Least squares estimation via normal equations For linear problems, we need not calculate SS(β); rather, direct solution for b is possible Recognize that vector of residuals will be normal to vector of x values at the least squares estimate Estimate of experimental error Assuming model structure is adequate, estimate s 2 of σ 2 can be obtained: 22

Precision of Estimate: Variance in b We can calculate the variance in our estimate of the slope, b: Why? 23

Confidence Interval for β Once we have the standard error in b, we can calculate confidence intervals to some desired (1-α)100% level of confidence Analysis of variance Test hypothesis: If confidence interval for β includes 0, then β not significant Degrees of freedom (need in order to use t distribution) p = # parameters estimated by least squares 24

Example Regression income Leverage Residuals Age 50 40 30 20 10 0 Income 8 6.16 22 9.88 35 14.35 40 24.06 57 30.34 73 32.17 78 42.18 87 43.23 98 48.76 0 25 50 75 100 age Leverage, P<.0001 Whole Model Analysis of Variance Source Model Error C. Total DF 1 8 9 Sum of Squares 8836.6440 64.6695 8901.3135 Tested against reduced model: Y=0 Parameter Estimates Term Intercept age Effect Tests Source age Estimate Zeroed 0 0.500983 Nparm 1 DF 1 Std Error 0 0.015152 Sum of Squares 8836.6440 Mean Square 8836.64 8.08 t Ratio. 33.06 F Ratio 1093.146 F Ratio 1093.146 Prob > F <.0001 Prob> t. <.0001 Prob > F <.0001 Note that this simple model assumes an intercept of zero model must go through origin We will relax this requirement soon 25

Lack of Fit Error vs. Pure Error Sometimes we have replicated data E.g. multiple runs at same x values in a designed experiment We can decompose the residual error contributions This allows us to TEST for lack of fit Where SS R = residual sum of squares error SS L = lack of fit squared error SS E = pure replicate error By lack of fit we mean evidence that the linear model form is inadequate 26

Regression: Mean Centered Models Model form Estimate by 27

Regression: Mean Centered Models Confidence Intervals Our confidence interval on y widens as we get further from the center of our data! 28

Polynomial Regression We may believe that a higher order model structure applies. Polynomial forms are also linear in the coefficients and can be fit with least squares Curvature included through x 2 term Example: Growth rate data 29

Regression Example: Growth Rate Data Bivariate Fit of y By x 95 90 85 80 Image removed due to copyright considerations. y 75 70 65 60 5 10 15 20 25 30 35 40 x Fit Mean Linear Fit Polynomial Fit Degree=2 Replicate data provides opportunity to check for lack of fit 30

Growth Rate First Order Model Mean significant, but linear term not Clear evidence of lack of fit Image removed due to copyright considerations. 31

Growth Rate Second Order Model No evidence of lack of fit Quadratic term significant Image removed due to copyright considerations. 32

Polynomial Regression In Excel Create additional input columns for each input Use Data Analysis and Regression tool x x^2 y 10 100 73 10 100 78 15 225 85 20 400 90 20 400 91 25 625 87 25 625 86 25 625 91 30 900 75 35 1225 65 Regression Statistics Multiple R 0.968 R Square 0.936 Adjusted R Square 0.918 Standard Error 2.541 Observations 10 ANOVA Regression Residual Total Intercept x x^2 df 2 7 9 Coefficients 35.657 5.263-0.128 SS 665.706 45.194 710.9 Standard Error 5.618 0.558 0.013 MS 332.853 6.456 t Stat 6.347 9.431-9.966 F 51.555 P-value 0.0004 3.1E-05 2.2E-05 Significance F 6.48E-05 Lower Upper 95% 95% 22.373 48.942 3.943 6.582-0.158-0.097 33

Polynomial Regression Analysis of Variance Source Model Error C. Total Lack Of Fit Source Lack Of Fit Pure Error Total Error DF 2 7 9 DF 3 4 7 Sum of Squares 665.70617 45.19383 710.90000 Sum of Squares 18.193829 27.000000 45.193829 Mean Square 332.853 6.456 Mean Square 6.0646 6.7500 F Ratio 51.5551 Prob > F <.0001 F Ratio 0.8985 Prob > F 0.5157 Max RSq 0.9620 Generated using JMP package Summary of Fit RSquare RSquare Adj Root Mean Sq Error Mean of Response Observations (or Sum Wgts) 0.936427 0.918264 2.540917 82.1 10 Parameter Estimates Term Intercept x x*x Effect Tests Source x x*x Nparm 1 1 Estimate 35.657437 5.2628956-0.127674 DF 1 1 Std Error 5.617927 0.558022 0.012811 Sum of Squares 574.28553 641.20451 t Ratio 6.35 9.43-9.97 Prob> t 0.0004 <.0001 <.0001 F Ratio 88.9502 99.3151 Prob > F <.0001 <.0001 34

Summary Comparison of Treatments ANOVA Multivariate Analysis of Variance Regression Modeling Next Time Time Series Models Forecasting 35