Lecture 11 Simple Linear Regression

Lecture 11 Simple Liear Regressio Fall 2013 Prof. Yao Xie, yao.xie@isye.gatech.edu H. Milto Stewart School of Idustrial Systems & Egieerig Georgia Tech

Midterm 2 mea: 91.2 media: 93.75 std: 6.5 2

Meddicorp Sales Meddicorp Compay sells medical supplies to hospitals, cliics, ad doctor's offices. Meddicorp's maagemet cosiders the effectiveess of a ew advertisig program. Maagemet wats to kow if the advertisemet i 1999 is related to sales. 3

Data The compay observes for 25 offices the yearly sales (i thousads) ad the advertisemet expediture for the ew program (i hudreds) SALES ADV 1 963.50 374.27 2 893.00 408.50 3 1057.25 414.31 4 1183.25 448.42 5 1419.50 517.88... 4

Regressio aalysis Step 1: graphical display of data scatter plot: sales vs. advertisemet cost 5

Step 2: fid the relatioship or associatio betwee Sales ad Advertisemet Cost Regressio 6

Regressio Aalysis The collectio of statistical tools that are used to model ad explore relatioships betwee variables that are related i odetermiistic maer is called regressio aalysis Occurs frequetly i egieerig ad sciece 7

Scatter Diagram May problems i egieerig ad sciece ivolve explorig the relatioships betwee two or more variables. Regressio aalysis is a statistical techique that is very useful for these types of problems 8 = = = = i i i i i i i y y x x y y x x 1 2 1 2 1 ) ( ) ( ) )( ( ˆρ 1 ˆ 1 ρ

Basics of Regressio We observe a respose or depedet variable (Y) With each (Y), we also observe regressors or predictors {X 1,, X } Goal: determie the mathematical relatioship betwee respose variables ad regressors Y = h(x 1,, X ) 9

Fuctio ca be o- liear I this class, we will focus o the case where Y is a liear fuctio of {X 1,, X } Y = h(x1,...,x) = β0+β1x1+...+βx 15 10 5 0 2 4 6 C1 8 10 12 10

Differet forms of regressio Simple liear regressio Y = β0 + β1x + ε Multiple liear regressio Y = β0 + β1x1 + β2x2+ ε Polyomial regressio Y = β0 + β1x + β2x 2 + ε............ 11

Basics of regressios Which is the RESPONSE ad which is the PREDICTOR? The respose or depedet variable varies with differet values of the regressor/predictor. The predictor values are fixed: we observe the respose for these fixed values The focus is i explaiig the respose variable i associatio with oe or more predictors 12

Simple liear regressio Our goal is to fid the best lie that describes a liear relatioship: 12 11 Fid (β0,β1) where 10 9 8 7 Y = β0 + β1x + ε 6 5 4 Ukow parameters: 3 1. β0 Itercept (where the lie crosses y-axis) 2. β1 Slope of the lie Basic idea a. Plot observatios (X,Y) b. Fid best lie that follows plotted poits 1 2 3 4 C1 5 6 7 8 13

Class activity 1. I the Meddicorp Compay example, the respose is: A. Sales B. Advertisemet Expediture 2. I the Meddicorp Compay example, the predictor is: A. Sales B. Advertisemet Expediture 3. To lear about the associatio betwee sales ad the advertisemet expediture we ca use simple liear regressio: A. True Β. False 4. If the associatio betwee respose ad predictor is positive the the slope is A. Positive Β. Negative C. We caot idetify the slope sig 14

Simple liear regressio: model With observed data {(X1,Y1),.,(X,Y)}, we model the liear relatioship E(εi) = 0 Var(εi) = σ 2 Yi = β0 + β1xi + εi, i =1,, {ε1,, ε} are idepedet radom variables (Later we assume εi ~ Normal) Later, we will check these assumptios whe we check model adequacy 15

Summary: simple liear regressio Based o the scatter diagram, it is probably reasoable to assume that the mea of the radom variable Y is related to X by the followig simple liear regressio model: Respose Regressor or Predictor ε i Y Itercept i = β + β X i + ε i =1,2,, 0 1 i ( ) ε i Ν 0, σ 2 Slope Radom error where the slope ad itercept of the lie are called regressio coefficiets. The case of simple liear regressio cosiders a sigle regressor or predictor x ad a depedet or respose variable Y. 16

Estimate regressio parameters To estimate (β0,β1), we fid values that miimize squared error: ( ) 2 y i ( β + β 0 1x i ) i= 1 derivatio: method of least squares 17

Method of least squares y i 0 1 x i i, i 1, 2, p, y Observed value Data (y) ` To estimate (β0,β1), we fid values that miimize squared error: L a 2 i a a 1 a 1 y i 0 2 1 x i 2 2 1 2 The least squares estimators of 0 ad 1, say, ˆ 0 ad ˆ 1, must satisfy Figure 11-3 Estimated regressio lie x L ` 0 ˆ ` a 1 L 2 ` 1 2 a 2 1 ` ˆ 0, ˆ 1 0, ˆ 1 2 a ` 1 2 ` 1 2 Least square ormal equatios 1 y i ˆ 0 ˆ 1x i 2 0 1 y i ˆ 0 ˆ 1x i 2 x i 0 ˆ 0 a ˆ 0 ˆ 1 a x i a x i ˆ 1 a x i 2 a y i y i x i 18

Least square estimates The least squares estimates of the itercept ad slope i the simple liear regressio model are ˆ 0 y ˆ 1x (11-7) ˆ 1 a y i x i a a a x 2 i a a y i b a a 2 x i b x i b (11-8) where y 11 2 g y i ad x 11 2 g x i. 19

Alterative otatio S x x a 1 2 1x i x2 2 a 2 x i a a a x i b 2 b (11-10) S x y a 1y i y21x i x2 a a 1 21 2 a a a x i b a a x i ay ia b a a b y i b (11-11) ˆ β 0 = y ˆ β1x 1 ˆβ = S S xy xx ˆ ˆ ˆ yi = β 0 + β1x i Fitted (estimated) regressio model 20

Example: oxyge ad hydrocarco level Table 11-1 Oxyge ad Hydrocarbo Levels Observatio Hydrocarbo Level Purity Number x (%) y (%) 1 0.99 90.01 2 1.02 89.05 3 1.15 91.43 4 1.29 93.74 5 1.46 96.73 6 1.36 94.45 7 0.87 87.59 8 1.23 91.77 9 1.55 99.42 10 1.40 93.65 11 1.19 93.54 12 1.15 92.52 13 0.98 90.56 14 1.01 89.54 15 1.11 89.85 16 1.20 90.39 17 1.26 93.25 18 1.32 93.41 19 1.43 94.98 Purity (y) Questio: fit a simple regressio model to related purity (y) to hydrocarbo level (x) 20 0.95 87.33 Figure 11-1 Scatter diagram of oxyge purity versus hydrocarbo level from Table 11-1. 100 98 96 94 92 90 88 86 0.85 0.95 1.05 1.15 1.25 1.35 1.45 1.55 Hydrocarbo level ( x) 21

20 20 20 a x i 23.92 a x 1.1960 y 92.1605 y i 1,843.21 20 a y i 2 170,044.5321 a 20 20 a x i y i 2,214.6566 x i 2 29.2892 S x x a 20 x i 2 a a 20 20 2 x i b 29.2892 123.9222 20 0.68088 ad S x y a 20 x i y i a a 20 x i b a a 20 20 y i b 2,214.6566 123.92211,843.212 20 10.17744 22

Therefore, the least squares estimates of the slope ad itercept are ˆ 1 S x y S x x 10.17744 0.68088 14.94748 ad 1 21 2 ˆ 0 y ˆ 1x 92.1605 114.9474821.196 74.28331 The fitted simple liear regressio model (with the coefficiets reported to three decimal places) is 102 ŷ 74.283 14.947 x Oxyge purity y (%) 99 96 93 90. 87 23 0.87 1.07 1.27 1.47 1.67 Hydrocarbo level (%) x

Iterpretatio of regressio model Regressio model ŷ 1 2 ŷ 74.283 14.947 x 89.23% whe the This may be iterpreted as a estimate of the true populatio mea purity whe x 1.00%, The estimates are subject to error hydrocarbo level is x 1.00%. T later: we will use cofidece itervals to describe the error i estimatio from a regressio model 24

Estimatio of variace Usig the fitted model, we ca estimate value of the respose variable for give predictor Residuals: Our model: Y i = β 0 + β 1 X i + ε i, i =1,,, Var(ε i ) = σ 2 Ubiased estimator (MSE: Mea Square Error) ˆ σ 2 = ˆ ˆ yi = β 0 + β1x i r i = MSE y = i oxyge ad hydrocarco level example i= 1 ˆ yˆ r i 2 i 2 ˆ 2 1.18, 25

Example: Oil Well Drillig Costs Estimatig the costs of drillig oil wells is a importat cosideratio for the oil idustry. Data: the total costs ad the depths of 16 off-shore oil wells located i Philippies. Depth Cost 5000 2596.8 5200 3328.0 6000 3181.1 6538 3198.4 7109 4779.9 7556 5905.6 8005 5769.2 8207 8089.5 Depth Cost 8210 4813.1 8600 5618.7 9026 7736.0 9197 6788.3 9926 7840.8 10813 8882.5 13800 10489.5 14311 12506.6 26

Step 1: graphical display of the data R code: plot(depth, Cost, xlab= Depth, ylab = Cost ) 27

Class activity 1. I this example, the respose is: A. The drillig cost B. The well depth 2. I this example, the depedet variable is: A. The drillig cost B. The well depth 3. Is there a liear associatio betwee the drillig cost ad the well depth? A. Yes ad positive Β. Yes ad egative C. No 28

Step 2: fid the relatioship betwee Depth ad Cost 29

Results ad use of regressio model 1. Fit a liear regressio model: Estimates (β 0,β 1 ) are (-2277.1, 1.0033) 2. What does the model predict as the cost icrease for a additioal depth of 1000 ft? If we icrease X by 1000, we icrease Y by 1000β 1 = $1003 3. What cost would you predict for a oil well of 10,000 ft depth? X = 10,000 ft is i the rage of the data, ad estimate of the lie at x=10,000 is ˆ β + (10,000) ˆ β = -2277.1 + 10,033 = $7753 4. What is the estimate of the error variace? Estimate σ 2 774,211 5.What could you say about the cost of a oil well of depth 20,000 ft? X=20,000 ft is much greater tha all the observed values of X We should ot extrapolate the regressio out that far. 0 1 30

Summary Simple liear regressio Estimate coefficiets from data: method of least squares ˆ β = y ˆ β1x Y = β0 + β1x Estimate of variace 0 ˆβ 1 = S S xy xx ˆ ˆ ˆ yi = β 0 + β1x i y Fitted (estimated) regressio ` model Observed value Data (y) ` Estimated regressio lie x 31