Lecture 11 Simple Linear Regression

Similar documents
Final Review. Fall 2013 Prof. Yao Xie, H. Milton Stewart School of Industrial Systems & Engineering Georgia Tech

Linear Regression Models

1 Inferential Methods for Correlation and Regression Analysis

Chapters 5 and 13: REGRESSION AND CORRELATION. Univariate data: x, Bivariate data (x,y).

3/3/2014. CDS M Phil Econometrics. Types of Relationships. Types of Relationships. Types of Relationships. Vijayamohanan Pillai N.

Simple Regression. Acknowledgement. These slides are based on presentations created and copyrighted by Prof. Daniel Menasce (GMU) CS 700

Simple Linear Regression

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS

Simple Linear Regression

Statistics 203 Introduction to Regression and Analysis of Variance Assignment #1 Solutions January 20, 2005

11 Correlation and Regression

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

Properties and Hypothesis Testing

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Stat 139 Homework 7 Solutions, Fall 2015

Regression, Inference, and Model Building

ST 305: Exam 3 ( ) = P(A)P(B A) ( ) = P(A) + P(B) ( ) = 1 P( A) ( ) = P(A) P(B) σ X 2 = σ a+bx. σ ˆp. σ X +Y. σ X Y. σ X. σ Y. σ n.

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

TAMS24: Notations and Formulas

First, note that the LS residuals are orthogonal to the regressors. X Xb X y = 0 ( normal equations ; (k 1) ) So,

(all terms are scalars).the minimization is clearer in sum notation:

UNIT 11 MULTIPLE LINEAR REGRESSION

Correlation Regression

Continuous Data that can take on any real number (time/length) based on sample data. Categorical data can only be named or categorised

Circle the single best answer for each multiple choice question. Your choice should be made clearly.

Assessment and Modeling of Forests. FR 4218 Spring Assignment 1 Solutions

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Statistics 20: Final Exam Solutions Summer Session 2007

Computing Confidence Intervals for Sample Data

S Y Y = ΣY 2 n. Using the above expressions, the correlation coefficient is. r = SXX S Y Y

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Open book and notes. 120 minutes. Cover page and six pages of exam. No calculators.

Introduction to Econometrics (3 rd Updated Edition) Solutions to Odd- Numbered End- of- Chapter Exercises: Chapter 4

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Common Large/Small Sample Tests 1/55

Random Variables, Sampling and Estimation

Correlation and Covariance

Ismor Fischer, 1/11/

Chapter 11 Output Analysis for a Single Model. Banks, Carson, Nelson & Nicol Discrete-Event System Simulation

REGRESSION AND ANALYSIS OF VARIANCE. Motivation. Module structure

Polynomial Functions and Their Graphs

Final Examination Solutions 17/6/2010

University of California, Los Angeles Department of Statistics. Practice problems - simple regression 2 - solutions

Dr. Maddah ENMG 617 EM Statistics 11/26/12. Multiple Regression (2) (Chapter 15, Hines)

STP 226 ELEMENTARY STATISTICS

ECON 3150/4150, Spring term Lecture 3

Chapter 8: Estimating with Confidence

Correlation. Two variables: Which test? Relationship Between Two Numerical Variables. Two variables: Which test? Contingency table Grouped bar graph

University of California, Los Angeles Department of Statistics. Simple regression analysis

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Linear Regression Analysis. Analysis of paired data and using a given value of one variable to predict the value of the other

9. Simple linear regression G2.1) Show that the vector of residuals e = Y Ŷ has the covariance matrix (I X(X T X) 1 X T )σ 2.

Lecture 3. Properties of Summary Statistics: Sampling Distribution

(7 One- and Two-Sample Estimation Problem )

MA Advanced Econometrics: Properties of Least Squares Estimators

Algebra of Least Squares

Simple Regression Model

Section 14. Simple linear regression.

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

TMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences.

Worksheet 23 ( ) Introduction to Simple Linear Regression (continued)

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Topic 9: Sampling Distributions of Estimators

10-701/ Machine Learning Mid-term Exam Solution

INSTRUCTIONS (A) 1.22 (B) 0.74 (C) 4.93 (D) 1.18 (E) 2.43

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. Comments:

UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL/MAY 2009 EXAMINATIONS ECO220Y1Y PART 1 OF 2 SOLUTIONS

Statistical Properties of OLS estimators

Maximum Likelihood Estimation

Comparing Two Populations. Topic 15 - Two Sample Inference I. Comparing Two Means. Comparing Two Pop Means. Background Reading

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Agenda: Recap. Lecture. Chapter 12. Homework. Chapt 12 #1, 2, 3 SAS Problems 3 & 4 by hand. Marquette University MATH 4740/MSCS 5740

STA6938-Logistic Regression Model

There is no straightforward approach for choosing the warmup period l.

STP 226 EXAMPLE EXAM #1

Statistics Lecture 27. Final review. Administrative Notes. Outline. Experiments. Sampling and Surveys. Administrative Notes

Department of Civil Engineering-I.I.T. Delhi CEL 899: Environmental Risk Assessment HW5 Solution

This is an introductory course in Analysis of Variance and Design of Experiments.

Regression. Correlation vs. regression. The parameters of linear regression. Regression assumes... Random sample. Y = α + β X.

MA 575, Linear Models : Homework 3

Lecture 1, Jan 19. i=1 p i = 1.

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01

WEIGHTED LEAST SQUARES - used to give more emphasis to selected points in the analysis. Recall, in OLS we minimize Q =! % =!

Statistical Intervals for a Single Sample

Solutions to Odd Numbered End of Chapter Exercises: Chapter 4

Linear Regression Models, OLS, Assumptions and Properties

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

bwght = cigs

2 1. The r.s., of size n2, from population 2 will be. 2 and 2. 2) The two populations are independent. This implies that all of the n1 n2

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Topic 10: Introduction to Estimation

Confidence Level We want to estimate the true mean of a random variable X economically and with confidence.

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Read through these prior to coming to the test and follow them when you take your test.

Chapter 1 (Definitions)

Table 1: Mean FEV1 (and sample size) by smoking status and time. FEV (L/sec)

Transcription:

Lecture 11 Simple Liear Regressio Fall 2013 Prof. Yao Xie, yao.xie@isye.gatech.edu H. Milto Stewart School of Idustrial Systems & Egieerig Georgia Tech

Midterm 2 mea: 91.2 media: 93.75 std: 6.5 2

Meddicorp Sales Meddicorp Compay sells medical supplies to hospitals, cliics, ad doctor's offices. Meddicorp's maagemet cosiders the effectiveess of a ew advertisig program. Maagemet wats to kow if the advertisemet i 1999 is related to sales. 3

Data The compay observes for 25 offices the yearly sales (i thousads) ad the advertisemet expediture for the ew program (i hudreds) SALES ADV 1 963.50 374.27 2 893.00 408.50 3 1057.25 414.31 4 1183.25 448.42 5 1419.50 517.88... 4

Regressio aalysis Step 1: graphical display of data scatter plot: sales vs. advertisemet cost 5

Step 2: fid the relatioship or associatio betwee Sales ad Advertisemet Cost Regressio 6

Regressio Aalysis The collectio of statistical tools that are used to model ad explore relatioships betwee variables that are related i odetermiistic maer is called regressio aalysis Occurs frequetly i egieerig ad sciece 7

Scatter Diagram May problems i egieerig ad sciece ivolve explorig the relatioships betwee two or more variables. Regressio aalysis is a statistical techique that is very useful for these types of problems 8 = = = = i i i i i i i y y x x y y x x 1 2 1 2 1 ) ( ) ( ) )( ( ˆρ 1 ˆ 1 ρ

Basics of Regressio We observe a respose or depedet variable (Y) With each (Y), we also observe regressors or predictors {X 1,, X } Goal: determie the mathematical relatioship betwee respose variables ad regressors Y = h(x 1,, X ) 9

Fuctio ca be o- liear I this class, we will focus o the case where Y is a liear fuctio of {X 1,, X } Y = h(x1,...,x) = β0+β1x1+...+βx 15 10 5 0 2 4 6 C1 8 10 12 10

Differet forms of regressio Simple liear regressio Y = β0 + β1x + ε Multiple liear regressio Y = β0 + β1x1 + β2x2+ ε Polyomial regressio Y = β0 + β1x + β2x 2 + ε............ 11

Basics of regressios Which is the RESPONSE ad which is the PREDICTOR? The respose or depedet variable varies with differet values of the regressor/predictor. The predictor values are fixed: we observe the respose for these fixed values The focus is i explaiig the respose variable i associatio with oe or more predictors 12

Simple liear regressio Our goal is to fid the best lie that describes a liear relatioship: 12 11 Fid (β0,β1) where 10 9 8 7 Y = β0 + β1x + ε 6 5 4 Ukow parameters: 3 1. β0 Itercept (where the lie crosses y-axis) 2. β1 Slope of the lie Basic idea a. Plot observatios (X,Y) b. Fid best lie that follows plotted poits 1 2 3 4 C1 5 6 7 8 13

Class activity 1. I the Meddicorp Compay example, the respose is: A. Sales B. Advertisemet Expediture 2. I the Meddicorp Compay example, the predictor is: A. Sales B. Advertisemet Expediture 3. To lear about the associatio betwee sales ad the advertisemet expediture we ca use simple liear regressio: A. True Β. False 4. If the associatio betwee respose ad predictor is positive the the slope is A. Positive Β. Negative C. We caot idetify the slope sig 14

Simple liear regressio: model With observed data {(X1,Y1),.,(X,Y)}, we model the liear relatioship E(εi) = 0 Var(εi) = σ 2 Yi = β0 + β1xi + εi, i =1,, {ε1,, ε} are idepedet radom variables (Later we assume εi ~ Normal) Later, we will check these assumptios whe we check model adequacy 15

Summary: simple liear regressio Based o the scatter diagram, it is probably reasoable to assume that the mea of the radom variable Y is related to X by the followig simple liear regressio model: Respose Regressor or Predictor ε i Y Itercept i = β + β X i + ε i =1,2,, 0 1 i ( ) ε i Ν 0, σ 2 Slope Radom error where the slope ad itercept of the lie are called regressio coefficiets. The case of simple liear regressio cosiders a sigle regressor or predictor x ad a depedet or respose variable Y. 16

Estimate regressio parameters To estimate (β0,β1), we fid values that miimize squared error: ( ) 2 y i ( β + β 0 1x i ) i= 1 derivatio: method of least squares 17

Method of least squares y i 0 1 x i i, i 1, 2, p, y Observed value Data (y) ` To estimate (β0,β1), we fid values that miimize squared error: L a 2 i a a 1 a 1 y i 0 2 1 x i 2 2 1 2 The least squares estimators of 0 ad 1, say, ˆ 0 ad ˆ 1, must satisfy Figure 11-3 Estimated regressio lie x L ` 0 ˆ ` a 1 L 2 ` 1 2 a 2 1 ` ˆ 0, ˆ 1 0, ˆ 1 2 a ` 1 2 ` 1 2 Least square ormal equatios 1 y i ˆ 0 ˆ 1x i 2 0 1 y i ˆ 0 ˆ 1x i 2 x i 0 ˆ 0 a ˆ 0 ˆ 1 a x i a x i ˆ 1 a x i 2 a y i y i x i 18

Least square estimates The least squares estimates of the itercept ad slope i the simple liear regressio model are ˆ 0 y ˆ 1x (11-7) ˆ 1 a y i x i a a a x 2 i a a y i b a a 2 x i b x i b (11-8) where y 11 2 g y i ad x 11 2 g x i. 19

Alterative otatio S x x a 1 2 1x i x2 2 a 2 x i a a a x i b 2 b (11-10) S x y a 1y i y21x i x2 a a 1 21 2 a a a x i b a a x i ay ia b a a b y i b (11-11) ˆ β 0 = y ˆ β1x 1 ˆβ = S S xy xx ˆ ˆ ˆ yi = β 0 + β1x i Fitted (estimated) regressio model 20

Example: oxyge ad hydrocarco level Table 11-1 Oxyge ad Hydrocarbo Levels Observatio Hydrocarbo Level Purity Number x (%) y (%) 1 0.99 90.01 2 1.02 89.05 3 1.15 91.43 4 1.29 93.74 5 1.46 96.73 6 1.36 94.45 7 0.87 87.59 8 1.23 91.77 9 1.55 99.42 10 1.40 93.65 11 1.19 93.54 12 1.15 92.52 13 0.98 90.56 14 1.01 89.54 15 1.11 89.85 16 1.20 90.39 17 1.26 93.25 18 1.32 93.41 19 1.43 94.98 Purity (y) Questio: fit a simple regressio model to related purity (y) to hydrocarbo level (x) 20 0.95 87.33 Figure 11-1 Scatter diagram of oxyge purity versus hydrocarbo level from Table 11-1. 100 98 96 94 92 90 88 86 0.85 0.95 1.05 1.15 1.25 1.35 1.45 1.55 Hydrocarbo level ( x) 21

20 20 20 a x i 23.92 a x 1.1960 y 92.1605 y i 1,843.21 20 a y i 2 170,044.5321 a 20 20 a x i y i 2,214.6566 x i 2 29.2892 S x x a 20 x i 2 a a 20 20 2 x i b 29.2892 123.9222 20 0.68088 ad S x y a 20 x i y i a a 20 x i b a a 20 20 y i b 2,214.6566 123.92211,843.212 20 10.17744 22

Therefore, the least squares estimates of the slope ad itercept are ˆ 1 S x y S x x 10.17744 0.68088 14.94748 ad 1 21 2 ˆ 0 y ˆ 1x 92.1605 114.9474821.196 74.28331 The fitted simple liear regressio model (with the coefficiets reported to three decimal places) is 102 ŷ 74.283 14.947 x Oxyge purity y (%) 99 96 93 90. 87 23 0.87 1.07 1.27 1.47 1.67 Hydrocarbo level (%) x

Iterpretatio of regressio model Regressio model ŷ 1 2 ŷ 74.283 14.947 x 89.23% whe the This may be iterpreted as a estimate of the true populatio mea purity whe x 1.00%, The estimates are subject to error hydrocarbo level is x 1.00%. T later: we will use cofidece itervals to describe the error i estimatio from a regressio model 24

Estimatio of variace Usig the fitted model, we ca estimate value of the respose variable for give predictor Residuals: Our model: Y i = β 0 + β 1 X i + ε i, i =1,,, Var(ε i ) = σ 2 Ubiased estimator (MSE: Mea Square Error) ˆ σ 2 = ˆ ˆ yi = β 0 + β1x i r i = MSE y = i oxyge ad hydrocarco level example i= 1 ˆ yˆ r i 2 i 2 ˆ 2 1.18, 25

Example: Oil Well Drillig Costs Estimatig the costs of drillig oil wells is a importat cosideratio for the oil idustry. Data: the total costs ad the depths of 16 off-shore oil wells located i Philippies. Depth Cost 5000 2596.8 5200 3328.0 6000 3181.1 6538 3198.4 7109 4779.9 7556 5905.6 8005 5769.2 8207 8089.5 Depth Cost 8210 4813.1 8600 5618.7 9026 7736.0 9197 6788.3 9926 7840.8 10813 8882.5 13800 10489.5 14311 12506.6 26

Step 1: graphical display of the data R code: plot(depth, Cost, xlab= Depth, ylab = Cost ) 27

Class activity 1. I this example, the respose is: A. The drillig cost B. The well depth 2. I this example, the depedet variable is: A. The drillig cost B. The well depth 3. Is there a liear associatio betwee the drillig cost ad the well depth? A. Yes ad positive Β. Yes ad egative C. No 28

Step 2: fid the relatioship betwee Depth ad Cost 29

Results ad use of regressio model 1. Fit a liear regressio model: Estimates (β 0,β 1 ) are (-2277.1, 1.0033) 2. What does the model predict as the cost icrease for a additioal depth of 1000 ft? If we icrease X by 1000, we icrease Y by 1000β 1 = $1003 3. What cost would you predict for a oil well of 10,000 ft depth? X = 10,000 ft is i the rage of the data, ad estimate of the lie at x=10,000 is ˆ β + (10,000) ˆ β = -2277.1 + 10,033 = $7753 4. What is the estimate of the error variace? Estimate σ 2 774,211 5.What could you say about the cost of a oil well of depth 20,000 ft? X=20,000 ft is much greater tha all the observed values of X We should ot extrapolate the regressio out that far. 0 1 30

Summary Simple liear regressio Estimate coefficiets from data: method of least squares ˆ β = y ˆ β1x Y = β0 + β1x Estimate of variace 0 ˆβ 1 = S S xy xx ˆ ˆ ˆ yi = β 0 + β1x i y Fitted (estimated) regressio ` model Observed value Data (y) ` Estimated regressio lie x 31