Linear Regression Models

Similar documents
Simple Regression. Acknowledgement. These slides are based on presentations created and copyrighted by Prof. Daniel Menasce (GMU) CS 700

Regression, Inference, and Model Building

1 Inferential Methods for Correlation and Regression Analysis

Simple Linear Regression

Lecture 11 Simple Linear Regression

(all terms are scalars).the minimization is clearer in sum notation:

Chapters 5 and 13: REGRESSION AND CORRELATION. Univariate data: x, Bivariate data (x,y).

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Computing Confidence Intervals for Sample Data

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

University of California, Los Angeles Department of Statistics. Simple regression analysis

Chapter 13, Part A Analysis of Variance and Experimental Design

Regression. Correlation vs. regression. The parameters of linear regression. Regression assumes... Random sample. Y = α + β X.

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Assessment and Modeling of Forests. FR 4218 Spring Assignment 1 Solutions

3/3/2014. CDS M Phil Econometrics. Types of Relationships. Types of Relationships. Types of Relationships. Vijayamohanan Pillai N.

Properties and Hypothesis Testing

SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS

S Y Y = ΣY 2 n. Using the above expressions, the correlation coefficient is. r = SXX S Y Y

Correlation. Two variables: Which test? Relationship Between Two Numerical Variables. Two variables: Which test? Contingency table Grouped bar graph

Statistics 203 Introduction to Regression and Analysis of Variance Assignment #1 Solutions January 20, 2005

Lecture 5. Materials Covered: Chapter 6 Suggested Exercises: 6.7, 6.9, 6.17, 6.20, 6.21, 6.41, 6.49, 6.52, 6.53, 6.62, 6.63.

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Stat 139 Homework 7 Solutions, Fall 2015

TMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences.

Exam II Covers. STA 291 Lecture 19. Exam II Next Tuesday 5-7pm Memorial Hall (Same place as exam I) Makeup Exam 7:15pm 9:15pm Location CB 234

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. Comments:

Correlation and Regression

Simple Regression Model

Correlation Regression

ECON 3150/4150, Spring term Lecture 3

ST 305: Exam 3 ( ) = P(A)P(B A) ( ) = P(A) + P(B) ( ) = 1 P( A) ( ) = P(A) P(B) σ X 2 = σ a+bx. σ ˆp. σ X +Y. σ X Y. σ X. σ Y. σ n.

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Simple Linear Regression

First, note that the LS residuals are orthogonal to the regressors. X Xb X y = 0 ( normal equations ; (k 1) ) So,

Read through these prior to coming to the test and follow them when you take your test.

(7 One- and Two-Sample Estimation Problem )

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Linear Regression Analysis. Analysis of paired data and using a given value of one variable to predict the value of the other

UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL/MAY 2009 EXAMINATIONS ECO220Y1Y PART 1 OF 2 SOLUTIONS

Common Large/Small Sample Tests 1/55

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

Worksheet 23 ( ) Introduction to Simple Linear Regression (continued)

Section 14. Simple linear regression.

Final Review. Fall 2013 Prof. Yao Xie, H. Milton Stewart School of Industrial Systems & Engineering Georgia Tech

Final Examination Solutions 17/6/2010

Dr. Maddah ENMG 617 EM Statistics 11/26/12. Multiple Regression (2) (Chapter 15, Hines)

Open book and notes. 120 minutes. Cover page and six pages of exam. No calculators.

Agenda: Recap. Lecture. Chapter 12. Homework. Chapt 12 #1, 2, 3 SAS Problems 3 & 4 by hand. Marquette University MATH 4740/MSCS 5740

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

11 Correlation and Regression

MATH/STAT 352: Lecture 15

Sample Size Determination (Two or More Samples)

Topic 9: Sampling Distributions of Estimators

Dealing with Data and Fitting Empirically

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

Lecture 1, Jan 19. i=1 p i = 1.

Describing the Relation between Two Variables

Linear Regression Models, OLS, Assumptions and Properties

Class 27. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

Confidence Interval for Standard Deviation of Normal Distribution with Known Coefficients of Variation

Statistical Properties of OLS estimators

CTL.SC0x Supply Chain Analytics

STP 226 ELEMENTARY STATISTICS

Expectation and Variance of a random variable

[ ] ( ) ( ) [ ] ( ) 1 [ ] [ ] Sums of Random Variables Y = a 1 X 1 + a 2 X 2 + +a n X n The expected value of Y is:

REGRESSION AND ANALYSIS OF VARIANCE. Motivation. Module structure

- E < p. ˆ p q ˆ E = q ˆ = 1 - p ˆ = sample proportion of x failures in a sample size of n. where. x n sample proportion. population proportion

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Machine Learning Assignment-1

Topic 9: Sampling Distributions of Estimators

Sample Size Estimation in the Proportional Hazards Model for K-sample or Regression Settings Scott S. Emerson, M.D., Ph.D.

Topic 9: Sampling Distributions of Estimators

University of California, Los Angeles Department of Statistics. Practice problems - simple regression 2 - solutions

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

STP 226 EXAMPLE EXAM #1

Continuous Data that can take on any real number (time/length) based on sample data. Categorical data can only be named or categorised

Overview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Grant MacEwan University STAT 252 Dr. Karen Buro Formula Sheet

Important Formulas. Expectation: E (X) = Σ [X P(X)] = n p q σ = n p q. P(X) = n! X1! X 2! X 3! X k! p X. Chapter 6 The Normal Distribution.

Chapter 11 Output Analysis for a Single Model. Banks, Carson, Nelson & Nicol Discrete-Event System Simulation

HYPOTHESIS TESTS FOR ONE POPULATION MEAN WORKSHEET MTH 1210, FALL 2018

Circle the single best answer for each multiple choice question. Your choice should be made clearly.

Lesson 11: Simple Linear Regression

Access to the published version may require journal subscription. Published with permission from: Elsevier.

Chapter 8: STATISTICAL INTERVALS FOR A SINGLE SAMPLE. Part 3: Summary of CI for µ Confidence Interval for a Population Proportion p

The Method of Least Squares. To understand least squares fitting of data.

BHW #13 1/ Cooper. ENGR 323 Probabilistic Analysis Beautiful Homework # 13

MidtermII Review. Sta Fall Office Hours Wednesday 12:30-2:30pm Watch linear regression videos before lab on Thursday

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

Confidence Level We want to estimate the true mean of a random variable X economically and with confidence.

Chapter If n is odd, the median is the exact middle number If n is even, the median is the average of the two middle numbers

n but for a small sample of the population, the mean is defined as: n 2. For a lognormal distribution, the median equals the mean.

A statistical method to determine sample size to estimate characteristic value of soil parameters

Department of Civil Engineering-I.I.T. Delhi CEL 899: Environmental Risk Assessment HW5 Solution

TAMS24: Notations and Formulas

A quick activity - Central Limit Theorem and Proportions. Lecture 21: Testing Proportions. Results from the GSS. Statistics and the General Population

Transcription:

Liear Regressio Models Dr. Joh Mellor-Crummey Departmet of Computer Sciece Rice Uiversity johmc@cs.rice.edu COMP 528 Lecture 9 15 February 2005

Goals for Today Uderstad how to Use scatter diagrams to ispect the relatioship betwee two umerical variables Fit a lie to observatios usig liear regressio Calculate ad iterpret a coefficiet of determiatio Compute cofidece itervals associated with regressios Verify the assumptios uderlyig regressio aalysis 2

Scatter Diagrams What is a good model? Y Y Y X X X Good Model Bad Model Bad Model (wrog slope) (o-liear behavior) 3

Possible Models of a Radom Variable Mea value observed i several trials A distributio that fits the observatios e.g. y ~ N(µ,σ) A equatio i terms of oe or more idepedet variables 4

Liear Regressio Aalysis Fit a liear model that predicts the value of a radom variable Examples: predict the time for gzip to compress a file from its size predict the size of a file compressed by gzip from its origial size 5

Estimatig Model Parameters Give observatio pairs { (x 1,y 1 ), (x 2,y 2 ),..., (x,y )} each x i is a idepedet variable each y i is a depedet variable Determie regressio parameters b 0 ad b 1 i y ˆ i = b o + b 1 x i ˆ y = b o + b 1 x : predicted value of i th observatio y i : predictio error for i th observatio e i = y i " ˆ Y e i (x i,y i ) (x i, ˆ y i ) X 6

Fittig a Lie to Data Approach 1: Miimize sum of predictio errors Choose lie so that " e i = 0 May lies satisfy this equatio better method eeded Y X 7

Fittig a Lie to Data Approach 2: Least-squares Fittig Criterio Miimize sum of squares error (SSE), where " SSE = e i 2 Subject to costrait that total error is 0 " e i = 0 8

Derivig Coefficiet b 0 for Least Squares Error Mea error e i = y i " ˆ y i = y i " (b 0 + b 1 x i ) e = 1 " e i = 1 Settig mea error to 0, we obtai b 0 "(y i #(b 0 + b 1 x i )) = y # b 0 # b 1 x 0 = y " b 0 " b 1 x # b 0 = y " b 1 x 9

Computig Sum of Squared Errors Error e i = y i " ˆ y i = y i " (b 0 + b 1 x i ) Substitutig b 0 = y " b 1 x, we get e i = y i " (y " b 1 x + b 1 x i ) # # # SSE = e i 2 = ((y i " y ) " (b 1 x i " b 1 x )) 2 SSE "1 = 1 = 1 ((y i " y ) 2 " 2b 1 (y i " y )(x i " x ) + b 2 1 (x i " x ) 2 ) "1 $ ' $ ' #(y "1 i " y ) 2 1 " 2b # 2 1 1 & (y "1 i " y )(x i " x )) + b 1 & #(x "1 i " x ) 2 ) % ( % ( = s 2 y " 2b 1 s 2 xy + b 2 2 1 s x 10

Derivig Coefficiet b 1 for Least Squares To fid b 1, solve "(SSE) /"b 1 = 0 SSE "1 = s 2 y " 2b 1 s 2 xy + b 2 2 1 s x 1 #(SSE) = "2s 2 2 xy + 2b 1 s x "1 #b 1 0 = "2s 2 xy + 2b 1 s 2 x # b 1 = s 2 xy 2 s x b 1 = # # x i y i " x y x i 2 " x 2 11

Allocatig Variatio Quatifyig variatio SST = (y i " y ) 2 Key questios how much variatio is uexplaied? # (total sum of squares) SSE = #(y i " y ˆ i ) 2 (sum of squares error) how much variatio is accouted for by the regressio? SSR = SST " SSE (sum of squares regressio) 12

Coefficiet of Determiatio Measurig the quality of a regressio model coefficiet of determiatio = R 2 = SSR SST = SST " SSE SST = #(y i " y ) 2 "#(y i " y ˆ i ) 2 # (y i " y ) 2 What does each of the followig mea: R 2 =1? R 2 =0? R 2 =0.77? 13

Stadard Deviatio of Errors Variace of errors = SSE/(degrees of freedom) s e 2 = SSE " 2 Why -2 degrees of freedom for SSE? SSE computed after calculatig two regressio parameters Degrees of freedom ad liear regressio SST = SSR + SSE (-1) = 1 + (-2) Variace of errors kow as Mea Squared Error (MSE) Stadard deviatio of errors for liear regressio s e = MSE = SSE " 2 14

Cofidece Itervals for b 0 & b 1 Assume the populatio is described by a liear model y = " 0 + " 1 x b 0 ad b 1 are estimates of β 0 ad β 1 from a sigle sample Other samples might yield differet estimates How accurate are b 0 ad b 1? compute cofidece itervals at 100(1-α)% cofidece level b 0 ± t [1"# / 2;"2] s b0 b 1 ± t [1"# / 2;"2] s b1 where $ ' 1 s b0 = s & e + x 2 ) & % # x 2 i " x 2 ) ( 1/ 2 s b1 = $ %& # s e x 2 i " x 2 ' () 1/ 2 15

Cofidece Itervals for Predictios Use regressio model for y to predict for ew x values y ˆ p = b 0 + b 1 x p This is mea value of predicted respose based o sample Stadard deviatio of mea of a future sample of m observatios at x p $ 1 y mp = s e m + 1 ' & + (x p " x ) 2 ) & % # x 2 i " x 2 ) ( sˆ For 1 observatio 1/m = 1; for observatios 1/m 0 1/ 2 Cofidece iterval for m future predictios at x p y ˆ p ± t [1"# / 2;"2] sˆ y mp 16

Facts about Cofidece i Predictios $ 1 y mp = s e m + 1 ' & + (x p " x ) 2 ) & % # x 2 i " x 2 ) ( sˆ Cofidece iterval for predictios is tightest at ˆ y p ± t [1"# / 2;"2] sˆ y mp 1/ 2 x Y upper cofidece boud mea lower cofidece boud X Be cautious i makig predictios far from x x 17

Assumptios of Liear Regressio Whe derivig regressio parameters, we make the followig four assumptios 1. The predictor x is o-stochastic ad is measured error-free 2. The true relatioship betwee y ad predictor x is liear 3. The model errors are statistically idepedet 4. The errors are ormally distributed with a 0 mea ad costat std. deviatio If ay of the assumptios are violated, the model would be misleadig Apply visual tests to verify that assumptios 2-4 hold 18

2. Test liear relatioship of y ad x Use scatter plot of y versus x Y liear Y multiliear X X Y outlier Y oliear X X 19

3. Errors are idepedet Plot e i versus ˆ y i ad verify that there is o tred e i y ˆ i Plot error as a fuctio of experimet umber ad verify that there is o tred ay tred would idicate that some factor ot accouted for affected the observed values e i i 20

4. Errors are ormally distributed Use a quatile-quatile plot of e i versus N(0,1) residual quatile Check for costat stadard deviatio of errors by verifyig that there is o spread i plot of e i versus ˆ o tred i spread ormal quatile y i icreasig spread e i e i ˆ y i ˆ y i 21