University of California, Los Angeles Department of Statistics. Simple regression analysis

Similar documents
University of California, Los Angeles Department of Statistics. Practice problems - simple regression 2 - solutions

Simple Linear Regression

Stat 139 Homework 7 Solutions, Fall 2015

Linear Regression Models

Statistics 203 Introduction to Regression and Analysis of Variance Assignment #1 Solutions January 20, 2005

1 Inferential Methods for Correlation and Regression Analysis

Regression, Inference, and Model Building

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Final Review. Fall 2013 Prof. Yao Xie, H. Milton Stewart School of Industrial Systems & Engineering Georgia Tech

Regression. Correlation vs. regression. The parameters of linear regression. Regression assumes... Random sample. Y = α + β X.

Simple Regression. Acknowledgement. These slides are based on presentations created and copyrighted by Prof. Daniel Menasce (GMU) CS 700

S Y Y = ΣY 2 n. Using the above expressions, the correlation coefficient is. r = SXX S Y Y

Worksheet 23 ( ) Introduction to Simple Linear Regression (continued)

MA 575, Linear Models : Homework 3

STP 226 EXAMPLE EXAM #1

Grant MacEwan University STAT 252 Dr. Karen Buro Formula Sheet

3/3/2014. CDS M Phil Econometrics. Types of Relationships. Types of Relationships. Types of Relationships. Vijayamohanan Pillai N.

UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL/MAY 2009 EXAMINATIONS ECO220Y1Y PART 1 OF 2 SOLUTIONS

Mathematical Notation Math Introduction to Applied Statistics

SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS

11 Correlation and Regression

Correlation Regression

(all terms are scalars).the minimization is clearer in sum notation:

Continuous Data that can take on any real number (time/length) based on sample data. Categorical data can only be named or categorised

ST 305: Exam 3 ( ) = P(A)P(B A) ( ) = P(A) + P(B) ( ) = 1 P( A) ( ) = P(A) P(B) σ X 2 = σ a+bx. σ ˆp. σ X +Y. σ X Y. σ X. σ Y. σ n.

University of California, Los Angeles Department of Statistics. Hypothesis testing

Chapters 5 and 13: REGRESSION AND CORRELATION. Univariate data: x, Bivariate data (x,y).

Assessment and Modeling of Forests. FR 4218 Spring Assignment 1 Solutions

Statistical Intervals for a Single Sample

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

Interval Estimation (Confidence Interval = C.I.): An interval estimate of some population parameter is an interval of the form (, ),

Final Examination Solutions 17/6/2010

Sample Size Determination (Two or More Samples)

TAMS24: Notations and Formulas

Describing the Relation between Two Variables

ECON 3150/4150, Spring term Lecture 3

Open book and notes. 120 minutes. Cover page and six pages of exam. No calculators.

Properties and Hypothesis Testing

Statistical Properties of OLS estimators

9. Simple linear regression G2.1) Show that the vector of residuals e = Y Ŷ has the covariance matrix (I X(X T X) 1 X T )σ 2.

Section 14. Simple linear regression.

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

Lecture 11 Simple Linear Regression

Least-Squares Regression

Linear Regression Analysis. Analysis of paired data and using a given value of one variable to predict the value of the other

Topic 9: Sampling Distributions of Estimators

First, note that the LS residuals are orthogonal to the regressors. X Xb X y = 0 ( normal equations ; (k 1) ) So,

Linear Regression Demystified

Lesson 11: Simple Linear Regression

INSTRUCTIONS (A) 1.22 (B) 0.74 (C) 4.93 (D) 1.18 (E) 2.43

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators

Overview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions

bwght = cigs

Comparing your lab results with the others by one-way ANOVA

STP 226 ELEMENTARY STATISTICS

Simple Linear Regression

TMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences.

(X i X)(Y i Y ) = 1 n

Chapter If n is odd, the median is the exact middle number If n is even, the median is the average of the two middle numbers

Algebra of Least Squares

n but for a small sample of the population, the mean is defined as: n 2. For a lognormal distribution, the median equals the mean.

Mathacle. PSet Stats, Concepts In Statistics Level Number Name: Date:

Common Large/Small Sample Tests 1/55

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Dr. Maddah ENMG 617 EM Statistics 11/26/12. Multiple Regression (2) (Chapter 15, Hines)

Statistics 20: Final Exam Solutions Summer Session 2007

Chapter 6 Part 5. Confidence Intervals t distribution chi square distribution. October 23, 2008

REVIEW OF SIMPLE LINEAR REGRESSION SIMPLE LINEAR REGRESSION

Ismor Fischer, 1/11/

Correlation and Covariance

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

To make comparisons for two populations, consider whether the samples are independent or dependent.

Stat 319 Theory of Statistics (2) Exercises

NCSS Statistical Software. Tolerance Intervals

REGRESSION AND ANALYSIS OF VARIANCE. Motivation. Module structure

Correlation. Two variables: Which test? Relationship Between Two Numerical Variables. Two variables: Which test? Contingency table Grouped bar graph

Lecture 1, Jan 19. i=1 p i = 1.

Dealing with Data and Fitting Empirically

Linear Regression Models, OLS, Assumptions and Properties

Important Formulas. Expectation: E (X) = Σ [X P(X)] = n p q σ = n p q. P(X) = n! X1! X 2! X 3! X k! p X. Chapter 6 The Normal Distribution.

A Relationship Between the One-Way MANOVA Test Statistic and the Hotelling Lawley Trace Test Statistic

Exam II Covers. STA 291 Lecture 19. Exam II Next Tuesday 5-7pm Memorial Hall (Same place as exam I) Makeup Exam 7:15pm 9:15pm Location CB 234

REGRESSION MODELS ANOVA

Inferential Statistics. Inference Process. Inferential Statistics and Probability a Holistic Approach. Inference Process.

Correlation and Regression

1 Models for Matched Pairs

Quick Review of Probability

multiplies all measures of center and the standard deviation and range by k, while the variance is multiplied by k 2.

Parameter, Statistic and Random Samples

Simple Linear Regression. Copyright 2012 Pearson Education, Inc. Publishing as Prentice Hall. Chapter Chapter. β0 β1. β β = 1. a. b.

Confidence Interval for Standard Deviation of Normal Distribution with Known Coefficients of Variation

Regression, Part I. A) Correlation describes the relationship between two variables, where neither is independent or a predictor.

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Formulas and Tables for Gerstman

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Transcription:

Uiversity of Califoria, Los Ageles Departmet of Statistics Statistics 100C Istructor: Nicolas Christou Simple regressio aalysis Itroductio: Regressio aalysis is a statistical method aimig at discoverig how oe variable is related to aother variable. It is useful i predictig oe variable from aother variable. Cosider the followig scatterplot of the percetage of body fat agaist thigh circumferece (cm). This data set is described i detail i the hadout o R. 50 60 70 80 0 10 20 30 40 Thigh circumferece (cm) Body fat (%) Ad aother oe: This is the cocetratio of lead agaist the cocetratio of zic (see hadout o R for more details o this data set). 500 1000 1500 100 200 300 400 500 600 Zic cocetratio (ppm) Lead cocetratio (ppm) 1

What do you observe? Is there a equatio that ca model the picture above? Regressio model equatio: y i = β 0 + β 1 x i + ɛ i where - y respose variable (radom) - x predictor variable (o-radom) - β 0 itercept (o-radom) - β 1 slope (o-radom) - ɛ radom error term, ɛ N(0, σ) Usig the method of least squares we estimate β 0 ad β 1 : ˆβ 1 = (x i x)(y i ȳ) (x i x)y i (x i x) 2 = (x i x) 2 == x i y i 1 ( x i ) ( y i ) x 2 i ( x i) 2 ˆβ 0 = y i The fitted lie is: ˆβ 1 x i ˆβ 0 = ȳ ˆβ 1 x ŷ i = ˆβ 0 + ˆβ 1 x i Distributio of ˆβ 1 ad ˆβ 0 : ˆβ 1 N ( β 1, ) ( ) σ 1, ˆβ0 N β (x i x) 2 0, σ + x 2 (x i x) 2 The stadard deviatio σ is ukow ad it is estimated with the residual stadard error which measures the variability aroud the fitted lie. It is computed as follows: s e = (y i ŷ i ) 2 2 = e 2 i 2 = e 2 i 2 where e i = y i ŷ i = y i ˆβ 0 ˆβ 1 x i is called the residual (the differece betwee the observed y i value ad the fitted value ŷ i. 2

Coefficiet of determiatio: The total variatio i y (total sum of squares SST = (y i ȳ) 2 ) is equal to the regressio sum of squares (SSR = (ŷ i ȳ) 2 ) plus the error sum of squares (SSE = (y i ŷ i ) 2 ): SST = SSR + SSE The percetage of the variatio i y that ca be explaied by x is called coefficiet of determiatio (R 2 ): R 2 = SSR SST = 1 SSE SST Always 0 R 2 1 Useful: SST = Coefficiet of correlatio (r): (y i ȳ) 2 SST = ( 1)s 2 y where s 2 y is the variace of y. r = (x i x)(y i ȳ) (x i x) 2 (y i ȳ) 2 Or easier for calculatios: r = x i y i 1 ( x i ) ( y i ) x 2 i ( x i) 2 yi 2 ( y i) 2 Always 1 r 1 ad R 2 = r 2. Aother formula for r: r = ˆβ 1 s x s y where s x, s y are the stadard deviatios of x ad y. Sample covariace betwee y ad x: cov(x, y) = (x i x)(y i ȳ) 1 Therefore r = cov(x, y) s x s y cov(x,y) = rs x s y ad ˆβ 1 = r s y s x 3

Stadard error of ˆβ 1 ad ˆβ 0 : s ˆβ1 = s e (x i x) 2 = s e x 2 i ( x i) 2 ad 1 s ˆβ0 = s e + x 2 (x i x) 2 = s e 1 + x 2 x 2 i ( x i) 2 Testig for liear relatioship betwee y ad x: H 0 : β 1 = 0 H a : β 1 0 Test statistic: t = ˆβ 1 β 1 s ˆβ1 Reject H 0 (i.e. there is liear relatioship) if t > t α 2 ; 2 or t < t α 2 ; 2 Cofidece iterval for β 1 : ˆβ 1 t α 2 ; 2 s ˆβ1 β 1 ˆβ 1 + t α 2 ; 2 s ˆβ1 Or β 1 falls i: ˆβ 1 ± t α 2 ; 2 s ˆβ1 Predictio iterval for y for a give x (whe x i = x g ): ŷg ± t α 2 ; 2 s e 1 + 1 + (x g x) 2 (x i x) 2, where ŷ g = ˆβ 0 + ˆβ 1 x g. Cofidece iterval for the mea value of y for a give x (whe x i = x g ): ŷg ± t α 2 ; 2 s e 1 Useful thigs to kow: + (x g x) 2 (x i x) 2, where ŷ g = ˆβ 0 + ˆβ 1 x g. (x i x) 2 = x 2 i ( x i ) 2 ad (y i ȳ) 2 = yi 2 ( y i ) 2 4

Simple regressio aalysis - A simple example The data below give the mileage per gallo (Y ) obtaied by a test automobile whe usig gasolie of varyig octae (x): y x xy y 2 x 2 13.0 89 1157.0 169.00 7921 13.5 93 1255.5 182.25 8649 13.0 87 1131.0 169.00 7569 13.2 90 1188.0 174.24 8100 13.3 89 1183.7 176.89 7921 13.8 95 1311.0 190.44 9025 14.3 100 1430.0 204.49 10000 14.0 98 1372.0 196.00 9604 8 y i = 108.1 8 x i = 741 8 x i y i = 10028.2 8 yi 2 = 1462.31 8 x 2 i = 68789 a. Fid the least squares estimates of ˆβ 0 ad ˆβ 1. ˆβ 1 = x i y i 1 ( x i ) ( y i ) x 2 i ( x i) 2 = 10028.2 1 8 (741)(108.1) 68789 7412 8 = 0.100325. ˆβ 0 = ȳ ˆβ 1 x = 108.1 8 0.100325 741 8 = 4.2199. Therefore the fitted lie is: ŷ i = 4.2199 + 0.100325x i. b. Compute the fitted values ad residuals. Usig the fitted lie ŷ i = 4.2199 + 0.100325x i we ca fid the fitted values ad residuals. For example, the first fitted value is: ŷ 1 = 4.2199 + 0.100325(89) = 13.1488, ad the first residual is e 1 = y 1 ŷ 1 = 13.0 13.1488 = 0.14888, etc. The table below shows all the fitted values ad residuals. ŷ i e i e 2 i 13.14883-0.14882 0.02215 13.55013-0.05013 0.00251 12.94818 0.05183 0.00269 13.24915-0.04915 0.00242 13.14883 0.15118 0.02285 13.75078 0.04922 0.00242 14.25240 0.04760 0.00227 14.05175-0.05175 0.00268 e i = 0 e 2 i = 0.05998 c. Fid the estimate of σ 2. s 2 e = e 2 i 2 = 0.05998 8 2 = 0.009997. Therefore, s e = 0.009997 = 0.09999. 5

d. Compute the stadard error of ˆβ 1. s ˆβ1 = s e x 2 i ( x i) 2 = 0.09999 = 0.00806. 68789 7412 8 e. Costruct a 95% cofidece iterval for ˆβ 1. The parameter β 1 falls i: ˆβ 1 ± t α 2 ; 2 s ˆβ1 or 0.100325 ± 2.447(0.00806) Therefore we are 95% cofidet that β 1 falls i the iterval: 0.0806 β 1 0.12. f. Estimate the miles per gallo for a octae gasolie level of 94. ŷ = 4.2199 + 0.100325(94) = 13.65. g. Compute the coefficiet of determiatio, R 2. R 2 = 1 SSE SST = 1 e 2 i ( 1)s 2 = 1 0.05998 y 7(0.2298) = 0.9627. Therefore, 96.27% of the variatio i Y ca be explaied by x. The same example ca be doe with few simple commads i R: #Eter the data: > x <- c(89,93,87,90,89,95,100,98) > y <- c(13,13.5,13,13.2,13.3,13.8,14.3,14) #Ru the regressio of y o x: > ex <- lm(y ~x) #Display the results: > summary(ex) Call: lm(formula = y ~ x) Residuals: Mi 1Q Media 3Q Max -0.1488221-0.0505280-0.0007717 0.0498781 0.1511779 Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) 4.21990 0.74743 5.646 0.00132 ** x 0.10032 0.00806 12.447 1.64e-05 *** --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual stadard error: 0.09999 o 6 degrees of freedom Multiple R-squared: 0.9627, Adjusted R-squared: 0.9565 F-statistic: 154.9 o 1 ad 6 DF, p-value: 1.643e-05 6

Simple regressio i R - examples Example 1: We will use the followig data: data1 <- read.table("http://www.stat.ucla.edu/~christo/statistics100c/body_fat.txt", header=true) This file cotais data o percetage of body fat determied by uderwater weighig ad various body circumferece measuremets for 251 me. Here is the variable descriptio: Variable x 1 y x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 Descriptio Desity determied from uderwater weighig Percet body fat from Siri s (1956) equatio Age (years) Weight (lbs) Height (iches) Neck circumferece (cm) Chest circumferece (cm) Abdome 2 circumferece (cm) Hip circumferece (cm) Thigh circumferece (cm) Kee circumferece (cm) Akle circumferece (cm) Biceps (exteded) circumferece (cm) Forearm circumferece (cm) Wrist circumferece (cm) We wat to ru the regressio of Y (percetage body fat) o x 2 (thigh circumferece). Here is the regressio output: ex1 <- lm(data1$y ~data1$x10) summary(ex1) Call: lm(formula = data1$x2~ data1$x10) Residuals: Mi 1Q Media 3Q Max -18.1601-4.7707-0.1076 4.5219 25.5994 Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) -34.26252 4.99529-6.859 5.46e-11 *** data$x10 0.89861 0.08373 10.732 < 2e-16 *** --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual stadard error: 6.947 o 249 degrees of freedom Multiple R-squared: 0.3163, Adjusted R-squared: 0.3135 F-statistic: 115.2 o 1 ad 249 DF, p-value: < 2.2e-16 y^ = 34.26 + 0.8986x Body fat (%) 0 10 20 30 40 50 60 70 80 Thigh circumferece (cm) 7

Example 2: Here are the data: data2 <- read.table("http://www.stat.ucla.edu/~christo/statistics100c/soil.txt", header=true) This data set cosists of 4 variables. The first two colums are the x ad y coordiates, ad the last two colums are the cocetratio of lead ad zic i ppm at 155 locatios. We will ru the regressio of lead agaist zic. Our goal is to build a regressio model to predict the lead cocetratio from the zic cocetratio. Here is the regressio output. ex2 <- lm(data2$lead ~data2$zic) summary(ex2) Call: lm(formula = data2$lead ~ data2$zic) Residuals: Mi 1Q Media 3Q Max -79.853-12.945-1.646 15.339 104.200 Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) 17.367688 4.344268 3.998 9.92e-05 *** data2$zic 0.289523 0.007296 39.681 < 2e-16 *** --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual stadard error: 33.24 o 153 degrees of freedom Multiple R-squared: 0.9114, Adjusted R-squared: 0.9109 F-statistic: 1575 o 1 ad 153 DF, p-value: < 2.2e-16 Exercise: a. Costruct the histogram of lead ad zic ad commet. b. Trasform the data to get a bell-shaped histogram. c. Plot the trasform data of lead o the trasform data of zic ad compare this scatterplot with the scatterplot of the origial data. d. Ru the regressio of the trasform data of lead o the trasform data of zic ad compare the R 2 of this regressio to the R 2 usig the origial data. 8