University of California, Los Angeles Department of Statistics. Simple regression analysis

Size: px

Start display at page:

Download "University of California, Los Angeles Department of Statistics. Simple regression analysis"

Rolf Casey
5 years ago
Views:

1 Uiversity of Califoria, Los Ageles Departmet of Statistics Statistics 100C Istructor: Nicolas Christou Simple regressio aalysis Itroductio: Regressio aalysis is a statistical method aimig at discoverig how oe variable is related to aother variable. It is useful i predictig oe variable from aother variable. Cosider the followig scatterplot of the percetage of body fat agaist thigh circumferece (cm). This data set is described i detail i the hadout o R Thigh circumferece (cm) Body fat (%) Ad aother oe: This is the cocetratio of lead agaist the cocetratio of zic (see hadout o R for more details o this data set) Zic cocetratio (ppm) Lead cocetratio (ppm) 1

2 What do you observe? Is there a equatio that ca model the picture above? Regressio model equatio: y i = β 0 + β 1 x i + ɛ i where - y respose variable (radom) - x predictor variable (o-radom) - β 0 itercept (o-radom) - β 1 slope (o-radom) - ɛ radom error term, ɛ N(0, σ) Usig the method of least squares we estimate β 0 ad β 1 : ˆβ 1 = (x i x)(y i ȳ) (x i x)y i (x i x) 2 = (x i x) 2 == x i y i 1 ( x i ) ( y i ) x 2 i ( x i) 2 ˆβ 0 = y i The fitted lie is: ˆβ 1 x i ˆβ 0 = ȳ ˆβ 1 x ŷ i = ˆβ 0 + ˆβ 1 x i Distributio of ˆβ 1 ad ˆβ 0 : ˆβ 1 N ( β 1, ) ( ) σ 1, ˆβ0 N β (x i x) 2 0, σ + x 2 (x i x) 2 The stadard deviatio σ is ukow ad it is estimated with the residual stadard error which measures the variability aroud the fitted lie. It is computed as follows: s e = (y i ŷ i ) 2 2 = e 2 i 2 = e 2 i 2 where e i = y i ŷ i = y i ˆβ 0 ˆβ 1 x i is called the residual (the differece betwee the observed y i value ad the fitted value ŷ i. 2

3 Coefficiet of determiatio: The total variatio i y (total sum of squares SST = (y i ȳ) 2 ) is equal to the regressio sum of squares (SSR = (ŷ i ȳ) 2 ) plus the error sum of squares (SSE = (y i ŷ i ) 2 ): SST = SSR + SSE The percetage of the variatio i y that ca be explaied by x is called coefficiet of determiatio (R 2 ): R 2 = SSR SST = 1 SSE SST Always 0 R 2 1 Useful: SST = Coefficiet of correlatio (r): (y i ȳ) 2 SST = ( 1)s 2 y where s 2 y is the variace of y. r = (x i x)(y i ȳ) (x i x) 2 (y i ȳ) 2 Or easier for calculatios: r = x i y i 1 ( x i ) ( y i ) x 2 i ( x i) 2 yi 2 ( y i) 2 Always 1 r 1 ad R 2 = r 2. Aother formula for r: r = ˆβ 1 s x s y where s x, s y are the stadard deviatios of x ad y. Sample covariace betwee y ad x: cov(x, y) = (x i x)(y i ȳ) 1 Therefore r = cov(x, y) s x s y cov(x,y) = rs x s y ad ˆβ 1 = r s y s x 3

4 Stadard error of ˆβ 1 ad ˆβ 0 : s ˆβ1 = s e (x i x) 2 = s e x 2 i ( x i) 2 ad 1 s ˆβ0 = s e + x 2 (x i x) 2 = s e 1 + x 2 x 2 i ( x i) 2 Testig for liear relatioship betwee y ad x: H 0 : β 1 = 0 H a : β 1 0 Test statistic: t = ˆβ 1 β 1 s ˆβ1 Reject H 0 (i.e. there is liear relatioship) if t > t α 2 ; 2 or t < t α 2 ; 2 Cofidece iterval for β 1 : ˆβ 1 t α 2 ; 2 s ˆβ1 β 1 ˆβ 1 + t α 2 ; 2 s ˆβ1 Or β 1 falls i: ˆβ 1 ± t α 2 ; 2 s ˆβ1 Predictio iterval for y for a give x (whe x i = x g ): ŷg ± t α 2 ; 2 s e (x g x) 2 (x i x) 2, where ŷ g = ˆβ 0 + ˆβ 1 x g. Cofidece iterval for the mea value of y for a give x (whe x i = x g ): ŷg ± t α 2 ; 2 s e 1 Useful thigs to kow: + (x g x) 2 (x i x) 2, where ŷ g = ˆβ 0 + ˆβ 1 x g. (x i x) 2 = x 2 i ( x i ) 2 ad (y i ȳ) 2 = yi 2 ( y i ) 2 4

5 Simple regressio aalysis - A simple example The data below give the mileage per gallo (Y ) obtaied by a test automobile whe usig gasolie of varyig octae (x): y x xy y 2 x y i = x i = x i y i = yi 2 = x 2 i = a. Fid the least squares estimates of ˆβ 0 ad ˆβ 1. ˆβ 1 = x i y i 1 ( x i ) ( y i ) x 2 i ( x i) 2 = (741)(108.1) = ˆβ 0 = ȳ ˆβ 1 x = = Therefore the fitted lie is: ŷ i = x i. b. Compute the fitted values ad residuals. Usig the fitted lie ŷ i = x i we ca fid the fitted values ad residuals. For example, the first fitted value is: ŷ 1 = (89) = , ad the first residual is e 1 = y 1 ŷ 1 = = , etc. The table below shows all the fitted values ad residuals. ŷ i e i e 2 i e i = 0 e 2 i = c. Fid the estimate of σ 2. s 2 e = e 2 i 2 = = Therefore, s e = =

6 d. Compute the stadard error of ˆβ 1. s ˆβ1 = s e x 2 i ( x i) 2 = = e. Costruct a 95% cofidece iterval for ˆβ 1. The parameter β 1 falls i: ˆβ 1 ± t α 2 ; 2 s ˆβ1 or ± 2.447( ) Therefore we are 95% cofidet that β 1 falls i the iterval: β f. Estimate the miles per gallo for a octae gasolie level of 94. ŷ = (94) = g. Compute the coefficiet of determiatio, R 2. R 2 = 1 SSE SST = 1 e 2 i ( 1)s 2 = y 7(0.2298) = Therefore, 96.27% of the variatio i Y ca be explaied by x. The same example ca be doe with few simple commads i R: #Eter the data: > x <- c(89,93,87,90,89,95,100,98) > y <- c(13,13.5,13,13.2,13.3,13.8,14.3,14) #Ru the regressio of y o x: > ex <- lm(y ~x) #Display the results: > summary(ex) Call: lm(formula = y ~ x) Residuals: Mi 1Q Media 3Q Max Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) ** x e-05 *** --- Sigif. codes: 0 *** ** 0.01 * Residual stadard error: o 6 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: o 1 ad 6 DF, p-value: 1.643e-05 6

7 Simple regressio i R - examples Example 1: We will use the followig data: data1 <- read.table(" header=true) This file cotais data o percetage of body fat determied by uderwater weighig ad various body circumferece measuremets for 251 me. Here is the variable descriptio: Variable x 1 y x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 Descriptio Desity determied from uderwater weighig Percet body fat from Siri s (1956) equatio Age (years) Weight (lbs) Height (iches) Neck circumferece (cm) Chest circumferece (cm) Abdome 2 circumferece (cm) Hip circumferece (cm) Thigh circumferece (cm) Kee circumferece (cm) Akle circumferece (cm) Biceps (exteded) circumferece (cm) Forearm circumferece (cm) Wrist circumferece (cm) We wat to ru the regressio of Y (percetage body fat) o x 2 (thigh circumferece). Here is the regressio output: ex1 <- lm(data1$y ~data1$x10) summary(ex1) Call: lm(formula = data1$x2~ data1$x10) Residuals: Mi 1Q Media 3Q Max Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) e-11 *** data$x < 2e-16 *** --- Sigif. codes: 0 *** ** 0.01 * Residual stadard error: o 249 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: o 1 ad 249 DF, p-value: < 2.2e-16 y^ = x Body fat (%) Thigh circumferece (cm) 7

8 Example 2: Here are the data: data2 <- read.table(" header=true) This data set cosists of 4 variables. The first two colums are the x ad y coordiates, ad the last two colums are the cocetratio of lead ad zic i ppm at 155 locatios. We will ru the regressio of lead agaist zic. Our goal is to build a regressio model to predict the lead cocetratio from the zic cocetratio. Here is the regressio output. ex2 <- lm(data2$lead ~data2$zic) summary(ex2) Call: lm(formula = data2$lead ~ data2$zic) Residuals: Mi 1Q Media 3Q Max Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) e-05 *** data2$zic < 2e-16 *** --- Sigif. codes: 0 *** ** 0.01 * Residual stadard error: o 153 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: 1575 o 1 ad 153 DF, p-value: < 2.2e-16 Exercise: a. Costruct the histogram of lead ad zic ad commet. b. Trasform the data to get a bell-shaped histogram. c. Plot the trasform data of lead o the trasform data of zic ad compare this scatterplot with the scatterplot of the origial data. d. Ru the regressio of the trasform data of lead o the trasform data of zic ad compare the R 2 of this regressio to the R 2 usig the origial data. 8

University of California, Los Angeles Department of Statistics. Practice problems - simple regression 2 - solutions

University of California, Los Angeles Department of Statistics. Practice problems - simple regression 2 - solutions Uiversity of Califoria, Los Ageles Departmet of Statistics Statistics 00C Istructor: Nicolas Christou EXERCISE Aswer the followig questios: Practice problems - simple regressio - solutios a Suppose y,