Uiversity of Califoria, Los Ageles Departmet of Statistics Statistics 100C Istructor: Nicolas Christou Simple regressio aalysis Itroductio: Regressio aalysis is a statistical method aimig at discoverig how oe variable is related to aother variable. It is useful i predictig oe variable from aother variable. Cosider the followig scatterplot of the percetage of body fat agaist thigh circumferece (cm). This data set is described i detail i the hadout o R. 50 60 70 80 0 10 20 30 40 Thigh circumferece (cm) Body fat (%) Ad aother oe: This is the cocetratio of lead agaist the cocetratio of zic (see hadout o R for more details o this data set). 500 1000 1500 100 200 300 400 500 600 Zic cocetratio (ppm) Lead cocetratio (ppm) 1
What do you observe? Is there a equatio that ca model the picture above? Regressio model equatio: y i = β 0 + β 1 x i + ɛ i where - y respose variable (radom) - x predictor variable (o-radom) - β 0 itercept (o-radom) - β 1 slope (o-radom) - ɛ radom error term, ɛ N(0, σ) Usig the method of least squares we estimate β 0 ad β 1 : ˆβ 1 = (x i x)(y i ȳ) (x i x)y i (x i x) 2 = (x i x) 2 == x i y i 1 ( x i ) ( y i ) x 2 i ( x i) 2 ˆβ 0 = y i The fitted lie is: ˆβ 1 x i ˆβ 0 = ȳ ˆβ 1 x ŷ i = ˆβ 0 + ˆβ 1 x i Distributio of ˆβ 1 ad ˆβ 0 : ˆβ 1 N ( β 1, ) ( ) σ 1, ˆβ0 N β (x i x) 2 0, σ + x 2 (x i x) 2 The stadard deviatio σ is ukow ad it is estimated with the residual stadard error which measures the variability aroud the fitted lie. It is computed as follows: s e = (y i ŷ i ) 2 2 = e 2 i 2 = e 2 i 2 where e i = y i ŷ i = y i ˆβ 0 ˆβ 1 x i is called the residual (the differece betwee the observed y i value ad the fitted value ŷ i. 2
Coefficiet of determiatio: The total variatio i y (total sum of squares SST = (y i ȳ) 2 ) is equal to the regressio sum of squares (SSR = (ŷ i ȳ) 2 ) plus the error sum of squares (SSE = (y i ŷ i ) 2 ): SST = SSR + SSE The percetage of the variatio i y that ca be explaied by x is called coefficiet of determiatio (R 2 ): R 2 = SSR SST = 1 SSE SST Always 0 R 2 1 Useful: SST = Coefficiet of correlatio (r): (y i ȳ) 2 SST = ( 1)s 2 y where s 2 y is the variace of y. r = (x i x)(y i ȳ) (x i x) 2 (y i ȳ) 2 Or easier for calculatios: r = x i y i 1 ( x i ) ( y i ) x 2 i ( x i) 2 yi 2 ( y i) 2 Always 1 r 1 ad R 2 = r 2. Aother formula for r: r = ˆβ 1 s x s y where s x, s y are the stadard deviatios of x ad y. Sample covariace betwee y ad x: cov(x, y) = (x i x)(y i ȳ) 1 Therefore r = cov(x, y) s x s y cov(x,y) = rs x s y ad ˆβ 1 = r s y s x 3
Stadard error of ˆβ 1 ad ˆβ 0 : s ˆβ1 = s e (x i x) 2 = s e x 2 i ( x i) 2 ad 1 s ˆβ0 = s e + x 2 (x i x) 2 = s e 1 + x 2 x 2 i ( x i) 2 Testig for liear relatioship betwee y ad x: H 0 : β 1 = 0 H a : β 1 0 Test statistic: t = ˆβ 1 β 1 s ˆβ1 Reject H 0 (i.e. there is liear relatioship) if t > t α 2 ; 2 or t < t α 2 ; 2 Cofidece iterval for β 1 : ˆβ 1 t α 2 ; 2 s ˆβ1 β 1 ˆβ 1 + t α 2 ; 2 s ˆβ1 Or β 1 falls i: ˆβ 1 ± t α 2 ; 2 s ˆβ1 Predictio iterval for y for a give x (whe x i = x g ): ŷg ± t α 2 ; 2 s e 1 + 1 + (x g x) 2 (x i x) 2, where ŷ g = ˆβ 0 + ˆβ 1 x g. Cofidece iterval for the mea value of y for a give x (whe x i = x g ): ŷg ± t α 2 ; 2 s e 1 Useful thigs to kow: + (x g x) 2 (x i x) 2, where ŷ g = ˆβ 0 + ˆβ 1 x g. (x i x) 2 = x 2 i ( x i ) 2 ad (y i ȳ) 2 = yi 2 ( y i ) 2 4
Simple regressio aalysis - A simple example The data below give the mileage per gallo (Y ) obtaied by a test automobile whe usig gasolie of varyig octae (x): y x xy y 2 x 2 13.0 89 1157.0 169.00 7921 13.5 93 1255.5 182.25 8649 13.0 87 1131.0 169.00 7569 13.2 90 1188.0 174.24 8100 13.3 89 1183.7 176.89 7921 13.8 95 1311.0 190.44 9025 14.3 100 1430.0 204.49 10000 14.0 98 1372.0 196.00 9604 8 y i = 108.1 8 x i = 741 8 x i y i = 10028.2 8 yi 2 = 1462.31 8 x 2 i = 68789 a. Fid the least squares estimates of ˆβ 0 ad ˆβ 1. ˆβ 1 = x i y i 1 ( x i ) ( y i ) x 2 i ( x i) 2 = 10028.2 1 8 (741)(108.1) 68789 7412 8 = 0.100325. ˆβ 0 = ȳ ˆβ 1 x = 108.1 8 0.100325 741 8 = 4.2199. Therefore the fitted lie is: ŷ i = 4.2199 + 0.100325x i. b. Compute the fitted values ad residuals. Usig the fitted lie ŷ i = 4.2199 + 0.100325x i we ca fid the fitted values ad residuals. For example, the first fitted value is: ŷ 1 = 4.2199 + 0.100325(89) = 13.1488, ad the first residual is e 1 = y 1 ŷ 1 = 13.0 13.1488 = 0.14888, etc. The table below shows all the fitted values ad residuals. ŷ i e i e 2 i 13.14883-0.14882 0.02215 13.55013-0.05013 0.00251 12.94818 0.05183 0.00269 13.24915-0.04915 0.00242 13.14883 0.15118 0.02285 13.75078 0.04922 0.00242 14.25240 0.04760 0.00227 14.05175-0.05175 0.00268 e i = 0 e 2 i = 0.05998 c. Fid the estimate of σ 2. s 2 e = e 2 i 2 = 0.05998 8 2 = 0.009997. Therefore, s e = 0.009997 = 0.09999. 5
d. Compute the stadard error of ˆβ 1. s ˆβ1 = s e x 2 i ( x i) 2 = 0.09999 = 0.00806. 68789 7412 8 e. Costruct a 95% cofidece iterval for ˆβ 1. The parameter β 1 falls i: ˆβ 1 ± t α 2 ; 2 s ˆβ1 or 0.100325 ± 2.447(0.00806) Therefore we are 95% cofidet that β 1 falls i the iterval: 0.0806 β 1 0.12. f. Estimate the miles per gallo for a octae gasolie level of 94. ŷ = 4.2199 + 0.100325(94) = 13.65. g. Compute the coefficiet of determiatio, R 2. R 2 = 1 SSE SST = 1 e 2 i ( 1)s 2 = 1 0.05998 y 7(0.2298) = 0.9627. Therefore, 96.27% of the variatio i Y ca be explaied by x. The same example ca be doe with few simple commads i R: #Eter the data: > x <- c(89,93,87,90,89,95,100,98) > y <- c(13,13.5,13,13.2,13.3,13.8,14.3,14) #Ru the regressio of y o x: > ex <- lm(y ~x) #Display the results: > summary(ex) Call: lm(formula = y ~ x) Residuals: Mi 1Q Media 3Q Max -0.1488221-0.0505280-0.0007717 0.0498781 0.1511779 Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) 4.21990 0.74743 5.646 0.00132 ** x 0.10032 0.00806 12.447 1.64e-05 *** --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual stadard error: 0.09999 o 6 degrees of freedom Multiple R-squared: 0.9627, Adjusted R-squared: 0.9565 F-statistic: 154.9 o 1 ad 6 DF, p-value: 1.643e-05 6
Simple regressio i R - examples Example 1: We will use the followig data: data1 <- read.table("http://www.stat.ucla.edu/~christo/statistics100c/body_fat.txt", header=true) This file cotais data o percetage of body fat determied by uderwater weighig ad various body circumferece measuremets for 251 me. Here is the variable descriptio: Variable x 1 y x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 Descriptio Desity determied from uderwater weighig Percet body fat from Siri s (1956) equatio Age (years) Weight (lbs) Height (iches) Neck circumferece (cm) Chest circumferece (cm) Abdome 2 circumferece (cm) Hip circumferece (cm) Thigh circumferece (cm) Kee circumferece (cm) Akle circumferece (cm) Biceps (exteded) circumferece (cm) Forearm circumferece (cm) Wrist circumferece (cm) We wat to ru the regressio of Y (percetage body fat) o x 2 (thigh circumferece). Here is the regressio output: ex1 <- lm(data1$y ~data1$x10) summary(ex1) Call: lm(formula = data1$x2~ data1$x10) Residuals: Mi 1Q Media 3Q Max -18.1601-4.7707-0.1076 4.5219 25.5994 Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) -34.26252 4.99529-6.859 5.46e-11 *** data$x10 0.89861 0.08373 10.732 < 2e-16 *** --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual stadard error: 6.947 o 249 degrees of freedom Multiple R-squared: 0.3163, Adjusted R-squared: 0.3135 F-statistic: 115.2 o 1 ad 249 DF, p-value: < 2.2e-16 y^ = 34.26 + 0.8986x Body fat (%) 0 10 20 30 40 50 60 70 80 Thigh circumferece (cm) 7
Example 2: Here are the data: data2 <- read.table("http://www.stat.ucla.edu/~christo/statistics100c/soil.txt", header=true) This data set cosists of 4 variables. The first two colums are the x ad y coordiates, ad the last two colums are the cocetratio of lead ad zic i ppm at 155 locatios. We will ru the regressio of lead agaist zic. Our goal is to build a regressio model to predict the lead cocetratio from the zic cocetratio. Here is the regressio output. ex2 <- lm(data2$lead ~data2$zic) summary(ex2) Call: lm(formula = data2$lead ~ data2$zic) Residuals: Mi 1Q Media 3Q Max -79.853-12.945-1.646 15.339 104.200 Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) 17.367688 4.344268 3.998 9.92e-05 *** data2$zic 0.289523 0.007296 39.681 < 2e-16 *** --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual stadard error: 33.24 o 153 degrees of freedom Multiple R-squared: 0.9114, Adjusted R-squared: 0.9109 F-statistic: 1575 o 1 ad 153 DF, p-value: < 2.2e-16 Exercise: a. Costruct the histogram of lead ad zic ad commet. b. Trasform the data to get a bell-shaped histogram. c. Plot the trasform data of lead o the trasform data of zic ad compare this scatterplot with the scatterplot of the origial data. d. Ru the regressio of the trasform data of lead o the trasform data of zic ad compare the R 2 of this regressio to the R 2 usig the origial data. 8