Lecture 16: Again on Regression

Size: px

Start display at page:

Download "Lecture 16: Again on Regression"

Clifton Shepherd
5 years ago
Views:

1 Lecture 16: Again on Regression S. Massa, Department of Statistics, University of Oxford 10 February 2016

2 The Normality Assumption Body weights (Kg) and brain weights (Kg) of 62 mammals. Species Body weight (kg) Body rank Brain weight (g) Brain rank African elephant African giant pouched rat Arctic Fox Arctic ground squirrel Asian elephant Baboon Big brown bat Brazilian tapir Cat Chimpanzee Chinchilla Cow Desert hedgehog Donkey Eastern American mole Echidna European hedgehog Galago Genet Giant armadillo Giraffe Goat Golden hamster Gorilla Gray seal Gray wolf Ground squirrel Guinea pig

3 The Normality Assumption Suppose we want to test for correlation between body weight X and brain weight Y. Look at the scatterplot. Compute brain r xy = But we cannot see any obvious linear relationship. The regression line is completely dominated by the extreme values. Is our data really normal? body

4 The Normality Assumption Histogram of body Histogram of brain Frequency Frequency body Here you can see histograms of the body and brain weights. The data don t really look normal. What if we apply some transformation? Take logarithms for example. brain

5 The Normality Assumption Histogram of log(body) Histogram of log(brain) Frequency Frequency log(body) log(brain) Here you can see histograms of the logarithms of the body and brain weights. The situation is much better now. What about the scatter plot?

6 The Normality Assumption log(body) log(brain) After taking logarithms, there is an obvious linear relationship. We then compute r log(x),log(y) = 0.96.

7 The Normality Assumption log(body) log(brain) We can actually compute the regression line to be log 10 (y) = log 10 (x) , or after exponentiating y 8.5 x 3/4.

8 Spearman Rank Correlation Coefficient A non-parametric alternative to the correlation coefficient is known as the Spearman Rank Correlation coefficient. Replace the actual observations by their ranks. Drawback: This correlation coefficient will not give us a regression line! It can be used to test H 0 : ρ = 0.

9 Spearman Rank Correlation Coefficient To compute the Spearman Rank Correlation Coefficient: 1. Order and rank the x s and the y s. With each pair (x i, y i ) we will have two ranks (R x i, Ry i ). 2. Compute the absolute different of the two ranks d i = R x i Ry i. 3. Compute the sum of the squared differences D = i d 2 i = i R x i R y i The Spearman s rank correlation coefficient is r s = 1 6D n(n 2 1).

10 Hypothesis Testing Having computed r s we can now conduct the hypothesis test H 0 : there is no association between the ranks, against H 1 : there is association between the ranks. This the two-sided test. One can also conduct a one-sided test if there is reason to believe that any association present should be positive (or negative). Under the null hypothesis H 0 T := r s n 2 t 1 r 2 n 2. s

11 Example We compute the rank pairs for our dataset. R x R y d R x R y d R x R y d Then the sums of the squared differences is D = d 2 i = And Spearmans rank correlation coefficient is r s = 1 6D n(n 2 1) =

12 Example The observed value of the statistic is d.f. P = 0.10 P = 0.05 P = 0.02 P = t obs := r s 62 2 = 24.64, 1 r 2 s and as usual we compare it with a t-distribution with n 2 = 60 degrees of freedom. The critical value is 2, therefore reject the null hypothesis (24.64 > 2.00).

13 Residuals R 2 I Let (x i, y i ) be observations of normally distributed variables (X, Y ). Suppose that (X, Y ) are related through Y = α + βx + ɛ, where ɛ is independent of X and has zero mean. Since ɛ is independent of X var(y ) = β 2 var(x) + var(ɛ) This means that our uncertainty about Y consists of two parts: Our uncertainty about X; our uncertainty about ɛ. The first part should disappear once we know x.

14 Residuals R 2 II The residual uncertainty is due to ɛ and thus independent of X. In fact we can quantify the proportion of uncertainty in Y explained by X var(y ) = β 2 var(x) + var(ɛ) = r 2 var(y ) var(x) + var(ɛ) var(x) = r 2 var(y ) + var(ɛ). Therefore var(ɛ) = (1 r 2 )var(y ). r 2 is precisely the proportion of the variability in Y that is explained by the variability of X.

15 Example: the Galton Dataset again Child height (in.) Parent average height (in.)

16 Example: the Galton Dataset again Recall the summary statistics: Then we have Parent Child Sum Diff SD Variance s xy = 1 4 (s2 x+y s 2 x y) = 1 ( ) = 2.07, 4 and r xy = s xy s x s y = =

17 Example: the Galton dataset revisited Therefore we quickly compute r xy = b = s y s x r xy = a = ȳ b x = 23.8 y = 0.649x If the parents average height is 72 inches we can predict the child s height as y pred = = 70.5.

18 Example: Prediction What is the probability that the child s height Y is over 70 inches? We know that Y N(68.1, ). Then standardising and using the standard normal table we find ( Y ) P(Y > 70) = P > = P(Z > 0.75) = , Parents Child mean SD corr 0.459

19 Example: Prediction If we know the parents height is average, i.e., 68.3, what is the probability that the child s height Y is over 70 inches? Now we are looking within the second column. There is significantly less variability there than across the whole sample. Knowing X helps us predict Y.

20 Example: Prediction Recall that var(y ) = β 2 var(x)+var(ɛ), Thus if we know X, the residual variance is just var(ɛ) = (1 r 2 )var(y ) = ( ) =

21 Example: Prediction The height of a child whose parents average height is 68.3 will be approximately N(68.1, ). P(Y > 70 X = 68.3) ( Y ) = P > = P(Z > 0.85) = = <

22 Homoscedasticity Regression only makes sense when the data is homoscedastic. Homoscedasticity means that there is the same dispersion if we look at the y values for different x values. The opposite is known as heteroscedasticity. Roughly it means that in our model Y = α + βx + ɛ, the dispersion of ɛ does not depend on X. The scatterplot should be roughly oval shaped.

23 The right one is heteroscedastic. The variability of Y depends on X. Homoscedasticity Here you can see simulated data sets. The left one is homoscedastic. You can see similar variability across all the different vertical slices.

24 Regression to the mean I Recall the general form of the regression line y ȳ = r xy s y s x (x x). Standardising we can rewrite this as y ȳ s y = r xy x x s x. If r xy ±1, then we can observe what is known as regression to the mean: an extremely high (or low) value of x will usually occur with a less extreme value of y.

25 Regression to the mean II Suppose for example that s x = s y, then and since r xy < 1 we have y ȳ = r xy (x x), y ȳ < x x. In other words if the standardised deviation of x from its mean is extremely high, then we expect that y will be closer to its mean, if standardised.

26 Regression to the mean III Very important in practice: a test score of 100/100 will invariably be followed by a score 100, most likely < 100. An extremely high value will usually be followed by a less extreme one. You have to be careful to account for this regression to the mean before reaching any conclusions. Read the example with the traffic cameras: traffic cameras were installed in locations where large clusters of accidents were observed. A year later the number of accidents was much lower. How much of this was due to the cameras and how much due to regression to the mean?

27 Summary If data does not look normal then we use Spearman s rank correlation coefficient. If r is the correlation, then the standard deviation of Y when X = x is known is 1 r 2 SD(Y ). Regression only makes sense if data is homoscedastic: in other words variability of Y does not depend on X. If (X, Y ) have correlation ρ < 1 then there is regression to the mean has to be accounted for in order to reach safe conclusions.

Stat 529 (Winter 2011) A simple linear regression (SLR) case study. Mammals brain weights and body weights

Stat 529 (Winter 2011) A simple linear regression (SLR) case study Reading: Sections 8.1 8.4, 8.6, 8.7 Mammals brain weights and body weights Questions of interest Scatterplots of the data Log transforming