Data Analysis and Statistical Methods Statistics 651

Size: px

Start display at page:

Download "Data Analysis and Statistical Methods Statistics 651"

Eugenia Malone
6 years ago
Views:

1 Data Analysis and Statistical Methods Statistics Lecture 31 (MWF) Review of test for independence and starting with linear regression Suhasini Subba Rao

2 Review: Test for independence In many situations we observe two variable on an individual, for example the gender and favourite colour. Often we want to see whether there is dependence between the two observations (does gender influence colour preference). If there is no dependence then the proportions with each subpopulation should be same as the proportions over the entire population. If there is a dependence, this is no longer true. The principle of the Test for independence, it so calculate expected numbers if they are independent and compare it what we observe. 1

3 Example I: Test for independence Psychologists wanted to investigate whether there was dependence between height and how bossy someone was (aka Do short men have a Napolean complex). They gathered the following data. short medium large totals bossy not bossy Test the hypothesis that there is no dependence between height and bossiness against the alternative that there is. 2

4 Solution I Recall that independence means that if you randomly selection someone the probability they will be be bossy is the same as if you were to restrict the population to tall people (or short people or middle size people) and randomly select someone in this subpopulation (of only tall, or short or middle size people). If this is the case, then size has no dependence on bossiness. In reality we cannot calculate these probabilities, because we do not observe the entire population of people, but we do have samples from the population. In this case we have a sample of 1000 people. First look at the data. We see that the proportion of short men who are bossy is larger than the proportion that the proportion of medium and 3

5 large men that are bossy. So from looking at the data, there appears to be a dependence. But this difference could be due to random variation. So we want to test whether the difference is significant or not. Our objective is to test: H 0 : There is no dependence between height and bossiness. H A : There is a dependence between height and bossiness. We first have to make a table of expected values under the null that there is no dependence between height and bossiness. 4

6 Motivation We observe that in the total population of men in the sample 30% = 300/1000 are bossy and 70% = 700/1000 are not bossy. We transfer these percentages to the subgroups of small/median and large men. short medium large totals bossy 30% of % of % of not bossy 70 % of % of % of Which gives: 5

7 short medium large totals bossy = = = not bossy = = = which is the same as: short medium large totals bossy = = = not bossy 1000 = = =

8 In summary, what you need to do... short medium large totals bossy = = = not bossy 1000 = = = So basically you just need to multiple each column number by the row number and divide by the total number to each each entry of the table. We can now evaluate the test statistic, by first taking the difference: 7

9 short medium large totals bossy (60 90) 2 60 ( ) (60 55) not bossy ( ) ( ) ( ) The test statistic is T = (60 90) ( ) ( ) ( ) (60 55) ( ) = 26 8

10 Now because there are 3 2 cells (it is a 3 by 2 table), under the null T has a χ 2 distribution with (3 1) (2 1) = 2-degrees of freedom. Look up Table 7: χ 2 2(0.05) = The p-value is P(χ 2 2 > 26) = Since T = > 5.99, there is enough evidence to reject the null. Equivalently the p-value is very small. That is, based on the data there appears to be a dependence between size and bossiness. 9

11 Example 2 A group of space explorers have discovered a planet which is inhabitated by alien creatures. They notice that there are three main groups of aliens: the Pink aliens, the Blue aliens and the Green aliens. One of the explorer s happens to be a statistican. She notices that the size of the alien tends to differ amongst the population. So she sets out to determine whether there was any dependence between the size of alien and colour of alien. She randomly selected 160 aliens and notes their colour and size (grouped as either large or small). This is the data she collected: Pink Blue Green Subtotal Big Little Subtotal State the null and alternative, what do you think were the conclusions of the statistican s research (use α = 0.05)? 10

12 Solution 2 H 0 : There is no dependence between height and colour. H A : There is a dependence between height and colour. We do a chi-squared test for independence and have need to make a table of what we expect to observe if there is no dependence between height and colour. Pink Blue Green Subtotal Big 160 = = = Little 160 = = = Subtotal

13 We now construct the T statistic Lecture 31 (MWF) Review of test for independence and linear regression T = (30 25) (20 25) (50 50) (10 15) (20 15) (30 30)2 30 = Under the null T has a χ 2 -squared distribution with (3 1) (2 1) = 2 degrees of freedom. Looking into the tables we see that χ 2 (0.05) = The p-value for 5.33 is about 0.07 (which is greater than 0.05). Since 5.33 < 5.99 there is not enough evidence to reject the null. Therefore we cannot conclude from the data that there is clear evidence for dependence between colour and height. 12

14 Linear regression Suppose I randomly pick a pick an adult and I ask you to guess their height. You would probably give me an interval, of say, 4.5 to 6.5 feet (this can be considered as a CI). Suppose I gave you the additional information that they have size 5 feet, would you reassess your previous estimate? Your would probably change your estimate. In this case you may say their height would be between The size 5 gives us additional information about that person. It allows us to narrow down our estimate and make a more precise estimate of her height. 13

15 Put into statistical terms, without knowledge of their shoe size the standard deviation is quite large. Recall that standard devation is a measure of error. Once we know their shoe size the standard devation (amount of error) decreases. Often we believe that one variable may have an influence on another variable. For example the variable X (shoe size of person) may influence the variable Y (the height height of that person). We call X the independent variable. We call Y the dependent variable. To see if X has an influence on Y we often plot a scatter plot with X on the x-axis and Y on the Y -axis. We look for a relationship between the two. 14

16 Sometimes it is not clear what influences what (for example does shoe size have an influence on height or height have an influence on shoe size), in which case, you let the dependent variable Y be the variable of interest. 15

17 Smoking and lung cancer The independent variable is number of cigerattes smoked per capita in a state and the dependent variable is the incidence of lung cancer per 100K people. 16

18 Smoking and leukemia The independent variable is number of cigerattes smoked per capita in a state and the dependent variable is the incidence of leukemia per 100K people. 17

19 None of the plots follow exactly a linear line. To check if x has an effect on Y in a linear way we could fit a line through the points. We can use the line to predict the average value of Y given x. For example, the average height of a person with size 5 feet. What line is the best line to use? How can we check whether this line has any meaning at all (after all we can put a line through any scatterplot)? 18

20 Recall the equation of a line y = mx + c Y m = y x c X In linear regression we fit this line through the data. 19

21 Least squares - the line of best fit We fit the line β 0 + β 1 x through the data, the way we choose β 0 and β 1 is using the method of least squares. We have the observations {(y 1, x 1 ),...,(y n, x n )}, and believe that y i depends linearly on x i. We use x i to predict y i. The predictor is ŷ i, where ŷ i = ˆβ 0 + ˆβ 1 x i. We want ŷ i to be as close as possible to y i, hence we choose the ˆβ 0 and ˆβ 1 such that it minimises the quantity n (y i ŷ i ) 2 = i=1 n (y i ˆβ 0 ˆβ 1 x i ) 2. i=1 20

22 A graphical representation y.. (x 3, y ) 3 y y 3 3. (x 5, y 5 ) y y 5 5 y y 4 4 (x 4, y 4 ). 1 1 (x, y ) 1 1 y y. y y 2 2 (x 2, y ) 2 x 21

23 Quantities required We need the average of the x s: x = 1 n n i=1 x i. And the average of the y s ȳ = 1 n n i=1 y i. We need to calculate: S xy = (y 1 ȳ)(x 1 x) (y n ȳ)(x n x) = S xx n = (x 1 x) (x n x) 2 = (x i x) 2 i=1 n (y i ȳ)(x i x) i=1 22

24 The equations for the parameter estimator The least squares estimator minimises the squared sum of all these vertical distances. Basically it gives the line of best fit through the observations. The line ˆβ 1 and ˆβ 0 can be evaluated using the formulas: ˆβ 1 = S xy S xx where S xy = n i=1 (y i ȳ)(x i x) and S xx = n i=1 (x i x) 2 with x and ȳ, the sample means of X and Y : x = 1 n n i=1 x i and ȳ = 1 n n i=1 y i. And ˆβ 0 = ȳ ˆβ 1 x. 23

25 Therefore given an ˆβ 0 ˆβ1, given any regressor (explanatory variable) x, we can predict y using the predictor ŷ i = ˆβ 0 + ˆβ 1 x i. 24

26 What S xy and S xx mean The S xy and S xx just fall out when trying to minimise the least squares equation. However, they do have an useful interpretation. We start by centralising the data, ie. Y i Ȳ and X i X, this does not change the slope. Let us suppose that X i exerts a positive influence on Y i. This means that large negative values of X i X are likely to result in large negative values of Y i Ȳ and large positive values of X i X are likely to result in large positive values of Y i Ȳ. What this means is that the (X i X)(Y i Ȳ ) is likely to be positive and thus i (X i X)(Y i Ȳ ) is highly likely to be positive (highly likely because remember that data is random so we can never be sure that an effect is seen in the data). Using a similar argument, we can argue that if X i exerts a negative influence on Y i, then it is highly likely i (X i X)(Y i Ȳ ) will be negative. On the 25

27 other hand, if X i does not exert any linear influence on Y i, then the product (X i X)(Y i Ȳ ) can be either negative or positive and the sum i (X i X)(Y i Ȳ ) will cancel out the negative and positive and is likely to be close to zero. S xx is simply the sample standard deviation before dividing by n 1, and measure the amount of variation of the independent variables. The value of the coefficient ˆβ 1 will vary according to the units you use. For example, suppose you want to measure the temperature has on the volume of ice on a lake, if you measure the temperature in Celcius, the slope will be different to if you measure the temperature in Fahrenheit. Thus the slope (like the mean) is sensitive to the units used. 26

28 Toy Example: size of a person and their shoe size This the mechanics of how the slope and intercept are calculated. You do not have to learn the precise details. However, it does give you some idea of what exactly the S xx and S yy are. Let x i be the shoe size and y i their height. We observe the height and shoe size of 5 people: Height y i Feet size x i It is natural to believe there is a possible linear dependence between the shoe size and height. Summary statistics: ȳ = 14 and x = 4. 27

29 Height y i 6 Lecture 31 (MWF) Review of test for independence and linear regression ȳ = 14 feet size x i x = 4 Height y i y i ȳ feet size x i x i x (x i x) (y i ȳ)(x i x) ( 8) ( 3) = = 36 S xy = 4 i=1 (y i ȳ)(x i x) = = 60. S xx = 4 i=1 (x i x) 2 = = 20. Then we have ˆβ 1 = = 3 and ˆβ 0 = ȳ ˆβ 1 x = = 2. 28

30 The line of best fit is Ŷ = 2 + 3x. Lecture 31 (MWF) Review of test for independence and linear regression We plot this below. the points are the observations and the line is the line of best fit y x 29

31 Intepretating the slope What does the slope Ŷ = 2 + 3x (where x is the shoe size and Ŷ is the predictive length) tell us about the relationship between shoe size and height? On face value, the fact that 3 is large and positive may make you think that there is a positive relationship (since the slope is not zero - since a zero slope indicates no relationship). DO NOT be fooled by this! We have estimated the size of the slope from a sample of 5 people, the slope 3 could easily be obtained randomly (when there is no relationship at all). Recall that three of the terms (X i X)(Y i Ȳ ). The slope ˆβ 1 = 3 is an estimate of the true slope (which we define later). 30

32 Therefore our objectives are: Lecture 31 (MWF) Review of test for independence and linear regression (i) Is the slope (estimator) significant? Ie. is there really evidence of a relationship. Here we need to use statistical techniques since as usual we do not observe the entire population (in this example it is just 5 people!). (ii) If there is a relationship, how strong is this relationship (again the size of 5 does not mean anything), the strength of a relationship is determined by how well the line fits the points. 31

Data Analysis and Statistical Methods Statistics 651

y 1 2 3 4 5 6 7 x Data Analysis and Statistical Methods Statistics 651 http://www.stat.tamu.edu/~suhasini/teaching.html Lecture 32 Suhasini Subba Rao Previous lecture We are interested in whether a dependent