PS5: Two Variable Statistics LT3: Linear regression LT4: The test of independence.

PS5: Two Variable Statistics LT3: Linear regression LT4: The test of independence. Example by eye. On a hot day, nine cars were left in the sun in a car parking lot. The length of time each car was left in the sun was recorded, as well as the temperature inside the car at the end of the period. Car A B C D E F G H I Time 50 5 25 40 15 45 55 10 15 Temp 100 70 88 96 77 110 121 80 73 A. Calculate the mean of both variable B. Draw a scatter diagram of the data. C. Plot the point, on the scatter D. diagram and then draw the line of best fit. E. Predict: The temp at 35 minutes The temp at 75 minutes. Comment on the reliability of your predictions. 1

More on reliability of a prediction The accuracy of an interpolation depends on how linear the original data was. This can be gauged by the correlation coefficient and by ensuring that the data is randomly scattered around the line of best fit. The accuracy of an extrapolation depends not only on how linear the original data was, but also on the assumption that the linear trend will continue past the poles. The validity of this assumption depends greatly on the situation we are looking at. Issues The problem with drawing a line of best fit by eye is that the line drawn will vary from one person to another. Instead, we use a method known as linear regression to find the equation of the line which best fits the data. The most common method is the method of least squares. The Least Squares Regression Line For any line we draw to model the points, we can find the vertical distances d 1, d 2, d 3,... between each point and the line. The least squares regression line is the line which makes this sum as small as possible. We calculate the equation of the line using = +. (a is slope) Take a look at how regressions are used with stocks, specifically Strangles. 2

The Least Squares Regression Line Lets take the last data and calculate this on our GDC! Car A B C D E F G H I Time (x) 50 5 25 40 15 45 55 10 15 Temp 100 70 88 96 77 110 121 80 73 A) Use technology to find the least squares regression line. It is very simple to calculate the least square regression line, and furthermore, its easy to graph the line. With the data in and, (no mean point) 1) Press Stat Calc. linreg(ax+b). 2) make sure it looks like this 3) calculate. 4) put into form. 5) plot it with y =. On your own. You try. The annual income and average weekly grocery bill for a selection of families is shown below: Family A B C D E F G H Income (x thousand pounds) 55 36 25 47 60 64 42 50 Grocery bill (y pounds) 120 90 60 160 190 250 110 150 a Construct a scatter diagram to illustrate the data. b Use technology to find the least squares regression line. Graph it. c Estimate the weekly grocery bill for a family with an annual income of 95,000 pounds. Comment on whether this estimate is likely to be reliable. Answer: Annual grocery bill: pounds. Because this is extrapolation, it may be unreliable. 3

The least Squares Formulae (by hand) Some questions may be asked about this. Lets do a small problem for background but note that this will not be on the IB paper. = ( ) Remember: what is? We us a different formula. = What is? = This is new. Problem: x y xy 1 3 3 5 5 6 We need to find, and sum of = 3, = 4.667 The least Squares Fomulae (by hand) = ( ) = Problem: x y xy 1 3 3 1 = = 3 4.667 = 3 5 15 9 5 6 30 25 Sum 48 35 = 9 = 4.667 =. 3 4.667 =.7499 3 We need to find, and sum of = 3, = 4.667 =.7499 + 2.417 4

New equation that works with r. Here is another equation that works without finding and. = ( ) This is nice because you are able to use formulas or data that can be found easily with a calculator. Cool. Lets move on. LT4, 11E the test of independence. Pronounced Chi = Ki. This is the formal test of independence and it s counterparts. 5

11E: Categorical or grouped tests. Currently, we are only able to look at numerical data through the Least Squared Regression Test. What if we want to look at categorical or grouped data? In other words, suppose the variables gender and regular exercise are examined. They may be dependent, for example females may be more likely to exercise regularly than males. Alternatively, the variables may be independent, which means the gender of a person has no effect on whether they exercise regularly. We need to use the or chi-squared test of independence. (pronounced ky squared test ) the test of independence looks to see if two variables, usually grouped or categorical are independent or dependent of one another. What are some other examples that can be use? EX: Is GPA dependent on height or visa versa? We can test this dependency or lack there of using the test of independence. Contingency tables To understand grouped or categorical data, we use a contingency table to show the number of values that will fall under the that particular category. Example: Suppose we surveyed 400 people about their gender and number of text messages per week. Since this is a numerical value, we want to group it into categories, such as 300, and < 300. We want to test whether the two variables are independent, therefore we need to put the variables in a contingency table and count the number of people within these two categories. After we fill in the contingency table, it s time to run the test. < sum Male Female sum 60 152 212 135 53 188 195 205 400 6

Calculating by hand. After we have gathered the data and placed it into a contingency table, we need to calculate the expected values if the variables were independent. For example, if they were independent, then 300 = ( 300). =. This would be the probability of the survey for that cell, so the expected value would be 400 < sum Male 60 135 195 Female 152 53 205 sum 212 188 400 = 103.35 males texting over 300 texts. To streamline this, we make an expected frequency table and calculate the expected of each cell. < sum Male Femal e 212 195 = 103.35 400 212 205 = 108.65 400 188 195 = 91.65 195 400 188 205 = 96.35 205 400 sum 212 188 400 Contingency tables examined. The observed frequency table is the one where you collected the data. The expected frequency table is the one we calculate to find the expected frequency. Here is a generic look at a expected frequency contingency table sum Category V Category W = = c = = d sum a b n 7

Before we continue Chi-squared test examines the difference between the observed values obtained from our sample and the expected values we have recently calculated. To do this, we look at = Where is the observed frequency and is the expected frequency. If the variables are independent, the observed and expected values will be very similar. This means that the values of ( ) will be small, and hence will be small. If the variables are not independent, the observed values will differ significantly from the expected values. The values of ( ) will be large, and hence will be large. Now, lets continue using a table to help us calculate. Finished the calculations What we want to find is the sum of divided by. We fill in the table top row, first column and continue column then row from the cont. table. 60 103.35 43.35 1879.22 18.18 135 91.65 43.35 1879.22 20.50 152 108.65 43.35 1879.22 17.30 53 96.35 43.35 1879.22 19.50 total 75.48 The therefore this is fairly large, therefore we can predict that gender and the amount of texting may be dependent. What are some things that that hinder our reliability? 8

Do it on your own! There was a survey conducted in my Concepts of Algebra class that asked students what gender they were, and their hat type preference. The results were as follows: Fedora Cowboy sum Fedora Cowboy sum Male 13 23 17 21 Male Fem ale 6 17 36 9 Fem ale sum sum 1) Calculate a contingency table Find using a table. total Calculating Chi-squared using GDC What we need to know is that by hand, and using technology are both required for your IB papers. How to use a TI-84 plus to calculate Lets enter the data into a matrix to calculated 1) the expected values, and 2) the. Step 1) enter the raw data into a 2 3 matrix. 2 Step 2). Step 3) Calculate! That gets the. To find the filled in expected values. Go back to matrix and edit matrix B. 9

The formal test. Next time. Homework: 11D p.18 #1,2,5,6 11E.1 p.22 #2,3 10