UNIT 12 ~ More About Regression

***SECTION 15.1*** The Regression Model When a scatterplot shows a relationship between a variable x and a y, we can use the fitted to the data to predict y for a given value of x. Now we want to do tests and confidence intervals in this setting. Example 1: Crying and IQ Infants who cry easily may be more easily stimulated than others. This may be a sign of higher IQ. Child development researchers explored the relationship between the crying of infants four to ten days old and their later IQ test scores. A snap of a rubber band on the sole of the foot caused the infants to cry. The researchers recorded the crying and measured its intensity by the number of peaks in the most active 20 seconds. They later measured the children s IQ at age three years using the Stanford-Binet IQ test. The following table contains data on 38 infants. Let us analyze the data (Be sure to use the DATA ANALYSIS TOOLBOX on pages 93-94!) Crying IQ Crying IQ Crying IQ Crying IQ 10 87 20 90 17 94 12 94 12 97 16 100 19 103 12 103 9 103 23 103 13 104 14 106 16 106 27 108 18 109 10 109 18 109 15 112 18 112 23 113 15 114 21 114 16 118 9 119 12 119 12 120 19 120 16 124 20 132 15 133 22 135 31 135 16 136 17 141 30 155 22 157 33 159 13 162 Data Who? What? Why? When, where, how, and by whom? Graphs Unit 12 ~ Pg. 1

Numerical summaries Model Interpretation Conditions for the Regression Model The slope b and intercept a of the least-squares line are. That is, we calculate them from the data. In our previous example, we know that these statistics would take somewhat different values if we repeated the study with different infants. When we perform formal inference, we think of a and b as estimates of the unknown parameters. Conditions for Regression Inference (L.I.N.E.R.) We have n observations of an explanatory variable x and a response variable y. Our goal is to study or predict the behavior of y for given values of x. Linear The actual relationship between x and y is linear. For any fixed value of x, the mean response falls on the population (true) regression line µ y = α+ β x. The slope and intercept are usually unknown parameters. Independent Individual observations are independent of each other. Normal For any fixed value of x, the response y varies according to a Normal distribution. Equal variance The standard deviation of y (call it ) is the same for all values of x. The common standard deviation is usually an unknown parameter. Random The data come from a well-designed random sample or randomized experiment. Unit 12 ~ Pg. 2

The heart of this model is that there is an on the average straight-line relationship between y and x. The true regression line µ y = α+ β x says that the mean response moves along a straight line as the explanatory variable changes. We observe the regression line. The values of y that we do observe vary about their means according to a Normal distribution. This figure shows the regression model in picture form. The line in the figure is the true regression line. The mean of the response y moves along this line as the explanatory variable x takes different values. The Normal curves show how y will vary when x is held fixed at different values. All of the curves have the same σ, so the variability of y is the same for all values of x. *YOU SHOULD CHECK THE CONDITIONS FOR INFERENCE WHEN YOU DO INFERENCE ABOUT REGRESSION! Checking the Regression Conditions (L.I.N.E.R.) You can fit a least-squares line to any set of explanatory-response data when both variables are quantitative. Before we do inference, we must check these conditions one by one. Linear Examine the scatterplot to check that the overall pattern is roughly linear. Look for curved patterns in the residual plot. Check to see that the residual center on the residual = 0 line at each x-value in the residual plot. Independent Look at how the data were produced. Random sampling and random assignment help ensure the independence of individual observations. If sampling is done without replacement, remember to check that the population is at least 10 times as large as the sample (10% condition). Normal Make a stemplot, histogram, boxplot, or Normal probability plot of the residuals and check for clear skewness or other major departures from Normality. Equal variance Look at the scatter of the residuals above and below the residual = 0 line in the residual plot. The amount of scatter should be roughly the same from the smallest to the largest x- value. Random See if the data were produced by random sampling or a randomized experiment. Unit 12 ~ Pg. 3

Example 2: Crying and IQ.. Let us check conditions for Example 1. Estimating Parameters The first step in inference is to the unknown α, β, and σ. When the regression model describes our data and we calculate the least-squares line ŷ= a+ bx, the slope b of the least-squares line is an unbiased estimator of the true slope β, and the intercept a of the least-squares line is an unbiased estimator of the true intercept α. Example 3: Crying and IQ.. Let us find and interpret the slope and intercept for Example 1. Unit 12 ~ Pg. 4

The remaining parameter is σ, which describes the of the response y about the true regression line. The residuals estimate how much y varies about the true line. Recall that the residuals are the vertical deviation of the data points from the least-squares line: residual= observed y predicted y = y yˆ There are n residuals, one for each data point. Because σ is the standard deviation of responses about the true regression line, we estimate it by a. We saw this error measure before in Chapter 3 (Pg. 218). We call this sample standard deviation a to emphasize that it is estimated from data. The residuals from a least-squares line always have mean zero, which simplifies their standard error. Standard Error about the Least-Squares Line The standard error about the line is s= residuals n 2 2 = ( y yˆ ) 2 n 2 Use s to estimate the unknown σ in the regression model. Example 4: Crying and IQ Let us calculate the standard error for Example 1. Unit 12 ~ Pg. 5

Confidence Intervals for the Regression Slope The slope β of the true regression line is usually the most important parameter in the regression problem. The slope is the of the as the variable. We often want to estimate β. The slope b of the least-squares line is an unbiased estimator of β. A confidence interval is more useful because it shows how accurate the estimate b is likely to be. The confidence interval for β has the familiar form * estimate ± t SE estimate Because b is our estimate, the confidence interval becomes b ± t * SE b Confidence Interval for Regression Slope A level C confidence interval for the slope β of the true regression line is b ± t * SE b In this expression, the standard error of the least-squares slope b is SE b = s x x ( ) 2 and t * is the critical value for the density curve with area α between * t and t *. Example 5: Crying and IQ Let us examine regression output for Example 1. Unit 12 ~ Pg. 6

Testing the Hypothesis of No Linear Relationship The most common hypothesis about the slope is: A regression line with slope 0 is horizontal. That is, the mean of y when x changes. So this H 0 says that there is no true linear relationship between x and y. In other words, H 0 says there is no correlation between x and y in the population from which we drew our data. * TRICK: You can use the test for zero slope to test the hypothesis of zero correlation between any two quantitative variables. * NOTE: Testing for correlation makes sense only if the observations are a random sample. In regression settings, this is often not the case because researchers may fix in advance the values of x they want to study. Significance Tests for Regression Slope To test the hypothesis H : 0 0 β =, compute the t statistic b t= SE b In terms of a random variable T having the t( n 2) distribution, the P-value for a test of H 0 against H : β > 0 is P( T t) a H : β < 0 is P( T t) a H a : β 0 is 2 P( T t ) This test is also a test of the hypothesis that the correlation is 0 in the population. * Regression output from statistical software usually gives t and its P-value. So, for a one-sided test, be sure to the P-value in the output by. Example 6: Crying and IQ Let us test the regression slope for Example 1. Unit 12 ~ Pg. 7

Example 7: Beer and blood alcohol Let us look at how well the number of beers a student drinks predicts his or her blood alcohol content (BAC). Sixteen student volunteers at Ohio State University drank a randomly assigned number of cans of beer. Thirty minutes later, a police officer measured their BAC. We will perform a linear regression t test. Here are the data: Student: 1 2 3 4 5 6 7 8 Beers: 5 2 9 8 3 7 3 5 BAC: 0.10 0.03 0.19 0.12 0.04 0.095 0.07 0.06 Student: 9 10 11 12 13 14 15 16 Beers: 3 5 4 6 5 7 1 4 BAC: 0.02 0.05 0.07 0.10 0.085 0.09 0.01 0.05 STEP 1: State STEP 2: Plan STEP 3: Do STEP 4: Conclude Unit 12 ~ Pg. 8

***SECTION 4.1*** Transforming To Achieve Linearity UNIT 12 ~ More About Regression Linear Transformations: Note: Linear Transformations can not Non-Linear Transformations: Nonlinear relationships between variables can sometimes be changed into relationships by one or of the variables. GOAL: transform the data into a linear pattern once our data is linear, we can make a regression model to help make future predictions. Types of Transformations: (1) The against. y x = ab becomes linear when we plot (2) The. y p = ax becomes linear when we plot against Transforming Non-Linear Bivariate Data: The table below lists the mean distance from the sun (in astronomical units or AU) and the period (time to orbit) for the nine planets of the solar system. Planet Distance (AU) Period (years) Mercury 0.386 0.241 Venus 0.720 0.615 Earth 1.00 1.00 Mars 1.52 1.88 Jupiter 5.19 11.9 Saturn 9.53 29.46 Uranus 19.2 83.8 Neptune 30.0 164 Pluto 39.5 248 Unit 12 ~ Pg. 9

Confirm that linear regression for this data yields evidence that the points lie fairly close to a line. Explain how you know this. Provide evidence to confirm that a line in not an appropriate model for this data, and explain your evidence. Develop a more appropriate model for this data, and provide supporting evidence for your choice. HINT: Instead of trying to find a function (non-linear) to fit the curve, we take our data and it to make it and then fit a line to it. Use your model to predict the period of an asteroid located 4.0 AU from the sun. Unit 12 ~ Pg. 10