Stat 101: Lecture 6 Summer 2006
Outline Review and Questions Example for regression Transformations, Extrapolations, and Residual Review
Mathematical model for regression Each point (X i, Y i ) in the scatterplot satisfies: Y i = a + bx i + ɛ i ɛ i N(0, sd = σ). σ is usually unknown. The ɛ s have nothing to do with one another (independent). e.g., big ɛ i does not imply big ɛ j. We know X i s exactly. This imply that all error occurs in the vertical direction.
Estimating the regression line e i = Y i (a + bx i ) is called residuals. It measures the vertical distance from a point to the regression line. One estimates â and ˆb by minimizing, f (a, b) = n (Y i (a + bx i )) 2 i=1 Take the derivative of f (a, b) w.r.t a and b, and set them to 0, we get, â = Ȳ ˆb X; ˆb = 1 n n 1 X iy i XȲ n 1 X i 2 X 2 f (a, b) is also referred as Sum of Squared Errors (SSE). 1 n
An example A biologist wants to predict brain weight from body weight, based on a sample of 62 mammals. A scatter plot shows below. Ecological correlation?
The regression equation is, Y = 90.996 + 0.966X The correlation is 0.9344. But it is heavily influenced by a few outliers. The sd of the residuals is 334.721. This stands for the typical distance of a point to the regression line in the vertical direction. Under the Parameter Estimates portion of the printout, the last column tells whether the intercept and slope are significantly different from 0. Small numbers indicate significant differences; values less than 0.05 are usually taken to indicate real differences from zero, as opposed to chance errors.
The root mean square (RMSE) is the standard deviation of the vertical distances between each point and the estimated line. It is an estimate of the standard deviation of the vertical distances between the observations and the true line. Formally, RMSE = 1 n n 1 ( Y i (â + ˆbX ) ) 2 i Note that â + ˆbX i is the mean of the Y-value at X i.
The regression line predicts the average value for the Y values at a given X. In practice, one wants to predict the individual value for a particular value of X. e.g. if my weight is 50 (kg), then how much would my brain weigh? The prediction (g) is, log Ŷ = â + ˆb log X = 90.96 + 0.9665 50 = 98.325 But this is just the average for all mammals who weigh as much as I do.
The individual value is less exact than the average value. To predict the average value, the only source of uncertainty is the exact location of the regression line (i.e. â, ˆb are estimates of the true intercept and slope.) In order to predict my brainweight, the uncertainty about my deviation from the average is added to the uncertainty about the location of the line. For example, if I weights 50 (kg), then my brain should weigh 98.325(g) + ɛ. Assuming the regression model is correct, then ɛ has a normal distribution with mean zero and standard deviation 334.721. Note: with this model, my brain could easily have negative weight. This could make us question the regression assumptions.
Transformations The scatterplot of the brainweight against body weight showed the line was probably controlled by a few large values (high-leverage points). Even worse, the scatterplot did not resemble the football-shaped point cloud that supports the regression assumptions listed before. In cases like this, one can consider making a transformation of the response variable or the explanatory variable or both. For this data, consider taking the logarithm (10 base) of the brainweight and the body weight. The scatterplot is much better.
Taking the log shows that the outliers are not surprising. The regression equation is now: log Y = 0.908 + 0.763 log X Now 91.23% of the variation in brain weight is explained by body weight. Both the intercept and the slope are highly significant. The estimated sd of ɛ is 0.317. This is the typical vertical distance between a point and the line. Makeing transformations is an art. here the analysis suggests that, Y = 8.1 X 0.763 So there is a power-law relationship between brain mass and body mass.
Extrapolation Predicting Y values for X values outside the range of X values observed in the data is called extrapolation. This is risky, because you have no evidence that the linear relationship you have seen in the scatterplot continues to hold in the new X region. Extrapolated values can be entirely wrong. It is unreliable to predict the brain weight of a blue whale or the hog-nosed bat.
Residuals Estiamte the regression line (using JMP software or by calculating â and ˆb by hand). Then find the differnece between each observed Y i and the predicted value Ŷi using the fitted line. These differences are called the residuals. Plot each difference against the corresponding X i value. This plot is called a residual plot.
If the assumptions for linear regressin hold, what should on see in the residual plot? If the pattern of the residuals around the horizontal line at zero is: Curved, then the assumption of linearity is violated. fan-shaped, then the assumption of constant sd is violated. filled with many outliers, then again the assumption of constant sd is violated. shows a pattern (e.g. positive, negative, positive, negative), then the assumption of independent errors is violated.
When the residuals have a histogram that looks normal and when the residual plot shows no pattern, then we can use the normal distribution to make inferences about individuals. Suppose we do not make the log transformation. What percentage of 20-kilogram mammals have brain that weigh more than 180 grams? The regression equation says that the mean brainweight for 20 kilogram animals is 90.996 + 0.966 * 20 = 110.33. The sd of the residuals is 334.721. Under the regression assumptions, the 20-kilogram mammals have brainweights that are normally distributed with mean 110.33 and sd 334.721. The z-transformation is (180-110.33) / 334.72 = 0.208. From the table, the area under the curve to the right of 0.208 is (100-15.85) / 2 = 42.075%
Midterm I Instruction We will have Midterm I Thursday, July 13th. The exam is 12:30pm - 2:30pm. Do not be late! Office hour: 10:00am - 12:00am, Wednesday, July 12th, 211 Old Chem. The exam will cover all the materials we have discussed so far. The exam is open book, open lecture. You can use laptop if you wish. And if you choose to type, you should manage to send your answer as attachment to my email fei@stat.duke.edu by 2:30pm. Otherwise, the answer is not acceptable. The questions are expected similar to the exercises / review exercises / quiz 1. You should be able to finish the exam in 2 hours. When time is up, put your pens / pencils done while I am collecting the answers. Otherwise, you will get 0 score.
Designed Experiments and Observational Studies Double-blind, randomized, control study versus Observational Studies. Drug-placebo study. Lung cancer and smoking. Association does not imply causation. Confounding factors. Subgroup study or weighted average can help to understand the confounding factors.
Descriptive Statistics Central tendency: Mean, median (quantile, percentile), mode. Diespersion: standard deviation, range, IQR. Histograms, boxplots, and scatterplots.
Normal Distributions Use of the normal table. For a normal distribution, the probability that you observe a value within 1 sd is 68%, within 2 sd is 95%, and within 3 sd is 99.7%. Use of the z-transformation. Always draw pictures.
Correlation Correlation r measures the linear association between two variables. Calculate the correlation by z-transformation. r 2 is the coefficient of determination. It is the proportion of the variation in Y that is explained by X. No linear assocation does not imply no assocation. And association is not causation. Ecological correlation may be misleading.
Regression Fit the best line to the data. Regression effect in test-retest example. The formula for regression is, Y i = a + bx i + ɛ i We are assuming ɛ i N(0, sd = σ). And the ɛ s are independent.
Residuals: e i = Y i (a + bx i ). Find the regression line by minimizing the Sum of Squared Errors (SSE). f (a, b) = n (Y i (a + bx i )) 2 i=1 The Least Squred Estimators (LSE) are, â = Ȳ ˆb X; ˆb = 1 n n 1 X iy i XȲ n 1 X i 2 X 2 And estimates for the residuals are ê i = Y i (â + ˆbX i ) Data transformation. Extrapolation is risky. 1 n