Applied Linear Regression

Size: px

Start display at page:

Download "Applied Linear Regression"

Annis Tate
6 years ago
Views:

2 Applied Linear Regression Third Edition SANFORD WEISBERG University o Minnesota School o Statistics Minneapolis, Minnesota A JOHN WILEY & SONS, INC., PUBLICATION

3 Copyright 2005 by John Wiley & Sons, Inc. All rights reserved. Published by John Wiley & Sons, Inc., Hoboken, New Jersey. Published siultaneously in Canada. No part o this publication ay be reproduced, stored in a retrieval syste, or transitted in any or or by any eans, electronic, echanical, photocopying, recording, scanning, or otherwise, except as peritted under Section 107 or 108 o the 1976 United States Copyright Act, without either the prior written perission o the Publisher, or authorization through payent o the appropriate per-copy ee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, , ax , or on the web at Requests to the Publisher or perission should be addressed to the Perissions Departent, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) , ax (201) Liit o Liability/Disclaier o Warranty: While the publisher and author have used their best eorts in preparing this book, they ake no representations or warranties with respect to the accuracy or copleteness o the contents o this book and speciically disclai any iplied warranties o erchantability or itness or a particular purpose. No warranty ay be created or extended by sales representatives or written sales aterials. The advice and strategies contained herein ay not be suitable or your situation. You should consult with a proessional where appropriate. Neither the publisher nor author shall be liable or any loss o proit or any other coercial daages, including but not liited to special, incidental, consequential, or other daages. For general inoration on our other products and services please contact our Custoer Care Departent within the U.S. at , outside the U.S. at or ax Wiley also publishes its books in a variety o electronic orats. Soe content that appears in print, however, ay not be available in electronic orat. Library o Congress Cataloging-in-Publication Data: Weisberg, Sanord, 1947 Applied linear regression / Sanord Weisberg. 3rd ed. p. c. (Wiley series in probability and statistics) Includes bibliographical reerences and index. ISBN (acid-ree paper) 1. Regression analysis. I. Title. II. Series. QA278.2.W dc Printed in the United States o Aerica

4 To Carol, Stephanie and to the eory o y parents

5 Contents Preace xiii 1 Scatterplots and Regression Scatterplots, Mean Functions, Variance Functions, Suary Graph, Tools or Looking at Scatterplots, Size, Transorations, Soothers or the Mean Function, Scatterplot Matrices, 15 Probles, 17 2 Siple Linear Regression Ordinary Least Squares Estiation, Least Squares Criterion, Estiating σ 2, Properties o Least Squares Estiates, Estiated Variances, Coparing Models: The Analysis o Variance, The F -Test or Regression, Interpreting p-values, Power o Tests, The Coeicient o Deterination, R 2, Conidence Intervals and Tests, The Intercept, Slope, 33 vii

6 viii CONTENTS Prediction, Fitted Values, The Residuals, 36 Probles, 38 3 Multiple Regression Adding a Ter to a Siple Linear Regression Model, Explaining Variability, Added-Variable Plots, The Multiple Linear Regression Model, Ters and Predictors, Ordinary Least Squares, Data and Matrix Notation, Variance-Covariance Matrix o e, Ordinary Least Squares Estiators, Properties o the Estiates, Siple Regression in Matrix Ters, The Analysis o Variance, The Coeicient o Deterination, Hypotheses Concerning One o the Ters, Relationship to the t-statistic, t-tests and Added-Variable Plots, Other Tests o Hypotheses, Sequential Analysis o Variance Tables, Predictions and Fitted Values, 65 Probles, 65 4 Drawing Conclusions Understanding Paraeter Estiates, Rate o Change, Signs o Estiates, Interpretation Depends on Other Ters in the Mean Function, Rank Deicient and Over-Paraeterized Mean Functions, Tests, Dropping Ters, Logariths, Experientation Versus Observation, 77

7 CONTENTS ix 4.3 Sapling ro a Noral Population, More on R 2, Siple Linear Regression and R 2, Multiple Linear Regression, Regression through the Origin, Missing Data, Missing at Rando, Alternatives, Coputationally Intensive Methods, Regression Inerence without Norality, Nonlinear Functions o Paraeters, Predictors Measured with Error, 90 Probles, 92 5 Weights, Lack o Fit, and More Weighted Least Squares, Applications o Weighted Least Squares, Additional Coents, Testing or Lack o Fit, Variance Known, Testing or Lack o Fit, Variance Unknown, General F Testing, Non-null Distributions, Additional Coents, Joint Conidence Regions, 108 Probles, Polynoials and Factors Polynoial Regression, Polynoials with Several Predictors, Using the Delta Method to Estiate a Miniu or a Maxiu, Fractional Polynoials, Factors, No Other Predictors, Adding a Predictor: Coparing Regression Lines, Additional Coents, Many Factors, Partial One-Diensional Mean Functions, Rando Coeicient Models, 134 Probles, 137

8 x CONTENTS 7 Transorations Transorations and Scatterplots, Power Transorations, Transoring Only the Predictor Variable, Transoring the Response Only, The Box and Cox Method, Transorations and Scatterplot Matrices, The 1D Estiation Result and Linearly Related Predictors, Autoatic Choice o Transoration o Predictors, Transoring the Response, Transorations o Nonpositive Variables, 160 Probles, Regression Diagnostics: Residuals The Residuals, Dierence Between ê and e, The Hat Matrix, Residuals and the Hat Matrix with Weights, The Residuals When the Model Is Correct, The Residuals When the Model Is Not Correct, Fuel Consuption Data, Testing or Curvature, Nonconstant Variance, Variance Stabilizing Transorations, A Diagnostic or Nonconstant Variance, Additional Coents, Graphs or Model Assessent, Checking Mean Functions, Checking Variance Functions, 189 Probles, Outliers and Inluence Outliers, An Outlier Test, Weighted Least Squares, Signiicance Levels or the Outlier Test, Additional Coents, Inluence o Cases, Cook s Distance, 198

9 CONTENTS xi Magnitude o D i, Coputing D i, Other Measures o Inluence, Norality Assuption, 204 Probles, Variable Selection The Active Ters, Collinearity, Collinearity and Variances, Variable Selection, Inoration Criteria, Coputationally Intensive Criteria, Using Subject-Matter Knowledge, Coputational Methods, Subset Selection Overstates Signiicance, Windills, Six Mean Functions, A Coputationally Intensive Approach, 228 Probles, Nonlinear Regression Estiation or Nonlinear Mean Functions, Inerence Assuing Large Saples, Bootstrap Inerence, Reerences, 248 Probles, Logistic Regression Binoial Regression, Mean Functions or Binoial Regression, Fitting Logistic Regression, One-Predictor Exaple, Many Ters, Deviance, Goodness-o-Fit Tests, Binoial Rando Variables, Maxiu Likelihood Estiation, The Log-Likelihood or Logistic Regression, 264

10 xii CONTENTS 12.4 Generalized Linear Models, 265 Probles, 266 Appendix 270 A.1 Web Site, 270 A.2 Means and Variances o Rando Variables, 270 A.2.1 E Notation, 270 A.2.2 Var Notation, 271 A.2.3 Cov Notation, 271 A.2.4 Conditional Moents, 272 A.3 Least Squares or Siple Regression, 273 A.4 Means and Variances o Least Squares Estiates, 273 A.5 Estiating E(Y X) Using a Soother, 275 A.6 A Brie Introduction to Matrices and Vectors, 278 A.6.1 Addition and Subtraction, 279 A.6.2 Multiplication by a Scalar, 280 A.6.3 Matrix Multiplication, 280 A.6.4 Transpose o a Matrix, 281 A.6.5 Inverse o a Matrix, 281 A.6.6 Orthogonality, 282 A.6.7 Linear Dependence and Rank o a Matrix, 283 A.7 Rando Vectors, 283 A.8 Least Squares Using Matrices, 284 A.8.1 Properties o Estiates, 285 A.8.2 The Residual Su o Squares, 285 A.8.3 Estiate o Variance, 286 A.9 The QR Factorization, 286 A.10 Maxiu Likelihood Estiates, 287 A.11 The Box-Cox Method or Transorations, 289 A.11.1 Univariate Case, 289 A.11.2 Multivariate Case, 290 A.12 Case Deletion in Linear Regression, 291 Reerences 293 Author Index 301 Subject Index 305

11 Preace Regression analysis answers questions about the dependence o a response variable on one or ore predictors, including prediction o uture values o a response, discovering which predictors are iportant, and estiating the ipact o changing a predictor or a treatent on the value o the response. At the publication o the second edition o this book about 20 years ago, regression analysis using least squares was essentially the only ethodology available to analysts interested in questions like these. Cheap, widely available high-speed coputing has changed the rules or exaining these questions. Modern copetitors include nonparaetric regression, neural networks, support vector achines, andtree-based ethods, aong others. A new ield o coputer science, called achine learning, adds diversity, and conusion, to the ix. With the availability o sotware, using a neural network or any o these other ethods sees to be just as easy as using linear regression. So, a reasonable question to ask is: Who needs a revised book on linear regression using ordinary least squares when all these other newer and, presuably, better ethods exist? This question has several answers. First, ost other odern regression odeling ethods are really just elaborations or odiications o linear regression odeling. To understand, as opposed to use, neural networks or the support vector achine is nearly ipossible without a good understanding o linear regression ethodology. Second, linear regression ethodology is relatively transparent, as will be seen throughout this book. We can draw graphs that will generally allow us to see relationships between variables and decide whether the odels we are using ake any sense. Many o the ore odern ethods are uch like a black box in which data are stued in at one end and answers pop out at the other, without uch hope or the nonexpert to understand what is going on inside the box. Third, i you know how to do soething in linear regression, the sae ethodology with only inor adjustents will usually carry over to other regression-type probles or which least squares is not appropriate. For exaple, the ethodology or coparing response curves or dierent values o a treatent variable when the response is continuous is studied in Chapter 6 o this book. Analogous ethodology can be used when the response is a possibly censored survival tie, even though the ethod o itting needs to be appropriate or the censored response and not least squares. The ethodology o Chapter 6 is useul both in its xiii

12 xiv PREFACE own right when applied to linear regression probles and as a set o core ideas that can be applied in other settings. Probably the ost iportant reason to learn about linear regression and least squares estiation is that even with all the new alternatives ost analyses o data continue to be based on this older paradig. And why is this? The priary reason is that it works: least squares regression provides good, and useul, answers to any probles. Pick up the journals in any area where data are coonly used or prediction or estiation and the doinant ethod used will be linear regression with least squares estiation. What s New in this Edition Many o the exaples and hoework data sets ro the second edition have been kept, although soe have been updated. The uel consuption data, or exaple, now uses 2001 values rather than 1974 values. Most o the derivations are the sae as in the second edition, although the order o presentation is soewhat dierent. To keep the length o the book nearly unchanged, ethods that ailed to gain general usage have been deleted, as have the separate chapters on prediction and issing data. These latter two topics have been integrated into the reaining text. The continuing thee o the second edition was the need or diagnostic ethods, in which itted odels are analyzed or deiciencies, through analysis o residuals and inluence. This ephasis was unusual when the second edition was published and iportant quantities like Studentized residuals and Cook s distance were not readily available in the coercial sotware o the tie. Ties have changed, and so has the ephasis o this book. This edition stresses graphical ethods including looking at data both beore and ater itting odels. This is relected iediately in the new Chapter 1, which introduces the key idea o looking at data with scatterplots and the soewhat less universal tool o scatterplot atrices. Most analyses and hoework probles start with drawing graphs. We tailor analyses to correspond to what we see in the graphs, and this additional step can ake odeling easier and itted odels relect the data ore closely. Rearkably, this also lessens the need or diagnostic ethods. The ephasis on graphs leads to several additional ethods and procedures that were not included in the second edition. The use o soothers to help suarize a scatterplot is introduced early, although only a little o the theory o soothing is presented (in Appendix A.5). Transorations o predictors and the response are stressed, and relatively unailiar ethods based both on soothing and on generalization o the Box Cox ethod are presented in Chapter 7. Another new topic included in the book is coputationally intensive ethods and siulation. The key exaple o this is the bootstrap, in Section 4.6, which can be used to ake inerences about itted odels in sall saples. A soewhat dierent coputationally intensive ethod is used in an exaple in Chapter 10, which is a copletely rewritten chapter on variable selection. The book concludes with two expanded chapters on nonlinear and logistic regression, both o which are generalizations o the linear regression odel. I have

13 PREFACE xv included these chapters to provide instructors and students with enough inoration or basic usage o these odels and to take advantage o the intuition gained about the ro an in-depth study o the linear regression odel. Each o these can be treated at book-length, and appropriate reerences are given. Matheatical Level The atheatical level o this book is roughly the sae as the level o the second edition. Matrix representation o data is used, particularly in the derivation o the ethodology in Chapters 2 4. Derivations are less requent in later chapters, and so the necessary atheatics is less. Calculus is generally not required, except or an occasional use o a derivative, or the discussion o the delta ethod, Section 6.1.2, and or a ew topics in the Appendix. The discussions requiring calculus can be skipped without uch loss. Coputing and Coputer Packages Like the second edition, only passing ention is ade in the book to coputer packages. To help the reader ake a connection between the text and a coputer package or doing the coputations, we provide several web copanions or Applied Linear Regression that discuss how to use standard statistical packages or linear regression analysis. The packages covered include JMP, SAS, SPSS, R, and S-plus; others ay be included ater publication o the book. In addition, all the data iles discussed in the book are also on the website. The web address or this aterial is Soe readers ay preer to have a book that integrates the text ore closely with a coputer package, and or this purpose, I can recoend R. D. Cook and S. Weisberg (1999), Applied Regression Including Coputing and Graphics, also published by John Wiley. This book includes a very user-riendly, ree coputer package called Arc that does everything that is described in that book and also nearly everything in Applied Linear Regression. Teaching with this Book The irst ten chapters o the book should provide adequate aterial or a one-quarter course on linear regression. For a seester-length course, the last two chapters can be added. A teacher s anual, priarily giving solutions to all the hoework probles, can be obtained ro the publisher by instructors. Acknowledgents I a grateul to several people who generously shared their data or inclusion in this book; they are cited where their data appears. Charles Anderson and Don Pereira suggested several o the exaples. Keija Shan, Katherine St. Clair, and Gary Oehlert helped with the website and its content. Brian Sell helped with the

14 xvi PREFACE exaples and with any adinistrative chores. Several others helped with earlier editions: Christopher Bingha, Morton Brown, Cathy Capbell, Dennis Cook, Stephen Fienberg, Jaes Frane, Seyour Geisser, John Hartigan, David Hinkley, Alan Izenan, Soren Johansen, Kenneth Koehler, David Lane, Michael Lavine, Kinley Larntz, John Rice, Donald Rubin, Joe Shih, Pete Stewart, Stephen Stigler, Douglas Tiany, Carol Weisberg, and Howard Weisberg. Sanord Weisberg St. Paul, Minnesota April 13, 2004

15 CHAPTER 1 Scatterplots and Regression Regression is the study o dependence. It is used to answer questions such as Does changing class size aect success o students? Can we predict the tie o the next eruption o Old Faithul Geyser ro the length o the ost recent eruption? Do changes in diet result in changes in cholesterol level, and i so, do the results depend on other characteristics such as age, sex, and aount o exercise? Do countries with higher per person incoe have lower birth rates than countries with lower incoe? Regression analysis is a central part o any research projects. In ost o this book, we study the iportant instance o regression ethodology called linear regression. These ethods are the ost coonly used in regression, and virtually all other regression ethods build upon an understanding o how linear regression works. As with ost statistical analyses, the goal o regression is to suarize observed data as siply, useully, and elegantly as possible. In soe probles, a theory ay be available that speciies how the response varies as the values o the predictors change. In other probles, a theory ay be lacking, and we need to use the data to help us decide on how to proceed. In either case, an essential irst step in regression analysis is to draw appropriate graphs o the data. In this chapter, we discuss the undaental graphical tool or looking at regression data, a two-diensional scatterplot. In regression probles with one predictor and one response, the scatterplot o the response versus the predictor is the starting point or regression analysis. In probles with any predictors, several siple graphs will be required at the beginning o an analysis. A scatterplot atrix is a convenient way to organize looking at any scatterplots at once. We will look at several exaples to introduce the ain tools or looking at scatterplots and scatterplot atrices and extracting inoration ro the. We will also introduce the notation that will be used throughout the rest o the book. 1.1 SCATTERPLOTS We begin with a regression proble with one predictor, which we will generically call X and one response variable, which we will call Y. Data consists o Applied Linear Regression, Third Edition, by Sanord Weisberg ISBN Copyright 2005 John Wiley & Sons, Inc. 1

16 2 SCATTERPLOTS AND REGRESSION values (x i,y i ), i = 1,...,n,o(X, Y ) observed on each o n units or cases. In any particular proble, both X and Y will have other naes such as Teperature or Concentration that are ore descriptive o the data that is to be analyzed. The goal o regression is to understand how the values o Y change as X is varied over its range o possible values. A irst look at how Y changes as X is varied is available ro a scatterplot. Inheritance o Height One o the irst uses o regression was to study inheritance o traits ro generation to generation. During the period , E. S. Pearson organized the collection o n = 1375 heights o others in the United Kingdo under the age o 65 and one o their adult daughters over the age o 18. Pearson and Lee (1903) published the data, and we shall use these data to exaine inheritance. The data are given in thedataileheights.txt 1. Our interest is in inheritance ro the other to the daughter, so we view the other s height, called Mheight, as the predictor variable and the daughter s height, Dheight, as the response variable. Do taller others tend to have taller daughters? Do shorter others tend to have shorter daughters? A scatterplot o Dheight versus Mheight helps us answer these questions. The scatterplot is a graph o each o the n points with the response Dheight on the vertical axis and predictor Mheight on the horizontal axis. This plot is shown in Figure 1.1. For regression probles with one predictor X and a response Y,we call the scatterplot o Y versus X a suary graph. Here are soe iportant characteristics o Figure 1.1: 1. The range o heights appears to be about the sae or others and or daughters. Because o this, we draw the plot so that the lengths o the horizontal and vertical axes are the sae, and the scales are the sae. I all others and daughters had exactly the sae height, then all the points would all exactly on a 45 line. Soe coputer progras or drawing a scatterplot are not sart enough to igure out that the lengths o the axes should be the sae, so you ight need to resize the plot or to draw it several ties. 2. The original data that went into this scatterplot was rounded so each o the heights was given to the nearest inch. I we were to plot the original data, we would have substantial overplotting with any points at exactly the sae location. This is undesirable because we will not know i one point represents one case or any cases, and this can be very isleading. The easiest solution is to use jittering, in which a sall unior rando nuber is added to each value. In Figure 1.1, we used a unior rando nuber on the range ro 0.5 to+0.5, so the jittered values would round to the nubers given in the original source. 3. One iportant unction o the scatterplot is to decide i we ight reasonably assue that the response on the vertical axis is independent o the predictor 1 See Appendix A.1 or instructions or getting data iles ro the Internet.

17 SCATTERPLOTS Dheight Mheight FIG. 1.1 Scatterplot o others and daughters heights in the Pearson and Lee data. The original data have been jittered to avoid overplotting, but i rounded to the nearest inch would return the original data provided by Pearson and Lee. on the horizontal axis. This is clearly not the case here since as we ove across Figure 1.1 ro let to right, the scatter o points is dierent or each value o the predictor. What we ean by this is shown in Figure 1.2, in which we show only points corresponding to other daughter pairs with Mheight rounding to either 58, 64 or 68 inches. We see that within each o these three strips or slices, even though the nuber o points is dierent within each slice, (a) the ean o Dheight is increasing ro let to right, and (b) the vertical variability in Dheight sees to be ore or less the sae or each o the ixed values o Mheight. 4. The scatter o points in the graph appears to be ore or less elliptically shaped, with the axis o the ellipse tilted upward. We will see in Section 4.3 that suary graphs that look like this one suggest use o the siple linear regression odel that will be discussed in Chapter Scatterplots are also iportant or inding separated points, which are either points with values on the horizontal axis that are well separated ro the other points or points with values on the vertical axis that, given the value on the horizontal axis, are either uch too large or too sall. In ters o this exaple, this would ean looking or very tall or short others or, alternatively, or daughters who are very tall or short, given the height o their other.

18 4 SCATTERPLOTS AND REGRESSION Dheight Mheight FIG. 1.2 Scatterplot showing only pairs with other s height that rounds to 58, 64 or 68 inches. These two types o separated points have dierent naes and roles in a regression proble. Extree values on the let and right o the horizontal axis are points that are likely to be iportant in itting regression odels and are called leverage points. The separated points on the vertical axis, here unusually tall or short daughters give their other s height, are potentially outliers, cases that are soehow dierent ro the others in the data. While the data in Figure 1.1 do include a ew tall and a ew short others and a ew tall and short daughters, given the height o the others, none appears worthy o special treatent, ostly because in a saple size this large we expect to see soe airly unusual other daughter pairs. We will continue with this exaple later. Forbes Data In an 1857 article, a Scottish physicist naed Jaes D. Forbes discussed a series o experients that he had done concerning the relationship between atospheric pressure and the boiling point o water. He knew that altitude could be deterined ro atospheric pressure, easured with a baroeter, with lower pressures corresponding to higher altitudes. In the iddle o the nineteenth century, baroeters were ragile instruents, and Forbes wondered i a sipler easureent o the boiling point o water could substitute or a direct reading o baroetric pressure. Forbes

19 SCATTERPLOTS 5 Pressure Residuals Teperature (a) Teperature (b) FIG. 1.3 Forbes data. (a) Pressure versus Tep; (b) Residuals versus Tep. collected data in the Alps and in Scotland. He easured at each location pressure in inches o ercury with a baroeter and boiling point in degrees Fahrenheit using a theroeter. Boiling point easureents were adjusted or the dierence between the abient air teperature when he took the easureents and a standard teperature. The data or n = 17 locales are reproduced in the ile orbes.txt. The scatterplot o Pressure versus Tep is shown in Figure 1.3a. The general appearance o this plot is very dierent ro the suary graph or the heights data. First, the saple size is only 17, as copared to over 1300 or the heights data. Second, apart ro one point, all the points all alost exactly on a sooth curve. This eans that the variability in pressure or a given teperature is extreely sall. The points in Figure 1.3a appear to all very close to the straight line shown on the plot, and so we ight be encouraged to think that the ean o pressure given teperature could be odelled by a straight line. Look closely at the graph, and you will see that there is a sall systeatic error with the straight line: apart ro the one point that does not it at all, the points in the iddle o the graph all below the line, and those at the highest and lowest teperatures all above the line. This is uch easier to see in Figure 1.3b, which is obtained by reoving the linear trend ro Figure 1.3a, so the plotted points on the vertical axis are given or each value o Tep by Residual = Pressure point on the line This allows us to gain resolution in the plot since the range on the vertical axis in Figure 1.3a is about 10 inches o ercury while the range in Figure 1.3b is about 0.8 inches o ercury. To get the sae resolution in Figure 1.3a, we would need a graph that is 10/0.8 = 12.5 as big as Figure 1.3b. Again ignoring the one point that clearly does not atch the others, the curvature in the plot is clearly visible in Figure 1.3b.

20 6 SCATTERPLOTS AND REGRESSION log(pressure) Residuals Teperature (a) Teperature (b) FIG. 1.4 (a) Scatterplot o Forbes data. The line shown is the ols line or the regression o log(pressure) on Tep. (b) Residuals versus Tep. While there is nothing at all wrong with curvature, the ethods we will be studying in this book work best when the plot can be suarized by a straight line. Soeties we can get a straight line by transoring one or both o the plotted quantities. Forbes had a physical theory that suggested that log(pressure) is linearly related to Tep. Forbes (1857) contains what ay be the irst published suary graph corresponding to his physical odel. His igure is redrawn in Figure 1.4. Following Forbes, we use base ten coon logs in this exaple, although in ost o the exaples in this book we will use base-two logariths. The choice o base has no aterial eect on the appearance o the graph or on itted regression odels, but interpretation o paraeters can depend on the choice o base, and using base-two oten leads to a sipler interpretation or paraeters. The key eature o Figure 1.4a is that apart ro one point the data appear to all very close to the straight line shown on the igure, and the residual plot in Figure 1.4b conirs that the deviations ro the straight line are not systeatic the way they were in Figure 1.3b. All this is evidence that the straight line is a reasonable suary o these data. Length at Age or Sallouth Bass The sallouth bass is a avorite gae ish in inland lakes. Many sallouth bass populations are anaged through stocking, ishing regulations, and other eans, with a goal to aintain a healthy population. One tool in the study o ish populations is to understand the growth pattern o ish such as the dependence o a easure o size like ish length on age o the ish. Managers could copare these relationships between dierent populations with dissiilar anageent plans to learn how anageent ipacts ish growth. Figure 1.5 displays the Length at capture in versus Age at capture or n = 439 sall outh bass easured in West Bearskin Lake in Northeastern Minnesota in Only ish o age seven or less are included in this graph. The data were provided by the Minnesota Departent o Natural Resources and are given in the

21 SCATTERPLOTS Length Age FIG. 1.5 Length () versus Age or West Bearskin Lake sallouth bass. The solid line shown was estiated using ordinary least squares or ols. The dashed line joins the average observed length at each age. ile wblake.txt. Fish scales have annular rings like trees, and these can be counted to deterine the age o a ish. These data are cross-sectional, eaning that all the observations were taken at the sae tie. In a longitudinal study, the sae ish would be easured each year, possibly requiring any years o taking easureents. The data ile gives the Length in, Age in years, and the Scale radius, also in. The appearance o this graph is dierent ro the suary plots shown or last two exaples. The predictor Age can only take on integer values corresponding to the nuber o annular rings on the scale, so we are really plotting seven distinct populations o ish. As ight be expected, length generally increases with age, but the longest ish at age-one ish exceeds the length o the shortest age-our ish, so knowing the age o a ish will not allow us to predict its length exactly; see Proble 2.5. Predicting the Weather Can early season snowall ro Septeber 1 until Deceber 31 predict snowall in the reainder o the year, ro January 1 to June 30? Figure 1.6, using data ro the data ile tcollinssnow.txt, gives a plot o Late season snowall ro January 1 to June 30 versus Early season snowall or the period Septeber 1 to Deceber 31 o the previous year, both easured in inches at Ft. Collins, Colorado 2.ILate is related to Early, the relationship is considerably weaker than 2 The data are ro the public doain source

22 8 SCATTERPLOTS AND REGRESSION Late Early FIG. 1.6 Plot o snowall or 93 years ro 1900 to 1992 in inches. The solid horizontal line is drawn at the average late season snowall. The dashed line is the best itting (ordinary least squares) line o arbitrary slope. in the previous exaples, and the graph suggests that early winter snowall and late winter snowall ay be copletely unrelated, or uncorrelated. Interest in this regression proble will thereore be in testing the hypothesis that the two variables are uncorrelated versus the alternative that they are not uncorrelated, essentially coparing the it o the two lines shown in Figure 1.6. Fitting odels will be helpul here. Turkey Growth This exaple is ro an experient on the growth o turkeys (Noll, Weibel, Cook, and Witer, 1984). Pens o turkeys were grown with an identical diet, except that each pen was suppleented with a Dose o the aino acid ethionine as a percentage o the total diet o the birds. The ethionine was provided using either a standard source or one o two experiental sources. The response is average weight gain in gras o all the turkeys in the pen. Figure 1.7 provides a suary graph based on the data in the ile turkey.txt. Except at Dose = 0, each point in the graph is the average response o ive pens o turkeys; at Dose = 0, there were ten pens o turkeys. Because averages are plotted, the graph does not display the variation between pens treated alike. At each value o Dose > 0, there are three points shown, with dierent sybols corresponding to the three sources o ethionine, so the variation between points at a given Dose is really the variation between sources. At Dose = 0, the point has been arbitrarily labelled with the sybol or the irst group, since Dose = 0isthe sae treatent or all sources. For now, ignore the three sources and exaine Figure 1.7 in the way we have been exaining the other suary graphs in this chapter. Weight gain sees

23 MEAN FUNCTIONS 9 Weight gain (g) Aount (percent o diet) FIG. 1.7 Weight gain versus Dose o ethionine or turkeys. The three sybols or the points reer to three dierent sources o ethionine. to increase with increasing Dose, but the increase does not appear to be linear, eaning that a straight line does not see to be a reasonable representation o the average dependence o the response on the predictor. This leads to study o ean unctions. 1.2 MEAN FUNCTIONS Iagine a generic suary plot o Y versus X. Our interest centers on how the distribution o Y changes as X is varied. One iportant aspect o this distribution is the ean unction, which we deine by E(Y X = x) = a unction that depends on the value o x (1.1) We read the let side o this equation as the expected value o the response when the predictor is ixed at the value X = x; i the notation E( ) or expectations and Var( ) or variances is unailiar, please read Appendix A.2. The right side o (1.1) depends on the proble. For exaple, in the heights data in Exaple 1.1, we ight believe that E(Dheight Mheight = x) = β 0 + β 1 x (1.2) that is, the ean unction is a straight line. This particular ean unction has two paraeters, an intercept β 0 and a slope β 1. I we knew the values o the βs, then the ean unction would be copletely speciied, but usually the βs need to be estiated ro data. Figure 1.8 shows two possibilities or βs in the straight-line ean unction (1.2) or the heights data. For the dashed line, β 0 = 0andβ 1 = 1. This ean unction

24 10 SCATTERPLOTS AND REGRESSION Dheight Mheight FIG. 1.8 The heights data. The dashed line is or E(Dheight Mheight) = Mheight, and the solid line is estiated by ols. would suggest that daughters have the sae height as their others on average. The second line is estiated using ordinary least squares, or ols, the estiation ethod that will be described in the next chapter. The ols line has slope less than one, eaning that tall others tend to have daughters who are taller than average because the slope is positive but shorter than theselves because the slope is less than one. Siilarly, short others tend to have short daughters but taller than theselves. This is perhaps a surprising result and is the origin o the ter regression, since extree values in one generation tend to revert or regress toward the population ean in the next generation. Two lines are shown in Figure 1.5 or the sallouth bass data. The dashed line joins the average length at each age. It provides an estiate o the ean unction E(Length Age) without actually speciying any unctional or or the ean unction. We will call this a nonparaetric estiated ean unction; soeties we will call it a soother. The solid line is the ols estiated straight line (1.1) or the ean unction. Perhaps surprisingly, the straight line and the dashed lines that join the within-age eans appear to agree very closely, and we ight be encouraged to use the straight-line ean unction to describe these data. This would ean that the increase in length per year is the sae or all ages. We cannot expect this to be true i we were to include older-aged ish because eventually the growth rate ust slow down. For the range o ages here, the approxiation sees to be adequate. For the Ft. Collins weather data, we ight expect the straight-line ean unction (1.1) to be appropriate but with β 1 = 0. I the slope is zero, then the ean unction is parallel to the horizontal axis, as shown in Figure 1.6. We will eventually test or independence o Early and Late by testing the hypothesis that β 1 = 0against the alternative hypothesis that β 1 0.

25 SUMMARY GRAPH 11 Not all suary graphs will have a straight-line ean unction. In Forbes data, to achieve linearity we have replaced the easured value o Pressure by log(pressure). Transoration o variables will be a key tool in extending the useulness o linear regression odels. In the turkey data and other growth odels, a nonlinear ean unction ight be ore appropriate, such as E(Y Dose = x) = β 0 + β 1 [1 exp( β 2 x)] (1.3) The βs in (1.3) have a useul interpretation, and they can be used to suarize the experient. When Dose = 0, E(Y Dose = 0) = β 0,soβ 0 is the baseline growth without suppleentation. Assuing β 2 > 0, when the Dose is large, exp( β 2 Dose) is sall, and so E(Y Dose) approaches β 0 + β 1 or large Dose. We think o β 0 + β 1 as the liit to growth with this additive. The rate paraeter β 2 deterines how quickly axiu growth is achieved. This three-paraeter ean unction will be considered in Chapter VARIANCE FUNCTIONS Another characteristic o the distribution o the response given the predictor is the variance unction, deined by the sybol Var(Y X = x) and in words as the variance o the response distribution given that the predictor is ixed at X = x. For exaple, in Figure 1.2 we can see that the variance unction or Dheight Mheight is approxiately the sae or each o the three values o Mheight shown in the graph. In the sallouth bass data in Figure 1.5, an assuption that the variance is constant across the plot is plausible, even i it is not certain (see Proble 1.1). In the turkey data, we cannot say uch about the variance unction ro the suary plot because we have plotted treatent eans rather than the actual pen values, so the graph does not display the inoration about the variability between pens that have a ixed value o Dose. A requent assuption in itting linear regression odels is that the variance unction is the sae or every value o x. This is usually written as Var(Y X = x) = σ 2 (1.4) where σ 2 (read siga squared ) is a generally unknown positive constant. We will encounter later in this book other probles with coplicated variance unctions. 1.4 SUMMARY GRAPH In all the exaples except the snowall data, there is a clear dependence o the response on the predictor. In the snowall exaple, there ight be no dependence at all. The turkey growth exaple is dierent ro the others because the average value o the response sees to change nonlinearly with the value o the predictor on the horizontal axis.

26 12 SCATTERPLOTS AND REGRESSION TABLE 1.1 Four Hypothetical Data Sets. The Data Are Given in the File anscobe.txt X 1 Y 1 Y 2 Y 3 X 2 Y The scatterplots or these exaples are all typical o graphs one ight see in probles with one response and one predictor. Exaination o the suary graph is a irst step in exploring the relationships these graphs portray. Anscobe (1973) provided the artiicial data given in Table 1.1 that consists o 11 pairs o points (x i,y i ), to which the siple linear regression ean unction E(y x) = β 0 + β 1 x is it. Each data set leads to an identical suary analysis with the sae estiated slope, intercept, and other suary statistics, but the visual ipression o each o the graphs is very dierent. The irst exaple in Figure 1.9a is as one ight expect to observe i the siple linear regression odel were appropriate. The graph o the second data set given in Figure 1.9b suggests that the analysis based on siple linear regression is incorrect and that a sooth curve, perhaps a quadratic polynoial, could be it to the data with little reaining variability. Figure 1.9c suggests that the prescription o siple regression ay be correct or ost o the data, but one o the cases is too ar away ro the itted regression line. This is called the outlier proble. Possibly the case that does not atch the others should be deleted ro the data set, and the regression should be reit ro the reaining ten cases. This will lead to a dierent itted line. Without a context or the data, we cannot judge one line correct and the other incorrect. The inal set graphed in Figure 1.9d is dierent ro the other three in that there is not enough inoration to ake a judgent concerning the ean unction. I the eighth case were deleted, we could not even estiate a slope. We ust distrust an analysis that is so heavily dependent upon a single case. 1.5 TOOLS FOR LOOKING AT SCATTERPLOTS Because looking at scatterplots is so iportant to itting regression odels, we establish soe coon vocabulary or describing the inoration in the and soe tools to help us extract the inoration they contain.

27 TOOLS FOR LOOKING AT SCATTERPLOTS y (a) (b) x (c) (d) FIG. 1.9 Four hypothetical data sets (ro Anscobe, 1973). The suary graph is o the response Y versus the predictor X. The ean unction or the graph is deined by (1.1), and it characterizes how Y changes on the average as the value o X is varied. We ay have a paraetric odel or the ean unction and will use data to estiate the paraeters. The variance unction also characterizes the graph, and in any probles we will assue at least at irst that the variance unction is constant. The scatterplot also will highlight separated points that ay be o special interest because they do not it the trend deterined by the ajority o the points. A null plot has constant ean unction, constant variance unction and no separated points. The scatterplot or the snowall data appears to be a null plot Size To extract all the available inoration ro a scatterplot, we ay need to interact with it by changing scales, by resizing, or by reoving linear trends. An exaple o this is given in Proble 1.2.

28 14 SCATTERPLOTS AND REGRESSION Transorations In soe probles, either or both o Y and X can be replaced by transorations so the suary graph has desirable properties. Most o the tie, we will use power transorations, replacing, or exaple, X by X λ or soe nuber λ. Because logarithic transorations are so requently used, we will interpret λ = 0 as corresponding to a log transor. In this book, we will generally use logs to the base two, but i your coputer progra does not perit the use o base-two logariths, any other base, such as base-ten or natural logariths, is equivalent Soothers or the Mean Function In the sallouth bass data in Figure 1.5, we coputed an estiate o E(Length Age) using a siple nonparaetric soother obtained by averaging the repeated observations at each value o Age. Soothers can also be deined when we do not have repeated observations at values o the predictor by averaging the observed data or all values o X close to, but not necessarily equal to, x. The literature on using soothers to estiate ean unctions has exploded in recent years, with good airly eleentary treatents given by Härdle (1990), Siono (1996), Bowan and Azzalini (1997), and Green and Silveran (1994). Although these authors discuss nonparaetric regression as an end in itsel, we will generally use soothers as plot enhanceents to help us understand the inoration available in a scatterplot and to help calibrate the it o a paraetric ean unction to a scatterplot. For exaple, Figure 1.10 repeats Figure 1.1, this tie adding the estiated straight-line ean unction and soother called a loess sooth (Cleveland, 1979). Roughly speaking, the loess sooth estiates E(Y X = x) at the point x by itting Dheight Mheight FIG Heights data with the ols line and a loess sooth with span = 0.10.

29 SCATTERPLOT MATRICES 15 a straight line to a raction o the points closest to x; we used the raction o 0.20 in this igure because the saple size is so large, but it is ore usual to set the raction to about 2/3. The soother is obtained by joining the estiated values o E(Y X = x) or any values o x. The loess soother and the straight line agree alost perectly or Mheight close to average, but they agree less well or larger values o Mheight where there is uch less data. Soothers tend to be less reliable at the edges o the plot. We briely discuss the loess soother in Appendix A.5, but this aterial is dependent on the results in Chapters SCATTERPLOT MATRICES With one potential predictor, a scatterplot provides a suary o the regression relationship between the response and the potential predictor. With any potential predictors, we need to look at any scatterplots. A scatterplot atrix is a convenient way to organize these plots. Fuel Consuption The goal o this exaple is to understand how uel consuption varies over the 50 United States and the District o Colubia, and, in particular, to understand the eect on uel consuption o state gasoline tax. Table 1.2 describes the variables to be used in this exaple; the data are given in the ile uel2001.txt. The data were collected by the US Federal Highway Adinistration. Both Drivers and FuelC are state totals, so these will be larger in states with ore people and saller in less populous states. Incoe is coputed per person. To ake all these coparable and to attept to eliinate the eect o size o the state, we copute rates Dlic = Drivers/Pop and Fuel = FuelC/Pop. Additionally, we replace Miles by its (base-two) logarith beore doing any urther analysis. Justiication or replacing Miles with log(miles) is deerred to Proble 7.7. TABLE 1.2 Drivers FuelC Incoe Miles Pop Tax State Fuel Dlic log(miles) Variables in the Fuel Consuption Data a Nuber o licensed drivers in the state Gasoline sold or road use, thousands o gallons Per person personal incoe or the year 2000, in thousands o dollars Miles o Federal-aid highway iles in the state 2001 population age 16 and over Gasoline state tax rate, cents per gallon State nae 1000 Fuelc/Pop 1000 Drivers/Pop Base-two logarith o Miles Source: Highway Statistics 2001, a All data are or 2001, unless otherwise noted. The last three variables do not appear in the data ile but are coputed ro the previous variables, as described in the text.

30 16 SCATTERPLOTS AND REGRESSION Tax Dlic Incoe logmiles Fuel FIG Scatterplot atrix or the uel data. The scatterplot atrix or the uel data is shown in Figure Except or the diagonal, a scatterplot atrix is a 2D array o scatterplots. The variable naes on the diagonal label the axes. In Figure 1.11, the variable log(miles) appears on the horizontal axis o the all the plots in the ourth colun ro the let and on the vertical axis o all the plots in the ourth row ro the top 3. Each plot in a scatterplot atrix is relevant to a particular one-predictor regression o the variable on the vertical axis, given the variable on the horizontal axis. For exaple, the plot o Fuel versus Tax in the last plot in the irst colun o the scatterplot atrix is relevant or the regression o Fuel on Tax ; this is the irst plot in the last row o Figure We can interpret this plot as we would a scatterplot or siple regression. We get the overall ipression that Fuel decreases on the average as Tax increases, but there is lot o variation. We can ake siilar qualitative judgents about the each o the regressions o Fuel on the other variables. The overall ipression is that Fuel is at best weakly related to each o the variables in the scatterplot atrix. 3 The scatterplot atrix progra used to draw Figure 1.11, which is the pairs unction in R, has the diagonal running ro the top let to the lower right. Other progras, such as the splo unction in R, has the diagonal ro lower-let to upper-right. There sees to be no strong reason to preer one over the other.

31 PROBLEMS 17 Does this help us understand how Fuel is related to all our predictors siultaneously? The arginal relationships between the response and each o the variables are not suicient to understand the joint relationship between the response and the predictors. The interrelationships aong the predictors are also iportant. The pairwise relationships between the predictors can be viewed in the reaining cells o the scatterplot atrix. In Figure 1.11, the relationships between all pairs o predictors appear to be very weak, suggesting that or this proble the arginal plots including Fuel are quite inorative about the ultiple regression proble. General considerations or other scatterplot atrices will be developed in later chapters. PROBLEMS 1.1. Sallouth bass data Copute the eans and the variances or each o the eight subpopulations in the sallouth bass data. Draw a graph o average length versus Age and copare to Figure 1.5. Draw a graph o the standard deviations versus age. I the variance unction is constant, then the plot o standard deviation versus Age should be a null plot. Suarize the inoration Mitchell data The data shown in Figure 1.12 give average soil teperature in degrees C at 20 c depth in Mitchell, Nebraska, or 17 years beginning January 1976, plotted versus the onth nuber. The data were collected by K. Hubbard and provided by O. Burnside Suarize the inoration in the graph about the dependence o soil teperature on onth nuber. Average Soil Teperature Months ater January 1976 FIG Monthly soil teperature data.

32 18 SCATTERPLOTS AND REGRESSION The data used to draw Figure 1.12 are in the ile Mitchell.txt. Redraw the graph, but this tie ake the length o the horizontal axis at least our ties the length o the vertical axis. Repeat Proble United Nations The data in the ile UN1.txt contains PPgdp, the 2001 gross national product per person in US dollars, and Fertility, the birth rate per 1000 eales in the population in the year The data are or 193 localities, ostly UN eber countries, but also other areas such as Hong Kong that are not independent countries; the third variable on the ile called Locality gives the nae o the locality. The data were collected ro unsd/deographic. In this proble, we will study the conditional distribution o Fertility given PPgdp Identiy the predictor and the response Draw the scatterplot o Fertility on the vertical axis versus PPgdp on the horizontal axis and suarize the inoration in this graph. Does a straight-line ean unction see to be a plausible or a suary o this graph? Draw the scatterplot o log(fertility) versus log(ppgdp), using logs to the base two. Does the siple linear regression odel see plausible or a suary o this graph? 1.4. Old Faithul The data in the data ile oldaith.txt gives inoration about eruptions o Old Faithul Geyser during October Variables are the Duration in seconds o the current eruption, and the Interval, the tie in inutes to the next eruption. The data were collected by volunteers and were provided by R. Hutchinson. Apart ro issing data or the period ro idnight to 6 AM, this is a coplete record o eruptions or that onth. Old Faithul Geyser is an iportant tourist attraction, with up to several thousand people watching it erupt on pleasant suer days. The park service uses data like these to obtain a prediction equation or the tie to the next eruption. Draw the relevant suary graph or predicting interval ro duration, and suarize your results Water run-o in the Sierras Can Southern Caliornia s water supply in uture years be predicted ro past data? One actor aecting water availability is strea run-o. I run-o could be predicted, engineers, planners and policy akers could do their jobs ore eiciently. The data in the ile water.txt contains 43 years worth o precipitation easureents taken at six sites in the Sierra Nevada ountains (labelled APMAM, APSAB, APSLAKE, OPBPC, OPRC, and OPSLAKE), and strea run-o volue at a site near Bishop, Caliornia, labelled BSAAM. The data are ro the UCLA Statistics WWW server. Draw the scatterplot atrix or these data and suarize the inoration available ro these plots.

33 CHAPTER 2 Siple Linear Regression The siple linear regression odel consists o the ean unction and the variance unction E(Y X = x) = β 0 + β 1 x (2.1) Var(Y X = x) = σ 2 The paraeters in the ean unction are the intercept β 0, which is the value o E(Y X = x) when x equals zero, and the slope β 1, which is the rate o change in E(Y X = x) or a unit change in X; see Figure 2.1. By varying the paraeters, we can get all possible straight lines. In ost applications, paraeters are unknown and ust be estiated using data. The variance unction in (2.1) is assued to be constant, with a positive value σ 2 that is usually unknown. Because the variance σ 2 > 0, the observed value o the ith response y i will typically not equal its expected value E(Y X = x i ). To account or this dierence between the observed data and the expected value, statisticians have invented a quantity called a statistical error, ore i, or case i deined iplicitly by the equation y i = E(Y X = x i ) + e i or explicitly by e i = y i E(Y X = x i ). The errors e i depend on unknown paraeters in the ean unction and so are not observable quantities. They are rando variables and correspond to the vertical distance between the point y i and the ean unction E(Y X = x i ). In the heights data, page 2, the errors are the dierences between the heights o particular daughters and the average height o all daughters with others o a given ixed height. I the assued ean unction is incorrect, then the dierence between the observed data and the incorrect ean unction will have a non rando coponent, as illustrated in Figure 2.2. We ake two iportant assuptions concerning the errors. First, we assue that E(e i x i ) = 0, so i we could draw a scatterplot o the e i versus the x i,we would have a null scatterplot, with no patterns. The second assuption is that the errors are all independent, eaning that the value o the error or one case gives Applied Linear Regression, Third Edition, by Sanord Weisberg ISBN Copyright 2005 John Wiley & Sons, Inc. 19

34 20 SIMPLE LINEAR REGRESSION 3 Y 2 b b 0 = intercept 1 2 X 3 FIG. 2.1 Equation o a straight line E(Y X = x) = β 0 + β 1 x. Fixed lack-o-it error True relationship Y Straight line X FIG. 2.2 errors. Approxiating a curved ean unction by straight line cases adds a ixed coponent to the no inoration about the value o the error or another case. This is likely to be true in the exaples in Chapter 1, although this assuption will not hold in all probles. Errors are oten assued to be norally distributed, but norality is uch stronger than we need. In this book, the norality assuption is used priarily to obtain tests and conidence stateents with sall saples. I the errors are thought to ollow soe dierent distribution, such as the Poisson or the Binoial,

35 ORDINARY LEAST SQUARES ESTIMATION 21 other ethods besides ols ay be ore appropriate; we return to this topic in Chapter ORDINARY LEAST SQUARES ESTIMATION Many ethods have been suggested or obtaining estiates o paraeters in a odel. The ethod discussed here is called ordinary least squares, orols, in which paraeter estiates are chosen to iniize a quantity called the residual su o squares. A oral developent o the least squares estiates is given in Appendix A.3. Paraeters are unknown quantities that characterize a odel. Estiates o paraeters are coputable unctions o data and are thereore statistics. To keep this distinction clear, paraeters are denoted by Greek letters like α, β, γ and σ, and estiates o paraeters are denoted by putting a hat over the corresponding Greek letter. For exaple, ˆβ 1, read beta one hat, is the estiator o β 1,and ˆσ 2 is the estiator o σ 2.Theitted value or case i is given by Ê(Y X = x i ),orwhich we use the shorthand notation ŷ i, ŷ i = Ê(Y X = x i ) = ˆβ 0 + ˆβ 1 x i (2.2) Although the e i are not paraeters in the usual sense, we shall use the sae hat notation to speciy the residuals: the residual or the ith case, denoted ê i,isgiven by the equation ê i = y i Ê(Y X = x i ) = y i ŷ i = y i ( ˆβ 0 + ˆβ 1 ) i = 1,...,n (2.3) which should be copared with the equation or the statistical errors, e i = y i (β 0 + β 1 x i ) i = 1,...,n All least squares coputations or siple regression depend only on averages, sus o squares and sus o cross-products. Deinitions o the quantities used are given in Table 2.1. Sus o squares and cross-products have been centered by subtracting the average ro each o the values beore squaring or taking cross-products. Appropriate alternative orulas or coputing the corrected sus o squares and cross products ro uncorrected sus o squares and crossproducts that are oten given in eleentary textbooks are useul or atheatical proos, but they can be highly inaccurate when used on a coputer and should be avoided. Table 2.1 also lists deinitions or the usual univariate and bivariate suary statistics, the saple averages (x,y), saple variances (SD 2 x, SD2 y ), and estiated covariance and correlation (s xy,r xy ). The hat rule described earlier would suggest that dierent sybols should be used or these quantities; or exaple, ˆρ xy ight be ore appropriate or the saple correlation i the population correlation is ρ xy.

36 22 SIMPLE LINEAR REGRESSION TABLE 2.1 Deinitions o Sybols a Quantity Deinition Description x xi /n Saple average o x y yi /n Saple average o y SXX (xi x) 2 = (x i x)x i Su o squares or the x s SD 2 x SXX/(n 1) Saple variance o the x s SD x SXX/(n 1) Saple standard deviation o the x s SYY (yi y) 2 = (y i y)y i Su o squares or the y s SD 2 y SYY/(n 1) Saple variance o the y s SD y SYY/(n 1) Saple standard deviation o the y s SXY (xi x)(y i y) = (x i x)y i Su o cross-products s xy SXY/(n 1) Saple covariance r xy s xy /(SD x SD y ) Saple correlation a In each equation, the sybol eans to add over all the n values or pairs o values in the data. This inconsistency is deliberate since in any regression situations, these statistics are not estiates o population paraeters. To illustrate coputations, we will use Forbes data, page 4, or which n = 17. The data are given in Table 2.2. In our analysis o these data, the response will be taken to be Lpres = 100 log 10 (Pressure), and the predictor is Tep.Wehave used the values or these variables shown in Table 2.2 to do the coputations. TABLE 2.2 Forbes 1857 Data on Boiling Point and Baroetric Pressure or 17 Locations in the Alps and Scotland Case Nuber Tep ( F) Pressure (Inches Hg) Lpres = 100 log(pressure)

37 LEAST SQUARES CRITERION 23 Neither ultiplication by 100 nor the base o the logariths has iportant eects on the analysis. Multiplication by 100 avoids using scientiic notation or nubers we display in the text, and changing the base o the logariths erely ultiplies the logariths by a constant. For exaple, to convert ro base-ten logariths to base-two logariths, ultiply by To convert natural logariths to base-two, ultiply by Forbes data were collected at 17 selected locations, so the saple variance o boiling points, SD 2 x = 33.17, is not an estiate o any eaningul population variance. Siilarly, r xy depends as uch on the ethod o sapling as it does on the population value ρ xy, should such a population value ake sense. In the heights exaple, page 2, i the 1375 other daughter pairs can be viewed as a saple ro a population, then the saple correlation is an estiate o a population correlation. The usual saple statistics are oten presented and used in place o the corrected sus o squares and cross-products, so alternative orulas are given using both sets o quantities. 2.2 LEAST SQUARES CRITERION The criterion unction or obtaining estiators is based on the residuals, which geoetrically are the vertical distances between the itted line and the actual y- values, as illustrated in Figure 2.3. The residuals relect the inherent asyetry in the roles o the response and the predictor in regression probles. Response = Y Residuals are the signed lengths o the vertical lines Predictor = X FIG. 2.3 A scheatic plot or ols itting. Each data point is indicated by a sall circle, and the solid line is a candidate ols line given by a particular choice o slope and intercept. The solid vertical lines between the points and the solid line are the residuals. Points below the line have negative residuals, while points above the line have positive residuals.

38 24 SIMPLE LINEAR REGRESSION The ols estiators are those values β 0 and β 1 that iniize the unction 1 n [ RSS(β 0,β 1 ) = yi (β 0 + β 1 x i ) ] 2 i=1 (2.4) When evaluated at ( ˆβ 0, ˆβ 1 ), we call the quantity RSS( ˆβ 0, ˆβ 1 ) the residual su o squares, orjustrss. The least squares estiates can be derived in any ways, one o which is outlined in Appendix A.3. They are given by the expressions ˆβ 1 = SXY SXX = r SD y xy = r xy SD x ˆβ 0 = y ˆβ 1 x ( ) SYY 2 SXX (2.5) The several ors or ˆβ 1 are all equivalent. We ephasize again that ols produces estiates o paraeters but not the actual values o the paraeters. The data in Figure 2.3 were created by setting the x i to be rando saple o 20 nubers ro a N(2, 1.5) distribution and then coputing y i = x i + e i, where the errors were N(0, 1) rando nubers. For this graph, the true values o β 0 = 0.7 andβ 1 = 0.8 are known. The graph o the true ean unction is shown in Figure 2.3 as a dashed line, and it sees to atch the data poorly copared to ols, given by the solid line. Since ols iniizes (2.4), it will always it at least as well as, and generally better than, the true ean unction. Using Forbes data, we will write x to be the saple ean o Tep and y to be the saple ean o Lpres. The quantities needed or coputing the least squares estiators are x = SXX = SXY = y = SYY = (2.6) The quantity SYY, although not yet needed, is given or copleteness. In the rare instances that regression calculations are not done using statistical sotware or a statistical calculator, interediate calculations such as these should be done as accurately as possible, and rounding should be done only to inal results. Using (2.6), we ind ˆβ 1 = SXY SXX = ˆβ 0 = y ˆβ 1 x = We abuse notation by using the sybol or a ixed though unknown quantity like β j as i it were a variable arguent. Thus, or exaple, RSS(β 0,β 1 ) is a unction o two variables to be evaluated as its arguents β 0 and β 1 vary. The sae abuse o notation is used in the discussion o conidence intervals.

39 ESTIMATING σ 2 25 The estiated line, given by either o the equations Ê(Lpres Tep) = Tep = (Tep ) was drawn in Figure 1.4a. The it o this line to the data is excellent. 2.3 ESTIMATING σ 2 Since the variance σ 2 is essentially the average squared size o the e 2 i, we should expect that its estiator ˆσ 2 is obtained by averaging the squared residuals. Under the assuption that the errors are uncorrelated rando variables with zero eans andcoonvarianceσ 2, an unbiased estiate o σ 2 is obtained by dividing RSS = êi 2 by its degrees o reedo (d), where residual d = nuber o cases inus the nuber o paraeters in the ean unction. For siple regression, residual d = n 2, so the estiate o σ 2 is given by ˆσ 2 = RSS n 2 (2.7) This quantity is called the residual ean square. In general, any su o squares divided by its d is called a ean square. The residual su o squares can be coputed by squaring the residuals and adding the up. It can also be coputed ro the orula (Proble 2.9) RSS = SYY SXY 2 SXX = SYY ˆβ 1 2 SXX (2.8) Using the suaries or Forbes data given at (2.6), we ind RSS = = (2.9) σ 2 = = (2.10) The square root o ˆσ 2, ˆσ = = is oten called the standard error o regression. It is in the sae units as is the response variable. I in addition to the assuptions ade previously, the e i are drawn ro a noral distribution, then the residual ean square will be distributed as a ultiple o a chi-squared rando variable with d = n 2, or in sybols, (n 2) ˆσ 2 σ 2 χ 2 (n 2)

40 26 SIMPLE LINEAR REGRESSION This is proved in ore advanced books on linear odels and is used to obtain the distribution o test statistics and also to ake conidence stateents concerning σ 2. In particular, this act iplies that E( ˆσ 2 ) = σ 2, although norality is not required or unbiasedness. 2.4 PROPERTIES OF LEAST SQUARES ESTIMATES The ols estiates depend on data only through the statistics given in Table 2.1. This is both an advantage, aking coputing easy, and a disadvantage, since any two data sets or which these are identical give the sae itted regression, even i a straight-line odel is appropriate or one but not the other, as we have seen in Anscobe s exaples in Section 1.4. The estiates ˆβ 0 and ˆβ 1 can both be written as linear cobinations o y 1,...,y n, or exaple, writing c i = (x i x)/sxx (see Appendix A.3) ˆβ 1 = ( ) x i x y i = c i y i SXX The itted value at x = x is Ê(Y X = x) = y ˆβ 1 x + ˆβ 1 x = y so the itted line ust pass through the point (x,y), intuitively the center o the data. Finally, as long as the ean unction includes an intercept, ê i = 0. Mean unctions without an intercept will usually have ê i 0. Since the estiates ˆβ 0 and ˆβ 1 depend on the rando e i s, the estiates are also rando variables. I all the e i have zero ean and the ean unction is correct, then, as shown in Appendix A.4, the least squares estiates are unbiased, E( ˆβ 0 ) = β 0 E( ˆβ 1 ) = β 1 The variance o the estiators, assuing Var(e i ) = σ 2,i = 1,..., n, and Cov(e i,e j ) = 0,i j, are ro Appendix A.4, Var( ˆβ 1 ) = σ 2 1 SXX Var( ˆβ 0 ) = σ 2 ( 1 n + x2 SXX ) (2.11) The two estiates are correlated, with covariance Cov( ˆβ 0, ˆβ 1 ) = σ 2 x SXX (2.12)

41 ESTIMATED VARIANCES 27 The correlation between the estiates can be coputed to be ρ( ˆβ 0, ˆβ 1 ) = x SXX/n + x 2 = x (n 1)SD 2 x /n + x2 This correlation can be close to plus or inus one i SD x is sall copared to x and can be ade to equal zero i the predictor is centered to have saple ean zero. The Gauss Markov theore provides an optiality result or ols estiates. Aong all estiates that are linear cobinations o the ys and unbiased, the ols estiates have the sallest variance. I one believes the assuptions and is interested in using linear unbiased estiates, the ols estiates are the ones to use. When the errors are norally distributed, the ols estiates can be justiied using a copletely dierent arguent, since they are then also axiu likelihood estiates, as discussed in any atheatical statistics text, or exaple, Casella and Berger (1990). Under the assuption that errors are independent, noral with constant variance, which is written in sybols as e i NID(0,σ 2 ) i = 1,...,n ˆβ 0 and ˆβ 1 are also norally distributed, since they are linear unctions o the y i s and hence o the e i, with variances and covariances given by (2.11) and (2.12). These results are used to get conidence intervals and tests. Norality o estiates also holds without norality o errors i the saple size is large enough ESTIMATED VARIANCES Estiates o Var( ˆβ 0 ) and Var( ˆβ 1 ) are obtained by substituting ˆσ 2 or σ 2 in (2.11). We use the sybol Var()or an estiated variance. Thus Var( ˆβ 1 ) =ˆσ 2 1 SXX ( ) Var( ˆβ 0 ) =ˆσ 2 1 n + x2 SXX The square root o an estiated variance is called a standard error, orwhichwe use the sybol se( ). The use o this notation is illustrated by se( ˆβ 1 ) = Var( ˆβ 1 ) 2 The ain requireent ) or all estiates to be norally distributed in large saples is that ax i ((x i x) 2 /SXX ust get close to zero as the saple size increases (Huber, 1981).

42 28 SIMPLE LINEAR REGRESSION 2.6 COMPARING MODELS: THE ANALYSIS OF VARIANCE The analysis o variance provides a convenient ethod o coparing the it o two or ore ean unctions or the sae set o data. The ethodology developed here is very useul in ultiple regression and, with inor odiication, in ost regression probles. An eleentary alternative to the siple regression odel suggests itting the ean unction E(Y X = x) = β 0 (2.13) The ean unction (2.13) is the sae or all values o X. Fitting with this ean unction is equivalent to inding the best line parallel to the horizontal or x-axis, as shown in Figure 2.4. The ols estiate o the ean unction is E (Y X) = ˆβ 0,where ˆβ 0 is the value o β 0 that iniizes (y i β 0 ) 2. The iniizer is given by ˆβ 0 = y (2.14) The residual su o squares is (yi ˆβ 0 ) 2 = (y i y) 2 = SYY (2.15) This residual su o squares has n 1d,n cases inus one paraeter in the ean unction. Next, consider the siple regression ean unction obtained ro (2.13) by adding a ter that depends on X E(Y X = x) = β 0 + β 1 x (2.16) Fitting this ean unction is equivalent to inding the best line o arbitrary slope, as shown in Figure 2.4. The ols estiates or this ean unction are given by Response = Y Fit o (2.16) Fit o (2.13) Predictor = X FIG. 2.4 Two ean unctions copared by the analysis o variance.

43 COMPARING MODELS: THE ANALYSIS OF VARIANCE 29 (2.5). The estiates o β 0 under the two ean unctions are dierent, just as the eaning o β 0 in the two ean unctions is dierent. For (2.13), β 0 is the average o the y i s, but or (2.16), β 0 is the expected value o Y when X = 0. For (2.16), the residual su o squares, given in (2.8), is RSS = SYY (SXY)2 SXX (2.17) As entioned earlier, RSS has n 2d. The dierence between the su o squares at (2.15) and that at (2.17) is the reduction in residual su o squares due to enlarging the ean unction ro (2.13) to the siple regression ean unction (2.16). This is the su o squares due to regression, SSreg, deinedby SSreg = SYY RSS ) = SYY (SYY (SXY)2 SXX = (SXY)2 SXX (2.18) The d associated with SSreg is the dierence in d or ean unction (2.13), n 1, and the d or ean unction (2.16), n 2, so the d or SSreg is (n 1) (n 2) = 1 or siple regression. These results are oten suarized in an analysis o variance table, abbreviated as ANOVA, given in Table 2.3. The colun arked Source reers to descriptive labels given to the sus o squares; in ore coplicated tables, there ay be any sources, and the labels given ay be dierent in soe coputer progras. The d colun gives the nuber o degrees o reedo associated with each naed source. The next colun gives the associated su o squares. The ean square colun is coputed ro the su o squares colun by dividing sus o squares by the corresponding d. The ean square on the residual line is just ˆσ 2, as already discussed. The analysis o variance or Forbes data is given in Table 2.4. Although this table will be produced by any linear regression sotware progra, the entries in Table 2.4 can be constructed ro the suary statistics given at (2.6). The ANOVA is always coputed relative to a speciic larger ean unction, here given by (2.16), and a saller ean unction obtained ro the larger by setting TABLE 2.3 The Analysis o Variance Table or Siple Regression Source d SS MS F p-value Regression 1 SSreg SSreg/1 MSreg/ ˆσ 2 Residual n 2 RSS ˆσ 2 = RSS/(n 2) Total n 1 SYY

44 30 SIMPLE LINEAR REGRESSION TABLE 2.4 Analysis o Variance Table or Forbes Data Source d SS MS F p-value Regression on Tep Residual soe paraeters to zero, or occasionally setting the to soe other known value. For exaple, equation (2.13) was obtained ro (2.16) by setting β 1 = 0. The line in the ANOVA table or the total gives the residual su o squares corresponding to the ean unction with the ewest paraeters. In the next chapter, the analysis o variance is applied to a sequence o ean unctions, but the reerence to a ixed large ean unction reains intact The F -Test or Regression I the su o squares or regression SSreg is large, then the siple regression ean unction E(Y X = x) = β 0 + β 1 x should be a signiicant iproveent over the ean unction given by (2.13), E(y X = x) = β 0. This is equivalent to saying that the additional paraeter in the siple regression ean unction β 1 is dierent ro zero or that E(Y X = x) is not constant as X varies. To oralize this notion, we need to be able to judge how large is large. This is done by coparing the regression ean square, SSreg divided by its d, to the residual ean square ˆσ 2. We call this ratio F : F = (SYY RSS)/1 ˆσ 2 = SSreg/1 ˆσ 2 (2.19) F is just a rescaled version o SSreg = SYY RSS, with larger values o SSreg resulting in larger values o F. Forally, we can consider testing the null hypothesis (NH) against the alternative hypothesis (AH) NH: E(Y X = x) = β 0 AH: E(Y X = x) = β 0 + β 1 x (2.20) I the errors are NID(0,σ 2 ) or the saple size is large enough, then under NH (2.19) will ollow an F -distribution with d associated with the nuerator and denoinator o (2.19), 1 and n 2 or siple regression. This is written F F(1,n 2). For Forbes data, we copute F = = 2963 We obtain a signiicance level or p-value or this test by coparing F to the percentage points o the F(1,n 2)-distribution. Most coputer progras that it regression odels will include unctions to coputing percentage points o the F

45 THE COEFFICIENT OF DETERMINATION, R 2 31 and other standard distributions and will include the p-value along with the ANOVA table, as in Table 2.4. The p-value is shown as approxiately zero, eaning that, i the NH were true, the change o F exceeding its observed value is essentially zero. This is very strong evidence against NH and in avor o AH Interpreting p-values Under the appropriate assuptions, the p-value is the conditional probability o observing a value o the coputed statistic, here the value o F, as extree or ore extree, here as large or larger, than the observed value, given that the NH is true. A sall p-value provides evidence against the NH. In soe research areas, it has becoe traditional to adopt a ixed signiicance level when exaining p-values. For exaple, i a ixed signiicance level o α is adopted, then we would say that an NH is rejected at level α i the p-value is less than α. The ost coon choice or α is 0.05, which would ean that, were the NH to be true, we would incorrectly ind evidence against it about 5% o the tie, or about 1 test in 20. Accept reject rules like this are generally unnecessary or reasonable scientiic inquiry. Siply reporting p-values and allowing readers to decide on signiicance sees a better approach. There is an iportant distinction between statistical signiicance, the observation o a suiciently sall p-value, and scientiic signiicance, observing an eect o suicient agnitude to be eaningul. Judgent o the latter usually will require exaination o ore than just the p-value Power o Tests When the NH is true, and all assuptions are et, the chance o incorrectly declaringannhtobealseatlevelα is just α. Iα = 0.05, then in 5% o tests where the NH is true we will get a p-value saller than or equal to When the NH is alse, we expect to see sall p-values ore oten. The power o a test is deined to be the probability o detecting a alse NH. For the hypothesis test (2.20), when the NH is alse, it is shown in ore advanced books on linear odels (such as Seber, 1977) that the statistic F given by (2.19) has a noncentral F distribution, with 1 and n 2 d, and with noncentrality paraeter given by SXXβ1 2/σ 2. The larger the value o the non centrality paraeter, the greater the power. The noncentrality is increased i β1 2 is large, i SXX is large, either by spreading out the predictors or by increasing the saple size, or by decreasing σ THE COEFFICIENT OF DETERMINATION, R 2 I both sides o (2.18) are divided by SYY, weget SSreg SYY = 1 RSS (2.21) SYY The let-hand side o (2.21) is the proportion o variability o the response explained by regression on the predictor. The right-hand side consists o one inus the

46 32 SIMPLE LINEAR REGRESSION reaining unexplained variability. This concept o dividing up the total variability according to whether or not it is explained is o suicient iportance that a special nae is given to it. We deine R 2, the coeicient o deterination, to be R 2 = SSreg SYY = 1 RSS SYY (2.22) R 2 is coputed ro quantities that are available in the ANOVA table. It is a scaleree one-nuber suary o the strength o the relationship between the x i and the y i in the data. It generalizes nicely to ultiple regression, depends only on the sus or squares and appears to be easy to interpret. For Forbes data, R 2 = SSreg SYY = = and thus about 99.5% o the variability in the observed values or 100 log(pressure) is explained by boiling point. Since R 2 does not depend on units o easureent, we would get the sae value i we had used logariths with a dierent base, or i we did not ultiply log(pressure) by 100. By appealing to (2.22) and to Table 2.1, we can write R 2 = SSreg SYY = (SXY)2 SXX SYY = r2 xy and thus R 2 is the sae as the square o the saple correlation between the predictor and the response. 2.8 CONFIDENCE INTERVALS AND TESTS When the errors are NID(0,σ 2 ), paraeter estiates, itted values, and predictions will be norally distributed because all o these are linear cobinations o the y i and hence o the e i. Conidence intervals and tests can be based on the t- distribution, which is the appropriate distribution with noral estiates but using an estiate o variance ˆσ 2. Suppose we let t(α/2,d) be the value that cuts o α/2 100% in the upper tail o the t-distribution with d d. These values can be coputed in ost statistical packages or spreadsheet sotware The Intercept The intercept is used to illustrate the general or o conidence intervals or norally distributed estiates. The standard error o the intercept is se(β 0 ) =ˆσ(1/n + x 2 /SXX) 1/2. Hence a (1 α) 100% conidence interval or the intercept is the set o points β 0 in the interval ˆβ 0 t(α/2,n 2)se( ˆβ 0 ) β 0 ˆβ 0 + t(α/2,n 2)se( ˆβ 0 ) 3 Such as the unction tinv in Microsot Excel, or the unction pt in R or S-plus.

47 CONFIDENCE INTERVALS AND TESTS 33 For Forbes data, se( ˆβ 0 ) = (1/17 + ( ) 2 / ) 1/2 = For a 90% conidence interval, t(0.05, 15) = 1.753, and the interval is (3.340) β (3.340) β Ninety percent o such intervals will include the true value. A hypothesis test o is obtained by coputing the t-statistic NH: β 0 = β0, β 1 arbitrary AH: β 0 β0, β 1 arbitrary t = ˆβ 0 β 0 se( ˆβ 0 ) (2.23) and reerring this ratio to the t-distribution with n 2 d. For exaple, in Forbes data, consider testing the NH β 0 = 35 against the alternative that β The statistic is ( 35) t = = which has a p-value near 0.05, providing soe evidence against NH. This hypothesis test or these data is not one that would occur to ost investigators and is used only as an illustration Slope The standard error o ˆβ 1 is se( ˆβ 1 ) =ˆσ/ SXX = A 95% conidence interval or the slope is the set o β 1 such that (0.0164) β (0.0164) β As an exaple o a test or slope equal to zero, consider the Ft. Collins snowall data presented on page 7. One can show, Proble 2.11, that the estiated slope is ˆβ 1 = , se( ˆβ 1 ) = The test o interest is o NH: β 1 = 0 AH: β 1 0 (2.24) For the Ft. Collins data, t = ( )/ = To get a signiicance level or this test, copare t with the t(91) distribution; the two-sided p-value is

48 34 SIMPLE LINEAR REGRESSION 0.124, suggesting no evidence against the NH that Early and Late season snowalls are independent. Copare the hypothesis (2.24) with (2.20). Both appear to be identical. In act, t 2 = ( ˆβ 1 se( ˆβ 1 ) ) 2 = ˆβ 1 2 ˆσ 2 /SXX = ˆβ 1 2SXX ˆσ 2 = F so the square o a t statistic with d d is equivalent to an F -statistic with (1,d) d. In nonlinear and logistic regression odels discussed later in the book, the analog o the t test will not be identical to the analog o the F test, and they can give conlicting conclusions. For linear regression odels, no conlict occurs and the two tests are equivalent Prediction The estiated ean unction can be used to obtain values o the response or given values o the predictor. The two iportant variants o this proble are prediction and estiation o itted values. Since prediction is ore iportant, we discuss it irst. In prediction we have a new case, possibly a uture value, not one used to estiate paraeters, with observed value o the predictor x. We would like to know the value y, the corresponding response, but it has not yet been observed. We can use the estiated ean unction to predict it. We assue that the data used to estiate the ean unction are relevant to the new case, so the itted odel applies to it. In the heights exaple, we would probably be willing to apply the itted ean unction to other daughter pairs alive in England at the end o the nineteenth century. Whether the prediction would be reasonable or other daughter pairs in other countries or in other tie periods is uch less clear. In Forbes proble, we would probably be willing to apply the results or altitudes in the range he studied. Given this additional assuption, a point prediction o y, say ỹ,isjust ỹ = ˆβ 0 + ˆβ 1 x ỹ predicts the as yet unobserved y. The variability o this predictor has two sources: the variation in the estiates ˆβ 0 and ˆβ 1, and the variation due to the act that y will not equal its expectation, since even i we knew the paraeters exactly, the uture value o the response will not generally equal its expectation. Using Appendix A.4, ( 1 Var(ỹ x ) = σ 2 + σ 2 n + (x x) 2 ) (2.25) SXX Taking square roots and estiating σ 2 by ˆσ 2, we get the standard error o prediction (sepred) at x, sepred(ỹ x ) =ˆσ (1 + 1n + (x x) 2 ) 1/2 (2.26) SXX

49 CONFIDENCE INTERVALS AND TESTS 35 A prediction interval uses ultipliers ro the t-distribution. For prediction o 100 log(pressure) or a location with x = 200, the point prediction is ỹ = (200) = , with standard error o prediction sepred(ỹ x = 200) = (1 + = Thus a 99% predictive interval is the set o all y such that ) 1/2 ( ) (0.393) y (0.393) y More interesting would be a 99% prediction interval or Pressure, rather than or 100 log(pressure). A point prediction is just 10 ( /100) = inches o Mercury. The prediction interval is ound by exponentiating the end points o the interval in log scale. Dividing by 100 and then exponentiating, we get /100 Pressure / Pressure In the original scale, the prediction interval is not syetric about the point estiate. For the heights data, Figure 2.5 is a plot o the estiated ean unction given by the dashed line or the regression o Dheight on Mheight along with curves at ˆβ 0 + ˆβ 1 x ± t(.025, 15)sepred( Dheight Mheight ) The vertical distance between the two solid curves or any value o Mheight corresponds to a 95% prediction interval or daughter s height given other s height. Although not obvious ro the graph because o the very large saple size, the interval is wider or others who were either relatively tall or short, as the curves bend outward ro the narrowest point at Mheight = Mheight Fitted Values In rare probles, one ay be interested in obtaining an estiate o E(Y X = x).in the heights data, this is like asking or the population ean height o all daughters o others with a particular height. This quantity is estiated by the itted value ŷ = β 0 + β 1 x, and its standard error is seit(ỹ x ) =ˆσ ( 1 n + (x x) 2 ) 1/2 SXX

50 36 SIMPLE LINEAR REGRESSION Dheight Mheight FIG. 2.5 data. Prediction intervals (solid lines) and intervals or itted values (dashed lines) or the heights To obtain conidence intervals, it is ore usual to copute a siultaneous interval or all possible values o x. This is the sae as irst coputing a joint conidence region or β 0 and β 1, and ro these, coputing the set o all possible ean unctions with slope and intercept in the joint conidence set (Section 5.5). The conidence region or the ean unction is the set o all y such that ( ˆβ 0 + ˆβ 1 x) seit(ŷ x)[2f(α; 2,n 2)] 1/2 y ( ˆβ 0 + ˆβ 1 x) + seit(ŷ x)[2f(α; 2,n 2)] 1/2 For ultiple regression, replace 2F(α; 2,n 2) by p F(α; p,n p ),wherep is the nuber o paraeters estiated in the ean unction including the intercept. The siultaneous band or the itted line or the heights data is shown in Figure 2.5 as the vertical distances between the two dotted lines. The prediction intervals are uch wider than the conidence intervals. Why is this so (Proble 2.4)? 2.9 THE RESIDUALS Plots o residuals versus other quantities are used to ind ailures o assuptions. The ost coon plot, especially useul in siple regression, is the plot o residuals versus the itted values. A null plot would indicate no ailure o assuptions. Curvature ight indicate that the itted ean unction is inappropriate. Residuals that see to increase or decrease in average agnitude with the itted values ight indicate nonconstant residual variance. A ew relatively large residuals ay be indicative o outliers, cases or which the odel is soehow inappropriate. The plot o residuals versus itted values or the heights data is shown in Figure 2.6. This is a null plot, as it indicates no particular probles.

51 THE RESIDUALS 37 Residuals Fitted values FIG. 2.6 Residuals versus itted values or the heights data. 12 Residuals Fitted values FIG. 2.7 Residual plot or Forbes data. The itted values and residuals or Forbes data are plotted in Figure 2.7. The residuals are generally sall copared to the itted values, and they do not ollow any distinct pattern in Figure 2.7. The residual or case nuber 12 is about our ties the size o the next largest residual in absolute value. This ay suggest that the assuptions concerning the errors are not correct. Either Var(100 log(pressure) Tep) ay not be constant or or case 12, the corresponding error ay have a large ixed coponent. Forbes ay have isread or iscopied the results o his calculations or this case, which would suggest that the nubers in

52 38 SIMPLE LINEAR REGRESSION TABLE 2.5 Suary Statistics or Forbes Data with All Data and with Case 12 Deleted Quantity All Data Delete Case 12 ˆβ ˆβ se( ˆβ 0 ) se( ˆβ 1 ) ˆσ R the data do not correspond to the actual easureents. Forbes noted this possibility hisel, by arking this pair o nubers in his paper as being evidently a istake, presuably because o the large observed residual. Since we are concerned with the eects o case 12, we could reit the data, this tie without case 12, and then exaine the changes that occur in the estiates o paraeters, itted values, residual variance, and so on. This is suarized in Table 2.5, giving estiates o paraeters, their standard errors, ˆσ 2, and the coeicient o deterination R 2 with and without case 12. The estiates o paraeters are essentially identical with and without case 12. In other regression probles, deletion o a single case can change everything. The eect o case 12 on standard errors is ore arked: i case 12 is deleted, standard errors are decreased by a actor o about 3.1, and variances are decreased by a actor o about Inclusion o this case gives the appearance o less reliable results than would be suggested on the basis o the other 16 cases. In particular, prediction intervals o Pressure are uch wider based on all the data than on the 16-case data, although the point predictions are nearly the sae. The residual plot obtained when case 12 is deleted beore coputing indicates no obvious ailures in the reaining 16 cases. Two copeting its using the sae ean unction but soewhat dierent data are available, and they lead to slightly dierent conclusions, although the results o the two analyses agree ore than they disagree. On the basis o the data, there is no real way to choose between the two, and we have no way o deciding which is the correct ols analysis o the data. A good approach to this proble is to describe both or, in general, all plausible alternatives. PROBLEMS 2.1. Height and weight data The table below and in the data ile htwt.txt gives Ht = height in centieters and Wt = weight in kilogras or a saple o n = year-old girls. The data are taken ro a larger study described in Proble 3.1. Interest is in predicting weight ro height.

53 PROBLEMS 39 Ht Wt Drawa scatterplotowt onthe verticalaxis versushton the horizontal axis. On the basis o this plot, does a siple linear regression odel ake sense or these data? Why or why not? Show that x = , y = 59.47, SXX = , SYY = , and SXY = Copute estiates o the slope and the intercept or the regression o Y on X. Draw the itted line on your scatterplot Obtain the estiate o σ 2 and ind the estiated standard errors o ˆβ 0 and ˆβ 1. Also ind the estiated covariance between ˆβ 0 and ˆβ 1. Copute the t-tests or the hypotheses that β 0 = 0andthatβ 1 = 0 and ind the appropriate p-values using two-sided tests Obtain the analysis o variance table and F -test or regression. Show nuerically that F = t 2,wheret was coputed in Proble or testing β 1 = More with Forbes data An alternative approach to the analysis o Forbes experients coes ro the Clausius Clapeyron orula o classical therodynaics, which dates to Clausius (1850). According to this theory, we should ind that E(Lpres Tep) = β 0 + β 1 1 Ktep (2.27) where Ktep is teperature in degrees Kelvin, which equals plus (5/9) Tep. I we were to graph this ean unction on a plot o Lpres versus Ktep, we would get a curve, not a straight line. However, we can estiate the paraeters β 0 and β 1 using siple linear regression ethods by deining u 1 to be the inverse o teperature in degrees Kelvin, u 1 = 1 Ktep = 1 (5/9)Tep

54 40 SIMPLE LINEAR REGRESSION Then the ean unction (2.27) can be rewritten as E(Lpres Tep) = β 0 + β 1 u 1 (2.28) or which siple linear regression is suitable. The notation we have used in (2.28) is a little dierent, as the let side o the equation says we are conditioning on Tep, but the variable Tep does not appear explicitly on the right side o the equation Draw the plot o Lpres versus u 1, and veriy that apart ro case 12 the 17 points in Forbes data all close to a straight line Copute the linear regression iplied by (2.28), and suarize your results We now have two possible odels or the sae data based on the regression o Lpres on Tep used by Forbes, and (2.28) based on the Clausius Clapeyron orula. To copare these two, draw the plot o the itted values ro Forbes ean unction it versus the itted values ro (2.28). On the basis o these and any other coputations you think ight help, is it possible to preer one approach over the other? Why? In his original paper, Forbes provided additional data collected by the botanist Dr. Joseph Hooker on teperatures and boiling points easured oten at higher altitudes in the Hialaya Mountains. The data or n = 31 locations is given in the ile hooker.txt. Findthe estiated ean unction (2.28) or Hooker s data This proble is not recoended unless you have access to a package with a prograing language, like R, S-plus, Matheatica, or SAS IML. For each o the cases in Hooker s data, copute the predicted values ŷ and the standard error o prediction. Then copute z = (Lpres ŷ)/sepred. Each o the zs is a rando variable, but i the odel is correct, each has ean zero and standard deviation close to one. Copute the saple ean and standard deviation o the zs, and suarize results Repeat Proble 2.2.5, but this tie predict and copute the z-scores or the 17 cases in Forbes data, again using the itted ean unction ro Hooker s data. I the ean unction or Hooker s data applies to Forbes data, then each o the z-scores should have zero ean and standard deviation close to one. Copute the z-scores, copare the to those in the last proble and coent on the results Deviations ro the ean Soeties it is convenient to write the siple linear regression odel in a dierent or that is a little easier to anipulate. Taking equation (2.1), and adding β 1 x β 1 x, which equals zero, to the

55 PROBLEMS 41 right-hand side, and cobining ters, we can write y i = β 0 + β 1 x + β 1 x i β 1 x + e i = (β 0 + β 1 x) + β 1 (x i x) + e i = α + β 1 (x i x) + e i (2.29) where we have deined α = β 0 + β 1 x. This is called the deviations ro the saple average or or siple regression What is the eaning o the paraeter α? Show that the least squares estiates are ˆα = y, ˆβ 1 as given by (2.5) Find expressions or the variances o the estiates and the covariance between the Heights o others and daughters For the heights data in the ile heights.txt, copute the regression o Dheight on Mheight, and report the estiates, their standard errors, the value o the coeicient o deterination, and the estiate o variance. Give the analysis o variance table that tests the hypothesis that E(Dheight Mheight) = β 0 versus the alternative that E(Dheight Mheight) = β 0 + β 1 Mheight, and write a sentence or two that suarizes the results o these coputations Write the ean unction in the deviations ro the ean or as in Proble 2.3. For this particular proble, give an interpretation or the value o β 1. In particular, discuss the three cases o β 1 = 1, β 1 < 1 and β 1 > 1. Obtain a 99% conidence interval or β 1 ro the data Obtain a prediction and 99% prediction interval or a daughter whose other is 64 inches tall Sallouth bass Using the West Bearskin Lake sallouth bass data in the ile wblake.txt, obtain 95% intervals or the ean length at ages 2, 4 and 6 years Obtain a 95% interval or the ean length at age 9. Explain why this interval is likely to be untrustworthy The ile wblake2.txt contains all the data or ages one to eight and, in addition, includes a ew older ishes. Using the ethods we have learned in this chapter, show that the siple linear regression odel is not appropriate or this larger data set United Nations data Reer to the UN data in Proble 1.3, page 18.

56 42 SIMPLE LINEAR REGRESSION Using base-ten logariths, use a sotware package to copute the siple linear regression odel corresponding to the graph in Proble 1.3.3, and get the analysis o variance table Draw the suary graph, and add the itted line to the graph Test the hypothesis that the slope is zero versus the alternative that it is negative (a one-sided test). Give the signiicance level o the test and a sentence that suarizes the result Give the value o the coeicient o deterination, and explain its eaning Increasing log(ppgdp) by one unit is the sae as ultiplying PPgdp by ten. I two localities dier in PPgdp by a actor o ten, give a 95% conidence interval on the dierence in log(fertility) or these two localities For a locality not in the data with PPgdp = 1000, obtain a point prediction and a 95% prediction interval or log(fertility). I the interval (a, b) is a 95% prediction interval or log(fertility), then a 95% prediction interval or Fertility is given by (10 a, 10 b ). Use this result to get a 95% prediction interval or Fertility Identiy (1) the locality with the highest value o Fertility; (2) the locality with the lowest value o Fertility; and (3) the two localities with the largest positive residuals ro the regression when both variables are in log scale, and the two countries with the largest negative residuals in log scales Regression through the origin Occasionally, a ean unction in which the intercept is known apriorito be zero ay be it. This ean unction is given by E(y x) = β 1 x (2.30) The residual su o squares or this odel, assuing the errors are independent with coon variance σ 2,isRSS = (y i ˆβ 1 x i ) Show that the least squares estiate o β 1 is ˆβ 1 = x i y i / xi 2. Show that ˆβ 1 is unbiased and that Var( ˆβ 1 ) = σ 2 / xi 2. Find an expression or ˆσ 2. How any d does it have? Derive the analysis o variance table with the larger odel given by (2.16), but with the saller odel speciied in (2.30). Show that the F -test derived ro this table is nuerically equivalent to the square o the t-test (2.23) with β0 = The data in Table 2.6 and in the ile snake.txt give X = water content o snow on April 1 and Y = water yield ro April to July in inches in the Snake River watershed in Wyoing or n = 17 years ro 1919 to 1935 (ro Wil, 1950).

57 PROBLEMS 43 TABLE 2.6 Snake River Data or Proble 2.7 X Y X Y Fit a regression through the origin and ind ˆβ 1 and σ 2.Obtaina 95% conidence interval or β 1. Test the hypothesis that the intercept is zero Plot the residuals versus the itted values and coent on the adequacy o the ean unction with zero intercept. In regression through the origin, ê i Scale invariance In the siple regression odel (2.1), suppose the value o the predictor X is replaced by cx, wherec is soe non zero constant. How are ˆβ 0, ˆβ 1, ˆσ 2, R 2,andthet-test o NH: β 1 = 0 aected by this change? Suppose each value o the response Y is replaced by dy, orsoe d 0. Repeat Using Appendix A.3, veriy equation (2.8) Zip s law Suppose we counted the nuber o ties each word was used in the written works by Shakespeare, Alexander Hailton, or soe other author with a substantial written record (Table 2.7). Can we say anything about the requencies o the ost coon words? Suppose we let i be the rate per 1000 words o text or the ith ost requent word used. The linguist George Zip ( ) observed a law like relationship between rate and rank (Zip, 1949), E( i i) = a/i b and urther observed that the exponent is close to b = 1. Taking logariths o both sides, we get approxiately E(log( i ) log(i)) = log(a) b log(i) (2.31)

58 44 SIMPLE LINEAR REGRESSION TABLE 2.7 Word Hailton HailtonRank Madison MadisonRank Jay JayRank Ulysses UlyssesRank The Word Count Data The word Rate per 1000 words o this word in the writings o Alexander Hailton Rank o this word in Hailton s writings Rate per 1000 words o this word in the writings o Jaes Madison Rank o this word in Madison s writings Rate per 1000 words o this word in the writings o John Jay Rank o this word in Jay s writings Rate per 1000 words o this word in Ulysses by Jaes Joyce Rank o this word in Ulysses Zip s law has been applied to requencies o any other classes o objects besides words, such as the requency o visits to web pages on the internet and the requencies o species o insects in an ecosyste. The data in MWwords.txt give the requencies o words in works ro our dierent sources: the political writings o eighteenth-century Aerican political igures Alexander Hailton, Jaes Madison, and John Jay, and the book Ulysses by twentieth-century Irish writer Jaes Joyce. The data are ro Mosteller and Wallace (1964, Table 8.1-1), and give the requencies o 165 very coon words. Several issing values occur in the data; these are really words that were used so inrequently that their count was not reported in Mosteller and Wallace s table Using only the 50 ost requent words in Hailton s work (that is, using only rows in the data or which HailtonRank 50), draw the appropriate suary graph, estiate the ean unction (2.31), and suarize your results Test the hypothesis that b = 1 against the two-sided alternative and suarize Repeat Proble , but or words with rank o 75 or less, and with rank less than 100. For larger nuber o words, Zip s law ay break down. Does that see to happen with these data? For the Ft. Collins snow all data discussed in Exaple 1.1, test the hypothesis that the slope is zero versus the alternative that it is not zero. Show that the t-test o this hypothesis is the sae as the F -test; that is, t 2 = F Old Faithul Use the data ro Proble 1.4, page Use siple linear regression ethodology to obtain a prediction equation or interval ro duration. Suarize your results in a way that ight be useul or the nontechnical personnel who sta the Old Faithul Visitor s Center Construct a 95% conidence interval or E(interval duration = 250)

59 PROBLEMS An individual has just arrived at the end o an eruption that lasted 250 seconds. Give a 95% conidence interval or the tie the individual will have to wait or the next eruption Estiate the 0.90 quantile o the conditional distribution o interval (duration = 250) assuing that the population is norally distributed Windills Energy can be produced ro wind using windills. Choosing a site or a wind ar, the location o the windills, can be a ultiillion dollar gable. I wind is inadequate at the site, then the energy produced over the lietie o the wind ar can be uch less than the cost o building and operation. Prediction o long-ter wind speed at a candidate site can be an iportant coponent in the decision to build or not to build. Since energy produced varies as the square o the wind speed, even sall errors can have serious consequences. The data in the ile w1.txt provides easureents that can be used to help in the prediction process. Data were collected every six hours or the year 2002, except that the onth o May 2002 is issing. The values Cspd are the calculated wind speeds in eters per second at a candidate site or building a wind ar. These values were collected at tower erected on the site. The values RSpd are wind speeds at a reerence site, whichis a nearby location or which wind speeds have been recorded over a very long tie period. Airports soeties serve as reerence sites, but in this case, the reerence data coes ro the National Center or Environental Modeling; these data are described at The reerence is about 50 k south west o the candidate site. Both sites are in the northern part o South Dakota. The data were provided by Mark Ahlstro and Rol Miller o WindLogics Draw the scatterplot o the response CSpd versus the predictor RSpd. Is the siple linear regression odel plausible or these data? Fit the siple regression o the response on the predictor, and present the appropriate regression suaries Obtain a 95% prediction interval or CSpd at a tie when RSpd = For this proble, we revert to generic notation and let x = CSpd and y = CSpd and let n be the nuber o cases used in the regression (n = 1116 in the data we have used in this proble) and x and SXX deined ro these n observations. Suppose we want to ake predictions at tie points with values o wind speed x 1,.., x that are dierent ro the n cases used in constructing the prediction equation. Show that (1) the average o the predictions is equal to the prediction taken at the average value x o the values o the

60 46 SIMPLE LINEAR REGRESSION predictor, and (2) using the irst result, the standard error o the average o predictions is se o average prediction = ˆσ 2 ( 1 +ˆσ 2 n + (x x) 2 ) SXX (2.32) I is very large, then the irst ter in the square root is negligible, and the standard error o average prediction is essentially the sae as the standard error o a itted value at x For the period ro January 1, 1948 to July 31, 2003, a total o = wind speed easureents are available at the reerence site, excluding the data ro the year For these easureents, the average wind speed was x = Give a 95% prediction interval on the long-ter average wind speed at the candidate site. This long-ter average o the past is then taken as an estiate o the long-ter average o the uture and can be used to help decide i the candidate is a suitable site or a wind ar.

61 CHAPTER 3 Multiple Regression Multiple linear regression generalizes the siple linear regression odel by allowing or any ters in a ean unction rather than just one intercept and one slope. 3.1 ADDING A TERM TO A SIMPLE LINEAR REGRESSION MODEL We start with a response Y and the siple linear regression ean unction E(Y X 1 = x 1 ) = β 0 + β 1 x 1 Now suppose we have a second variable X 2 with which to predict the response. By adding X 2 to the proble, we will get a ean unction that depends on both the value o X 1 and the value o X 2, E(Y X 1 = x 1,X 2 = x 2 ) = β 0 + β 1 x 1 + β 2 x 2 (3.1) The ain idea in adding X 2 is to explain the part o Y that has not already been explained by X 1. United Nations Data We will reconsider the United Nations data discussed in Proble 1.3. To the regression o log(fertility), the base-two log ertility rate on log(ppgdp), the base-two log o the per person gross doestic product, we consider adding Purban, the percentage o the population that lives in an urban area. The data in the ile UN2.txt give values or these three variables, as well as the nae o the Locality or 193 localities, ostly countries, or which the United Nations provides data. Figure 3.1 presents several graphical views o these data. Figure 3.1a can be viewed as a suary graph or the siple regression o log(fertility) on log(ppgdp). The itted ean unction using ols is Ê(log(Fertility) log(ppgdp)) = log(ppgdp) Applied Linear Regression, Third Edition, by Sanord Weisberg ISBN Copyright 2005 John Wiley & Sons, Inc. 47

62 48 MULTIPLE REGRESSION log(fertility) log(fertility) log(ppgdp) (a) Percent urban (b) Percent urban log(ppgdp) (c) ê ro log(fertility) on log(ppgdp) ê ro Purban on log(ppgdp) (d) FIG. 3.1 United Nations data on 193 localities, ostly nations. (a) log(fertility) versus log(ppgdp); (b) log(fertility) versus Purban; (c) Purban versus log(ppgdp); (d) Added-variable plot or Purban ater log(ppgdp). with R 2 = 0.459, so about 46% o the variability in log(fertility) is explained by log(ppgdp). An increase o one unit in log(ppgdp), which corresponds to a doubling o PPgdp, is estiated to decrease log(fertility) by units. Siilarly, Figure 3.1b is the suary graph or the regression o log(fertility) on Purban. This siple regression has itted ean unction Ê(log(Fertility) Purban) = Purban with R 2 = 0.348, so Purban explains about 35% o the variability in log(fertility). An increase o one percent urban iplies a change on the average in log(fertility) o To get a suary graph o the regression o log(fertility) on both log(ppgdp) and Purban would require a three-diensional plot o these three variables, with log(ppgdp) on one o the horizontal axes, Purban on the other horizontal axis, and log(fertility) on the vertical axis. Although such plots are possible by using

63 ADDING A TERM TO A SIMPLE LINEAR REGRESSION MODEL 49 either perspective or otion to display the third diension, using the is uch ore diicult than using two-diensional graphics, and their successul use is not widespread. Cook and Weisberg (1999a) discuss using otion to understand three-diensional graphics or regression. As a partial substitute or looking at the ull three-diensional plot, we add a third plot to the irst two in Figure 3.1, naely, the plot o Purban versus log(ppgdp) shown in Figure 3.1c. This graph does not include the response, so it only shows the relationship between the two potential predictors. In this proble, these two variables are positively correlated, and the ean unction or Figure 3.1c sees to be well approxiated by a straight line. The inerence to draw ro Figure 3.1c is that to the extent that Purban can be predicted by log(ppgdp), these two potential predictors are easuring the sae thing, and so the role o these two variables in predicting log(fertility) will be overlapping, and they will both, to soe extent, be explaining the sae variability Explaining Variability Given these graphs, what can be said about the proportion o variability in log(fertility) explained by log(ppgdp) and Purban? We can say that the total explained variation ust exceed 46 percent, the larger o the two values explained by each variable separately, since using both log(ppgdp) and Purban ust surely be at least as inorative as using just one o the. The total variation will be additive, 46% + 35% = 91%, only i the two variables are copletely unrelated and easure dierent things. The total can be less than the su i the ters are related and are at least in part explaining the sae variation. Finally, the total can exceed the su i the two variables act jointly so that knowing both gives ore inoration than knowing just one o the. For exaple, the area o a rectangle ay be only poorly deterined by either the length or width alone, but i both are considered at the sae tie, area can be deterined exactly. It is precisely this inability to predict the joint relationship ro the arginal relationships that akes ultiple regression rich and coplicated Added-Variable Plots The unique eect o adding Purban to a ean unction that already includes log(ppgdp) is deterined by the relationship between the part o log(fertility) that is not explained by log(ppgdp) and the part o Purban that is not explained by log(ppgdp). The unexplained parts are just the residuals ro these two siple regressions, and so we need to exaine the scatterplot o the residuals ro the regression o log(fertility) on log(ppgdp) versus the residuals ro the regression o Purban on log(ppgdp). This plot is shown in Figure 3.1d. Figure 3.1b is the suary graph or the relationship between log(fertility) and Purban ignoring log(ppgdp), while Figure 3.1d shows this relationship, but ater adjusting or log(ppgdp). I Figure 3.1d shows a stronger relationship than does Figure 3.1b, eaning that the points in the plot show less variation about the itted straight line,

64 50 MULTIPLE REGRESSION then the two variables act jointly to explain extra variation, while i the relationship is weaker, or the plot exhibits ore variation, then the total explained variability is less than the additive aount. The latter sees to be the case here. I we it the siple regression ean unction to Figure3.1d, the itted line has zero intercept, since the averages o the two plotted variables are zero, and the estiated slope via ols is ˆβ 2 = It turns out that this is exactly the estiate ˆβ 2 that would be obtained using ols to get the estiates using the ean unction (3.1). Figure 3.1d is called an added-variable plot. We now have two estiates o the coeicient β 2 or Purban: ˆβ 2 = ignoring log(ppgdp) ˆβ 2 = adjusting or log(ppgdp) While both o these indicate that ore urbanization is associated with lower ertility, adjusting or log(ppgdp) suggests that the agnitude o this eect is only about one-ourth as large as one ight think i log(ppgdp) were ignored. In other probles, slope estiates or the sae ter but ro dierent ean unctions can be even ore wildly dierent, changing signs, agnitude, and signiicance. This naturally coplicates the interpretation o itted odels, and also coparing between studies it with even slightly dierent ean unctions. To get the coeicient estiate or log(ppgdp) in the regression o log(fertility) on both predictors, we would use the sae procedure we used or Purban and consider the proble o adding log(ppgdp) to a ean unction that already includes Purban. This would require looking at the graph o the residuals ro the regression o log(fertility) on Purban versus the residuals ro the regression o log(ppgdp) on Purban (see Proble 3.2). 3.2 THE MULTIPLE LINEAR REGRESSION MODEL The general ultiple linear regression odel with response Y X 1,...,X p will have the or and ters E(Y X) = β 0 + β 1 X 1 + +β p X p (3.2) The sybol X in E(Y X) eans that we are conditioning on all the ters on the right side o the equation. Siilarly, when we are conditioning on speciic values or the predictors x 1,...,x p that we will collectively call x, we write E(Y X = x) = β 0 + β 1 x 1 + +β p x p (3.3) As in Chapter 2, the βs are unknown paraeters we need to estiate. Equation (3.2) is a linear unction o the paraeters, which is why this is called linear regression. When p = 1, X has only one eleent, and we get the siple regression proble discussed in Chapter 2. When p = 2, the ean unction (3.2) corresponds

65 TERMS AND PREDICTORS 51 Y 1 b 0 b 1 1 b 2 X 1 X 2 FIG. 3.2 A linear regression surace with p = 2 predictors. to a plane in three diensions, as shown in Figure 3.2. When p>2, the itted ean unction is a hyperplane, the generalization o a p-diensional plane in a (p + 1)-diensional space. We cannot draw a general p-diensional plane in our three-diensional world. 3.3 TERMS AND PREDICTORS Regression probles start with a collection o potential predictors. Soe o these ay be continuous easureents, like the height or weight o an object. Soe ay be discrete but ordered, like a doctor s rating o overall health o a patient on a nine-point scale. Other potential predictors can be categorical, like eye color or an indicator o whether a particular unit received a treatent. All these types o potential predictors can be useul in ultiple linear regression. Fro the pool o potential predictors, we create a set o ters that are the X-variables that appear in (3.2). The ters ight include: The intercept The ean unction (3.2) can we rewritten as E(Y X) = β 0 X 0 + β 1 X 1 + +β p X p where X 0 is a ter that is always equal to one. Mean unctions without an intercept would not have this ter included. Predictors The siplest type o ter is equal to one o the predictors, or exaple, the variable Mheight in the heights data. Transorations o predictors Soeties the original predictors need to be transored in soe way to ake (3.2) hold to a reasonable approxiation. This was the case with the UN data just discussed, in which PPgdp was

66 52 MULTIPLE REGRESSION used in log scale. The willingness to replace predictors by transorations o the greatly expands the range o probles that can be suarized with a linear regression odel. Polynoials Probles with curved ean unctions can soeties be accoodated in the ultiple linear regression odel by including polynoial ters in the predictor variables. For exaple, we ight include as ters both a predictor X 1 and its square X1 2 to it a quadratic polynoial in that predictor. Coplex polynoial suraces in several predictors can be useul in soe probles 1. Interactions and other cobinations o predictors Cobining several predictors is oten useul. An exaple o this is using body ass index, given by height divided by weight squared, in place o both height and weight, or using a total test score in place o the separate scores ro each o several parts. Products o predictors called interactions are oten included in a ean unction along with the original predictors to allow or joint eects o two or ore variables. Duy variables and actors A categorical predictor with two or ore levels is called a actor. Factors are included in ultiple linear regression using duy variables, which are typically ters that have only two values, oten zero and one, indicating which category is present or a particular observation. We will see in Chapter 6 that a categorical predictor with two categories can be represented by one duy variable, while a categorical predictor with any categories can require several duy variables. A regression with say k predictors ay cobine to give ewer than k ters or expand to require ore than k ters. The distinction between predictors and ters can be very helpul in thinking about an appropriate ean unction to use in a particular proble, and in using graphs to understand a proble. For exaple, a regression with one predictor can always be studied using the 2D scatterplot o the response versus the predictor, regardless o the nuber o ters required in the ean unction. We will use the uel consuption data introduced in Section 1.6 as the priary exaple or the rest o this chapter. As discussed earlier, the goal is to understand how uel consuption varies as a unction o state characteristics. The variables are deined in Table 1.2 and are given in the ile uel2001.txt. Frothesix initial predictors, we use a set o our cobinations to deine ters in the regression ean unction. Basic suary statistics or the relevant variables in the uel data are given in Table 3.1, and these begin to give us a bit o a picture o these data. First, there is quite a bit o variation in Fuel, with values between a iniu o about 626 gallons per year and a axiu o about 843 gallons per year. The gas Tax varies 1 This discussion o polynoials ight puzzle soe readers because in Section 3.2, we said the linear regression ean unction was a hyperplane, but here we have said that it ight be curved, seeingly a contradiction. However, both o these stateents are correct. I we it a ean unction like E(Y X = x) = β 0 + β 1 x + β 2 x 2, the ean unction is a quadratic curve in the plot o the response versus x but a plane in the three-diensional plot o the response versus x and x 2.

67 TERMS AND PREDICTORS 53 TABLE 3.1 Suary Statistics or the Fuel Data Variable N Average Std Dev Miniu Median Maxiu Tax Dlic Incoe logmiles Fuel ro only 7.5 cents per gallon to a high o 29 cents per gallon, so unlike uch o the world gasoline taxes account or only a sall part o the cost to consuers o gasoline. Also o interest is the range o values in Dlic: The nuber o licensed drivers per 1000 population over the age o 16 is between about 700 and Soe states appear to have ore licensed drivers than they have population over age 16. Either these states allow drivers under the age o 16, allow nonresidents to obtain a driver s license, or the data are in error. For this exaple, we will assue one o the irst two reasons. O course, these univariate suaries cannot tell us uch about how the uel consuption depends on the other variables. For this, graphs are very helpul. The scatterplot atrix or the uel data is repeated in Figure 3.3. Fro our previous Tax Dlic Incoe logmiles Fuel FIG. 3.3 Scatterplot atrix or the uel data.

68 54 MULTIPLE REGRESSION TABLE 3.2 Saple Correlations or the Fuel Data Saple Correlations Tax Dlic Incoe logmiles Fuel Tax Dlic Incoe logmiles Fuel discussion, Fuel decreases on the average as Tax increases, but there is lot o variation. We can ake siilar qualitative judgents about each o the regressions o Fuel on the other variables. The overall ipression is that Fuel is at best weakly related to each o the variables in the scatterplot atrix, and in turn these variables are only weakly related to each other. Does this help us understand how Fuel is related to all our predictors siultaneously? We know ro the discussion in Section 3.1 that the arginal relationships between the response and each o the variables is not suicient to understand the joint relationship between the response and the ters. The interrelationships aong the ters are also iportant. The pairwise relationships between the ters can be viewed in the reaining cells o the scatterplot atrix. In Figure 3.3, the relationships between all pairs o ters appear to be very weak, suggesting that or this proble the arginal plots including Fuel are quite inorative about the ultiple regression proble. A ore traditional, and less inorative, suary o the two-variable relationships is the atrix o saple correlations, shown in Table 3.2. In this instance, the correlation atrix helps to reinorce the relationships we see in the scatterplot atrix, with airly sall correlations between the predictors and Fuel, and essentially no correlation between the predictors theselves. 3.4 ORDINARY LEAST SQUARES Fro the initial collection o potential predictors, we have coputed a set o p + 1 ters, including an intercept, X = (X 0,X 1,...,X p ). The ean unction and variance unction or ultiple linear regression are E(Y X) = β 0 + β 1 X 1 + +β p X p (3.4) Var(Y X) = σ 2 Both the βs andσ 2 are unknown paraeters that need to be estiated Data and Matrix Notation Suppose we have observed data or n cases or units, eaning we have a value o Y and all o the ters or each o the n cases. We have sybols or the response and

69 ORDINARY LEAST SQUARES 55 the ters using atrices and vectors; see Appendix A.6 or a brie introduction. We deine y 1 1 x 11 x 1p y 2 1 x 21 x 2p Y =. X =.... (3.5)..... y n 1 x n1 x np so Y is an n 1 vector and X is an n (p + 1) atrix. We also deine β to be a (p + 1) 1 vector o regression coeicients and e to be the n 1 vector o statistical errors, β = β 0 β 1. β p and e = The atrix X gives all o the observed values o the ters. The ith row o X will be deined by the sybol x i,whichisa(p + 1) 1 vector or ean unctions that include an intercept. Even though x i is a row o X, we use the convention that all vectors are colun vectors and thereore need to write x i to represent a row. An equation or the ean unction evaluated at x i is E(Y X = x i ) = x i β = β 0 + β 1 x i1 + +β p x ip (3.6) In atrix notation, we will write the ultiple linear regression odel as e 1 e 2. e n Y = Xβ + e (3.7) The ith row o (3.7) is y i = x i β + e i. For the uel data, the irst ew and the last ew rows o the atrix X and the vector Y are X = Y = The ters in X are in the order intercept, Tax, Dlic, Incoe and inally log(miles). The atrix X is 51 5andY is 51 1.

70 56 MULTIPLE REGRESSION Variance-Covariance Matrix o e The 51 1 error vector is an unobservable rando vector, as in Appendix A.6. The assuptions concerning the e i s given in Chapter 2 are suarized in atrix or as E(e) = 0 Var(e) = σ 2 I n where Var(e) eans the covariance atrix o e, I n is the n n atrix with ones on the diagonal and zeroes everywhere else, and 0 is a atrix or vector o zeroes o appropriate size. I we add the assuption o norality, we can write e N(0,σ 2 I n ) Ordinary Least Squares Estiators The least squares estiate ˆβ o β is chosen to iniize the residual su o squares unction RSS(β) = (y i x i β)2 = (Y Xβ) (Y Xβ) (3.8) The ols estiates can be ound ro (3.8) by dierentiation in a atrix analog to the developent o Appendix A.3. The ols estiate is given by the orula ˆβ = (X X) 1 X Y (3.9) provided that the inverse (X X) 1 exists. The estiator ˆβ depends only on the suicient statistics X X and X Y, which are atrices o uncorrected sus o squares and cross-products. Do not copute the least squares estiates using (3.9)! Uncorrected sus o squares and cross-products are prone to large rounding error, and so coputations can be highly inaccurate. The preerred coputational ethods are based on atrix decopositions as briely outlined in Appendix A.8. At the very least, coputations should be based on corrected sus o squares and cross-products. Suppose we deine X to be the n p atrix X = (x 11 x 1 ) (x 1p x p ) (x 21 x 1 ) (x 2p x p )... (x n1 x 1 ) (x np x p ) This atrix consists o the original X atrix, but with the irst colun reoved and the colun ean subtracted ro each o the reaining coluns. Siilarly, Y is the vector with typical eleents y i y. Then C = 1 ( X X X ) Y n 1 Y X Y (3.10) Y

71 ORDINARY LEAST SQUARES 57 is the atrix o saple variances and covariances. When p = 1, the atrix C is given by C = 1 ( ) SXX SXY n 1 SXY SYY The eleents o C are the suary statistics needed or ols coputations in siple linear regression. I we let β be the paraeter vector excluding the intercept β 0, then or p 1, ˆβ = (X X ) 1 X Y ˆβ 0 = y ˆβ x where x is the vector o saple eans or all the ters except or the intercept. Once ˆβ is coputed, we can deine several related quantities. The itted values are Ŷ = X ˆβ and the residuals are ê = Y Ŷ. The unction (3.8) evaluated at ˆβ is the residual su o squares, or RSS, RSS = ê ê = (Y X ˆβ) (Y X ˆβ) (3.11) Properties o the Estiates Additional properties o the ols estiates are derived in Appendix A.8 and are only suarized here. Assuing that E(e) = 0 and Var(e) = σ 2 I n,then ˆβ is unbiased, E( ˆβ) = β, and Excluding the intercept ter, Var( ˆβ) = σ 2 (X X) 1 (3.12) Var( ˆβ ) = σ 2 (X X ) 1 (3.13) and so (X X ) 1 is all but the irst row and colun o (X X) 1. An estiate o σ 2 is given by ˆσ 2 RSS = (3.14) n (p + 1) which is the residual su o squares divided by its d = n (p + 1). Several orulas or RSS can be coputed by substituting the value o ˆβ into (3.11) and sipliying: RSS = Y Y ˆβ (X X) ˆβ = Y Y ˆβ X Y = Y Y ˆβ (X X ) ˆβ (3.15) = Y Y ˆβ (X X) ˆβ + ny 2

72 58 MULTIPLE REGRESSION Recognizing that Y Y = SYY, (3.15) has the nicest interpretation, as it writes RSS as equal to the total su o squares inus a quantity we will call the regression su o squares, or SSreg. In addition, i e is norally distributed, then the residual su o squares has a Chi-squared distribution, (n (p + 1)) ˆσ 2 /σ 2 χ 2 (n (p + 1)) By substituting ˆσ 2 or σ 2 in (3.12), we ind the estiated variance o ˆβ, Var( ˆβ), to be Var( ˆβ) =ˆσ 2 (X X) 1 (3.16) Siple Regression in Matrix Ters For siple regression, X and Y are given by 1 x 1 1 x 2 X =.. 1 x n Y = y 1 y 2. y n and thus ( (X X) = ) n xi xi x 2 i X Y = ( ) yi y 2 i By direct ultiplication, (X X) 1 can be shown to be (X X) 1 = 1 ( ) x 2 i /n x SXX x 1 so that ( ) ˆβ ˆβ = 0 = (X X) 1 X Y = 1 ˆβ 1 SXX ( ) y ˆβ = 1 x SXY/SXX ( x 2 i /n x x xi y i )( ) yi y 2 i (3.17) as ound previously. Also, since x 2 i /(nsxx) = 1/n + x2 /SXX, the variances and covariances or ˆβ 0 and ˆβ 1 ound in Chapter 2 are identical to those given by σ 2 (X X) 1. The results are sipler in the deviations ro the saple average or, since X X = SXX X Y = SXY

73 ORDINARY LEAST SQUARES 59 and ˆβ 1 = (X X ) 1 X Y = SXY SXX ˆβ 0 = y ˆβ 1 x Fuel Consuption Data We will generally let p equal the nuber o ters in a ean unction excluding the intercept, and p = p + 1 equal i the intercept is included; p = p i the intercept is not included. We shall now it the ean unction with p = 5 ters, including the intercept or the uel consuption data. Continuing a practice we have already begun, we will write Fuel on Tax Dlic Incoe log(miles) as shorthand or using ols to it the ultiple linear regression odel with ean unction E(Fuel X) = β 0 + β 1 Tax + β 2 Dlic + β 3 Incoe + β 4 log(miles) where conditioning on X is short or conditioning on all the ters in the ean unction. All the coputations are based on the suary statistics, which are the saple eans given in Table 3.1 and the saple covariance atrix C deined at (3.10) and given by Tax Dlic Incoe logmiles Fuel Tax Dlic Incoe logmiles Fuel Most statistical sotware will give the saple correlations rather than the covariances. The reader can veriy that the correlations in Table 3.2 can be obtained ro these covariances. For exaple, the saple correlation between Tax and Incoe is / ( ) = as in Table 3.2. One can convert back ro correlations and saple variances to covariances; the square root o the saple variances are given in Table 3.1. The 5 5atrix(X X) 1 is given by Intercept Tax Dlic Incoe logmiles Intercept e e e e-01 Tax e e e e-04 Dlic e e e e-06 Incoe e e e e-03 logmiles e e e e-03 The eleents o (X X) 1 oten dier by several orders o agnitude, as is the case here, where the sallest eleent in absolute value is = , and the largest eleent is It is the cobining o these nubers o very dierent agnitude that can lead to nuerical inaccuracies in coputations.

74 60 MULTIPLE REGRESSION The lower-right 4 4 sub-atrix o (X X) 1 is (X X ) 1. Using the orulas based on corrected sus o squares in this chapter, the estiate ˆβ is coputed to be The estiated intercept is ˆβ = (X X ) 1 X Y = and the residual su o squares is so the estiate o σ 2 is ˆβ 1 ˆβ 2 ˆβ 3 ˆβ 4 = ˆβ 0 = y ˆβ x = RSS = Y Y ˆβ (X X ) O β = 193,700 ˆσ 2 = RSS n (p + 1) = 193, = 4211 Standard errors and estiated covariances o the ˆβ j are ound by ultiplying ˆσ by the square roots o eleents o (X X) 1. For exaple, se( ˆβ 2 ) =ˆσ = Virtually all statistical sotware packages include higher-level unctions that will it ultiple regression odels, but getting interediate results like (X X) 1 ay be a challenge. Table 3.3 shows typical output ro a statistical package. This output gives the estiates ˆβ and their standard errors coputed based on ˆσ 2 and the TABLE 3.3 Edited Output ro the Suary Method in R or Multiple Regression in the Fuel Data Coeicients: Estiate Std. Error t value Pr(> t ) (Intercept) Tax Dlic Incoe logmiles Residual standard error: on 46 degrees o reedo Multiple R-Squared: F-statistic: on 4 and 46 DF, p-value: 9.33e-07

75 THE ANALYSIS OF VARIANCE 61 diagonal eleents o (X X) 1. The colun arked t-value is the ratio o the estiate to its standard error. The colun labelled Pr(> t ) will be discussed shortly. Below the table are a nuber o other suary statistics; at this point only the estiate o σ called the residual standard error and its d are ailiar. 3.5 THE ANALYSIS OF VARIANCE For ultiple regression, the analysis o variance is a very rich technique that is used to copare ean unctions that include dierent nested sets o ters. In the overall analysis o variance, the ean unction with all the ters E(Y X = x) = β x (3.18) is copared with the ean unction that includes only an intercept: E(Y X = x) = β 0 (3.19) For siple regression, these correspond to (2.16) and (2.13), respectively. For ean unction (3.19), ˆβ 0 = y and the residual su o squares is SYY. For ean unction (3.18), the estiate o β is given by (3.9) and RSS is given in (3.11). We ust have RSS < SYY, and the dierence between these two SSreg = SYY RSS (3.20) corresponds to the su o squares o Y explained by the larger ean unction that is not explained by the saller ean unction. The nuber o d associated with SSreg is equal to the nuber o d in SYY inus the nuber o d in RSS, which equals p, the nuber o ters in the ean unction excluding the intercept. These results are suarized in the analysis o variance table in Table 3.4. We can judge the iportance o the regression on the ters in the larger odel by deterining i SSreg is suiciently large by coparing the ratio o the ean square or regression to ˆσ 2 to the F(p,n p ) distribution 2 to get a signiicance TABLE 3.4 The Overall Analysis o Variance Table Source d SS MS F p-value Regression p SSreg SSreg/1 MSreg/ ˆσ 2 Residual n (p + 1) RSS ˆσ 2 = RSS/(n 2) Total n 1 SYY 2 Reinder: p = p or ean unctions with no intercept, and p = p + 1 or ean unctions with an intercept.

76 62 MULTIPLE REGRESSION level. I the coputed signiicance level is sall enough, then we would judge that the ean unction (3.18) provides a signiicantly better it than does (3.19). The ratio will have an exact F distribution i the errors are noral and (3.19) is true. The hypothesis tested by this F -test is NH: E(Y X = x) = β 0 AH: E(Y X = x) = x β The Coeicient o Deterination As with siple regression, the ratio R 2 = SSreg SYY = 1 RSS SYY (3.21) gives the proportion o variability in Y explained by regression on the ters. R 2 can also be shown to be the square o the correlation between the observed values Y and the itted values Ŷ ; we will explore this urther in the next chapter. R 2 is also called the ultiple correlation coeicient because it is the axiu o the correlation between Y and any linear cobination o the ters in the ean unction. Fuel Consuption Data The overall analysis o variance table is given by D Su Sq Mean Sq F value Pr(>F) Regression e-07 Residuals Total To get a signiicance level or the test, we would copare F = with the F(4, 46) distribution. Most coputer packages do this autoatically, and the result is shown in the colun arked Pr(>F) to be about , a very sall nuber, leading to very strong evidence against the null hypothesis that the ean o Fuel does not depend on any o the ters. The value o R 2 = / = indicates that about hal the variation in Fuel is explained by the ters. The value o F, its signiicance level, and the value o R 2 are given in Table Hypotheses Concerning One o the Ters Obtaining inoration on one o the ters ay be o interest. Can we do as well, understanding the ean unction or Fuel i we delete the Tax variable? This aounts to the ollowing hypothesis test o NH: β 1 = 0, β 0,β 2,β 3,β 4 arbitrary AH: β 1 0, β 0,β 2,β 3,β 4 arbitrary (3.22) The ollowing procedure can be used. First, it the ean unction that excludes the ter or Tax and get the residual su o squares or this saller ean unction.

77 THE ANALYSIS OF VARIANCE 63 Then it again, this tie including Tax, and once again get the residual su o squares. Subtracting the residual su o squares or the larger ean unction ro the residual su o squares or the saller ean unction will give the su o squares or regression on Tax ater adjusting or the ters that are in both ean unctions, Dlic, Incoe and log(miles). Here is a suary o the coputations that are needed: D SS MS F Pr(>F) Excluding Tax Including Tax Dierence The row arked Excluding Tax gives the d and RSS or the ean unction without Tax, and the next line gives these values or the larger ean unction including Tax. The dierence between these two given on the next line is the su o squares explained by Tax ater adjusting or the other ters in the ean unction. The F -test is given by F = (18,264/1)/ ˆσ 2 = 4.34, which, when copared to the F distribution with (1, 46) d gives a signiicance level o about We thus have odest evidence that the coeicient or Tax is dierent ro zero. This is called a partial F -test. Partial F -tests can be generalized to testing several coeicients to be zero, but we delay that generalization to Section Relationship to the t-statistic Another reasonable procedure or testing the iportance o Tax is siply to copare the estiate o the coeicient divided by its standard error to the t(n p ) distribution. One can show that the square o this t-statistic is the sae nuber o the F -statistic just coputed, so these two procedures are identical. Thereore, the t-statistic tests hypothesis (3.22) concerning the iportance o ters adjusted or all the other ters, not ignoring the. Fro Table 3.3, the t-statistic or Tax is t = 2.083, and t 2 = ( 2.083) 2 = 4.34, the sae as the F -statistic we just coputed. The signiicance level or Tax given in Table 3.3 also agrees with the signiicance level we just obtained or the F -test, and so the signiicance level reported is or the two-sided test. To test the hypothesis that β 1 = 0 against the one-sided alternative that β 1 < 0, we could again use the sae t-value, but the signiicance level would be one-hal o the value or the two-sided test. A t-test that β j has a speciic value versus a two-sided or one-sided alternative (with all other coeicients arbitrary) can be carried out as described in Section t-tests and Added-Variable Plots In Section 3.1, we discussed adding a ter to a siple regression ean unction. The sae general procedure can be used to add a ter to any linear regression ean unction. For the added-variable plot or a ter, say X 1, plot the residuals ro the regression o Y on all the other X s versus the residuals or the regression

78 64 MULTIPLE REGRESSION o X 1 on all the other Xs. One can show (Proble 3.2) that (1) the slope o the regression in the added-variable plot is the estiated coeicient or X 1 in the regression with all the ters, and (2) the t-test or testing the slope to the zero in the added-variable plot is essentially the sae as the t-test or testing β 1 = 0inthe it o the larger ean unction, the only dierence being a correction or degrees o reedo Other Tests o Hypotheses We have obtained a test o a hypothesis concerning the eect o Tax adjusted or all the other ters in the ean unction. Equally well, we could obtain tests or the eect o Tax adjusting or soe o the other ters or or none o the other ters. In general, these tests will not be equivalent, and a variable can be judged useul ignoring variables but useless when adjusted or the. Furtherore, a predictor that is useless by itsel ay becoe iportant when considered in concert with the other variables. The outcoe o these tests depends on the saple correlations between the ters Sequential Analysis o Variance Tables By separating Tax ro the other ters, SSreg is divided into two pieces, one or itting the irst three ters, and one or itting Tax ater the other three. This subdivision can be continued by dividing SSreg into a su o squares explained by each ter separately. Unless all the ters are uncorrelated, this breakdown is not unique. For exaple, we could irst it Dlic, thentax adjusted or Dlic, then Incoe adjusted or both Dlic and Tax, and inally log(miles) adjusted or the other three. The resulting table is given in Table 3.5a. Alternatively, we could it in the order log(miles), Incoe, Dlic and then Tax as in Table 3.5b. The sus o squares can be quite dierent in the two tables. For exaple, the su o squares or Dlic ignoring the other ters is about 25% larger than the su o squares or Dlic adjusting or the other ters. In this proble, the ters are nearly uncorrelated, see Table 3.2, so the eect o ordering is relatively inor. In probles with high saple correlations between ters, order can be very iportant. TABLE 3.5 Two Analysis o Variance Tables with Dierent Orders o Fitting (a) First analysis (b) Second analysis D Su Sq Mean Sq Dlic Tax Incoe logmiles Residuals D Su Sq Mean Sq logmiles Incoe Dlic Tax Residuals

79 PROBLEMS PREDICTIONS AND FITTED VALUES Suppose we have observed, or will in the uture observe, a new case with its own set o predictors that result in a vector o ters x. We would like to predict the value o the response given x. In exactly the sae way as was done in siple regression, the point prediction is ỹ = x ˆβ, and the standard error o prediction, sepred(ỹ x ), using Appendix A.8, is sepred(ỹ x ) =ˆσ 1 + x (X X) 1 x (3.23) Siilarly, the estiated average o all possible units with a value x or the ters is given by the estiated ean unction at x, Ê(Y X = x) =ŷ = x ˆβ with standard error given by seit(ŷ x) =ˆσ x (X X) 1 x (3.24) Virtually all sotware packages will give the user access to the itted values, but getting the standard error o prediction and o the itted value ay be harder. I a progra produces seit but not sepred, the latter can be coputed ro the orer ro the result sepred(ỹ x ) = ˆσ 2 + seit(ỹ x ) 2 PROBLEMS 3.1. Berkeley Guidance Study The Berkeley Guidance Study enrolled children born in Berkeley, Caliornia, between January 1928 and June 1929, and then easured the periodically until age eighteen (Tuddenha and Snyder, 1954). The data we use is described in Table 3.6, and the data is given in the data iles BGSgirls.txt or girls only, BGSboys.txt or boys only, and BGSall.txt or boys and girls cobined. For this exaple, use only the data on the girls For the girls only, draw the scatterplot atrix o all the age two variables, all the age nine variables and Soa. Write a suary o the inoration in this scatterplot atrix. Also obtain the atrix o saple correlations between the height variables Starting with the ean unction E(Soa WT9) = β 0 + β 1 WT9, use added-variable plots to explore adding LG9 to get the ean unction E(Soa WT9, LG9) = β 0 + β 1 WT9 + β 2 LG9. In particular, obtain the our plots equivalent to Figure 3.1, and suarize the inoration in the plots Fit the ultiple linear regression odel with ean unction E(Soa X) = β 0 + β 1 HT2 + β 2 WT2 + β 3 HT9 + β 4 WT9 + β 5 ST9 (3.25)

80 66 MULTIPLE REGRESSION TABLE 3.6 Variable Deinitions or the Berkeley Guidance Study in the Files BGSgirls.txt, BGSboys.txt, andbgsall.txt Variable Sex WT2 HT2 WT9 HT9 LG9 ST9 WT18 HT18 LG18 ST18 Soa Description 0 or ales, 1 or eales Age 2 weight, kg Age 2 height, c Age 9 weight, kg Age 9 height, c Age 9 leg circuerence, c Age 9 strength, kg Age 18 weight, kg Age 18 height, c Age 18 leg circuerence, c Age 18 strength, kg Soatotype, a scale ro 1, very thin, to 7, obese, o body type Find ˆσ, R 2, the overall analysis o variance table and overall F -test. Copute the t-statistics to be used to test each o the β j to be zero against two-sided alternatives. Explicitly state the hypotheses tested and the conclusions Obtain the sequential analysis o variance table or itting the variables in the order they are given in (3.25). State the hypotheses tested and the conclusions or each o the tests Obtain analysis o variance again, this tie itting with the ive ters in the order given ro right to let in (3.25). Explain the dierences with the table you obtained in Proble What graphs could help understand the issues? 3.2. Added-variable plots This proble uses the United Nations exaple in Section 3.1 to deonstrate any o the properties o added-variable plots. This proble is based on the ean unction E(log(Fertility) log(ppgdp) = x 1, Purban = x 2 ) = β 0 + β 1 x 1 + β 2 x 2 There is nothing special about the two-predictor regression ean unction, but we are using this case or siplicity Show that the estiated coeicient or log(ppgdp) is the sae as the estiated slope in the added-variable plot or log(ppgdp) ater Purban. This correctly suggests that all the estiates in a ultiple linear regression odel are adjusted or all the other ters in the ean unction. Also, show that the residuals in the added-variable plot are identical to the residuals ro the ean unction with both predictors.

81 PROBLEMS Show that the t-test or the coeicient or log(ppgdp) is not quite the sae ro the added-variable plot and ro the regression with both ters, and explain why they are slightly dierent The ollowing questions all reer to the ean unction E(Y X 1 = x 1,X 2 = x 2 ) = β 0 + β 1 x 1 + β 2 x 2 (3.26) Suppose we it (3.26) to data or which x 1 = 2.2x 2, with no error. For exaple, x 1 could be a weight in pounds, and x 2 the weight o the sae object in kg. Describe the appearance o the added-variable plot or X 2 ater X Again reerring to (3.26), suppose now that Y and X 1 are perectly correlated, so Y = 3X 1, without any error. Describe the appearance o the added-variable plot or X 2 ater X Under what conditions will the added-variable plot or X 2 ater X 1 have exactly the sae shape as the scatterplot o Y versus X 2? True or alse: The vertical variation in an added-variable plot or X 2 ater X 1 is always less than or equal to the vertical variation in a plot o Y versus X 2. Explain Suppose we have a regression in which we want to it the ean unction (3.1). Following the outline in Section 3.1, suppose that the two ters X 1 and X 2 have saple correlation equal to zero. This eans that, i x ij,i = 1,...,n, and j = 1, 2 are the observed values o these two ters or the n cases in the data, n i=1 (x i1 x 1 )(x i2 x 2 ) = Give the orula or the slope o the regression or Y on X 1,andor Y on X 2. Give the value o the slope o the regression or X 2 on X Give orulas or the residuals or the regressions o Y on X 1 and or X 2 on X 1. The plot o these two sets o residuals corresponds to the added-variable plot in Figure 3.1d Copute the slope o the regression corresponding to the added-variable plot or the regression o Y on X 2 ater X 1, and show that this slope is exactly the sae as the slope or the siple regression o Y on X 2 ignoring X 1. Also ind the intercept or the added-variable plot Reer to the data described in Proble 1.5, page 18. For this proble, consider the regression proble with response BSAAM, and three predictors as ters given by OPBPC, OPRC and OPSLAKE Exaine the scatterplot atrix drawn or these three ters and the response. What should the correlation atrix look like (that is, which correlations are large and positive, which are large and negative, and which are sall)? Copute the correlation atrix to veriy your results. Get the regression suary or the regression o BSAAM on these three ters. Explain what the t-values colun o your output eans.

82 68 MULTIPLE REGRESSION Obtain the overall test i the hypothesis that BSAAM is independent o the three ters versus the alternative that it is not independent o the, and suarize your results Obtain three analysis o variance tables itting in the order (OPBPC, OPRC and OPSLAKE), then (OPBPC, OPSLAKE and OPRC), and inally (OPSLAKE, OPRC and OPBPC). Explain the resulting tables, and discuss in particular any apparent inconsistencies. Which F -tests in the Anova tables are equivalent to t-tests in the regression output? Using the output ro the last proble, test the hypothesis that the coeicients or both OPRC and OPBPC are both zero against the alternative that they are not both zero.

83 CHAPTER 4 Drawing Conclusions The coputations that are done in ultiple linear regression, including drawing graphs, creation o ters, itting odels, and peroring tests, will be siilar in ost probles. Interpreting the results, however, ay dier by proble, even i the outline o the analysis is the sae. Many issues play into drawing conclusions, and soe o the are discussed in this chapter. 4.1 UNDERSTANDING PARAMETER ESTIMATES Paraeters in ean unctions have units attached to the. For exaple, the itted ean unction or the uel consuption data is E(Fuel X) = Tax Dlic 6.14 Incoe log(miles) Fuel is easured in gallons, and so all the quantities on the right o this equation ust also be in gallons. The intercept is gallons. Since Incoe is easured in thousands o dollars, the coeicient or Incoe ust be in gallons per thousand dollars o incoe. Siilarly, the units or the coeicient or Tax is gallons per cent o tax Rate o Change The usual interpretation o an estiated coeicient is as a rate o change: increasing Tax rate by one cent should decrease consuption, all other actors being held constant, by about 4.23 gallons per person. This assues that a predictor can in act be changed without aecting the other ters in the ean unction and that the available data will apply when the predictor is so changed. The uel data are observational since the assignent o values or the predictors was not under the control o the analyst, so whether increasing taxes would cause Applied Linear Regression, Third Edition, by Sanord Weisberg ISBN Copyright 2005 John Wiley & Sons, Inc. 69

84 70 DRAWING CONCLUSIONS a decrease in uel consuption cannot be assessed ro these data. Fro these data, we can observe association but not cause: states with higher tax rates are observed to have lower uel consuption. To draw conclusions concerning the eects o changing tax rates, the rates ust in act be changed and the results observed. Thecoeicientestiateolog(Miles) is 18.55, eaningthatachangeooneunit in log(miles) is associated with an gallon per person increase in consuption. States with ore roads have higher per capita uel consuption. Since we used base-two logariths in this proble, increasing log(miles) by one unit eans that the value o Miles doubles. I we double the aount o road in a state, we expect to increase uel consuption by about gallons per person. I we had used base-ten logariths, then the itted ean unction would be E(Fuel X) = Tax Dlic 6.14 Incoe log 10 (Miles) The only change in the itted odel is or the coeicient or the log o Miles, which is now interpreted as a change in expected Fuel consuption when log 10 (Miles) increases by one unit, or when Miles is ultiplied by Signs o Estiates The sign o a paraeter estiate indicates the direction o the relationship between the ter and the response. In ultiple regression, i the ters are correlated, the sign o a coeicient ay change depending on the other ters in the odel. While this is atheatically possible and, occasionally, scientiically reasonable, it certainly akes interpretation ore diicult. Soeties this proble can be reoved by redeining the ters into new linear cobinations that are easier to interpret Interpretation Depends on Other Ters in the Mean Function The value o a paraeter estiate not only depends on the other ters in a ean unction but it can also change i the other ters are replaced by linear cobinations o the ters. Berkeley Guidance Study Data ro the Berkeley Guidance Study on the growth o boys and girls are given in Proble 3.1. As in Proble 3.1, we will view Soa as the response, but consider the three predictors WT2, WT9, WT18 or the n = 70 girls in the study. The scatterplot atrix or these our variables is given in Figure 4.1. First look at the last row o this igure, giving the arginal response plots o Soa versus each o the three potential predictors. For each o these plots, we see that Soa is increasing with the potential predictor on the average, although the relationship is strongest at the oldest age and weakest at the youngest age. The two-diensional plots o each pair o predictors suggest that the predictors are correlated aong theselves. Taken together, we have evidence that the regression on all three predictors cannot

85 UNDERSTANDING PARAMETER ESTIMATES WT WT9 WT Soa FIG. 4.1 Scatterplot atrix or the girls in the Berkeley Guidance Study. be viewed as just the su o the three separate siple regressions because we ust account or the correlations between the ters. We will proceed with this exaple using the three original predictors as ters and Soa as the response. We are encouraged to do this because o the appearance o the scatterplot atrix. Since each o the two-diensional plots appear to be well suarized by a straight-line ean unction, we will see later that this suggests that the regression o the response on the original predictors without transoration is likely to be appropriate. The paraeter estiates or the regression o Soa on WT2, WT9, and WT18 given in the colun arked Model 1 in Table 4.1 leads to the unexpected conclusion that heavier girls at age two ay tend to be thinner, have lower expected soatotype, at age 18. We reach this conclusion because the t-statistic or testing the coeicient equal to zero, which is not shown in the table, has a signiicance level o about The sign, and the weak signiicance, ay be due to the correlations between the ters. In place o the preceding variables, consider the ollowing: WT2 = Weight at age 2 DW9 = WT9 WT2 = Weight gain ro age 2 to 9 DW18 = WT18 WT9 = Weight gain ro age 9 to 18

86 72 DRAWING CONCLUSIONS TABLE 4.1 Regression o Soa on Dierent Cobinations o Three Weight Variables or the n = 70 Girls in the Berkeley Guidance Study Ter Model 1 Model 2 Model 3 (Intercept) WT WT WT DW NA DW NA Since all three original ters easure weight, cobining the in this way is reasonable. I the variables easured dierent quantities, then cobining the could lead to conclusions that are even less useul than those originally obtained. The paraeter estiates or Soa on WT2, DW9, anddw18 are given in the colun arked Model 2 in Table 4.1. Although not shown in the table, suary statistics or the regression like R 2 and ˆσ 2 are identical or all the ean unctions in Table 4.1 but coeicient estiates and t-tests are not the sae. For exaple, the slope estiate or WT2 is about 0.12, with t = 1.87 in the colun Model 1, while in Model 2, the estiate is about one-tenth the size, and the t-value is In the orer case, the eect o WT2 appears plausible, while in the latter it does not. Although the estiate is negative in each, we would be led in the latter case to conclude that the eect o WT2 is negligible. Thus, interpretation o the eect o a variable depends not only on the other variables in a odel but also upon which linear transoration o those variables is used. Another interesting eature o Table 4.1 is that the estiate or WT18 in Model 1 is identical to the estiate or DW18 in Model 2. In Model 1, the estiate or WT18 is the eect on Soa o changing WT18 by one unit, with all other ters held ixed. In Model 2, the estiate or DW18 is the change in Soa when DW18 changes by one unit, when all other ters are held ixed. But the only way DW18 = WT18 WT9 can be changed by one unit with the other variables including WT9 = DW9 WT2 held ixed is by changing WT18 by one unit. Consequently, the ters WT18 in Model 1 and DW18 in Model 2 play identical roles and thereore we get the sae estiates. The linear transoration o the three weight variables we have used so ar could be replaced by other linear cobinations, and, depending on the context, others ight be preerred. For exaple, another set ight be AVE = (WT2 + WT9 + WT18)/3 LIN = WT18 WT2 QUAD = WT2 2WT9 + WT18 This transoration ocuses on the act that WT2, WT9 and WT18 are ordered in tie and are ore or less equally spaced. Pretending that the weight easureents

87 UNDERSTANDING PARAMETER ESTIMATES 73 are equally spaced, AVE, LIN and QUAD are, respectively, the average, linear, and quadratic tie trends in weight gain Rank Deicient and Over-Paraeterized Mean Functions In the last exaple, several cobinations o the basic predictors WT2, WT9, and WT18 were studied. One ight naturally ask what would happen i ore than three cobinations o these predictors were used in the sae regression odel. As long as we use linear cobinations o the predictors, as opposed to nonlinear cobinations or transorations o the, we cannot use ore than three, the nuber o linearly independent quantities. To see why this is true, consider adding DW9 to the ean unction including WT2, WT9 and WT18. As in Chapter 3, we can learn about adding DW9 using an added-variable plot o the residuals ro the regression o Soa on WT2, WT9 and WT18 versus the residuals ro the regression o DW9 on WT2, WT9 and WT18. SinceDW9 can be written as an exact linear cobination o the other predictors, DW9 = WT9 WT2, the residuals ro this second regression are all exactly zero. A slope coeicient or DW9 is thus not deined ater adjusting or the other three ters. We would say that the our ters WT2, WT9, WT18, anddw9 are linearly dependent, since one can be deterined exactly ro the others. The three variables WT2, WT9 and WT18 are linearly independent because one o the cannot be deterined exactly by a linear cobination o the others. The axiu nuber o linearly independent ters that could be included in a ean unction is called the rank o the data atrix X. Model 3 in Table 4.1 gives the estiates produced in a coputer package when we tried to it using an intercept and the ive ters WT2, WT9, WT18, DW9, and DW18. Most coputer progras, including this one, will select the irst three, and the estiated coeicients or the. For the reaining ters, this progra sets the estiates to NA, a code or a issing value; the word aliased is soeties used to indicate a ter that is a linear cobination o ters already in the ean unction, and so a coeicient or it is not estiable. Mean unctions that are over-paraeterized occur ost oten in designed experients. The siplest exaple is the one-way design. Suppose that a unit is assigned to one o three treatent groups, and let X 1 = 1 i the unit is in group one and zero otherwise, X 2 = 1 i the unit is in group two and zero otherwise, and X 3 = 1 i the unit is in group three and zero otherwise. For each unit, we ust have X 1 + X 2 + X 3 = 1 since each unit is in only one o the three groups. We thereore cannot it the odel E(Y X) = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 because the su o the X j is equal to the colun o ones, and so, or exaple, X 3 = 1 X 1 X 2. To it a odel, we ust do soething else. The options are: (1) place a constraint like β 1 + β 2 + β 3 = 0 on the paraeters; (2) exclude one o the X j ro the odel, or (3) leave out an explicit intercept. All o these options will in soe sense be equivalent, since the sae R 2, σ 2 and overall F -test and

88 74 DRAWING CONCLUSIONS predictions will result. O course, soe care ust be taken in using paraeter estiates, since these will surely depend on the paraeterization used to get a ull rank odel. For urther reading on atrices and odels o less than ull rank, see, or exaple, Searle (1971, 1982) Tests Even i the itted odel were correct and errors were norally distributed, tests and conidence stateents or paraeters are diicult to interpret because correlations aong the ters lead to a ultiplicity o possible tests. Soeties, tests o eects adjusted or other variables are clearly desirable, such as in assessing a treatent eect ater adjusting or other variables to reduce variability. At other ties, the order o itting is not clear, and the analyst ust expect abiguous results. In ost situations, the only true test o signiicance is repeated experientation Dropping Ters Suppose we have a saple o n rectangles ro which we want to odel log(area) as a unction o log(length), perhaps through the siple regression ean unction E(log(Area) log(length)) = η 0 + η 1 log(length) (4.1) Fro eleentary geoetry, we know that Area = Length Width, and so the true ean unction or log(area) is E(log(Area) log(length), log(width)) = β 0 + β 1 log(length) + β 2 log(width) (4.2) with β 0 = 0, and β 1 = β 2 = 1. The questions o interest are: (1) can the incorrect ean unction speciied by (4.1) provide a useul approxiation to the true ean unction (4.2), and i so, (2) what are the relationships between ηs, in (4.1) and the βs in (4.2)? The answers to these questions coes ro Appendix A.2.4. Suppose that the true ean unction were E(Y X 1 = x 1,X 2 = x 2 ) = β 0 + β 1 x 1 + β 2 x 2 (4.3) but we want to it a ean unction with X 1 only. The ean unction or Y X 1 is obtained by averaging (4.3) over X 2, E(Y X 1 = x 1 ) = E [E(Y X 1 = x 1,X 2 ) X 1 = x 1 ] = β 0 + β 1 x 1 + β 2 E(X 2 X 1 = x 1 ) (4.4) We cannot, in general, siply drop a set o ters ro a correct ean unction, but we need to substitute the conditional expectation o the ters dropped given the ters that reain in the ean unction. In the context o the rectangles exaple, we get E(log(Area) log(length)) = η 0 + η 1 log(length) + β 2 E(log(Width) log(length)) (4.5)

89 UNDERSTANDING PARAMETER ESTIMATES 75 The answers to the questions posed depend on the ean unction or the regression o log(width) on log(length). This conditional expectation has little to do with the area o rectangles, but uch to do with the way we obtain a saple o rectangles to use in our study. We will consider three cases. In the irst case, iagine that each o the rectangles in the study is ored by sapling a log(length) and a log(width) ro independent distributions. I the ean o the log(width) distribution is W, then by independence Substituting into (4.5), E(log(Width) log(length)) = E(log(Width)) = W E(log(Area) log(length)) = β 0 + β 1 log(length) + β 2 W = (β 0 + β 2 W)+ β 1 log(length) = W + log(length) where the last equation ollows by substituting β 0 = 0,β 1 = β 2 = 1. For this case, the ean unction (4.1) would be appropriate or the regression o log(area) on log(width). The intercept or the ean unction (4.1) would be W, and so it depends on the distribution o the widths in the data. The slope or log(length) is the sae or itting (4.1) or (4.2). In the second case, suppose that E(log(Width) log(length)) = γ 0 + γ 1 log(length) so the ean unction or the regression o log(width) on log(length) is a straight line. This could occur, or exaple, i the rectangles in our study were obtained by sapling ro a aily o siilar rectangles, so the ratio Width/Length is the sae or all rectangles in the study. Substituting this into (4.5) and sipliying gives E(log(Area) log(length)) = β 0 + β 1 log(length) + β 2 (γ 0 + γ 1 log(length)) = (β 0 + β 2 γ 0 ) + (β 1 + β 2 γ 1 )log(length) = γ 0 + (1 + γ 1 )log(length) Once again itting using (4.1) will be appropriate, but the values o η 0 = γ 0 and η 1 = 1 + γ 1 depend on the paraeters o the regression o log(width) on log(length). The γ s are a characteristic o the sapling plan, not o rectangles. Two experienters who saple rectangles o dierent shapes will end up estiating dierent paraeters. For a inal case, suppose that the ean unction E(log(Width) log(length)) = γ 0 + γ 1 log(length) + γ 2 log(length) 2

90 76 DRAWING CONCLUSIONS is quadratic. Substituting into (4.5), setting β 0 = 0,β 1 = β 2 = 1 and sipliying gives E(log(Area) log(length)) = β 0 + β 1 log(length) + β 2 ( γ 0 + γ 1 log(length) + γ 2 log(length) 2) = γ 0 + (1 + γ 1 )log(length) + γ 2 log(length) 2 which is a quadratic unction o log(length). I the ean unction is quadratic, or any other unction beyond a straight line, then itting (4.1) is inappropriate. Fro the above three cases, we see that both the ean unction and the paraeters or the response depend on the ean unction or the regression o the reoved ters on the reaining ters. I the ean unction or the regression o the reoved ters on the retained ters is not linear, then a linear ean unction will not be appropriate or the regression proble with ewer ters. Variances are also aected when ters are dropped. Returning to the true ean unction given by (4.3), the general result or the regression o Y on X 1 alone is, ro Appendix A.2.4, Var(Y X 1 = x 1 ) = E [Var(Y X 1 = x 1,X 2 ) X 1 = x 1 ] + Var [E(Y X 1 = x 1,X 2 ) X 1 = x 1 ] = σ 2 + β 2 Var(X 2 X 1 = x 1 )β 2 (4.6) In the context o the rectangles exaple, β 2 = 1 and we get Var(log(Area) log(length)) = σ 2 + Var(log(Width) log(length)) Although itting (4.1) can be appropriate i log(width) and log(length) are linearly related, the errors or this ean unction can be uch larger than those or (4.2) i Var(log(Width) log(length)) is large. I Var(log(Width) log(length)) is sall enough, then itting (4.2) can actually give answers that are nearly as accurate as itting with the true ean unction (4.2) Logariths I we start with the siple regression ean unction, E(Y X = x) = β 0 + β 1 x a useul way to interpret the coeicient β 1 is as the irst derivative o the ean unction with respect to x, de(y X = x) = β 1 dx We recall ro eleentary geoetry that the irst derivative is the rate o change, or the slope o the tangent to a curve, at a point. Since the ean unction or

91 EXPERIMENTATION VERSUS OBSERVATION 77 siple regression is a straight line, the slope o the tangent is the sae value β 1 or any value o x, andβ 1 copletely characterizes the change in the ean when the predictor is changed or any value o x. When the predictor is replaced by log(x), the ean unction as a unction o x E(Y X = x) = β 0 + β 1 log(x) is no longer a straight line, but rather it is a curve. The tangent at the point x>0is de(y X = x) dx = β 1 x The slope o the tangent is dierent or each x and the eect o changing x on E(Y X = x) is largest or sall values o x and gets saller as x is increased. When the response is in log scale, we can get siilar approxiate results by exponentiating both sides o the equation: E(log(Y ) X = x) = β 0 + β 1 x E(Y X = x) e β 0 e β 1x Dierentiating this second equation gives de(y X = x) = β 1 E(Y X = x) dx The rate o change at x is thus equal to β 1 ties the ean at x. We can also write de(y X = x)/dx E(Y X = x) = β 1 is constant, and so β 1 can be interpreted as the constant rate o change in the response per unit o response. 4.2 EXPERIMENTATION VERSUS OBSERVATION There are undaentally two types o predictors that are used in a regression analysis, experiental and observational. Experiental predictors have values that are under the control o the experienter, while or observational predictors, the values are observed rather than set. Consider, or exaple, a hypothetical study o actors deterining the yield o a certain crop. Experiental variables ight include the aount and type o ertilizers used, the spacing o plants, and the aount o irrigation, since each o these can be assigned by the investigator to the units, which are plots o land. Observational predictors ight include characteristics o the plots in the study, such as drainage, exposure, soil ertility, and weather variables. All o these are beyond the control o the experienter, yet ay have iportant eects on the observed yields. The priary dierence between experiental and observational predictors is in the inerences we can ake. Fro experiental data, we can oten iner causation.

92 78 DRAWING CONCLUSIONS I we assign the level o ertilizer to plots, usually on the basis o a randoization schee, and observe dierences due to levels o ertilizer, we can iner that the ertilizer is causing the dierences. Observational predictors allow weaker inerences. We ight say that weather variables are associated with yield, but the causal link is not available or variables that are not under the experienter s control. Soe experiental designs, including those that use randoization, are constructed so that the eects o observational actors can be ignored or used in analysis o covariance (see, e.g., Cox, 1958; Oehlert, 2000). Purely observational studies that are not under the control o the analyst can only be used to predict or odel the events that were observed in the data, as in the uel consuption exaple. To apply observational results to predict uture values, additional assuptions about the behavior o uture values copared to the behavior o the existing data ust be ade. Fro a purely observational study, we cannot iner a causal relationship without additional inoration external to the observational study. Feedlots A eedlot is a aring operation that includes large nuber o cattle, swine or poultry in a sall area. Feedlots are eicient producers o anial products, and can provide high-paying skilled jobs in rural areas. They can also cause environental probles, particularly with odors, ground water pollution, and noise. Ta, Tiany, and Weisberg (1996) report a study on the eect o eedlots on property values. This study was based on all 292 rural residential property sales in two southern Minnesota counties in Regression analysis was used. The response was sale price. Predictors included house characteristics such as size, nuber o bedroos, age o the property, and so on. Additional predictors described the relationship o the property to existing eedlots, such as distance to the nearest eedlot, nuber o nearby eedlots, and related eatures o the eedlots such as their size. The eedlot eect could be inerred ro the coeicients or the eedlot variables. In the analysis, the coeicient estiates or eedlot eects were generally positive and judged to be nonzero, eaning that close proxiity to eedlots was associated with an increase in sale prices. While association o the opposite sign was expected, the positive sign is plausible i the positive econoic ipact o the eedlot outweighs the negative environental ipact. The positive eect is estiated to be sall, however, and equal to 5% or less o the sale price o the hoes in the study. These data are purely observational, with no experiental predictors. The data collectors had no control over the houses that actually sold, or siting o eedlots. Consequently, any inerence that nearby eedlots cause increases in sale price is unwarranted ro this study. Given that we are liited to association, rather than causation, we ight next turn to whether we can generalize the results. Can we iner the sae association to houses that were not sold in these counties during this period? We have no way o knowing ro the data i the sae

93 EXPERIMENTATION VERSUS OBSERVATION 79 relationship would hold or hoes that did not sell. For exaple, soe hoeowners ay have perceived that they could not get a reasonable price and ay have decided not to sell. This would create a bias in avor o a positive eect o eedlots. Can we generalize geographically, to other Minnesota counties or to other places in the Midwest United States? The answer to this question depends on the characteristics o the two counties studied. Both are rural counties with populations o about 17,000. Both have very low property values with edian sale price in this period o less than $50,000. Each had dierent regulations or operators o eedlots, and these regulations could ipact pollution probles. Applying the results to a county with dierent deographics or regulations cannot be justiied by these data alone, and additional inoration and assuptions are required. Joiner (1981) coined the picturesque phrase lurking variable to describe a predictor variable not included in a ean unction that is correlated with ters in the ean unction. Suppose we have a regression with predictors X that are included in the regression and a lurking variable L not included in the study, and that the true regression ean unction is E(Y X = x,l= l) = β 0 + p β j x j + δl (4.7) with δ 0. We assue that X and L are correlated and or siplicity we assue urther that E(L X = x) = γ 0 + γ j x j. When we it the incorrect ean unction that ignores the lurking variable, we get, ro Section 4.1.6, E(Y X = x) = β 0 + j 1 p β j x j + δe(l X = x) j 1 = (β 0 + δγ 0 ) + p (β j + δγ j )x j (4.8) Suppose we are particularly interested in inerences about the coeicient or X 1, and, unknown to us, β 1 in (4.7) is equal to zero. I we were able to it with the lurking variable included, we would probably conclude that X 1 is not an iportant predictor. I we it the incorrect ean unction (4.8), the coeicient or X 1 becoes (β 1 + δγ 1 ), which will be non zero i γ 1 0. The lurking variable asquerades as the variable o interest to give an incorrect inerence. A lurking variable can also hide the eect o an iportant variable i, or exaple, β 1 0 but β 1 + δγ 1 = 0. All large observational studies like this eedlot study potentially have lurking variables. For this study, a casino had recently opened near these counties, creating any jobs and a deand or housing that ight well have overshadowed any eect o eedlots. In experiental data with rando assignent, the potential eects o lurking variables are greatly decreased, since the rando assignent guarantees that j 1

94 80 DRAWING CONCLUSIONS the correlation between the ters in the ean unction and any lurking variable is sall or zero. The interpretation o results ro a regression analysis depend on the details o the data design and collection. The eedlot study has extreely liited scope, and is but one eleent to be considered in trying to understand the eect o eedlots on property values. Studies like this eedlot study are easily isused. As recently as spring 2004, the study was cited in an application or a perit to build a eedlot in Starke county, Indiana, claiing that the study supports the positive eect o eedlots on property values, conusing association with causation, and inerring generalizability to other locations without any logical oundation or doing so. 4.3 SAMPLING FROM A NORMAL POPULATION Much o the intuition or the use o least squares estiation is based on the assuption that the observed data are a saple ro a ultivariate noral population. While the assuption o ultivariate norality is alost never tenable in practical regression probles, it is worthwhile to explore the relevant results or noral data, irst assuing rando sapling and then reoving that assuption. Suppose that all o the observed variables are noral rando variables, and the observations on each case are independent o the observations on each other case. In a two-variable proble, or the ith case observe (x i,y i ), and suppose that ( ) (( ) ( xi µx σ 2 )) N, x ρ xy σ x σ y (4.9) y i µ y ρ xy σ x σ y Equation (4.9) says that x i and y i are each realizations o noral rando variables with eans µ x and µ y, variances σx 2 and σ y 2 and correlation ρ xy. Now, suppose we consider the conditional distribution o y i given that we have already observed the value o x i. It can be shown (see e.g., Lindgren, 1993; Casella and Berger, 1990) that the conditional distribution o y i given x i, is noral and, ( ) σ y y i x i N µ y + ρ xy (x i µ x ), σy 2 σ (1 ρ2 xy ) (4.10) x I we deine σ 2 y β 0 = µ y β 1 µ x β 1 = ρ xy σ y σ x σ 2 = σ 2 y (1 ρ2 xy ) (4.11) then the conditional distribution o y i given x i is siply y i x i N(β 0 + β 1 x i,σ 2 ) (4.12) which is essentially the sae as the siple regression odel with the added assuption o norality.

95 MORE ON R 2 81 Given rando sapling, the ive paraeters in (4.9) are estiated, using the notation o Table 2.1, by ˆµ x = x ˆσ 2 x = SD2 x ˆρ xy = r xy ˆµ y = y ˆσ 2 y = SD2 y (4.13) Estiates o β 0 and β 1 are obtained by substituting estiates ro (4.13) or paraeters in (4.11), so that ˆβ 1 = r xy SD y /SD x, and so on, as derived in Chapter 2. However, ˆσ 2 = [(n 1)/(n 2)]SD 2 y (1 r2 xy ) to correct or degrees o reedo. I the observations on the ith case are y i and a p 1 vector x i not including a constant, ultivariate norality is shown sybolically by ( ) (( ) ( )) xi µx xx xy N, y i µ y xy σy 2 (4.14) where xx is a p p atrix o variances and covariances between the eleents o x i and xy is a p 1 vector o covariances between x i and y i. The conditional distribution o y i given x i is then ( y i x i N (µ y β µ x ) + β x i,σ 2) (4.15) I R 2 is the population ultiple correlation, β = 1 xx xy; σ 2 = σ 2 y xy 1 xx xy = σ 2 y (1 R2 ) The orulas or ˆβ and σ 2 and the orulas or their least squares estiators dier only by the substitution o estiates or paraeters, with (n 1) 1 (X X ) estiating xx,and(n 1) 1 (X Y) estiating xy. 4.4 MORE ON R 2 The conditional distribution in (4.10) or (4.15) does not depend on rando sapling, but only on noral distributions, so whenever ultivariate norality sees reasonable, a linear regression odel is suggested or the conditional distribution o one variable, given the others. However, i rando sapling is not used, soe o the usual suary statistics, including R 2, lose their connection to population paraeters. Figure 4.2a repeats Figure 1.1, the scatterplot o Dheight versus Mheight or the heights data. These data closely reseble a bivariate noral saple, and so R 2 = 0.24 estiates the population R 2 or this proble. Figure 4.2b repeats this last igure, except that all cases with Mheight between 61 and 64 inches the lower and upper quartile o the other s heights rounded to the nearest inch have been reoved or the data. The ols regression line appears siilar, but the value o R 2 = 0.37 is about 50% larger. By reoving the iddle o the data, we have ade

96 82 DRAWING CONCLUSIONS (a) All 70 D height (b) Outer (c) Inner M height FIG. 4.2 Three views o the heights data. R 2 larger, and it no longer estiates a population value. Siilarly, in Figure 4.2c, we exclude all the cases with Mheight outside the quartiles, and get R 2 = 0.027, and the relationship between Dheight and Mheight virtually disappears. This exaple points out that even in the unusual event o analyzing data drawn ro a ultivariate noral population, i sapling o the population is not rando, the interpretation o R 2 ay be copletely isleading, as this statistic will be strongly inluenced by the ethod o sapling. In particular, a ew cases with unusual values or the predictors can largely deterine the observed value o this statistic. We have seen that we can anipulate the value o R 2 erely by changing our sapling plan or collecting data: i the values o the ters are widely dispersed,

97 MORE ON R 2 83 then R 2 will tend to be too large, while i the values are over a very sall range, then R 2 will tend to be too sall. Because the notion o proportion o variability explained is so useul, a diagnostic ethod is needed to decide i it is a useul concept in any particular proble Siple Linear Regression and R 2 In siple regression linear probles, we can always deterine the appropriateness o R 2 as a suary by exaining the suary graph o the response versus the predictor. I the plot looks like a saple ro a bivariate noral population, as in Figure 4.2a, then R 2 is a useul easure. The less the graph looks like this igure, the less useul is R 2 as a suary easure. Figure 4.3 shows six suary graphs. Only or the irst three o the is R 2 a useul suary o the regression proble. In Figure 4.3e, the ean unction appears curved rather than straight so correlation is a poor easure o dependence. In Figure 4.3d the value o R 2 is virtually deterined by one point, aking R 2 necessarily unreliable. The regular appearance o the reaining plot suggests a dierent type o proble. We ay have several identiiable groups o points caused by a lurking variable not included in the ean unction, such that the ean unction or each group has a negative slope, but when groups are cobined the slope becoes positive. Once again R 2 is not a useul suary o this graph. Response (a) (b) (c) Predictor or y^ (d) (e) () FIG. 4.3 Six suary graphs. R 2 is an appropriate easure or a c, but inappropriate or d.

98 84 DRAWING CONCLUSIONS Multiple Linear Regression In ultiple linear regression, R 2 can also be interpreted as the square o the correlation in a suary graph, this tie o Y versus itted values Ŷ. This plot can be interpreted exactly the sae way as the plot o the response versus the single ter in siple linear regression to decide on the useulness o R 2 as a suary easure. For other regression ethods such as nonlinear regression, we can deine R 2 to be the square o the correlation between the response and the itted values, and use this suary graph to decide i R 2 is a useul suary Regression through the Origin With regression through the origin, the proportion o variability explained is given by 1 SSreg/ yi 2, using uncorrected sus o squares. This quantity is not invariant under location change, so, or exaple, i units are changed ro Fahrenheit to Celsius, you will get a dierent value or the proportion o variability explained. For this reason, use o an R 2 -like easure or regression through the origin is not recoended. 4.5 MISSING DATA In any probles, soe variables will be unrecorded or soe cases. The ethods we study in this book generally assue and require coplete data, without any issing values. The literature on analyzing incoplete data probles is very large, and our goal here is ore to point out the issues than to provide solutions. Two recent books on this topic are Little and Rubin (1987) and Schaer (1997) Missing at Rando The ost coon solution to issing data probles is to delete either cases or variables so the resulting data set is coplete. Many sotware packages delete partially issing cases by deault, and it regression odels to the reaining, coplete, cases. This is a reasonable approach as long as the raction o cases deleted is sall enough, and the cause o values being unobserved is unrelated to the relationships under study. This would include data lost through an accident like dropping a test tube, or aking an illegible entry in a logbook. I the reason or not observing values depends on the values that would have been observed, then the analysis o data ay require odeling the cause o the ailure to observe values. For exaple, i values o a easureent are unrecorded i the value is less than the iniu detection liit o an instruent, then the value is issing because the value that should have been observed is too sall. A siple expedient in this case that is soeties helpul is to substitute a value less than or equal to the detection liit or the unobserved values. This expedient is not always entirely satisactory because substituting, or iputing, a ixed value or the unobserved quantity can reduce the variation on the illed-in variable, and yield isleading inerences.

99 MISSING DATA 85 As a second exaple, suppose we have a clinical trial that enrolls subjects with a particular edical condition, assigns each subject a treatent, and then the subjects are ollowed or a period o tie to observe their response, which ay be tie until a particular landark occurs, such as iproveent o the edical condition. Subjects who do not respond well to the treatent ay drop out o the study early, while subjects who do well ay be ore likely to reain in the study. Since the probability o observing a value depends on the value that would have been observed, siply deleting subjects who drop out early can easily lead to incorrect inerences because the successul subjects will be overrepresented aong those who coplete the study. In any clinical trials, the response variable is not observed because the study ends, not because o patient characteristics. In this case, we call the response ties censored; so or each patient, we know either the tie to the landark or the tie to censoring. This is a dierent type o issing data proble, and analysis needs to include both the uncensored and censored observations. Book-length treatents o censored survival data are given by Kalbleisch and Prentice (1980) and Cox and Oakes (1984), aong others. As a inal exaple, consider a cross-cultural deographic study. Soe deographic variables are harder to easure than others, and soe variables, such as the rate o eployent or woen over the age o 15, ay not be available or less-developed countries. Deleting countries that do not have this variable easured could change the population that is studied by excluding less-developed countries. Rubin (1976) deined data to be issing at rando (MAR) i the ailure to observe a value does not depend on the value that would have been observed. With MAR data, case deletion can be a useul option. Deterining whether an assuption o MAR is appropriate or a particular data set is an iportant step in the analysis o incoplete data Alternatives All the alternatives we briely outline here require strong assuptions concerning the data that ay be ipossible to check in practice. Suppose irst that we cobine the response and predictors into a single vector Z. We assue that the distribution o Z is ully known, apart ro unknown paraeters. The siplest assuption is that Z N(µ, ). I we had reasonable estiates o µ and, then we could use (4.15) to estiate paraeters or the regression o the response on the other ters. The EM algorith (Depster, Laird, and Rubin, 1977) is a coputational ethod that is used to estiate the paraeters o the known joint distribution based on data with issing values. Alternatively, given a odel or the data like ultivariate norality, one could ipute values or the issing data and then analyze the copleted data as i it were ully observed. Multiple iputation carries this one step urther by creating several iputed data sets that, according to the odel used, are plausible, illed-in data sets, and then average the analyses o the illed-in data sets. Sotware or both iputation and the EM algorith or axiu likelihood estiate is available

100 86 DRAWING CONCLUSIONS in several standard statistical packages, including the issing package in S-plus and the MI procedure in SAS. The third approach is ore coprehensive, as it requires building a odel or the process o interest and the issing data process siultaneously. Exaples o this approach are given by Ibrahi, Lipsitz, and Horton (2001), and Tang, Little, and Raghunathan (2003). The data described in Table 4.2 provides an exaple. Allison and Cicchetti (1976) presented data on sleep patterns o 62 aal species along with several other possible predictors o sleep. The data were in turn copiled ro several other sources, and not all values are easured or all species. For exaple, PS, the nuber o hours o paradoxical sleep, was easured or only 50 o the 62 species in the data set, and GP, the gestation period, was easured or only 58 o the species. I we are interested in the dependence o hours o sleep on the other predictors, then we have at least three possible responses, PS, SWS, andts, all observed on only a subset o the species. To use case deletion and then standard ethods to analyze the conditional distributions o interest, we need to assue that the chance o a value being issing does not depend on the value. For exaple, the our issing values o GP are issing because no one had (as o 1976) published this value or these species. Using the iputation or the axiu likelihood ethods are alternatives or these data, but they require aking assuptions like norality, which ight be palatable or any o the variables i transored to logarithic scale. Soe o the variables, like P and SE are categorical, so other assuptions beyond ultivariate norality ight be needed. TABLE 4.2 The Sleep Data a Nuber Percent Variable Type Observed Missing Description BodyWt Variate 62 0 Body weight in kg BrainWt Variate 62 0 Brain weight in g D Factor 62 0 Danger index, 1 = least danger,...,5 = ost GP Variate 58 6 Gestation tie, days Lie Variate 58 6 Maxiu lie span, years P Factor 62 0 Predation index, 1 = lowest,..., 5 = highest SE Factor 62 0 Sleep exposure index, 1 = ore exposed,...,5= ost protected PS Response Paradoxical dreaing sleep, hrs/day SWS Response Slow wave nondreaing sleep, hrs/day TS Response 58 6 Total sleep, hrs/day Species Labels 62 0 Species o aal a 10 variables, 62 observations, 8 patterns o issing values; 5 variables (50%) have at least one issing value; 20 observations (32%) have at least one issing value.

101 COMPUTATIONALLY INTENSIVE METHODS COMPUTATIONALLY INTENSIVE METHODS Suppose we have a saple y 1,...,y n ro a particular distribution G, or exaple a standard noral distribution. What is a conidence interval or the population edian? We can obtain an approxiate answer to this question by coputer siulation, set up as ollows: 1. Obtain a siulated rando saple y1,...,y n ro the known distribution G. Most statistical coputing languages include unctions or siulating rando deviates (see Thisted, 1988 or coputational ethods). 2. Copute and save the edian o the saple in step Repeat steps 1 and 2 a large nuber o ties, say B ties. The larger the value o B, the ore precise the ultiate answer. 4. I we take B = 999, a siple percentile-based 95% conidence interval or the edian is the interval between the 25th sallest value and the 975th largest value, which are the saple 2.5 and 97.5 percentiles, respectively. In ost interesting probles, we will not actually know G andsothissiulation is not available. Eron (1979) pointed out that the observed data can be used to estiate G, and then we can saple ro the estiate Ĝ. The algorith becoes: 1. Obtain a rando saple y1,...,y n ro Ĝ by sapling with replaceent ro the observed values y 1,...,y n. In particular, the i-th eleent o the saple yi is equally likely to be any o the original y 1,...,y n.soeo the y i will appear several ties in the rando saple, while others will not appear at all. 2. Continue with steps 2 4 o the irst algorith. A test at the 5% level concerning the population edian can be rejected i the hypothesized value o the edian does not all in the conidence interval coputed at step 4. Eron called this ethod the bootstrap, and we call B the nuber o bootstrap saples. Excellent reerences or the bootstrap are the books by Eron and Tibshirani (1993), and Davison and Hinkley (1997) Regression Inerence without Norality Bootstrap ethods can be applied in ore coplex probles like regression. Inerences and accurate standard errors or paraeters and ean unctions require either norality o regression errors or large saple sizes. In sall saples without norality, standard inerence ethods can be isleading, and in these cases a bootstrap can be used or inerence.

102 88 DRAWING CONCLUSIONS Transactions Data The data in this exaple consists o a saple o branches o a large Australian bank (Cunningha and Heathcote, 1989). Each branch akes transactions o two types, and or each o the branches we have recorded the nuber o transactions T 1 and T 2,aswellasTie, the total nuber o inutes o labor used by the branch in type 1 and type 2 transactions. I β j is the average nuber o inutes or a transaction o type j, j = 1, 2, then the total nuber o inutes in a branch or transaction type j is β j Tj, and the total nuber o inutes is expected to be E(Tie T 1,T 2 ) = β 0 + β 1 T 1 + β 2 T 2 (4.16) possibly with β 0 = 0 because zero transactions should iply zero tie spent. The data are displayed in Figure 4.4, and are given in the data ile transact.txt. The key eatures o the scatterplot atrix are: (1) the arginal response plots in the last row appear to have reasonably linear ean unctions; (2) there appear to be a nuber o branches with no T 1 transactions but any T 2 transactions; and (3) in the plot o Tie versus T 2, variability appears to increase ro let to right. The errors in this proble probably have a skewed distribution. Occasional transactions take a very long tie, but since transaction tie is bounded below by T T Tie , ,000 FIG. 4.4 Scatterplot atrix or the transactions data.

103 COMPUTATIONALLY INTENSIVE METHODS 89 TABLE 4.3 Suary or B = 999 Case Bootstraps or the Transactions Data, Giving 95% Conidence Intervals, Lower to Upper, Based on Standard Noral Theory and on the Percentile Bootstrap Noral Theory Bootstrap Estiate Lower Upper Estiate Lower Upper Intercept T T zero, there can not be any really extree quick transactions. Inerences based on noral theory are thereore questionable. Following the suggestion o Pardoe and Weisberg (2001) or this exaple, a bootstrap is coputed as ollows: 1. Nuber the cases in the data set ro 1 to n. Take a rando saple with replaceent o size n ro these case nubers. Thus, the i-th case nuber in the saple is equally likely to be any o the n cases in the original data. 2. Create a data set ro the original data, but repeating each row in the data set the nuber o ties that row was selected in the rando saple in step 1. Soe cases will appear several ties and others will not appear at all. Copute the regression using this data set, and save the values o the coeicient estiates. 3. Repeat steps 1 and 2 a large nuber o ties, say, B ties. 4. Estiate a 95% conidence interval or each o the estiates by the 2.5 and 97.5 percentiles o the saple o B bootstrap saples. Table 4.3 suarizes the percentile bootstrap or the transactions data. The colun arked Estiate gives the ols estiate under Noral theory and the average o the B bootstrap siulations under Bootstrap. The dierence between these two is called the bootstrap bias, which is quite sall or all three ters relative to the size o the conidence intervals. The 95% bootstrap intervals are consistently wider than the corresponding noral intervals, indicating that the noral-theory conidence intervals are probably overly optiistic. The bootstrap intervals given in Table 4.3 are rando, since i the bootstrap is repeated, the answers will be a little dierent. The variability in the end-points o the interval can be decreased by increasing the nuber B o bootstrap saples Nonlinear Functions o Paraeters One o the iportant uses o the bootstrap is to get estiates o error variability in probles where standard theory is either issing, or, equally oten, unknown to the analyst. Suppose, or exaple, we wanted to get a conidence interval or the ratio β 1 /β 2 in the transactions data. This is the ratio o the tie or a type 1

104 90 DRAWING CONCLUSIONS transaction to the tie or a type 2 transaction. The point estiate or this ratio is just ˆβ 1 / ˆβ 2, but we will not learn how to get a noral-theory conidence interval or a nonlinear unction o paraeters like this until Section Using the bootstrap, this coputation is easy: just copute the ratio in each o the bootstrap saples and then use the percentiles o the bootstrap distribution to get the conidence interval. For these data, the point estiate is 2.68 with 95% bootstrap conidence interval ro 1.76 to 3.86, so with 95% conidence, type 1 transactions take on average ro about 1.76 to 3.86 ties as long as do type 2 transactions Predictors Measured with Error Predictors and the response are oten easured with error. While we ight have a theory that tells us the ean unction or the response, given the true values o the predictors, we ust it with the response, given the iperectly easured values o the predictors. We can soeties use siulation to understand how the easureent error aects our answers. Here is the basic setup. We have a true response Y and a set o ters X and atrueeanunction E(Y X = x ) = β x In place o Y and X we observe Y = Y + δ and X = X + η, whereδ and η are easureent errors. I we it the ean unction E(Y X = x) = γ x what can we say about the relationship between β and γ? While there is a substantial theoretical literature on this proble (or exaple, Fuller, 1987), we shall attept to get an answer to this question using siulation. To do so, we need to know soething about δ and η. Catchability o Northern Pike One o the questions o interest to isheries anagers is the diiculty o catching a ish. A useul concept is the idea o catchability. Suppose that Y is the catch or an angler or a ixed aount o eort, and X is the abundance o ish available in the population that the angler is ishing. Suppose urther that E(Y X = x ) = β 1 x (4.17) I this ean unction were to hold, then we could deine β 1 to be the catchability o this particular ish species. The data we use coes ro a study o Northern Pike, a popular gae ish in inland lakes in the United States. Data were collected on 16 lakes by Rob Pierce o the Minnesota Departent o Natural Resources. On each lake we have a easureent called CPUE or catch per unit eort, which is the catch or a

105 COMPUTATIONALLY INTENSIVE METHODS 91 speciic aount o ishing eort. Abundance on the lake is easured using the ish Density that is deined to be the nuber o ish in the lake divided by the surace area o the lake. While surace area can be deterined with reasonable accuracy, the nuber o ish in the lake is estiated using a capture recapture experient (Seber, 2002). Since both CPUE and Density are experientally estiated, they both have standard errors attached to the. In ters o (4.17), we have observed CPUE = Y + δ and Density = x + η. In addition, we can obtain estiates o the standard deviations o the δs and ηs ro the properties o the ethods used to ind CPUE and Density. The data ile npdata.txt includes both the CPUE and Density and their standard errors SECPUE and SEdens. Figure 4.5 is the plot o the estiated CPUE and Density. Ignoring the lines on the graph, a key characteristic o this graph is the large variability in the points. A straight line ean unction sees plausible or these data, but any other curves are equally plausible. We continue under the assuption that a straight-line ean unction is sensible. The two lines on Figure 4.5 are the ols siple regression its through the origin (solid line) and not through the origin (dashed line). The F -test coparing the has a p-value o about 0.13, so we are encouraged to use the sipler throughthe-origin odel that will allow us to interpret the slope as the catchability. The estiate is ˆβ 1 = 0.34 with standard error 0.035, so a 95% conidence interval or β 1 ignoring easureent errors is (0.250, ). To assess the eect o easureent error on the estiate and on the conidence interval, we irst ake soe assuptions. First, we suppose that the estiated standard errors o the easureents are the actual standard errors o the easureents. Second, we assue that the easureent errors are independently and norally Estiated CPUE Estiated density FIG. 4.5 Scatterplot o estiated CPUE versus Density or the northern pike data. Solid line is the ols ean unction through the origin, and the dashed line is the ols line allowing an intercept.

106 92 DRAWING CONCLUSIONS TABLE 4.4 Data Siulation Suary or the Northern Pike Point Estiate 95% ConidenceInterval Noral theory (0.250, 0.399) Siulation (0.230, 0.387) distributed. Neither o these assuptions are checkable ro these data, but or the purposes o a siulation these see like reasonable assuptions. The siulation proceeds as ollows: 1. Generate a pseudo-response vector given by Ỹ = CPUE + δ, wherethei-th eleent o δ is a noral rando nuber with ean zero and variance given by the square o the estiated standard error or the i-th CPUE value. In this proble, each observation has its own estiated error variance, but in other probles there ay be a coon estiate or all eleents o the response. 2. Repeat step 1, but or the predictor to get x = Density + η. 3. Fit the siple regression odel o Ỹ on x and save the estiated slope. 4. Repeat steps 1 3 B ties. The average o the B values o the slope estiate is an estiate o the slope in the proble with no easureent error. A conidence interval or the slope is ound using the percentile ethod discussed with the bootstrap. The saples generated in steps 1 2 are not quite ro the right distribution, as they are centered at the observed values o CPUE and Density rather than the unobserved values o Y and x, but the observed values estiate the unobserved true values, so this substitution adds variability to the results, but does not aect the validity o the ethodology. The results or B = 999 siulations are suarized in Table 4.4. The results o the noral theory and the siulation that allows or easureent error are rearkably siilar. In this proble, we judge the easureent error to be uniportant. PROBLEMS 4.1. Fit the regression o Soa on AVE, LIN and QUAD as deined in Section 4.1 or the girls in the Berkeley Guidance Study data, and copare to the results in Section Starting with (4.10), we can write y i = µ y + ρ xy σ y σ x (x i µ x ) + ε i

107 PROBLEMS 93 Ignoring the error ter ε i, solve this equation or x i as a unction o y i and the paraeters Find the conditional distribution o x i y i. Under what conditions is the equation you obtained in Proble 4.2.1, which is coputed by inverting the regression o y on x, the sae as the regression o x on y? 4.3. For the transactions data described in Section 4.6.1, deine A = (T 1 + T 2 )/2 to be the average transaction tie, and D = T 1 T 2, and it the ollowing our ean unctions M1 :E(Y T 1,T 2 ) = β 01 + β 11 T 1 + β 21 T 2 M2 :E(Y T 1,T 2 ) = β 02 + β 32 A + β 42 D M3 :E(Y T 1,T 2 ) = β 03 + β 23 T 2 + β 43 D M4 :E(Y T 1,T 2 ) = β 04 + β 14 T 1 + β 24 T 2 + β 34 A + β 44 D In the it o M4, soe o the coeicients estiates are labelled as either aliased or as issing. Explain what this eans What aspects o the itted regressions are the sae? What is dierent? Why is the estiate or T 2 dierent in M1 and M3? 4.4. Interpreting coeicients with logariths For the siple regression with ean unction E(log(Y ) X = x) = β 0 + β 1 log(x), provide an interpretation or β 1 as a rate o change in Y or a sall change in x Show that the results o Section do not depend on the base o the logariths Use the bootstrap to estiate conidence intervals o the coeicients in the uel data Windill data For the windill data in the data ile w1.txt discussed in Proble 2.13, page 45, use B = 999 replications o the bootstrap to estiate a 95% conidence interval or the long-ter average wind speed at the candidate site and copare this to the prediction interval in Proble See the coent at the end o Proble to justiy using a bootstrap conidence interval or the ean as a prediction interval or the long-ter ean Suppose we it a regression with the true ean unction E(Y X 1 = x 1,X 2 = x 2 ) = 3 + 4x 1 + 2x 2 Provide conditions under which the ean unction or E(Y X 1 = x 1 ) is linear but has a negative coeicient or x 1.

108 94 DRAWING CONCLUSIONS 4.8. In a study o aculty salaries in a sall college in the Midwest, a linear regression odel was it, giving the itted ean unction E( Salary Sex) = Sex (4.18) where Sex equals one i the aculty eber was eale and zero i ale. The response Salary is easured in dollars (the data are ro the 1970s) Give a sentence that describes the eaning o the two estiated coeicients An alternative ean unction it to these data with an additional ter, Years, the nuber o years eployed at this college, gives the estiated ean unction E( Salary Sex, Years) = Sex + 759Years (4.19) The iportant dierence between these two ean unctions is that the coeicient or Sex has changed signs. Using the results o this chapter, explain how this could happen. (Data consistent with these equations are presented in Proble 6.13) Sleep data For the sleep data described in Section 4.5, describe conditions under which the issing at rando assuption is reasonable. In this case, deleting the partially observed species and analyzing the coplete data can ake sense Describe conditions under which the issing at rando assuption or the sleep data is not reasonable. In this case, deleting partially observed species can change the inerences by changing the deinition o the sapled population Suppose that the sleep data were ully observed, eaning that values or all the variables were available or all 62 species. Assuing that there are ore than 62 species o aals, provide a situation where exaining the issing at rando assuption could still be iportant Thedatagiveninlongley.txt were irst given by Longley (1967) to deonstrate inadequacies o regression coputer progras then available. The variables are: De = GNP price delator, in percent GNP = GNP, in illions o dollars Uneployed = Uneployent, in thousands o persons Ared.Forces = Size o ared orces, in thousands Population = Population 14 years o age and over, in thousands

109 PROBLEMS 95 Eployed = Total derived eployent in thousands the response Year = Year Draw the scatterplot atrix or these data excluding Year, and explain ro the plot why this ight be a good exaple to illustrate nuerical probles o regression progras. (Hint: Nuerical probles arise through rounding errors, and these are ost likely to occur when ters in the regression odel are very highly correlated.) Fit the regression o Eployed on the others excluding Year Suppose that the values given in this exaple were only accurate to three signiicant igures (two igures or De). The eects o easureent errors can be assessed using a siulation study in which we add unior rando values to the observed values, and recopute estiates or each siulation. For exaple, Unep or 1947 is given as 2356, which corresponds to 2,356,000. I we assue only three signiicant igures, we only believe the irst three digits. In the siulation we would replace 2356 by u, whereu is a unior rando nuber between 5 and +5. Repeat the siulation 1000 ties, and on each siulation copute the coeicient estiates. Copare the standard deviation o the coeicient estiates ro the siulation to the coeicient standard errors ro the regression on the unperturbed data. I the standard deviations in the siulation are as large or larger than the standard errors, we would have evidence that rounding would have iportant ipact on results.

110 CHAPTER 5 Weights, Lack o Fit, and More This chapter introduces a nuber o additional basic tools or itting and using ultiple linear regression odels. 5.1 WEIGHTED LEAST SQUARES The assuption that the variance unction Var(Y X) is the sae or all values o the ters X can be relaxed in a nuber o ways. In the siplest case, suppose we have the ultiple regression ean unction given or the ith case by E(Y X = x i ) = β x i (5.1) but rather than assue that errors are constant, we assue that Var(Y X = x i ) = Var(e i ) = σ 2 /w i (5.2) where w 1,...,w n are known positive nubers. The variance unction is still characterized by only one unknown positive nuber σ 2, but the variances can be dierent or each case. This will lead to the use o weighted least squares, orwls, in place o ols, to get estiates. In atrix ters, let W be an n n diagonal atrix with the w i on the diagonal. The odel we now use is Y = Xβ + e Var(e) = σ 2 W 1 (5.3) We will continue to use the sybol ˆβ or the estiator o β, even though the estiate will be obtained via weighted, not ordinary, least squares. The estiator ˆβ is chosen to iniize the weighted residual su o squares unction, RSS(β) = (Y Xβ) W(Y Xβ) (5.4) = w i (y i x i β)2 (5.5) Applied Linear Regression, Third Edition, by Sanord Weisberg ISBN Copyright 2005 John Wiley & Sons, Inc. 96

111 WEIGHTED LEAST SQUARES 97 The use o the weighted residual su o squares recognizes that soe o the errors are ore variable than others since cases with large values o w i will have sall variance and will thereore be given ore weight in the weighted RSS. The wls estiator is given by ˆβ = (X WX) 1 X WY (5.6) While this last equation can be ound directly, it is convenient to transor the proble speciied by (5.3) to one that can be solved by ols. Then, all o the results or ols can be applied to wls probles. Let W 1/2 be the n n diagonal atrix with ith diagonal eleent w i,andso W 1/2 is a diagonal atrix with 1/ w i on the diagonal, and W 1/2 W 1/2 = I. Using Appendix A.7 on rando vectors, the covariance atrix o W 1/2 e is Var(W 1/2 e) = W 1/2 Var(e)W 1/2 = W 1/2 (σ 2 W 1 )W 1/2 = W 1/2 (σ 2 W 1/2 W 1/2 )W 1/2 = σ 2 (W 1/2 W 1/2 )(W 1/2 W 1/2 ) = σ 2 I (5.7) This eans that the vector W 1/2 e is a rando vector but with covariance atrix equal to σ 2 ties the identity atrix. Multiplying both sides o equation (5.3) by W 1/2 gives W 1/2 Y = W 1/2 Xβ + W 1/2 e (5.8) Deine Z = W 1/2 Y, M = W 1/2 X,andd = W 1/2 e, and (5.8) becoes Z = Mβ + d (5.9) Fro (5.7), Var(d) = σ 2 I,andin(5.9)β is exactly the sae as β in (5.3). Model (5.9) can be solved using ols, ˆβ = (M M) 1 M Z ( 1 = (W 1/2 X) (W X)) 1/2 (W 1/2 X) (W 1/2 Y) = ( X W 1/2 W 1/2 X) 1 (X W 1/2 W 1/2 Y) = ( X WX ) 1 (X WY) which is the estiator given at (5.6).

112 98 WEIGHTS, LACK OF FIT, AND MORE To suarize, the wls regression o Y on X with weights given by the diagonal eleents o W is the sae as the ols regression o Z on M, where M = w1 w1 x 11 w2 w2 x wn wn x n1 w1 x 1p w2 x 2p.. Z = wn x np w1 y 1 w2 y 2.. wn y n Even the colun o ones gets ultiplied by the w i. The regression proble is then solved using M and Z in place o X and Y Applications o Weighted Least Squares Known weights w i can occur in any ways. I the ith response is an average o n i equally variable observations, then Var(y i ) = σ 2 /n i,andw i = n i.iy i is a total o n i observations, Var(y i ) = n i σ 2,andw i = 1/n i. I variance is proportional to soe predictor x i,var(y i ) = x i σ 2,thenw i = 1/x i. Strong Interaction The purpose o the experient described here is to study the interactions o unstable eleentary particles in collision with proton targets (Weisberg et al., 1978). These particles interact via the so-called strong interaction orce that holds nuclei together. Although the electroagnetic orce is well understood, the strong interaction is soewhat ysterious, and this experient was designed to test certain theories o the nature o the strong interaction. The experient was carried out with bea having various values o incident oentu, or equivalently or various values o s, the square o the total energy in the center-o-ass rae o reerence syste. For each value o s, we observe the scattering cross-section y, easured in illibarns (µb). A theoretical odel o the strong interaction orce predicts that E(y s) = β 0 + β 1 s 1/2 + relatively sall ters (5.10) The theory akes quantitative predictions about β 0 and β 1 and their dependence on particular input and output particle type. O interest, thereore, are: (1) estiation o β 0 and β 1, given (5.10) and (2) assessent o whether (5.10) provides an accurate description o the observed data. The data given in Table 5.1 and in the ile physics.txt suarize the results o experients when both the input and output particle was the π eson. At each value o s, a very large nuber o particles was counted, and as a result the values o Var(y s = s i ) = σ 2 /w i are known alost exactly; the square roots o these values are given in the third colun o Table 5.1, labelled SD i. The variables in the ile are labelled as x,y, and SD, corresponding to Table 5.1. Ignoring the saller ters, ean unction (5.10) is a siple linear regression ean unction with ters or an intercept and x = s 1/2. We will need to use wls,

113 WEIGHTED LEAST SQUARES 99 TABLE 5.1 The Strong Interaction Data x = s 1/2 y (µb) SD i TABLE 5.2 wls Estiates or the Strong Interaction Data Coeicients: Estiate Std. Error t value Pr(> t ) (Intercept) e-08 x e-06 Residual standard error: on 8 degrees o reedo Multiple R-Squared: Analysis o Variance Table D Su Sq Mean Sq F value Pr(>F) x e-06 Residuals because the variances are not constant but are dierent or each value o s. In this proble, we are in the unusual situation that we not only know the weights but also know the value o σ 2 /w i or each value o i. There are 11 quantities w 1,...,w 10 and σ 2 that describe the values o only 10 variances, so we have too any paraeters, and we are ree to speciy one o the 11 paraeters to be any nonzero value we like. The siplest approach is to set σ 2 = 1. I σ 2 = 1, then the last colun o Table 5.1 gives 1/ w i, i = 1, 2,...,n, and so the weights are just the inverse squares o the last colun o this table. The it o the siple regression odel via wls is suarized in Table 5.2. R 2 is large, and the paraeter estiates are well deterined. The next question is whether (5.10) does in act it the data. This question o it or lack o it o a odel is the subject o the next section Additional Coents Many statistical odels, including ixed eects, variance coponents, tie series, and soe econoetric odels, will speciy that Var(e) =, where is an n n syetric atrix that depends on a sall nuber o paraeters. Estiates or

114 100 WEIGHTS, LACK OF FIT, AND MORE the generalized least squares proble iniize (5.4), with W replaced by 1. Pinheiro and Bates (2000) is one recent source or discussion o these odels. I any observations are taken at each value o x, the inverse o the saple variance o y given x can provide useul estiated weights. This ethod was used to get weights in the strong interaction data, where the nuber o cases per value o x was extreely large. Proble 5.6 provides another exaple using estiated weights as i they were true weights. The useulness o this ethod depends on having a large saple size at each value o x. In soe probles, Var(Y X) will depend on the ean E(Y X). For exaple, i the response is a count that ollows a Poisson distribution, then Var(Y X) = E(Y X), while i the response ollows a gaa distribution, Var(Y X) = σ 2 (E(Y X)) 2. The traditional approach to itting when the variance depends on the ean is to use a variance stabilizing transoration, as will be described in Section 8.3, in which the response is replaced by a transoration o it so that the variance unction is approxiately constant in the transored scale. Nelder and Wedderburn (1972) introduced generalized linear odels that extend linear regression ethodology to probles in which the variance unction depends on the ean. One particular generalized linear odel when the response is a binoial count leads to logistic regression and is discussed in Chapter 12. The other ost iportant exaple o a generalized linear odel is Poisson regression and log-linear odels and is discussed by Agresti (1996). McCullagh and Nelder (1989) provide a general introduction to generalized linear odels. 5.2 TESTING FOR LACK OF FIT, VARIANCE KNOWN When the ean unction used in itting is correct, then the residual ean square ˆσ 2 provides an unbiased estiate o σ 2. I the ean unction is wrong, then ˆσ 2 will estiate a quantity larger than σ 2, since its size will depend both on the errors and on systeatic bias ro itting the wrong ean unction. I σ 2 is known, or i an estiate o it is available that does not depend on the itted ean unction, a test or lack o it o the odel can be obtained by coparing ˆσ 2 to the odelree value. I ˆσ 2 is too large, we ay have evidence that the ean unction is wrong. In the strong interaction data, we want to know i the straight-line ean unction (5.10) is correct. As outlined in Section 5.1, the inverses o the squares o the values in colun 3 o Table 5.1 are used as weights when we set σ 2 = 1, a known value. Fro Table 5.2, ˆσ 2 = Evidence against the siple regression odel will be obtained i we judge ˆσ 2 = large when copared with the known value o σ 2 = 1. To assign a p-value to this coparison, we use the ollowing result. I the e i NID(0,σ 2 /w i ), i = 1, 2,...,n, with the w i and σ 2 known, and paraeters in the ean unction are estiated using wls, and the ean unction is correct, then X 2 = RSS σ 2 = (n p ) ˆσ 2 σ 2 (5.11)

115 TESTING FOR LACK OF FIT, VARIANCE KNOWN 101 is distributed as a chi-squared rando variable with n p d. As usual, RSS is the residual su o squares. For the exaple, ro Table 5.4, X 2 = = Using a table or a coputer progra that coputes quantiles o the χ 2 distribution, χ 2 (0.01, 8) = 20.09, so the p-value associated with the test is less than 0.01, which suggests that the straight-line ean unction ay not be adequate. When this test indicates the lack o it, it is usual to it alternative ean unctions either by transoring soe o the ters or by adding polynoial ters in the predictors. The available physical theory suggests this latter approach, and the quadratic ean unction E(y s) = β 0 + β 1 s 1/2 + β 2 (s 1/2 ) 2 + saller ters β 0 + β 1 x + β 2 x 2 (5.12) with x = s 1/2 should be it to the data. This ean unction has three ters, an intercept, x = s 1/2,andx 2 = s 1. Fitting ust use wls with the sae weights as beore, as given in Table 5.3. The itted curve or the quadratic it is shown in Figure 5.1. The curve atches the data very closely. We can test or lack o it o this odel by coputing X 2 = RSS σ 2 = = Coparing this value with the percentage points o χ 2 (7) gives a p-value o about 0.86, indicating no evidence o lack o it or ean unction (5.12). TABLE 5.3 wls Estiates or the Quadratic Mean Function or the Strong Interaction Data Coeicients: Estiate Std. Error t value Pr(> t ) (Intercept) e-08 x x^ Residual standard error: on 7 degrees o reedo Multiple R-Squared: Analysis o Variance Table Response: y D Su Sq Mean Sq F value Pr(>F) x e-08 x^ Residuals

116 102 WEIGHTS, LACK OF FIT, AND MORE Cross section, y x = s ( 1/2) FIG. 5.1 Scatterplot or the strong interaction data. Solid line: siple linear regression ean unction. Dashed line: quadratic ean unction. Although (5.10) does not describe the data, (5.12) does result in an adequate it. Judgent o the success or ailure o the odel or the strong interaction orce requires analysis o data or other choices o incidence and product particles, as well as the data analyzed here. On the basis o this urther analysis, Weisberg et al. (1978) concluded that the theoretical odel or the strong interaction orce is consistent with the observed data. 5.3 TESTING FOR LACK OF FIT, VARIANCE UNKNOWN When σ 2 is unknown, a test or lack o it requires a odel-ree estiate o the variance. The ost coon odel-ree estiate akes use o the variation between cases with the sae values on all o the predictors. For exaple, consider the artiicial data with n = 10 given in Table 5.4. The data were generated by irst choosing the values o x i and then coputing y i = x i + e i,i = 1, 2,...,10, where the e i are standard noral rando deviates. I we consider only the values o y i corresponding to x = 1, we can copute the average response y and the standard deviation with 3 1 = 2 d, as shown in the table. I we assue that variance is the sae or all values o x, a pooled estiate o the coon variance is obtained by pooling the individual standard deviations into a single estiate. I n is the nuber o cases at a value o x and SD is the standard deviation or that value o x, then the su o squares or pure error, sybolically SS pe,isgivenby SS pe = (n 1)SD 2 (5.13)

117 TESTING FOR LACK OF FIT, VARIANCE UNKNOWN 103 TABLE 5.4 An Illustration o the Coputation o Pure Error X Y y (n 1)SD 2 SD d } where the su is over all groups o cases. For exaple, SS pe is siply the su o the nubers in the ourth colun o Table 5.4, SS pe = = Associated with SS pe is its d, d pe = (n 1) = = 6. The pooled, or pure error estiate o variance is ˆσ pe = SS pe /d pe = This is the sae estiate that would be obtained or the residual variance i the data were analyzed using a one-way analysis o variance, grouping according to the value o x. The pure error estiate o variance akes no reerence to the linear regression ean unction. It only uses the assuption that the residual variance is the sae or each x. Now suppose we it a linear regression ean unction to the data. The analysis o variance is given in Table 5.5. The residual ean square in Table 5.5 provides an estiate o σ 2, but this estiate depends on the ean unction. Thus we have two estiates o σ 2, and i the latter is uch larger than the orer, the odel is inadequate. We can obtain an F -test i the residual su o squares in Table 5.5 is divided into two parts, the su o squares or pure error, as given in Table 5.4, and the reainder, called the su o squares or lack o it, orss lo = RSS SS pe = = with d = n p d pe.thef -test is the ratio o the TABLE 5.5 Analysis o Variance or the Data in Table 5.4 Analysis o Variance Table D Su Sq Mean Sq F value Pr(>F) Regression Residuals Lack o it Pure error

118 104 WEIGHTS, LACK OF FIT, AND MORE ean square or lack o it to the ean square or pure error. The observed F = 2.36 is considerably saller than F(0.05; 2, 6) = 5.14, suggesting no lack o it o the odel to these data. Although all the exaples in this section have a single predictor, the ideas used to get a odel-ree estiate o σ 2 are perectly general. The pure-error estiate o variance is based on the su o squares between the values o the response or all cases with the sae values on all o the predictors. Apple Shoots Many types o trees produce two types o orphologically dierent shoots. Soe branches reain vegetative year ater year and contribute considerably to the size o the tree. Called long shoots or leaders, they ay grow as uch as 15 or 20 c in one growing season. Short shoots, rarely exceeding 1 c in total length, produce ruit. To coplicate the issue urther, long shoots occasionally change to short in a new growing season and vice versa. The echanis that the tree uses to control the long and short shoots is not well understood. Bland (1978) has done a descriptive study o the dierence between long and short shoots o McIntosh apple trees. Using healthy trees o clonal stock planted in 1933 and 1934, he took saples o long and short shoots ro the trees every ew days throughout the 1971 growing season o about 106 days. The shoots sapled are presued to be a saple o available shoots at the sapling dates. The sapled shoots were reoved ro the tree, arked and taken to the laboratory or analysis. Aong the any easureents taken, Bland counted the nuber o ste units in each shoot. The long and the short shoots could dier because o the nuber o ste units, the average size o ste units, or both. Bland s data is given in the data iles longshoots.txt or data on long shoots, shortshoots.txt or data on short shoots, and allshoots.txt or both long and short shoots. We will consider only the long shoots, leaving the short shoots to the probles section. Our goal is to ind an equation that can adequately describe the relationship between Day = days ro dorancy and Y = nuber o ste units. Lacking a theoretical or or this equation, we irst exaine Figure 5.2, a scatterplot o average nuber o ste units versus Day. The apparent linearity o this plot should encourage us to it a straight-line ean unction, E(Y Day) = β 0 + β 1 Day (5.14) I this ean unction were adequate, we would have the interesting result that the observed rate o production o ste units per day is constant over the growing season. For each sapled day, Table 5.6 reports n = nuber o shoots sapled, y = average nuber o ste units on that day, and SD = within-day standard deviation. Assuing that residual variance is constant ro day to day, we can do the regression in two ways. On the basis o the suaries given, since Var(y Day) = σ 2 /n, we ust copute the wls regression o y on Day with weights given by the values o n. Thisis

119 GENERAL F TESTING 105 Nuber o ste units Days since dorancy FIG. 5.2 Scatterplot or long shoots in the apple shoot data. suarized in Table 5.7a. Alternatively, i the original 189 data points were available, we could copute the unweighted regression o the original data on Day. This is suarized in Table 5.7b. Both ethods give identical intercept, slope, and regression su o squares. They dier on any calculation that uses the residual su o squares, because in Table 5.7b the residual su o squares is the su o SS pe and SS lo. For exaple, the standard errors o the coeicients in the two tables dier because in Table 5.7a the apparent estiate o variance is with 20 d.., while in Table 5.7b it is with 187 d. Using pure error alone to estiate ˆσ 2 ay be appropriate, especially i the odel is doubtul; this would lead to a third set o standard errors. The SS pe can be coputed directly ro Table 5.6 using (5.13), SS pe = (n 1)SD 2 = with (n 1) = 167 d. The F -test or lack o it is F = Since F(0.01; 20, 167) = 1.99, the p-value or this test is less than 0.01, indicating that the straight-line ean unction (5.14) does not appear to be adequate. However, an F -test with this any d is very powerul and will detect very sall deviations ro the null hypothesis. Thus, while the result here is statistically signiicant, it ay not be scientiically iportant, and or purposes o describing the growth o apple shoots, the ean unction (5.14) ay be adequate. 5.4 GENERAL F TESTING We have encountered several situations that lead to coputation o a statistic that has an F distribution when a null hypothesis (NH) and norality hold. The theory or the F -tests is quite general. In the basic structure, a saller ean unction o the null hypothesis is copared with a larger ean unction o the alternative

120 106 WEIGHTS, LACK OF FIT, AND MORE TABLE 5.6 Bland s Data or Long and Short Apple Shoots a Long Shoots Short Shoots Day n y SD Len Day n y SD Len a Len = 0 or short shoots and 1 or long shoots. hypothesis (AH), and the saller ean unction can be obtained ro the larger by setting soe paraeters in the larger ean unction equal to zero, equal to each other, or equal to soe speciic value. One exaple previously encountered is testing to see i the last q ters in a ean unction are needed ater itting the irst p q. In atrix notation, partition X = (X 1, X 2 ),wherex 1 is n (p q), X 2 is n q, and partition β = (β 1, β 2 ),whereβ 1 is (p q) 1, β 2 is q 1, so the two hypotheses in atrix ters are NH: AH: Y = X 1 β 1 + e Y = X 1 β 1 + X 2 β 2 + e (5.15)

121 GENERAL F TESTING 107 TABLE 5.7 Regression or Long Shoots in the Apple Data (a) wls regression using day eans Estiate Std. Error t value Pr(> t ) (Intercept) <2e-16 Day <2e-16 Residual standard error: on 20 degrees o reedo Multiple R-Squared: Analysis o Variance Table D Su Sq Mean Sq F value Pr(>F) Day < 2.2e-16 Residuals (b) ols regression o y on Day Estiate Std. Error t value Pr(> t ) (Intercept) <2e-16 Day <2e-16 Residual standard error: on 187 degrees o reedo Multiple R-Squared: Analysis o Variance Table D Su Sq Mean Sq F value Pr(>F) Regression < 2.2e-16 Residual Lack o it Pure error The saller odel is obtained ro the larger by setting β 2 = 0. The ost general approach to coputing the F -test is to it two regressions. Ater itting NH, ind the residual su o squares and its degrees o reedo RSS NH and d NH. Siilarly, under the alternative ean unction copute RSS AH and d AH.Weusthaved NH > d AH, since the alternative ean unction has ore paraeters. Also, RSS NH RSS AH > 0, since the it o the AH ust be at least as good as the it o the NH. The F -test then gives evidence against NH i F = (RSS NH RSS AH )/(d NH d AH ) RSS AH /d AH (5.16) is large when copared with the F(d NH d AH, d AH ) distribution Non-null Distributions The nuerator and denoinator o (5.16) are independent o each other. Assuing norality, apart ro the degrees o reedo, the denoinator is distributed as σ 2 ties a χ 2 rando variable under both NH and AH. Ignoring the degrees o

122 108 WEIGHTS, LACK OF FIT, AND MORE reedo, when NH is true the nuerator is also distributed as σ 2 ties a χ 2,so the ratio (5.16) has an F -distribution under NH because the F is deined to be the ratio o two independent χ 2 rando variables, each divided by their degrees o reedo. When AH is true, apart ro degrees o reedo the nuerator is distributed as a σ 2 ties a noncentral χ 2. In particular, the expected value o the nuerator o (5.16) will be E(nuerator o (5.16)) = σ 2 (1 + noncentrality paraeter) (5.17) For hypothesis (5.15), the noncentrality paraeter is given by the expression β 2 X 2 (I X 1(X 1 X 1) 1 X 1 )X 2β 2 qσ 2 (5.18) To help understand this, consider the special case o X 2 X 2 = I and X 1 X 2 = 0 so the ters in X 2 are uncorrelated with each other and with the ters in X 1.Then (5.17) becoes E(nuerator) = σ 2 + β 2 β 2 (5.19) For this special case, the expected value o the nuerator o (5.16), and the power o the F -test, will be large i β 2 is large. In the general case where X 1 X 2 0, the results are ore coplicated, and the size o the noncentrality paraeter, and power o the F -test, depend not only on σ 2 but also on the saple correlations between the variables in X 1 and those in X 2. I these correlations are large, then the power o F ay be sall even i β 2 β 2 is large. More general results on F -tests are presented in advanced linear odel texts such as Seber (1977) Additional Coents The F distribution or (5.16) is exact i the errors are norally distributed, and in this case it is the likelihood ratio test or (5.15). The F -test is generally robust to departures ro norality o the errors, eaning that estiates, tests, and conidence procedures are only odestly aected by odest departures ro norality. In any case, when norality is in doubt, the bootstrap described in Section 4.6 can be used to get signiicance levels or (5.16). 5.5 JOINT CONFIDENCE REGIONS Just as conidence intervals or a single paraeter are based on the t distribution, conidence regions or several paraeters will require use o an F distribution. The regions are elliptical. The (1 α) 100% conidence region or β is the set o vectors β such that (β ˆβ) (X X)(β ˆβ) p ˆσ 2 F(α; p,n p ) (5.20)

123 JOINT CONFIDENCE REGIONS 109 The conidence region or β, the paraeter vector excluding β 0, is, using the notation o Chapter 3, the set o vectors β such that (β ˆβ ) (X X )(β ˆβ ) p ˆσ 2 F(α; p, n p) (5.21) The region (5.20) is a p -diensional ellipsoid centered at β, while (5.21) is a p-diensional ellipsoid centered at β. For exaple, the 95% conidence region or (β 1,β 2 ) in the regression o log(fertility) on log(ppgdp) and Purban in the UN data is given in Figure 5.3. This ellipse is centered at ( 0.13, ). The orientation o the ellipse, the direction o the ajor axis, is negative, relecting the negative correlation between the estiates o these two coeicients. The horizontal and vertical lines shown o the plot are the arginal 95% conidence intervals or each o the two coeicient estiates. Fro the graph, it is apparent that there are values o the coeicients that are in the 95% joint conidence region that would be viewed as iplausible i we exained only the arginal intervals. A slight generalization is needed to get a conidence ellipsoid or an arbitrary subset o β. Suppose that β 1 is a subvector o β with q eleents. Let S be the q q subatrix o (X X) 1 corresponding to the q eleents o β 1. Then the 95% conidence region is the set o points β l such that (β 1 ˆβ 1 ) S 1 (β 1 ˆβ 1 ) q ˆσ 2 F(α; q,n p ) (5.22) Coeicient or Purban Coeicient or log (PPgdp) FIG % conidence region or the UN data.

124 110 WEIGHTS, LACK OF FIT, AND MORE The bootstrap can also be used to get joint conidence regions by generalization o the ethod outlined or conidence intervals in Section 4.6, but the nuber o bootstrap replications B required is likely to be uch larger than the nuber needed or intervals. The bootstrap conidence region would be the sallest set that includes (1 α) 100% o the bootstrap replications. PROBLEMS 5.1. Galton s sweet peas Many o the ideas o regression irst appeared in the work o Sir Francis Galton on the inheritance o characteristics ro one generation to the next. In a paper on Typical Laws o Heredity, delivered to the Royal Institution on February 9, 1877, Galton discussed soe experients on sweet peas. By coparing the sweet peas produced by parent plants to those produced by ospring plants, he could observe inheritance ro one generation to the next. Galton categorized parent plants according to the typical diaeter o the peas they produced. For seven size classes ro 0.15 to 0.21 inches, he arranged or each o nine o his riends to grow 10 plants ro seed in each size class; however, two o the crops were total ailures. A suary o Galton s data was later published by Karl Pearson (1930) (see Table 5.8 and the data ile galtonpeas.txt). Only average diaeter and standard deviation o the ospring peas are given by Pearson; saple sizes are unknown Draw the scatterplot o Progeny versus Parent Assuing that the standard deviations given are population values, copute the weighted regression o Progeny on Parent. Draw the itted ean unction on your scatterplot Galton wanted to know i characteristics o the parent plant such as size were passed on to the ospring plants. In itting the regression, a paraeter value o β 1 = 1 would correspond to perect inheritance, while β 1 < 1 would suggest that the ospring are reverting toward what ay be roughly and perhaps airly described as the average ancestral type (The substitution o regression or reversion was TABLE 5.8 Galton s Peas Data Parent Progeny Diaeter (.01 in) Diaeter (.01 in) SD

125 PROBLEMS 111 probably due to Galton in 1885). Test the hypothesis that β 1 = 1versus the alternative that β 1 < In his experients, Galton took the average size o all peas produced by a plant to deterine the size class o the parental plant. Yet or seeds to represent that plant and produce ospring, Galton chose seeds that were as close to the overall average size as possible. Thus, or a sall plant, the exceptional large seed was chosen as a representative, while larger, ore robust plants were represented by relatively saller seeds. What eects would you expect these experiental biases to have on (1) estiation o the intercept and slope and (2) estiates o error? 5.2. Apple shoots Apply the analysis o Section 5.3 to the data on short shoots in Table Nonparaetric lack o it The lack-o-it tests in Sections require either a known value or σ 2 or repeated observations or a given value o the predictor that can be used to obtain a odel-ree, or pure-error, estiate o σ 2. Loader (2004, Sec. 4.3) describes a lack-o-it test that can be used without repeated observations or prior knowledge o σ 2 based on coparing the it o the paraetric odel to the it o a soother. For illustration, consider Figure 5.4, which uses data that will be described later in this proble. For each data point, we can ind the itted value ŷ i ro the paraetric it, which is just a point on the line, and ỹ i, the itted value ro the soother, which is Height Dbh FIG. 5.4 Height versus Dbh or the Upper Flat Creek grand ir data. The solid line is the ols it. The dashed line is the loess it with soothing paraeter 2/3.

126 112 WEIGHTS, LACK OF FIT, AND MORE a point on the dashed line. I the paraetric odel is appropriate or the data, then the dierences (ŷ i ỹ i ) should all be relatively sall. A suggested test statistic is based on looking at the squared dierences and then dividing by an estiate o σ 2, ni=1 (ŷ i ỹ i ) 2 G = ˆσ 2 (5.23) where ˆσ 2 is the estiate o variance ro the paraetric it. Large values o G provide evidence against the NH that the paraetric ean unction atches the data. Loader (2004) provides an approxiation to the distribution o G and also a bootstrap or coputing an approxiate signiicance level or a test based on G. In this proble, we will present the bootstrap The appropriate bootstrap algorith is a little dierent ro what we have seen beore, and uses a paraetric bootstrap. It works as ollows: 1. Fit the paraetric and sooth regression to the data, and copute G ro (5.23). Save the residuals, ê i = y i ŷ i ro the paraetric it. 2. Obtain a bootstrap saple ê1,...,ê n by sapling with replaceent ro ê 1,...,ê n. Soe residuals will appear in the saple any ties, soe not at all. 3. Given the bootstrap residuals, copute a bootstrap response Y with eleents yi =ŷ i +êi. Use the original predictors unchanged in every bootstrap saple. Obtain the paraetric and nonparaetric itted values with the response Y, and then copute G ro (5.23). 4. Repeat steps 2 3 B ties, perhaps B = The signiicance level o the test is estiated to be the raction o bootstrap saples that give a value o (5.23) that exceed the observed G. The iportant proble o selecting a soothing paraeter or the soother has been ignored. I the loess soother is used, selecting the soothing paraeter to be 2/3 is a reasonable deault, and statistical packages ay include ethods to choose a soothing paraeter. See Siono (1996), Bowan, and Azzalini (1997), and Loader (2004) or ore discussion o this issue. Write a coputer progra that ipleents this algorith or regression with one predictor Thedataileucg.txt gives the diaeter Dbh in illieters at 137 c perpendicular to the bole, and the Height o the tree in decieters or a saple o grand ir trees at Upper Flat Creek, Idaho, in 1991, courtesy o Andrew Robinson. Also included in the ile are the Plot nuber, the Tree nuber in a plot, and the Species that is always GF or these data. Use the coputer progra you wrote in the last subproble to test or lack o it o the siple linear regression ean unction or the regression o Height on Dbh.

127 PROBLEMS An F -test In siple regression, derive an explicit orula or the F -test o NH: E(Y X = x) = x (β 0 = 0,β 1 = 1) AH: E(Y X = x) = β 0 + β 1 x 5.5. Snow geese Aerial surveys soeties rely on visual ethods to estiate the nuber o anials in an area. For exaple, to study snow geese in their suer range areas west o Hudson Bay in Canada, sall aircrat were used to ly over the range, and when a lock o geese was spotted, an experienced person estiated the nuber o geese in the lock. To investigate the reliability o this ethod o counting, an experient was conducted in which an airplane carrying two observers lew over n = 45 locks, and each observer ade an independent estiate o the nuber o birds in each lock. Also, a photograph o the lock was taken so that a ore or less exact count o the nuber o birds in the lock could be ade. The resulting data are given in the data ile snowgeese.txt (Cook and Jacobson, 1978). The three variables in the data set are Photo = photo count, Obs1 = aerial count by observer one and Obs2 = aerial count by observer Draw scatterplot atrix o three variables. Do these graphs suggest that a linear regression odel ight be appropriate or the regression o Photo on either o the observer counts, or on both o the observer counts? Why or why not? For the siple regression odel o Photo on Obs1, what do the error ters easure? Why is it appropriate to it the regression o Photo on Obs1 rather than the regression o Obs1 on Photo? Copute the regression o Photo on Obs1 using ols, and test the hypothesis o Proble 5.4. State in words the eaning o this hypothesis and the result o the test. Is the observer reliable (you ust deine reliable)? Suarize your results Repeat Proble 5.5.2, except it the regression o Photo 1/2 on Obs1 1/2. The square-root scale is used to stabilize the error variance Repeat Proble 5.5.2, except assue that the variance o an error is Obs1 σ Do both observers cobined do a better job at predicting Photo than either predictor separately? To answer this question, you ay wish to look at the regression o Photo on both Obs1 and Obs2. Since ro the scatterplot atrix the two ters are highly correlated, interpretation o results ight be a bit hard. An alternative is to replace Obs1 and Obs2 by Average = (Obs1 + Obs2)/2 anddi = Obs1 Obs2. The new ters have the sae inoration as the observer counts, but they are uch less correlated. You ight also need to consider using wls. As a result o this experient, the practice o using visual counts o lock size to deterine population estiates was discontinued in avor o using photographs.

128 114 WEIGHTS, LACK OF FIT, AND MORE TABLE 5.9 Jevons Gold Coinage Data Age, Saple Average Miniu Maxiu Decades Size n Weight SD Weight Weight Jevons gold coins The data in this exaple are deduced ro a diagra in a paper written by W. Stanley Jevons (1868) and provided by Stephen M. Stigler. In a study o coinage, Jevons weighed 274 gold sovereigns that he had collected ro circulation in Manchester, England. For each coin, he recorded the weight ater cleaning to the nearest g, and the date o issue. Table 5.9 lists the average, iniu, and axiu weight or each age class. The age classes are coded 1 to 5, roughly corresponding to the age o the coin in decades. The standard weight o a gold sovereign was supposed to be g; the iniu legal weight was g. The data are given the ile jevons.txt Draw a scatterplot o Weight versus Age, and coent on the applicability o the usual assuptions o the linear regression odel. Also draw a scatterplot o SD versus Age, and suarize the inoration in this plot Since the nubers o coins n in each age class are all airly large, it is reasonable to pretend that the variance o coin weight or each Age is well approxiated by SD 2, and hence Var(Weight) is given by SD 2 /n. Copute the iplied wls regression Copute a lack-o-it test or the linear regression odel, and suarize results Is the itted regression consistent with the known standard weight or a new coin? For previously unsapled coins o Age = 1, 2, 3, 4, 5, estiate the probability that the weight o the coin is less than the legal iniu. (Hints: The standard error o prediction is a su o two ters, the known variance o an unsapled coin o known Age and the estiated variance o the itted value or that Age. You should use the noral distribution rather than a t to get the probabilities.) 5.7. Thedatailephysics1.txt gives the results o the experient described in Section 5.1.1, except in this case the input is the π eson as beore, but the output is the π + eson. Analyze these data ollowing the analysis done in the text, and suarize your results.

129 CHAPTER 6 Polynoials and Factors 6.1 POLYNOMIAL REGRESSION I a ean unction with one predictor X is sooth but not straight, integer powers o the predictors can be used to approxiate E(Y X). The siplest exaple o this is quadratic regression, in which the ean unction is E(Y X = x) = β 0 + β 1 x + β 2 x 2 (6.1) Depending on the signs o the βs, a quadratic ean unction can look like either o curves shown in Figure 6.1. Quadratic ean unctions can thereore be used when the ean is expected to have a iniu or axiu in the range o the predictor. The iniu or axiu will occur or the value o X or which the derivative de(y X = x)/dx = 0, which occurs at x M = β 1 /(2β 2 ) (6.2) x M is estiated by substituting estiates or the βs into (6.2). Quadratics can also be used when the ean unction is curved but does not have a iniu or axiu within the range o the predictor. Reerring to Figure 6.1a, i the range o X is between the dashed lines, then the ean unction is everywhere increasing but not linear, while in Figure 6.1b it is decreasing but not linear. Quadratic regression is an iportant special case o polynoial regression. With one predictor, the polynoial ean unction o degree d is E(Y X) = β 0 + β 1 X + β 2 X 2 + +β d X d (6.3) I d = 2, the odel is quadratic, d = 3 is cubic, and so on. Any sooth unction can be estiated by a polynoial o high-enough degree, and polynoial ean unctions are generally used as approxiations and rarely represent a physical odel. Applied Linear Regression, Third Edition, by Sanord Weisberg ISBN Copyright 2005 John Wiley & Sons, Inc. 115

130 116 POLYNOMIALS AND FACTORS E(Y X) (a) X (b) FIG. 6.1 Generic quadratic curves. A quadratic is the siplest curve that can approxiate a ean unction with a iniu or axiu within the range o possible values o the predictor. It can also be used to approxiate soe nonlinear unctions without a iniu or axiu in the range o interest, possibly using the part o the curve between the dashed lines. The ean unction (6.3) can be it via ols with p = d + 1 ters given by an intercept and X, X 2,...,X d. Any regression progra can be used or itting polynoials, but i d is larger than three, serious nuerical probles ay arise with soe coputer packages, and direct itting o (6.3) can be unreliable. Soe nuerical accuracy can be retained by centering, using ters such as Z k = (X x) k,k = 1,...,d. Better ethods using orthogonal polynoials are surveyed by Seber (1977, Chapter 8). An exaple o quadratic regression has already been given with the physics data in Section Fro Figure 5.1, page 102, a axiu value o the ean unction does not occur within the range o the data, and we are in a situation like Figure 6.1b with the range o x between the dashed lines. The approxiating ean unction ay be very accurate within the range o X observed in the data, but it ay be very poor outside this range; see Proble In the physics exaple, a test or lack o it that uses extra inoration about variances indicated that a straight-line ean unction was not adequate or the data, while the test or lack o it ater the quadratic odel indicated that this odel was adequate. When a test or lack o it is not available, coparison o the quadratic odel to the siple linear regression odel E(Y X) = β 0 + β 1 X + β 2 X 2 E(Y X) = β 0 + β 1 X is usually based on a t-test o β 2 = 0. A siple strategy or choosing d is to continue adding ters to the ean unction until the t-test or the highest-order ter is nonsigniicant. An eliination schee can also be used, in which a axiu value o d is ixed, and ters are deleted ro the ean unction one at a tie, starting with the highest-order ter, until the highest-order reaining ter has a

131 POLYNOMIAL REGRESSION 117 signiicant t-value. Kennedy and Bancrot (1971) suggest using a signiicance level o about 0.10 or this procedure. In ost applications o polynoial regression, only d = 1ord = 2 are considered. For larger values o d, the itted polynoial curves becoe wiggly, providing an increasingly better it by atching the variation in the observed data ore and ore closely. The curve is then odeling the rando variation rather than the overall shape o the relationship between variables Polynoials with Several Predictors With ore than one predictor, we can conteplate having integer powers and products o all the predictors as ters in the ean unction. For exaple, or the iportant special case o two predictors the second-order ean unction is given by E(Y X 1 = x 1,X 2 = x 2 ) = β 0 + β 1 x 1 + β 2 x 2 + β 11 x β 22x β 12x 1 x 2 (6.4) The new ter in (6.4) is the ultiplicative ter x 1 x 2 called an interaction. With k predictors, the second-order odel includes an intercept, k linear ters, k quadratic ters, and k(k + 1)/2 interaction ters. I k = 5, the second-order ean unction has 26 ters, and with k = 10, it has 76 ters. A usual strategy is to view the second-order odel as consisting o too any ters and use testing or other selection strategies such as those to be outlined in Section to delete ters or unneeded quadratics and interactions. We will provide an alternative approach in Section 6.4. The ost iportant new eature o the second-order odel is the interaction. Return to the k = 2 predictor ean unction (6.4). I x 1 is changed to x 1 + δ, then the value o the ean unction is E(Y X 1 = x 1 + δ, X 2 = x 2 ) = β 0 + β 1 (x 1 + δ) + β 2 x 2 + β 11 (x 1 + δ) 2 + β 22 x β 12(x 1 + δ)x 2 (6.5) The change in the expected response is the dierence between (6.5) and (6.4), E(Y X 1 = x 1 + δ, X 2 = x 2 ) E(Y X 1 = x 1,X 2 = x 2 ) = (β 11 δ 2 + β 1 δ) + 2β 11 δx 1 + β 12 δx 2 (6.6) I β 12 = 0, the expected change is the sae or every value o x 2.Iβ 12 0, then β 12 δx 2 will be dierent or each value o x 2, and so the eect o changing x 1 will depend on the value o x 2. Without the interaction, the eect o changing one predictor is the sae or every value o the other predictor. Cakes Oehlert (2000, Exaple 19.3) provides data ro a sall experient on baking packaged cake ixes. Two actors, X 1 = baking tie in inutes and X 2 = baking teperature in degrees F, were varied in the experient. The response Y was the average palatability score o our cakes baked at a given cobination o (X 1,X 2 ), with higher values desirable. Figure 6.2 is a graphical representation o the experiental design, ro which we see that the center point at (35, 350) was replicated

132 118 POLYNOMIALS AND FACTORS X X 1 FIG. 6.2 Central coposite design or the cake exaple. The center points have been slightly jittered to avoid overprinting. six ties. Replication allows or estiation o pure error and tests or lack o it. The experient consisted o n = 14 runs. The estiated ean unction based on (6.4) and using the data in the ile cakes.txt is E(Y X 1,X 2 ) = X X X X X 1X 2 (6.7) Each o the coeicient estiates, including both quadratics and the interaction, has signiicance level o or less, so all ters are useul in the ean unction (see Proble 6.1). Since each o X 1 and X 2 appears in three o the ters in (6.7), interpreting this ean unction is virtually ipossible without the aid o graphs. Figure 6.3 presents a useul way o suarizing the itted ean unction. In Figure 6.3a, the horizontal axis is the baking tie X 1, and the vertical axis is the response Y. The three curves shown on the graph are obtained by ixing the value o teperature X 2 at either 340, 350, or 360, and substituting into (6.7). For exaple, when X 2 = 350, substitute 350 or X 2 in (6.7), and sipliy to get E(Y X 1, X 2 = 350) = ˆβ 0 + ˆβ 2 (350) + ˆβ 22 (350) 2 + ˆβ 1 X 1 + ˆβ 12 (350)X 1 + ˆβ 11 X 2 1 (6.8) = X X 2 1

133 POLYNOMIAL REGRESSION 119 Y X 2 = 360 X 2 = 350 X 2 = 340 Y X 1 = 37 X 1 = 35 X 1 = X 1 X 2 (a) (b) FIG. 6.3 Estiated response curves or the cakes data, based on (6.7). Each o the lines within a plot is a quadratic curve, because both the X 2 1 and X2 2 ters are in the ean unction. Each o the curves has a soewhat dierent shape. For exaple, in Figure 6.3a, the baking tie X 1 that axiizes the response is lower at X 2 = 360 degrees than it is at X 2 = 340 degrees. Siilarly, we see ro Figure 6.3b that the response curves are about the sae or baking tie o 35 or 37 inutes, but the response is lower at the shorter baking tie. The palatability score is perhaps surprisingly sensitive to changes in teperature o 10 or 15 degrees and baking ties o just a ew inutes. I we had it the ean unction (6.4), but with β 12 = 0 so the interaction is absent, we would get the itted response curves shown in Figure 6.4. Without the Y X 2 = 350 X 2 = 340 X 2 = 360 Y X 1 = 35 X 1 = 33 X 1 = X 1 X 2 (a) (b) FIG. 6.4 Estiated response curves or the cakes data, based on itting with β 12 = 0.

134 120 POLYNOMIALS AND FACTORS interaction, all the curves within a plot have the sae shape, and all are axiized at the sae point. Without the interaction, we could say, or exaple, that or all teperatures the response is axiized or baking tie o around 36 in, and or all baking ties, the response is axiized or teperature around 355 degrees. While this ean unction is sipler, F -testing would show that it does not adequately atch the data, and so (6.8) and Figure 6.3 give appropriate suaries or these data Using the Delta Method to Estiate a Miniu or a Maxiu We have seen at (6.2) that the value o the predictor that will axiize or iniize a quadratic, depending on the signs o the βs, is x M = β 1 /(2β 2 ).Thisisa nonlinear cobination o the βs, and so its estiate, ˆx M = ˆβ 1 /(2 ˆβ 2 ) is a nonlinear cobination o estiates. The delta ethod provides an approxiate standard error o a nonlinear cobination o estiates that is accurate in large saples. The derivation o the delta ethod, and possibly its use, requires eleentary calculus. We will use dierent notation or this derivation to ephasize that the results are uch ore general than just or ratios o coeicient estiates in ultiple linear regression. Let θ be a k 1 paraeter vector, with estiator ˆθ such that ˆθ N(θ,σ 2 D) (6.9) where D is a known, positive deinite, atrix. Equation (6.9) can be exact, as it is or the ultiple linear regression odel with noral errors, or asyptotically valid, as in nonlinear or generalized linear odels. In soe probles, σ 2 ay be known, but in the ultiple linear regression proble it is usually unknown and also estiated ro data. Suppose g(θ) is a nonlinear continuous unction o θ that we would like to estiate. Suppose that θ is the true value o θ. To approxiate g(ˆθ), we can use a Taylor series expansion (see Section 11.1) about g(θ ), where we have deined g(ˆθ) = g(θ ) + k j=1 g ( ˆθ j θj ) + sall ters θ j g(θ ) + ġ(θ ) (ˆθ θ ) (6.10) ġ(θ ) = g ( g θ =,..., g ) θ 1 θ k evaluated at θ. The vector ġ has diension k 1. We have expressed in (6.10) our estiate g(ˆθ) as approxiately a constant g(θ ) plus a linear cobination o

135 POLYNOMIAL REGRESSION 121 data. The variance o a constant is zero, as is the covariance between a constant and a unction o data. We can thereore approxiate the variance o g(ˆθ) by [ ] Var(g(ˆθ)) = Var(g(θ )) + Var ġ(θ ) (ˆθ θ ) = ġ(θ ) Var(ˆθ)ġ(θ ) = σ 2 ġ(θ ) Dġ(θ ) (6.11) This equation is the heart o the delta ethod, so we will write it out again as a scalar equation. Let ġ i be the i-th eleent o ġ( ˆθ),soġ i is the partial derivative o g(θ) with respect to θ i,andletd ij be the (i, j)-eleent o the atrix D. Then the estiated variance o g(ˆθ) is k k Var(g( ˆθ)) = σ 2 ġ i ġ j d ij (6.12) i=1 j=1 In practice, all derivatives are evaluated at ˆθ, andσ 2 is replaced by its estiate. In large saples and under regularity conditions, g(ˆθ) will be norally distributed with ean g(θ ) and variance (6.11). In sall saples, the noral approxiation ay be poor, and inerence based on the bootstrap, Proble 6.16, ight be preerable. For quadratic regression (6.1), the iniu or axiu occurs at g(β) = β 1 /(2β 2 ), which is estiated by g(ˆβ). To apply the delta ethod, we need the partial derivative, evaluated at ˆβ, ( ) ( ) g = 0, 1 ˆβ 1, β 2 ˆβ 2 2 ˆβ 2 2 Using (6.12), straightorward calculation gives Var(g(β)) = 1 ( 4 ˆβ 2 2 Var( ˆβ 1 ) + ˆβ 1 2 ˆβ 2 2 Var( ˆβ 2 ) 2 ) ˆβ 1 Cov( ˆβ 1, ˆβ 2 ) ˆβ 2 (6.13) The variances and covariances in (6.13) are eleents o the atrix σ 2 (X X) 1, and so the estiated variance is obtained ro ˆσ 2 D =ˆσ 2 (X X) 1. As a odestly ore coplicated exaple, the estiated ean unction or palatability or the cake data when the teperature is 350 degrees is given by (6.8). The estiated axiu palatability occurs when the baking tie is ˆx M = ˆβ 1 + ˆβ 12 (350) 2 ˆβ 11 = 36.2 in

136 122 POLYNOMIALS AND FACTORS which depends on the estiate ˆβ 12 or the interaction as well as on the linear and quadratic ters or X 1. The standard error ro the delta ethod can be coputed to be 0.4 inutes. I we can believe the noral approxiation, a 95% conidence interval or x M is 36.2 ± or about 35.4 to 37 in. Writing a unction or coputing the delta ethod is not particularly hard using a language such as Maple, Matheatica, MatLab, R or S-plus that can do sybolic dierentiation to get ġ. I your package will not do the dierentiation or you, then you can still copute the derivatives by hand and use (6.12) to get the estiated standard error. The estiated variance atrix ˆσ 2 (X X) 1 is coputed by all standard regression progras, although getting access to it ay not be easy in all progras Fractional Polynoials Most probles that use polynoials use only integer powers o the predictors as ters. Royston and Altan (1994) considered using ractional powers o predictors in addition to integer powers. This provides a wider class o ean unctions that can be approxiated using only a ew ters and gives results siilar to the results we will get in choosing a transoration in Section FACTORS Factors allow the inclusion o qualitative or categorical predictors in the ean unction o a ultiple linear regression odel. Factors can have two levels, such as ale or eale, treated or untreated, and so on, or they can have any levels, such as eye color, location, or any others. To include actors in a ultiple linear regression ean unction, we need a way to indicate which particular level o the actor is present or each case in the data. For a actor with two levels, a duy variable, which is a ter that takes the value 1 or one o the categories and 0 or the other category, can be used. Assignent o labels to the values is generally arbitrary, and will not change the outcoe o the analysis. Duy variables can alternatively be deined with a dierent set o values, perhaps 1 and 1, or possibly 1 and 2. The only iportant point is the ter has only two values. As an exaple, we return to the sleep data described in Section 4.5. This is an observational study o the sleeping patterns o 62 aal species. One o the response variables in the study is TS, the total hours o sleep per day. Consider here as an initial predictor the variable D, which is a categorical index to easure the overall danger o that species. D has ive categories, with D = 1 indicating species acing the least danger ro other anials, to D = 5 or species acing the ost danger. Category labels here are the nubers 1, 2, 3, 4, and 5, but D is not a easured variable. We could have just as easily used naes such as lowest, low, iddle, high, and highest or these ive category naes. The data are in the ile sleep1.txt. TS was not given or three o the species, so this analysis is based on the 59 species or which data are provided.

137 FACTORS No Other Predictors We begin this discussion by asking how the ean o TS varies as D changes ro category to category. We would like to be able to write down a ean unction that allows each level o D to have its own ean, and we do that with a set o duy variables. Since D has ive levels, the j-th duy variable U j or the actor, j = 1,...,5hasith value u ij,ori = 1,...,n,givenby { 1 i Di = jth category o D u ij = (6.14) 0 otherwise I the actor had three levels rather than ive, and the saple size n = 7 with cases 1, 2, and 7 at the irst level o the actor, cases 4 and 5 at the second level, and cases 3 and 6 at the third level, then the three duy variables would be U 1 U 2 U I these three duy variables are added together, we will get a colun o ones. This is an iportant characteristic o a set o duy variables or a actor: their su always adds up to the sae value, usually one, or each case because each case has one and only one level o the actor. Returning to the sleep data, we can write the ean unction as E(TS D) = β 1 U 1 + β 2 U 2 + β 3 U 3 + β 4 U 4 + β 5 U 5 (6.15) and we can interpret β j as the population ean or all species with danger index equal to j. Mean unction (6.15) does not appear to include an intercept. Since the su o the U j is just a colun o ones, the intercept is iplicit in (6.15). Since it is usual to have an intercept included explicitly, we can leave out one o the duy variables, leading to the actor rule: The actor rule A actor with d levels can be represented by at ost d duy variables. I the intercept is in the ean unction, at ost d 1 o the duy variables can be used in the ean unction. One coon choice is to delete the irst duy variable, and get the ean unction E(TS D) = η 0 + η 2 U 2 + η 3 U 3 + η 4 U 4 + η 5 U 5 (6.16) where we have changed the naes o the paraeters because they now have dierent eanings. The eans or the ive groups are now η 0 + η j or levels

138 124 POLYNOMIALS AND FACTORS j = 2, 3, 4, 5oD, andη 0 or D = 1. Although the paraeters have dierent eanings in (6.15) and (6.16), both are itting a separate ean or each level o D, and so both are really the sae ean unction. The ean unction (6.16) is a usual ean unction or one-way analysis o variance odels, which siply its a separate ean or each level o the classiication actor. Most coputer progras allow the user to use a actor 1 in a ean unction without actually coputing the duy variables. For exaple, in the packages S-plus and R, D would irst be declared to be a actor, and then the ean unction (6.16) would be speciied by TS 1 + D (6.17) where the 1 speciies itting the intercept, and the D speciies itting the ters that are created or the actor D. As is coon in the speciication o ean unctions in linear regression coputer progras, (6.17) speciies the ters in the ean unction but not the paraeters. This will work or linear odels because each ter has one corresponding paraeter. Since ost ean unctions include an intercept, the speciication TS D is equivalent to (6.17) 2 ; to it (6.15) without an explicit intercept, use TS D 1 Sets o duy variables are not the only way to convert a actor into a set o ters, and each coputer package can have its own rules or getting the ters that will represent the actor, and it is iportant to know what your package is doing i you need to interpret and use coeicient estiates. Soe packages, like R and S-plus, allow the user to choose the way that a actor will be represented. Figure 6.5 provides a scatterplot o TS versus D or the sleep data. Soe coents about this plot are in order. First, D is a categorical variable, but since the categories are ordered, it is reasonable to plot the in the order ro one to ive. However, the spacings on the horizontal axis between the categories are arbitrary. Figure 6.5 appears to have an approxiately linear ean unction, but we are not using this discovery in the one-way analysis o variance odel (but see Proble 6.3). I D had unordered categories, the graph could be drawn with the categories on the horizontal axis so that the average response within group is increasing ro let to right. Also ro Figure 6.5, variability sees to be ore or less the sae or each group, suggesting that itting with constant variance is appropriate. Table 6.1 suarizes the it o the one-way analysis o variance, irst using (6.15), then using (6.16). In Table 6.1a, the coeicient or each U j is the corresponding estiated ean or level j o D, and the t-value is or testing the 1 A actor is called a class variable in SAS. 2 In SAS, the equivalent odel speciication would be TS=D.

139 FACTORS Total sleep (h) Danger index FIG. 6.5 Total sleep versus danger index or the sleep data. TABLE 6.1 One-Way Mean Function or the Sleep Data Using Two Paraeterizations Estiate Std. Error t-value Pr(> t ) (a) Mean unction (6.15) U U U U U D Su Sq Mean Sq F -value Pr(>F) D Residuals Estiate Std. Error t-value Pr(> t ) (b) Mean unction (6.16) Intercept U U U U D Su Sq Mean Sq F -value Pr(>F) D Residuals

140 126 POLYNOMIALS AND FACTORS hypothesis that the ean or level j is zero versus the alternative that the ean is not zero. In Table 6.1b, the estiate or the intercept is the ean or level one o D, and the other estiates are the dierences between the ean or level one and the jth level. Siilarly, the t-test that the coeicient or U j is zero or j>1is really testing the dierence between the ean or the jth level o D and the irst level o D. There are also dierences in the analysis o variance tables. The analysis o variance in Table 6.1a corresponding to (6.15) is testing the null hypothesis that all the βs equal zero or that E(TS D) = 0, against the alternative (6.15). This is not the usual hypothesis that one wishes to test using the overall analysis o variance. The analysis o variance in Table 6.1b is the usual table, with null hypothesis E(TS D) = β 0 versus (6.16). In suary, both the analyst and a coputer package have considerable lexibility in the way that duy variables or a actor are deined. Dierent choices have both advantages and disadvantages, and the analyst should be aware o the choice ade by a particular coputer progra Adding a Predictor: Coparing Regression Lines To the sleep data, suppose we add log(bodywt), the base-two logarith o the species average body weight, as a predictor. We now have two predictors, the danger index D, a actor with ive levels, and log(bodywt). We assue or now that or a ixed value o D, E(TS log(bodywt) = x,d = j) = β 0j + β 1j x (6.18) We distinguish our dierent situations: Model 1: Most general Every level o D has a dierent slope and intercept, corresponding to Figure 6.6a. We can write this ost general ean unction in several ways. I we include a duy variable or each level o D, we can write E(TS log(bodywt) = x, D = j) = d ( β0j U j + β 1j U j x ) (6.19) Mean unction (6.19) has 2d ters, the d duy variables or d intercepts and d interactions ored by ultiplying each duy variable by the continuous predictor or d slope paraeters. In the coputer packages R and S-plus, i D has been declared to be a actor, this ean unction can be it by the stateent j=1 TS 1 + D + D:log(BodyWt)

141 FACTORS 127 TS (a) General (b) Parallel (c) Coon intercept (d) Coon regression log 2 (BodyWt) FIG. 6.6 Four odels or the regression o TS on log(bodywt) with ive groups deterined by D. where the 1 explicitly deletes the intercept, the ter D its a separate intercept or each level o D, andtheter D:log(BodyWt) speciies interactions between each o the duy variables or D and log(bodywt) 3. Using a dierent letter or the paraeters, this ean unction can also be written as E(TS log(bodywt) = x,d = j) = η 0 + η 1 x + d j=2 ( η0j U j + η 1j U j x ) (6.20) Coparing the two paraeterizations, we have η 0 = β 01, η 1 = β 11,andor j>1, η 0j = β 0j β 01 and η 1j = β 1j β 11. The paraeterization (6.19) is ore convenient or getting interpretable paraeters, while (6.20) is useul or coparing ean unctions. Mean unction (6.20) can be speciied in R 3 Other progras such as SAS ay replace the : with a *.

142 128 POLYNOMIALS AND FACTORS and S-plus by log(ts) log(bodywt) + D + D:log(BodyWt) where again the overall intercept is iplicit in the ean unction and need not be speciied. In Figure 6.6a, the estiated ean unctions or each level o D appear to be nearly parallel, so we should expect a sipler ean unction ight be appropriate or these data. Model 2: Parallel regressions For this ean unction, all the within-group ean unctions are parallel as in Figure 6.6b, so β 11 = β 12 = =β 1d in (6.19), or η 12 = η 12 = =η 1d = 0 in (6.20). Each level o D can have its own intercept. This ean unction can be speciied as log(ts) D + log(bodywt) The dierence between levels o D is the sae or every value o the continuous predictor because no duy variable by predictor interactions is included in the ean unction. This ean unction should only be used i it is in act appropriate or the data. This ean unction is it with ters or the intercept, log(bodywt) and D. The nuber o paraeters estiated is d + 1. We can see ro Figure 6.6b that the itted ean unction or D = 5hasthe sallest intercept, or D = 1 the intercept is largest, and or the three interediate categories, the ean unctions are very nearly the sae. This ight suggest that the three iddle categories could be cobined; see Proble 6.3. Model 3: Coon intercept In this ean unction, the intercepts are all equal, β 01 = =β 0d in (6.19) or η 02 = =η 0d = 0 in (6.20), but slopes are arbitrary, as illustrated in Figure 6.6c. This particular ean unction is probably inappropriate or the sleep data, as it requires that the expected hours o TS given that a species is o 1-kg body weight, so log(bodywt) = 0, is the sae or all levels o danger, and this sees to be totally arbitrary. The ean unction would change i we used dierent units, like gras or pounds. This ean unction is it with ters or the intercept, log(bodywt) and the log(bodywt) D interaction, or a total o d + 1 paraeters. The R or S-plus speciication o this ean unction is TS 1 + D:log(BodyWt) Model 4: Coincident regression lines Here, all lines are the sae, β 01 = = β 0 and β 11 = =β 1 in (6.19) or η 02 = =η 0d = η 12 = =η 1d = 0 in (6.20). This is the ost stringent odel, as illustrated in Figure 6.6d. This ean unction requires only a ter or the intercept and or log(bodywt), or a total o 2 paraeters and is given by TS log(bodywt)

143 FACTORS 129 TABLE 6.2 Residual Su o Squares and d or the Four Mean Functions or the Sleep Data d RSS F P(>F) Model 1, ost general Model 2, parallel Model 3, coon intercept Model 4, all the sae It is usually o interest to test the plausibility o odels 4 or 2 against a dierent, less stringent odel as an alternative. The or o these tests is ro the orulation o the general F -test given in Section 5.4. Table 6.2 gives the RSS and d or each o the our odels or the sleep data. Most tests concerning the slopes and intercepts o dierent regression lines will use the general ean unction o Model 1 as the alternative odel. The usual F -test or testing ean unctions 2, 3, and 4 is then given or l = 2, 3, 4by F l = (RSS l RSS 1 )(d l d 1 ) RSS 1 /d 1 (6.21) I the hypothesis provides as good a odel as does the alternative, then F will be sall. I the odel is not adequate when copared with the general odel, then F will be large when copared with the percentage points o the F(d l d 1, d 1 ) distribution. The F -values or coparison to the ean unction or Model 1 are given in Table 6.2. Both the coon intercept ean unction and the coincident ean unction are clearly worse than Model 1, since the p-values are quite sall. However, the p-value or the parallel regression odel is very large, suggesting that the parallel regression odel is appropriate or these data. The analysis is copleted in Proble Additional Coents Probably the ost coon proble in coparing groups is testing or parallel slopes in siple regression with two groups. Since this F -test has 1 d in the nuerator, it is equivalent to a t-test. Let ˆβ j, ˆσ j 2,n j and SXX j be, respectively, the estiated slope, residual ean square, saple size, and corrected su o squares or the it o the ean unction in group j, j = 1, 2. Then a pooled estiate o σ 2 is ( ˆσ 2 (n1 2) ˆσ 1 2 = + (n ) 2 2) ˆσ 2 2 (6.22) n 1 + n 2 4 and the t-test or equality o slopes is t = ˆβ 1 ˆβ 2 ˆσ(1/SXX 1 + 1/SXX 2 ) 1/2 (6.23)

144 130 POLYNOMIALS AND FACTORS with n 1 + n 2 4 d. The square o this t-statistic is nuerically identical to the corresponding F -statistic. The odel or coon intercept, Model 3 o Section 6.2.2, can be easily extended to the case where the regression lines are assued coon at any ixed point, X = c. In the sleep data, suppose we wished to test or concurrence at c = 2, close to the average log body weight in the data. Siply replace log(brainwt) by z = log(brainwt) 2 in all the odels. Another generalization o this ethod is to allow the regression lines to be concurrent at soe arbitrary and unknown point, so the point o equality ust be estiated ro the data. This turns out to be a nonlinear regression proble (Saw, 1966). 6.3 MANY FACTORS Increasing the nuber o actors or the nuber o continuous predictors in a ean unction can add considerably to coplexity but does not really raise any new undaental issues. Consider irst a proble with any actors but no continuous predictors. The data in the ile wool.txt are ro a sall experient to understand the strength o wool as a unction o three actors that were under the control o the experienter (Box and Cox, 1964). The variables are suarized in Table 6.3. Each o the three actors was set to one o three levels, and all 3 3 = 27 possible cobinations o the three actors were used exactly once in the experient, so we have a single replication o a 3 3 design. The response variable log(cycles) is the logarith o the nuber o loading cycles to ailure o worsted yarn. We will treat each o the three predictors as a actor with three levels. A ain-eects ean unction or these data would include only an intercept and two duy variables or each o the three actors, or a total o seven paraeters. A ull second-order ean unction would add all the two-actor interactions to the ean unction; each two-actor interaction would require 2 2 = 4 duy variables, so the second-order odel will have = 19 paraeters. The third-order odel includes the three-actor interaction with = 8 duy variables or a total o = 27 paraeters. This latter ean unction will it the data exactly because it has as any paraeters as data points. R and S-plus speciication o these three ean unctions are, assuing that Len, Ap and Load TABLE 6.3 The Wool Data Variable Deinition Len Length o test specien (250, 300, 350 ) Ap Aplitude o loading cycle (8, 9, 10 ) Load Load put on the specien (40, 45, 50 g) log(cycles) Logarith o the nuber o cycles until the specien ails

145 PARTIAL ONE-DIMENSIONAL MEAN FUNCTIONS 131 have all been declared as actors, log(cycles) Len + Ap + Load log(cycles) Len + Ap + Load + Len:Ap + Len:Load + Ap:Load log(cycles) Len + Ap + Load + Len:Ap + Len:Load + Ap:Load + Len:Ap:Load Other ean unctions can be obtained by dropping soe o the two-actor interactions. Probles with any actors can be neatly handled using analysis o variance ethods given, or exaple, by Oehlert (2000), and in any other books. The analysis o variance odels are the sae as ultiple linear regression odels, but the notation is a little dierent. Analysis o the wool data is continued in Probles 6.20 and PARTIAL ONE-DIMENSIONAL MEAN FUNCTIONS A proble with several continuous predictors and actors requires generalization o the our ean unctions given in Section For exaple, suppose we have two continuous predictors X 1 and X 2 and a single actor F. All o the ollowing are generalizations o the parallel regression ean unctions, using a generic response Y and the coputer notation o showing the ters but not the paraeters: Y 1 + F + X 1 Y 1 + F + X 2 Y 1 + F + X 1 + X 2 Y 1 + F + X 1 + X 2 + X 1 X 2 These ean unctions dier only with respect to the coplexity o the dependence o Y on the continuous predictors. With ore continuous predictors, interpreting ean unctions such as these can be diicult. In particular, how to suarize these itted unctions using a graph is not obvious. Cook and Weisberg (2004) have provided a dierent way to look that probles such as this one can be very useul in practice and can be easily suarized graphically. In the basic setup, suppose we have ters X = (X 1,...,X p ) created ro the continuous ters and a actor F with d levels. We suppose that 1. The ean unction depends on the X-ters through a single linear cobination, so i X = x, the linear cobination has the value x β or soe unknown β. A ter or the intercept is not included in X.

146 132 POLYNOMIALS AND FACTORS 2. For an observation at level j o the actor F, the ean unction is E(Y X = x,f = j) = η 0j + η 1j (x β ) (6.24) This is equivalent to the ost general Model 1 given previously, since each level o the actor has its own intercept and slope. We can then suarize the regression proble with a graph like one o the raes in Figure 6.6, with x β or an estiate o it on the horizontal axis. The generalization o the parallel ean unctions is obtained by setting all the η 1j equal, while the generalization o the coon intercepts ean unction sets all the η 0j to be equal. There is an iediate coplication: the ean unction (6.24) is not a linear ean unction because the unknown paraeter η 1j ultiplies the unknown paraeter β, and so the paraeters cannot be it in the usual way using linear least squares sotware. Even so, estiating paraeters is not particularly hard. In Proble 6.21, we suggest a siple coputer progra that can be written that will use standard linear regression sotware to estiate paraeters, and in Proble 11.6, we show how the paraeters can be estiated using a nonlinear least squares progra. Australian Athletes As an exaple, we will use data provided by Richard Telord and Ross Cunningha collected on a saple o 202 elite athletes who were in training at the Australian Institute o Sport. The data are in the ile ais.txt. For this exaple, we are interested in the conditional distribution o the variable LBM, the lean body ass, given three ters, Ht, height in c, Wt, weight in kg, and RCC, the red cell count, separately or each sex. The data are displayed in Figure 6.7, with or ales and or eales. With the exception o RCC, the variables are all approxiately linearly related; RCC appears to be at best weakly related to the others. We begin by coputing an F -test to copare the null hypothesis given by E(LBM Sex, Ht, Wt, RCC) = β 0 + β 1 Sex + β 2 Ht + β 3 Wt + β 4 RCC to the alternative ean unction E(LBM Sex, Ht, Wt, RCC)=β 0 + β 1 Sex + β 2 Ht + β 3 Wt + β 4 RCC + β 12 (Sex Ht) + β 13 (Sex Wt) + β 14 (Sex RCC) (6.25) that has a separate ean unction or each o the two sexes. For the irst o these two, we ind RSS = with = 198 d. For the second ean unction, RSS = with = 194 d. The value o the test statistic is F = with (4, 194) d, or a corresponding p-value that is zero to three decial places. We have strong evidence that the saller o these ean unctions is inadequate.

147 PARTIAL ONE-DIMENSIONAL MEAN FUNCTIONS 133 LBM Ht Wt RCC FIG. 6.7 Scatterplot atrix or the Australian athletes data, using or ales and or eales. Interpretation o the larger ean unction is diicult because we cannot draw siple graphs to suarize the results. The partial one-diensional (POD) ean unction or these data is given by E(LBM Sex, Ht, Wt, RCC) = β 0 + β 1 Sex + β 2 Ht + β 3 Wt + β 4 RCC + η 0 Sex + η 1 Sex (β 2 Ht + β 3 Wt + β 4 RCC) (6.26) which is a odest reparaeterization o (6.24). We can it the ean unction (6.26) using either the algorith outlined in Proble 6.21 or the nonlinear least squares ethod outlined in Proble The residual su o squares is with = 196 d. The F -test or coparing this ean unction to (6.25) has value F = 0.63 with (2, 194) d, with a p-value o about The conclusion is the POD ean unction atches the data as well as the ore coplicated (6.25). The ajor advantage o the POD ean unction is that we can draw the suarizing graph given in Figure 6.8. The horizontal axis in the graph is the single linear cobination o the predictors ˆβ 2 Ht + ˆβ 3 Wt + ˆβ 4 RCC ro the it o (6.26). The vertical axis is the response LBM, once again with or ales and or

148 134 POLYNOMIALS AND FACTORS LBM, Groups = Sex Linear Predictor, pod ean unction FIG. 6.8 Suary graph or the POD ean unction or the Australian athletes data. eales. The two lines shown on the graph are the itted values or ales in the solid line and eales in the dashed line. We get the interesting suary that LBM depends on the sae linear cobination o the ters or each o the two sexes, but the itted regression has a larger slope or ales than or eales. 6.5 RANDOM COEFFICIENT MODELS We conclude this chapter with a brie introduction to probles in which the ethodology o this chapter sees appropriate, but or which dierent ethodology is to be preerred. Little is known about wetland containation by road salt, priarily NaCl. An exploratory study exained chloride concentrations in ive roadside arshes and our arshes isolated ro roads to evaluate potential dierences in chloride concentration between arshes receiving road runo and those isolated ro road runo, and to explore trends in chloride concentrations across an agricultural growing season, ro about April to October. The data in the ile chloride.txt, provided by Steanie Miklovic and Susan Galatowitsch, suarize results. Repeated easureents o chloride level were taken in April, June, August, and October, during Two o the arshes were dry by August 2001, so only April and June easureents were taken on those two. The variables in the ile are Cl, the easured chloride level in g/liter, Month, the onth o easureent, with April as 4, June as 6, and so on; Marsh, thearsh nuber, and Type, either isolated or roadside. Following the ethodology o this chapter, we ight conteplate itting ultiple linear regression odels with a separate intercept and slope or each level o

149 RANDOM COEFFICIENT MODELS 135 Type, ollowing the outline in Section This ean unction ignores the possibility o each arsh having a dierent intercept and slope. Were we to include a actor Marsh in the ean unction, we would end up itting up to 18 paraeters, but the data include only 32 observations. Furtherore, itting a separate regression or each arsh does not directly answer the questions o interest that average over arshes. The data are shown in Figure 6.9. A separate graph is given or the two types, and the points within a wetland are joined in each graph. While it is clear that the overall chloride level is dierent in the two types o wetlands, the lines within a graph do not tell a coherent story; we cannot tell i there is a dependence on Month, or i the dependence is the sae or the two types. To exaine data such as these, we use a rando coeicients odel, which in this proble assues that the arshes within a type are a rando saple o arshes that could have been studied. Using a generic notation, suppose that y ijk is the value o the response or the jth arsh o type i at tie x k. The rando coeicients odel speciies that y ijk = β 0i + β 1i x k + b 0ij + b 1ij x k + e ijk (6.27) As in other odels, the βs are ixed, unknown paraeters that speciy separate linear regression or each o the two types, β 01 + β 11 x k or Type = isolated, and β 02 + β 12 x k or Type = roadside. The errors e ijk will be taken to be independent and identically distributed with variance σ 2. Mean unction (6.27) is dierent ro other ean unctions we have seen because o the inclusion o the bs. We assue Isolated Roadside 80 Cl (g/l) Month nuber FIG. 6.9 The chloride concentration data.

150 136 POLYNOMIALS AND FACTORS that the bs are rando variables, independent o the e ijk,andthat ( ) (( ) ( )) b0ij 0 τ 2 N, 0 τ 01 b 1ij 0 τ 01 τ1 2 According to this odel, a particular arsh has intercept β 0i + b 0ij and slope β 1i + b 1ij, while the average intercept and slope or type i are β 0i and β 1i, respectively. The two variances τ0 2 and τ 1 2 odel the variation in intercepts and slopes between arshes within a type, and τ 01 allows the bs to be correlated within arsh. Rather than estiate all the bs, we will instead estiate the τs that characterize the variability in the bs. One o the eects o a rando coeicient odel is that the y ijk are no longer independent as they are in the linear regression odels we have considered so ar. Using Appendix A.2.2: 0 i i Cov(y ijk,y i j k ) = 0 i = i,j j τ0 2 + x kx k τ1 2 + (x k + x k )τ 01 i = i,j = j,k k σ 2 + τ0 2 + x2 k τ x kτ 01 i = i,j = j,k = k (6.28) The iportant point here is that repeated observations on the sae arsh are correlated, while observations on dierent arshes are not correlated. I we consider the sipler rando intercepts odel given by y ijk = β 0i + β 1i x k + b 0ij + e ijk (6.29) we will have b 1ij = τ1 2 = τ 01 = 0, and the correlation between observations in the sae arsh will be τ0 2/(σ 2 + τ0 2 ), which is known as the intra-class correlation. Since this is always positive, the variation within one arsh will always be saller than the variation between arshes, as oten akes very good sense. Methods or itting with a odel such as (6.27) is beyond the scope o this book. Sotware is widely available, however, and books by Pinheiro and Bates (2000), Littell, Milliken, Stroup, and Wolinger (1996), and Verbeke and Molenberghs (2000) describe the ethodology and the sotware. Without going into detail, generalization o the testing ethods discussed in this book suggest that the rando intercepts odel (6.29) is appropriate or these data with β 11 = β 12, as there is no evidence that either the slope is dierent or the two types or that the slope varies ro arsh to arsh. When we it (6.29) using the sotware described by Pinheiro and Bates, we get the suaries shown in Table 6.4. The dierence between the types is estiated to be between about 28 and 72 g/liter, while the level o Cl appears to be increasing over the year by about 1.85 g/liter per onth. The estiates o the variances τ 0 and σ are shown in the standard deviation scale. Since these two are o coparable size, thereore a substantial gain in precision in the analysis by accounting or, and reoving, the between arsh variation. The rando coeicients odels are but one instance o a general class o linear ixed odels. These are described in Pinheiro and Bates (2000), Littell, Milliken,

151 PROBLEMS 137 TABLE 6.4 Approxiate 95% Conidence Intervals or Paraeters o the Rando Coeicients Model Using R or the Chloride Data Fixed eects: lower est. upper (Intercept) Month Type Rando Eects: lower est. upper sd((intercept)) Within-group standard error: Stroup, and Wolinger (1996), and Diggle, Heagerty, Liang, and Zeger (2002). This is a very rich and iportant class o odels that allow itting with a wide variety o correlation and ean structures. PROBLEMS 6.1. Cake data The data or this exaple are in the data ile cakes.txt Fit (6.4) and veriy that the signiicance levels are all less than Estiate the optial (X 1,X 2 ) cobination ( X 1, X 2 ) and ind the standard errors o X 1 and X The cake experient was carried out in two blocks o seven observations each. It is possible that the response ight dier by block. For exaple, i the blocks were dierent days, then dierences in air teperature or huidity when the cakes were ixed ight have soe eect on Y. We can allow or block eects by adding a actor or Block to the ean unction, and possibly allowing or Block by ter interactions. Add block eects to the ean unction it in Section and suarize results. The blocking is indicated by the variable Block in the data ile Thedataintheilelathe1.txt are the results o an experient on characterizing the lie o a drill bit in cutting steel on a lathe. Two actors were varied in the experient, Speed and Feed rate. The response is Lie, the total tie until the drill bit ails, in inutes. The values o Speed in the data have been coded by coputing (Actual speed in eet per inute 900) Speed = 300 (Actual eed rate in thousandths o an inch per revolution 13) Feed = 6

152 138 POLYNOMIALS AND FACTORS The coded variables are centered at zero. Coding has no aterial eect on the analysis but can be convenient in interpreting coeicient estiates Draw a scatterplot atrix o Speed, Feed, Lie,andlog(Lie), the basetwo logarith o tool lie. Add a little jittering to Speed and Feed to reveal over plotting. The plot o Speed versus Feed gives a picture o the experiental design, which is called a central coposite design. It is useul when we are trying to ind a value o the actors that axiizes or iniizes the response. Also, several o the experiental conditions were replicated, allowing or a pure-error estiate o variance and lack o it testing. Coent on the scatterplot atrix For experients in which the response is a tie to ailure or tie to event, the response oten needs to be transored to a ore useul scale, typically by taking the log o the response, or soeties by taking the inverse. For this experient, log scale can be shown to be appropriate (Proble 9.7). Fit the ull second-order ean unction (6.4) to these data using log(lie) as the response, and suarize results Test or the necessity o the Speed Feed interaction, and suarize your results. Draw appropriate suary graphs equivalent to Figure 6.3 or Figure 6.4, depending on the outcoe o your test For Speed = 0.5, estiate the value o Feed that iniizes log(lie), and obtain a 95% conidence interval or this value using the delta ethod In the sleep data, do a lack o it test or D linear against the one-way Anova odel. Suarize results Thedataintheiletwins.txt give the IQ scores o identical twins, one raised in a oster hoe, IQ, and the other raised by birth parents, IQb. The data were published by Burt (1966), and their authenticity has been questioned. For purposes o this exaple, the twin pairs can be divided into three social classes C, low, iddle or high, coded in the data ile 1, 2, and 3, respectively, according to the social class o the birth parents. Treat IQ as the response and IQb as the predictor, with C as a actor. Peror an appropriate analysis o these data. Be sure to draw and discuss a relevant graph. Are the within-class ean unctions straight lines? Are there class dierences? I there are dierences, what are they? 6.5. Reerring to the data in Proble 2.2, copare the regression lines or Forbes data and Hooker s data, or the ean unction E(log(Pressure) Tep) = β 0 + β 1 Tep Reer to the Berkeley Guidance study described in Proble 3.1. Using the data ile BGSall.txt, consider the regression o HT18 on HT9 and the grouping actor Sex.

153 PROBLEMS Draw the scatterplot o HT18 versus HT9, using a dierent sybol or ales and eales. Coent on the inoration in the graph about an appropriate ean unction or these data Fit the our ean unction suggested in Section 6.2.2, peror the appropriate tests, and suarize your indings In the Berkeley Guidance Study data, Proble 6.6, consider the response HT18 and predictors HT2 and HT9. Model 1 in Section allows each level o the grouping variable, in this exaple the variable Sex, to have its own ean unction. Write down at least two generalizations o this odel or this proble with two continuous predictors rather than one Continuing with Proble 6.7 and assuing no interaction between HT2 and HT9, obtain a test or the null hypothesis that the regression planes are parallel or boys and girls versus the alternative that separate planes are required or each sex Reer to the apple shoot data, Section 5.3, using the data ile allshoots.txt, giving inoration on both long and short shoots Copute a ean square or pure error separately or long and short shoots, and show that the pure-error estiate o variance or long shoots is about twice the size o the estiate or short shoots. Since these two estiates are based on copletely dierent observations, they are independent, and so their ratio will have an F distribution under the null hypothesis that the variance is the sae or the two types o shoots. Obtain the appropriate test, and suarize results. (Hint: the alternative hypothesis is that the two variances are unequal, eaning that you need to copute a two-tailed signiicance level, not one-tailed as is usually done with F -tests). Under the assuption that the variance or short shoots is σ 2 and the variance or long shoots is 2σ 2, obtain a pooled pure-error estiate o σ Draw the scatterplot o ybar versus Day, with a separate sybol or each o the two types o shoots, and coent on the graph. Are straight-line ean unctions plausible? Are the two types o shoots dierent? Fit odels 1, 3 and 4 ro Section to these data. You will need to use weighted least squares, since each o the responses is an average o n values. Also, in light o Proble 6.9.1, assue that the variance or short shoots is σ 2, but the variance or long shoots is 2σ Gothic and Roanesque cathedrals The data in the data ile cathedral.txt gives Height = nave height and Length = total length, both in eet, or edieval English cathedrals. The cathedrals can be classiied according to their architectural style, either Roanesque or, later, Gothic. Soe

154 140 POLYNOMIALS AND FACTORS cathedrals have both a Gothic and a Roanesque part, each o diering height; these cathedrals are included twice. Naes o the cathedrals are also provided in the ile. The data were provided by Gould S.J. based on plans given by Clapha (1934) For these data, it is useul to draw separate plots o Length versus Height or each architectural style. Suarize the dierences apparent in the graphs in the regressions o Length on Height or the two styles Use the data and the plots to it regression odels that suarize the relationship between the response Length and the predictor Height or the two architectural styles Windill data In Proble 2.13, page 45, we considered data to predict wind speed CSpd at a candidate site based on wind speed RSpd at a nearby reerence site where long-ter data is available. In addition to RSpd, we also have available the wind direction, RDir, easured in degrees. A standard ethod to include the direction data in the prediction is to divide the directions into several bins and then it a separate ean unction or o CSpd on RSpd in each bin. In the wind ar literature, this is called the easure, correlate, predict ethod, Derrick (1992). The data ile w2.txt contains values o CSpd, RSpd, RDir, andbin or 2002 or the sae candidate and reerence sites considered in Proble Sixteen bins are used, the irst bin or cases with RDir between 0 and 22.5 degrees, the second or cases with RDir between 22.5 and 45 degrees,..., and the last bin between and 360 degrees. Both the nuber o bins and their starting points are arbitrary Obtain tests that copare itting the our ean unctions discussed in Section to the 16 bins. How any paraeters are in each o the ean unctions? Do not attept this proble unless your coputer package has a prograing language. Table 6.5 gives the nuber o observations in each o the 16 bins along with the average wind speed in that bin or the reerence site or the period January 1, 1948 to July 31, 2003, excluding the year 2002; the table is also given in the data ile w3.txt. Assuing the ost general odel o a separate regression in each bin is appropriate, predict the average wind speed at the candidate site or each o the 16 bins, and ind the standard error. This will give you 16 predictions and 16 independent standard errors. Finally, cobine these 16 estiates into one overall estiate (you should weight according to the nuber o cases in a bin), and then copare your answer to the prediction and standard error ro Proble Land valuation Taxes on arland enrolled in a Green Acres progra in etropolitan Minneapolis St. Paul are valued only with respect to the land s value as productive arland; the act that a shopping center or industrial park

155 PROBLEMS 141 TABLE 6.5 Bin Counts and Means or the Windill Data a Bin Bin.count RSpd Bin Bin.count RSpd a These data are also given in the ile w3.txt. has been built nearby cannot enter into the valuation. This creates diiculties because alost all sales, which are the basis or setting assessed values, are priced according to the developent potential o the land, not its value as arland. A ethod o equalizing valuation o land o coparable quality was needed. One ethod o equalization is based on a soil productivity score P,a nuber between 1, or very poor land, and 100, or the highest quality agricultural land. The data in the ile prodscore.txt, provided by Doug Tiany, gives P along with Value, the average assessed value, the Year, either 1981 or 1982 and the County nae or our counties in Minnesota, Le Sueur, Meeker, McLeod, and Sibley, where developent pressures had little eect on assessed value o land in The unit o analysis is a township, roughly six iles square. The goal o analysis is to decide i soil productivity score is a good predictor o assessed value o arland. Be sure to exaine county and year dierences, and write a short suary that would be o use to decision akers who need to deterine i this ethod can be used to set property taxes Sex discriination Thedataintheilesalary.txt concern salary and other characteristics o all aculty in a sall Midwestern college collected in the early 1980s or presentation in legal proceedings or which discriination against woen in salary was at issue. All persons in the data hold tenured or tenure track positions; teporary aculty are not included. The data were collected ro personnel iles and consist o the quantities described in Table Draw an appropriate graphical suary o the data, and coent o the graph Test the hypothesis that the ean salary or en and woen is the sae. What alternative hypothesis do you think is appropriate?

156 142 POLYNOMIALS AND FACTORS TABLE 6.6 The Salary Data Variable Sex Rank Year Degree YSdeg Salary Description Sex, 1 or eale and 0 or ale Rank, 1 or Assistant Proessor, 2 or Associate Proessor, and 3 or Full Proessor Nuber o years in current rank Highest degree, 1 i Doctorate, 0 i Masters Nuber o years since highest degree was earned Acadeic year salary in dollars Obtain a test o the hypothesis that salary adjusted or years in current rank, highest degree, and years since highest degree is the sae or each o the three ranks, versus the alternative that the salaries are not the sae. Test to see i the sex dierential in salary is the sae in each rank Finkelstein (1980), in a discussion o the use o regression in discriination cases, wrote,... [a] variable ay relect a position or status bestowed by the eployer, in which case i there is discriination in the award o the position or status, the variable ay be tainted. Thus, or exaple, i discriination is at work in prootion o aculty to higher ranks, using rank to adjust salaries beore coparing the sexes ay not be acceptable to the courts. Fit two ean unctions, one including Sex, Year, YSdeg and Degree, and the second adding Rank. Suarize and copare the results o leaving out rank eects on inerences concerning dierential in pay by sex Using the salary data in Proble 6.13, one itted ean unction is E(Salary Sex, Year) = Sex + 741Year + 169Sex Year Give the coeicients in the estiated ean unction i Sex were coded so ales had the value 2 and eales had the value 1 (the coding given to get the above ean unction was 0 or ales and 1 or eales) Give the coeicients i Sex were codes as 1 or ales and +1 or eales Pens o turkeys were grown with an identical diet, except that each pen was suppleented with an aount A o an aino acid ethionine as a percentage o the total diet o the birds. The data in the ile turk0.txt give the response average weight Gain in gras o all the turkeys in the pen or 35 pens o turkeys receiving various levels o A.

157 PROBLEMS Draw the scatterplot o Gain versus A and suarize. In particular, does siple linear regression appear plausible? Obtain a lack o it test or the siple linear regression ean unction, and suarize results. Repeat or the quadratic regression ean unction To the graph drawn in Proble , add the itted ean unctions based on both the siple linear regression ean unction and the quadratic ean unction, or values o A in the range ro 0 to 0.60, and coent For the quadratic regression ean unction or the turkey data discussed in Proble 6.15, use the bootstrap to estiate the standard error o the value o D that axiizes gain. Copare this estiated standard error with the answer obtained using the delta ethod Reer to Jevons coin data, Proble 5.6. Deterine the Age at which the predicted weight o coins is equal to the legal iniu, and use the delta ethod to get a standard error or the estiated age. This proble is called inverse regression, and is discussed by Brown (1994) Thedataintheileile.txt give the world record ties or the oneile run. For ales, the records are or the period ro , and or eales, or the period The variables in the ile are Year, year o the record, Tie, the record tie, in seconds, Nae, the nae o the runner, Country, the runner s hoe country, Place, the place where the record was run (issing or any o the early records), and Gender, either Male or Feale. The data were taken ro sut/eng/ Draw a scatterplot o Tie versus Year, using a dierent sybol or en and woen. Coent on the graph Fit separate siple linear regression ean unctions to each sex, and show that separate slopes and intercepts are required. Provide an interpretation o the slope paraeters or each sex Find the year in which the eale record is expected to be 240 seconds, or our inutes. This will require inverting the itted regression equation. Use the delta ethod to estiate the standard error o this estiate Using the odel it in Proble , estiate the year in which the eale record will atch the ale record, and use the delta ethod to estiate the standard error o the year in which they will agree. Coent on whether you think using the point at which the itted regression lines cross as a reasonable estiator o the crossing tie Use the delta ethod to get a 95% conidence interval or the ratio β 1 /β 2 or the transactions data, and copare with the bootstrap interval obtained at the end o Section

158 144 POLYNOMIALS AND FACTORS Reer to the wool data discussed in Section Write out in ull the ain-eects and the second-order ean unctions, assuing that the three predictors will be turned into actors, each with three levels. This will require you to deine appropriate duy variables and paraeters For the two ean unctions in Proble , write out the expected change in the response when Len and Ap are ixed at their iddle levels, but Load is increased ro its iddle level to its high level A POD odel or a proble with p predictors X = (X 1...,X p ) and a actor F with d levels is speciied, or the jth level o F,by E(Y X = x,f = j) = η 0j + η 1j (x β ) (6.30) This is a nonlinear odel because η ij ultiplies the paraeter β. Estiation o paraeters can use the ollowing two-step algorith: 1. Assue that the η 1j, j = 1,...,d are known. At the irst step o the algorith, set η 1j = 1,j = 1,...,d. Deine a new ter z j = η 1j x,and substituting into (6.30), E(Y X = x,f = j) = η 0j + z j β We recognize this as a ean unction or parallel regressions with coon slopes β and a separate intercept or each level o F. This ean unction can be it using standard ols linear regression sotware. Save the estiate ˆβ o β. 2. Let v = x ˆβ,where ˆβ was coputed in step 1. Substitute v or x β in (6.30) to get E(Y X = x,f = j) = η 0j + η 1j v which we recognize as a ean unction with a separate intercept and slope or each level o F. This ean unction can also be it using ols linear regression sotware. Save the estiates o η 1j and use the in the next iteration o step 1. Repeat this algorith until the residual su o squares obtained at the two steps is essentially the sae. The estiates obtained at the last step will be the ols estiates or the original ean unction, and the residual su o squares will be the residual su o squares that would be obtained by itting using nonlinear least squares. Estiated standard errors o the coeicients will be too sall, so t-tests should not be used, but F -tests can be used to copare odels. Write a coputer progra that ipleents this algorith.

159 PROBLEMS Using the coputer progra written in the last proble or soe other coputational tool, veriy the results obtained in the text or the Australian Athletes data. Also, obtain tests or the general POD ean unction versus the POD ean unction with parallel ean unctions The Minnesota Twins proessional baseball tea plays its gaes in the Metrodoe, an indoor stadiu with a abric roo. In addition to the large air ans required to keep to roo ro collapsing, the baseball ield is surrounded by ventilation ans that blow heated or cooled air into the stadiu. Air is norally blown into the center o the ield equally ro all directions. According to a retired supervisor in the Metrodoe, in the late innings o soe gaes the ans would be odiied so that the ventilation air would blow out ro hoe plate toward the outield. The idea is that the air low ight increase the length o a ly ball. For exaple, i this were done in the iddle o the eighth inning, then the air-low advantage would be in avor o the hoe tea or six outs, three in each o the eighth and ninth innings, and in avor o the visitor or three outs in the ninth inning, resulting in a slight advantage or the hoe tea. To see i anipulating the ans could possibly ake any dierence, a group o students at the University o Minnesota and their proessor built a cannon that used copressed air to shoot baseballs. They then did the ollowing experient in the Metrodoe in March 2003: 1. A ixed angle o 50 degrees and velocity o 150 eet per second was selected. In the actual experient, neither the velocity nor the angle could be controlled exactly, so the actual angle and velocity varied ro shot to shot. 2. The ventilation ans were set so that to the extent possible all the air was blowing in ro the outield towards hoe plate, providing a headwind. Ater waiting about 20 inutes or the air lows to stabilize, 20 balls were shot into the outield, and their distances were recorded. Additional variables recorded on each shot include the weight (in gras) and diaeter (in c) o the ball used on that shot, and the actual velocity and angle. 3. The ventilation ans were then reversed, so as uch as possible air was blowing out toward the outield, giving a tailwind. Ater waiting 20 inutes or air currents to stabilize, 15 balls were shot into the outield, again easuring the ball weight and diaeter, and the actual velocity and angle on each shot. The data ro this experient are available in the ile doedata.txt, courtesy o Ivan Marusic. The variable naes are Cond, the condition, head or tail wind; Velocity, the actual velocity in eet per second; Angle, the actual angle; BallWt, the weight o the ball in gras used on that particular test; BallDia, the diaeter in inches o the ball used on that test; Dist, distance in eet o the light o the ball Suarize any evidence that anipulating the ans can change the distance that a baseball travels. Be sure to explain how you reached

160 146 POLYNOMIALS AND FACTORS your conclusions, and provide appropriate suary statistics that ight be useul or a newspaper reporter (a report o this experient is given in the Minneapolis StarTribune or July 27, 2003) In light o the discussion in Section 6.5, one could argue that this experient by itsel cannot provide adequate inoration to decide i the ans can aect length o a ly ball. The treatent is anipulating the ans; each condition was set up only once and then repeatedly observed. Unortunately, resetting the ans ater each shot is not practical because o the need to wait at least 20 inutes or the air lows to stabilize. A second experient was carried out in May 2003, using a siilar experiental protocol. As beore, the ans were irst set to provide a headwind, and then, ater several trials, the ans were switched to a tailwind. Unlike the irst experient, however, the noinal Angle and Velocity were varied according to a 3 2 actorial design. The data ile doedata1.txt contains the results ro both the irst experient and the second experient, with an additional colun called Date indicating which saple is which. Analyze these data, and write a brie report o your indings.

161 CHAPTER 7 Transorations There are exceptional probles or which we know that the ean unction E(Y X) is a linear regression ean unction. For exaple, i (Y, X) has a joint noral distribution, then as in Section 4.3, the conditional distribution o Y X has a linear ean unction. Soeties, the ean unction ay be deterined by a theory, apart ro paraeter values, as in the strong interaction data in Section Oten, there is no theory to tell us the correct or or the ean unction, and any paraetric or we use is little ore than an approxiation that we hope is adequate or the proble at hand. Replacing either the predictors, the response, or both by nonlinear transorations o the is an iportant tool that the analyst can use to extend the nuber o probles or which linear regression ethodology is appropriate. This brings up two iportant questions: How do we choose transorations? How do we decide i an approxiate odel is adequate or the data at hand? We address the irst o these questions in this chapter, and the second in Chapters 8 and TRANSFORMATIONS AND SCATTERPLOTS The ost requent purpose o transorations is to achieve a ean unction that is linear in the transored scale. In probles with only one predictor and one response, the ean unction can be visualized in a scatterplot, and we can attept to select a transoration so the resulting scatterplot has an approxiate straightline ean unction. With any predictors, selection o transorations can be harder, as the criterion to use or selecting transorations is less clear, so we consider the one predictor case irst. We seek a transoration so i X is the transored predictor and Y is the transored response, then the ean unction in the transored scale is E(Y X = x) β 0 + β 1 x where we have used rather than = to recognize that this relationship ay be an approxiation and not exactly true. Applied Linear Regression, Third Edition, by Sanord Weisberg ISBN Copyright 2005 John Wiley & Sons, Inc. 147

162 148 TRANSFORMATIONS Arican_elephant Brain weight (g) Asian_elephant Huan Body weight (kg) FIG. 7.1 Plot o BrainWt versus BodyWt or 62 aal species. Figure 7.1 contains a plot o body weight BodyWt in kilogras and brain weight BrainWt in gras or 62 species o aals (Allison and Cicchetti, 1976), using the data in the ile brains.txt. Apart ro the three separated points or two species o elephants and or huans, the uneven distribution o points hides any useul visual inoration about the ean o BrainWt, givenbodywt. In any case, there is little or no evidence or a straight-line ean unction here. Both variables range over several orders o agnitude ro tiny species with body weights o just a ew gras to huge anials o over 6600 kg. Transorations can help in this proble Power Transorations A transoration aily is a collection o transorations that are indexed by one or a ew paraeters that the analyst can select. The aily that is used ost oten is called the power aily, deined or a strictly positive variable U by ψ(u,λ) = U λ (7.1) As the power paraeter λ is varied, we get the ebers o this aily, including the square root and cube root transorations, λ = 1/2 or1/3, and the inverse, λ = 1. We will interpret the value o λ = 0 to be a log transoration. The usual values o λ that are considered are in the range ro 2 to 2, but values in the range ro 1 to+1 are ordinarily selected. The value o λ =+1 corresponds to no transoration. The variable U ust be strictly positive or these transorations to be used, but we will have ore to say later about transoring variables that ay be zero or negative. We have introduced this ψ-notation 1 because we will later consider other ailies o transorations, and having this notation will allow ore clarity in the discussion. 1 ψ is the Greek letter psi.

163 TRANSFORMATIONS AND SCATTERPLOTS 149 Soe statistical packages include graphical tools that can help you select power transorations o both the predictor and the response. For exaple, a plot could include slidebars to select values o the transoration paraeters applied to the horizontal and vertical variables. As dierent values o the paraeters are selected in the slidebars, the graph is updated to relect the transoration o the data corresponding to the currently selected value o the transoration paraeter. I a graphical interace is not available in your package, you can draw several igures to help select a transoration. A ean soother and the ols line added to each o the plots ay be helpul in looking at these plots. Figure 7.2 shows plots o ψ(brainwt,λ) versus ψ(bodywt,λ) with the sae λ or both variables, orλ = 1, 0, 1/3, 1/2. There is no necessity or the transoration to be the sae or the two variables, but it is reasonable here because both variables are the sae type o easureents, one being the weight o an object, and the other a weight o a coponent o the object. I we allowed each variable to have its own transoration paraeter, the visual search or a transoration is harder because ore possibilities need to be considered. Brain Wt Body Wt 1 (a) log e (Brain Wt) log e (Body Wt) (b) Brain Wt Body Wt 0.33 (c) Brain Wt Body Wt 0.5 (d) FIG. 7.2 Scatterplots or the brain weight data with our possible transorations. The solid line on each plot is the ols line; the dashed line is a loess sooth.

164 150 TRANSFORMATIONS Fro the our graphs in Figure 7.2, the clear choice is replacing the weights by their logariths. In this scale, the ean unction appears to be a straight line, with the soother closely atching the ols line shown on the graph in log scale but atching less well or the other transorations. As a bonus, the variance unction in the log plot appears to be constant. The use o logariths or the brain weight data ay not be particularly surprising, in light o the ollowing two epirical rules that are oten helpul in linear regression odeling: The log rule I the values o a variable range over ore than one order o agnitude and the variable is strictly positive, then replacing the variable by its logarith is likely to be helpul. The range rule I the range o a variable is considerably less than one order o agnitude, then any transoration o that variable is unlikely to be helpul. The log rule is satisied or both BodyWt, with range kg to 6654 kg, and or BrainWt, with range 0.14 g to 5712 g, so log transorations would have been indicated as a starting point or exaining these variables or transorations. Siple linear regression sees to be appropriate with both variables in log scale. This corresponds to the physical odel BrainWt = α BodyWt β 1 δ (7.2) where δ is a ultiplicative error, eaning that the actual average brain weight or a particular species is obtained by taking the ean brain weight or species o a particular body weight and ultiplying by δ. We would expect that δ would have ean 1 and a distribution concentrated on values close to 1. On taking logariths and setting β 1 = log(α) and e = log(δ), log(brainwt) = β 0 + β 1 log(bodywt) + e which is the siple linear regression odel. Scientists who study the relationships between attributes o individuals or species call (7.2) an alloetric odel (see, or exaple, Gould, 1966, 1973; Hahn, 1979), and the value o β 1 plays an iportant role in alloetric studies. We ephasize, however, that not all useul transorations will correspond to interpretable physical odels Transoring Only the Predictor Variable In the brain weight exaple, transorations o both the response and the predictor are required to get a linear ean unction. In other probles, transoration o only one variable ay be desirable. I we want to use a aily o power transorations, it is convenient to introduce the aily o scaled power transorations, deined or strictly positive X by ψ S (X, λ) = { (X λ 1)/λ i λ 0 log(x) i λ = 0 (7.3)

165 TRANSFORMATIONS AND SCATTERPLOTS 151 The scaled power transorations ψ S (X, λ) dier ro the basic power transorations ψ(x,λ) in several respects. First ψ S (X, λ) is continuous as a unction o λ. Since li λ 0 ψ S (X, λ) = log e (X), the logarithic transoration is a eber o this aily with λ = 0. Also, scaled power transorations preserve the direction o association, in the sense that i (X, Y ) are positively related, then (ψ S (X, λ), Y ) are positively related or all values o λ. With basic power transorations, the direction o association changes when λ<0. I we ind an appropriate power to use or a scaled power transoration, we would in practice use the basic power transoration ψ(x,λ) in regression odeling, since the two dier only by a scale, location, and possibly sign change. The scaled transorations are used to select a transoration only. I transoring only the predictor and using a choice ro the power aily, we begin with the ean unction E(Y X) = β 0 + β 1 ψ S (X, λ) (7.4) I we know λ, we can it (7.4) via ols and get the residual su o squares, RSS(λ). The estiate ˆλ o λ is siply the value o λ that iniizes RSS(λ). As a practical atter, we do not need to know λ very precisely, and selecting λ to iniize RSS(λ) ro λ { 1, 1/2, 0, 1/3, 1/2, 1} (7.5) is usually adequate. As an exaple o transoring only the predictor, we consider the dependence o tree Height in decieters on Dbh, the diaeter o the tree in at 137 c above the ground, or a saple o western cedar trees in 1991 in the Upper Flat Creek stand o the University o Idaho Experiental Forest (courtesy o Andrew Robinson). The data are in the ile ucwc.txt. Figure 7.3 is the scatterplot o Height Dbh FIG. 7.3 Height versus Dbh or the red cedar data ro Upper Flat Creek.

166 152 TRANSFORMATIONS Height log 2 (Dbh) FIG. 7.4 The red cedar data ro Upper Flat Creek transored. the data, and on this plot we have superiposed three curved lines. For each λ, we coputed itted values ŷ(λ) ro the ols regression o Height on ψ S (Dbh,λ).The line or a particular value o λ is obtained by plotting the points (Dbh, ŷ(λ)) and joining the with a line. Only three values o λ are shown in the igure because it gets too crowded to see uch with ore lines, but aong these three, the choice o λ = 0 sees to atch the data ost closely. The choice o λ = 1 does not atch the data or large and sall trees, while the inverse is too curved to atch the data or larger trees. This suggests replacing Dbh with log(dbh), as we have done in Figure 7.4. As an alternative approach, the value o the transoration paraeter can be estiated by itting using nonlinear least squares. The ean unction (7.4) is a nonlinear unction o the paraeters because β 1 ultiplies the nonlinear unction ψ S (X, λ) o the paraeter λ. Using the ethods described in Chapter 11, the estiate o λ turns out to be ˆλ = 0.05 with a standard error o 0.15, so λ = 0is close enough to believe that this is a sensible transoration to use Transoring the Response Only A transoration o the response only can be selected using an inverse itted value plot, in which we put the itted values ro the regression o Y on X on the vertical axis and the response on the horizontal axis. In siple regression the itted values are proportional to the predictor X, so an equivalent plot is o X on the vertical axis versus Y on the horizontal axis. The ethod outlined in Section can then be applied to this inverse proble, as suggested by Cook and Weisberg (1994). Thus, to estiate a transoration ψ S (Y, λ y ), start with the ean unction E(ŷ Y) = α 0 + α 1 ψ S (Y, λ y )

167 TRANSFORMATIONS AND SCATTERPLOT MATRICES 153 and estiate λ y. An exaple o the use o an inverse response plot will be given in Section The Box and Cox Method Box and Cox (1964) provided another general ethod or selecting transorations o the response that is applicable both in siple and ultiple regression. As with the previous ethods, we will select the transoration ro a aily indexed by a paraeter λ. For the Box Cox ethod, we need a slightly ore coplicated version o the power aily that we will call the odiied power aily, deined by Box and Cox (1964) or strictly positive Y to be ψ M (Y, λ y ) = ψ S (Y, λ y ) g(y ) 1 λ y (7.6) { g(y ) 1 λ y (Y = λ y 1)/λ y i λ y 0 g(y ) log(y ) i λ y = 0 where g(y ) is the geoetric ean o the untransored variable 2. In the Box Cox ethod, we assue that the ean unction E(ψ M (Y, λ y ) X = x) = β x (7.7) holds or soe λ y.iλ y were known, we could it the ean unction (7.7) using ols because the transored response ψ M (Y, λ y ) would then be copletely speciied. Write the residual su o squares ro this regression as RSS(λ y ). Multiplication o the scaled power transoration by g(y ) 1 λ guarantees that the units o ψ M (Y, λ y ) are the sae or all values o λ y, and so all the RSS(λ y ) are in the sae units. We estiate λ y to be the value o the transoration paraeter that iniizes RSS(λ y ). Fro a practical point o view, we can again select λ y ro aong the choices in (7.5). The Box Cox ethod is not transoring or linearity, but rather it is transoring or norality: λ is chosen to ake the residuals ro the regression o ψ(y,λ y ) on X as close to norally distributed as possible. Hernandez and Johnson (1980) point out that as close to noral as possible need not be very close to noral, and so graphical checks are desirable ater selecting a transoration. The Box and Cox ethod will also produce a conidence interval or the transoration paraeter; see Appendix A.11.1 or details. 7.2 TRANSFORMATIONS AND SCATTERPLOT MATRICES The data described in Table 7.1 and given in the data ile highway.txt are taken ro an unpublished aster s paper in civil engineering by Carl Hostedt. 2 I the values o Y are y 1,...,y n, the geoetric ean o Y is g(y ) = exp( log(y i )/n), using natural logariths.

168 154 TRANSFORMATIONS TABLE 7.1 The Highway Accident Data a Variable Rate Len ADT Trucks Sli Shld Sigs Description 1973 accident rate per illion vehicle iles Length o the segent in iles Estiated average daily traic count in thousands Truck volue as a percent o the total volue 1973 speed liit Shoulder width in eet o outer shoulder on the roadway Nuber o signalized interchanges per ile in the segent a Additional variables appear in the ile highway.txt, and will be described in Table They relate the autoobile accident rate in accidents per illion vehicle iles to several potential ters. The data include 39 sections o large highways in the state o Minnesota in The goal o this analysis was to understand the ipact o the design variables, Acpts, Sli, Sigs, andshld that are under the control o the highway departent, on accidents. The other variables are thought to be iportant deterinants o accidents but are ore or less beyond the control o the highway departent and are included to reduce variability due to these uncontrollable actors. We have no particular reason to believe that Rate will be a linear unction o the predictors, or any theoretical reason to preer any particular or or the ean unction. An iportant irst step in this analysis is to exaine the scatterplot atrix o all the predictors and the response, as given in Figure 7.5. Here are soe observations about this scatterplot atrix that ight help in selecting transorations: 1. The variable Sigs, the nuber o traic lights per ile, is zero or reewaytype road segents but can be well over 2 or other segents. Transorations ay help with this variable, but since it has non positive values, we cannot use the power transorations directly. Since Sigs is coputed as the nuber o signals divided by Len, we will replace Sigs by a related variable Sigs1 deined by Sigs1 = Sigs Len + 1 Len This variable is always positive and can be transored using the power aily. 2. ADT and Len have a large range, and logariths are likely to be appropriate or the. 3. Sli varies only ro 40 ph to 70 ph, with ost values in the range 50 to 60. Transorations are unlikely to be uch use here. 4. Each o the predictors sees to be at least odestly associated with Rate, as the ean unction or each o the plots in the top row o Figure 7.5 is not lat.

169 TRANSFORMATIONS AND SCATTERPLOT MATRICES Rate Len ADT Trks Sli Shld Sigs FIG. 7.5 The highway accident data, no transorations. 5. Many o the predictors are also related to each other. In soe cases, the ean unctions or the plots o predictor versus predictor appear to be linear; in other cases, they are not linear. Given these preliinary views o the scatterplot atrix, we now have the daunting task o inding good transorations to use. This raises iediate questions: What are the goals in selecting transorations? How can we decide i we have ade a good choice? The overall goal o transoring in linear regression is to ind transorations in which ultiple linear regression atches the data to a good approxiation. The connection between this goal and choosing transorations that ake the 2D plots o predictors have linear ean unctions is not entirely obvious. Iportant work by Brillinger (1983) and Li and Duan (1989) provides a theoretical connection. Suppose we have a response variable Y and a set o predictors X, and suppose it were true that E(Y X = x) = g(β x) (7.8)

170 156 TRANSFORMATIONS or soe copletely unknown and unspeciied unction g. According to this, the ean o Y depends on X only through a linear cobination o the ters in X, and i we could draw a graph o Y versus β x, this graph would have g as its ean unction. We could then either estiate g, or we could transor Y to ake the ean unction linear. All this depends on estiating β without speciying anything about g. Are there conditions under which the ols regression o Y on X can help us learn about β? The 1D Estiation Result and Linearly Related Predictors Suppose that A = a X and B = b X were any two linear cobinations o the ters in X, such that E(A B) = γ 0 + γ 1 B (7.9) so the graph o A versus B has a straight-line ean unction. We will say that X is a set o linear predictors i (7.9) holds or all linear cobinations A and B. The condition that all the graphs in a scatterplot atrix o X have straight-line ean unctions is weaker than (7.9), but it is a reasonable condition that we can check in practice. Requiring that X has a ultivariate noral distribution is uch stronger than (7.9). Hall and Li (1993) show that (7.9) holds approxiately as the nuber o predictors grows large, so in very large probles, transoration becoes less iportant because (7.9) will hold approxiately without any transorations. Given that (7.9) holds at least to a reasonable approxiation, and assuing that E(Y X = x) = g(β x), then the ols estiate ˆβ is a consistent estiate o cβ or soe constant c that is usually nonzero (Li and Duan, 1989; see also Cook, 1998). Given this theore, a useul general procedure or applying ultiple linear regression analysis is: 1. Transor predictors to get ters or which (7.9) holds, at least approxiately. The ters in X ay include duy variables that represent actors, which should not be transored, as well as transorations o continuous predictors. 2. We can estiate g ro the 2D scatterplot o Y versus ˆβ x,where ˆβ is the ols estiator ro the regression o Y on X. Alost equivalently, we can estiate a transoration o Y either ro the inverse plot o ˆβ x versus Y or ro using the Box Cox ethod. This is a general and powerul approach to building regression odels that atch data well, based on the assuption that (7.8) is appropriate or the data. We have already seen ean unctions in Chapter 6 or which (7.8) does not hold because o the inclusion o interaction ters, and so transorations chosen using the ethods discussed here ay not provide a coprehensive ean unction when interactions are present. The Li Duan theore is actually uch ore general and has been extended to probles with interactions present and to any other estiation ethods beyond

171 TRANSFORMATIONS AND SCATTERPLOT MATRICES 157 ols. See Cook and Weisberg (1999a, Chapters 18 20) and, at a higher atheatical level, Cook (1998) Autoatic Choice o Transoration o Predictors Using the results o Section 7.2.1, we seek to transor the predictors so that all plots o one predictor versus another have a linear ean unction, or at least have ean unctions that are not too curved. Without interactive graphical tools, or soe autoatic ethod or selecting transorations, this can be a discouraging task, as the analyst ay need to draw any scatterplot atrices to get a useul set o transorations. Velilla (1993) proposed a ultivariate extension o the Box and Cox ethod to select transorations to linearity, and this ethod can oten suggest a very good starting point or selecting transorations o predictors. Starting with k untransored strictly positive predictors X = (X 1,...,X k ), we will apply a odiied power transoration to each X j, and so there will be k transoration paraeters collected into λ = (λ 1,λ 2,...,λ k ). We will write ψ M (X, λ) to be the set o variables ψ M (X, λ) = (ψ M (X 1,λ 1 ),...,ψ M (X k,λ k )) Let V(λ) be the saple covariance atrix o the transored data ψ M (X, λ). The value ˆλ is selected as the value o λ that iniizes the logarith o the deterinant o V(λ). This iniization can be carried using a general unction iniizer included in high-level languages such as R, S-plus, Maple, Matheatica, or even Excel. The iniizers generally require only speciication o the unction to be iniized and a set o starting values or the algorith. The starting values can be taken to be λ = 0, λ = 1, or soe other appropriate vector o zeros and ones. Returning to the highway data, we eliinate Sli asavariabletobetransored because its range is too narrow. For the reaining ters, we get the suary o transorations using the ultivariate Box Cox ethod in Table 7.2. The table TABLE 7.2 Power Transorations to Norality or the Highway Data Box Cox Transorations to Multivariate norality Est.Power Std.Err. Wald(Power=0) Wald(Power=1) Len ADT Trks Shld Sigs L.R. test, all powers = 0: d = 5 p = 3e-04 L.R. test, all powers = 1: d = 5 p = 0 L.R. test, o (0,0,0,1,0): d = 5 p = 0.29

172 158 TRANSFORMATIONS gives the value o ˆλ in the colun arked Est. power. The standard errors are coputed as outlined in Appendix A For our purposes, the standard errors can be treated like standard errors o regression coeicients. The next two coluns are like t-tests o the transoration paraeter equal to zero or to one. These tests should be copared with a noral distribution, so values larger in absolute value than 1.96 correspond to p-values less than The power paraeters or Len, ADT, Trks and Sigs1 do not appear to be dierent ro zero, and Shld does not appear to be dierent ro one. At the oot o the table are three likelihood ratio tests. The irst o these tests is that all powers are zero; this is irly rejected as the approxiate χ 2 (5) is very large. Siilarly, the test or no transoration (λ = 1) is irly rejected. The test that the irst three variables should be in log scale, the next untransored, and the last in log scale, has a p-value 0.29 and suggests using these siple transorations in urther analysis with these data. The predictors in transored scale, along with the response, are shown in Figure 7.6. All these 2D plots have a linear ean unction, or at least are not strongly nonlinear. They provide a good place to start regression odeling Rate loglen logadt logtrks Sli Shld logsigs FIG. 7.6 Transored predictors or the highway data.

173 TRANSFORMING THE RESPONSE TRANSFORMING THE RESPONSE Once the ters are transored, we can turn our attention to transoring the response. Figure 7.7 is the inverse itted value plot or the highway data using the transored ters deterined in the last section. This plot has the response Rate on the horizontal axis and the itted values ro the regression o Rate on the transored predictors on the vertical axis. Cook and Weisberg (1994) have shown that i the predictors are approxiately linearly related, then we can use the ethod o Section to select a transoration or Rate. Aong the three curves shown on this plot, the logarithic sees to be the ost appropriate. The Box Cox ethod provides an alternative procedure or inding a transoration o the response. It is oten suarized by a graph with λ y on the horizontal axis and either RSS(λ y ) or better yet (n/2) log(rss(λ y )/n) on the vertical axis. With this latter choice, the estiate ˆλ y is the point that axiizes the curve, and a conidence interval or the estiate is given by the set o all λ y with log(l(ˆλ y )) log(l(λ y )<1.92; see Appendix A This graph or the highway data is shown in Figure 7.8, with ˆλ 0.2and the conidence interval o about 0.8 to The log transoration is in the conidence interval, agreeing with the inverse itted value plot. In the highway data, the two transoration ethods or the response see to agree, but there is no theoretical reason why they need to give the sae transoration. The ollowing path is recoended or selecting a response transoration: 1. With approxiately linear predictors, draw the inverse response plot o ŷ versus the response. I this plot shows a clear nonlinear trend, then the response should be transored to atch the nonlinear trend. There is no reason why only power transorations should be considered. For exaple, Fitted values Rate FIG. 7.7 Inverse itted value plot or the highway data.

174 160 TRANSFORMATIONS 95% log Likelihood λ y FIG. 7.8 Box Cox suary graph or the highway data. the transoration could be selected using a soother. I there is no clear nonlinear trend, transoration o the response is unlikely to be helpul. 2. The Box Cox procedure can be used to select a transoration to norality. It requires the use o a transoration aily. For the highway data, we now have a reasonable starting point or regression, with several o the predictors and the response all transored to log scale. We will continue with this exaple in later chapters. 7.4 TRANSFORMATIONS OF NONPOSITIVE VARIABLES Several transoration ailies or a variable U that includes negative values have been suggested. The central idea is to use the ethods discussed in this chapter or selecting a transoration ro a aily but to use a aily that perits U to be non positive. One possibility is to consider transorations o the or (U + γ) λ, where γ is suiciently large to ensure that U + γ is strictly positive. We used a variant o this ethod with the variable Sigs in the highway data. In principle, (γ, λ) could be estiated siultaneously, although in practice estiates o γ are highly variable and unreliable. Alternatively, Yeo and Johnson (2000) proposed a aily o transorations that can be used without restrictions on U that have any o the good properties o the Box Cox power aily. These transorations are deined by ψ YJ (U, λ) = { ψm (U + 1,λ) i U 0 ψ M ( U + 1, 2 λ) i U<0 (7.10) I U is strictly positive, then the Yeo Johnson transoration is the sae as the Box Cox power transoration o (U + 1). I U is strictly negative, then the Yeo Johnson transoration is the Box Cox power transoration o ( U + 1),

175 PROBLEMS 161 Transor, λ = y (a) Transor, λ = Transor, λ = y (b) y (c) FIG. 7.9 Coparison o Box Cox (dashed lines) and Yeo Johnson (solid lines) power transorations or λ = 1, 0, 0.5. The Box Cox transorations and Yeo Johnson transorations behave dierently or values o y close to zero. but with power 2 λ. With both negative and positive values, the transoration is a ixture o these two, so dierent powers are used or positive and negative values. In this latter case, interpretation o the transoration paraeter is diicult, as it has a dierent eaning or U 0andorU<0. Figure 7.9 shows the Box Cox transoration and Yeo Johnson transoration or the values o λ = 1, 0,.5. For positive values, the two transorations dier in their behavior with values close to zero, with the Box Cox transorations providing a uch larger change or sall values than do the Yeo Johnson transorations. PROBLEMS 7.1. The data in the ile baeskel.txt were collected in a study o the eect o dissolved sulur on the surace tension o liquid copper (Baes and Kellogg,

176 162 TRANSFORMATIONS 1953). The predictor Sulur is the weight percent sulur, and the response is Tension, the decrease in surace tension in dynes per c. Two replicate observations were taken at each value o Sulur. These data were previously discussed by Sclove (1968) Draw the plot o Tension versus Sulur to veriy that a transoration is required to achieve a straight-line ean unction Set λ = 1, and it the ean unction E(Tension Sulur) = β 0 + β 1 Sulur λ using ols; thatis,ittheols regression with Tension as the response and 1/Sulur as the predictor. Let new be a vector o 100 equally spaced values between the iniu value o Sulur and its axiu value. Copute the itted values ro the regression you just it, given by Fit.new = ˆβ 0 + ˆβ 1 new λ. Then, add to the graph you drew in Proble the line joining the points (new, Fit.new). Repeat or λ = 0, 1. Which o these three choices o λ gives itted values that atch the data ost closely? Replace Sulur by its logarith, and consider transoring the response Tension. To do this, draw the inverse itted value plot with the itted values ro the regression o Tension on log log(sulphur) on the vertical axis and Tension on the horizontal axis. Repeat the ethodology o Proble to decide i urther transoration o the response will be helpul The (hypothetical) data in the ile stopping.txt give stopping ties or n = 62 trials o various autoobiles traveling at Speed iles per hour and the resulting stopping Distance in eet (Ezekiel and Fox, 1959) Draw the scatterplot o Distance versus Speed. Add the siple regression ean unction to your plot. What probles are apparent? Copute a test or lack o it, and suarize results Find an appropriate transoration or Distance that can linearize this regression Hald (1960) has suggested on the basis o a theoretical arguent that the ean unction E(Distance Speed) = β 0 + β 1 Speed + β 2 Speed 2, with Var(Distance Speed) = σ 2 Speed 2 is appropriate or data o this type. Copare the it o this odel to the odel ound in Proble For Speed in the range 0 to 40 ph, draw the curves that give the predicted Distance ro each odel, and qualitatively copare the This proble uses the data discussed in Proble 1.5. A ajor source o water in Southern Caliornia is the Owens Valley. This water supply is in turn replenished by spring runo ro the Sierra Nevada ountains. I runo could be predicted, engineers, planners, and policy akers could do their

177 PROBLEMS 163 jobs ore eiciently. The data in the ile water.txt contains 43 years o precipitation easureents taken at six sites in the ountains, in inches o water, and strea runo volue at a site near Bishop, Caliornia. The three sites with nae starting with O are airly close to each other, and the three sites starting with A are also airly close to each other Load the data ile, and construct the scatterplot atrix o the six snowall variables, which are the predictors in this proble. Using the ethodology or autoatic choice o transorations outlined in Section 7.2.2, ind transorations to ake the predictors as close to linearly related as possible. Obtain a test o the hypothesis that all λ j = 0 against a general alternative, and suarize your results. Do the transorations you ound appear to achieve linearity? How do you know? Given log transorations o the predictors, show that a log transoration o the response is reasonable Consider the ultiple linear regression odel with ean unction given by E(log(y) x) = β 0 + β 1 log(apmam) + β 2 log(apsab) +β 3 log(apslake) + β 4 log(opbpc) +β 5 log(oprc) + β 6 log(opslake) with constant variance unction. Estiate the regression coeicients using ols. You will ind that two o the estiates are negative; Which are they? Does a negative coeicient ake any sense? Why are the coeicients negative? In the ols it, the regression coeicient estiates or the three predictors beginning with O are approxiately equal. Are there conditions under which one ight expect these coeicients to be equal? What are they? Test the hypothesis that they are equal against the alternative that they are not all equal Write one or two paragraphs that suarize the use o the snowall variables to predict runo. The suary should discuss the iportant predictors, give useul graphical suaries, and give an estiate o variability. Be creative The data in the ile salarygov.txt give the axiu onthly salary or 495 non-unionized job classes in a idwestern governental unit in The variables are described in Table The data as given has as its unit o analysis the job class. In a study o the dependence o axiu salary on skill, one ight preer to have as unit o analysis the eployee, not the job class. Discuss how this preerence would change the analysis.

178 164 TRANSFORMATIONS TABLE 7.3 Variable MaxSalary NE NW Score JobClass The Governental Salary Data Description Maxiu salary in dollars or eployees in this job class, the response Total nuber o eployees currently eployed in this job class Nuber o woen eployees in the job class Score or job class based on diiculty, skill level, training requireents and level o responsibility as deterined by a consultant to the governental unit. This value or these data is in the range between 82 to Nae o the job class; a ew naes were illegible or partly illegible Exaine the scatterplot o MaxSalary versus Score. Find transoration(s) that would ake the ean unction or the resulting scatterplot approxiately linear. Does the transoration you choose also appear to achieve constant variance? According to Minnesota statutes, and probably laws in other states as well, a job class is considered to be eale doinated i 70% o the eployees or ore in the job class are eale. These data were collected to exaine whether eale-doinated positions are copensated at a lower level, adjusting or Score, than are other positions. Create a actor with two levels that divides the job classes into eale doinated or not, it appropriate odels, and suarize your results. Be indul o the need to transor variables and the possibility o weighting An alternative to using a actor or eale-doinated jobs is to use a ter NW/NE, the raction o woen in the job class. Repeat the last proble, but encoding the inoration about sex using this variable in place o the actor World cities The Union Bank o Switzerland publishes a report entitled Prices and Earnings Around the Globe on their internet web site, The data in the ile BigMac2003.txt and described in Table 7.4 are taken ro their 2003 version or 70 world cities Draw the scatterplot with BigMac on the vertical axis and FoodIndex on the horizontal axis. Provide a qualitative description o this graph. Use an inverse itted value plot and the Box Cox ethod to ind a transoration o BigMac so that the resulting scatterplot has a linear ean unction. Two o the cities, with very large values or BigMac, are very inluential or selecting a transoration. You should do this exercise with all the cities and with those two cities reoved Draw the scatterplot atrix o the three variables (BigMac, Rice, Bread), and use the ultivariate Box Cox procedure to decide on noralizing transorations. Test the null hypothesis that λ = (1, 1, 1) against a

179 PROBLEMS 165 TABLE 7.4 Global Price Coparison Data Variable BigMac Bread Rice Bus FoodIndex TeachGI TeachNI TaxRate TH Apt City Description Minutes o labor to buy a Big Mac haburger based on a typical wage averaged over 13 occupations Minutes o labor to buy 1 kg bread Minutes o labor to buy 1 kg o rice Lowest cost o 10 k public transit Food price index, Zurich=100 Priary teacher s gross annual salary, thousands o US dollars Priary teacher s net annual salary, thousands o US dollars 100 (TeachGI TeachNI)/TeackGI. In soe places, this is negative, suggesting a governent subsidy rather than tax Teacher s hours per week o work Monthly rent in US dollars o a typical three-roo apartent City nae Source: Most o the data are ro the Union Bank o Switzerland publication Prices and Earnings Around the Globe, 2003 edition, ro general alternative. Does deleting Karachi and Nairobi change your conclusions? Set up the regression using the our ters, log(bread), log(bus), log(teachgi),andapt 0.33, and with response BigMac. Draw the inverse itted value plot o ŷ versus BigMac. Estiate the best power transoration. Check on the adequacy o your estiate by reitting the regression odel with the transored response and drawing the inverse itted value plot again. I transoration was successul, this second inverse itted value plot should have a linear ean unction The data in the ile wool.txt were introduced in Section 6.3. For this proble, we will start with Cycles, rather than its logarith, as the response Draw the scatterplot atrix or these data and suarize the inoration in this plot View all three predictors as actors with three levels, and without transoring Cycles, it the second-order ean unction with ters or all ain eects and all two-actor interactions. Suarize results Fit the irst-order ean unction consisting only o the ain eects. Fro Proble 7.6.2, this ean unction is not adequate or these data based on using Cycles as the response. Use both the inverse itted value plot and the Box Cox ethod to select a transoration or Cycles based on the irst-order ean unction In the transored scale, reit the second-order odel, and show that none o the interactions are required in this scale. For this proble, the transoration leads to a uch sipler odel than is required or

180 166 TRANSFORMATIONS the response in the original scale. This is an exaple o reovable nonadditivity Justiy transoring Miles in the Fuel data The data ile UN3.txt contains data described in Table 7.5. There are data or n = 125 localities, ostly UN eber countries, or which values are observed or all the variables recorded. Consider the regression proble with ModernC as the response variable and the other variables in the ile as deining ters Select appropriate transorations o the predictors to be used as ters. (Hint: Since Change is negative or soe localities, the Box Cox aily o transorations cannot be used directly.) Given the transored predictors as ters, select a transoration or the response Fit the regression using the transorations you have obtained, and suarize your results. TABLE 7.5 Variable Description o Variables in the Data File UN3.txt Description Locality Country/locality nae ModernC Percent o unarried woen using a odern ethod o contraception Change Annual population growth rate, percent PPgdp Per capita gross national product, US dollars Frate Percent o eales over age 15 econoically active Pop Total 2001 population, 1000s Fertility Expected nuber o live births per eale, 2000 Purban Percent o population that is urban, 2001 Source: The data were collected ro and reer to values collected between 2000 and 2003.

181 CHAPTER 8 Regression Diagnostics: Residuals So ar in this book, we have ostly used graphs to help us decide what to do beore itting a regression odel. Regression diagnostics are used ater itting to check i a itted ean unction and assuptions are consistent with observed data. The basic statistics here are the residuals or possibly rescaled residuals. I the itted odel does not give a set o residuals that appear to be reasonable, then soe aspect o the odel, either the assued ean unction or assuptions concerning the variance unction, ay be called into doubt. A related issue is the iportance o each case on estiation and other aspects o the analysis. In soe data sets, the observed statistics ay change in iportant ways i one case is deleted ro the data. Such a case is called inluential, and we shall learn to detect such cases. We will be led to study and use two relatively unailiar diagnostic statistics, called distance easures and leverage values. We concentrate on graphical diagnostics but include nuerical quantities that can aid in interpretation o the graphs. 8.1 THE RESIDUALS Using the atrix notation outlined in Chapter 3, we begin by deriving the properties o residuals. The basic ultiple linear regression odel is given by Y = Xβ + e Var(e) =ˆσ 2 I (8.1) where X is a known atrix with n rows and p coluns, including a colun o 1s or the intercept i the intercept is included in the ean unction. We will urther assue that we have selected a paraeterization or the ean unction so that X has ull colun rank, eaning that the inverse (X X) 1 exists; as we have seen previously, this is not an iportant liitation on regression odels because we can always delete ters ro the ean unction, or equivalently delete coluns ro X, until we have ull rank. The p 1 vector β is the unknown paraeter vector. Applied Linear Regression, Third Edition, by Sanord Weisberg ISBN Copyright 2005 John Wiley & Sons, Inc. 167

182 168 REGRESSION DIAGNOSTICS: RESIDUALS The vector e consists o unobservable errors that we assue are equally variable and uncorrelated, unless stated otherwise. In itting odel (8.1), we estiate β by ˆβ = (X X) 1 X Y, and the itted values Ŷ corresponding to the observed values Y are then given by where H is the n n atrix deined by Ŷ = X ˆβ = X(X X) 1 X Y = HY (8.2) H = X(X X) 1 X (8.3) H is called the hat atrix because it transors the vector o observed responses Y into the vector o itted responses Ŷ. The vector o residuals ê is deined by ê = Y Ŷ = Y X ˆβ = Y X(X X) 1 X Y = (I H)Y (8.4) Dierence Between êande The errors e are unobservable rando variables, assued to have zero ean and uncorrelated eleents, each with coon variance σ 2. The residuals ê are coputed quantities that can be graphed or otherwise studied. Their ean and variance, using (8.4) and Appendix A.7, are E(ê) = 0 Var(ê) = σ 2 (I H) (8.5) Like the errors, each o the residuals has zero ean, but each residual ay have a dierent variance. Unlike the errors, the residuals are correlated. Fro (8.4), the residuals are linear cobinations o the errors. I the errors are norally distributed, so are the residuals. I the intercept is included in the odel, then the su o the residuals is 0, ê 1 = ê i = 0. In scalar or, the variance o the ith residual is Var(ê i ) =ˆσ 2 (1 h ii ) (8.6) where h ii is the ith diagonal eleent o H. Diagnostic procedures are based on the coputed residuals, which we would like to assue behave as the unobservable errors would. The useulness o this assuption depends on the hat atrix, since it is H that relates e to ê and also gives the variances and covariances o the residuals.

183 THE RESIDUALS The Hat Matrix H is n n and syetric with any special properties that are easy to veriy directly ro (8.3). Multiplying X on the let by H leaves X unchanged, HX = X. Siilarly, (I H)X = 0. The property HH = H 2 = H also shows that H(I H) = 0, so the covariance between the itted values HY and residuals (I H)Y is Cov(ê, Ŷ) = Cov(HY,(I H)Y) = σ 2 H(I H) = 0 Another nae or H is the orthogonal projection on the colun space o X. The eleents o H, theh ij, are given by h ij = x i (X X) 1 x j = x j (X X) 1 x i = h ji (8.7) Many helpul relationships can be ound between the h ij. For exaple, n h ii = p (8.8) i=1 and, i the ean unction includes an intercept, n h ij = i=1 n h ij = 1 (8.9) j=1 Each diagonal eleent h ii is bounded below by 1/n and above by 1/r, ir is the nuber o rows o X that are identical to x i. As can be seen ro (8.6), cases with large values o h ii will have sall values or Var(ê i );ash ii gets closer to 1, this variance will approach 0. For such a case, no atter what value o y i is observed or the ith case, we are nearly certain to get a residual near 0. Hoaglin and Welsch (1978) pointed this out using a scalar version o (8.2), ŷ i = n h ij y j = h ii y i + j=1 n h ij y j (8.10) In cobination with (8.9), equation (8.10) shows that as h ii approaches 1, ŷ i gets closer to y i. For this reason, they called h ii the leverage o the ith case. Cases with large values o h ii will have unusual values or x i. Assuing that the intercept is in the ean unction, and using the notation o the deviations ro the average cross-products atrix discussed in Chapter 3, h ii can be written as j i h ii = 1 n + (x i x) (X X ) 1 (x i x) (8.11) The second ter on the right-hand side o (8.11) is the equation o an ellipsoid centered at x, andx i = (1, x i ).

184 170 REGRESSION DIAGNOSTICS: RESIDUALS log(ppgdp) Liechtenstein Purban FIG. 8.1 Contours o constant leverage in two diensions. For exaple, consider again the United Nations data, Section 3.1. The plot o log(ppgdp) versus Purban is given in the scatterplot in Figure 8.1. The ellipses drawn on graph correspond to elliptical contours o constant h ii or h ii = 0.02, 0.04, 0.06, 0.08, and Any point that alls exactly on the outer contour would have h ii = 0.10, while points on the innerost contour have h ii = Points near the long or ajor axis o the ellipsoid need to be uch arther away ro x, in the usual Euclidean distance sense, than do points closer to the inor axis, to have the sae values or h ii 1. In the exaple, the localities with the highest level o urbanization, which are Beruda, Hong Kong, Singapore, and Guadalupe, all with 100% urbanization, do not have particularly high leverage, as all the points or these places are between the contour or h ii = 0.04 and None o the h ii is very large, with the largest value or the arked point or Liechtenstein, which has relatively high incoe or relatively low urbanization. In other probles, high-leverage points with values close to 1 can occur, and identiying these cases is very useul in understanding a regression proble Residuals and the Hat Matrix with Weights When Var(e) = σ 2 W 1 with W a known diagonal atrix o positive weights as in Section 5.1, all the results so ar in this section require soe odiication. A useul version o the hat atrix is given by (see Proble 8.9) H = W 1/2 X(X WX) 1 X W 1/2 (8.12) 1 The ter Purban is a percentage between 0 and 100. Contours o constant leverage corresponding to Purban > 100 are shown to give the shape o the contours, even though in this particular proble points could not occur in this region.

185 THE RESIDUALS 171 and the leverages are the diagonal eleents o this atrix. The itted values are given as usual by Ŷ = X ˆβ, where now ˆβ is the wls estiator. The deinition o the residuals is a little trickier. The obvious deinition o a residual is, in scalar version, y i ˆβ x i, but this choice has iportant deiciencies. First, the su o squares o these residuals will not equal the residual su o squares because the weights are ignored. Second, the variance o the ith residual will depend on the weight o case i. Both o these probles can be solved by deining residuals or weighted least squares or i = 1,...,n by ê i = w i (y i ˆβ x i ) (8.13) The su o squares o these residuals is the residual su o squares, and the variance o the residuals does not depend on the weight. When all the weights are equal to 1, (8.13) reduces to (8.4). In drawing graphs and other diagnostic procedures discussed in this book, (8.13) should be used to deine residuals. Soe coputer packages use the unweighted residuals rather than (8.13) by deault. There is no consistent nae or these residuals. For exaple, in R and S-plus, the residuals deined by (8.13) are called Pearson residuals in soe unctions, and weighted residuals elsewhere. In this book, the sybols ê i or ê always reer to the residuals deined by (8.13) The Residuals When the Model Is Correct Suppose that U is equal to one o the ters in the ean unction, or soe linear cobination o the ters. Residuals are generally used in scatterplots o the residuals ê against U. The key eatures o these residual plots when the correct odel is it are as ollows: 1. The ean unction is E(ê U) = 0. This eans that the scatterplot o residuals on the horizontal axis versus any linear cobination o the ters should have a constant ean unction equal to Since Var(ê U) = σ 2 (1 h ii ) even i the itted odel is correct, the variance unction is not quite constant. The variability will be saller or high-leverage cases with h ii close to The residuals are correlated, but this correlation is generally uniportant and not visible in residual plots. When the odel is correct, residual plots should look like null plots The Residuals When the Model Is Not Correct I the itted odel is based on incorrect assuptions, there will be a plot o residuals versus soe ter or cobination o ters that is not a null plot. Figure 8.2 shows several generic residual plots or a siple linear regression proble. The irst plot is

186 172 REGRESSION DIAGNOSTICS: RESIDUALS ri ri yi or xi (a) yi or xi (b) yi or xi (c) ri ri ri yi or xi (d) yi (e) yi () ri ri 0 0 yi (g) yi (h) FIG. 8.2 Residual plots: (a) null plot; (b) right-opening egaphone; (c) let-opening egaphone; (d) double outward box; (e) () nonlinearity; (g) (h) cobinations o nonlinearity and nonconstant variance unction. a null plot that indicates no probles with the itted odel. Fro Figures 8.2b d, in siple regression, we would iner nonconstant variance as a unction o the quantity plotted on the horizontal axis. The curvature apparent in Figures 8.2e h suggests an incorrectly speciied ean unction. Figures 8.2g h suggest both curvature and nonconstant variance. In odels with any ters, we cannot necessarily associate shapes in a residual plot with a particular proble with the assuptions. For exaple, Figure 8.3 shows a residual plot or the it o the ean unction E(Y X = x) = β 0 + β 1 x 1 + β 2 x 2 or the artiicial data given in the ile caution.txt ro Cook and Weisberg

187 THE RESIDUALS 173 Residuals Fitted values FIG. 8.3 Residual plot or the caution data. (1999b). The right-opening egaphone is clear in this graph, suggesting nonconstant variance. But these data were actually generated using a ean unction E(Y X = x) = x (1.5 + x 2 ) 2 (8.14) with constant variance, with scatterplot atrix given in Figure 8.4. The real proble is that the ean unction is wrong, even though ro the residual plot, nonconstant variance appears to be the proble. A nonnull residual plot in ultiple regression be indicates that soething is wrong but does not necessarily tell what is wrong. Residual plots in ultiple regression can be interpreted just as residual plots in siple regression i two conditions are satisied. First, the predictors should be approxiately linearly related, Section The second condition is on the ean unction: we ust be able to write the ean unction in the or E(Y X = x) = g(β x) or soe unspeciied unction g. I either o these conditions ails, then residual plots cannot be interpreted as in siple regression (Cook and Weisberg, 1999b). In the caution data, the irst condition is satisied, as can be veriied by looking at the plot o X 1 versus X 2 in Figure 8.4, but the second condition ails because (8.14) cannot be written as a unction o a single linear cobination o the ters Fuel Consuption Data According to theory, i the ean unction and other assuptions are correct, then all possible residual plots o residuals versus any unction o the ters should reseble a null plot, so any plots o residuals should be exained. Usual choices

188 174 REGRESSION DIAGNOSTICS: RESIDUALS X X Y FIG. 8.4 Scatterplot atrix or the caution data. include plots versus each o the ters and versus itted values, as shown in Figure 8.5 or the uel consuption data. None o the plots versus individual ters in Figure 8.5a d suggest any particular probles, apart ro the relatively large positive residual or Wyoing and large negative residual or Alaska. In soe o the graphs, the point or the District o Colubia is separated ro the others. Wyoing is large but sparsely populated with a well-developed road syste. Driving long distances or the necessities o lie like going to see a doctor will be coon in this state. While Alaska is also very large and sparsely populated, ost people live in relatively sall areas around cities. Much o Alaska is not accessible by road. These conditions should result in lower use o otor uel than ight otherwise be expected. The District o Colubia is a very copact urban area with good rapid transit, so use o cars will generally be less. It has a sall residual but unusual values or the ters in the ean unction, so it is separated horizontally ro ost o the rest o the data. The District o Colubia has high leverage (h 9,9 = 0.415), while the other two are candidates or outliers. We will return to these issues in the next chapter.

189 THE RESIDUALS 175 WY WY Residuals DC AK Residuals DC AK Tax Dlic (a) (b) WY WY Residuals DC AK Residuals DC AK Incoe log(miles) (c) (d) WY WY Residuals DC AK Fuel AK DC Fitted values Fitted values (e) () FIG. 8.5 Residual plots or the uel consuption data. Figure 8.5e is a plot o residuals versus the itted values, which are just a linear cobination o the ters. Soe coputer packages will produce this graph as the only plot o residuals, and i only one plot were possible, this would be the plot to draw, as it contains soe inoration ro all the ters in the ean unction. There is a hint o curvature in this plot, possibly suggesting that the ean unction is not adequate or the data. We will look at this ore careully in the next section. Figure 8.5 is dierent ro the others because it is not a plot o residuals but rather a plot o the response versus the itted values. This is really just a rescaling o Figure 8.5e. I the ean unction and the assuptions appear to be sensible, then this igure is a suary graph or the regression proble. The ean unction or the graph should be the straight line shown in the igure, which is just the itted ols siple regression it o the response on the itted values.

190 176 REGRESSION DIAGNOSTICS: RESIDUALS TABLE 8.1 Signiicance Levels or the Lack-o-Fit Tests or the Residual Plots in Figure 8.5 Ter Test Stat. Pr(> t ) Tax Dlic Incoe log(miles) Fitted values TESTING FOR CURVATURE Tests can be coputed to help decide i residual plots such as those in Figure 8.5 are null plots or not. One helpul test looks or curvature in this plot. Suppose we have a plot o residuals ê versus a quantity U on the horizontal axis, where U could be a ter in the ean unction or a cobination o ters 2.Asiple test or curvature is to reit the original ean unction with an additional ter or U 2 added. The test or curvature is then based on the t-statistic or testing the coeicient or U 2 to be 0. I U does not depend on estiated coeicients, then ausualt-test o this hypothesis can be used. I U is equal to the itted values so that it depends the estiated coeicients, then the test statistic should be copared with the standard noral distribution to get signiicance levels. This latter case is called Tukey s test or nonadditivity (Tukey, 1949). Table 8.1 gives the lack-o-it tests or the residual plots in Figure 8.5. None o the tests have sall signiicance levels, providing no evidence against the ean unction. As a second exaple, consider again the United Nations data ro Section 3.1 with response log(fertility) and two predictors log(ppgdp) and Purban. The apparent linearity in all the raes o the scatterplot atrix in Figure 8.6 suggests that the ean unction E(log(Fertility) log(ppgdp), Purban) = β 0 + β 1 log(ppgdp) + β 2 Purban (8.15) should be appropriate or these data. Plots o residuals versus the two ters and versus itted values are shown in Figure 8.7. Without reerence to the curved lines shown on the plot, the visual appearance o these plots is satisactory, with no obvious curvature or nonconstant variance. However, the curvature tests tell a dierent story. In each o the graphs, the value o the test statistics shown in Table 8.2 has a p-value o 0 to 2 decial places, suggesting that the ean unction (8.15) is not adequate or these data. We will return to this exaple in Section I U is a polynoial ter, or exaple, U = X1 2,whereX 1 is another ter in the ean unction, this procedure is not recoended.

191 NONCONSTANT VARIANCE log(ppgdp) Purban log(fertility) FIG. 8.6 Scatterplot atrix or three variables in the UN data. TABLE 8.2 Lack-o-Fit Tests or the UN Data Test Stat. Pr(> t ) log(ppgdp) Purban Tukey test NONCONSTANT VARIANCE A nonconstant variance unction in a residual plot ay indicate that a constant variance assuption is alse. There are at least our basic reedies or nonconstant variance. The irst is to use a variance stabilizing transoration since replacing Y by Y T ay induce constant variance in the transored scale. A second option is to ind epirical weights that could be used in weighted least squares. Weights

192 178 REGRESSION DIAGNOSTICS: RESIDUALS Residuals log(ppgdp) Residuals Purban Residuals Fitted values FIG. 8.7 Residual plots or the UN data. The dotted curved lines are quadratic polynoials it to the residual plot and do not correspond exactly to the lack-o-it tests that add a quadratic ter to the original ean unction. that are siple unctions o single predictors, such as Var(Y X) = σ 2 X 1, with X 1 > 0, can soeties be justiied theoretically. I replication is available, then within group variances ay be used to provide approxiate weights. Although beyond the scope o this book, it is also possible to use generalized least squares and estiate weights and coeicients siultaneously (see, or exaple, Pinheiro and Bates, 2000, Section 5.1.2). The third option is to do nothing. Estiates o paraeters, given a isspeciied variance unction, reain unbiased, i soewhat ineicient. Tests and conidence intervals coputed with the wrong variance unction will be inaccurate, but the bootstrap ay be used to get ore accurate results. The inal option is to use regression odels that account or the nonconstant variance that is a unction o the ean. These are called generalized linear odels and are discussed in the context o logistic regression in Chapter 12. In this section, we consider priarily the irst two options.

193 NONCONSTANT VARIANCE Variance Stabilizing Transorations Suppose that the response is strictly positive, and the variance unction beore transoration is Var(Y X = x) = σ 2 g(e(y X = x)) (8.16) where g(e(y X = x)) is a unction that is increasing with the value o its arguent. For exaple, i the distribution o Y X has a Poisson distribution, then g(e(y X = x)) = E(Y X = x), since or Poisson variables, the ean and variance are equal. For distributions in which the ean and variance are unctionally related as in (8.16), Scheé (1959, Section 10.7) provides a general theory or deterining transorations that can stabilize variance, so that Var(Y T X = x) will be approxiately constant. Table 8.3 lists the coon variance stabilizing transorations. O course, transoring away nonconstant variance can introduce nonlinearity into the ean unction, so this option ay not always be reasonable. The square root, log(y ), and 1/Y are appropriate when variance increases or decreases with the response, but each is ore severe than the one beore it. The square-root transoration is relatively ild and is ost appropriate when the response ollows a Poisson distribution, usually the irst odel considered or errors in counts. The logarith is the ost coonly used transoration; the base o the logariths is irrelevant. It is appropriate when the error standard deviation is a percent o the response, such as ±10% o the response, not ±10 units, so Var(Y X) σ 2 [E(Y X)] 2. The reciprocal or inverse transoration is oten applied when the response is a tie until an event, such as tie to coplete a task, or until healing. This converts ties per event to a rate per unit tie; oten the transored easureents ay be ultiplied by a constant to avoid very sall nubers. Rates can provide a natural easureent scale. TABLE 8.3 Y T Coon Variance Stabilizing Transorations Coents Y Used when Var(Y X) E(Y X), as or Poisson distributed data. Y T = Y + Y + 1 can be used i all the counts are sall (Freean and Tukey, 1950). log(y ) Use i Var(Y X) [E(Y X)] 2. In this case, the errors behave like a percentage o the response, ±10%, rather than an absolute deviation, ±10 units. 1/Y The inverse transoration stabilizes variance when Var(Y X) [E(Y X)] 4. It can be appropriate when responses are ostly close to 0, but occasional large values occur. sin 1 ( Y) The arcsine square-root transoration is used i Y is a proportion between 0 and 1, but it can be used ore generally i y has a liited range by irst transoring Y to the range (0, 1), and then applying the transoration.

194 180 REGRESSION DIAGNOSTICS: RESIDUALS A Diagnostic or Nonconstant Variance Cook and Weisberg (l983) provided a diagnostic test or nonconstant variance. Suppose now that Var(Y X) depends on an unknown vector paraeter λ and a known set o ters Z with observed values or the ith case z i. For exaple, i Z = Y, then variance depends on the response. Siilarly, Z ay be the sae as X, a subset o X, or indeed it could be copletely dierent ro X, perhaps indicating spatial location or tie o observation. We assue that Var(Y X, Z = z) = σ 2 exp(λ z) (8.17) This coplicated or says that (1) Var(Y Z = z) >0orallz because the exponential unction is never negative; (2) variance depends on z and λ but only through the linear cobination λ z;(3)var(y Z = z) is onotonic, either increasing or decreasing, in each coponent o Z; and(4)iλ = 0, thenvar(y Z = z) = σ 2. The results o Chen (1983) suggest that the tests described here are not very sensitive to the exact unctional or used in (8.17), and so the use o the exponential unction is relatively benign, and any or that depends on the linear cobination λ z would lead to very siilar inerence. Assuing that errors are norally distributed, a score test o λ = 0 is particularly siple to copute using standard regression sotware. The test is carried out using the ollowing steps: 1. Copute the ols it with the ean unction as i E(Y X = x) = β x Var(Y X, Z = z) = σ 2 or equivalently, λ = 0. Save the residuals ê i. 2. Copute scaled squared residuals u i =ê 2 i / σ 2 = nê 2 i /[(n p ) ˆσ 2 ], where σ 2 = ê 2 j /n is the axiu likelihood estiate o σ 2, and diers ro ˆσ 2 only by the divisor o n rather than n p. We cobine the u i into a variable U. 3. Copute the regression with the ean unction E(U Z = z) = λ 0 + λ z. Obtain SSreg or this regression with d = q, the nuber o coponents in Z. I variance is thought to be a unction o the responses, then in this regression, replace Z by the itted values ro the regression in step 1. The SSreg then will have 1 d. 4. Copute the score test, S = SSreg/2. The signiicance level or the test can be obtained by coparing S with its asyptotic distribution, which, under the hypothesis λ = 0, isχ 2 (q). Iλ 0, thens will be too large, so large values o S provide evidence against the hypothesis o constant variance.

195 NONCONSTANT VARIANCE 181 I we had started with a set o known weights, then the score test could be based on itting the variance unction Var(Y X, Z = z) = σ 2 w exp(λ z) (8.18) The null hypothesis or the score test is then Var(Y X, Z = x) = σ 2 /w versus the alternative given by (8.18). The test is exactly the sae as outlined above except that in step one, the wls it with weights w is used in place o the ols it, and in the reaining steps, the weighted or Pearson residuals given by (8.13) are used, not unweighted residuals. Snow Geese The relationship between photo = photo count, obs1 = count by observer 1, and obs2 = count by observer 2 o locks o snow geese in the Hudson Bay area o Canada is discussed in Proble 5.5 o Chapter 5. The data are displayed in Figure 8.8. We see in the graph that (1) there is substantial disagreeent between the observers; (2) the observers cannot predict the photo count very well, and (3) the variability appears to be larger or larger locks photo obs1 obs FIG. 8.8 The snow geese data. The line on each plot is a loess sooth with soothing paraeter 2/3.

196 182 REGRESSION DIAGNOSTICS: RESIDUALS Using the irst observer only, we illustrate coputation o the score test or constant variance. The irst step is to it the ols regression o photo on obs1. The itted ols ean unction is Ê(photo obs1) = obs1. Fro this, we can copute the residuals ê i, σ = êi 2/n, and then copute the u i =êi 2 / σ.wethen copute the regression o U on obs1, under the hypothesis suggested by Figure 8.8 that variance increases with obs1. The analysis o variance table or this second regression is: Response: U D Su Sq Mean Sq F value Pr(>F) obs e-09 Residuals The score test or nonconstant variance is S = (1/2)SSreg = (1/2) = 81.41, which, when C copared with the chi-squared distribution with one d, gives an extreely sall p-value. The hypothesis o constant residual variance is not tenable. The analyst ust now cope with the alost certain nonconstant variance evident in the data. Two courses o action are outlined in Probles and Snier Data When gasoline is puped into a tank, hydrocarbon vapors are orced out o the tank and into the atosphere. To reduce this signiicant source o air pollution, devices are installed to capture the vapor. In testing these vapor recovery systes, a snier easures the aount recovered. To estiate the eiciency o the syste, soe ethod o estiating the total aount given o ust be used. To this end, a laboratory experient was conducted in which the aount o vapor given o was easured under controlled conditions. Four predictors are relevant or odeling: TankTep = initial tank teperature (degrees F) GasTep = teperature o the dispensed gasoline (degrees F) TankPres = initial vapor pressure in the tank (psi) GasPres = vapor pressure o the dispensed gasoline (psi) The response is the hydrocarbons Y eitted in gras. The data, kindly provided by John Rice, are given in the data ile snier.txt, and are shown in Figure 8.9. The clustering o points in any o the raes o this scatterplot is indicative o the attept o the experienters to set the predictors at a ew noinal values, but the actual values o the predictors easured during the experient was soewhat dierent ro the noinal. We also see that (1) the predictors are generally linearly related, so transorations are unlikely to be desirable here, and (2) soe o the predictors, notably the two pressure variables, are closely linearly related, suggesting, as we will see in Chapter 10, that using both in the ean unction ay not be desirable. For now, however, we will use all our ters and begin with the ean unction including all our predictors as ters and it via ols as i the

197 NONCONSTANT VARIANCE TankTep GasTep TankPres GasPres Y FIG. 8.9 Scatterplot atrix or the snier data. variance unction were constant. Several plots o the residuals or this regression are shown in Figure Figure 8.10a is the plot o residuals versus itted values. While this plot is ar ro perect, it does not suggest the need to worry uch about the assuption o nonconstant variance. Figures 8.10b and c, which are plots o residuals against TankTep and GasPres, respectively, give a soewhat dierent picture, as particularly in Figure 8.10c variance does appear to increase ro let to right. Because none o the graphs in Figure 8.9 have clearly nonlinear ean unctions, the inerence that variance ay not be constant can be tentatively adopted ro the residual plots. Table 8.4 gives the results o several nonconstant variance score tests, each coputed using a dierent choice or Z. Each o these tests is just hal the su o squares or regression or U on the choice o Z shown. The plot shown in Figure 8.10d has the itted values ro the regression o U on all our predictors, and corresponds to the last line o Table 8.4. Fro Table 8.4, we would diagnose nonconstant variance as a unction o various choices o Z. We can copare nested choices or Z by taking the dierence between the score tests and coparing the result with the χ 2 distribution with d

198 184 REGRESSION DIAGNOSTICS: RESIDUALS Residuals Residuals Y (a) Tank teperature (b) Residuals Residuals Gas pressure (c) Linear cobination (d) FIG Residuals plots or the snier data with variance assued to be constant. TABLE 8.4 Score Tests or Snier Data Choice or Z d S p-value TankTep GasPres TankTep, GasPres TankTep, GasTep TankPres, GasPres Fitted values equal to the dierence in their d (Hinkley, 1985). For exaple, to copare the 4 d choice o Z to Z = (TankTep, GasPres), we can copute = 1.98 with 4 2 = 2d,togetap-value o about 0.37, and so the sipler Z with two ters is adequate. Coparing Z = (TankTep, GasPres) with Z = GasPres,

199 GRAPHS FOR MODEL ASSESSMENT 185 the test statistic is = 2.07 with 2 1 = 1 d, giving a p-value o about.15, so once again the sipler choice o Z sees adequate. Cobining these tests, we would be led to assessing that the variance is priarily a unction o GasPres. A reasonable approach to working with these data is to assue that Var(Y X, Z) = σ 2 GasPres and use 1/GasPres as weights in weighted least squares Additional Coents Soe coputer packages will include unctions or the score test or nonconstant variance. With other coputer progras, it ay be ore convenient to copute the score test as ollows: (1) Copute the residuals ê i ro the regression o Y on X;let ˆσ 2 be the usual residual ean square ro this regression. (2) Copute the regression o ê 2 i on Z or the choice o Z o interest, and let SSreg(Z) be the resulting su o squares or regression. (3) Copute S = (1/2)SSreg(Z)/[(n p ) ˆσ 2 /n] 2. Pinheiro and Bates (2000, Section 5.2.1) present ethodology and sotware or estiating weights using odels siilar to those discussed here. 8.4 GRAPHS FOR MODEL ASSESSMENT Residual plots are used to exaine regression odels to see i they ail to atch observed data. I systeatic ailures are ound, then odels ay need to reorulated to ind a better itting odel. A closely related proble is assessing how well a odel atches the data. Let us irst think about a regression with one predictor in which we have itted a siple linear regression odel, and the goal is to suarize how well a itted odel atches the observed data. The lack-o-it tests developed in Section 5.3 and 6.1 approach the question o goodness o it ro a testing point o view. We now look at this issue ro a graphical point o view using arginal odel plots. We illustrate the idea irst with a proble with just one predictor. In Section 7.1.2, we discussed the regression o Height on Dbh or a saple o western red cedar trees ro Upper Flat Creek, Idaho. The ean unction E(Height Dbh) = β 0 + β 1 Dbh (8.19) was shown to be a poor suary o these data, as can be seen in Figure Two sooths are given on the plot. The ols it o (8.19), shown as a dashed line, estiates the ean unction only i the siple linear regression odel is correct. The loess it, shown as a solid line, estiates the ean unction regardless o the it o the siple linear regression odel. I we judge these two its to be dierent, then we have visual evidence against the siple linear regression ean unction.

200 186 REGRESSION DIAGNOSTICS: RESIDUALS Height Diaeter, Dbh FIG Model checking plot or the siple linear regression or western red cedar trees at Upper Flat Creek. The dashed line is the ols siple linear regression it, and the solid line is a loess sooth. The loess it is clearly curved, so the ean unction (8.19) is not a very good suary o this regression proble. Although Figure 8.11 includes the data, the priary ocus in this plot is coparing the two curves, using the data as background ostly to help choose the soothing paraeter or the loess sooth, to help visualize variation, and to locate any extree or unusual points Checking Mean Functions With ore than one predictor, we will look at arginal odels to get a sequence o two-diensional plots to exaine. Suppose that the odel we have itted has ean unction E(Y X = x) = β x, although in what ollows the exact or o the ean unction is not iportant. We will draw a plot with the response Y on the vertical axis. On the horizontal axis, we will plot a quantity U that will consist o any unction o X we think is relevant, such as itted values, any o the individual ters in X, or even transorations o the. Fitting a soother to the plot o Y versus U estiates E(Y U) without any assuptions. We want to copare this soothtoanestiateoe(y U) based on the odel. Following Cook and Weisberg (1997), an estiate o E(Y U), given the odel, can be based on application o equation (A.4) in Appendix A.2.4. Under the odel, we have E(Y U = u) = E[E(Y X = x) U = u] We need to substitute an estiate or E(Y X = x). On the basis o the odel, this expectation is estiated at the observed values o the ters by the itted values Ŷ.

201 GRAPHS FOR MODEL ASSESSMENT 187 We get E(Y U = u) = E(Ŷ U = u) (8.20) The iplication o this result is that we can estiate E(Y U = u) by soothing the scatterplot with U on the horizontal axis, and the itted values Ŷ on the vertical axis. I the odel is correct, then the sooth o Y versus U and the sooth o Ŷ versus U should agree; i the odel is not correct, these sooths ay not agree. As an exaple, we return to the United Nations data discussed in Section 8.2, starting with the ean unction given by (8.15), E(log(Fertility) log(ppgdp), Purban) = β 0 + β 1 log(ppgdp) + β 2 Purban (8.21) and suppose that U = log(ppgdp), one o the ters in the ean unction. Figure 8.12 shows plots o log(fertility) versus U and o Ŷ versus U. The sooth in Figure 8.12a estiates E(log(Fertility) log(ppgdp)) whether the odel is right or not, but the sooth in Figure 8.12b ay not give a useul estiate o E(log(Fertility) log(ppgdp)) i the linear regression odel is wrong. Coparison o these two estiated ean unctions provides a visual assessent o the adequacy o the ean unction or the odel. Superiposing the sooth in Figure 8.12b on Figure 8.12a gives the arginal odel plot shown in Figure 8.13a. The two itted curves do not atch well, suggesting that the ean unction (8.21) is inadequate. Three additional arginal odel plots are shown in Figure The plots versus Purban and versus itted values also exhibit a curved ean sooth based on the data copared to a straight sooth based on the itted odel, coniring the inadequacy o the ean unction. The inal plot in Figure 8.13d is a little dierent. The quantity on the horizontal axis is a randoly selected linear cobination o Purban and log(fertility) log(ppgdp) (a) Fitted values log(ppgdp) (b) FIG Plots or log(fertility) versus log(ppgdp) and ŷ versus log(ppgdp). In both plots, the curves are loess sooths with soothing paraeters equal to 2/3. I the odel has the correct ean unction, then these two sooths estiate the sae quantity.

202 188 REGRESSION DIAGNOSTICS: RESIDUALS log(fertility) log(ppgdp) (a) log(fertility) Purban (b) log(fertility) Fitted values (c) log(fertility) Rando direction (d) FIG Four arginal odel plots, versus the two ters in the ean unction, itted values, and a rando linear cobination o the ters in the ean unction. log(ppgdp). In this direction, both sooths are curved, and they agree airly well, except possibly at the let edge o the graph. I a itted odel is appropriate or data, then the two sooths in the arginal odel plots will agree or any choice o U, including randoly selected ones. I the odel is wrong, then or soe choices o U, the two sooths will disagree. Since the ean unction (8.21) is inadequate, we need to consider urther odiication to get a ean unction that atches the data well. One approach is to expand (8.15) by including both quadratic ters and an interaction between log(ppgdp) and Purban. Using the ethods described elsewhere in this book, we conclude that the ean unction E(log(Fertility) log(ppgdp), Purban) = β 0 + β 1 log(ppgdp) + β 2 Purban + β 22 Purban 2 (8.22)

203 GRAPHS FOR MODEL ASSESSMENT 189 log(fertility) log(fertility) log(ppgdp) (a) Purban (b) log(fertility) log(fertility) Fitted values (c) Rando direction (d) FIG Marginal odel plots or the United Nations data, including a quadratic ter in Purban. atches the data well, as conired by the arginal odel plots in Figure Evidently, adding the quadratic in Purban allows the eect o increasing Purban on log(fertility) to be saller when Purban is large than when Purban is sall Checking Variance Functions Model checking plots can also be used to check or odel inadequacy in the variance unction, which or the ultiple linear regression proble eans checking the constant variance assuption. We call the square root o the variance unction the standard deviation unction. The plot o Y versus U can be used to get the estiate SD data (Y U) o the standard deviation unction, as discussed in Appendix A.5. This estiate o the square root o the variance unction does not depend on a odel.

204 190 REGRESSION DIAGNOSTICS: RESIDUALS We need a odel-based estiate o the standard deviation unction. Applying (A.5) and again substituting Ŷ E(Y U = u), Var(Y U) = E[Var(Y X) U] + Var[E(Y X) U] (8.23) E[σ 2 U] + Var[Ŷ U] = σ 2 + Var[Ŷ U] (8.24) Equation (8.23) is the general result that holds or any odel. Equation (8.24) holds or the linear regression odel in which the variance unction Var(Y X) = σ 2 is constant. According to this result, we can estiate Var(Y U) under the odel by getting a variance sooth o Ŷ versus U, and then adding to this an estiate o σ 2,orwhichweuse ˆσ 2 ro the ols it o the odel. We will call the square root o this estiated variance unction SD odel (Y U). I the odel is appropriate or the data, then apart ro sapling error, SD data (Y U) = SD odel (Y U), but i the odel is wrong, these two unctions need not be equal. For visual display, we show the ean unction estiated ro the plot ±SD data (Y U) using solid lines and the ean unction estiated ro the odel ±SD odel (Y U) using dashed lines; colored lines would be helpul here. The sae soothing paraeter should be used or all the sooths, so any bias due to soothing will tend to cancel. These sooths or the United Nations exaple are shown in Figure 8.15, irst or the it o (8.21) and then or the it o (8.22). For both, the horizontal axis is itted values, but alost anything could be put on this axis. Apart ro the edges o the plot where the sooths are less accurate, these plots do not suggest any proble with nonconstant variance, as the estiated variances unctions using the two ethods are siilar, particularly or the ean unction (8.22) that atches the data. log(fertility) Fitted values, original odel (a) log(fertility) Fitted values, quadratic ter added (b) FIG Marginal odel plots with standard deviation sooths added. (a) The it o (8.21). (b) The it o (8.22).

205 PROBLEMS 191 The arginal odel plots described here can be applied in virtually any regression proble, not just or the linear regression. For exaple, Pan, Connett, Porzio, and Weisberg (2001) discuss application to longitudinal data, and Porzio (2002) discusses calibrating arginal odel plots or binary regression. PROBLEMS 8.1. Working with the hat atrix Prove the results given by (8.7) Prove that 1/n h ii 1/r, whereh ii is a diagonal entry in H, and r is the nuber o rows in X that are exactly the sae as x i I the linear trend were reoved ro Figure 8.5, what would the resulting graph look like? 8.3. This exaple copares in-ield ultrasonic easureents o the depths o deects in the Alaska oil pipeline to easureents o the sae deects in a laboratory. The lab easureents were done in six dierent batches. The goal is to decide i the ield easureent can be used to predict the ore accurate lab easureent. In this analysis, the ield easureent is the response variable and the laboratory easureent is the predictor variable. The data, in the ile pipeline.txt, were given at section6/pd621.ht. The three variables are called Field, the in-ield easureent, Lab, the ore accurate in-lab easureent, and Batch, the batch nuber Draw the scatterplot o Lab versus Field, and coent on the applicability o the siple linear regression odel Fit the siple regression odel, and get the residual plot. Copute the score test or nonconstant variance, and suarize your results Fit the siple regression ean unction again, but this tie assue that Var(Lab Field) = σ 2 /Field. Get the score test or the it o this variance unction. Also test or nonconstant variance as a unction o batch; since the batches are arbitrarily nubered, be sure to treat Batch as a actor. (Hint: Both these tests are extensions o the ethodology outlined in the text. The only change required is to be sure that the residuals deined by (8.13) are used when coputing the statistic.) Repeat Proble 8.3.3, but with Var(Lab Field) = σ 2 /Field Reer to Proble 7.2. Fit Hald s odel, given in Proble 7.2.3, but with constant variance, Var(Distance Speed) = σ 2. Copute the score test or nonconstant variance or the alternatives that (a) variance depends on the ean; (b) variance depends on Speed; and (c) variance depends on Speed and Speed 2. Is adding Speed 2 helpul?

206 192 REGRESSION DIAGNOSTICS: RESIDUALS 8.5. Consider the siple regression odel, E(Y X = x) = β 0 + β 1 x with Var (Y X = x) = σ Find a orula or the h ij and or the leverages h ii In a 2D plot o the response versus the predictor in a siple regression proble, explain how high-leverage points can be identiied Make up a predictor X so that the value o the leverage in siple regression or one o the cases is equal to Using the QR actorization deined in Appendix A.12, show that H = QQ. Hence, i q i is the ith row o Q, h ii = q i q i h ij = q i q j Thus, i the QR actorization o X is coputed, the h ii and the h ij are easily obtained Let U be an n 1 vector with 1 as its irst eleent and 0s elsewhere. Consider coputing the regression o U on an n p ull rank atrix X. Asusual, let H = X(X X) 1 X be the Hat atrix with eleents h ij Show that the eleents o the vector o itted values ro the regression o U on X are the h 1j, j = 1, 2,...,n Show that the vector o residuals have 1 h 11 as the irst eleent, and the other eleents are h 1j,j > Two n n atrices A and B are orthogonal i AB = BA = 0. Show that I H and H are orthogonal. Use this result to show that as long as the intercept is in the ean unction, the slope o the regression o ê on Ŷ is 0. What is the slope o the regression o ê on Y? 8.9. Suppose that W is a known positive diagonal atrix o positive weights, and we have a weighted least squares proble, Y = Xβ + e Var(e) =ˆσ 2 W 1 (8.25) Using the transorations as in Section 5.1, show that the hat atrix is given by (8.12) Draw residuals plots or the ean unction described in Proble or the Caliornia water data, and coent on your results. Test or curvature as a unction o itted values. Also, get arginal odel plots or this odel Reer to the transactions data discussed in Section Fit the ean unction (4.16) with constant variance, and use arginal odel plots to exaine the it. Be sure to consider both the ean unction and the variance unction. Coent on the results.

207 PROBLEMS 193 TABLE 8.5 Crustacean Zooplankton Species Data Variable Species MaxDepth MeanDepth Cond Elev Lat Long Dist NLakes Photo Area Lake Description Nuber o zooplankton species Maxiu lake depth, Mean lake depth, Speciic conductance, icro Sieans Elevation, N latitude, degrees W longitude, degrees distance to nearest lake, k nuber o lakes within 20 k Rate o photosynthesis, ostly by the 14 C ethod Surace area o the lake, in hectares Nae o lake Source: Fro Dodson (1992) The nuber o crustacean zooplankton species present in a lake can be dierent, even or two nearby lakes. The data in the ile lakes.txt, provided by S. Dodson and discussed in part in Dodson (1992), give the nuber o known crustacean zooplankton species or 69 world lakes. Also included are a nuber o characteristics o each lake. There are soe issing values, indicated with a? in the data ile. The goal o the analysis is to understand how the nuber o species present depends on the other easured variables that are characteristics o the lake. The variables are described in Table 8.5. Decide on appropriate transorations o the data to be used in this proble. Then, it appropriate linear regression odels, and suarize your results.

208 CHAPTER 9 Outliers and Inluence 9.1 OUTLIERS In soe probles, the observed response or a ew o the cases ay not see to correspond to the odel itted to the bulk o the data. In a siple regression proble such as displayed in Figure 1.9c, page 13, this ay be obvious ro a plot o the response versus the predictor, where ost o the cases lie near a itted line but a ew do not. Cases that do not ollow the sae odel as the rest o the data are called outliers, and identiying these cases can be useul. We use the ean shit outlier odel to deine outliers. Suppose that the ith case is a candidate or an outlier. We assue that the ean unction or all other cases is E(Y X = x j ) = x j β, but or case i the ean unction is E(Y X = x i) = x i β + δ. The expected response or the ith case is shited by an aount δ, andatesto δ = 0 is a test or a single outlier in the ith case. In this developent, we assue Var(Y X) = σ 2. Cases with large residuals are candidates or outliers. Not all large residual cases are outliers, since large errors e i will occur with the requency prescribed by the generating probability distribution. Whatever testing procedure we develop ust oer protection against declaring too any cases to be outliers. This leads to the use o siultaneous testing procedures. Also, not all outliers are bad. For exaple, a geologist searching or oil deposits ay be looking or outliers, i the oil is in the places where a itted odel does not atch the data. Outlier identiication is done relative to a speciied odel. I the or o the odel is odiied, the status o individual cases as outliers ay change. Finally, soe outliers will have greater eect on the regression estiates than will others, a point that is pursued shortly An Outlier Test Suppose that the ith case is suspected to be an outlier. First, deine a new ter, say U, with the jth eleent u j = 0orj i, andtheith eleent u i = 1. Thus, U is Applied Linear Regression, Third Edition, by Sanord Weisberg ISBN Copyright 2005 John Wiley & Sons, Inc. 194

209 OUTLIERS 195 a duy variable that is zero or all cases but the ith. Then, siply copute the regression o the response on both the ters in X and U. The estiated coeicient or U is the estiate o the ean shit δ. Thet-statistic or testing δ = 0against a two-sided alternative is the appropriate test statistic. Norally distributed errors are required or this test, and then the test will be distributed as Student s t with n p 1d. We will now consider an alternative approach that will lead to the sae test, but ro a dierent point o view. The equivalence o the two approaches is let as an exercise. Again suppose that the ith case is suspected to be an outlier. We can proceed as ollows: 1. Delete the ith case ro the data, so n 1 cases reain in the reduced data set. 2. Using the reduced data set, estiate β and σ 2. Call these estiates ˆβ (i) and ˆσ 2 (i) to reind us that case i was not used in estiation. The estiator ˆσ 2 (i) has n p 1d. 3. For the deleted case, copute the itted value ŷ i(i) = x i ˆβ (i). Since the ith case was not used in estiation, y i and ŷ i(i) are independent. The variance o y i ŷ i(i) is given by Var(y i ŷ i(i) ) = σ 2 + σ 2 x i (X (i) X (i)) 1 x i (9.1) where X (i) is the atrix X with the ith row deleted. This variance is estiated by replacing σ 2 with ˆσ (i) 2 in (9.1). 4. Now E(y i ŷ i(i) ) = δ, which is zero under the null hypothesis that case i is not an outlier but nonzero otherwise. Assuing noral errors, a Student s t-test o the hypothesis δ = 0isgivenby t i = y i ŷ i(i) (9.2) ˆσ (i) 1 + x i (X (i) X (i)) 1 x i This test has n p 1 d, and is identical to the t-test suggested in the irst paragraph o this section. There is a siple coputational orula or t i in (9.2). We irst deine an interediate quantity oten called a standardized residual, by r i = ê i ˆσ 1 h ii (9.3) where the h ii is the leverage or the ith case, deined at (8.7). Like the residuals ê i,ther i have ean zero, but unlike the ê i, the variances o the r i are all equal to

210 196 OUTLIERS AND INFLUENCE one. Because the h ii need not all be equal, the r i are not just a rescaling o the ê i. With the aid o Appendix A.12, one can show that t i can be coputed as ( ) 1/2 n p 1 t i = r i n p ri 2 = ê i ˆσ (i) 1 hii (9.4) A statistic divided by its estiated standard deviation is usually called a studentized statistic, in honor o W. S. Gosset, who irst wrote about the t-distribution using the pseudony Student 1. The residual t i is called a studentized residual. Wesee that r i and t i carry the sae inoration since one can be obtained ro the other via a siple orula. Also, this result shows that t i can be coputed ro the residuals, the leverages and ˆσ 2, so we don t need to delete the ith case, or to add avariableu, to get the outlier test Weighted Least Squares I we initially assued that Var(Y X) = σ 2 /w or known positive weights w, then in equation (9.3), we copute the residuals ê i using the correct weighted orula (8.13) and leverages are the diagonal eleents o (8.12). Otherwise, no changes are required Signiicance Levels or the Outlier Test I the analyst suspects in advance that the ith case is an outlier, then t i should be copared with the central t-distribution with the appropriate nuber o d. The analyst rarely has a prior choice or the outlier. Testing the case with the largest value o t i to be an outlier is like peroring n signiicance tests, one or each o n cases. I, or exaple, n = 65, p = 4, the probability that a t statistic with 60 d exceeds in absolute value is 0.05; however, the probability that the largest o 65 independent t-tests exceeds is 0.964, suggesting quite clearly the need or a dierent critical value or a test based on the axiu o any tests. Since tests based on the t i are correlated, this coputation is only a guide. Excellent discussions o this and other ultiple-test probles are presented by Miller (1981). The technique we use to ind critical values is based on the Bonerroni inequality, which states that or n tests each o size a, the probability o alsely labeling at least one case as an outlier is no greater than na. This procedure is conservative and provides an upper bound on the probability. For exaple, the Bonerroni inequality speciies only that the probability o the axiu o 65 tests exceeding 2.00 is no greater than 65(0.05), which is larger than 1. Choosing the critical value to be the (α/n) 100% point o t will give a signiicance level o no ore than n(α/n) = α. We would choose a level o.05/65 = or each test to give an overall level o no ore than 65(.00077) = See www-gap.dcs.st-and.ac.uk/ history/matheaticians/gosset.htl or a biography o Student.

211 OUTLIERS 197 Standard unctions or the t-distribution can be used to copute p-values or the outlier test: siply copute the p-value as usual and then ultiply by the saple size. I this nuber is saller then one, then this is the p-value adjusted or ultiple testing. I this nuber exceeds one, then the p-value is one. In Forbes data, Exaple 1.1, case 12 was suspected to be an outlier because o its large residual. To peror the outlier test, we irst need the standardized residual, which is coputed using (9.3) ro ê i = 1.36, ˆσ = 0.379, and h 12,12 = , and the outlier test is r 12 = = ( ) /2 t i = = The noinal two-sided p-value corresponding to this test statistic when copared with the t(14) distribution is I the location o the outlier was not selected in advance, the Bonerroni-adjusted p-value is = This very sall value supports case 12 as an outlier. The test locates an outlier, but it does not tell us what to do about it. I we believe that the case is an outlier because o a blunder, or exaple, an unusually large easureent error, or a recording error, then we ight delete the outlier and analyze the reaining cases without the suspected case. Soeties, we can try to igure out why a particular case is outlying, and inding the cause ay be the ost iportant part o the analysis. All this depends on the context o the proble you are studying Additional Coents There is a vast literature on ethods or handling outliers, including Barnett and Lewis (2004), Hawkins (1980), and Beckan and Cook (1983). I a set o data has ore than one outlier, a sequential approach can be recoended, but the cases ay ask each other, aking inding groups o outliers diicult. Cook and Weisberg (1982, p. 28) provide the generalization o the ean shit odel given here to ultiple cases. Hawkins, Bradu, and Kass (1984) provide a proising ethod or searching all subsets o cases or outlying subsets. Bonerroni bounds or outlier tests are discussed by Cook and Prescott (1981). They ind that or onecase-at-a-tie ethods the bound is very accurate, but it is uch less accurate or ultiple-case ethods. The testing procedure helps the analyst in inding outliers, to ake the available or urther study. Alternatively, we could design robust statistical ethods that can tolerate or accoodate soe proportion o bad or outlying data; see, or exaple, Staudte and Sheather (1990).

212 198 OUTLIERS AND INFLUENCE 9.2 INFLUENCE OF CASES Single cases or sall groups o cases can strongly inluence the it o a regression odel. In Anscobe s exaples in Figure 1.9d, page 13, the itted odel depends entirely on the one point with x = 19. I that case were deleted, we could not estiate the slope. I it were perturbed, oved around a little, the itted line would ollow the point. In contrast, i any o the other cases were deleted or oved around, the change in the itted ean unction would be quite sall. The general idea o inluence analysis is to study changes in a speciic part o the analysis when the data are slightly perturbed. Whereas statistics such as residuals are used to ind probles with a odel, inluence analysis is done as i the odel were correct, and we study the robustness o the conclusions, given a particular odel, to the perturbations. The ost useul and iportant ethod o perturbing the data is deleting the cases ro the data one at a tie. We then study the eects or inluence o each individual case by coparing the ull data analysis to the analysis obtained with a case reoved. Cases whose reoval causes ajor changes in the analysis are called inluential. Using the notation ro the last section, a subscript (i) eans with the ith case deleted, so, or exaple, β (i) is the estiate o β coputed without case i, X (i) is the (n 1) p atrix obtained ro X by deleting the ith row, and so on. In particular, then, ˆβ (i) = (X (i) X (i)) 1 X (i) Y (i) (9.5) Figure 9.1 is a scatterplot atrix o coeicient estiates or the three paraeters in the UN data ro Section 3.1 obtained by deleting cases one at a tie. Every tie a case is deleted, dierent coeicient estiates ay be obtained. All 2D plot in Figure 9.1 are ore or less elliptically shaped, which is a coon characteristic o the deletion estiates. In the plot or the coeicients or log(ppgdp) and Purban, the points or Arenia and Ukraine are in one corner and the point or Djibouti is in the opposite corner; deleting any one o these localities causes the largest change in the values o the estiated paraeters, although all the changes are sall. While the plots in Figure 9.1 are inorative about the eects o deleting cases one at a tie, looking at these plots can be bewildering, particularly i the nuber o ters in the odel is large. A single suary statistic that can suarize these pictures is desirable, and this is provided by Cook s distance Cook s Distance We can suarize the inluence on the estiate o β by coparing ˆβ to ˆβ (i). Since each o these is a p vector, the coparison requires a ethod o cobining inoration ro each o the p coponents into a single nuber. Several ways o doing this have been proposed in the literature, but ost o the will result in roughly the sae inoration, at least or ultiple linear regression. The ethod we use is due to Cook (1977). We deine Cook s distance D i to be D i = ( ˆβ (i) ˆβ) (X X)( ˆβ (i) ˆβ) p ˆσ 2 (9.6)

213 INFLUENCE OF CASES Intercept log(ppgdp) Purban FIG. 9.1 Estiates o paraeters in the UN data obtained by deleting one case at a tie. This statistic has several desirable properties. First, contours o constant D i are ellipsoids, with the sae shape as conidence ellipsoids. Second, the contours can be thought o as deining the distance ro ˆβ (i) to ˆβ. Third,D i does not depend on paraeterization, so i the coluns o X are odiied by linear transoration, D i is unchanged. Finally, i we deine vectors o itted values as Ŷ = X ˆβ and Ŷ (i) = X ˆβ (i), then (9.6) can be rewritten as D i = (Ŷ (i) Ŷ) (Ŷ (i) Ŷ) p ˆσ 2 (9.7) so D i is the ordinary Euclidean distance between Ŷ and Ŷ (i). Cases or which D i is large have substantial inluence on both the estiate o β andonittedvalues, and deletion o the ay result in iportant changes in conclusions Magnitude o D i Cases with large values o D i are the ones whose deletion will result in substantial changes in the analysis. Typically, the case with the largest D i, or in large data sets the cases with the largest ew D i, will be o interest. One ethod o calibrating

214 200 OUTLIERS AND INFLUENCE D i is obtained by analogy to conidence regions. I D i were exactly equal to the α 100% point o the F distribution with p and n p d, then deletion o the ith case would ove the estiate o ˆβ to the edge o a (1 α) 100% conidence region based on the coplete data. Since or ost F distributions the 50% point is near one, a value o D i = 1 will ove the estiate to the edge o about a 50% conidence region, a potentially iportant change. I the largest D i is substantially less than one, deletion o a case will not change the estiate o ˆβ by uch. To investigate the inluence o a case ore closely, the analyst should delete the large D i case and recopute the analysis to see exactly what aspects o it have changed Coputing D i Fro the derivation o Cook s distance, it is not clear that using these statistics is coputationally convenient. However, the results sketched in Appendix A.12 can be used to write D i using ore ailiar quantities. A siple or or D i is D i = 1 p r2 i h ii 1 h ii (9.8) D i is a product o the square o the ith standardized residual r i and a onotonic unction o h ii.ip is ixed, the size o D i will be deterined by two dierent sources: the size o r i, a rando variable relecting lack o it o the odel at the ith case, and h ii, relecting the location o x i relative to x. A large value o D i ay be due to large r i,largeh ii, or both. Rat Data An experient was conducted to investigate the aount o a particular drug present in the liver o a rat. Nineteen rats were randoly selected, weighed, placed under light ether anesthesia and given an oral dose o the drug. Because large livers would absorb ore o a given dose than saller livers, the actual dose an anial received was approxiately deterined as 40 g o the drug per kilogra o body weight. Liver weight is known to be strongly related to body weight. Ater a ixed length o tie, each rat was sacriiced, the liver weighed, and the percent o the dose in the liver deterined. The experiental hypothesis was that, or the ethod o deterining the dose, there is no relationship between the percentage o the dose in the liver (Y ) and the body weight BodyWt, liver weight LiverWt, and relative Dose. The data, provided by Dennis Cook and given in the ile rat.txt, are shown in Figure 9.2. As had been expected, the arginal suary plots or Y versus each o the predictors suggests no relationship, and none o the siple regressions is signiicant, all having t-values less than one. The itted regression suary or the regression o Y on the three predictors is shown in Table 9.1. BodyWt and Dose have signiicant t-tests, with p<0.05 in both cases, indicating that the two easureents cobined are a useul indicator o Y ;iliverwt is dropped ro the ean unction, the sae phenoenon appears. The analysis so ar, based only on suary statistics, ight lead to the conclusion

215 INFLUENCE OF CASES BodyWt LiverWt Dose y FIG. 9.2 Scatterplot atrix or the rat data. TABLE 9.1 Regression Suary or the Rat Data Coeicients: Estiate Std. Error t value Pr(> t ) (Intercept) BodyWt LiverWt Dose Residual standard error: on 15 degrees o reedo Multiple R-Squared: F-statistic: 2.86 on 3 and 15 DF, p-value: that while neither BodyWt or Dose are associated with the response when the other is ignored, in cobination they are associated with the response. But, ro Figure 9.2, Dose and BodyWt are alost perectly linearly related, so they easure the sae thing! We turn to case analysis to attept to resolve this paradox. Figure 9.3 displays diagnostic statistics or the ean unction with all the ters included. The outlier

216 202 OUTLIERS AND INFLUENCE Outlier test Leverage Cook s distance Case nuber FIG. 9.3 Diagnostic statistics or the rat data. statistics are not particularly large. However, Cook s distance iediately locates a possible cause: case three has D 3 =.93; no other case has D i bigger than 0.27, suggesting that case nuber three alone ay have large enough inluence on the it to induce the anoaly. The value o h 33 = 0.85 indicates that the proble is an unusual set o predictors or case 3. One suggestion at this point is to delete the third case and recopute the regression. These coputations are given in Table 9.2. The paradox dissolves and the apparent relationship ound in the irst analysis can thus be ascribed to the third case alone. Once again, the diagnostic analysis inds a proble, but does not tell us what to do next, and this will depend on the context o the proble. Rat nuber three, with weight 190 g, was reported to have received a ull dose o 1.000, which was a larger dose than it should have received according to the rule or assigning doses; or exaple, rat eight with weight o 195 g got a lower dose o A nuber o causes or the result ound in the irst analysis are possible: (1) the dose or weight recorded or case 3 was in error, so the case should probably be deleted

217 INFLUENCE OF CASES 203 TABLE 9.2 Regression Suary or the Rat Data with Case 3 Deleted Coeicients: Estiate Std. Error t value Pr(> t ) (Intercept) BodyWt LiverWt Dose Residual standard error: on 14 degrees o reedo Multiple R-Squared: F-statistic: on 3 and 14 DF, p-value: ro the study, or (2) the regression it in the second analysis is not appropriate except in the region deined by the 18 points excluding case 3. This has any iplications concerning the experient. It is possible that the cobination o dose and rat weight chosen was ortuitous, and that the lack o relationship ound would not persist or any other cobinations o the, since inclusion o a data point apparently taken under dierent conditions leads to a dierent conclusion. This suggests the need or collection o additional data, with dose deterined by soe rule other than a constant proportion o weight Other Measures o Inluence The added-variable plots introduced in Section 3.1 provide a graphical diagnostic or inluence. Cases corresponding to points at the let or right o an added-variable plot that do not atch the general trend in the plot are likely to be inluential or the variable that is to be added. For exaple, Figure 9.4 shows the added-variable plots or BodyWt and or Dose or the rat data. The point or case three is clearly y others y others BodyWt others Dose others FIG. 9.4 Added-variable plots or BodyWt and Dose.

218 204 OUTLIERS AND INFLUENCE separated ro the others, and is a likely inluential point based on these graphs. The added-variable plot does not correspond exactly to Cook s distance, but to local inluence deined by Cook (1986). As with the outlier proble, inluential groups o cases ay serve to ask each other and ay not be ound by exaination o cases one at a tie. In soe probles, ultiple-case ethods ay be desirable; see Cook and Weisberg (1982, Section 3.6). 9.3 NORMALITY ASSUMPTION The assuption o noral errors plays only a inor role in regression analysis. It is needed priarily or inerence with sall saples, and even then the bootstrap outlined in Section 4.6 can be used or inerence. Furtherore, nonnorality o the unobservable errors is very diicult to diagnose in sall saples by exaination o residuals. Fro (8.4), the relationship between the errors and the residuals is ê = (I H)Y = (I H)(X ˆβ + e) = (I H)e because (I H)X = 0. In scalar or, the ith residual is ê i = e i h ij e j (9.9) Thus, ê i is equal to e i, adjusted by subtracting o a weighted su o all the errors. By the central liit theore, the su in (9.9) will be nearly noral even i the errors are not noral. With a sall or oderate saple size n, the second ter can doinate the irst, and the residuals can behave like a noral saple even i the errors are not noral. Gnanadesikan (1997) reers to this as the supernorality o residuals. As n increases or ixed p, the second ter in (9.9) has sall variance copared to the irst ter, so it becoes less iportant, and residuals can be used to assess norality; but in large saples, norality is uch less iportant. Should a test o norality be desirable, a noral probability plot can be used. A general treatent o probability plotting is given by Gnanadesikan (1997). Suppose we have a saple o n nubers z 1,z 2,...,z n, and we wish to exaine the hypothesis that the z s are a saple ro a noral distribution with unknown ean µ and variance σ 2. A useul way to proceed is as ollows: j=1 1. Order the z s to get z (1) z (2)... z (n). The ordered zs are called the saple order statistics.

219 NORMALITY ASSUMPTION Now, consider a standard noral saple o size n. Letu (1) u (2)... u (n) be the ean values o the order statistics that would be obtained i we repeatedly took saples o size n ro the standard noral. The u (i) sare called the expected order statistics. Theu (i) are available in printed tables, or can be well approxiated using a coputer progra I the zs are noral, then E(z (i) ) = µ + σu (i) so that the regression o z (i) on u (i) will be a straight line. I it is not straight, we have evidence against norality. Judging whether a probability plot is suiciently straight requires experience. Daniel and Wood (1980) provided any pages o plots to help the analyst learn to use these plots; this can be easily recreated using a coputer package that allows one quickly to look at any plots. Atkinson (1985) used a variation o the bootstrap to calibrate probability plots. Many statistics have been proposed or testing a saple or norality. One o these that works extreely well is the Shapiro and Wilk (1965) W statistic, which is essentially the square o the correlation between the observed order statistics and the expected order statistics. Norality is rejected i W is too sall. Royston (1982abc) provides details and coputer routines or the calculation o the test and or inding p-values. Figure 9.5 shows noral probability plots o the residuals or the heights data (Section 1.1) and or the transactions data (Section 4.6.1). Both have large enough Saple Quantiles Saple Quantiles (a) Heights data (b) Transaction data FIG. 9.5 Noral probability plots o residuals or (a) the heights data and (b) the transactions data. 2 Suppose (x) is a unction that returns the area p to the let o x under a standard noral distribution, and 1 (p) coputes the inverse o the noral, so or a given value o p, it returns the associated value o x. Then the ith expected noral order statistic is approxiately 1 [(i (3/8))/(n + (1/4))] (Blo, 1958).

220 206 OUTLIERS AND INFLUENCE saples or noral probability plots to be useul. For the heights data, the plot is very nearly straight, indicating no evidence against norality. For the transactions data, norality is in doubt because the plot is not straight. In particular, there are very large positive residuals well away ro a itted line. This supports the earlier clai that the errors or this proble are likely to be skewed with too any large values. PROBLEMS 9.1. In an unweighted regression proble with n = 54, p = 5, the results included ˆσ = 4.0 and the ollowing statistics or our o the cases: ê i h ii For each o these our cases, copute r i, D i,andt i. Test each o the our cases to be an outlier. Make a qualitative stateent about the inluence o each case on the analysis In the uel consuption data, consider itting the ean unction E(Fuel X) = β 0 + β 1 Tax + β 2 Dlic + β 3 Incoe + β 4 log(miles) For this regression, we ind ˆσ = with 46 d, and the diagnostic statistics or our states and the District o Colubia were: Fuel ê i h ii Alaska New York Hawaii Wyoing District o Colubia Copute D i and t i or each o these cases, and test or one outlier. Which is ost inluential? 9.3. The atrix (X (i) X (i)) can be written as (X (i) X (i)) = X X x i x i,wherex i is the ith row o X. Use this deinition to prove that (A.37) holds.

221 PROBLEMS The quantity y i x i ˆβ (i) is the residual or the ith case when β is estiated without the ith case. Use (A.37) to show that ê i y i x i ˆβ (i) = 1 h ii This quantity is called the predicted residual, or the PRESS residual Use (A.37) to veriy (9.8) Suppose that interest centered on β rather than β, whereβ is the paraeter vector excluding the intercept. Using (5.21) as a basis, deine a distance easure D i like Cook s D i and show that (Cook, 1979) ( ) Di = r2 i hii 1/n p 1 h ii + 1/n where p is the nuber o ters in the ean unction excluding the intercept Reer to the lathe data in Proble Starting with the ull second-order odel, use the Box Cox ethod to show that an appropriate scale or the response is the logarithic scale Find the two cases that are ost inluential in the it o the quadratic ean unction, and explain why they are inluential. Delete these points ro the data, reit the quadratic ean unction, and copare to the it with all the data Florida election 2000 In the 2000 election or US president, the counting o votes in Florida was controversial. In Pal Beach county in south Florida, or exaple, voters used a so-called butterly ballot. Soe believe that the layout o the ballot caused soe voters to cast votes or Buchanan when their intended choice was Gore. The data in the ile lorida.txt 3 has our variables, County, the county nae, and Gore, Bush, andbuchanan, the nuber o votes or each o these three candidates. Draw the scatterplot o Buchanan versus Bush, and test the hypothesis that Pal Beach county is an outlier relative to the siple linear regression ean unction or E(Buchanan Bush). Identiy another county with an unusual value o the Buchanan vote, given its Bush vote, and test that county to be an outlier. State your conclusions ro the test, and its relevance, i any, to the issue o the butterly ballot. Next, repeat the analysis, but irst consider transoring the variables in the plot to better satisy the assuptions o the siple linear regression odel. Again test to see i Pal Beach County is an outlier and suarize. 3 Source: county.htl.

222 208 OUTLIERS AND INFLUENCE 9.9. Reer to the United Nations data described in Proble 7.8 and consider the regression with response ModernC and predictors (log(ppgdp), Change, Pop, Fertility, Frate, Purban) Exaine added-variable plots or each o the ters in the regression odel and suarize. Is it likely that any o the localities are inluential or any o the ters? Which localities? Which ters? Are there any outliers in the data? Coplete analysis o the regression o ModernC on the ters in the ean unction The data in the data ile landrent.txt were collected by Douglas Tiany to study the variation in rent paid in 1977 or agricultural land planted to alala. The variables are Y = average rent per acre planted to alala, X 1 = average rent paid or all tillable land, X 2 = density o dairy cows (nuber per square ile), X 3 = proportion o arland used as pasture, X 4 = 1i liing is required to grow alala; 0, otherwise. The unit o analysis is a county in Minnesota; the 67 counties with appreciable rented arland are included. Alala is a high protein crop that is suitable eed or dairy cows. It is thought that rent or land planted to alala relative to rent or other agricultural purposes would be higher in areas with a high density o dairy cows and rents would be lower in counties where liing is required, since that would ean additional expense. Use all the techniques learned so ar to explore these data with regard to understanding rent structure. Suarize your results The data in the ile cloud.txt suarize the results o the irst Florida Area Cuulus Experient, or FACE-1, designed to study the eectiveness o cloud seeding to increase rainall in a target area (Woodley, Sipson, Biondini, and Berkeley, 1977). A ixed target area o approxiately 3000 square iles was established to the north and east o Coral Gables, Florida. During the suer o 1975, each day was judged on its suitability or seeding. The decision to use a particular day in the experient was based priarily on a suitability criterion S depending on a atheatical odel or rainall. Days with S>1.5 were chosen as experiental days; there were 24 days chosen in On each day, the decision to seed was ade by lipping a coin; as it turned out, 12 days were seeded, 12 unseeded. On seeded days, silver iodide was injected into the clouds ro sall aircrat. The predictors and the response are deined in Table 9.3. The goal o the analysis is to decide i there is evidence that cloud seeding is eective in increasing rainall. Begin your analysis by drawing appropriate graphs. Obtain appropriate transorations o predictors. Fit appropriate ean unctions and suarize your results. (Hint: Be sure to check or inluential observations and outliers.)

223 PROBLEMS 209 TABLE 9.3 Variable The Florida Area Cuulus Experient on Cloud Seeding Description A Action, 1 = seed, 0 = do not seed D Days ater the irst day o the experient (June 16, 1975=0) S Suitability or seeding C Percent cloud cover in the experiental area, easured using radar in Coral Gables, Florida P Prewetness, aount o rainall in the hour preceding seeding in 10 7 cubic eters E Echo otion category, either l or 2, a easure o the type o cloud Rain Rainall ollowing the action o seeding or not seeding in 10 7 cubic eters Health plans use any tools to try to control the cost o prescription edicines. For older drugs, generic substitutes that are equivalent to naebrand drugs are soeties available at a lower cost. Another tool that ay lower costs is restricting the drugs that physicians ay prescribe. For exaple, i three siilar drugs are available or treating the sae syptos, a health plan ay require physicians to prescribe only one o the. Since the usage o the chosen drug will be higher, the health plan ay be able to negotiate a lower price or that drug. Thedataintheiledrugcost.txt, provided by Mark Siracuse, can be used to explore the eectiveness o these two strategies in controlling drug costs. The response variable is COST, the average cost o drugs per prescription per day, and predictors include GS (the extent to which the plan uses generic substitution, a nuber between zero, no substitution, and 100, always use a generic substitute i available) and RI (a easure o the restrictiveness o the plan, ro zero, no restrictions on the physician, to 100, TABLE 9.4 The Drug Cost Data Variable COST RXPM GS RI COPAY AGE F MM ID Description Average cost to plan or one prescription or one day, dollars Average nuber o prescriptions per eber per year Percent generic substitution used by the plan Restrictiveness index (0=none, 100=total) Average eber copayent or prescriptions Average eber age Percent eale ebers Meber onths, a easure o the size o the plan An identiier or the nae o the plan

224 210 OUTLIERS AND INFLUENCE the axiu possible restrictiveness). Other variables that ight ipact cost were also collected, and are described in Table 9.4. The data are ro the id-1990s, and are or 29 plans throughout the United States with pharacies adinistered by a national insurance copany. Provide a coplete analysis i these data, paying particular regard to possible outliers and inluential cases. Suarize your results with regard to the iportance o GS and RI. In particular, can we iner that ore use o GS and RI will reduce drug costs?

225 CHAPTER 10 Variable Selection We live in an era o cheap data but expensive inoration. A anuacturer studying the actors that ipact the quality o its product, or exaple, ay have any easures o quality, and possibly hundreds or even thousands o potential predictors o quality, including characteristics o the anuacturing process, training o eployees, supplier o raw aterials, and any others. In a edical setting, to odel the size o tuor, we ight have any potential predictors that describe the status o the patient, treatents given, and environental actors thought to be relevant. In both o these settings, and in any others, we can have too any predictors. One response to working with probles with any potential predictors is to try to identiy the iportant or active predictors and the uniportant or inactive ones. Variable selection ethods are oten used or this purpose, and we will study the in this chapter. Estiates and predictions will generally be ore precise ro itted odels based only on relevant ters, although selection tends to overestiate signiicance. Soeties, identiying the iportant predictors can be an end in itsel. For exaple, learning i the supplier o raw aterials ipacts quality o a anuactured product ay be ore iportant than attepting to easure the exact or o the dependence. Linear regression with variable selection is not the only approach to the proble o odeling a response as a unction o a very large nuber o ters or predictors. The ields o achine learning and to soe extent data ining provide alternative techniques or this proble, and in soe circustances, the ethods developed in these areas can give superior answers. An introduction to these areas is given by Hastie, Tibshirani, and Friedan (2001). The ethods described in this chapter are iportant in their own right because they are so widely used, and also because they can provide a basis or understanding newer ethods THE ACTIVE TERMS Given a response Y and a set o ters X derived ro the predictors, the idealized goal o variable selection is to divide X into two pieces X = (X A,X I ),wherex A Applied Linear Regression, Third Edition, by Sanord Weisberg ISBN Copyright 2005 John Wiley & Sons, Inc. 211

226 212 VARIABLE SELECTION is the set o active ters, while X I is the set o inactive ters that are not relevant to the regression proble, in the sense that the two ean unctions E(Y X A,X I ) and E(Y X A ) would give exactly the sae values. Identiying the active predictors can be surprisingly diicult. Dividing X = (X A,X I ) into active and inactive ters, suppose the ean unction E(Y X = x) = β X = β A x A + β I x I (10.1) was a correct speciication. For exaple, using the ethods o the previous chapters in this book, we ight have selected transorations, added interactions, and possibly deleted a ew outliers so that (10.1) holds at least to a reasonable approxiation. For the inactive ters, we will have β I = 0. I we have a suiciently large saple to estiate β, then identiying X A sees easy: the ters in X A will have non zero corresponding eleents in ˆβ, and the ters in X I will correspond to eleents in ˆβ close to zero. Siulated Data To illustrate, we consider two cases based on artiicial data, each with ive ters in X, including the intercept. For both cases, the response is obtained as y = 1 + x 1 + x 2 + 0x 3 + 0x 4 + error where the error is N(0, 1), soσ 2 = 1. For this ean unction, X A is the intercept plus irst two coponents o X, andx I is the reaining two coponents. In the irst case we consider, X 1 = (x 1,x 2,x 3,x 4 ) are independent standard noral rando variables, and so the population covariance atrix or X 1 is Var(X 1 ) = In the second case, X 2 = (x 1,x 2,x 3,x 4 ) are again noral with ean zero, but the population covariance atrix is Var(X 2 ) = (10.2) so the irst and third variables are highly positively correlated, and the second and ourth variables are highly negatively correlated. Table 10.1 suarizes one set o siulated data or the irst case and with n = 100. ˆβ is reasonably close to the true value o 1 or the irst three coeicients and 0 or the reaining two coeicients. The t-values or the irst three ters are large, indicating that these are clearly estiated to be nonzero, while the t-values

227 THE ACTIVE TERMS 213 TABLE 10.1 Regression Suary or the Siulated Data with No Correlation between the Predictors Estiate Std. Error t-value Pr(> t ) (Intercept) x x x x Var( ˆβ) = ˆσ = 0.911, d = 95, R 2 = or the reaining two ters are uch saller. As it happens, the t-test or β 4 has a p-value o about 0.03, suggesting incorrectly that β 3 0. I we do tests at 5% level, then 5% o the tie we will ake errors like this. Also shown in Table 10.1 is Var( ˆβ), which is approxiately equal to 1/n ties a diagonal atrix; ignoring the entries in the irst row and colun o this atrix that involve the intercept, the reaining 4 4 atrix should be approxiately the inverse o Var(X 1 ),whichis the identity atrix. Apart ro the intercept, all the estiates are equally variable and independent. I the saple size were increased to 1000, Var( ˆβ) would be approxiately the sae atrix ultiplied by 1/1000 rather than 1/100. Table 10.2 gives the suary o the results when n = 100, and the covariance o the ters excluding the intercept is given by (10.2). X A and X I are not clearly identiied. Since x 2 and x 4 are alost the sae variable, apart ro a sign change, identiication o x 4 as ore likely to be the active predictor is not surprising; the choice between x 2 and x 4 can vary ro realization to realization o this siulation. All o the t-values are uch saller than in the irst case, priarily because with covariance between the ters, variances o the estiated coeicients are greatly inlated relative to uncorrelated ters. To get estiates with about the sae variances in case 2 as we got in case 1 requires about 11 ties as any observations. The siulation or case 2 is repeated in Table 10.3 with n = Apart ro the intercept, estiates and standard errors are now siilar to those in Table 10.1, but the large correlations between soe o the estiates, indicated by the large o-diagonal eleents in the covariance atrix or ˆβ, reain. Identiication o the ters in X A and X I with correlation present can require huge saple sizes relative to probles with uncorrelated ters. Selection ethods try to identiy the active ters and then reit, ignoring the ters thought to be inactive. Table 10.4a is derived ro Table 10.2 by deleting the two ters with sall t-values. This sees like a very good solution and suary o this proble, with one exception: it is the wrong answer, since x 4 is included

228 214 VARIABLE SELECTION TABLE 10.2 Regression Suary or the Siulated Data with High Correlation between the Predictors Estiate Std. Error t-value Pr(> t ) (Intercept) x x x x Var( ˆβ) = ˆσ = 0.911, d = 95, R 2 = TABLE 10.3 Regression Suary or the Siulated Data, Correlated Case But with n = 1100 Estiate Std. Error t-value Pr(> t ) (Intercept) x x x x ˆσ = 1.01, d = 1095, R 2 = Var( ˆβ) = rather than x 2. Table 10.4b is the it o the ean unction using only the correct X A as ters. The it o this choice or the ean unction is soewhat worse, with larger ˆσ and saller R Collinearity Two ters X 1 and X 2 are exactly collinear, or linearly dependent, i there is a linear equation such as c 1 X 1 + c 2 X 2 = c 0 (10.3) or soe constants c 0, c 1 and c 2 that is true or all cases in the data. For exaple, suppose that X 1 and X 2 are aounts o two cheicals and are chosen so that

229 THE ACTIVE TERMS 215 TABLE 10.4 Regression Suary or Two Candidate Subsets in the Siulated Data, Correlated Cases, with n = 100 Estiate Std. Error t-value Pr(> t ) (a) Candidate ters are intercept, x 1 and x 4 (Intercept) x x ˆσ = 0.906, d = 97, R 2 = Estiate Std. Error t-value Pr(> t ) (b) Candidate ters are intercept, x 1 and x 2 (Intercept) x x ˆσ = 0.927, d = 97, R 2 = X 1 + X 2 = 50 l, then X 1 and X 2 are exactly collinear. Since X 2 = 50 X 1, knowing X 1 is exactly the sae as knowing both X 1 and X 2, and only one o X 1 or X 2 can be included in a ean unction. Exact collinearities can occur by accident when, or exaple, weight in pounds and in kilogras are both included in a ean unction, or with sets o duy variables. Approxiate collinearity is obtained i the equation (10.3) nearly holds or the observed data. For exaple, the variables Dose and BodyWt in the rat data in Section are approxiately collinear since Dose was approxiately deterined as a ultiple o BodyWt. Inthe irst o the two siulated cases in the last section, there is no collinearity because the ters are uncorrelated. In the second case, because o high correlation x 1 x 3 and x 2 x 4, so these pairs o ters are collinear. Collinearity between ters X 1 and X 2 is easured by the square o their saple correlation, r12 2. Exact collinearity corresponds to r2 12 = 1, and noncollinearity corresponds to r12 2 = 0. As r2 12 approaches 1, approxiate collinearity becoes generally stronger. Most discussions o collinearity are really concerned with approxiate collinearity. The deinition o approxiate collinearity extends naturally to p>2ters.aset o ters, X 1,X 2,...,X p are approxiately collinear i, or constants c 0,c 1,...,c p, c 1 X 1 + c 2 X 2 + +c p X p c 0 with at least one c j 0. For that j, we can write X j 1 c 0 c l X l = c 0 + c j c l j j l j ( c ) l X l c j

230 216 VARIABLE SELECTION which is siilar to a linear regression ean unction with intercept c 0 /c j and slopes c l /c j. A siple diagnostic analogous to the squared correlation or the two-variable case is the square o the ultiple correlation between X j and the other X s, which we will call Rj 2. This nuber is coputed ro the regression o X j on the other X s. I the largest Rj 2 is near 1, we would diagnose approxiate collinearity. When a set o predictors is exactly collinear, one or ore predictors ust be deleted, or else unique least squares estiates o coeicients do not exist. Since the deleted predictor contains no inoration ater the others, no inoration is lost by this process although interpretation o paraeters can be ore coplex. When collinearity is approxiate, a usual reedy is again to delete variables ro the ean unction, with loss o inoration about itted values expected to be inial. The hard part is deciding which variables to delete Collinearity and Variances We have seen in the initial exaple in this chapter that correlation between the ters increases the variances o estiates. In a ean unction with two ters beyond the intercept, E(Y X 1 = x 1,X 2 = x 2 ) = β 0 + β 1 x 1 + β 2 x 2 suppose the saple correlation between X 1 and X 2 is r 12, and deine the sybol SX j X j = (x ij x j ) 2 to be the su o squares or the jth ter in the ean unction. It is an exercise (Proble 10.7) to show that, or j = 1, 2, Var( ˆβ j ) = σ 2 1 r SX j X j (10.4) The variances o ˆβ 1 and ˆβ 2 are iniized i r12 2 = 0, while as r2 12 nears 1, these variances are greatly inlated; or exaple, i r12 2 =.95, the variance o β 1 is 20 ties as large as i r12 2 = 0. The use o collinear predictors can lead to unacceptably variable estiated coeicients copared to probles with no collinearity. When p>2, the variance o the j-th coeicient is (Proble 10.7) Var( ˆβ j ) = σ 2 1 R 2 j 1 SX j X j (10.5) The quantity 1/(1 Rj 2) is called the jth variance inlation actor, orvif j (Marquardt, 1970). Assuing that the X j s could have been sapled to ake Rj 2 = 0 while keeping SX j X j constant, the VIF represents the increase in variance due to the correlation between the predictors and, hence, collinearity. In the irst o the two siulated exaples earlier in this section, all the ters are independent, so each o the Rj 2 should be close to zero, and the VIF are all close

231 VARIABLE SELECTION 217 to their iniu value o one. For the second exaple, each o the R 2 j 0.952, and each VIF should be close to 1/( ) = Estiates in the second case are about 10 or 11 ties as variable as estiates in the irst case VARIABLE SELECTION The goal o variable selection is to divide X into the set o active ters X A and the set o inactive ters X I. For this purpose, we assue that the ean unction (10.1) is appropriate or the data at hand. I we have k ters in the ean unction apart ro the intercept, then there are potentially 2 k possible choices o X A obtained ro all possible subsets o the ters. I k = 5, there are only 2 5 = 32 choices or X A and all 32 possible can be it and copared. I k = 10, there are 1024 choices, and itting such a large nuber o odels is possible but still an unpleasant prospect. For k as sall as 30, the nuber o odels possible is uch too large to consider the all. There are two basic issues. First, given a particular candidate X C or the active ters, what criterion should be used to copare X C to other possible choices or X A? The second issue is coputational: How do we deal with the potentially huge nuber o coparisons that need to be ade? Inoration Criteria Suppose we have a particular candidate subset X C.IX C = X A, then the itted values ro the it o the ean unction E(Y X C = x C ) = β C x C (10.6) should be very siilar to the it o ean unction (10.1), and the residual su o squares or the it o (10.6) should be siilar to the residual su o squares or (10.1). I X C isses iportant ters, the residual su o squares should be larger; see Proble Criteria or coparing various candidate subsets are based on the lack o it o a odel and its coplexity. Lack o it or a candidate subset X C is easured by its residual su o squares RSS C. Coplexity or ultiple linear regression odels is easured by the nuber o ters p C in X C, including the intercept 1. The ost coon criterion that is useul in ultiple linear regression and any other probles where odel coparison is at issue is the Akaike Inoration Criterion, oraic. Ignoring constants that are the sae or every candidate subset, AIC is given by Sakaoto, Ishiguro, and Kitagawa (1987), AIC = n log(rss C /n) + p C (10.7) 1 The coplexity ay also be deined as the nuber o paraeters estiated in the regression as a whole, which is equal to the nuber o ters plus one or estiating σ 2.

232 218 VARIABLE SELECTION Sall values o AIC are preerred, so better candidate sets will have saller RSS and a saller nuber o ters p C. An alternative to AIC is the Bayes Inoration Criterion, or BIC, given by Schwarz (1978), BIC = n log(rss C /n) + p C log(n) (10.8) which provides a dierent balance between lack o it and coplexity. Once again, saller values are preerred. Yet a third criterion that balances between lack o it and coplexity is Mallows C p (Mallows, 1973), where the subscript p is the nuber o ters in candidate X C. This statistic is deined by C pc = RSS C ˆσ 2 + 2p C n (10.9) where ˆσ 2 is ro the it o (10.1). As with any probles or which any solutions are proposed, there is no clear choice between the criteria or preerring a subset ean unction. There is an iportant siilarity between all three criteria: i we ix the coplexity, eaning that we consider only the choices X C with a ixed nuber o ters, then all three will agree that the choice with the sallest value o residual su o squares is the preerred choice. Highway Accident Data We will use the highway accident data described in Section 7.2. The initial ters we consider include the transorations ound in Section and a ew others and are described in Table The response variable is, ro Section 7.3, log(rate). This ean unction includes 14 ters to describe only n = 39 cases. TABLE 10.5 Variable log(rate) log(len) log(adt) log(trks) Sli Lwid Shld Itg log(sigs1) Acpt Hwy Deinition o Ters or the Highway Accident Data Description Base-two logarith o 1973 accident rate per illion vehicle iles, the response Base-two logarith o the length o the segent in iles Base-two logarith o average daily traic count in thousands Base-two logarith o truck volue as a percent o the total volue 1973 speed liit Lane width in eet Shoulder width in eet o outer shoulder on the roadway Nuber o reeway-type interchanges per ile in the segent Base-two logarith o (nuber o signalized interchanges per ile in the segent + 1)/(length o segent) Nuber o access points per ile in the segent A actor coded 0 i a ederal interstate highway, 1 i a principal arterial highway, 2 i a ajor arterial, and 3 otherwise

233 VARIABLE SELECTION 219 TABLE 10.6 Data a Regression Suary or the Fit o All Ters in the Highway Accident Estiate Std. Error t-value Pr(> t ) Intercept loglen logadt logtrks logsigs Sli Shld Lane Acpt Itg Lwid Hwy Hwy Hwy ˆσ = on 25 d, R 2 = a The ters Hwy1, Hwy2, andhwy3 are duy variables or the highway actor. The regression on all the ters is suarized in Table Only two o the ters have t-values exceeding 2 in absolute value, in spite o the act that R 2 = Few o the predictors adjusted or the others are clearly iportant even though, taken as a group, they are useul or predicting accident rates. This is usually evidence that X A is saller than the ull set o available ters. To illustrate the inoration criteria, consider a candidate subset X C consisting o the intercept and (log(len), Sli, Acpt, log(trks), Shld). For this choice, RSS C = with p C = 6. For the ean unction with all the ters, RSS X = with p X = 14, so ˆσ 2 = Fro these, we can copute the values o AIC, BIC, and C p or the subset ean unction, AIC = 39 log(5.016/39) = BIC = 39 log(5.016/39) + 6log(39) = C p = = and or the ull ean unction, AIC = 39 log(3.5377/39) = BIC = 39 log(3.5377/39) + 14 log(39) = C p = =

234 220 VARIABLE SELECTION All three criteria are saller or the subset, and so the subset is preerred over the ean unction with all the ters. This subset need not be preerred to other subsets, however Coputationally Intensive Criteria Cross-validation can also be used to copare candidate subset ean unctions. The ost straightorward type o cross-validation is to split the data into two parts at rando, a construction set and a validation set. The construction set is used to estiate the paraeters in the ean unction. Fitted values ro this it are then coputed or the cases in the validation set, and the average o the squared dierences between the response and the itted values or the validation set is used as a suary o it or the candidate subset. Good candidates or X A will have sall cross-validation errors. Correction or coplexity is not required because dierent data are used or itting and or estiating itting errors. Another version o cross-validation uses predicted residuals or the subset ean unction based on the candidate X C. For this criterion, copute the itted value or each i ro a regression that does not include case i. The su o squares o these values is the predicted residual su o squares, or PRESS (Allen, 1974; Geisser and Eddy, 1979), PRESS = n (y i x ˆβ Ci C(i) ) 2 = i=1 n i=1 ( êci 1 h Cii ) 2 (10.10) where ê Ci and h Cii are, respectively, the residual and the leverage or the ith case in the subset odel. For the subset ean unction and the ull ean unction considered in the last section, the values o PRESS are coputed to be and , respectively, suggesting substantial iproveent o this particular subset over the ull ean unction because it gives uch saller errors on the average. The PRESS ethod or linear regression depends only on residuals and leverages, so it is airly easy to copute. This siplicity does not carry over to probles in which the coputation will require reitting the ean unction n ties. As a result, this ethod is not oten used outside the linear regression odel Using Subject-Matter Knowledge The single ost iportant tool in selecting a subset o variables is the analyst s knowledge o the area under study and o each o the variables. In the highway accident data, Hwy is a actor, so all o its levels should probably either be in the candidate subset or excluded. Also, the variable log(len) should be treated dierently ro the others, since its inclusion in the active predictors ay be required by the way highway segents are deined. Suppose that highways consist o sae stretches and bad spots, and that ost accidents occur at the bad spots. I we were to lengthen a highway segent in our study by a sall aount, it is unlikely that we would add another bad spot to the section, assuing bad spots are

235 COMPUTATIONAL METHODS 221 rare, but the coputed response, accidents per illion vehicle iles on the section o roadway, will decrease. Thus, the response and log(len) should be negatively correlated, and we should consider only subsets that include log(len). O the 14 ters, one is to be included in all candidate subsets, and three, the duy variables or Hwy, are all to be included or excluded as a group. Thus, we have 10 ters (or groups o ters) that can be included or not, or a total o 2 10 = 1024 possible subset ean unctions to consider COMPUTATIONAL METHODS With linear regression, it is possible to ind the ew candidate subsets o each subset size that iniize the inoration criteria. Furnival and Wilson (1974) provided an algorith called the leaps and bounds algorith that uses inoration ro regressions already coputed to bound the possible value o the criterion unction or regressions as yet not coputed. This trick allows skipping the coputation o ost regressions. The algorith has been widely ipleented in statistical packages and in subroutine libraries. It cannot be used with actors, unless the actors are replaced by sets o duy variables, and it cannot be used with coputationally intensive criteria such as cross-validation or PRESS. For probles other than linear least squares regression, or i cross-validation is to be used as the criterion unction, exhaustive ethods are generally not easible, and coputational coproise is required. Stepwise ethods require exaining only a ew subsets o each subset size. The estiate o X A is then selected ro the ew subsets that were exained. Stepwise ethods are not guaranteed to ind the candidate subset that is optial according to any criterion unction, but they oten give useul results in practice. Stepwise ethods have three basic variations. For siplicity o presentation, we assue that no ters beyond the intercept are orced into the subsets considered. As beore, let k be the nuber o ters, or groups o ters, that ight be added to the ean unction. Forward selection uses the ollowing procedure: [FS.1] Consider all candidate subsets consisting o one ter beyond the intercept, and ind the subset that iniizes the criterion o interest. I an inoration criterion is used, then this aounts to inding the ter that is ost highly correlated with the response because its inclusion in the subset gives the sallest residual su o squares. Regardless o the criterion, this step requires exaining k candidate subsets. [FS.2] For all reaining steps, consider adding one ter to the subset selected at the previous step. Using an inoration criterion, this will aount to adding the ter with the largest partial correlation 2 with the response given the ters already in the subset, and so this is a very easy calculation. Using 2 The partial correlation is the ordinary correlation coeicient between the two plotted quantities in an added-variable plot.

236 222 VARIABLE SELECTION cross-validation, this will require itting all subsets consisting o the subset selected at the previous step plus one additional ter. At step l, k l + 1 subsets need to be considered. [FS.3] Stop when all the ters are included in the subset, or when addition o another ter increases the value o the selection criterion. I the nuber o ters beyond the intercept is k, this algorith will consider at ost k + (k 1) + +1 = k(k 1)/2 othe2 k possible subsets. For k = 10, the nuber o subsets considered is 45 o the 1024 possible subsets. The subset aong these 45 that has the best value o the criterion selected is tentatively selected as the candidate or X A. The algorith requires odiication i a group o ters is to be treated as all included or all not included, as would be the case with a actor. At each step, we would consider adding the ter or the group o ters that produces the best value on the criterion o interest. Each o the inoration criteria can now give dierent best choices because at each step, as we are no longer necessarily exaining ean unctions with p C ixed. The backward eliination algorith works in the opposite order: [BE.1] Fit irst with candidate subset X C = X, as given by (10.1). [BE.2] At the next step, consider all possible subsets obtained by reoving one ter other than those to be orced to be in all ean unctions ro the candidate subset selected at the last step. Using an inoration criterion, this aounts to reoving the ter with the sallest t-value in the regression suary because this will give the sallest increase in residual su o squares. Using cross-validation, all subsets ored by deleting one ter ro the current subset ust be considered. [BE.3] Continue until all ters but those orced into all ean unctions are deleted, or until the next deletion increases the value o the criterion. As with the orward selection ethod, only k(k 1)/2 subsets are considered, and the best aong those considered is the candidate or X A. The subsets considered by orward selection and by backward eliination ay not be the sae. I actors are included aong the ters, as with backward eliination, the inoration criteria need not all select the sae subset o ixed size as the best. The orward and backward algoriths can be cobined into a stepwise ethod, where at each step, a ter is either deleted or added so that the resulting candidate ean unction iniizes the criterion unction o interest. This will have the advantage o allowing consideration o ore subsets, without the need or exaining all 2 k subsets. Highway Accidents Table 10.7 presents a suary o the 45 ean unctions exained using orward selection or the highway accident data, using PRESS as the criterion unction or selecting subsets. The volue o inoration in this table ay see overwheling,

237 COMPUTATIONAL METHODS 223 TABLE 10.7 Forward Selection or the Highway Accident Data. Subsets within a Step Are Ordered According to the Value o PRESS Step 1: Base ters: (loglen) d RSS p C(p) AIC BIC PRESS Add: Sli Add: Shld Add: Acpt Add: Hwy Add: logsigs Add: logtrks Add: logadt Add: Itg Add: Lane Add: Lwid Step 2: Base ters: (loglen Sli) d RSS p C(p) AIC BIC PRESS Add: logtrks Add: Hwy Add: logsigs Add: Itg Add: Lane Add: logadt Add: Shld Add: Acpt Add: Lwid Step 3: Base ters: (loglen Sli logtrks) d RSS p C(p) AIC BIC PRESS Add: Hwy Add: Itg Add: logadt Add: logsigs Add: Lane Add: Shld Add: Acpt Add: Lwid Step 4: Base ters: (loglen Sli logtrks Hwy) d RSS p C(p) AIC BIC PRESS Add: logsigs Add: Itg Add: Lane Add: logadt Add: Shld Add: Acpt Add: Lwid Step 5: Base ters: (loglen Sli logtrks Hwy logsigs1) d RSS p C(p) AIC BIC PRESS Add: Itg Add: Lane (Continued overlea )

238 224 VARIABLE SELECTION TABLE 10.7 (Continued) Step 5: Base ters: (loglen Sli logtrks Hwy logsigs1) Add: logadt Add: Shld Add: Acpt Add: Lwid Add: Itg Add: Lane Add: logadt Add: Shld Add: Acpt Add: Lwid Step 6: Base ters: (loglen Sli logtrks Hwy logsigs1 Itg) d RSS p C(p) AIC BIC PRESS Add: Lane Add: logadt Add: Shld Add: Acpt Add: Lwid Step 7: Base ters: (loglen Sli logtrks Hwy logsigs1 Itg Lane) d RSS p C(p) AIC BIC PRESS Add: logadt Add: Shld Add: Acpt Add: Lwid Step 8: Base ters: (loglen Sli logtrks Hwy logsigs1 Itg Lane logadt) d RSS p C(p) AIC BIC PRESS Add: Shld Add: Acpt Add: Lwid Step 9: Base ters: (loglen Sli logtrks Hwy logsigs1 Itg Lane logadt Shld) d RSS p C(p) AIC BIC PRESS Add: Lwid Add: Acpt Step 10: Base ters: (loglen Sli logtrks Hwy logsigs1 Itg Lane logadt Shld Lwid) d RSS p C(p) AIC BIC PRESS Add: Acpt so soe description is in order. At Step 1, the ean unction consists o the single ter log(len) beyond the intercept because this ter is to be included in all ean unctions. Ten ean unctions can be obtained by adding one o the reaining 10- candidate ters, counting Hwy as a single ter. For each candidate ean unction, the d, RSS, and nuber o ters p = p C in the ean unction are printed, as are

239 COMPUTATIONAL METHODS 225 PRESS and the three inoration criteria. I none o the ters were actors, then all three inoration criteria would order the ters identically. Since Hwy is a actor, the ordering need not be the sae on all criteria. PRESS ay choose a dierent ordering ro the other three. All the criteria agree or the best ter to add at the irst step, since adding Sli gives the sallest value o each criterion. Step 2 starts with the base ean unction consisting o log(len) and Sli, and PRESS selects log(trks) at this step. Both C p and BIC would select a dierent ter at this step, leading to dierent results (see Proble 10.10). This process is repeated at each step. The candidate ean unction with the sallest value o PRESS is given by (log(len), Sli, log(trks), Hwy, log(sigs1)), with a value o PRESS = Several other subsets have values o PRESS that dier ro this one by only a trivial aount, and, since the values o all the criteria are rando variables, declaring this subset to the best needs to be tepered with a bit o coon sense. The estiated active predictors should be selected ro aong the ew essentially equivalent subsets on soe other grounds, such as agreeent with theory. The candidate or the active subset has R 2 = 0.765, as copared to the axiu possible value o 0.791, the R 2 or the ean unction using all the predictors. Further analysis o this proble is let to hoework probles Subset Selection Overstates Signiicance All selection ethods can overstate signiicance. Consider another siulated exaple. A data set o n = 100 cases with a response Y and 50 predictors X = (X 1,...,X 50 ) was generated using standard noral rando deviates, so there are no active ters, and the true ultiple correlation between Y and X is also exactly zero. The regression o Y on X is suarized in the irst line o Table The value o R 2 = 0.54 ay see surprisingly large, considering that all the data are independent rando nubers. The overall F -test, which is in a scale ore easily calibrated, gives a p-value o.301 or the data; Rencher and Pun (1980) and Freedan (1983) report siilar siulations with the overall p-value varying ro near 0 to near 1, as it should since the null hypothesis o β = 0 is true. In the siulation reported here, 11 o 50 ters had t > 2, while 7 o 50 had t > 2. Line 2 o Table 10.8 displays suary statistics ro the it o the ean unction that retains all the ters with t > 2. The value o R 2 drops to The ajor change is in the perceived signiicance o the result. The overall F now has p-value o about.001, and the t -values or ive ters exceed 2. The third line is siilar to the second, except a ore stringent t > 2 was used. Using seven ters, R 2 = 0.20, and our ters have t -values exceeding two. This exaple deonstrates any lessons. Signiicance is overstated. The coeicients or the ters let in the ean unction will generally be too large in absolute value, and have t- orf -values that are too large. Even i the response and the predictors are unrelated, R 2 can be large: when β = 0, the expected value o R 2 is k/(n 1). With selection, R 2 can be uch too large.

240 226 VARIABLE SELECTION TABLE 10.8 Results o a Siulated Exaple with 50 Ters and n = 100 Nuber o p-value o Nuber Nuber Method Ters R 2 Overall F t > 2 t > 2 No selection t > t > WINDMILLS The windill data discussed in Probles 2.13, 4.6 and 6.11 provide another case study or odel selection. In Proble 2.13, only the wind speed at the reerence site was used in the ean unction. In Proble 6.11, wind direction at the reerence site was used to divide the data into 16 bins, and a separate regression was it in each o the bins, giving a ean unction with 32 paraeters. We now consider several other potential ean unctions Six Mean Functions For this particular candidate site, we used as a reerence site the closest site where the National Center or Environental Modeling data is collected, which is southwest o the candidate. There are additional possible reerence sites to the northwest, the northeast, and the southeast o the candidate site. We could use data ro all our o these candidates to predict wind speed at the candidate site. In addition, we could consider the use o lagged variables, in which we use the wind speed at the reerence site six hours beore the current tie to odel the current wind speed at the candidate site. Lagged variables are coonly used with data collected at regularly spaced intervals and can help account or serial correlations between consecutive easureents. In all, we will consider six very dierent ean unctions or predicting CSpd, using the ters deined in Table 10.9: [Model 1] E(CSpd Spd1) = β 0 + β 1 Spd1. This was it in Proble [Model 2] Fit as in Model 1, but with a separate intercept and slope or each o 16 bins deterined by the wind direction at reerence site 1. [Model 3] This ean unction uses the inoration about the wind directions in a dierent way. Writing θ or the wind direction at the reerence site, the ean unction is E(CSpd X) = β 0 + β 1 Spd1 + β 2 cos(θ) + β 3 sin(θ) =+β 4 cos(θ)spd1 + β 5 sin(θ)spd1 This ean unction uses our ters to include the inoration in the wind direction. The ter cos(θ)spd1 is the wind coponent in the east west

241 WINDMILLS 227 TABLE 10.9 Label Description o Data in the Windill Data in the File w4.txt Description Date Date and tie o easureent. 2002/3/4/12 eans March 4, 2002 at 12 hours ater idnight Dir1 Wind direction θ at reerence site 1 in degrees Spd1 Wind speed at reerence site 1 in eters per second. Site 1 is the closest site to the candidate site Spd2 Wind speed at reerence site 2 in /s Spd3 Wind speed at reerence site 3 in /s Spd4 Wind speed at reerence site 4 in /s Spd1Lag1 Wind speed at reerence site 1 six hours previously Spd2Lag1 Wind speed at reerence site 2 six hours previously Spd3Lag1 Wind speed at reerence site 3 six hours previously Spd4Lag1 Wind speed at reerence site 4 six hours previously Bin Bin nuber Spd1Sin1 Spd1 sin(θ), site 1 Spd1Cos1 Spd1 cos(θ), site 1 CSpd Wind speed in /s at the candidate site direction, while sin(θ)spd1 is the coponent in the north south direction. The ters in sine and cosine alone are included to allow inoration ro the wind direction alone. [Model 4] This odel uses the ean unction E(CSpd X) = β 0 + β 1 Spd1 + β 2 Spd1Lag1 that ignores inoration ro the angles but includes inoration ro the wind speed at the previous period. [Model 5] This odel uses wind speed ro all our candidate sites, E(CSpd X) = β 0 + β 1 Spd1 + β 2 Spd2 + β 3 Spd3 + β 4 Spd4 [Model 6] The inal ean unction starts with odel 5 and then adds inoration on the lagged wind speeds: E(CSpd X) = β 0 + β 1 Spd1 + β 2 Spd2 + β 3 Spd3 + β 4 Spd4 + β 5 Spd1Lag1 + β 6 Spd2Lag1 + β 7 Spd3Lag1 + β 8 Spd4Lag1 All six o these ean unctions were it using the data in the ile w4.txt. The irst case in the data does not have a value or the lagged variables, so it has been deleted ro the ile. Since the onth o May 2002 is also issing, the irst case in June 2002 was also deleted.

242 228 VARIABLE SELECTION TABLE Suary Criteria or the Fit o Six Mean Function to the Windill Data d AIC BIC PRESS Model Model Model Model Model Model Table gives the inoration criteria or coparing the it o these six ean unctions, along with PRESS. All three criteria agree on the ordering o the odels. The siplest odel 1 is preerred over odel 3; evidently, the inoration in the sine and cosine o the direction is not helpul. Adding the lagged wind speed in Model 4 is clearly helpul, and apparently is ore useul than the inoration ro binning the directions used in Model 2. Adding inoration ro our reerence sites, as in odels 5 and 6, gives a substantial iproveent, with about a 15% decrease in the criterion statistics. Model 6, which includes lags but not inoration on angles, appears to be the ost appropriate odel here A Coputationally Intensive Approach The windill data provides an unusual opportunity to look at odel selection by exaining the related proble o estiating the long-ter average wind speed not at the candidate site but at the closest reerence site. The data ile w5.txt 3 gives 55 years o data ro all our candidate sites. We can siulate the original proble by estiating a regression odel or predicting wind speed at site 1, given the data ro the reaining three sites, and then see how well we do by coparing the prediction to the actual value, which is the known average over 55 years o data. We used the ollowing procedure: 1. The year 2002 at the candidate site had n = 1116 data points. We begin by selecting n tie points at rando ro the 55 years o data to coprise the year o coplete data. 2. For the saple o ties selected, it the odels we wish to copare. In this siulation, we considered only the odel with one site as the predictor without binning; one site as predictor with wind directions binned into 16 bins, and using the wind speeds at all three reaining sites as predictors without using bins or lagged variables. Data ro the siulated year were used to estiate the paraeters o the odel, predict the average wind speed 3 Because this ile is so large, it is not included with the other data iles and ust be downloaded separately ro the web site or this book.

243 WINDMILLS 229 over the reaining tie points, and also copute the standard error o the prediction, using the ethodology outlined previously in this section, and in Probles 2.13 and Repeat the irst two steps 1000 ties, and suarize the results in histogras. The suarizing histogras are shown in Figure The irst colun shows the histogras or the estiates o the long-ter average wind speed or the three ean unctions. The vertical dashed lines indicate the true ean wind speed at site 1 over the 55 years o data collection. All three ethods have distributions o estiates that are centered very close to the true value and appear to be ore or less norally distributed. The second colun gives the standard errors o the estiated ean or the 1000 siulations, and the dashed line corresponds to the Frequency Reerence = Spd2, ignore bins Reerence = Spd2, ignore bins, SE Frequency Reerence = Spd2, 16 bins Frequency Frequency Frequency Reerence = Spd2, 16 bins, SE Frequency Three reerences Three reerences, SE FIG Suary o the siulation or the windill data. The irst colun gives a histogra o the estiated ean wind speed at reerence site 1 or 1000 siulations using three ean unctions. The second colun gives a histogra o the 1000 standard errors. The dashed lines give the true values, the average o the wind speed easureents ro 1948 to 2003 or the averages, and the standard deviation o the 1000 averages ro the siulation or the standard errors.

244 230 VARIABLE SELECTION true value, actually the standard deviation o the 1000 eans in the irst colun. In each case, ost o the histogra is to the right o the dashed line, indicating that the standard orulas will generally overestiate the actual standard error, perhaps by 5%. Also, the ean unctions that use only one reerence, with or without binning, are extreely siilar, suggesting only a trivial iproveent due to binning. Using three reerences, however, shits the distribution o the standard errors to the let, so this ethod is uch ore precise than the others. Generalizing these results to the candidate site ro reerence site 1 does not see to be too large a leap. This would suggest that we can do better with ore reerence sites than with one and that the inoration about wind direction, at least at this candidate site, is probably uniportant. PROBLEMS Generate data as described or the two siulated data sets in Section 10.1, and copare the results you get to the results given in the text Using the data in Table with a response Y and three predictors X 1, X 2 and X 3 ro Mantel (1970) in the ile antel.txt, apply the BE and FS algoriths, using C p as a criterion unction. Also, ind AIC and C p or all possible odels, and copare results. What is X A? TABLE Mantel s Data or Proble 10.2 Y X 1 X 2 X Use BE with the highway accident data and copare with the results in Table For the boys in the Berkeley Guidance Study in Proble 3.1, ind a odel or HT18 as a unction o the other variables or ages 9 or earlier. Peror a coplete analysis, including selection o transorations and diagnostic analysis, and suarize your results An experientwas conducted to studyo2up, oxygen uptake in illigras o oxygen per inute, given ive cheical easureents shown in Table (Moore, 1975). The data were collected on saples o dairy wastes kept in suspension in water in a laboratory or 220 days. All observations were on the

245 PROBLEMS 231 TABLE Variable Day BOD TKN TS TVS COD O2UP Oxygen Update Experient Description Day nuber Biological oxygen deand Total Kjeldahl nitrogen Total solids Total volatile solids Cheical oxygen deand Oxygen uptake sae saple over tie. We desire an equation relating log(o2up) to the other variables. The goal is to ind variables that should be urther studied with the eventual goal o developing a prediction equation; day cannot be used as a predictor. The data are given in the ile dwaste.txt. Coplete the analysis o these data, including a coplete diagnostic analysis. What diagnostic indicates the need or transoring O2UP to a logarithic scale? Prove the results (10.4)-(10.5). To avoid tedious algebra, start with an addedvariable plot or X j ater all the other ters in the ean unction. The estiated slope ˆβ j is the ols estiated slope in the added-variable plot. Find the standard error o this estiate, and show that it agrees with the given equations Galápagos Islands The Galápagos Islands o the coast o Ecuador provide an excellent laboratory or studying the actors that inluence the developent and survival o dierent lie species. Johnson and Raven (1973) have presented data in the ile galapagos.txt, giving the nuber o species and related variables or 29 dierent islands (Table 10.13). Counts are given or both the total nuber o species and the nuber o species that occur only on that one island (the endeic species). Use these data to ind actors that inluence diversity, as easured by soe unction o the nuber o species and the nuber o endeic species, and suarize your results. One coplicating actor is that elevation is not recorded or six very sall islands, so soe provision ust be ade or this. Four possibilities are: (1) ind the elevations; (2) delete these six islands ro the data; (3) ignore elevation as a predictor o diversity, or (4) substitute a plausible value or the issing data. Exaination o large-scale aps suggests that none o these elevations exceed Suppose that (10.1) holds with β I = 0, but we it a subset odel using the ters X C X A ;thatis,x C does not include all the relevant ters. Give general conditions under which the ean unction E(Y X C ) is a linear ean unction. (Hint: See Appendix A.2.4.)

246 232 VARIABLE SELECTION TABLE Galápagos Island Data Variable Island NS ES Area Anear Dist DistSC Elevation EM Description Island nae Nuber o species Nuber o endeic species (occurs only on that island) Surace area o island, hectares Area o closest island, hectares Distance to closest island, kiloeters Distance ro Santa Cruz Island, kiloeters Elevation in, issing values given as zero 1 i elevation is observed, 0 i issing For the highway accident data, it the regression odel with active predictors given by the subset with the sallest value o PRESS in Table The coeicient estiate o Sli is negative, eaning that segents with higher speed liits lower accident rates. Explain this inding Reëxpress C p as a unction o the F -statistic used or testing the null hypothesis (10.6) versus the alternative (10.1). Discuss In the windill data discussed in Section 10.4, data were collected at the candidate site or about a year, or about 1200 observations. One issue is whether the collection period could be shortened to six onths, about 600 observations, or three onths, about 300 observations, and still give a reliable estiate o the long-ter average wind speed. Design and carry out a siulation experient using the data described in Section to characterize the increase in error due to shortening the collection period. For the purpose o the siulation, consider site #1 to be the candidate site and site #2 to be the reerence site, and consider only the use o Spd2 to predict Spd1. (Hint: The sapling schee used in Section ay not be appropriate or tie periods shorter than a year because o seasonal variation. Rather than picking 600 observations at rando to ake up a siulated six-onth period, a better idea ight be to pick a starting observation at rando, and then pick 600 consecutive observations to coprise the siulated six onths.)

247 CHAPTER 11 Nonlinear Regression A regression ean unction cannot always be written as a linear cobination o the ters. For exaple, in the turkey diet suppleent experient described in Section 1.1, the ean unction E(Y X = x) = θ 1 + θ 2 (1 exp( θ 3 x)) (11.1) where Y was growth and X the aount o suppleent added to the turkey diet, was suggested. This ean unction has three paraeters, θ 1,θ 2,andθ 3, but only one predictor, X. It is a nonlinear ean unction because the ean unction is not a linear cobination o the paraeters. In (11.1), θ 2 ultiplies 1 exp( θ 3 x),and θ 3 enters through the exponent. Another nonlinear ean unction we have already seen was used in estiating transorations o predictors to achieve linearity, given by E(Y X = x) = β 0 + β 1 ψ S (x, λ) (11.2) where ψ S (x, λ) is the scaled power transoration deined by (7.3), page 150. This is a nonlinear odel because the slope paraeter β 1 ultiplies ψ S (x, λ), which depends on the paraeter λ. In Chapter 7, we estiated λ visually and then estiated the βs ro the linear odel assuing λ is ixed at its estiated value. I we estiate all three paraeters siultaneously, then the ean unction is nonlinear. Nonlinear ean unctions usually arise when we have additional inoration about the dependence o the response on the predictor. Soeties, the ean unction is selected because the paraeters o the unction have a useul interpretation. In the turkey growth exaple, when X = 0, E(Y X = 0) = θ 1,soθ 1 is the expected growth with no suppleentation. Assuing θ 3 > 0, as X increases, E(Y X = x) will approach θ 1 + θ 2, so the su o the irst two paraeters is the axiu growth possible or any dose called an asyptote,andθ 2 is the axiu additional growth due to suppleentation. The inal paraeter θ 3 is a rate paraeter; or larger Applied Linear Regression, Third Edition, by Sanord Weisberg ISBN Copyright 2005 John Wiley & Sons, Inc. 233

248 234 NONLINEAR REGRESSION values o θ 3, the expected growth approaches its axiu ore quickly than it would i θ 3 were saller ESTIMATION FOR NONLINEAR MEAN FUNCTIONS Here is the general setup or nonlinear regression. We have a set o p ters X, and a vector θ = (θ 1,...,θ k ) o paraeters such that the ean unction relating the response Y to X is given by E(Y X = x) = (x, θ) (11.3) We call the unction a kernel ean unction. The two exaples o we have seen so ar in this chapter are in (11.1) and (11.2) but there are o course any other choices, both sipler and ore coplex. The linear kernel ean unction, (x, θ) = x θ is a special case o the nonlinear kernel ean unction. Many nonlinear ean unctions ipose restrictions on the paraeters, like θ 3 > 0 in (11.1). As with linear odels, we also need to speciy the variance unction, and or this we will use exactly the sae structure as or the linear odel and assue Var(Y X = x i ) = σ 2 /w i (11.4) where, as beore, the w i are known, positive weights and σ 2 is an unknown positive nuber. Equations (11.3) and (11.4) together with the assuption that observations are independent o each other deine the nonlinear regression odel. The only dierence between the nonlinear regression odel and the linear regression odel is the or o the ean unction, and so we should expect that there will be any parallels that can be exploited. The data consist o observations (x i,y i ), i = 1,...,n. Because we have retained the assuption that observations are independent and that the variance unction (11.4) is known apart ro the scale actor σ 2, we can use least squares to estiate the unknown paraeters, so we need to iniize over all peritted values o θ the residual su o squares unction, RSS(θ) = n w i (y i (x i, θ)) 2 (11.5) i=1 We have ols i all the weights are equal and wls i they are not all equal. For linear odels, there is a orula or the value ˆθ o θ that iniizes RSS(θ), given at (A.21) in the Appendix. For nonlinear regression, there generally is no orula, and iniization o (11.5) is a nuerical proble. We present soe theory now that will approxiate (11.5) at each iteration o a coputing algorith by a nearby linear regression proble. Not only will this give one o the standard coputing algoriths used or nonlinear regression but will also provide expressions

249 ESTIMATION FOR NONLINEAR MEAN FUNCTIONS 235 or approxiate standard errors and point out how to do approxiate tests. The derivation uses soe calculus. We begin with a brie reresher on approxiating a unction using a Taylor series expansion 1. In the scalar version, suppose we have a unction g(β), where β is a scalar. We want to approxiate g(β) or values o β close to soe ixed value β. The Taylor series approxiation is g(β) = g(β ) + (β β ) dg(β) dβ (β β ) 2 d2 g(β) dβ 2 + Reainder (11.6) All the derivatives in equation (11.6) are evaluated at β, and so Taylor series approxiates g(β), the unction on the let side o (11.6) using the polynoial in β on the right side o (11.6). We have only shown a two-ter Taylor expansion and have collected all the higher-order ters into the reainder. By taking enough ters in the Taylor expansion, any unction g can be approxiated as closely as wanted. In ost statistical applications, only one or two ters o the Taylor series are needed to get an adequate approxiation. Indeed, in the application o the Taylor expansion here, we will ostly use a one-ter expansion that includes the quadratic ter in the reainder. When g(θ) is a unction o a vector-valued paraeter θ, the two-ter Taylor series is very siilar, g(θ) = g(θ ) + (θ θ ) u(θ ) (θ θ ) H(θ )(θ θ ) + Reainder (11.7) where we have deined two new quantities in (11.7), the score vector u(θ ) and the Hessian atrix H(θ ).Iθ has k eleents, then u(θ ) also has k eleents, and its jth eleent is given by g(x, θ)/ θ j, evaluated at θ = θ. The Hessian is a k k atrix whose (l, j) eleent is the partial second derivative 2 g(x, θ)/( θ l θ j ), evaluated at θ = θ. We return to the proble o iniizing (11.5). Suppose we have a current guess θ o the value o θ that will iniize (11.5). The general idea is to approxiate (θ, x i ) using a Taylor approxiation around θ. Using a one-ter Taylor series, ignoring the ter with the Hessian in (11.7), we get (θ, x i ) (θ, x i ) + u i (θ ) (θ θ ) (11.8) We have put the subscript i on the u because the value o the derivatives can be dierent or every value o x i.theu i (θ ) play the sae role as the ters in the ultiple linear regression odel. There are as any eleents o u i (θ ) as paraeters in the ean unction. The dierence between nonlinear and linear odels is that the u i (θ ) ay depend on unknown paraeters, while in ultiple linear regression, the ters depend only on the predictors. 1 Jerzy Neyan ( ), one o the ajor igures in the developent o statistics in the twentieth century, oten said that arithetic had ive basic operations: addition, subtraction, ultiplication, division, and Taylor series.

250 236 NONLINEAR REGRESSION Substitute the approxiation (11.8) into (11.5) and sipliy to get RSS(θ) = = n [ w i yi (θ, x i ) ] 2 i=1 n [ w i yi (θ, x i ) u i (θ ) (θ θ ) ] 2 i=1 n [ê w i i u i (θ ) (θ θ ) ] 2 i=1 (11.9) where ê i = y i (θ, x i ) is the ith working residual that depends on the current guess θ. The approxiate RSS(θ) is now in the sae or as the residual su o squares unction or ultiple linear regression (5.5), with response given by the ê i, ters given by u i(θ ), paraeter given by θ θ, and weights w i. We switch to atrix notation and let U(θ ) be an n k atrix with ith row u i (θ ), W is an n n diagonal atrix o weights, and ê = (ê 1,...,ê n ). The least squares estiate is then θ θ = [U(θ ) WU(θ )] 1 U(θ ) Wê (11.10) ˆθ = θ + [U(θ ) WU(θ )] 1 U(θ ) Wê (11.11) We will use (11.10) in two ways, irst to get a coputing algorith or estiating θ in the rest o this section and then as a basis or inerence in the next section. Here is the Gausss Newton algorith that is suggested by (11.10) (11.11): 1. Select an initial guess θ (0) or θ, and copute RSS(θ (0) ). 2. Set the iteration counter at j = Copute U(θ (j) ) and ê (j) with ith eleent y i (x i, θ (j) ). Evaluating (11.11) requires the estiate ro a weighted linear least squares proble, with response ê (j), predictors U(θ (j) ), and weights given by the w i. The new estiator is θ (j+1). Also, copute the residuals su o squares RSS(θ (j+1) ). 4. Stop i RSS(θ (j) ) RSS(θ (j+1) ) is suiciently sall, in which case there is convergence. Otherwise, set j = j + 1. I j is too large, stop, and declare that the algorith has ailed to converge. I j is not too large, go to step 3. The Gauss Newton algorith estiates the paraeters o a nonlinear regression proble by a sequence o approxiating linear wls calculations. Most statistical sotware or nonlinear regression uses the Gauss Newton algorith, or a odiication o it, or estiating paraeters. Soe progras allow using a general unction iniizer based on soe other algorith to iniize (11.5). We provide soe reerences at the end o the chapter.

251 INFERENCE ASSUMING LARGE SAMPLES 237 There appear to be two ipedients to the use o the Gauss Newton algorith. First, the score vectors, which are the derivatives o with respect to the paraeters, are needed. Soe sotware ay require the user to provide expressions or the derivatives, but any packages copute derivatives using either sybolic or nueric dierentiation. Also, the user ust provide starting values θ (0) ;there appears to be no general way to avoid speciying starting values. The optiization routine ay also converge to a local iniu o the residuals su o squares unction rather than a global iniu, and so inding good starting values can be very iportant in soe probles. With poor starting values, an algorith ay ail to converge to any estiate. We will shortly discuss starting values in the context o an exaple INFERENCE ASSUMING LARGE SAMPLES We repeat (11.11), but now we reinterpret θ as the true, unknown value o θ. In this case, the working residuals ê are now the actual errors e, the dierences between the response and the true eans. We write ˆθ = θ + [U(θ ) WU(θ )] 1 U(θ ) We (11.12) This equation is based on the assuption that the nonlinear kernel ean unction can be accurately approxiated close to θ by the linear approxiation (11.8), and this can be guaranteed only i the saple size n is large enough. We then see that ˆθ is equal to the true value plus a linear cobination o the eleents o e, and by the central liit theore ˆθ under regularity conditions will be approxiately norally distributed, ˆθ N(θ,σ 2 [U(θ ) WU(θ )] 1 ) (11.13) An estiate o the large-saple variance is obtained by replacing the unknown θ by ˆθ on the right side o (11.13), where the estiate o σ 2 is Var(ˆθ) =ˆσ 2 [U(ˆθ) WU(ˆθ)] 1 (11.14) ˆσ 2 = RSS(ˆθ) n k (11.15) where k is the nuber o paraeters estiated in the ean unction. These results closely parallel the results or the linear odel, and consequently the inerential ethods such as F -andt-tests and the analysis o variance or coparing nested ean unctions, can be used or nonlinear odels. One change that is recoended is to use the noral distribution rather than the t or inerences where the t would be relevant, but since (11.13) is really expected to be valid only

252 238 NONLINEAR REGRESSION in large saples, this is hardly iportant. We ephasize that in sall saples, large-saple inerences ay be inaccurate. We can illustrate using these results with the turkey growth experient. Methionine is an aino acid that is essential or noral growth in turkeys. Depending on the ingredients in the eed, turkey producers ay need to add suppleental ethionine or a proper diet. Too uch ethionine could be toxic. Too little ethionine could result in alnourished birds. An experient was conducted to study the eects on turkey growth o dierent aounts A o ethionine, ranging ro a control with no suppleentation to 0.44% o the total diet. The experiental unit was a pen o young turkeys, and treatents were assigned to pens at rando so that 10 pens get the control (no suppleentation) and 5 pens received each o the other ive aounts used in the experient, or a total o 35 pens. Pen weights, the average weight o the turkeys in the pen, were obtained at the beginning and the end o the experient three weeks later. The response variable is Gain, the average weight gain in gras per turkey in a pen. The weight gains are shown in Table 11.1 and are also given in the ile turk0.txt (Cook and Witer, 1985). The priary goal o this experient is to understand how expected weight gain E(Gain A) changes as A is varied. The data are shown in Figure In Figure 11.1, E(Gain A) appears to increase with A, at least over the range o values o A in the data. In addition, there is considerable pen-to-pen variation, relected by the variability between repeated observations at the sae value o A. The ean unction is certainly not a straight line since the dierence in the eans when A>0.3 is uch saller than the dierence in eans when A<0.2. While a polynoial o degree two or three ight well atch the ean at the six values o A in the experient, it will surely not atch the data outside the range o A, and the paraeters would have little physical eaning (see Proble 6.15). A nonlinear ean unction is preerable or this proble. For turkey growth as a unction o an aino acid, the ean unction E(Gain A) = θ 1 + θ 2 (1 exp( θ 3 A)) (11.16) TABLE 11.1 Aount, A The Turkey Growth Data Gain , 631, 661, 624, , 615, 605, 608, , 667, 657, 685, , 715, 717, 709, , 712, 726, 760, , 796, 763, 791, , 771, 799, 799, 791

253 INFERENCE ASSUMING LARGE SAMPLES 239 Weight gain (g) Aount (percent o diet) FIG Turkey data. was suggested by Parks (1982). To estiate the paraeters in (11.16), we need starting values or θ. While there is no absolute rule or selecting starting values, the ollowing approaches are oten useul: Guessing Soeties, starting values can be obtained by guessing values or the paraeters. In the turkey data, ro Figure 11.1, the intercept is about 620 and the asyptote is around 800. This leads to starting values θ (0) 1 = 620 and θ (0) 2 = = 180. Guessing a value or the rate paraeter θ 3 is harder. Solving equations or a subset o the data Select as any distinct data points as paraeters, and solve the equations or the unknown paraeters. The hope is that the equations will be easy to solve. Selecting data points that are diverse oten works well. In the turkey data, given θ (0) 1 = 620 and θ (0) 2 = 180 ro the graph, we can get an initial estiate or θ 3 by solving only one equation in one unknown. For exaple, when D = 0.16, a plausible value o Gain is Gain = 750, so 750 = (1 exp( θ (0) 3 (.16))) which is easily solved to give θ (0) 3 8. Thus, we now have starting values or all three paraeters. Linearization I possible, transor to a ultiple linear regression ean unction, and it it to get starting values. In the turkey data, we can ove the paraeters θ 1 and θ 2 to the let side o the ean unction to get (θ 1 + θ 2 ) y i θ 2 = exp( θ 3 D)

254 240 NONLINEAR REGRESSION Taking logariths o both sides, ( ) (θ1 + θ 2 ) y i log = θ 3 D θ 2 Substituting initial guesses θ (0) 1 = 620 and θ (0) 2 = 180 on the let side o this equation, we can copute an initial guess or θ 3 by the linear regression o log[(y i 800)/180] on D, through the origin. The ols estiate in this approxiate proble is θ (0) Many coputer packages or nonlinear regression require speciication o the unction using an expression such as y ~ th1 + th2*(1 - exp(-th3*a)) In this equation, the sybol ~ can be read is odeled as, and the atheatical sybols +, -, * and / represent addition, subtraction, ultiplication and division, respectively. Siilarly, exp represents exponentiation, and ^, or, in soe progras, **, is used or raising to a power. Parentheses are used according to the usual atheatical rules. Generally, the orula on the right o the odel speciication will be siilar to an expression in a coputer prograing language such as Basic, Fortran, or C. This or o a coputer odel should be contrasted with the odel stateents described in Sections Model stateents or nonlinear odels include explicit stateent o both the paraeters and the ters on the right side o the equation. For linear odels, the paraeters are usually oitted and only the ters are included in the odel stateent. I the starting values are adequate and the nonlinear optiizer converges, output including the quantities in Table 11.2 will be produced. This table is very siilar to the usual output or linear regression. The colun arked Estiate gives ˆθ. Since there is no necessary connection between ters and paraeters, the lines o the table are labeled with the naes o the paraeters, not the naes o the ters. The next colun labeled Std. Error gives the square root o the diagonal entries o the atrix given at (11.14), so the standard errors are based on largesaple approxiation. The colun labeled t-value is the ratio o the estiate to its large-saple standard error and can be used or a test o the null hypothesis TABLE 11.2 Nonlinear Least Squares Fit o (11.16) Forula: Gain ~ th1 + th2 * (1 - exp(-th3 * A)) Paraeters: Estiate Std. Error t-value Pr(> t ) th < 2e-16 th e-16 th e Residual standard error: on 32 degrees o reedo

255 INFERENCE ASSUMING LARGE SAMPLES 241 Weight gain (g) Aount (percent o diet) FIG Fitted ean unction. that a particular paraeter is equal to zero against either a general or one-sided alternative. The colun arked P(> t ) is the signiicance level or this test, using a noral reerence distribution rather than a t-distribution. Given at the oot o the table is the estiate ˆσ and its d, which is the nuber o cases inus the nuber o eleents in θ that were estiated, d = 35 3 = 32. Since this exaple has only one predictor, Figure 11.1 is a suary graph or this proble. Figure 11.2 repeats this igure, but with the itted ean unction Ê(Gain A = a) = (1 exp( 7.122a)) added to the graph. The itted ean unction does not reproduce the possible decline o response or the largest value o A because it is constrained to increase toward an asyptote. For A = 0.28, the itted unction is soewhat less than the ean o the observed values, while at A = 0.44, it is soewhat larger than the ean o the observed values. I we believe that an asyptotic or is really appropriate or these data, then the it o this ean unction sees to be very good. Using the repeated observations at each level o A, we can peror a lack-o-it test or the ean unction, which is F = 2.50 with (3, 29) d, or a signiicance level o 0.08, so the it appears adequate. Three Sources o Methionine The purpose o this experient was not only to estiate the weight gain response curve as a unction o aount o ethionine added but also to decide i the source o ethionine was iportant. The coplete experient included three sources that we will call S 1,S 2,S 3. We can iagine a separate response curve such as Figure 11.2 or each o the three sources, and the goal ight be to decide i the three response curves are dierent. Suppose we create three duy variables S i,i = 1, 2, 3, so that S i is equal to one i an observation is ro source i, and it is zero otherwise. Assuing that the

256 242 NONLINEAR REGRESSION ean unction (11.16) is appropriate or each source, the largest odel we ight conteplate is E(Gain A = a, S 1,S 2,S 3 ) = S 1 [θ 11 + θ 21 (1 exp( θ 31 a))] + S 2 [θ 12 + θ 22 (1 exp( θ 32 a))] + S 3 [θ 13 + θ 23 (1 exp( θ 33 a))] (11.17) This equation has a separate intercept, rate paraeter, and asyptote or each group, and so has nine paraeters. For this particular proble, this unction has too any paraeters because a dose o A = 0 ro source 1 is the sae as A = 0 with any o the sources, so the expected response at A = 0 ust be the sae or all three sources. This requires that θ 11 = θ 12 = θ 13 = θ 1, which is a odel o coon intercepts, dierent asyptotes, and dierent slopes, E(Gain A = a, S 1,S 2,S 3 ) = θ 1 + S 1 [θ 21 (1 exp( θ 31 a))] +S 2 [θ 22 (1 exp( θ 32 a))] +S 3 [θ 23 (1 exp( θ 33 a))] (11.18) Other reasonable ean unctions to exaine include coon intercepts and asyptotes but separate rate paraeters, E(Gain A = a, S 1,S 2,S 3 ) = θ 1 + θ 2 {S 1 [1 exp( θ 31 a)] + S 2 [1 exp( θ 32 a)] + S 3 [1 exp( θ 33 a)]} (11.19) = θ 1 + θ 2 (1 exp( ) θ 3i S i a) and inally the ean unction o identical ean unctions, given by (11.16). The data ro this experient are given in the data ile turkey.txt. This ile is a little dierent because it does not give the response in each pen, but rather or each cobination o A and source it gives the nuber o pens with the cobination, the ean response or those pens, and SD the standard deviation o the pen responses. Fro Section 5.4, we can use these standard deviations to get a pure error estiate o σ 2 that can be used in lack-o-it testing, ˆσ pe 2 = SS pe ( 1)SD 2 = = = d pe ( 1) 70 The data are shown in Figure A separate sybol was used or each o the three groups. Each point shown is an average over pens, where = 5 or every

257 INFERENCE ASSUMING LARGE SAMPLES 243 Gain FIG Turkey growth as a unction o ethionine added or three sources o ethionine. The lines shown on the igure are or the it o (11.18), the ost general reasonable ean unction or these data. A point except at A = 0, where = 10. The point at A = 0iscoontoallthree groups. The our ean unctions (11.16) (11.19) can all be it using nonlinear weighted least squares, with weights equal to the s. Starting values or the estiates can be obtained in the sae way as or itting or one group. Table 11.3 suarizes the it o the our ean unctions, giving the RSS and d or each. For coparing the ean unctions, we start with a lack-o-it test or the ost restrictive, coon ean unction, or which the F -test or lack o it is F = /10 = which, when copared with the F(10, 70) distribution, gives a p-value o about 0.65, or no evidence o lack o it o this ean unction. Since the ost restrictive TABLE 11.3 Four Mean Functions Fit to the Turkey Growth Data with Three Sources o Methionine Change in Source d SS d SS Coon ean unction, (11.16) Equal intercept and asyptote, (11.19) Coon intercepts, (11.18) Separate regressions, (11.17) Pure error

258 244 NONLINEAR REGRESSION ean unction is adequate, we need not test the other ean unctions or lack o it. We can peror tests to copare the ean unctions using the general F - testing procedure o Section 5.4. For exaple, to copare as a null hypothesis the equal intercept and slope ean unction (11.19) versus the coon intercept ean unction (11.18), we need to copute the change in RSS, given in the table as with 2 d. For the F -test, we can use ˆσ pe 2 in the denoinator, and so F = 528.4/2 ˆσ pe 2 = 0.93 F(2, 70) or which p = 0.61, suggesting no evidence against the sipler ean unction. Continued testing suggests that the siplest ean unction o no dierence between sources appears appropriate or these data, so we conclude that there is no evidence o a dierence in response curve due to source. I we did not have a pure error estiate o variance, the estiate o variance ro the ost general ean unction wouldbeusedinthef-tests BOOTSTRAP INFERENCE The inerence ethods based on large saples introduced in the last section ay be inaccurate and isleading in sall saples. We cannot tell in advance i the large-saple inerence will be accurate or not, as it depends not only on the ean unction but also on the way we paraeterize it, since there are any ways to write the sae nonlinear ean unction, and on the actual values o the predictors and the response. Because o this possible inaccuracy, coputing inerences in soe other way, at least as a check on the large-saple inerences, is a good idea. One generally useul approach is to use the bootstrap introduced in Section 4.6. The case resapling bootstrap described in Section can be applied in nonlinear regression. Davison and Hinkley (1997) describe an alternative bootstrap schee on the basis o resapling residuals, but we will not discuss it here. We illustrate the use o the bootstrap with data in the ile segreg.txt, which consists o easureents o electricity consuption in KWH and ean teperature in degrees F or one building on the University o Minnesota s Twin Cities capus or 39 onths in , courtesy o Charles Ng. The goal is to odel consuption as a unction o teperature. Higher teperature causes the use o air conditioning, so high teperatures should ean high consuption. This building is stea heated, so electricity is not used or heating. Figure 11.4, a plot o C = consuption in KWH/day versus Tep, the ean teperature in degrees F. The ean unction or these data is { θ E(C Tep) = 0 Tep γ θ 0 + θ 1 (Tep γ) Tep >γ This ean unction has three paraeters, the level θ 0 o the irst phase; the slope θ 1 o the second phase, and the knot, γ, and assues that energy consuption is

259 BOOTSTRAP INFERENCE 245 KWH/day Mean teperature (F) FIG Electrical energy consuption per day as a unction o ean teperature or one building. The line shown on the graph is the least squares it. unaected by teperature when the teperature is below the knot, but the ean increases linearly with teperature beyond the knot. The goal is to estiate the paraeters. The ean unction can be cobined into a single equation by writing E(C Tep) = θ 0 + θ 1 (ax(0, Tep γ)) Starting values can be easily obtained ro the graph, with θ (0) 0 = 70, θ (0) 1 = 0.5 and γ (0) = 40. The itted odel is suarized in Table The baseline electrical consuption is estiated to be about ˆθ 0 75 KWH per day. The knot is estiated to be at ˆγ 42 F, and the increent in consuption beyond that teperature is about ˆθ KWH per degree increase. Fro Figure 11.4, one ight get the ipression that inoration about the knot is asyetric: γ could be larger than 42 but is unlikely to be substantially less than 42. We ight expect that in this case conidence or test procedures based on asyptotic norality will be quite poor. We can conir this using the bootstrap. TABLE 11.4 Regression Suary or the Segented Regression Exaple Forula: C ~ th0 + th1 * (pax(0, Tep - gaa)) Paraeters: Estiate Std. Error t value Pr(> t ) th < 2e-16 th e-06 gaa e-11 Residual standard error: on 36 degrees o reedo

260 246 NONLINEAR REGRESSION q q 2 g FIG Scatterplot atrix o estiates o the paraeters in the segented regression exaple, coputed ro B = 999 case bootstraps. Figure 11.5 is a scatterplot atrix o B = 999 bootstrap replications. All three paraeters are estiated on each replication. This scatterplot atrix is a little ore coplicated than the ones we have previously seen. The diagonals contain histogras o the 999 estiates o each o the paraeters. I the noral approxiation were adequate, we would expect that each o these histogras would look like a noral density unction. While this ay be so or θ 1, this is not the case or θ 2 and or γ. As expected, the histogra or γ is skewed to the right, eaning that estiates o γ uch larger than about 40 occasionally occur but saller values alost never occur. The univariate noral approxiations are thereore poor. The other graphs in the scatterplot atrix tell us about the distributions o the estiated paraeters taken two at a tie. I the noral approxiation were to hold, these graphs should have approxiately straight-line ean unctions. The soothers on Figure 11.5 are generally ar ro straight, and so the large-saple inerences are likely to be badly in error.

261 BOOTSTRAP INFERENCE 247 q q 2 q FIG Scatterplot atrix o bootstrap estiates or the turkey growth data. Two o the replicates were very dierent ro the others and were deleted beore graphing. In contrast, Figure 11.6 is the bootstrap suary or the irst source in the turkey growth data. Norality is apparent in histogras on the diagonal, and a linear ean unction sees plausible or ost o the scatterplots, and so the large-saple inerence is adequate here. Table 11.5 copares the estiates and conidence intervals produced by largesaple theory, and by the bootstrap. The bootstrap standard errors are the standard TABLE 11.5 Coparison o Large-Saple and Bootstrap Inerence or the Segented Regression Data Large Saple Bootstrap θ 0 θ 1 γ θ 0 θ 1 γ Estiate Mean SE SD % % % %

262 248 NONLINEAR REGRESSION deviation o the bootstrap replicates, and the ends o the bootstrap 95% conidence interval are the and quantiles o the bootstrap replicates. The largesaple theory conidence interval is given by the usual rule o estiate plus or inus 1.96 ties the standard error coputed ro large-saple theory. Although the bootstrap SDs atch the large-saple standard errors reasonably well, the conidence intervals or both θ 1 and or γ are shited toward saller values than the ore accurate bootstrap estiates REFERENCES Seber and Wild (1989) and Bates and Watts (1988) provide textbook-length treatents o nonlinear regression probles. Coputational issues are also discussed in these reerences, and in Thisted (1988, Chapter 4). Ratkowsky (1990) provides an extensive listing o nonlinear ean unctions that are coonly used in various ields o application. PROBLEMS Suppose we have a response Y, a predictor X, and a actor G with g levels. A generalization o the concurrent regression ean unction given by Model 3 o Section 6.2.2, is, or j = 1,...,g, E(Y X = x,g = j) = β 0 + β 1j (x γ) (11.20) or soe point o concurrence γ Explain why (11.20) is a nonlinear ean unction. Describe in words what this ean unction speciies Fit (11.20) to the sleep data discussed in Section 6.2.2, so the ean unction o interest is E(TS log(bodywt) = x,d = j) = β 0 + β 1j (x γ) (Hint: To get starting values, it the concurrent regression odel with γ = 0.Theestiateoγ will be very highly variable, as is oten the case with centering paraeters like γ in this ean unction.) In isheries studies, the ost coonly used ean unction or expected length o a ish at a given age is the von Bertalany unction (von Bertalany, 1938; Haddon, 2001), given by E(Length Age = t) = L (1 exp( K(t t 0 )) (11.21) The paraeter L is the expected value o Length or extreely large ages, and so it is the asyptotic or upper liit to growth, and K is a growth rate paraeter that deterines how quickly the upper liit to growth is reached. When Age = t 0, the expected length o the ish is 0, which allows ish to have nonzero length at birth i t 0 < 0.

263 PROBLEMS Thedataintheilelakeary.txt give the Age in years and Length in illieters or a saple o 78 bluegill ish ro Lake Mary, Minnesota, in 1981 (courtesy o Richard Frie). Age is deterined by counting the nuber o rings on a scale o the ish. This is a cross-sectional data set, eaning that all the ish were easured once. Draw a scatterplot o the data Use nonlinear regression to it the von Bertalany unction to these data. To get starting values, irst guess at L ro the scatterplot to be a value larger than any o the observed values in the data. Next, divide both sides o (11.21) by the initial estiate o L, and rearrange ters to get just exp( K(t t 0 ) on the right o the equation. Take logariths, to get a linear ean unction, and then use ols or the linear ean unction to get the reaining starting values. Draw the itted ean unction on your scatterplot Obtain a 95% conidence interval or L using the large-saple approxiation, and using the bootstrap The data in the ile walleye.txt give the length in and the age in years o a saple o over 3000 ale walleye, a popular gae ish, captured in Butternut Lake in Northern Wisconsin (LeBeau, 2004). The ish are also classiied according to the tie period in which they were captured, with period = 1 or pre-1990, period = 2 or , and period = 3 or Manageent practices on the lake were dierent in each o the periods, so it is o interest to copare the length at age or the three tie periods. Using the von Bertalany length at age unction (11.21), copare the three tie periods. I dierent, are all the paraeters dierent, or just soe o the? Which ones? Suarize your results A quadratic polynoial as a nonlinear odel The data in the ile swan96.txt were collected by the Minnesota Departent o Natural Resources to study the abundance o black crappies, a species o ish, on Swan Lake, Minnesota in The response variable is LCPUE, the logarith o the catch o 200 or longer black crappies per unit o ishing eort. It is believed that LCPUE is proportional to abundance. The single predictor is Day, the day on which the saple was taken, easured as the nuber o days ater June 19, Soe o the easureents were taken the ollowing spring on the sae population o ish beore the young o the year are born in late June. No saples are taken during the winter onths when the lake surace was rozen For these data, it the quadratic polynoial E(LCPUE Day = x) = β 0 + β 1 x + β 2 x 2 assuing Var(LCPUE Day = x) = σ 2. Draw a scatterplot o LCPUE versus Day, and add the itted curve to this plot.

264 250 NONLINEAR REGRESSION Using the delta ethod described in Section 6.1.2, obtain the estiate and variance or the value o Day that axiizes E(LCPUE Day) Another paraeterization o the quadratic polynoial is E(Y X) = θ 1 2θ 2 θ 3 x + θ 3 x 2 where the θs can be related to the βs by θ 1 = β 0, θ 2 = β 1 /2β 2, θ 3 = β 2 In this paraeterization, θ 1 is the intercept, θ 2 is the value o the predictor that gives the axiu value o the response, and θ 3 is a easure o curvature. This is a nonlinear odel because the ean unction is a nonlinear unction o the paraeters. Its advantage is that at least two o the paraeters, the intercept θ 1 and the value o x that axiizes the response θ 2, are directly interpretable. Use nonlinear least squares to it this ean unction. Copare your results to the irst two parts o this proble Nonlinear regression can be used to select transorations or a linear regression ean unction. As an exaple, consider the highway accident data, described in Table 7.1, with response log(rate) and two predictors X 1 = Len and X 2 = ADT. Fit the nonlinear ean unction E(log(Rate) X 1 =x 1,X 2 =x 2,X 3 =x 3 )=β 0 +β 1 ψ S (X 1,λ 1 )+β 2 ψ S (X 2,λ 2 ) where the scaled power transorations ψ S (X j,λ j ) are deined at (7.3). Copare the results you get to results obtained using the transoration ethodology in Chapter POD odels Partial one-diensional ean unctions or probles with both actors and continuous predictors were discussed in Section 6.4. For the Australian athletes data discussed in that section, the ean unction (6.26), E(LBM Sex, Ht, Wt, RCC) = β 0 + β 1 Sex + β 2 Ht + β 3 Wt + β 4 RCC + η 0 Sex + η 1 Sex (β 2 Ht + β 3 Wt + β 4 RCC) was suggested. This ean unction is nonlinear because η 1 ultiplies each o the βs. Proble 6.21 provides a siple algorith or inding estiates using only standard linear regression sotware. This ethod, however, will not produce the large-saple estiated covariance atrix that is available using nonlinear least squares Describe a reasonable ethod or inding starting values or itting (6.26) using nonlinear least squares For the cloud seeding data, Proble 9.11, it the partial onediensional odel using the action variable A as the grouping variable, and suarize your results.

265 CHAPTER 12 Logistic Regression A stor on July 4, 1999 with winds exceeding 90 iles per hour hit the Boundary Waters Canoe Area Wilderness (BWCAW) in northeastern Minnesota, causing serious daage to the orest. Roy Rich studied the eects o this stor using a very extensive ground survey o the area, deterining or over 3600 trees the status, either alive or dead, species, and size. One goal o this study is to deterine the dependence o survival on species, size o the tree, and on the local severity. Figure 12.1a shows a plot or 659 Balsa Fir trees, with the response variable Y coded as 1 or trees that were blown down and died and 0 or trees that survived, versus the single predictor log(d), the base-two logarith o the diaeter o the tree. To iniize overprinting, the plotted values o the variables were slightly jittered beore plotting. Even with the jittering, this plot is uch less inorative than ost o the plots o response versus a predictor that we have seen earlier in this book. Since the density o ink is higher in the lower-let and upper-right corners, the probability o blowdown is apparently higher or large trees than or sall trees, but little ore than that can be learned ro this plot. Figure 12.1b is an alternative to Figure 12.1a. This graph displays density estiates, which are like soothed histogras or log(d), separately or the survivors, the solid line, and or the trees blown down, the dashed line 1. Both densities are roughly shaped like a noral density. The density or the survivors, Y = 1, is shited to the right relative to the density or the density or Y = 0, eaning that the trees that blew down are generally larger. I the histogras had no overlap, then the quantity on the horizontal axis, log(d), would be a perect predictor o survival. Since there is substantial overlap o the densities, log(d) is not a perect predictor o blowdown. For values o log(d) where the two densities have the sae height, the probability o survival will be about 0.5. For values o log(d) where the height o the density or survivors is higher than the density or blowdown, then the probability o surviving exceeds 0.5; when the height o the density o 1 Looking at overlapping histogras is uch harder than looking at overlapping density estiates. Silveran (1986) and Bowan and Azzalini (1997), aong others, provide discussions o density estiation. Applied Linear Regression, Third Edition, by Sanord Weisberg ISBN Copyright 2005 John Wiley & Sons, Inc. 251

266 252 LOGISTIC REGRESSION Blowdown indicator Log 2 (Diaeter) (a) Density Y = 0 Y = log 2 (D) FIG Blowdown data or Balsa Fir. (a) Scatterplot o Y versus log(d). The solid line is the ols line. The dotted line is the it o a soothing spline. The dashed line is the logistic regression it. Data have been jittered in both variables to iniize overprinting. (b) Separate density estiates or log(d) or survivors, Y = 0, and blowdown, Y = 1. (b) blowdown is higher, the probability o surviving is less than 0.5. In Figure 12.1b, the probability o survival is greater than 0.5 i log(d) is less than about 3.3, and less than 0.5 i log(d) exceeds 3.3. More generally, suppose in a proble with predictor X we let θ(x) = Pr(Y = 1 X = x) be the conditional probability that Y = 1 given the value o the predictor. This conditional probability plays the role o the ean unction in regression probles when the response is either one or zero. For the blowdown data in Figure 12.1, the probability o blowdown increases ro let to right. We can visualize θ(log(d)) by adding a soother to Figure 12.1a. The straight line on this graph is the ols regression o Y on log(d). It includes estiated values

267 BINOMIAL REGRESSION 253 o θ(log(d)) outside the range (0, 1) or very sall or large trees, and so the ols line cannot be a good representation o θ(log(d)) or all values o log(d). The ols line is oten inappropriate or a bounded response because it will produce itted values outside the peritted range. The dotted soother in Figure 12.1 uses a soother, so it estiates the ean unction without a odel. This estiate o the ean unction has the characteristic shape o binary regression, with asyptotes at 0 and 1 or extree values o the predictor. The logistic regression odels we will study next also have this shape. As with other regression probles, with a binary response we also have a set o ters or predictors X, and we are interested in the study o Pr(Y = 1 X = x) = θ(x) as x is varied. The response variable is really a category, like success or ailure, alive or dead, passed or ailed, and so on. In soe probles, the ith value o the response y i will be a count o the nuber o successes in i independent trials each with the sae probability o success. I all the i = 1, then each eleent o Y has a Bernoulli distribution; i soe o the i > 1, then each eleent o Y has a Binoial distribution i each o the trials has the sae probability o success and all trials are independent. Bernoulli regression is a special case o binoial regression with all the i = BINOMIAL REGRESSION We recall the basic acts about binoial rando variables. Let y be the nuber o successes out o independent trials, each with the sae probability θ o success, so y can have any integer value between 0 and. The rando variable y has a binoial distribution. We write this as y Bin(, θ). The probability that Y equals a speciic integer j = 0, 1,...,,isgivenby ( ) Pr(y = j) = θ j (1 θ) ( j) (12.1) j where ( j ) =!/(j!( j)!) is the nuber o dierent orderings o j successes in trials. Equation (12.1) is called the probability ass unction or the binoial. The ean and variance o a binoial are E(y) = θ; Var(y) = θ(1 θ) (12.2) Since is known, both the ean and variance are deterined by one paraeter θ. In the binoial regression proble, the response y i counts the nuber o successes in i trials, and so i y i o the trials were ailures. In addition, we have p ters or predictors x i possibly including a constant or the intercept, and assue that the probability o success or the ith case is θ(x i ). We write this copactly as (Y X = x i ) Bin( i,θ(x i )), i = 1,...,n (12.3) We use y i / i, the observed raction o successes at each i, as the response because the range o y i / i is always between 0 and 1, whereas the range o y i is between

268 254 LOGISTIC REGRESSION 0and i and can be dierent or each i. Using (12.2), the ean and variance unctions are E(y i / i x i ) = θ(x i ) (12.4) Var(y i / i x i ) = θ(x i )(1 θ(x i ))/ i (12.5) In the ultiple linear regression odel, the ean unction and the variance unction generally have copletely separate paraeters, but that is not so or binoial regression. The value o θ(x i ) deterines both the ean unction and the variance unction, so we need to estiate θ(x i ).Ithe i are all large, we could siply estiate θ(x i ) by y i / i, the observed proportion o successes at x i.inany applications, the i are sall oten i = 1oralli so this siple ethod will not always work Mean Functions or Binoial Regression As with linear regression odels, we assue that θ(x i ) depends on x i only through a linear cobination β x i or soe unknown β. This eans that any two cases or which β x is equal will have the sae probability o success. We can write θ(x) as a unction o β x, θ(x i ) = (β x i ) The quantity β x i is called the linear predictor. As in nonlinear odels, the unction is called a kernel ean unction. (β x i ) should take values in the range (0, 1) or all β x. The ost requently used kernel ean unction or binoial regression is the logistic unction, θ(x i ) = (β x i ) = exp(β x i ) 1 + exp(β x i ) = exp( β x i ) (12.6) A graph o this kernel ean unction is shown in Figure The logistic ean unction is always between 0 and 1, and has no additional paraeters. Most presentations o logistic regression work with the inverse o the kernel ean unction called the link unction. Solving (12.6) or β x,weind ( ) θ(x) log = β x (12.7) 1 θ(x) The let side o (12.7) is called a logit and the right side is the linear predictor β x. The logit is a linear unction o the ters on the right side o (12.7). I we were to draw a graph o log(θ(x)/(1 θ(x))) versus β x, we would get a straight line. The ratio θ(x)/(1 θ(x)) is the odds o success. For exaple, i the probability o success is 0.25, the odds o success are.25/(1.25) = 1/3, one success to each three ailures. I the probability o success is 0.8, then the odds o success are 0.8/0.2 = 4, or our successes to one ailure. Whereas probabilities are bounded

269 FITTING LOGISTIC REGRESSION 255 Probability b x FIG The logistic kernel ean unction. between 0 and 1, odds can be any nonnegative nuber. The logit is the logarith o the odds; natural logs are used in deining the logit. According to equation (12.7), the logit is equal to a linear cobination o the ters. In suary, the logistic regression odel consists o the data and distribution speciied by (12.3), and a ixed coponent that connects the response to the ean through (12.6) FITTING LOGISTIC REGRESSION Many standard statistical packages allow estiation or logistic regression odels. The ost coon coputational ethod is outlined in Section ; or now, we return to our exaple One-Predictor Exaple Consider irst logistic regression with one predictor using the Balsa Fir data ro the BWCAW blowdown shown in Figure The data are given in the ile blowbf.txt. The single predictor is log(d), using base-two logariths. All the i = 1. We it with two ters, the intercept and log(d). The results are suarized in Table TABLE 12.1 Logistic Regression Suary or the Balsa Fir Blowdown Data Coeicients: Estiate Std. Error z value Pr(> z ) (Intercept) <2e-16 logd <2e Residual deviance: on 657 degrees o reedo Pearson s X^2: on 657 degrees o reedo

270 256 LOGISTIC REGRESSION The output reports estiate ˆβ 0 = or the intercept and ˆβ 1 = or the slope or log(d). The dashed curve drawn on Figure 12.1a corresponds to the itted ean unction Ê(Y log(d)) = exp[ ( log(D))] The logistic it atches the nonparaetric spline it airly closely, except or the largest and sallest trees where soothers and paraetric its are oten in disagreeent. It is not always easy to tell i the logistic it is atching the data without coparing it to a nonparaetric sooth. Equation (12.7) provides a basis or understanding coeicients in logistic regression. In the exaple, the coeicient or log(d) is about I log(d) were increased by one unit (since this is a base-two logarith, increasing log(d) by one unit eans that the value o D doubles), then the natural logarith o the odds will increase by 2.26 and the odds will be ultiplied by exp(2.26) = 9.6. Thus, a tree with diaeter 10 in. is 9.6 ties as likely to blow down as a tree o diaeter ive inches, or a tree o 2-in. diaeter is 9.6 ties as likely to blow down as a tree o 1 in. diaeter. In general, i ˆβ j is an estiated coeicient in a logistic regression, then i x j is increased by one unit, the odds o success, that is, the odds that Y = 1, are ultiplied by exp( ˆβ j ). Table 12.1 also reports standard errors o the estiates, and the colun arked z-value shows the ratio o the estiates to their standard errors. These values can be used or testing coeicients to be zero ater adjusting or the other ters in the odel, as in linear odels, but the test should be copared to the standard noral distribution rather than a t-distribution. The deviance and Pearson s X 2 reported in the table will be discussed shortly. Since the variance o a binoial is deterined by the ean, there is not a variance paraeter that can be estiated separately. While an equivalent to an R 2 easure can be deined or logistic regression, its use is not recoended Many Ters We introduce a second predictor into the blowdown data. The variable S is a local easure o severity o the stor that will vary ro location to location, ro near 0, with very ew trees eected, to near 1, with nearly all trees blown down. Figure 12.3 shows two useul plots. Figure 12.3a gives the density or S or each o the two values o Y. In contrast to log(d), the two density estiates are uch less nearly noral in shape, with one group soewhat skewed, and the other perhaps biodal, or at least very diuse. The two densities are less clearly separated, and this indicates that S is a weaker predictor o blowdown or these Balsa Fir trees. Figure 12.3b is a scatterplot o S versus log(d), with dierent sybols or Y = 1 and or Y = 0. This particular plot would be uch easier to use with dierent colors indicating the two classes. In the upper-right o the plot, the sybol or Y = 1 predoinates, while in the lower-let, the sybol or Y = 0 predoinates.

271 FITTING LOGISTIC REGRESSION 257 Density S S (a) Y = 0 Y = log 2 (D) (b) Y = 0 Y = 1 FIG (a) Density estiates or S or survivors, Y = 0 and blowdown, Y = 1, or the Balsa Fir data. (b) Plot o S versus log(d) with separate sybols or points with Y = 1andY = 0. The values o log(d) have been jittered. This suggests that the two predictors have a joint eect because the prevalence o sybols or Y = 1 changes along a diagonal in the plot. I the prevalence changed ro let to right but not down to up, then only the variable on the horizontal axis would have an eect on the probability o blowdown. I the prevalence o sybols or Y = 1 were unior throughout the plot, then neither variable would be iportant. Fitting logistic regression with the two predictors log(d) and S eans that the probability o Y = 1 depends on these predictors only through a linear cobination β 1 log(d) + β 2 S or soe (β 1,β 2 ). This is like inding a sequence o parallel lines like those in Figure 12.4a so that the probability o blowdown is constant on the

272 258 LOGISTIC REGRESSION S log 2 (D) (a) S log 2 (D) (b) FIG Scatterplots with contours o estiated probability o blowdown. (a) No interaction ean unction. (b) Mean unction with interaction. lines, and increases (or, in other probles, decreases) ro lower let to upper right. The lines shown on Figure 12.4a coe ro the logistic regression it that is suarized in Table 12.2a. Fro the table, we see that all points that have the sae values o log(D) S have the sae estiated probability o blowdown. For exaple, or the points near the line arked 0.5, we would expect about 50% sybols or Y = 0 and 50% sybols or Y = 1, but near the line 0.1, we would expect 90% sybols or Y = 0. Figure 12.4b and Table 12.2b correspond to itting with a ean unction that includes the two ters and their interaction. The lines o constant estiated probability o Y = 1 shown on Figure 12.4b are now curves rather than straight lines,

273 FITTING LOGISTIC REGRESSION 259 TABLE 12.2 Logistic Regressions or the Balsa Fir Data (a) No interaction Coeicients: Estiate Std. Error z value Pr(> z ) (Intercept) <2e-16 *** logd <2e-16 *** S <2e-16 *** --- Residual deviance: on 656 degrees o reedo Pearson s X^2: on 656 degrees o reedo (b) Mean unction with interaction Coeicients: Estiate Std. Error z value Pr(> z ) (Intercept) logd S logd:s e Residual deviance: on 655 degrees o reedo Pearson s X^2: on 655 degrees o reedo but otherwise the interpretation o this plot is the sae as Figure 12.4a. Visually deciding which o these two ean unctions atches the data ore closely is diicult, and we will shortly develop a test or this coparison. Interpretation o estiates o paraeters in ean unctions with no interaction works the sae way with any predictors as it does with one predictor. For exaple, the estiated eect o increasing log(d) by one unit using Table 12.2a is to ultiply the odds o blowdown by exp(2.2164) 9.2, siilar to the estiated eect when S is ignored. Interpretation o estiates is coplicated by interactions. Using the estiates in Table 12.2b, i log(d) is increased by one unit, then the odds o blowdown are ultiplied by exp( S), which depends on the value o S. For the eect o S, sinces is bounded between 0 and 1, we cannot increase S by one unit, so we can suarize the S-eect by looking at an increase o 0.1 units. The odds ultiplier or S is then exp(.1[ log(D)]). These two unctions are graphed in Figure Big trees were uch ore likely to blow down in severe areas than in areas where severity was low. For ixed diaeter, increasing severity by 0.1 has a relatively odest eect on the odds o blowdown. As with logistic regression with a single ter, the estiates are approxiately norally distributed i the saple size is large enough, with estiated standard errors in Table The ratios o the estiates to their standard errors are called Wald tests. For the no-interaction ean unction in Table 12.2a, the p-values or all three Wald tests are very sall, indicating all ters are iportant when adjusting or the other ters in the ean unction. For the interaction ean unction in Table 12.2b, the p-values or S and or the interaction are both sall, but the ain eect or log(d) has a large p-value. Using the hierarchy principle,

274 260 LOGISTIC REGRESSION Odds ultiplier S (a) Odds ultiplier D (b) FIG Blowdown odds ultiplier or (a) doubling the diaeter o a Balsa Fir tree as a unction o local severity, S, and (b) increasing S by 0.1 as a unction o diaeter. however, we recoend that whenever an interaction is included in a ean unction, all the ain eects in that interaction be included as well, so in light o the signiicant interaction the test or log(d) is not relevant, and we would retain log(d). In the next section, we derive tests analogous to F -tests or linear regression. Unlike the linear odel where F -tests and Wald tests are equivalent, in logistic regression they can give conlicting results. The tests in the next section are to be preerred over the Wald tests Deviance In ultiple linear regression, the residual su o squares provides the basis or tests or coparing ean unctions. In logistic regression, the residual su o squares is replaced by the deviance, which is oten called G 2. The deviance is deined or logistic regression to be G 2 = 2 n i=1 [ y i log ( yi ŷ i ) ( i y i + ( i y i ) log i ŷ i )] (12.8) where ŷ i = i ˆθ(x i ) are the itted nuber o successes in i trials. The d associated with the deviance is equal to the nuber o cases n used in the calculation inus the nuber o eleents o β that were estiated; in the exaple, d = = 655. Methodology or coparing odels parallels the results in Section 5.4. Write β x = β 1 x 1 + β 2 x 2, and consider testing NH: θ(x) = (β 1 x 1) AH: θ(x) = (β 1 x 1 + β 2 x 2)

275 FITTING LOGISTIC REGRESSION 261 TABLE 12.3 Analysis o Deviance or Balsa Fir Blowdown Data Ters d Deviance Change in d Deviance P(> Chi ) 1, log(d) , log(d), S, S log(d) TABLE 12.4 Sequential Analysis o Deviance or Balsa Fir Blowdown Data Ters d Deviance Change in d Deviance P(> Chi ) Intercept Add log(d) Add S Add interaction to see i the ters in x 2 have zero coeicients. Obtain the deviance G 2 NH and degrees o reedo d NH under the null hypothesis, and then obtain G 2 AH and d AH under the alternative hypothesis. As with linear odels, we will have evidence against the null hypothesis i G 2 NH G2 AH is too large. To get p-values, we copare the dierence G 2 NH G2 AH with the χ 2 distribution with d = d NH d AH, not with an F -distribution as was done or linear odels. I we set x 1 = (Ones, log(d)), where Ones is the vector o ones to it the intercept, and x 2 = (S, S log(d)) to test that only log(d) and the intercept are required in the ean unction. Fitting under the null hypothesis is suarized in Table 12.1, with G 2 NH = with d NH = 657. The alternative hypothesis is suarized in Table 12.2, where we see that G 2 AH = with d AH = 655. These results can be suarized in an analysis o deviance table, as in Table The interpretation o this table parallels closely the results or ultiple linear regression odels in Section 5.4. The table includes the deviance and d or each o the odels. The test statistic depends on the change in deviance and the change in d, as given in the table. The p-value is 0 to 4 decials, and the larger odel is preerred. We could also have a longer sequence o odels, or exaple, irst itting with an intercept only, then the intercept and log(d), then adding S, and inally the interaction. This would give an analysis o deviance table like Table 12.4 that parallels the sequential analysis o variance tables discussed in Section The table displays the tests or coparing two adjacent ean unctions Goodness-o-Fit Tests When the nuber o trials i > 1, the deviance G 2 can be used to provide a goodness-o-it test or a logistic regression odel, essentially coparing the null hypothesis that the ean unction used is adequate versus the alternative that a separate paraeter needs to be it or each value o i (this latter case is called the

276 262 LOGISTIC REGRESSION saturated odel). When all the i are large enough, G 2 can be copared with the χn p 2 distribution to get an approxiate p-value. The goodness-o-it test is not applicable in the blowdown exaple because all the i = 1. Pearson s X 2 is an approxiation to G 2 deined or logistic regression by X 2 = = n i=1 n i=1 [ (y i ŷ i ) 2 ( 1 ŷ i + i (y i / i ˆθ(x i )) 2 ˆθ(x i )(1 ˆθ(x i )) 1 i ŷ i )] (12.9) X 2 and G 2 have the sae large-saple distribution and oten give the sae inerences. In sall saples, there ay be dierences, and soeties X 2 ay be preerred or testing goodness-o-it. Titanic The Titanic was a British luxury passenger liner that sank when it struck an iceberg about 640 k south o Newoundland on April 14 15, 1912, on its aiden voyage to New York City ro Southapton, England. O 2201 known passengers and crew, only 711 are reported to have survived. The data in the ile titanic.txt ro Dawson (1995) classiy the people on board the ship according to their Sex as Male or Feale, Age, either child or adult, and Class, either irst, second, third, or crew. Not all cobinations o the three actors occur in the data, since no children were ebers o the crew. For each age/sex/class cobination, the nuber o people M and the nuber surviving Surv are also reported. The data are shown in Table Table 12.6 gives the value o G 2 and Pearson s X 2 or the it o ive ean unctions to these data. Since alost all the i exceed 1, we can use either G 2 or X 2 as a goodness-o-it test or these odels. The irst two ean unctions, the ain eects only odel, and the ain eects plus the Class Sex interaction, clearly do not it the data because the values o G 2 and X 2 are both uch larger then their d, and the corresponding p-values ro the χ 2 distribution are TABLE 12.5 Data ro the Titanic Disaster o Each Cell Gives Surv/M, the Nuber o Survivors, and the Nuber o People in the Cell Feale Male Class Adult Child Adult Child Crew 20/23 NA 192/862 NA First 140/144 1/1 57/175 5/5 Second 80/93 13/13 14/168 11/11 Third 76/165 14/31 75/462 13/48

277 BINOMIAL RANDOM VARIABLES 263 TABLE 12.6 Fit o Four Mean Functions or the Titanic Data. Each o the Mean Functions Treats Age, Sex, andclass as Factors, and Fits Dierent Main Eects and Interactions Mean Function d G 2 X 2 Main eects only Main eects + Class Sex Main eects + Class Sex + Class Age Main eects + all two-actor interactions Main eects, two-actor and three-actor interactions to several decial places. The third odel, which adds the Class Age interaction, has both G 2 and X 2 saller than its d, with p-values o about 0.64, so this ean unction sees to atch the data well. Adding ore ters can only reduce the value o G 2 and X 2, and adding the third interaction decreases these statistics to 0 to the accuracy shown. Adding the three-actor interaction its one paraeter or each cell, eectively estiating the probability o survival by the observed probability o survival in each cell. This will give an exact it to the data. The analysis o these data is continued in Proble BINOMIAL RANDOM VARIABLES In this section, we provide a very brie introduction to axiu likelihood estiation and then provide a coputing algorith or inding axiu likelihood estiates or logistic regression Maxiu Likelihood Estiation Data can be used to estiate θ using axiu likelihood estiation. Suppose we have observed y successes in independent trials, each with the sae probability θ o success. The axiu likelihood estiate or le o θ is the value ˆθ o θ that axiizes the probability o observing y successes in trials. This aounts to rewriting (12.1) as a unction o θ, with y held ixed at its observed value, ( ) L(θ) = θ y y (1 θ) ( y) (12.10) L(θ) is called the likelihood unction or θ. Since the sae value axiizes both L(θ) and log(l(θ)), we work with the ore convenient log-likelihood, given by ( log (L(θ)) = log y ) + y log(θ) + ( y)log(1 θ) (12.11)

278 264 LOGISTIC REGRESSION Dierentiating (12.11) with respect to θ and setting the result to 0 gives d log(l(θ)) dθ Solving or θ gives the le, ˆθ = y = y θ y 1 θ = 0 = Observed nuber o successes Observed ixed nuber o trials which is the observed proportion o successes. Although we can ind the variance o this estiator directly, we use a result that gives the large-saple variance o the le or ost statistical probles. Suppose the paraeter θ is a vector. Then in large saples, [ ( Var( ˆθ) 2 )] 1 log(l(θ)) = E θ( θ) (12.12) For the binoial exaple, θ is a scalar, and [ ( d 2 )] 1 log(l(θ)) E dθ 2 = [ = = [ ( y E θ(1 θ) θ(1 θ) θ 2 y (1 θ) 2 ] 1 )] 1 (12.13) This variance is estiated by substituting ˆθ or θ. In large saples, the le ˆθ is approxiately norally distributed with ean θ and variance given by (12.12) The Log-Likelihood or Logistic Regression Equation (12.10) provides the likelihood unction or a single binoial rando variable y with trials and probability o success θ. We generalize now to having n independent rando variables (y 1,...,y n ) with y i a binoial rando variable with i trials and probability o success θ(x i ) that depends on the value o x i and so ay be dierent or each i. The likelihood based on (y 1,...,y n ) is obtained by ultiplying the likelihood or each observation, L = n i=1 ( i y i ) (θ(x i )) y i (1 θ(x i )) i y i n (θ(x i )) y i (1 θ(x i )) i y i i=1

279 GENERALIZED LINEAR MODELS 265 In the last expression, we have dropped the binoial coeicients ( i y i ) because they do not depend on paraeters. Ater inor rearranging, the log-likelihood is log(l) n i=1 [ ( ) ] θ(xi ) y i log + i log(1 θ(x i )) 1 θ(x i ) Next, we substitute or θ(x i ) using equation (12.7) to get log(l(β)) = n [ (β x i )y i i log(1 + exp(β x i )) ] (12.14) i=1 The log-likelihood depends on the regression paraeters β explicitly, and we can axiize (12.14) to get estiates. An iterative procedure is required. The usual ethods using either the Newton Raphson algorith or Fisher scoring attain convergence in just a ew iterations, although probles can arise with unusual data sets, or exaple, i one or ore o the predictors can deterine the value o the response exactly; see Collett (2002, Section 3.12). Details o the coputational ethod are provided by McCullagh and Nelder (1989, Section 2.5), Collett (2002), and Agresti (1996, 2002), aong others. The estiated covariance atrix o the estiates is given by Var( ˆβ) = (X ŴX) 1 where Ŵ is a diagonal atrix with entries i ˆθ(x i )(1 ˆθ(x i )), andx is a atrix with ith row x GENERALIZED LINEAR MODELS Both the ultiple linear regression odel discussed earlier in this book and the logistic regression odel discussed in this chapter are particular instances o a generalized linear odel. Generalized linear odels all share three basic characteristics: 1. The distribution o the response Y, given a set o ters X, is distributed according to an exponential aily distribution. The iportant ebers o this class include the noral and binoial distributions we have already encountered, as well as the Poisson and gaa distributions. Generalized linear odels based on the Poisson distributions are the basis o the ost coon odels or contingency tables o counts; see Agresti (1996, 2002). 2. The response Y depends on the ters X only through the linear cobination β X. 3. The ean E(Y X = x) = (β x) or soe kernel ean unction. Forthe ultiple linear regression odel, is the identity unction, and or logistic regression, it is the logistic unction. There is considerable lexibility in selecting the kernel ean unction. Most presentations o generalized linear odels discuss the link unction, which is the inverse o rather than itsel.

280 266 LOGISTIC REGRESSION These three coponents are enough to speciy copletely a regression proble along with ethods or coputing estiates and aking inerences. The ethodology or these odels generally builds on the ethods in this book, usually with only inor odiication. Generalized linear odels were irst suggested by Nelder and Wedderburn (1972) and are discussed at length by McCullagh and Nelder (1989). Soe statistical packages use coon sotware to it all generalized linear odels, including the ultiple linear regression odel. Book-length treatents o binoial regression are given by Collett (2002) and by Hoser and Leeshow (2000). PROBLEMS Downer data For unknown reasons, dairy cows soeties becoe recubent they lay down. Called downers, these cows ay have a serious illness that ay lead to death o the cow. These data are ro a study o blood saples o over 400 downer cows studied at the Ruakura New Zealand Anial Health Laboratory during A variety o blood tests were perored, and or any o the anials, the outcoe (survived, died, or anial was killed) was deterined. The goal is to see i survival can be predicted ro the blood easureents. The variables in the data ile downer.txt are described in Table These data were collected ro veterinary records, and not all variables were recorded or all cows Consider irst predicting Outcoe ro Myopathy. Find the raction o surviving cows o Myopathy = 0andorMyopathy = Fit the logistic regression with response Outcoe, and the single predictor Myopathy. Obtain a 95% conidence interval or coeicient or Myopathy, and copute the estiated decrease in odds o survival when Myopathy = 1. Obtain the estiated probability o survival when Myopathy = 0andwhenMyopathy = 1, and copare with the observed survival ractions in Proble TABLE 12.7 The Downer Data Variable n Description AST 429 Seru asparate aino transerase (U/l at 30C) Calving i easured beore calving, 1 i ater CK 413 Seru creatine phosphokinase (U/l at 30C) Daysrec 432 Days recubent when easureents were done Inlaat 136 Is inlaation present? 0=no, 1=yes Myopathy 222 Is uscle disorder present? 1=yes, 0=no PCV 175 Packed cell volue (heatocrit), percent Urea 266 Seru urea (ol/l) Outcoe i survived, 0 i died or killed Source: Clark, Henderson, Hoggard, Ellison, and Young (1987).

281 PROBLEMS Next, consider the regression proble with only CK as a predictor (CK is observed ore oten than is Myopathy, so this regression will be based on ore cases than were used in the irst two parts o this proble). Draw separate density estiates o CK, or Outcoe = 0andorOutcoe = 1. Also, draw separate density estiates or log(ck) or the two groups. Coent on the graphs Fit the logistic regression ean unction with log(ck) as the only ter beyond the intercept. Suarize results Fit the logistic ean unction with ters or log(ck), Myopathy and a Myopathy log(ck) interaction. Interpret each o the coeicient estiates. Obtain a sequential deviance table or itting the ters in the order given above, and suarize results. (Missing data can cause a proble here: i your coputer progra requires that you it three separate ean unctions to get the analysis o deviance, then you ust be sure that each it is based on the sae set o observations, those or which CK and Myopathy are both observed.) Starting with (12.6), prove (12.7) Electric shocks A study carried out by R. Norell was designed to learn about the eect o sall electrical currents on ar anials, with the eventual goal o understanding the eects o high-voltage power lines near ars. A total o = 70 trials were carried out at each o six intensities, 0, 1, 2, 3, 4, and 5 A (shocks on the order o 15 A are painul or any huans, Dalziel, Lagen, and Thurston 1941). The data are given in the ile shocks.txt with coluns Intensity, nuber o trials, which is always equal to 70, and Y, the nuber o trials out o or which the response, outh oveent, was observed. Draw a plot o the raction responding versus Intensity. Then, it the logistic regression with predictor Intensity, and add the itted curve to your plot. Test the hypothesis that the probability o response is independent o Intensity, and suarize your conclusions. Provide a brie interpretation o the coeicient or Intensity. (Hint: The response in the logistic regression is the nuber o successes in trials. Unless the nuber o trials is one or every case, coputer progras will require that you speciy the nuber o trials in soe way. Soe progras will have an arguent with a nae like trials or weights or this purpose. Others, like R and JMP, require that you speciy a bivariate response consisting o the nuber o successes Y and the nuber o ailures Y.) Donner party In the winter o , about 90 wagon train eigrants in the Donner party were unable to cross the Sierra Nevada Mountains o Caliornia beore winter, and alost hal o the starved to death. The data in ile donner.txt ro Johnson (1996) include soe inoration about each o the ebers o the party. The variables include Age, theageo the person, Sex, whether ale or eale, Status, whether the person was a

282 268 LOGISTIC REGRESSION eber o a aily group, a hired worker or one o the aily groups, or a single individual who did not appear to be a hired worker or a eber o any o the larger aily groups, and Outcoe, coded 1 i the person survived and 0 i the person died How any en and woen were in the Donner Party? What was the survival rate or each sex? Obtain a test that the survival rates were the sae against the alternative that they were dierent. What do you conclude? Fit the logistic regression odel with response Outcoe and predictor Age, and provide an interpretation or the itted coeicient or Age Draw the graph o Outcoe versus Age, and add both a sooth and a itted logistic curve to the graph. The logistic regression curve apparently does not atch the data: Explain what the dierences are and how this ailure ight be relevant to understanding who survived this tragedy. Fit again, but this tie, add a quadratic ter in Age. Does the itted curve now atch the sooth ore accurately? Fit the logistic regression odel with ters or an intercept, Age, Age 2, Sex, and a actor or Status. Provide an interpretation or the paraeter estiates or Sex and or each o the paraeter estiates or Status. Obtain tests on the basis o the deviance or adding each o the ters to a ean unction that already includes the other ters, and suarize the results o each o the tests via a p-value and a one-sentence suary o the results Assuing that the logistic regression odel provides an adequate suary o the data, give a one-paragraph written suary on the survival o ebers o the Donner Party Countereit banknotes The data in the ile banknote.txt contains inoration on 100 countereit Swiss banknotes with Y = 0 and 100 genuine banknotes with Y = 1. Also included are six physical easureents o the notes, including the Length, Diagonal and the Let and Right edges o the note, all in illieters, and the distance ro the iage to the Top edge and Botto edge o the paper, all in illieters (Flury and Riedwyl, 1988). The goal o the analysis is to estiate the probability or odds that a banknote is countereit, given the values o the six easureents Draw a scatterplot atrix o six predictors, arking the points dierent colors or the two groups (genuine or countereit). Suarize the inoration in the scatterplot atrix Use logistic regression to study the conditional distribution o y, given the predictors Challenger The ile challeng.txt ro Dalal, Fowlkes, and Hoadley (1989) contains data on O-rings on 23 U. S. space shuttle issions prior

283 PROBLEMS 269 to the Challenger disaster o January 20, For each o the previous issions, the teperature at take-o and the pressure o a pre-launch test were recorded, along with the nuber o O-rings that ailed out o six. Use these data to try to understand the probability o ailure as a unction o teperature, and o teperature and pressure. Use your itted odel to estiate the probability o ailure o an O-ring when the teperature was 31 F, the launch teperature on January 20, Titanic Reer to the Titanic data, described in Section , page Fit a logistic regression odel with ters or actors Sex, Age and Class. On the basis o exaination o the data in Table 12.5, explain why you expect that this ean unction will be inadequate to explain these data Fit a logistic regression odel that includes all the ters o the last part, plus all the two-actor interactions. Use appropriate testing procedures to decide i any o the two-actor interactions can be eliinated. Assuing that the ean unction you have obtained atches the data well, suarize the results you have obtained by interpreting the paraeters to describe dierent survival rates or various actor cobinations. (Hint: How does the survival o the crew dier ro the passengers? First class ro third class? Males ro eales? Children versus adults? Did children in irst class survive ore oten than children in third class?) BWCAW blowdown The data ile blowapb.txt contains the data or Rich s blowdown data, as introduced at the beginning o this chapter, but or the two species SPP = A or aspen, and SPP = PB or paper birch Fit the sae ean unction used or Balsa Fir to each o these species. Is the interaction between S and logd required or these species? Ignoring the variable S, copare the two species, using the ean unctions outlined in Section Windill data For the windill data in the data ile w4.txt, usethe our-site data to estiate the probability that the wind speed at the candidate site exceeds six eters per second, and suarize your results.

284 Appendix A.1 WEB SITE The web address or aterial or this book is The web site includes ree text priers on how to do the coputations described in the book with several standard coputer progras, all the data iles described in the book, errata or the book, and scripts or soe o the packages that can reproduce the exaples in the book. A.2 MEANS AND VARIANCES OF RANDOM VARIABLES Suppose we let u 1,u 2,...,u n be rando variables and also let a 0,a 1,...,a n be n + 1 known constants. A.2.1 E Notation The sybol E(u i ) is read as the expected value o the rando variable u i.the phrase expected value is the sae as the phrase ean value. Inorally, the expected value o u i is the average value o a very large saple drawn ro the distribution o u i.ie(u i ) = 0, then the average value we would get or u i i we sapled its distribution repeatedly is 0. Since u i is a rando variable, any particular realization o u i is likely to be nonzero. The expected value is a linear operator, which eans E(a 0 + a 1 u 1 ) = a 0 + a 1 E(u 1 ) E (a 0 + ) a i u i = a 0 + a i E(u i ) (A.1) Applied Linear Regression, Third Edition, by Sanord Weisberg ISBN Copyright 2005 John Wiley & Sons, Inc. 270

285 MEANS AND VARIANCES OF RANDOM VARIABLES 271 For exaple, suppose that u 1,...,u n are a rando saple ro a population, and E(u i ) = µ, i = 1,...,n. The saple ean is u = u i /n = (1/n)u i,andthe expectation o the saple ean is ( ) 1 E(u) = E n u i = 1 E(ui ) = 1 n n (nµ) = µ We say that u is an unbiased estiate o the population ean µ, since its expected value is µ. A.2.2 Var Notation The sybol Var(u i ) is the variance o u i. The variance is deined by the equation Var(u i ) = E[u i E(u i )] 2 =, the expected squared dierence between an observed value or u i and its ean value. The larger Var(u i ), the ore variable observed values or u i are likely to be. The sybol σ 2 is oten used or a variance, or σu 2 ight be used or the variance o the identically distributed u i i several variances are being discussed. The general rule or the variance o a su o uncorrelated rando variables is Var (a 0 + ) a i u i = ai 2 Var(u i) (A.2) The a 0 ter vanishes because the variance o a 0 + u is the sae as the variance o u since the variance o a constant is 0. Assuing that Var(u i ) = σ 2, we can ind the variance o the saple ean o independently, identically distributed u i : ( ) 1 Var(u) = Var n u i = 1n E(ui 2 ) = 1n 2 (nσ 2 ) = σ 2 n A.2.3 Cov Notation The sybol Cov(u i,u j ) is read as the covariance between the rando variables u i and u j and is deined by the equation Cov(u i,u j ) = E [( u i E(u i ))(u j E(u j ) )] = Cov(u j,u i ) The covariance describes the way two rando variables vary jointly. I the two variables are independent, then Cov(u i,u j ) = 0, but zero correlation does not iply independence. The variance is a special case o covariance, since Cov(u i,u i ) = Var(u i ). The rule or covariance is Cov(a 0 + a 1 u 1,a 3 + a 2 u 2 ) = a 1 a 2 Cov(u 1,u 2 )

286 272 APPENDIX The correlation coeicient is deined by ρ(u i,u j ) = Cov(u i,u j ) Var(ui )Var(u j ) The correlation does not depend on units o easureent and has a value between 1 and 1. The general or or the variance o a linear cobination o correlated rando variables is Var (a 0 + ) a i u i = n i=1 ai 2 Var(u n 1 i) + 2 n i=1 j=i+1 a i a j Cov(u i,u j ) (A.3) A.2.4 Conditional Moents Throughout the book, we use notation like E(Y X = x) to denote the ean o the rando variable Y in the population or which the value o X is ixed at the value X = x. Siilarly, Var(Y X = x) is the variance o the rando variable Y in the population or which X is ixed at X = x. There are siple relationships between the conditional ean and variance o Y given X and the unconditional ean and variances (see, or exaple, Casella and Berger, 1990): E(Y ) = E[E(Y X = x)] Var(Y ) = E[Var(Y X = x)] + Var(E(Y X = x)] (A.4) (A.5) For exaple, suppose that when we condition on the predictor X we have a siple linear regression ean unction with constant variance, E(Y X = x) = β 0 + β 1 x,var(y X = x) = σ 2. In addition, suppose the unconditional oents o the predictor are E(X) = µ x and Var(X) = τx 2. Then or the unconditional rando variable Y, E(Y ) = E[E(Y X = x)] = E[β 0 + β 1 x] = β 0 + β 1 µ x Var(Y ) = E[Var(Y X = x)] + Var[E(Y X = x)] = E[σ 2 ] + Var[β 0 + β 1 x] = σ 2 + β 2 1 τ 2 x The ean o the unconditional variable Y is obtained by substituting the ean o the unconditional variable X into the conditional ean orula, and the unconditional variance o Y equals the conditional variance plus a second quantity that depends on both β 2 1 and on τ 2 x.

287 MEANS AND VARIANCES OF LEAST SQUARES ESTIMATES 273 A.3 LEAST SQUARES FOR SIMPLE REGRESSION The ols estiates o β 0 and β 1 in siple regression are the values that iniize the residual su o squares unction, RSS(β 0,β 1 ) = n (y i β 0 β 1 x i ) 2 (A.6) i=1 One ethod o inding the iniizer is to dierentiate with respect to β 0 and β 1, set the derivatives equal to 0, and solve RSS(β 0,β 1 ) β 0 = 2 RSS(β 0,β 1 ) β 1 = 2 Upon rearranging ters, we get n (y i β 0 β 1 x i ) = 0 i=1 n x i (y i β 0 β 1 x i ) = 0 i=1 β 0 n + β 1 xi = y i β 0 xi + β 1 x 2 i = x i y i (A.7) Equations (A.7) are called the noral equations or the siple linear regression odel (2.1). The noral equations depend on the data only through the suicient statistics x i, y i, xi 2 and x i y i. Using the orulas SXX = (xi x) 2 = x 2 i nx 2 SXY = (x i x)(y i y) = x i y i nxy (A.8) equivalent and nuerically ore stable suicient statistics are given by x, y, SXX and SXY. Solving (A.7), we get ˆβ 0 = y ˆβ 1 x, ˆβ 1 = SXY SXX (A.9) A.4 MEANS AND VARIANCES OF LEAST SQUARES ESTIMATES The least squares estiates are linear cobinations o the observed values y 1,...,y n o the response, so we can apply the results o Appendix A.2 to the estiates ound in Appendix A.3 to get the eans, variances, and covariances o the estiates. Assue the siple regression odel (2.1) is correct. The estiator ˆβ 1 given at (A.9) can be written as ˆβ 1 = c i y i, where or each i, c i = (x i x)/sxx.

288 274 APPENDIX Since we are conditioning on the values o X, thec i are ixed nubers. By (A.1), ( ) E( ˆβ 1 X) = E ci y i X = x i = c i E(y i X = x i ) = c i (β 0 + β 1 x i ) = β 0 ci + β 1 ci x i By direct suation, c i = 0and c i x i = 1, giving E( ˆβ 1 X) = β 1 which shows that ˆβ 1 is unbiased or β 1. A siilar coputation will show that ˆβ 0 = β 0. Since the y i are assued independent, the variance o ˆβ 1 is ound by an application o (A.2), ( ) Var( ˆβ 1 X) = Var ci y i X = x i = c 2 i Var(Y X = x i) = σ 2 c 2 i = σ 2 /SXX This coputation also used c 2 i = (x i x) 2 /SXX 2 = 1/SXX. Coputing the variance o ˆβ 0 requires an application o (A.3). We write Var( ˆβ 0 ) = Var(y ˆβ 1 x X) = Var(y X) + x 2 Var( ˆβ 1 X) 2xCov(y, ˆβ 1 X) (A.10) To coplete this coputation, we need to copute the covariance, ( 1 Cov(y, ˆβ 1 X) = Cov yi, ) c i y i n = 1 n ci Cov(y i,y i ) = σ 2 n = 0 ci

289 ESTIMATING E(Y X) USING A SMOOTHER 275 because the y i sipliying, are independent, and c i = 0. Substituting into (A.10) and ( ) Var( ˆβ 0 ) = σ 2 1 n + x2 SXX Finally, Cov( ˆβ 0, ˆβ 1 X) = Cov(y ˆβ 1 x, ˆβ 1 X) = Cov(y, ˆβ 1 ) xcov( ˆβ 1, ˆβ 1 ) = 0 σ 2 x SXX = σ 2 x SXX Further application o these results gives the variance o a itted value, ŷ = ˆβ 0 + ˆβ 1 x: Var(ŷ X = x) = Var( ˆβ 0 + ˆβ 1 x X = x) = Var( ˆβ 0 X = x) + x 2 Var( ˆβ 1 X = x) + 2xCov( ˆβ 0, ˆβ 1 X = x) ( ) 1 = σ 2 n + x2 + σ 2 x 2 1 SXX SXX 2σ 2 x x SXX ( ) 1 = σ 2 (x x)2 + (A.11) n SXX A prediction ỹ at the uture value x is just ˆβ 0 + ˆβ 1 x. The variance o a prediction consists o the variance o the itted value at x given by (A.11) plus σ 2, the variance o the error that will be attached to the uture value, as given by (2.25). Var(ỹ X = x ) = σ 2 ( 1 n ) (x x)2 + + σ 2 SXX A.5 ESTIMATING E(Y X) USING A SMOOTHER For a 2D scatterplot o Y versus X, a scatterplot soother provides an estiate o the ean unction E(Y X = x) as x varies, without aking paraetric assuptions about the ean unction, so we do not need to assue that the ean unction is a straight line or any other particular or. We very briely introduce one o any types o local soothers and provide reerences to other approaches to soothing.

290 276 APPENDIX The soother we use ost oten in this book is the siplest case o the loess soother (Cleveland, 1979; see also the irst step in Algorith in Härdle, 1990, p. 192). This soother estiates E(Y X = x g ) by ỹ g at the point x g via a weighted least squares siple regression, giving ore weight to points close to x g than to points distant ro x g. Here is the ethod: 1. Select a value or a soothing paraeter, a nuber between 0 and 1. Values o close to 1 will give curves that are too sooth and will be close to a straight line, while sall values o give curves that are too rough and atch all the wiggles in the data. The value o ust be chosen to balance the bias o oversoothing with the variability o undersoothing. Rearkably, or any probles, 2/3 is a good choice. There is a substantial literature on the appropriate ways to estiate a soothing paraeter or loess and or other soothing ethods, but or the purposes o using a soother to help us look at a graph, optial choice o a soothing paraeter is not critical. 2. Find the nclosest points to x g. For exaple, i n = 100, and = 0.6, then ind the n= 60 closest points to x g. Every tie the value o x g is changed, the points selected ay change. 3. Aong these n nearest neighbors to x g, copute the wls estiates or the siple regression o Y on X, with weights deterined so that points close to x g have the highest weight, and the weights decline toward 0 or points arther ro x g. We use a triangular weight unction that gives axiu weight to data at x g, and weights that decrease linearly to 0 at the edge o the neighborhood. I a dierent weight unction is used, answers are soewhat dierent. 4. The value o ỹ g is the itted value at x g ro the wls regression using the nearest neighbors ound at step 2 as the data, and the weights ro step 3 as weights. 5. Repeat 1 4 or any values o x g that or a grid o points that cover the interval on the x-axis o interest. Join the points. Figure A.1 shows a plot o Y versus X, along with our soothers. The irst soother is the ols siple regression line, which does not atch the data well because the ean unction or the data in this igure is probably curved, not straight. The loess sooth with = 0.1 is as expected very wiggly, atching the local variation rather than the ean. The line or = 2/3 sees to atch the data very well, while the loess it or =.95 is nearly the sae as or = 2/3, but it tends toward oversoothing and attepts to atch the ols line. We would conclude ro this graph that a straight-line ean unction is likely to be inadequate because it does not atch the data very well. Loader (2004) discusses a oral lack-o-it test on the basis o coparing paraetric and nonparaetric estiates o the ean unction that is presented in Proble 5.3. The loess soother is an exaple o a nearest neighbor soother. Local polynoial regression soothers and kernel soothers are siilar to loess, except they

291 ESTIMATING E(Y X) USING A SMOOTHER 277 Height =.1 = 2/3 =.95 OLS Dbh FIG. A.1 Three choices o the soothing paraeter or a loess sooth. The data used in this plot are discussed in Section give positive weight to all cases within a ixed distance o the point o interest rather than a ixed nuber o points. There is a large literature on nonparaetric regression, or which scatterplot soothing is a priary tool. Recent reerence on this subject include Siono (1996), Bowan and Azzalini (1997), and Loader (2004). The literature on estiating a variance unction ro a scatterplot is uch saller than the literature on estiating the ean (but see Ruppert, Wand, Holst and Hössjer, 1997). Here is a siple algorith that can produce a soother that estiates the standard deviation unction, which is the square root o the variance unction: 1. Sooth the y i on the x i to get an estiate say ỹ i or each value o X = x i. Copute the squared residuals, r i = (y i ỹ i ) 2. Under norality o errors, the expectation E(r i x i ) = Var(Y X = x i ), so a ean sooth or the squared residuals estiates the variance sooth or Y. 2. Sooth the r i on x i to estiate Var(Y X = x i ) by si 2 at each value x i.then si 2 is the soothed estiate o the variance, and s i is a soothed estiate o the standard deviation. 3. Add three lines to the scatterplot: The ean sooth (x i, ỹ i ), the ean sooth plus one standard deviation, (x i, ỹ i + s i ) and the ean sooth inus one standard deviation, (x i, ỹ i s i ).

292 278 APPENDIX Height Dbh FIG. A.2 lines. loess sooth or the ean unction, solid line, and ean ± one standard deviation, dashed Figure A.2 shows the loess sooth or the ean unction and the ean plus and inus one standard deviation or the sae data as in Figure A.1. The variability appears to be a bit larger in the iddle o the range than at the edges. A.6 A BRIEF INTRODUCTION TO MATRICES AND VECTORS We provide only a brie introduction to atrices and vectors. More coplete reerences include Graybill (1969), Searle (1982), Schott (1996), or any good linear algebra book. Boldace type is used to indicate atrices and vectors. We will say that X is an r c atrix i it is an array o nubers with r rows and c coluns. A speciic 4 3atrixX is X = = x 11 x 12 x 13 x 21 x 22 x 23 x 31 x 32 x 33 x 41 x 42 x 43 = ( x ij ) (A.12) The eleent x ij o X is the nuber in the ith row and the jth colun. For exaple, in the preceding atrix, x 32 = 3.

293 A BRIEF INTRODUCTION TO MATRICES AND VECTORS 279 A vector is a atrix with just one colun. A speciic 4 1atrixy, whichis a vector o length 4, is given by y = = The eleents o a vector are generally singly subscripted; thus, y 3 = 2. A row vector is a atrix with one row. We do not use row vectors in this book. I a vector is needed to represent a row, a transpose o a colun vector will be used (see below). A square atrix has the sae nuber o rows and coluns, so r = c. A square atrix Z is syetric i z ij = z ji or all i and j. A square atrix is diagonal i all eleents o the ain diagonal are 0, z ij = 0, unless i = j. The atrices C and D below are syetric and diagonal, respectively: C = D = y 1 y 2 y 3 y The diagonal atrix with all eleents on the diagonal equal to 1 is called the identity atrix, or which the sybol I is used. The 4 4 identity atrix is I = A scalar is a 1 1 atrix, an ordinary nuber. A.6.1 Addition and Subtraction Two atrices can be added or subtracted only i they have the sae nuber o rows and coluns. The su C = A + B o r c atrices is also r c. Addition is done eleentwise: C = A + B = a 11 a 12 a 21 a 22 a 31 a 32 + b 11 b 12 b 21 b 22 b 31 b 32 = a 11 + b 11 a 12 + b 12 a 21 + b 21 a 22 + b 22 a 31 + b 31 a 32 + b 32 Subtraction works the sae way, with the + signs changed to signs. The usual rules or addition o nubers apply to addition o atrices, naely coutativity, A + B = B + A, and associativity, (A + B) + C = A + (B + C).

294 280 APPENDIX A.6.2 Multiplication by a Scalar I k is a nuber and A is an r c atrix with eleents (a ij ),thenkais an r c atrix with eleents (ka ij ). For exaple, the atrix σ 2 I has all diagonal eleents equal to σ 2 and all o-diagonal eleents equal to 0. A.6.3 Matrix Multiplication Multiplication o atrices ollows rules that are ore coplicated than are the rules or addition and subtraction. For two atrices to be ultiplied together in the order AB, the nuber o coluns o A ust equal the nuber o rows o B. For exaple, i A is r c, andb is c q, thenc = AB is r q. I the eleents o A are (a ij ) and the eleents o B are (b ij ), then the eleents o C = (c ij ) are given by the orula c c ij = a ik b kj k=1 This orula says that c ij is ored by taking the ith row o A and the jth colun o B, ultiplying the irst eleent o the speciied row in A by the irst eleent in the speciied colun in B, ultiplying second eleents, and so on, and then adding the products together. I A is 1 c and B is c 1, then the product AB is 1 1, an ordinary nuber. For exaple, i A and B are then the product AB is A = ( ) B = AB = (1 2) + (3 1) + (2 2) + ( 1 4) = 3 AB is not the sae as BA. For the preceding atrices, the product BA will be a 4 4atrix: BA = The ollowing sall exaple illustrates what happens when all the diensions are bigger than 1. A 3 2atrixA ties a 2 2atrixB is given as a 11 a 12 ( ) a a 21 a 22 b11 b 11 b 11 + a 12 b 21 a 11 b 12 + a 12 b = a b a 31 a 21 b 21 b 11 + a 22 b 21 a 21 b 12 + a 22 b a 31 b 11 + a 32 b 21 a 31 b 12 + a 32 b 22

295 A BRIEF INTRODUCTION TO MATRICES AND VECTORS 281 Using nubers, an exaple o ultiplication o two atrices is 3 1 ( ) = = In this exaple, BA is not deined because the nuber o coluns o B is not equal to the nuber o rows o A. However, the associative law holds: I A is r c, B is c q, andc is q p, thena(bc) = (AB)C, and the result is an r p atrix. A.6.4 Transpose o a Matrix The transpose o an r c atrix X is a c r atrix called X such that i the eleents o X are (x ij ), then the eleents o X are (x ji ).FortheatrixX given at (A.12), X = The transpose o a colun vector is a row vector. The transpose o a product (AB) is the product o the transposes, in opposite order, so(ab) = B A. Suppose that a is an r 1 vector with eleents a 1,...,a r. Then the product a a will be a 1 1 atrix or scalar, given by a a = a a a2 r = r ai 2 i=1 (A.13) Thus, a a provides a copact notation or the su o the squares o the eleents o a vector a. The square root o this quantity (a a) 1/2 is called the nor or length o the vector a. Siilarly, i a and b are both r 1 vectors, then we obtain a b = a 1 b 1 + a 2 b 2 + +a n b n = r a i b i = i=1 r b i a i = b a The act that a b = b a is oten quite useul in anipulating the vectors used in regression calculations. Another useul orula in regression calculations is obtained by applying the distributive law A.6.5 Inverse o a Matrix i=1 (a b) (a b) = a a + b b 2a b (A.14) For any scalar c 0, there is another nuber called the inverse o c, sayd, such that the product cd = 1. For exaple, i c = 3, then d = 1/c = 1/3, and the inverse

296 282 APPENDIX o 3 is 1/3. Siilarly, the inverse o 1/3 is 3. The nuber 0 does not have an inverse because there is no other nuber d such that 0 d = 1. Square atrices can also have an inverse. We will say that the inverse o a atrix C is another atrix D, such that CD = I, and we write D = C 1.Notall square atrices have an inverse. The collection o atrices that have an inverse are called ull rank, invertible, or nonsingular. A square atrix that is not invertible is o less than ull rank, or singular. I a atrix has an inverse, it has a unique inverse. The inverse is easy to copute only in special cases, and its coputation in general can require a very tedious calculation that is best done on a coputer. High-level atrix and statistical languages such as Matlab, Maple, Matheatica, R and S-plus include unctions or inverting atrices, or returning an appropriate essage i the inverse does not exist. The identity atrix I is its own inverse. I C is a diagonal atrix, say C = then C 1 is the diagonal atrix C = as can be veriied by direct ultiplication. For any diagonal atrix with nonzero diagonal eleents, the inverse is obtained by inverting the diagonal eleents. I any o the diagonal eleents are 0, then no inverse exists. A.6.6 Orthogonality Two vectors a and b o the sae length are orthogonal i a b = 0. An r c atrix Q has orthonoral coluns i its coluns, viewed as a set o c r dierent r 1 vectors, are orthogonal and in addition have length 1. This is equivalent to requiring that Q Q = I, ther r identity atrix. A square atrix A is orthogonal i A A = AA = I, andsoa 1 = A. For exaple, the atrix A =

297 RANDOM VECTORS 283 can be shown to be orthogonal by showing that A A = I, and thereore A 1 = A = A.6.7 Linear Dependence and Rank o a Matrix Suppose we have a n p atrix X, with coluns given by the vectors x 1,...,x p ; we consider only the case p n. We will say that x 1,...,x p are linearly dependent i we can ind ultipliers a 1,...,a p, not all o which are 0, such that p a i x i = 0 i=1 (A.15) I no such ultipliers exist, then we say that the vectors are linearly independent, and the atrix is ull rank. In general, the rank o a atrix is the axiu nuber o x i that or a linearly independent set. For exaple, the atrix X given at (A.12) can be shown to have linearly independent coluns because no a i not all equal to zero can be ound that satisy (A.15). On the other hand, the atrix X = = (x 1, x 2, x 3 ) (A.16) has linearly dependent coluns and is singular because x 3 = 3x 1 + x 2,or3x 1 + x 2 x 3 = 0. This atrix is o rank two because the linearly independent subset o the coluns with the ost eleents, consisting o any two o the three coluns, has two eleents. The atrix X X is a p p atrix. I X has rank p, so does X X. Full-rank square atrices always have an inverse. Square atrices o less than ull rank never have an inverse. A.7 RANDOM VECTORS An n 1 vector Y is a rando vector i each o its eleents is a rando variable. The ean o an n 1 rando vector Y is also an n 1 vector whose eleents are the eans o the eleents o Y. The variance o an n 1 vector Y is an n n square syetric atrix, oten called a covariance atrix, written Var(Y) with Var(y i ) as its (i, i) eleent and Cov(y i,y j ) = Cov(y j,y i ) as both the (i, j) and (j, i) eleent.

298 284 APPENDIX The rules or eans and variances o rando vectors are atrix equivalents o the scalar versions in Appendix A.2. I a 0 is a vector o constants, and A is a atrix o constants, E(a 0 + AY) = a 0 + AE(Y) Var(a 0 + AY) = AVar(Y)A (A.17) (A.18) A.8 LEAST SQUARES USING MATRICES The ultiple linear regression odel can be written as E(Y X = x) = β x Var(Y X = x) = σ 2 In atrix ters, we will write the odel using errors as Y = Xβ + e where Y is the n 1 vector o response values and X is a n p atrix. I the ean unction includes an intercept, then the irst colun o X is a vector o ones, and p = p + 1. I the ean unction does not include an intercept, then the colun o one is not included in X and p = p. Theith row o the n p atrix X is x i, β is a p 1 vector o paraeters or the ean unction, e is the n 1 vector o unobservable errors, and σ 2 is an unknown positive constant. The ols estiate ˆβ o β is given by the arguents that iniize the residual su o squares unction, Using (A.14), we obtain RSS(β) = (Y Xβ) (Y Xβ) RSS(β) = Y Y + β (X X)β 2Y Xβ (A.19) RSS(β) depends on only three unctions o the data: Y Y, X X,andY X.Any two data sets that have the sae values o these three quantities will have the sae least squares estiates. Using (A.8), the inoration in these quantities is equivalent to the inoration contained in the saple eans o the ters plus the saple covariances o the ters and the response. To iniize (A.19), dierentiate with respect to β and set the result equal to 0. This leads to the atrix version o the noral equations, X Xβ = X Y (A.20) The ols estiates are any solution to these equations. I the inverse o (X X) exists, as it will i the coluns o X are linearly independent, the ols estiates are unique and are given by ˆβ = (X X) 1 X Y (A.21)

299 LEAST SQUARES USING MATRICES 285 I the inverse does not exist, then the atrix (X X) is o less than ull rank, and the ols estiate is not unique. In this case, ost coputer progras will use a linearly independent subset o the coluns o X in itting the odel, so that the reduced odel atrix does have ull rank. This is discussed in Section A.8.1 Properties o Estiates Using the rules or eans and variances o rando vectors, (A.17) and (A.18), we ind E( ˆβ X) = E((X X) 1 X Y X) = (X X) 1 X E(Y X) = (X X) 1 X Xβ = β (A.22) so ˆβ is unbiased or β, as long as the ean unction that was it is the true ean unction. The variance o ˆβ is Var( ˆβ X) = Var((X X) 1 X Y X) = (X X) 1 X [Var(Y X)] X(X X) 1 [ ] = (X X) 1 X σ 2 I X(X X) 1 = σ 2 (X X) 1 X X(X X) 1 = σ 2 (X X) 1 (A.23) The variances and covariances are copactly deterined as σ 2 ties a atrix whose eleents are deterined only by X and not by Y. A.8.2 The Residual Su o Squares Let Ŷ = X ˆβ be the n 1 vector o itted values corresponding to the n cases in the data, and ê = Y Ŷ is the vector o residuals. One representation o the residual su o squares, which is the residual su o squares unction evaluated at ˆβ, is RSS = (Y Ŷ) (Y Ŷ) = ê ê = which suggests that the residual su o squares can be coputed by squaring the residuals and adding the up. In ultiple linear regression, it can also be coputed ore eiciently on the basis o suary statistics. Using (A.19) and the suary statistics X X, X Y and Y Y, we write n i=1 RSS = RSS( ˆβ) = Y Y + ˆβ X X ˆβ 2Y X ˆβ ê 2 i

300 286 APPENDIX We will irst show that ˆβ X X ˆβ = Y X ˆβ. Substituting or one o the ˆβs, we get ˆβ X X(X X) 1 X Y = ˆβ X Y = Y X ˆβ the last result ollowing because taking the transpose o a 1 1 atrix does not change its value. The residual su o squares unction can now be rewritten as RSS = Y Y ˆβ X X ˆβ = Y Y Ŷ Ŷ where Ŷ = X ˆβ are the itted values. The residual su o squares is the dierence in the squares o the lengths o the two vectors Y and Ŷ. Another useul or or the residual su o squares is RSS = SYY(1 R 2 ) where R 2 is the square o the saple correlation between Ŷ and Y. A.8.3 Estiate o Variance Under the assuption o constant variance, the estiate o σ 2 is ˆσ 2 = RSS d (A.24) with d d, where d is equal to the nuber o cases n inus the nuber o ters with estiated coeicients in the odel. I the atrix X is o ull rank, then d = n p, where p = p or ean unctions without an intercept, and p = p + 1 or ean unctions with an intercept. The nuber o estiated coeicients will be less than p i X is not o ull rank. A.9 THE QR FACTORIZATION Most o the orulas given in this book are convenient or derivations but can be inaccurate when used on a coputer because inverting a atrix such as (X X) leaves open the possibility o introducing signiicant rounding errors into calculations. Most statistical packages will use better ethods o coputing, and understanding how they work is useul. We start with the basic n p atrix X o ters. Suppose we could ind an n p atrix Q and a p p atrix R such that (1) X = QR; (2)Q has orthonoral coluns, eaning that Q Q = I p and (3) R is an upper triangular atrix, eaning that all the entries in R below the diagonal are equal to 0, but those on or above the diagonal can be nonzero.

301 MAXIMUM LIKELIHOOD ESTIMATES 287 Using the basic properties o atrices, we can write X = QR X X = (QR) (QR) = R R (X X) 1 = (R R) 1 = R 1 (R ) 1 (A.25) ˆβ = X(X X) 1 X Y = R 1 (Q Y) (A.26) H = X(X X) 1 X = QQ (A.27) Equation (A.25) ollows because R is a square atrix, and the inverse o the product o square atrices is the product o the inverses in opposite order. Fro (A.26), to copute ˆβ, irst copute Q Y,whichisap 1 vector, and ultiply on the let by R to get R ˆβ = Q Y (A.28) This last equation is very easy to solve because R is a triangular atrix and so we can use backsolving. For exaple, to solve the equations ˆβ = irst solve the last equation, so ˆβ 3 = 1, substitute into the equation above it, so 2 ˆβ = 2, so ˆβ 2 = 1/2. Finally, the irst equation is 7 ˆβ = 3, so ˆβ 3 = 1/7. Equation (A.27) shows how the eleents o the n n hat atrix H can be coputed without inverting a atrix and without using all the storage needed to save H in ull. I q i is the ith colun o Q, then an eleent h ij o the H atrix is siply coputed as h ij = q i q j. Golub and Van Loan (1996) provide a coplete treatent on coputing and using the QR actorization. Very high quality coputer code or coputing this and related quantities or statistics is provided in the publicly available Lapack package, described on the internet at This code is also used in any standard statistical packages A.10 MAXIMUM LIKELIHOOD ESTIMATES Maxiu likelihood estiation is probably the ost requently used ethod o deriving estiates in statistics. A general treatent is given by Casella and Berger (1990, Section 7.2.2); here we derive the axiu likelihood estiates or the linear regression odel assuing norality, without proo or uch explanation. Our goal is to establish notation and deine quantities that will be used in the

302 288 APPENDIX discussion o Box Cox transorations, and estiation or generalized linear odels in Chapter 12. The noral ultiple linear regression odel speciies or the ith observation that (y i x i ) N(β x i,σ 2 ) Given this odel, the density or the ith observation y i is the noral density unction, yi (y i x i, β,σ 2 ) = 1 exp ( (y i β x i ) 2 ) 2πσ 2σ 2 Assuing the observations are independent, the likelihood unction is just the product o the densities or each o the n observations, viewed as a unction o the paraeters with the data ixed rather than a unction o the data with the paraeters ixed: L(β,σ 2 Y) = = = n yi (y i x i, β,σ 2 ) i=1 n 1 exp ( (y i β x i ) 2 ) 2πσ 2σ 2 i=1 ( 1 2πσ ) n exp ( 1 σ 2 ) n (y i β x i ) 2 The axiu likelihood estiates are siply the values o β and σ 2 that axiize the likelihood unction. The values that axiize the likelihood will also axiize the logarith o the likelihood ( ) log L(β,σ 2 Y) = n 2 log(2π) n 2 log(σ 2 ) 1 n 2σ 2 (y i β x i ) 2 (A.29) The log-likelihood unction (A.29) is a su o three ters. Since β is included only in the third ter and this ter has a negative sign in ront o it, we recognize that axiizing the log-likelihood over β is the sae as iniizing the third ter, which, apart ro constants, is the sae as the residual su o squares unction (see Section 3.4.3). We have just shown that the axiu likelihood estiate o β or the noral linear regression proble is the sae as the ols estiator. Fixing β at the ols estiator ˆβ, (A.29) becoes ( ) log L( ˆβ,σ 2 Y) = n 2 log(2π) n 2 log(σ 2 ) 1 2σ 2 RSS (A.30) i=1 i=1

303 THE BOX COX METHOD FOR TRANSFORMATIONS 289 and dierentiating (A.30) with respect to σ 2 and setting the result to 0 gives the axiu likelihood estiator or σ 2 as RSS/n, the sae estiate we have been using, apart ro division by n rather than n p. Maxiu likelihood estiation has any iportant properties that ake the useul. These estiates are approxiately norally distributed in large saples, and the large-saple variance achieves the lower bound or the variance o all unbiased estiates. A.11 THE BOX COX METHOD FOR TRANSFORMATIONS A.11.1 Univariate Case Box and Cox (1964) derived the Box Cox ethod or selecting a transoration using a likelihood-like ethod. They supposed that, or soe value o λ, ψ M (Y, λ) given by (7.6), page 153, is norally distributed. With n independent observations, thereore, the log-likelihood unction or (β,σ 2,λ) is given by (A.29), but with y i replaced by ψ M (Y, λ) 1, ( ) log L(β,σ 2,λ Y) = n 2 log(2π) n 2 log(σ 2 ) 1 2σ 2 n (ψ M (y i,λ) β x i ) 2 (A.31) For a ixed value o λ, (A.31) is the sae as (A.29), and so the axiu likelihood estiates or β and σ 2 are obtained ro the regression o ψ M (Y, λ) on X, and the value o the log-likelihood evaluated at these estiates is ( ) log L(β(λ), σ 2 (λ), λ Y) = n 2 log(2π) n 2 log(rss(λ)/n) n (A.32) 2 where RSS(λ) is the residual su o squares in the regression o ψ M (Y, λ) on X, as deined in Section Only the second ter in (A.32) involves data, and so the global axiu likelihood estiate o λ iniizes RSS(λ). Standard likelihood theory can be applied to get a (1 α) 100% conidence interval or λ to be the set { [ ] } λ 2 log(l(β(ˆλ), σ 2 (ˆλ), ˆλ Y)) log(l(β(λ), σ 2 (λ), λ Y)) <χ 2 (1, 1 α) i=1 Or, setting α =.05 so χ 2 (1,.95) = 3.84, and using (A.32) { } λ (n/2)(log(rss(λ)) log(rss( ˆλ)) < 1.92 (A.33) 1 As λ is varied, the units o ψ M (Y, λ) can change, and so the joint density o the transored data should require a Jacobian ter; see Casella and Berger (1990, Section 4.3). The odiied power transorations are deined so the Jacobian o the transoration is always equal to 1, and it can thereore be ignored.

304 290 APPENDIX Many statistical packages will have routines that will provide a graph o RSS(λ) versus λ, oro(n/2) log(rss(λ)) versus λ as shown in Figure 7.8, or the highway accident data. Equation (A.32) shows that the conidence interval or λ includes all values o λ or which the log-likelihood is within 1.92 units o the axiu value o the log-likelihood, or between the two vertical lines in the igure. A.11.2 Multivariate Case Although the aterial in this section uses ore atheatical statistics than ost o this book, it is included because the details o coputing the ultivariate extension o Box Cox transorations are not published elsewhere. The basic idea was proposed by Velilla (1993). Suppose X is a set o p variables we wish to transor and deine ψ M (X, λ) = (ψ M (X 1,λ 1 ),...,ψ M (X k,λ k )) Here, we have used the odiied power transorations (7.6) or each eleent o X, but the sae general idea can be applied using other transorations such as the Yeo Johnson aily introduced in Section 7.4. In analogy to the univariate case, we assue that or soe λ, we will have ψ M (X, λ) N(µ, V) where V is soe unknown positive deinite syetric atrix that needs to be estiated. I x i is the observed value o X or the ith observation, then the likelihood unction is given by L(µ, V, λ X) = n 1 (2π V ) 1/2 i=1 exp ( 1 ) 2 (ψ M(x i, λ) µ) V 1 (ψ M (x i, λ) µ) (A.34) where V is the deterinant 2. Ater rearranging ters, the log-likelihood is given by log(l(µ, V, λ X)) = n 2 log(2π) n 2 log( V ) 1 2 n V 1 (ψ M (x i, λ) µ)(ψ M (x i, λ) µ) (A.35) i=1 I we ix λ, then (A.35) is the standard log-likelihood or the ultivariate noral distribution. The values o V and µ that axiize (A.35) are the saple ean and 2 The deterinant is deined in any linear algebra textbook.

305 CASE DELETION IN LINEAR REGRESSION 291 saple covariance atrix, the latter with divisor n rather than n 1, µ(λ) = 1 n V(λ) = 1 n n ψ M (x i, λ) i=1 n (ψ M (x i, λ) µ(λ))(ψ M (x i, λ) µ(λ)) i=1 Substituting these estiates into (A.35) gives the proile log-likelihood or λ, log(l(µ(λ), V(λ), λ X)) = n 2 log(2π) n 2 log( V(λ) ) n 2 (A.36) This equation will be axiized by iniizing the deterinant o V(λ) over values o λ. This is a nuerical proble or which there is no closed-or solution, but it can be solved using a general-purpose unction iniizer. Standard theory or axiu likelihood estiates can provide tests concerning λ and standard errors or the eleents o λ. To test the hypothesis that λ = λ 0 against a general alternative, copute [ ] G 2 = 2 log(l(µ(ˆλ), V(ˆλ), ˆλ)) log(l(µ(λ 0 ), V(λ 0 ), λ 0 )) and copare G 2 to a Chi-squared distribution with k d. The standard error o ˆλ is obtained ro the inverse o the expected inoration atrix evaluated at ˆλ. The expected inoration or ˆλ is just the atrix o second derivatives o (A.36) with respect to λ evaluated at ˆλ. Many optiization routines, such as opti in R, will return the atrix o estiated second derivatives i requested; all that is required is inverting this atrix, and then the square roots o the diagonal eleents are the estiated standard errors. A.12 CASE DELETION IN LINEAR REGRESSION Suppose X is the n p atrix o ters with linearly independent coluns. We use the subscript (i) to ean without case i, so that X (i) is an (n 1) p atrix. We can copute (X (i) X (i)) 1 ro the rearkable orula (X (i) X (i)) 1 = (X X) 1 + (X X) 1 x i x i (X X) 1 1 h ii (A.37) where h ii = x i (X X) 1 x i is the ith leverage value, a diagonal value ro the hat atrix. This orula was used by Gauss (1821); a history o it and any variations is given by Henderson and Searle (1981). It can be applied to give all the results

306 292 APPENDIX that one would want relating ultiple linear regression with and without the ith case. For exaple, Writing r i =ê i / ˆσ 1 h ii, the estiate o variance is ˆβ (i) = ˆβ (X X) 1 x i ê i 1 h ii (A.38) and the studentized residual t i is ( ) 1 n p ˆσ (i) 2 =ˆσ 2 1 n p ri 2 (A.39) ( ) 1/2 n p 1 t i = r i n p ri 2 (A.40) The diagnostic statistics exained in this book were irst thought to be practical because o siple orulas used to obtain various statistics when cases are deleted that avoided recoputing estiates. Advances in coputing in the last 20 years or so have ade the coputational burden o recoputing without a case uch less onerous, and so diagnostic ethods equivalent to those discussed here can be applied to probles other than linear regression where the updating orulas are not available.

307 Reerences Agresti, A. (1996). An Introduction to Categorical Data Analysis. New York: Wiley. Agresti, A. (2002). Categorical Data Analysis, Second Edition. New York: Wiley. Allen, D. M. (1974). The relationship between variable selection and prediction. Technoetrics, 16, Allison, T. and Cicchetti, D. V. (1976). Sleep in aals: Ecological and constitutional correlates. Science, 194, Anscobe, F. (1973). Graphs in statistical analysis. A. Stat., 27, Atkinson, A. C. (1985). Plots, Transorations and Regression. Oxord: Oxord University Press. Baes, C. and Kellogg, H. (1953). Eects o dissolved sulphur on the surace tension o liquid copper. J. Metals, 5, Barnett, V. and Lewis, T. (2004). Outliers in Statistical Data, Third Edition. Chichester: Wiley. Bates, D. and Watts, D. (1988). Relative curvature easures o nonlinearity (with discussions). J. R. Stat. Soc. Ser. B, 22, Beckan, R. and Cook, R. D. (1983). Outliers. Technoetrics, 25, Bland, J. (1978). A coparison o certain aspects o ontogeny in the long and short shoots o McIntosh apple during one annual growth cycle. Unpublished Ph. D. Dissertation, University o Minnesota, St. Paul. Blo, G. (1958). Statistical Estiates and Transored Beta Variates. New York: Wiley. Bowan, A. and Azzalini, A. (1997). Applied Soothing Techniques or Data Analysis. Oxord: Oxord University Press. Box, G. E. P. and Cox, D. R. (1964). An analysis o transorations. J. R. Stat. Soc., Ser. B, 26, Brillinger, D. (1983). A generalized linear odel with Gaussian regression variables. In Bickel, P. J., Doksu, K. A., and Hodges Jr., J. L., eds., A Festschrit or Erich L. Lehann. New York: Chapan & Hall, Brown, P. (1994). Measureent, Regression and Calibration. Oxord: Oxord University Press. Applied Linear Regression, Third Edition, by Sanord Weisberg ISBN Copyright 2005 John Wiley & Sons, Inc. 293

308 294 REFERENCES Burt, C. (1966). The genetic deterination o dierences in intelligence: A study o onozygotic twins reared together and apart. Br.J.Psychol., 57, Casella, G. and Berger, R. (1990). Statistical Inerence. Paciic Grove: Wadsworth & Brooks- Cole. Chen, C. F. (1983). Score tests or regression odels. J. A. Stat. Assoc., 78, Clapha, A. W. (1934). English Roanesque Architecture Ater the Conquest. Oxord: Clarendon Press. Clark, R., Henderson, H. V., Hoggard, G. K., Ellison, R., and Young, B. (1987). The ability o biocheical and haeatological tests to predict recovery in periparturient recubent cows. N. Z. Vet. J., 35, Clausius, R. (1850). Über die bewegende Krat der Wäre und die Gezetze welche sich daraus ür die Wärelehre selbst abeiten lassen. Annalen der Physik, 79, Cleveland, W. (1979). Robust locally weighted regression and soothing scatterplots. J. o the Aer. Stat. Assoc., 74, Collett, D. (2002). Modelling Binary Data, Second Edition. Boca Raton: CRC Press. Cook, R. D. (1977). Detection o inluential observations in linear regression. Technoetrics, 19, Cook, R. (1979). Inluential observations in linear regression. J. Aer. Stat. Assoc., 74, Cook, R. D. (1986). Assessent o local inluence (with discussion). J. R. Stat. Soc., Ser. B, 48, Cook, R. D. (1998). Regression Graphics: Ideas or Studying Regressions through Graphics. New York: Wiley. Cook, R. and Jacobson, J. (1978). Analysis o 1977 West Hudson Bay snow goose surveys. Unpublished Report, Canadian Wildlie Services. Cook, R. D. and Prescott, P. (1981). Approxiate signiicance levels or detecting outliers in linear regression. Technoetrics, 23, Cook, R. D. and Weisberg, S. (1982). Residuals and Inluence in Regression. London: Chapan & Hall. Cook, R. D. and Weisberg, S. (1983). Diagnostics or heteroscedasticity in regression. Bioetrika, 70, Cook, R. D. and Weisberg, S. (1994). Transoring a response variable or linearity. Bioetrika, 81, Cook, R. D. and Weisberg, S. (1997). Graphics or assessing the adequacy o regression odels. J. o the Aer. Stat. Assoc., 92, Cook, R. D. and Weisberg, S. (1999a). Applied Regression Including Coputing and Graphics. New York: Wiley. Cook, R. D. and Weisberg, S. (1999b). Graphs in statistical: Analysis: Is the ediu the essage? Aer. Stat., 53, Cook, R. D. and Weisberg, S. (2004). Partial one-diensional regression odels. Aer. Stat., 58, Cook, R. D. and Witer, J. (1985). A note on paraeter-eects curvature. Journal o the Aerican Statistical Association, 80, Cox, D. R. (1958). The Planning o Experients. New York: Wiley. Cox, D. R. and Oakes, D. (1984). Analysis o Survival Data. London: Chapan & Hall.

309 REFERENCES 295 Cunningha, R. and Heathcote, C. (1989). Estiating a non-gaussian regression odel with ulticollinearity. Australian Journal o Statistics, 31, Dalziel, C., Lagen, J., and Thurston, J. (1941). Electric shocks. Transactions o the IEEE, 60, Daniel, C. and Wood, F. (1980). Fitting Equations to Data, Second Edition. New York: Wiley. Davison, A. and Hinkley, D. (1997). Bootstrap Methods and their Application. Cabridge: Cabridge University Press. Dawson, R. (1995). The Unusual Episode Data Revisited. Journal o Statistical Education, 3, an electronic journal available at Depster, A., Laird, N., and Rubin, D. (1977). Maxiu likelihood estiation ro incoplete data via the EM algorith. J. R. Stat. Soc., Ser. B, 29, Derrick, A. (1992). Developent o the easure-correlate-predict strategy or site assessent. Proceedings o the 14th BWEA Conerence, Nottingha, Diggle, P., Heagerty, P., Liang, K. Y., and Zeger, S. (2002). Analysis o Longitudinal Data, Second Edition. Oxord: Oxord University Press. Dodson, S. (1992), Predicting crustacean zooplankton species richness. Linology and Oceanography, 37, Eron, B. (1979). Bootstrap ethods: another look at the jackknie. Ann. Stat., 7, Eron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap. Boca Raton: Chapan &Hall. Ezekiel, M. and Fox, K. A. (1959). Methods o Correlation Analysis, Linear and Curvilinear. New York: Wiley. Finkelstein, M. (1980). The judicial reception o ultiple regression studies in race and sex discriination cases. Colubia Law Review, 80, Flury, B. and Riedwyl, H. (1988). Multivariate Statistics: A Practical Approach. London: Chapan & Hall. Forbes, J. (1857). Further experients and rearks on the easureent o heights by boiling point o water. Trans. R. Soc. Edinburgh, 21, Freedan, D. (1983). A note on screening regression equations. Aer. Stat., 37, Freean, M. and Tukey, J. (1950). Transorations related to angular and the square root. Ann. Math. Stat., 21, Fuller, W. (1987). Measureent Error Models. New York: Wiley. Furnival, G. and Wilson, R. (1974). Regression by leaps and bounds. Technoetrics, 16, Gauss, C. ( ). Theoria Cobinationis Observationu Erroribus Miniis Obnoxiae (Theory o the cobination o observations which leads to the sallest errors). Werke, 4, Geisser, S. and Eddy, W. F. (1979). A predictive approach to odel selection. J. A. Stat. Assoc., 74, Gnanadesikan, R. (1997). Methods or Statistical Analysis o Multivariate Data, Second Edition. New York: Wiley. Golub, G. and van Loan, C. (1996). Matrix Coputations, Third Edition. Baltiore: Johns Hopkins. Gould, S. J. (1966). Alloetry and size in ontogeny and phylogeny. Biol. Rev., 41,

310 296 REFERENCES Gould, S. J. (1973). The shape o things to coe. Syst. Zool., 22, Graybill, F. (1969). Introduction to Matrices with Statistical Applications. Belont, CA: Wadsworth. Green, P. and Silveran, B. (1994). Nonparaetric Regression and Generalized Linear Models: A Roughness Penalty Approach. London: Chapan & Hall. Haddon, M. (2001). Modelling and Quantitative Methods in Fisheries. Boca Raton: Chapan & Hall. Hahn, M., ed. (1979). Developent and Evolution o Brain Size. New York: Acadeic Press. Hald, A. (1960). Statistical Theory with Engineering Applications. New York: Wiley. Hall, P. and Li, K. C. (1993). On alost linearity o low diensional projections ro high diensional data. Ann. Stat., 21, Härdle, W. (1990). Applied Nonparaetric Regression. Cabridge: Cabridge University Press. Hastie, T., Tibshirani, R., and Friedan, J. (2001). The Eleents o Statistical Learning. New York: Springer. Hawkins, D. M. (1980). Identiication o Outliers. London: Chapan & Hall. Hawkins, D. M., Bradu, D., and Kass, G. (1984). Locations o several outliers in ultiple regression using eleental sets. Technoetrics, 26, Henderson, H. V. and Searle, S. R. (1981). On deriving the inverse o a su o atrices. SIAM Rev., 23, Hernandez, F. and Johnson, R. A. (1980). The large saple behavior o transorations to norality. J. o the Aer. Stat. Assoc., 75, Hinkley, D. (1985). Transoration diagnostics or linear odels. Bioetrika, 72, Hoaglin, D. C. and Welsch, R. (1978). The hat atrix in regression and ANOVA. A. Stat., 32, Hoser, D. and Leeshow, S. (2000). Applied Logistic Regression, Second Edition. New York: Wiley. Huber, P. (1981). Robust Statistics. New York: Wiley. Ibrahi, J., Lipsitz, S., and Horton, N. (2001). Using auxiliary data or paraeter estiation with non-ignorably issing outcoes. Applied Statistics, 50, Jevons, W. S. (1868). On the condition o the gold coinage o the United Kingdo, with reerence to the question o international currency. J. [R.] Stat. Soc., 31, Johnson, K. (1996). Unortunate Eigrants: Narratives o the Donner Party. Logan: Utah State University Press. Johnson, M. P. and Raven, P. H. (1973). Species nuber and endeis: The Galápagos Archipelago revisted. Science, 179, Joiner, B. (1981). Lurking variables: Soe exaples. Aer. Stat., 35, Kalbleisch, J. D. and Prentice, R. L. (1980). The Statistical Analysis o Failure Tie Data. New York: Wiley. Kennedy, W. and Bancrot, T. (1971). Model building or prediction in regression based on repeated signiicance tests. Ann. Math. Stat., 42, LeBeau, M. (2004). Evaluation o the intraspeciic eects o a 15-inch iniu size liit on walleye populations in Northern Wisconsin. Unpublished Ph. D. Dissertation, University o Minnesota.

311 REFERENCES 297 Li, K. C. and Duan, N. (1989). Regression analysis under link violation. Ann. Stat., 17, Lindgren, B. L. (1993). Statistical Theory, Fourth Edition. New York: Macillan. Loader, C. (2004). Soothing: Local regression techniques. In Gentle, J., Härdle, W., Mori, Y., Handbook o Coputational Statistics. New York: Springer-Verlag. Littell, R., Milliken, G., Stroup, W.Wolinger, R. (1996). SAS Syste or Mixed Models. Cary: SAS Institute. Little, R. and Rubin, D. (1987). Statistical Analysis with Missing Data. New York: Wiley. Longley, J. (1967). An appraisal o least squares progras or the electronic coputer ro the point o view o the user. J. A. Stat. Assoc., 62, Mallows, C. (1973). Soe coents on C p. Technoetrics, 15, Mantel, N. (1970). Why stepdown procedures in variable selection? Technoetrics, 12, Marquardt, D. W. (1970). Generalized inverses, ridge regression and biased linear estiation. Technoetrics, 12, McCullagh, P. and Nelder, J. (1989). Generalized Linear Models, Second Edition. London: Chapan & Hall. Miller, R. (1981). Siultaneous Inerence, Second Edition. New York: Springer. Moore, J. A. (1975). Total bioedical oxygen deand o anial anures. Unpublished Ph. D. Dissertation, University o Minnesota. Mosteller, F. and Wallace, D. (1964). Inerence and Disputed Authorship: The Federalist. Reading: Addison-Wesley. Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear odels. Journal o the Royal Statistical Society, Series A, 135, Noll, S., Weibel, P., Cook, R. D., and Witer, J. (1984). Biopotency o ethionine sources or young turkeys. Poultry Science, 63, Oehlert, G. (2000). A First Course in the Design and Analysis o Experients. NewYork: W.H. Freean. Pan, Wei, Connett, J., Porzio, G., and Weisberg, S. (2001). Graphical odel checking using arginal odel plots with longitudinal data. Statistics in Medicine, 20, Pardoe, I. and Weisberg, S. (2001). An Introduction to bootstrap ethods using Arc. Unpublished Report available at Parks, J. (1982). A Theory o Feeding and Growth o Anials. New York: Springer. Pearson, K. (1930). Lie and Letters and Labours o Francis Galton, Vol IIIa. Cabridge: Cabridge University Press. Pinheiro, J. and Bates, D. (2000). Mixed-Eects Models in S and S-plus. New York: Springer. Pearson, E. S. and Lee, S. (1903). On the laws o inheritance in an. Bioetrika, 2, Porzio, G. (2002). A siulated band to check binary regression odels. Metron, 60, Ratkowsky, D. A. (1990). Handbook o Nonlinear Regression Models. New York: Marcel Dekker. Rencher, A. and Pun, F. (1980). Inlation o R 2 in best subset regression. Technoetrics, 22, Royston, J. P. (1982a). An extension o Shapiro and Wilk s W test or norality to large saples. Appl. Stat., 31,

312 298 REFERENCES Royston, J. P. (1982b). Expected noral order statistics (exact and approxiate), algorith AS177. Appl. Stat., 31, Royston, J. P. (1982c). The W test or norality, algorith AS181. Appl. Stat., 31, Royston, P. and Altan, D. (1994). Regression using ractional polynoials o continuous covariates: parsionious paraetric odeling. Applied Statistics, 43, Rubin, D. (1976). Inerence and issing data. Bioetrika, 63, Ruppert, D., Wand, M., Holst, U., and Hössjer, O. (1997). Local polynoial varianceunction estiation. Technoetrics, 39, Sakaoto, Y., Ishiguro, M., and Kitagawa, G. (1987). Akaike Inoration Criterion Statistics. Dordrecht: Reidel. Saw, J. (1966). A conservative test or concurrence o several regression lines and related probles. Bioetrika, 53, Schaer, J. (1997). Analysis o Incoplete Multivariate Data. Boca Raton: Chapan & Hall/CRC. Scheé, H. (1959). The Analysis o Variance. New York: Wiley. Sclove, S. (1968). Iproved estiators or coeicients in linear regression. J. A. Stat. Assoc., 63, Schott, J. (1996). Matrix Analysis or Statistics. New York: Wiley. Schwarz, G. (1978). Estiating the diension o a odel. Annals o Statistics, 6, Searle, S. R. (1971). Linear Models. New York: Wiley. Searle, S. R. (1982). Matrix Algebra Useul or Statistics. New York: Wiley. Seber, G. A. F. (1977). Linear Regression Analysis. New York: Wiley. Seber, G. A. F. (2002). The Estiation o Anial Abundance. Caldwell, NJ: Blackburn Press. Seber, G. and Wild, C. (1989). Nonlinear Regression. NewYork:Wiley. Silveran, B. (1986). Density Estiation or Statistics and Data Analysis. Boca Raton: CRC/Chapan & Hall. Siono, J. (1996). Soothing Methods in Statistics. New York: Springer-Verlag. Shapiro, S. S. and Wilk, M. B. (1965). An analysis o variance test or norality (coplete saples). Bioetrika, 52, Staudte, R. and Sheather, S. (1990). Robust Estiation and Testing. New York: Wiley. Ta, S., Tiany, D., and Weisberg, S. (1996). Measured eects o eedlots on residential property values in Minnesota: A report to the legislature. Technical Report, Departent o Applied Econoics, University o Minnesota, agecon.lib.un.edu/n/p96-12.pd. Tang, G., Little, R., and Raghunathan, T. (2003). Analysis o ultivariate issing data with nonignorable nonresponse. Bioetrika, 90, Thisted, R. (1988). Eleents o Statistical Coputing. New York: Chapan & Hall. Tuddenha, R. and Snyder, M. (1954). Physical growth o Caliornia boys and girls ro birth to age 18. Cali. Publ. Child Dev., 1, Tukey, J. (1949). One degree o reedo or nonadditivity. Bioetrics, 5, Velilla, S. (1993). A note on the ultivariate Box-Cox transoration to norality. Stat. Probab. Lett., 17, Verbeke, G. and Molenberghs, G. (2000). Linear Mixed Models or Longitudinal Data. New York: Springer-Verlag.

313 REFERENCES 299 von Bertalany, L. (1938). A quantitative theory o organic growth. Huan Biology, 10, Weisberg, H., Beier, H., Brody, H., Patton, R., Raychaudhari, K., Takeda, H., Thern, R., and Van Berg, R. (1978). s-dependence o proton ragentation by hadrons. II. Incident laboratory oenta, GeV/c. Phys. Rev. D, 17, Wil, H. (1950). Statistical control in hydrologic orecasting. Research Notes, 71, Paciic Northwest Forest Range Experient Station, Oregon. Woodley, W. L., Sipson, J., Biondini, R., and Berkeley, J. (1977). Rainall results : Florida area cuulus experient. Science, 195, Yeo, I. and Johnson, R. (2000). A new aily o power transorations to iprove norality or syetry. Bioetrika, 87, Zip, G. (1949). Huan Behavior and the Principle o Least-Eort. Cabridge: Addison- Wesley.

314 Author Index Agresti, A., 100, 265, 293 Ahlstro, M., 45 Allen, D. M., 220, 293 Allison, T., 86, 148, 293 Altan, D., 122, 298 Anscobe, F., 26, 12, 293 Atkinson, A. C., 205, 293 Azzalini, A., 112, 251, 277, 293 Baes, C., 161, 293 Bancrot, T., 117, 296 Barnett, V., 197, 293 Bates, D., 100, 136, 177, 185, 248, 293, 297 Beckan, R., 197, 293 Beier, H., 299 Berger, R., 27, 80, 287, 289, 294 Berkeley, J., 209, 299 Bickel, P. J., 293 Biondini, R., 209, 299 Bland, J., 104, 293 Blo, G., 205, 293 Bowan, A., 112, 251, 277, 293 Box, G. E. P., 130, 153, 289, 293 Bradu, D., 197, 296 Brillinger, D., 155, 293 Brody, H., 299 Brown, P., 143, 293 Burnside, O., 17 Burt, C., 138, 294 Capeyron, E., 39 Casella, G., 27, 80, 287, 289, 294 Chen, C. F., 180, 294 Cicchetti, D., 86, 148, 293 Clapha, A. W., 140, 294 Clark, R., 266, 294 Clausius, R., 39 Cleveland, W., 14, 276, 294 Collett, D., , 294 Connett, J., 191, 297 Cook, R. D., 8, 49, 113, 131, 152, , 159, , 180, 186, , 204, 238, , 297 Cox, D. R., 294, 78, 85, 130, 153, 289, Cunningha, R., 88, 132, 294 Dalal, S., 268 Dalziel, C., 267, 295 Daniel, C., 205, 295 Davison, A., 87, 244, 295 Dawson, R., 262, 295 Depster, A., 85, 295 Derrick, A., 140, 295 Diggle, P., 137, 295 Dodson, S., 193, 295 Doksu, K. A., 293 Duan, N., , 296 Eddy, W., 220, 295 Eron, B., 87, 295 Ellison, R., 266, 294 Ezekiel, M., 162, 295 Finkelstien, M., 295 Flury, B., 268, 295 Forbes, J., 4, 6, 295 Fowlkes, E. B., 268 Fox, K. A., 162, 295 Applied Linear Regression, Third Edition, by Sanord Weisberg ISBN Copyright 2005 John Wiley & Sons, Inc. 301

315 302 AUTHOR INDEX Freedan, D., 225, 295 Freean, M., 179, 295 Frie, R., 249 Friedan, J., 211, 296 Fuller, W., 90, 295 Furnival, G., 221, 295 Galatowitsch, S., 134 Galto, F., 110 Gauss, C., 291, 295 Geisser, S., 220, 295 Gentle, J., 297 Gnanadesikan, R., 204, 295 Golub, G., 287, 295 Gosset, W. S., 196 Gould, S. J., 140, 150, 295 Graybill, F., 278, 296 Green, P., 14, 296 Haddon, M., 248, 296 Hahn, M., 150, 296 Hald, A., 162, 296 Hall, P., 156, 296 Härdle, W., 14, 276, Hastie, T., 211, 296 Hawkins, D. M., 197, 296 Heagerty, P., 137, 295 Heathcote, C., 88, 294 Henderson, H. V., 266, 291, 294, 296 Hernandez, F., 153, 296 Hinkley, D., 87, 184, 244, Hoadley, B., 268 Hoaglin, D. C., 169, 296 Hodges Jr., J. L., 293 Hostedt, C., 153 Hoggard, G. K., 266, 294 Holst, U., 277, 298 Horton, N., 86, 296 Hoser, D., 266, 296 Hössjer, O., 277, 298 Hubbard, K., 17 Huber, P., 27, 296 Hutchinson, R., 18 Ibrahi, J., 86, 296 Ishiguro, M., 217, 298 Jacobson, J., 113, 294 Jevons, W. S., 114, 296 Johnson, K., 267, 296 Johnson, M. P., 231 Johnson, R. A., 153, 296, 299 Joiner, B., 79, 296 Kalbleisch, J. D., 85, 296 Kass, G., 197, 296 Kellogg, H., 161, 293 Kennedy, W., 117, 296 Kitagawa, G., 217, 298 Lagen, J., 295 Laird, N., 85, 295 LeBeau, M., 249, 296 Lee, S., 2, 297 Leeshow, S., 266, 296 Lewis, T., 197, 293 Li, K. C., , 296 Liang, K. Y., 137, 295 Lindgren, B., 80, 297 Lipsitz, S., 86, 296 Littell, R., 136, 297 Little, R., 84, 86, Loader, C., , , 297 Longley, J., 94, 297 Mallows, C., 218, 297 Mantel, N., 230, 297 Marquardt, D. W., 216, 297 McCullagh, P., 100, , 297 Miklovic, S., 134 Miller, R., 45, 196, 297 Milliken, G., 136, 297 Molenberghs, G., 136, 298 Moore, J., 231, 297 Mori, Y., 297 Mosteller, F., 44, 297 Nelder, J. A., 100, , 297 Neyan, J., 235 Ng, C., 244 Noll, S., 8, 297 Norell, R., 267 Oakes, D., 85, 294 Oehlert, G., 78, 117, 131, 297 Pan, W., 191, 297 Pardoe, I., 89, 297 Parks, J., 238, 297 Patton, R., 299 Pearson, E. S., 2, 297 Pearson, K., 110, 297 Pierce, R., 90 Pinheiro, J., 100, 136, 177, 185, 297 Porzio, G., 191, 297 Prentice, R. L., 85, 296

316 AUTHOR INDEX 303 Prescott, P., 197, 294 Pun, F., 225, 297 Raghunathan, T., 86, 298 Ratkowsky, D. A., 248, 297 Raven, P. H., 231, 296 Raychaudhari, K., 299 Rencher, A., 225, 297 Rice, J., 182 Rich, R., 251 Riedwyl, H., 268, 295 Robinson, A., 112, 151 Royston, J. P., 122, 205, Rubin, D., 84 85, 295, Ruppert, D., 277, 298 Sakaoto, Y., 217, 298 Saw, J., 130, 298 Schaer, J., 84, 298 Scheé, H., 179, 298 Schott, J., 278, 298 Schwarz, G., 218, 298 Sclove, S., 162, 298 Searle, S. R., 74, 278, 291, 296, 298 Seber, G. A. F., 31, 91, 108, 116, 248, 298 Shapiro, S. S., 205, 298 Sheather, S., 197, 298 Silveran, B., 14, 251, 296, 298 Siono, J., 14, 112, 277, 298 Sipson, J., 209, 299 Siracuse, M., 210 Snyder, M., 65, 298 Staudte, R., 197, 298 Stigler, S., 114 Stroup, W., , 297 Ta, S., 78, 298 Takeda, H., 299 Tang, G., 86, 298 Telord, R., 132 Thern, R., 299 Thisted, R., 87, 248, 298 Thurston, J., 295 Tibshirani, R., 87, 211, Tiany, D., 78, 141, 208, 298 Tuddenha, R., 65, 298 Tukey, J., 176, 179, 295, 298 Van Berg, R., 299 Van Loan, C., 287, 295 Velilla, S., 157, 290, 298 Verbeke, G., 136, 298 Von Bertalany, L., 248, 299 Wallace, D., 44, 297 Wand, M., 277, 298 Watts, D., 248, 293 Wedderburn, R. W. M., 100, 266, 297 Weibel, P., 8, 297 Weisberg, H., 98, 102, 299 Weisberg, S., 49, 78, 89, 131, 152, 157, 159, , 180, 186, 191, 197, 204, 294, Welsch, R., 169, 296 Wild, C., 248, 298 Wilk, M. B., 205, 298 Wil, H., 42, 299 Wilson, R., 221, 295 Witer, J., 8, 238, 294, 297 Wolinger, R., , 297 Wood, F., 205, 295 Woodley, W. L., 209, 299 Yeo, I., 160, 299 Young, B., 266, 294 Zeger, S., 137, 295 Zip, G., 43, 299

317 Subject Index Added-variable plot, 49 50, 63, 66, 73, 203, 221 AIC, 217 Akaike Inoration Criterion, 217 Aliased, 73 Alloetric, 150 Analysis o variance, 28 29, 131 overall, 61 sequential, 62, 64 anova, overall, 61 sequential, 62,64 Arcsine square-root, 179 Asyptote, 233 Backsolving, 287 Backward eliination algorith, 222 Bayes Inoration Criterion, 218 Bernoulli distribution, 253 BIC, 218 Binary regression, 251 Binoial distribution, 253, 263 Binoial regression, 251, 253 Bonerroni inequality, 196 Bootstrap, 87, 110, 112, 143, 178, 244 bias, 89 easureent error, 90 nonlinear regression, 244 ratio o estiators, 89 Box Cox, 153, 157, , 289 Cases, 2 Causation, 69, 78 Censored regression, 85 Central coposite design, 138 Class variable, 124 Coeicient o deterination, 31, 62 Collinearity, 211, 214 Coparing groups linear regression, 126 nonlinear regression, 241 rando coeicient odel, 134 Coplexity, 217 Coputer packages, 270 Coputer siulation, 87 Conidence interval, 32 intercept, 32 slope, 33 Conidence regions, 108 Cook s distance, 198 Correlation, 31, 54, 272 Covariance atrix, 283 Cross-validation, 220 Curvature in residual plot, 176 Data cross-sectional, 7 longitudinal, 7 Data iles, 270 ais.txt, 132 allshoots.txt, 104, 139 anscobe.txt, 12 baeskel.txt, 161 banknote.txt, 268 BGSall.txt, 65 66, 138 BGSboys.txt, BGSgirls.txt, BigMac2003.txt, 164 Applied Linear Regression, Third Edition, by Sanord Weisberg ISBN Copyright 2005 John Wiley & Sons, Inc. 305

318 306 SUBJECT INDEX Data iles (Continued) blowapb.txt, 269 blowbf.txt, 255 brains.txt, 148 cakes.txt, 118, 137 cathedral.txt, 139 caution.txt, 172 challeng.txt, 268 chloride.txt, 134 cloud.txt, 208 doedata.txt, 145 doedata1.txt, 146 donner.txt, 267 downer.txt, 266 drugcost.txt, 209 dwaste.txt, 231 lorida.txt, 207 orbes.txt, 5 tcollinssnow.txt, 7 uel2001.txt, 15, 52 galapagos.txt, 231 galtonpeas.txt, 110 heights.txt, 2,41 highway.txt, hooker.txt, 40 htwt.txt, 38 jevons.txt, 114 lakeary.txt, 249 lakes.txt, 193 landrent.txt, 208 lathe1.txt, 137 longley.txt, 94 longshoots.txt, 104 antel.txt, 230 ile.txt, 143 Mitchell.txt, 18 MWwords.txt, 44 npdata.txt, 91 oldaith.txt, 18 physics.txt, 98 physics1.txt, 114 pipeline.txt, 191 prodscore.txt, 141 rat.txt, 200 salary.txt, 141 salarygov.txt, 163 segreg.txt, 244 shocks.txt, 267 shortshoots.txt, 104 sleep1.txt, 122 snake.txt, 42 snier.txt, 182 snowgeese.txt, 113 stopping.txt, 162 swan96.txt, 249 titanic.txt, 262 transact.txt, 88 turk0.txt, 142, 238 turkey.txt, 8, 242 twins.txt, 138 ucg.txt, 112 ucwc.txt, 151 UN1.txt, 18 UN2.txt, 47 UN3.txt, 166 walleye.txt, 249 water.txt, 18, 163 wblake.txt, 7,41 wblake2.txt, 41 w1.txt, 45, 93 w2.txt, 140 w3.txt, 140 w4.txt, 227, 269 w5.txt, 228 wool.txt, 130, 165 Data ining, 211 Degrees o reedo, 25 Delta ethod, 120, 143 Density estiate, 251 Dependence, 1 Deviance, 260 Diagnostics, , Duy variable, 52, 122 EM algorith, 85 Errors, 19 Exaples Alaska pipeline aults, 191 Anscobe, 12, 198 Apple shoot, 139 Apple shoots, 104, 111 Australian athletes, 132, 250 Banknote, 268 Berkeley guidance study, 65, 70, 92, , 230 Big Mac data, Black crappies, 249 Blowdown, 251, 255, 269 Brain weight, 148 Cake data, 137 Cakes, 117 Caliornia water, 18, 162, 192 Cathedrals, 139 Challenger data, 268 Cloud seeding, 209, 250 Donner party, 267 Downer, 266 Drug costs, 209

319 SUBJECT INDEX 307 Electric shocks, 267 Feedlots, 78 Florida election 2000, 207 Forbes, 4, 22 24, 29, 33 34, 37, 39, 138 Ft. Collins snowall, 7, 33, 44 Fuel consuption, 15, 52, 69, 93, 166, 173, 206 Galápagos species, 231 Galton s peas, 110 Governent salary, 163 Heights, 2, 19, 23, 34 36, 41, 51, 205 Highway accidents, 153, 159, 218, 222, 230, 232, 250 Hooker, 40, 138 Jevons coins, 114, 143 Lake Mary, 248 Land productivity, 140 Land rent, 208 Lathe, 137 Lathe data, 207 Longley, 94 Mantel, 230 Metrodoe, 145 Mitchell, 17 Northern Pike, 90 Old Faithul, 18, 44 Oxygen uptake, 230 Physics data, 116 Rat, 200, 203 Segented regression, 244 Sex discriination, 141 Sleep, 86, 122, 126, 138, 248 Sallouth bass, 6, 17, 41 Snake River levels, 42 Snier, 182 Snow geese, 113, 181 Stopping distances, 162 Strong interaction, 98, 114 Surace tension, 161 Titanic, 262 Transactions, 88, 143, 205 Turkey growth, 8, , 233, 238, 247 Twin study, 138 United Nations, 18, 41, 47, 51, 66, 166, 170, 176, 187, 198 Upper Flat Creek, 151, 112 Walleye growth, 249 Windills, 45, 93, 140, 226, 232, 269 Wool data, 130, 165 World cities, 164 Zip s Law, 43 Zooplankton species, 193 Expected inoration atrix, 291 Expected order statistics, 205 Experient, 77 Exponential aily distribution, 265 F -test, 28 overall anova, 30 partial, 63 power, 107 robustness, 108 Factor, 52, Factor rule, 123 Fisher scoring, 265 Fitted value, 21, 35, 57, 65 Fixed signiicance level, 31 Forward selection, 221 Full rank atrix, 283 Gauss Markov theore, 27 Gauss Newton algorith, 236 Generalized least squares, 100, 178 Generalized linear odels, 100, 178, 265 Geoetric ean, 153 Goodness o it, 261 Hat atrix, Hat notation, 21 Hessian atrix, 235 Histogra, 251 Hyperplane, 51 Independent, 2, 19 Inoration atrix, 291 Inluence, 167, 198 Inheritance, 2 Interaction, 52, 117 Intercept, 9, 19, 51 Intra-class correlation, 136 Inverse itted value plot, 152, 159, 165 Inverse regression, 143 Jittering, 2 Kernel ean unction, 234, 254 logistic unction, 254 Lack o it F test, 103 or nonlinear odels, 241 nonparaetric, 111 su o squares, 103 variance known, 100 variance unknown, 102 Lagged variables, 226 Lapack, 287

320 308 SUBJECT INDEX Leaps and bounds, 221 Least squares nonlinear, 234 ordinary, 6 weighted, 96 Leverage, 4, 169, 196 Li Duan theore, 156 Likelihood unction, 263 Likelihood ratio test, 108, 158, 260 Linear dependence, 73, 214, 283 independence, 73, 283 ixed odel, 136 operator, 270 predictor, 156, 254 regression, 1 Link unction, 254 Local inluence, 204 loess, 14 15, , 149, 181, 185, 187, Logariths, 70 and coeicient estiates, 76 choice o base, 23 log rule, 150 Logistic unction, 254 Logistic regression, 100, log-likelihood, 264 odel coparisons, 261 Logit, 254 Lurking variable, 79 Machine learning, 211 Main eects, 130 Mallows C p, 218 Marginal odel plot, Matrix diagonal, 279 ull rank, 282 identity, 279 inverse, 282 invertible, 282 nonsingular, 282 nor, 281 notation, 54 orthogonal, 282 orthogonal coluns, 282 singular, 282 square, 279 syetric, 279 Maxiu likelihood, 27, , Mean unction, 9 11 Mean square, 25 Measure, correlate, predict, 140 Measureent error, 90 Median, 87 Missing at rando, 85 data, 84 ultiple iputation, 85 values in data iles, 270 Modiied power aily, 153 Multiple correlation coeicient, 62 Multiple linear regression, 47 Multivariate noral, 80 Multivariate transorations, 157 Nearest neighbor, 276 Newton Raphson, 265 NID, 27 Noncentral χ 2, 108 Noncentral F,31 Nonconstant variance, 177, 180 Nonlinear least squares, 152, 234 Nonlinear regression, coparing groups, 241 large-saple inerence, 237 starting values, 237 Nonparaetric, 10 Nonparaetric regression, 277 Noral distributions, 80 Noral equations, 273, 284 Norality, 20, 58, 204 large saple, 27 Noral probability plot, 204 Null plot, 13, 36 Observational study, 69, 77 Odds o success, 254 OLS, see Ordinary least squares One-diensional estiation result, 156 One-way analysis o variance, 124 Order statistics, 205 Ordinary least squares, 6 7, 10, 21 28, 116, 144, 231, 234, 240, , 288 sae as axiu likelihood estiate, 288 Orthogonal, 192 polynoials, 116 projection, 169 Outlier, 4, 36, ean shit, 194 Over-paraeterized, 73 Overplotting, 2 p,59 p-value, 268 interpreting, 31

321 SUBJECT INDEX 309 Paraeters, 9, 21 interpreting, 69 not the sae as estiates, 21, 24 ters in log scale, 70 Paraetric bootstrap, 112 Partial correlation, 221 Partial one-diensional, 131, 144, 250 Pearson residual, 171 Pearson s X 2, 262 POD odel, , 144, 250 Polynoial regression, 115 Power, 31, 108 Power aily, 148 Power transorations, 14 Predicted residual, 207, 220 Prediction, 34, 65 Predictor, 1, 51 PRESS, 220 residual, 207 Probability ass unction, 253 Proile log-likelihood, 291 Pure error, 103, 242 Quadratic regression, 115 estiated iniu or axiu, 115 R 2, 31 32, 62, 81 interpretation, 83 Rando coeicients odel, 135 Rando intercepts odel, 136 Rando sapling, 81 Rando vector, 283 Range rule, 150 Rectangles, 74 Regression, 1, 10 binoial, 251 ultiple linear, 47 nonlinear, 233 su o squares, 58 through the origin, 42, 84 Reovable nonadditivity, 166 Residual, 21, 23, 36, 57 degrees o reedo, 25 ean square, 25 Pearson, 171 properties, 168 standardized, 195 studentized, 196 su o squares, 21, 24, 57 su o squares unction, 234 weighted, Residual plots, 171 curvature, 176 Tukey s test, 176 Response, 1 Saple correlations, 54 Saple covariance atrix, 57 Saple order statistics, 204 Saturated odel, 262 Scalar, 279 Scaled power transoration, 233 Scatterplot, 1 Scatterplot atrix, Score test, 180 nonconstant variance, 180 Score vector, 235 se, 27 seit, 35 sepred, 34 Second-order ean unction, 117 Segented regression, 244 Separated points, 3 Signiicance level ixed, 31 Siple regression odel, 19 deviations or, 41 atrix version, 58 Slices, 3 Slope,9,19 Soother, 10, 14, 275 Soothing paraeter, 276 Standard error, 27 o itted value, 2 o prediction, 2 o regression, 25 Standardized residual, 195 Statistical error, 19 Stepwise ethods, backward eliination, 222 orward selection, 221 Straight lines, 19 Studentized residual, 196 Studentized statistic, 196 Su o squares due to regression, 29 or lack o it, 103 regression, 58 Suary graph, 1 2, 11 12, 175 ultiple regression, 84 Supernorality, 204 Taylor series, 120, 235 Ters, 47, Three-diensional plot, 48

322 310 SUBJECT INDEX Transorations, 52, arcsine square-root, 179 Box Cox, 153 ailies, 148 or linearity, 233 log rule, 150 any predictors, odiied power, 153 ultivariate Box Cox, 157 nonpositive variables, 160 power aily, 148 predictor, 150 range rule, 150 response, 159 scaled, 150 using a nonlinear odel, 233 Yeo Johnson, 160 t-tests, 32 Tukey s test or nonadditivity, 176 Unbiased, 271 Uncorrelated, 8 Variability explained, 83 Variable selection ethods, 211 Variance estiate pure error, 103 Variance unction, 11, 96 Variance inlation actor, 216 Variance stabilizing transoration, 100, 177 Vector, 279 length, 281 Web site, Weighted least squares, 96 99, 114, 234 outliers, 196 Weighted residual, 171 Wind ar, 45 WLS, see Weighted least squares Yeo Johnson transoration aily, 160, 290 Zip s law, 43

324

325

326

327

328

329

I. Understand get a conceptual grasp of the problem

I. Understand get a conceptual grasp of the problem MASSACHUSETTS INSTITUTE OF TECHNOLOGY Departent o Physics Physics 81T Fall Ter 4 Class Proble 1: Solution Proble 1 A car is driving at a constant but unknown velocity,, on a straightaway A otorcycle is