CONTENTS OF PART IV NOTES FOR SUMMER STATISTICS INSTITUTE COURSE COMMON MISTAKES IN STATISTICS SPOTTING THEM AND AVOIDING THEM

1 2 CONTENTS OF PART IV NOTES FOR SUMMER STATISTICS INSTITUTE COURSE COMMON MISTAKES IN STATISTICS SPOTTING THEM AND AVOIDING THEM Part IV: Mistakes Invlving Regressin Dividing a Cntinuus Variable int Categries Suggestins MAY 24 27, 2010 Instructr: Martha K. Smith Cmmn Mistakes in Regressin Related t Mdel Assumptins 3 A. Over-fitting 4 Aviding ver-fitting 9 B. Using cnfidence intervals when predictin intervals are needed 11 C. Over-interpreting high R 2 16 D. Mistakes in interpretatin f cefficients 18 1.Interpreting a cefficient as a rate f change in Y instead f as a rate f change in the cnditinal mean f Y 18 2. Nt taking cnfidence intervals fr cefficients int accunt 20 3. Interpreting a cefficient that is nt statistically significant 21 4. Interpreting cefficients in multiple regressin with the same language used fr a slpe in simple linear regressin 21 5. Multiple inference n cefficients 22 E. Mistakes in selecting terms 23 1. Assuming linearity is preserved when variables are drpped 23 2. Prblems with stepwise mdel selectin prcedures 24 Dividing a Cntinuus Variable int Categries 27 Suggestins 32 Fr researchers 32 Planning research 32 Analyzing data 34 Writing up research 35 Fr reading research invlving statistics 37 Fr reviewers, referees, editrs, etc. 39 Fr teachers 40 References 42

3 4 A. Over-fitting COMMON MISTAKES IN REGRESSION RELATED TO MODEL ASSUMPTIONS B. Using Cnfidence Intervals when Predictin Intervals Are Needed C. Over-interpreting High R 2 D. Mistakes in Interpretatin f Cefficients E. Mistakes in Selecting Terms A. OVER-FITTING With fur parameters I can fit an elephant and with five I can make him wiggle his trunk. Jhn vn Neumann If we have n distinct x values and crrespnding y values fr each, it is pssible t find a curve ging exactly thrugh all n resulting pints (x, y); this can be dne by setting up a system f equatins and slving simultaneusly. But this is nt what regressin methds typically are designed t d. Mst regressin methds (e.g., least squares) estimate cnditinal means f the respnse variable given the explanatry variables. They are nt expected t g thrugh all the data pints. Fr example, with ne explanatry variable X (e.g., height) and respnse variable Y (e.g., weight), if we fix a value x f X, we have a cnditinal distributin f Y given X = x (e.g., the cnditinal distributin f weight fr peple with height x). This cnditinal distributin has an expected value (ppulatin mean), which we will dente E(Y X = x) (e.g., the mean weight f peple with height x). This is the cnditinal mean f Y given X = x. It depends n x -- in ther wrds, E(Y X = x) is a mathematical functin f x.

5 6 In least squares regressin (and mst ther kinds f regressin), ne f the mdel assumptins is that the cnditinal mean functin has a specified frm. Then we use the data t find a functin f x that apprximates the functin E(Y X = x). This is different frm, and subtler (and harder) than, finding a curve that ges thrugh all the data pints. Example: T illustrate, I have used simulated data: Five pints were sampled frm a jint distributin where the cnditinal mean E(Y X = x) is knwn t be x 2, and where each cnditinal distributin Y (X = x) is nrmal with standard deviatin 1. I used least squares regressin t estimate the cnditinal means by a quadratic curve y = a +bx + cx 2. That is, I used least squares regressin, with E(Y X=x) =! +"x + #x 2 as ne f the mdel assumptins, t btain estimates a, b, and c f!, ", and # (respectively), based n the data. There are ther ways f expressing this mdel assumptin, fr example, y =! +"x + #x 2 + $, r y i =! +"x i + #x i 2 + $ i The graph belw shws: The five data pints in red (ne at the left is mstly hidden by the green curve) The curve y = x 2 f cnditinal means (black) The graph f the calculated regressin equatin (in green). Nte that: The pints sampled frm the distributin d nt lie n the curve f means (black). The green curve is nt exactly the same as the black curve, but is clse. In this example, the sampled pints were mstly belw the curve f means. Since the regressin curve (green) was calculated using just the five sampled pints (red), the red pints are mre evenly distributed abve and belw it (green curve) than they are in relatin t the real curve f means (black).

Nte: In a real wrld example, we wuld nt knw the cnditinal mean functin (black curve) -- and in mst prblems, wuld nt even knw in advance whether it is linear, quadratic, r smething else. Thus, part f the prblem f finding an apprpriate regressin curve is figuring ut what kind f functin it shuld be. Cntinuing with this example, if we (naively) try t get a "gd fit" by trying a quartic (furth degree) regressin curve -- that is, using a mdel assumptin f the frm E(Y X=x) =! +" 1 x + " 2 x 2 + " 3 x 3 + " 4 x 4, we get the fllwing picture: 7 Yu can barely see any f the red pints in this picture. That s because they re all n the calculated regressin curve (green). We have fund a regressin curve that fits all the data! But it is nt a gd regressin curve -- because what we are really trying t estimate by regressin is the black curve (curve f cnditinal means). We have dne a rtten jb f that; we have made the mistake f ver-fitting. We have fit an elephant, s t speak. If we had instead tried t fit a cubic (third degree) regressin curve -- that is, using a mdel assumptin f the frm E(Y X=x) =! +" 1 x + " 2 x 2 + " 3 x 3, we wuld get smething mre wiggly than the quadratic fit and less wiggly than the quartic fit. Hwever, it wuld still be ver-fitting, since (by cnstructin) the crrect mdel assumptin fr these data wuld be a quadratic mean functin. 8

Hw can ver-fitting be avided? As with mst things in statistics, there are n hard and fast rules that guarantee success. Hwever, here sme guidelines. They apply t many ther types f statistical mdels (e.g., multilinear, mixed mdels, general linear mdels, hierarchical mdels) as well as least squares regressin. 1. Validate yur mdel (fr the mean functin, r whatever else yu are mdeling) if at all pssible. Gd and Hardin (2006, p. 188) list three general types f validatin methds: i. Independent validatin (e.g., wait till the future and see if predictins are accurate) This f curse is nt always pssible. ii. Split the sample. Use ne part fr mdel building, the ther fr validatin. See item II(c) f Data Snping fr mre discussin.) iii. Resampling methds. See Chapter 13 f Gd and Hardin (2006), and the further references prvided there, fr mre infrmatin. 9 2. Gather plenty f (ideally, well-sampled) data. If yu are gathering data (especially thrugh an experiment), be sure t cnsult the literature n ptimal design t plan the data cllectin t get the tightest pssible estimates frm the least amunt f data. Fr regressin, the values f the explanatry variable (x values, in the abve example) d nt usually need t be randmly sampled; chsing them carefully can minimize variances and thus give tighter estimates. Unfrtunately, there is nt much knwn abut sample sizes needed fr gd mdeling. Ryan (2009, p. 20) qutes Draper and Smith (1998) as suggesting that the number f bservatins shuld be at least ten times the number f terms. Gd and Hardin (2006, p. 183) ffer the fllwing cnjecturally: "If m pints are required t determine a univariate regressin line with sufficient precisin, then it will take at least m n bservatins and perhaps n!m n bservatins t apprpriately characterize and evaluate a regressin mdel with n variables." 3. Pay particular attentin t transparency and aviding verinterpretatin in reprting yur results. Fr example, be sure t state carefully what assumptins yu made, what decisins yu made, yur basis fr making these decisins, and what validatin prcedures yu used. Prvide (in supplementary nline material if necessary) enugh detail s that anther researcher culd replicate yur methds. 10

B. USING CONFIDENCE INTERVALS WHEN PREDICTION INTERVALS ARE NEEDED Recall frm the discussin f ver-fitting: The mdel assumptins fr least squares regressin assume that the cnditinal mean functin E(Y X = x) has a certain frm. The regressin estimatin prcedure then prduces a functin f the specified frm that estimates the true cnditinal mean functin. Fr example, if the mdel assumptin is that E(Y X=x) =! +"x, then least squares regressin will prduce an equatin f the frm y = a +bx, where a is an estimate f the true value! and b is an estimate f the true value ". Thus fr a particular value f x, # = a +bx 11 But nw suppse we want t estimate an actual value f Y when X = x, rather than just the cnditinal mean E(Y X = x). The nly estimate available fr an actual value f Y is # = a +bx, the same thing we used t estimate E(Y X = x). But since Y is a randm variable (whereas E(Y X = x) is a single number, nt a randm variable), we cannt expect t estimate Y as precisely as we can estimate the cnditinal mean E(Y X=x). i.e., even if # is a gd estimate f the cnditinal mean E(Y X = x), it might be a very a crude estimate f an actual value f Y. The graph belw illustrates this. The blue line is the actual line f cnditinal means. The yellw line is the calculated regressin line. The brwn x's shw sme values f Y when x = 3. The black square shws the value f the cnditinal mean f Y when x = 3. 12 is the estimate f E(Y X = x).

In this example, the estimate # fr x = 3 is virtually indistinguishable frm the cnditinal mean when x = 3, s # is a very gd estimate f the cnditinal mean. But if we are trying t estimate Y when x = 3, ur estimate # (black square) might be way ff -- fr example, the value f Y might turn ut t be at the highest brwn x r at the lwest. This illustrates hw the uncertainty f! as an estimate f Y is much greater than the uncertainty f! as an estimate f the cnditinal mean f Y. T estimate the uncertainty in ur estimate f the cnditinal mean E(Y X = x), we can cnstruct a cnfidence interval fr the cnditinal mean. But, as we ve just seen, the uncertainty in ur estimate f Y when X = x is greater than ur uncertainty f E(Y X = x). Thus, the cnfidence interval fr the cnditinal mean underestimates the uncertainty in ur use f! as an estimate f a value f Y (X = x). Instead, we need what is called a predictin interval, which takes int accunt the variability in the cnditinal distributin Y (X = x) as well as the uncertainty in ur estimate f the cnditinal mean E(Y (X = x). 13 Example: With the data used t create the abve plt: The 95% cnfidence interval fr the cnditinal mean when x = 3 is (6.634, 7.568) (giving a margin f errr f abut 0.5). But the 95% predictin interval fr Y when x = 3 is (5.139, 9.062) (giving a margin f errr f abut 2). Nte that the predictin interval includes all f the y-values assciated with x = 3 in the data used, except fr the highest ne, which it misses by a hair. Cmments: 1. Fr large enugh sample size, the least squares estimate f the cnditinal mean is fairly rbust t departures frm the mdel assumptin f nrmality f errrs. This depends n the Central Limit therem and the fact that the frmula fr # can be expressed as a linear cmbinatin f the y-values fr the data. Hwever, since the t-statistic used in calculating the predictin interval als invlves the cnditinal distributin directly, predictin is less rbust t departures frm nrmality. 14

2. The distinctin between variability and uncertainty is useful in understanding the distinctin between cnfidence intervals fr the cnditinal mean and predictin intervals: The cnfidence interval fr the cnditinal mean measures ur degree f uncertainty in ur estimate f the cnditinal mean. But the predictin interval must als take int accunt the variability in the cnditinal distributin. In fact, fr least squares simple linear regressin, The width f the cnfidence interval depends n the variance f # = ax + b as an estimatr f E(Y X = x). But the width f the predictin interval depends n the variance f # as an estimatr f Y (X = x). The variance f # as an estimatr f Y (X = x) is the sum f the cnditinal variance f Y (usually dented $ 2 ) and the variance f # as an estimatr f E(Y X = x). The first term ($ 2 ) is a measure f the variability in the cnditinal distributin. The secnd term (the variance f # as an estimatr f E(Y X = x)) is a measure f the uncertainty in the estimate f the cnditinal mean. The cnditinal variance $ 2 typically is the larger f these tw terms. 15 C. Over-interpreting High R 2 1. Just what is cnsidered high R 2 varies frm field t field. In many areas f the scial and bilgical sciences, an R 2 f 0.50 r 0.60 is cnsidered high. Hwever, Ck and Weisberg (1999) give an example f a simulated data set with 50 predictrs and 100 bservatins, where R 2 = 0.59, even thugh the respnse is independent f all the predictrs (s all regressrs have cefficient zer in the true mean functin). Hwever, the p-value f the F-statistic fr significance f the verall regressin was 0.13. Nnetheless, six f the terms were individually significant at the.05 level, prviding an example f why adjusting fr multiple inferences is imprtant. 2. High R 2 can als ccur as a result ver-fitting. The R 2 fr the example f ver-fitting by a quartic curve was 1.00, since the curve went thrugh all the pints. 3. A regressin mdel giving apparently high R 2 may nt be as gd a fit as might be btained by a transfrmatin. Fr example, fr the data pictured belw, fitting a linear regressin (red line) t the data (blue -- DC utput f a windmill vs. wind speed) will give R 2 = 0.87. 16

17 18 D. Mistakes in Interpretatin f Cefficients 1. Interpreting a cefficient as a rate f change in Y instead f as a rate f change in the cnditinal mean f Y. As pinted ut in the discussin f ver-fitting, the cmputed regressin equatin estimates the true cnditinal mean functin. Fr sme purpses this might be a gd enugh fit. Hwever, since the data indicate a clear curved trend, it is likely that a better fit can be fund by a suitable transfrmatin. Since the predictr wind speed is a rate (miles per hur), ne pssibility is that the reciprcal (hurs per mile) might be a natural chice f transfrmatin. Trying this gives R 2 = 0.98, and the plt belw shws that indeed a linear fit fr the transfrmed data makes mre sense than fr the untransfrmed data. Hw well it estimates the behavir f actual values f the randm variable depends n the variability f the respnse variable Y. Thus, interpreting the cmputed cefficients in terms f the respnse variable is ften misleading. Illustratin: In the graph shwn belw: The data are marked as green x s, The true line f cnditinal means, E(Y X = x) = 1 + 2x, is in vilet, and The fitted (cmputed) regressin line is in blue.

Nte that the fitted regressin line is clse t the true line f cnditinal means. The equatin f the fitted regressin line is (with cefficients runded t a reasnable degree) % = 0.56 + 2.18x. Thus it is accurate t say, "Fr each change f ne unit in x, the average change in the mean f Y is abut 2.18 units." It is nt accurate t say, "Fr each change f ne unit in x, Y changes abut 2.18 units." Fr example, we can see frm the graph that when x is 2, Y might be anywhere between a little belw 4 t a little abve 5.5; when x is 3, Y might be anywhere frm a little mre than 5.5 t a little mre than 9. S when ging frm x = 2 t x = 3, the change in Y might be almst zer, r it might be as large as 5.5 units. 19 2. Nt taking cnfidence intervals fr cefficients int accunt. Even when a regressin cefficient is (crrectly) interpreted as a rate f change f a cnditinal mean (rather than a rate f change f the respnse variable), it is imprtant t take int accunt the uncertainty in the estimatin f the regressin cefficient. T illustrate, in the example used in item 1 abve, the cmputed regressin line has equatin % = 0.56 + 2.18x. Hwever, a 95% cnfidence interval fr the slpe is (1.80, 2.56). S saying, "The rate f change f the cnditinal mean f Y with respect t x is estimated t be between 1.80 and 2.56" is usually preferable t saying, "The rate f change f the cnditinal mean Y with respect t x is abut 2.18." Hwever, the decisin needs t be made n the basis f what difference is practically imprtant. Fr example, if the width f the cnfidence interval is less than the precisin f measurement, there is n harm in neglecting the range. Anther factr that is als imprtant in deciding what level f accuracy t use is what level f accuracy yur audience can handle; this, hwever, needs t be balanced with the pssible cnsequences f nt cmmunicating the uncertainty in the results f the analysis. 20

21 22 3. Interpreting a cefficient that is nt statistically significant. Interpretatins f results that are nt statistically significant are made surprisingly ften. If the t-test fr a regressin cefficient is nt statistically significant, it is nt apprpriate t interpret the cefficient. A better alternative might be t say, "N statistically significant linear dependence f the mean f Y n x was detected. (This is really just a special case f the mistake in item 2. Hwever, it is frequent enugh t deserve explicit mentin.) 5. Multiple inference n cefficients. When interpreting mre than ne cefficient in a regressin equatin, it is imprtant t use apprpriate methds fr multiple inference, rather than using just the individual cnfidence intervals that are autmatically given by mst sftware. One technique fr multiple inference in regressin is using cnfidence regins. See, fr example, Weisberg (2005, Sectin 5.5, pp. 108-110) r Ck and Weisberg (1999, Sectin 10.8, pp. 250-255). 4. Interpreting cefficients in multiple regressin with the same language used fr a slpe in simple linear regressin. Even when there is an exact linear dependence f ne variable n tw thers, the interpretatin f cefficients is nt as simple as fr a slpe with ne dependent variable. Example: If y = 1 + 2x 1 + 3x 2, it is nt accurate t say, "Fr each change f 1 unit in x 1, y changes 2 units". What is crrect is t say, "If x 2 is fixed, then fr each change f 1 unit in x 1, y changes 2 units." Similarly, if the cmputed regressin line is % = 1 + 2x 1 + 3x 2, with cnfidence interval (1.5, 2.5), then a crrect interpretatin wuld be, "The estimated rate f change f the cnditinal mean f Y with respect t x 1, when x 2 is fixed, is between 1.5 and 2.5 units." Fr mre n interpreting cefficients in multiple regressin, see Sectin 4.3 (pp 161-175) f Ryan (2009).

E. Mistakes in Selecting Terms When trying t frm a regressin mdel, it is usually desirable t include as few terms as pssible while still giving a gd mdel. Varius prcedures have been develped t help try t decide which explanatry variables can be drpped withut imprtant lss f infrmatin. Hwever, these prcedures are ften used inapprpriately. Here are sme cmmn mistakes that may ccur in variable selectin. 1. Assuming linearity is preserved when variables are drpped One cmmn mistake in using "variable selectin" methds is t assume that if ne r mre variables are drpped, then the apprpriate mdel using the remaining variables can be btained simply by deleting the drpped variables frm the "full mdel" (i.e., the mdel with all the explanatry variables). This assumptin is in general false. Example: If the true mdel is E(Y X 1, X 2 ) = 1 + 2X 1 +3X 2, then a linear mdel fits when Y is regressed n bth X 1 and X 2. But if in additin, E(X 2 X 1 ) = lg(x 1 ), then it can be calculated that E(Y X 1 ) = 1 +2X 1 + 3lg(X 1 ), which shws that a linear mdel des nt fit when Y is regressed n X 1 alne (and, in particular, that the mdel E(Y X 1 ) = 1 +2X 1 is incrrect.) 23 One methd that smetimes wrks t get arund this prblem is t transfrm the variables t have a multivariate nrmal distributin, and then wrk with the transfrmed variables. This will ensure that the cnditinal means are a linear functin f the transfrmed explanatry variables, n matter which subset f explanatry variables is chsen. Such a transfrmatin is smetimes pssible with sme variant f a Bx-Cx transfrmatin prcedure. See, e.g., Ck and Weisberg (1999, pp. 236 and 324 329) fr mre details. 2. Prblems with Stepwise Mdel Selectin Prcedures "... perhaps the mst serius surce f errr lies in letting statistical prcedures make decisins fr yu." "Dn't be t quick t turn n the cmputer. Bypassing the brain t cmpute by reflex is a sure recipe fr disaster." Gd and Hardin (2006, p. 3, p. 152) Varius cmputer algrithms have been develped fr aiding in mdel selectin. Many f them are "autmatic", in the sense that they have a "stpping rule" (which it might be pssible fr the researcher t set r change frm a default value) based n criteria such as the value f a t-statistic r an F-statistic. Others might be better termed "semi-autmatic," in the sense that they autmatically list varius ptins and values f measures that might be used t help evaluate them. 24

Cautin: Different regressin sftware may use the same name t designate different algrithms. Frward Selectin" and "Backward Eliminatin" are tw examples Be sure t read the dcumentatin t knw find ut just what the algrithm des in the sftware yu are using -- in particular, whether it has a stpping rule r is f the "semiautmatic" variety. Ck and Weisberg (1999, p. 280) cmment, "We d nt recmmend such stpping rules fr rutine use since they can reject perfectly reasnable submdels frm further cnsideratin. Stepwise prcedures are easy t explain, inexpensive t cmpute, and widely used. The cmparative simplicity f the results frm stepwise regressin with mdel selectin rules appeals t many analysts. But, such algrithmic mdel selectin methds must be used with cautin." They give an example (2006, pp. 280-281) illustrating hw stepwise regressin algrithms will generally result in mdels suggesting that the remaining terms are mre imprtant than they really are, and that the R 2 values f the submdels btained may be misleadingly large. Ryan (2009, pp.269-273 and 284-286) elabrates n these pints. One underlying prblem with methds based n t r F statistics is that they ignre prblems f multiple inference. 25 Alternatives t Stepwise Selectin Methds There are varius criteria that may be cnsidered in evaluating mdels. One that has intuitive appeal is Mallw's C-statistic. It is an estimate f Mean Square Errr, and can als be regarded as a measure that accunts fr bth bias and variance. 3 See, e.g., Ck and Weisberg (1999, pp. 272-280), Ryan (2009, pp. 273-277 and 279-283), R. Berk (2004, pp. 130-135), Smith (2008) Other aids include Akaike's Infrmatin Criterin (and variatins) and Added Variable Plts. And, f curse, cntext can be imprtant t cnsider in deciding n a mdel. Fr example, the questins f interest can dictate that certain variables need t remain in the mdel; r quality f data can help decide which f tw variables t retain. Several cnsideratins may cme int play in deciding n a mdel. Als, ther regressin methds (e.g., Ridge Regressin) may be useful instead f Least Squares Regressin. Fr mre discussin f mdel selectin methds, see Ck and Weisberg (1999, Chapters 10, 11 and 17-20); Ryan (2009, Chapters 7, 11, 12 and references therein); Berk (2004, pp. 126-135); Gd and Hardin (2006, Chapters 10, 11, Appendix A); and Harrell (2001) 26

DIVIDING A CONTINUOUS VARIABLE INTO CATEGORIES This is nt really a matter f ignring mdel assumptins, but f using a methd that at best can be expected t give less infrmatin than a mre apprpriate methd, and at wrst can give misleading results. It is als knwn by ther names such as "discretizing," "chpping data," r "binning". Specific methds smetimes used include "median split" r "extreme third tails". Whatever yu call it, it s usually a bad idea. Exceptins: Cmparing new data with existing data where nly the categries are knwn, nt the values f the cntinuus variable. Here, it s a necessarily evil; see item 3 belw. Explaining an idea t an audience that lacks the sphisticatin fr the full analysis. Hwever, this shuld nly be dne when the full analysis has been dne and justifies the result that is illustrated by the simpler technique using categrizing. See Gelman and Park (2008) fr an example. See als item 5 belw. 27 1. When ding hypthesis tests, the lss f infrmatin when dividing cntinuus variables int categries typically translates int lsing pwer. See Van Belle (2008, pp. 139-140 fr mre discussin and references. See McLelland (2002) fr an nline cmputer demnstratin. 2. The lss f infrmatin invlved in chsing bins t make a histgram can result in a misleading histgram. Example: The fllwing three graphs are all histgrams f the same data (the times between successive eruptins f the Old Faithful geyser in Yellwstne Natinal Park). The first has five bins, the secnd seven bins, and the third 14 bins. 28 Instead f chpping the data, use a technique (such as regressin) that can wrk with the cntinuus variable. The basic reasn is intuitive: If yu chp the data, yu are tssing away infrmatin. The lss f infrmatin can ccur in varius ways with varius cnsequences. Here are sme:

Nte that that histgram with nly five bins des nt pick up the bimdality f the data; the histgram with seven bins hints at it; and the histgram with 14 bins shws it mre clearly. The prblem with bins in a histgram is the reasn why histgrams are nt gd fr checking mdels assumptins such as nrmality. Sme sftware has a "kernel density" feature that can give a smthed visual estimate f the distributin f data. This is usually better than a histgram fr getting a sense f the distributin. Here s an example fr the data abve: the histgram n the left has five bins, the histgram n the right 14; nte that the smthed estimate f the distributin is the same in bth cases. 29 3. Cllecting cntinuus data by categries can als cause headaches later n. Gd and Hardin (2006, pp. 28 29) give an example f a lng-term study invlving incmes. The data were cllected in categries f ten thusand dllars. Because f inflatin, purchasing pwer decreased nticeably frm the beginning t the end f the study. The categrizatin f incme made it virtually impssible t crrect fr inflatin. 4. Wainer, Gessarli, and Verdi (2006 pp. 49-52; r Wainer, 2009 Chapter 14) argue that if a large enugh sample is drawn frm tw uncrrelated variables, it is pssible t grup the variables ne way s that the binned means shw an increasing trend, and anther way s that they shw a decreasing trend. They cnclude that if the riginal data are available, ne shuld lk at the scatterplt rather than at binned data. Mral: If there is a gd justificatin fr binning data in an analysis, it shuld be "befre the fact" -- yu culd therwise be accused f manipulating the data t get the results yu want! 30 Althugh using such a smther is better than a histgram fr giving the shape f the data, the result is nly as gd at the data; a small data set may give a misleading idea f the underlying distributin.

5. There are times when cntinuus data must be dichtmized, fr example in deciding a cut-ff fr diagnstic criteria. When this is the case, it s imprtant t chse the cut-ff carefully, and t cnsider the sensitivity, specificity, and psitive predictive value. Fr definitins and further references, see http://www.ma.utexas.edu/users/mks/statmistakes/misundcn d.html and references therein. Fr an example f hw iffy making such cut-ffs can be, see Ott, (2008). 31 Suggestins fr Researchers SUGGESTIONS The mst cmmn errr in statistics is t assume that statistical prcedures can take the pace f sustained effrt. Gd and Hardin (2006, p. 186) Thrughut: Lk fr, take int accunt, and reprt surces f uncertainty. Specific suggestins fr planning research: Decide what questins yu will be studying. Trying t study t many things at nce is likely t create prblems with multiple testing, s it may be wise t limit yur study. If yu will be gathering data, think abut hw yu will gather and analyze it befre yu start t gather the data. Read reprts n related research, fcusing n prblems that were encuntered and hw yu might get arund them and/r hw yu might plan yur research t fill in gaps in current knwledge in the area. If yu are planning an experiment, lk fr all pssible surces f variability and design yur experiment t take these int accunt as much as pssible.! The design will depend n the particular situatin.! The literature n design f experiments is extensive; cnsult it.! Remember that the design affects what methd f analysis is apprpriate. If yu are gathering bservatinal data, think abut pssible cnfunding factrs and plan yur data gathering t reduce cnfunding. 32

33 34 Be sure t recrd any time and spatial variables, r any ther variables that might influence utcme, whether r nt yu initially plan t use them in yur analysis. Als think abut any factrs that might make the sample biased.! Yu may need t limit yur study t a smaller ppulatin than riginally intended. Think carefully abut what measures yu will use.! If yur data gathering invlves asking questins, put careful thught int chsing and phrasing them. Then check them ut with a test-run and revise as needed. Think carefully abut hw yu will randmize (fr an experiment) r sample (fr an bservatinal study). Think carefully abut whether r nt the mdel assumptins f yur intended methd f analysis are likely t be reasnable.! If nt, revise either yur plan fr data gathering r yur plan fr analysis, r bth. Cnduct a pilt study t truble sht and btain variance estimates fr a pwer analysis.! Revise plans as needed. D a pwer analysis t estimate what sample size yu need t detect meaningful differences.! Revise plans as needed. Plan hw t deal with multiple inferences, including data snping questins that might arise later. If yu plan t use existing data, mdify the suggestins abve, as in the suggestins under Data Snping (Part III) Fr additinal suggestins, see van Belle (2008, Chapter 8). Specific suggestins fr analyzing data: Befre ding any frmal analysis, ask whether r nt the mdel assumptins f the prcedure are plausible in the cntext f the data. Plt the data (r residuals, as apprpriate) as pssible t get additinal checks n whether r nt mdel assumptins hld. If mdel assumptins appear t be vilated, cnsider transfrmatins f the data r using alternate methds f analysis as apprpriate. If mre than ne statistical inference is used, be sure t take that int accunt by using apprpriate methdlgy fr multiple inference. If yu use hypthesis tests, be sure t calculate cnfidence intervals as well. But be aware that there might als be ther surces f uncertainty nt captured by cnfidence intervals.

Specific suggestins fr writing up research: Critics may cmplain that we advcate interpreting reprts nt merely with a grain f salt but with an entire shaker; s be it.... Neither sciety nr we can affrd t be led dwn false pathways. Gd and Hardin (2006, p. 119) Until a happier future arrives, imperfectins in mdels require further thught, and rutine disclsure f imperfectins wuld be helpful. David Freedman (2008, p. 61) Aim fr transparency. Include enugh detail s the reader can critique bth the data gathering and the analysis.! Lk fr and reprt pssible surces f bias r ther surces f additinal uncertainty in results.! Fr mre detailed suggestins n recgnizing and reprting bias, see Chapter 1 and pp. 113-115 f Gd and Hardin (2006). All f Chapter 7 f that bk is a gd supplement t the suggestins here.! Cnsider including a "limitatins" sectin, but be sure t reiterate r summarize the limitatins in stating cnclusins -- including in the abstract. Include enugh detail s that anther researcher culd replicate bth the data gathering and the analysis.! Fr example, "SAS Prc Mixed was used" is nt adequate detail. Yu als need t explain which factrs were fixed, which randm, which nested, etc. 35 If space limitatins d nt permit all the detail needed t be included in the actual paper, prvide them in a website t accmpany the article.! Sme jurnals nw include websites fr supplementary infrmatin; publish in these them when pssible. When citing surces, give explicit page numbers, especially fr bks. Include discussin f why the analyses used are apprpriate i.e., why mdel assumptins are well enugh satisfied fr the rbustness criteria fr the specific technique, r whether they are iffy. This might g in a supplementary infrmatin website. If yu d hypthesis testing, be sure t reprt p-values (rather than just phrases such as "significant at the.05 level") and als give cnfidence intervals. In sme situatins, ther measures such as "number t treat" wuld be apprpriate. See pp. 151-153 f van Belle (2008) fr mre discussin. Be careful t use language (bth in the abstract and in the bdy f the article) that expresses any uncertainty and limitatins. If yu have built a mdel, be sure t explain the decisins that went int the selectin f that mdel See Gd and Hardin (2006, pp. 181 182) fr mre suggestins Fr mre suggestins and details, see Chapters 8 and 9 f van Belle (2008) Chapters 7 and 9 f Gd and Hardin (2006) Harris et al (2009) Miller (2004) Rbbins (2004) Strasak et al (2007) 36

Suggestins fr Reading Research Invlving Statistics "Sme experts think peer review validates published research. Fr thse f us wh have been editrs, assciate editrs, reviewers, r the targets f peer review, this argument may ring hllw. Even fr careful readers f jurnal articles, the argument may seem a little farfetched. David Freedman, Chance 2008, v. 21 N 1, p. 61 Overarching Guideline: Lk fr surces f uncertainty. Specific suggestins: D nt just read the abstract. Abstracts smetimes fcus n cnclusins that are mre speculative than the data warrant. Identify the exact research questin(s) the researchers are asking. Decide if this is a questin that yu are interested in. Fr example, if yu are interested in the effect f a medicatin n hip fractures, is this the endpint that the researchers have studied, r have they just studied a prxy such as bne density? Determine the type f study: bservatinal r experimental; explratry r cnfirmatry. This will influence the strength f the cnclusins that can be drawn. 37 Identify the measures the researchers are using. Decide hw well they fit what yu are lking fr frm the study. Fr example, if yur interest is in hw well a medicatin r lifestyle change will reduce yur chances f having a hip fracture, a study with utcme based n hip fractures will be mre infrmative that ne with utcme bne density. Pay attentin t hw the sample(s) were chsen. Think abut any factrs that might make the sample biased. Results frm a biased sample are unreliable, althugh smetimes they might give sme infrmatin abut a smaller ppulatin than intended. Remember that vluntary respnse samples are usually biased. Have the researchers explained why the statistical prcedures they have used are apprpriate fr the data they are analyzing? In particular, have they given gd reasns why the mdel assumptins fit the cntext? If nt, their results shuld have less credibility than if the mdel has been shwn t fit the cntext well. If there is multiple inference n the same data, have the authrs taken that int accunt in deciding significance r cnfidence levels? If hypthesis tests are used, are cnfidence intervals als given? The cnfidence intervals can give a better idea f the range f uncertainty due t sampling variability. But be aware that there might als be ther surces f uncertainty nt captured by cnfidence intervals (e.g., bias r lack f fit f mdel assumptins). Have claims been limited t the ppulatin frm which the data are actually gathered? 38

Have the authrs taken practical significance as well as statistical significance? Is the pwer f statistical tests large enugh t warrant claims f n difference? See Gd (2006, Chapter 8) fr mre suggestins and detail. See van Belle (2008, Chapter 7) fr items specific t Evidence Based Medicine. Suggestins fr Reviewers, Referees, Editrs, and Members f Institutinal Review Bards Base acceptance n the quality f the design, implementatin, analysis, and writing f the research, nt n the results f analysis See Suggestins fr Researchers and Suggestins fr Reading Research abve. Check t be sure pwer calculatins are prspective, nt retrspective. Advcate fr changing plicies if necessary t prmte best practices. Fr mre suggestins, see Cyne (2009, p. 51) 39 Suggestins fr Teachers f Statistics Emphasize that uncertainty is ften unavidable; we can best deal with it by seeking t knw where it may ccur and trying t estimate hw large it is. Be willing t say, "I dn't knw" when apprpriate. Pint ut the differences between rdinary and technical uses f wrds. Be sure t include sme discussin f skewed distributins. Emphasize that every statistical technique depends n mdel assumptins. Frm the habit f checking if the mdel assumptins are reasnable befre applying a prcedure. Expect yur students t d the same. Give assessment questins that ask the student t decide which techniques are apprpriate. Discuss rbustness. When a test fails t reject the null hypthesis, d nt accept the null hypthesis unless a pwer calculatin has shwn that the test will detect a practically significant difference, r unless there is sme ther carefully thught ut decisin criterin that has been met. Expect yur students t d the same. Remember, and emphasize, that ne study des nt prve anything. In particular, d nt use strng language such as "We cnclude that...", "This prves that...", 'This shws that... is..." Instead, use mre hnest language such as, "These data supprt the claim that..." r "This experiment suggests that..." Expect yur students t d the same. Fr mre suggestins fr intrductry curses, see American Statistical Assciatin (2005). 40

41 42 In intrductry curses, try t cautin yur students abut the prblems with multiple inference, even if yu can't g int detail. In advanced curses, be sure t discuss the prblems f multiple inference. REFERENCES American Statistical Assciatin (2005), Guidelines fr Assessment and Instructin in Statistics Educatin (GAISE) Cllege Reprt, dwnlad frm http://www.amstat.rg/educatin/gaise/index.cfm R. A. Berk (2004), Regressin Analysis: A Cnstructive Critique, Sage R. D. Ck and S. Weisberg (1999), Applied Regressin Including Cmputing and Graphics, Wiley J. Cyne (2009), Are mst psitive findings in health psychlgy false r at least smewhat exaggerated?, Eurpean Health Psychlgist, Vl. 11, N. 3, pp. 49-51 N. Draper and H. Smith (1998), Applied Regressin Analysis, Wiley D. Freedman (2008), Editrial: Oasis r Mirage?, Chance v. 21 N 1, pp. 59-61 Gelman and Park (2008), Splitting a predictr at the upper quarter r third, American Statistician 62, N. 4, pp. 1-8, http://www.stat.clumbia.edu/%7egelman/research/published/third s5.pdf P. Gd and J. Hardin (2006), Cmmn Errrs in Statistics (and Hw t Avid Them), Wiley F. E. Harrell (2001), Regressin Mdeling Strategies, Springer. Harris, A. H. S., R. Reeder and J. K. Hyun (2009), Cmmn statistical and research design prblems in manuscripts submitted t high-impact psychiatry jurnals: What editrs and reviewers

43 want authrs t knw, Jurnal f Psychiatric Research, vl 43 n15, 1231-1234 G. McLelland (2002), Negative Cnsequences f Dichtmizing Cntinuus Predictr Variables, nline dem at http://psych.clrad.edu/~mcclella/mediansplit/ Miller, Jane (2004), The Chicag Guide t Writing abut Numbers: The Effective Presentatin f Quantitative Infrmatin, University f Chicag Press Ott, S. (2008), T and Z scres and WHO definitins, Osteprsis and Bne Physilgy Page, http://curses.washingtn.edu/bnephys/pbmd.html#tz Rbbins, N. (2004), Creating Mre Effective Graphs, Wiley. T. Ryan (2009), Mdern Regressin Methds, Wiley M. Smith (2008), Lecture Ntes n Selecting Terms, http://www.ma.utexas.edu/users/mks/statmistakes/selterms.pdf Strasak, A. M. et al (2007), The Use f Statistics in Medical Research, The American Statistician. February 1, 2007, 61(1): 47-55 van Belle, G. (2008) Statistical Rules f Thumb, Wiley Wainer, Gessarli, and Verdi (2006), Finding What Is Nt There thrugh the Unfrtunate Binning f Results: The Mendel Effect, Chance Magazine, vl. 19, N. 1, pp. 49-52. Essentially the same article appears as Chapter14 in Wainer (2009), Picturing the Uncertain Wrld, Princetn University Press.