Regression M&M 2.3 and 10. Uses Curve fitting Summarization ('model') Description Prediction Explanation Adjustment for 'confounding' variables

Size: px

Start display at page:

Download "Regression M&M 2.3 and 10. Uses Curve fitting Summarization ('model') Description Prediction Explanation Adjustment for 'confounding' variables"

Ellen Harrison
5 years ago
Views:

1 Uses Curve fitting Summarization ('model') Description Prediction Explanation Adjustment for 'confounding' variables MALES FEMALES Age. Tot. %-ile; weight,g Tot. %-ile; weight,g wk N. 0th 50th 90th No. 0th 50th 90th Technical Meaning [originally] simply a line of 'best fit' to data points [nowadays] Regression line is the LINE that connects the CENTRES of the distributions of Y's at each X value not necessarily a straight line; could be curved, as with growth charts not necessarily µ Y X 's used as CENTRES ; could use medians etc strictly speaking, haven't completed description unless we characterize the variation around the centres of the Y distributions at each X inference not restricted to the distributions of Y's for which we make some observations; it applies to distributions of Y's at all unobserved X values in between BIRTH WEIGHT (DISTRIBUTION) MALES BIRTH WEIGHT (MEDIAN) FEMALES Median (50th %ile) for MALES Median (50th %ile) for FEMALES Examples (with appropriate caveats) Birth weight (Y) in relation to gestational age (X) Blood pressure (Y) in relation to age (X) Cardiovascular mortality (Y) in relation to water hardness (X)? Cancer incidence (Y) in relation to some exposure (X)? Scholastic performance (Y) vis a vis amount of TV watched (X) Caveat: No guarantee that simple straight line relationship will be adequate. Also, in some instances the relationship might change with the type of X and Y variables used to measure the two phenomena being studied; also the relationship may be more artifact than real - see later for inference.) 4000g 3000g 2000g 000g Live singleton births, Canada 986 Source: Arbuckle & Sherman CMAJ , 989 GESTATIONAL AGE (week) page

2 S i m p l e L i n e a r R e g r e s s i o n (one X) (straight line) Equation µ Y X = α + β X or µ Y X X = β = "rise" "run" In Practice: one rarely sees an exact straight line relationship in health science applications; - While physicists are often able to examine the relationship between Y and X in a laboratory with all other things being equal (ie controlled or held constant) medical investigators largely are not. The universe of (X,Y) pairs is very large and any 'true' relationship is disturbed by countless uncontrollable (and sometimes un-measurable factors. In any particular sample of (X,Y) pairs these distortions will surely be operating. 2 - The true relationship (even if we could measure it exactly) may not be a simple straight line. 3 - The measuring instruments may be faulty or inexact (using 'instruments' in the broadest sense). One always tries to have the investigation sufficiently controlled that the 'real' relationship won't be 'swamped' by factors and 3 and that the background "noise" will be small enough so that alternative models (eg curvilinear relationships) can be distinguished from one another Linear here means linear in the parameters. The equation y = Bx C can be made linear in the parameters by taking logs i.e. log[y] = log[b] + x log[c]; y = a+b x+c x 2 is already linear in the parameters a b and c. The following model cannot be made linear in the parameters α β γ: Fitting a straight line to data - Least Squares Method The most common method is that of Least Squares. Note that least squares can be thought of as just a curve fitting method and doesn't have to be thought of in a statistical (or random variation or sampling variation) context. Other more statistically-oriented methods include the method of minimum Chi-Square (matching observed and expected counts according to measure of discrepancy) and the Method of Maximum likelihood (finding the parameters that made the data most likely). Each has a different criterion of "best-fit". Least Squares Approach: Consider a candidate slope (b) and intercept (a) and predict that the Y value accompanying any X=x is y^ = a + b x. The observed y value will deviate from this "predicted" or "fitted" value by an amount d = y - y^ We wish to keep this deviation as small as possible, but we must try to strike a balance over all the data points. Again just like when calculating variances, it is easier to work with squared deviations : d 2 = (y - y^ ) 2 We weight all deviations equally (whether they be the ones in the middle or the extremes of the x range) using d 2 = (y - y^ ) 2 to measure the overall (or average) discrepancy of the points from the line. proportion dying = α + -α +exp{β γ log[dose]} there are also several theoretical advantages to least squares estimates over others based for example on least absolute deviations: - they are the most precise of all the possible estimates one could get by taking linear combinations of y's. page 2

3 From all the possible candidates for slope (b) and intercept (a), we choose the particular values a and b which make this sum of squares (sum of squared deviations of 'fitted' from 'observed' Y's) a minimum. ie we search for the a and b that give us the least squares fit. Fortunately, we don't have to use trial and error to arrive at the 'best' a and b. Instead, it can be shown by calculus or algebraically that the a and b which minimize d 2 are: b = β^ = a = α^ = y b x {x i x }{y i y } {x i x } 2 = r xy s y s x [Note that a least-squares fit of the regression line of X on Y would give a different set of values for the slope and intercept: the slope r of the line of x on y is xy s x ]. one needs to be careful when s y using a calculator or computer program to specify which is the explanatory variable (X) and which is the predicted variable (Y)]. Meaning of intercept parameter (a): Unlike the slope parameter (which represents the increase/decrease in µ Y X for every unit increase in x), the intercept does not always have a 'natural' interpretation. It depends on where the x-values lie in relation to x=0, and may represent part of what is really the mean Y. For example, the regression line for fuel economy of cars (Y) in relation to their weight (x) might be µ Y weight = 60 mpg 0.0 weight in lbs [0.0 mpg/lb] but there are no cars weighing 0 lbs. It would be better to write the equation in relation to some 'central' value for weight e.g lbs; then the same equation can be cast as µ Y weight 25 = 0.0 (weight 3500) It is helpful for testing whether there is evidence of a non-zero slope to think of the simplest of all regression models, namely that which is a horizontal straight line µ Y X = α + 0 X = the constant α. This is a re-statement of the fact that the sum of squared deviances around a constant horizontal line at height 'a' is smallest when 'α' = the mean. [We don't always use the mean as the best 'centre' of a set of numbers. Imagine waiting for one of several elevators with doors in a row along one wall; you do not know which one will arrive next, and so want to stand in the 'best' place no matter which one comes next. Where to stand depends on the criterion being optimized: if you want to minimize the maximum distance, stand in the middle between the one on the extreme left and the extreme right; if you wish to minimize the average distance, where do you stand?, If, for some reason, you want to minimize the average squared distance, where to stand? If the elevator doors are not equally spaced from each other, what then?] page 3

4 The anatomy of a slope: some re-expressions i.e. a weighted average of the slope from datapoints and 4 and that from datapoints 2 and 3, with weights proportional to the {x xbar}{y ybar} Consider the formula: slope = b = squares of their distances on x axis {x 4 x } 2 and {x 3 x 2 } 2 {x xbar} 2 Without loss of generality & for simplicity, assume ybar=0. If we have 3 x's, unit apart (e.g. x =; x 2 =2; x 3 =3), then... x xbar = ; x 2 xbar = 0; x 3 xbar = + so slope = b = { }y + {0}y 2 + { + }y 3 { } 2 + { 0 } 2 + { + } 2 b = i.e. slope = y 3 y x 3 x Note that y 2 contributes to ybar and thus to an estimate of the average y (i.e. level) but not to the slope. If 4 x's unit apart (e.g. x =; x 2 =2; x 3 =3; x 4 =4), then,... Another way to think of the slope: Rewrite b = {x xbar}{y ybar} {x xbar} 2 {x xbar} 2 {y ybar} {x xbar} {x xbar} 2 = weight {x xbar} 2 for estimate as {y ybar} {x xbar} Yet another way to think of the slope: weight {y ybar} {x xbar} weight of slope x xbar =.5 x 2 xbar = 0.5 x 3 xbar = +0.5 x 4 xbar = +.5 and so slope = b = {.5}y + { 0.5}y 2 + { + 0.5}y 3 + { +.5}y 4 {.5} 2 + { 0.5} 2 + { 0.5 } 2 + { +.5} 2 i.e. slope =.5{ y 4 y } + 0.5{ y 3 y 2 } {y 4 y } 2 { y 3 y 2 } i.e. slope = {x 5 4 x } {x 3 x 2 } i.e. slope = 9 0 {y 4 y } {x 4 x } + 0 { y 3 y 2 } {x 3 x 2 } b is a weighted average of all the pairwise slopes with weights proportional to {xi xj } 2. e.g. If 4 x's unit apart yi yj xi xj denote by b &2 the slope obtained from {x 2,y 2 } & {x,y }, etc... b =.b &2 + 4.b &3 + 9.b &4 +.b &3 + 4.b 2&4 +.b 3& = 20 jh 6/94 page 4

5 Inferences regarding Simple Linear Regression How reliable are 50 than to take X =38, 39, 49, 4, 42. Any individual fluctuations will 'throw off' the slope much less if the X's are far apart. (i) the (estimated) slope (ii) the (estimated) intercept (ii) the predicted mean Y at a given X (iv) the predicted y for a (future) individual with a given X BP (a) BP (b) when they are based on data from a sample? i.e. how much would these estimated quantities change if they were based on a different random sample [with the same x values]? We can use the concept of sampling variation to (i) describe the 'uncertainty' in our estimates via CONFIDENCE INTERVALS or (ii) carry out TESTS of significance on the parameters (slope, intercept, predicted mean) AGE AGE We can describe the degree of reliability of (or, conversely, the degree of uncertainty in) an estimated quantity by the standard deviation of the possible estimates produced by different random samples of the same size from the same x's. We call this (obviously conceptual) S.D. the standard error of the estimated quantity (just like the standard error of the mean when estimating µ). helpful to think of slope as an average difference in means for 2 groups that are x-unit apart. The size of the standard error will depend on. how 'spread apart' the x's are 2. How good a fit the regression line really is (i.e. how small is the unexplained variation about the line) 3. How large the sample size, n, is. Factors affecting reliability (in more detail). The spread of the X's: The best way to get a reliable estimate of the slope is to take Y readings at X's that are quite a distance from each other. E.g. in estimating the "per year increase in BP over the yr. age range", it would be better to take X=30,35, 40, 45, thick line: real (true) relation between average BP at age X and X : thin lines: possible apparent relationships because of individual variation when we study individual at each of two ages (a) spaced closer together (b) spaced further apart. Notes Regression line refers to the relationship between the average Y at a given X to the X, and not to individual Y's vs X. Obviously of course if the individual Y's are close to the average Y, so much the better! The above argument would suggest studying individuals at the extremes of the X values of interest. We do this if we are sure that the relationship is a linear one. If we are not sure, it is wiser -- if we have a choice in the matter -- to take a 3-point distribution. There is a common misapprehension that a Gaussian distribution of X values is desirable for estimating a regression slope of Y on X. In fact, the 'inverted U' shape of the Gaussian is the least desirable! page 5

6 Factors affecting reliability (continued) 2. The (vertical) variation about the regression line: Again, consider BP and age, and suppose that indeed the average BP of all persons aged X + is β units higher than the average BP of all persons aged X, and that this linear relationship average BP of persons aged x = α + β X (average of Y's at a given x = intercept + slope X) holds over the age span Obviously, everybody aged x=32 won't have the exact same BP, some will be above the average of 32 yr olds, some below. Likewise for the different ages x=30, In other words, at any x there will be a distribution of y's about the average for age X. Obviously, how wide this distribution is about α + β X will have an effect on what slopes one could find in different samples (measure vertical spread around the line by σ) variation when we study individual at each of two ages when the within-age distributions have (a) a narrow spread (b) a wider spread NOTE: For unweighted regression, should have roughly same spread of Y's at each X. Factors affecting reliability (continued) 3. Sample Size (n) Larger n will make it more difficult for the types of extremes and misleading estimates caused by ) poor X spread and 2) large variation in Y about µ Y X, to occur. Clearly, it may be possible to spread the x's out so as to maximize their variance (and thus reduce the n required) but it may not be possible to change the magnitude of the variation about µ Y X (unless there are other known factors influencing BP). Thus the need for reasonably stable estimated y^ [i.e.estimate of µ Y X ] BP (a) BP (b) AGE AGE thick line: real (true) relation between average BP at age X and X : thin lines: possible apparent relationships because of individual page 6

7 Standard Errors SE(b) = SE( β^) = SE(a) = SE( α^) = σ σ {x i x } 2 ; n + x 2 {x i x } 2 (Note: there is a negative correlation between a and b). We don't usually know σ so we estimate it from the data, using scatter of the y's from the fitted line i.e. SD of the residuals) If examine the structure of SE(b), see that it reflects the 3 factors discussed above: (i) a large spread of the x's makes contribution of each observation to {x i x } 2 large, and since this is in the denominator, it reduces the SE (ii) a small vertical scatter is reflected in a small σ and since this is in the numerator, it also reduces the SE of the estimated slope (iii) a large sample size means that {x i x } 2 is larger, and like (i) this reduces the SE. The formula, as written, tends to hide this last factor; note that {x i x } 2 is what we use to compute the spread of a set of x's -- we simply divide it by n to get a variance and then take the square root to get the sd. To make the point here, simplify n- to n and write {x i x } 2 n var(x), so that {x i x } 2 n sd(x) The structure of SE(a) : In addition to the factors mentioned above, all of which come in again in the expected way, there is the additional factor of x 2 ; since this is in the denominator, it increases the SE. This is natural in that if the data, and thus x, are far from x=0, then any imprecision in the estimate of the slope will project backwards to a large imprecision in the estimated intercept. Also, if one uses 'centered' x's, so that x = 0, the formula for the SE reduces to SE(a) = σ n = σn and we recognize this as SE(y ) -- not surprisingly, since y is the 'intercept' for centered data. CI's & Tests of Significance for ^ and ^ are based on t distribution (or Gaussian Z's if n large) ^ ± tn 2 SE( ^ ) ^ H 0 : t n 2 = SE(^) ^ ± tn 2 SE( ^ ) and the equation for the SE simplifies to SE(b) = σ n sd(x) = SD y x / SDx n H 0 : t n 2 = ^ SE(^) with n in its familiar place in the denominator of the SE (even in more complex SE's, this is where n is usually found!) page 7

8 Standard Error for Estimated µ Y X or 'average Y at X' variation, which is as follows: We estimate 'average Y at X' or µ Y X by α^ + β^ X. Since the estimate is based on two estimated quantities, each of which is subject to sampling variation, it contains the uncertainty of both: SE(estimated average Y at X) = σ Again, we must use an estimate σ^ of σ. n + {X x } 2 {x i x } 2 First-time users of this formula suspect that it has a missing or an x instead of an xbar or something. There is no typographical error, and indeed if one examines it closely, it makes sense. X refers to the x-value at which one is estimating the mean -- it has nothing to do with the actual x's in the study which generated the estimated coefficients, except that the closer X is to the center of the data, the smaller the quantity {X x } and thus the quantity {X x } 2, and thus the SE, will be. Indeed, if we estimate the average Y right at X = x, the estimate is simply y (since the fitted line goes through [ x, y ] ) and its SE will be σ n + { x x } 2 {x i x } 2 or σ n = σ n = SE( y ). α^+ β^ X ± t σ + n + {X x } 2 {x i x } 2. Both the CI for the estimated mean and the CI for individuals (ie the estimated percentiles of the distribution) are bow-shaped when drawn as a function of X. They are narrowest at X = x, and fan out from there. One needs to be careful not to confuse the much narrower CI for the mean with the much wider CI for individuals. If one can see the raw data, it is usually obvious which is which -- the CI for individuals is almost as wide as the raw data themselves. cf. data on sleeping through the night; alcohol levels and eye speed. Confidence Interval for individual Y at X A certain percentage P% of individuals are within t P σ of the mean µ Y X = α + β X, where t P is a multiple, depending on P, from the t or, if n is large, the Z table. However, we are not quite certain where exactly the mean α + β X is -- the best we can do is estimate, with a certain P% confidence, that it is within t P SE( α^ + β^ X ) of the point estimate α^+ β^ X. The uncertainty concerning the mean and the natural variation of individuals around the mean -- wherever it is -- combine in the expression for the estimated P% range of individual page 8

Correlation and Regression

Correlation and Regression references: A&B Ch 5,8,9,0; Colton Chapter 6, M&M Chapters 2 (descriptive) and 9 (inference) Similarities Both involve relationships between pair of numerical variables X & Y.