Stat 328 Regression Summary

Size: px

Start display at page:

Download "Stat 328 Regression Summary"

Suzanna Conley
5 years ago
Views:

1 Simple Linear Regression Model Stat 328 Regression Summary The basic (normal) "simple linear regression" model says that a response/output variable ` depends on an explanatory/input/system variable _ in a "noisy but linear" way. That is, one supposes that there is a linear relationship between _ and mean `, K `Š_ º? >??_ and that (for fixed _) there is around that mean a distribution of ` that is normal. Further, the model assumption is that the standard deviation of the response distribution is constant in _. In symbols it is standard to write `º? >??_B where B is normal with mean > and standard deviation R. This describes one `. Where several observations ` with corresponding values _ are under consideration, the assumption is that the ` (the B ) are independent. (The B are conceptually equivalent to unrelated random draws from the same fixed normal continuous distribution.) The model statement in its full glory is then ` º? >??_ B for º? B for º? are independent normal î> R ï random variables The model statement above is a perfectly theoretical matter. One can begin with it, and for specific choices of? >?? and Rfind probabilities for ` at given values of _. In applications, the real mode of operation is instead to take data pairs î_? î`? ïîî_ î` ïîîî_ î` ï and use them to make inferences about the parameters? > î?? and R and to make predictions based on the estimates (based on the empirically fitted model). Descriptive Analysis of Approximately Linear î_ `ï Data After plotting î_î`ï data to determine that the "linear in _ mean of `" model makes some sense, it is reasonable to try to quantify "how linear" the data look and to find a line of "best fit" to the scatterplot of the data. The sample correlation between ` and _ Yº bî_ Ÿ _ïî` Ÿ `ï º? b î_ Ÿ _ï b î` Ÿ `ï is a measure of strength of linear relationship between _ and `. º? º? 1

2 Calculus can be invoked to find a slope? and intercept? minimizing the sum of? > squared vertical distances from data points to a fitted line, bî` Ÿ î?? _ ïï. These " least squares" values are I º? bî_ Ÿ _ïî` Ÿ `ï º? bî_ Ÿ _ï º? º? >? and I > º `ŸI?_ It is further common to refer to the value of ` on the "least squares line" corresponding to _ as a fitted or predicted value ` º I I _ >? One might take the difference between what is observed ( ` ) and what is "predicted" or "explained" ( ` ) as a kind of leftover part or " residual" corresponding to a data value L º ` Ÿ` The sum b î` Ÿ `ï is most of the sample variance of the values `. It is a measure of º? raw variation in the response variable. eople often call it the " total sum of squares" and write L º? The sum of squared residuals b tt[v[ º cî` Ÿ `ï º? is a measure of variation in response remaining unaccounted for after fitting a line to the data. eople often call it the "error sum of squares" and write ttf º c L º cî` Ÿ` ï º? One is guaranteed that ttuv[ ¾ ttf. So the difference ttuv[ Ÿ ttf is a non- negative measure of variation accounted for in fitting a line to î_î`ï data. eople often call it the " regression sum of squares" and write º? tts º tt[v[ Ÿ ttf 2

3 The coefficient of determination expresses tts as a fraction of ttuv[ and is s º tts ttuv[ which is interpreted as "the fraction of raw variation in ` accounted for in the model fitting process." arameter Estimates for SLR The descriptive statistics for î_î`ï data can be used to provide "single number estimates" of the (typically unknown) parameters of the simple linear regression model. That is, the slope of the least squares line can serve as an estimate of??,? ºI?? and the intercept of the least squares line can serve as an estimate of? >,? ºI > > The variance of ` for a given _ can be estimated by a kind of average of squared residuals? ttf Z L º c L º Ÿ Ÿ Of course, the square root of this "regression sample variance" is Z L º Z a single number estimate of R. º? L and serves as Interval-Based Inference Methods for SLR The normal simple linear regression model provides inference formulas for model parameters. Confidence limits for?? (the slope of the line relating mean ` to _... the rate of change of average ` with respect to _) are ZL I? [ î_ Ÿ _ï b º? where [ is a quantile of the [ distribution with L º Ÿ degrees of freedom. Dielman calls the ratio Zì î_ Ÿ_ï L b the standard error of I? and uses the symbol ZI for it. In? º? these terms, the confidence limits for are?? I [Z? I? 3

4 Confidence limits for K `Š_ º? >??_ (the mean value of ` at a given value _) are? î_ÿ_ï îi I _ï [Z >? L bî_ Ÿ _ï One can abbreviate îi> I?_ï as `, and Dielman calls ZL the standard bî_ Ÿ_ï error of the mean and uses the symbol K`Š_ are Z T º?? º? î_ÿ_ï for it. In these terms, the confidence limits for ` [Z T (Note that by choosing _º>, this formula provides confidence limits for? >, though this parameter is rarely of independent practical interest.) rediction limits for an additional observation ` at a given value _ are? î_ÿ_ï îi I _ï [Z >? L? bî_ Ÿ _ï Dielman calls Z L?? î_ÿ_ï bî_ Ÿ_ï Z W for it. It is on occasion useful to notice that º? º? the standard error of prediction and uses the symbol W Z L Z T Z º sing the ZW notation, prediction limits for an additional ` at _ are ` [Z W Hypothesis Tests and SLR The normal simple linear regression model supports hypothesis testing. H: 0?? º # can be tested using the test statistic I? Ÿ# I? Ÿ# uº Z º L Z I î_ Ÿ_ï b? º? 4

5 and a [ Ÿ reference distribution. H: 0 K`Š_ º # can be tested using the test statistic and a [ Ÿ reference distribution. îi> I?_ï Ÿ# `Ÿ# uº º? î_ÿ_ï ZT ZL bî_ Ÿ_ï º? ANOVA and SLR The breaking down of ttuv[ into tts and ttf can be thought of as a kind of " analysis of variance" in `. That enterprise is often summarized in a special kind of table. The general form is as below. ANOVA Table (for SLR) Source SS df MS F Regression tts? nts º ttsì? g º ntsìntf Error ttf Ÿ ntf º ttfìî Ÿ ï Total tt[v[ Ÿ? In this table the ratios of sums of squares to degrees of freedom are called "mean squares." The mean square for error is, in fact, the estimate of R (i.e. ntf º Z ). As it turns out, the ratio in the "F" column can be used as a test statistic for the hypothesis H: 0?? º 0. The reference distribution appropriate is the g? Ÿ distribution. As it turns out, the value g º ntsìntf is the square of the [ statistic for testing this hypothesis, and the g test produces exactly the same W -values as a two-sided [ test. L Standardized Residuals and SLR The theoretical variances of the residuals turn out to depend upon their corresponding _ values. As a means of putting these residuals all on the same footing, it is common to "standardize" them by dividing by an estimated standard deviation for each. This produces standardized residuals L º L ZL?Ÿ? Ÿ î_ Ÿ_ï bî_ Ÿ_ï These (if the normal simple linear regression model is a good one) "ought" to look as if they are approximately normal with mean > and standard deviation?. Various kinds of plotting with these standardized residuals (or with the raw residuals) are used as means of "model checking" or "model diagnostics." º? 5

6 Multiple Linear Regression Model The basic (normal) "multiple linear regression" model says that a response/output variable ` depends on explanatory/input/system variables _? R in a "noisy but linear" way. That is, one supposes that there is a linear relationship between? _ R and mean `, K `Š,_ º? >??_??_? R_ R? R and that (for fixed? R) there is around that mean a distribution of `that is normal. Further, the standard assumption is that the standard deviation of the response distribution is constant in. In symbols it is standard to write? R ` º? >??_??_? R_ R B where B is normal with mean > and standard deviation R. This describes one `. Where several observations ` with corresponding values _? R are under consideration, the assumption is that the ` (the B ) are independent. (The B are conceptually equivalent to unrelated random draws from the same fixed normal continuous distribution.) The model statement in its full glory is then ` º? >??_??_? R_ R B for º? B for º? are independent normal î> R ï random variables The model statement above is a perfectly theoretical matter. One can begin with it, and for specific choices of? >???? R and Rfind probabilities for ` at given values of? R. In applications, the real mode of operation is instead to take data vectors î_?? _? _ R? `? ï î_? R ` ï î_? R `ï and use them to make inferences about the parameters? >??,?? R and R and to make predictions based on the estimates (based on the empirically fitted model). Descriptive Analysis of Approximately Linear î? R _ `ï Data Calculus can be invoked to find coefficients? >???? R minimizing the sum of squared vertical distances from data points in îr?ï dimensional space to a fitted surface, bî` Ÿî?? _? _? _ ïï. These " least squares" values º? >?? R R DO NOT have simple formulas (unless one is willing to use matrix notation). And in particular, one can NOT simply somehow use the formulas from simple linear regression in this more complicated context. We will call these minimizing coefficients IIII >? R and need to rely upon JM to produce them for us. 6

7 It is further common to refer to the value of ` on the "least squares surface" corresponding to _? R as a fitted or predicted value ` ºI> I?_? I _ I R_ R Exactly as in SLR, one takes the difference between what is observed ( ` ) and what is "predicted" or "explained" ( ` ) as a kind of leftover part or " residual" corresponding to a data value L º ` Ÿ` The total sum of squares, ttuv[ º b î` Ÿ `ï, is (still) most of the sample variance º? of the values and measures raw variation in the response variable. Just as in SLR, ` L º? the sum of squared residuals b is a measure of variation in response remaining unaccounted for after fitting the equation to the data. As in SLR, people call it the error sum of squares and write ttf º c L º cî` Ÿ` ï º? (The formula looks exactly like the one for SLR. It is simply the case that now ` is computed using all R inputs, not just a single _.) One is still guaranteed that ttuv[ ¾ ttf. So the difference ttuv[ Ÿ ttf is a non-negative measure of variation accounted for in fitting the linear equation to the data. As in SLR, people call it the regression sum of squares and write º? tts º tt[v[ Ÿ ttf The coefficient of (multiple) determination expresses tts as a fraction of ttuv[ and is s º tts ttuv[ which is interpreted as "the fraction of raw variation in ` accounted for in the model fitting process." This quantity can also be interpreted in terms of a correlation, as it turns out to be the square of the sample linear correlation between the observations ` and the fitted or predicted values `. arameter Estimates for MLR The descriptive statistics for î_? R`ïdata can be used to provide "single number estimates" of the (typically unknown) parameters of the multiple linear regression model. That is, the least squares coefficients IIII >? R serve as estimates of the parameters????. The first of these is a kind of high-dimensional "intercept" >? R 7

8 and in the case where the predictors are not functionally related, the others serve as rates of change of average ` with respect to a single _, provided the other _'s are held fixed. The variance of ` for a fixed values _? R can be estimated by a kind of average of squared residuals? ttf Z L º c L º ŸRŸ? ŸRŸ? The square root of this "regression sample variance" is Z L º Z number estimate of R. º? L and serves as a single Interval-Based Inference Methods for MLR The normal multiple linear regression model provides inference formulas for model parameters. Confidence limits for? Q (the rate of change of average ` with respect to _ Q ) are I Q [ZI Q where [ is a quantile of the [ distribution with L ºŸRŸ? degrees of freedom and ZI Q is a standard error of IQ. There is no simple formula for ZI Q, and in particular, one can NOT simply somehow use the formula from simple linear regression in this more complicated context. It IS the case that this standard error is a multiple of Z L, but we will have to rely upon JM to provide it for us. Confidence limits for K `Š? >??????, º R R_ R (the mean value of ` at a particular choice of the _ ) are? R ` [Z T where ZT is a standard error for ` and depends upon which values of? R are under consideration. There is no simple formula for Z T and in particular one can NOT simply somehow use the formula from simple linear regression in this more complicated context. It IS the case that this standard error is a multiple of Z L, but we will have to rely upon JM to provide it for us. rediction limits for an additional observation `at a given vector are ` [Z W î _ï? R where Z W is a standard error of prediction. It remains true (as in SLR) that Z W º Z L Z T but otherwise there are no simple formulas for it, and in particular one can NOT simply somehow use the formula from simple linear regression in this more complicated context. 8

9 Hypothesis Tests and MLR The normal multiple linear regression model supports hypothesis testing. H: 0? Q º # can be tested using the test statistic uº I Q Ÿ # Z and a [ ŸRŸ? reference distribution. H: 0 K`Š # can be tested using the test?, º R statistic and a [ ŸRŸ? reference distribution. I Q uº`ÿ Z T # ANOVA and MLR As in SLR, the breaking down of ttuv[ into tts and ttf can be thought of as a kind of " analysis of variance" in `, and summarized in a special kind of table. The general form for MLR is as below. ANOVA Table (for MLR Overall F Test) Source SS df MS F Regression tts R nts º ttsìr g º ntsìntf Error ttf Ÿ R Ÿ? ntf º ttfìî Ÿ R Ÿ?ï Total tt[v[ Ÿ? (Note that as in SLR, the mean square for error is, in fact, the estimate of ntf º Z L ).) As it turns out, the ratio in the "F" column can be used as a test statistic for the hypothesis H: 0?? º? º º? R º 0. The reference distribution appropriate is the g RŸRŸ? distribution. R (i.e. "artial F Tests" in MLR It is possible to use ANOVA ideas to invent F tests for investigating whether some whole group of?'s (short of the entire set) are all >. For example one might want to test the hypothesis H: 0? S? º? S º º? R º 0 9

10 (This is the hypothesis that only the first S of the R input variables _ have any impact on the mean system response... the hypothesis that the first S of the _'s are adequate to predict `... the hypothesis that after accounting for the first S of the _'s, the others do not contribute "significantly" to one's ability to explain or predict `.) If we call the model for ` in terms of all R of the predictors the "full model" and the model for ` involving only _? through _ S the "reduced model" then an g test of the above hypothesis can be made using the statistic gº îttsfull Ÿ ttsreducedïìîr Ÿ Sï ntf and an grÿsÿrÿ? reference distribution. ttsfull ¾ ttsreduced so the numerator here is non-negative. Finding a W-value for this kind of test is a means of judging whether for the full model is "significantly"/detectably larger than s for the reduced model. (Caution here, statistical significant is not the same as practical importance. With a big enough data set, essentially any increase in s will produce a small W-value.) It is reasonably common to expand the basic MLR ANOVA table to organize calculations for this test statistic. This is (Expanded) ANOVA Table (for MLR) Source SS df MS F Regression tts R nts º ttsìr g º nts ìntf _? _ S ttsred W îttsfullÿttsredïìîrÿsï _ S? _Š_ R S _ W ttsfull Ÿ ttsred R Ÿ S îttsfull Ÿ ttsredïìîr Ÿ Sï ntffull Error ttffull Ÿ R Ÿ? ntf Full º ttffullìî Ÿ R Ÿ?ï Total tt[v[ Ÿ? Full Full Full Full Full s Standardized Residuals in MLR As in SLR, people sometimes wish to standardize residuals before using them to do model checking/diagnostics. While it is not possible to give a simple formula for the "standard error of L " without using matrix notation, most MLR programs will compute these values. The standardized residual for data point is then (as in SLR) L º L standard error of L If the normal multiple linear regression model is a good one these "ought" to look as if they are approximately normal with mean > and standard deviation?. 10

11 Intervals and Tests for Linear Combinations of?'s in MLR It is sometimes important to do inference for a linear combination of MLR model coefficients m º J>? > J??? J? J JR? R (where JJJ >? R are known constants). Note, for example, that K`Š?, is of this R form for J > º?J? º _? J º _ and J R º _ R. Note too that a difference in mean responses at two sets of predictors, say î ï? R and î ï? R is of this form for J º >J º _ Ÿ_ J º _ Ÿ_ and J º _ Ÿ_. >??? R R R An obvious estimate of m is Confidence limits for m are mº J> I> J? I? J I J JRIR m [î standard error of mï There is no simple formula for "standard error of m " This standard error is a multiple of Z, but we will have to rely upon JM to provide it for us. (Computation of m and its standard error is under the "Custom Test" option in JM.) H: # 0 mº can be tested using the test statistic mÿ # uº standard error of m and a [ ŸRŸ? reference distribution. Or, if one thinks about it for a while, it is possible to find a reduced model that corresponds to the restriction that the null hypothesis places on the MLR model and to use a " SºRŸ? " partial g test (with? and ŸRŸ? degrees of freedom) equivalent to the [ test for this purpose. 11

Inferences for Regression

Inferences for Regression An Example: Body Fat and Waist Size Looking at the relationship between % body fat and waist size (in inches). Here is a scatterplot of our data set: Remembering Regression In