Chapter 5 Linear least squares regression

Size: px
Start display at page:

Download "Chapter 5 Linear least squares regression"

Transcription

1 Chapter 5 Linear least squares regression Consider first simple linear regression. For now, the model of interest is only the model of the conditional expectations; the distribution of the residuals is ignored. The immediate objective is to determine estimators of the model coefficients, or parameters. As there are many different means of estimating the parameters, a decision must be made on which to use. Those that are used in conventional linear regression are optimal, or best, among a reasonably large class of estimators. The least squares estimators minimize the sum of the squared residuals. No other set of unbiased estimators produces a smaller sum of squared residuals. Since the sum of the squared residuals is used in computing the estimate of residual error, informally at least, the least squares estimators have the least error among all (unbiased) estimators. The derivation of the least squares estimators is useful to understand, but more useful is understanding the properties of the least squares estimators. We consider only the n observation pairs (y 1, x 1 )..., (y n, x n ). The model of the ith observation Y i is E(Y i x i ) β 0 + β 1 x i. The objective is to use the n observation pairs to estimate the parameters β 0 and β 1 in a manner that will result in the smallest possible error, where error is defined (in this context) to be the sum of the squared residuals. Let ŷ i β 0 + β 1 x i denote the ith fitted value (the estimators β 0 and β 1 have not derived yet) and ε i y i ŷ i y i β 0 β 1 x i denote the ith residual. The sum of the squared residuals depends on, and is a function of the estimators β 0 and β 1. It is S( β 0, β 1 ) (y i ŷ i ) 2 (y i β 0 β 1 x i ) 2. The calculus-based approach (the simplest) to deriving estimators of β 0 and β 1 minimizes S(β 0, β 1 ) by the choice of β 0 and β 1. The values of β 0 and β 1 that minimize S(β 0, β 1 ) are the least squares estimates. 17

2 S(β 0, β 1 ) is a function that is recognizable as a quadratic equation in two variables β 0 and β 1, and it can be visualized in three-space as a downward-pointing cone, though not with a pointed tip but a smooth tip. The function has its minimum value at the tip. The tip lies above the solution pair ( β 0, β 1 ); these are the estimates that minimize the sum of the squared residuals. To find the minimum of S(β 0, β 1 ) (y i β 0 β 1 x i ) 2, the partial derivatives S(β 0, β 1 )/ β 0 and S(β 0, β 1 )/ β 1 are computed, set to zero, and the resulting equations solved for β 0 and β 1. First, since n y i ny. Setting S(β 0, β 1 ) β 0 Second, Setting S(β 0, β 1 ) β 1 S(β 0, β 1 ) n (y i β 0 β 1 x i ) 2 β 0 β 0 (y i β 0 β 1 x i ) 2 β 0 2 (y i β 0 β 1 x i ) 2 (ny nβ 0 nβ 1 x) 0 and rearranging produces β 0 + xβ 1 y. S(β 0, β 1 ) n (y i β 0 β 1 x i ) 2 β 1 β 1 (y i β 0 β 1 x i ) 2 β 1 2 x i (y i β 0 β 1 x i ) ( 2 x i y i nβ 0 x β 1 0 and rearranging produces nxβ 0 + x 2 i β 1 x i y i. Collectively, system of equations (known as the normal equations) is β 0 + xβ 1 y nxβ 0 + x 2 i β 1 x i y i. 18 x 2 i ).

3 The solution to the system is β 0 y β 1 x (xi x)(y i y) β 1. (xi x) 2 The least squares estimate of E(Y i x i ) is ŷ i β 0 + β 1 x i y β 1 x + β 1 x i y + (x i x) β 1. (1) If you are able to use calculus, then you should verify the following. Without any explanatory variables, the linear model is E(Y ) β 0. The least squares estimator (derive this) is β 0 n 1 n i y i y. If there is an explanatory variable, then the least squares estimator of E(Y x i ) (formula 1) adjusts the estimator y according to how far x i is from x. The degree of adjustment depends on the estimate of β 1. Properties of the least squares estimators 1. The estimate of ŷ i given that x i x is y. Therefore, the least squares line passes through the pair (x, y). 2. The residuals are and the sum of the residuals is ε i y i ŷ i y i β 0 β 1 x i, ε i y i β 0 β 1 x i y i y β 1 x β 1 x i (y i y) β 1 (x x i ) (y i y) + β 1 (x i x). 19

4 Since y i ny y, n (y i y) 0 and the sum of residuals is n ε i Another property that will be used later is 0 ε i x i. The argument supporting this claim uses the second normal equation nxβ 0 + n x2 i β 1 n x iy i. Since the least squares estimates β 0 and β 1 satisfy the normal equations, it is true that nx β 0 + x 2 β i 1 x i y i. (2) Consider ε i x i (y i β 0 β 1 x i )x i x i y i β 0 x i β 1 x 2 i x i y i n β 0 x β 1 x 2 i. Equation (2) implies that the last line of the above set of equations is 0 and that n ε ix i The sum of the product between residuals and fitted values is 0 since ŷ i ε i ( β 0 + β 1 x i ) ε i β 0 ε i + β 1 ε i x i The properties above established that n ε i 0 and n ε ix i 0. Hence, n ŷi ε i 0. Assessing the adequacy of the regression model: The model must be considered a good (adequate) model before it is useful or interesting. A benchmark model against which the regression model ought to be substantially better is the null model E(Y ) β 0. A measure of adequacy is either the sum or mean of the squared residuals about the model. The residuals 20

5 associated with the null model are y i y, i 1,..., n since y is the least squares estimator of the parameter β 0 µ. The total sums of squares measures the fit of the null model: TSS (y i y) 2. TSS is the sum of squared residuals about the null model. The corresponding measure of fit for the linear regression model is the sum of squared residuals RSS (y i ŷ i ) 2 (y i β 0 β 1 x i ) 2. Another property of least squares regression is that RSS TSS because the null model is a constrained version of the two-parameter linear model E(Y x) β 0 + β 1 x. The constraint imposed on the parameters that leads to the null model is β 1 0. In the unconstrained case of the two-parameter model, the parameter estimates are those that minimize the residual sum of squares. In the constrained case, one parameter is not chosen to minimize the residual sum of squares; instead, it is fixed (β 1 0) and so it is not optimal unless the least squares estimate of β 1 were exactly 0. It is always the case that constraining a parameter will increase the residual sum of squares compared to the residual sum of squares when a parameter is not constrained but instead fit to the data using least squares regression. In other words, a constrained model will never fit better than an unconstrained model. TSS will be close to RSS when the constrained and unconstrained models are alike. The constrained and unconstrained models are alike if β 1 is very close to 0. Then, the constrained and unconstrained model residuals are much alike, that is, y i ŷ i y i β 0 y i y and TSS will be only slightly larger than RSS. Conversely, if TSS is only slightly larger than RSS, then y i ŷ i y i y for i 1,..., n which implies that y β 0 + β 1 x i, i 1,..., n; thus β 1 0 and the constrained and unconstrained models are very similar with respect to the residuals. 21

6 This logic will lead to a test of significance for a set of parameters. It will be concluded that parameters are zero if the residual sum of squares associated with a model with the set constrained to be zero is not much larger than the residual sum of squares associated with a model in which the parameters are unconstrained. Suppose that the residuals about the regression model were all zero so that the model perfectly fit the data. Then, RSS 0 and none of the variability in the response variable is unexplained by the regression model. More generally, RSS is said to be the residual variation not explained by the regression model, and the difference RegSS TSS RSS (3) is the portion of the total sums of squares that is explained by the regression model. The ratio of the regression and total sum of squares is the coefficient of determination r 2 RegSS (4) TSS TSS RSS. (5) TSS Since 0 TSS RSS TSS, 0 r 2 1. The coefficient of determination is sometimes said to the proportion of variation in the response variable that is explained by the regression model (or the explanatory variable). These ideas can be pursued further. Note that The total sum of squares can be expanded using formula (6) y i y (y i ŷ i ) + (ŷ i y) (6) TSS (y i y) 2 [(y i ŷ i ) + (ŷ i y)] 2. (y i ŷ i ) (y i ŷ i )(ŷ i y) + (ŷ i y) 2. 22

7 The cross-product 2 n (y i ŷ i )(ŷ i y) is zero since (y i ŷ i )(ŷ i y) ε i (ŷ i y) ε i ŷ i ε i ŷ i y ε i y ε i. The second term, y n ε i 0 since n ε i 0 (see property (2) of the least squares estimators above.), The first term n ε iŷ i 0 by property (4) above. Thus, the crossproduct term is 0, and the expansion of the total sum of squares is TSS (y i ŷ i ) 2 + RSS + (ŷ i y) 2 (ŷ i y) 2 TSS RSS (ŷ i y) 2. Recall the definition of the regression sum of squares (formula 3) which set TSS RSS RegSS. Comparing this definition to TSS RSS n (ŷ i y) 2 leads to another expression for the regression sum of squares: RegSS (ŷ i y) 2. This expression for regression sum of squares shows that the extent to which the unconstrained model predictions of the y i s (the ŷ i s) differ from the constrained model predictions of y i s (y), and determines how well the model fits. 23

8 Example The figure to the right shows world record times for the mile run, from 1861 to 2003 for men (red) and from 1967 for 1996 women (blue). 1 Separate regression lines are graphed for men and women as are horizontal lines identifying the mean world record times. 2 For the male observations, I have graphed vertical lines between the fitted values and the sample mean, thus each vertical line has length ŷ i y males. For the female observations, I have graphed vertical lines from the fitted values to the observed values, so each of these lines has length y i ŷ i. The lengths of the lines illustrate that TSS RegSS + RSS is dominated by RegSS for both genders. World record time (sec) Year Correlation: The correlation coefficient r is a measure of the strength of the linear association between two quantitative variables. The correlation coefficient is r 1 ( ) ( ) xi x yi y n 1 s x where s x and s y are the sample standard deviations of the response and explanatory variables, respectively. r is sometimes called Pearson s correlation to distinguish it from other measures of association; however, the term correlation coefficient in statistics refers specifically to r. Notice that the correlation coefficient is symmetric in the sense that the response and explanatory variables can be reversed or switched without altering r. The same statement does not hold for the parameter estimates β 1 and r are related: s y β 0 y β 1 x (xi x)(y i y) β 1. (xi x) 2 r s x s y β1 and β 1 r s y s x. If s x s y, then β 1 r; if the explanatory variable is twice as variable as the response (s x 2s y ), then the slope of the least squares line is reduced by a factor of two; i.e, β r. 1 Running a mile in four minutes (240 seconds) translates to a speed of 15 miles per hour. 2 The risks of extrapolation are apparent since as of February, 2009 the women s world record for the mile run is 4:12.56 set by Svetlana Masterkova of Russia on August 14, (7)

9 In the case of simple linear regression, the coefficient of determination is the square of the correlation coefficient, which explains why the symbol for the coefficient of determination is r 2. Properties of correlation r r r measures only the strength of linear association between two variables The value of r is always between 1 and 1. If r > 0, then the association is between variables is positive. If r < 0,then the association is between variables is negative. If r 0, then there is no association between variables r.48 r If r 1 or r 1, then there is perfect positive or negative linear association, respectively, between variables. 4. The value of r is unitless, and so the meaning of r as a measure of correlation (as discussed above) holds for any pair of quantitative variables. 5. There is no distinction between the x and y variable in terms of one variable explaining the other variable. More examples: Missoula Monthly Average Temperatures (degrees Fahrenheit) Distance from Sun vs. Planet Position (for all 9 planets) Average Temperature (deg F) Remarks Month Distance from Sun Planet Position 25

10 The plot on the left and above illustrates that the correlation coefficient is a measure of linear association. r.16 does not imply that a relationship does not exist between variables. The planet distance plot illustrates that a curved (but monotone) relationship may have a large correlation coefficient provided that the points lie close to a line. For these data, r.91. Effect of outliers on r: The correlation coefficient resistant to outliers is sensitive to outliers. Consider the two scatterplots below. y x y x Left panel: r.06 without the outlier and r.90 with the outlier. Right panel: r 1 without the outlier and r 0 with the outlier. There are several measures of association which are resistant to the effects of outliers. Two of the more common such measures are Kendall s τ and Spearman s ρ. Both of these measure the degree of monotonic association between two quantitative variables. Causation: Two variables have a cause-and-effect relationship if changes in one variable cause changes in the other variable. Examples: The a change in the concentration of a catalyst causes changes in a reaction rate. It is argued that smoking causes cardiovascular diseases and some cancers. Many studies have shown associations between smoking level and the incidence of lung cancer, but studies that imply causation in humans have not be carried out. Example: In 1982, SAT scores were first published on a state-by-state basis. Average scores ranged from 790 (out of 1600) (S. Carolina) to 1088 (Iowa). The data were used to support conflicting positions and claims regarding secondary education. In response, Powell and Steelman 3 attempted to use the data to understand why state-to-state variation was 3 Powell, B. and Steelman, L.C Variations in state SAT performance: Meaningful or misleading? Harvard Educational Review 54(4),

11 so great and, as they stated, assess the extent to which the compositional/demographic and school-structural characteristics are implicated in SAT differences. These data are sometimes called ecological data. Such data are aggregate statistics usually means. The data (means) differ from observations on individuals, and so inferences drawn about individuals from these data are potentially misleading or incorrect. In this example, data are measurements on states, not individual high school students, and the reader should not draw inferences about individuals from the data. Powell and Steelman obtained measurements on a number of variables, but in particular 1. The percentage of total eligible students that took the test 2. The average number of years that the takers took formal courses in social sciences, humanities, and natural sciences 3. The median percentile ranking of the test takers among the students in their classes The figure below and left is a dotplot showing mean SAT scores by state, and the figure below and right plots rank against percent takers. Wyoming Wisconsin WestVirginia Washington Virginia Vermont Utah Texas Tennessee SouthDakota SouthCarolina RhodeIsland Pennsylvania Oregon Oklahoma Ohio NorthDakota NorthCarolina NewYork NewMexico NewJersey NewHampshire Nevada Nebraska Montana Missouri Mississippi Minnesota Michigan Massachusetts Maryland Maine Louisiana Kentucky Kansas Iowa Indiana Illinois Idaho Hawaii Georgia Florida Delaware Connecticut Colorado California Arkansas Arizona Alaska Alabama Median class rank of takers S Dakota N Dakota Nebraska Montana Illinois Washington Nevada Alaska Texas Florida Oregon Hawaii Vermont New Hampshire New Jersey CT Mean SAT score (1982) Takers (percentage of class) Midwestern states tend to have low percentages of test takers, principally because most midwestern state universities required that applicants take the ACT test. Presumably, most of the students that took the SAT in these midwestern states were interested in attending college out-of-state. The percentage of test takers may have a substantial role in explaining variation in state SAT means but little if any insight into the school-structural characteristics responsible for in SAT differences. 27

12 It s expected that median class rank (of the takers) also explains some of the variation in mean state SAT scores. Differences between states with respect to these variables will confound a comparison of states aimed at understanding the success (or lack thereof) of states at educating students. The other potential explanatory variables (income, years, expenditure and public) are far more interesting with respect to the characteristics of states that succeed in educating students. The plot below and left shows the relationship between SAT means and (mean) years of formal study. For these data, r.330. If the confounding variables percent takers and median class rank are accounted, or adjusted for, then the apparent relationship is stronger. The figure below and right plots the adjusted mean SAT scores against adjusted years of formal study. For these data, r.561. State mean SAT State mean SAT (adjusted) Years of formal studies Years of formal studies (adjusted) Percent takers and median class rank are called confounding variables because the variables affect SAT means and also are associated with years of formal schooling. Differences among states with respect to percent takers and median class rank confounds, or obscures the relationship between mean SAT and mean years of formal schooling. Percent takers and median class rank are confounding variables, and are sometimes called lurking variables particularly if they are not measured, but understood to confound the relationship between two other measured variables. Residual error If the null model is assumed to be true, then the estimator of the residual variance ε is 1 (yi y) 2. n 1 28

13 If the linear model regression model is assumed to be true, then the estimator of σ 2 is the sample variance of the residuals s 2 1 ( εi ε) 2 n 2 1 ε 2 n 2 i 1 (yi ŷ i ) 2 n 2 1 The denominator must be used so that the estimator is unbiased, (the estimator is unbiased provided that its expected value is the same as the estimand: E(s 2 ) σ 2 ). The n 2 term n 2 is the degrees of freedom associated with (y i ŷ i ) 2 (assuming there is one explanatory variable). The degrees of freedom are in most cases, n (k + 1) where k + 1 is the number of parameters in the linear model (including β 0 ). The degrees of freedom associated with a particular sum of squares is the scaling factor that makes the scaled sum of squares an unbiased estimator of σ 2. Determining the specific degrees of freedom for a sum of squares is sometimes complex, though in most conventional problems it is, as stated above, n k 1. Multiple regression The linear regression model is extended from one explanatory variable to two, and then to k > 1. With two explanatory variables, the model is E(Y i x i,1, x i,2 ) β 0 + β 1 x i,1 + β 2 x i,2. Once again, the objective is to use the n observation triples (y, x i,1, x i,2 ), i 1,..., n to estimate the parameters β 0 β 1, and β 2 in a manner that will result in the smallest possible error, where error is defined (in this context) to be the sum of the squared residuals. Now the fitted values are ŷ i β 0 + β 1 x i,1 + β 2 x i,2 and the associated residual is ε i y i ŷ i y i β 0 β 1 x i,1 β 2 x i,2. The sum of the squared residuals is a function of the estimates β 0, β 1 and β 2. It is S( β 0, β 1, β 2 ) (y i ŷ i ) 2 (y i β 0 β 1 x i,1 β 2 x i,2 ) 2. The least squares parameter estimates β 0, β 1, β 2 are the values of β 0, β 1, β 2 that minimize S(β 0, β 1, β 2 ). The calculus-based approach computes the partial derivatives S(β 0, β 1, β 2 )/ β 0, and S(β 0, β 1, β 2 )/ β 1 and S(β 0, β 1, β 2 )/ β 2, sets each of the partial derivatives equal to 29

14 zero, and solves the resulting system of normal equations for β 0, β 1 and β 2. The partial derivatives are S(β 0, β 1, β 2 ) n (y i β 0 β 1 x i,1 β 2 x i,2 ) 2 β 0 β 0 (y i β 0 β 1 x i,1 β 2 x i,2 ) 2 β 0 2 (y i β 0 β 1 x i,1 β 2 x i,2 ) Next, ( 2 y i nβ 0 β 1 x i,1 β 2 ) x i,2 ). S(β 0, β 1, β 2 ) n (y i β 0 β 1 x i,1 β 2 x i,2 ) 2 β 1 β 1 (y i β 0 β 1 x i ) 2 β 1 2 x i,1 (y i β 0 β 1 x i,1 β 2 x i,2 ) ( 2 x i,1 y i β 0 x i,1 β 1 x 2 i,1 β 2 ) x i,1 x i,2. Similarly, ( S(β 0, β 1, β 2 ) 2 β 2 x i,2 y i β 0 x i,2 β 1 x i,1 x i,2 β 2 x 2 i,2 ). The normal equations are β 0 β 0 nβ 0 + β 1 x i,1 + β 1 x i,2 x i,1 + β 1 x i,2 + β 1 x 2 i,1 + β 2 x i,1 x i,2 x i,1 x i,2 + β 2 x 2 i,2 y i x i,1 y i x i,2 y i. The system can be written in matrix form as n x1 x2 x1 x 2 2 x1 x 2 x2 x1 x 2 x 2 2 β 0 β 1 β 2 y x1 y x2 y. 30

15 The solution is n x1 x2 x1 x 2 2 x1 x 2 x2 x1 x 2 x y x1 y x2 y β 0 β 1 β 2 provided that the inverse exists. If there are k 2 explanatory variables, the system is n x1 x1 x xk x1 x k xk x1 x k. x 2 k β 0 β 1. β k y x1 y. xk y. The system consists of three equations in k unknowns (β 0, β 1, β k ) and is usually solved by a computer implementation of Householder s or a similar algorithm. These algorithms are remarkably accurate and fast. Example The Ginzberg data set are observations made on patients hospitalized for depression. Three variables were measured on each patient: depression (Beck self-report depression scale), simplicity: a score of a subject s need to see the world in black and white, and fatalism score. 4 Both explanatory variables together explain R % of the variation in depression score. The figure to the right shows the fitted model as a surface. The surface of the plane consists of estimates of the expected depression score given the fatalism and simplicity score. Depression Simplicity score Fatalism score There are very few situations in which there is a correct regression model in the sense that there is some particular combination of explanatory variables that describes the mean 4 A fatalistic outlook is one in which the person feels they lack control of events to the extent that there is little point to attempting to improve their situation. 31

16 or expected value of y better than all other choices of explanatory variables. Every model is wrong, but some models are less wrong than others (which is to say, they more accurately describe the expected value of y). Some examples of multiple linear regression models are E(Y x 1, x 2 ) β 0 + β 1 x 1 + β 2 x 2 E(Y x 1, x 2 ) β 1 x 1 + β 2 x 2 E(Y x 1, x 2 ) β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 x 2 E(Y x 1, x 2 ) β 0 + β 1 log(x 1 ) + β 2 log(x 2 ). An example of a nonlinear multiple regression model is E(Y x 1, x 2 ) exp(β 0 + β 1 x 1 + β 2 x 2 ). So far, the multiple linear regression model describes only the expected value of the response variable Y. For inferential purposes, the model needs to more completely describe the distribution of the response variable. The full model is very similar to the simple linear regression model. It is Y N(E(Y x 1,..., x k ), σ) where E(Y x 1,..., x k ) β 0 + β 1 x β k x k. The model states that the variance of Y is constant, provided that the expected value of mean is correctly specified. This condition is expressed by writing var(y x 1,..., x k ) σ 2. An equivalent specification of the model is Y β 0 + β 1 x β k x k + ε where ε N(0, σ). The residuals about the fitted regression model must be realizations of independent random variables ε 1, ε 2,... ε n to carry out accurate inference (hypothesis tests and confidence intervals). A compact model specification relevant to the data is Y i β 0 + β 1 x 1,i + + β k x k,i + ε i, where ε iid i N(0, σ), i 1, 2,..., n. The notation iid is shorthand for independent and identically distributed. An additional assumption has been stated: the observations (or the residuals) are independently distributed. Regression surfaces: Mathematically, a linear model describes a sub-space of R k. For example, a model with two explanatory variables x 1 and x 2 (k 2) defines a plane within 32

17 three-space, or R 3. There are three axes: y, x 1 and x 2. The regression plane intersects the y axis where x 1 x 2 0. The height of the plane at the intersection is β 0 + β β 2 0 β 0. The coefficient β 1 is the slope of the plane if x 2 is fixed at any particular value, and β 2 is the slope of the plane if x 1 is fixed at any particular value. If more than two explanatory variables are present in the model, then it is difficult to develop a geometrical interpretation. For example, a model with three variables describes a three-dimensional object embedded in 4-space. Interpretation of regression coefficients A coefficient, or parameter estimate for a multiple linear regression model is interpreted as the effect of the associated explanatory variable if all other variables are held fixed. The effect of a one unit increase of x 1 held fixed is on E(Y x 1,..., x k ), when all other variables are E(Y x 1 + 1, x 2..., x k ) E(Y x 1, x 2..., x k ) β 0 + β 1 (x 1 + 1) + β 2 x β k x k (β 0 + β 1 x 1 + β 2 x β k x k ) β 1. The effect of a 1 unit increase in x 1 does not depend on the value of either variable. The effect of a simultaneous increase of a one unit change of both x 1 and x 2 is sum of the coefficients β 1 and β 2 since E(Y x 1 + 1, x , x k ) E(Y x 1, x 2..., x k ) β 0 + β 1 (x 1 + 1) + β 2 (x 2 + 1)+ (β 0 + β 1 x 1 + β 2 x 2 ) β 1 + β 2 For this reason, the model E(Y x 1,..., x k ) β 0 +β 1 x 1 + +β k x k is called an additive model. The parameter estimates (coefficients) from the simple linear regression of depression score on fatalism are shown in Table 1; Table 2 shows the coefficient from the multiple linear regression of depression score on fatalism and simplicity. The coefficients associated with fatalism differ between models, in part because simplicity is correlated with fatalism (r.631). The interpretation of the parameter estimate associated with fatalism differs between models because the estimated change in mean depression score associated with a one unit change in fatalism is numerically different, and because the parameter estimate from the multiple regression model describes the estimated effect provided simplicity is held fixed. 33

18 Table 1: Coefficients and standard errors obtained from the linear regression of weekly numbers of depression score on fatalism. r Variable Coefficient Std. Error t-statistic P (T > t ) Intercept Fatalism <.0001 Table 2: Coefficients and standard errors obtained from the linear regression of weekly numbers of depression score on fatalism and simplicity. r Variable Coefficient Std. Error t-statistic P (T > t ) Intercept Fatalism <.0001 Simplicity The simple linear regression model estimates the effect of a one unit increase in fatalism to be an increase in depression score of.657. No control of simplicity is assumed. Because fatalism and simplicity are positively correlated, an increase in fatalism is associated with an increase in simplicity (for most or patients), so the effect of a one unit increase in fatalism without fixing simplicity has a greater effect than a one unit increase in fatalism when fixing simplicity. Analyzing the effect of fatalism with simplicity fixed is unrealistic, and in general, the interpretation of regression coefficients obtained from observational studies is not as clear as in experimental studies. In experimental studies, the values of the explanatory variables (properly called factors) are chosen by the researcher. With two factors with l 1 and l 2 levels respectively, there will usually be approximately the same number of observations at each of the l 1 l 2 treatment (combinations). Model assessment As with k 1 explanatory variables, model assessment is based on the decomposition of the total sums of squares, TSS. TSS measures the fit of the null model (which constrains all the linear model parameters to be 0 except for the intercept β 0 ). The estimator of β 0 is the sample mean and the total sum of squares is TSS (y i y) 2. As with k 1, the measure of fit for the linear regression model is the sum of squared residuals RSS (y i ŷ i ) 2 (y i β 0 β 1 x i,1 β 1 x i,k ) 2. 34

19 The sum of squared residuals provides an estimator of σ 2 var(y x 1,..., x k ). Fox uses s 2 E to denote the estimator: s 2 E RSS n k 1 ε 2 i (y i β 0 β 1 x i,1 β 1 x i,k ) 2 For the Ginzberg data, the estimated variance of depression scores was.250 (using the null model E(Y ) β 0 and s 2 E.123. The relative reduction is ( )/ which is not (as should be expected) much different from r The difference is attributable to the degrees of freedom associated with the two sum of squares (TSS for the null model and RSS for the linear regression model). The relative reduction in the variance estimates is sometimes called the adjusted r 2. Neither r 2 nor adjusted r 2 are unbiased, but the adjusted r 2 is less biased. The standard error of the regression is s E s 2 E and it estimates the average distance of an observation y from it s fitted value ŷ. The partitioning of TSS according to TSS RegSS + RSS holds as it did with a single variable, and the coefficient of determination r 2 RegSS TSS has the same interpretation as before: it s the proportion of the variation in the response variable explained by the regression. Standardized regression coefficients In some cases, its desirable to compare parameter estimates β 1,..., β k, or a subset of the estimates with the intent of assessing the relative importance of the associated explanatory variables x 1,..., x k toward explaining the expected value of the response variable. If the explanatory variables are measured in different units, then a direct comparison of, say β 1, the estimated change in the mean of Y given a one unit change in x 1 to β 2, the estimated change in the mean of Y given a one unit change in x 2 has little meaning because a one unit change in x 1 may span a large portion of the range of x 1 while a one unit change in x 2 may span a small portion of the range of x 2. 35

20 One solution to the problem is to standardize all variables 5 according to Z i,y y i y Z i,1 x i,1 x 1 s 1.. Z i,k x i,k x k s k. Afterwards, a one unit change in any explanatory variable x j away from x j has the same meaning: it spans about 68/2 34% of the observations closest to the center of the distribution of x j provided that the distribution of x j is roughly symmetric and without substantial outliers. Let β 1,..., β k denote the parameter estimates obtained from a model fit using the standardized variables. Afterward, a comparison of the magnitudes of β 1,..., β k reflects the relative importance of each variable towards explaining the conditional mean of Y. All variables in Ginzberg data set (from the R library alr3) have been scaled to have the same standard deviation, and so the relative importance of fatalism and simplicity score towards explaining expected depression score can be assessed by comparing the parameter estimates β Fatalism.418 and β Simplicity.380. The variables have approximately the same contribution towards explaining the expectation of Y. It s not necessary to scale or standardize the variables before the regression. The standardized regression coefficients (those coefficients obtained after standardizing all variables) can be computed using the formulas β 1 s 1 β1.. β k s k βk where β 1,..., β k are the estimates obtained without standardizing the variables. The explanation of why these formulas yield the standardized coefficients provides some insight into the properties of the coefficients. The first normal equation in the system for k explanatory variables is nβ 0 + β 1 x i,1 + β 1 x i,2 5 It s not necessary to standardize the response variable but it does lead to a simple shortcut. y i 36

21 and it implies that β 0 y β 1 x 1 β k x k. Substituting this expression for β 0 into the definition of a fitted value leads to ŷ i β 0 + β 1 x i,1 + + β k x i,k (y β 1 x 1 β k x k ) + β 1 x i,1 + + β k x i,k y + β 1 (x i,1 x 1 ) + + β k (x i,k x k ) (8) ŷ i y β 1 (x i,1 x 1 ) + + β k (x i,k x k ) From the third equation, it is apparent that the fitted model passes through the point (y, x 1,..., x k ). (This can be verified by evaluating the model by setting x 1 x 1,..., x k x k. The fitted value is ŷ y). A few more operations on the last equation will show how to compute the standardized coefficients. First, divide both sides by : ŷ i y β 1 (x i,1 x 1 ) + + β k (x i,k x k ) Multiply the jth coefficient on the right by 1 s j s j, j 1,..., k: ŷ i y s 1 β1 (x i,1 x 1 ) s s k βk (x i,k x k ) s k. Rearrange terms: ŷ i y ( ) ( ) s 1 xi,1 x 1 s k xi,k x k β β k. s 1 s k In this expression for the linear regression model, the response and explanatory variable observations have been standardized. The coefficients are associated with the standardized variables are Returning to the last line in derivation (8), β 1 β 1 s 1,, β k β k s k. ŷ i y β 1 (x i,1 x 1 ) + + β k (x i,k x k ), it s apparent that the regression sum of squares (which is the sum of squares of the lefthand terms, i 1,..., n) depends on the magnitude of the regression coefficients and the standard deviation of the responses. A regression coefficient that is large in magnitude and associated 37

22 with a variable with a large standard deviation will have the greatest impact upon the regression sum of squares. Example Returning to the data on 1984 record times of Scottish fell races, the relative importance of distance versus elevation gain can be assesses by regressing the logarithms of distance and elevation gain on record time, and then computing the standardized coefficients. Parameter estimates are shown in the table below. Table 3: Coefficients and standard errors obtained from the linear regression of log-record time on log-distance and log-elevation gain using the data set hills from the MASS library. Variable Parameter estimate Std. Error t-statistic P (T > t ) Constant <.0001 log(distance) <.0001 log(climb) The standard deviations of the variables (obtained from the function call sd(log(hills))) are.7048, s log(distance).5928 and s log(climb).8110, and the standardized coefficients are and β log(distance) β 1 s β log(distance) β 1 s Distance of the race is far more important towards determining the record times than the elevation gain. The question is muddied by log(distance) and log(climb) being moderately or strongly correlated (r.700). Another approach to assessing the influence of log(climb) is to conduct a regression of log(time) on log(distance) and examine the association between the residuals from the model and log(climb). The residuals, are in essence, the portion of log(time) that has not been accounted for by log(distance). If a significant fraction of the variation in the residuals is explained by log(climb), then it can be concluded that elevation gain is important. However, the elevation gain is associated with distance, and we ought to remove this association if possible from elevation gain. This is accomplished by regressing log(climb) on log(distance) and using the residuals from the regression in place of log(climb). 38

23 A partial residual plot plots the residuals from a regression of Y on x j against the residuals from a regression of x i on x j. The resulting plot reveals the extent to which additional information in x i is useful for explaining the response variable. The plot to the right is a partial residual plot showing that there is a single, highly influential race (Knock Hill the recorded pace was faster than a 4 minute mile) confounding the contributory effect of log(climb) on log(time), given log(distance). For these partial residuals, r.309; however, log(time) log(climb) with Knock Hill removed from the residual pairs, the correlation coefficient is.665. The partial residual plot also suggests that the log transformation of elevation gain is unhelpful because of the apparent curvilinear relationship between residuals. The conclusion is that elevation gain is important but that it is difficult to gauge its relative importance because it is moderately correlated with distance, and because of the impact of the dubious observation on the Knock Hill race. R code to used to construct the partial residual plots follows. y.r <- lm(loghills$logtime~loghills$logdistance)$resid x.r <- lm(loghills$logclimb~loghills$logdistance)$resid plot(x.r,y.r,pch16,xlab"log(climb)",ylab"log(time)") cor(x.r,y.r) cor(x.r[-18],y.r[-18]) plot(x.r[-18],y.r[-18],pch16,xlab"log(climb)",ylab"log(time)") 39

Jakarta International School 6 th Grade Formative Assessment Graphing and Statistics -Black

Jakarta International School 6 th Grade Formative Assessment Graphing and Statistics -Black Jakarta International School 6 th Grade Formative Assessment Graphing and Statistics -Black Name: Date: Score : 42 Data collection, presentation and application Frequency tables. (Answer question 1 on

More information

Multivariate Statistics

Multivariate Statistics Multivariate Statistics Chapter 4: Factor analysis Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2017/2018 Master in Mathematical Engineering Pedro

More information

New Educators Campaign Weekly Report

New Educators Campaign Weekly Report Campaign Weekly Report Conversations and 9/24/2017 Leader Forms Emails Collected Text Opt-ins Digital Journey 14,661 5,289 4,458 7,124 317 13,699 1,871 2,124 Pro 13,924 5,175 4,345 6,726 294 13,086 1,767

More information

Chapter 11 : State SAT scores for 1982 Data Listing

Chapter 11 : State SAT scores for 1982 Data Listing EXST3201 Chapter 12a Geaghan Fall 2005: Page 1 Chapter 12 : Variable selection An example: State SAT scores In 1982 there was concern for scores of the Scholastic Aptitude Test (SAT) scores that varied

More information

Parametric Test. Multiple Linear Regression Spatial Application I: State Homicide Rates Equations taken from Zar, 1984.

Parametric Test. Multiple Linear Regression Spatial Application I: State Homicide Rates Equations taken from Zar, 1984. Multiple Linear Regression Spatial Application I: State Homicide Rates Equations taken from Zar, 984. y ˆ = a + b x + b 2 x 2K + b n x n where n is the number of variables Example: In an earlier bivariate

More information

Standard Indicator That s the Latitude! Students will use latitude and longitude to locate places in Indiana and other parts of the world.

Standard Indicator That s the Latitude! Students will use latitude and longitude to locate places in Indiana and other parts of the world. Standard Indicator 4.3.1 That s the Latitude! Purpose Students will use latitude and longitude to locate places in Indiana and other parts of the world. Materials For the teacher: graph paper, globe showing

More information

Summary of Natural Hazard Statistics for 2008 in the United States

Summary of Natural Hazard Statistics for 2008 in the United States Summary of Natural Hazard Statistics for 2008 in the United States This National Weather Service (NWS) report summarizes fatalities, injuries and damages caused by severe weather in 2008. The NWS Office

More information

Abortion Facilities Target College Students

Abortion Facilities Target College Students Target College Students By Kristan Hawkins Executive Director, Students for Life America Ashleigh Weaver Researcher Abstract In the Fall 2011, Life Dynamics released a study entitled, Racial Targeting

More information

Multivariate Statistics

Multivariate Statistics Multivariate Statistics Chapter 3: Principal Component Analysis Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2017/2018 Master in Mathematical

More information

Multivariate Statistics

Multivariate Statistics Multivariate Statistics Chapter 6: Cluster Analysis Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2017/2018 Master in Mathematical Engineering

More information

Cooperative Program Allocation Budget Receipts Southern Baptist Convention Executive Committee May 2018

Cooperative Program Allocation Budget Receipts Southern Baptist Convention Executive Committee May 2018 Cooperative Program Allocation Budget Receipts May 2018 Cooperative Program Allocation Budget Current Current $ Change % Change Month Month from from Contribution Sources 2017-2018 2016-2017 Prior Year

More information

Cooperative Program Allocation Budget Receipts Southern Baptist Convention Executive Committee October 2017

Cooperative Program Allocation Budget Receipts Southern Baptist Convention Executive Committee October 2017 Cooperative Program Allocation Budget Receipts October 2017 Cooperative Program Allocation Budget Current Current $ Change % Change Month Month from from Contribution Sources 2017-2018 2016-2017 Prior

More information

Cooperative Program Allocation Budget Receipts Southern Baptist Convention Executive Committee October 2018

Cooperative Program Allocation Budget Receipts Southern Baptist Convention Executive Committee October 2018 Cooperative Program Allocation Budget Receipts October 2018 Cooperative Program Allocation Budget Current Current $ Change % Change Month Month from from Contribution Sources 2018-2019 2017-2018 Prior

More information

Chapter. Organizing and Summarizing Data. Copyright 2013, 2010 and 2007 Pearson Education, Inc.

Chapter. Organizing and Summarizing Data. Copyright 2013, 2010 and 2007 Pearson Education, Inc. Chapter 2 Organizing and Summarizing Data Section 2.1 Organizing Qualitative Data Objectives 1. Organize Qualitative Data in Tables 2. Construct Bar Graphs 3. Construct Pie Charts When data is collected

More information

Club Convergence and Clustering of U.S. State-Level CO 2 Emissions

Club Convergence and Clustering of U.S. State-Level CO 2 Emissions Methodological Club Convergence and Clustering of U.S. State-Level CO 2 Emissions J. Wesley Burnett Division of Resource Management West Virginia University Wednesday, August 31, 2013 Outline Motivation

More information

A. Geography Students know the location of places, geographic features, and patterns of the environment.

A. Geography Students know the location of places, geographic features, and patterns of the environment. Learning Targets Elementary Social Studies Grade 5 2014-2015 A. Geography Students know the location of places, geographic features, and patterns of the environment. A.5.1. A.5.2. A.5.3. A.5.4. Label North

More information

Correction to Spatial and temporal distributions of U.S. winds and wind power at 80 m derived from measurements

Correction to Spatial and temporal distributions of U.S. winds and wind power at 80 m derived from measurements JOURNAL OF GEOPHYSICAL RESEARCH, VOL. 109,, doi:10.1029/2004jd005099, 2004 Correction to Spatial and temporal distributions of U.S. winds and wind power at 80 m derived from measurements Cristina L. Archer

More information

Challenge 1: Learning About the Physical Geography of Canada and the United States

Challenge 1: Learning About the Physical Geography of Canada and the United States 60ºN S T U D E N T H A N D O U T Challenge 1: Learning About the Physical Geography of Canada and the United States 170ºE 10ºW 180º 20ºW 60ºN 30ºW 1 40ºW 160ºW 50ºW 150ºW 60ºW 140ºW N W S E 0 500 1,000

More information

, District of Columbia

, District of Columbia State Capitals These are the State Seals of each state. Fill in the blank with the name of each states capital city. (Hint: You may find it helpful to do the word search first to refresh your memory.),

More information

Chapter 7: Simple linear regression

Chapter 7: Simple linear regression The absolute movement of the ground and buildings during an earthquake is small even in major earthquakes. The damage that a building suffers depends not upon its displacement, but upon the acceleration.

More information

What Lies Beneath: A Sub- National Look at Okun s Law for the United States.

What Lies Beneath: A Sub- National Look at Okun s Law for the United States. What Lies Beneath: A Sub- National Look at Okun s Law for the United States. Nathalie Gonzalez Prieto International Monetary Fund Global Labor Markets Workshop Paris, September 1-2, 2016 What the paper

More information

Additional VEX Worlds 2019 Spot Allocations

Additional VEX Worlds 2019 Spot Allocations Overview VEX Worlds 2019 Spot s Qualifying spots for the VEX Robotics World Championship are calculated twice per year. On the following table, the number in the column is based on the number of teams

More information

Preview: Making a Mental Map of the Region

Preview: Making a Mental Map of the Region Preview: Making a Mental Map of the Region Draw an outline map of Canada and the United States on the next page or on a separate sheet of paper. Add a compass rose to your map, showing where north, south,

More information

Intercity Bus Stop Analysis

Intercity Bus Stop Analysis by Karalyn Clouser, Research Associate and David Kack, Director of the Small Urban and Rural Livability Center Western Transportation Institute College of Engineering Montana State University Report prepared

More information

Multivariate Analysis

Multivariate Analysis Multivariate Analysis Chapter 5: Cluster analysis Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2015/2016 Master in Business Administration and

More information

Hourly Precipitation Data Documentation (text and csv version) February 2016

Hourly Precipitation Data Documentation (text and csv version) February 2016 I. Description Hourly Precipitation Data Documentation (text and csv version) February 2016 Hourly Precipitation Data (labeled Precipitation Hourly in Climate Data Online system) is a database that gives

More information

Online Appendix: Can Easing Concealed Carry Deter Crime?

Online Appendix: Can Easing Concealed Carry Deter Crime? Online Appendix: Can Easing Concealed Carry Deter Crime? David Fortunato University of California, Merced dfortunato@ucmerced.edu Regulations included in institutional context measure As noted in the main

More information

SUPPLEMENTAL NUTRITION ASSISTANCE PROGRAM QUALITY CONTROL ANNUAL REPORT FISCAL YEAR 2008

SUPPLEMENTAL NUTRITION ASSISTANCE PROGRAM QUALITY CONTROL ANNUAL REPORT FISCAL YEAR 2008 SUPPLEMENTAL NUTRITION ASSISTANCE PROGRAM QUALITY CONTROL ANNUAL REPORT FISCAL YEAR 2008 U.S. DEPARTMENT OF AGRICULTURE FOOD AND NUTRITION SERVICE PROGRAM ACCOUNTABILITY AND ADMINISTRATION DIVISION QUALITY

More information

9. Linear Regression and Correlation

9. Linear Regression and Correlation 9. Linear Regression and Correlation Data: y a quantitative response variable x a quantitative explanatory variable (Chap. 8: Recall that both variables were categorical) For example, y = annual income,

More information

QF (Build 1010) Widget Publishing, Inc Page: 1 Batch: 98 Test Mode VAC Publisher's Statement 03/15/16, 10:20:02 Circulation by Issue

QF (Build 1010) Widget Publishing, Inc Page: 1 Batch: 98 Test Mode VAC Publisher's Statement 03/15/16, 10:20:02 Circulation by Issue QF 1.100 (Build 1010) Widget Publishing, Inc Page: 1 Circulation by Issue Qualified Non-Paid Circulation Qualified Paid Circulation Individual Assoc. Total Assoc. Total Total Requester Group Qualified

More information

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis Lecture 5: Ecological distance metrics; Principal Coordinates Analysis Univariate testing vs. community analysis Univariate testing deals with hypotheses concerning individual taxa Is this taxon differentially

More information

North American Geography. Lesson 2: My Country tis of Thee

North American Geography. Lesson 2: My Country tis of Thee North American Geography Lesson 2: My Country tis of Thee Unit Overview: As students work through the activities in this unit they will be introduced to the United States in general, different regions

More information

RELATIONSHIPS BETWEEN THE AMERICAN BROWN BEAR POPULATION AND THE BIGFOOT PHENOMENON

RELATIONSHIPS BETWEEN THE AMERICAN BROWN BEAR POPULATION AND THE BIGFOOT PHENOMENON RELATIONSHIPS BETWEEN THE AMERICAN BROWN BEAR POPULATION AND THE BIGFOOT PHENOMENON ETHAN A. BLIGHT Blight Investigations, Gainesville, FL ABSTRACT Misidentification of the American brown bear (Ursus arctos,

More information

Osteopathic Medical Colleges

Osteopathic Medical Colleges Osteopathic Medical Colleges Matriculants by U.S. States and Territories Entering Class 0 Prepared by the Research Department American Association of Colleges of Osteopathic Medicine Copyright 0, AAM All

More information

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis Lecture 5: Ecological distance metrics; Principal Coordinates Analysis Univariate testing vs. community analysis Univariate testing deals with hypotheses concerning individual taxa Is this taxon differentially

More information

Printable Activity book

Printable Activity book Printable Activity book 16 Pages of Activities Printable Activity Book Print it Take it Keep them busy Print them out Laminate them or Put them in page protectors Put them in a binder Bring along a dry

More information

2005 Mortgage Broker Regulation Matrix

2005 Mortgage Broker Regulation Matrix 2005 Mortgage Broker Regulation Matrix Notes on individual states follow the table REG EXEMPTIONS LIC-EDU LIC-EXP LIC-EXAM LIC-CONT-EDU NET WORTH BOND MAN-LIC MAN-EDU MAN-EXP MAN-EXAM Alabama 1 0 2 0 0

More information

Unit 6 - Introduction to linear regression

Unit 6 - Introduction to linear regression Unit 6 - Introduction to linear regression Suggested reading: OpenIntro Statistics, Chapter 7 Suggested exercises: Part 1 - Relationship between two numerical variables: 7.7, 7.9, 7.11, 7.13, 7.15, 7.25,

More information

Multivariate Classification Methods: The Prevalence of Sexually Transmitted Diseases

Multivariate Classification Methods: The Prevalence of Sexually Transmitted Diseases Multivariate Classification Methods: The Prevalence of Sexually Transmitted Diseases Summer Undergraduate Mathematical Sciences Research Institute (SUMSRI) Lindsay Kellam, Queens College kellaml@queens.edu

More information

Unit 6 - Simple linear regression

Unit 6 - Simple linear regression Sta 101: Data Analysis and Statistical Inference Dr. Çetinkaya-Rundel Unit 6 - Simple linear regression LO 1. Define the explanatory variable as the independent variable (predictor), and the response variable

More information

Scatterplots. STAT22000 Autumn 2013 Lecture 4. What to Look in a Scatter Plot? Form of an Association

Scatterplots. STAT22000 Autumn 2013 Lecture 4. What to Look in a Scatter Plot? Form of an Association Scatterplots STAT22000 Autumn 2013 Lecture 4 Yibi Huang October 7, 2013 21 Scatterplots 22 Correlation (x 1, y 1 ) (x 2, y 2 ) (x 3, y 3 ) (x n, y n ) A scatter plot shows the relationship between two

More information

JAN/FEB MAR/APR MAY/JUN

JAN/FEB MAR/APR MAY/JUN QF 1.100 (Build 1010) Widget Publishing, Inc Page: 1 Circulation Breakdown by Issue Qualified Non-Paid Qualified Paid Previous This Previous This Total Total issue Removals Additions issue issue Removals

More information

Meteorology 110. Lab 1. Geography and Map Skills

Meteorology 110. Lab 1. Geography and Map Skills Meteorology 110 Name Lab 1 Geography and Map Skills 1. Geography Weather involves maps. There s no getting around it. You must know where places are so when they are mentioned in the course it won t be

More information

Chapter 9. Correlation and Regression

Chapter 9. Correlation and Regression Chapter 9 Correlation and Regression Lesson 9-1/9-2, Part 1 Correlation Registered Florida Pleasure Crafts and Watercraft Related Manatee Deaths 100 80 60 40 20 0 1991 1993 1995 1997 1999 Year Boats in

More information

Alpine Funds 2017 Tax Guide

Alpine Funds 2017 Tax Guide Alpine s 2017 Guide Alpine Dynamic Dividend ADVDX 1/30/17 1/31/17 1/31/17 0.020000000 0.019248130 0.000000000 0.00000000 0.019248130 0.013842273 0.000000000 0.000000000 0.000751870 0.000000000 0.00 0.00

More information

Alpine Funds 2016 Tax Guide

Alpine Funds 2016 Tax Guide Alpine s 2016 Guide Alpine Dynamic Dividend ADVDX 01/28/2016 01/29/2016 01/29/2016 0.020000000 0.017621842 0.000000000 0.00000000 0.017621842 0.013359130 0.000000000 0.000000000 0.002378158 0.000000000

More information

Office of Special Education Projects State Contacts List - Part B and Part C

Office of Special Education Projects State Contacts List - Part B and Part C Office of Special Education Projects State Contacts List - Part B and Part C Source: http://www.ed.gov/policy/speced/guid/idea/monitor/state-contactlist.html Alabama Customer Specialist: Jill Harris 202-245-7372

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression ST 430/514 Recall: A regression model describes how a dependent variable (or response) Y is affected, on average, by one or more independent variables (or factors, or covariates)

More information

extreme weather, climate & preparedness in the american mind

extreme weather, climate & preparedness in the american mind extreme weather, climate & preparedness in the american mind Extreme Weather, Climate & Preparedness In the American Mind Interview dates: March 12, 2012 March 30, 2012. Interviews: 1,008 Adults (18+)

More information

MINERALS THROUGH GEOGRAPHY

MINERALS THROUGH GEOGRAPHY MINERALS THROUGH GEOGRAPHY INTRODUCTION Minerals are related to rock type, not political definition of place. So, the minerals are to be found in a variety of locations that doesn t depend on population

More information

BlackRock Core Bond Trust (BHK) BlackRock Enhanced International Dividend Trust (BGY) 2 BlackRock Defined Opportunity Credit Trust (BHL) 3

BlackRock Core Bond Trust (BHK) BlackRock Enhanced International Dividend Trust (BGY) 2 BlackRock Defined Opportunity Credit Trust (BHL) 3 MUNICIPAL FUNDS Arizona (MZA) California Municipal Income Trust (BFZ) California Municipal 08 Term Trust (BJZ) California Quality (MCA) California Quality (MUC) California (MYC) Florida Municipal 00 Term

More information

Prerequisite Material

Prerequisite Material Prerequisite Material Study Populations and Random Samples A study population is a clearly defined collection of people, animals, plants, or objects. In social and behavioral research, a study population

More information

Heteroskedasticity. Occurs when the Gauss Markov assumption that the residual variance is constant across all observations in the data set

Heteroskedasticity. Occurs when the Gauss Markov assumption that the residual variance is constant across all observations in the data set Heteroskedasticity Occurs when the Gauss Markov assumption that the residual variance is constant across all observations in the data set Heteroskedasticity Occurs when the Gauss Markov assumption that

More information

Crop Progress. Corn Mature Selected States [These 18 States planted 92% of the 2017 corn acreage]

Crop Progress. Corn Mature Selected States [These 18 States planted 92% of the 2017 corn acreage] Crop Progress ISSN: 00 Released October, 0, by the National Agricultural Statistics Service (NASS), Agricultural Statistics Board, United s Department of Agriculture (USDA). Corn Mature Selected s [These

More information

9 Correlation and Regression

9 Correlation and Regression 9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the

More information

Ch 2: Simple Linear Regression

Ch 2: Simple Linear Regression Ch 2: Simple Linear Regression 1. Simple Linear Regression Model A simple regression model with a single regressor x is y = β 0 + β 1 x + ɛ, where we assume that the error ɛ is independent random component

More information

FLOOD/FLASH FLOOD. Lightning. Tornado

FLOOD/FLASH FLOOD. Lightning. Tornado 2004 Annual Summaries National Oceanic and Atmospheric Administration National Environmental Satellite Data Information Service National Climatic Data Center FLOOD/FLASH FLOOD Lightning Tornado Hurricane

More information

Important note: Transcripts are not substitutes for textbook assignments. 1

Important note: Transcripts are not substitutes for textbook assignments. 1 In this lesson we will cover correlation and regression, two really common statistical analyses for quantitative (or continuous) data. Specially we will review how to organize the data, the importance

More information

Chapter 20: Logistic regression for binary response variables

Chapter 20: Logistic regression for binary response variables Chapter 20: Logistic regression for binary response variables In 1846, the Donner and Reed families left Illinois for California by covered wagon (87 people, 20 wagons). They attempted a new and untried

More information

11 Correlation and Regression

11 Correlation and Regression Chapter 11 Correlation and Regression August 21, 2017 1 11 Correlation and Regression When comparing two variables, sometimes one variable (the explanatory variable) can be used to help predict the value

More information

An Analysis of Regional Income Variation in the United States:

An Analysis of Regional Income Variation in the United States: Modern Economy, 2017, 8, 232-248 http://www.scirp.org/journal/me ISSN Online: 2152-7261 ISSN Print: 2152-7245 An Analysis of Regional Income Variation in the United States: 1969-2013 Orley M. Amos Jr.

More information

Non-iterative, regression-based estimation of haplotype associations

Non-iterative, regression-based estimation of haplotype associations Non-iterative, regression-based estimation of haplotype associations Benjamin French, PhD Department of Biostatistics and Epidemiology University of Pennsylvania bcfrench@upenn.edu National Cancer Center

More information

Final Exam - Solutions

Final Exam - Solutions Ecn 102 - Analysis of Economic Data University of California - Davis March 19, 2010 Instructor: John Parman Final Exam - Solutions You have until 5:30pm to complete this exam. Please remember to put your

More information

Statistics in medicine

Statistics in medicine Statistics in medicine Lecture 4: and multivariable regression Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu

More information

Infant Mortality: Cross Section study of the United State, with Emphasis on Education

Infant Mortality: Cross Section study of the United State, with Emphasis on Education Illinois State University ISU ReD: Research and edata Stevenson Center for Community and Economic Development Arts and Sciences Fall 12-15-2014 Infant Mortality: Cross Section study of the United State,

More information

Mrs. Poyner/Mr. Page Chapter 3 page 1

Mrs. Poyner/Mr. Page Chapter 3 page 1 Name: Date: Period: Chapter 2: Take Home TEST Bivariate Data Part 1: Multiple Choice. (2.5 points each) Hand write the letter corresponding to the best answer in space provided on page 6. 1. In a statistics

More information

Lecture 6: Linear Regression

Lecture 6: Linear Regression Lecture 6: Linear Regression Reading: Sections 3.1-3 STATS 202: Data mining and analysis Jonathan Taylor, 10/5 Slide credits: Sergio Bacallado 1 / 30 Simple linear regression Model: y i = β 0 + β 1 x i

More information

Outline. Administrivia and Introduction Course Structure Syllabus Introduction to Data Mining

Outline. Administrivia and Introduction Course Structure Syllabus Introduction to Data Mining Outline Administrivia and Introduction Course Structure Syllabus Introduction to Data Mining Dimensionality Reduction Introduction Principal Components Analysis Singular Value Decomposition Multidimensional

More information

Data Set 1A: Algal Photosynthesis vs. Salinity and Temperature

Data Set 1A: Algal Photosynthesis vs. Salinity and Temperature Data Set A: Algal Photosynthesis vs. Salinity and Temperature Statistical setting These data are from a controlled experiment in which two quantitative variables were manipulated, to determine their effects

More information

St. Luke s College Institutional Snapshot

St. Luke s College Institutional Snapshot 1. Student Headcount St. Luke s College Institutional Snapshot Enrollment # % # % # % Full Time 139 76% 165 65% 132 54% Part Time 45 24% 89 35% 112 46% Total 184 254 244 2. Average Age Average Age Average

More information

Pima Community College Students who Enrolled at Top 200 Ranked Universities

Pima Community College Students who Enrolled at Top 200 Ranked Universities Pima Community College Students who Enrolled at Top 200 Ranked Universities Institutional Research, Planning and Effectiveness Project #20170814-MH-60-CIR August 2017 Students who Attended Pima Community

More information

Chapter 5 Friday, May 21st

Chapter 5 Friday, May 21st Chapter 5 Friday, May 21 st Overview In this Chapter we will see three different methods we can use to describe a relationship between two quantitative variables. These methods are: Scatterplot Correlation

More information

Heteroskedasticity. (In practice this means the spread of observations around any given value of X will not now be constant)

Heteroskedasticity. (In practice this means the spread of observations around any given value of X will not now be constant) Heteroskedasticity Occurs when the Gauss Markov assumption that the residual variance is constant across all observations in the data set so that E(u 2 i /X i ) σ 2 i (In practice this means the spread

More information

LABORATORY REPORT. If you have any questions concerning this report, please do not hesitate to call us at (800) or (574)

LABORATORY REPORT. If you have any questions concerning this report, please do not hesitate to call us at (800) or (574) LABORATORY REPORT If you have any questions concerning this report, please do not hesitate to call us at (800) 332-4345 or (574) 233-4777. This report may not be reproduced, except in full, without written

More information

LABORATORY REPORT. If you have any questions concerning this report, please do not hesitate to call us at (800) or (574)

LABORATORY REPORT. If you have any questions concerning this report, please do not hesitate to call us at (800) or (574) LABORATORY REPORT If you have any questions concerning this report, please do not hesitate to call us at (800) 332-4345 or (574) 233-4777. This report may not be reproduced, except in full, without written

More information

Lecture 18: Simple Linear Regression

Lecture 18: Simple Linear Regression Lecture 18: Simple Linear Regression BIOS 553 Department of Biostatistics University of Michigan Fall 2004 The Correlation Coefficient: r The correlation coefficient (r) is a number that measures the strength

More information

1) A residual plot: A)

1) A residual plot: A) 1) A residual plot: A) B) C) D) E) displays residuals of the response variable versus the independent variable. displays residuals of the independent variable versus the response variable. displays residuals

More information

Statistics for Engineers Lecture 9 Linear Regression

Statistics for Engineers Lecture 9 Linear Regression Statistics for Engineers Lecture 9 Linear Regression Chong Ma Department of Statistics University of South Carolina chongm@email.sc.edu April 17, 2017 Chong Ma (Statistics, USC) STAT 509 Spring 2017 April

More information

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM 1 REGRESSION AND CORRELATION As we learned in Chapter 9 ( Bivariate Tables ), the differential access to the Internet is real and persistent. Celeste Campos-Castillo s (015) research confirmed the impact

More information

University of California, Berkeley, Statistics 131A: Statistical Inference for the Social and Life Sciences. Michael Lugo, Spring 2012

University of California, Berkeley, Statistics 131A: Statistical Inference for the Social and Life Sciences. Michael Lugo, Spring 2012 University of California, Berkeley, Statistics 3A: Statistical Inference for the Social and Life Sciences Michael Lugo, Spring 202 Solutions to Exam Friday, March 2, 202. [5: 2+2+] Consider the stemplot

More information

MINERALS THROUGH GEOGRAPHY. General Standard. Grade level K , resources, and environmen t

MINERALS THROUGH GEOGRAPHY. General Standard. Grade level K , resources, and environmen t Minerals through Geography 1 STANDARDS MINERALS THROUGH GEOGRAPHY See summary of National Science Education s. Original: http://books.nap.edu/readingroom/books/nses/ Concept General Specific General Specific

More information

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania Chapter 10 Regression Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania Scatter Diagrams A graph in which pairs of points, (x, y), are

More information

MATH 1150 Chapter 2 Notation and Terminology

MATH 1150 Chapter 2 Notation and Terminology MATH 1150 Chapter 2 Notation and Terminology Categorical Data The following is a dataset for 30 randomly selected adults in the U.S., showing the values of two categorical variables: whether or not the

More information

CHAPTER 5 LINEAR REGRESSION AND CORRELATION

CHAPTER 5 LINEAR REGRESSION AND CORRELATION CHAPTER 5 LINEAR REGRESSION AND CORRELATION Expected Outcomes Able to use simple and multiple linear regression analysis, and correlation. Able to conduct hypothesis testing for simple and multiple linear

More information

Last time: PCA. Statistical Data Mining and Machine Learning Hilary Term Singular Value Decomposition (SVD) Eigendecomposition and PCA

Last time: PCA. Statistical Data Mining and Machine Learning Hilary Term Singular Value Decomposition (SVD) Eigendecomposition and PCA Last time: PCA Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml

More information

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships Objectives 2.3 Least-squares regression Regression lines Prediction and Extrapolation Correlation and r 2 Transforming relationships Adapted from authors slides 2012 W.H. Freeman and Company Straight Line

More information

HOMEWORK (due Wed, Jan 23): Chapter 3: #42, 48, 74

HOMEWORK (due Wed, Jan 23): Chapter 3: #42, 48, 74 ANNOUNCEMENTS: Grades available on eee for Week 1 clickers, Quiz and Discussion. If your clicker grade is missing, check next week before contacting me. If any other grades are missing let me know now.

More information

Statistical Methods for Data Mining

Statistical Methods for Data Mining Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Linear Model Selection and Regularization Recall the linear model Y = 0 + 1 X 1 + + p X p +. In the lectures

More information

Sociology 6Z03 Review I

Sociology 6Z03 Review I Sociology 6Z03 Review I John Fox McMaster University Fall 2016 John Fox (McMaster University) Sociology 6Z03 Review I Fall 2016 1 / 19 Outline: Review I Introduction Displaying Distributions Describing

More information

Grand Total Baccalaureate Post-Baccalaureate Masters Doctorate Professional Post-Professional

Grand Total Baccalaureate Post-Baccalaureate Masters Doctorate Professional Post-Professional s by Location of Permanent Home Address and Degree Level Louisiana Acadia 19 13 0 3 0 3 0 0 0 Allen 5 5 0 0 0 0 0 0 0 Ascension 307 269 2 28 1 6 0 1 0 Assumption 14 12 0 1 0 1 0 0 0 Avoyelles 6 4 0 1 0

More information

Lecture 3: Multiple Regression

Lecture 3: Multiple Regression Lecture 3: Multiple Regression R.G. Pierse 1 The General Linear Model Suppose that we have k explanatory variables Y i = β 1 + β X i + β 3 X 3i + + β k X ki + u i, i = 1,, n (1.1) or Y i = β j X ji + u

More information

Ecn Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman. Midterm 2. Name: ID Number: Section:

Ecn Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman. Midterm 2. Name: ID Number: Section: Ecn 102 - Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman Midterm 2 You have until 10:20am to complete this exam. Please remember to put your name,

More information

Midterm 2 - Solutions

Midterm 2 - Solutions Ecn 102 - Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman Midterm 2 - Solutions You have until 10:20am to complete this exam. Please remember to put

More information

AP Final Review II Exploring Data (20% 30%)

AP Final Review II Exploring Data (20% 30%) AP Final Review II Exploring Data (20% 30%) Quantitative vs Categorical Variables Quantitative variables are numerical values for which arithmetic operations such as means make sense. It is usually a measure

More information

Review of Statistics

Review of Statistics Review of Statistics Topics Descriptive Statistics Mean, Variance Probability Union event, joint event Random Variables Discrete and Continuous Distributions, Moments Two Random Variables Covariance and

More information

High School World History Cycle 2 Week 2 Lifework

High School World History Cycle 2 Week 2 Lifework Name: Advisory: Period: High School World History Cycle 2 Week 2 Lifework This packet is due Monday, November 7 Complete and turn in on Friday for 10 points of EXTRA CREDIT! Lifework Assignment Complete

More information

GIS use in Public Health 1

GIS use in Public Health 1 Geographic Information Systems (GIS) use in Public Health Douglas Morales, MPH Epidemiologist/GIS Coordinator Office of Health Assessment and Epidemiology Epidemiology Unit Objectives Define GIS and justify

More information

AP Statistics Unit 2 (Chapters 7-10) Warm-Ups: Part 1

AP Statistics Unit 2 (Chapters 7-10) Warm-Ups: Part 1 AP Statistics Unit 2 (Chapters 7-10) Warm-Ups: Part 1 2. A researcher is interested in determining if one could predict the score on a statistics exam from the amount of time spent studying for the exam.

More information

Business Statistics. Lecture 10: Correlation and Linear Regression

Business Statistics. Lecture 10: Correlation and Linear Regression Business Statistics Lecture 10: Correlation and Linear Regression Scatterplot A scatterplot shows the relationship between two quantitative variables measured on the same individuals. It displays the Form

More information

Rank University AMJ AMR ASQ JAP OBHDP OS PPSYCH SMJ SUM 1 University of Pennsylvania (T) Michigan State University

Rank University AMJ AMR ASQ JAP OBHDP OS PPSYCH SMJ SUM 1 University of Pennsylvania (T) Michigan State University Rank University AMJ AMR ASQ JAP OBHDP OS PPSYCH SMJ SUM 1 University of Pennsylvania 4 1 2 0 2 4 0 9 22 2(T) Michigan State University 2 0 0 9 1 0 0 4 16 University of Michigan 3 0 2 5 2 0 0 4 16 4 Harvard

More information