Module 2. General Linear Model

Size: px

Start display at page:

Download "Module 2. General Linear Model"

Conrad Harrington
5 years ago
Views:

1 D.G. Bonett (9/018) Module General Linear Model The relation between one response variable (y) and q 1 predictor variables (x 1, x,, x q ) for one randomly selected person can be represented by the following general linear model (GLM) y i = β 0 + β 1 x 1i + β x i + + β q x qi + e i where y i is the response variable score for person i, and x ji is the score for person i on predictor variable j. The value β 0 + β 1 x 1i + β x i + + β q x qi is the predicted y score for person i, and e i = y i (β 0 + β 1 x 1i + β x i + + β q x qi ) is the prediction error for person i. The coefficients β 0, β 1,, β q are the unknown population parameters to be estimated from sample data. The coefficient β 0 is the y-intercept and β 1,, β q are the slope coefficients. Each predictor variable in a GLM can be fixed or random. Fixed predictor variables (factors) have a predetermined number of values (levels). Random predictor variables are always quantitative, but fixed predictor variables can be quantitative (e.g., 10 hours, 0 hours, or 30 hours of training; 0, 1,, or 3 siblings) or qualitative (e.g., treatment A, treatment B, or treatment C; Democrat, Republican, or Independent). The fixed predictor variables can be treatment factors with levels to which participants are randomly assigned (e.g., Treatment A, Treatment B, or Treatment C; 10 hours, 0 hours, or 30 hours of training) or classification factors with levels that represent existing characteristics of the study population (e.g., 0, 1,, or 3 siblings; Democrat, Republican, or Independent). In Module 1, we saw how a -level qualitative predictor variable (e.g., male/female) could be included in the model by dummy coding the -level predictor variable. To model a qualitative predictor variable with k levels, the qualitative predictor variable must be converted into k 1 indicators variables as will be explained later. A GLM where all predictor variables are indicator variables is called an analysis of variance (ANOVA) model; a GLM that has only quantitative predictor variables is called a multiple regression model, and a GLM that has indicator variables and quantitative predictor variables is called an analysis of covariance (ANCOVA) model. 1

2 D.G. Bonett (9/018) Interpreting Slope Coefficients in Nonexperimental Designs In experiments with a quantitative treatment factor (x j ), β j describes the change in the population mean of y that will be caused by any 1-point increase in x j (within the range of x values used in the experiment). In nonexperimental designs with fixed or random quantitative predictor variables, the slope coefficient for x j describes the change in y associated with a 1-point increase in x j, controlling for the other q 1 predictor variables. The phrase controlling for requires explanation. Consider the following model with two predictor variables y i = β 0 + β 1 x 1i + β x i + e i Now consider a model where x 1 is treated as a response variable, x is treated as a predictor variable, and e x1 is the resulting prediction error. The e x1 prediction error reflects the aspect of x 1 that is not linearly associated with x. It can be shown that β 1 in the above model is equal to β 1 in the following model y i = β 0 + β 1 e x1i + β x i + e i In general, each β j in a GLM describes the relation between y and e xj where e xj represents the part of x j that is not linearly related to any of the other predictor variables, and it is in this sense that β j describes the association between x j and y controlling for all other predictor variables in the model. In applications where two or more predictor variables are measuring similar attributes, β j may be very difficult to interpret. For example, if x 1 = depression and x = psychological well-being, e x1 represents some component of depression that is unrelated to psychological well-being and that component could be very difficult to explain. However, in some applications, β j will become more meaningful when certain predictor variables are added to the model. For example, if x 1 is a measure of spatial ability that involves complicated verbal instructions and x is a measure of reading comprehension, β 1 describes the relation between y and the component of the spatial ability measure that is unrelated to reading comprehension (e x1 ). In this application, e x1 could represent a more pure measure of spatial ability than x 1. In applications where the relation between y and x 1 is confounded with demographic variables such as age, gender, or ethnicity, controlling for these demographic variables usually provides a more meaningful interpretation of β 1. For example, if an indicator variable for gender is added to a model where y is

3 D.G. Bonett (9/018) predicted by x 1, β 1 then describes the slope of the line relating x 1 to y within each gender category. The interpretation of β j will also be simple if the correlations between x j and all other predictor variables are small because then e xj will be essentially the same as x j. It is a common mistake to interpret β j as a description of the relation between y and x j when x j has a moderate or large correlation with the other predictor variables. Confounding Variables The value of the slope coefficient for a specific predictor variable in a GLM can change substantially if a confounding variable is added to the model. A confounding variable is a variable that is correlated with y and a particular predictor variable. Consider a model with one predictor variable (x 1 ) y i = β 0 + β 1 x 1i + e i (M0del 1a) and the following model that includes x as an additional predictor variable y i = β 0 + β 1 x 1i + β x i + e i (Model 1b) where the asterisks in Model 1a indicate that the parameters and prediction errors are not necessarily identical to those in Model 1b. If x 1 in Model 1a is confounded with some other variable, then β 1 will contain confounding variable bias. The confounding variable bias in β 1 can be described in terms of how much its value changes if a confounding variable is added to the model. Suppose x is a confounding variable and is added to Model 1a to give Model 1b. It can be shown that β 1 = β 1 + (β ρ x1 x σ x /σ x1 ) where the term in parentheses is the confounding variable bias in β 1 due to the omission of x from the model. Note that the amount of confounding bias due to x depends on the magnitude of the correlation between x and x 1 as well as the magnitude of β where the value of β depends on the correlation between x and y. The magnitude of the confounding variable bias due to x will be small if either β or ρ x1 x σ x /σ x1 is small. Confounding variable bias in a particular slope coefficient due to one or more confounding variables can be removed by including those confounding variables in the model. If the researcher can present a convincing argument that all potential confounding variables for predictor variable x j that have not been included in the model are likely to have small correlations with the response variable or small correlations with all included predictor variables, this would imply that 3

4 D.G. Bonett (9/018) confounding variable bias is small and β j might cautiously and tentatively be interpreted as a description of a causal relation between x j and y. In experimental designs where participants are randomly assigned to the levels of some treatment factor, the randomization process guarantees that all excluded predictor variables must be uncorrelated with the treatment predictor variables, and this precludes the possibility of any confounding variables. If there are no confounding variables, the slope coefficient for a particular treatment predictor variable will describe a causal relation between the treatment predictor variable and the response variable. Analysis of Variance Table for a GLM The variance of y (also called the total variance) can be decomposed into two sources of variability the variance of the prediction errors (also called error variance or residual variance) and the variance of y that is predictable from the q predictor variables (also called model variance). The decomposition of the total variance can be summarized in an analysis of variance (ANOVA) table as shown below. Source SS df MS F MODEL SS M df M = q MS M = SS M/df M MS M/MS E ERROR SS E df E = n q 1 MS E = SS E/df E TOTAL SS T df T = n 1 MS T = SS T/df T The estimated predicted y score for person i is y i = β 0 + β 1x 1i + β x i + + β qx qi and the estimated prediction error (residual) for person i is e i = y i y i where β 0, β 1,, β q are least squares estimates. The sum of squares (SS) values in the ANOVA table are SST = n i=1 (y i μ y) SSE = n i=1 e i SSM = SST SSE. MSE is an estimate of the variance of the prediction errors in the study population (σ e ), and the MST is an estimate of the variance of the y scores in the study population (σ y ). 4

5 D.G. Bonett (9/018) If all predictor variables in the GLM are random, a multiple correlation between y and the q predictor variables is an interesting population parameter. A multiple correlation is equal to a Pearson correlation between y i and the predicted y i score (β 1 x 1i + β x i + + β q x qi ). The OLS estimates of β j define a unique linear function of the q predictor variables that maximizes the value of the Pearson correlation between y i and y i where y i = β 0 + β 1x 1i + β x i + + β qx qi. In the random-x case, the population multiple correlation is denoted as ρ y.x where x denotes the set of q predictor variables. The squared multiple correlation ρ y.x, referred to as the coefficient of multiple determination, and is equal to 1 σ e /σ y. The coefficient of multiple determination describes the proportion of the response variable variance that can be predicted from the q predictor variables. An estimate of ρ y.x (reported as R-squared in most statistical packages) is obtained from the SS values in the ANOVA table ρ y.x = 1 SS E /SS T. (.1) An estimate of the multiple correlation is obtained by taking the square-root of Equation.1. It is instructive to express ρ y.x as the following function of Pearson correlations for the simple case of q = predictor variables ρ y.x = (ρ yx1 + ρ yx ρ yx1 ρ yx ρ x1 x )/(1 ρ x1 x ). (.) From Equation. we see that ρ y.x = ρ yx1 + ρ yx if x 1 and x are uncorrelated. In general, if all q predictor variables are uncorrelated, then ρ y.x will equal the sum of the squared Pearson correlations between y and each of the predictor variables. If the correlations among all q predictor variables are very small, then ρ y.x will approximately equal the sum of the squared Pearson correlations between y and each of the predictor variables. Note that ρ y.x can be greater than ρ yx1 + ρ yx if ρ yx1 ρ yx ρ x1 x is a negative value. The estimated squared multiple correlation has a positive bias and the bias can be substantial when dfe is small. The following adjusted squared multiple correlation (reported as adjusted R-squared in most statistical packages) has less positive bias and is equal to adj ρ y.x = 1 MSE/MST. (.3) 5

6 D.G. Bonett (9/018) A confidence interval for ρ y.x does not have a simple formula but it can be obtained in R. APA journals now expect authors who report an estimate of ρ y.x (preferably adj ρ y.x ) to also include a confidence interval for ρ y.x. In the fixed-x case, the following estimate of the coefficient of multiple determination η = 1 SS E /SS T (.4) is an estimate of η and is equal to ρ y.x. Like ρ y.x, η has a positive bias and the bias can be substantial when dfe is small. Although η = ρ y.x and η = ρ y.x, different symbols are used for the coefficient of multiple determination in the random-x and fixed-x cases because η and ρ y.x have different sampling distributions, and a confidence interval for η in the fixed-x model will be different than a confidence interval for ρ y.x in the random-x model. The confidence interval for η in the fixed-x case is complicated but it can be obtained in R. Note that adj ρ y.x is a function of MSE = SS E /dfe. Although SS E can never increase when predictor variables are added to the model, MSE can increase because the decrease in dfe could be relatively greater than the decrease in SS E. Unlike η and, which can never decrease when predictor variables are added to the model, ρ y.x adj ρ y.x can decrease when predictor variables are added to the model. The F statistic from the ANOVA table is used to test the omnibus null hypothesis H0: β 1 = β = = β q = 0 against an alternative hypothesis that at least one population slope coefficient is nonzero. H0 is rejected if the p-value for the F statistic (F = MS M/MS E) is less than α. These null and alternative hypotheses are equivalent to testing H0: ρ y.x = 0 against H1: ρ y.x > 0. A statistical test that allows the researcher to simply decide if H0: ρ y.x = 0 can or cannot be rejected does not provide useful scientific information because the researcher knows, before any data have been collected, that H0 is almost certainly false and hence H1 is almost certainly true. Although the researcher knows that ρ y.x (or η ) will almost never equal 0, the value of ρ y.x (or η ) will not be known and therefore a confidence interval for ρ y.x (or η ) provides useful information. 6

7 D.G. Bonett (9/018) Confidence Interval and Test for Slope Coefficient A 100(1 α)% confidence interval for β j (the population slope coefficient corresponding to x j ) is β j ± t α/;dfe SE β j (.5) where β j is the OLS estimate of β j and df E = n q 1. The standard error of β j can be expressed as SE β j = MS E /[(1 ρ xj.x) σ xj (n 1)] (.6) where ρ xj.x denotes the estimated squared multiple correlation between x j and the other q 1 predictor variables, and σ xj is the estimated variance of x j. A confidence interval for β j can be used to test H0: β j = 0 against H1: β j > 0 and H: β j < 0. SPSS and R also compute the test statistic t = β j/se β j and its corresponding p-value for each β j. The p-value can be used to decide if H0: β j = 0 can be rejected. If H0: β j = 0 is rejected, the sign of β j determines which alternative hypothesis to accept. The MSE can be expressed as df T σ y (1 ρ y.x )/df E. This expression is informative because it shows that a larger value of the multiple correlation between the response variable and all predictor variables (ρ y.x ) will give a smaller value of MSE, which in turn reduces the value of SE β j and the width of the confidence interval for β j. Furthermore, SE β j (Equation.6) is related the multiple correlation between predictor variable j and the other predictor variables (ρ xj.x) where larger values of ρ xj.x increase the value of SE β j. Thus, correlated predictor variables will inflate the value of SE β j which in turn will increase the width of the confidence interval for β j and reduce the power of the test of H0: β j = 0. However, correlated predictor variables will not reduce the power of the test of H0: ρ y.x = 0 (H0: β 1 = β = = β q = 0) and so it is possible to reject H0: ρ y.x = 0 but then fail to reject H0: β j = 0 for any of the predictor variables in the GLM. 7

8 D.G. Bonett (9/018) Example.1. A researcher obtained a random sample of n = 36 participants from business directory containing contact information for about 50,000 working adults. All participants in the sample were given a life satisfaction questionnaire (y), a neuroticism questionnaire (x 1), a conscientiousness questionnaire (x ), and an industriousness questionnaire (x 3). All three predictor variables were moderately correlated. The adjusted estimate of the squared multiple correlation was.7. The 95% confidence interval for the population squared correlation was [.135,.319] indicating that 13.5% to 31.9% of the variance in life satisfaction scores can be predicted from a linear function of the neuroticism, conscientiousness, and industriousness scores. The estimated slope coefficients and 95% confidence intervals for the population slope coefficients are given below. Predictor β j 95% CI for β j neurotic [-0.79, ] conscientious [-0.536, ] industrious 0.30 [0.148, 0.456] Interaction Effects in a GLM In many applications, the strength of the relation between x 1 and y will depend on the value of a second predictor variable x. When this occurs, it is also the case that the relation between x and y will depend on the value of x 1. In these situations, we say that x 1 and x interact. In applications where x 1 and x interact and the relation between x 1 and y is of primary interest, it also could be said that x moderates the relation between x 1 and y. The interaction effect for x 1 and x can be included in a GLM by adding a new predictor variable that is the product of x 1 and x. An example of a GLM with two predictor variables and their interaction is shown below. y i = β 0 + β 1 x 1i + β x i + β 3 x 1i x i + e i (Model ) The product variable can be highly correlated with x 1 and x which can inflate the values of SE β j. This problem can be reduced by centering x 1 and x before computing the product. A GLM that includes a product (interaction) term allows the strength of the relation between x 1 and y to vary across the levels of x and the strength of the relation between x and y to vary across the levels of x 1. Consequently, the values of β 1 and β may be uninteresting. The nature of the interaction effect in a GLM can be understood by examining the simple slope for x 1 at low and high values of x or the simple slope for x at low and high values of x 1. The simple slope for x 1 can be obtained by factoring x 1i out of the β 1 x 1i and β 3 x 1i x i terms as shown below 8

9 D.G. Bonett (9/018) y i = β 0 + β x i + (β 1 + β 3 x i )x 1i + e i where the term in parentheses is the simple slope for x 1. Factoring x i out of the β x i and β 3 x 1i x i terms shows that simple slope for x is β + β 3 x 1i. When x 1 and x are random and have been centered, their means will be zero and it is common practice to report the simple slope for x 1 at x = σ x and at x = σ x (one standard deviation below and above the centered mean of x ). Likewise, the simple slope for x is typically reported for x 1 = σ x1 and x 1 = σ x1. If x 1 is fixed, simple slopes for x could be computed at the lowest and highest values of x 1. If x 1 is a dummy variable, simple slopes for x would be computed for x 1 = 0 and x 1 = 1. If x 1 is the predictor variable of primary importance, the simple slopes for x 1 are typically more interesting than the simple slopes for x. Centering will change the values of β 1 and β but not β 3 in Model. Centering does not affect the values of the simple slopes at the equivalent centered and uncentered values of x 1 or x. Centering also has no effect on the estimates of ρ y.x, η, or σ e. When the interaction effect (β 3 ) is not small, the simple slope for x 1 could differ meaningfully across the values of x, and the simple slope for x could differ meaningfully across the values for x 1. In some applications it could be informative to determine the value of x where the simple slope for x 1 changes sign or the value of x 1 where the simple slope for x changes sign. These change points can be estimated by setting β + β 3 x 1i to zero and solving for x 1 or setting β 1 + β 3 x i to zero and solving for x. The estimated value of x where the simple slope for x 1 changes sign is -β 1/β 3, and the estimated value of x 1 where the simple slope for x changes sign is -β /β 3. The interpretation of the simple slopes will be less complicated if the change point is outside the range of typical x 1 or x values. Confidence Interval for Simple Slopes A 100(1 α)% confidence interval for the simple slope for x 1 at x (where x is a particular value of x ) is β 1 + β 3x ± t α/;dfe SE β 1+β 3x (.7a) where SE β 1+β 3x = SE + x β 1 SE + x β 3 cov(β 1, β 3), cov(β 1, β 3) is the estimated covariance between β 1 and β 3, and df E = n q 1. SPSS and R will report the estimated covariances among slope estimates as optional output. 9

10 D.G. Bonett (9/018) A 100(1 α)% confidence interval for the simple slope for x at x 1 is β + β 3x 1 ± t α/;df SE β +β 3x 1 (.7b) where SE β +β 3x = SE 1 + x β 1 SE + x β 3 1 cov(β, β 3). Indicator Variables A qualitative predictor variable with k categories (e.g., male/female, Democrat/Republican/Independent, freshman/sophomore/junior/senior, etc.) is called a factor and can serve as a predictor variable in a GLM if it is first converted into k 1 indicator variables. Dummy coded variables and effect coded variables are two type of indicator variables. Dummy coded variables have values of 0 and 1. For a categorical predictor variable with k categories, the dummy coded indicator variable j is equal to 1 in level j and 0 otherwise. For example, two dummy coded indicator variables x 1 and x will code k = 3 levels, as shown below. level x 1 x Participants with level 1 of the predictor variable are assigned an x 1 score of 1 and an x score of 0, participants with level of the predictor variable are assigned an x 1 score of 0 and an x score of 1, and participants with level 3 of the predictor variable are assigned an x 1 score of 0 and an x 1 score of 0. The level for which all the dummy codes are equal to 0 is called the reference level. In the above example, level 3 is the reference level. The GLM for a k = 3 level dummy coded qualitative factor is y i = β 0 + β 1 x 1i + β x i + e i and it can be shown that β 0 = μ 3, β 1 = μ 1 μ 3 and β = μ μ 3. Equation.5 can be used to obtain confidence intervals for β 1 and β, which are confidence intervals for μ 1 μ 3 and μ μ 3, respectively. Note that β 1 β = (μ 1 μ 3 ) (μ μ 3 ) = μ 1 μ and so a confidence interval for μ 1 μ is obtained from a confidence interval for β 1 β (Equation.4 can be used to obtain a confidence interval for 10

11 D.G. Bonett (9/018) β 1 β ). In general, a GLM for one qualitative factor with k levels that have been dummy coded, the parameters are β 0 = μ k and β j = μ j μ k. Effect coded variables have values of 1, 0, and -1 (or just 1 and -1 if there are only two categories). As with dummy coding, k 1 effect coded variables are needed to code a qualitative factor with k levels. The two effect coded variables for a qualitative predictor variable with k = 3 levels are shown below. level x 1 x For a k-level qualitative factor, the effect coded variable j is equal to 1 in level j, -1 in level k, and 0 otherwise. In this general case, β 0 = (μ 1 + μ + + μ k )/k and β j = μ j β 0. To obtain a confidence interval for a pairwise mean difference, say μ 1 μ, it would be necessary to compute a confidence interval for β 1 β, which equals (μ 1 β 0 ) (μ β 0 ) = μ 1 μ. However, pairwise comparisons involving the last category are not obvious. For k = 3, μ 1 μ 3 = β 1 β and μ μ 3 = β 1 β. With k = categories, only one effect coded variable is required and the model is y i = β 0 + β 1 x i + e i with participants at level 1 assigned an x i score of 1 and participants at level assigned an x i score of -1. With k = and effect coding, β 0 = (μ 1 + μ )/ and β 1 = μ 1 β 0 = μ 1 (μ 1 + μ )/ = (μ 1 μ )/. A GLM can have two or more qualitative factors. Consider the simplest case of two qualitative factors (factor A and factor B) that each have two levels (called a factorial design). The two levels of factor A are denoted as a1 and a. The two levels of factor B are denoted as b1 and b. Either dummy coding or effects coding can be used to code each factor. The interaction between the two factors is coded by taking the product of the dummy or effect codes for each factor. The dummy codes and effects codes are shown below for a factorial design. Dummy Codes Effect Codes A B x 1 x x 1 x x 1 x x 1 x

12 D.G. Bonett (9/018) The GLM for a factorial design is y i = β 0 + β 1 x 1i + β x i + β 3 x 1i x i + e i and the interpretation of the model parameters depends on the type of coding used. With dummy coded indicator variables the model parameters have the following definitions. β 0 = μ (mean at a and b) β 1 = μ 1 μ (simple main effect of A at b) β = μ 1 μ (simple main effect of B at a) β 3 = μ 11 μ 1 μ 1 + μ (AB interaction effect) With effect coded indicator variables the model parameters have the following definitions. β 0 = (μ 11 + μ 1 + μ 1 + μ )/4 (grand mean) β 1 = (μ 11 + μ 1 )/4 (μ 1 + μ )/4 (main effect of A divided by ) β = (μ 11 + μ 1 )/4 (μ 1 + μ )/4 (main effect of B divided by ) β 3 = (μ 11 μ 1 μ 1 + μ )/4 (AB interaction effect divided by 4) If the interaction term is not included in the model, the definitions of β 0, β 1, and β are unchanged with effects coding, but with dummy coding these parameters have the following definitions. β 0 = μ (mean at a and b) β 1 = (μ 11 + μ 1 )/ (μ 1 + μ )/ (main effect of A) β = (μ 11 + μ 1 )/ (μ 1 + μ )/ (main effect of B) Dummy coding is preferred to effect coding then the model contains only one qualitative predictor variable or when there are multiple qualitative predictor variables that are assumed not to interact. Effect coding is sometimes preferred to dummy coding if the model contains two or more -level qualitative predictor variables and their interactions. Quadratic Model If a nonlinear relation between y and x cannot be linearized using transformations, the following quadratic model will be appropriate when the relation between y and x can be characterized by a curve with a single bend. y i = β 0 + β 1 x i + β x i + e i 1

13 D.G. Bonett (9/018) In a quadratic model, the slope of the line relating x to y varies across the levels of x. Specifically, the slope of the line at x = x is equal to β 1 + β x, which may be estimated by replacing β 1 and β with their estimates. It is standard practice to center x in a quadratic model. Centering x will change the estimate of β 1 and can substantially reduce its standard error. Centering x will not change the estimate of β or its standard error. A quadratic model implies that the direction of the relation between x and y changes sign at some value of x. In some applications it could be informative to estimate the value of x where the relation between x and y changes direction. The estimated change point is equal to -β 1/β. If the change point is outside the range of typical x scores, this implies that the direction of the relation is constant across typical x scores and this could simplify the interpretation of results. To estimate the amount of curvature in the nonlinear relation between x and y, the slope of the line can be compared at low (x L ) and high (x H ) values of x. The difference in slopes at low and high values of x is equal to (β 1 + β x H ) (β 1 + β x H ) = (x H x L )β where x L and x H are values specified by the researcher. A confidence interval for (x H x L )β is obtained by multiplying the endpoints of a confidence interval for β by (x H x L ). In applications where the slope of the line at x = x does not have a clear interpretation, which would be the case in applications where it is difficult to assign a clear psychological meaning to specific values of the response variable, the researcher might be content with simply testing the following hypotheses about the values of β 1 and β. H0: β 1 = 0 H1: β 1 > 0 H: β 1 < 0 H0: β = 0 H1: β > 0 H: β < 0 Confidence intervals for β 1 and β may be used to select H1 or H. Deciding β 1 > 0 implies that there is a positive relation between y and x, and deciding β 1 < 0 implies that there is a negative relation between y and x. Deciding β > 0 implies that the predicted y scores follow a curve that bends up, and deciding β < 0 implies that the predicted y scores follow a curve that bends down. For the fixed-x case, a graph of the sample means with confidence interval bars at each level of x provides additional information about the nature of the nonlinear relation. It is common to center x in a quadratic model because doing so reduces the standard error for β 1 and increases the power of the test of H0: β 1 = 0 without affecting the power of the test of H0: β = 0. 13

D.G. Bonett (9/018) Example.. A random sample of psychology students was obtained from a volunteer pool. The sample was randomized into four study group conditions with ten study groups per condition.

14 D.G. Bonett (9/018) Example.. A random sample of psychology students was obtained from a volunteer pool. The sample was randomized into four study group conditions with ten study groups per condition. The four conditions used study groups of, 4, 6, or 8 students. The dependent variable is performance on a research project scored on a scale from 0 to 100. The 95% confidence intervals for β 1 and β are [4.8, 9.64] and [-0.79,-0.34], respectively. These confidence interval results support H 1: β 1 > 0 and H : β < 0, indicating that project scores are positively related to study group size but the relation is curved with a downward bend. A graph of the estimated population means with 95% confidence interval bars is shown below. Semipartial Correlation A semipartial correlation (or part correlation) between x 1 and y controlling for x x s is denoted as ρ y(x1.x 0 ) where x 0 represents a set of control variables x x s. A semipartial correlation is a Pearson correlation between y and e x1 where e x1 is the prediction error in a model that predicts x 1 from x x q. Replacing x with e x1 in Figure 1 (in Module 1) can be used to assess the importance of an estimated semipartial correlation. A semipartial correlation between x 1 and y controlling for x x s describes the standard deviation change in y associated with a 1 standard deviation increase in e x1. An estimate of a semipartial correlation between x 1 and y controlling for x x s can be obtained by first computing the residuals in a GLM where x 1 is the response variable and x x q are the predictor variables. The Pearson correlation between the y scores and the x 1 residuals (e x 1 ) is an estimate of the semipartial correlation. It can be shown than the squared semipartial correlation between y and x 1 is equal to the difference between ρ y.x (where x is the set of all control variables plus x 1 ) and ρ y.x0 (where x 0 is the set of all control variables). This difference is referred to as R in APA journals. In a random-x model, a semipartial correlation could be computed for each predictor variable controlling for all other predictor variables in the model. These semipartial correlations are conceptually similar to slope coefficients because they 14

15 D.G. Bonett (9/018) describe the relation between x j and y after the linear effects of all other predictor variables have been removed only from x j. A semipartial correlation is a standardized measure of effect size that is easier to interpret than a slope coefficient in applications where the metrics of x j and y are unfamiliar to the intended audience. A confidence interval for ρ y(x1.x 0 ) is obtained in two steps. First, an approximate 100(1 α)% confidence interval for a transformed semipartial correlation estimate is computed ρ y(x1.x 0 ) ± z α/ g/(n 3) (.8) 4 where g = (ρ y.x ρ y.x 4 + ρ y.x0 ρ y.x0 + 1)/(1 ρ y(x1.x 0 )) and ρ y(x1.x 0 ) = ln ([ 1 + ρ y(x1.xo) ])/. 1 ρ y(x 1.x0) Let ρ L and ρ U denote the endpoints of Equation.8. Reverse transforming the endpoints of Equation.8 gives the following lower confidence limit for ρ y(x1.x 0 ) [exp(ρ L ) 1]/[exp(ρ L ) + 1] (.09a) and the following upper confidence limit for ρ y(x1.x 0 ) [exp(ρ U ) 1]/[exp(ρ U ) + 1]. (.09b) Partial Correlation A partial correlation between x j and y removes the linear effects of one or more control variables from both x j and y. In comparison, a semi-partial correlation between x j and y removes the linear effects of one or more control variables from only x j. Let x 0 denote a set of control variables x x s. A partial correlation between x 1 and y controlling for x 0, denoted as ρ yx1.x 0, is a Pearson correlation between e x1 and e y where e x1 represents the prediction errors in a model that predicts x 1 from x 0 and e y represents the prediction errors in a model that predicts y from x 0. A partial correlation between x 1 and y describes the standard deviation change in e y associated with a 1 standard deviation increase in e x1. Replacing y with e y and x with e x1 in Figure 1 is helpful in assessing the importance of an estimated partial correlation. 15

16 D.G. Bonett (9/018) Like the multiple correlation and semipartial correlation, a partial correlation is appropriate only for random-x models. A partial correlation may be more interesting than a semi-partial correlation if it is important to remove the effects of the control variables from both y and x 1. For example, if social skills and problem solving skills are measured in a sample of 6 to 9 year old children, the correlation between these two variables could be misleading because both variables are related to age, and it would be desirable to remove the effect of age from both variables. A confidence interval for ρ yx1.x 0 is obtained in two steps. First, a 100(1 α)% confidence interval for a transformed partial correlation estimate is computed ρ yx1.x 0 ± z α/ 1/(n s 3) (.10) where ρ yx1.x 0 = ln ([ 1 + ρ yx1.x0 ])/ and s is the number of control variables. Let ρ 1 L ρ yx 1.x0 and ρ U denote the endpoints of Equation.10. Reverse transforming the endpoints of Equation.10 gives the following lower confidence limit for ρ yx1.x 0 [exp(ρ L ) 1]/[exp(ρ L ) + 1] (.11a) and the following upper confidence limit for ρ yx1.x 0 [exp(ρ U ) 1]/[exp(ρ U ) + 1]. (.11b) Example.3. A validation study examined the correlation between a new 3-D spatial ability scale and a -D spatial ability scale. However, both scales have detailed written instructions and it is likely that both scales are contaminated by differences in reading comprehension. The -D and 3-D spatial ability scales and a measure of reading comprehension were given to a random sample of 151 community college students. The sample partial correlation between the 3-D and -D scales, controlling for reading comprehension, was.70. The Fisher transformed partial correlation is ρ yx = and Equation.14 with α =.05 gives /147 = [0.705, 1.09] which is reverse transformed to give [.61,.77]. Comparing Two Correlations Recall from Module 1 that Equations 1.9ab could be used to obtain a confidence interval for the difference between two Pearson or point-biserial correlations that were estimated from a two-group design. Equations 1.9ab also may be used to obtain confidence intervals for a difference between two squared multiple correlations, two semipartial correlations, or two partial correlations where the two correlations have been estimated from a two-group design. 16

17 D.G. Bonett (9/018) Standardized Slope Coefficients An estimated standardized slope, denoted as β j, is computed from standardized y scores (y i = y i μ y σ y ) and standardized x j scores (x ji = x ji μ xj ). It is not necessary to standardize indicator variables. Recall that β j describes the relation between y and e xj where e xj is the part of x j that is not linearly related to the other predictor variables in the model. If the predictor variables are standardized and if x j is predicted from all other predictor variables in the model, e xj represents the part of x j that is not linearly related to any of the other predictor variables. Then β j describes the change in y, in standard deviation units, associated with a 1-point increase in e xj. However, it can be shown that the standard deviation of the e xj scores is equal to 1 ρ xj.x which will be less than 1 unless ρ xj.x = 0. For example, if ρ xj.x =.75 then the standard deviation of the e xj scores is equal to 1 ρ xj.x =.5 so that a 1-point increase in e xj corresponds to a standard deviation increase in e xj. The fact that a 1-point increase in e xj does not always correspond to a 1 standard deviation increase in e xj makes the standardized slope difficult to interpret. If the model has only one predictor variable or if the predictor variables are uncorrelated, the standardized slope is equal to the Pearson correlation between y and x j. σ xj A standardized slope estimate can be expressed as β j = ρ y(x1.x o )/ 1 ρ xj.x. If ρ xj.x > 0, the standardized slope will be greater than the corresponding semipartial correlation. Unlike a semipartial correlation, which cannot be greater than 1 or less than -1, a standardized slope can have a value that is much greater than 1 or much less than -1 when ρ xj.x is large. Although most APA journals recommend the reporting of standardized slope estimates, semipartial correlations along with their confidence intervals are usually preferable. An approximate 100(1 α)% confidence interval for β j is β j ± z α/ SE β j (.1) 17

18 D.G. Bonett (9/018) where SE β j is the standard error of the standardized slope. The formula for SE β j is complicated but can be obtained in R. Example.4. In Example.1, life satisfaction (y), neuroticism (x 1), conscientiousness (x ), and industriousness (x 3) were measured in a sample of n = 36 employees. The estimated standardized slopes and semipartial correlations along with 95% confidence intervals for the population semipartial correlations are given below. Suppose industriousness (x 3) is the predictor variable of primary interest. After controlling for neuroticism and conscientiousness, we can be 95% confident that a one standard deviation increase in industriousness is associated with a.108 to.38 standard deviation increase in life satisfaction in the study population. Semipartial Standardized 95% CI for 95% CI for Predictor Correlation Slope Semipartial Correlation Standardized Slope neuroticism [-.456, -.44] [-0.554, -0.94] conscientious [-.53, -.031] [-0.479, -0.17] industrious [.108,.38] [0.170, 0.450] Analysis of Covariance Model An ANCOVA model is a GLM with one or more qualitative factors and one or more quantitative predictor variables. The quantitative predictor variables in the ANCOVA model are referred to as covariates. In experimental designs where participants are randomly assigned to the levels of a treatment factor, the covariates can be assumed to be uncorrelated with the indicator variables and then the slope coefficients for the treatment indicator variables will not contain any confounding variable bias. In experimental designs, covariates are used primarily to reduce prediction error variance and secondarily because the researcher is interested in how the covariates relate to the dependent variable or interact with the treatment factor(s). Reducing prediction error variance has the beneficial effects of narrowing the confidence interval widths and increasing the power of the hypothesis tests. The ANCOVA model is also used in nonexperimental designs where participants have self-selected into two or more treatment conditions. When participants selfselect into treatment conditions, the slope coefficients for the treatment indicator variables will contain confounding variable bias because the participants in different treatment conditions could systematically differ in terms of attributes that are correlated with the response variable. If the most important confounding variables can be included in the model as covariates, the confounding variable bias could be substantially reduced and then the slope coefficients for the treatment effects will be less misleading. 18

19 D.G. Bonett (9/018) Consider an ANCOVA model for a -level factor and one covariate. Using dummy coding, the following model includes one covariate (x 1 ), one dummy coded variable (x ), and their product (x 3 = x 1 x ) y i = β 0 + β 1 x 1i + β x i + β 3 x 3i + e i. (Model 3) It is common to center the covariate by subtracting the overall sample covariate mean from every covariate score. The covariate is assumed to be centered in the above model. If a confidence interval for β 3 suggests that the interaction effect could be small, a confidence interval or test for β would be examined. Otherwise, the simple slopes for the dummy coded variable would be examined at low and high values of the covariate. The ANCOVA model can be represented as a multiple-group regression model. For example, the slope coefficients of Model 3 can be interpreted in terms of the parameters of a simple linear regression model for each of the two groups as shown below y 1i = β 10 + β 11 x 11i + e1i (Model 4a) y i = β 0 + β 1 x 1i + ei (Model 4b) where the first subscript indicates group membership (1 or ). When x is a dummy coded variable in Model 3, it can be shown that β = β 10 β 0, β 1 = β 1, and β 3 = β 11 β 1. When x is an effect coded variable in Model 3, β 1 = (β 11 + β 1 )/, β = (β 10 β 0 )/, and β 3 = (β 11 β 1 )/. In Models 4a and 4b, β 11 and β 1 are the simple slopes for x 1 at each level of the dummy variable. Example.5. A researcher suspects that multiple attempts to recall recently learned information, even without performance feedback, will improve recall performance. A twogroup experiment was conducted where all participants viewed a 50-minute video lecture. The first group was given a short test over the lecture material without performance feedback every day for five days. The second group was not tested during this 5-day period. Ten days later, all participants were given a test of the lecture material (scored 0 to 100). Using an ANOVA model y i = β 0 + β 1 x 1i + e i, where x 1 is a dummy coded variable. The 95% confidence interval for β 1 was [1.1, 18.8] suggesting that attempts to recall information, even without feedback, will improve retention of learned material. The researcher also obtained the total SAT score for each participant, which is believed to correlate with the final test score. Using an ANCOVA model y i = β 0 + β 1 x 1i + β x i + β 3 x 3i + e i, where x 1 is the total SAT score and x is the dummy coded variable, the interaction effect appeared to be small, and a 95% confidence interval for β was [6.7, 13.]. This confidence interval is considerably narrower, and hence more informative, than the confidence interval for β 1 using the ANOVA model. 19

20 D.G. Bonett (9/018) Example.6. A researcher obtained a sample of retirees who began working crossword puzzles after retirement and another sample of retirees who did not take up crossword puzzles. Using an ANOVA model y i = β 0 + β 1 x 1i + e i, where y is a measure of intelligence and x 1 is a dummy coded variable, the researcher obtained a 95% confidence interval for β 1 of [8., 1.5]. This result suggests that the average intelligence is higher for retirees who began working crossword puzzles after retirement. However, this nonexperimental result does not imply that taking up crossword puzzles after retirement will cause an improvement in cognitive functioning. It is possible that retirees who take up crossword puzzles after retirement are more intelligent than retirees who do not take up crossword puzzles. Using an ANCOVA model y i = β 0 + β 1 x 1i + β x i + β 3 x 3i + e i, where x 1 is years of college education (an easily-measured proxy of pre-retirement intelligence) and x is the dummy coded variable, the interaction effect appeared to be small and a 95% confidence interval for β was [-0.4,.1]. This result suggests that taking up crossword puzzles after retirement may not have much of an effect on cognitive functioning. The ANCOVA has another important application in studies that compare two different treatments (e.g., treatment 1 and treatment ) and certain participants are expected to benefit more from treatment 1 than treatment. In these situations, it may difficult to ethically justify an experiment that randomly assigns participants to treatment conditions. However, if a priority score can be assigned to each participant that assesses the degree to which a participant might benefit from treatment 1, participants with priority scores above some cut point would all be assigned to treatment 1 and participants with priority scores below the cut point would all be assigned to treatment. It is a remarkable fact that if the priority score is used as a covariate in a two-group ANCOVA, the slope coefficient for the treatment dummy variable will describe the causal effect of treatment on the response variable (assuming a linear relation between the priority scores and the response variable and no priority by treatment interaction). This type of design is called a regression discontinuity (RD) design. Example.7. A new program to provide low income college-bound students with financial aid application training was evaluated using a random sample of 100 low income collegebound students from the Los Angeles school district. All students from families making less than $17,000 per year received the financial aid application training and all other students received the usual assistance from their high school counselor. The dependent variable was the amount of financial aid obtained. The 95% confidence interval for the dummy variable slope coefficient was [$60, $918] indicating that if all low income collegebound students in the Los Angeles school district had received the financial aid application training, their mean financial aid would be $60 to $918 higher than if they had received the typical assistance provided by a high-school counselor. Note that the researcher was able to make a causal claim about the effectiveness of the new program even though students were not randomly assigned to groups. Although the RD design can reduce the "ethical costs" of a study, the RD design requires about 300% more participants than the corresponding two-group 0

21 D.G. Bonett (9/018) experimental design to achieve the same hypothesis testing power and confidence interval precision. Thus, the RD design will have lower ethical costs combined with greater sampling costs (e.g., measurement costs, cost of administering treatment, payments to participants) than a two-group experimental design. The sampling costs of the RD design can be substantially reduced by specifying an "indifference range" of priority scores for which either treatment could be beneficial and then randomly assigning participants with priority scores within this range to one of the two treatment conditions. Participants with priority scores below the lower limit of the indifference range are assigned to treatment and participants with priority scores above the upper limit of the indifference range are assigned to treatment 1. Increasing the width of the indifference range decreases the sample size requirements of the RD design. For example, with a wide indifference range that includes about 50% of the priority scores, the sample size requirement for the RD design only requires about 30% more participants than the corresponding twogroup experimental design. Exploratory Model Selection In applications where the researcher has measured many potential predictor variables and wants to determine which of these variables are most useful in predicting y, there are exploratory procedures (e.g., forward selection, backwards elimination, stepwise) in SPSS and R that will sift through a set of candidate predictor variables and identify the best subset. Although these exploratory procedures are popular, they can produce misleading results. In general, the p- values for the selected predictor variables will be too small and the confidence intervals for the slope coefficients of the selected predictor variables will be too narrow. If a large sample is available, the sample can be randomly divided into two samples. The exploratory model selection is performed in one sample (the training sample) and the selected model is then applied in the second sample (the test sample). Only the results in the test sample should be reported. Note however that the parameter estimates (e.g., ρ y.x, ρ y(x1.x 0 ), β j, β j) obtained in the training sample tend to shrink towards 0 in the test sample, and the shrinkage can be substantial if the number of candidate predictor variables is large. The amount of shrinkage can be reduced by using a Bonferroni adjusted alpha level in the training sample variable selection process. LASSO regression (described in more advanced courses) is a newer method for selecting predictor variables in the training sample that tend to remain good predictors in the test sample. 1

22 D.G. Bonett (9/018) Assumptions In fixed-x models, confidence intervals for slopes, simple slopes, η, and σ e assume: 1) random sampling, ) independence among participants, 3) linearity between y and each predictor variable (linearity assumption), 4) constant variability of the prediction errors across the values of every predictor variable (equal prediction error variance assumption), and 5) approximate normality of the prediction errors in the study population (prediction error normality assumption). Scatterplots of y with each predictor variable are useful in assessing the linearity assumption. Scatterplots of the residuals with each predictor variable (called residual plots) are helpful in assessing the equal variance assumption. Skewness and kurtosis estimates of the residuals are useful in assessing the prediction error normality assumption. Transforming the response variable may reduce prediction error non-normality. In random-x models, confidence interval for slopes and simple slopes requires the same assumptions given above for the fixed-x models. Confidence intervals for a squared multiple correlation, partial correlation, semipartial correlation, or standardized slope in random-x models require a stronger assumption that y and all q predictor variables have an approximate multivariate normal distribution in the study population. The multivariate normality assumption implies the linearity and equal error variance assumptions of the fixed-x model, and also assumes the predictor variables are linearly related to each other and each have an approximate normal distribution in the study population. To assess the multivariate normality assumption, assess the linearity and equal error variance assumptions as described above. Also examine scatterplots for all pairs of predictor variables to assess linearity of predictor variables, and check all predictor variables for skewness and kurtosis. Transforming the predictor variables may reduce nonlinearity and nonnormality of the predictor variables. Influential Observations A participant with an unusually large residual (e i = y i y i) may excessively influence the least-squares estimates of one or more β j values. However, an examination of the residuals can be misleading because the least-squares estimates of β j have the property of minimizing the sum of the (y i y i) values, and this tends to reduce the size of the residual for an outlier participant. A better approach is to compute y i for participant i using only the data from the other n 1 participants and then subtract this predicted y score from y i. These are called deleted residuals. The deleted residuals can be made easier to interpret by dividing them by their standard errors. Deleted residuals divided by their standard errors are called

Prerequisite Material

Prerequisite Material Study Populations and Random Samples A study population is a clearly defined collection of people, animals, plants, or objects. In social and behavioral research, a study population