Regression With a Categorical Independent Variable

Similar documents
Regression With a Categorical Independent Variable

Regression With a Categorical Independent Variable

Regression With a Categorical Independent Variable: Mean Comparisons

Chapter 8: Regression Models with Qualitative Predictors

Categorical Predictor Variables

Applied Regression Analysis

Chapter 4: Regression Models

Chapter 4. Regression Models. Learning Objectives

Variance Partitioning

Topic 17 - Single Factor Analysis of Variance. Outline. One-way ANOVA. The Data / Notation. One way ANOVA Cell means model Factor effects model

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Regression Models. Chapter 4. Introduction. Introduction. Introduction

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

Profile Analysis Multivariate Regression

Variance Partitioning

STAT 705 Chapter 16: One-way ANOVA

Lecture 6: Linear Regression

ST430 Exam 2 Solutions

BNAD 276 Lecture 10 Simple Linear Regression Model

Chapter 3 Multiple Regression Complete Example

14 Multiple Linear Regression

Interactions among Continuous Predictors

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

1 Correlation and Inference from Regression

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

Lecture 9: Linear Regression

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Regression Analysis. BUS 735: Business Decision Making and Research

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Advanced Regression Topics: Violation of Assumptions

THE PEARSON CORRELATION COEFFICIENT

STATISTICS FOR ECONOMISTS: A BEGINNING. John E. Floyd University of Toronto

Topic 1. Definitions

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

22s:152 Applied Linear Regression

FAQ: Linear and Multiple Regression Analysis: Coefficients

Interactions and Factorial ANOVA

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

Analysis of Variance

Interactions and Factorial ANOVA

Chapter 16. Simple Linear Regression and dcorrelation

Topic 28: Unequal Replication in Two-Way ANOVA

Lecture 5: Clustering, Linear Regression

Lecture 6 Multiple Linear Regression, cont.

WISE International Masters

One-Way Analysis of Variance. With regression, we related two quantitative, typically continuous variables.

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

WELCOME! Lecture 13 Thommy Perlinger

Simple, Marginal, and Interaction Effects in General Linear Models

Lecture 5: Clustering, Linear Regression

Statistical Techniques II EXST7015 Simple Linear Regression

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

Multiple linear regression S6

Statistical Foundations:

General Linear Model (Chapter 4)

Correlation & Simple Regression

Review of the General Linear Model

1. (Rao example 11.15) A study measures oxygen demand (y) (on a log scale) and five explanatory variables (see below). Data are available as

Daniel Boduszek University of Huddersfield

Chapter 14 Student Lecture Notes 14-1

Multiple Linear Regression

Chapter 14. Multiple Regression Models. Multiple Regression Models. Multiple Regression Models

General Principles Within-Cases Factors Only Within and Between. Within Cases ANOVA. Part One

A Re-Introduction to General Linear Models (GLM)

Example: Forced Expiratory Volume (FEV) Program L13. Example: Forced Expiratory Volume (FEV) Example: Forced Expiratory Volume (FEV)

Topic 20: Single Factor Analysis of Variance

Simple, Marginal, and Interaction Effects in General Linear Models: Part 1

Finding Relationships Among Variables

Business Statistics. Lecture 10: Correlation and Linear Regression

Basic Business Statistics, 10/e

Analysing data: regression and correlation S6 and S7

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Mathematics for Economics MA course

ANALYTICAL COMPARISONS AMONG TREATMENT MEANS (CHAPTER 4)

Six Sigma Black Belt Study Guides

Simple Linear Regression

Simple Linear Regression for the Climate Data

ECON 497 Midterm Spring

Multiple Regression. Dr. Frank Wood. Frank Wood, Linear Regression Models Lecture 12, Slide 1

Chapter 7 Student Lecture Notes 7-1

Simple Linear Regression

STAT Chapter 11: Regression

Business Statistics. Lecture 9: Simple Regression

Lecture 5: Clustering, Linear Regression

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

STATISTICS 110/201 PRACTICE FINAL EXAM

Statistical Distribution Assumptions of General Linear Models

Module 2. General Linear Model

Inferences for Regression

The simple linear regression model discussed in Chapter 13 was written as

Regression and the 2-Sample t

Formula for the t-test

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

MORE ON MULTIPLE REGRESSION

Comparing Nested Models

Statistical methods for comparing multiple groups. Lecture 7: ANOVA. ANOVA: Definition. ANOVA: Concepts

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Transcription:

Regression ith a Independent Variable ERSH 8320 Slide 1 of 34

Today s Lecture Regression with a single categorical independent variable. Today s Lecture Coding procedures for analysis. Dummy coding. Relationship between categorical independent variable regression and other statistical terms. Slide 2 of 34

Regression with Continuous Linear regression regresses a continuous-valued dependent variable, Y, onto a set of continuous-valued independent variables X. Regression Basics Not A Good Idea Research Design Analysis Specs Example Example Analysis The regression line gives the estimate of the mean of Y conditional on the values of X, or E(Y X). But what happens when some or all independent variables are categorical in nature? Is the point of the regression to determine E(Y X), across the levels of Y? Can t we just put the categorical variables into SPSS and push the Continue" button? Slide 3 of 34

Example Data Set Neter (1996, p. 676). The Kenton Food Company wished to test four different package designs for a new breakfast cereal. Regression Basics Not A Good Idea Research Design Analysis Specs Example Example Analysis Twenty stores, with approximately equal sales volumes, were selected as the experimental units. Each store was randomly assigned one of the package designs, with each package design assigned to five stores. The stores were chosen to be comparable in location and sales volume. Other relevant conditions that could affect sales, such as price, amount and location of shelf space, and special promotional efforts, were kept the same for all of the stores in the experiment. Slide 4 of 34

Cereal Regression Basics Not A Good Idea Research Design Analysis Specs Example Example Analysis Slide 5 of 34

A Regular Regression? Regression Basics Not A Good Idea Research Design Analysis Specs Example Example Analysis Number of Cases Sold 30.00 25.00 20.00 15.00 10.00 Number of Cases Sold = 7.70 + 4.38 * package R Square = 0.64 1.00 2.00 3.00 4.00 Package Type hat is wrong with this picture? Slide 6 of 34

variables commonly occur in research settings. Regression Basics Not A Good Idea Research Design Analysis Specs Example Example Analysis Another term sometimes used to describe for categorial variables is that of qualitative variables. A strict definition of a qualitative or categorical variable is that of a variable that has a finite number of levels. Continuous (or quantitative) variables, alternatively, have infinitely many levels. Often this is assumed more than practiced. Quantitative variables often have countably many levels. Level of precision of an instrument can limit the number of levels of a quantitative variable. Slide 7 of 34

Research Design Regression Basics Not A Good Idea Research Design Analysis Specs Example Example Analysis variables can occur in many different research designs: Experimental research. Quasi-experimental research. Nonexperimental/Observational research. Such variables can be used with regression for: Prediction. Explanation. Slide 8 of 34

Analysis Specifics Because of nature of categorical variables, emphasis of regression is not on linear trends but on differences between means (of Y ) at each level of the category. Regression Basics Not A Good Idea Research Design Analysis Specs Example Example Analysis Not all categorical variables are ordered (like cereal box type, gender,etc...). hen considering differences in the mean of the dependent variable, the type of analysis being conducted by a regression is commonly called an ANalysis Of VAriance (ANOVA). Combinations of categorical and continuous variables in the same regression is called ANalysis Of CoVAriance (ANCOVA - Chapters 14 and 15). Slide 9 of 34

Example Variable: Two Categories Regression Basics Not A Good Idea Research Design Analysis Specs Example Example Analysis From Pedhazur (1997; p. 343): Assume that the data reported [below] were obtained in an experiment in which E represents an experimental group and C represents a control group. E C 20 10 18 12 17 11 17 15 Y 13 17 85 65 Ȳ 17 13 (Y Ȳ ) 2 = y 2 26 34 Slide 10 of 34

Old School Statistics: The t-test Regression Basics Not A Good Idea Research Design Analysis Specs Example Example Analysis As you may recall from an earlier course on statistics, an easy way to determine if the means of the two conditions differ significantly is to use a t-test (with n 1 + n 2 2) degrees of freedom. H 0 µ 1 = µ 2 H A µ 1 µ 2 t = Σy 2 1 +Σy2 2 n 1 +n 2 2 Ȳ 1 ( Ȳ2 ) 1 n 1 + 1 n 2 Slide 11 of 34

Old School Statistics: The t-test Regression Basics Not A Good Idea Research Design Analysis Specs Example Example Analysis t = 17 13 ( 1 5 + 1 5 26+34 5+5 2 ) = 4 3 = 2.31 From Excel ( =tdist(2.31,8,2) ), p = 0.0496. If we used a Type-I error rate of 0.05, we would reject the null hypothesis, and conclude the means of the two groups were significantly different. But what if we had more than two groups?. This type of problem can be solved equivalently from within the context of the General Linear Model. Slide 12 of 34

hen using categorical variables in regression, levels of the categories must be recoded from their original value to ensure the regression model truly estimates the mean differences at levels of the categories. Several types of coding strategies are common: Dummy coding. Effect coding. Each type will produce the same fit of the model (via R 2 ). The estimated regression parameters are different across coding types, thereby representing the true difference in approaches employed by each type of coding. The choice of method of coding does not differ as a function of the type of research or analysis or purpose (explanation or prediction) of the analysis. Slide 13 of 34

Definition: a code is a set of symbols to which meanings can be assigned (Pedhazur, 1997; p. 342). The assignment of symbols follows a rule (or set of rules) determined by the categories of the variable used. Typically symbols represent the respective levels of a categorical variable. All entities within the same symbol are considered alike (or homogeneous) within that category level. levels must be predetermined prior to analysis. Some variables are obviously categorical - gender. Some variables are not so obviously categorial - political affiliation. Slide 14 of 34

Example: Dummy Coded Example 1 Example 2 Example 3 The most straight-forward method of coding categorical variables is dummy coding. In dummy coding, one creates a set of column vectors that represent the membership of an observation to a given category level. If an observation is a member of a specific category level, they are given a value of 1 in that category level s column vector. If an observation is not a member of a specific category, they are given a value of 0 in that category level s column vector. Slide 15 of 34

Example: Dummy Coded Example 1 Example 2 Example 3 For each observation, a no more that a single 1 will appear in the set of column vectors for that variable. The column vectors represent the predictor variables in a regression analysis, where the dependent variable is modeled as a function of these columns. Because of linear dependence with an intercept, one category-level vector is often excluded from the analysis. Because all observations at a given category level have the same value across the set of predictors, the predicted value of the dependent variable, Y, will be identical for all observations within a category. The set of category vectors (and a vector for an intercept) are now used as input into a regression model. Slide 16 of 34

Dummy Coded Regression Example Example: Dummy Coded Example 1 Example 2 Example 3 Y X 1 X 2 X 3 Group 20 1 1 0 E 18 1 1 0 E 17 1 1 0 E 17 1 1 0 E 13 1 1 0 E 10 1 0 1 C 12 1 0 1 C 11 1 0 1 C 15 1 0 1 C 17 1 0 1 C Mean 15 1 0.5 0.5 SS 100 0 2.5 2.5 yx2 = 10 yx3 = 10 Slide 17 of 34

Dummy Coded Regression The General Linear Model states that the estimated regression parameters are given by: b = (X X) 1 X y Example: Dummy Coded Example 1 Example 2 Example 3 From the previous slide, you can see what our entries for X could be, but... Notice that X 1 = X 2 + X 3. This linear dependency means that: (X X) is a singular matrix - no inverse exists. Any combination of two of the columns would rid us of the linear dependency. Slide 18 of 34

Dummy Coded Regression - X 2 and X 3 For our first example analysis, consider the regression of Y on X 2 and X 3. Example: Dummy Coded Example 1 Example 2 Example 3 b 2 = 17 b 3 = 13 y2 = 100 SS res = X X = 60 Y = b 2 X 2 + b 3 X 3 + e SS reg = 100 60 = 40 R 2 = 40 100 = 0.4 Slide 19 of 34

Dummy Coded Regression - X 2 and X 3 b 2 = 17 is the mean for the E category. Example: Dummy Coded Example 1 Example 2 Example 3 b 3 = 13 is the mean for the C category. ithout an intercept, the model is fairly easy to interpret. For more advanced models, an intercept will prove to be helpful in interpretation. Slide 20 of 34

Dummy Coded Regression - X 1 and X 2 For our second example analysis, consider the regression of Y on X 1 and X 2. Example: Dummy Coded Example 1 Example 2 Example 3 a = 13 b 2 = 4 y2 = 100 SS res = X X = 60 Y = a + b 2 X 2 + e SS reg = 100 60 = 40 R 2 = 40 100 = 0.4 Slide 21 of 34

Dummy Coded Regression - X 2 and X 3 a = 13 is the mean for the C category. Example: Dummy Coded Example 1 Example 2 Example 3 b 2 = 4 is the mean difference between the E category and the C category. The C category is called reference category. For members of the C category: Y = a + b 2 X 2 = 13 + 4(0) = 13 For members of the E category: Y = a + b 2 X 2 = 13 + 4(1) = 17 ith the intercept, the model parameters are now different from the first example. The fit of the model, however, is the same. Slide 22 of 34

Dummy Coded Regression - X 1 and X 3 For our third example analysis, consider the regression of Y on X 1 and X 3. Example: Dummy Coded Example 1 Example 2 Example 3 a = 17 b 3 = 4 y2 = 100 SS res = X X = 60 Y = a + b 3 X 3 + e SS reg = 100 60 = 40 R 2 = 40 100 = 0.4 Slide 23 of 34

Dummy Coded Regression - X 1 and X 3 a = 17 is the mean for the E category. Example: Dummy Coded Example 1 Example 2 Example 3 b 3 = 4 is the mean difference between the C category and the E category. The E category is called reference category. For members of the E category: Y = a + b 3 X 3 = 17 4(0) = 17 For members of the E category: Y = a + b 3 X 3 = 17 4(1) = 13 ith the intercept, the model parameters are now different from the first example. The fit of the model, however, is the same. Slide 24 of 34

Hypothesis Test of the Regression Coefficien Because each model had the same value for R 2 and the same number of degrees of freedom for the regression (1), all hypothesis tests of the model parameters will result in the same value of the test statistic. Example: Dummy Coded Example 1 Example 2 Example 3 F = R 2 /k (1 R 2 )/(N k 1) = 0.4/1 (1 0.4)/(10 1 1) = 5.33 From Excel ( =fdist(5.33,1,8) ), p = 0.0496. If we used a Type-I error rate of 0.05, we would reject the null hypothesis, and conclude the regression coefficient for each analysis would be significantly different from zero. Slide 25 of 34

Hypothesis Test of the Regression Coefficien Recall from the t-test of the mean difference, t = 2.321 Example: Dummy Coded Example 1 Example 2 Example 3 For the test of the coefficient, notice that F = t 2. Also notice that the p-values for each hypothesis test were the same, p = 0.0496. The test of the regression coefficient is equivalent to running a t-test when using a single categorical variable with two categories. Slide 26 of 34

Breakfast Cereal Example Generalizing the concept of dummy coding, we revisit our first example data set, the cereal experiment data. Recall that there were four different types of cereal boxes. A dummy coding scheme would involve creation of four new column vectors, each representing observations from each box type. Just as was the case with two categories, a linear dependency is created if we wanted to use all four variables. Therefore, we must choose which category to remove from the analysis. Slide 27 of 34

One-ay Analysis of Variance Breakfast Cereal Example Just as was the case for the example with two categories, a multiple category regression model with a single categorical independent variable has a direct link to a statistical test you may be familiar with. The regression model tests for mean differences across all pairings of category levels simultaneously. Testing for a difference between multiple groups equates to a one-way ANOVA model (for a model with a single categorical independent variable). Slide 28 of 34

Y X 1 X 2 X 3 X 4 X 5 Type 11 1 1 0 0 0 1 17 1 1 0 0 0 1 16 1 1 0 0 0 1 14 1 1 0 0 0 1 15 1 1 0 0 0 1 12 1 0 1 0 0 2 10 1 0 1 0 0 2 15 1 0 1 0 0 2 19 1 0 1 0 0 2 11 1 0 1 0 0 2 23 1 0 0 1 0 3 20 1 0 0 1 0 3 18 1 0 0 1 0 3 17 1 0 0 1 0 3 19 1 0 0 1 0 3 27 1 0 0 0 1 4 33 1 0 0 0 1 4 22 1 0 0 0 1 4 26 1 0 0 0 1 4 28 1 0 0 0 1 4 28-1

Breakfast Cereal Example To make things interesting, let s drop X 5 from our analysis. Breakfast Cereal Example Y = a + b 2 X 2 + b 3 X 3 + b 4 X 4 + e Because X 5 (representing box type four) was omitted from our model, the estimated intercept parameter now represents the mean for group X 5. All other parameters represent the difference between their respective category level and category level four with respect to the dependent variable. a = 27.2 b 2 = 12.6 b 3 = 13.8 b 4 = 7.8 Slide 29 of 34

Breakfast Cereal Example Therefore: Ȳ A = Y A = a + b 2 (1) + b 3 (0) + b 4 (0) = 27.2 12.6 = 14.6 Breakfast Cereal Example Ȳ B = Y B = a + b 2 (0) + b 3 (1) + b 4 (0) = 27.2 13.8 = 13.4 Ȳ C = Y C = a + b 2 (0) + b 3 (0) + b 4 (1) = 27.2 7.8 = 19.4 Ȳ D = Y D = a + b 2 (0) + b 3 (0) + b 4 (0) = 27.2 Slide 30 of 34

Hypothesis Test To test that all means are equal to each other (H 0 : µ 1 = µ 2 =... = µ k ) against the hypothesis that at least one mean differs (H 1 : At least one µ µ ), called an omnibus test, the same hypothesis test from before can be used: Breakfast Cereal Example F = R 2 /k (1 R 2 )/(N k 1) = 0.4/1 (1 0.4)/(10 1 1) = 5.33 y 2 = 1013.0 SS res = 158.4 SS reg = 1013.0 158.4 = 854.6 R 2 = 854.6/1013.0 = 0.844 Slide 31 of 34

Hypothesis Tests F = R 2 /k (1 R 2 )/(N k 1) = 0.844/3 (1 0.844)/(20 3 1) = 28.77 Breakfast Cereal Example From Excel ( =fdist(28.77,3,16) ), p = 0.000001. If we used a Type-I error rate of 0.05, we would reject the null hypothesis, and conclude that at least one regression coefficient for this analysis would be significantly different from zero. Having a regression coefficient of zero means having zero difference between two means (reference and specific category being compared). Having all regression coefficients of zero means absolutely no difference between any of the means. Slide 32 of 34

Final Thought Final Thought Next Class Regression with categorical variables can be accomplished by coding schemes. Differing ways of coding (or inclusion of certain coded column vectors) may change the interpretation of the model parameters, but will not change the overall fit of the model. But what can be said for comparing pair-wise mean differences in a simultaneous regression model...i will leave you in suspense... Slide 33 of 34

Next Time Chapter 15: ANCOVA. Final Thought Next Class Slide 34 of 34