Lecture 2: Categorical Variable A nice book about categorical variable is An Introduction to Categorical Data Analysis authored by Alan Agresti 1
Categorical Variable Categorical variable is qualitative variable. One example is the dummy variable gender, which equals 1 for male worker, and 0 for female worker. Here the numbers 1 and 0 have no numerical meanings (they do not imply, say, 1 > 0, or 1 = 1 + 0). Categorical variables can take more than two values. For example, transportation mode can be walk, drive and use public transportation. Here, we have three string values. 2
Descriptive Statistics In most cases, it is inappropriate to report the mean value for a categorical variable. After all, what is the meaning of average transportation mode? It makes sense to report the count and proportion (percentage, frequency) for categorical variables. We want to know how many people, or what percentage of population, use the public transportation versus walking and driving. 3
Dummy Variable A dummy (binary, indicator, dichotomous) variable can only take values 1 and 0. It follows Bernoulli distribution with probability p 1 P(y = 1), p 0 P(y = 0) = 1 p 1. For a dummy variable, we can prove that E(y) = p 1 (1) var(y) = p 1 (1 p 1 ) (2) So for dummy variables, the mean value is the same as the probability (proportion) of y = 1. For example, if y is gender, then the sample average is the percentage of male worker (if we let y = 1 for male workers) 4
Pie Graph Pie graph can be used to illustrate the proportion (percentage). For example, the graph below shows that the US economy is in recession in 15% quarters from Q1 1947 to Q4 2015. Figure 1: Pie Chart of Economy Recession 15% Boom 85% 5
Bar Graph Alternatively, the bar graph below reports counts the economy is booming in more than 200 quarters, and in recession in about 50 quarters. Figure 2: Bar Graph 0 50 100 150 200 Boom Recession 6
Two-Way Table Joint Distribution A two-way table can report either the count or percentage when there are two categorical variables. For example, the categorical variable highinf is no if the annualized quarterly inflation rate is less than 3% and is yes otherwise. The categorical variable recession is no if the annualized quarterly GDP growth rate is positive and is yes otherwise. The two-way table below reports that, for example, in 24 quarters, the economy suffers from both recession and high inflation. We obtain the joint distribution if we divide each count by the total sum. For 118 example, P(recession = no, highin f = yes) = 116+118+17+24 = 0.429. The R command for the two-way table is table(x,y). We get a one-way table for x if we drop y. recession highinf no yes no 116 118 yes 17 24 7
Exercises 1. Find P(recession = no,highin f = no) 2. Find P(recession = yes) 3. Find P(highin f = yes) 4. Find P(recession = yes highin f = no) 8
Two-Way Table Marginal Distribution I We obtain the marginal distribution for recession by adding the counts horizontally. That is, P(recession = no) = P(recession = no,highin f = no) + P(recession = no,highin f = yes) (3) The R command to obtain marginal count is margin.table(table(x,y), 1), or, table(x). The latter is based on the one-way table of x. recession no yes 234 41 Please verify that 234 = 116 + 118,41 = 17 + 24. 9
Two-Way Table Marginal Distribution II The marginal probability can be computed either from the one-way table or two-way table: P(recession = no) = 234 234 + 41 = 116 116 + 118 + 17 + 24 + 118 116 + 118 + 17 + 24 (4) The R command to obtain marginal probability is prop.table(table(x)) recession no yes 0.8509091 0.1490909 10
Two-Way Table Conditional Probability From probability theory, P(x = x i y = y j ) = P(x = x i,y = y j ) P(y = y j ) (5) The R command to obtain conditional probability for x is prop.table(table(x,y), 2)) highinf recession no yes no 0.8721805 0.8309859 yes 0.1278195 0.1690141 11
Statistical Independence From probability theory, we know that the joint probability P(A B) equals the product of conditional probability P(A B) and marginal probability P(B). P(A B) = P(A B)P(B) (6) If A and B are independent, then the conditional probability is the same as unconditional (marginal) probability: P(A B) = P(A). In that case P(A B) = P(A)P(B) (if A and B are independent) In general, two random variables are independent if P(x = x i,y = y j ) = P(x = x i )P(y = y j ), ( x i,y j ) (7) 12
Chi-squared Test for Statistical Independence Let n be the total count, and n i j be the actual count of observations satisfying (x = x i,y = y j ). Finally, let p i j = P(x = x i,y = y j ). Under the null hypothesis of independence H 0 : x and y are independent we have p i j = P(x = x i )P(y = y j ). The main idea of the Chi-squared Test is comparing the actual count n i j to the theoretical count under the null hypothesis np i j = np(x = x i )P(y = y j ). Big difference leads to rejection. Chi-squared Test = i, j (n i j np(x = x i )P(y = y j )) 2 np(x = x i )P(y = y j ) (8) This test follows χ 2 distribution under the null hypothesis. 13
Logistic Regression Consider a binary dependent variable y and an independent variable x. The logistic regression specifies the probability as P(y = 1) = eβ 0+β 1 x 1 + e β 0+β 1 x (9) It follows that the odds P(y=1) 1 P(y=1) is given by P(y = 1) 1 P(y = 1) = eβ 0+β 1 x (10) Therefore, the odds when x = 1 relative to the odds when x = 0, the odds ratio, is, P(y=1) 1 P(y=1) x=1 P(y=1) 1 P(y=1) x=0 = e β 1 (11) How to interpret e β 0? 14
Maximum Likelihood Estimation The density function for a Bernoulli distribution is f i = P y i i (1 P i ) 1 y i. Assuming i.i.d sample, the joint density (likelihood function) for the whole sample is L = Π n i=1 f i. We obtain the log-likelihood by taking log of the joint density: log(l) = n log( f i ) = i=1 n y i log(p i ) + (1 y i )log(1 P i ), (12) i=1 where P i is given by (9). Finally, maximum likelihood methods estimates ˆβ by maximizing (12) via numerical methods. 15
Categorical Variable as Regressor In general, categorical variable needs to be converted to a set of dummy variables before being used as regressors in a regression. For example, for the transportation mode we can define two dummy variables D 1 = D 2 = 1, if walk 0, otherwise 1, if drive 0, otherwise So D 1 = 0,D 2 = 0 for a person using public transportation (base group). The regression looks like y = β 0 + β 1 D 1 + β 2 D 2 + u. Here, β 1 measures the difference in y between walking and using public transportation. How about β 0? 16