Lecture 2: Categorical Variable. A nice book about categorical variable is An Introduction to Categorical Data Analysis authored by Alan Agresti

Similar documents
2. We care about proportion for categorical variable, but average for numerical one.

Econometrics Problem Set 10

Lecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

4. Nonlinear regression functions

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Multiple Sample Categorical Data

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

The Chi-Square Distributions

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Ling 289 Contingency Table Statistics

Lecture 41 Sections Mon, Apr 7, 2008

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

Review of the General Linear Model

The multiple regression model; Indicator variables as regressors

Binary Logistic Regression

Linear Regression With Special Variables

Single-level Models for Binary Responses

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices.

Contest Quiz 3. Question Sheet. In this quiz we will review concepts of linear regression covered in lecture 2.

STAC51: Categorical data Analysis

STAT 7030: Categorical Data Analysis

Psych Jan. 5, 2005

A SHORT INTRODUCTION TO PROBABILITY

Testing Independence

STP 226 ELEMENTARY STATISTICS NOTES

Stat 135 Fall 2013 FINAL EXAM December 18, 2013

Lecture 28 Chi-Square Analysis

For more information about how to cite these materials visit

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Research Methods in Political Science I

Lecture 01: Introduction

Introduction To Logistic Regression

STATISTICS 141 Final Review

Introduction. Dipankar Bandyopadhyay, Ph.D. Department of Biostatistics, Virginia Commonwealth University

Unit 9: Inferences for Proportions and Count Data

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices.

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

Unit 11: Multiple Linear Regression

Introduction to Bayesian Learning. Machine Learning Fall 2018

Binary Dependent Variables

Discrete Multivariate Statistics

Final Exam - Solutions

MLE for a logit model

Logistic Regression Analysis

Chapter 20: Logistic regression for binary response variables

CHAPTER 1: BINARY LOGIT MODEL

Learning Objectives for Stat 225

Homework Solutions Applied Logistic Regression

Lecture 8: ARIMA Forecasting Please read Chapters 7 and 8 of MWH Book

2.3 Analysis of Categorical Data

Inference and Regression

Lecture 5. October 21, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Cohen s s Kappa and Log-linear Models

13.1 Categorical Data and the Multinomial Experiment

Econ 1123: Section 2. Review. Binary Regressors. Bivariate. Regression. Omitted Variable Bias

(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1.

Naïve Bayes classification

Unit 9: Inferences for Proportions and Count Data

Machine Learning Linear Classification. Prof. Matteo Matteucci

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution

Chapter 26: Comparing Counts (Chi Square)

Lecture 41 Sections Wed, Nov 12, 2008

Correlation and regression

Sociology 362 Data Exercise 6 Logistic Regression 2

Investigating Models with Two or Three Categories

CHAPTER 4 & 5 Linear Regression with One Regressor. Kazu Matsuda IBEC PHBU 430 Econometrics

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

Lecture 10: Alternatives to OLS with limited dependent variables. PEA vs APE Logit/Probit Poisson

Log-linear Models for Contingency Tables

Simple logistic regression

Chapter 2 Solutions Page 15 of 28

Nominal Data. Parametric Statistics. Nonparametric Statistics. Parametric vs Nonparametric Tests. Greg C Elvers

Econ 325: Introduction to Empirical Economics

BMI 541/699 Lecture 22

ISQS 5349 Final Exam, Spring 2017.

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Basic Medical Statistics Course

Lab 10 - Binary Variables

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Solutions for Examination Categorical Data Analysis, March 21, 2013

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

9 Generalized Linear Models

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

PS5: Two Variable Statistics LT3: Linear regression LT4: The test of independence.

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Universidad Carlos III de Madrid Econometría Nonlinear Regression Functions Problem Set 8

Regression with Qualitative Information. Part VI. Regression with Qualitative Information

Regression #8: Loose Ends

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Solutions to Odd-Numbered End-of-Chapter Exercises: Chapter 8

We know from STAT.1030 that the relevant test statistic for equality of proportions is:

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Sociology 6Z03 Review II

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

CIVL 7012/8012. Simple Linear Regression. Lecture 3

Lecture 8: Summary Measures

Transcription:

Lecture 2: Categorical Variable A nice book about categorical variable is An Introduction to Categorical Data Analysis authored by Alan Agresti 1

Categorical Variable Categorical variable is qualitative variable. One example is the dummy variable gender, which equals 1 for male worker, and 0 for female worker. Here the numbers 1 and 0 have no numerical meanings (they do not imply, say, 1 > 0, or 1 = 1 + 0). Categorical variables can take more than two values. For example, transportation mode can be walk, drive and use public transportation. Here, we have three string values. 2

Descriptive Statistics In most cases, it is inappropriate to report the mean value for a categorical variable. After all, what is the meaning of average transportation mode? It makes sense to report the count and proportion (percentage, frequency) for categorical variables. We want to know how many people, or what percentage of population, use the public transportation versus walking and driving. 3

Dummy Variable A dummy (binary, indicator, dichotomous) variable can only take values 1 and 0. It follows Bernoulli distribution with probability p 1 P(y = 1), p 0 P(y = 0) = 1 p 1. For a dummy variable, we can prove that E(y) = p 1 (1) var(y) = p 1 (1 p 1 ) (2) So for dummy variables, the mean value is the same as the probability (proportion) of y = 1. For example, if y is gender, then the sample average is the percentage of male worker (if we let y = 1 for male workers) 4

Pie Graph Pie graph can be used to illustrate the proportion (percentage). For example, the graph below shows that the US economy is in recession in 15% quarters from Q1 1947 to Q4 2015. Figure 1: Pie Chart of Economy Recession 15% Boom 85% 5

Bar Graph Alternatively, the bar graph below reports counts the economy is booming in more than 200 quarters, and in recession in about 50 quarters. Figure 2: Bar Graph 0 50 100 150 200 Boom Recession 6

Two-Way Table Joint Distribution A two-way table can report either the count or percentage when there are two categorical variables. For example, the categorical variable highinf is no if the annualized quarterly inflation rate is less than 3% and is yes otherwise. The categorical variable recession is no if the annualized quarterly GDP growth rate is positive and is yes otherwise. The two-way table below reports that, for example, in 24 quarters, the economy suffers from both recession and high inflation. We obtain the joint distribution if we divide each count by the total sum. For 118 example, P(recession = no, highin f = yes) = 116+118+17+24 = 0.429. The R command for the two-way table is table(x,y). We get a one-way table for x if we drop y. recession highinf no yes no 116 118 yes 17 24 7

Exercises 1. Find P(recession = no,highin f = no) 2. Find P(recession = yes) 3. Find P(highin f = yes) 4. Find P(recession = yes highin f = no) 8

Two-Way Table Marginal Distribution I We obtain the marginal distribution for recession by adding the counts horizontally. That is, P(recession = no) = P(recession = no,highin f = no) + P(recession = no,highin f = yes) (3) The R command to obtain marginal count is margin.table(table(x,y), 1), or, table(x). The latter is based on the one-way table of x. recession no yes 234 41 Please verify that 234 = 116 + 118,41 = 17 + 24. 9

Two-Way Table Marginal Distribution II The marginal probability can be computed either from the one-way table or two-way table: P(recession = no) = 234 234 + 41 = 116 116 + 118 + 17 + 24 + 118 116 + 118 + 17 + 24 (4) The R command to obtain marginal probability is prop.table(table(x)) recession no yes 0.8509091 0.1490909 10

Two-Way Table Conditional Probability From probability theory, P(x = x i y = y j ) = P(x = x i,y = y j ) P(y = y j ) (5) The R command to obtain conditional probability for x is prop.table(table(x,y), 2)) highinf recession no yes no 0.8721805 0.8309859 yes 0.1278195 0.1690141 11

Statistical Independence From probability theory, we know that the joint probability P(A B) equals the product of conditional probability P(A B) and marginal probability P(B). P(A B) = P(A B)P(B) (6) If A and B are independent, then the conditional probability is the same as unconditional (marginal) probability: P(A B) = P(A). In that case P(A B) = P(A)P(B) (if A and B are independent) In general, two random variables are independent if P(x = x i,y = y j ) = P(x = x i )P(y = y j ), ( x i,y j ) (7) 12

Chi-squared Test for Statistical Independence Let n be the total count, and n i j be the actual count of observations satisfying (x = x i,y = y j ). Finally, let p i j = P(x = x i,y = y j ). Under the null hypothesis of independence H 0 : x and y are independent we have p i j = P(x = x i )P(y = y j ). The main idea of the Chi-squared Test is comparing the actual count n i j to the theoretical count under the null hypothesis np i j = np(x = x i )P(y = y j ). Big difference leads to rejection. Chi-squared Test = i, j (n i j np(x = x i )P(y = y j )) 2 np(x = x i )P(y = y j ) (8) This test follows χ 2 distribution under the null hypothesis. 13

Logistic Regression Consider a binary dependent variable y and an independent variable x. The logistic regression specifies the probability as P(y = 1) = eβ 0+β 1 x 1 + e β 0+β 1 x (9) It follows that the odds P(y=1) 1 P(y=1) is given by P(y = 1) 1 P(y = 1) = eβ 0+β 1 x (10) Therefore, the odds when x = 1 relative to the odds when x = 0, the odds ratio, is, P(y=1) 1 P(y=1) x=1 P(y=1) 1 P(y=1) x=0 = e β 1 (11) How to interpret e β 0? 14

Maximum Likelihood Estimation The density function for a Bernoulli distribution is f i = P y i i (1 P i ) 1 y i. Assuming i.i.d sample, the joint density (likelihood function) for the whole sample is L = Π n i=1 f i. We obtain the log-likelihood by taking log of the joint density: log(l) = n log( f i ) = i=1 n y i log(p i ) + (1 y i )log(1 P i ), (12) i=1 where P i is given by (9). Finally, maximum likelihood methods estimates ˆβ by maximizing (12) via numerical methods. 15

Categorical Variable as Regressor In general, categorical variable needs to be converted to a set of dummy variables before being used as regressors in a regression. For example, for the transportation mode we can define two dummy variables D 1 = D 2 = 1, if walk 0, otherwise 1, if drive 0, otherwise So D 1 = 0,D 2 = 0 for a person using public transportation (base group). The regression looks like y = β 0 + β 1 D 1 + β 2 D 2 + u. Here, β 1 measures the difference in y between walking and using public transportation. How about β 0? 16