Data-analysis and Retrieval Ordinal Classification

Similar documents
Linear Regression Models P8111

Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models

Review of Multinomial Distribution If n trials are performed: in each trial there are J > 2 possible outcomes (categories) Multicategory Logit Models

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Generalized Linear Models for Non-Normal Data

Proportional Odds Logistic Regression. stat 557 Heike Hofmann

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Linear Regression With Special Variables

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Data Mining 2018 Logistic Regression Text Classification

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Logistic Regression - problem 6.14

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Logistic Regression. Advanced Methods for Data Analysis (36-402/36-608) Spring 2014

Single-level Models for Binary Responses

Introduction to Signal Detection and Classification. Phani Chavali

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

NELS 88. Latent Response Variable Formulation Versus Probability Curve Formulation

Comparing IRT with Other Models

1. Logistic Regression, One Predictor 2. Inference: Estimating the Parameters 3. Multiple Logistic Regression 4. AIC and BIC in Logistic Regression

Logistic Regression. Continued Psy 524 Ainsworth

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes 1

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

Generalized Models: Part 1

Classification. Chapter Introduction. 6.2 The Bayes classifier

2. We care about proportion for categorical variable, but average for numerical one.

Binary choice. Michel Bierlaire

Machine Learning Linear Classification. Prof. Matteo Matteucci

Lecture #11: Classification & Logistic Regression

BMI 541/699 Lecture 22

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Limited Dependent Variable Models II

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Introduction to Generalized Models

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Model Estimation Example

Econometrics Lecture 5: Limited Dependent Variable Models: Logit and Probit

Econometrics II. Seppo Pynnönen. Spring Department of Mathematics and Statistics, University of Vaasa, Finland

8 Nominal and Ordinal Logistic Regression

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Bivariate Relationships Between Variables

Generalized Linear Modeling - Logistic Regression

Modeling Binary Outcomes: Logit and Probit Models

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Logistisk regression T.K.

MS&E 226: Small Data

Logistic Regression and Generalized Linear Models

Exam Applied Statistical Regression. Good Luck!

Stat 8053, Fall 2013: Multinomial Logistic Models

Binary Logistic Regression

2/26/2017. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Introducing Generalized Linear Models: Logistic Regression

Linear Models in Machine Learning

Mixed Models for Longitudinal Ordinal and Nominal Outcomes

Neural networks (not in book)

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Statistical Modelling with Stata: Binary Outcomes

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

Review. December 4 th, Review

Mathematical Formulation of Our Example

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Hypothesis testing, part 2. With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

Ronald Heck Week 14 1 EDEP 768E: Seminar in Categorical Data Modeling (F2012) Nov. 17, 2012

Ch 6: Multicategory Logit Models

Generalized Linear Models 1

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Logistic & Tobit Regression

Correlation and regression

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Group comparisons in logit and probit using predicted probabilities 1

STAT5044: Regression and Anova

Lecture-20: Discrete Choice Modeling-I

Chapter 1. Modeling Basics

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

CS145: INTRODUCTION TO DATA MINING

Ninth ARTNeT Capacity Building Workshop for Trade Research "Trade Flows and Trade Policy Analysis"

Introduction to Logistic Regression

12 Modelling Binomial Response Data

A general mixed model approach for spatio-temporal regression data

Categorical data analysis Chapter 5

Lecture 12: Effect modification, and confounding in logistic regression

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Modern Methods of Statistical Learning sf2935 Lecture 5: Logistic Regression T.K

CDA Chapter 3 part II

MS&E 226: Small Data

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Lecture 4 Logistic Regression

Generalized linear models

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

LOGISTIC REGRESSION. Lalmohan Bhar Indian Agricultural Statistics Research Institute, New Delhi

maximum likelihood Maximum likelihood families nega2ve binomial ordered logit/probit

Transcription:

Data-analysis and Retrieval Ordinal Classification Ad Feelders Universiteit Utrecht Data-analysis and Retrieval 1 / 30

Strongly disagree Ordinal Classification 1 2 3 4 5 0% (0) 10.5% (2) 21.1% (4) 42.1% (8) 26.3% (5) Strongly agree The course was relevant for my programme When a variable is ordinal, its categories can be ranked from low to Strongly high, disagree but the distances between adjacent categories Strongly agree are unknown. 1 2 3 4 5 In ordinal classification the class variable is ordinal. Example: Likert scale I learned a lot from this course 0% (0) 15.8% (3) 26.3% (5) 15.8% (3) 42.1% (8) Strongly disagree Strongly agree 1 2 3 4 5 0% (0) 5.3% (1) 26.3% (5) 31.6% (6) 36.8% (7) Data-analysis and Retrieval 2 / 30

Logistic Regression Revisited Consider the linear regression model y = β x + ε, E[ε] = 0 where y is an unobserved (latent) numeric variable. We only observe whether y is bigger than a given threshold: { 1 if y y = > 0 0 if y 0 Note the vector notation: x = (1, x 1,..., x p ) and β = (β 0, β 1,..., β p ), so β x = β 0 + p β j x j j=1 Data-analysis and Retrieval 3 / 30

Logistic Regression Revisited According to this model, the probability that y = 1 is P(y = 1) = P(y > 0) = P(β x + ε > 0) = P(ε > β x) If the distribution of ε is symmetric (e.g. normal or logistic), then P(ε > β x) = P(ε < β x) F (β x) Here F is the cumulative density function (cdf) of ε. The cdf is defined as F (z) = P(Z z) = z f (Z)dZ where f is the probability density function (pdf) of Z. Data-analysis and Retrieval 4 / 30

Cumulative density function Data-analysis and Retrieval 5 / 30

The Probit Model So we have P(y = 1) = F (β x) We have In the probit model we assume ε N(0, 1). The assumption of unit variance is a harmless normalization. P(y = 1) = Φ(β x) where Φ( ) denotes the standard normal cumulative density function. Data-analysis and Retrieval 6 / 30

ε N(0, 1) is a harmless normalization. Suppose instead that ε N(0, σ 2 ), as is common in linear regression. First of all, note that ( ) ε P(y = 1 x) = P(ε < β x) = P σ < β x σ Define u = ε σ. Then u N(0, 1). Furthermore, let α j = β j σ. The model with coefficients α j and error term u is observationally equivalent to the model with coefficients β j and error term ε. They are observationally equivalent because they produce the exact same probabilities for the different Y values. Since Y is all we observe (not Y ), the two models cannot be distinguished from each other on the basis of observations. Data-analysis and Retrieval 7 / 30

Standard Normal density and cumulative density function f(x) and F(x) 0.0 0.2 0.4 0.6 0.8 1.0 4 2 0 2 4 x Data-analysis and Retrieval 8 / 30

The Logit Model (Logistic Regression) For the logit (logistic regression) model P(y = 1) = Λ(β x) = eβ x 1 + e β x where Λ( ) denotes the logistic cumulative density function. Data-analysis and Retrieval 9 / 30

Normal (red) and logistic (blue) cumulative density Phi(x) and Lambda(x) 0.0 0.2 0.4 0.6 0.8 1.0 4 2 0 2 4 x Data-analysis and Retrieval 10 / 30

Alternative Parametrization Instead of fixing the threshold at zero, we can also remove the intercept from the model and choose the threshold. Then we have y = p β j x j + ε, E[ε] = 0 j=1 where y is an unobserved (latent) numeric variable. We only observe whether y is bigger than a threshold t: { 1 if y y = > t 0 if y t Data-analysis and Retrieval 11 / 30

Generalization to Ordinal Classification y i = β x i + ε i, E[ε i ] = 0. We only observe between which thresholds y falls. Let m denote the number of classes, where the classes are labeled {1, 2,..., m}. Then y is defined as follows: 1 if < y t 1 2 if t 1 < y t 2 y =.. m if t m 1 < y < Here t 1,..., t m 1 are unknown thresholds that have to be estimated from the data (together with the coefficient vector β). Data-analysis and Retrieval 12 / 30

Discretization of y We only observe y, which indicates the interval y falls into. t 1 t 2 t 3 y* 1 2 3 4 y Data-analysis and Retrieval 13 / 30

Distribution of y given x Data-analysis and Retrieval 14 / 30

Class Probabilities We observe y = 1 when y falls between t 0 = and t 1. Hence Substituting y i P(y i = 1 x i ) = P(t 0 y i < t 1 x i ) = β x i + ε i, we get P(y i = 1 x i ) = P(t 0 β x i + ε i < t 1 x i ) Now we subtract β x i from all terms in the inequality to get P(y i = 1 x i ) = P(t 0 β x i ε i < t 1 β x i x i ) Data-analysis and Retrieval 15 / 30

Class Probabilities We have: P(y i = 1 x i ) = P(t 0 β x i ε i < t 1 β x i x i ) = P(ε i < t 1 β x i x i ) P(ε i < t 0 β x i x i ) = F (t 1 β x i ) F (t 0 β x i ) This derivation can be generalized to compute the probability of any observed outcome y i = j given x i : P(y i = j x i ) = F (t j β x i ) F (t j 1 β x i ), j = 1,..., m. Data-analysis and Retrieval 16 / 30

Class Probabilities So for a model with four possible classes, the formula s for the different outcomes are: P(y i = 1 x i ) = F (t 1 β x i ) P(y i = 2 x i ) = F (t 2 β x i ) F (t 1 β x i ) P(y i = 3 x i ) = F (t 3 β x i ) F (t 2 β x i ) P(y i = 4 x i ) = 1 F (t 3 β x i ) Data-analysis and Retrieval 17 / 30

Cumulative Class Probabilities Also, note that: P(y i 1 x i ) = F (t 1 β x i ) P(y i 2 x i ) = F (t 2 β x i ) P(y i 3 x i ) = F (t 3 β x i ) P(y i 4 x i ) = 1 In general we have P(y i j x i ) = F (t j β x i ). Data-analysis and Retrieval 18 / 30

Cumulative Class Probabilities We have seen that: P(y j x) = F (t j β x). In logistic regression we choose for F the logistic cdf so we get Λ(z) = exp(z) 1 + exp(z), P(y j x) = exp(t j β x) 1 + exp(t j β x). Set of parallel logistic regression models for y j against y > j: [ ] P(y j x) log = t j β x P(y > j x) Data-analysis and Retrieval 19 / 30

Interpretation We have Hence P(y i j x i ) = F (t j β x i ). P(y j x) x k = F (t j β x) x k = β k F (t j β x) = β k f (t j β x). f (t j β x) is always positive, since f is a probability density function. So if β k is positive, an increase in x k will lead to a decrease in P(y j) for all j = 1,..., m 1. Or (same thing), an increase in x k will lead to an increase in P(y j) for all j = 2,..., m. In this specific sense, one can say that if x k increases, higher values of y become more likely. Data-analysis and Retrieval 20 / 30

Maximum Likelihood Estimation The likelihood function is L(β, t X, y) = = m P(y i = j x i, β, t) j=1 y i =j m j=1 y i =j [ ] F (t j β x i ) F (t j 1 β x i ), where y i =j indicates we multiply over all cases where y is observed to have value j. Taking logs, the log likelihood is equal to log L(β, t X, y) = m [ ] log F (t j β x i ) F (t j 1 β x i ). j=1 y i =j This expression can be maximized with numerical methods to estimate the t s and β s. Data-analysis and Retrieval 21 / 30

Summarizing the Differences How exactly are the ordered and unordered (= multinomial) logistic regression model different? The ordinal model has a single coefficient vector β for all classes, whereas the multinomial model has a coefficient vector β k for each class k (except one). As a consequence the decision boundaries are restricted to be parallel to each other in the ordinal model. This is quite a strong constraint! In the ordinal model the relation between predictor and class label is monotone, either increasing or decreasing. For example: if β j is positive, then (all else equal) an increase in x k makes the higher classes more likely and a decrease in x k makes the lower classes more likely. Data-analysis and Retrieval 22 / 30

Fitting the Ordinal Logistic Regression Model > wine.polr2 <- polr(quality~density+alcohol,data=wine.dat, Hess=T) > summary(wine.polr2) Coefficients: Value Std. Error t value density 106.27 0.30531 348.08 alcohol 1.11 0.05267 21.07 Intercepts: Value Std. Error t value 1 2 111.9811 0.3032 369.3438 2 3 113.8616 0.3057 372.4595 3 4 117.2274 0.3253 360.3388 4 5 119.7431 0.3667 326.5169 5 6 122.6179 0.4476 273.9202 Residual Deviance: 3351.787 AIC: 3365.787 Data-analysis and Retrieval 23 / 30

Prediction > wine.pred <- predict(wine.polr,wine.dat,type="class") > table(wine.dat[,12],wine.pred) wine.pred 1 2 3 4 5 6 1 0 0 9 1 0 0 2 0 0 36 17 0 0 3 0 2 503 173 2 1 4 0 0 213 392 33 0 5 0 0 7 138 54 0 6 0 0 0 10 8 0 > sum(diag(wine.confmat))/1599 [1] 0.5934959 > summary(wine.dat[,12]) 1 2 3 4 5 6 10 53 681 638 199 18 > 681/1599 [1] 0.4258912 Data-analysis and Retrieval 24 / 30

Decision Boundary Ordinal LR on Wine Data Data-analysis and Retrieval 25 / 30

Fitting the Multinomial Logistic Regression Model > wine.multi2 <- multinom(quality~density+alcohol,data=wine.dat) > summary(wine.multi2) Call: multinom(formula = quality ~ density + alcohol, data = wine.dat) Coefficients: (Intercept) density alcohol 2 43.003814-45.879145 0.4365809 3 68.378492-62.949698-0.1396656 4 2.385932-7.137190 0.8667334 5-108.347472 93.833526 1.6793355 6-17.906887-4.108423 2.0746746 Std. Errors: (Intercept) density alcohol 2 2.260309 2.274173 0.4522562 3 2.137517 2.146860 0.4289871 4 2.139384 2.141155 0.4282659 5 2.166738 2.182500 0.4334476 6 2.507561 2.531655 0.4810792 Residual Deviance: 3314.079 AIC: 3344.079 Data-analysis and Retrieval 26 / 30

Prediction with Multinomial Logit > wine.pred.m <- predict(wine.multi2,wine.dat,type="class") > wine.confmat.m <- table(wine.dat[,12],wine.pred.m) > wine.confmat.m wine.pred.m 1 2 3 4 5 6 1 0 0 7 3 0 0 2 0 0 30 22 1 0 3 0 0 514 161 6 0 4 0 0 267 347 24 0 5 0 0 23 159 17 0 6 0 0 2 10 6 0 > sum(diag(wine.confmat.m))/1599 [1] 0.5490932 Data-analysis and Retrieval 27 / 30

Decision Boundary Multinomial LR on Wine Data Data-analysis and Retrieval 28 / 30

Comparison of Ordinal and Multinomial Model > wine.polr.corr <- as.numeric(wine.pred==wine.dat[,12]) > wine.multi.corr <- as.numeric(wine.pred.m==wine.dat[,12]) > wine.comp <- table(wine.polr.corr,wine.multi.corr) > wine.comp wine.multi.corr wine.polr.corr 0 1 0 546 104 1 175 774 # is the difference in accuracy significant? Null hypothesis: H 0 : e polr = e multi, H a : e polr e multi If the null hypothesis is correct then P(cell (1,0)) = P(cell (0,1)) = 1 2 (the other two cells are ignored). Data-analysis and Retrieval 29 / 30

Comparison of Ordinal and Multinomial Model Hence the p-value is P(X 104) + P(X 175), where X Binom(π = 1 2, n = 279) In R we can compute this as > 2*pbinom(104,175+104,prob=0.5) [1] 2.531092e-05 # yes, the p-value is smaller than 0.01, which is # already a very strict significance level The p-value is very small, so we conclude that the ordinal model has significantly higher accuracy than the multinomial model. Data-analysis and Retrieval 30 / 30