Data-analysis and Retrieval Ordinal Classification

Size: px

Start display at page:

Download "Data-analysis and Retrieval Ordinal Classification"

Linda Horn
5 years ago
Views:

1 Data-analysis and Retrieval Ordinal Classification Ad Feelders Universiteit Utrecht Data-analysis and Retrieval 1 / 30

2 Strongly disagree Ordinal Classification % (0) 10.5% (2) 21.1% (4) 42.1% (8) 26.3% (5) Strongly agree The course was relevant for my programme When a variable is ordinal, its categories can be ranked from low to Strongly high, disagree but the distances between adjacent categories Strongly agree are unknown In ordinal classification the class variable is ordinal. Example: Likert scale I learned a lot from this course 0% (0) 15.8% (3) 26.3% (5) 15.8% (3) 42.1% (8) Strongly disagree Strongly agree % (0) 5.3% (1) 26.3% (5) 31.6% (6) 36.8% (7) Data-analysis and Retrieval 2 / 30

3 Logistic Regression Revisited Consider the linear regression model y = β x + ε, E[ε] = 0 where y is an unobserved (latent) numeric variable. We only observe whether y is bigger than a given threshold: { 1 if y y = > 0 0 if y 0 Note the vector notation: x = (1, x 1,..., x p ) and β = (β 0, β 1,..., β p ), so β x = β 0 + p β j x j j=1 Data-analysis and Retrieval 3 / 30

4 Logistic Regression Revisited According to this model, the probability that y = 1 is P(y = 1) = P(y > 0) = P(β x + ε > 0) = P(ε > β x) If the distribution of ε is symmetric (e.g. normal or logistic), then P(ε > β x) = P(ε < β x) F (β x) Here F is the cumulative density function (cdf) of ε. The cdf is defined as F (z) = P(Z z) = z f (Z)dZ where f is the probability density function (pdf) of Z. Data-analysis and Retrieval 4 / 30

5 Cumulative density function Data-analysis and Retrieval 5 / 30

6 The Probit Model So we have P(y = 1) = F (β x) We have In the probit model we assume ε N(0, 1). The assumption of unit variance is a harmless normalization. P(y = 1) = Φ(β x) where Φ( ) denotes the standard normal cumulative density function. Data-analysis and Retrieval 6 / 30

7 ε N(0, 1) is a harmless normalization. Suppose instead that ε N(0, σ 2 ), as is common in linear regression. First of all, note that ( ) ε P(y = 1 x) = P(ε < β x) = P σ < β x σ Define u = ε σ. Then u N(0, 1). Furthermore, let α j = β j σ. The model with coefficients α j and error term u is observationally equivalent to the model with coefficients β j and error term ε. They are observationally equivalent because they produce the exact same probabilities for the different Y values. Since Y is all we observe (not Y ), the two models cannot be distinguished from each other on the basis of observations. Data-analysis and Retrieval 7 / 30

8 Standard Normal density and cumulative density function f(x) and F(x) x Data-analysis and Retrieval 8 / 30

9 The Logit Model (Logistic Regression) For the logit (logistic regression) model P(y = 1) = Λ(β x) = eβ x 1 + e β x where Λ( ) denotes the logistic cumulative density function. Data-analysis and Retrieval 9 / 30

10 Normal (red) and logistic (blue) cumulative density Phi(x) and Lambda(x) x Data-analysis and Retrieval 10 / 30

11 Alternative Parametrization Instead of fixing the threshold at zero, we can also remove the intercept from the model and choose the threshold. Then we have y = p β j x j + ε, E[ε] = 0 j=1 where y is an unobserved (latent) numeric variable. We only observe whether y is bigger than a threshold t: { 1 if y y = > t 0 if y t Data-analysis and Retrieval 11 / 30

12 Generalization to Ordinal Classification y i = β x i + ε i, E[ε i ] = 0. We only observe between which thresholds y falls. Let m denote the number of classes, where the classes are labeled {1, 2,..., m}. Then y is defined as follows: 1 if < y t 1 2 if t 1 < y t 2 y =.. m if t m 1 < y < Here t 1,..., t m 1 are unknown thresholds that have to be estimated from the data (together with the coefficient vector β). Data-analysis and Retrieval 12 / 30

13 Discretization of y We only observe y, which indicates the interval y falls into. t 1 t 2 t 3 y* y Data-analysis and Retrieval 13 / 30

14 Distribution of y given x Data-analysis and Retrieval 14 / 30

15 Class Probabilities We observe y = 1 when y falls between t 0 = and t 1. Hence Substituting y i P(y i = 1 x i ) = P(t 0 y i < t 1 x i ) = β x i + ε i, we get P(y i = 1 x i ) = P(t 0 β x i + ε i < t 1 x i ) Now we subtract β x i from all terms in the inequality to get P(y i = 1 x i ) = P(t 0 β x i ε i < t 1 β x i x i ) Data-analysis and Retrieval 15 / 30

16 Class Probabilities We have: P(y i = 1 x i ) = P(t 0 β x i ε i < t 1 β x i x i ) = P(ε i < t 1 β x i x i ) P(ε i < t 0 β x i x i ) = F (t 1 β x i ) F (t 0 β x i ) This derivation can be generalized to compute the probability of any observed outcome y i = j given x i : P(y i = j x i ) = F (t j β x i ) F (t j 1 β x i ), j = 1,..., m. Data-analysis and Retrieval 16 / 30

17 Class Probabilities So for a model with four possible classes, the formula s for the different outcomes are: P(y i = 1 x i ) = F (t 1 β x i ) P(y i = 2 x i ) = F (t 2 β x i ) F (t 1 β x i ) P(y i = 3 x i ) = F (t 3 β x i ) F (t 2 β x i ) P(y i = 4 x i ) = 1 F (t 3 β x i ) Data-analysis and Retrieval 17 / 30

18 Cumulative Class Probabilities Also, note that: P(y i 1 x i ) = F (t 1 β x i ) P(y i 2 x i ) = F (t 2 β x i ) P(y i 3 x i ) = F (t 3 β x i ) P(y i 4 x i ) = 1 In general we have P(y i j x i ) = F (t j β x i ). Data-analysis and Retrieval 18 / 30

19 Cumulative Class Probabilities We have seen that: P(y j x) = F (t j β x). In logistic regression we choose for F the logistic cdf so we get Λ(z) = exp(z) 1 + exp(z), P(y j x) = exp(t j β x) 1 + exp(t j β x). Set of parallel logistic regression models for y j against y > j: [ ] P(y j x) log = t j β x P(y > j x) Data-analysis and Retrieval 19 / 30

20 Interpretation We have Hence P(y i j x i ) = F (t j β x i ). P(y j x) x k = F (t j β x) x k = β k F (t j β x) = β k f (t j β x). f (t j β x) is always positive, since f is a probability density function. So if β k is positive, an increase in x k will lead to a decrease in P(y j) for all j = 1,..., m 1. Or (same thing), an increase in x k will lead to an increase in P(y j) for all j = 2,..., m. In this specific sense, one can say that if x k increases, higher values of y become more likely. Data-analysis and Retrieval 20 / 30

21 Maximum Likelihood Estimation The likelihood function is L(β, t X, y) = = m P(y i = j x i, β, t) j=1 y i =j m j=1 y i =j [ ] F (t j β x i ) F (t j 1 β x i ), where y i =j indicates we multiply over all cases where y is observed to have value j. Taking logs, the log likelihood is equal to log L(β, t X, y) = m [ ] log F (t j β x i ) F (t j 1 β x i ). j=1 y i =j This expression can be maximized with numerical methods to estimate the t s and β s. Data-analysis and Retrieval 21 / 30

22 Summarizing the Differences How exactly are the ordered and unordered (= multinomial) logistic regression model different? The ordinal model has a single coefficient vector β for all classes, whereas the multinomial model has a coefficient vector β k for each class k (except one). As a consequence the decision boundaries are restricted to be parallel to each other in the ordinal model. This is quite a strong constraint! In the ordinal model the relation between predictor and class label is monotone, either increasing or decreasing. For example: if β j is positive, then (all else equal) an increase in x k makes the higher classes more likely and a decrease in x k makes the lower classes more likely. Data-analysis and Retrieval 22 / 30

23 Fitting the Ordinal Logistic Regression Model > wine.polr2 <- polr(quality~density+alcohol,data=wine.dat, Hess=T) > summary(wine.polr2) Coefficients: Value Std. Error t value density alcohol Intercepts: Value Std. Error t value Residual Deviance: AIC: Data-analysis and Retrieval 23 / 30

24 Prediction > wine.pred <- predict(wine.polr,wine.dat,type="class") > table(wine.dat[,12],wine.pred) wine.pred > sum(diag(wine.confmat))/1599 [1] > summary(wine.dat[,12]) > 681/1599 [1] Data-analysis and Retrieval 24 / 30

25 Decision Boundary Ordinal LR on Wine Data Data-analysis and Retrieval 25 / 30

26 Fitting the Multinomial Logistic Regression Model > wine.multi2 <- multinom(quality~density+alcohol,data=wine.dat) > summary(wine.multi2) Call: multinom(formula = quality ~ density + alcohol, data = wine.dat) Coefficients: (Intercept) density alcohol Std. Errors: (Intercept) density alcohol Residual Deviance: AIC: Data-analysis and Retrieval 26 / 30

27 Prediction with Multinomial Logit > wine.pred.m <- predict(wine.multi2,wine.dat,type="class") > wine.confmat.m <- table(wine.dat[,12],wine.pred.m) > wine.confmat.m wine.pred.m > sum(diag(wine.confmat.m))/1599 [1] Data-analysis and Retrieval 27 / 30

28 Decision Boundary Multinomial LR on Wine Data Data-analysis and Retrieval 28 / 30

29 Comparison of Ordinal and Multinomial Model > wine.polr.corr <- as.numeric(wine.pred==wine.dat[,12]) > wine.multi.corr <- as.numeric(wine.pred.m==wine.dat[,12]) > wine.comp <- table(wine.polr.corr,wine.multi.corr) > wine.comp wine.multi.corr wine.polr.corr # is the difference in accuracy significant? Null hypothesis: H 0 : e polr = e multi, H a : e polr e multi If the null hypothesis is correct then P(cell (1,0)) = P(cell (0,1)) = 1 2 (the other two cells are ignored). Data-analysis and Retrieval 29 / 30

30 Comparison of Ordinal and Multinomial Model Hence the p-value is P(X 104) + P(X 175), where X Binom(π = 1 2, n = 279) In R we can compute this as > 2*pbinom(104, ,prob=0.5) [1] e-05 # yes, the p-value is smaller than 0.01, which is # already a very strict significance level The p-value is very small, so we conclude that the ordinal model has significantly higher accuracy than the multinomial model. Data-analysis and Retrieval 30 / 30

Linear Regression Models P8111

Linear Regression Models P8111 Lecture 25 Jeff Goldsmith April 26, 2016 1 of 37 Today s Lecture Logistic regression / GLMs Model framework Interpretation Estimation 2 of 37 Linear regression Course started