Regression modeling for categorical data. Part II : Model selection and prediction

Size: px

Start display at page:

Download "Regression modeling for categorical data. Part II : Model selection and prediction"

Nora McCormick
6 years ago
Views:

1 Regression modeling for categorical data Part II : Model selection and prediction David Causeur Agrocampus Ouest IRMAR CNRS UMR

2 Plan du cours 1 Interpreting model fit 2 Model comparison 3 Subset selection 4 Prediction 5 Cross-validation

3 Definition Individual departures from the fit The deviance residuals ε i ( ˆβ) are defined as follows: ε i ( ˆβ) = y i 2 log (1 + exp ( y i ( ˆβ 0 + ˆβ 1 x i ˆβ p x ip ) )) They are individual contributions to the residual deviance: D x,y ( ˆβ) = n ε 2 i ( ˆβ) i=1 For large n, ε i ( ˆβ) N (0, 1).

4 Individual departures from the fit R script > epsilon = residuals(maturity.logit,type="deviance") # Extracts the deviance residuals > outlier = which(abs(epsilon)>2) # Selects the indices of the largest epsilons > pi = predict(maturity.logit,type="response") # Calculates the estimated pi=p(y=+1) > cbind(scale(dta.12[,1:5]),dta.12[,6:7],pi)[outlier,] # Displays external information L a b Weight Diam Maturity Variety pi go mo bl bl

5 Plotting the model fit Model for maturity from variety and index a: logit π 1 (x) = µ + βx, (for variety 37 ) logit π 2 (x) = µ + α 2 + (β + γ 2 )x, (for variety bl ) logit π 3 (x) = µ + α 3 + (β + γ 3 )x, (for variety go ) logit π 4 (x) = µ + α 4 + (β + γ 4 )x, (for variety mo ) Correspondence with the estimated values: Parameter Estimation (Intercept) µ Varietybl α Varietygo α Varietymo α a β Varietybl:a γ Varietygo:a γ Varietymo:a γ

6 Plotting the model fit > vec.a = seq(from=min(dta.12$a),to=max(dta.12$a),by=0.01) > # vec.a is a high resolution sequence of a values > varieties = levels(dta.12$variety) # Vector of Variety levels > pi = matrix(0,nrow=length(vec.a),ncol=4) > # pi will be used to store the estimated P(Y=+1) > # One row of pi for each value in vec.a, one column for each variety R script > for (j in 1:4) + pi[,j] = predict(maturity.logit,type="response", + newdata=data.frame(variety=varieties[j],a=vec.a)) # Estimated P(Y=+1) in matrix pi > matplot(vec.a,pi,type= l,lwd=2,lty=1,xlab="a",ylab=expression(pi), + main= Maturity along index a ) # Plots the 4 probability curves > legend("bottomright",lwd=2,lty=1,col=1:4, + legend=c( 37, bl, go, mo ),bty="n") # adds a legend to the plot

7 Plotting the model fit Maturity along index a π bl go mo a

8 Confidence intervals for the regression parameters Asymptotic distribution of the ML estimator of β The ML estimator ˆβ of β is approximately normally distributed, for a large n, with mean β and variance matrix V ˆβ = (X VX) 1

9 Confidence intervals for the regression parameters R script > X = model.matrix( Variety*a,data=dta.12) # Extracts the design matrix of the model > pi = predict(maturity.logit,type="response") # Fitted P(Y=+1) > V = diag(pi*(1-pi)) # Diagonal matrix which diagonal entries are pi*(1-pi) > Var.beta = solve(t(x)%*%v%*%x) # Asymptotic variance of the ML estimator > sqrt(diag(var.beta)) # Estimated standard deviations of the regression parameters (Intercept) Varietybl Varietygo Varietymo a Varietybl:a Varietygo:a Varietymo:a

10 Confidence intervals for the regression parameters Asymptotic distribution of the ML estimator of β The ML estimator ˆβ of β is approximately normally distributed, for a large n, with mean β and variance matrix V ˆβ = (X VX) 1 Confidence interval CI 1 α (β j ) with confidence level 1 α of β j : CI 1 α (β j ) = [ ˆβ j z 1 α 2 ˆσ ˆβ j ; ˆβ j + z 1 α 2 ˆσ ˆβ j ], where z 1 α = F 1 2 0,1 (1 α/2) is the (1 α/2)-quantile of the standard normal distribution.

11 Confidence intervals for the regression parameters > ci = cbind(estimate=coef(maturity.logit),confint.default(maturity.logit,level=0.95)) > ci Estimate 2.5 % 97.5 % (Intercept) Varietybl Varietygo Varietymo a Varietybl:a Varietygo:a Varietymo:a R script

12 Confidence intervals for the regression parameters > exp(ci) # Confidence intervals for odds-ratio Estimate 2.5 % 97.5 % (Intercept) e e e+00 Varietybl e e e+04 Varietygo e e e-01 Varietymo e e e+09 a e e e+00 Varietybl:a e e e+01 Varietygo:a e e e+00 Varietymo:a e e e+01 R script

13 Confidence intervals for the regression parameters R script > pi = lwr = upr = matrix(0,nrow=length(vec.a),ncol=4) > # pi stores the estimated P(Y=+1) : one row of pi for each value in vec.a > # One column for each variety. Same for lwr (lower bound) and upr (upper bound) > for (j in 1:4) { + predictions = predict(maturity.logit,type="response", + newdata=data.frame(variety=varieties[j],a=vec.a),se.fit=true) + pi[,j] = predictions$fit + lwr[,j] = predictions$fit-1.96*predictions$se.fit + lwr[lwr[,j]<0,j] = 0 # Lower bound of C.I. >= 0 + upr[,j] = predictions$fit+1.96*predictions$se.fit + upr[upr[,j]>1,j] = 1 # Upper bound of C.I. <= 1 + }

14 Confidence intervals for the regression parameters > par(mfrow=c(2,2)) # Splits the graphics in a 2x2 grid > color = rgb(red=0,green=0,blue=0.9,alpha=0.5) # Code for a transparent blue > for (j in 1:4) { + plot(vec.a,pi[,j],type= l,lwd=2,lty=1,xlab="a",ylim=c(0,1), + ylab=expression(pi),main= Maturity along index a ) + polygon(c(vec.a,rev(vec.a)),c(lwr[,j],rev(upr[,j])),col=color) + # adds a shaded confidence region around the curve + lines(vec.a,pi[,j],lwd=2) + mtext(paste("variety",varieties[j])) + } > par(mfrow=c(1,1)) # Restores the 1x1 organization for the next graphics device R script

15 Confidence intervals for the regression parameters Maturity along index a Maturity along index a π Variety 37 π Variety bl a a Maturity along index a Maturity along index a π Variety go π Variety mo a a

16 Wald tests Based on the asymptotic normality of ˆβ: Z βj = ˆβ j ˆσ ˆβ j. is a Student-like test statistics for the test of H (j) 0 : β j = 0.

17 Wald tests > maturity.logit = glm(maturity Variety*a,data=dta.12,family=binomial) > summary(maturity.logit) R script Call: glm(formula = Maturity Variety * a, family = binomial, data = dta.12) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) Varietybl e-05 Varietygo Varietymo a e-06 Varietybl:a Varietygo:a Varietymo:a

18 Assessment of the fit Residual deviance D x,y ( ˆβ): lowest possible deviance among all the possible fits of the model... among those possible fits, the null model.

19 Assessment of the fit Residual deviance D x,y ( ˆβ): lowest possible deviance among all the possible fits of the model... among those possible fits, the null model. > maturity.logit = glm(maturity Variety*a,data=dta.12,family=binomial) > summary(maturity.logit) R script Call: glm(formula = Maturity Variety * a, family = binomial, data = dta.12) Null deviance: Residual deviance: AIC: on 306 degrees of freedom on 299 degrees of freedom

20 Assessment of the fit Residual deviance D x,y ( ˆβ): lowest possible deviance among all the possible fits of the model... among those possible fits, the null model. Model comparison has to account for the complexity of the model. Definition The Akaike Information Criterion AIC x,y ( ˆβ) is given by: AIC x,y ( ˆβ) = D x,y ( ˆβ) + 2(p + 1). AIC x,y ( ˆβ) estimates the information loss when using the model estimated with ˆβ rather than the unknown model that is supposed to generate the data.

21 Assessment of the fit Residual deviance D x,y ( ˆβ): lowest possible deviance among all the possible fits of the model... among those possible fits, the null model. Bayesian Information Criterion (BIC): BIC x,y ( ˆβ) = D x,y ( ˆβ) + ln(n)(p + 1), to measure the information loss when using the estimated model rather than the true model in the scope of parametric models considered

22 Assessment of the fit Residual deviance D x,y ( ˆβ): lowest possible deviance among all the possible fits of the model... among those possible fits, the null model. AIC or BIC? If the goal is to make a prediction rule, AIC is recommended. If the goal is just to fit the model, BIC shall be favored. The goodness-of-fit of a model is more penalized by its complexity when it is evaluated by BIC than AIC.

23 Plan du cours 1 Interpreting model fit 2 Model comparison 3 Subset selection 4 Prediction 5 Cross-validation

24 Significance of an effect Illustration by the maturity study. Suppose we aim at testing a Variety index a interaction effect. It consists in comparing the full model M full logit π i (x) = µ + α i + (β + γ i )x. to one of its possible submodel M sub obtained by setting the γs to zero: logit π i (x) = µ + α i + βx

25 Significance of an effect Illustration by the maturity study. Suppose we aim at testing a Variety index a interaction effect. Testing for the significance of the interaction effect is stated as: { H0 : There is no interaction effect, H 1 : There is an interaction effect or, in an equivalent model comparison perspective: { H0 : M full does not explain the maturity better than M sub H 1 : M full does explain the maturity better than M sub.

26 Likelihood-ratio test The residual deviances D sub and D full will be used to compare M sub and M sub. > maturity.full = glm(maturity Variety*a,data=dta.12,family=binomial) > maturity.sub = glm(maturity Variety+a,data=dta.12,family=binomial) > deviance(maturity.full) [1] > deviance(maturity.sub) [1] R script

27 Likelihood-ratio test The residual deviances D sub and D full will be used to compare M sub and M sub. The difference D full/sub = measures the fitting gain obtained by using model M full rather than model M sub. Null distribution: χ 2 3, only depending on the difference between the number of parameters in the two models, here 3 (the γs).

28 Likelihood-ratio test The residual deviances D sub and D full will be used to compare M sub and M sub. > dev.diff = deviance(maturity.sub) - deviance(maturity.full) > pchisq(dev.diff,df=3,lower.tail=false) > # Gives the probability that a chi-square variable exceeds dev.diff [1] e-19 R script

29 Likelihood-ratio test Now summed up in a general framework: Definition (χ 2 analysis of deviance or LRT test) Suppose M full M sub are two nested models. The so-called Likelihood-Ratio Test (LRT) statistics, or analysis of deviance test statistics, of the following hypothesis testing issue: { H0 : M full does not explain the response better than M sub H 1 : M full does explain the response better than M sub, is defined as D full/sub = D sub D full = 2 log l sub l full. Under H 0, D full/sub χ2 k full k sub.

30 Likelihood-ratio test > anova(maturity.sub,maturity.full,test="chisq") Analysis of Deviance Table Model 1: Maturity Variety + a Model 2: Maturity Variety * a Resid. Df Resid. Dev Df Deviance Pr(>Chi) < 2.2e-16 R script

31 Analysis of Deviance Table Complete analysis of deviance table: > Anova(maturity.logit) Analysis of Deviance Table Response: Maturity LR Chisq Df Pr(>Chisq) Variety < 2.2e-16 a < 2.2e-16 Variety:a < 2.2e-16 R script

32 Analysis of Deviance Table 1st row, the main effect of Variety is tested: { H0 : logit π i (x) = µ + βx H 1 : logit π i (x) = µ + α i + βx. 2nd row, the main effect of the a index is tested similarly. 3rd row, the test for the interaction effect is handled differently: { H0 : logit π i (x) = µ + α i + βx H 1 : logit π i (x) = µ + α i + (β + γ i )x.

33 Analysis of Deviance Table The Wald test for the significance of a single parameter is equivalent to the Student test in the usual linear model. Similarly, the LRT corresponds to the Fisher test for analysis of variance. When an effect is just measured by one single parameter, in the usual linear model, the t-test statistics is just the signed square-root of the F-test (their p-value are exactly the same). in the logistic linear model, this coherence between the Wald and the LRT test no longer holds.

34 Definition Profile likelihood confidence intervals The profile likelihood confidence interval for β j, with confidence level 1 α, is the interval of values b j such that the LRT of H 0 : β j = b j at level α does not conclude to the rejection of the null.

35 Profile likelihood confidence intervals > confint(maturity.logit,level=0.95) Waiting for profiling to be done % 97.5 % (Intercept) Varietybl Varietygo Varietymo a Varietybl:a Varietygo:a Varietymo:a R script

36 Detailing a significant group effect Once the effect of a factor is significant, what levels shall be pointed out as different? Post-hoc tests: I(I 1)/2 pairwise comparisons of the effect parameters. In the maturity study, for the interaction effect, 6 tests of H (ii ) 0 : γ i = γ i, for 1 i < i 4. The probability of one or more erroneous rejections of any of the null hypotheses H (ii ) 0 : 1 P H (ii ) 0 all true (H(12) 0 is not rejected,..., H ((I 1),I) 0 is not rejected) = 1 P H (12) (H (12) 0 = 1 (1 α) I(I 1)/2 0 is not rejected)... P H ((I 1)I) 0 (H ((I 1)I) 0 is not rejected)

37 Detailing a significant group effect Once the effect of a factor is significant, what levels shall be pointed out as different? Post-hoc tests: I(I 1)/2 pairwise comparisons of the effect parameters. > 1-(1-0.05) ˆ 6 [1] R script If α = 0.05, then the probability of one or more erroneous declarations that two γ i are significantly different is 0.26!

38 Detailing a significant group effect Once the effect of a factor is significant, what levels shall be pointed out as different? Post-hoc tests: I(I 1)/2 pairwise comparisons of the effect parameters. > alpha = 1-(1-0.05) ˆ (1/6) > alpha [1] > 1-(1-alpha) ˆ 6 [1] 0.05 R script

39 Detailing a significant group effect R script # Initialization of empty matrices of confidence bounds for the pairwise differences > estimate = lower = upper = matrix(0,nrow=4,ncol=4) > varieties = levels(dta.12$variety) # Extract variety names # Sets names to matrices upper and lower > rownames(estimate) = rownames(upper) = rownames(lower) = varieties > colnames(estimate) = colnames(upper) = colnames(lower) = varieties > upper 37 bl go mo bl go mo

40 Detailing a significant group effect > for (j in 1:4) { + tmp = dta.12 # Temporary dataset similar to dta.12 + tmp$variety = relevel(dta.12$variety,varieties[j]) + tmp.logit = glm(formula(maturity.logit),family=binomial,data=tmp) + estimate[j,-j] = coef(tmp.logit)[6:8] + ci = confint(tmp.logit,level=1-alpha,parm=6:8) + upper[j,-j] = ci[,2] # Feeds the jth row of matrix upper + lower[j,-j] = ci[,1] # Feeds the jth row of matrix lower + } R script

41 Detailing a significant group effect > ci = data.frame(estimate=estimate[col(estimate)>row(estimate)], + lower=lower[col(lower)>row(lower)],upper=upper[col(upper)>row(upper)]) > # Creates a 6 x 3 matrix with all the pairwise combinations in rows > colnames(ci) = c("estimate","2.5%","97.5%") R script > cilabs = outer(varieties,varieties,paste,sep="-") > # Creates names for combinations with all pairwise combinations of variety labels > rownames(ci) = cilabs[col(cilabs)>row(cilabs)] > ci Estimate 2.5% 97.5% 37-bl go bl-go mo bl-mo go-mo

42 Detailing a significant group effect Once the effect of a factor is significant, what levels shall be pointed out as different? Post-hoc tests: I(I 1)/2 pairwise comparisons of the effect parameters. Finally: the slopes of the probability curves are significantly different for all pairwise comparisons except for the comparison between varieties bl and mo.

43 Model for preference data Exercise We aim at studying the effect of age on the preference of women for a special type of perfume, denoted G2 or another one, denoted G3. It is supected that the way the preference is affected by age may depend on the consumer habit, especially the frequency of use of a perfume. Results of a consumer study are provided in file parfums.txt, in order to address the former issue. Propose and fit an appropriate model for the above issue. Is the effect of age preference different according to the frequency of use of a perfume?

44 Model for colouring of fat Exercise Experimental study to investigate the causes of yellow fat in lamb meat. The experiment focuses on two possible causes, the feeding and housing modes, in a balanced design: 20 lambs per possible association of two feeding modes (1: two meals a day or 2: ad libitum) and two housing modes (1: individual and 2: collective) Feeding mode Housing mode Numbers of lambs Total number with coloured fat of lambs Feeding Housing Coloured Total F 1 H F 1 H F 2 H F 2 H Propose and fit an appropriate model for the above issue. Is the effect of feeding mode different according to the housing mode? Give the odds-ratio of the feeding mode.

45 Plan du cours 1 Interpreting model fit 2 Model comparison 3 Subset selection 4 Prediction 5 Cross-validation

46 Subset selection Let us now model the maturity class of an apricot by L, a, b, Diam and Weight. Subset selection issue: which subset of those 5 x s is sufficient to explain the maturity status? Handled by comparison of the 2 5 = 32 submodels The submodels are not all nested: LRT is not appropriate For a number k of x s: the submodel M k with lowest residual deviance Dk is the champion The overall champion M is the M k with lowest BIC (or AIC)

47 Search algorithm If p explanatory variables, then 2 p submodels... can be huge! Stepwise search algorithms - forward starts from M 0 : First step: fit the p models with only one x and keep model M 1 with lowest BIC. kth step: 2 variants forward stepwise: fit the p k + 1 models obtained by adding one x to M k 1 and keep the model with lowest BIC. forward/backward stepwise: fit the p models obtained by adding or removing one x to/from M k 1 and keep the model with lowest BIC. Stop when adding an x increases BIC. In forward/backward, at most p 2 submodels are fitted.

48 Search algorithm If p explanatory variables, then 2 p submodels... can be huge! R script > p=50 # For example, p=50 candidate variables > p ˆ 2 # Maximum number of model fits in a stepwise search [1] 2500 > 2 ˆ p # Number of possible submodels [1] e+15 > p ˆ 2/2 ˆ p # Proportion of submodels explored in a stepwise search [1] e-12

49 Search algorithm If p explanatory variables, then 2 p submodels... can be huge! forward or backward? backward tends to keep more variables in the selection. when p n or p > n, then estimation of the full model is not reliable (or just not possible). forward or forward/backward? forward/backward is greedier than just forward. Both algorithms are computationally equivalent: forward/backward search can be viewed as free improvement.

50 Search algorithm If p explanatory variables, then 2 p submodels... can be huge! > maturity.logit = glm(maturity.,family=binomial,data=dta.12[,-7]) > # Fits the model with all explanatory variables except Variety > stepwise(maturity.logit,direction="forward/backward",criterion="bic") Direction: forward/backward Criterion: BIC Start: AIC= Maturity 1 Df Deviance AIC + a Weight <none> b Diam L R script

51 Search algorithm If p explanatory variables, then 2 p submodels... can be huge! > maturity.logit = glm(maturity.,family=binomial,data=dta.12[,-7]) > # Fits the model with all explanatory variables except Variety > stepwise(maturity.logit,direction="forward/backward",criterion="bic") Direction: forward/backward Criterion: BIC Step: AIC= Maturity a Df Deviance AIC + L Weight Diam b <none> a R script

52 Search algorithm If p explanatory variables, then 2 p submodels... can be huge! > maturity.logit = glm(maturity.,family=binomial,data=dta.12[,-7]) > # Fits the model with all explanatory variables except Variety > stepwise(maturity.logit,direction="forward/backward",criterion="bic") Direction: forward/backward Criterion: BIC Step: AIC= Maturity a + L Df Deviance AIC + b <none> Diam Weight L a R script

53 Search algorithm If p explanatory variables, then 2 p submodels... can be huge! Df Deviance AIC <none> b Diam Weight L a R script Call: glm(formula = Maturity a + L + b, family = binomial, data = dta.12[,-7]) Coefficients: (Intercept) a L b Degrees of Freedom: 306 Total (i.e. Null); 303 Residual Null Deviance: Residual Deviance: AIC: 241.3

54 Search algorithm If p explanatory variables, then 2 p submodels... can be huge! Since p is moderate, an exhaustive search is possible here > maturity.select = bestglm(xy=dta.12[,-7],family=binomial,method="exhaustive") > maturity.select$subsets[,8] = maturity.select$subsets[,8]+log(nrow(dta.12)) > maturity.select$subsets Intercept L a b Weight Diam BIC 0 TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE * TRUE TRUE TRUE TRUE TRUE TRUE R script

55 Plan du cours 1 Interpreting model fit 2 Model comparison 3 Subset selection 4 Prediction 5 Cross-validation

56 Classification rule In the usual linear model, predicting the unknown value of Y from x = (x 1,..., x p ) is unambiguous: Ŷ = Ê(Y X = x) = ˆβ 0 + ˆβ 1 x ˆβ p x p.

57 Classification rule In the logistic linear model, the prediction issue is less clear Estimation of π(x) = P(Y = 1 X = x)? Assignment of a Y value, either +1 or -1 given x? > maturity.logit = glm(maturity.,family=binomial,data=dta.12[,-7]) > # Using the model to estimate the probability that Maturity = 2 > pi = predict(maturity.logit,type="response") > # Plotting the predictions versus observed maturity status > plot(pi dta.12$maturity,xlab="observed maturity status", + ylab="estimated probability of Maturity=2 ",cex.lab=1.25, + main="predictions versus observed maturity status",cex.axis=1.25, + cex.main=1.25) R script

58 Classification rule Predictions versus observed maturity status Estimated probability of 'Maturity=2' Observed maturity status

59 Definition Classification rule A decision rule aiming at the prediction of the value of a categorical response variable Y from X = (X 1,..., X p ) is named a classification rule. Algorithm describing how to get Ŷ, starting from x. Definition The misclassification probability of an item with explanatory profile x 0 and unknown response value Y 0 is P(Ŷ0 Y 0 X = x 0 ).

60 Bayes classification rule Definition The logistic linear Bayes classification rule is derived as follows: if ˆπ(x 0 ) is the estimated probability that Y 0 = +1, then: { Ŷ0 = +1 if ˆπ(x 0 ) 0.5 Ŷ 0 = 1 if ˆπ(x 0 ) < 0.5

61 Bayes classification rule > # Implementation of the Bayes logistic classification rule > predicted = ifelse(pi>=0.5, 2, 1 ) > # Construction of the confusion matrix > confusion = table(dta.12$maturity,predicted,dnn=list("obs.","pred.")) > confusion Pred. Obs R script

62 Prediction performance The global misclassification rate is often not suited Example: Y is the status, sick (Y = +1) or healthy (Y = 1), of a patient. To be avoided: Ŷ = 1 whereas Y = +1, Less importantly: Ŷ = +1 whereas Y = 1. Definition The sampling items with Ŷ = +1 are said positive and those with Ŷ = 1 are said negative.

63 Prediction performance Probability of a true positive: P(Ŷ P(Ŷ = +1, Y = +1) = +1 Y = +1) =. P(Y = +1) estimated by the sensitivity or true positive rate: sensitivity = Probability of a true negative: { } # i = 1,..., n, Ŷ i = +1, Y i = +1. # {i = 1,..., n, Y i = +1} P(Ŷ P(Ŷ = 1, Y = 1) = 1 Y = 1) =. P(Y = 1) estimated by the specificity or true negative rate: specificity = { } # i = 1,..., n, Ŷ i = 1, Y i = 1. # {i = 1,..., n, Y i = 1}

64 Prediction performance Exercise Give the sensitivity and specificity of the Bayes logistic linear classification rule of the maturity of an apricot by the 3 colorimetric and 2 biometric measurements.

65 A short case study Biostatistical issue: predicting the status, healthy (Y = 1) or sick (Y = +1) of a patient The classification rule is highly sensitive (0.9) and highly specific (0.9). The incidence p = P(Y = +1) is low, say p =

66 A short case study Probability that someone predicted as sick is sick: where P(Y = +1) P(Y = +1 Ŷ = +1) = P(Ŷ = +1 Y = +1) P(Ŷ = +1), = sensitivity incidence P(Ŷ = +1), P(Ŷ = +1) = P(Ŷ = +1 Y = +1)P(Y = +1) + P(Ŷ = +1 Y = 1)P(Y = 1), Hence, = sensitivity incidence + (1 specificity) (1 incidence), = (1 0.9) ( ), = P(Y = +1 Ŷ = +1) = , =

67 A short case study Biostatistical issue: predicting the status, healthy (Y = 1) or sick (Y = +1) of a patient The classification rule is highly sensitive (0.9) and highly specific (0.9). The incidence p = P(Y = +1) is low, say p = Probability that someone predicted as sick is sick: P(Y = +1 Ŷ = +1) = Conlusion: high sensitivity and high specificity do not guarantee a good prediction performance!

68 Precision of a classification rule Probability that a positive is truely +1: P(Y = +1 Ŷ = +1) = P(Ŷ = +1, Y = +1) P(Ŷ = +1). estimated by the precision or Positive Predictive Value (PPV): { } # i = 1,..., n, Ŷ i = +1, Y i = +1 PPV = { }. # i = 1,..., n, Ŷ i = +1 Probability that a negative is truely -1: P(Y = 1 Ŷ = 1) = P(Ŷ = 1, Y = 1) P(Ŷ = 1). estimated by the Negative Predictive Value (NPV): { } # i = 1,..., n, Ŷ i = 1, Y i = 1 NPV = { }. # i = 1,..., n, Ŷ i = 1

69 Precision of a classification rule > perf = rep(0,5) # Creates a 5-vector with only 0 entries > names(perf) = c("nb. pos","sens.","spec.","ppv","npv") > colmargins = colsums(confusion) # Column totals > rowmargins = rowsums(confusion) # Row totals > perf[1] = colmargins[2] # Number of positives > perf[2] = confusion[2,2]/rowmargins[2] # Sensitivity > perf[3] = confusion[1,1]/rowmargins[1] # Specificity > perf[4] = confusion[2,2]/colmargins[2] # PPV > perf[5] = confusion[1,1]/colmargins[1] # NPV > perf Nb. pos Sens. Spec. PPV NPV R script

70 Classification ability of explanatory variables The performance of a logistic classification rule depends on: the relevance of the x s to predict the response Y = ±1, the choice of the threshold on π(x), above which the prediction is Ŷ = +1. In the Bayes classification rule, the former threshold is 0.5. lower threshold leads to larger TPR and FPR larger threshold leads in lower TPR and FPR

71 Classification ability of explanatory variables > # Create a prediction object to be used by function performance > pred = prediction(predictions=pi,labels=dta.12$maturity) > tpr = performance(pred,measure="tpr") # Derive the TPR and the FPR > fpr = performance(pred,measure="fpr") # and the FPR > # Plots TPR and FPR against the threshold > plot(tpr,lwd=2,col="blue",ylab="tpr and FPR",xlab="Threshold") > plot(fpr,lwd=2,col="orange",add=true) > legend("topright",bty="n",lwd=2,col=c("blue","orange"),legend=c("tpr","fpr")) R script

72 Classification ability of explanatory variables TPR and FPR TPR FPR Threshold

73 Classification ability of explanatory variables > # Finds the minimal threshold for which TPR>=0.95 > choice = min(which(tpr@"y.values"[[1]]>=0.95)) > threshold = tpr@"x.values"[[1]][choice] > threshold > tpr@"y.values"[[1]][choice] # Corresponding TPR [1] > fpr@"y.values"[[1]][choice] # Corresponding FPR [1] R script

74 Classification ability of explanatory variables The performance of a logistic classification rule depends on: the relevance of the x s to predict the response Y = ±1, the choice of the threshold on π(x), above which the prediction is Ŷ = +1. In the Bayes classification rule, the former threshold is 0.5. lower threshold leads to larger TPR and FPR larger threshold leads in lower TPR and FPR ROC curve: compromises between sensitivity (True Positive Rate) and specificity (True Negative Rate = 1-False Positive Rate)

75 Classification ability of explanatory variables > # Derive performance criteria > perf = performance(pred,measure="tpr",x.measure="fpr") > plot(perf,lwd=2,col="blue") R script

76 Classification ability of explanatory variables True positive rate False positive rate

77 Classification ability of explanatory variables ROC curve Starts from (0,0) for threshold = 1 (all predictions are -1) Ends at (1,1) for threshold = 0 (all predictions are +1) Measures the prediction ability by comparison with two reference ROC curves: the ideal classifier ROC curve, reaching (0,1); the worst classifier ROC curve, going along the line y = x. Area Under the ROC Curve (AUC): measures the prediction ability of the x s AUC = 1 for the ideal classifier AUC = 0.5 for the worst classifier.

78 Classification ability of explanatory variables > [1] R script

79 Classification ability of explanatory variables > choice = which.min(fpr@"y.values"[[1]] ˆ 2+(1-tpr@"y.values"[[1]]) ˆ 2) > threshold = tpr@"x.values"[[1]][choice] > threshold > tpr@"y.values"[[1]][choice] # Corresponding TPR [1] > fpr@"y.values"[[1]][choice] # Corresponding FPR [1] R script

80 Plan du cours 1 Interpreting model fit 2 Model comparison 3 Subset selection 4 Prediction 5 Cross-validation

81 Assessment of a classification rule Remark: in the previous prediction performance criteria, Y i is explicitly used to derive Ŷi... major deviation from the real conditions. Recommendation: the classification rule, fitted on a learning sample, has to be applied to a completely separate test sample. Definition An assessment procedure involving a test sample, completely separated from the learning sample, is referred to as external cross-validation If n is moderate, then an internal K -fold CV procedure shall be preferred.

82 Assessment of a classification rule viduals n indiv

83 Assessment of a classification rule n indiv viduals The sample is split into k balanced subsamples

84 Assessment of a classification rule s n indiv vidual Learning sample used to estimate the model

85 Assessment of a classification rule Testing sample to calculate prediction errors s n indiv vidual Learning sample to estimate the model

86 Assessment of a classification rule s Testing sample to calculate prediction errors vidual n indiv

87 Assessment of a classification rule s n indiv vidual Testing sample to calculate prediction errors

88 Assessment of a classification rule viduals n indiv Testing sample to calculate prediction errors

89 Assessment of a classification rule The choice of K can affect the result of the CV procedure: The CV procedure involves to fit the model K times: if fitting the model is computationally time-consuming, the values K = 3 and K = 10 are often chosen. when n is small, K = n may be recommended. The resulting CV procedure is named leave-one-out cross-validation.

90 Assessment of a classification rule > # Step 1: segmentation of the dataset in 10 subsamples R script > subsamples = cvsegments(n=307,k=10) > # List whose 10 components contain the indices of items in each of the 10 segments > unlist(lapply(subsamples,length)) # Subsamples sizes V1 V2 V3 V4 V5 V6 V7 V8 V9 V

91 Assessment of a classification rule > # Step 2: cycling over the segments > cvpredicted = rep( 0,307) # will contain the cross-validated predictions > nbselected = rep(0,10) # will contain the number of selected variables R script > for (k in 1:10) { # cycling over the 10 segments + learn = dta.12[-subsamples[[k]],-7] # the kth segment is excluded + test = dta.12[subsamples[[k]],-7] # the test sample is just the kth segment + maturity.select = bestglm(xy=learn,family=binomial, method="exhaustive") + resselect = maturity.select$subsets # selected is a vector of boolean values + selected = unlist(resselect[which.min(resselect[,8]),2:6]) + nbselected[k] = sum(selected) # Fits the selected model + maturity.logit = glm(maturity.,family=binomial,data=learn[,c(selected,true)]) + pi = predict(maturity.logit,newdata=test[,c(selected,true)],type="response") + cvpredicted[subsamples[[k]]] = ifelse(pi>=threshold, 2, 1 ) + } > nbselected [1]

92 Assessment of a classification rule > # Step 3: Cross-validated performance criteria > # The confusion matrix is first created > confusion = table(dta.12$maturity,cvpredicted,dnn=list("obs.","pred.")) > confusion Pred. Obs R script

93 Assessment of a classification rule > # The performance criteria are deduced > perf = rep(0,5) # Creates a 5-vector with only 0 entries > names(perf) = c("nb. pos","sens.","spec.","ppv","npv") > colmargins = colsums(confusion) > rowmargins = rowsums(confusion) # Column and Row totals > perf[1] = colmargins[2] # Number of positives > perf[2] = confusion[2,2]/rowmargins[2] # Sensitivity > perf[3] = confusion[1,1]/rowmargins[1] # Specificity > perf[4] = confusion[2,2]/colmargins[2] # PPV > perf[5] = confusion[1,1]/colmargins[1] # NPV > perf Nb. pos Sens. Spec. PPV NPV R script

94 Modeling default in cheese production Exercise A dairy food industry wish to make an objective classification rule to detect major defaults in cheese. For that, they collect daily data of proportions of defective cheeses and corresponding food process conditions, characterized by 2 sanitary variables (San1 and San2)and 3 milk quality variables (From1, From2, From3). Data are provided in file cheese.txt. Propose and fit an appropriate model for the above issue using the possible explanatory variables San1, San2, From1, From2, From3, San1 2, From1 2, From3 2 and From1 From3. Suppose we want the classification rule to be able to detect 90% of the defective cheese. Correspondingly, which proportion of false positives should we expect?

Regression modeling for categorical data. Part II : Model selection and prediction

Regression modeling for categorical data Part II : Model selection and prediction David Causeur Agrocampus Ouest IRMAR CNRS UMR 6625 http://math.agrocampus-ouest.fr/infogluedeliverlive/membres/david.causeur