Regression modeling for categorical data. Part II : Model selection and prediction

Size: px
Start display at page:

Download "Regression modeling for categorical data. Part II : Model selection and prediction"

Transcription

1 Regression modeling for categorical data Part II : Model selection and prediction David Causeur Agrocampus Ouest IRMAR CNRS UMR

2 Outline 1 Interpreting model fit 2 Model comparison 3 Subset selection 4 Prediction 5 Cross-validation

3 Definition Individual departures from the fit The deviance residuals ε i ( ˆβ) are defined as follows: ε i ( ˆβ) = y i 2 log (1 + exp ( y i ( ˆβ 0 + ˆβ 1 x i ˆβ p x ip ) )) They are individual contributions to the residual deviance: D x,y ( ˆβ) = n ε 2 i ( ˆβ) i=1 For large n, ε i ( ˆβ) N (0, 1).

4 Individual departures from the fit R script > epsilon = residuals(maturity.logit,type="deviance") # Extracts the deviance residuals > outlier = which(abs(epsilon)>2) # Selects the indices of the largest epsilons > pi = predict(maturity.logit,type="response") # Calculates the estimated pi=p(y=+1) > cbind(scale(dta.12[,1:5]),dta.12[,6:7],pi)[outlier,] # Displays external information L a b Weight Diam Maturity Variety pi go mo bl bl

5 Plotting the model fit Model for maturity from variety and index a: logit π 1 (x) = µ + βx, (for variety 37 ) logit π 2 (x) = µ + α 2 + (β + γ 2 )x, (for variety bl ) logit π 3 (x) = µ + α 3 + (β + γ 3 )x, (for variety go ) logit π 4 (x) = µ + α 4 + (β + γ 4 )x, (for variety mo ) Correspondence with the estimated values: Parameter Estimation (Intercept) µ Varietybl α Varietygo α Varietymo α a β Varietybl:a γ Varietygo:a γ Varietymo:a γ

6 Plotting the model fit > vec.a = seq(from=min(dta.12$a),to=max(dta.12$a),by=0.01) > # vec.a is a high resolution sequence of a values > varieties = levels(dta.12$variety) # Vector of Variety levels > pi = matrix(0,nrow=length(vec.a),ncol=4) > # pi will be used to store the estimated P(Y=+1) > # One row of pi for each value in vec.a, one column for each variety R script > for (j in 1:4) + pi[,j] = predict(maturity.logit,type="response", + newdata=data.frame(variety=varieties[j],a=vec.a)) # Estimated P(Y=+1) in matrix pi > matplot(vec.a,pi,type= l,lwd=2,lty=1,xlab="a",ylab=expression(pi), + main= Maturity along index a ) # Plots the 4 probability curves > legend("bottomright",lwd=2,lty=1,col=1:4, + legend=c( 37, bl, go, mo ),bty="n") # adds a legend to the plot

7 Plotting the model fit Maturity along index a π bl go mo a

8 Confidence intervals for the regression parameters Asymptotic distribution of the ML estimator of β The ML estimator ˆβ of β is approximately normally distributed, for a large n, with mean β and variance matrix V ˆβ = (X VX) 1

9 Confidence intervals for the regression parameters R script > X = model.matrix( Variety*a,data=dta.12) # Extracts the design matrix of the model > pi = predict(maturity.logit,type="response") # Fitted P(Y=+1) > V = diag(pi*(1-pi)) # Diagonal matrix which diagonal entries are pi*(1-pi) > Var.beta = solve(t(x)%*%v%*%x) # Asymptotic variance of the ML estimator > sqrt(diag(var.beta)) # Estimated standard deviations of the regression parameters (Intercept) Varietybl Varietygo Varietymo a Varietybl:a Varietygo:a Varietymo:a

10 Confidence intervals for the regression parameters Asymptotic distribution of the ML estimator of β The ML estimator ˆβ of β is approximately normally distributed, for a large n, with mean β and variance matrix V ˆβ = (X VX) 1 Confidence interval CI 1 α (β j ) with confidence level 1 α of β j : CI 1 α (β j ) = [ ˆβj z 1 α 2 ˆσ ˆβ j ; ˆβ j + z 1 α 2 ˆσ ˆβ j ], where z 1 α = F 1 2 0,1 (1 α/2) is the (1 α/2)-quantile of the standard normal distribution.

11 Confidence intervals for the regression parameters > ci = cbind(estimate=coef(maturity.logit),confint.default(maturity.logit,level=0.95)) > ci Estimate 2.5 % 97.5 % (Intercept) Varietybl Varietygo Varietymo a Varietybl:a Varietygo:a Varietymo:a R script

12 Confidence intervals for the regression parameters > exp(ci) # Confidence intervals for odds-ratio Estimate 2.5 % 97.5 % (Intercept) e e e+00 Varietybl e e e+04 Varietygo e e e-01 Varietymo e e e+09 a e e e+00 Varietybl:a e e e+01 Varietygo:a e e e+00 Varietymo:a e e e+01 R script

13 Confidence intervals for the regression parameters R script > pi = lwr = upr = matrix(0,nrow=length(vec.a),ncol=4) > # pi stores the estimated P(Y=+1) : one row of pi for each value in vec.a > # One column for each variety. Same for lwr (lower bound) and upr (upper bound) > for (j in 1:4) { + predictions = predict(maturity.logit,type="response", + newdata=data.frame(variety=varieties[j],a=vec.a),se.fit=true) + pi[,j] = predictions$fit + lwr[,j] = predictions$fit-1.96*predictions$se.fit + lwr[lwr[,j]<0,j] = 0 # Lower bound of C.I. >= 0 + upr[,j] = predictions$fit+1.96*predictions$se.fit + upr[upr[,j]>1,j] = 1 # Upper bound of C.I. <= 1 + }

14 Confidence intervals for the regression parameters > par(mfrow=c(2,2)) # Splits the graphics in a 2x2 grid > color = rgb(red=0,green=0,blue=0.9,alpha=0.5) # Code for a transparent blue > for (j in 1:4) { + plot(vec.a,pi[,j],type= l,lwd=2,lty=1,xlab="a",ylim=c(0,1), + ylab=expression(pi),main= Maturity along index a ) + polygon(c(vec.a,rev(vec.a)),c(lwr[,j],rev(upr[,j])),col=color) + # adds a shaded confidence region around the curve + lines(vec.a,pi[,j],lwd=2) + mtext(paste("variety",varieties[j])) + } > par(mfrow=c(1,1)) # Restores the 1x1 organization for the next graphics device R script

15 Confidence intervals for the regression parameters Maturity along index a Maturity along index a π Variety 37 π Variety bl a a Maturity along index a Maturity along index a π Variety go π Variety mo a a

16 Wald tests Based on the asymptotic normality of ˆβ: Z βj = ˆβ j ˆσ ˆβ j. is a Student-like test statistics for the test of H (j) 0 : β j = 0.

17 Wald tests > maturity.logit = glm(maturity Variety*a,data=dta.12,family=binomial) > summary(maturity.logit) R script Call: glm(formula = Maturity Variety * a, family = binomial, data = dta.12) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) Varietybl e-05 Varietygo Varietymo a e-06 Varietybl:a Varietygo:a Varietymo:a

18 Assessment of the fit Residual deviance D x,y ( ˆβ): lowest possible deviance among all the possible fits of the model... among those possible fits, the null model.

19 Assessment of the fit Residual deviance D x,y ( ˆβ): lowest possible deviance among all the possible fits of the model... among those possible fits, the null model. > maturity.logit = glm(maturity Variety*a,data=dta.12,family=binomial) > summary(maturity.logit) R script Call: glm(formula = Maturity Variety * a, family = binomial, data = dta.12) Null deviance: Residual deviance: AIC: on 306 degrees of freedom on 299 degrees of freedom

20 Assessment of the fit Residual deviance D x,y ( ˆβ): lowest possible deviance among all the possible fits of the model... among those possible fits, the null model. Model comparison has to account for the complexity of the model. Definition The Akaike Information Criterion AIC x,y ( ˆβ) is given by: AIC x,y ( ˆβ) = D x,y ( ˆβ) + 2(p + 1). AIC x,y ( ˆβ) estimates the information loss when using the model estimated with ˆβ rather than the unknown model that is supposed to generate the data.

21 Assessment of the fit Residual deviance D x,y ( ˆβ): lowest possible deviance among all the possible fits of the model... among those possible fits, the null model. Bayesian Information Criterion (BIC): BIC x,y ( ˆβ) = D x,y ( ˆβ) + ln(n)(p + 1), to measure the information loss when using the estimated model rather than the true model in the scope of parametric models considered

22 Assessment of the fit Residual deviance D x,y ( ˆβ): lowest possible deviance among all the possible fits of the model... among those possible fits, the null model. AIC or BIC? If the goal is to make a prediction rule, AIC is recommended. If the goal is just to fit the model, BIC shall be favored. The goodness-of-fit of a model is more penalized by its complexity when it is evaluated by BIC than AIC.

23 Outline 1 Interpreting model fit 2 Model comparison 3 Subset selection 4 Prediction 5 Cross-validation

24 Significance of an effect Illustration by the maturity study. Suppose we aim at testing a Variety index a interaction effect. It consists in comparing the full model M full logit π i (x) = µ + α i + (β + γ i )x. to one of its possible submodel M sub obtained by setting the γs to zero: logit π i (x) = µ + α i + βx

25 Significance of an effect Illustration by the maturity study. Suppose we aim at testing a Variety index a interaction effect. Testing for the significance of the interaction effect is stated as: { H0 : There is no interaction effect, H 1 : There is an interaction effect or, in an equivalent model comparison perspective: { H0 : M full does not explain the maturity better than M sub H 1 : M full does explain the maturity better than M sub.

26 Likelihood-ratio test The residual deviances D sub and D full will be used to compare M sub and M sub. > maturity.full = glm(maturity Variety*a,data=dta.12,family=binomial) > maturity.sub = glm(maturity Variety+a,data=dta.12,family=binomial) > deviance(maturity.full) [1] > deviance(maturity.sub) [1] R script

27 Likelihood-ratio test The residual deviances D sub and D full will be used to compare M sub and M sub. The difference D full/sub = measures the fitting gain obtained by using model M full rather than model M sub. Null distribution: χ 2 3, only depending on the difference between the number of parameters in the two models, here 3 (the γs).

28 Likelihood-ratio test The residual deviances D sub and D full will be used to compare M sub and M sub. > dev.diff = deviance(maturity.sub) - deviance(maturity.full) > pchisq(dev.diff,df=3,lower.tail=false) > # Gives the probability that a chi-square variable exceeds dev.diff [1] e-19 R script

29 Likelihood-ratio test Now summed up in a general framework: Definition (χ 2 analysis of deviance or LRT test) Suppose M full M sub are two nested models. The so-called Likelihood-Ratio Test (LRT) statistics, or analysis of deviance test statistics, of the following hypothesis testing issue: { H0 : M full does not explain the response better than M sub H 1 : M full does explain the response better than M sub, is defined as D full/sub = D sub D full = 2 log l sub l full. Under H 0, D full/sub χ2 k full k sub.

30 Likelihood-ratio test > anova(maturity.sub,maturity.full,test="chisq") Analysis of Deviance Table Model 1: Maturity Variety + a Model 2: Maturity Variety * a Resid. Df Resid. Dev Df Deviance Pr(>Chi) < 2.2e-16 R script

31 Analysis of Deviance Table Complete analysis of deviance table: > Anova(maturity.logit) Analysis of Deviance Table Response: Maturity LR Chisq Df Pr(>Chisq) Variety < 2.2e-16 a < 2.2e-16 Variety:a < 2.2e-16 R script

32 Analysis of Deviance Table 1st row, the main effect of Variety is tested: { H0 : logit π i (x) = µ + βx H 1 : logit π i (x) = µ + α i + βx. 2nd row, the main effect of the a index is tested similarly. 3rd row, the test for the interaction effect is handled differently: { H0 : logit π i (x) = µ + α i + βx H 1 : logit π i (x) = µ + α i + (β + γ i )x.

33 Analysis of Deviance Table The Wald test for the significance of a single parameter is equivalent to the Student test in the usual linear model. Similarly, the LRT corresponds to the Fisher test for analysis of variance. When an effect is just measured by one single parameter, in the usual linear model, the t-test statistics is just the signed square-root of the F-test (their p-value are exactly the same). in the logistic linear model, this coherence between the Wald and the LRT test no longer holds.

34 Model for preference data Exercise We aim at studying the effect of age on the preference of women for a special type of perfume, denoted G2 or another one, denoted G3. It is supected that the way the preference is affected by age may depend on the consumer habit, especially the frequency of use of a perfume. Results of a consumer study are provided in file parfums.txt, in order to address the former issue. Propose and fit an appropriate model for the above issue. Is the effect of age preference different according to the frequency of use of a perfume?

35 Model for colouring of fat Exercise Experimental study to investigate the causes of yellow fat in lamb meat. The experiment focuses on two possible causes, the feeding and housing modes, in a balanced design: 20 lambs per possible association of two feeding modes (1: two meals a day or 2: ad libitum) and two housing modes (1: individual and 2: collective) Feeding mode Housing mode Numbers of lambs Total number with coloured fat of lambs Feeding Housing Coloured Total F 1 H F 1 H F 2 H F 2 H Propose and fit an appropriate model for the above issue. Is the effect of feeding mode different according to the housing mode? Give the odds-ratio of the feeding mode.

36 Outline 1 Interpreting model fit 2 Model comparison 3 Subset selection 4 Prediction 5 Cross-validation

37 Subset selection Let us now model the maturity class of an apricot by L, a, b, Diam and Weight. Subset selection issue: which subset of those 5 x s is sufficient to explain the maturity status? Handled by comparison of the 2 5 = 32 submodels The submodels are not all nested: LRT is not appropriate For a number k of x s: the submodel M k with lowest residual deviance Dk is the champion The overall champion M is the M k with lowest BIC (or AIC)

38 Search algorithm If p explanatory variables, then 2 p submodels... can be huge! Stepwise search algorithms - forward starts from M 0 : First step: fit the p models with only one x and keep model M 1 with lowest BIC. kth step: 2 variants forward stepwise: fit the p k + 1 models obtained by adding one x to M k 1 and keep the model with lowest BIC. forward/backward stepwise: fit the p models obtained by adding or removing one x to/from M k 1 and keep the model with lowest BIC. Stop when adding an x increases BIC. In forward/backward, at most p 2 submodels are fitted.

39 Search algorithm If p explanatory variables, then 2 p submodels... can be huge! R script > p=50 # For example, p=50 candidate variables > p ˆ 2 # Maximum number of model fits in a stepwise search [1] 2500 > 2 ˆ p # Number of possible submodels [1] e+15 > p ˆ 2/2 ˆ p # Proportion of submodels explored in a stepwise search [1] e-12

40 Search algorithm If p explanatory variables, then 2 p submodels... can be huge! forward or backward? backward tends to keep more variables in the selection. when p n or p > n, then estimation of the full model is not reliable (or just not possible). forward or forward/backward? forward/backward is greedier than just forward. Both algorithms are computationally equivalent: forward/backward search can be viewed as free improvement.

41 Search algorithm If p explanatory variables, then 2 p submodels... can be huge! > maturity.logit = glm(maturity.,family=binomial,data=dta.12[,-7]) > # Fits the model with all explanatory variables except Variety > stepwise(maturity.logit,direction="forward/backward",criterion="bic") Direction: forward/backward Criterion: BIC Start: AIC= Maturity 1 Df Deviance AIC + a Weight <none> b Diam L R script

42 Search algorithm If p explanatory variables, then 2 p submodels... can be huge! > maturity.logit = glm(maturity.,family=binomial,data=dta.12[,-7]) > # Fits the model with all explanatory variables except Variety > stepwise(maturity.logit,direction="forward/backward",criterion="bic") Direction: forward/backward Criterion: BIC Step: AIC= Maturity a Df Deviance AIC + L Weight Diam b <none> a R script

43 Search algorithm If p explanatory variables, then 2 p submodels... can be huge! > maturity.logit = glm(maturity.,family=binomial,data=dta.12[,-7]) > # Fits the model with all explanatory variables except Variety > stepwise(maturity.logit,direction="forward/backward",criterion="bic") Direction: forward/backward Criterion: BIC Step: AIC= Maturity a + L Df Deviance AIC + b <none> Diam Weight L a R script

44 Search algorithm If p explanatory variables, then 2 p submodels... can be huge! Df Deviance AIC <none> b Diam Weight L a R script Call: glm(formula = Maturity a + L + b, family = binomial, data = dta.12[,-7]) Coefficients: (Intercept) a L b Degrees of Freedom: 306 Total (i.e. Null); 303 Residual Null Deviance: Residual Deviance: AIC: 241.3

45 Search algorithm If p explanatory variables, then 2 p submodels... can be huge! Since p is moderate, an exhaustive search is possible here > maturity.select = bestglm(xy=dta.12[,-7],family=binomial,method="exhaustive") > maturity.select$subsets[,8] = maturity.select$subsets[,8]+log(nrow(dta.12)) > maturity.select$subsets Intercept L a b Weight Diam BIC 0 TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE * TRUE TRUE TRUE TRUE TRUE TRUE R script

46 Outline 1 Interpreting model fit 2 Model comparison 3 Subset selection 4 Prediction 5 Cross-validation

47 Classification rule In the usual linear model, predicting the unknown value of Y from x = (x 1,..., x p ) is unambiguous: Ŷ = Ê(Y X = x) = ˆβ 0 + ˆβ 1 x ˆβ p x p.

48 Classification rule In the logistic linear model, the prediction issue is less clear Estimation of π(x) = P(Y = 1 X = x)? Assignment of a Y value, either +1 or -1 given x? > maturity.logit = glm(maturity.,family=binomial,data=dta.12[,-7]) > # Using the model to estimate the probability that Maturity = 2 > pi = predict(maturity.logit,type="response") > # Plotting the predictions versus observed maturity status > plot(pi dta.12$maturity,xlab="observed maturity status", + ylab="estimated probability of Maturity=2 ",cex.lab=1.25, + main="predictions versus observed maturity status",cex.axis=1.25, + cex.main=1.25) R script

49 Classification rule Predictions versus observed maturity status Estimated probability of 'Maturity=2' Observed maturity status

50 Definition Classification rule A decision rule aiming at the prediction of the value of a categorical response variable Y from X = (X 1,..., X p ) is named a classification rule. Algorithm describing how to get Ŷ, starting from x. Definition The misclassification probability of an item with explanatory profile x 0 and unknown response value Y 0 is P(Ŷ0 Y 0 X = x 0 ).

51 Bayes classification rule Definition The logistic linear Bayes classification rule is derived as follows: if ˆπ(x 0 ) is the estimated probability that Y 0 = +1, then: { Ŷ0 = +1 if ˆπ(x 0 ) 0.5 Ŷ 0 = 1 if ˆπ(x 0 ) < 0.5

52 Bayes classification rule > # Implementation of the Bayes logistic classification rule > predicted = ifelse(pi>=0.5, 2, 1 ) > # Construction of the confusion matrix > confusion = table(dta.12$maturity,predicted,dnn=list("obs.","pred.")) > confusion Pred. Obs R script

53 Prediction performance The global misclassification rate is often not suited Example: Y is the status, sick (Y = +1) or healthy (Y = 1), of a patient. To be avoided: Ŷ = 1 whereas Y = +1, Less importantly: Ŷ = +1 whereas Y = 1. Definition The sampling items with Ŷ = +1 are said positive and those with Ŷ = 1 are said negative.

54 Prediction performance Probability of a true positive: P(Ŷ P(Ŷ = +1, Y = +1) = +1 Y = +1) =. P(Y = +1) estimated by the sensitivity or true positive rate: { } # i = 1,..., n, Ŷ i = +1, Y i = +1 sensitivity =. # {i = 1,..., n, Y i = +1} Probability of a true negative: P(Ŷ P(Ŷ = 1, Y = 1) = 1 Y = 1) =. P(Y = 1) estimated by the specificity or true negative rate: specificity = { } # i = 1,..., n, Ŷ i = 1, Y i = 1. # {i = 1,..., n, Y i = 1}

55 Prediction performance Exercise Give the sensitivity and specificity of the Bayes logistic linear classification rule of the maturity of an apricot by the 3 colorimetric and 2 biometric measurements.

56 A short case study Biostatistical issue: predicting the status, healthy (Y = 1) or sick (Y = +1) of a patient The classification rule is highly sensitive (0.9) and highly specific (0.9). The incidence p = P(Y = +1) is low, say p =

57 A short case study Probability that someone predicted as sick is sick: where P(Y = +1) P(Y = +1 Ŷ = +1) = P(Ŷ = +1 Y = +1) P(Ŷ = +1), = sensitivity incidence P(Ŷ = +1), P(Ŷ = +1) = P(Ŷ = +1 Y = +1)P(Y = +1) + P(Ŷ = +1 Y = 1)P(Y = 1), Hence, = sensitivity incidence + (1 specificity) (1 incidence), = (1 0.9) ( ), = P(Y = +1 Ŷ = +1) = , =

58 A short case study Biostatistical issue: predicting the status, healthy (Y = 1) or sick (Y = +1) of a patient The classification rule is highly sensitive (0.9) and highly specific (0.9). The incidence p = P(Y = +1) is low, say p = Probability that someone predicted as sick is sick: P(Y = +1 Ŷ = +1) = Conlusion: high sensitivity and high specificity do not guarantee a good prediction performance!

59 Precision of a classification rule Probability that a positive is truely +1: P(Y = +1 Ŷ = +1) = P(Ŷ = +1, Y = +1) P(Ŷ = +1). estimated by the precision or Positive Predictive Value (PPV): { } # i = 1,..., n, Ŷ i = +1, Y i = +1 PPV = { }. # i = 1,..., n, Ŷ i = +1 Probability that a negative is truely -1: P(Y = 1 Ŷ = 1) = P(Ŷ = 1, Y = 1) P(Ŷ = 1). estimated by the Negative Predictive Value (NPV): { } # i = 1,..., n, Ŷ i = 1, Y i = 1 NPV = { }. # i = 1,..., n, Ŷ i = 1

60 Precision of a classification rule > perf = rep(0,5) # Creates a 5-vector with only 0 entries > names(perf) = c("nb. pos","sens.","spec.","ppv","npv") > colmargins = colsums(confusion) # Column totals > rowmargins = rowsums(confusion) # Row totals > perf[1] = colmargins[2] # Number of positives > perf[2] = confusion[2,2]/rowmargins[2] # Sensitivity > perf[3] = confusion[1,1]/rowmargins[1] # Specificity > perf[4] = confusion[2,2]/colmargins[2] # PPV > perf[5] = confusion[1,1]/colmargins[1] # NPV > perf Nb. pos Sens. Spec. PPV NPV R script

61 Classification ability of explanatory variables The performance of a logistic classification rule depends on: the relevance of the x s to predict the response Y = ±1, the choice of the threshold on π(x), above which the prediction is Ŷ = +1. In the Bayes classification rule, the former threshold is 0.5. lower threshold leads to larger TPR and FPR larger threshold leads in lower TPR and FPR

62 Classification ability of explanatory variables > # Create a prediction object to be used by function performance > pred = prediction(predictions=pi,labels=dta.12$maturity) > tpr = performance(pred,measure="tpr") # Derive the TPR and the FPR > fpr = performance(pred,measure="fpr") # and the FPR > # Plots TPR and FPR against the threshold > plot(tpr,lwd=2,col="blue",ylab="tpr and FPR",xlab="Threshold") > plot(fpr,lwd=2,col="orange",add=true) > legend("topright",bty="n",lwd=2,col=c("blue","orange"),legend=c("tpr","fpr")) R script

63 Classification ability of explanatory variables TPR and FPR 1.00 TPR 0.95 FPR Threshold

64 Classification ability of explanatory variables > # Finds the minimal threshold for which TPR>=0.95 > choice = min(which(tpr@"y.values"[[1]]>=0.95)) > threshold = tpr@"x.values"[[1]][choice] > threshold > tpr@"y.values"[[1]][choice] # Corresponding TPR [1] > fpr@"y.values"[[1]][choice] # Corresponding FPR [1] R script

65 Classification ability of explanatory variables The performance of a logistic classification rule depends on: the relevance of the x s to predict the response Y = ±1, the choice of the threshold on π(x), above which the prediction is Ŷ = +1. In the Bayes classification rule, the former threshold is 0.5. lower threshold leads to larger TPR and FPR larger threshold leads in lower TPR and FPR ROC curve: compromises between sensitivity (True Positive Rate) and specificity (True Negative Rate = 1-False Positive Rate)

66 Classification ability of explanatory variables > # Derive performance criteria > perf = performance(pred,measure="tpr",x.measure="fpr") > plot(perf,lwd=2,col="blue") R script

67 Classification ability of explanatory variables True positive rate False positive rate

68 Classification ability of explanatory variables ROC curve Starts from (0,0) for threshold = 1 (all predictions are -1) Ends at (1,1) for threshold = 0 (all predictions are +1) Measures the prediction ability by comparison with two reference ROC curves: the ideal classifier ROC curve, reaching (0,1); the worst classifier ROC curve, going along the line y = x. Area Under the ROC Curve (AUC): measures the prediction ability of the x s AUC = 1 for the ideal classifier AUC = 0.5 for the worst classifier.

69 Classification ability of explanatory variables > [1] R script

70 Classification ability of explanatory variables > choice = which.min(fpr@"y.values"[[1]] ˆ 2+(1-tpr@"y.values"[[1]]) ˆ 2) > threshold = tpr@"x.values"[[1]][choice] > threshold > tpr@"y.values"[[1]][choice] # Corresponding TPR [1] > fpr@"y.values"[[1]][choice] # Corresponding FPR [1] R script

71 Outline 1 Interpreting model fit 2 Model comparison 3 Subset selection 4 Prediction 5 Cross-validation

72 Assessment of a classification rule Remark: in the previous prediction performance criteria, Y i is explicitly used to derive Ŷi... major deviation from the real conditions. Recommendation: the classification rule, fitted on a learning sample, has to be applied to a completely separate test sample. Definition An assessment procedure involving a test sample, completely separated from the learning sample, is referred to as external cross-validation If n is moderate, then an internal K -fold CV procedure shall be preferred.

73 Assessment of a classification rule viduals n indiv

74 Assessment of a classification rule n indiv viduals The sample is split into k balanced subsamples

75 Assessment of a classification rule s n indiv vidual Learning sample used to estimate the model

76 Assessment of a classification rule Testing sample to calculate prediction errors s n indiv vidual Learning sample to estimate the model

77 Assessment of a classification rule s Testing sample to calculate prediction errors vidual n indiv

78 Assessment of a classification rule s n indiv vidual Testing sample to calculate prediction errors

79 Assessment of a classification rule viduals n indiv Testing sample to calculate prediction errors

80 Assessment of a classification rule The choice of K can affect the result of the CV procedure: The CV procedure involves to fit the model K times: if fitting the model is computationally time-consuming, the values K = 3 and K = 10 are often chosen. when n is small, K = n may be recommended. The resulting CV procedure is named leave-one-out cross-validation.

81 Assessment of a classification rule > # Step 1: segmentation of the dataset in 10 subsamples R script > subsamples = cvsegments(n=307,k=10) > # List whose 10 components contain the indices of items in each of the 10 segments > unlist(lapply(subsamples,length)) # Subsamples sizes V1 V2 V3 V4 V5 V6 V7 V8 V9 V

82 Assessment of a classification rule > # Step 2: cycling over the segments > cvpredicted = rep( 0,307) # will contain the cross-validated predictions > nbselected = rep(0,10) # will contain the number of selected variables R script > for (k in 1:10) { # cycling over the 10 segments + learn = dta.12[-subsamples[[k]],-7] # the kth segment is excluded + test = dta.12[subsamples[[k]],-7] # the test sample is just the kth segment + maturity.select = bestglm(xy=learn,family=binomial, method="exhaustive") + resselect = maturity.select$subsets # selected is a vector of boolean values + selected = unlist(resselect[which.min(resselect[,8]),2:6]) + nbselected[k] = sum(selected) # Fits the selected model + maturity.logit = glm(maturity.,family=binomial,data=learn[,c(selected,true)]) + pi = predict(maturity.logit,newdata=test[,c(selected,true)],type="response") + cvpredicted[subsamples[[k]]] = ifelse(pi>=threshold, 2, 1 ) + } > nbselected [1]

83 Assessment of a classification rule > # Step 3: Cross-validated performance criteria > # The confusion matrix is first created > confusion = table(dta.12$maturity,cvpredicted,dnn=list("obs.","pred.")) > confusion Pred. Obs R script

84 Assessment of a classification rule > # The performance criteria are deduced > perf = rep(0,5) # Creates a 5-vector with only 0 entries > names(perf) = c("nb. pos","sens.","spec.","ppv","npv") > colmargins = colsums(confusion) > rowmargins = rowsums(confusion) # Column and Row totals > perf[1] = colmargins[2] # Number of positives > perf[2] = confusion[2,2]/rowmargins[2] # Sensitivity > perf[3] = confusion[1,1]/rowmargins[1] # Specificity > perf[4] = confusion[2,2]/colmargins[2] # PPV > perf[5] = confusion[1,1]/colmargins[1] # NPV > perf Nb. pos Sens. Spec. PPV NPV R script

85 Modeling default in cheese production Exercise A dairy food industry wish to make an objective classification rule to detect major defaults in cheese. For that, they collect daily data of proportions of defective cheeses and corresponding food process conditions, characterized by 2 sanitary variables (San1 and San2)and 3 milk quality variables (From1, From2, From3). Data are provided in file cheese.txt. Propose and fit an appropriate model for the above issue using the possible explanatory variables San1, San2, From1, From2, From3, San1 2, From1 2, From3 2 and From1 From3. Suppose we want the classification rule to be able to detect 90% of the defective cheese. Correspondingly, which proportion of false positives should we expect?

Regression modeling for categorical data. Part II : Model selection and prediction

Regression modeling for categorical data. Part II : Model selection and prediction Regression modeling for categorical data Part II : Model selection and prediction David Causeur Agrocampus Ouest IRMAR CNRS UMR 6625 http://math.agrocampus-ouest.fr/infogluedeliverlive/membres/david.causeur

More information

High-dimensional regression modeling

High-dimensional regression modeling High-dimensional regression modeling David Causeur Department of Statistics and Computer Science Agrocampus Ouest IRMAR CNRS UMR 6625 http://www.agrocampus-ouest.fr/math/causeur/ Course objectives Making

More information

Linear Regression Models P8111

Linear Regression Models P8111 Linear Regression Models P8111 Lecture 25 Jeff Goldsmith April 26, 2016 1 of 37 Today s Lecture Logistic regression / GLMs Model framework Interpretation Estimation 2 of 37 Linear regression Course started

More information

Introduction to Logistic Regression

Introduction to Logistic Regression Introduction to Logistic Regression Problem & Data Overview Primary Research Questions: 1. What are the risk factors associated with CHD? Regression Questions: 1. What is Y? 2. What is X? Did player develop

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches Sta 216, Lecture 4 Last Time: Logistic regression example, existence/uniqueness of MLEs Today s Class: 1. Hypothesis testing through analysis of deviance 2. Standard errors & confidence intervals 3. Model

More information

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010 1 Linear models Y = Xβ + ɛ with ɛ N (0, σ 2 e) or Y N (Xβ, σ 2 e) where the model matrix X contains the information on predictors and β includes all coefficients (intercept, slope(s) etc.). 1. Number of

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Lecture 3. Hypothesis testing. Goodness of Fit. Model diagnostics GLM (Spring, 2018) Lecture 3 1 / 34 Models Let M(X r ) be a model with design matrix X r (with r columns) r n

More information

Exercise 5.4 Solution

Exercise 5.4 Solution Exercise 5.4 Solution Niels Richard Hansen University of Copenhagen May 7, 2010 1 5.4(a) > leukemia

More information

Logistic Regressions. Stat 430

Logistic Regressions. Stat 430 Logistic Regressions Stat 430 Final Project Final Project is, again, team based You will decide on a project - only constraint is: you are supposed to use techniques for a solution that are related to

More information

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION Categorical Data Analysis (Semester II: 2010 2011) April/May, 2011 Time Allowed : 2 Hours Matriculation No: Seat No: Grade Table Question 1 2 3 4 5 6 Full marks

More information

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books STA 44/04 Jan 6, 00 / 5 Administration Homework on web page, due Feb NSERC summer undergraduate award applications due Feb 5 Some helpful books STA 44/04 Jan 6, 00... administration / 5 STA 44/04 Jan 6,

More information

ssh tap sas913, sas https://www.statlab.umd.edu/sasdoc/sashtml/onldoc.htm

ssh tap sas913, sas https://www.statlab.umd.edu/sasdoc/sashtml/onldoc.htm Kedem, STAT 430 SAS Examples: Logistic Regression ==================================== ssh abc@glue.umd.edu, tap sas913, sas https://www.statlab.umd.edu/sasdoc/sashtml/onldoc.htm a. Logistic regression.

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Linear model selection and regularization

Linear model selection and regularization Linear model selection and regularization Problems with linear regression with least square 1. Prediction Accuracy: linear regression has low bias but suffer from high variance, especially when n p. It

More information

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3 STA 303 H1S / 1002 HS Winter 2011 Test March 7, 2011 LAST NAME: FIRST NAME: STUDENT NUMBER: ENROLLED IN: (circle one) STA 303 STA 1002 INSTRUCTIONS: Time: 90 minutes Aids allowed: calculator. Some formulae

More information

ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T.

ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T. Exam 3 Review Suppose that X i = x =(x 1,, x k ) T is observed and that Y i X i = x i independent Binomial(n i,π(x i )) for i =1,, N where ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T x) This is called the

More information

Lecture 2: Poisson and logistic regression

Lecture 2: Poisson and logistic regression Dankmar Böhning Southampton Statistical Sciences Research Institute University of Southampton, UK S 3 RI, 11-12 December 2014 introduction to Poisson regression application to the BELCAP study introduction

More information

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models SCHOOL OF MATHEMATICS AND STATISTICS Linear and Generalised Linear Models Autumn Semester 2017 18 2 hours Attempt all the questions. The allocation of marks is shown in brackets. RESTRICTED OPEN BOOK EXAMINATION

More information

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla.

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla. Experimental Design and Statistical Methods Workshop LOGISTIC REGRESSION Jesús Piedrafita Arilla jesus.piedrafita@uab.cat Departament de Ciència Animal i dels Aliments Items Logistic regression model Logit

More information

Cohen s s Kappa and Log-linear Models

Cohen s s Kappa and Log-linear Models Cohen s s Kappa and Log-linear Models HRP 261 03/03/03 10-11 11 am 1. Cohen s Kappa Actual agreement = sum of the proportions found on the diagonals. π ii Cohen: Compare the actual agreement with the chance

More information

Correlation and regression

Correlation and regression 1 Correlation and regression Yongjua Laosiritaworn Introductory on Field Epidemiology 6 July 2015, Thailand Data 2 Illustrative data (Doll, 1955) 3 Scatter plot 4 Doll, 1955 5 6 Correlation coefficient,

More information

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46 A Generalized Linear Model for Binomial Response Data Copyright c 2017 Dan Nettleton (Iowa State University) Statistics 510 1 / 46 Now suppose that instead of a Bernoulli response, we have a binomial response

More information

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017 Binary Regression GH Chapter 5, ISL Chapter 4 January 31, 2017 Seedling Survival Tropical rain forests have up to 300 species of trees per hectare, which leads to difficulties when studying processes which

More information

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear. Linear regression Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear. 1/48 Linear regression Linear regression is a simple approach

More information

STA6938-Logistic Regression Model

STA6938-Logistic Regression Model Dr. Ying Zhang STA6938-Logistic Regression Model Topic 2-Multiple Logistic Regression Model Outlines:. Model Fitting 2. Statistical Inference for Multiple Logistic Regression Model 3. Interpretation of

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Generalized linear models

Generalized linear models Generalized linear models Douglas Bates November 01, 2010 Contents 1 Definition 1 2 Links 2 3 Estimating parameters 5 4 Example 6 5 Model building 8 6 Conclusions 8 7 Summary 9 1 Generalized Linear Models

More information

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses ST3241 Categorical Data Analysis I Multicategory Logit Models Logit Models For Nominal Responses 1 Models For Nominal Responses Y is nominal with J categories. Let {π 1,, π J } denote the response probabilities

More information

Stat 579: Generalized Linear Models and Extensions

Stat 579: Generalized Linear Models and Extensions Stat 579: Generalized Linear Models and Extensions Yan Lu Jan, 2018, week 3 1 / 67 Hypothesis tests Likelihood ratio tests Wald tests Score tests 2 / 67 Generalized Likelihood ratio tests Let Y = (Y 1,

More information

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression Logistic Regression Usual linear regression (repetition) y i = b 0 + b 1 x 1i + b 2 x 2i + e i, e i N(0,σ 2 ) or: y i N(b 0 + b 1 x 1i + b 2 x 2i,σ 2 ) Example (DGA, p. 336): E(PEmax) = 47.355 + 1.024

More information

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS Duration - 3 hours Aids Allowed: Calculator LAST NAME: FIRST NAME: STUDENT NUMBER: There are 27 pages

More information

Lecture 5: Poisson and logistic regression

Lecture 5: Poisson and logistic regression Dankmar Böhning Southampton Statistical Sciences Research Institute University of Southampton, UK S 3 RI, 3-5 March 2014 introduction to Poisson regression application to the BELCAP study introduction

More information

Multinomial Logistic Regression Models

Multinomial Logistic Regression Models Stat 544, Lecture 19 1 Multinomial Logistic Regression Models Polytomous responses. Logistic regression can be extended to handle responses that are polytomous, i.e. taking r>2 categories. (Note: The word

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population

More information

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: ) NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) Categorical Data Analysis (Semester II: 2010 2011) April/May, 2011 Time Allowed : 2 Hours Matriculation No: Seat No: Grade Table Question 1 2 3

More information

Simple logistic regression

Simple logistic regression Simple logistic regression Biometry 755 Spring 2009 Simple logistic regression p. 1/47 Model assumptions 1. The observed data are independent realizations of a binary response variable Y that follows a

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data Ronald Heck Class Notes: Week 8 1 Class Notes: Week 8 Probit versus Logit Link Functions and Count Data This week we ll take up a couple of issues. The first is working with a probit link function. While

More information

Logistic & Tobit Regression

Logistic & Tobit Regression Logistic & Tobit Regression Different Types of Regression Binary Regression (D) Logistic transformation + e P( y x) = 1 + e! " x! + " x " P( y x) % ln$ ' = ( + ) x # 1! P( y x) & logit of P(y x){ P(y

More information

Performance Evaluation and Comparison

Performance Evaluation and Comparison Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression Reading: Hoff Chapter 9 November 4, 2009 Problem Data: Observe pairs (Y i,x i ),i = 1,... n Response or dependent variable Y Predictor or independent variable X GOALS: Exploring

More information

Model comparison and selection

Model comparison and selection BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)

More information

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper Student Name: ID: McGill University Faculty of Science Department of Mathematics and Statistics Statistics Part A Comprehensive Exam Methodology Paper Date: Friday, May 13, 2016 Time: 13:00 17:00 Instructions

More information

Chapter 5: Logistic Regression-I

Chapter 5: Logistic Regression-I : Logistic Regression-I Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM [Acknowledgements to Tim Hanson and Haitao Chu] D. Bandyopadhyay

More information

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game. EdPsych/Psych/Soc 589 C.J. Anderson Homework 5: Answer Key 1. Probelm 3.18 (page 96 of Agresti). (a) Y assume Poisson random variable. Plausible Model: E(y) = µt. The expected number of arrests arrests

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) (b) (c) (d) (e) In 2 2 tables, statistical independence is equivalent

More information

Homework 5 - Solution

Homework 5 - Solution STAT 526 - Spring 2011 Homework 5 - Solution Olga Vitek Each part of the problems 5 points 1. Agresti 10.1 (a) and (b). Let Patient Die Suicide Yes No sum Yes 1097 90 1187 No 203 435 638 sum 1300 525 1825

More information

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation Yujin Chung November 29th, 2016 Fall 2016 Yujin Chung Lec13: MLE Fall 2016 1/24 Previous Parametric tests Mean comparisons (normality assumption)

More information

Lecture 12: Effect modification, and confounding in logistic regression

Lecture 12: Effect modification, and confounding in logistic regression Lecture 12: Effect modification, and confounding in logistic regression Ani Manichaikul amanicha@jhsph.edu 4 May 2007 Today Categorical predictor create dummy variables just like for linear regression

More information

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks (9) Model selection and goodness-of-fit checks Objectives In this module we will study methods for model comparisons and checking for model adequacy For model comparisons there are a finite number of candidate

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models Generalized Linear Models - part III Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs.

More information

Generalized Additive Models

Generalized Additive Models Generalized Additive Models The Model The GLM is: g( µ) = ß 0 + ß 1 x 1 + ß 2 x 2 +... + ß k x k The generalization to the GAM is: g(µ) = ß 0 + f 1 (x 1 ) + f 2 (x 2 ) +... + f k (x k ) where the functions

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

BMI 541/699 Lecture 22

BMI 541/699 Lecture 22 BMI 541/699 Lecture 22 Where we are: 1. Introduction and Experimental Design 2. Exploratory Data Analysis 3. Probability 4. T-based methods for continous variables 5. Power and sample size for t-based

More information

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00 Two Hours MATH38052 Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER GENERALISED LINEAR MODELS 26 May 2016 14:00 16:00 Answer ALL TWO questions in Section

More information

Ch 2: Simple Linear Regression

Ch 2: Simple Linear Regression Ch 2: Simple Linear Regression 1. Simple Linear Regression Model A simple regression model with a single regressor x is y = β 0 + β 1 x + ɛ, where we assume that the error ɛ is independent random component

More information

Inference for Regression

Inference for Regression Inference for Regression Section 9.4 Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Department of Mathematics University of Houston Lecture 13b - 3339 Cathy Poliak, Ph.D. cathy@math.uh.edu

More information

Variable Selection and Model Building

Variable Selection and Model Building LINEAR REGRESSION ANALYSIS MODULE XIII Lecture - 39 Variable Selection and Model Building Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur 5. Akaike s information

More information

9 Generalized Linear Models

9 Generalized Linear Models 9 Generalized Linear Models The Generalized Linear Model (GLM) is a model which has been built to include a wide range of different models you already know, e.g. ANOVA and multiple linear regression models

More information

Log-linear Models for Contingency Tables

Log-linear Models for Contingency Tables Log-linear Models for Contingency Tables Statistics 149 Spring 2006 Copyright 2006 by Mark E. Irwin Log-linear Models for Two-way Contingency Tables Example: Business Administration Majors and Gender A

More information

STAT 7030: Categorical Data Analysis

STAT 7030: Categorical Data Analysis STAT 7030: Categorical Data Analysis 5. Logistic Regression Peng Zeng Department of Mathematics and Statistics Auburn University Fall 2012 Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall 2012

More information

Chapter 20: Logistic regression for binary response variables

Chapter 20: Logistic regression for binary response variables Chapter 20: Logistic regression for binary response variables In 1846, the Donner and Reed families left Illinois for California by covered wagon (87 people, 20 wagons). They attempted a new and untried

More information

Homework 2: Simple Linear Regression

Homework 2: Simple Linear Regression STAT 4385 Applied Regression Analysis Homework : Simple Linear Regression (Simple Linear Regression) Thirty (n = 30) College graduates who have recently entered the job market. For each student, the CGPA

More information

STA 450/4000 S: January

STA 450/4000 S: January STA 450/4000 S: January 6 005 Notes Friday tutorial on R programming reminder office hours on - F; -4 R The book Modern Applied Statistics with S by Venables and Ripley is very useful. Make sure you have

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

Lecture 3: Inference in SLR

Lecture 3: Inference in SLR Lecture 3: Inference in SLR STAT 51 Spring 011 Background Reading KNNL:.1.6 3-1 Topic Overview This topic will cover: Review of hypothesis testing Inference about 1 Inference about 0 Confidence Intervals

More information

Binary Dependent Variables

Binary Dependent Variables Binary Dependent Variables In some cases the outcome of interest rather than one of the right hand side variables - is discrete rather than continuous Binary Dependent Variables In some cases the outcome

More information

Generalization to Multi-Class and Continuous Responses. STA Data Mining I

Generalization to Multi-Class and Continuous Responses. STA Data Mining I Generalization to Multi-Class and Continuous Responses STA 5703 - Data Mining I 1. Categorical Responses (a) Splitting Criterion Outline Goodness-of-split Criterion Chi-square Tests and Twoing Rule (b)

More information

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y Predictor or Independent variable x Model with error: for i = 1,..., n, y i = α + βx i + ε i ε i : independent errors (sampling, measurement,

More information

R Hints for Chapter 10

R Hints for Chapter 10 R Hints for Chapter 10 The multiple logistic regression model assumes that the success probability p for a binomial random variable depends on independent variables or design variables x 1, x 2,, x k.

More information

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30 MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD Copyright c 2012 (Iowa State University) Statistics 511 1 / 30 INFORMATION CRITERIA Akaike s Information criterion is given by AIC = 2l(ˆθ) + 2k, where l(ˆθ)

More information

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 Introduction to Generalized Univariate Models: Models for Binary Outcomes EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 EPSY 905: Intro to Generalized In This Lecture A short review

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Introduction to Logistic Regression

Introduction to Logistic Regression Misclassification 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.0 0.2 0.4 0.6 0.8 1.0 Cutoff Introduction to Logistic Regression Problem & Data Overview Primary Research Questions: 1. What skills are important

More information

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

STAT 526 Spring Midterm 1. Wednesday February 2, 2011 STAT 526 Spring 2011 Midterm 1 Wednesday February 2, 2011 Time: 2 hours Name (please print): Show all your work and calculations. Partial credit will be given for work that is partially correct. Points

More information

Model comparison. Patrick Breheny. March 28. Introduction Measures of predictive power Model selection

Model comparison. Patrick Breheny. March 28. Introduction Measures of predictive power Model selection Model comparison Patrick Breheny March 28 Patrick Breheny BST 760: Advanced Regression 1/25 Wells in Bangladesh In this lecture and the next, we will consider a data set involving modeling the decisions

More information

STAC51: Categorical data Analysis

STAC51: Categorical data Analysis STAC51: Categorical data Analysis Mahinda Samarakoon April 6, 2016 Mahinda Samarakoon STAC51: Categorical data Analysis 1 / 25 Table of contents 1 Building and applying logistic regression models (Chap

More information

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News: Today HW 1: due February 4, 11.59 pm. Aspects of Design CD Chapter 2 Continue with Chapter 2 of ELM In the News: STA 2201: Applied Statistics II January 14, 2015 1/35 Recap: data on proportions data: y

More information

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model Stat 3302 (Spring 2017) Peter F. Craigmile Simple linear logistic regression (part 1) [Dobson and Barnett, 2008, Sections 7.1 7.3] Generalized linear models for binary data Beetles dose-response example

More information

Generalised linear models. Response variable can take a number of different formats

Generalised linear models. Response variable can take a number of different formats Generalised linear models Response variable can take a number of different formats Structure Limitations of linear models and GLM theory GLM for count data GLM for presence \ absence data GLM for proportion

More information

UNIVERSITY OF TORONTO Faculty of Arts and Science

UNIVERSITY OF TORONTO Faculty of Arts and Science UNIVERSITY OF TORONTO Faculty of Arts and Science December 2013 Final Examination STA442H1F/2101HF Methods of Applied Statistics Jerry Brunner Duration - 3 hours Aids: Calculator Model(s): Any calculator

More information

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Methods and Criteria for Model Selection CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Introduce classifier evaluation criteria } Introduce Bias x Variance duality } Model Assessment }

More information

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam ECLT 5810 Linear Regression and Logistic Regression for Classification Prof. Wai Lam Linear Regression Models Least Squares Input vectors is an attribute / feature / predictor (independent variable) The

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Outline. Mixed models in R using the lme4 package Part 3: Longitudinal data. Sleep deprivation data. Simple longitudinal data

Outline. Mixed models in R using the lme4 package Part 3: Longitudinal data. Sleep deprivation data. Simple longitudinal data Outline Mixed models in R using the lme4 package Part 3: Longitudinal data Douglas Bates Longitudinal data: sleepstudy A model with random effects for intercept and slope University of Wisconsin - Madison

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Single-level Models for Binary Responses

Single-level Models for Binary Responses Single-level Models for Binary Responses Distribution of Binary Data y i response for individual i (i = 1,..., n), coded 0 or 1 Denote by r the number in the sample with y = 1 Mean and variance E(y) =

More information

Lecture 14: Introduction to Poisson Regression

Lecture 14: Introduction to Poisson Regression Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why

More information

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview Modelling counts I Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week

More information

MASM22/FMSN30: Linear and Logistic Regression, 7.5 hp FMSN40:... with Data Gathering, 9 hp

MASM22/FMSN30: Linear and Logistic Regression, 7.5 hp FMSN40:... with Data Gathering, 9 hp Selection criteria Example Methods MASM22/FMSN30: Linear and Logistic Regression, 7.5 hp FMSN40:... with Data Gathering, 9 hp Lecture 5, spring 2018 Model selection tools Mathematical Statistics / Centre

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Methods@Manchester Summer School Manchester University July 2 6, 2018 Generalized Linear Models: a generic approach to statistical modelling www.research-training.net/manchester2018

More information

Math 423/533: The Main Theoretical Topics

Math 423/533: The Main Theoretical Topics Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)

More information

Bootstrap, Jackknife and other resampling methods

Bootstrap, Jackknife and other resampling methods Bootstrap, Jackknife and other resampling methods Part VI: Cross-validation Rozenn Dahyot Room 128, Department of Statistics Trinity College Dublin, Ireland dahyot@mee.tcd.ie 2005 R. Dahyot (TCD) 453 Modern

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1 Parametric Modelling of Over-dispersed Count Data Part III / MMath (Applied Statistics) 1 Introduction Poisson regression is the de facto approach for handling count data What happens then when Poisson

More information

MATH 644: Regression Analysis Methods

MATH 644: Regression Analysis Methods MATH 644: Regression Analysis Methods FINAL EXAM Fall, 2012 INSTRUCTIONS TO STUDENTS: 1. This test contains SIX questions. It comprises ELEVEN printed pages. 2. Answer ALL questions for a total of 100

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

A brief introduction to mixed models

A brief introduction to mixed models A brief introduction to mixed models University of Gothenburg Gothenburg April 6, 2017 Outline An introduction to mixed models based on a few examples: Definition of standard mixed models. Parameter estimation.

More information