Present Practice, issues and headaches.

Size: px

Start display at page:

Download "Present Practice, issues and headaches."

Jayson Reeves
6 years ago
Views:

1 1

2 Present Practice, issues and headaches. Classification is Data Mining area par excellence. Will focus on binary targets of events/non-events. Research and applications in: clinical data analysis (disease / no-disease, insistence on odd ratios and logistic regression), Direct marketing: response/no-response, attrition, etc. Recommender systems: Interesting / uninteresting. Fraud, Terrorism,.. Issues and Headaches (can t covered them all in this lecture): -Binary target training/estimation mixture of events/non-events. -Obfuscating terminology. -Model comparisons. -Modeling methodologies confront unexpected issues: colinearity and separation in logistic (and neurals?), smooth and non-step response function in trees, etc. 2

3 Meaning of Probability Statements: context dependent. Probability measured in interval (0, 1) and methods are mechanical, i.e., context provided by practitioner/analyst. E.g: 1. Model estimates household has %70 probability of responding to credit card solicitation. Solicitation cost is minimal and bad feeling from non-responding customer disregarded. Likely action: solicit. 2. Model estimates that probability that conference ceiling will fall on us right now is 40%. How many of you will stay until I finish reading this paragraph? Action: Run for your life? 3. DNA matching asserts that probability that male A is father of baby is 95%, i.e., 1 in 20 is False Positive. Action: A is father? Cost (profit) of implementing/not implementing decision, even if not exactly quantifiable, is most important in context. 3

4 Present Practice, issues and headaches. There is advice on mixture of 0/1 of target variable for estimation, but doubts persist. Obfuscating terminology of ROC, precision, model choice, etc. Concepts used in classification methods used to compare models are derived from different methods (i.e., trees, neurals, etc). Unclear practices on separation, 0/1, colinearity and variable selection. Colinearity likely more of a bete-noire than in linear regression case, but adds doubts about stability of predicted probability in context of scoring future data bases. In short: all models produce predictions in probability form and decision has to be made. In next pages: Events = 1, non-events = 0. 4

5 In short, practical issues: Model Comparison and selection based on some criterion. Pitfall: Null model should be important part of the game, often disregarded because we get paid to find something, not in order to find nothing. Cutoff selection for decision making. In both cases, costs of wrong decision can affect model and cutoff selections. Further, most applications focused on events tiny minority of data base. But many cases where both sides matter: public opinion, success/failure of negotiation, etc. Further refinement (not developed here): soliciting true responder not necessarily profitable: responder may be bad customer Further refinement of targeting: response / noresponse, profitable / not so. Actually 3 levels, nobody cares for Unprofitable non-responder. 5

6 Not Recommended. 6

7 Model Evaluations and Comparison. Model Chi-Square Accuracy, Percent Correct Predictions, ROCs, etc. Pseudo-R 2 Hosmer-Lemeshow Model Chi-Square: LRT: -2 log (likelihood model 2) - -2log (likelihood model 1) Typically comparing against null model (model 2). It cannot indicate whether model is useful, just that is better than chance prediction. 7

8 Accuracy of predictions (big area). Predictions are usually classified as events ( 1 ) whenever posterior probability > cutoff. Else, non-event ( 0 ). Demsar (2006) shows most algorithms compared based on accuracy. Classification Rate: Proportion events predicted as events (similarly for non-events). (Also called accuracy: overall classification rate of events and non-events). Precision Rate: Proportion predicted events / true events (similarly for non-events). 8

9 Classification (confusion) Table. Preds Predicted 0 Predicted 1 Total Reals Real 0 A (TN) B (FP) A + B (Neg) Real 1 C (FN) D (TP) C + D (Pos) Total A + C B + D A + B + C + D Classification (Accuracy) Rate: 100 (A + D) / grand total. Sensitivity = Event class-recall- (hit) rate TPR = 100 * D / C + D. Specificity = Non-Event classification rate TNR = 100 * A / A + B. 1 Specificity = Non-Event miscl. rate (false alarm) FPR = 100 * B / A + B. 9

10 Graphical Appreciation. Predicted Events. Events No Events P r e d ic te d _ E v e n ts E v e n t P r e c is io n = P r e d ic te d _ E v e n t P r e d ic te d _ E v e n t E v e n t R e c a ll = E v e n t F m e a s u r e : 2 ( β + 1) P r e c is io n * R e β 2 c a ll * p r e c is io n + r e c a ll 10

11 More terminology, just because Conditional probabilities. TPR = P ( PRED POS / POSITIVE (EVENT)) = Recall TNR = P (PRED NEG / NEGATIVE (NO EVENT)) FPR = P (PRED POS / NEGATIVE (NO EVENT)) = 1 - TNR FNR = P (PRED NEG / POSITIVE (EVENT)) PPR = Positive Precision rate = Purity = P ( POS / PRED POS)) = TP / (TP + FP) NPR = Negative Precision Rate = P (NEG / PRED (NEG)) = TN/ (TN + FN) Unconditional Probabilities. Prevalence = risk = P (Positive) Used mostly in clinical studies F-measure evenly balanced when β = 1. Favors precision when > 1, recall otherwise. Used in text classification, information retrieval and language processing. 11

12 Classification table as goodness-of-fit? 1. Good model may classify poorly: Hosmer and Lemeshow (2000, p. 157): model with misclassification rate dependent on slope, not on model fit. 2. Classification done by choosing cut-off point in posterior probability. Well known that classification favors majority group, which is independent of model fit. Thus, if P 1 =.49 and P 2 =.52, and cut-off is.5, observations classified into different categories when probabilities very close. Assumes known and unchanging natural class distribution and that error cost of FP = errors FN. Typically favors majority class; but in most applications, cost of misclassifying 1 is higher. 12

13 Accuracy can mislead. Example 1: Assume 1 important. Overall Accuracy (left model) = 92.5%, Overall Accuracy (right model) = 97.5%, but right model misses all 1 s. Predicted 0 Predicted 1 Predicted 0 Predicted 1 Actual Actual Example 2: 80% accuracy in two models. If test data set contains more 0 s, right model better. If more 1 s, left model. Predicted 0 Predicted 1 Predicted 0 Predicted 1 Actual Actual

14 Classification table as goodness-of-fit? 3. Models (prob. Distrs) from different samples cannot be compared based on these tables because predicted probabilities are confounded by the distribution of probabilities in original samples. These tables are useful only when classification is main goal. 4 ROC curve: cut-off probability point is changed, and sensitivity (Y) (TPR) vs. 1 specificity (X) (FPR) is plotted: Area under curve (AUROC) shows percentage of pairs of events / non-events such that predicted probability for event > predicted probability non-event (same as Mann-Whitney U statistic) or Wilcoxon test of ranks. Also, related to Gini = 2 AUROC 1. 14

15 For given FPR (prob. Non-event predicted as event, e.g.,.5), ROC shows corresponding TPR (prob. of predicting event as event, e.g.,.81). Numbers along curve: accuracy, cutoff, precision. 15

16 Roc: what does it mean? For randomly chosen responder/sick patient/attriter and another randomly chosen non-responder/healthy patient/non-attriter, AUROC measures probability of identifying event by way of the model alone. Thus, no model and balanced 0/1 50%. Direct Marketing Application. Suppose mailing data base 10,000 candidates. Expect 10% response rate if mail everybody, expect 1,000 responders. Assume budget constraint that allows to mail to just 3,500 FPR * 9,000 + TPR * 1000 = 3,500, or TPR = 3.5 FPR * 9 From ROC graph, locate pair (FPR, TPR) that satisfies equation, derive cutoff point and contact those above cutoff point (ellaborated later on). 16

17 ROC vs. Accuracy. Accuracy directly related to error rate, ROC to ordering of ranks of probability. In general, ROC better measure of model performance than accuracy but NOT ALWAYS. Ling and Zhang (2003) prove that AUC is statistically consistent and more discriminating than accuracy. If AUC and acc are statistically consistent to degree C if AUC indicates model 1 better than model 2, there is probability C that acc will agree. If AUC is D times more discriminating than acc, it is D times more likely that AUC can differentiate between models 1 and 2. 17

18 Inference on ROCs. Hanley and McNeil (1982) conservative SE of ROC curve: SE = AUROC (1 AUROC )( n 1)( Q AUROC ) + ( n 1)( Q AUROC ) Q n n AUROC 2AUROC = Q = 2 AUROC (1 + AUROC ) "1": event Next slide: example of (over-fitted) model with seemingly grandiose ROC: note that accuracy is above 97% initially and then declines to around 49%. Note precision decline. Don t blame colinearity or any other bete-noire for this. 18

19 Don t blindly choose this model. 19

20 ROC as model selector. Area under ROC (AUROC) used as parameter to judge better model, larger better, although often more practical to estimate AUROC over limited range of FPR. AUROC estimated either non-parametrically via trapezoidal rule: m 1 AUC ( θ ) FPR TPR( θ i ) + TPR FPR 2 i = 2 TPR = TPR( θ ) TPR( θ ) AUROC more easily estimated by: (S n1 (n1 + 1) / 2) / (n0 n1) (Hand & Till, 2001). = S r i,1 r i,1 = rank of downwardly sorted posterior probability of event observation. n1: number of events, n0: number of non-events. REMEMBER: AUROC related to ranks, nonparametric measure, not linked with R-squares. i 1 FPR = FPR( θ ) FPR( θ ) i i i 1 20

21 ROC as model selector. 0.5 AUROC 1. Model comparison via ROC: Outer ROC curves indicate better models. But ROCs can cross. Can then create convex hull of ROC curves, and for specific FPR, disregard models below convex hull of other models. Also, perfect ROC (= 1) does not imply perfect classification, only as long as posterior probability of Event observations > non-event observations. I.e., ROC measures ranking. But observations can still be misclassified dependent on cutoff. Note that ROC graph could be discrete if algorithm predicts class membership instead of predicted probability (e.g., trees, discrete classifier). 21

22 ROC as model selector. Point (10, 40) more conservative than liberal (90, 99) because conservatives make positive classifications only with strong evidence to avoid FP (remember, lower FP associated with higher cutoffs). Conversely, liberals use weaker evidence to catch many positives, but get high FPs. Since more negatives than positives in original data, performance in left hand side of ROC graph more interesting. 45* line: random classifier that guesses TP as much as FP. Points below line perform worse than random negation (reverse) of rule produces point above 45* line. Closer to 45, worse performance. How close is close? Need inference. Note declining accuracy (numbers above curve) as we increase FPR. Notice steeper ROC slope on left side as compared to right side here positives easier to find than negatives in this area. 22

23 ROC as model selector. ROC unaffected by class skewness/imbalance or cost distributions. Note that ROC depends on column of TP and FP, and denominators of TPR and FPR are invariant to class skewness. Accuracy, precision, lift, F-score are affected because all them use values from both columns of classification matrix. Same algorithm applied on 2 test data sets with different class balances show same ROC curves, but precision-recall curve changes. For logistic regression case, separation may imply AUROC = 1 but Wald CIs are too wide and model is unreliable. Sensitivity = Event class-recall- (hit) rate TPR = 100 * D / C + D. Specificity = Non-Event classification rate TNR = 100 * A / A + B. 1 Specificity = Non-Event miscl. rate (false alarm) FPR = 100 * B / A + B. 23

24 Classification table as goodness-of-fit? (cont. 1). 5) If want to visualize cut-off point of highest separation, plot sensitivity and specificity (Y) vs. prob. cut-off points (X). Intersection indicates maximum separation (K-S test). Graph below, optimal: 12%, no cost specified. LA8 24

25 Slide 24 LA8 \SAS_RESDEV\EXTENSIONS\CUTOFF\DAVE3.SAS Leonardo Auslender, 1/26/07

26 Classification table as goodness-of-fit? (cont. 2). Optimal cut-off point minimizes expected cost of misclassification. Often, cost not known. When known, usually items of interest are C(i, j), i j, cost of predicting i when it should have been j. C(0,1) = CFN; C(1, 0) = CFP. Let π be original proportion of event in population. Then optimal threshold minimizes cost: (1 π) FPR C(1, 0) + π (1 TPR) * C(0, 1). Derivation of Minimal cost and cutoff: Cavg = Co + CTP P(TP) + CTN P(TN) + CFP P(FP) + CFN P(FN) P(TP) = π * TPR, TNR = 1 FPR Co: modeling/testing fixed cost, FNR = 1 - TPR and substituting around: Cavg = [TPR π ( CTP - CFN )] + [FPR (1-π) ( CFP - CTN )] + Co + [CTN (1-π) + CFN π] 25

27 Derivation of Minimal cost and cutoff: Typically, CTP = CTN = 0 Cavg = - TPR * π * CFN + FPR * (1-π) * CFP + Co + CFN * π = Co + CFN π (1 TPR) + FPR CFP (1 π) To minimize cost, find first derivative and use ROC: Cavg = [ROC(FPR) * π * ( CTP - CFN )] + [FPR * (1-π) * ( CFP - CTN )] + Co + (CTN * (1-π) + CFN * π) dc droc = π( CTP CFN ) + (1 π)( CFP CTN ) = 0 dfpr dfpr and rearranging, we get : droc (1 π)( CFP CTN ) = dfpr π( CFN CTP) 26

28 Derivation of Minimal cost and cutoff: d R O C (1 π ) * ( C F P C T N ) = = = > d F P R π * ( C F N C T P ) f o r u n i t a r y c o s t s o f m i s c l a s s i f i c a t i o n, d C p o i n t a t w h i c h = 0 i s π d R O C ( c a l l e d K S t e s t m a x d i s t a n c e ). NB: In above discussion, π is implicit but not visible in ROC curve. If CTN = CTP = 0, for given π, higher CFP relative to CFN shifts optimal point to left or ROC and higher cutoff point chosen (and thus avoids false positives more often). Vice-versa for higher relative CFN. 27

29 Choosing cutoff via max KS. 28

30 5 priors, 4 optimal cutoffs and minimal costs (2 overlap) FN=FP.. 29

31 Cost of FN / FP =

32 Cost of FN / FP =

33 Cost of FN / FP =

34 33

35 Positive Precision Rates dependent on priors. 34

36 Precision Recall Curves. Many situations in which priors and costs are unknown. Could also be case in which priors and costs actually vary, either in time or by subpopulations (what, you thought we d make it easy on you?) E.g., prevalence of certain diseases differs across races or ethnic groups. In these cases, ROC and threshold cutoff not so reliable because ROC does not reflect changes in priors (when we want it to, remember, denominators of TPR and FPR do not change by varying priors or costs). Aside: In clinical cases, some standard e.g., is to study ROC to left of FPR =.05. Study called Partial ROC. Cai and Dodd (web), PSA study 3 years and 6 months before onset of prostate cancer. For FPR <= 2%, TPR ~ 30% at T = 3 and ~ 57% T = 6 months PSA does not provide enough information. 35

37 Precision Recall curve. 36

38 Cutoff set by equality between precision curve and TPR. 37

39 38

40 Cutoff set by maximum Cumulative Profit. 39

41 Cutoff set by maximum Cumulative Profit. 40

42 Can we link all this stuff together? Yes (Alvarez, 2002) Let r = recall p = precision a = accuracy π = prior probability π r + (π + a 1) p = 2 π p r π = p(1 a) / (r + p 2 pr) (e.g., unsure of original π). 41

43 42

44 Balanced / Unbalanced target Rare Event. Typical situation: Binary dependent variable with far fewer 1 s than 0 s: fraud, extreme diseases, oil spills. Logistic regression in this case, for instance, underestimates probability of rare events. Also, tendency to create enormous data bases to contain rares. In case of rare events of logistic regression, most applications yield small probabilities; but for Y=1, probabilities would be larger π i (1-π i ) larger for obs Y = 1 variance (its inverse) is smaller additional 1 s bring in more information than 0 s (King, Zeng 2001). L o g i s t i c C o v a r i a n c e M a t r i x : n ' - 1 V ( β ) = [ π i ( 1 - π i ) x i x i ] i = 1 43

45 Balanced/Unbalanced target Rare Event. In case of rare event, Y=1 density very poorly estimated on left tail relative to Y = 0 density on right tail threshold on X to classify Y is too far to the right. 44

46 Balanced/Unbalanced target (cont. 1). 1) Model estimation typically uses samples. 2) Estimation assumption: training class distribution of target matches natural distribution. 3) But classifiers built from unbalanced samples perform poorly on minority class usually. Worse, if costlier to misclassify minority cases. 4) Some algorithms cannot use cost information effectively. 5) Weiss and Provost (2001): replicating natural distribution in sample, not necessarily good practice for estimation. Present practice ( 1 minority class, class of interest). 1) Under-sample 0 : throws out potentially useful data. Danger of sample bias: do not select on X differently for Y = 0 and 1. Example: Y = 1: cancer patients, Y = 0 random sample from U.S. population. But Y = 1 patient sought health care X: find medical specialist, have right tests, etc. get Y = 0 sample from patients who sought treatment and had no cancer. 45

47 Balanced/Unbalanced target (cont. 2). 2) Over-sample 1 : increases training data size and estimation time. Typically makes exact copies of 1 s probable over-fitting. Alternative: make imperfect copies (Auslender 2000, 2001). If unbalanced data and minority misclassification more costly minimize cost instead of error rate by factoring in these costs. Weiss and Provost results (with classification trees, C4.5): 1) Rules predicting 1 : higher error rate than predicting 0 because: 2) Test data 1 s misclassified more often than 0 s, because: a) Test data contains more 0 s. b) Algorithms sometimes strongly affected by initial marginal priors. c) Algorithms cannot learn boundaries of 1 class with relatively few examples. 46

48 Balanced/Unbalanced target (cont. 4). Logistic Regression Case. 1) Correction according to prior proportion of 1 s: βs for predictors are consistent, β 0 needs correction for δ = true proportion and α = sample proportion of 1 s. NOTE: not robust to model misspecification. β ˆ 1 δ α ln )( ) 0 δ 1 α [( ] 2) Weighting in estimation by α (Y = 1) and 1 α(y = 0). Advantages: robust to misspecification. Disadvantages: 1) usual method of computing standard errors is biased; 2) rare event finite sample corrections not developed for weighting (see King, Zeng 2001 for full discussion). 47

49 Balanced/Unbalanced target (cont. 5). Re-balancing trees (Auslender, 1998, never finished): create samples with respective percentages of 0/1 equal to 45/55, 46/54,,,,,, 54/46, 55/54. Typically observe that upper set of levels is similar or same for all samples (split values and variables). Lower layer typically contains similar variables that are split, sometimes in different hierarchical order: variable 1 is split in level 4 in sample 45/55, and in level 5 in sample 50/50, while variable 2 behaves reciprocally. Conclude: top level is core of tree, and middle level still provides strong information. After that, information not reliable. Similar approach possible for logistic regression in context of colinearity 48

50 Pseudo-R Square. Proportion of variation(?) explained by the model. McFadden's - Pseudo - R 2 statistic: McFadden's - R 2 = 1 - [LL ( model 1) / LL( model 2 )] {= 1 - [-2LL(model 1 ) / -2LL(model 2 )] } Where model 2 usually null model (intercept only). R 2 : scalar measure between 0 and (somewhat close to) 1. (Others: Nagelkerke, Efron s, McKelvey & Zavoina). 49

51 Hosmer-Lemeshow fit (2000, p. 148). Assume 5 binary predictors in the model maximum number of patterns = 2 ^ Assume only 8 patterns exist (J). Pearson based measures of fit distributed asymptotically as chisquare (J p - 1 parms). Problem: for increasing n under J = n, p increases at same rate as n degrees of freedom is wrong. Proposal: create patterns by grouping by percentiles of the posterior predicted distribution. By simulation, if g = 10 percentiles (deciles) chosen, distributed as chi-square (g 2) when J = n. 50

52 51

53 Ponderings. Most business applications do not provide cost/profit information and decisions lack vital input. Even possible that event profit >>>> non-event cost such that it is better to target entire population and predicted probabilities merely indicate one ordering. Most applications do no focus on prediction but on classification, specifically TPR = recall = hit rate. But from precision-recall curve, high TPR can be associated with low precision Real world prediction rates could be low, and classification rates high. It is real world predictions that matter, not how well model performed during classification. In absence of real cost/profit information, don t wait for ceiling to fall on you. 52

54 53

Evaluation & Credibility Issues

Evaluation & Credibility Issues What measure should we use? accuracy might not be enough. How reliable are the predicted results? How much should we believe in what was learned? Error on the training data