Modeling with Rules. Cynthia Rudin. Assistant Professor of Sta8s8cs Massachuse:s Ins8tute of Technology

Size: px

Start display at page:

Download "Modeling with Rules. Cynthia Rudin. Assistant Professor of Sta8s8cs Massachuse:s Ins8tute of Technology"

Jordan Hawkins
5 years ago
Views:

1 Modeling with Rules Cynthia Rudin Assistant Professor of Sta8s8cs Massachuse:s Ins8tute of Technology joint work with: David Madigan (Columbia) Allison Chang, Ben Letham (MIT PhD students) Dimitris Bertsimas (MIT) Tyler McCormick (UW) Gene Kogan (Independent)

2 Would like predic8ve models that are both accurate and interpretable. Accuracy = classifica8on accuracy Interpretability =?

3 Would like predic8ve models that are both accurate and interpretable. Accuracy = classifica8on accuracy Interpretability = concise - model is small convincing - there are reasons behind each predic8on

4 Modeling with Rules Decision List Traffic jam in Boston? fenway park=1 1 97/100 8mes rush_hour= /523 8mes rain=0, construc8on= /482 8mes Friday=1-1 3/3 8mes rain= /892 8mes otherwise /15 8mes

5 Modeling with Rules Dichotomy in the State of the Art Accuracy Interpretability vs. Decision Trees Support Vector Machines Boosted Decision Trees

6 Modeling with Rules Daydreaming Nice if the whole algorithm were interpretable OR Want the accuracy of SVM/Boosted DT and the interpretability of Decision Trees.

7 Outline Part 1: Humans can interpret the predictions, and understand the full algorithm Sequen8al Event Predic8on with Associa8on Rules (R, Letham, Aouissi, Kogan, Madigan) - COLT 2011 Part 2: Bayesian hierarchical modeling with rules A Hierarchical Model for Associa8on Rule Mining of Sequen8al Events: An Approach to Automated Medical Symptom Predic8on. (McCormick, R, Madigan) Annals of Applied Sta8s8cs, forthcoming 2012 Part 3: Accurate rule classifiers using MIO Ordered Rules for Classifica8on: A Discrete Op8miza8on Approach to Associa8ve Classifica8on (Bertsimas, Chang, R) In progress

8 Associa8on Rule Mining: (Agrawal; Imielinski; Swami, 1993) & (Agrawal and Srikant, 1994)

9 Construc8on=1 Traffic=1 Rain=1

10 15 8mes we saw construc8on and rain, and 13 out of 15 of those 8mes we also saw traffic Supp(construction=1 & rain=1) = 15 Supp(traffic=1 & construction=1 & rain=1) = 13 Conf (contruction=1 & rain=1 traffic=1) = 13 /15

11 Max Confidence, Min Support Algorithm Step 1. Find all rules a b, where Supp(a) θ. Conf (a b), Step 2. Rank rules in descending order of recommend the right hand side of the first rule that applies. Supp(a) Conf (a b), /15=.867 rush hour=0 Friday=1 otherwise /25= /17= /50=.68

12 15 8mes we saw construc8on and rain, and 13 out of 15 of those 8mes we also saw traffic Supp(construction=1 & rain=1) = 15 Supp(traffic=1 & construction=1 & rain=1) = 13 Conf (contruction=1 & rain=1 traffic=1) = 13 /15 Conf=.99, Supp=10000 vs. Conf=1, Supp=10

13 AdjustedConf (a b) := Supp(a & b) Supp(a) + K Bayesian version of the confidence

14 Step 1 Find all rules. Adjusted Confidence Algorithm a b AdjustedConf (a b), Step 2. Rank rules in descending order of recommend the right hand side of the first rule that applies. Supp(a) AdjustedConf (a b), K = 5 rush hour= /(25+5)= /(15+5)=.65 otherwise /(50+5)=.62 Friday= /(17+5)=.55

15 AdjustedConf (a b) := Supp(a & b) Supp(a) + K Rare rules can be used Among rules with similar confidence, prefers rules with higher support K encourages larger support, helps with predic8on Conf=.99, Support=10000 vs. Conf=1, Support=10

16 Humans can understand the prediction, and the algorithm Good for sequential event problems, where a set of events happen in a particular order e.g., for predicting what a customer will put next into an online shopping cart, or for predicting medical symptoms in a sequence Having larger K helps with generalization algorithmic stability (pointwise hypothesis stability) other learning theoretic implications Performs better empirically than the Max-Conf Min-Support Classifiers in our experiments A Learning Theory Framework for Associa8on Rules and Sequen8al Events (R, Letham, Kogan, Madigan) SSRN 2011 Sequen8al Event Predic8on with Associa8on Rules (R, Letham, Aouissi, Kogan, Madigan) - COLT 2011

17 Outline Part 1: Humans can interpret the predictions, and understand the full algorithm Sequen8al Event Predic8on with Associa8on Rules (R, Letham, Aouissi, Kogan, Madigan) - COLT 2011 Part 2: Bayesian hierarchical modeling with rules A Hierarchical Model for Associa8on Rule Mining of Sequen8al Events: An Approach to Automated Medical Symptom Predic8on. (McCormick, R, Madigan) Annals of Applied Sta8s8cs, forthcoming 2012 Part 3: Accurate rule classifiers using MIO Ordered Rules for Classifica8on: A Discrete Op8miza8on Approach to Associa8ve Classifica8on (Bertsimas, Chang, R) In progress

18 Recommender Systems for Medical Condi8ons Input medical condi8on: Predic8on based on your medical history:

19 Recommender Systems for Medical Condi8ons Input medical condi8on: Predic8on based on your medical history: dyspepsia & epigastric pain depression Gastroesophageal reflux heartburn high blood pressure high blood pressure

20 Medical Condi8on Predic8on heartburn headache dyspepsia fungal infec8on heartburn epigastric pain hypertension dyspepsia t Recommenda8ons 1. rhini8s 2. dyspepsia 3. low back pain Recommenda8ons 1. dyspepsia 2. high blood pressure 3. low back pain Recommenda8ons 1. epigastric pain 2. heartburn 3. high blood pressure

21 Hierarchical Associa8on Rule Model (HARM)

22 Hierarchical Associa8on Rule Model (HARM) i patient index, r rule index of lhs r rhs r

23 Hierarchical Associa8on Rule Model (HARM) i patient index, r rule index of lhs r rhs r y ir := Supp i (rhs r & lhs r ) n ir := Supp i (lhs r ) We'll model y ir ~ Binomial(n ir, p ir ) shared across individuals

24 Hierarchical Associa8on Rule Model (HARM) i patient index, r rule index of lhs r rhs r y ir := Supp i (rhs r & lhs r ) n ir := Supp i (lhs r ) We'll model y ir ~ Binomial(n ir, p ir ) p ir ~ Beta(π ir,τ i )

25 Hierarchical Associa8on Rule Model (HARM) i patient index, r rule index of lhs r rhs r y ir := Supp i (rhs r & lhs r ) n ir := Supp i (lhs r ) We'll model y ir ~ Binomial(n ir, p ir ) p ir ~ Beta(π ir,τ i ) Under this model, E(p ir y ir,n ir ) = y ir + π ir n ir + π ir + τ i.

26 Hierarchical Associa8on Rule Model (HARM) i patient index, r rule index of lhs r rhs r y ir := Supp i (rhs r & lhs r ) n ir := Supp i (lhs r ) We'll model y ir ~ Binomial(n ir, p ir ) p ir ~ Beta(π ir,τ i ) π ir = exp(m' i β r + γ i )

27 Hierarchical Associa8on Rule Model (HARM) π ir = exp(m' i β r + γ i )

28 Hierarchical Associa8on Rule Model (HARM) M I D (observable characteristics) 1 1 π ir = exp(m' i β r + γ i )

29 Hierarchical Associa8on Rule Model (HARM) M I D (observable characteristics) 1 1 π ir = exp(m' i β r + γ i ) Example: π ir = exp(β r,0 + β r,1 1 male + γ i ) = exp(β r,1 1 male )exp(β r,0 + γ i )

30 Hierarchical Associa8on Rule Model (HARM) i patient index, r rule index of lhs r rhs r y ir := Supp i (rhs r & lhs r ) n ir := Supp i (lhs r ) We'll model y ir ~ Binomial(n ir, p ir ) p ir ~ Beta(π ir,τ i ) π ir = exp(m' i β r + γ i )

31 Hierarchical Associa8on Rule Model (HARM) y ir ~ Binomial(n ir, p ir ) p ir ~ Beta(π ir,τ i ) π ir = exp(m' i β r + γ i ) log(τ i ) ~ Normal(0,σ τ 2 ) log(β rd ) ~ Normal(µ β,σ β 2 ) log(γ i ) ~ Normal(µ γ,σ γ 2 ) diffuse uniform priors on µ β,σ β 2,σ τ 2 HARM estimates posterior distribution (MCMC), then ranks rules by posterior mean.

32 Hierarchical Associa8on Rule Model (HARM) 43,000 pa8ent encounters ~2,300 pa8ents, age (> 40) pre- exis8ng condi8ons dealt with separately used 25 most common condi8ons, and 25 least common condi8ons

33 For trials=1:500 Form training and test sets: sample ~200 patients for each patient, randomly split encounters into training and test t training test

34 For trials=1:500 Form training and test sets: sample ~200 patients for each patient, randomly split encounters into training and test For each patient, iteratively make predictions on test encounters get 1 point whenever our top 3 recommendations contain patient s next condition t training test

35 (a) All patients Proportion of correct predictions HARM Conf. Adj. k=.25 Adj. k=.5 Adj. k=1 Adj. k=2 Thresh.=2 Thresh.=

36 Myocardial infarc8on in pa8ents with hypertension, in treatment (T) and placebo (P) groups HARM Confidence Rescaled Risk P T P T P T P T Over 70 T P T P T P T P Over 70 Mean of posterior means Key: Middle 90% Middle half

37 Myocardial infarc8on in pa8ents with high cholesterol, in treatment (T) and placebo (P) groups HARM Confidence Rescaled Risk P T P T P T P T Over 70 P T P T P T P T Over 70 Mean of posterior means Key: Middle 90% Middle half

38 Outline Part 1: Humans can interpret the predictions, and understand the full algorithm Sequen8al Event Predic8on with Associa8on Rules (R, Letham, Aouissi, Kogan, Madigan) - COLT 2011 Part 2: Bayesian hierarchical modeling with rules A Hierarchical Model for Associa8on Rule Mining of Sequen8al Events: An Approach to Automated Medical Symptom Predic8on. (McCormick, R, Madigan) Annals of Applied Sta8s8cs, forthcoming 2012 Part 3: Accurate rule classifiers using MIO Ordered Rules for Classifica8on: A Discrete Op8miza8on Approach to Associa8ve Classifica8on (Bertsimas, Chang, R) In progress

39 Mixed Integer Optimization MIO/MIP is a style of mathematical programming Not generally used for ML perception from 1970 s that MIO s are intractable

40 Mixed Integer Optimization MIO/MIP is a style of mathematical programming Not generally used for ML perception from 1970 s that MIO s are intractable Not all valid MIO formulations are equally strong

41 Mixed Integer Optimization MIO/MIP is a style of mathematical programming Not generally used for ML perception from 1970 s that MIO s are intractable Not all valid MIO formulations are equally strong Can use LP relaxations for very large scale problems

42 Mixed Integer Optimization MIO/MIP is a style of mathematical programming Not generally used for ML perception from 1970 s that MIO s are intractable Not all valid MIO formulations are equally strong Can use LP relaxations for very large scale problems Associa8on rules historically plagued by combinatorial explosion...

43 Ordered Rules for Classifica8on Minimize misclassifica8on error, regularize by height of the highest null rule. null rules : higher one predicts the default class and ends the list. 43

44 MIO Learning Algorithm

45 MIO Learning Algorithm Maximize classificaaon accuracy

46 MIO Learning Algorithm Maximize classificaaon accuracy Maximize rank of the highest null rule (regularizaaon)

47 Experiments Five algorithms Logis8c Regression (LogReg) Support Vector Machines / RBF kernel (SVM) Classifica8on and Regression Trees (CART) Boosted Decision Trees (AdaBoost) Ordered Rules for Classifica8on (ORC) Several publicly available datasets (UCI) Accuracy averaged over 3 folds

48 Classifica8on Accuracy

49 CART on Tic Tac Toe yes o no o ~x ~x ~x 0.26 x : : : : o o ~o 1 CART accuracy = x : :

50 ORC on Tic Tac Toe x x x x x x x x x x x x x x x x x x x x x x x x x wins x wins x wins x wins x wins x wins x wins x wins 9 x does not win ORC accuracy = 1

51 MONKS Problems 1 6 Integer valued features taking values 1,2,3,4 Examples are in class 1 if either a1=a2 or a5=1

52 CART on MONKS Problems 1 Examples are in class 1 if either a1=a2 or a5=1

53 ORC on MONKS Problems 1 a1=3, a2=3 1 (33/33) a1=2, a2=2 1 (30/30) a5=1 1 (65/65) a1=1, a2=1 1 (31/31) 1 (152/288) Examples are in class 1 if either a1=a2 or a5=1

54 The bo:om line: You don t need to sacrifice accuracy to get interpretability.

55 Outline Part 1: Humans can interpret the predictions, and understand the full algorithm Sequen8al Event Predic8on with Associa8on Rules (R, Letham, Aouissi, Kogan, Madigan) - COLT 2011 Part 2: Bayesian hierarchical modeling with rules A Hierarchical Model for Associa8on Rule Mining of Sequen8al Events: An Approach to Automated Medical Symptom Predic8on. (McCormick, R, Madigan) Annals of Applied Sta8s8cs, forthcoming 2012 Part 3: Accurate rule classifiers using MIO Ordered Rules for Classifica8on: A Discrete Op8miza8on Approach to Associa8ve Classifica8on (Bertsimas, Chang, R) In progress current work coming up

56 Associa8on Rules/ Associa8ve Classifica8on Bayesian Analysis Logical Analysis of Data (LAD) ML algorithms that use rules as features Decision Lists Decision Trees

57 Current Work Machine Learning for the NYC Power Grid cover of IEEE Computer, spotlight issue for IEEE TPAMI in February, WIRED Science, Slashdot, US News & World Report... Supervised Ranking, Equivalences between Ranking and Classifica8on, Ranking with MIO Reverse- Engineering Quality Rankings in Businessweek last week ML algorithms that understand how they will be used for a subsequent task Several other projects

58 Thank you!

An Integer Optimization Approach to Associative Classification

An Integer Optimization Approach to Associative Classification Dimitris Bertsimas Allison Chang Cynthia Rudin Operations Research Center Massachusetts Institute of Technology Cambridge, MA 02139 dbertsim@mit.edu