A Bayesian Framework for Learning Rule Sets for Interpretable Classification

Size: px

Start display at page:

Download "A Bayesian Framework for Learning Rule Sets for Interpretable Classification"

Chastity Walker
6 years ago
Views:

1 Journa of Machine Learning Research 18 (2017) 1-37 Submitted 1/16; Revised 2/17; Pubished 8/17 A Bayesian Framework for Learning Rue Sets for Interpretabe Cassification Tong Wang Cynthia Rudin Finae Doshi-Veez Yimin Liu Erica Kampf Perry MacNeie TONG-WANG@UIOWA.EDU University of Iowa CYNTHIA@CS.DUKE.EDU Duke University FINALE@SEAS.HARVARD.EDU Harvard University LIUYIMIN2000@GMAIL.COM Edward Jones EKLAMPFL@FORD.COM Ford Motor Company PMACNEIL@FORD.COM Ford Motor Company Editor: Maya Gupta Abstract We present a machine earning agorithm for buiding cassifiers that are comprised of a sma number of short rues. These are restricted disjunctive norma form modes. An exampe of a cassifier of this form is as foows: If X satisfies (condition A AND condition B) OR (condition C) OR, then Y =1. Modes of this form have the advantage of being interpretabe to human experts since they produce a set of rues that concisey describe a specific cass. We present two probabiistic modes with prior parameters that the user can set to encourage the mode to have a desired size and shape, to conform with a domain-specific definition of interpretabiity. We provide a scaabe MAP inference approach and deveop theoretica bounds to reduce computation by iterativey pruning the search space. We appy our method (Bayesian Rue Sets BRS) to characterize and predict user behavior with respect to in-vehice context-aware personaized recommender systems. Our method has a major advantage over cassica associative cassification methods and decision trees in that it does not greediy grow the mode. Keywords: disjunctive norma form, statistica earning, data mining, association rues, interpretabe cassifier, Bayesian modeing 1. Introduction When appying machine earning modes to domains such as medica diagnosis and customer behavior anaysis, in addition to a reiabe decision, one woud aso ike to understand how this decision is generated, and more importanty, what the decision says about the data itsef. For cassification tasks specificay, a few summarizing and descriptive rues wi provide intuition about the data that can directy hep domain experts understand the decision process. Our goa is to construct rue set modes that serve this purpose: these modes provide predictions and aso descriptions of a cass, which are reasons for a prediction. Here is an exampe of a rue set mode for predicting whether a customer wi accept a coupon for a nearby coffee house, where the coupon is presented by their car s mobie recommendation device: if a customer (goes to coffee houses once per month AND destination = no urgent pace AND passenger 6= kids) 2017 Tong Wang, Cynthia Rudin, Finae Doshi-Veez, Yimin Liu, Erica Kampf, and Perry MacNeie. License: CC-BY 4.0, see Attribution requirements are provided at

2 WANG ET AL. OR (goes to coffee houses once per month AND the time unti coupon expires = one day) then predict the customer wi accept the coupon for a coffee house. This mode makes predictions and provides characteristics of the customers and their contexts that ead to an acceptance of coupons. Formay, a rue set mode consists of a set of rues, where each rue is a conjunction of conditions. Rue set modes predict that an observation is in the positive cass when at east one of the rues is satisfied. Otherwise, the observation is cassified to be in the negative cass. Rue set modes that have a sma number of conditions are usefu as non-back-box (interpretabe) machine earning modes. Observations come with predictions and aso reasons (e.g., this observation is positive because it satisfies a particuar rue). This form of mode appears in various fieds with various names. First, they are sparse disjunctive norma form (DNF) modes. In operations research, there is a fied caed ogica anaysis of data" (Boros et a., 2000; Crama et a., 1988), which constructs rue sets based on combinatorics and optimization. Simiar modes are caed consideration sets" in marketing (see, e.g., Hauser et a., 2010), or disjunctions of conjunctions." Consideration sets are a way for humans to hande information overoad. When confronted with a compicated decision (such as which teevision to purchase), the theory of consideration sets says that humans tend to narrow down the possibiities of what they are wiing to consider into a sma disjunction of conjunctions ( ony consider TV s that are sma AND inexpensive, OR arge and with a high dispay resoution"). In PAC earning, the goa is to earn rue set modes when the data are non-noisy, meaning there exists a mode within the cass that perfecty cassifies the data. The fied of rue earning" aims to construct rue set modes. We prefer the terms rue set" modes or sparse DNF" for this work as our goa is to create interpretabe modes for non-experts, which are not necessariy consideration sets, not other forms of ogica data anaysis, and pertain to data sets that are potentiay noisy, unike the PAC mode. For many decisions, the space of good predictive modes is often arge enough to incude very simpe modes (Hote, 1993). This means that there may exist very sparse but accurate rue set modes; the chaenge is determining how to find them efficienty. Greedy methods used in previous works (Maioutov and Varshney, 2013; Poack et a., 1988; Friedman and Fisher, 1999; Gaines and Compton, 1995), where rues are added to the mode one by one, do not generay produce high-quaity sparse modes. We create a Bayesian framework for constructing rue sets, which provides priors for controing the size and shape of a mode. These methods strike a nice baance between accuracy and interpretabiity through user-adjustabe Bayesian prior parameters. To contro interpretabiity, we introduce two types of priors on rue set modes, one that uses beta-binomias to mode the rue seecting process, caed BRS-BetaBinomia, and one that uses Poisson distributions to mode rue generation, caed BRS-Poisson. Both can be adjusted to suit a domain-specific notion of interpretabiity; it is we-known that interpretabiity comes in different forms for different domains (Martens and Baesens, 2010; Martens et a., 2011; Huysmans et a., 2011; Aahyari and Lavesson, 2011; Rüping, 2006; Freitas, 2014). We provide an approximate inference method that uses association rue mining and a randomized search agorithm (which is a form of stochastic coordinate descent or simuated anneaing) to find optima BRS maximum a posteriori (MAP) modes. This approach is motivated by a theoretica bound that aows us to iterativey reduce the size of the computationay hard probem of finding a MAP soution. This bound states that we need ony mine rues that are sufficienty fre- 2

3 BAYESIAN RULE SETS quent in the database and as the search continues, the bound becomes tighter and further reduces the search space. This greaty reduces computationa compexity and aows the probem to be soved in practica settings. The theoretica resut takes advantage of the strength of the Bayesian prior. Our appied interest is to understand user responses to personaized advertisements that are chosen based on user characteristics, the advertisement, and the context. Such systems are caed context-aware recommender systems (see surveys Adomavicius and Tuzhiin, 2005, 2008; Verbert et a., 2012; Badauf et a., 2007, and references therein). One major chaenge in the design of recommender systems, reviewed in Verbert et a. (2012), is the interaction chaenge: users typicay wish to know why certain recommendations were made. Our work addresses precisey this chaenge: our modes provide rues in data that describe conditions for a recommendation to be accepted. The main contributions of our paper are as foows: We propose a Bayesian approach for earning rue set (DNF) cassifiers. This approach incorporates two important aspects of performance, accuracy, and interpretabiity, and baance between them via user-defined parameters. Because of the foundation in Bayesian methodoogy, the prior parameters are meaningfu; they represent the desired size and shape of the rue set. We derive bounds on the support of rues and number of rues in a MAP soution. These bounds are usefu in practice because they safey (and drasticay) reduce the soution space, improving computationa efficiency. The bound on the size of a MAP mode guarantees a sparse soution if prior parameters are chosen to favor smaer sets of rues. More importanty, the bounds become tighter as the search proceeds, reducing the search space unti the search finishes. The simuation studies demonstrate the efficiency and reiabiity of the search procedure. Losses in accuracy usuay resut from a misspecified rue representation, not generay from probems with optimization. Separatey, using pubicy avaiabe data sets, we compare rue set modes to interpretabe and uninterpretabe modes from other popuar cassification methods. Our resuts suggest that BRS modes can achieve competitive performance, and are particuary usefu when data are generated from a set of underying rues. We study in-vehice mobie recommender systems. Specificay, we used Amazon Mechanica Turk to coect data about users interacting with a mobie recommendation system. We used rue set modes to understand and anayze consumers behavior and predict their response to different coupons recommended in different contexts. We were abe to generate interpretabe resuts that can hep domain experts better understand their customers. The remainder of our paper is structured as foows. In Section 2, we discuss reated work. In Section 3, we present Bayesian Rue Set modeing. In Section 4, we introduce an approximate MAP inference method using associative rue mining and randomized search. We aso present theoretica bounds on the support and the number of rues in an optima MAP soution. In Section 5, we show the simuation studies to justify the inference methods. We then report experimenta resuts using pubicy avaiabe data, incuding the in-vehice recommendation system data set we created, to show that BRS modes are on-par with back-box modes whie under strict size restrictions for the purpose of being interpretabe. 3

4 WANG ET AL. A shorter version of this work appeared at the Internationa Conference on Data Mining (Wang et a., 2016). 2. Reated Work The modes we are studying have different names in different fieds: disjunctions of conjunctions" in marketing, cassification rues" in data mining, disjunctive norma forms" (DNF) in artificia inteigence, and ogica anaysis of data" in operations research. Learning ogica modes of this form has an extensive history in various settings. The LAD techniques were first appied to binary data in (Crama et a., 1988) and were ater extended to non-binary cases (Boros et a., 2000). In parae, Vaiant (1984) showed that DNFs coud be earned in poynomia time in the PAC (probaby approximatey correct - non-noisy) setting, and recent work has improved those bounds via poynomia threshod functions (Kivans and Servedio, 2001) and Fourier anaysis (Fedman, 2012). Other work studies the hardness of earning DNF (Fedman, 2006). None of these theoretica approaches considered the practica aspects of buiding a sparse mode with reaistic noisy data. In the meantime, the data-mining iterature has deveoped approaches to buiding ogica conjunctive modes. Associative cassification and rue earning methods (e.g., Ma et a., 1998; Li et a., 2001; Yin and Han, 2003; Chen et a., 2006; Cheng et a., 2007; McCormick et a., 2012; Rudin et a., 2013; Dong et a., 1999; Michaski, 1969; Cark and Nibett, 1989; Frank and Witten, 1998) mine for frequent rues in the data and combine them in a heuristic way, where rues are ranked by an interestingness criteria. Some of these methods, ike CBA, CPAR and CMAR (Li et a., 2001; Yin and Han, 2003; Chen et a., 2006; Cheng et a., 2007) sti suffer from a huge number of rues. Inductive ogic programming (Muggeton and De Raedt, 1994) is simiar, in that it mines (potentiay compicated) rues and takes the simpe union of these rues as the rue set, rather than optimizing the rue set directy. This is a major disadvantage over the type of approach we take here. Another cass of approaches aims to construct rue set modes by greediy adding the conjunction that expains the most of the remaining data (Maioutov and Varshney, 2013; Poack et a., 1988; Friedman and Fisher, 1999; Gaines and Compton, 1995; Cohen, 1995). Thus, again, these methods do not directy aim to produce gobay optima conjunctive modes. There are few recent techniques that do aim to fuy earn rue set modes. Hauser et a. (2010); Goh and Rudin (2014) appied integer programming approaches to soving the fu probems but woud face a computationa chaenge for even moderatey sized probems. There are severa branch-and-bound agorithms that optimize different objectives than ours (e.g., Webb, 1995). Rijnbeek and Kors (2010) proposed an efficient way to exhaustivey search for short and accurate decision rues. That work is different than ours in that it does not take a generative approach, and their goba objective does not consider mode compexity, meaning they do not have the same advantage of reduction to a smaer probem that we have. Methods that aim to gobay find decision ists or rue ists can be used to find optima DNF modes, as ong as it is possibe to restrict a of the abes in the rue ist (except the defaut) to the same vaue. A rue ist where a abes except the defaut rue are the same vaue is exacty a DNF formua. We are working on extending the CORELS agorithm (Angeino et a., 2017) to hande DNF. Note that ogica modes are generay robust to outiers and naturay hande missing data, with no imputation needed for missing attribute vaues. These methods can perform comparaby with traditiona convex optimization-based methods such as support vector machines or asso (though inear modes are not aways considered to be interpretabe in many domains). 4

5 BAYESIAN RULE SETS One coud extend any method for DNF earning of binary cassification probems to muticass cassification. A conventiona way to directy extend a binary cassifier is to decompose the muti-cass probem into a set of two-cass probems. Then for each binary cassification probem, one coud buid a separate BRS mode, e.g., by the one-against-a method. One woud generate different rue sets for each cass. For this approach, the issue of overapping coverage of rue sets for different casses is handed, for instance, by an error correcting output codes (Schapire, 1997). The main appication we consider is in-vehice context-aware recommender systems. The most simiar works to ours incude that of Barais et a. (2011), who present a framework that discovers reationships between user context and services using association rues. Lee et a. (2006) create interpretabe context-aware recommendations by using a decision tree mode that considers ocation context, persona context, environmenta context and user preferences. However, they did not study some of the most important factors we incude, namey contextua information such as the user s destination, reative ocations of services aong the route, the ocation of the services with respect to the user s destination, passenger(s) in the vehice, etc. Our work is reated to recommendation systems for in-vehice context-aware music recommendations (see Batrunas et a., 2011; Wang et a., 2012), but whether a user wi accept a music recommendation does not depend on anything anaogous to the ocation of a business that the user woud drive to. The setup of in-vehice recommendation systems are aso different than, for instance, mobie-tourism guides (Noguera et a., 2012; Schwinger et a., 2005; Van Setten et a., 2004; Tung and Soo, 2004) where the user is searching to accept a recommendation, and interacts heaviy with the system in order to find an acceptabe recommendation. The cosest work to ours is probaby that of Park et a. (2007) who aso consider Bayesian predictive modes for context aware recommender systems to restaurants. They aso consider demographic and context-based attributes. They did not study advertising, however, which means they did not consider the ocations to the coupon s venue, expiration times, etc. 3. Bayesian Rue Sets We work with standard cassification data. The data set S consists of {x n,y n } n=1,...n, where y n 2 {0, 1} with N + positive and N negative, and x n 2X. x n has J features. Let a represent a rue and a( ) with a corresponding Booean function a( ) :X!{0, 1}. a(x) indicates if x satisfies rue a. Let A denote a rue set and et A( ) represent a cassifier, ( 1 9a 2 A, a(x) =1 A(x) = 0 otherwise. x is cassified as positive if it satisfies at east one rue in A. Figure 1 shows an exampe of a rue set. Each rue is a yeow patch that covers a particuar area, and the rue appies to the area it covers. In Figure 1, the white ova in the midde indicates the positive cass. Our goa is to find a set of rues A that covers mosty the positive cass, but itte of the negative cass, whie keeping A as a sma set of short rues. We present a probabiistic mode for seecting rues. Taking a generative approach aows us to fexiby incorporate users expectations on the shape" of a rue set through the prior. The user can guide the mode toward more domain-interpretabe soutions by specifying the desired baance between the size and engths of rues without committing to any particuar vaue for these parameters. 5

WANG ET AL. Figure 1: Iustration of Rue Sets. Area covered by any of rues is cassified as positive. Area not covered by any rues is cassified as negative.

6 WANG ET AL. Figure 1: Iustration of Rue Sets. Area covered by any of rues is cassified as positive. Area not covered by any rues is cassified as negative. We propose two modes for the prior, a Beta-Binomia prior and a Poisson prior, which characterize the interpretabiity from different perspectives. The ikeihood ensures that the mode expains the data we and ensures good cassification performance. We detai the priors and ikeihood beow. 3.1 Prior We propose two priors. In the BRS-BetaBinomia mode, the maximum ength L of rues is predetermined by a user. Rues of the same ength are paced into the same rue poo. The mode uses L beta priors to contro the probabiities of seecting rues from different poos. In a BRS-Poisson mode, the shape of a rue set, which incudes the number of rues and engths of rues, is decided by drawing from Poisson distributions parameterized with user-defined vaues. Then the generative process fis it in with conditions by first randomy seecting attributes and then randomy seecting vaues corresponding to each attribute. We present the two prior modes in detai BETA-BINOMIAL PRIOR Rues are drawn randomy from a set A. Assuming the interpretabiity of a rue is associated with the ength of a rue (the number of conditions in a rue), the rue space A is divided into L poos indexed by the engths, L being the maximum ength the user aows. A = [ L =1 A. (1) Then, rues are drawn independenty in each poo, and we assume in A each rue has probabiity p to be drawn, which we pace a beta prior on. For 2{1,...,L}, Seect a rue from A Bernoui(p ) (2) p Beta(, ). (3) In reaity, it is not practica to enumerate a possibe rues within A when ` is even moderatey arge. However, since we aim for intuitive modes, we know that the optima MAP mode wi consist of a sma number of rues, each of which has sufficienty arge support in the data (this 6

7 BAYESIAN RULE SETS is formaized ater within proofs). Our approximate inference technique thus finds and uses high support rues to use within A. This effectivey makes the beta-binomia mode non-parametric. Let M notate the number of rues drawn from A, 2{1,...,L}. A BRS mode A consists of rues seected from each poo. Then p(a; {, } )= = LY Z =1 LY =1 p M (1 p ) ( A M ) dp B(M +, A M + ), (4) B(, ) where B( ) represents a Beta function. Parameters {, } govern the prior probabiity of seecting a rue set. To favor smaer modes, we woud choose, such that + is cose to 0. In our experiments, we set =1for a POISSON PRIOR We introduce a data independent prior for the BRS mode. Let M denote the tota number of rues in A and L m denote the engths of rues for m 2{1,...M}. For each rue, the ength needs to be at east 1 and at most the number of a attributes J. According to the mode, we first draw M from a Poisson distribution. We then draw L m from truncated Poisson distributions (which ony aow numbers from 1 to J), to decide the shape of a rue set. Then, since we know the rue sizes, we can now fi in the rues with conditions. To generate a condition, we first randomy seect the attributes then randomy seect vaues corresponding to each attribute. Let v m,k represent the attribute index for the k-th condition in the m-th rue, v m,k 2{1,...J}. K vm,k is the tota number of vaues for attribute v m,k. Note that a vaue here refers to a category for a categorica attribute, or an interva for a discretized continuous variabe. To summarize, a rue set is constructed as foows: 1: Draw the number of rues: M Poisson( ) 2: for m 2{1,...M} do 3: Draw the number of conditions in m-th rue: L m Truncated-Poisson( ),L m 2{1,...J} 4: Randomy seect L m attributes from J attributes without repacement 5: for k 2{1,...L m } do 6: Uniformy at random seect a vaue from K vm,k vaues corresponding to attribute v m,k 7: end for 8: end for We define the normaization constant!(, ), so the probabiity of generating a rue set A is p(a;, ) = 1 M!(, ) Poisson(M; ) Y m=1 Poisson(L m ; ) 1 J L m L m Y k=1 1 K vm,k. (5) 3.2 Likeihood Let A(x n ) denote the cassification outcome for x n, and et y n denote the observed outcome. Reca that A(x n )=1if x n obeys any of the rues a 2 A. We introduce ikeihood parameter + to represent the prior probabiity that y n =1when A(x n )=1, and to represent the prior probabiity 7

8 WANG ET AL. that y n =0when A(x n )=0. Thus: y n x n,a ( Bernoui( + ) A(x n )=1 Bernoui(1 ) A(x n )=0, with + and drawn from beta priors: + Beta( +, +) and Beta(, ). (6) Based on the cassification outcomes, the training data are divided into four cases: true positives (TP = P n A(x n)y n ), fase positives (FP = P P n A(x n)(1 y n )), true negatives (TN = n (1 A(x n)) (1 y n )) and fase negatives (FN = P n (1 A(x n)) y n ). Then, it can be derived that Z p(s A; +, +,, )= TP + (1 + ) FP TN (1 ) FN d + d = B(TP + +, FP + + ) B( +, +) B(TN +, FN + ). (7) B(, ) According to (6), +, +,, govern the probabiity that a prediction is correct on training data, which determines the ikeihood. To ensure the mode achieves the maximum ikeihood when a data points are cassified correcty, we need and 1. So we choose +, +,, such that > 0.5 and + > 0.5. Parameter tuning for the prior parameters wi be shown in experiments section. Let H denote a set of parameters for a data S, H = +, +,,, prior, where prior depends on the prior mode, our goa is to find an optima rue set A that Soving for a MAP mode is equivaent to maximizing A 2 arg max p(a S; H). (8) A F (A, S; H) = og p(a; H) + og p(s A; H), (9) where either of the two priors provided above can be used for the first term, and the ikeihood is in (7). We write the objective as F (A), the prior probabiity of seecting A as p(a), and the ikeihood as p(s A), omitting dependence on parameters when appropriate. 4. Approximate MAP Inference In this section, we describe a procedure for approximatey soving for the maximum a posteriori (MAP) soution to a BRS mode. Inference in Rue Set modeing is chaenging because it invoves a search over exponentiay many possibe sets of rues: since each rue is a conjunction of conditions, the number of rues increases exponentiay with the number of conditions, and the number of sets of rues increases exponentiay with the number of rues. This is the reason earning a rue set has aways been a difficut probem in theory. 8

9 BAYESIAN RULE SETS Inference becomes easier, however, for our BRS mode, due to the computationa benefits brought by the Bayesian prior, which can effectivey reduce the origina probem to a much more manageabe size and significanty improve computationa efficiency. Beow, we detai how to expoit the Bayesian prior to derive usefu bounds during any search process. Here we use a stochastic coordinate descent search technique, but the bounds hod regardess of what search technique is used. 4.1 Search Procedure Given an objective function F (A) over discrete search space of different sets of rues and a temperature schedue function over time steps, T [t] = T 1 t N iter 0, a simuated anneaing (Hwang, 1988) procedure is a discrete time, discrete state Markov Chain where at step t, given the current state A [t], the next state A [t+1] is chosen by first randomy n seecting a proposa ofrom the neighbors, and the proposa is accepted with probabiity min 1, exp F (A [t+1] ) F (A [t] ), in order to find an optima T [t] rue set A. Our version of simuated anneaing is simiar aso to the -greedy agorithm used in muti-armed bandits (Sutton and Barto, 1998). Simiar to the Gibbs samping steps used by Letham et a. (2015); Wang and Rudin (2015) for rue-ist modes, here a neighboring soution is a rue set whose edit distance is 1 to the current rue set (one of the rues is different). Our simuated anneaing agorithm proposes a new soution via two steps: choosing an action from ADD, CUT and REPLACE, and then choosing a rue to perform the action on. In the first step, the seection of which rues to ADD, CUT or REPLACE is not chosen uniformy. Instead, the simuated anneaing proposes a neighboring soution that aims to do better than the current mode with respect to a randomy chosen miscassified point. At iteration t +1, an exampe k is drawn from data points miscassified by the current mode A [t]. Let R 1 (x k ) represent a set of rues that x k satisfies, and et R 0 (x k ) represent a set of rues that x k does not satisfy. If exampe k is positive, it means the current rue set fais to cover it so we propose a new mode that covers exampe k, by either adding a new rue from R 1 (x k ) to A [t], or repacing a rue from A [t] with a rue from R 1 (x k ). The two actions are chosen with equa probabiities. Simiary, if exampe k is negative, it means the current rue set covers wrong data, and we then find a neighboring rue set that covers ess, by removing or repacing a rue from A [t] \R 0 (x k ), each with probabiity 0.5. In the second step, a rue is chosen to perform the action on. To choose the best rue, we evauate the precision of tentative modes obtained by performing the seected action on each candidate rue. This precision is: Q(A) = P P i A(x i)y i i A(x i). Then a choice is made between exporation, which means choosing a random rue (from the ones that improve on the new point), and expoitation, which means choosing the best rue (the one with the highest precision). We denote the probabiity of exporation as p in our search agorithm. This randomness heps to avoid oca minima and heps the Markov Chain to converge to a goba minimum. We detai the three actions beow. ADD 1. Seect a rue z according to x k 9

10 WANG ET AL. CUT With probabiity p, draw z randomy from R 1 (x k ). (Expore) With probabiity 1 p, z = arg max Q(A [t] [ a). (Expoit) a2r 1 (x k ) 2. Then A [t+1] A [t] [ z. 1. Seect a rue z according to x k With probabiity p, draw z randomy from A [t] \R 0 (x k ). (Expore) With probabiity 1 p, z = arg max Q(A [t] \a). (Expoit) a2a [t] \R 0 (x k ) 2. Then A [t+1] A [t] \z. REPLACE: first CUT, then ADD. To summarize: the proposa strategy uses an -greedy strategy to determine when to expore and when to expoit, and when it expoits, the agorithm uses a randomized coordinate descent approach by choosing a rue to maximize improvement. 4.2 Iterativey reducing rue space Another approach to reducing computation is to directy reduce the set of rues. This dramaticay reduces the search space, which is the power set of the set of rues. We take advantage of the Bayesian prior and derive a deterministic bound on MAP BRS modes that excude rues that fai to satisfy the bound. Since the Bayesian prior is constructed to penaize arge modes, a BRS mode tends to contain a sma number of rues. As a sma number of rues must cover the positive cass, each rue in the MAP mode must cover enough observations. We define the number of observations that satisfy a rue as the support of the rue: supp(a) = NX 1(a(x n ) = 1). (10) n=1 Removing a rue wi yied a simper mode (and a smaer prior) but may ower the ikeihood. However, we can prove that the oss in ikeihood is bounded as a function of the support. For a rue set A and a rue z 2 A, we use A \z to represent a set that contains a rues from A except z, i.e., A \z = {a a 2 A, a 6= z}. Define = (N ) (N + + )(N ). Then the foowing hods: Lemma 1 p(s A) ( ) supp(z) p(s A \z ). 10

11 BAYESIAN RULE SETS A proofs in this paper are in Appendix A. This emma shows that if appe 1, removing a rue with high support owers the ikeihood. We wish to derive ower bounds on the support of rues in a MAP mode. This wi aow us to remove (ow-support) rues from the search space without any oss in posterior. This wi simutaneousy yied an upper bound on the maximum number of rues in a MAP mode. The bounds wi be updated as the search proceeds. Reca that A [ ] denotes the rue set at time, and et v [t] denote the best soution found unti iteration t, i.e., v [t] = max appet F (A[ ] ). We first ook at BRS-BetaBinomia modes. In a BRS-BetaBinomia mode, rues are seected based on their engths, so there are separate bounds for rues of different engths, and we notate the upper bounds of the number of rues at step t as {m [t] } =1,...,L, where m [t] represents the upper bound for the number of rues of ength. We then introduce some notations that wi be used in the theorems. Let L denote the maximum ikeihood of data S, which is achieved when a data are cassified correcty (this hods when + > + and > ), i.e. TP = N +, FP =0, TN = N, and FN =0, giving: L := max p(s A) =B(N + + +, +) A B( +, +) For a BRS-BetaBinomia mode, the foowing is true. Theorem 1 Take a BRS-BetaBinomia mode with parameters H = {L, +, +,,, {A,, } =1,..L }, B(N +, ). (11) B(, ) where L, +, +,,, {, } =1,...L 2 N +, + > +, > and <. Define A 2 arg max A F (A) and M = A. If appe 1, we have: 8a 2 A, supp(a) C [t] and where j and m [0] = og L og p(s ;) og A A 1 2 C [t] og max = m og 6 m [t] = [t 1] A m + [t 1] og L + og p(;) v [t] 7 og A + 1 5, [t 1] +m 1, (12) 7 k and v [0] = F (;). The size of A is upper bounded by M appe LX =1 11 m [t].

12 WANG ET AL. In the equation for m [t], p(;) is the prior of an empty set, which is aso the maximum prior for a rue set mode, which occurs when we set =1for a. Since L and p(;) upper bound the ikeihood and prior, respectivey, og L + og p(;) upper bounds the optima objective vaue. The difference between this vaue and v [t], the numerator, represents the room for improvement from the current soution v [t]. The smaer the difference, the smaer the m [t], and thus the arger the minimum support bound. In the extreme case, when an empty set is the optima soution, i.e., the ikeihood and the prior achieve the upper bound L and p(;) at the same time, and the optima soution is discovered at time t, then the numerator becomes zero and m [t] =0which precisey refers to an empty set. Additionay, parameters and jointy govern the probabiity of seecting rues of ength, according to formua (3). When is set to be sma and is set to be arge, p is sma since E(p )= +. Therefore the resuting m [t] is sma, guaranteeing that the agorithm wi choose a smaer number of rues overa. [t 1] [t 1] Specificay, at step 0 when there is no m, we set m = A, yieding 6 7 m [0] = 6 4 og L og p(s ;) og A A 1 7 5, (13) which is the upper bound we obtain before we start earning for A. This bound uses an empty set as a comparison mode, which is a reasonabe benchmark since the Bayesian prior favors smaer modes. (It is possibe, for instance, that the empty mode is actuay cose to optima, depending on the prior.) As og p(s ;) increases, m [0] becomes smaer, which means the mode s maximum possibe size is smaer. Intuitivey, if an empty set aready achieves high ikeihood, adding rues wi often hurt the prior term and achieve itte gain on the ikeihood. Using this upper bound, we get a minimum support which can be used to reduce the search space before simuated anneaing begins. As m [t] decreases, the minimum support increases since the rues need to jointy cover positive exampes. Aso, if fewer rues are needed it means each rue needs to cover more exampes. Simiary, for BRS-Poisson modes, we have: Theorem 2 Take a BRS-Poisson mode with parameters H = {L, +, +,,,, }, where L, +, +,, 2 N +, + > +, >. Define A 2 arg max A F (A) and M = A. If appe 1, we have: 8a 2 A, supp(a) C [t] and og x C [t] M = [t], (14) og where and n e max 2 (J + 1), x = (J + 1) ( +1) M [t] 6og = 4 ( +1) og +1 x 2 J o + og L + og p(;) v [t] og +1 x,

13 BAYESIAN RULE SETS The size of A is upper bounded by M appe M [t]. Simiar to Theorem 1, og L + og p(;) upper bounds the optima objective vaue and og L + og p(;) v [t] is an upper bound on the room for improvement from the current soution v [t]. The smaer the bounds becomes, the arger is the minimum support, thus reducing the rue space iterativey whenever a new maximum soution is discovered. The agorithm, which appies the bounds above, is given as Agorithm 1. Agorithm 1 Inference agorithm. procedure SIMULATED ANNEALING(N iter ) A FP-Growth(S) A [0] a randomy generated rue set A max A [0] for t =0! N iter do A {a 2A, supp(a) C [t] } (C [t] is from (12) or (14), depending on the mode choice) S miscassified exampes by A [t] (Find miscassified exampes) (x k,y k ) a random exampe drawn from S if y k =1then( action ADD, with probabiity 0.5 REPLACE, with probabiity 0.5 ese ( action CUT, with probabiity 0.5 REPLACE, with probabiity 0.5 A [t+1] action(a,a [t], x k ) A max arg n max(f (A max ),F(A [t+1] o)) (Check for improved optima soution) =min 1, exp F (A [t+1] ) F (A [t] ) (Probabiity of an anneaing move) ( T [t] A [t+1], with probabiity A [t+1] end for return A max end procedure A [t], with probabiity Rue Mining We mine a set of candidate rues A before running the search agorithm and ony search within A to create an approximate but more computationay efficient inference agorithm. For categorica attributes, we consider both positive associations (e.g., x j = bue ) and negative associations (x j = not green ) as conditions. (The importance of negative conditions is stressed, for instance, by Brin et a., 1997; Wu et a., 2002; Teng et a., 2002). For numerica attributes, we create a set of binary variabes for each numerica attribute, by comparing it with a set of threshods (usuay quanties), and add their negations as separate attributes. For exampe, we discretize age with three threshods, creating six binary variabes in tota, i.e., age 25, age 50, age 75 and the negations for each of 13

14 WANG ET AL. them, i.e., age<25, age<50 and age<75. We then mine for frequent rues within the set of positive observations S +. To do this, we use the FP-growth agorithm (Borget, 2005), which can in practice be repaced with any desired frequent rue-mining method. Sometimes for arge data sets with many features, even when we restrict the ength of rues and the minimum support, the number of generated rues coud sti be too arge to hande, as it can be exponentia in the number of attributes. For exampe, for one of the advertisement data sets, a miion rues are generated when the minimum support is 5% and the maximum ength is 3. In that case, we use a second criterion to screen for the most potentiay usefu M 0 rues, where M 0 is user-defined and depends on the user s computationa capacity. We first fiter out rues on the ower right pane of ROC space, i.e., their fase positive rate is greater than true positive rate. Then we use information gain to screen rues, simiary to other works (Chen et a., 2006; Cheng et a., 2007). For a rue a, the information gain is InfoGain(S a) =H(S) H(S a), (15) where H(S) is the entropy of the data and H(S a) is the conditiona entropy of data that spit on rue a. Given a data set S, entropy H(S) is constant; therefore our screening technique chooses the M 0 rues that have the smaest H(S a). Figure 2: A rues and seected rues on a ROC pane We iustrate the effect of screening on one of our mobie advertisement data sets. We mined a rues with minimum support 5% and maximum ength 3. For each rue, we computed its true positive rate and fase positive rate on the training data and potted it as a dot in Figure 2. The top 5000 rues with highest information gains are coored in red, and the rest are in bue. As shown in the figure, information gain indeed seected good rues as they are coser to the upper eft corner in ROC space. For many appications, this screening technique is not needed, and we can simpy use the entire set of pre-mined rues that have support above the required threshod for the optima soution provided in the theorems above. 14

15 BAYESIAN RULE SETS 5. Simuation Studies We present simuation studies to show the interpretabiity and accuracy of our mode and the efficiency of the simuated anneaing procedure. We first demonstrate the efficiency of the simuated anneaing agorithm by searching within candidate rues for a true rue set" that generates the data. Simuations show that our agorithm can recover the true rue set with high probabiity within a sma number of iterations despite the arge search space. We then designed the second set of simuations to study the trade-off between accuracy and simpicity. BRS modes may ose accuracy to gain mode simpicity when the number of attributes is arge. Combining the two simuations studies, we were abe to hypothesize that possibe osses in accuracy are often due to the choice of mode representation as a rue set, rather than simuated anneaing. In this section, we choose the BRS-BetaBinomia mode for the experiments. BRS-Poisson resuts woud be simiar; the ony difference is the choice of prior. 5.1 Simuation Study 1: Efficiency of Simuated Anneaing In the first simuation study, we woud ike to test if simuated anneaing is abe to recover a true rue set given that these rues are in a candidate rue set, and we woud ike to know how efficienty simuated anneaing finds them. Thus we omit the step of rue mining for this experiment and directy work with generated candidate rues, which we know in this case contains a true rue set. Let there be N observations, {x n } n=1,...,n and a coection of M candidate rues, {a m } m=1,...m. We can construct an N M binary matrix where the entry in the n-th row and m-th coumn is the Booean function a m (x n ) that represents whether the n-th observation satisfies the m-th rue. Let the true rue set be a subset of the candidate rues, A {a m } m=1,...m. The abes {y n } n=1,...n satisfy ( 1 9a 2 A,a(x n )=1 y n = (16) 0 otherwise. We ran experiments with varying N and M. Each time, we first generated an N M binary matrix by setting the entries to 1 with probabiity 0.035, and then seected 20 coumns to form A. (Vaues and 20 are chosen so that the positive cass contains roughy 50% of the observations. We aso ensured that the 20 rues do not incude each other.) Then {y n } n=1,...n were generated as above. We randomy assigned engths from 1 to 5 to {a m } m=1,...m. Finay, we ran Agorithm 1 on the data set. The prior parameters were set as beow for a simuations: =1, = A for =1,...,3, + = = 1000, + = =1. For each pair of N and M, we repeated the experiment 100 times and recorded intermediate output modes at different iterations. Since mutipe different rue sets can cover the same points, we compared abes generated from the true rue set and the recovered rue set: a abe is 1 if the instance satisfies the rue set and 0 otherwise. We recorded the training error for each intermediate output mode. The mean and standard deviation of training error rate are pot Figure 3, aong with run time to convergence in seconds on the right figure. Comparing the three sets of experiments, we notice that different sizes of the binary matrix ed to differing convergence times due to different computation cost of handing the matrices, yet the three error rate curves in Figure 3 amost competey overap. Neither the size of the data set N nor 15

WANG ET AL. Figure 3: Convergence of mean error rate with number of iterations for data with varying N and M and running time (in seconds), obtained by the BRS-BetaBinomia.

16 WANG ET AL. Figure 3: Convergence of mean error rate with number of iterations for data with varying N and M and running time (in seconds), obtained by the BRS-BetaBinomia. the size of the rue space M affected the search. This is because the strategy based on curiosity and bounds chose the best soution efficienty, regardess of the size of data or the rue space. More importanty, this study shows that simuated anneaing is abe to recover a true rue set A, or an equivaent rue set with optima performance, with high probabiity, with few iterations. 5.2 Simuation Study 2: Accuracy-Interpretabiity Trade-off We observe from the first simuation study that simuated anneaing does not tend to cause osses in accuracy. In the second simuation study, we woud ike to identify whether the rue representation causes osses in accuracy. In this study, we used the rue mining step and directy worked with data {x n } n=1,...,n. Without oss of generaity, we assume x n 2{0, 1} J. In each experiment, we first constructed an N J binary matrix {x n } n=1,...n by setting the entries to 1 with probabiity 0.5, then randomy generated 20 rues from {x n } n=1,...,n to form A and finay generated abes {y n } n=1,...,n foowing formua (16). We checked the baance of the data before pursuing the experiment. The data set was used ony if the positive cass was within [30%, 70%] of the tota data, and otherwise regenerated. To constrain computationa compexity, we set the maximum ength of rues to be 7 and then seected the top 5000 rues with highest information gain. Then we performed Agorithm 1 on these candidate rues to generate a BRS mode Ã. Ideay, Ã shoud be simper than A, with a possibe oss in its accuracy as a trade-off. To study the infuence of the dimensions of data on accuracy and simpicity, we varied N and J and repeated each experiment 100 times. Figure 4 shows the accuracy and number of rues in Ã in each experiment. The number of rues in Ã was amost aways ess than the number of rues in A (which was 20). On average, Ã contained 12 rues, which was 60% of the size of A, sighty compromising accuracy. BRS needed to compromise more when J was arge since it became harder to maintain the same eve of simpicity. 16

Note that it took fewer iterations than simuation study 1 since the error rate curve started from ess than 0.5, due to rue pre-screening using information gain.

17 BAYESIAN RULE SETS Figure 4: Accuracy and compexity of output BRS modes for data with varying N and J, obtained from the BRS-BetaBinomia. We show in Figure 5 the average training error at different iterations to iustrate the convergence rate. Simuated anneaing took ess than 50 iterations to converge. Note that it took fewer iterations than simuation study 1 since the error rate curve started from ess than 0.5, due to rue pre-screening using information gain. Figure 5: Convergence of mean error rate with varying N and J, obtained by BRS-BetaBinomia. Figure 4 and Figure 5 both indicate that increasing the number of observations did not affect the performance of the mode. BRS is more sensitive to the number of attributes. The arger J is, the ess informative a short rue becomes to describe and distinguish a cass in the data set, and the more BRS has to trade-off, in order to keep simpe rues within the mode. To summarize: when we mode using fewer rues, it benefits interpretabiity but can ose information. Rues with a constraint on the maximum ength tend to be ess accurate when deaing with 17

18 WANG ET AL. high dimensiona feature spaces. The oss increases with the number of features. This is the price paid for using an interpretabe mode. There is itte price paid for using our optimization method. 6. Experiments Our main appication of interest is mobie advertising. We coected a arge advertisement data set using Amazon Mechanica Turk that we made pubicy avaiabe (Wang et a., 2015), and we tested BRS on other pubicy avaiabe data sets. We show that our mode can generate interpretabe resuts that are easy for humans to understand, without osing too much (if any) predictive accuracy. In situations where the ground truth consists of deterministic rues (simiary to the simuation study), our method tends to perform better than other popuar machine earning techniques. 6.1 UCI Data Sets We test BRS on ten data sets from the UCI machine earning repository (Bache and Lichman, 2013) with different types of attributes; four of them contain categorica attributes, three of them contain numerica attributes and three of them contain both. We performed 5-fod cross vaidation on each data set. Since our goa was to produce modes that are accurate and interpretabe, we set the maximum ength of rues to be 3 for a experiments. We compared with interpretabe agorithms: Lasso (without interaction terms to preserve interpretabiity), decision trees (C4.5 and CART), and inductive rue earner RIPPER (Cohen, 1995), which produces a rue ist greediy and anneas ocay afterward. In addition to interpretabe modes, we aso compare BRS with uninterpretabe modes from random forests and SVM to set a benchmark for expected accuracy on each data set. We used impementations of baseine methods from R packages, and tuned mode parameters via 5-fod nested cross vaidation. Tabe 1: Accuracy of each comparing agorithm (mean ± std) on ten UCI data sets. Data Type BRS1 BRS2 Lasso C4.5 CART RIPPER Random SVM Forest connect-4.76±.01.75±.01.70±.00.83±.00.69± ±.00.82±.00 mushroom 1.00± ±.00.96± ±.00.99±.00.99± ±.00.93±.00 Categorica tic-tac-toe 1.00± ±.00.71±.02.92±.03.93±.02.98±.01.99±.00.99±.00 chess.89±.01.89±.01.83±.01.97±.00.92±.04.91±.01.91±.00.99±.00 magic.80±.02.79±.01.76±.01.79±.00.77±.00.78±.00.78±.01.78±.01 banknote Numerica.97±.01.97± ±.00.90±.01.90±.02.91±.01.91± ±.00 indian-diabetes.72±.03.73±.03.67±.01.66±.03.67±.01.67±.02.75±.03.69±.01 adut.83±.00.84±.01.82±.02.83±.00.82±.01.83±.00.84±.00.84±.00 bank-marketing Mixed.91±.00.91±.00.90±.00.90±.00.90±.00.90±.00.90±.00.90±.00 heart.85±.03.84±.03.85±.04.76±.06.77±.06.78±.04.81±.06.86±.06 Rank of accuracy 2.6± ± ± ± ± ± ± ±2.2 Tabe 1 dispays the means and standard deviations of the test accuracy for a agorithms (the ead author s aptop ran out of memory when appying RIPPER to the connect-4 data set), 18

19 BAYESIAN RULE SETS Tabe 2: Runtime (in seconds) of each comparing agorithm (mean ± std) on ten UCI data sets. BRS1 BRS2 Lasso C4.5 CART RIPPER Random SVM Forest connect ± ± ± ± ± ± ±35.1 mushroom 8.8± ± ± ± ± ± ± ±1.2 tic-tac-toe 1.0± ± ± ± ± ± ± ±0.0 chess 44.8± ± ± ± ± ± ± ±79.6 magic 31.3± ± ± ± ± ± ± ±3.2 banknote 1.0± ± ± ± ± ± ± ±0.0 indian-diabetes 0.6± ± ± ± ± ± ± ±0.0 adut 50.2± ± ± ± ± ± ± ±4.7 bank-marketing 55.2± ± ± ± ± ± ± ±28.1 heart 0.5± ± ± ± ± ± ± ±0.0 and the rank of their average performance. Tabe 2 dispays the runtime. BRS1 represents BRS- BetaBinomia and BRS2 represents BRS-Poisson. Whie BRS modes were under strict restriction for interpretabiity purposes (L =3), BRS s performance surpassed that of the other interpretabe methods and was on par with uninterpretabe modes. This is because, firsty, BRS has Bayesian priors that favor rues with a arge support (Theorem 1, 2) which naturay avoids overfitting of the data; and secondy, BRS optimizes a goba objective whie decision trees and RIPPER rey on oca greedy spitting and pruning methods, and interpretabe versions of Lasso are inear in the base features. Decision trees and RIPPER are oca optimization methods and tend not to have the same eve of performance as gobay optima agorithms, such as SVM, Random Forests, BRS, etc. The cass of rue set modes and decision tree modes are the same: both create regions within input space consisting of a conjunction of conditions. The difference is the choice of optimization method: goba vs. oca. For the tic-tac-toe data set, the positive cass can be cassified using exacty eight rues. BRS has the capabiity to exacty earn these conditions, whereas the greedy spitting and pruning methods that are pervasive throughout the data mining iterature (e.g., CART, C4.5) and convexified approximate methods (e.g., SVM) have difficuty with this. Both inear modes and tree modes exist that achieve perfect accuracy, but the heuristic spitting/pruning and convexification of the methods we compared with prevented these perfect soutions from being found. BRS achieves accuracies competitive to uninterpretabe modes whie requiring a much shorter runtime, which grows sowy with the size of the data, unike Random forest and SVM. This indicates that BRS can reiaby produce a good mode within a reasonabe time for arge data sets PARAMETER TUNING A MAP soution maximizes the sum of ogs of prior and ikeihood. The scae of the prior and ikeihood directy determines how the mode trades off between fitting the data and achieving the desired sparsity eve. Again, we choose the BRS-BetaBinomia for this demonstration. The Bayesian mode uses parameters {, } L =1 to govern the prior for seecting rues and parameters +, +,, to govern the ikeihood of data. We study how sensitive the resuts are to different 19

WANG ET AL. parameters and how to tune the parameters to get the best performing mode. We choose one data set from each category from Tabe 1. We fixed the prior parameters =1, = A for =1,.

20 WANG ET AL. parameters and how to tune the parameters to get the best performing mode. We choose one data set from each category from Tabe 1. We fixed the prior parameters =1, = A for =1,..,L, and ony varied ikeihood parameters +, +,, to anayze how the performance changes as the magnitude and ratio of the parameters change. To simpify the study, we took + = =, and + = =. We et s = + and s was chosen from {100, 1000, 10000}. We et = + and varied within [0, 1]. Here, s and uniquey define and. We constructed modes for different s and and potted the average out-of-sampe accuracy as s and changed in Figure 6 for three data sets representative of categorica, numerica and mixed data types. The X axis represents and Y axis represents s. Accuracies increase as increases, which is consistent with the intuition of + and. The highest accuracy is aways achieved at the right haf of the curve. The performance is ess sensitive to the magnitude s and ratio, especiay when > 0.5. For tic-tac-toe and diabetes, the accuracy became fat once becomes greater than a certain threshod, and the performance was not sensitive to either after that point; it is important as it makes tuning the agorithm easy in practice. Generay, taking cose to 1 eads to a satisfactory output. Tabe 1 and 2 were generated by modes with =0.9and s = Figure 6: Parameter tuning experiments. Accuracy vs. for a data sets. X axis represents and Y axis represents accuracy DEMONSTRATION WITH TIC-TAC-TOE DATA Let us iustrate the reduction in computation that comes from the bounds in the theorems in Section 4. For this demonstration, we chose the tic-tac-toe data set since the data can be cassified exacty by eight rues which are naturay an underying true rue set, and our method recovered the rues successfuy, as shown in Tabe 1. We use BRS-BetaBinomia for this demonstration. First, we used FP-Growth (Borget, 2005) to generate rues. At this step, we find a possibe rues with ength between 1 to 9 (the data has 9 attributes) and has minimum support of 1 (the rue s itemset must appear at east once int the data set). FP-growth generates a tota of rues. We then set the hyper-parameters as: =1, = A for =1,...,9, + = N + +1, + = N +, = N +1, = N. We ran Agorithm 1 on this data set. We kept a ist of the best soutions v [t] found unti time t [t 1] and updated m and the minimum support with the new v [t] according to Theorem 1. We show in Figure 7 the minimum support at different iterations whenever a better v [t] is obtained. Within 20

21 BAYESIAN RULE SETS Figure 7: Updated minimum support at different iterations ten iterations, the minimum support increased from 1 to 9. If, however, we choose a stronger prior for sma modes by increasing to 10 times A for a and keeping a the other parameters unchanged, then the minimum support increases even faster and reached 13 very quicky. This is consistent with the intuition that if the prior penaizes arge modes more heaviy, the output tends to have a smaer size. Therefore each rue wi need to cover more data. Pacing a bound on the minimum support is critica for computation since the number of possibe rues decays exponentiay with the support, as shown in Figure 8a. As the minimum support is increased, more rues are removed from the origina rue space. Figure 8a shows the percentage of rues eft in the soution space as the minimum threshod is increased from 1 to 9 and to 13. At a minimum support of 9, the rue space is reduced to 9.0% of its origina size; at a minimum support of 13, the rue space is reduced to 4.7%. This provides intuition of the benefit of Theorem 1. (a) Reduced rue space as the minimum support increases (b) Number of rues of different engths mined at different minimum support. Figure 8: Demonstration with Tic-Tac-Toe data set Athough rues are fitered out based on their support, we observe that onger rues are more heaviy reduced than shorter rues. This is because the support of a rue is negativey correated with its ength, as the more conditions a rue contains, the fewer observations can satisfy it. Figure 8b shows distributions of rues across different engths when mined at different minimum support ev- 21

22 WANG ET AL. es. Before any fitering (i.e., C = 1), more than haf of the rues have engths greater than 5. As C is increased to 9 and 13, these ong rues are amost competey removed from the rue space. To summarize, the strength of the prior is important for computation, and changes the nature of the probem: a stronger prior can dramaticay reduce the size of the search space. Any increase in minimum support, that the prior provides, eads to a smaer search by the rue mining method and a smaer optimization probem for the BRS agorithm. These are mutipe order-of-magnitude decreases in the size of the probem. 6.2 Appication to In-vehice Recommendation System For this experiment, our goa was to understand customers response to recommendations made in an in-vehice recommender system that provides coupons for oca businesses. The coupons woud be targeted to the user in his/her particuar context. Our data were coected on Amazon Mechanica Turk via a survey that we wi describe shorty. We used Turkers with high ratings (95% or above). Out of 752 surveys, 652 were accepted, which generated data cases (after removing rows containing missing attributes). The prediction probem is to predict if a customer is going to accept a coupon for a particuar venue, considering demographic and contextua attributes. Answers that the user wi drive there right away or ater before the coupon expires are abeed as Y =1 and answers no, I do not want the coupon are abeed as Y =0. We are interested in investigating five types of coupons: bars, takeaway food restaurants, coffee houses, cheap restaurants (average expense beow $20 per person), expensive restaurants (average expense between $20 to $50 per person). In the first part of the survey, we asked users to provide their demographic information and preferences. In the second part, we described 20 different driving scenarios (see exampes in Appendix B) to each user aong with additiona context information and coupon information (see Appendix B for a fu description of attributes) and asked the user if s/he wi use the coupon. For this probem, we wanted to generate simpe BRS modes that are easy to understand. So we restricted the engths of rues and the number of rues to yied very sparse modes. Before mining rues, we converted each row of data into an item which is an attribute and vaue pair. For categorica attributes, each attribute-vaue pair was directy coded into a condition. Using marita status as an exampe, marita status is singe was converted into (MaritaStatus: Singe), (MaritaStatus: Not Married partner), and (MaritaStatus: Not Unmarried partner), (MaritaStatus: Not Widowed). For discretized numerica attributes, the eves are ordered, such as: age is 20 to 25, or 26 to 30, etc.; additionay, each attribute-vaue pair was converted into two conditions, each using one side of the range. For exampe, age is 20 to 25 was converted into (Age:>=20) and (Age:<=25). Then each condition is a haf-space defined by threshod vaues. For the rue mining step, we set the minimum support to be 5% and set the maximum ength of rues to be 4. We used information gain in Equation (15) to seect the best 5000 rues to use for BRS. We ran simuated anneaing for iterations to obtain a rue set. We compared with interpretabe cassification agorithms that span the space of widey used methods that are known for interpretabiity and accuracy, C4.5, CART, Lasso, RIPPER, and a naïve baseine using top K rues, referred to as TopK. For C4.5, CART, Lasso, and RIPPER, we used the RWeka package in R and tuned the hyper-parameters to generate different modes on a ROC pane. For the TopK method, we varied K from 1 to 10 to produce ten modes using best K pre-mined rues ranked by the information gain in Equation (15). For BRS, We varied the hyperparameters 22

23 BAYESIAN RULE SETS +, +,, to obtain different sets of rues. For a methods, we picked modes on their ROC frontiers and reported their performance beow. Performance in accuracy To compare their predictive performance, we measured out-of-sampe AUC (the Area Under The ROC Curve) from 5-fod testing for a methods, reported in Tabe 3. The BRS cassifiers, whie restricted to produce sparse disjunctions of conjunctions, had better performance than decision trees and RIPPER, which use greedy spitting and pruning methods, and do not aim to gobay optimize. TopK s performance was substantiay beow that of other methods. BRS modes are aso comparabe to Lasso, but the form of the mode is different. BRS modes do not require users to cognitivey process coefficients. Bar Takeaway Food Coffee House Cheap Restaurant Expensive Restaurant BRS (0.013) (0.005) (0.010) (0.022) (0.025) BRS (0.011) (0.023) (0.007) (0.019) (0.030) C (0.015) (0.051) (0.018) (0.033) (0.027) CART (0.019) (0.035) (0.013) (0.018) (0.010) Lasso (0.014) (0.042) (0.011) (0.024) (0.017) RIPPER (0.015) (0.048) 0.762(0.012) 0.705(0.023) (0.034) TopK (0.015) (0.024) 0.502(0.012) 0.582(0.023) (0.011) Tabe 3: AUC comparison for mobie advertisement data set, means and standard deviations over fods are reported. Performance in compexity For the same experiment, we woud aso ike to know the compexity of a methods at different accuracy eves. Since the methods we compare have different structures, there is not a straightforward way to directy compare compexity. However, decision trees can be converted into equivaent rue set modes. For a decision tree, an exampe is cassified as positive if it fas into any positive eaf. Therefore, we can generate equivaent modes in rue set form for each decision tree, by simpy coecting branches with positive eaves. Therefore, the four agorithms we compare, C4.5, CART, RIPPER, and BRS have the same form. To measure the compexity, we count the tota number of conditions in the mode, which is the sum of engths for a rues. This oosey represents the cognitive oad needed to understand a mode. For each method, we take the modes that are used to compute the AUC in Tabe 3 and pot their accuracy and compexity in Figure 9. BRS modes achieved the highest accuracy at the same eve of compexity. This is not surprising given that BRS performs substantiay more optimization than other methods. To show that the benefits of BRS did not come from rue mining or screening using heuristics, we compared the BRS modes with TopK modes that rey soey on rue mining and ranking with heuristics. Figure 10 shows there is a substantia gain in the accuracy of BRS modes compared to TopK modes. From Tabe 3 we observe that Lasso achieved consistenty good AUC. For the five coupons, the average number of nonzero coefficients for asso modes in different fods are 93, 94.6, 90.2, 93.2 and 93.4, which is on the order of 20 times arger than the number of conditions used in BRS and other rue based modes; in this case, the asso modes are not interpretabe. 23

24 WANG ET AL. Figure 9: Test accuracy vs compexity for BRS and other modes on mobie advertisement data sets for different coupons Figure 10: Test accuracy vs compexity for BRS and TopK on mobie advertisement data sets for different coupons Exampes of BRS modes In practice, for this particuar appication, the benefits of interpretabiity far outweigh sma improvements in accuracy. An interpretabe mode can be usefu to a vender 24

BAYESIAN RULE SETS choosing whether to provide a coupon and what type of coupon to provide, it can be usefu to users of the recommender system, and it can be usefu to the designers of the recommender

As discussed in the introduction, rue set cassifiers are particuary natura for representing consumer behavior, particuary consideration sets, as modeed here.

25 BAYESIAN RULE SETS choosing whether to provide a coupon and what type of coupon to provide, it can be usefu to users of the recommender system, and it can be usefu to the designers of the recommender system to understand the popuation of users and correations with successfu use of the system. As discussed in the introduction, rue set cassifiers are particuary natura for representing consumer behavior, particuary consideration sets, as modeed here. We show severa cassifiers produced by BRS in Figure 11, where the curves were produced by modes from the experiments discussed above. Exampe rue sets are isted in each box aong the curve. For instance, the cassifier near the midde of the curve in Figure 11 (a) has one rue, and reads If a person visits a bar at east once per month, is not traveing with kids, and their occupation is not farming/fishing/forestry, then predict the person wi use the coupon for a bar before it expires." In these exampes (and generay), we see that a user s genera interest in a coupon s venue (bar, coffee shop, etc.) is the most reevant attribute to the cassification outcome; it appears in every rue in the two figures. (a) Coupons for bars (b) Coupons for coffee houses Figure 11: ROC for data set of coupons for bars and coffee houses. 7. Concusion We presented a method that produces rue set (disjunctive norma form) modes, where the shape of the mode can be controed by the user through Bayesian priors. In some appications, such as those arising in customer behavior modeing, the form of these modes may be more usefu than traditiona inear modes. Since finding sparse modes is computationay hard, most approaches take severe heuristic approximations (such as greedy spitting and pruning in the case of decision trees, or convexification in the case of inear modes). These approximations can severey hurt performance, as is easiy shown experimentay, using data sets whose ground truth formuas are not difficut to find. We chose a different type of approximation, where we make an up-front statistica assumption in buiding our modes out of pre-mined rues, and aim to find the gobay optima 25

26 WANG ET AL. soution in the reduced space of rues. We then find theoretica conditions under which using premined rues provaby does not change the set of MAP optima soutions. These conditions reate the size of the data set to the strength of the prior. If the prior is sufficienty strong and the data set is not too arge, the set of pre-mined rues is provaby sufficient for finding an optima rue set. We showed the benefits of this approach on a consumer behavior modeing appication of current interest to connected vehice" projects. Our resuts, using data from an extensive survey taken by severa hundred individuas, show that simpe rues based on a user s context can be directy usefu in predicting the user s response. Appendix A. Proof 1 (of Lemma 1) Let TP, FP, TN, and FN be the number of true positives, fase positives, true negatives and fase negatives in S cassified by A. We now compute the ikeihood for mode A \z. The most extreme case is when rue z is an 100% accurate rue that appies ony to rea positive data points and those data points satisfy ony z. Therefore once removing it, the number of true positives decreases by supp(z) and the number of fase negatives increases by supp(z). That is, p(s A \z ) B(TP supp(z)+ +, FP + + ) B(TN +, FN + supp(z)+ ) B( +, +) B(, ) =p(s A) g 3 (supp(z)), (17) where g 3 (supp(z)) = (TP + + supp(z)) (TP + + ) (FN + + supp(z)) (FN + ) (TP + FP ) (TP + FP supp(z)) (18) (TN + FN + + ) (TN + FN supp(z)). (19) Now we break down g 3 (supp(z)) to find a ower bound for it. The first two terms in (18) become (TP + + supp(z)) (TP + FP ) (TP + + ) (TP + FP supp(z)) = (TP + FP supp(z))...(tp + FP ) (TP + + supp(z))...(tp + + 1) TP + FP supp(z) TP N supp(z). (20) N

27 BAYESIAN RULE SETS Equaity hods in (20) when TP = N +, FP =0. Simiary, the ast two terms in (18) become (FN + + supp(z)) (TN + FN + + ) (FN + ) (TN + FN supp(z)) (FN + )...(FN + + supp(z) 1) = (TN + FN + + )...(TN + FN supp(z) 1) FN + supp(z) FN + TN + + supp(z). (21) N + + Equaity in (21) hods when TN = N, FN =0. Combining (17), (18), (20) and (21), we obtain N supp(z) P (S A \z ) p(s A) N N + + = supp(z) p(s A). (22) Proof 2 (of Theorem 1) Step 1 We first prove the upper bound m [t]. Since A 2 arg max A F (A), F (A ) v [t], i.e., og p(s A ) + og p(a ) v [t]. (23) We then upper bound the two terms on the eft-hand-side. Let M denote the number of rues of ength in A. The prior probabiity of seecting A from A is LY p(a B(M +, A M + ) )=p(;). B(, A + ) We observe that when 0 appe M appe A, giving =1 B(M +, A M + ) appe B(, A + ), p(a ) appe p(;) B(M 0 + 0, A 0 M 0 + 0) B( 0, A 0 + 0) 0 ( 0 + 1) (M + = p(;) 0 0 1) ( A 0 M + 0 0) ( A ) M appe p(;) M 0 (24) A for a 0 2{1,...,L}. The ikeihood is upper bounded by p(s A ) appel. (25) Substituting (24) and (25) into (23) we get M og L + og p(;)+m 0 og A v [t], (26) 27

28 WANG ET AL. which gives v [t] A M M 0 appe og L + og p(;) og appe og L + og p(;) og A [t 1] m v [t], (27) [t 1] where (27) foows because m is an upper bound on M 0. Since M 0 has to be an integer, we get 0 M 0 appe 6 og L + og p(;) v [t] A og Thus [t 1] m = m [t] 0 for a 0 2{1,...,L}. (28) M = LX =1 M appe LX =1 m [t]. (29) Step 2: Now we prove the ower bound on the support. We woud ike to prove that a MAP mode does not contain rues of support ess than a threshod. To show this, we prove that if any rue z has support smaer than some constant C, then removing it yieds a better objective, i.e., F (A) appe F (A \z ). (30) Our goa is to find conditions on C such that this inequaity hods. Assume in rue set A, M comes from poo A of rues with ength, 2{1,...L}, and rue z has ength 0 so it is drawn from A 0. A \z consists of the same rues as A except missing one rue from A 0. We must have: p(a \z )= B(M 0 + 0, A 0 M 0 + 0) B( 0, 0) =p(a) B(M 0 + 0, A 0 M 0 + 0) B(M , A 0 M ) =p(a) A 0 M M Y B(M +, A M + ) B( 6= 0, ) A 0 M M decreases monotonicay as M 0 increases, so it is ower bounded at the upper bound on M 0, m [t] 0, obtained in step 1. Therefore A 0 M M A 0 m [t] m [t] for 0 2{1,...L}. 28

29 BAYESIAN RULE SETS Thus p(a \z ) + A m [t] max m [t] 1+! p(a). (31) Combining (31) and Lemma 1, the joint probabiity of S and A \z is bounded by In order to get P (S, A \z ) statement, we must have P (S, A \z )=p(a \z )P (S A \z ) + A m [t] max m [t] 1+! ( ) supp(z) P (S, A). P (S, A), and with appe 1 from the assumption in the theorem s supp(z) appe This means if 9z 2 A such that supp(z) appe which is equivaent to saying, for any z 2 A, supp(z) Since the support of a rue has to be integer, supp(z) 2 6 og max og max og max og max A m [t] m [t] og + 1+.! A m [t] + m [t] 1+ og, then A 62 arg max A 0 F (A 0 ), A m [t] m [t] og A m [t] m [t] og Before proving Theorem 2, we first prove the foowing emma that we wi need ater. Lemma 2 Define a function g() = 2 g() appe max. 7 (J + 1),,J 2 N +. If 1 appe appe J, ( 2 ) J (J + 1),. 2 Proof 3 (Of Lemma 2) In order to bound g(), we wi show that g() is convex, which means its maximum vaue occurs at the endpoints of the interva we are considering. The second derivative of g() respect to is 2 g 00 () =g() 4 n + 2 1X k=1 1 k 1 k + J L m! 3 2 1X (k + J ) 2 > 0, (32) k=1 29

30 WANG ET AL. since at east one of the terms 1 (k+j ) 2 > 0. Thus g() is stricty convex. Therefore the maximum of g() is achieved at the boundary of L m, namey 1 or J. So we have g() appe max {g(1),g(j)} ( ) J = max (J + 1),. (33) 2 2 Proof 4 (Of Theorem 2) We foow simiar steps in the proof for Theorem 1. Step 1 We first derive an upper bound M at time t, denoted as M [t]. The probabiity of seecting a rue set A depends on the number of rues M and the ength of each rue, which we denote as L m,m2{1,...,m }, so the prior probabiity of seecting A is YM P (A ; ) =!(, )Poisson(M ; ) Poisson(L m ; ) = p(;) appe p(;) M (M + 1) 1 (M + 1) M Y m m e Lm (J L m + 1) (J + 1) e (J + 1) M M 1 J L m L m Y k L m Y k 1 K vm,k 1 K vm,k Y Lm (J Lm + 1). (34) 2 (34) foows from K vm,k 2 since a attributes have at east two vaues. Using Lemma 2 we have that Lm J (J Lm + 1) appe max (J + 1),. (35) Combining (34) and (35), and the definition of x, we have m P (A ; ) appe p(;) x M (M + 1). (36) Now we appy (23) combining with (36) and (25), we get og L + og p(;) + og x M (M + 1) v [t] (37) x M Now we want to upper bound (M +1). If M appe, the statement of the theorem hods triviay. For the remainder of the proof we consider, if M x >, M (M +1) is upper bounded by x M M! appe x x (M )! ( + 1) (M ), where in the denominator we used (M +1) = ( +1)( +1) M ( +1)( +1) (M ). So we have og L + og p(;) + og x ( + 1) +(M ) og x +1 v [t]. (38) 30

31 BAYESIAN RULE SETS We have x +1 appe 1. To see this, note (J) (J+1) < 1,e 2 < 1 and e ( 2 ) J (J+1) < 1 for every and J. Then soving for M in (38), using x +1 < 1 to determine the direction of the inequaity yieds: M appe M [t] 6 = 4 + og L x + og p(;) + og ( +1) v [t] 7 5 (39) ( +1) 6og = 4 ( +1) og +1 x og +1 x + og L + og p(;) v [t] og +1 x 7 5. (40) Step 2 Now we prove the ower bound on the support of rues in an optima set A. Simiar to the proof for Theorem 1, we wi show that for a rue set A, if any rue a z has support supp(z) <Con data S, then A 62 arg minf (A 0 ). Assume rue a z has support ess than C and A \z has the k-th rue A 0 2 S removed from A. Assume A consists of M rues, and the z-th rue has ength L z. We reate P (A \z ; ) with P (A; ). We mutipy P (A \z ; ) with 1 in disguise to reate it to P (A; ): P (A \z )= M (J + 1) e Lz (J L z + 1) Q L z k e P (A) 1 K vz,k M (J + 1) L z P (A) (41) (J Lz + 1) 2 = M (J + 1) e P (A) g(l z ;, J) (42) M P (A) x (43) where (41) foows that K vm,k 2 since a attributes have at east two vaues, (42) foows the definition of g() in Lemma 2 and (43) uses the upper bound in Lemma 2 and. Then combining (22) with (42), the joint probabiity of S and A \z is ower bounded by i.e., In order to get P (S, A \z ) We have appe 1, thus P (S, A \z )=P (A \z )P (S A \z ) P (S, A), we need M supp(z) P (S, A). x M supp(z) supp(z) x x M 1, x M [t], x M [t] supp(z) appe og og. (44) 31

32 WANG ET AL. Therefore, for any rue z in a MAP mode A, supp(z) og x M [t] og. (45) Appendix B. Mobie Advertisement data sets The attributes of this data set incude: 1. User attributes Gender: mae, femae Age: beow 21, 21 to 25, 26 to 30, etc. Marita Status: singe, married partner, unmarried partner, or widowed Number of chidren: 0, 1, or more than 1 Education: high schoo, bacheors degree, associates degree, or graduate degree Occupation: architecture & engineering, business & financia, etc. Annua income: ess than $12500, $ $24999, $ $37499, etc. Number of times that he/she goes to a bar: 0, ess than 1, 1 to 3, 4 to 8 or greater than 8 Number of times that he/she buys takeaway food: 0, ess than 1, 1 to 3, 4 to 8 or greater than 8 Number of times that he/she goes to a coffee house: 0, ess than 1, 1 to 3, 4 to 8 or greater than 8 Number of times that he/she eats at a restaurant with average expense ess than $20 per person: 0, ess than 1, 1 to 3, 4 to 8 or greater than 8 Number of times that he/she goes to a bar: 0, ess than 1, 1 to 3, 4 to 8 or greater than 8 2. Contextua attributes Driving destination: home, work, or no urgent destination Location of user, coupon and destination: we provide a map to show the geographica ocation of the user, destination, and the venue, and we mark the distance between each two paces with time of driving. The user can see whether the venue is in the same direction as the destination. Weather: sunny, rainy, or snowy Temperature: 30F o, 55F o, or 80F o Time: 10AM, 2PM, or 6PM Passenger: aone, partner, kid(s), or friend(s) 3. Coupon attributes time before it expires: 2 hours or one day A coupons provide a 20% discount. The survey was divided into different parts, so that Turkers without chidren woud never see a scenario where their kids" were in the vehice. Figure 12 and 13 show two exampes of scenarios in the survey. 32

BAYESIAN RULE S ETS Figure 12: Exampe 1 of scenario in the survey Figure 13: Exampe 2 of scenario in the survey References Gediminas Adomavicius and Aexander

, 17(6):734 749, 2005. ISSN 1041-4347. Gediminas Adomavicius and Aexander Tuzhiin. Context-aware recommender systems.

33 BAYESIAN RULE S ETS Figure 12: Exampe 1 of scenario in the survey Figure 13: Exampe 2 of scenario in the survey References Gediminas Adomavicius and Aexander Tuzhiin. Toward the next generation of recommender systems: A survey of the state-of-the-art and possibe extensions. IEEE Trans. on Know. and Data Eng., 17(6): , ISSN Gediminas Adomavicius and Aexander Tuzhiin. Context-aware recommender systems. In Proc of the 2008 ACM Conf on Rec Systems, RecSys 08, pages , New York, NY, USA, ACM. Hiva Aahyari and Nikas Lavesson. User-oriented assessment of cassification mode understandabiity. In SCAI, pages 11 19,

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with?

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with? Bayesian Learning A powerfu and growing approach in machine earning We use it in our own decision making a the time You hear a which which coud equay be Thanks or Tanks, which woud you go with? Combine