Examples are not Enough, Learn to Criticize! Criticism for Interpretability

Size: px

Start display at page:

Download "Examples are not Enough, Learn to Criticize! Criticism for Interpretability"

Kelley Joseph
6 years ago
Views:

1 Exaples are not Enough, Learn to Criticize! Criticis for Interpretability Been Ki Allen Institute for AI Rajiv Khanna UT Austin Oluwasani Koyejo UIUC Abstract Exaple-based explanations are widely used in the effort to iprove the interpretability of highly coplex distributions. However, prototypes alone are rarely sufficient to represent the gist of the coplexity. In order for users to construct better ental odels and understand coplex data distributions, we also need criticis to explain what are not captured by prototypes. Motivated by the Bayesian odel criticis fraework, we develop MMD-critic which efficiently learns prototypes and criticis, designed to aid huan interpretability. A huan subject pilot study shows that the MMD-critic selects prototypes and criticis that are useful to facilitate huan understanding and reasoning. We also evaluate the prototypes selected by MMD-critic via a nearest prototype classifier, showing copetitive perforance copared to baselines. 1 Introduction and Related Work As achine learning (ML) ethods have becoe ubiquitous in huan decision aking, their transparency and interpretability have grown in iportance (Varshney, 2016). Interpretability is particularity iportant in doains where decisions can have significant consequences. For exaple, the pneuonia risk prediction case study in Caruana et al. (2015) showed that a ore interpretable odel could reveal iportant but surprising patterns in the data that coplex odels overlooked. Studies of huan reasoning have shown that the use of exaples (prototypes) is fundaental to the developent of effective strategies for tactical decision-aking (Newell and Sion, 1972; Cohen et al., 1996). Exaple-based explanations are widely used in the effort to iprove interpretability. A popular research progra along these lines is case-based reasoning (CBR) (Aaodt and Plaza, 1994), which has been successfully applied to real-world probles (Bichindaritz and Marling, 2006). More recently, the Bayesian fraework has been cobined with CBR-based approaches in the unsupervised-learning setting, leading to iproveents in user interpretability (Ki et al., 2014). In a supervised learning setting, exaple-based classifiers have been is shown to achieve coparable perforance to non-interpretable ethods, while offering a condensed view of a dataset (Bien and Tibshirani, 2011). However, exaples are not enough. Relying only on exaples to explain the odels behavior can lead over-generalization and isunderstanding. Exaples alone ay be sufficient when the distribution of data points are clean in the sense that there exists a set of prototypical exaples which sufficiently represent the data. However, this is rarely the case in real world data. For instance, fitting odels to coplex datasets often requires the use of regularization. While the regularization adds bias to the odel to iprove generalization perforance, this sae bias ay conflict with the distribution of the data. Thus, to aintain interpretability, it is iportant, along with prototypical exaples, to deliver insights signifying the parts of the input space where prototypical exaples All authors contributed equally. 29th Conference on Neural Inforation Processing Systes (NIPS 2016), Barcelona, Spain.

2 do not provide good explanations. We call the data points that do not quite fit the odel criticis saples. Together with prototypes, criticis can help huans build a better ental odel of the coplex data space. Bayesian odel criticis (BMC) is a fraework for evaluating fitted Bayesian odels, and was developed to to aid odel developent and selection by helping to identify where and how a particular odel ay fail to explain the data. It has quickly developed into an iportant part of odel design, and Bayesian statisticians now view odel criticis as an iportant coponent in the cycle of odel construction, inference and criticis (Gelan et al., 2014). Lloyd and Ghahraani (2015) recently proposed an exploratory approach for statistical odel criticis using the axiu ean discrepancy (MMD) two saple test, and explored the use of the witness function to identify the portions of the input space the odel ost isrepresents the data. Instead of using the MMD to copare two odels as in classic two saple testing (Gretton et al., 2008), or to copare the odel to input data as in the Bayesian odel criticis of Lloyd and Ghahraani (2015), we consider a novel application of the MMD, and its associated witness function as a principled approach for selecting prototype and criticis saples. We present the MMD-critic, a scalable fraework for prototype and criticis selection to iprove the interpretability of achine learning ethods. To our best knowledge, ours is the first work which leverages the BMC fraework to generate explanations for achine learning ethods. MMD-critic uses the MMD statistic as a easure of siilarity between points and potential prototypes, and efficiently selects prototypes that axiize the statistic. In addition to prototypes, MMD-critic selects criticis saples i.e. saples that are not well-explained by the prototypes using a regularized witness function score. The scalability follows fro our analysis, where we show that under certain conditions, the MMD for prototype selection is a superodular set function. Our superodularity proof is general and ay be of independent interest. While we are priarily concerned with prototype selection and criticis, we quantitatively evaluate the perforance of MMD-critic as a nearest prototype classifier, and show that it achieves coparable perforance to existing ethods. We also present results fro a huan subject pilot study which shows that including the criticis together with prototypes is helpful for an end-task that requires the data-distributions to be well-explained. 2 Preliinaries This section includes notation and a few iportant definitions. Vectors are denoted by lower case x and atrices by capital X. The Euclidean inner product between atrices A and B is given by A, B = a i,j b i,j. Let det(x) denote the deterinant of X. Sets are denoted by sans serif e.g. S. The reals are denoted by R. [n] denotes the set of integers {1,..., n}, and 2 V denotes the power set of V. The indicator function 1 [a] takes the value of 1 if its arguent a is true and is 0 otherwise. We denote probability distributions by either P or Q. The notation will denote cardinality when applied to sets, or absolute value when applied to real values. 2.1 Maxiu Mean Discrepancy (MMD) The axiu ean discrepancy (MMD) is a easure of the difference between distributions P and Q, given by the suprenu over a function space F of differences between the expectations with respect to two distributions. The MMD is given by: MMD(F, P, Q) = sup f F ( E X P [f(x)] E Y Q [f(y )] ). (1) When F is a reproducing kernel Hilbert space (RKHS) with kernel function k : X X R, the suprenu is achieved at (Gretton et al., 2008): f(x) = E X P [k(x, X )] E X Q [k(x, X )]. (2) The function (2) is also known as the witness function as it easures the axiu discrepancy between the two expectations in F. Observe that the witness function is positive whenever Q underfits the density of P, and negative wherever Q overfits P. We can substitute (2) into (1) and square the result, leading to: MMD 2 (F, P, Q) = E X,X P [k(x, X )] 2E X P,y Q [k(x, Y )] + E Y,Y Q [k(y, Y )]. (3) 2

3 It is clear that MMD 2 (F, P, Q) 0 and MMD 2 (F, P, Q) = 0 iff. P is indistinguishable fro Q on the RHKS F. This population definition can be approxiated using saple expectations. In particular, given n saples fro P as X = {x i P, i [n]}, and saples fro Q as Z = {z i Q, i []}, the following is a finite saple approxiation: MMD 2 b(f, X, Z) = 1 n 2 i,j [n] k(x i, x j ) 2 n and the witness function is approxiated as: i [n] i [n],j [] j [] k(x i, z j ) i,j [] k(z i, z j ), (4) f(x) = 1 k(x, x i ) 1 k(x, z j ). (5) n 3 MMD-critic for Prototype Selection and Criticis Given n saples fro a statistical odel X = {x i, i [n]}, let S [n] represent a subset of the indices, so that X S = {x i i S}. Given a RKHS with the kernel function k(, ), we can easure the axiu ean discrepancy between the saples and any selected subset using MMD 2 (F, X, X S ). MMD-critic selects prototype indices S which iniize MMD 2 (F, X, X S ). For our purposes, it will be convenient to pose the proble as a noralized discrete axiization. To this end, consider the following cost function, given by the negation of MMD 2 (F, X, X S ) with an additive bias: J b (S) = 1 n 2 = 2 n S n i,j=1 k(x i, x j ) MMD 2 (F, X, X S ) i [n],j S k(x i, y j ) 1 S 2 k(y i, x j ). (6) Note that the additive bias MMD 2 (F, X, ) = 1 n 2 n i,j=1 k(x i, x j ) is a constant with respect to S. Further, J b (S) is noralized, since, when evaluated on the epty set, we have that: J b ( ) = in S 2 [n] J b (S) = 1 n 2 n k(x i, x j ) 1 n 2 i,j=1 i,j S n i,j=1 k(x i, x j ) = 0. MMD-critic selects prototypes as the subset of indices S [n] which optiize: ax J b (S). (7) S 2 [n], S For the purposes of optiizing the cost function (6), it will prove useful to exploit it s linearity with respect to the kernel entries. The following Lea is easily shown by enueration. Lea 1. Let J b ( ) be defined as in (6), then J b ( ) is a linear function of k(x i, x j ). In particular, define K R n n, with k i,j = k(x i, x j ), and A(S) R n n with entries a i,j (S) = 2 n S 1 [j S] 1 S 1 2 [i S] 1 [j S] then: J b (S) = A(S), K. 3.1 Subodularity and Efficient Prototype Selection While the discrete optiization proble (6) ay be quite coplicated to optiize, we show that the cost function J b (S) is onotone subodular under conditions on the kernel atrix which are often satisfied in practice, and which can be easily checked given a kernel atrix. Based on this result, we describe the greedy forward selection algorith for efficient prototype selection. Let F : 2 [n] R represent a set function. F is noralized if F ( ) = 0. F is onotonic, if for all subsets u v 2 [n] it holds that F (U) F (V). F is subodular, if for all subsets U, V 2 [n] it holds that F (U V) + F (U V) F (U) + F (V). Subodular functions have a diinishing returns property (Nehauser et al., 1978) i.e. the arginal gain of adding eleents decreases with the size of the set. When F is subodular, F is superodular (and vice versa). 3

4 We prove subodularity for a larger class of probles, then show subodularity of (6) as a special case. Our proof for the larger class ay be of independent interest. In particular, the following Theore considers general discrete optiization probles which are linear atrix functionals, and shows sufficient conditions on the atrix for the proble to be onotone and/or subodular. Theore 2 (Monotone Subodularity for Linear Fors). Let H R n n (not necessarily syetric) be eleent-wise non-negative and bounded, with upper bound h = ax i,j [n] h i,j > 0. Further, construct the binary atrix representation of the indices that achieve the axiu as E [0, 1] n n with e i,j = 1 if h i,j = h and e i,j = 0 otherwise, and its copleent E = 1 E with the corresponding set E = {(i, j) s.t. e i,j = 0}. Given the ground set S 2 [n] consider the linear for: F (H, S) = A(S), H S S. Given = S, define the functions: a(s {u}) a(s) a(s {u}) + a(s v}) a(s {u, v}) a(s) α(n, ) =, β(n, ) =, b(s) b(s {u, v}) + d(s) (8) where a(s) = F (E, S), b(s) = F (E, S) for all u, v S (additional notation suppressed in α( ) and β( ) for clarity). Let = ax S S S be the axial cardinality of any eleent in the ground set. 1. If h i,j h α(n, ) 0, (i, j) E, then F (H, S) is onotone 2. If h i,j h β(n, ) 0, (i, j) E, then F (H, S) is subodular. Finally, we consider a special case of Theore 2 for the MMD. Corollary 3 (Monotone Subodularity for MMD). Let the kernel atrix K R n n be eleent-wise non-negative, with equal diagonal ters k i,i = k > 0 i [n], and be diagonally doinant. If the k off-diagonal ters k i,j i, j [n], i j satisfy 0 k i,j n 3 +2n 2 2n 3, then J b(s) given by (6) is onotone subodular. The diagonal doinance condition expressed by Corollary 3 is easy to check given a kernel atrix. We also note that the conditions can be significantly weakened if one deterines the required nuber of prototypes = ax S n a-priori. This is further siplified for the MMD since the bounds (8) are both onotonically decreasing functions of, so the condition need only be checked for. Observe that diagonal doinance is not a necessary condition, as the ore general approach in Theore 2 allows arbitrarily indexed axial entries in the kernel. Diagonal doinance is assued to siplify the resulting expressions. Perhaps, ore iportant to practice is our observation that the diagonal doinance condition expressed by Corollary 3 is satisfied by paraetrized kernels with appropriately selected paraeters. We provide an exaple for radial basis function (RBF) kernels and powers of positive standardized kernels. Further exaples and ore general conditions are left for future work. Exaple 4 (Radial basis function Kernel). Consider the radial basis function kernel K with entries k i,j = k(x i, x j ) = exp( γ x i x j ) evaluated on a saple X with non-duplicate points i.e. x i x j x i, x j X. The off-diagonal kernel entries k i,j i j onotonically decrease with respect to increasing γ. Thus, γ such that Corollary 3 is satisfied for γ γ. Exaple 5 (Powers of Positive Standardized Kernels). Consider a eleent-wise positive kernel atrix G standardized to be eleent-wise bounded 0 g i,j < 1 with unitary diagonal g i,i = 1 i [n]. Define the kernel power K with k i,j = g p i,j. The off-diagonal kernel entries k i,j i j onotonically decrease with respect to increasing p. Thus, p such that Corollary 3 is satisfied for p p. Beyond the exaples outlined here, siilar conditions can be enuerated for a wide range of paraetrized kernel functions, and are easily checked for odel-based kernels e.g. the Fisher kernel (Jaakkola et al., 1999) useful for coparing data points based on siilarity with respect to a probabilistic odel. Our interpretation of fro these exaples is that the conditions of Corollary 3 are not excessively restrictive. While constrained axiization of subodular functions is generally NP-hard, the siple greedy forward selection heuristic has been shown to perfor alost as well as the optial in practice, and is known to have strong theoretical guarantees. Theore 6 (Nehauser et al. (1978)). In the case of any noralized, onotonic subodular function F, the set S obtained by the greedy algorith achieves at least a constant fraction ( 1 1 e ) of the objective value obtained by the optial solution i.e. F (S ) = ( 1 1 e ) ax S F (s). 4

5 In addition, no polynoial tie algorith can provide a better approxiation guarantee unless P = NP (Feige, 1998). An additional benefit of the greedy approach is that it does not require the decision of the nuber of prototypes to be ade at training tie, so assuing the kernel satisfies appropriate conditions, training can be stopped at any based on coputational constraints, while still returning eaningful results. The greedy algorith is outlined in Algorith 1. Algorith 1 Greedy algorith, ax F (S) s.t. S Input:, S = while S < do foreach i [n]\s, f i = F (S i) F (S) S = S {arg ax f i } end while Return: S. 3.2 Model Criticis In addition to selecting prototype saples, MMD-critic characterizes the data points not well explained by the prototypes which we call the odel criticis. These data points are selected as the largest values of the witness function (5) i.e. where the siilarity between the dataset and the prototypes deviate the ost. Consider the cost function: L(C) = 1 k(x i, x l ) 1 k(x j, x l ) l C n j S. (9) i [n] The absolute value ensures that we easure both positive deviations f(x) > 0 where the prototypes underfit the density of the saples, and negative deviations f(x) < 0, where the prototypes overfit the density of the saples. Thus, we focus priarily on the agnitude of deviation, rather than its sign. The following theore shows that (9) is a linear function of C. Theore 7. The criticis function L(C) is a linear function of C. We found that the addition of a regularizer which encourages a diverse selection of criticis points iproved perforance. Let r : 2 [n] R represent a regularization function. We select the criticis points as the axiizers of this cost function: ax L(C) + r(k, C) (10) C [n]\s, C c Where [n]\s denote all indexes which not include the prototypes, and c is the nuber of criticis points desired. Fortunately, due to the linearity of (5), the optiization function (10) is subodular when the regularization function is subodular. We encourage the use of regularizers which incorporate diversity into the criticis selection. We found the best qualitative perforance using the log-deterinant regularizer (Krause et al., 2008). Let K C,C be the sub-atrix of K corresponding to the pair of indexes in C C, then the log-deterinant regularizer is given by: r(k, C) = log det K C,C (11) which is known to be subodular. Further, several researchers have found, both in theory and practice (Shara et al., 2015), that greedy optiization is an effective strategy for optiization. We apply the greedy algorith for criticis selection with the function F (C) = L(C) + r(k, C). 4 Related Work There is a large literature on techniques for selecting prototypes that suarize a dataset, and a full literature survey is beyond the scope of this anuscript. Instead, we overview a few of the ost relevant references. The K-edoid clustering (Kaufan and Rousseeuw, 1987) is a classic technique for selecting a representative subset of data points, and can be solved using various iterative algoriths. K-edoid clustering is quite siilar to K-eans clustering, with the additional condition that the presented prototypes ust be in the dataset. The ubiquity of large datasets has led to resurgence 5

6 of interest in the data suarization proble, also known as the set cover proble. Progress has included novel cost functions and algoriths for several doains including iage suarization (Sion et al., 2007) and docuent suarizauion (Lin and Biles, 2011). Recent innovations also include highly scalable and distributed algoriths (Badanidiyuru et al., 2014; Mirzasoleian et al., 2015). There is also a large literature on variations of the set cover proble tuned for classification, such as the cover digraph approach of (Priebe et al., 2003) and prototype selection for interpretable classification (Bien and Tibshirani, 2011), which involves selecting prototypes that axiize the coverage within the class, but iniize the coverage across classes. Subodular / Superodular functions are well studied in the cobinatorial optiization literature, with several scalable algoriths that coe with optiization theoretic optiality guarantees (Nehauser et al., 1978). In the Bayesian odeling literature, subodular optiization has previously been applied for approxiate inference by Koyejo et al. (2014). The technical conditions required for subodularity of (6) are due to averaging of the kernel siilarity scores as the average requires a division by the cardinality S. In particular, the analogue of (6) which replaces all the averages by sus (i.e. reoves all division by S ) is equivalent to the well known subodular functions previously used for scene (Sion et al., 2007) and docuent (Lin and Biles, 2011) suarization, given by: 2 n i [n],j S k(x i, y j ) + λ i,j S k(y i, x j ), where λ > 0 is a regularization paraeter. The function that results is known to be subodular when the kernel is eleent-wise positive i.e. without the need for additional diagonal doinance conditions. On the other hand, the averaging has a desirable built-in balancing effect. When using the su, practitioners ust tune the additional regularization paraeter λ to achieve a siilar balance. 5 Results We present results for the proposed technique MMD-critic using USPS hand written digits (Hull, 1994) and Iagenet (Deng et al., 2009) datasets. We quantitatively evaluate the prototypes in ters of predictive quality as copared to related baselines on USPS hand written digits dataset. We also present preliinary results fro a huan subject pilot study. Our results suggest that the odel criticis which is unique to the proposed MMD-critic is especially useful to facilitate huan understanding. For all datasets, we eployed the radial basis function (RBF) kernel with entries k i,j = k(x i, x j ) = exp( γ x i x j ), which satisfies the conditions of Corollary 3 for sufficiently large γ (c.f. Exaple 4, see Exaple 5 and following discussion for alternative feasible kernels). The Nearest Prototype Classifier: While our priary interest is in interpretable prototype selection and criticis, prototypes ay also be useful for speeding up eory-based achine learning techniques such as the nearest neighbor classifier by restricting the neighbor search to the prototypes, soeties known as the nearest prototype classifier (Bien and Tibshirani, 2011; Kuncheva and Bezdek, 1998). This classification provides an objective (although indirect) evaluation of the quality of the selected prototypes, and is useful for setting hyperparaeters. We eploy a 1 nearest neighbor classifier using the Hilbert space distance induced by the kernels. Let y i [k] denote the label associated with each prototype i S, for k classes. As we eploy noralized kernels (where the diagonal is 1), it is sufficient to easure the pairwise kernel siilarity. Thus, for a test point ˆx, the nearest neighbor classifier reduces to: ŷ = y i, where i = argin ˆx x i 2 H K i S 5.1 MMD-critic evaluated on USPS Digits Dataset = argax k(ˆx, x i ). i S The USPS hand written digits dataset Hull (1994) consists of n = 7291 training (and 2007 test) greyscale iages of 10 handwritten digits fro 0 to 9. We consider two kinds of RBF kernels (i) global: where the pairwise kernel is coputed between all data points, and (ii) local: given by exp( γ x i x j )1 [yi=y j], i.e. points in different classes are assigned a siilarity score of zero. The local approach has the effect of pushing points in different classes further apart. The kernel hyperparaeter γ was chosen based to axiize the average cross-validated classification perforance, then fixed for all other experients. Classification: We evaluated nearest prototype classifiers using MMD-critic, and copared to baselines (and reported perforance) fro Bien and Tibshirani (2011) (abbreviated as PS) and their 6

0.18 0.16 0.14 MMD-global MMD-local PS K-edoids Test error 0.12 0.10 0.08 0.06 0.04 0 1000 2000 3000 4000 Nuber of prototypes Figure 1: Classification error vs. nuber of prototypes = S.

7 MMD-global MMD-local PS K-edoids Test error Nuber of prototypes Figure 1: Classification error vs. nuber of prototypes = S. MMD-critic shows coparable (or iproved) perforance as copared to other odels (left). Rando subset of prototypes and criticis fro the USPS dataset (right). ipleentation of K-edoids. Figure 1(left) copares MMD-critic with global and local kernels, to the baselines for different nubers of selected prototypes = S. Our results show coparable (or iproved) perforance as copared to other odels. In particular, we observe that the global kernels out-perfor the local kernels 2 by a sall argin. We note that MMD is particularly effective at selecting the first few prototypes (i.e. speed of error reduction as nuber of prototypes increases) suggesting its utility for rapidly suarising the dataset. Selected Prototypes and Criticis: Fig. 1 (right) presents a randoly selected subset of the prototypes and criticis fro the MMD-critic using the local kernel. We observe that the prototypes capture any of the coon ways of writing digits, while the criticis clearly capture outliers. 5.2 Qualitative Measure: Prototypes and Criticiss of Iages In this section, we learn prototypes and criticiss fro the Iagenet dataset (Russakovsky et al., 2015) using iage ebeddings fro He et al. (2015). Each iage is represented by a 2048 diensions vector ebedding, and each iage belongs to one of 1000 categories. We select two breeds of one category (e.g., Blenhei spaniel) and run MMD-critic to learn prototypes and criticis. As shown in Figure 2, MMD-critic learns reasonable prototypes and criticiss for two types of dog breeds. On the left, criticiss picked out the different coloring (second criticis is in black and white picture), as well as pictures capturing oveents of dogs (first and third criticiss). Siilarly, on the right, criticiss capture the unusual, but potentially frequent pictures of dogs in costues (first and second criticiss). 5.3 Quantitative easure: Prototypes and Criticiss iprove interpretability We conducted a huan pilot study to collect objective and subjective easures of interpretability using MMD-critic. The experient used the sae dataset as Section 5.2. We define interpretability in this work as the following: a ethod is interpretable if a user can correctly and efficiently predict the ethod s results. Under this definition, we designed a predictive task to quantitatively evaluate the interpretability. Given a randoly sapled data point, we easure how well a huan can predict a group it belongs to (accuracy), and how fast they can perfor the task (efficiency). We chose this dataset as the task of assigning a new iage to a group requires groups to be well-explained but does not require specialized training. We presented four conditions in the experient. 1) raw iages condition (Raw Condition) 2) Prototypes Only (Proto Only Condition) 3) Prototypes and criticiss (Proto and Criticis Condition) 4) Uniforly sapled data points per group (Unifor Condition). Raw Condition contained 100 iages per species (e.g., if a group contains 2 species, there are 200 iages) Proto Only Condition, Proto and Criticis Condition and Unifor Condition contains the sae nuber of iages. 2 Note that the local kernel trivially achieves perfect accuracy. Thus, in order to easure generalization perforance, we do not use class labels for local kernel test instances i.e. we use the global kernel instead of local kernel for test instances regardless of training. 7

Figure 2: Learned prototypes and criticiss fro Iagenet dataset (two types of dog breeds) We used within-subject design to iniize the effect of inter-participant variability, with a balanced Latin

8 Figure 2: Learned prototypes and criticiss fro Iagenet dataset (two types of dog breeds) We used within-subject design to iniize the effect of inter-participant variability, with a balanced Latin square to account for a potential learning effect. The four conditions were assigned to four participants (four ales) in a balanced anner. Each subject answered 21 questions, where the first three questions are practice questions and not included in the analysis. Each question showed six groups (e.g., red fox, kit fox) of a species (e.g., fox), and a randoly sapled data point that belongs to one of the groups. Subjects were encouraged to answer the questions as quickly and accurately as possible. A break was iposed after each question to itigate the potential effect of fatigue. We easured the accuracy of answers as well as the tie they took to answer each question. Participants were also asked to respond to 10 5-point Likert scale survey questions about their subjective easure of accuracy and efficiency. Each survey question copared a pair of conditions (e.g., Condition A was ore helpful than condition B to correctly (or efficiently) assign the iage to a group). Subjects perfored the best using Proto and Criticis Condition (M=87.5%, SD=20%). The perforance with Proto Only Condition was relatively siilar (M=75%, SD=41%), while that with Unifor Condition (M=55%, SD=38%, 37% decrease) and Raw Condition (M=56%, SD=33%, 36% decrease) was substantially lower. In ters of speed, subjects were ost efficient using Proto Only Condition (M=1.04 ins/question, SD=0.28, 44% decrease copared to Raw Condition), followed by Unifor Condition (M=1.31 ins/question, SD=0.59) and Proto and Criticis Condition (M=1.37 ins/question, SD=0.8). Subjects spent the ost tie with Raw Condition (M=1.86 ins/question, SD=0.67). Subjects indicated their preference of Proto and Criticis Condition over Raw Condition and Unifor Condition. In a survey question that asks to copare Proto and Criticis Condition and Raw Condition, a subject added that [Proto and Criticis Condition resulted in] less confusion fro trying to discover hidden patterns in a ton of iages, ore clues indicating what features are iportant". In particular, in a question that asks to copare Proto and Criticis Condition and Proto Only Condition, a subject said that The addition of criticiss ade it easier to locate the defining features of the cluster within the prototypical iages". The huans superior perforance with prototypes and criticis in this preliinary study shows that providing criticiss together with prototypes is a proising direction to iprove the interpretability. 6 Conclusion We present the MMD-critic, a scalable fraework for prototype and criticis selection to iprove the interpretability of coplex data distributions. To our best knowledge, ours is the first work which leverages the BMC fraework to generate explanations. Further, MMD-critic shows copetitive perforance as a nearest prototype classifier copared to to existing ethods. When criticis is given together with prototypes, a huan pilot study suggests that huans are better able to perfor a predictive task that requires the data-distributions to be well-explained. This suggests that criticis and prototypes are a step towards iproving interpretability of coplex data distributions. For future work, we hope to further explore the properties of MMD-critic such as the effect of the choice of kernel, and weaker conditions on the kernel atrix for subodularity. We plan to explore applications to larger datasets, aided by recent work on distributed algoriths for subodular optiization. We also intend to coplete a larger scale user study on how criticis and prototypes presented together affect huan understanding. 8

9 References A. Aaodt and E. Plaza. Case-based reasoning: Foundational issues, ethodological variations, and syste approaches. AI counications, A. Badanidiyuru, B. Mirzasoleian, A. Karbasi, and A. Krause. Streaing subodular axiization: Massive data suarization on the fly. In KDD. ACM, I. Bichindaritz and C. Marling. Case-based reasoning in the health sciences: What s next? AI in edicine, J. Bien and R. Tibshirani. Prototype selection for interpretable classification. The Annals of Applied Statistics, pages , R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Stur, and N. Elhadad. Intelligible odels for healthcare: Predicting pneuonia risk and hospital 30-day readission. In KDD, M.S. Cohen, J.T. Freean, and S. Wolf. Metarecognition in tie-stressed decision aking: Recognizing, critiquing, and correcting. Huan Factors, J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei. Iagenet: A large-scale hierarchical iage database. In CVPR, U. Feige. A threshold of ln n for approxiating set cover. JACM, A. Gelan, J.B. Carlin, H.S. Stern, and D.B. Rubin. Bayesian data analysis. Taylor & Francis, A. Gretton, K.M. Borgwardt, M.J. Rasch, B. Schölkopf, and A. Sola. A kernel ethod for the two-saple proble. JMLR, K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for iage recognition. arxiv: , J.J. Hull. A database for handwritten text recognition research. TPAMI, T.S. Jaakkola, D. Haussler, et al. Exploiting generative odels in discriinative classifiers. In NIPS, pages , L. Kaufan and P. Rousseeuw. Clustering by eans of edoids. North-Holland, B. Ki, C. Rudin, and J.A. Shah. The Bayesian Case Model: A generative approach for case-based reasoning and prototype classification. In NIPS, O.O. Koyejo, R. Khanna, J. Ghosh, and R. Poldrack. On prior distributions and approxiate inference for structured variables. In NIPS, A. Krause, A. Singh, and C. Guestrin. Near-optial sensor placeents in gaussian processes: Theory, efficient algoriths and epirical studies. JMLR, L. I. Kuncheva and J.C. Bezdek. Nearest prototype classification: clustering, genetic algoriths, or rando search? IEEE Transactions on Systes, Man, and Cybernetics, 28(1): , H. Lin and J. Biles. A class of subodular functions for docuent suarization. In ACL, J. R. Lloyd and Z. Ghahraani. Statistical odel criticis using kernel two saple tests. In NIPS, B. Mirzasoleian, A. Karbasi, A. Badanidiyuru, and A. Krause. Distributed subodular cover: Succinctly suarizing assive data. In NIPS, G. L Nehauser, L.A. Wolsey, and M.L. Fisher. An analysis of approxiations for axiizing subodular set functions. Matheatical Prograing, A. Newell and H.A. Sion. Huan proble solving. Prentice-Hall Englewood Cliffs, C.E. Priebe, D.J. Marchette, J.G. DeVinney, and D.A. Socolinsky. Classification using class cover catch digraphs. Journal of classification, O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg, and L. Fei-Fei. IageNet Large Scale Visual Recognition Challenge. IJCV, D. Shara, A. Kapoor, and A. Deshpande. On greedy axiization of entropy. In ICML, I. Sion, N. Snavely, and S.M. Seitz. Scene suarization for online iage collections. In ICCV, K.R. Varshney. Engineering safety in achine learning. arxiv: ,

10 Proof of Theore 2 Observe that fro the eleent-wise upper bound on H, the following eleent-wise inequality holds h E H h E + νe. Thus, fro the linearity of F (H, S) = A(S), H with respect to H, we have that: F (h E, S) F (H, S) F (h E + νe, S), where (by linearity) F (h E + νe, S) = h F (E, S) + νf (E, S). Next, eploying ters: a(s) = F (E, S) = H(S), E and b(s) = F (E, S) = A(S), E. we ay rewrite the bounds as: h a(s) F (H, S) h a(s) + νb(s). Monotonicity: The function F (H, S) is onotone with respect to S if: F (H, S {u}) F (H, S) 0. Applying the lower and upper bounds, we have that: F (H, S {u}) F (H, S) h a(s {u}) h a(s) νb(s) 0 = ν h a(s {u}) h a(s) b(s) = h α(n, ) Thus, when the off-diagonal ters satisfy h i,j h α(n, ) 0, (i, j) E, we have that F (H, S) is onotone. Subodularity: The function F (H, S) is subodular with respect to S if: F (H, S {u}) + F (H, S {v}) F (H, S {u, v}) + F (H, S). Again, applying the lower and upper bounds, we have that: F (H, S {u}) + F (H, S {v}) F (H, S {u, v}) F (H, S) h a(s {u}) + h a(s {v}) h a(s {u, v}) νb(s {u, v}) h a(s) νb(s) 0 = ν h a(s {u}) + a(s {v}) a(s {u, v}) a(s) b(s {u, v}) + b(s) = h β(n, ) Thus, when the off-diagonal ters satisfy h i,j h β(n, ) 0, (i, j) E, we have that F (H, S) is subodular. Proof of Corollary 3 Based on the diagonal doinance assuption on K, it is clear that E = {i, j [n] i j} indexes the off diagonal ters, and E = 1 E = I. Given A(S) with entries a i,j(s) = 2 1 n S [j S] 1 1 S 2 [i S] 1 [j S], we can copute the bounds (8) siply by enuerating sus as: a(s) = A(S), I = 2 n = 2 2 n 1 2(n ) b(s) = A(S), 1 I = n 2 2 = 2(n 1) n 1 Monotonicity: J p( ) is onotone when the upper bound of the off-diagonal ters is given by α(n, ) = by Theore 2. We have that: a(s {u}) a(s) b(s) Thus: a(s {u}) a(s) = α(n, ) = (n 1), b(s) = 1 n. n ( + 1)((n 2) + n). This is a decreasing function wrt. Further, for the ground set 2 [n], we have that = n, and α(n, n) = 1 n 2 1 Subodularity: J p( ) is subodular when the upper bound of the off-diagonal ters is given by β(n, ) = by Theore 2. We have that: a(s {u})+a(s v}) a(s {u,v}) a(s) b(s {u,v})+b(s) a(s {u}) + a(s v}) a(s {u, v}) a(s) = (n 1) b(s {u, v}) + b(s) = + 1 n

11 Thus: β(n, ) = n ( + 1)(n( ) 2( 2 + 2)) This is a decreasing function wrt. Further, for the ground set 2 [n], we have that = n, and β(n, n) = 1 n 3 +2n 2 2n 3. Cobined Bound: Finally, we show that β(n, n) α(n, n), so that the bound k i,j k β(n, n) is sufficient to guarantee both onotonicity and subodularity. β(n, n) α(n, n) 1 = n 3 + 2n 2 2n 3 1 n 2 1 = n 2 1 n 3 + 2n 2 2n 3 = 0 n 3 + n 2 n 3 = 0 (n 1)(n 2 2) which holds when n > 1 and n 2. Thus β(n, n) α(n, n). The proof is coplete. Proof of Theore 7 A discrete function is linear if it can be written in the for F (C) = i [n] wi1 [i C]. Consider (9) and observe that: L(C) = 1 k(x i, x l ) 1 k(x j, x l ) l C n i [n] j S = 1 k(x i, x l ) 1 k(x j, x l ) l [n] n 1 [l C] i [n] j S = w l 1 [l C], where: l [n] w l = 1 k(x i, x l ) 1 k(x j, x l ) n. i [n] j S 11

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes