Examples are not Enough, Learn to Criticize! Criticism for Interpretability
|
|
- Kelley Joseph
- 6 years ago
- Views:
Transcription
1 Exaples are not Enough, Learn to Criticize! Criticis for Interpretability Been Ki Allen Institute for AI Rajiv Khanna UT Austin Oluwasani Koyejo UIUC Abstract Exaple-based explanations are widely used in the effort to iprove the interpretability of highly coplex distributions. However, prototypes alone are rarely sufficient to represent the gist of the coplexity. In order for users to construct better ental odels and understand coplex data distributions, we also need criticis to explain what are not captured by prototypes. Motivated by the Bayesian odel criticis fraework, we develop MMD-critic which efficiently learns prototypes and criticis, designed to aid huan interpretability. A huan subject pilot study shows that the MMD-critic selects prototypes and criticis that are useful to facilitate huan understanding and reasoning. We also evaluate the prototypes selected by MMD-critic via a nearest prototype classifier, showing copetitive perforance copared to baselines. 1 Introduction and Related Work As achine learning (ML) ethods have becoe ubiquitous in huan decision aking, their transparency and interpretability have grown in iportance (Varshney, 2016). Interpretability is particularity iportant in doains where decisions can have significant consequences. For exaple, the pneuonia risk prediction case study in Caruana et al. (2015) showed that a ore interpretable odel could reveal iportant but surprising patterns in the data that coplex odels overlooked. Studies of huan reasoning have shown that the use of exaples (prototypes) is fundaental to the developent of effective strategies for tactical decision-aking (Newell and Sion, 1972; Cohen et al., 1996). Exaple-based explanations are widely used in the effort to iprove interpretability. A popular research progra along these lines is case-based reasoning (CBR) (Aaodt and Plaza, 1994), which has been successfully applied to real-world probles (Bichindaritz and Marling, 2006). More recently, the Bayesian fraework has been cobined with CBR-based approaches in the unsupervised-learning setting, leading to iproveents in user interpretability (Ki et al., 2014). In a supervised learning setting, exaple-based classifiers have been is shown to achieve coparable perforance to non-interpretable ethods, while offering a condensed view of a dataset (Bien and Tibshirani, 2011). However, exaples are not enough. Relying only on exaples to explain the odels behavior can lead over-generalization and isunderstanding. Exaples alone ay be sufficient when the distribution of data points are clean in the sense that there exists a set of prototypical exaples which sufficiently represent the data. However, this is rarely the case in real world data. For instance, fitting odels to coplex datasets often requires the use of regularization. While the regularization adds bias to the odel to iprove generalization perforance, this sae bias ay conflict with the distribution of the data. Thus, to aintain interpretability, it is iportant, along with prototypical exaples, to deliver insights signifying the parts of the input space where prototypical exaples All authors contributed equally. 29th Conference on Neural Inforation Processing Systes (NIPS 2016), Barcelona, Spain.
2 do not provide good explanations. We call the data points that do not quite fit the odel criticis saples. Together with prototypes, criticis can help huans build a better ental odel of the coplex data space. Bayesian odel criticis (BMC) is a fraework for evaluating fitted Bayesian odels, and was developed to to aid odel developent and selection by helping to identify where and how a particular odel ay fail to explain the data. It has quickly developed into an iportant part of odel design, and Bayesian statisticians now view odel criticis as an iportant coponent in the cycle of odel construction, inference and criticis (Gelan et al., 2014). Lloyd and Ghahraani (2015) recently proposed an exploratory approach for statistical odel criticis using the axiu ean discrepancy (MMD) two saple test, and explored the use of the witness function to identify the portions of the input space the odel ost isrepresents the data. Instead of using the MMD to copare two odels as in classic two saple testing (Gretton et al., 2008), or to copare the odel to input data as in the Bayesian odel criticis of Lloyd and Ghahraani (2015), we consider a novel application of the MMD, and its associated witness function as a principled approach for selecting prototype and criticis saples. We present the MMD-critic, a scalable fraework for prototype and criticis selection to iprove the interpretability of achine learning ethods. To our best knowledge, ours is the first work which leverages the BMC fraework to generate explanations for achine learning ethods. MMD-critic uses the MMD statistic as a easure of siilarity between points and potential prototypes, and efficiently selects prototypes that axiize the statistic. In addition to prototypes, MMD-critic selects criticis saples i.e. saples that are not well-explained by the prototypes using a regularized witness function score. The scalability follows fro our analysis, where we show that under certain conditions, the MMD for prototype selection is a superodular set function. Our superodularity proof is general and ay be of independent interest. While we are priarily concerned with prototype selection and criticis, we quantitatively evaluate the perforance of MMD-critic as a nearest prototype classifier, and show that it achieves coparable perforance to existing ethods. We also present results fro a huan subject pilot study which shows that including the criticis together with prototypes is helpful for an end-task that requires the data-distributions to be well-explained. 2 Preliinaries This section includes notation and a few iportant definitions. Vectors are denoted by lower case x and atrices by capital X. The Euclidean inner product between atrices A and B is given by A, B = a i,j b i,j. Let det(x) denote the deterinant of X. Sets are denoted by sans serif e.g. S. The reals are denoted by R. [n] denotes the set of integers {1,..., n}, and 2 V denotes the power set of V. The indicator function 1 [a] takes the value of 1 if its arguent a is true and is 0 otherwise. We denote probability distributions by either P or Q. The notation will denote cardinality when applied to sets, or absolute value when applied to real values. 2.1 Maxiu Mean Discrepancy (MMD) The axiu ean discrepancy (MMD) is a easure of the difference between distributions P and Q, given by the suprenu over a function space F of differences between the expectations with respect to two distributions. The MMD is given by: MMD(F, P, Q) = sup f F ( E X P [f(x)] E Y Q [f(y )] ). (1) When F is a reproducing kernel Hilbert space (RKHS) with kernel function k : X X R, the suprenu is achieved at (Gretton et al., 2008): f(x) = E X P [k(x, X )] E X Q [k(x, X )]. (2) The function (2) is also known as the witness function as it easures the axiu discrepancy between the two expectations in F. Observe that the witness function is positive whenever Q underfits the density of P, and negative wherever Q overfits P. We can substitute (2) into (1) and square the result, leading to: MMD 2 (F, P, Q) = E X,X P [k(x, X )] 2E X P,y Q [k(x, Y )] + E Y,Y Q [k(y, Y )]. (3) 2
3 It is clear that MMD 2 (F, P, Q) 0 and MMD 2 (F, P, Q) = 0 iff. P is indistinguishable fro Q on the RHKS F. This population definition can be approxiated using saple expectations. In particular, given n saples fro P as X = {x i P, i [n]}, and saples fro Q as Z = {z i Q, i []}, the following is a finite saple approxiation: MMD 2 b(f, X, Z) = 1 n 2 i,j [n] k(x i, x j ) 2 n and the witness function is approxiated as: i [n] i [n],j [] j [] k(x i, z j ) i,j [] k(z i, z j ), (4) f(x) = 1 k(x, x i ) 1 k(x, z j ). (5) n 3 MMD-critic for Prototype Selection and Criticis Given n saples fro a statistical odel X = {x i, i [n]}, let S [n] represent a subset of the indices, so that X S = {x i i S}. Given a RKHS with the kernel function k(, ), we can easure the axiu ean discrepancy between the saples and any selected subset using MMD 2 (F, X, X S ). MMD-critic selects prototype indices S which iniize MMD 2 (F, X, X S ). For our purposes, it will be convenient to pose the proble as a noralized discrete axiization. To this end, consider the following cost function, given by the negation of MMD 2 (F, X, X S ) with an additive bias: J b (S) = 1 n 2 = 2 n S n i,j=1 k(x i, x j ) MMD 2 (F, X, X S ) i [n],j S k(x i, y j ) 1 S 2 k(y i, x j ). (6) Note that the additive bias MMD 2 (F, X, ) = 1 n 2 n i,j=1 k(x i, x j ) is a constant with respect to S. Further, J b (S) is noralized, since, when evaluated on the epty set, we have that: J b ( ) = in S 2 [n] J b (S) = 1 n 2 n k(x i, x j ) 1 n 2 i,j=1 i,j S n i,j=1 k(x i, x j ) = 0. MMD-critic selects prototypes as the subset of indices S [n] which optiize: ax J b (S). (7) S 2 [n], S For the purposes of optiizing the cost function (6), it will prove useful to exploit it s linearity with respect to the kernel entries. The following Lea is easily shown by enueration. Lea 1. Let J b ( ) be defined as in (6), then J b ( ) is a linear function of k(x i, x j ). In particular, define K R n n, with k i,j = k(x i, x j ), and A(S) R n n with entries a i,j (S) = 2 n S 1 [j S] 1 S 1 2 [i S] 1 [j S] then: J b (S) = A(S), K. 3.1 Subodularity and Efficient Prototype Selection While the discrete optiization proble (6) ay be quite coplicated to optiize, we show that the cost function J b (S) is onotone subodular under conditions on the kernel atrix which are often satisfied in practice, and which can be easily checked given a kernel atrix. Based on this result, we describe the greedy forward selection algorith for efficient prototype selection. Let F : 2 [n] R represent a set function. F is noralized if F ( ) = 0. F is onotonic, if for all subsets u v 2 [n] it holds that F (U) F (V). F is subodular, if for all subsets U, V 2 [n] it holds that F (U V) + F (U V) F (U) + F (V). Subodular functions have a diinishing returns property (Nehauser et al., 1978) i.e. the arginal gain of adding eleents decreases with the size of the set. When F is subodular, F is superodular (and vice versa). 3
4 We prove subodularity for a larger class of probles, then show subodularity of (6) as a special case. Our proof for the larger class ay be of independent interest. In particular, the following Theore considers general discrete optiization probles which are linear atrix functionals, and shows sufficient conditions on the atrix for the proble to be onotone and/or subodular. Theore 2 (Monotone Subodularity for Linear Fors). Let H R n n (not necessarily syetric) be eleent-wise non-negative and bounded, with upper bound h = ax i,j [n] h i,j > 0. Further, construct the binary atrix representation of the indices that achieve the axiu as E [0, 1] n n with e i,j = 1 if h i,j = h and e i,j = 0 otherwise, and its copleent E = 1 E with the corresponding set E = {(i, j) s.t. e i,j = 0}. Given the ground set S 2 [n] consider the linear for: F (H, S) = A(S), H S S. Given = S, define the functions: a(s {u}) a(s) a(s {u}) + a(s v}) a(s {u, v}) a(s) α(n, ) =, β(n, ) =, b(s) b(s {u, v}) + d(s) (8) where a(s) = F (E, S), b(s) = F (E, S) for all u, v S (additional notation suppressed in α( ) and β( ) for clarity). Let = ax S S S be the axial cardinality of any eleent in the ground set. 1. If h i,j h α(n, ) 0, (i, j) E, then F (H, S) is onotone 2. If h i,j h β(n, ) 0, (i, j) E, then F (H, S) is subodular. Finally, we consider a special case of Theore 2 for the MMD. Corollary 3 (Monotone Subodularity for MMD). Let the kernel atrix K R n n be eleent-wise non-negative, with equal diagonal ters k i,i = k > 0 i [n], and be diagonally doinant. If the k off-diagonal ters k i,j i, j [n], i j satisfy 0 k i,j n 3 +2n 2 2n 3, then J b(s) given by (6) is onotone subodular. The diagonal doinance condition expressed by Corollary 3 is easy to check given a kernel atrix. We also note that the conditions can be significantly weakened if one deterines the required nuber of prototypes = ax S n a-priori. This is further siplified for the MMD since the bounds (8) are both onotonically decreasing functions of, so the condition need only be checked for. Observe that diagonal doinance is not a necessary condition, as the ore general approach in Theore 2 allows arbitrarily indexed axial entries in the kernel. Diagonal doinance is assued to siplify the resulting expressions. Perhaps, ore iportant to practice is our observation that the diagonal doinance condition expressed by Corollary 3 is satisfied by paraetrized kernels with appropriately selected paraeters. We provide an exaple for radial basis function (RBF) kernels and powers of positive standardized kernels. Further exaples and ore general conditions are left for future work. Exaple 4 (Radial basis function Kernel). Consider the radial basis function kernel K with entries k i,j = k(x i, x j ) = exp( γ x i x j ) evaluated on a saple X with non-duplicate points i.e. x i x j x i, x j X. The off-diagonal kernel entries k i,j i j onotonically decrease with respect to increasing γ. Thus, γ such that Corollary 3 is satisfied for γ γ. Exaple 5 (Powers of Positive Standardized Kernels). Consider a eleent-wise positive kernel atrix G standardized to be eleent-wise bounded 0 g i,j < 1 with unitary diagonal g i,i = 1 i [n]. Define the kernel power K with k i,j = g p i,j. The off-diagonal kernel entries k i,j i j onotonically decrease with respect to increasing p. Thus, p such that Corollary 3 is satisfied for p p. Beyond the exaples outlined here, siilar conditions can be enuerated for a wide range of paraetrized kernel functions, and are easily checked for odel-based kernels e.g. the Fisher kernel (Jaakkola et al., 1999) useful for coparing data points based on siilarity with respect to a probabilistic odel. Our interpretation of fro these exaples is that the conditions of Corollary 3 are not excessively restrictive. While constrained axiization of subodular functions is generally NP-hard, the siple greedy forward selection heuristic has been shown to perfor alost as well as the optial in practice, and is known to have strong theoretical guarantees. Theore 6 (Nehauser et al. (1978)). In the case of any noralized, onotonic subodular function F, the set S obtained by the greedy algorith achieves at least a constant fraction ( 1 1 e ) of the objective value obtained by the optial solution i.e. F (S ) = ( 1 1 e ) ax S F (s). 4
5 In addition, no polynoial tie algorith can provide a better approxiation guarantee unless P = NP (Feige, 1998). An additional benefit of the greedy approach is that it does not require the decision of the nuber of prototypes to be ade at training tie, so assuing the kernel satisfies appropriate conditions, training can be stopped at any based on coputational constraints, while still returning eaningful results. The greedy algorith is outlined in Algorith 1. Algorith 1 Greedy algorith, ax F (S) s.t. S Input:, S = while S < do foreach i [n]\s, f i = F (S i) F (S) S = S {arg ax f i } end while Return: S. 3.2 Model Criticis In addition to selecting prototype saples, MMD-critic characterizes the data points not well explained by the prototypes which we call the odel criticis. These data points are selected as the largest values of the witness function (5) i.e. where the siilarity between the dataset and the prototypes deviate the ost. Consider the cost function: L(C) = 1 k(x i, x l ) 1 k(x j, x l ) l C n j S. (9) i [n] The absolute value ensures that we easure both positive deviations f(x) > 0 where the prototypes underfit the density of the saples, and negative deviations f(x) < 0, where the prototypes overfit the density of the saples. Thus, we focus priarily on the agnitude of deviation, rather than its sign. The following theore shows that (9) is a linear function of C. Theore 7. The criticis function L(C) is a linear function of C. We found that the addition of a regularizer which encourages a diverse selection of criticis points iproved perforance. Let r : 2 [n] R represent a regularization function. We select the criticis points as the axiizers of this cost function: ax L(C) + r(k, C) (10) C [n]\s, C c Where [n]\s denote all indexes which not include the prototypes, and c is the nuber of criticis points desired. Fortunately, due to the linearity of (5), the optiization function (10) is subodular when the regularization function is subodular. We encourage the use of regularizers which incorporate diversity into the criticis selection. We found the best qualitative perforance using the log-deterinant regularizer (Krause et al., 2008). Let K C,C be the sub-atrix of K corresponding to the pair of indexes in C C, then the log-deterinant regularizer is given by: r(k, C) = log det K C,C (11) which is known to be subodular. Further, several researchers have found, both in theory and practice (Shara et al., 2015), that greedy optiization is an effective strategy for optiization. We apply the greedy algorith for criticis selection with the function F (C) = L(C) + r(k, C). 4 Related Work There is a large literature on techniques for selecting prototypes that suarize a dataset, and a full literature survey is beyond the scope of this anuscript. Instead, we overview a few of the ost relevant references. The K-edoid clustering (Kaufan and Rousseeuw, 1987) is a classic technique for selecting a representative subset of data points, and can be solved using various iterative algoriths. K-edoid clustering is quite siilar to K-eans clustering, with the additional condition that the presented prototypes ust be in the dataset. The ubiquity of large datasets has led to resurgence 5
6 of interest in the data suarization proble, also known as the set cover proble. Progress has included novel cost functions and algoriths for several doains including iage suarization (Sion et al., 2007) and docuent suarizauion (Lin and Biles, 2011). Recent innovations also include highly scalable and distributed algoriths (Badanidiyuru et al., 2014; Mirzasoleian et al., 2015). There is also a large literature on variations of the set cover proble tuned for classification, such as the cover digraph approach of (Priebe et al., 2003) and prototype selection for interpretable classification (Bien and Tibshirani, 2011), which involves selecting prototypes that axiize the coverage within the class, but iniize the coverage across classes. Subodular / Superodular functions are well studied in the cobinatorial optiization literature, with several scalable algoriths that coe with optiization theoretic optiality guarantees (Nehauser et al., 1978). In the Bayesian odeling literature, subodular optiization has previously been applied for approxiate inference by Koyejo et al. (2014). The technical conditions required for subodularity of (6) are due to averaging of the kernel siilarity scores as the average requires a division by the cardinality S. In particular, the analogue of (6) which replaces all the averages by sus (i.e. reoves all division by S ) is equivalent to the well known subodular functions previously used for scene (Sion et al., 2007) and docuent (Lin and Biles, 2011) suarization, given by: 2 n i [n],j S k(x i, y j ) + λ i,j S k(y i, x j ), where λ > 0 is a regularization paraeter. The function that results is known to be subodular when the kernel is eleent-wise positive i.e. without the need for additional diagonal doinance conditions. On the other hand, the averaging has a desirable built-in balancing effect. When using the su, practitioners ust tune the additional regularization paraeter λ to achieve a siilar balance. 5 Results We present results for the proposed technique MMD-critic using USPS hand written digits (Hull, 1994) and Iagenet (Deng et al., 2009) datasets. We quantitatively evaluate the prototypes in ters of predictive quality as copared to related baselines on USPS hand written digits dataset. We also present preliinary results fro a huan subject pilot study. Our results suggest that the odel criticis which is unique to the proposed MMD-critic is especially useful to facilitate huan understanding. For all datasets, we eployed the radial basis function (RBF) kernel with entries k i,j = k(x i, x j ) = exp( γ x i x j ), which satisfies the conditions of Corollary 3 for sufficiently large γ (c.f. Exaple 4, see Exaple 5 and following discussion for alternative feasible kernels). The Nearest Prototype Classifier: While our priary interest is in interpretable prototype selection and criticis, prototypes ay also be useful for speeding up eory-based achine learning techniques such as the nearest neighbor classifier by restricting the neighbor search to the prototypes, soeties known as the nearest prototype classifier (Bien and Tibshirani, 2011; Kuncheva and Bezdek, 1998). This classification provides an objective (although indirect) evaluation of the quality of the selected prototypes, and is useful for setting hyperparaeters. We eploy a 1 nearest neighbor classifier using the Hilbert space distance induced by the kernels. Let y i [k] denote the label associated with each prototype i S, for k classes. As we eploy noralized kernels (where the diagonal is 1), it is sufficient to easure the pairwise kernel siilarity. Thus, for a test point ˆx, the nearest neighbor classifier reduces to: ŷ = y i, where i = argin ˆx x i 2 H K i S 5.1 MMD-critic evaluated on USPS Digits Dataset = argax k(ˆx, x i ). i S The USPS hand written digits dataset Hull (1994) consists of n = 7291 training (and 2007 test) greyscale iages of 10 handwritten digits fro 0 to 9. We consider two kinds of RBF kernels (i) global: where the pairwise kernel is coputed between all data points, and (ii) local: given by exp( γ x i x j )1 [yi=y j], i.e. points in different classes are assigned a siilarity score of zero. The local approach has the effect of pushing points in different classes further apart. The kernel hyperparaeter γ was chosen based to axiize the average cross-validated classification perforance, then fixed for all other experients. Classification: We evaluated nearest prototype classifiers using MMD-critic, and copared to baselines (and reported perforance) fro Bien and Tibshirani (2011) (abbreviated as PS) and their 6
7 MMD-global MMD-local PS K-edoids Test error Nuber of prototypes Figure 1: Classification error vs. nuber of prototypes = S. MMD-critic shows coparable (or iproved) perforance as copared to other odels (left). Rando subset of prototypes and criticis fro the USPS dataset (right). ipleentation of K-edoids. Figure 1(left) copares MMD-critic with global and local kernels, to the baselines for different nubers of selected prototypes = S. Our results show coparable (or iproved) perforance as copared to other odels. In particular, we observe that the global kernels out-perfor the local kernels 2 by a sall argin. We note that MMD is particularly effective at selecting the first few prototypes (i.e. speed of error reduction as nuber of prototypes increases) suggesting its utility for rapidly suarising the dataset. Selected Prototypes and Criticis: Fig. 1 (right) presents a randoly selected subset of the prototypes and criticis fro the MMD-critic using the local kernel. We observe that the prototypes capture any of the coon ways of writing digits, while the criticis clearly capture outliers. 5.2 Qualitative Measure: Prototypes and Criticiss of Iages In this section, we learn prototypes and criticiss fro the Iagenet dataset (Russakovsky et al., 2015) using iage ebeddings fro He et al. (2015). Each iage is represented by a 2048 diensions vector ebedding, and each iage belongs to one of 1000 categories. We select two breeds of one category (e.g., Blenhei spaniel) and run MMD-critic to learn prototypes and criticis. As shown in Figure 2, MMD-critic learns reasonable prototypes and criticiss for two types of dog breeds. On the left, criticiss picked out the different coloring (second criticis is in black and white picture), as well as pictures capturing oveents of dogs (first and third criticiss). Siilarly, on the right, criticiss capture the unusual, but potentially frequent pictures of dogs in costues (first and second criticiss). 5.3 Quantitative easure: Prototypes and Criticiss iprove interpretability We conducted a huan pilot study to collect objective and subjective easures of interpretability using MMD-critic. The experient used the sae dataset as Section 5.2. We define interpretability in this work as the following: a ethod is interpretable if a user can correctly and efficiently predict the ethod s results. Under this definition, we designed a predictive task to quantitatively evaluate the interpretability. Given a randoly sapled data point, we easure how well a huan can predict a group it belongs to (accuracy), and how fast they can perfor the task (efficiency). We chose this dataset as the task of assigning a new iage to a group requires groups to be well-explained but does not require specialized training. We presented four conditions in the experient. 1) raw iages condition (Raw Condition) 2) Prototypes Only (Proto Only Condition) 3) Prototypes and criticiss (Proto and Criticis Condition) 4) Uniforly sapled data points per group (Unifor Condition). Raw Condition contained 100 iages per species (e.g., if a group contains 2 species, there are 200 iages) Proto Only Condition, Proto and Criticis Condition and Unifor Condition contains the sae nuber of iages. 2 Note that the local kernel trivially achieves perfect accuracy. Thus, in order to easure generalization perforance, we do not use class labels for local kernel test instances i.e. we use the global kernel instead of local kernel for test instances regardless of training. 7
8 Figure 2: Learned prototypes and criticiss fro Iagenet dataset (two types of dog breeds) We used within-subject design to iniize the effect of inter-participant variability, with a balanced Latin square to account for a potential learning effect. The four conditions were assigned to four participants (four ales) in a balanced anner. Each subject answered 21 questions, where the first three questions are practice questions and not included in the analysis. Each question showed six groups (e.g., red fox, kit fox) of a species (e.g., fox), and a randoly sapled data point that belongs to one of the groups. Subjects were encouraged to answer the questions as quickly and accurately as possible. A break was iposed after each question to itigate the potential effect of fatigue. We easured the accuracy of answers as well as the tie they took to answer each question. Participants were also asked to respond to 10 5-point Likert scale survey questions about their subjective easure of accuracy and efficiency. Each survey question copared a pair of conditions (e.g., Condition A was ore helpful than condition B to correctly (or efficiently) assign the iage to a group). Subjects perfored the best using Proto and Criticis Condition (M=87.5%, SD=20%). The perforance with Proto Only Condition was relatively siilar (M=75%, SD=41%), while that with Unifor Condition (M=55%, SD=38%, 37% decrease) and Raw Condition (M=56%, SD=33%, 36% decrease) was substantially lower. In ters of speed, subjects were ost efficient using Proto Only Condition (M=1.04 ins/question, SD=0.28, 44% decrease copared to Raw Condition), followed by Unifor Condition (M=1.31 ins/question, SD=0.59) and Proto and Criticis Condition (M=1.37 ins/question, SD=0.8). Subjects spent the ost tie with Raw Condition (M=1.86 ins/question, SD=0.67). Subjects indicated their preference of Proto and Criticis Condition over Raw Condition and Unifor Condition. In a survey question that asks to copare Proto and Criticis Condition and Raw Condition, a subject added that [Proto and Criticis Condition resulted in] less confusion fro trying to discover hidden patterns in a ton of iages, ore clues indicating what features are iportant". In particular, in a question that asks to copare Proto and Criticis Condition and Proto Only Condition, a subject said that The addition of criticiss ade it easier to locate the defining features of the cluster within the prototypical iages". The huans superior perforance with prototypes and criticis in this preliinary study shows that providing criticiss together with prototypes is a proising direction to iprove the interpretability. 6 Conclusion We present the MMD-critic, a scalable fraework for prototype and criticis selection to iprove the interpretability of coplex data distributions. To our best knowledge, ours is the first work which leverages the BMC fraework to generate explanations. Further, MMD-critic shows copetitive perforance as a nearest prototype classifier copared to to existing ethods. When criticis is given together with prototypes, a huan pilot study suggests that huans are better able to perfor a predictive task that requires the data-distributions to be well-explained. This suggests that criticis and prototypes are a step towards iproving interpretability of coplex data distributions. For future work, we hope to further explore the properties of MMD-critic such as the effect of the choice of kernel, and weaker conditions on the kernel atrix for subodularity. We plan to explore applications to larger datasets, aided by recent work on distributed algoriths for subodular optiization. We also intend to coplete a larger scale user study on how criticis and prototypes presented together affect huan understanding. 8
9 References A. Aaodt and E. Plaza. Case-based reasoning: Foundational issues, ethodological variations, and syste approaches. AI counications, A. Badanidiyuru, B. Mirzasoleian, A. Karbasi, and A. Krause. Streaing subodular axiization: Massive data suarization on the fly. In KDD. ACM, I. Bichindaritz and C. Marling. Case-based reasoning in the health sciences: What s next? AI in edicine, J. Bien and R. Tibshirani. Prototype selection for interpretable classification. The Annals of Applied Statistics, pages , R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Stur, and N. Elhadad. Intelligible odels for healthcare: Predicting pneuonia risk and hospital 30-day readission. In KDD, M.S. Cohen, J.T. Freean, and S. Wolf. Metarecognition in tie-stressed decision aking: Recognizing, critiquing, and correcting. Huan Factors, J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei. Iagenet: A large-scale hierarchical iage database. In CVPR, U. Feige. A threshold of ln n for approxiating set cover. JACM, A. Gelan, J.B. Carlin, H.S. Stern, and D.B. Rubin. Bayesian data analysis. Taylor & Francis, A. Gretton, K.M. Borgwardt, M.J. Rasch, B. Schölkopf, and A. Sola. A kernel ethod for the two-saple proble. JMLR, K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for iage recognition. arxiv: , J.J. Hull. A database for handwritten text recognition research. TPAMI, T.S. Jaakkola, D. Haussler, et al. Exploiting generative odels in discriinative classifiers. In NIPS, pages , L. Kaufan and P. Rousseeuw. Clustering by eans of edoids. North-Holland, B. Ki, C. Rudin, and J.A. Shah. The Bayesian Case Model: A generative approach for case-based reasoning and prototype classification. In NIPS, O.O. Koyejo, R. Khanna, J. Ghosh, and R. Poldrack. On prior distributions and approxiate inference for structured variables. In NIPS, A. Krause, A. Singh, and C. Guestrin. Near-optial sensor placeents in gaussian processes: Theory, efficient algoriths and epirical studies. JMLR, L. I. Kuncheva and J.C. Bezdek. Nearest prototype classification: clustering, genetic algoriths, or rando search? IEEE Transactions on Systes, Man, and Cybernetics, 28(1): , H. Lin and J. Biles. A class of subodular functions for docuent suarization. In ACL, J. R. Lloyd and Z. Ghahraani. Statistical odel criticis using kernel two saple tests. In NIPS, B. Mirzasoleian, A. Karbasi, A. Badanidiyuru, and A. Krause. Distributed subodular cover: Succinctly suarizing assive data. In NIPS, G. L Nehauser, L.A. Wolsey, and M.L. Fisher. An analysis of approxiations for axiizing subodular set functions. Matheatical Prograing, A. Newell and H.A. Sion. Huan proble solving. Prentice-Hall Englewood Cliffs, C.E. Priebe, D.J. Marchette, J.G. DeVinney, and D.A. Socolinsky. Classification using class cover catch digraphs. Journal of classification, O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg, and L. Fei-Fei. IageNet Large Scale Visual Recognition Challenge. IJCV, D. Shara, A. Kapoor, and A. Deshpande. On greedy axiization of entropy. In ICML, I. Sion, N. Snavely, and S.M. Seitz. Scene suarization for online iage collections. In ICCV, K.R. Varshney. Engineering safety in achine learning. arxiv: ,
10 Proof of Theore 2 Observe that fro the eleent-wise upper bound on H, the following eleent-wise inequality holds h E H h E + νe. Thus, fro the linearity of F (H, S) = A(S), H with respect to H, we have that: F (h E, S) F (H, S) F (h E + νe, S), where (by linearity) F (h E + νe, S) = h F (E, S) + νf (E, S). Next, eploying ters: a(s) = F (E, S) = H(S), E and b(s) = F (E, S) = A(S), E. we ay rewrite the bounds as: h a(s) F (H, S) h a(s) + νb(s). Monotonicity: The function F (H, S) is onotone with respect to S if: F (H, S {u}) F (H, S) 0. Applying the lower and upper bounds, we have that: F (H, S {u}) F (H, S) h a(s {u}) h a(s) νb(s) 0 = ν h a(s {u}) h a(s) b(s) = h α(n, ) Thus, when the off-diagonal ters satisfy h i,j h α(n, ) 0, (i, j) E, we have that F (H, S) is onotone. Subodularity: The function F (H, S) is subodular with respect to S if: F (H, S {u}) + F (H, S {v}) F (H, S {u, v}) + F (H, S). Again, applying the lower and upper bounds, we have that: F (H, S {u}) + F (H, S {v}) F (H, S {u, v}) F (H, S) h a(s {u}) + h a(s {v}) h a(s {u, v}) νb(s {u, v}) h a(s) νb(s) 0 = ν h a(s {u}) + a(s {v}) a(s {u, v}) a(s) b(s {u, v}) + b(s) = h β(n, ) Thus, when the off-diagonal ters satisfy h i,j h β(n, ) 0, (i, j) E, we have that F (H, S) is subodular. Proof of Corollary 3 Based on the diagonal doinance assuption on K, it is clear that E = {i, j [n] i j} indexes the off diagonal ters, and E = 1 E = I. Given A(S) with entries a i,j(s) = 2 1 n S [j S] 1 1 S 2 [i S] 1 [j S], we can copute the bounds (8) siply by enuerating sus as: a(s) = A(S), I = 2 n = 2 2 n 1 2(n ) b(s) = A(S), 1 I = n 2 2 = 2(n 1) n 1 Monotonicity: J p( ) is onotone when the upper bound of the off-diagonal ters is given by α(n, ) = by Theore 2. We have that: a(s {u}) a(s) b(s) Thus: a(s {u}) a(s) = α(n, ) = (n 1), b(s) = 1 n. n ( + 1)((n 2) + n). This is a decreasing function wrt. Further, for the ground set 2 [n], we have that = n, and α(n, n) = 1 n 2 1 Subodularity: J p( ) is subodular when the upper bound of the off-diagonal ters is given by β(n, ) = by Theore 2. We have that: a(s {u})+a(s v}) a(s {u,v}) a(s) b(s {u,v})+b(s) a(s {u}) + a(s v}) a(s {u, v}) a(s) = (n 1) b(s {u, v}) + b(s) = + 1 n
11 Thus: β(n, ) = n ( + 1)(n( ) 2( 2 + 2)) This is a decreasing function wrt. Further, for the ground set 2 [n], we have that = n, and β(n, n) = 1 n 3 +2n 2 2n 3. Cobined Bound: Finally, we show that β(n, n) α(n, n), so that the bound k i,j k β(n, n) is sufficient to guarantee both onotonicity and subodularity. β(n, n) α(n, n) 1 = n 3 + 2n 2 2n 3 1 n 2 1 = n 2 1 n 3 + 2n 2 2n 3 = 0 n 3 + n 2 n 3 = 0 (n 1)(n 2 2) which holds when n > 1 and n 2. Thus β(n, n) α(n, n). The proof is coplete. Proof of Theore 7 A discrete function is linear if it can be written in the for F (C) = i [n] wi1 [i C]. Consider (9) and observe that: L(C) = 1 k(x i, x l ) 1 k(x j, x l ) l C n i [n] j S = 1 k(x i, x l ) 1 k(x j, x l ) l [n] n 1 [l C] i [n] j S = w l 1 [l C], where: l [n] w l = 1 k(x i, x l ) 1 k(x j, x l ) n. i [n] j S 11
Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines
Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes
More informationKernel Methods and Support Vector Machines
Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic
More informationExamples are not Enough, Learn to Criticize! Criticism for Interpretability
Examples are not Enough, Learn to Criticize! Criticism for Interpretability Been Kim 1 Rajiv Khanna 2 Oluwasanmi Koyejo 3 1 Allen Institute for AI 2 UT AUSTIN 3 UIUC NIPS, 2016/ Presenter: Ji Gao Presenter:
More informationExamples are not Enough, Learn to Criticize! Criticism for Interpretability
Examples are not Enough, Learn to Criticize! Criticism for Interpretability Been Kim, Rajiv Khanna, Oluwasanmi Koyejo Wittawat Jitkrittum Gatsby Machine Learning Journal Club 16 Jan 2017 1/20 Summary Examples
More informationPattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition
Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition
More informationE0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis
E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds
More informationA Simple Regression Problem
A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where
More informationSupport Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization
Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering
More informationUsing EM To Estimate A Probablity Density With A Mixture Of Gaussians
Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points
More informationCombining Classifiers
Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/
More informatione-companion ONLY AVAILABLE IN ELECTRONIC FORM
OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer
More informationCS Lecture 13. More Maximum Likelihood
CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood
More informationFeature Extraction Techniques
Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that
More informationModel Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon
Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential
More informationarxiv: v1 [cs.ds] 3 Feb 2014
arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/
More informationMixed Robust/Average Submodular Partitioning
Mixed Robust/Average Subodular Partitioning Kai Wei 1 Rishabh Iyer 1 Shengjie Wang 2 Wenruo Bai 1 Jeff Biles 1 1 Departent of Electrical Engineering, University of Washington 2 Departent of Coputer Science,
More informationarxiv: v2 [cs.lg] 30 Mar 2017
Batch Renoralization: Towards Reducing Minibatch Dependence in Batch-Noralized Models Sergey Ioffe Google Inc., sioffe@google.co arxiv:1702.03275v2 [cs.lg] 30 Mar 2017 Abstract Batch Noralization is quite
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October
More information1 Bounding the Margin
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost
More informationUniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval
Unifor Approxiation and Bernstein Polynoials with Coefficients in the Unit Interval Weiang Qian and Marc D. Riedel Electrical and Coputer Engineering, University of Minnesota 200 Union St. S.E. Minneapolis,
More information1 Proof of learning bounds
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a
More informationBayes Decision Rule and Naïve Bayes Classifier
Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.
More informationPattern Recognition and Machine Learning. Artificial Neural networks
Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial
More informationBoosting with log-loss
Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the
More informationSupport Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab
Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a
More informationPolygonal Designs: Existence and Construction
Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G
More informationProbability Distributions
Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples
More informationStochastic Subgradient Methods
Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods
More informationSupport Vector Machines. Goals for the lecture
Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed
More information13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices
CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay
More informationIntelligent Systems: Reasoning and Recognition. Artificial Neural Networks
Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial
More informationSupplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data
Suppleentary to Learning Discriinative Bayesian Networks fro High-diensional Continuous Neuroiaging Data Luping Zhou, Lei Wang, Lingqiao Liu, Philip Ogunbona, and Dinggang Shen Proposition. Given a sparse
More informationBlock designs and statistics
Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent
More informationRobustness and Regularization of Support Vector Machines
Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA
More informationConsistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material
Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable
More informationEnsemble Based on Data Envelopment Analysis
Enseble Based on Data Envelopent Analysis So Young Sohn & Hong Choi Departent of Coputer Science & Industrial Systes Engineering, Yonsei University, Seoul, Korea Tel) 82-2-223-404, Fax) 82-2- 364-7807
More informationLower Bounds for Quantized Matrix Completion
Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &
More informationComputational and Statistical Learning Theory
Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher
More informationSupport Vector Machines MIT Course Notes Cynthia Rudin
Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance
More informationPAC-Bayes Analysis Of Maximum Entropy Learning
PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E
More informationPhysics 215 Winter The Density Matrix
Physics 215 Winter 2018 The Density Matrix The quantu space of states is a Hilbert space H. Any state vector ψ H is a pure state. Since any linear cobination of eleents of H are also an eleent of H, it
More informationAlgorithms for parallel processor scheduling with distinct due windows and unit-time jobs
BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES Vol. 57, No. 3, 2009 Algoriths for parallel processor scheduling with distinct due windows and unit-tie obs A. JANIAK 1, W.A. JANIAK 2, and
More informationMachine Learning Basics: Estimators, Bias and Variance
Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics
More informationTight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions
Tight Inforation-Theoretic Lower Bounds for Welfare Maxiization in Cobinatorial Auctions Vahab Mirrokni Jan Vondrák Theory Group, Microsoft Dept of Matheatics Research Princeton University Redond, WA 9805
More informationCOS 424: Interacting with Data. Written Exercises
COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well
More informationQuantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search
Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths
More informationA note on the multiplication of sparse matrices
Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani
More informationInteractive Markov Models of Evolutionary Algorithms
Cleveland State University EngagedScholarship@CSU Electrical Engineering & Coputer Science Faculty Publications Electrical Engineering & Coputer Science Departent 2015 Interactive Markov Models of Evolutionary
More informationPattern Recognition and Machine Learning. Artificial Neural networks
Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial
More informationExperimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis
City University of New York (CUNY) CUNY Acadeic Works International Conference on Hydroinforatics 8-1-2014 Experiental Design For Model Discriination And Precise Paraeter Estiation In WDS Analysis Giovanna
More informationLecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon
Lecture 2: Enseble Methods Isabelle Guyon guyoni@inf.ethz.ch Introduction Book Chapter 7 Weighted Majority Mixture of Experts/Coittee Assue K experts f, f 2, f K (base learners) x f (x) Each expert akes
More informationSupport Vector Machines. Maximizing the Margin
Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the
More information1 Generalization bounds based on Rademacher complexity
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges
More informationA Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay
A Low-Coplexity Congestion Control and Scheduling Algorith for Multihop Wireless Networks with Order-Optial Per-Flow Delay Po-Kai Huang, Xiaojun Lin, and Chih-Chun Wang School of Electrical and Coputer
More informationE0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)
E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how
More informationThis model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.
CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when
More informationASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical
IEEE TRANSACTIONS ON INFORMATION THEORY Large Alphabet Source Coding using Independent Coponent Analysis Aichai Painsky, Meber, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE arxiv:67.7v [cs.it] Jul
More informationarxiv: v1 [cs.cv] 28 Aug 2015
Discrete Hashing with Deep Neural Network Thanh-Toan Do Anh-Zung Doan Ngai-Man Cheung Singapore University of Technology and Design {thanhtoan do, dung doan, ngaian cheung}@sutd.edu.sg arxiv:58.748v [cs.cv]
More informationCSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13
CSE55: Randoied Algoriths and obabilistic Analysis May 6, Lecture Lecturer: Anna Karlin Scribe: Noah Siegel, Jonathan Shi Rando walks and Markov chains This lecture discusses Markov chains, which capture
More informationSequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5,
Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, 2015 31 11 Motif Finding Sources for this section: Rouchka, 1997, A Brief Overview of Gibbs Sapling. J. Buhler, M. Topa:
More informationConvex Programming for Scheduling Unrelated Parallel Machines
Convex Prograing for Scheduling Unrelated Parallel Machines Yossi Azar Air Epstein Abstract We consider the classical proble of scheduling parallel unrelated achines. Each job is to be processed by exactly
More informationPattern Recognition and Machine Learning. Artificial Neural networks
Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016/2017 Lessons 9 11 Jan 2017 Outline Artificial Neural networks Notation...2 Convolutional Neural Networks...3
More informationFairness via priority scheduling
Fairness via priority scheduling Veeraruna Kavitha, N Heachandra and Debayan Das IEOR, IIT Bobay, Mubai, 400076, India vavitha,nh,debayan}@iitbacin Abstract In the context of ulti-agent resource allocation
More informationNyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison
yströ Method vs : A Theoretical and Epirical Coparison Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou Machine Learning Lab, GE Global Research, San Raon, CA 94583 Michigan State University,
More informationThe Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate
The Siplex Method is Strongly Polynoial for the Markov Decision Proble with a Fixed Discount Rate Yinyu Ye April 20, 2010 Abstract In this note we prove that the classic siplex ethod with the ost-negativereduced-cost
More informationNew Bounds for Learning Intervals with Implications for Semi-Supervised Learning
JMLR: Workshop and Conference Proceedings vol (1) 1 15 New Bounds for Learning Intervals with Iplications for Sei-Supervised Learning David P. Helbold dph@soe.ucsc.edu Departent of Coputer Science, University
More informationGrafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space
Journal of Machine Learning Research 3 (2003) 1333-1356 Subitted 5/02; Published 3/03 Grafting: Fast, Increental Feature Selection by Gradient Descent in Function Space Sion Perkins Space and Reote Sensing
More informationA Smoothed Boosting Algorithm Using Probabilistic Output Codes
A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu
More information3.3 Variational Characterization of Singular Values
3.3. Variational Characterization of Singular Values 61 3.3 Variational Characterization of Singular Values Since the singular values are square roots of the eigenvalues of the Heritian atrices A A and
More informationNon-Parametric Non-Line-of-Sight Identification 1
Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,
More informationMulti-view Discriminative Manifold Embedding for Pattern Classification
Multi-view Discriinative Manifold Ebedding for Pattern Classification X. Wang Departen of Inforation Zhenghzou 450053, China Y. Guo Departent of Digestive Zhengzhou 450053, China Z. Wang Henan University
More informationChapter 6 1-D Continuous Groups
Chapter 6 1-D Continuous Groups Continuous groups consist of group eleents labelled by one or ore continuous variables, say a 1, a 2,, a r, where each variable has a well- defined range. This chapter explores:
More informationFixed-to-Variable Length Distribution Matching
Fixed-to-Variable Length Distribution Matching Rana Ali Ajad and Georg Böcherer Institute for Counications Engineering Technische Universität München, Gerany Eail: raa2463@gail.co,georg.boecherer@tu.de
More informationEstimating Parameters for a Gaussian pdf
Pattern Recognition and achine Learning Jaes L. Crowley ENSIAG 3 IS First Seester 00/0 Lesson 5 7 Noveber 00 Contents Estiating Paraeters for a Gaussian pdf Notation... The Pattern Recognition Proble...3
More informationA Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness
A Note on Scheduling Tall/Sall Multiprocessor Tasks with Unit Processing Tie to Miniize Maxiu Tardiness Philippe Baptiste and Baruch Schieber IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights,
More informationUNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer.
UIVRSITY OF TRTO DIPARTITO DI IGGRIA SCIZA DLL IFORAZIO 3823 Povo Trento (Italy) Via Soarive 4 http://www.disi.unitn.it O TH US OF SV FOR LCTROAGTIC SUBSURFAC SSIG A. Boni. Conci A. assa and S. Piffer
More informationInspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information
Cite as: Straub D. (2014). Value of inforation analysis with structural reliability ethods. Structural Safety, 49: 75-86. Value of Inforation Analysis with Structural Reliability Methods Daniel Straub
More informationDEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS
ISSN 1440-771X AUSTRALIA DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS An Iproved Method for Bandwidth Selection When Estiating ROC Curves Peter G Hall and Rob J Hyndan Working Paper 11/00 An iproved
More informationDetection and Estimation Theory
ESE 54 Detection and Estiation Theory Joseph A. O Sullivan Sauel C. Sachs Professor Electronic Systes and Signals Research Laboratory Electrical and Systes Engineering Washington University 11 Urbauer
More informationSPECTRUM sensing is a core concept of cognitive radio
World Acadey of Science, Engineering and Technology International Journal of Electronics and Counication Engineering Vol:6, o:2, 202 Efficient Detection Using Sequential Probability Ratio Test in Mobile
More informationIn this chapter, we consider several graph-theoretic and probabilistic models
THREE ONE GRAPH-THEORETIC AND STATISTICAL MODELS 3.1 INTRODUCTION In this chapter, we consider several graph-theoretic and probabilistic odels for a social network, which we do under different assuptions
More informationFoundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)
More informationComputational and Statistical Learning Theory
Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic
More informationShannon Sampling II. Connections to Learning Theory
Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: sale@athberkeleyedu Ding-Xuan Zhou Departent of Matheatics,
More informationStatistics and Probability Letters
Statistics and Probability Letters 79 2009 223 233 Contents lists available at ScienceDirect Statistics and Probability Letters journal hoepage: www.elsevier.co/locate/stapro A CLT for a one-diensional
More informationA Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)
1 A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine (1900 words) Contact: Jerry Farlow Dept of Matheatics Univeristy of Maine Orono, ME 04469 Tel (07) 866-3540 Eail: farlow@ath.uaine.edu
More informationA Theoretical Analysis of a Warm Start Technique
A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful
More informationBayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)
Bayesian Learning Chapter 6: Bayesian Learning CS 536: Machine Learning Littan (Wu, TA) [Read Ch. 6, except 6.3] [Suggested exercises: 6.1, 6.2, 6.6] Bayes Theore MAP, ML hypotheses MAP learners Miniu
More informationPredictive Vaccinology: Optimisation of Predictions Using Support Vector Machine Classifiers
Predictive Vaccinology: Optiisation of Predictions Using Support Vector Machine Classifiers Ivana Bozic,2, Guang Lan Zhang 2,3, and Vladiir Brusic 2,4 Faculty of Matheatics, University of Belgrade, Belgrade,
More informationA Note on the Applied Use of MDL Approximations
A Note on the Applied Use of MDL Approxiations Daniel J. Navarro Departent of Psychology Ohio State University Abstract An applied proble is discussed in which two nested psychological odels of retention
More informationUpper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition
Upper bound on false alar rate for landine detection and classification using syntactic pattern recognition Ahed O. Nasif, Brian L. Mark, Kenneth J. Hintz, and Nathalia Peixoto Dept. of Electrical and
More informationIntroduction to Machine Learning. Recitation 11
Introduction to Machine Learning Lecturer: Regev Schweiger Recitation Fall Seester Scribe: Regev Schweiger. Kernel Ridge Regression We now take on the task of kernel-izing ridge regression. Let x,...,
More informationNonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy
Storage Capacity and Dynaics of Nononotonic Networks Bruno Crespi a and Ignazio Lazzizzera b a. IRST, I-38050 Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I-38050 Povo (Trento) Italy INFN Gruppo
More informationBootstrapping Dependent Data
Bootstrapping Dependent Data One of the key issues confronting bootstrap resapling approxiations is how to deal with dependent data. Consider a sequence fx t g n t= of dependent rando variables. Clearly
More informationDistributed Subgradient Methods for Multi-agent Optimization
1 Distributed Subgradient Methods for Multi-agent Optiization Angelia Nedić and Asuan Ozdaglar October 29, 2007 Abstract We study a distributed coputation odel for optiizing a su of convex objective functions
More informationLecture 9 November 23, 2015
CSC244: Discrepancy Theory in Coputer Science Fall 25 Aleksandar Nikolov Lecture 9 Noveber 23, 25 Scribe: Nick Spooner Properties of γ 2 Recall that γ 2 (A) is defined for A R n as follows: γ 2 (A) = in{r(u)
More informationCONTROL SYSTEMS, ROBOTICS, AND AUTOMATION Vol. IX Uncertainty Models For Robustness Analysis - A. Garulli, A. Tesi and A. Vicino
UNCERTAINTY MODELS FOR ROBUSTNESS ANALYSIS A. Garulli Dipartiento di Ingegneria dell Inforazione, Università di Siena, Italy A. Tesi Dipartiento di Sistei e Inforatica, Università di Firenze, Italy A.
More informationSymbolic Analysis as Universal Tool for Deriving Properties of Non-linear Algorithms Case study of EM Algorithm
Acta Polytechnica Hungarica Vol., No., 04 Sybolic Analysis as Universal Tool for Deriving Properties of Non-linear Algoriths Case study of EM Algorith Vladiir Mladenović, Miroslav Lutovac, Dana Porrat
More informationOn the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation
journal of coplexity 6, 459473 (2000) doi:0.006jco.2000.0544, available online at http:www.idealibrary.co on On the Counication Coplexity of Lipschitzian Optiization for the Coordinated Model of Coputation
More informationAsynchronous Gossip Algorithms for Stochastic Optimization
Asynchronous Gossip Algoriths for Stochastic Optiization S. Sundhar Ra ECE Dept. University of Illinois Urbana, IL 680 ssrini@illinois.edu A. Nedić IESE Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu
More information