Credal Ensembles of Classifiers

Size: px
Start display at page:

Download "Credal Ensembles of Classifiers"

Transcription

1 Credal Ensembles of Classifiers G. Corani a,, A. Antonucci a a IDSIA Istituto Dalle Molle di Studi sull Intelligenza Artificiale CH-6928 Manno (Lugano), Switzerland Abstract It is studied how to aggregate the probabilistic predictions generated by different SPODE (Super-Parent-One-Dependence Estimators) classifiers. It is shown that aggregating such predictions via compression-based weights achieves a slight but consistent improvement of performance over previously existing aggregation methods, including Bayesian Model Averaging and simple average (the approach adopted by the AODE algorithm). Then, attention is given to the problem of choosing the prior probability distribution over the models; this is an important issue in any Bayesian ensemble of models. To robustly deal with the choice of the prior, the single prior over the models is substituted by a set of priors over the models (credal set), thus obtaining a credal ensemble of Bayesian classifiers. The credal ensemble recognizes the prior-dependent instances, namely the instances whose most probable class varies when different prior over the models are considered. When faced with prior-dependent instances, the credal ensemble remains reliable by returning a set of classes rather than a single class. Two credal ensembles of SPODEs are developed; the first generalizes the Bayesian Model Averaging and the second the compression-based aggregation. Extensive experiments show that the novel ensembles compare favorably to traditional methods for aggregating SPODEs and also to previous credal classifiers. Keywords: classification, Bayesian model averaging, compression-based averaging, AODE, credal classification, imprecise probability, credal ensemble 1. Introduction Bayesian model averaging (BMA) (Hoeting et al., 1999) is a sound approach to address the uncertainty which characterizes the identification of the supposedly best model; given a set of alternative models, BMA weights the inferences produced by the models using as weights the models posterior probabilities. BMA assumes one of the models in the ensemble to be the true one. Under this assumption, BMA is expected to achieve better predictive accuracy than the choice of any single model (Hoeting et al., 1999). However, such assumption is mostly violated; as a consequence, BMA generally does not achieve high predictive performance in experiments. The problem is that Corresponding author: giorgio@idsia.ch Preprint submitted to Computational Statistics & Data Analysis November 9, 2012

2 BMA gets excessively concentrated around the single most probable model (Domingos, 2000; Minka, 2002): especially on large data sets, averaging using the posterior probabilities to weight the models is almost the same as selecting the MAP model (Boullé, 2007, page 1672), where the MAP model is the most probable model a posteriori. To address this problem, the compression-based approach (Boullé, 2007) applies a logarithmic smoothing to the models posterior probabilities, thus computing weights that are more evenly-distributed. The compression-based weights can be justified from an information-theoretic viewpoint and have been used to combine naive Bayes classifiers characterized by different feature sets, obtaining excellent results in prediction contexts (Boullé, 2007). The Averaged One-Dependence Estimators (AODE) (Webb et al., 2005) is an ensemble of SPODE (SuperParent-One-Dependence Estimator) classifiers. Each SPODE adopts a certain feature as a super-parent, namely it assumes all features to depend on a common feature (the super-parent) in addition to the class. The AODE ensemble simply averages the posterior probabilities computed by the different SPODEs; remarkably, AODE is more effective than the majority of rival schemes for aggregating SPODEs, including BMA (Yang et al., 2007). As a preliminary step we develop BMA-AODE, namely BMA over SPODEs, with some computational differences with respect to the previous frameworks of Yang et al. (2007) and Cerquides and de Mántaras (2005); we confirm however that BMA over SPODEs is outperformed by AODE. Then we develop the novel COMP-AODE classifier, which weights the SPODEs using the compression-based coefficients and yields a slight but consistent improvement in the classification performance over AODE. Considering the high performance of AODE, this result is noteworthy. An important issue in any Bayesian ensemble of models is however the choice of the prior over the models. Most commonly it is adopted a uniform prior or a prior which favors simpler models over complex ones (Boullé, 2007). Although such choices are reasonable, the specification of any single prior implies some arbitrariness and entails the risk of drawing prior-dependent conclusions, especially on small data sets. In fact, the specification of the prior over the models is a serious issue for Bayesian ensembles. To overcome the problem, we take inspiration from the field of imprecise probability (IP) (Walley, 1991). IP approaches can be used to generalize Bayesian models, describing prior uncertainty by a set of prior distributions instead of a single prior; see Walley (1996) for a deep discussion of the reasons for preferring IP approaches to the specification of a single prior. In particular, classifiers based on a set of priors are called credal classifiers; they automatically detect the prior-dependent instances, namely the instances whose most probable class varies under different priors. On the prior-dependent instances, traditional classifiers are unreliable (Corani and Zaffalon, 2008b); on the same instances credal classifier remain instead reliable by returning a set of classes. A survey of credal classifiers can be found in (Corani et al., 2012). In this paper, we extend the idea of credal model averaging (CMA) (Corani and Zaffalon, 2008a). Credal model averaging generalizes Bayesian model averaging; it substitutes the single prior over the models by a set of priors (credal set). CMA is therefore a credal classifier which can be seen as a credal ensemble of Bayesian classifiers. In (Corani and Zaffalon, 2008a), the application of CMA was limited to the case of naive Bayes. 2

3 In this paper we develop BMA-AODE* and COMP-AODE*, namely the credal ensembles which respectively generalize BMA-AODE and COMP-AODE, substituting the uniform prior over the models by a credal set. By extensive experiments we show that both credal ensembles compare favorably to both their single-prior counterparts and to previously existing credal classifiers, including the CMA of (Corani and Zaffalon, 2008a). 2. Methods We consider a classification problem with k features; we denote by C the class variable (taking values in C) and by A := (A 1,..., A k ) the set of features, taking values respectively in A 1,..., A k. For a generic variable A, we denote as P (A) the probability mass function over its values and as P (a) the probability that A = a. We assume the data to be complete and the training data D to contain n instances. We estimate the parameters of the SPODEs adopting the usual Bayesian approach for generative models (Heckerman, 1995), namely by setting Dirichlet priors and then taking expectation from the parameter posterior distribution. Under 0-1 loss a traditional probabilistic classifier returns, for a test instance whose class in unknown, say ã := {ã 1,..., ã k }, the most probable class c : c := arg max c C P (c ã). Credal classifiers change this paradigm, by occasionally returning more classes; this happens in particular when the most probable class is prior-dependent. We discuss this point more in detail later, when presenting credal classifiers From Naive Bayes to AODE The Naive Bayes classifier assumes the stochastic independence of the features given the class; it therefore factorizes the joint probability as follows: P (c, a) := P (c) k P (a j c), (1) corresponding to the topology of Fig.1(a). Despite the biased estimate of probabilities due to the above (so-called naive) assumption, naive Bayes performs well under 0-1 loss (Domingos and Pazzani, 1997); it thus constitutes a reasonable choice if the goal is simple classification, without the need for accurate probability estimates (Friedman, 1997). The naive independence assumption is relaxed for instance by the tree-augmented naive classifier (TAN), which allows the subgraph involving only the features to be a tree, allowing thus each feature to depend on the class and on another feature; an example is shown in Fig.1(b). Generally, TAN outperforms naive Bayes in classification (Friedman et al., 1997). The AODE classifier (Webb et al., 2005) is an ensemble of k SPODE (SuperParent One Dependence Estimator) classifiers; each SPODE is characterized by a certain j=1 3

4 C C A 1 A 2 A 3 A 4 A 1 A 2 A 3 A 4 (a) Naive Bayes. (b) A possible TAN structure. Figure 1: Naive Bayes vs TAN. C A 1 A 2 A 3 A 4 Figure 2: SPODE with super-parent A 1. super-parent feature, so that the remaining features are children of both the class and the super-parent, as shown in in Fig.2. In fact, each single SPODE is a TAN. We denote the set of SPODEs as S := {s 1,..., s k }, where s j indicates the SPODE with super-parent A j. SPODE s j factorizes the joint probability as: P (c, a s j ) = P (c) P (a j c) k l=1,l j P (a l a j, c). In order to classify the test instance ã, AODE averages the posterior probability P (c ã, s j ) computed by each single SPODE: P (c a) P (c, a) := 1 k k P (c, a s j ). In this paper we focus on more sophisticated approaches for aggregating the predictions of the SPODEs. j= Bayesian Model Averaging (BMA) with SPODEs By ensembling the SPODEs via BMA we assume one of the SPODEs to be the true model. We thus introduce a variable S over S, where P (S = s j ) denotes the prior probability of SPODE s j to be the true model. Each SPODE has the same number of variables, the same number of arcs and the same in-degree (the maximum number of parents per node). Thus, we adopt a uniform prior, assigning prior probability 1/k to each SPODE. In fact, the uniform prior over the models is frequently adopted when 4

5 implementing BMA. To classify the test instance ã, BMA computes the following posterior mass function: P (c ã) := k P (c ã, s j ) P (s j D) j=1 k P (c ã, s j ) P (D s j ) P (s j ), where P (D s j ) = θ j P (D s j, θ j )P (θ j )dθ j and θ i denote respectively the marginal likelihood and the parameters of s j. Under certain assumptions, the marginal likelihood can be analytically computed (Heckerman, 1995); this is the approach adopted by both (Cerquides and de Mántaras, 2005) and (Yang et al., 2007) to implement BMA over SPODEs. Such papers report that BMA over SPODEs is outperformed by AODE. This can be due, besides the already discussed excessive concentration of BMA on the MAP model, to the adoption of the marginal likelihood to compute BMA. The marginal likelihood measures how good the model is at representing the joint distribution; instead, a classifier has to estimate the posterior probability of the classes conditionally on the observed features. Therefore, a high marginal likelihood does not necessarily imply a good classification performance (Cowell, 2001; Kontkanen et al., 1999). Following Boullé (2007), we thus substitute the marginal likelihood with the conditional likelihood: n n L j := P (c (i) a (i), s j ) = P (c (i) a (i), s j, θ j )P (θ j D)dθ j. (2) θ j i=1 i=1 We call BMA-AODE the classifier which estimates the posterior probabilities of the class, given the test instance ã, as follows: P (c ã) j=1 k P (c ã, s j ) L j P (s j ). (3) j=1 Especially on large data sets, the difference between the likelihoods of the different SPODEs might be of several order of magnitudes. We remove from the ensemble the SPODEs whose conditional likelihood is smaller than L max /10 4, where L max is the maximum conditional likelihood among all SPODEs. Discarding models with very low posterior probability is in fact common when dealing with BMA and can be seen as a belief revision (Dubois and Prade, 1997). Given the joint beliefs P (X, Y ), the revision P (X, Y ) induced by a marginal P (Y ) is defined by P (x, y) := P (x y) P (y). In other words, if P (y) is known to be a better model than P (y) for the marginal beliefs about y, this information can be used in the above described way to redefine the joint. Accordingly, in BMA-AODE, the marginal beliefs about S have been replaced by a better candidate, inducing a revision in the corresponding joint model Exponentiation of the Log-Likelihoods Regardless whether the marginal likelihood or the conditional likelihood is considered, it is common to compute the log-likelihood rather than the likelihood, in order to avoid numerical problems due to the multiplication of many probabilities. However, the log-likelihoods can become very negative on large data sets; in this case, their 5

6 exponentiation can incur into numerical problems. This issue has forced for instance the usage of high numerical precision, causing a slowdown of the computation: BMA often lead to arithmetic overflow when calculating very large exponentials or factorials. One solution is to use the Java class BigDecimal which unfortunately can be very slow. (Yang et al., 2007, pag. 1660). We instead adopt the procedure of Algorithm 1, communicated to us by Dr. Dash who published several works on BMA (Dash and Cooper, 2004). The procedure robustly exponentiates the log-likelihoods using standard numerical precision. Algorithm 1 Robust exponentiation of log-likelihoods. Require: Array log liks of log-likelihoods, assumed of length k. minval=min(log liks) for i = 1:k do shifted logliks(i)=logliks(i)-minval; tmp liks(i)=exp(shifted logliks(i)); end for total=sum(tmp liks) for i = 1:k do liks(i)=tmp liks(i)/total; end for return liks {Array of likelihoods exponentiated and normalized.} 2.3. BMA-AODE*: Extending BMA-AODE to Sets of Probabilities By BMA-AODE* we extend BMA-AODE to imprecise probabilities (Walley, 1991), substituting the single prior mass function P (S) over the models by a set of priors (credal set). The credal set P(S), defined over S, lets the prior probability of each SPODE vary within a range rather than being a fixed number. BMA-AODE* is a credal ensemble of Bayesian classifiers, since the parameters of each SPODE are learned in the standard Bayesian way. In principle we could let the prior probability of each SPODE vary exactly between zero and one (vacuous model). Yet, this would generate vacuous posterior inferences and prevent learning from data (Piatti et al., 2009). To obtain non-vacuous posterior inferences, we introduce a lower bound ɛ for the prior probability of each model. The credal set is defined by the following constraints: { P(S) := P (S) P (s j ) ɛ j = 1,..., k k j=1 P (s j) = 1 }. (4) The prior probability of each SPODE varies thus between ɛ and 1 (k 1)ɛ. 6

7 The credal set in (4) represents ignorance about the prior probability of each SPODE being the true model. Since P(S) is a set a prior mass functions, BMA-AODE* can be regarded as a set of BMA-AODE classifiers, each corresponding to a different prior. If the most probable class of an instance varies under different priors, the classification is prior-dependent. When dealing with prior-dependent instances, credal classifiers (Corani et al., 2012; Corani and Zaffalon, 2008b) become indeterminate, by returning a set of classes instead of a single class. Before discussing how this set of classes is identified, we need introducing the concept of credal dominance (or, for short, dominance): class c dominates class c if c is more probable than c under each prior of the credal set. If no class dominates c, then c is non-dominated. Credal classifiers return in particular all the non-dominated classes, which are identified by performing different pairwise dominance tests among classes. This criterion is called maximality (Walley, 1991, Section 3.9.2) and is described by Algorithm 2. If all classes are non-dominated, maximality requires performing C ( C 1) tests; this is the worst case in terms of computational complexity. However, once a dominated classes is identified, it can be ignored in the following tests, thus reducing the computational burden. While there exist different criteria for taking decisions under imprecise probabilities (Troffaes, 2007), maximality is the criterion commonly used by credal classifiers. Algorithm 2 Identification of the non-dominated classes N D through maximality N D := C for c N D do for c N D (c c ) do check whether c dominates c if c dominates c then N D N D \ c end if end for end for return N D Non-dominated classes are incomparable, which means that there is no available information to rank them. Credal classifiers can be thus seen as dropping the dominated classes and expressing indecision about the non-dominated ones. Following (3), within BMA-AODE*, c dominates c if the solution of the following optimization problem is greater than zero: minimize: k P (c ã, s j ) L j P (s j ) j=1 k P (c ã, s j ) L j P (s j ) with respect to: P (s 1 ),..., P (s k ) (5) j=1 7

8 subject to: P (s j ) ɛ j = 1,..., k k j=1 P (s j) = 1. The constraints of problem (5) are constituted by the definition of credal set; the problem is a linear programming task and as such it can be solved in polynomial time (Karmakar, 1984). As already discussed for BMA-AODE, we include in the computation only the SPODEs whose conditional likelihood is at least L max /10 4. This can be regarded as a belief revision process, involving the credal set. The marginal credal set P (Y ) induces the following revision of the joint credal set P(X, Y ): { P (X, Y ) := P (X, Y ) P } (x, y) := P (x y) P (y) P (Y ) P. (Y ) The uniform prior belongs to the credal set of BMA-AODE*; therefore, the set of nondominated classes identified by BMA-AODE* includes by design the most probable class returned by BMA-AODE; if in particular BMA-AODE* returns a single nondominated class, this coincides with the class returned by BMA-AODE Compression-Based Averaging Compression-based averaging has been introduced in (Boullé, 2007) to mitigate the excessive concentration of BMA around the MAP model; it replaces the posterior probabilities P (s j D) of the models by smoother compression weights, which we denote as P (s j D) for model s j. Note that adopting the compression coefficients in place of the posterior probabilities can be seen as a belief revision; for this reason, we adopt again the notation P. To present the method, we need some further notation. In particular, we denote by LL j the log of the conditional likelihood of model s j. We moreover introduce the null classifier, which classifies instances by simply using the marginal probabilities of the classes, without conditioning on the features. The null classifier will be used for the computation of the compression coefficients. We denote the null classifier as s 0 ; therefore we associate a further state s 0 to S, whose domain thus becomes {s 0, s 1,..., s k }. We denote as LL 0 the conditional log-likelihood of the null classifier and as H(C) := c C P (c) log P (c) the sample estimate of the entropy of the class variable. Assuming P (c) to be estimated by maximum likelihood (unlike the parameters of the SPODEs, which are instead estimated in a Bayesian way), then LL 0 = nh(c) (Boullé, 2007). Since we are dealing with a traditional single-prior classifier, we set a single prior mass function over the models, assigning uniform prior probability to the various SPODEs and prior probability ɛ to the null model; assigning a prior probability to the null model is necessary, since its posterior probability appears in the compression coefficients. Thus, we define the prior over variable S as follows: { ɛ j = 0, P (s j ) = 1 ɛ k j = 1,..., k. (6) 8

9 The compression coefficients are computed in two steps: computation of the raw compression coefficients and normalization. The raw compression coefficient associated to SPODE s j is: π j := 1 log P (s j D) log P (s 0 D) = 1 LL j + log P (s j ) LL 0 + log P (s 0 ) = 1 LL j + log 1 ɛ k nh(c) + log ɛ. (7) A negative π j means that s j is a worse predictor than the null model; a positive π j means that s j is a better predictor than the null model, which is the general case in practical situations. The upper limit of π j is one: in this case s j is a perfect predictor, with likelihood 1, and thus log-likelihood 0. Following (Boullé, 2007), we keep in the ensemble only the feasible models, namely those with π j > 0, and we discard the models with π j 0; therefore, the null model is not part of the resulting ensemble. The procedure corresponds to a belief revision induced by the removal from the ensemble of the models whose posterior probability falls below a certain threshold. An information-theoretic interpretation of the compression weights can be given recalling that the principle of minimum description length (MDL) (Grünwald, 2005) prescribes to minimize the overall description length of model and data given the model. The description length correspond to the logarithm of the posterior probability, namely the logarithm of the prior (interpreted as the code length of the model) plus the logarithm of the likelihood (interpreted as the code length of the data given the model). Thus, LL j + log P (s j ) represents the quantity of information required to encode the model plus the class values given the model. The code length of the null model can be interpreted as the quantity of information necessary to describe the classes, when no explanatory data is used to induce the model. Each model can potentially exploit the explanatory data to better compress the class conditional information. The ratio of the code length of a model to that of the null model stands for a relative gain in compression efficiency. (Boullé, 2007, pag. 1663). With no loss of generality, assume the features to be ordered such that A 1, A 2,..., A k yield a feasible model when used as super-parent; thus, SPODEs s 1, s 2,..., s k are feasible, while SPODEs s j with j > k are removed from the ensemble. The normalized compression coefficients P (s j D) are obtained by normalizing the raw compression coefficients of the feasible SPODEs: P (s j D) = { π j k l=1 π l if j = 1,..., k, 0 otherwise. The posterior probabilities of the classes are estimated as: P (c ã) := (8) k P (c ã, s j ) P (s j D). (9) j=1 We call this classifier COMP-AODE, where COMP stands for compression-based COMP-AODE*: Extending COMP-AODE to Sets of Probabilities We extend COMP-AODE to imprecise probabilities by substituting the single prior distribution P (S) over the models by the credal set P c (S), the subscript c denoting 9

10 compression. The credal set of COMP-AODE* differs from that of BMA-AODE* in that it assigns prior probability ɛ to the null model: P c (S) := P (S) P (s 0 ) = ɛ, P (s j ) ɛ j = 1,..., k, k j=0 P (s j) = 1. (10) The raw compression weight of SPODE s j varies within the interval: [ π j 1 LL j + log ɛ nh(c) + log ɛ, 1 LL ] j + log (1 kɛ). (11) nh(c) + log ɛ Since the prior used by COMP-AODE (6) belongs to the credal set of COMP-AODE*, the point estimate (7) of the compression coefficient adopted by COMP-AODE lies within the interval (11). The weights π j cannot vary independently from each other; they are instead linked by the normalization constraint in (10). COMP-AODE* regards SPODE s j as non-feasible if the upper coefficient of compression is non-positive: this approach thus preserves all the models which are feasible, in the sense of Section 2.4, for at least a prior in the set P c (S). COMP-AODE* is thus more conservative than COMP-AODE, namely it discards a lower number of models. However, generally neither COMP-AODE* nor COMP-AODE remove any SPODE from the ensemble. Since the prior adopted by COMP-AODE is contained in the credal set of COMP-AODE*, the most probable class identified by COMP-AODE is part of the non-dominated classes identified by COMP-AODE*; exception to this statement are possible only if the set of feasible SPODEs differs between COMP-AODE* and COMP-AODE. However, this never happened in our experiments. Like BMA-AODE*, COMP-AODE* identifies the non-dominated classes through maximality (Algorithm 2). In the following, we explain how to compute the test of dominance among two classes. Testing dominance Without loss of generality, we assume the features to have been re-ordered, so that the first k features yield a model with positive upper coefficient of compression when used as super-parent. Thus, SPODEs {s 1,..., s k} are the feasible ones. In this case the dominance test corresponds to evaluate whether or not the solution of the following optimization problem is greater than zero. minimize: k j=1 P (c ã, s j ) π j k j=1 P (c ã, s j ) π j with respect to: P (s 0 ), P (s 1 ),..., P (s k ) (12) subject to: P (s 0 ) = ɛ P (s j ) ɛ j = 1,..., k k j=1 P (s j) = 1, 10

11 where the normalization term k j=1 π j has been already simplified and the constraints represent the credal set. To express the dependence on the optimization variables of the objective function, recall that P (s 0 ) = ɛ and express π j through (7). The objective function rewrites as: k j=1 ( P (c a, s j ) 1 log P (s ) j) + LL j log ɛ + LL 0 k j=1 which, removing the constant terms, becomes: ( P (c a, s j ) 1 log P (s ) j) + LL j, log ɛ + LL 0 k j=1 (P (c a, s j ) P (c a, s j )) log P (s j ), (13) The task is therefore a linearly constrained optimization of a non-linear objective function. However, the objective function is a sum of terms including only single optimization variables and, by separing the negative from the positive terms, can be rewritten as a difference of two convex functions. Thus, the problem reduces to a convex optimization, which can be solved with polynomial complexity Computational Complexity of the Classifiers We now analyze the computational complexity of the various classifiers. We distinguish between learning and classification complexity, the latter referring to the classification of a single instance. Both the space and the time required for computations are evaluated. The orders of magnitude of these descriptors are reported as a function of the dataset size n, the number of attributes/spodes k, the number of classes l := C, and average number of states for the attributes v := k 1 k i=1 A i. A summary of this analysis is in Table 1 and the discussion below. Algorithm Space Time learning/classification learning classification AODE O(lk 2 v 2 ) O(nk 2 ) O(lk 2 ) BMA[COMP]-AODE O(lk 2 v 2 ) O(n(l + k)k) O(lk 2 ) BMA[COMP]-AODE* O(lk 2 v 2 ) O(n(l + k)k) O(l 2 poly(k)) Table 1: Complexity of classifiers. Let us first evaluate the AODE. For a single SPODE s j, the tables P (C), P (A j C) and P (A i C, A j ), with i = 1,..., k and i j should be stored, this implying space complexity O(lkv 2 ) for learning each SPODE and O(lk 2 v 2 ) for the AODE ensemble. These tables should be available during learning and classification for both classifiers; thus, space requirements of these two stages are the same. Time complexity to scan the dataset and learn the probabilities is O(nk) for each SPODE, and hence O(nk 2 ) for the AODE. The time required to compute the posterior probabilities is O(lk) for each SPODE, and hence O(lk 2 ) for AODE. 11

12 Learning BMA-AODE or COMP-AODE takes the same space as AODE, but higher computational time, due to the evaluation of the conditional likelihood as in (2). The additional computational time is O(nlk), thus requiring O(n(l + k)k) time overall. For classification, time and space complexity during learning and classification are just the same. The credal classifiers BMA-AODE* and COMP-AODE* require the same space complexity and the same time complexity in learning of their non-credal counterparts. However, credal classifiers have higher time complexity in classification. The pairwise dominance tests in Algorithm 2 requires the solution of a number of optimization problems for each test instance which is quadratic in the number of classes. The complexity of the linear programming problem for BMA-AODE* is polynomial in the number of variables (Borgwardt, 1987); the (convex) optimization problem required COMP- AODE* has polynomial complexity too, as already discussed. 3. Experiments We run experiments on 40 data sets, whose characteristics are given in the Appendix. On each data set we perform 10 runs of 5-fold cross-validation. In order to have complete data, we replace missing values with the median and the mode for respectively numerical and categorical features. We discretize numerical features by the entropy-based method of Fayyad and Irani (1993). For the implementation of all credal sets, we set ɛ = For pairwise comparison of classifiers over the collection of data sets we use the Wilcoxon signed-rank test, as recommended by Dem sar (2006) Determinate Classifiers We call determinate the classifiers which always return a single class, namely AODE, BMA-AODE and COMP-AODE. For determinate classifiers we measure two indicators: the accuracy, namely the percentage of correct classifications, and the Brier loss n 1 te ( 2 1 P (c (i) a )) (i), n te i where n te denotes the number of instances in the test set, while P (c (i) a (i) ) is the probability estimated by the classifier for the true class of the i-th instance. A preliminary finding is that adopting conditional likelihood instead of marginal likelihood is an effective refinement, which reduces of 6% on average the Brier loss of BMA-AODE. Despite this refinement, however, BMA-AODE is outperformed (p<0.01) by AODE regarding both accuracy and Brier loss. We present in Figure 3(a) the scatter plot of accuracies and in Figure 4(a) the relative Brier losses, namely the Brier loss of BMA-AODE divided, data set by data set, by the Brier loss of AODE. On average, BMA-AODE has 3% higher Brier loss than AODE. BMA-AODE computed with marginal likelihood was already found to be outperformed by AODE (Yang et al., 2007; Cerquides and de Mántaras, 2005). As for the comparison of COMP-AODE with AODE: there is no significant difference on accuracy, as it can be inferred from Figure 3(b), but COMP-AODE outperforms AODE on the Brier loss (p-value <.01). Figure 4(b) shows the relative Brier 12

13 losses, namely the Brier loss of COMP-AODE divided, data set by data set, by the Brier loss of AODE. Averaging over data sets, COMP-AODE reduces the Brier loss of about 3% compared to AODE. We see this result as noteworthy, since AODE is a high performance classifier. These results broaden the scope of the experiments of (Boullé, 2007), in which the compression approach was applied to an ensemble of naive Bayes classifiers. 1 1 Accuracy AODE bisector Accuracy AODE bisector Accuracy BMA-AODE Accuracy COMP-AODE Figure 3: Scatter plots of accuracies; the solid line shows the bisector. Brier BMA-AODE Brier AODE Brier COMP-AODE Brier AODE Data sets Data sets Figure 4: Relative Brier losses; points lying below the horizontal line represent performance better than AODE, and vice versa. Note the different y-scales of the two graphs Credal Classifiers A credal classifier returns a set of classes if the instance is prior-dependent and a single class if the instance is instead safe, i.e. non prior-dependent. Note however that an instance can be judged as prior-dependent by a certain credal classifier and as safe by a different credal classifier. To characterize the performance of a credal classifier, the following four indicators are considered (Corani and Zaffalon, 2008b): determinacy: % of instances recognized as safe, namely classified with a single class; 13

14 single-accuracy: the accuracy achieved over the instances recognized as safe; set-accuracy: the accuracy achieved, by returning a set of classes, over the priordependent instances; indeterminate output size: the average number of classes returned on the priordependent instances. Averaging over data sets, BMA-AODE* has 94% determinacy; it is completely determinate on 7 data sets. The determinacy fluctuates among data sets, being however correlated with the sample size (ρ = 0.3). The choice of the prior is less important on large data sets: bigger data sets tend to contain a lower percentage of prior-dependent instances, thus increasing determinacy. BMA-AODE* performs well when indeterminate: averaging over all data sets, it achieves 90% set-accuracy by returning 2.3 classes (the average number of classes in the collection of data sets is 3.6). It is worth analyzing the performance of BMA-AODE on the prior-dependent instances. In Figure 5(a) we compare, data set by data set, the accuracy achieved by BMA-AODE on the instances judged respectively as safe and as prior-dependent by BMA-AODE*; the plot shows a sharp drop of accuracy on the prior-dependent instances, which is statistically significant (p-value <.01). As a rough indication, averaging over data sets, the accuracy of BMA-AODE is 83% on the safe instances but only 52% on the instances recognized as prior-dependent by BMA-AODE*. Thus, on the prior-dependent instances, BMA- AODE provides fragile classifications; on the same instances, BMA-AODE* returns a small-sized but highly accurate set of classes. Accuracy BMA-AODE Accuracy COMP-AODE 1 1 safe instances bisector safe instances bisector prior-dependent instances prior-dependent instances Figure 5: Accuracy of the determinate classifiers on the instances recognized as safe and as prior-dependent by their credal counterparts. The accuracies of BMA-AODE [COMP-AODE] is thus separately measured on the instances judged safe and prior-dependent by BMA-AODE* [COMP-AODE*]. The solid line shows the bisector. Let us now analyze the performance of COMP-AODE*; it has higher determinacy than BMA-AODE*; averaging over data sets, its determinacy is 99%, with only minor fluctuations across data sets; the classifier is moreover completely determinate on 18 data sets. The determinacy of COMP-AODE* is very high and stable across data sets. 14

15 Therefore, under the compression-based approach only a small fraction of the instances is prior-dependent; this robustness to the choice of the prior is likely to contribute to the good performance of compression-based ensemble of classifiers and constitutes a desirable but previously unknown property of the compression-based approach. Numerical inspection shows that the logarithmic smoothing of the models posterior probabilities makes indeed the compression weights only little sensitive to the choice of the prior. COMP-AODE* performs well when indeterminate: averaging over all data sets, it achieves 95% set-accuracy by returning 2 classes (note that the indeterminate output size cannot be less than two). Again, it is worth checking the behavior of the corresponding determinate classifier, namely COMP-AODE, on the instances that are prior-dependent for the COMP- AODE*. In Figure 5(b) we compare, data set by data set, the accuracy achieved by COMP-AODE on the instances judged respectively safe and prior-dependent by COMP-AODE*; there is a large drop of accuracy on the prior-dependent instances, and the drop is significant (p-value <.01). Averaging over data sets, the accuracy of COMP-AODE drops from 82% on the safe instances to only 47% on the instances judged as prior-dependent by COMP-AODE*. Thus even COMP-AODE, despite its robustness and its high-performance, undergoes a severe loss of accuracy on the instances recognized as prior-dependent by COMP-AODE*. On the very same instances, COMP-AODE* returns a small sized but highly reliable set of classes, thus enhancing the overall classification reliability. These results convincingly show the importance of detecting the prior-dependent classifications Utility-based Measures We have seen so far that the credal ensembles extend in a sensible way their determinate counterparts, being able to recognize prior-dependent instances and to robustly deal with them. Yet, it is not obvious how to compare credal and determinate classifiers by means of a synthetic indicator. In our view, the most principled answer to this question is that of Zaffalon et al. (2012), which allows comparing the 0-1 loss of a traditional classifier with a utility score defined for credal classifiers. The starting point is the discounted accuracy, which rewards a prediction containing m classes with 1/m if it contains the true class, and with 0 otherwise. The discounted accuracy can be compared to the accuracy achieved by a determinate classifier. It has been shown (Zaffalon et al., 2012) that, within a betting framework based on fairly general assumptions, discounted-accuracy is the only score which satisfies some fundamental properties for assessing both determinate and indeterminate classifications. Yet Zaffalon et al. (2012) also shows some severe shortcomings of discounted-accuracy: consider two medical doctors, doctors random and doctor vacuous, who should diagnose whether a patient is healthy or diseased. Doctor random issues uniformly random diagnosis; doctor vacuous instead always returns both categories, thus admitting its ignorance. Let us assume that the hospital profits a quantity of money proportional to the discounted-accuracy achieved by its doctors at each visit. Both doctors have the same expected discounted-accuracy for each visit, namely 1/2. For the hospital, both doctors provide the same expected profit on each visit, but with a substantial difference: the profit of doctor vacuous is deterministic, while the 15

16 profit of doctor random is affected by considerable variance. Any risk-averse hospital manager should thus prefer doctor vacuous over doctor random, since it yields the same expected profit with less variance. In fact, under risk-aversion, the expected utility increases with expectation of the rewards and decreases with their variance (Levy and Markowitz, 1979). To model this fact, it is necessary applying a utility function to the discounted-accuracy score assigned on each instance. Zaffalon et al. (2012) design the utility function as follows: the utility of a correct and determinate classification (discounted-accuracy 1) is 1; the utility of a wrong classification (discounted-accuracy 0) is 0. Therefore, the utility of a traditional determinate classifier is its accuracy. The utility of an accurate but indeterminate classification consisting of two classes (discounted-accuracy 0.5) is assumed to lie between 0.65 and 0.8. Two quadratic utility functions are then derived corresponding to these boundary values, and passing respectively through {u(0) = 0, u(0.5) = 0.65, u(1) = 1} and {u(0) = 0, u(0.5) = 0.8, u(1) = 1}, denoted as u 65 and u 80 respectively; the mathematical expression of these utility functions are as follows: u 65 (x) = 1.2x x, u 80 (x) = 0.6x x, where x is the value of discounted accuracy. Since u(1) = 1, utility and accuracy coincide for determinate classifiers; therefore, utility of credal classifiers and accuracy of determinate classifiers can be directly compared. In Del Coz and Bahamonde (2009) classifiers which return indeterminate classifications are scored through the F 1 -metric, originally designed for Information Retrieval tasks. The F 1 metric, when applied to indeterminate classifications, returns a score which is always comprised between u 65 and u 80, further confirming the reasonableness of both utility functions. More details on the links between F 1, u 65 and u 80 are given in Zaffalon et al. (2012). While in real case studies the utility function should be elicited by discussing with the decision maker, u 65 and u 80 can be suitably used for data mining experiments like those shown in the following. We now analyze the utilities generated by the various classifiers, comparing each credal classifier with its determinate counterpart; recall that for a traditional classifier, utility and accuracy are the same. The utility of BMA-AODE* is significantly higher (p-value <.01) than that of BMA-AODE under both u 65 and u 80. This confirms that extending the model to imprecise probability is a sensible approach. In the first row of Figure 6 we show the relative utility, namely the utility of BMA-AODE* divided, data set by data set, by the utility (i.e., accuracy) of BMA-AODE; the two plots refer respectively to u 65 and u 80. Averaging over data sets, the improvement of utility is about 1% and 2% under u 65 and u 80 ; although the improvement might look small, we recall that it is obtained by modifying the classifications of the prior-dependent instances only, 6% of the total on average. If we focus on the prior-dependent instances only, the increase of utility generally varies between +10% and +40% depending on the data set and on the utility function. Clearly, the improvement is even larger under u 80 which assigns higher utility than u 65 to the indeterminate but accurate classifications. The analysis is similar when comparing COMP-AODE* with COMP-AODE. In the second row of Figure 6 we show the relative utility, namely the utility of COMP- AODE* divided, data set by data set, by the utility (i.e., accuracy) of COMP-AODE. The increase of utility is in this case generally under 1%, as a consequence of the higher determinacy of COMP-AODE (99% on average), which allows less room for improving utility through indeterminate classifications. In fact, the robustness of COMP- 16

17 u65bma-aode* Accuracy BMA-AODE u80bma-aode* Accuracy BMA-AODE datasets datasets u65comp-aode* accuracy COMP-AODE u80scomp-aode* accuracy COMP-AODE datasets datasets Figure 6: Relative utilities of credal classifiers compared to their precise counterparts. AODE to the choice of the prior reduces the portion of instances where it is necessary making the classification indeterminate. Focusing however on the (rare) indeterminate instances, the increase of utility deriving to the extension to imprecise probability lies between 39% and 60%, depending on the data set and on the utility function. Eventually, COMP-AODE* has significantly (p-value <.01) higher utility than COMP-AODE under both u 65 and u 80 ; also in this case the credal ensemble outperforms its traditional counterpart. The utilities of COMP-AODE* and BMA-AODE* are also compared; under u 65 COMP-AODE* yields significantly (p-value <.05) higher utility than BMA-AODE*, while under u 80 the difference among the two classifiers is not significant, although the utility generated by COMP-AODE* is generally slightly higher. The point is that BMA-AODE* is more often indeterminate than COMP-AODE*; under u 80 the indeterminate but accurate classifications are rewarded more than under u 65, thus allowing BMA-AODE* to almost close the gap with COMP-AODE*. We conclude however that COMP-AODE* should be generally preferred over BMA-AODE*. Eventually we point out that COMP-AODE* generates significantly (p-value <.01) higher utility than AODE, under both u 65 and u 80. The extension to imprecise probability has thus concretely improved the overall performance of the compressionbased ensemble: recall that the determinate COMP-AODE yields better probability estimates but not better accuracy than AODE. 17

18 3.4. Comparison with Previous Credal Classifiers Ranks under u 65 COMP-AODE* CMA CDT Ranks under u 80 COMP-AODE* CMA CDT Figure 7: Comparison between credal classifiers by means of the Friedman test: the boldfaced points show the average ranks; a lower rank implies better performance. The bars display the critical distance, computed with 95% confidence: the performance of two classifiers are significantly different if their bars do not overlap. In this section we compare COMP-AODE* with previous credal classifiers. A well-known credal classifier is the naive credal classifier (NCC) (Corani and Zaffalon, 2008b), which is an extension of naive Bayes to imprecise probability. The comparison of NCC and COMP-AODE* on the 40 data sets of Section 3 shows that COMP-AODE yields significantly (p <0.01) higher utility than NCC under both u 65 and u 80. However, over time algorithms more sophisticated than NCC have been developed, such as: credal model averaging (CMA) (Corani and Zaffalon, 2008a), namely a generalization of BMA (in the same spirit of BMA-AODE) for naive Bayes classifier; credal decision tree (CDT) (Abellán and Moral, 2005), namely an extension of classification trees to imprecise probability. We then compare CDT, CMA and COMP-AODE* via the Friedman test; this is the approach recommended by (Dem sar, 2006) for comparing multiple classifiers on multiple data sets. First, the procedure ranks on each data set the classifiers according to the utility they generate; then, it tests the null hypothesis of all classifiers having the same average rank across the data sets. If the null hypothesis is rejected, a post-hoc test is adopted to identify the significant differences among classifiers. Adopting a 95% confidence, no significant difference is detected among classifiers; the result is the same under both utilities. However, under both utilities COMP-AODE* has the best average rank, as shown in Figure 3.4. Lowering the confidence to 90%, two significant differences are found: a) COMP-AODE* produces significantly higher utility than CMA 18

19 under u 65 and b) COMP-AODE* produces significantly higher utility than CDT under u 80. These results, though not completely conclusive, suggest that COMP-AODE* compares favorably to previous credal classifiers Some Comments on Credal Classification versus Reject Option Determinate classifiers can be equipped with a reject option (Herbei and Wegkamp, 2006), thus refusing to classify an instance if the posterior probability of the most probable class is less than a threshold. For the sake of simplicity we consider a case with two classes only; to formally introduce the reject option, it is necessary setting a cost d (0 < d < 1/2), which is incurred into when rejecting an instance. A cost 0, 1, d is therefore incurred into when respectively correctly classifying, wrongly classifying and rejecting an instance. Under 0-1 loss, the expected cost for classifying an instance corresponds to the probability of misclassification; it is thus 1 p, where p denotes the posterior probability of the most probable class. The optimal behavior is thus to reject the classification whenever the expected classification cost is higher than the rejection cost, namely when (1 p ) > d; this is equivalent to rejecting the instance whenever p < 1 d, where (1 d) constitutes the rejection threshold. The behavior induced by the reject option is quite different from that of a credal classifier, as we show in the following example. On an a very large data set the posterior probability of the classes is little sensitive on the choice of the prior, because of the wide amount of data available for learning; in this condition, instance are rarely priordependent and therefore a credal classifier will mostly return a single class. On the other hand, the determinate classifier with reject option (RO in the following) rejects all the instances for which p < 1 d; if d is small, there can be even a high number of rejected instances. The difference between these behaviors is due to the credal classifier being unaware of the cost d associated with rejecting an instance, which is instead driving the behavior of RO. To rigorously compare RO against a credal classifier, it is thus necessary making the credal classifier aware of the cost d. Recalling that the credal classifier already returns both classes on the instances which are prior dependent, this will change the behavior of the credal classifier only on the instances which are not prior-dependent. In particular, the credal classifier should reject all the instances for which p < 1 d, where p is the lower probability of the most probable class; the instances rejected by means of this criterion will be thus a superset of those rejected by RO. Therefore, the credal classifier will reject the instances which are prior-dependent and those for which p < 1 d. Eventually, the cost generated by the credal classifier should be compared with those generated by the RO. In the case with more than 2 classes the analysis might become slightly more complicated than what discussed here; however, we leave the analysis of credal classifiers with reject option as a topic for future research. Note also that this kind of experiment will require the computation of upper and lower posterior probability of the classes, which is not always trivial with credal classifiers. 4. Conclusions This main contribution of this paper regards two novel credal ensembles of SPODE classifiers. Both ensembles compare favorably to more traditional ensemble, showing 19

20 the effectiveness of the IP approach to deal with the problem of specifying the prior over the models. In particular, the credal ensembles automatically identify the priordependent instances and cope reliably with them by returning a small-sized but highly accurate set of classes. On the same instances, the traditional ensembles of classifiers undergo instead a severe drop of accuracy. In particular, the credal ensemble based on compression weights achieves very good performance; our findings thus broaden the application scope of compression weights, originally introduced by Boullé (2007). As a future work, it could be interesting developing a credal generalization of the MAPLMG algorithm (Cerquides and de Mántaras, 2005) which is so far the highestperforming approach (Yang et al., 2007) for aggregating SPODEs. Other interesting developments might include applying the ideas of credal ensemble to domains other than classification, such as regression. A credal ensemble of different regressor models would then return predictions which are interval rather than points; the interval would highlight the sensitivity of the prediction on the prior which is set over the competing regression models. In terms of computational speed, it would be possible making BMA-AODE* faster by solving its credal-dominance test without running the linear programming procedure, identifying instead the solution on the basis of the derivatives of the objective functions; a similar approach has been for instance followed by (Corani and Zaffalon, 2008a). Acknowledgements The research in this paper has been partially supported by the Swiss NSF grants no and by the Hasler foundation grant n We thank the anonymous reviewers for their valuable insights. References Abellán, J., Moral, S., Upper entropy of credal sets. Applications to credal classification. International Journal of Approximate Reasoning 39, Borgwardt, K., The simplex algorithm: a probabilistic analysis. Springer. Boullé, M., Compression-based averaging of selective naive Bayes classifiers. Journal of Machine Learning Research 8, Cerquides, J., de Mántaras, R., Robust Bayesian linear classifier ensembles, in: Proc. of the 16th European Conference on Machine Learning (ECML-PKDD 2005), pp Corani, G., Antonucci, A., Zaffalon, M., Bayesian networks with imprecise probabilities: theory and application to classification, in: Holmes, D.E., Jain, L.C., Kacprzyk, J., Jain, L.C. (Eds.), Data Mining: Foundations and Intelligent Paradigms. volume 23, pp

Credal Classification

Credal Classification Credal Classification A. Antonucci, G. Corani, D. Maua {alessandro,giorgio,denis}@idsia.ch Istituto Dalle Molle di Studi sull Intelligenza Artificiale Lugano (Switzerland) IJCAI-13 About the speaker PhD

More information

Bayesian Networks with Imprecise Probabilities: Classification

Bayesian Networks with Imprecise Probabilities: Classification Bayesian Networks with Imprecise Probabilities: Classification Alessandro Antonucci and Giorgio Corani {alessandro,giorgio}@idsia.ch IDSIA - Switzerland AAAI-10 A companion chapter A companion chapter

More information

Decision-Theoretic Specification of Credal Networks: A Unified Language for Uncertain Modeling with Sets of Bayesian Networks

Decision-Theoretic Specification of Credal Networks: A Unified Language for Uncertain Modeling with Sets of Bayesian Networks Decision-Theoretic Specification of Credal Networks: A Unified Language for Uncertain Modeling with Sets of Bayesian Networks Alessandro Antonucci Marco Zaffalon IDSIA, Istituto Dalle Molle di Studi sull

More information

Short Note: Naive Bayes Classifiers and Permanence of Ratios

Short Note: Naive Bayes Classifiers and Permanence of Ratios Short Note: Naive Bayes Classifiers and Permanence of Ratios Julián M. Ortiz (jmo1@ualberta.ca) Department of Civil & Environmental Engineering University of Alberta Abstract The assumption of permanence

More information

On the errors introduced by the naive Bayes independence assumption

On the errors introduced by the naive Bayes independence assumption On the errors introduced by the naive Bayes independence assumption Author Matthijs de Wachter 3671100 Utrecht University Master Thesis Artificial Intelligence Supervisor Dr. Silja Renooij Department of

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Learning Extended Tree Augmented Naive Structures

Learning Extended Tree Augmented Naive Structures Learning Extended Tree Augmented Naive Structures Cassio P. de Campos a,, Giorgio Corani b, Mauro Scanagatta b, Marco Cuccu c, Marco Zaffalon b a Queen s University Belfast, UK b Istituto Dalle Molle di

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Model Complexity of Pseudo-independent Models

Model Complexity of Pseudo-independent Models Model Complexity of Pseudo-independent Models Jae-Hyuck Lee and Yang Xiang Department of Computing and Information Science University of Guelph, Guelph, Canada {jaehyuck, yxiang}@cis.uoguelph,ca Abstract

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Likelihood-Based Naive Credal Classifier

Likelihood-Based Naive Credal Classifier 7th International Symposium on Imprecise Probability: Theories and Applications, Innsbruck, Austria, 011 Likelihood-Based Naive Credal Classifier Alessandro Antonucci IDSIA, Lugano alessandro@idsia.ch

More information

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over Point estimation Suppose we are interested in the value of a parameter θ, for example the unknown bias of a coin. We have already seen how one may use the Bayesian method to reason about θ; namely, we

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Bayesian Networks with Imprecise Probabilities: Theory and Application to Classification

Bayesian Networks with Imprecise Probabilities: Theory and Application to Classification Bayesian Networks with Imprecise Probabilities: Theory and Application to Classification G. Corani 1, A. Antonucci 1, and M. Zaffalon 1 IDSIA, Manno, Switzerland {giorgio,alessandro,zaffalon}@idsia.ch

More information

Chapter Three. Hypothesis Testing

Chapter Three. Hypothesis Testing 3.1 Introduction The final phase of analyzing data is to make a decision concerning a set of choices or options. Should I invest in stocks or bonds? Should a new product be marketed? Are my products being

More information

An Ensemble of Bayesian Networks for Multilabel Classification

An Ensemble of Bayesian Networks for Multilabel Classification Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence An Ensemble of Bayesian Networks for Multilabel Classification Antonucci Alessandro, Giorgio Corani, Denis Mauá,

More information

It provides a complete overview of credal networks and credal Cassio P. de Campos, Alessandro Antonucci, Giorgio Corani classifiers.

It provides a complete overview of credal networks and credal Cassio P. de Campos, Alessandro Antonucci, Giorgio Corani classifiers. Bayesian estimation vs Imprecise Dirichlet Model From Naive Bayes to Naive Credal Classifier Further classifiers Conclusions Bayesian estimation vs Imprecise Dirichlet Model From Naive Bayes to Naive Credal

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Generalized Loopy 2U: A New Algorithm for Approximate Inference in Credal Networks

Generalized Loopy 2U: A New Algorithm for Approximate Inference in Credal Networks Generalized Loopy 2U: A New Algorithm for Approximate Inference in Credal Networks Alessandro Antonucci, Marco Zaffalon, Yi Sun, Cassio P. de Campos Istituto Dalle Molle di Studi sull Intelligenza Artificiale

More information

CHAPTER-17. Decision Tree Induction

CHAPTER-17. Decision Tree Induction CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

Not so naive Bayesian classification

Not so naive Bayesian classification Not so naive Bayesian classification Geoff Webb Monash University, Melbourne, Australia http://www.csse.monash.edu.au/ webb Not so naive Bayesian classification p. 1/2 Overview Probability estimation provides

More information

On the Complexity of Strong and Epistemic Credal Networks

On the Complexity of Strong and Epistemic Credal Networks On the Complexity of Strong and Epistemic Credal Networks Denis D. Mauá Cassio P. de Campos Alessio Benavoli Alessandro Antonucci Istituto Dalle Molle di Studi sull Intelligenza Artificiale (IDSIA), Manno-Lugano,

More information

Lecture 9: Bayesian Learning

Lecture 9: Bayesian Learning Lecture 9: Bayesian Learning Cognitive Systems II - Machine Learning Part II: Special Aspects of Concept Learning Bayes Theorem, MAL / ML hypotheses, Brute-force MAP LEARNING, MDL principle, Bayes Optimal

More information

Chapter 6 Classification and Prediction (2)

Chapter 6 Classification and Prediction (2) Chapter 6 Classification and Prediction (2) Outline Classification and Prediction Decision Tree Naïve Bayes Classifier Support Vector Machines (SVM) K-nearest Neighbors Accuracy and Error Measures Feature

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Approximate Credal Network Updating by Linear Programming with Applications to Decision Making

Approximate Credal Network Updating by Linear Programming with Applications to Decision Making Approximate Credal Network Updating by Linear Programming with Applications to Decision Making Alessandro Antonucci a,, Cassio P. de Campos b, David Huber a, Marco Zaffalon a a Istituto Dalle Molle di

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Learning Extended Tree Augmented Naive Structures

Learning Extended Tree Augmented Naive Structures Learning Extended Tree Augmented Naive Structures de Campos, C. P., Corani, G., Scanagatta, M., Cuccu, M., & Zaffalon, M. (2016). Learning Extended Tree Augmented Naive Structures. International Journal

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Machine Learning, Midterm Exam: Spring 2009 SOLUTION 10-601 Machine Learning, Midterm Exam: Spring 2009 SOLUTION March 4, 2009 Please put your name at the top of the table below. If you need more room to work out your answer to a question, use the back of

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

CS446: Machine Learning Fall Final Exam. December 6 th, 2016 CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Due Thursday, September 19, in class What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks

Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks Yoshua Bengio Dept. IRO Université de Montréal Montreal, Qc, Canada, H3C 3J7 bengioy@iro.umontreal.ca Samy Bengio IDIAP CP 592,

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Chapter 8&9: Classification: Part 3 Instructor: Yizhou Sun yzsun@ccs.neu.edu March 12, 2013 Midterm Report Grade Distribution 90-100 10 80-89 16 70-79 8 60-69 4

More information

Entropy-based Pruning for Learning Bayesian Networks using BIC

Entropy-based Pruning for Learning Bayesian Networks using BIC Entropy-based Pruning for Learning Bayesian Networks using BIC Cassio P. de Campos Queen s University Belfast, UK arxiv:1707.06194v1 [cs.ai] 19 Jul 2017 Mauro Scanagatta Giorgio Corani Marco Zaffalon Istituto

More information

An analysis of data characteristics that affect naive Bayes performance

An analysis of data characteristics that affect naive Bayes performance An analysis of data characteristics that affect naive Bayes performance Irina Rish Joseph Hellerstein Jayram Thathachar IBM T.J. Watson Research Center 3 Saw Mill River Road, Hawthorne, NY 1532 RISH@US.IBM.CM

More information

CS 188: Artificial Intelligence Spring Today

CS 188: Artificial Intelligence Spring Today CS 188: Artificial Intelligence Spring 2006 Lecture 9: Naïve Bayes 2/14/2006 Dan Klein UC Berkeley Many slides from either Stuart Russell or Andrew Moore Bayes rule Today Expectations and utilities Naïve

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Statistical Learning. Philipp Koehn. 10 November 2015

Statistical Learning. Philipp Koehn. 10 November 2015 Statistical Learning Philipp Koehn 10 November 2015 Outline 1 Learning agents Inductive learning Decision tree learning Measuring learning performance Bayesian learning Maximum a posteriori and maximum

More information

Beyond Uniform Priors in Bayesian Network Structure Learning

Beyond Uniform Priors in Bayesian Network Structure Learning Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks) scutari@stats.ox.ac.uk Department of Statistics April 5, 2017 Bayesian Network Structure Learning Learning

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

Entropy-based pruning for learning Bayesian networks using BIC

Entropy-based pruning for learning Bayesian networks using BIC Entropy-based pruning for learning Bayesian networks using BIC Cassio P. de Campos a,b,, Mauro Scanagatta c, Giorgio Corani c, Marco Zaffalon c a Utrecht University, The Netherlands b Queen s University

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine CS 484 Data Mining Classification 7 Some slides are from Professor Padhraic Smyth at UC Irvine Bayesian Belief networks Conditional independence assumption of Naïve Bayes classifier is too strong. Allows

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel Lecture 2 Judging the Performance of Classifiers Nitin R. Patel 1 In this note we will examine the question of how to udge the usefulness of a classifier and how to compare different classifiers. Not only

More information

Discriminant analysis and supervised classification

Discriminant analysis and supervised classification Discriminant analysis and supervised classification Angela Montanari 1 Linear discriminant analysis Linear discriminant analysis (LDA) also known as Fisher s linear discriminant analysis or as Canonical

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Continuous updating rules for imprecise probabilities

Continuous updating rules for imprecise probabilities Continuous updating rules for imprecise probabilities Marco Cattaneo Department of Statistics, LMU Munich WPMSIIP 2013, Lugano, Switzerland 6 September 2013 example X {1, 2, 3} Marco Cattaneo @ LMU Munich

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Bayesian Approaches Data Mining Selected Technique

Bayesian Approaches Data Mining Selected Technique Bayesian Approaches Data Mining Selected Technique Henry Xiao xiao@cs.queensu.ca School of Computing Queen s University Henry Xiao CISC 873 Data Mining p. 1/17 Probabilistic Bases Review the fundamentals

More information

Air pollution prediction via multi-label classification

Air pollution prediction via multi-label classification Air pollution prediction via multi-label classification Giorgio Corani and Mauro Scanagatta Istituto Dalle Molle di Studi sull Intelligenza Artificiale (IDSIA) Scuola universitaria professionale della

More information

10-701/ Machine Learning: Assignment 1

10-701/ Machine Learning: Assignment 1 10-701/15-781 Machine Learning: Assignment 1 The assignment is due September 27, 2005 at the beginning of class. Write your name in the top right-hand corner of each page submitted. No paperclips, folders,

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan Bayesian Learning CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Bayes Theorem MAP Learners Bayes optimal classifier Naïve Bayes classifier Example text classification Bayesian networks

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Machine Perceptual Learning and Sensory Summer Augmented 6 Computing Announcements Machine Learning Lecture 2 Course webpage http://www.vision.rwth-aachen.de/teaching/ Slides will be made available on

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Bayesian decision making

Bayesian decision making Bayesian decision making Václav Hlaváč Czech Technical University in Prague Czech Institute of Informatics, Robotics and Cybernetics 166 36 Prague 6, Jugoslávských partyzánů 1580/3, Czech Republic http://people.ciirc.cvut.cz/hlavac,

More information

Bayesian Networks. Motivation

Bayesian Networks. Motivation Bayesian Networks Computer Sciences 760 Spring 2014 http://pages.cs.wisc.edu/~dpage/cs760/ Motivation Assume we have five Boolean variables,,,, The joint probability is,,,, How many state configurations

More information

Lecture 5: Bayesian Network

Lecture 5: Bayesian Network Lecture 5: Bayesian Network Topics of this lecture What is a Bayesian network? A simple example Formal definition of BN A slightly difficult example Learning of BN An example of learning Important topics

More information

CHAPTER 3. THE IMPERFECT CUMULATIVE SCALE

CHAPTER 3. THE IMPERFECT CUMULATIVE SCALE CHAPTER 3. THE IMPERFECT CUMULATIVE SCALE 3.1 Model Violations If a set of items does not form a perfect Guttman scale but contains a few wrong responses, we do not necessarily need to discard it. A wrong

More information

Exact model averaging with naive Bayesian classifiers

Exact model averaging with naive Bayesian classifiers Exact model averaging with naive Bayesian classifiers Denver Dash ddash@sispittedu Decision Systems Laboratory, Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 15213 USA Gregory F

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction 15-0: Learning vs. Deduction Artificial Intelligence Programming Bayesian Learning Chris Brooks Department of Computer Science University of San Francisco So far, we ve seen two types of reasoning: Deductive

More information

An Empirical-Bayes Score for Discrete Bayesian Networks

An Empirical-Bayes Score for Discrete Bayesian Networks An Empirical-Bayes Score for Discrete Bayesian Networks scutari@stats.ox.ac.uk Department of Statistics September 8, 2016 Bayesian Network Structure Learning Learning a BN B = (G, Θ) from a data set D

More information

Ensemble Methods and Random Forests

Ensemble Methods and Random Forests Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization

More information

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm Probabilistic Graphical Models 10-708 Homework 2: Due February 24, 2014 at 4 pm Directions. This homework assignment covers the material presented in Lectures 4-8. You must complete all four problems to

More information

Robust Monte Carlo Methods for Sequential Planning and Decision Making

Robust Monte Carlo Methods for Sequential Planning and Decision Making Robust Monte Carlo Methods for Sequential Planning and Decision Making Sue Zheng, Jason Pacheco, & John Fisher Sensing, Learning, & Inference Group Computer Science & Artificial Intelligence Laboratory

More information

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Fundamentals to Biostatistics Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Statistics collection, analysis, interpretation of data development of new

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Probabilistic Graphical Models for Image Analysis - Lecture 1

Probabilistic Graphical Models for Image Analysis - Lecture 1 Probabilistic Graphical Models for Image Analysis - Lecture 1 Alexey Gronskiy, Stefan Bauer 21 September 2018 Max Planck ETH Center for Learning Systems Overview 1. Motivation - Why Graphical Models 2.

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

LECTURE NOTE #3 PROF. ALAN YUILLE

LECTURE NOTE #3 PROF. ALAN YUILLE LECTURE NOTE #3 PROF. ALAN YUILLE 1. Three Topics (1) Precision and Recall Curves. Receiver Operating Characteristic Curves (ROC). What to do if we do not fix the loss function? (2) The Curse of Dimensionality.

More information

Bayesian Learning Features of Bayesian learning methods:

Bayesian Learning Features of Bayesian learning methods: Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. This provides a more

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information

Math 350: An exploration of HMMs through doodles.

Math 350: An exploration of HMMs through doodles. Math 350: An exploration of HMMs through doodles. Joshua Little (407673) 19 December 2012 1 Background 1.1 Hidden Markov models. Markov chains (MCs) work well for modelling discrete-time processes, or

More information

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn CMU 10-701: Machine Learning (Fall 2016) https://piazza.com/class/is95mzbrvpn63d OUT: September 13th DUE: September

More information

Model Averaging With Holdout Estimation of the Posterior Distribution

Model Averaging With Holdout Estimation of the Posterior Distribution Model Averaging With Holdout stimation of the Posterior Distribution Alexandre Lacoste alexandre.lacoste.1@ulaval.ca François Laviolette francois.laviolette@ift.ulaval.ca Mario Marchand mario.marchand@ift.ulaval.ca

More information

Learning in Bayesian Networks

Learning in Bayesian Networks Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks

More information

Planning With Information States: A Survey Term Project for cs397sml Spring 2002

Planning With Information States: A Survey Term Project for cs397sml Spring 2002 Planning With Information States: A Survey Term Project for cs397sml Spring 2002 Jason O Kane jokane@uiuc.edu April 18, 2003 1 Introduction Classical planning generally depends on the assumption that the

More information

A Comparative Study of Different Order Relations of Intervals

A Comparative Study of Different Order Relations of Intervals A Comparative Study of Different Order Relations of Intervals Samiran Karmakar Department of Business Mathematics and Statistics, St. Xavier s College, Kolkata, India skmath.rnc@gmail.com A. K. Bhunia

More information

1 Motivation for Instrumental Variable (IV) Regression

1 Motivation for Instrumental Variable (IV) Regression ECON 370: IV & 2SLS 1 Instrumental Variables Estimation and Two Stage Least Squares Econometric Methods, ECON 370 Let s get back to the thiking in terms of cross sectional (or pooled cross sectional) data

More information