Credal Classification A. Antonucci, G. Corani, D. Maua {alessandro,giorgio,denis}@idsia.ch Istituto Dalle Molle di Studi sull Intelligenza Artificiale Lugano (Switzerland) IJCAI-13
About the speaker PhD in Information Engineering (Politecnico di Milano, 2005). Since 2006: researcher at IDSIA (www.idsia.ch), Switzerland. Member of the IDSIA Imprecise Probability Group (http://ipg.idsia.ch) Probabilistic graphical models, data mining, imprecise probability.
Agenda 1 Estimating a multinomial 2 Naive Bayes and Naive Credal Classifier 3 Comparing credal and traditonal classifiers 4 Credal Model Averaging
Estimation for a categorical variable Class C, with sample space Ω C = {c 1,..., c m }. P (c j ) = θ j, θ = {θ 1,..., θ m }. n i.i.d. observations; n = {n 1,..., n m }. Multinomial likelihood: L(θ n) Max. likelihood estimator: ˆθ j = nj n. m j=1 θ nj j
Bayesian estimation: Dirichlet prior The prior expresses the beliefs about θ, before analyzing the data: π(θ) k j=1 θ stj 1 j. s > 0 is the equivalent sample size, which can be regarded as a number of hidden instances; t j is the proportion of hidden instances in category j.
Posterior distribution Obtained by multiplying likelihood and prior: π(θ n) j θ (nj+stj 1) j Dir posteriors are obtained from Dir priors (conjugacy). Taking expectations: P (c j n, t) = n j + st j n + s
Example n=10, n 1 =4, n 2 =6, s=1. The posterior estimate is sensitive on the choice of the prior: Prior 1 Prior 2 t 1 = 0.5 t 1 = 0.8 t 2 = 0.5 t 2 = 0.2 ˆθ 1 = 4 + 0.5 10 + 1 = 0.41 ˆθ1 = 4 + 0.8 10 + 1 = 0.44 Uniform prior is the most common choice.
Uniform prior looks non-informative...... but its estimates depend on the sample space. Sample space θ 1, θ 2 θ 1, θ 2, θ 3 Data n 1 = 4 n 1 = 4 n 2 = 6 n 2 = 6 - n 3 = 0 Estimate of θ 1 ˆθ1 = 4 + 1/2 10 + 1 =.41 ˆθ1 = 4 + 1/3 10 + 1 =.39 Non-informative priors are a lost cause L. Wasserman, normaldeviate.wordpress.com
Modelling prior-ignorance: the Imprecise Dirichlet Model (IDM) (Walley, 1996) The IDM is a convex set of Dirichlet prior. The set of admissible values for t j is: 0 < t j < 1 j j t j = 1 This is a model of prior ignorance : a priori nothing is known.
Learning from data with the IDM An upper and a lower posterior expectation are derived for each chance θ j. E(θ j n) = E(θ j n) = n j + st j inf 0<t j<1 n + s n j + st j sup 0<t j<1 n + s = n j n + s = n j + s n + s Upper and lower expectations do not depend on the sample space.
Example Binary variable, with n=10, n 1 =4, n 2 =6, s=1. The estimates of θ 1 are: Bayes Bayes IDM (t 1 = 0.5, t 2 = 0.5) (t 1 = 0.8, t 2 = 0.2) ˆθ 1 = 4 + 0.5 10 + 1 ˆθ 1 = 4 + 0.8 10 + 1 [ 4 10 + 1, 4 + 1 ] 10 + 1 =.409 =.436 = [.363,.454] The interval estimate of the IDM comprises the estimates obtained letting vary each t i within (0, 1).
On the IDM The estimates are imprecise, being characterized by an upper and a lower probability. They do not depend on the sample space. The gap between upper and lower probability narrows as more data become available. References: Walley, 1996, J. of the Royal Statistical Society. Series B. Bernard, 2009, Int. J. Approximate Reasoning.
Agenda 1 Estimating a multinomial 2 Naive Bayes and Naive Credal Classifier 3 Comparing credal and traditonal classifiers 4 Credal Model Averaging
Estimating a multinomial Naive Bayes and Naive Credal Classifier Comparing credal and traditonal classifiers Credal Model Averaging Classification Determine the type of Iris (class) on the basis of length and width (features) of sepal and petal. (a) setosa (b) virginica (c) versicolor Classification: to predict the class C of a given object, on the basis of features A = {A1,..., Ak }. Prediction: the most probable class a posteriori.
Naive Bayes (NBC) C A 1 A2 A 3 Naively assumes the features to be independent given the class. NBC is highly biased, but achieves good accuracy, especially on small data sets, thanks to low variance (Friedman,1997).
Naive Bayes (NBC) C A 1 A 2 A 3 Learns from data the joint probability of class and features, decomposed as the marginal probability of the classes and the conditional probability of each feature given the class.
Joint prior θ c,a : the unknown joint probability of class and features, which we want to estimate. Under naive assumption and Dirichlet prior, the joint prior is: P (θ c,a ) k θc st(c) θ st(a,c) (a c). c Ω C i=1 a Ω Ai where t(a, c) is the proportion of hidden instances with C = c and A i = a. Let vector t collect all the parameters t(c) and t(a, c). Thus, a joint prior is specified by t.
Likelihood and posterior The likelihood is like the prior, with coefficients st( ) replaced by the n( ). L(θ n) c C [ θ n(c) c k i=1 θ n(a,c) (a c) a A i The joint posterior P (θ c,a n, t) is like the likelihood, with coefficients n( ) replaced by st( ) + n( ). Once P (θ c,a n, t) is available, the classifier is trained. ].
Issuing a classification The value of the features is specified as a = (a i,..., a k ). where P (c a) = P (c) P (c) = k P (a i c) i=1 n(c) + st(c) n + s P (a i c) = n(a i, c) + st(a i, c) n c + st c. Prior-dependence : the most probable class varies with t.
Credal classifiers Induced using a set of priors (credal set). They separate safe instances from prior-dependent ones. On prior-dependent instances: they return a set of classes ( indeterminate classifications ), remaining robust though less informative.
IDM over the naive topology We consider a set of joint priors, factorized as: Class : 0 < t(c) < 1 c Ω c c t(c) = 1 A priori, 0 < P (c j ) < 1 j. Features given the class: t(a, c) = t(c) a, a c 0 < t(a, c) < t(c) a, c A priori, 0 < P (a c) < 1, c.
Naive Credal Classifier (NCC) Uses the IDM to specify a credal set of joint distributions and updates it into a posterior credal set. The posterior probability of class c ranges within an interval. Given feature observation a, class c each posterior of the credal set: credal-dominates c if for P (c a) > P (c a)
NCC and prior-dependent instances Credal-dominance is checked by solving an optimization problem. NCC eventually returns the non-dominated classes: a singleton on the safe instances a set on the prior-dependent ones. The next applications shows the gain in reliability due to returning indeterminate classifications on the prior-dependent instances.
Next From Naive Bayes to Naive Credal Classifier: Texture recognition
Texture recognition The OUTEX data sets (Ojala, 2002): 4500 images, 24 classes (textiles, carpets, woods..).
Features: Local Binary Patterns (Ojala, 2002) The gray level of each pixel is compared with that of its neighbors, resulting in a binary judgment (more intense/ less intense). Such judgments are collected in a string for each pixel.
Local Binary Patterns (2) Each string is then assigned to a single category. The categories group similar strings: e.g., 00001111 is in the same category of 11110000 for rotational invariance. There are 18 categories. For each image there are 18 features: the % of pixels assigned to each category.
Results (Corani et al., BMVC 2010) Accuracy of NBC: 92% (SVMs: 92.5%). NBC is highly accurate on the safe instances, but almost random on the prior-dependent ones. Safe Prior-dependent Amount% 95% 5% NBC: accuracy 94% 56% NCC: accuracy 94% 85% NCC: non-dom. classes 1 2.4
Different training set sizes (II) As n grows there is a decrease of both: the % of indet. classification; the number of classes returned when indeterminate. 100% % of indeterminate classifications 25 Classes returned when indeterminate 80% 20 60% 15 40% 10 20% 5 0% 0 1000 2000 3000 4000 5000 Training set size 0 0 1000 2000 3000 4000 5000 Training set size
Naive Bayes + rejection option vs Naive Credal Rejection option: reject an instance (no classification) if the probability of the most probable class is below a threshold p. Texture recognition: half of the prior-dependent instances classified by naive Bayes with probability > 90%. Rejection rule is not designed to detect prior-dependent instances!
Tree-augmented credal classifier (TAN) C C A 1 A 2 A 3 A 1 A 2 A 3 NBC TAN TAN improves over naive Bayes by weakening the independence assumption.
Tree-augmented credal classifier (TAN) C C A 1 A 2 A 3 A 1 A 2 A 3 NBC TAN Credal TAN detects instances unreliably classified by Bayesian TAN because of prior-dependence. References: Zaffalon, Reliable Computing, 2003 Corani and De Campos, Int. J. Appr. Reasoning, 2010.
Next From Naive Bayes to Naive Credal Classifier: Conservative treatment of missing data
Ignorance from missing data Usually, classifiers ignore missing data, assuming them to be MAR (missing at random). MAR: the probability of an observation to be missing does not depend on its value or on the value of other missing data. The MAR assumption cannot be tested on the incomplete data.
A non-mar example: a political poll. The right-wing supporters (R) sometimes refuse to answer; left-wing (L) supporters always answer. Vote Answer L L L L L L R R R - R - By ignoring missing data, P (R) = 1/4: underestimated!
Conservative treatment of missing data. Consider each possible completion of the data. Answer D1 D2 D3 D4 L L L L L L L L L L L L L L L R R R R R - L L R R - L R L R P (R) 1/6 1/3 1/3 1/2 P (R) [1/6, 1/2]; this interval includes the real value. Application to BNs: Ramoni and Sebastiani (2000).
Conservative Inference Rule (Zaffalon, 2005) MAR missing data are ignored. Non-MAR missing data are filled in all possible ways, both in the training and in the test data. The replacements exponentially grow with the missing data; yet polynomial time algorithms are available for naive Credal. Reference: Corani and Zaffalon, 2008, JMLR
The conservative treatment of missing data increase indeterminacy. Multiple classes are returned if the most probable class depends: on the prior specification or on the completion of the non-mar missing data. Declare each feature as MAR or non-mar depending on domain knowledge, for a good trade-off between robustness and informativeness.
Agenda 1 Estimating a multinomial 2 Naive Bayes and Naive Credal Classifier 3 Comparing credal and traditonal classifiers 4 Credal Model Averaging
Discounted-accuracy d-acc = 1 N N i=1 (accurate) i Z i accurate i : whether the set of classes returned on instance i contains the actual class; Z i : the number of classes returned on the i-th instance. For a traditional classification, d-acc equals the 0-1 accuracy.
Some properties d-acc = 1 N N i=1 (accurate) i Z i Among two accurate classifications, d-acc is higher for the more informative classifier. Among an accurate and an inaccurate classification, d-acc is higher for the accurate classifier, even if it has returned multiple classes (linear decrease). Discounted-accuracy satisfies some fundamental properties (Zaffalon et al., IJAR 2012)
Doctor random and doctor vacuous. Possible diseases:{a,b}. Doctor random: uniformly random diagnosis. Doctor vacuous: return both categories. Disease Random Vacuous Class. d-acc Class. d-acc A A 1 {A,B} 0.5 A B 0 {A,B} 0.5 B A 0 {A,B} 0.5 B B 1 {A,B} 0.5 d-acc 0.5 0.5
Doctor random vs doctor vacuous: the manager viewpoint. Disease Random Vacuous Class. d-acc Class. d-acc A A 1 {A,B} 0.5 A B 0 {A,B} 0.5 B A 0 {A,B} 0.5 B B 1 {A,B} 0.5 d-acc 0.5 0.5 Assumption: the hospital profits a quantity of money proportional to the discounted-accuracy. After n visits, the profits are: doctor vacuous: n/2, with no variance. doctor random: expected n/2, with variance n/4.
Introducing utility Expected utility increases with the expected value of the rewards but decreases with their variance. Any risk-adverse manager prefers doctor vacuous over doctor random. And you prefer a vacuous over a random diagnosis! Idea: quantify such preference through a utility function (Zaffalon et al., IJAR 2012).
How to design the utility function Actual Predicted Utility A A 1 A B 0 Utility corresponds with accuracy for traditional classification consisting of a single class.
How to design the utility function Actual Predicted Utility A B,C 0 A A,B 0.65-0.80 A wrong indeterminate classification has utility 0. The utility of an accurate but indeterminate classification consisting of two classes has to be larger than 0.5...... otherwise doctor random and doctor vacuous yield the same utility.
How to design the utility function Let this utility range between 0.65 (function u 65 ) or 0.8 (function u 80 ), then perform sensitivity analysis. Utility of credal classifiers and accuracy of determinate classifiers can be now compared. Such utility functions are numerically close to F1 and F2 metric (Del Coz et al., JLMR, 2009).
Comparing naive Bayes and naive Credal Wins on 55 UCI data sets: Utility naive Bayes naive Credal u 65 18 37 u 80 13 42 How to compare credal and traditional classifier was an open problem at AAAI 10!
Agenda 1 Estimating a multinomial 2 Naive Bayes and Naive Credal Classifier 3 Comparing credal and traditonal classifiers 4 Credal Model Averaging
Model uncertainty: an example with logistic regression. Given k features, we can design 2 k logistic regressors, each with a different feature set. Model uncertainty: different logistic regressors (with different feature sets) can fit similarly well the data. Considering only a single model leads to overconfident conclusions. Bayesian Model Averaging (BMA) combines the inferences of multiple models, using as weights the posterior probability of the models.
Bayesian Model Averaging (BMA) The posterior of class c under BMA is: P (c D) = m i M P (c m i, D)P (m i D) m i M P (c m i, D)P (D m i )P (m i ) D: available training data. m i is the i-th model within the model space M. P (m i ) is the prior over the models. Often BMA analysis is repeated under different prior over the models.
Independent Bernoulli (IB) prior over the models Prior probability of model m i, containing k i features: P (m i ) = θ ki (1 θ) k ki Assumptions: every covariate has the same prior probability of inclusion θ, which is the only parameter; the probability of inclusion of each covariate is independent from the others. Uniform prior over the models: θ = 0.5.
Credal model averaging (CMA) Represents the prior over the models by a credal set. The prior probability of inclusion of each covariate varies in [θ, θ]. Ignorance a priori can be represented by setting θ = ɛ θ = 1 ɛ CMA automates sensitivity analysis on the value of θ.
Upper and lower probability of a class Lower posterior probability of class c 1 : min θ [θ,θ] m i M P (c 1 D, x) = min θ [θ,θ] m i M P (c 1 D, x, m i )P (m i D) = P (D m i )θ ki (1 θ) k ki P (c 1 D, x, m i ) m P (D m j M j)θ kj (1 θ) k kj The upper is found by maximization of the above expression. Such optimizations are analytically solved.
Credal model averaging (CMA) The prior over the models is modelled by a credal set. The single models (logistic regressors with different feature sets) are learned in a traditional Bayesian way. CMA is thus a credal ensemble of Bayesian regressors.
Indeterminate classifications Safe instances: the most probable class remains the same, regardless the prior over the models. CMA and BMA return the same class if θ [θ, θ]. Prior-dependent instances: the most probable class varies with θ. CMA suspend the judgment, returning a set of classes.
Predicting presence or absence of marmot burrows Study area: 9500 cells (100m 2 each) within the Stelvio National Park (Italy). Presence of burrows in about 4.5% of the cells. Features available for each cell : altitude, slope, aspect (the direction in which the slope faces), soil temperature, soil cover, etc.
Experiments: comparing BMA and CMA BMA prior over the models: uniform (θ=0.5) hierarchical (θ as a random variable, with uniform prior). Training sets of size n {30, 60, 90..., 300}. 30 experiments for each sample size. For CMA we set θ = 0.95 and θ = 0.05 (ignorance a priori).
BMA accuracy Regardless the adopted prior over the models: BMA is highly accurate on the safe instances but... almost a random guesser on the prior-dependent ones! BMA: accuracy 0.0 0.2 0.4 0.6 0.8 1.0 safe prior dependent BMA: accuracy 0.0 0.2 0.4 0.6 0.8 1.0 safe prior dependent 30 90 150 210 270 450 750 1050 1500 training set dimension 30 90 150 210 270 450 750 1050 1500 training set dimension Figure: Left: uniform prior. Right: hierarchical prior.
Credal model averaging (CMA) Allows to easily compute sensitivity analysis regarding the coefficients of the covariates (upper / lower value) by solving optimization problem. The solution is analitically done. CMA automates sensitivity analysis.
Other ensembles using a credal set for the prior probability of the models. Credal ensemble of naive Bayes (Corani and Zaffalon, ECML-PKDD 2008.) Credal ensemble with compression coefficients (Corani and Antonucci, Comp. Statistics & Data Analysis, in press)
Other imprecise probability classifiers. Some useful pointers: Credal KNN: Destercker, Soft Computing 2012 Credal regression trees: Abellan and Moral, Int. J. Appr. Reasoning, 2003. Abellan et al., Comp. Statistics & Data Analysis, 2013 Classification of temporal sequences: Antonucci et al., Proc. ISIPTA 13 (Symp. on Imprecise Probability)
Conclusions Credal classifiers suspend the judgment on prior-dependent instances. Prior-dependent instances are not effectively identified by the rejection option. Credal classifiers robustly deal with the specification of the prior and with non-mar missing data. Typically, the accuracy of traditional classifiers drops on the instances indeterminately classified by credal classifiers.
Open problems Multi-label classification. Dealing by credal set with the prior on the model parameters and on the prior over the models.