Credal Classification

Similar documents
Bayesian Networks with Imprecise Probabilities: Classification

It provides a complete overview of credal networks and credal Cassio P. de Campos, Alessandro Antonucci, Giorgio Corani classifiers.

Credal Ensembles of Classifiers

Likelihood-Based Naive Credal Classifier

Continuous updating rules for imprecise probabilities

Bayesian Networks with Imprecise Probabilities: Theory and Application to Classification

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Decision-Theoretic Specification of Credal Networks: A Unified Language for Uncertain Modeling with Sets of Bayesian Networks

Generalized Loopy 2U: A New Algorithm for Approximate Inference in Credal Networks

Machine Learning for Signal Processing Bayes Classification and Regression

ECE521 week 3: 23/26 January 2017

Pattern Recognition and Machine Learning

Robust Bayesian Model Averaging for the analysis of presence-absence data.

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

MLE/MAP + Naïve Bayes

Introduction to Probabilistic Machine Learning

Fast Non-Parametric Bayesian Inference on Infinite Trees

Learning Extended Tree Augmented Naive Structures

On the Complexity of Strong and Epistemic Credal Networks

Hierarchical Multinomial-Dirichlet model for the estimation of conditional probability tables

Bayesian Learning (II)

Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks

The Bayes classifier

CS6220: DATA MINING TECHNIQUES

Algorithmisches Lernen/Machine Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

MLE/MAP + Naïve Bayes

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Short Note: Naive Bayes Classifiers and Permanence of Ratios

Click Prediction and Preference Ranking of RSS Feeds

Learning Bayesian network : Given structure and completely observed data

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Introduction to Machine Learning Midterm Exam Solutions

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Approximate Credal Network Updating by Linear Programming with Applications to Decision Making

CSCI-567: Machine Learning (Spring 2019)

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

MIRA, SVM, k-nn. Lirong Xia

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Learning Extended Tree Augmented Naive Structures

Machine Learning Overview

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Machine Learning Lecture 2

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Imprecise Hierarchical Dirichlet model with applications

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Reliable Uncertain Evidence Modeling in Bayesian Networks by Credal Networks

Introduction to Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

Naïve Bayes classification

Algorithm-Independent Learning Issues

Part 3 Robust Bayesian statistics & applications in reliability networks

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

An Ensemble of Bayesian Networks for Multilabel Classification

Introduction to Machine Learning Midterm Exam

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Introduction to Machine Learning

Bayesian Models in Machine Learning

Machine Learning Lecture 2

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

STA 4273H: Statistical Machine Learning

Neutron inverse kinetics via Gaussian Processes

Introduction to Machine Learning

Classification Methods II: Linear and Quadratic Discrimminant Analysis

COS513 LECTURE 8 STATISTICAL CONCEPTS

Statistical comparison of classifiers through Bayesian hierarchical modelling

Machine Learning Lecture 2

Conservative Inference Rule for Uncertain Reasoning under Incompleteness

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Frequentist-Bayesian Model Comparisons: A Simple Example

CS540 Machine learning L9 Bayesian statistics

Chris Bishop s PRML Ch. 8: Graphical Models

Chapter 6 Classification and Prediction (2)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

CS145: INTRODUCTION TO DATA MINING

An Introduction to Statistical and Probabilistic Linear Models

STA 4273H: Statistical Machine Learning

Probabilistic Graphical Models

The Minimum Message Length Principle for Inductive Inference

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning for Signal Processing Bayes Classification

Undirected Graphical Models

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Machine Learning using Bayesian Approaches

Bayesian Methods: Naïve Bayes

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Ch 4. Linear Models for Classification

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Semiparametric Discriminant Analysis of Mixture Populations Using Mahalanobis Distance. Probal Chaudhuri and Subhajit Dutta

Support Vector Machines

CS6220: DATA MINING TECHNIQUES

Logistic Regression. Machine Learning Fall 2018

Transcription:

Credal Classification A. Antonucci, G. Corani, D. Maua {alessandro,giorgio,denis}@idsia.ch Istituto Dalle Molle di Studi sull Intelligenza Artificiale Lugano (Switzerland) IJCAI-13

About the speaker PhD in Information Engineering (Politecnico di Milano, 2005). Since 2006: researcher at IDSIA (www.idsia.ch), Switzerland. Member of the IDSIA Imprecise Probability Group (http://ipg.idsia.ch) Probabilistic graphical models, data mining, imprecise probability.

Agenda 1 Estimating a multinomial 2 Naive Bayes and Naive Credal Classifier 3 Comparing credal and traditonal classifiers 4 Credal Model Averaging

Estimation for a categorical variable Class C, with sample space Ω C = {c 1,..., c m }. P (c j ) = θ j, θ = {θ 1,..., θ m }. n i.i.d. observations; n = {n 1,..., n m }. Multinomial likelihood: L(θ n) Max. likelihood estimator: ˆθ j = nj n. m j=1 θ nj j

Bayesian estimation: Dirichlet prior The prior expresses the beliefs about θ, before analyzing the data: π(θ) k j=1 θ stj 1 j. s > 0 is the equivalent sample size, which can be regarded as a number of hidden instances; t j is the proportion of hidden instances in category j.

Posterior distribution Obtained by multiplying likelihood and prior: π(θ n) j θ (nj+stj 1) j Dir posteriors are obtained from Dir priors (conjugacy). Taking expectations: P (c j n, t) = n j + st j n + s

Example n=10, n 1 =4, n 2 =6, s=1. The posterior estimate is sensitive on the choice of the prior: Prior 1 Prior 2 t 1 = 0.5 t 1 = 0.8 t 2 = 0.5 t 2 = 0.2 ˆθ 1 = 4 + 0.5 10 + 1 = 0.41 ˆθ1 = 4 + 0.8 10 + 1 = 0.44 Uniform prior is the most common choice.

Uniform prior looks non-informative...... but its estimates depend on the sample space. Sample space θ 1, θ 2 θ 1, θ 2, θ 3 Data n 1 = 4 n 1 = 4 n 2 = 6 n 2 = 6 - n 3 = 0 Estimate of θ 1 ˆθ1 = 4 + 1/2 10 + 1 =.41 ˆθ1 = 4 + 1/3 10 + 1 =.39 Non-informative priors are a lost cause L. Wasserman, normaldeviate.wordpress.com

Modelling prior-ignorance: the Imprecise Dirichlet Model (IDM) (Walley, 1996) The IDM is a convex set of Dirichlet prior. The set of admissible values for t j is: 0 < t j < 1 j j t j = 1 This is a model of prior ignorance : a priori nothing is known.

Learning from data with the IDM An upper and a lower posterior expectation are derived for each chance θ j. E(θ j n) = E(θ j n) = n j + st j inf 0<t j<1 n + s n j + st j sup 0<t j<1 n + s = n j n + s = n j + s n + s Upper and lower expectations do not depend on the sample space.

Example Binary variable, with n=10, n 1 =4, n 2 =6, s=1. The estimates of θ 1 are: Bayes Bayes IDM (t 1 = 0.5, t 2 = 0.5) (t 1 = 0.8, t 2 = 0.2) ˆθ 1 = 4 + 0.5 10 + 1 ˆθ 1 = 4 + 0.8 10 + 1 [ 4 10 + 1, 4 + 1 ] 10 + 1 =.409 =.436 = [.363,.454] The interval estimate of the IDM comprises the estimates obtained letting vary each t i within (0, 1).

On the IDM The estimates are imprecise, being characterized by an upper and a lower probability. They do not depend on the sample space. The gap between upper and lower probability narrows as more data become available. References: Walley, 1996, J. of the Royal Statistical Society. Series B. Bernard, 2009, Int. J. Approximate Reasoning.

Agenda 1 Estimating a multinomial 2 Naive Bayes and Naive Credal Classifier 3 Comparing credal and traditonal classifiers 4 Credal Model Averaging

Estimating a multinomial Naive Bayes and Naive Credal Classifier Comparing credal and traditonal classifiers Credal Model Averaging Classification Determine the type of Iris (class) on the basis of length and width (features) of sepal and petal. (a) setosa (b) virginica (c) versicolor Classification: to predict the class C of a given object, on the basis of features A = {A1,..., Ak }. Prediction: the most probable class a posteriori.

Naive Bayes (NBC) C A 1 A2 A 3 Naively assumes the features to be independent given the class. NBC is highly biased, but achieves good accuracy, especially on small data sets, thanks to low variance (Friedman,1997).

Naive Bayes (NBC) C A 1 A 2 A 3 Learns from data the joint probability of class and features, decomposed as the marginal probability of the classes and the conditional probability of each feature given the class.

Joint prior θ c,a : the unknown joint probability of class and features, which we want to estimate. Under naive assumption and Dirichlet prior, the joint prior is: P (θ c,a ) k θc st(c) θ st(a,c) (a c). c Ω C i=1 a Ω Ai where t(a, c) is the proportion of hidden instances with C = c and A i = a. Let vector t collect all the parameters t(c) and t(a, c). Thus, a joint prior is specified by t.

Likelihood and posterior The likelihood is like the prior, with coefficients st( ) replaced by the n( ). L(θ n) c C [ θ n(c) c k i=1 θ n(a,c) (a c) a A i The joint posterior P (θ c,a n, t) is like the likelihood, with coefficients n( ) replaced by st( ) + n( ). Once P (θ c,a n, t) is available, the classifier is trained. ].

Issuing a classification The value of the features is specified as a = (a i,..., a k ). where P (c a) = P (c) P (c) = k P (a i c) i=1 n(c) + st(c) n + s P (a i c) = n(a i, c) + st(a i, c) n c + st c. Prior-dependence : the most probable class varies with t.

Credal classifiers Induced using a set of priors (credal set). They separate safe instances from prior-dependent ones. On prior-dependent instances: they return a set of classes ( indeterminate classifications ), remaining robust though less informative.

IDM over the naive topology We consider a set of joint priors, factorized as: Class : 0 < t(c) < 1 c Ω c c t(c) = 1 A priori, 0 < P (c j ) < 1 j. Features given the class: t(a, c) = t(c) a, a c 0 < t(a, c) < t(c) a, c A priori, 0 < P (a c) < 1, c.

Naive Credal Classifier (NCC) Uses the IDM to specify a credal set of joint distributions and updates it into a posterior credal set. The posterior probability of class c ranges within an interval. Given feature observation a, class c each posterior of the credal set: credal-dominates c if for P (c a) > P (c a)

NCC and prior-dependent instances Credal-dominance is checked by solving an optimization problem. NCC eventually returns the non-dominated classes: a singleton on the safe instances a set on the prior-dependent ones. The next applications shows the gain in reliability due to returning indeterminate classifications on the prior-dependent instances.

Next From Naive Bayes to Naive Credal Classifier: Texture recognition

Texture recognition The OUTEX data sets (Ojala, 2002): 4500 images, 24 classes (textiles, carpets, woods..).

Features: Local Binary Patterns (Ojala, 2002) The gray level of each pixel is compared with that of its neighbors, resulting in a binary judgment (more intense/ less intense). Such judgments are collected in a string for each pixel.

Local Binary Patterns (2) Each string is then assigned to a single category. The categories group similar strings: e.g., 00001111 is in the same category of 11110000 for rotational invariance. There are 18 categories. For each image there are 18 features: the % of pixels assigned to each category.

Results (Corani et al., BMVC 2010) Accuracy of NBC: 92% (SVMs: 92.5%). NBC is highly accurate on the safe instances, but almost random on the prior-dependent ones. Safe Prior-dependent Amount% 95% 5% NBC: accuracy 94% 56% NCC: accuracy 94% 85% NCC: non-dom. classes 1 2.4

Different training set sizes (II) As n grows there is a decrease of both: the % of indet. classification; the number of classes returned when indeterminate. 100% % of indeterminate classifications 25 Classes returned when indeterminate 80% 20 60% 15 40% 10 20% 5 0% 0 1000 2000 3000 4000 5000 Training set size 0 0 1000 2000 3000 4000 5000 Training set size

Naive Bayes + rejection option vs Naive Credal Rejection option: reject an instance (no classification) if the probability of the most probable class is below a threshold p. Texture recognition: half of the prior-dependent instances classified by naive Bayes with probability > 90%. Rejection rule is not designed to detect prior-dependent instances!

Tree-augmented credal classifier (TAN) C C A 1 A 2 A 3 A 1 A 2 A 3 NBC TAN TAN improves over naive Bayes by weakening the independence assumption.

Tree-augmented credal classifier (TAN) C C A 1 A 2 A 3 A 1 A 2 A 3 NBC TAN Credal TAN detects instances unreliably classified by Bayesian TAN because of prior-dependence. References: Zaffalon, Reliable Computing, 2003 Corani and De Campos, Int. J. Appr. Reasoning, 2010.

Next From Naive Bayes to Naive Credal Classifier: Conservative treatment of missing data

Ignorance from missing data Usually, classifiers ignore missing data, assuming them to be MAR (missing at random). MAR: the probability of an observation to be missing does not depend on its value or on the value of other missing data. The MAR assumption cannot be tested on the incomplete data.

A non-mar example: a political poll. The right-wing supporters (R) sometimes refuse to answer; left-wing (L) supporters always answer. Vote Answer L L L L L L R R R - R - By ignoring missing data, P (R) = 1/4: underestimated!

Conservative treatment of missing data. Consider each possible completion of the data. Answer D1 D2 D3 D4 L L L L L L L L L L L L L L L R R R R R - L L R R - L R L R P (R) 1/6 1/3 1/3 1/2 P (R) [1/6, 1/2]; this interval includes the real value. Application to BNs: Ramoni and Sebastiani (2000).

Conservative Inference Rule (Zaffalon, 2005) MAR missing data are ignored. Non-MAR missing data are filled in all possible ways, both in the training and in the test data. The replacements exponentially grow with the missing data; yet polynomial time algorithms are available for naive Credal. Reference: Corani and Zaffalon, 2008, JMLR

The conservative treatment of missing data increase indeterminacy. Multiple classes are returned if the most probable class depends: on the prior specification or on the completion of the non-mar missing data. Declare each feature as MAR or non-mar depending on domain knowledge, for a good trade-off between robustness and informativeness.

Agenda 1 Estimating a multinomial 2 Naive Bayes and Naive Credal Classifier 3 Comparing credal and traditonal classifiers 4 Credal Model Averaging

Discounted-accuracy d-acc = 1 N N i=1 (accurate) i Z i accurate i : whether the set of classes returned on instance i contains the actual class; Z i : the number of classes returned on the i-th instance. For a traditional classification, d-acc equals the 0-1 accuracy.

Some properties d-acc = 1 N N i=1 (accurate) i Z i Among two accurate classifications, d-acc is higher for the more informative classifier. Among an accurate and an inaccurate classification, d-acc is higher for the accurate classifier, even if it has returned multiple classes (linear decrease). Discounted-accuracy satisfies some fundamental properties (Zaffalon et al., IJAR 2012)

Doctor random and doctor vacuous. Possible diseases:{a,b}. Doctor random: uniformly random diagnosis. Doctor vacuous: return both categories. Disease Random Vacuous Class. d-acc Class. d-acc A A 1 {A,B} 0.5 A B 0 {A,B} 0.5 B A 0 {A,B} 0.5 B B 1 {A,B} 0.5 d-acc 0.5 0.5

Doctor random vs doctor vacuous: the manager viewpoint. Disease Random Vacuous Class. d-acc Class. d-acc A A 1 {A,B} 0.5 A B 0 {A,B} 0.5 B A 0 {A,B} 0.5 B B 1 {A,B} 0.5 d-acc 0.5 0.5 Assumption: the hospital profits a quantity of money proportional to the discounted-accuracy. After n visits, the profits are: doctor vacuous: n/2, with no variance. doctor random: expected n/2, with variance n/4.

Introducing utility Expected utility increases with the expected value of the rewards but decreases with their variance. Any risk-adverse manager prefers doctor vacuous over doctor random. And you prefer a vacuous over a random diagnosis! Idea: quantify such preference through a utility function (Zaffalon et al., IJAR 2012).

How to design the utility function Actual Predicted Utility A A 1 A B 0 Utility corresponds with accuracy for traditional classification consisting of a single class.

How to design the utility function Actual Predicted Utility A B,C 0 A A,B 0.65-0.80 A wrong indeterminate classification has utility 0. The utility of an accurate but indeterminate classification consisting of two classes has to be larger than 0.5...... otherwise doctor random and doctor vacuous yield the same utility.

How to design the utility function Let this utility range between 0.65 (function u 65 ) or 0.8 (function u 80 ), then perform sensitivity analysis. Utility of credal classifiers and accuracy of determinate classifiers can be now compared. Such utility functions are numerically close to F1 and F2 metric (Del Coz et al., JLMR, 2009).

Comparing naive Bayes and naive Credal Wins on 55 UCI data sets: Utility naive Bayes naive Credal u 65 18 37 u 80 13 42 How to compare credal and traditional classifier was an open problem at AAAI 10!

Agenda 1 Estimating a multinomial 2 Naive Bayes and Naive Credal Classifier 3 Comparing credal and traditonal classifiers 4 Credal Model Averaging

Model uncertainty: an example with logistic regression. Given k features, we can design 2 k logistic regressors, each with a different feature set. Model uncertainty: different logistic regressors (with different feature sets) can fit similarly well the data. Considering only a single model leads to overconfident conclusions. Bayesian Model Averaging (BMA) combines the inferences of multiple models, using as weights the posterior probability of the models.

Bayesian Model Averaging (BMA) The posterior of class c under BMA is: P (c D) = m i M P (c m i, D)P (m i D) m i M P (c m i, D)P (D m i )P (m i ) D: available training data. m i is the i-th model within the model space M. P (m i ) is the prior over the models. Often BMA analysis is repeated under different prior over the models.

Independent Bernoulli (IB) prior over the models Prior probability of model m i, containing k i features: P (m i ) = θ ki (1 θ) k ki Assumptions: every covariate has the same prior probability of inclusion θ, which is the only parameter; the probability of inclusion of each covariate is independent from the others. Uniform prior over the models: θ = 0.5.

Credal model averaging (CMA) Represents the prior over the models by a credal set. The prior probability of inclusion of each covariate varies in [θ, θ]. Ignorance a priori can be represented by setting θ = ɛ θ = 1 ɛ CMA automates sensitivity analysis on the value of θ.

Upper and lower probability of a class Lower posterior probability of class c 1 : min θ [θ,θ] m i M P (c 1 D, x) = min θ [θ,θ] m i M P (c 1 D, x, m i )P (m i D) = P (D m i )θ ki (1 θ) k ki P (c 1 D, x, m i ) m P (D m j M j)θ kj (1 θ) k kj The upper is found by maximization of the above expression. Such optimizations are analytically solved.

Credal model averaging (CMA) The prior over the models is modelled by a credal set. The single models (logistic regressors with different feature sets) are learned in a traditional Bayesian way. CMA is thus a credal ensemble of Bayesian regressors.

Indeterminate classifications Safe instances: the most probable class remains the same, regardless the prior over the models. CMA and BMA return the same class if θ [θ, θ]. Prior-dependent instances: the most probable class varies with θ. CMA suspend the judgment, returning a set of classes.

Predicting presence or absence of marmot burrows Study area: 9500 cells (100m 2 each) within the Stelvio National Park (Italy). Presence of burrows in about 4.5% of the cells. Features available for each cell : altitude, slope, aspect (the direction in which the slope faces), soil temperature, soil cover, etc.

Experiments: comparing BMA and CMA BMA prior over the models: uniform (θ=0.5) hierarchical (θ as a random variable, with uniform prior). Training sets of size n {30, 60, 90..., 300}. 30 experiments for each sample size. For CMA we set θ = 0.95 and θ = 0.05 (ignorance a priori).

BMA accuracy Regardless the adopted prior over the models: BMA is highly accurate on the safe instances but... almost a random guesser on the prior-dependent ones! BMA: accuracy 0.0 0.2 0.4 0.6 0.8 1.0 safe prior dependent BMA: accuracy 0.0 0.2 0.4 0.6 0.8 1.0 safe prior dependent 30 90 150 210 270 450 750 1050 1500 training set dimension 30 90 150 210 270 450 750 1050 1500 training set dimension Figure: Left: uniform prior. Right: hierarchical prior.

Credal model averaging (CMA) Allows to easily compute sensitivity analysis regarding the coefficients of the covariates (upper / lower value) by solving optimization problem. The solution is analitically done. CMA automates sensitivity analysis.

Other ensembles using a credal set for the prior probability of the models. Credal ensemble of naive Bayes (Corani and Zaffalon, ECML-PKDD 2008.) Credal ensemble with compression coefficients (Corani and Antonucci, Comp. Statistics & Data Analysis, in press)

Other imprecise probability classifiers. Some useful pointers: Credal KNN: Destercker, Soft Computing 2012 Credal regression trees: Abellan and Moral, Int. J. Appr. Reasoning, 2003. Abellan et al., Comp. Statistics & Data Analysis, 2013 Classification of temporal sequences: Antonucci et al., Proc. ISIPTA 13 (Symp. on Imprecise Probability)

Conclusions Credal classifiers suspend the judgment on prior-dependent instances. Prior-dependent instances are not effectively identified by the rejection option. Credal classifiers robustly deal with the specification of the prior and with non-mar missing data. Typically, the accuracy of traditional classifiers drops on the instances indeterminately classified by credal classifiers.

Open problems Multi-label classification. Dealing by credal set with the prior on the model parameters and on the prior over the models.