Belief functions: A gentle introduction

Size: px

Start display at page:

Download "Belief functions: A gentle introduction"

Kathlyn Lamb
5 years ago
Views:

1 Belief functions: A gentle introduction Seoul National University Professor Fabio Cuzzolin School of Engineering, Computing and Mathematics Oxford Brookes University, Oxford, UK Seoul, Korea, 30/05/18 Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 1 / 125

2 Uncertainty Outline 1 Uncertainty Second-order uncertainty Classical probability 2 Beyond probability Set-valued observations Propositional evidence Scarce data Representing ignorance Rare events Uncertain data 3 Belief theory A theory of evidence Belief functions Semantics Dempster s rule Multivariate analysis Misunderstandings 4 Reasoning with belief functions Statistical inference Combination Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem The total belief theorem Decision making 5 Theories of uncertainty Imprecise probability Monotone capacities Probability intervals Fuzzy and possibility theory Probability boxes Rough sets 6 Belief functions on reals Continuous belief functions Random sets 7 Conclusions Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 2 / 125

Uncertainty Second-order uncertainty Orders of uncertainty the difference between predictable and unpredictable variation is one of the fundamental issues in the philosophy of probability second

3 Uncertainty Second-order uncertainty Orders of uncertainty the difference between predictable and unpredictable variation is one of the fundamental issues in the philosophy of probability second order uncertainty: being uncertain about our very model of uncertainty has a consequence on human behaviour: people are averse to unpredictable variations (as in Ellsberg s paradox) how good are Kolmogorov s measure-theoretic probability, or Bayesian and frequentist approaches at modelling second-order uncertainty? Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 3 / 125

4 Uncertainty Classical probability Probability measures mainstream mathematical theory of (first order) uncertainty: mathematical (measure-theoretical) probability mainly due to Russian mathematician Andrey Kolmogorov probability is an application of measure theory, the theory of assigning numbers to sets additive probability measure mathematical representation of the notion of chance assigns a probability value to every subset of a collection of possible outcomes (of a random experiment, of a decision problem, etc) collection of outcomes Ω sample space, universe subset A of the universe event Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 4 / 125

5 Uncertainty Classical probability Probability measures probability measure µ: a real-valued function on a probability space that satisfies countable additivity probability space: it is a triplet (Ω, F, P) formed by a universe Ω, a σ-algebra F of its subsets, and a probability measure on F not all subsets of Ω belong necessarily to F axioms of probability measures: µ( ) = 0, µ(ω) = 1 0 µ(a) 1 for all events A F additivity: for all countable collection of pairwise disjoint events A i : ( ) µ A i = µ(a i ) i i probabilities have different interpretations: we consider frequentist and Bayesian (subjective) probability Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 5 / 125

6 Uncertainty Classical probability Frequentist inference in the frequentist interpretation, the (aleatory) probability of an event is its relative frequency in time the frequentist interpretation offers guidance in the design of practical random experiments developed by Fisher, Pearman, Neyman three main tools: statistical hypothesis testing model selection confidence interval analysis Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 6 / 125

7 Uncertainty Classical probability Statistical hypothesis testing 1 state the research hypothesis 2 state the relevant null and alternative hypotheses 3 state the statistical assumptions being made about the sample, e.g. assumptions about the statistical independence or about the form of the distributions of the observations 4 state the relevant test statistic T (a quantity derived from the sample) 5 derive the distribution of the test statistic under the null hypothesis from the assumptions 6 set a significance level (α), i.e. a probability threshold below which the null hypothesis will be rejected 7 compute from the observations the observed value t obs of the test statistic T 8 calculate the p-value, the probability (under the null hypothesis) of sampling a test statistic at least as extreme as the observed value 9 Reject the null hypothesis, in favor of the alternative hypothesis, if and only if the p-value is less than the significance level threshold Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 7 / 125

8 Uncertainty Classical probability P-values More likely observation Probability density Very unlikely observations Observed data point P-value Very unlikely observations Set of possible results the p-value is not the probability that the null hypothesis is true or the probability that the alternative hypothesis is false: frequentist statistics does not and cannot attach probabilities to hypotheses Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 8 / 125

9 Uncertainty Classical probability Maximum Likelihood Estimation (MLE) the term likelihood was popularized in mathematical statistics by Ronald Fisher in 1922: On the mathematical foundations of theoretical statistics Fisher argued against inverse (Bayesian) probability as a basis for statistical inferences, and instead proposes inferences based on likelihood functions likelihood principle: all of the evidence in a sample relevant to model parameters is contained in the likelihood function this is hotly debated, still [Mayo,Gandenberger] maximum likelihood estimation: where {ˆθ mle} {arg max L(θ ; x 1,..., x n)}, θ Θ L(θ ; x 1,..., x n) = f (x 1, x 2,..., x n θ) and {f (. θ), θ Θ} is a parametric model consistency: the sequence of MLEs converges in probability, for a sufficiently large number of observations, to the (actual) value being estimated Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 9 / 125

Uncertainty Classical probability Subjective probability (epistemic) probability = degrees of belief of an individual assessing the state of the world Ramsey and de Finetti subjective beliefs must

10 Uncertainty Classical probability Subjective probability (epistemic) probability = degrees of belief of an individual assessing the state of the world Ramsey and de Finetti subjective beliefs must follow the laws of probability if they are to be coherent (if this proof was prooftight we would not be here in front of you!) also, evidence casts doubt that humans will have coherent beliefs or behave rationally Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 10 / 125

11 Uncertainty Classical probability Bayesian inference prior distribution: the distribution of the parameter(s) before any data is observed, i.e. p(θ α) depends on a vector of hyperparameters α likelihood: the distribution of the observed data conditional on its parameters, i.e. p(x θ) marginal likelihood (sometimes also termed the evidence) is the distribution of the observed data marginalized over the parameter(s): p(x α) = p(x θ)p(θ α) dθ θ posterior distribution: the distribution of the parameter(s) after taking into account the observed data, as determined by Bayes rule: p(θ X, α) = p(x θ)p(θ α) p(x α) p(x θ)p(θ α) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 11 / 125

12 Beyond probability Outline 1 Uncertainty Second-order uncertainty Classical probability 2 Beyond probability Set-valued observations Propositional evidence Scarce data Representing ignorance Rare events Uncertain data 3 Belief theory A theory of evidence Belief functions Semantics Dempster s rule Multivariate analysis Misunderstandings 4 Reasoning with belief functions Statistical inference Combination Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem The total belief theorem Decision making 5 Theories of uncertainty Imprecise probability Monotone capacities Probability intervals Fuzzy and possibility theory Probability boxes Rough sets 6 Belief functions on reals Continuous belief functions Random sets 7 Conclusions Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 12 / 125

13 Something is wrong? Beyond probability measure-theoretical mathematical probability is not general enough: cannot (properly) model missing data cannot (properly) model propositional data cannot really model unusual data (second order uncertainty) the frequentist approach to probability: cannot really model pure data (without design ) in a way, cannot even model properly continuous data models scarce data only asymptotically Bayesian reasoning has several limitations: cannot model no data (ignorance) cannot model uncertain data cannot model pure data (without prior) again, cannot properly model scarce data (only asymptotically) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 13 / 125

14 Beyond probability Fisher has not got it all right the setting of hypothesis testing is (arguably) arguable the scope is quite narrow: rejecting or not rejecting a hypothesis (although it can provide confidence intervals) the criterion is arbitrary: who decides what an extreme realisation is (choice of α)? what is the deal with 0.05 and 0.01? the whole tail idea comes from the fact that, under measure theory, the conditional probability (p-value) of a point outcome x is zero seems trying to patch an underlying problem with the way probability is mathematically defined cannot cope with pure data, without assumptions on the process (experiment) which generated them (we will come back to this later) deals with scarce data only asymptotically Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 14 / 125

15 Beyond probability The problem(s) with Bayes pretty bad at representing ignorance Jeffrey s uninformative priors are just not adequate different results on different parameter spaces Bayes rule assumes the new evidence comes in the form of certainty: A is true in the real world, often this is not the case ( uncertain or vague evidence) beware the prior! model selection in Bayesian statistics results from a confusion between the original subjective interpretation, and the objectivist view of a rigorous objective procedure why should we pick a prior? either there is prior knowledge (beliefs) or there is not all will be fine, in the end! asymptotically, the choice of the prior does not matter (really!) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 15 / 125

16 Beyond probability Set-valued observations The die as random variable face 6 face 1 face 3 face 5 face 2 face 4 X a die is a simple example of (discrete) random variable there is a probability space Ω = {face1, face2,..., face6} which maps to a real number: 1, 2,..., 6 (no need for measurability here) now, imagine that face1 and face2 are cloaked, and we roll the die Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 16 / 125

17 Beyond probability Set-valued observations The cloaked die: set-valued observations face 6 face 1 face 3 face 5 face 2 face 4 X the same probability space Ω = {face1, face2,..., face6} is still there (nothing has changed in the way the die works) however, now the mapping is different: both face1 and face2 are mapped to the set of possible values {1, 2} (since we cannot observe the outcome) this is a random set [Matheron,Kendall,Nguyen, Molchanov]: a set-valued random variable whenever data are missing observations are inherently set-valued Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 17 / 125

18 Beyond probability Propositional evidence Reliable witnesses Evidence supporting propositions suppose there is a murder, and three people are under trial for it: Peter, John and Mary our hypothesis space is therefore Θ = {Peter, John, Mary} there is a witness: he testifies that the person he saw was a man this amounts to supporting the proposition A = {Peter, John} Θ should we take this testimony at face value? in fact, the witness was tested and the machine reported an 80% chance he was drunk when he reported the crime we should partly support the (vacuous) hypothesis that any one among Peter, John and Mary could be the murderer: it is natural to assign 80% chance to proposition A, and 20% chance to proposition Θ Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 18 / 125

19 Beyond probability Propositional evidence Dealing with propositional evidence even when evidence (data) supports propositions, Kolmogorov s probability forces us to specify support for individual outcomes this is unreasonable - an artificial constraint due to a mathematical model that is not general enough we have no elements to assign this 80% probability to either Peter or John, nor to distribute it among them the cause is the additivity of probability measures: but this is not the most general type of measure for sets under a minimal requirement of monotoniticy measure can potentially suitable to describe probabilities of events: these objects are called capacities in particular, random sets are capacities in which the numbers assigned to subsets are given by a probability distribution Belief functions and propositional evidence As capacities (and random sets in particular), belief functions allow us to assign mass directly to propositions. Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 19 / 125

Beyond probability Scarce data Machines that learn Generalising from scarce data machine learning: designing algorithms that can learn from data BUT, we train them on a ridicously small amount of

20 Beyond probability Scarce data Machines that learn Generalising from scarce data machine learning: designing algorithms that can learn from data BUT, we train them on a ridicously small amount of data: how can we make sure they are robust to new situations never encountered before (model adaptation)? statistical learning theory [Vapnik] is based on traditional probability theory Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 20 / 125

21 Beyond probability Scarce data Dealing with scarce data a somewhat naive objection: probability distributions assume an infinite amount of evidence, so in reality finite evidence can only provide a constraint on the true probability values unfortunately, those who believe probabilities to be limits of relative frequencies (the frequentists) never really estimate a probability from the data the only assume ( design ) probability distributions for their p-values Fisher: fine, I can never compute probabilities, but I can use the data to test my hypotheses on them in opposition, those who do estimate probability distributions from the data (the Bayesians) do not think of probabilities as infinite accumulations of evidence (but as degrees of belief) Bayes: I only need to be able to model a likelihood function of the data well, actually, frequentists do estimate probabilities from scarce data when they do stochastic regression (e.g., logistic regression) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 21 / 125

Beyond probability Scarce data Asymptotic happiness what is true, is that both frequentists and Bayesians seem to be happy with solving their problems asymptotically limit properties

22 Beyond probability Scarce data Asymptotic happiness what is true, is that both frequentists and Bayesians seem to be happy with solving their problems asymptotically limit properties of ML estimates Bernstein-von Mises theorem what about the here and now? e.g. smart cars? Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 22 / 125

23 Beyond probability Representing ignorance Modelling pure data Bayesian inference Bayesian reasoning requires modelling the data and a prior (actually, you need to pick the proper hypothesis space too!) prior is just a name for beliefs built over a long period of time, from the evidence you have observed so long a time has passed that all track record of observations is lost, and all is left is a probability distribution why should we pick a prior? either there is prior knowledge or there is not nevertheless we are compelling to picking one, because the mathematical formalism requires it this is the result of a confusion between the original subjective interpretation (where prior beliefs always exist), and the objectivist view of a rigorous objective procedure (where in most cases we do not have any prior knowledge) Bayesians then go in damage limitation mode, and try to pick the least damaging prior (see ignorance later) all will be fine, in the end! (Bernstein-von Mises theorem) Asymptotically, the choice of the prior does not matter (really!) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 23 / 125

24 Beyond probability Representing ignorance Dangerous priors Bayesian inference the prior distribution is typically hard to determine solution pick an uninformative probability Jeffrey s prior Gramian of the Fisher information matrix can be improper (unnormalised), and it violates the strong version of the likelihood principle: inferences depend not just on the data likelihood but also on the universe of all possible experimental outcomes uniform priors can lead to different results on different spaces, given the same likelihood functions the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior (Bernstein-von Mises theorem) A. W. F. Edwards: It is sometimes said, in defence of the Bayesian concept, that the choice of prior distribution is unimportant in practice, because it hardly influences the posterior distribution at all when there are moderate amounts of data. The less said about this defence the better. Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 24 / 125

25 Beyond probability Representing ignorance Modelling pure data Frequentist inference the frequentist approach is inherently unable to describe pure data, without making additional assumptions on the data-generating process in Nature one cannot design an experiment: data come your way, whether you want it or not you cannot set the stopping rules again, recalls the old image of a scientist analysing (from Greek ana + lysis, breaking up) a specific aspect of the world in their lab the same data can lead to opposite conclusions different experiments can lead to the same data, whereas the parametric model employed (family of probability distributions) is linked to a specific experiment apparently, however, frequentists are just fine with this Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 25 / 125

26 Beyond probability Representing ignorance Dealing with ignorance Shafer vs Bayes uninformative priors can be dangerous (Andrew Gelman): they violate the strong likelihood principle, may be unnormalised wrong priors can kill a Bayesian model priors in general cannot handle multiple hypothesis spaces in a coherent way (families of frames, in Shafer s terminology) Belief functions and priors Reasoning with belief functions does not require any prior. Belief functions and ignorance Belief functions naturally represent ignorance via the vacuous belief function, assigning mass 1 to the whole hypothesis space. Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 26 / 125

27 Beyond probability Rare events Extinct dinosaurs The statistics of rare events dinosaurs probably were worrying about overpopulation risks.... until it hit them! Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 27 / 125

28 Beyond probability Rare events What s a rare event? what is a rare event? clearly we are interested in them because they are not so rare, after all! examples of rare events, also called tail risks or black swans, are: volcanic eruptions, meteor impacts, financial crashes.. mathematically, an event is rare when it covers a region of the hypothesis space which is seldom sampled it is an issue with the quality of the sample Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 28 / 125

29 Beyond probability Rare events Rare events and second-order uncertainty probability distributions for the system s behaviour are built in normal times (e.g. while a nuclear plant is working just fine), then used to extrapolate results at the tail of the distribution training samples P(Y=1 x) 'rare' event x popular statistical procedures (e.g. logistic regression) can sharply underestimate the probability of rare events Harvard s G. King [2001] has proposed corrections based on oversampling the rare events w.r.t the normal ones the issue is really one with the reliability of the model! we need to explictly model second-order uncertainty Belief functions and rare events Belief functions can model second-order uncertainty: rare events are a form of lack of information in certain regions of the sample space. Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 29 / 125

30 Beyond probability Uncertain data Uncertain data concepts themselves can be not well defined, e.g. dark or somewhat round object (qualitative data) fuzzy theory accounts for this via the concept of graded membership unreliable sensors can generate faulty (outlier) measurements: can we still treat these data as certain? or is more natural to attach to them a degree of reliability, based on the past track record of the sensor (data generating process)? but then, can we still apply Bayes rule? people ( experts, e.g. doctors) tend to express themselves in terms of likelihoods directly (e.g. I think diagnosis A is most likely, otherwise either A or B ) if the doctors were frequentists, and were provided with the same data, they would probably apply logistic regression and come up with the same prediction on P(disease symptoms): unfortunately doctors are not statisticians multiple sensors can provide as output a PDF on the same space e.g., two Kalman filters based one on color, the other on motion (optical flow), providing a normal predictive PDF on the location of the target in the image plane Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 30 / 125

31 Belief theory Outline 1 Uncertainty Second-order uncertainty Classical probability 2 Beyond probability Set-valued observations Propositional evidence Scarce data Representing ignorance Rare events Uncertain data 3 Belief theory A theory of evidence Belief functions Semantics Dempster s rule Multivariate analysis Misunderstandings 4 Reasoning with belief functions Statistical inference Combination Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem The total belief theorem Decision making 5 Theories of uncertainty Imprecise probability Monotone capacities Probability intervals Fuzzy and possibility theory Probability boxes Rough sets 6 Belief functions on reals Continuous belief functions Random sets 7 Conclusions Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 31 / 125

32 Belief theory A theory of evidence A mathematical theory of evidence Shafer called his proposal A mathematical theory of evidence the mathematical objects it deals with are called belief functions where do these names come from? what interpretation of probability do they entail? truth probabilistic knowledge knowledge evidence belief it is a theory of epistemic probability: it is about probabilities as a mathematical representation of knowledge (a human s knowledge, or a machine s) it is a theory of evidential probability: such probabilities representing knowledge are induced ( elicited ) by the available evidence Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 32 / 125

33 Belief theory A theory of evidence Evidence supporting hypotheses in probabilistic logic, statements such as "hypothesis H is probably true" mean that the empirical evidence E supports H to a high degree called the epistemic probability of H given E Rationale There exists evidence in the form of probabilities, which supports degrees of belief on a certain matter. the space where the evidence lives is different from the hypothesis space they are linked by a map one to many: but this is a random set! Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 33 / 125

34 Belief theory Belief functions Dempster s multivalued mappings Dempster s work formalises random sets via multivalued (one-to-many) mappings Γ from a probability space (Ω, F, P) to the domain of interest Θ drunk (0.2) not drunk (0.8) Peter Mary John examples taken from a famous trial example [Shafer] elements of Ω are mapped to subsets of Ω: once again this is a random set in the example Γ maps {not drunk} Ω to {Peter, John} Θ the probability distribution P on Ω induces a mass assignment m : 2 Θ [0, 1] on the power set 2 Θ = {A Θ} via the multivalued mapping Γ : Ω 2 Θ Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 34 / 125

35 Belief theory Belief functions Belief and plausibility measures the belief in A as the probability that the evidence implies A: Bel(A) = P({ω Ω Γ(ω) A}) the plausibility of A as the probability that the evidence does not contradict A: Pl(A) = P({ω Ω Γ(ω) A }) = 1 Bel(A) originally termed by Dempster lower and upper probabilities belief and plausibility values can (but this is disputed) be interpreted as lower and upper bounds to the values of an unknown, underlying probability measure: Bel(A) P(A) Pl(A) for all A Θ Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 35 / 125

36 Belief theory Belief functions Basic probability assignments Mass functions belief functions (BF) are functions from 2 Θ, the set of all subsets of Θ, to [0, 1], assigning values to subsets of Θ it can be proven that each belief function has form Bel(A) = B A m(b) where m is a mass function or basic probability assignment on Θ, defined as a function 2 Θ [0, 1], such that: m( ) = 0 A Θ m(a) = 1 any subset A of Θ such that m(a) > 0 is called a focal element (FE) of m working with belief functions reduces to manipulating focal elements Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 36 / 125

37 Belief theory Belief functions A generalisation of sets, fuzzy sets, probabilities belief functions generalise traditional ( crisp ) sets: a logical (or categorical ) mass function has one focal set A, with m(a) = 1 belief functions generalise standard probabilities: a Bayesian mass function has as only focal sets elements (rather than subsets) of Θ complete ignorance is represented by the vacuous mass function: m(θ) = 1 belief functions generalise fuzzy sets (see possibility theory later), which are assimilated to consonant BFs whose focal elements are nested: A 1... A m consonant Bayesian vacuous Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 37 / 125

38 Belief theory Semantics Semantics of belief functions Modelling second-order uncertainty p(x) = 1 probability simplex A B (A) 0 (B) 1 p(z) = 0.7 Bel p(x) = 0.6 p(x) = 0.2 p(z) = 1 belief functions have multiple interpretations as set-valued random variables (random sets) p(z) = 0.2 p(y) = 1 as (completely monotone) capacities (functions from the power set to [0, 1]) as a special class of credal sets (convex sets of probability distributions) [Levi,Kyburg] as such, they are a very expressive means of modelling uncertainty on the model itself, due to lack of data quantity or quality, or both Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 38 / 125

39 Belief theory Semantics Axiomatic definition belief functions can also be defined in axiomatic terms, just like Kolmogorov s additive probability measures this is the definition proposed by Shafer in 1976 Belief function A function Bel : 2 Θ [0, 1] from the power set 2 Θ to [0, 1] such that: Bel( ) = 0, Bel(Θ) = 1; for every n and for every collection A 1,..., A n 2 Θ we have that: Bel(A 1... A n) i Bel(A i ) i<j Bel(A i A j ) + + ( 1) n+1 Bel(A 1... A n) makes clearer that belief measures generalise standard probability measures: replace additivity with superadditivity (third axiom) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 39 / 125

40 Belief theory Dempster s rule Jeffrey s rule of conditioning belief measures include probability ones as a special case: what does replace Bayes rule? Jeffrey s rule of conditioning: a step forward from certainty and Bayes rule an initial probability P stands corrected by a second probability P, defined only on a number of events suppose P is defined on a σ-algebra A there is a new prob measure P on a sub-algebra B of A, and the updated probability P has to: 1 meet the prob values specified by P for events in B 2 be such that B B, X, Y B, X, Y A { P (X) P(X) P (Y ) = if P(Y ) > 0 P(Y ) 0 if P(Y ) = 0 there is a unique solution: P (A) = B B P(A B)P (B) generalises Bayes conditioning! (obtained when P (B) = 1 for some B) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 40 / 125

41 Belief theory Dempster s rule Conditioning versus combination what if I have a new probability on the same σ-algebra A? Jeffrey s rule cannot be applied! as we saw, this happens when multiple sensors provide predictive PDFs belief function deal with uncertain evidence by moving away from the concept of conditioning (via Bayes rule).... to that of combining pieces of evidence supporting multiple (intersecting) propositions to various degrees Belief functions and evidence Belief reasoning works by combining existing belief functions with new ones, which are able to encode uncertain evidence. Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 41 / 125

42 Belief theory Dempster s rule Dempster s combination drunk (0.2) not drunk (0.8) cleaned (0.6) not cleaned (0.4) Peter John Mary new piece of evidence: a blond hair has been found; also, there is a probability 0.6 that the room has been cleaned before the crime the assumption is that pairs of outcomes in the source spaces ω 1 Ω 1 and ω 2 Ω 2 support the intersection of their images in 2 Θ : θ Γ 1 (ω 1 ) Γ 2 (ω 2 ) if this is done independently, then the probability that pair (ω 1, ω 2 ) is selected is P 1 ({ω 1 })P 2 ({ω 2 }), yielding Dempster s rule of combination: (m 1 m 2 )(A) = 1 m 1 (B)m 2 (C), = A Θ, 1 κ B C=A Bayes rule is a special case of Dempster s rule Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 42 / 125

43 Belief theory Dempster s rule Dempster s combination A simple numerical example B 1 A A 2 B 2 = 2 4 m Bel 1 Bel 2 m m m m({θ 1 }) = = 0.48 m({θ 2 }) = = 0.31 m({θ 1, θ 2 }) = = 0.21 X 1 X 2 X 3 Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 43 / 125

44 Belief theory Dempster s rule A generalisation of Bayesian inference belief theory generalises Bayesian probability (it contains it as a special case), in that: classical probability measures are a special class of belief functions (in the finite case) or random sets (in the infinite case) Bayes certain evidence is a special case of Shafer s bodies of evidence (general belief functions) Bayes rule of conditioning is a special case of Dempster s rule of combination it also generalises set-theoretical intersection: if m A and m B are logical mass functions and A B, then m A m B = m A B however, it overcomes its limitations you do not need a prior: if you are ignorant, you will use the vacuous BF m Θ which, when combined with new BFs m encoding data, will not change the result m Θ m = m however, if you do have prior knowledge you are welcome to use it! Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 44 / 125

45 Belief theory Multivariate analysis Multivariate analysis Refinements and coarsenings the theory allows us to handle evidence impacting on different but related domains assume we are interested in the nature of an object in a road scene. We could describe it, e.g., in the frame Θ = {vehicle, pedestrian}, or in the finer frame Ω = {car, bicycle, motorcycle, pedestrian} other example: different image features in pose estimation a frame Ω is a refinement of a frame Θ (or, equivalently, Θ is a coarsening of Ω) if elements of Ω can be obtained by splitting some or all of the elements of Θ Θ θ 1 θ 2 θ 3 ρ Ω Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 45 / 125

46 Belief theory Multivariate analysis Families of compatible frames Multivariate analysis when Ω is a refinement for a collection Θ 1,..., Θ N of other frames it is called their common refinement two frames are said to be compatible if they do have a common refinement compatible frames can be associated with different variables/attributes/features: let Θ X = {red, blue, green} and Θ Y = {small, medium, large} be the domains of attributes X and Y describing, respectively, the color and the size of an object in such a case the common refinement Θ X Θ Y = Θ X Θ Y is simply the Cartesian product or, they can be descriptions of the same variable at different levels of granularity (as in the road scene example) evidence can be moved from one frame to another within a family of compatible frames Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 46 / 125

47 Belief theory Multivariate analysis Families of compatible frames Pictorial illustration 1 A i i n A 1 i A n 1 n 1... n Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 47 / 125

48 Belief theory Multivariate analysis Marginalisation let Θ X and Θ Y be two compatible frames let m XY be a mass function on Θ X Θ Y it can be expressed in the coarser frame Θ X by transferring each mass m XY (A) to the projection of A on Θ X : Y X B=A X A we obtain a marginal mass function on Θ X : m XY X (B) = m XY (A) {A Θ XY,A Θ X =B} X Y B Θ X (again, it generalizes both set projection and probabilistic marginalization) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 48 / 125

49 Belief theory Multivariate analysis Vacuous extension the inverse of marginalization a mass function m X on Θ X can be expressed in Θ X Θ Y by transferring each mass m X (B) to the cylindrical extension of B: Y X B A=B Y this operation is called the vacuous extension of m X in Θ X Θ Y : { m X XY m X (B) if A = B Θ Y (A) = 0 otherwise a strong feature of belief theory: the vacuous belief function (our representation of ignorance) is left unchanged when moving from one space to another! X Y Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 49 / 125

50 Belief theory Misunderstandings Belief functions are not (general) credal sets p(x) = 1 Bel Cre a belief function on Θ is in 1-1 correspondence with a convex set of probability distributions there (a credal set) however, belief functions are a special class of credal sets, those induced by a random set mapping p(z) = 1 p(y) = 1 Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 50 / 125

51 Belief theory Misunderstandings Belief functions are not parameterised families of distributions, or confidence intervals p(x) = 1 p(z) = 1 Bel Fam p(y) = 1 obviously, a parameterised family of distributions on Θ is a subset of the set of all possible distributions (just like belief functions) not all families of distributions correspond to belief functions example: Gaussian PDFs with 0 mean and arbitrary variance {N (0, σ), σ R + } is not a belief function they are not confidence intervals either: confidence intervals are one-dimensional, and their interpretation is entirely different. Confidence intervals are interval estimates Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 51 / 125

Belief theory Misunderstandings Belief functions are not second-order distributions Dirichlet distribution Belief function as uniform meta-distribution unlike hypothesis testing, general Bayesian

52 Belief theory Misunderstandings Belief functions are not second-order distributions Dirichlet distribution Belief function as uniform meta-distribution unlike hypothesis testing, general Bayesian inference leads to probability distributions over the space of parameters these are second order probabilities, i.e. probability distributions on hypotheses which are themselves probabilities belief functions can be defined on the hypothesis space Ω, or on the parameter space Θ when defined on Ω they are sets of PDFs and can then be seen as indicator second order distributions (see figure) when defined on the parameter space Θ, they amount to families of second-order distributions in the two cases they generalise MLE/MAP and general Bayesian inference, respectively Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 52 / 125

53 Reasoning with belief functions Outline 1 Uncertainty Second-order uncertainty Classical probability 2 Beyond probability Set-valued observations Propositional evidence Scarce data Representing ignorance Rare events Uncertain data 3 Belief theory A theory of evidence Belief functions Semantics Dempster s rule Multivariate analysis Misunderstandings 4 Reasoning with belief functions Statistical inference Combination Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem The total belief theorem Decision making 5 Theories of uncertainty Imprecise probability Monotone capacities Probability intervals Fuzzy and possibility theory Probability boxes Rough sets 6 Belief functions on reals Continuous belief functions Random sets 7 Conclusions Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 53 / 125

54 Reasoning with belief functions Reasoning with belief functions 1 inference: building a belief function from data (either statistical or qualitative) 2 reasoning: updating belief representations when new data arrives either by combination with another belief function or by conditioning with respect to new events/observations 3 manipulating conditional belief functions via a generalisation of Bayes theorem vie network propagation via a generalisation of the total probability theorem 4 using the resulting belief function(s) for: decision making regression classification etc (estimation, optimisation..) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 54 / 125

55 Reasoning with belief functions Reasoning with belief functions EFFICIENT COMPUTATION CONDITIONING COMBINED BELIEF FUNCTIONS INFERENCE MANIPULATION DECISION MAKING STATISTICAL DATA/ OPINIONS BELIEF FUNCTIONS MEASURING UNCERTAINTY COMBINATION CONDITIONAL BELIEF FUNCTIONS TOTAL/ MARGINAL BELIEF FUNCTIONS DECISIONS CONTINUOUS FORMULATION Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 55 / 125

56 Reasoning with belief functions Statistical inference Dempster s approach to statistical inference Fiducial argument consider a statistical model { } f (x θ), x X, θ Θ, where X is the sample space and Θ the parameter space having observed x, how to quantify the uncertainty about the parameter θ, without specifying a prior probability distribution? suppose that we known a data-generating mechanism [Fisher] X = a(θ, U) where U is an (unobserved) auxiliary variable with known probability distribution µ : U [0, 1] independent of θ for instance, to generate a continuous random variable X with cumulative distribution function (CDF) F θ, one might draw U from U([0, 1]) and set X = F 1 θ (U) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 56 / 125

57 Reasoning with belief functions Statistical inference Dempster s approach to statistical inference the equation X = a(θ, U) defines a multi-valued mapping Γ : U 2 X Θ : { } Γ : u Γ(u) = (x, θ) X Θ x = a(θ, u) X Θ under the usual measurability conditions, the probability space (U, B(U), µ) and the multi-valued mapping Γ induce a belief function Bel X Θ on X Θ conditioning it on θ yields Bel X (. θ) f ( θ) on X conditioning it on X = x gives Bel Θ ( x) on Θ X U Bel X Bel (. ) X : U [0,1] x Bel (. x) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 57 / 125

58 Reasoning with belief functions Statistical inference Inference from classical likelihood [Shafer76, Denoeux] consider a statistical model { L(θ; x) = f (x θ), x X, θ Θ }, where X is the sample space and Θ the parameter space Bel Θ (θ x) is the consonant belief function (with nested focal elements) with plausibility of the singletons equal to the normalized likelihood: pl(θ x) = L(θ; x) sup θ Θ L(θ ; x) takes the empirical normalised likelihood to be the upper bound to the probability density of the sought parameter! (rather than the actual PDF) the corresponding plausibility function is Pl Θ (A x) = sup θ A pl(θ x) the plausibility of a composite hypothesis A Θ is the usual likelihood ratio statistics Pl Θ (A x) = sup θ A L(θ; x) sup θ Θ L(θ; x) compatible with the likelihood principle Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 58 / 125

Reasoning with belief functions Statistical inference Coin toss example Inference with belief functions consider a coin toss experiment we toss the coin n = 10 times, obtaining the sample { } X = H,

59 Reasoning with belief functions Statistical inference Coin toss example Inference with belief functions consider a coin toss experiment we toss the coin n = 10 times, obtaining the sample { } X = H, H, T, H, T, H, T, H, H, H with k = 7 successes (heads H) and n k = 3 fails (tails T) parameter of interest: the probability θ = p of heads in a single toss inference problem consists then on gathering information on the value of p Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 59 / 125

60 Reasoning with belief functions Statistical inference Coin toss example General Bayesian inference trials are typically assumed to be independent (they are equally distributed) the likelihood of the sample is binomial: P(X p) = p k (1 p) n k apply Bayes rule to get the posterior P(p X) = P(X p)p(p) P(X) as we do not have a-priori information on the prior likelihood function maximum likelihood estimate P(X p) = p k (1 p) n k p Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 60 / 125

61 Reasoning with belief functions Statistical inference Coin toss example Frequentist inference what would a frequentist do? it is reasonable that p be equal to p = k, i.e., the fraction of successes n we can then test this hypothesis in the classical frequentist setting this implies assuming independent and equally distributed trials, so that the conditional distribution of the sample is the binomial we can then compute the p-value for, say, a confidence level of α = 0.05 the right-tail p-value for the hypothesis p = k (the integral area in pink) is equal n >> α = Hence, the hypothesis cannot be rejected to 1 2 likelihood function p-value = 1/2 p Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 61 / 125

62 Reasoning with belief functions Statistical inference Coin toss example Inference with likelihood-based belief functions likelihood-based belief function inference yields the following belief measure, conditioned on the observed sample X, over Θ = [0, 1] Pl Θ (A X) = sup ˆL(p X); Bel Θ (A X) = 1 Pl Θ (A c X), A Θ p A where ˆL(p X) is the normalised version of the traditional likelihood random set induced by likelihood p X determines an entire envelope of PDFs on the parameter space Θ = [0, 1] (a belief function there) the random set associated with this belief measure is: { } ω Ω = [0, 1] Γ X (ω) = θ Θ Pl Θ ({θ} X) ω Θ = [0, 1] which is an interval centered around the ML estimate of p Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 62 / 125

63 Reasoning with belief functions Statistical inference Coin toss example Inference with likelihood-based belief functions the same procedure can applied to the normalised empirical counts ˆf (H) = 7 = 1, ˆf (T ) = 3, rather than to the normalised likelihood function 7 7 imposing Pl Ω (H) = 1, Pl Ω (T ) = 3 on Ω = {H, T }, and looking for the least 7 committed belief function there with these plausibility values we get the mass assignment: m(h) = 4 7, m(t ) = 0, m(ω) = 3 7, Bel 4/7 7/ MLE p corresponds to the credal set on the left p = 1 needs to be excluded, as the available sample evidence reports that we had n(t ) = 3 counts already, so that 1 p 0 this outcome (a belief function on Ω = {H, T }) robustifies classical MLE Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 63 / 125

64 Reasoning with belief functions Statistical inference Summary on inference general Bayesian inference continuous PDF on the parameter space Θ (a second-order distribution) MLE/MAP estimation a single parameter value = a single PDF on Ω generalised maximum likelihood a belief function on Ω (a convex set of PDFs on Ω) generalises MAP/MLE likelihood-based / Dempster-based belief function inference a belief function on Θ = a convex set of second-order distributions generalises general Bayesian inference Dempster s approach requires a data-generating process likelihood approach produces only consonant BFs Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 64 / 125

65 Reasoning with belief functions Combination Combining vs conditioning Reasoning with belief functions belief theory is a generalisation of Bayesian reasoning whereas in Bayesian theory evidence is of the kind A is true (e.g. a new datum is available).... in belief theory, new evidence can assume the more general form of a belief function a proposition A is a very special case of belief function with m(a) = 1 in most cases, reasoning needs then to be performed by combining belief functions, rather than by conditioning with respect to an event nevertheless, conditional belief functions are of interest, especially for statistical inference Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 65 / 125

66 Reasoning with belief functions Combination Dempster s rule under fire Zadeh s paradox question is: is Dempster s sum the only possible rule of combination? seems to have paradoxical behaviour in certain circumstances doctors have opinions about the condition of a patient Θ = {M, C, T }, where M stands for meningitis, C for concussion and T for tumor two doctors provide the following diagnoses: D 1 : I am 99% sure it s meningitis, but there is a small chance of 1% that it is concussion". D 2 : I am 99% sure it s a tumor, but there is a small chance of 1% that it is concussion". can be encoded by the following mass functions: 0.99 A = {M} m 1 (A) = 0.01 A = {C} m 2 (A) = 0 otherwise 0.99 A = {T } 0.01 A = {C} 0 otherwise, (1) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 66 / 125

67 Reasoning with belief functions Combination Dempster s rule under fire Zadeh s paradox their (unnormalised) Dempster s combination is: { A = { } m(a) = A = {C} as the two masses are highly conflicting, normalisation yields the belief function focussed on C it is definitively concussion, although both experts had left it as only a fringe possibility objections: the belief functions in the example are really probabilities, so this is a problem with Bayesian representations, in case! diseases are never exclusive, so that it may be argued that Zadeh s choice of a frame of discernment is misleading open world approaches with no normalisation doctors disagree so much that any person would conclude that one of the them is just wrong reliability of sources needs to be accounted for Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 67 / 125

68 Reasoning with belief functions Combination Dempster s rule under fire Tchamova s paradox this time, the two doctors generate the following mass assignments over Θ = {M, C, T }: a A = {M} b 1 A = {M, C} m 1 (A) = 1 a A = {M, C} m 2 (A) = b 2 A = Θ 0 otherwise 1 b 1 b 2 A = {T }. assuming equal reliability of the two doctors, Dempster s combination yields m 1 m 2 = m 1, i.e, Doctor 2 s diagnosis is completely absorbed by that of Doctor 1! here the paradoxical behaviour is not a consequence of conflict in Dempster s combination, every source of evidence has a veto power over the hypotheses it does not believe to be possible if any of them gets it wrong, the combined belief function will never give support to the correct hypothesis (2) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 68 / 125

69 Reasoning with belief functions Combination Yager s and Dubois rules first answer to Zadeh s objections based on view that conflict is generated by non-reliable information sources conflicting mass m( ) = B C= m 1(B)m 2 (C) should be re-assigned to the whole frame Θ let m (A) = m 1 (B)m 2 (C) whenever B C = A m Y (A) = { m (A) A Θ m (Θ) + m( ) A = Θ. (3) Dubois and Prade s idea: similar to Yager s, BUT conflicting mass is not transferred all the way up, but to B C (due to applying the minimum specificity principle) m D (A) = m (A) + m 1 (B)m 2 (C). (4) B C=A,B C= the resulting BF dominates Yager s combination: m D (A) m Y (A) A Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 69 / 125

70 Reasoning with belief functions Combination Conjunctive and disjunctive rules rather than normalising (as in Dempster s rule) or re-assigning the conflicting mass m( ) to other non-empty subsets (as in Yager s and Dubois proposals), Smets conjunctive rule leaves the conflicting mass with the empty set: m (A) = m 1 (B)m 2 (C) (5) B C=A applicable to unnormalised belief functions in an open world assumption: current frame only approximately describes the set of possible hypotheses disjunctive rule of combination: m (A) = B C=A m 1 (B)m 2 (C) (6) consensus between two sources is expressed by the union of the supported propositions, rather than by their intersection not that Bel 1 Bel 2 (A) = Bel 1 (A) Bel 2 (A): belief values are simply multiplied! Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 70 / 125

71 Reasoning with belief functions Combination Combination: some conclusions Yager s rule is rather unjustified.. Dubois is kinda intermediate between conjunction and disjunction my take on this: Dempster s (conjunctive) combination and disjunctive combination are the two extrema of a spectrum of possible results Proposal: combination tubes? Meta-uncertainty on the sources generating the input belief functions (their independence and reliability) induces uncertainty on the result of the combination, represented by a bracket of combination rules, which produce a tube of BFs. fits well with belief likelihood concept, and was already hinted at by Pearl in Reasoning with belief functions: An analysis of compatibility we should probably work with intervals of belief functions then? Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 71 / 125

72 Reasoning with belief functions Conditioning Conditional belief functions Approaches in Bayesian theory conditioning is done via Bayes rule: P(A B) = P(A B) P(B) for belief functions, many approaches to conditioning have been proposed (just as for combination!) original Dempster s conditioning Fagin and Halpern s lower envelopes geometric conditioning [Suppes] unnormalized conditional belief functions [Smets] generalised Jeffrey s rules [Smets] sets of equivalent events under multi-valued mappings [Spies] several of them are special cases of combination rules: Dempster s, Smets.. others are the unique solution when interpreting belief functions as convex sets of probabilities (Fagin s) once again, a duality emerges between the most and least cautious conditioning approaches Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 72 / 125

73 Reasoning with belief functions Conditioning Dempster s conditioning Dempster s rule of combination induces a conditioning operator given a new event B, the logical belief function such that m(b) = is combined with the a-priori belief function Bel using Dempster s rule the resulting BF is the conditional belief function given B Bel B Bel (A B) in terms of belief and plausibility values, Dempster s conditioning yields Bel (A B) = Bel(A B) Bel( B) 1 Bel( B) = Pl(B) Pl(B\A), Pl Pl(B) (A B) = Pl(A B) Pl(B) obtained by Bayes rule by replacing probability with plausibility measures! Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 73 / 125

74 Reasoning with belief functions Conditioning Lower envelopes of conditional probabilities we know that a belief function can be seen as the lower envelope of the family of probabilities consistent with it: Bel(A) = inf P P[Bel] P(A) conditional belief function as the lower envelope (the inf) of the family of conditional probability functions P(A B), where P is consistent with Bel: Bel Cr (A B). = inf P(A B), Pl Cr (A B) =. sup P(A B) P P[Bel] P P[Bel] quite incompatible with the random set interpretation nevertheless, whereas lower/upper envelopes of arbitrary sets of probabilities are not in general belief functions, these actually are belief functions: Bel Cr (A B) = Bel(A B) Bel(A B+Pl(Ā B), Pl Cr (A B) = Pl(A B) Pl(A B)+Bel(Ā B) they provide a more conservative estimate then Dempster s conditioning Bel Cr (A B) Bel (A B) Pl (A B) Pl Cr (A B) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 74 / 125

75 Reasoning with belief functions Conditioning Geometric conditioning Suppes and Zanotti proposed a geometric conditioning approach Bel G (A B) = Bel(A B) Bel(B) Bel(B \ A), Pl G (A B) = Bel(B) Bel(B) retains only the masses of focal elements inside B, and normalises them: m G (A B) = m(a) Bel(B) A B it is a consequence of the focussing approach to belief update: no new information is introduced, we merely focus on a specific subset of the original set replaces probability with belief measures in Bayes rule Pl (A B) = Pl(A B) Bel Pl(B) G (A B) = Bel(A B) Bel(B) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 75 / 125

76 Reasoning with belief functions Conditioning Conjunctive rule of conditioning it is induced by the conjunctive rule of combination: m (A B) = m m B (m B is the logical BF focussed on B) [Smets] its belief and plausibility values are: Bel (A B) = { Bel(A B) A B 0 A B = Pl (A B) = { Pl(A B) A B 1 A B = it is compatible with the principles of belief revision [Gilboa, Perea]: a state of belief is modified to take into account a new piece of information in probability theory, both focussing and revision are expressed by Bayes rule, but they are conceptually different operations which produce different results on BFs it is more committal than Dempster s rule! Bel (A B) Bel (A B) Pl (A B) Pl (A B) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 76 / 125

77 Reasoning with belief functions Conditioning Disjunctive rule of conditioning induced by the disjunctive rule of combination: m (A B) = m m B obviously dual to conjunctive conditioning assigns mass only to subsets containing the conditioning event B belief and plausibility values: Bel (A B) = { Bel(A) A B 0 A B Pl (A B) = { Pl(A) A B = 1 A B it is less committal not only than Dempster s rule, but also than credal conditioning Bel (A B) Bel Cr (A B) Pl Cr (A B) Pl (A B) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 77 / 125

78 Reasoning with belief functions Conditioning Conditioning - an overview belief plausibility Pl(B) Pl(B \ A) Pl(A B) Dempster s Pl(B) Pl(B) Bel(A B) Pl(A B) Credal Cr Bel(A B) + Pl(Ā B) Pl(A B) + Bel(Ā B) Bel(A B) Bel(B) Bel(B \ A) Geometric G Bel(B) Bel(B) Conjunctive Bel(A B), A B Pl(A B), A B Disjunctive Bel(A), A B Pl(A), A B = Nested conditioning operators Conditioning operators form a nested family, from the more committal to the least one! Bl ( ) Bl Cr ( ) Bl ( ) Bl ( ) Pl ( ) Pl ( ) Pl Cr ( ) Pl ( ) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 78 / 125

79 Reasoning with belief functions Belief vs Bayesian reasoning Belief vs Bayesian reasoning A toy example suppose we want to estimate the class of an object appearing in an image, based on feature measurements extracted from the image (e.g. by convolutional neural networks) we capture a training set of images, complete with annotated object labels assuming a PDF of a certain family (e.g. mixture of Gaussians) we can learn from the training data a likelihood function p(y x), where y is the object class and x the image feature vector suppose n different sensors extract n features x i from each image: x 1,..., x n let us compare how data fusion works under the Bayesian and the belief function paradigms! Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 79 / 125

80 Reasoning with belief functions Belief vs Bayesian reasoning (Naive) Bayesian data fusion Belief vs Bayesian reasoning the likelihoods of the individual features are computed using the n likelihood functions learned during training: p(x i y), for all i = 1,..., n measurements are typically assumed to be conditionally independent, yielding the product likelihood p(x y) = i p(x i y) Bayesian inference is applied, typically assuming uniform priors (for there is no reason to think otherwise), yielding p(y x) p(x y) = i p(x i y) x 1 likelihood function p(x 1 y) conditional independence uniform prior... Bayes' rule x n likelihood function p(x n y) i p(x i y) p(y x) ~ i p(x i y) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 80 / 125

81 Reasoning with belief functions Belief vs Bayesian reasoning Dempster-Shafer data fusion Belief vs Bayesian reasoning with belief functions, for each feature type i a BF is learned from the the individual likelihood p(x i y), e.g. via the likelihood-based approach by Shafer this yields n belief functions Bel(y x i ), on the range of possible object classes Y a combination rule is applied to compute an overall BF (e.g.,, ), obtaining Bel(Y x) = Bel(Y x 1 )... Bel(Y x n), Y Y x 1 likelihood function p(x 1 y) likelihood-based inference Bel(Y x 1 ) belief function combination... Bel(Y x) x n likelihood function p(x n y) likelihood-based inference Bel(Y x n ) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 81 / 125

82 Reasoning with belief functions Belief vs Bayesian reasoning Inference under partially reliable data Belief vs Bayesian reasoning in the fusion example we have assumed that the data are measured correctly what if the data-generating process is not completely reliable? problem: suppose we want to just detect an object (binary decision: yes Y or no N) two sensors produce image features x 1 and x 2, but we learned from the training data that both are reliable only 20% of the time at test time we get an image, measure x 1 and x 2, and unluckily sensor 2 got it wrong! the object is actually there we get the following normalised likelihoods p(x 1 Y ) = 0.9, p(x 1 N) = 0.1; p(x 2 Y ) = 0.1, p(x 2 N) = 0.9 Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 82 / 125

83 Reasoning with belief functions Belief vs Bayesian reasoning Inference under partially reliable data Belief vs Bayesian reasoning how do the two fusion pipelines cope with this? the Bayesian scholar assumes the two sensors/processes are conditionally independent, and multiply the likelihoods obtaining p(x 1, x 2 Y ) = = 0.09, p(x 1, x 2 N) = = 0.09 so that p(y x 1, x 2 ) = 1 2, p(n x 1, x 2 ) = 1 2 Shafer s faithful follower discounts the likelihoods by assigning mass.2 to the whole hypothesis space Θ = {Y, N}: m(y x 1 ) = = 0.72, m(n x 1 ) = = 0.08, m(θ x 1 ) = 0.2; m(y x 2 ) = = 0.08, m(n x 2 ) = = 0.72 m(θ x 2 ) = 0.2 Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 83 / 125

84 Reasoning with belief functions Belief vs Bayesian reasoning Inference under partially reliable data Belief vs Bayesian reasoning thus, when we combine them by Dempster s rule we get the BF Bel on {Y, N}: m(y x 1, x 2 ) = 0.458, m(n x 1, x 2 ) = 0.458, m(θ x 1, x 2 ) = when combined using the disjunctive rule (the least committal one) we get Bel : m (Y x 1, x 2 ) = 0.09, m (N x 1, x 2 ) = 0.09, m (Θ x 1, x 2 ) = 0.82 the corresponding (credal) sets of probabilities are Bel Bel' Bayes P(Y x,x ) the credal interval for Bel is quite narrow: reliability is assumed to be 80%, and we got a faulty measurement in two! (50%) the disjunctive rule is much more cautious about the correct inference 1 2 Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 84 / 125

85 Reasoning with belief functions Generalised Bayes Theorem Generalised Bayes Theorem Generalising full Bayesian inference in Smets generalised Bayesian theorem setting, the input is a set of conditional belief functions on Θ, rather than likelihoods p(x θ) there Bel X (X θ), X X, θ Θ each associated with a value θ of the parameter (these are not the same conditional belief functions we saw, where a conditioning event B Θ alters a prior belief function Bel Θ mapping it to Bel Θ (. B)) they can be seen as a parameterised family of BFs on the data the desired output is another family of belief functions on Θ, parameterised by all sets of measurements X on X: Bel Θ (A X), X X each piece of evidence m X (X θ) has an effect on our beliefs on the parameters coherent with the random set setting, as we condition on set-valued observations Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 85 / 125

86 Reasoning with belief functions Generalised Bayes Theorem Generalised Bayes Theorem Generalised Bayes Theorem Implements this inference Bel X (X θ) Bel Θ (A X) by: 1 computing an intermediate family of BFs on X parameterised by sets of parameter values: Bel X (X A) = θ A Bel X (X θ) = θ A Bel X (X θ) via the disjunctive rule of combination 2 assuming that Pl Θ (A X) = Pl X (X A) A Θ, X X 3 this yields Bel Θ (A X) = θ Ā Bel X ( X θ) generalises Bayes rule (by replacing P with Pl) when priors are uniform Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 86 / 125

87 Reasoning with belief functions The total belief theorem The total belief theorem Generalising the law of total probability conditional belief functions are crucial for our approach to inference complementary link of the chain: generalisation of the law of total probability recall that a refining is a mapping from elements of one set Ω to elements of a disjoint partition of a second set Θ Bel 0 = 2 [0,1] i i Bel i = 2 [0,1] i i Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 87 / 125

88 Reasoning with belief functions The total belief theorem The total belief theorem Statement Total belief theorem Suppose Θ and Ω are two finite sets, and ρ : 2 Ω 2 Θ the unique refining between them. Let Bel 0 be a belief function defined over Ω = {ω 1,..., ω Ω }. Suppose there exists a collection of belief functions Bel i : 2 Π i [0, 1], where Π = {Π 1,..., Π Ω }, Π i = ρ({ω i }), is the partition of Θ induced by Ω. Then, there exists a belief function Bel : 2 Θ [0, 1] such that: 1 Bel 0 is the marginal of Bel to Ω (Bel 0 (A) = Bel(ρ(A))); 2 Bel Bel Πi = Bel i i = 1,..., Ω, where Bel Πi is the logical belief function: m Πi (A) = 1 A = Π i, 0 otherwise several distinct solutions exists, and they likely form a graph with symmetries one such solution is easily identifiable Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 88 / 125

89 Reasoning with belief functions The total belief theorem The total belief theorem Existence of a solution [Zhou & Cuzzolin, UAI 2017] assume Θ Θ, and m a mass function over Θ m can be identified with a mass function m Θ over the larger frame Θ : for any E Θ, m Θ (E ) = m(e) if E = E (Θ \ Θ) and m Θ (E ) = 0 otherwise such m Θ is called the conditional embedding of m into Θ let Bel i be the conditional embedding of Bel i into Θ for all Bel i : 2 Π i [0, 1], and Bel = Bel 1 Bel Ω Total belief theorem: existence The belief function Bel. = Bel Θ 0 Bel is a valid total belief function. Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 89 / 125

90 Reasoning with belief functions Decision making Decision making with belief functions a decision problem can be formalised by defining: a set Ω of possible states of the world a set X of consequences and a set F of acts, where an act is a function f : Ω X mapping a world state to a consequence problem: to select an act f from an available list F (i.e., to make a decision), which optimises a certain objective function various approaches to decision making with belief functions; among those: decision making in the TBM is based on expected utility via pignistic transform generalised expected utility [Gilboa] based on classical expected utility theory [Savage,von Neumann] also a lot of interest in multicriteria decision making (based on a number of attributes) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 90 / 125

91 Reasoning with belief functions Decision making Decision making with the pignistic probability classical expected utility theory is due to Von Neumann in Smets Transferable Belief Model, decision making is done by maximising the expected utility of actions based on the pignistic transform this maps a belief function Bel on Ω to a probability distribution there: BetP[Bel](ω) = A ω m(a) A ω Ω the set of possible actions F and the set Ω of possible outcomes are distinct, and the utility function u is defined on F Ω the optimal decision maximises E[u] = ω Ω u(f, ω)pign(ω) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 91 / 125

92 Reasoning with belief functions Decision making Savage s sure thing principle let be a preference relation on F, such that f g means that f is at least as desirable as g Savage (1954) has showed that verifies some rationality requirements iff there exists a probability measure P on Ω and a utility function u : X R s.t. f, g F, f g E P (u f ) E P (u g) does that mean that using belief functions is irrational? given f, h F and E Ω, let feh denote the act defined by { f (ω) if ω E (feh)(ω) = h(ω) if ω E then the sure thing principle states that E, f, g, h, h : feh geh feh geh Ellsberg s paradox: empirically the Sure Thing Principle is violated! this is because people are averted to second-order uncertainty Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 92 / 125

93 Reasoning with belief functions Decision making Ellsberg s paradox suppose you have an urn containing 30 red balls and 60 balls, either black or yellow f 1 : you receive 100 euros if you draw a red ball f 2 : you receive 100 euros if you draw a black ball f 3 : you receive 100 euros if you draw a red or yellow ball f 4 : you receive 100 euros if you draw a black or yellow ball in this example Ω = {R, B, Y }, f i : Ω R and X = R empirically most people strictly prefer f 1 to f 2, but they strictly prefer f 4 to f 3 R B Y f Now, pick E = {R, B}: by definition f f 1 {R, B}0 = f 1, f 2 {R, B}0 = f 2 f f 1 {R, B}100 = f 3, f 2 {R, B}100 = f 4 f since f 1 f 2, i.e. f 1 {R, B}0 f 2 {R, B}0 the Sure Thing principle would imply f 1 {R, B}100 f 2 {R, B}100, i.e., f 3 f 4 empirically the Sure Thing Principle is violated! Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 93 / 125

94 Reasoning with belief functions Decision making Lower and upper expected utilities Gilboa (1987) proposed a modification of Savage s axioms a preference relation meets these weaker requirements iff there exists a (non necessarily additive) measure µ and a utility function u : X R such that f, g F, f g C µ(u f ) C µ(u g), where C µ is the Choquet integral, defined for X : Ω R as + 0 C µ(x) = µ(x(ω) t)dt + [µ(x(ω) t) 1]dt. 0 given a belief function Bel on Ω and a utility function u, this theorem supports making decisions based on the Choquet integral of u with respect to Bel for finite Ω, it can be shown that C Bel (u f ) = m(b) min u(f (ω)) C ω B B Ω Pl(u f ) = B Ω m(b) max u(f (ω)) ω B (lower and upper expectations of u f with respect to Bel) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 94 / 125

95 Reasoning with belief functions Decision making Decision making Possible strategies let P(Bel) as usual be the set of probability measures P compatible with Bel, i.e., such that Bel P. Then, it can be shown that C Bel (u f ) = min E P (u f ) = E(u f ) C Pl (u f ) = E(u f ) P P(Bel) two expected utilities E(f ) and E(f ): how do we make a decision? possible decision criteria based on interval dominance: 1 f g iff E(u f ) E(u g) (conservative strategy) 2 f g iff E(u f ) E(u g) (pessimistic strategy) 3 f g iff E(u f ) E(u g) (optimistic strategy) 4 f g iff αe(u f ) + (1 α)e(u f ) αe(u g) + (1 α)e(u g) for some α [0, 1] called a pessimism index (Hurwicz criterion) the conservative strategy yields only a partial preorder: f and g are not comparable if E(u f ) < E(u g) and E(u g) < E(u f ) Ellberg s paradox is actually explained by the pessimistic strategy Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 95 / 125

96 Theories of uncertainty Outline 1 Uncertainty Second-order uncertainty Classical probability 2 Beyond probability Set-valued observations Propositional evidence Scarce data Representing ignorance Rare events Uncertain data 3 Belief theory A theory of evidence Belief functions Semantics Dempster s rule Multivariate analysis Misunderstandings 4 Reasoning with belief functions Statistical inference Combination Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem The total belief theorem Decision making 5 Theories of uncertainty Imprecise probability Monotone capacities Probability intervals Fuzzy and possibility theory Probability boxes Rough sets 6 Belief functions on reals Continuous belief functions Random sets 7 Conclusions Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 96 / 125

97 Theories of uncertainty Theories of uncertainty several different mathematical theories of uncertainty compete to be adopted by practitioners consensus is that there no such a thing as the best mathematical description of uncertainty random sets are not the most general framework; however, we argue here, they naturally arise from set-valued observations scholars have extensively discussed and compared the various approaches to uncertainty theory [Klir, Destercke] theoretical and empirical comparisons between belief functions and other theories were conducted [Lee,Yager, Helton,Regan..] some attempts have been made to unify most approaches to uncertainty theory [Klir,Zadeh,Walley] Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 97 / 125

98 Theories of uncertainty A hierarchy of uncertainty theories Lower/upper previsions CREDAL SETS MONOTONE CAPACITIES 2-MON CAPACITIES Lower/upper probabilities 2-monotone capacities FEASIBLE PROB INTERVALS BELIEF FUNCTIONS RANDOM SETS -MON CAPACITIES Probability intervals Random sets Generalised p-boxes NORMALISED SUM FUNCTIONS P-boxes Probabilities Possibilities Left: relation between BFs and other uncertainty measures Right: Destercke s partial hierarchies of uncertainty theories Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 98 / 125

99 Theories of uncertainty Imprecise probability Coherent lower probabilities Walley s Imprecise Probability behavioural approach to probability a lower probability P is a function from a sigma-algebra to the unit interval [0, 1] such that: P(A B) P(A) + P(B) A B = (super-additivity) a lower probability P avoids sure loss if P(P) =. { } P : P(A) P(A), A Ω (the lower bound constraints P(A) can be satisfied by some probability measure) it is coherent if inf P(A) = P(A) p P(P) (P is the lower envelope on subsets of P(P)) not all convex sets of probabilities can be described by merely focusing on events [Walley] notion of gamble following De Finetti, imprecise probability equals belief to inclination to act : an agent believes in an outcome to the extent it is willing to accept a bet on it Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/18 99 / 125

100 Theories of uncertainty Imprecise probability Desirable gambles Gamble A gamble is a bounded real-valued function on Θ: X : Θ R, θ X(θ). 0 D X Y a lower probability can be seen as a functional defined on the class of all indicator functions of sets (the traditional events) an agent s set of desirable gambles by D L(Ω), where L(Ω) is the set of all bounded real valued functions on Ω since whether a gamble is desirable depends on the agent s belief on the outcome, D can be used as a model of the agent s uncertainty about the problem Coherence of desirable gambles A set D of desirable gambles is coherent iff it is a convex cone (it is closed under convex combination). Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

101 Theories of uncertainty Imprecise probability Lower and upper previsions suppose the agent buys a gamble X for a price µ: this yields a new gamble X µ lower prevision P(X) of a gamble X: P(X). = sup{µ : X µ D}, the supremum acceptable price for buying X selling a gamble X for a price µ also yields a new gamble µ X upper prevision P(X) of a gamble X: P(X). = inf{µ : µ X D}, supremum acceptable price for selling X when lower and upper prevision coincide, P(X) = P(X) = P(X) is called the precise prevision of X (what de Finetti called fair price ) for prices in [P(X), P(X)] we are undecided as to whether buy or sell gamble X Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

102 Theories of uncertainty Imprecise probability Rules of rational behaviour Rational behaviour the agent does not specify betting rates such as they lose utility whatever the outcomes (avoiding sure loss) the agent is fully aware of the consequences of its betting rates (coherence) if the first condition is not met, there exists a positive combination of gambles which the agent finds individually desirable which is not desirable to them one consequence of avoiding sure loss is that P(A) P(A) a consequence of coherence is that lower previsions are superadditive a precise prevision P is coherent iff: (i) P(λX + µy ) = λp(x) + µp(y ); (ii) if X > 0 then P(X) 0; (iii) P(Ω) = 1, and coincides with de Finetti s notion of coherent prevision A powerful theory Generalises probability measures, de Finetti previsions, 2-monotone capacities, Choquet capacities, possibility/necessity measures, belief/plausibility measures, random sets but also probability boxes, credal sets, and robust Bayesian models. Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

103 Theories of uncertainty Monotone capacities Monotone capacities Choquet [1953], Sugeno [1974] the theory of capacity is a generalisation of classical measure theory Monotone capacity Given a domain Θ and a non-empty family F of subsets of Θ, a monotone capacity or fuzzy measure is a function µ : F [0, 1] such that µ( ) = 0 if A B then µ(a) µ(b), for every A, B F ( monotonicity ) for any nonnegative measurable function f on (Θ, F), the Choquet integral of f on any A F is defined as: C µ(f ). = where F α = {x Θ f (x) α}, α [0, ) 0 µ(f α A)dα, both Choquet integral of monotone capacities and natural extension of lower probabilities are generalisations of the Lebesgue integral Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

104 Theories of uncertainty Monotone capacities Order of a capacity Special types of capacities Order of a capacity A capacity µ is said to be of order k if ( k ) µ A j j=1 K [1,...,k] for all collections of k subsets A j, j K of Θ ( ( 1) K +1 µ j K A j ) if k > k the resulting theory is less general than a theory of capacities of order k Capacities and belief functions Belief functions are infinitely monotone capacities. just compare the definition of order with the third axiom of belief functions Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

105 Theories of uncertainty Probability intervals Probability intervals Set of probability intervals A system of constraints on a probability distribution p : Θ [0, 1] of the form: P(l, u) =. { } p : l(x) p(x) u(x), x Θ probability intervals typically arise through measurement errors, or measurements inherently of interval nature a set of probability intervals also determines a credal set, a sub-class of all credal sets generated by lower and upper probabilities each belief function induces a set of probability intervals Belief functions and probability intervals The minimal probability interval containing a pair of belief/plausibility functions is that whose lower bound is the belief of singletons, the upper bound is their plausibility: l (x) = Bel(x), u (x) = Pl(x) x Θ Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

106 Theories of uncertainty Fuzzy and possibility theory Fuzzy sets and possibility theory Zadeh,Dubois and Prade concept of fuzzy set was introduced by Lotfi A. Zadeh [1965] elements belong to a set with a certain degree of membership theory was further developed by Didier Dubois and Henri Prade into a mathematical theory of partial belief, called possibility theory a possibility measure on Θ is a function Π : 2 Θ [0, 1] such that Π( ) = 0, Π(Θ) = 1 and ( ) Π A i = sup Π(A i ) i for every family of subsets {A i 2 Θ } i each possibility measure is uniquely characterised by a membership function π : Θ [0, 1] s.t. π(x). = Π({x}) via the formula: Π(A) = sup x A π(x) the dual quantity N(A) = 1 Π(A c ) is called necessity measure Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

107 Theories of uncertainty Fuzzy and possibility theory Possibility and belief measures call plausibility assignment pl the restriction of the plausibility function to singletons pl(x) = Pl({x}) - then [Shafer]: Bel is a necessity measure iff Bel is consonant the membership function coincides with the plausibility assignment a finite fuzzy set is equivalent to a consonant belief function Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

108 Theories of uncertainty Fuzzy and possibility theory Belief functions on fuzzy sets belief functions defined on fuzzy sets have also been proposed basic idea: belief measures generalised on fuzzy sets as follows: Bel(X) = I(A X)m(A) A M where X is a fuzzy set defined on Θ, m is a mass function defined on the collection M of fuzzy sets on Θ I(A X) is a measure of how much fuzzy set A is included in fuzzy set X various measures of inclusion in [0, 1] can be proposed: Lukasiewicz: I(x, y) = min{1, 1 x y} [Ishizuka] Kleene-Dienes: I(x, y) = max{1 x, y} [Yager] from which one can get: I(A B) = x Θ I(A(x), B(y)) [Wu 2009] Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

109 Theories of uncertainty Probability boxes Probability boxes and random sets a probability box or p-box [Ferson and { Hajagos] F, F is a } class of cumulative distribution functions (CDFs) F, F = F CDF : F F F every pair Bel, Pl defined on the real line R (a random set), generates a unique p-box: F(x) = Bel((, x]), F(x) = Pl((, x]) F F X conversely, every p-box generates an entire equivalence class of random intervals e.g. the one with as focal elements { [ ] F = γ = F 1 (α), F 1 (α) } α [0, 1] where F 1 (α). = inf{f(x) α}, F 1 (α). = inf{f(x) α} are the quasi-inverses of the upper and lower CDFs Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

110 Theories of uncertainty Rough sets Rough sets first described by Polish computer scientist Zdzislaw I. Pawlak [1991] strongly linked to the idea of partition of the universe of hypotheses provide a formal approximation of a traditional set in terms of a pair of lower and an upper approximating sets let R Θ Θ be an equivalence relation which partitions Θ into a family of disjoint subsets Θ/R, called elementary sets measurable sets σ(θ/r): the unions of one or more elementary sets, plus the empty set we can then approximate any subset of Θ using those measurable sets X: apr(a) = { X σ(u/r), X A }, apr(a) = { X σ(u/r), X A } Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

111 Theories of uncertainty Rough sets Rough sets and belief functions elementary set lower approximation upper approximation event A Universe any probability P on F = σ(θ/r) can be extended to 2 Θ using inner measures: { } P (A) = sup P(X) X σ(θ/r), X A = P(apr(A)) these are belief functions! (as was recognised before) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

112 Belief functions on reals Outline 1 Uncertainty Second-order uncertainty Classical probability 2 Beyond probability Set-valued observations Propositional evidence Scarce data Representing ignorance Rare events Uncertain data 3 Belief theory A theory of evidence Belief functions Semantics Dempster s rule Multivariate analysis Misunderstandings 4 Reasoning with belief functions Statistical inference Combination Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem The total belief theorem Decision making 5 Theories of uncertainty Imprecise probability Monotone capacities Probability intervals Fuzzy and possibility theory Probability boxes Rough sets 6 Belief functions on reals Continuous belief functions Random sets 7 Conclusions Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

113 Belief functions on reals Continuous formulations of the theory of belief functions in the original formulation by Shafer [1976], belief functions are defined on finite sets only need for generalising this to arbitrary domains was soon recognised main approaches to continuous formulation: Shafer s allocations of probability [1982] continuous belief functions on Borel intervals of the real line [Strat90,Smets] belief functions as random sets [Nguyen78, Molchanov06] other approaches, with limited (so far) impact generalised evidence theory MV algebras several others Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

114 Belief functions on reals Continuous belief functions Continuous belief functions [Strat, Smets] take as frame of discernment Θ the set of possible closed intervals right extremum 0 a b 1 0 a b 1 0 a b left b extremum a 0 intervals contained in [a,b] intervals containing [a,b] b a 0 Bel([a,b]) b a 0 Pl([a,b]) Bel([a, b]) = b b a x m(x, y)dydx, Pl([a, b]) = b N 0 max(a,x) Dempster s rule generalises in terms of double integrals continuous pignistic PDF: Bet(a). = lim ɛ 0 a 0 dx 1 a+ɛ m(x, y) y x dy m(x, y)dydx Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

115 Belief functions on reals Continuous belief functions Special cases of random closed intervals Fuzzy sets and p-boxes Consonant random interval (x) 1 1 p-box F * Γ( ) Γ( ) F * 0 U( ) V( ) x 0 U( ) V( ) x a fuzzy set on the real line induces a mapping to a collection of nested intervals, parameterised by the level c a p-box, i.e, upper and lower bounds to a cumulative distribution function, also induces a family of intervals (as we already saw) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

116 Belief functions on reals Random sets Belief functions as random sets [Nguyen,1978], [Hestir,1991], [Shafer,1987] given a multi-valued mapping Γ, a straightforward step is to consider the probability value P(ω) as attached to the subset Γ(ω) Θ: this is a random set in Θ, i.e., a probability measure on a collection of subsets the degree of belief Bel(A) of an event A becomes the cumulative distribution function (CDF) of the open interval of sets {B A} in 2 Θ the lower inverse and upper inverse of Γ are: Γ. { } (A) = ω Ω : Γ(ω) A, Γ(ω) Γ (A) =. { } ω Ω : Γ(ω) A given two σ-fields A, B on Ω, Θ, Γ is said strongly measurable iff B B, Γ (B) A the lower probability measure on B is defined as P (B). = P(Γ (B)) for all B B - this is nothing but a belief function! Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

117 Belief functions on reals Random sets Belief functions as random sets Molchanov s work recently, strong renewed interest in a theory of random sets, thanks to Molchanov [2006,2017] and others theory of calculus with capacities and random sets Radon-Nikodym theorems for capacities and random sets and derivatives of capacities (conditional) expectations of random sets limit theorems: strong law of large numbers, central limit theorem, Gaussian RSs examined set-valued random processes powerful mathematical framework! way forward for the theory in my view no mentioning of conditioning and combination yet connections with mathematical statistics to develop special case of random element [Frechet], random variable with structured output Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

118 Belief functions on reals Random sets Random closed sets the family of all sets is too large, we typically restrict ourselves to the case of random elements in the space of closed subsets of a certain topological space E the family of closed subsets of E is denoted by C K denotes the family of all compact subsets of E let (Ω, F, P) be a probability space a map X : Ω C is called a random closed set if, for every compact set K in E: {ω : X(ω) K } F this is equivalent to strong measurability, whenever the σ-field on Θ is replaced by the family K of compact subsets of Θ the consequence is that the upper probability of K exists K K Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

119 Belief functions on reals Random sets Random closed sets Some examples if ξ is a random variable, then X = (, ξ] is a random closed set if ξ 1, ξ 2 and ξ 3 are three random vectors in R d, then the triangle with vertices ξ 1, ξ 2 and ξ 3 is a random closed set Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

120 Belief functions on reals Random sets Capacity functionals Random closed sets a functional T X : K [0, 1] given by T X (K ) = P({X K }), K K is said to be the capacity functional of X in particular, if X = {ξ} is a classical random variable, then T X (K ) = P({ξ K }) is the probability distribution of the random variable ξ the name capacity functional follows from the fact that T X is a functional on K which takes values in [0, 1], equals 0 on the empty set, is monotone and upper semicontinuous (i.e., T X is a capacity, and also completely alternating on K) T X (K ) is the plausibility measure induced by the multivalued mapping X, restricted to compact subsets the links betwen random closed sets and belief/plausibility functions, upper and lower probabilities, and contaminated models in statistics are very briefly hinted at in [Molchanov 2005], Chapter 1, Section 9 Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

121 Conclusions Outline 1 Uncertainty Second-order uncertainty Classical probability 2 Beyond probability Set-valued observations Propositional evidence Scarce data Representing ignorance Rare events Uncertain data 3 Belief theory A theory of evidence Belief functions Semantics Dempster s rule Multivariate analysis Misunderstandings 4 Reasoning with belief functions Statistical inference Combination Conditioning Belief vs Bayesian reasoning Generalised Bayes Theorem The total belief theorem Decision making 5 Theories of uncertainty Imprecise probability Monotone capacities Probability intervals Fuzzy and possibility theory Probability boxes Rough sets 6 Belief functions on reals Continuous belief functions Random sets 7 Conclusions Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

122 Conclusions A summary the theory of belief functions is grounded in the beautiful mathematics of random sets has strong relationships with other theories of uncertainty (can be efficiently implemented by Monte-Carlo approximation) statistical evidence may be represented in several ways: by likelihood-based belief functions, generalizing both likelihood-based and Bayesian inference by Dempster s idea of using auxiliary variables in the framework of the Generalised Bayesian Theorem (propagation on graphical models can be performed) decision making strategies based on intervals of expected utilities can be formulated that are more cautious than traditional ones the extension to continuous domains can be tackled via the Borel interval representation, in the more general case using the theory of random sets (a toolbox of estimation, classification, regression tools based on the theory of belief functions is available) Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

123 Conclusions What still needs to be resolved clarify once and for all the epistemic interpretation of belief function theory random variables for set-valued observations mechanism for evidence combination still debated, depend on meta-information on sources hardly accessible working with intervals of belief functions may be the way forward acknowledges the meta-uncertainty on the nature of the sources generating the evidence same holds for conditioning (as we showed) what about computational complexity? not an issue, just apply sampling for approximate inference we do not need to assign mass to all subsets, but we need to be allowed to do so when observations are indeed sets belief functions on reals Borel intervals are nice, but the way forward is grounding the theory into the mathematics of random sets Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

124 Conclusions Future of random set/belief function theory fully developed theory of statistical inference with random sets generalised likelihood, logistic regression limit theorem, total probability for random sets random set random variables and processes frequentist inference with random sets propose solutions to high impact problems rare event prediction robust foundations for machine learning robust climatic change predictions further development of machine learning tools random set random forests generalised max entropy classification robust statistical learning theory Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

Appendix For Further Reading For Further Reading I G. Shafer.

Princeton University Press, 1976. I. Molchanov.

Visions of a generalized probability theory.

The geometry of uncertainty - The geometry of imprecise

125 Appendix For Further Reading For Further Reading I G. Shafer. A mathematical theory of evidence. Princeton University Press, I. Molchanov. Theory of Random Sets. Springer, F. Cuzzolin. Visions of a generalized probability theory. Lambert Academic Publishing, F. Cuzzolin. The geometry of uncertainty - The geometry of imprecise probabilities. Springer-Verlag (in press). Professor Fabio Cuzzolin Belief functions: A gentle introduction Seoul, Korea, 30/05/ / 125

Belief functions: past, present and future

Belief functions: past, present and future CSA 2016, Algiers Fabio Cuzzolin Department of Computing and Communication Technologies Oxford Brookes University, Oxford, UK Algiers, 13/12/2016 Fabio Cuzzolin