Predictive inference under exchangeability

Similar documents
Imprecise Bernoulli processes

INDEPENDENT NATURAL EXTENSION

Coherent Predictive Inference under Exchangeability with Imprecise Probabilities

Coherent Predictive Inference under Exchangeability with Imprecise Probabilities (Extended Abstract) *

A gentle introduction to imprecise probability models

A survey of the theory of coherent lower previsions

Imprecise Probability

Imprecise Probabilities in Stochastic Processes and Graphical Models: New Developments

Irrelevant and Independent Natural Extension for Sets of Desirable Gambles

arxiv: v3 [math.pr] 9 Dec 2010

Sklar s theorem in an imprecise setting

CONDITIONAL MODELS: COHERENCE AND INFERENCE THROUGH SEQUENCES OF JOINT MASS FUNCTIONS

arxiv: v1 [math.pr] 9 Jan 2016

arxiv: v6 [math.pr] 23 Jan 2015

Conservative Inference Rule for Uncertain Reasoning under Incompleteness

GERT DE COOMAN AND DIRK AEYELS

Optimal control of linear systems with quadratic cost and imprecise input noise Alexander Erreygers

Journal of Mathematical Analysis and Applications

Efficient Computation of Updated Lower Expectations for Imprecise Continuous-Time Hidden Markov Chains

LEXICOGRAPHIC CHOICE FUNCTIONS

Calculating Bounds on Expected Return and First Passage Times in Finite-State Imprecise Birth-Death Chains

Reintroducing Credal Networks under Epistemic Irrelevance Jasper De Bock

BIVARIATE P-BOXES AND MAXITIVE FUNCTIONS. Keywords: Uni- and bivariate p-boxes, maxitive functions, focal sets, comonotonicity,

Weak Dutch Books versus Strict Consistency with Lower Previsions

Practical implementation of possibilistic probability mass functions

Practical implementation of possibilistic probability mass functions

Extension of coherent lower previsions to unbounded random variables

Active Elicitation of Imprecise Probability Models

Chapter 4: Modelling

Independence in Generalized Interval Probability. Yan Wang

Asymptotic properties of imprecise Markov chains

Exploiting Bayesian Network Sensitivity Functions for Inference in Credal Networks

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

Introduction to Bayesian Inference

Durham Research Online

Serena Doria. Department of Sciences, University G.d Annunzio, Via dei Vestini, 31, Chieti, Italy. Received 7 July 2008; Revised 25 December 2008

Kuznetsov Independence for Interval-valued Expectations and Sets of Probability Distributions: Properties and Algorithms

Some exercises on coherent lower previsions

Locally Convex Vector Spaces II: Natural Constructions in the Locally Convex Category

Coherence with Proper Scoring Rules

On the Complexity of Strong and Epistemic Credal Networks

Probabilistic Logics and Probabilistic Networks

An extension of chaotic probability models to real-valued variables

Probability theory basics

Dominating countably many forecasts.* T. Seidenfeld, M.J.Schervish, and J.B.Kadane. Carnegie Mellon University. May 5, 2011

COHERENCE OF RULES FOR DEFINING CONDITIONAL POSSIBILITY

Three grades of coherence for non-archimedean preferences

Part 3 Robust Bayesian statistics & applications in reliability networks

Three contrasts between two senses of coherence

MATH 556: PROBABILITY PRIMER

Decision-Theoretic Specification of Credal Networks: A Unified Language for Uncertain Modeling with Sets of Bayesian Networks

Bayesian Networks with Imprecise Probabilities: Theory and Application to Classification

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

COS513 LECTURE 8 STATISTICAL CONCEPTS

1 Probability Model. 1.1 Types of models to be discussed in the course

STAT J535: Introduction

Sample Spaces, Random Variables

ON THE EQUIVALENCE OF CONGLOMERABILITY AND DISINTEGRABILITY FOR UNBOUNDED RANDOM VARIABLES

PROBLEMS. (b) (Polarization Identity) Show that in any inner product space

Introduction to belief functions

Reproducing Kernel Hilbert Spaces

Expectation Propagation Algorithm

1 Probability Model. 1.1 Types of models to be discussed in the course

Math Review Sheet, Fall 2008

Combinatorics in Banach space theory Lecture 12

Submodularity in Machine Learning

On Markov Properties in Evidence Theory

CS286.2 Lecture 13: Quantum de Finetti Theorems

Week 2: Sequences and Series

Optimisation under uncertainty applied to a bridge collision problem

TEACHING INDEPENDENCE AND EXCHANGEABILITY

the time it takes until a radioactive substance undergoes a decay

A BEHAVIOURAL MODEL FOR VAGUE PROBABILITY ASSESSMENTS

Best Guaranteed Result Principle and Decision Making in Operations with Stochastic Factors and Uncertainty

Multiple Model Tracking by Imprecise Markov Trees

An introduction to some aspects of functional analysis

The Limit Behaviour of Imprecise Continuous-Time Markov Chains

arxiv: v3 [math.pr] 4 May 2017

Origins of Probability Theory

Convex Geometry. Carsten Schütt

Reasoning Under Uncertainty: Introduction to Probability

Stochastic Processes, Kernel Regression, Infinite Mixture Models

Frame expansions in separable Banach spaces

17 Variational Inference

Bayesian Inference: What, and Why?

Module 1. Probability

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Principle of Mathematical Induction

Independence for Full Conditional Measures, Graphoids and Bayesian Networks

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Bayesian Inference under Ambiguity: Conditional Prior Belief Functions

Reliability of Coherent Systems with Dependent Component Lifetimes

Computing Minimax Decisions with Incomplete Observations

Linear Algebra. Preliminary Lecture Notes

De Finetti theorems for a Boolean analogue of easy quantum groups

2 Belief, probability and exchangeability

Mathematics 530. Practice Problems. n + 1 }

CSC 2541: Bayesian Methods for Machine Learning

Nonparametric predictive inference for ordinal data

3. The Multivariate Hypergeometric Distribution

Transcription:

Predictive inference under exchangeability and the Imprecise Dirichlet Multinomial Model Gert de Cooman, Jasper De Bock, Márcio Diniz Ghent University, SYSTeMS gert.decooman@ugent.be http://users.ugent.be/ gdcooma gertekoo.wordpress.com EBEB 2014 11 March 2014

It s probability theory, Jim, but not as we know it

LOWER PREVISIONS

Lower and upper previsions c Σ X P(I {c} ) = 4/7 P(I {c} ) = 1/4 a b Equivalent model Consider the set L (X) = R X of all real-valued maps on X. We define two real functionals on L (X): for all f : X R P M (f ) = min{p p (f ): p M } lower prevision/expectation P M (f ) = max{p p (f ): p M } upper prevision/expectation. Observe that P M ( f ) = P M (f ).

Basic properties of lower previsions Definition We call a real functional P on L (X) a lower prevision if it satisfies the following properties: for all f and g in L (X) and all real λ 0: 1. P(f ) min f [boundedness]; 2. P(f + g) P(f ) + P(g) [super-additivity]; 3. P(λ f ) = λ P(f ) [non-negative homogeneity]. Theorem A real functional P is a lower prevision if and only if it is the lower envelope of some credal set M.

Conditioning and lower previsions Suppose we have two variables X 1 in X 1 and X 2 in X 2. Consider for instance: a joint lower prevision P 1,2 for (X 1,X 2 ) defined on L (X 1 X 2 ); a conditional lower prevision P 2 ( x 1 ) for X 2 conditional on X 1 = x 1, defined on L (X 2 ), for all values x 1 X 1. Coherence These lower previsions P 1,2 and P 2 ( X 1 ) must satisfy certain (joint) coherence criteria: compare with Bayes s Rule and de Finetti s coherence criteria for precise previsions.

Conditioning and lower previsions Suppose we have two variables X 1 in X 1 and X 2 in X 2. Consider for instance: a joint lower prevision P 1,2 for (X 1,X 2 ) defined on L (X 1 X 2 ); a conditional lower prevision P 2 ( x 1 ) for X 2 conditional on X 1 = x 1, defined on L (X 2 ), for all values x 1 X 1. Coherence These lower previsions P 1,2 and P 2 ( X 1 ) must satisfy certain (joint) coherence criteria: compare with Bayes s Rule and de Finetti s coherence criteria for precise previsions. Complication A joint lower prevision P 1,2 does not always uniquely determine a conditional P 2 ( X 1 ), we can only impose coherence between them.

Many variables: notation Suppose we have variables X i X i, i N For S N we denote the S-tuple of variables X s, s S by and generic values by X S = x S. X S X S := s S X s

Many variables: notation Suppose we have variables X i X i, i N For S N we denote the S-tuple of variables X s, s S by X S X S := s S X s and generic values by X S = x S. Now consider an input-output pair I,O N. Conditional lower previsions P O (f (X O ) x I ) = lower prevision of f (X O ) conditional on X I = x I.

Coherence criterion: Walley (1991) The conditional lower previsions P Os ( X Is ), s = 1,...,n are coherent if and only if: for all f j L (X Oj I j ), all k {1,...,n}, all x Ik X Ik and all g L (X Ok I k ), there is some z N {x Ik } n j=1 supp Ij (f j ) such that: [ n ] [f s P Os (f s x Is )] [g P Ok (g x Ik )] (z N ) 0. s=1 where supp I (f ) := { x I X I : I {xi }f 0 }.

Coherence criterion: Walley (1991) The conditional lower previsions P Os ( X Is ), s = 1,...,n are coherent if and only if: for all f j L (X Oj I j ), all k {1,...,n}, all x Ik X Ik and all g L (X Ok I k ), there is some z N {x Ik } n j=1 supp Ij (f j ) such that: [ n ] [f s P Os (f s x Is )] [g P Ok (g x Ik )] (z N ) 0. s=1 where supp I (f ) := { x I X I : I {xi }f 0 }. This is quite complicated and cumbersome!

A few papers that try to brave the complications @ARTICLE{miranda2007, author = {Miranda, Enrique and De Cooman, Gert}, title = {Marginal extension in the theory of coherent lower previsions}, journal = {International Journal of Approximate Reasoning}, year = 2007, volume = 46, pages = {188--225} }

A few papers that try to brave the complications @ARTICLE{miranda2007, author = {De Cooman, Gert and Miranda, Enrique}, title = {Weak and strong laws of large numbers for coherent lower previsions}, journal = {Journal of Statistical Planning and Inference}, year = 2008, volume = 138, pages = {2409--2432} }

SETS OF DESIRABLE GAMBLES

Why work with sets of desirable gambles? Working with sets of desirable gambles D: is simpler, more intuitive and more elegant is more general and expressive than (conditional) lower previsions gives a geometrical flavour to probabilistic inference shows that probabilistic inference and Bayes Rule are logical inference includes classical propositional logic as another special case includes precise probability as one special case avoids problems with conditioning on sets of probability zero

First steps: Williams (1977) @ARTICLE{williams2007, author = {Williams, Peter M.}, title = {Notes on conditional previsions}, journal = {International Journal of Approximate Reasoning}, year = 2007, volume = 44, pages = {366--383} }

First steps: Walley (2000) @ARTICLE{walley2000, author = {Walley, Peter}, title = {Towards a unified theory of imprecise probability}, journal = {International Journal of Approximate Reasoning}, year = 2000, volume = 24, pages = {125--148} }

Set of desirable gambles as a belief model Gambles: A gamble f : X R is an uncertain reward whose value is f (X). (f (H),f (T)) T 1 1 1 H 1

Set of desirable gambles as a belief model Gambles: A gamble f : X R is an uncertain reward whose value is f (X). (f (H),f (T)) T 1 1 1 H 1 Set of desirable gambles: D L (X) is a set of gambles that a subject strictly prefers to zero.

Coherence for a set of desirable gambles A set of desirable gambles D is called coherent if: D1. if f 0 then f D [not desiring non-positivity] D2. if f > 0 then f D [desiring partial gains] D3. if f,g D then f + g D [addition] D4. if f D then λf D for all real λ > 0 [scaling] T 1 1 1 H 1

Coherence for a set of desirable gambles A set of desirable gambles D is called coherent if: D1. if f 0 then f D [not desiring non-positivity] D2. if f > 0 then f D [desiring partial gains] D3. if f,g D then f + g D [addition] D4. if f D then λf D for all real λ > 0 [scaling] T (f (H),f (T)) 1 1 1 H 1

Coherence for a set of desirable gambles A set of desirable gambles D is called coherent if: D1. if f 0 then f D [not desiring non-positivity] D2. if f > 0 then f D [desiring partial gains] D3. if f,g D then f + g D [addition] D4. if f D then λf D for all real λ > 0 [scaling] T (f (H),f (T)) 1 1 1 H 1

Coherence for a set of desirable gambles A set of desirable gambles D is called coherent if: D1. if f 0 then f D [not desiring non-positivity] D2. if f > 0 then f D [desiring partial gains] D3. if f,g D then f + g D [addition] D4. if f D then λf D for all real λ > 0 [scaling] Precise models correspond to the special case that the convex cones D are actually halfspaces! T (f (H),f (T)) 1 1 1 H 1

Coherence for a set of desirable gambles A set of desirable gambles D is called coherent if: D1. if f 0 then f D [not desiring non-positivity] D2. if f > 0 then f D [desiring partial gains] D3. if f,g D then f + g D [addition] D4. if f D then λf D for all real λ > 0 [scaling] Precise models correspond to the special case that the convex cones D are actually halfspaces! T (f (H),f (T)) 1 1 1 H 1

Connection with lower previsions P(f ) = sup{α R: f α D} supremum buying price for f T 1 P(f ) f 1 f P(f ) 1 H 1

Connection with lower previsions P(f ) = sup{α R: f α D} supremum buying price for f T 1 P(f ) f 1 f P(f ) 1 H 1

Connection with lower previsions P(f ) = sup{α R: f α D} supremum buying price for f T 1 P(f ) f 1 f P(f ) 1 H 1

INFERENCE

Inference: natural extension T A 1 1 1 H 1

Inference: natural extension T A L (X) >0 1 1 1 H 1

Inference: natural extension T E A := posi(a L (X) >0 ) 1 1 1 H 1 { n } posi(k ) := λ k f k : f k K,λ k > 0,n > 0 k=1

Inference: marginalisation and conditioning Let D N be a coherent set of desirable gambles on X N. For any subset I N, we have the X I -marginals: D I = marg I (D N ) := D N L (X I ), so f (X I ) D I f (X I ) D N. How to condition a coherent set D N on the observation that X I = x I? The updated set of desirable gambles D N x I L (X N\I ) on X N\I is: g D N x I I {xi }g D N. Works for all conditioning events: no problem with conditioning on sets of probability zero!

Conditional lower previsions Just like in the unconditional case, we can use a coherent set of desirable gambles D N to derive conditional lower previsions. Consider disjoint subsets I and O of N: P O (g x I ) := sup { } µ R: I {xi }[g µ] D N = sup{µ R: g µ D N x I } for all g L (X O ) is the lower prevision of g, conditional on X I = x I. P O (g X I ) is the gamble on X I that assumes the value P O (g x I ) in x I X I.

Coherent conditional lower previsions Consider m couples of disjoint subsets I s and O s of N, and corresponding conditional lower previsions P Os ( X Is ) for s = 1,...,m. Theorem (Williams, 1977) These conditional lower previsions are (jointly) coherent if and only if there is some coherent set of desirable gambles D N that produces them, in the sense that for all s = 1,...,m: P Os (g x Is ) := sup { µ R: I {xis }[g µ] D N } for all g L (X Os ) and all x Is X Is.

A few papers that avoid the complications @ARTICLE{miranda2007, author = {De Cooman, Gert and Hermans, Filip}, title = {Imprecise probability trees: Bridging two theories of imprecise probability}, journal = {Artificial Intelligence}, year = 2008, volume = 172, pages = {1400--1427} }

A few papers that avoid the complications @ARTICLE{miranda2007, author = {De Cooman, Gert and Miranda, Enrique}, title = {Irrelevant and independent natural extension for sets of desirable gambles}, journal = {Journal of Artificial Intelligence Research}, year = 2012, volume = 45, pages = {601--640} }

PERMUTATION SYMMETRY

Symmetry group Consider a variable X assuming values in a finite set X and a finite group P of permutations π of X Modelling that there is a symmetry P behind X: if you believe that inferences about X will be invariant under any permutation π P. Consider any permutation π P and any gamble f on X : π t f := f π, meaning that (π t f )(x) = f (π(x)) for all x X. π t is a linear transformation of the vector space L (X). P t := { π t : π P } is a finite group of linear transformations of the vector space L (X).

How to model permutation symmetry? Permutation symmetry: You are indifferent between any gamble f on X and its permutation π t f. f π t f f π t f 0. This leads to a linear space I P of indifferent gambles: I P := span( { f π t f : f L (X) and π P } ).

How to model permutation symmetry?

How to model permutation symmetry? Permutation symmetry: You are indifferent between any gamble f on X and its permutation π t f. f π t f f π t f 0. This leads to a linear space I P of indifferent gambles: I P := span( { f π t f : f L (X) and π P } ). A set of desirable gambles D is strongly P-invariant if D + I P D or actually: D + I P = D.

Let s see what this means: Consider the linear subspace of all permutation invariant gambles: L P (X) := { f L (X): ( π P)f = π t f } Because P t is a finite group, this linear space can be seen as the range of the following special linear transformation: proj P : L (X) L (X) with proj P (f ) := 1 P π t f. π P Theorem 1. π t proj P = proj P = proj P π t 2. proj P proj P = proj P 3. rng(proj P ) = L P (X) and 4. ker(proj P ) = I P.

Symmetry representation theorem Consider any gamble f on X : f = f proj P (f ) + proj }{{} P (f ) }{{} I P L P (X) Theorem (Symmetry Representation Theorem) Let D be strongly P-invariant, then: f D proj P (f ) D. So D has a lower-dimensional representation proj P (D) L P (X).

Permutation invariant gambles Consider the permutation invariant atoms: [x] := {π(x): π P} which constitute a partition of X : A P := {[x]: x X }. Then the permutation invariant gambles f L P (X) are constant on these invariant atoms [x]; and can therefore be seen as gambles on these atoms. f L (A P ).

FINITE EXCHANGEABILITY

Exchangeability for sets of desirable gambles @ARTICLE{cooman2012, author = {De Cooman, Gert and Quaeghebeur, Erik}, title = {Exchangeability and sets of desirable gambles}, journal = {International Journal of Approximate Reasoning}, year = 2012, vol = 53, pages = {363--395} }

Permutations and count vectors Consider any permutation π of the set of indices {1,2,...,n}. For any x = (x 1,x 2,...,x n ) in X n, we let πx := (x π(1),x π(2),...,x π(n) ). For any x X n, consider the corresponding count vector T(x), where for all z X : T z (x) := {k {1,...,n}: x k = z}. Example For X = {a,b} and x = (a,a,b,b,a,b,b,a,a,a,b,b,b), we have T a (x) = 6 and T b (x) = 7.

Let m = T(x) and consider the permutation invariant atom [m] := {y X n : T(y) = m}. This atom has how many elements? ( ) n n! = m x X m x! Let HypGeo n ( m) be the expectation operator associated with the uniform distribution on [m]: Interestingly: HypGeo n (f m) := ( ) n 1 m f (x) for all f : X n R x [m] proj P (f )(x) = HypGeo n (f m) for all x [m] can be seen as a gamble on the composition m of an urn with n balls.

COUNTABLE EXCHANGEABILITY

Bruno de Finetti s exchangeability result Infinite Representation Theorem: The sequence X 1,..., X n,... of random variables in the finite set X is exchangeable iff there is a (unique) coherent prevision H on the linear space V (Σ X ) of all polynomials on Σ X such that for all n N and all f : X n R: Observe that ( E(f ) = H m N n HypGeo n (f m)b m ). B m (θ) = MultiNom n ([m] θ) = m N n HypGeo n (f m)b m (θ) = MultiNom n (f θ) ( ) n m θ m x x x X

Representation theorem for sets of desirable gambles Representation Theorem A sequence D 1,..., D n,... of coherent sets of desirable gambles is exchangeable iff there is some (unique) Bernstein coherent H V (Σ X ) such that: f D n MultiNom n (f ) H for all n N and f L (X n ). A set H of polynomials on Σ X is Bernstein coherent if: B1. if p has some non-positive Bernstein expansion then p H B2. if p has some positive Bernstein expansion then p H B3. if p 1 H and p 2 H then p 1 + p 2 H B4. if p H then λp H for all positive real numbers λ.

Suppose we observe the first n variables, with count vector m = T(x): (X 1,...,X n ) = (x 1,...,x n ) = x. Then the remaining variables X n+1,...,x n+k,... are still exchangeable, with representation H x = H m given by: p H m B m p H.

Suppose we observe the first n variables, with count vector m = T(x): (X 1,...,X n ) = (x 1,...,x n ) = x. Then the remaining variables X n+1,...,x n+k,... are still exchangeable, with representation H x = H m given by: p H m B m p H. Conclusion: A Bernstein coherent set of polynomials H completely characterises all predictive inferences about an exchangeable sequence.

PREDICTIVE INFERENCE SYSTEMS

Predictive inference systems Formal definition: A predictive inference system is a map Ψ that associates with every finite set of categories X a Bernstein coherent set of polynomials on Σ X : Ψ(X ) = H X. Basic idea: Once the set of possible observations X is determined, then all predictive inferences about successive observations X 1,..., X n,... in X are completely fixed by Ψ(X ) = H X.

Inference principles Even if (when) you don t like this idea, you might want to concede the following: Using inference principles to constrain Ψ: We can use general inference principles to impose conditions on Ψ, or in other words to constrain: the values H X can assume for different X the relation between H X and H Y for different X and Y

Inference principles Even if (when) you don t like this idea, you might want to concede the following: Using inference principles to constrain Ψ: We can use general inference principles to impose conditions on Ψ, or in other words to constrain: the values H X can assume for different X the relation between H X and H Y for different X and Y Taken to (what might be called) extremes (Carnap, Walley,... ): Impose so many constraints (principles) that you end up with a single Ψ, or a parametrised family of them, e.g.: the λ system, the IDM family

A few examples Renaming Invariance Inferences should not be influenced by what names we give to the categories. Pooling Invariance For gambles that do not differentiate between pooled categories, it should not matter whether we consider predictive inferences for the set of original categories X, or for the set of pooled categories Y. Specificity If, after making a number of observations in X, we decide that in the future we are only going to consider outcomes in a subset Y of X, we can discard from the past observations those outcomes not in Y.

Look how nice! Renaming Invariance For any onto and one-to-one π : X Y B Cπ (m) p H Y B m (p C π ) H X for all p V (Σ Y ) and m N X Pooling Invariance For any onto ρ : X Y B Cρ (m) p H Y B m (p C ρ ) H X for all p V (Σ Y ) and m N X Specificity For any Y X B m Y p H Y B m (p Y ) H X for all p V (Σ Y ) and m N X

EXAMPLES

Imprecise Dirichlet Multinomial Model Let, with s > 0: and } s X {α := R X >0 : α x < s x X HX s := {p V (Σ X ): ( α s X )Diri(p α) > 0} This inference system is pooling and renaming invariant, and specific. For any observed count vector m N n : P s ({x} m) = m x n + s and P s({x} m) = m x + s n + s

Haldane system Let: HX H := s>0{p V (Σ X ): ( α s X )Diri(p α) > 0} = s>0h s X This inference system is pooling and renaming invariant, and specific. For any observed count vector m N n : P s ({x} m) = P s ({x} m) = m x n