Probabilistic Reasoning

Size: px

Start display at page:

Download "Probabilistic Reasoning"

Clement Spencer
5 years ago
Views:

1 Probablstc Reasonng (Probablstsch Redeneren) authors: Lnda van der Gaag Slja Renooj Fall 2013

3 Preface In artfcal-ntellgence research, the probablstc-network, or (Bayesan) belef-network framework for automated reasonng wth uncertanty s rapdly ganng n popularty. The framework provdes a powerful formalsm for representng a jont probablty dstrbuton on a set of statstcal varables. In addton, t offers algorthms for effcent probablstc nference. At present, more and more knowledge-based systems employng the framework are beng developed for varous domans of applcaton, rangng from probablstc nformaton retreval to medcal dagnoss. Ths syllabus provdes a tutoral ntroducton to the probablstc-network framework and hghlghts some ssues of ongong research n applyng the framework for problem solvng n real-lfe domans. Each chapter ncludes a number of exercses, and answers or hnts to some of them (ndcated by a *) are provded at the end. Ths syllabus was frst wrtten n the late 1990s by L.C. van der Gaag and has been contnuously under development eversnce. Snce 2001, adaptons and extensons have been made mostly by S. Renooj. The syllabus s by no means devod from mperfectons and any useful comments on ts contents are greatly apprecated by the authors. For the 2006 edton, several references to relevant recent research have been added to Chapters 4, 5, and 6. In addton, materal on the subject of senstvty analyss has been extended. For the 2009 edton, some small errors were corrected and relevant references were added. Smlar updates were done for the 2013 edton, for whch also an example n Chapter 6 was extended. Lnda van der Gaag Slja Renooj Utrecht Unversty July 2013 c All rghts reserved. No part of ths work may be reproduced wthout permsson of the authors. 1

4 2

5 Contents 1 Introducton 5 2 Prelmnares Graph Theory Probablty Theory Independences and Graphcal Representatons The Concept of Independence Revsted Pearl s Axomatc System for Independence Propertes of Independence Relatons Graphcal Representatons of Independence Undrected Graphs Drected Graphs Choosng a Graphcal Representaton The Probablstc Network Framework The Probablstc Network Formalsm Probablstc Inference Drected Trees Sngly Connected Dgraphs Multply Connected Dgraphs Other Algorthms for Probablstc Inference Buldng a Probablstc Network Identfyng Varables and Values Constructng the Dgraph Constructng the Dgraph by Hand Learnng the Dgraph from Data Assessng Probabltes Sources of Probablstc Informaton Smplfyng Probablty Assessment Elctng Probabltes from Experts A Procedure for Probablty Refnement Brngng Probablstc Networks nto Practce Senstvty Analyss What to Analyse? One-way Senstvty Analyss

6 Two-way Senstvty Analyss Evaluatng Probablstc Networks The Percentage Correct and ts Shortcomngs The evaluaton score A Problem-Solvng Archtecture Threshold Decson Makng Selectve Evdence Gatherng Conclusons Solutons, Answers and Hnts 141

7 Chapter 1 Introducton Ths chapter gves some hstorcal background about the use of probablty theory and other uncertanty formalsms for reasons of decson support. It brefly motvates the emergence of Probablstc networks (or: (Bayesan) belef networks) and the reason why probablstc network applcatons hstorcally have often concerned the medcal doman. Over the past few decades nterest n the results of artfcal-ntellgence research has been growng to an ncreasng extent. Especally the area of knowledge-based systems has attracted much attenton. The phrase knowledge-based system, or expert system, s generally employed to denote computer systems n whch some symbolc representaton of human knowledge s ncorporated and appled [Lucas & Van der Gaag, 1991, Jackson, 1990]. Knowledge-based systems are typcally desgned to deal wth real-lfe problems that requre consderable human knowledge and expertse for ther soluton; examples range from medcal dagnoss and techncal trouble shootng to fnancal advce and product desgn. It s ther ablty to capture and reason wth (specalsed) human knowledge that allows knowledge-based systems to arrve at a performance comparable to that of human experts. Knowledge-based systems by now have found ther way from academc laboratores to the ndustral world and are beng ntegrated nto conventonal software envronments. As more and more knowledge-based systems are beng developed for a large varety of problems, t becomes apparent that the knowledge requred to solve these problems often s not precsely defned but nstead s of an mprecse nature. In fact, many reallfe problem domans are fraught wth uncertanty. Human experts n these domans typcally are able to form judgements and take decsons based on uncertan, ncomplete, and sometmes even contradctory nformaton. To be of practcal use, a knowledgebased system has to deal wth such nformaton at least equally well. For ths purpose, a knowledge-based system employs a formalsm for representng uncertanty and an assocated algorthm for manpulatng uncertan nformaton. The major research topc n artfcal ntellgence of reasonng wth uncertanty, or plausble reasonng, addresses the desgn of such formalsms and algorthms [Shafer & Pearl, 1990]. As probablty theory s a mathematcally well-founded theory about uncertanty, havng a long and outstandng tradton of research and experence, t s not surprsng that ths theory takes a promnent place n research on reasonng wth uncertanty n knowledge-based systems. Unfortunately, applyng probablty theory n a knowledgebased context s not as easy as t may seem at frst sght. Straghtforward applcaton 5

8 6 1. Introducton of the basc concepts from probablty theory leads to nsuperable problems of computatonal complexty: explct representaton of a jont probablty dstrbuton requres exponental space (exponental n the number of varables dscerned), and even f the dstrbuton could be represented more economcally, computng probabltes of nterest by the basc rules of margnalsaton and condtonng would have an exponental tme complexty. The rch hstory of applyng probablty theory for reasonng wth uncertanty n knowledge-based systems shows varous attempts to settle these problems. In ths chapter, we sketch the hstorcal background of applyng probablty theory n a knowledge-based system. We would lke to note that our ntenton s not to be complete, but merely to gve an mpresson of the problems encountered by researchers poneerng n automated probablstc nference. In our sketch, we focus on the task of (medcal) dagnoss. For a gven problem doman, we dscern a set of possble hypotheses H = {h 1,...,h n }, n 1, and a set of peces of evdence E = {e 1,...,e m }, m 1, that may be observed n relaton wth these hypotheses. For ease of exposton, we assume that each of the hypotheses s ether true or false; equally, we assume that each of the peces of evdence s ether true or false. A dagnostc problem n ths doman now s a set of peces of evdence e E that s actually observed and needs to be explaned n terms of the hypotheses dscerned. A dagnoss for a problem e under consderaton s a set of hypotheses h H that best explans e. As early as n the 1960s several research efforts on automated reasonng wth uncertanty for dagnostc applcatons were undertaken [Warner et al., 1961, Gorry & Barnett, 1968, De Dombal et al., 1972]. The systems constructed n ths perod were based to a large extent on applcaton of Bayes Theorem; n the sequel, we wll refer to the approach taken n these early systems as the nave-bayesan approach. In ths approach, the basc dea of computng a dagnoss for a set of actually observed peces of evdence e E s to compute for all sets of hypotheses h H the condtonal probablty Pr(h e) from the dstrbuton Pr on the doman at hand, and then select a set h H wth hghest probablty. Snce for real-lfe applcatons the condtonal probabltes Pr(e h) often are easer to come by than the condtonal probabltes Pr(h e), generally Bayes Theorem s used for computng the requred probabltes: Pr(h e) = Pr(e h) Pr(h) Pr(e) It wll be evdent that ths approach s qute expensve from a computatonal pont of vew: because a dagnoss may be composed of several dfferent hypotheses, the number of probabltes to be computed equals 2 n 1. To overcome ths problem of tme complexty, a smplfyng assumpton s made: t s assumed that all hypotheses are mutually exclusve and collectvely exhaustve. Wth ths assumpton only the n sngleton hypothess sets {h } have to be consdered as possble dagnoses. So, only the probabltes Pr(h e) (wrtng h nstead of {h }) for all h H have to be computed. To ths end, once more Bayes Theorem s used: Pr(h e) = Pr(e h ) Pr(h ) Pr(e) = Pr(e h ) Pr(h ) n k=1 Pr(e h k) Pr(h k ) For automated applcaton of Bayes Theorem n ths form, several probabltes are requred from the jont probablty dstrbuton Pr on the doman at hand. In fact, condtonal probabltes Pr(e h k ), k = 1,...,n, for every combnaton of peces of

9 1. Introducton 7 evdence e E, have to be avalable. Apart from the fact that t s hardly lkely that these probabltes wll be readly avalable n a real-lfe problem doman, ths means storng exponentally many probabltes. To overcome ths problem of space complexty, a second smplfyng assumpton s made: t s assumed that all peces of evdence are condtonally ndependent gven any of the hypotheses dscerned. The two smplfyng assumptons taken together allow for computng the probabltes Pr(h e) for all h H gven observed evdence e = {e j1,...,e jp }, 1 p m, from Pr(h e j1 e jp ) = Pr(e j1 e jp h ) Pr(h ) n k=1 Pr(e j 1 e jp h k ) Pr(h k ) = = Pr(e j1 h ) Pr(e jp h ) Pr(h ) n k=1 Pr(e j 1 h k ) Pr(e jp h k ) Pr(h k ) It wll be evdent that for any dagnostc problem e now only n 1 probabltes have to be computed, and that for ths purpose only m n condtonal probabltes and n 1 pror ones have to be stored. The systems for automated reasonng wth uncertanty constructed n the 1960s were rather small-scaled: they were devsed for clear-cut problem domans wth only a small number of hypotheses and restrcted evdence. For these small systems, all probabltes necessary for applyng Bayes Theorem could be acqured from statstcal analyss of emprcal data. Despte the underlyng (over-)smplfyng assumptons, these systems performed consderably well [De Dombal et al., 1974]. Nevertheless, nterest n ths nave Bayesan approach to reasonng wth uncertanty faded n the late 1960s and early 1970s. One of the reasons for ths declne n nterest s that the approach was feasble only for hghly restrcted problem domans. For larger or more complex domans, the above-mentoned smplfyng assumptons often were serously volated, causng degeneraton of system behavour. In addton, for larger domans the approach nevtably became demandng, ether computatonally or from an assessment pont of vew. At ths stage, the frst dagnostc knowledge-based systems began to emerge from artfcal-ntellgence research. These systems mostly use producton rules for representng human (experental) knowledge n a modular form closely resemblng logcal mplcatons producton rules are expressons of the form f condton then concluson. These so-called rule-based expert systems exhbt ntellgent reasonng behavour by employng a heurstc reasonng algorthm that use the producton rules for selectve gatherng of evdence and for prunng the search space of possble dagnoses. It s ths prunng behavour that renders the rule-based expert systems capable of dealng wth larger and complexer problem domans than the early nave-bayesan systems are. The best-known rule-based expert system developed n the 1970s s the MYCIN system for assstng physcans n the dagnoss and treatment of bacteral nfectons [Buchanan & Shortlffe, 1984]. In the context of rule-based expert systems, the nave Bayesan approach to reasonng wth uncertanty s no longer feasble due to the large number of probabltes to be computed. Snce n a rule-based system durng problem solvng the search space of possble dagnoses s pruned by heurstc as well as probablstc crtera, t s necessary to compute probabltes for all ntermedate results derved by the producton rules n

10 8 1. Introducton addton to the probabltes of the separate hypotheses. To allow for effcent computaton of all these probabltes, a set of computaton rules has been desgned. These computaton rules provde for computng the probablty of an (ntermedate) result from probabltes assocated wth the producton rules that are used n ts dervaton; to ths end, each producton rule s assgned the condtonal probablty of ts concluson gven ts condton. Unfortunately, these computaton rules do not always accord wth the axoms of probablty theory and can not even be consdered approxmaton rules for computng probabltes. In the sequel, we wll use the phrase quas-probablstc to refer to ths approach. The most well-known llustraton of the quas-probablstc approach s the certanty-factor model, desgned orgnally for dealng wth uncertanty n the MYCIN system [Shortlffe & Buchanan, 1984]. The certanty-factor model enjoys wdespread use n rule-based expert systems bult after MYCIN, even though by now t s wdely known that the model s mathematcally flawed. The relatve success of the model can however be accounted for by ts satsfactory behavour n most applcatons and by ts conceptual and computatonal smplcty [Van der Gaag, 1994]. Although the quas-probablstc approach to reasonng wth uncertanty n knowledge-based systems on the one hand met wth consderable success n the artfcalntellgence communty, t was crtcsed severely on the other hand because of ts adhoc character. The ncorrectness of the approach from a mathematcal pont of vew even led to a world-wde debate concernng the approprateness of probablty theory for handlng uncertanty n a knowledge-based context. The adversares of probablty theory argue that the theory s not expressve enough to cope wth the dfferent knds of uncertanty that are encountered n real-lfe problem domans and therefore have to be dealt wth n knowledge-based systems. As a consequence several other (more or less) mathematcal models have been proposed for reasonng wth uncertanty. A major trend n plausble reasonng has arsen from the clam that probablty theory s not able to capture mprecson or vagueness, notons of uncertanty whch are salent n natural language representatons. The name of L.A. Zadeh s nseparable from ths trend: he was the frst to propose fuzzy set theory as the pont of departure for the development of methods that are able to cope wth vague nformaton. Dempster-Shafer theory les at the bass of another major trend n plausble reasonng. The theory was developed by G. Shafer, buldng on earler work by A.P. Dempster [Shafer, 1976]. It was motvated by the observaton that probablty theory s not able to dscern between uncertanty and gnorance due to ncompleteness of nformaton. The advocates of probablty theory, on the other hand, clam that t s provable that probablty theory s the only correct way of dealng wth uncertanty and that anythng that can be done wth non-probablstc methods, can be done equally well usng a probablty-based method. For ths clam often an argument by R.T. Cox s cted [Cox, 1979]: Cox states a smple set of ntutve propertes a measure of uncertanty has to satsfy and subsequently shows that the basc axoms of probablty theory follow. Here, we wll not enter nto the debate concernng the approprateness of probablty theory for reasonng wth uncertanty n knowledge-based systems; for a wde range of dvergng opnons, the reader s referred to [Cheeseman, 1988] wth ts ensung dscussons. Although the above-mentoned debate was not n the least subdued, n the md- 1980s the probablstc network framework was ntroduced as a novel approach to applyng probablty theory for reasonng wth uncertanty n knowledge-based systems

11 1. Introducton 9 [Pearl, 1988]. The probablstc network framework s charactersed by a powerful formalsm for representng doman knowledge and the uncertantes that go wth t more n specfc, the formalsm provdes for a concse representaton of a jont probablty dstrbuton on a set of statstcal varables. Assocated wth ths formalsm are algorthms for effcently computng probabltes of nterest and for processng evdence; these algorthms consttute the basc buldng blocks for reasonng wth knowledge represented n the formalsm. When compared to the nave-bayesan approach on the one hand and the quas-probablstc approach on the other hand, the probablstc network approach offers advantages over both. In contrast wth the quas-probablstc approach, the probablstc network approach has a frm mathematcal foundaton n probablty theory. Contrastng the nave-bayesan approach, the probablstc network approach crcumvents the need for smplfyng assumptons by capturng and reasonng about actual ndependences among varables. Snce ts ntroducton, the probablstc network framework has rapdly ganed n popularty and by now s begnnng to llustrate ts worth n complex problem domans: practcal applcatons have been and are beng developed for example for medcal dagnoss and prognoss [Andreassen et al., 1987, Heckerman et al., 1992, Blanco et al., 2005], for probablstc nformaton retreval [Bruza & Van der Gaag, 1994], n computer vson [Jensen et al., 1990], n forensc scence [Taron et al., 2006] and varous other domans (see [Pourret, Nam & Marcot, 2008]). Whereas earler applcatons of probablstc networks were mostly handcrafted wth the help of doman experts, the ncreasng avalablty of large data sets has made t much easer to construct applcatons drectly from data [Neapoltan, 2003]. Even large data sets, however, usually do not contan suffcent relable nformaton to construct relable networks of general topology; for ths reason, network engneers ether resort to usng varous types of classfer [Fredman et al., 1997], or agan have to rely on doman expertse to complete the network [Druzdzel & Van der Gaag, 2000]. Ths syllabus provde a tutoral ntroducton to the probablstc network framework and hghlght some ssues of ongong research n applyng the framework for real-lfe problem solvng. It s organsed as follows. Chapter 2 provdes some prelmnares from graph theory and from probablty theory. In Chapter 3, we dscuss the representaton of probablstc ndependence n graphcal models. Chapter 4 ntroduces the probablstc network framework: t detals the probablstc network formalsm and outlnes ts assocated algorthms. In Chapter 5 we address buldng probablstc networks for real-lfe problem domans. Analyss of and problem solvng wth probablstc networks s the topc of Chapter 6. The syllabus s rounded off wth some conclusve dscusson n Chapter 7.

12 10 1. Introducton

13 Chapter 2 Prelmnares Ths chapter refreshes the necessary concepts from graph- and probablty theory that play a central role n the probablstc network framework. These concepts can be found n any textbook on graph theory and probablty theory. The chapter also ntroduces the notaton that s used throughout the syllabus. 2.1 Graph Theory In ths secton, some concepts from graph theory are revewed. Our revew s talored to the probablstc network framework and s not meant to be exhaustve; for further nformaton, any ntroductory textbook on graph theory wll suffce. Generally two types of graph are dscerned: undrected and drected ones. Defnton An undrected graph G s a par G = (V (G), E(G)) where V (G) s a fnte set of vertces (also called nodes) and E(G) s a set of unordered pars (V,V j ), V,V j V (G), called edges. A drected graph, or dgraph for short, s a par G = (V (G),A(G)) where V (G) s a fnte set of vertces and A(G) s a set of ordered pars (V,V j ), V,V j V (G), called arcs. An arc (V,V j ) s often wrtten as V V j or V j V. For a vertex n a graph, dfferent sets of related vertces can be dentfed. Defnton In a dgraph G, vertex V j s called a predecessor (or parent) of vertex V f (V j,v ) A(G); the set of all predecessors of vertex V n G s denoted as ρ G (V ). Lkewse, vertex V j s called a successor (or chld) of vertex V f (V,V j ) A(G); the set of all successors of vertex V n G s denoted as σ G (V ). The reflexve transtve closure 1 of V under the predecessor relaton s denoted as ρ G (V ); an element from ρ G (V ) s called an ancestor of V. Smlarly, σ G (V ) denotes the descendants of V. The set of neghbours of vertex V s defned as { σg (V ν G (V ) = ) ρ G (V ) f G s drected; {V j (V,V j ) E(G)} f G s undrected 1 The reflexve closure of set A under r s r 0 (A) = A, the transtve closure s r + (A) = r(a) r + (r(a)), and both combned gves r (A) = r 0 (A) r + (A). 11

14 12 2. Prelmnares The sze of the neghbour-set of a vertex s called ts degree. In case of a vertex n a dgraph, we n addton defne the n-degree to be ts number of predecessors and the out-degree to be the number of ts successors; the ncomng and outgong arcs together are called ts ncdent arcs. In the sequel, we wll often drop the subscrpt G from ρ G etc. as long as ambguty cannot occur. The followng defnton ntroduces several types of vertex sequence for undrected graphs. Defnton Let G = (V (G),E(G)) be an undrected graph. A path from vertex V 0 to vertex V k n G s a sequence of vertces V 0,...,V k, k 0, wth dstnct edges (V 1,V ) E(G), = 1,...,k, between them; k s called the length of the path. A path s called smple f all vertces are dstnct. A cycle s a path V 0,...,V k,v 0 from V 0 to V 0 of non-zero length. The graph G s called cyclc f t contans at least one cycle; otherwse, t s called acyclc. In undrected graphs, self-loops (an edge (V 0,V 0 )) are generally not allowed. The concepts of path and cycle ntroduced for undrected graphs drectly apply to drected graphs by consderng arcs rather than edges. Unless stated otherwse, we typcally assume paths to be smple. We now ntroduce the concept of underlyng graph. Ths concept assocates an undrected graph wth a drected one. We thereby assume that drected graphs do not contan self-loops ether, although ths s not a conventon. Defnton Let G = (V (G), A(G)) be a dgraph. The underlyng graph H of G s the undrected graph H = (V (H),E(H)) where V (H) = V (G) and E(H) s obtaned from A(G) by replacng each arc (V,V j ) A(G) by an edge (V,V j ). Related to a dgraph s underlyng graph, we ntroduce two addtonal types of vertex sequence for dgraphs. Defnton Let G be a dgraph and let H be ts underlyng graph. A chan from vertex V 0 to vertex V k n G s a sequence of vertces V 0,...,V k, k 0, that s a path n the underlyng graph H of G; k s called the length of the chan. A loop n G s a sequence of vertces that s a cycle n the underlyng graph H of G. Note that, n a dgraph, the concept of path takes the drectons of the arcs nto account, whle the concept of chan does not. A dgraph s therefore acyclc f t contans no drected cycles; an acyclc dgraph (or DAG) can contan loops, however. In a drected graph, two vertces may be connected by a chan. If ths property holds for any two vertces n a dgraph, we say that the graph s connected. Defnton A dgraph G s connected f there exsts at least one chan between any two vertces n G; otherwse, t s called unconnected. We have ntroduced the concept of connectedness to apply to drected graphs; the concept, however, s easly extended to apply to undrected graphs. We now dstngush between several types of dgraph.

15 2.2 Probablty Theory 13 Defnton A dgraph G s called sngly connected f t does not contan any loops; otherwse, t s called multply connected. A sngly connected dgraph G s called a drected tree f each vertex n G has at most one predecessor. Note that n a sngly connected dgraph, there s at most one chan between any two vertces; ths property does not hold for multply connected dgraphs. To conclude, we ntroduce the concept of a subgraph. The concept s ntroduced for undrected graphs, but s extended straghtforwardly to apply to dgraphs. Defnton Let G = (V (G), E(G)) be an undrected graph. The subgraph H nduced by V V (G) s the undrected graph H = (V,(V V ) E(G)). Note that a subgraph nduced by a set of varables V takes from the orgnal graph all edges exstent among the vertces from V. 2.2 Probablty Theory In ths secton, we provde a bref revew of some basc concepts from probablty theory. Once more, our revew s talored to the probablstc network framework and s not meant to be an exhaustve tutoral; for further nformaton, any ntroductory textbook on probablty theory wll suffce. Probablty theory s often approached from a set-theoretc pont of vew. Probablty dstrbutons are then defned on sets of elements that represent events. All possble outcomes of an experment (for example, the possble outcomes of rollng a de) are gven by the sample space Ω and each event A s a subset of Ω. A probablty measure/functon/dstrbuton then s a functon from events to the [0, 1] nterval. As events are sets, combnatons of events can be expressed usng operatons on sets such as unon ( ) and ntersecton ( ). Outcomes of an experment are often coded by usng a random varable (also: statstcal/ stochastc varable) whch s a functon from the sample space to another space (such as reals). By wrtng probablty dstrbutons on statstcal varables, the notaton suppresses references to the actual sample space. However, as a statement about a statstcal varable defnes an event, there s no actual dfference. In the Probablstc Network communty statstcal varables are taken to be functons from Ω to Ω. For a varable V defned on outcomes true and false, for example, we therefore have V (true) = true and V (false) = false. We now smply say that V can have or take on one of the values true and false, n whch case we wrte V = true or V = false as possble value-assgnments. The Probablstc Network communty approaches probablty theory from an algebrac pont of vew by assocatng probabltes wth logcal propostons nstead of sets. In ths syllabus, we consder a set of statstcal varables V = {V 1,...,V n }, n 1. For ease of exposton, we wll often restrct the dscusson to bnary varables takng one of the truth values true and false; the generalsaton to varables wth more than two dscrete values, however, s rather straghtforward. For abbrevaton, we wll use v to denote the proposton that the varable V takes the value true; V = false wll be denoted as v. The set of varables V may be looked upon as spannng a Boolean algebra of propostons V. Informally speakng, ths algebra comprses all logcal propostons that are bult from value assgnments to the varables dscerned. More formally,

16 14 2. Prelmnares the Boolean algebra of propostons V spanned by V s the set of logcal propostons consstng of the constant propostons True and False 2, the atomc proposton v for all V V, and all compound propostons that are constructed from these by applyng the bnary operators (conjuncton), (dsjuncton), and the unary operator (negaton); the elements of the algebra V adhere to the usual axoms of propostonal logc. We now defne a jont probablty dstrbuton as a functon on a Boolean algebra of propostons that s spanned by a set of statstcal varables. Defnton Let V be a set of statstcal varables and let V be the Boolean algebra of propostons spanned by V. Let Pr : V [0,1] be a functon such that Pr(x) 0, for all x V, and Pr(False) = 0, more n specfc; Pr(True) = 1; Pr(x y) = Pr(x) + Pr(y), for all x,y V such that x y False. Then, Pr s called a jont probablty dstrbuton on V. For each x V, the functon value Pr(x) s termed the probablty of x. A probablty Pr(x) for a logcal proposton x expresses the amount of certanty concernng the truth of x. Note that n the prevous defnton we have assocated probabltes wth logcal propostons nstead of wth sets, whch s the more common vew taken n (ntroductory) lterature on probablty theory. It can easly be shown, however, that the probablty of an event (a set of outcomes) s equvalent to the probablty of the truth of the proposton assertng the occurrence of the event [Fnett, 1970]. Example Suppose X and Y are statstcal varables representng a con toss. Let A be the event that X = heads and Y = tals then the probablty of ths event s from a set-theoretc pont of vew: the probablty of event A, wrtten Pr(A); from an algebrac pont of vew: the probablty that (X = heads and Y = tals) True, wrtten Pr(X = heads Y = tals) If X = heads and Y = tals were consdered two separate events A and B, then ths would make no dfference from the algebrac pont of vew, but n the set-theoretc approach we should now wrte Pr(A B). In the sequel, we wll want to sngle out strctly postve jont probablty dstrbutons as these have some nterestng propertes. Strctly postve dstrbutons for example are well-known for ther not embeddng any functonal or logcal relatonshps among ther varables. Defnton Let V be a set of statstcal varables and let V be the Boolean algebra of propostons spanned by V. Let Pr be a jont probablty dstrbuton on V. Pr s strctly postve f Pr(x) = 0 mples x False. We now ntroduce the concept of condtonal probablty. 2 Note the dfference between these propostons and the afore mentoned outcomes/values!

17 2.2 Probablty Theory 15 Defnton Let V be a set of statstcal varables and let V be the Boolean algebra of propostons spanned by V. Let Pr be a jont probablty dstrbuton on V. For each x,y V wth Pr(y) > 0, the condtonal probablty of x gven y, denoted as Pr(x y), s defned as Pr(x y) = Pr(x y) Pr(y) The condtonal probablty Pr(x y) expresses the amount of certanty concernng the truth of x gven that the nformaton y s known wth certanty. Note that a condtonal probablty Pr(x y) = p does not mean that whenever y s known to be true, the probablty of x equals p: t means that the probablty of x equals p f y s known to be true and nothng else s known that may affect the certanty concernng the truth of x. In the sequel, we wll assume that all condtonal probabltes specfed are properly defned, that s, for each condtonal probablty Pr(x y), we wll mplctly assume that Pr(y) > 0. We further state wthout proof that for a gven element y V, the condtonal probabltes Pr(x y) for all x V once more consttute a jont probablty dstrbuton on V ; ths probablty dstrbuton s called the condtonal probablty dstrbuton gven y and wll sometmes be denoted as Pr y. A condtonal probablty dstrbuton s sometmes referred to as a posteror probablty dstrbuton; the jont probablty dstrbuton t s obtaned from then n contrast s referred to as the pror dstrbuton. The followng defnton ntroduces the concept of ndependence of propostons. Defnton Let V be a set of statstcal varables and let V be the Boolean algebra of propostons spanned by V. Let Pr be a jont probablty dstrbuton on V. Then, two propostons x,y V are called (mutually) ndependent n Pr f Pr(x y) = Pr(x) Pr(y) otherwse, x and y are called dependent n Pr. Two propostons x,y V are called condtonally ndependent gven the proposton z V n Pr f Pr(x y z) = Pr(x z) Pr(y z) otherwse, x and y are called condtonally dependent gven z n Pr. In the sequel, we wll make extensve use of varous well-known propertes of jont probablty dstrbutons. Before statng these propertes, we provde some addtonal concepts and notatonal conventons. Recall that so far we have bult on the Boolean algebra of propostons V spanned by some set of statstcal varables V. In the sequel, we wll want to focus on the varables themselves and to refer to (arbtrary) conjunctons of value assgnments to all varables from the set V or from some subset of V. We wll use C W to denote the conjuncton C W = V W V of all varables from W V ; for W =, we take C W = True. The conjuncton of varables C W s called the confguraton template of W. A conjuncton c W of value assgnments to the varables from W s called a confguraton of W. A confguraton c W s nothng more than a short-hand notaton to ndcate that you are consderng a proposton that conssts of the conjuncton of atomc propostons representng some value assgnment to each varable n the set W. Wrtng c W, W = {W 1,...W m }, nstead of W 1 = some value W 2 = some value... W m = some value s very convenent, especally f you don t care about the actual

18 16 2. Prelmnares values and varables. Note that a confguraton template C W s a short-hand notaton for an even more general statement, namely about all possble value assgnments to the varables n W: any confguraton c W of W of nterest can be obtaned by fllng n approprate values for all varables nvolved. To avod abundance of braces, we wll often wrte C V and c V nstead of C {V } and c {V }, respectvely, for sngleton sets {V }. Please note that for a sngle vertex V we have that C {V } = V and that Pr(V ) has therefore an entrely dfferent meanng than t would from a set-theoretc pont of vew! We now state the varous propertes that we wll use n the sequel. We would lke to note that n the lterature on probablty theory these propertes are ntroduced n many dfferent appearances; we have chosen the form that suts our purposes best. The property stated n the followng proposton s known as the chan rule. Proposton Let V = {V 1,...,V n }, n 1, be a set of statstcal varables and let Pr be a jont probablty dstrbuton on V. Then, Pr(C V ) = Pr(V 1 V n ) = Pr(V n V 1 V n 1 )... Pr(V 2 V 1 ) Pr(V 1 ) In the expresson stated n the prevous proposton, each V s a varable that takes ether the value true or the value false, expressed as v and v, respectvely. The expresson therefore represents 2 n equaltes, one for each confguraton of the set of varables V. The property stated n the followng proposton s termed the margnalsaton property. Proposton Let V be a set of statstcal varables and let Pr be a jont probablty dstrbuton on V. Then, Pr(C X ) = c Y Pr(C X c Y ) for all sets of varables X,Y V. We state wthout proof that for any set of varables X V, the probabltes Pr(c X ) for all confguratons c X of X once more consttute a jont probablty dstrbuton; ths probablty dstrbuton s termed the margnal probablty dstrbuton on X. The condtonng property s stated n the followng proposton. Proposton Let V be a set of statstcal varables and let Pr be a jont probablty dstrbuton on V. Then, Pr(C X ) = c Y Pr(C X c Y ) Pr(c Y ) for all sets of varables X,Y V. The followng theorem s known as Bayes Theorem. Theorem Let V be a set of statstcal varables and let Pr be a jont probablty dstrbuton on V. Then, Pr(C X C Y ) = Pr(C Y C X ) Pr(C X ) Pr(C Y ) for all sets of varables X,Y V.

19 Exercses 2 17 To conclude ths secton, we once more turn to the concept of (condtonal) ndependence. Recall that so far we have taken the concept of ndependence to apply to propostons. We now ntroduce the concept of ndependence of varables. Defnton Let V be a set of statstcal varables and let X,Y,Z V. Let Pr be a jont probablty dstrbuton on V. Then, the set of varables X s called condtonally ndependent of the set of varables Y gven the set of varables Z n Pr f Pr(C X C Y C Z ) = Pr(C X C Z ) otherwse, X s called condtonally dependent of Y gven Z n Pr. In qualtatve terms, the expresson Pr(C X C Y C Z ) = Pr(C X C Z ) ndcates that, once nformaton about Z s avalable, nformaton about Y s rrelevant wth respect to X. Note that for X and Y to be ndependent gven Z, every par of confguratons of X and Y has to be ndependent gven every confguraton of Z. Independence of varables therefore mples ndependence of propostons. The reverse property, however, does not hold n general. Also note that the expresson from Defnton s asymmetrc n X and Y. Usng Bayes Theorem, however, t s easly shown that Pr(C X C Y C Z ) = Pr(C X C Z ) mples Pr(C Y C X C Z ) = Pr(C Y C Z ). Exercses Exercse 2.1 Prove the followng propertes for any jont probablty dstrbuton, usng only defntons and not the propertes from ths exercse: a. the chan rule (stated n Proposton 2.2.6); b. Bayes Theorem (stated n Theorem 2.2.9); c. the margnalsaton property (stated n Proposton 2.2.7); * d. the condtonng property (stated n Proposton 2.2.8). Exercse 2.2 Let V be a set of statstcal varables and let Pr be a jont probablty dstrbuton on V. Show that Pr(C X C Y ) = Pr(C X ) + Pr(C Y ) Pr(C X C Y ) for all sets of varables X,Y V. Exercse 2.3 Let V be a set of statstcal varables and let Pr be a jont probablty dstrbuton on V. Show that Pr(C X C Z ) = c Y Pr(C X c Y C Z ) Pr(c Y C Z ) for all sets of varables X,Y,Z V.

20 18 Exercses 2 * Exercse 2.4 Let V be a set of statstcal varables and let X,Y,Z V. Let Pr be a jont probablty dstrbuton on V. Show that the set of varables X s condtonally ndependent of the set of varables Y gven the set of varables Z f and only f Pr(C X C Y C Z ) = Pr(C X C Z ) Pr(C Y C Z ).

21 Chapter 3 Independences and Graphcal Representatons Ths chapter formalses two types of ndependence relaton. The frst, I Pr, s the type of relaton that can be captured by a probablty dstrbuton. These ndependence relatons form a proper subset of a more general type of ndependence relaton I that abstracts away from probablty dstrbutons. The chapter also dscusses dfferent representatons of ndependence relatons, most notably (n)drected graphs. An mportant noton ntroduced n ths chapter s the concept of d-separaton. Current research nto ndependence relatons s focussed on defnng small generatng sets [Waal & Van der Gaag, 2005], and on automated constructon of graphcal representatons from them [Baolett et al., 2011]. The hstorcal background to the framework of probablstc networks shows varous attempts to handle the computatonal complexty of applyng probablty theory for reasonng wth uncertanty n knowledge-based systems. The concept of (condtonal) ndependence plays a key role n these attempts as knowledge about ndependences allows for smplfyng computatons. In ths chapter, we address formalsms that allow for a concse representaton of an ndependence relaton for effectve use n a knowledgebased system. 3.1 The Concept of Independence Revsted In most ntroductory lterature on probablty theory, the concept of (condtonal) ndependence s ntroduced n terms of numercal quanttes: the ndependence relaton of a jont probablty dstrbuton s taken to be mplctly embedded n the probabltes nvolved. Recall for example that n the prevous chapter we have defned two sets of varables X and Y to be condtonally ndependent gven a thrd set of varables Z f Pr(C X C Y C Z ) = Pr(C X C Z ). A defnton of ndependence n terms of numbers suggests that, n order to determne whether two sets of varables are (condtonally) ndependent, several condtonal probabltes have to be computed and several equaltes have to be tested; moreover, such a defnton suggests that for determnng ndependence a jont probablty dstrbuton has to be explctly avalable for the varables dscerned. In contrast, humans tend to be able to state drectly, wth convcton and consstency, whether or not two sets of varables are ndependent. Such statements 19

22 20 3. Independences and Graphcal Representatons of ndependence typcally are ssued qualtatvely, wthout any reference to numercal manpulaton of exact probabltes. Based on these observatons, we cannot but conclude that the concept of ndependence s far more basc to human reasonng than ts numercal defnton suggests. In fact, the defnton of ndependence n terms of probabltes may be looked upon as a quanttatve way of capturng the basc concept whch s qualtatve n nature. To formalse propertes of the qualtatve concept of ndependence, J. Pearl and hs co-researchers have desgned an axomatc system for ndependence [Pearl & Paz, 1985, Pearl & Verma, 1987, Geger & Pearl, 1988]. In ths secton, we revew ths axomatc system Pearl s Axomatc System for Independence We begn our revew of Pearl s axomatc system for ndependence by ntroducng some new termnology and notatonal conventon. Defnton Let V be a set of statstcal varables and let Pr be a jont probablty dstrbuton on V. Then, the ndependence relaton I Pr P(V ) P(V ) P(V ) of Pr s defned by (X,Z,Y ) I Pr f and only f Pr(C X C Y C Z ) = Pr(C X C Z ) for all sets of varables X,Y,Z V. In the sequel, we wll wrte I Pr (X,Z,Y ) to denote (X,Z,Y ) I Pr and I Pr (X,Z,Y ) to denote (X,Z,Y ) I Pr. A statement I Pr (X,Z,Y ) of a jont probablty dstrbuton s ndependence relaton I Pr s termed an ndependence statement. In qualtatve terms, an ndependence statement I Pr (X,Z,Y ) expresses that n the context of nformaton about Z nformaton about Y s rrelevant wth respect to X. We note that the above defnton allows for statng some trval but convenent ndependence statements, such as I Pr (X,X,Y ), whch holds ff Pr(C X C Y C X ) = Pr(C X C X ),.e. 1 = 1; ts symmetrc verson I Pr (Y,X,X) s also trvally true snce I Pr (Y,X,X) ff Pr(C Y C X C X ) = Pr(C Y C X ). In desgnng hs axomatc system for ndependence, Pearl bulds on a set of propertes that are satsfed by any jont probablty dstrbuton s ndependence relaton; Theorem revews these propertes. Theorem Let V be a set of statstcal varables. Let Pr be a jont probablty dstrbuton on V and let I Pr be ts ndependence relaton. Then, I Pr satsfes the propertes I Pr (X,Z,Y ) I Pr (Y,Z,X); I Pr (X,Z,Y W) I Pr (X,Z,Y ) I Pr (X,Z,W); I Pr (X,Z,Y W) I Pr (X,Z W,Y ); I Pr (X,Z,Y ) I Pr (X,Z Y,W) I Pr (X,Z,Y W); (symmetry) (decomposton) (weak unon) (contracton) for all mutually dsjont sets of varables X,Y,Z,W V. If the dstrbuton Pr s strctly postve, then I Pr satsfes the addtonal property I Pr (X,Z W,Y ) I Pr (X,Z Y,W) I Pr (X,Z,Y W); (ntersecton) for all mutually dsjont sets of varables X,Y,Z,W V.

23 3.1 The Concept of Independence Revsted 21 The propertes stated n the prevous theorem are easly verfed from the basc axoms of probablty theory. We would lke to note that we have closely followed Pearl by statng the propertes n the theorem to hold for mutually dsjont sets of varables only [Pearl, 1988]. These propertes, however, also hold for overlappng sets of varables [Van der Gaag & Meyer, 1998]. Pearl now takes the propertes stated n Theorem as axoms for the qualtatve concept of ndependence [Pearl, 1988]. Followng the propertes for I Pr, Pearl assumed for each axom that the sets of varables nvolved are mutually dsjont. Gven the nsght that the propertes also hold for overlappng sets, we wll lft the assumpton of mutual dsjontness n the next and all followng defntons nvolvng ndependence relatons. The followng now defnes nformatonal ndependence: Defnton Let V be a set of statstcal varables. A sem-graphod ndependence relaton on V s a ternary relaton I P(V ) P(V ) P(V ) such that I satsfes the propertes I(X,Z,Y ) I(Y,Z,X); I(X,Z,Y W) I(X,Z,Y ) I(X,Z,W); I(X,Z,Y W) I(X,Z W,Y ); I(X,Z,Y ) I(X,Z Y,W) I(X,Z,Y W); for all sets of varables X,Y,Z,W V. A graphod ndependence relaton I on V s a sem-graphod ndependence relaton on V such that I satsfes the addtonal property I(X,Z W,Y ) I(X,Z Y,W) I(X,Z,Y W); for all sets of varables X, Y, Z, W V. The propertes descrbed n the prevous defnton wth each other convey the dea that learnng rrelevant nformaton does not alter the ndependences among the varables dscerned [Pearl, 1988]. We consder the qualtatve meanngs of the varous propertes separately. The property I(X,Z,Y ) I(Y,Z,X) for all sets of varables X,Y,Z V, states that f nformaton about Y s deemed rrelevant wth respect to X n the context of some nformaton about Z, then nformaton about X must be rrelevant wth respect to Y n ths context; ths property s called the symmetry axom. The property I(X,Z,Y W) I(X,Z,Y ) I(X,Z,W) for all sets of varables X,Y,Z,W V, asserts that f nformaton about both Y and W s judged rrelevant wth respect to X, then both nformaton about Y and nformaton about W must be rrelevant wth respect to X separately; ths property s known as the decomposton axom. We would lke to note that the decomposton axom may be reformulated as I(X, Z, Y W) I(X, Z, Y ); we have chosen, however, to use Pearl s orgnal formulaton because t conveys the dea of decomposton more clearly.

24 22 3. Independences and Graphcal Representatons The property I(X,Z,Y W) I(X,Z W,Y ) for all sets of varables X, Y, Z, W V, states that learnng nformaton about W that s known to be rrelevant wth respect to X cannot help rrelevant nformaton about Y to become relevant wth respect to X; ths property s known as the weak unon axom. The property I(X,Z,Y ) I(X,Z Y,W) I(X,Z,Y W) for all sets of varables X,Y,Z,W V, states that f we judge nformaton about W to be rrelevant wth respect to X after learnng some rrelevant nformaton about Y, then the nformaton about W must have been rrelevant wth respect to X before we learned Y ; ths property s known as the contracton axom. Note that the contracton axom can be reformulated as I(X,Z,Y ) (I(X,Z Y,W) I(X,Z,Y W)). From ths reformulaton, t s seen that the axom can be looked upon as a condtonal reverse of the weak unon axom. We now consder the property I(X,Z W,Y ) I(X,Z Y,W) I(X,Z,Y W) for all sets of varables X,Y,Z,W V, for graphod ndependence relatons. Ths property states that f, n the context of some nformaton about Z, learnng nformaton about W renders nformaton about Y rrelevant wth respect to X and learnng Y renders W rrelevant wth respect to X, then the nformaton about both Y and W must be rrelevant wth respect to X gven Z; ths property s known as the ntersecton axom. From Defnton and Theorem 3.1.2, we observe that every ndependence relaton that s embedded n a jont probablty dstrbuton s a sem-graphod ndependence relaton; ths property s stated more formally n the followng corollary. Corollary Let V be a set of statstcal varables. Let Pr be a jont probablty dstrbuton on V and let I Pr be ts ndependence relaton. Then, I Pr s a sem-graphod ndependence relaton. Furthermore, f Pr s strctly postve, then I Pr s a graphod ndependence relaton. Unfortunately, although any probablty dstrbuton s ndependence relaton s a semgraphod ndependence relaton, the reverse property does not hold. There exst semgraphod ndependence relatons for whch there do not exst jont probablty dstrbutons embeddng them; for detals, we refer to [Van der Gaag & Meyer, 1996, Studený, 1989]. We would lke to note that t has been shown that a fnte axomatsaton of the concept of probablstc ndependence does not exst [Studený, 1992] Propertes of Independence Relatons Usng the defnton of nformatonal ndependence, we derve some convenent propertes of (sem-graphod and graphod) ndependence relatons. The followng lemma shows that the symmetry and contracton axoms are easly generalsed to b-mplcatons.

25 3.1 The Concept of Independence Revsted 23 Lemma Let V be a set of statstcal varables. Furthermore, let I be a semgraphod ndependence relaton on V. Then, I(X,Z,Y ) I(Y,Z,X); I(X,Z,Y ) I(X,Z Y,W) I(X,Z,Y W); for all sets of varables X, Y, Z, W V. Proof. We begn our proof by observng that snce I s a sem-graphod ndependence relaton, t obeys the frst four axoms stated n Defnton The frst property stated n the lemma now follows drectly from the symmetry axom. For the second property, we observe that I(X,Z,Y ) I(X,Z Y,W) I(X,Z,Y W) concdes wth the contracton axom and therefore trvally holds for the relaton I. We wll now prove that I(X,Z,Y W) I(X,Z,Y ) I(X,Z Y,W). We have I(X,Z,Y W) I(X,Z,Y ) I(X,Z,W) I(X,Z,Y ) by the decomposton axom. In addton, we have I(X,Z,Y W) I(X,Z Y,W) by weak unon. The property stated n the lemma now follows drectly. For graphod ndependence relatons we have that the ntersecton axom can also be generalsed to a b-mplcaton. Lemma Let V be a set of statstcal varables. Furthermore, let I be a graphod ndependence relaton on V. Then, I(X,Z W,Y ) I(X,Z Y,W) I(X,Z,Y W) for all sets of varables X, Y, Z, W V. Proof. We wll only prove that I(X,Z,Y W) I(X,Z W,Y ) I(X,Z Y,W); the reverse property concdes wth the ntersecton axom and therefore trvally holds for the ndependence relaton I. We have and I(X,Z,Y W) I(X,Z W,Y ) I(X,Z,Y W) I(X,Z Y,W) by the weak unon axom. The property stated n the lemma now follows drectly. In the sequel, we wll use the phrase ndependence relaton to denote a sem-graphod ndependence relaton, unless stated otherwse.

26 24 3. Independences and Graphcal Representatons 3.2 Graphcal Representatons of Independence One of the problems n applyng probablty theory for automated reasonng wth uncertanty n a knowledge-based system s the space complexty of representng a jont probablty dstrbuton. Snce the concept of ndependence plays a key role n solvng ths problem, a formalsm for representng jont probablty dstrbutons should allow for effcently modelng ndependences. There are varous ways of representng an ndependence relaton. One way, for example, s to enumerate the separate statements of an ndependence relaton explctly. Such a representaton clearly s mpractcal as the number of tuples n an ndependence relaton can be astronomcal. Another way s to make use of the axoms from Defnton 3.1.3: only the statements from an approprate subset of the ndependence relaton are enumerated explctly and all ts other statements are defned mplctly by ths set and the defnng axoms. Although explotng the axoms allows for a far more economcal representaton of an ndependence relaton than explct enumeraton, t can stll requre exponental space. In ths secton, we consder more concse representatons of ndependence relatons, buldng on the dea of graphcal encodng. In Secton we address modelng an ndependence relaton n an undrected graph; n Secton we consder the representaton of an ndependence relaton n the formalsm of drected graphs Undrected Graphs Undrected graphs have no probablstc meanng by themselves. For representng an ndependence relaton n an undrected graph, therefore, a probablstc meanng has to be assgned to the topologcal propertes of such a graph, that s, we have to assgn a probablstc meanng to the vertces of the graph and to assgn a probablstc meanng to ts edges. Informally speakng, we choose to encode an ndependence relaton n an undrected graph by modellng the varables of the relaton as vertces and by representng ts ndependence statements by absence of edges. To formally capture ths meanng, we begn by defnng a graphcal crteron for readng from a graph sets of vertces that allow for blockng all paths between two gven sets of vertces; ths graphcal crteron s termed the separaton crteron for undrected graphs. Defnton Let G = (V (G), E(G)) be an undrected graph. Let X, Y, Z V (G) be sets of vertces n G. The set of vertces Z s sad to separate the sets of vertces X and Y n G, denoted as X Z Y G, f for each vertex V X and each vertex V j Y every smple path from V to V j n G contans at least one vertex from Z. We look upon a separatng set of varables as effectvely blockng nfluence: f a set of varables Z separates two sets of varables X and Y, then Z s looked upon as blockng any flow of nformaton or nfluence between X and Y. Two sets of varables X and Y that are thus separated by a set Z now are taken to be condtonally ndependent gven Z. Defnton Let V be a set of statstcal varables and let I be an ndependence relaton on V. Furthermore, let G = (V (G),E(G)) be an undrected graph wth V (G) = V. Then, the graph G s called an undrected dependence map, or D-map for short, for I, f for all sets of varables X,Y,Z V, we have: f I(X,Z,Y ) then X Z Y G ;

Probabilistic Reasoning

Probabilistic Reasoning Probablstc Reasonng (Probablstsch Redeneren) authors: Lnda van der Gaag Slja Renooj Fall 2016 Preface In artfcal-ntellgence research, the Bayesan network framework for automated reasonng wth uncertanty