Local Probability Models

Size: px

Start display at page:

Download "Local Probability Models"

Kelley Mosley
5 years ago
Views:

1 Readings: K&F 3.4, 5.~5.5 Local Probability Models Lecture 3 pr 4, 2 SE 55, Statistical Methods, Spring 2 Instructor: Su-In Lee University of Washington, Seattle Outline Last time onditional parameterization ayesian networks Independencies in graphs Today Local independencies, d-separation, I-equivalence From distributions to N graphs Local probability models (PDs) Tabular Deterministic ontet-specific Independence of causal influences ontinuous variables SE 55 Statistical Methods Spring 2 2

2 I-equivalence between graphs I(G) describe all conditional independencies in G Different ayesian networks can have same Ind. ( Z) ( Z) ( Z) ( ) Z Z Z Z Equivalence class I Equivalence class II Two N graphs G and G 2 are I-equivalent if I(G ) = I(G 2 ) SE 55 Statistical Methods Spring 2 3 I-equivalence between graphs If P factorizes over a graph in an I-equivalence class P factorizes over all other graphs in the same class P cannot distinguish one I-equivalent graph from another Implications for structure learning We cannot find the correct structure from within the same equivalent class. -> will revisit later. Test for I-equivalence: d-separation SE 55 Statistical Methods Spring 2 4 2

3 Test for I-equivalence Necessary condition: same graph skeleton Otherwise, can find active path in one graph but not other ut, not sufficient: v-structures Sufficient condition: same skeleton and v-structures ut, not necessary: complete graphs (no independence) Every two nodes are connected by some edge Define Z as immoral if, are not directly connected Necessary and Sufficient: same skeleton and immoral set of v- structures SE 55 Statistical Methods Spring 2 5 onstructing graphs for P an we construct a graph for a distribution P? ny graph which is an I-map for P ut, this is not so useful: complete graphs omplete graphs imply no independence assumptions Thus, they are I-maps of any distribution Every two nodes are connected by some edge SE 55 Statistical Methods Spring 2 6 3

4 Minimal I-Maps graph G is a minimal I-map for P if: G is an I-map for P Removing any edge from G renders it not an I-map for P Eample: if is a minimal I-map for P, W Z Then:,,, is not I-maps. W Z W Z W Z W Z SE 55 Statistical Methods Spring 2 7 ayesian network definition revisited ayesian network is a pair (G,P) P factorizes over G P is specified as set of PDs associated with G s nodes dditional requirement: G is a minimal I-map for P SE 55 Statistical Methods Spring 2 8 4

5 onstructing minimal I-maps Reverse factorization theorem n P(,..., ) = P( Pa( )) G is an I-map of P n i= i lgorithm for constructing a minimal I-Map Input: () fied ordering of nodes,, n ; (2) set of independencies that hold in P, denoted by I For each i, Select parents of i as minimal subset of {,, i- }, such that ( i {, i- Pa( i )} Pa( i )) I (Outline of) Proof of minimal I-map i I-map since the factorization above holds by construction Minimal since by construction, removing one edge destroys the factorization 9 Non-uniqueness of minimal I-map pplying the same I-Map construction process with different orders can lead to different structures ssume: I(G) = I(P) hoosing order: Drastic effects on compleity of minimal I-Map graph Heuristic: Use causal order D Order: E,,D,, D E E Different independence assumptions (different skeletons, e.g., ( ) holds on left) SE 55 Statistical Methods Spring 2 5

6 Perfect maps G is a perfect map (P-Map) for P if I(P)=I(G) Does every distribution have a P-map? No: independencies may be encoded in PD ( Z=) No: some structures cannot be represented in a N Independencies in P: ( D,), and (,D) D D (,D) does not hold ( D) also holds SE 55 Statistical Methods Spring 2 Finding a perfect map If P has a P-map, can we find it? Not uniquely, since I-equivalent graphs are indistinguishable Thus, represent I-equivalent graphs and return it Recall I-Equivalence Necessary and Sufficient: same skeleton and immoral set of v-structures Finding P-maps Step I: Find skeleton Step II: Find immoral set of v-structures Step III: Direct constrained edges Detailed algorithm: please read the tetbook SE 55 Statistical Methods Spring 2 2 6

7 Outline Last time onditional parameterization ayesian networks Independencies in graphs Local independencies, d-separation, I-equivalence Today From distributions to N graphs Local probability models onditional Probability Distributions (PDs) Table PDs Deterministic PDs ontet-specific PDs Independence of causal influences ontinuous variables 3 Table PDs Entry for each joint assignment of and Pa() For each pa : Val( ) P( pa ) = Most general representation P(I) I i i.7.3 P(S I) Represents every discrete PD S S I I s s i.95.5 i.2.8 Limitations I annot model continuous RVs # parameters eponential in Pa() annot model large in-degree dependencies Ignores structure within the PD How to overcome the limitations? D S H P(S I,D,,H) S I,D, H s s I d h.95.5 I d h.2.8 : : 7

8 Structured PDs Key idea: reduce # parameters by modeling P( Pa ) without eplicitly determining all entries of the joint We use constraints instead. Lose epressive power (cannot represent every PD) Many ways depending on the constraints Deterministic, tree-structured, rule-based PDs SE 55 Statistical Methods Spring 2 5 Deterministic PDs There is a function f : Val(Pa ) Val() such that P( pa = f ( pa ) ) = otherwise Eample functions OR, OR, ND, NND functions Z = + (continuous variables) SE 55 Statistical Methods Spring 2 6 8

9 Deterministic PDs: eample I Induce additional conditional independencies Eample: T is any deterministic function of T,T 2 T T 2 (S S 2 T,T 2 ) T S S 2 SE 55 Statistical Methods Spring 2 7 Deterministic PDs: eample II Induce additional conditional independencies Eample: is an OR deterministic function of, D (D E,) E SE 55 Statistical Methods Spring 2 8 9

10 Deterministic PDs: eample III Induce additional conditional independencies Eample: T is an OR deterministic function of T,T 2 T T 2 (S S 2 T =t ) T S S 2 ontet specific independencies 9 ontet specific independencies Let,,Z be disjoint random variable sets Let be a set of variables and c Val() and are contetually independent given Z and c, denoted ( c Z, =c) if: P(, Z, c) = P( Z, c) whenever P(, Z, c) > SE 55 Statistical Methods Spring 2 2

11 Tree PDs natural representation for capturing common elements in PDs Uses a decision tree L S Eample job offer Job offer Val(J)={yes,no} pply in time Val()={yes,no} Quality of Letter Val(L)={good,bad} GP Score Val(S)={high,low} P(J,L,S) J J L S j j a l s.95.5 a l s.95.5 a l s.95.5 a l s.95.5 a l s.9. a l s.3.7 a l s.2.8 a l s.2.8 Tree PDs: eample L S L S J J J L S j j a l s.95.5 a l s.95.5 a l s.95.5 a l s.95.5 a l s.9. a l s.3.7 a l s.2.8 a l s parameters (.95,.5) a a S L l l s s (.9,.) (.3,.7) (.2,.8) binary tree + 4 parameters

12 Rule PDs rule r is a pair (c;p) where c is an assignment to a subset of variables and p [,]. Let Scope[r]= rule-based PD P( Pa()) is a set of rules R s.t. For each rule r R Scope[r] {} Pa() For each assignment (,u) to {} Pa() we have eactly one rule (c;p) R such that c is compatible with (,u). Then, we have P(= Pa()=u) = p Pa() SE 55 Statistical Methods Spring 2 23 Rule PDs Eample: Let be a variable with Pa() = {,,} r: (a, b, ;.) r2: (a, c, ;.2) r3: (b, c, ;.3) r4: (a, b, c, ;.4) r5: (a, b, c ;.5) r6: (a, b, ;.9) r7: (a, c, ;.8) r8: (b, c, ;.7) r9: (a, b, c, ;.6) Note: each assignment maps to eactly one rule Rules cannot always be represented compactly within tree PDs SE 55 Statistical Methods Spring

13 Tree PDs and Rule PDs an represent every discrete function an be easily learned and dealt with in inference ut, some functions are not represented compactly OR in tree PDs: cannot split in one step on a,b and a,b lternative representations eist omple logical rules SE 55 Statistical Methods Spring 2 25 ontet specific independencies =a (D =a ) D spurious edge D =a (D =a ) a a c c b b D (.9,.) (.7,.3) (.2,.8) (.4,.6) Reasoning by cases implies that (,D) 3

14 Independence of causal influence auses:, n Effect: 2... n General case: has a comple dependency on, n ommon case Each i influences separately Influence of, n is combined to an overall influence on SE 55 Statistical Methods Spring 2 27 Eample : Noisy OR Two independent causal mechanisms by, 2 Key assumptions OR : =y cannot happen unless one of, 2 occurs Noisy : each causal mechanism is noisy. Independence of the causal mechanisms by, 2 P(=y =, 2 = 2 ) = P(=y =, 2 = 2 ) P(=y =, 2 = 2 ) 2 2 y y

15 Noisy OR: elaborate representation 2 effective value of each cause..9 Noisy PD Noisy PD 2 Noise parameter λ =.9 Noise parameter λ =.8 2 y y Deterministic OR 29 Noisy OR: general case is a binary variable with n binary parents,... n PD P(,... n ) is a noisy OR if there are (n+) noise parameters λ,λ,...,λ n such that P( = y,..., n ) = ( λ ) i: i = i ( λ ) P( = y,..., n) = ( λ) ( λi ) i: i = i i n n i j: ( i j =y o ) 5

Generalized linear models (GLMs) soft version of a linear threshold function Eample: logistic PD 2 inary variables,... n, n (dditive) combined effect of s z = wo + wi( i = ) Z i= Logistic PD: n P( = y,.

16 Generalized linear models (GLMs) soft version of a linear threshold function Eample: logistic PD 2 inary variables,... n, n (dditive) combined effect of s z = wo + wi( i = ) Z i= Logistic PD: n P( = y,..., k ) = logit wo + wi( i = ) i=... n Logit function (smooth step function) z e logit( z) = + e z Z 3 General formulation Let be a random variable with parents,... n The PD P(,... n ) ehibits independence of causal influence (II) if it can be described via a network structure shown below: Logistic Z i = w i ( i =) Z = ΣZ i = logit (Z) 2 Z Z Z n Z n Noisy OR Z i has noise model Z is an OR function is the identity PD The PD P(Z Z,...Z n ) is deterministic Key advantage: O(n) parameters 32 6

17 ontinuous variables One solution: discretize Often requires too many value states Loses domain structure Other solutions: use continuous function for P( Pa()) an combine continuous and discrete variables, resulting in hybrid networks Inference and learning may become more difficult SE 55 Statistical Methods Spring 2 33 Gaussian density functions mong the most common continuous representations Univariate case: 2 P( ) ~ N( µ, σ ) if p( ) = e 2πσ 2 µ 2 σ

18 Multivariate Gaussian density functions multivariate Gaussian distribution over,... n has n mean vector µ nn positive definite covariance matri Σ (positive definite: n T R : > ) Joint density function: T p( ) = ep ( µ ) ( µ ) n / 2 / 2 (2π ) 2 µ i =E[ i ] Σ ii =Var[ i ] Σ ij =ov[ i, j ]=E[ i j ]-E[ i ]E[ j ] (i j) SE 55 Statistical Methods Spring 2 35 Gaussian density functions Marginal distributions are easy to compute µ P(, ) = N ; µ ( ) P( ) = N µ ; Independencies can be determined from parameters If =,... n have a joint normal distribution N(µ;Σ) then ( i j ) iff Σ ij = (for i j) Does not hold in general for non-gaussian distributions SE 55 Statistical Methods Spring

19 Linear Gaussian PDs is a continuous variable with parents,... n has a linear Gaussian model if it can be described using parameters β,...,β n and σ 2 such that 2 P (,..., n) = N( β + β βnn; σ ) T 2 Vector notation: P ( ) = N( β + β ; )... 2 n σ Pros Simple aptures many interesting dependencies ons Fied variance (variance cannot depend on parents values) SE 55 Statistical Methods Spring 2 37 Linear Gaussian ayesian network linear Gaussian ayesian network is a ayesian network where ll variables are continuous ll of the PDs are linear Gaussians Key result: linear Gaussian models are equivalent to multivariate Gaussian density functions Proof? SE 55 Statistical Methods Spring

20 Hybrid models Models of continuous and discrete variables ontinuous variables with discrete parents Discrete variables with continuous parents onditional Linear Gaussians continuous variable = {,..., n } continuous parents U = {U,...,U m } discrete parents u U P( u, ) = N a n 2 ( u, au i i; σ ) + = : i, u... n U... U m conditional Linear ayesian network is one where Discrete variables have only discrete parents ontinuous variables have only LG PDs SE 55 Statistical Methods Spring 2 39 Hybrid models ontinuous parents for discrete children Threshold models P( = y.9 < ) =.5 otherwise Linear sigmoid (logit function) P( = y,..., ) = logit( w + k o n i= w ) i i SE 55 Statistical Methods Spring 2 4 2

21 Summary: PD models Deterministic functions ontet specific dependencies Tree PDs Rule PDs Independence of causal influence Noisy OR Logistic function PDs capture additional domain structure SE 55 Statistical Methods Spring 2 4 cknowledgement These lecture notes were generated based on the slides from Prof Eran Segal. SE 55 Statistical Methods Spring

Undirected Graphical Models

Readings: K&F 4. 4.2 4.3 4.4 Undirected Graphical Models Lecture 4 pr 6 20 SE 55 Statistical Methods Spring 20 Instructor: Su-In Lee University of Washington Seattle ayesian Network Representation irected