Variable Elimination: asic Ideas Sargur srihari@cedar.buffalo.edu 1
Topics 1. Types of Inference lgorithms 2. Variable Elimination: the asic ideas 3. Variable Elimination Sum-Product VE lgorithm Sum-Product VE for Conditional Probabilities 4. Variable Ordering for VE 2
Inference lgorithms Types of inference algorithms 1. Exact 1. Variable Elimination 2. Clique trees (elief Propagation) 2. pproximate 1. Optimization 1. Propagation with approximate messages 2. Variational (analytical approximations) 2. Particle-based (sampling) 3
Variable Elimination: asic Ideas We begin with principles underlying exact inference in PGMs s we show, the same N structure that allows compaction of complex distributions also helps support inference In particular we can use dynamic programming techniques to perform inference even for large and complex networks in reasonable time 4
Intuition for Variable Elimination Consider inference is a very simple N ààcàd E.g., sequence of words CPDs are first order word probabilities We consider phased computation Probabilities of four words: The, quick, brown, fox Use results of a previous phase in computation of next phase Then reformulate this process in terms of a global computation on the joint distribution 5 D C
Exact Inference: Variable Elimination To compute P(), i.e., distribution of values b of, we have P() = P(,) = P(a)P( a) a required P(a), P(b a) available in N If has k values and has m values For each b: k multiplications and k-1 addition Since there are m values of, process is repeated for each value of b: this computation is O(k x m) a C D 6
Moving Down N ssume we want to compute P(C) Using same analysis P(C) = P(,C) = P(b)P(C b) P(c b) is given in CPD b ut P() is not given as network parameters It can be computed using P() = P(,) = P(a)P( a) a b a C D If and C have k values each, complexity is O(k 2 ) 7
Computation depends on Structure 1. Structure of N is critical for computation If had been a parent of C P(C) = b P(b)P(C b) would not have sufficed 2. lgorithm does not compute single values but sets of values at a time P() over all possible values of are used to compute P(C) C D
Complexity of General Chain In general, if we have X 1 àx 2 à..àx n and there are k values of X i, total cost is O(nk 2 ) Naïve evaluation Generate entire joint and summing it out Would generate k n probabilities for the events x 1,.. x n In this example, despite exponential size of joint distribution we can do inference in linear time X 1 X 2 X n 9
Insight that avoids exponentiality The joint probability decomposes as P(,,C,D)=P()P( )P(C )P(D C) To compute P(D) we need to sum together all entries where D=d 1 nd separately entries where D=d 2 Exact computation for P(D) is: Examine summation 3 rd & 4 th terms of first 2 terms: P(c 1 b 1 )P(d 1 c 1 ) Modify to first compute P(a 1 )P(b 1 a 1 )+P(a 2 )P(b 1 a 2 ) then multiply by common term 10 D C
First Transformation of sum Same structure is repeated throughout table Performing the same transformation we get the summation for P(D) as Observe certain terms are repeated several times in this expression P(a 1 )P(b 1 a 1 )+P(a 2 )P(b 1 a 2 ) and P(a 1 )P(b 2 a 1 )+P(a 2 )P(b 2 a 2 ) are repeated four times 11
2 nd & 3 rd transformation on the sum Defining τ 1 : Val() àr where τ 1 (b 1 ) and τ 1 (b 2 ) are the two expressions, we get Can reverse the order of a sum and product sum first, product next 12
Fourth Transformation of sum gain notice shared expressions that are better computed once and used multiple times We define τ 2 : Val(C) àr τ 2 (c 1 )=τ 1 (b 1 )P(c 1 b 1 )+τ 1 (b 2 )P(c 1 b 2 ) τ 2 (c 2 )=τ 1 (b 1 )P(c 2 b 1 )+τ 1 (b 2 )P(c 2 b 2 ) 13
Summary of computation We begin by computing τ 1 () Requires 4 multiplications and 2 additions Using it we can compute τ 2 (C) also requires 4 multis and 2 adds which Finally we compute P(D) at same cost Total no of ops is 18 Joint distribution requires 16 x 3=48 mps and 14 adds 14
Computation Summary Transformation we have performed has steps P(D) = P()P( ) P(C )P(D C) We push the first summation resulting in P(D) = P(D C) P(C ) P()P( ) We compute the product ψ 1 (,) =P()P( ) and sum out to obtain the function For each value of b, we compute τ 1 (b) = ψ 1 (,b) = P()P(b ) We then continue C C τ 2 (C) = ψ 2 (,C) Resulting τ 2 (C) is used to compute P(D) ψ 2 (,C) = τ 1 ()P(C ) τ 1 () = ψ 1 (,) 15
Computation is Dynamic Programming Naiive way for would have us compute every P(b) = P()P(b ) many times, once for every value of C and D For a chain of length n this would be computed exponentially many times Dynamic Programming inverts order of computation performing it inside out rather than outside in First computing once for all values in τ 1 (), that allows us to compute τ 2 (C) once for all, etc. 16 P(D) = P()P( ) P(C )P(D C) C
Ideas that prevented exponential blowup ecause of structure of N, some subexpressions depend only on a small no. of variables y computing and caching these results we can avoid generating them exponential no. of times 17