EDML: A Method for Learning Parameters in Bayesian Networks

Size: px

Start display at page:

Download "EDML: A Method for Learning Parameters in Bayesian Networks"

Nicholas Jefferson
5 years ago
Views:

1 : A Metod for Learning Parameters in Bayesian Networks Artur Coi, Kaled S. Refaat and Adnan Darwice Computer Science Department University of California, Los Angeles {aycoi, krefaat, darwice}@cs.ucla.edu Abstract We propose a metod called for learning MAP parameters in binary Bayesian networks under incomplete data. Te metod assumes Beta priors and can be used to learn maximum likeliood parameters wen te priors are uninformative. exibits interesting beaviors, especially wen compared to. We introduce, explain its origin, and study some of its properties bot analytically and empirically. 1 INTRODUCTION We consider in tis paper te problem of learning Bayesian network parameters given incomplete data, wile assuming tat all network variables are binary. We propose a specific metod,, 1 wic as a similar structure and complexity to te algoritm (Dempster, Laird, & Rubin, 1977; Lauritzen, 1995). assumes Beta priors on network parameters, allowing one to compute MAP parameters. Wen using uninformative priors, reduces to computing maximum likeliood (ML) parameters. originated from applying an approximate inference algoritm (Coi & Darwice, 2006) to a meta network in wic parameters are explicated as variables, and on wic data is asserted as evidence. Te update equations of resemble te ones for, yet appears to ave different convergence properties wic stem from its being an inference metod as opposed to a local searc metod. For example, we will identify a class of incomplete datasets on wic is guaranteed to converge immediately to an 1 stands for Edge-Deletion MAP-Learning or Edge-Deletion Maximum-Likeliood as it is based on an edge-deletion approximate inference algoritm tat can compute MAP or maximum likeliood parameters. optimal solution, by simply reasoning about te beavior of its underlying inference metod. Even toug originates in a rater involved approximate inference sceme, its update equations can be intuitively justified independently. We terefore present initially in Section 3 before delving into te details of ow it was originally derived in Section 5. Intuitively, can be tougt of as relying on two key concepts. Te first concept is tat of estimating te parameters of a single random variable given soft observations, i.e., observations tat provide soft evidence on te values of a random variable. Te second key concept beind is tat of interpreting te examples of an incomplete data set as providing soft observations on te random variables of a Bayesian network. As to te first concept, we also sow tat MAP and ML parameter estimates are unique in tis case, terefore, generalizing te fundamental result wic says tat tese estimates are unique for ard observations. Tis result is interesting and fundamental enoug tat we treat it separately in Section 4 before we move on and discuss te origin of in Section 5. We discuss some teoretical properties of in Section 6, were we identify situations in wic it is guaranteed to converge immediately to optimal estimates. We present some preliminary empirical results in Section 7 tat corroborate some of te convergence beaviors predicted. In Section 8, we close wit some concluding remarks on related and future work. We note tat wile we focus on binary variables ere, our approac generalizes to multivalued variables as well. We will comment later on tis and te reason we restricted our focus ere. 2 TECHNICAL PRELIMINARIES We use upper case letters (X) to denote variables and lower case letters (x) to denote teir values. Variable

2 sets are denoted by bold-face upper case letters (X) and teir instantiations by bold-face lower case letters (x). Since our focus is on binary variables, we use x (positive) and x (negative) to denote te two values of binary variable X. Generally, we will use X to denote a variable in a Bayesian network and U to denote its parents. A network parameter will terefore ave te general form θ x u, representing te probability Pr(X =x U=u). Note tat variable X can be tougt of as inducing a number of conditional random variables, denoted X u, were te values of variable X u are drawn based on te conditional distribution Pr(X u). In fact, parameter estimation in Bayesian networks can be tougt of as a process of estimating te distributions of tese conditional random variables. Since we assume binary variables, eac of tese distributions can be caracterized by te single parameter θ x u, since θ x u = 1 θ x u. We will use θ to denote te set of all network parameters. Given a network structure G in wic all variables are binary, our goal is to learn its parameters from an incomplete dataset, suc as: example X Y Z 1 x ȳ? 2? ȳ? 3 x? z We use D to denote a dataset, and d i to denote an example. Te dataset above as tree examples, wit d 3 being te instantiation X = x, Z =z. A commonly used measure for te quality of parameter estimates θ is teir likeliood, defined as: L(θ D) = N i=1 Pr θ(d i ), were Pr θ is te distribution induced by network structure G and parameters θ. In te case of complete data (eac example fixes te value of eac variable), te ML parameters are unique. Learning ML parameters is arder wen te data is incomplete, were is typically employed. starts wit some initial parameters θ 0, called a seed, and successively improves on tem via. uses te update equation: θ k+1 x u = N i=1 P r θ k(xu d i) N i=1 P r θ k(u d i), wic requires inference on a Bayesian network parameterized by θ k, in order to compute P r θ k(xu d i ) and P r θ k(u d i ). In fact, one run of te jointree algoritm on eac distinct example is sufficient to implement an of, wic is guaranteed to never decrease te likeliood of its estimates across s. also converges to every local maxima, given tat it starts wit an appropriate seed. It is common to run wit multiple seeds, keeping te best local maxima it finds. See (Darwice, 2009; Koller & Friedman, 2009) for recent treatments on parameter learning in Bayesian networks via and related metods. can also be used to find MAP parameters, assuming one as some priors on network parameters. Te Beta distribution is commonly used as a prior on te probability of a binary random variable. In particular, te Beta for random variable X u is specified by two exponents, α Xu and β Xu, leading to a density [θ x u ] α Xu 1 [1 θ x u ] β Xu 1. It is common to assume tat exponents are > 1 (te density is ten unimodal). For MAP parameters, uses te update equation (see, e.g., (Darwice, 2009)): x u = α Xu 1 + N i=1 Pr θ k(xu d i) α Xu + β Xu 2 + N i=1 Pr θ k(u d i). θ k+1 Wen α Xu = β Xu = 1 (uninformative prior), te equation reduces to te one for computing ML parameters. Wen computing ML parameters, using α Xu = β Xu = 2 leads to wat is usually known as Laplace smooting. Tis is a common tecnique to deal wit te problem of insufficient counts (i.e., instantiations tat never appear in te dataset, leading to zero probabilities and division by zero). We will indeed use Laplace smooting in our experiments. Our metod for learning MAP and ML parameters makes eavy use of two notions: (1) te odds of an event, wic is te probability of te event over te probability of its negation, and (2) te Bayes factor (Good, 1950), wic is te relative cange in te odds of one event, say, X = x, due to observing some oter event, say, η. In tis case, we ave te odds O(x) and O(x η), were te Bayes factor is κ = O(x η)/o(x), wic is viewed as quantifying te strengt of soft evidence η on X =x. It is known tat κ = Pr(η x)/pr(η x) and κ [0, ]. Wen κ = 0, te soft evidence reduces to ard evidence asserting X = x. Wen κ =, te soft evidence reduces to ard evidence asserting X =x. Wen κ = 1, te soft evidence is neutral and bears no information on X =x. A detailed discussion on te use of Bayes factors for soft evidence is given in (Can & Darwice, 2005). 3 AN OVERVIEW OF Consider Algoritm 1, wic provides pseudocode for. typically starts wit some initial parameters estimates, called a seed, and ten iterates to monotonically improve on tese estimates. Eac consists of two steps. Te first step, Line 3, computes marginals over te families of a Bayesian network tat is parameterized by te current estimates. Te second step, Line 4, uses te computed probabilities to

3 Algoritm 1 input: G: A Bayesian network structure D: An incomplete dataset d 1,..., d N θ: An initial parameterization of structure G α Xu, β Xu : Beta prior for eac random variable X u 1: wile not converged do 2: Pr distribution induced by θ and G 3: Compute probabilities: Pr(xu d i) and Pr(u d i) for eac family instantiation xu and example d i 4: Update parameters: Algoritm 2 input: G: A Bayesian network structure D: An incomplete dataset d 1,..., d N θ: An initial parameterization of structure G α Xu, β Xu : Beta prior for eac random variable X u 1: wile not converged do 2: Pr distribution induced by θ and G 3: Compute Bayes factors: κ i Pr(xu di)/pr(x u) Pr(u di) + 1 x u Pr( xu d i)/pr( x u) Pr(u d i) + 1 for eac family instantiation xu and example d i 4: Update parameters: (1) θ x u 5: return parameterization θ αxu 1 + N i=1 Pr(xu di) α Xu + β Xu 2 + N i=1 Pr(u di) θ x u argmax p [p] α Xu 1 [1 p] β Xu 1 5: return parameterization θ N i=1 [κ i x u p p + 1] (2) update te network parameters. Te process continues until some convergence criterion is met. Te main point ere is tat te computation on Line 3 can be implemented by a single run of te jointree algoritm, wile te update on Line 4 is immediate. Consider now Algoritm 2, wic provides pseudocode for, to be contrasted wit te one for. Te two algoritms clearly ave te same overall structure. Tat is, also starts wit some initial parameters estimates, called a seed, and ten iterates to update tese estimates. Eac consists of two steps. Te first step, Line 3, computes Bayes factors using a Bayesian network tat is parameterized by te current estimates. Te second step, Line 4, uses te computed Bayes factors to update network parameters. Te process continues until some convergence criterion is met. Muc like, te computation on Line 3 can be implemented by a single run of te jointree algoritm. Unlike, owever, te update on Line 4 is not immediate as it involves solving an optimization problem, albeit a simple one. Aside from tis optimization task, and ave te same computational complexity. We next explain te two concepts underlying and ow tey lead to te equations of Algoritm ESTIMATION FROM SOFT OBSERVATIONS Consider a random variable X wit values x and x, and suppose tat we ave N > 0 independent observations of X, wit N x as te number of positive observations. It is well known tat te ML parameter estimates for random variable X are unique in tis case and caracterized by θ x = N x /N. If one furter assumes a Beta prior wit exponents α and β tat are 1, it is also known tat te MAP parameter estimates are unique and caracterized by θ x = Nx+α 1 N+α+β 2. Consider now a more general problem in wic te observations are soft in tat tey only provide soft evidence on te values of random variable X. Tat is, eac soft observation η i is associated wit a Bayes factor κ i x = O(x η i )/O(x) wic quantifies te evidence tat η i provides on aving observed te value x of variable X. We will sow later tat te ML estimates remain unique in tis more general case, if at least one of te soft observations is not trivial (i.e., wit Bayes factor κ i x 1). Moreover, we will sow tat te MAP estimates are also unique assuming a Beta prior wit exponents 1. In particular, we will sow tat te unique MAP estimates are caracterized by Equation 2 of Algoritm 2. Furter, we will sow tat te unique ML estimates are caracterized by te same equation wile using a Beta prior wit exponents = 1. Tis is te first key concept tat underlies our proposed algoritm for estimating ML and MAP parameters in a binary Bayesian network. 3.2 EXAMPLES AS SOFT OBSERVATIONS Te second key concept underlying is to interpret eac example d i in a dataset as providing a soft observation on eac random variable X u. As mentioned earlier, soft observations are specified by Bayes factors and, ence, one needs to specify te Bayes factor κ i x u tat example d i induces on random variable

4 x X 1 X 2 X N Figure 1: Estimation given independent observations. X u. uses Equation 1 for tis purpose, wic will be derived in Section 5. We next consider a few special cases of tis equation to igligt its beavior. Consider first te case in wic example d i implies parent instantiation u (i.e., te parents U of variable X are instantiated to u in example d i ). In tis case, Equation 1 reduces to κ i x u = O(x u,di) O(x u), wic is te relative cange in te odds of x given u due to conditioning on example d i. Note tat for root variables X, wic ave no parents U, Equation 1 furter reduces to κ i x = O(x di) O(x). Te second case we consider is wen example d i is inconsistent wit parent instantiation u. In tis case, Equation 1 reduces to κ i x u = 1, wic amounts to neutral evidence. Hence, example d i is irrelevant to estimating te distribution of variable X u in tis case, and will be ignored by. Te last special case of Equation 1 we sall consider is wen te example d i is complete; tat is, it fixes te value of eac variable. In tis case, one can verify tat κ i x u {0, 1, } and, ence, te example can be viewed as providing eiter neutral or ard evidence on eac random variable X u. Tus, an example will provide soft observations on variables only wen it is incomplete (i.e., missing some values). Oterwise, it is eiter irrelevant to, or provides a ard observation on, eac variable X u. In te next section, we prove Equation 2 of Algoritm 2. In Section 5, we discuss te origin of, were we go on and derive Equation 1 of Algoritm 2. 4 ESTIMATION FROM SOFT OBSERVATIONS Consider a binary variable X. Figure 1 depicts a network were θ x is a parameter representing Pr(X =x) and X 1,..., X N are independent observations of X. Suppose furter tat we ave a Beta prior on parameter θ x wit exponents α 1 and β 1. A standard estimation problem is to assume tat we know te values of tese observations and ten estimate te parameter θ x. We now consider a variant on tis problem, in wic we only ave soft evidence η i about eac observation, wose strengt is quantified by a Bayes factor κ i x = O(x η i )/O(x). Here, κ i x represents te cange in odds tat te i-t observation is positive due to evidence η i. We will refer to η i as a soft observation on variable X, and our goal in tis section is to compute (and optimize) te posterior density on parameter θ x given tese soft observations η 1,..., η N. We first consider te likeliood: Pr(η 1,..., η N θ x ) = N i=1 Pr(η i θ x ) = N i=1 [Pr(η i x, θ x )Pr(x θ x ) + Pr(η i x, θ x )Pr( x θ x )] = N i=1 [Pr(η i x)θ x + Pr(η i x)(1 θ x )] N i=1 [κi x θ x θ x + 1]. Te last step follows because κ i x = O(x η i )/O(x) = Pr(η i x)/pr(η i x). Te posterior density is ten: ρ(θ x η 1,..., η N ) ρ(θ x )Pr(η 1,..., η N θ x ) [θ x ] α 1 [1 θ x ] β 1 N i=1 [κi x θ x θ x + 1]. Tis is exactly Equation 2 of Algoritm 2 assuming we replace te random variable X wit te conditional random variable X u. 2 Te second derivative of te log posterior is α 1 [θ x ] 2 β 1 [1 θ x ] 2 i [ (κ i x 1) (κ i x 1)θ x + 1 wic is strictly negative wen κ i x 1 for at least one i. Tis remains true wen α = β = 1. Hence, bot te likeliood function and te posterior density are strictly log-concave and terefore ave unique modes. Tis means tat bot ML and MAP parameter estimates are unique in te case of soft, independent observations, wic generalizes te uniqueness result for ard, independent observations on a variable X. 5 THE ORIGIN OF Tis section reveals te tecnical origin of, sowing ow Equation 1 of Algoritm 2 is derived, and providing te basis for te overall structure of as spelled out in Algoritm 2. originated from an approximation algoritm for computing MAP parameters in a meta network. Figure 2 depicts an example meta network in wic 2 Te case of κ i x = needs to be andled carefully in Equation 2. First note tat κ i x = iff Pr(η i x) = 0 in te derivation of tis equation. In tis case, te term Pr(η i x)θ x +Pr(η i x)(1 θ x) equals c θ x for some constant c (0, 1]. Since te value of Equation 2 does not depend on constant c, we will assume c = 1. Hence, wen κ i x =, te term [κ i x θ x θ x + 1] evaluates to θ x by convention. ] 2

5 H 3 H 3 H 1 H 2 H 3 : : S 1 E 1 S 2 E 2 S 3 s s (a) Adding generators (b) Deleting copy edges Figure 2: A meta network induced from a base network S H E. Te CPTs ere are based on standard semantics; see, e.g., (Darwice, 2009, C. 18). Figure 3: Introducing generators into a meta network and ten deleting copy edges from te resulting meta network, wic leads to introducing clones. parameters are represented explicitly as nodes (Darwice, 2009). In particular, for eac conditional random variable X u in te original Bayesian network, called te base network, we ave a node θ x u in te meta network wic represents a parameter tat caracterizes te distribution of tis random variable. Moreover, te meta network includes enoug instances of te base network to allow te assertion of eac example d i as evidence on one of tese instances. Assuming tat θ is an instantiation of all parameter variables, and D is a dataset, MAP estimates are ten: θ = argmax ρ(θ D), θ were ρ is te density induced by te meta network. Computing MAP estimates exactly is usually proibitive due to te structure of te meta network. We terefore use te tecnique of edge deletion (Coi & Darwice, 2006), wic formulates approximate inference as exact inference on a simplified network tat is obtained by deleting edges from te original network. Te tecnique compensates for tese deletions by introducing auxiliary parameters wose values must be cosen carefully (and usually iteratively) in order to improve te quality of approximations obtained from te simplified network. is te result of making a few specific coices for deleting edges and for coosing values for te auxiliary parameters introduced, wic we explain next. 5.1 INTRODUCING GENERATORS Let X i denote te instance of variable X in te base network corresponding to example d i. Te first coice of is tat for eac edge θ x u X i in te meta network, we introduce a generator variable Xu, i leading to te pair of edges θ x u X u X i i. Figure 3(a) depicts a fragment of te meta network in Figure 2, in wic we introduced two generator variables for edges θ e and θ e, leading to θ e E 3 E3 and θ e. Variable Xu i is meant to generate values of variable X i according to te distribution specified by parameter θ x u. Hence, te conditional distribution of a generator Xu i is suc tat Pr(x i u θ x u ) = θ x u. Moreover, te CPT of variable X i is set to ensure tat variable X i copies te value of generator Xu i if and only if te parents of X i take on te value u. Tat is, te CPT of variable X i acts as a selector tat cooses a particular generator Xu i to copy from, depending on te values of its parents U. For example, in Figure 3(a), wen parent H 3 takes on its positive value, variable copies te value of generator E 3. Wen parent H3 takes on its negative value, variable copies te value of generator. Adding generator variables does not cange te meta network as it continues to ave te same density over te original variables. Yet, generators are essential to te derivation of as tey will be used for interpreting data examples as soft observations. 5.2 DELETING COPY EDGES Te second coice made by is tat we only delete edges of te form Xu X i i from te augmented meta network, wic we sall call copy edges. Figure 3(b) depicts an example in wic we ave deleted

6 H 2 : H 1 H 2 H 3 S 1 E 1 S 3 S 2 s S 3 S 2 E 2 S 1 S 2 : S 2 E : E : s Figure 4: An edge-deleted network obtained from te meta network in Figure 2 found by: (1) adding generator variables, (2) deleting copy edges, and (3) adding cloned generators. Te figure igligts te island for example d 2, and te island for parameter θ s. S : E 2 te two copy edges from Figure 3(a). Note ere te addition of anoter auxiliary variable Xu:, i called a clone, for eac generator Xu. i Te addition of clones is mandated by te edge deletion framework. Moreover, if te CPT of clone Xu: i is cosen carefully, it can compensate for te parent-to-cild information lost wen deleting edge Xu X i i. We will later see ow sets tese CPTs. Te oter aspect of compensating for a deleted edge is to specify soft evidence on eac generator Xu. i Tis is also mandated by te edge deletion framework, and is meant to compensate for te cild-to-parent information lost wen deleting edge Xu X i i. We will later see ow sets tis soft evidence as well, wic effectively completes te specification of te algoritm. We prelude tis specification, owever, by making some furter observations about te structure of te meta network after edge deletion. 5.3 PARAMETER & EXAMPLE ISLANDS Consider te network in Figure 4, wic is obtained from te meta network in Figure 2 according to te edge-deletion process indicated earlier. Te edge-deleted network contains a set of disconnected structures, called islands. Eac island belongs to one of two classes: a parameter island for eac network parameter θ x u and an example island for eac example d i in te dataset. Figure 4 provides te full details for one example island and one parameter island. Note tat eac parameter island corresponds to a Naive Bayes structure, wit parameter θ x u as E 2 te root and generators Xu i as cildren. Wen soft evidence is asserted on tese generators, we get te estimation problem we treated in Section 4. can now be fully described by specifying (1) te soft evidence on eac generator X i u in a parameter island, and (2) te CPT of eac clone X i u: in an example island. Tese specifications are given next. 5.4 CHILD-TO-PARENT COMPENSATION Te edge deletion approac suggests te following soft evidence on generators X i u, specified as Bayes factors: κ i x u = O(xi u: d i ) O(x i u:) = P ri (d i x i u:) P r i (d i x i u:), (3) were P r i is te distribution induced by te island of example d i. We will now sow tat tis equation simplifies to Equation 1 of Algoritm 2. Suppose tat we marginalize all clones X i u: from te island of example d i, leading to a network tat induces a distribution Pr. Te new network as te following properties. First, it as te same structure as te base network. Second, Pr(x u) = P r i (x i u:), wic means tat te CPTs of clones in example islands correspond to parameters in te base network. Finally, if we use u to denote te disjunction of all parent instantiations excluding u, we get: κ i x u = P ri (d i x i u:) P r i (d i x i u:) = Pr(d i xu)pr(u) + Pr(d i u)pr(u) Pr(d i xu)pr(u) + Pr(d i u)pr(u) = Pr(xu d i)/pr(x u) Pr(u d i ) + 1 Pr( xu d i )/Pr( x u) Pr(u d i ) + 1. Tis is exactly Equation 1 of Algoritm 2. Hence, we can evaluate Equation 3 by evaluating Equation 1 on te base network, as long as we seed te base network wit parameters tat correspond to te CPTs of clones in an example island. 5.5 PARENT-TO-CHILD COMPENSATION We now complete te derivation of by sowing ow it specifies te CPTs of clones in example islands, wic are needed for computing soft evidence as in te previous section. In a nutsell, assumes an initial value of tese CPTs, typically cosen randomly. Given tese CPTs, example islands will be fully specified and will compute soft evidence as given by Equation 3. Te

7 s H 1 H 2 H 3 S 1 E 1 S 2 E 2 S 3 s Figure 5: A pruning of te meta network in Figure 2 given H 1 =, H 2 = and H 3 =. computed soft evidence is ten injected on te generators of parameter islands, leading to a full specification of tese islands. will ten estimate parameters by solving an exact optimization problem on eac parameter island as sown in Section 4. Te estimated parameters are ten used as te new values of CPTs for clones in example islands. Tis process repeats until convergence. We ave sown in te previous section tat te CPTs of clones are in one-to-one correspondence wit te parameters of te base network. We ave also sown tat soft evidence, as given by Equation 3, can be computed by evaluating Equation 1 of Algoritm 2 (wit parameters θ corresponding to te CPTs of clones in an example island). takes advantage of tis correspondence, leading to te simplified statement spelled out in Algoritm 2. 6 SOME PROPERTIES OF Being an approximate inference metod, one can somes identify good beaviors of by identifying situations under wic te underlying inference algoritm will produce ig quality approximations. We provide a result in tis section tat illustrates tis point in te extreme, were is guaranteed to return optimal estimates and in only one. Our result relies on te following observation about parameter estimation via inference on a meta network. Wen te parents U of a variable X are observed to u in an example d i, all edges θ x u X i in te meta network become superfluous and can be pruned, except for te one edge tat satisfies u = u. Moreover, edges outgoing from observed nodes can also be pruned from a meta network. Suppose now tat te parents of eac variable are observed in a dataset. After pruning edges as indicated earlier, eac parameter variable θ x u will end up being te root of an isolated naive Bayes structure tat as some variables X i as its cildren (tose wose parents are instantiated to u in example d i ). Figure 5 depicts te result of suc pruning in te meta network of Figure 2, given a dataset wit H 1 =, H 2 = and H 3 =. Te above observation implies tat wen te parents of eac variable are observed in a dataset, parameters can be estimated independently. Tis leads to te following well known result. Proposition 1 Wen te dataset is complete, te ML estimate for parameter θ x u is unique and given by D#(xu)/D#(u), were D#(xu) is te number of examples containing xu and D#(u) is te number of examples containing u. It is well known tat returns suc estimates and in only one (i.e., independently of its seed). Te following more general result is also implied by our earlier observation. Proposition 2 Wen only leaf variables ave missing values in a dataset, te ML estimate for eac parameter θ x u is unique and given by D#(xu)/D + #(u). Here, D + #(u) is te number of examples containing u and in wic X is observed. We can now prove te following property of, wic is not satisfied by, as we sow next. Teorem 1 Wen only leaf variables ave missing values in a dataset, returns te unique ML estimates given by Proposition 2 and in only one. Proof Consider an example d i tat fixes te values of parents U for variable X and consider Equation 1. First, κ i x u = 1 iff example d i is inconsistent wit u or does not set te value of X. Next, κ i x u = 0 iff example d i contains xu. Finally, κ i x u = iff example d i contains xu. Moreover, tese values are independent of te seed so te algoritm converges in one. Given tese values of te Bayes factors, Equation 2 leads to te estimate of Proposition 2. We ave a number of observations about tis result. First, since Proposition 1 is implied by Proposition 2, returns te unique ML estimates in only one wen te dataset is complete (just like ). Next, wen only te values of leaf variables are missing in a dataset, Proposition 2 says tat tere is a unique ML estimate for eac network parameter. Moreover,

8 Teorem 1 says tat returns tese unique estimates and in only one. Finally, Teorem 1 does not old for. In particular, one can sow tat under te conditions of tis teorem, an will update its current parameter estimates θ and return te following estimates for θ x u : D#(xu) + D #(u)pr(x u). D#(u) Here, D #(u) is te number of examples tat contain u and in wic te value of X is missing. Tis next estimate clearly depends on te current parameter estimates. As a result, te beavior of will depend on its initial seed, unlike. Wen only te values of leaf variables are missing, tere is a unique optimal solution as sown by Proposition 2. Since is known to converge to a local optimum, it will eventually return te optimal estimates as well, but possibly after some number of s. In tis case, te difference between and is simply in te speed of convergence. Teorem 1 clearly suggests better convergence beavior of over in some situations. We next present initial experiments supporting tis suggestion. 7 MORE ON CONVERGENCE We igligt now a few empirical properties of. In particular, we sow ow can somes find iger quality estimates tan, in fewer s and also in less. We igligt different types of relative convergence beavior in Figure 6, wic depicts example runs on a selection of networks: spect, win95pts, emdec6g, and tcc4e. Network spect is a naive Bayes network induced from a dataset in te UCI ML repository, wit 1 class variable and 22 attributes. Network win95pts (76 variables) is an expert system for printer troublesooting in Windows 95. Networks emdec6g (168 variables) and tcc4e (98 variables) are noisy-or networks for diagnosis (courtesy of HRL Laboratories). We simulated datasets of size 2 k, using te original CPT parameters of te respective networks, and ten used and to learn new parameters for a network wit te same structure. We assumed tat certain variables were idden (latent); in Figure 6, we randomly cose 1 4 of te variables to be idden. Hidden nodes are of particular interest to, because it as been observed tat local extrema and convergence rates can be problematic for ere; see, for example (Elidan & Friedman, 2005; Salakutdinov, Roweis, & Garamani, 2003) spect win95pts tcc4e emdec6g spect win95pts tcc4e emdec6g Figure 6: Quality of parameter estimates over s (left column) and (rigt column). Going rigt on te x-axis, we ave increasing s and. Going up on te y-axis, we ave increasing quality of parameter estimates. is depicted wit a solid red line, and wit a dased black line. In Figure 6, eac plot represents a simulated data set of size 2 10, were and ave been initialized wit te same random parameter seeds. Bot algoritms were run for a fixed number of s, 1024 in tis case, and we observed te quality of te parameter estimates found, wit respect to te log posterior probability (wic as been normalized so tat te maximum log probability observed is ). We assumed a Beta prior wit exponents 2. damped its parameter updates by a factor of 1 2, wic is typical for (loopy) belief propagation algoritms. 3 3 Te simple bisection metod suffices for te optimization sub-problem in for binary Bayesian networks. In our current implementation, we used te conjugate gradient metod, wit a convergence tresold of 10 8.

9 In te left column of Figure 6, we evaluated te quality of estimates over s of and. In tese examples, (represented by a solid red line) tended to ave better quality estimates from to (curves tat are iger are better), and furter managed to find tem in fewer s (curves to te left are faster). 4 Tis is most dramatic in network spect, were appears to ave converged almost immediately, wereas spent a significant number of s to reac estimates of comparable quality. As most nodes idden in network spect were leaf nodes, tis may be expected due to te considerations from te previous section. In te rigt column of Figure 6, we evaluated te quality of estimates, now in terms of. We remark again tat procedurally, and are very similar, and eac algoritm needs only one evaluation of te jointree algoritm per distinct example in te data set (per ). solves an optimization problem per distinct example, wereas as a closed-form update equation in te corresponding step (Line 4 in Algoritms 1 and 2). Altoug tis optimization problem is a simple one, does require more per tan. Te rigt column of Figure 6 suggests tat can still find better estimates faster, especially in te cases were as converged in significantly fewer s. In network emdec6g, we find tat altoug appeared to converge in fewer s, was able to find better estimates in less. We anticipate in larger networks wit iger treewidt, te spent in te simple optimization sub-problem will be dominated by te to perform jointree propagation. We also performed experiments on networks learned from binary aplotype data (Elidan & Gould, 2008), wic are networks wit bounded treewidt. Here, we simulated data sets of size 2 10, were we again randomly selected 1 4 of te variables to be idden. We furter ran and for a fixed number of s (512, ere). For eac of te 74 networks available, we ran and wit 3 random seeds, for a total of 222 cases. In Figure 7, we igligt a selection of te runs we performed, to illustrate examples of relative convergence beaviors. Again, in te first row, we see a case were identifies better estimates in fewer s and less. In te next two rows, we igligt two cases were appears to converge to a superior fixed point tan te one tat appears to converge to. In te last row, we igligt an instance were instead converges to a superior estimate. In Figure 8, we compare te estimates of 4 We omit te results of te first 10 s as initial parameter estimates are relatively poor, wic make te plots difficult to read. greedy e3 bounded.u e3 bounded.u greedy greedy e3 bounded.u e3 bounded.u greedy Figure 7: Quality of parameter estimates over s (left column) and (rigt column). Going rigt on te x-axis, we ave increasing s and. Going up te y-axis, we ave increasing quality of parameter estimates. is depicted wit a solid red line, and wit a dased black line. and at eac, computing te percentage of te 74 3 = 222 cases considered, were ad estimates no worse tan tose found by. In tis set of experiments, te estimates identified by are clearly superior (or at least, no worse in most cases), wen compared to. We remark owever, tat wen bot algoritms are given enoug s to converge, we ave observed tat te quality of te estimates found by bot algoritms are often comparable. Tis is evident in Figure 6, for example. Te analysis from te previous section indicates owever tat tere are (very specialized) situations were would be clearly preferred over. One subject of future study is te identification of situations and applications were

10 % of 222 cases, favored Figure 8: Quality of estimates over 74 networks (3 cases eac) induced from binary aplotype data. Going rigt on te x-axis, we ave increasing s. Going up te y-axis, we ave an increasing percentage of instances were s estimates were no worse tan tose given by. would be preferred in practice as well. 8 FUTURE AND RELATED WORK as played a critical role in learning probabilistic grapical models and Bayesian networks (Dempster et al., 1977; Lauritzen, 1995; Heckerman, 1998). However learning (and Bayesian learning in particular) remains callenging in a variety of situations, particularly wen tere are idden (latent) variables; see, e.g., (Elidan, Ninio, Friedman, & Suurmans, 2002; Elidan & Friedman, 2005). Slow convergence of as also been recognized, particularly in te presence of idden variables. A variety of tecniques, some incorporating more traditional approaces to optimization, ave been proposed in te literature; see, e.g., (Tiesson, Meek, & Heckerman, 2001). Variational approaces are an increasingly popular formalism for learning tasks as well, and for topic models in particular, were variational alternatives to are used to maximize a lower bound on te log likeliood (Blei, Ng, & Jordan, 2003). Expectation Propagation also provides variations of (Minka & Lafferty, 2002) and is closely related to (loopy) belief propagation (Minka, 2001). Our empirical results ave been restricted to a preliminary investigation of te convergence of, in contrast to. A more compreensive evaluation is called for in relation to bot and oter approaces based on Bayesian inference. We ave also focused tis paper on binary variables., owever, generalizes to multivalued variables since edge deletion does not require a restriction to binary variables and te key result of Section 4 also generalizes to multivalued variables. Te resulting formulation is less transparent toug wen compared to te binary case since Bayes factors no longer apply directly and one must appeal to a more complex metod for quantifying soft evidence; see (Can & Darwice, 2005). We expect our future work to focus on a more compreensive empirical evaluation of, in te context of an implementation tat uses multivalued variables. Moreover, we seek to identify additional properties of tat go beyond convergence. References Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Diriclet allocation. JMLR, 3, Latent Can, H., & Darwice, A. (2005). On te revision of probabilistic beliefs using uncertain evidence. Artificial Intelligence, 163, Coi, A., & Darwice, A. (2006). An edge deletion semantics for belief propagation and its practical impact on approximation quality. In AAAI, pp Darwice, A. (2009). Modeling and Reasoning wit Bayesian Networks. Cambridge University Press. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likeliood from incomplete data via te algoritm. Journal of te Royal Statistical Society B, 39, Elidan, G., & Friedman, N. (2005). Learning idden variable networks: Te information bottleneck approac. JMLR, 6, Elidan, G., & Gould, S. (2008). Learning bounded treewidt Bayesian networks. JMLR, 9, Elidan, G., Ninio, M., Friedman, N., & Suurmans, D. (2002). Data perturbation for escaping local maxima in learning. In AAAI/IAAI, pp Good, I. J. (1950). Probability and te Weiging of Evidence. Carles Griffin, London. Heckerman, D. (1998). A tutorial on learning wit Bayesian networks. In Jordan, M. I. (Ed.), Learning in Grapical Models, pp MIT Press. Koller, D., & Friedman, N. (2009). Probabilistic Grapical Models: Principles and Tecniques. MIT Press. Lauritzen, S. (1995). Te algoritm for grapical association models wit missing data. Computational Statistics and Data Analysis, 19, Minka, T. P. (2001). Expectation propagation for approximate Bayesian inference. In UAI, pp Minka, T. P., & Lafferty, J. D. (2002). Expectationpropogation for te generative aspect model. In UAI, pp Salakutdinov, R., Roweis, S. T., & Garamani, Z. (2003). Optimization wit and expectation-conjugategradient. In ICML, pp Tiesson, B., Meek, C., & Heckerman, D. (2001). Accelerating for large databases. Macine Learning, 45 (3),

Regularized Regression

Regularized Regression David M. Blei Columbia University December 5, 205 Modern regression problems are ig dimensional, wic means tat te number of covariates p is large. In practice statisticians regularize