Estimating the Variance of Query Responses in Hybrid Bayesian Nets

Size: px

Start display at page:

Download "Estimating the Variance of Query Responses in Hybrid Bayesian Nets"

Mitchell Logan
6 years ago
Views:

1 Estimating the Variance of Query Responses in Hybrid Bayesian Nets Yasin bbasi-yadkori, Russ Greiner, Bret Hoehn Dept of omputing Science University of lberta Peter Hooper Dept of Math and Statistical Science University of lberta bstract Bayesian network is a model of a distribution, encoded using a network structure S augmented with conditional distribution parameters (DP Θ that specify the conditional probability of a variable, given each assignment to its parents. Given a fixed structure S and DP Θ, we can compute the response to a fixed query Q(Θ P S,Θ ( c E e, which is a real number. However, in many situations, the DParameters Θ themselves can be uncertain e.g., when they are estimated from a (random datasample. Here, this response Q(Θ will be a random variable. Earlier results provided a way to estimate the variance of this response when all variables are discrete. This paper extends that analysis to deal with Bayesian networks that can also include normally distributed continuous variables. (We consider essentially arbitrary Bayesian net structures, assuming only that discrete variables have no continuous parents. In particular, we show how to compute posterior distributions of each independent DP, and then how to use the Delta method to approximate the variance of any query. We also derive a compact form for the variance in the case of Naive Bayes structures. Finally, we provide empirical studies that demonstrate that our system works effectively, even when the parameters corresponds to a small sample. Introduction In general, a Bayesian network is a model of distribution, represented as a directed acyclic graph S whose nodes represent variables and whose arcs represent the dependencies between them, as well as conditional distribution parameters (DP Θ that specify the conditional probability of each variable, given each assignment to its parents. Given a fixed structure S and parameters Θ, we can compute the response to a fixed query Q(Θ P S,Θ ( c E e, which is a real number. However, in many situations, the DP parameters Θ themselves can be uncertain e.g., when they are estimated from a (random datasample. Here, this response Q(Θ will be a random variable. For discrete variables, whose parents are discrete, these DPs correspond to Ptables (Pearl, 988. onsider for example the variable D in Figure, which has only the parent B. Its parameter θ D +b θ +d +b, θ d +b corresponds to the distribution of D given that B is true, and θ D b θ +d b, θ d b corresponds to the D distribution when B is false. If these values were known with certainty, we could view them as constants e.g., θ D +b 0.3, 0.7. However, if they were only based on an expert s not-necessarily-perfect assessment, or if they were learned from a datasample, they would not be known with certainty. Here, we would represent the parameter as a random variable; perhaps θ D +b Dir( 3, 7 to mean this parameter is drawn from a Dirichlet distribution, with parameters 3 and 7. Similarly, the normally distributed variable N ( θ, σ depends on parameters that are drawn from some distribution; here from a Normal-Inverseχ distribution (see below. Now consider computing the response to a fixed query from this fixed structure, say P( b + d θ b θ +d b θ b θ +d b + θ +b θ +d +b P( +b, 0 < < P( +b, 0 < <, < < + P( +b P( 0 < <, < < + + b θ +b πσ, +b σ Z + Z 0 e σ (y θ e σ, +b (x θ +b θ, +b y dx dy learly these responses depend on the parameters here θ i, σ i. s they are random variables, clearly the responses to these queries are random variables as well. Van llen, et al. (008 earlier dealt with discrete variables in arbitrary graph structures, proving that this response is

2 N ( θ, σ E, N ( θ E + θ E + θ E, σ E B θ +b, θ b D E θ +d +b, θ d +b θ +d b, θ d b, +b N ( θ +b + θ,+b, σ,+b, b N ( θ b + θ, b, σ, b Figure : Simple Example of Hybrid Network, N asymptotically normal, and providing both the expected value of the response and the asymptotic variance. This paper extends that analysis to deal with Bayesian networks that can also include continuous normally-distributed variables, whose DPs are drawn from a Normal-Inverse-χ distribution. (We consider essentially arbitrary Bayesian net structures, requiring only that discrete variables have no continuous parents. In particular, Section provides the foundations, showing how to compute posterior distributions of each independent DP. Section 3 then shows how to use the Delta method to approximate the variance of any query for general Bayesian network structures. Sections 4 and 5 derive compact forms for the variance in the case of Naive Bayes structures, with discrete vs continuous root nodes, and Section 6 presents empirical results showing that it works effectively. The website ualberta.ca/ greiner/reserh/hybrid provides additional details about this process, including proofs and detailed examples; we indicate this using the notation [Web:χ] below. Foundations Following standard convention, we represent the distribution of each discrete variable by a onditional Probability Table, whose rows each correspond to a specific assignment to that variable s parents. (Recall we require that all parents of discrete variables be discrete. Each of these row-parameters is drawn from a Dirichlet distribution (Heckerman, 998; Van llen et al., 008. We will assume all continuous variables are normally distributed. For hybrid Bayesian nets, which include both discrete and continuous variables, we use onditional Linear Gaussian Models to represent the conditional distribution (Koller & Friedman, 007. Their parameters are themselves random variables, with mean and variance drawn from a Normal-Inverse-χ distribution (Gelman, arlin, Stern, & Rubin, 003; see below. To be more concrete, consider again Figure. To simplify our description, we will assume that all discrete variables are binary (although all of our analysis applies if they range over any finite set of values. Here, B s parameters are θ B θ +b, θ b and D has two parameters, For binary variables X, we will let +x abbreviate X θ D +b θ +d +b, θ d +b associated with B being true and θ D b θ +d b, θ d b for B being false. We assume each parameter is drawn from a Dirichlet distribution; here θ B θ +b, θ b Dir( +,. We initialize all parameters to be uniform ; that is i. The continuous variable has no parents; its DP is simply N ( θ, σ where in general N ( µ, σ is a Gaussian distribution with mean µ and variance σ. These parameters are also random variables θ σ N (, σ χ ( σ where χ (ν refers to a hi-squared distribution with ν degrees of freedom. Here,,, and are all hyperparameters, each initialized to a (positive real number. By convention, we will typically set 0,, and 5. Next, the continuous variable is a child of both the continuous and the discrete (binary B. Its DP is ( a, +b N ( θ +b + θ,+b a, σ,+b a, b N ( θ b + θ, b a, σ, b Note we have a different set of parameters for each instantiation of the discrete parent. To specify the distributions over these parameters: (θ +b, θ,+b σ +b N ( ( +b,,+b, σ +b,+b χ ( +b +b +b σ +b ( (There is a similar set of equations associated with, b. The variable E has two continuous parents, with the DP E a, c N ( θ E + θ E a + θ E c, σ E where (θ E, θ E, θ E σe N ( ( E, E, E, σe E χ ( E E E σ E and x abbreviate X 0. We use normal Roman letters (e.g.,, a for base variable and associated value, Greek letters (e.g., Θ, θ for parameters and their values, and yrillic letters (e.g.,,,,, pronounced Zhe, El, he, De, E for hyper-parameters.

3 In general, consider the continuous variable U with r discrete parents {D,..., D r } and t continuous parents {,..., t }. Let d d,..., d r be the values of the discrete variables, and c c,..., c s be the values of the continuous variables. Then U d, c N ( θ V d + i θ V d,i c i, σ V d. (3 Notice there are r such equations (assuming each D i is binary, and for each, we need to specify t different θ V d,ci parameters, as well as a constant term θ V d, and a variance term σ V d for a total of d (t + parameters. Now to specify the distribution over these parameters. The variables associated with each d assignment are independent. However, the (t + parameters for interdependent: (To simplify our notation, below we omit the V d part of the subscripts. θ σ N t(, σ χ ( σ Notice this requires O(t hyperparameters: { i i..t + }, { i,j i..t +, j..t + }, as well as and. We will sometimes abbreviate Equation 4 as θ, σ Norm/χ (,,,, where Norm/χ ( refers to the Normal-Inverse-χ distribution. (When there is only one continuous parent, t which means both and are scalars. In general, we initialize the t + -ary vector to be all 0 s, and to be the (t + (t + identity matrix I t+.. omputing Posterior Distributions We assume the parameters for one row are independent of the parameters for the others; e.g., θ D +b θ B, θ D +b θ and θ D +b σ. Moreover, each of these distribution is conjugate. That is, if we initialize the parameter for each row of each discrete variable as Dir(,, and the parameters for each (conditional continuous variable as Norm/χ ( 0, I,, 5, then observe the datasample S B D E then the posterior distribution is θ B θ +b, θ b S Dir( 5,, θ D +b θ +d +b, θ d +b S Dir( 3, 3, θ D b θ +d b, θ d b S Dir(,. (Here, we compute the posterior hyperparameters by simply adding to the prior the number of examples that match each condition. So as the hyperparameters for θ B were initially,, (4 after seeing 4 B + and B instances in S, the posterior is Dir( + 4, +. Now to compute the posterior for the continuous variables (Gelman et al., 003: Let n be the effective sample size (here n 5, Ā ( /5 0.36, be the sample mean and (n s ( , be the sum of squares. We use these update rules to produce the posterior distribution: + n + n + n + n + n Ā + (n s + Hence, for variable, n + n (Ā θ, σ S Norm/χ ( 0.3, 6, 0.644, 0. (5 (This is described in more detail in [Web:pp. ]. For the general update rules, see [Web:pp. ]. For variable when +b, (θ +b, θ,+b σ,+b N ( (0.4, 0.009, σ,+b where σ,+b ( χ (9, and (θ E, θ E, θ E σe N 3( (0.4, 0.065, 0.6, σ,+b where σ E 3 Estimating Variance χ ( To define our task, we assume we are given a Bayesian net structure S. the (posterior distribution over the parameters for each variable, Θ (which correspond to Dirichlet parameters for each discrete variable, and the Normal-Inverse-χ parameters for each Gaussian these are denoted using the yrillic letters. In the above example (Figure, Θ θ B, θ D +b, θ D b, θ, θ,+b, σ,+b, θ, b, σ, b, θ E,, σ E,. a specific query over some variables within the network, Q(Θ P S,Θ ( c E e. (This notation emphasizes its dependence on the parameters. We will consider queries whose query variables are each either an assignment to a discrete variable (e.g., +b or a range for a

4 continuous variable ( < <, and whose evidence variables are each a specific assignment, to either a discrete or continuous variable e.g., d or 3. Given this, the response will be a random variable in the interval [0, ]; we want to return a good estimate of both its mean and its variance. We estimate the variance using the Delta method (Oehlert, 99; asella & Berger, 00. Let ˆΘ E[Θ] be the expected value of the parameter values, and Θ Q( ˆΘ Q θ i i be the vector of the derivatives of Q wrt each of the parameters θ i, evaluated at ˆΘ. Using a Taylor expansion, Q(Θ Q( ˆΘ + Θ Q( ˆΘ T (Θ ˆΘ + R (6 where R is the terms of degree and higher. ssuming this R is neglible, Q(Θ Q( ˆΘ Θ Q( ˆΘ T (Θ ˆΘ which means, assuming E[Q(Θ] Q( ˆΘ, 3 E[Q(Θ E[Q(Θ] E[Q(Θ Q( ˆΘ] Θ Q( ˆΘ T ov(θ Θ Q( ˆΘ where ov(θ is the variance-covariance matrix. Note the left hand side is the variance of the response V ( Q(Θ, which suggests we can approximate this variance using V( Q(Θ Θ Q( ˆΘ T ov(θ Θ Q( ˆΘ (7 The two challenges, therefore, are ( computing the covariance matrix ov(θ, and ( computing the derivatives Q θ i. Fortunately, given standard assumptions about independence of the different parameters, the parameters associated with different variables are uncorrelated i.e., for each pair of distinct variables X and Y, we have Θ(X Θ(Y, where Θ(X are the parameters associated with the variable X. This means the covariance matrix will be a block diagonal matrix and so V( Q(Θ V Q(Θ ( X X where (8 V Q(Θ ( X Θ(X Q( ˆΘ(X T (9 ov(θ(x Θ(X Q( ˆΘ(X The rest of this section describes how to compute the covariance terms for each type of node: discrete child of discrete parents, and continuous child of both discrete and continuous parents. The next two sections show simpler versions for Naive Bayes structures. 3 While this claim holds for discrete networks (ooper & Herskovits, 99, it does not apply to continuous networks; see [Web:ounterEx]. onsider the network shown in Figure and the query, Q P( < < E 3, +b (0 In general, we consider queries of the form Q P( R I E e, which allows us to partition the variables into 3 sets: the ones that have some specific instantiation E (in the evidence component of the query, the ones that are in some range R (in the query component of the query, and the remaining variables T that do not appear anywhere in the query. (Here, R {}, I {[, ]}, E {E, B}, e {3, +}, and T {, D}. Let P( u be the probability distribution function of the joint distribution of the variables u. For example, P( a, c, e N ( a; θ, σ N ( c; θ +b + θ,+b a, σ,+b N ( e; θ E + θ E a + θ E c, σe Now let P Ee ( u be the value of P( u when all evidence variables are substituted with their values in the query. For example, using P E3,+b ( a, c θ +b N ( a; θ, σ N ( c; θ +b + θ,+b a, σ,+b N ( E 3; θ E + θ E a + θ E c, σ E we have Q(Θ P R R P d P E3,+b,d( a,c da dc R R Pd P E3,+b,d( a,c da dc Pd f(+b, d d g(+b, d for the obvious f(. and g(. functions. To simplify our notation, we will let U refer to the continuous variables in T R, n refer to the integral in the numerator and d refer to the integral in the denominator, z refer to the sum in the numerator and z refer to the sum in the denominator. So, in general, z s Q P E e ( U du z d P E e ( U du z f(z z g(z ( In order to compute the variance of Q, we need to compute its derivatives wrt parameters. Letting γ be an arbitrary parameter, Q γ z g(z ( γ z f(z + Q g(z γ z ( Now recall the Delta method, Equation 7. To estimate the variance of a query, we only need to compute the derivatives of the query wrt parameters of the network, then use the covariance matrix, which is described in [Web:pp. ]. In [Web:pp. B], we show that all those derivatives are functions of these integrals, including every single Gaussian variable u i and every pair of (not necessarily distinct Gaussian variables u i, u j, over both the numerator

5 bounds and also the denominator bounds: u i P E e ( U du u i u j P E e ( U du n d u i P E e ( U du n d u i u j P E e ( U du where u i, u j ( R T are continuous variables of the network that do not appear in the evidence set. So, in our example, we only need to compute these integrals, Z Z Z Z χ P E3,+b,d ( a, c da dc χ P E3,+b,d ( a, c da dc where χ refers a, c or a, c and d is iterating over different values that D can take. (This corresponds to 0 different integrals. In [Web:pp. B], we present an algorithm to compute these integrals, OMPINTEGRLS(N, Θ, Q, which takes as input: the Bayesian network S, parameters Θ and the query Q. It returns all the integrals with the above forms. Let ˆQ and ˆP be the values of the corresponding functions based on the parameter values Θ ˆΘ. Given those integrals, we can compute the derivatives of the query wrt each parameter χ, which could be θ Ui, θ Ui j, σu i (see [Web:pp. B]: h i Q χ ˆΘ P z where h P( U» i θ Ui ˆΘ P( U θ Ui j» P( U σ U i R Pz d P( U du R h P( U n χ ˆΘ ˆΘ i du ˆQ P z Rd ˆΘ h i P( U du χ ˆΘ ˆP (U (3 Given these derivatives, as well as the covariance matrix (defined above, we can then use Equation 8 to compute V Q(Θ ( X for each variable X, which can then be added together to form our approximation to the variance (via Equation 9. See [Web:pp. B] for the proof, and [Web:Ex. 3] for a specific worked-out example. The following algorithm computes all derivatives of the form γ z f(z in a general network, using the f(z s P E e, z ( U du from Equation. Here, to compute Q γ over all γs, we would first call ompderivativeshybrid(s, Θ, S f then ompderivativeshybrid(s, Θ, S g i.e., with different bounds for the integral. (If the query variable was discrete e.g., Q P( +b + d, E 3 then the integrals would be the same, but the variables z would be different on different calls. ompderivativeshybrid(n: BayesianNetwork, Θ: parameters, S z s P E e, z ( U du : returns S γ over all parameters γ : r : 0; r : 0; associated with : t {discrete variables that do not appear in the query} 3: Let f(z refer to s P E e, z ( U du 4: for each w assignment to t do 5: for θ i : Dirichlet parameters associated with w do 6: r (θ i + f(w 7: end for 8: Θ w : parameters of continuous variables when discrete variables are instantiated to w 9: s OMPDERIVTIVES(N, Θ w, f(w % using Equation 3 0: r (Θ w + s : end for : for each Dirichlet parameters of the network, θ i do 3: r (θ i : r (θ i / ˆθ i 4: end for 5: return [r, r ] Lines 5 7 then 4 of OMPDERIVTIVESHYBRID is a brute force procedure to compute γ z f(z when γ is a Dirichlet parameter associated with a discrete variable. Van llen et al. (008 produced a more efficient algorithm for computing γ z f(z when γ is a Dirichlet parameter. For an example, see [Web:Ex. 4]. 4 Naive Bayes with Discrete lass Variable Naive Bayes structure is a simple tree, with a single variable serving as the only parent to the remaining vari- v i ˆθ Ui P ˆθ r Ui r c r ˆP (U c j u i ˆθ Ui P ˆθ ables; for notation, we let {F i } refer to the continuous child r Ui r c r ˆP (U variables and {G j } to the discrete child variables; see Figure (a. The discrete variable can take n different val- + ( u i ˆθ Ui P m i ˆθ «r Ui r c r ues, according to the Dirichlet distribution θ θ,..., θ n Dir(,..., n. The effective sample size is m j i i, and e j m is the expected value of the j th value, which corresponds to P( j. Each G i is a discrete child, which takes n i possible values, according to a Dirichlet distribution. Given the value of its parent, j, its parameters are θ Gi j θ Gi j,..., θ Gin i j Dir( G i j,..., G in i j. and its effective sample size (for this parental assignment is m Gi j k G G ik j. Here e ijr i r j m Gi is the expected j value, corresponding to P( G i r j. Each F i is a continuous random variable, distributed as F i j N ( θ Fi j, σ F i j where θ Fi j and σf i j are jointly distributed accord-

6 (a θ +g +c, θ g +c θ +g c, θ g c θ +c, θ c G F G F 3 +c N ( θ F F F3 +c, σ F 3 +c 3 F 3 c N ( θ F3 c, σ F 3 c F +c N ( θ F +c, σ F +c F c N ( θ F c, σ F c (b N ( θ, σ F F F 3 F 4 F N ( θ F + θ F, σf F 4 N ( θ F4 + θ F4, σf 4 Figure : Two examples of Naive Bayes systems (Structure+Parameters (a Discrete parent; (b ontinuous Parent ing to a Normal-Inverse-χ distribution: θ Fi j σf i j N ( F i j, σf i j χ ( σ F i j F i j F i j F i j F i j We want to compute the variance of Q(Θ P( q F f,..., F k f k, G g,..., G l g l, D, Θ P( q F F, GG, D, Θ where F F F f,..., F k f k, GG G g,..., G l g l and D is the dataset. We also set p i : ˆP ( i F F, GG, D, Θ. (Recall ˆP is the probability P(..., θ computed at the mean value of the parameter vector ˆΘ. Theorem (Proof in [Web:pp. D] Given the above conditions, For root : V Q ( [ pq ] p j e j m+ p q e q + j For each discrete child G i in the evidence set : V Q ( G i p q ( ( eiqgi ( p q e iqgi (+m iq + j p j( e ijgi e ijgi (+m ij For each continuous ( child in the evidence set : V Q ( p q ( pq h jq + k p k h jk where h ij F i j `fi F i j F i j + F i j F i j F i j F i j 4 0 `fi F i j F i F i j F i j Example onsider the Bayesian network in Figure (a, where is a binary variable that takes two values {+, } according to a Dirichlet distribution θ +c, θ c Dir(,. G is also binary, drawn according to a Dirichlet distribution (conditioned on the value of : θ +g +c, θ g +c Dir(,, θ +g c, θ g c Dir( 3, The distributions of two continuous children is given by: F +c N ( θ F +c, σ F +c θ F +c, σ F +c Norm/χ (, 5,, 6 F c N ( θ F c, σ F c θ F c, σ F c Norm/χ ( 0,,, F + c N ( θ F +c, σ F +c θ F +c, σ F +c Norm/χ (,, 3, 7 F c N ( θ F c, σ F c θ F c, σ F c Norm/χ (,,, 0 We want to compute the variance of Q P( +c F.5, F 3, +g. (Notice this does not involve the other two child nodes. Using results from [Web:pp. ], E ˆθ F +c E ˆθF c 0 E ˆθ F3 +c E ˆθF3 c E ˆσ F +c.5 E ˆσ F c 4.8 E ˆσ F 3 +c.6 E ˆσ F 3 c.5 This yields p 0.89 and p 0.8, and so ( V Q ( m+ p p V Q ( F + V Q ( F e + p e + p e fter substitutions, we can show that h 0.063, h and h and h , so, VQ ( G (( p h + p h + ( p h + p h p Hence, using Theorem, V (Q V( Q(Θ V Q ( + V Q ( F + V Q ( F 3 + V Q ( G

7 5 ll ontinuous Naive Bayes Now consider a Naive Bayes where all nodes correspond to continuous variables, both the root and the children F i. The distribution of the parent is given by For each child F i, N ( θ, σ σ θ σ N (, χ ( σ F i N ( θ Fi + θ Fi, σ F i (θ Fi, θ Fi σ F i N ( ( F i, F i, σ F i σ F i χ ( F F i i F i We like to compute the variance of the query F i Q(Θ P( c < < c F f,..., F n f n, Θ, D P( c < < c F F, Θ, D Let P F F ( c be the probability distribution function of the above distribution and ˆP F F ( c be its value at Θ ˆΘ. Theorem (Proof in [Web:pp. F] Given the above conditions, i «V Q( c 4K π hˆp F F ( c + c ( h` c 4 ( 4 i «! c B 4 ˆP F F ( c V Q( F i R i u i M i u T i where i is iterating over evidence variables and X j B ( + X j E ( X j K exp E B 4 R j ( 4K π 4 ( c ( (f j «s π ( (f j M j ov(θ Fj, θ Fj, σf j v X u t π u i (u i, u i, u i3 h u i ˆP i c F i F F ( c c» u i ( c F i + B F i + fi F ˆP i F F ( c j c c " F u i3 i ` F i (f i F i + F i c F i B F i 4 ˆP F F ( c See [Web:Section 5] for an example. 6 Empirical Studies # c Given that the parameters for different variables are independent (e.g., Θ is independent of Θ E, etc, and the distributions for each individual variable are conjugate, the posterior distribution, given a complete datasample, is unambiguous and straight-forward to compute; see Section.. This is why we are focusing on the challenge of computing variance of the response of a specific query, given this posterior distribution. s noted earlier, our estimation technique, for computing V( Q(Θ (Equation 7 makes several assumptions, including the assumptions that the mean of the response is response of the mean of the variables (E[Q(Θ] Q( ˆΘ and that the first-order approximation will work effectively. Following Van llen et al. (008, we therefore ran a number of studies, to explore whether our approximations are sufficiently close at least within a factor of. In each study, we first identified a particular structure S (for space reasons, we considered only Naive Bayes here; see [Web:Studies] and a specific query, which here is of the form P( c F f,..., F n f n. We then considered various settings of the hyperparameters (i.e., the yrillic variables. For example, suppose perhaps s parameters were θ +c, θ c Dir( 4, 6, and F s parameters were θ F, θ F, σ F Norm/χ ( 0, 0, I,, 5, etc. 4 For each set of hyperparameters, we could then use Equation 7 produce an analytic estimate of the variance of the response, V V( Q(Θ. We can also obtain a (presumably more accurate empirical estimate σ, as follows: We first draw a number of parameter values from the posterior distribution over the parameters (as encoded by the hyperparameters. For example, given the hyperparameters shown above, on one draw, we might then get θ ( +c, θ( c 0.4, 0.58 and θ ( F, θ( F, σ( F 0., 0.04, 0.9 ; the next draw might yield θ +c, ( θ c ( 0.39, 0.6 and θ ( F, θ( F, σ( F 0.09, 0.03,.0.5 For each particular assignment to the parameters, call it Θ (i θ (i j, we can then compute the associated response to the query, 4 To simplify the notation, we will deal with σ rather than σ. 5 Note each is a sampling of the parameters; n.b., not of the domain variables i.e., this is not over values for nor values for F. c

8 mean relative difference mean relative difference mean relative difference training data size number of children number of children Figure 3: (a RelativeError vs Training Set size; (b Relative Error vs #children (all continuous; (c Relative Error vs #children (both continuous and discrete r (i Q(Θ (i. fter m,000 draws, we can obtain m responses, from which we can compute the empirical variance σ. Given the V and σ values computed for each network structure, query, and set of hyperparameters, we can then compute the relative-error, V σ / σ. To investigate the quality of our approximation, we explore two scaling questions: ( How does the relative-error scale with training size? Here, we considered a NaiveBayes network with a continuous parent, and 4 continuous children (like Figure [b]. We then initialized the hyperparameters as shown above, computed posterior parameters by training this structure on data sets of size {0, 50, 00, 50, 000, 5,000, 0,000}, and computed both V and σ (over m 000 draws for each of 00 different queries. Figure 3(a shows that the average (over 00 queries relative-error decreases as we increase the training set size. ( How does the relative-error vary with the number of children? Here, we consider a discrete parent and r {,, 4, 8, 6 } continuous children, trained on,000 instances. Figure 3(b shows that the average, over 00 queries. We see that the difference between relative-error grows with the number of children. We also considered both discrete and continuous children again consider r {,, 4, 8, 6 } children, but now half are discrete and the other half are continuous. (For r, the only child was discrete. Figure 3(c shows that, while the relative-error again grows with the number of children, this growth is slower here, vs the all continous case shown above. In all cases, we see that the error is close; in all cases within the desired factor of of the correct answer. Moreover, this is very efficient to compute (as it is just a straight-line computation much faster than the sampling approach that involved,000 of inferences. See [Web:Studies] for more extensive studies and analyses, wrt naive bayes and also more complicated structures. 7 onclusion Van llen et al. (008 earlier motivated the task of computing the variance of the response to a query wrt a given Bayesian network, as this can help us ( to estimate the bias +variance of each given Bayesian network, which can help us select the best discriminative model (Guo & Greiner, 005, and ( to combine the responses of various independent belief net classifiers by weighting their respective (mean probabilities by /variance (Lee, Greiner, & Wang, 006. That earlier paper, however, considered only discrete values. This current paper extends that earlier one by showing how to deal with continuous (Gaussian variables. We show how to use the Delta method to obtain an approximation, for arbitrary networks (insisting only that discrete variables have only discrete parents. We also provide simpler forms that apply to simple NaiveBayes models one for discrete root and arbitrary children, and another for continuous parent and continuous children. We also provide empirical evidence to demonstrate that this approach works effectively. References asella, G., & Berger, R. L. (00. Statistical Inference. ooper, G., & Herskovits, E. (99. Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, Gelman,., arlin, J. B., Stern, H. S., & Rubin, D. B. (003. Bayesian Data nalysis. hapman and Hall. Guo, Y., & Greiner, R. (005. Discriminative model selection for belief net structures. In I. Heckerman, D. E. (998. tutorial on learning with Bayesian networks. In Learning in Graphical Models. Koller, D., & Friedman, N. (007. Graphical Models. to appear. Lee,., Greiner, R., & Wang, S. (006. Using variance estimates to combine Bayesian classifiers. In IML. Oehlert, G. W. (99. note on the delta method. The merican Statistician, 46(, 7 9. Pearl, J. (988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Van llen, T., Singh,., Greiner, R., & Hooper, P. (008. Quantifying the uncertainty of a belief net response: Bayesian errorbars for belief net inference. rtificial Intelligence.

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear