Estimating the Variance of Query Responses in Hybrid Bayesian Nets

Size: px
Start display at page:

Download "Estimating the Variance of Query Responses in Hybrid Bayesian Nets"

Transcription

1 Estimating the Variance of Query Responses in Hybrid Bayesian Nets Yasin bbasi-yadkori, Russ Greiner, Bret Hoehn Dept of omputing Science University of lberta Peter Hooper Dept of Math and Statistical Science University of lberta bstract Bayesian network is a model of a distribution, encoded using a network structure S augmented with conditional distribution parameters (DP Θ that specify the conditional probability of a variable, given each assignment to its parents. Given a fixed structure S and DP Θ, we can compute the response to a fixed query Q(Θ P S,Θ ( c E e, which is a real number. However, in many situations, the DParameters Θ themselves can be uncertain e.g., when they are estimated from a (random datasample. Here, this response Q(Θ will be a random variable. Earlier results provided a way to estimate the variance of this response when all variables are discrete. This paper extends that analysis to deal with Bayesian networks that can also include normally distributed continuous variables. (We consider essentially arbitrary Bayesian net structures, assuming only that discrete variables have no continuous parents. In particular, we show how to compute posterior distributions of each independent DP, and then how to use the Delta method to approximate the variance of any query. We also derive a compact form for the variance in the case of Naive Bayes structures. Finally, we provide empirical studies that demonstrate that our system works effectively, even when the parameters corresponds to a small sample. Introduction In general, a Bayesian network is a model of distribution, represented as a directed acyclic graph S whose nodes represent variables and whose arcs represent the dependencies between them, as well as conditional distribution parameters (DP Θ that specify the conditional probability of each variable, given each assignment to its parents. Given a fixed structure S and parameters Θ, we can compute the response to a fixed query Q(Θ P S,Θ ( c E e, which is a real number. However, in many situations, the DP parameters Θ themselves can be uncertain e.g., when they are estimated from a (random datasample. Here, this response Q(Θ will be a random variable. For discrete variables, whose parents are discrete, these DPs correspond to Ptables (Pearl, 988. onsider for example the variable D in Figure, which has only the parent B. Its parameter θ D +b θ +d +b, θ d +b corresponds to the distribution of D given that B is true, and θ D b θ +d b, θ d b corresponds to the D distribution when B is false. If these values were known with certainty, we could view them as constants e.g., θ D +b 0.3, 0.7. However, if they were only based on an expert s not-necessarily-perfect assessment, or if they were learned from a datasample, they would not be known with certainty. Here, we would represent the parameter as a random variable; perhaps θ D +b Dir( 3, 7 to mean this parameter is drawn from a Dirichlet distribution, with parameters 3 and 7. Similarly, the normally distributed variable N ( θ, σ depends on parameters that are drawn from some distribution; here from a Normal-Inverseχ distribution (see below. Now consider computing the response to a fixed query from this fixed structure, say P( b + d θ b θ +d b θ b θ +d b + θ +b θ +d +b P( +b, 0 < < P( +b, 0 < <, < < + P( +b P( 0 < <, < < + + b θ +b πσ, +b σ Z + Z 0 e σ (y θ e σ, +b (x θ +b θ, +b y dx dy learly these responses depend on the parameters here θ i, σ i. s they are random variables, clearly the responses to these queries are random variables as well. Van llen, et al. (008 earlier dealt with discrete variables in arbitrary graph structures, proving that this response is

2 N ( θ, σ E, N ( θ E + θ E + θ E, σ E B θ +b, θ b D E θ +d +b, θ d +b θ +d b, θ d b, +b N ( θ +b + θ,+b, σ,+b, b N ( θ b + θ, b, σ, b Figure : Simple Example of Hybrid Network, N asymptotically normal, and providing both the expected value of the response and the asymptotic variance. This paper extends that analysis to deal with Bayesian networks that can also include continuous normally-distributed variables, whose DPs are drawn from a Normal-Inverse-χ distribution. (We consider essentially arbitrary Bayesian net structures, requiring only that discrete variables have no continuous parents. In particular, Section provides the foundations, showing how to compute posterior distributions of each independent DP. Section 3 then shows how to use the Delta method to approximate the variance of any query for general Bayesian network structures. Sections 4 and 5 derive compact forms for the variance in the case of Naive Bayes structures, with discrete vs continuous root nodes, and Section 6 presents empirical results showing that it works effectively. The website ualberta.ca/ greiner/reserh/hybrid provides additional details about this process, including proofs and detailed examples; we indicate this using the notation [Web:χ] below. Foundations Following standard convention, we represent the distribution of each discrete variable by a onditional Probability Table, whose rows each correspond to a specific assignment to that variable s parents. (Recall we require that all parents of discrete variables be discrete. Each of these row-parameters is drawn from a Dirichlet distribution (Heckerman, 998; Van llen et al., 008. We will assume all continuous variables are normally distributed. For hybrid Bayesian nets, which include both discrete and continuous variables, we use onditional Linear Gaussian Models to represent the conditional distribution (Koller & Friedman, 007. Their parameters are themselves random variables, with mean and variance drawn from a Normal-Inverse-χ distribution (Gelman, arlin, Stern, & Rubin, 003; see below. To be more concrete, consider again Figure. To simplify our description, we will assume that all discrete variables are binary (although all of our analysis applies if they range over any finite set of values. Here, B s parameters are θ B θ +b, θ b and D has two parameters, For binary variables X, we will let +x abbreviate X θ D +b θ +d +b, θ d +b associated with B being true and θ D b θ +d b, θ d b for B being false. We assume each parameter is drawn from a Dirichlet distribution; here θ B θ +b, θ b Dir( +,. We initialize all parameters to be uniform ; that is i. The continuous variable has no parents; its DP is simply N ( θ, σ where in general N ( µ, σ is a Gaussian distribution with mean µ and variance σ. These parameters are also random variables θ σ N (, σ χ ( σ where χ (ν refers to a hi-squared distribution with ν degrees of freedom. Here,,, and are all hyperparameters, each initialized to a (positive real number. By convention, we will typically set 0,, and 5. Next, the continuous variable is a child of both the continuous and the discrete (binary B. Its DP is ( a, +b N ( θ +b + θ,+b a, σ,+b a, b N ( θ b + θ, b a, σ, b Note we have a different set of parameters for each instantiation of the discrete parent. To specify the distributions over these parameters: (θ +b, θ,+b σ +b N ( ( +b,,+b, σ +b,+b χ ( +b +b +b σ +b ( (There is a similar set of equations associated with, b. The variable E has two continuous parents, with the DP E a, c N ( θ E + θ E a + θ E c, σ E where (θ E, θ E, θ E σe N ( ( E, E, E, σe E χ ( E E E σ E and x abbreviate X 0. We use normal Roman letters (e.g.,, a for base variable and associated value, Greek letters (e.g., Θ, θ for parameters and their values, and yrillic letters (e.g.,,,,, pronounced Zhe, El, he, De, E for hyper-parameters.

3 In general, consider the continuous variable U with r discrete parents {D,..., D r } and t continuous parents {,..., t }. Let d d,..., d r be the values of the discrete variables, and c c,..., c s be the values of the continuous variables. Then U d, c N ( θ V d + i θ V d,i c i, σ V d. (3 Notice there are r such equations (assuming each D i is binary, and for each, we need to specify t different θ V d,ci parameters, as well as a constant term θ V d, and a variance term σ V d for a total of d (t + parameters. Now to specify the distribution over these parameters. The variables associated with each d assignment are independent. However, the (t + parameters for interdependent: (To simplify our notation, below we omit the V d part of the subscripts. θ σ N t(, σ χ ( σ Notice this requires O(t hyperparameters: { i i..t + }, { i,j i..t +, j..t + }, as well as and. We will sometimes abbreviate Equation 4 as θ, σ Norm/χ (,,,, where Norm/χ ( refers to the Normal-Inverse-χ distribution. (When there is only one continuous parent, t which means both and are scalars. In general, we initialize the t + -ary vector to be all 0 s, and to be the (t + (t + identity matrix I t+.. omputing Posterior Distributions We assume the parameters for one row are independent of the parameters for the others; e.g., θ D +b θ B, θ D +b θ and θ D +b σ. Moreover, each of these distribution is conjugate. That is, if we initialize the parameter for each row of each discrete variable as Dir(,, and the parameters for each (conditional continuous variable as Norm/χ ( 0, I,, 5, then observe the datasample S B D E then the posterior distribution is θ B θ +b, θ b S Dir( 5,, θ D +b θ +d +b, θ d +b S Dir( 3, 3, θ D b θ +d b, θ d b S Dir(,. (Here, we compute the posterior hyperparameters by simply adding to the prior the number of examples that match each condition. So as the hyperparameters for θ B were initially,, (4 after seeing 4 B + and B instances in S, the posterior is Dir( + 4, +. Now to compute the posterior for the continuous variables (Gelman et al., 003: Let n be the effective sample size (here n 5, Ā ( /5 0.36, be the sample mean and (n s ( , be the sum of squares. We use these update rules to produce the posterior distribution: + n + n + n + n + n Ā + (n s + Hence, for variable, n + n (Ā θ, σ S Norm/χ ( 0.3, 6, 0.644, 0. (5 (This is described in more detail in [Web:pp. ]. For the general update rules, see [Web:pp. ]. For variable when +b, (θ +b, θ,+b σ,+b N ( (0.4, 0.009, σ,+b where σ,+b ( χ (9, and (θ E, θ E, θ E σe N 3( (0.4, 0.065, 0.6, σ,+b where σ E 3 Estimating Variance χ ( To define our task, we assume we are given a Bayesian net structure S. the (posterior distribution over the parameters for each variable, Θ (which correspond to Dirichlet parameters for each discrete variable, and the Normal-Inverse-χ parameters for each Gaussian these are denoted using the yrillic letters. In the above example (Figure, Θ θ B, θ D +b, θ D b, θ, θ,+b, σ,+b, θ, b, σ, b, θ E,, σ E,. a specific query over some variables within the network, Q(Θ P S,Θ ( c E e. (This notation emphasizes its dependence on the parameters. We will consider queries whose query variables are each either an assignment to a discrete variable (e.g., +b or a range for a

4 continuous variable ( < <, and whose evidence variables are each a specific assignment, to either a discrete or continuous variable e.g., d or 3. Given this, the response will be a random variable in the interval [0, ]; we want to return a good estimate of both its mean and its variance. We estimate the variance using the Delta method (Oehlert, 99; asella & Berger, 00. Let ˆΘ E[Θ] be the expected value of the parameter values, and Θ Q( ˆΘ Q θ i i be the vector of the derivatives of Q wrt each of the parameters θ i, evaluated at ˆΘ. Using a Taylor expansion, Q(Θ Q( ˆΘ + Θ Q( ˆΘ T (Θ ˆΘ + R (6 where R is the terms of degree and higher. ssuming this R is neglible, Q(Θ Q( ˆΘ Θ Q( ˆΘ T (Θ ˆΘ which means, assuming E[Q(Θ] Q( ˆΘ, 3 E[Q(Θ E[Q(Θ] E[Q(Θ Q( ˆΘ] Θ Q( ˆΘ T ov(θ Θ Q( ˆΘ where ov(θ is the variance-covariance matrix. Note the left hand side is the variance of the response V ( Q(Θ, which suggests we can approximate this variance using V( Q(Θ Θ Q( ˆΘ T ov(θ Θ Q( ˆΘ (7 The two challenges, therefore, are ( computing the covariance matrix ov(θ, and ( computing the derivatives Q θ i. Fortunately, given standard assumptions about independence of the different parameters, the parameters associated with different variables are uncorrelated i.e., for each pair of distinct variables X and Y, we have Θ(X Θ(Y, where Θ(X are the parameters associated with the variable X. This means the covariance matrix will be a block diagonal matrix and so V( Q(Θ V Q(Θ ( X X where (8 V Q(Θ ( X Θ(X Q( ˆΘ(X T (9 ov(θ(x Θ(X Q( ˆΘ(X The rest of this section describes how to compute the covariance terms for each type of node: discrete child of discrete parents, and continuous child of both discrete and continuous parents. The next two sections show simpler versions for Naive Bayes structures. 3 While this claim holds for discrete networks (ooper & Herskovits, 99, it does not apply to continuous networks; see [Web:ounterEx]. onsider the network shown in Figure and the query, Q P( < < E 3, +b (0 In general, we consider queries of the form Q P( R I E e, which allows us to partition the variables into 3 sets: the ones that have some specific instantiation E (in the evidence component of the query, the ones that are in some range R (in the query component of the query, and the remaining variables T that do not appear anywhere in the query. (Here, R {}, I {[, ]}, E {E, B}, e {3, +}, and T {, D}. Let P( u be the probability distribution function of the joint distribution of the variables u. For example, P( a, c, e N ( a; θ, σ N ( c; θ +b + θ,+b a, σ,+b N ( e; θ E + θ E a + θ E c, σe Now let P Ee ( u be the value of P( u when all evidence variables are substituted with their values in the query. For example, using P E3,+b ( a, c θ +b N ( a; θ, σ N ( c; θ +b + θ,+b a, σ,+b N ( E 3; θ E + θ E a + θ E c, σ E we have Q(Θ P R R P d P E3,+b,d( a,c da dc R R Pd P E3,+b,d( a,c da dc Pd f(+b, d d g(+b, d for the obvious f(. and g(. functions. To simplify our notation, we will let U refer to the continuous variables in T R, n refer to the integral in the numerator and d refer to the integral in the denominator, z refer to the sum in the numerator and z refer to the sum in the denominator. So, in general, z s Q P E e ( U du z d P E e ( U du z f(z z g(z ( In order to compute the variance of Q, we need to compute its derivatives wrt parameters. Letting γ be an arbitrary parameter, Q γ z g(z ( γ z f(z + Q g(z γ z ( Now recall the Delta method, Equation 7. To estimate the variance of a query, we only need to compute the derivatives of the query wrt parameters of the network, then use the covariance matrix, which is described in [Web:pp. ]. In [Web:pp. B], we show that all those derivatives are functions of these integrals, including every single Gaussian variable u i and every pair of (not necessarily distinct Gaussian variables u i, u j, over both the numerator

5 bounds and also the denominator bounds: u i P E e ( U du u i u j P E e ( U du n d u i P E e ( U du n d u i u j P E e ( U du where u i, u j ( R T are continuous variables of the network that do not appear in the evidence set. So, in our example, we only need to compute these integrals, Z Z Z Z χ P E3,+b,d ( a, c da dc χ P E3,+b,d ( a, c da dc where χ refers a, c or a, c and d is iterating over different values that D can take. (This corresponds to 0 different integrals. In [Web:pp. B], we present an algorithm to compute these integrals, OMPINTEGRLS(N, Θ, Q, which takes as input: the Bayesian network S, parameters Θ and the query Q. It returns all the integrals with the above forms. Let ˆQ and ˆP be the values of the corresponding functions based on the parameter values Θ ˆΘ. Given those integrals, we can compute the derivatives of the query wrt each parameter χ, which could be θ Ui, θ Ui j, σu i (see [Web:pp. B]: h i Q χ ˆΘ P z where h P( U» i θ Ui ˆΘ P( U θ Ui j» P( U σ U i R Pz d P( U du R h P( U n χ ˆΘ ˆΘ i du ˆQ P z Rd ˆΘ h i P( U du χ ˆΘ ˆP (U (3 Given these derivatives, as well as the covariance matrix (defined above, we can then use Equation 8 to compute V Q(Θ ( X for each variable X, which can then be added together to form our approximation to the variance (via Equation 9. See [Web:pp. B] for the proof, and [Web:Ex. 3] for a specific worked-out example. The following algorithm computes all derivatives of the form γ z f(z in a general network, using the f(z s P E e, z ( U du from Equation. Here, to compute Q γ over all γs, we would first call ompderivativeshybrid(s, Θ, S f then ompderivativeshybrid(s, Θ, S g i.e., with different bounds for the integral. (If the query variable was discrete e.g., Q P( +b + d, E 3 then the integrals would be the same, but the variables z would be different on different calls. ompderivativeshybrid(n: BayesianNetwork, Θ: parameters, S z s P E e, z ( U du : returns S γ over all parameters γ : r : 0; r : 0; associated with : t {discrete variables that do not appear in the query} 3: Let f(z refer to s P E e, z ( U du 4: for each w assignment to t do 5: for θ i : Dirichlet parameters associated with w do 6: r (θ i + f(w 7: end for 8: Θ w : parameters of continuous variables when discrete variables are instantiated to w 9: s OMPDERIVTIVES(N, Θ w, f(w % using Equation 3 0: r (Θ w + s : end for : for each Dirichlet parameters of the network, θ i do 3: r (θ i : r (θ i / ˆθ i 4: end for 5: return [r, r ] Lines 5 7 then 4 of OMPDERIVTIVESHYBRID is a brute force procedure to compute γ z f(z when γ is a Dirichlet parameter associated with a discrete variable. Van llen et al. (008 produced a more efficient algorithm for computing γ z f(z when γ is a Dirichlet parameter. For an example, see [Web:Ex. 4]. 4 Naive Bayes with Discrete lass Variable Naive Bayes structure is a simple tree, with a single variable serving as the only parent to the remaining vari- v i ˆθ Ui P ˆθ r Ui r c r ˆP (U c j u i ˆθ Ui P ˆθ ables; for notation, we let {F i } refer to the continuous child r Ui r c r ˆP (U variables and {G j } to the discrete child variables; see Figure (a. The discrete variable can take n different val- + ( u i ˆθ Ui P m i ˆθ «r Ui r c r ues, according to the Dirichlet distribution θ θ,..., θ n Dir(,..., n. The effective sample size is m j i i, and e j m is the expected value of the j th value, which corresponds to P( j. Each G i is a discrete child, which takes n i possible values, according to a Dirichlet distribution. Given the value of its parent, j, its parameters are θ Gi j θ Gi j,..., θ Gin i j Dir( G i j,..., G in i j. and its effective sample size (for this parental assignment is m Gi j k G G ik j. Here e ijr i r j m Gi is the expected j value, corresponding to P( G i r j. Each F i is a continuous random variable, distributed as F i j N ( θ Fi j, σ F i j where θ Fi j and σf i j are jointly distributed accord-

6 (a θ +g +c, θ g +c θ +g c, θ g c θ +c, θ c G F G F 3 +c N ( θ F F F3 +c, σ F 3 +c 3 F 3 c N ( θ F3 c, σ F 3 c F +c N ( θ F +c, σ F +c F c N ( θ F c, σ F c (b N ( θ, σ F F F 3 F 4 F N ( θ F + θ F, σf F 4 N ( θ F4 + θ F4, σf 4 Figure : Two examples of Naive Bayes systems (Structure+Parameters (a Discrete parent; (b ontinuous Parent ing to a Normal-Inverse-χ distribution: θ Fi j σf i j N ( F i j, σf i j χ ( σ F i j F i j F i j F i j F i j We want to compute the variance of Q(Θ P( q F f,..., F k f k, G g,..., G l g l, D, Θ P( q F F, GG, D, Θ where F F F f,..., F k f k, GG G g,..., G l g l and D is the dataset. We also set p i : ˆP ( i F F, GG, D, Θ. (Recall ˆP is the probability P(..., θ computed at the mean value of the parameter vector ˆΘ. Theorem (Proof in [Web:pp. D] Given the above conditions, For root : V Q ( [ pq ] p j e j m+ p q e q + j For each discrete child G i in the evidence set : V Q ( G i p q ( ( eiqgi ( p q e iqgi (+m iq + j p j( e ijgi e ijgi (+m ij For each continuous ( child in the evidence set : V Q ( p q ( pq h jq + k p k h jk where h ij F i j `fi F i j F i j + F i j F i j F i j F i j 4 0 `fi F i j F i F i j F i j Example onsider the Bayesian network in Figure (a, where is a binary variable that takes two values {+, } according to a Dirichlet distribution θ +c, θ c Dir(,. G is also binary, drawn according to a Dirichlet distribution (conditioned on the value of : θ +g +c, θ g +c Dir(,, θ +g c, θ g c Dir( 3, The distributions of two continuous children is given by: F +c N ( θ F +c, σ F +c θ F +c, σ F +c Norm/χ (, 5,, 6 F c N ( θ F c, σ F c θ F c, σ F c Norm/χ ( 0,,, F + c N ( θ F +c, σ F +c θ F +c, σ F +c Norm/χ (,, 3, 7 F c N ( θ F c, σ F c θ F c, σ F c Norm/χ (,,, 0 We want to compute the variance of Q P( +c F.5, F 3, +g. (Notice this does not involve the other two child nodes. Using results from [Web:pp. ], E ˆθ F +c E ˆθF c 0 E ˆθ F3 +c E ˆθF3 c E ˆσ F +c.5 E ˆσ F c 4.8 E ˆσ F 3 +c.6 E ˆσ F 3 c.5 This yields p 0.89 and p 0.8, and so ( V Q ( m+ p p V Q ( F + V Q ( F e + p e + p e fter substitutions, we can show that h 0.063, h and h and h , so, VQ ( G (( p h + p h + ( p h + p h p Hence, using Theorem, V (Q V( Q(Θ V Q ( + V Q ( F + V Q ( F 3 + V Q ( G

7 5 ll ontinuous Naive Bayes Now consider a Naive Bayes where all nodes correspond to continuous variables, both the root and the children F i. The distribution of the parent is given by For each child F i, N ( θ, σ σ θ σ N (, χ ( σ F i N ( θ Fi + θ Fi, σ F i (θ Fi, θ Fi σ F i N ( ( F i, F i, σ F i σ F i χ ( F F i i F i We like to compute the variance of the query F i Q(Θ P( c < < c F f,..., F n f n, Θ, D P( c < < c F F, Θ, D Let P F F ( c be the probability distribution function of the above distribution and ˆP F F ( c be its value at Θ ˆΘ. Theorem (Proof in [Web:pp. F] Given the above conditions, i «V Q( c 4K π hˆp F F ( c + c ( h` c 4 ( 4 i «! c B 4 ˆP F F ( c V Q( F i R i u i M i u T i where i is iterating over evidence variables and X j B ( + X j E ( X j K exp E B 4 R j ( 4K π 4 ( c ( (f j «s π ( (f j M j ov(θ Fj, θ Fj, σf j v X u t π u i (u i, u i, u i3 h u i ˆP i c F i F F ( c c» u i ( c F i + B F i + fi F ˆP i F F ( c j c c " F u i3 i ` F i (f i F i + F i c F i B F i 4 ˆP F F ( c See [Web:Section 5] for an example. 6 Empirical Studies # c Given that the parameters for different variables are independent (e.g., Θ is independent of Θ E, etc, and the distributions for each individual variable are conjugate, the posterior distribution, given a complete datasample, is unambiguous and straight-forward to compute; see Section.. This is why we are focusing on the challenge of computing variance of the response of a specific query, given this posterior distribution. s noted earlier, our estimation technique, for computing V( Q(Θ (Equation 7 makes several assumptions, including the assumptions that the mean of the response is response of the mean of the variables (E[Q(Θ] Q( ˆΘ and that the first-order approximation will work effectively. Following Van llen et al. (008, we therefore ran a number of studies, to explore whether our approximations are sufficiently close at least within a factor of. In each study, we first identified a particular structure S (for space reasons, we considered only Naive Bayes here; see [Web:Studies] and a specific query, which here is of the form P( c F f,..., F n f n. We then considered various settings of the hyperparameters (i.e., the yrillic variables. For example, suppose perhaps s parameters were θ +c, θ c Dir( 4, 6, and F s parameters were θ F, θ F, σ F Norm/χ ( 0, 0, I,, 5, etc. 4 For each set of hyperparameters, we could then use Equation 7 produce an analytic estimate of the variance of the response, V V( Q(Θ. We can also obtain a (presumably more accurate empirical estimate σ, as follows: We first draw a number of parameter values from the posterior distribution over the parameters (as encoded by the hyperparameters. For example, given the hyperparameters shown above, on one draw, we might then get θ ( +c, θ( c 0.4, 0.58 and θ ( F, θ( F, σ( F 0., 0.04, 0.9 ; the next draw might yield θ +c, ( θ c ( 0.39, 0.6 and θ ( F, θ( F, σ( F 0.09, 0.03,.0.5 For each particular assignment to the parameters, call it Θ (i θ (i j, we can then compute the associated response to the query, 4 To simplify the notation, we will deal with σ rather than σ. 5 Note each is a sampling of the parameters; n.b., not of the domain variables i.e., this is not over values for nor values for F. c

8 mean relative difference mean relative difference mean relative difference training data size number of children number of children Figure 3: (a RelativeError vs Training Set size; (b Relative Error vs #children (all continuous; (c Relative Error vs #children (both continuous and discrete r (i Q(Θ (i. fter m,000 draws, we can obtain m responses, from which we can compute the empirical variance σ. Given the V and σ values computed for each network structure, query, and set of hyperparameters, we can then compute the relative-error, V σ / σ. To investigate the quality of our approximation, we explore two scaling questions: ( How does the relative-error scale with training size? Here, we considered a NaiveBayes network with a continuous parent, and 4 continuous children (like Figure [b]. We then initialized the hyperparameters as shown above, computed posterior parameters by training this structure on data sets of size {0, 50, 00, 50, 000, 5,000, 0,000}, and computed both V and σ (over m 000 draws for each of 00 different queries. Figure 3(a shows that the average (over 00 queries relative-error decreases as we increase the training set size. ( How does the relative-error vary with the number of children? Here, we consider a discrete parent and r {,, 4, 8, 6 } continuous children, trained on,000 instances. Figure 3(b shows that the average, over 00 queries. We see that the difference between relative-error grows with the number of children. We also considered both discrete and continuous children again consider r {,, 4, 8, 6 } children, but now half are discrete and the other half are continuous. (For r, the only child was discrete. Figure 3(c shows that, while the relative-error again grows with the number of children, this growth is slower here, vs the all continous case shown above. In all cases, we see that the error is close; in all cases within the desired factor of of the correct answer. Moreover, this is very efficient to compute (as it is just a straight-line computation much faster than the sampling approach that involved,000 of inferences. See [Web:Studies] for more extensive studies and analyses, wrt naive bayes and also more complicated structures. 7 onclusion Van llen et al. (008 earlier motivated the task of computing the variance of the response to a query wrt a given Bayesian network, as this can help us ( to estimate the bias +variance of each given Bayesian network, which can help us select the best discriminative model (Guo & Greiner, 005, and ( to combine the responses of various independent belief net classifiers by weighting their respective (mean probabilities by /variance (Lee, Greiner, & Wang, 006. That earlier paper, however, considered only discrete values. This current paper extends that earlier one by showing how to deal with continuous (Gaussian variables. We show how to use the Delta method to obtain an approximation, for arbitrary networks (insisting only that discrete variables have only discrete parents. We also provide simpler forms that apply to simple NaiveBayes models one for discrete root and arbitrary children, and another for continuous parent and continuous children. We also provide empirical evidence to demonstrate that this approach works effectively. References asella, G., & Berger, R. L. (00. Statistical Inference. ooper, G., & Herskovits, E. (99. Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, Gelman,., arlin, J. B., Stern, H. S., & Rubin, D. B. (003. Bayesian Data nalysis. hapman and Hall. Guo, Y., & Greiner, R. (005. Discriminative model selection for belief net structures. In I. Heckerman, D. E. (998. tutorial on learning with Bayesian networks. In Learning in Graphical Models. Koller, D., & Friedman, N. (007. Graphical Models. to appear. Lee,., Greiner, R., & Wang, S. (006. Using variance estimates to combine Bayesian classifiers. In IML. Oehlert, G. W. (99. note on the delta method. The merican Statistician, 46(, 7 9. Pearl, J. (988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Van llen, T., Singh,., Greiner, R., & Hooper, P. (008. Quantifying the uncertainty of a belief net response: Bayesian errorbars for belief net inference. rtificial Intelligence.

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Exact distribution theory for belief net responses

Exact distribution theory for belief net responses Exact distribution theory for belief net responses Peter M. Hooper Department of Mathematical and Statistical Sciences University of Alberta Edmonton, Canada, T6G 2G1 hooper@stat.ualberta.ca May 2, 2008

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Learning in Bayesian Networks

Learning in Bayesian Networks Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Probabilistic Graphical Models (I)

Probabilistic Graphical Models (I) Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random

More information

The Monte Carlo Method: Bayesian Networks

The Monte Carlo Method: Bayesian Networks The Method: Bayesian Networks Dieter W. Heermann Methods 2009 Dieter W. Heermann ( Methods)The Method: Bayesian Networks 2009 1 / 18 Outline 1 Bayesian Networks 2 Gene Expression Data 3 Bayesian Networks

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Outline. Motivation Contest Sample. Estimator. Loss. Standard Error. Prior Pseudo-Data. Bayesian Estimator. Estimators. John Dodson.

Outline. Motivation Contest Sample. Estimator. Loss. Standard Error. Prior Pseudo-Data. Bayesian Estimator. Estimators. John Dodson. s s Practitioner Course: Portfolio Optimization September 24, 2008 s The Goal of s The goal of estimation is to assign numerical values to the parameters of a probability model. Considerations There are

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

GWAS IV: Bayesian linear (variance component) models

GWAS IV: Bayesian linear (variance component) models GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Sampling Algorithms for Probabilistic Graphical models

Sampling Algorithms for Probabilistic Graphical models Sampling Algorithms for Probabilistic Graphical models Vibhav Gogate University of Washington References: Chapter 12 of Probabilistic Graphical models: Principles and Techniques by Daphne Koller and Nir

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Exact model averaging with naive Bayesian classifiers

Exact model averaging with naive Bayesian classifiers Exact model averaging with naive Bayesian classifiers Denver Dash ddash@sispittedu Decision Systems Laboratory, Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 15213 USA Gregory F

More information

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester Physics 403 Parameter Estimation, Correlations, and Error Bars Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Best Estimates and Reliability

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

Directed Graphical Models

Directed Graphical Models CS 2750: Machine Learning Directed Graphical Models Prof. Adriana Kovashka University of Pittsburgh March 28, 2017 Graphical Models If no assumption of independence is made, must estimate an exponential

More information

Lecture 3: Machine learning, classification, and generative models

Lecture 3: Machine learning, classification, and generative models EE E6820: Speech & Audio Processing & Recognition Lecture 3: Machine learning, classification, and generative models 1 Classification 2 Generative models 3 Gaussian models Michael Mandel

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

An Introduction to Bayesian Machine Learning

An Introduction to Bayesian Machine Learning 1 An Introduction to Bayesian Machine Learning José Miguel Hernández-Lobato Department of Engineering, Cambridge University April 8, 2013 2 What is Machine Learning? The design of computational systems

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Intelligent Systems:

Intelligent Systems: Intelligent Systems: Undirected Graphical models (Factor Graphs) (2 lectures) Carsten Rother 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM Roadmap for next two lectures Definition

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

T Machine Learning: Basic Principles

T Machine Learning: Basic Principles Machine Learning: Basic Principles Bayesian Networks Laboratory of Computer and Information Science (CIS) Department of Computer Science and Engineering Helsinki University of Technology (TKK) Autumn 2007

More information

Bayesian Error-Bars for Belief Net Inference

Bayesian Error-Bars for Belief Net Inference $ To appear in to the Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI-01), Seattle, Aug 001. Bayesian Error-Bars for Belief Net Inference Tim Van Allen digimine,

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

More information

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4 ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian

More information

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21 CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about

More information

Dependent Dirichlet Priors and Optimal Linear Estimators for Belief Net Parameters

Dependent Dirichlet Priors and Optimal Linear Estimators for Belief Net Parameters Dependent Dirichlet Priors and Optimal Linear Estimators for Belief Net Parameters Peter M. Hooper Dept. of Mathematical & Statistical Sciences University of Alberta Edmonton, AB T6G 2G1 Canada Abstract

More information

10708 Graphical Models: Homework 2

10708 Graphical Models: Homework 2 10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves

More information

Introduction to Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models Introduction to Probabilistic Graphical Models Sargur Srihari srihari@cedar.buffalo.edu 1 Topics 1. What are probabilistic graphical models (PGMs) 2. Use of PGMs Engineering and AI 3. Directionality in

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

ECONOMETRIC METHODS II: TIME SERIES LECTURE NOTES ON THE KALMAN FILTER. The Kalman Filter. We will be concerned with state space systems of the form

ECONOMETRIC METHODS II: TIME SERIES LECTURE NOTES ON THE KALMAN FILTER. The Kalman Filter. We will be concerned with state space systems of the form ECONOMETRIC METHODS II: TIME SERIES LECTURE NOTES ON THE KALMAN FILTER KRISTOFFER P. NIMARK The Kalman Filter We will be concerned with state space systems of the form X t = A t X t 1 + C t u t 0.1 Z t

More information

A Tutorial on Learning with Bayesian Networks

A Tutorial on Learning with Bayesian Networks A utorial on Learning with Bayesian Networks David Heckerman Presented by: Krishna V Chengavalli April 21 2003 Outline Introduction Different Approaches Bayesian Networks Learning Probabilities and Structure

More information

Quantifying the uncertainty of a belief net response: Bayesian error-bars for belief net inference

Quantifying the uncertainty of a belief net response: Bayesian error-bars for belief net inference Artificial Intelligence 72 (2008) 483 53 www.elsevier.com/locate/artint Quantifying the uncertainty of a belief net response: Bayesian error-bars for belief net inference Tim Van Allen a, Ajit Singh b,

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014 Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 Problem Set 3 Issued: Thursday, September 25, 2014 Due: Thursday,

More information

Bayesian Networks. Motivation

Bayesian Networks. Motivation Bayesian Networks Computer Sciences 760 Spring 2014 http://pages.cs.wisc.edu/~dpage/cs760/ Motivation Assume we have five Boolean variables,,,, The joint probability is,,,, How many state configurations

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Hierarchical Models & Bayesian Model Selection

Hierarchical Models & Bayesian Model Selection Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016 Contact information Please report any typos or

More information

Bayesian Approaches Data Mining Selected Technique

Bayesian Approaches Data Mining Selected Technique Bayesian Approaches Data Mining Selected Technique Henry Xiao xiao@cs.queensu.ca School of Computing Queen s University Henry Xiao CISC 873 Data Mining p. 1/17 Probabilistic Bases Review the fundamentals

More information

Stochastic Processes, Kernel Regression, Infinite Mixture Models

Stochastic Processes, Kernel Regression, Infinite Mixture Models Stochastic Processes, Kernel Regression, Infinite Mixture Models Gabriel Huang (TA for Simon Lacoste-Julien) IFT 6269 : Probabilistic Graphical Models - Fall 2018 Stochastic Process = Random Function 2

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

When Discriminative Learning of Bayesian Network Parameters Is Easy

When Discriminative Learning of Bayesian Network Parameters Is Easy Pp. 491 496 in Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI 2003), edited by G. Gottlob and T. Walsh. Morgan Kaufmann, 2003. When Discriminative Learning

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 2. Statistical Schools Adapted from slides by Ben Rubinstein Statistical Schools of Thought Remainder of lecture is to provide

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

Introduction to Bayesian inference

Introduction to Bayesian inference Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015 Probabilistic models Describe how data was generated using probability distributions

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

CSE 150. Assignment 6 Summer Maximum likelihood estimation. Out: Thu Jul 14 Due: Tue Jul 19

CSE 150. Assignment 6 Summer Maximum likelihood estimation. Out: Thu Jul 14 Due: Tue Jul 19 SE 150. Assignment 6 Summer 2016 Out: Thu Jul 14 ue: Tue Jul 19 6.1 Maximum likelihood estimation A (a) omplete data onsider a complete data set of i.i.d. examples {a t, b t, c t, d t } T t=1 drawn from

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

Basic Sampling Methods

Basic Sampling Methods Basic Sampling Methods Sargur Srihari srihari@cedar.buffalo.edu 1 1. Motivation Topics Intractability in ML How sampling can help 2. Ancestral Sampling Using BNs 3. Transforming a Uniform Distribution

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Belief Update in CLG Bayesian Networks With Lazy Propagation

Belief Update in CLG Bayesian Networks With Lazy Propagation Belief Update in CLG Bayesian Networks With Lazy Propagation Anders L Madsen HUGIN Expert A/S Gasværksvej 5 9000 Aalborg, Denmark Anders.L.Madsen@hugin.com Abstract In recent years Bayesian networks (BNs)

More information

Probability and Information Theory. Sargur N. Srihari

Probability and Information Theory. Sargur N. Srihari Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal

More information

Directed Graphical Models

Directed Graphical Models Directed Graphical Models Instructor: Alan Ritter Many Slides from Tom Mitchell Graphical Models Key Idea: Conditional independence assumptions useful but Naïve Bayes is extreme! Graphical models express

More information

Lecture 9: Bayesian Learning

Lecture 9: Bayesian Learning Lecture 9: Bayesian Learning Cognitive Systems II - Machine Learning Part II: Special Aspects of Concept Learning Bayes Theorem, MAL / ML hypotheses, Brute-force MAP LEARNING, MDL principle, Bayes Optimal

More information

An Empirical-Bayes Score for Discrete Bayesian Networks

An Empirical-Bayes Score for Discrete Bayesian Networks JMLR: Workshop and Conference Proceedings vol 52, 438-448, 2016 PGM 2016 An Empirical-Bayes Score for Discrete Bayesian Networks Marco Scutari Department of Statistics University of Oxford Oxford, United

More information

CS340 Winter 2010: HW3 Out Wed. 2nd February, due Friday 11th February

CS340 Winter 2010: HW3 Out Wed. 2nd February, due Friday 11th February CS340 Winter 2010: HW3 Out Wed. 2nd February, due Friday 11th February 1 PageRank You are given in the file adjency.mat a matrix G of size n n where n = 1000 such that { 1 if outbound link from i to j,

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Hierarchical Multinomial-Dirichlet model for the estimation of conditional probability tables

Hierarchical Multinomial-Dirichlet model for the estimation of conditional probability tables Hierarchical Multinomial-Dirichlet model for the estimation of conditional probability tables Laura Azzimonti IDSIA - SUPSI/USI Manno, Switzerland laura@idsia.ch Giorgio Corani IDSIA - SUPSI/USI Manno,

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

An Introduction to Bayesian Networks in Systems and Control

An Introduction to Bayesian Networks in Systems and Control 1 n Introduction to ayesian Networks in Systems and Control Dr Michael shcroft Computer Science Department Uppsala University Uppsala, Sweden mikeashcroft@inatas.com bstract ayesian networks are a popular

More information

Introduction to continuous and hybrid. Bayesian networks

Introduction to continuous and hybrid. Bayesian networks Introduction to continuous and hybrid Bayesian networks Joanna Ficek Supervisor: Paul Fink, M.Sc. Department of Statistics LMU January 16, 2016 Outline Introduction Gaussians Hybrid BNs Continuous children

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Linear Regression [DRAFT - In Progress] Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

On the errors introduced by the naive Bayes independence assumption

On the errors introduced by the naive Bayes independence assumption On the errors introduced by the naive Bayes independence assumption Author Matthijs de Wachter 3671100 Utrecht University Master Thesis Artificial Intelligence Supervisor Dr. Silja Renooij Department of

More information

Graphical Models 359

Graphical Models 359 8 Graphical Models Probabilities play a central role in modern pattern recognition. We have seen in Chapter 1 that probability theory can be expressed in terms of two simple equations corresponding to

More information

Variational Inference. Sargur Srihari

Variational Inference. Sargur Srihari Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms

More information

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth

More information

Minimum Free Energies with Data Temperature for Parameter Learning of Bayesian Networks

Minimum Free Energies with Data Temperature for Parameter Learning of Bayesian Networks 28 2th IEEE International Conference on Tools with Artificial Intelligence Minimum Free Energies with Data Temperature for Parameter Learning of Bayesian Networks Takashi Isozaki 1,2, Noriji Kato 2, Maomi

More information

Inference and estimation in probabilistic time series models

Inference and estimation in probabilistic time series models 1 Inference and estimation in probabilistic time series models David Barber, A Taylan Cemgil and Silvia Chiappa 11 Time series The term time series refers to data that can be represented as a sequence

More information

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016 CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016 Plan for today and next week Today and next time: Bayesian networks (Bishop Sec. 8.1) Conditional

More information

LECTURE NOTE #3 PROF. ALAN YUILLE

LECTURE NOTE #3 PROF. ALAN YUILLE LECTURE NOTE #3 PROF. ALAN YUILLE 1. Three Topics (1) Precision and Recall Curves. Receiver Operating Characteristic Curves (ROC). What to do if we do not fix the loss function? (2) The Curse of Dimensionality.

More information

Probabilistic Graphical Networks: Definitions and Basic Results

Probabilistic Graphical Networks: Definitions and Basic Results This document gives a cursory overview of Probabilistic Graphical Networks. The material has been gleaned from different sources. I make no claim to original authorship of this material. Bayesian Graphical

More information

Illustration of the K2 Algorithm for Learning Bayes Net Structures

Illustration of the K2 Algorithm for Learning Bayes Net Structures Illustration of the K2 Algorithm for Learning Bayes Net Structures Prof. Carolina Ruiz Department of Computer Science, WPI ruiz@cs.wpi.edu http://www.cs.wpi.edu/ ruiz The purpose of this handout is to

More information

10-701/15-781, Machine Learning: Homework 4

10-701/15-781, Machine Learning: Homework 4 10-701/15-781, Machine Learning: Homewor 4 Aarti Singh Carnegie Mellon University ˆ The assignment is due at 10:30 am beginning of class on Mon, Nov 15, 2010. ˆ Separate you answers into five parts, one

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information