Introduction to continuous and hybrid. Bayesian networks

Size: px

Start display at page:

Download "Introduction to continuous and hybrid. Bayesian networks"

Margaret Allen
6 years ago
Views:

1 Introduction to continuous and hybrid Bayesian networks Joanna Ficek Supervisor: Paul Fink, M.Sc. Department of Statistics LMU January 16, 2016

2 Outline Introduction Gaussians Hybrid BNs Continuous children with discrete parents Discrete children with continuous parents Exponential family Entropy Relative entropy Projections I-projection M-projections Summary

3 Why? Where?

4 Bayesian networks, [access ]

5 Barbini et al. Bayesian Approach in Medicine and Health Management

6 Wenying Yan et al. Effects of Time Point Measurement on the Reconstruction of Gene Regulatory Networks, [access ]

7 Sebastiani et al., Nature Genetics 37:435,2005

8 Applications medical diagnosis gene expression data complex genetic models robot localization risk management in robotics credit scoring...

9 Continuous nodes 1. Challenges:

10 Continuous nodes 1. Challenges: no representation for all possible densities (unlike CPTs in discrete BNs) inference issues complex distributions 2. Solutions:

11 Continuous nodes 1. Challenges: no representation for all possible densities (unlike CPTs in discrete BNs) inference issues complex distributions 2. Solutions: discretization Linear Models Gaussian approximation

12 Discretization 1. Idea: continuous domain into a finite set of intervals 2. Methods: Equal Interval Width, Equal Interval Frequency,...

13 Discretization 1. Idea: continuous domain into a finite set of intervals 2. Methods: Equal Interval Width, Equal Interval Frequency,... Select y [y 1, y 2 ] x 2 P(X [x 1, x 2 ] y ) = p(x y )dy x 1

14 Discretization 1. Idea: continuous domain into a finite set of intervals 2. Methods: Equal Interval Width, Equal Interval Frequency,...

15 Discretization 1. Idea: continuous domain into a finite set of intervals 2. Methods: Equal Interval Width, Equal Interval Frequency, Limitations: loss of information trade-off: accuracy vs. computational cost O(d c )

16 Discretization 1. Idea: continuous domain into a finite set of intervals 2. Methods: Equal Interval Width, Equal Interval Frequency, Limitations: loss of information trade-off: accuracy vs. computational cost O(d c ) 4. Alternative: Linear Models

17 Linear Models Broad class of models that satisfy independence of casual influence: influence of multiple causes can be decomposed into separate influences.

18 Linear Models Broad class of models that satisfy independence of casual influence: influence of multiple causes can be decomposed into separate influences. the effect of parents (Y 1,..., Y n ) on X can be summarized via linear function f (Y 1,..., Y n ) = where w are coefficients. k w i Y i no interactions between Y i s (only through f (Y 1,..., Y n )) i=0

19 Linear Gaussian model

20 Linear Gaussian model Definition Let X be a continuous variable with continuous parents Y 1,..., Y k. We say that X has a Linear Gaussian CPD if there are β 0,..., β k and σ 2 such that: p(x y) = N (β 0 + β T y, σ 2 )

21 Linear Gaussian model Definition Let X be a continuous variable with continuous parents Y 1,..., Y k. We say that X has a Linear Gaussian CPD if there are β 0,..., β k and σ 2 such that: p(x y) = N (β 0 + β T y, σ 2 ) X = β 0 + β 1 y β k y k + ɛ

22 Linear Gaussian model IQ exams p(iq) = N (100, 15) p(e IQ) = N ( IQ, 10) = N (75, 10)

23 Linear Gaussian model IQ exams p(iq) = N (100, 15) p(e IQ) = N ( IQ, 10) = N (75, 10) Definition A Gaussian Bayesian network is a Bayesian network where all the variables are continuous and where CPDs are linear Gaussians.

24 Gaussian BN joint Gaussian Theorem Let X be a linear Gaussian of its parents Y 1,..., Y k : p(x y) = N (β 0 + β T y; σ 2 ) Assume, that Y 1,..., Y k are jointly Gaussian with distribution N (µ; Σ). Then: The distribution of X is a normal distribution p(x ) = N (µ X ; σx 2 ) where: µ X = β 0 + β T µ σ 2 X = σ2 + β T Σβ The joint distribution over {Y, X } is a normal distribution, where: k Cov[Y i, X ] = β j Σ i,j j=0

25 Gaussian BN joint Gaussian Theorem Let X be a linear Gaussian of its parents Y 1,..., Y k : p(x y) = N (β 0 + β T y; σ 2 ) Assume, that Y 1,..., Y k are jointly Gaussian with distribution N (µ; Σ). Then: The distribution of X is a normal distribution p(x ) = N (µ X ; σx 2 ) where: µ X = β 0 + β T µ σ 2 X = σ2 + β T Σβ The joint distribution over {Y, X } is a normal distribution, where: k Cov[Y i, X ] = β j Σ i,j j=0 A Gaussian Bayesian network defines a joint Gaussian distribution.

26 Gaussian BN joint Gaussian Theorem Let X = X 1,..., X n and let P be a joint Gaussian distribution over X. Given any ordering X 1,..., X n over X, we can construct a Bayesian network graph G and a Bayesian network B such that: 1. Pa G i X 1,..., X i 1 ; 2. the CPD of X i in B is a linear Gaussian of its parents; 3. G is a minimal I-map for P.

27 Probability of being accepted to a german university (MA programme) for international students form other EU-countries. accepted

28 Probability of being accepted to a german university (MA programme) for international students form other EU-countries. country of birth IQ study learned German exams accepted

29 Hybrid BNs continuous children with discrete (and continuous) parents IQ study exams

30 Hybrid BNs continuous children with discrete (and continuous) parents IQ study exams discrete children with continuous (and discrete) parents learned German exams accepted

31 Continuous children with discrete parents

32 Continuous children with discrete parents Definition Let X be a continuous variable with discrete A = A 1,..., A m and continuous Y = Y 1,..., Y k parents. We define the Conditional Linear Gaussian (CLG) model as: k p(x a, y) = N (w a,0 + w a,iy i ; σa) 2 i=1 where w are coefficients.

33 Continuous children with discrete parents Definition Let X be a continuous variable with discrete A = A 1,..., A m and continuous Y = Y 1,..., Y k parents. We define the Conditional Linear Gaussian (CLG) model as: k p(x a, y) = N (w a,0 + w a,iy i ; σa) 2 i=1 where w are coefficients. separate linear Gaussian model for each assignment to discrete parents

34 Continuous children with discrete parents Definition Let X be a continuous variable with discrete A = A 1,..., A m and continuous Y = Y 1,..., Y k parents. We define the Conditional Linear Gaussian (CLG) model as: k p(x a, y) = N (w a,0 + w a,iy i ; σa) 2 i=1 where w are coefficients. separate linear Gaussian model for each assignment to discrete parents defines a conditional Gaussian joint distribution [Lerner et al.,2001]

35 Continuous children with discrete parents IQ study exams

36 Continuous children with discrete parents IQ study exams p(iq) = N (100, 15) p(e IQ, S = s 1 ) = N ( IQ, 10) = N (85, 10)

37 Continuous children with discrete parents IQ study exams p(iq) = N (100, 15) p(e IQ, S = s 1 ) = N ( IQ, 10) = N (85, 10) p(e IQ, S = s 0 ) = N ( IQ, 12) = N (65, 12)

38 Discrete children with continuous parents

39 Discrete children with continuous parents Threshold (τ) which determines the change in discrete values

40 Discrete children with continuous parents Threshold (τ) which determines the change in discrete values X binary {x 0, x 1 } with parents Y 1,..., Y k : f (Y 1,..., Y k ) τ P(X = x 1 ) likely to be 1 f (Y 1,..., Y k ) < τ P(X = x 1 ) likely to be 0

41 Discrete children with continuous parents Threshold (τ) which determines the change in discrete values X binary {x 0, x 1 } with parents Y 1,..., Y k : f (Y 1,..., Y k ) τ P(X = x 1 ) likely to be 1 f (Y 1,..., Y k ) < τ P(X = x 1 ) likely to be 0 f (Y 1,..., Y k ) = w 0 + k i=1 w iy i

42 Discrete children with continuous parents Threshold (τ) which determines the change in discrete values X binary {x 0, x 1 } with parents Y 1,..., Y k : f (Y 1,..., Y k ) τ P(X = x 1 ) likely to be 1 f (Y 1,..., Y k ) < τ P(X = x 1 ) likely to be 0 f (Y 1,..., Y k ) = w 0 + k i=1 w iy i Definition The CPD P(X Y 1,..., Y k ) is a logistic CPD if there are k + 1 weights w 0, w 1,..., w k such that: k P(x 1 Y 1,..., Y k ) = sigmoid(w 0 + w i Y i ) i=1

43 Discrete children with continuous parents Threshold which determines the change in discrete values Augmented CLGs [Lerner et al. (2001)] Definition Let A be a discrete variable with possible values a 1,..., a m and let Y = Y 1,..., Y k denote its continuous parents. We define the CPD in augmented Conditional Linear Gaussian model as: p(a = a i y 1,..., y k ) = exp(w i,0 + k l=1 w i,ly l ) m j=1 exp(w j,0 + k s=1 w j,sy s )

44 Discrete children with continuous parents Definition A Bayesian network with all discrete variables having only discrete parents and continuous variables having a CLG CPD is a Conditional Linear Gaussian network. learned German exams accepted

46 Exponential family An exponential family is specified by: a sufficient statistics function τ a parameter space that is a convex set Θ of legal parameters a natural parameter function t an auxiliary measure A over X

47 Exponential family An exponential family is specified by: a sufficient statistics function τ a parameter space that is a convex set Θ of legal parameters a natural parameter function t an auxiliary measure A over X Definition An exponential family P = {P θ : θ Θ} over set of variables X, where with partition function P θ (ξ) = 1 A(ξ)exp{ t(θ), τ(ξ) } Z(θ) Z(θ) = ξ A(ξ)exp{ t(θ), τ(ξ) }

48 Univariate normal distribution τ(x) = x, x 2 (1) t(µ, σ 2 ) = µ σ 2, 1 2σ 2 (2) Z(µ, σ 2 ) = { } µ 2 2πσexp 2σ 2 (3) Then, P(x) = { 1 1 Z(µ, σ 2 exp{ t(θ), τ(x )} = exp ) 2πσ } (x µ)2 2σ 2

49 Definition Product distribution A factor in an exponential family: φ θ (ξ) = A(ξ)exp{ t(θ), τ(ξ) } Definition Let Φ 1,..., Φ k be exponential factor families. The composition of Φ 1,..., Φ k is the family Φ 1 Φ 2... Φ k parametrized by θ = θ 1 θ 2... θ k Θ 1 Θ 2... Θ k : P θ i ( ) { } φ θi (ξ) = A i (ξ) exp t(θ), τ(ξ) } i i

50 Definition Product distribution A factor in an exponential family: φ θ (ξ) = A(ξ)exp{ t(θ), τ(ξ) } Definition Let Φ 1,..., Φ k be exponential factor families. The composition of Φ 1,..., Φ k is the family Φ 1 Φ 2... Φ k parametrized by θ = θ 1 θ 2... θ k Θ 1 Θ 2... Θ k : P θ i ( ) { } φ θi (ξ) = A i (ξ) exp t(θ), τ(ξ) } i i A Bayesian network with locally normalized exponential CPDs defines an exponential family.

51 Definition Product distribution A factor in an exponential family: φ θ (ξ) = A(ξ)exp{ θ, τ(ξ) } Definition Let Φ 1,..., Φ k be exponential factor families. The composition of Φ 1,..., Φ k is the family Φ 1 Φ 2... Φ k parametrized by θ = θ 1 θ 2... θ k Θ 1 Θ 2... Θ k : P θ i ( ) { } φ θi (ξ) = A i (ξ) exp t i (θ i ), τ i (ξ) } i i A Bayesian network with locally normalized exponential CPDs defines an exponential family.

52 Entropy Measure of degree of disorder in a system (thermodynamics, R. Clausius, 1865) Shannon s Measure of Uncertainty - statistical entropy: Definition Let P(X ) be a distribution over a random variable X. The entropy of X is H P (X ) = E P [logp(x )]

53 Entropy Measure of degree of disorder in a system (thermodynamics, R. Clausius, 1865) Shannon s Measure of Uncertainty - statistical entropy: Definition Let P(X ) be a distribution over a random variable X. The entropy of X is H P (X ) = E P [logp(x )] small entropy probability mass on a few instances large entropy probability mass uniformly spread

54 Entropy Theorem Let P θ be a distribution in an exponential family defined by the functions τ and t. Then H Pθ (X ) = lnz(θ) E Pθ [τ(x )], t(θ)

55 Entropy Theorem Let P θ be a distribution in an exponential family defined by the functions τ and t. Then H Pθ (X ) = lnz(θ) E Pθ [τ(x )], t(θ) Theorem Let P(X ) = i P(X i Pa G i ) be a distribution consistent with a Bayesian network G. Then H P (X ) = i H P (X i Pa G i ) = i P(pa G i )H P (X i pa G i ) pa G i

56 Complex distribution approximation Relative entropy Measure of inaccuracy: relative entropy (Kullback-Leibler distance) Definition Let Q and P be two distributions over random variables X 1,..., X n. The relative entropy of Q and P is: where we set log(0) = 0. [ D(Q(X 1,..., X n ) P(X 1,..., X n )) = E Q log Q(X ] 1,..., X n ) P(X 1,..., X n )

57 Complex distribution approximation Relative entropy Measure of inaccuracy: relative entropy (Kullback-Leibler distance) Definition Let Q and P be two distributions over random variables X 1,..., X n. The relative entropy of Q and P is: where we set log(0) = 0. [ D(Q(X 1,..., X n ) P(X 1,..., X n )) = E Q log Q(X ] 1,..., X n ) P(X 1,..., X n ) P,Q D(Q P) 0 D(Q P) small P close to Q small loss of information P Q D(Q P) D(P Q)

58 Relative entropy Theorem Consider a distribution Q and a distribution P θ in an exponential family defined by τ and t. Then D(Q P θ ) = H Q (X ) E Q [τ(x )], t(θ) + lnz(θ)

59 Relative entropy Theorem Consider a distribution Q and a distribution P θ in an exponential family defined by τ and t. Then D(Q P θ ) = H Q (X ) E Q [τ(x )], t(θ) + lnz(θ) Theorem If P and Q are distributions over X consistent with a Bayesian network G, then D(Q P) = i Q(pa G i )D(Q(X i, pa G i )) P(X i pa G i ) pa G i

60 Projections Project a distribution P onto family of distributions Q. I-projections M-projections Q I = arg min D(Q P) Q Q Q M = arg min D(P Q) Q Q In general: Q I Q M

61 I-projections Q I = arg min D(Q P) = arg min ( H Q(X ) + E Q [ lnp(x )]) Q Q Q Q if complex graphical model high density where P is large low density where P is small penalty for low entropy some simplification of computation is possible

62 M-projections Q M = arg min D(P Q) = arg min ( H P(X ) + E P [ lnq(x )]) Q Q Q Q learning problem attempts to match the main mass of P: high density to the regions probable according to P high penalty for low density in these regions relatively high variance use of exponential form of the distribution may simplify the computation

63 M-projections Let P be a distribution over X. Theorem Let Q be an exponential family defined by τ and t. Then Q M = Q θ if there is a set of parameters θ such that E Qθ [τ(x )] = E P [τ(x )]

64 M-projections Let P be a distribution over X. Theorem Let Q be an exponential family defined by τ and t. Then Q M = Q θ if there is a set of parameters θ such that E Qθ [τ(x )] = E P [τ(x )] Theorem Let G be a Bayesian network structure. Then Q M (X ) = i P(X i Pa G i )

65 Summary 1. Continuous and hybrid BNs have many applications 2. Challenges: representation, inference 3. Solutions: discretization, Linear Models, approximation 4. Exponential family - useful form 5. True distribution unknown or complex entropy, relative entropy 6. Projections - find approximation

66 Bibliography 1. Koller D. & Friedman N., 2009, Probabilistic Graphical Models. Principles and Techniques., The MIT Press, Massachusetts, USA 2. Koller D., 2013, on-line course Probabilistic Graphical Models, 3. Lerner et al., 2001, Exact Inference in Networks with Discrete Children of Continuous Parent, p , UAI Pourret et al.,2008, Bayesian networks : a practical guide to applications, John Wiley & Sons Ltd, ISBN: , online: Bruce-Marcot-Bay.pdf

67 Bibliography 1. Friedman et al., 2000, Using Bayesian Networks to Analyze Expression Data, Journal of Computational Biology, Volume7, No. 3/4 2. Lauritzen S. & Sheehan N., 2003, Graphical Model for Genetic Analyses, Statistical Science 2003, Vol. 18, No. 4,

68 Thank you for your attention!

Ludwig-Maximilians University

Ludwig-Maximilians University Seminar Paper Introduction to continuous and hybrid Bayesian networks. Seminar: (Imprecise) Probabilistic Graphical Models Author: Joanna Ficek Supervisor: M. Sc. Paul Fink