Semantic Foundations for Probabilistic Programming

Size: px

Start display at page:

Download "Semantic Foundations for Probabilistic Programming"

Frederick Beasley
5 years ago
Views:

1 Semantic Foundations for Probabilistic Programming Chris Heunen Ohad Kammar, Sam Staton, Frank Wood, Hongseok Yang 1 / 21

2 Semantic foundations programs mathematical objects s1 s2 2 / 21

3 Semantic foundations programs s1 s2 s1;s2 mathematical objects 2 / 21

4 Semantic foundations programs s1 s2 s1;s2 mathematical objects Operational: remember implementation details Denotational: see what program does conceptually (efficiency) (correctness) 2 / 21

5 Semantic foundations programs s1 s2 s1;s2 mathematical objects Operational: remember implementation details Denotational: see what program does conceptually Motivation: Ground programmer s unspoken intuitions Justify/refute/suggest program transformations Understand programming through mathematics (efficiency) (correctness) 2 / 21

6 Semantic foundations programs s1 s2 s1;s2 mathematical objects Operational: remember implementation details Denotational: see what program does conceptually Motivation: Ground programmer s unspoken intuitions Justify/refute/suggest program transformations Understand probability through program equations (efficiency) (correctness) 2 / 21

7 Probabilistic programming P(A B) = P(B A) P(A) P(B) 3 / 21

8 Probabilistic programming P(A B) P(B A) P(A) 3 / 21

9 Probabilistic programming P(A B) P(B A) P(A) posterior likelihood prior 3 / 21

10 Probabilistic programming P(A B) P(B A) P(A) posterior likelihood prior idealized Anglican = functional programming + normalize observe sample 3 / 21

11 Overview Interpret types as measurable spaces e.g. real = R Interpret (open) terms as kernels Interpret closed terms as measures Inference normalizes measures posterior likelihood prior [Kozen, Semantics of probabilistic programs, J Comp Syst Sci, 1981] 4 / 21

12 Overview Interpret types as measurable spaces e.g. real = R Interpret (open) terms as kernels Interpret closed terms as measures Inference normalizes measures posterior likelihood prior But: Commutativity? Higher order functions? Extensionality? Recursion? Fubini not true for all kernels R R not a measurable space [Kozen, Semantics of probabilistic programs, J Comp Syst Sci, 1981] [Aumann, Borel structures for function spaces, Ill J Math, 1961] 4 / 21

13 Example 1. Toss a fair coin to get outcome x 2. Set up exponential decay with rate r depending on x 3. Observe immediate decay 4. What is the outcome x? 5 / 21

14 Example 1. Toss a fair coin to get outcome x 2. Set up exponential decay with rate r depending on x 3. Observe immediate decay 4. What is the outcome x? let x = sample(bern(0.5)) in let r = if x then 2.0 else 1.0 observe(0.0 from exp(r)); return x 5 / 21

15 Example 1. Toss a fair coin to get outcome x 2. Set up exponential decay with rate r depending on x 3. Observe immediate decay 4. What is the outcome x? two traces: let x = sample(bern(0.5)) in x=true x=false let r = if x then 2.0 else 1.0 observe(0.0 from exp(r)); return x 5 / 21

16 Example 1. Toss a fair coin to get outcome x 2. Set up exponential decay with rate r depending on x 3. Observe immediate decay 4. What is the outcome x? two traces: let x = sample(bern(0.5)) in x=true x=false let r = if x then 2.0 else 1.0 r=2.0 observe(0.0 from exp(r)); score 2 return x return true 5 / 21

17 Example 1. Toss a fair coin to get outcome x 2. Set up exponential decay with rate r depending on x 3. Observe immediate decay 4. What is the outcome x? two traces: let x = sample(bern(0.5)) in x=true x=false let r = if x then 2.0 else 1.0 r=2.0 r=1.0 observe(0.0 from exp(r)); score 2 score 1 return x return true return false 5 / 21

18 Example 1. Toss a fair coin to get outcome x 2. Set up exponential decay with rate r depending on x 3. Observe immediate decay 4. What is the outcome x? two traces: let x = sample(bern(0.5)) in x=true x=false let r = if x then 2.0 else 1.0 r=2.0 r=1.0 observe(0.0 from exp(r)); score 2 score 1 return x return true return false posterior likelihood prior 2 0.5: true 1 0.5: false 5 / 21

19 Example 1. Toss a fair coin to get outcome x 2. Set up exponential decay with rate r depending on x 3. Observe immediate decay 4. What is the outcome x? P(true) = 1, P(false) = 0.5 two traces: let x = sample(bern(0.5)) in x=true x=false let r = if x then 2.0 else 1.0 r=2.0 r=1.0 observe(0.0 from exp(r)); score 2 score 1 return x return true return false posterior likelihood prior 2 0.5: true 1 0.5: false 5 / 21

20 Example 1. Toss a fair coin to get outcome x 2. Set up exponential decay with rate r depending on x 3. Observe immediate decay model evidence (score): What is the outcome x? P(true) = 66%, P(false) = 33% two traces: let x = sample(bern(0.5)) in x=true x=false let r = if x then 2.0 else 1.0 r=2.0 r=1.0 observe(0.0 from exp(r)); score 2 score 1 return x return true return false posterior likelihood prior 2 0.5: true 1 0.5: false 5 / 21

21 Example 1. Toss a fair coin to get outcome x 2. Set up exponential decay with rate r depending on x 3. Observe immediate decay model evidence (score): What is the outcome x? P(true) = 66%, P(false) = 33% Programs may also sample continuous distributions so have to deal with uncountable number of traces: let y = sample(gauss(7,2)) 5 / 21

22 Measure theory Impossible to sample 0.5 from standard normal distribution But sample in interval (0, 1) with probability around / 21

23 Measure theory Impossible to sample 0.5 from standard normal distribution But sample in interval (0, 1) with probability around 0.34 A measurable space is a set X with a family Σ X of subsets that is closed under countable unions and complements A (probability) measure on X is a function p: Σ X [0, ] that satisfies p( U n ) = p(u n ) (and has p(x) = 1) 6 / 21

24 First order language Types: A, B ::= R P(A) 1 A B i I A i real numbers bool := nat := i N 1 distributions over A finite products countable sums 7 / 21

25 First order language Types: A, B ::= R P(A) 1 A B i I A i Deterministic terms may not sample: variables x, y, z constructors for sums and products case, ini, if, false, true measurable functions bern, exp, gauss, dirac d 42.0 : R d gauss(2.0, 7.0) : P(R) x: R, y: R d x + y : R x: R, y: R d x < y : bool 7 / 21

26 First order language Types: A, B ::= R P(A) 1 A B i I A i Deterministic terms may not sample: variables x, y, z constructors for sums and products case, in i, if, false, true measurable functions bern, exp, gauss, dirac Probabilistic terms may sample: sequencing return, let constraints score priors sample Γ d t: A Γ p return(t): A Γ d t: R Γ p score(t): 1 Γ p t: A x : A p u: B Γ p let x = t in u: B Γ d t: P(A) Γ p sample(t): A 7 / 21

27 First order language Types: A, B ::= R P(A) 1 A B i I A i Deterministic terms may not sample: variables x, y, z constructors for sums and products case, in i, if, false, true measurable functions bern, exp, gauss, dirac inference norm Probabilistic terms may sample: sequencing return, let constraints score priors sample Γ d t: A Γ p return(t): A Γ d t: R Γ p score(t): 1 Γ p t: A x : A p u: B Γ p let x = t in u: B Γ d t: P(A) Γ p sample(t): A 7 / 21

28 First order semantics Interpret type A deterministic term Γ d t: A as measurable space A as measurable function Γ A probabilistic term Γ p t: A as kernel t: Γ Σ A [0, ] fixing first argument: measure, fixing second argument: measurable 8 / 21

29 First order semantics Interpret type A deterministic term Γ d t: A as measurable space A as measurable function Γ A probabilistic term Γ p t: A as kernel t: Γ Σ A [0, ] fixing first argument: measure, fixing second argument: measurable Γ d t: R Γ p score(t): 1 score(t)(γ, ) = t(γ) 8 / 21

30 First order semantics Interpret type A deterministic term Γ d t: A as measurable space A as measurable function Γ A probabilistic term Γ p t: A as kernel t: Γ Σ A [0, ] fixing first argument: measure, fixing second argument: measurable Γ d t: R Γ p score(t): 1 score(t)(γ, ) = t(γ) Γ d t: P(A) Γ p sample(t): A sample(t)(γ, U) = (t(γ))(u) 8 / 21

31 First order semantics Interpret type A deterministic term Γ d t: A as measurable space A as measurable function Γ A probabilistic term Γ p t: A as kernel t: Γ Σ A [0, ] fixing first argument: measure, fixing second argument: measurable Γ d t: R Γ p score(t): 1 score(t)(γ, ) = t(γ) Γ d t: P(A) Γ p sample(t): A Γ p t: A x : A p u: B Γ p let x = t in u: B sample(t)(γ, U) = (t(γ))(u) let x = t in u(γ, U) = Au(γ, x, U) t(γ, dx) 8 / 21

32 First order semantics Interpret type A deterministic term Γ d t: A as measurable space A as measurable function Γ A probabilistic term Γ p t: A as kernel t: Γ Σ A [0, ] fixing first argument: measure, fixing second argument: measurable Γ d t: R Γ p score(t): 1 Γ d t: P(A) Γ p sample(t): A Γ p t: A x : A p u: B Γ p let x = t in u: B score(t)(γ, ) = t(γ) sample(t)(γ, U) = (t(γ))(u) let x = t in u = A udt 8 / 21

33 Example let x = sample(bern(0.5)) in let r = if x then 2.0 else 1.0 observe(0.0 from exp(r)); return x The meaning of a program returning values in X is a measure on X has measure 0.0 {true} has measure 1.0 = {false} has measure 0.5 = {true, false} has measure / 21

34 Normalization: posterior likelihood prior Γ p t: A Γ d norm(t): R P(A) }{{} model evidence errors normalized posterior 10 / 21

35 Normalization: posterior likelihood prior Γ p t: A Γ d norm(t): R P(A) }{{} model evidence errors normalized posterior Interpretation of probabilistic term is kernel Γ Σ A [0, ] so fixing first argument gives measure t(γ, ) t(γ, A) is normalized probability measure 10 / 21

36 Normalization: posterior likelihood prior Γ p t: A Γ d norm(t): R P(A) }{{} model evidence normalized posterior errors (constant is 0 or ) Interpretation of probabilistic term is kernel Γ Σ A [0, ] so fixing first argument gives measure t(γ, ) t(γ, A) is normalized probability measure normalizing constant is model evidence 10 / 21

37 Normalization: posterior likelihood prior Γ p t: A Γ d norm(t): R P(A) }{{} model evidence normalized posterior errors (constant is 0 or ) let x = sample(bern(0.5)) in let r = if x then 2.0 else 1.0 observe(0.0 from exp(r)); return x = ( true : false : ) 10 / 21

38 Normalization: posterior likelihood prior Γ p t: A Γ d norm(t): R P(A) }{{} model evidence norm( let x = sample(bern(0.5)) in let r = if x then 2.0 else 1.0 observe(0.0 from exp(r)); return x ) normalized posterior errors (constant is 0 or ) = in ( ) 1 1.5, bern(0.66) 10 / 21

39 Example: sequential Monte Carlo norm( let x=t in u ) = norm( let (e,d) = norm(t) in score(e); let x=sample(d) in u ) 11 / 21

40 Example: importance sampling sample(exp(2)) = let x = sample(gauss(0,1))) score(exp-pdf(2,x) / gauss-pdf(0,1,x)); return x 12 / 21

41 Example: importance sampling sample(exp(2)) = = let x = sample(gauss(0,1))) score(exp-pdf(2,x) / gauss-pdf(0,1,x)); return x let x = sample(gauss(0,1))) score(1 / gauss-pdf(0,1,x)); score(exp-pdf(2,x)); return x 12 / 21

) norm( norm( let x = sample(gauss(0,1))) score(1 /

42 Example: importance sampling norm( sample(exp(2)) ) = norm( let x = sample(gauss(0,1))) score(exp-pdf(2,x) / gauss-pdf(0,1,x)); return x ) norm( norm( let x = sample(gauss(0,1))) score(1 / gauss-pdf(0,1,x)); score(exp-pdf(2,x)); ); return x ) Don t normalize as you go 12 / 21

43 Commutativity Reordering lines is very useful program transformation let x=t in let y=u in let y=u in v = let x=t in v 13 / 21

44 Commutativity Reordering lines is very useful program transformation let x=t in let y=u in let y=u in v = let x=t in v amounts to Fubini s theorem v du dt = v dt du A B B A 13 / 21

45 Commutativity Reordering lines is very useful program transformation let x=t in let y=u in let y=u in v = let x=t in v amounts to Fubini s theorem v du dt = v dt du A B Not true for arbitrary kernels, only for s-finite kernels kernel is s-finite when countable sum of bounded ones k: Γ Σ A [0, ] bounded if n γ U : k(γ, U) < n B A 13 / 21

46 Commutativity Reordering lines is very useful program transformation let x=t in let y=u in let y=u in v = let x=t in v amounts to Fubini s theorem v du dt = v dt du A B Not true for arbitrary kernels, only for s-finite kernels kernel is s-finite when countable sum of bounded ones k: Γ Σ A [0, ] bounded if n γ U : k(γ, U) < n kernel k is s-finite iff it can be built from sub-probability distributions, score, and binding k >>= l is (γ, V) l(γ, x, V)k(γ, dx) A B A measurable spaces and s-finite kernels form distributive symmetric monoidal category 13 / 21

47 Commutativity Reordering lines is very useful program transformation let x=t in let y=u in let y=u in v = let x=t in v amounts to Fubini s theorem v du dt = v dt du A B Not true for arbitrary kernels, only for s-finite kernels kernel is s-finite when countable sum of bounded ones k: Γ Σ A [0, ] bounded if n γ U : k(γ, U) < n B A Interpret terms as s-finite kernels 13 / 21

48 Example: facts about distributions let x = sample(gauss(0.0,1.0)) in return (x<0) = sample(bern(0.5)) 14 / 21

49 Example: conjugate priors let x = sample(beta(1,1)) in observe(bern(x), true); = return x observe(bern(0.5), true); let x = sample(beta(2,1)) in return x 15 / 21

50 Higher order functions Allow probabilistic terms as input/output for other terms [Roy et al, A stochastic programming perspective on nonparametric Bayes, ICML 2008] 16 / 21

51 Higher order functions Allow probabilistic terms as input/output for other terms A, B ::= R P(A) 1 A B i I A i A B [Roy et al, A stochastic programming perspective on nonparametric Bayes, ICML 2008] 16 / 21

52 Higher order functions Allow probabilistic terms as input/output for other terms A, B ::= R P(A) 1 A B i I A i A B R R is not a measurable space [Roy et al, A stochastic programming perspective on nonparametric Bayes, ICML 2008] [Aumann, Borel structures for function spaces, Ill J Math, 1961] 16 / 21

53 Higher order functions Allow probabilistic terms as input/output for other terms A, B ::= R P(A) 1 A B i I A i A B R R is not a measurable space Easy to handle operationally. What to do denotationally? [Roy et al, A stochastic programming perspective on nonparametric Bayes, ICML 2008] [Aumann, Borel structures for function spaces, Ill J Math, 1961] [Borgström et al, Measure transformer semantics for Bayesian machine learning, ESOP2011] 16 / 21

54 Higher order semantics Use category theory to extend measure theory not enough function spaces measurable spaces 17 / 21

55 Higher order semantics Use category theory to extend measure theory not enough function spaces measurable spaces sheaves on measurable spaces presheaves on measurable spaces that preserve countable products preserves all structure all function spaces 17 / 21

56 Higher order semantics Use category theory to extend measure theory measurable spaces A Giry monad P(A) measurable spaces distribution types sheaves on measurable spaces sheaves on measurable spaces 17 / 21

57 Higher order semantics Use category theory to extend measure theory measurable spaces sheaves on measurable spaces Giry monad left Kan extension measurable spaces sheaves on measurable spaces [Power, Generic models for computational effects, Th Comp Sci 2006] 17 / 21

58 Higher order semantics Use category theory to extend measure theory measurable spaces sheaves on measurable spaces 1 (R R) consists of random functions All definable functions R R are measurable measurable Ω R R Church-Turing Denotational and operational semantics match soundness & adequacy 17 / 21

59 Extensionality p f Not extensional: 1 A B g for all p f = g Solution: restrict to subcategory that is extensional 18 / 21

60 Extensionality p f Not extensional: 1 A B g for all p f = g Solution: restrict to subcategory that is extensional A quasi-measurable space is a set X with M X [R X] satisfying if f : R R is measurable and g M, then gf M if f : R X is constant, then f M if f : R N is measurable and g n M, then [g n ] f M t g f(t) (t) morphisms are functions f : X Y with g M X fg M Y 18 / 21

61 Extensionality p f Not extensional: 1 A B g for all p f = g Solution: restrict to subcategory that is extensional A quasi-measurable space is a set X with M X [R X] satisfying if f : R R is measurable and g M, then gf M if f : R X is constant, then f M if f : R N is measurable and g n M, then [g n ] f M t g f(t) (t) morphisms are functions f : X Y with g M X fg M Y Example: X measurable space, M X measurable functions R X morphism X Y is measurable function 18 / 21

62 Extensionality p f Not extensional: 1 A B g for all p f = g Solution: restrict to subcategory that is extensional A quasi-measurable space is a set X with M X [R X] satisfying if f : R R is measurable and g M, then gf M if f : R X is constant, then f M if f : R N is measurable and g n M, then [g n ] f M t g f(t) (t) morphisms are functions f : X Y with g M X fg M Y Example: X measurable space, M X measurable functions R X morphism X Y is measurable function Theorem: this gives cartesian closed category with countable sums Corollary: if term t has first order type, then t is measurable even if t involves higher order functions 18 / 21

63 Extensionality p f Not extensional: 1 A B g for all p f = g Solution: restrict to subcategory that is extensional A quasi-measurable space is a set X with M X [R X] satisfying if f : R R is measurable and g M, then gf M if f : R X is constant, then f M if f : R N is measurable and g n M, then [g n ] f M t g f(t) (t) A measure on (X, M X ) is a measure µ on R with a function f M Proposition: measures on [X Y] are random functions measurable map R X Y modulo measure on R 18 / 21

64 Recursion No recursion / least fixed points Idea: restrict to presheaves over domains An ω-complete partial order has suprema of increasing sequences morphisms preserve suprema of increasing sequences and infima sup x n A quasi-measurable space is ordered when X is an ωcpo and M is closed under pointwise increasing suprema x 5 x 4 x 1 x 2 x 3 Example: Any ωcpo, e.g. [0,1] take M all measurable functions R X where X has the Borel σ-algebra on the Lawson topology Theorem: this gives a cartesian closed category with countable sums 19 / 21

65 Example: von Neumann s trick let g = bern(0.66) in letrec f() = (let x = sample(g) let y = sample(g)? = sample(bern(0.5)) if x=y then f() else return x) in f() 20 / 21

66 Conclusion Foundational semantics for probabilistic programming: continuous distributions soft constraints commutativity higher order functions recursion can verify/suggest program transformations. Approximations? 21 / 21

Semantics for Probabilistic Programming

Semantics for Probabilistic Programming Chris Heunen 1 / 21 Bayes law P(A B) = P(B A) P(A) P(B) 2 / 21 Bayes law P(A B) = P(B A) P(A) P(B) Bayesian reasoning: predict future, based on model and prior evidence