A Lambda Calculus Foundation for Universal Probabilistic Programming

Size: px

Start display at page:

Download "A Lambda Calculus Foundation for Universal Probabilistic Programming"

Annis Cobb
5 years ago
Views:

1 A Lambda Calculus Foundation for Universal Probabilistic Programming Johannes Borgström (Uppsala University) Ugo Dal Lago (University of Bologna, INRIA) Andrew D. Gordon (Microsoft Research, University of Edinburgh) Marcin Szymczak (University of Edinburgh) January 23, 2016

2 Introduction We want to prove correct a variant of Metropolis-Hastings MCMC on program traces (sequences of random choices made during execution), in the line of the algorithm used by Church. Why a formal correctness proof of Trace MCMC?... because there is none yet! (for a functional language) Can we really trust probabilistic languages and their inference engines? Machine learning used in safety-critical applications (medicine, autonomous vehicles etc.) Traces are highly nonstandard parameter spaces. Simple textbook proof for MH-MCMC does not apply.

3 Roadmap To prove correctness of an inference algorithm for a probabilistic language we need: The syntax of the language A semantics of the language A rigorous definition of the algorithm A formal definition of correct

4 Roadmap This paper consists of two parts: Semantics of a probabilistic lambda-calculus with continuous distribution, defined in two ways: Distributional semantics- distribution on return values Sampling-based semantics- distribution on random traces A formal proof of correctness of MH-MCMC on this language, with respect to the distributional semantics. Still completing proofs of two measurability lemmas

5 Untyped lambda-calculus with continuous distributions Let x, D, g range over countable sets of identifiers, distributions, primitive functions, respectively. V ::= c x λx.m M ::= V M N D(V 1,..., V D ) g(v 1,..., V g ) G ::= V fail if V then M else L fail We define a metric space on the space Λ of terms: d(c, d) = c d d(x, x) = 0 d(λx.m, λx.n) = d(m, N) d(m N, L P) = d(m, L) + (N, P)... The metric space (Λ, d) gives rise to a topology and a Borel σ-algebra.

6 Example program Using standard syntactic sugar for let. let p = uniform() in let flip = λ.uniform() < p in if (flip() = 0) and (flip() = 1) then p else fail

7 Distributional Semantics- Small Step Deterministic reduction: M N E[(λx.M) V ] det E[M{V /x}] E[T ] det E[fail] E[fail] det fail if E is not [ ]... One-step evaluation: M D E[D( c)] E{µ D( c) } E[M] δ(e[n]) if M det N Step-Indexed approximation semantics: M n D. M D n > 0 G n δ(g) M 0 0 {N n E N } N supp(d) M n+1 (A E N (A) D(dN)) Semantics: M = sup{d M n D} Lemma is a subprobability kernel Lemma n is a subprobability kernel for every n 0.

8 Distributional Semantics- Big Step n > 0 G n δ(g) M 0 0 n > 0 T n δ(error) n > 0 n > 0 D( c) n µ D( c) g( c) n δ(σ g ( c)) M n D if true then M else N n+1 D N n D if false then M else N n+1 D M n D N n E {L{V /x} n E L,V } (λx.l) supp(d),v supp(e ) MN n+1 A D E (A) + D(R) δ(error) + D(V λ ) E E (A)+ EL,V (A) D V λ (λx.dl) E V (dv ) Semantics: M = sup{d M n D} Theorem For every term M, M = M.

9 Sampling Based Semantics - Pseudo-deterministic Evaluation Small step: (M, w, s) (M, w, s ) M det N (M, w, s) (N, w, s) w = pdf D ( c, c) w > 0 (E[D( c)], w, s) (E[c], ww, s@[c]) Big step: M s w G G GV w = pdf D ( c, c) w > 0 G [] 1 G D( c) [c] w c g( c) [] 1 σg ( c) M s 1 w1 λx.p N s 2 w2 V P[V /x] s 3 w3 G M N s 1@s 3 w 1 w 2 w 3 G... Proposition M s w G if and only if (M, 1, []) (G, w, s).

10 Sampling Based Semantics: inspired by (Nori, Hur, Rajamani, Samuel 2013) Theorem Measurable space of program traces: (S, S), where: S = n N Rn S = { n N Hn Hn Rn for all n} Stock measure on program traces: µ( n N Hn) = n=1 λn(hn) Density function of a program M (w.r.t. stock measure on traces): { w if M s P M (s) = w G for some G 0 otherwise Outcome of evaluation of M as a function of trace s: { G if M s O M (s) = w G for some w fail otherwise A subprobability measure on program traces: M S (A) = P M (s)µ(ds) Can obtain measure on values by transformation: M S = M S O 1 M M S = M = M A Recall: M - Small-step distributional semantics M - Big step distributional semantics

11 MCMC on General State Spaces (Green 1995, Tierney 1994, 1998) Let (Ω, Σ) be an arbitrary measurable space. Suppose we want to sample from some distribution π on Σ. Define a proposal kernel Q(x, A) : Ω Σ R and a measurable acceptance function α(x, y) : Ω Ω [0, 1] such that the resulting Metropolis-Hastings transition kernel: P(x, A) = α(x, y)q(x, dy) + δ(x)(a) (1 α(x, t))q(s, dt) A Ω is reversible with respect to π: P(x, B)π(dx) = P(y, A)π(dy) A for all A, B Σ. Then π is the stationary distribution of the Markov chain with transition kernel P. If Q(x, A) = q(x, y)µ(dy) and π(a) = π(x)µ(dx), detailed balance A A equation simplifies. B

12 MH-MCMC Inference Idea: formalize the algorithm used by Church (or slightly simplified version thereof): Given trace s = [s 1,..., s n] in program M, choose k s.t. k 0, k n at random. Partially evaluate M under the trace [s 1,..., s k ], yielding M. Evaluate M, sampling values [t k+1,..., t m] from target distributions on the way. Set t = [s 1,..., s nt n+1,..., t m], accept with probability α(s, t) = min{1, t s } Problem: the proposal kernel corresponding to this algorithm has no density! Fixing a prefix would immediately set the integral to 0. The lack of density makes the proof much harder. We have decided to leave it as further work and start with a kernel which has density.

13 MH-MCMC Inference- Take 2 Solution: update all elements of the trace, following the approach of (Hur, Nori et al, 2015). Let s = [s 1,..., s n] be the previous trace. For each i-th random choice: If i < n, draw t i = Gaussian(s 1, σ 2 ). Otherwise, draw t i from target distribution. Repeat until we get a generalized value and return trace t. Accept with probability 0 if P M (t) = 0 α(s, t) = 1 if P M (s)q(s, t) = 0 min{1, P M (t)q(t,s) otherwise P M (s)q(s,t) }

14 Inference- Take 2 This algorithm has the following transition kernel P: M if s = [] M if (M, 1, []) (M k, w k, s k ) (M, w, s) peval(m, s) = for some M k, w k, s k, w such that s k s fail otherwise q(s, t) = (Π k i=1 pdf Gaussian (s i, σ 2, t i )) P N (t k+1.. t ) if t 0 where k = min{ s, t } and N = peval(m, t 1..k ) q(s, []) = 1 q(s, t) µ(dt) where A = {t t 0} A Q(s, A) = q(s, t) µ(dt) A P(s, A) = α(s, t) Q(s, dt) + δ(s)(a) (1 α(s, t)) Q(s, dt) A Stationary distribution: π(a) = M S (A)/ M S (S) (normalized distribution on traces)

15 Definition of correctness Define P n (x, A) to be the probability of reaching A from x in n steps: P 0 (s, A) = δ(s)(a) P n+1 (s, A) = P(t, A)P n (s, dt) The variational norm is a measure of closeness of probability measures: µ 1 µ 2 = sup µ 1(A) µ 2(A) A Σ Let T n (s, A) = P n (s, O 1 M (A)) and M GV(A) = M (A)/ M (GV). The algorithm can be considered correct if for every trace s with P M (s) 0, lim T n (s, ) M n GV = 0.

16 Proof of correctness Theorem (Tierney 1994) Let P be a Metropolis kernel (as given earlier). If π is the stationary distribution of P and P is π-irreducible and aperiodic, then lim n P n (x, ) π = 0 Lemma (Strong Irreducibility, implies π-irreducibility) If P M (s) > 0 and M S (A) > 0 then P(s, A) > 0. Lemma (Aperiodicity) P is π-aperiodic. Then the above theorem from gives: lim n P n (x, ) π = 0 Theorem (Main Result) For every trace s with P M (s) 0, lim n T n (s, ) M GV = 0.

17 Further work Finish proofs of two remaining technical lemmas (in second part) Translation of Church to the calculus Trial implementation Understanding conditioning Alternative inference algorithm, similar to Church Program MCMC in calculus itself

Probabilistic Applicative Bisimulation and Call-by-Value Lam

Probabilistic Applicative Bisimulation and Call-by-Value Lam Probabilistic Applicative and Call-by-Value Lambda Calculi Joint work with Ugo Dal Lago ENS Lyon February 9, 2014 Probabilistic Applicative and Call-by-Value Lam Introduction Fundamental question: when