Web-Supplement for: Accurate Logistic Variational Message Passing: Algebraic and Numerical Details

Size: px

Start display at page:

Download "Web-Supplement for: Accurate Logistic Variational Message Passing: Algebraic and Numerical Details"

Barbra Wood
5 years ago
Views:

1 Web-Supplement for: Accurate Logistic Variational Message Passing: Algebraic and Numerical Details BY TUI H. NOLAN AND MATT P. WAND School of Mathematical and Physical Sciences, University of Technology Sydney, Broadway 2007, Australia S. Proof of Theorem First note that for any a, b R, ( a Φ(a + bxφ(x dx = Φ b 2 + ( b and x Φ(a + bxφ(x dx = b 2 + φ a. b 2 + (S. From the first result in (S., it follows immediately that k expit k (µ + σ xφ(x dx = p k,i Φ. i= + σ 2 s 2 k,i Hence, for all µ R and σ > 0, k B 0(µ, σ 2 p k,i Φ i= + σ 2 s 2 k,i = expit(µ + σ xφ(x dx expit k (µ + σ xφ(x dx expit(µ + σ x expitk (µ + σ x φ(x dx sup expit(u expit k (u u R φ(x dx = k the last equality follows from (7. Part (a of Theorem follows immediately. The second result in (S. implies that k p k,i s k,i x expit k (µ + σ xφ(x dx = σ φ i= + σ 2 s 2 k,i + σ 2 s 2 k,i

2 For all µ R and σ > 0 we then have k B (µ, σ 2 p k,i s k,i σ φ i= + σ 2 s 2 k,i + σ 2 s 2 k,i = expit(µ + σ x xφ(x dx expit k (µ + σ x xφ(x dx expit(µ + σ x expitk (µ + σ x x φ(x dx sup expit(u expit k (u u R = 2 k x φ(x dx = 2/π k. S.2 Derivation of Algorithm The message passed from p(y θ to θ is 0 x φ(x dx m p(y θ θ (θ = expy T Aθ T log + exp(aθ} is not conjugate with Multivariate Normal messages passed to θ from other factors. A nonconjugate VMP remedy (Knowles & Minka, 20 involves replacement of m p(y θ θ (θ by θ m p(y θ θ (θ exp vec(θθ T T η p(y θ θ to enforce conjugacy with Multivariate Normal messages. Under pre-specification (S.2 the current q(θ density function satisfies q(θ m p(y θ θ (θ (product of messages passed to θ from its other neighbours (S.2 From (0 of Wand (207, we then get q(θ m p(y θ θ (θm θ p(y θ (θ = exp θ vec(θθ T T η p(y θ θ η p(y θ θ η p(y θ θ + η θ p(y θ. Let µ p(y θ θ and Σ p(y θ θ be the corresponding common parameters. The natural parameters and common parameters are the following functions of each other: η p(y θ θ = and (ηp(y θ θ (η p(y θ θ 2 = Σ p(y θ θ µ p(y θ θ 2 vec(σ p(y θ θ µp(y θ θ = 2 vec ( (η p(y θ θ 2 } (η p(y θ θ Σ p(y θ θ = 2 vec ( (η p(y θ θ 2 }. In the upcoming arguments we will use the shorthand: η q(θ η p(y θ θ, µ q(θ µ p(y θ θ and Σ q(θ Σ p(y θ θ. 2 (S.3

3 Using the set-up of Section 2.2 of Rohde & Wand (206, the θ-localized approximate marginal log-likelihood is log p(y; q, η q(θ θ = Entropyq(θ; η q(θ } + NonEntropyq(θ; η q(θ } Entropyq(θ; η q(θ } = 2 log 2 vec (η q(θ 2 } + d + log(2π}/2, with d denoting the dimension of θ. Also, NonEntropyq(θ; η q(θ } E q(θ;ηq(θ log p(y θ} +E q(θ;ηq(θ (sum of other log-factors neighboring θ T = y T Aµ q(θ E q(θ;ηq(θ T θ log + exp(aθ} + vec(θθ T η η is the sum of the natural parameters of the messages passed to θ other than the message from p(y θ. Ideally we would maximize log p(y; q, η q(θ θ over η q(θ but are thwarted by the intractability of the q-density expectation of log + exp(aθ}. To get around this we apply ( to obtain log p(y; q, η q(θ θ Entropyq(θ; η q(θ } + NonEntropyη q(θ, ω } NonEntropyη q(θ, ω } y T Aµ q(θ 2 (ω2 T diagonal(aσ q(θ A T T log + expaµ q(θ + 2 ( 2ω diagonal(aσ q(θ A T } T µ q(θ + ( } vec Σ q(θ + µ q(θ µ T η q(θ and ω is an n vector of variational parameters. We now seek to maximize log p(y; q, η q(θ, ω θ Entropyq(θ; η q(θ } + NonEntropyη q(θ, ω }. Using arguments analogous to those given in Section 4 of Rohde & Wand (206, the function log p(y; q, η q(θ, ω θ has a stationary point in the η T q(θ ωt T space if and only if η q(θ = H ηq(θ A(η q(θ } Dηq(θ NonEntropyη q(θ, ω } T and (S.4 0 = D ω NonEntropyη q(θ, ω } T (S.5 with A denoting the Multivariate Normal log-partition function and D ηq(θ and H ηq(θ respectively denoting the derivative vector and Hessian matrix with respect to η q(θ as defined in Rohde & Wand (206. Standard vector calculus steps (e.g. Wand, 2002 lead to D ω NonEntropyη q(θ, ω } T = expitaµ q(θ + 2 ( 2ω diagonal(aσ q(θ A T } ω diagonal(aσ q(θ A T. Substitution of this result into (S.5 and a reworking of the arguments that lead to Result 2 of Rohde & Wand (206, but applied to NonEntropyη q(θ, ω }, rather than NonEntropyq(θ; η q(θ }, lead to the fixed-point updates: ω expitaµ q(θ + 2 ( 2ω diagonal(aσ q(θ A T } v q(θ Σ q(θ D µq(θ NonEntropyη q(θ, ω } T H µq(θ NonEntropyη q(θ, ω } µ q(θ µ q(θ + Σ q(θ v q(θ (S.6 3

4 Further vector calculus leads to the explicit forms: D µq(θ NonEntropyη q(θ, ω } T = A T ( y expit Aµ q(θ + 2 ( 2ω diagonal(aσ q(θ A T and +(η + 2 vec ( (η 2 µq(θ H µq(θ NonEntropyη q(θ, ω } T = ( A T 2 diag + cosh Aµ q(θ + 2 ( 2ω diagonal(aσ q(θ A T A + 2 vec ( (η 2. Introduction of ω 0 logit(ω and substitution into (S.6 then gives ω 0 Aµ q(θ + 2 ( 2ω diagonal(aσ q(θ A T ω expit(ω 0 ; ω 2 /2 + cosh(ω 0 } Σ q(θ } A T diag(ω 2 A 2vec((η 2 µ q(θ µ q(θ + Σ q(θ A T (y ω + (η + 2 vec ( (η 2 µq(θ }. (S.7 The remainder of the derivation of Algorithm involves expressing (S.7 in terms of the input and output natural parameter vectors η p(y θ θ and η θ p(y θ. Using (S.3, the ω 0 update is equivalent to µ 2 vec A ( } (η p(y θ θ 2 (ηp(y θ θ σ 2 2 A vec diagonal ( } (η p(y θ θ 2 A T ω 0 µ + 2 ( 2ω σ 2. The Σ q(θ update can be written as which is equivalent to 2 vec ( (η p(y θ θ 2 } A T diag(ω 2 A 2 vec ( (η } 2 (η p(y θ θ 2 + (η 2 2 vec(at diag(ω 2 A + (η 2 which, in turn, is equivalent to the second component of η p(y θ θ being updated according to (η p(y θ θ 2 2 vec(at diag(ω 2 A. (S.8 For the update of the first component of η p(y θ θ we note that the last update of (S.7 is equivalent to Σ q(θ µ q(θ Σ q(θ µ q(θ + A T (y ω + (η + 2 vec ( (η 2 µq(θ, on the right-hand side, Σ q(θ = AT diag(ω 2 A 2 vec ( (η 2 (S.9 (S.0 4

5 according to its updated value and µ q(θ = 2 vec ( (η p(y θ θ 2 } (η p(y θ θ (S. is the terms of the natural parameter from the previous iteration before (S.8 has taken place. Substitution of (S.0 and (S. into (S.9 we get (η p(y θ θ + (η 2 AT diag(ω 2 A + vec ( (η 2 } vec ( (η p(y θ θ 2 } (η p(y θ θ +A T (y ω + (η vec ( (η 2 vec ( (η p(y θ θ 2 } (η p(y θ θ which is equivalent to (η p(y θ θ 2 AT diag(ω 2 Avec ( (η p(y θ θ 2 } (η p(y θ θ + A T (y ω = A T (ω 2 µ + A T (y ω = A T (y ω + ω 2 µ. Combining this update with that given in (S.8 we get the following update for the full natural parameter vector: ( η p(y θ θ AT y ω + ω 2 µ 2 vec( A T diag(ω 2 A which matches Algorithm. S.3 Approximation of corr(β 0, β y Consider the Bayesian linear regression model (8. Then, given the approximate noninformativity of prior distribution of β = β 0 β, the posterior covariance matrix of β, Cov(β y is such that Cov(β y the inverse Fisher information matrix of β = X T diagb (Xβ}X ( = X T diag X 2 + cosh(xβ} X x.. x n. Straightforward algebra then leads to the approximate posterior correlation between β 0 and β being corr(β 0, β y n n i= x i/ + cosh(β 0 + β x i n n i= / + cosh(β 0 + β x i }. n n i= x2 i / + cosh(β 0 + β x i } However, the x i s are uniformly distributed on (0, so replacement of sample means by population means leads to the final approximation corr(β 0, β y 0 x/ + cosh(β 0 + β x} dx 0 / + cosh(β 0 + β x} dx. 0 x2 / + cosh(β 0 + β x} dx 5

6 S.4 Approximate Marginal Log-Likelihood Expressions The simulation study described in Section 4 concerning variational inference for the Bayesian logistic regression model (2 used approximate marginal log-likelihood expressions, appropriate for the particular approach, as a means to assess convergence. Each of the expressions are given in this section. The last one uses the definition, as defined in Section 3.2, B(µ, σ 2 b(µ + σ xφ(x dx b(x log( + e x. As with the B r notation given there, evaluations of B(µ, σ 2 when µ and σ 2 are equal-sized column vectors are defined in an element-wise fashion, as illustrated by ( 9 36 B(9, 36 B,. 28 B(, 28 S.4. Jaakkola-Jordan Updates Σ q(β;ξ log p(y; q, ξ = 2 log Σ q(β;ξ + 2 µt q(β;ξ Σ q(β;ξ µ q(β;ξ 2 µt β Σ β µ β X T diag tanh(ξ/2 2ξ + n ξ/2 log( + e ξ i + (ξ/4 tanh(ξ i /2} 2 log Σ β i= } X + Σ β and µ q(β;ξ Σ q(β;ξx T (y 2 +Σ β µ β}. and ξ is the current value of the variational parameter vector that arises in the Jaakkola-Jordan device. See, for example, Section 5. of Wand (207. S.4.2 Saul-Jordan Updates log p(y; q, ω = 2 log Σ q(β 2 tr Σ β Σq(β + (µ q(β µ β (µ q(β µ β T } +y T Xµ q(β 2 (ω2 T diagonal(xσ q(β X T T log + expxµ q(β + 2 ( 2ω diagonal(xσ q(βx T } + d 2 2 log Σ β ω is the current value of the variational parameter vector that arises in the Saul-Jordan device. S.4.3 Knowles-Minka-Wand Updates log p(y; q = 2 log Σ q(β 2 tr Σ β Σq(β + (µ q(β µ β (µ q(β µ β T } +y T Xµ q(β T B ( Xµ q(β, diagonal(xσ q(β X T + d 2 2 log Σ β 6

7 Additional References Rohde, D. and Wand, M.P. (206. Semiparametric mean field variational Bayes: General principles and numerical issues. Journal of Machine Learning Research, 7(72, 47. Wand, M.P. (2002. Vector differential calculus in statistics. The American Statistician, 56,

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,