Gray-box Inference for Structured Gaussian Process Models: Supplementary Material

Size: px

Start display at page:

Download "Gray-box Inference for Structured Gaussian Process Models: Supplementary Material"

Patrick Philip Howard
5 years ago
Views:

1 Gray-box Inference for Structured Gaussian Process Models: Supplementary Material Pietro Galliani, Amir Dezfouli, Edwin V. Bonilla and Novi Quadrianto February 8, Proof of Theorem Here we proof the result that we can estimate the expected log lielihood and its gradients using expectations over low-dimensional Gaussians, that is that Theorem. For the structured gp model ined in our paper the expected log lielihood over the given variational distribution and its gradients can be estimated using expectations over T n -dimensional Gaussians and V -dimensional Gaussians, where T n is the length of each sequence and V is the vocabulary size. 1.1 Estimation of L ell in the full (non-sparse) model For the L ell we have that: L ell Nseq log p(y n f n ) f bin f bin 1 q(f un )q(f bin ) q(f un )q(f bin ) log p(y n f n ) df un df bin () f un un fn )q(fn un )q(f bin ) log p(y n f n ) df un un dfn df bin (3) f un n f un \n q(f un \n log p(y n f n ) q(f un n )q(f bin ) (4) K π log p(y n f n ) qn (f un n )q(f bin ), (5) where q n (fn un ) is a (T n V )-dimensional Gaussian with bloc-diagonal covariance Σ (n), each bloc of size T n T n. Therefore, we can estimate the above term by sampling from T n -dimensional Gaussians independently. Furthermore, q(f bin ) is a V -dimensional Gaussian, which can also be sampled independently. In practice, we can assume that the covariance of q(f bin ) is diagonal and we only sample from unary Gaussians for the pairwise functions. \n (1) 1

2 1. Gradients Taing the gradients of the th term for the nth sequence in the L ell : L (,n) ell λ un L(,n) ell log p(y n f n ) qn (fn un)q(f bin ) (6) q n (fn un )q(f bin ) log p(y n f n ) dfn un df bin (7) f bin fn un q n (fn un )q(f bin ) log p(y n f n ) λ un log q n(fn un ) dfn un df bin (8) f bin f un n log p(y n f n ) λ un log q n(fn un ), (9) q n (fn un)q(f bin ) where we have used the fact that x f(x) f(x) x log f(x) for any nonnegative function f(x) Similarly. the gradients of the parameters of the distribution over binary functions can be estimated using: λ binl (,n) ell log p(y n f n ) λ bin log q(f bin ) q n (f un n )q(f bin ). (10) KL terms in the sparse model The KL term (L ) in the variational objective (L elbo ) is composed of a KL divergence between the approximate posteriors and the priors over the inducing variables and pairwise functions: L KL(q(u) p(u)) KL(q(f bin ) p(f bin )), (11) }{{}}{{} L un L bin where, as the approximate posterior and the prior over the pairwise functions are Gaussian, the KL over pairwise functions can be computed analytically: L bin KL(q(f bin ) p(f bin )) KL(N (f bin ; m bin, S bin ) N (f bin ; 0, K bin )) (1) 1 ( log K bin log S bin + (m bin ) T (K bin ) 1 m bin + tr (K bin ) 1 S bin V ). (13) For the distributions over the unary functions we need to compute a KL divergence between a mixture of Gaussians and a Gaussian. For this we consider the decomposition of the KL divergence as follows: L un KL(q(u) p(u)) E q [ log q(u)] }{{} L ent where the entropy term (L ent ) can be lower bounded using Jensen s inequality: 1 l1 + E q [log p(u)] }{{}, (14) L cross K K L ent π log π l N (m ; m l, S + S l ) ˆL ent. (15) and the negative cross-entropy term (L cross ) can be computed exactly: L cross 1 K 1 [M log π + log κ(z j, Z j ) + m T jκ(z j, Z j ) 1 m j + tr κ(z j, Z j ) 1 S j ]. (16) π V j1

3 3 Proof of Theorem 3 To prove Theorem 3 we will express the expected log lielihood term in the same form as that given in Equation (5), showing that the resulting q n (fn un ) is also a (T n V )-dimensional Gaussian with bloc-diagonal covariance, having V blocs each of dimensions T n T n. We start by taing the given L ell, where the expectations are over the joint posterior q(f, u λ) p(f un u)q(u)q(f bin ): L ell Nseq where our our approximating distribution is: log p(y n f n ) p(f un u)q(u)q(f bin ) (17) log p(y f) q(u)p(f un u)du q(f bin )df, (18) f u } {{ } q(f un ) which can be computed analytically: q(f) q(f un )q(f bin ) (19) q(f un ) q(u)p(f un u)du, (0) u K K q(f un ) π q (f un ) 1 1 V π j1 N (f un j ; b j, Σ j ) (1) b j A j m j () Σ j K j + A j S j A T j. (3) We note in Equation (1) that q (f un ) has a bloc diagonal structure, which implies that we have the same expression for the L ell as in Equation (5). Therefore, we obtain analogous estimates: L ell K 1 π log p(y n f n ) qn (f un n )q(f bin ), (4) Here, as before, q n (fn un ) is a (T n V ) dimensional Gaussian with bloc-diagonal covariance Σ (n), each bloc of size T n T n. The main difference in this (sparse) case is that b (n) and Σ (n) are constrained by the expressions in Equations () and (3). Hence, the proof for the gradients follows the same derivation as in 1. above. 4 Gradients of L elbo for sparse model Here we give the gradients of the variational objective wrt the parameters for the variational distributions over the inducing variables, pairwise functions and hyper-parameters. 4.1 Inducing variables KL term As the structured lielihood does not affect the KL divergence term, the gradients corresponding to this term are similar to those in the non-structured case (Dezfouli and Bonilla, 015). Let K zz be the bloc-diagonal 3

4 covariance with V blocs κ(z j, Z j ), j 1,... Q. Additionally, lets assume the following initions: C S + S l, (5) N N (m ; m l, C ), (6) K z π l N. (7) The gradients of L wrt the posterior mean and posterior covariance for component are: l1 m L cross π K 1 zz m, (8) S L cross 1 π K 1 zz (9) V π L cross 1 [M log π + log κ(z j, Z j ) + m T jκ(z j, Z j ) 1 m j + tr κ(z j, Z j ) 1 S j ], (30) j1 where we note that we compute K 1 zz by inverting the corresponding blocs κ(z j, Z j ) independently. The gradients of the entropy term wrt the variational parameters are: m ˆLent π S ˆLent 1 π K l1 π l ( N z K l1 π ˆLent log z π l ( N z K l Expected log lielihood term + N ) C 1 z (m m l ), (31) l + N ) [C 1 C 1 z (m m l )(m m l ) T C 1 ], (3) l π l N z l. Retaing the gradients in the full model in Equation (9), we have that: λ un L(,n) ell log p(y n f n ) λ un log q n(fn un ), (33) q n (fn un)q(f bin ) where the variational parameters λ un are the posterior means and covariances ({m j } and {S j }) of the inducing variables. As given in Equation (1), q (f un ) factorizes over the latent process (j 1,..., V ), so do the marginals q n (fn un ), hence: λ un log q n(fn un ) λ un V j1 log N (f un nj ; b jn, Σ jn ), (34) where each of the distributions in Equation (34) is a T n dimensional Gaussian. Let us assume the following initions: X n : all feature vectors corresponding to sequence n (35) A jn κ(x n, Z j )κ(z j, Z j ) 1 (36) K n j κ j (X n, X n ) A jn κ(z j, X n ), therefore: (37) b jn A jn m j, (38) Σ jn K n j + A jn S j A T jn. (39) 4

5 Hence, the gradients of log q (f un ) wrt the the variational parameters of the unary posterior distributions over the inducing points are: mj log q n (fn un ) A T jnσ 1 ( ) jn f un nj b jn, (40) Sj log q n (fn un ) 1 [ ] AT jn Σ 1 un jn (fnj b jn )(fnj un b jn ) T Σ 1 jn Σ 1 jn A jn (41) Therefore, the gradients of L ell wrt the parameters of the distributions over unary functions are: mj L ell π N S κ(z seq j, Z j ) 1 κ(z j, X n )(Σ jn ) 1 Sj L ell π N seq S Pairwise functions The gradients of the L bin A T jn { S i1 S i1 (f un nij b jn ) log p(y n f un ni, f bin i ), (4) [ (Σjn ) 1 (f un nij b jn )((f un nj ) (,i) b jn ) T (Σ jn ) 1 (43) (Σ jn ) 1] } log p(y n fni, un fi bin ) A jn wrt the parameters of the posterior over pairwise functions are given by: m binl bin (K bin ) 1 m bin (44) S binl bin 1 ( (S bin ) 1 (K bin ) 1) (45) The gradients of the L ell wrt the parameters of the posterior over pairwise functions are given by: m binl ell 1 S S binl ell 1 N seq S K S π 1 i1 K π 1 i1 (S bin ) 1 (f bin i S [(S bin ) 1 (f bin i 5 Piecewise pseudolielihood m bin ) log p(y n f un ni, f bin i ) (46) m bin )(f bin i m bin ) T (S bin ) 1 (S bin ) 1 ] log p(y n f un ni, f bin i ) Piecewise pseudolielihood (Sutton and McCallum, 007) approximates the lielihood p(y n fni un, f i bin ) of sequence n given the latent functions fni un bin and fi by computing the product, 1 for every single factor and every variable occurring in it, of the conditional probability of the variable given its neighbours with respect to that factor. In our linear model, this yields the following expression for the log pseudolielihood p(y n fni un, f i bin ): log p(y n f un ni, f bin i ) W n w1 log p((y n ) w f un ni) + w 1 w 1 where W n is the number of words in sentence n and log p((y n ) w1 (y n ) w, f bin i ) W n w1 (47) log p((y n ) w f un ni, f bin i ) p((y n ) w f un ni) exp(f un niy n (w)); (48) p((y n ) w (y n ) w+1, fi bin ) exp(fi bin ((y n ) w, (y n ) w+1 )); (49) p((y n ) w+1 (y n ) w, fi bin ) exp(fi bin ((y n ) w, (y n ) w+1 )). (50) 1 Here i represents the index of the specific samples fni un and f bin i taen from our distributions q n (fn un) and q(f bin ). 5

6 6 Experiments 6.1 Experimental set-up Before starting all experiments, our program selected the positions of the 500 inducing points by performing K-means clustering on the training data. The initial values of the means (one mean value for every inducing point p and for every possible label l) are set as the fraction of the points of the training set in the cluster whose centroid is p that belong to label l. As described in Algorithm 1, we adaptively choose the correct learning rates for all parameters by searching (beginning from an initial guess, and doubling or dividing it by two as needed) for the biggest step size that causes the objective to decrease over twenty stochastic optimization steps. The step size to use for the given parameter is taen as one fourth of this value, to avoid instability. The initial step sizes were 0.5 for unary mean parameters, for binary mean parameters, for unary covariance parameeters, for binary covariance parameters and 1.0 for (linear) ernel hyperparameters (these values were selected only to speed up, insofar as possible, the process of parameter search); and the optimal step sizes were found in the order unary mean unary covariance binary mean binary covariance ernel hyperparameter. Of course this is not an exhaustive grid search; but it is less computationally expensive and wors well in practice. 6. Optimization Then we optimize the three sets of parameters (unaries, binaries, hyperparameters) in a global loop, in the same order as mentioned earlier, through standard Stochastic Gradient Descent, until the time limit is reached, using 4000 new random samples each step for the estimation of the relevant gradients and averages. For the small-scale experiments, the variational parameters for unary nodes are optimized for 500 iterations, variational parameters for pairwise nodes are optimized for 100 iterations, and hyper-parameters are updated for 0 iterations. We eep a tight bound (10 in the unary case, 0 in the binary one) on the maximum possible absolute values that covariance parameters can tae. If an update would bring their value beyond it, we recompute the gradients with new samples: oftentimes, this resolves the issue (in brief, because the faulty update was due to a bad estimation of the gradients). If the problem persists, we disregard the current sentence and move on to the next one; and if after ten attempts the problem still persists, we move bac to the latest safe position that did not cause out-of-bound errors. In this way, we can use a relatively small number of samples and maintain rather aggressive step sizes while recovering neatly from out-of-bounds errors. The setting for the large-scale experiments was similar, except that the optimization schedule was (1500, 500, 500) (that is 1500 unary optimization steps, 500 binary ones and 500 hyperparameter ones) for the big base np and chuning experiments. All the experiments were run for four hours, not counting initial clustering or final prediction (but counting step size selection). For comparison, the (small-scale) gp-ess experiments, which were run for 50,000 elliptical slice sampling steps, too on average hours for chuning, 1.19 hours for base np,.07 hours for segmentation and for japanese ne. 6.3 Performance profiles Figure 1 shows the performance of our algorithm as a function of time. We see that the test lielihood decreases very regularly in all the folds and so does overall the error rate, albeit with more variability. The bul of the optimization, both with respect to the test lielihood and with respect to the error rate, occurs during the first 10 minutes. This suggests that the ind of approach described in this paper might be particularly suited for cases in which expected loglielihood of the prediction and speed of convergence are priorities. When computing the gradient of the ernel hyperparameter, 8000 samples were used instead to insure greater stability. 6

7 Algorithm 1 Method for computing the step size 1: function chec_stepsize(parameters, step_size, num_to_chec0) : to_chec num_to_chec sentences at random from the training set; 3: old_obj current value of the objective function; 4: for i 0... num_to_chec do 5: grad gradient of objective wrt parameters; 6: parameters parameters - step_size grad; 7: if parameters are out of bounds then 8: return False; 9: new_obj current value of the objective function; 10: if new_obj < old_obj then 11: return True; 1: else 13: return False; 14: function search_stepsize_up(parameters, initial_step_size) 15: step_size initial_step_size; 16: old_params current parameter values of the model; 17: while True do 18: is_good chec_stepsize(parameters, step_size); 19: if is_good False then 0: parameters old_params; 1: return step_size factor ; : step_size step_size factor; 3: function search_stepsize_down(parameters, initial_step_size) 4: step_size initial_step_size; 5: old_params current parameter values of the model; 6: while True do 7: is_good chec_stepsize(parameters, step_size); 8: if is_good True then 9: parameters old_params; 30: return step_size factor ; 31: step_size step_size/factor; 3: function choose_stepsize(parameters, initial_step_size, factor ) 33: step_size search_stepsize_up(parameters, initial_step_size, factor); 34: if step_size initial_step_size factor then 35: step_size search_stepsize_down(parameters, initial_step_size, factor); 36: return step_size/factor; 7

8 Figure 1: The test performance of gp-var-t on chuning for the large scale experiment as a function of time. References Amir Dezfouli and Edwin V Bonilla. Scalable inference for Gaussian process models with blac-box lielihoods. In NIPS Charles Sutton and Andrew McCallum. Piecewise pseudolielihood for efficient training of conditional random fields. In ICML,

But if z is conditioned on, we need to model it:

But if z is conditioned on, we need to model it: Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or