Boosting Variational Inference

Size: px

Start display at page:

Download "Boosting Variational Inference"

Thomasine Carter
5 years ago
Views:

1 Boosting Variational Inference Fangjian Guo MIT Xiangyu Wang Duke University Kai Fan Duke University Tamara Broderick MIT David Dunson Duke University Abstract Modern Bayesian inference tyically requires some form of osterior aroximation, and mean-field variational inference (MFVI) is an increasingly oular choice due to its seed. But MFVI can be inaccurate in various asects, including an inability to cature multimodality in the osterior and underestimation of the osterior covariance. These issues arise since MFVI considers aroximations to the osterior only in a family of factorized distributions. We instead consider a much more flexible aroximating family consisting of all ossible finite mixtures of a arametric base distribution (e.g., Gaussian). In order to efficiently find a high-quality osterior aroximation within this family, we borrow ideas from gradient boosting and roose boosting variational inference (BVI). BVI iteratively imroves the current aroximation by mixing it with a new comonent from the base distribution family. We develo ractical algorithms for BVI and demonstrate their erformance on both real and simulated data. 1 Introduction Bayesian inference offers a flexible framework for learning with rich, hierarchical models of data and for coherently quantifying uncertainty in unknown arameters through the osterior distribution. However, for any moderately comlex model, the osterior is intractable to calculate exactly and must be aroximated. Mean-field variational inference (MFVI) has grown in oularity as a method for aroximating the osterior since it is often fast even for large data sets. MFVI is fast in art because it formulates osterior aroximation as an otimization roblem, and leads to an efficient coordinate-ascent algorithm when there is certain structure within the model, known as conditional conjugacy [3]. Such conjugacy roerties tyically only hold for factorization aroximations, which effectively cannot cature multimodality and underestimate the osterior covariance, sometimes drastically [1, 28, 27, 25, 17]. The linear resonse technique [13] and the full-rank aroach within [15] rovide a correction to the covariance underestimation of MFVI in the unimodal case but do not address the multimodality issue. Black-box inference, as in [23] and the mean-field aroach within [15], focus on making the MFVI otimization roblem easier for ractitioners, by avoiding tedious calculations, but they do not change the otimization objective of MFVI and therefore still face the roblems outlined here. An alternative and more flexible class of aroximating distributions for variational inference (VI) is the family of mixture models. Indeed, even if we consider only Gaussian base distributions, one can find a mixture of Gaussians that is arbitrarily close to any continuous robability density [7, 20]. [2, 14, 12] have reviously considered using aroximating families with a fixed number of mixture comonents; these authors also emloy a further aroximation to the VI otimization objective. The resulting otimization algorithms have clear ractical limitations, which limit the ability to find a good aroximation in the mixture model family. In articular, there is large sensitivity to initial values, for good erformance algorithms may need to be rerun for many different initializations and comonent numbers, and the aroximation to the VI objective can limit flexibility. 29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Sain.

2 As a much more effective algorithm, which automatically adats the number of mixture comonents to the comlexity of the osterior and increases aroximation accuracy, we roose a novel aroach insired by boosting. We call our method boosting variational inference (BVI). BVI starts with a single-comonent aroximation and roceeds to add a new mixture comonent at each ste. Indeendent to our work, we notice that this idea is also considered by a concurrent aer [18]. 2 Variational inference and Gaussian mixtures Suose we observe N data oints, collected in the matrix X with N rows. A Bayesian model is secified through a rior π(θ) and likelihood (X θ), yielding the osterior (θ X) π(θ)(x θ) =: f(θ) by the Bayes Theorem, with θ R D. While f is easy to evaluate, the normalizing constant (X) involves an intractable integral, so an aroximation is needed. The osterior (θ X) is rarely used directly; rather, we would often like to reort a mean, covariance, or other osterior functional E g(θ) = (θ X)g(θ)dθ for some function g. E.g., g(θ) = θ yields the osterior mean. One aroach to aroximate the osterior is as follows. Choose a discreancy D between distribution q(θ) and the exact osterior X (θ) := (θ X). We assume D is non-negative and zero only when q = X. In general, though, the otimum D := inf q H D(q, X ) over some constrained family of distributions H may be strictly greater than zero. Roughly, we exect a larger H to yield a lower D but at a higher comutational cost. When D(q, X ) = D KL (q X ), the Kullback-Leibler (KL) divergence between q and X, this otimization roblem is called variational inference. A articularly flexible choice for H is the family of all finite mixtures. More recisely, let h φ (θ) be some arametric distribution over θ with arameter φ Φ. E.g., for a Gaussian mixtures, h φ (θ) = N µ,σ (θ) with mean vector µ and ositive semidefinite covariance Σ. Let k denote the (k-1)-dimensional simlex: k = {w R k : k j=1 w j = 1 & j, w j 0}. Then H k is the set of all k-comonent mixtures over these base distributions, and H is the set of all finite mixtures, namely k H k = {h : h(θ) = w j h φj (θ), w k, φ Φ k }, H = H k. (1) j=1 Our main contribution is to roose a novel algorithm for aroximately solving this discreancy minimization roblem. 3 Boosting and Gradient Boosting In general, reaching D may require an infinite mixture. We consider a greedy, incremental rocedure, as in [29], to aroach D with a sequence of finite mixtures q 1, q 2, for each q t H t. The quality of aroximation can be measured with the excess discreancy D(q t ) := D(q t, X ) D 0, and we would like D(q t ) 0. Hence, given any ɛ > 0, we can find a large enough t such that D(q t ) ɛ. In articular, we start with a single base distribution q 1 = h φ1 for some φ 1. Iteratively, at each ste t = 2, 3,..., let q t 1 be the aroximation from the revious ste. Form q t by mixing a new base distribution h t with weight α t [0, 1] together with q t 1 with weight (1 α t ). This aroach is called greedy [22, Ch. 4] if we choose (aroximately) otimal base distribution h t and weight α t at each ste t: q t = (1 α t )q t 1 + α t h t, D(q t, X ) inf φ Φ,0 α 1 D((1 α)q t 1 + αh φ, X ) + ɛ t, (2) where we relax otimality to within some non-negative sequence ɛ t 0. At each ste q t remains normalized by construction and takes the form of a mixture of base distributions. The iterative udates are in the style of boosting or greedy error minimization [9, 8, 10, 16]. Under convexity and strong smoothness conditions on D(, x ), Theorem II.1 of [29] guarantees that D(q t ) converges to zero at rate O(1/t). We will verify that KL divergence satisfies these conditions in Theorem 1. Gradient Boosting Let D(q) := D(q, X ) as a shorthand notation. Rather than jointly otimizing D((1 α t )q t 1 +α t h t ) over (α t, h t ), which may be non-convex and difficult in general, we consider nearly otimal choices (cf. Eq. (2)). We choose h t first, in a gradient descent style, and then otimize the corresonding weight α t. For h t, we follow gradient boosting [11] and consider the functional gradient D(q) at the current solution q = q t 1. In what follows, we adot the notation g, h = g(θ)h(θ)dθ and h 2 2 = h(θ) 2 dθ. When h 2 0, a Taylor exansion yields k=1 D(q + h) = D(q) + g, h + o( h 2 2), (3) 2

3 where g is the functional gradient D(q) := g(θ). We choose h to minimize the inner roduct in Eq. (3); i.e., as in gradient descent, we choose h to match the direction of D(q). 4 Boosting Variational Inference Boosting variational inference (BVI) alies the framework of the revious section with Kullback- Leibler (KL) divergence as the discreancy measure. We first justify the choice of KL, and then resent a two-stage multivariate Gaussian mixture boosting algorithm (Algorithm 1) to stochastically decrease the excess discreancy. In each iteration, it firstly identifies h t with gradient boosting, and then solves for α t. The KL discreancy measure is defined as D(q, X ) = D KL (q X ) = q(θ) log q(θ) dθ = log (X) + X (θ) q(θ) log q(θ) dθ. (4) f(θ) By droing the constant log (X), an effective discreancy (negative value of ELBO [3]) can be defined as D KL (q) = q(θ) log q(θ) f(θ) dθ. The following theorem shows that KL satisfies the greedy boosting conditions [29, 22] under some additional assumtions (that may, e.g., hold on a bounded set). See Aendix A for roof and discussions. This trivially imlies that the conditions also hold for D KL (q). See Aendix C for a similar analysis of KL divergence in the other direction, D( X q). Theorem 1. Given densities q 1, q 2 and true density, KL divergence is a convex functional, i.e., for any α [0, 1] satisfying D KL ((1 α)q 1 + αq 2 ) (1 α) D KL (q 1 ) + α D KL (q 2 ). (5) If we further assume that densities are bounded q 1 (θ), q 2 (θ) a > 0, and denote the functional gradient of KL at density q as D KL (q) = log q(θ) log (θ), then the KL divergence is also strongly smooth, i.e., satisfying D KL (q 2 ) D KL (q 1 ) D KL (q 1 ), q 2 q a q 2 q (6) Setting α t with Stochastic Newton For fixed h t, DKL is a convex function of α t (see (11) in Aendix B). We can estimate both its first and second derivatives with Monte Carlo, by drawing samles from h t and q t 1. Then we can use a stochastic 2 nd -order method, e.g., Newton s [5], to solve α t. See Aendix B for details. Setting h t with Lalacian Gradient Boosting Following the idea of Eq. (3) for gradient boosting, we take the Taylor exansion of D KL (q t ) around α t 0 D KL (q t ) = D KL (q t 1 ) + α t h t, log q t 1 f α t q t 1, log q t 1 f + o(α2 t ), (7) which suggests minimizing h t, log qt 1 f qt 1, where log f is the functional gradient D KL (q t 1 ). However, direct minimization of the inner roduct is ill-osed since h t will degenerate to a oint mass at the minimum of functional gradient. Instead, following the least square rocedure of [11], we match the direction in terms of l 2 norm by ĥt = arg min h=hφ,λ>0 λh log(f/q t 1 ) 2 2. Plugging in the otimal value for λ, and with some algebra the objective is identical to ĥ t = arg max E θ h log(f(θ)/q t 1 (θ)) 1/2 log h 2 2, (8) h=h φ where log(f/q t 1 ) is effectively the residual log-likelihood. Ideally, when q t 1 f, the residual should be flat; otherwise, its has eaks where osterior density is underestimated and basins where overestimated. For Gaussian family h φ (θ) = N µ,σ (θ), (8) becomes ĥ µt,σ t = arg max µ,σ E θ Nµ,Σ log(f(θ)/q t 1 (θ)) + 1/4 log Σ. (9) We notice that the log determinant term revents degeneration. We therefore roose the following heuristic (Algorithm 1) to efficiently otimize (9): we aroximately decomose the residual log(f/q t 1 ) into a constant lus a quadratic eak 1 2 (θ η)t S 1 (θ η) (i.e. Lalacian aroximation to f/q t 1 ), and then (9) becomes convex and solutions are in closed-form as µ = η, Σ = S/2, where η and S 1 can be solved numerically with any suitable otimization routine. 3

4 Algorithm 1 Lalacian Gradient Boosting for Gaussian Base Family N µ,σ Require: evaluable roduct density of rior and likelihood f(θ) for θ R D Start with some initial aroximation q 1 = N µ1,σ 1 e.g. q 1 = N 0,cI for t = 2 to T do ˆµ t arg min θ log(q t 1 (θ)/f(θ)) with an otimization routine initialized at θ 0 q t 1. H t Hessian θ= ˆµt log(q t 1 (θ)/f(θ)) with numerical aroximation ˆΣ t Ht 1 /2 and let ĥt(θ) = N (θ ˆµ t, ˆΣ t ) be the new comonent ˆα t arg min αt DKL ((1 α t )q t 1 + α t ĥ t ) with Algorithm 2. See Aendix B q t (1 ˆα t )q t 1 + ˆα t ĥ t. Boosting end for return q t 5 Exeriments In this section, we comare erformance of BVI to MFVI with both toy and real-world data. Toy Examles Figure 1 highlights the ability of BVI (unlike MFVI) to cature (a) heavy tails (b) 1 multimodality and (c) multivariate distributions. Figure 1 (a) is a Cauchy density X (θ) 1+(θ/2) ; 2 (b) is a mixture of univariate Gaussians with different locations and scales; (c) is a mixture of five 2D-Gaussians with random locations and covariances. We initialize with a Gaussian with very large (co)variance, and then run BVI for 50 iterations. Sequences (α t ) t and ( ˆD KL (q t, )) t are shown in sublots. Since these distributions are non-conjugate, we run the automatic variational inference (ADVI) in Stan [15, 6] to obtain results for MFVI (orange). Figure 1: Toy examles: (a) (heavy-tailed) Cauchy distribution (b) a mixture of four univariate Gaussians and (c) a mixture of five bivariate Gaussians with random means and covariances. Curves/contours are colored by blue (true), red (BVI), green (initial q 1 in BVI) and orange (ADVI). Sequence of (α t ) t and Monte Carlo estimates of (D KL (q t )) t are lotted against iteration in sublots. 1 Mean estimates 0.4 Variance estimates 0.05 Covariance estimates Estimate 0 1 ADVI mean field BVI True (MCMC) Estimate True (MCMC) Estimate True (MCMC) Figure 2: Estimated osterior mean and covariance for logistic regression (dashed: estimate = true). Bayesian Logistic Regression We aly our algorithm to Bayesian logistic regression for the Nodal dataset [4], consisting of N = 53 observations of six redictors x i (intercet included) and a binary resonse y i { 1, +1}. The likelihood is N i=1 g(y ix i β), where g(x) = (1 + e x ) 1 and we use the rior β N (0, I). For reference, we show results from the Polya-Gamma samler (an MCMC algorithm for logistic regression) using R ackage BayesLogit [21] as the ground truth. We also comare the erformance of BVI to MFVI (ADVI with Stan). As shown in Figure 2, while both methods cature the correct mean, BVI rovides better estimates of the variance and, unlike MFVI, does not set the covariances to zero. We exect more dramatic differences in cases where MFVI yields biased estimates of the osterior means [26] and cases where the osterior is multimodal. We lan to investigate these cases in future work. 4

5 References [1] C. M. Bisho. Pattern Recognition and Machine Learning, chater 10. Sringer, New York, [2] C. M. Bisho, N. Lawrence, T. Jaakkola, and M. I. Jordan. Aroximating osterior distributions in belief networks using mixtures. In Advances in Neural Information Processing Systems 10: Proceedings of the 1997 Conference, volume 10, age 416, [3] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. arxiv rerint arxiv: , [4] B. Brown. Prediction analyses for binary data. Biostatistics Casebook, ages 3 18, [5] R. H. Byrd, S. Hansen, J. Nocedal, and Y. Singer. A stochastic quasi-newton method for large-scale otimization. SIAM Journal on Otimization, 26(2): , [6] B. Carenter, A. Gelman, M. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. A. Brubaker, J. Guo, P. Li, and A. Riddell. Stan: A robabilistic rogramming language. J Stat Softw, [7] V. A. Eanechnikov. Non-arametric estimation of a multivariate robability density. Theory of Probability &am; Its Alications, 14(1): , [8] Y. Freund, R. Schaire, and N. Abe. A short introduction to boosting. Journal-Jaanese Society For Artificial Intelligence, 14( ):1612, [9] Y. Freund and R. E. Schaire. A desicion-theoretic generalization of on-line learning and an alication to boosting. In Euroean conference on comutational learning theory, ages Sringer, [10] J. Friedman, T. Hastie, R. Tibshirani, et al. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics, 28(2): , [11] J. H. Friedman. Greedy function aroximation: a gradient boosting machine. Annals of statistics, ages , [12] S. Gershman, M. Hoffman, and D. Blei. Nonarametric variational inference. In Proceedings of the 29th International Conference on Machine Learning, ages , New York, NY, USA, [13] R. J. Giordano, T. Broderick, and M. I. Jordan. Linear resonse methods for accurate covariance estimates from mean field variational bayes. In Advances in Neural Information Processing Systems, ages , [14] T. Jaakkola and M. I. Jordan. Imroving the mean field aroximation via the use of mixture distributions. In Learning in grahical models, ages Sringer, [15] A. Kucukelbir, R. Ranganath, A. Gelman, and D. Blei. Automatic variational inference in stan. In Advances in neural information rocessing systems, ages , [16] J. Q. Li and A. R. Barron. Mixture density estimation. In Advances in neural information rocessing systems, ages , [17] D. J. C. MacKay. Information Theory, Inference, and Learning Algorithms, chater 33. Cambridge University Press, [18] A. C. Miller, N. Foti, and R. P. Adams. Variational boosting: Iteratively refining osterior aroximations. arxiv rerint arxiv: , [19] T. Minka et al. Divergence measures and message assing. Technical reort, Technical reort, Microsoft Research, [20] E. Parzen. On estimation of a robability density function and mode. The annals of mathematical statistics, 33(3): , [21] N. G. Polson, J. G. Scott, and J. Windle. Bayesian inference for logistic models using ólya gamma latent variables. Journal of the American Statistical Association, 108(504): , [22] A. Rakhlin. Alications of emirical rocesses in learning theory: algorithmic stability and generalization bounds. PhD thesis, Massachusetts Institute of Technology, [23] R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference. In AISTATS, ages ,

6 [24] H. Robbins and S. Monro. A stochastic aroximation method. The Annals of Mathematical Statistics, ages , [25] H. Rue, S. Martino, and N. Choin. Aroximate Bayesian inference for latent Gaussian models by using integrated nested Lalace aroximations. Journal of the Royal Statistical Society: Series B (statistical methodology), 71(2): , [26] R. E. Turner and M. Sahani. Two roblems with variational exectation maximisation for time-series models. In Worksho on Inference and Estimation in Probabilistic Time-Series Models, volume 2, [27] R. E. Turner and M. Sahani. Two roblems with variational exectation maximisation for time-series models. In D. Barber, A. T. Cemgil, and S. Chiaa, editors, Bayesian Time Series Models. Cambridge University Press, [28] B. Wang and M. Titterington. Inadequacy of interval estimates corresonding to variational Bayesian aroximations. In Worksho on Artificial Intelligence and Statistics, ages , [29] T. Zhang. Sequential greedy aroximation for certain convex otimization roblems. IEEE Transactions on Information Theory, 49(3): ,

7 Aendices A Proof of Theorem 1 Proof. For any densities q 1, q 2 and α [0, 1], we have the convexity from D KL ((1 α)q 1 + αq 2 ) = (1 α) q 1 log (1 α)q 1 + αq 2 dθ + α q 2 log (1 α)q 1 + αq 2 dθ = (1 α) q 1 log q 1[1 + α(q 2 /q 1 1)] dθ + α q 2 log q 2[1 + (1 α)(q 1 /q 2 1)] dθ = (1 α) q 1 log q 1 dθ + α q 2 log q 2 dθ + (1 α) q 1 log[1 + α(q 2 /q 1 1)]dθ + α q 2 log[1 + (1 α)(q 1 /q 2 1)]dθ (1 α) D KL (q 1 ) + α D KL (q 2 ) + (1 α) α(q 2 q 1 )dθ + α (1 α)(q 1 q 2 )dθ (1 α) D KL (q 1 ) + α D KL (q 2 ), where we have used log(1 + x) x and (q 2 q 1 )dθ = 0. Let h = q 2 q 1, then we have h(θ)dθ = (q 2 q 1 )dθ = 0. Again, using the inequality log(1 + x) x, we have the strong smoothness from D KL (q 1 + h ) D KL (q 1 ) = q 1 + h, log(q 1 + h) q 1, log q 1 h, log q 1 + h, h q 1 + log q 1 q 1, log q 1 h, log = h, log q 1 + h, h q 1 h, log log q 1 log, h + 1 a h 2 2, where in the last ste we used the assumtion that q 1, q 2 a > 0. Note that here we assume that densities are lower-bounded by a > 0. While this may seem unrealistic for some densities, this is a technical requirement to ensure that the discreancy is smooth enough so that greedy boosting can be alied. For densities on R D that do not have a lower bound, we suggest consider aroximation within (1 ɛ) of robability mass. In articular, consider a closed set Θ R D such that Θ x(θ)dθ = (1 ɛ) for a small ɛ > 0. Then we can use the smallest density in Θ as a lower bound. B Stochastic Newton s Algorithm for Solving Mixing Weights Fixing h t, by taking derivatives of effective discreancy with resect to α t, we have D KL (q t ) = α t 2 DKL (q t ) α 2 t (h t q t 1 ) log (1 α t)q t 1 + α t h t dθ = f = (h t q t 1 ) 2 dθ = E η αt (θ) (1 α t )q t 1 + α t h t θ ht E γ αt (θ) E γ αt (θ), (10) θ ht θ q t 1 E η αt (θ) 0, (11) θ q t 1 where γ αt and η αt are evaluable functions of θ and α t given below in Algorithm 2. Because the first and second derivatives are stochastically estimated instead of exact, to ensure convergence we use a decaying sequence of ste sizes b/k in Algorithm 2 that satisfy the Robbins-Monro conditions [5, 24]. 7

8 Algorithm 2 Stochastic Newton s Require: current aroximation q t 1 (θ), new comonent h t (θ), roduct density of rior and likelihood f(θ), Monte Carlo samle size n, initial ste size b > 0 Initialize k 0, α t 0 Indeendently draw {θ (h) i } h t and {θ (q) i } q t 1 for i = 1,, n while α t not convergent do Comute ˆD KL = 1 n i (γ α t (θ (h) i ) γ αt (θ (q) i )) with γ αt (θ) = log (1 αt)qt 1(θ)+αtht(θ) f(θ) Comute ˆD KL = 1 n i (ηα t(θ (h) i ) ηα t (θ (q) h i )) with η αt (θ) = t(θ) q t 1(θ) (1 α t)q t 1(θ)+α th t(θ) k k + 1, α t α t (b/k) ˆD KL / ˆD KL end while C BVI with KL Divergence in the Alternative Direction KL divergence is asymmetric and can be written in two directions [19]: D KL (q t X ) and D KL ( X q t ). We have discussed BVI algorithm in the main text based on the exclusive direction D KL (q t X ); in this Aendix, we show that a similar algorithm can be derived with the alternative direction, namely the inclusive direction D KL ( X q t ) = X (θ) log X(θ) q t (θ) dθ = X (θ) log X (θ)dθ c f(θ) log q t (θ)dθ. (12) Here c > 0 is the normalizing constant. Hence, by droing constants that do not involve q t, we can similarly define an effective discreancy as D kl (q) = f(θ) log q(θ)dθ, (13) where we use lower-case italic subscrit kl to distinguish it from the revious KL divergence. Fixing X, we will use the notation D kl (q) := D KL ( X q). Before roceeding to the derivation of algorithm, we firstly show that D KL ( X q t ) also satisfies the regularity conditions required by greedy boosting. Lemma 1. D KL ( q) is a convex functional of q, i.e., given densities, q 1, q 2, it holds that for all α [0, 1]. D KL ( (1 α)q 1 + αq 2 ) (1 α) D KL ( q 1 ) + α D KL ( q 2 ) (14) Proof. For any α [0, 1], by Jensen s inequality, log[(1 α)q 1 + αq 2 ] (1 α) log q 1 + α log q 2. Then it directly follows that D KL ( (1 α)q 1 + αq 2 ) (1 α) D KL ( q 1 ) + α D KL ( q 2 ). Another condition required by boosting framework of [29] is strong smoothness. It has been shown that for D KL ( q) to satisfy strong smoothness, it suffices to require an uer bound on the log ratio of base family densities [16] [22, Chater 4]: a = su q 1,q 2 {h φ :φ Φ}, θ R D This condition might be translated to constraints on the arameter sace Φ. log q 1 (θ) < +. (15) q 2 (θ) 8

9 Solving α t Again, we firstly show that by fixing h t, otimizing α t is a convex roblem, and can be solved with stochastic Newton s method. Since in D kl the integral measure does not involve α t, we have α D kl = f(θ) log q t(θ) h t q t 1 dθ = f(θ) dθ, (16) α (1 α)q t 1 + αh t 2 α D 2 kl = f(θ) 2 log q t (θ) (h t q t 1 ) 2 α 2 dθ = f(θ) dθ 0. (17) [(1 α)q t 1 + αh t ] 2 Notice that as we cannot samle from the true osterior X (θ) f(θ), we cannot directly estimate these derivatives with Monte Carlo. However, imortance samling can be alied here, by rewriting the derivatives as α D f(θ) h t (θ) q t 1 (θ) kl = E (18) θ q t 1 q t 1 (θ) (1 α)q t 1 (θ) + αh t (θ) 2 α 2 D kl = E θ q t 1 f(θ) (h t (θ) q t 1 (θ)) 2 q t 1 (θ) [(1 α)q t 1 (θ) + αh t (θ)] 2. (19) Hence we can draw articles θ i q t 1 (θ) and reweigh it with w i = f(θ i )/q t 1 (θ i ). And we have the following algorithm. Algorithm 3 Stochastic Newton s with D kl (q t ) = D KL ( x q t ) Require: current aroximation q t 1 (θ), new comonent h t (θ), roduct density of rior and likelihood f(θ), imortance samling samle size n, initial ste size b > 0 Initialize k 0, α t 0 Indeendently draw {θ i } q t 1 for i = 1,, n w i f(θ i )/q t 1 (θ i ) for i = 1,, n while α t not convergent do h s i t(θ i) q t 1(θ i) (1 α t)q t 1(θ i)+α th t(θ i) for i = 1,, n Comute ˆD kl = 1 n i w is i Comute ˆD kl = 1 n i w is 2 i k k + 1, α t α t (b/k) ˆD kl / ˆD kl end while Setting h t It turns out with D kl we can derive the same gradient boosting rocedure as Algorithm 1. Again, with Taylor exansion of the effective discreancy at α t 0, we have D kl (q t ) = D kl (q t 1 ) α t f(θ)/q t 1 (θ), h t (θ) + α t f(θ)dθ + o(α 2 t ). (20) Similarly, to avoid degeneracy, we consider which is equivalent to min λh f/q t 1 2 2, (21) h=h φ,λ>0 max log E (L(θ)/q t 1 (θ)) 1/2 log h 2 h=h φ θ h 2. (22) By Jensen s inequality, we observe that the revious objective in Eqs. (8) is in fact a lower bound of this objective, namely log E θ h (f(θ)/q t 1 (θ)) 1/2 log h 2 2 E θ h log(f(θ)/q t 1 (θ)) 1/2 log h 2 2. (23) And we further notice that this bound is iteratively tightened, since as q t 1 (θ) better aroximates X (θ), f(θ)/q t 1 (θ) X (θ)/q t 1 (θ) will converge to a constant. By maximizing the lower bound, we arrive at the same gradient boosting objective, and hence the same Lalacian gradient boosting subroutine for setting the new comonent h t. To summarize, with KL divergence in the alternative direction D KL ( X q t ) as the discreancy measure, we derive an algorithm almost identical to Algorithm 1, excet that the ste for solving ˆα t is erformed with Algorithm 3. 9

4. Score normalization technical details We now discuss the technical details of the score normalization method.

4. Score normalization technical details We now discuss the technical details of the score normalization method. SMT SCORING SYSTEM This document describes the scoring system for the Stanford Math Tournament We begin by giving an overview of the changes to scoring and a non-technical descrition of the scoring rules