Online Bounds for Bayesian Algorithms

Size: px

Start display at page:

Download "Online Bounds for Bayesian Algorithms"

Dylan Webb
5 years ago
Views:

1 Online Bounds for Bayesian Algorithms Sham M. Kakade Computer and Information Science Department University of Pennsylvania Andrew Y. Ng Computer Science Department Stanford University Abstract We present a competitive analysis of Bayesian learning algorithms in the online learning setting and show that many simple Bayesian algorithms such as Gaussian linear regression and Bayesian logistic regression perform favorably when compared, in retrospect, to the single best model in the model class. The analysis does not assume that the Bayesian algorithms modeling assumptions are correct, and our bounds hold even if the data is adversarially chosen. For Gaussian linear regression using logloss, our error bounds are comparable to the best bounds in the online learning literature, and we also provide a lower bound showing that Gaussian linear regression is optimal in a certain worst case sense. We also give bounds for some widely used maximum a posteriori MAP estimation algorithms, including regularized logistic regression. Introduction The last decade has seen significant progress in online learning algorithms that perform well even in adversarial settings e.g. the expert algorithms of Cesa-Bianchi et al In the online learning framework, one makes minimal assumptions on the data presented to the learner, and the goal is to obtain good relative performance on arbitrary sequences. In statistics, this philosophy has been espoused by Dawid 984 in the prequential approach. We study the performance of Bayesian algorithms in this adversarial setting, in which the process generating the data is not restricted to come from the prior data sequences may be arbitrary. Our motivation is similar to that given in the online learning literature and the MDL literature see Grunwald, 005 namely, that models are often chosen to balance realism with computational tractability, and often assumptions made by the Bayesian are not truly believed to hold e.g. i.i.d. assumptions. Our goal is to study the performance of Bayesian algorithms in the worst-case, where all modeling assumptions may be violated. We consider the widely used class of generalized linear models focusing on Gaussian linear regression and logistic regression and provide relative performance bounds comparing to the best model in our model class when the cost function is the logloss. Though the regression problem has been studied in a competitive framework and, indeed, many ingenious algorithms have been devised for it e.g., Foster, 99; Vovk, 00; Azoury and Warmuth, 00, our goal here is to study how the more widely used, and often simpler, Bayesian algorithms fare. Our bounds for linear regression are comparable to the best bounds in the literature though we use the logloss as opposed to the square loss. The competitive approach to regression started with Foster 99, who provided competitive bounds for a variant of the ridge regression algorithm under the square loss. Vovk 00 presents many competitive algorithms and provides bounds for linear regression under the square loss with an algorithm that differs slightly from the Bayesian one. Azoury and Warmuth 00 rederive Vovk s bound with a different analysis based on Bregman distances. Our work differs from these in that we consider Bayesian Gaussian

2 linear regression, while previous work typically used more complex, cleverly devised algorithms which are either variants of a MAP procedure as in Vovk, 00, or that involve other steps such as clipping predictions as in Azoury and Warmuth, 00. These distinctions are discussed in more detail in Section 3.. We should also note that when the loss function is the logloss, multiplicative weights algorithms are sometimes identical to Bayes rule with particular choices of the parameters see Freund and Schapire, 999. Furthermore, Bayesian algorithms have been used in some online learning settings, such as the sleeping experts setting of Freund et al. 997 and the online boolean prediction setting of Cesa-Bianchi et al Ng and Jordan 00 also analyzed an online Bayesian algorithm but assumed that the data generation process was not too different from the model prior. To our knowledge, there have been no studies of Bayesian generalized linear models in an adversarial online learning setting though many variants have been considered as discussed above. We also examine maximum a posteriori MAP algorithms for both Gaussian linear regression i.e., ridge regression and for regularized logistic regression. These algorithms are often used in practice, particularly in logistic regression where Bayesian model averaging is computationally expensive, but the MAP algorithm requires only solving a convex problem. As expected, MAP algorithms are somewhat less competitive than full Bayesian model averaging, though not unreasonably so. Bayesian Model Averaging We now consider the Bayesian model averaging BMA algorithm and give a bound on its worst-case online loss. We start with some preliminaries. Let x R n denote the inputs of a learning problem and y R the outputs. Consider a model from the generalized linear model family see McCullagh and Nelder, 989, that can be written py x, = py T x, where R n are the parameters of our model T denotes the transpose of. Note that the predicted distribution of y depends only on T x, which is linear in. For example, in the case of Gaussian linear regression, we have T py x, = exp x y πσ σ, where σ is a fixed, known constant that is not a parameter of our model. In logistic regression, we would have log py x, = y log + exp T + y log x + exp T, x where we assume y {0, }. Let S = {x, y, x, y,..., x T, y T } be an arbitrary sequence of examples, possibly chosen by an adversary. We also use S t to denote the subsequence consisting of only the first t examples. We assume throughout this paper that x t where denotes the L norm. Assume that we are going to use a Bayesian algorithm to make our online predictions. Specifically, assume that we have a Gaussian prior on the parameters: p = N ; 0, ν I n, where I n is the n-by-n identity matrix, N ; µ, Σ is the Gaussian density with mean µ and covariance Σ, and ν > 0 is some fixed constant governing the prior variance. Also, let p t = p S t = t i= pyi x i, p t i= pyi x i, pd be the posterior distribution over given the first t training examples. We have that p 0 = p is just the prior distribution.

3 On iteration t, we are given the input x t, and our algorithm makes a prediction using the posterior distribution over the outputs: py x t, S t = py x t, p S t d. We are then given the true label y t, and we suffer logloss log py t x t, S t. We define the cumulative loss of the BMA algorithm after T rounds to be L BMA S = log py t x t, S t. Importantly, note that even though the algorithm we consider is a Bayesian one, our theoretical results do not assume that the data comes from any particular probabilistic model. In particular, the data may be chosen by an adversary. We are interested in comparing against the loss of any expert that uses some fixed parameters R n. Define l t = log py t x t,, and let L S = l t = log py t x t,. Sometimes, we also wish to compare against distributions over experts. Given a distribution Q over, define l Q t = Q log pyt x t, d, and L Q S = l Q t = QL Sd. This is the expected logloss incurred by a procedure that first samples some Q and then uses this for all its predictions. Here, the expectation is over the random, not over the sequence of examples. Note that the expectation is of the logloss, which is a different type of averaging than in BMA, which had the expectation and the log in the reverse order.. A Useful Variational Bound The following lemma provides a worst case bound of the loss incurred by Bayesian algorithms and will be useful for deriving our main result in the next section. A result very similar to this for finite model classes is given by Freund et al For completeness, we prove the result here in its full generality, though our proof is similar to theirs. As usual, define KLq p = q q log p. Lemma.: Let Q be any distribution over. Then for all sequences S L BMA S L Q S + KLQ p 0. Proof: Let Y = {y,..., y T } and X = {x,..., x T }. The chain rule of conditional probabilities implies that L BMA S = log py X and L S = log py X,. So L BMA S L Q S = log py X + Q log py X, d py X, = Q log py X d py X,p0 By Bayes rule, we have that p T = py X. Continuing, = Q log p T p 0 d = Q log Q p 0 d Q log Q p T d = KLQ p 0 KLQ p T. Together with the fact that KLQ p T 0, this proves the lemma.

4 . An Upper Bound for Generalized Linear Models For the theorem that we shortly present, we need one new definition. Let f y z = log py T x = z. Thus, f y t T x t = l t. Note that for linear regression as defined in Equation, we have that for all y f y z = σ 3 and for logistic regression as defined in Equation, we have that for y {0, } f y z. Theorem.: Suppose f y z is continuously differentiable. Let S be a sequence such that x t and such that for some constant c, f z c for all z. Then for all, y t L BMA S L S + ν + n log + T cν 4 n The /ν term can be interpreted as a penalty term from our prior. The log term is how fast our loss could grow in comparison to the best. Importantly, this extra loss is only logarithmic in T in this adversarial setting. This bound almost identical to those provided by Vovk 00; Azoury and Warmuth 00 and Foster 99 for the linear regression case under the square loss; the only difference is that in their bounds, the last term is multiplied by an upper bound on y t. In contrast, we require no bound on y t in the Gaussian linear regression case due to the fact that we deal with the logloss also recall f y z = σ for all y. Proof: We use Lemma. with Q = N ;, ɛ I n being a normal distribution with mean and covariance ɛ I n. Here, ɛ is a variational parameter that we later tune to get the tightest possible bound. Letting HQ = n log πeɛ be the entropy of Q, we have [ KLQ p 0 = Q log exp ] π n/ ν I n / ν T d HQ = n log ν + ν Q T d n n log ɛ = n log ν + ν + nɛ n n log ɛ. 5 To prove the result, we also need to relate the error of L Q to that of L. By taking a Taylor expansion of f y assume y S, we have that f y z = f y z + f yz z z + f y ξz z z, for some appropriate function ξ. Thus, if z is a random variable with mean z, we have [ E z [f y z] = f y z + f yz 0 + E z f y ξz z z ] [ z z f y z ] + ce z = f y z + c Varz Consider a single example x, y. We can apply the argument above with z = T x, and z = T x, where Q. Note that E[z] = z since Q has mean. Also, Var T x = x T ɛ I n x = x ɛ ɛ because we previously assumed that x. Thus, we have E Q [f y T x] f y T x + cɛ

5 Since l Q t = E Q [f y t T x t ] and l t = f y t T x t, we can sum both sides from t = to T to obtain L Q S L S + T c ɛ Putting this together with Lemma. and Equation 5, we find that L BMA S L S + T c ɛ + n log ν + ν + nɛ n n log ɛ. Finally, by choosing ɛ = nν n+t cν and simplifying, Theorem. follows..3 A Lower Bound for Gaussian Linear Regression The following lower bound shows that, for linear regression, no other prediction scheme is better than Bayes in the worst case when our penalty term is. Here, we compare to an arbitrary predictive distribution qy x t, S t for prediction at time t, which suffers an instant loss l q t = log qy t x t, S t. In the theorem, denotes the floor function. Theorem.3: Let L S be the loss under the Gaussian linear regression model using the parameter, and let ν = σ =. For any set of predictive distributions qy x t, S t, there exists an S with x t such that l q t inf L S + + n log T + n Proof: sketch If n = and if S is such that x t =, one can show the equality: L BMA S = inf L S + + log + T Let Y = {y,..., y T } and X = {,..., }. By the chain rule of conditional probabilities, L BMA S = log py X where p is the Gaussian linear regression model, and q s loss is T l qt = log qy X. For any predictive distribution q that differs from p, there must exist some sequence S such that log qy X is greater than log py X since probabilities are normalized. Such a sequence proves the result for n =. The modification for n dimensions follows: S is broken into T/n subsequences where in every subsequence only one dimension has x t k = and the other dimensions are set to 0. The result follows due to the additivity of the losses on these subsequences. 3 MAP Estimation We now present bounds for MAP algorithms for both Gaussian linear regression i.e., ridge regression and logistic regression. These algorithms use the maximum ˆ t of p t to form their predictive distribution py x t, ˆ t at time t, as opposed to BMA s predictive distribution of py x t, S t. As expected these bounds are weaker than BMA, though perhaps not unreasonably so. 3. The Square Loss and Ridge Regression Before we provide the MAP bound, let us first present the form of the posteriors and the predictions for Gaussian linear regression. Define A t = ν I n + t σ i= xi x it, and b t = t i= xi y i. We now have that p t = p S t = N ; ˆ t, ˆΣ t, 6 where ˆ t = A t b t, and ˆΣ t = A t. Also, the predictions at time t + are given by py t+ x t+, S t = N y t+ ; ŷ t+, s t+ 7

6 where ŷ t+ = ˆ t T x t+, s t+ = x t+t ˆΣt x t+ + σ. In contrast, the prediction of a fixed expert using parameter would be py t x t, = N y t ; yt, σ, 8 where y t = T x t. Now the BMA loss is: L BMA S = s t y t ˆ t x T t + log πs t 9 Importantly, note how Bayes is adaptively weighting the squared term with the inverse variances /s t which depend on the current observation x t. The logloss of using a fixed expert is just: L S = σ yt T x t + log πσ 0 The MAP procedure referred to as ridge regression uses py x t, ˆ t which has a fixed variance. Hence, the MAP loss is essentially the square loss and we define it as such: L MAP S = y t ˆ t x T t, L S = y t T x t, where ˆ t is the MAP estimate see Equation 6. Corollary 3.: Let γ = σ + ν. For all S such that x t and for all, we have L MAP S γ σ L S + γ ν + γ n log + T ν σ n Proof: Using Equations 9,0 and Theorem., we have y t ˆ t x T t σ yt T x t + ν s t + n log + T cν + n πσ log πs t Equations 6, 7 imply that σ s t σ + ν. Using this, the result follows by noting that the last term is negative and by multiplying both sides of the equation by σ + ν. We might have hoped that MAP were more competitive in that the leading coefficient, in front of the L S term in the bound, be similar to Theorem. rather than γ σ. Crudely, the reason that MAP is not as effective as BMA is that MAP does not take into account the uncertainty in its predictions thus the squared terms cannot be reweighted to take variance into account compare Equations 9 and. Some previous non-bayesian algorithms did in fact have bounds with this coefficient being unity. Vovk 00 provides such an algorithm, though this algorithm differs from MAP in that its predictions at time t are a nonlinear function of x t it uses A t instead of A t at time t. Foster 99 provides a bound with this coefficient being with more restrictive assumptions. Azoury and Warmuth 00 also provide a bound with a coefficient of by using a MAP procedure with clipping. Their algorithm thresholds the prediction ŷ t = ˆ T t x t if it is larger than some upper bound. Note that we do not assume any upper bound on y t.

7 As the following lower bound shows, it is not possible for the MAP linear regression algorithm to have a coefficient of for L S, with a reasonable regret bound. A similar lower bound is in Vovk 00, which doesn t apply to our setting where we have the additional constraint x t. Theorem 3.: Let γ = σ + ν. There exists a sequence S with x t such that L MAP S inf L S + + ΩT Proof: sketch Let S be a length T + sequence, with n =, where for the first T steps, x t = / T and y t =, and at T +, x T + = and y T + = 0. Here, one can show that inf L S + = T/4 and L MAP S 3T/8, and the result follows. 3. Logistic Regression MAP estimation is often used for regularized logistic regression, since it requires only solving a convex program while BMA has to deal with a high dimensional integral over that is intractable to compute exactly. Letting ˆ t be the maximum of the posterior p t, define L MAP S = T log pyt x t, ˆ t. As with the square loss case, the bound we present is multiplicatively worse by a factor of 4. Theorem 3.3: In the logistic regression model with ν 0.5, we have that for all sequences S such that x t and y t {0, } and for all L MAP S 4L S + ν + n log + T ν n Proof: sketch Assume n = the general case is analogous. The proof consists of showing that lˆt t = log py t x t, ˆ t 4l BMA t. Without loss of generality, assume y t = and x t 0, and for convenience, we just write x instead of x t. Now the BMA prediction is p, xp t d, and l BMA t is the negative log of this. Note = gives probability for y t = and this setting of minimizes the loss at time t. Since we do not have a closed form solution of the posterior p t, let us work with another distribution q in lieu of p t that satisfies certain properties. Define p q = p, xqd, which can be viewed as the prediction using q rather than the posterior. We choose q to be the rectification of the Gaussian N ; ˆ t, ν I n, such that there is positive probability only for ˆ t and the distribution is renormalized. With this choice, we first show that the loss of q, log p q, is less than or equal to l BMA t. Then we complete the proof by showing that t 4 log p lˆt q, since log p q l BMA t. Consider the q which maximizes p q subject to the following constraints: let q have its maximum at ˆ t ; let q = 0 if < ˆ t intuitively, mass to the left of ˆ t is just making the p q smaller; and impose the constraint that log q /ν. We now argue that for such a q, log p q l BMA t. First note that due to the Gaussian prior p 0, it is straightforward to show that log p t ν the prior imposes some minimum curvature. Now if this posterior p t were rectified with support only for ˆ t and renormalized, then such a modified distribution clearly satisfies the aforementioned constraints, and it has loss less than the loss of p t itself since the rectification only increases the prediction. Hence, the maximizer, q, of p q subject to the constraints has loss less than that of p t, i.e. log p q l BMA t. We now show that such a maximal q is the renormalized rectification of the Gaussian N ; ˆ t, ν I n, such that there is positive probability only for > ˆ t. Assume some other q satisfied these constraints and maximized p q. It cannot be that q ˆ t < qˆ t,

8 else one can show q would not be normalized since with q ˆ t < qˆ t, the curvature constraint imposes that this q cannot cross q. It also cannot be that q ˆ t > qˆ t. To see this, note that normalization and curvature imply that q must cross p t only once. Now a sufficiently slight perturbation of this crossing point to the left, by shifting more mass from the left to the right side of the crossing point, would not violate the curvature constraint and would result in a new distribution with larger p q, contradicting the maximality of q. Hence, we have that q ˆ t = qˆ t. This, along with the curvature constraint and normalization, imply that the rectified Gaussian, q, is the unique solution. To complete the proof, we show lˆt t = log p x, ˆ t 4 log p q. We consider two cases, ˆ t < 0 and ˆ t 0. We start with the case ˆ t < 0. Using the boundedness of the derivative log p x, / < and that q only has support for ˆ t, we have p q = explog p x, qd exp logp x, ˆ t + ˆ t qd.6p x, ˆ t where we have used that exp ˆ t qd <.6 which can be verified numerically using the definition of q with ν 0.5. Now observe that for ˆ t 0, we have the lower bound log p x, ˆ t log. Hence, log p q log p x, ˆ t log.6 log p x, ˆ t log.6/ log t, which shows t 4 log p 0.3lˆt lˆt q. Now for the case ˆ t 0. Let σ be the sigmoid function, so p x, = σx and p q = σxqd. Since the sigmoid is concave for > 0 and, for this case, q only has support from positive, we have that p q σ x qd. Using the definition of q, we then have that p q σxˆ t + ν σˆ t + ν, where the last inequality follows from ˆ t + ν > 0 and x. Using properties of σ, one can show log σ z < log σz for all z. Hence, for all ˆ t, log σ < log σ log σˆ t. Using this derivative condition along with the previous bound on p q, we have that log p q log σˆ t + ν log σˆ t ν = t ν, which shows that lˆt lˆt t 4 log p q since ν 0.5. This proves the claim when ˆ t 0. Acknowledgments. We thank Dean Foster for numerous helpful discussions. This work was supported by the Department of the Interior/DARPA under contract NBCHD References Azoury, K. S. and Warmuth, M. 00. Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 433. Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D., Schapire, R., and Warmuth, M How to use expert advice. J. ACM, 44. Cesa-Bianchi, N., Helmbold, D., and Panizza, S On Bayes methods for on-line boolean prediction. Algorithmica,. Dawid, A Statistical theory: The prequential approach. J. Royal Statistical Society. Foster, D. P. 99. Prediction in the worst case. Annals of Statistics, 9. Freund, Y. and Schapire, R Adaptive game playing using multiplicative weights. Games and Economic Behavior, 9: Freund, Y., Schapire, R., Singer, Y., and Warmuth, M Using and combining predictors that specialize. In STOC. Grunwald, P A tutorial introduction to the minimum description length principle. McCullagh, P. and Nelder, J. A Generalized Linear Models nd ed.. Chapman and Hall. Ng, A. Y. and Jordan, M. 00. Convergence rates of the voting Gibbs classifier, with application to Bayesian feature selection. In Proceedings of the 8th Int l Conference on Machine Learning. Vovk, V. 00. Competitive on-line statistics. International Statistical Review, 69.

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis