Online Bounds for Bayesian Algorithms
|
|
- Dylan Webb
- 5 years ago
- Views:
Transcription
1 Online Bounds for Bayesian Algorithms Sham M. Kakade Computer and Information Science Department University of Pennsylvania Andrew Y. Ng Computer Science Department Stanford University Abstract We present a competitive analysis of Bayesian learning algorithms in the online learning setting and show that many simple Bayesian algorithms such as Gaussian linear regression and Bayesian logistic regression perform favorably when compared, in retrospect, to the single best model in the model class. The analysis does not assume that the Bayesian algorithms modeling assumptions are correct, and our bounds hold even if the data is adversarially chosen. For Gaussian linear regression using logloss, our error bounds are comparable to the best bounds in the online learning literature, and we also provide a lower bound showing that Gaussian linear regression is optimal in a certain worst case sense. We also give bounds for some widely used maximum a posteriori MAP estimation algorithms, including regularized logistic regression. Introduction The last decade has seen significant progress in online learning algorithms that perform well even in adversarial settings e.g. the expert algorithms of Cesa-Bianchi et al In the online learning framework, one makes minimal assumptions on the data presented to the learner, and the goal is to obtain good relative performance on arbitrary sequences. In statistics, this philosophy has been espoused by Dawid 984 in the prequential approach. We study the performance of Bayesian algorithms in this adversarial setting, in which the process generating the data is not restricted to come from the prior data sequences may be arbitrary. Our motivation is similar to that given in the online learning literature and the MDL literature see Grunwald, 005 namely, that models are often chosen to balance realism with computational tractability, and often assumptions made by the Bayesian are not truly believed to hold e.g. i.i.d. assumptions. Our goal is to study the performance of Bayesian algorithms in the worst-case, where all modeling assumptions may be violated. We consider the widely used class of generalized linear models focusing on Gaussian linear regression and logistic regression and provide relative performance bounds comparing to the best model in our model class when the cost function is the logloss. Though the regression problem has been studied in a competitive framework and, indeed, many ingenious algorithms have been devised for it e.g., Foster, 99; Vovk, 00; Azoury and Warmuth, 00, our goal here is to study how the more widely used, and often simpler, Bayesian algorithms fare. Our bounds for linear regression are comparable to the best bounds in the literature though we use the logloss as opposed to the square loss. The competitive approach to regression started with Foster 99, who provided competitive bounds for a variant of the ridge regression algorithm under the square loss. Vovk 00 presents many competitive algorithms and provides bounds for linear regression under the square loss with an algorithm that differs slightly from the Bayesian one. Azoury and Warmuth 00 rederive Vovk s bound with a different analysis based on Bregman distances. Our work differs from these in that we consider Bayesian Gaussian
2 linear regression, while previous work typically used more complex, cleverly devised algorithms which are either variants of a MAP procedure as in Vovk, 00, or that involve other steps such as clipping predictions as in Azoury and Warmuth, 00. These distinctions are discussed in more detail in Section 3.. We should also note that when the loss function is the logloss, multiplicative weights algorithms are sometimes identical to Bayes rule with particular choices of the parameters see Freund and Schapire, 999. Furthermore, Bayesian algorithms have been used in some online learning settings, such as the sleeping experts setting of Freund et al. 997 and the online boolean prediction setting of Cesa-Bianchi et al Ng and Jordan 00 also analyzed an online Bayesian algorithm but assumed that the data generation process was not too different from the model prior. To our knowledge, there have been no studies of Bayesian generalized linear models in an adversarial online learning setting though many variants have been considered as discussed above. We also examine maximum a posteriori MAP algorithms for both Gaussian linear regression i.e., ridge regression and for regularized logistic regression. These algorithms are often used in practice, particularly in logistic regression where Bayesian model averaging is computationally expensive, but the MAP algorithm requires only solving a convex problem. As expected, MAP algorithms are somewhat less competitive than full Bayesian model averaging, though not unreasonably so. Bayesian Model Averaging We now consider the Bayesian model averaging BMA algorithm and give a bound on its worst-case online loss. We start with some preliminaries. Let x R n denote the inputs of a learning problem and y R the outputs. Consider a model from the generalized linear model family see McCullagh and Nelder, 989, that can be written py x, = py T x, where R n are the parameters of our model T denotes the transpose of. Note that the predicted distribution of y depends only on T x, which is linear in. For example, in the case of Gaussian linear regression, we have T py x, = exp x y πσ σ, where σ is a fixed, known constant that is not a parameter of our model. In logistic regression, we would have log py x, = y log + exp T + y log x + exp T, x where we assume y {0, }. Let S = {x, y, x, y,..., x T, y T } be an arbitrary sequence of examples, possibly chosen by an adversary. We also use S t to denote the subsequence consisting of only the first t examples. We assume throughout this paper that x t where denotes the L norm. Assume that we are going to use a Bayesian algorithm to make our online predictions. Specifically, assume that we have a Gaussian prior on the parameters: p = N ; 0, ν I n, where I n is the n-by-n identity matrix, N ; µ, Σ is the Gaussian density with mean µ and covariance Σ, and ν > 0 is some fixed constant governing the prior variance. Also, let p t = p S t = t i= pyi x i, p t i= pyi x i, pd be the posterior distribution over given the first t training examples. We have that p 0 = p is just the prior distribution.
3 On iteration t, we are given the input x t, and our algorithm makes a prediction using the posterior distribution over the outputs: py x t, S t = py x t, p S t d. We are then given the true label y t, and we suffer logloss log py t x t, S t. We define the cumulative loss of the BMA algorithm after T rounds to be L BMA S = log py t x t, S t. Importantly, note that even though the algorithm we consider is a Bayesian one, our theoretical results do not assume that the data comes from any particular probabilistic model. In particular, the data may be chosen by an adversary. We are interested in comparing against the loss of any expert that uses some fixed parameters R n. Define l t = log py t x t,, and let L S = l t = log py t x t,. Sometimes, we also wish to compare against distributions over experts. Given a distribution Q over, define l Q t = Q log pyt x t, d, and L Q S = l Q t = QL Sd. This is the expected logloss incurred by a procedure that first samples some Q and then uses this for all its predictions. Here, the expectation is over the random, not over the sequence of examples. Note that the expectation is of the logloss, which is a different type of averaging than in BMA, which had the expectation and the log in the reverse order.. A Useful Variational Bound The following lemma provides a worst case bound of the loss incurred by Bayesian algorithms and will be useful for deriving our main result in the next section. A result very similar to this for finite model classes is given by Freund et al For completeness, we prove the result here in its full generality, though our proof is similar to theirs. As usual, define KLq p = q q log p. Lemma.: Let Q be any distribution over. Then for all sequences S L BMA S L Q S + KLQ p 0. Proof: Let Y = {y,..., y T } and X = {x,..., x T }. The chain rule of conditional probabilities implies that L BMA S = log py X and L S = log py X,. So L BMA S L Q S = log py X + Q log py X, d py X, = Q log py X d py X,p0 By Bayes rule, we have that p T = py X. Continuing, = Q log p T p 0 d = Q log Q p 0 d Q log Q p T d = KLQ p 0 KLQ p T. Together with the fact that KLQ p T 0, this proves the lemma.
4 . An Upper Bound for Generalized Linear Models For the theorem that we shortly present, we need one new definition. Let f y z = log py T x = z. Thus, f y t T x t = l t. Note that for linear regression as defined in Equation, we have that for all y f y z = σ 3 and for logistic regression as defined in Equation, we have that for y {0, } f y z. Theorem.: Suppose f y z is continuously differentiable. Let S be a sequence such that x t and such that for some constant c, f z c for all z. Then for all, y t L BMA S L S + ν + n log + T cν 4 n The /ν term can be interpreted as a penalty term from our prior. The log term is how fast our loss could grow in comparison to the best. Importantly, this extra loss is only logarithmic in T in this adversarial setting. This bound almost identical to those provided by Vovk 00; Azoury and Warmuth 00 and Foster 99 for the linear regression case under the square loss; the only difference is that in their bounds, the last term is multiplied by an upper bound on y t. In contrast, we require no bound on y t in the Gaussian linear regression case due to the fact that we deal with the logloss also recall f y z = σ for all y. Proof: We use Lemma. with Q = N ;, ɛ I n being a normal distribution with mean and covariance ɛ I n. Here, ɛ is a variational parameter that we later tune to get the tightest possible bound. Letting HQ = n log πeɛ be the entropy of Q, we have [ KLQ p 0 = Q log exp ] π n/ ν I n / ν T d HQ = n log ν + ν Q T d n n log ɛ = n log ν + ν + nɛ n n log ɛ. 5 To prove the result, we also need to relate the error of L Q to that of L. By taking a Taylor expansion of f y assume y S, we have that f y z = f y z + f yz z z + f y ξz z z, for some appropriate function ξ. Thus, if z is a random variable with mean z, we have [ E z [f y z] = f y z + f yz 0 + E z f y ξz z z ] [ z z f y z ] + ce z = f y z + c Varz Consider a single example x, y. We can apply the argument above with z = T x, and z = T x, where Q. Note that E[z] = z since Q has mean. Also, Var T x = x T ɛ I n x = x ɛ ɛ because we previously assumed that x. Thus, we have E Q [f y T x] f y T x + cɛ
5 Since l Q t = E Q [f y t T x t ] and l t = f y t T x t, we can sum both sides from t = to T to obtain L Q S L S + T c ɛ Putting this together with Lemma. and Equation 5, we find that L BMA S L S + T c ɛ + n log ν + ν + nɛ n n log ɛ. Finally, by choosing ɛ = nν n+t cν and simplifying, Theorem. follows..3 A Lower Bound for Gaussian Linear Regression The following lower bound shows that, for linear regression, no other prediction scheme is better than Bayes in the worst case when our penalty term is. Here, we compare to an arbitrary predictive distribution qy x t, S t for prediction at time t, which suffers an instant loss l q t = log qy t x t, S t. In the theorem, denotes the floor function. Theorem.3: Let L S be the loss under the Gaussian linear regression model using the parameter, and let ν = σ =. For any set of predictive distributions qy x t, S t, there exists an S with x t such that l q t inf L S + + n log T + n Proof: sketch If n = and if S is such that x t =, one can show the equality: L BMA S = inf L S + + log + T Let Y = {y,..., y T } and X = {,..., }. By the chain rule of conditional probabilities, L BMA S = log py X where p is the Gaussian linear regression model, and q s loss is T l qt = log qy X. For any predictive distribution q that differs from p, there must exist some sequence S such that log qy X is greater than log py X since probabilities are normalized. Such a sequence proves the result for n =. The modification for n dimensions follows: S is broken into T/n subsequences where in every subsequence only one dimension has x t k = and the other dimensions are set to 0. The result follows due to the additivity of the losses on these subsequences. 3 MAP Estimation We now present bounds for MAP algorithms for both Gaussian linear regression i.e., ridge regression and logistic regression. These algorithms use the maximum ˆ t of p t to form their predictive distribution py x t, ˆ t at time t, as opposed to BMA s predictive distribution of py x t, S t. As expected these bounds are weaker than BMA, though perhaps not unreasonably so. 3. The Square Loss and Ridge Regression Before we provide the MAP bound, let us first present the form of the posteriors and the predictions for Gaussian linear regression. Define A t = ν I n + t σ i= xi x it, and b t = t i= xi y i. We now have that p t = p S t = N ; ˆ t, ˆΣ t, 6 where ˆ t = A t b t, and ˆΣ t = A t. Also, the predictions at time t + are given by py t+ x t+, S t = N y t+ ; ŷ t+, s t+ 7
6 where ŷ t+ = ˆ t T x t+, s t+ = x t+t ˆΣt x t+ + σ. In contrast, the prediction of a fixed expert using parameter would be py t x t, = N y t ; yt, σ, 8 where y t = T x t. Now the BMA loss is: L BMA S = s t y t ˆ t x T t + log πs t 9 Importantly, note how Bayes is adaptively weighting the squared term with the inverse variances /s t which depend on the current observation x t. The logloss of using a fixed expert is just: L S = σ yt T x t + log πσ 0 The MAP procedure referred to as ridge regression uses py x t, ˆ t which has a fixed variance. Hence, the MAP loss is essentially the square loss and we define it as such: L MAP S = y t ˆ t x T t, L S = y t T x t, where ˆ t is the MAP estimate see Equation 6. Corollary 3.: Let γ = σ + ν. For all S such that x t and for all, we have L MAP S γ σ L S + γ ν + γ n log + T ν σ n Proof: Using Equations 9,0 and Theorem., we have y t ˆ t x T t σ yt T x t + ν s t + n log + T cν + n πσ log πs t Equations 6, 7 imply that σ s t σ + ν. Using this, the result follows by noting that the last term is negative and by multiplying both sides of the equation by σ + ν. We might have hoped that MAP were more competitive in that the leading coefficient, in front of the L S term in the bound, be similar to Theorem. rather than γ σ. Crudely, the reason that MAP is not as effective as BMA is that MAP does not take into account the uncertainty in its predictions thus the squared terms cannot be reweighted to take variance into account compare Equations 9 and. Some previous non-bayesian algorithms did in fact have bounds with this coefficient being unity. Vovk 00 provides such an algorithm, though this algorithm differs from MAP in that its predictions at time t are a nonlinear function of x t it uses A t instead of A t at time t. Foster 99 provides a bound with this coefficient being with more restrictive assumptions. Azoury and Warmuth 00 also provide a bound with a coefficient of by using a MAP procedure with clipping. Their algorithm thresholds the prediction ŷ t = ˆ T t x t if it is larger than some upper bound. Note that we do not assume any upper bound on y t.
7 As the following lower bound shows, it is not possible for the MAP linear regression algorithm to have a coefficient of for L S, with a reasonable regret bound. A similar lower bound is in Vovk 00, which doesn t apply to our setting where we have the additional constraint x t. Theorem 3.: Let γ = σ + ν. There exists a sequence S with x t such that L MAP S inf L S + + ΩT Proof: sketch Let S be a length T + sequence, with n =, where for the first T steps, x t = / T and y t =, and at T +, x T + = and y T + = 0. Here, one can show that inf L S + = T/4 and L MAP S 3T/8, and the result follows. 3. Logistic Regression MAP estimation is often used for regularized logistic regression, since it requires only solving a convex program while BMA has to deal with a high dimensional integral over that is intractable to compute exactly. Letting ˆ t be the maximum of the posterior p t, define L MAP S = T log pyt x t, ˆ t. As with the square loss case, the bound we present is multiplicatively worse by a factor of 4. Theorem 3.3: In the logistic regression model with ν 0.5, we have that for all sequences S such that x t and y t {0, } and for all L MAP S 4L S + ν + n log + T ν n Proof: sketch Assume n = the general case is analogous. The proof consists of showing that lˆt t = log py t x t, ˆ t 4l BMA t. Without loss of generality, assume y t = and x t 0, and for convenience, we just write x instead of x t. Now the BMA prediction is p, xp t d, and l BMA t is the negative log of this. Note = gives probability for y t = and this setting of minimizes the loss at time t. Since we do not have a closed form solution of the posterior p t, let us work with another distribution q in lieu of p t that satisfies certain properties. Define p q = p, xqd, which can be viewed as the prediction using q rather than the posterior. We choose q to be the rectification of the Gaussian N ; ˆ t, ν I n, such that there is positive probability only for ˆ t and the distribution is renormalized. With this choice, we first show that the loss of q, log p q, is less than or equal to l BMA t. Then we complete the proof by showing that t 4 log p lˆt q, since log p q l BMA t. Consider the q which maximizes p q subject to the following constraints: let q have its maximum at ˆ t ; let q = 0 if < ˆ t intuitively, mass to the left of ˆ t is just making the p q smaller; and impose the constraint that log q /ν. We now argue that for such a q, log p q l BMA t. First note that due to the Gaussian prior p 0, it is straightforward to show that log p t ν the prior imposes some minimum curvature. Now if this posterior p t were rectified with support only for ˆ t and renormalized, then such a modified distribution clearly satisfies the aforementioned constraints, and it has loss less than the loss of p t itself since the rectification only increases the prediction. Hence, the maximizer, q, of p q subject to the constraints has loss less than that of p t, i.e. log p q l BMA t. We now show that such a maximal q is the renormalized rectification of the Gaussian N ; ˆ t, ν I n, such that there is positive probability only for > ˆ t. Assume some other q satisfied these constraints and maximized p q. It cannot be that q ˆ t < qˆ t,
8 else one can show q would not be normalized since with q ˆ t < qˆ t, the curvature constraint imposes that this q cannot cross q. It also cannot be that q ˆ t > qˆ t. To see this, note that normalization and curvature imply that q must cross p t only once. Now a sufficiently slight perturbation of this crossing point to the left, by shifting more mass from the left to the right side of the crossing point, would not violate the curvature constraint and would result in a new distribution with larger p q, contradicting the maximality of q. Hence, we have that q ˆ t = qˆ t. This, along with the curvature constraint and normalization, imply that the rectified Gaussian, q, is the unique solution. To complete the proof, we show lˆt t = log p x, ˆ t 4 log p q. We consider two cases, ˆ t < 0 and ˆ t 0. We start with the case ˆ t < 0. Using the boundedness of the derivative log p x, / < and that q only has support for ˆ t, we have p q = explog p x, qd exp logp x, ˆ t + ˆ t qd.6p x, ˆ t where we have used that exp ˆ t qd <.6 which can be verified numerically using the definition of q with ν 0.5. Now observe that for ˆ t 0, we have the lower bound log p x, ˆ t log. Hence, log p q log p x, ˆ t log.6 log p x, ˆ t log.6/ log t, which shows t 4 log p 0.3lˆt lˆt q. Now for the case ˆ t 0. Let σ be the sigmoid function, so p x, = σx and p q = σxqd. Since the sigmoid is concave for > 0 and, for this case, q only has support from positive, we have that p q σ x qd. Using the definition of q, we then have that p q σxˆ t + ν σˆ t + ν, where the last inequality follows from ˆ t + ν > 0 and x. Using properties of σ, one can show log σ z < log σz for all z. Hence, for all ˆ t, log σ < log σ log σˆ t. Using this derivative condition along with the previous bound on p q, we have that log p q log σˆ t + ν log σˆ t ν = t ν, which shows that lˆt lˆt t 4 log p q since ν 0.5. This proves the claim when ˆ t 0. Acknowledgments. We thank Dean Foster for numerous helpful discussions. This work was supported by the Department of the Interior/DARPA under contract NBCHD References Azoury, K. S. and Warmuth, M. 00. Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 433. Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D., Schapire, R., and Warmuth, M How to use expert advice. J. ACM, 44. Cesa-Bianchi, N., Helmbold, D., and Panizza, S On Bayes methods for on-line boolean prediction. Algorithmica,. Dawid, A Statistical theory: The prequential approach. J. Royal Statistical Society. Foster, D. P. 99. Prediction in the worst case. Annals of Statistics, 9. Freund, Y. and Schapire, R Adaptive game playing using multiplicative weights. Games and Economic Behavior, 9: Freund, Y., Schapire, R., Singer, Y., and Warmuth, M Using and combining predictors that specialize. In STOC. Grunwald, P A tutorial introduction to the minimum description length principle. McCullagh, P. and Nelder, J. A Generalized Linear Models nd ed.. Chapman and Hall. Ng, A. Y. and Jordan, M. 00. Convergence rates of the voting Gibbs classifier, with application to Bayesian feature selection. In Proceedings of the 8th Int l Conference on Machine Learning. Vovk, V. 00. Competitive on-line statistics. International Statistical Review, 69.
Worst-Case Bounds for Gaussian Process Models
Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationUsing Additive Expert Ensembles to Cope with Concept Drift
Jeremy Z. Kolter and Marcus A. Maloof {jzk, maloof}@cs.georgetown.edu Department of Computer Science, Georgetown University, Washington, DC 20057-1232, USA Abstract We consider online learning where the
More informationThe No-Regret Framework for Online Learning
The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,
More informationFoundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
More informationOptimal and Adaptive Online Learning
Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University Examples of Online Learning (a) Spam detection 2 / 34 Examples of Online Learning
More informationThe best expert versus the smartest algorithm
Theoretical Computer Science 34 004 361 380 www.elsevier.com/locate/tcs The best expert versus the smartest algorithm Peter Chen a, Guoli Ding b; a Department of Computer Science, Louisiana State University,
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationU Logo Use Guidelines
Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity
More informationRegression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More informationOnline Learning and Online Convex Optimization
Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationCS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm
CS61: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm Tim Roughgarden February 9, 016 1 Online Algorithms This lecture begins the third module of the
More informationMinimax Fixed-Design Linear Regression
JMLR: Workshop and Conference Proceedings vol 40:1 14, 2015 Mini Fixed-Design Linear Regression Peter L. Bartlett University of California at Berkeley and Queensland University of Technology Wouter M.
More informationOn Minimaxity of Follow the Leader Strategy in the Stochastic Setting
On Minimaxity of Follow the Leader Strategy in the Stochastic Setting Wojciech Kot lowsi Poznań University of Technology, Poland wotlowsi@cs.put.poznan.pl Abstract. We consider the setting of prediction
More informationAn Identity for Kernel Ridge Regression
An Identity for Kernel Ridge Regression Fedor Zhdanov and Yuri Kalnishkan Computer Learning Research Centre and Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW20
More informationLeast Squares Regression
E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute
More informationLearning, Games, and Networks
Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to
More informationMulti-View Regression via Canonincal Correlation Analysis
Multi-View Regression via Canonincal Correlation Analysis Dean P. Foster U. Pennsylvania (Sham M. Kakade of TTI) Model and background p ( i n) y i = X ij β j + ɛ i ɛ i iid N(0, σ 2 ) i=j Data mining and
More informationIntroduction to Statistical Learning Theory
Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated
More informationLinear Probability Forecasting
Linear Probability Forecasting Fedor Zhdanov and Yuri Kalnishkan Computer Learning Research Centre and Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW0 0EX, UK {fedor,yura}@cs.rhul.ac.uk
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationGambling in a rigged casino: The adversarial multi-armed bandit problem
Gambling in a rigged casino: The adversarial multi-armed bandit problem Peter Auer Institute for Theoretical Computer Science University of Technology Graz A-8010 Graz (Austria) pauer@igi.tu-graz.ac.at
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationLecture 4: Lower Bounds (ending); Thompson Sampling
CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from
More informationRobustness and duality of maximum entropy and exponential family distributions
Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationLINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception
LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,
More informationAdaptive Sampling Under Low Noise Conditions 1
Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationNo-Regret Algorithms for Unconstrained Online Convex Optimization
No-Regret Algorithms for Unconstrained Online Convex Optimization Matthew Streeter Duolingo, Inc. Pittsburgh, PA 153 matt@duolingo.com H. Brendan McMahan Google, Inc. Seattle, WA 98103 mcmahan@google.com
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationAgnostic Online learnability
Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes
More informationOnline Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri
Online Learning Jordan Boyd-Graber University of Colorado Boulder LECTURE 21 Slides adapted from Mohri Jordan Boyd-Graber Boulder Online Learning 1 of 31 Motivation PAC learning: distribution fixed over
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationIntroduction to Bayesian Learning. Machine Learning Fall 2018
Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationExperts in a Markov Decision Process
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2004 Experts in a Markov Decision Process Eyal Even-Dar Sham Kakade University of Pennsylvania Yishay Mansour Follow
More informationOnline Convex Optimization
Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed
More informationDiscriminative Learning can Succeed where Generative Learning Fails
Discriminative Learning can Succeed where Generative Learning Fails Philip M. Long, a Rocco A. Servedio, b,,1 Hans Ulrich Simon c a Google, Mountain View, CA, USA b Columbia University, New York, New York,
More informationy(x) = x w + ε(x), (1)
Linear regression We are ready to consider our first machine-learning problem: linear regression. Suppose that e are interested in the values of a function y(x): R d R, here x is a d-dimensional vector-valued
More informationLecture 16: Perceptron and Exponential Weights Algorithm
EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew
More informationAdaptive Game Playing Using Multiplicative Weights
Games and Economic Behavior 29, 79 03 (999 Article ID game.999.0738, available online at http://www.idealibrary.com on Adaptive Game Playing Using Multiplicative Weights Yoav Freund and Robert E. Schapire
More informationy Xw 2 2 y Xw λ w 2 2
CS 189 Introduction to Machine Learning Spring 2018 Note 4 1 MLE and MAP for Regression (Part I) So far, we ve explored two approaches of the regression framework, Ordinary Least Squares and Ridge Regression:
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More information1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016
AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework
More informationThe Multi-Arm Bandit Framework
The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94
More informationBetter Algorithms for Selective Sampling
Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized
More informationOnline Learning of Non-stationary Sequences
Online Learning of Non-stationary Sequences Claire Monteleoni and Tommi Jaakkola MIT Computer Science and Artificial Intelligence Laboratory 200 Technology Square Cambridge, MA 02139 {cmontel,tommi}@ai.mit.edu
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationCOS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013
COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 203 Review of Zero-Sum Games At the end of last lecture, we discussed a model for two player games (call
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationMachine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1
Machine Learning 1 Linear Classifiers Marius Kloft Humboldt University of Berlin Summer Term 2014 Machine Learning 1 Linear Classifiers 1 Recap Past lectures: Machine Learning 1 Linear Classifiers 2 Recap
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationA Generalization of Principal Component Analysis to the Exponential Family
A Generalization of Principal Component Analysis to the Exponential Family Michael Collins Sanjoy Dasgupta Robert E. Schapire AT&T Labs Research 8 Park Avenue, Florham Park, NJ 7932 mcollins, dasgupta,
More informationAn Introduction to Expectation-Maximization
An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative
More informationGame Theory, On-line prediction and Boosting (Freund, Schapire)
Game heory, On-line prediction and Boosting (Freund, Schapire) Idan Attias /4/208 INRODUCION he purpose of this paper is to bring out the close connection between game theory, on-line prediction and boosting,
More informationPosterior Regularization
Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationLecture 6. Regression
Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron
More informationWE focus on the following linear model of adaptive filtering:
1782 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 5, MAY 2006 The p-norm Generalization of the LMS Algorithm for Adaptive Filtering Jyrki Kivinen, Manfred K. Warmuth, and Babak Hassibi Abstract
More informationNew bounds on the price of bandit feedback for mistake-bounded online multiclass learning
Journal of Machine Learning Research 1 8, 2017 Algorithmic Learning Theory 2017 New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Philip M. Long Google, 1600 Amphitheatre
More informationPerceptron Mistake Bounds
Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce
More informationSelf Supervised Boosting
Self Supervised Boosting Max Welling, Richard S. Zemel, and Geoffrey E. Hinton Department of omputer Science University of Toronto 1 King s ollege Road Toronto, M5S 3G5 anada Abstract Boosting algorithms
More informationIntroduction to Machine Learning
Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationNotes on Machine Learning for and
Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori
More informationToday. Calculus. Linear Regression. Lagrange Multipliers
Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationMachine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang
Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationMidterm Exam, Spring 2005
10-701 Midterm Exam, Spring 2005 1. Write your name and your email address below. Name: Email address: 2. There should be 15 numbered pages in this exam (including this cover sheet). 3. Write your name
More informationOnline prediction with expert advise
Online prediction with expert advise Jyrki Kivinen Australian National University http://axiom.anu.edu.au/~kivinen Contents 1. Online prediction: introductory example, basic setting 2. Classification with
More informationOnline Learning with Feedback Graphs
Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between
More informationTutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning
Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret
More information1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 October 31, 2016 Due: A. November 11, 2016; B. November 22, 2016 A. Boosting 1. Implement
More informationOnline Kernel PCA with Entropic Matrix Updates
Dima Kuzmin Manfred K. Warmuth Computer Science Department, University of California - Santa Cruz dima@cse.ucsc.edu manfred@cse.ucsc.edu Abstract A number of updates for density matrices have been developed
More informationMachine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /
Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationMachine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart
Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural
More informationLearning Gaussian Process Models from Uncertain Data
Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada
More informationLinear and logistic regression
Linear and logistic regression Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Linear and logistic regression 1/22 Outline 1 Linear regression 2 Logistic regression 3 Fisher discriminant analysis
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationRobust Selective Sampling from Single and Multiple Teachers
Robust Selective Sampling from Single and Multiple Teachers Ofer Dekel Microsoft Research oferd@microsoftcom Claudio Gentile DICOM, Università dell Insubria claudiogentile@uninsubriait Karthik Sridharan
More informationA Drifting-Games Analysis for Online Learning and Applications to Boosting
A Drifting-Games Analysis for Online Learning and Applications to Boosting Haipeng Luo Department of Computer Science Princeton University Princeton, NJ 08540 haipengl@cs.princeton.edu Robert E. Schapire
More informationDiscrete Mathematics and Probability Theory Fall 2015 Lecture 21
CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about
More informationThe Moment Method; Convex Duality; and Large/Medium/Small Deviations
Stat 928: Statistical Learning Theory Lecture: 5 The Moment Method; Convex Duality; and Large/Medium/Small Deviations Instructor: Sham Kakade The Exponential Inequality and Convex Duality The exponential
More information0.1 Motivating example: weighted majority algorithm
princeton univ. F 16 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: Sanjeev Arora (Today s notes
More informationClassification Based on Probability
Logistic Regression These slides were assembled by Byron Boots, with only minor modifications from Eric Eaton s slides and grateful acknowledgement to the many others who made their course materials freely
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationOnline Advertising is Big Business
Online Advertising Online Advertising is Big Business Multiple billion dollar industry $43B in 2013 in USA, 17% increase over 2012 [PWC, Internet Advertising Bureau, April 2013] Higher revenue in USA
More information