Online Bounds for Bayesian Algorithms

Size: px
Start display at page:

Download "Online Bounds for Bayesian Algorithms"

Transcription

1 Online Bounds for Bayesian Algorithms Sham M. Kakade Computer and Information Science Department University of Pennsylvania Andrew Y. Ng Computer Science Department Stanford University Abstract We present a competitive analysis of Bayesian learning algorithms in the online learning setting and show that many simple Bayesian algorithms such as Gaussian linear regression and Bayesian logistic regression perform favorably when compared, in retrospect, to the single best model in the model class. The analysis does not assume that the Bayesian algorithms modeling assumptions are correct, and our bounds hold even if the data is adversarially chosen. For Gaussian linear regression using logloss, our error bounds are comparable to the best bounds in the online learning literature, and we also provide a lower bound showing that Gaussian linear regression is optimal in a certain worst case sense. We also give bounds for some widely used maximum a posteriori MAP estimation algorithms, including regularized logistic regression. Introduction The last decade has seen significant progress in online learning algorithms that perform well even in adversarial settings e.g. the expert algorithms of Cesa-Bianchi et al In the online learning framework, one makes minimal assumptions on the data presented to the learner, and the goal is to obtain good relative performance on arbitrary sequences. In statistics, this philosophy has been espoused by Dawid 984 in the prequential approach. We study the performance of Bayesian algorithms in this adversarial setting, in which the process generating the data is not restricted to come from the prior data sequences may be arbitrary. Our motivation is similar to that given in the online learning literature and the MDL literature see Grunwald, 005 namely, that models are often chosen to balance realism with computational tractability, and often assumptions made by the Bayesian are not truly believed to hold e.g. i.i.d. assumptions. Our goal is to study the performance of Bayesian algorithms in the worst-case, where all modeling assumptions may be violated. We consider the widely used class of generalized linear models focusing on Gaussian linear regression and logistic regression and provide relative performance bounds comparing to the best model in our model class when the cost function is the logloss. Though the regression problem has been studied in a competitive framework and, indeed, many ingenious algorithms have been devised for it e.g., Foster, 99; Vovk, 00; Azoury and Warmuth, 00, our goal here is to study how the more widely used, and often simpler, Bayesian algorithms fare. Our bounds for linear regression are comparable to the best bounds in the literature though we use the logloss as opposed to the square loss. The competitive approach to regression started with Foster 99, who provided competitive bounds for a variant of the ridge regression algorithm under the square loss. Vovk 00 presents many competitive algorithms and provides bounds for linear regression under the square loss with an algorithm that differs slightly from the Bayesian one. Azoury and Warmuth 00 rederive Vovk s bound with a different analysis based on Bregman distances. Our work differs from these in that we consider Bayesian Gaussian

2 linear regression, while previous work typically used more complex, cleverly devised algorithms which are either variants of a MAP procedure as in Vovk, 00, or that involve other steps such as clipping predictions as in Azoury and Warmuth, 00. These distinctions are discussed in more detail in Section 3.. We should also note that when the loss function is the logloss, multiplicative weights algorithms are sometimes identical to Bayes rule with particular choices of the parameters see Freund and Schapire, 999. Furthermore, Bayesian algorithms have been used in some online learning settings, such as the sleeping experts setting of Freund et al. 997 and the online boolean prediction setting of Cesa-Bianchi et al Ng and Jordan 00 also analyzed an online Bayesian algorithm but assumed that the data generation process was not too different from the model prior. To our knowledge, there have been no studies of Bayesian generalized linear models in an adversarial online learning setting though many variants have been considered as discussed above. We also examine maximum a posteriori MAP algorithms for both Gaussian linear regression i.e., ridge regression and for regularized logistic regression. These algorithms are often used in practice, particularly in logistic regression where Bayesian model averaging is computationally expensive, but the MAP algorithm requires only solving a convex problem. As expected, MAP algorithms are somewhat less competitive than full Bayesian model averaging, though not unreasonably so. Bayesian Model Averaging We now consider the Bayesian model averaging BMA algorithm and give a bound on its worst-case online loss. We start with some preliminaries. Let x R n denote the inputs of a learning problem and y R the outputs. Consider a model from the generalized linear model family see McCullagh and Nelder, 989, that can be written py x, = py T x, where R n are the parameters of our model T denotes the transpose of. Note that the predicted distribution of y depends only on T x, which is linear in. For example, in the case of Gaussian linear regression, we have T py x, = exp x y πσ σ, where σ is a fixed, known constant that is not a parameter of our model. In logistic regression, we would have log py x, = y log + exp T + y log x + exp T, x where we assume y {0, }. Let S = {x, y, x, y,..., x T, y T } be an arbitrary sequence of examples, possibly chosen by an adversary. We also use S t to denote the subsequence consisting of only the first t examples. We assume throughout this paper that x t where denotes the L norm. Assume that we are going to use a Bayesian algorithm to make our online predictions. Specifically, assume that we have a Gaussian prior on the parameters: p = N ; 0, ν I n, where I n is the n-by-n identity matrix, N ; µ, Σ is the Gaussian density with mean µ and covariance Σ, and ν > 0 is some fixed constant governing the prior variance. Also, let p t = p S t = t i= pyi x i, p t i= pyi x i, pd be the posterior distribution over given the first t training examples. We have that p 0 = p is just the prior distribution.

3 On iteration t, we are given the input x t, and our algorithm makes a prediction using the posterior distribution over the outputs: py x t, S t = py x t, p S t d. We are then given the true label y t, and we suffer logloss log py t x t, S t. We define the cumulative loss of the BMA algorithm after T rounds to be L BMA S = log py t x t, S t. Importantly, note that even though the algorithm we consider is a Bayesian one, our theoretical results do not assume that the data comes from any particular probabilistic model. In particular, the data may be chosen by an adversary. We are interested in comparing against the loss of any expert that uses some fixed parameters R n. Define l t = log py t x t,, and let L S = l t = log py t x t,. Sometimes, we also wish to compare against distributions over experts. Given a distribution Q over, define l Q t = Q log pyt x t, d, and L Q S = l Q t = QL Sd. This is the expected logloss incurred by a procedure that first samples some Q and then uses this for all its predictions. Here, the expectation is over the random, not over the sequence of examples. Note that the expectation is of the logloss, which is a different type of averaging than in BMA, which had the expectation and the log in the reverse order.. A Useful Variational Bound The following lemma provides a worst case bound of the loss incurred by Bayesian algorithms and will be useful for deriving our main result in the next section. A result very similar to this for finite model classes is given by Freund et al For completeness, we prove the result here in its full generality, though our proof is similar to theirs. As usual, define KLq p = q q log p. Lemma.: Let Q be any distribution over. Then for all sequences S L BMA S L Q S + KLQ p 0. Proof: Let Y = {y,..., y T } and X = {x,..., x T }. The chain rule of conditional probabilities implies that L BMA S = log py X and L S = log py X,. So L BMA S L Q S = log py X + Q log py X, d py X, = Q log py X d py X,p0 By Bayes rule, we have that p T = py X. Continuing, = Q log p T p 0 d = Q log Q p 0 d Q log Q p T d = KLQ p 0 KLQ p T. Together with the fact that KLQ p T 0, this proves the lemma.

4 . An Upper Bound for Generalized Linear Models For the theorem that we shortly present, we need one new definition. Let f y z = log py T x = z. Thus, f y t T x t = l t. Note that for linear regression as defined in Equation, we have that for all y f y z = σ 3 and for logistic regression as defined in Equation, we have that for y {0, } f y z. Theorem.: Suppose f y z is continuously differentiable. Let S be a sequence such that x t and such that for some constant c, f z c for all z. Then for all, y t L BMA S L S + ν + n log + T cν 4 n The /ν term can be interpreted as a penalty term from our prior. The log term is how fast our loss could grow in comparison to the best. Importantly, this extra loss is only logarithmic in T in this adversarial setting. This bound almost identical to those provided by Vovk 00; Azoury and Warmuth 00 and Foster 99 for the linear regression case under the square loss; the only difference is that in their bounds, the last term is multiplied by an upper bound on y t. In contrast, we require no bound on y t in the Gaussian linear regression case due to the fact that we deal with the logloss also recall f y z = σ for all y. Proof: We use Lemma. with Q = N ;, ɛ I n being a normal distribution with mean and covariance ɛ I n. Here, ɛ is a variational parameter that we later tune to get the tightest possible bound. Letting HQ = n log πeɛ be the entropy of Q, we have [ KLQ p 0 = Q log exp ] π n/ ν I n / ν T d HQ = n log ν + ν Q T d n n log ɛ = n log ν + ν + nɛ n n log ɛ. 5 To prove the result, we also need to relate the error of L Q to that of L. By taking a Taylor expansion of f y assume y S, we have that f y z = f y z + f yz z z + f y ξz z z, for some appropriate function ξ. Thus, if z is a random variable with mean z, we have [ E z [f y z] = f y z + f yz 0 + E z f y ξz z z ] [ z z f y z ] + ce z = f y z + c Varz Consider a single example x, y. We can apply the argument above with z = T x, and z = T x, where Q. Note that E[z] = z since Q has mean. Also, Var T x = x T ɛ I n x = x ɛ ɛ because we previously assumed that x. Thus, we have E Q [f y T x] f y T x + cɛ

5 Since l Q t = E Q [f y t T x t ] and l t = f y t T x t, we can sum both sides from t = to T to obtain L Q S L S + T c ɛ Putting this together with Lemma. and Equation 5, we find that L BMA S L S + T c ɛ + n log ν + ν + nɛ n n log ɛ. Finally, by choosing ɛ = nν n+t cν and simplifying, Theorem. follows..3 A Lower Bound for Gaussian Linear Regression The following lower bound shows that, for linear regression, no other prediction scheme is better than Bayes in the worst case when our penalty term is. Here, we compare to an arbitrary predictive distribution qy x t, S t for prediction at time t, which suffers an instant loss l q t = log qy t x t, S t. In the theorem, denotes the floor function. Theorem.3: Let L S be the loss under the Gaussian linear regression model using the parameter, and let ν = σ =. For any set of predictive distributions qy x t, S t, there exists an S with x t such that l q t inf L S + + n log T + n Proof: sketch If n = and if S is such that x t =, one can show the equality: L BMA S = inf L S + + log + T Let Y = {y,..., y T } and X = {,..., }. By the chain rule of conditional probabilities, L BMA S = log py X where p is the Gaussian linear regression model, and q s loss is T l qt = log qy X. For any predictive distribution q that differs from p, there must exist some sequence S such that log qy X is greater than log py X since probabilities are normalized. Such a sequence proves the result for n =. The modification for n dimensions follows: S is broken into T/n subsequences where in every subsequence only one dimension has x t k = and the other dimensions are set to 0. The result follows due to the additivity of the losses on these subsequences. 3 MAP Estimation We now present bounds for MAP algorithms for both Gaussian linear regression i.e., ridge regression and logistic regression. These algorithms use the maximum ˆ t of p t to form their predictive distribution py x t, ˆ t at time t, as opposed to BMA s predictive distribution of py x t, S t. As expected these bounds are weaker than BMA, though perhaps not unreasonably so. 3. The Square Loss and Ridge Regression Before we provide the MAP bound, let us first present the form of the posteriors and the predictions for Gaussian linear regression. Define A t = ν I n + t σ i= xi x it, and b t = t i= xi y i. We now have that p t = p S t = N ; ˆ t, ˆΣ t, 6 where ˆ t = A t b t, and ˆΣ t = A t. Also, the predictions at time t + are given by py t+ x t+, S t = N y t+ ; ŷ t+, s t+ 7

6 where ŷ t+ = ˆ t T x t+, s t+ = x t+t ˆΣt x t+ + σ. In contrast, the prediction of a fixed expert using parameter would be py t x t, = N y t ; yt, σ, 8 where y t = T x t. Now the BMA loss is: L BMA S = s t y t ˆ t x T t + log πs t 9 Importantly, note how Bayes is adaptively weighting the squared term with the inverse variances /s t which depend on the current observation x t. The logloss of using a fixed expert is just: L S = σ yt T x t + log πσ 0 The MAP procedure referred to as ridge regression uses py x t, ˆ t which has a fixed variance. Hence, the MAP loss is essentially the square loss and we define it as such: L MAP S = y t ˆ t x T t, L S = y t T x t, where ˆ t is the MAP estimate see Equation 6. Corollary 3.: Let γ = σ + ν. For all S such that x t and for all, we have L MAP S γ σ L S + γ ν + γ n log + T ν σ n Proof: Using Equations 9,0 and Theorem., we have y t ˆ t x T t σ yt T x t + ν s t + n log + T cν + n πσ log πs t Equations 6, 7 imply that σ s t σ + ν. Using this, the result follows by noting that the last term is negative and by multiplying both sides of the equation by σ + ν. We might have hoped that MAP were more competitive in that the leading coefficient, in front of the L S term in the bound, be similar to Theorem. rather than γ σ. Crudely, the reason that MAP is not as effective as BMA is that MAP does not take into account the uncertainty in its predictions thus the squared terms cannot be reweighted to take variance into account compare Equations 9 and. Some previous non-bayesian algorithms did in fact have bounds with this coefficient being unity. Vovk 00 provides such an algorithm, though this algorithm differs from MAP in that its predictions at time t are a nonlinear function of x t it uses A t instead of A t at time t. Foster 99 provides a bound with this coefficient being with more restrictive assumptions. Azoury and Warmuth 00 also provide a bound with a coefficient of by using a MAP procedure with clipping. Their algorithm thresholds the prediction ŷ t = ˆ T t x t if it is larger than some upper bound. Note that we do not assume any upper bound on y t.

7 As the following lower bound shows, it is not possible for the MAP linear regression algorithm to have a coefficient of for L S, with a reasonable regret bound. A similar lower bound is in Vovk 00, which doesn t apply to our setting where we have the additional constraint x t. Theorem 3.: Let γ = σ + ν. There exists a sequence S with x t such that L MAP S inf L S + + ΩT Proof: sketch Let S be a length T + sequence, with n =, where for the first T steps, x t = / T and y t =, and at T +, x T + = and y T + = 0. Here, one can show that inf L S + = T/4 and L MAP S 3T/8, and the result follows. 3. Logistic Regression MAP estimation is often used for regularized logistic regression, since it requires only solving a convex program while BMA has to deal with a high dimensional integral over that is intractable to compute exactly. Letting ˆ t be the maximum of the posterior p t, define L MAP S = T log pyt x t, ˆ t. As with the square loss case, the bound we present is multiplicatively worse by a factor of 4. Theorem 3.3: In the logistic regression model with ν 0.5, we have that for all sequences S such that x t and y t {0, } and for all L MAP S 4L S + ν + n log + T ν n Proof: sketch Assume n = the general case is analogous. The proof consists of showing that lˆt t = log py t x t, ˆ t 4l BMA t. Without loss of generality, assume y t = and x t 0, and for convenience, we just write x instead of x t. Now the BMA prediction is p, xp t d, and l BMA t is the negative log of this. Note = gives probability for y t = and this setting of minimizes the loss at time t. Since we do not have a closed form solution of the posterior p t, let us work with another distribution q in lieu of p t that satisfies certain properties. Define p q = p, xqd, which can be viewed as the prediction using q rather than the posterior. We choose q to be the rectification of the Gaussian N ; ˆ t, ν I n, such that there is positive probability only for ˆ t and the distribution is renormalized. With this choice, we first show that the loss of q, log p q, is less than or equal to l BMA t. Then we complete the proof by showing that t 4 log p lˆt q, since log p q l BMA t. Consider the q which maximizes p q subject to the following constraints: let q have its maximum at ˆ t ; let q = 0 if < ˆ t intuitively, mass to the left of ˆ t is just making the p q smaller; and impose the constraint that log q /ν. We now argue that for such a q, log p q l BMA t. First note that due to the Gaussian prior p 0, it is straightforward to show that log p t ν the prior imposes some minimum curvature. Now if this posterior p t were rectified with support only for ˆ t and renormalized, then such a modified distribution clearly satisfies the aforementioned constraints, and it has loss less than the loss of p t itself since the rectification only increases the prediction. Hence, the maximizer, q, of p q subject to the constraints has loss less than that of p t, i.e. log p q l BMA t. We now show that such a maximal q is the renormalized rectification of the Gaussian N ; ˆ t, ν I n, such that there is positive probability only for > ˆ t. Assume some other q satisfied these constraints and maximized p q. It cannot be that q ˆ t < qˆ t,

8 else one can show q would not be normalized since with q ˆ t < qˆ t, the curvature constraint imposes that this q cannot cross q. It also cannot be that q ˆ t > qˆ t. To see this, note that normalization and curvature imply that q must cross p t only once. Now a sufficiently slight perturbation of this crossing point to the left, by shifting more mass from the left to the right side of the crossing point, would not violate the curvature constraint and would result in a new distribution with larger p q, contradicting the maximality of q. Hence, we have that q ˆ t = qˆ t. This, along with the curvature constraint and normalization, imply that the rectified Gaussian, q, is the unique solution. To complete the proof, we show lˆt t = log p x, ˆ t 4 log p q. We consider two cases, ˆ t < 0 and ˆ t 0. We start with the case ˆ t < 0. Using the boundedness of the derivative log p x, / < and that q only has support for ˆ t, we have p q = explog p x, qd exp logp x, ˆ t + ˆ t qd.6p x, ˆ t where we have used that exp ˆ t qd <.6 which can be verified numerically using the definition of q with ν 0.5. Now observe that for ˆ t 0, we have the lower bound log p x, ˆ t log. Hence, log p q log p x, ˆ t log.6 log p x, ˆ t log.6/ log t, which shows t 4 log p 0.3lˆt lˆt q. Now for the case ˆ t 0. Let σ be the sigmoid function, so p x, = σx and p q = σxqd. Since the sigmoid is concave for > 0 and, for this case, q only has support from positive, we have that p q σ x qd. Using the definition of q, we then have that p q σxˆ t + ν σˆ t + ν, where the last inequality follows from ˆ t + ν > 0 and x. Using properties of σ, one can show log σ z < log σz for all z. Hence, for all ˆ t, log σ < log σ log σˆ t. Using this derivative condition along with the previous bound on p q, we have that log p q log σˆ t + ν log σˆ t ν = t ν, which shows that lˆt lˆt t 4 log p q since ν 0.5. This proves the claim when ˆ t 0. Acknowledgments. We thank Dean Foster for numerous helpful discussions. This work was supported by the Department of the Interior/DARPA under contract NBCHD References Azoury, K. S. and Warmuth, M. 00. Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 433. Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D., Schapire, R., and Warmuth, M How to use expert advice. J. ACM, 44. Cesa-Bianchi, N., Helmbold, D., and Panizza, S On Bayes methods for on-line boolean prediction. Algorithmica,. Dawid, A Statistical theory: The prequential approach. J. Royal Statistical Society. Foster, D. P. 99. Prediction in the worst case. Annals of Statistics, 9. Freund, Y. and Schapire, R Adaptive game playing using multiplicative weights. Games and Economic Behavior, 9: Freund, Y., Schapire, R., Singer, Y., and Warmuth, M Using and combining predictors that specialize. In STOC. Grunwald, P A tutorial introduction to the minimum description length principle. McCullagh, P. and Nelder, J. A Generalized Linear Models nd ed.. Chapman and Hall. Ng, A. Y. and Jordan, M. 00. Convergence rates of the voting Gibbs classifier, with application to Bayesian feature selection. In Proceedings of the 8th Int l Conference on Machine Learning. Vovk, V. 00. Competitive on-line statistics. International Statistical Review, 69.

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Using Additive Expert Ensembles to Cope with Concept Drift

Using Additive Expert Ensembles to Cope with Concept Drift Jeremy Z. Kolter and Marcus A. Maloof {jzk, maloof}@cs.georgetown.edu Department of Computer Science, Georgetown University, Washington, DC 20057-1232, USA Abstract We consider online learning where the

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

Optimal and Adaptive Online Learning

Optimal and Adaptive Online Learning Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University Examples of Online Learning (a) Spam detection 2 / 34 Examples of Online Learning

More information

The best expert versus the smartest algorithm

The best expert versus the smartest algorithm Theoretical Computer Science 34 004 361 380 www.elsevier.com/locate/tcs The best expert versus the smartest algorithm Peter Chen a, Guoli Ding b; a Department of Computer Science, Louisiana State University,

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

U Logo Use Guidelines

U Logo Use Guidelines Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm CS61: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm Tim Roughgarden February 9, 016 1 Online Algorithms This lecture begins the third module of the

More information

Minimax Fixed-Design Linear Regression

Minimax Fixed-Design Linear Regression JMLR: Workshop and Conference Proceedings vol 40:1 14, 2015 Mini Fixed-Design Linear Regression Peter L. Bartlett University of California at Berkeley and Queensland University of Technology Wouter M.

More information

On Minimaxity of Follow the Leader Strategy in the Stochastic Setting

On Minimaxity of Follow the Leader Strategy in the Stochastic Setting On Minimaxity of Follow the Leader Strategy in the Stochastic Setting Wojciech Kot lowsi Poznań University of Technology, Poland wotlowsi@cs.put.poznan.pl Abstract. We consider the setting of prediction

More information

An Identity for Kernel Ridge Regression

An Identity for Kernel Ridge Regression An Identity for Kernel Ridge Regression Fedor Zhdanov and Yuri Kalnishkan Computer Learning Research Centre and Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW20

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Learning, Games, and Networks

Learning, Games, and Networks Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to

More information

Multi-View Regression via Canonincal Correlation Analysis

Multi-View Regression via Canonincal Correlation Analysis Multi-View Regression via Canonincal Correlation Analysis Dean P. Foster U. Pennsylvania (Sham M. Kakade of TTI) Model and background p ( i n) y i = X ij β j + ɛ i ɛ i iid N(0, σ 2 ) i=j Data mining and

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated

More information

Linear Probability Forecasting

Linear Probability Forecasting Linear Probability Forecasting Fedor Zhdanov and Yuri Kalnishkan Computer Learning Research Centre and Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW0 0EX, UK {fedor,yura}@cs.rhul.ac.uk

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Gambling in a rigged casino: The adversarial multi-armed bandit problem

Gambling in a rigged casino: The adversarial multi-armed bandit problem Gambling in a rigged casino: The adversarial multi-armed bandit problem Peter Auer Institute for Theoretical Computer Science University of Technology Graz A-8010 Graz (Austria) pauer@igi.tu-graz.ac.at

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

Robustness and duality of maximum entropy and exponential family distributions

Robustness and duality of maximum entropy and exponential family distributions Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

No-Regret Algorithms for Unconstrained Online Convex Optimization

No-Regret Algorithms for Unconstrained Online Convex Optimization No-Regret Algorithms for Unconstrained Online Convex Optimization Matthew Streeter Duolingo, Inc. Pittsburgh, PA 153 matt@duolingo.com H. Brendan McMahan Google, Inc. Seattle, WA 98103 mcmahan@google.com

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Agnostic Online learnability

Agnostic Online learnability Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes

More information

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri Online Learning Jordan Boyd-Graber University of Colorado Boulder LECTURE 21 Slides adapted from Mohri Jordan Boyd-Graber Boulder Online Learning 1 of 31 Motivation PAC learning: distribution fixed over

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Experts in a Markov Decision Process

Experts in a Markov Decision Process University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2004 Experts in a Markov Decision Process Eyal Even-Dar Sham Kakade University of Pennsylvania Yishay Mansour Follow

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

Discriminative Learning can Succeed where Generative Learning Fails

Discriminative Learning can Succeed where Generative Learning Fails Discriminative Learning can Succeed where Generative Learning Fails Philip M. Long, a Rocco A. Servedio, b,,1 Hans Ulrich Simon c a Google, Mountain View, CA, USA b Columbia University, New York, New York,

More information

y(x) = x w + ε(x), (1)

y(x) = x w + ε(x), (1) Linear regression We are ready to consider our first machine-learning problem: linear regression. Suppose that e are interested in the values of a function y(x): R d R, here x is a d-dimensional vector-valued

More information

Lecture 16: Perceptron and Exponential Weights Algorithm

Lecture 16: Perceptron and Exponential Weights Algorithm EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew

More information

Adaptive Game Playing Using Multiplicative Weights

Adaptive Game Playing Using Multiplicative Weights Games and Economic Behavior 29, 79 03 (999 Article ID game.999.0738, available online at http://www.idealibrary.com on Adaptive Game Playing Using Multiplicative Weights Yoav Freund and Robert E. Schapire

More information

y Xw 2 2 y Xw λ w 2 2

y Xw 2 2 y Xw λ w 2 2 CS 189 Introduction to Machine Learning Spring 2018 Note 4 1 MLE and MAP for Regression (Part I) So far, we ve explored two approaches of the regression framework, Ordinary Least Squares and Ridge Regression:

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016 AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

Better Algorithms for Selective Sampling

Better Algorithms for Selective Sampling Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized

More information

Online Learning of Non-stationary Sequences

Online Learning of Non-stationary Sequences Online Learning of Non-stationary Sequences Claire Monteleoni and Tommi Jaakkola MIT Computer Science and Artificial Intelligence Laboratory 200 Technology Square Cambridge, MA 02139 {cmontel,tommi}@ai.mit.edu

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013 COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 203 Review of Zero-Sum Games At the end of last lecture, we discussed a model for two player games (call

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1 Machine Learning 1 Linear Classifiers Marius Kloft Humboldt University of Berlin Summer Term 2014 Machine Learning 1 Linear Classifiers 1 Recap Past lectures: Machine Learning 1 Linear Classifiers 2 Recap

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

A Generalization of Principal Component Analysis to the Exponential Family

A Generalization of Principal Component Analysis to the Exponential Family A Generalization of Principal Component Analysis to the Exponential Family Michael Collins Sanjoy Dasgupta Robert E. Schapire AT&T Labs Research 8 Park Avenue, Florham Park, NJ 7932 mcollins, dasgupta,

More information

An Introduction to Expectation-Maximization

An Introduction to Expectation-Maximization An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative

More information

Game Theory, On-line prediction and Boosting (Freund, Schapire)

Game Theory, On-line prediction and Boosting (Freund, Schapire) Game heory, On-line prediction and Boosting (Freund, Schapire) Idan Attias /4/208 INRODUCION he purpose of this paper is to bring out the close connection between game theory, on-line prediction and boosting,

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Lecture 6. Regression

Lecture 6. Regression Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron

More information

WE focus on the following linear model of adaptive filtering:

WE focus on the following linear model of adaptive filtering: 1782 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 5, MAY 2006 The p-norm Generalization of the LMS Algorithm for Adaptive Filtering Jyrki Kivinen, Manfred K. Warmuth, and Babak Hassibi Abstract

More information

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Journal of Machine Learning Research 1 8, 2017 Algorithmic Learning Theory 2017 New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Philip M. Long Google, 1600 Amphitheatre

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

Self Supervised Boosting

Self Supervised Boosting Self Supervised Boosting Max Welling, Richard S. Zemel, and Geoffrey E. Hinton Department of omputer Science University of Toronto 1 King s ollege Road Toronto, M5S 3G5 anada Abstract Boosting algorithms

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori

More information

Today. Calculus. Linear Regression. Lagrange Multipliers

Today. Calculus. Linear Regression. Lagrange Multipliers Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Midterm Exam, Spring 2005

Midterm Exam, Spring 2005 10-701 Midterm Exam, Spring 2005 1. Write your name and your email address below. Name: Email address: 2. There should be 15 numbered pages in this exam (including this cover sheet). 3. Write your name

More information

Online prediction with expert advise

Online prediction with expert advise Online prediction with expert advise Jyrki Kivinen Australian National University http://axiom.anu.edu.au/~kivinen Contents 1. Online prediction: introductory example, basic setting 2. Classification with

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution: Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 October 31, 2016 Due: A. November 11, 2016; B. November 22, 2016 A. Boosting 1. Implement

More information

Online Kernel PCA with Entropic Matrix Updates

Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth Computer Science Department, University of California - Santa Cruz dima@cse.ucsc.edu manfred@cse.ucsc.edu Abstract A number of updates for density matrices have been developed

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Linear and logistic regression

Linear and logistic regression Linear and logistic regression Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Linear and logistic regression 1/22 Outline 1 Linear regression 2 Logistic regression 3 Fisher discriminant analysis

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Robust Selective Sampling from Single and Multiple Teachers

Robust Selective Sampling from Single and Multiple Teachers Robust Selective Sampling from Single and Multiple Teachers Ofer Dekel Microsoft Research oferd@microsoftcom Claudio Gentile DICOM, Università dell Insubria claudiogentile@uninsubriait Karthik Sridharan

More information

A Drifting-Games Analysis for Online Learning and Applications to Boosting

A Drifting-Games Analysis for Online Learning and Applications to Boosting A Drifting-Games Analysis for Online Learning and Applications to Boosting Haipeng Luo Department of Computer Science Princeton University Princeton, NJ 08540 haipengl@cs.princeton.edu Robert E. Schapire

More information

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21 CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about

More information

The Moment Method; Convex Duality; and Large/Medium/Small Deviations

The Moment Method; Convex Duality; and Large/Medium/Small Deviations Stat 928: Statistical Learning Theory Lecture: 5 The Moment Method; Convex Duality; and Large/Medium/Small Deviations Instructor: Sham Kakade The Exponential Inequality and Convex Duality The exponential

More information

0.1 Motivating example: weighted majority algorithm

0.1 Motivating example: weighted majority algorithm princeton univ. F 16 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: Sanjeev Arora (Today s notes

More information

Classification Based on Probability

Classification Based on Probability Logistic Regression These slides were assembled by Byron Boots, with only minor modifications from Eric Eaton s slides and grateful acknowledgement to the many others who made their course materials freely

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Online Advertising is Big Business

Online Advertising is Big Business Online Advertising Online Advertising is Big Business Multiple billion dollar industry $43B in 2013 in USA, 17% increase over 2012 [PWC, Internet Advertising Bureau, April 2013] Higher revenue in USA

More information