Lecture 4. Instructor: Haipeng Luo

Similar documents
COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

Online Classification: Perceptron and Winnow

Feature Selection: Part 1

Homework Assignment 3 Due in class, Thursday October 15

Errors for Linear Systems

Problem Set 9 Solutions

1 The Mistake Bound Model

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

Lecture 10: May 6, 2013

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Linear Feature Engineering 11

Lecture Notes on Linear Regression

Bezier curves. Michael S. Floater. August 25, These notes provide an introduction to Bezier curves. i=0

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Lecture 14: Bandits with Budget Constraints

Boostrapaggregating (Bagging)

Maximizing the number of nonnegative subsets

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

The Experts/Multiplicative Weights Algorithm and Applications

Lecture 4: November 17, Part 1 Single Buffer Management

Lecture 10 Support Vector Machines. Oct

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

Lecture 10: Euler s Equations for Multivariable

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

Homework Notes Week 7

} Often, when learning, we deal with uncertainty:

Bézier curves. Michael S. Floater. September 10, These notes provide an introduction to Bézier curves. i=0

Ensemble Methods: Boosting

COS 521: Advanced Algorithms Game Theory and Linear Programming

MA 323 Geometric Modelling Course Notes: Day 13 Bezier Curves & Bernstein Polynomials

1 Convex Optimization

CS286r Assign One. Answer Key

Supplementary material: Margin based PU Learning. Matrix Concentration Inequalities

Vapnik-Chervonenkis theory

CS : Algorithms and Uncertainty Lecture 17 Date: October 26, 2016

Supplement to Clustering with Statistical Error Control

Complete subgraphs in multipartite graphs

Difference Equations

3.1 ML and Empirical Distribution

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

find (x): given element x, return the canonical element of the set containing x;

Calculation of time complexity (3%)

k t+1 + c t A t k t, t=0

Chapter 6. Supplemental Text Material

Lecture 10 Support Vector Machines II

The Second Anti-Mathima on Game Theory

A note on almost sure behavior of randomly weighted sums of φ-mixing random variables with φ-mixing weights

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

Foundations of Arithmetic

P exp(tx) = 1 + t 2k M 2k. k N

Announcements EWA with ɛ-exploration (recap) Lecture 20: EXP3 Algorithm. EECS598: Prediction and Learning: It s Only a Game Fall 2013.

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Excess Error, Approximation Error, and Estimation Error

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Assortment Optimization under MNL

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

Generalized Linear Methods

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Edge Isoperimetric Inequalities

CSC 411 / CSC D11 / CSC C11

More metrics on cartesian products

Canonical transformations

NUMERICAL DIFFERENTIATION

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Lecture 17. Solving LPs/SDPs using Multiplicative Weights Multiplicative Weights

Lecture Space-Bounded Derandomization

/ n ) are compared. The logic is: if the two

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

The Expectation-Maximization Algorithm

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Module 9. Lecture 6. Duality in Assignment Problems

10-701/ Machine Learning, Fall 2005 Homework 3

Dirichlet s Theorem In Arithmetic Progressions

Note on EM-training of IBM-model 1

Grover s Algorithm + Quantum Zeno Effect + Vaidman

Every planar graph is 4-colourable a proof without computer

Eigenvalues of Random Graphs

Thermodynamics Second Law Entropy

Supplementary Notes for Chapter 9 Mixture Thermodynamics

COS 511: Theoretical Machine Learning

( ) 1/ 2. ( P SO2 )( P O2 ) 1/ 2.

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Week3, Chapter 4. Position and Displacement. Motion in Two Dimensions. Instantaneous Velocity. Average Velocity

The exam is closed book, closed notes except your one-page cheat sheet.

Modelli Clamfim Equazione del Calore Lezione ottobre 2014

Finding Primitive Roots Pseudo-Deterministically

Linear Approximation with Regularization and Moving Least Squares

Math 261 Exercise sheet 2

Physics 5153 Classical Mechanics. Principle of Virtual Work-1

THE SUMMATION NOTATION Ʃ

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

Kernel Methods and SVMs Extension

Lecture 17 : Stochastic Processes II

Economics 101. Lecture 4 - Equilibrium and Efficiency

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Affine transformations and convexity

The Order Relation and Trace Inequalities for. Hermitian Operators

BOUNDEDNESS OF THE RIESZ TRANSFORM WITH MATRIX A 2 WEIGHTS

Transcription:

Lecture 4 Instructor: Hapeng Luo In the followng lectures, we focus on the expert problem and study more adaptve algorthms. Although Hedge s proven to be worst-case optmal, one may wonder how well t would actually perform when dealng wth a practcal problem that s probably not the worst case or even relatvely easy. Indeed, the regret bound we proved for Hedge only says that for all problem nstances, Hedge s regret s unformly bounded by O T ln. However, deally we want to have an algorthm that enjoys a much smaller regret n many easy stuatons, but n the worst case stll guarantees the mnmax regret O T ln. Dervng adaptve algorthms and adaptve regret bounds s exactly one way to acheve ths goal. Small-loss Bounds We start wth the arguably smplest adaptve bound, sometmes called small-loss bound or frst order bound. Recall that we proved the followng ntermedate bound for Hedge: R T = L T L T ln = p t l t, where L T s the cumulatve loss vector, s the best expert and we defne L T = T p t, l t to be the cumulatve loss of the algorthm. By boundedness of losses the last term above can be bounded by L T. If <, then rearrangng gves R T ln L T. Therefore, f { for a moment } we assume we knew the quantty L T ahead of tme and was able to set = mn, ln L T, then we arrve at { } R T max ln, ln ln /LT ln L T L T = O LT ln ln. The fnal bound above s the so-called small-loss bound, whch essentally replaces the dependence on T n the mnmax bound T ln by the loss of the best expert L T. ote that L T s bounded by T, therefore the small-loss bound s not worse than the mnmax bound. More mportantly, t can be much smaller than T when the best expert ndeed suffers very small loss. In partcular, f the best expert makes no mstakes at all and have L T = 0, then the small-loss bound s only Oln, ndependent of T. Ths s one typcal example of adaptve bounds that we are amng for. Of course, one obvous ssue n the above dervaton s that the learnng rate has to be set n terms of the unknown quantty L T. In fact, ths becomes an even more severe problem n a non-oblvous envronment snce L T can depend on the algorthm s actons and thus, makng the defnton of crcular. Fortunately, there are many dfferent ways to address ths ssue, and we explore one of them here. The dea s to use a more adaptve and tme-varyng learnng rate schedule. Specfcally, the algo-

rthm predcts p t exp t L t where t = ln L t. ote that L t = t τ= p τ, l τ s the cumulatve loss of the algorthm up to round t and s thus avalable at the begnnng of round t. Ths s sometmes called a self-confdent learnng rate snce the algorthm s confdent that ts loss s close to the loss of the best expert and thus uses t as a proxy for the loss of the best expert to tune the learnng rate. We next prove that ths algorthm ndeed provdes a small-loss bound. Theorem. Hedge wth adaptve learnng rate schedule ensures R T 3 L T ln 9 ln. Proof. Let Φ t = ln = exp L t. In Lecture we already proved Summng over t and rearrangng gve L T Φ 0 Φ T T Φ t t Φ t t p t, l t t t = ln T T ln exp T L T = L T ln L T p t l t To bound the term T t p t, l t, note that p t, l t = L t = L t L t L t L t L t L t L t L t L t LT L 0 L T, and thus T t p t, l t L t L t L t dx x t = = p t l t. Φ t t Φ t t p t l t Φ t t Φ t t t p t, l t Φ t t Φ t t. L t L t L t L t L t L t L t L t L T ln. To bound Φ t t Φ t t, we prove that Φ t n ncreasng n and thus Φ t t Φ t t. It suffces to prove that the dervatve s non-negatve. Indeed, drect calculaton shows that wth

p t exp L t, Φ t = ln exp L t = L t exp L t = = exp L t = ln p t ln exp L t j L t = ln = ln = = = j= p t ln j= exp L tj exp L t p t ln p 0, t where the last step s by the fact that entropy s maxmzed by the unform dstrbuton. To sum up, we have proven that R T = L T L T 3 L T ln. Solvng for L T leads to L T 3 ln L T 9 ln. 4 Fnally squarng both sdes and usng a b a b gve whch completes the proof. L T 9 ln L T 3 L T ln, Besdes enjoyng a better theoretcal regret bound, ths algorthm s also ntutvely more reasonable snce t tunes the learnng rate adaptvely based on observed data. In general, learnng rate tunng s an mportant topc n machne learnng and could be of great practcal mportance. Quantle Bounds Small-loss bounds mprove the dependence on T n the mnmax regret bound to L T. Is t possble to mprove the other term ln n the mnmax bound to somethng better? To answer ths queston, consder agan Hedge wth a fxed learnng rate for smplcty, and note that we proved n Lecture, L T ln ln exp L T T. = Wthout loss of generalty, assume L T L T so that expert s the -th best expert. Prevously we obtaned the fnal regret bound by lower boundng = exp L T by max exp L T = exp L T. In general, however, for each we have exp L T j j= exp L T j exp L T, j= and we therefore have the followng regret bound aganst the -th best expert: L T L T ln T. Wth optmally tuned to ln /T, the bound becomes T ln. Ths s called the quantle bound and t states that the algorthm suffers at most ths amount of regret for all but / fracton of 3

the experts. Of course, at the end of the day what we care about s actually the loss of the algorthm. So assumng we had the knowledge of L T for a moment, then we could pck the optmal to acheve L T mn L T T ln [], 3 whch s a strctly better bound compared to L T T ln. To understand the mprovement, consder the case when s huge but there are many smlar experts so that for example the top % of them all have the same cumulatve loss. Then bound 3 s at most whch s ndependent of. L T % T ln % = L T T ln00, Just as n the prevous dscusson, one obvous ssue n the dervaton of bound 3 above s agan that the learnng rate needs to be tuned based on unknown knowledge. To address the ssue, here we explore a qute dfferent approach. The dea s to have dfferent nstances of Hedge runnng wth dfferent learnng rates, and have a master Hedge to combne the predctons of these metaexperts. To ths end, we use Hedge to denote an nstance of Hedge runnng wth learnng rate. The algorthm s shown below. Algorthm : Hedge wth Quantle Bounds Input: master learnng rate > 0, base learnng rates,..., M Intalze: M Hedge algorthms Hedge,..., Hedge M, C 0 j = 0 for all j [M] for t =,..., T do let p j t be the predcton of Hedge j on round t compute p t = M j= q tjp j t where q t j exp C t j play p t and observe loss vector l t [0, ] update C t j = C t j pass l t to Hedge,..., Hedge M. p j t, l t for all j [M] By Eq., we have for each Hedge j and each expert p j t, l t L T ln T j. j On the other hand, for the master Hedge, we have for each meta-expert j, M q t j p j t, l t C T j ln M T. j= ote that by constructon, we have M j= q tj p j t, l t = p t, l t and C T j = T p j t, l t. Therefore summng up the above two nequaltes lead to p t, l t L T ln T j ln M T = ln T j T ln M, j j where the last step s by pckng the optmal = ln M/T. ote that the above holds for all j and all. Therefore, suppose we have a for each, there s an j such that j ln T j = O T ln, and b M s much smaller than, then we obtan bound 3. Settng M = and j = ln j /T would clearly satsfy a, but not b. Fortunately, t turns out that one only needs to create M ln meta-experts and stll satsfy a. Specfcally, let j = ln ln j and M = log T ln. 4

ow clearly for each, there exst a j such that j ln /T j and therefore p t, l t L T ln T j T ln M j ln ln T ln /T T ln M /T = 3 T ln T ln M. It remans to show that M s small enough. Indeed, snce ln x x/, x, we have ln = ln, and therefore M = Oln ln. So as least for the case when / s larger than Oln ln, the term T ln M s domnated by T ln n the regret bound. We summarze the result n the followng theorem. ln Theorem. Algorthm wth = T, j = ln j T and M = log ln ln ensures L T mn L T 3 T ln O T lnln ln. Ths dea of combnng algorthms usng Hedge s useful for many other problems. It s usually a quck and easy way to verfy whether some regret bound s possble or not n theory. However, the resultng algorthm mght not be so elegant and practcal. In the next lecture, we wll study a dfferent algorthm that not only guarantees a quantle bound n fact even better than the one proven here, but also enjoys several more useful propertes. 5