On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

Similar documents
On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

Machine Learning Theory (CS 6783)

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

Learning Theory: Lecture Notes

Glivenko-Cantelli Classes

6.883: Online Methods in Machine Learning Alexander Rakhlin

Self-normalized deviation inequalities with application to t-statistic

Optimally Sparse SVMs

Notes on Snell Envelops and Examples

arxiv: v1 [math.pr] 13 Oct 2011

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

REGRESSION WITH QUADRATIC LOSS

arxiv: v1 [math.pr] 4 Dec 2013

Regression with quadratic loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

2 Banach spaces and Hilbert spaces

Convergence of random variables. (telegram style notes) P.J.C. Spreij

DISCRETE PREDICTION PROBLEMS: RANDOMIZED PREDICTION

1 Review and Overview

Agnostic Learning and Concentration Inequalities

An Introduction to Randomized Algorithms

Math Solutions to homework 6

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

Concentration inequalities

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Lecture 15: Learning Theory: Concentration Inequalities

Rademacher Complexity

Solutions to HW Assignment 1

MAT1026 Calculus II Basic Convergence Tests for Series

1 Review and Overview

ECE534, Spring 2018: Solutions for Problem Set #2

Chapter 7 Isoperimetric problem

Lecture 10 October Minimaxity and least favorable prior sequences

18.657: Mathematics of Machine Learning

Notes 19 : Martingale CLT

Lecture 19: Convergence

Precise Rates in Complete Moment Convergence for Negatively Associated Sequences

The random version of Dvoretzky s theorem in l n

Topics. Homework Problems. MATH 301 Introduction to Analysis Chapter Four Sequences. 1. Definition of convergence of sequences.

2.1. The Algebraic and Order Properties of R Definition. A binary operation on a set F is a function B : F F! F.

Sequences and Series of Functions

Machine Learning Theory (CS 6783)

Lecture 2: Concentration Bounds

7.1 Convergence of sequences of random variables

Asymptotic distribution of products of sums of independent random variables

Online Convex Optimization in the Bandit Setting: Gradient Descent Without a Gradient. -Avinash Atreya Feb

SDS 321: Introduction to Probability and Statistics

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

This section is optional.

A remark on p-summing norms of operators

A Proof of Birkhoff s Ergodic Theorem

BETWEEN QUASICONVEX AND CONVEX SET-VALUED MAPPINGS. 1. Introduction. Throughout the paper we denote by X a linear space and by Y a topological linear

Information Theory Tutorial Communication over Channels with memory. Chi Zhang Department of Electrical Engineering University of Notre Dame

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

A survey on penalized empirical risk minimization Sara A. van de Geer

Empirical Process Theory and Oracle Inequalities

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

7.1 Convergence of sequences of random variables

Partial match queries: a limit process

Boundaries and the James theorem

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15

Fall 2013 MTH431/531 Real analysis Section Notes

Learnability with Rademacher Complexities

5.1 A mutual information bound based on metric entropy

1 Convergence in Probability and the Weak Law of Large Numbers

Chapter IV Integration Theory

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

ECE 330:541, Stochastic Signals and Systems Lecture Notes on Limit Theorems from Probability Fall 2002

Machine Learning Brett Bernstein

Sh. Al-sharif - R. Khalil

Measure and Measurable Functions

Notes 5 : More on the a.s. convergence of sums

The log-behavior of n p(n) and n p(n)/n

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Equivalent Banach Operator Ideal Norms 1

Lecture 3: August 31

Entropy Rates and Asymptotic Equipartition

Introduction to Extreme Value Theory Laurens de Haan, ISM Japan, Erasmus University Rotterdam, NL University of Lisbon, PT

On equivalent strictly G-convex renormings of Banach spaces

1 Lecture 2: Sequence, Series and power series (8/14/2012)

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

The value of Banach limits on a certain sequence of all rational numbers in the interval (0,1) Bao Qi Feng

Absolute Boundedness and Absolute Convergence in Sequence Spaces* Martin Buntinas and Naza Tanović Miller

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Lecture 3 The Lebesgue Integral

OFF-DIAGONAL MULTILINEAR INTERPOLATION BETWEEN ADJOINT OPERATORS

A REMARK ON A PROBLEM OF KLEE

Riesz-Fischer Sequences and Lower Frame Bounds

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

for all x ; ;x R. A ifiite sequece fx ; g is said to be ND if every fiite subset X ; ;X is ND. The coditios (.) ad (.3) are equivalet for =, but these

Solutions to Tutorial 3 (Week 4)

18.657: Mathematics of Machine Learning

Lecture 3 : Random variables and their distributions

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Distribution of Random Samples & Limit theorems

Introduction to Optimization Techniques. How to Solve Equations

2.4 Sequences, Sequences of Sets

Transcription:

O Equivalece of Martigale Tail Bouds ad Determiistic Regret Iequalities Sasha Rakhli Departmet of Statistics, The Wharto School Uiversity of Pesylvaia Dec 16, 2015 Joit work with K. Sridhara arxiv:1510.03925

Outlie Itroductio Beyod Baach spaces Extras

If Z 1,..., Z idepedet with EZ t = 0 the E ( 2 Z t ) = EZ 2 t.

If Z 1,..., Z idepedet with EZ t = 0 the Exteds to Hilbert space E ( E 2 Z t ) 2 Z i = EZ 2 t. = E Z i 2.

(Pielis 94): Let Z 1,..., Z be a martigale differece sequece i a separable 2-smooth Baach space (B, ). For ay u > 0 P (sup 1 where σ 2 Z t 2. Z t σu) 2 exp { u2 2D 2 },

(Pielis 94): Let Z 1,..., Z be a martigale differece sequece i a separable 2-smooth Baach space (B, ). For ay u > 0 P (sup 1 where σ 2 Z t 2. Z t σu) 2 exp { u2 2D 2 }, Questios: replace σ with sequece-depedet versio? is it always possible? exted beyod liear structure of Baach spaces?

(Pielis 94): Let Z 1,..., Z be a martigale differece sequece i a separable 2-smooth Baach space (B, ). For ay u > 0 P (sup 1 where σ 2 Z t 2. Z t σu) 2 exp { u2 2D 2 }, Questios: replace σ with sequece-depedet versio? is it always possible? exted beyod liear structure of Baach spaces? Cotributios: address these questios the actual techique: equivalece of tail bouds ad determiistic pathwise regret iequalities

Baby versio Uit Euclidea ball B i R d. Let z 1,..., z B be arbitrary. Defie ŷ t+1 = ŷ t+1 (z 1,..., z t ) = Proj B (ŷ t 1 z t ) with ŷ 1 = 0.

Baby versio Uit Euclidea ball B i R d. Let z 1,..., z B be arbitrary. Defie ŷ t+1 = ŷ t+1 (z 1,..., z t ) = Proj B (ŷ t 1 z t ) with ŷ 1 = 0. The, y, z 1,..., z B, ŷ t y, z t

Baby versio Uit Euclidea ball B i R d. Let z 1,..., z B be arbitrary. Defie ŷ t+1 = ŷ t+1 (z 1,..., z t ) = Proj B (ŷ t 1 z t ) with ŷ 1 = 0. The, y, z 1,..., z B, ŷ t y, z t Rewrite as (ŷ t ) z 1,..., z B, z t ŷ t, z t.

Determiistic iequality: (ŷ t ) z 1,..., z B, z t ŷ t, z t. (1)

Determiistic iequality: (ŷ t ) z 1,..., z B, z t ŷ t, z t. (1) Apply to a MDS Z 1,..., Z with values i B P ( Z t u) P ( ŷ t, Z t u) (2)

Determiistic iequality: (ŷ t ) z 1,..., z B, z t ŷ t, z t. (1) Apply to a MDS Z 1,..., Z with values i B P ( Z t u) P ( ŷ t, Z t u) exp{ u 2 /2} (2) by Asuma-Hoeffdig.

Determiistic iequality: (ŷ t ) z 1,..., z B, z t ŷ t, z t. (1) Apply to a MDS Z 1,..., Z with values i B P ( Z t u) P ( ŷ t, Z t u) exp{ u 2 /2} (2) by Asuma-Hoeffdig. Itegrate tails: E Z t c (3) Usig vo Neuma miimax theorem, it is possible to show (ŷ t ) y, z 1,..., z B, ŷ t y, z t sup mds E W t

Determiistic iequality: (ŷ t ) z 1,..., z B, z t ŷ t, z t. (1) Apply to a MDS Z 1,..., Z with values i B P ( Z t u) P ( ŷ t, Z t u) exp{ u 2 /2} (2) by Asuma-Hoeffdig. Itegrate tails: E Z t c (3) Usig vo Neuma miimax theorem, it is possible to show (ŷ t ) y, z 1,..., z B, ŷ t y, z t sup mds E W t c Coclusio: (1) (2) (3) (1) (up to cost)

(ŷ t ) y, z 1,..., z B, ŷ t y, z t P ( Z t u) exp{ u 2 /2} E Z t c

(ŷ t ) y, z 1,..., z B, ŷ t y, z t P ( Z t u) exp{ u 2 /2} E Z t c Curiosities: i particular (3) (2) amplifies i-expectatio to high prob. improve tail bouds by takig a better gradiet descet improve gradiet descet by fidig better tail bouds move beyod liear structure of Baach space

(ŷ t ) y, z 1,..., z B, ŷ t y, z t P ( Z t u) exp{ u 2 /2} E Z t c Curiosities: i particular (3) (2) amplifies i-expectatio to high prob. improve tail bouds by takig a better gradiet descet improve gradiet descet by fidig better tail bouds move beyod liear structure of Baach space < / ed of baby versio >

Warmup: mirror descet with adaptive step size (B, ) 2-smooth, (B, ) deotes dual. D R B B R Bregma divergece w.r.t. R, which is 1-strogly covex o uit ball B B. Deote R 2 max sup f,g B D R (f, g). Here z t s eed ot be i uit ball. Lemma. F B covex. Defie, ŷ t+1 = ŷ t+1(z 1,..., z t) = argmi y F {η t y, z t + D R (y, ŷ t)} ad η t R max mi {1, ( t s=1 z s 2 + t 1 s=1 z s 2 1 ) }. The for ay y F ad ay z 1,..., z B, ŷ t y, z t 2.5R max z t 2 + 1.

Warmup: mirror descet with adaptive step size Let E t be coditioal expectatio. Theorem. Let Z 1,..., Z be a B-valued MDS. For ay u > 0, Z t 2.5R max ( V + 1) P V +W + (E > u 2 exp { u 2 /16}, V +W ) 2 where V = Z t 2 ad W = E t 1 Z t 2. Holds with W 0 if MDS coditioally symmetric. -idepedet, self-ormalized, ca be exteded to p-smooth

summary so far coectio betwee first-order covex optimizatio methods ad oe-sided probabilistic tail bouds

Outlie Itroductio Beyod Baach spaces Extras

Iterpret as supremum of stochastic process Z t = sup y 1 y, Z t Geeralizatio (after ceterig): take ay stochastic process Z t ad sup g G g(z t ) E t 1 [g(z t )]

Iterpret as supremum of stochastic process Z t = sup y 1 y, Z t Geeralizatio (after ceterig): take ay stochastic process Z t ad sup g G g(z t ) E t 1 [g(z t )] Eough to cosider D t = σ(ɛ 1,..., ɛ t ) geerated by i.i.d. Rademacher: sup f F ɛ t f(x t ) where x t is D t 1 -measurable. (exted Pacheko s symmetrizatio techique to martigales) f(x t(ɛ 1 t 1)) = g(z t(ɛ 1 t 1, +1) g(z t(ɛ 1 t 1, 1))

Determiistic regret iequalities Let y 1,..., y {±1}, x 1,..., x X, F = {f X R} For a give fuctio B F X R, wat a predictio strategy such that ŷ t = ŷ t(x 1,..., x t, y 1,..., y t 1) (x t, y t), ŷ ty t if f F { y tf(x t) + 2B(f; x 1,..., x )}.

Determiistic regret iequalities Let y 1,..., y {±1}, x 1,..., x X, F = {f X R} For a give fuctio B F X R, wat a predictio strategy such that ŷ t = ŷ t(x 1,..., x t, y 1,..., y t 1) (x t, y t), ŷ ty t if f F { y tf(x t) + 2B(f; x 1,..., x )}. If existece of (ŷ t) is certified, apply to y t = ɛ t ad x t = x t(ɛ): P (sup f F { ɛ tf(x t) 2B(f; x 1,..., x )} u) P ( ɛ tŷ t u) exp{...}.

Lemma. If for ay predictable process x = (x 1,..., x ) E [sup f F ɛ t f(x t ) 2B(f; x 1,..., x )] 0, the there exists a strategy (ŷ t ) with values ŷ t sup f F f(x t ) such that the determiistic iequality holds for all sequeces.

Lemma. If for ay predictable process x = (x 1,..., x ) E [sup f F ɛ t f(x t ) 2B(f; x 1,..., x )] 0, the there exists a strategy (ŷ t ) with values ŷ t sup f F f(x t ) such that the determiistic iequality holds for all sequeces. automatic amplificatio to high probability existetial o explicit predictio strategy (ŷ t ) a offset versio of sequetial Rademacher complexity R (F; x) = E [sup f F ɛ t f(x t )] (ɛ 1,..., ɛ ) sup f F ɛ t f(x t ) is ot Lipschitz; cocetratio methods fail

Defiitio. Let r (1, 2]. We say that sequetial Rademacher complexity of F exhibits a 1/r growth if 1, x, R (F; x) C 1/r sup f(x t (ɛ)). f F,ɛ {±1},t

Usig amplificatio ad reverse Hölder (due to Burkholder/Pisier): Lemma. Let F R X. Suppose sequetial Rademacher complexity exhibits 1/r growth, r (1, 2]. For ay p < r, E sup f F ɛ t f(x t ) C r,p E ( sup f(x t ) )1/p p. f F Further, if F [ 1, 1] X, the E sup f F ɛ t f(x t ) C log E ( sup f(x t ) )1/r r f F I spirit of: if ca prove E Z t the E Z t E Z t 2

Defiitio. We say G R Z has martigale type p if C such that E[sup g G (g(z t) E t 1 [g(z t)])] C E( E t 1 sup g(z t) g(z t) p 1/p ) g G Theorem. For ay G R Z, 1. If sequetial Rademacher exhibits 1/r growth, r (1, 2], the G has martigale type p for every p < r. 2. If G has martigale type p, the sequetial Rademacher exhibits 1/p growth.

Fier aalysis for type 2 Defie Var = sup f(x t ) 2, Var(f) = f(x t ) 2 f F Wheever log N seq (α) α q, q [0, 2], E [sup f F High probability via amplificatio. ɛ t f(x t ) C (Var 1/2 ) q 4 (Var 1/2 (f)) 2 q 4 ] 0

Fier aalysis for type 2 Defie Var = sup f(x t ) 2, Var(f) = f(x t ) 2 f F Wheever log N seq (α) α q, q [0, 2], E [sup f F High probability via amplificatio. ɛ t f(x t ) C (Var 1/2 ) q 4 (Var 1/2 (f)) 2 q 4 ] 0 Compare to (Massart, Rossigol 13): weak variace improvemet of Nemirovskii iequality: for i.i.d. zero mea Z 1,..., Z R d : E [max j d ɛ t Z t,j ] 2 l(2d)e max j d Z 2 t,j. We match idex j o both sides; exted to martigales beyod fiite case.

Coclusios Equivalece of determiistic regret iequalities ad martigale tail bouds gives a way of provig tail bouds (for martigales or i.i.d.) by exhibitig a method or certifyig its existece amplificatio to high probability Use it to exted otio of martigale type to geeral classes Not i this talk: data-depedet bouds for olie learig

Ope questios What is behid the equivalece? Replace with E( E( sup 1/p E t 1 sup g(z t ) g(z t) p ) g G g G 1/p g(z t ) E t 1g(Z t) p ) If sequetial Rademacher complexity exhibits 1/r growth rate, the does G have martigale type r? We oly prove martigale type p for ay p < r.

Outlie Itroductio Beyod Baach spaces Extras

Reverse Hölder priciple For p (0, ), defie Z p, = (sup t>0 t p P(Z > t)) 1/p Lemma (Pisier). For ay δ (0, 1) ad ay R there exists C p (δ, R) s.t. the followig holds. For i.i.d. (Z i ) i 0, if sup N 1 P (sup N 1/p Z i > R) δ i N the Z p, C p (δ, R) Corollary: For ay 0 < q < p < there exists C p,q such that Z p, C p,q sup N 1/p sup Z i N 1 i N q