Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Similar documents
Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

The log-behavior of n p(n) and n p(n)/n

REGRESSION WITH QUADRATIC LOSS

Regression with quadratic loss

Optimally Sparse SVMs

MATH 112: HOMEWORK 6 SOLUTIONS. Problem 1: Rudin, Chapter 3, Problem s k < s k < 2 + s k+1

BIRKHOFF ERGODIC THEOREM

Learning Theory: Lecture Notes

Convergence of random variables. (telegram style notes) P.J.C. Spreij

A Hadamard-type lower bound for symmetric diagonally dominant positive matrices

Computation of Error Bounds for P-matrix Linear Complementarity Problems

Rademacher Complexity

1+x 1 + α+x. x = 2(α x2 ) 1+x

An Introduction to Randomized Algorithms

Self-normalized deviation inequalities with application to t-statistic

1 Review and Overview

1 Review and Overview

Seunghee Ye Ma 8: Week 5 Oct 28

1 Convergence in Probability and the Weak Law of Large Numbers

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1

Law of the sum of Bernoulli random variables

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

STAT Homework 1 - Solutions

Glivenko-Cantelli Classes

Section 11.8: Power Series

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables

ECE534, Spring 2018: Solutions for Problem Set #2

arxiv: v1 [math.pr] 4 Dec 2013

Precise Rates in Complete Moment Convergence for Negatively Associated Sequences

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number

APPENDIX A SMO ALGORITHM

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

Machine Learning Brett Bernstein

Sequences and Limits

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

The random version of Dvoretzky s theorem in l n

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Assignment 5: Solutions

Solutions to HW Assignment 1

Lecture Chapter 6: Convergence of Random Sequences

Lecture 7: October 18, 2017

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Lecture 3 : Random variables and their distributions

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

7.1 Convergence of sequences of random variables

Chapter 6 Principles of Data Reduction

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

The Boolean Ring of Intervals

Lecture 3: August 31

Notes 27 : Brownian motion: path properties

Sieve Estimators: Consistency and Rates of Convergence

Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities

Lecture 19. sup y 1,..., yn B d n

Advanced Stochastic Processes.

Mi-Hwa Ko and Tae-Sung Kim

Math 341 Lecture #31 6.5: Power Series

for all x ; ;x R. A ifiite sequece fx ; g is said to be ND if every fiite subset X ; ;X is ND. The coditios (.) ad (.3) are equivalet for =, but these

Monte Carlo Integration

Research Article Nonexistence of Homoclinic Solutions for a Class of Discrete Hamiltonian Systems

TESTING FOR THE BUFFERED AUTOREGRESSIVE PROCESSES (SUPPLEMENTARY MATERIAL)

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

ECE 6980 An Algorithmic and Information-Theoretic Toolbox for Massive Data

7.1 Convergence of sequences of random variables

Lecture 2: Concentration Bounds

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

32 estimating the cumulative distribution function

Read carefully the instructions on the answer book and make sure that the particulars required are entered on each answer book.

1 Lecture 2: Sequence, Series and power series (8/14/2012)

ECE 330:541, Stochastic Signals and Systems Lecture Notes on Limit Theorems from Probability Fall 2002

MAT1026 Calculus II Basic Convergence Tests for Series

2 Banach spaces and Hilbert spaces

15.083J/6.859J Integer Optimization. Lecture 3: Methods to enhance formulations

Problem Set 2 Solutions

Element sampling: Part 2

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

Supplementary Materials for Statistical-Computational Phase Transitions in Planted Models: The High-Dimensional Setting

Lecture 11 and 12: Basic estimation theory

MATH301 Real Analysis (2008 Fall) Tutorial Note #7. k=1 f k (x) converges pointwise to S(x) on E if and

Notes for Lecture 11

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

36-755, Fall 2017 Homework 5 Solution Due Wed Nov 15 by 5:00pm in Jisu s mailbox

Sequences and Series of Functions

<, if ε > 0 2nloglogn. =, if ε < 0.

Linear Support Vector Machines

Technical Proofs for Homogeneity Pursuit

Research Article Approximate Riesz Algebra-Valued Derivations

Entropy Rates and Asymptotic Equipartition

Learnability with Rademacher Complexities

Supplement for SADAGRAD: Strongly Adaptive Stochastic Gradient Methods"

Berry-Esseen bounds for self-normalized martingales

6.3 Testing Series With Positive Terms

Math 451: Euclidean and Non-Euclidean Geometry MWF 3pm, Gasson 204 Homework 3 Solutions

Lecture 15: Learning Theory: Concentration Inequalities

Concavity Solutions of Second-Order Differential Equations

Transcription:

Supplemetary Material for Fast Stochastic AUC Maximizatio with O/-Covergece Rate Migrui Liu Xiaoxua Zhag Zaiyi Che Xiaoyu Wag 3 iabao Yag echical Lemmas ized versio of Hoeffdig s iequality, ote that We itroduce two cocetratio iequalities i Lemma 4, which are used frequetly i the proofs Lemma 4 Radomized versio of Hoeffdig s iequality Suppose is a radom variable takig value o N +, ad let X,, X be idepedet radom variables Defie X = X + + X If every X i is strictly bouded by the itervals [a i, b i ], the we have with probability at least, X E X l/ i= b i a i 7 Similarly, with probability at least, E X X l/ i= b i a i 8 Radomized versio of vector cocetratio iequality Suppose is a radom variable takig value o N +, ad let X,, X R d be iid radom variables If φ : R d H, where H is a Hilbert space edowed with orm actually we ca take H to be R d edowed with ifiity orm Suppose B = sup x R d φx < he we have with probability at least, φx i E φx B [ i= + ] log/ 9 Pr X + + X EX = Pr X E X t= Pr = t Pr = t =, t= l/ l/ i= b i a i i= b i a i = t where the first iequality follows from the determiistic versio of Hoeffdig s iequality It is easy to show the correctess of the radomized versio of vector cocetratio iequality by employig the same techique he determiistic versio ca be derived via McDiarmid s iequality McDiarmid, 989 A stadard proof ca be foud i the sectio 4 of Shawe-aylor & Cristiaii, 004 For completeess, we iclude the proof here o derive the determiistic versio, defie S = X,, X, S = X,, X to be two collectios of idepedet samples, S = X,, X i, X i, X i+,, X, ad fs = i= φx i EφX, we have fs fs B/ By McDiarmid s iequality, we have Proof he proof is quite straightforward For the radom- Departmet of Computer Sciece, he Uiversity of Iowa, IA 54, USA Uiversity of Sciece ad echology of Chia 3 Itellifusio Correspodece to: Migrui Liu <migruiliu@uiowaedu>, iabao Yag <tiabao-yag@uiowaedu> ɛ Pr fs EfS > ɛ exp 4B 0 Proceedigs of the 35 th Iteratioal Coferece o Machie Learig, Stockholm, Swede, PMLR 80, 08 Copyright 08 by the authors Defie σ = σ,, σ to be Rademacher variables, ie Prσ i = = Prσ i = = /, ad σ i s are iid

Fast Stochastic AUC Maximizatio Defie φ S = i= φz i, the E fs = E φs Eφ S = E φ S Eφ S = E E φ S φ S E φ S φ S = E σ i φx i φx i i= E σ i φx i i= = E σi φ X i + σ i σ j φx i φx j i= i j E σi φ X i + σ i σ j φx i φx j i= i j = Eφ X i B i= Combig this result with 0, ad takig ɛ = B suffice to get the result log Proof of Lemma Proof Accordig to the equatio 6 i Yig et al, 06, we have fv, α = fw, a, b, α = p p w x a + αw x P x y = dx+ x w x b + + αw x } P x y = α x = p p w Exx y = + Exx y = w w aex y = + bex y = + a + b + + αw Ex y = Ex y = α } Whe α = w Ex y = Ex y = Ω, it is easy to see that α Ω by employig Cauchy-Schwarz iequality, ie α w Ex y = Ex y = Rκ, fv, α achieves its maximum with respect to α, so we get f v = f w, a, b, w Ex y = Ex y = = p p w Exx y = w aw Ex y = + a + w Exx y = w bw Ex y = + b + [ w Ex y = Ex y = } + w Ex y = Ex y = ] = p p [ w, a, b M w, a, b + affie fuctio of v ], where M = M + M + M 3, M = Exx y = Ex y = 0 Ex y = 0 0 0 0 M = Exx y = 0 Ex y = 0 0 0 Ex y = 0 M 3 = qq 0 0 0 0 0 0 0 0 q = Ex y = Ex y = Note that M, M, M 3 are positive semidefiite matrices, ad hece M is positive semidefiite So f v is covex ad piecewise quadratic Sice Ω is a polyhedro, accordig to Corollary 3 of Li, 03, we ca kow that f v restricted o Ω satisfies the quadratic growth coditio 3 Proof of Lemma 3 Proof By applyig the iequality 9 i Lemma 4, the triagle iequality, ad the uio boud, we have with probability at least 6, Â A κ + l + κ + l +, Note that both ad + follow the Beroulli distributio, ad deote p = Pry = By applyig determiistic versio of Hoeffdig s iequality i Lemma 4 ie, iequality 8 to idicator fuctios of radom variables I [yi= ] ad I [yi=] respectively ad the uio boud, we have with probability at least 6, the followig two equatios hold simultaeously: p l l, + p

Fast Stochastic AUC Maximizatio Accordig to ad, by uio boud, we kow that with probability at least 3, we have  A κ + p l l + κ + p l l Note that is equivalet to l p p, l p p 3 4 Utilizig 4 ad pluggig it ito 3, we kow that with probability at least 3, κ + l  A l p l κ + l + p + κ = l l p 4κ + l, ξ where l l + ξ mi p, p 4 Proof of heorem κ + l l p l Proof Defie = log, ad a, = G 3γ + γ l 6, µ 0 = R0 0,, µ k = k µ 0, R k = R 0 / k, where k =,, m he we have µ k Rk = k µ 0 R0 By defiitio of m, whe 00, 0 < log log m log log log, 5 so we have m 4 log 6 o employ the result of Lemma, at the i-th stage, we eed R 0 to satisfy R +4κ = = 4 i / + 4κ, which should Ri 4 i R0 hold for ay i m So 0 4 m suffices to achieve this requiremet Now we argue that this coditio ca be implied by 00 Note that 0 = /m m o show the impli- 4 m, it suffices to prove log 3 log, ie, ad 4 m 4 log catio from 00 to 0 that whe 00, log log log log log log log, log log = log log 3, which obviously holds Accordig to Lemma, we kow that P v f v satisfies the quadratic growth coditio, which implies that there exists some c > 0, such that v v cp v P v, where v is the closest poit to v i Ω We ca assume c R0 G, ie, c G R 0 Otherwise we ca set c to be R0 G such that the quadratic growth property i Lemma still holds Whe 00, we have µ m = m µ 0 4 log R 0 G G R 0 log G 3 l3 0 log R 0 log 0 G 3 l3 log R 0 log m + G 3 l3 log R 0 log + log log = G R 3 l3 log G 0 R 0 log log +3 log 3γ + γ l 60 / 0 0 3 + l 3 0 log 0 0 + log

Fast Stochastic AUC Maximizatio where the first iequality holds because of 6, the secod iequality stems from the fact that γ, γ, 0 < <, ad the defiitio of, the third iequality holds by employig a + b ab, the fourth iequality holds because 0 m +, the fifth iequality holds because of the lower boud of m i 5, ad the last iequality holds sice 00 ad the fuctio is mootoically icreasig with respect to So G R 0 µ m Recall that c G R 0, ad thus c µ m Give v k, deote v k by the closest optimal solutio to v k We cosider two cases Case If c µ 0, the µ 0 c µ m So there exists a k such that µ k c µ k, where 0 k < m o utilize this fact, we have the followig lemma Lemma 5 Let k satisfy µ k c µ k he for ay k k, there exists a Borel set A k Ω of probability at least k, such that for ω A k, the poits v k } m k= geerated by the Algorithm satisfy v k v k R k = k+ R 0, 7 P v k P µ k R k = k µ 0 R 0 8 Moreover, for k > k there is a Borel set C k Ω of probability at least k k such that o C k, we have P v k P v k µ k Rk 9 Proof We prove 7 ad 8 by iductio Note that 7 holds for k = Assume it is true for some k > o A k Accordig to the Lemma, there exists a Borel set B k with PrB k such that P v k P R k a 0, = µ k k R 0 R k = µ k R k, which is 8 By the iductive hypothesis, v k v k R k o the set A k Defie A k = A k B k Note that PrA k PrA k + PrB k k, ad o A k, by the quadratic growth coditio ad the defiitio of k, we have v k v k c P v k P which is 7 for k + P v k P µ k µ kr k µ k R k, Now we prove 9 For k > k, it is easy to show a similar coclusio as i Lemma Remark: At k-th stage with k > k, oe ca use the similar proof of Lemma by substitutig all v to v k, the first term I o the RHS becomes zero ad hece we ca get a tighter boud of P v k P v k, we here relax the boud to be R k a 0,, which is, there exists a Borel set B k with PrB k such that P v k P v k R k a 0, = k k R k a 0, = k k µ k R k = µ kr k, which implies that o C k = k j=k + B j, we have P v k P v k = k j=k + k j=k + P v j P v j k j µ k R k µ k R k By uio boud, we have PrC k = Pr k j=k + B j k k Here completes the proof Now we proceed the proof as follows Note that µ 0 c µ m At the ed of k -th stage, o the Borel set A k of probability at least k, we have P v k P µ k R k he o the Borel set D m = C m A k = m j=k + B j A k with PrD m m, we have P v m P = P v m P v k + P v k P µ k Rk 4µ k c µ k R k = 4c a 0, By the defiitio of m ad, ad the fact that m log, we have m So PrD m Case If c < µ 0, the o A = B, P v P R 0 a 0, = R 0 a 0, a 0, = µ 0 a 0, c a 0, Hece o A C m, by usig Lemma 5 ad a similar argumet as i case, we have P v m P =P v m P v + P v P R 0 a 0, c a 0,,

where PrA C m Combiig the two cases, we have with probability at least, P v m P 4c c G 3γ + γ l 60 log 0 l = Õ 0 Fast Stochastic AUC Maximizatio Refereces Li, G Global error bouds for piecewise covex polyomials Mathematical Programmig, pp 8, 03 McDiarmid, C O the method of bouded differeces Surveys i combiatorics, 4:48 88, 989 Shawe-aylor, J ad Cristiaii, N Kerel methods for patter aalysis Cambridge uiversity press, 004 Yig, Y, We, L, ad Lyu, S Stochastic olie auc maximizatio I Advaces i Neural Iformatio Processig Systems, pp 45 459, 06