Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Similar documents
Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

The log-behavior of n p(n) and n p(n)/n

REGRESSION WITH QUADRATIC LOSS

Regression with quadratic loss

Optimally Sparse SVMs

MATH 112: HOMEWORK 6 SOLUTIONS. Problem 1: Rudin, Chapter 3, Problem s k < s k < 2 + s k+1

BIRKHOFF ERGODIC THEOREM

Learning Theory: Lecture Notes

Convergence of random variables. (telegram style notes) P.J.C. Spreij

A Hadamard-type lower bound for symmetric diagonally dominant positive matrices

Rademacher Complexity

1+x 1 + α+x. x = 2(α x2 ) 1+x

Self-normalized deviation inequalities with application to t-statistic

An Introduction to Randomized Algorithms

Computation of Error Bounds for P-matrix Linear Complementarity Problems

1 Review and Overview

1 Review and Overview

Seunghee Ye Ma 8: Week 5 Oct 28

1 Convergence in Probability and the Weak Law of Large Numbers

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1

Law of the sum of Bernoulli random variables

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

STAT Homework 1 - Solutions

Glivenko-Cantelli Classes

Section 11.8: Power Series

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables

APPENDIX A SMO ALGORITHM

ECE534, Spring 2018: Solutions for Problem Set #2

Precise Rates in Complete Moment Convergence for Negatively Associated Sequences

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

Sequences and Limits

Lecture Chapter 6: Convergence of Random Sequences

Machine Learning Brett Bernstein

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

arxiv: v1 [math.pr] 4 Dec 2013

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Assignment 5: Solutions

The random version of Dvoretzky s theorem in l n

Lecture 7: October 18, 2017

Solutions to HW Assignment 1

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Lecture 3 : Random variables and their distributions

Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)

Chapter 6 Principles of Data Reduction

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

7.1 Convergence of sequences of random variables

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7

The Boolean Ring of Intervals

Lecture 3: August 31

Sieve Estimators: Consistency and Rates of Convergence

Notes 27 : Brownian motion: path properties

Research Article Nonexistence of Homoclinic Solutions for a Class of Discrete Hamiltonian Systems

Math 341 Lecture #31 6.5: Power Series

Advanced Stochastic Processes.

TESTING FOR THE BUFFERED AUTOREGRESSIVE PROCESSES (SUPPLEMENTARY MATERIAL)

Mi-Hwa Ko and Tae-Sung Kim

for all x ; ;x R. A ifiite sequece fx ; g is said to be ND if every fiite subset X ; ;X is ND. The coditios (.) ad (.3) are equivalet for =, but these

Monte Carlo Integration

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

ECE 6980 An Algorithmic and Information-Theoretic Toolbox for Massive Data

7.1 Convergence of sequences of random variables

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Lecture 2: Concentration Bounds

Read carefully the instructions on the answer book and make sure that the particulars required are entered on each answer book.

32 estimating the cumulative distribution function

1 Lecture 2: Sequence, Series and power series (8/14/2012)

Supplement for SADAGRAD: Strongly Adaptive Stochastic Gradient Methods"

MAT1026 Calculus II Basic Convergence Tests for Series

ECE 330:541, Stochastic Signals and Systems Lecture Notes on Limit Theorems from Probability Fall 2002

Lecture 19. sup y 1,..., yn B d n

2 Banach spaces and Hilbert spaces

Concavity Solutions of Second-Order Differential Equations

Problem Set 2 Solutions

Rates of Convergence for Quicksort

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

Element sampling: Part 2

MATH301 Real Analysis (2008 Fall) Tutorial Note #7. k=1 f k (x) converges pointwise to S(x) on E if and

Supplementary Materials for Statistical-Computational Phase Transitions in Planted Models: The High-Dimensional Setting

<, if ε > 0 2nloglogn. =, if ε < 0.

Lecture 11 and 12: Basic estimation theory

Notes for Lecture 11

36-755, Fall 2017 Homework 5 Solution Due Wed Nov 15 by 5:00pm in Jisu s mailbox

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Sequences and Series of Functions

Linear Support Vector Machines

Technical Proofs for Homogeneity Pursuit

15.083J/6.859J Integer Optimization. Lecture 3: Methods to enhance formulations

Research Article Approximate Riesz Algebra-Valued Derivations

Entropy Rates and Asymptotic Equipartition

Lecture 4: April 10, 2013

Berry-Esseen bounds for self-normalized martingales

Learnability with Rademacher Complexities

Math 451: Euclidean and Non-Euclidean Geometry MWF 3pm, Gasson 204 Homework 3 Solutions

Transcription:

Supplemetary Material for Fast Stochastic AUC Maximizatio with O/-Covergece Rate Migrui Liu Xiaoxua Zhag Zaiyi Che Xiaoyu Wag 3 iabao Yag echical Lemmas ized versio of Hoeffdig s iequality, ote that We itroduce two cocetratio iequalities i Lemma 4, which are used frequetly i the proofs Lemma 4 Radomized versio of Hoeffdig s iequality Suppose is a radom variable takig value o N +, ad let X,, X be idepedet radom variables Defie X = X + + X If every X i is strictly bouded by the itervals [a i, b i ], the we have with probability at least, X E X l/ i= b i a i 7 Similarly, with probability at least, E X X l/ i= b i a i 8 i= Proof he proof is quite straightforward For the radom- 9 Pr X + + X EX = Pr X E X t= Pr = t Pr = t =, t= l/ l/ i= b i a i i= b i a i = t where the first iequality follows from the determiistic Radomized versio of vector cocetratio iequality Suppose is a radom variable takig value versio of Hoeffdig s iequality It is easy to show the correctess of the radomized o N +, ad let X,, X R d be iid radom versio of vector cocetratio iequality by employig variables If φ : R d H, where H is a Hilbert the same techique he determiistic versio space edowed with orm actually we ca take ca be derived via McDiarmid s iequality McDiarmid, H to be R d edowed with ifiity orm Suppose 989 A stadard proof ca be foud i the sectio B = sup x R d φx < he we have with probability 4 of Shawe-aylor & Cristiaii, 004 For at least, completeess, we iclude the proof here o derive φx i E φx B [ + the determiistic versio, defie S = X,, X, ] S = X log/,, X to be two collectios of idepe- det samples, S = X,, X i, X i, X i+,, X, ad fs = i= φx i EφX, we have fs fs B/ By McDiarmid s iequality, we have Departmet of Computer Sciece, he Uiversity of Iowa, Iowa City, IA 54, USA Uiversity of Sciece ad echology of Chia 3 Itellifusio Correspodece to: Migrui Liu <migrui-liu@uiowaedu>, Xiaoxua Zhag <xiaoxuazhag@uiowaedu>, Zaiyi Che <czy656@mailutsceduc>, Xiaoyu Wag <faghuaxue@gmailcom>, iabao Yag <tiabao-yag@uiowaedu> ɛ Pr fs EfS > ɛ exp 4B 0 Proceedigs of the 35 th Iteratioal Coferece o Machie Learig, Stockholm, Swede, PMLR 80, 08 Copyright 08 by the authors Defie σ = σ,, σ to be Rademacher variables, ie Prσ i = = Prσ i = = /, ad σ i s are iid

Fast Stochastic AUC Maximizatio Defie φ S = i= φz i, the E fs = E φs Eφ S = E φ S Eφ S = E E φ S φ S E φ S φ S = E σ i φx i φx i i= E σ i φx i i= = E σi φ X i + σ i σ j φx i φx j i= i j E σi φ X i + σ i σ j φx i φx j i= i j = Eφ X i B i= Combig this result with 0, ad takig ɛ = B suffice to get the result log Proof of Lemma Proof Accordig to the equatio 6 i Yig et al, 06, we have fv, α = fw, a, b, α = p p w x a + αw x P x y = dx+ x w x b + + αw x } P x y = α x = p p w Exx y = + Exx y = w w aex y = + bex y = + a + b + + αw Ex y = Ex y = α } Whe α = w Ex y = Ex y = Ω, it is easy to see that α Ω by employig Cauchy-Schwarz iequality, ie α w Ex y = Ex y = Rκ, fv, α achieves its maximum with respect to α, so we get f v = f w, a, b, w Ex y = Ex y = = p p w Exx y = w aw Ex y = + a + w Exx y = w bw Ex y = + b + [ w Ex y = Ex y = } + w Ex y = Ex y = ] = p p [ w, a, b M w, a, b + affie fuctio of v ], where M = M + M + M 3, M = Exx y = Ex y = 0 Ex y = 0 0 0 0 M = Exx y = 0 Ex y = 0 0 0 Ex y = 0 M 3 = qq 0 0 0 0 0 0 0 0 q = Ex y = Ex y = Note that M, M, M 3 are positive semidefiite matrix, ad hece M is positive semidefiite So f v is covex ad piecewise quadratic Sice Ω is a polyhedro, accordig to Corollary 3 of Li, 03, we ca kow that f v restricted o Ω satisfies the quadratic growth coditio 3 Proof of Lemma 3 Proof By applyig the iequality 9 i Lemma 4, the triagle iequality, ad the uio boud, we have with probability at least 6, Â A κ + l + κ + l +, Note that both ad + follow the Beroulli distributio, ad deote p = Pry = By applyig determiistic versio of Hoeffdig s iequality i Lemma 4 ie, iequality 8 to idicator fuctios of radom variables I [yi= ] ad I [yi=] respectively ad the uio boud, we have with probability at least 6, the followig two equatios hold simultaeously: p l l, + p

Fast Stochastic AUC Maximizatio Accordig to ad, by uio boud, we kow that with probability at least 3, we have  A κ + p l l + κ + p l l Note that is equivalet to l p p, l p p 3 4 Utilizig 4 ad pluggig it ito 3, we kow that with probability at least 3, κ + l  A l p l κ + l + p + κ = l l p 4κ + l, ξ where l l + ξ mi p, p 4 Proof of heorem κ + l l p l Proof Defie = log, ad a, = G 3γ + γ l 6, µ 0 = R0 a 0,, µ k = k µ 0, R k = R 0 / k, where k =,, m he we have µ k Rk = k µ 0 R0 By defiitio of m, whe 00, 0 < log log m log log log, 5 so we have m 4 log 6 o employ the result of Lemma, at the i-th stage, we eed R 0 to satisfy R +4κ = = 4 i / + 4κ, which should Ri 4 i R0 hold for ay i m So 0 4 m suffices to achieve this requiremet Now we argue that this coditio ca be implied by 00 Note that 0 = /m m o show the impli- 4 m, it suffices to prove log 3 log, ie, ad 4 m 4 log catio from 00 to 0 that whe 00, log log log log log log log, log log = log log 3, which obviously holds Accordig to Lemma, we kow that P v f v satisfy the quadratic growth coditio, which implies that there exists some c > 0, such that v v cp v P v, where v is the closest poit to v i Ω We ca assume c R0 G, ie, c G R 0 Otherwise we ca set c to be R0 G such that the quadratic growth property i Lemma still holds Whe 00, we have µ m = m µ 0 4 log R 0 G G R 0 log G 3 l3 0 log R 0 log 0 G 3 l3 log R 0 log m + G 3 l3 log R 0 log + log log = G R 3 l3 log G 0 R 0 log log +3 log 3γ + γ l 60 / 0 0 3 + l 3 0 log 0 0 + log

Fast Stochastic AUC Maximizatio where the first iequality holds because of 6, the secod iequality stems from the fact that γ, γ, 0 < <, ad the defiitio of, the third iequality holds by employig a + b ab, the fourth iequality holds because 0 m +, the fifth iequality holds because of the lower boud of m i 5, ad the last iequality holds sice 00 ad the fuctio is mootoically icreasig with respect to So G R 0 µ m Recall that c G R 0, ad thus c µ m Give v k, deote v k by the closest optimal solutio to v k We cosider two cases Case If c µ 0, the µ 0 c µ m So there exists a k such that µ k c µ k, where 0 k < m o utilize this fact, we have the followig lemma Lemma 5 Let k satisfy µ k c µ k he for ay k k, there exists a Borel set A k Ω of probability at least k, such that for ω A k, the poits v k } m k= geerated by the Algorithm satisfy v k v k R k = k+ R 0, 7 P v k P µ k R k = k µ 0 R 0 8 Moreover, for k > k there is a Borel set C k Ω of probability at least k k such that o C k, we have P v k P v k µ k Rk 9 Proof We prove 7 ad 8 by iductio Note that 7 holds for k = Assume it is true for some k > o A k Accordig to the Lemma, there exists a Borel set B k with PrB k such that P v k P R k a 0, = µ k k R 0 R k = µ k R k, which is 8 By the iductive hypothesis, v k v k R k o the set A k Defie A k = A k B k Note that PrA k PrA k + PrB k k, ad o A k, by the HEB ad the defiitio of k, we have v k v k c P v k P the RHS becomes zero ad hece we ca get a tighter boud of P v k P v k, we here relax the boud to be R k a 0,, which is, there exists a Borel set B k with PrB k such that P v k P v k R k a 0, = k k R k a 0, = k k µ k R k = µ kr k, which implies that o C k = k j=k + B j, we have P v k P v k = k j=k + k j=k + P v j P v j k j µ k R k µ k R k By uio boud, we have PrC k = Pr k j=k + B j k k Here completes the proof Now we proceed the proof as follows Note that µ 0 c µ m At the ed of k -th stage, o the Borel set A k of probability at least k, we have P v k P µ k R k he o the Borel set D m = C m A k = m j=k + B j A k with PrD m m, we have P v m P = P v m P v k + P v k P µ k Rk 4µ k c µ k R k = 4c a 0, By the defiitio of m ad, ad the fact that m log, we have m So PrD m Case If c < µ 0, the o A = B, P v P R 0 a 0, = R 0 a 0, a 0, P v k P µ k µ kr k µ k R k, = µ 0 a 0, c a 0, which is 7 for k + Now we prove 9 For k > k, it is easy to show a similar coclusio as i Lemma Remark: At k-th stage with k > k, oe ca use the similar proof of Lemma by substitutig all v to v k, the first term I o Hece o A C m, by usig Lemma 5 ad a similar argumet as i case, we have P v m P =P v m P v + P v P R 0 a 0, c a 0,,

where PrA C m Combiig the two cases, we have with probability at least, P v m P 4c c G 3γ + γ l 60 log 0 l = Õ 0 Fast Stochastic AUC Maximizatio Refereces Li, Guoyi Global error bouds for piecewise covex polyomials Mathematical Programmig, pp 8, 03 McDiarmid, Coli O the method of bouded differeces Surveys i combiatorics, 4:48 88, 989 Shawe-aylor, Joh ad Cristiaii, Nello Kerel methods for patter aalysis Cambridge uiversity press, 004 Yig, Yimig, We, Logyi, ad Lyu, Siwei Stochastic olie auc maximizatio I Advaces i Neural Iformatio Processig Systems, pp 45 459, 06