Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Size: px

Start display at page:

Download "Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate"

Ella Willis
5 years ago
Views:

1 Fast Stochastic Maximization with O1/n)-Convergence Rate Mingrui Liu 1 Xiaoxuan Zhang 1 Zaiyi Chen Xiaoyu Wang 3 ianbao Yang 1 Abstract In this paper we consider statistical learning with area under ROC curve) maximization in the classical stochastic setting where one random data drawn from an unknown distribution is revealed at each for updating the model Although consistent convex surrogate losses for maximization have been proposed to make the problem tractable it remains an challenging problem to design fast optimization algorithms in the classical stochastic setting due to that the convex surrogate loss depends on random pairs of examples from positive and negative classes Building on a saddle point formulation for a consistent square loss this paper proposes a novel stochastic algorithm to improve the standard O1/ n) convergence rate to Õ1/n) convergence rate without strong convexity assumption or any favorable statistical assumptions eg low noise) where n is the number of random samples o the best of our knowledge this is the first stochastic algorithm for maximization with a statistical convergence rate as fast as O1/n) up to a logarithmic factor Extensive experiments on eight large-scale benchmark data sets demonstrate the superior performance of the proposed algorithm comparing with existing stochastic or online algorithms for maximization 1 Introduction Area under ROC curve ) Metz 1978; Hanley & Mc- Neil 198; 1983) is a commonly used metric for evaluating the performance of a classifier ROC receiver operating characteristic) curve is the "true positive rate - false positive rate" curve and represents the probability that examples in positive class are scored 1 Department of Computer Science he University of Iowa IA 54 USA University of Science and echnology of China 3 Intellifusion Correspondence to: Mingrui Liu <mingruiliu@uiowaedu> ianbao Yang <tianbao-yang@uiowaedu> Proceedings of the 35 th International Conference on Machine Learning Stockholm Sweden PMLR Copyright 018 by the authors) higher than those in negative class Hanley & McNeil 198) Compared with misclassification rate is more favorable in the applications with imbalanced datasets and in which the real-valued classifier is used for ranking However the algorithms designed to minimize the misclassification error rate may not lead to maximization of Cortes & Mohri 004) In a batch learning setting where all training data is given beforehand one may formulate maximization as a convex empirical optimization problem based on the training data using a convex surrogate loss he resulting optimization problem can be solved by employing existing algorithms Herschtal & Raskutti 004; Cortes & Mohri 004; Ferri et al 011) However such an approach has two limitations: i) it is not applicable to many applications in which a random data is received in a sequential manner eg online advertisement); ii) little is known about the statistical convergence rate of the empirical maximizer to the true maximizer due to the iid assumption of pairwise data does not hold herefore a stochastic algorithm with a provable convergence rate in the classical stochastic setting for minimizing an expected convex surrogate loss for maximization is desirable for addressing these two limitations It is also useful for tackling large-scale data by one pass of training data Nevertheless the pairwise nature in the definition of makes it challenging to design algorithms suitable for the classical stochastic setting o address this challenge several online and stochastic algorithms have been developed based on convex surrogate loss Zhao et al 011; Gao et al 013; Ying et al 016) An interesting observation in Ying et al 016) is that the maximization using a consistent square loss is equivalent to a stochastic minmax saddle point problem for which a primal-dual style of stochastic gradient algorithm can be employed to yield an Õ1/ n) convergence rate for minimizing an expected square loss where n is the number of samples How to improve the convergence rate of stochastic optimization of remains an open problem though Fast rate such as O1/n) has been established for stochastic algorithms for expected convex risk minimization in literature Hazan & Kale 011a; Srebro et al 010; Bach & Moulines 013) However these studies either impose

2 Fast Stochastic Maximization strong assumptions about the problem eg strong convexity assumption low-noise assumption etc) and/or their analysis is limited to certain settings that is not applicable to maximization herefore a stochastic algorithm with a provable fast rate as O1/n) without imposing strong assumptions should be considered as a significant contribution for maximization he proposed algorithm referred to as is a Fast Stochastic algorithm for true maximization It is based on the min-max saddle point formulation as observed in Ying et al 016) However different from Ying et al 016) we develop a novel multi-stage scheme for running primal-dual stochastic gradient method with adaptively changing parameters and initial solutions for both primal and dual variables he adaptive multi-stage scheme not only leverages the primal solution from previous stages but also utilizes the empirical estimation of feature vectors of examples from both positive and negative classes for initializing the dual variables he convergence analysis of the proposed algorithm hinges on a quadratic growth property of a reformulated objective and a novel synthesis of the adaptive scheme and the convergence result of primal-dual stochastic gradient method o summarize the major contributions of this work are: We propose a novel fast stochastic algorithm that can be run in the classical stochastic setting for maximizing with a known number of total samples n We establish an Õ1/n) 1 convergence rate for the proposed algorithm where n is the total number of samples his is the first Õ1/n) convergence result for stochastic optimization We evaluate the proposed algorithm on eight largescale benchmark datasets he results show that our algorithm significantly outperforms two state-of-theart stochastic/online methods namely SOLAM algorithm Ying et al 016) and algorithm Gao et al 013) Related Work Zhao et al 011) is probably the first work on online maximization of In order not to store all received positive and negative examples they proposed to use reservoir sampling to maintain a buffer of positive and negative examples At each they run gradient update based on an online empirical version of defined on the buffered data to update the models A regret bound was established with the cost function at each t defined based on the example received at the t-th and all data in the buffer Even with the optimal buffer size their regret bound is worse than n Gao et al 013) proposed another on- 1 he Õ ) notation hides logarithmic factors line algorithm without resorting to the reservoir sampling heir algorithm hinges on the observation that if a consistent square loss is used the gradient of an online empirical version of can be computed based on first and second order statistics of received data Hence their algorithm needs to maintain a covariance matrix of received examples which render it not practical for high-dimensional data Although a randomized version is proposed for maintaining low rank version of the covariance matrix it still has considerable computational overhead In terms of guarantee they established a regret bound in the order of for general case and a possible O1) regret bound under the low-noise condition However these regret bounds do not directly imply a convergence rate for statistical learning for maximization he reason is that the data that define each cost function are not independent he stochastic algorithm proposed in a recent work Ying et al 016) is the first with a convergence guarantee for stochastic optimization and also the first one without storing any historical examples or their covariance matrix As aforementioned their algorithm is based on a novel min-max saddle point formulation of maximization and has a convergence rate of Õ1/ n) Fast rate of stochastic optimization such as O1/n) has been studied for standard classification and regression problems under some conditions For example Hazan & Kale 011a) proposed a method with an O1/n) convergence rate under a weak) strong convexity assumption heir algorithm requires knowing the strong convexity parameter For statistical learning problems Srebro et al 010) established an O1/n) convergence rate for smooth loss functions under low-noise assumption eg the optimal risk value is close to zero) However their algorithm requires knowing a good estimation of the optimal risk value Bach & Moulines 013) established the first O1/n) convergence rate of stochastic algorithms for minimizing expected square loss for regression and expected logistic loss for classification without the strong convexity assumption However all these algorithms and analysis are not applicable to the maximization in the classical stochastic setting Finally we note that the multi-stage scheme of the proposed algorithm is similar to that developed in Hazan & Kale 011b; Ghadimi & Lan 013; Juditsky et al 014; Xu et al 017) for minimizing strongly or uniformly convex functions or problems with error bound conditions However the differences between the proposed work and these works are that i) our development is based on primal dual stochastic method for solving a stochastic saddle point problem of maximization with particular treatment of dual variables; while their algorithms are for solving stochastic convex minimization problems; ii) we do not require strong convexity assumption with known parameters

3 Fast Stochastic Maximization as in Hazan & Kale 011b; Ghadimi & Lan 013); iii) we do not assume uniform strong convexity assumption as in Juditsky et al 014) Instead we prove a quadratic growth condition for the studied problem that is a special case of error bound condition studied in Xu et al 017) However the stochastic algorithms proposed in Xu et al 017) require a target accuracy level and cannot be applied to one pass learning setting for large-scale data 3 Algorithm and Main Result 31 Preliminaries and Notations Let z = x y) P denote a random data following an unknown distribution P where x X R d represents the feature vector and y {1 1} represents the label Denote by Z = X {1 1} and by p = Pry = 1) = E y [I [y=1] ] where I is an indicator function We assume that sup x X x κ Let x R d+ denote an augmented feature vector with the last two components being 0 and let Bx 0 r) = {x : x x 0 r} be an l -ball centered at x 0 with a radius r Given a score function h : R d R the at the population level referred to as the true in this paper) is defined as: h) = Prhx) hx ) y = 1 y = 1) where z = x y) and z = x y) are a pair of random data he maximization problem is to find an optimal score function in a hypothesis class such that is maximized Since the problem is non-convex it is usually solved by using consistent convex surrogate loss A common choice used by previous studies Ying et al 016; Gao et al 013) is the square loss In this paper we consider learning a linear function hx) = w x to maximize the using a square loss ie min Lw) E zz [1 w x x )) y = 1 y = 1] w R d 1) Since the loss function depends on a random pair of data making it difficult to handle in the classical stochastic setting a solution is to cast the above problem into an equivalent saddle point problem Ying et al 016): min max {fw a b α) := E z[f w a b α; z)]} w R d α R ab) R where F w a b α; z) = 1 p)w x a) I [y=1] + pw x b) I [y= 1] p1 p)α α)pw xi [y= 1] 1 p)w xi [y=1] ) Assuming the optimal solution w sits in a bounded domain such that w 1 R we can restrict the primal and dual variables to constrained domains Ω 1 = {w a b) : w 1 R a Rκ b Rκ} Ω = {α R : α Rκ} ie min max{fw a b α) := E z [F w a b α; z)]} wab) Ω 1 α Ω ) Please note that adding the l 1 -ball constraint on w does not restrict the performance of the learned model due to that i) if R is large enough such that the optimal solution w to 1) satisfies w 1 R then w is also the optimal solution to ) ; ii) for a finite number of samples the constraint can serve as regularization for improving the generalization performance We use l 1 -ball constraint because it allows us to develop a fast stochastic optimization algorithm with an convergence rate of Õ1/n) Next we will present the proposed algorithm and its convergence results We will also provide a convergence analysis on a core component of the proposed algorithm 3 Algorithm and its Convergence he proposed algorithm is presented in Algorithm 1 which is referred to as he parameteres R 0 G D 0 β 0 will be explained shortly he algorithm divides the updates into m stages and each stage has n/m updates Each stage of runs a primal dual style stochastic gradient PDSG) method outlined in Algorithm which is similar to that in Ying et al 016) except for three differences: i) the step size is given as a constant for each call of PDSG; ii) each update of the primal variable v = w a b) and the dual variable α is projected into an intersection of their constrained domain and an l ball centered at the initial solution v 1 α 1 ; iii) upon receiving a random data z t the variables Â± ± p are updated according to Algorithm 3 where ± represent the number of positive/negative examples received so far; Â ± represents the cumulative augmented) feature vector for the positive class and negative class; and p represents the estimated positive ratio based on the received examples he parameters for each call of PDSG is adaptively changing including the initial solutions v 1 α 1 the radius of l balls for the primal variables and the dual variables and the step size η he initial parameter R 0 is a bound of Euclidean distance from v 0 to the optimal set of ) in terms of v he initial parameter D 0 is for setting a constrained domain of the dual variable he initial parameter β 0 is for setting the initial step size he parameter G is an upper bound of Euclidean norms of Gu z) Ĝtu z t ) gu) for any u Ω 1 Ω and z z t Z which are defined in 3) he function ˆF t at the line 5 and line 6 of the Algorithm given his can be shown that if w 1 1 then the optimal solutions to a b α satisfy the prescribed bounds in light of their closed-form solutions depending on w

4 Fast Stochastic Maximization Algorithm 1 1: Set m = 1 log n log n 1 n 0 = n/m R 0 = 1 + κ R G > max1 + 4κ)κR + 1) κr Rκ) κ4κr + 11R + 1)) β 0 = 1 + 8κ ) D 0 = κr 0 : Initialize v 0 = 0 R d+ α 0 = 0 3: for k = 1 m do βk 1 4: Set η k = R 3n0G k 1 5: Call PDSG to obtain v k α k β k R k D k ) = PDSG v k 1 α k 1 R k 1 D k 1 n 0 η k ) 6: end for 7: return v m by F t v α z t ) = 1 p)w x t a) I [y] + pw x t b) I [yt= 1] p1 p)α α) pw x t I [yt= 1] 1 p)w xi [y]) is an estimation of F v α z t ) using the current estimation p of p It is worth mentioning that the line 5 and line 6 of the Algorithm need to execute a projection onto the intersection of two convex sets We can use the alternating projection algorithm to generate a sequence which can linearly converges to the desired point Actually it is easy to show that the two sets in our case are boundedly linearly regular Definition 311 of Bauschke & Borwein 1993)) and the linear convergence is guaranteed by the Corollary 314 of Bauschke & Borwein 1993) Finally the convergence rate of is presented in the following theorem heorem 1 Given 0 1) assume n is sufficiently 3 ln 1 ) large such that n > max100 m ) hen with minp1 p)) probability at least 1 ln 1 max f v m α) min max fv α) Õ ) ) α Ω v Ω 1 α Ω n where Õ ) suppresses logarithmic factor of logn) and some constants of the problem independent of n Remark: In practice we can maintain global variables Â + Â p and use them for updating parameters and initializing dual variables he local versions used in Algorithm are just for simplicity of analysis he above theorem implies the convergence of maximization in terms of the square loss Corollary 1 Under the same condition as in heorem 1 Algorithm PDSGv 1 α 1 r D η) 1: Initialize variables Â + R d+ Â R d+ + p R as zeros : for t = 1 do 3: Receive a sample z t = x t y t ) 4: Update Â± ± p using the data z t 5: v t+1 = Π Ω1 Bv 1r)v t η v Ft v t α t z t )) 6: α t+1 = Π Ω Bα 1D)α t + η α Ft v t α t z t )) 7: end for vt 8: Compute v = 9: Let r = r/ 10: Update β D according to Lemma 1 11: return v α β r D) and α = Â Â+ + ) v Algorithm 3 Update Â± ± p given a data x t y t ) 1: Â + = Â+ + I [y] x t : Â = Â + I [yt= 1] x t 3: p = p + I [y] 4: + = + + I [y] 5: = + I [yt= 1] 6: p = + / + + ) with probability at least 1 ln 1 Lŵ m ) min Lw) Õ ) ) w 1 R n 4 Convergence Analysis his section is devoted to convergence analysis Due to limitation of space we present detailed analysis for a key lemma More details about the proof of main heorem can be found in the supplement Before analysis we first give some notations that will be frequently used in this section v = w a b) R d+ u = v α) R d+3 Gu z) = v F v α; z) α F v α; z)) Ĝ t u z t ) = v Ft v α; z t ) α Ft v α; z t )) gu) = v fv α) α fv α)) Lemma 1 For each call of Algorithm we update D β according to D = 4 κ + κr + ln 1 min p 1 p) ) ) 1 + κ)r ln 1 )) 3)

5 Fast Stochastic Maximization and β = 1 + 8κ ) + 3κ 1 + κ) + min p 1 p) ln 1 ) ) ln 1 )/ ) where p is an estimation of p Suppose v 1 v r where v Ω 1 is the optimal ) solution closest to v 1 and R > max 3 ln r 1 ) then minp1 p)) max f v α) min max fv α) α Ω v Ω 1 α Ω ) 3γ 1 + ln 6 )γ rg ) where γ 1 = 1 + 8κ ) + 3κ 1+κ) + ln 1 ) minp1 p)/ γ = κ + 8κ + ln 1 )1+κ) ) ) minp1 p)/ Remark: When n is sufficiently large as stated in the main theorem then log/)/ < min p 1 p) by Hoeffding s inequality due to that p is estimated using n 0 examples Based on the above lemma Algorithm 1 uses a geometrically decreasing radius r in order to achieve a faster convergence By further utilizing the following property we can prove the main theorem Lemma f 1 v) = max α Ω fv α) restricted on the set Ω 1 satisfies a quadratic growth condition ie for any v Ω 1 there exists c > 0 such that v v cf 1 v) min v Ω 1 f 1 v)) 1/ where v is the optimal solution to min v Ω1 f 1 v) that is closest to v Remark: It is notable that the proposed algorithm does not require the value of c that is difficult to compute 41 Proof of Lemma 1 o prove this lemma we need the following lemma whose proof is included in the supplement Lemma 3 Let Â = Â+/ + Â / and and A = Ex y = 1) Ex y = 1)) 0 0) After the k-th call of Algorithm with k 1 with probability 1 3 we have Â A 4κ where ξ min p 1 p) ) + ln 1 ) ξ ln 1 ) 4) Proof of Lemma 1 By the setting of G we have ) max gu t ) Gu t z t ) Ĝtu t z t ) G tu tz t Define α = arg max α f v α) = w [Ex y = 1) Ex y = 1)] Following standard analysis of primal-dual update eg see inequality 17 in Ying et al 016)) we have max f v α) min max fv α) α Ω v Ω 1 α Ω f v α ) fv ᾱ ) v 1 v + α 1 α + ηg η η + u t u ) gu t ) Gu t z t )) + u t u ) Gu t z t ) Ĝtu t z t )) = I + II + III + IV + V where v is the optimal solution closest to v 1 of ) and the first inequality follows from the fact that min v Ω1 max α Ω fv α) fv ᾱ ) Now we bound the five terms respectively It is obvious that I = v1 v η r 0 η Let Âk 1 p k 1 ξ k 1 be counterparts of that in Lemma 3 estimated from data in k 1)-stage According to the setting of α in Algorithm 1 for the first call of Algorithm we have α 1 = Av 1 due to v 0 = 0 and for k-th call of Algorithm with k > 1 we have α 1 = Âk 1v 1 hen for the first call of Algorithm we have II = Av1 A v η and for other calls we have II = Âk 1v 1 A v η Âk 1v 1 v ) + Âk 1 A) v η By Cauchy-Schwartz inequality we have max{ Av 1 v ) Âk 1v 1 v ) } 4κ v 1 v Âk 1 A) v Âk 1 A v By combining the result in Lemma 3 we have with probability 1 /3 ) II 8κ r 3κ + ln 1 η + I ) 1 + κ) R k>1 ξ k 1 η

6 Fast Stochastic Maximization Similarly we can bound α 1 α by κr for the first call and 4 ) κ + ln 1 ) 1 + κ)r α 1 α min p k 1 1 p k 1 )n 0 n 0 ln 1 )) + κr for k-th call with k > 1 According to initial value of D and r for each call we have α 1 α D Next we can bound the last two terms similarly to Ying et al 016) Define ũ 1 = u 1 Ω 1 Bv 1 r)) Ω Bα 1 D)) ũ t+1 = Π Ω1 Bv 1r)) ũ t ηgu t ) Gu t z t ))) Ω Bα 1D)) then we have with probability at least 1 3 ηũ t u ) gu t ) Gu t z t )) ũ 1 u + 1 η gu t ) Gu t z t ) v 1 v + α 1 α r + D + η G + 1 η 4G Note that both u t and ũ t are measurable with respect to F t = {z 1 z t 1 } and {S t : ηu t ũ t ) gu t ) Gu t )) : t = 1 } is a martingale difference sequence and for any t we have ηu t ũ t ) gu t ) Gu t z t )) ηg u t ũ t ηgr + D) hen by Azuma- Hoeffding s inequality we have with probability at least 1 3 ηu t ũ t ) gu t ) Gu t z t )) ηgr + D) Hence with probability 1 3 we have ln 3 ) 5) IV ) 1 + 8κ )r 16κ + ln 1 ) 1 + κ) R + η ηξ k 1 4Gr + D) ln 3 + ηg + ) Next we bound V According to Lemma 3 of Ying et al 016) V sup u t u 1 + u 1 u ) t ) sup Ĝtu z) Gu z) / u Ωz Z 4r + D) κ4κr + 11R + 1) 8r + D)G ln 6 ) ln 6 ) t where the last inequality holds since 1 t When R r by union bound with probability at least 1 we have max f v α) min max fv α) α Ω v Ω 1 α Ω ζ 1r η + 3ηG + ζ rg ln 6 ) where ζ 1 = 1 + 8κ 3κ 1+κ) ) + I [k>1] ξ k 1 ζ = κ + ln 1 )1+κ) κ + I ) [k>1] ) ξk 1 ξ k 1 = min ˆp k 1 1 ˆp k 1 ) ln 1 ) + ln 1 ) ) Moreover by choosing η = ζ 1r 3G we have with probability at least 1 we have max f v α) min max fv α) α Ω v Ω 1 α Ω ) 3ζ 1 + ln 6 )ζ r 0 G By Hoeffding inequality ln 1 p p k 1 ) ln 1 1 p) 1 p k 1 ) ) 3 ln 1 ) When we can bound ζ minp1 p)) 1 γ 1 and ζ 1 3 ln γ due to ) then the conclusion follows minp1 p)) 5 Experiments In this section we conduct experiments to compare our algorithm with two state-of-the-art stochastic/online optimization methods namely SOLAM

7 Fast Stochastic Maximization 0 a9a dataset 3 ijcnn1 dataset 086 covtype_binary dataset 0684 HIGGS dataset a) a9a b) ijcnn c) covtype d) HIGGS 7 w8a dataset 95 real-sim dataset 96 rcv1_binary dataset 8 news0binary dataset e) w8a f) real-sim g) rcv1 binary Figure 1 -Iteration curves of and the baselines h) news0binary Ying et al 016) and Gao et al 013) o make the comparison fair we implement two variations of SO- LAM using two different norm constraints on w: SOLAM with l norm constraint the original one) referred to as SO- LAM and SOLAM with l 1 norm constraint referred to as SOLAM-l1 We use eight large-scale benchmark datasets from libsvm website 3 ranging from high-dimensional to lowdimensional from balanced class distribution to imbalanced class distribution he statistics of these datasets are summarized in able 1 We randomly divide each dataset into three sets respectively training validation and testing For a9a and w8a datasets we randomly split the given testing set into half validation and half testing For the datasets that do not explicitly provide a testing set we randomly split the entire data into 4:1:1 for training validation and testing For the dataset ijcnn1 and rcv1 binary despite that the test set is given the size of the training set is relatively small hus we first combine the training and the test sets and then follow the above procedure to split it he involved parameters of each algorithm are tuned based on the validation data has two parameters R and G R is decided within the range 10 [ 1:1:5] G only affects the initial stepsize of each epoch ie η 1 = 3n0G β 0 R 0 and η k+1 = β k β k 1 η k Algorithm 1 line 4-5) so we equivalently tune η 1 [ 10:1:10] As for SOLAM following 3 cjlin/libsvmtools/datasets/ able 1 Statistics of Datasets Datasets #raining #esting feat % of Pos a9a ijcnn covtype binary HIGGS w8a real-sim rcv1 binary news0binary the same strategy in the original paper Ying et al 016) we tune R in 10 [ 1:1:5] and the learning rate in [ 10:1:10] has two versions corresponding to different ways of maintaining the covariance matrix the full version and an approximated version designed to deal with high dimensional data On the five low-dimensional datasets a9a ijcnn1 covtype binary HIGGS w8a) we use the full version of since the dimension is low enough and thus it is unnecessary to use the low rank approximation method which is less accurate and more complicated For the three high-dimensional datasets real-sim rcv1 binary news0binary) we set the low rank τ = 50 the same as the original paper Gao et al 013) here are two parameters shared by both versions of step size η and regularized parameter λ Following the suggestion of the original paper Gao et al 013) we tune η [ 1:1: 4] and λ [ 10:1:0] Each algorithm updates the model by one pass of train-

8 Fast Stochastic Maximization 0 a9a dataset 3 ijcnn1 dataset 086 covtype_binary dataset 0684 HIGGS dataset a) a9a b) ijcnn c) covtype d) HIGGS 7 w8a dataset 95 real-sim dataset 96 rcv1_binary dataset 8 news0binary dataset e) w8a f) real-sim g) rcv1 binary Figure -ime curves of and the baselines h) news0binary able Averaged final on testing data Datasets SOLAM SOLAM l1 a9a ± ± ± ± ijcnn ± ± ± ± 0096 covtype binary ± ± ± ± HIGGS ± ± ± ± w8a ± ± ± ± 0071 real-sim ± ± ± ± rcv1 binary ± ± ± ± news0binary ± ± ± ± ing data and the models at different s are evaluated by computed on the testing data to demonstrate the testing) convergence speed of different algorithms We report the results on testing sets averaged over 5 random runs over the training data he convergence curves of each algorithm are plotted in Figure 1 and Figure in terms of both the number of s and training time he final with the standard deviation over 5 runs is summarized in able From Figure 1 we can see that the proposed converges faster than the two state-of-the-art methods which is consistent with our theory For the convergence in terms of running time shown in Figure 1 the proposed also has the best performance 6 Conclusion In this paper we have proposed a novel stochastic algorithm for maximization in the classical stochastic setting where one random data is received at each We theoretically analyze the proposed algorithm and derive a fast convergence rate of Õ1/n) largely improved from the best result that the current state-of-the-art method can achieve - Õ1/ n) We have also verified the efficiency of our algorithm by experiments on multiple benchmark datasets and the results show that our algorithm converges faster than two strong baseline algorithms in terms of both the number of s and running time For future work we will consider fast stochastic algorithms for maximization based on different surrogate losses One may also consider extending the proposed algorithm to learning to rank from pairwise data in which the label of each data have more than two possible values in the binary classification Acknowledgments We thank the anonymous reviewers for their helpful comments M Liu X Zhang Yang are partially supported by National Science Foundation IIS )

9 Fast Stochastic Maximization References Bach F R and Moulines E Non-strongly-convex smooth stochastic approximation with convergence rate O1/n) In Advances in Neural Information Processing Systems 6 NIPS) pp Bauschke H H and Borwein J M On the convergence of von neumann s alternating projection algorithm for two sets Set-Valued Analysis 1): Cortes C and Mohri M Auc optimization vs error rate minimization In Advances in neural information processing systems pp Ferri C Hernández-Orallo J and Flach P A A coherent interpretation of auc as a measure of aggregated classification performance In Proceedings of the 8th International Conference on Machine Learning ICML-11) pp Gao W Jin R Zhu S and Zhou Z-H One-pass auc optimization In International Conference on Machine Learning pp Metz C E Basic principles of roc analysis In Seminars in nuclear medicine volume 8 pp Elsevier 1978 Srebro N Sridharan K and ewari A Smoothness low noise and fast rates In Advances in Neural Information Processing Systems NIPS) pp Xu Y Lin Q and Yang Stochastic convex optimization: Faster local growth implies faster global convergence In Proceedings of the 34th International Conference on Machine Learning ICML) pp Ying Y Wen L and Lyu S Stochastic online auc maximization In Advances in Neural Information Processing Systems pp Zhao P Jin R Yang and Hoi S C Online auc maximization In Proceedings of the 8th international conference on machine learning ICML-11) pp Ghadimi S and Lan G Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization ii: Shrinking procedures and optimal algorithms SIAM Journal on Optimization 34): Hanley J A and McNeil B J he meaning and use of the area under a receiver operating characteristic roc) curve Radiology 1431): Hanley J A and McNeil B J A method of comparing the areas under receiver operating characteristic curves derived from the same cases Radiology 1483): Hazan E and Kale S Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization Journal of Machine Learning Research - Proceedings rack 19: a Hazan E and Kale S Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization In Proceedings of the 4th Annual Conference on Learning heory COL) pp b Herschtal A and Raskutti B Optimising area under the roc curve using gradient descent In Proceedings of the twenty-first international conference on Machine learning pp 49 ACM 004 Juditsky A Nesterov Y et al Deterministic and stochastic primal-dual subgradient algorithms for uniformly convex minimization Stochastic Systems 41):

Stochastic Online AUC Maximization

Stochastic Online AUC Maximization Yiming Ying, Longyin Wen, Siwei Lyu Department of Mathematics and Statistics SUNY at Albany, Albany, NY,, USA Department of Computer Science SUNY at Albany, Albany, NY,,