Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Size: px
Start display at page:

Download "Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate"

Transcription

1 Fast Stochastic Maximization with O1/n)-Convergence Rate Mingrui Liu 1 Xiaoxuan Zhang 1 Zaiyi Chen Xiaoyu Wang 3 ianbao Yang 1 Abstract In this paper we consider statistical learning with area under ROC curve) maximization in the classical stochastic setting where one random data drawn from an unknown distribution is revealed at each for updating the model Although consistent convex surrogate losses for maximization have been proposed to make the problem tractable it remains an challenging problem to design fast optimization algorithms in the classical stochastic setting due to that the convex surrogate loss depends on random pairs of examples from positive and negative classes Building on a saddle point formulation for a consistent square loss this paper proposes a novel stochastic algorithm to improve the standard O1/ n) convergence rate to Õ1/n) convergence rate without strong convexity assumption or any favorable statistical assumptions eg low noise) where n is the number of random samples o the best of our knowledge this is the first stochastic algorithm for maximization with a statistical convergence rate as fast as O1/n) up to a logarithmic factor Extensive experiments on eight large-scale benchmark data sets demonstrate the superior performance of the proposed algorithm comparing with existing stochastic or online algorithms for maximization 1 Introduction Area under ROC curve ) Metz 1978; Hanley & Mc- Neil 198; 1983) is a commonly used metric for evaluating the performance of a classifier ROC receiver operating characteristic) curve is the "true positive rate - false positive rate" curve and represents the probability that examples in positive class are scored 1 Department of Computer Science he University of Iowa IA 54 USA University of Science and echnology of China 3 Intellifusion Correspondence to: Mingrui Liu <mingruiliu@uiowaedu> ianbao Yang <tianbao-yang@uiowaedu> Proceedings of the 35 th International Conference on Machine Learning Stockholm Sweden PMLR Copyright 018 by the authors) higher than those in negative class Hanley & McNeil 198) Compared with misclassification rate is more favorable in the applications with imbalanced datasets and in which the real-valued classifier is used for ranking However the algorithms designed to minimize the misclassification error rate may not lead to maximization of Cortes & Mohri 004) In a batch learning setting where all training data is given beforehand one may formulate maximization as a convex empirical optimization problem based on the training data using a convex surrogate loss he resulting optimization problem can be solved by employing existing algorithms Herschtal & Raskutti 004; Cortes & Mohri 004; Ferri et al 011) However such an approach has two limitations: i) it is not applicable to many applications in which a random data is received in a sequential manner eg online advertisement); ii) little is known about the statistical convergence rate of the empirical maximizer to the true maximizer due to the iid assumption of pairwise data does not hold herefore a stochastic algorithm with a provable convergence rate in the classical stochastic setting for minimizing an expected convex surrogate loss for maximization is desirable for addressing these two limitations It is also useful for tackling large-scale data by one pass of training data Nevertheless the pairwise nature in the definition of makes it challenging to design algorithms suitable for the classical stochastic setting o address this challenge several online and stochastic algorithms have been developed based on convex surrogate loss Zhao et al 011; Gao et al 013; Ying et al 016) An interesting observation in Ying et al 016) is that the maximization using a consistent square loss is equivalent to a stochastic minmax saddle point problem for which a primal-dual style of stochastic gradient algorithm can be employed to yield an Õ1/ n) convergence rate for minimizing an expected square loss where n is the number of samples How to improve the convergence rate of stochastic optimization of remains an open problem though Fast rate such as O1/n) has been established for stochastic algorithms for expected convex risk minimization in literature Hazan & Kale 011a; Srebro et al 010; Bach & Moulines 013) However these studies either impose

2 Fast Stochastic Maximization strong assumptions about the problem eg strong convexity assumption low-noise assumption etc) and/or their analysis is limited to certain settings that is not applicable to maximization herefore a stochastic algorithm with a provable fast rate as O1/n) without imposing strong assumptions should be considered as a significant contribution for maximization he proposed algorithm referred to as is a Fast Stochastic algorithm for true maximization It is based on the min-max saddle point formulation as observed in Ying et al 016) However different from Ying et al 016) we develop a novel multi-stage scheme for running primal-dual stochastic gradient method with adaptively changing parameters and initial solutions for both primal and dual variables he adaptive multi-stage scheme not only leverages the primal solution from previous stages but also utilizes the empirical estimation of feature vectors of examples from both positive and negative classes for initializing the dual variables he convergence analysis of the proposed algorithm hinges on a quadratic growth property of a reformulated objective and a novel synthesis of the adaptive scheme and the convergence result of primal-dual stochastic gradient method o summarize the major contributions of this work are: We propose a novel fast stochastic algorithm that can be run in the classical stochastic setting for maximizing with a known number of total samples n We establish an Õ1/n) 1 convergence rate for the proposed algorithm where n is the total number of samples his is the first Õ1/n) convergence result for stochastic optimization We evaluate the proposed algorithm on eight largescale benchmark datasets he results show that our algorithm significantly outperforms two state-of-theart stochastic/online methods namely SOLAM algorithm Ying et al 016) and algorithm Gao et al 013) Related Work Zhao et al 011) is probably the first work on online maximization of In order not to store all received positive and negative examples they proposed to use reservoir sampling to maintain a buffer of positive and negative examples At each they run gradient update based on an online empirical version of defined on the buffered data to update the models A regret bound was established with the cost function at each t defined based on the example received at the t-th and all data in the buffer Even with the optimal buffer size their regret bound is worse than n Gao et al 013) proposed another on- 1 he Õ ) notation hides logarithmic factors line algorithm without resorting to the reservoir sampling heir algorithm hinges on the observation that if a consistent square loss is used the gradient of an online empirical version of can be computed based on first and second order statistics of received data Hence their algorithm needs to maintain a covariance matrix of received examples which render it not practical for high-dimensional data Although a randomized version is proposed for maintaining low rank version of the covariance matrix it still has considerable computational overhead In terms of guarantee they established a regret bound in the order of for general case and a possible O1) regret bound under the low-noise condition However these regret bounds do not directly imply a convergence rate for statistical learning for maximization he reason is that the data that define each cost function are not independent he stochastic algorithm proposed in a recent work Ying et al 016) is the first with a convergence guarantee for stochastic optimization and also the first one without storing any historical examples or their covariance matrix As aforementioned their algorithm is based on a novel min-max saddle point formulation of maximization and has a convergence rate of Õ1/ n) Fast rate of stochastic optimization such as O1/n) has been studied for standard classification and regression problems under some conditions For example Hazan & Kale 011a) proposed a method with an O1/n) convergence rate under a weak) strong convexity assumption heir algorithm requires knowing the strong convexity parameter For statistical learning problems Srebro et al 010) established an O1/n) convergence rate for smooth loss functions under low-noise assumption eg the optimal risk value is close to zero) However their algorithm requires knowing a good estimation of the optimal risk value Bach & Moulines 013) established the first O1/n) convergence rate of stochastic algorithms for minimizing expected square loss for regression and expected logistic loss for classification without the strong convexity assumption However all these algorithms and analysis are not applicable to the maximization in the classical stochastic setting Finally we note that the multi-stage scheme of the proposed algorithm is similar to that developed in Hazan & Kale 011b; Ghadimi & Lan 013; Juditsky et al 014; Xu et al 017) for minimizing strongly or uniformly convex functions or problems with error bound conditions However the differences between the proposed work and these works are that i) our development is based on primal dual stochastic method for solving a stochastic saddle point problem of maximization with particular treatment of dual variables; while their algorithms are for solving stochastic convex minimization problems; ii) we do not require strong convexity assumption with known parameters

3 Fast Stochastic Maximization as in Hazan & Kale 011b; Ghadimi & Lan 013); iii) we do not assume uniform strong convexity assumption as in Juditsky et al 014) Instead we prove a quadratic growth condition for the studied problem that is a special case of error bound condition studied in Xu et al 017) However the stochastic algorithms proposed in Xu et al 017) require a target accuracy level and cannot be applied to one pass learning setting for large-scale data 3 Algorithm and Main Result 31 Preliminaries and Notations Let z = x y) P denote a random data following an unknown distribution P where x X R d represents the feature vector and y {1 1} represents the label Denote by Z = X {1 1} and by p = Pry = 1) = E y [I [y=1] ] where I is an indicator function We assume that sup x X x κ Let x R d+ denote an augmented feature vector with the last two components being 0 and let Bx 0 r) = {x : x x 0 r} be an l -ball centered at x 0 with a radius r Given a score function h : R d R the at the population level referred to as the true in this paper) is defined as: h) = Prhx) hx ) y = 1 y = 1) where z = x y) and z = x y) are a pair of random data he maximization problem is to find an optimal score function in a hypothesis class such that is maximized Since the problem is non-convex it is usually solved by using consistent convex surrogate loss A common choice used by previous studies Ying et al 016; Gao et al 013) is the square loss In this paper we consider learning a linear function hx) = w x to maximize the using a square loss ie min Lw) E zz [1 w x x )) y = 1 y = 1] w R d 1) Since the loss function depends on a random pair of data making it difficult to handle in the classical stochastic setting a solution is to cast the above problem into an equivalent saddle point problem Ying et al 016): min max {fw a b α) := E z[f w a b α; z)]} w R d α R ab) R where F w a b α; z) = 1 p)w x a) I [y=1] + pw x b) I [y= 1] p1 p)α α)pw xi [y= 1] 1 p)w xi [y=1] ) Assuming the optimal solution w sits in a bounded domain such that w 1 R we can restrict the primal and dual variables to constrained domains Ω 1 = {w a b) : w 1 R a Rκ b Rκ} Ω = {α R : α Rκ} ie min max{fw a b α) := E z [F w a b α; z)]} wab) Ω 1 α Ω ) Please note that adding the l 1 -ball constraint on w does not restrict the performance of the learned model due to that i) if R is large enough such that the optimal solution w to 1) satisfies w 1 R then w is also the optimal solution to ) ; ii) for a finite number of samples the constraint can serve as regularization for improving the generalization performance We use l 1 -ball constraint because it allows us to develop a fast stochastic optimization algorithm with an convergence rate of Õ1/n) Next we will present the proposed algorithm and its convergence results We will also provide a convergence analysis on a core component of the proposed algorithm 3 Algorithm and its Convergence he proposed algorithm is presented in Algorithm 1 which is referred to as he parameteres R 0 G D 0 β 0 will be explained shortly he algorithm divides the updates into m stages and each stage has n/m updates Each stage of runs a primal dual style stochastic gradient PDSG) method outlined in Algorithm which is similar to that in Ying et al 016) except for three differences: i) the step size is given as a constant for each call of PDSG; ii) each update of the primal variable v = w a b) and the dual variable α is projected into an intersection of their constrained domain and an l ball centered at the initial solution v 1 α 1 ; iii) upon receiving a random data z t the variables ± ± p are updated according to Algorithm 3 where ± represent the number of positive/negative examples received so far;  ± represents the cumulative augmented) feature vector for the positive class and negative class; and p represents the estimated positive ratio based on the received examples he parameters for each call of PDSG is adaptively changing including the initial solutions v 1 α 1 the radius of l balls for the primal variables and the dual variables and the step size η he initial parameter R 0 is a bound of Euclidean distance from v 0 to the optimal set of ) in terms of v he initial parameter D 0 is for setting a constrained domain of the dual variable he initial parameter β 0 is for setting the initial step size he parameter G is an upper bound of Euclidean norms of Gu z) Ĝtu z t ) gu) for any u Ω 1 Ω and z z t Z which are defined in 3) he function ˆF t at the line 5 and line 6 of the Algorithm given his can be shown that if w 1 1 then the optimal solutions to a b α satisfy the prescribed bounds in light of their closed-form solutions depending on w

4 Fast Stochastic Maximization Algorithm 1 1: Set m = 1 log n log n 1 n 0 = n/m R 0 = 1 + κ R G > max1 + 4κ)κR + 1) κr Rκ) κ4κr + 11R + 1)) β 0 = 1 + 8κ ) D 0 = κr 0 : Initialize v 0 = 0 R d+ α 0 = 0 3: for k = 1 m do βk 1 4: Set η k = R 3n0G k 1 5: Call PDSG to obtain v k α k β k R k D k ) = PDSG v k 1 α k 1 R k 1 D k 1 n 0 η k ) 6: end for 7: return v m by F t v α z t ) = 1 p)w x t a) I [y] + pw x t b) I [yt= 1] p1 p)α α) pw x t I [yt= 1] 1 p)w xi [y]) is an estimation of F v α z t ) using the current estimation p of p It is worth mentioning that the line 5 and line 6 of the Algorithm need to execute a projection onto the intersection of two convex sets We can use the alternating projection algorithm to generate a sequence which can linearly converges to the desired point Actually it is easy to show that the two sets in our case are boundedly linearly regular Definition 311 of Bauschke & Borwein 1993)) and the linear convergence is guaranteed by the Corollary 314 of Bauschke & Borwein 1993) Finally the convergence rate of is presented in the following theorem heorem 1 Given 0 1) assume n is sufficiently 3 ln 1 ) large such that n > max100 m ) hen with minp1 p)) probability at least 1 ln 1 max f v m α) min max fv α) Õ ) ) α Ω v Ω 1 α Ω n where Õ ) suppresses logarithmic factor of logn) and some constants of the problem independent of n Remark: In practice we can maintain global variables  +  p and use them for updating parameters and initializing dual variables he local versions used in Algorithm are just for simplicity of analysis he above theorem implies the convergence of maximization in terms of the square loss Corollary 1 Under the same condition as in heorem 1 Algorithm PDSGv 1 α 1 r D η) 1: Initialize variables  + R d+  R d+ + p R as zeros : for t = 1 do 3: Receive a sample z t = x t y t ) 4: Update ± ± p using the data z t 5: v t+1 = Π Ω1 Bv 1r)v t η v Ft v t α t z t )) 6: α t+1 = Π Ω Bα 1D)α t + η α Ft v t α t z t )) 7: end for vt 8: Compute v = 9: Let r = r/ 10: Update β D according to Lemma 1 11: return v α β r D) and α =  Â+ + ) v Algorithm 3 Update ± ± p given a data x t y t ) 1:  + = Â+ + I [y] x t :  =  + I [yt= 1] x t 3: p = p + I [y] 4: + = + + I [y] 5: = + I [yt= 1] 6: p = + / + + ) with probability at least 1 ln 1 Lŵ m ) min Lw) Õ ) ) w 1 R n 4 Convergence Analysis his section is devoted to convergence analysis Due to limitation of space we present detailed analysis for a key lemma More details about the proof of main heorem can be found in the supplement Before analysis we first give some notations that will be frequently used in this section v = w a b) R d+ u = v α) R d+3 Gu z) = v F v α; z) α F v α; z)) Ĝ t u z t ) = v Ft v α; z t ) α Ft v α; z t )) gu) = v fv α) α fv α)) Lemma 1 For each call of Algorithm we update D β according to D = 4 κ + κr + ln 1 min p 1 p) ) ) 1 + κ)r ln 1 )) 3)

5 Fast Stochastic Maximization and β = 1 + 8κ ) + 3κ 1 + κ) + min p 1 p) ln 1 ) ) ln 1 )/ ) where p is an estimation of p Suppose v 1 v r where v Ω 1 is the optimal ) solution closest to v 1 and R > max 3 ln r 1 ) then minp1 p)) max f v α) min max fv α) α Ω v Ω 1 α Ω ) 3γ 1 + ln 6 )γ rg ) where γ 1 = 1 + 8κ ) + 3κ 1+κ) + ln 1 ) minp1 p)/ γ = κ + 8κ + ln 1 )1+κ) ) ) minp1 p)/ Remark: When n is sufficiently large as stated in the main theorem then log/)/ < min p 1 p) by Hoeffding s inequality due to that p is estimated using n 0 examples Based on the above lemma Algorithm 1 uses a geometrically decreasing radius r in order to achieve a faster convergence By further utilizing the following property we can prove the main theorem Lemma f 1 v) = max α Ω fv α) restricted on the set Ω 1 satisfies a quadratic growth condition ie for any v Ω 1 there exists c > 0 such that v v cf 1 v) min v Ω 1 f 1 v)) 1/ where v is the optimal solution to min v Ω1 f 1 v) that is closest to v Remark: It is notable that the proposed algorithm does not require the value of c that is difficult to compute 41 Proof of Lemma 1 o prove this lemma we need the following lemma whose proof is included in the supplement Lemma 3 Let  = Â+/ +  / and and A = Ex y = 1) Ex y = 1)) 0 0) After the k-th call of Algorithm with k 1 with probability 1 3 we have  A 4κ where ξ min p 1 p) ) + ln 1 ) ξ ln 1 ) 4) Proof of Lemma 1 By the setting of G we have ) max gu t ) Gu t z t ) Ĝtu t z t ) G tu tz t Define α = arg max α f v α) = w [Ex y = 1) Ex y = 1)] Following standard analysis of primal-dual update eg see inequality 17 in Ying et al 016)) we have max f v α) min max fv α) α Ω v Ω 1 α Ω f v α ) fv ᾱ ) v 1 v + α 1 α + ηg η η + u t u ) gu t ) Gu t z t )) + u t u ) Gu t z t ) Ĝtu t z t )) = I + II + III + IV + V where v is the optimal solution closest to v 1 of ) and the first inequality follows from the fact that min v Ω1 max α Ω fv α) fv ᾱ ) Now we bound the five terms respectively It is obvious that I = v1 v η r 0 η Let Âk 1 p k 1 ξ k 1 be counterparts of that in Lemma 3 estimated from data in k 1)-stage According to the setting of α in Algorithm 1 for the first call of Algorithm we have α 1 = Av 1 due to v 0 = 0 and for k-th call of Algorithm with k > 1 we have α 1 = Âk 1v 1 hen for the first call of Algorithm we have II = Av1 A v η and for other calls we have II = Âk 1v 1 A v η Âk 1v 1 v ) + Âk 1 A) v η By Cauchy-Schwartz inequality we have max{ Av 1 v ) Âk 1v 1 v ) } 4κ v 1 v Âk 1 A) v Âk 1 A v By combining the result in Lemma 3 we have with probability 1 /3 ) II 8κ r 3κ + ln 1 η + I ) 1 + κ) R k>1 ξ k 1 η

6 Fast Stochastic Maximization Similarly we can bound α 1 α by κr for the first call and 4 ) κ + ln 1 ) 1 + κ)r α 1 α min p k 1 1 p k 1 )n 0 n 0 ln 1 )) + κr for k-th call with k > 1 According to initial value of D and r for each call we have α 1 α D Next we can bound the last two terms similarly to Ying et al 016) Define ũ 1 = u 1 Ω 1 Bv 1 r)) Ω Bα 1 D)) ũ t+1 = Π Ω1 Bv 1r)) ũ t ηgu t ) Gu t z t ))) Ω Bα 1D)) then we have with probability at least 1 3 ηũ t u ) gu t ) Gu t z t )) ũ 1 u + 1 η gu t ) Gu t z t ) v 1 v + α 1 α r + D + η G + 1 η 4G Note that both u t and ũ t are measurable with respect to F t = {z 1 z t 1 } and {S t : ηu t ũ t ) gu t ) Gu t )) : t = 1 } is a martingale difference sequence and for any t we have ηu t ũ t ) gu t ) Gu t z t )) ηg u t ũ t ηgr + D) hen by Azuma- Hoeffding s inequality we have with probability at least 1 3 ηu t ũ t ) gu t ) Gu t z t )) ηgr + D) Hence with probability 1 3 we have ln 3 ) 5) IV ) 1 + 8κ )r 16κ + ln 1 ) 1 + κ) R + η ηξ k 1 4Gr + D) ln 3 + ηg + ) Next we bound V According to Lemma 3 of Ying et al 016) V sup u t u 1 + u 1 u ) t ) sup Ĝtu z) Gu z) / u Ωz Z 4r + D) κ4κr + 11R + 1) 8r + D)G ln 6 ) ln 6 ) t where the last inequality holds since 1 t When R r by union bound with probability at least 1 we have max f v α) min max fv α) α Ω v Ω 1 α Ω ζ 1r η + 3ηG + ζ rg ln 6 ) where ζ 1 = 1 + 8κ 3κ 1+κ) ) + I [k>1] ξ k 1 ζ = κ + ln 1 )1+κ) κ + I ) [k>1] ) ξk 1 ξ k 1 = min ˆp k 1 1 ˆp k 1 ) ln 1 ) + ln 1 ) ) Moreover by choosing η = ζ 1r 3G we have with probability at least 1 we have max f v α) min max fv α) α Ω v Ω 1 α Ω ) 3ζ 1 + ln 6 )ζ r 0 G By Hoeffding inequality ln 1 p p k 1 ) ln 1 1 p) 1 p k 1 ) ) 3 ln 1 ) When we can bound ζ minp1 p)) 1 γ 1 and ζ 1 3 ln γ due to ) then the conclusion follows minp1 p)) 5 Experiments In this section we conduct experiments to compare our algorithm with two state-of-the-art stochastic/online optimization methods namely SOLAM

7 Fast Stochastic Maximization 0 a9a dataset 3 ijcnn1 dataset 086 covtype_binary dataset 0684 HIGGS dataset a) a9a b) ijcnn c) covtype d) HIGGS 7 w8a dataset 95 real-sim dataset 96 rcv1_binary dataset 8 news0binary dataset e) w8a f) real-sim g) rcv1 binary Figure 1 -Iteration curves of and the baselines h) news0binary Ying et al 016) and Gao et al 013) o make the comparison fair we implement two variations of SO- LAM using two different norm constraints on w: SOLAM with l norm constraint the original one) referred to as SO- LAM and SOLAM with l 1 norm constraint referred to as SOLAM-l1 We use eight large-scale benchmark datasets from libsvm website 3 ranging from high-dimensional to lowdimensional from balanced class distribution to imbalanced class distribution he statistics of these datasets are summarized in able 1 We randomly divide each dataset into three sets respectively training validation and testing For a9a and w8a datasets we randomly split the given testing set into half validation and half testing For the datasets that do not explicitly provide a testing set we randomly split the entire data into 4:1:1 for training validation and testing For the dataset ijcnn1 and rcv1 binary despite that the test set is given the size of the training set is relatively small hus we first combine the training and the test sets and then follow the above procedure to split it he involved parameters of each algorithm are tuned based on the validation data has two parameters R and G R is decided within the range 10 [ 1:1:5] G only affects the initial stepsize of each epoch ie η 1 = 3n0G β 0 R 0 and η k+1 = β k β k 1 η k Algorithm 1 line 4-5) so we equivalently tune η 1 [ 10:1:10] As for SOLAM following 3 cjlin/libsvmtools/datasets/ able 1 Statistics of Datasets Datasets #raining #esting feat % of Pos a9a ijcnn covtype binary HIGGS w8a real-sim rcv1 binary news0binary the same strategy in the original paper Ying et al 016) we tune R in 10 [ 1:1:5] and the learning rate in [ 10:1:10] has two versions corresponding to different ways of maintaining the covariance matrix the full version and an approximated version designed to deal with high dimensional data On the five low-dimensional datasets a9a ijcnn1 covtype binary HIGGS w8a) we use the full version of since the dimension is low enough and thus it is unnecessary to use the low rank approximation method which is less accurate and more complicated For the three high-dimensional datasets real-sim rcv1 binary news0binary) we set the low rank τ = 50 the same as the original paper Gao et al 013) here are two parameters shared by both versions of step size η and regularized parameter λ Following the suggestion of the original paper Gao et al 013) we tune η [ 1:1: 4] and λ [ 10:1:0] Each algorithm updates the model by one pass of train-

8 Fast Stochastic Maximization 0 a9a dataset 3 ijcnn1 dataset 086 covtype_binary dataset 0684 HIGGS dataset a) a9a b) ijcnn c) covtype d) HIGGS 7 w8a dataset 95 real-sim dataset 96 rcv1_binary dataset 8 news0binary dataset e) w8a f) real-sim g) rcv1 binary Figure -ime curves of and the baselines h) news0binary able Averaged final on testing data Datasets SOLAM SOLAM l1 a9a ± ± ± ± ijcnn ± ± ± ± 0096 covtype binary ± ± ± ± HIGGS ± ± ± ± w8a ± ± ± ± 0071 real-sim ± ± ± ± rcv1 binary ± ± ± ± news0binary ± ± ± ± ing data and the models at different s are evaluated by computed on the testing data to demonstrate the testing) convergence speed of different algorithms We report the results on testing sets averaged over 5 random runs over the training data he convergence curves of each algorithm are plotted in Figure 1 and Figure in terms of both the number of s and training time he final with the standard deviation over 5 runs is summarized in able From Figure 1 we can see that the proposed converges faster than the two state-of-the-art methods which is consistent with our theory For the convergence in terms of running time shown in Figure 1 the proposed also has the best performance 6 Conclusion In this paper we have proposed a novel stochastic algorithm for maximization in the classical stochastic setting where one random data is received at each We theoretically analyze the proposed algorithm and derive a fast convergence rate of Õ1/n) largely improved from the best result that the current state-of-the-art method can achieve - Õ1/ n) We have also verified the efficiency of our algorithm by experiments on multiple benchmark datasets and the results show that our algorithm converges faster than two strong baseline algorithms in terms of both the number of s and running time For future work we will consider fast stochastic algorithms for maximization based on different surrogate losses One may also consider extending the proposed algorithm to learning to rank from pairwise data in which the label of each data have more than two possible values in the binary classification Acknowledgments We thank the anonymous reviewers for their helpful comments M Liu X Zhang Yang are partially supported by National Science Foundation IIS )

9 Fast Stochastic Maximization References Bach F R and Moulines E Non-strongly-convex smooth stochastic approximation with convergence rate O1/n) In Advances in Neural Information Processing Systems 6 NIPS) pp Bauschke H H and Borwein J M On the convergence of von neumann s alternating projection algorithm for two sets Set-Valued Analysis 1): Cortes C and Mohri M Auc optimization vs error rate minimization In Advances in neural information processing systems pp Ferri C Hernández-Orallo J and Flach P A A coherent interpretation of auc as a measure of aggregated classification performance In Proceedings of the 8th International Conference on Machine Learning ICML-11) pp Gao W Jin R Zhu S and Zhou Z-H One-pass auc optimization In International Conference on Machine Learning pp Metz C E Basic principles of roc analysis In Seminars in nuclear medicine volume 8 pp Elsevier 1978 Srebro N Sridharan K and ewari A Smoothness low noise and fast rates In Advances in Neural Information Processing Systems NIPS) pp Xu Y Lin Q and Yang Stochastic convex optimization: Faster local growth implies faster global convergence In Proceedings of the 34th International Conference on Machine Learning ICML) pp Ying Y Wen L and Lyu S Stochastic online auc maximization In Advances in Neural Information Processing Systems pp Zhao P Jin R Yang and Hoi S C Online auc maximization In Proceedings of the 8th international conference on machine learning ICML-11) pp Ghadimi S and Lan G Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization ii: Shrinking procedures and optimal algorithms SIAM Journal on Optimization 34): Hanley J A and McNeil B J he meaning and use of the area under a receiver operating characteristic roc) curve Radiology 1431): Hanley J A and McNeil B J A method of comparing the areas under receiver operating characteristic curves derived from the same cases Radiology 1483): Hazan E and Kale S Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization Journal of Machine Learning Research - Proceedings rack 19: a Hazan E and Kale S Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization In Proceedings of the 4th Annual Conference on Learning heory COL) pp b Herschtal A and Raskutti B Optimising area under the roc curve using gradient descent In Proceedings of the twenty-first international conference on Machine learning pp 49 ACM 004 Juditsky A Nesterov Y et al Deterministic and stochastic primal-dual subgradient algorithms for uniformly convex minimization Stochastic Systems 41):

Stochastic Online AUC Maximization

Stochastic Online AUC Maximization Stochastic Online AUC Maximization Yiming Ying, Longyin Wen, Siwei Lyu Department of Mathematics and Statistics SUNY at Albany, Albany, NY,, USA Department of Computer Science SUNY at Albany, Albany, NY,,

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Accelerate Subgradient Methods

Accelerate Subgradient Methods Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis

A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-7) A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis Zhe Li, Tianbao Yang, Lijun Zhang, Rong

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Block stochastic gradient update method

Block stochastic gradient update method Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

On the Consistency of AUC Pairwise Optimization

On the Consistency of AUC Pairwise Optimization On the Consistency of AUC Pairwise Optimization Wei Gao and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University Collaborative Innovation Center of Novel Software Technology

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem

Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem ABSRAC Yang Mu University of Massachusetts Boston 100 Morrissey Boulevard Boston, MA, US 015 yangmu@cs.umb.edu ianyi Zhou University

More information

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7. Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer

More information

Appendix: Kernelized Online Imbalanced Learning with Fixed Budgets

Appendix: Kernelized Online Imbalanced Learning with Fixed Budgets Appendix: Kernelized Online Imbalanced Learning with Fixed Budgets Junjie u 1,, aiqin Yang 1,, Irwin King 1,, Michael R. Lyu 1,, and Anthony Man-Cho So 3 1 Shenzhen Key Laboratory of Rich Media Big Data

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Provable Non-Convex Min-Max Optimization

Provable Non-Convex Min-Max Optimization Provable Non-Convex Min-Max Optimization Mingrui Liu, Hassan Rafique, Qihang Lin, Tianbao Yang Department of Computer Science, The University of Iowa, Iowa City, IA, 52242 Department of Mathematics, The

More information

Regret bounded by gradual variation for online convex optimization

Regret bounded by gradual variation for online convex optimization Noname manuscript No. will be inserted by the editor Regret bounded by gradual variation for online convex optimization Tianbao Yang Mehrdad Mahdavi Rong Jin Shenghuo Zhu Received: date / Accepted: date

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Generalized Boosting Algorithms for Convex Optimization

Generalized Boosting Algorithms for Convex Optimization Alexander Grubb agrubb@cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 523 USA Abstract Boosting is a popular way to derive powerful

More information

A Two-Stage Approach for Learning a Sparse Model with Sharp

A Two-Stage Approach for Learning a Sparse Model with Sharp A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis The University of Iowa, Nanjing University, Alibaba Group February 3, 2017 1 Problem and Chanllenges 2 The Two-stage Approach

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Stochastic Semi-Proximal Mirror-Prox

Stochastic Semi-Proximal Mirror-Prox Stochastic Semi-Proximal Mirror-Prox Niao He Georgia Institute of echnology nhe6@gatech.edu Zaid Harchaoui NYU, Inria firstname.lastname@nyu.edu Abstract We present a direct extension of the Semi-Proximal

More information

Maximization of AUC and Buffered AUC in Binary Classification

Maximization of AUC and Buffered AUC in Binary Classification Maximization of AUC and Buffered AUC in Binary Classification Matthew Norton,Stan Uryasev March 2016 RESEARCH REPORT 2015-2 Risk Management and Financial Engineering Lab Department of Industrial and Systems

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

AUC Maximizing Support Vector Learning

AUC Maximizing Support Vector Learning Maximizing Support Vector Learning Ulf Brefeld brefeld@informatik.hu-berlin.de Tobias Scheffer scheffer@informatik.hu-berlin.de Humboldt-Universität zu Berlin, Department of Computer Science, Unter den

More information

On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization

On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization Linbo Qiao, and Tianyi Lin 3 and Yu-Gang Jiang and Fan Yang 5 and Wei Liu 6 and Xicheng Lu, Abstract We consider

More information

The Perceptron Algorithm 1

The Perceptron Algorithm 1 CS 64: Machine Learning Spring 5 College of Computer and Information Science Northeastern University Lecture 5 March, 6 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu Introduction The Perceptron

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Supplementary Material of Dynamic Regret of Strongly Adaptive Methods

Supplementary Material of Dynamic Regret of Strongly Adaptive Methods Supplementary Material of A Proof of Lemma 1 Lijun Zhang 1 ianbao Yang Rong Jin 3 Zhi-Hua Zhou 1 We first prove the first part of Lemma 1 Let k = log K t hen, integer t can be represented in the base-k

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Generalization Bounds for Online Learning Algorithms with Pairwise Loss Functions

Generalization Bounds for Online Learning Algorithms with Pairwise Loss Functions JMLR: Workshop and Conference Proceedings vol 3 0 3. 3. 5th Annual Conference on Learning Theory Generalization Bounds for Online Learning Algorithms with Pairwise Loss Functions Yuyang Wang Roni Khardon

More information

Adaptive Probabilities in Stochastic Optimization Algorithms

Adaptive Probabilities in Stochastic Optimization Algorithms Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Surrogate regret bounds for generalized classification performance metrics

Surrogate regret bounds for generalized classification performance metrics Surrogate regret bounds for generalized classification performance metrics Wojciech Kotłowski Krzysztof Dembczyński Poznań University of Technology PL-SIGML, Częstochowa, 14.04.2016 1 / 36 Motivation 2

More information

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

More information

Subgradient Methods for Maximum Margin Structured Learning

Subgradient Methods for Maximum Margin Structured Learning Nathan D. Ratliff ndr@ri.cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. 15213 USA Martin A. Zinkevich maz@cs.ualberta.ca Department of Computing

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting

More information

Computational Learning Theory - Hilary Term : Learning Real-valued Functions

Computational Learning Theory - Hilary Term : Learning Real-valued Functions Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling

More information

Dreem Challenge report (team Bussanati)

Dreem Challenge report (team Bussanati) Wavelet course, MVA 04-05 Simon Bussy, simon.bussy@gmail.com Antoine Recanati, arecanat@ens-cachan.fr Dreem Challenge report (team Bussanati) Description and specifics of the challenge We worked on the

More information

Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections

Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections Jianhui Chen 1, Tianbao Yang 2, Qihang Lin 2, Lijun Zhang 3, and Yi Chang 4 July 18, 2016 Yahoo Research 1, The

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines: Maximum Margin Classifiers Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Better Algorithms for Selective Sampling

Better Algorithms for Selective Sampling Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Importance Sampling for Minibatches

Importance Sampling for Minibatches Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Dynamic Regret of Strongly Adaptive Methods

Dynamic Regret of Strongly Adaptive Methods Lijun Zhang 1 ianbao Yang 2 Rong Jin 3 Zhi-Hua Zhou 1 Abstract o cope with changing environments, recent developments in online learning have introduced the concepts of adaptive regret and dynamic regret

More information

arxiv: v8 [cs.lg] 24 Nov 2016

arxiv: v8 [cs.lg] 24 Nov 2016 Confidence-Weighted Bipartite Ranking Majdi Khalid, Indrakshi Ray, and Hamidreza Chitsaz Colorado State University, Fort Collins, USA majdi.khaled@gmail.com,indrakshi.ray@colostate.edu, chitsaz@chitsazlab.org

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Proceedings of the hirty-first AAAI Conference on Artificial Intelligence (AAAI-7) Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Zhouyuan Huo Dept. of Computer

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information