Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem

Size: px

Start display at page:

Download "Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem"

Domenic Willis
6 years ago
Views:

1 Consrained Sochasic Gradien Descen for Large-scale Leas Squares Problem ABSRAC Yang Mu Universiy of Massachuses Boson 100 Morrissey Boulevard Boson, MA, US 015 ianyi Zhou Universiy of echnology Sydney 35 Jones Sree Ulimo, NSW 007, Ausralia he leas squares problem is one of he mos imporan regression problems in saisics, machine learning and daa mining. In his paper, we presen he Consrained Sochasic Gradien Descen (CSGD algorihm o solve he largescale leas squares problem. CSGD improves he Sochasic Gradien Descen (SGD by imposing a provable consrain ha he linear regression line passes hrough he mean poin of all he daa poins. I resuls in he bes regre bound O(log, and fases convergence speed among all firs order approaches. Empirical sudies jusify he effeciveness of CSGD by comparing i wih SGD and oher sae-of-hear approaches. An example is also given o show how o use CSGD o opimize SGD based leas squares problems o achieve a beer performance. Caegories and Subjec Descripors G.1.6 [Opimizaion]: Leas squares mehods, Sochasic programming; I..6 [Learning]: Parameer learning General erms Algorihms, heory Keywords Sochasic opimizaion, Large-scale leas squares, online learning Corresponding auhor Permission o make digial or hard copies of all or par of his work for personal or classroom use is graned wihou fee provided ha copies are no made or disribued for profi or commercial advanage and ha copies bear his noice and he full ciaion on he firs page. Copyrighs for componens of his work owned by ohers han ACM mus be honored. Absracing wih credi is permied. o copy oherwise, or republish, o pos on servers or o redisribue o liss, requires prior specific permission and/or a fee. Reques permissions from permissions@acm.org. KDDaŕ13, Augus 11 14, 013, Chicago, Illinois, USA. Copyrigh 0XX ACM /13/08...$ Wei Ding Universiy of Massachuses Boson 100 Morrissey Boulevard Boson, MA, US 015 ding@cs.umb.edu Dacheng ao Universiy of echnology Sydney 35 Jones Sree Ulimo, NSW 007, Ausralia dacheng.ao@us.edu.au 1. INRODUCION he sochasic leas squares problem aims o find he coefficien w R d o minimize he objecive funcion a sep w +1 =argmin w 1 1 y i w x i, (1 i=1 where (x i,y i X Yisan inpu-oupu pair randomly drawn from daa se (X, Y endowed in a disribuion D wih x i R d and y i R, w+1 R d is a parameer ha minimizes he empirical leas squares loss a sep, and l(w, x,y = 1 y w x is he empirical risk. For large-scale problems, he classical opimizaion mehods, such as inerior poin mehod and conjugae gradien descen, have o scan all daa poins several imes in order o evaluae he objecive funcion and find he opimal w. Recenly, Sochasic Gradien Descen (SGD [7, 9, 9, 3, 0, 13] mehods show is promising efficiency in solving large-scale problems. Some of hem have been widely applied o he leas squares problem. he Leas Mean Squares (LMS algorihm [5] is he sandard firs order SGD, which akes a scalar as he learning rae. he Recursive Leas Squares (RLS approach [5, 15] is an insaniaion of he sochasic Newon mehod by replacing he scalar learning rae wih an approximaion of he Hessian marix inverse. he Averaged Sochasic Gradien Descen (ASGD [1] averages he SGD resuls o esimae w. ASGD converges more sably han SGD. Is convergence rae even approaches o ha of second order mehod, when he esimaor is sufficienly close o w. his happens afer processing a huge amoun of daa poins. Since he leas squares loss l(w, X, Y is usually srongly convex [5], he firs order approaches can converge a he rae of O(1/. Given he smalles eigenvalue λ 0 of he Hessian marix [8, 7, 18], many algorihms can achieve fas convergence raes and good regre bounds [7,, 10, 3, 11]. However, if he Hessian marix is unknown in advance, SGD may perform poorly [18]. For high-dimensional large-scale problems, he srong convexiy are no always guaraneed, because he smalles eigenvalue of he Hessian marix migh be close o 0. Wihou

2 he assumpion of he srong convexiy, he convergence rae of he firs order approaches reduces o O(1/ [30] while reains compuaion complexiy of O(d in each ieraion. Second order approaches, using he Hessian approximaion, converge a rae O(1/. Alhough hey are appealing due o he fas convergence rae and sableness, he expensive ime complexiy of O(d when dealing wih each ieraion limis he use of second order approaches in pracical largescale problems. In his paper, we prove ha he linear regression line defined by he opimal coefficien w passes hrough he mean poin ( x, ȳ of all he daa poins drawn from he disribuion D. Given his propery, we can significanly improve SGD for opimizing he large-scale leas squares problem by adding an equaliy consrain w +1 x ȳ =0,where( x, ȳ is he bach mean of he colleced daa poins ill sep during he opimizaion ieraions. he bach mean ( x, ȳ is an unbiased esimaion o ( x, ȳ by ieraions (cf. he Law of Large Numbers. We erm he proposed approach as he consrained SGD (CSGD. CSGD shrinks he opimal soluion space of he leas squares problem from he enire R d o a hyper-plane in R d, hus significanly improves he convergence rae and he regre bound. In paricular, Wihou he srong convexiy assumpion, CSGD converges a he rae of O(log /, which is close o ha of a full second order approach, while reaining ime complexiy of O(d in each ieraion. CSGD achieves he O(log regreboundwihourequiring srong convexiy, which is he bes regre bound among exising SGD mehods. Noe ha when he daa poins are cenralized (mean is 0, he consrain becomes rivial and CSGD reduces o SGD, which is he wors case for CSGD. In pracical online learning, he colleced daa poins, however, are ofen no cenralized, and hus CSGD is preferred. In his paper, we only discuss he properies of CSGD when daa poins are no cenralized. Noaions. We denoe he inpu daa poin x =[1 x ] = [x (1,...,x (d ] and w =[w (1 w ] =[w (1,...,w (d ], where x (1 = 1 is he firs elemen of x and w (1 is he bias parameer [6]. p is he L p norm, p is he squared Lp norm, is he absolue operaion for scalars and l(w is abbreviaed for l(w, x,y.. CONSRAINED SOCHASIC GRADI- EN DESCEN We presen he Consrained Sochasic Gradien Descen (CSGD algorihm for he large-scale leas squares problem by incorporaing SGD wih he fac ha he linear regression line passes hrough he mean poin of all he daa poins..1 CSGD algorihm he sandard Sochasic Gradien Descen (SGD algorihm akes he form of w +1 =Π τ (w η g, ( where η is an appropriae learning rae, g is he gradien of he loss funcion l(w, x,y = 1 y w x,πτ( is he Euclidean projecion funcion ha projecs w ono he predefined convex se τ by Π τ(w =argmin v v τ w. (3 In leas squares, w is defined in he enire R d,andπ τ ( can be aken off. hus, he search space of SGD is he enire R d o obain he opimal soluion. According o heorem.1 (cf. Secion., we add a consrain w x ȳ =0asep o SGD, where ( x, ȳ is an unbiased esimaion o ( x, ȳ afer ieraions, and obain CSGD w +1 =argmin w 1 1 y i w x i, s.. w x ȳ =0. i=1 (4 he consrain in Eq.(4 deermines he hyper-plane τ = {w w x =ȳ } residing in R d. By replacing τ wih τ in Eq.(, we have w +1 =Π τ (w η g. (5 he projecion funcion Π τ ( projecsapoinonohe hyper-plane τ. By solving Eq.(3, Π τ ( is uniquely defined a each sep by Π τ (v =P v + r, (6 where P is he projecion marix a sep and akes he form of P = I x x, (7 x where P R d d is idempoen and projecs a vecor ono he subspace generaed by x,andr = ȳ x x. By combining Eqs.(5 and (6, he ieraive procedure for CSGD is w +1 = P (w η g +r. (8 We can obain he ime and space complexiies of above procedure boh as O(d afer plugging Eq.(7 ino (8 and updae w +1, ( w +1 = I x x (w x η g +r (9 = w η g x ( x (w η g / x + r. Algorihm 1 describes how o calculae CSGD for he leas squares problem. his algorihm has he ime and space complexiies boh of O(d. Algorihm 1 Consrained Sochasic Gradien Descen (CSGD Iniialize w 1 = x 0 = 0 and ȳ 0 =0. for =1,, 3,... do Compue he gradien g l(w, x,y. Compue ( x, ȳ wih x = 1 x x, and ȳ = 1 ȳ y. Compue w +1 = w η g x ( x (w η g / x + r. end for

3 . Regression line consrain Algorihm 1 relies on he fac ha he opimal soluion lies in a hyper-plane decided by he mean poin, which leads o a significan improvemen on he convergence rae and he regre bound. heorem.1. (Regression line consrain he opimal soluion w lies on he hyper-plane, w x ȳ =0,whichis defined by he mean poin ( x, ȳ of daa poins drawn from he disribuion X Yendowed in D. Proof. he loss funcion is explicily defined as l(w, X, Y = (x,y D 1 y w x w (1, (10 where w (1 is he firs elemen of w. Seing he derivaive of he loss funcion w.r.. w (1 o zero, we obain w x + w (1 y =0. x X y Y w x y =0. x X y Y hus he opimal soluion w saisfies w x ȳ =0. heorem.1 is he core heorem for our mehod. Bishop [6] applied he deriviive w.r. he bias w (1 o sudy he propery of he bias. However, alhough he heorem iself is in a simple form, o he bes of our knowledge, i has never been saed and applied in any approach for leas squares opimizaion. he mean poin ( x, ȳ over a disribuion D is usually no given. In a sochasic approach, we can use he bach mean ( x, ȳ o approximae ( x, ȳ. he approximaion has an esimaion error, however, i will no lower he performance. his is because he bach opimal always saisfies his consrain when opimizing he empirical loss. herefore, we give he consrained esimaion error bound for compleeness. Proposiion.. (Consrained esimaion error bound According o he Law of Large Numbers, we assume here is asepm yeilds w x m x + ȳ m ȳ ɛ w. hen given a olerable small value ɛ, he esimaion error bound Π τ (w w ɛ holds for any sep m. Proof. Since Π τ (w w is he disance beween w and he hyper-plane τ,wehave Π τ (w w = Along wih w x ȳ =0,wehave ȳ w x. w ȳ w x (11 = ȳ ȳ (w x w x w x x + ȳ ȳ w x m x + ȳ m ȳ ɛ w, where m exiss since ( x, ȳ convergeso( x, ȳ according o he Law of Large Numbers. herefore, Π τ (w w ɛ. Proposiion. saes ha, if a sep m, ( x m, ȳ misclose o ( x, ȳ, hen a sufficienly good soluion (wihin an ɛ-ball cenered a opimal soluion w lies on hyper-plane τ m.in addiion, he esimaion error decays and is value is upper bounded by he weighed combinaion of x x and ȳ ȳ. Noice ha CSGD opimizes he opimal empirical soluion w ha is always locaed on hyper-plane τ. According o heorem.1 and Proposiion., under he assumpion of regression consrain, CSGD explicily minimizesheempiricallossasgoodashesecondordersgd. AccordingoProposiion., Π τ (w w converges a he same rae as x x and ȳ ȳ whose exponenial convergence rae [4] is suppored by he Law of Large Numbers. hus, we ignore he difference beweenπ τ (w and w in our heoreical analysis for simpliciy reasons. 3. A ONE-SEP DIFFERENCE INEQUAL- IY o sudy he heoreical properies of CSGD, we sar from he one-sep difference bound, which is crucial o analyze he regre and convergence behavior.... Figure 1: An illusraing example afer sep. ŵ +1 is he SGD resul. w +1 is he projecion for ŵ +1 on o he hyper-plane τ. w is he opimal soluion. an θ = ŵ +1 w +1 / w +1 w Afer ieraion sep, CSGD projecs he SGD resul ŵ +1 on he hyper-plane τ o ge a new resul w +1 wih direcion and sep size correcion. An illusraion is given in Figure 1. Noe ha, w is assumed on τ accordingoproposiion.. In addiion, he definiion of a gradien for any g l(w implies l(w l(w +g (w w (1 g (w w l(w l(w. Wih Eq.(1 we have he following heorems for sep difference bound. Firsly, we describe he sep difference bound proved by Nemirovski for SGD. heorem 3.1. (Sep difference bound of SGD For any opimal soluion w, SGD has he following inequaliy beween seps 1 and ŵ +1 w ŵ w η g η (l(ŵ l(w, where η is he learning rae a sep.

4 Deail proof is given in [19]. Secondly, we prove he sep difference bound of CSGD as follows. heorem 3.. (Sep difference bound of CSGD For any opimal soluion w, he following inequaliy holds for CSGD beween seps 1 and w +1 w w w (13 η g η (l(w l(w ḡ+1, x where w =Π τ (ŵ and ḡ +1 = l(ŵ +1, x, ȳ. Proof. Since ŵ +1 = w η g, beween wo seps 1 and, ŵ +1 and w follows heorem 3.1. herefore, we have ŵ +1 w w w η g η (l(w l(w. (14 As Euclidean projecion w +1 =Π τ (ŵ +1 given in Eq.(3, which is also shown in Figure 1, has he propery, ŵ +1 w = w +1 w + w +1 ŵ +1. (15 hen, subsiuing ŵ +1 w given by Eq.(15 ino Eq.(14 yields w +1 w w w (16 η g η (l(w l(w w +1 ŵ +1. By using he projecion funcion defined in Eq.(6, we have w +1 ŵ +1 (17 = P ŵ +1 + r ŵ +1 ( = I x x ŵ x +1 + ȳ x x ŵ +1 x (ȳ x ŵ +1 = x. Since ḡ +1 = l(ŵ +1, x, ȳ = x (ȳ x ŵ +1,we have w +1 ŵ +1 = ḡ+1. x A direc resul of he sep difference bound allows he following heorem which derives he convergence resul of CSGD. heorem 3.3. (Loss bound Assume (1 he norm of any gradien from l is bounded by G, ( he norm of w is less han or equal o D and (3 ḡ +1 x η(l(w l(w D G + G G hen η. Proof. Rearranging he bound in heorem 3. and sum he loss erms over from 1 hrough andhengehesum: η (l(w l(w (18 w 1 w w +1 w ( + η g ḡ+1 x D G + G he final sep uses he fac ha w 1 w D, where w 1 is iniialized o 0, alongwih g G for any and he assumpion ḡ +1 x η. G. A corollary which is he consequence of his heorem is presened in he following. Alhough he convergence for CSGD follows immediaely according o he Nemirovski s 3- line subgradien descen convergence proof [17], we presen our firs corollary underscoring he rae of convergence when η is fixed, in general is approximaely 1/ɛ,orequivalenly, 1/. Corollary 3.4. (Fixed sep convergence rae Assume heorem 3.3 hold and for any predeermined ieraions wih η = 1,hen min l(w 1 l(w 1 (D G + G +l(w. Proof. le η = η = 1 for any sep, he bound for convergence rae in heorem 3.3 becomes, (l(w l(w 1 η (D G +G η. he desired bound is achieved afer plugging in he specific value of η and dividing boh sides by. I is clear ha he fixed sep convergence rae for CSGD is upper bounded by SGD, which can be achieved by aking ou he G. 4. REGRE ANALYSIS Regre is he difference beween he oal loss and he opimal loss, which has been analyzed in mos online algorihms for evaluaing he correcness and convergence. 4.1 Regre Le G be he upper bound of g for any from 1,...,, we have he following heorem. heorem 4.1. (Regre Bound for CSGD he regre of CSGD is: R G( G (1 + log, H where H is a consan. herefore, lim sup R G( / 0.

5 Proof. In heorem 3., Eq.(16 shows ha: η (l(w l(w (19 w w w +1 w + η g w +1 ŵ +1, where w +1 ŵ +1 = w +1 w an θ and θ [0, π, which is shown in Figure 1. herefore, sum Eq.(19 over from 1 o,wehave l(w l(w (0 w w = ( 1 1 an θ 1 η η 1 η w 1 w + G η. η 1 By adding a dummy erm 1 η 0 (1+an θ 0 w 1 w =0 on he righ side, we have l(w l(w (1 w w ( 1 1 an θ 1 η η 1 η 1 + G η. Noe ha, an θ does no decrease w.r. sep as shown in Lemma 4. and η does no increase w.r. sep. Since η 1 > 0, we assume he lower bound an θ 1 η 1 hold 1,henEq.(1canberewrienas H l(w l(w ( ( 1 w w 1 H + G η. η η 1 Se η = 1 for all 1, we have H l(w l(w G (1 + log. (3 H When an θ becomes small, he improvemen from he consrain will no be significan as shown in Figure 1. Lemma 4. shows ha an θ does no decrease as increases if ŵ +1 and w +1 are close o w. his indicaes he improvemen from ŵ +1 o w +1 is sable under he regression consrain. And his furher proves he sableness of he regre bound. Lemma 4.. an θ = w +1 ŵ +1 / w +1 w does no decrease w.r. sep. Proof. I is known ha ŵ +1 and w +1 converge o w. If w +1 converges beyond he speed of ŵ +1, anθ will diverge and he Lemma holds for sure. 1 his inequaliy is based on he assumpion ha H is posiive. Alhough his assumpion could be slighly violaed when an θ =0ifw lies on τ and (x,y =( x, ȳ, his even rarely happens in real cases. Even if i happens bu for finie imes, he legaliy of our analysis is sill provable. So we simply rule ou his rare even in heoreical analysis. SGD has he convergence rae O(1/. his is he wors convergence rae ha can be obained by all he sochasic opimizaion approaches. Before we prove CSGD has a faser convergence rae han SGD, we emporarily make a conservaive assumpion ha CSGD and SGD boh converge a he rae of O( α, where α is a posiive number. Le ŵ +1 w be a α and w +1 w be b α.along wih Eq.(15, we have an θ = w +1 ŵ +1 / w +1 w ŵ+1 w = w+1 w w +1 w a b = b = O(1. In our approach, he O(log regre bound achieved by CGSD neiher requires srong convexiy nor regularizaion, while Hazan e al. achieve he same O(log regre bound under he assumpion of srong convexiy [10], and Barle e al. use regularizaion o obain he same O(log regre bound for srongly convex funcions and O( for any arbirary convex funcions. Furhermore, he O(log regre bound of CGSD is beer han he general regre bound O( discussed in [30]. he regre bound suggess a decreasing sep size, which yields he convergence rae saed in he following corollary. Corollary 4.3. (Decreasing sep convergence rae Assume heorem 4.1 hold and η = 1 for any sep, hen H min l(w 1 l(w G H (1 + log +l(w. his corollary is a direc resul of heorem 4.1. I shows ha he O(log / convergence rae of CSGD is much beer han O(1/ obained by SGD. 4. Learning rae sraegy heorem 4.1 and Corollary 4.3 sugges an opimal learning rae decreasing a he rae of O(1/ wihou assuming he srong convexiy for he objecive funcion. However, he decay proporional o he inverse of he number of ieraions is no robus o he wrong seing of he proporionaliy consan. he ypical resul for he wrong proporionaliy consan will lead o divergence in he firs several ieraions or converge o a poin far away from he opimal. Moivaed by his problem, we propose a -phase learning rae sraegy, which is defined as { η 0/, < m η =. η 0 m/, m he sep m is achieved when desired error olerance ɛ is obained in Proposiion., m = O(1/ɛ. he maximum value for m is he oal size of he daase, since he global mean would be achieved afer one pass of he daa.

6 5. EXPERIMENS 5.1 Opimizaion sudy In his secion, we perform numerical simulaion o sysemaically analyze he proposed algorihms and conduc empirical verificaion of our heoreical resuls. Our opimizaion sudy includes 5 algorihms, including he proposed CSGD and NCSGD, and SGD, ASGD, SGD, for comparaive sudy: 1 Sochasic Gradien Descen (SGD (a.k.a Robbinsmonro algorihm [1]: SGD, which is also known as he Leas Mean Squares approach o solve he sochasic leas squares, is chosen as he baseline algorihm. Averaged Gradien Descen (ASGD (a.k.a. Polyak- Rupper averaging [1]: ASGD performs he SGD approach and reurns he average poin a each ieraion. ASGD, as he sae-of-he-ar approach on firs order sochasic opimizaion, has achieved good performance [9, ]. 3 Consrained Sochasic Gradien Descen (CSGD: CSGD uses he -phase learning rae sraegy in order o achieve he proposed regre bound. 4 Naive Consrained Sochasic Gradien Descen (NC- SGD: NCSGD is a naive version of CSGD, which updaes wih he same learning rae of SGD o illusrae he opimizaion error of NCSGD is upper bounded by SGD a each sep. 5 Second Order Sochasic Gradien Descen (SGD [7]: SGD replaces he learning rae η by he inverse of he Hessian marix, which also forms Recursive Leas Squares, a second order sochasic leas squares soluion. Compared o he firs order approaches, SGD is considered as he bes possible approach using a full Hessian marix. he experimens for leas squares opimizaion have been conduced on wo differen seings: srongly and non-srongly convex cases. he difference beween srongly convex and non-srongly convex objecives has been exensively sudied in convex opimizaion and machine learning on selecion of he learning rae sraegy [, 4]. Even hough a decay of he learning rae a he rae of he inverse of he number of samples has been heoreically suggesed o achieve he opimal rae of convergence in srongly convex case [10]. In pracice, he leas squares approach may decrease oo fas and he ieraion will ge suck oo far away from he opimal soluion [1]. o solve his problem, a slower decay has been proposed in [] for learning rae, η = η 0 α and α (1/, 1. A λ/ w regularizaion erm can also be added o obain λ-srongly convex and uniformly Lipschiz [9, 1]. o ensure convergence, we safely se α =1/ [30] as a robus learning rae for all algorihms in our empirical sudy, which guaranees all algorihms can converge in close viciniy o he opimal soluion. In order o sudy he real convergence performance of differen approaches, we use he prior knowledge of he specrum of he Hessian marix o obain he bes η 0. o achieve he regre bound in heorem 4.1, a -phase learning rae is uilized for CSGD and m is se o be he half of he daase size n/. We generae n inpu samples wih d dimensions i.i.d. from a uniform disribuion beween 0 and 1. he opimal coefficien w is randomly drawn from sandard Gaussian disribuion. hen he n daa samples for our experimens is consruced using he n inpu samples as well as he coefficien wih a zero-mean Gaussian noise wih variance 0.. wo seings for leas squares opimizaion are designed: 1 a low-dimension case wih srong convexiy, where n =10,000, d =100, a high-dimension case, where n =5,000, d =5,000, wih smalles eigenvalue of he Hessian marix close o 0, which yields a non-srongly convex case. In each ieraion round, one sample is randomly drawn from one individual daase using a uniformly disribuion. In he srongly convex case, as shown in Figure op row, CSGD behaves similar o SGD and ouperforms oher firs order approaches. As we know, SGD, as a second order approach, is he bes possible soluion per ieraion for all firs order approaches. However, SGD requires O(d compuaion and space complexiy in each ieraion. CSGD performs like SGD bu only requires O(d compuaion and space complexiy, and CSGD has a comparable performance as SGD when doing opimizaion by giving a cerain amoun of CPU ime, as shown in op righ slo of Figure. In he non-srongly convex case, as shown in Figure boom row, CSGD performs he bes among all firs order approaches. SGD becomes impracical in his high dimensional case due o is high compuaion complexiy. CSGD has a slow sar a he beginning and his is because i adops a larger iniial learning rae η 0 which yields beer convergence for he second phase. his fac is also consisen wih he comparisons using he wrong iniial learning raes discussed in []. However, his larger iniial learning rae speeds up he convergence in he second phase. In addiion, he non-srong convexiy corresponds o an infinie raio of he eigenvalues for he Hessian marix, which significanly slows he performance of SGD. NCSGD has no been influenced by his case and sill performs consisenly beer han SGD. In our empirical sudy, we have observed: 1 NCSGD consisenly performs beer han SGD, and he experimenal resuls verify Corollary4.3; CSGD performs very similarly o he second order approaches, and his suppors he regre bound discussed in heorem 4.1; 3CSGD performs he bes among all he oher sae-of-he-ar firs order approaches ; 4 CSGD is he mos compeen algorihm o deal wih high-dimension daa among all he oher algorihms in our experimens. 5. Classificaion exension In he classificaion ask, leas squares loss funcion plays an imporan role and he opimizaion of leas squares is he cornersone of all leas squares based algorihms, such as Leas Squares Suppor Vecor Machine (LS-SVM [6], Regularized Leas-Squares classificaion [, 6] and ec. In his case, SGD is uilized by defaul during he opimizaion. I is well acknowledged ha he opimizaion speed for leas squares direcly affecs he performance of leas squares based algorihms. A faser opimizaion procedure corresponds o less raining ieraions and less raining ime. herefore, replacing he SGD wih CSGD for he leas squares opimizaion in many algorihm, he performance could be grealy improved. In his subsecion, we show an example

7 l(w-l(w* l(w-l(w* SGD SGD NCSGD CSGD ASGD Number of Samples l(w-l(w* l(w-l(w* ime(cpu seconds d=5,000 d= Number of Samples ime(cpu seconds Figure : Comparison of he mehods on he low-dimension case (op, and he high-dimension case (boom. how o adop CSGD in he opimizaion for he exising classificaion approaches. One direc classificaion approach using leas squares loss is ADApive LINear Elemen (Adaline [16], which is a wellknown mehod in neural nework. Adaline adops a simple percepron-like sysem ha accomplishes classificaion, which modifies coefficiens o minimize he leas squares error a every ieraion. Noe ha, alhough i may no achieve a perfec classificaion by using a linear classifier, he direc opimizaion for leas squares is commonly used as a subrouine in many complex algorihms, such as Muliple Adaline (Madaline [16] o achieve he non-linear separabiliy by using muliple layers of Adalines. Since he fundamenal opimizaion procedures for hese leas squares algorihms are he same, we only show a basic case for Adaline o show CSGD can improve he performance. In Adaline, each neuron separaes wo classes using a coefficien vecor w. he equaion of he separaing hyper-plane can be derived from he coefficien vecor. Specifically, o classify inpu sample x i,lene be he ne inpu of his neuron, where ne = w x. he oupu of Adaline o is 1 when ne > 0ando is -1 oherwise. he crucial par for raining he Adaline is o obain he bes coefficien vecor w, which is updaed per ieraion by minimizing he squared error. A ieraion, he squared error is 1 (y ne,wherey is 1 or 1 represeninghe posiive class or negaive class respecively. Adaline adops SGD for opimizaion whose learning rule is given by w +1 = w ηg, (4 where η is a consan learning rae, and he gradien g = (y w x x. When replacing SGD wih CSGD for Adaline, Eq.(4 is replaced wih Eq.(9. In he muliclass classificaion case, suppose here are c classes, Adaline needs c neurons o perform he classificaion and each neuron sill performs he binary class discriminaion. he CSGD version of Adaline (C-Adaline is depiced in Algorihm, which is sraighforward and easy o implemen. One hing need o be poined ou is ha, he class label y has o be rebuil in order o fi c neurons. herefore, he class label y of sample x for neuron c i is defined as: y (,ci =1wheny = i and y (,ci =0oherwise. he oupu o = k means ha he k h neuron c k has he highes ne value among c neurons. Algorihm CSGD version of Adaline (C-Adaline Iniialize w 0 = x 0 = 0 and ȳ 0 =0. for =1,, 3,... do for i =1,, 3,...,c do Compue he gradien g l(w, x,y (,ci. Compue ( x, ȳ (,ci wih x = 1 x x, and ȳ (,ci = 1 ȳ ( 1,ci + 1 y (,c i. w +1 = w η g x ( x (w η g / x + r. end for end for o evaluae he improvemen of he C-Adaline,we provide compuaional experimens of boh Adaline and C-Adaline on he MNIS daase [14], which is widely used as sochasic opimizaion classificaion benchmark on handwrien digis recogniion [9, 8].

8 Error rae Adaline C-Adaline Error rae Number of Samples ime(cpu seconds Figure 3: Lef: es error rae versus ieraion on he MNIS classificaion. Righ: es error rae versus CPU ime on he MNIS classificaion. C-Adaline is he proposed CSGD version of Adaline he MNIS daase consiss of 60,000 raining samples and 10,000 es samples wih 10 classes. Each digi is presened by a 8 8 gray scale image, for a oal of 784 feaures. All he daa is downscaled o [0, 1] via dividing he maximum pixel inensiy by 55. For he seing of he learning rae, Adaline adops a consan while C-Adaline simply akes he updaing rule of Naive Consrained Sochasic Gradien Descen (NCSGD. η for Adaline is se o 17,and he C-Adaline has η 0 = 4, which are boh he opimal resuls seleced from 0, 19,, 1. Since he opimal soluion is unique, his experimen is o examine how fas can Adaline and C-Adaline converge o his opimal. he es se error rae as a funcions of number of operaions is shown in Figure 3 (Lef. I is clear ha boh Adaline and C-Adaline converge o he same es error because, hey boh opimize he leas squares error. C-Adaline achieves 0.14 es error afer processing 14 samples ( 10 4, while Adaline achieves 0.14 es error afer processing 0 samples ( his indicaes ha C-Adaline converges 64 imes as fas as Adaline! Considering he size of he raining se is 60,000, C-Adaline uses 1/4 of he oal raining samples o achieve he nearly opimal es error rae, while Adaline needs o visi each raining sample more han 16 imes o achieve he same es error rae. Figure 3 (Righ shows he es error rae versus he CPU ime. o achieve he 0.14 es error, Adaline consumes seconds, while C-Adaline only akes 3.38 seconds. Noe ha, in his experimen, 10 neurons are rained in parallel. I is anoher achievemen o ge he nearly opimal es error using a leas squares classifier in abou 3 seconds for a 10 6 scale daase. o beer undersand he classificaion resuls, in Figure 4, we visualize he daa samples on a wo dimensional space by -SNE [7], which is a nonlinear mapping commonly used for exploring he inheren srucure from high dimensional daa. Since a linear classifier does no perform well on his problem and boh algorihms have he same classificaion error ulimaely, we suppress he samples which are sill misclassified in Figure 4 for clariy s sake. When boh algorihms have processed 14 raining samples, heir classificaion resuls on 10,000 es samples are depiced in Figure 4. C-Adaline misclassified 1 samples while Adaline misclassified 148 samples, which is abou 6 imes as C-Adaline. his experimen compares a SGD based classifier (Adaline and he proposed CSGD improvemen version (C-Adaline using he MNIS daase. In summary, C-Adaline consisenly has a beer es error rae han he original Adaline during he opimizaion. For a given es error rae, C-Adaline akes much less CPU ime han Adaline. 6. CONCLUSION In his paper, we analyze a new consrained based sochasic gradien descen soluion for he large-scale leas square problem. We provide heoreical jusificaions for he proposed mehod, called CSGD and NCSGD, which uilize he averaging hyper-plane as he projeced hyper-plane. Specifically, we described he convergence raes as well as he regre bounds for he proposed mehod. CSGD performs like a full second order approach bu wih simpler compuaion han SGD. he opimal regre O(log isachievedincsgd when adoping a corresponding learning rae sraegy. he heoreical superioriies are jusified by experimenal resuls. In addiion, i is easy o exend he SGD based leas squares algorihms o CSGD and he CSGD version can yield beer performance. An example of upgrading Adaline from SGD o CSGD is used o demonsrae he sraighforward bu efficien implemenaion of CSGD References [1] A. Agarwal, S. Negahban, and M. Wainwrigh. Sochasic opimizaion and sparse saisical recovery: Opimal algorihms for high dimensions. Advances in Neural Informaion Processing Sysems, 01. [] F. Bach and E. Moulines. Non-asympoic analysis of sochasic approximaion algorihms for machine learning. Advances in Neural Informaion Processing Sysems, pages , 011. [3] P. L. Barle, E. Hazan, and A. Rakhlin. Adapive online gradien descen. Advances in Neural Informaion Processing Sysems, 007. [4] L. E. Baum and M. Kaz. Exponenial convergence raes for he law of large numbers. ransacion American Mahemaical Sociey, pages , [5] D. P. Bersekas. Nonlinear programming. Ahena Scienific, [6] C. M. Bishop. Paern recogniion and machine learning. Springer-Verlag New York, 006.

Figure 4: wo dimensional -SNE visualizaion of he classificaion resuls for C-Adaline (Lef and Adaline (Righ on MNIS daase when 14 samples have been processed. [7] L. Boou and O. Bousque.

9 Figure 4: wo dimensional -SNE visualizaion of he classificaion resuls for C-Adaline (Lef and Adaline (Righ on MNIS daase when 14 samples have been processed. [7] L. Boou and O. Bousque. he radeoffs of large scale learning. Advances in Neural Informaion Processing Sysems, 0: , 008. [8] L. Boou and Y. LeCun. Large scale online learning. Advances in Neural Informaion Processing Sysems, 003. [9] J. C. Duchi, E. Hazan, and Y. Singer. Adapive subgradien mehods for online learning and sochasic opimizaion. Journal of Machine Learning Research, 1:11 159, 011. [10] E. Hazan, A. Agarwal, and S. Kale. Logarihmic regre algorihms for online convex opimizaion. Conference on Learning heory, 69(-3:169 19, Dec [11] C. Hu, J.. Kwok, and W. Pan. Acceleraed gradien mehods for sochasic opimizaion and online learning. Advances in Neural Informaion Processing Sysems, pages , 009. [1] H. J. Kushner and G. Yin. Sochasic approximaion and recursive algorihms and applicaions. Springer- Verlag, 003. [13] J. Langford, L. Li, and. Zhang. Sparse online learning via runcaed gradien. Journal of Machine Learning Research, 10: , June 009. [14] Y. LeCun, L. Boou, Y. Bengio, and P. Haffner. Gradien-based learning applied o documen recogniion. Proceedings of he IEEE, 86(11:78 34, [15] L. Ljung. Analysis of sochasic gradien algorihms for linear regression problems. IEEE ransacions on Informaion heory, pages 30(: , [16] K. Mehrora, C. K. Mohan, and S. Ranka. Elemens of arificial neural neworks. MI press, [17] A. Nemirovski. Efficien mehods in convex programming. Lecure Noes, [18] A. Nemirovski, A. Judisky, G. Lan, and A. Shapiro. Robus sochasic approximaion approach o sochasic programming. SIAM Journal on Opimizaion, 19(4: , Jan [19] A. Nemirovski and D. Yudin. Problem complexiy and mehod efficiency in opimizaion. John Wiley and Sons Ld, [0] Y. Neserov. Inroducory lecures on convex opimizaion. A basic course(applied Opimizaion, 004. [1] B.. Polyak and A. B. Judisky. Acceleraion of sochasic approximaion by averaging. SIAM Journal on Conrol and Opimizaion, pages , 199. [] R. Rifkin, G. Yeo, and. Poggio. Regularized leassquares classificaion. Nao Science Series Sub Series III Compuer and Sysems Sciences, 190: , 003. [3] S. Shalev-Shwarz and S. M. Kakade. Mind he dualiy gap: Logarihmic regre algorihms for online opimizaion. Advances in Neural Informaion Processing Sysems, pages , 008. [4] S. Shalev-Shwarz and N. Srebro. Svm opimizaion: inverse dependence on raining se size. Inernaional Conference on Machine Learning, 008. [5] J. C. Spall. Inroducion o sochasic search and opimizaion. John Wiley and Sons, Inc, 003. [6] J. Suykens and J. Vandewalle. Leas squares suppor vecor machine classifiers. Neural processing leers, [7] L. Van der Maaen and G. Hinon. Visualizing daa using -sne. Journal of Machine Learning Research, 9( :85, 008. [8] L. Xiao. Dual averaging mehods for regularized sochasic learning and online opimizaion. he Journal of Machine Learning Research, 11: , 010. [9]. Zhang. Solving large scale linear predicion problems using sochasic gradien descen algorihms. Inernaional Conference on Machine Learning, 004. [30] M. Zinkevich. Online convex programming and generalized infiniesimal gradien ascen. Inernaional Conference on Machine Learning, 003.

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Noes for EE7C Spring 018: Convex Opimizaion and Approximaion Insrucor: Moriz Hard Email: hard+ee7c@berkeley.edu Graduae Insrucor: Max Simchowiz Email: msimchow+ee7c@berkeley.edu Ocober 15, 018 3