Exploiting Strong Convexity from Data with Primal-Dual First-Order Algorithms

Size: px
Start display at page:

Download "Exploiting Strong Convexity from Data with Primal-Dual First-Order Algorithms"

Transcription

1 Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms Jialei Wag Li Xiao Abstract We cosider empirical risk miimizatio of liear predictors with covex loss fuctios. Such problems ca be reformulated as covex-cocave saddle poit problems ad thus are well suitable for primal-dual first-order algorithms. However primal-dual algorithms ofte require explicit strogly covex regularizatio i order to obtai fast liear covergece ad the required dual proximal mappig may ot admit closedform or efficiet solutio. I this paper we develop both batch ad radomized primal-dual algorithms that ca exploit strog covexity from data adaptively ad are capable of achievig liear covergece eve without regularizatio. We also preset dual-free variats of the adaptive primal-dual algorithms that do ot require computig the dual proximal mappig which are especially suitable for logistic regressio.. Itroductio We cosider the problem of regularized empirical risk miimizatio ERM of liear predictors. Leta...a R d be the feature vectors of data samples φ i : R R be a covex loss fuctio associated with the liear predictio a T i x for i =... ad g : Rd R be a covex regularizatio fuctio for the predictorx R d. ERM amouts to solvig the followig covex optimizatio problem: { mi Px def = } x R d i= φ ia T i xgx. Examples of the above formulatio iclude may wellkow classificatio ad regressio problems. For biary classificatio each feature vectora i is associated with a label b i {±}. I particular logistic regressio is obtaied by settig φ i z = logexp b i z. For liear regressio problems each feature vector a i is associated with a Departmet of Computer Sciece The Uiversity of Chicago Chicago Illiois USA. Microsoft Research Redmod Washigto 9805 USA. Correspodece to: Jialei Wag <jialei@uchicago.edu> Li Xiao <li.xiao@microsoft.com>. depedet variable b i R ad φ i z = /z b i. The we get ridge regressio with gx = λ/ x ad elastic et withgx = λ x λ / x. LetA = [a...a ] T be the data matrix. Throughout this paper we make the followig assumptios: Assumptio. The fuctios φ i g ad matrixasatisfy: Each φ i is δ-strogly covex ad /-smooth where > 0 ad δ 0 ad δ ; g isλ-strogly covex where λ 0; λδµ > 0 where µ = λ mi A T A. The strog covexity ad smoothess metioed above are with respect to the stadard Euclidea orm deoted as x = x T x. See e.g. Nesterov 004 Sectios.. ad..3 for the exact defiitios. Let R = max i { a i } ad assumig λ > 0 the R /λ is a popular defiitio of coditio umber for aalyzig complexities of differet algorithms. The last coditio above meas that the primal objective fuctio Px is strogly covex eve ifλ = 0. There have bee extesive research activities i recet years o developig efficietly algorithms for solvig problem. A broad class of radomized algorithms that exploit the fiite sum structure i the ERM problem have emerged as very competitive both i terms of theoretical complexity ad practical performace. They ca be put ito three categories: primal dual ad primal-dual. Primal radomized algorithms work with the ERM problem directly. They are moder versios of radomized icremetal gradiet methods e.g. Bertsekas 0; Nedic & Bertsekas 00 equipped with variace reductio techiques. Each iteratio of such algorithms oly process oe data poit a i with complexity Od. They icludes SAG Roux et al. 0 SAGA Defazio et al. 04 ad SVRG Johso & Zhag 03; Xiao & Zhag 04 which all achieve the iteratio complexity O R /λlog/ǫ to fid a ǫ- optimal solutio. I fact they are capable of exploitig the strog covexity from data meaig that the coditio umber R /λ i the complexity ca be replaced by the more favorable oer /λδµ /. This improvemet ca be achieved without explicit kowledge of µ from data.

2 Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms Dual algorithms solve Fechel dual of by maximizig Dy def = i= φ i y i g i= y ia i usig radomized coordiate ascet algorithms. Here φ i ad g deotes the cojugate fuctios of φ i ad g. They iclude SDCA Shalev-Shwartz & Zhag 03 Nesterov 0 ad Richtárik & Takáč 04. They have the same complexity O R /λlog/ǫ but are hard to exploit strog covexity from data. Primal-dual algorithms solve the covex-cocave saddle poit problem mi x max y Lxy where Lxy def = i= yi a i x φ i y i gx. 3 I particular SPDC Zhag & Xiao 05 achieves a accelerated liear covergece rate with iteratio complexity O R/ λlog/ǫ which is better tha the aforemetioed o-accelerated complexity whe R /λ >. La & Zhou 05 developed dual-free variats of accelerated primal-dual algorithms but without cosiderig the liear predictor structure i ERM. Balamuruga & Bach 06 exteded SVRG ad SAGA to solvig saddle poit problems. Accelerated primal ad dual radomized algorithms have also bee developed. Nesterov 0 Fercoq & Richtárik 05 ad Li et al. 05b developed accelerated coordiate gradiet algorithms which ca be applied to solve the dual problem. Alle-Zhu 06 developed a accelerated variat of SVRG. Acceleratio ca also be obtaied usig the Catalyst framework Li et al. 05a. They all achieve the same O R/ λlog/ǫ complexity. A commo feature of accelerated algorithms is that they require good estimate of the strog covexity parameter. This makes hard for them to exploit strog covexity from data because the miimum sigular valueµof the data matrixais very hard to estimate i geeral. I this paper we show that primal-dual algorithms are capable of exploitig strog covexity from data if the algorithm parameters such as step sizes are set appropriately. While these optimal settig depeds o the kowledge of the covexity parameterµfrom the data we develop adaptive variats of primal-dual algorithms that ca tue the parameter automatically. Such adaptive schemes rely critically o the capability of evaluatig the primal-dual optimality gaps by primal-dual algorithms. A major disadvatage of primal-dual algorithms is that the required dual proximal mappig may ot admit closedform or efficiet solutio. We follow the approach of La & Zhou 05 to derive dual-free variats of the primal-dual algorithms customized for ERM problems with the liear predictor structure ad show that they ca also exploit strog covexity from data with correct choices of parameters or usig a adaptatio scheme. Algorithm Batch Primal-Dual BPD Algorithm iput: parameters τ θ iitial poit x 0 = x 0 y 0 for t = 0... do y t = prox f y t A x t x t = prox τg x t τa T y t x t = x t θx t x t ed for. Batch primal-dual algorithms Before divig ito radomized primal-dual algorithms we first cosider batch primal-dual algorithms which exhibit similar properties as their radomized variats. To this ed we cosider a batch versio of the ERM problem mi x R d { Px def = faxgx }. 4 where A R d ad make the followig assumptio: Assumptio. The fuctios f g ad matrixasatisfy: f is δ-strogly covex ad /-smooth where > 0 ad δ 0 ad δ ; g isλ-strogly covex where λ 0; λδµ > 0 where µ = λ mi A T A. For exact correspodece with problem we have fz = i= φ iz i with z i = a T i x. Uder Assumptio the fuctio fz is δ/-strogly covex ad /-smooth ad fax is δµ /-strogly covex ad R /-smooth. However such correspodeces aloe are ot sufficiet to exploit the structure of i.e. substitutig them ito the batch algorithms of this sectio will ot produce the efficiet algorithms for solvig problem that we will preset i Sectios 3 ad 4.. So we do ot make such correspodeces explicit i this sectio. Rather treat them as idepedet assumptios with the same otatio. Usig cojugate fuctios we ca derive the dual of 4 as max y R { Dy def = f y g A T y } 5 ad the covex-cocave saddle poit formulatio is { def mi max Lxy = gxy T Ax f y }. 6 x R d y R We cosider the primal-dual first-order algorithm proposed by Chambolle & Pock 0; 06 for solvig the saddle poit problem 6 which is give as Algorithm. Here we call it the batch primal-dual BPD algorithm. Assumig that f is smooth ad g is strogly covex Chambolle & Pock 0; 06 showed that Algorithm achieves accelerated liear covergece rate if λ > 0. However they did ot cosider the case where additioal or the sole source of strog covexity comes from fax.

3 Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms I the followig theorem we show how to set the parameters τ ad θ to exploit both sources of strog covexity to achieve fast liear covergece. Theorem. Suppose Assumptio holds ad x y is the uique saddle poit ofldefied i 6. LetL = A = λmax A T A. If we set the parameters i Algorithm as = L λδµ τ = L λδµ 7 ad θ = max{θ x θ y } where θ x = δ µ δ L τλ θ y = / 8 the we have τ λ x t x 4 yt y θ t C Lx t y Lx y t θ t C wherec = τ λ x 0 x 4 y 0 y. The proof of Theorem is give i Appedices B ad C. Here we give a detailed aalysis of the covergece rate. Substitutig ad τ i 7 ito the expressios for θ y ad θ x i 8 ad assumig λδµ L we have θ x δµ L λδµ L δ λ θ y = λδµ /L λδµ L. L λδµ Sice the overall coditio umber of the problem is L λδµ it is clear that θ y is a accelerated covergece rate. Next we examie θ x i two special cases. The case of δµ = 0but λ > 0. I this case we have τ = L λ ad = λ L ad thus θ x = λ/l λ L θ y= λ/l λ L. Therefore we have θ = max{θ x θ y } λ L. This ideed is a accelerated covergece rate recoverig the result of Chambolle & Pock 0; 06. The case of λ = 0 butδµ > 0. τ = Lµ δ ad = µ δ L ad I this case we have θ x = δµ L δµ/lδ θ y δµ L. L Notice that δ µ is the coditio umber of fax. Next we assume µ L ad examie how θ x varies withδ. Ifδ µ L meaig f is badly coditioed the θ x δµ L 3 δµ/l = δµ 3L. Because the overall coditio umber is L δ µ this is a accelerated liear rate ad so isθ = max{θ x θ y }. Algorithm Adaptive Batch Primal-Dual Ada-BPD iput: problem costats λ δlad ˆµ > 0 iitial poit x 0 y 0 ad adaptatio period T. Compute τ ad θ as i 7 ad 8 usigµ = ˆµ for t = 0... do y t = prox f y t A x t x t = prox τg x t τa T y t x t = x t θx t x t if modtt == 0 the τθ = BPD-Adapt {P s D s } t s=t T ed if ed for Ifδ µ L meaig f is mildly coditioed the θ x µ3 µ L 3 µ/l 3/ µ/l L. This represets a half-accelerated rate because the overall coditio umber is L δ µ L3 µ. 3 Ifδ = i.e.f is a simple quadratic fuctio the θ x µ µ L µ/l L. This rate does ot have acceleratio because the overall coditio umber is L δ µ L µ. I summary the extet of acceleratio i the domiatig factorθ x which determiesθ depeds o the relative size of δ ad µ /L i.e. the relative coditioig betwee the fuctio f ad the matrix A. I geeral we have full acceleratio if δ µ /L. The theory predicts that the acceleratio degrades as the fuctio f gets better coditioed. However i our umerical experimets we ofte observe acceleratio eve ifδ gets closer to. As explaied i Chambolle & Pock 0 Algorithm is equivalet to a precoditioed ADMM. Deg & Yi 06 characterized coditios for ADMM to obtai liear covergece without assumig both parts of the objective fuctio beig strogly covex but they did ot derive covergece rate for this case... Adaptive batch primal-dual algorithms I practice it is ofte very hard to obtai good estimate of the problem-depedet costats especially µ = λmi A T A i order to apply the algorithmic parameters specified i Theorem. Here we explore heuristics that ca eable adaptive tuig of such parameters which ofte lead to much improved performace i practice. A key observatio is that the covergece rate of the BPD algorithm chages mootoically with the overall strog covexity parameter λ δµ regardless of the extet of 3

4 Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms Algorithm 3 BPD-Adapt simple heuristic iput: previous estimate ˆµ adaptio periodt primal ad dual objective values {P s D s } t s=t T ifp t D t < θ T P t T D t T the ˆµ := ˆµ else ˆµ := ˆµ/ ed if Compute τ ad θ as i 7 ad 8 usigµ = ˆµ output: ew parameters τθ acceleratio. I other words the larger λ δµ is the faster the covergece. Therefore if we ca moitor the progress of the covergece ad compare it with the predicted covergece rate i Theorem the we ca adjust the algorithmic parameters to exploit the fastest possible covergece. More specifically if the observed covergece is slower tha the predicted covergece rate the we should reduce the estimate ofµ; if the observed covergece is better tha the predicted rate the we ca try to icrease µ for eve faster covergece. We formalize the above reasoig i a Adaptive BPD Ada-BPD algorithm described i Algorithm. This algorithm maitais a estimate ˆµ of the true costatµ ad adjust it every T iteratios. We use P t ad D t to represet the primal ad dual objective values at Px t ad Dy t respectively. We give two implemetatios of the tuig procedure BPD-Adapt: Algorithm 3 is a simple heuristic for tuig the estimate ˆµ where the icreasig ad decreasig factor ca be chaged to other values larger tha ; Algorithm 4 is a more robust heuristic. It does ot rely o the specific covergece rate θ established i Theorem. Istead it simply compares the curret estimate of objective reductio rate ˆρ with the previous estimate ρ θ T. It also specifies a o-tuig rage of chages iρ specified by the iterval [cc]. Oe ca also devise more sophisticated schemes; e.g. if we estimate that δµ < λ the o more tuig is ecessary. The capability of accessig both the primal ad dual objective values allows primal-dual algorithms to have good estimate of the covergece rate which eables effective tuig heuristics. Automatic tuig of primal-dual algorithms have also bee studied by e.g. Malitsky & Pock 06 ad Goldstei et al. 03 but with differet goals. Fially we ote that Theorem oly establishes covergece rate for the distace to the optimal poit ad the quatity Lx t y Lx y t which is ot quite the duality gappx t Dy t. Nevertheless same covergece rate ca also be established for the duality gap see Algorithm 4 BPD-Adapt robust heuristic iput: previous rate estimateρ > 0 = δˆµ period T costats c < ad c > ad {P s D s } t s=t T Compute ew rate estimate ˆρ = Pt D t P t T D t T if ˆρ cρ the := ρ := ˆρ else if ˆρ cρ the := / else := ed if λ ρ := ˆρ λ = L τ = L Compute θ usig 8 or setθ = output: ew parameters τθ Zhag & Xiao 05 Sectio. which ca be used to better justify the adaptio procedure. 3. Radomized primal-dual algorithm I this sectio we come back to the ERM problem which have a fiite sum structure that allows the developmet of radomized primal-dual algorithms. I particular we exted the stochastic primal-dual coordiate SPDC algorithm Zhag & Xiao 05 to exploit the strog covexity from data i order to achieve faster covergece rate. First we show that by settig algorithmic parameters appropriately the origial SPDC algorithm may directly beefit from strog covexity from the loss fuctio. We ote that the SPDC algorithm is a special case of the Adaptive SPDC Ada-SPDC algorithm preseted i Algorithm 5 by settig the adaptio period T = ot performig ay adaptio. The followig theorem is proved i Appedix E. Theorem. Suppose Assumptio holds. Let x y be the saddle poit of the fuctio L defied i 3 ad R = max{ a... a }. If we set T = i Algorithm 5 o adaptio ad let τ = 4R λδµ = 4R ad θ = max{θ x θ y } where θ x = τδµ 4δ λδµ 9 τλ θ y = // / 0 the we have τ [ λ E x t x ] 4 E[ y t y ] θ t C E [ Lx t y Lx y t ] θ t C wherec = τ λ x 0 x 4 y 0 y. The expectatio E[ ] is take with respect to the history of radom idices draw at each iteratio. 4

5 Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms Algorithm 5 Adaptive SPDC Ada-SPDC iput: parameters τ θ > 0 iitial poit x 0 y 0 ad adaptatio period T. Set x 0 = x 0 for t = 0... do pick k {...} uiformly at radom for i {...} do ifi == k the y t k = prox φ k y t k at k xt else y t i = y t i ed if ed for x t = prox τg x t τ u t y t u t = u t yt k y t k a k x t = x t θx t x t k y t k a k if modtt = 0 the τθ = SPDC-Adapt {P t s D t s } T s=0 ed if ed for Below we give a detailed discussio o the expected covergece rate established i Theorem. The cases of µ = 0 but λ > 0. τ = 4R λ ad = λ 4R ad θ x = τλ = 4R /λ I this case we have θ y = // / = 8R /λ. Hece θ = θ y. These recover the parameters ad covergece rate of the stadard SPDC Zhag & Xiao 05. The cases of µ > 0 but λ = 0. τ = 4Rµ δ ad = µ δ 4R ad θ x = τδµ δµ 4δ = θ y = 8R/µ δµ δ 8R I this case we have 3R δµ/4r4δ. δµ. 8R Sice the objective is R /-smooth ad δµ /-strogly covex θ y is a accelerated rate if δµ 8R otherwise θ y. For θ x we cosider differet situatios: If µ R the we have θ x δµ R which is a accelerated rate. So isθ = max{θ x θ y }. If µ < R ad δ µ R the θ x δµ R which represets accelerated rate. The iteratio complexity of SPDC is which is better tha that of Õ R µ δ SVRG i this case which isõ R δµ. Ifµ < R adδ µ R the we getθ x µ R. This is a half-accelerated rate because i this case SVRG would requireõr3 µ iteratios while iteratio complexity here isõr µ 3. If µ < R ad δ meaig the φ i s are well coditioed the we get θ x δµ R µ R which is a o-accelerated rate. The correspodig iteratio complexity is the same as SVRG. 3.. Parameter adaptatio for SPDC The SPDC-Adapt procedure called i Algorithm 5 follows the same logics as the batch adaptio schemes i Algorithms 3 ad 4 ad we omit the details here. Oe thig we emphasize here is that the adaptatio period T is i terms of epochs or umber of passes over the data. I additio we oly compute the primal ad dual objective values after each pass or every few passes because computig them exactly usually eed to take a full pass of the data. Aother importat issue is that ulike the batch case where the duality gap usually decreases mootoically the duality gap for radomized algorithms ca fluctuate wildly. So istead of usig oly the two ed valuesp t T D t T ad P t D t we ca use more poits to estimate the covergece rate through a liear regressio. Suppose the primal-dual values at the ed of each pastt passes are {P0D0}{PD}...{PTDT} ad we eed to estimateρrate per pass such that Pt Dt ρ t P0 D0 t =...T. We ca tur it ito a liear regressio problem after takig logarithm ad obtai the estimate ˆρ through T Pt Dt logˆρ = T t=tlog P0 D0. The rest of the adaptio procedure ca follow the robust scheme i Algorithm 4. I practice we ca compute the primal-dual values more sporadically say every few passes ad modify the regressio accordigly. 4. Dual-free Primal-dual algorithms Compared with primal algorithms oe major disadvatage of primal-dual algorithms is the requiremet of computig the proximal mappig of the dual fuctiof orφ i which may ot admit closed-formed solutio or efficiet computatio. This is especially the case for logistic regressio oe of the most popular loss fuctios used i classificatio. La & Zhou 05 developed dual-free variats of primal-dual algorithms that avoid computig the dual proximal mappig. Their mai techique is to replace the Euclidea distace i the dual proximal mappig with a Bregma divergece defied over the dual loss fuctio itself. 5

6 Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms Algorithm 6 Dual-Free BPD Algorithm iput: parameters τ θ > 0 iitial poit x 0 y 0 Set x 0 = x 0 ad v 0 = f y 0 for t = 0... do v t = vt A x t y t = f v t x t = prox τg x t τa T y t x t = x t θx t x t ed for We show how to apply this approach to solve the structured ERM problems cosidered i this paper. They ca also exploit strog covexity from data if the algorithmic parameters are set appropriately or adapted automatically. 4.. Dual-free BPD algorithm First we cosider the batch settig. We replace the dual proximal mappig computigy t i Algorithm with y t { =argmi f y y T A x t Dyyt } y where D is the Bregma divergece of a strictly covex kerel fuctio h defied as D h yy t = hy hy t hy t y y t. Algorithm is obtaied i the Euclidea settig with hy = y ad Dyy t = y yt. While our covergece results would apply for arbitrary Bregma divergece we oly focus o the case of usig f itself as the kerel because this allows us to computey t i very efficietly. The followig lemma explais the details Cf. La & Zhou 05 Lemma. Lemma. Let the kerel h f i the Bregma divergeced. If we costruct a sequece of vectors{v t } such that v 0 = f y 0 ad for allt 0 v t = vt A x t the the solutio to problem isy t = f v t. Proof. Suppose v t = f y t true fort = 0 the Dyy t = f y f y t v tt y y t. The solutio to ca be writte as { y t = argmi f y y T A x t f y v tt y } y { = argmi f y } A x t vt T y y = argmax y = argmax y { T v t A x t y f y} } { v tt y f y = f v t where i the last equality we used the property of cojugate fuctio whe f is strogly covex ad smooth. Moreover v t = f y t = f y t which completes the proof. Accordig to Lemma we oly eed to provide iitial poits such that v 0 = f y 0 is easy to compute. We do ot eed to compute f y t directly for ay t > 0 because it is ca be updated as v t i. Cosequetly we ca updatey t i the BPD algorithm usig the gradiet f v t without the eed of dual proximal mappig. The resultig dual-free algorithm is give i Algorithm 6. La & Zhou 05 cosidered a geeral settig which does ot possess the liear predictor structure we focus o i this paper ad assumed that oly the regularizatio g is strogly covex. Our followig result shows that dualfree primal-dual algorithms ca also exploit strog covexity from data with appropriate algorithmic parameters. Theorem 3. Suppose Assumptio holds ad let x y be the uique saddle poit of L defied i 6. If we set the parameters i Algorithm 6 as τ = L λδµ = L λδµ 3 ad θ = max{θ x θ y } where θ x = τδµ 4 τλ θ y = / 4 the we have τ λ x t x Dy y t θ t C Lx t y Lx y t θ t C where C = τ λ x 0 x Dy y 0. Theorem 3 is proved i Appedices B ad D. Assumig λδµ L we have θ x δµ 6L λ λδµ L λδµ θ y 4L. Agai we gai isights by cosider the special cases: If δµ = 0 ad λ > 0 the θ y λ 4L ad θ x λ L. So θ = max{θ xθ y } is a accelerated rate. If δµ > 0 ad λ = 0 the θ y δµ 4L ad θ x δµ 6L. Thus θ = max{θ x θ y } δµ 6L is ot accelerated. Notice that this coclusio does ot depeds o the relative size ofδ adµ /L ad this is the major differece from the Euclidea case discussed i Sectio. If both δµ > 0 ad λ > 0 the the extet of acceleratio depeds o their relative size. If λ is o the same order as δµ or larger the accelerated rate is obtaied. Ifλis much smaller tha δµ the the theory predicts o acceleratio. 6

7 Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms Algorithm 7 Adaptive Dual-Free SPDC ADF-SPDC iput: parameters τ θ > 0 iitial poit x 0 y 0 ad adaptatio period T. Set x 0 = x 0 ad v 0 i = φ i y 0 i fori =... for t = 0... do pick k {...} uiformly at radom for i {...} do ifi == k the else v t k v t i = vt k at k xt y t k = φ k vt k = v t i ed if ed for x t = prox τg y t i = y t i x t τ u t y t u t = u t yt k y t k a k x t = x t θx t x t k y t k a k if modtt = 0 the τθ = SPDC-Adapt {P t s D t s } T s=0 ed if ed for 4.. Dual-free SPDC algorithm The same approach ca be applied to derive a Dualfree SPDC algorithm which is described i Algorithm 7. It also icludes a parameter adaptio procedure so we call it the adaptive dual-free SPDC ADF-SPDC algorithm. O related work Shalev-Shwartz & Zhag 06 ad Shalev-Shwartz 06 itroduced dual-free SDCA. The followig theorem characterizes the choice of algorithmic parameters that ca exploit strog covexity from data to achieve liear covergece proof give i Appedix F. Theorem 4. Suppose Assumptio holds. Let x y be the saddle poit of L defied i 3 ad R = max{ a... a }. If we set T = i Algorithm 7 o adaptio ad let = 4R λδµ τ = 4R ad θ = max{θ x θ y } where θ x = τδµ 4 λδµ 5 τλ θ y = // / 6 the we have τ λ E [ x t x ] 4 E[ Dy y t ] θ t C E [ Lx t y Lx y t ] θ t C where C = τ λ x 0 x Dy y 0. Below we discuss the expected covergece rate established i Theorem i two special cases. The cases of µ = 0 but λ > 0. τ = 4R λ ad = 4R λ ad θ x = τλ = 4R /λ I this case we have θ y = // / = 8R /λ. These recover the covergece rate of the stadard SPDC algorithm Zhag & Xiao 05. The cases of µ > 0 but λ = 0. I this case we have τ = 4Rµ δ = 4R µ δ ad θ x = τδµ δµ 4 = 3R δµ/4r4 θ y = // / = 8R/µ δ. We ote that the primal fuctio ow is R /-smooth ad δµ /-strogly covex. We discuss the followig cases: If δµ > R the we have θ x δµ 8R ad θ y. Therefore θ = max{θ xθ y }. Otherwise we have θ x δµ 64R ad θ y is of the same order. This is ot a accelerated rate ad we have the same iteratio complexity as SVRG. Fially we give cocrete examples of how to compute the iitial poits y 0 ad v 0 such that v 0 i = φ i y 0 i. For squared loss φ i α = α b i ad φ i β = β b i β. So v 0 i = φ i y 0 i = y 0 i b i. For logistic regressio we have b i { } ad φ i α = log e biα. The cojugate fuctio is φ i β = b iβlog b i βb i βlogb i β if b i β [ 0] ad otherwise. We ca choose y 0 i = b i ad v 0 i =0 such that v 0 i =φ i y 0 i. For logistic regressio we have δ = 0 over the full domai of φ i. However each φ i is locally strogly covex i bouded domai Bach 04: if z [ B B] the we kow δ = mi z φ i z exp B/4. Therefore it is well suitable for a adaptatio scheme similar to Algorithm 4 that do ot require kowledge of either δ or µ. 5. Prelimiary experimets We preset prelimiary experimets to demostrate the effectiveess of our proposed algorithms. First we cosider batch primal-dual algorithms for ridge regressio over a sythetic dataset. The data matrix A has sizes = 5000 ad d = 3000 ad its etries are sampled from multivariate ormal distributio with mea zero ad covariace matrix Σ ij = i j /. We ormalize all datasets 7

8 Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms Primal optimality gap Primal AG BPD Opt-BPD Ada-BPD sythetic λ = / sythetic λ = 0 / sythetic λ = 0 4 / Figure. Compariso of batch primal-dual algorithms for a ridge regressio problem with = 5000 add = such that a i = a i /max j a j to esure the maximum orm of the data poits is. We use l -regularizatio gx = λ/ x with three choices of parameterλ: / 0 / ad 0 4 / which represet the strog medium ad weak levels of regularizatio respectively. Figure shows the performace of four differet algorithms: the accelerated gradiet algorithm for solvig the primal miimizatio problem Primal AG Nesterov 004 usig λ as strog covexity parameter the BPD algorithm Algorithm that usesλas the strog covexity parameter settig µ = 0 the optimal BPD algorithm Opt- BPD that uses µ = λ mi A T A explicitly computed from data ad the Ada-BPD algorithm Algorithm with the robust adaptatio heuristic Algorithm 4 with T = 0 c = 0.95 ad c =.5. As expected the performace of Primal-AG is very similar to BPD with the same strog covexity parameter. The Opt-BPD fully exploits strog covexity from data thus has the fastest covergece. The Ada-BPD algorithm ca partially exploit strog covexity from data without kowledge ofµ. Next we compare the DF-SPDC Algorithm 5 without adaptio ad ADF-SPDC Algorithm 7 with adaptio agaist several state-of-the-art radomized algorithms for ERM: SVRG Johso & Zhag 03 SAGA Defazio et al. 04 Katyusha Alle-Zhu 06 ad the stadard SPDC method Zhag & Xiao 05. For SVRG ad Katyusha a accelerated variat of SVRG we choose the variace reductio period asm =. The step sizes of all algorithms are set as their origial paper suggested. For Ada-SPDC ad ADF-SPDC we use the robust adaptatio scheme witht = 0 c = 0.95 ad c =.5. We first compare these radomized algorithms for ridge regressio over the same sythetic data described above ad thecpuact data from the LibSVM website. The results are show i Figure. With relatively strog regularizatio λ = / all methods perform similarly as predicted by theory. For the sythetic dataset With λ = 0 / the regularizatio is weaker but still stroger tha the hidde strog covexity from data so the accelerated algorithms all variats of SPDC ad Katyusha perform better tha SVRG ad SAGA. With λ = 0 4 / it looks that the strog covexity from data domiates the regularizatio. Sice the o-accelerated algorithms SVRG ad SAGA may automatically exploit strog covexity from data they become faster tha the o-adaptive accelerated methods Katyusha SPDC ad DF-SPDC. The adaptive accelerated method ADF-SPDC has the fastest covergece. This shows that our theoretical results which predict o acceleratio i this case ca be further improved. Fially we compare these radomized algorithm for logistic regressio o the rcv dataset from LibSVM website ad aother sythetic dataset with = 5000 ad d = 500 geerated similarly as before but with covariace matrix Σ ij = i j /00. For the stadard SPDC we solve the dual proximal mappig usig a few steps of Newto s method to high precisio. The dual-free SPDC algorithms oly use gradiets of the logistic fuctio. The results are preseted i Figure 3. for both datasets the strog covexity from data is very weak or oe so the accelerated algorithms performs better. 6. Coclusios We have show that primal-dual first-order algorithms are capable of exploitig strog covexity from data if the algorithmic parameters are chose appropriately. While they may depeds o problem depedet costats that are ukow we developed heuristics for adaptig the parameters o the fly ad obtaied improved performace i experimets. It looks that our theoretical characterizatio of the covergece rates ca be further improved as our experimets ofte demostrate sigificat acceleratio i cases where our theory does ot predict acceleratio. cjli/libsvm/ 8

9 Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms Primal optimality gap SVRG SAGA Katyusha SPDC DF-SPDC ADF-SPDC sythetic λ = / sythetic λ = 0 / sythetic λ = 0 4 / Primal optimality gap SVRG SAGA Katyusha SPDC DF-SPDC ADF-SPDC cpuact λ = / cpuact λ = 0 / cpuact λ = 0 4 / Figure. Compariso of radomized algorithms for ridge regressio problems. Primal optimality gap SVRG SAGA Katyusha SPDC DF-SPDC ADF-SPDC sythetic λ = / sythetic λ = 0 / sythetic λ = 0 4 / Primal optimality gap SVRG SAGA Katyusha SPDC DF-SPDC ADF-SPDC rcv λ = / rcv λ = 0 / rcv λ = 0 4 / Figure 3. Compariso of radomized algorithms for logistic regressio problems. 9

10 Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms Refereces Alle-Zhu Zeyua. Katyusha: Accelerated variace reductio for faster sgd. ArXiv e-prit Bach Fracis. Adaptivity of averaged stochastic gradiet descet to local strog covexity for logistic regressio. Joural of Machie Learig Research 5: Balamuruga Palaiappa ad Bach Fracis. Stochastic variace reductio methods for saddle-poit problems. I Advaces i Neural Iformatio Processig Systems NIPS 9 pp Bertsekas Dimitri P. Icremetal gradiet subgradiet ad proximal methods for covex optimizatio: A survey. I Sra Suvrit Nowozi Sebastia ad Wright Stephe J. eds. Optimizatio for Machie Learig chapter 4 pp MIT Press 0. Chambolle Atoi ad Pock Thomas. A first-order primal-dual algorithm for covex problems with applicatios to imagig. Joural of Mathematical Imagig ad Visio 40: Chambolle Atoi ad Pock Thomas. O the ergodic covergece rates of a first-order primal dual algorithm. Mathematical Programmig Series A 59: Defazio Aaro Bach Fracis ad Lacoste-Julie Simo. Saga: A fast icremetal gradiet method with support for o-strogly covex composite objectives. I Advaces i Neural Iformatio Processig Systems pp Deg Wei ad Yi Wotao. O the global ad liear covergece of the geeralized alteratig directio method of multipliers. Joural of Scietific Computig 663: Fercoq Oliver ad Richtárik Peter. Accelerated parallel ad proximal coordiate descet. SIAM Joural o Optimizatio 54: Goldstei Tom Li Mi Yua Xiaomig Esser Erie ad Baraiuk Richard. Adaptive primal-dual hybrid gradiet methods for saddle-poit problems. arxiv preprit arxiv: Johso Rie ad Zhag Tog. Acceleratig stochastic gradiet descet usig predictive variace reductio. I Advaces i Neural Iformatio Processig Systems pp La Guaghui ad Zhou Yi. A optimal radomized icremetal gradiet method. arxiv preprit arxiv: Li Hogzhou Mairal Julie ad Harchaoui Zaid. A uiversal catalyst for first-order optimizatio. I Advaces i Neural Iformatio Processig Systems pp a. Li Qihag Lu Zhaosog ad Xiao Li. A accelerated radomized proximal coordiate gradiet method ad its applicatio to regularized empirical risk miimizatio. SIAM Joural o Optimizatio 54: b. Malitsky Yura ad Pock Thomas. A first-order primal-dual algorithm with liesearch. arxiv preprit arxiv: Nedic Agelia ad Bertsekas Dimitri P. Icremetal subgradiet methods for odifferetiable optimizatio. SIAM Joural o Optimizatio : Nesterov Y. Itroductory Lectures o Covex Optimizatio: A Basic Course. Kluwer Bosto 004. Nesterov Yu. Efficiecy of coordiate descet methods o huge-scale optimizatio problems. SIAM Joural o Optimizatio : Richtárik Peter ad Takáč Marti. Iteratio complexity of radomized block-coordiate descet methods for miimizig a composite fuctio. Mathematical Programmig 44-: Roux Nicolas L Schmidt Mark ad Bach Fracis. A stochastic gradiet method with a expoetial covergece rate for fiite traiig sets. I Advaces i Neural Iformatio Processig Systems pp Shalev-Shwartz Shai. Sdca without duality regularizatio ad idividual covexity. I Proceedigs of The 33rd Iteratioal Coferece o Machie Learig pp Shalev-Shwartz Shai ad Zhag Tog. Stochastic dual coordiate ascet methods for regularized loss miimizatio. Joural of Machie Learig Research 4Feb: Shalev-Shwartz Shai ad Zhag Tog. Accelerated proximal stochastic dual coordiate ascet for regularized loss miimizatio. Mathematical Programmig 55-: Xiao Li ad Zhag Tog. A proximal stochastic gradiet method with progressive variace reductio. SIAM Joural o Optimizatio 44: Zhag Yuche ad Xiao Li. Stochastic primal-dual coordiate method for regularized empirical risk miimizatio. I Proceedigs of The 3d Iteratioal Coferece o Machie Learig pp

11 Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms I the followig appedices we provide detailed proofs of theorems stated i the mai paper. I Sectio A we first prove a basic iequality which is useful throughout the rest of the covergece aalysis. Sectio B cotais geeral aalysis of the batch primal-dual algorithm that are commo for provig both Theorem ad Theorem 3. Sectios C D E ad F give proofs for Theorem Theorem 3 Theorem ad Theorem 4 respectively. A. A basic lemma Lemma. Let h be a strictly covex fuctio ad D h be its Bregma divergece. Suppose ψ is ν-strogly covex with respect tod h ad /δ-smooth with respect to the Euclidea orm ad ŷ = argmi y C { ψyηdh yȳ } where C is a compact covex set that lies withi the relative iterior of the domais of h ad ψ i.e. both h ad ψ are differetiable over C. The for ay y C ad ρ [0] we have ψyηd h y x ψŷηd h ŷȳ η ρν D h yŷ ρδ ψy ψŷ. Proof. The miimizer ŷ satisfies the followig first-order optimality coditio: ψŷη D h ŷȳ y ŷ 0 y C. Here D deotes partial gradiet of the Bregma divergece with respect to its first argumet i.e. Dŷȳ = hŷ hȳ. So the above optimality coditio is the same as ψŷη hŷ hȳ y ŷ 0 y C. 7 Sice ψ isν-strogly covex with respect tod h ad /δ-smooth we have ψy ψŷ ψŷy ˆx νd h yŷ ψy ψŷ ψŷy ŷ δ ψy ψŷ. For the secod iequality see e.g. Theorem..5 i Nesterov 004. Multiplyig the two iequalities above by ρ ad ρ respectively ad addig them together we have ψy ψŷ ψŷy ŷ ρνd h yŷ ρδ ψy ψŷ. The Bregma divergece D h satisfies the followig equality: D h yȳ = D h yŷd h ŷȳ hŷ hȳ y ŷ. We multiply this equality byη ad add it to the last iequality to obtai ψyηd h yȳ ψŷηd h yŷ η ρν D h ŷȳ ρδ ψy ψŷ ψŷη hŷ hȳ y ŷ. Usig the optimality coditio i 7 the last term of ier product is oegative ad thus ca be dropped which gives the desired iequality. B. Commo Aalysis of Batch Primal-Dual Algorithms We cosider the geeral primal-dual update rule as:

12 Iteratio: ˆxŷ = PD τ xȳ xỹ Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms ˆx = argmi x R d ŷ = arg mi y R { gxỹ T Ax τ x x Each iteratio of Algorithm is equivalet to the followig specificatio of PD τ : } 8 {f y y T A x Dyȳ }. 9 ˆx = x t x = x t x = x t θx t x t ŷ = y t ȳ = y t ỹ = y t. 0 Besides Assumptio we also assume that f isν-strogly covex with respect to a kerel fuctio h i.e. where D h is the Bregma divergece defied as f y f y f yy y νd h y y D h y y = hy hy hyy y. We assume thathis -strogly covex ad/δ -smooth. Depedig o the kerel fuctioh this assumptio of may impose additioal restrictios o f. I this paper we are mostly iterested i two special cases: hy = / y ad hy = f y for the latter we always have ν =. From ow o we will omit the subscript h ad use D deote the Bregma divergece. Uder the above assumptios ay solutio x y to the saddle-poit problem 6 satisfies the optimality coditio: The optimality coditios for the updates described i equatios 8 ad 9 are A T y gx Ax = f y. A T ỹ x ˆx gˆx 3 τ A x hŷ hȳ = f ŷ. 4 Applyig Lemma to the dual miimizatio step i 9 with ψy = f y y T A x η = / y = y ad ρ = / we obtai f y y T A x Dy ȳ f ŷ ŷ T A x Dŷȳ ν Dy ŷ δ f y f ŷ. 5 4 Similarly for the primal miimizatio step i 8 we have settigρ = 0 gx ỹ T Ax τ x x gˆxỹ T Aˆx τ ˆx x τ λ x ˆx. 6 Combiig the two iequalities above with the defiitio Lxy = gxy T Ax f y we get Lˆxy Lx ŷ = gˆxy T Aˆx f y gx ŷ T Ax f ŷ τ x x Dy ȳ τ λ x ˆx ν Dy ŷ τ ˆx x Dŷȳ δ f y f ŷ 4 y T Aˆx ŷ T Ax ỹ T Ax ỹ T Aˆx y T A xŷ T A x.

13 Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms We ca simplify the ier product terms as y T Aˆx ŷ T Ax ỹ T Ax ỹ T Aˆx y T A xŷ T A x = ŷ ỹ T Aˆx x ŷ y T Aˆx x. Rearragig terms o the two sides of the iequality we have τ x x Dy ȳ Lˆxy Lx ŷ τ λ x ˆx ν Dy ŷ τ ˆx x Dŷȳ δ f y f ŷ 4 ŷ y T Aˆx x ŷ ỹ T Aˆx x. Applyig the substitutios i 0 yields τ x x t Dy y t Lx t y Lx y t τ λ x x t ν Dy y t τ xt x t Dyt y t δ f y f y t 4 y t y T A x t x t θx t x t. 7 We ca rearrage the ier product term i 7 as y t y T A x t x t θx t x t = y t y T Ax t x t θy t y T Ax t x t θy t y t T Ax t x t. Usig the optimality coditios i ad 4 we ca also boud f y f y t : = f y f y t Ax A x t θx t x t hy t hy t α Ax x t α θax t x t hy t hy t whereα >. With the defiitioµ = λ mi A T A we also have Ax x t µ x x t. Combiig them with the iequality 7 leads to τ x x t Dy y t θy t y T Ax t x t Lx t y Lx y t τ λ x x t ν Dy y t y t y T Ax t x t τ xt x t Dyt y t θy t y t T Ax t x t δµ α 4 x x t α δ θax t x t hy 4 t hy t. 8 3

14 C. Proof of Theorem Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms Let the kerel fuctio behy = / y. I this case we havedy y = / y y ad hy = y. Moreover = δ = ad ν =. Therefore the iequality 8 becomes τ δµ x x t α y y t θy t y T Ax t x t Lx t y Lx y t τ λ x x t y y t y t y T Ax t x t τ xt x t yt y t θy t y t T Ax t x t α δ θax t x t 4 yt y t. 9 Next we derive aother form of the uderlied items above: yt y t θy t y t T Ax t x t = yt y t θ yt y t T Ax t x t = θax t x t yt y t θ Ax t x t θax t x t yt y t θ L x t x t where i the last iequality we used A L ad hece Ax t x t L x t x t. Combiig with iequality 9 we have τ δµ x t x α yt y θy t y T Ax t x t θ L x t x t Lx t y Lx y t τ λ x t x y t y y t y T Ax t x t τ xt x t θax α δ t x t 4 yt y t. 30 We ca remove the last term i the above iequality as log as its coefficiet is oegative i.e. α δ 4 0. I order to maximize /α we take the equality ad solve for the largest value of α allowed which results i α = δ α = δ. Applyig these values i 30 gives τ δµ x t x δ yt y θy t y T Ax t x t θ L x t x t Lx t y Lx y t τ λ x t x y t y y t y T Ax t x t τ xt x t. 3 4

15 Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms We use t to deote the last row i 3. Equivaletly we defie t = = τ λ x x t y y t y t y T Ax t x t τ λ x x t 4 y y t [ x t x t y y t ] T [ τ I AT A τ xt x t ][ ] x t x t. y y t The quadratic form i the last term is oegative if the matrix M = [ τ I AT A ] is positive semidefiite for which a sufficiet coditio isτ /L. Uder this coditio t τ λ x x t 4 y y t 0. 3 If we ca to choose τ ad so that τ δµ δ θ τ λ θ θ L θ τ 33 the accordig to 3 we have t Lx t y Lx y t θ t. Because t 0 ad Lx t y Lx y t 0 for ay t 0 we have t θ t which implies ad t θ t 0 Lx t y Lx y t θ t 0. Let θ x ad θ y be two cotractio factors determied by the first two iequalities i 33 i.e. / θ x = τ δµ δ τ λ = θ y = / = /. τδµ δ τλ The we ca let θ = max{θ x θ y }. We ote that ay θ < would satisfy the last coditio i 33 provided that τ = L which also makes the matrixm positive semidefiite ad thus esures the iequality 3. Amog all possible pairsτ that satisfyτ = /L we choose which give the desired results of Theorem. τ = L λδµ = λδµ 34 L 5

16 D. Proof of Theorem 3 If we choose h = f the Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms h is -strogly covex ad /δ-smooth i.e. = ad δ = δ; f is-strogly covex with respect toh i.e.ν =. For coveiece we repeat iequality 8 here: τ x x t Dy y t θy t y T Ax t x t Lx t y Lx y t τ λ x x t ν Dy y t y t y T Ax t x t τ xt x t Dyt y t θy t y t T Ax t x t δµ α 4 x x t α δ θax t x t hy 4 t hy t. 35 We first boud the Bregma divergece Dy t y t usig the assumptio that the kerel h is -strogly covex ad /δ-smooth. Usig similar argumets as i the proof of Lemma we have for ay ρ [0] Dy t y t = hy t hy t hy t y t y t ρ yt y t ρ δ hy t hy t. 36 For ay β > 0 we ca lower boud the ier product term I additio we have θy t y t T Ax t x t β yt y t θ L β xt x t. θax t x t hy t hy t θ L x t x t hy t hy t. Combiig these bouds with 35 ad 36 withρ = / we arrive at τ δµ α θ L L β α δθ x x t Dy y t θy t y T Ax t x t x t x t Lx t y Lx y t τ λ x x t Dy y t y t y T Ax t x t 4 β δ y t y t 4 α δ hy t hy t τ xt x t. 37 We choose α ad β i 37 to zero out the coefficiets of y t y t ad hy t hy t : α = β =. 6

17 The the iequality 37 becomes τ δµ 4 θ L Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms x x t Dy y t θy t y T Ax t x t δθ L 4 x t x t Lx t y Lx y t τ λ x x t Dy y t y t y T Ax t x t τ xt x t. The coefficiet of x t x t ca be bouded as θ L δθ L 4 = 4 δ θ L = 4δ 4 θ L < θ L where i the iequality we used δ. Therefore we have x τ δµ x t 4 Dy y t θy t y T Ax t x t θ L x t x t Lx t y Lx y t τ λ x x t Dy y t y t y T Ax t x t τ xt x t. We use t to deote the last row of the above iequality. Equivaletly we defie t = τ λ x x t Dy y t y t y T Ax t x t τ xt x t. Sice h is-strogly covex we have Dy y t y y t ad thus t = τ λ x x t Dy y t τ λ x x t Dy y t The quadratic form i the last term is oegative ifτ /L. Uder this coditio t yt y y t y T Ax t x t τ xt x t [ ] x t x t T [ y y t τ I ][ ] AT x t x t A y y t. τ λ x x t Dy y t If we ca to choose τ ad so that τ δµ 4 θ τ λ θ θ L θ τ 39 the we have t Lx t y Lx y t θ t. Because t 0 ad Lx t y Lx y t 0 for ay t 0 we have t θ t which implies t θ t 0 7

18 Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms ad Lx t y Lx y t θ t 0. To satisfy the last coditio i 39 ad also esure the iequality 38 it suffices to have τ 4L. We choose τ = L λδµ = λδµ. L With the above choice ad assumig λδµ L we have θ y = For the cotractio factor over the primal variables we have = / = λδµ /4L λδµ. 4L θ x = τ δµ 4 τδµ 4 δµ 44L τ λ = τλ = τλ δµ 6L λ L λδµ. This fiishes the proof of Theorem 3. E. Proof of Theorem We cosider the SPDC algorithm i the Euclidea case with hx = / x. The correspodig batch case aalysis is give i Sectio C. For each i =... let ỹ i be ỹ i = argmi y Based o the first-order optimality coditio we have Also sicey i miimizes φ i y y a ix we have By Lemma withρ = / we have ad re-arragig terms we get { φ iy } y yt i y a i x t. a i x t ỹ i y t i φ i ỹ i. a i x φ i y i. yi a i x t φ iyi yt i yi ỹi yi φ iỹ i ỹ i a i x t y t i yi ỹi yi ỹ i y t i δ 4 φ i ỹ i φ i yi ỹi y t i ỹ i y i a i x t φ iỹ i φ iyi δ 4 φ i ỹ i φ i y i. 40 8

19 Notice that Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms E[y t i ] = ỹ i y t i E[y t i yi ] = ỹ i yi y t i ] = ỹ i y t E[y t i i yt i y i E[φ iy t i ] = φ iỹ i φ iy t i. Plug the above relatios ito 40 ad divide both sides by we have y t i y 4 i 4 ad summig over i =... we get 4 where u t = y t y i= y t i a i u t = E[y t i y i ] E[y t i y t i ] yt i yi E[yt i a i x t y t i ] E[φ iy t i ] φ iy t i φ iy t i φ iyi δ a i x t x ỹ i y t i 4 E[ y t y ] E[ yt y t ] 4 φ ky t k φ ky t k i= u t u t u t u x t δ 4 Ax x t ỹ yt i= y t i a i ad u = O the other had sicex t miimizes the τ λ-strogly covex objective gx u t u t u t x x xt τ we ca apply Lemma withρ = 0 to obtai gx u t u t u t x xt x gx t u t u t u t x t xt x t ad re-arragig terms we get x t x τ τ λ τ τ φ iy t i φ iy i yia i. i= τ λ x t x E[ x t x ] E[ xt x t ] E[gx t gx ] τ E[ u t u t u t x t x ]. 9

20 Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms Also otice that Lx t y Lx y Lx y Lx y t Lx y Lx y t = φ iy t i φ iy φ ky t k φ ky t k gxt gx i= u x t u t x u t u t x. Combiig everythig together we have x t x τ 4 τ λ E[ x t x ] y t y Lx y Lx y t E[ y t y ] E[ xt x t ] E[ yt y t ] 4 τ E[Lx t y Lx y Lx y Lx y t ] E[ u t u u t u t x t x t ] δ 4 Ax x t ỹ yt. Next we otice that δ 4 Ax x t E[yt ] y t for some α > ad Ax x t µ x x t ad θaxt x t ỹ yt = δ 4 Ax x t θax t x t ỹ yt δ Ax x t α 4 α δ 4 θaxt x t ỹ yt θ Ax t x t ỹ yt θ L x t x t E[ yt y t ]. We follow the same reasoig as i the stadard SPDC aalysis u t u u t u t x t x t = yt y T Ax t x t y t y t T Ax t x t θy t y t T Ax t x t ad usig Cauchy-Schwartz iequality we have ad y t y t T Ax t x t yt y t T A /τ yt y t /τr y t y t T Ax t x t yt y t T A /τ yt y t /τr. θyt y T Ax t x t xt x t 8τ xt x t 8τ 0

21 Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms Thus we get u t u u t u t x t x t yt y T Ax t x t yt y t /4τR xt x t 8τ Puttig everythig together we have τ /αδµ x t x 4 Lx y Lx y t θ τ λ E[ x t x ] 4 θ xt x t. 8τ 4 8τ α θδl E[Lx t y Lx y Lx y Lx y t ] τ E[ x t x t ] 8τ 4R τ α δ E[ y t y t ]. θyt y T Ax t x t y t y θlx t y Lx y x t x t θyt y T Ax t x t E[ y t y ] E[yt y T Ax t x t ] If we choose the parameters as α = τ = 4δ 6R the we kow 4R τ α δ = 4 8 > 0 ad α θδl L 8 R 8 56τ thus 8τ α θδl 3 8τ. I additio we have α = 4δ. Fially we obtai τ δµ x t x 4 4δ y t y θlx t y Lx y 4 Lx y Lx y t 3 θ 8τ xt x t θyt y T Ax t x t τ λ E[ x t x ] E[ y t y ] E[yt y T Ax t x t ] 4 E[Lx t y Lx y Lx y Lx y t ] 3 8τ E[ xt x t ]. Now we ca defie θ x ad θ y as the ratios betwee the coefficiets i the x-distace ad y-distace terms ad let θ = max{θ x θ y } as before. Choosig the step-size parameters as λδµ gives the desired result. τ = 4R λδµ = 4R

22 F. Proof of Theorem 4 Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms I this settig fori-th coordiate of the dual variables y we choose h = φ i let ad defie For i =... let ỹ i be D i y i y i = φ iy i φ iy i φ i y iy i y i ỹ i = argmi y Dyy = Based o the first-order optimality coditio we have Also sicey i miimizes φ i y y a ix we have D i y i y i. i= { } φ iy D iyy t i y a i x t. a i x t φ i ỹ i φ i y t i φ i ỹ i. a i x φ i y i. Usig Lemma withρ = / we obtai yi a i x t φ iyi D iyi yt i D i y iỹ i φ iỹ i ỹ i a i x t ad rearragig terms we get D i yi yt i D iỹ i y t i δ 4 φ i ỹ i φ i yi D i y iỹ i D iỹ i y t i ỹ i y i a i x t φ iỹ i φ iyi δ 4 φ i ỹ i φ i y i. 4 With i.i.d. radom samplig at each iteratio we have the followig relatios: E[y t i ] = ỹ i y t i E[D i y t i yi] = D iỹ i yi Diy t i yi E[D i y t i y t i ] = D iỹ i y t i E[φ iy t i ] = φ iỹ i φ iy t i. Pluggig the above relatios ito 4 ad dividig both sides by we have D i y t i y i D i y t i yi E[D iy t i E[y t i y t i ] yt i yi y t i ] a i x t E[φ iy t i ] φ iy t i φ iy t i φ iyi δ a i x t x φ i ỹ i φ i y t i 4

23 ad summig over i =... we get Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms Dy t y E[Dy t y ] E[Dyt y t ] φ ky t k φ ky t k i= u t u t u t u x t δ 4 Ax x t φ where φ y t is a-dimesioal vector such that thei-th coordiate is φ iy t i φ iy i ỹ φ y t ad u t = i= y t i a i u t = [φ y t ] i = φ i y t i i= y t i a i ad u = yia i. i= O the other had sicex t miimizes a τ λ-strogly covex objective gx u t u t u t x x xt τ we ca apply Lemma withρ = 0 to obtai gx u t u t u t x xt x gx t u t u t u t x t xt x t ad rearragig terms we get Notice that x t x τ τ λ τ τ τ λ x t x E[ x t x ] E[ xt x t ] E[gx t gx ] τ E[ u t u t u t x t x ]. Lx t y Lx y Lx y Lx y t Lx y Lx y t = φ iy t i φ iy φ ky t k φ ky t k gxt gx i= u x t u t x u t u t x so x t x τ τ λ E[ x t x ] Dy t y Lx y Lx y t E[Dy t y ] E[ xt x t ] E[Dyt y t ] τ E[Lx t y Lx y Lx y Lx y t ] E[ u t u u t u t x t x t ] δ 4 Ax x t φ ỹ φ y t. 3

24 Next we have δ 4 Ax x t φ for ay α > ad ad θaxt x t φ Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms ỹ φ y t ỹ φ y t = δ 4 Ax x t θax t x t φ δ Ax x t α 4 α δ 4 θaxt x t φ Ax x t µ x x t Followig the same reasoig as i the stadard SPDC aalysis we have ỹ φ y t ỹ φ y t θ Ax t x t φ ỹ φ y t ] u t u u t u t x t x t = yt y T Ax t x t θ L x t x t E[ φ y t φ y t ]. y t y t T Ax t x t θy t y t T Ax t x t ad usig Cauchy-Schwartz iequality we have ad Thus we get y t y t T Ax t x t yt y t T A /τ yt y t /τr y t y t T Ax t x t yt y t T A /τ yt y t /τr. u t u u t u t x t x t yt y T Ax t x t yt y t /4τR xt x t 8τ θ xt x t. 8τ Also we ca lower boud the termdy t y t usig Lemma withρ = /: Dy t y t = φ iy t i φ iy t i φ i y t i= i= θyt y T Ax t x t xt x t 8τ xt x t 8τ i y t i θyt y T Ax t x t y t i yt i y t i δ φ i y t i φ i y t i = yt y t δ φ y t φ y t. 4

25 Exploitig Strog Covexity from Data with Primal-Dual First-Order Algorithms Combiig everythig above together we have τ /αδµ x t x 4 Lx y Lx y t θ 8τ α θδl τ λ E[ x t x ] Dy t y θlx t y Lx y x t x t θyt y T Ax t x t E[Dy t y ] E[yt y T Ax t x t ] E[Lx t y Lx y Lx y Lx y t ] τ E[ x t x t ] 8τ 4R τ E[ y t y t ] δ α δ E[ φ y t φ y t ]. If we choose the parameters as the we kow ad ad thus I additio we have α θδl α = 4 τ = 6R 4R τ = 4 > 0 δ α δ = δ δ 8 > 0 δl 8 δr δ 8 56τ 56τ 8τ α θδl 3 8τ. α = 4. Fially we obtai τ δµ x t x 4 4 Dy t y θlx t y Lx y Lx y Lx y t 3 θ 8τ xt x t θyt y T Ax t x t τ λ E[ x t x ] E[ y t y ] E[yt y T Ax t x t ] E[Lx t y Lx y Lx y Lx y t ] 3 8τ E[ xt x t ]. As before we ca defie θ x ad θ y as the ratios betwee the coefficiets i the x-distace ad y-distace terms ad let θ = max{θ x θ y }. The choosig the step-size parameters as gives the desired result. τ = 4R λδµ = λδµ 4R 5

Doubly Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization with Factorized Data

Doubly Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization with Factorized Data Doubly Stochastic Primal-Dual Coordiate Method for Regularized Empirical Risk Miimizatio with Factorized Data Adams Wei Yu, Qihag Li, Tiabao Yag Caregie Mello Uiversity The Uiversity of Iowa weiyu@cs.cmu.edu,

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

1 Duality revisited. AM 221: Advanced Optimization Spring 2016 AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

Introduction to Optimization Techniques. How to Solve Equations

Introduction to Optimization Techniques. How to Solve Equations Itroductio to Optimizatio Techiques How to Solve Equatios Iterative Methods of Optimizatio Iterative methods of optimizatio Solutio of the oliear equatios resultig form a optimizatio problem is usually

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS MASSACHUSTTS INSTITUT OF TCHNOLOGY 6.436J/5.085J Fall 2008 Lecture 9 /7/2008 LAWS OF LARG NUMBRS II Cotets. The strog law of large umbers 2. The Cheroff boud TH STRONG LAW OF LARG NUMBRS While the weak

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate Supplemetary Material for Fast Stochastic AUC Maximizatio with O/-Covergece Rate Migrui Liu Xiaoxua Zhag Zaiyi Che Xiaoyu Wag 3 iabao Yag echical Lemmas ized versio of Hoeffdig s iequality, ote that We

More information

Optimization Methods MIT 2.098/6.255/ Final exam

Optimization Methods MIT 2.098/6.255/ Final exam Optimizatio Methods MIT 2.098/6.255/15.093 Fial exam Date Give: December 19th, 2006 P1. [30 pts] Classify the followig statemets as true or false. All aswers must be well-justified, either through a short

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machie Learig Lecturer: Philippe Rigollet Lecture 0 Scribe: Ade Forrow Oct. 3, 05 Recall the followig defiitios from last time: Defiitio: A fuctio K : X X R is called a positive symmetric

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

Rademacher Complexity

Rademacher Complexity EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week Lecture: Cocept Check Exercises Starred problems are optioal. Statistical Learig Theory. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT

More information

Intro to Learning Theory

Intro to Learning Theory Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified

More information

Information-based Feature Selection

Information-based Feature Selection Iformatio-based Feature Selectio Farza Faria, Abbas Kazeroui, Afshi Babveyh Email: {faria,abbask,afshib}@staford.edu 1 Itroductio Feature selectio is a topic of great iterest i applicatios dealig with

More information

Supplemental Material: Proofs

Supplemental Material: Proofs Proof to Theorem Supplemetal Material: Proofs Proof. Let be the miimal umber of traiig items to esure a uique solutio θ. First cosider the case. It happes if ad oly if θ ad Rak(A) d, which is a special

More information

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

A Risk Comparison of Ordinary Least Squares vs Ridge Regression Joural of Machie Learig Research 14 (2013) 1505-1511 Submitted 5/12; Revised 3/13; Published 6/13 A Risk Compariso of Ordiary Least Squares vs Ridge Regressio Paramveer S. Dhillo Departmet of Computer

More information

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate Supplemetary Material for Fast Stochastic AUC Maximizatio with O/-Covergece Rate Migrui Liu Xiaoxua Zhag Zaiyi Che Xiaoyu Wag 3 iabao Yag echical Lemmas ized versio of Hoeffdig s iequality, ote that We

More information

Support vector machine revisited

Support vector machine revisited 6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where

More information

6.867 Machine learning, lecture 7 (Jaakkola) 1

6.867 Machine learning, lecture 7 (Jaakkola) 1 6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

Algorithms for Clustering

Algorithms for Clustering CR2: Statistical Learig & Applicatios Algorithms for Clusterig Lecturer: J. Salmo Scribe: A. Alcolei Settig: give a data set X R p where is the umber of observatio ad p is the umber of features, we wat

More information

1 Inferential Methods for Correlation and Regression Analysis

1 Inferential Methods for Correlation and Regression Analysis 1 Iferetial Methods for Correlatio ad Regressio Aalysis I the chapter o Correlatio ad Regressio Aalysis tools for describig bivariate cotiuous data were itroduced. The sample Pearso Correlatio Coefficiet

More information

Linear Support Vector Machines

Linear Support Vector Machines Liear Support Vector Machies David S. Roseberg The Support Vector Machie For a liear support vector machie (SVM), we use the hypothesis space of affie fuctios F = { f(x) = w T x + b w R d, b R } ad evaluate

More information

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ. 2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For

More information

Linear Regression Demystified

Linear Regression Demystified Liear Regressio Demystified Liear regressio is a importat subject i statistics. I elemetary statistics courses, formulae related to liear regressio are ofte stated without derivatio. This ote iteds to

More information

Maximum Likelihood Estimation and Complexity Regularization

Maximum Likelihood Estimation and Complexity Regularization ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio

More information

Introduction to Machine Learning DIS10

Introduction to Machine Learning DIS10 CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

Recursive Algorithms. Recurrences. Recursive Algorithms Analysis

Recursive Algorithms. Recurrences. Recursive Algorithms Analysis Recursive Algorithms Recurreces Computer Sciece & Egieerig 35: Discrete Mathematics Christopher M Bourke cbourke@cseuledu A recursive algorithm is oe i which objects are defied i terms of other objects

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013 Large Deviatios for i.i.d. Radom Variables Cotet. Cheroff boud usig expoetial momet geeratig fuctios. Properties of a momet

More information

b i u x i U a i j u x i u x j

b i u x i U a i j u x i u x j M ath 5 2 7 Fall 2 0 0 9 L ecture 1 9 N ov. 1 6, 2 0 0 9 ) S ecod- Order Elliptic Equatios: Weak S olutios 1. Defiitios. I this ad the followig two lectures we will study the boudary value problem Here

More information

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018) NYU Ceter for Data Sciece: DS-GA 003 Machie Learig ad Computatioal Statistics (Sprig 208) Brett Berstei, David Roseberg, Be Jakubowski Jauary 20, 208 Istructios: Followig most lab ad lecture sectios, we

More information

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017 Lecture 9: Boostig Akshay Krishamurthy akshay@csumassedu October 3, 07 Recap Last week we discussed some algorithmic aspects of machie learig We saw oe very powerful family of learig algorithms, amely

More information

The Method of Least Squares. To understand least squares fitting of data.

The Method of Least Squares. To understand least squares fitting of data. The Method of Least Squares KEY WORDS Curve fittig, least square GOAL To uderstad least squares fittig of data To uderstad the least squares solutio of icosistet systems of liear equatios 1 Motivatio Curve

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

On the Linear Convergence of a Cyclic Incremental Aggregated Gradient Method

On the Linear Convergence of a Cyclic Incremental Aggregated Gradient Method O the Liear Covergece of a Cyclic Icremetal Aggregated Gradiet Method Arya Mokhtari Departmet of Electrical ad Systems Egieerig Uiversity of Pesylvaia Philadelphia, PA 19104, USA Mert Gürbüzbalaba Departmet

More information

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 10 October Minimaxity and least favorable prior sequences STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least

More information

1 Review and Overview

1 Review and Overview CS9T/STATS3: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #6 Scribe: Jay Whag ad Patrick Cho October 0, 08 Review ad Overview Recall i the last lecture that for ay family of scalar fuctios F, we

More information

Linear Classifiers III

Linear Classifiers III Uiversität Potsdam Istitut für Iformatik Lehrstuhl Maschielles Lere Liear Classifiers III Blaie Nelso, Tobias Scheffer Cotets Classificatio Problem Bayesia Classifier Decisio Liear Classifiers, MAP Models

More information

Review Problems 1. ICME and MS&E Refresher Course September 19, 2011 B = C = AB = A = A 2 = A 3... C 2 = C 3 = =

Review Problems 1. ICME and MS&E Refresher Course September 19, 2011 B = C = AB = A = A 2 = A 3... C 2 = C 3 = = Review Problems ICME ad MS&E Refresher Course September 9, 0 Warm-up problems. For the followig matrices A = 0 B = C = AB = 0 fid all powers A,A 3,(which is A times A),... ad B,B 3,... ad C,C 3,... Solutio:

More information

A New Solution Method for the Finite-Horizon Discrete-Time EOQ Problem

A New Solution Method for the Finite-Horizon Discrete-Time EOQ Problem This is the Pre-Published Versio. A New Solutio Method for the Fiite-Horizo Discrete-Time EOQ Problem Chug-Lu Li Departmet of Logistics The Hog Kog Polytechic Uiversity Hug Hom, Kowloo, Hog Kog Phoe: +852-2766-7410

More information

Lecture 7: October 18, 2017

Lecture 7: October 18, 2017 Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem

More information

A Hadamard-type lower bound for symmetric diagonally dominant positive matrices

A Hadamard-type lower bound for symmetric diagonally dominant positive matrices A Hadamard-type lower boud for symmetric diagoally domiat positive matrices Christopher J. Hillar, Adre Wibisoo Uiversity of Califoria, Berkeley Jauary 7, 205 Abstract We prove a ew lower-boud form of

More information

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam. Probability ad Statistics FS 07 Secod Sessio Exam 09.0.08 Time Limit: 80 Miutes Name: Studet ID: This exam cotais 9 pages (icludig this cover page) ad 0 questios. A Formulae sheet is provided with the

More information

Algebra of Least Squares

Algebra of Least Squares October 19, 2018 Algebra of Least Squares Geometry of Least Squares Recall that out data is like a table [Y X] where Y collects observatios o the depedet variable Y ad X collects observatios o the k-dimesioal

More information

ECE 901 Lecture 13: Maximum Likelihood Estimation

ECE 901 Lecture 13: Maximum Likelihood Estimation ECE 90 Lecture 3: Maximum Likelihood Estimatio R. Nowak 5/7/009 The focus of this lecture is to cosider aother approach to learig based o maximum likelihood estimatio. Ulike earlier approaches cosidered

More information

Output Analysis and Run-Length Control

Output Analysis and Run-Length Control IEOR E4703: Mote Carlo Simulatio Columbia Uiversity c 2017 by Marti Haugh Output Aalysis ad Ru-Legth Cotrol I these otes we describe how the Cetral Limit Theorem ca be used to costruct approximate (1 α%

More information

Naïve Bayes. Naïve Bayes

Naïve Bayes. Naïve Bayes Statistical Data Miig ad Machie Learig Hilary Term 206 Dio Sejdiovic Departmet of Statistics Oxford Slides ad other materials available at: http://www.stats.ox.ac.uk/~sejdiov/sdmml : aother plug-i classifier

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

1 of 7 7/16/2009 6:06 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 6. Order Statistics Defiitios Suppose agai that we have a basic radom experimet, ad that X is a real-valued radom variable

More information

Accelerated Method for Stochastic Composition Optimization with Nonsmooth Regularization

Accelerated Method for Stochastic Composition Optimization with Nonsmooth Regularization he hirty-secod AAAI Coferece o Artificial Itelligece AAAI-8 Accelerated Method for Stochastic Compositio Optimizatio with Nosmooth Regularizatio Zhouyua Huo, Bi Gu, Ji Liu, 2 Heg Huag Departmet of Electrical

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression REGRESSION 1 Outlie Liear regressio Regularizatio fuctios Polyomial curve fittig Stochastic gradiet descet for regressio MLE for regressio Step-wise forward regressio Regressio methods Statistical techiques

More information

Differentiable Convex Functions

Differentiable Convex Functions Differetiable Covex Fuctios The followig picture motivates Theorem 11. f ( x) f ( x) f '( x)( x x) ˆx x 1 Theorem 11 : Let f : R R be differetiable. The, f is covex o the covex set C R if, ad oly if for

More information

Seunghee Ye Ma 8: Week 5 Oct 28

Seunghee Ye Ma 8: Week 5 Oct 28 Week 5 Summary I Sectio, we go over the Mea Value Theorem ad its applicatios. I Sectio 2, we will recap what we have covered so far this term. Topics Page Mea Value Theorem. Applicatios of the Mea Value

More information

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

Math 61CM - Solutions to homework 3

Math 61CM - Solutions to homework 3 Math 6CM - Solutios to homework 3 Cédric De Groote October 2 th, 208 Problem : Let F be a field, m 0 a fixed oegative iteger ad let V = {a 0 + a x + + a m x m a 0,, a m F} be the vector space cosistig

More information

Sequences and Series of Functions

Sequences and Series of Functions Chapter 6 Sequeces ad Series of Fuctios 6.1. Covergece of a Sequece of Fuctios Poitwise Covergece. Defiitio 6.1. Let, for each N, fuctio f : A R be defied. If, for each x A, the sequece (f (x)) coverges

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2016 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

IP Reference guide for integer programming formulations.

IP Reference guide for integer programming formulations. IP Referece guide for iteger programmig formulatios. by James B. Orli for 15.053 ad 15.058 This documet is iteded as a compact (or relatively compact) guide to the formulatio of iteger programs. For more

More information

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32 Boostig Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, 2017 1 / 32 Outlie 1 Admiistratio 2 Review of last lecture 3 Boostig Professor Ameet Talwalkar CS260

More information

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786) Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm

More information

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters

More information

6.883: Online Methods in Machine Learning Alexander Rakhlin

6.883: Online Methods in Machine Learning Alexander Rakhlin 6.883: Olie Methods i Machie Learig Alexader Rakhli LECTURES 5 AND 6. THE EXPERTS SETTING. EXPONENTIAL WEIGHTS All the algorithms preseted so far halluciate the future values as radom draws ad the perform

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

Feedback in Iterative Algorithms

Feedback in Iterative Algorithms Feedback i Iterative Algorithms Charles Byre (Charles Byre@uml.edu), Departmet of Mathematical Scieces, Uiversity of Massachusetts Lowell, Lowell, MA 01854 October 17, 2005 Abstract Whe the oegative system

More information

The standard deviation of the mean

The standard deviation of the mean Physics 6C Fall 20 The stadard deviatio of the mea These otes provide some clarificatio o the distictio betwee the stadard deviatio ad the stadard deviatio of the mea.. The sample mea ad variace Cosider

More information

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach STAT 425: Itroductio to Noparametric Statistics Witer 28 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Istructor: Ye-Chi Che Referece: Sectio 8.4 of All of Noparametric Statistics.

More information

ON POINTWISE BINOMIAL APPROXIMATION

ON POINTWISE BINOMIAL APPROXIMATION Iteratioal Joural of Pure ad Applied Mathematics Volume 71 No. 1 2011, 57-66 ON POINTWISE BINOMIAL APPROXIMATION BY w-functions K. Teerapabolar 1, P. Wogkasem 2 Departmet of Mathematics Faculty of Sciece

More information

On Random Line Segments in the Unit Square

On Random Line Segments in the Unit Square O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,

More information

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion .87 Machie learig: lecture Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics The learig problem hypothesis class, estimatio algorithm loss ad estimatio criterio samplig, empirical ad epected losses

More information

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector Dimesio-free PAC-Bayesia bouds for the estimatio of the mea of a radom vector Olivier Catoi CREST CNRS UMR 9194 Uiversité Paris Saclay olivier.catoi@esae.fr Ilaria Giulii Laboratoire de Probabilités et

More information

ECE-S352 Introduction to Digital Signal Processing Lecture 3A Direct Solution of Difference Equations

ECE-S352 Introduction to Digital Signal Processing Lecture 3A Direct Solution of Difference Equations ECE-S352 Itroductio to Digital Sigal Processig Lecture 3A Direct Solutio of Differece Equatios Discrete Time Systems Described by Differece Equatios Uit impulse (sample) respose h() of a DT system allows

More information

Recurrence Relations

Recurrence Relations Recurrece Relatios Aalysis of recursive algorithms, such as: it factorial (it ) { if (==0) retur ; else retur ( * factorial(-)); } Let t be the umber of multiplicatios eeded to calculate factorial(). The

More information

Chapter 6 Principles of Data Reduction

Chapter 6 Principles of Data Reduction Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a

More information

Chapter 6 Infinite Series

Chapter 6 Infinite Series Chapter 6 Ifiite Series I the previous chapter we cosidered itegrals which were improper i the sese that the iterval of itegratio was ubouded. I this chapter we are goig to discuss a topic which is somewhat

More information

Math 113 Exam 3 Practice

Math 113 Exam 3 Practice Math Exam Practice Exam 4 will cover.-., 0. ad 0.. Note that eve though. was tested i exam, questios from that sectios may also be o this exam. For practice problems o., refer to the last review. This

More information

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT OCTOBER 7, 2016 LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT Geometry of LS We ca thik of y ad the colums of X as members of the -dimesioal Euclidea space R Oe ca

More information

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We

More information

CHAPTER 10 INFINITE SEQUENCES AND SERIES

CHAPTER 10 INFINITE SEQUENCES AND SERIES CHAPTER 10 INFINITE SEQUENCES AND SERIES 10.1 Sequeces 10.2 Ifiite Series 10.3 The Itegral Tests 10.4 Compariso Tests 10.5 The Ratio ad Root Tests 10.6 Alteratig Series: Absolute ad Coditioal Covergece

More information

The log-behavior of n p(n) and n p(n)/n

The log-behavior of n p(n) and n p(n)/n Ramauja J. 44 017, 81-99 The log-behavior of p ad p/ William Y.C. Che 1 ad Ke Y. Zheg 1 Ceter for Applied Mathematics Tiaji Uiversity Tiaji 0007, P. R. Chia Ceter for Combiatorics, LPMC Nakai Uivercity

More information

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f. Lecture 5 Let us give oe more example of MLE. Example 3. The uiform distributio U[0, ] o the iterval [0, ] has p.d.f. { 1 f(x =, 0 x, 0, otherwise The likelihood fuctio ϕ( = f(x i = 1 I(X 1,..., X [0,

More information

Preponderantly increasing/decreasing data in regression analysis

Preponderantly increasing/decreasing data in regression analysis Croatia Operatioal Research Review 269 CRORR 7(2016), 269 276 Prepoderatly icreasig/decreasig data i regressio aalysis Darija Marković 1, 1 Departmet of Mathematics, J. J. Strossmayer Uiversity of Osijek,

More information

1 Review and Overview

1 Review and Overview DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,

More information

Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression

Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression Harder, Better, Faster, Stroger Covergece Rates for Least-Squares Regressio Aoymous Author(s) Affiliatio Address email Abstract 1 2 3 4 5 6 We cosider the optimizatio of a quadratic objective fuctio whose

More information

Basics of Probability Theory (for Theory of Computation courses)

Basics of Probability Theory (for Theory of Computation courses) Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface.

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe

More information

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015 ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],

More information

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients. Defiitios ad Theorems Remember the scalar form of the liear programmig problem, Miimize, Subject to, f(x) = c i x i a 1i x i = b 1 a mi x i = b m x i 0 i = 1,2,, where x are the decisio variables. c, b,

More information