Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression
|
|
- Harold Pearson
- 5 years ago
- Views:
Transcription
1 Harder, Better, Faster, Stroger Covergece Rates for Least-Squares Regressio Aoymous Author(s) Affiliatio Address Abstract We cosider the optimizatio of a quadratic objective fuctio whose gradiets are oly accessible through a stochastic oracle. We preset the first algorithm that achieves joitly the optimal predictio error rates for least-squares regressio, both i terms of forgettig of iitial coditios ad i terms of depedece o the oise ad dimesio of the problem, ad prove dimesioless ad tighter rates for a regularized versio of this algorithm Itroductio May supervised machie learig problems are aturally cast as the miimizatio of a smooth fuctio defied o a Euclidea space. This icludes least-squares regressio, logistic regressio (see, e.g., Hastie et al., 2009) or geeralized liear models (McCullagh ad Nelder, 1989). While small problems with few or low-dimesioal iput features may be solved precisely by may potetial optimizatio algorithms (e.g., Newto method), large-scale problems with may high-dimesioal features are typically solved with simple gradiet-based iterative techiques whose per-iteratio cost is small. I this paper, we cosider a quadratic objective fuctio f whose gradiets are oly accessible through a stochastic oracle that returs the gradiet at ay give poit plus a zero-mea fiite variace radom error. I this stochastic approximatio framework (Robbis ad Moro, 1951), it is kow that two quatities dictate the behavior of various algorithms, amely the covariace matrix V of the oise i the gradiets, ad the deviatio θ 0 θ betwee the iitial poit of the algorithm θ 0 ad ay of the global miimizer θ of f. This leads to a bias/variace decompositio (Bach ad Moulies, 2013; Hsu et al., 2014) of the performace of most algorithms as the sum of two terms: (a) the bias term characterizes how fast iitial coditios are forgotte ad thus is icreasig i a well-chose orm of θ 0 θ ; while (b) the variace term characterizes the effect of the oise i the gradiets, idepedetly of the startig poit, ad with a term that is icreasig i the covariace of the oise. 26 For quadratic fuctios with (a) a oise covariace matrix V which is proportioal (with costat 27 σ 2 ) to the Hessia of f (a situatio which correspods to least-squares regressio) ad (b) a ii- 28 tial poit characterized by the orm θ 0 θ 2, the optimal bias ad variace terms are kow separately. O the oe had, the optimal bias term after iteratios is proportioal to L θ 0 θ 2 29, where L is the largest eigevalue of the Hessia of f. This rate is achieved by accelerated gradiet descet (Nesterov, 1983, 2004), ad is kow to be optimal if the umber of iteratios is less tha 32 the dimesio d of the uderlyig predictors, but the algorithm is ot robust to radom or determi- 33 istic oise i the gradiets (d Aspremot, 2008; Schmidt et al., 2011; Devolder et al., 2014). O the 34 other had, the optimal variace term is proportioal to σ2 d (Tsybakov, 2003); it is kow to be 35 achieved by averaged gradiet descet (Bach ad Moulies, 2013), for which the bias term oly 36 achieves L θ 0 θ 2 istead of L θ 0 θ 2. 2 Submitted to 30th Coferece o Neural Iformatio Processig Systems (NIPS 2016). Do ot distribute.
2 Our first cotributio i this paper is to aalyze i Sectio 3 averaged accelerated gradiet descet, showig that it attais optimal rates for both the variace ad the bias terms. It shows that averagig is beeficial for accelerated techiques ad provides a provable robustess to oise. While optimal whe measurig performace i terms of the dimesio d ad the iitial distace to optimum θ 0 θ 2, these rates are ot adapted i may situatios where either d is larger tha the umber of iteratios (i.e., the umber of observatios for regular stochastic gradiet descet) or L θ 0 θ 2 is much larger tha 2. Our secod cotributio is to provide i Sectio 4 a aalysis of a ew algorithm (based o some additioal regularizatio) that ca adapt our bouds to fier assumptios o θ 0 θ ad the Hessia of the problem, leadig i particular to dimesio-free quatities that ca thus be exteded to the Hilbert space settig (i particular for o-parametric estimatio) Least-Squares Regressio Statistical Assumptios. We cosider the followig geeral settig: H is a d-dimesioal Euclidea space with d 1, the observatios (x, y ) H R, 1, are idepedet ad idetically distributed (i.i.d.), ad such that E x 2 ad Ey 2 are fiite. We cosider the least-squares regressio problem which is the miimizatio of the quadratic fuctio f(θ) = 1 2 E( x, θ y ) 2. Covariace matrix: We deote by Σ = E(x x ) R d d the populatio covariace matrix, which is the Hessia of f at all poits. Without loss of geerality, we ca assume Σ ivertible. This implies that all eigevalues of Σ are strictly positive (but they may be arbitrarily small). We assume there exists R > 0 such that E x 2 x x R 2 Σ where A B meas that B A is positive semi-defiite. This assumptio is satisfied, for example, for least-square regressio with almost surely bouded data. Eigevalue decay: Most covergece bouds deped o the dimesio d of H. However it is possible to derive dimesio-free ad ofte tighter covergece rates by cosiderig bouds depedig o the value tr Σ b for b [0, 1]. Give b, if we cosider the eigevalues of Σ ordered i decreasig order, which we deote by s i, the tr Σ b = 62 i sb i, ad the eigevalues decay For b goig to 0 the 63 tr Σ b teds to d ad we are back i the classical low-dimesioal case. Whe b = 1, we simply get 64 tr Σ = E x 2, which will correspod to the weakest assumptio i our cotext Optimal predictor: The regressio fuctio f(θ) = 1 2 E( x, θ y ) 2 always admits a global miimum θ = Σ 1 E(y x ). Whe iitializig algorithms at θ 0 = 0 or regularizig by the squared orm, rates of covergece geerally deped o θ, a quatity which could be arbitrarily large. However there exists a systematic upper-boud Σ 1 2 θ 2 Ey. 2 This leads aturally to the cosideratio of covergece bouds depedig o Σ r/2 θ for r 1. Noise: We deote by ε = y θ, x the residual for which we have E[ε x ] = 0. Although we do ot have E[ε x ] = 0 i geeral uless the model is well-specified, we assume the oise to be a structured process such that there exists σ > 0 with E[ε 2 x x ] σ 2 Σ. This assumptio is satisfied for example for data almost surely bouded or whe the model is well-specified Averaged Gradiet Methods ad Acceleratio. We focus i this paper o stochastic gradiet methods with acceleratio for a quadratic fuctio regularized by λ 2 θ θ 0 2. The regularizatio will be useful whe derivig tighter covergece rates i Sectio 4, ad it has the additioal beefit of makig the problem λ-strogly-covex. Accelerated stochastic gradiet descet is defied by a iterative system with two parameters (θ, ν ) startig from θ 0 = ν 0 H, ad satisfyig for 1, θ = ν 1 γf (ν 1 γλ(ν 1 θ 0 ) ν = θ + δ ( ) θ θ 1, (1) with γ, δ R 2 ad f (θ 1 ) a ubiased estimate o the gradiet f(θ). The mometum coefficiet δ R is chose to accelerate the covergece rate (Nesterov, 1983; Beck ad Teboulle, 2009) ad has its roots i the heavy-ball algorithm from Polyak (1964). We especially cocetrate here, followig Polyak ad Juditsky (1992), o the average of the sequece θ = 1 +1 i=0 θ, 2
3 Stochastic Oracles o the Gradiet. Let (F ) 0 be the icreasig family of σ-fields that are geerated by all variables (x i, y i ) for i. The oracle we cosider is the sum of the true gradiet f (θ) ad a idepedet zero-mea oise that does ot deped o θ 1. Cosequetly it is of the form f (θ) = f (θ) ξ where the oise process ξ is F -measurable with E[ξ F 1 ] = 0 ad E[ ξ 2 ] is fiite. Furthermore we also assume that there exists τ R such that E[ξ ξ ] τ 2 Σ, that is, the oise has a particular structure adapted to least-squares regressio. 3 Accelerated Stochastic Averaged Gradiet Descet We study the covergece of averaged accelerated stochastic gradiet descet defied by Eq. (1) for λ = 0 ad δ = 1. It ca be rewritte for the quadratic fuctio f as a secod-order iterative system with costat coefficiets: θ = [ I γσ ] (2θ 1 θ 2 ) + γy x. Theorem 1 For ay costat step-size γ, such that γσ I, Ef( θ ) f(θ ) 36 θ 0 θ 2 γ( + 1) τ 2 d + 1. (2) 96 We ca make the followig observatios: The first boud 1 γ θ 2 0 θ 2 i Eq. (2) correspods to the usual accelerated rate. It has bee show by Nesterov (2004) to be the optimal rate of covergece for optimizig a quadratic fuctio with a first-order method that ca access oly to sequeces of gradiets whe d. We recover by averagig a algorithm dedicated to strogly-covex fuctio the traditioal covergece rate for o-strogly covex fuctios. The secod boud i Eq. (2) matches the optimal statistical performace τ 2 d over all 103 estimators i H (Tsybakov, 2008) eve without computatioal limits, i the sese that 104 o estimator that uses the same iformatio ca improve upo this rate. Accordigly 105 this algorithm achieves joit bias/variace optimality (whe measured i terms of τ 2 ad 106 θ 0 θ 2 ) We have the same rate of covergece for the bias whe compared to the regular Nesterov acceleratio without averagig studied by Flammario ad Bach (2015), which correspods to choosig δ = 1 2/ for all. However if the problem is µ-strogly covex, this latter was show to also coverge at the liear rate O ( (1 γµ) ) ad thus is adaptive to hidde strog-covexity (sice the algorithm does ot eed to kow µ to ru), thus eds up covergig faster tha the rate 1/ 2. This is cofirmed i our experimets i Sectio 5. Overall, the bias term is improved whereas the variace term is ot degraded ad acceleratio is thus robust to oise i the gradiets. Thereby, while secod-order iterative methods for optimizig quadratic fuctios i the sigular case, such as cojugate gradiet (Polyak, 1987, Sectio 6.1) are otoriously highly sesitive to oise, we are able to propose a versio which is robust to stochastic oise Tighter Covergece Rates We have see i Corollary 1 above that the averaged accelerated gradiet algorithm matches the lower bouds τ 2 d/ ad L θ 2 0 θ 2 for the predictio error. However the algorithm performs better i almost all cases except the worst-case scearios correspodig to the lower bouds. For example the algorithm may still predict well whe the dimesio d is much bigger tha. Similarly the orm of the optimal predictor θ 2 may be huge ad the predictio still good, as gradiet algorithms happe to be adaptive to the difficulty of the problem: ideed, if the problem is simpler, the covergece rate of the gradiet algorithm will be improved. I this sectio, we provide such a theoretical guaratee. We study the covergece of averaged accelerated stochastic gradiet descet defied by Eq. (1) for λ = (γ( + 1) 2 ) 1 ad δ [ , 1]. We have the followig theorem: 1 this is differet from the oracle usually cosidered i stochastic approximatio (see Bach ad Moulies (2013); Dieuleveut ad Bach (2015)). 3
4 Theorem 2 For ay costat step-size γ, such that γ(σ + λi) I, Ef( θ ) f(θ ) mi r [0,1], b [0,1] We ca make the followig observatios: [74 Σr/2 (θ 0 θ ) 2 γ 1 r ( + 1) + 8 τ 2 γ b tr(σ b ) 2(1 r) ( + 1) 1 2b The algorithm is idepedet of r ad b, thus all the bouds for differet values of (r, b) are valid. This is a strog property of the algorithm, which is ideed adaptative to the regularity ad the effective dimesio of the problem (oce γ is chose). I situatios i which either d is larger tha or L θ 0 θ 2 is larger tha 2, the algorithm ca still ejoy good covergece properties, by adaptig to the best values of b ad r. For b = 0 we recover the variace term of Corollary 1, but for b > 0 ad fast decays of eigevalues of Σ, the boud may be much smaller; ote that we lose i the depedecy i, but typically, for large d, this ca be advatageous. With r, b well chose, we recover the optimal rate for o-parametric regressio (Capoetto ad De Vito, 2007). 5 Experimets We illustrate ow our theoretical results o sythetic examples. For d = 25 we cosider ormally distributed iputs x with radom covariace matrix Σ which has eigevalues 1/i 3, for i = 1,..., d, ad radom optimum θ ad startig poit θ 0 such that θ 0 θ = 1. The outputs y are geerated from a liear fuctio with homoscedastic oise with uit sigal to oise-ratio (σ 2 = 1), we take R 2 = tr Σ the average radius of the data ad a step-size γ = 1/R 2 ad λ = 0. The additive oise oracle is used. We show results averaged over 10 replicatios. We compare the performace of averaged SGD (AvSGD), usual Nesterov acceleratio for covex fuctios (AccSGD) ad our ovel averaged accelerated SGD (AvAccSGD) 2, o two differet problems: oe determiistic ( θ 0 θ = 1, σ 2 = 0) which will illustrate how the bias term behaves, ad oe purely stochastic ( θ 0 θ = 0, σ 2 = 1) which will illustrate how the variace term behaves. For the bias (left plot of Figure 1), AvSGD coverges at speed O(1/), while AvAccSGD ]. log 10 [f(θ)-f(θ * )] Bias -10 AvAccSGD -12 AccSGD AvSGD log () log 10 [f(θ)-f(θ * )] Variace -10 AvAccSGD -12 AccSGD AvSGD log () Figure 1: Sythetic problem (d = 25) ad γ = 1/R 2. Left: Bias. Right: Variace ad AccSGD coverge both at speed O(1/ 2 ). However, as metioed i the observatios followig Theorem 1, AccSGD takes advatage of the hidde strog covexity of the quadratic fuctio ad starts covergig liearly at the ed. For the variace (right plot of Figure 1), AccSGD is ot covergig to the opltimum ad keeps oscillatig whereas AvSGD ad AvAccSGD both coverge to the optimum at a speed O(1/). However AvSGD remais slightly faster i the begiig. Note that for small, or whe the bias L θ 0 θ 2 / 2 is much bigger tha the variace σ 2 d/, the bias may have a stroger effect, although asymptotically, the variace always domiates. It is thus essetial to have a algorithm which is optimal i both regimes; this is achieved by AvAccSGD. 2 which is ot the averagig of AccSGD because the mometum term is proportioal to 1 3/ for AccSGD istead of beig equal to 1 for AvAccSGD. 4
5 Refereces Bach, F. ad Moulies, E. (2013). No-strogly-covex smooth stochastic approximatio with covergece rate O(1/). I Advaces i Neural Iformatio Processig Systems. Beck, A. ad Teboulle, M. (2009). A fast iterative shrikage-thresholdig algorithm for liear iverse problems. SIAM J. Imagig Sci., 2(1): Capoetto, A. ad De Vito, E. (2007). Optimal rates for the regularized least-squares algorithm. Foudatios of Computatioal Mathematics, 7(3): d Aspremot, A. (2008). Smooth optimizatio with approximate gradiet. SIAM J. Optim., 19(3): Devolder, O., Glieur, F., ad Nesterov, Y. (2014). First-order methods of smooth covex optimizatio with iexact oracle. Math. Program., 146(1-2, Ser. A): Dieuleveut, A. ad Bach, F. (2015). No-parametric stochastic approximatio with large step sizes. Aals of Statistics. To appear. Flammario, N. ad Bach, F. (2015). From averagig to acceleratio, there is oly a step-size. I Proceedigs of the Iteratioal Coferece o Learig Theory (COLT). Hastie, T., Tibshirai, R., ad Friedma, J. (2009). The Elemets of Statistical Learig. Spriger Series i Statistics. Spriger, secod editio. Hsu, D., Kakade, S. M., ad Zhag, T. (2014). Radom desig aalysis of ridge regressio. Foudatios of Computatioal Mathematics, 14(3): McCullagh, P. ad Nelder, J. A. (1989). Geeralized Liear Models. Moographs o Statistics ad Applied Probability. Chapma & Hall, Lodo, secod editio. Nesterov, Y. (1983). A method of solvig a covex programmig problem with covergece rate O(1/k 2 ). Soviet Mathematics Doklady, 27(2): Nesterov, Y. (2004). Itroductory Lectures o Covex Optimizatio, volume 87 of Applied Optimizatio. Kluwer Academic Publishers, Bosto, MA. Polyak, B. T. (1964). Some methods of speedig up the covergece of iteratio methods. {USSR} Computatioal Mathematics ad Mathematical Physics, 4(5):1 17. Polyak, B. T. (1987). Itroductio to Optimizatio. Traslatios Series i Mathematics ad Egieerig. Optimizatio Software, Ic., Publicatios Divisio, New York. Polyak, B. T. ad Juditsky, A. B. (1992). Acceleratio of stochastic approximatio by averagig. SIAM J. Cotrol Optim., 30(4): Robbis, H. ad Moro, S. (1951). A stochastic approxiatio method. The Aals of mathematical Statistics, 22(3): Schmidt, M., Le Roux, N., ad Bach, F. (2011). Covergece rates of iexact proximal-gradiet methods for covex optimizatio. I Advaces i Neural Iformatio Processig Systems. Tsybakov, A. B. (2003). Optimal rates of aggregatio. I Proceedigs of the Aual Coferece o Computatioal Learig Theory. Tsybakov, A. B. (2008). Itroductio to Noparametric Estimatio. Spriger. 5
Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d
Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y
More informationECE 901 Lecture 12: Complexity Regularization and the Squared Loss
ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality
More informationA Risk Comparison of Ordinary Least Squares vs Ridge Regression
Joural of Machie Learig Research 14 (2013) 1505-1511 Submitted 5/12; Revised 3/13; Published 6/13 A Risk Compariso of Ordiary Least Squares vs Ridge Regressio Paramveer S. Dhillo Departmet of Computer
More informationSummary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector
Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short
More informationLecture 19: Convergence
Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may
More informationThe random version of Dvoretzky s theorem in l n
The radom versio of Dvoretzky s theorem i l Gideo Schechtma Abstract We show that with high probability a sectio of the l ball of dimesio k cε log c > 0 a uiversal costat) is ε close to a multiple of the
More information1 Duality revisited. AM 221: Advanced Optimization Spring 2016
AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R
More informationExponential Convergence of Testing Error for Stochastic Gradient Methods
Proceedigs of Machie Learig Research vol 75: 47, 208 3st Aual Coferece o Learig Theory Expoetial Covergece of Testig Error for Stochastic Gradiet Methods Loucas Pillaud-Vivie Alessadro Rudi Fracis Bach
More informationMachine Learning Brett Bernstein
Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio
More informationLecture 7: Density Estimation: k-nearest Neighbor and Basis Approach
STAT 425: Itroductio to Noparametric Statistics Witer 28 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Istructor: Ye-Chi Che Referece: Sectio 8.4 of All of Noparametric Statistics.
More informationIntroduction to Machine Learning DIS10
CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig
More informationMachine Learning Theory Tübingen University, WS 2016/2017 Lecture 12
Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig
More informationStochastic Simulation
Stochastic Simulatio 1 Itroductio Readig Assigmet: Read Chapter 1 of text. We shall itroduce may of the key issues to be discussed i this course via a couple of model problems. Model Problem 1 (Jackso
More informationLecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)
Lecture 4 Homework Hw 1 ad 2 will be reoped after class for every body. New deadlie 4/20 Hw 3 ad 4 olie (Nima is lead) Pod-cast lecture o-lie Fial projects Nima will register groups ext week. Email/tell
More informationxn = x n 1 α f(xn 1 + β n) f(xn 1 β n)
Proceedigs of the 005 Witer Simulatio Coferece M E Kuhl, N M Steiger, F B Armstrog, ad J A Joies, eds BALANCING BIAS AND VARIANCE IN THE OPTIMIZATION OF SIMULATION MODELS Christie SM Currie School of Mathematics
More informationInformation-based Feature Selection
Iformatio-based Feature Selectio Farza Faria, Abbas Kazeroui, Afshi Babveyh Email: {faria,abbask,afshib}@staford.edu 1 Itroductio Feature selectio is a topic of great iterest i applicatios dealig with
More informationA survey on penalized empirical risk minimization Sara A. van de Geer
A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the
More informationMachine Learning Theory Tübingen University, WS 2016/2017 Lecture 11
Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple
More informationLecture 2 October 11
Itroductio to probabilistic graphical models 203/204 Lecture 2 October Lecturer: Guillaume Oboziski Scribes: Aymeric Reshef, Claire Verade Course webpage: http://www.di.es.fr/~fbach/courses/fall203/ 2.
More informationECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors
ECONOMETRIC THEORY MODULE XIII Lecture - 34 Asymptotic Theory ad Stochastic Regressors Dr. Shalabh Departmet of Mathematics ad Statistics Idia Istitute of Techology Kapur Asymptotic theory The asymptotic
More informationREGRESSION WITH QUADRATIC LOSS
REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d
More informationSieve Estimators: Consistency and Rates of Convergence
EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes
More informationSlide Set 13 Linear Model with Endogenous Regressors and the GMM estimator
Slide Set 13 Liear Model with Edogeous Regressors ad the GMM estimator Pietro Coretto pcoretto@uisa.it Ecoometrics Master i Ecoomics ad Fiace (MEF) Uiversità degli Studi di Napoli Federico II Versio: Friday
More information6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition
6. Kalma filter implemetatio for liear algebraic equatios. Karhue-Loeve decompositio 6.1. Solvable liear algebraic systems. Probabilistic iterpretatio. Let A be a quadratic matrix (ot obligatory osigular.
More informationRegression with quadratic loss
Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,
More informationRegularization methods for large scale machine learning
Regularizatio methods for large scale machie learig Lorezo Rosasco March 7, 2017 Abstract After recallig a iverse problems perspective o supervised learig, we discuss regularizatio methods for large scale
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 6 9/23/2013. Brownian motion. Introduction
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/5.070J Fall 203 Lecture 6 9/23/203 Browia motio. Itroductio Cotet.. A heuristic costructio of a Browia motio from a radom walk. 2. Defiitio ad basic properties
More informationOutline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression
REGRESSION 1 Outlie Liear regressio Regularizatio fuctios Polyomial curve fittig Stochastic gradiet descet for regressio MLE for regressio Step-wise forward regressio Regressio methods Statistical techiques
More informationFast Rates for Regularized Objectives
Fast Rates for Regularized Objectives Karthik Sridhara, Natha Srebro, Shai Shalev-Shwartz Toyota Techological Istitute Chicago Abstract We study covergece properties of empirical miimizatio of a stochastic
More informationw (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.
2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For
More informationLecture 10 October Minimaxity and least favorable prior sequences
STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least
More informationIntro to Learning Theory
Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified
More informationMODEL CHANGE DETECTION WITH APPLICATION TO MACHINE LEARNING. University of Illinois at Urbana-Champaign
MODEL CHANGE DETECTION WITH APPLICATION TO MACHINE LEARNING Yuheg Bu Jiaxu Lu Veugopal V. Veeravalli Uiversity of Illiois at Urbaa-Champaig Tsighua Uiversity Email: bu3@illiois.edu, lujx4@mails.tsighua.edu.c,
More informationarxiv: v1 [math.pr] 13 Oct 2011
A tail iequality for quadratic forms of subgaussia radom vectors Daiel Hsu, Sham M. Kakade,, ad Tog Zhag 3 arxiv:0.84v math.pr] 3 Oct 0 Microsoft Research New Eglad Departmet of Statistics, Wharto School,
More informationMachine Learning Theory (CS 6783)
Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT
More informationCSE 527, Additional notes on MLE & EM
CSE 57 Lecture Notes: MLE & EM CSE 57, Additioal otes o MLE & EM Based o earlier otes by C. Grat & M. Narasimha Itroductio Last lecture we bega a examiatio of model based clusterig. This lecture will be
More informationLecture 12: February 28
10-716: Advaced Machie Learig Sprig 2019 Lecture 12: February 28 Lecturer: Pradeep Ravikumar Scribes: Jacob Tyo, Rishub Jai, Ojash Neopae Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More informationA statistical method to determine sample size to estimate characteristic value of soil parameters
A statistical method to determie sample size to estimate characteristic value of soil parameters Y. Hojo, B. Setiawa 2 ad M. Suzuki 3 Abstract Sample size is a importat factor to be cosidered i determiig
More informationIntroduction to Optimization Techniques. How to Solve Equations
Itroductio to Optimizatio Techiques How to Solve Equatios Iterative Methods of Optimizatio Iterative methods of optimizatio Solutio of the oliear equatios resultig form a optimizatio problem is usually
More informationDimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector
Dimesio-free PAC-Bayesia bouds for the estimatio of the mea of a radom vector Olivier Catoi CREST CNRS UMR 9194 Uiversité Paris Saclay olivier.catoi@esae.fr Ilaria Giulii Laboratoire de Probabilités et
More informationLecture 24: Variable selection in linear models
Lecture 24: Variable selectio i liear models Cosider liear model X = Z β + ε, β R p ad Varε = σ 2 I. Like the LSE, the ridge regressio estimator does ot give 0 estimate to a compoet of β eve if that compoet
More informationMA Advanced Econometrics: Properties of Least Squares Estimators
MA Advaced Ecoometrics: Properties of Least Squares Estimators Karl Whela School of Ecoomics, UCD February 5, 20 Karl Whela UCD Least Squares Estimators February 5, 20 / 5 Part I Least Squares: Some Fiite-Sample
More informationProblem Set 4 Due Oct, 12
EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios
More informationDiscrete Mathematics for CS Spring 2008 David Wagner Note 22
CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig
More informationResampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.
Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator
More informationSpectral Partitioning in the Planted Partition Model
Spectral Graph Theory Lecture 21 Spectral Partitioig i the Plated Partitio Model Daiel A. Spielma November 11, 2009 21.1 Itroductio I this lecture, we will perform a crude aalysis of the performace of
More informationIntroductory statistics
CM9S: Machie Learig for Bioiformatics Lecture - 03/3/06 Itroductory statistics Lecturer: Sriram Sakararama Scribe: Sriram Sakararama We will provide a overview of statistical iferece focussig o the key
More informationOptimally Sparse SVMs
A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but
More informationMachine Learning Brett Bernstein
Machie Learig Brett Berstei Week Lecture: Cocept Check Exercises Starred problems are optioal. Statistical Learig Theory. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete
More informationElement sampling: Part 2
Chapter 4 Elemet samplig: Part 2 4.1 Itroductio We ow cosider uequal probability samplig desigs which is very popular i practice. I the uequal probability samplig, we ca improve the efficiecy of the resultig
More informationAdaptivity of Averaged Stochastic Gradient Descent to Local Strong Convexity for Logistic Regression
Joural of Machie Learig Research 204) 595-627 Submitted 0/3; Revised 2/3; Published 2/4 Adaptivity of Averaged Stochastic Gradiet Descet to Local Strog Covexity for Logistic Regressio Fracis Bach INRIA
More informationLecture 20: Multivariate convergence and the Central Limit Theorem
Lecture 20: Multivariate covergece ad the Cetral Limit Theorem Covergece i distributio for radom vectors Let Z,Z 1,Z 2,... be radom vectors o R k. If the cdf of Z is cotiuous, the we ca defie covergece
More informationExponential convergence of testing error for stochastic gradient methods
Expoetial covergece of testig error for stochastic gradiet methods Loucas Pillaud-Vivie, Alessadro Rudi, Fracis Bach To cite this versio: Loucas Pillaud-Vivie, Alessadro Rudi, Fracis Bach. Expoetial covergece
More informationRegression with an Evaporating Logarithmic Trend
Regressio with a Evaporatig Logarithmic Tred Peter C. B. Phillips Cowles Foudatio, Yale Uiversity, Uiversity of Aucklad & Uiversity of York ad Yixiao Su Departmet of Ecoomics Yale Uiversity October 5,
More informationOptimization Methods MIT 2.098/6.255/ Final exam
Optimizatio Methods MIT 2.098/6.255/15.093 Fial exam Date Give: December 19th, 2006 P1. [30 pts] Classify the followig statemets as true or false. All aswers must be well-justified, either through a short
More informationEfficient GMM LECTURE 12 GMM II
DECEMBER 1 010 LECTURE 1 II Efficiet The estimator depeds o the choice of the weight matrix A. The efficiet estimator is the oe that has the smallest asymptotic variace amog all estimators defied by differet
More informationEmpirical Process Theory and Oracle Inequalities
Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi
More informationLinear Classifiers III
Uiversität Potsdam Istitut für Iformatik Lehrstuhl Maschielles Lere Liear Classifiers III Blaie Nelso, Tobias Scheffer Cotets Classificatio Problem Bayesia Classifier Decisio Liear Classifiers, MAP Models
More informationECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization
ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where
More informationAlgorithms for Clustering
CR2: Statistical Learig & Applicatios Algorithms for Clusterig Lecturer: J. Salmo Scribe: A. Alcolei Settig: give a data set X R p where is the umber of observatio ad p is the umber of features, we wat
More informationStatistical Inference Based on Extremum Estimators
T. Rotheberg Fall, 2007 Statistical Iferece Based o Extremum Estimators Itroductio Suppose 0, the true value of a p-dimesioal parameter, is kow to lie i some subset S R p : Ofte we choose to estimate 0
More informationECE 901 Lecture 13: Maximum Likelihood Estimation
ECE 90 Lecture 3: Maximum Likelihood Estimatio R. Nowak 5/7/009 The focus of this lecture is to cosider aother approach to learig based o maximum likelihood estimatio. Ulike earlier approaches cosidered
More informationDifferentiable Convex Functions
Differetiable Covex Fuctios The followig picture motivates Theorem 11. f ( x) f ( x) f '( x)( x x) ˆx x 1 Theorem 11 : Let f : R R be differetiable. The, f is covex o the covex set C R if, ad oly if for
More information11 Correlation and Regression
11 Correlatio Regressio 11.1 Multivariate Data Ofte we look at data where several variables are recorded for the same idividuals or samplig uits. For example, at a coastal weather statio, we might record
More informationSequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence
Sequeces A sequece of umbers is a fuctio whose domai is the positive itegers. We ca see that the sequece 1, 1, 2, 2, 3, 3,... is a fuctio from the positive itegers whe we write the first sequece elemet
More informationUnbiased Estimation. February 7-12, 2008
Ubiased Estimatio February 7-2, 2008 We begi with a sample X = (X,..., X ) of radom variables chose accordig to oe of a family of probabilities P θ where θ is elemet from the parameter space Θ. For radom
More informationNYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)
NYU Ceter for Data Sciece: DS-GA 003 Machie Learig ad Computatioal Statistics (Sprig 208) Brett Berstei, David Roseberg, Be Jakubowski Jauary 20, 208 Istructios: Followig most lab ad lecture sectios, we
More informationBasics of Probability Theory (for Theory of Computation courses)
Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface.
More informationMachine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring
Machie Learig Regressio I Hamid R. Rabiee [Slides are based o Bishop Book] Sprig 015 http://ce.sharif.edu/courses/93-94//ce717-1 Liear Regressio Liear regressio: ivolves a respose variable ad a sigle predictor
More informationSelf-normalized deviation inequalities with application to t-statistic
Self-ormalized deviatio iequalities with applicatio to t-statistic Xiequa Fa Ceter for Applied Mathematics, Tiaji Uiversity, 30007 Tiaji, Chia Abstract Let ξ i i 1 be a sequece of idepedet ad symmetric
More informationLecture 15: Learning Theory: Concentration Inequalities
STAT 425: Itroductio to Noparametric Statistics Witer 208 Lecture 5: Learig Theory: Cocetratio Iequalities Istructor: Ye-Chi Che 5. Itroductio Recall that i the lecture o classificatio, we have see that
More informationMaximum Likelihood Estimation and Complexity Regularization
ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio
More information6.867 Machine learning, lecture 7 (Jaakkola) 1
6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit
More informationMulti parameter proximal point algorithms
Multi parameter proximal poit algorithms Ogaeditse A. Boikayo a,b,, Gheorghe Moroşau a a Departmet of Mathematics ad its Applicatios Cetral Europea Uiversity Nador u. 9, H-1051 Budapest, Hugary b Departmet
More informationLecture 7: October 18, 2017
Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem
More informationOn Random Line Segments in the Unit Square
O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,
More informationConvergence of random variables. (telegram style notes) P.J.C. Spreij
Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space
More information6.883: Online Methods in Machine Learning Alexander Rakhlin
6.883: Olie Methods i Machie Learig Alexader Rakhli LECTURES 5 AND 6. THE EXPERTS SETTING. EXPONENTIAL WEIGHTS All the algorithms preseted so far halluciate the future values as radom draws ad the perform
More informationTR/46 OCTOBER THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION A. TALBOT
TR/46 OCTOBER 974 THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION by A. TALBOT .. Itroductio. A problem i approximatio theory o which I have recetly worked [] required for its solutio a proof that the
More informationAn Introduction to Randomized Algorithms
A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis
More information10-701/ Machine Learning Mid-term Exam Solution
0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it
More information1 Review and Overview
DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,
More informationEconomics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator
Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters
More informationThe Method of Least Squares. To understand least squares fitting of data.
The Method of Least Squares KEY WORDS Curve fittig, least square GOAL To uderstad least squares fittig of data To uderstad the least squares solutio of icosistet systems of liear equatios 1 Motivatio Curve
More informationSimulation. Two Rule For Inverting A Distribution Function
Simulatio Two Rule For Ivertig A Distributio Fuctio Rule 1. If F(x) = u is costat o a iterval [x 1, x 2 ), the the uiform value u is mapped oto x 2 through the iversio process. Rule 2. If there is a jump
More informationCS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5
CS434a/54a: Patter Recogitio Prof. Olga Veksler Lecture 5 Today Itroductio to parameter estimatio Two methods for parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio Itroducto Bayesia Decisio
More informationAnalysis of Algorithms. Introduction. Contents
Itroductio The focus of this module is mathematical aspects of algorithms. Our mai focus is aalysis of algorithms, which meas evaluatig efficiecy of algorithms by aalytical ad mathematical methods. We
More informationAre adaptive Mann iterations really adaptive?
MATHEMATICAL COMMUNICATIONS 399 Math. Commu., Vol. 4, No. 2, pp. 399-42 (2009) Are adaptive Ma iteratios really adaptive? Kamil S. Kazimierski, Departmet of Mathematics ad Computer Sciece, Uiversity of
More informationA unified framework for high-dimensional analysis of M-estimators with decomposable regularizers
A uified framework for high-dimesioal aalysis of M-estimators with decomposable regularizers Sahad Negahba, UC Berkeley Pradeep Ravikumar, UT Austi Marti Waiwright, UC Berkeley Bi Yu, UC Berkeley NIPS
More informationCMSE 820: Math. Foundations of Data Sci.
Lecture 17 8.4 Weighted path graphs Take from [10, Lecture 3] As alluded to at the ed of the previous sectio, we ow aalyze weighted path graphs. To that ed, we prove the followig: Theorem 6 (Fiedler).
More informationAlgebra of Least Squares
October 19, 2018 Algebra of Least Squares Geometry of Least Squares Recall that out data is like a table [Y X] where Y collects observatios o the depedet variable Y ad X collects observatios o the k-dimesioal
More informationMinimizing Finite Sums with the Stochastic Average Gradient
Miimizig Fiite Sums with the Stochastic Average Gradiet Mark Schmidt, Nicolas Le Roux, Fracis Bach To cite this versio: Mark Schmidt, Nicolas Le Roux, Fracis Bach. Miimizig Fiite Sums with the Stochastic
More informationJacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3
No-Parametric Techiques Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3 Parametric vs. No-Parametric Parametric Based o Fuctios (e.g Normal Distributio) Uimodal Oly oe peak Ulikely real data cofies
More information5.1 A mutual information bound based on metric entropy
Chapter 5 Global Fao Method I this chapter, we exted the techiques of Chapter 2.4 o Fao s method the local Fao method) to a more global costructio. I particular, we show that, rather tha costructig a local
More informationGeometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT
OCTOBER 7, 2016 LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT Geometry of LS We ca thik of y ad the colums of X as members of the -dimesioal Euclidea space R Oe ca
More informationCov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.
CS 189 Itroductio to Machie Learig Sprig 218 Note 11 1 Caoical Correlatio Aalysis The Pearso Correlatio Coefficiet ρ(x, Y ) is a way to measure how liearly related (i other words, how well a liear model
More informationAn almost sure invariance principle for trimmed sums of random vectors
Proc. Idia Acad. Sci. Math. Sci. Vol. 20, No. 5, November 200, pp. 6 68. Idia Academy of Scieces A almost sure ivariace priciple for trimmed sums of radom vectors KE-ANG FU School of Statistics ad Mathematics,
More informationLecture 10: Universal coding and prediction
0-704: Iformatio Processig ad Learig Sprig 0 Lecture 0: Uiversal codig ad predictio Lecturer: Aarti Sigh Scribes: Georg M. Goerg Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved
More informationLecture 9: Boosting. Akshay Krishnamurthy October 3, 2017
Lecture 9: Boostig Akshay Krishamurthy akshay@csumassedu October 3, 07 Recap Last week we discussed some algorithmic aspects of machie learig We saw oe very powerful family of learig algorithms, amely
More information