Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression

Size: px
Start display at page:

Download "Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression"

Transcription

1 Harder, Better, Faster, Stroger Covergece Rates for Least-Squares Regressio Aoymous Author(s) Affiliatio Address Abstract We cosider the optimizatio of a quadratic objective fuctio whose gradiets are oly accessible through a stochastic oracle. We preset the first algorithm that achieves joitly the optimal predictio error rates for least-squares regressio, both i terms of forgettig of iitial coditios ad i terms of depedece o the oise ad dimesio of the problem, ad prove dimesioless ad tighter rates for a regularized versio of this algorithm Itroductio May supervised machie learig problems are aturally cast as the miimizatio of a smooth fuctio defied o a Euclidea space. This icludes least-squares regressio, logistic regressio (see, e.g., Hastie et al., 2009) or geeralized liear models (McCullagh ad Nelder, 1989). While small problems with few or low-dimesioal iput features may be solved precisely by may potetial optimizatio algorithms (e.g., Newto method), large-scale problems with may high-dimesioal features are typically solved with simple gradiet-based iterative techiques whose per-iteratio cost is small. I this paper, we cosider a quadratic objective fuctio f whose gradiets are oly accessible through a stochastic oracle that returs the gradiet at ay give poit plus a zero-mea fiite variace radom error. I this stochastic approximatio framework (Robbis ad Moro, 1951), it is kow that two quatities dictate the behavior of various algorithms, amely the covariace matrix V of the oise i the gradiets, ad the deviatio θ 0 θ betwee the iitial poit of the algorithm θ 0 ad ay of the global miimizer θ of f. This leads to a bias/variace decompositio (Bach ad Moulies, 2013; Hsu et al., 2014) of the performace of most algorithms as the sum of two terms: (a) the bias term characterizes how fast iitial coditios are forgotte ad thus is icreasig i a well-chose orm of θ 0 θ ; while (b) the variace term characterizes the effect of the oise i the gradiets, idepedetly of the startig poit, ad with a term that is icreasig i the covariace of the oise. 26 For quadratic fuctios with (a) a oise covariace matrix V which is proportioal (with costat 27 σ 2 ) to the Hessia of f (a situatio which correspods to least-squares regressio) ad (b) a ii- 28 tial poit characterized by the orm θ 0 θ 2, the optimal bias ad variace terms are kow separately. O the oe had, the optimal bias term after iteratios is proportioal to L θ 0 θ 2 29, where L is the largest eigevalue of the Hessia of f. This rate is achieved by accelerated gradiet descet (Nesterov, 1983, 2004), ad is kow to be optimal if the umber of iteratios is less tha 32 the dimesio d of the uderlyig predictors, but the algorithm is ot robust to radom or determi- 33 istic oise i the gradiets (d Aspremot, 2008; Schmidt et al., 2011; Devolder et al., 2014). O the 34 other had, the optimal variace term is proportioal to σ2 d (Tsybakov, 2003); it is kow to be 35 achieved by averaged gradiet descet (Bach ad Moulies, 2013), for which the bias term oly 36 achieves L θ 0 θ 2 istead of L θ 0 θ 2. 2 Submitted to 30th Coferece o Neural Iformatio Processig Systems (NIPS 2016). Do ot distribute.

2 Our first cotributio i this paper is to aalyze i Sectio 3 averaged accelerated gradiet descet, showig that it attais optimal rates for both the variace ad the bias terms. It shows that averagig is beeficial for accelerated techiques ad provides a provable robustess to oise. While optimal whe measurig performace i terms of the dimesio d ad the iitial distace to optimum θ 0 θ 2, these rates are ot adapted i may situatios where either d is larger tha the umber of iteratios (i.e., the umber of observatios for regular stochastic gradiet descet) or L θ 0 θ 2 is much larger tha 2. Our secod cotributio is to provide i Sectio 4 a aalysis of a ew algorithm (based o some additioal regularizatio) that ca adapt our bouds to fier assumptios o θ 0 θ ad the Hessia of the problem, leadig i particular to dimesio-free quatities that ca thus be exteded to the Hilbert space settig (i particular for o-parametric estimatio) Least-Squares Regressio Statistical Assumptios. We cosider the followig geeral settig: H is a d-dimesioal Euclidea space with d 1, the observatios (x, y ) H R, 1, are idepedet ad idetically distributed (i.i.d.), ad such that E x 2 ad Ey 2 are fiite. We cosider the least-squares regressio problem which is the miimizatio of the quadratic fuctio f(θ) = 1 2 E( x, θ y ) 2. Covariace matrix: We deote by Σ = E(x x ) R d d the populatio covariace matrix, which is the Hessia of f at all poits. Without loss of geerality, we ca assume Σ ivertible. This implies that all eigevalues of Σ are strictly positive (but they may be arbitrarily small). We assume there exists R > 0 such that E x 2 x x R 2 Σ where A B meas that B A is positive semi-defiite. This assumptio is satisfied, for example, for least-square regressio with almost surely bouded data. Eigevalue decay: Most covergece bouds deped o the dimesio d of H. However it is possible to derive dimesio-free ad ofte tighter covergece rates by cosiderig bouds depedig o the value tr Σ b for b [0, 1]. Give b, if we cosider the eigevalues of Σ ordered i decreasig order, which we deote by s i, the tr Σ b = 62 i sb i, ad the eigevalues decay For b goig to 0 the 63 tr Σ b teds to d ad we are back i the classical low-dimesioal case. Whe b = 1, we simply get 64 tr Σ = E x 2, which will correspod to the weakest assumptio i our cotext Optimal predictor: The regressio fuctio f(θ) = 1 2 E( x, θ y ) 2 always admits a global miimum θ = Σ 1 E(y x ). Whe iitializig algorithms at θ 0 = 0 or regularizig by the squared orm, rates of covergece geerally deped o θ, a quatity which could be arbitrarily large. However there exists a systematic upper-boud Σ 1 2 θ 2 Ey. 2 This leads aturally to the cosideratio of covergece bouds depedig o Σ r/2 θ for r 1. Noise: We deote by ε = y θ, x the residual for which we have E[ε x ] = 0. Although we do ot have E[ε x ] = 0 i geeral uless the model is well-specified, we assume the oise to be a structured process such that there exists σ > 0 with E[ε 2 x x ] σ 2 Σ. This assumptio is satisfied for example for data almost surely bouded or whe the model is well-specified Averaged Gradiet Methods ad Acceleratio. We focus i this paper o stochastic gradiet methods with acceleratio for a quadratic fuctio regularized by λ 2 θ θ 0 2. The regularizatio will be useful whe derivig tighter covergece rates i Sectio 4, ad it has the additioal beefit of makig the problem λ-strogly-covex. Accelerated stochastic gradiet descet is defied by a iterative system with two parameters (θ, ν ) startig from θ 0 = ν 0 H, ad satisfyig for 1, θ = ν 1 γf (ν 1 γλ(ν 1 θ 0 ) ν = θ + δ ( ) θ θ 1, (1) with γ, δ R 2 ad f (θ 1 ) a ubiased estimate o the gradiet f(θ). The mometum coefficiet δ R is chose to accelerate the covergece rate (Nesterov, 1983; Beck ad Teboulle, 2009) ad has its roots i the heavy-ball algorithm from Polyak (1964). We especially cocetrate here, followig Polyak ad Juditsky (1992), o the average of the sequece θ = 1 +1 i=0 θ, 2

3 Stochastic Oracles o the Gradiet. Let (F ) 0 be the icreasig family of σ-fields that are geerated by all variables (x i, y i ) for i. The oracle we cosider is the sum of the true gradiet f (θ) ad a idepedet zero-mea oise that does ot deped o θ 1. Cosequetly it is of the form f (θ) = f (θ) ξ where the oise process ξ is F -measurable with E[ξ F 1 ] = 0 ad E[ ξ 2 ] is fiite. Furthermore we also assume that there exists τ R such that E[ξ ξ ] τ 2 Σ, that is, the oise has a particular structure adapted to least-squares regressio. 3 Accelerated Stochastic Averaged Gradiet Descet We study the covergece of averaged accelerated stochastic gradiet descet defied by Eq. (1) for λ = 0 ad δ = 1. It ca be rewritte for the quadratic fuctio f as a secod-order iterative system with costat coefficiets: θ = [ I γσ ] (2θ 1 θ 2 ) + γy x. Theorem 1 For ay costat step-size γ, such that γσ I, Ef( θ ) f(θ ) 36 θ 0 θ 2 γ( + 1) τ 2 d + 1. (2) 96 We ca make the followig observatios: The first boud 1 γ θ 2 0 θ 2 i Eq. (2) correspods to the usual accelerated rate. It has bee show by Nesterov (2004) to be the optimal rate of covergece for optimizig a quadratic fuctio with a first-order method that ca access oly to sequeces of gradiets whe d. We recover by averagig a algorithm dedicated to strogly-covex fuctio the traditioal covergece rate for o-strogly covex fuctios. The secod boud i Eq. (2) matches the optimal statistical performace τ 2 d over all 103 estimators i H (Tsybakov, 2008) eve without computatioal limits, i the sese that 104 o estimator that uses the same iformatio ca improve upo this rate. Accordigly 105 this algorithm achieves joit bias/variace optimality (whe measured i terms of τ 2 ad 106 θ 0 θ 2 ) We have the same rate of covergece for the bias whe compared to the regular Nesterov acceleratio without averagig studied by Flammario ad Bach (2015), which correspods to choosig δ = 1 2/ for all. However if the problem is µ-strogly covex, this latter was show to also coverge at the liear rate O ( (1 γµ) ) ad thus is adaptive to hidde strog-covexity (sice the algorithm does ot eed to kow µ to ru), thus eds up covergig faster tha the rate 1/ 2. This is cofirmed i our experimets i Sectio 5. Overall, the bias term is improved whereas the variace term is ot degraded ad acceleratio is thus robust to oise i the gradiets. Thereby, while secod-order iterative methods for optimizig quadratic fuctios i the sigular case, such as cojugate gradiet (Polyak, 1987, Sectio 6.1) are otoriously highly sesitive to oise, we are able to propose a versio which is robust to stochastic oise Tighter Covergece Rates We have see i Corollary 1 above that the averaged accelerated gradiet algorithm matches the lower bouds τ 2 d/ ad L θ 2 0 θ 2 for the predictio error. However the algorithm performs better i almost all cases except the worst-case scearios correspodig to the lower bouds. For example the algorithm may still predict well whe the dimesio d is much bigger tha. Similarly the orm of the optimal predictor θ 2 may be huge ad the predictio still good, as gradiet algorithms happe to be adaptive to the difficulty of the problem: ideed, if the problem is simpler, the covergece rate of the gradiet algorithm will be improved. I this sectio, we provide such a theoretical guaratee. We study the covergece of averaged accelerated stochastic gradiet descet defied by Eq. (1) for λ = (γ( + 1) 2 ) 1 ad δ [ , 1]. We have the followig theorem: 1 this is differet from the oracle usually cosidered i stochastic approximatio (see Bach ad Moulies (2013); Dieuleveut ad Bach (2015)). 3

4 Theorem 2 For ay costat step-size γ, such that γ(σ + λi) I, Ef( θ ) f(θ ) mi r [0,1], b [0,1] We ca make the followig observatios: [74 Σr/2 (θ 0 θ ) 2 γ 1 r ( + 1) + 8 τ 2 γ b tr(σ b ) 2(1 r) ( + 1) 1 2b The algorithm is idepedet of r ad b, thus all the bouds for differet values of (r, b) are valid. This is a strog property of the algorithm, which is ideed adaptative to the regularity ad the effective dimesio of the problem (oce γ is chose). I situatios i which either d is larger tha or L θ 0 θ 2 is larger tha 2, the algorithm ca still ejoy good covergece properties, by adaptig to the best values of b ad r. For b = 0 we recover the variace term of Corollary 1, but for b > 0 ad fast decays of eigevalues of Σ, the boud may be much smaller; ote that we lose i the depedecy i, but typically, for large d, this ca be advatageous. With r, b well chose, we recover the optimal rate for o-parametric regressio (Capoetto ad De Vito, 2007). 5 Experimets We illustrate ow our theoretical results o sythetic examples. For d = 25 we cosider ormally distributed iputs x with radom covariace matrix Σ which has eigevalues 1/i 3, for i = 1,..., d, ad radom optimum θ ad startig poit θ 0 such that θ 0 θ = 1. The outputs y are geerated from a liear fuctio with homoscedastic oise with uit sigal to oise-ratio (σ 2 = 1), we take R 2 = tr Σ the average radius of the data ad a step-size γ = 1/R 2 ad λ = 0. The additive oise oracle is used. We show results averaged over 10 replicatios. We compare the performace of averaged SGD (AvSGD), usual Nesterov acceleratio for covex fuctios (AccSGD) ad our ovel averaged accelerated SGD (AvAccSGD) 2, o two differet problems: oe determiistic ( θ 0 θ = 1, σ 2 = 0) which will illustrate how the bias term behaves, ad oe purely stochastic ( θ 0 θ = 0, σ 2 = 1) which will illustrate how the variace term behaves. For the bias (left plot of Figure 1), AvSGD coverges at speed O(1/), while AvAccSGD ]. log 10 [f(θ)-f(θ * )] Bias -10 AvAccSGD -12 AccSGD AvSGD log () log 10 [f(θ)-f(θ * )] Variace -10 AvAccSGD -12 AccSGD AvSGD log () Figure 1: Sythetic problem (d = 25) ad γ = 1/R 2. Left: Bias. Right: Variace ad AccSGD coverge both at speed O(1/ 2 ). However, as metioed i the observatios followig Theorem 1, AccSGD takes advatage of the hidde strog covexity of the quadratic fuctio ad starts covergig liearly at the ed. For the variace (right plot of Figure 1), AccSGD is ot covergig to the opltimum ad keeps oscillatig whereas AvSGD ad AvAccSGD both coverge to the optimum at a speed O(1/). However AvSGD remais slightly faster i the begiig. Note that for small, or whe the bias L θ 0 θ 2 / 2 is much bigger tha the variace σ 2 d/, the bias may have a stroger effect, although asymptotically, the variace always domiates. It is thus essetial to have a algorithm which is optimal i both regimes; this is achieved by AvAccSGD. 2 which is ot the averagig of AccSGD because the mometum term is proportioal to 1 3/ for AccSGD istead of beig equal to 1 for AvAccSGD. 4

5 Refereces Bach, F. ad Moulies, E. (2013). No-strogly-covex smooth stochastic approximatio with covergece rate O(1/). I Advaces i Neural Iformatio Processig Systems. Beck, A. ad Teboulle, M. (2009). A fast iterative shrikage-thresholdig algorithm for liear iverse problems. SIAM J. Imagig Sci., 2(1): Capoetto, A. ad De Vito, E. (2007). Optimal rates for the regularized least-squares algorithm. Foudatios of Computatioal Mathematics, 7(3): d Aspremot, A. (2008). Smooth optimizatio with approximate gradiet. SIAM J. Optim., 19(3): Devolder, O., Glieur, F., ad Nesterov, Y. (2014). First-order methods of smooth covex optimizatio with iexact oracle. Math. Program., 146(1-2, Ser. A): Dieuleveut, A. ad Bach, F. (2015). No-parametric stochastic approximatio with large step sizes. Aals of Statistics. To appear. Flammario, N. ad Bach, F. (2015). From averagig to acceleratio, there is oly a step-size. I Proceedigs of the Iteratioal Coferece o Learig Theory (COLT). Hastie, T., Tibshirai, R., ad Friedma, J. (2009). The Elemets of Statistical Learig. Spriger Series i Statistics. Spriger, secod editio. Hsu, D., Kakade, S. M., ad Zhag, T. (2014). Radom desig aalysis of ridge regressio. Foudatios of Computatioal Mathematics, 14(3): McCullagh, P. ad Nelder, J. A. (1989). Geeralized Liear Models. Moographs o Statistics ad Applied Probability. Chapma & Hall, Lodo, secod editio. Nesterov, Y. (1983). A method of solvig a covex programmig problem with covergece rate O(1/k 2 ). Soviet Mathematics Doklady, 27(2): Nesterov, Y. (2004). Itroductory Lectures o Covex Optimizatio, volume 87 of Applied Optimizatio. Kluwer Academic Publishers, Bosto, MA. Polyak, B. T. (1964). Some methods of speedig up the covergece of iteratio methods. {USSR} Computatioal Mathematics ad Mathematical Physics, 4(5):1 17. Polyak, B. T. (1987). Itroductio to Optimizatio. Traslatios Series i Mathematics ad Egieerig. Optimizatio Software, Ic., Publicatios Divisio, New York. Polyak, B. T. ad Juditsky, A. B. (1992). Acceleratio of stochastic approximatio by averagig. SIAM J. Cotrol Optim., 30(4): Robbis, H. ad Moro, S. (1951). A stochastic approxiatio method. The Aals of mathematical Statistics, 22(3): Schmidt, M., Le Roux, N., ad Bach, F. (2011). Covergece rates of iexact proximal-gradiet methods for covex optimizatio. I Advaces i Neural Iformatio Processig Systems. Tsybakov, A. B. (2003). Optimal rates of aggregatio. I Proceedigs of the Aual Coferece o Computatioal Learig Theory. Tsybakov, A. B. (2008). Itroductio to Noparametric Estimatio. Spriger. 5

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

A Risk Comparison of Ordinary Least Squares vs Ridge Regression Joural of Machie Learig Research 14 (2013) 1505-1511 Submitted 5/12; Revised 3/13; Published 6/13 A Risk Compariso of Ordiary Least Squares vs Ridge Regressio Paramveer S. Dhillo Departmet of Computer

More information

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

The random version of Dvoretzky s theorem in l n

The random version of Dvoretzky s theorem in l n The radom versio of Dvoretzky s theorem i l Gideo Schechtma Abstract We show that with high probability a sectio of the l ball of dimesio k cε log c > 0 a uiversal costat) is ε close to a multiple of the

More information

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

1 Duality revisited. AM 221: Advanced Optimization Spring 2016 AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R

More information

Exponential Convergence of Testing Error for Stochastic Gradient Methods

Exponential Convergence of Testing Error for Stochastic Gradient Methods Proceedigs of Machie Learig Research vol 75: 47, 208 3st Aual Coferece o Learig Theory Expoetial Covergece of Testig Error for Stochastic Gradiet Methods Loucas Pillaud-Vivie Alessadro Rudi Fracis Bach

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio

More information

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach STAT 425: Itroductio to Noparametric Statistics Witer 28 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Istructor: Ye-Chi Che Referece: Sectio 8.4 of All of Noparametric Statistics.

More information

Introduction to Machine Learning DIS10

Introduction to Machine Learning DIS10 CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig

More information

Stochastic Simulation

Stochastic Simulation Stochastic Simulatio 1 Itroductio Readig Assigmet: Read Chapter 1 of text. We shall itroduce may of the key issues to be discussed i this course via a couple of model problems. Model Problem 1 (Jackso

More information

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead) Lecture 4 Homework Hw 1 ad 2 will be reoped after class for every body. New deadlie 4/20 Hw 3 ad 4 olie (Nima is lead) Pod-cast lecture o-lie Fial projects Nima will register groups ext week. Email/tell

More information

xn = x n 1 α f(xn 1 + β n) f(xn 1 β n)

xn = x n 1 α f(xn 1 + β n) f(xn 1 β n) Proceedigs of the 005 Witer Simulatio Coferece M E Kuhl, N M Steiger, F B Armstrog, ad J A Joies, eds BALANCING BIAS AND VARIANCE IN THE OPTIMIZATION OF SIMULATION MODELS Christie SM Currie School of Mathematics

More information

Information-based Feature Selection

Information-based Feature Selection Iformatio-based Feature Selectio Farza Faria, Abbas Kazeroui, Afshi Babveyh Email: {faria,abbask,afshib}@staford.edu 1 Itroductio Feature selectio is a topic of great iterest i applicatios dealig with

More information

A survey on penalized empirical risk minimization Sara A. van de Geer

A survey on penalized empirical risk minimization Sara A. van de Geer A survey o pealized empirical risk miimizatio Sara A. va de Geer We address the questio how to choose the pealty i empirical risk miimizatio. Roughly speakig, this pealty should be a good boud for the

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

Lecture 2 October 11

Lecture 2 October 11 Itroductio to probabilistic graphical models 203/204 Lecture 2 October Lecturer: Guillaume Oboziski Scribes: Aymeric Reshef, Claire Verade Course webpage: http://www.di.es.fr/~fbach/courses/fall203/ 2.

More information

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors ECONOMETRIC THEORY MODULE XIII Lecture - 34 Asymptotic Theory ad Stochastic Regressors Dr. Shalabh Departmet of Mathematics ad Statistics Idia Istitute of Techology Kapur Asymptotic theory The asymptotic

More information

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d

More information

Sieve Estimators: Consistency and Rates of Convergence

Sieve Estimators: Consistency and Rates of Convergence EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes

More information

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator Slide Set 13 Liear Model with Edogeous Regressors ad the GMM estimator Pietro Coretto pcoretto@uisa.it Ecoometrics Master i Ecoomics ad Fiace (MEF) Uiversità degli Studi di Napoli Federico II Versio: Friday

More information

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition 6. Kalma filter implemetatio for liear algebraic equatios. Karhue-Loeve decompositio 6.1. Solvable liear algebraic systems. Probabilistic iterpretatio. Let A be a quadratic matrix (ot obligatory osigular.

More information

Regression with quadratic loss

Regression with quadratic loss Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,

More information

Regularization methods for large scale machine learning

Regularization methods for large scale machine learning Regularizatio methods for large scale machie learig Lorezo Rosasco March 7, 2017 Abstract After recallig a iverse problems perspective o supervised learig, we discuss regularizatio methods for large scale

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 6 9/23/2013. Brownian motion. Introduction

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 6 9/23/2013. Brownian motion. Introduction MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/5.070J Fall 203 Lecture 6 9/23/203 Browia motio. Itroductio Cotet.. A heuristic costructio of a Browia motio from a radom walk. 2. Defiitio ad basic properties

More information

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression REGRESSION 1 Outlie Liear regressio Regularizatio fuctios Polyomial curve fittig Stochastic gradiet descet for regressio MLE for regressio Step-wise forward regressio Regressio methods Statistical techiques

More information

Fast Rates for Regularized Objectives

Fast Rates for Regularized Objectives Fast Rates for Regularized Objectives Karthik Sridhara, Natha Srebro, Shai Shalev-Shwartz Toyota Techological Istitute Chicago Abstract We study covergece properties of empirical miimizatio of a stochastic

More information

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ. 2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For

More information

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 10 October Minimaxity and least favorable prior sequences STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least

More information

Intro to Learning Theory

Intro to Learning Theory Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified

More information

MODEL CHANGE DETECTION WITH APPLICATION TO MACHINE LEARNING. University of Illinois at Urbana-Champaign

MODEL CHANGE DETECTION WITH APPLICATION TO MACHINE LEARNING. University of Illinois at Urbana-Champaign MODEL CHANGE DETECTION WITH APPLICATION TO MACHINE LEARNING Yuheg Bu Jiaxu Lu Veugopal V. Veeravalli Uiversity of Illiois at Urbaa-Champaig Tsighua Uiversity Email: bu3@illiois.edu, lujx4@mails.tsighua.edu.c,

More information

arxiv: v1 [math.pr] 13 Oct 2011

arxiv: v1 [math.pr] 13 Oct 2011 A tail iequality for quadratic forms of subgaussia radom vectors Daiel Hsu, Sham M. Kakade,, ad Tog Zhag 3 arxiv:0.84v math.pr] 3 Oct 0 Microsoft Research New Eglad Departmet of Statistics, Wharto School,

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT

More information

CSE 527, Additional notes on MLE & EM

CSE 527, Additional notes on MLE & EM CSE 57 Lecture Notes: MLE & EM CSE 57, Additioal otes o MLE & EM Based o earlier otes by C. Grat & M. Narasimha Itroductio Last lecture we bega a examiatio of model based clusterig. This lecture will be

More information

Lecture 12: February 28

Lecture 12: February 28 10-716: Advaced Machie Learig Sprig 2019 Lecture 12: February 28 Lecturer: Pradeep Ravikumar Scribes: Jacob Tyo, Rishub Jai, Ojash Neopae Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

A statistical method to determine sample size to estimate characteristic value of soil parameters

A statistical method to determine sample size to estimate characteristic value of soil parameters A statistical method to determie sample size to estimate characteristic value of soil parameters Y. Hojo, B. Setiawa 2 ad M. Suzuki 3 Abstract Sample size is a importat factor to be cosidered i determiig

More information

Introduction to Optimization Techniques. How to Solve Equations

Introduction to Optimization Techniques. How to Solve Equations Itroductio to Optimizatio Techiques How to Solve Equatios Iterative Methods of Optimizatio Iterative methods of optimizatio Solutio of the oliear equatios resultig form a optimizatio problem is usually

More information

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector Dimesio-free PAC-Bayesia bouds for the estimatio of the mea of a radom vector Olivier Catoi CREST CNRS UMR 9194 Uiversité Paris Saclay olivier.catoi@esae.fr Ilaria Giulii Laboratoire de Probabilités et

More information

Lecture 24: Variable selection in linear models

Lecture 24: Variable selection in linear models Lecture 24: Variable selectio i liear models Cosider liear model X = Z β + ε, β R p ad Varε = σ 2 I. Like the LSE, the ridge regressio estimator does ot give 0 estimate to a compoet of β eve if that compoet

More information

MA Advanced Econometrics: Properties of Least Squares Estimators

MA Advanced Econometrics: Properties of Least Squares Estimators MA Advaced Ecoometrics: Properties of Least Squares Estimators Karl Whela School of Ecoomics, UCD February 5, 20 Karl Whela UCD Least Squares Estimators February 5, 20 / 5 Part I Least Squares: Some Fiite-Sample

More information

Problem Set 4 Due Oct, 12

Problem Set 4 Due Oct, 12 EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios

More information

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Discrete Mathematics for CS Spring 2008 David Wagner Note 22 CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig

More information

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n. Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator

More information

Spectral Partitioning in the Planted Partition Model

Spectral Partitioning in the Planted Partition Model Spectral Graph Theory Lecture 21 Spectral Partitioig i the Plated Partitio Model Daiel A. Spielma November 11, 2009 21.1 Itroductio I this lecture, we will perform a crude aalysis of the performace of

More information

Introductory statistics

Introductory statistics CM9S: Machie Learig for Bioiformatics Lecture - 03/3/06 Itroductory statistics Lecturer: Sriram Sakararama Scribe: Sriram Sakararama We will provide a overview of statistical iferece focussig o the key

More information

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but

More information

Machine Learning Brett Bernstein

Machine Learning Brett Bernstein Machie Learig Brett Berstei Week Lecture: Cocept Check Exercises Starred problems are optioal. Statistical Learig Theory. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete

More information

Element sampling: Part 2

Element sampling: Part 2 Chapter 4 Elemet samplig: Part 2 4.1 Itroductio We ow cosider uequal probability samplig desigs which is very popular i practice. I the uequal probability samplig, we ca improve the efficiecy of the resultig

More information

Adaptivity of Averaged Stochastic Gradient Descent to Local Strong Convexity for Logistic Regression

Adaptivity of Averaged Stochastic Gradient Descent to Local Strong Convexity for Logistic Regression Joural of Machie Learig Research 204) 595-627 Submitted 0/3; Revised 2/3; Published 2/4 Adaptivity of Averaged Stochastic Gradiet Descet to Local Strog Covexity for Logistic Regressio Fracis Bach INRIA

More information

Lecture 20: Multivariate convergence and the Central Limit Theorem

Lecture 20: Multivariate convergence and the Central Limit Theorem Lecture 20: Multivariate covergece ad the Cetral Limit Theorem Covergece i distributio for radom vectors Let Z,Z 1,Z 2,... be radom vectors o R k. If the cdf of Z is cotiuous, the we ca defie covergece

More information

Exponential convergence of testing error for stochastic gradient methods

Exponential convergence of testing error for stochastic gradient methods Expoetial covergece of testig error for stochastic gradiet methods Loucas Pillaud-Vivie, Alessadro Rudi, Fracis Bach To cite this versio: Loucas Pillaud-Vivie, Alessadro Rudi, Fracis Bach. Expoetial covergece

More information

Regression with an Evaporating Logarithmic Trend

Regression with an Evaporating Logarithmic Trend Regressio with a Evaporatig Logarithmic Tred Peter C. B. Phillips Cowles Foudatio, Yale Uiversity, Uiversity of Aucklad & Uiversity of York ad Yixiao Su Departmet of Ecoomics Yale Uiversity October 5,

More information

Optimization Methods MIT 2.098/6.255/ Final exam

Optimization Methods MIT 2.098/6.255/ Final exam Optimizatio Methods MIT 2.098/6.255/15.093 Fial exam Date Give: December 19th, 2006 P1. [30 pts] Classify the followig statemets as true or false. All aswers must be well-justified, either through a short

More information

Efficient GMM LECTURE 12 GMM II

Efficient GMM LECTURE 12 GMM II DECEMBER 1 010 LECTURE 1 II Efficiet The estimator depeds o the choice of the weight matrix A. The efficiet estimator is the oe that has the smallest asymptotic variace amog all estimators defied by differet

More information

Empirical Process Theory and Oracle Inequalities

Empirical Process Theory and Oracle Inequalities Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi

More information

Linear Classifiers III

Linear Classifiers III Uiversität Potsdam Istitut für Iformatik Lehrstuhl Maschielles Lere Liear Classifiers III Blaie Nelso, Tobias Scheffer Cotets Classificatio Problem Bayesia Classifier Decisio Liear Classifiers, MAP Models

More information

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where

More information

Algorithms for Clustering

Algorithms for Clustering CR2: Statistical Learig & Applicatios Algorithms for Clusterig Lecturer: J. Salmo Scribe: A. Alcolei Settig: give a data set X R p where is the umber of observatio ad p is the umber of features, we wat

More information

Statistical Inference Based on Extremum Estimators

Statistical Inference Based on Extremum Estimators T. Rotheberg Fall, 2007 Statistical Iferece Based o Extremum Estimators Itroductio Suppose 0, the true value of a p-dimesioal parameter, is kow to lie i some subset S R p : Ofte we choose to estimate 0

More information

ECE 901 Lecture 13: Maximum Likelihood Estimation

ECE 901 Lecture 13: Maximum Likelihood Estimation ECE 90 Lecture 3: Maximum Likelihood Estimatio R. Nowak 5/7/009 The focus of this lecture is to cosider aother approach to learig based o maximum likelihood estimatio. Ulike earlier approaches cosidered

More information

Differentiable Convex Functions

Differentiable Convex Functions Differetiable Covex Fuctios The followig picture motivates Theorem 11. f ( x) f ( x) f '( x)( x x) ˆx x 1 Theorem 11 : Let f : R R be differetiable. The, f is covex o the covex set C R if, ad oly if for

More information

11 Correlation and Regression

11 Correlation and Regression 11 Correlatio Regressio 11.1 Multivariate Data Ofte we look at data where several variables are recorded for the same idividuals or samplig uits. For example, at a coastal weather statio, we might record

More information

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence Sequeces A sequece of umbers is a fuctio whose domai is the positive itegers. We ca see that the sequece 1, 1, 2, 2, 3, 3,... is a fuctio from the positive itegers whe we write the first sequece elemet

More information

Unbiased Estimation. February 7-12, 2008

Unbiased Estimation. February 7-12, 2008 Ubiased Estimatio February 7-2, 2008 We begi with a sample X = (X,..., X ) of radom variables chose accordig to oe of a family of probabilities P θ where θ is elemet from the parameter space Θ. For radom

More information

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018) NYU Ceter for Data Sciece: DS-GA 003 Machie Learig ad Computatioal Statistics (Sprig 208) Brett Berstei, David Roseberg, Be Jakubowski Jauary 20, 208 Istructios: Followig most lab ad lecture sectios, we

More information

Basics of Probability Theory (for Theory of Computation courses)

Basics of Probability Theory (for Theory of Computation courses) Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface.

More information

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring Machie Learig Regressio I Hamid R. Rabiee [Slides are based o Bishop Book] Sprig 015 http://ce.sharif.edu/courses/93-94//ce717-1 Liear Regressio Liear regressio: ivolves a respose variable ad a sigle predictor

More information

Self-normalized deviation inequalities with application to t-statistic

Self-normalized deviation inequalities with application to t-statistic Self-ormalized deviatio iequalities with applicatio to t-statistic Xiequa Fa Ceter for Applied Mathematics, Tiaji Uiversity, 30007 Tiaji, Chia Abstract Let ξ i i 1 be a sequece of idepedet ad symmetric

More information

Lecture 15: Learning Theory: Concentration Inequalities

Lecture 15: Learning Theory: Concentration Inequalities STAT 425: Itroductio to Noparametric Statistics Witer 208 Lecture 5: Learig Theory: Cocetratio Iequalities Istructor: Ye-Chi Che 5. Itroductio Recall that i the lecture o classificatio, we have see that

More information

Maximum Likelihood Estimation and Complexity Regularization

Maximum Likelihood Estimation and Complexity Regularization ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio

More information

6.867 Machine learning, lecture 7 (Jaakkola) 1

6.867 Machine learning, lecture 7 (Jaakkola) 1 6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit

More information

Multi parameter proximal point algorithms

Multi parameter proximal point algorithms Multi parameter proximal poit algorithms Ogaeditse A. Boikayo a,b,, Gheorghe Moroşau a a Departmet of Mathematics ad its Applicatios Cetral Europea Uiversity Nador u. 9, H-1051 Budapest, Hugary b Departmet

More information

Lecture 7: October 18, 2017

Lecture 7: October 18, 2017 Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem

More information

On Random Line Segments in the Unit Square

On Random Line Segments in the Unit Square O Radom Lie Segmets i the Uit Square Thomas A. Courtade Departmet of Electrical Egieerig Uiversity of Califoria Los Ageles, Califoria 90095 Email: tacourta@ee.ucla.edu I. INTRODUCTION Let Q = [0, 1] [0,

More information

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Convergence of random variables. (telegram style notes) P.J.C. Spreij Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space

More information

6.883: Online Methods in Machine Learning Alexander Rakhlin

6.883: Online Methods in Machine Learning Alexander Rakhlin 6.883: Olie Methods i Machie Learig Alexader Rakhli LECTURES 5 AND 6. THE EXPERTS SETTING. EXPONENTIAL WEIGHTS All the algorithms preseted so far halluciate the future values as radom draws ad the perform

More information

TR/46 OCTOBER THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION A. TALBOT

TR/46 OCTOBER THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION A. TALBOT TR/46 OCTOBER 974 THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION by A. TALBOT .. Itroductio. A problem i approximatio theory o which I have recetly worked [] required for its solutio a proof that the

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it

More information

1 Review and Overview

1 Review and Overview DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,

More information

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters

More information

The Method of Least Squares. To understand least squares fitting of data.

The Method of Least Squares. To understand least squares fitting of data. The Method of Least Squares KEY WORDS Curve fittig, least square GOAL To uderstad least squares fittig of data To uderstad the least squares solutio of icosistet systems of liear equatios 1 Motivatio Curve

More information

Simulation. Two Rule For Inverting A Distribution Function

Simulation. Two Rule For Inverting A Distribution Function Simulatio Two Rule For Ivertig A Distributio Fuctio Rule 1. If F(x) = u is costat o a iterval [x 1, x 2 ), the the uiform value u is mapped oto x 2 through the iversio process. Rule 2. If there is a jump

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5 CS434a/54a: Patter Recogitio Prof. Olga Veksler Lecture 5 Today Itroductio to parameter estimatio Two methods for parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio Itroducto Bayesia Decisio

More information

Analysis of Algorithms. Introduction. Contents

Analysis of Algorithms. Introduction. Contents Itroductio The focus of this module is mathematical aspects of algorithms. Our mai focus is aalysis of algorithms, which meas evaluatig efficiecy of algorithms by aalytical ad mathematical methods. We

More information

Are adaptive Mann iterations really adaptive?

Are adaptive Mann iterations really adaptive? MATHEMATICAL COMMUNICATIONS 399 Math. Commu., Vol. 4, No. 2, pp. 399-42 (2009) Are adaptive Ma iteratios really adaptive? Kamil S. Kazimierski, Departmet of Mathematics ad Computer Sciece, Uiversity of

More information

A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers

A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers A uified framework for high-dimesioal aalysis of M-estimators with decomposable regularizers Sahad Negahba, UC Berkeley Pradeep Ravikumar, UT Austi Marti Waiwright, UC Berkeley Bi Yu, UC Berkeley NIPS

More information

CMSE 820: Math. Foundations of Data Sci.

CMSE 820: Math. Foundations of Data Sci. Lecture 17 8.4 Weighted path graphs Take from [10, Lecture 3] As alluded to at the ed of the previous sectio, we ow aalyze weighted path graphs. To that ed, we prove the followig: Theorem 6 (Fiedler).

More information

Algebra of Least Squares

Algebra of Least Squares October 19, 2018 Algebra of Least Squares Geometry of Least Squares Recall that out data is like a table [Y X] where Y collects observatios o the depedet variable Y ad X collects observatios o the k-dimesioal

More information

Minimizing Finite Sums with the Stochastic Average Gradient

Minimizing Finite Sums with the Stochastic Average Gradient Miimizig Fiite Sums with the Stochastic Average Gradiet Mark Schmidt, Nicolas Le Roux, Fracis Bach To cite this versio: Mark Schmidt, Nicolas Le Roux, Fracis Bach. Miimizig Fiite Sums with the Stochastic

More information

Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3

Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3 No-Parametric Techiques Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3 Parametric vs. No-Parametric Parametric Based o Fuctios (e.g Normal Distributio) Uimodal Oly oe peak Ulikely real data cofies

More information

5.1 A mutual information bound based on metric entropy

5.1 A mutual information bound based on metric entropy Chapter 5 Global Fao Method I this chapter, we exted the techiques of Chapter 2.4 o Fao s method the local Fao method) to a more global costructio. I particular, we show that, rather tha costructig a local

More information

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT OCTOBER 7, 2016 LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT Geometry of LS We ca thik of y ad the colums of X as members of the -dimesioal Euclidea space R Oe ca

More information

Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.

Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n. CS 189 Itroductio to Machie Learig Sprig 218 Note 11 1 Caoical Correlatio Aalysis The Pearso Correlatio Coefficiet ρ(x, Y ) is a way to measure how liearly related (i other words, how well a liear model

More information

An almost sure invariance principle for trimmed sums of random vectors

An almost sure invariance principle for trimmed sums of random vectors Proc. Idia Acad. Sci. Math. Sci. Vol. 20, No. 5, November 200, pp. 6 68. Idia Academy of Scieces A almost sure ivariace priciple for trimmed sums of radom vectors KE-ANG FU School of Statistics ad Mathematics,

More information

Lecture 10: Universal coding and prediction

Lecture 10: Universal coding and prediction 0-704: Iformatio Processig ad Learig Sprig 0 Lecture 0: Uiversal codig ad predictio Lecturer: Aarti Sigh Scribes: Georg M. Goerg Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved

More information

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017 Lecture 9: Boostig Akshay Krishamurthy akshay@csumassedu October 3, 07 Recap Last week we discussed some algorithmic aspects of machie learig We saw oe very powerful family of learig algorithms, amely

More information