A Risk Comparison of Ordinary Least Squares vs Ridge Regression

Joural of Machie Learig Research 14 (2013) 1505-1511 Submitted 5/12; Revised 3/13; Published 6/13 A Risk Compariso of Ordiary Least Squares vs Ridge Regressio Paramveer S. Dhillo Departmet of Computer ad Iformatio Sciece Uiversity of Pesylvaia Philadelphia, PA 19104, USA Dea P. Foster Departmet of Statistics Wharto School, Uiversity of Pesylvaia Philadelphia, PA 19104, USA Sham M. Kakade Microsoft Research Oe Memorial Drive Cambridge, MA 02142, USA Lyle H. Ugar Departmet of Computer ad Iformatio Sciece Uiversity of Pesylvaia Philadelphia, PA 19104, USA DHILLON@CIS.UPENN.EDU FOSTER@WHARTON.UPENN.EDU SKAKADE@MICROSOFT.COM UNGAR@CIS.UPENN.EDU Editor: Gabor Lugosi Abstract We compare the risk of ridge regressio to a simple variat of ordiary least squares, i which oe simply proects the data oto a fiite dimesioal subspace (as specified by a pricipal compoet aalysis) ad the performs a ordiary (u-regularized) least squares regressio i this subspace. This ote shows that the risk of this ordiary least squares method (PCA-OLS) is withi a costat factor (amely 4) of the risk of ridge regressio (RR). Keywords: risk iflatio, ridge regressio, pca 1. Itroductio Cosider the fixed desig settig where we have a set of vectors X ={X i }, ad let X deote the matrix where the i th row of X is X i. The observed label vector is Y R. Suppose that: Y = X+ε, where ε is idepedet oise i each coordiate, with the variace of ε i beig. The obective is to lear E[Y]=X. The expected loss of a vector estimator is: L()= 1 E Y[ Y X 2 ], Let ˆ be a estimator of (costructed with a sample Y ). Deotig Σ := 1 XT X, c 2013 Paramveer S. Dhillo, Dea P. Foster, Sham M. Kakade ad Lyle H. Ugar.

DHILLON, FOSTER, KAKADE AND UNGAR we have that the risk (i.e., expected excess loss) is: Risk(ˆ) :=Eˆ[L(ˆ) L()]=Eˆ ˆ 2 Σ, where x Σ = x Σx ad where the expectatio is with respect to the radomess i Y. We show that a simple variat of ordiary (u-regularized) least squares always compares favorably to ridge regressio (as measured by the risk). This observatio is based o the followig bias variace decompositio: where =E[ˆ]. 1.1 The Risk of Ridge Regressio (RR) Risk(ˆ)=E ˆ 2 Σ+ 2 Σ, (1) }{{}}{{} Variace Predictio Bias Ridge regressio or Tikhoov Regularizatio (Tikhoov, 1963) pealizes thel 2 orm of a parameter vector ad shriks it towards zero, pealizig large values more. The estimator is: ˆ λ = argmi{ Y X 2 + λ 2 }. The closed form estimate is the: ( ) 1 ˆ λ =(Σ+λI) 1 XT Y. Note that is the ordiary least squares estimator. Without loss of geerality, rotate X such that: ˆ 0 = ˆ λ=0 = argmi{ Y X 2 }, Σ=diag(λ 1,λ 2,...,λ p ), where the λ i s are ordered i decreasig order. To see the ature of this shrikage observe that: [ˆ λ ] := λ λ + λ [ˆ 0 ], where ˆ 0 is the ordiary least squares estimator. Usig the bias-variace decompositio, (Equatio 1), we have that: Lemma 1 Risk(ˆ λ )= σ2 ( λ λ + λ ) 2 + The proof is straightforward ad is provided i the appedix. 2 λ (1+ λ λ )2. 1506

A RISK COMPARISON OF ORDINARY LEAST SQUARES VS RIDGE REGRESSION 2. Ordiary Least Squares with PCA (PCA-OLS) Now let us costruct a simple estimator based o λ. Note that our rotated coordiate system where Σ is equal to diag(λ 1,λ 2,...,λ p ) correspods the PCA coordiate system. Cosider the followig ordiary least squares estimator o the top PCA subspace it uses the least squares estimate o coordiate if λ λ ad 0 otherwise { [ˆ0 ] [ˆ PCA,λ ] = if λ λ 0 otherwise. The followig claim shows this estimator compares favorably to the ridge estimator (for every λ) o matter how the λ is chose, for example, usig cross validatio or ay other strategy. Our mai theorem (Theorem 2) bouds the Risk Ratio/Risk Iflatio 1 of the PCA-OLS ad the RR estimators. Theorem 2 (Bouded Risk Iflatio) For all λ 0, we have that: ad the left had iequality is tight. 0 Risk(ˆ PCA,λ ) Risk(ˆ λ ) 4, Proof Usig the bias variace decompositio of the risk we ca write the risk as: Risk(ˆ PCA,λ )= σ2 ½ λ λ+ λ 2. :λ <λ The first term represets the variace ad the secod the bias. The ridge regressio risk is give by Lemma 1. We ow show that the th term i the expressio for the PCA risk is withi a factor 4 of the th term of the ridge regressio risk. First, let s cosider the case whe λ λ, the the ratio of th terms is: ( λ λ +λ ) 2+ 2 λ (1+ λ λ )2 Similarly, if λ < λ, the ratio of the th terms is: ( λ λ +λ λ 2 ) 2+ 2 λ (1+ λ λ )2 ( λ λ +λ λ 2 λ 2 (1+ λ λ )2 Sice, each term is withi a factor of 4 the proof is complete. ( ) 2 = 1+ λ ) 2 4. λ ( = 1+ λ ) 2 4. λ It is worth otig that the coverse is ot true ad the ridge regressio estimator (RR) ca be arbitrarily worse tha the PCA-OLS estimator. A example which shows that the left had iequality is tight is give i the Appedix. 1. Risk Iflatio has also bee used as a criterio for evaluatig feature selectio procedures (Foster ad George, 1994). 1507

DHILLON, FOSTER, KAKADE AND UNGAR 3. Experimets First, we geerated sythetic data with p = 100 ad varyig values of = {20, 50, 80, 110}. The data was geerated i a fixed desig settig as Y = X+ε where ε i N(0,1) i = 1,...,. Furthermore, X p MV N(0,I) where MVN(µ,Σ) is the Multivariate Normal Distributio with mea vector µ, variace-covariace matrixσad N(0,1) = 1,..., p. The results are show i Figure 1. As ca be see, the risk ratio of PCA (PCA-OLS) ad ridge regressio (RR) is ever worse tha 4 ad ofte its better tha 1 as dictated by Theorem 2. Next, we chose two real world data sets, amely USPS (=1500, p=241) ad BCI (=400, p=117). 2 Sice we do ot kow the true model for these data sets, we used all the observatios to fit a OLS regressio ad used it as a estimate of the true parameter. This is a reasoable approximatio to the true parameter as we estimate the ridge regressio (RR) ad PCA-OLS models o a small subset of these observatios. Next we choose a radom subset of the observatios, amely 0.2 p, 0.5 p ad 0.8 p to fit the ridge regressio (RR) ad PCA-OLS models. The results are show i Figure 2. As ca be see, the risk ratio of PCA-OLS to ridge regressio (RR) is agai withi a factor of 4 ad ofte PCA-OLS is better, that is, the ratio < 1. 4. Coclusio We showed that the risk iflatio of a particular ordiary least squares estimator (o the top PCA subspace) is withi a factor 4 of the ridge estimator. It turs out the coverse is ot true this PCA estimator may be arbitrarily better tha the ridge oe. Appedix A. Proof of Lemma 1. We aalyze the bias-variace decompositio i Equatio 1. For the variace, E Y ˆ λ λ 2 Σ λ E Y ([ˆ λ ] [ λ ] ) 2 = σ2 2E[ λ 1 ] (λ + λ) 2 (Y i E[Y i ])[X i ] (Y i E[Y i])[x i] i=1 λ (λ + λ) 2 λ (λ + λ) 2 λ 2 (λ + λ) 2. i=1 i=1 Var(Y i )[X i ] 2 [X i ] 2 i =1 2. The details about the data sets ca be foud here: http://olivier.chapelle.cc/ssl-book/bechmarks.html. 1508

A RISK COMPARISON OF ORDINARY LEAST SQUARES VS RIDGE REGRESSION =0.2p =0.5p =0.8p =1.1p 0.0 0.2 0.4 0.6 0.8 0.2 0.6 1.0 0.5 0.7 0.9 1.1 0.5 0.7 0.9 Figure 1: Plots showig the risk ratio as a fuctio of λ, the regularizatio parameter ad, for the sythetic data set. p=100 i all the cases. The error bars correspod to oe stadard deviatio for 100 such radom trials. =0.2p =0.5p =0.8p =1.1p 1.05 1.10 1.15 1.10 1.20 1.30 1.10 1.15 1.20 1.25 1.10 1.25 1.40 0.970 0.985 0.985 0.995 0.990 1.000 0.980 0.990 1.000 1.010 Figure 2: Plots showig the risk ratio as a fuctio of λ, the regularizatio parameter ad, for two real world data sets (BCI ad USPS top to bottom). 1509

DHILLON, FOSTER, KAKADE AND UNGAR Similarly, for the bias, λ 2 Σ λ ([ λ ] [] ) 2 ( λ 2 λ λ + λ 1 2 λ (1+ λ λ )2, ) 2 which completes the proof. The risk for RR ca be arbitrarily worse tha the PCA-OLS estimator. Cosider the stadard OLS settig described i Sectio 1 i which X is p matrix ad Y is a 1 vector. Let X = diag( 1+α,1,...,1), the Σ = X X = diag(1+α,1,...,1) for some (α > 0) ad also choose =[2+α,0,...,0]. For coveiece let s also choose =. The, usig Lemma 1, we get the risk of RR estimator as ( ) Risk(ˆ λ )= 1+α 2 + (p 1) 1+α+λ (1+λ) }{{}}{{ 2 +(2+α)2 (1+α) (1+ } 1+α. λ }{{ )2 } I II III Let s cosider two cases Case 1: λ<(p 1) 1/3 1, the II >(p 1) 1/3. Case 2: λ>1, the 1+ 1+α λ < 2+α, hece III >(1+α). Combiig these two cases we get λ, Risk(ˆ λ ) > mi((p 1) 1/3,(1+α)). If we choose p such that p 1=(1+α) 3, the Risk(ˆ λ )>(1+α). The PCA-OLS risk (From Theorem 2) is: Risk(ˆ PCA,λ )= ½ λ λ+ λ 2. :λ <λ Cosiderig λ (1,1+α), the first term will cotribute 1 to the risk ad rest everythig will be 0. So the risk of PCA-OLS is 1 ad the risk ratio is Now, for large α, the risk ratio 0. Risk(ˆ PCA,λ ) 1 Risk(ˆ λ ) (1+α). 1510

A RISK COMPARISON OF ORDINARY LEAST SQUARES VS RIDGE REGRESSION Refereces D. P. Foster ad E. I. George. The risk iflatio criterio for multiple regressio. The Aals of Statistics, pages 1947 1975, 1994. A. N. Tikhoov. Solutio of icorrectly formulated problems ad the regularizatio method. Soviet Math Dokl 4, pages 501 504, 1963. 1511