A Risk Comparison of Ordinary Least Squares vs Ridge Regression

Similar documents
arxiv: v1 [math.pr] 13 Oct 2011

Expectation and Variance of a random variable

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

First Year Quantitative Comp Exam Spring, Part I - 203A. f X (x) = 0 otherwise

Regression with quadratic loss

REGRESSION WITH QUADRATIC LOSS

The random version of Dvoretzky s theorem in l n

Lecture 2: Monte Carlo Simulation

Lecture 24: Variable selection in linear models

Machine Learning for Data Science (CS 4786)

PH 425 Quantum Measurement and Spin Winter SPINS Lab 1

Regression with an Evaporating Logarithmic Trend

Convergence of random variables. (telegram style notes) P.J.C. Spreij

The standard deviation of the mean

Investigating the Significance of a Correlation Coefficient using Jackknife Estimates

A survey on penalized empirical risk minimization Sara A. van de Geer

Linear Regression Demystified

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test.

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Intro to Learning Theory

An Introduction to Randomized Algorithms

Empirical Process Theory and Oracle Inequalities

Optimally Sparse SVMs

THE SYSTEMATIC AND THE RANDOM. ERRORS - DUE TO ELEMENT TOLERANCES OF ELECTRICAL NETWORKS

A Hadamard-type lower bound for symmetric diagonally dominant positive matrices

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

6.3 Testing Series With Positive Terms

PAijpam.eu ON TENSOR PRODUCT DECOMPOSITION

Double Stage Shrinkage Estimator of Two Parameters. Generalized Exponential Distribution

Lecture 7: October 18, 2017

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Machine Learning Brett Bernstein

Lecture 12: September 27

Rademacher Complexity

Math 152. Rumbos Fall Solutions to Review Problems for Exam #2. Number of Heads Frequency

Random Variables, Sampling and Estimation

Lecture 19: Convergence

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Time series models 2007

Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.

Quick Review of Probability

Quick Review of Probability

Sieve Estimators: Consistency and Rates of Convergence

Linear Regression Models

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

Simulation. Two Rule For Inverting A Distribution Function

10-701/ Machine Learning Mid-term Exam Solution

Section 14. Simple linear regression.

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Chapter 6 Sampling Distributions

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Learning Theory: Lecture Notes

Regression and generalization

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.


THE KALMAN FILTER RAUL ROJAS

Confidence Interval for Standard Deviation of Normal Distribution with Known Coefficients of Variation

Solutions to selected exercise of Randomized Algorithms

Monte Carlo Integration

Berry-Esseen bounds for self-normalized martingales

CMSE 820: Math. Foundations of Data Sci.

Random Design Analysis of Ridge Regression

Maximum Likelihood Estimation and Complexity Regularization

Posted-Price, Sealed-Bid Auctions

Frequentist Inference

Supplement to Graph Sparsification Approaches for Laplacian Smoothing

Computation of Error Bounds for P-matrix Linear Complementarity Problems

1 Review of Probability & Statistics

Chapter 13: Tests of Hypothesis Section 13.1 Introduction

Modified Decomposition Method by Adomian and. Rach for Solving Nonlinear Volterra Integro- Differential Equations

SDS 321: Introduction to Probability and Statistics

CS284A: Representations and Algorithms in Molecular Biology

ON POINTWISE BINOMIAL APPROXIMATION

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound

Machine Learning Theory (CS 6783)

Rank tests and regression rank scores tests in measurement error models

Lecture 2: Concentration Bounds

Information-based Feature Selection

5.1 Review of Singular Value Decomposition (SVD)

Comparison Study of Series Approximation. and Convergence between Chebyshev. and Legendre Series

Similarity Solutions to Unsteady Pseudoplastic. Flow Near a Moving Wall

Vector Quantization: a Limiting Case of EM

A goodness-of-fit test based on the empirical characteristic function and a comparison of tests for normality

Introduction to Machine Learning DIS10

1+x 1 + α+x. x = 2(α x2 ) 1+x

Quantile regression with multilayer perceptrons.

Matrix Representation of Data in Experiment

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Basics of Probability Theory (for Theory of Computation courses)

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

1 Review and Overview

Estimation of Gumbel Parameters under Ranked Set Sampling

7.1 Convergence of sequences of random variables

Linear Support Vector Machines

Solutions to Odd Numbered End of Chapter Exercises: Chapter 4

Lecture 12: February 28

Advanced Stochastic Processes.

Transcription:

Joural of Machie Learig Research 14 (2013) 1505-1511 Submitted 5/12; Revised 3/13; Published 6/13 A Risk Compariso of Ordiary Least Squares vs Ridge Regressio Paramveer S. Dhillo Departmet of Computer ad Iformatio Sciece Uiversity of Pesylvaia Philadelphia, PA 19104, USA Dea P. Foster Departmet of Statistics Wharto School, Uiversity of Pesylvaia Philadelphia, PA 19104, USA Sham M. Kakade Microsoft Research Oe Memorial Drive Cambridge, MA 02142, USA Lyle H. Ugar Departmet of Computer ad Iformatio Sciece Uiversity of Pesylvaia Philadelphia, PA 19104, USA DHILLON@CIS.UPENN.EDU FOSTER@WHARTON.UPENN.EDU SKAKADE@MICROSOFT.COM UNGAR@CIS.UPENN.EDU Editor: Gabor Lugosi Abstract We compare the risk of ridge regressio to a simple variat of ordiary least squares, i which oe simply proects the data oto a fiite dimesioal subspace (as specified by a pricipal compoet aalysis) ad the performs a ordiary (u-regularized) least squares regressio i this subspace. This ote shows that the risk of this ordiary least squares method (PCA-OLS) is withi a costat factor (amely 4) of the risk of ridge regressio (RR). Keywords: risk iflatio, ridge regressio, pca 1. Itroductio Cosider the fixed desig settig where we have a set of vectors X ={X i }, ad let X deote the matrix where the i th row of X is X i. The observed label vector is Y R. Suppose that: Y = X+ε, where ε is idepedet oise i each coordiate, with the variace of ε i beig. The obective is to lear E[Y]=X. The expected loss of a vector estimator is: L()= 1 E Y[ Y X 2 ], Let ˆ be a estimator of (costructed with a sample Y ). Deotig Σ := 1 XT X, c 2013 Paramveer S. Dhillo, Dea P. Foster, Sham M. Kakade ad Lyle H. Ugar.

DHILLON, FOSTER, KAKADE AND UNGAR we have that the risk (i.e., expected excess loss) is: Risk(ˆ) :=Eˆ[L(ˆ) L()]=Eˆ ˆ 2 Σ, where x Σ = x Σx ad where the expectatio is with respect to the radomess i Y. We show that a simple variat of ordiary (u-regularized) least squares always compares favorably to ridge regressio (as measured by the risk). This observatio is based o the followig bias variace decompositio: where =E[ˆ]. 1.1 The Risk of Ridge Regressio (RR) Risk(ˆ)=E ˆ 2 Σ+ 2 Σ, (1) }{{}}{{} Variace Predictio Bias Ridge regressio or Tikhoov Regularizatio (Tikhoov, 1963) pealizes thel 2 orm of a parameter vector ad shriks it towards zero, pealizig large values more. The estimator is: ˆ λ = argmi{ Y X 2 + λ 2 }. The closed form estimate is the: ( ) 1 ˆ λ =(Σ+λI) 1 XT Y. Note that is the ordiary least squares estimator. Without loss of geerality, rotate X such that: ˆ 0 = ˆ λ=0 = argmi{ Y X 2 }, Σ=diag(λ 1,λ 2,...,λ p ), where the λ i s are ordered i decreasig order. To see the ature of this shrikage observe that: [ˆ λ ] := λ λ + λ [ˆ 0 ], where ˆ 0 is the ordiary least squares estimator. Usig the bias-variace decompositio, (Equatio 1), we have that: Lemma 1 Risk(ˆ λ )= σ2 ( λ λ + λ ) 2 + The proof is straightforward ad is provided i the appedix. 2 λ (1+ λ λ )2. 1506

A RISK COMPARISON OF ORDINARY LEAST SQUARES VS RIDGE REGRESSION 2. Ordiary Least Squares with PCA (PCA-OLS) Now let us costruct a simple estimator based o λ. Note that our rotated coordiate system where Σ is equal to diag(λ 1,λ 2,...,λ p ) correspods the PCA coordiate system. Cosider the followig ordiary least squares estimator o the top PCA subspace it uses the least squares estimate o coordiate if λ λ ad 0 otherwise { [ˆ0 ] [ˆ PCA,λ ] = if λ λ 0 otherwise. The followig claim shows this estimator compares favorably to the ridge estimator (for every λ) o matter how the λ is chose, for example, usig cross validatio or ay other strategy. Our mai theorem (Theorem 2) bouds the Risk Ratio/Risk Iflatio 1 of the PCA-OLS ad the RR estimators. Theorem 2 (Bouded Risk Iflatio) For all λ 0, we have that: ad the left had iequality is tight. 0 Risk(ˆ PCA,λ ) Risk(ˆ λ ) 4, Proof Usig the bias variace decompositio of the risk we ca write the risk as: Risk(ˆ PCA,λ )= σ2 ½ λ λ+ λ 2. :λ <λ The first term represets the variace ad the secod the bias. The ridge regressio risk is give by Lemma 1. We ow show that the th term i the expressio for the PCA risk is withi a factor 4 of the th term of the ridge regressio risk. First, let s cosider the case whe λ λ, the the ratio of th terms is: ( λ λ +λ ) 2+ 2 λ (1+ λ λ )2 Similarly, if λ < λ, the ratio of the th terms is: ( λ λ +λ λ 2 ) 2+ 2 λ (1+ λ λ )2 ( λ λ +λ λ 2 λ 2 (1+ λ λ )2 Sice, each term is withi a factor of 4 the proof is complete. ( ) 2 = 1+ λ ) 2 4. λ ( = 1+ λ ) 2 4. λ It is worth otig that the coverse is ot true ad the ridge regressio estimator (RR) ca be arbitrarily worse tha the PCA-OLS estimator. A example which shows that the left had iequality is tight is give i the Appedix. 1. Risk Iflatio has also bee used as a criterio for evaluatig feature selectio procedures (Foster ad George, 1994). 1507

DHILLON, FOSTER, KAKADE AND UNGAR 3. Experimets First, we geerated sythetic data with p = 100 ad varyig values of = {20, 50, 80, 110}. The data was geerated i a fixed desig settig as Y = X+ε where ε i N(0,1) i = 1,...,. Furthermore, X p MV N(0,I) where MVN(µ,Σ) is the Multivariate Normal Distributio with mea vector µ, variace-covariace matrixσad N(0,1) = 1,..., p. The results are show i Figure 1. As ca be see, the risk ratio of PCA (PCA-OLS) ad ridge regressio (RR) is ever worse tha 4 ad ofte its better tha 1 as dictated by Theorem 2. Next, we chose two real world data sets, amely USPS (=1500, p=241) ad BCI (=400, p=117). 2 Sice we do ot kow the true model for these data sets, we used all the observatios to fit a OLS regressio ad used it as a estimate of the true parameter. This is a reasoable approximatio to the true parameter as we estimate the ridge regressio (RR) ad PCA-OLS models o a small subset of these observatios. Next we choose a radom subset of the observatios, amely 0.2 p, 0.5 p ad 0.8 p to fit the ridge regressio (RR) ad PCA-OLS models. The results are show i Figure 2. As ca be see, the risk ratio of PCA-OLS to ridge regressio (RR) is agai withi a factor of 4 ad ofte PCA-OLS is better, that is, the ratio < 1. 4. Coclusio We showed that the risk iflatio of a particular ordiary least squares estimator (o the top PCA subspace) is withi a factor 4 of the ridge estimator. It turs out the coverse is ot true this PCA estimator may be arbitrarily better tha the ridge oe. Appedix A. Proof of Lemma 1. We aalyze the bias-variace decompositio i Equatio 1. For the variace, E Y ˆ λ λ 2 Σ λ E Y ([ˆ λ ] [ λ ] ) 2 = σ2 2E[ λ 1 ] (λ + λ) 2 (Y i E[Y i ])[X i ] (Y i E[Y i])[x i] i=1 λ (λ + λ) 2 λ (λ + λ) 2 λ 2 (λ + λ) 2. i=1 i=1 Var(Y i )[X i ] 2 [X i ] 2 i =1 2. The details about the data sets ca be foud here: http://olivier.chapelle.cc/ssl-book/bechmarks.html. 1508

A RISK COMPARISON OF ORDINARY LEAST SQUARES VS RIDGE REGRESSION =0.2p =0.5p =0.8p =1.1p 0.0 0.2 0.4 0.6 0.8 0.2 0.6 1.0 0.5 0.7 0.9 1.1 0.5 0.7 0.9 Figure 1: Plots showig the risk ratio as a fuctio of λ, the regularizatio parameter ad, for the sythetic data set. p=100 i all the cases. The error bars correspod to oe stadard deviatio for 100 such radom trials. =0.2p =0.5p =0.8p =1.1p 1.05 1.10 1.15 1.10 1.20 1.30 1.10 1.15 1.20 1.25 1.10 1.25 1.40 0.970 0.985 0.985 0.995 0.990 1.000 0.980 0.990 1.000 1.010 Figure 2: Plots showig the risk ratio as a fuctio of λ, the regularizatio parameter ad, for two real world data sets (BCI ad USPS top to bottom). 1509

DHILLON, FOSTER, KAKADE AND UNGAR Similarly, for the bias, λ 2 Σ λ ([ λ ] [] ) 2 ( λ 2 λ λ + λ 1 2 λ (1+ λ λ )2, ) 2 which completes the proof. The risk for RR ca be arbitrarily worse tha the PCA-OLS estimator. Cosider the stadard OLS settig described i Sectio 1 i which X is p matrix ad Y is a 1 vector. Let X = diag( 1+α,1,...,1), the Σ = X X = diag(1+α,1,...,1) for some (α > 0) ad also choose =[2+α,0,...,0]. For coveiece let s also choose =. The, usig Lemma 1, we get the risk of RR estimator as ( ) Risk(ˆ λ )= 1+α 2 + (p 1) 1+α+λ (1+λ) }{{}}{{ 2 +(2+α)2 (1+α) (1+ } 1+α. λ }{{ )2 } I II III Let s cosider two cases Case 1: λ<(p 1) 1/3 1, the II >(p 1) 1/3. Case 2: λ>1, the 1+ 1+α λ < 2+α, hece III >(1+α). Combiig these two cases we get λ, Risk(ˆ λ ) > mi((p 1) 1/3,(1+α)). If we choose p such that p 1=(1+α) 3, the Risk(ˆ λ )>(1+α). The PCA-OLS risk (From Theorem 2) is: Risk(ˆ PCA,λ )= ½ λ λ+ λ 2. :λ <λ Cosiderig λ (1,1+α), the first term will cotribute 1 to the risk ad rest everythig will be 0. So the risk of PCA-OLS is 1 ad the risk ratio is Now, for large α, the risk ratio 0. Risk(ˆ PCA,λ ) 1 Risk(ˆ λ ) (1+α). 1510

A RISK COMPARISON OF ORDINARY LEAST SQUARES VS RIDGE REGRESSION Refereces D. P. Foster ad E. I. George. The risk iflatio criterio for multiple regressio. The Aals of Statistics, pages 1947 1975, 1994. A. N. Tikhoov. Solutio of icorrectly formulated problems ad the regularizatio method. Soviet Math Dokl 4, pages 501 504, 1963. 1511