REGRESSION WITH QUADRATIC LOSS

Similar documents
Regression with quadratic loss

1 Review and Overview

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Optimally Sparse SVMs

Binary classification, Part 1

Machine Learning Brett Bernstein

Empirical Process Theory and Oracle Inequalities

Lecture 15: Learning Theory: Concentration Inequalities

18.657: Mathematics of Machine Learning

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Rademacher Complexity

Machine Learning Theory (CS 6783)

A survey on penalized empirical risk minimization Sara A. van de Geer

Intro to Learning Theory

Lecture 3: August 31

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

18.657: Mathematics of Machine Learning

6.3 Testing Series With Positive Terms

1 Review and Overview

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Maximum Likelihood Estimation and Complexity Regularization

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Lecture 7: October 18, 2017

An Introduction to Randomized Algorithms

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

lim za n n = z lim a n n.

Linear Regression Demystified

Lecture 11 and 12: Basic estimation theory

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Review Problems 1. ICME and MS&E Refresher Course September 19, 2011 B = C = AB = A = A 2 = A 3... C 2 = C 3 = =

Glivenko-Cantelli Classes

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Linear Support Vector Machines

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

2 Banach spaces and Hilbert spaces

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

7.1 Convergence of sequences of random variables

Self-normalized deviation inequalities with application to t-statistic

Machine Learning Theory (CS 6783)

Section 14. Simple linear regression.

A Proof of Birkhoff s Ergodic Theorem

7.1 Convergence of sequences of random variables

Sequences and Series of Functions

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Lecture 10 October Minimaxity and least favorable prior sequences

Learnability with Rademacher Complexities

Agnostic Learning and Concentration Inequalities

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

Monte Carlo Integration

Lecture 14: Graph Entropy

Fall 2013 MTH431/531 Real analysis Section Notes

Frequentist Inference

Lecture 11: Decision Trees

Lecture 6 Simple alternatives and the Neyman-Pearson lemma

Information-based Feature Selection

Algebra of Least Squares

Sieve Estimators: Consistency and Rates of Convergence

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Expectation and Variance of a random variable


32 estimating the cumulative distribution function

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Enumerative & Asymptotic Combinatorics

5.1 A mutual information bound based on metric entropy

McGill University Math 354: Honors Analysis 3 Fall 2012 Solutions to selected problems

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01

Topic 9: Sampling Distributions of Estimators

Support vector machine revisited

10-701/ Machine Learning Mid-term Exam Solution

6.867 Machine learning, lecture 7 (Jaakkola) 1

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)

1 Introduction to reducing variance in Monte Carlo simulations

Estimation for Complete Data

It is often useful to approximate complicated functions using simpler ones. We consider the task of approximating a function by a polynomial.

Random Variables, Sampling and Estimation

Algorithms for Clustering

STAT Homework 1 - Solutions

Topic 9: Sampling Distributions of Estimators

1+x 1 + α+x. x = 2(α x2 ) 1+x

Problem Set 4 Due Oct, 12

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

Machine Learning Brett Bernstein

Lecture 10: Bounded Linear Operators and Orthogonality in Hilbert Spaces

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

4.3 Growth Rates of Solutions to Recurrences

Math Solutions to homework 6

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Chapter 6 Principles of Data Reduction

Transcription:

REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d -valued feature vector or iput vector) ad Y is the real-valued respose or output). We assume that the ukow joit distributio P = P Z = P XY of X, Y ) belogs to some class P of probability distributios over R d R. The learig problem, the, is to produce a predictor of Y give X o the basis of a i.i.d. traiig sample Z = Z 1,..., Z ) = X 1, Y 1 ),..., X, Y )) from P. A predictor is just a measurable) fuctio f : R d R, ad we evaluate its performace by the expected quadratic loss Lf) E[Y fx)) 2. As we have see before, the smallest expected loss is achieved by the regressio fuctio f x) = E[Y X = x, i.e., Moreover, for ay other f we have L if f Lf) = Lf ) = E[X E[Y X) 2. Lf) = L + f f 2 L 2 P X ), where f f 2 L 2 P X ) = R d fx) f x) 2 P X dx). Sice we do ot kow P, i geeral we caot hope to lear f, so, as before, istead we aim at fidig a good approximatio to the best predictor i some class F of fuctios f : R d R, i.e., to use the traiig data Z to costruct a predictor f F, such that L f ) L F) if f F Lf) with high probability. We will assume that the margial distributio P X of the feature vector is supported o a closed subset X R d, ad that the joit distributio P of X, Y ) is such that, with probability oe, 1) Y M ad f X) M. for some costat 0 < M <. Thus we ca assume that the traiig samples belog to the set Z = X [ M, M. We will also assume that the class F is a subset of a suitable reproducig kerel Hilbert space RKHS) H K iduced by some Mercer kerel K : X X R. It will be useful to defie 2) C K sup Kx, x); x X we will assume that C K is fiite. The followig simple boud will come i hady: Date: March 28, 2011. 1

Lemma 1. For ay fuctio f : X R, defie the sup orm 3) The for ay f H K we have Proof. For ay f H K ad x X, f sup fx). x X f C K f K. fx) = f, K x K f K K x K = f K Kx, x), where the first step is by the reproducig kerel property, while the secod step is by Cauchy Schwarz. Takig the supremum of both sides over X, we get 3). 1. ERM over a ball i RKHS First, we will look at the simplest case: ERM over a ball i H K. Thus, we pick the radius λ > 0 ad take F = F λ = {f H K : f K λ}. The ERM algorithm outputs the predictor 1 f = arg mi L f) arg mi Y i fx i )) 2, where L f) deotes, as usual, the empirical loss i this case, empirical quadratic loss) of f. Theorem 1. With probability at least 1 δ, L f ) L F λ ) + 8M + 2C Kλ) 2 32 log1/δ) 4) + M 2 + C 2 Kλ 2 ) Proof. First let us itroduce some otatio. Let us deote the quadratic loss fuctio y, u) y u) 2 by ly, u), ad for ay f : R d R let l fx, y) ly, fx)) = y fx)) 2 Let l F λ deote the fuctio class {l f : f F λ }. Let f λ deote ay miimizer of Lf) over F λ, i.e., Lf λ ) = L F λ ). As usual, we write 5) L f ) L F λ ) = L f ) L F λ ) = L f ) L f ) + L f ) L f λ ) + L f λ ) Lf λ ) 2 sup L f) Lf) = 2 sup P l f) P l f) = 2 l F λ ), where we have defied the uiform deviatio l F) sup P l f) P l f). f F Next we show that, as a fuctio of the traiig sample Z, gz ) = l F λ ) has bouded differeces. Ideed, for ay 1 i, ay z Z, ad ay z i Z, let z i) deote z with the ith 2

coordiate replaced by z i. The gz ) gzi) ) 1 sup y i fx i )) 2 y i fx i)) 2 2 sup 4 4 sup x X y M sup y fx) 2 ) M 2 + sup f 2 M 2 + CKλ 2 2), where the last lie is by Lemma 1. Thus, l F λ ) has the bouded differece property with c 1 =... = c = 4M 2 + CK 2 λ2 )/, so McDiarmid s iequality says that, for ay t > 0, ) t 2 ) P l F λ ) E l F λ ) + t exp 8M 2 + CK 2 λ2 ) 2. Therefore, lettig we see that 2 log1/δ) t = 2M 2 + CKλ 2 2 ), l F λ ) E l F λ ) + 2M 2 + C 2 Kλ 2 ) with probability at least 1 δ. Moreover, by symmetrizatio we have 2 log1/δ) 6) E l F λ ) 2ER l F λ Z )), where 7) is the Rademacher average of the radom) set [ R l F λ Z )) = 1 E σ sup σ i l fz i ) l F λ Z ) = {l fz 1 ),..., l fz )) : f F λ } = { Y 1 fx 1 ) 2 ),..., Y fx )) 2 ) : f F λ }. To boud the Rademacher average i 7), we will eed to use the cotractio priciple. To that ed, let us fix ay y [ M, M ad ay u, v [ C K λ, C K λ. The ly, u) ly, v) = y 2 2yu + u 2 ) y 2 2yv + v 2 ) = 2yv u) v 2 u 2 ) = 2 y u v u v 2M + 2C K λ) u v. Hece, by the cotractio priciple we ca write [ R l F λ Z )) 2M + 2C Kλ) 8) E σ sup σ i Y i fx i )). 3

Moreover 9) E σ [ [ sup σ i Y i fx i )) E σ σ i Y i + E σ sup σ i fx i ) Yi 2 + R F λ Z )) M + C K λ), where the first step uses the triagle iequality, the secod step uses the result from the previous lecture o the expected absolute value of Rademacher sums, ad the third step uses 1) ad the boud o the Rademacher average over a ball i a RKHS. Combiig 6) through 9) ad overboudig 9) slightly), we coclude that 10) l F λ ) 4M + 2C Kλ) 2 + 2M 2 + C 2 Kλ 2 ) 2 log1/δ) with probability at least 1 δ. Fially, combiig this with 5), we get 4). 2. Regularized least squares i a RKHS The observatio we had made may times by ow is that whe the joit distributio of the iput-output pair X, Y ) X R is ukow, there is o hope i geeral to lear the optimal predictor f from a fiite traiig sample. Thus, restrictig our attetio to some hypothesis space F, which is a proper subset of the class of all measurable fuctios f : X R, is a form of isurace: If we do ot do this, the we ca always fid some fuctio f that attais zero empirical loss, yet performs spectacularly badly o the iputs outside the traiig set. Whe this happes, we say that our leared predictor overfits. O the other had, if our hypothesis space F cosists of well-behaved fuctios, the it is possible to lear a predictor that achieves a graceful balace betwee i-sample data fit ad out-of-sample geeralizatio. The price we pay is the approximatio error L F) L if Lf) if Lf) 0. f F f:x R I the regressio settig, the approximatio error ca be expressed as L F) L = if f F f f 2 P X, where f x) = E[Y X = x is the regressio fuctio the MMSE predictor of Y give X). Whe see from this perspective, the use of a restricted hypothesis space F is a form of regularizatio a way of guarateeig that the leared predictor performs well outside the traiig sample. However, this is ot the oly way to achieve regularizatio. I this sectio, we will aalyze aother way: complexity regularizatio. I a utshell, complexity regularizatio is a modificatio of the ERM scheme that allows us to search over a fairly rich hypothesis space by addig a pealty term. Complexity regularizatio is a very geeral techique with wide applicability. We will look at a particular example of complexity regularizatio over a RKHS ad derive a simple boud o its geeralizatio performace. To set thigs up, let > 0 be a regularizatio parameter. Itroduce the regularized quadratic loss J f) Lf) + f 2 K ad its empirical couterpart J, f) L f) + f 2 K. 4

Defie the fuctios 11) f arg mi J f) f H K ad 12) f, arg mi J, f). f H K We will refer to 12) as the regularized kerel least squares RKLS) algorithm. Note that the miimizatio i 11) ad 12) takes place i the etire RKHS H K, rather tha a subset, say, a ball. However, the additio of the regularizatio term f 2 K esures that the RKLS algorithm does ot just select ay fuctio f H K that happes to fit the traiig data well istead, it weighs the goodess-of-fit term L f) term agaist the complexity f 2 K, sice a very large value of f 2 K would idicate that f might wiggle aroud a lot ad, therefore, overfit the traiig sample. The regularizatio parameter > 0 cotrols the relative importace of the goodess-of-fit ad the complexity terms. We have the followig basic boud o the geeralizatio performace of RKLS: Theorem 2. With probability at least 1 δ, L f 4M 13), ) L A) + where A) is the regularized approximatio error. Proof. We start with the followig lemma: Lemma 2. 1 + 2C K ) 2 + 2 2M 2 + C2 K M 2 ) + A)) 2 log2/δ) [ if Lf) + f 2 K L f H K 14) L f, ) L δ f, ) δ f ) + A), where δ f) Lf) L f) for all f. Proof. First, a obvious overboudig gives L f, ) L J f, ) L. The J f, ) = L f, ) + f, 2 K = L f, ) L f, ) + L f, ) + f, 2 K }{{} =J, b f,) = L f, ) L f, ) + J, f, ) J, f ) + J, f ) L f, ) L f, ) + J, f ) = L f, ) L f, ) + L f ) + f 2 K = L f, ) L f, ) + L f ) Lf ) + Lf ) + f 2 K = L f, ) L f, ) + L f ) Lf ) + J f ). 5

This gives ad we are doe. L f, ) L L f, ) L f, ) + L f ) Lf ) + J f ) L = L f, ) L f, ) + L f ) Lf [ ) + if Lf) + f 2 K L f H = L f, ) L f, ) + L f ) Lf ) + A), Lemma 2 shows that the excess loss of the regularized empirical loss miimizer f, is bouded from above by the sum of three terms: the deviatio δ f, ) L f, ) L f, ) of f, itself, the egative) deviatio δ f ) L f ) Lf ) of the best regularized predictor f, ad the approximatio error A). To prove Theorem 2, we will eed to obtai high-probability bouds o the two deviatio terms. To that ed, we eed a lemma: Lemma 3. The fuctios f ad f, satisfy the bouds 15) f A) C K. ad 16) respectively. f, K M with probability oe Proof. To prove 15), we use the fact that A) = Lf ) L + f 2 K f 2 K, which gives f K A)/. From this ad from 3) we obtai 15). For 16), we use the fact that f, miimizes J, f) over all f. I particular, J, f, ) = L f, ) + f, 2 K J, 0) = 1 Y 2 i M 2 w.p. 1, where the last step follows from 1). Rearragig ad usig the fact that L f) 0 for all f, we get 16). Now we are ready to boud δ f, ). For ay R 0, let F R = {f H K : f K R} deote the zero-cetered ball of radius R i the RKHS H K. The Lemma 3 says that f, F M/ with probability oe. Therefore, with probability oe we have δ f, ) = δ f, ) 1 { f, F b M/ } δ f, ) 1 { f, F b M/ } sup δ f) 1 f F { f, F b M/ } M/ }{{} l F M/ ) l F M/ ). 6

Cosequetly, we ca carry out the same aalysis as i the proof of Theorem 1. First of all, the fuctio gz ) = l F M/ ) has bouded differeces with ) c 1 =... = c 4 M 2 + sup f 2 4M 2 ) 1 + C2 K f F M/ where the last step uses 16) ad Lemma 1. Therefore, with probability at least 1 δ/2, ) 2 δ f 4M 1 + 2C K ), ) l F M/ ) + 2M 1 2 + C2 K 2 log2/δ) 17), where the secod step follows from 10) with δ replaced by δ/2 ad with λ = M/. It remais to boud δ f ). This is, actually, much easier, sice we are dealig with a sigle data-idepedet fuctio. I particular, ote that we ca write δ f ) = 1 Y i f X i )) 2 E [ Y f X)) 2 = 1 U i, where U i Y i f X i )) 2 E [ Y f X)) 2, 1 i, are i.i.d. radom variables with EU i = 0 ad U i sup sup y f x) ) 2 2M 2 + f 2 ) 2 M 2 + C2 K A) ) y [ M,M x X with probability oe, where we have used 1) ad 15). We ca therefore use Hoeffdig s iequality to write, for ay t 0, P δ f ) t ) ) ) 1 t 2 = P U i t exp 8 M 2 + CK 2 A)/) 2 This implies that 18) δ f ) 2 M 2 + C2 K A) ) 2 log2/δ) with probability at least 1 δ/2. Combiig 17) ad 18) with 14), we get 13). 7