Regression with quadratic loss

Similar documents
REGRESSION WITH QUADRATIC LOSS

1 Review and Overview

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Binary classification, Part 1

Optimally Sparse SVMs

Machine Learning Brett Bernstein

18.657: Mathematics of Machine Learning

Lecture 15: Learning Theory: Concentration Inequalities

Intro to Learning Theory

Empirical Process Theory and Oracle Inequalities

Rademacher Complexity

A survey on penalized empirical risk minimization Sara A. van de Geer

1 Review and Overview

Lecture 3: August 31

Machine Learning Theory (CS 6783)

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

6.3 Testing Series With Positive Terms

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

18.657: Mathematics of Machine Learning

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

An Introduction to Randomized Algorithms

Lecture 7: October 18, 2017

Glivenko-Cantelli Classes

Linear Regression Demystified

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Sieve Estimators: Consistency and Rates of Convergence

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

Review Problems 1. ICME and MS&E Refresher Course September 19, 2011 B = C = AB = A = A 2 = A 3... C 2 = C 3 = =

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Section 14. Simple linear regression.

Frequentist Inference

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Linear Support Vector Machines

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Lecture 10 October Minimaxity and least favorable prior sequences

7.1 Convergence of sequences of random variables

Learnability with Rademacher Complexities

18.657: Mathematics of Machine Learning

5.1 A mutual information bound based on metric entropy

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

2 Banach spaces and Hilbert spaces

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Polynomial Functions and Their Graphs

A Proof of Birkhoff s Ergodic Theorem

Lecture 11: Decision Trees

Information-based Feature Selection

Estimation for Complete Data

Fall 2013 MTH431/531 Real analysis Section Notes

It is often useful to approximate complicated functions using simpler ones. We consider the task of approximating a function by a polynomial.

Machine Learning Theory (CS 6783)

7.1 Convergence of sequences of random variables

Lecture 6 Simple alternatives and the Neyman-Pearson lemma

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

Lecture 11 and 12: Basic estimation theory

Maximum Likelihood Estimation and Complexity Regularization

32 estimating the cumulative distribution function

Monte Carlo Integration

Lecture 10: Bounded Linear Operators and Orthogonality in Hilbert Spaces

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

6.867 Machine learning, lecture 7 (Jaakkola) 1

Lecture #20. n ( x p i )1/p = max

1 Introduction to reducing variance in Monte Carlo simulations

Rates of Convergence by Moduli of Continuity

lim za n n = z lim a n n.

ECE 901 Lecture 4: Estimation of Lipschitz smooth functions

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Topic 9: Sampling Distributions of Estimators

b i u x i U a i j u x i u x j

Random Variables, Sampling and Estimation

Self-normalized deviation inequalities with application to t-statistic

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Output Analysis and Run-Length Control

Learning Theory: Lecture Notes

Sequences and Series of Functions

Enumerative & Asymptotic Combinatorics

10-701/ Machine Learning Mid-term Exam Solution

6.867 Machine learning

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01

Algebra of Least Squares

Advanced Stochastic Processes.

Topic 9: Sampling Distributions of Estimators

Agnostic Learning and Concentration Inequalities

4.3 Growth Rates of Solutions to Recurrences

Properties and Hypothesis Testing

The Wasserstein distances

Chapter 6 Principles of Data Reduction

Support vector machine revisited

1 Approximating Integrals using Taylor Polynomials

Lecture 14: Graph Entropy

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Transcription:

Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before, X is a R d -valued feature vector or iput vector ad Y is the real-valued respose or output. We assume that the ukow joit distributio P = P Z = P X Y of X,Y belogs to some class P of probability distributios over R d R. The learig problem, the, is to produce a predictor of Y give X o the basis of a i.i.d. traiig sample Z = Z 1,..., Z = X 1,Y 1,...,X,Y from P. A predictor is just a measurable fuctio f : R d R, ad we evaluate its performace by the expected quadratic loss Lf EY f X 2 ]. As we have see before, the smallest expected loss is achieved by the regressio fuctio f x EY X = x], i.e., L iflf = Lf = EX EY X ] 2 ]. f Moreover, for ay other f we have where Lf = L + f f 2 L 2 P X, f f 2 L 2 P X = R d f x f x 2 P X d x. Sice we do ot kow P, i geeral we caot hope to lear f, so, as before, istead we aim at fidig a good approximatio to the best predictor i some class F of fuctios f : R d R, i.e., to use the traiig data Z to costruct a predictor f F, such that L f L F if Lf with high probability. We will assume that the margial distributio P X of the feature vector is supported o a closed subset X R d, ad that the joit distributio P of X,Y is such that, with probability oe, Y M ad f X M. 1 for some costat 0 < M <. Thus we ca assume that the traiig samples belog to the set Z = X M, M]. We will also assume that the class F is a subset of a suitable reproducig kerel Hilbert space RKHS H K iduced by some Mercer kerel K : X X R. It will be useful to defie C K sup K x, x; 2 x X we will assume that C K is fiite. The followig simple boud will come i hady: 1

Lemma 1. For ay fuctio f : X R, defie the sup orm f sup f x. 3 x X The for ay f H K we have f C K f K. Proof. For ay f H K ad x X, f x = f,k x K f K K x K = f K K x, x, where the first step is by the reproducig kerel property, while the secod step is by Cauchy Schwarz. Takig the supremum of both sides over X, we get 3. 1 ERM over a ball i RKHS First, we will look at the simplest case: ERM over a ball i H K. Thus, we pick the radius λ > 0 ad take The ERM algorithm outputs the predictor f = argmi F = F λ = { f H K : f K λ }. 1 L f argmi Y i f X i 2, where L f deotes, as usual, the empirical loss i this case, empirical quadratic loss of f. Theorem 1. With probability at least 1 δ, L f L F λ + 16M +C K λ 2 + M 2 +C 2 32 log1/δ K λ2 4 Proof. First let us itroduce some otatio. Let us deote the quadratic loss fuctio y,u y u 2 by ly,u, ad for ay f : R d R let l f x, y ly, f x = y f x 2 Let l F λ deote the fuctio class {l f : f F λ }. Let f λ deote ay miimizer of Lf over F λ, i.e., Lf λ = L F λ. As usual, we write L f L F λ = L f L F λ = L f L f + L f L f λ + L f λ Lf λ 2 sup L f Lf = 2 sup P l f Pl f = 2 l F λ, 5 2

where we have defied the uiform deviatio l F sup P l f Pl f. Next we show that, as a fuctio of the traiig sample Z, g Z = l F λ has bouded differeces. Ideed, for ay 1 i, ay z Z, ad ay z i Z, let z i deote z with the i th coordiate replaced by z i. The g z g z i 1 sup yi f x i 2 y i f x i 2 2 sup sup sup y f x 2 x X y M 4 4 M 2 + sup f 2 M 2 +C 2 K λ2, where the last lie is by Lemma 1. Thus, l F λ has the bouded differece property with c 1 =... = c = 4M 2 +C 2 K λ2 /, so McDiarmid s iequality says that, for ay t > 0, t 2 P l F λ E l F λ + t exp 8M 2 +C 2 K λ2 2. Therefore, lettig we see that t = 2M 2 +C 2 2log1/δ K λ2, l F λ E l F λ + 2M 2 +C 2 2log1/δ K λ2 with probability at least 1 δ. Moreover, by symmetrizatio we have E l F λ 2ER l F λ Z, 6 where ] R l F λ Z = 1 E sup i l f Z i 7 is the Rademacher average of the radom set l F λ Z = { l f Z 1,...,l f Z : f F λ } = { Y 1 f X 1 2,...,Y f X 2 : f F λ }. To boud the Rademacher average i 7, we will eed to use the cotractio priciple. To that ed, cosider the fuctio ϕt = t 2. O the iterval A, A] for some A > 0, this fuctio is Lipschitz with costat 2A, i.e., s 2 t 2 2A s t, A s, t A. 3

Thus, sice Y i M ad f X i C K λ for all 1 i, by the cotractio priciple we ca write R l F λ Z 4M +C K λ E sup i Yi f X i ]. 8 Moreover E sup i Yi f X i ] ] E i Y i + E sup i f X i Y 2 + R i F λ Z M +C K λ, 9 where the first step uses the triagle iequality, the secod step uses the result from previous lectures o the expected absolute value of Rademacher sums, ad the third step uses 1 ad the boud o the Rademacher average over a ball i a RKHS. Combiig 6 through 9 ad overboudig 9 slightly, we coclude that l F λ 8M +C K λ 2 + 2M 2 +C 2 2log1/δ K λ2 10 with probability at least 1 δ. Fially, combiig this with 5, we get 4. 2 Regularized least squares i a RKHS The observatio we had made may times by ow is that whe the joit distributio of the iput-output pair X,Y X R is ukow, there is o hope i geeral to lear the optimal predictor f from a fiite traiig sample. Thus, restrictig our attetio to some hypothesis space F, which is a proper subset of the class of all measurable fuctios f : X R, is a form of isurace: If we do ot do this, the we ca always fid some fuctio f that attais zero empirical loss, yet performs spectacularly badly o the iputs outside the traiig set. Whe this happes, we say that our leared predictor overfits. O the other had, if our hypothesis space F cosists of well-behaved fuctios, the it is possible to lear a predictor that achieves a graceful balace betwee i-sample data fit ad out-of-sample geeralizatio. The price we pay is the approximatio error L F L if Lf if f :X R Lf 0. I the regressio settig, the approximatio error ca be expressed as L F L = if f f 2 P X, where f x = EY X = x] is the regressio fuctio the MMSE predictor of Y give X. Whe see from this perspective, the use of a restricted hypothesis space F is a form of regularizatio a way of guarateeig that the leared predictor performs well outside the traiig sample. However, this is ot the oly way to achieve regularizatio. I this sectio, we will aalyze aother way: complexity regularizatio. I a utshell, complexity regularizatio is a modificatio of the ERM scheme 4

that allows us to search over a fairly rich hypothesis space by addig a pealty term. Complexity regularizatio is a very geeral techique with wide applicability. We will look at a particular example of complexity regularizatio over a RKHS ad derive a simple boud o its geeralizatio performace. To set thigs up, let > 0 be a regularizatio parameter. Itroduce the regularized quadratic loss J f Lf + f 2 K ad its empirical couterpart J, f L f + f 2 K. Defie the fuctios f argmi J f 11 f H K ad f, argmi J, f. 12 f H K We will refer to 12 as the regularized kerel least squares RKLS algorithm. Note that the miimizatio i 11 ad 12 takes place i the etire RKHS H K, rather tha a subset, say, a ball. However, the additio of the regularizatio term f 2 K esures that the RKLS algorithm does ot just select ay fuctio f H K that happes to fit the traiig data well istead, it weighs the goodess-of-fit term L f term agaist the complexity f 2 K, sice a very large value of f 2 K would idicate that f might wiggle aroud a lot ad, therefore, overfit the traiig sample. The regularizatio parameter > 0 cotrols the relative importace of the goodess-of-fit ad the complexity terms. We have the followig basic boud o the geeralizatio performace of RKLS: Theorem 2. With probability at least 1 δ, L f, L A + 16M 2 1 + C 2 K + 2 2M 2 + C 2 K M 2 + A 2log2/δ 13 where is the regularized approximatio error. Proof. We start with the followig lemma: Lemma 2. where δ f Lf L f for all f. ] A if Lf + f 2 K L f H K L f, L δ f, δ f + A, 14 5

Proof. First, a obvious overboudig gives L f, L J f, L. The This gives ad we are doe. J f, = L f, + f, 2 K = L f, L f, + L f, + f, 2 K }{{} =J, f, = L f, L f, + J, f, J, f + J,f L f, L f, + J, f = L f, L f, + L f + f 2 K = L f, L f, + L f Lf + Lf + f 2 K = L f, L f, + L f Lf + J f. L f, L L f, L f, + L f Lf + J f L = L f, L f, + L f Lf + if f H = L f, L f, + L f Lf + A, Lf + f 2 K ] L Lemma 2 shows that the excess loss of the regularized empirical loss miimizer f, is bouded from above by the sum of three terms: the deviatio δ f, L f, L f, of f, itself, the egative deviatio δ f L f Lf of the best regularized predictor f, ad the approximatio error A. To prove Theorem 2, we will eed to obtai high-probability bouds o the two deviatio terms. To that ed, we eed a lemma: Lemma 3. The fuctios f ad f, satisfy the bouds ad respectively. Proof. To prove 15, we use the fact that f C K A. 15 f, K M with probability oe 16 A = Lf L + f 2 K f 2 K, which gives f K A/. From this ad from 3 we obtai 15. For 16, we use the fact that f, miimizes J, f over all f. I particular, J, f, = L f, + f, 2 K J,0 = 1 Y 2 i M 2 w.p. 1, where the last step follows from 1. Rearragig ad usig the fact that L f 0 for all f, we get 16. 6

Now we are ready to boud δ f,. For ay R 0, let F R = {f H K : f K R} deote the zerocetered ball of radius R i the RKHS H K. The Lemma 3 says that f, F M/ with probability oe. Therefore, with probability oe we have δ f, = δ f, 1 { f, F M/ } δ f, 1{ f, F M/ } sup δ f 1 { f, F M/ } M/ }{{} l F M/ l F M/. Cosequetly, we ca carry out the same aalysis as i the proof of Theorem 1. First of all, the fuctio g Z = l F M/ has bouded differeces with c 1 =... = c 4 M 2 + sup f 2 M/ 4M 2 1 + C 2 K where the last step uses 16 ad Lemma 1. Therefore, with probability at least 1 δ/2, δ f, l F M/ 8M 2 1 + C 2 K + 2M 2 1 + C 2 K 2log2/δ, 17 where the secod step follows from 10 with δ replaced by δ/2 ad with λ = M/. It remais to boud δ f. This is, actually, much easier, sice we are dealig with a sigle dataidepedet fuctio. I particular, ote that we ca write δ f = 1 Y i f X i 2 E Y f X 2] = 1 U i, where U i Y i f X i 2 E Y f X 2],1 i, are i.i.d. radom variables with EU i = 0 ad U i sup sup y M,M] x X y f x 2 2M 2 + f 2 2 M 2 + C 2 K A with probability oe, where we have used 1 ad 15. We ca therefore use Hoeffdig s iequality to write, for ay t 0, P δ f t 1 t 2 = P U i t exp 8 M 2 +C 2 K A/ 2 This implies that δ f 2 M 2 + C 2 K A 2log2/δ 18 with probability at least 1 δ/2. Combiig 17 ad 18 with 14, we get 13. 7