l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities

Similar documents
Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Least squares under convex constraint

Reconstruction from Anisotropic Random Measurements

THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich

The deterministic Lasso

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses

Supremum of simple stochastic processes

(Part 1) High-dimensional statistics May / 41

IEOR 265 Lecture 3 Sparse Linear Regression

Can we do statistical inference in a non-asymptotic way? 1

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results

Sharp Generalization Error Bounds for Randomly-projected Classifiers

High-dimensional covariance estimation based on Gaussian graphical models

10-704: Information Processing and Learning Fall Lecture 21: Nov 14. sup

Statistical Properties of Large Margin Classifiers

Sparse and Low Rank Recovery via Null Space Properties

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Inference for High Dimensional Robust Regression

AdaBoost and other Large Margin Classifiers: Convexity in Classification

Classification with Reject Option

The Moment Method; Convex Duality; and Large/Medium/Small Deviations

The Learning Problem and Regularization

1 Sparsity and l 1 relaxation

Stochastic Optimization

Concentration behavior of the penalized least squares estimator

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song

Class 2 & 3 Overfitting & Regularization

Information Geometry

Convex Optimization & Lagrange Duality

Concentration, self-bounding functions

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

Statistical Issues in Searches: Photon Science Response. Rebecca Willett, Duke University

19.1 Maximum Likelihood estimator and risk upper bound

Model Selection and Geometry

Fast learning rates for plug-in classifiers under the margin condition

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

Lecture 2: Uniform Entropy

The PAC Learning Framework -II

Some Statistical Properties of Deep Networks

Rademacher Bounds for Non-i.i.d. Processes

arxiv: v1 [math.st] 8 Jan 2008

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Information theoretic perspectives on learning algorithms

Concentration function and other stuff

Steiner s formula and large deviations theory

regression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered,

Lecture 2: August 31

Gauge optimization and duality

On the singular values of random matrices

Optimal Estimation of a Nonsmooth Functional

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes

Least Squares Regression

VCMC: Variational Consensus Monte Carlo

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

arxiv: v2 [math.st] 7 Feb 2017

The Geometry of Hypothesis Testing over Convex Cones

Generalization Bounds and Stability

Machine Learning for NLP

On Bayesian Computation

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

Lecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1]

Recovering overcomplete sparse representations from structured sensing

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms

A Tight Excess Risk Bound via a Unified PAC-Bayesian- Rademacher-Shtarkov-MDL Complexity

The lasso, persistence, and cross-validation

Statistical learning with Lipschitz and convex loss functions

Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods

Inference For High Dimensional M-estimates. Fixed Design Results

Approximately Gaussian marginals and the hyperplane conjecture

Sparsity oracle inequalities for the Lasso

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

STAT 200C: High-dimensional Statistics

A talk on Oracle inequalities and regularization. by Sara van de Geer

High-dimensional graphical model selection: Practical and information-theoretic limits

Accelerate Subgradient Methods

Lecture 21: Minimax Theory

Robust high-dimensional linear regression: A statistical perspective

16.1 Bounding Capacity with Covering Number

Adventures in Kernel Density Estimation! Clay Scott!! joint with!! JooSeuk Kim, Robert Vandermeulen, and! Efrén Cruz Cortés!

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery

McGill University Math 354: Honors Analysis 3

Bayesian Perspectives on Sparse Empirical Bayes Analysis (SEBA)

Improved Bounds on the Dot Product under Random Projection and Random Sign Projection

Lecture 35: December The fundamental statistical distances

Bayesian Aggregation for Extraordinarily Large Dataset

Theoretical Statistics. Lecture 12.

Minimax Rates. Homology Inference

Sparse solutions of underdetermined systems

Learning Patterns for Detection with Multiscale Scan Statistics

Minimax rate of convergence and the performance of ERM in phase recovery

1 Cheeger differentiation

Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities

Supplementary Material for Nonparametric Operator-Regularized Covariance Function Estimation for Functional Data

An iterative hard thresholding estimator for low rank matrix recovery

Transcription:

l -Regularized Linear Regression: Persistence and Oracle Inequalities Peter Bartlett EECS and Statistics UC Berkeley slides at http://www.stat.berkeley.edu/ bartlett Joint work with Shahar Mendelson and Joe Neeman.

l -regularized linear regression Random pair: (X, Y ) P, in R d R. n independent samples drawn from P: (X, Y ),..., (X n, Y n ). Find β so linear function X, β has small risk, Pl β = P ( X, β Y ) 2. Here, l β (X, Y ) = ( X, β Y ) 2 is the quadratic loss of the linear prediction.

l -regularized linear regression Random pair: (X, Y ) P, in R d R. n independent samples drawn from P: (X, Y ),..., (X n, Y n ). Find β so linear function X, β has small risk, Pl β = P ( X, β Y ) 2. Example. l -regularized least squares: ( ) ˆβ = arg min P n l β + ρ n β l β R d d, where P n l β = n n ( X i, β Y i ) 2, and β l d = i= d β j. j=

l -regularized linear regression Example. l -regularized least squares: ( ) ˆβ = arg min P n l β + ρ n β l β R d d, where P n l β = n n ( X i, β Y i ) 2, and β l d = i= d β j. j= Tends to select sparse solutions (few non-zero components β j ). Useful, for example, if d n.

l -regularized linear regression Example. l -regularized least squares: ( ) ˆβ = arg min P n l β + ρ n β l β R d d, Example. l -constrained least squares: ˆβ = arg min P n l β. β l d b n [Recall: l β (X, Y ) = ( X, β Y ) 2.]

l -regularized linear regression Example. l -regularized least squares: ( ) ˆβ = arg min P n l β + ρ n β l β R d d, Example. l -constrained least squares: Some questions: ˆβ = arg min P n l β. β l d b n Prediction: Does ˆβ give accurate forecasts? e.g., How does Pl ˆβ compare with Pl β? } Here, β = arg min {Pl β : β l b n. d

l -regularized linear regression Example. l -regularized least squares: ( ) ˆβ = arg min P n l β + ρ n β l β R d d, Example. l -constrained least squares: Some questions: ˆβ = arg min P n l β. β l d b n Does ˆβ give accurate forecasts? } e.g., Pl ˆβ versus Pl β = min {Pl β : β l b n? d Estimation: Under assumptions on P, is ˆβ correct? Sparsity Pattern Estimation: Under assumptions on P, are the non-zeros of ˆβ correct?

Outline of Talk. For l -constrained least squares, bounds on Pl ˆβ Pl β. Persistence: (Greenshtein and Ritov, 2004) For what d n, b n does Pl ˆβ Pl β 0? Convex Aggregation: (Tsybakov, 2003) For b = (convex combinations of dictionary functions), what is rate of Pl ˆβ Pl β? 2. For l -regularized least squares, oracle inequalities. 3. Proof ideas.

l -regularized linear regression Key Issue: l β is unbounded, so some key tools (e.g., concentration inequalities) cannot immediately be applied. For (X, Y ) bounded, l β can be bounded using β l d, but this gives loose prediction bounds. We use chaining to show that metric structures of l -constrained linear functions under P n and P are similar.

Main Results: Excess Risk For l -constrained least squares, ˆβ = arg min P nl β, β l d b if X and Y have suitable tail behaviour then, with probability δ, Pl ˆβ Pl β c ( logα (nd) b 2 δ 2 min n + d n, b n ( + n b )). Small d regime: d/n. Large d regime: b/ n.

Main Results: Excess Risk For l -constrained least squares, with probability δ, ( Pl ˆβ Pl β c logα (nd) b 2 δ 2 min n + d n, b n ( + n b )). Conditions:. PY 2 is bounded by a constant. 2. X bounded a.s., X log concave and maxj X, e j L2 c, or X log concave and isotropic.

Application: Persistence Consider l -constrained least squares, ˆβ = arg min P nl β. β l d b Suppose that PY 2 is bounded by a constant and tails of X decay nicely (e.g., X bounded a.s. or X log concave and isotropic). Then for increasing d n and ( ) n b n = o log 3/2 n log 3/2, (nd n ) l -constrained least squares is persistent (i.e., Pl ˆβ Pl β 0).

Application: Persistence If PY 2 is bounded and tails of X decay nicely, then l -constrained least squares is persistent provided that d n is increasing and ( ) n b n = o log 3/2 n log 3/2. (nd n ) Previous Results (Greenshtein and Ritov, 2004):. b n = ω(n /2 / log /2 n) implies empirical minimization is not persistent for Gaussian (X, Y ). 2. b n = o(n /2 / log /2 n) implies empirical minimization is persistent for Gaussian (X, Y ). 3. b n = o(n /4 / log /4 n) implies empirical minimization is persistent under tail conditions on (X, Y ).

Application: Convex Aggregation Consider b =, so that the l -ball of radius b is the convex hull of a dictionary of d functions (the components of X). Tsybakov (2003) showed that, for any aggregation scheme ˆβ, the rate of convex aggregation satisfies ( ( )) d log d Pl ˆβ Pl β = Ω min n,. n For bounded, isotropic distributions, our result implies that this rate can be achieved, up to log factors, by least squares over the convex hull of the dictionary. Previous positive results (Tsybakov, 2003; Bunea, Tsybakov and Wegkamp, 2006) involved complicated estimators.

Main Results: Oracle Inequality For l -regularized least squares, ( ) ˆβ = arg min P n l β + ρ n β β l dn, if d n increases but log d n = o(n), and Y and X bounded a.s., then with probability at least o(), Pl ˆβ inf β ( )) (Pl β + cρ n + β l dn, where ρ n = c log3/2 n log /2 (d n n) n.

Outline of Talk. For l -constrained least squares, bounds on Pl ˆβ Pl β. Persistence: For what d n, b n does Pl ˆβ Pl β 0? Convex Aggregation: For b = (convex combinations of dictionary functions), what is rate of Pl ˆβ Pl β? 2. For l -regularized least squares, oracle inequalities. 3. Proof ideas.

Proof Ideas:. ɛ-equivalence of P and P n structures Define { } λ G λ = P(l β l β ) (l β l β ) : P(l β l β ) λ. Then: E sup g Gλ P n g Pg is small with high probability, for all β with P(l β l β ) λ, ( ɛ)p(l β l β ) P n (l β l β ) ( + ɛ)p(l β l β ) P(l ˆβ l β ) λ, where ˆβ = arg min β P n l β.

Proof Ideas: 2. Symmetrization, subgaussian tails For a class of functions F, E sup f F P n f Pf 2 n E sup f F ɛ i f (X i ), where ɛ i are independent Rademacher (±) random variables. (Giné and Zinn, 984) The Rademacher process i Z β = i ɛ i (l β (X i, Y i ) l β (X i, Y i )) indexed by β in T λ = { β R d : β l d } b, Pl β Pl β λ is subgaussian wrt a certain metric d. That is, for every β, β 2 T λ and t, Pr ( Z β Z β2 td(β, β 2 )) 2 exp( t 2 /2).

Proof Ideas: 2. Symmetrization, subgaussian tails The Rademacher process Z β = i ɛ i (l β (X i, Y i ) l β (X i, Y i )) indexed by β in T λ = { β R d : β l d is subgaussian wrt the metric } b, Pl β Pl β λ d(β, β 2 ) = 4 max i X i, β β 2 ( X i, v 2 + i i sup v λd 2T l β (X i, Y i ) ). This is a random l distance, scaled by the random l 2 diameter of λd 2T. Here, D = {x R d : P X, x 2 }.

Proof Ideas: 3. Chaining For a subgaussian process {Z t } indexed by a metric space (T, d), and for t 0 T, E sup Z t Z t0 cd(t, d) = c t T diam(t,d) where N(ɛ, T, d) is the ɛ covering number of T. 0 log N(ɛ, T, d) dɛ,

Proof Ideas: 4. Bounding the Entropy Integral It suffices to calculate the entropy integral D( λd 2bB d, d). We can approximate this by D( ( λd 2bB d, d) min D( ) λd, d), D(2bB d, d). This leads to: Pl ˆβ Pl β c logα (nd) δ 2 ( b 2 min n + d n, b n ( + n b )).

Proof Ideas: 5. Oracle Inequalities We get an isomorphic condition on {l β l β }, 2 P n(l β l β ) ɛ n P(l β l β ) 2P n (l β l β ) + ɛ n, and this implies that ˆβ = arg min β (P n l β + cɛ n ) has ( Pl β inf Plβ + c ) ɛ n. β This leads to oracle inequality: For l -regularized least squares, ( ) ˆβ = arg min P n l β + ρ n β β l dn, with probability at least o(), Pl ˆβ inf β ( )) (Pl β + cρ n + β l dn.

Outline of Talk. For l -constrained least squares, Pl ˆβ Pl β c ( logα (nd) b 2 δ 2 min n + d n, b n ( + n b )). Persistence: If b n = õ( n), then Pl ˆβ Pl β 0. Convex Aggregation: Empirical risk ( minimization gives optimal rate (up to log factors): Õ min(d/n, ) log d/n). 2. For l -regularized least squares, oracle inequalities. 3. Proof ideas: subgaussian Rademacher process.