Convergence rates of spectral methods for statistical inverse learning problems

Similar documents
Unbiased Risk Estimation as Parameter Choice Rule for Filter-based Regularization Methods

Empirical Risk Minimization as Parameter Choice Rule for General Linear Regularization Methods

Online Gradient Descent Learning Algorithms

An inverse problem perspective on machine learning

Kernel Partial Least Squares is Universally Consistent

Early Stopping for Computational Learning

Statistical Inverse Problems and Instrumental Variables

Optimal Rates for Multi-pass Stochastic Gradient Methods

arxiv: v1 [math.st] 28 May 2016

Inverse problems in statistics

Spectral Regularization

Two-parameter regularization method for determining the heat source

Regularization via Spectral Filtering

Conjugate Gradient algorithm. Storage: fixed, independent of number of steps.

ORACLE INEQUALITY FOR A STATISTICAL RAUS GFRERER TYPE RULE

Spectral Filtering for MultiOutput Learning

RegML 2018 Class 2 Tikhonov regularization and kernels

Iterative Convex Regularization

On regularization algorithms in learning theory

Generalization theory

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Oslo Class 2 Tikhonov regularization and kernels

Approximation Theoretical Questions for SVMs

Approximate Kernel PCA with Random Features

Inverse problems in statistics

Optimal Rates for Regularization of Statistical Inverse Learning Problems

Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces

Lecture 7 Introduction to Statistical Decision Theory

Uncertainty Quantification for Inverse Problems. November 7, 2011

Lecture 9: Krylov Subspace Methods. 2 Derivation of the Conjugate Gradient Algorithm

Can we do statistical inference in a non-asymptotic way? 1

On Regularization Algorithms in Learning Theory

Stochastic optimization in Hilbert spaces

Learning sets and subspaces: a spectral approach

Learning gradients: prescriptive models

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms

2 Tikhonov Regularization and ERM

Reproducing Kernel Hilbert Spaces

ON EARLY STOPPING IN GRADIENT DESCENT LEARNING. 1. Introduction

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

Learning, Regularization and Ill-Posed Inverse Problems

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Iterative regularization of nonlinear ill-posed problems in Banach space

Nonparametric regression with martingale increment errors

Reproducing Kernel Hilbert Spaces

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

TUM 2016 Class 3 Large scale learning by regularization

Less is More: Computational Regularization by Subsampling

Nonlinear error dynamics for cycled data assimilation methods

Kernel Methods. Machine Learning A W VO

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Karhunen-Loève decomposition of Gaussian measures on Banach spaces

Joint distribution optimal transportation for domain adaptation

Preconditioned Newton methods for ill-posed problems

41903: Introduction to Nonparametrics

Plug-in Approach to Active Learning

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Inverse scattering problem from an impedance obstacle

Geometry on Probability Spaces

3 Compact Operators, Generalized Inverse, Best- Approximate Solution

ECE 636: Systems identification

arxiv: v1 [stat.ml] 5 Nov 2018

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

sparse and low-rank tensor recovery Cubic-Sketching

Convergence of Eigenspaces in Kernel Principal Component Analysis

D I S C U S S I O N P A P E R

Principal Component Analysis

Statistical Convergence of Kernel CCA

Model Selection and Geometry

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

Due Giorni di Algebra Lineare Numerica (2GALN) Febbraio 2016, Como. Iterative regularization in variable exponent Lebesgue spaces

Regularization in Reproducing Kernel Banach Spaces

Less is More: Computational Regularization by Subsampling

Worst-Case Bounds for Gaussian Process Models

The Learning Problem and Regularization

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Reproducing Kernel Hilbert Spaces

Stochastic Spectral Approaches to Bayesian Inference

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Online gradient descent learning algorithm

CONVERGENCE RATES OF GENERAL REGULARIZATION METHODS FOR STATISTICAL INVERSE PROBLEMS AND APPLICATIONS

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Regularization in Banach Space

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Numerical differentiation by means of Legendre polynomials in the presence of square summable noise

3. Some tools for the analysis of sequential strategies based on a Gaussian process prior

The Conjugate Gradient Method

Kernel Learning via Random Fourier Representations

Error Estimates for Multi-Penalty Regularization under General Source Condition

Bayesian Regularization

Analysis Preliminary Exam Workshop: Hilbert Spaces

Approximate Kernel Methods

Lecture Notes 1: Vector spaces

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Stochastic Optimization Algorithms Beyond SG

STAT 200C: High-dimensional Statistics

Unlabeled Data: Now It Helps, Now It Doesn t

Mercer s Theorem, Feature Maps, and Smoothing

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Transcription:

Convergence rates of spectral methods for statistical inverse learning problems G. Blanchard Universtität Potsdam UCL/Gatsby unit, 04/11/2015 Joint work with N. Mücke (U. Potsdam); N. Krämer (U. München) G. Blanchard Rates for statistical inverse learning 1 / 39

1 The inverse learning setting 2 Rates for linear spectral regularization methods 3 Rates for conjugate gradient regularization G. Blanchard Rates for statistical inverse learning 2 / 39

1 The inverse learning setting 2 Rates for linear spectral regularization methods 3 Rates for conjugate gradient regularization G. Blanchard Rates for statistical inverse learning 3 / 39

DETERMINISTIC AND STATISTICAL INVERSE PROBLEMS Let A be a bounded operator between Hilbert spaces H 1 H 2 (assumed known) Classical (deterministic) inverse problem: observe y σ = Af + ση, (IP) under the assumption η 1. Note: the H 2 -norm measures the observation error; the H 1 -norm measures the reconstruction error. Classical deterministic theory: see Engl, Hanke and Neubauer (2000). G. Blanchard Rates for statistical inverse learning 4 / 39

DETERMINISTIC AND STATISTICAL INVERSE PROBLEMS Inverse problem y σ = Af + ση. (IP) What if noise is random? Classical statistical inverse problem model: η is a Gaussian white noise process on H 2. Note: in this case (IP) is not an equation between elements in H 2, but is to be interpreted as process on H 2. Under Hölder source condition of order r and polynomial ill-posedness (eigenvalue decay) of order 1/s, sharp minimax rates are known in this setting: (A A) θ ( f f ) H1 ) O (σ 2(r+θ) 2r+1+s O ) (σ 2(ν+bθ) 2ν+b+1, for θ [0, 1 2 ] (θ = 0: inverse problem; θ = 1 2 : direct problem.) (Alternate parametrization: b := 1/s, ν := rb intrinsic regularity.) G. Blanchard Rates for statistical inverse learning 5 / 39

LINEAR SPECTRAL REGULARIZATION METHODS Inverse problem (deterministic or statistical) where A is known. First consider the so-called normal equation : A y σ = (A A)f + σ(a η). Linear spectral methods: let ζ λ (x) : R + R + be a real function of 1 real variable which is an approximation of 1/x and λ > 0 a tunig parameter. Define fλ = ζ λ (A A)A y σ Examples: Tikhonov ζ λ (x) = (x + λ) 1, spectral cut-off ζ λ (x) = x 1 1{x λ}, Landweber iteration polynomials, ν-methods... Under general conditions on ζ λ, optimal/mimimax rates can be attained by such methods (Deterministic: Engl et al., 2000; Stochastic noise: Bissantz et al, 2007) G. Blanchard Rates for statistical inverse learning 6 / 39

STATISTICAL LEARNING Learning usually refers to the following setting: where Y R, (X i, Y i ) i=1,...,n i.i.d. P XY on X Y Goal: estimate some functional related to the dependency between X and Y, for instance (nonparametric) least squares regression: estimate f (x) := E [Y X = x], and measure the quality of an estimator f via f f [ 2 ) ] 2 = E X PX ( f (X) f (X) L 2 (P X ) G. Blanchard Rates for statistical inverse learning 7 / 39

SETTING: INVERSE LEARNING PROBLEM We refer to inverse learning for an inverse problem where we have noisy observations at random design points: (X i, Y i ) i=1,...,n i.i.d. : Y i = (Af )(X i ) + ε i. (ILP) the goal is to recover f H 1. early works on closely related subjects: from the splines literature in the 80 s (e.g. O Sullivan 90) G. Blanchard Rates for statistical inverse learning 8 / 39

MAIN ASSUMPTION FOR INVERSE LEARNING Model: Y i = (Af )(X i ) + ε i, i = 1,..., n, where A : H 1 H 2. (ILP) Observe: H 2 should be a space of real-values functions on X. the geometrical structure of the measurement errors will be dictated by the statistical properties of the sampling scheme we do not need to assume or consider any a priori Hilbert structure on H 2 the crucial stuctural assumption we make is the following: Assumption The family of evaluation functionals (S x ), x X, defined by S x : H 1 R f (S x )(f ) := (Af )(x) is uniformly bounded, i.e., there exists κ < such that for any x X S x (f ) κ f H1. G. Blanchard Rates for statistical inverse learning 9 / 39

GEOMETRY OF INVERSE LEARNING The inverse learning setting was essentially introduced by Caponnetto et al. (2006). Riesz s theorem implies the existence for any x X of F x H 1 : f H 1 : (Af )(x) = f, F x K (x, y) := F x, F y defines a positive semidefinite kernel on X with associated reproducing kernel Hilbert space (RKHS) denoted H K. as a pure function space, H K coincides with Im(A). assuming A injective, A is in fact an isometric isomorphism between H 1 and H K. G. Blanchard Rates for statistical inverse learning 10 / 39

GEOMETRY OF INVERSE LEARNING Main assumption implies that as a function space, Im(A) is endowed with a natural RKHS structure with a kernel K bounded by κ. Furthermore this RKHS H K is isometric to H 1 (through A 1 ). Therefore, the inverse learning problem is formally equivalent to the kernel learning problem Y i = h (X i ) + ε i, i = 1,..., n where h H K, and we measure the quality of an estimator ĥ H K via the RKHS norm ĥ HK h Indeed, if we put f := A 1ĥ, then f f H1 = A( f f ) HK = ĥ h HK G. Blanchard Rates for statistical inverse learning 11 / 39

SETTING, REFORMULATED We are actually back to the familiar regression setting on a random design, Y i = h (X i ) + ε i, where (X i, Y i ) 1 i n is an i.i.d. sample from P XY on the space X R, with E [ε i X i ] = 0. Noise assumptions: (BernsteinNoise) E [ ε p i X ] 1 i 2 p!mp, p 2 h is assumed to lie in a (known) RKHS H K with bounded kernel K. The criterion for measuring the quality of an estimator ĥ is the RKHS norm ĥ h HK. G. Blanchard Rates for statistical inverse learning 12 / 39

1 The inverse learning setting 2 Rates for linear spectral regularization methods 3 Rates for conjugate gradient regularization G. Blanchard Rates for statistical inverse learning 13 / 39

EMPIRICAL AND POPULATION OPERATORS Define the (random) empirical evaluation operator T n : h H (h(x 1 ),..., h(x n )) R n and its population counterpart the inclusion operator T : h H h L 2 (X, P X ); the (random) empirical kernel integral operator Tn : (v 1,..., v n ) R n 1 n K (X i,.)v i H n and its population counterpart, the kernel integral operator T : f L 2 (X, P X ) T (f ) = f (x)k(x,.)dp X (x) H. i=1 finally, define the empirical covariance operator S n = T n T n and its population counterpart S = T T. observe that S n, S are both opertors H K H K ; the intuition is that S n is a (random) approximation of S. G. Blanchard Rates for statistical inverse learning 14 / 39

Recall the model with h H K : Y i = h (X i ) + ε i i.e. Y = T n h + ε, where Y := (Y 1,..., Y n ). Associated normal equation : Z = Tn Y = Tn T n h + Tn ε = S n h + Tn ε Idea (Rosasco, Caponnetto, De Vito, Odone): use methods from inverse problems literature Observe that there is also an error on the operator Use concentration principles to bound T n ε and S n S. G. Blanchard Rates for statistical inverse learning 15 / 39

LINEAR SPECTRAL REGULARIZATION METHODS Linear spectral methods: ĥ ζ = ζ(s n )Z for somme well-chosen function ζ : R R acting on the spectrum and approximating the function x x 1. Examples: Tikhonov ζ λ (t) = (t + λ) 1, spectral cut-off ζ λ (t) = t 1 1{t λ}, Landweber iteration polynomials, ν-methods... G. Blanchard Rates for statistical inverse learning 16 / 39

SPECTRAL REGULARIZATION IN KERNEL SPACE Linear spectral regularization in kernel space is written ĥ ζ = ζ(s n )T n Y notice ζ(s n )Tn = ζ(tn T n )Tn = Tn ζ(t n Tn ) = Tn ζ(k n ), where K n = T n Tn : R n R n is the kernel Gram matrix, K n (i, j) = 1 n K (X i, X j ). equivalently: with ĥ ζ = n α ζ,i K (X i,.) i=1 α ζ = 1 n ζ ( 1 n K n ) Y. G. Blanchard Rates for statistical inverse learning 17 / 39

STRUCTURAL ASSUMPTIONS Two parameters determine attainable convergence rates: (Hölder) Source condition for the signal: for r > 0, define SC(r, R) : h = S r h 0 with h o R (can be generalized to extended source conditions, see e.g. Mathé and Pereverzev 2003) Ill-posedness: if (λ i ) i 1 is the sequence of positive eigenvalues of S in nonincreasing order, then define and IP + (s, β) : IP (s, β ) : λ i βi 1 s λ i β i 1 s G. Blanchard Rates for statistical inverse learning 18 / 39

ERROR/RISK MEASURE We are measuring the error (risk) of an estimator ĥ in the family of norms S θ (ĥ ) h (θ [0, 1 HK 2 ]) Note θ = 0: inverse problem; θ = 1/2: direct problem, since S 1 2 ( ĥ h ) = ĥ h L2. HK (P X ) G. Blanchard Rates for statistical inverse learning 19 / 39

PREVIOUS RESULTS [1]: Smale and Zhou (2007) [2]: Bauer, Pereverzev, Rosasco (2007) [3]: Caponnetto, De Vito (2007) [4]: Caponnetto (2006) Error [1] [2] [3] [4] ( ) 2r+1 ( ) 2r+1 ( ) (2r+1) ( ĥ L2 h 2r+2 2r+2 2r+1+s 1 1 1 (P X ) n n n ( ) r ( ) r ĥ h HK r+1 r+1 1 1 n n ) (2r+1) 2r+1+s 1 n Assumptions r 1 2 r q 1 2 r 1 2 0 r q 1 2 (q: qualification) +unlabeled data if 2r + s < 1 Method Tikhonov General Tikhonov General Matching lower bound: only for ĥ h L2 [2]. (P X ) Compare to results known for regularization methods under Gaussian White Noise model: Mair and Ruymgaart (1996), Nussbaum and Pereverzev (1999), Bissantz, Hohage, Munk and Ruymgaart (2007). G. Blanchard Rates for statistical inverse learning 20 / 39

ASSUMPTIONS ON REGULARIZATION FUNCTION From now on we assume κ = 1 for simplicity. Standard assmptions on the regularization family ζ λ : [0, 1] R are: (i) There exists a constant D < such that sup sup 0<λ 1 0<t 1 tζ λ (t) D, (ii) There exists a constant M < such that sup sup 0<λ 1 0<t 1 λ ζ λ (t) M, (iii) Qualification: λ 1 : sup 1 tζ λ (t) t ν γ ν λ ν. 0<t 1 holds for ν = 0 and ν = q > 0. G. Blanchard Rates for statistical inverse learning 21 / 39

UPPER BOUND ON RATES Theorem (Mücke, Blanchard) Assume r, R, b, β are fixed positive constants and let P(r, R, s, β) denote the set of distributions on X Y satisfying (IP + )(s, β), (SC)(r, R) and (BernsteinNoise). Define ĥ (n) λ n = ζ λn (S n )Z (n) using a regularization family (ζ λ ) satisfying the standard assumptions with qualification q r + θ, and the parameter choice rule ( R 2 σ 2 λ n = n ) 1 2r+1+s. it holds for any θ [0, 1 2 ], η (0, 1): sup P n P P(r,R,s,β) ( S θ (h σ ĥ(n) λ n ) > C(log η 1 2 )R HK R 2 n ) (r+θ) 2r+1+s η. G. Blanchard Rates for statistical inverse learning 22 / 39

COMMENTS it follows that the convergence rate obtained is of order ( ) (r+θ) σ 2 2r+1+s C.R R 2 n the constant C depends on the various parameters entering in the assumptions, but not on n, R, σ! the result applies to all linear spectral regularization methods but assuming a precise tuning of the regularization constant λ as a function of the assumed regularization parameters of the target not adaptive. G. Blanchard Rates for statistical inverse learning 23 / 39

WEAK LOWER BOUND ON RATES Theorem (Mücke, Blanchard) Assume r, R, s, β are fixed positive constants and let P (r, R, s, β) denote the set of distributions on X Y satisfying (IP )(s, β), (SC)(r, R) and (BernsteinNoise). (We assume this set to be non empty!) Then lim sup n inf ĥ sup P n P P (r,r,s,β) S θ (h ĥ) HK ( ) (r+θ) σ 2 2r+1+s > CR R 2 n > 0 Proof: Fano s lemma technique G. Blanchard Rates for statistical inverse learning 24 / 39

STRONG LOWER BOUND ON RATES Assume additionally no big jumps in eigenvalues : λ 2k inf > 0 k 1 λ k Theorem (Mücke, Blanchard) Assume r, R, s, β are fixed positive constants and let P (r, R, s, β) denote the set of distributions on X Y satisfying (IP )(s, β), (SC)(r, R) and (BernsteinNoise). (We assume this set to be non empty!) Then lim inf inf sup P n n ĥ P P (r,r,s,β) S θ (h ĥ) HK ( ) (r+θ) σ 2 2r+1+s > CR R 2 n > 0 Proof: Fano s lemma technique G. Blanchard Rates for statistical inverse learning 25 / 39

COMMENTS obtained rates are minimax (but not adaptive) in the parameters R, n, σ...... provided (IP )(s, β) (IP + )(s, α) is not empty. G. Blanchard Rates for statistical inverse learning 26 / 39

STATISTICAL ERROR CONTROL Error controls were introduced and used by Caponnetto and De Vito (2007), Caponnetto (2007), using Bernstein s inequality for Hilbert space-valued variables (see Pinelis and Sakhanenko; Yurinski). Theorem (Caponetto, De Vito) Define N (λ) = Tr( (S + λ) 1 S ), then under assumption (BernsteinNoise) we have the following: [ (S ( ) ] P + λ) 1 2 (T n Y S n h N (λ) ) 2M + 2 log 6 1 δ. n λn δ Also, the following holds: [ (S ( P + λ) 1 N (λ) 2 (Sn S) 2 HS n ) ] + 2 log 6 1 δ. λn δ G. Blanchard Rates for statistical inverse learning 27 / 39

1 The inverse learning setting 2 Rates for linear spectral regularization methods 3 Rates for conjugate gradient regularization G. Blanchard Rates for statistical inverse learning 28 / 39

PARTIAL LEAST SQUARES REGULARIZATION Consider first the classical linear regression setting Y = Xω + ε, where Y := (Y 1,..., Y n ); X := (X 1,..., X n ) t ; ε = (ε 1,..., ε n ). Algorithmic description of Partial Least Squares: find direction v 1 s.t. Ĉov( X, v, Y ) Y t Xv v 1 = Arg Max = Arg Max v R v d v R v Xt Y d project Y orthogonally on Xv yielding Y 1 iterate the procedure on the residual Y Y 1 The fit at step m is m i=1 Y i. Regularization is obtained by early stopping. G. Blanchard Rates for statistical inverse learning 29 / 39

PLS AND CONJUGATE GRADIENT An equivalent definition of PLS: where ω m = is a Krylov space of order m. Arg Min Y Xω 2 ω K m(xx t,x t Y) K m (A, z) = span { z, Az,..., A m 1 z } G. Blanchard Rates for statistical inverse learning 30 / 39

PLS AND CONJUGATE GRADIENT An equivalent definition of PLS: where ω m = is a Krylov space of order m. Arg Min Y Xω 2 ω K m(xx t,x t Y) K m (A, z) = span { z, Az,..., A m 1 z } This definition is equivalent to m steps of the conjugate gradient algorithm applied to iteratively solve the linear equation Aω = X t Xω = X t Y = z G. Blanchard Rates for statistical inverse learning 30 / 39

PLS AND CONJUGATE GRADIENT An equivalent definition of PLS: where ω m = is a Krylov space of order m. Arg Min Y Xω 2 ω K m(xx t,x t Y) K m (A, z) = span { z, Az,..., A m 1 z } This definition is equivalent to m steps of the conjugate gradient algorithm applied to iteratively solve the linear equation Aω = X t Xω = X t Y = z For any fixed m, the fit Y m = Xω m is a nonlinear function of Y. G. Blanchard Rates for statistical inverse learning 30 / 39

PROPERTIES OF CONJUGATE GRADIENT by definition ω m has the form ω m = p m (A)z = X t p m (XX t )Y, where p m is a polynomial of degree m 1. of particular interest are the residual polynomials r m (t) = 1 tp m (t) ; Y Y m = r m (XX t )Y the polynomials r m form a family of orthogonal polynomials for the inner product p, q = p(xx t )Y, XX t q(xx t )Y and with the normalization r m (0) = 1. the polynomials r m follow an order 2 recurrence relation of the type ( simple implementation) r m+1 (t) = a m tr m (t) + b m r m (t) + c m r m 1 (t) G. Blanchard Rates for statistical inverse learning 31 / 39

ALGORITHM FOR CG/PLS Initialize: ω 0 = 0; r 0 = X t Y; g 0 = r 0 for m = 0,..., (m max 1) do α m = Xr m 2 / X t Xg m 2 ω m+1 = ω m + α m g m (update) r m+1 = r m α m X t Xg m (residuals) β m = Xr m+1 2 / Xr m 2 g m+1 = r m+1 + β m g m (next direction) end for Return: approximate solution ω mmax G. Blanchard Rates for statistical inverse learning 32 / 39

KERNEL-CG REGULARIZATION ( KERNEL PARTIAL LEAST SQUARES) Define the m-th iterate of CG as ĥ CG(m) = where K m denotes Krylov space: Arg Min Tn Y h H, h K m(s n,tn Y) K m (A, z) = span { z, Az,..., A m 1 z } equivalently: α CG(m) = Arg Min K 1 2 α K m(k n,y) n (Y K n α) 2 and ĥ CG(m) = n α CG(m),i K (X i,.). i=1 G. Blanchard Rates for statistical inverse learning 33 / 39

RATES FOR CG Consider the following stopping rule for some fixed τ { ( ) r+1 } m := min m 0 : Tn 1 2r+1+s 6 (T n ĥ CG(m) Y) τ n log2. (1) δ Theorem (Blanchard, Krämer) Assume (BernsteinNoise), SC(r, R), IP(s, β) hold; let θ [0, 1 2 ). Then for τ large enough, with probability larger than 1 δ : S θ (ĥcg( m) h ) Hk c(r, R, s, β, τ) ( ) r+θ 1 2r+1+s 6 n log2 δ. Technical tools: again, concentration of the error in appropriate norm, and suitable reworking of the arguments of Nemirovskii (1980) for deterministic CG. G. Blanchard Rates for statistical inverse learning 34 / 39

OUTER RATES It it natural (for the prediction problem) to assume extension of source condition for h H (now assuming h L 2 (P X )) SC outer (r, R) : B (r+ 1 2 ) h L R (for B := TT ) 2 to include the possible range r ( 1 2, 0]. For such outer source conditions, even for Kernel ridge regression and for the direct (=prediction) problem, there are no known results without additional assumptions to reach the optimal rate O ( n r+ 1 2 2r+1+s ). Mendelson and Neeman (2009) make assumptions on the sup norm of the eigenfunctions of the integral operator Caponnetto (2006) assumes additional unlabeled examples X n+1,..., Xñ are available, with ñ ( ) n O n (1 2r s) + 2r+1+s G. Blanchard Rates for statistical inverse learning 35 / 39

CONSTRUCTION WITH UNLABELED DATA assume n i.i.d. X-examples are available, out of which n are labeled. extend the n vector Y to a ñ-vector Ỹ = ñ n (Y 1,..., Y n, 0,..., 0) perform the same algorithm as before on X, Ỹ. notice in particular that T ñ Ỹ = T n Y. G. Blanchard Rates for statistical inverse learning 36 / 39

CONSTRUCTION WITH UNLABELED DATA assume n i.i.d. X-examples are available, out of which n are labeled. extend the n vector Y to a ñ-vector Ỹ = ñ n (Y 1,..., Y n, 0,..., 0) perform the same algorithm as before on X, Ỹ. notice in particular that T ñ Ỹ = T n Y. Recall: ĥ CG1(m) = Arg Min T n Y h H h K m( S n,tn Y) equivalently: α = Arg Min K ) 1 2 2 n (Ỹ Kn α ω K m( K n,ỹ) G. Blanchard Rates for statistical inverse learning 36 / 39

OUTER RATES FOR CG REGULARIZATION Consider the following stopping rule for some fixed τ > 3 2, { ( ) r+1 } m := min m 0 : Tn 4β 2r+1+s 6 (T n ĥ CG(m) Y) τm n log2. (2) δ Furthermore assume (BoundedY) : Y M a.s. Theorem Assume (BoundedY), SC outer (r, R), IP + (s, β), and r ( min(s, 1 2 ), 0). ( ) ( 2r) Assume unlabeled data is available with ñ n 16β 2 + n log 2 6 2r+1+s δ. Then for θ [0, r + 1 2 ), with probability larger than 1 δ : B θ (Th m h ) ( ) r+ 2 1 θ 16β 2 2r+1+s 6 L 2 c(r, τ)(m + R) n log2 δ. G. Blanchard Rates for statistical inverse learning 37 / 39

OVERVIEW: inverse problem setting under random i.i.d. design scheme ( learning setting ), for source condition: Hölder of order r ; for ill-posedness: polynomial decay of eigenvalues of order s ; rates of the form (for θ [0, 1 2 ]): ) S θ (h ĥ) HK O (n (r+θ) 2r+1+s. rates established for general linear spectral methods, as well as CG. matching lower bound. matches classical rates in the white noise model (=sequence model) with σ 2 n. extension to outer rates (r ( 1 2, 0)) if additional unlabeled data available. G. Blanchard Rates for statistical inverse learning 38 / 39

CONCLUSION/PERSPECTIVES We filled gaps in the existing picture for inverse learning methods... Adaptivity? Ideally attain optimal rates without a priori knowledge of r nor of s! Lepski s method/balancing principle: in progress. Need a good estimator for N (λ)! (Prior work on this: Caponnetto; need some sharper bound) Hold-out principle: only valid for direct problem? But optimal parameter does not depend on risk norm: hope for validity in inverse case. G. Blanchard Rates for statistical inverse learning 39 / 39