Convergence rates of spectral methods for statistical inverse learning problems

Convergence rates of spectral methods for statistical inverse learning problems G. Blanchard Universtität Potsdam UCL/Gatsby unit, 04/11/2015 Joint work with N. Mücke (U. Potsdam); N. Krämer (U. München) G. Blanchard Rates for statistical inverse learning 1 / 39

1 The inverse learning setting 2 Rates for linear spectral regularization methods 3 Rates for conjugate gradient regularization G. Blanchard Rates for statistical inverse learning 2 / 39

1 The inverse learning setting 2 Rates for linear spectral regularization methods 3 Rates for conjugate gradient regularization G. Blanchard Rates for statistical inverse learning 3 / 39

DETERMINISTIC AND STATISTICAL INVERSE PROBLEMS Let A be a bounded operator between Hilbert spaces H 1 H 2 (assumed known) Classical (deterministic) inverse problem: observe y σ = Af + ση, (IP) under the assumption η 1. Note: the H 2 -norm measures the observation error; the H 1 -norm measures the reconstruction error. Classical deterministic theory: see Engl, Hanke and Neubauer (2000). G. Blanchard Rates for statistical inverse learning 4 / 39

DETERMINISTIC AND STATISTICAL INVERSE PROBLEMS Inverse problem y σ = Af + ση. (IP) What if noise is random? Classical statistical inverse problem model: η is a Gaussian white noise process on H 2. Note: in this case (IP) is not an equation between elements in H 2, but is to be interpreted as process on H 2. Under Hölder source condition of order r and polynomial ill-posedness (eigenvalue decay) of order 1/s, sharp minimax rates are known in this setting: (A A) θ ( f f ) H1 ) O (σ 2(r+θ) 2r+1+s O ) (σ 2(ν+bθ) 2ν+b+1, for θ [0, 1 2 ] (θ = 0: inverse problem; θ = 1 2 : direct problem.) (Alternate parametrization: b := 1/s, ν := rb intrinsic regularity.) G. Blanchard Rates for statistical inverse learning 5 / 39

LINEAR SPECTRAL REGULARIZATION METHODS Inverse problem (deterministic or statistical) where A is known. First consider the so-called normal equation : A y σ = (A A)f + σ(a η). Linear spectral methods: let ζ λ (x) : R + R + be a real function of 1 real variable which is an approximation of 1/x and λ > 0 a tunig parameter. Define fλ = ζ λ (A A)A y σ Examples: Tikhonov ζ λ (x) = (x + λ) 1, spectral cut-off ζ λ (x) = x 1 1{x λ}, Landweber iteration polynomials, ν-methods... Under general conditions on ζ λ, optimal/mimimax rates can be attained by such methods (Deterministic: Engl et al., 2000; Stochastic noise: Bissantz et al, 2007) G. Blanchard Rates for statistical inverse learning 6 / 39

STATISTICAL LEARNING Learning usually refers to the following setting: where Y R, (X i, Y i ) i=1,...,n i.i.d. P XY on X Y Goal: estimate some functional related to the dependency between X and Y, for instance (nonparametric) least squares regression: estimate f (x) := E [Y X = x], and measure the quality of an estimator f via f f [ 2 ) ] 2 = E X PX ( f (X) f (X) L 2 (P X ) G. Blanchard Rates for statistical inverse learning 7 / 39

SETTING: INVERSE LEARNING PROBLEM We refer to inverse learning for an inverse problem where we have noisy observations at random design points: (X i, Y i ) i=1,...,n i.i.d. : Y i = (Af )(X i ) + ε i. (ILP) the goal is to recover f H 1. early works on closely related subjects: from the splines literature in the 80 s (e.g. O Sullivan 90) G. Blanchard Rates for statistical inverse learning 8 / 39

MAIN ASSUMPTION FOR INVERSE LEARNING Model: Y i = (Af )(X i ) + ε i, i = 1,..., n, where A : H 1 H 2. (ILP) Observe: H 2 should be a space of real-values functions on X. the geometrical structure of the measurement errors will be dictated by the statistical properties of the sampling scheme we do not need to assume or consider any a priori Hilbert structure on H 2 the crucial stuctural assumption we make is the following: Assumption The family of evaluation functionals (S x ), x X, defined by S x : H 1 R f (S x )(f ) := (Af )(x) is uniformly bounded, i.e., there exists κ < such that for any x X S x (f ) κ f H1. G. Blanchard Rates for statistical inverse learning 9 / 39

GEOMETRY OF INVERSE LEARNING The inverse learning setting was essentially introduced by Caponnetto et al. (2006). Riesz s theorem implies the existence for any x X of F x H 1 : f H 1 : (Af )(x) = f, F x K (x, y) := F x, F y defines a positive semidefinite kernel on X with associated reproducing kernel Hilbert space (RKHS) denoted H K. as a pure function space, H K coincides with Im(A). assuming A injective, A is in fact an isometric isomorphism between H 1 and H K. G. Blanchard Rates for statistical inverse learning 10 / 39

GEOMETRY OF INVERSE LEARNING Main assumption implies that as a function space, Im(A) is endowed with a natural RKHS structure with a kernel K bounded by κ. Furthermore this RKHS H K is isometric to H 1 (through A 1 ). Therefore, the inverse learning problem is formally equivalent to the kernel learning problem Y i = h (X i ) + ε i, i = 1,..., n where h H K, and we measure the quality of an estimator ĥ H K via the RKHS norm ĥ HK h Indeed, if we put f := A 1ĥ, then f f H1 = A( f f ) HK = ĥ h HK G. Blanchard Rates for statistical inverse learning 11 / 39

SETTING, REFORMULATED We are actually back to the familiar regression setting on a random design, Y i = h (X i ) + ε i, where (X i, Y i ) 1 i n is an i.i.d. sample from P XY on the space X R, with E [ε i X i ] = 0. Noise assumptions: (BernsteinNoise) E [ ε p i X ] 1 i 2 p!mp, p 2 h is assumed to lie in a (known) RKHS H K with bounded kernel K. The criterion for measuring the quality of an estimator ĥ is the RKHS norm ĥ h HK. G. Blanchard Rates for statistical inverse learning 12 / 39

1 The inverse learning setting 2 Rates for linear spectral regularization methods 3 Rates for conjugate gradient regularization G. Blanchard Rates for statistical inverse learning 13 / 39

EMPIRICAL AND POPULATION OPERATORS Define the (random) empirical evaluation operator T n : h H (h(x 1 ),..., h(x n )) R n and its population counterpart the inclusion operator T : h H h L 2 (X, P X ); the (random) empirical kernel integral operator Tn : (v 1,..., v n ) R n 1 n K (X i,.)v i H n and its population counterpart, the kernel integral operator T : f L 2 (X, P X ) T (f ) = f (x)k(x,.)dp X (x) H. i=1 finally, define the empirical covariance operator S n = T n T n and its population counterpart S = T T. observe that S n, S are both opertors H K H K ; the intuition is that S n is a (random) approximation of S. G. Blanchard Rates for statistical inverse learning 14 / 39

Recall the model with h H K : Y i = h (X i ) + ε i i.e. Y = T n h + ε, where Y := (Y 1,..., Y n ). Associated normal equation : Z = Tn Y = Tn T n h + Tn ε = S n h + Tn ε Idea (Rosasco, Caponnetto, De Vito, Odone): use methods from inverse problems literature Observe that there is also an error on the operator Use concentration principles to bound T n ε and S n S. G. Blanchard Rates for statistical inverse learning 15 / 39

LINEAR SPECTRAL REGULARIZATION METHODS Linear spectral methods: ĥ ζ = ζ(s n )Z for somme well-chosen function ζ : R R acting on the spectrum and approximating the function x x 1. Examples: Tikhonov ζ λ (t) = (t + λ) 1, spectral cut-off ζ λ (t) = t 1 1{t λ}, Landweber iteration polynomials, ν-methods... G. Blanchard Rates for statistical inverse learning 16 / 39

SPECTRAL REGULARIZATION IN KERNEL SPACE Linear spectral regularization in kernel space is written ĥ ζ = ζ(s n )T n Y notice ζ(s n )Tn = ζ(tn T n )Tn = Tn ζ(t n Tn ) = Tn ζ(k n ), where K n = T n Tn : R n R n is the kernel Gram matrix, K n (i, j) = 1 n K (X i, X j ). equivalently: with ĥ ζ = n α ζ,i K (X i,.) i=1 α ζ = 1 n ζ ( 1 n K n ) Y. G. Blanchard Rates for statistical inverse learning 17 / 39

STRUCTURAL ASSUMPTIONS Two parameters determine attainable convergence rates: (Hölder) Source condition for the signal: for r > 0, define SC(r, R) : h = S r h 0 with h o R (can be generalized to extended source conditions, see e.g. Mathé and Pereverzev 2003) Ill-posedness: if (λ i ) i 1 is the sequence of positive eigenvalues of S in nonincreasing order, then define and IP + (s, β) : IP (s, β ) : λ i βi 1 s λ i β i 1 s G. Blanchard Rates for statistical inverse learning 18 / 39

ERROR/RISK MEASURE We are measuring the error (risk) of an estimator ĥ in the family of norms S θ (ĥ ) h (θ [0, 1 HK 2 ]) Note θ = 0: inverse problem; θ = 1/2: direct problem, since S 1 2 ( ĥ h ) = ĥ h L2. HK (P X ) G. Blanchard Rates for statistical inverse learning 19 / 39

PREVIOUS RESULTS [1]: Smale and Zhou (2007) [2]: Bauer, Pereverzev, Rosasco (2007) [3]: Caponnetto, De Vito (2007) [4]: Caponnetto (2006) Error [1] [2] [3] [4] ( ) 2r+1 ( ) 2r+1 ( ) (2r+1) ( ĥ L2 h 2r+2 2r+2 2r+1+s 1 1 1 (P X ) n n n ( ) r ( ) r ĥ h HK r+1 r+1 1 1 n n ) (2r+1) 2r+1+s 1 n Assumptions r 1 2 r q 1 2 r 1 2 0 r q 1 2 (q: qualification) +unlabeled data if 2r + s < 1 Method Tikhonov General Tikhonov General Matching lower bound: only for ĥ h L2 [2]. (P X ) Compare to results known for regularization methods under Gaussian White Noise model: Mair and Ruymgaart (1996), Nussbaum and Pereverzev (1999), Bissantz, Hohage, Munk and Ruymgaart (2007). G. Blanchard Rates for statistical inverse learning 20 / 39

ASSUMPTIONS ON REGULARIZATION FUNCTION From now on we assume κ = 1 for simplicity. Standard assmptions on the regularization family ζ λ : [0, 1] R are: (i) There exists a constant D < such that sup sup 0<λ 1 0<t 1 tζ λ (t) D, (ii) There exists a constant M < such that sup sup 0<λ 1 0<t 1 λ ζ λ (t) M, (iii) Qualification: λ 1 : sup 1 tζ λ (t) t ν γ ν λ ν. 0<t 1 holds for ν = 0 and ν = q > 0. G. Blanchard Rates for statistical inverse learning 21 / 39

UPPER BOUND ON RATES Theorem (Mücke, Blanchard) Assume r, R, b, β are fixed positive constants and let P(r, R, s, β) denote the set of distributions on X Y satisfying (IP + )(s, β), (SC)(r, R) and (BernsteinNoise). Define ĥ (n) λ n = ζ λn (S n )Z (n) using a regularization family (ζ λ ) satisfying the standard assumptions with qualification q r + θ, and the parameter choice rule ( R 2 σ 2 λ n = n ) 1 2r+1+s. it holds for any θ [0, 1 2 ], η (0, 1): sup P n P P(r,R,s,β) ( S θ (h σ ĥ(n) λ n ) > C(log η 1 2 )R HK R 2 n ) (r+θ) 2r+1+s η. G. Blanchard Rates for statistical inverse learning 22 / 39

COMMENTS it follows that the convergence rate obtained is of order ( ) (r+θ) σ 2 2r+1+s C.R R 2 n the constant C depends on the various parameters entering in the assumptions, but not on n, R, σ! the result applies to all linear spectral regularization methods but assuming a precise tuning of the regularization constant λ as a function of the assumed regularization parameters of the target not adaptive. G. Blanchard Rates for statistical inverse learning 23 / 39

WEAK LOWER BOUND ON RATES Theorem (Mücke, Blanchard) Assume r, R, s, β are fixed positive constants and let P (r, R, s, β) denote the set of distributions on X Y satisfying (IP )(s, β), (SC)(r, R) and (BernsteinNoise). (We assume this set to be non empty!) Then lim sup n inf ĥ sup P n P P (r,r,s,β) S θ (h ĥ) HK ( ) (r+θ) σ 2 2r+1+s > CR R 2 n > 0 Proof: Fano s lemma technique G. Blanchard Rates for statistical inverse learning 24 / 39

STRONG LOWER BOUND ON RATES Assume additionally no big jumps in eigenvalues : λ 2k inf > 0 k 1 λ k Theorem (Mücke, Blanchard) Assume r, R, s, β are fixed positive constants and let P (r, R, s, β) denote the set of distributions on X Y satisfying (IP )(s, β), (SC)(r, R) and (BernsteinNoise). (We assume this set to be non empty!) Then lim inf inf sup P n n ĥ P P (r,r,s,β) S θ (h ĥ) HK ( ) (r+θ) σ 2 2r+1+s > CR R 2 n > 0 Proof: Fano s lemma technique G. Blanchard Rates for statistical inverse learning 25 / 39

COMMENTS obtained rates are minimax (but not adaptive) in the parameters R, n, σ...... provided (IP )(s, β) (IP + )(s, α) is not empty. G. Blanchard Rates for statistical inverse learning 26 / 39

STATISTICAL ERROR CONTROL Error controls were introduced and used by Caponnetto and De Vito (2007), Caponnetto (2007), using Bernstein s inequality for Hilbert space-valued variables (see Pinelis and Sakhanenko; Yurinski). Theorem (Caponetto, De Vito) Define N (λ) = Tr( (S + λ) 1 S ), then under assumption (BernsteinNoise) we have the following: [ (S ( ) ] P + λ) 1 2 (T n Y S n h N (λ) ) 2M + 2 log 6 1 δ. n λn δ Also, the following holds: [ (S ( P + λ) 1 N (λ) 2 (Sn S) 2 HS n ) ] + 2 log 6 1 δ. λn δ G. Blanchard Rates for statistical inverse learning 27 / 39

1 The inverse learning setting 2 Rates for linear spectral regularization methods 3 Rates for conjugate gradient regularization G. Blanchard Rates for statistical inverse learning 28 / 39

PARTIAL LEAST SQUARES REGULARIZATION Consider first the classical linear regression setting Y = Xω + ε, where Y := (Y 1,..., Y n ); X := (X 1,..., X n ) t ; ε = (ε 1,..., ε n ). Algorithmic description of Partial Least Squares: find direction v 1 s.t. Ĉov( X, v, Y ) Y t Xv v 1 = Arg Max = Arg Max v R v d v R v Xt Y d project Y orthogonally on Xv yielding Y 1 iterate the procedure on the residual Y Y 1 The fit at step m is m i=1 Y i. Regularization is obtained by early stopping. G. Blanchard Rates for statistical inverse learning 29 / 39

PLS AND CONJUGATE GRADIENT An equivalent definition of PLS: where ω m = is a Krylov space of order m. Arg Min Y Xω 2 ω K m(xx t,x t Y) K m (A, z) = span { z, Az,..., A m 1 z } G. Blanchard Rates for statistical inverse learning 30 / 39

PLS AND CONJUGATE GRADIENT An equivalent definition of PLS: where ω m = is a Krylov space of order m. Arg Min Y Xω 2 ω K m(xx t,x t Y) K m (A, z) = span { z, Az,..., A m 1 z } This definition is equivalent to m steps of the conjugate gradient algorithm applied to iteratively solve the linear equation Aω = X t Xω = X t Y = z G. Blanchard Rates for statistical inverse learning 30 / 39

PLS AND CONJUGATE GRADIENT An equivalent definition of PLS: where ω m = is a Krylov space of order m. Arg Min Y Xω 2 ω K m(xx t,x t Y) K m (A, z) = span { z, Az,..., A m 1 z } This definition is equivalent to m steps of the conjugate gradient algorithm applied to iteratively solve the linear equation Aω = X t Xω = X t Y = z For any fixed m, the fit Y m = Xω m is a nonlinear function of Y. G. Blanchard Rates for statistical inverse learning 30 / 39

PROPERTIES OF CONJUGATE GRADIENT by definition ω m has the form ω m = p m (A)z = X t p m (XX t )Y, where p m is a polynomial of degree m 1. of particular interest are the residual polynomials r m (t) = 1 tp m (t) ; Y Y m = r m (XX t )Y the polynomials r m form a family of orthogonal polynomials for the inner product p, q = p(xx t )Y, XX t q(xx t )Y and with the normalization r m (0) = 1. the polynomials r m follow an order 2 recurrence relation of the type ( simple implementation) r m+1 (t) = a m tr m (t) + b m r m (t) + c m r m 1 (t) G. Blanchard Rates for statistical inverse learning 31 / 39

ALGORITHM FOR CG/PLS Initialize: ω 0 = 0; r 0 = X t Y; g 0 = r 0 for m = 0,..., (m max 1) do α m = Xr m 2 / X t Xg m 2 ω m+1 = ω m + α m g m (update) r m+1 = r m α m X t Xg m (residuals) β m = Xr m+1 2 / Xr m 2 g m+1 = r m+1 + β m g m (next direction) end for Return: approximate solution ω mmax G. Blanchard Rates for statistical inverse learning 32 / 39

KERNEL-CG REGULARIZATION ( KERNEL PARTIAL LEAST SQUARES) Define the m-th iterate of CG as ĥ CG(m) = where K m denotes Krylov space: Arg Min Tn Y h H, h K m(s n,tn Y) K m (A, z) = span { z, Az,..., A m 1 z } equivalently: α CG(m) = Arg Min K 1 2 α K m(k n,y) n (Y K n α) 2 and ĥ CG(m) = n α CG(m),i K (X i,.). i=1 G. Blanchard Rates for statistical inverse learning 33 / 39

RATES FOR CG Consider the following stopping rule for some fixed τ { ( ) r+1 } m := min m 0 : Tn 1 2r+1+s 6 (T n ĥ CG(m) Y) τ n log2. (1) δ Theorem (Blanchard, Krämer) Assume (BernsteinNoise), SC(r, R), IP(s, β) hold; let θ [0, 1 2 ). Then for τ large enough, with probability larger than 1 δ : S θ (ĥcg( m) h ) Hk c(r, R, s, β, τ) ( ) r+θ 1 2r+1+s 6 n log2 δ. Technical tools: again, concentration of the error in appropriate norm, and suitable reworking of the arguments of Nemirovskii (1980) for deterministic CG. G. Blanchard Rates for statistical inverse learning 34 / 39

OUTER RATES It it natural (for the prediction problem) to assume extension of source condition for h H (now assuming h L 2 (P X )) SC outer (r, R) : B (r+ 1 2 ) h L R (for B := TT ) 2 to include the possible range r ( 1 2, 0]. For such outer source conditions, even for Kernel ridge regression and for the direct (=prediction) problem, there are no known results without additional assumptions to reach the optimal rate O ( n r+ 1 2 2r+1+s ). Mendelson and Neeman (2009) make assumptions on the sup norm of the eigenfunctions of the integral operator Caponnetto (2006) assumes additional unlabeled examples X n+1,..., Xñ are available, with ñ ( ) n O n (1 2r s) + 2r+1+s G. Blanchard Rates for statistical inverse learning 35 / 39

CONSTRUCTION WITH UNLABELED DATA assume n i.i.d. X-examples are available, out of which n are labeled. extend the n vector Y to a ñ-vector Ỹ = ñ n (Y 1,..., Y n, 0,..., 0) perform the same algorithm as before on X, Ỹ. notice in particular that T ñ Ỹ = T n Y. G. Blanchard Rates for statistical inverse learning 36 / 39

CONSTRUCTION WITH UNLABELED DATA assume n i.i.d. X-examples are available, out of which n are labeled. extend the n vector Y to a ñ-vector Ỹ = ñ n (Y 1,..., Y n, 0,..., 0) perform the same algorithm as before on X, Ỹ. notice in particular that T ñ Ỹ = T n Y. Recall: ĥ CG1(m) = Arg Min T n Y h H h K m( S n,tn Y) equivalently: α = Arg Min K ) 1 2 2 n (Ỹ Kn α ω K m( K n,ỹ) G. Blanchard Rates for statistical inverse learning 36 / 39

OUTER RATES FOR CG REGULARIZATION Consider the following stopping rule for some fixed τ > 3 2, { ( ) r+1 } m := min m 0 : Tn 4β 2r+1+s 6 (T n ĥ CG(m) Y) τm n log2. (2) δ Furthermore assume (BoundedY) : Y M a.s. Theorem Assume (BoundedY), SC outer (r, R), IP + (s, β), and r ( min(s, 1 2 ), 0). ( ) ( 2r) Assume unlabeled data is available with ñ n 16β 2 + n log 2 6 2r+1+s δ. Then for θ [0, r + 1 2 ), with probability larger than 1 δ : B θ (Th m h ) ( ) r+ 2 1 θ 16β 2 2r+1+s 6 L 2 c(r, τ)(m + R) n log2 δ. G. Blanchard Rates for statistical inverse learning 37 / 39

OVERVIEW: inverse problem setting under random i.i.d. design scheme ( learning setting ), for source condition: Hölder of order r ; for ill-posedness: polynomial decay of eigenvalues of order s ; rates of the form (for θ [0, 1 2 ]): ) S θ (h ĥ) HK O (n (r+θ) 2r+1+s. rates established for general linear spectral methods, as well as CG. matching lower bound. matches classical rates in the white noise model (=sequence model) with σ 2 n. extension to outer rates (r ( 1 2, 0)) if additional unlabeled data available. G. Blanchard Rates for statistical inverse learning 38 / 39

CONCLUSION/PERSPECTIVES We filled gaps in the existing picture for inverse learning methods... Adaptivity? Ideally attain optimal rates without a priori knowledge of r nor of s! Lepski s method/balancing principle: in progress. Need a good estimator for N (λ)! (Prior work on this: Caponnetto; need some sharper bound) Hold-out principle: only valid for direct problem? But optimal parameter does not depend on risk norm: hope for validity in inverse case. G. Blanchard Rates for statistical inverse learning 39 / 39