Convergence rates of spectral methods for statistical inverse learning problems

Size: px
Start display at page:

Download "Convergence rates of spectral methods for statistical inverse learning problems"

Transcription

1 Convergence rates of spectral methods for statistical inverse learning problems G. Blanchard Universtität Potsdam UCL/Gatsby unit, 04/11/2015 Joint work with N. Mücke (U. Potsdam); N. Krämer (U. München) G. Blanchard Rates for statistical inverse learning 1 / 39

2 1 The inverse learning setting 2 Rates for linear spectral regularization methods 3 Rates for conjugate gradient regularization G. Blanchard Rates for statistical inverse learning 2 / 39

3 1 The inverse learning setting 2 Rates for linear spectral regularization methods 3 Rates for conjugate gradient regularization G. Blanchard Rates for statistical inverse learning 3 / 39

4 DETERMINISTIC AND STATISTICAL INVERSE PROBLEMS Let A be a bounded operator between Hilbert spaces H 1 H 2 (assumed known) Classical (deterministic) inverse problem: observe y σ = Af + ση, (IP) under the assumption η 1. Note: the H 2 -norm measures the observation error; the H 1 -norm measures the reconstruction error. Classical deterministic theory: see Engl, Hanke and Neubauer (2000). G. Blanchard Rates for statistical inverse learning 4 / 39

5 DETERMINISTIC AND STATISTICAL INVERSE PROBLEMS Inverse problem y σ = Af + ση. (IP) What if noise is random? Classical statistical inverse problem model: η is a Gaussian white noise process on H 2. Note: in this case (IP) is not an equation between elements in H 2, but is to be interpreted as process on H 2. Under Hölder source condition of order r and polynomial ill-posedness (eigenvalue decay) of order 1/s, sharp minimax rates are known in this setting: (A A) θ ( f f ) H1 ) O (σ 2(r+θ) 2r+1+s O ) (σ 2(ν+bθ) 2ν+b+1, for θ [0, 1 2 ] (θ = 0: inverse problem; θ = 1 2 : direct problem.) (Alternate parametrization: b := 1/s, ν := rb intrinsic regularity.) G. Blanchard Rates for statistical inverse learning 5 / 39

6 LINEAR SPECTRAL REGULARIZATION METHODS Inverse problem (deterministic or statistical) where A is known. First consider the so-called normal equation : A y σ = (A A)f + σ(a η). Linear spectral methods: let ζ λ (x) : R + R + be a real function of 1 real variable which is an approximation of 1/x and λ > 0 a tunig parameter. Define fλ = ζ λ (A A)A y σ Examples: Tikhonov ζ λ (x) = (x + λ) 1, spectral cut-off ζ λ (x) = x 1 1{x λ}, Landweber iteration polynomials, ν-methods... Under general conditions on ζ λ, optimal/mimimax rates can be attained by such methods (Deterministic: Engl et al., 2000; Stochastic noise: Bissantz et al, 2007) G. Blanchard Rates for statistical inverse learning 6 / 39

7 STATISTICAL LEARNING Learning usually refers to the following setting: where Y R, (X i, Y i ) i=1,...,n i.i.d. P XY on X Y Goal: estimate some functional related to the dependency between X and Y, for instance (nonparametric) least squares regression: estimate f (x) := E [Y X = x], and measure the quality of an estimator f via f f [ 2 ) ] 2 = E X PX ( f (X) f (X) L 2 (P X ) G. Blanchard Rates for statistical inverse learning 7 / 39

8 SETTING: INVERSE LEARNING PROBLEM We refer to inverse learning for an inverse problem where we have noisy observations at random design points: (X i, Y i ) i=1,...,n i.i.d. : Y i = (Af )(X i ) + ε i. (ILP) the goal is to recover f H 1. early works on closely related subjects: from the splines literature in the 80 s (e.g. O Sullivan 90) G. Blanchard Rates for statistical inverse learning 8 / 39

9 MAIN ASSUMPTION FOR INVERSE LEARNING Model: Y i = (Af )(X i ) + ε i, i = 1,..., n, where A : H 1 H 2. (ILP) Observe: H 2 should be a space of real-values functions on X. the geometrical structure of the measurement errors will be dictated by the statistical properties of the sampling scheme we do not need to assume or consider any a priori Hilbert structure on H 2 the crucial stuctural assumption we make is the following: Assumption The family of evaluation functionals (S x ), x X, defined by S x : H 1 R f (S x )(f ) := (Af )(x) is uniformly bounded, i.e., there exists κ < such that for any x X S x (f ) κ f H1. G. Blanchard Rates for statistical inverse learning 9 / 39

10 GEOMETRY OF INVERSE LEARNING The inverse learning setting was essentially introduced by Caponnetto et al. (2006). Riesz s theorem implies the existence for any x X of F x H 1 : f H 1 : (Af )(x) = f, F x K (x, y) := F x, F y defines a positive semidefinite kernel on X with associated reproducing kernel Hilbert space (RKHS) denoted H K. as a pure function space, H K coincides with Im(A). assuming A injective, A is in fact an isometric isomorphism between H 1 and H K. G. Blanchard Rates for statistical inverse learning 10 / 39

11 GEOMETRY OF INVERSE LEARNING Main assumption implies that as a function space, Im(A) is endowed with a natural RKHS structure with a kernel K bounded by κ. Furthermore this RKHS H K is isometric to H 1 (through A 1 ). Therefore, the inverse learning problem is formally equivalent to the kernel learning problem Y i = h (X i ) + ε i, i = 1,..., n where h H K, and we measure the quality of an estimator ĥ H K via the RKHS norm ĥ HK h Indeed, if we put f := A 1ĥ, then f f H1 = A( f f ) HK = ĥ h HK G. Blanchard Rates for statistical inverse learning 11 / 39

12 SETTING, REFORMULATED We are actually back to the familiar regression setting on a random design, Y i = h (X i ) + ε i, where (X i, Y i ) 1 i n is an i.i.d. sample from P XY on the space X R, with E [ε i X i ] = 0. Noise assumptions: (BernsteinNoise) E [ ε p i X ] 1 i 2 p!mp, p 2 h is assumed to lie in a (known) RKHS H K with bounded kernel K. The criterion for measuring the quality of an estimator ĥ is the RKHS norm ĥ h HK. G. Blanchard Rates for statistical inverse learning 12 / 39

13 1 The inverse learning setting 2 Rates for linear spectral regularization methods 3 Rates for conjugate gradient regularization G. Blanchard Rates for statistical inverse learning 13 / 39

14 EMPIRICAL AND POPULATION OPERATORS Define the (random) empirical evaluation operator T n : h H (h(x 1 ),..., h(x n )) R n and its population counterpart the inclusion operator T : h H h L 2 (X, P X ); the (random) empirical kernel integral operator Tn : (v 1,..., v n ) R n 1 n K (X i,.)v i H n and its population counterpart, the kernel integral operator T : f L 2 (X, P X ) T (f ) = f (x)k(x,.)dp X (x) H. i=1 finally, define the empirical covariance operator S n = T n T n and its population counterpart S = T T. observe that S n, S are both opertors H K H K ; the intuition is that S n is a (random) approximation of S. G. Blanchard Rates for statistical inverse learning 14 / 39

15 Recall the model with h H K : Y i = h (X i ) + ε i i.e. Y = T n h + ε, where Y := (Y 1,..., Y n ). Associated normal equation : Z = Tn Y = Tn T n h + Tn ε = S n h + Tn ε Idea (Rosasco, Caponnetto, De Vito, Odone): use methods from inverse problems literature Observe that there is also an error on the operator Use concentration principles to bound T n ε and S n S. G. Blanchard Rates for statistical inverse learning 15 / 39

16 LINEAR SPECTRAL REGULARIZATION METHODS Linear spectral methods: ĥ ζ = ζ(s n )Z for somme well-chosen function ζ : R R acting on the spectrum and approximating the function x x 1. Examples: Tikhonov ζ λ (t) = (t + λ) 1, spectral cut-off ζ λ (t) = t 1 1{t λ}, Landweber iteration polynomials, ν-methods... G. Blanchard Rates for statistical inverse learning 16 / 39

17 SPECTRAL REGULARIZATION IN KERNEL SPACE Linear spectral regularization in kernel space is written ĥ ζ = ζ(s n )T n Y notice ζ(s n )Tn = ζ(tn T n )Tn = Tn ζ(t n Tn ) = Tn ζ(k n ), where K n = T n Tn : R n R n is the kernel Gram matrix, K n (i, j) = 1 n K (X i, X j ). equivalently: with ĥ ζ = n α ζ,i K (X i,.) i=1 α ζ = 1 n ζ ( 1 n K n ) Y. G. Blanchard Rates for statistical inverse learning 17 / 39

18 STRUCTURAL ASSUMPTIONS Two parameters determine attainable convergence rates: (Hölder) Source condition for the signal: for r > 0, define SC(r, R) : h = S r h 0 with h o R (can be generalized to extended source conditions, see e.g. Mathé and Pereverzev 2003) Ill-posedness: if (λ i ) i 1 is the sequence of positive eigenvalues of S in nonincreasing order, then define and IP + (s, β) : IP (s, β ) : λ i βi 1 s λ i β i 1 s G. Blanchard Rates for statistical inverse learning 18 / 39

19 ERROR/RISK MEASURE We are measuring the error (risk) of an estimator ĥ in the family of norms S θ (ĥ ) h (θ [0, 1 HK 2 ]) Note θ = 0: inverse problem; θ = 1/2: direct problem, since S 1 2 ( ĥ h ) = ĥ h L2. HK (P X ) G. Blanchard Rates for statistical inverse learning 19 / 39

20 PREVIOUS RESULTS [1]: Smale and Zhou (2007) [2]: Bauer, Pereverzev, Rosasco (2007) [3]: Caponnetto, De Vito (2007) [4]: Caponnetto (2006) Error [1] [2] [3] [4] ( ) 2r+1 ( ) 2r+1 ( ) (2r+1) ( ĥ L2 h 2r+2 2r+2 2r+1+s (P X ) n n n ( ) r ( ) r ĥ h HK r+1 r n n ) (2r+1) 2r+1+s 1 n Assumptions r 1 2 r q 1 2 r r q 1 2 (q: qualification) +unlabeled data if 2r + s < 1 Method Tikhonov General Tikhonov General Matching lower bound: only for ĥ h L2 [2]. (P X ) Compare to results known for regularization methods under Gaussian White Noise model: Mair and Ruymgaart (1996), Nussbaum and Pereverzev (1999), Bissantz, Hohage, Munk and Ruymgaart (2007). G. Blanchard Rates for statistical inverse learning 20 / 39

21 ASSUMPTIONS ON REGULARIZATION FUNCTION From now on we assume κ = 1 for simplicity. Standard assmptions on the regularization family ζ λ : [0, 1] R are: (i) There exists a constant D < such that sup sup 0<λ 1 0<t 1 tζ λ (t) D, (ii) There exists a constant M < such that sup sup 0<λ 1 0<t 1 λ ζ λ (t) M, (iii) Qualification: λ 1 : sup 1 tζ λ (t) t ν γ ν λ ν. 0<t 1 holds for ν = 0 and ν = q > 0. G. Blanchard Rates for statistical inverse learning 21 / 39

22 UPPER BOUND ON RATES Theorem (Mücke, Blanchard) Assume r, R, b, β are fixed positive constants and let P(r, R, s, β) denote the set of distributions on X Y satisfying (IP + )(s, β), (SC)(r, R) and (BernsteinNoise). Define ĥ (n) λ n = ζ λn (S n )Z (n) using a regularization family (ζ λ ) satisfying the standard assumptions with qualification q r + θ, and the parameter choice rule ( R 2 σ 2 λ n = n ) 1 2r+1+s. it holds for any θ [0, 1 2 ], η (0, 1): sup P n P P(r,R,s,β) ( S θ (h σ ĥ(n) λ n ) > C(log η 1 2 )R HK R 2 n ) (r+θ) 2r+1+s η. G. Blanchard Rates for statistical inverse learning 22 / 39

23 COMMENTS it follows that the convergence rate obtained is of order ( ) (r+θ) σ 2 2r+1+s C.R R 2 n the constant C depends on the various parameters entering in the assumptions, but not on n, R, σ! the result applies to all linear spectral regularization methods but assuming a precise tuning of the regularization constant λ as a function of the assumed regularization parameters of the target not adaptive. G. Blanchard Rates for statistical inverse learning 23 / 39

24 WEAK LOWER BOUND ON RATES Theorem (Mücke, Blanchard) Assume r, R, s, β are fixed positive constants and let P (r, R, s, β) denote the set of distributions on X Y satisfying (IP )(s, β), (SC)(r, R) and (BernsteinNoise). (We assume this set to be non empty!) Then lim sup n inf ĥ sup P n P P (r,r,s,β) S θ (h ĥ) HK ( ) (r+θ) σ 2 2r+1+s > CR R 2 n > 0 Proof: Fano s lemma technique G. Blanchard Rates for statistical inverse learning 24 / 39

25 STRONG LOWER BOUND ON RATES Assume additionally no big jumps in eigenvalues : λ 2k inf > 0 k 1 λ k Theorem (Mücke, Blanchard) Assume r, R, s, β are fixed positive constants and let P (r, R, s, β) denote the set of distributions on X Y satisfying (IP )(s, β), (SC)(r, R) and (BernsteinNoise). (We assume this set to be non empty!) Then lim inf inf sup P n n ĥ P P (r,r,s,β) S θ (h ĥ) HK ( ) (r+θ) σ 2 2r+1+s > CR R 2 n > 0 Proof: Fano s lemma technique G. Blanchard Rates for statistical inverse learning 25 / 39

26 COMMENTS obtained rates are minimax (but not adaptive) in the parameters R, n, σ provided (IP )(s, β) (IP + )(s, α) is not empty. G. Blanchard Rates for statistical inverse learning 26 / 39

27 STATISTICAL ERROR CONTROL Error controls were introduced and used by Caponnetto and De Vito (2007), Caponnetto (2007), using Bernstein s inequality for Hilbert space-valued variables (see Pinelis and Sakhanenko; Yurinski). Theorem (Caponetto, De Vito) Define N (λ) = Tr( (S + λ) 1 S ), then under assumption (BernsteinNoise) we have the following: [ (S ( ) ] P + λ) 1 2 (T n Y S n h N (λ) ) 2M + 2 log 6 1 δ. n λn δ Also, the following holds: [ (S ( P + λ) 1 N (λ) 2 (Sn S) 2 HS n ) ] + 2 log 6 1 δ. λn δ G. Blanchard Rates for statistical inverse learning 27 / 39

28 1 The inverse learning setting 2 Rates for linear spectral regularization methods 3 Rates for conjugate gradient regularization G. Blanchard Rates for statistical inverse learning 28 / 39

29 PARTIAL LEAST SQUARES REGULARIZATION Consider first the classical linear regression setting Y = Xω + ε, where Y := (Y 1,..., Y n ); X := (X 1,..., X n ) t ; ε = (ε 1,..., ε n ). Algorithmic description of Partial Least Squares: find direction v 1 s.t. Ĉov( X, v, Y ) Y t Xv v 1 = Arg Max = Arg Max v R v d v R v Xt Y d project Y orthogonally on Xv yielding Y 1 iterate the procedure on the residual Y Y 1 The fit at step m is m i=1 Y i. Regularization is obtained by early stopping. G. Blanchard Rates for statistical inverse learning 29 / 39

30 PLS AND CONJUGATE GRADIENT An equivalent definition of PLS: where ω m = is a Krylov space of order m. Arg Min Y Xω 2 ω K m(xx t,x t Y) K m (A, z) = span { z, Az,..., A m 1 z } G. Blanchard Rates for statistical inverse learning 30 / 39

31 PLS AND CONJUGATE GRADIENT An equivalent definition of PLS: where ω m = is a Krylov space of order m. Arg Min Y Xω 2 ω K m(xx t,x t Y) K m (A, z) = span { z, Az,..., A m 1 z } This definition is equivalent to m steps of the conjugate gradient algorithm applied to iteratively solve the linear equation Aω = X t Xω = X t Y = z G. Blanchard Rates for statistical inverse learning 30 / 39

32 PLS AND CONJUGATE GRADIENT An equivalent definition of PLS: where ω m = is a Krylov space of order m. Arg Min Y Xω 2 ω K m(xx t,x t Y) K m (A, z) = span { z, Az,..., A m 1 z } This definition is equivalent to m steps of the conjugate gradient algorithm applied to iteratively solve the linear equation Aω = X t Xω = X t Y = z For any fixed m, the fit Y m = Xω m is a nonlinear function of Y. G. Blanchard Rates for statistical inverse learning 30 / 39

33 PROPERTIES OF CONJUGATE GRADIENT by definition ω m has the form ω m = p m (A)z = X t p m (XX t )Y, where p m is a polynomial of degree m 1. of particular interest are the residual polynomials r m (t) = 1 tp m (t) ; Y Y m = r m (XX t )Y the polynomials r m form a family of orthogonal polynomials for the inner product p, q = p(xx t )Y, XX t q(xx t )Y and with the normalization r m (0) = 1. the polynomials r m follow an order 2 recurrence relation of the type ( simple implementation) r m+1 (t) = a m tr m (t) + b m r m (t) + c m r m 1 (t) G. Blanchard Rates for statistical inverse learning 31 / 39

34 ALGORITHM FOR CG/PLS Initialize: ω 0 = 0; r 0 = X t Y; g 0 = r 0 for m = 0,..., (m max 1) do α m = Xr m 2 / X t Xg m 2 ω m+1 = ω m + α m g m (update) r m+1 = r m α m X t Xg m (residuals) β m = Xr m+1 2 / Xr m 2 g m+1 = r m+1 + β m g m (next direction) end for Return: approximate solution ω mmax G. Blanchard Rates for statistical inverse learning 32 / 39

35 KERNEL-CG REGULARIZATION ( KERNEL PARTIAL LEAST SQUARES) Define the m-th iterate of CG as ĥ CG(m) = where K m denotes Krylov space: Arg Min Tn Y h H, h K m(s n,tn Y) K m (A, z) = span { z, Az,..., A m 1 z } equivalently: α CG(m) = Arg Min K 1 2 α K m(k n,y) n (Y K n α) 2 and ĥ CG(m) = n α CG(m),i K (X i,.). i=1 G. Blanchard Rates for statistical inverse learning 33 / 39

36 RATES FOR CG Consider the following stopping rule for some fixed τ { ( ) r+1 } m := min m 0 : Tn 1 2r+1+s 6 (T n ĥ CG(m) Y) τ n log2. (1) δ Theorem (Blanchard, Krämer) Assume (BernsteinNoise), SC(r, R), IP(s, β) hold; let θ [0, 1 2 ). Then for τ large enough, with probability larger than 1 δ : S θ (ĥcg( m) h ) Hk c(r, R, s, β, τ) ( ) r+θ 1 2r+1+s 6 n log2 δ. Technical tools: again, concentration of the error in appropriate norm, and suitable reworking of the arguments of Nemirovskii (1980) for deterministic CG. G. Blanchard Rates for statistical inverse learning 34 / 39

37 OUTER RATES It it natural (for the prediction problem) to assume extension of source condition for h H (now assuming h L 2 (P X )) SC outer (r, R) : B (r+ 1 2 ) h L R (for B := TT ) 2 to include the possible range r ( 1 2, 0]. For such outer source conditions, even for Kernel ridge regression and for the direct (=prediction) problem, there are no known results without additional assumptions to reach the optimal rate O ( n r r+1+s ). Mendelson and Neeman (2009) make assumptions on the sup norm of the eigenfunctions of the integral operator Caponnetto (2006) assumes additional unlabeled examples X n+1,..., Xñ are available, with ñ ( ) n O n (1 2r s) + 2r+1+s G. Blanchard Rates for statistical inverse learning 35 / 39

38 CONSTRUCTION WITH UNLABELED DATA assume n i.i.d. X-examples are available, out of which n are labeled. extend the n vector Y to a ñ-vector Ỹ = ñ n (Y 1,..., Y n, 0,..., 0) perform the same algorithm as before on X, Ỹ. notice in particular that T ñ Ỹ = T n Y. G. Blanchard Rates for statistical inverse learning 36 / 39

39 CONSTRUCTION WITH UNLABELED DATA assume n i.i.d. X-examples are available, out of which n are labeled. extend the n vector Y to a ñ-vector Ỹ = ñ n (Y 1,..., Y n, 0,..., 0) perform the same algorithm as before on X, Ỹ. notice in particular that T ñ Ỹ = T n Y. Recall: ĥ CG1(m) = Arg Min T n Y h H h K m( S n,tn Y) equivalently: α = Arg Min K ) n (Ỹ Kn α ω K m( K n,ỹ) G. Blanchard Rates for statistical inverse learning 36 / 39

40 OUTER RATES FOR CG REGULARIZATION Consider the following stopping rule for some fixed τ > 3 2, { ( ) r+1 } m := min m 0 : Tn 4β 2r+1+s 6 (T n ĥ CG(m) Y) τm n log2. (2) δ Furthermore assume (BoundedY) : Y M a.s. Theorem Assume (BoundedY), SC outer (r, R), IP + (s, β), and r ( min(s, 1 2 ), 0). ( ) ( 2r) Assume unlabeled data is available with ñ n 16β 2 + n log 2 6 2r+1+s δ. Then for θ [0, r ), with probability larger than 1 δ : B θ (Th m h ) ( ) r+ 2 1 θ 16β 2 2r+1+s 6 L 2 c(r, τ)(m + R) n log2 δ. G. Blanchard Rates for statistical inverse learning 37 / 39

41 OVERVIEW: inverse problem setting under random i.i.d. design scheme ( learning setting ), for source condition: Hölder of order r ; for ill-posedness: polynomial decay of eigenvalues of order s ; rates of the form (for θ [0, 1 2 ]): ) S θ (h ĥ) HK O (n (r+θ) 2r+1+s. rates established for general linear spectral methods, as well as CG. matching lower bound. matches classical rates in the white noise model (=sequence model) with σ 2 n. extension to outer rates (r ( 1 2, 0)) if additional unlabeled data available. G. Blanchard Rates for statistical inverse learning 38 / 39

42 CONCLUSION/PERSPECTIVES We filled gaps in the existing picture for inverse learning methods... Adaptivity? Ideally attain optimal rates without a priori knowledge of r nor of s! Lepski s method/balancing principle: in progress. Need a good estimator for N (λ)! (Prior work on this: Caponnetto; need some sharper bound) Hold-out principle: only valid for direct problem? But optimal parameter does not depend on risk norm: hope for validity in inverse case. G. Blanchard Rates for statistical inverse learning 39 / 39

Unbiased Risk Estimation as Parameter Choice Rule for Filter-based Regularization Methods

Unbiased Risk Estimation as Parameter Choice Rule for Filter-based Regularization Methods Unbiased Risk Estimation as Parameter Choice Rule for Filter-based Regularization Methods Frank Werner 1 Statistical Inverse Problems in Biophysics Group Max Planck Institute for Biophysical Chemistry,

More information

Empirical Risk Minimization as Parameter Choice Rule for General Linear Regularization Methods

Empirical Risk Minimization as Parameter Choice Rule for General Linear Regularization Methods Empirical Risk Minimization as Parameter Choice Rule for General Linear Regularization Methods Frank Werner 1 Statistical Inverse Problems in Biophysics Group Max Planck Institute for Biophysical Chemistry,

More information

Online Gradient Descent Learning Algorithms

Online Gradient Descent Learning Algorithms DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline

More information

An inverse problem perspective on machine learning

An inverse problem perspective on machine learning An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems

More information

Kernel Partial Least Squares is Universally Consistent

Kernel Partial Least Squares is Universally Consistent Kernel Partial Least Squares is Universally Consistent arxiv:0902.4380v2 [stat.me] 14 Jan 2010 Gilles Blanchard Weierstrass Institute for Applied Analysis and Stochastics Mohrenstr. 39 10117 Berlin, Germany

More information

Early Stopping for Computational Learning

Early Stopping for Computational Learning Early Stopping for Computational Learning Lorenzo Rosasco Universita di Genova, Massachusetts Institute of Technology Istituto Italiano di Tecnologia CBMM Sestri Levante, September, 2014 joint work with

More information

Statistical Inverse Problems and Instrumental Variables

Statistical Inverse Problems and Instrumental Variables Statistical Inverse Problems and Instrumental Variables Thorsten Hohage Institut für Numerische und Angewandte Mathematik University of Göttingen Workshop on Inverse and Partial Information Problems: Methodology

More information

Optimal Rates for Multi-pass Stochastic Gradient Methods

Optimal Rates for Multi-pass Stochastic Gradient Methods Journal of Machine Learning Research 8 (07) -47 Submitted 3/7; Revised 8/7; Published 0/7 Optimal Rates for Multi-pass Stochastic Gradient Methods Junhong Lin Laboratory for Computational and Statistical

More information

arxiv: v1 [math.st] 28 May 2016

arxiv: v1 [math.st] 28 May 2016 Kernel ridge vs. principal component regression: minimax bounds and adaptability of regularization operators Lee H. Dicker Dean P. Foster Daniel Hsu arxiv:1605.08839v1 [math.st] 8 May 016 May 31, 016 Abstract

More information

Inverse problems in statistics

Inverse problems in statistics Inverse problems in statistics Laurent Cavalier (Université Aix-Marseille 1, France) YES, Eurandom, 10 October 2011 p. 1/32 Part II 2) Adaptation and oracle inequalities YES, Eurandom, 10 October 2011

More information

Spectral Regularization

Spectral Regularization Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Two-parameter regularization method for determining the heat source

Two-parameter regularization method for determining the heat source Global Journal of Pure and Applied Mathematics. ISSN 0973-1768 Volume 13, Number 8 (017), pp. 3937-3950 Research India Publications http://www.ripublication.com Two-parameter regularization method for

More information

Regularization via Spectral Filtering

Regularization via Spectral Filtering Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Conjugate Gradient algorithm. Storage: fixed, independent of number of steps.

Conjugate Gradient algorithm. Storage: fixed, independent of number of steps. Conjugate Gradient algorithm Need: A symmetric positive definite; Cost: 1 matrix-vector product per step; Storage: fixed, independent of number of steps. The CG method minimizes the A norm of the error,

More information

ORACLE INEQUALITY FOR A STATISTICAL RAUS GFRERER TYPE RULE

ORACLE INEQUALITY FOR A STATISTICAL RAUS GFRERER TYPE RULE ORACLE INEQUALITY FOR A STATISTICAL RAUS GFRERER TYPE RULE QINIAN JIN AND PETER MATHÉ Abstract. We consider statistical linear inverse problems in Hilbert spaces. Approximate solutions are sought within

More information

Spectral Filtering for MultiOutput Learning

Spectral Filtering for MultiOutput Learning Spectral Filtering for MultiOutput Learning Lorenzo Rosasco Center for Biological and Computational Learning, MIT Universita di Genova, Italy Plan Learning with kernels Multioutput kernel and regularization

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

Iterative Convex Regularization

Iterative Convex Regularization Iterative Convex Regularization Lorenzo Rosasco Universita di Genova Universita di Genova Istituto Italiano di Tecnologia Massachusetts Institute of Technology Optimization and Statistical Learning Workshop,

More information

On regularization algorithms in learning theory

On regularization algorithms in learning theory Journal of Complexity 23 (2007) 52 72 www.elsevier.com/locate/jco On regularization algorithms in learning theory Frank Bauer a, Sergei Pereverzev b, Lorenzo Rosasco c,d, a Institute for Mathematical Stochastics,

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

Inverse problems in statistics

Inverse problems in statistics Inverse problems in statistics Laurent Cavalier (Université Aix-Marseille 1, France) Yale, May 2 2011 p. 1/35 Introduction There exist many fields where inverse problems appear Astronomy (Hubble satellite).

More information

Optimal Rates for Regularization of Statistical Inverse Learning Problems

Optimal Rates for Regularization of Statistical Inverse Learning Problems Universitätsverlag Potsdam Gilles Blanchard Nicole Mücke Optimal Rates for Regularization of Statistical Inverse Learning Problems Preprints des Instituts für Mathematik der Universität Potsdam 5 (2016)

More information

Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces

Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces Junhong Lin 1, Alessandro Rudi 2,3, Lorenzo Rosasco 4,5, Volkan Cevher 1 1 Laboratory for Information and Inference

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Uncertainty Quantification for Inverse Problems. November 7, 2011

Uncertainty Quantification for Inverse Problems. November 7, 2011 Uncertainty Quantification for Inverse Problems November 7, 2011 Outline UQ and inverse problems Review: least-squares Review: Gaussian Bayesian linear model Parametric reductions for IP Bias, variance

More information

Lecture 9: Krylov Subspace Methods. 2 Derivation of the Conjugate Gradient Algorithm

Lecture 9: Krylov Subspace Methods. 2 Derivation of the Conjugate Gradient Algorithm CS 622 Data-Sparse Matrix Computations September 19, 217 Lecture 9: Krylov Subspace Methods Lecturer: Anil Damle Scribes: David Eriksson, Marc Aurele Gilles, Ariah Klages-Mundt, Sophia Novitzky 1 Introduction

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

On Regularization Algorithms in Learning Theory

On Regularization Algorithms in Learning Theory On Regularization Algorithms in Learning Theory Frank Bauer a, Sergei Pereverzev b, Lorenzo Rosasco c,1 a Institute for Mathematical Stochastics, University of Göttingen, Department of Mathematics, Maschmühlenweg

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Learning sets and subspaces: a spectral approach

Learning sets and subspaces: a spectral approach Learning sets and subspaces: a spectral approach Alessandro Rudi DIBRIS, Università di Genova Optimization and dynamical processes in Statistical learning and inverse problems Sept 8-12, 2014 A world of

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Junhong Lin Volkan Cevher Laboratory for Information and Inference Systems École Polytechnique Fédérale

More information

2 Tikhonov Regularization and ERM

2 Tikhonov Regularization and ERM Introduction Here we discusses how a class of regularization methods originally designed to solve ill-posed inverse problems give rise to regularized learning algorithms. These algorithms are kernel methods

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

ON EARLY STOPPING IN GRADIENT DESCENT LEARNING. 1. Introduction

ON EARLY STOPPING IN GRADIENT DESCENT LEARNING. 1. Introduction ON EARLY STOPPING IN GRADIENT DESCENT LEARNING YUAN YAO, LORENZO ROSASCO, AND ANDREA CAPONNETTO Abstract. In this paper, we study a family of gradient descent algorithms to approximate the regression function

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information

Learning, Regularization and Ill-Posed Inverse Problems

Learning, Regularization and Ill-Posed Inverse Problems Learning, Regularization and Ill-Posed Inverse Problems Lorenzo Rosasco DISI, Università di Genova rosasco@disi.unige.it Andrea Caponnetto DISI, Università di Genova caponnetto@disi.unige.it Ernesto De

More information

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Kernel-Based Contrast Functions for Sufficient Dimension Reduction Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach

More information

Iterative regularization of nonlinear ill-posed problems in Banach space

Iterative regularization of nonlinear ill-posed problems in Banach space Iterative regularization of nonlinear ill-posed problems in Banach space Barbara Kaltenbacher, University of Klagenfurt joint work with Bernd Hofmann, Technical University of Chemnitz, Frank Schöpfer and

More information

Nonparametric regression with martingale increment errors

Nonparametric regression with martingale increment errors S. Gaïffas (LSTA - Paris 6) joint work with S. Delattre (LPMA - Paris 7) work in progress Motivations Some facts: Theoretical study of statistical algorithms requires stationary and ergodicity. Concentration

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar

More information

TUM 2016 Class 3 Large scale learning by regularization

TUM 2016 Class 3 Large scale learning by regularization TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Nonlinear error dynamics for cycled data assimilation methods

Nonlinear error dynamics for cycled data assimilation methods Nonlinear error dynamics for cycled data assimilation methods A J F Moodey 1, A S Lawless 1,2, P J van Leeuwen 2, R W E Potthast 1,3 1 Department of Mathematics and Statistics, University of Reading, UK.

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates : A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.

More information

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions

More information

Karhunen-Loève decomposition of Gaussian measures on Banach spaces

Karhunen-Loève decomposition of Gaussian measures on Banach spaces Karhunen-Loève decomposition of Gaussian measures on Banach spaces Jean-Charles Croix GT APSSE - April 2017, the 13th joint work with Xavier Bay. 1 / 29 Sommaire 1 Preliminaries on Gaussian processes 2

More information

Joint distribution optimal transportation for domain adaptation

Joint distribution optimal transportation for domain adaptation Joint distribution optimal transportation for domain adaptation Changhuang Wan Mechanical and Aerospace Engineering Department The Ohio State University March 8 th, 2018 Joint distribution optimal transportation

More information

Preconditioned Newton methods for ill-posed problems

Preconditioned Newton methods for ill-posed problems Preconditioned Newton methods for ill-posed problems Dissertation zur Erlangung des Doktorgrades der Mathematisch-Naturwissenschaftlichen Fakultäten der Georg-August-Universität zu Göttingen vorgelegt

More information

41903: Introduction to Nonparametrics

41903: Introduction to Nonparametrics 41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific

More information

Plug-in Approach to Active Learning

Plug-in Approach to Active Learning Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X

More information

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product Chapter 4 Hilbert Spaces 4.1 Inner Product Spaces Inner Product Space. A complex vector space E is called an inner product space (or a pre-hilbert space, or a unitary space) if there is a mapping (, )

More information

Inverse scattering problem from an impedance obstacle

Inverse scattering problem from an impedance obstacle Inverse Inverse scattering problem from an impedance obstacle Department of Mathematics, NCKU 5 th Workshop on Boundary Element Methods, Integral Equations and Related Topics in Taiwan NSYSU, October 4,

More information

Geometry on Probability Spaces

Geometry on Probability Spaces Geometry on Probability Spaces Steve Smale Toyota Technological Institute at Chicago 427 East 60th Street, Chicago, IL 60637, USA E-mail: smale@math.berkeley.edu Ding-Xuan Zhou Department of Mathematics,

More information

3 Compact Operators, Generalized Inverse, Best- Approximate Solution

3 Compact Operators, Generalized Inverse, Best- Approximate Solution 3 Compact Operators, Generalized Inverse, Best- Approximate Solution As we have already heard in the lecture a mathematical problem is well - posed in the sense of Hadamard if the following properties

More information

ECE 636: Systems identification

ECE 636: Systems identification ECE 636: Systems identification Lectures 3 4 Random variables/signals (continued) Random/stochastic vectors Random signals and linear systems Random signals in the frequency domain υ ε x S z + y Experimental

More information

arxiv: v1 [stat.ml] 5 Nov 2018

arxiv: v1 [stat.ml] 5 Nov 2018 Kernel Conjugate Gradient Methods with Random Projections Junhong Lin and Volkan Cevher {junhong.lin, volkan.cevher}@epfl.ch Laboratory for Information and Inference Systems École Polytechnique Fédérale

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

sparse and low-rank tensor recovery Cubic-Sketching

sparse and low-rank tensor recovery Cubic-Sketching Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru

More information

Convergence of Eigenspaces in Kernel Principal Component Analysis

Convergence of Eigenspaces in Kernel Principal Component Analysis Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation

More information

D I S C U S S I O N P A P E R

D I S C U S S I O N P A P E R I N S T I T U T D E S T A T I S T I Q U E B I O S T A T I S T I Q U E E T S C I E N C E S A C T U A R I E L L E S ( I S B A ) UNIVERSITÉ CATHOLIQUE DE LOUVAIN D I S C U S S I O N P A P E R 2014/06 Adaptive

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Statistical Convergence of Kernel CCA

Statistical Convergence of Kernel CCA Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and Kernels Lorenzo Rosasco Linear functions Let H lin be the space of linear functions f(x) = w x. f w is one

More information

Due Giorni di Algebra Lineare Numerica (2GALN) Febbraio 2016, Como. Iterative regularization in variable exponent Lebesgue spaces

Due Giorni di Algebra Lineare Numerica (2GALN) Febbraio 2016, Como. Iterative regularization in variable exponent Lebesgue spaces Due Giorni di Algebra Lineare Numerica (2GALN) 16 17 Febbraio 2016, Como Iterative regularization in variable exponent Lebesgue spaces Claudio Estatico 1 Joint work with: Brigida Bonino 1, Fabio Di Benedetto

More information

Regularization in Reproducing Kernel Banach Spaces

Regularization in Reproducing Kernel Banach Spaces .... Regularization in Reproducing Kernel Banach Spaces Guohui Song School of Mathematical and Statistical Sciences Arizona State University Comp Math Seminar, September 16, 2010 Joint work with Dr. Fred

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

The Learning Problem and Regularization

The Learning Problem and Regularization 9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Stochastic Spectral Approaches to Bayesian Inference

Stochastic Spectral Approaches to Bayesian Inference Stochastic Spectral Approaches to Bayesian Inference Prof. Nathan L. Gibson Department of Mathematics Applied Mathematics and Computation Seminar March 4, 2011 Prof. Gibson (OSU) Spectral Approaches to

More information

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444 Kernel Methods Jean-Philippe Vert Jean-Philippe.Vert@mines.org Last update: Jan 2015 Jean-Philippe Vert (Mines ParisTech) 1 / 444 What we know how to solve Jean-Philippe Vert (Mines ParisTech) 2 / 444

More information

Online gradient descent learning algorithm

Online gradient descent learning algorithm Online gradient descent learning algorithm Yiming Ying and Massimiliano Pontil Department of Computer Science, University College London Gower Street, London, WCE 6BT, England, UK {y.ying, m.pontil}@cs.ucl.ac.uk

More information

CONVERGENCE RATES OF GENERAL REGULARIZATION METHODS FOR STATISTICAL INVERSE PROBLEMS AND APPLICATIONS

CONVERGENCE RATES OF GENERAL REGULARIZATION METHODS FOR STATISTICAL INVERSE PROBLEMS AND APPLICATIONS CONVERGENCE RATES OF GENERAL REGULARIZATION METHODS FOR STATISTICAL INVERSE PROBLEMS AND APPLICATIONS BY N. BISSANTZ, T. HOHAGE, A. MUNK AND F. RUYMGAART UNIVERSITY OF GÖTTINGEN AND TEXAS TECH UNIVERSITY

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

Regularization in Banach Space

Regularization in Banach Space Regularization in Banach Space Barbara Kaltenbacher, Alpen-Adria-Universität Klagenfurt joint work with Uno Hämarik, University of Tartu Bernd Hofmann, Technical University of Chemnitz Urve Kangro, University

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Numerical differentiation by means of Legendre polynomials in the presence of square summable noise

Numerical differentiation by means of Legendre polynomials in the presence of square summable noise www.oeaw.ac.at Numerical differentiation by means of Legendre polynomials in the presence of square summable noise S. Lu, V. Naumova, S. Pereverzyev RICAM-Report 2012-15 www.ricam.oeaw.ac.at Numerical

More information

3. Some tools for the analysis of sequential strategies based on a Gaussian process prior

3. Some tools for the analysis of sequential strategies based on a Gaussian process prior 3. Some tools for the analysis of sequential strategies based on a Gaussian process prior E. Vazquez Computer experiments June 21-22, 2010, Paris 21 / 34 Function approximation with a Gaussian prior Aim:

More information

The Conjugate Gradient Method

The Conjugate Gradient Method The Conjugate Gradient Method Classical Iterations We have a problem, We assume that the matrix comes from a discretization of a PDE. The best and most popular model problem is, The matrix will be as large

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

Error Estimates for Multi-Penalty Regularization under General Source Condition

Error Estimates for Multi-Penalty Regularization under General Source Condition Error Estimates for Multi-Penalty Regularization under General Source Condition Abhishake Rastogi Department of Mathematics Indian Institute of Technology Delhi New Delhi 006, India abhishekrastogi202@gmail.com

More information

Bayesian Regularization

Bayesian Regularization Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors

More information

Analysis Preliminary Exam Workshop: Hilbert Spaces

Analysis Preliminary Exam Workshop: Hilbert Spaces Analysis Preliminary Exam Workshop: Hilbert Spaces 1. Hilbert spaces A Hilbert space H is a complete real or complex inner product space. Consider complex Hilbert spaces for definiteness. If (, ) : H H

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Unlabeled Data: Now It Helps, Now It Doesn t

Unlabeled Data: Now It Helps, Now It Doesn t institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster

More information

Mercer s Theorem, Feature Maps, and Smoothing

Mercer s Theorem, Feature Maps, and Smoothing Mercer s Theorem, Feature Maps, and Smoothing Ha Quang Minh, Partha Niyogi, and Yuan Yao Department of Computer Science, University of Chicago 00 East 58th St, Chicago, IL 60637, USA Department of Mathematics,

More information

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien

More information