consistent learning by composite proximal thresholding

Size: px

Start display at page:

Download "consistent learning by composite proximal thresholding"

MargaretMargaret Norris
5 years ago
Views:

1 consistent learning by composite proximal thresholding Saverio Salzo Università degli Studi di Genova Optimization in Machine learning, vision and image processing Université Paul Sabatier, Toulouse 6-7 October

2 outline 1. Introduction The learning problem Constructing the estimators Our contribution 2. Statistical Analysis Proving consistency (a sketch) The main result 3. Algorithm Description Inexact proximity operators The main result 4. Further developments 2

3 introduction

4 nonparametric regression with random design Income Statistical Learning Years of Education Seniority FIGURE 2.3. The plot displays income as a function of years of education and seniority in the Income data set. The blue surface represents the true underlying relationship between income and years of education and seniority, which is known since the data are simulated. The red dots indicate the observed values of these quantities for 30 individuals. Input space X = range(years of Education) range(seniority) R 2. Output space Y = range(income) R. z n = (x i, y i ) 1 i n (X Y) n are i.i.d. realizations of a random variable (X, Y) distributed according to P. As an example, suppose that X 1,...,X p are characteristics of a patient s blood sample that can be easily measured in a lab, and Y is a variable encoding the patient s risk for a severe adverse reaction to a particular 4

5 nonparametric regression with random design X Y x y P (y x) The distribution P describes a fuzzy function. The goal is to estimate a function f : X Y such that ( x X ) f (x) is chosen according to P(y x) For instance one could take f = E(Y X), the conditional mean. 5

6 the formal setting We are given an input space (X, A), an output space Y R bounded interval, and a random variable (X, Y) with values in X Y with distribution P. a sequence (X i, Y i ) i N of i.i.d. random variables taking values in X Y with common distribution P. n observations z n = (x i, y i ) 1 i n of the random variables Z n = (Y i, Y i ) 1 i n, which constitutes the training set. a (large) closed convex constraint set C L 2 (P X ) encoding prior information. 6

7 the learning problem The goal is to approach f C, the regression function on C, which is the minimizer of the L 2 -risk R: L 2 (P X ) R + f y f (x) 2 dp(x, y) = E Y f (X) 2 X R over the constraint set C, without any knowledge of P, but using only training sets z n. 7

8 the learning problem The goal is to approach f C, the regression function on C, which is the minimizer of the L 2 -risk R: L 2 (P X ) R + f y f (x) 2 dp(x, y) = E Y f (X) 2 X R over the constraint set C, without any knowledge of P, but using only training sets z n. A learning algorithm is a map Λ: n N (X Y)n C such that the estimators Λ(Z n ) asymptotically converge to f C, meaning that Λ(Z n ) f C L2 (P X ) 0 (in probability or a.s.) as n +. 7

9 the learning algorithm: how we build estimators We adopt a linear model: where: f u = k K u e k φ k (pointwise), u l 2 (K) (e k ) k K is the canonical basis of l 2 (K) and (φ k ) k K is a countable dictionary of bounded functions in M(X, R) such that φ k (x) 2 κ 2 ( x X ) k K the coefficients u e k are constrained in given intervals C k R, thus defining the constraint C = { f u u l 2 (K) C } k k K Our estimators will be in C! Moreover, H K = { f u u l 2 (K) } is a RKHS with feature map Φ: X l 2 (K): x (φ k (x)) k K and kernel K(x, x ) = Φ(x) Φ(x ). 8

10 the learning algorithm: how we build estimators We solve a composite regularized least squares regression problem { 1 n û n,λ (z n ) = argmin u l 2 (K) (y i f u (x i )) 2 + λ } g k ( u e k ) n i=1 k K g k : R R, g k = ι Ck + σ Dk + η k r ( k K) where 0 C k, D k R closed intervals, η k η > 0, and r ]1, 2]. The learning algorithm z n fûn,λ (z n) The indicator function of C k and the support function of D k are: { { 0 if ξ Ck ξ sup Dk if ξ 0 ι Ck (ξ) = σ Dk (ξ) = + if ξ / C k ξ inf D k if ξ < 0 9

11 the learning algorithm: how we build estimators We solve a composite regularized least squares regression problem { 1 n û n,λ (z n ) = argmin u l 2 (K) (y i f u (x i )) 2 + λ } g k ( u e k ) n i=1 k K g k : R R, g k = ι Ck + σ Dk + η k r ( k K) where 0 C k, D k R closed intervals, η k η > 0, and r ]1, 2]. The learning algorithm z n fûn,λ (z n) The empirical risk and the regularizer ˆR n (f u ) = 1 n (y i f u (x i )) 2 G(u) = g k ( u e k ) n i=1 k K This is a regularized empirical risk minimization learning algorithm. 9

12 the learning algorithm: how we build estimators We solve a composite regularized least squares regression problem { 1 n û n,λ (z n ) = argmin u l 2 (K) (y i f u (x i )) 2 + λ } g k ( u e k ) n i=1 k K g k : R R, g k = ι Ck + σ Dk + η k r ( k K) where 0 C k, D k R closed intervals, η k η > 0, and r ]1, 2]. The learning algorithm z n fûn,λ (z n) This model encompass: ridge regression (Hoerl and Kennard, 70) g k = 2 elastic net (Zou, Hastie, 05; De Mol et al., 09) g k = ω k + η 2 bridge regression (Frank and Friedman 93) g k = r, 1 < r < 2. 9

13 the learning algorithm: how we build estimators We minimize the regularized empirical risk by proximal gradient algorithms: Let 0 < γ < λ/(2κ 2 ) and define (u m ) m N for all m N and all k K u m+1 e k = prox γgk ( u m e k 2γ nλ where prox γgk (ξ) = argmin t R { γgk (t) + (1/2)(t ξ) 2}. n i=1 ) ( ) fum (x i ) y i φk (x i ) 10

14 the learning algorithm: how we build estimators We minimize the regularized empirical risk by proximal gradient algorithms: Let 0 < γ < λ/(2κ 2 ) and define (u m ) m N for all m N and all k K u m+1 e k = prox γgk ( u m e k 2γ nλ where prox γgk (ξ) = argmin t R { γgk (t) + (1/2)(t ξ) 2}. n i=1 ) ( ) fum (x i ) y i φk (x i ) Example If g k = ι Ck, prox gk = π Ck. If g k = σ Dk, then ξ inf D k if ξ inf D k prox gk (ξ) = soft Dk (ξ) = 0 if ξ D k ξ + sup D k if ξ inf D k 10

15 the learning algorithm: how we build estimators Example σd(ξ) soft D(ξ) ξ ξ The support function of D =[ 1, 1] (black), D =[ 0.5, Figure1, 1: 5] Soft (red), thresholding D =[0.5, 2] (green), and prox φ for φ = +0.9 r,withr =2(red), r = 2, 0.5] (blue). (orange), r =4/3 (blue). D = [ 1, 1], σ D = Ḡ be the restriction of G to l r (K), endowed with the (iii) norm The soft-thresholding r.sinceu 0 loperator r (K) andwith respect to a bounded interval D k =[ω k, ω k] R is ), wehavethatu 0 Ḡ(u0). Let ψ be the modulus of total convexity of Ḡ and let ϕ dulus of total convexity of r r in lr (K). Then, for every u l r (K), G(u) G(u 0) µ ω k if µ>ω k 0 + ψ(u0, u u0 r ). Moreover, since Ḡ = H + η r r,withh ( µ D Γ0(lr k) (K)), soft wehave Dk (µ) = 0 if µ D 11 k (2 he statement follows from [8, Proposition A.9-Remark A.10].

16 the learning algorithm: motivation When the regression function has sparse representation a further goal besides prediction is to identify the set of corresponding relevant features. To this respect in our learning algorithm: estimators have finite support (sparsity is promoted). 12

17 the learning algorithm: motivation When the regression function has sparse representation a further goal besides prediction is to identify the set of corresponding relevant features. To this respect in our learning algorithm: estimators have finite support (sparsity is promoted). positiveness and boundedness of the coefficients can be enforced. 12

18 the learning algorithm: motivation When the regression function has sparse representation a further goal besides prediction is to identify the set of corresponding relevant features. To this respect in our learning algorithm: estimators have finite support (sparsity is promoted). positiveness and boundedness of the coefficients can be enforced. estimators keep the same grouping selection properties of the elastic-net estimators, but they are possibly not affected by the double shrinkage phenomenon (Zou and Hastie 05). 12

19 the learning algorithm: motivation When the regression function has sparse representation a further goal besides prediction is to identify the set of corresponding relevant features. To this respect in our learning algorithm: estimators have finite support (sparsity is promoted). positiveness and boundedness of the coefficients can be enforced. estimators keep the same grouping selection properties of the elastic-net estimators, but they are possibly not affected by the double shrinkage phenomenon (Zou and Hastie 05). the regularizer gives a high flexibility to control the shape of the thresholding operation in dependence of the available prior information. 12

20 the learning algorithm: motivation σ D (ξ) ξ D = [ 1, 1] 13

21 the learning algorithm: motivation σ D (ξ) ξ D = [ 0.5, 1.5] 13

22 the learning algorithm: motivation σ D (ξ) ξ D = [0.5, 2] 13

23 the learning algorithm: motivation σ D (ξ) ξ D = [ 2, 0.5] 13

24 the learning algorithm: motivation prox gk (ξ) ξ C k = R, D k = [ 1, 1], η k = 0 14

25 the learning algorithm: motivation prox gk (ξ) ξ C k = R, D k = [ 1, 1], η k = 0.9, r = 4/3 14

26 the learning algorithm: motivation prox gk (ξ) ξ C k = [ 2, 2], D k = [ 1, 1], η k = 0 14

27 the learning algorithm: motivation prox gk (ξ) ξ C k = ], 6/5], D k = [0, 2], η k = 0.9, r = 4/3 14

28 our contribution We will address two problems: consistency of the estimators fûn,λ (Z n); meaning that, we can find conditions on a vanishing sequence (λ n ) n N such that fûn,λn (Z n) f C L2 (P X ) 0 (in probability or a.s.) computation of the estimators û n,λ (z n ) (for a fixed training set z n and parameter λ) by mean of an inexact and accelerated forward-backward splitting algorithm. 15

29 statistical analysis

30 the regression function wrt the constraint set f C is the minimizer of the risk over the constraint set C. Facts 1. Setting f = E(Y X), we have R(f ) = f f L 2 + inf L2 (P X ) R 17

31 the regression function wrt the constraint set f C is the minimizer of the risk over the constraint set C. Facts 1. Setting f = E(Y X), we have R(f ) = f f L 2 + inf L2 (P X ) R 2. f C is the (orthogonal) projection of f onto C in L 2 (X ); 17

32 the regression function wrt the constraint set f C is the minimizer of the risk over the constraint set C. Facts 1. Setting f = E(Y X), we have R(f ) = f f L 2 + inf L2 (P X ) R 2. f C is the (orthogonal) projection of f onto C in L 2 (X ); 3. ( f C) f f C 2 L R(f ) inf 2 C R; 17

33 the regression function wrt the constraint set f C is the minimizer of the risk over the constraint set C. Facts 1. Setting f = E(Y X), we have R(f ) = f f L 2 + inf L2 (P X ) R 2. f C is the (orthogonal) projection of f onto C in L 2 (X ); 3. ( f C) f f C 2 L 2 R(f ) inf C R; 4. ( f C) R(f ) inf R C [( 2 f f C L 2 + inf C 2 1/2 f R) R inf + inf R] fc L L 2 (X ) L 2 2 (X ) 17

34 the regression function wrt the constraint set f C is the minimizer of the risk over the constraint set C. Facts 1. Setting f = E(Y X), we have R(f ) = f f L 2 + inf L2 (P X ) R 2. f C is the (orthogonal) projection of f onto C in L 2 (X ); 3. ( f C) f f C 2 L 2 R(f ) inf C R; 4. ( f C) R(f ) inf R C [( 2 f f C L 2 + inf C 2 1/2 f R) R inf + inf R] fc L L 2 (X ) L 2 2 (X ) It follows that R(ˆf n ) inf C R 0 ˆf n f C L2 (P X ) 0. 17

35 splitting the error Two minimization problems ( u λ = argmin u l2 (K) R(fu ) + λg(u) ) û n,λ (z n ) = argmin u l2 (K) (ˆR n (f u ) + λg(u) ) Then: fûn,λ (Z n) f C L 2 f û n,λ (Z n) f uλ L 2 + f u λ f C L 2 We have to control: κ û n,λ (Z n ) u λ r + ( R(f uλ ) inf C R ) 1/2 the stochastic term: û n,λ (Z n ) u λ r the deterministic term: R(f uλ ) inf C R 18

36 splitting the error Two minimization problems ( u λ = argmin u l2 (K) R(fu ) + λg(u) ) û n,λ (z n ) = argmin u l2 (K) (ˆR n (f u ) + λg(u) ) Then: fûn,λ (Z n) f C L 2 f û n,λ (Z n) f uλ L 2 + f u λ f C L 2 We have to control: κ û n,λ (Z n ) u λ r + ( R(f uλ ) inf C R ) 1/2 the stochastic term: û n,λ (Z n ) u λ r the deterministic term: R(f uλ ) inf C R = R(f uλ ) inf u dom G R(f u ) tends to zero as λ 0 because of a general variational principle (Attouch 96). 18

37 splitting the error Two minimization problems ( u λ = argmin u l2 (K) R(fu ) + λg(u) ) û n,λ (z n ) = argmin u l2 (K) (ˆR n (f u ) + λg(u) ) Then: fûn,λ (Z n) f C L 2 f û n,λ (Z n) f uλ L 2 + f u λ f C L 2 We have to control: κ û n,λ (Z n ) u λ r + ( R(f uλ ) inf C R ) 1/2 the stochastic term: û n,λ (Z n ) u λ r? the deterministic term: R(f uλ ) inf C R = R(f uλ ) inf u dom G R(f u ) tends to zero as λ 0 because of a general variational principle (Attouch 96). 18

38 a representation and sensitivity theorem Lemma ((Xu and Roach 91) Total convexity on bounded sets) Let ρ > 0, let u 0 l r (K), and u 0 G(u 0). Then there exists a universal constant M > 0 (depending only on r) such that ( u l r ) G(u) G(u 0 ) u u 0 u u u 0 2 r 0 + ηm ( u 0 r + u u 0 r ) 2 r. Theorem (Combettes, Salzo and Villa 15) There exists a measurable function Ψ λ : X Y l 2 (K), which is of type Ψ λ (x, y) = h λ (x, y)φ(x), such that E P [Ψ λ ] λ G(u λ ), Ψ λ C u λ 2, Ψ λ 2 2κ(R(f uλ )) 1/2, and z n (X Y) n û n,λ (z n ) u λ ηm r ( u λ r + û n,λ (z n ) u λ r ) 2 r 1 1 λ n n Ψ λ (x i, y i ) E P [Ψ λ ]. 2 i=1 19

39 the consistency theorem [ 1 P n n ] Ψ λ (X i, Y i ) E[Ψ λ (X, Y)] δ(n, λ, τ) 1 e τ 2 i=1 Theorem (Combettes, Salzo and Villa 15) Let (λ n ) n N be a sequence in ]0, + [ converging to 0. following holds: Then the 1. If 1/(λ 2/r n n 1/2 ) 0, then fûn,λ (Z n) f C L 0 in probability If log n/(λ 2/r n n 1/2 ) 0, then fûn,λ (Z n) f C L 2 0 P-a.s. 20

40 the consistency theorem [ 1 P n n ] Ψ λ (X i, Y i ) E[Ψ λ (X, Y)] δ(n, λ, τ) 1 e τ 2 i=1 Theorem (Combettes, Salzo and Villa 15) Let (λ n ) n N be a sequence in ]0, + [ converging to 0. following holds: Then the 1. If 1/(λ 2/r n n 1/2 ) 0, then fûn,λ (Z n) f C L 0 in probability If log n/(λ 2/r n n 1/2 ) 0, then fûn,λ (Z n) f C L 2 0 P-a.s. 3. Suppose f C { f u u dom G } and set S = argmin u dom G R(f u ). Then, there exists a unique u S which minimizes G over S and f u = f C. Moreover: if 1/(λ nn 1/2 ) 0, then û n,λn (Z n) u r 0 in probability; if log n/(λ nn 1/2 ) 0, then û n,λn (Z n) u r 0 P-a.s. 20

41 algorithm

42 the minimization problem We aim at minimizing the functional u l 2 (K) 1 n (y i f u (x i )) 2 + λ g k ( u e k ) n i=1 k K which is the sum of a: g k : R R, g k = ι Ck + σ Dk + η k r ( k K) differentiable term u ˆR n (f u ) with Lipschitz continuous gradient. lower semicontinuous, convex and separable term λg(u) = λ k K g k( u e k ). Note that we need convergence in objective values. 22

43 the minimization problem We aim at minimizing the functional u l 2 (K) 1 n (y i f u (x i )) 2 + λ g k ( u e k ) n i=1 k K which is the sum of a: g k : R R, g k = ι Ck + σ Dk + η k r ( k K) differentiable term u ˆR n (f u ) with Lipschitz continuous gradient. lower semicontinuous, convex and separable term λg(u) = λ k K g k( u e k ). Note that we need convergence in objective values. This problem can be solved by forward-backward splitting methods. These algorithms have been studied for convex separable regularizers in (Combettes and Pesquet 07; Bredies and Lorenz 08). 22

44 an inexact accelerated fbs algorithm Fix u 0 = v 0 l 2 (K), τ 0 = 1 and define for every m N, τ m+1 = 1 2 for all k K w m e k = v m e k 2γ nλ n ( ) fvm (x i ) y i φk (x i ) i=1 u m+1 e k = prox γgk ( w m e k ) ( 1 + ) 1 + 4τm 2 v m+1 = u m+1 + τ m 1 τ m+1 (u m+1 u m ) (Nesterov 83; Guler 92; Beck and Teboulle 09; Villa, Salzo et. al. 13) 23

45 an inexact accelerated fbs algorithm Fix u 0 = v 0 l 2 (K), τ 0 = 1 and define for every m N, τ m+1 = 1 2 for all k K w m e k = v m e k 2γ nλ n ( ) fvm (x i ) y i φk (x i ) i=1 u m+1 e k = π Ck ( proxγηk r ( softγdk w m e k )) ( 1 + ) 1 + 4τm 2 v m+1 = u m+1 + τ m 1 τ m+1 (u m+1 u m ) (Nesterov 83; Guler 92; Beck and Teboulle 09; Villa, Salzo et. al. 13) 23

46 an inexact accelerated fbs algorithm Fix u 0 = v 0 l 2 (K), τ 0 = 1 and define for every m N, τ m+1 = 1 2 for all k K w m e k = v m e k 2γ nλ n ( ) fvm (x i ) y i φk (x i ) u m+1 e k = π Ck ( proxγηk r ( softγdk w m e k ) + α m,k ) ( 1 + ) 1 + 4τm 2 v m+1 = u m+1 + τ m 1 τ m+1 (u m+1 u m ) i=1 (Nesterov 83; Guler 92; Beck and Teboulle 09; Villa, Salzo et. al. 13) 23

47 an inexact accelerated fbs algorithm where: 0 < γ < λ/(2κ 2 ) (α m,k ) (m,k) N K is a double sequence in R that accounts for errors in the computation of the proximity operator prox γηk r and, for some c > 0 and p ]3/2, + [ and (ν k ) K l 1 (K), satisfies the condition { α k,m min cm 2p ν k 2γη k r( w m e k + 1) r w m e k + 1, soft γdk ( w m e k ) 1 + rγη k, ( softγdk ( w m e k ) 1 + rγη k ) 1/(r 1) }. 24

48 proximal thresholding: the role of the exponent r prox φ (ξ) ξ Soft thresholding (olive) and prox of φ = (red). 25

49 proximal thresholding: the role of the exponent r prox φ (ξ) ξ Soft thresholding (olive) and prox of φ = /2 (purple). 25

50 proximal thresholding: the role of the exponent r prox φ (ξ) ξ Soft thresholding (olive) and prox of φ = /3 (blue). 25

51 the proximity operator of powers Let r ]1, 2], let τ > 0, let µ R, and consider prox τ r : R R. prox τ r(µ) = sign(µ)ξ, where ξ 0 and ξ + rτξ r 1 = µ. There are several exponents r for which the equation can be explicitly solved: r {3/2, 4/3, 5/4}. However in general this is not possible and one needs to rely on iterative methods, e.g. by bisection. 26

52 the proximity operator of powers Let r ]1, 2], let τ > 0, let µ R, and consider prox τ r : R R. prox τ r(µ) = sign(µ)ξ, where ξ 0 and ξ + rτξ r 1 = µ. There are several exponents r for which the equation can be explicitly solved: r {3/2, 4/3, 5/4}. However in general this is not possible and one needs to rely on iterative methods, e.g. by bisection. The following decomposition rule holds (Combettes and Pesquet 07) g k = ι Ck + η k r + σ Dk prox γgk (µ) = π Ck ( proxγηk r ( softγdk (µ) ) ) 26

53 the proximity operator of powers Let r ]1, 2], let τ > 0, let µ R, and consider prox τ r : R R. prox τ r(µ) = sign(µ)ξ, where ξ 0 and ξ + rτξ r 1 = µ. There are several exponents r for which the equation can be explicitly solved: r {3/2, 4/3, 5/4}. However in general this is not possible and one needs to rely on iterative methods, e.g. by bisection. The following decomposition rule holds (Combettes and Pesquet 07) g k = ι Ck + η k r + σ Dk prox γgk (µ) π Ck ( proxγηk r ( softγdk (µ) ) + α k,m ) ) The question is to adjust the decomposition rule in the presence of errors. 26

54 a notion of inexact proximity operator u δ prox ηϕ (w) { def u δ2 2η argmin ξ H ϕ(ξ) + 1 } 2η ξ w 2 H. 27

55 a notion of inexact proximity operator u δ prox ηϕ (w) { def u δ2 2η argmin ξ H ϕ(ξ) + 1 } 2η ξ w 2 H. Lemma Let ψ, φ, σ Γ + 0 (R) with ψ(0) = 0 and σ positively homogeneous. Then s = prox ϑ r(µ) + α = s δ prox ϑ r(µ) s δ prox ψ (prox σ (µ)) = s δ prox ψ+σ (µ) s δ prox φ (µ) and p = π C (s) = p δ prox ιc +φ (µ). 27

56 a notion of inexact proximity operator u δ prox ηϕ (w) { def u δ2 2η argmin ξ H ϕ(ξ) + 1 } 2η ξ w 2 H. Lemma Let ψ, φ, σ Γ + 0 (R) with ψ(0) = 0 and σ positively homogeneous. Then s = prox ϑ r(µ) + α = s δ prox ϑ r(µ) s δ prox ψ (prox σ (µ)) = s δ prox ψ+σ (µ) s δ prox φ (µ) and p = π C (s) = p δ prox ιc +φ (µ). g k = ι Ck + η k r + σ Dk ( s = prox γηk r softγdk (µ) ) } + α k,m p = π Ck (s) p δk,m prox γgk (µ) 27

57 the theorem of convergence Theorem Let ˆF n (u) = 1 n n i=1 (f u (x i ) y i ) 2 and G(u) = k K g k ( u e k ). Then there exist a vanishing sequence (δ m ) m N, δ m Cm p, such that u m+1 δm prox γg (v m γ ) λ F n (v m ) v m+1 = u m+1 + τ m 1 (u m+1 u m ) τ m+1 Hence it follows from (Villa, Salzo et. al. 13) that the sequence (u m ) m N is minimizing for ˆF n + λg and u m û n,λ (z n ) r 0 as m +. 28

58 further developments

59 a more general framework The statistical analysis can be pursued in more general terms (Combettes, Salzo and Villa 14): 1. The functional model is given by means of a linear operator A: F M(X, Y): u f u continuous w.r.t. the topology of the pointwise convergence, where F (the feature space) and Y are Banach spaces. This way ran(a) is a pre-rkbs of vector-valued functions. 30

60 a more general framework The statistical analysis can be pursued in more general terms (Combettes, Salzo and Villa 14): 1. The functional model is given by means of a linear operator A: F M(X, Y): u f u continuous w.r.t. the topology of the pointwise convergence, where F (the feature space) and Y are Banach spaces. This way ran(a) is a pre-rkbs of vector-valued functions. 2. The constraint set is C = { f M(X, Y) ( x X ) f (x) C(x) }, where, for every x X, C(x) Y is closed convex. 30

61 a more general framework The statistical analysis can be pursued in more general terms (Combettes, Salzo and Villa 14): 1. The functional model is given by means of a linear operator A: F M(X, Y): u f u continuous w.r.t. the topology of the pointwise convergence, where F (the feature space) and Y are Banach spaces. This way ran(a) is a pre-rkbs of vector-valued functions. 2. The constraint set is C = { f M(X, Y) ( x X ) f (x) C(x) }, where, for every x X, C(x) Y is closed convex. 3. The risk is based on a general loss function R(f ) = l(x, y, f (x))dp(x, y) X Y 30

62 a more general framework The statistical analysis can be pursued in more general terms (Combettes, Salzo and Villa 14): 1. The functional model is given by means of a linear operator A: F M(X, Y): u f u continuous w.r.t. the topology of the pointwise convergence, where F (the feature space) and Y are Banach spaces. This way ran(a) is a pre-rkbs of vector-valued functions. 2. The constraint set is C = { f M(X, Y) ( x X ) f (x) C(x) }, where, for every x X, C(x) Y is closed convex. 3. The risk is based on a general loss function R(f ) = l(x, y, f (x))dp(x, y) X Y 4. G: F ], + ] is totally convex, meaning that, for every t > 0 and every u dom G, 0 < ψ G (u; t) = inf { G(v) G(u) G (u; v u) u dom G, u v = t }. 30

63 a more general framework We characterized the property of universality w.r.t. C, i.e., when inf C R = inf C ran(a) R. We proved a representer and sensitivity theorem: that is there exists a measurable function Ψ λ : X Y F such that E P [Ψ λ ] λ G(u λ ) and, for every z n (X Y) n, ψ G (u λ, û n,λ (z n ) u λ ) û n,λ (z n ) u λ 1 1 λ n n Ψ λ (x i, y i ) E P [Ψ λ ]. We found conditions on the sequence (λ n ) n N, ensuring consistency, that depend on the modulus of total convexity of G, the Rademacher type of F, and the local Lipschitz constants of the loss function. i=1 31

64 an alternative approach When restricted to real-valued functions, an alternative approach is to use Rademacher complexities. 1. In this case the statistical analysis achieves sharper results when dealing with general loss function. 2. Open problem: Rademacher complexity for (reproducing kernel) classes of functions taking values in infinite dimension spaces. A contraction principle is missing in that scenario. 32

65 Thank you 33

66 some references P. L. Combettes and J.-C. Pesquet, Proximal thresholding algorithm for minimization over orthonormal bases, SIAM J. Optim., vol. 18, pp , P. L. Combettes, S. Salzo and S. Villa, Consistency of Regularized Learning Schemes in Banach spaces. ArXiv, P. L. Combettes, S. Salzo and S. Villa, Consistent Learning by Composite Proximal Thresholding. ArXiv, S. Villa, S. Salzo, L. Baldassarre, A. Verri Accelerated and Inexact Forward-Backward Algorithms, SIOPT., 16, Z B. Xu and G.F. Roach, Characteristic inequalities of uniformly convex and uniformly smooth Banach spaces, Journal of Mathematical Analysis and Applications, 157,

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and