consistent learning by composite proximal thresholding

Size: px
Start display at page:

Download "consistent learning by composite proximal thresholding"

Transcription

1 consistent learning by composite proximal thresholding Saverio Salzo Università degli Studi di Genova Optimization in Machine learning, vision and image processing Université Paul Sabatier, Toulouse 6-7 October

2 outline 1. Introduction The learning problem Constructing the estimators Our contribution 2. Statistical Analysis Proving consistency (a sketch) The main result 3. Algorithm Description Inexact proximity operators The main result 4. Further developments 2

3 introduction

4 nonparametric regression with random design Income Statistical Learning Years of Education Seniority FIGURE 2.3. The plot displays income as a function of years of education and seniority in the Income data set. The blue surface represents the true underlying relationship between income and years of education and seniority, which is known since the data are simulated. The red dots indicate the observed values of these quantities for 30 individuals. Input space X = range(years of Education) range(seniority) R 2. Output space Y = range(income) R. z n = (x i, y i ) 1 i n (X Y) n are i.i.d. realizations of a random variable (X, Y) distributed according to P. As an example, suppose that X 1,...,X p are characteristics of a patient s blood sample that can be easily measured in a lab, and Y is a variable encoding the patient s risk for a severe adverse reaction to a particular 4

5 nonparametric regression with random design X Y x y P (y x) The distribution P describes a fuzzy function. The goal is to estimate a function f : X Y such that ( x X ) f (x) is chosen according to P(y x) For instance one could take f = E(Y X), the conditional mean. 5

6 the formal setting We are given an input space (X, A), an output space Y R bounded interval, and a random variable (X, Y) with values in X Y with distribution P. a sequence (X i, Y i ) i N of i.i.d. random variables taking values in X Y with common distribution P. n observations z n = (x i, y i ) 1 i n of the random variables Z n = (Y i, Y i ) 1 i n, which constitutes the training set. a (large) closed convex constraint set C L 2 (P X ) encoding prior information. 6

7 the learning problem The goal is to approach f C, the regression function on C, which is the minimizer of the L 2 -risk R: L 2 (P X ) R + f y f (x) 2 dp(x, y) = E Y f (X) 2 X R over the constraint set C, without any knowledge of P, but using only training sets z n. 7

8 the learning problem The goal is to approach f C, the regression function on C, which is the minimizer of the L 2 -risk R: L 2 (P X ) R + f y f (x) 2 dp(x, y) = E Y f (X) 2 X R over the constraint set C, without any knowledge of P, but using only training sets z n. A learning algorithm is a map Λ: n N (X Y)n C such that the estimators Λ(Z n ) asymptotically converge to f C, meaning that Λ(Z n ) f C L2 (P X ) 0 (in probability or a.s.) as n +. 7

9 the learning algorithm: how we build estimators We adopt a linear model: where: f u = k K u e k φ k (pointwise), u l 2 (K) (e k ) k K is the canonical basis of l 2 (K) and (φ k ) k K is a countable dictionary of bounded functions in M(X, R) such that φ k (x) 2 κ 2 ( x X ) k K the coefficients u e k are constrained in given intervals C k R, thus defining the constraint C = { f u u l 2 (K) C } k k K Our estimators will be in C! Moreover, H K = { f u u l 2 (K) } is a RKHS with feature map Φ: X l 2 (K): x (φ k (x)) k K and kernel K(x, x ) = Φ(x) Φ(x ). 8

10 the learning algorithm: how we build estimators We solve a composite regularized least squares regression problem { 1 n û n,λ (z n ) = argmin u l 2 (K) (y i f u (x i )) 2 + λ } g k ( u e k ) n i=1 k K g k : R R, g k = ι Ck + σ Dk + η k r ( k K) where 0 C k, D k R closed intervals, η k η > 0, and r ]1, 2]. The learning algorithm z n fûn,λ (z n) The indicator function of C k and the support function of D k are: { { 0 if ξ Ck ξ sup Dk if ξ 0 ι Ck (ξ) = σ Dk (ξ) = + if ξ / C k ξ inf D k if ξ < 0 9

11 the learning algorithm: how we build estimators We solve a composite regularized least squares regression problem { 1 n û n,λ (z n ) = argmin u l 2 (K) (y i f u (x i )) 2 + λ } g k ( u e k ) n i=1 k K g k : R R, g k = ι Ck + σ Dk + η k r ( k K) where 0 C k, D k R closed intervals, η k η > 0, and r ]1, 2]. The learning algorithm z n fûn,λ (z n) The empirical risk and the regularizer ˆR n (f u ) = 1 n (y i f u (x i )) 2 G(u) = g k ( u e k ) n i=1 k K This is a regularized empirical risk minimization learning algorithm. 9

12 the learning algorithm: how we build estimators We solve a composite regularized least squares regression problem { 1 n û n,λ (z n ) = argmin u l 2 (K) (y i f u (x i )) 2 + λ } g k ( u e k ) n i=1 k K g k : R R, g k = ι Ck + σ Dk + η k r ( k K) where 0 C k, D k R closed intervals, η k η > 0, and r ]1, 2]. The learning algorithm z n fûn,λ (z n) This model encompass: ridge regression (Hoerl and Kennard, 70) g k = 2 elastic net (Zou, Hastie, 05; De Mol et al., 09) g k = ω k + η 2 bridge regression (Frank and Friedman 93) g k = r, 1 < r < 2. 9

13 the learning algorithm: how we build estimators We minimize the regularized empirical risk by proximal gradient algorithms: Let 0 < γ < λ/(2κ 2 ) and define (u m ) m N for all m N and all k K u m+1 e k = prox γgk ( u m e k 2γ nλ where prox γgk (ξ) = argmin t R { γgk (t) + (1/2)(t ξ) 2}. n i=1 ) ( ) fum (x i ) y i φk (x i ) 10

14 the learning algorithm: how we build estimators We minimize the regularized empirical risk by proximal gradient algorithms: Let 0 < γ < λ/(2κ 2 ) and define (u m ) m N for all m N and all k K u m+1 e k = prox γgk ( u m e k 2γ nλ where prox γgk (ξ) = argmin t R { γgk (t) + (1/2)(t ξ) 2}. n i=1 ) ( ) fum (x i ) y i φk (x i ) Example If g k = ι Ck, prox gk = π Ck. If g k = σ Dk, then ξ inf D k if ξ inf D k prox gk (ξ) = soft Dk (ξ) = 0 if ξ D k ξ + sup D k if ξ inf D k 10

15 the learning algorithm: how we build estimators Example σd(ξ) soft D(ξ) ξ ξ The support function of D =[ 1, 1] (black), D =[ 0.5, Figure1, 1: 5] Soft (red), thresholding D =[0.5, 2] (green), and prox φ for φ = +0.9 r,withr =2(red), r = 2, 0.5] (blue). (orange), r =4/3 (blue). D = [ 1, 1], σ D = Ḡ be the restriction of G to l r (K), endowed with the (iii) norm The soft-thresholding r.sinceu 0 loperator r (K) andwith respect to a bounded interval D k =[ω k, ω k] R is ), wehavethatu 0 Ḡ(u0). Let ψ be the modulus of total convexity of Ḡ and let ϕ dulus of total convexity of r r in lr (K). Then, for every u l r (K), G(u) G(u 0) µ ω k if µ>ω k 0 + ψ(u0, u u0 r ). Moreover, since Ḡ = H + η r r,withh ( µ D Γ0(lr k) (K)), soft wehave Dk (µ) = 0 if µ D 11 k (2 he statement follows from [8, Proposition A.9-Remark A.10].

16 the learning algorithm: motivation When the regression function has sparse representation a further goal besides prediction is to identify the set of corresponding relevant features. To this respect in our learning algorithm: estimators have finite support (sparsity is promoted). 12

17 the learning algorithm: motivation When the regression function has sparse representation a further goal besides prediction is to identify the set of corresponding relevant features. To this respect in our learning algorithm: estimators have finite support (sparsity is promoted). positiveness and boundedness of the coefficients can be enforced. 12

18 the learning algorithm: motivation When the regression function has sparse representation a further goal besides prediction is to identify the set of corresponding relevant features. To this respect in our learning algorithm: estimators have finite support (sparsity is promoted). positiveness and boundedness of the coefficients can be enforced. estimators keep the same grouping selection properties of the elastic-net estimators, but they are possibly not affected by the double shrinkage phenomenon (Zou and Hastie 05). 12

19 the learning algorithm: motivation When the regression function has sparse representation a further goal besides prediction is to identify the set of corresponding relevant features. To this respect in our learning algorithm: estimators have finite support (sparsity is promoted). positiveness and boundedness of the coefficients can be enforced. estimators keep the same grouping selection properties of the elastic-net estimators, but they are possibly not affected by the double shrinkage phenomenon (Zou and Hastie 05). the regularizer gives a high flexibility to control the shape of the thresholding operation in dependence of the available prior information. 12

20 the learning algorithm: motivation σ D (ξ) ξ D = [ 1, 1] 13

21 the learning algorithm: motivation σ D (ξ) ξ D = [ 0.5, 1.5] 13

22 the learning algorithm: motivation σ D (ξ) ξ D = [0.5, 2] 13

23 the learning algorithm: motivation σ D (ξ) ξ D = [ 2, 0.5] 13

24 the learning algorithm: motivation prox gk (ξ) ξ C k = R, D k = [ 1, 1], η k = 0 14

25 the learning algorithm: motivation prox gk (ξ) ξ C k = R, D k = [ 1, 1], η k = 0.9, r = 4/3 14

26 the learning algorithm: motivation prox gk (ξ) ξ C k = [ 2, 2], D k = [ 1, 1], η k = 0 14

27 the learning algorithm: motivation prox gk (ξ) ξ C k = ], 6/5], D k = [0, 2], η k = 0.9, r = 4/3 14

28 our contribution We will address two problems: consistency of the estimators fûn,λ (Z n); meaning that, we can find conditions on a vanishing sequence (λ n ) n N such that fûn,λn (Z n) f C L2 (P X ) 0 (in probability or a.s.) computation of the estimators û n,λ (z n ) (for a fixed training set z n and parameter λ) by mean of an inexact and accelerated forward-backward splitting algorithm. 15

29 statistical analysis

30 the regression function wrt the constraint set f C is the minimizer of the risk over the constraint set C. Facts 1. Setting f = E(Y X), we have R(f ) = f f L 2 + inf L2 (P X ) R 17

31 the regression function wrt the constraint set f C is the minimizer of the risk over the constraint set C. Facts 1. Setting f = E(Y X), we have R(f ) = f f L 2 + inf L2 (P X ) R 2. f C is the (orthogonal) projection of f onto C in L 2 (X ); 17

32 the regression function wrt the constraint set f C is the minimizer of the risk over the constraint set C. Facts 1. Setting f = E(Y X), we have R(f ) = f f L 2 + inf L2 (P X ) R 2. f C is the (orthogonal) projection of f onto C in L 2 (X ); 3. ( f C) f f C 2 L R(f ) inf 2 C R; 17

33 the regression function wrt the constraint set f C is the minimizer of the risk over the constraint set C. Facts 1. Setting f = E(Y X), we have R(f ) = f f L 2 + inf L2 (P X ) R 2. f C is the (orthogonal) projection of f onto C in L 2 (X ); 3. ( f C) f f C 2 L 2 R(f ) inf C R; 4. ( f C) R(f ) inf R C [( 2 f f C L 2 + inf C 2 1/2 f R) R inf + inf R] fc L L 2 (X ) L 2 2 (X ) 17

34 the regression function wrt the constraint set f C is the minimizer of the risk over the constraint set C. Facts 1. Setting f = E(Y X), we have R(f ) = f f L 2 + inf L2 (P X ) R 2. f C is the (orthogonal) projection of f onto C in L 2 (X ); 3. ( f C) f f C 2 L 2 R(f ) inf C R; 4. ( f C) R(f ) inf R C [( 2 f f C L 2 + inf C 2 1/2 f R) R inf + inf R] fc L L 2 (X ) L 2 2 (X ) It follows that R(ˆf n ) inf C R 0 ˆf n f C L2 (P X ) 0. 17

35 splitting the error Two minimization problems ( u λ = argmin u l2 (K) R(fu ) + λg(u) ) û n,λ (z n ) = argmin u l2 (K) (ˆR n (f u ) + λg(u) ) Then: fûn,λ (Z n) f C L 2 f û n,λ (Z n) f uλ L 2 + f u λ f C L 2 We have to control: κ û n,λ (Z n ) u λ r + ( R(f uλ ) inf C R ) 1/2 the stochastic term: û n,λ (Z n ) u λ r the deterministic term: R(f uλ ) inf C R 18

36 splitting the error Two minimization problems ( u λ = argmin u l2 (K) R(fu ) + λg(u) ) û n,λ (z n ) = argmin u l2 (K) (ˆR n (f u ) + λg(u) ) Then: fûn,λ (Z n) f C L 2 f û n,λ (Z n) f uλ L 2 + f u λ f C L 2 We have to control: κ û n,λ (Z n ) u λ r + ( R(f uλ ) inf C R ) 1/2 the stochastic term: û n,λ (Z n ) u λ r the deterministic term: R(f uλ ) inf C R = R(f uλ ) inf u dom G R(f u ) tends to zero as λ 0 because of a general variational principle (Attouch 96). 18

37 splitting the error Two minimization problems ( u λ = argmin u l2 (K) R(fu ) + λg(u) ) û n,λ (z n ) = argmin u l2 (K) (ˆR n (f u ) + λg(u) ) Then: fûn,λ (Z n) f C L 2 f û n,λ (Z n) f uλ L 2 + f u λ f C L 2 We have to control: κ û n,λ (Z n ) u λ r + ( R(f uλ ) inf C R ) 1/2 the stochastic term: û n,λ (Z n ) u λ r? the deterministic term: R(f uλ ) inf C R = R(f uλ ) inf u dom G R(f u ) tends to zero as λ 0 because of a general variational principle (Attouch 96). 18

38 a representation and sensitivity theorem Lemma ((Xu and Roach 91) Total convexity on bounded sets) Let ρ > 0, let u 0 l r (K), and u 0 G(u 0). Then there exists a universal constant M > 0 (depending only on r) such that ( u l r ) G(u) G(u 0 ) u u 0 u u u 0 2 r 0 + ηm ( u 0 r + u u 0 r ) 2 r. Theorem (Combettes, Salzo and Villa 15) There exists a measurable function Ψ λ : X Y l 2 (K), which is of type Ψ λ (x, y) = h λ (x, y)φ(x), such that E P [Ψ λ ] λ G(u λ ), Ψ λ C u λ 2, Ψ λ 2 2κ(R(f uλ )) 1/2, and z n (X Y) n û n,λ (z n ) u λ ηm r ( u λ r + û n,λ (z n ) u λ r ) 2 r 1 1 λ n n Ψ λ (x i, y i ) E P [Ψ λ ]. 2 i=1 19

39 the consistency theorem [ 1 P n n ] Ψ λ (X i, Y i ) E[Ψ λ (X, Y)] δ(n, λ, τ) 1 e τ 2 i=1 Theorem (Combettes, Salzo and Villa 15) Let (λ n ) n N be a sequence in ]0, + [ converging to 0. following holds: Then the 1. If 1/(λ 2/r n n 1/2 ) 0, then fûn,λ (Z n) f C L 0 in probability If log n/(λ 2/r n n 1/2 ) 0, then fûn,λ (Z n) f C L 2 0 P-a.s. 20

40 the consistency theorem [ 1 P n n ] Ψ λ (X i, Y i ) E[Ψ λ (X, Y)] δ(n, λ, τ) 1 e τ 2 i=1 Theorem (Combettes, Salzo and Villa 15) Let (λ n ) n N be a sequence in ]0, + [ converging to 0. following holds: Then the 1. If 1/(λ 2/r n n 1/2 ) 0, then fûn,λ (Z n) f C L 0 in probability If log n/(λ 2/r n n 1/2 ) 0, then fûn,λ (Z n) f C L 2 0 P-a.s. 3. Suppose f C { f u u dom G } and set S = argmin u dom G R(f u ). Then, there exists a unique u S which minimizes G over S and f u = f C. Moreover: if 1/(λ nn 1/2 ) 0, then û n,λn (Z n) u r 0 in probability; if log n/(λ nn 1/2 ) 0, then û n,λn (Z n) u r 0 P-a.s. 20

41 algorithm

42 the minimization problem We aim at minimizing the functional u l 2 (K) 1 n (y i f u (x i )) 2 + λ g k ( u e k ) n i=1 k K which is the sum of a: g k : R R, g k = ι Ck + σ Dk + η k r ( k K) differentiable term u ˆR n (f u ) with Lipschitz continuous gradient. lower semicontinuous, convex and separable term λg(u) = λ k K g k( u e k ). Note that we need convergence in objective values. 22

43 the minimization problem We aim at minimizing the functional u l 2 (K) 1 n (y i f u (x i )) 2 + λ g k ( u e k ) n i=1 k K which is the sum of a: g k : R R, g k = ι Ck + σ Dk + η k r ( k K) differentiable term u ˆR n (f u ) with Lipschitz continuous gradient. lower semicontinuous, convex and separable term λg(u) = λ k K g k( u e k ). Note that we need convergence in objective values. This problem can be solved by forward-backward splitting methods. These algorithms have been studied for convex separable regularizers in (Combettes and Pesquet 07; Bredies and Lorenz 08). 22

44 an inexact accelerated fbs algorithm Fix u 0 = v 0 l 2 (K), τ 0 = 1 and define for every m N, τ m+1 = 1 2 for all k K w m e k = v m e k 2γ nλ n ( ) fvm (x i ) y i φk (x i ) i=1 u m+1 e k = prox γgk ( w m e k ) ( 1 + ) 1 + 4τm 2 v m+1 = u m+1 + τ m 1 τ m+1 (u m+1 u m ) (Nesterov 83; Guler 92; Beck and Teboulle 09; Villa, Salzo et. al. 13) 23

45 an inexact accelerated fbs algorithm Fix u 0 = v 0 l 2 (K), τ 0 = 1 and define for every m N, τ m+1 = 1 2 for all k K w m e k = v m e k 2γ nλ n ( ) fvm (x i ) y i φk (x i ) i=1 u m+1 e k = π Ck ( proxγηk r ( softγdk w m e k )) ( 1 + ) 1 + 4τm 2 v m+1 = u m+1 + τ m 1 τ m+1 (u m+1 u m ) (Nesterov 83; Guler 92; Beck and Teboulle 09; Villa, Salzo et. al. 13) 23

46 an inexact accelerated fbs algorithm Fix u 0 = v 0 l 2 (K), τ 0 = 1 and define for every m N, τ m+1 = 1 2 for all k K w m e k = v m e k 2γ nλ n ( ) fvm (x i ) y i φk (x i ) u m+1 e k = π Ck ( proxγηk r ( softγdk w m e k ) + α m,k ) ( 1 + ) 1 + 4τm 2 v m+1 = u m+1 + τ m 1 τ m+1 (u m+1 u m ) i=1 (Nesterov 83; Guler 92; Beck and Teboulle 09; Villa, Salzo et. al. 13) 23

47 an inexact accelerated fbs algorithm where: 0 < γ < λ/(2κ 2 ) (α m,k ) (m,k) N K is a double sequence in R that accounts for errors in the computation of the proximity operator prox γηk r and, for some c > 0 and p ]3/2, + [ and (ν k ) K l 1 (K), satisfies the condition { α k,m min cm 2p ν k 2γη k r( w m e k + 1) r w m e k + 1, soft γdk ( w m e k ) 1 + rγη k, ( softγdk ( w m e k ) 1 + rγη k ) 1/(r 1) }. 24

48 proximal thresholding: the role of the exponent r prox φ (ξ) ξ Soft thresholding (olive) and prox of φ = (red). 25

49 proximal thresholding: the role of the exponent r prox φ (ξ) ξ Soft thresholding (olive) and prox of φ = /2 (purple). 25

50 proximal thresholding: the role of the exponent r prox φ (ξ) ξ Soft thresholding (olive) and prox of φ = /3 (blue). 25

51 the proximity operator of powers Let r ]1, 2], let τ > 0, let µ R, and consider prox τ r : R R. prox τ r(µ) = sign(µ)ξ, where ξ 0 and ξ + rτξ r 1 = µ. There are several exponents r for which the equation can be explicitly solved: r {3/2, 4/3, 5/4}. However in general this is not possible and one needs to rely on iterative methods, e.g. by bisection. 26

52 the proximity operator of powers Let r ]1, 2], let τ > 0, let µ R, and consider prox τ r : R R. prox τ r(µ) = sign(µ)ξ, where ξ 0 and ξ + rτξ r 1 = µ. There are several exponents r for which the equation can be explicitly solved: r {3/2, 4/3, 5/4}. However in general this is not possible and one needs to rely on iterative methods, e.g. by bisection. The following decomposition rule holds (Combettes and Pesquet 07) g k = ι Ck + η k r + σ Dk prox γgk (µ) = π Ck ( proxγηk r ( softγdk (µ) ) ) 26

53 the proximity operator of powers Let r ]1, 2], let τ > 0, let µ R, and consider prox τ r : R R. prox τ r(µ) = sign(µ)ξ, where ξ 0 and ξ + rτξ r 1 = µ. There are several exponents r for which the equation can be explicitly solved: r {3/2, 4/3, 5/4}. However in general this is not possible and one needs to rely on iterative methods, e.g. by bisection. The following decomposition rule holds (Combettes and Pesquet 07) g k = ι Ck + η k r + σ Dk prox γgk (µ) π Ck ( proxγηk r ( softγdk (µ) ) + α k,m ) ) The question is to adjust the decomposition rule in the presence of errors. 26

54 a notion of inexact proximity operator u δ prox ηϕ (w) { def u δ2 2η argmin ξ H ϕ(ξ) + 1 } 2η ξ w 2 H. 27

55 a notion of inexact proximity operator u δ prox ηϕ (w) { def u δ2 2η argmin ξ H ϕ(ξ) + 1 } 2η ξ w 2 H. Lemma Let ψ, φ, σ Γ + 0 (R) with ψ(0) = 0 and σ positively homogeneous. Then s = prox ϑ r(µ) + α = s δ prox ϑ r(µ) s δ prox ψ (prox σ (µ)) = s δ prox ψ+σ (µ) s δ prox φ (µ) and p = π C (s) = p δ prox ιc +φ (µ). 27

56 a notion of inexact proximity operator u δ prox ηϕ (w) { def u δ2 2η argmin ξ H ϕ(ξ) + 1 } 2η ξ w 2 H. Lemma Let ψ, φ, σ Γ + 0 (R) with ψ(0) = 0 and σ positively homogeneous. Then s = prox ϑ r(µ) + α = s δ prox ϑ r(µ) s δ prox ψ (prox σ (µ)) = s δ prox ψ+σ (µ) s δ prox φ (µ) and p = π C (s) = p δ prox ιc +φ (µ). g k = ι Ck + η k r + σ Dk ( s = prox γηk r softγdk (µ) ) } + α k,m p = π Ck (s) p δk,m prox γgk (µ) 27

57 the theorem of convergence Theorem Let ˆF n (u) = 1 n n i=1 (f u (x i ) y i ) 2 and G(u) = k K g k ( u e k ). Then there exist a vanishing sequence (δ m ) m N, δ m Cm p, such that u m+1 δm prox γg (v m γ ) λ F n (v m ) v m+1 = u m+1 + τ m 1 (u m+1 u m ) τ m+1 Hence it follows from (Villa, Salzo et. al. 13) that the sequence (u m ) m N is minimizing for ˆF n + λg and u m û n,λ (z n ) r 0 as m +. 28

58 further developments

59 a more general framework The statistical analysis can be pursued in more general terms (Combettes, Salzo and Villa 14): 1. The functional model is given by means of a linear operator A: F M(X, Y): u f u continuous w.r.t. the topology of the pointwise convergence, where F (the feature space) and Y are Banach spaces. This way ran(a) is a pre-rkbs of vector-valued functions. 30

60 a more general framework The statistical analysis can be pursued in more general terms (Combettes, Salzo and Villa 14): 1. The functional model is given by means of a linear operator A: F M(X, Y): u f u continuous w.r.t. the topology of the pointwise convergence, where F (the feature space) and Y are Banach spaces. This way ran(a) is a pre-rkbs of vector-valued functions. 2. The constraint set is C = { f M(X, Y) ( x X ) f (x) C(x) }, where, for every x X, C(x) Y is closed convex. 30

61 a more general framework The statistical analysis can be pursued in more general terms (Combettes, Salzo and Villa 14): 1. The functional model is given by means of a linear operator A: F M(X, Y): u f u continuous w.r.t. the topology of the pointwise convergence, where F (the feature space) and Y are Banach spaces. This way ran(a) is a pre-rkbs of vector-valued functions. 2. The constraint set is C = { f M(X, Y) ( x X ) f (x) C(x) }, where, for every x X, C(x) Y is closed convex. 3. The risk is based on a general loss function R(f ) = l(x, y, f (x))dp(x, y) X Y 30

62 a more general framework The statistical analysis can be pursued in more general terms (Combettes, Salzo and Villa 14): 1. The functional model is given by means of a linear operator A: F M(X, Y): u f u continuous w.r.t. the topology of the pointwise convergence, where F (the feature space) and Y are Banach spaces. This way ran(a) is a pre-rkbs of vector-valued functions. 2. The constraint set is C = { f M(X, Y) ( x X ) f (x) C(x) }, where, for every x X, C(x) Y is closed convex. 3. The risk is based on a general loss function R(f ) = l(x, y, f (x))dp(x, y) X Y 4. G: F ], + ] is totally convex, meaning that, for every t > 0 and every u dom G, 0 < ψ G (u; t) = inf { G(v) G(u) G (u; v u) u dom G, u v = t }. 30

63 a more general framework We characterized the property of universality w.r.t. C, i.e., when inf C R = inf C ran(a) R. We proved a representer and sensitivity theorem: that is there exists a measurable function Ψ λ : X Y F such that E P [Ψ λ ] λ G(u λ ) and, for every z n (X Y) n, ψ G (u λ, û n,λ (z n ) u λ ) û n,λ (z n ) u λ 1 1 λ n n Ψ λ (x i, y i ) E P [Ψ λ ]. We found conditions on the sequence (λ n ) n N, ensuring consistency, that depend on the modulus of total convexity of G, the Rademacher type of F, and the local Lipschitz constants of the loss function. i=1 31

64 an alternative approach When restricted to real-valued functions, an alternative approach is to use Rademacher complexities. 1. In this case the statistical analysis achieves sharper results when dealing with general loss function. 2. Open problem: Rademacher complexity for (reproducing kernel) classes of functions taking values in infinite dimension spaces. A contraction principle is missing in that scenario. 32

65 Thank you 33

66 some references P. L. Combettes and J.-C. Pesquet, Proximal thresholding algorithm for minimization over orthonormal bases, SIAM J. Optim., vol. 18, pp , P. L. Combettes, S. Salzo and S. Villa, Consistency of Regularized Learning Schemes in Banach spaces. ArXiv, P. L. Combettes, S. Salzo and S. Villa, Consistent Learning by Composite Proximal Thresholding. ArXiv, S. Villa, S. Salzo, L. Baldassarre, A. Verri Accelerated and Inexact Forward-Backward Algorithms, SIOPT., 16, Z B. Xu and G.F. Roach, Characteristic inequalities of uniformly convex and uniformly smooth Banach spaces, Journal of Mathematical Analysis and Applications, 157,

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Proximal methods. S. Villa. October 7, 2014

Proximal methods. S. Villa. October 7, 2014 Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Perturbed Proximal Gradient Algorithm

Perturbed Proximal Gradient Algorithm Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Regularization Algorithms for Learning

Regularization Algorithms for Learning DISI, UNIGE Texas, 10/19/07 plan motivation setting elastic net regularization - iterative thresholding algorithms - error estimates and parameter choice applications motivations starting point of many

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

About Split Proximal Algorithms for the Q-Lasso

About Split Proximal Algorithms for the Q-Lasso Thai Journal of Mathematics Volume 5 (207) Number : 7 http://thaijmath.in.cmu.ac.th ISSN 686-0209 About Split Proximal Algorithms for the Q-Lasso Abdellatif Moudafi Aix Marseille Université, CNRS-L.S.I.S

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés

dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés Inférence pénalisée dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés Gersende Fort Institut de Mathématiques de Toulouse, CNRS and Univ. Paul Sabatier Toulouse,

More information

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1

More information

Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches

Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches Patrick L. Combettes joint work with J.-C. Pesquet) Laboratoire Jacques-Louis Lions Faculté de Mathématiques

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Stability of optimization problems with stochastic dominance constraints

Stability of optimization problems with stochastic dominance constraints Stability of optimization problems with stochastic dominance constraints D. Dentcheva and W. Römisch Stevens Institute of Technology, Hoboken Humboldt-University Berlin www.math.hu-berlin.de/~romisch SIAM

More information

Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms

Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms Peter Ochs, Jalal Fadili, and Thomas Brox Saarland University, Saarbrücken, Germany Normandie Univ, ENSICAEN, CNRS, GREYC, France

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Proximal Methods for Optimization with Spasity-inducing Norms

Proximal Methods for Optimization with Spasity-inducing Norms Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology

More information

Stochastic Optimization: First order method

Stochastic Optimization: First order method Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Linear convergence of iterative soft-thresholding

Linear convergence of iterative soft-thresholding arxiv:0709.1598v3 [math.fa] 11 Dec 007 Linear convergence of iterative soft-thresholding Kristian Bredies and Dirk A. Lorenz ABSTRACT. In this article, the convergence of the often used iterative softthresholding

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information

Oslo Class 6 Sparsity based regularization

Oslo Class 6 Sparsity based regularization RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity

More information

Adaptive discretization and first-order methods for nonsmooth inverse problems for PDEs

Adaptive discretization and first-order methods for nonsmooth inverse problems for PDEs Adaptive discretization and first-order methods for nonsmooth inverse problems for PDEs Christian Clason Faculty of Mathematics, Universität Duisburg-Essen joint work with Barbara Kaltenbacher, Tuomo Valkonen,

More information

Mathematical Methods for Data Analysis

Mathematical Methods for Data Analysis Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data

More information

A memory gradient algorithm for l 2 -l 0 regularization with applications to image restoration

A memory gradient algorithm for l 2 -l 0 regularization with applications to image restoration A memory gradient algorithm for l 2 -l 0 regularization with applications to image restoration E. Chouzenoux, A. Jezierska, J.-C. Pesquet and H. Talbot Université Paris-Est Lab. d Informatique Gaspard

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms

Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms JOTA manuscript No. (will be inserted by the editor) Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms Peter Ochs Jalal Fadili Thomas Brox Received: date / Accepted: date Abstract

More information

Variable Metric Forward-Backward Algorithm

Variable Metric Forward-Backward Algorithm Variable Metric Forward-Backward Algorithm 1/37 Variable Metric Forward-Backward Algorithm for minimizing the sum of a differentiable function and a convex function E. Chouzenoux in collaboration with

More information

arxiv: v1 [cs.it] 21 Feb 2013

arxiv: v1 [cs.it] 21 Feb 2013 q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto

More information

Existence and Approximation of Fixed Points of. Bregman Nonexpansive Operators. Banach Spaces

Existence and Approximation of Fixed Points of. Bregman Nonexpansive Operators. Banach Spaces Existence and Approximation of Fixed Points of in Reflexive Banach Spaces Department of Mathematics The Technion Israel Institute of Technology Haifa 22.07.2010 Joint work with Prof. Simeon Reich General

More information

A user s guide to Lojasiewicz/KL inequalities

A user s guide to Lojasiewicz/KL inequalities Other A user s guide to Lojasiewicz/KL inequalities Toulouse School of Economics, Université Toulouse I SLRA, Grenoble, 2015 Motivations behind KL f : R n R smooth ẋ(t) = f (x(t)) or x k+1 = x k λ k f

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

Fonctions Perspectives et Statistique en Grande Dimension

Fonctions Perspectives et Statistique en Grande Dimension Fonctions Perspectives et Statistique en Grande Dimension Patrick L. Combettes Department of Mathematics North Carolina State University Raleigh, NC 27695, USA Basé sur un travail conjoint avec C. L. Müller

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Penalized Barycenters in the Wasserstein space

Penalized Barycenters in the Wasserstein space Penalized Barycenters in the Wasserstein space Elsa Cazelles, joint work with Jérémie Bigot & Nicolas Papadakis Université de Bordeaux & CNRS Journées IOP - Du 5 au 8 Juillet 2017 Bordeaux Elsa Cazelles

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

From error bounds to the complexity of first-order descent methods for convex functions

From error bounds to the complexity of first-order descent methods for convex functions From error bounds to the complexity of first-order descent methods for convex functions Nguyen Trong Phong-TSE Joint work with Jérôme Bolte, Juan Peypouquet, Bruce Suter. Toulouse, 23-25, March, 2016 Journées

More information

Stochastic Semi-Proximal Mirror-Prox

Stochastic Semi-Proximal Mirror-Prox Stochastic Semi-Proximal Mirror-Prox Niao He Georgia Institute of echnology nhe6@gatech.edu Zaid Harchaoui NYU, Inria firstname.lastname@nyu.edu Abstract We present a direct extension of the Semi-Proximal

More information

An introduction to Mathematical Theory of Control

An introduction to Mathematical Theory of Control An introduction to Mathematical Theory of Control Vasile Staicu University of Aveiro UNICA, May 2018 Vasile Staicu (University of Aveiro) An introduction to Mathematical Theory of Control UNICA, May 2018

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

OWL to the rescue of LASSO

OWL to the rescue of LASSO OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,

More information

Statistical Properties of Large Margin Classifiers

Statistical Properties of Large Margin Classifiers Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides

More information

Douglas-Rachford splitting for nonconvex feasibility problems

Douglas-Rachford splitting for nonconvex feasibility problems Douglas-Rachford splitting for nonconvex feasibility problems Guoyin Li Ting Kei Pong Jan 3, 015 Abstract We adapt the Douglas-Rachford DR) splitting method to solve nonconvex feasibility problems by studying

More information

Bayesian Models for Regularization in Optimization

Bayesian Models for Regularization in Optimization Bayesian Models for Regularization in Optimization Aleksandr Aravkin, UBC Bradley Bell, UW Alessandro Chiuso, Padova Michael Friedlander, UBC Gianluigi Pilloneto, Padova Jim Burke, UW MOPTA, Lehigh University,

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

SIGNAL RECOVERY BY PROXIMAL FORWARD-BACKWARD SPLITTING

SIGNAL RECOVERY BY PROXIMAL FORWARD-BACKWARD SPLITTING Multiscale Model. Simul. To appear SIGNAL RECOVERY BY PROXIMAL FORWARD-BACKWARD SPLITTING PATRICK L. COMBETTES AND VALÉRIE R. WAJS Abstract. We show that various inverse problems in signal recovery can

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

Kernels A Machine Learning Overview

Kernels A Machine Learning Overview Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

Dual Proximal Gradient Method

Dual Proximal Gradient Method Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

arxiv: v2 [math.oc] 21 Nov 2017

arxiv: v2 [math.oc] 21 Nov 2017 Unifying abstract inexact convergence theorems and block coordinate variable metric ipiano arxiv:1602.07283v2 [math.oc] 21 Nov 2017 Peter Ochs Mathematical Optimization Group Saarland University Germany

More information

BISTA: a Bregmanian proximal gradient method without the global Lipschitz continuity assumption

BISTA: a Bregmanian proximal gradient method without the global Lipschitz continuity assumption BISTA: a Bregmanian proximal gradient method without the global Lipschitz continuity assumption Daniel Reem (joint work with Simeon Reich and Alvaro De Pierro) Department of Mathematics, The Technion,

More information

On the convergence rate of a forward-backward type primal-dual splitting algorithm for convex optimization problems

On the convergence rate of a forward-backward type primal-dual splitting algorithm for convex optimization problems On the convergence rate of a forward-backward type primal-dual splitting algorithm for convex optimization problems Radu Ioan Boţ Ernö Robert Csetnek August 5, 014 Abstract. In this paper we analyze the

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

ADMM for monotone operators: convergence analysis and rates

ADMM for monotone operators: convergence analysis and rates ADMM for monotone operators: convergence analysis and rates Radu Ioan Boţ Ernö Robert Csetne May 4, 07 Abstract. We propose in this paper a unifying scheme for several algorithms from the literature dedicated

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

sparse and low-rank tensor recovery Cubic-Sketching

sparse and low-rank tensor recovery Cubic-Sketching Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

Nonparametric Decentralized Detection and Sparse Sensor Selection via Weighted Kernel

Nonparametric Decentralized Detection and Sparse Sensor Selection via Weighted Kernel IEEE TRASACTIOS O SIGAL PROCESSIG, VOL. X, O. X, X X onparametric Decentralized Detection and Sparse Sensor Selection via Weighted Kernel Weiguang Wang, Yingbin Liang, Member, IEEE, Eric P. Xing, Senior

More information

Block Coordinate Descent for Regularized Multi-convex Optimization

Block Coordinate Descent for Regularized Multi-convex Optimization Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Classification with Reject Option

Classification with Reject Option Classification with Reject Option Bartlett and Wegkamp (2008) Wegkamp and Yuan (2010) February 17, 2012 Outline. Introduction.. Classification with reject option. Spirit of the papers BW2008.. Infinite

More information

Nonparametric regression with martingale increment errors

Nonparametric regression with martingale increment errors S. Gaïffas (LSTA - Paris 6) joint work with S. Delattre (LPMA - Paris 7) work in progress Motivations Some facts: Theoretical study of statistical algorithms requires stationary and ergodicity. Concentration

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

A Parallel Block-Coordinate Approach for Primal-Dual Splitting with Arbitrary Random Block Selection

A Parallel Block-Coordinate Approach for Primal-Dual Splitting with Arbitrary Random Block Selection EUSIPCO 2015 1/19 A Parallel Block-Coordinate Approach for Primal-Dual Splitting with Arbitrary Random Block Selection Jean-Christophe Pesquet Laboratoire d Informatique Gaspard Monge - CNRS Univ. Paris-Est

More information

Sequential convex programming,: value function and convergence

Sequential convex programming,: value function and convergence Sequential convex programming,: value function and convergence Edouard Pauwels joint work with Jérôme Bolte Journées MODE Toulouse March 23 2016 1 / 16 Introduction Local search methods for finite dimensional

More information

Interpolation-Based Trust-Region Methods for DFO

Interpolation-Based Trust-Region Methods for DFO Interpolation-Based Trust-Region Methods for DFO Luis Nunes Vicente University of Coimbra (joint work with A. Bandeira, A. R. Conn, S. Gratton, and K. Scheinberg) July 27, 2010 ICCOPT, Santiago http//www.mat.uc.pt/~lnv

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

Recent Advances in Structured Sparse Models

Recent Advances in Structured Sparse Models Recent Advances in Structured Sparse Models Julien Mairal Willow group - INRIA - ENS - Paris 21 September 2010 LEAR seminar At Grenoble, September 21 st, 2010 Julien Mairal Recent Advances in Structured

More information

arxiv: v1 [math.oc] 12 Mar 2013

arxiv: v1 [math.oc] 12 Mar 2013 On the convergence rate improvement of a primal-dual splitting algorithm for solving monotone inclusion problems arxiv:303.875v [math.oc] Mar 03 Radu Ioan Boţ Ernö Robert Csetnek André Heinrich February

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Solving Structured Sparsity Regularization with Proximal Methods

Solving Structured Sparsity Regularization with Proximal Methods Solving Structured Sparsity Regularization with Proximal Methods Sofia Mosci 1,LorenzoRosasco 3,4, Matteo Santoro 1, Alessandro Verri 1, and Silvia Villa 2 1 Università degli Studi di Genova - DISI Via

More information

SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS

SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS XIANTAO XIAO, YONGFENG LI, ZAIWEN WEN, AND LIWEI ZHANG Abstract. The goal of this paper is to study approaches to bridge the gap between

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Towards stability and optimality in stochastic gradient descent

Towards stability and optimality in stochastic gradient descent Towards stability and optimality in stochastic gradient descent Panos Toulis, Dustin Tran and Edoardo M. Airoldi August 26, 2016 Discussion by Ikenna Odinaka Duke University Outline Introduction 1 Introduction

More information

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1 EE 546, Univ of Washington, Spring 2012 6. Proximal mapping introduction review of conjugate functions proximal mapping Proximal mapping 6 1 Proximal mapping the proximal mapping (prox-operator) of a convex

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information