consistent learning by composite proximal thresholding
|
|
- MargaretMargaret Norris
- 5 years ago
- Views:
Transcription
1 consistent learning by composite proximal thresholding Saverio Salzo Università degli Studi di Genova Optimization in Machine learning, vision and image processing Université Paul Sabatier, Toulouse 6-7 October
2 outline 1. Introduction The learning problem Constructing the estimators Our contribution 2. Statistical Analysis Proving consistency (a sketch) The main result 3. Algorithm Description Inexact proximity operators The main result 4. Further developments 2
3 introduction
4 nonparametric regression with random design Income Statistical Learning Years of Education Seniority FIGURE 2.3. The plot displays income as a function of years of education and seniority in the Income data set. The blue surface represents the true underlying relationship between income and years of education and seniority, which is known since the data are simulated. The red dots indicate the observed values of these quantities for 30 individuals. Input space X = range(years of Education) range(seniority) R 2. Output space Y = range(income) R. z n = (x i, y i ) 1 i n (X Y) n are i.i.d. realizations of a random variable (X, Y) distributed according to P. As an example, suppose that X 1,...,X p are characteristics of a patient s blood sample that can be easily measured in a lab, and Y is a variable encoding the patient s risk for a severe adverse reaction to a particular 4
5 nonparametric regression with random design X Y x y P (y x) The distribution P describes a fuzzy function. The goal is to estimate a function f : X Y such that ( x X ) f (x) is chosen according to P(y x) For instance one could take f = E(Y X), the conditional mean. 5
6 the formal setting We are given an input space (X, A), an output space Y R bounded interval, and a random variable (X, Y) with values in X Y with distribution P. a sequence (X i, Y i ) i N of i.i.d. random variables taking values in X Y with common distribution P. n observations z n = (x i, y i ) 1 i n of the random variables Z n = (Y i, Y i ) 1 i n, which constitutes the training set. a (large) closed convex constraint set C L 2 (P X ) encoding prior information. 6
7 the learning problem The goal is to approach f C, the regression function on C, which is the minimizer of the L 2 -risk R: L 2 (P X ) R + f y f (x) 2 dp(x, y) = E Y f (X) 2 X R over the constraint set C, without any knowledge of P, but using only training sets z n. 7
8 the learning problem The goal is to approach f C, the regression function on C, which is the minimizer of the L 2 -risk R: L 2 (P X ) R + f y f (x) 2 dp(x, y) = E Y f (X) 2 X R over the constraint set C, without any knowledge of P, but using only training sets z n. A learning algorithm is a map Λ: n N (X Y)n C such that the estimators Λ(Z n ) asymptotically converge to f C, meaning that Λ(Z n ) f C L2 (P X ) 0 (in probability or a.s.) as n +. 7
9 the learning algorithm: how we build estimators We adopt a linear model: where: f u = k K u e k φ k (pointwise), u l 2 (K) (e k ) k K is the canonical basis of l 2 (K) and (φ k ) k K is a countable dictionary of bounded functions in M(X, R) such that φ k (x) 2 κ 2 ( x X ) k K the coefficients u e k are constrained in given intervals C k R, thus defining the constraint C = { f u u l 2 (K) C } k k K Our estimators will be in C! Moreover, H K = { f u u l 2 (K) } is a RKHS with feature map Φ: X l 2 (K): x (φ k (x)) k K and kernel K(x, x ) = Φ(x) Φ(x ). 8
10 the learning algorithm: how we build estimators We solve a composite regularized least squares regression problem { 1 n û n,λ (z n ) = argmin u l 2 (K) (y i f u (x i )) 2 + λ } g k ( u e k ) n i=1 k K g k : R R, g k = ι Ck + σ Dk + η k r ( k K) where 0 C k, D k R closed intervals, η k η > 0, and r ]1, 2]. The learning algorithm z n fûn,λ (z n) The indicator function of C k and the support function of D k are: { { 0 if ξ Ck ξ sup Dk if ξ 0 ι Ck (ξ) = σ Dk (ξ) = + if ξ / C k ξ inf D k if ξ < 0 9
11 the learning algorithm: how we build estimators We solve a composite regularized least squares regression problem { 1 n û n,λ (z n ) = argmin u l 2 (K) (y i f u (x i )) 2 + λ } g k ( u e k ) n i=1 k K g k : R R, g k = ι Ck + σ Dk + η k r ( k K) where 0 C k, D k R closed intervals, η k η > 0, and r ]1, 2]. The learning algorithm z n fûn,λ (z n) The empirical risk and the regularizer ˆR n (f u ) = 1 n (y i f u (x i )) 2 G(u) = g k ( u e k ) n i=1 k K This is a regularized empirical risk minimization learning algorithm. 9
12 the learning algorithm: how we build estimators We solve a composite regularized least squares regression problem { 1 n û n,λ (z n ) = argmin u l 2 (K) (y i f u (x i )) 2 + λ } g k ( u e k ) n i=1 k K g k : R R, g k = ι Ck + σ Dk + η k r ( k K) where 0 C k, D k R closed intervals, η k η > 0, and r ]1, 2]. The learning algorithm z n fûn,λ (z n) This model encompass: ridge regression (Hoerl and Kennard, 70) g k = 2 elastic net (Zou, Hastie, 05; De Mol et al., 09) g k = ω k + η 2 bridge regression (Frank and Friedman 93) g k = r, 1 < r < 2. 9
13 the learning algorithm: how we build estimators We minimize the regularized empirical risk by proximal gradient algorithms: Let 0 < γ < λ/(2κ 2 ) and define (u m ) m N for all m N and all k K u m+1 e k = prox γgk ( u m e k 2γ nλ where prox γgk (ξ) = argmin t R { γgk (t) + (1/2)(t ξ) 2}. n i=1 ) ( ) fum (x i ) y i φk (x i ) 10
14 the learning algorithm: how we build estimators We minimize the regularized empirical risk by proximal gradient algorithms: Let 0 < γ < λ/(2κ 2 ) and define (u m ) m N for all m N and all k K u m+1 e k = prox γgk ( u m e k 2γ nλ where prox γgk (ξ) = argmin t R { γgk (t) + (1/2)(t ξ) 2}. n i=1 ) ( ) fum (x i ) y i φk (x i ) Example If g k = ι Ck, prox gk = π Ck. If g k = σ Dk, then ξ inf D k if ξ inf D k prox gk (ξ) = soft Dk (ξ) = 0 if ξ D k ξ + sup D k if ξ inf D k 10
15 the learning algorithm: how we build estimators Example σd(ξ) soft D(ξ) ξ ξ The support function of D =[ 1, 1] (black), D =[ 0.5, Figure1, 1: 5] Soft (red), thresholding D =[0.5, 2] (green), and prox φ for φ = +0.9 r,withr =2(red), r = 2, 0.5] (blue). (orange), r =4/3 (blue). D = [ 1, 1], σ D = Ḡ be the restriction of G to l r (K), endowed with the (iii) norm The soft-thresholding r.sinceu 0 loperator r (K) andwith respect to a bounded interval D k =[ω k, ω k] R is ), wehavethatu 0 Ḡ(u0). Let ψ be the modulus of total convexity of Ḡ and let ϕ dulus of total convexity of r r in lr (K). Then, for every u l r (K), G(u) G(u 0) µ ω k if µ>ω k 0 + ψ(u0, u u0 r ). Moreover, since Ḡ = H + η r r,withh ( µ D Γ0(lr k) (K)), soft wehave Dk (µ) = 0 if µ D 11 k (2 he statement follows from [8, Proposition A.9-Remark A.10].
16 the learning algorithm: motivation When the regression function has sparse representation a further goal besides prediction is to identify the set of corresponding relevant features. To this respect in our learning algorithm: estimators have finite support (sparsity is promoted). 12
17 the learning algorithm: motivation When the regression function has sparse representation a further goal besides prediction is to identify the set of corresponding relevant features. To this respect in our learning algorithm: estimators have finite support (sparsity is promoted). positiveness and boundedness of the coefficients can be enforced. 12
18 the learning algorithm: motivation When the regression function has sparse representation a further goal besides prediction is to identify the set of corresponding relevant features. To this respect in our learning algorithm: estimators have finite support (sparsity is promoted). positiveness and boundedness of the coefficients can be enforced. estimators keep the same grouping selection properties of the elastic-net estimators, but they are possibly not affected by the double shrinkage phenomenon (Zou and Hastie 05). 12
19 the learning algorithm: motivation When the regression function has sparse representation a further goal besides prediction is to identify the set of corresponding relevant features. To this respect in our learning algorithm: estimators have finite support (sparsity is promoted). positiveness and boundedness of the coefficients can be enforced. estimators keep the same grouping selection properties of the elastic-net estimators, but they are possibly not affected by the double shrinkage phenomenon (Zou and Hastie 05). the regularizer gives a high flexibility to control the shape of the thresholding operation in dependence of the available prior information. 12
20 the learning algorithm: motivation σ D (ξ) ξ D = [ 1, 1] 13
21 the learning algorithm: motivation σ D (ξ) ξ D = [ 0.5, 1.5] 13
22 the learning algorithm: motivation σ D (ξ) ξ D = [0.5, 2] 13
23 the learning algorithm: motivation σ D (ξ) ξ D = [ 2, 0.5] 13
24 the learning algorithm: motivation prox gk (ξ) ξ C k = R, D k = [ 1, 1], η k = 0 14
25 the learning algorithm: motivation prox gk (ξ) ξ C k = R, D k = [ 1, 1], η k = 0.9, r = 4/3 14
26 the learning algorithm: motivation prox gk (ξ) ξ C k = [ 2, 2], D k = [ 1, 1], η k = 0 14
27 the learning algorithm: motivation prox gk (ξ) ξ C k = ], 6/5], D k = [0, 2], η k = 0.9, r = 4/3 14
28 our contribution We will address two problems: consistency of the estimators fûn,λ (Z n); meaning that, we can find conditions on a vanishing sequence (λ n ) n N such that fûn,λn (Z n) f C L2 (P X ) 0 (in probability or a.s.) computation of the estimators û n,λ (z n ) (for a fixed training set z n and parameter λ) by mean of an inexact and accelerated forward-backward splitting algorithm. 15
29 statistical analysis
30 the regression function wrt the constraint set f C is the minimizer of the risk over the constraint set C. Facts 1. Setting f = E(Y X), we have R(f ) = f f L 2 + inf L2 (P X ) R 17
31 the regression function wrt the constraint set f C is the minimizer of the risk over the constraint set C. Facts 1. Setting f = E(Y X), we have R(f ) = f f L 2 + inf L2 (P X ) R 2. f C is the (orthogonal) projection of f onto C in L 2 (X ); 17
32 the regression function wrt the constraint set f C is the minimizer of the risk over the constraint set C. Facts 1. Setting f = E(Y X), we have R(f ) = f f L 2 + inf L2 (P X ) R 2. f C is the (orthogonal) projection of f onto C in L 2 (X ); 3. ( f C) f f C 2 L R(f ) inf 2 C R; 17
33 the regression function wrt the constraint set f C is the minimizer of the risk over the constraint set C. Facts 1. Setting f = E(Y X), we have R(f ) = f f L 2 + inf L2 (P X ) R 2. f C is the (orthogonal) projection of f onto C in L 2 (X ); 3. ( f C) f f C 2 L 2 R(f ) inf C R; 4. ( f C) R(f ) inf R C [( 2 f f C L 2 + inf C 2 1/2 f R) R inf + inf R] fc L L 2 (X ) L 2 2 (X ) 17
34 the regression function wrt the constraint set f C is the minimizer of the risk over the constraint set C. Facts 1. Setting f = E(Y X), we have R(f ) = f f L 2 + inf L2 (P X ) R 2. f C is the (orthogonal) projection of f onto C in L 2 (X ); 3. ( f C) f f C 2 L 2 R(f ) inf C R; 4. ( f C) R(f ) inf R C [( 2 f f C L 2 + inf C 2 1/2 f R) R inf + inf R] fc L L 2 (X ) L 2 2 (X ) It follows that R(ˆf n ) inf C R 0 ˆf n f C L2 (P X ) 0. 17
35 splitting the error Two minimization problems ( u λ = argmin u l2 (K) R(fu ) + λg(u) ) û n,λ (z n ) = argmin u l2 (K) (ˆR n (f u ) + λg(u) ) Then: fûn,λ (Z n) f C L 2 f û n,λ (Z n) f uλ L 2 + f u λ f C L 2 We have to control: κ û n,λ (Z n ) u λ r + ( R(f uλ ) inf C R ) 1/2 the stochastic term: û n,λ (Z n ) u λ r the deterministic term: R(f uλ ) inf C R 18
36 splitting the error Two minimization problems ( u λ = argmin u l2 (K) R(fu ) + λg(u) ) û n,λ (z n ) = argmin u l2 (K) (ˆR n (f u ) + λg(u) ) Then: fûn,λ (Z n) f C L 2 f û n,λ (Z n) f uλ L 2 + f u λ f C L 2 We have to control: κ û n,λ (Z n ) u λ r + ( R(f uλ ) inf C R ) 1/2 the stochastic term: û n,λ (Z n ) u λ r the deterministic term: R(f uλ ) inf C R = R(f uλ ) inf u dom G R(f u ) tends to zero as λ 0 because of a general variational principle (Attouch 96). 18
37 splitting the error Two minimization problems ( u λ = argmin u l2 (K) R(fu ) + λg(u) ) û n,λ (z n ) = argmin u l2 (K) (ˆR n (f u ) + λg(u) ) Then: fûn,λ (Z n) f C L 2 f û n,λ (Z n) f uλ L 2 + f u λ f C L 2 We have to control: κ û n,λ (Z n ) u λ r + ( R(f uλ ) inf C R ) 1/2 the stochastic term: û n,λ (Z n ) u λ r? the deterministic term: R(f uλ ) inf C R = R(f uλ ) inf u dom G R(f u ) tends to zero as λ 0 because of a general variational principle (Attouch 96). 18
38 a representation and sensitivity theorem Lemma ((Xu and Roach 91) Total convexity on bounded sets) Let ρ > 0, let u 0 l r (K), and u 0 G(u 0). Then there exists a universal constant M > 0 (depending only on r) such that ( u l r ) G(u) G(u 0 ) u u 0 u u u 0 2 r 0 + ηm ( u 0 r + u u 0 r ) 2 r. Theorem (Combettes, Salzo and Villa 15) There exists a measurable function Ψ λ : X Y l 2 (K), which is of type Ψ λ (x, y) = h λ (x, y)φ(x), such that E P [Ψ λ ] λ G(u λ ), Ψ λ C u λ 2, Ψ λ 2 2κ(R(f uλ )) 1/2, and z n (X Y) n û n,λ (z n ) u λ ηm r ( u λ r + û n,λ (z n ) u λ r ) 2 r 1 1 λ n n Ψ λ (x i, y i ) E P [Ψ λ ]. 2 i=1 19
39 the consistency theorem [ 1 P n n ] Ψ λ (X i, Y i ) E[Ψ λ (X, Y)] δ(n, λ, τ) 1 e τ 2 i=1 Theorem (Combettes, Salzo and Villa 15) Let (λ n ) n N be a sequence in ]0, + [ converging to 0. following holds: Then the 1. If 1/(λ 2/r n n 1/2 ) 0, then fûn,λ (Z n) f C L 0 in probability If log n/(λ 2/r n n 1/2 ) 0, then fûn,λ (Z n) f C L 2 0 P-a.s. 20
40 the consistency theorem [ 1 P n n ] Ψ λ (X i, Y i ) E[Ψ λ (X, Y)] δ(n, λ, τ) 1 e τ 2 i=1 Theorem (Combettes, Salzo and Villa 15) Let (λ n ) n N be a sequence in ]0, + [ converging to 0. following holds: Then the 1. If 1/(λ 2/r n n 1/2 ) 0, then fûn,λ (Z n) f C L 0 in probability If log n/(λ 2/r n n 1/2 ) 0, then fûn,λ (Z n) f C L 2 0 P-a.s. 3. Suppose f C { f u u dom G } and set S = argmin u dom G R(f u ). Then, there exists a unique u S which minimizes G over S and f u = f C. Moreover: if 1/(λ nn 1/2 ) 0, then û n,λn (Z n) u r 0 in probability; if log n/(λ nn 1/2 ) 0, then û n,λn (Z n) u r 0 P-a.s. 20
41 algorithm
42 the minimization problem We aim at minimizing the functional u l 2 (K) 1 n (y i f u (x i )) 2 + λ g k ( u e k ) n i=1 k K which is the sum of a: g k : R R, g k = ι Ck + σ Dk + η k r ( k K) differentiable term u ˆR n (f u ) with Lipschitz continuous gradient. lower semicontinuous, convex and separable term λg(u) = λ k K g k( u e k ). Note that we need convergence in objective values. 22
43 the minimization problem We aim at minimizing the functional u l 2 (K) 1 n (y i f u (x i )) 2 + λ g k ( u e k ) n i=1 k K which is the sum of a: g k : R R, g k = ι Ck + σ Dk + η k r ( k K) differentiable term u ˆR n (f u ) with Lipschitz continuous gradient. lower semicontinuous, convex and separable term λg(u) = λ k K g k( u e k ). Note that we need convergence in objective values. This problem can be solved by forward-backward splitting methods. These algorithms have been studied for convex separable regularizers in (Combettes and Pesquet 07; Bredies and Lorenz 08). 22
44 an inexact accelerated fbs algorithm Fix u 0 = v 0 l 2 (K), τ 0 = 1 and define for every m N, τ m+1 = 1 2 for all k K w m e k = v m e k 2γ nλ n ( ) fvm (x i ) y i φk (x i ) i=1 u m+1 e k = prox γgk ( w m e k ) ( 1 + ) 1 + 4τm 2 v m+1 = u m+1 + τ m 1 τ m+1 (u m+1 u m ) (Nesterov 83; Guler 92; Beck and Teboulle 09; Villa, Salzo et. al. 13) 23
45 an inexact accelerated fbs algorithm Fix u 0 = v 0 l 2 (K), τ 0 = 1 and define for every m N, τ m+1 = 1 2 for all k K w m e k = v m e k 2γ nλ n ( ) fvm (x i ) y i φk (x i ) i=1 u m+1 e k = π Ck ( proxγηk r ( softγdk w m e k )) ( 1 + ) 1 + 4τm 2 v m+1 = u m+1 + τ m 1 τ m+1 (u m+1 u m ) (Nesterov 83; Guler 92; Beck and Teboulle 09; Villa, Salzo et. al. 13) 23
46 an inexact accelerated fbs algorithm Fix u 0 = v 0 l 2 (K), τ 0 = 1 and define for every m N, τ m+1 = 1 2 for all k K w m e k = v m e k 2γ nλ n ( ) fvm (x i ) y i φk (x i ) u m+1 e k = π Ck ( proxγηk r ( softγdk w m e k ) + α m,k ) ( 1 + ) 1 + 4τm 2 v m+1 = u m+1 + τ m 1 τ m+1 (u m+1 u m ) i=1 (Nesterov 83; Guler 92; Beck and Teboulle 09; Villa, Salzo et. al. 13) 23
47 an inexact accelerated fbs algorithm where: 0 < γ < λ/(2κ 2 ) (α m,k ) (m,k) N K is a double sequence in R that accounts for errors in the computation of the proximity operator prox γηk r and, for some c > 0 and p ]3/2, + [ and (ν k ) K l 1 (K), satisfies the condition { α k,m min cm 2p ν k 2γη k r( w m e k + 1) r w m e k + 1, soft γdk ( w m e k ) 1 + rγη k, ( softγdk ( w m e k ) 1 + rγη k ) 1/(r 1) }. 24
48 proximal thresholding: the role of the exponent r prox φ (ξ) ξ Soft thresholding (olive) and prox of φ = (red). 25
49 proximal thresholding: the role of the exponent r prox φ (ξ) ξ Soft thresholding (olive) and prox of φ = /2 (purple). 25
50 proximal thresholding: the role of the exponent r prox φ (ξ) ξ Soft thresholding (olive) and prox of φ = /3 (blue). 25
51 the proximity operator of powers Let r ]1, 2], let τ > 0, let µ R, and consider prox τ r : R R. prox τ r(µ) = sign(µ)ξ, where ξ 0 and ξ + rτξ r 1 = µ. There are several exponents r for which the equation can be explicitly solved: r {3/2, 4/3, 5/4}. However in general this is not possible and one needs to rely on iterative methods, e.g. by bisection. 26
52 the proximity operator of powers Let r ]1, 2], let τ > 0, let µ R, and consider prox τ r : R R. prox τ r(µ) = sign(µ)ξ, where ξ 0 and ξ + rτξ r 1 = µ. There are several exponents r for which the equation can be explicitly solved: r {3/2, 4/3, 5/4}. However in general this is not possible and one needs to rely on iterative methods, e.g. by bisection. The following decomposition rule holds (Combettes and Pesquet 07) g k = ι Ck + η k r + σ Dk prox γgk (µ) = π Ck ( proxγηk r ( softγdk (µ) ) ) 26
53 the proximity operator of powers Let r ]1, 2], let τ > 0, let µ R, and consider prox τ r : R R. prox τ r(µ) = sign(µ)ξ, where ξ 0 and ξ + rτξ r 1 = µ. There are several exponents r for which the equation can be explicitly solved: r {3/2, 4/3, 5/4}. However in general this is not possible and one needs to rely on iterative methods, e.g. by bisection. The following decomposition rule holds (Combettes and Pesquet 07) g k = ι Ck + η k r + σ Dk prox γgk (µ) π Ck ( proxγηk r ( softγdk (µ) ) + α k,m ) ) The question is to adjust the decomposition rule in the presence of errors. 26
54 a notion of inexact proximity operator u δ prox ηϕ (w) { def u δ2 2η argmin ξ H ϕ(ξ) + 1 } 2η ξ w 2 H. 27
55 a notion of inexact proximity operator u δ prox ηϕ (w) { def u δ2 2η argmin ξ H ϕ(ξ) + 1 } 2η ξ w 2 H. Lemma Let ψ, φ, σ Γ + 0 (R) with ψ(0) = 0 and σ positively homogeneous. Then s = prox ϑ r(µ) + α = s δ prox ϑ r(µ) s δ prox ψ (prox σ (µ)) = s δ prox ψ+σ (µ) s δ prox φ (µ) and p = π C (s) = p δ prox ιc +φ (µ). 27
56 a notion of inexact proximity operator u δ prox ηϕ (w) { def u δ2 2η argmin ξ H ϕ(ξ) + 1 } 2η ξ w 2 H. Lemma Let ψ, φ, σ Γ + 0 (R) with ψ(0) = 0 and σ positively homogeneous. Then s = prox ϑ r(µ) + α = s δ prox ϑ r(µ) s δ prox ψ (prox σ (µ)) = s δ prox ψ+σ (µ) s δ prox φ (µ) and p = π C (s) = p δ prox ιc +φ (µ). g k = ι Ck + η k r + σ Dk ( s = prox γηk r softγdk (µ) ) } + α k,m p = π Ck (s) p δk,m prox γgk (µ) 27
57 the theorem of convergence Theorem Let ˆF n (u) = 1 n n i=1 (f u (x i ) y i ) 2 and G(u) = k K g k ( u e k ). Then there exist a vanishing sequence (δ m ) m N, δ m Cm p, such that u m+1 δm prox γg (v m γ ) λ F n (v m ) v m+1 = u m+1 + τ m 1 (u m+1 u m ) τ m+1 Hence it follows from (Villa, Salzo et. al. 13) that the sequence (u m ) m N is minimizing for ˆF n + λg and u m û n,λ (z n ) r 0 as m +. 28
58 further developments
59 a more general framework The statistical analysis can be pursued in more general terms (Combettes, Salzo and Villa 14): 1. The functional model is given by means of a linear operator A: F M(X, Y): u f u continuous w.r.t. the topology of the pointwise convergence, where F (the feature space) and Y are Banach spaces. This way ran(a) is a pre-rkbs of vector-valued functions. 30
60 a more general framework The statistical analysis can be pursued in more general terms (Combettes, Salzo and Villa 14): 1. The functional model is given by means of a linear operator A: F M(X, Y): u f u continuous w.r.t. the topology of the pointwise convergence, where F (the feature space) and Y are Banach spaces. This way ran(a) is a pre-rkbs of vector-valued functions. 2. The constraint set is C = { f M(X, Y) ( x X ) f (x) C(x) }, where, for every x X, C(x) Y is closed convex. 30
61 a more general framework The statistical analysis can be pursued in more general terms (Combettes, Salzo and Villa 14): 1. The functional model is given by means of a linear operator A: F M(X, Y): u f u continuous w.r.t. the topology of the pointwise convergence, where F (the feature space) and Y are Banach spaces. This way ran(a) is a pre-rkbs of vector-valued functions. 2. The constraint set is C = { f M(X, Y) ( x X ) f (x) C(x) }, where, for every x X, C(x) Y is closed convex. 3. The risk is based on a general loss function R(f ) = l(x, y, f (x))dp(x, y) X Y 30
62 a more general framework The statistical analysis can be pursued in more general terms (Combettes, Salzo and Villa 14): 1. The functional model is given by means of a linear operator A: F M(X, Y): u f u continuous w.r.t. the topology of the pointwise convergence, where F (the feature space) and Y are Banach spaces. This way ran(a) is a pre-rkbs of vector-valued functions. 2. The constraint set is C = { f M(X, Y) ( x X ) f (x) C(x) }, where, for every x X, C(x) Y is closed convex. 3. The risk is based on a general loss function R(f ) = l(x, y, f (x))dp(x, y) X Y 4. G: F ], + ] is totally convex, meaning that, for every t > 0 and every u dom G, 0 < ψ G (u; t) = inf { G(v) G(u) G (u; v u) u dom G, u v = t }. 30
63 a more general framework We characterized the property of universality w.r.t. C, i.e., when inf C R = inf C ran(a) R. We proved a representer and sensitivity theorem: that is there exists a measurable function Ψ λ : X Y F such that E P [Ψ λ ] λ G(u λ ) and, for every z n (X Y) n, ψ G (u λ, û n,λ (z n ) u λ ) û n,λ (z n ) u λ 1 1 λ n n Ψ λ (x i, y i ) E P [Ψ λ ]. We found conditions on the sequence (λ n ) n N, ensuring consistency, that depend on the modulus of total convexity of G, the Rademacher type of F, and the local Lipschitz constants of the loss function. i=1 31
64 an alternative approach When restricted to real-valued functions, an alternative approach is to use Rademacher complexities. 1. In this case the statistical analysis achieves sharper results when dealing with general loss function. 2. Open problem: Rademacher complexity for (reproducing kernel) classes of functions taking values in infinite dimension spaces. A contraction principle is missing in that scenario. 32
65 Thank you 33
66 some references P. L. Combettes and J.-C. Pesquet, Proximal thresholding algorithm for minimization over orthonormal bases, SIAM J. Optim., vol. 18, pp , P. L. Combettes, S. Salzo and S. Villa, Consistency of Regularized Learning Schemes in Banach spaces. ArXiv, P. L. Combettes, S. Salzo and S. Villa, Consistent Learning by Composite Proximal Thresholding. ArXiv, S. Villa, S. Salzo, L. Baldassarre, A. Verri Accelerated and Inexact Forward-Backward Algorithms, SIOPT., 16, Z B. Xu and G.F. Roach, Characteristic inequalities of uniformly convex and uniformly smooth Banach spaces, Journal of Mathematical Analysis and Applications, 157,
Learning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationProximal methods. S. Villa. October 7, 2014
Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationPerturbed Proximal Gradient Algorithm
Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationRegularization Algorithms for Learning
DISI, UNIGE Texas, 10/19/07 plan motivation setting elastic net regularization - iterative thresholding algorithms - error estimates and parameter choice applications motivations starting point of many
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationAbout Split Proximal Algorithms for the Q-Lasso
Thai Journal of Mathematics Volume 5 (207) Number : 7 http://thaijmath.in.cmu.ac.th ISSN 686-0209 About Split Proximal Algorithms for the Q-Lasso Abdellatif Moudafi Aix Marseille Université, CNRS-L.S.I.S
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationdans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés
Inférence pénalisée dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés Gersende Fort Institut de Mathématiques de Toulouse, CNRS and Univ. Paul Sabatier Toulouse,
More informationI P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION
I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1
More informationSplitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches
Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches Patrick L. Combettes joint work with J.-C. Pesquet) Laboratoire Jacques-Louis Lions Faculté de Mathématiques
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More informationStability of optimization problems with stochastic dominance constraints
Stability of optimization problems with stochastic dominance constraints D. Dentcheva and W. Römisch Stevens Institute of Technology, Hoboken Humboldt-University Berlin www.math.hu-berlin.de/~romisch SIAM
More informationNon-smooth Non-convex Bregman Minimization: Unification and new Algorithms
Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms Peter Ochs, Jalal Fadili, and Thomas Brox Saarland University, Saarbrücken, Germany Normandie Univ, ENSICAEN, CNRS, GREYC, France
More informationReproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto
Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationProximal Methods for Optimization with Spasity-inducing Norms
Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology
More informationStochastic Optimization: First order method
Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationLinear convergence of iterative soft-thresholding
arxiv:0709.1598v3 [math.fa] 11 Dec 007 Linear convergence of iterative soft-thresholding Kristian Bredies and Dirk A. Lorenz ABSTRACT. In this article, the convergence of the often used iterative softthresholding
More informationMIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design
MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation
More informationOslo Class 6 Sparsity based regularization
RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity
More informationAdaptive discretization and first-order methods for nonsmooth inverse problems for PDEs
Adaptive discretization and first-order methods for nonsmooth inverse problems for PDEs Christian Clason Faculty of Mathematics, Universität Duisburg-Essen joint work with Barbara Kaltenbacher, Tuomo Valkonen,
More informationMathematical Methods for Data Analysis
Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data
More informationA memory gradient algorithm for l 2 -l 0 regularization with applications to image restoration
A memory gradient algorithm for l 2 -l 0 regularization with applications to image restoration E. Chouzenoux, A. Jezierska, J.-C. Pesquet and H. Talbot Université Paris-Est Lab. d Informatique Gaspard
More informationStochastic Proximal Gradient Algorithm
Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind
More informationNon-smooth Non-convex Bregman Minimization: Unification and New Algorithms
JOTA manuscript No. (will be inserted by the editor) Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms Peter Ochs Jalal Fadili Thomas Brox Received: date / Accepted: date Abstract
More informationVariable Metric Forward-Backward Algorithm
Variable Metric Forward-Backward Algorithm 1/37 Variable Metric Forward-Backward Algorithm for minimizing the sum of a differentiable function and a convex function E. Chouzenoux in collaboration with
More informationarxiv: v1 [cs.it] 21 Feb 2013
q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto
More informationExistence and Approximation of Fixed Points of. Bregman Nonexpansive Operators. Banach Spaces
Existence and Approximation of Fixed Points of in Reflexive Banach Spaces Department of Mathematics The Technion Israel Institute of Technology Haifa 22.07.2010 Joint work with Prof. Simeon Reich General
More informationA user s guide to Lojasiewicz/KL inequalities
Other A user s guide to Lojasiewicz/KL inequalities Toulouse School of Economics, Université Toulouse I SLRA, Grenoble, 2015 Motivations behind KL f : R n R smooth ẋ(t) = f (x(t)) or x k+1 = x k λ k f
More informationFast proximal gradient methods
L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationMIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco
MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y
More informationFast learning rates for plug-in classifiers under the margin condition
Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,
More informationFonctions Perspectives et Statistique en Grande Dimension
Fonctions Perspectives et Statistique en Grande Dimension Patrick L. Combettes Department of Mathematics North Carolina State University Raleigh, NC 27695, USA Basé sur un travail conjoint avec C. L. Müller
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationPenalized Barycenters in the Wasserstein space
Penalized Barycenters in the Wasserstein space Elsa Cazelles, joint work with Jérémie Bigot & Nicolas Papadakis Université de Bordeaux & CNRS Journées IOP - Du 5 au 8 Juillet 2017 Bordeaux Elsa Cazelles
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationA Unified Approach to Proximal Algorithms using Bregman Distance
A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department
More informationFrom error bounds to the complexity of first-order descent methods for convex functions
From error bounds to the complexity of first-order descent methods for convex functions Nguyen Trong Phong-TSE Joint work with Jérôme Bolte, Juan Peypouquet, Bruce Suter. Toulouse, 23-25, March, 2016 Journées
More informationStochastic Semi-Proximal Mirror-Prox
Stochastic Semi-Proximal Mirror-Prox Niao He Georgia Institute of echnology nhe6@gatech.edu Zaid Harchaoui NYU, Inria firstname.lastname@nyu.edu Abstract We present a direct extension of the Semi-Proximal
More informationAn introduction to Mathematical Theory of Control
An introduction to Mathematical Theory of Control Vasile Staicu University of Aveiro UNICA, May 2018 Vasile Staicu (University of Aveiro) An introduction to Mathematical Theory of Control UNICA, May 2018
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationGradient Sliding for Composite Optimization
Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationOWL to the rescue of LASSO
OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,
More informationStatistical Properties of Large Margin Classifiers
Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides
More informationDouglas-Rachford splitting for nonconvex feasibility problems
Douglas-Rachford splitting for nonconvex feasibility problems Guoyin Li Ting Kei Pong Jan 3, 015 Abstract We adapt the Douglas-Rachford DR) splitting method to solve nonconvex feasibility problems by studying
More informationBayesian Models for Regularization in Optimization
Bayesian Models for Regularization in Optimization Aleksandr Aravkin, UBC Bradley Bell, UW Alessandro Chiuso, Padova Michael Friedlander, UBC Gianluigi Pilloneto, Padova Jim Burke, UW MOPTA, Lehigh University,
More information1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More informationRandomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity
Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University
More informationGeneralization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh
Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds
More informationSIGNAL RECOVERY BY PROXIMAL FORWARD-BACKWARD SPLITTING
Multiscale Model. Simul. To appear SIGNAL RECOVERY BY PROXIMAL FORWARD-BACKWARD SPLITTING PATRICK L. COMBETTES AND VALÉRIE R. WAJS Abstract. We show that various inverse problems in signal recovery can
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationMLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT
MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net
More informationOptimal Regularized Dual Averaging Methods for Stochastic Optimization
Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business
More informationKernels A Machine Learning Overview
Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter
More informationClass 2 & 3 Overfitting & Regularization
Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationDual Proximal Gradient Method
Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationarxiv: v2 [math.oc] 21 Nov 2017
Unifying abstract inexact convergence theorems and block coordinate variable metric ipiano arxiv:1602.07283v2 [math.oc] 21 Nov 2017 Peter Ochs Mathematical Optimization Group Saarland University Germany
More informationBISTA: a Bregmanian proximal gradient method without the global Lipschitz continuity assumption
BISTA: a Bregmanian proximal gradient method without the global Lipschitz continuity assumption Daniel Reem (joint work with Simeon Reich and Alvaro De Pierro) Department of Mathematics, The Technion,
More informationOn the convergence rate of a forward-backward type primal-dual splitting algorithm for convex optimization problems
On the convergence rate of a forward-backward type primal-dual splitting algorithm for convex optimization problems Radu Ioan Boţ Ernö Robert Csetnek August 5, 014 Abstract. In this paper we analyze the
More informationOn Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:
A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition
More informationADMM for monotone operators: convergence analysis and rates
ADMM for monotone operators: convergence analysis and rates Radu Ioan Boţ Ernö Robert Csetne May 4, 07 Abstract. We propose in this paper a unifying scheme for several algorithms from the literature dedicated
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationsparse and low-rank tensor recovery Cubic-Sketching
Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru
More informationAn Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods
An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This
More informationNonparametric Decentralized Detection and Sparse Sensor Selection via Weighted Kernel
IEEE TRASACTIOS O SIGAL PROCESSIG, VOL. X, O. X, X X onparametric Decentralized Detection and Sparse Sensor Selection via Weighted Kernel Weiguang Wang, Yingbin Liang, Member, IEEE, Eric P. Xing, Senior
More informationBlock Coordinate Descent for Regularized Multi-convex Optimization
Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationClassification with Reject Option
Classification with Reject Option Bartlett and Wegkamp (2008) Wegkamp and Yuan (2010) February 17, 2012 Outline. Introduction.. Classification with reject option. Spirit of the papers BW2008.. Infinite
More informationNonparametric regression with martingale increment errors
S. Gaïffas (LSTA - Paris 6) joint work with S. Delattre (LPMA - Paris 7) work in progress Motivations Some facts: Theoretical study of statistical algorithms requires stationary and ergodicity. Concentration
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationA Parallel Block-Coordinate Approach for Primal-Dual Splitting with Arbitrary Random Block Selection
EUSIPCO 2015 1/19 A Parallel Block-Coordinate Approach for Primal-Dual Splitting with Arbitrary Random Block Selection Jean-Christophe Pesquet Laboratoire d Informatique Gaspard Monge - CNRS Univ. Paris-Est
More informationSequential convex programming,: value function and convergence
Sequential convex programming,: value function and convergence Edouard Pauwels joint work with Jérôme Bolte Journées MODE Toulouse March 23 2016 1 / 16 Introduction Local search methods for finite dimensional
More informationInterpolation-Based Trust-Region Methods for DFO
Interpolation-Based Trust-Region Methods for DFO Luis Nunes Vicente University of Coimbra (joint work with A. Bandeira, A. R. Conn, S. Gratton, and K. Scheinberg) July 27, 2010 ICCOPT, Santiago http//www.mat.uc.pt/~lnv
More informationKernel Method: Data Analysis with Positive Definite Kernels
Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University
More informationRecent Advances in Structured Sparse Models
Recent Advances in Structured Sparse Models Julien Mairal Willow group - INRIA - ENS - Paris 21 September 2010 LEAR seminar At Grenoble, September 21 st, 2010 Julien Mairal Recent Advances in Structured
More informationarxiv: v1 [math.oc] 12 Mar 2013
On the convergence rate improvement of a primal-dual splitting algorithm for solving monotone inclusion problems arxiv:303.875v [math.oc] Mar 03 Radu Ioan Boţ Ernö Robert Csetnek André Heinrich February
More informationDistributed Inexact Newton-type Pursuit for Non-convex Sparse Learning
Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology
More informationSolving Structured Sparsity Regularization with Proximal Methods
Solving Structured Sparsity Regularization with Proximal Methods Sofia Mosci 1,LorenzoRosasco 3,4, Matteo Santoro 1, Alessandro Verri 1, and Silvia Villa 2 1 Università degli Studi di Genova - DISI Via
More informationSEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS
SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS XIANTAO XIAO, YONGFENG LI, ZAIWEN WEN, AND LIWEI ZHANG Abstract. The goal of this paper is to study approaches to bridge the gap between
More informationOptimization and Optimal Control in Banach Spaces
Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,
More informationA Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization
A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationTowards stability and optimality in stochastic gradient descent
Towards stability and optimality in stochastic gradient descent Panos Toulis, Dustin Tran and Edoardo M. Airoldi August 26, 2016 Discussion by Ikenna Odinaka Duke University Outline Introduction 1 Introduction
More informationEE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1
EE 546, Univ of Washington, Spring 2012 6. Proximal mapping introduction review of conjugate functions proximal mapping Proximal mapping 6 1 Proximal mapping the proximal mapping (prox-operator) of a convex
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More information