Inverse Statistical Learning

Size: px

Start display at page:

Download "Inverse Statistical Learning"

Wendy Horton
6 years ago
Views:

1 Inverse Statistical Learning Minimax theory, adaptation and algorithm avec (par ordre d apparition) C. Marteau, M. Chichignoud, C. Brunet and S. Souchet Dijon, le 15 janvier 2014 Inverse Statistical Learning 1 / 51

2 The problem of Inverse Statistical Learning Given (X, Y ) P on X Y, a class G and a loss function l : G (X Y) R +,, we aim at : g arg min E P l(g, (X, Y )), Inverse Statistical Learning 2 / 51

3 The problem of Inverse Statistical Learning Given (X, Y ) P on X Y, a class G and a loss function l : G (X Y) R +,, we aim at : g arg min E P l(g, (X, Y )), from an indirect sequence of observations : (Z 1, Y 1 ),..., (Z n, Y n ) i.i.d. from P, where Z i Af, A is a linear compact operator (and X f ). Inverse Statistical Learning 2 / 51

4 Statistical Learning with errors in variables Given (X, Y ) P on X Y, a class G and a loss function l : G (X Y) R +,, we aim at : g arg min E P l(g, (X, Y )), from a noisy sequence of observations : (X 1 + ɛ 1, Y 1 ),..., (X n + ɛ n, Y n ) i.i.d. from P, where Z i f η and η is the density of the i.i.d. sequence (ɛ i ) n i=1. Inverse Statistical Learning 3 / 51

5 Statistical Learning with errors in variables Given (X, Y ) P on X Y, a class G and a loss function l : G (X Y) R +,, we aim at : g arg min E P l(g, (X, Y )), from a noisy sequence of observations : (X 1 + ɛ 1, Y 1 ),..., (X n + ɛ n, Y n ) i.i.d. from P, where Z i f η and η is the density of the i.i.d. sequence (ɛ i ) n i=1. Y = R : regression with errors in variables, Y = {1,..., M} : classification with errors in variables, Y = : unsupervised learning with errors in variables. Inverse Statistical Learning 3 / 51

6 Toy Example (I) Direct dataset (Unobservable) Observations (Available) Inverse Statistical Learning 4 / 51

7 Toy example (II) Direct dataset (Unobservable) Observations (Available) Inverse Statistical Learning 5 / 51

8 Real-world example in oncology (I) Fig.1 : The same tumor observed by two radiologists Z ij = X i + ɛ ij, j {1, 2}. Inverse Statistical Learning 6 / 51

9 Real-world example in oncology (II) Fig.1 : Batch effect in a Micro-array dataset J. A. Gagnon-Bartsch, L. Jacob and T. P. Speed, 2013 Inverse Statistical Learning 7 / 51

10 Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC Contents 1. Minimax rates in discriminant analysis 2. Excess risk bound 3. The algorithm of noisy k-means (4.) Adaptation Inverse Statistical Learning 8 / 51

11 Origin : a minimax motivation (with C. Marteau) Direct case Density estimation Classification n 2γ 2γ+1 n γ(α+1) γ(α+2)+d Noisy case n 2γ 2γ+2β+1??? f Σ(γ, L) E(Y = 1 X = x) Σ(γ, L) Assumptions Margin parameter α 0 F[η](t) t β F[η j ](t) t j β j j = 1,..., d Inverse Statistical Learning 9 / 51

12 Mammen and Tsybakov (1999) Given two densities f and g, for any G K, the Bayes risk is defined as : [ R K (G) = 1 ] fdq + gdq. 2 K/G G Given X 1 1,..., X 1 n f and X 2 1,..., X 2 n g, we aim at : G = arg min G G R K (G). Goal To obtain minimax fast rates r n (F) inf Ĝ sup Ed (Ĝ, G ), where d {d f,g, d }. (f,g) F Inverse Statistical Learning 10 / 51

13 Mammen and Tsybakov (1999) with errors in variables We observe Z 1 1,..., Z 1 n and Z 2 1,..., Z 2 n such that : where : Z 1 i = X 1 i + ɛ 1 i and Z 2 i = X 2 i + ɛ 2 i, for i = 1,... n, X 1 i f and X 2 i g, ɛ j i i.i.d. with density η. Goal To obtain minimax fast rates r n (F, β) inf Ĝ sup Ed (Ĝ, G ), where d {d f,g, d }. (f,g) F Inverse Statistical Learning 11 / 51

14 ERM approach ERM principle in the direct case : 1 2n n i=1 1 X 1 i G C + 1 2n n i=1 1 X 2 i G R K (G). Inverse Statistical Learning 12 / 51

15 ERM approach ERM principle in this model fails : 1 2n n i=1 1 Z 1 i G C + 1 2n n i=1 1 Z 2 i G 1 2 [ f η + G C G ] g η R K (G). Inverse Statistical Learning 12 / 51

16 ERM approach ERM principle in this model fails : 1 2n n i=1 1 Z 1 i G C + 1 2n n i=1 1 Z 2 i G 1 2 Solution Define Rn λ (G) = 1 [ λ ˆf 2 n (x)dx + G C G [ f η + G C G ] ĝn λ (x)dx R K (G), where (ˆf λ n, ĝ λ n ) are estimators of (f, g) of the form : ˆf λ n (x) = 1 nλ n i=1 ( Z 1 K i x λ ). ] g η R K (G). Inverse Statistical Learning 12 / 51

17 Details Z1 1,..., Z n 1 i.i.d. f η et Z1 2,..., Z n 2 i.i.d. g η. We consider : Rn λ (G) = 1 [ ] λ ˆf 2 n (x)dx + ĝn λ (x)dx, G C G where ˆf n λ and ĝn λ are deconvolution kernel estimator. Then : [ n ] Rn λ (G) = 1 n h λ n G (Z 1 C i ) + hg λ (Z i 2 ), i=1 i=1 where : h λ G (z) = G 1 λ K ( z x λ ) dx = 1 G K λ (z). Inverse Statistical Learning 13 / 51

18 Vapnik s bound (ɛ = 0) The use of empirical process comes from VC theory : R K (Ĝ n ) R K (G ) R K (Ĝ n ) R n (Ĝ n ) + R n (G ) R K (G ) 2 sup (R n R)(G). G G Goal to control uniformly the empirical process indexed by G. Inverse Statistical Learning 14 / 51

19 Vapnik s bound (ɛ = 0) The use of empirical process comes from VC theory : R K (Ĝ n ) R K (G ) R K (Ĝ n ) R n (Ĝ n ) + R n (G ) R K (G ) 2 sup (R n R)(G). G G Goal to control uniformly the empirical process indexed by G. ISL {1 G K λ, G G}. Inverse Statistical Learning 14 / 51

20 Theorem 1 : Upper bound (j.w. with C. Marteau) Suppose (f, g) G(α, γ) and F[η](t) Π d i=1 t i β i, β i > 1/2, i = 1,..., d. Consider a kernel K of order γ, which satisfies some properties. Then : where lim sup n + (f,g) G(α,γ) τ d (α, β, γ) = n τ d (α,β,γ) E f,g d (Ĝ n, G ) < +, γα γ(2+α)+d+2 d i=1 β i γ(α+1) γ(2+α)+d+2 d i=1 β i and λ = (λ 1,..., λ d ) is chosen as : for d = d for d = d f,g. λ j = n 1 γ(2+α)+2 d i=1 β i +d, j {1,..., d}. Inverse Statistical Learning 15 / 51

21 Theorem 2 : Lower bound (j.w. with C. Marteau) Suppose F[η](t) Π d i=1 t i β i, β i > 1/2, i = 1,..., d. Then for α 1, lim inf inf sup n τ d (α,β,γ) E f,g d (Ĝ n,m, G ) > 0, n + Ĝ n (f,g) G(α,γ) where the infinimum is taken over all possible estimators of the set G and γα γ(2+α)+d+2 d for d = d i=1 β i τ d (α, β, γ) = for d = d f,g. γ(α+1) γ(2+α)+d+2 d i=1 β i Inverse Statistical Learning 16 / 51

22 Conclusion (minimax) Direct case Density estimation Classification n 2γ 2γ+1 n γ(α+1) γ(α+2)+d Noisy case n 2γ 2γ+2β+1 n γ(α+1) γ(α+2)+2 β+d f Σ(γ, L) E(Y = 1 X = x) Σ(γ, L) Assumptions Margin parameter α 0 F[η](t) t β F[η j ](t) t j β j j = 1,... d Inverse Statistical Learning 17 / 51

23 Minimax rates Heuristics, proofs Noisy K-means algorithm Adaptation using ERC Sketch of the proofs, heuristic 1. Noisy quantization (for simplicity) 2. Excess risk decomposition 3. Bias control (easy and minimax) 4. Variance control : key lemma Inverse Statistical Learning 18 / 51

24 Other results (I) (Un)supervised classification with errors-in-variables : R l (ĝ λ n ) R l (g ) Cn κγ γ(2κ+ρ 1)+(2κ 1) d β i=1 i, where g = arg min R l (g, (X, Y )) Inverse Statistical Learning 19 / 51

25 Other results (I) (Un)supervised classification with errors-in-variables : where R l (ĝ λ n ) R l (g ) Cn κγ γ(2κ+ρ 1)+(2κ 1) d β i=1 i, g = arg min R l (g, (X, Y )) (Un)supervised classification with Z i Af using ˆf N n (x) = N ˆθ k φ k (x), k=1 where θ k = b 1 1 n k n i=1 ψ k(z i ) and A Aφ k = bk 2φ k and f Θ(γ, L) := {f = θ k ϕ k : θk 2 k2γ+1 L}. k=1 Inverse Statistical Learning 19 / 51

26 Other results (II) If f Σ( γ, L) the anisotropic Hölder class : where : R l (ĝ λ n ) R l (g ) Cn ɛ(κ, β, γ ) = (2κ 1) and λ = (λ 1,..., λ d ) is chosen as : 2κ 1 κ 2κ+ρ 1+ɛ(κ,β,γ) d j=1 β j γ j, λ j n 2γ j (2κ+ρ 1+ɛ(κ,β, γ )), j = 1,... d. Inverse Statistical Learning 20 / 51

27 Other results (II) If f Σ( γ, L) the anisotropic Hölder class : where : R l (ĝ λ n ) R l (g ) Cn ɛ(κ, β, γ ) = (2κ 1) and λ = (λ 1,..., λ d ) is chosen as : 2κ 1 κ 2κ+ρ 1+ɛ(κ,β,γ) d j=1 β j γ j, λ j n 2γ j (2κ+ρ 1+ɛ(κ,β, γ )), j = 1,... d. Non-exact oracle inequalities : R l (ĝ) (1 + ɛ) inf g G R l(g) + C(ɛ)n without margin assumption. γ γ(1+ρ)+ d β i=1 i, Inverse Statistical Learning 20 / 51

28 Finite dimensional clustering Given k, we aim at : arg min E min X c j 2. c=(c 1,...,c k ) R dk j=1,...k The empirical couterpart : 1 ĉ n arg min c=(c 1,...,c k ) R dk n n min X i c j 2, j=1,...k i=1 gives rise to the popular k-means studied in (Pollard, 1982). Inverse Statistical Learning 21 / 51

29 Finite dimensional noisy clustering (j.w. with C. Brunet) We want to approximate a solution of the stochastic minimization : 1 min c=(c 1,...,c k ) R dk n n γ λ (c, Z i ), i=1 where γ λ (c, z) = K min x c j 2 Kλ (z x) dx. j=1,...,k Inverse Statistical Learning 22 / 51

30 First order conditions (I) Suppose X M and Pollard s regularity assumptions are satisfied. Then, u {1,..., d} j {1,..., k}, we have the following assertion : n i=1 V c uj = j x u Kλ (Z i x) dx n = euj Jn λ (c) = 0, i=1 Kλ (Z i x) dx V j where J λ n (c) = n γ λ (c, Z i ). i=1 Inverse Statistical Learning 23 / 51

31 First order conditions (II) The standard k-means : n i=1 c u,j = X n i,u1 Xi V j i=1 V n i=1 1 = j x u δ Xi dx n, u, j, X i V j i=1 V j δ Xi dx where δ Xi is the Dirac function at point X i. Another look : V c u,j = j x uˆfn (x)dx, u {1,..., d}, j {1,..., k}, V ˆfn j (x)dx where ˆf n (x) = 1/n n i=1 K λ (Z i x) is the kernel deconvolution estimator of the density f. Inverse Statistical Learning 24 / 51

32 The algorithm of Noisy K-means (j.w. with C. Brunet) Inverse Statistical Learning 25 / 51

33 Experimental setting : simulation study 1. We draw i.i.d. sequences (X i ) i=1,...,n (gaussian mixtures), and (ɛ i ) n i=1 (symmetric noise) for n {100, 500}. 2. We draw repetitions (ɛ j ) j=1,...,m with m = We compute Noisy k-means clusters ĉ with an estimation step of f η thanks to We calculate the clustering risk : r n (ĉ) = i=1 1I X j i / V j (ĉ). Inverse Statistical Learning 26 / 51

34 Experimental setting - Model 1 For u {1,..., 10}, we call Mod1(u) : Z i = X i + ɛ i (u), i = 1,..., n, Mod1(u) where : (X i ) n i=1 are i.i.d. with density f = 1/2f N (02,I 2 ) + 1/2f N((5,0) T,I 2) and (ɛ i (u)) n i=1 are i.i.d. with law N (0 2, Σ(u)), where Σ(u) is a diagonal matrix with diagonal vector (0, u) T, for u {1,..., 10}. Inverse Statistical Learning 27 / 51

35 Illustrations Mod1 Inverse Statistical Learning 28 / 51

36 Experimental setting - Model 2 For u {1,..., 10}, we call model Mod2(u) : Z i = X i (u) + ɛ i, i = 1,..., n, Mod2(u) where : (X i (u)) n i=1 are i.i.d. with density f = 1/3f N (02,I 2 ) + 1/3f N((a,b) T,I 2) + 1/3f N((b,a) T,I 2), where (a, b) = (15 (u 1)/2, 5 + (u 1)/2), for u {1,..., 10}, and (ɛ i ) n i=1 are i.i.d. with law N (0 2, Σ), where Σ is a diagonal matrix with diagonal vector (5, 5) T. Inverse Statistical Learning 29 / 51

37 Illustrations Mod2 Inverse Statistical Learning 30 / 51

38 Results Mod1 for n = 100 Inverse Statistical Learning 31 / 51

39 Results Mod1 for n = 500 Inverse Statistical Learning 32 / 51

40 Results Mod2 Inverse Statistical Learning 33 / 51

41 Adaptation! To get the optimal rates, we act as follows : { ( ) } c(λ) 2/(1+ρ) R(ĉ λ, c ) inf C 1 + C 2 λ 2γ Cn λ n γ 2γ(1+ρ)+2β where λ = O(n 1 2γ(1+ρ)+2β ). Goal to choose the bandwidth based on Lepski s principle Inverse Statistical Learning 34 / 51

42 Empirical Risk Comparison (j.w. with M. Chichignoud) We choose λ as follows : ˆλ = max{λ Λ : R λ n (ĉ λ ) R λ n (ĉ λ ) 3δ λ, λ λ}, where δ λ is defined as : λ 2β δ λ = C adapt log n, n where C adapt > 0 is an explicit constant. Inverse Statistical Learning 35 / 51

43 Adaptation : data-driven choices of λ Inverse Statistical Learning 36 / 51

44 Uniform law for ɛ Inverse Statistical Learning 37 / 51

45 Adaptation : stability of ICI method Inverse Statistical Learning 38 / 51

46 Real dataset : Iris Inverse Statistical Learning 39 / 51

47 Adaptation using Empirical Risk Comparison (ERC) To get the optimal rates, we act as follows : { ( ) } c(λ) 2/(1+ρ) R(ĉ λ, c ) inf C 1 + C 2 λ 2γ Cn λ n γ 2γ(1+ρ)+2β where λ = O(n 1 2γ(1+ρ)+2β ). Goal to choose the bandwidth based on Lepski s principle Inverse Statistical Learning 40 / 51

48 Lepski s method {ˆf h, h H} a family of (kernel) estimators, with associated (bandwidth) h H R. BV decomposition : ˆf h f C{B(h) + V (h)}, where (usually) V ( ) is known. Related to minimax theory : f Σ(γ, L) ˆf h (γ) f C inf{b(h) + V (h)} = Cψ n (γ). Goal a data-driven method to reach the bias-variance trade-off (minimax adaptive method). Inverse Statistical Learning 41 / 51

49 Lepski s method : the rule The rule : ĥ = max{h > 0 : h h, ˆf h ˆf h cv (h )} ˆf h ˆf h ˆf h f + f ˆf h B(h) + V (h) + B(h ) + V (h ) h h B(h) + V (h ) The rule selects the biggest h > 0 such that : B(h) + V (h ) sup h h V (h c h h, B(h) (c 1)V (h ). ) Inverse Statistical Learning 42 / 51

50 Empirical Risk Comparison (j.w. with M. Chichignoud) We choose λ as follows : ˆλ = max{λ Λ : R λ n (ĉ λ ) R λ n (ĉ λ ) 3δ λ, λ λ}, where δ λ is defined as : λ 2β δ λ = C adapt log n, n where C adapt > 0 is an explicit constant. Inverse Statistical Learning 43 / 51

51 Theorem 3 : Adaptive upper bound (j.w. with M. Chichignoud) Suppose f Σ(γ, L), the noise assumption and Pollard s regularity assumptions are satisfied. Consider a kernel K of order γ, which satisfies the kernel assumption. Then : lim where sup n + (log n) n γ γ+ d β i=1 i sup f Σ(γ,L) ĉ λ = arg min c C and ˆλ is chosen with ERC rule. E[R(ĉˆλ) R(c )] < +, n l λ (c, Z i ), i=1 Inverse Statistical Learning 44 / 51

52 Proof for λ {λ 1, λ 2 }, λ 1 < λ 2. Inverse Statistical Learning 45 / 51

53 Proof for λ {λ 1, λ 2 }, λ 1 < λ 2. The rule becomes : ˆλ = λ 1 1I Ω + λ 2 1I Ω C, where Ω = {R λ 1 n (ĉ λ2, ĉ λ1 ) > Cδ λ1 }. Inverse Statistical Learning 45 / 51

54 Proof for λ {λ 1, λ 2 }, λ 1 < λ 2. The rule becomes : ˆλ = λ 1 1I Ω + λ 2 1I Ω C, where Ω = {R λ 1 n (ĉ λ2, ĉ λ1 ) > Cδ λ1 }. Case 1 : λ = λ 1 < λ 2. Case 2 : λ = λ 2 > λ 1. Inverse Statistical Learning 45 / 51

55 Proof for λ {λ 1, λ 2 }, λ 1 < λ 2. Case 1 : λ = λ 1 < λ 2. ER(ĉˆλ, c ) = ER(ĉˆλ, c )( 1I Ω + 1I Ω C ) ψ n (λ ) + ER(ĉˆλ, c ) 1I Ω C. Inverse Statistical Learning 46 / 51

56 Proof for λ {λ 1, λ 2 }, λ 1 < λ 2. Case 1 : λ = λ 1 < λ 2. ER(ĉˆλ, c ) = ER(ĉˆλ, c )( 1I Ω + 1I Ω C ) On Ω C, we have with high proba : ψ n (λ ) + ER(ĉˆλ, c ) 1I Ω C. R(ĉˆλ, c ) = (R R λ )(ĉˆλ, c ) + (R λ Rn λ )(ĉˆλ, c ) + Rn λ (ĉˆλ, c ) B(λ ) + (R λ Rn λ )(ĉˆλ, c ) + 3δ λ Inverse Statistical Learning 46 / 51

57 Proof for λ {λ 1, λ 2 }, λ 1 < λ 2. Case 1 : λ = λ 1 < λ 2. ER(ĉˆλ, c ) = ER(ĉˆλ, c )( 1I Ω + 1I Ω C ) On Ω C, we have with high proba : ψ n (λ ) + ER(ĉˆλ, c ) 1I Ω C. R(ĉˆλ, c ) = (R R λ )(ĉˆλ, c ) + (R λ Rn λ )(ĉˆλ, c ) + Rn λ (ĉˆλ, c ) B(λ ) + (R λ Rn λ )(ĉˆλ, c ) + 3δ λ B(λ ) + r λ (2 log n) + 3δ λ Cψ n (λ ), where r λ (t) : P ( sup c R λ n R λ (c, c ) r λ (t) ) e t. Inverse Statistical Learning 46 / 51

58 Proof for λ {λ 1, λ 2 }, λ 1 < λ 2. Case 2 : λ = λ 2 > λ 1. ER(ĉˆλ, c ) ψ n (λ ) + ER(ĉˆλ, c ) 1I Ω ψ n (λ ) + P(Ω), where Ω = {R λ 1 n (ĉ λ2, ĉ λ1 ) > Cδ λ1 }. Inverse Statistical Learning 47 / 51

59 Proof for λ {λ 1, λ 2 }, λ 1 < λ 2. Case 2 : λ = λ 2 > λ 1. ER(ĉˆλ, c ) ψ n (λ ) + ER(ĉˆλ, c ) 1I Ω ψ n (λ ) + P(Ω), where Ω = {R λ 1 n (ĉ λ2, ĉ λ1 ) > Cδ λ1 }. R λ 1 n (ĉ λ2, ĉ λ1 ) = (R λ 1 n R λ 1 )(ĉ λ2, ĉ λ1 ) + (R λ 1 R)(ĉ λ2, ĉ λ1 ) + R(ĉ λ2, ĉ λ1 ) (R λ 1 n R λ 1 )(ĉ λ2, ĉ λ1 ) + 2B(λ 1 ) + R(ĉ λ2, c ). Inverse Statistical Learning 47 / 51

60 Proof for λ {λ 1, λ 2 }, λ 1 < λ 2. Case 2 : λ = λ 2 > λ 1. ER(ĉˆλ, c ) ψ n (λ ) + ER(ĉˆλ, c ) 1I Ω ψ n (λ ) + P(Ω), where Ω = {R λ 1 n (ĉ λ2, ĉ λ1 ) > Cδ λ1 }. R λ 1 n (ĉ λ2, ĉ λ1 ) = (R λ 1 n R λ 1 )(ĉ λ2, ĉ λ1 ) + (R λ 1 R)(ĉ λ2, ĉ λ1 ) + R(ĉ λ2, ĉ λ1 ) (R λ 1 n R λ 1 )(ĉ λ2, ĉ λ1 ) + 2B(λ 1 ) + R(ĉ λ2, c ). Since B(λ 1 ) < B(λ 2 ) = B(λ ) = δ λ = δ λ2 < δ λ1 Bousquet twice, we have with proba 1 2n 2 : and using R λ 1 n (ĉ λ2, ĉ λ1 ) 2r λ1 (2 log n) + 2δ λ1 + B(λ 2 ) + δ λ2 Cδ λ1. Inverse Statistical Learning 47 / 51

61 ERC s Extension Consider a family of λ-erm {ĝ λ, λ > 0}. Assume : 1.There exists an increasing function denoted by Bias( ) such that : (R λ R)(g, g ) Bias(λ) R(g, g ), for all g G. 2.There exists a decreasing function denoted by Var t ( ) (t 0) such that λ, t > 0 : ( { P (Rn λ R λ )(g, g ) 1 } ) 4 R(g, g ) > Var t (λ) e -t. sup g G Then, there exists a universal constant C 3 such that ( { } ) ER(ĝˆλ, g ) C 3 inf Bias(λ) + Var t (λ) + e -t, for all t 0. λ Inverse Statistical Learning 48 / 51

62 Examples Nonparametric estimation Image denoising Rn λ (f t ) = i (Y i f t ) 2 K λ (X i x 0 ). Local robust regression Rn λ (t) = i ρ(y i t)k λ (X i x 0 ). Fitted local likelihood R λ n (θ) = i log p(y i, θ)k λ (X i x 0 ). Inverse Statistical Learning Quantile estimation Rn λ (q) = i (x q)(τ 1Ix q ) K λ (Z i x)dx. Learning principal curves Rn λ (g) = i inft x f (t) 2 K λ (Z i x)dx. Binary classification R λ n (G) = K i 1IYi 1I(x G) λ (Z i x)dx. Inverse Statistical Learning 49 / 51

63 Open problems Anisotropic case Margin adaptation Model selection Inverse Statistical Learning 50 / 51

64 Conclusion Thanks for your attention! Inverse Statistical Learning 51 / 51

The algorithm of noisy k-means

Noisy k-means The algorithm of noisy k-means Camille Brunet LAREMA Université d Angers 2 Boulevard Lavoisier, 49045 Angers Cedex, France Sébastien Loustau LAREMA Université d Angers 2 Boulevard Lavoisier,