Rémi Gribonval Inria Rennes - Bretagne Atlantique.

Size: px

Start display at page:

Download "Rémi Gribonval Inria Rennes - Bretagne Atlantique."

Sandra Crawford
6 years ago
Views:

1 Rémi Gribonval Inria Rennes - Bretagne Atlantique remi.gribonval@inria.fr

2 Contributors & Collaborators Anthony Bourrier Nicolas Keriven Yann Traonmilin Gilles Puy Gilles Blanchard Mike Davies Tomer Peleg Patrick Perez 2

3 Agenda From Compressive Sensing to Compressive Learning? Compressive Clustering / Compressive GMM Information-preserving projections & sketches Conclusion 3

4 Algorithm Experimental results ˆ s}ks=1 and support ˆ = {µ s}ks=1. weights { Data setup: = 1, ( 1,..., k ) drawn uniformly on the simplex. Entries of µ1,..., µk N (, 1). Machine Learning i.i.d. rt functions: ponents to add to the support! hafµ, r i, added to the support ˆ. Algorithm heuristics: Frequencies drawn i.i.d. from N (, Id). New support function search (step 1) initialized as ru, where r uniformly drawn in, max x 2 and u uniformly drawn on B2(, 1). x2 with positivity constraints on coefficients: Available data argmin z (3) U 22, 2RK + Comparison between: Our method: Sketch is computed on-the-fly and data is discarded. EM: Data isof stored to allow the standard optimization steps to be pertraining collection feature vectors = point cloud formed. and corresponding support are kept ˆ 1,..., ˆ k. coefficients Goals orithm on the objective function, with initialization at nd coefficients. Quality measures: KL Divergence and Hellinger distance. Compressed EM infer parametersn to achieve a certain task KL div. Hell. Mem. KL div. Hell. Mem. Third step 1.68 ±.28.6 ± ±.44.7 ±.3.24 generalization to future samples with the same probability distribution 1.24 ±.31.2 ± ±.21.1 ± Second step ±.15.1 ± ±.21.1 ±.2 24 Table 1: Comparison between our method and an EM algorithm. n = 2, k = 1, m = 1. Examples 6 n=1, Hell. for 8% k*n/m A =) ustration in dimension n = 1 for k = 3 Gaus1. Bottom: Iteration 2. Blue curve=true mixture, ed mixture, Green curve=gradient function. Green ids, Red Dots=Reconstructed Centroids PCA principal subspace sketch size m Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruction quality for n = 1. Clustering centroids Dictionary learning dictionary Classification classifier parameters (e.g. support vectors) 4

5 Challenging dimensions Point cloud = large matrix of feature vectors 5

6 Challenging dimensions Point cloud = large matrix of feature vectors x 1 5

7 Challenging dimensions Point cloud = large matrix of feature vectors x 1 x 2 5

8 Challenging dimensions Point cloud = large matrix of feature vectors x 1 x 2 x N 5

9 Challenging dimensions Point cloud = large matrix of feature vectors x 1 x 2 x N High feature dimension n Large collection size N 5

10 Challenging dimensions Point cloud = large matrix of feature vectors x 1 x 2 x N High feature dimension n Large collection size N Challenge: compress before learning? 5

11 Compressive Machine Learning? Point cloud = large matrix of feature vectors x 1 x 2 x N M Y = M y 1 y 2 y N 6

12 Compressive Machine Learning? Point cloud = large matrix of feature vectors x 1 x 2 x N M Y = M y 1 y 2 y N 6

13 Compressive Machine Learning? Point cloud = large matrix of feature vectors x 1 x 2 x N Reduce feature dimension [Calderbank & al 29, Reboredo & al 213] (Random) feature projection Exploits / needs low-dimensional feature model Y = M y 1 y 2 y N 6

14 Challenges of large collections Feature projection: limited impact Y = M 7

15 Challenges of large collections Feature projection: limited impact Y = M Big Data Challenge: compress collection size 7

16 Compressive Machine Learning? Point cloud = empirical probability distribution 8

17 Compressive Machine Learning? Point cloud = empirical probability distribution Reduce collection dimension coresets see e.g. [Agarwal & al 23, Felman 21] sketching & hashing see e.g. [Thaper & al 22, Cormode & al 25] 8

18 Compressive Machine Learning? Point cloud = empirical probability distribution M Sketching operator nonlinear in the feature vectors linear in their probability distribution Reduce collection dimension coresets see e.g. [Agarwal & al 23, Felman 21] sketching & hashing see e.g. [Thaper & al 22, Cormode & al 25] z 2 R m 8

19 Compressive Machine Learning? Point cloud = empirical probability distribution M Sketching operator nonlinear in the feature vectors linear in their probability distribution Reduce collection dimension coresets see e.g. [Agarwal & al 23, Felman 21] sketching & hashing see e.g. [Thaper & al 22, Cormode & al 25] z 2 R m 8

20 1.24 ±.31.2 ± ±.15.1 ± ±.21.1 ± ±.21.1 ±.2 24 Example: Compressive Clustering Table 1: Comparison between our method and an EM algorithm. 2, k = 1, m = 1. N = 1; n = 2 M z2r n = m m = 6 6 n=1, Hell. for 8% A =) k*n/m.6 3 Recovery algorithm estimated centroids ground truth sketch size m Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc tion quality for n = 1. 9

21 Computational impact of sketching Time (s) Computation time K = Time (s) K = N Collection size N 1 8 Sketching (no distr. computing) CLOMP Sketching (no distr. computing) CLOMPR CLOMP CLOMPR BS CGMM CHS BS CGMM EM EM N K = 5 Sketch + freq. (Compressive methods) Data (EM) Time (s) Memory (bytes) Memory K = 2 Sketch + freq. (Compressive methods) Data (EM) N N Collection K = size 2 N Sketch + freq. (Compressive methods) Data (EM) Sketching (no distr. computing) CLOMP CLOMPR BS CGMM EM Memory (bytes) Ph.D. A. Bourrier & N. Keriven Memory (bytes)

22 The Sketch Trick Data distribution p(x) 11

23 The Sketch Trick Data distribution p(x) Machine Learning method of moments compressive learning Probability space p Signal Processing inverse problems compressive sensing Signal space x M Linear projection M z Sketch space y Observation space 11

24 The Sketch Trick Data distribution p(x) Sketch z` = Z = Eh`() 1 N h`(x)p(x)dx N h`(x i ) i=1 Machine Learning method of moments compressive learning Probability space p M z Sketch space Linear projection Signal Processing inverse problems compressive sensing M Signal space x y Observation space 11

25 The Sketch Trick Data distribution p(x) Sketch z` = Z = Eh`() 1 N h`(x)p(x)dx N i=1 h`(x i ) nonlinear in the feature vectors linear in the distribution p(x) finite-dimensional Mean Map Embedding, cf Smola & al 27, Sriperumbudur & al 21 Machine Learning method of moments compressive learning Probability space p M z Sketch space Linear projection Signal Processing inverse problems compressive sensing M Signal space x y Observation space 11

26 The Sketch Trick Information preservation? Data distribution p(x) Sketch z` = Z = Eh`() 1 N h`(x)p(x)dx N i=1 h`(x i ) nonlinear in the feature vectors linear in the distribution p(x) finite-dimensional Mean Map Embedding, cf Smola & al 27, Sriperumbudur & al 21 Machine Learning method of moments compressive learning Probability space p M z Sketch space Linear projection Signal Processing inverse problems compressive sensing M Signal space x y Observation space 11

27 The Sketch Trick Dimension reduction? Data distribution p(x) Sketch z` = Z = Eh`() 1 N h`(x)p(x)dx N h`(x i ) i=1 nonlinear in the feature vectors linear in the distribution p(x) finite-dimensional Mean Map Embedding, cf Smola & al 27, Sriperumbudur & al 21 Machine Learning method of moments compressive learning Probability space p M z Sketch space Linear projection Signal Processing inverse problems compressive sensing M Signal space x y Observation space 12

28 Compressive Learning (Heuristic) Examples

29 Compressive Machine Learning Point cloud = empirical probability distribution M Sketching operator z 2 R m Reduce collection dimension ~ sketching z` = 1 N N i=1 h`(x i ) 1 apple ` apple m Choosing information preserving sketch? 14

30 austure, reen ±.28.6 ± ±.31.2 ± ±.15.1 ± ±.44.7 ± ±.21.1 ± ±.21.1 ±.2 24 Example: Compressive Clustering Table 1: Comparison between our method and an EM algorithm. n = 2, k = 1, m = 1. Goal: find k centroids Sketching approach 6 n=1, Hell. for 8% 1 6 p(x) is spatially localized need incoherent sampling choose Fourier sampling k*n/m A =) sample characteristic function z` = N 1 12 sketch size m N e jw`> xi i=1 Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc~pooled Random Fourier Features, cf Rahimi & Recht 27 tion quality for n = 1. Standard approach = K-means choose sampling frequencies! ` 2 Rn 15

31 austure, reen ±.28.6 ± ±.31.2 ± ±.15.1 ± ±.44.7 ± ±.21.1 ± ±.21.1 ±.2 24 Example: Compressive Clustering Table 1: Comparison between our method and an EM algorithm. n = 2, k = 1, m = 1. Goal: find k centroids Sketching approach 6 n=1, Hell. for 8% 1 6 p(x) is spatially localized need incoherent sampling choose Fourier sampling k*n/m A =) sample characteristic function z` = N 1 12 sketch size m N e jw`> xi i=1 Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc~pooled Random Fourier Features, cf Rahimi & Recht 27 tion quality for n = 1. Standard approach = K-means choose sampling frequencies! ` 2 Rn 15

32 austure, reen ±.28.6 ± ±.31.2 ± ±.15.1 ± ±.44.7 ± ±.21.1 ± ±.21.1 ±.2 24 Example: Compressive Clustering Table 1: Comparison between our method and an EM algorithm. n = 2, k = 1, m = 1. Goal: find k centroids Sketching approach 6 n=1, Hell. for 8% 1 6 p(x) is spatially localized need incoherent sampling choose Fourier sampling k*n/m A =) sample characteristic function z` = N 1 12 sketch size m N e jw`> xi i=1 Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc~pooled Random Fourier Features, cf Rahimi & Recht 27 tion quality for n = 1. Standard approach = K-means choose sampling frequencies! ` 2 Rn 15

33 austure, reen ±.28.6 ± ±.31.2 ± ±.15.1 ± ±.44.7 ± ±.21.1 ± ±.21.1 ±.2 24 Example: Compressive Clustering Table 1: Comparison between our method and an EM algorithm. n = 2, k = 1, m = 1. Goal: find k centroids Sketching approach 6 n=1, Hell. for 8% 1 6 p(x) is spatially localized need incoherent sampling choose Fourier sampling k*n/m A =) sample characteristic function z` = N 1 12 sketch size m N e jw`> xi i=1 Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc~pooled Random Fourier Features, cf Rahimi & Recht 27 tion quality for n = 1. Standard approach = K-means choose sampling frequencies Keriven & al, Sketching for Large-Scale Learning of Mixture Models (ICASSP 216)! ` 2 Rn 15

34 austure, reen ±.28.6 ± ±.31.2 ± ±.15.1 ± ±.44.7 ± ±.21.1 ± ±.21.1 ±.2 24 Example: Compressive Clustering Table 1: Comparison between our method and an EM algorithm. n = 2, k = 1, m = 1. Goal: find k centroids 6 n=1, Hell. for 8% k*n/m A =) sketch size m Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruction quality for n = 1. m N = 1; n = 2 M Sampled Characteristic Function z2r m = 6 16

35 austure, reen ±.28.6 ± ±.31.2 ± ±.15.1 ± ±.44.7 ± ±.21.1 ± ±.21.1 ±.2 24 Example: Compressive Clustering Table 1: Comparison between our method and an EM algorithm. n = Density model=gmm with variance = identity 2, k = 1, m = 1. Goal: find k centroids p 6 n=1, Hell. for 8% 1 6 K k p k ground truth k= k*n/m A =) sketch size m Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruction quality for n = 1. m N = 1; n = 2 M Sampled Characteristic Function z2r m = 6 z = Mp K k=1 k Mp k 16

36 austure, reen ±.28.6 ± ±.31.2 ± ±.15.1 ± ±.44.7 ± ±.21.1 ± ±.21.1 ±.2 24 Example: Compressive Clustering Table 1: Comparison between our method and an EM algorithm. n = Density model=gmm with variance = identity 2, k = 1, m = 1. Goal: find k centroids p 6 n=1, Hell. for 8% 1 6 K k p k ground truth k=1 estimated centroids Recovery algorithm = decoder k*n/m A =) sketch size m Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruction quality for n = 1. m N = 1; n = 2 M Sampled Characteristic Function z2r m = 6 z = Mp K k=1 k Mp k 16

37 austure, reen ±.28.6 ± ±.31.2 ± ±.15.1 ± ±.44.7 ± ±.21.1 ± ±.21.1 ±.2 24 Example: Compressive Clustering Table 1: Comparison between our method and an EM algorithm. n = Density model=gmm with variance = identity 2, k = 1, m = 1. Goal: find k centroids p 6 n=1, Hell. for 8% 1 6 K k p k ground truth k=1 estimated centroids Recovery algorithm = decoder k*n/m Compressive Learning OMP A =) similar to: OMP with Replacement, Subspace Pursuit & CoSaMP sketch size m Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruction quality for n = 1. m N = 1; n = 2 M Sampled Characteristic Function z2r m = 6 z = Mp K k=1 k Mp k 16

38 austure, reen ±.28.6 ± ±.31.2 ± ±.15.1 ± ±.44.7 ± ±.21.1 ± ±.21.1 ±.2 24 Example: Compressive GMM Table 1: Comparison between our method and an EM algorithm. n = Density model=gmm with variance = diagonal 2, k = 1, m = 1. Goal: find k centroids p 6 n=1, Hell. for 8% 1 6 K k p k ground truth k=1 estimated centroids Recovery algorithm = decoder k*n/m Compressive Hierarchical Splitting (CHS) A =) = extension to general GMM sketch size m Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruction quality for n = 1. m N = 1; n = 2 M Sampled Characteristic Function z2r m = 6 z = Mp K k=1 k Mp k 17

39 Application: Speaker Verification Claimed speaker identity Alice Bob speech utterance 18

40 Application: Speaker Verification Alice ~ 5 Gbytes ~ 1 hours of speech NIST SRE 25 database EM or CHS UBM = Universal Background Model speech utterance 19

41 Application: Speaker Verification Alice EM or CHS UBM = Universal Background Model Speaker specific model speech utterance 2

42 Application: Speaker Verification Alice EM or CHS speech utterance 21

43 Application: Speaker Verification Results (DET-curves) ~ 5 Gbytes ~ 1 hours of speech MFCC coefficients x i 2 R 12 N = 3 22

44 Application: Speaker Verification Results (DET-curves) ~ 5 Gbytes ~ 1 hours of speech MFCC coefficients x i 2 R 12 N = 3 After silence detection N = 6 22

45 Application: Speaker Verification Results (DET-curves) ~ 5 Gbytes ~ 1 hours of speech MFCC coefficients x i 2 R 12 N = 3 After silence detection N = 6 Maximum size manageable by EM N = 3 22

46 Application: Speaker Verification Results (DET-curves) ~ 5 Gbytes ~ 1 hours of speech CHS MFCC coefficients x i 2 R 12 N = 3 After silence detection N = 6 Maximum size manageable by EM N = 3 for EM for CHS 23

47 Application: Speaker Verification Results (DET-curves) ~ 5 Gbytes ~ 1 hours of speech CHS MFCC coefficients x i 2 R 12 N = 3 After silence detection N = 6 for CHS Maximum size manageable by EM N = 3 for EM 24

48 Application: Speaker Verification Results (DET-curves) ~ 5 Gbytes ~ 1 hours of speech CHS m= fold compression 25

49 Application: Speaker Verification Results (DET-curves) ~ 5 Gbytes ~ 1 hours of speech CHS m= fold compression m= fold compression 25

Application: Speaker Verification Results (DET-curves) ~ 5 Gbytes ~ 1 hours of speech CHS m= 5 7 2 -fold

50 Application: Speaker Verification Results (DET-curves) ~ 5 Gbytes ~ 1 hours of speech CHS m= fold compression m= fold compression m= fold compression exploit whole collection improved performance 25

51 Computational Efficiency

52 Computational Aspects z` = 1 N Sketching empirical characteristic function N i=1 e jw>` xi 27

53 Computational Aspects z` = 1 N Sketching empirical characteristic function N i=1 e jw>` xi 27

54 Computational Aspects z` = 1 N Sketching empirical characteristic function N i=1 e jw>` xi h( ) =e j( ) h(w) W W 27

55 Computational Aspects z` = 1 N Sketching empirical characteristic function N i=1 h( ) =e j( ) e jw>` xi h(w) z average W W 27

56 Computational Aspects z` = 1 N Sketching empirical characteristic function N i=1 h( ) =e j( ) e jw>` xi h(w) z ~ One-layer random neural net DNN ~ hierarchical sketching? see also [Bruna & al 213, Giryes & al 215] average W W 27

57 Computational Aspects z` = 1 N Sketching empirical characteristic function N i=1 h( ) =e j( ) e jw>` xi h(w) z ~ One-layer random neural net DNN ~ hierarchical sketching? see also [Bruna & al 213, Giryes & al 215] average Privacy-reserving sketch and forget W W 28

58 Computational Aspects z` = 1 N Sketching empirical characteristic function N i=1 h( ) =e j( ) e jw>` xi h(w) z Streaming algorithms One pass; online update average W W 29

59 Computational Aspects z` = 1 N Sketching empirical characteristic function N i=1 h( ) =e j( ) e jw>` xi h(w) z Streaming algorithms One pass; online update average W W streaming 29

60 Computational Aspects z` = 1 N Sketching empirical characteristic function N i=1 h( ) =e j( ) e jw>` xi z Distributed computing Decentralized (HADOOP) / parallel (GPU) average h(w) DIS TRI BU TED W W 3

61 Summary: Compressive GMM Dimension reduction Efficiency Time (s) Time (s) K = 2 K = 5 Memory (bytes) Sketch + freq. (Compressive methods) Data (EM) 1 2 Sketching (no distr. computing) 1 5 Sketching (no distr. computing) CLOMP Sketching (n 1 1 CLOMP CLOMPR CLOMP 1 2 CLOMPR BS CGMM CLOMPR BS CGMM EM CHS EM 1 4 BS CGMM N N N K = 5 Time (s) K = 2 K = Sketch + freq. (Compressive methods) Data (EM) 1 8 Sketch + freq. (Compressive method Data (EM) Memory (bytes) Information preservation? Memory (bytes) N N

62 Information preserving projections (for compressive learning)

63 Stable recovery Signal space R n Model set = signals of interest Ex: set of k-sparse vectors k = {x 2 R n, kxk apple k} x Linear projection M y Observation space R m m n 33

64 Stable recovery Signal space R n Model set = signals of interest Ex: set of k-sparse vectors k = {x 2 R n, kxk apple k} x Ideal goal: build decoder with the guarantee that Linear projection M Recovery algorithm = decoder kx (Mx + e)k appleckek, 8x 2 (instance optimality [Cohen & al 29]) y Observation space R m m n 33

65 Stable recovery Signal space R n Model set = signals of interest Ex: set of k-sparse vectors k = {x 2 R n, kxk apple k} x Ideal goal: build decoder with the guarantee that Linear projection M Recovery algorithm = decoder kx (Mx + e)k appleckek, 8x 2 (instance optimality [Cohen & al 29]) y Observation space R m m n Are there such decoders? 33

66 Stable recovery of k-sparse vectors Typical decoders L1 minimization LASSO [Tibshirani 1994],Basis Pursuit [Chen & al 1999] Greedy algorithms (Orthonormal) Matching Pursuit [Mallat & Zhang 1993], Iterative Hard Thresholding (IHT) [Blumensath & Davies 29], Guarantees Assume Restricted isometry property (RIP) [Candès & al 24] (y) := arg minkxk 1 x:mx=y Exact recovery Stability to noise Robustness to model error 1 apple kmzk2 2 kzk 2 2 apple 1+ when kzk apple 2k 34

67 Stable recovery Signal space R n Model set = signals of interest Low-dimensional model Sparse x Linear projection M y Observation space R m m n 35

68 Stable recovery Signal space R n Model set = signals of interest Low-dimensional model Sparse Sparse in dictionary D x Linear projection M y Observation space R m m n 36

69 Stable recovery Signal space x R n Model set = signals of interest Low-dimensional model Sparse Sparse in dictionary D Co-sparse in analysis operator A total variation, physics-driven sparse models.. Linear projection M y Observation space R m m n 37

70 Stable recovery Signal space x Linear projection M R n Model set = signals of interest Low-dimensional model Sparse Sparse in dictionary D Co-sparse in analysis operator A total variation, physics-driven sparse models Low-rank matrix or tensor matrix completion, phase-retrieval, blind sensor calibration y Observation space R m m n 38

71 Stable recovery Signal space x Linear projection y M R n Observation space R m Model set m n = signals of interest Low-dimensional model Sparse Sparse in dictionary D Co-sparse in analysis operator A total variation, physics-driven sparse models Low-rank matrix or tensor matrix completion, phase-retrieval, blind sensor calibration Manifold / Union of manifolds detection, estimation, localization, mapping Matrix with sparse inverse Gaussian graphical models Given point cloud database indexing 39

72 Stable recovery Vector space H Signal space x Linear projection y M R n Observation space R m Model set m n = signals of interest Low-dimensional model Sparse Sparse in dictionary D Co-sparse in analysis operator A total variation, physics-driven sparse models Low-rank matrix or tensor matrix completion, phase-retrieval, blind sensor calibration Manifold / Union of manifolds detection, estimation, localization, mapping Matrix with sparse inverse Gaussian graphical models Given point cloud database indexing Gaussian Mixture Model 4

General stable recovery Vector space H Signal space R n Model set = signals of interest Low-dimensional model arbitrary set H x Ideal goal: build decoder with the guarantee that Linear projection M

73 General stable recovery Vector space H Signal space R n Model set = signals of interest Low-dimensional model arbitrary set H x Ideal goal: build decoder with the guarantee that Linear projection M Recovery algorithm = decoder kx (Mx + e)k appleckek, 8x 2 (instance optimality [Cohen & al 29]) y Observation space R m m n Are there such decoders? Literature on embeddings, cf monograph Robinson 21 Hurewicz & Wallman 41, Falconer 85, Hunt & Kaloshin 99 41

74 Stable recovery from arbitrary model sets Theorem 1: RIP is necessary Definition: (general) Restricted Isometry Property (RIP) on secant set apple kmzk kzk apple when z 2 := {x x,x,x 2 } up to renormalization = p 1 ; = p 1+ M satisfies the RIP as soon as there exists an instance optimal decoder 42

75 Stable recovery from arbitrary model sets Theorem 1: RIP is necessary Definition: (general) Restricted Isometry Property (RIP) on secant set apple kmzk kzk apple when z 2 := {x x,x,x 2 } up to renormalization = p 1 ; = p 1+ M satisfies the RIP as soon as there exists an instance optimal decoder Theorem 2: RIP is sufficient RIP implies existence of decoder with performance guarantees: kx (Mx + e)k applec( )kek + C ( )d (x, ) [Cohen & al 29] for k [Bourrier & al 214] for arbitrary model set 8x 2 Exact recovery Stable to noise Bonus: robust to model error 42

Stable recovery from arbitrary model sets Theorem 1: RIP is necessary Definition: (general) Restricted Isometry Property (RIP) on secant set apple kmzk kzk apple when z 2 := {x x,x,x 2 } up to

76 Stable recovery from arbitrary model sets Theorem 1: RIP is necessary Definition: (general) Restricted Isometry Property (RIP) on secant set apple kmzk kzk apple when z 2 := {x x,x,x 2 } up to renormalization = p 1 ; = p 1+ M satisfies the RIP as soon as there exists an instance optimal decoder Theorem 2: RIP is sufficient RIP implies existence of decoder with performance guarantees: kx (Mx + e)k applec( )kek + C ( )d (x, ) Exact recovery Stable to noise [Cohen & al 29] for k Distance to model set [Bourrier & al 214] for arbitrary model set Bonus: robust to model error 42

77 Information recovery: decoders

78 Ideal decoder Instance optimality guarantee under RIP kx (Mx + e)k applec( )kek + C ( )d (x, ) «Ideal & Universal» instance optimal decoder (y) := arg min x2h C( )kmx yk + C ( )d (x, ) Adaptations when minimum may not be achieved (e.g., infinite dimension) + Noise-blind: no knowledge of noise level needed - Not very practical 44

79 Greedy decoders Projected Landweber ~ Iterative Hard Thresholding Gradient descent Projection x t+1/2 = x t + µm T (Mx t y) x t+1 =prox (x t+1/2 ) Definition: projection onto the model (~ denoising) prox (y) = arg min x2 ky xk 2 NB: can be NP-hard! Theorem [Blumensath 211]: PLA is instance optimal assuming RIP on constants 2 apple 1/µ apple apple 1/5 with 45

80 Decoding by (convex?) optimization RIP guarantees for generalized Basis Pursuit? General optimization framework Examples: (y) := arg min x2h Sparsity Low-rank k r f(x) s.t.kmx f(x) =kxk 1 f(x) =kxk? } yk apple [Cai & Zhang 214] < 1 p 2.77 Group sparsity Sparsity in J levels f(x) =kxk 1,2 f(x) =kxk 1 [Ayaz, Dirksen, Rauhut 214] < p [Bastounis & Hansen 215] r < 1/ J p Kmax /K min

arg min x2h δ Σ (f )= where δ Σ (z) =supδ Σ (x, z).

81 Decoding by (convex?) optimization RIP guarantees for generalized Basis Pursuit? General optimization framework One RIP to rule them all Given cone (or Union of Subspace) model set Definition of RIP constant Definition (y) := arg min x2h δ Σ (f )= where δ Σ (z) =supδ Σ (x, z). x Σ inf δ Σ(z) z T f (Σ) Expressions : δσ UoS(x, z) = Re x,z x H x+z Σ2 x 2 H 2Re x,z δ cone Σ (x, z) = 2Re x,z x+z Σ2 2Re x,z f(x) s.t.kmx yk apple and regularizer f( ) Theorem : M has RIP with constant δ Σ (f )on vectors from Σ Σ guarantees robust and stable uniform recovery. [Traonmilin & G. 215] 47

82 Ideal decoder (ter): (non)convex optimization RIP guarantees for generalized Basis Pursuit? General optimization framework (y) := arg min x2h f(x) s.t.kmx One RIP to rule them all sparsity Given cone (or Union of Subspace) kxk 1 model set low-rank Bounds on RIP constants group sparsity group-sparsity in levels yk apple kxk 1 1/ permutation matrices Birkhoff polytope norm 2/3 J j=1 f(x) kxk? kxk 1,2 kx j k 1,2 [Traonmilin & G. 215] (f) and regularizer p 1/ 2 p Kj 1/ p J +2 q J K max K min +2 f( ) 48

83 From RIP analysis to regularizer design? Our result f w Bastounis et al.,. 1, κ 1 =1 Model set: Partition indices into J blocks (group) sparsity per block K j I j δ Bastounis et al.,. 1, κ 1 =1 Ayaz et al., group norm x = j x j { kxj k apple K j supp(x j ) I j Compared regularizers J Our result,. 1, κ 1 =1 apple 1 := q K max K min kxk 1 = kx j k 1 j= Bastounis et al.,. 1, κ 1 = δ.5 f w,1 (x) = j=1 kx j k 1 p Kj J k k 49

84 Conclusion

85 Projections & Learning Signal Processing compressive sensing Signal space Machine Learning compressive learning Probability space Reduce dimension of data items Reduce size of collection x M y Observation space Linear projection p M z Sketch space Compressive sensing random projections of data items Compressive learning with sketches random projections of collections nonlinear in the feature vectors linear in their probability distribution 51

86 Summary Challenge: compress before learning? Compressive clustering & Compressive GMM Bourrier, G., Perez, Compressive Gaussian Mixture Estimation. ICASSP 213 Keriven & G.. Compressive Gaussian Mixture Estimation by Orthogonal Matching Pursuit with Replacement. SPARS 215, Cambridge, United Kingdom Keriven & al, Sketching for Large-Scale Learning of Mixture Models, ICASSP upcoming long-version with proofs Information preservation? Unified framework covering projections & sketches Notion of Instance Optimal Decoders = Uniform guarantees Fundamental role of general Restricted Isometry Property Bourrier & al, Fundamental performance limits for ideal decoders in high-dimensional linear inverse problems. IEEE Transactions on Information Theory,

87 Recent / ongoing work / challenges Decoders? RIP-based guarantees for general (convex & nonconvex) regularizers Traonmilin & G, Stable recovery of low-dimensional cones in Hilbert spaces - One RIP to rule them all, ariv: Sharpness? Guide design of regularizer (cf atomic norms)? Dimension reduction? H M : H! R m max (f) s.t.f 2 F Goal: given model set exhibit satisfying the RIP Sufficient target dimension: m & covering dim. of normalized secant set [Dirksen 214] : given a random sub-gaussian linear form Puy, Davies & G., Recipes for stable linear embeddings from Hilbert spaces to R^m, hal , see also [Puy, Davies & G., EUSIPCO 215]. This dimension is not necessary: better measure of dimension of? 53

Recent / ongoing work / challenges Compressive Spectral Clustering Random projections + graph signal processing = faster spectral clustering Tremblay & al, Accelerated Spectral

218 Example with Amazon graph (1 6 edges): 5 times speedup (3 hours instead of 15 hours for k= 5 classes) General Compressive Learning?

88 Recent / ongoing work / challenges Compressive Spectral Clustering Random projections + graph signal processing = faster spectral clustering Tremblay & al, Accelerated Spectral Clustering using Graph Filtering of Random Signals, ICASSP 216. O(k 2 N) O(k 2 log 2 k + N(log N + k)) proofs: Tremblay, Puy, G. & Vandergheynst, Compressive Spectral Clustering ariv: Example with Amazon graph (1 6 edges): 5 times speedup (3 hours instead of 15 hours for k= 5 classes) General Compressive Learning? RIP for sketches in RKHS applied to compressive GMM upcoming, Keriven, Bourrier, Perez & G. Compressive statistical learning: intrinsic dimension of PCA, k-means, Dictionary Learning, and related learning tasks work in progress, Blanchard & G. 54

89 TH###NKS#

Sketching for Large-Scale Learning of Mixture Models

Sketching for Large-Scale Learning of Mixture Models Nicolas Keriven Université Rennes 1, Inria Rennes Bretagne-atlantique Adv. Rémi Gribonval Outline Introduction Practical Approach Results Theoretical