Sketching for Large-Scale Learning of Mixture Models

Size: px

Start display at page:

Download "Sketching for Large-Scale Learning of Mixture Models"

Coleen Atkinson
5 years ago
Views:

1 Sketching for Large-Scale Learning of Mixture Models Nicolas Keriven Université Rennes 1, Inria Rennes Bretagne-atlantique Adv. Rémi Gribonval

2 Outline Introduction Practical Approach Results Theoretical analysis Conclusion and outlooks

3 Paths to Compressive Learning Goal : Compute parameters from a large database. 1/28

4 Paths to Compressive Learning Goal : Compute parameters from a large database. PCA : Classification : Regression : Density estimation : 1/28

5 Paths to Compressive Learning Goal : Compute parameters from a large database. PCA : Classification : Regression : Density estimation : Idea : compress the database beforehand. 1/28

6 Paths to Compressive Learning Goal : Compute parameters from a large database. PCA : Classification : Regression : Density estimation : Idea : compress the database beforehand. Large (tall) See e.g. [Calderbank 2009] 1/28

7 Paths to Compressive Learning Goal : Compute parameters from a large database. PCA : Classification : Regression : Density estimation : Idea : compress the database beforehand. Large (tall) See e.g. [Calderbank 2009] Large (fat) «Big data» See e.g. [Cormode 2011] 1/28

8 Inspiration Sketch Contains particular info about the database Maintained online [Cormode 2011] 2/28

9 Inspiration Sketch Learning Contains particular info about the database Maintained online [Cormode 2011] Knowledge about underlying probability distribution 2/28

10 Inspiration Sketch Learning Contains particular info about the database Maintained online [Cormode 2011] Knowledge about underlying probability distribution by Compressive Sensing Recover «low-dimensional» object from few linear measurements (ex : sparse vector, low-rank matrix ) 2/28

11 Inspiration Sketch Learning Contains particular info about the database Maintained online [Cormode 2011] Knowledge about underlying probability distribution by Compressive Sensing Recover «low-dimensional» object from few linear measurements (ex : sparse vector, low-rank matrix ) Sketch = measurements of underlying probability distribution 2/28

12 Generalized Moments (Generalized) Compressive Sensing Practical Compressive Learning 3/28

13 Generalized Moments (Generalized) Compressive Sensing Practical Compressive Learning 3/28

14 Generalized Moments (Generalized) Compressive Sensing Practical Compressive Learning Compressive Sensing : (Random) Projections 3/28

15 Generalized Moments (Generalized) Compressive Sensing Practical Compressive Learning Compressive Sensing : (Random) Projections Online Distributed 3/28

16 Generalized Moments (Generalized) Compressive Sensing Practical Compressive Learning Compressive Sensing : (Random) Projections Robustness of learning Alg.? Online Distributed 3/28

17 Mixture Model Estimation Usual Compressive Sensing 4/28

18 Mixture Model Estimation Usual Compressive Sensing Sparsity 4/28

19 Mixture Model Estimation Usual Compressive Sensing Sparsity Generalized Compressive Sensing (or ) Infinite, continuous dictionary! 4/28

20 Outline Practical Approach (Section 2 & 3) Greedy algorithm inspired by Compressive Sensing Application to K-means, GMM with diagonal covariance 5/28

21 Outline Practical Approach (Section 2 & 3) Greedy algorithm inspired by Compressive Sensing Application to K-means, GMM with diagonal covariance Theoretical Analysis (Section 4) Information-preservation guarantee Infinite-dimensional Compressive Sensing 5/28

22 Outline Introduction Practical Approach Results Theoretical analysis Conclusion and outlooks

23 Cost function Alg. 6/28

24 Cost function Alg. 6/28

25 Cost function Alg. «Ideal» decoding scheme See [Foucart 2013] 6/28

26 Cost function Alg. «Ideal» decoding scheme NP-complete See [Foucart 2013] 6/28

27 Cost function Alg. «Ideal» decoding scheme NP-complete Two approaches: Convex relaxation Greedy See [Foucart 2013] 6/28

28 Cost function Alg. «Ideal» decoding scheme NP-complete Two approaches: Convex relaxation Greedy See [Foucart 2013] 6/28

29 Cost function Alg. «Ideal» decoding scheme NP-complete Two approaches: Convex relaxation Greedy Ideal decoding scheme (Section 4) See [Foucart 2013] 6/28

30 Cost function Alg. «Ideal» decoding scheme NP-complete Two approaches: Convex relaxation Greedy Ideal decoding scheme (Section 4) Highly non-convex See [Foucart 2013] 6/28

31 Cost function Alg. «Ideal» decoding scheme NP-complete Two approaches: Convex relaxation Greedy Ideal decoding scheme (Section 4) Highly non-convex Two approaches: Convex relaxation [Bunea 2010] See [Foucart 2013] Greedy Proposed 6/28

32 Proposed algorithm Orthogonal Matching Pursuit (OMP) [Mallat 1993, Pati 1993] 1. Add atom most correlated to residual 2. Perform Least-Squares 3. Repeat until desired sparsity 7/28

33 Proposed algorithm OMP with Replacement (OMPR) [Jain 2011] Orthogonal Matching Pursuit (OMP) [Mallat 1993, Pati 1993] 1. Add atom most correlated to residual 1. Add atom most correlated to residual 2. Perform Hard-Thresholding (if necessary) 2. Perform Least-Squares 3. Perform Least-Squares 3. Repeat until desired sparsity 4. Repeat twice desired sparsity Similar to CoSAMP [Needell 2008] or SubSpace Pursuit [Dai 2009] 7/28

Proposed algorithm OMP with Replacement (OMPR) [Jain 2011] Orthogonal Matching Pursuit (OMP) [Mallat 1993, Pati 1993] 1. Add atom most correlated to residual 1. Add atom most correlated to residual 2.

34 Proposed algorithm OMP with Replacement (OMPR) [Jain 2011] Orthogonal Matching Pursuit (OMP) [Mallat 1993, Pati 1993] 1. Add atom most correlated to residual 1. Add atom most correlated to residual 2. Perform Hard-Thresholding (if necessary) 2. Perform Least-Squares 3. Perform Least-Squares 3. Repeat until desired sparsity 4. Repeat twice desired sparsity Similar to CoSAMP [Needell 2008] or SubSpace Pursuit [Dai 2009] Compressive Learning OMPR (CLOMPR) (proposed) 1. Add atom most correlated to residual with gradient descent 2. Perform Hard-Thresholding 3. Perform Non-Negative Least-Squares 4. Perform gradient descent on all parameters, initialized with current ones We cannot just add a component 5. Repeat twice desired sparsity 7/28

35 CLOMPR : illustration (schematic illustration) Goal : 3-GMM. Intermediary support. 8/28

36 CLOMPR : illustration (schematic illustration) Goal : 3-GMM. Intermediary support. 1 : Add atom with gradient descent 8/28

37 CLOMPR : illustration (schematic illustration) Goal : 3-GMM. Intermediary support. 1 : Add atom with gradient descent 2 : Hard Thresholding 3 : Non-Negative Least-Squares 8/28

38 CLOMPR : illustration (schematic illustration) Goal : 3-GMM. Intermediary support. 1 : Add atom with gradient descent 2 : Hard Thresholding 3 : Non-Negative Least-Squares 4 : Gradient Descent on all parameters 8/28

39 CLOMPR : illustration (schematic illustration) Goal : 3-GMM. Intermediary support. 1 : Add atom with gradient descent 2 : Hard Thresholding 3 : Non-Negative Least-Squares Final support 4 : Gradient Descent on all parameters 8/28

40 CLOMPR : illustration (schematic illustration) Goal : 3-GMM. Intermediary support. 1 : Add atom with gradient descent 2 : Hard Thresholding 3 : Non-Negative Least-Squares Final support 4 : Gradient Descent on all parameters Define? 8/28

41 Choice of sketch To implement CLOMPR, and must have a closed-form expression w.r.t. 9/28

42 Choice of sketch To implement CLOMPR, and must have a closed-form expression w.r.t. is spatially localized 9/28

43 Choice of sketch To implement CLOMPR, and must have a closed-form expression w.r.t. is spatially localized Need incoherent sampling -> Fourier sampling 9/28

44 Choice of sketch To implement CLOMPR, and must have a closed-form expression w.r.t. is spatially localized Need incoherent sampling -> Fourier sampling Closed-form for many models! (including alpha-stable ) 9/28

45 Choice of sketch To implement CLOMPR, and must have a closed-form expression w.r.t. is spatially localized Need incoherent sampling -> Fourier sampling Closed-form for many models! (including alpha-stable ) Random Fourier sampling [Candes 2006] Random Fourier features [Rahimi 2007] 9/28

46 Choice of sketch To implement CLOMPR, and must have a closed-form expression w.r.t. is spatially localized Need incoherent sampling -> Fourier sampling Define? Closed-form for many models! (including alpha-stable ) Random Fourier sampling [Candes 2006] Random Fourier features [Rahimi 2007] 9/28

47 Designing the frequency distribution Must adjust «scale» of distribution 10/28

48 Designing the frequency distribution Must adjust «scale» of distribution Adjust by hand Not that difficult The method is quite robust 10/28

49 Designing the frequency distribution Must adjust «scale» of distribution Adjust by hand Not that difficult The method is quite robust Cross-validation Can be very long! Used in practice [Sutherland2015] 10/28

50 Designing the frequency distribution Must adjust «scale» of distribution Adjust by hand Cross-validation Automatic Proposed Not that difficult The method is quite robust Can be very long! Used in practice [Sutherland2015] Partial pre-processing Heuristic based on GMMs-like distributions 10/28

51 Summary Given database,, 1. Design Partial pre-processing to choose Draw 2. Compute Online, distributed, GPU 3. Derive mixture model with CLOMPR 11/28

52 Outline Introduction Practical Approach Results Theoretical analysis Conclusion and outlooks

53 Application : K-means, GMM K-means Goal : Classic approach Algorithm : Lloyd-Max [Lloyd 1982] (Matlab s kmeans) 12/28

54 Application : K-means, GMM K-means Goal : Classic approach Algorithm : Lloyd-Max [Lloyd 1982] (Matlab s kmeans) Model : Compressive approach (clustered distribution = noisy mixture of Diracs) 12/28

55 Application : K-means, GMM K-means GMM diagonal cov. Classic approach Classic approach Goal : Algorithm : Lloyd-Max [Lloyd 1982] (Matlab s kmeans) Goal : Algorithm : EM [Dempster 1977] (VLFeat s gmm) Model : Compressive approach (clustered distribution = noisy mixture of Diracs) 12/28

: Compressive approach Model : Compressive approach

56 Application : K-means, GMM K-means GMM diagonal cov. Classic approach Classic approach Goal : Algorithm : Lloyd-Max [Lloyd 1982] (Matlab s kmeans) Goal : Algorithm : EM [Dempster 1977] (VLFeat s gmm) Model : Compressive approach Model : Compressive approach (clustered distribution = noisy mixture of Diracs) 12/28

57 Large-scale result K-means (n=5, K=10) Comparison with Matlab s kmeans Faster and more memory efficient on large databases Number of measurements does not depend on N 13/28

58 Large-scale result K-means (n=5, K=10) Comparison with Matlab s kmeans VLFeat s gmm Faster and more memory efficient on large databases Number of measurements does not depend on N GMMs, diagonal cov. (n=5, K=5) 13/28

59 Application : spectral clustering K-means (n=10, K=10, m=1000) Mean and var. over 50 exp. Spectral clustering for classification [Uw 2001], augmented MNIST database [Loosli 2007]. 14/28

60 Application : spectral clustering K-means (n=10, K=10, m=1000) Mean and var. over 50 exp. Spectral clustering for classification [Uw 2001], augmented MNIST database [Loosli 2007]. 14/28

61 Application : spectral clustering K-means (n=10, K=10, m=1000) Mean and var. over 50 exp. Spectral clustering for classification [Uw 2001], augmented MNIST database [Loosli 2007]. CLOMPR performs better and is more stable with a large database 14/28

62 Application : speaker recognition Variant of CLOMPR, faster at large K GMM (n=12, K=64) Classical method for speaker recognition [Reynolds 2000] (for proof of concept) NIST 2005 database, MFCCs. Also performs better on a large database. 15/28

63 Outline Introduction Practical Approach Results Theoretical analysis Conclusion and outlooks

64 Information-preservation guarantees Guarantee for CLOMPR? Difficult! (non-convex, random ) 16/28

65 Information-preservation guarantees Guarantee for CLOMPR? Difficult! (non-convex, random ) Solve cost func. 16/28

66 Information-preservation guarantees Guarantee for CLOMPR? Difficult! (non-convex, random ) Solve cost func. Robustness to using instead of? Robustness to not being exactly a mixture model? Guarantees in terms of usual learning cost functions? o K-means : sum of distances to closest centroid o GMMs : negative log-likelihood 16/28

67 K-means : result Goal minimize (expected risk) 17/28

68 K-means : result Goal minimize (expected risk) - separation - bounded domain Hyp. Reweighted Fourier features (needed for theory, no effect in practice) 17/28

69 K-means : result Goal minimize (expected risk) - separation - bounded domain Hyp. Reweighted Fourier features (needed for theory, no effect in practice) If w.h.p. 17/28

70 GMMs with known covariance : result Goal minimize (expected risk) 18/28

71 GMMs with known covariance : result Goal minimize (expected risk) Large enough separation - bounded domain Hyp. Fourier features 18/28

72 GMMs with known covariance : result Goal minimize (expected risk) Large enough separation - bounded domain Hyp. Fourier features If w.h.p. L1 distance from p* to the set of (separated) GMMs 18/28

73 GMM trade-off Trade-off Separation of means Size of sketch 19/28

74 GMM with unknown diagonal covariance Related to learning cost function? Efficiency of «compressive» approach? 20/28

75 GMM with unknown diagonal covariance Related to learning cost function? Efficiency of «compressive» approach? Nevertheless : when is exactly a GMM 20/28

76 Number of measurements Relative SSE K-means In theory, at least 21/28

77 Number of measurements Relative SSE K-means GMMs, known cov. Relative loglike In theory, at least 21/28

78 Number of measurements Relative SSE K-means GMMs, diagonal cov. Relative loglike GMMs, known cov. Relative loglike In theory, at least 21/28

79 Number of measurements Relative SSE K-means GMMs, diagonal cov. Relative loglike GMMs, known cov. Relative loglike In theory, at least Empirically? 21/28

80 Sketch of proof : principle Goal : Existence of instance Optimal Decoder 22/28

81 Sketch of proof : principle Goal : Existence of instance Optimal Decoder Lower Restricted Isometry Property (LRIP) [Bourrier 2014] 22/28

82 Sketch of proof : principle Goal : Existence of instance Optimal Decoder Lower Restricted Isometry Property (LRIP) [Bourrier 2014] Ex : Quantization error 22/28

83 Proving the LRIP 1 : Proving non-uniform LRIP 23/28

84 Proving the LRIP 1 : Proving non-uniform LRIP Kernel mean embedding [Smola 2007] Random (Fourier) Features [Rahimi 2007] Hoeffding, Bernstein, chaining 23/28

85 Proving the LRIP 1 : Proving non-uniform LRIP Kernel mean embedding [Smola 2007] Random (Fourier) Features [Rahimi 2007] 2 : Use - coverings to extend to uniform LRIP Hoeffding, Bernstein, chaining 23/28

86 Proving the LRIP 1 : Proving non-uniform LRIP Kernel mean embedding [Smola 2007] Random (Fourier) Features [Rahimi 2007] 2 : Use - coverings to extend to uniform LRIP Basic Set Hoeffding, Bernstein, chaining Easy! Ex : Quantization error [Boufounos 2016] 23/28

Proving the LRIP 1 : Proving non-uniform LRIP Kernel mean embedding [Smola 2007] Random (Fourier) Features [Rahimi 2007] 2 : Use - coverings to extend to uniform LRIP

87 Proving the LRIP 1 : Proving non-uniform LRIP Kernel mean embedding [Smola 2007] Random (Fourier) Features [Rahimi 2007] 2 : Use - coverings to extend to uniform LRIP Hoeffding, Bernstein, chaining Basic Set Normalized Secant Set Easy! Ex : Quantization error [Boufounos 2016] Finite-dimensional Infinite-dimensional Easy! Difficult! 23/28

88 Results Sufficient Conditions Bad! has finite covering numbers Ex : GMMs with unknown covariance 24/28

89 Results Sufficient Conditions Bad! has finite covering numbers mixtures of sufficiently separated distributions with smooth Ex : GMMs with unknown covariance + guarantees w.r.t. risk Ex : Mixture of Diracs (K-means) with «Smooth» Random Features Smooth risk 24/28

90 Results Sufficient Conditions Bad! has finite covering numbers mixtures of sufficiently separated distributions with smooth Ex : GMMs with unknown covariance + guarantees w.r.t. risk Ex : Mixture of Diracs (K-means) with «Smooth» Random Features Smooth risk «Smoother» Random Features Ex : Mixtures of Diracs (K-means) with GMMs with known covariance 24/28

91 Outline Introduction Practical Approach Results Theoretical analysis Conclusion and outlooks

92 Conclusions Contributions Greedy algorithm for large-scale mixture learning from random moments 25/28

93 Conclusions Contributions Greedy algorithm for large-scale mixture learning from random moments Efficient heuristic to design the sketching operator as Fourier sampling 25/28

94 Conclusions Contributions Greedy algorithm for large-scale mixture learning from random moments Efficient heuristic to design the sketching operator as Fourier sampling Application to mixtures of Diracs, GMMs 25/28

95 Conclusions Contributions Greedy algorithm for large-scale mixture learning from random moments Efficient heuristic to design the sketching operator as Fourier sampling Application to mixtures of Diracs, GMMs Evaluation on synthetic and real data 25/28

96 Conclusions Contributions Greedy algorithm for large-scale mixture learning from random moments Efficient heuristic to design the sketching operator as Fourier sampling Application to mixtures of Diracs, GMMs Evaluation on synthetic and real data Information preservation guarantees using infinite-dimensional Compressive Sensing 25/28

97 The SketchMLbox SketchMLbox (sketchml.gforge.inria.fr) Mixture of Diracs («K-means») GMMs with known covariance GMMs with unknown diagonal covariance Soon: Alpha-stable Gaussian Locally Linear Mapping [Deleforge 2014] Optimized for user-defined 26/28

98 Outlooks : algorithmic guarantees Algorithmic guarantees? Non-convex cost function, randomized algorithm 27/28

99 Outlooks : algorithmic guarantees Algorithmic guarantees? Non-convex cost function, randomized algorithm Locally convex? Cost function for a 3-GMM in dimension 1 at positions {x,y,z}={-3,0,3} 27/28

100 Outlooks : algorithmic guarantees Algorithmic guarantees? Non-convex cost function, randomized algorithm Locally convex? Basin of attraction? [Jacques 2016, Candes 2016 ] Cost function for a 3-GMM in dimension 1 at positions {x,y,z}={-3,0,3} 27/28

101 Outlooks : algorithmic guarantees Algorithmic guarantees? Non-convex cost function, randomized algorithm Locally convex? Basin of attraction? [Jacques 2016, Candes 2016 ] Reached by CLOMPR with reasonable hypotheses? Cost function for a 3-GMM in dimension 1 at positions {x,y,z}={-3,0,3} 27/28

102 Outlooks : algorithmic guarantees Algorithmic guarantees? Non-convex cost function, randomized algorithm Locally convex? Basin of attraction? [Jacques 2016, Candes 2016 ] Reached by CLOMPR with reasonable hypotheses? Stopping condition? Cost function for a 3-GMM in dimension 1 at positions {x,y,z}={-3,0,3} 27/28

Outlooks : algorithmic guarantees Algorithmic guarantees? Non-convex cost function, randomized algorithm Locally convex? Basin of attraction?

103 Outlooks : algorithmic guarantees Algorithmic guarantees? Non-convex cost function, randomized algorithm Locally convex? Basin of attraction? [Jacques 2016, Candes 2016 ] Reached by CLOMPR with reasonable hypotheses? Stopping condition? Recent result : locally block convex Cost function for a 3-GMM in dimension 1 at positions {x,y,z}={-3,0,3} 27/28

104 Outlooks : extension of the methods 1. Bridge observed gap between theory and practice? 28/28

105 Outlooks : extension of the methods 1. Bridge observed gap between theory and practice? Does not come from - coverings 28/28

106 Outlooks : extension of the methods 1. Bridge observed gap between theory and practice? Does not come from - coverings Improve concentration inequalities? 28/28

107 Outlooks : extension of the methods 1. Bridge observed gap between theory and practice? Does not come from - coverings Improve concentration inequalities? 2. Extend framework to other tasks? 28/28

108 Outlooks : extension of the methods 1. Bridge observed gap between theory and practice? Does not come from - coverings Improve concentration inequalities? 2. Extend framework to other tasks? Recent paper submitted to AISTATS : PCA 28/28

109 Outlooks : extension of the methods 1. Bridge observed gap between theory and practice? Does not come from - coverings Improve concentration inequalities? 2. Extend framework to other tasks? Recent paper submitted to AISTATS : PCA Other existing use of Fourier sketches? : e.g. classification [Sutherland 2015] 28/28

110 Outlooks : extension of the methods 1. Bridge observed gap between theory and practice? Does not come from - coverings Improve concentration inequalities? 2. Extend framework to other tasks? Recent paper submitted to AISTATS : PCA Other existing use of Fourier sketches? : e.g. classification [Sutherland 2015] Other kernel methods (algorithmic? Theoretical?) 28/28

111 Outlooks : extension of the methods 1. Bridge observed gap between theory and practice? Does not come from - coverings Improve concentration inequalities? 2. Extend framework to other tasks? Recent paper submitted to AISTATS : PCA Other existing use of Fourier sketches? : e.g. classification [Sutherland 2015] Other kernel methods (algorithmic? Theoretical?) 3. Extension to multi-layer sketches? (Neural networks ) 28/28

112 Outlooks : extension of the methods 1. Bridge observed gap between theory and practice? Does not come from - coverings Improve concentration inequalities? 2. Extend framework to other tasks? Recent paper submitted to AISTATS : PCA Other existing use of Fourier sketches? : e.g. classification [Sutherland 2015] Other kernel methods (algorithmic? Theoretical?) 3. Extension to multi-layer sketches? (Neural networks ) May be adapted to e.g. GMMs with unknown covariance 28/28

113 Outlooks : extension of the methods 1. Bridge observed gap between theory and practice? Does not come from - coverings Improve concentration inequalities? 2. Extend framework to other tasks? Recent paper submitted to AISTATS : PCA Other existing use of Fourier sketches? : e.g. classification [Sutherland 2015] Other kernel methods (algorithmic? Theoretical?) 3. Extension to multi-layer sketches? (Neural networks ) May be adapted to e.g. GMMs with unknown covariance Equivalence between LRIP and instance optimality still valid for non-linear operators! 28/28

Outlooks : extension of the methods 1. Bridge observed gap between theory and practice? Does not come from - coverings Improve concentration inequalities? 2. Extend framework to other tasks?

114 Outlooks : extension of the methods 1. Bridge observed gap between theory and practice? Does not come from - coverings Improve concentration inequalities? 2. Extend framework to other tasks? Recent paper submitted to AISTATS : PCA Other existing use of Fourier sketches? : e.g. classification [Sutherland 2015] Other kernel methods (algorithmic? Theoretical?) 3. Extension to multi-layer sketches? (Neural networks ) May be adapted to e.g. GMMs with unknown covariance Equivalence between LRIP and instance optimality still valid for non-linear operators! CLOMPR and current sufficient conditions no longer valid 28/28

Thank you! K., Bourrier, Gribonval, Perez. Sketching for Large-Scale Learning of Mixture Models ICASSP 2016 K., Bourrier, Gribonval, Perez. Sketching for Large-Scale Learning of Mixture Models (extended version) submitted to Information and Inference, arxiv:1606.

115 Thank you! K., Bourrier, Gribonval, Perez. Sketching for Large-Scale Learning of Mixture Models ICASSP 2016 K., Bourrier, Gribonval, Perez. Sketching for Large-Scale Learning of Mixture Models (extended version) submitted to Information and Inference, arxiv: K., Tremblay, Gribonval, Traonmilin. Compressive K-means ICASSP 2017 Gribonval, Blanchard, K., Traonmilin. Random moments for Sketched Statistical Learning submitted to AISTATS 2017, extended version soon

116 Appendix : CLOMPR

Rémi Gribonval Inria Rennes - Bretagne Atlantique.

Rémi Gribonval Inria Rennes - Bretagne Atlantique remi.gribonval@inria.fr Contributors & Collaborators Anthony Bourrier Nicolas Keriven Yann Traonmilin Gilles Puy Gilles Blanchard Mike Davies Tomer Peleg