Multiple kernel learning for multiple sources

Size: px

Start display at page:

Download "Multiple kernel learning for multiple sources"

Rosalind Mills
6 years ago
Views:

1 Multiple kernel learning for multiple sources Francis Bach INRIA - Ecole Normale Supérieure NIPS Workshop - December 2008

2 Talk outline Multiple sources in computer vision Multiple kernel learning (MKL) Equivalent formulations Theoretical analysis and open problems Covariance operators

3 Machine learning for computer vision Learning tasks on images Multiplication of digital media Many different tasks to be solved Associated with different machine learning problems

4 Image retrieval Classification, ranking, outlier detection

5 Image retrieval Classification, ranking, outlier detection

6 Personal photos Classification, clustering, visualisation

7 Machine learning for computer vision Learning tasks on images Multiplication of digital media Many different tasks to be solved Associated with different machine learning problems Application: retrieval/indexing of images Common issues: Complex tasks Heterogeneous data links with other medias (text and sound) Massive data

8 Machine learning for computer vision Learning tasks on images Multiplication of digital media Many different tasks to be solved Associated with different machine learning problems Application: retrieval/indexing of images Common issues: Complex tasks Heterogeneous data links with other medias (text and sound) Massive data Kernel methods

9 Multiple sources in computer vision Many different cues Shape Color Texture Segments Interest points Kernel design is easier for one source at at time Links with bioinformatics

10 Kernels for interest points SIFT + pyramid match (Grauman and Darrell, 2007)

11 Kernels for texture Histograms of filters

12 Kernels from segmentation graphs (Harchaoui and Bach, 2007) Goal of segmentation: extract objects of interest Many methods available, but, rarely find the object of interest entirely Segmentation graphs Allows to work on more reliable over-segmentation Going to a large square grid (millions of pixels) to a small graph (dozens or hundreds of regions)

13 Segmentation by watershed transform (Meyer, 2001) image gradient watershed 287 segments 64 segments 10 segments

14 Segmentation by watershed transform (Meyer, 2001) image gradient watershed 287 segments 64 segments 10 segments

regions Labels: region pixels Graph kernels (Gärtner et al.

15 Image as a segmentation graph Labelled undirected graph Vertices: connected segmented regions Edges: between spatially neighboring regions Labels: region pixels Graph kernels (Gärtner et al., 2003; Kashima et al., 2004; Harchaoui and Bach, 2007) provide an elegant and efficient solution

16 Talk outline Multiple sources in computer vision Multiple kernel learning (MKL) Equivalent formulations Theoretical analysis and open problems Covariance operators

17 Multiple sources by combining kernels Learning combinations of kernels: K(η) = m j=1 η jk j, η 0 Summing kernels concatenating feature spaces Assume k 1 (x,y)= Φ 1 (x),φ 1 (y), k 2 (x,y)= Φ 2 (x),φ 2 (y) k 1 (x,y) + k 2 (x,y) = (Φ1 (x) Φ 2 (x) ) (, Φ1 (y)) Φ 2 (y)

18 Multiple sources by combining kernels Learning combinations of kernels: K(η) = m j=1 η jk j, η 0 Summing kernels concatenating feature spaces Two natural (equivalent) settings 1. Single input space, multiple feature spaces x X, m different kernels on X Example: learning hyperparameters of kernels 2. Multiple pairs of input/feature spaces x j X j, j = 1,...,m, kernels on each input space Multiple sources Generalized additive models (Hastie and Tibshirani, 1990)

19 Multiple kernel learning (Lanckriet et al., 2004; Bach et al., 2004a) Learning kernels K = m j=1 η jk j, η 0 Summing kernels is equivalent to concatenating feature spaces m feature maps Φ j : X F j, j = 1,...,m. Minimization with respect to f 1 F 1,...,f m F m Predictor: f(x) = f 1 Φ 1 (x) + + f m Φ m (x) Φ 1 (x) f 1 ր.. ց x Φ j (x) f j ց.. ր Φ m (x) f m Which regularization? f 1 Φ 1 (x) + + f mφ m (x)

20 Regularization for multiple kernels Summing kernels is equivalent to concatenating feature spaces m feature maps Φ j : X F j, j = 1,...,m. Minimization with respect to f 1 F 1,...,f m F m Predictor: f(x) = f 1 Φ 1 (x) + + f m Φ m (x) Regularization by m j=1 f j 2 is equivalent to using K = m j=1 K j

21 Regularization for multiple kernels Summing kernels is equivalent to concatenating feature spaces m feature maps Φ j : X F j, j = 1,...,m. Minimization with respect to f 1 F 1,...,f m F m Predictor: f(x) = f 1 Φ 1 (x) + + f m Φ m (x) Regularization by m j=1 f j 2 is equivalent to using K = m j=1 K j Regularization by m j=1 f j should impose sparsity at the group level Main questions when regularizing by block l 1 -norm: 1. Equivalence with other kernel learning formulations 2. Algorithms 3. Analysis of sparsity inducing properties (Bach, 2008)

22 General kernel learning Proposition (Lanckriet et al, 2004, Bach et al., 2005, Micchelli and Pontil, 2005): G(K) = min f F n i=1 ϕ i(f Φ(x i )) + λ 2 f 2 = max α R n n i=1 ψ i(λα i ) λ 2 α Kα is a convex function of the Gram matrix K Theoretical learning bounds (Lanckriet et al, 2004, Srebro and Ben- David, 2006)

23 MKL - equivalence with other kernel learning formulations (Bach et al., 2004a) Block l 1 -norm problem: n i=1 ϕ i(f 1 Φ 1 (x i ) + + f mφ m (x i )) + λ 2 ( f f m ) 2 Kernel learning formulation: minimize w.r.t. η in the simplex: G(K(η)) = max α R n n i=1 ψ i(λα i ) λ 2 α ( m j=1 η jk j ) α Proposition: Block l 1 -norm regularization is equivalent to minimizing with respect to η the optimal value G(K(η)) Weights η obtained from optimality conditions Single optimization problem for learning both η and α

24 Algorithms for MKL (very) costly optimization with SDP, QCQP ou SOCP n 1,000 10,000, m 100 not possible loose required precision first order methods (see, e.g., Bottou and Bousquet (2008)) Dual coordinate ascent (SMO) with smoothing (Bach et al., 2004a) Optimization of G(K) by cutting planes (Sonnenburg et al., 2006) Optimization of G(K) with steepest descent with smoothing (Rakotomamonjy et al., 2008) Regularization path (Bach et al., 2004b) etc...

25 Summing kernels vs. optimizing weights Different regularizations: Regularization by m j=1 f j 2 is equivalent to using K = m j=1 K j Regularization by m j=1 f j should impose sparsity at the group level, and learn sparse weights η, with K = m j=1 η j K j If sparsity is not expected, l 1 has no reason to be better

26 Performance on Corel14 (Harchaoui and Bach, 2007) Corel14: 1400 natural images with 14 classes

27 Performance on Corel14 (Harchaoui and Bach, 2007) Error rates Performance comparison on Corel14 Histogram kernels (H) Walk kernels (W) Tree-walk kernels (TW) Test error Weighted tree-walks (wtw) MKL (M) H W TW wtw M Kernels

28 Caltech101 database (Fei-Fei et al., 2006)

29 Kernel combination for Caltech101 (Varma and Ray, 2007) Classification accuracies 1- NN SVM (1 vs. 1) SVM (1 vs. rest) Shape GB ± ± ± 0.70 Shape GB ± ± ± 0.57 Self Similarity ± ± ± 0.84 PHOG ± ± ± 0.52 PHOG ± ± ± 0.85 PHOWColour ± ± ± 1.46 PHOWGray ± ± ± 0.30 MKL Block l ± ± 0.39 (Varma and Ray, 2007) ± ± 0.59 See also Bosch et al. (2008)

30 Talk outline Multiple sources in computer vision Multiple kernel learning (MKL) Equivalent formulations Theoretical analysis and open problems Covariance operators

31 Analysis of MKL as non parametric group Lasso Assume m Hilbert spaces F i, i = 1,...,m on m different input spaces n m min y i f j (x ji ) + µ m n f j. f i F i, i=1,...,m 2n 2 i=1 NB: f j (x ji ) = f j Φ j(x ji ) j=1 Sparse generalized additive models (Hastie and Tibshirani, 1990, Ravikumar et al., 2007) Algorithms: use parametrization with α Analysis: Do not use α use covariance operators (i.e., stay in the primal/input space) j=1

32 (non centered) covariance operators Single random variable X: Σ XX is a bounded linear operator from F to F such that for all (f,g) F F, f,σ XX g = E(f(X)g(X)) Under minor assumptions, the operator Σ XX is auto-adjoint, nonnegative and Hilbert-Schmidt Tool of choice for the analysis of least-squares non parametric methods (Blanchard, 2006, Fukumizu et al., 2005, 2006, Gretton et al., 2006, Harchaoui et al., 2007, 2008, etc.) Natural empirical estimate f, ˆΣ XX g = 1 n n i=1 f(x i)g(x i ) converges in probability to Σ XX in HS norm.

33 Cross-covariance operators Several random variables: cross-covariance operators Σ Xi X j from F j to F i such that (f i,f j ) F i F j, f i,σ Xi X j f j = E(f i (X i )f j (X j )) Similar convergence properties of empirical estimates Joint covariance operator Σ XX defined by blocks We can define the bounded correlation operators through Σ Xi X j = Σ 1/2 X i X i C Xi X j Σ 1/2 X j X j NB: the joint covariance operator is never invertible, but the correlation operator may be

34 Covariance operators for multiple sources Simple tool for characterizing relationship between sources Formally equivalent to finite feature space setting Allows proper asymptotic and non asymptotic analysis, e.g., Limit distributions of non parametric test statistics (Gretton et al., 2006, Harchaoui, Bach and Moulines, 2007, 2008)

35 Analysis of MKL as non parametric group Lasso Assumptions 1. Generalized additive model: There exists functions f = (f 1,...,f m ) F = F 1 F m such that Y = m j=1 f j(x j ) + ε 2. Compacity and invertibility : All cross-correlation operators are compact and the joint correlation operator is invertible. 3. Additional technical assumptions

36 Compacity and invertibility of joint correlation operator Sufficient condition for compacity when distributions have densities: E { } pxi X j (x i,x j ) p Xi (x i )p Xj (x j ) 1 <. Dependence between variables is not too strong Sufficient condition for invertibility: functions in the RKHS. no exact correlation using Empty concurvity space assumption (Hastie and Tibshirani, 1990)

37 Group lasso - Consistency conditions Strict condition Weak condition max i J c Σ 1/2 X i X i C Xi X J C 1 X J X J Diag(1/ f j )g J < 1 max i J c Σ 1/2 X i X i C Xi X J C 1 X J X J Diag(1/ f j )g J 1 Theorem 1: Strict condition is sufficient for joint regular and sparsity consistency of the lasso. Theorem 2: Weak condition is necessary for joint regular and sparsity consistency of the lasso.

38 Conclusion - Interesting problems/issues Multiple kernel learning for supervised learning with multiple sources Vision (and bioinformatics?) Equivalent formulations Learning from exponentially many sources Theory: good estimation as long as log p = o(n) Structure is needed! (Bach, NIPS 2008) Choosing well-behaved sources Different sources or similar sources? Characterizing when using multiple sources helps

39 References F. R. Bach. Consistency of the group Lasso and multiple kernel learning. Journal of Machine Learning Research, pages , F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In Proceedings of the International Conference on Machine Learning (ICML), 2004a. F. R. Bach, R. Thibaux, and M. I. Jordan. Computing regularization paths for learning multiple kernels. In Advances in Neural Information Processing Systems 17, 2004b. A. Bosch, Zisserman A., and X. Munoz. Image classification using rois and multiple kernel learning. International Journal of Computer Vision, submitted. Léon Bottou and Olivier Bousquet. Learning using large datasets. In Mining Massive DataSets for Security, NATO ASI Workshop Series. IOS Press, Amsterdam, URL org/papers/bottou-bousquet-2008b. to appear. L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models for 101 object categories. Computer Vision and Image Understanding, Thomas Gärtner, Peter A. Flach, and Stefan Wrobel. On graph kernels: Hardness results and efficient alternatives. In COLT, K. Grauman and T. Darrell. The pyramid match kernel: Efficient learning with sets of features. J. Mach. Learn. Res., 8: , ISSN Z. Harchaoui and F. R. Bach. Image classification with segmentation graph kernels. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2007.

40 T. J. Hastie and R. J. Tibshirani. Generalized Additive Models. Chapman & Hall, Hisashi Kashima, Koji Tsuda, and Akihiro Inokuchi. Kernels for graphs. In Kernel Methods in Computational Biology. MIT Press, G. R. G. Lanckriet, N. Cristianini, L. El Ghaoui, P. Bartlett, and M. I. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5:27 72, F. Meyer. Hierarchies of partitions and morphological segmentation. In Scale-Space and Morphology in Computer Vision. Springer-Verlag, A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet. Simplemkl. Journal of Machine Learning Research, to appear, S. Sonnenburg, G. Räsch, C. Schäfer, and B. Schölkopf. Large scale multiple kernel learning. Journal of Machine Learning Research, 7: , M. Varma and D. Ray. Learning the discriminative power-invariance trade-off. In Proc. ICCV, 2007.

Hierarchical kernel learning

Hierarchical kernel learning Francis Bach Willow project, INRIA - Ecole Normale Supérieure May 2010 Outline Supervised learning and regularization Kernel methods vs. sparse methods MKL: Multiple kernel