Multiple Kernel Self-Organizing Maps

Similar documents
Analysis of Multiclass Support Vector Machines

Discriminative Direction for Kernel Classifiers

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

Cluster Kernels for Semi-Supervised Learning

A Sparse Kernelized Matrix Learning Vector Quantization Model for Human Activity Recognition

Spatial correlation in bipartite networks: the impact of the geographical distances on the relations in a corpus of medieval transactions

Spatial correlation in bipartite networks: the impact of the geographical distances on the relations in a corpus of medieval transactions

Immediate Reward Reinforcement Learning for Projective Kernel Methods

Linear and Non-Linear Dimensionality Reduction

Linking non-binned spike train kernels to several existing spike train metrics

Graph-Based Semi-Supervised Learning

Kernel Learning via Random Fourier Representations

Dynamic Time-Alignment Kernel in Support Vector Machine

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Learning with kernels and SVM

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Unsupervised and Semisupervised Classification via Absolute Value Inequalities

Statistical Machine Learning

Multiple kernel learning for multiple sources

c 4, < y 2, 1 0, otherwise,

Statistical learning on graphs

Machine Learning : Support Vector Machines

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Support Vector Machines for Classification: A Statistical Portrait

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Support Vector Machine Classification via Parameterless Robust Linear Programming

Graph Matching & Information. Geometry. Towards a Quantum Approach. David Emms

Analysis of N-terminal Acetylation data with Kernel-Based Clustering

Support Vector Machine For Functional Data Classification

Intelligent Systems Discriminative Learning, Neural Networks

A New Space for Comparing Graphs

Knowledge-Based Nonlinear Kernel Classifiers

Kernel methods for comparing distributions, measuring dependence

Mathematical Programming for Multiple Kernel Learning

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

Machine Learning - Waseda University Logistic Regression

Model Selection for LS-SVM : Application to Handwriting Recognition

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Neural Networks and Deep Learning

ν =.1 a max. of 10% of training set can be margin errors ν =.8 a max. of 80% of training can be margin errors

Active and Semi-supervised Kernel Classification

Machine Learning. Nathalie Villa-Vialaneix - Formation INRA, Niveau 3

Radial-Basis Function Networks

Support Vector Machine

Introduction to Support Vector Machines

Support Vector Machines on General Confidence Functions

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

EM-algorithm for Training of State-space Models with Application to Time Series Prediction

Radial-Basis Function Networks

Generative MaxEnt Learning for Multiclass Classification

Iterative Laplacian Score for Feature Selection

Query Rules Study on Active Semi-Supervised Learning using Particle Competition and Cooperation

Learning Vector Quantization

One-class Label Propagation Using Local Cone Based Similarity

ECS289: Scalable Machine Learning

Smooth Bayesian Kernel Machines

Semi-Supervised Learning through Principal Directions Estimation

Microarray Data Analysis: Discovery

Bits of Machine Learning Part 1: Supervised Learning

Heterogeneous mixture-of-experts for fusion of locally valid knowledge-based submodels

Support Vector Machines

ECS289: Scalable Machine Learning

Similarity and kernels in machine learning

Nonlinear Dimensionality Reduction

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Lecture 7: Kernels for Classification and Regression

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

SVMC An introduction to Support Vector Machines Classification

Learning from Examples

Introduction to Support Vector Machines

Kernel Methods and Support Vector Machines

Kernel methods and the exponential family

Machine Learning for Data Science (CS4786) Lecture 11

Space-Time Kernels. Dr. Jiaqiu Wang, Dr. Tao Cheng James Haworth University College London

An online kernel-based clustering approach for value function approximation

Multiclass Boosting with Repartitioning

10-701/ Recitation : Kernels

SVMs, Duality and the Kernel Trick

Linear & nonlinear classifiers

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Sparse Support Vector Machines by Kernel Discriminant Analysis

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

CS798: Selected topics in Machine Learning

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

Feature Design. Feature Design. Feature Design. & Deep Learning

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes

When Dictionary Learning Meets Classification

Learning Methods for Linear Detectors

Transcription:

Multiple Kernel Self-Organizing Maps Madalina Olteanu (2), Nathalie Villa-Vialaneix (1,2), Christine Cierco-Ayrolles (1) http://www.nathalievilla.org {madalina.olteanu,nathalie.villa}@univ-paris1.fr, christine.cierco@toulouse.inra.fr 2013, April 24th (1) (2) MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 1 / 18

Introduction Outline 1 Introduction 2 MK-SOM 3 Applications MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 2 / 18

Introduction Data, notations and objectives Data: D datasets (x d i ) i=1,...,n,d=1,...,d all measured on the same individuals or on the same objects, {i,..., n}, each taking values in an arbitrary space G d. MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 3 / 18

Introduction Data, notations and objectives Data: D datasets (x d i ) i=1,...,n,d=1,...,d all measured on the same individuals or on the same objects, {i,..., n}, each taking values in an arbitrary space G d. Examples: (x d i ) i are n observations of p numeric variables; (x d i ) i are n nodes of a graph; (x d i ) i are n observations of p factors; (x d i ) i are n texts/labels... MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 3 / 18

Introduction Data, notations and objectives Data: D datasets (x d i ) i=1,...,n,d=1,...,d all measured on the same individuals or on the same objects, {i,..., n}, each taking values in an arbitrary space G d. Examples: (x d i ) i are n observations of p numeric variables; (x d i ) i are n nodes of a graph; (x d i ) i are n observations of p factors; (x d i ) i are n texts/labels... Purpose: Combine all datasets to obtain a map of individuals/objects: self-organizing maps for clustering {i,..., n} using all datasets. MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 3 / 18

Introduction Example [Villa-Vialaneix et al., 2013] Data: A network with labeled nodes. Examples: Gender in a social network, Functional group of a gene in a gene interaction network... Purpose: find communities, i.e., groups of close nodes in the graph; close meaning: densely connected and sharing (comparatively) a few links with the other groups ( communities ); but also having similar labels. MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 4 / 18

MK-SOM Outline 1 Introduction 2 MK-SOM 3 Applications MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 5 / 18

MK-SOM Kernels/Multiple kernel What is a kernel? (x i ) G, (K(x i, x j )) ij st: K(x i, x j ) = K(x j, x i ) and (α i ) i, ij α i α j K(x i, x j ) 0. In this case [Aronszajn, 1950], (H,.,. ), Φ : G H st : K(x i, x j ) = Φ(x i ), Φ(x j ) Examples: nodes of a graph: Heat kernel K = e βl or K = L + where L is the Laplacian [Kondor and Lafferty, 2002, Smola and Kondor, 2003, Fouss et al., 2007] numerical variables: Gaussian kernel K(x i, x j ) = e β x i x j 2 ; text: [Watkins, 2000]... MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 6 / 18

MK-SOM Kernel SOM [Mac Donald and Fyfe, 2000, Andras, 2002, Villa and Rossi, 2007] (x i ) i G described by a kernel (K(x i, x i )) ii. Prototypes are defined in H: p j = i γ ji φ(x i ); energy is calculated in H: 1: Prototypes initialization: randomly set p 0 = n j i=1 γ0 ij Φ(x i) st γ 0 0 ij and i γ 0 = 1 ij 2: for t = 1 T do 3: Randomly choose i {1,..., n} 4: Assignment f t (x i ) arg min x i p t 1 j j=1,...,m 5: for all j = 1 M do Representation 6: γ t γ t 1 + µ j j t h t (δ(f t (x i ), j)) ( ) 1 i γ t 1 j 7: end for 8: end for δ: distance between neurons, h: decreasing function ((h t ) t diminishes with t), µ t 1/t. MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 7 / 18 H

MK-SOM Kernel SOM [Mac Donald and Fyfe, 2000, Andras, 2002, Villa and Rossi, 2007] (x i ) i G described by a kernel (K(x i, x i )) ii. Prototypes are defined in H: p j = i γ ji φ(x i ); energy is calculated in H: 1: Prototypes initialization: randomly set p 0 = n j i=1 γ0 ij Φ(x i) st γ 0 0 ij and i γ 0 = 1 ij 2: for t = 1 T do 3: Randomly choose i {1,..., n} 4: Assignment f t (x i ) arg min j=1,...,m ll γ t 1 jl γ t 1 jl K t (x l, x l ) 2 5: for all j = 1 M do Representation 6: γ t γ t 1 + µ j j t h t (δ(f t (x i ), j)) ( ) 1 i γ t 1 j 7: end for 8: end for l γ t 1 jl K t (x l, x i ) δ: distance between neurons, h: decreasing function ((h t ) t diminishes with t), µ t 1/t. MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 7 / 18

MK-SOM Combining kernels Suppose: each (x d i ) i is described by a kernel K d, kernels can be combined (e.g., [Rakotomamonjy et al., 2008] for SVM): with α d 0 and d α d = 1. K = D α d K d, d=1 MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 8 / 18

MK-SOM Combining kernels Suppose: each (x d i ) i is described by a kernel K d, kernels can be combined (e.g., [Rakotomamonjy et al., 2008] for SVM): K = D α d K d, with α d 0 and d α d = 1. Remark: Also useful to integrate different types of information coming from different kernels on the same dataset. d=1 MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 8 / 18

MK-SOM multiple kernel SOM 1: Prototypes initialization: randomly set p 0 = n j i=1 γ0 ji Φ(x i) st γ 0 0 and ji i γ 0 = 1 ji 2: Kernel initialization: set (α 0 d ) st α0 0 and d d α d = 1 (e.g., α 0 = 1/D); d K 0 d α 0 d K d 3: for t = 1 T do 4: Randomly choose i {1,..., n} 5: Assignment f t (x i ) arg min x i p t 1 j=1,...,m 6: for all j = 1 M do Representation 7: γ t γ t 1 + µ j j t h t (δ(f t (x i ), j)) ( ) 1 i γ t 1 j 8: end for 9: empty state ;-) 10: end for j 2 H(K t ) MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 9 / 18

MK-SOM multiple kernel SOM 1: Prototypes initialization: randomly set p 0 = n j i=1 γ0 ji Φ(x i) st γ 0 0 and ji i γ 0 = 1 ji 2: Kernel initialization: set (α 0 d ) st α0 0 and d d α d = 1 (e.g., α 0 = 1/D); d K 0 d α 0 d K d 3: for t = 1 T do 4: Randomly choose i {1,..., n} 5: Assignment f t (x i ) arg min γ t 1 jl γ t 1 jl K t (x l, x l ) 2 γ t 1 jl K t (x l, x i ) j=1,...,m 6: for all j = 1 M do Representation 7: γ t γ t 1 + µ j j t h t (δ(f t (x i ), j)) ( ) 1 i γ t 1 j 8: end for 9: empty state ;-) 10: end for ll l MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 9 / 18

MK-SOM Tuning (α d ) d on-line Purpose: minimize over (γ ji ) ji and (α d ) d the energy E((γ ji ) ji, (α d ) d ) = n i=1 j=1 M h (δ(f(x i ), j)) φ α (x i ) p α j (γ j ) 2, α MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 10 / 18

MK-SOM Tuning (α d ) d on-line Purpose: minimize over (γ ji ) ji and (α d ) d the energy E((γ ji ) ji, (α d ) d ) = n i=1 j=1 M h (δ(f(x i ), j)) φ α (x i ) p α j (γ j ) 2, α KSOM picks up an observation x i that has a contribution to the energy equal to: M E xi = h (δ(f(x i ), j)) φ α (x i ) p α j (γ j ) j=1 2 α (1) MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 10 / 18

MK-SOM Tuning (α d ) d on-line Purpose: minimize over (γ ji ) ji and (α d ) d the energy E((γ ji ) ji, (α d ) d ) = n i=1 j=1 M h (δ(f(x i ), j)) φ α (x i ) p α j (γ j ) 2, α KSOM picks up an observation x i that has a contribution to the energy equal to: M E xi = h (δ(f(x i ), j)) φ α (x i ) p α j (γ j ) j=1 2 α (1) Idea: Add a gradient descent step based on the derivative of (1): E xi M n = h (δ(f(x i ), j)) α K d(x d i, x d i ) 2 γ jl K d (x d i, x d l )+ d j=1 l=1 n M γ jl γ jl K d (x d l, x d l ) = h (δ(f(x i ), j)) φ(x i ) p d j (γ j ) 2 H(K d ) l,l =1 MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 10 / 18 j=1

MK-SOM adaptive multiple kernel SOM 1: Prototypes initialization: randomly set p 0 = n j i=1 γ0 ji Φ(x i) st γ 0 0 and ji i γ 0 = 1 ji 2: Kernel initialization: set (α 0 d ) st α0 0 and d d α d = 1 (e.g., α 0 = 1/D); d K 0 d α 0 d K d 3: for t = 1 T do 4: Randomly choose i {1,..., n} 5: Assignment f t (x i ) arg min x i p t 1 j=1,...,m 6: for all j = 1 M do Representation 7: γ t γ t 1 + µ j j t h t (δ(f t (x i ), j)) ( ) 1 i γ t 1 j 8: end for 9: 10: end for j 2 H(K t ) MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 11 / 18

MK-SOM adaptive multiple kernel SOM 1: Prototypes initialization: randomly set p 0 = n j i=1 γ0 ji Φ(x i) st γ 0 0 and ji i γ 0 = 1 ji 2: Kernel initialization: set (α 0 d ) st α0 0 and d d α d = 1 (e.g., α 0 = 1/D); d K 0 d α 0 d K d 3: for t = 1 T do 4: Randomly choose i {1,..., n} 5: Assignment f t (x i ) arg min x i p t 1 j=1,...,m j 2 H(K t ) 6: for all j = 1 M do Representation 7: γ t γ t 1 + µ j j t h t (δ(f t (x i ), j)) ( ) 1 i γ t 1 j 8: end for 9: Kernel update α t d αt 1 E xi + ν d t α d and K t d α t d K d 10: end for MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 11 / 18

Applications Outline 1 Introduction 2 MK-SOM 3 Applications MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 12 / 18

Applications Example1: simulated data Graph with 200 nodes classified in 8 groups: graph: Erdös Reyni models: groups 1 to 4 and groups 5 to 8 with intra-group edge probability 0.3 and inter-group edge probability 0.01; numerical data: (( nodes ) ( labelled )) with 2-dimensional (( Gaussian ) ( vectors: )) 0 0.3 0 1 0.3 0 odd groups N, and even groups N, ; 0 0 0.3 1 0 0.3 factor with two levels: groups 1, 2, 5 and 7: first level; other groups: second level. Only the knownledge on the three datasets can discriminate all 8 groups. MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 13 / 18

Applications Experiment Kernels: graph: L +, numerical data: Gaussian kernel; factor: another Gaussian kernel on the disjunctive recoding MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 14 / 18

Applications Experiment Kernels: graph: L +, numerical data: Gaussian kernel; factor: another Gaussian kernel on the disjunctive recoding Comparison: on 100 randomly generated datasets as previously described: multiple kernel SOM with all three data; kernel SOM with a single dataset; kernel SOM with two datasets or all three datasets in a single (Gaussian) kernel. MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 14 / 18

Applications An example MK-SOM All in one kernel Graph only Numerical variables only MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 15 / 18

Applications Numerical comparison I (over 100 simulations) with mutual information C i C j ij 200 log C i C j C i C j (adjusted version, equal to 1 if partitions are identical [Danon et al., 2005]). MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 16 / 18

Applications Numerical comparison II (over 100 simulations) with node purity (average percentage of nodes in the neuron that come from the majority (original) class of the neuron). MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 17 / 18

Applications Conclusion Summary integrating multiple sources information on SOM (e.g., finding communities in graph, while taking into account labels); uses multiple kernel and automatically tunes the combination; the method gives relevant communities according to all sources of information and a well-organized map; a similar approach also tested for relational SOM ([Massoni et al., 2013] for analyzing school-to-work transitions). MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 18 / 18

Applications Conclusion Summary integrating multiple sources information on SOM (e.g., finding communities in graph, while taking into account labels); uses multiple kernel and automatically tunes the combination; the method gives relevant communities according to all sources of information and a well-organized map; a similar approach also tested for relational SOM ([Massoni et al., 2013] for analyzing school-to-work transitions). Possible developments Main issue: computational time; currently studying a sparse version. MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 18 / 18

Applications Conclusion Summary integrating multiple sources information on SOM (e.g., finding communities in graph, while taking into account labels); uses multiple kernel and automatically tunes the combination; the method gives relevant communities according to all sources of information and a well-organized map; a similar approach also tested for relational SOM ([Massoni et al., 2013] for analyzing school-to-work transitions). Possible developments Main issue: computational time; currently studying a sparse version. Thank you for your attention... MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 18 / 18

References Applications Andras, P. (2002). Kernel-Kohonen networks. International Journal of Neural Systems, 12:117 135. Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337 404. Danon, L., Diaz-Guilera, A., Duch, J., and Arenas, A. (2005). Comparing community structure identification. Journal of Statistical Mechanics, page P09008. Fouss, F., Pirotte, A., Renders, J., and Saerens, M. (2007). Random-walk computation of similarities between nodes of a graph, with application to collaborative recommendation. IEEE Transactions on Knowledge and Data Engineering, 19(3):355 369. Kondor, R. and Lafferty, J. (2002). Diffusion kernels on graphs and other discrete structures. In Proceedings of the 19th International Conference on Machine Learning, pages 315 322. Mac Donald, D. and Fyfe, C. (2000). The kernel self organising map. In Proceedings of 4th International Conference on knowledge-based intelligence engineering systems and applied technologies, pages 317 320. Massoni, S., Olteanu, M., and Villa-Vialaneix, N. (2013). Which distance use when extracting typologies in sequence analysis? An application to school to work transitions. In International Work Conference on Artificial Neural Networks (IWANN 2013), Puerto de la Cruz, Tenerife. Forthcoming. Rakotomamonjy, A., Bach, F., Canu, S., and Grandvalet, Y. (2008). SimpleMKL. MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 18 / 18

Applications Journal of Machine Learning Research, 9:2491 2521. Smola, A. and Kondor, R. (2003). Kernels and regularization on graphs. In Warmuth, M. and Schölkopf, B., editors, Proceedings of the Conference on Learning Theory (COLT) and Kernel Workshop, Lecture Notes in Computer Science, pages 144 158. Villa, N. and Rossi, F. (2007). A comparison between dissimilarity SOM and kernel SOM for clustering the vertices of a graph. In 6th International Workshop on Self-Organizing Maps (WSOM), Bielefield, Germany. Neuroinformatics Group, Bielefield University. Villa-Vialaneix, N., Olteanu, M., and Cierco-Ayrolles, C. (2013). Carte auto-organisatrice pour graphes étiquetés. In Actes des Ateliers FGG (Fouille de Grands Graphes), colloque EGC (Extraction et Gestion de Connaissances), Toulouse, France. Watkins, C. (2000). Dynamic alignment kernels. In Smola, A., Bartlett, P., Schölkopf, B., and Schuurmans, D., editors, Advances in Large Margin Classifiers, pages 39 50, Cambridge, MA, USA. MIT P. MK-SOM (ESANN 2013) Nathalie Villa-Vialaneix Bruges, April 24th, 2013 18 / 18