Carlo Vittorio Cannistraci. Minimum Curvilinear Embedding unveils nonlinear patterns in 16S metagenomic data

Carlo Vittorio Cannistraci Minimum Curvilinear Embedding unveils nonlinear patterns in 16S metagenomic data Biomedical Cybernetics Group Biotechnology Center (BIOTEC) Technische Universität Dresden (TUD)

A. Palladini, S. Ciucci, F. Paroni Sterbini, L. Masucci, G. Cammarota, G. Ianiro, B. Posteraro, M. Sanguinetti, G. Gasbarrini, A. Gasbarrini & C.V. Cannistraci

Background 3 Metagenomic data are typically multidimensional Multivariate analyses usually applied to metagenomic data: PRINCIPAL COMPONENT and PRINCIPAL COORDINATES ANALYSIS (PCA and PCoA) Bacterial taxa Counts

samples Goal: analysis of patterns Visualization & discrimination dimensionality reduction of the dataset in a 2D space 1... n genes or proteins 1 2 m>>n Unsupervised classification grouping the samples present in the dataset in homogeneous classes dimension reduction and clustering D2 D1

When? - Analysis of preliminary data (few samples) in a project - Proposal of new biomedical unexpected hypothesis (determining new labels: molecular disease reclassification) - I do not have prior knowledge (absence of labels) - I do not trust prior-knowledge (risk of false labels or missing labels) In general: I try to unsupervisedly discover new knowledge

Issue 1: Small dataset (curse of dimensionality) Issue 1: Small datasets (features m >> samples n) and pitfalls of supervised methods (Smialowski et al., Bioinformatics 2009) Frequent solution: Unsupervised hybrid-two-phase procedure (H2P), dimension reduction coupled with clustering (Martella, Bioinformatics 2006)

Principal component analysis (PCA)

Nonlinear Dimension Reduction Kernel based {example: Gaussian-PCA} z y x x y Manifold based {example: Isomap} Tenenbaum et al. Science, 2000 Issue 3: Hypothesis of local continuity of the manifold Issue 4: presence of free parameters to tune!!!

The inspiration

How MC works: Navigating between the points with a greedy routing process: the minimum spanning tree (MST)! Euclidean Distance Minimum MC Distance Curvilienarity V2 MC Distance Matrix - B - - - - - A V1 - - - - The greedy routing navigability is a way to map the hidden nonlinear topology For MC: The global mapping and the local fitting are reciprocally dependent MC Minimize globally and fit locally!

Theory Minimum Curvilinear theory Nonlinear dimension reduction and clustering by Minimum Curvilinearity unfold neuropathic pain and tissue embryological classes CV Cannistraci, T Ravasi, FM Montevecchi, T Ideker, M Alessio Bioinformatics 2010, 26 (18), i531-i539 SVD-based version of Minimum Curvilinear embedding Minimum curvilinearity to enhance topological prediction of protein interactions by network embedding CV Cannistraci, G Alanis-Lobato, T Ravasi Bioinformatics 2013, 29 (13), i199-i209

Examples of applications 12 Comparison of PCA versus MCE Highlighting nonlinear patterns in population genetics datasets G Alanis-Lobato, CV Cannistraci, A Eriksson, A Manica, T Ravasi Scientific reports 2015, 5 Gender, Contraceptives and Individual Metabolic Predisposition Shape a Healthy Plasma LipidomeS Sales, J Graessler, S Ciucci, R Al-Atrib, T Vihervaara, K Schuhmann,..., Carlo V. Cannistraci and Andrej Shevshenko Scientific reports 6

Background 14 (metagen)omic data are complex and often characterized by nonlinearity: can it be detected by linear dimensionality reduction techniques? Hardly http://da.cira.colostate.edu/sites/default/files/theme_img/nonlinearity.png

Our case-study 15 An instance where PCA and PCoA failed to detect data structure: - Human gastric biopsies from dyspeptic patients either subjected to therapy with Proton Pump Inhibitors, or untreated 24 samples 12 treated with Proton Pump Inhibitors (PPI) 12 untreated 4 5 positive to Helicobacter pylori

Biological question 16 Does PPI treatment affect the gastric microbiota? Computational question Are linear multivariate techniques sufficient to detect patterns in complex data?

Methods 17 Application of linear mutlivariate techniques: PCA PCoA, that is classical Multidimensional scaling (cmds), on Bray- Curtis distance (also on Unifrac distance, but results are the same) Application of nonlinear machine learning: Non-centered MCE Isometric Feature Mapping (Isomap) Laplacian Eigenmaps (LE)

Methods 18 Focus on MCE (Cannistraci et al., Bioinformatics 26.18 (2010): i531-i539.): is a parameter-free nonlinear machine learning for dimension reduction estimates nonlinear sample distances by Minimum Spanning Tree was designed for small sample-size datasets

Methods 19 MCE already proved to be successful in detecting patterns in the bacterial metagenomes of sponges MCE PCoA Bayer et al. FEMS microbiology ecology 90.3 (2014): 832-843.

Results 20 PCA cmds

Results 21 ISOMAP LAPLACIAN EIGENMAPS

Results 22 Non-centered MCE detects 3 groups: Untreated H. pylori negative samples Treated samples Untreated H. pylori positive samples

Question 23 Why does ncmce find a more complex data structure than PCA, MDS and the other nonlinear machine learning? We look for an answer by applying these techniques to a nonlinear structure, the Swiss roll

Results 24 PCA cmds (Bray-Curtis distance)

Results 25 MCE Isomap LE

ncmce versus PCA and MDS 26 The problem is nonlinearity: Linear techniques such as PCA and classical MDS cannot detect the complex structure hidden in our dataset. They can detect only two groups at a time, but cannot at once resolve the differences among 3 groups due to the confounding effect of presence/absence of treatment and Helicobacter pylori

ncmce versus Isomap and LE 27 The problem is sparsity: Isomap and LE perform well on the Swiss roll: as a matter of fact they were designed for the dimension reduction of nonlinear structure. Nevertheless they do not perform well on the real metagenomic dataset because of its sparsity: the data points are not dense enough, there typically is an inflation of zeros in these omic data.

Biological question 28 Does PPI treatment affect the gastric microbiota? YES Computational question Are linear multivariate techniques sufficient to detect patterns in complex data? NO

Conclusions 29 PPI treatment modifies the gastric microbiota Multivariate techniques such as PCA and MDS are not sufficient to discover nonlinear structure, therefore also ncmce and other nonlinear techniques should complement data exploration

A further step: 31 nc-mce derived discriminative network

Carlo Vittorio Cannistraci Biomedical Cybernetics Group WEB: https://sites.google.com/site/carlovittoriocannistraci/home http://www.biotec.tu-dresden.de/research/cannistraci/ EMAIL kalokagathos.agon@gmail.com