Carlo Vittorio Cannistraci. Minimum Curvilinear Embedding unveils nonlinear patterns in 16S metagenomic data

Size: px

Start display at page:

Download "Carlo Vittorio Cannistraci. Minimum Curvilinear Embedding unveils nonlinear patterns in 16S metagenomic data"

Muriel Mills
5 years ago
Views:

1 Carlo Vittorio Cannistraci Minimum Curvilinear Embedding unveils nonlinear patterns in 16S metagenomic data Biomedical Cybernetics Group Biotechnology Center (BIOTEC) Technische Universität Dresden (TUD)

2 A. Palladini, S. Ciucci, F. Paroni Sterbini, L. Masucci, G. Cammarota, G. Ianiro, B. Posteraro, M. Sanguinetti, G. Gasbarrini, A. Gasbarrini & C.V. Cannistraci

3 Background 3 Metagenomic data are typically multidimensional Multivariate analyses usually applied to metagenomic data: PRINCIPAL COMPONENT and PRINCIPAL COORDINATES ANALYSIS (PCA and PCoA) Bacterial taxa Counts

4 samples Goal: analysis of patterns Visualization & discrimination dimensionality reduction of the dataset in a 2D space 1... n genes or proteins 1 2 m>>n Unsupervised classification grouping the samples present in the dataset in homogeneous classes dimension reduction and clustering D2 D1

5 When? - Analysis of preliminary data (few samples) in a project - Proposal of new biomedical unexpected hypothesis (determining new labels: molecular disease reclassification) - I do not have prior knowledge (absence of labels) - I do not trust prior-knowledge (risk of false labels or missing labels) In general: I try to unsupervisedly discover new knowledge

6 Issue 1: Small dataset (curse of dimensionality) Issue 1: Small datasets (features m >> samples n) and pitfalls of supervised methods (Smialowski et al., Bioinformatics 2009) Frequent solution: Unsupervised hybrid-two-phase procedure (H2P), dimension reduction coupled with clustering (Martella, Bioinformatics 2006)

7 Principal component analysis (PCA)

8 Nonlinear Dimension Reduction Kernel based {example: Gaussian-PCA} z y x x y Manifold based {example: Isomap} Tenenbaum et al. Science, 2000 Issue 3: Hypothesis of local continuity of the manifold Issue 4: presence of free parameters to tune!!!

9 The inspiration

10 How MC works: Navigating between the points with a greedy routing process: the minimum spanning tree (MST)! Euclidean Distance Minimum MC Distance Curvilienarity V2 MC Distance Matrix - B A V The greedy routing navigability is a way to map the hidden nonlinear topology For MC: The global mapping and the local fitting are reciprocally dependent MC Minimize globally and fit locally!

11 Theory Minimum Curvilinear theory Nonlinear dimension reduction and clustering by Minimum Curvilinearity unfold neuropathic pain and tissue embryological classes CV Cannistraci, T Ravasi, FM Montevecchi, T Ideker, M Alessio Bioinformatics 2010, 26 (18), i531-i539 SVD-based version of Minimum Curvilinear embedding Minimum curvilinearity to enhance topological prediction of protein interactions by network embedding CV Cannistraci, G Alanis-Lobato, T Ravasi Bioinformatics 2013, 29 (13), i199-i209

12 Examples of applications 12 Comparison of PCA versus MCE Highlighting nonlinear patterns in population genetics datasets G Alanis-Lobato, CV Cannistraci, A Eriksson, A Manica, T Ravasi Scientific reports 2015, 5 Gender, Contraceptives and Individual Metabolic Predisposition Shape a Healthy Plasma LipidomeS Sales, J Graessler, S Ciucci, R Al-Atrib, T Vihervaara, K Schuhmann,..., Carlo V. Cannistraci and Andrej Shevshenko Scientific reports 6

linear dimensionality reduction techniques?

14 Background 14 (metagen)omic data are complex and often characterized by nonlinearity: can it be detected by linear dimensionality reduction techniques? Hardly

15 Our case-study 15 An instance where PCA and PCoA failed to detect data structure: - Human gastric biopsies from dyspeptic patients either subjected to therapy with Proton Pump Inhibitors, or untreated 24 samples 12 treated with Proton Pump Inhibitors (PPI) 12 untreated 4 5 positive to Helicobacter pylori

16 Biological question 16 Does PPI treatment affect the gastric microbiota? Computational question Are linear multivariate techniques sufficient to detect patterns in complex data?

17 Methods 17 Application of linear mutlivariate techniques: PCA PCoA, that is classical Multidimensional scaling (cmds), on Bray- Curtis distance (also on Unifrac distance, but results are the same) Application of nonlinear machine learning: Non-centered MCE Isometric Feature Mapping (Isomap) Laplacian Eigenmaps (LE)

18 Methods 18 Focus on MCE (Cannistraci et al., Bioinformatics (2010): i531-i539.): is a parameter-free nonlinear machine learning for dimension reduction estimates nonlinear sample distances by Minimum Spanning Tree was designed for small sample-size datasets

19 Methods 19 MCE already proved to be successful in detecting patterns in the bacterial metagenomes of sponges MCE PCoA Bayer et al. FEMS microbiology ecology 90.3 (2014):

20 Results 20 PCA cmds

21 Results 21 ISOMAP LAPLACIAN EIGENMAPS

22 Results 22 Non-centered MCE detects 3 groups: Untreated H. pylori negative samples Treated samples Untreated H. pylori positive samples

23 Question 23 Why does ncmce find a more complex data structure than PCA, MDS and the other nonlinear machine learning? We look for an answer by applying these techniques to a nonlinear structure, the Swiss roll

24 Results 24 PCA cmds (Bray-Curtis distance)

25 Results 25 MCE Isomap LE

26 ncmce versus PCA and MDS 26 The problem is nonlinearity: Linear techniques such as PCA and classical MDS cannot detect the complex structure hidden in our dataset. They can detect only two groups at a time, but cannot at once resolve the differences among 3 groups due to the confounding effect of presence/absence of treatment and Helicobacter pylori

ncmce versus Isomap and LE 27 The problem is sparsity: Isomap and LE perform well on the Swiss roll: as a matter of fact they were designed for the dimension reduction of nonlinear structure.

27 ncmce versus Isomap and LE 27 The problem is sparsity: Isomap and LE perform well on the Swiss roll: as a matter of fact they were designed for the dimension reduction of nonlinear structure. Nevertheless they do not perform well on the real metagenomic dataset because of its sparsity: the data points are not dense enough, there typically is an inflation of zeros in these omic data.

28 Biological question 28 Does PPI treatment affect the gastric microbiota? YES Computational question Are linear multivariate techniques sufficient to detect patterns in complex data? NO

29 Conclusions 29 PPI treatment modifies the gastric microbiota Multivariate techniques such as PCA and MDS are not sufficient to discover nonlinear structure, therefore also ncmce and other nonlinear techniques should complement data exploration

30 30

31 A further step: 31 nc-mce derived discriminative network

32 Carlo Vittorio Cannistraci Biomedical Cybernetics Group WEB:

Nonlinear Dimensionality Reduction. Jose A. Costa

Nonlinear Dimensionality Reduction Jose A. Costa Mathematics of Information Seminar, Dec. Motivation Many useful of signals such as: Image databases; Gene expression microarrays; Internet traffic time