Carlo Vittorio Cannistraci. Minimum Curvilinear Embedding unveils nonlinear patterns in 16S metagenomic data

Similar documents
Nonlinear Dimensionality Reduction. Jose A. Costa

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Nonlinear Dimensionality Reduction

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

An Empirical Comparison of Dimensionality Reduction Methods for Classifying Gene and Protein Expression Datasets

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Nonlinear Dimensionality Reduction

ABTEKNILLINEN KORKEAKOULU

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Non-linear Dimensionality Reduction

Data-dependent representations: Laplacian Eigenmaps

Apprentissage non supervisée

Unsupervised dimensionality reduction

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Variable Importance in Nonlinear Kernels (VINK): Classification of Digitized Histopathology

Using Topological Data Analysis to find discrimination between microbial states in human microbiome data

Manifold Learning and it s application

Statistical Pattern Recognition

Localized Sliced Inverse Regression

Data dependent operators for the spatial-spectral fusion problem

Manifold Learning: Theory and Applications to HRI

Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

CS534 Machine Learning - Spring Final Exam

ISSN: (Online) Volume 3, Issue 5, May 2015 International Journal of Advance Research in Computer Science and Management Studies

Multivariate Statistics 101. Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis

Nonlinear Manifold Learning Summary

Experimental Design and Data Analysis for Biologists

Statistical Learning. Dong Liu. Dept. EEIS, USTC

Multivariate Analysis of Ecological Data using CANOCO

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s)

FIG S1: Rarefaction analysis of observed richness within Drosophila. All calculations were

Outline Classes of diversity measures. Species Divergence and the Measurement of Microbial Diversity. How do we describe and compare diversity?

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Data Preprocessing. Cluster Similarity

Statistical Machine Learning

Dimensionality Reduction: A Comparative Review

Discriminative K-means for Clustering

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

Graph Metrics and Dimension Reduction

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

PCA, Kernel PCA, ICA

Dimensionality Reduc1on

Bayesian Networks Inference with Probabilistic Graphical Models

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction

Gaussian Process Latent Random Field

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Expectation Maximization

Machine Learning. Chao Lan

Dimensionality Reduction: A Comparative Review

Learning a kernel matrix for nonlinear dimensionality reduction

Finding High-Order Correlations in High-Dimensional Biological Data

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine

Distance Preservation - Part 2

Manifold Learning: From Linear to nonlinear. Presenter: Wei-Lun (Harry) Chao Date: April 26 and May 3, 2012 At: AMMAI 2012

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Exploiting Sparse Non-Linear Structure in Astronomical Data

Learning gradients: prescriptive models

Local regression, intrinsic dimension, and nonparametric sparsity

Principal Component Analysis

Large-Scale Manifold Learning

Robust Laplacian Eigenmaps Using Global Information

Intrinsic Structure Study on Whale Vocalizations

Computational Biology Course Descriptions 12-14

Kernel PCA, clustering and canonical correlation analysis

Lecture: Mixture Models for Microbiome data

Computational Biology From The Perspective Of A Physical Scientist

(Non-linear) dimensionality reduction. Department of Computer Science, Czech Technical University in Prague

Final Exam, Machine Learning, Spring 2009

Dimensionality Reduction:

PCA and admixture models

Efficiently Implementing Sparsity in Learning

Analysis of Interest Rate Curves Clustering Using Self-Organising Maps

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Motivating the Covariance Matrix

Discriminant Uncorrelated Neighborhood Preserving Projections

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Computational Genomics

Machine Learning. Data visualization and dimensionality reduction. Eric Xing. Lecture 7, August 13, Eric Xing Eric CMU,

STA 414/2104: Lecture 8

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

A Tour of Unsupervised Learning Part I Graphical models and dimension reduction

Dimension Reduction and Low-dimensional Embedding

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

4 Bias-Variance for Ridge Regression (24 points)

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Advanced Machine Learning & Perception

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

Data Exploration and Unsupervised Learning with Clustering

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Pattern Recognition and Machine Learning

Feature gene selection method based on logistic and correlation information entropy

Multivariate analysis

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Transcription:

Carlo Vittorio Cannistraci Minimum Curvilinear Embedding unveils nonlinear patterns in 16S metagenomic data Biomedical Cybernetics Group Biotechnology Center (BIOTEC) Technische Universität Dresden (TUD)

A. Palladini, S. Ciucci, F. Paroni Sterbini, L. Masucci, G. Cammarota, G. Ianiro, B. Posteraro, M. Sanguinetti, G. Gasbarrini, A. Gasbarrini & C.V. Cannistraci

Background 3 Metagenomic data are typically multidimensional Multivariate analyses usually applied to metagenomic data: PRINCIPAL COMPONENT and PRINCIPAL COORDINATES ANALYSIS (PCA and PCoA) Bacterial taxa Counts

samples Goal: analysis of patterns Visualization & discrimination dimensionality reduction of the dataset in a 2D space 1... n genes or proteins 1 2 m>>n Unsupervised classification grouping the samples present in the dataset in homogeneous classes dimension reduction and clustering D2 D1

When? - Analysis of preliminary data (few samples) in a project - Proposal of new biomedical unexpected hypothesis (determining new labels: molecular disease reclassification) - I do not have prior knowledge (absence of labels) - I do not trust prior-knowledge (risk of false labels or missing labels) In general: I try to unsupervisedly discover new knowledge

Issue 1: Small dataset (curse of dimensionality) Issue 1: Small datasets (features m >> samples n) and pitfalls of supervised methods (Smialowski et al., Bioinformatics 2009) Frequent solution: Unsupervised hybrid-two-phase procedure (H2P), dimension reduction coupled with clustering (Martella, Bioinformatics 2006)

Principal component analysis (PCA)

Nonlinear Dimension Reduction Kernel based {example: Gaussian-PCA} z y x x y Manifold based {example: Isomap} Tenenbaum et al. Science, 2000 Issue 3: Hypothesis of local continuity of the manifold Issue 4: presence of free parameters to tune!!!

The inspiration

How MC works: Navigating between the points with a greedy routing process: the minimum spanning tree (MST)! Euclidean Distance Minimum MC Distance Curvilienarity V2 MC Distance Matrix - B - - - - - A V1 - - - - The greedy routing navigability is a way to map the hidden nonlinear topology For MC: The global mapping and the local fitting are reciprocally dependent MC Minimize globally and fit locally!

Theory Minimum Curvilinear theory Nonlinear dimension reduction and clustering by Minimum Curvilinearity unfold neuropathic pain and tissue embryological classes CV Cannistraci, T Ravasi, FM Montevecchi, T Ideker, M Alessio Bioinformatics 2010, 26 (18), i531-i539 SVD-based version of Minimum Curvilinear embedding Minimum curvilinearity to enhance topological prediction of protein interactions by network embedding CV Cannistraci, G Alanis-Lobato, T Ravasi Bioinformatics 2013, 29 (13), i199-i209

Examples of applications 12 Comparison of PCA versus MCE Highlighting nonlinear patterns in population genetics datasets G Alanis-Lobato, CV Cannistraci, A Eriksson, A Manica, T Ravasi Scientific reports 2015, 5 Gender, Contraceptives and Individual Metabolic Predisposition Shape a Healthy Plasma LipidomeS Sales, J Graessler, S Ciucci, R Al-Atrib, T Vihervaara, K Schuhmann,..., Carlo V. Cannistraci and Andrej Shevshenko Scientific reports 6

Background 14 (metagen)omic data are complex and often characterized by nonlinearity: can it be detected by linear dimensionality reduction techniques? Hardly http://da.cira.colostate.edu/sites/default/files/theme_img/nonlinearity.png

Our case-study 15 An instance where PCA and PCoA failed to detect data structure: - Human gastric biopsies from dyspeptic patients either subjected to therapy with Proton Pump Inhibitors, or untreated 24 samples 12 treated with Proton Pump Inhibitors (PPI) 12 untreated 4 5 positive to Helicobacter pylori

Biological question 16 Does PPI treatment affect the gastric microbiota? Computational question Are linear multivariate techniques sufficient to detect patterns in complex data?

Methods 17 Application of linear mutlivariate techniques: PCA PCoA, that is classical Multidimensional scaling (cmds), on Bray- Curtis distance (also on Unifrac distance, but results are the same) Application of nonlinear machine learning: Non-centered MCE Isometric Feature Mapping (Isomap) Laplacian Eigenmaps (LE)

Methods 18 Focus on MCE (Cannistraci et al., Bioinformatics 26.18 (2010): i531-i539.): is a parameter-free nonlinear machine learning for dimension reduction estimates nonlinear sample distances by Minimum Spanning Tree was designed for small sample-size datasets

Methods 19 MCE already proved to be successful in detecting patterns in the bacterial metagenomes of sponges MCE PCoA Bayer et al. FEMS microbiology ecology 90.3 (2014): 832-843.

Results 20 PCA cmds

Results 21 ISOMAP LAPLACIAN EIGENMAPS

Results 22 Non-centered MCE detects 3 groups: Untreated H. pylori negative samples Treated samples Untreated H. pylori positive samples

Question 23 Why does ncmce find a more complex data structure than PCA, MDS and the other nonlinear machine learning? We look for an answer by applying these techniques to a nonlinear structure, the Swiss roll

Results 24 PCA cmds (Bray-Curtis distance)

Results 25 MCE Isomap LE

ncmce versus PCA and MDS 26 The problem is nonlinearity: Linear techniques such as PCA and classical MDS cannot detect the complex structure hidden in our dataset. They can detect only two groups at a time, but cannot at once resolve the differences among 3 groups due to the confounding effect of presence/absence of treatment and Helicobacter pylori

ncmce versus Isomap and LE 27 The problem is sparsity: Isomap and LE perform well on the Swiss roll: as a matter of fact they were designed for the dimension reduction of nonlinear structure. Nevertheless they do not perform well on the real metagenomic dataset because of its sparsity: the data points are not dense enough, there typically is an inflation of zeros in these omic data.

Biological question 28 Does PPI treatment affect the gastric microbiota? YES Computational question Are linear multivariate techniques sufficient to detect patterns in complex data? NO

Conclusions 29 PPI treatment modifies the gastric microbiota Multivariate techniques such as PCA and MDS are not sufficient to discover nonlinear structure, therefore also ncmce and other nonlinear techniques should complement data exploration

30

A further step: 31 nc-mce derived discriminative network

Carlo Vittorio Cannistraci Biomedical Cybernetics Group WEB: https://sites.google.com/site/carlovittoriocannistraci/home http://www.biotec.tu-dresden.de/research/cannistraci/ EMAIL kalokagathos.agon@gmail.com