Compositional data methods for microbiome studies

Size: px
Start display at page:

Download "Compositional data methods for microbiome studies"

Transcription

1 Compositional data methods for microbiome studies M.Luz Calle Dept. of Biosciences, UVic-UCC 1

2 Important role of the microbiome in human health 2

3 Microbiome and HIV How the gut microbiome affects inmune reconstitution, HIV-1 replication and chronic inflammation in HIV-1 infected individuals. Dynamics of microbiome and the inflammatory response after HIV infection. How the human microbiome can influence the AIDS vaccine response. 3

4 Outline 1. Why a new algorithm for microbiome analysis is needed? 2. Present "SelBal: Selection of Balances", a new algorithm for microbiome differential abundance testing 4

5 Microbiome study 5

6 OTU: Operational Taxonomic Unit and Taxonomy assignment Sequences that are highly similar (e.g. 97%) are clustered together into OTUs which are used in place of microbial species. OTU1 OTU2 OTU3 OTU4 OTUs 6

7 OTU table or Abundance table Taxon1 Taxon2... TaxonM OTU1 OTU2 OTU3... OTUK TOTAL Sample1 X 11 X 12 X X 1k N 1 Sample2 X 21 X 22 X X 2k N Samplep X p1 X p2 X p3... X pk N p 7

8 Microbiome differential abundance testing Multivariate analysis: Are there global differences in microbial composition between sample groups? Adonis=PERMANOVA Univariate testing: Which taxa are differentially abundant between sample groups? Wilcoxon, DESeq2, EdgeR,... 8

9 Microbiome Compositional data Microbiome data is compositional: the change in abundance of one taxon induces changes in observed abundances of the other taxa 9

10 Microbiome Compositional data Microbiome data is compositional: the change in abundance of one taxon induces changes in observed abundances of the other taxa 10

11 Microbiome Compositional data Microbiome data is compositional: the change in abundance of one taxon induces changes in observed abundances of the other taxa Significant results: Wilcoxon: "TAXA_1" "TAXA_2" "TAXA_3" "TAXA_4" "TAXA_5" DESeq2: "TAXA_1" "TAXA_2" "TAXA_3" "TAXA_4" "TAXA_5" edger: "TAXA_1" "TAXA_2" "TAXA_3" "TAXA_4" "TAXA_5" 11

12 Microbiome Compositional data Microbiome data is compositional: the change in abundance of one taxon induces changes in observed abundances of the other taxa Significant results: Wilcoxon: "TAXA_1" "TAXA_2" "TAXA_3" "TAXA_4" "TAXA_5" DESeq2: "TAXA_1" "TAXA_2" "TAXA_3" "TAXA_4" "TAXA_5" edger: "TAXA_1" "TAXA_2" "TAXA_3" "TAXA_4" "TAXA_5" Univariate tests for compositional data: many significant findings are False Positive 12

13 Microbiome Compositional data 13

14 Microbiome Compositional data Microbiome data is compositional: the change in abundance of one taxon induces changes in the observed abundances of the other taxa 14

15 Microbiome Compositional data Microbiome data is compositional: the change in abundance of one taxon induces changes in the observed abundances of the other taxa If taxon 1 relative abundance changes from π 1 to π 1 we will observe the other taxa relative abundances to change by a constant factor F = (1 π 1 )/(1 π 1 ) π j = π j F, F = 1 π 1 1 π 1 for j 1 15

16 Microbiome Compositional data Microbiome data is compositional: the change in abundance of one taxon induces changes in the observed abundances of the other taxa If taxon 1 relative abundance changes from π 1 to π 1 we will observe the other taxa relative abundances to change by a constant factor F = (1 π 1 )/(1 π 1 ) π j = π j F, F = 1 π 1 1 π 1 for j 1 or a constant shift S = log(1 π 1 )/(1 π 1 ) in log-relative abundances: log (π j ) = log(π j ) + S, S = log ( 1 π 1 ) for j 1 1 π 1 16

17 Microbiome Compositional data Microbiome data is compositional: the change in abundance of one taxon induces changes in the observed abundances of the other taxa If taxon 1 relative abundance changes from π 1 to π 1 we will observe the other taxa relative abundances to change by a constant factor F = (1 π 1 )/(1 π 1 ) π j = π j F, F = 1 π 1 1 π 1 for j 1 or a constant shift S = log(1 π 1 )/(1 π 1 ) in log-relative abundances: log (π j ) = log(π j ) + S, S = log ( 1 π 1 ) for j 1 1 π 1 In the toy example: π 1 = > π 1 = > F = (1 π 1 )/(1 π 1 ) = 1/2 17

18 Microbiome Compositional data HPylori before π HPylori after π Shift in log-relative abundance: S = log(1 π 1 ) log(1 π 1 ) 4 18

19 Compositional data: log-ratio analysis Let X = (X 1, X 2,, X k ) be a composition of microbiome abundances. CODA: Analyze log-ratios between taxa: log (X i /X j ) = log (π i /π j ) Toy example: only log-ratios that involve taxa1 are different: log(x A 1 /X A 2 ) log(x B 1 /X B 2 ) log(0.2/0.2) log(0.6/0.1)... log(x A 2 /X A 3 ) = log(x B 2 /X B 3 ) log(0.2/0.2) = log(0.1/0.1)... 19

20 Compositional data: log-ratio analysis Let X = (X 1, X 2,, X k ) be a composition of microbiome abundances. CODA: Analyze log-ratios between taxa: log (X i /X j ) = log (π i /π j ) Toy example: only log-ratios that involve taxa1 are different: log(x A 1 /X A 2 ) log(x B 1 /X B 2 ) log(0.2/0.2) log(0.6/0.1)... log(x A 2 /X A 3 ) = log(x B 2 /X B 3 ) log(0.2/0.2) = log(0.1/0.1)... 20

21 Compositional balances: a new perspective for microbiome analysis Javier Rivera, PhD thesis Let X = (X 1, X 2,, X k ) be a composition of microbiome abundances. Instead of individual abundances, we analyze relative abundances between groups of taxa: Compositional balances Extension of the concept of log-ratio between two taxa: log (X i /X j ) = log (π i /π j ) Let's X + and X two disjoint subsets of components in X. The balance between X + and X is defined as: B = k + k log ( k + + k 1 i I X + i) 1 j I ) ( X j k + k 1 k i I+ log X i 1 + k j I log X j 21

22 Selbal: an algorithm for selection of balances Y, response variable, numeric or dichotomous, X = (X 1, X 2,, X k ) compositioin Z = (Z 1, Z 2,, Z r ) covariates Goal: to determine the sub-compositions X + and X so that the balance B between X + and X is highly associated with Y after adjustment for Z For a continuous variable Y: For a dichotomous variable Y: Y = β 0 + β 1 B + γ Z logit(y) = β 0 + β 1 B + γ Z 22

23 Selbal: an algorithm for selection of balances STEP 0: Zero replacement STEP 1: Optimal balance between two components, B (1) The algorithm evaluates exhaustively all possible balances between two components: B = 1 (log(x 2 i) log (X j )) for i, j {1,..., k} i j. STEP s: Optimal balance adding a new component For s > 1 and given B (s 1), the algorithm evaluates the optimization criterion of the balance that is obtained by adding log(x p ) to B (s 1), for each variable X p that has not been included previously 23

24 B (s 1) M + (s 1) M (s 1) = 1 k + (s 1) log (X i ) i I + (s 1) 1 k (s 1) log (X j ) j I (s 1) B (s+) = (k (s 1) (s 1) + +1) k (s 1) (s 1) ( k (s 1) (s 1) + M+ +log (Xp ) (s 1) M (s 1) ), k + +k +1 k + +1 B (s ) = k (s 1) (s 1) (s 1) + (k +1) (s 1) k (s 1) (s 1) (M k + +k +1 + M (s 1) + log (X p ) (s 1) ), k +1 and selects B (s) that maximizes the optimization criterion (R 2, AUC). STOP criterion: cross-validation 24

25 Cross-validation: selbal.cv Goals: (1) to identify the optimal number of components to be included in the balance (2) to explore the robustness of the global balance identified with the whole dataset. 25

26 Crohn s disease Ren et al. 2015: 662 patients with Crohn s disease and 313 controls. Abundance data at genus level (48 genera) 26

27 27

28 AUC = and cv-auc =

29 Comparison with other methods METHOD Median number of taxa Mean cv-auc selbal DESeq edger ANCOM ALDEx

30 Conclusions The compositonal nature of microbiome data should not be ignored This applies not only to microbiome abundance but aslo to gene counts in microbiome functional analysis. Working with relative abundances among groups of taxa (compositional balances) overcomes the problem of differences in sample size. The algorithm performs forward selection (suboptimal). We are working to develop a new algorithm that finds the optimal balance through penalized regression (LASSO) for compositional data. 30

31 Javier Rivera Marc Noguera Roger Paredes and the MetaHIV group Vera Pawlowsky-Glahn Juan José Egozcue CODA group 31

32 Effects of "closing" compositional data "Closing" compositional data (proportions or rarefaction) induces spurious correlation (Pearson 1896): Two or more variables will be negatively correlated simply because the data are transformed to have a constant sum x = [ ] cor(x) = [ ], cor(π 0.28 x ) = [ ] also induces subcompositional incoherences in both, correlations and distances. 32

33 Statistical challenges of microbiome analysis Sparsity: large proportion of zeros in OTU Multivariate with complex phylogenetic structure High dimensional Compositional data 33

34 Compositional data Let's consider a vector of K positive components or parts x = (x 1, x 2,, x K ) Closed compositional data describe a data set in which the parts in each sample have a constant sum: x i = 1 Compositional data describe a data set in which the parts in each sample have an arbitrary or noninformative sum 34

35 Microbiome Compositional data Microbiome data is compositional: o Row abundances (counts) are not informative: large variability in the total number of counts per sample and total number of counts is related to the instrument (sampling depth), not to microbiome abundance in the environment o Relative abundances (proportions) and rarefaction are used to obtain a closed microbiome composition this may induce strong incoherencies in correlations and distances 35

36 36

Statistical methods for the analysis of microbiome compositional data in HIV studies

Statistical methods for the analysis of microbiome compositional data in HIV studies 1/ 56 Statistical methods for the analysis of microbiome compositional data in HIV studies Javier Rivera Pinto November 30, 2018 Outline 1 Introduction 2 Compositional data and microbiome analysis 3 Kernel

More information

Lecture: Mixture Models for Microbiome data

Lecture: Mixture Models for Microbiome data Lecture: Mixture Models for Microbiome data Lecture 3: Mixture Models for Microbiome data Outline: - - Sequencing thought experiment Mixture Models (tangent) - (esp. Negative Binomial) - Differential abundance

More information

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data Lecture 3: Mixture Models for Microbiome data 1 Lecture 3: Mixture Models for Microbiome data Outline: - Mixture Models (Negative Binomial) - DESeq2 / Don t Rarefy. Ever. 2 Hypothesis Tests - reminder

More information

Package milineage. October 20, 2017

Package milineage. October 20, 2017 Type Package Package milineage October 20, 2017 Title Association Tests for Microbial Lineages on a Taxonomic Tree Version 2.0 Date 2017-10-18 Author Zheng-Zheng Tang Maintainer Zheng-Zheng Tang

More information

Lecture 2: Descriptive statistics, normalizations & testing

Lecture 2: Descriptive statistics, normalizations & testing Lecture 2: Descriptive statistics, normalizations & testing From sequences to OTU table Sequencing Sample 1 Sample 2... Sample N Abundances of each microbial taxon in each of the N samples 2 1 Normalizing

More information

Niche Modeling. STAMPS - MBL Course Woods Hole, MA - August 9, 2016

Niche Modeling. STAMPS - MBL Course Woods Hole, MA - August 9, 2016 Niche Modeling Katie Pollard & Josh Ladau Gladstone Institutes UCSF Division of Biostatistics, Institute for Human Genetics and Institute for Computational Health Science STAMPS - MBL Course Woods Hole,

More information

CoDa-dendrogram: A new exploratory tool. 2 Dept. Informàtica i Matemàtica Aplicada, Universitat de Girona, Spain;

CoDa-dendrogram: A new exploratory tool. 2 Dept. Informàtica i Matemàtica Aplicada, Universitat de Girona, Spain; CoDa-dendrogram: A new exploratory tool J.J. Egozcue 1, and V. Pawlowsky-Glahn 2 1 Dept. Matemàtica Aplicada III, Universitat Politècnica de Catalunya, Barcelona, Spain; juan.jose.egozcue@upc.edu 2 Dept.

More information

Outline Classes of diversity measures. Species Divergence and the Measurement of Microbial Diversity. How do we describe and compare diversity?

Outline Classes of diversity measures. Species Divergence and the Measurement of Microbial Diversity. How do we describe and compare diversity? Species Divergence and the Measurement of Microbial Diversity Cathy Lozupone University of Colorado, Boulder. Washington University, St Louis. Outline Classes of diversity measures α vs β diversity Quantitative

More information

High-Throughput Sequencing Course

High-Throughput Sequencing Course High-Throughput Sequencing Course DESeq Model for RNA-Seq Biostatistics and Bioinformatics Summer 2017 Outline Review: Standard linear regression model (e.g., to model gene expression as function of an

More information

An Adaptive Association Test for Microbiome Data

An Adaptive Association Test for Microbiome Data An Adaptive Association Test for Microbiome Data Chong Wu 1, Jun Chen 2, Junghi 1 Kim and Wei Pan 1 1 Division of Biostatistics, School of Public Health, University of Minnesota; 2 Division of Biomedical

More information

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution Taxonomy Content Why Taxonomy? How to determine & classify a species Domains versus Kingdoms Phylogeny and evolution Why Taxonomy? Classification Arrangement in groups or taxa (taxon = group) Nomenclature

More information

NEGATIVE BINOMIAL MODELLING AND APPLICATIONS FOR MICROBIOME COUNT DATA

NEGATIVE BINOMIAL MODELLING AND APPLICATIONS FOR MICROBIOME COUNT DATA NEGATIVE BINOMIAL MODELLING AND APPLICATIONS FOR MICROBIOME COUNT DATA by Chang Chen Submitted in partial fulfillment of the requirements for the degree of Master of Science at Dalhousie University Halifax,

More information

Regression Clustering

Regression Clustering Regression Clustering In regression clustering, we assume a model of the form y = f g (x, θ g ) + ɛ g for observations y and x in the g th group. Usually, of course, we assume linear models of the form

More information

arxiv: v1 [stat.ap] 23 May 2013

arxiv: v1 [stat.ap] 23 May 2013 The Annals of Applied Statistics 2013, Vol. 7, No. 1, 418 442 DOI: 10.1214/12-AOAS592 c Institute of Mathematical Statistics, 2013 arxiv:1305.5355v1 [stat.ap] 23 May 2013 VARIABLE SELECTION FOR SPARSE

More information

LARGE NUMBERS OF EXPLANATORY VARIABLES. H.S. Battey. WHAO-PSI, St Louis, 9 September 2018

LARGE NUMBERS OF EXPLANATORY VARIABLES. H.S. Battey. WHAO-PSI, St Louis, 9 September 2018 LARGE NUMBERS OF EXPLANATORY VARIABLES HS Battey Department of Mathematics, Imperial College London WHAO-PSI, St Louis, 9 September 2018 Regression, broadly defined Response variable Y i, eg, blood pressure,

More information

Statistical aspects of prediction models with high-dimensional data

Statistical aspects of prediction models with high-dimensional data Statistical aspects of prediction models with high-dimensional data Anne Laure Boulesteix Institut für Medizinische Informationsverarbeitung, Biometrie und Epidemiologie February 15th, 2017 Typeset by

More information

Microbiota: Its Evolution and Essence. Hsin-Jung Joyce Wu "Microbiota and man: the story about us

Microbiota: Its Evolution and Essence. Hsin-Jung Joyce Wu Microbiota and man: the story about us Microbiota: Its Evolution and Essence Overview q Define microbiota q Learn the tool q Ecological and evolutionary forces in shaping gut microbiota q Gut microbiota versus free-living microbe communities

More information

Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s)

Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s) Lecture 2: Diversity, Distances, adonis Lecture 2: Diversity, Distances, adonis Diversity - alpha, beta (, gamma) Beta- Diversity in practice: Ecological Distances Unsupervised Learning: Clustering, etc

More information

Microbial analysis with STAMP

Microbial analysis with STAMP Microbial analysis with STAMP Conor Meehan cmeehan@itg.be A quick aside on who I am Tangents already! Who I am A postdoc at the Institute of Tropical Medicine in Antwerp, Belgium Mycobacteria evolution

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Microbiome: 16S rrna Sequencing 3/30/2018

Microbiome: 16S rrna Sequencing 3/30/2018 Microbiome: 16S rrna Sequencing 3/30/2018 Skills from Previous Lectures Central Dogma of Biology Lecture 3: Genetics and Genomics Lecture 4: Microarrays Lecture 12: ChIP-Seq Phylogenetics Lecture 13: Phylogenetics

More information

High-dimensional regression modeling

High-dimensional regression modeling High-dimensional regression modeling David Causeur Department of Statistics and Computer Science Agrocampus Ouest IRMAR CNRS UMR 6625 http://www.agrocampus-ouest.fr/math/causeur/ Course objectives Making

More information

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

More on Unsupervised Learning

More on Unsupervised Learning More on Unsupervised Learning Two types of problems are to find association rules for occurrences in common in observations (market basket analysis), and finding the groups of values of observational data

More information

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Lucas Janson, Stanford Department of Statistics WADAPT Workshop, NIPS, December 2016 Collaborators: Emmanuel

More information

Robust statistics. Michael Love 7/10/2016

Robust statistics. Michael Love 7/10/2016 Robust statistics Michael Love 7/10/2016 Robust topics Median MAD Spearman Wilcoxon rank test Weighted least squares Cook's distance M-estimators Robust topics Median => middle MAD => spread Spearman =>

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature25973 Power Simulations We performed extensive power simulations to demonstrate that the analyses carried out in our study are well powered. Our simulations indicate very high power for

More information

Sparse Proteomics Analysis (SPA)

Sparse Proteomics Analysis (SPA) Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universität Berlin Winter School on Compressed Sensing December 5, 2015

More information

Systems biology. Abstract

Systems biology. Abstract Bioinformatics, 31(10), 2015, 1607 1613 doi: 10.1093/bioinformatics/btu855 Advance Access Publication Date: 6 January 2015 Original Paper Systems biology Selection of models for the analysis of risk-factor

More information

ESL Chap3. Some extensions of lasso

ESL Chap3. Some extensions of lasso ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Incorporating published univariable associations in diagnostic and prognostic modeling

Incorporating published univariable associations in diagnostic and prognostic modeling Incorporating published univariable associations in diagnostic and prognostic modeling Thomas Debray Julius Center for Health Sciences and Primary Care University Medical Center Utrecht The Netherlands

More information

Linear Regression. Volker Tresp 2018

Linear Regression. Volker Tresp 2018 Linear Regression Volker Tresp 2018 1 Learning Machine: The Linear Model / ADALINE As with the Perceptron we start with an activation functions that is a linearly weighted sum of the inputs h = M j=0 w

More information

arxiv: v1 [stat.ml] 29 Jul 2016

arxiv: v1 [stat.ml] 29 Jul 2016 The Phylogenetic LASSO and the Microbiome Stephen T Rush Christine H Lee Washington Mio Peter T Kim arxiv:1607.08877v1 [stat.ml] 29 Jul 2016 Abstract Scientific investigations that incorporate next generation

More information

Nemours Biomedical Research Statistics Course. Li Xie Nemours Biostatistics Core October 14, 2014

Nemours Biomedical Research Statistics Course. Li Xie Nemours Biostatistics Core October 14, 2014 Nemours Biomedical Research Statistics Course Li Xie Nemours Biostatistics Core October 14, 2014 Outline Recap Introduction to Logistic Regression Recap Descriptive statistics Variable type Example of

More information

Microbes and you ON THE LATEST HUMAN MICROBIOME DISCOVERIES, COMPUTATIONAL QUESTIONS AND SOME SOLUTIONS. Elizabeth Tseng

Microbes and you ON THE LATEST HUMAN MICROBIOME DISCOVERIES, COMPUTATIONAL QUESTIONS AND SOME SOLUTIONS. Elizabeth Tseng Microbes and you ON THE LATEST HUMAN MICROBIOME DISCOVERIES, COMPUTATIONAL QUESTIONS AND SOME SOLUTIONS Elizabeth Tseng Dept. of CSE, University of Washington Johanna Lampe Lab, Fred Hutchinson Cancer

More information

Textbook Examples of. SPSS Procedure

Textbook Examples of. SPSS Procedure Textbook s of IBM SPSS Procedures Each SPSS procedure listed below has its own section in the textbook. These sections include a purpose statement that describes the statistical test, identification of

More information

arxiv: v2 [stat.me] 16 Jun 2011

arxiv: v2 [stat.me] 16 Jun 2011 A data-based power transformation for compositional data Michail T. Tsagris, Simon Preston and Andrew T.A. Wood Division of Statistics, School of Mathematical Sciences, University of Nottingham, UK; pmxmt1@nottingham.ac.uk

More information

Regression diagnostics

Regression diagnostics Regression diagnostics Kerby Shedden Department of Statistics, University of Michigan November 5, 018 1 / 6 Motivation When working with a linear model with design matrix X, the conventional linear model

More information

Extended Bayesian Information Criteria for Model Selection with Large Model Spaces

Extended Bayesian Information Criteria for Model Selection with Large Model Spaces Extended Bayesian Information Criteria for Model Selection with Large Model Spaces Jiahua Chen, University of British Columbia Zehua Chen, National University of Singapore (Biometrika, 2008) 1 / 18 Variable

More information

3 Comparison with Other Dummy Variable Methods

3 Comparison with Other Dummy Variable Methods Stats 300C: Theory of Statistics Spring 2018 Lecture 11 April 25, 2018 Prof. Emmanuel Candès Scribe: Emmanuel Candès, Michael Celentano, Zijun Gao, Shuangning Li 1 Outline Agenda: Knockoffs 1. Introduction

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Statistics for Differential Expression in Sequencing Studies. Naomi Altman

Statistics for Differential Expression in Sequencing Studies. Naomi Altman Statistics for Differential Expression in Sequencing Studies Naomi Altman naomi@stat.psu.edu Outline Preliminaries what you need to do before the DE analysis Stat Background what you need to know to understand

More information

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables Qi Tang (Joint work with Kam-Wah Tsui and Sijian Wang) Department of Statistics University of Wisconsin-Madison Feb. 8,

More information

Mining Imperfect Data

Mining Imperfect Data Mining Imperfect Data Dealing with Contamination and Incomplete Records Ronald K. Pearson ProSanos Corporation Harrisburg, Pennsylvania and Thomas Jefferson University Philadelphia, Pennsylvania siam.

More information

Methodological Concepts for Source Apportionment

Methodological Concepts for Source Apportionment Methodological Concepts for Source Apportionment Peter Filzmoser Institute of Statistics and Mathematical Methods in Economics Vienna University of Technology UBA Berlin, Germany November 18, 2016 in collaboration

More information

Supplementary Information

Supplementary Information Supplementary Information Table S1. Per-sample sequences, observed OTUs, richness estimates, diversity indices and coverage. Samples codes as follows: YED (Young leaves Endophytes), MED (Mature leaves

More information

Missing Covariate Data in Matched Case-Control Studies

Missing Covariate Data in Matched Case-Control Studies Missing Covariate Data in Matched Case-Control Studies Department of Statistics North Carolina State University Paul Rathouz Dept. of Health Studies U. of Chicago prathouz@health.bsd.uchicago.edu with

More information

Bacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria

Bacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria Bacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria Seminar presentation Pierre Barbera Supervised by:

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Multivariate Gaussians Mark Schmidt University of British Columbia Winter 2019 Last Time: Multivariate Gaussian http://personal.kenyon.edu/hartlaub/mellonproject/bivariate2.html

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

Or How to select variables Using Bayesian LASSO

Or How to select variables Using Bayesian LASSO Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO On Bayesian Variable Selection

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Building a Prognostic Biomarker

Building a Prognostic Biomarker Building a Prognostic Biomarker Noah Simon and Richard Simon July 2016 1 / 44 Prognostic Biomarker for a Continuous Measure On each of n patients measure y i - single continuous outcome (eg. blood pressure,

More information

arxiv: v1 [math.st] 4 Mar 2019

arxiv: v1 [math.st] 4 Mar 2019 Noname manuscript No. (will be inserted by the editor) Regression models for compositional data: General log-contrast formulations, proximal optimization, and microbiome data applications Patrick L. Combettes

More information

Normalization of metagenomic data A comprehensive evaluation of existing methods

Normalization of metagenomic data A comprehensive evaluation of existing methods MASTER S THESIS Normalization of metagenomic data A comprehensive evaluation of existing methods MIKAEL WALLROTH Department of Mathematical Sciences CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG

More information

Estimating subgroup specific treatment effects via concave fusion

Estimating subgroup specific treatment effects via concave fusion Estimating subgroup specific treatment effects via concave fusion Jian Huang University of Iowa April 6, 2016 Outline 1 Motivation and the problem 2 The proposed model and approach Concave pairwise fusion

More information

CLASSIFICATION UNIT GUIDE DUE WEDNESDAY 3/1

CLASSIFICATION UNIT GUIDE DUE WEDNESDAY 3/1 CLASSIFICATION UNIT GUIDE DUE WEDNESDAY 3/1 MONDAY TUESDAY WEDNESDAY THURSDAY FRIDAY 2/13 2/14 - B 2/15 2/16 - B 2/17 2/20 Intro to Viruses Viruses VS Cells 2/21 - B Virus Reproduction Q 1-2 2/22 2/23

More information

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Yale School of Public Health Joint work with Ning Hao, Yue S. Niu presented @Tsinghua University Outline 1 The Problem

More information

1. HyperLogLog algorithm

1. HyperLogLog algorithm SUPPLEMENTARY INFORMATION FOR KRAKENHLL (BREITWIESER AND SALZBERG, 2018) 1. HyperLogLog algorithm... 1 2. Database building and reanalysis of the patient data (Salzberg, et al., 2016)... 7 3. Enabling

More information

Two-sample tests of high-dimensional means for compositional data

Two-sample tests of high-dimensional means for compositional data Biometrika (208, 05,,pp. 5 32 doi: 0.093/biomet/asx060 Printed in Great Britain Advance Access publication 3 November 207 Two-sample tests of high-dimensional means for compositional data BY YUANPEI CAO

More information

Overview. and data transformations of gene expression data. Toy 2-d Clustering Example. K-Means. Motivation. Model-based clustering

Overview. and data transformations of gene expression data. Toy 2-d Clustering Example. K-Means. Motivation. Model-based clustering Model-based clustering and data transformations of gene expression data Walter L. Ruzzo University of Washington UW CSE Computational Biology Group 2 Toy 2-d Clustering Example K-Means? 3 4 Hierarchical

More information

Statistical tests for differential expression in count data (1)

Statistical tests for differential expression in count data (1) Statistical tests for differential expression in count data (1) NBIC Advanced RNA-seq course 25-26 August 2011 Academic Medical Center, Amsterdam The analysis of a microarray experiment Pre-process image

More information

2.3. Clustering or vector quantization 57

2.3. Clustering or vector quantization 57 Multivariate Statistics non-negative matrix factorisation and sparse dictionary learning The PCA decomposition is by construction optimal solution to argmin A R n q,h R q p X AH 2 2 under constraint :

More information

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University Problems

More information

Probability Theory for Machine Learning. Chris Cremer September 2015

Probability Theory for Machine Learning. Chris Cremer September 2015 Probability Theory for Machine Learning Chris Cremer September 2015 Outline Motivation Probability Definitions and Rules Probability Distributions MLE for Gaussian Parameter Estimation MLE and Least Squares

More information

Proteomics and Variable Selection

Proteomics and Variable Selection Proteomics and Variable Selection p. 1/55 Proteomics and Variable Selection Alex Lewin With thanks to Paul Kirk for some graphs Department of Epidemiology and Biostatistics, School of Public Health, Imperial

More information

Test Bank for Microbiology A Systems Approach 3rd edition by Cowan

Test Bank for Microbiology A Systems Approach 3rd edition by Cowan Test Bank for Microbiology A Systems Approach 3rd edition by Cowan Link download full: http://testbankair.com/download/test-bankfor-microbiology-a-systems-approach-3rd-by-cowan/ Chapter 1: The Main Themes

More information

THE CLOSURE PROBLEM: ONE HUNDRED YEARS OF DEBATE

THE CLOSURE PROBLEM: ONE HUNDRED YEARS OF DEBATE Vera Pawlowsky-Glahn 1 and Juan José Egozcue 2 M 2 1 Dept. of Computer Science and Applied Mathematics; University of Girona; Girona, SPAIN; vera.pawlowsky@udg.edu; 2 Dept. of Applied Mathematics; Technical

More information

Interaction networks shed light on the ecology and evolution of soil microbiomes. Linda Kinkel Department of Plant Pathology University of Minnesota

Interaction networks shed light on the ecology and evolution of soil microbiomes. Linda Kinkel Department of Plant Pathology University of Minnesota Interaction networks shed light on the ecology and evolution of soil microbiomes Linda Kinkel Department of Plant Pathology University of Minnesota Soil Health: Disease suppression How do we measure soil

More information

Feature Engineering, Model Evaluations

Feature Engineering, Model Evaluations Feature Engineering, Model Evaluations Giri Iyengar Cornell University gi43@cornell.edu Feb 5, 2018 Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 1 / 35 Overview 1 ETL 2 Feature Engineering

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Amplicon Sequencing. Dr. Orla O Sullivan SIRG Research Fellow Teagasc

Amplicon Sequencing. Dr. Orla O Sullivan SIRG Research Fellow Teagasc Amplicon Sequencing Dr. Orla O Sullivan SIRG Research Fellow Teagasc What is Amplicon Sequencing? Sequencing of target genes (are regions of ) obtained by PCR using gene specific primers. Why do we do

More information

Species richness estimation with high diversity but spurious singletons

Species richness estimation with high diversity but spurious singletons Species richness estimation with high diversity but spurious singletons Amy Willis arxiv:604.02598v [stat.me] 9 Apr 206 Informal note from the author The method described in this paper has been available

More information

Characterizing and predicting cyanobacterial blooms in an 8-year

Characterizing and predicting cyanobacterial blooms in an 8-year 1 2 3 4 5 Characterizing and predicting cyanobacterial blooms in an 8-year amplicon sequencing time-course Authors Nicolas Tromas 1*, Nathalie Fortin 2, Larbi Bedrani 1, Yves Terrat 1, Pedro Cardoso 4,

More information

ABTEKNILLINEN KORKEAKOULU

ABTEKNILLINEN KORKEAKOULU Two-way analysis of high-dimensional collinear data 1 Tommi Suvitaival 1 Janne Nikkilä 1,2 Matej Orešič 3 Samuel Kaski 1 1 Department of Information and Computer Science, Helsinki University of Technology,

More information

Regression with Compositional Response. Eva Fišerová

Regression with Compositional Response. Eva Fišerová Regression with Compositional Response Eva Fišerová Palacký University Olomouc Czech Republic LinStat2014, August 24-28, 2014, Linköping joint work with Karel Hron and Sandra Donevska Objectives of the

More information

Backward Genotype-Trait Association. in Case-Control Designs

Backward Genotype-Trait Association. in Case-Control Designs Backward Genotype-Trait Association (BGTA)-Based Dissection of Complex Traits in Case-Control Designs Tian Zheng, Hui Wang and Shaw-Hwa Lo Department of Statistics, Columbia University, New York, New York,

More information

Guarding against Spurious Discoveries in High Dimension. Jianqing Fan

Guarding against Spurious Discoveries in High Dimension. Jianqing Fan in High Dimension Jianqing Fan Princeton University with Wen-Xin Zhou September 30, 2016 Outline 1 Introduction 2 Spurious correlation and random geometry 3 Goodness Of Spurious Fit (GOSF) 4 Asymptotic

More information

A Poisson-multivariate normal hierarchical model for measuring microbial conditional independence networks from metagenomic count data

A Poisson-multivariate normal hierarchical model for measuring microbial conditional independence networks from metagenomic count data A Poisson-multivariate normal hierarchical model for measuring microbial conditional independence networks from metagenomic count data Surojit Biswas 1, Derek S. Lundberg 2, Jeffery L. Dangl 2,3,4,5, Vladimir

More information

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data Cinzia Viroli 1 joint with E. Bonafede 1, S. Robin 2 & F. Picard 3 1 Department of Statistical Sciences, University

More information

STAT331. Cox s Proportional Hazards Model

STAT331. Cox s Proportional Hazards Model STAT331 Cox s Proportional Hazards Model In this unit we introduce Cox s proportional hazards (Cox s PH) model, give a heuristic development of the partial likelihood function, and discuss adaptations

More information

Differential Expression Analysis Techniques for Single-Cell RNA-seq Experiments

Differential Expression Analysis Techniques for Single-Cell RNA-seq Experiments Differential Expression Analysis Techniques for Single-Cell RNA-seq Experiments for the Computational Biology Doctoral Seminar (CMPBIO 293), organized by N. Yosef & T. Ashuach, Spring 2018, UC Berkeley

More information

Intelligent Systems Statistical Machine Learning

Intelligent Systems Statistical Machine Learning Intelligent Systems Statistical Machine Learning Carsten Rother, Dmitrij Schlesinger WS2014/2015, Our tasks (recap) The model: two variables are usually present: - the first one is typically discrete k

More information

STATISTICAL LEARNING OF INTEGRATIVE ANALYSIS. Meilei Jiang

STATISTICAL LEARNING OF INTEGRATIVE ANALYSIS. Meilei Jiang STATISTICAL LEARNING OF INTEGRATIVE ANALYSIS Meilei Jiang A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu arxiv:1709.04840v1 [stat.me] 14 Sep 2017 Abstract Penalty-based variable selection methods are powerful in selecting relevant covariates

More information

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1

More information

Regularization and Variable Selection via the Elastic Net

Regularization and Variable Selection via the Elastic Net p. 1/1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Journal of Royal Statistical Society, B, 2005 Presenter: Minhua Chen, Nov. 07, 2008 p. 2/1 Agenda Introduction

More information

Phylofactorization - theory and challenges

Phylofactorization - theory and challenges Phylofactorization - theory and challenges Alex D. Washburne 1 1 Montana State University; alex.d.washburne@gmail.com Abstract Data from biological communities are composed of species connected by the

More information

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction Xiaodong Lin 1 and Yu Zhu 2 1 Statistical and Applied Mathematical Science Institute, RTP, NC, 27709 USA University of Cincinnati,

More information

Prediction of double gene knockout measurements

Prediction of double gene knockout measurements Prediction of double gene knockout measurements Sofia Kyriazopoulou-Panagiotopoulou sofiakp@stanford.edu December 12, 2008 Abstract One way to get an insight into the potential interaction between a pair

More information

Testing for group differences in brain functional connectivity

Testing for group differences in brain functional connectivity Testing for group differences in brain functional connectivity Junghi Kim, Wei Pan, for ADNI Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Banff Feb

More information

Part IV Extensions: Competing Risks Endpoints and Non-Parametric AUC(t) Estimation

Part IV Extensions: Competing Risks Endpoints and Non-Parametric AUC(t) Estimation Part IV Extensions: Competing Risks Endpoints and Non-Parametric AUC(t) Estimation Patrick J. Heagerty PhD Department of Biostatistics University of Washington 166 ISCB 2010 Session Four Outline Examples

More information

Biologists use a system of classification to organize information about the diversity of living things.

Biologists use a system of classification to organize information about the diversity of living things. Section 1: Biologists use a system of classification to organize information about the diversity of living things. K What I Know W What I Want to Find Out L What I Learned Essential Questions What are

More information

STAT 100C: Linear models

STAT 100C: Linear models STAT 100C: Linear models Arash A. Amini June 9, 2018 1 / 21 Model selection Choosing the best model among a collection of models {M 1, M 2..., M N }. What is a good model? 1. fits the data well (model

More information

Introductory compositional data (CoDa)analysis for soil

Introductory compositional data (CoDa)analysis for soil Introductory compositional data (CoDa)analysis for soil 1 scientists Léon E. Parent, department of Soils and Agrifood Engineering Université Laval, Québec 2 Definition (Aitchison, 1986) Compositional data

More information

Sparse and Robust Optimization and Applications

Sparse and Robust Optimization and Applications Sparse and and Statistical Learning Workshop Les Houches, 2013 Robust Laurent El Ghaoui with Mert Pilanci, Anh Pham EECS Dept., UC Berkeley January 7, 2013 1 / 36 Outline Sparse Sparse Sparse Probability

More information

Transforming RNA-Seq Data to Improve the Performance of Prognostic Gene Signatures

Transforming RNA-Seq Data to Improve the Performance of Prognostic Gene Signatures Transforming RNA-Seq Data to Improve the Performance of Prognostic Gene Signatures Isabella Zwiener 1,2 *, Barbara Frisch 2, Harald Binder 2 1 Center for Thrombosis and Hemostasis (CTH), University Medical

More information

Variable Selection for Multivariate Models

Variable Selection for Multivariate Models Variable Selection for Multivariate Models Myth and Reality Kurt VARMUZA Vienna University of Technology Department of Statistics and Probability Theory Laboratory for ChemoMetrics www.lcm.tuwien.ac.at/vk/

More information