Statistical methods for the analysis of microbiome compositional data in HIV studies
|
|
- Garry Bailey
- 5 years ago
- Views:
Transcription
1 1/ 56 Statistical methods for the analysis of microbiome compositional data in HIV studies Javier Rivera Pinto November 30, 2018
2 Outline 1 Introduction 2 Compositional data and microbiome analysis 3 Kernel machine regression 4 Identification of microbial signatures 5 Conclusions 2/ 56
3 Thesis structure Applied part Theoretical part Introduction Characteristics of microbiome abundance matrices Compositional data Kernel machine regression Identification of microbial signatures 3/ 56
4 Introduction 4/ 56
5 Human microbiome Human microbiome: collection of all the microorganisms living in association with the human body, including archaea, bacteria, and viruses. It accounts for about 1 to 3 percent of total body mass. Archaea Bacteria Virus The human microbiome is involved in a large number of essential functions: food digestion, immune system maintenance,... Alterations in microbiota have been associated to high impact diseases: asthma, cardiovascular disease, cancer,... 5/ 56
6 Microbiome and HIV Gut houses most of immune cells Damages in the gut epithelium Bacterial translocation triggers inflammation processes Systemic and chronic inflammation IrsiCaixa is investigating how we can act on the microbiome to help people living with HIV recover immunity, and to strenghten the immune response of a therapeutic or preventive vaccine. 6/ 56
7 Data extraction Microbiome studies are based on microbial DNA sequencing through two main approaches: amplicon sequencing and whole metagenomics DNA shotgun sequencing Amplicon sequencing: sequences a phylogenetic marker gene after Polymerase chain reaction (PCR) amplification. Shotgun metagenomics sequencing: sequences the total microbial DNA of a sample. 7/ 56
8 Data extraction (II) 8/ 56
9 Microbiome abundance matrix Microbiome abundance table is usually expressed as a matrix of counts, denoted by X, with k columns (taxa) and n rows (samples). Each entry x ij of X is the number of sequences (reads) corresponding to taxon j in sample i. x 11 x x 1k x 21 x x 2k X =.... x n1 x n2... x nk x ij N {0}, i {1,..., n}, j {1,..., k} 9/ 56
10 Introduction Compositional data and microbiome analysis Kernel machine regression Identification of microbial signatures Conclusions Standard microbiome statistical analysis (I) Normalization: the large variability of the total counts per sample is addressed with different type of normalization methods: working with proportions, rarefying counts or using percentile transformations. Amplification error... error error error Sequencing error error error st cycle 2nd cycle c-th cycle 10/ 56
11 Standard microbiome statistical analysis (II) Diversity analysis: α and β diversity are measured with indices of richness, evenness and distances between the composition of samples. 11/ 56
12 Standard microbiome statistical analysis (III) Ordination plots: graphical representations to represent graphically the multidimensional data into two or three orthogonal axes, preserving the main trends of the data. 12/ 56
13 Standard microbiome statistical analysis (III) Differential abundance testing: evaluation of differences in composition between groups of samples, globally or for particular taxa. It can be analyzed from a multivariate perspective or using univariate tests, Multivarite analysis: test for global differences: PERMANOVA, ANOSIM, Kernel machine regression or Dirichlet-Multinomial distribution. Univariate analysis: test for differences for a particular taxon: edger or DESeq2. 13/ 56
14 Contributions to microbiome-hiv studies 14/ 56
15 Compositional data and microbiome analysis 15/ 56
16 Characteristics of X Some characteristics of X may suppose a problem for the analysis: The high variability of the total number of counts along individuals Normalization The constriction induced by the maximum number of sequence reads of the DNA sequencer Compositional nature The presence of high amount of zeros Zero replacement 16/ 56
17 Compositional data A composition is a vector of k strictly positive components or parts x = (x 1,..., x k ) ; x i > 0, i {1,..., k} with a constrained or noninformative total sum k x i. Properties Each component is not informative by itself Relevant information is contained in the ratios Two proportional vectors are equally informative i=1 17/ 56
18 Equivalence class The simplex is the sample space of compositional data. Two vectors x 1 = (x 11,..., x 1k ) and x 2 = (x 21,..., x 2k ) are compositionally equivalent (denoted by = a ), if they are proportional, that is: x 1 = a x 2 p > 0, x 1 = px 2 Each equivalence class has a representative in the unit simplex defined as: k S k = {x = (x 1,..., x k ), x i > 0, x i = 1} i=1 18/ 56
19 Conditions for a proper analysis of compositions Permutation invariance: a change in the order of the parts in the composition should not affect the results Scale invariance: any function f used for the analysis of compositional data must be invariant for any element of the same compositionally equivalent class Subcompositional coherence: results obtained when a subset of components is analyzed should not contradict those obtained when analysing the whole composition. 19/ 56
20 Issues when the compositionality is ignored Ignoring the compositional nature in microbiome studies may induce: Spurious correlations: the total sum constraint characterizing compositional data forces some of the correlations between components to be negative. Subcompositional incoherences: results obtained for a sub-composition (a subset of the components) may disagree with those obtained for the whole composition. Increase of type I error: differential abundance testing is highly affected when the compositional nature of microbiome datasets is not acknowledged, presenting an increase of false positive findings. 20/ 56
21 Aitchison geometry: log-ratio approach Any meaningful (scale-invariant) function of a composition can be expressed in terms of ratios of its components (Atichison, 1986) The simplest invariant function is given by the log-ratio between two components: ( ) xi f (x) = log, i, j {1,..., k} x j The generalization of the log-ratio is called log-contrast, defined as: f (x) = k k a i log(x i ); a i = 0 i=1 i=1 21/ 56
22 Aitchison geometry: vector space structure The perturbation and powering operations in the k-dimensional simplex give it a vector space structure. Given two compositions x, y S k, the perturbation of x by y is given as x y = C (x 1 y 1, x 2 y 2,..., x k y k ) the power transformation or powering of the composition x by a constant α R is defined as α x = C (x α 1, x α 2,..., x α k ). The vector space is Euclidean since a norm and a distance (Aitchison distance) are defined. Aitchison distance between x and y compositions is defined as: d a (x, y) = x y a = 1 2k k i=1 k j=1 ( log x i log y ) 2 i. x j y j 22/ 56
23 Aitchison geometry: coordinate representations Several data transformations have been proposed in order to work in the real space instead of working in the simplex: alr, clr and ilr transformations. Through these transformations a common procedure for the statistical analysis consists on: Formulate the compositional problem in terms of its components Implement the corresponding data transformation Apply the appropriate statistical analysis Translate back the results into terms of initial compositions 23/ 56
24 Aitchison geometry: alr and clr transformations Additive log-ratio transformation (alr) ( ( ) ( )) x x = (x 1,..., x k ) log 1 xk 1,..., log x k x k A component has to be selected Centred log-ratio transformation (clr) ( ( x = (x 1,..., x k ) log x 1 g(x) where g(x) denotes the geometric mean of x ),..., log ( xk g(x) New coordinates have a constant sum of 0 which implies a singular covariance matrix )) 24/ 56
25 Aitchison geometry: ilr transformations Isometric log-ratio transformation (ilr) Representation of a composition in a particular orthonormal basis in S k It overcomes the problem of singular covariance matrix for clr-transformation Multiple ways of defining an orthonormal basis (Sequential binary partition) Each new coordinate is known as a balance. Given two sets of disjoint indices I + and I with k + and k components respectively, the associated balance is given by ( )1 /k + k+ k B = i I X k ++k log + i ( j I X j )1 /k 25/ 56
26 Zeros in microbiome data analysis Depending on the particular study and the method used for measuring the information we can distinguish three different type of zeros: Count zeros: related to processes that may be compared with a multinomial. Rounded zeros: values bellow a detection limit (continuous variables). Essential zeros: represent the total absence of the taxon in sample s environment. 26/ 56
27 Dealing with count zeros There are two extended ways for replacing zeros in compositional data analysis: Substitution by a pseudocount: Replace zeros by 0.65 Add 1 count to the whole matrix Geometric Bayesian Multiplicative Replacement: using prior information, all values are replaced so that there are no zeros and the ratio between non-zero components are preserved. 27/ 56
28 Kernel machine regression 28/ 56
29 Overall question Question: is it there any relationship between microbiome and a response variable of interest? Model-based multivariate methods: they assume a Dirichlet-Multinomial distribution for the abundance matrix {HMP} package (LaRosa, 2012) Dirichlet-Multinomial regression (Chen, 2013) Distance-based multivariate methods: they are based on a distance matrix D. Analysis of similarity (ANOSIM) Permutational Analysis of Variance (PERMANOVA) Kernel machine regression 29/ 56
30 Kernel machine regression: formulation Kernel machine regression is a semi-parametric regression model that includes a non-parametric component to associate a set of covariates X, for instance microbiome abundances, with a response variable of interest Y. Y i = β 0 + β Z i + h(x i ) + ɛ i logit(y i ) = β 0 + β Z i + h(x i ) for continuous outcomes for dichotomous outcomes The non-parametric part measures the relationship between the microbiome composition and the outcome. 30/ 56
31 Kernel machine regression: association test The association is evaluated according to the following hypothesis: H 0 : h(x) = 0 H 1 : h(x) 0 Kernel machine regression is a special kind of mixed model where h(x) is a subject-specific random effect h(x) N(0, τk), where K is the kernel matrix defined from the distance matrix D as: K = 1 2 (I 11T n ) D 2 (I 11T n Thus, the association test can be rewritten as: H 0 : τ = 0 H 1 : τ 0 ) 31/ 56
32 Kernel machine regression: compositional data (I) Similar to the standard Kernel machine regression it is defined as: Y i = β 0 + β Z i + h(clr(x i )) + ɛ i logit(y i ) = β 0 + β Z i + h(clr(x i )) for continuous outcomes for dichotomous outcomes where the test for evaluating the association is given by: H 0 : h(clr(x)) = 0 H 1 : h(clr(x)) 0 being the kernel matrix defined using the Atichison distance matrix D A as: K = 1 ) ) (I 11T 2 D A (I 11T 2 n n 32/ 56
33 Kernel machine regression: compositional data (II) The procedure, named MiRKAT-CoDA is defined by the following steps: 1 Zero replacement 2 Compute Aitchison distance matrix: once the count matrix does not contain zeros, the clr() function of {compositions} package is used to compute the centered log-ratio transformed values. Then, Euclidean distance is calculated getting Aitchison distance matrix D A. 3 Obtain Kernel matrix: using D2K() function in {MiRKAT} package. It transforms the distance matrix D A into the Kernel matrix K A. 4 Implement Kernel machine regression: MiRKAT() function in the package with the same name, performs the kernel machine regression given the response variable, the Kernel matrix and the covariate adjustment. 33/ 56
34 Kernel machine regression: weighted version (I) Weighted Aitchison distance Given two compositions denoted by x 1 = (x 11,..., x 1k ) and x 2 = (x 21,..., x 2k ), and a vector of weights w = (w 1,..., w k ), ) k 2 y 1i d w (x 1, x 2 ) = w i (log g w (y 1 ) log y 2i g w (y 2 ) i=1 where y 1 and y 2 are the initial compositions x 1 and x 2 divided by w ( ) y i = x i = xi1,..., x ik w w 1 w k and g w ( ) denotes the weighted geometric mean where the s w = k w i is the total sum of the weights ( g w (y) = exp 1 sw k i=1 ) w i log(y i ) i=1 34/ 56
35 Kernel machine regression: weighted version (II) The procedure to measure the contribution of each part in the association is called weighted MiRKAT-CoDA and is given by: 1 Zero replacement 2 MiRKAT-CoDA with weights: considering a sequence of weights S = {s 1,..., s q }, for each component X i, i {1,..., k} and for each particular value s r S, the components of w = (w 1,..., w k ) are defined as: { wj = 1 j i w i = s r, s r S For each pair of weight and variable, a p-value is obtained after running MiRKAT() function, getting a table like the following one: 35/ 56
36 Kernel machine regression: weighted version (III) Taxon 1 Taxon 2... Taxon k w = s 1 p 11 p p 1k w = s 2 p 21 p p 2k w = s q p q1 p q2... p qk 3 Linear regression: the contribution of each taxon is summarized by the slope of the linear regression model between theh weights S and minus the logarithm of the p-values obtained for each different weight. The larger the slope, the larger is the contribution of the variable to the global association. 4 Slope ranking: once the contribution of each taxon has been estimated they can be ranked in a decreasing order so that the most important features appear on the top of the list. 36/ 56
37 Application to microbiome-hiv association (I) Data information 156 subjects (127 HIV-positive individuals and 29 HIV-negative individuals) Microbial information for 60 different genera MSM variable (Men who have sex with men) as a possible confounding variable. MiRKAT-CoDA The result of running this algorithm to the HIV-study, is a p-value of if we do not adjust by MSM, and a p-value of after adjusting for sexual practice. 37/ 56
38 Application to microbiome-hiv association (II) Weighted - MiRKAT-CoDA g_rc9_gut_group Taxa contribution log10(p val) g_bacteroides f_erysipelotrichaceae_g_unclassified f_ruminococcaceae_g_incertae_sedis g_bacteroides g_succinivibrio g_oribacterium f_vadinbb60_g_unclassified Weights g_phascolarctobacterium g_roseburia g_anaerostipes Slope size 38/ 56
39 Discussion Contributions MiRKAT-CoDA, the package for Kernel machine regression using Aitchison distance, avoids the possible incoherences resulted with those measures non subcompositionally dominant. The weighted version of MiRKAT-CoDA allows to rank different taxa according to their importance in the global association with the outcome. Limitations The default set of weights S may result uninformative. If the reference global p-value (when no weights are considered) is very small or zero, the weighting method is not very informative since changing the weight of just one taxon can hardly modify the global p-value, remaining equal to zero in most cases. 39/ 56
40 Identification of microbial signatures 40/ 56
41 Specific question Question: which specific taxa are associated with the outcome? When the response variable is dichotomous, the question is known as differential abundance analysis and can be addressed in different ways: Based on RNA-seq analysis {edger} package (Robinson, 2010) {DESeq2} package (Anders, 2014) Based on Compositional data analysis {ALDEx2} package (Fernandes, 2013) {ANCOM} package (Mandal, 2015) 41/ 56
42 Microbial signature Microbial signature: groups of microbial taxa that are predictive of a phenotype of interest. Microbial signatures are useful for diagnosis, prognosis or prediction of therapeutic responses. 42/ 56
43 selbal Model selbal is a model selection procedure that searches a sparse model that adequately explains the response variable of interest. Goal Given a numeric or dichotomous response variable Y, a composition X = (X 1,..., X k ) and additional covariates Z = (Z 1,..., Z r ), it determines two disjoint subcompositions of X, X + and X, whose balance B(X +, X ) is highly associated with Y after adjustment for covariates Z. ( k + k X /k + i I+ i)1 B(X +, X ) = log ( k + + k Xj)1/k j I B(X +, X ) 1 log X i 1 k + k i I + j I log X j 43/ 56
44 selbal: algorithm (I) selbal looks for the best balance following these steps: 1 Zero replacement 2 Optimal balance between two components: exhaustive evaluation of all the possible balances composed by only two components; that is, all balances of the form: 1 ( B ij = B(X i, X j ) = log(xi ) log(x j ) ) 2 for i, j {1,, k}, i j Depending on the class of the response variable, each balance B ij is tested for association with Y with: Y = β 0 + β 1 B ij + γ Z logit(y) = β 0 + β 1 B ij + γ Z for continuous responses for dichotomous responses The balance that maximizes the optimization criteria is selected and denoted by B (1) 44/ 56
45 selbal: algorithm (II) 3 Optimal balance adding a new component: for s > 1 and until the stop criterion is fulfilled, let B (s 1) be the balance defined in the previous step (s 1) given by: B (s 1) 1 k (s 1) + i I (s 1) + log(x i ) 1 k (s 1) j I (s 1) log(x j ) where I (s 1) + and I (s 1) are two disjoint subsets of indices in {1,, k} with k (s 1) + and k (s 1) elements, respectively. For each of the remaining variables, X p not yet included in the balance, p / ( ) I (s 1) + I (s 1), the algorithm considers the balance that is obtained by adding log(x p ) to the positive part of B (s 1) or to its negative part): 45/ 56
46 selbal: algorithm (III) B (s+) p 1 k (s 1) B (s ) p 1 k (s 1) + i I (s 1) + ( i I (s 1) + log(x i ) ) log(x i ) + log(x p ) 1 k (s 1) ( 1 k (s 1) + 1 j I (s 1) j I (s 1) log(x j ) ) log(x j ) + log(x p ) Each of these pairs of balances B (s+) p and B (s ) p for each of the remaining variables X p is tested for association with the response variable through the corresponding regression model. Finally, the balance that maximizes the optimization criterion, defines the new balance B (s) for the s-th step. 46/ 56
47 selbal: association measure and stop criterion Association measure For continuous responses, it is the mean squared error (MSE) of the linear regression model. For dichotomous outcomes, it is the area under the ROC curve. Stop criterion There are two possible stopping rules: The algorithm stops when the improvement of the optimization parameters is lower than a specified threshold (default equal to 0). The algorithm stops when the specified maximum number of componentes has been included in the balance (default equal to 20). 47/ 56
48 Testing dataset selbal: cross validation A cross-validation procedure is implemented for: Defining the optimal number of variables in the balance Measure the robustness of the result Training dataset Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 48/ 56
49 Application to microbiome-crohn s disease association (I) Data information 975 subjects (662 patients with Crohn s disease and 313 without any symptom Microbial information of 49 different taxa Optimal number of variables Accuracy (AUC) Number of variables 49/ 56
50 Application to microbiome-crohn s disease association (II) Global balance DENOMINATOR g Streptococcus g Dialister g Adlercreutzia g Dorea g Oscillospira o Lactobacillales_g g Aggregatibacter g Eggerthella NUMERATOR g Roseburia o Clostridiales_g g Bacteroides f Peptostreptococcaceae_g TPR ROC curve AUC ROC FPR Balance CD Factor no 50/ 56
51 Application to microbiome-crohn s disease association (III) Cross-validation % Global BAL 1 BAL 2 BAL 3 g Dialister 100 g Roseburia 100 o Clostridiales_g 98 g Bacteroides 98 g Dorea 96 o Lactobacillales_g 94 g Eggerthella 92 g Aggregatibacter 92 g Adlercreutzia 90 f Peptostreptococcaceae_g 86 g Streptococcus 76 g Oscillospira 72 g Actinomyces 26 g Blautia 24 FREQ / 56
52 selbal against other methods Results using Crohn s disease dataset with selbal and other methods according to the variable selection and model building procedure. Method comparison 0.8 AUC DESeq edger ANCOM ALDEx2 selbal 52/ 56
53 Discussion Contributions selbal is an alternative for the differential abundance tests available in the literature. The resulted balance has an biological meaning and avoids the problems with type I error usually associated with classical differential abundance tests. selbal offers better results than the most extended differential abundance tests. Limitations The algorithm does not cover all the possible balances defined from a set of k taxa, so the result can be suoptimal. 53/ 56
54 Conclusions 54/ 56
55 What we have learned about microbiome and HIV infection The chronic inflammation derived after HIV infection is responsible of an increased risk of presenting non-aids related diseases and premature aging. Previous results indicating a clear shift from Bacteroides to Prevotella in HIV-1 infection should be revised accounting for possible confounders such as HIV risk factors, exercise or diet. Patients who spontaneously maintain sustained control of HIV, elite controllers (EC), have different microbiota from individuals with progressive infection and more similar to HIV negative individuals. Though diet is known to have an important effect on gut microbiome composition in healthy individuals, measuring its effects on HIV infection is difficult because of the lack of extensive and reliable information at this level. 55/ 56
56 New approaches for the analysis of microbiome compositional data Microbiome abundance data is compositional The constraint over the total number of reads induces strong dependencies among the abundances of different taxa. The use of standard statistical methods ignoring the compositionality can lead to important adverse implications. Kernel machine regression combined with Aitchison distance provides a powerful framework for testing global associations between microbiome and a response variable of interest. Kernel machine regression combined with weighted Aitchison distance provides a measure of the contribution of each taxon to the joint microbiome association with the outcome. The search of microbial signatures with selbal is a powerful approach for defining biomarkers to differentiate groups of samples or to identify associations. 56/ 56
Compositional data methods for microbiome studies
Compositional data methods for microbiome studies M.Luz Calle Dept. of Biosciences, UVic-UCC http://mon.uvic.cat/bms/ http://mon.uvic.cat/master-omics/ 1 Important role of the microbiome in human health
More informationAn Adaptive Association Test for Microbiome Data
An Adaptive Association Test for Microbiome Data Chong Wu 1, Jun Chen 2, Junghi 1 Kim and Wei Pan 1 1 Division of Biostatistics, School of Public Health, University of Minnesota; 2 Division of Biomedical
More informationLecture: Mixture Models for Microbiome data
Lecture: Mixture Models for Microbiome data Lecture 3: Mixture Models for Microbiome data Outline: - - Sequencing thought experiment Mixture Models (tangent) - (esp. Negative Binomial) - Differential abundance
More informationLecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data
Lecture 3: Mixture Models for Microbiome data 1 Lecture 3: Mixture Models for Microbiome data Outline: - Mixture Models (Negative Binomial) - DESeq2 / Don t Rarefy. Ever. 2 Hypothesis Tests - reminder
More informationCoDa-dendrogram: A new exploratory tool. 2 Dept. Informàtica i Matemàtica Aplicada, Universitat de Girona, Spain;
CoDa-dendrogram: A new exploratory tool J.J. Egozcue 1, and V. Pawlowsky-Glahn 2 1 Dept. Matemàtica Aplicada III, Universitat Politècnica de Catalunya, Barcelona, Spain; juan.jose.egozcue@upc.edu 2 Dept.
More informationRegression with Compositional Response. Eva Fišerová
Regression with Compositional Response Eva Fišerová Palacký University Olomouc Czech Republic LinStat2014, August 24-28, 2014, Linköping joint work with Karel Hron and Sandra Donevska Objectives of the
More informationarxiv: v2 [stat.me] 16 Jun 2011
A data-based power transformation for compositional data Michail T. Tsagris, Simon Preston and Andrew T.A. Wood Division of Statistics, School of Mathematical Sciences, University of Nottingham, UK; pmxmt1@nottingham.ac.uk
More informationPart IV Extensions: Competing Risks Endpoints and Non-Parametric AUC(t) Estimation
Part IV Extensions: Competing Risks Endpoints and Non-Parametric AUC(t) Estimation Patrick J. Heagerty PhD Department of Biostatistics University of Washington 166 ISCB 2010 Session Four Outline Examples
More informationProteomics and Variable Selection
Proteomics and Variable Selection p. 1/55 Proteomics and Variable Selection Alex Lewin With thanks to Paul Kirk for some graphs Department of Epidemiology and Biostatistics, School of Public Health, Imperial
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationSupervised Learning to Predict Geographic Origin of Human Metagenomic Samples
Supervised Learning to Predict Geographic Origin of Human Metagenomic Samples Christopher Malow cmalow@stanford.edu Abstract Metagenomic studies of human fecal samples have used whole genome shotgun sequencing
More informationHigh-Throughput Sequencing Course
High-Throughput Sequencing Course DESeq Model for RNA-Seq Biostatistics and Bioinformatics Summer 2017 Outline Review: Standard linear regression model (e.g., to model gene expression as function of an
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationStatistical Analysis of. Compositional Data
Statistical Analysis of Compositional Data Statistical Analysis of Compositional Data Carles Barceló Vidal J Antoni Martín Fernández Santiago Thió Fdez-Henestrosa Dept d Informàtica i Matemàtica Aplicada
More informationMicrobiome: 16S rrna Sequencing 3/30/2018
Microbiome: 16S rrna Sequencing 3/30/2018 Skills from Previous Lectures Central Dogma of Biology Lecture 3: Genetics and Genomics Lecture 4: Microarrays Lecture 12: ChIP-Seq Phylogenetics Lecture 13: Phylogenetics
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationGeometric Constraints II
Geometric Constraints II Realizability, Rigidity and Related theorems. Embeddability of Metric Spaces Section 1 Given the matrix D d i,j 1 i,j n corresponding to a metric space, give conditions under which
More informationUpdating on the Kernel Density Estimation for Compositional Data
Updating on the Kernel Density Estimation for Compositional Data Martín-Fernández, J. A., Chacón-Durán, J. E., and Mateu-Figueras, G. Dpt. Informàtica i Matemàtica Aplicada, Universitat de Girona, Campus
More informationKernel Logistic Regression and the Import Vector Machine
Kernel Logistic Regression and the Import Vector Machine Ji Zhu and Trevor Hastie Journal of Computational and Graphical Statistics, 2005 Presented by Mingtao Ding Duke University December 8, 2011 Mingtao
More informationNormalization of metagenomic data A comprehensive evaluation of existing methods
MASTER S THESIS Normalization of metagenomic data A comprehensive evaluation of existing methods MIKAEL WALLROTH Department of Mathematical Sciences CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG
More informationPerformance Evaluation and Comparison
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation
More informationModel Accuracy Measures
Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses
More informationSUPPLEMENTARY INFORMATION
doi:10.1038/nature25973 Power Simulations We performed extensive power simulations to demonstrate that the analyses carried out in our study are well powered. Our simulations indicate very high power for
More informationSUPPLEMENTARY INFORMATION
City of origin as a confounding variable. The original study was designed such that the city where sampling was performed was perfectly confounded with where the DNA extractions and sequencing was performed.
More informationKernel Methods. Machine Learning A W VO
Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance
More informationNiche Modeling. STAMPS - MBL Course Woods Hole, MA - August 9, 2016
Niche Modeling Katie Pollard & Josh Ladau Gladstone Institutes UCSF Division of Biostatistics, Institute for Human Genetics and Institute for Computational Health Science STAMPS - MBL Course Woods Hole,
More informationCorrelation and regression
1 Correlation and regression Yongjua Laosiritaworn Introductory on Field Epidemiology 6 July 2015, Thailand Data 2 Illustrative data (Doll, 1955) 3 Scatter plot 4 Doll, 1955 5 6 Correlation coefficient,
More informationMachine Learning (CS 567) Lecture 5
Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol
More informationLecture 2: Descriptive statistics, normalizations & testing
Lecture 2: Descriptive statistics, normalizations & testing From sequences to OTU table Sequencing Sample 1 Sample 2... Sample N Abundances of each microbial taxon in each of the N samples 2 1 Normalizing
More informationCS 231A Section 1: Linear Algebra & Probability Review
CS 231A Section 1: Linear Algebra & Probability Review 1 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability
More informationIstituto di Microbiologia. Università Cattolica del Sacro Cuore, Roma. Gut Microbiota assessment and the Meta-HIT program.
Istituto di Microbiologia Università Cattolica del Sacro Cuore, Roma Gut Microbiota assessment and the Meta-HIT program Giovanni Delogu 1 Most of the bacteria species living in the gut cannot be cultivated
More informationIntelligent Systems Statistical Machine Learning
Intelligent Systems Statistical Machine Learning Carsten Rother, Dmitrij Schlesinger WS2014/2015, Our tasks (recap) The model: two variables are usually present: - the first one is typically discrete k
More informationExpression Data Exploration: Association, Patterns, Factors & Regression Modelling
Expression Data Exploration: Association, Patterns, Factors & Regression Modelling Exploring gene expression data Scale factors, median chip correlation on gene subsets for crude data quality investigation
More informationStatistics in medicine
Statistics in medicine Lecture 4: and multivariable regression Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationSTAT331. Cox s Proportional Hazards Model
STAT331 Cox s Proportional Hazards Model In this unit we introduce Cox s proportional hazards (Cox s PH) model, give a heuristic development of the partial likelihood function, and discuss adaptations
More informationData Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395
Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection
More informationCSC314 / CSC763 Introduction to Machine Learning
CSC314 / CSC763 Introduction to Machine Learning COMSATS Institute of Information Technology Dr. Adeel Nawab More on Evaluating Hypotheses/Learning Algorithms Lecture Outline: Review of Confidence Intervals
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationIntelligent Systems Statistical Machine Learning
Intelligent Systems Statistical Machine Learning Carsten Rother, Dmitrij Schlesinger WS2015/2016, Our model and tasks The model: two variables are usually present: - the first one is typically discrete
More informationIntroduction to Machine Learning
Introduction to Machine Learning Thomas G. Dietterich tgd@eecs.oregonstate.edu 1 Outline What is Machine Learning? Introduction to Supervised Learning: Linear Methods Overfitting, Regularization, and the
More informationMethodological Concepts for Source Apportionment
Methodological Concepts for Source Apportionment Peter Filzmoser Institute of Statistics and Mathematical Methods in Economics Vienna University of Technology UBA Berlin, Germany November 18, 2016 in collaboration
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables
More informationBacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria
Bacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria Seminar presentation Pierre Barbera Supervised by:
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationSample Size Estimation for Studies of High-Dimensional Data
Sample Size Estimation for Studies of High-Dimensional Data James J. Chen, Ph.D. National Center for Toxicological Research Food and Drug Administration June 3, 2009 China Medical University Taichung,
More informationTwo-sample tests of high-dimensional means for compositional data
Biometrika (208, 05,,pp. 5 32 doi: 0.093/biomet/asx060 Printed in Great Britain Advance Access publication 3 November 207 Two-sample tests of high-dimensional means for compositional data BY YUANPEI CAO
More information5. Discriminant analysis
5. Discriminant analysis We continue from Bayes s rule presented in Section 3 on p. 85 (5.1) where c i is a class, x isap-dimensional vector (data case) and we use class conditional probability (density
More informationExperimental Design and Data Analysis for Biologists
Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction
More informationOutline Classes of diversity measures. Species Divergence and the Measurement of Microbial Diversity. How do we describe and compare diversity?
Species Divergence and the Measurement of Microbial Diversity Cathy Lozupone University of Colorado, Boulder. Washington University, St Louis. Outline Classes of diversity measures α vs β diversity Quantitative
More informationMultivariate Statistical Analysis
Multivariate Statistical Analysis Fall 2011 C. L. Williams, Ph.D. Lecture 4 for Applied Multivariate Analysis Outline 1 Eigen values and eigen vectors Characteristic equation Some properties of eigendecompositions
More informationStatistical Methods for High Dimensional Count and Compositional Data With Applications to Microbiome Studies
University of Pennsylvania ScholarlyCommons Publicly Accessible Penn Dissertations --206 Statistical Methods for High Dimensional Count and Compositional Data With Applications to Microbiome Studies Yuanpei
More informationA Program for Data Transformations and Kernel Density Estimation
A Program for Data Transformations and Kernel Density Estimation John G. Manchuk and Clayton V. Deutsch Modeling applications in geostatistics often involve multiple variables that are not multivariate
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationCS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang
CS 231A Section 1: Linear Algebra & Probability Review Kevin Tang Kevin Tang Section 1-1 9/30/2011 Topics Support Vector Machines Boosting Viola Jones face detector Linear Algebra Review Notation Operations
More information[Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements
[Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements Aasthaa Bansal PhD Pharmaceutical Outcomes Research & Policy Program University of Washington 69 Biomarkers
More informationMachine Learning - MT Clustering
Machine Learning - MT 2016 15. Clustering Varun Kanade University of Oxford November 28, 2016 Announcements No new practical this week All practicals must be signed off in sessions this week Firm Deadline:
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationPackage milineage. October 20, 2017
Type Package Package milineage October 20, 2017 Title Association Tests for Microbial Lineages on a Taxonomic Tree Version 2.0 Date 2017-10-18 Author Zheng-Zheng Tang Maintainer Zheng-Zheng Tang
More informationResampling Methods CAPT David Ruth, USN
Resampling Methods CAPT David Ruth, USN Mathematics Department, United States Naval Academy Science of Test Workshop 05 April 2017 Outline Overview of resampling methods Bootstrapping Cross-validation
More informationClassifier performance evaluation
Classifier performance evaluation Václav Hlaváč Czech Technical University in Prague Czech Institute of Informatics, Robotics and Cybernetics 166 36 Prague 6, Jugoslávských partyzánu 1580/3, Czech Republic
More informationClass 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio
Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant
More informationA Least Squares Formulation for Canonical Correlation Analysis
A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation
More informationBAYESIAN DECISION THEORY
Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will
More informationLinear Dimensionality Reduction
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Principal Component Analysis 3 Factor Analysis
More informationLecture 4 Discriminant Analysis, k-nearest Neighbors
Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se
More informationGenetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig
Genetic Networks Korbinian Strimmer IMISE, Universität Leipzig Seminar: Statistical Analysis of RNA-Seq Data 19 June 2012 Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 1 Paper G. I. Allen and Z. Liu.
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationTABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1
TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1 1.1 The Probability Model...1 1.2 Finite Discrete Models with Equally Likely Outcomes...5 1.2.1 Tree Diagrams...6 1.2.2 The Multiplication Principle...8
More informationESL Chap3. Some extensions of lasso
ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied
More informationIntroductory compositional data (CoDa)analysis for soil
Introductory compositional data (CoDa)analysis for soil 1 scientists Léon E. Parent, department of Soils and Agrifood Engineering Université Laval, Québec 2 Definition (Aitchison, 1986) Compositional data
More informationGibbs Sampling in Linear Models #2
Gibbs Sampling in Linear Models #2 Econ 690 Purdue University Outline 1 Linear Regression Model with a Changepoint Example with Temperature Data 2 The Seemingly Unrelated Regressions Model 3 Gibbs sampling
More informationLinear Classifiers as Pattern Detectors
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2014/2015 Lesson 16 8 April 2015 Contents Linear Classifiers as Pattern Detectors Notation...2 Linear
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationSparse Approximation and Variable Selection
Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation
More informationPubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH
PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH The First Step: SAMPLE SIZE DETERMINATION THE ULTIMATE GOAL The most important, ultimate step of any of clinical research is to do draw inferences;
More informationSparse Proteomics Analysis (SPA)
Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universität Berlin Winter School on Compressed Sensing December 5, 2015
More informationRobustness of Principal Components
PCA for Clustering An objective of principal components analysis is to identify linear combinations of the original variables that are useful in accounting for the variation in those original variables.
More informationLECTURE NOTE #3 PROF. ALAN YUILLE
LECTURE NOTE #3 PROF. ALAN YUILLE 1. Three Topics (1) Precision and Recall Curves. Receiver Operating Characteristic Curves (ROC). What to do if we do not fix the loss function? (2) The Curse of Dimensionality.
More information18 Bivariate normal distribution I
8 Bivariate normal distribution I 8 Example Imagine firing arrows at a target Hopefully they will fall close to the target centre As we fire more arrows we find a high density near the centre and fewer
More informationUnconstrained Ordination
Unconstrained Ordination Sites Species A Species B Species C Species D Species E 1 0 (1) 5 (1) 1 (1) 10 (4) 10 (4) 2 2 (3) 8 (3) 4 (3) 12 (6) 20 (6) 3 8 (6) 20 (6) 10 (6) 1 (2) 3 (2) 4 4 (5) 11 (5) 8 (5)
More informationEvaluation. Andrea Passerini Machine Learning. Evaluation
Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain
More informationApplied Machine Learning Annalisa Marsico
Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature
More informationStatistical Machine Learning
Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x
More informationLearning Multiple Tasks with a Sparse Matrix-Normal Penalty
Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1
More informationStat 5101 Lecture Notes
Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random
More informationUnsupervised machine learning
Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels
More informationStatistical tests for differential expression in count data (1)
Statistical tests for differential expression in count data (1) NBIC Advanced RNA-seq course 25-26 August 2011 Academic Medical Center, Amsterdam The analysis of a microarray experiment Pre-process image
More informationPrincipal Components Analysis. Sargur Srihari University at Buffalo
Principal Components Analysis Sargur Srihari University at Buffalo 1 Topics Projection Pursuit Methods Principal Components Examples of using PCA Graphical use of PCA Multidimensional Scaling Srihari 2
More informationBIG IDEAS. Area of Learning: SCIENCE Life Sciences Grade 11. Learning Standards. Curricular Competencies
Area of Learning: SCIENCE Life Sciences Grade 11 BIG IDEAS Life is a result of interactions at the molecular and cellular levels. Evolution occurs at the population level. Learning Standards Organisms
More informationMulti-state Models: An Overview
Multi-state Models: An Overview Andrew Titman Lancaster University 14 April 2016 Overview Introduction to multi-state modelling Examples of applications Continuously observed processes Intermittently observed
More informationMODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES
MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata Most common diseases are caused by
More informationEvaluation requires to define performance measures to be optimized
Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation
More informationFocus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.
Previously Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations y = Ax Or A simply represents data Notion of eigenvectors,
More informationMicrobiota: Its Evolution and Essence. Hsin-Jung Joyce Wu "Microbiota and man: the story about us
Microbiota: Its Evolution and Essence Overview q Define microbiota q Learn the tool q Ecological and evolutionary forces in shaping gut microbiota q Gut microbiota versus free-living microbe communities
More information