Statistical methods for the analysis of microbiome compositional data in HIV studies

Size: px
Start display at page:

Download "Statistical methods for the analysis of microbiome compositional data in HIV studies"

Transcription

1 1/ 56 Statistical methods for the analysis of microbiome compositional data in HIV studies Javier Rivera Pinto November 30, 2018

2 Outline 1 Introduction 2 Compositional data and microbiome analysis 3 Kernel machine regression 4 Identification of microbial signatures 5 Conclusions 2/ 56

3 Thesis structure Applied part Theoretical part Introduction Characteristics of microbiome abundance matrices Compositional data Kernel machine regression Identification of microbial signatures 3/ 56

4 Introduction 4/ 56

5 Human microbiome Human microbiome: collection of all the microorganisms living in association with the human body, including archaea, bacteria, and viruses. It accounts for about 1 to 3 percent of total body mass. Archaea Bacteria Virus The human microbiome is involved in a large number of essential functions: food digestion, immune system maintenance,... Alterations in microbiota have been associated to high impact diseases: asthma, cardiovascular disease, cancer,... 5/ 56

6 Microbiome and HIV Gut houses most of immune cells Damages in the gut epithelium Bacterial translocation triggers inflammation processes Systemic and chronic inflammation IrsiCaixa is investigating how we can act on the microbiome to help people living with HIV recover immunity, and to strenghten the immune response of a therapeutic or preventive vaccine. 6/ 56

7 Data extraction Microbiome studies are based on microbial DNA sequencing through two main approaches: amplicon sequencing and whole metagenomics DNA shotgun sequencing Amplicon sequencing: sequences a phylogenetic marker gene after Polymerase chain reaction (PCR) amplification. Shotgun metagenomics sequencing: sequences the total microbial DNA of a sample. 7/ 56

8 Data extraction (II) 8/ 56

9 Microbiome abundance matrix Microbiome abundance table is usually expressed as a matrix of counts, denoted by X, with k columns (taxa) and n rows (samples). Each entry x ij of X is the number of sequences (reads) corresponding to taxon j in sample i. x 11 x x 1k x 21 x x 2k X =.... x n1 x n2... x nk x ij N {0}, i {1,..., n}, j {1,..., k} 9/ 56

10 Introduction Compositional data and microbiome analysis Kernel machine regression Identification of microbial signatures Conclusions Standard microbiome statistical analysis (I) Normalization: the large variability of the total counts per sample is addressed with different type of normalization methods: working with proportions, rarefying counts or using percentile transformations. Amplification error... error error error Sequencing error error error st cycle 2nd cycle c-th cycle 10/ 56

11 Standard microbiome statistical analysis (II) Diversity analysis: α and β diversity are measured with indices of richness, evenness and distances between the composition of samples. 11/ 56

12 Standard microbiome statistical analysis (III) Ordination plots: graphical representations to represent graphically the multidimensional data into two or three orthogonal axes, preserving the main trends of the data. 12/ 56

13 Standard microbiome statistical analysis (III) Differential abundance testing: evaluation of differences in composition between groups of samples, globally or for particular taxa. It can be analyzed from a multivariate perspective or using univariate tests, Multivarite analysis: test for global differences: PERMANOVA, ANOSIM, Kernel machine regression or Dirichlet-Multinomial distribution. Univariate analysis: test for differences for a particular taxon: edger or DESeq2. 13/ 56

14 Contributions to microbiome-hiv studies 14/ 56

15 Compositional data and microbiome analysis 15/ 56

16 Characteristics of X Some characteristics of X may suppose a problem for the analysis: The high variability of the total number of counts along individuals Normalization The constriction induced by the maximum number of sequence reads of the DNA sequencer Compositional nature The presence of high amount of zeros Zero replacement 16/ 56

17 Compositional data A composition is a vector of k strictly positive components or parts x = (x 1,..., x k ) ; x i > 0, i {1,..., k} with a constrained or noninformative total sum k x i. Properties Each component is not informative by itself Relevant information is contained in the ratios Two proportional vectors are equally informative i=1 17/ 56

18 Equivalence class The simplex is the sample space of compositional data. Two vectors x 1 = (x 11,..., x 1k ) and x 2 = (x 21,..., x 2k ) are compositionally equivalent (denoted by = a ), if they are proportional, that is: x 1 = a x 2 p > 0, x 1 = px 2 Each equivalence class has a representative in the unit simplex defined as: k S k = {x = (x 1,..., x k ), x i > 0, x i = 1} i=1 18/ 56

19 Conditions for a proper analysis of compositions Permutation invariance: a change in the order of the parts in the composition should not affect the results Scale invariance: any function f used for the analysis of compositional data must be invariant for any element of the same compositionally equivalent class Subcompositional coherence: results obtained when a subset of components is analyzed should not contradict those obtained when analysing the whole composition. 19/ 56

20 Issues when the compositionality is ignored Ignoring the compositional nature in microbiome studies may induce: Spurious correlations: the total sum constraint characterizing compositional data forces some of the correlations between components to be negative. Subcompositional incoherences: results obtained for a sub-composition (a subset of the components) may disagree with those obtained for the whole composition. Increase of type I error: differential abundance testing is highly affected when the compositional nature of microbiome datasets is not acknowledged, presenting an increase of false positive findings. 20/ 56

21 Aitchison geometry: log-ratio approach Any meaningful (scale-invariant) function of a composition can be expressed in terms of ratios of its components (Atichison, 1986) The simplest invariant function is given by the log-ratio between two components: ( ) xi f (x) = log, i, j {1,..., k} x j The generalization of the log-ratio is called log-contrast, defined as: f (x) = k k a i log(x i ); a i = 0 i=1 i=1 21/ 56

22 Aitchison geometry: vector space structure The perturbation and powering operations in the k-dimensional simplex give it a vector space structure. Given two compositions x, y S k, the perturbation of x by y is given as x y = C (x 1 y 1, x 2 y 2,..., x k y k ) the power transformation or powering of the composition x by a constant α R is defined as α x = C (x α 1, x α 2,..., x α k ). The vector space is Euclidean since a norm and a distance (Aitchison distance) are defined. Aitchison distance between x and y compositions is defined as: d a (x, y) = x y a = 1 2k k i=1 k j=1 ( log x i log y ) 2 i. x j y j 22/ 56

23 Aitchison geometry: coordinate representations Several data transformations have been proposed in order to work in the real space instead of working in the simplex: alr, clr and ilr transformations. Through these transformations a common procedure for the statistical analysis consists on: Formulate the compositional problem in terms of its components Implement the corresponding data transformation Apply the appropriate statistical analysis Translate back the results into terms of initial compositions 23/ 56

24 Aitchison geometry: alr and clr transformations Additive log-ratio transformation (alr) ( ( ) ( )) x x = (x 1,..., x k ) log 1 xk 1,..., log x k x k A component has to be selected Centred log-ratio transformation (clr) ( ( x = (x 1,..., x k ) log x 1 g(x) where g(x) denotes the geometric mean of x ),..., log ( xk g(x) New coordinates have a constant sum of 0 which implies a singular covariance matrix )) 24/ 56

25 Aitchison geometry: ilr transformations Isometric log-ratio transformation (ilr) Representation of a composition in a particular orthonormal basis in S k It overcomes the problem of singular covariance matrix for clr-transformation Multiple ways of defining an orthonormal basis (Sequential binary partition) Each new coordinate is known as a balance. Given two sets of disjoint indices I + and I with k + and k components respectively, the associated balance is given by ( )1 /k + k+ k B = i I X k ++k log + i ( j I X j )1 /k 25/ 56

26 Zeros in microbiome data analysis Depending on the particular study and the method used for measuring the information we can distinguish three different type of zeros: Count zeros: related to processes that may be compared with a multinomial. Rounded zeros: values bellow a detection limit (continuous variables). Essential zeros: represent the total absence of the taxon in sample s environment. 26/ 56

27 Dealing with count zeros There are two extended ways for replacing zeros in compositional data analysis: Substitution by a pseudocount: Replace zeros by 0.65 Add 1 count to the whole matrix Geometric Bayesian Multiplicative Replacement: using prior information, all values are replaced so that there are no zeros and the ratio between non-zero components are preserved. 27/ 56

28 Kernel machine regression 28/ 56

29 Overall question Question: is it there any relationship between microbiome and a response variable of interest? Model-based multivariate methods: they assume a Dirichlet-Multinomial distribution for the abundance matrix {HMP} package (LaRosa, 2012) Dirichlet-Multinomial regression (Chen, 2013) Distance-based multivariate methods: they are based on a distance matrix D. Analysis of similarity (ANOSIM) Permutational Analysis of Variance (PERMANOVA) Kernel machine regression 29/ 56

30 Kernel machine regression: formulation Kernel machine regression is a semi-parametric regression model that includes a non-parametric component to associate a set of covariates X, for instance microbiome abundances, with a response variable of interest Y. Y i = β 0 + β Z i + h(x i ) + ɛ i logit(y i ) = β 0 + β Z i + h(x i ) for continuous outcomes for dichotomous outcomes The non-parametric part measures the relationship between the microbiome composition and the outcome. 30/ 56

31 Kernel machine regression: association test The association is evaluated according to the following hypothesis: H 0 : h(x) = 0 H 1 : h(x) 0 Kernel machine regression is a special kind of mixed model where h(x) is a subject-specific random effect h(x) N(0, τk), where K is the kernel matrix defined from the distance matrix D as: K = 1 2 (I 11T n ) D 2 (I 11T n Thus, the association test can be rewritten as: H 0 : τ = 0 H 1 : τ 0 ) 31/ 56

32 Kernel machine regression: compositional data (I) Similar to the standard Kernel machine regression it is defined as: Y i = β 0 + β Z i + h(clr(x i )) + ɛ i logit(y i ) = β 0 + β Z i + h(clr(x i )) for continuous outcomes for dichotomous outcomes where the test for evaluating the association is given by: H 0 : h(clr(x)) = 0 H 1 : h(clr(x)) 0 being the kernel matrix defined using the Atichison distance matrix D A as: K = 1 ) ) (I 11T 2 D A (I 11T 2 n n 32/ 56

33 Kernel machine regression: compositional data (II) The procedure, named MiRKAT-CoDA is defined by the following steps: 1 Zero replacement 2 Compute Aitchison distance matrix: once the count matrix does not contain zeros, the clr() function of {compositions} package is used to compute the centered log-ratio transformed values. Then, Euclidean distance is calculated getting Aitchison distance matrix D A. 3 Obtain Kernel matrix: using D2K() function in {MiRKAT} package. It transforms the distance matrix D A into the Kernel matrix K A. 4 Implement Kernel machine regression: MiRKAT() function in the package with the same name, performs the kernel machine regression given the response variable, the Kernel matrix and the covariate adjustment. 33/ 56

34 Kernel machine regression: weighted version (I) Weighted Aitchison distance Given two compositions denoted by x 1 = (x 11,..., x 1k ) and x 2 = (x 21,..., x 2k ), and a vector of weights w = (w 1,..., w k ), ) k 2 y 1i d w (x 1, x 2 ) = w i (log g w (y 1 ) log y 2i g w (y 2 ) i=1 where y 1 and y 2 are the initial compositions x 1 and x 2 divided by w ( ) y i = x i = xi1,..., x ik w w 1 w k and g w ( ) denotes the weighted geometric mean where the s w = k w i is the total sum of the weights ( g w (y) = exp 1 sw k i=1 ) w i log(y i ) i=1 34/ 56

35 Kernel machine regression: weighted version (II) The procedure to measure the contribution of each part in the association is called weighted MiRKAT-CoDA and is given by: 1 Zero replacement 2 MiRKAT-CoDA with weights: considering a sequence of weights S = {s 1,..., s q }, for each component X i, i {1,..., k} and for each particular value s r S, the components of w = (w 1,..., w k ) are defined as: { wj = 1 j i w i = s r, s r S For each pair of weight and variable, a p-value is obtained after running MiRKAT() function, getting a table like the following one: 35/ 56

36 Kernel machine regression: weighted version (III) Taxon 1 Taxon 2... Taxon k w = s 1 p 11 p p 1k w = s 2 p 21 p p 2k w = s q p q1 p q2... p qk 3 Linear regression: the contribution of each taxon is summarized by the slope of the linear regression model between theh weights S and minus the logarithm of the p-values obtained for each different weight. The larger the slope, the larger is the contribution of the variable to the global association. 4 Slope ranking: once the contribution of each taxon has been estimated they can be ranked in a decreasing order so that the most important features appear on the top of the list. 36/ 56

37 Application to microbiome-hiv association (I) Data information 156 subjects (127 HIV-positive individuals and 29 HIV-negative individuals) Microbial information for 60 different genera MSM variable (Men who have sex with men) as a possible confounding variable. MiRKAT-CoDA The result of running this algorithm to the HIV-study, is a p-value of if we do not adjust by MSM, and a p-value of after adjusting for sexual practice. 37/ 56

38 Application to microbiome-hiv association (II) Weighted - MiRKAT-CoDA g_rc9_gut_group Taxa contribution log10(p val) g_bacteroides f_erysipelotrichaceae_g_unclassified f_ruminococcaceae_g_incertae_sedis g_bacteroides g_succinivibrio g_oribacterium f_vadinbb60_g_unclassified Weights g_phascolarctobacterium g_roseburia g_anaerostipes Slope size 38/ 56

39 Discussion Contributions MiRKAT-CoDA, the package for Kernel machine regression using Aitchison distance, avoids the possible incoherences resulted with those measures non subcompositionally dominant. The weighted version of MiRKAT-CoDA allows to rank different taxa according to their importance in the global association with the outcome. Limitations The default set of weights S may result uninformative. If the reference global p-value (when no weights are considered) is very small or zero, the weighting method is not very informative since changing the weight of just one taxon can hardly modify the global p-value, remaining equal to zero in most cases. 39/ 56

40 Identification of microbial signatures 40/ 56

41 Specific question Question: which specific taxa are associated with the outcome? When the response variable is dichotomous, the question is known as differential abundance analysis and can be addressed in different ways: Based on RNA-seq analysis {edger} package (Robinson, 2010) {DESeq2} package (Anders, 2014) Based on Compositional data analysis {ALDEx2} package (Fernandes, 2013) {ANCOM} package (Mandal, 2015) 41/ 56

42 Microbial signature Microbial signature: groups of microbial taxa that are predictive of a phenotype of interest. Microbial signatures are useful for diagnosis, prognosis or prediction of therapeutic responses. 42/ 56

43 selbal Model selbal is a model selection procedure that searches a sparse model that adequately explains the response variable of interest. Goal Given a numeric or dichotomous response variable Y, a composition X = (X 1,..., X k ) and additional covariates Z = (Z 1,..., Z r ), it determines two disjoint subcompositions of X, X + and X, whose balance B(X +, X ) is highly associated with Y after adjustment for covariates Z. ( k + k X /k + i I+ i)1 B(X +, X ) = log ( k + + k Xj)1/k j I B(X +, X ) 1 log X i 1 k + k i I + j I log X j 43/ 56

44 selbal: algorithm (I) selbal looks for the best balance following these steps: 1 Zero replacement 2 Optimal balance between two components: exhaustive evaluation of all the possible balances composed by only two components; that is, all balances of the form: 1 ( B ij = B(X i, X j ) = log(xi ) log(x j ) ) 2 for i, j {1,, k}, i j Depending on the class of the response variable, each balance B ij is tested for association with Y with: Y = β 0 + β 1 B ij + γ Z logit(y) = β 0 + β 1 B ij + γ Z for continuous responses for dichotomous responses The balance that maximizes the optimization criteria is selected and denoted by B (1) 44/ 56

45 selbal: algorithm (II) 3 Optimal balance adding a new component: for s > 1 and until the stop criterion is fulfilled, let B (s 1) be the balance defined in the previous step (s 1) given by: B (s 1) 1 k (s 1) + i I (s 1) + log(x i ) 1 k (s 1) j I (s 1) log(x j ) where I (s 1) + and I (s 1) are two disjoint subsets of indices in {1,, k} with k (s 1) + and k (s 1) elements, respectively. For each of the remaining variables, X p not yet included in the balance, p / ( ) I (s 1) + I (s 1), the algorithm considers the balance that is obtained by adding log(x p ) to the positive part of B (s 1) or to its negative part): 45/ 56

46 selbal: algorithm (III) B (s+) p 1 k (s 1) B (s ) p 1 k (s 1) + i I (s 1) + ( i I (s 1) + log(x i ) ) log(x i ) + log(x p ) 1 k (s 1) ( 1 k (s 1) + 1 j I (s 1) j I (s 1) log(x j ) ) log(x j ) + log(x p ) Each of these pairs of balances B (s+) p and B (s ) p for each of the remaining variables X p is tested for association with the response variable through the corresponding regression model. Finally, the balance that maximizes the optimization criterion, defines the new balance B (s) for the s-th step. 46/ 56

47 selbal: association measure and stop criterion Association measure For continuous responses, it is the mean squared error (MSE) of the linear regression model. For dichotomous outcomes, it is the area under the ROC curve. Stop criterion There are two possible stopping rules: The algorithm stops when the improvement of the optimization parameters is lower than a specified threshold (default equal to 0). The algorithm stops when the specified maximum number of componentes has been included in the balance (default equal to 20). 47/ 56

48 Testing dataset selbal: cross validation A cross-validation procedure is implemented for: Defining the optimal number of variables in the balance Measure the robustness of the result Training dataset Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 48/ 56

49 Application to microbiome-crohn s disease association (I) Data information 975 subjects (662 patients with Crohn s disease and 313 without any symptom Microbial information of 49 different taxa Optimal number of variables Accuracy (AUC) Number of variables 49/ 56

50 Application to microbiome-crohn s disease association (II) Global balance DENOMINATOR g Streptococcus g Dialister g Adlercreutzia g Dorea g Oscillospira o Lactobacillales_g g Aggregatibacter g Eggerthella NUMERATOR g Roseburia o Clostridiales_g g Bacteroides f Peptostreptococcaceae_g TPR ROC curve AUC ROC FPR Balance CD Factor no 50/ 56

51 Application to microbiome-crohn s disease association (III) Cross-validation % Global BAL 1 BAL 2 BAL 3 g Dialister 100 g Roseburia 100 o Clostridiales_g 98 g Bacteroides 98 g Dorea 96 o Lactobacillales_g 94 g Eggerthella 92 g Aggregatibacter 92 g Adlercreutzia 90 f Peptostreptococcaceae_g 86 g Streptococcus 76 g Oscillospira 72 g Actinomyces 26 g Blautia 24 FREQ / 56

52 selbal against other methods Results using Crohn s disease dataset with selbal and other methods according to the variable selection and model building procedure. Method comparison 0.8 AUC DESeq edger ANCOM ALDEx2 selbal 52/ 56

53 Discussion Contributions selbal is an alternative for the differential abundance tests available in the literature. The resulted balance has an biological meaning and avoids the problems with type I error usually associated with classical differential abundance tests. selbal offers better results than the most extended differential abundance tests. Limitations The algorithm does not cover all the possible balances defined from a set of k taxa, so the result can be suoptimal. 53/ 56

54 Conclusions 54/ 56

55 What we have learned about microbiome and HIV infection The chronic inflammation derived after HIV infection is responsible of an increased risk of presenting non-aids related diseases and premature aging. Previous results indicating a clear shift from Bacteroides to Prevotella in HIV-1 infection should be revised accounting for possible confounders such as HIV risk factors, exercise or diet. Patients who spontaneously maintain sustained control of HIV, elite controllers (EC), have different microbiota from individuals with progressive infection and more similar to HIV negative individuals. Though diet is known to have an important effect on gut microbiome composition in healthy individuals, measuring its effects on HIV infection is difficult because of the lack of extensive and reliable information at this level. 55/ 56

56 New approaches for the analysis of microbiome compositional data Microbiome abundance data is compositional The constraint over the total number of reads induces strong dependencies among the abundances of different taxa. The use of standard statistical methods ignoring the compositionality can lead to important adverse implications. Kernel machine regression combined with Aitchison distance provides a powerful framework for testing global associations between microbiome and a response variable of interest. Kernel machine regression combined with weighted Aitchison distance provides a measure of the contribution of each taxon to the joint microbiome association with the outcome. The search of microbial signatures with selbal is a powerful approach for defining biomarkers to differentiate groups of samples or to identify associations. 56/ 56

Compositional data methods for microbiome studies

Compositional data methods for microbiome studies Compositional data methods for microbiome studies M.Luz Calle Dept. of Biosciences, UVic-UCC http://mon.uvic.cat/bms/ http://mon.uvic.cat/master-omics/ 1 Important role of the microbiome in human health

More information

An Adaptive Association Test for Microbiome Data

An Adaptive Association Test for Microbiome Data An Adaptive Association Test for Microbiome Data Chong Wu 1, Jun Chen 2, Junghi 1 Kim and Wei Pan 1 1 Division of Biostatistics, School of Public Health, University of Minnesota; 2 Division of Biomedical

More information

Lecture: Mixture Models for Microbiome data

Lecture: Mixture Models for Microbiome data Lecture: Mixture Models for Microbiome data Lecture 3: Mixture Models for Microbiome data Outline: - - Sequencing thought experiment Mixture Models (tangent) - (esp. Negative Binomial) - Differential abundance

More information

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data Lecture 3: Mixture Models for Microbiome data 1 Lecture 3: Mixture Models for Microbiome data Outline: - Mixture Models (Negative Binomial) - DESeq2 / Don t Rarefy. Ever. 2 Hypothesis Tests - reminder

More information

CoDa-dendrogram: A new exploratory tool. 2 Dept. Informàtica i Matemàtica Aplicada, Universitat de Girona, Spain;

CoDa-dendrogram: A new exploratory tool. 2 Dept. Informàtica i Matemàtica Aplicada, Universitat de Girona, Spain; CoDa-dendrogram: A new exploratory tool J.J. Egozcue 1, and V. Pawlowsky-Glahn 2 1 Dept. Matemàtica Aplicada III, Universitat Politècnica de Catalunya, Barcelona, Spain; juan.jose.egozcue@upc.edu 2 Dept.

More information

Regression with Compositional Response. Eva Fišerová

Regression with Compositional Response. Eva Fišerová Regression with Compositional Response Eva Fišerová Palacký University Olomouc Czech Republic LinStat2014, August 24-28, 2014, Linköping joint work with Karel Hron and Sandra Donevska Objectives of the

More information

arxiv: v2 [stat.me] 16 Jun 2011

arxiv: v2 [stat.me] 16 Jun 2011 A data-based power transformation for compositional data Michail T. Tsagris, Simon Preston and Andrew T.A. Wood Division of Statistics, School of Mathematical Sciences, University of Nottingham, UK; pmxmt1@nottingham.ac.uk

More information

Part IV Extensions: Competing Risks Endpoints and Non-Parametric AUC(t) Estimation

Part IV Extensions: Competing Risks Endpoints and Non-Parametric AUC(t) Estimation Part IV Extensions: Competing Risks Endpoints and Non-Parametric AUC(t) Estimation Patrick J. Heagerty PhD Department of Biostatistics University of Washington 166 ISCB 2010 Session Four Outline Examples

More information

Proteomics and Variable Selection

Proteomics and Variable Selection Proteomics and Variable Selection p. 1/55 Proteomics and Variable Selection Alex Lewin With thanks to Paul Kirk for some graphs Department of Epidemiology and Biostatistics, School of Public Health, Imperial

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Supervised Learning to Predict Geographic Origin of Human Metagenomic Samples

Supervised Learning to Predict Geographic Origin of Human Metagenomic Samples Supervised Learning to Predict Geographic Origin of Human Metagenomic Samples Christopher Malow cmalow@stanford.edu Abstract Metagenomic studies of human fecal samples have used whole genome shotgun sequencing

More information

High-Throughput Sequencing Course

High-Throughput Sequencing Course High-Throughput Sequencing Course DESeq Model for RNA-Seq Biostatistics and Bioinformatics Summer 2017 Outline Review: Standard linear regression model (e.g., to model gene expression as function of an

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Statistical Analysis of. Compositional Data

Statistical Analysis of. Compositional Data Statistical Analysis of Compositional Data Statistical Analysis of Compositional Data Carles Barceló Vidal J Antoni Martín Fernández Santiago Thió Fdez-Henestrosa Dept d Informàtica i Matemàtica Aplicada

More information

Microbiome: 16S rrna Sequencing 3/30/2018

Microbiome: 16S rrna Sequencing 3/30/2018 Microbiome: 16S rrna Sequencing 3/30/2018 Skills from Previous Lectures Central Dogma of Biology Lecture 3: Genetics and Genomics Lecture 4: Microarrays Lecture 12: ChIP-Seq Phylogenetics Lecture 13: Phylogenetics

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Geometric Constraints II

Geometric Constraints II Geometric Constraints II Realizability, Rigidity and Related theorems. Embeddability of Metric Spaces Section 1 Given the matrix D d i,j 1 i,j n corresponding to a metric space, give conditions under which

More information

Updating on the Kernel Density Estimation for Compositional Data

Updating on the Kernel Density Estimation for Compositional Data Updating on the Kernel Density Estimation for Compositional Data Martín-Fernández, J. A., Chacón-Durán, J. E., and Mateu-Figueras, G. Dpt. Informàtica i Matemàtica Aplicada, Universitat de Girona, Campus

More information

Kernel Logistic Regression and the Import Vector Machine

Kernel Logistic Regression and the Import Vector Machine Kernel Logistic Regression and the Import Vector Machine Ji Zhu and Trevor Hastie Journal of Computational and Graphical Statistics, 2005 Presented by Mingtao Ding Duke University December 8, 2011 Mingtao

More information

Normalization of metagenomic data A comprehensive evaluation of existing methods

Normalization of metagenomic data A comprehensive evaluation of existing methods MASTER S THESIS Normalization of metagenomic data A comprehensive evaluation of existing methods MIKAEL WALLROTH Department of Mathematical Sciences CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG

More information

Performance Evaluation and Comparison

Performance Evaluation and Comparison Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature25973 Power Simulations We performed extensive power simulations to demonstrate that the analyses carried out in our study are well powered. Our simulations indicate very high power for

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION City of origin as a confounding variable. The original study was designed such that the city where sampling was performed was perfectly confounded with where the DNA extractions and sequencing was performed.

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Niche Modeling. STAMPS - MBL Course Woods Hole, MA - August 9, 2016

Niche Modeling. STAMPS - MBL Course Woods Hole, MA - August 9, 2016 Niche Modeling Katie Pollard & Josh Ladau Gladstone Institutes UCSF Division of Biostatistics, Institute for Human Genetics and Institute for Computational Health Science STAMPS - MBL Course Woods Hole,

More information

Correlation and regression

Correlation and regression 1 Correlation and regression Yongjua Laosiritaworn Introductory on Field Epidemiology 6 July 2015, Thailand Data 2 Illustrative data (Doll, 1955) 3 Scatter plot 4 Doll, 1955 5 6 Correlation coefficient,

More information

Machine Learning (CS 567) Lecture 5

Machine Learning (CS 567) Lecture 5 Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Lecture 2: Descriptive statistics, normalizations & testing

Lecture 2: Descriptive statistics, normalizations & testing Lecture 2: Descriptive statistics, normalizations & testing From sequences to OTU table Sequencing Sample 1 Sample 2... Sample N Abundances of each microbial taxon in each of the N samples 2 1 Normalizing

More information

CS 231A Section 1: Linear Algebra & Probability Review

CS 231A Section 1: Linear Algebra & Probability Review CS 231A Section 1: Linear Algebra & Probability Review 1 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability

More information

Istituto di Microbiologia. Università Cattolica del Sacro Cuore, Roma. Gut Microbiota assessment and the Meta-HIT program.

Istituto di Microbiologia. Università Cattolica del Sacro Cuore, Roma. Gut Microbiota assessment and the Meta-HIT program. Istituto di Microbiologia Università Cattolica del Sacro Cuore, Roma Gut Microbiota assessment and the Meta-HIT program Giovanni Delogu 1 Most of the bacteria species living in the gut cannot be cultivated

More information

Intelligent Systems Statistical Machine Learning

Intelligent Systems Statistical Machine Learning Intelligent Systems Statistical Machine Learning Carsten Rother, Dmitrij Schlesinger WS2014/2015, Our tasks (recap) The model: two variables are usually present: - the first one is typically discrete k

More information

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling Expression Data Exploration: Association, Patterns, Factors & Regression Modelling Exploring gene expression data Scale factors, median chip correlation on gene subsets for crude data quality investigation

More information

Statistics in medicine

Statistics in medicine Statistics in medicine Lecture 4: and multivariable regression Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

STAT331. Cox s Proportional Hazards Model

STAT331. Cox s Proportional Hazards Model STAT331 Cox s Proportional Hazards Model In this unit we introduce Cox s proportional hazards (Cox s PH) model, give a heuristic development of the partial likelihood function, and discuss adaptations

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

CSC314 / CSC763 Introduction to Machine Learning

CSC314 / CSC763 Introduction to Machine Learning CSC314 / CSC763 Introduction to Machine Learning COMSATS Institute of Information Technology Dr. Adeel Nawab More on Evaluating Hypotheses/Learning Algorithms Lecture Outline: Review of Confidence Intervals

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Intelligent Systems Statistical Machine Learning

Intelligent Systems Statistical Machine Learning Intelligent Systems Statistical Machine Learning Carsten Rother, Dmitrij Schlesinger WS2015/2016, Our model and tasks The model: two variables are usually present: - the first one is typically discrete

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Thomas G. Dietterich tgd@eecs.oregonstate.edu 1 Outline What is Machine Learning? Introduction to Supervised Learning: Linear Methods Overfitting, Regularization, and the

More information

Methodological Concepts for Source Apportionment

Methodological Concepts for Source Apportionment Methodological Concepts for Source Apportionment Peter Filzmoser Institute of Statistics and Mathematical Methods in Economics Vienna University of Technology UBA Berlin, Germany November 18, 2016 in collaboration

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Bacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria

Bacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria Bacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria Seminar presentation Pierre Barbera Supervised by:

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Sample Size Estimation for Studies of High-Dimensional Data

Sample Size Estimation for Studies of High-Dimensional Data Sample Size Estimation for Studies of High-Dimensional Data James J. Chen, Ph.D. National Center for Toxicological Research Food and Drug Administration June 3, 2009 China Medical University Taichung,

More information

Two-sample tests of high-dimensional means for compositional data

Two-sample tests of high-dimensional means for compositional data Biometrika (208, 05,,pp. 5 32 doi: 0.093/biomet/asx060 Printed in Great Britain Advance Access publication 3 November 207 Two-sample tests of high-dimensional means for compositional data BY YUANPEI CAO

More information

5. Discriminant analysis

5. Discriminant analysis 5. Discriminant analysis We continue from Bayes s rule presented in Section 3 on p. 85 (5.1) where c i is a class, x isap-dimensional vector (data case) and we use class conditional probability (density

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

Outline Classes of diversity measures. Species Divergence and the Measurement of Microbial Diversity. How do we describe and compare diversity?

Outline Classes of diversity measures. Species Divergence and the Measurement of Microbial Diversity. How do we describe and compare diversity? Species Divergence and the Measurement of Microbial Diversity Cathy Lozupone University of Colorado, Boulder. Washington University, St Louis. Outline Classes of diversity measures α vs β diversity Quantitative

More information

Multivariate Statistical Analysis

Multivariate Statistical Analysis Multivariate Statistical Analysis Fall 2011 C. L. Williams, Ph.D. Lecture 4 for Applied Multivariate Analysis Outline 1 Eigen values and eigen vectors Characteristic equation Some properties of eigendecompositions

More information

Statistical Methods for High Dimensional Count and Compositional Data With Applications to Microbiome Studies

Statistical Methods for High Dimensional Count and Compositional Data With Applications to Microbiome Studies University of Pennsylvania ScholarlyCommons Publicly Accessible Penn Dissertations --206 Statistical Methods for High Dimensional Count and Compositional Data With Applications to Microbiome Studies Yuanpei

More information

A Program for Data Transformations and Kernel Density Estimation

A Program for Data Transformations and Kernel Density Estimation A Program for Data Transformations and Kernel Density Estimation John G. Manchuk and Clayton V. Deutsch Modeling applications in geostatistics often involve multiple variables that are not multivariate

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang CS 231A Section 1: Linear Algebra & Probability Review Kevin Tang Kevin Tang Section 1-1 9/30/2011 Topics Support Vector Machines Boosting Viola Jones face detector Linear Algebra Review Notation Operations

More information

[Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements

[Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements [Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements Aasthaa Bansal PhD Pharmaceutical Outcomes Research & Policy Program University of Washington 69 Biomarkers

More information

Machine Learning - MT Clustering

Machine Learning - MT Clustering Machine Learning - MT 2016 15. Clustering Varun Kanade University of Oxford November 28, 2016 Announcements No new practical this week All practicals must be signed off in sessions this week Firm Deadline:

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Package milineage. October 20, 2017

Package milineage. October 20, 2017 Type Package Package milineage October 20, 2017 Title Association Tests for Microbial Lineages on a Taxonomic Tree Version 2.0 Date 2017-10-18 Author Zheng-Zheng Tang Maintainer Zheng-Zheng Tang

More information

Resampling Methods CAPT David Ruth, USN

Resampling Methods CAPT David Ruth, USN Resampling Methods CAPT David Ruth, USN Mathematics Department, United States Naval Academy Science of Test Workshop 05 April 2017 Outline Overview of resampling methods Bootstrapping Cross-validation

More information

Classifier performance evaluation

Classifier performance evaluation Classifier performance evaluation Václav Hlaváč Czech Technical University in Prague Czech Institute of Informatics, Robotics and Cybernetics 166 36 Prague 6, Jugoslávských partyzánu 1580/3, Czech Republic

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information

A Least Squares Formulation for Canonical Correlation Analysis

A Least Squares Formulation for Canonical Correlation Analysis A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation

More information

BAYESIAN DECISION THEORY

BAYESIAN DECISION THEORY Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will

More information

Linear Dimensionality Reduction

Linear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Principal Component Analysis 3 Factor Analysis

More information

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 4 Discriminant Analysis, k-nearest Neighbors Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se

More information

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig Genetic Networks Korbinian Strimmer IMISE, Universität Leipzig Seminar: Statistical Analysis of RNA-Seq Data 19 June 2012 Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 1 Paper G. I. Allen and Z. Liu.

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1

TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1 TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1 1.1 The Probability Model...1 1.2 Finite Discrete Models with Equally Likely Outcomes...5 1.2.1 Tree Diagrams...6 1.2.2 The Multiplication Principle...8

More information

ESL Chap3. Some extensions of lasso

ESL Chap3. Some extensions of lasso ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied

More information

Introductory compositional data (CoDa)analysis for soil

Introductory compositional data (CoDa)analysis for soil Introductory compositional data (CoDa)analysis for soil 1 scientists Léon E. Parent, department of Soils and Agrifood Engineering Université Laval, Québec 2 Definition (Aitchison, 1986) Compositional data

More information

Gibbs Sampling in Linear Models #2

Gibbs Sampling in Linear Models #2 Gibbs Sampling in Linear Models #2 Econ 690 Purdue University Outline 1 Linear Regression Model with a Changepoint Example with Temperature Data 2 The Seemingly Unrelated Regressions Model 3 Gibbs sampling

More information

Linear Classifiers as Pattern Detectors

Linear Classifiers as Pattern Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2014/2015 Lesson 16 8 April 2015 Contents Linear Classifiers as Pattern Detectors Notation...2 Linear

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Sparse Approximation and Variable Selection

Sparse Approximation and Variable Selection Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation

More information

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH The First Step: SAMPLE SIZE DETERMINATION THE ULTIMATE GOAL The most important, ultimate step of any of clinical research is to do draw inferences;

More information

Sparse Proteomics Analysis (SPA)

Sparse Proteomics Analysis (SPA) Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universität Berlin Winter School on Compressed Sensing December 5, 2015

More information

Robustness of Principal Components

Robustness of Principal Components PCA for Clustering An objective of principal components analysis is to identify linear combinations of the original variables that are useful in accounting for the variation in those original variables.

More information

LECTURE NOTE #3 PROF. ALAN YUILLE

LECTURE NOTE #3 PROF. ALAN YUILLE LECTURE NOTE #3 PROF. ALAN YUILLE 1. Three Topics (1) Precision and Recall Curves. Receiver Operating Characteristic Curves (ROC). What to do if we do not fix the loss function? (2) The Curse of Dimensionality.

More information

18 Bivariate normal distribution I

18 Bivariate normal distribution I 8 Bivariate normal distribution I 8 Example Imagine firing arrows at a target Hopefully they will fall close to the target centre As we fire more arrows we find a high density near the centre and fewer

More information

Unconstrained Ordination

Unconstrained Ordination Unconstrained Ordination Sites Species A Species B Species C Species D Species E 1 0 (1) 5 (1) 1 (1) 10 (4) 10 (4) 2 2 (3) 8 (3) 4 (3) 12 (6) 20 (6) 3 8 (6) 20 (6) 10 (6) 1 (2) 3 (2) 4 4 (5) 11 (5) 8 (5)

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

Statistical tests for differential expression in count data (1)

Statistical tests for differential expression in count data (1) Statistical tests for differential expression in count data (1) NBIC Advanced RNA-seq course 25-26 August 2011 Academic Medical Center, Amsterdam The analysis of a microarray experiment Pre-process image

More information

Principal Components Analysis. Sargur Srihari University at Buffalo

Principal Components Analysis. Sargur Srihari University at Buffalo Principal Components Analysis Sargur Srihari University at Buffalo 1 Topics Projection Pursuit Methods Principal Components Examples of using PCA Graphical use of PCA Multidimensional Scaling Srihari 2

More information

BIG IDEAS. Area of Learning: SCIENCE Life Sciences Grade 11. Learning Standards. Curricular Competencies

BIG IDEAS. Area of Learning: SCIENCE Life Sciences Grade 11. Learning Standards. Curricular Competencies Area of Learning: SCIENCE Life Sciences Grade 11 BIG IDEAS Life is a result of interactions at the molecular and cellular levels. Evolution occurs at the population level. Learning Standards Organisms

More information

Multi-state Models: An Overview

Multi-state Models: An Overview Multi-state Models: An Overview Andrew Titman Lancaster University 14 April 2016 Overview Introduction to multi-state modelling Examples of applications Continuously observed processes Intermittently observed

More information

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata Most common diseases are caused by

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations. Previously Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations y = Ax Or A simply represents data Notion of eigenvectors,

More information

Microbiota: Its Evolution and Essence. Hsin-Jung Joyce Wu "Microbiota and man: the story about us

Microbiota: Its Evolution and Essence. Hsin-Jung Joyce Wu Microbiota and man: the story about us Microbiota: Its Evolution and Essence Overview q Define microbiota q Learn the tool q Ecological and evolutionary forces in shaping gut microbiota q Gut microbiota versus free-living microbe communities

More information