Niche Modeling. STAMPS - MBL Course Woods Hole, MA

Niche Modeling Katie Pollard & Josh Ladau Gladstone Institutes UCSF Division of Biostatistics, Institute for Human Genetics and Institute for Computational Health Science STAMPS - MBL Course Woods Hole, MA - August 9, 2016

"All models are wrong, but some are useful. - George E. Box

Environmental Niche Modeling

Marine Bacterial Niche Modeling

INPUT 1: Taxon or gene abundances Extract DNA Sequence 50 Million 100-bp sequences! TGGCTAGTACGATCGATCGCTGAT GATCGACACTTCGATCGATCTACTC GCATATCGATCGACACATTCTCGAT TGCATGACTGCATGCATGCACACA ACGTGTCGTTTCGTATCGATCAGC. Shotgun Metagenomes MICROBIS 16S Data Taxonomic profile Functional profile Who is there? What they are doing?

Marine Bacterial Niche Modeling

INPUT 2: Values of Environmental Variables at Sampling Locations SampleID Longitude Latitude Day Length Dust Flux Salinity ABR_0001_20 05_01_02-170 -64.5 20.46540359 2.53E-12 33.78893858 ABR_0005_20 05_01_07 ABR_0009_20 05_01_30-170 -50 16.01746378 7.07E-12 34.44804678-170 -20 12.98445342 4.20E-12 35.25957071. ABR_0013_20 05_02_26.!. -170 0.08 12.10659698 2.18E-12 35.2218286

Marine Bacterial Niche Modeling

Linear Model at Sampling Locations y y = a + bx x a is the intercept! b is the slope!! Seek the line that minimizes sum of squared residuals. The solution to the least squares problem is:!!! a = y bx b = i i (x i x )(y i y ) (x i x ) 2 Substituting estimates of (a,b) provides predictions.! Residual is observed minus predicted value for each x.

Data Transformations If Y increases a non-constant amount per unit increase in X, transformation may produce a linear relationship:! Log or exponentiate! Root or raise to a power! Reciprocal! Z-scores (subtract mean, divide by standard error)! For non-continuous data (e.g., counts), other models are typically needed. Generalized linear models will be covered next week.

Multiple Regression One outcome variable, 2+ environmental covariates! - Covariates can be continuous or categorical! - May include powers or other transformations of covariates! - May include interactions between covariates! Coefficients represent expected change in Y per unit increase in that covariate, while holding the other covariates constant (i.e., adjusted for them).! Coefficient estimates and their standard errors can be used to test for association with Y.! Predicted values can be computed for different covariate combinations.

Fitted Model Expected(LogRichness) ~ Daylength^2 + log(abs(elevation-thermoclinedepthmonthlymean)+0.01) + PhosphateConcentrationAnnualMean^2 Searched all possible models with up to eight variables, including squared terms.! - Evaluated performance on held out data using leave-one-out cross-validation! - Picked model with best CV R-squared! Also evaluated performance on independent marine 16S datasets (e.g., GOS) Ladau et al. (2013) ISMEJ

Correlated variables Environmental variables are often highly correlated.! Consequently, only one (or a few) correlated variables will typically be selected for the model.! What determines which variable is selected?! Is the selected variable more important biologically?

Marine Bacterial Niche Modeling

Geographic Projection: plug global variables into model to predict

Global Diversity Predictions

Avoid extrapolating too far beyond range of observed variables MESS statistic measures amount of extrapolation at each location Elith & Kearney (2010)

Relating Different Data Types Covariate (dependent variable) Continuous Categorical Outcome! (independent! variable) Continuous Linear Regression ANOVA Categorical Generalized Linear Model Regression (e.g., Logistic) Contingency Tables / Log-linear Model Regression

Global Taxon Distribution Predictions Ladau et al. (2013) ISMEJ Logistic regression for taxon presence/absence as a function of environmental variables.

ADDITIONAL DETAILS

Plot of residuals vs. X: no trends if linear relationship! Influential points: big effect on estimates, usually small residual! Model diagnostics quantify fit:!!!! Evaluating Model Fit r 2 = SS total SS resid SS total =1 SS resid SS total s e = SS resid n 2 - Coefficient of determination (r 2 ) is the amount of variation in Y that cannot be explained by the linear relationship (i.e., model) between X and Y.! - Pearson s correlation coefficient (r) is the square root of the coefficient of determination.! - Standard deviation about the least squares line (se) is average distance points are from the line.

Model Selection Criteria The goal of model selection is to pick the best model given the data.! Different criteria for evaluating best! - Likelihood ratio statistic (compare to chi-square for test)! - Change in residual sum of squares! - R 2 or adjusted R 2! - Akaike Information Criterion (AIC)! - Bayesian Information Criterion (BIC)! The last three account for the number of parameters with penalties to avoid over-fitting or too complex models.

Additional Validation Even with penalties for large models, observed data can be overfit, reducing generalizability and repeatability of results.! Some solutions:! Cross-validation involves holding out a random subset of the data and assessing model fit on the held out data, repeatedly.! External validation involves assessing model fit on a totally independent data set (e.g., from a replication study or another population)! For both, can also use prediction accuracy as criterion.

Algorithms for searching large space of possible models All subsets selection involves enumerating all possible models and picking the best one. Some times this is computationally infeasible. Alternatives include! Forward selection: Start with a small model and build up! Backward selection: Start with the full model and remove terms! Forward-backward selection: After building up, try removing terms to see if fit improves! Deletion-substitution-addition: Algorithm for searching in a less linear fashion! Additional issue: Include interactions without main effects?

Variable importance The importance of each covariate towards model fit can be measured with various statistics, e.g.:! Estimated coefficient divided by its standard error (linear models - this fails in many other models)! Sign of coefficient! Fit of model with and without the variable included! Average error minus error after permuting the covariate values, divided by its standard error (in cross-validation)! Decrease in node impurity after adding variable (random forests)! For classification, assess importance for each class.

Generalized linear model (GLM) If outcome is not quantitative, the linear model framework can be extended via data transformations, called link functions.! Binary: logit (alternatives: probit, log-log)! Counts: log (also known as log-linear model)! The covariates are still a linear combination.! But the error has a different distribution.! For GLMs, the parameters are estimated by numerical methods (e.g., Newton-Raphson).

Logistic regression parameters Model the probability π of observing a taxon:! logit(π) = ß 0 + ß 1 X! Interpretation of ß 1 is the expected change in logit for a unit increase in X. What is this?! If X is binary (e.g., 0=near land vs. 1=open ocean):! odds(x=0) = exp{ß 0 }, odds(x=1) = exp{ß 0 }exp{ß 1 }! Odds increase multiplicatively by exp{ß 1 } per unit X.! Odds ratio = odds(x=1)/odds(x=0) = exp{ß 1 }

Niche Modeling. STAMPS - MBL Course Woods Hole, MA - August 9, 2016