Niche Modeling. STAMPS - MBL Course Woods Hole, MA - August 9, 2016

Similar documents
Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Chapter 1. Modeling Basics

Model Selection. Frank Wood. December 10, 2009

Chapter 1 Statistical Inference

Statistics 203: Introduction to Regression and Analysis of Variance Course review

LOGISTIC REGRESSION Joseph M. Hilbe

Linear Regression Models P8111

MS&E 226: Small Data

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Statistics 262: Intermediate Biostatistics Model selection

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Model Estimation Example

36-309/749 Experimental Design for Behavioral and Social Sciences. Dec 1, 2015 Lecture 11: Mixed Models (HLMs)

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Introduction to Statistical modeling: handout for Math 489/583

Sections 4.1, 4.2, 4.3

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Multiple Regression and Regression Model Adequacy

36-463/663: Multilevel & Hierarchical Models

Tuning Parameter Selection in L1 Regularized Logistic Regression

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Correlation and regression

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Proteomics and Variable Selection

MASM22/FMSN30: Linear and Logistic Regression, 7.5 hp FMSN40:... with Data Gathering, 9 hp

Chapter 4: Generalized Linear Models-I

Generalized Linear Models for Non-Normal Data

2/26/2017. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Testing and Model Selection

Regression Model Building

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

MS-C1620 Statistical inference

Generalized Models: Part 1

Linear Models (continued)

Regression Modelling. Dr. Michael Schulzer. Centre for Clinical Epidemiology and Evaluation

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Categorical and Zero Inflated Growth Models

Linear model selection and regularization

Week 8 Hour 1: More on polynomial fits. The AIC

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Linear Regression In God we trust, all others bring data. William Edwards Deming

Longitudinal Modeling with Logistic Regression

Models for Count and Binary Data. Poisson and Logistic GWR Models. 24/07/2008 GWR Workshop 1

STAT5044: Regression and Anova

MS&E 226: Small Data

How the mean changes depends on the other variable. Plots can show what s happening...

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Contrasting Marginal and Mixed Effects Models Recall: two approaches to handling dependence in Generalized Linear Models:

Prediction of Bike Rental using Model Reuse Strategy

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla.

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

An Adaptive Association Test for Microbiome Data

More Accurately Analyze Complex Relationships

COMPLEMENTARY LOG-LOG MODEL

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Introduction to Generalized Models

Introduction To Logistic Regression

MS&E 226: Small Data

Analysing data: regression and correlation S6 and S7

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

Machine Learning for OR & FE

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Supplemental Resource: Brain and Cognitive Sciences Statistics & Visualization for Data Analysis & Inference January (IAP) 2009

Applied Machine Learning Annalisa Marsico

Week 7: Binary Outcomes (Scott Long Chapter 3 Part 2)

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Logistic Regression 21/05

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

Model comparison. Patrick Breheny. March 28. Introduction Measures of predictive power Model selection

Introduction to Within-Person Analysis and RM ANOVA

STAT 100C: Linear models

Regression, Ridge Regression, Lasso

Chapter 19: Logistic regression

10. Alternative case influence statistics

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

Model Selection for Semiparametric Bayesian Models with Application to Overdispersion

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Generating Half-normal Plot for Zero-inflated Binomial Regression

R Hints for Chapter 10

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Correlation and Regression Notes. Categorical / Categorical Relationship (Chi-Squared Independence Test)

Generalized linear models

Generalized Linear Models

STAC51: Categorical data Analysis

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Introduction to Regression

Sparse Linear Models (10/7/13)

A Practitioner s Guide to Generalized Linear Models

Logistic Regression: Regression with a Binary Dependent Variable

Stat 5101 Lecture Notes

Machine Learning Linear Classification. Prof. Matteo Matteucci

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Regression and Generalized Linear Models. Dr. Wolfgang Rolke Expo in Statistics C3TEC, Caguas, October 9, 2015

Hypothesis testing, part 2. With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

Generalized Linear Models I

Lecture 2: Poisson and logistic regression

ECE521 week 3: 23/26 January 2017

Transcription:

Niche Modeling Katie Pollard & Josh Ladau Gladstone Institutes UCSF Division of Biostatistics, Institute for Human Genetics and Institute for Computational Health Science STAMPS - MBL Course Woods Hole, MA - August 9, 2016

"All models are wrong, but some are useful. - George E. Box

Environmental Niche Modeling

Marine Bacterial Niche Modeling

Marine Bacterial Niche Modeling

INPUT 1: Taxon or gene abundances Extract DNA Sequence 50 Million 100-bp sequences! TGGCTAGTACGATCGATCGCTGAT GATCGACACTTCGATCGATCTACTC GCATATCGATCGACACATTCTCGAT TGCATGACTGCATGCATGCACACA ACGTGTCGTTTCGTATCGATCAGC. Shotgun Metagenomes MICROBIS 16S Data Taxonomic profile Functional profile Who is there? What they are doing?

Marine Bacterial Niche Modeling

INPUT 2: Values of Environmental Variables at Sampling Locations SampleID Longitude Latitude Day Length Dust Flux Salinity ABR_0001_20 05_01_02-170 -64.5 20.46540359 2.53E-12 33.78893858 ABR_0005_20 05_01_07 ABR_0009_20 05_01_30-170 -50 16.01746378 7.07E-12 34.44804678-170 -20 12.98445342 4.20E-12 35.25957071. ABR_0013_20 05_02_26.!. -170 0.08 12.10659698 2.18E-12 35.2218286

Marine Bacterial Niche Modeling

Linear Model at Sampling Locations y y = a + bx x a is the intercept! b is the slope!! Seek the line that minimizes sum of squared residuals. The solution to the least squares problem is:!!! a = y bx b = i i (x i x )(y i y ) (x i x ) 2 Substituting estimates of (a,b) provides predictions.! Residual is observed minus predicted value for each x.

Data Transformations If Y increases a non-constant amount per unit increase in X, transformation may produce a linear relationship:! Log or exponentiate! Root or raise to a power! Reciprocal! Z-scores (subtract mean, divide by standard error)! For non-continuous data (e.g., counts), other models are typically needed. Generalized linear models will be covered next week.

Multiple Regression One outcome variable, 2+ environmental covariates! - Covariates can be continuous or categorical! - May include powers or other transformations of covariates! - May include interactions between covariates! Coefficients represent expected change in Y per unit increase in that covariate, while holding the other covariates constant (i.e., adjusted for them).! Coefficient estimates and their standard errors can be used to test for association with Y.! Predicted values can be computed for different covariate combinations.

Fitted Model Expected(LogRichness) ~ Daylength^2 + log(abs(elevation-thermoclinedepthmonthlymean)+0.01) + PhosphateConcentrationAnnualMean^2 Searched all possible models with up to eight variables, including squared terms.! - Evaluated performance on held out data using leave-one-out cross-validation! - Picked model with best CV R-squared! Also evaluated performance on independent marine 16S datasets (e.g., GOS) Ladau et al. (2013) ISMEJ

Correlated variables Environmental variables are often highly correlated.! Consequently, only one (or a few) correlated variables will typically be selected for the model.! What determines which variable is selected?! Is the selected variable more important biologically?

Marine Bacterial Niche Modeling

Geographic Projection: plug global variables into model to predict

Global Diversity Predictions

Avoid extrapolating too far beyond range of observed variables MESS statistic measures amount of extrapolation at each location Elith & Kearney (2010)

Relating Different Data Types Covariate (dependent variable) Continuous Categorical Outcome! (independent! variable) Continuous Linear Regression ANOVA Categorical Generalized Linear Model Regression (e.g., Logistic) Contingency Tables / Log-linear Model Regression

Global Taxon Distribution Predictions Ladau et al. (2013) ISMEJ Logistic regression for taxon presence/absence as a function of environmental variables.

ADDITIONAL DETAILS

Plot of residuals vs. X: no trends if linear relationship! Influential points: big effect on estimates, usually small residual! Model diagnostics quantify fit:!!!! Evaluating Model Fit r 2 = SS total SS resid SS total =1 SS resid SS total s e = SS resid n 2 - Coefficient of determination (r 2 ) is the amount of variation in Y that cannot be explained by the linear relationship (i.e., model) between X and Y.! - Pearson s correlation coefficient (r) is the square root of the coefficient of determination.! - Standard deviation about the least squares line (se) is average distance points are from the line.

Model Selection Criteria The goal of model selection is to pick the best model given the data.! Different criteria for evaluating best! - Likelihood ratio statistic (compare to chi-square for test)! - Change in residual sum of squares! - R 2 or adjusted R 2! - Akaike Information Criterion (AIC)! - Bayesian Information Criterion (BIC)! The last three account for the number of parameters with penalties to avoid over-fitting or too complex models.

Additional Validation Even with penalties for large models, observed data can be overfit, reducing generalizability and repeatability of results.! Some solutions:! Cross-validation involves holding out a random subset of the data and assessing model fit on the held out data, repeatedly.! External validation involves assessing model fit on a totally independent data set (e.g., from a replication study or another population)! For both, can also use prediction accuracy as criterion.

Algorithms for searching large space of possible models All subsets selection involves enumerating all possible models and picking the best one. Some times this is computationally infeasible. Alternatives include! Forward selection: Start with a small model and build up! Backward selection: Start with the full model and remove terms! Forward-backward selection: After building up, try removing terms to see if fit improves! Deletion-substitution-addition: Algorithm for searching in a less linear fashion! Additional issue: Include interactions without main effects?

Variable importance The importance of each covariate towards model fit can be measured with various statistics, e.g.:! Estimated coefficient divided by its standard error (linear models - this fails in many other models)! Sign of coefficient! Fit of model with and without the variable included! Average error minus error after permuting the covariate values, divided by its standard error (in cross-validation)! Decrease in node impurity after adding variable (random forests)! For classification, assess importance for each class.

Generalized linear model (GLM) If outcome is not quantitative, the linear model framework can be extended via data transformations, called link functions.! Binary: logit (alternatives: probit, log-log)! Counts: log (also known as log-linear model)! The covariates are still a linear combination.! But the error has a different distribution.! For GLMs, the parameters are estimated by numerical methods (e.g., Newton-Raphson).

Logistic regression parameters Model the probability π of observing a taxon:! logit(π) = ß 0 + ß 1 X! Interpretation of ß 1 is the expected change in logit for a unit increase in X. What is this?! If X is binary (e.g., 0=near land vs. 1=open ocean):! odds(x=0) = exp{ß 0 }, odds(x=1) = exp{ß 0 }exp{ß 1 }! Odds increase multiplicatively by exp{ß 1 } per unit X.! Odds ratio = odds(x=1)/odds(x=0) = exp{ß 1 }