Maximum Entropy and Species Distribution Modeling

Similar documents
A Maximum Entropy Approach to Species Distribution Modeling. Rob Schapire Steven Phillips Miro Dudík Rob Anderson.

Maximum entropy modeling of species geographic distributions. Steven Phillips. with Miro Dudik & Rob Schapire

Correcting sample selection bias in maximum entropy density estimation

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1

Performance Guarantees for Regularized Maximum Entropy Density Estimation

Lecture 8. Instructor: Haipeng Luo

CS229 Supplemental Lecture notes

Machine Learning Lecture 5

Machine Learning Lecture 7

Incorporating Boosted Regression Trees into Ecological Latent Variable Models

CS7267 MACHINE LEARNING

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Machine Learning for NLP

Generalization, Overfitting, and Model Selection

Support vector machines Lecture 4

Machine Learning Linear Classification. Prof. Matteo Matteucci

Announcements Kevin Jamieson

Logistic Regression. Machine Learning Fall 2018

Voting (Ensemble Methods)

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Stochastic Gradient Descent

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

Learning Energy-Based Models of High-Dimensional Data

Self Supervised Boosting

2 Upper-bound of Generalization Error of AdaBoost

CS 6375: Machine Learning Computational Learning Theory

COMS 4771 Introduction to Machine Learning. Nakul Verma

VBM683 Machine Learning

Structural Maxent Models

Conquering the Cold: Climate suitability predictions for the Asian clam in cold temperate North America

ECE 5424: Introduction to Machine Learning

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

PAC-learning, VC Dimension and Margin-based Bounds

Chapter 14 Combining Models

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Performance Evaluation

1 Review of Winnow Algorithm

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

Classification. Chapter Introduction. 6.2 The Bayes classifier

ECE 5984: Introduction to Machine Learning

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Reducing contextual bandits to supervised learning

Machine Learning and Data Mining. Linear classification. Kalev Kask

TDT4173 Machine Learning

FINAL: CS 6375 (Machine Learning) Fall 2014

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Kernel Methods and Support Vector Machines

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Generalization, Overfitting, and Model Selection

CSCI-567: Machine Learning (Spring 2019)

A GIS Approach to Modeling Native American Influence on Camas Distribution:

Qualifying Exam in Machine Learning

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Bayesian Learning (II)

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Linear Models for Regression CS534

From perceptrons to word embeddings. Simon Šuster University of Groningen

Lecture 3: Introduction to Complexity Regularization

Feature Engineering, Model Evaluations

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Ecological Modelling

Ch 4. Linear Models for Classification

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Generalization and Overfitting

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

L11: Pattern recognition principles

Machine Learning, Midterm Exam

Machine Learning, Fall 2009: Midterm

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Linear & nonlinear classifiers

Using a Hopfield Network: A Nuts and Bolts Approach

Machine Learning. Ensemble Methods. Manfred Huber

Boosting: Foundations and Algorithms. Rob Schapire

Lecture 9: Classification, LDA

Linear discriminant functions

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Classification Based on Probability

Understanding Generalization Error: Bounds and Decompositions

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

Probabilistic Graphical Models

Hybrid Machine Learning Algorithms

Computational and Statistical Learning Theory

Foundations of Machine Learning

Linear Classifiers: Expressiveness

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Learning with multiple models. Boosting.

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Learning Theory Continued

Statistical NLP Spring A Discriminative Approach

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Transcription:

Maximum Entropy and Species Distribution Modeling Rob Schapire Steven Phillips Miro Dudík Also including work by or with: Rob Anderson, Jane Elith, Catherine Graham, Chris Raxworthy, NCEAS Working Group,...

The Problem: Species Habitat Modeling goal: model distribution of plant or animal species

The Problem: Species Habitat Modeling goal: model distribution of plant or animal species given: presence records

The Problem: Species Habitat Modeling goal: model distribution of plant or animal species given: presence records given: environmental variables precipitation wet days avg. temp.

The Problem: Species Habitat Modeling goal: model distribution of plant or animal species given: presence records given: environmental variables precipitation wet days avg. temp. desired output: map of range

Biological Importance fundamental question: what are survival requirements (niche) of given species? core problem for conservation of species first step for many applications: reserve design impact of climate change discovery of new species clarification of taxonomic boundaries

A Challenge for Machine Learning no negative examples very limited data often, only 20-100 presence records usually, not systematically collected may be museum records years or decades old sample bias (toward most accessible locations)

Our Approach assume presence records come from probability distribution π try to estimate π apply maximum entropy approach

Maxent Approach presence records for species X: altitude July temp. record #1 1410m 16 C record #2 1217m 22 C record #3 1351m 17 C...

Maxent Approach presence records for species X: altitude July temp. record #1 1410m 16 C record #2 1217m 22 C record #3 1351m 17 C. Average 1327m 17.2 C Stand. Dev. 117m 3.2 C..

Maxent Approach (cont.) data allow us to infer many facts, e.g.: average altitude of species X s habitat 1327m average July temperature of species X s habitat 17.2 C. stand. dev. of altitude of species X s habitat 117m.

Maxent Approach (cont.) data allow us to infer many facts, e.g.: average altitude of species X s habitat 1327m average July temperature of species X s habitat 17.2 C. stand. dev. of altitude of species X s habitat 117m. probability species X lives above 1100m 0.78 probability species X lives above 1200m 0.62.

Maxent Approach (cont.) data allow us to infer many facts, e.g.: average altitude of species X s habitat 1327m average July temperature of species X s habitat 17.2 C. stand. dev. of altitude of species X s habitat 117m. probability species X lives above 1100m 0.78 probability species X lives above 1200m 0.62. each tells us something about true distribution idea: find distribution satisfying all constraints among these, choose distribution closest to uniform (i.e., of highest entropy)

This Talk theory maxent with relaxed constraints new performance guarantees for maxent useful even with very large number of features (or constraints) algorithm and convergence experiments and applications

The Abstract Framework π = (unknown) true distribution [Della Pietra, Della Pietra & Lafferty]

The Abstract Framework π = (unknown) true distribution given: samples x 1,..., x m X x i π features f 1,..., f n f j : X [0, 1] [Della Pietra, Della Pietra & Lafferty]

The Abstract Framework π = (unknown) true distribution given: samples x 1,..., x m X x i π features f 1,..., f n f j : X [0, 1] goal: find ˆπ = estimate of π [Della Pietra, Della Pietra & Lafferty]

Maxent and Habitat Modeling x i = presence record X = all localities (discretized) π = distribution of localities inhabited by species ignores sample bias and dependence between samples

Maxent and Habitat Modeling x i = presence record X = all localities (discretized) π = distribution of localities inhabited by species ignores sample bias and dependence between samples features: use raw environmental variables v k linear: f j = v k

Maxent and Habitat Modeling x i = presence record X = all localities (discretized) π = distribution of localities inhabited by species ignores sample bias and dependence between samples features: use raw environmental variables v k functions linear: f j = v k quadratic: f j = v 2 k or derived

Maxent and Habitat Modeling x i = presence record X = all localities (discretized) π = distribution of localities inhabited by species ignores sample bias and dependence between samples features: use raw environmental variables v k functions linear: f j = v k quadratic: f j = v 2 k product: f j = v k v l or derived

Maxent and Habitat Modeling x i = presence record X = all localities (discretized) π = distribution of localities inhabited by species ignores sample bias and dependence between samples features: use raw environmental variables v k functions linear: f j = v k quadratic: f j = vk 2 product: f j = v k { v l 1 if vk a threshold: f j = 0 else or derived

Maxent and Habitat Modeling x i = presence record X = all localities (discretized) π = distribution of localities inhabited by species ignores sample bias and dependence between samples features: use raw environmental variables v k functions linear: f j = v k quadratic: f j = vk 2 product: f j = v k { v l 1 if vk a threshold: f j = 0 else # features can become very large (even infinite) or derived

A Bit More Notation π = empirical distribution i.e., π(x) = #{i : x i = x}/m π[f ] = expectation of f with respect to π (so π[f ] = empirical average of f )

Maxent π typically very poor estimate of π

Maxent π typically very poor estimate of π but: π[f j ] likely to be reasonable estimate of π[f j ]

Maxent π typically very poor estimate of π but: π[f j ] likely to be reasonable estimate of π[f j ] so: choose distribution ˆπ such that ˆπ[f j ] = π[f j ] for all features f j

Maxent π typically very poor estimate of π but: π[f j ] likely to be reasonable estimate of π[f j ] so: choose distribution ˆπ such that ˆπ[f j ] = π[f j ] for all features f j among these, choose one closest to uniform, i.e., of maximum entropy [Jaynes]

Maxent π typically very poor estimate of π but: π[f j ] likely to be reasonable estimate of π[f j ] so: choose distribution ˆπ such that ˆπ[f j ] = π[f j ] for all features f j among these, choose one closest to uniform, i.e., of maximum entropy [Jaynes] problem: can badly overfit, especially with a large number of features

A More Relaxed Version generally, only expect π[f j ] π[f j ]

A More Relaxed Version generally, only expect π[f j ] π[f j ] usually, can estimate upper bound on π[f j ] π[f j ]

A More Relaxed Version generally, only expect π[f j ] π[f j ] usually, can estimate upper bound on π[f j ] π[f j ] so: compute ˆπ to maximize H(ˆπ) (= entropy) subject to j : π[f j ] ˆπ[f j ] β j where β j = known upper bound [Kazama & Tsujii]

Duality can show solution must be Gibbs distribution: ˆπ(x) = q λ (x) exp λ j f j (x) j

Duality can show solution must be Gibbs distribution: ˆπ(x) = q λ (x) exp λ j f j (x) j in unrelaxed case, solution is Gibbs distribution that maximizes likelihood, i.e., minimizes: 1 ln q λ (x i ) m i }{{} negative log likelihood

Duality can show solution must be Gibbs distribution: ˆπ(x) = q λ (x) exp λ j f j (x) j in unrelaxed case, solution is Gibbs distribution that maximizes likelihood in relaxed case, solution is Gibbs distribution that minimizes: 1 ln q λ (x i ) + β j λ j m i j }{{}}{{} negative log likelihood regularization

Equivalent Motivations maxent with relaxed constraints log loss with regularization MAP estimate with Laplace prior on weights λ

How Good Is Maxent Estimate? want to bound distance between ˆπ and π (measure with relative entropy) RE(π ˆπ)

How Good Is Maxent Estimate? want to bound distance between ˆπ and π (measure with relative entropy) can never beat best Gibbs distribution π RE(π ˆπ) RE(π π ) +

How Good Is Maxent Estimate? want to bound distance between ˆπ and π (measure with relative entropy) can never beat best Gibbs distribution π additional term 0 as m depend on number or complexity of features smoothness of π RE(π ˆπ) RE(π π ) + additional term

Bounds for Finite Feature Classes with high probability, for all λ RE(π ˆπ) RE(π π ) + O (for choice of β j based only on n and m) π = q λ = best Gibbs distribution λ 1 measures smoothness of π very moderate in number of features ( λ 1 ln n m )

Bounds for Infinite Binary Feature Classes assume binary features with VC-dimension d then with high probability, for all λ : ( ) d RE(π ˆπ) RE(π π ) + Õ λ 1 m (for choice of β j based only on d and m) e.g., infinitely many threshold features, but very low VC-dimension

Main Theorem both bounds follow from main theorem: assume j : π[f j ] π[f j ] β j then RE(π ˆπ) RE(π π ) + 2 j β j λ j preceding results are simple corollaries using standard uniform convergence results in practice, theorem tells us how to set β j parameters: use tightest bound available on π[f j ] π[f j ]

Finding an Algorithm want to minimize L(λ) = 1 ln q λ (x i ) + m i j β j λ j no analytical solution instead, iteratively compute λ 1, λ 2,... so that L(λ t ) converges to minimum most algorithms for maxent update all weights λ j simultaneously less practical when very large number of features

Sequential-update Algorithm instead update just one weight at a time leads to sparser solution sometimes can search for best weight to update very efficiently analogous to boosting weak learner acts as oracle for choosing function (weak classifier) from large space can prove convergence to minimum of L

Experiments and Applications broad comparison of algorithms improvements by handling sample bias case study discovering new species clarification of taxonomic boundaries

NCEAS Experiments [Elith, Graham, et al.] species distribution modeling bake-off comparing 16 methods 226 plant and animal species from 6 world regions mostly 10 s to 100 s of presence records per species min = 2, max = 5822, average = 241.1, median = 58.5 design: training data: incidental, non-systematic, presence-only mainly from museums, herbaria, etc. test data: presence and absence data collected in systematic surveys

Results mean AUC (all species) boosted regression trees MAXENT gen'd dissimilarity models 0.722 0.716 0.725 gen'd additive models garp 0.699 0.699 bioclim 0.656 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76

Results mean AUC (all species) boosted regression trees MAXENT gen'd dissimilarity models 0.722 0.716 0.725 gen'd additive models garp 0.699 0.699 bioclim 0.656 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 newer statistical/machine learning methods (including maxent) performed better than more established methods reasonable to use presence-only incidental data

Maxent versus Boosted Regression Trees very similar, both mathematically and algorithmically, as methods for combining simpler features differences: maxent is generative; boosting is discriminative as implemented, boosting uses complex features; maxent uses simple features open: which is more important?

The Problem with Canada results for Canada are by far the weakest: mean AUC (all regions vs. Canada) boosted regression trees 0.601 0.725 MAXENT 0.582 0.722 gen'd dissimilarity models 0.560 0.716 gen'd additive models 0.549 0.699 garp 0.551 0.699 all Canada bioclim 0.632 0.656 0.54 0.58 0.62 0.66 0.70 0.74 apparent problem: very bad sample bias sampling much heavier in (warm) south than (cold) north

Sample Bias can modify maxent to handle sample bias use sampling distribution (assume known) as default distribution (instead of uniform) then factor bias out from final model problem: where to get sampling distribution

Sample Bias can modify maxent to handle sample bias use sampling distribution (assume known) as default distribution (instead of uniform) then factor bias out from final model problem: where to get sampling distribution typically, modeling many species at once so, for sampling distribution, use all locations where any species observed

Results of Sample Debiasing mean AUC (Canada only) MAXENT (w/ debiasing) 0.723 boosted regression trees MAXENT gen'd dissimilarity models gen'd additive models garp 0.601 0.582 0.560 0.549 0.551 bioclim 0.632 0.50 0.55 0.60 0.65 0.70 0.75 huge improvements possible using debiasing

Results of Sample Debiasing mean AUC (all species) MAXENT (w/ debiasing) 0.752 boosted regression trees MAXENT gen'd dissimilarity models gen'd additive models garp 0.725 0.722 0.716 0.699 0.699 bioclim 0.656 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 huge improvements possible using debiasing

Interpreting Maxent Models recall q λ (x) exp( j λ j f j (x)) for each environmental variable, can plot total contribution to exponent aspect elevation slope additive contribution to exponent precipitation avg. temperature no. wet days value of environmental variable threshold, β=1.0 threshold, β=0.01 linear+quadratic, β=0.1

Case Study: Microryzomys minutus [with Phillips & Anderson]

Case Study: Microryzomys minutus [with Phillips & Anderson]

Case Study: Microryzomys minutus [with Phillips & Anderson]

Case Study (cont.) accurately captures realized range did not predict other wet montane forest areas where could live, but doesn t examined predictions of maxent on six of these (chosen by biologist) found all had characteristics well outside typical range for actual presence records e.g., four sites had July precipitation 5 standard deviations above mean

Finding New Species [Raxworthy et al.] build models of several gekkos and chameleons of Madagascar ( 10-20 presence records each) identify isolated regions where predicted but not found

Finding New Species (cont.) combine all identified regions into single map

Finding New Species (cont.) combine all identified regions into single map many regions already well known areas of local endemism

Finding New Species (cont.) combine all identified regions into single map many regions already well known areas of local endemism survey regions not previously studied?????????

New Species result: discovery of many new species (possibly 15-30) Thanks to Chris Raxworthy for all maps and photos!

Clarifying Taxonomic Boundaries [Raxworthy et al.] cryptic species : classified as single species, but suspected mixture of 2 species sometimes, maxent model is wildly wrong if trained on all records but model is much more reasonable if trained on each sub-population separately gives strong evidence actually dealing with multiple species can then follow up with morphological or genetic study

This slide courtesy of Chris Raxworthy. Phelsuma madagascariensis subspecies One species Three species Phelsuma mad. grandis? Phelsuma mad. kochi? Phelsuma mad. madagascariensi

Summary maxent provides clean and effective fit to habitat modeling problem works with positive-only examples seems to perform well with limited number of examples theoretical guarantees indicate can be used even with a very large number of features other nice properties: easy to interpret by human expert can be extended to handle sample bias many biological applications