A Maximum Entropy Approach to Species Distribution Modeling. Rob Schapire Steven Phillips Miro Dudík Rob Anderson.

Size: px

Start display at page:

Download "A Maximum Entropy Approach to Species Distribution Modeling. Rob Schapire Steven Phillips Miro Dudík Rob Anderson."

Raymond Wood
5 years ago
Views:

1 A Maximum Entropy Approach to Species Distribution Modeling Rob Schapire Steven Phillips Miro Dudík Rob Anderson schapire

2 The Problem: Species Habitat Modeling goal: model distribution of plant or animal species given: occurrence records given: environmental variables precipitation wet days average temperature desired output: map of range

3 Issues no negative examples very limited data sample bias data often collected over a range of years or decades geographic uncertainty realized versus potential range hard versus soft predictions interpretability of constructed model

4 Our Approach assume recorded localities come from probability distribution π try to estimate π apply maximum entropy approach

5 The Maxent Set-up given: samples x i X, x i π features (base functions) f j : X R output: ˆπ = estimate of π, in terms of features

6 Maxent and Habitat Modeling x i = recorded locality X = all localities (discretized) π = distribution of localities inhabited by species assumes locality chosen at random from entire range ignores sample bias and dependence between samples features: use raw environmental variables v k or derived functions linear: f j = v k quadratic: f j = vk 2 product: f j = v k v l 1 if v threshold: f j = k a 0 else # features can become very large (even infinite)

7 This Talk maxent with relaxed constraints new performance guarantees for maxent useful even with very large number of features algorithm and convergence lots of experiments with habitat modeling data

8 The Abstract Framework given: x 1,...,x m X x i π features f 1,...,f n f j : X [0, 1] goal: find ˆπ = estimate of π let π = empirical distribution i.e., π(x) =#{i : x i = x}/m notation: π[f] =expectation of f with respect to π (so π[f] =empirical average of f)

9 Maxent π typically very poor estimate of π but: π[f j ] likely to be reasonable estimate of π[f j ] so: choose distribution ˆπ such that ˆπ[f j ]= π[f j ] for all features f j among these, choose one closest to uniform, i.e., of maximum entropy [Jaynes]

10 Maxent Can Badly Overfit say X = {1,...,n}, n large 1 if x j features: f j (x) = 0 else can show only distribution satisfying constraints is π clear overfitting

11 A More Relaxed Version generally, only expect π[f j ] π[f j ] usually, can estimate upper bound on π[f j ] π[f j ] so: compute ˆπ to maximize H(ˆπ) (= entropy) subject to j : π[f j ] ˆπ[f j ] β j where β j = known upper bound [Kazama & Tsujii]

12 Duality in unrelaxed case, solution is maximum likelihood Gibbs distribution Gibbs distribution: q λ (x) = exp j λ jf j (x) can show solution in relaxed case is Gibbs distribution q λ that minimizes: 1 m ln q λ(x i ) + β j λ j i j }{{} negative log likelihood Z λ }{{} regularization

13 Equivalent Motivations maxent with relaxed constraints log loss with regularization MAP estimate with Laplace prior on weights λ

14 How Good Is Maxent Estimate? would like to bound distance between ˆπ and π bound should be in terms of: # of examples m distance between π and best Gibbs distribution π = q λ complexity of features smoothness of π measure distance using relative entropy RE(p q) =p[ln(p/q)]

15 Bounds for Finite Feature Classes with high probability, for all λ RE(π ˆπ) RE(π π )+O (for choice of β j based only on n and m) λ 1 ln n m e.g., if π best Gibbs distribution with λ 1 K then RE(π ˆπ) RE(π π )+O very moderate in number of features K ln n m

16 Bounds for Infinite Binary Feature Classes assume binary features with VC-dimension d then with high probability, for all λ : RE(π ˆπ) RE(π π )+Õ (for choice of β j based only on d and m) λ 1 d m

17 Main Theorem assume j : π[f j ] π[f j ] β j then RE(π ˆπ) RE(π π )+2 j β j λ j previous results are simple corollaries using standard uniform convergence results

18 Proof of Main Theorem let LL(p q) =p[ ln q] =RE(p q)+h(p) want: LL(π ˆπ) LL(π π )+2 β j λ j j claim: for all λ LL(π q λ ) LL( π q λ ) β j λ j j proof: LL(π q λ ) LL( π q λ ) = π [ λ j f j +lnz λ ] π [ λ j f j +lnz λ ] = λ j ( π[f j ] π[f j ]) β j λ j

19 Proof (cont.) let ˆπ = qˆλ then LL(π ˆπ) = LL(π qˆλ ) LL( π qˆλ )+ β j ˆλ j LL( π q λ )+ β j λ j LL(π q λ )+2 β j λ j = LL(π π )+2 β j λ j

20 Finding an Algorithm want to minimize L(λ) = 1 m i ln q λ(x i )+ j β j λ j no analytical solution instead, iteratively compute λ 1, λ 2,...so that L(λ t ) converges to minimum most algorithms for maxent update all weights λ j simultaneously less practical when very large number of features

21 Sequential-update Algorithm instead update just one weight at a time leads to sparser solution sometimes can search for best weight to update very efficiently analogous to boosting weak learner acts as oracle for choosing function (weak classifier) from large space

22 Idea for Algorithm say add δ to λ j new weight vector λ derive bound F such that L(λ ) L(λ) F (λ,j,δ) choose j, δ to minimize F repeat in our case, can choose F = δ π[f j ]+ln(1+(e δ 1)q λ [f j ]) + β j ( λ j + δ λ j ) possible values for minimizing δ can be found analytically in practice, combine with approximate line search to speed convergence

23 Convergence have that L(λ t+1 ) L(λ t ) min j,δ F (λ t,j,δ)=g(λ t ) enough to show: G(λ) 0 G continuous G(λ) =0 λ satisfies KKT conditions for optimization problem

24 Goals of Experiments does maxent work for this problem? what features work best? how many examples are enough? how to choose β j s? how interpretable are results? what is a biologist s qualitative assessment of results?

25 Breeding Bird Experiments used data from North American Breeding Bird Survey huge dataset with many species we chose 12 ranged from 78 to 2,773 occurence records used seven publicly available environmental variables basic approach: train with half sample points (randomly selected) test on rest repeat 10 times

26 Using Maxent features: linear, quadratic, product, threshold used β j = β σ[f j] m

27 GARP [Stockwell & Noble; Stockwell & Peters] compared to Genetic Algorithm for Ruleset Prediction (GARP) widely used by biologists just about only method that uses positive-only data outputs binary predictions commonly combined with ensemble-like technique ( best subsets ) integer-valued predictions in {0, 1,...,10}

28 How to Compare? maxent outputs probabilities treat each algorithm as providing ranking of localities from most to least habitable compare using ROC curve and area under curve (AUC) AUC = probability randomly chosen positive and negative examples are ranked correctly

29 ROC Curves true positive rate true positive rate Yellow throated Vireo Loggerhead Shrike false positive rate LQP T GARP L LQ false positive rate LQP T GARP L LQ

30 AUC Comparisons maxent LQP maxent T GARP GARP

31 Classification Accuracy maxent threshold chosen to predict same size habitat area as GARP then compute omission rate (fraction of test localities omitted) maxent LQP maxent T GARP GARP

32 How Many Examples Are Enough? (AUC) area under ROC curve (AUC) Gray Vireo Hutton s Vireo Plumbeous V. number of training examples (m) Blue h. Vireo Yellow th. V. Loggerh. Shrike LQP, β=0.1 T, β=1.0 LQ, β=0.1 L, β=0.1 GARP, all available training data

33 How Many Examples Are Enough? (log loss) log loss on test examples Gray Vireo Hutton s Vireo Plumbeous V. number of training examples (m) Blue h. Vireo Yellow th. V. Loggerh. Shrike LQP, β=0.1 LQ, β=0.1 T, β=1.0 L, β=0.1

34 Sensitivity to Regularization Parameter (AUC) Threshold features area under ROC curve (AUC) Hutton s Vireo.75 Blue h. Vireo.75 Yellow th. V Linear, quadratic and product features Hutton s Vireo Blue h. Vireo Yellow th. V. regularization value (β) m=10 m=32 m=100 m=316

35 Sensitivity to Regularization Parameter (log loss) log loss on test examples Threshold features Hutton s Vireo Blue h. Vireo Yellow th. V Linear, quadratic and product features regularization value (β) Hutton s Vireo Blue h. Vireo Yellow th. V m=10 m=32 m=100 m=316

36 Interpreting Maxent Models recall q λ (x) exp( j λ jf j (x)) for each environmental variable, can plot total contribution to exponent aspect elevation slope additive contribution to exponent precipitation avg. temperature no. wet days value of environmental variable threshold, β=1.0 linear+quadratic, β=0.1 threshold, β=0.01

37 Case Study: Microryzomys minutus

38 Case Study (cont.) accurately captures realized range did not predict other wet montane forest areas where could live, but doesn t examined predictions of maxent on six of these (chosen by biologist) found all had characteristics well outside typical range for actual occurence localities e.g., four sites had July precipitation 5 standard deviations above mean

39 Summary maxent provides clean and effective fit to habitat modeling problem works with positive-only examples seems to perform well with limited number of examples provides smooth gradation from most to least habitable theoretical guarantees indicate can be used even with a very large number of features easy to interpret by human expert

40 Some of the Challenges sample bias geographic uncertainty historical data collected with a changing landscape how to check accuracy of model how to produce hard (presence or absence) predictions convergence rates for iterative algorithms analogous performance bounds for conditional maxent

Maximum Entropy and Species Distribution Modeling

Maximum Entropy and Species Distribution Modeling Rob Schapire Steven Phillips Miro Dudík Also including work by or with: Rob Anderson, Jane Elith, Catherine Graham, Chris Raxworthy, NCEAS Working Group,...