Species Distribution Modeling

Similar documents
Holdout and Cross-Validation Methods Overfitting Avoidance

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Machine Learning (CSE 446): Neural Networks

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Statistical Machine Learning from Data

Empirical Risk Minimization, Model Selection, and Model Assessment

Incorporating Boosted Regression Trees into Ecological Latent Variable Models

The exam is closed book, closed notes except your one-page cheat sheet.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Learning with multiple models. Boosting.

Statistical aspects of prediction models with high-dimensional data

Introduction to Machine Learning Midterm Exam

VBM683 Machine Learning

ECE521 week 3: 23/26 January 2017

ECE 5424: Introduction to Machine Learning

A Statistical Framework for Analysing Big Data Global Conference on Big Data for Official Statistics October, 2015 by S Tam, Chief

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Predicting MTA Bus Arrival Times in New York City

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Ensembles of Classifiers.

Hierarchical Boosting and Filter Generation

Notes on Machine Learning for and

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Introduction to Gaussian Process

Gradient Boosting (Continued)

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Neural Networks and Ensemble Methods for Classification

Clustering analysis of vegetation data

Ensemble Methods and Random Forests

Introduction to Machine Learning Midterm Exam Solutions

epochs epochs

Lecture 3: Decision Trees

Population dynamics of Apis mellifera -

Nonlinear Classification

Midterm Exam, Spring 2005

Machine Learning, Fall 2009: Midterm

Ways to make neural networks generalize better

Linear Models for Regression CS534

Application and Challenges of Artificial Intelligence in Exploration

CS534 Machine Learning - Spring Final Exam

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Reduced Error Logistic Regression

Discovery Through Situational Awareness

Regression Adjustment with Artificial Neural Networks

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

CS7267 MACHINE LEARNING

Midterm: CS 6375 Spring 2018

Forecasting Data Streams: Next Generation Flow Field Forecasting

Advanced statistical methods for data analysis Lecture 2

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Generative v. Discriminative classifiers Intuition

Summary and discussion of: Dropout Training as Adaptive Regularization

Learning from Data: Multi-layer Perceptrons

Supporting Information

Midterm: CS 6375 Spring 2015 Solutions

Accurate digital mapping of rare soils

c 4, < y 2, 1 0, otherwise,

Lecture 6. Notes on Linear Algebra. Perceptron

Introduction to Machine Learning

Decision Tree Learning

The Lady Tasting Tea. How to deal with multiple testing. Need to explore many models. More Predictive Modeling

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Statistical Data Mining and Machine Learning Hilary Term 2016

Pulses Characterization from Raw Data for CDMS

Chapter 3: Decision Tree Learning (part 2)

Machine Learning Lecture 5

Lecture 3: Decision Trees

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

In the Name of God. Lectures 15&16: Radial Basis Function Networks

Delta Boosting Machine and its application in Actuarial Modeling Simon CK Lee, Sheldon XS Lin KU Leuven, University of Toronto

Decision Trees & Random Forests

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Lecture 7: DecisionTrees

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

the tree till a class assignment is reached

FINAL: CS 6375 (Machine Learning) Fall 2014

Background of project Project objective Quartz Creek Probability of movement Accumulations OSU Streamwood Conclusion Acknowledgement

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Machine Learning 4771

Disa Yu, Vanderbilt University August 21, 2014 Eco-Informatics Summer Institute, Summer 2014 Pollinators Research Project HJ Andrews Experimental

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Final Examination CS540-2: Introduction to Artificial Intelligence

Adaptive Crowdsourcing via EM with Prior

Machine Learning Methods for Radio Host Cross-Identification with Crowdsourced Labels

CS 6375 Machine Learning

6.036 midterm review. Wednesday, March 18, 15

Machine Learning

Prediction of soil properties with NIR data and site descriptors using preprocessing and neural networks

Machine Learning, Midterm Exam

Transcription:

Species Distribution Modeling Julie Lapidus Scripps College 11 Eli Moss Brown University 11

Objective To characterize the performance of both multiple response and single response machine learning algorithms for multiple datasets

Single Response vs. Multiple Response Algorithms 1. Single-response models input: covariates output: prediction for a single species 2. Multiple-response models input: covariates output: simultaneous predictions for all species

Single vs. Multiple Response Single: Data learning only needs to predict one response at a time. Predictions informed solely by patterns in covariates Multiple: Fit of data attempts to account for all responses simultaneously. Prediction of rare moths might be helpfully influenced by patterns in others (only if they covary) 4

Single Response Elastic net logistic regression: fit the data by weighting each covariate within a linear equation. Control overfitting by penalizing weights Decision trees: make successive splits on covariates to arrive at an output Control overfitting by reducing size of tree 5

Single Response 6

Multiple Response Multivariate Decision trees: splits must predict all species, are chosen according to best overall correlation Control overfitting by reducing size of tree Single-Hidden-Layer Neural Network: Non-linear statistical data modeling tool inspired by human neural networks Control overfitting by selecting best number of training iterations 7

Moth Data Jeff Miller, collected 86-08 H.J. Andrews Experimental Forest 606 species 4 covariates 256 traps

Environmental Covariates Slope: percent grade 1-105 Aspect: degrees 2-358 Elevation: meters 1437-5007 Vegetation Type: closed forest, meadow, cut 72-77, open forest, shrub/very open forest represented as numbers when modeled

Issues with Moth Dataset Not uniformly sampled (roads) > 1/2 of the included species occurred fewer than 18 times over the course of the entire 23-year trapping period 1/6 of the species account for over half of the recorded moth occurrences Euclidian transformation of slope and aspect necessary Different moth species more prevalent each year

Australian Plants Dataset Victoria, Australia 5000 plant species Arthur Rylah Institute (Melbourne, Australia) Subset of 100 most abundant plant species 15,328 sites 81 covariates

Data Format SiteID Covariate N (e.g. temperature) Species 1 Species n A x 0 1 B y 1 1 C z 1 0

Subsetting the Data

Cohen s Kappa Statistical performance measurement used Pr(a) is relative observed agreement among raters Pr(e) is hypothetical probability oof chance agreement Value Interpretation 1 Complete agreement 0 No agreement beyond chance <0 Worse than random guessing 16

Kappa Values for All Species (sorted by decreasing abundance) Text Single Response Logistic Regression Single Response Decision Tree Multiple Response Decision Tree Neural Network All Moths Sorted by Decreasing Abundance Repeated Four Times

Kappa Values for Australian Plants Single Response Logistic Regression Single Response Decision Tree Multiple Response Decision Tree All Plants Sorted by Decreasing Abundance Repeated Three Times

Improving Predictions Difficult prediction problem few samples for many of the moths Few covariates Coarse vegetation type covariate non-continuous aspect covariate Creative use of our tools can yield better predictions... algorithms perform best on different ranges of abundance/covariate values

Kappa Values for All Species w/ highlighted best positive kappa alternatives Perizoma.curvilinea T Single Response Decision Tree Multiple Response Decision Tree Single Response Logistic Regression Neural Network All Moths Sorted by Decreasing Abundance Repeated Four Times

Kappa Values for All Species w/ highlighted worst negative kappa alternatives Single Response Logistic Regression Single Response Decision Tree Multiple Response Decision Tree Neural Network Polia.nimbosa Neoalcis.californiaria Anavitrinella.pampinaria Dysstroma.formosa Protitame.matilda Euxoa.infausta Apamea.atriclava Autographa.californica Lacanobia.radix Mesoleuca.ruficillata Orthosia.mys All Moths Sorted by Decreasing Abundance Repeated Four Times

Most Accurate Model Selected for Each Species All Moths Sorted by Decreasing Abundance Repeated Four Times all but one are 0 or above 22

Kappa Values for All Species w/ algorithm total best mean Single Response Decision Tree Multiple Response Decision Tree Single Response Logistic Regression Neural Network All Moths Sorted by Decreasing Abundance Repeated Four Times 23

Summary of Results Tested four algorithms for two datasets We have established a comparative baseline for predictive performance with the Australian plant dataset The moth dataset poses a considerable problem to modeling noisy occurrence patterns thinly sampled occurrence for 5/6 species Integration of models may improve prediction accuracy algorithm selection based on abundance, covariate proportions for each species 24

Further Research Data Preprocessing group moths by habitat preferences perform euclidean transformation on slope/aspect continuous vegetation type Prediction of moth species occurrence across HJA with environment grid More advanced algorithms our basic methods will serve as a baseline for comparison 25

Acknowledgements We would like to thank: Dr. Dietterich, CS Professor Dr. Wong, CS Professor Steven Highland, Geosciences PhD candidate Julia Jones, Geosciences Professor Arwen Lettkeman, CS PhD candidate Paul Wilkins, CS PhD candidate Rebecca Hutchinson, CS Post-doc