Integrating ARCGIS with Datamining Software to Predict Habitat for Red Sea Urchins on the Coast of British Columbia. Wayne Hajas Pacific Biological Station Nanaimo, BC 1
Allison Smeaton GIS-student intern Dan Leus Field Biologist Acknowledgements Pacific Underwater Harvesters Association Industry association for the red-urchin harvesters 2
Not a GIS person Myself I may let some jargon slip out I might need more explanation on GIS matters Might have different expectations about how things should work 3
Statement of Problem Which parts of the British Columbia coast are habitat for red sea urchins? 4
Problem in General Terms Can Geographic Datasets be used as predictive tools? On an opportunistic basis? Can ARCGIS be extended through the arcpy/python interface? 5
Outline Describe the problem Describe the data Mathematical method (very brief) Software GIS Operations Results and why I think they are valid Conclusions Questions 6
The Problem (Red Sea Urchins) 7
The Problem (Red Sea Urchins) 8
Data 9
Training Data Prediction Required Data Structure for Datamining Test Data (Predictive Variables) Objective Variable 8 FALSE blue habitat 10 FALSE red habitat 11 FALSE blue nonhabitat 6 FALSE red habitat 14 FALSE red? 14 FALSE blue? 8 TRUE red? 6 FALSE blue? 14 FALSE blue? 12 FALSE blue? 7 FALSE blue? 11 TRUE blue? 6 FALSE red 10?
Test (Predictive) Data Shorezone Its objective is to produce an integrated, searchable inventory of geomorphic and biological features which can be used as a tool for science, education, management, and environmental hazard mitigation. (Coastal and Ocean Resources Inc.) Winter 2010/2011 of ArcNews 11
GIS database Test (Predictive) Data Shorezone BC Coast is divided into ~90,000 polylines Vegetation, geology, wave exposure, ~50 fields Data collected from aerial surveys in the 1980 s. Probably marine charts also Access controlled by BC government. Parallel projects in Alaska and Washington State. 12
Known Habitat Harvest Events Scientific Surveys Expert Opinion Known Nonhabitat Expert Opinion Training Data Points, lines and polygons Collected and managed independently of test data (Shorezone) 13
Mathematical Methods 14
Datamining ( machine learning, artificial intelligence ) The actual data mining task is the automatic or semi-automatic analysis of large quantities of data to extract previously unknown interesting patterns such as groups of data records, unusual records and dependencies. (Wikipedia) 15
Datamining Used in marketing, text processing, scientific,. Many methods, many implementations Blackbox Large datasets, computationally expensive Assumption that training data represents test data 16
How Datamining is Applied Training Data Datamining Software Prediction Model Test Data
Software 18
Goals Scripts and not GUI Repeatability Record of what was done Creates need to integrate disparate software Work from managed datasets Don t duplicate effort Minimal number of intermediate datasets Less to manage 19
Software - Components Python 2.6 Custom Interface arcpy/ ARCGIS 10.0 Datamining software 20
arcpy ARCGIS becomes a callable software library Alternative to point-and-click Can also be used for automation arcpy.union_analysis(["well_buff50","stream_buff200"],"water_buffers") 21
Custom Interface to arcpy Found myself writing my own interface around arcpy(in python) Examples Garbage Collection Extracting data Creating and populating new fields Just to make things more pythonic. 22
Python Computer language like C, ruby, java or BASIC Many applications outside of GIS Rapid and Structured Development Open source Two roles: 1. Controller (which databases to use, etc) 2. Integration of arcpy and datamining-software 23
Datamining Software Some (most?) is open source For integration: compatibility with Python 2.6 on Windows Installation can be an issue. 24
Datamining Software Scripting integrated-software approach: Scikit-learn Easy-to-use and general-purpose machine learning in Python Point and click Weka Collection of machine learning algorithms for solving data mining problems 25
GIS Operations 26
Defining Training Data Which shorezone segments known to be habitat or nonhabitat? Known habitat/nonhabitat not expressed as shorezone segment. Need a set of rules work most of the time Can be automated 27
28
Defining Training Data (the rules) Must be within 150m of shore Closest shorezone segment. Ties broken by random selection Each known habitat/nonhabitat instance contributes at most one record to training data (Spatial join with closest option) 29
GIS Operations (defining training data) ~8 hours to assemble training data Did not delete the final set 30
GIS Operations (train and apply the model) SearchCursor and UpdateCursor to retrieve data and record results Predictions put into GIS database 31
Results 32
Presenting the Results 33
Checking the Results Training data is a useful benchmark Will model work beyond the training data? Overparameterizaton? Want to impose an independence between training data and validation process. 34
Checking the Results Cross validations Use 9/10 of training data at a time. Make predictions for other 1/10 Compare prediction to actual value Repeat ten times 35
Checking the Results (success) 36
Conclusions ARCGIS is extensible through the arcpy/python interface What else would be useful? Large amounts of GIS data can be used as predictive tools. Can be opportunistic! Other applications? 37
Conclusions (continued) The ARCGIS-python interface could be further developed. Might be some common need. Open source? 38
Questions and Comments 39
Checking the Results (failure) 40