MACHINE LEARNING FOR GEOLOGICAL MAPPING: ALGORITHMS AND APPLICATIONS MATTHEW J. CRACKNELL BSc (Hons) ARC Centre of Excellence in Ore Deposits (CODES) School of Physical Sciences (Earth Sciences) Submitted in fulfilment of the requirements for the degree of Doctor of Philosophy University of Tasmania May, 2014
i Did you ever fly a kite in bed? Did you ever walk with ten cats on your head? Did you ever milk this kind of cow? Well, we can do it. We know how. If you never did you should. These things are fun and fun is good. Dr. Seuss
ii
iii DECLARATION OF ORIGINALITY This thesis contains no material which has been accepted for a degree or diploma by the University or any other institution, except by way of background information and duly acknowledged in the thesis, and to the best of my knowledge and belief no material previously published or written by another person except where due acknowledgement is made in the text of the thesis, nor does the thesis contain any material that infringes copyright. AUTHORITY OF ACCESS This non-published content of the thesis (see below) may be made available for loan and limited copying and communication in accordance with the Copyright Act 1968. STATEMENT REGARDING PUBLISHED WORK CONTAINED IN THESIS Chapter 4 of this thesis is published under a Creative Commons Attribution (CC BY) licence. You are free to copy, communicate and adapt the work, so long as you attribute the authors. To view a copy of this licence, visit http://creativecommons.org/licenses/. The publishers of the papers comprising Chapters 5 to 6 hold the copyright for that content, and access to the material should be sought from the respective journals. Matthew J. Cracknell May 2014
iv Machine learning for geological mapping
v STATEMENT OF CO-AUTHORSHIP The following people and institutions contributed to the publication of work undertaken as part of this thesis: Matthew James Cracknell, ARC Centre of Excellence in Ore Deposits (CODES), School of Earth Sciences, University of Tasmania = Candidate Anya Marie Reading, ARC Centre of Excellence in Ore Deposits (CODES), School of Earth Sciences, University of Tasmania = Author 1 Andrew William McNeill, Mineral Resources Tasmania, Department of Infrastructure Energy & Resources (DIER) = Author 2 Author details and their roles: Paper 1, Geological mapping using remote sensing data: A comparison of five machine learning algorithms, their response to variations in the spatial distribution of training data and the use of explicit spatial information : Located in Chapter 4 Candidate was the primary author and with Author 1 contributing to its development, refinement and presentation.
vi Machine learning for geological mapping Paper 2, The upside of uncertainty: Identification of lithology contact zones from airborne geophysics and satellite data using Random Forests and Support Vector Machines : Located in Chapter 5 Candidate was the primary author and with Author 1 contributing to development, refinement and presentation. Paper 3, Mapping geology and volcanic-hosted massive sulfide alteration in the Hellyer Mt Charter region, Tasmania, using Random Forests and Self-Organising Maps : Located in Chapter 6 Candidate was the primary author and with Author 1 contributing to its refinement and presentation and Author 2 contributing to its formalisation and development. We the undersigned agree with the above stated proportion of work undertaken for each of the above published (or submitted) peer-reviewed manuscripts contributing to this thesis: Signed: Anya M. Reading Supervisor School Of Earth Sciences University of Tasmania Jocelyn McPhie Head of School School Of Earth Sciences University of Tasmania Date:
vii ABSTRACT Machine learning algorithms are designed to identify efficiently and to predict accurately patterns within multivariate data. They provide analysts computational tools to aid predictive modelling and the interpretation of interactions between data and the phenomena under investigation. The analysis of large volumes of disparate multivariate geospatial data using machine learning algorithms therefore offers great promise to industry and research in the geosciences. Geoscience data are frequently characterised by a restriction in the number and distribution of direct observations, irreducible noise in these data and a high degree of intraclass variability and interclass similarity. The choice of machine learning algorithm, or algorithms and the details of how algorithms are applied must therefore be appropriate to the context of geoscience data. With this knowledge, I aim to employ machine learning as a means of understanding the spatial distribution of complex geological phenomena. I conduct a rigorous and comprehensive comparison of machine learning algorithms, representing the five general machine learning strategies, for supervised lithology classification applications. I also develop and test a novel method for obtaining robust estimates of the uncertainty associated with machine learning algorithm categorical predictions. The insights gained from these experiments leads to the further development and comparison of new methods for the incorporation of spatial-contextual information into machine learning supervised classifiers. In using machine learning algorithms for geoscience applications, I have developed bestpractice methodologies that address the challenges facing geoscientists for geospatial supervised classification. Guidelines are established that detail the preparation and integration of disparate spatial data, the optimisation of trained classifiers for a given application and the robust statistical and spatial evaluation of outputs. I demonstrate, through a case study in a region that is prospective for economic mineralisation, the combination of supervised and unsupervised machine learning algorithms for the critical appraisal of pre-existing geological maps and formulation of meaningful interpretations of geological phenomena.
viii Machine learning for geological mapping The experiments conducted as part of my research confirm the efficacy of machine learning algorithms to generate accurate geological maps representing a variety of terranes. I identify and explore key aspects of the spatial and statistical distributions of geoscience data that affect machine learning algorithm performance. My research clearly identifies Random Forests as a good first-choice algorithm for the prediction of classes representing lithologies using commonly available multivariate geological and geophysical data. Furthermore, Random Forests prediction uncertainty is shown to be closely related to ambiguous and/or erroneous classifications and, thus provides a practical means of indicating variable levels of confidence. Spatial-contextual information is best incorporated into machine learning supervised classifiers via the pre-processing of input variables and/or the post-regularisation of classifications. My findings indicate that a trade-off between optimal predictive models and interpretable explanatory models exists, whereby, intuitively interpretable models are not necessarily the most accurate. The practical application of machine learning algorithms requires the implementation of three key stages: (1) data pre-processing; (2) algorithm training; and (3) prediction evaluation. This methodology provides the foundation for generating accurate and geologically meaningful predictions with minimal user intervention and assists in the formulation of robust interpretations of complex geological phenomena. For example, classifications obtained by Random Forests are useful for critically appraising interpreted geological maps. Clusters produced by Self-Organising Maps indicate the presence of discrete, spatially contiguous and geologically significant sub-classes within individual lithological units, which represent regions of contrasting primary composition and alteration styles. My results may be widely applied to a broad range of practical geoscience challenges such as ore deposit targeting, geo-hazard risk assessment, engineering and construction projects, hydrological and environmental modelling and ecological studies. The applications of machine learning algorithms detailed in this thesis align well with state-of-the-art Big Data online infrastructure and virtual laboratories currently emerging in Australia.
ix CONTENTS DECLARATION OF ORIGINALITY... III AUTHORITY OF ACCESS... III STATEMENT REGARDING PUBLISHED WORK CONTAINED IN THESIS... III STATEMENT OF CO-AUTHORSHIP...V ABSTRACT...VII CONTENTS...IX LIST OF TABLES... XV LIST OF FIGURES... XVII LIST OF ABBREVIATIONS...XXI ACKNOWLEDGEMENTS... XXIII CHAPTER 1 INTRODUCTION... 1 1.1. Machine learning...2 1.2. Geological maps...4 1.3. Research scope and hypothesis...5 1.3.1. Major research questions to be addressed...6 1.4. Thesis structure...7 CHAPTER 2 MACHINE LEARNING THEORY AND IMPLEMENTATION... 9 2.1. Machine learning...9 2.1.1. Supervised versus unsupervised learning...10 2.2. Supervised classification...10 2.2.1. Classification strategies...11 2.2.1.1. Statistical learning algorithms...11 2.2.1.2. Instance-based learners...14 2.2.1.3. Logic-based learners...17 2.2.1.4. Support Vector Machines...20 2.2.1.5. Perceptrons...23 2.2.2. Supervised classifier implementation...25 2.2.2.1. Data pre-processing...26 2.2.2.2. Classifier training...27
x Machine learning for geological mapping 2.2.2.3. Prediction evaluation... 29 2.3. Unsupervised clustering... 33 2.3.1. Clustering strategies... 33 2.3.1.1. Partitioning algorithms... 33 2.3.1.2. Hierarchical algorithms... 35 2.3.1.3. Self-Organising Maps... 36 2.3.2. Unsupervised clustering implementation... 38 2.4. Conclusions... 38 CHAPTER 3 A REVIEW OF MACHINE LEARNING FOR GEOSCIENCE CLASSIFICATION APPLICATIONS...41 3.1. Machine learning non-geoscience applications... 41 3.2. Machine learning geoscience applications... 44 3.2.1. Classification of 0D data... 45 3.2.1. Classification of 1D data... 46 3.2.1.1. One temporal dimension... 46 3.2.1.2. One spatial dimension... 47 3.2.1. Classification of 2D data... 51 3.2.1.3. Land cover/vegetation mapping... 52 3.2.1.4. Geological mapping... 55 Supervised classification... 55 Unsupervised clustering... 58 Combined supervised and unsupervised methods... 60 3.3. Practical machine learning implementation... 61 3.3.1. Data... 63 3.3.2. Data pre-processing... 64 3.3.3. Prediction evaluation... 64 3.3.4. Integrated workflow... 65 3.4. Conclusions... 66 CHAPTER 4 GEOLOGICAL MAPPING USING REMOTE SENSING DATA: A COMPARISON OF FIVE MACHINE LEARNING ALGORITHMS, THEIR RESPONSE TO VARIATIONS IN THE SPATIAL DISTRIBUTION OF TRAINING DATA AND THE USE OF EXPLICIT SPATIAL INFORMATION...69 4.0. Abstract... 69 4.1. Introduction... 70 4.1.1. Machine learning for supervised classification... 72 4.1.2. Machine learning algorithm theory... 73 4.1.2.1. Naïve Bayes... 73 4.1.2.2. k-nearest Neighbours... 73
Contents xi 4.1.2.3. Random Forests...73 4.1.2.4. Support Vector Machines...74 4.1.2.5. Artificial Neural Networks...74 4.1.3. Geology and tectonic setting...75 4.2. Data...77 4.3. Methods...78 4.3.1. Pre-processing...78 4.3.2. Classification model training...79 4.3.3. Prediction evaluation...79 4.4. Results...79 4.5. Discussion...84 4.5.1. Machine learning algorithms compared...84 4.5.2. Influence of training data spatial distribution...87 4.5.3. Using spatially constrained data...88 4.6. Conclusions...89 4.7. Acknowledgements...90 4.8. Description of supplementary information...91 CHAPTER 5 THE UPSIDE OF UNCERTAINTY: IDENTIFICATION OF LITHOLOGY CONTACT ZONES FROM AIRBORNE GEOPHYSICS AND SATELLITE DATA USING RANDOM FORESTS AND SUPPORT VECTOR MACHINES...93 5.0. Abstract...93 5.1. Introduction...94 5.1.1. The lithology prediction problem...97 5.1.2. Random Forests...98 5.1.3. Support Vector Machines...99 5.2. Data...101 5.2.1. Tectonic setting and history...101 5.2.2. Data sources...103 5.2.3. Data pre-processing...103 5.3. Methods...103 5.3.1. Training and evaluating algorithms...105 5.3.2. Variance...106 5.4. Results...106 5.5. Discussion...114 5.6. Conclusions...118 5.7. Acknowledgements...119
xii Machine learning for geological mapping CHAPTER 6 MAPPING GEOLOGY AND VOLCANIC-HOSTED MASSIVE SULFIDE ALTERATION IN THE HELLYER MT CHARTER REGION, TASMANIA, USING RANDOM FORESTS AND SELF-ORGANISING MAPS... 121 6.0. Abstract...121 6.1. Introduction...122 6.1.1. Geological setting...123 6.1.2. Random Forests...128 6.1.3. Self-Organising Maps...130 6.2. Data and Methods...130 6.2.1. Source data...130 6.2.2. Data sampling...131 6.2.3. Training Random Forests and variable selection...133 6.2.4. Implementing Self-Organising Maps...136 6.3. Results...137 6.3.1. Geological classification using Random Forests...137 6.3.2. Discrimination of geological sub-classes using Self-Organising Maps...141 6.4. Discussion...144 6.5. Conclusions...146 6.6. Acknowledgements...147 CHAPTER 7 SPATIAL-CONTEXTUAL MACHINE LEARNING SUPERVISED CLASSIFIERS: LITHOSTRATIGRAPHY CLASSIFICATION EXAMPLE... 149 7.0. Abstract...149 7.1. Introduction...150 7.1.1. Pre-processing methods...152 7.1.1.1. Focal operators...152 7.1.1.2. Image segmentation...153 7.1.2. Training data selection...154 7.1.3. Post-processing methods...155 7.1.4. Combination methods...155 7.1.5. Study aims...155 7.2. Data...156 7.2.1. Lithostratigraphy classification target...156 7.2.2. Geophysical data input variables...159 7.2.2.1. Pre-processing...160 7.3. Methods...160 7.3.1. Data sampling...160 7.3.2. Global pixel-based classifiers...162
Contents xiii 7.3.3. Spatial-contextual classifiers...162 7.3.3.1. Pre-processing...162 7.3.3.2. Algorithm training...164 7.3.3.3. Post-processing...165 7.3.4. Prediction evaluation...165 7.4. Results...165 7.5. Discussion...173 7.5.1. Spatial-contextual classifiers compared...173 7.5.2. Issues of spatial scale...175 7.5.3. Geological interpretations...176 7.6. Conclusions...177 CHAPTER 8 SYNTHESIS AND DISCUSSION... 179 8.1. Algorithms...179 8.1.1. Supervised classification...179 8.1.1.1. Implementation...180 8.1.1.2. Decision structures...181 8.1.1.3. Accuracy comparison...181 8.1.1.4. Spatial-contextual classifiers...183 8.1.1.5. Prediction uncertainty...184 8.1.2. Unsupervised clustering...185 8.2. Applications...186 8.2.1. Data pre-processing...186 8.2.1.1. Data preparation...187 8.2.1.2. Variable extraction...188 8.2.1.3. Variable selection...189 8.2.2. Classifier training...189 8.2.2.1. Training and test data...190 8.2.2.2. Classifier induction...190 8.2.2.3. Classification post-processing...191 8.2.3. Evaluation and interpretation...192 8.2.3.1. Statistical evaluation...193 8.2.3.2. Interrogating decision structures...194 8.2.3.3. Complementary interpretation...197 8.3. Extended research implications...199 8.3.1. Integrated workflow using R...199 8.3.2. Wider geoscience applications...200 8.3.3. Big Data...202 CHAPTER 9 CONCLUSIONS... 205
xiv Machine learning for geological mapping REFERENCES... 209 APPENDIX A MACHINE LEARNING ALGORITHM SENSITIVITY TO IMBALANCED CLASS DISTRIBUTIONS... 253 A.1. Introduction...253 A.2. Methods...254 A.3. Results...256 A.4. Discussion and Conclusions...259 APPENDIX B VARIANCE AND ENTROPY FOR MULTICLASS CLASSIFICATION UNCERTAINTY... 261 APPENDIX C SUPPLEMENTARY INFORMATION... 263 C.1. Data...263 C.2. MLA software and parameters...266 APPENDIX D R PACKAGES... 269 APPENDIX E DATA SOURCES AND PRE-PROCESSING... 271 APPENDIX F R CODE AND SCRIPTS... 275 README.txt...275