Mining Climate Data Michael Steinbach Vipin Kumar University of Minnesota /AHPCRC Collaborators: G. Karypis, S. Shekhar (University of Minnesota/AHPCRC) V. Chadola, S. Iyer, G. Simon, P. Zhang (UM/AHPCRC) P. N. Tan (Michigan State University) C. Potter (NASA Ames Research Center), S. Klooster (California State University, Monterey Bay). NASA funded project: Discovery of Changes from the Global Carbon Cycle and Climate System Using Data Mining Additional support from Army High Performance Computing Research Center Access to computing facilities was provided by the AHPCRC and the Minnesota Supercomputing Institute. M. Steinbach Mining Climate Data
Overview Background Data Mining Tasks Detection of Disturbances and Associations Discovery of Climate Indices Distributed Issues Conclusion M. Steinbach Mining Climate Data 2
Research Goal Average Monthly Temperature Research Goal: Find global climate patterns of interest to Earth Scientists A key interest is finding connections between the ocean / atmosphere and the land. NPP Pressure Precipitation... NPP Pressure Precipitation Global snapshots of values for a number of variables on land surfaces or water. SST Latitude SST Span a range of 10 to 50 years. grid cell Longitude Time zone Gridded data M. Steinbach Mining Climate Data 3
The El Nino Climate Phenomenon El Nino is the anomalous warming of the eastern tropical region of the Pacific. Normal Year: Trade winds push warm ocean water west, cool water rises in its place El Nino Year: Trade winds ease, switch direction, warmest water moves east. http://www.usatoday.com/weather/tg/wetnino/wetnino.htm M. Steinbach Mining Climate Data 4
Overview Background Data Mining Tasks Detection of Disturbances and Associations Discovery of Climate Indices Distributed Issues Conclusion M. Steinbach Mining Climate Data 5
Detection of Ecosystem Disturbances Can detect ecosystem disturbances by detecting sudden changes in greenness from satellite data FPAR: Fraction of Photosynthetic Active Radiation absorbed by the green part of vegetation. M. Steinbach Mining Climate Data 6
Detection of Ecosystem Disturbances Major ecosystem disturbances detected in North America. NASA image of patterns in the 18-year record (1982-1999) of global satellite observations of vegetation greenness from the Advanced Very High Resolution Radiometer (AVHRR). Different colored areas identify the major ecosystem disturbance events detected and the year they occurred. The majority of potential disturbance events pictured occurred in boreal forest ecosystems of Canada or shrublands and rangelands of the southern United States. Release: 03-51AR NASA DATA MINING REVEALS A NEW HISTORY OF NATURAL DISASTERS NASA is using satellite data to paint a detailed global picture of the interplay among natural disasters, human activities and the rise of carbon dioxide in the Earth's atmosphere during the past 20 years. http://amesnews.arc.nasa.gov/releases/2003/03_51ar.html Smoke over Borneo, Indonesia M. Steinbach Mining Climate Data 7
Mining Associations in Earth Science Data: Challenges Transaction Items Id 1 Bread, Milk 2 Beer, Diaper, Bread, Eggs 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Bread, Diaper, Milk Rules Discovered: {Milk} --> -->{Coke} {Diaper, Milk} Milk} --> -->{Beer} How to transform Earth Science data into transactions? What are the baskets? What are the items? How to define support? M. Steinbach Mining Climate Data 8
Mining Associations Patterns in Earth Science Data: Challenges (Lat,Long,time) Events (10N,10E,1) {Temp-Hi, Prec-Lo} (10N,10E,2) {Temp-Hi,Prec-Lo,NPP-Lo} (10N,11E,2) {Temp-Hi, NPP-Lo} (10N,11E,5) {Solar-Hi, NPP-Lo} (10N,11E,10) {Prec-Hi, PET-LO} 1 FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI ==> NPP-HI (support count=145, confidence=100%) 2 FPAR-HI PET-HI PREC-HI TEMP-HI ==> NPP-HI (support count=933, confidence=99.3%) 3 FPAR-HI PET-HI PREC-HI ==> NPP-HI (support count=1655, confidence=98.8%) 4 FPAR-HI PET-HI PREC-HI SOLAR-HI ==> NPP-HI (support count=268, confidence=98.2%) How to efficiently discover spatio-temporal associations? Use existing algorithms. Develop new algorithms. How to identify interesting patterns? Use objective interest measures. Use domain knowledge. M. Steinbach Mining Climate Data 9
Example of Interesting Association Patterns FPAR-Hi ==> NPP-Hi (sup=5.9%, conf=55.7%) Shrubland areas Rule has high support in shrubland areas M. Steinbach Mining Climate Data 10
Overview Background Data Mining Tasks Detection of Disturbances and Associations Discovery of Climate Indices Distributed Issues Conclusion M. Steinbach Mining Climate Data 11
Climate Indices: Connecting the Ocean/Atmosphere and the Land A climate index is a time series of temperature or pressure Similar to business or economic indices Based on Sea Surface Temperature (SST) or Sea Level Pressure (SLP) Climate indices are important because They distill climate variability at a regional or global scale into a single time series. They are well-accepted by Earth scientists. Dow Jones Index (from Yahoo) They are related to well-known climate phenomena such as El Niño. M. Steinbach Discovery of Climate Indices Using Clustering 12
A Temperature Based Climate Index: NINO1+2 Correlation Between ANOM 1+2 and Land Temp (>0.2) Correlation Between Nino 1+2 and Land Temperature (>0.2) 90 90 1 0.8 El Nino Events 60 60 30 0.9 0.6 0.8 0.4 0.7 0.2 0.6 latitude latitude 0 0.5 0 Nino 1+2 Index -30-60 -60-90 -90-180 -150-120 -90-60 -30 30 60 90 120 150 180-180 -150-120 -90-60 -30 0 30 60 90 120 150 180 longitude longitude 0.4-0.2 0.3-0.4 0.2-0.6 0.1-0.8 0 M. Steinbach Mining Climate Data 13
SST Clusters 90 107 SST Clusters 60 30 latitude 0-30 -60-90 -180-150 -120-90 -60-30 0 30 60 90 120 150 180 longitude M. Steinbach Mining Climate Data 14
SST Clusters That Reproduce El Nino Indices latitude 90 60 30 0 75 78 67 94 Niño Region Range Longitude Range Latitude 1+2 (94) 90 W-80 W 10 S-0 3 (67) 150 W-90 W 5 S-5 N 3.4 (78) 170 W-120 W 5 S-5 N 4 (75) 160 E-150 W 5 S-5 N El Nino Regions Defined by Earth Scientists -30-60 -90-180 -150-120 -90-60 -30 0 30 60 90 120 150 180 longitude Cluster Nino Index Correlation 94 NINO 1+2 0.9225 67 NINO 3 0.9462 78 NINO 3.4 0.9196 75 NINO 4 0.9165 M. Steinbach Mining Climate Data 15
An SST Cluster Moderately Correlated to Known Indices Cluster Cluster 29 vs. 29 Known vs. SOI, El ANOM Nino Climate 12, ANOM3, Indices: ANOM34, Nino 1+2, ANOM4 Nino 3, (mincorr Nino 3.4, = 0.2) Nino 4, and SOI 90 0.6 60 0.4 30 29 0.2 latitude 0 0-30 -0.2-60 -0.4-90 -180-150 -120-90 -60-30 0 30 60 90 120 150 180 longitude Corr Diff M. Steinbach Mining Climate Data 16-0.6
Overview Background Data Mining Tasks Detection of Ecological Disturbances Discovery of Climate Indices Distributed Issues Conclusion M. Steinbach Mining Climate Data 17
Need for High Performance Computing SNN clustering analysis require O(n 2 ) comparisons. Association rule algorithms can also be very compute intensive. Potentially very much greater than O(n 2 ) Amount of memory required exceeds for clustering and association rule algorithms can exceed 4GB of traditional sequential servers Pairwise correlation between every land and ocean pixels very time consuming on a sequential computer M. Steinbach Mining Climate Data 18
High Resolution EOS Data EOS satellites provide high resolution measurements Finer spatial grids 8 km 8 km grid produces 10,848,672 data points 1 km 1 km grid produces 694,315,008 data points More frequent measurements Multiple instruments Generates terabytes of day per day High resolution data allows us to answer more detailed questions: Detecting patterns such as trajectories, fronts, and movements of regions with uniform properties Finding relationships between leaf area index (LAI) and topography of a river drainage basin Finding relationships between fire frequency and elevation as well as topographic position Earth Observing System (e.g., Terra and Aqua satellites) http://www.crh.noaa.gov/lmk/soo/docu/basicwx.htm M. Steinbach Mining Climate Data 19
Distributed System For Analyzing Earth Science Data Climate Data Model Data DAAC ESIP...... Data Acquisition, Fusion and Transformation Subsetting Event Detection Data Exploration and Pre-Processing Clustering Statistical Analysis Visualization Trajectory Analysis Association Analysis Output and Presentation ESIP Distributed Data Sources Local Data Deviation Detection Data Mining Classification and Regression
Data Acquisition, Fusion, and Transformation Climate Data Model Data DAAC ESIP...... Data Acquisition, Fusion and Transformation Subsetting Event Detection Data Exploration and Pre-Processing Clustering Statistical Analysis Visualization Trajectory Analysis Association Analysis Output and Presentation ESIP Deviation Detection Classification and Regression Distributed Data Sources Local Data Data Mining This module will provide the ability to acquire the data necessary for the analysis, and since the data comes from heterogeneous sources, the ability to fuse and transform the data.
Data Acquisition, Transformation, and Fusion Challenges Locate and download data when it becomes available Web services such as directory services and peer-to-peer networking capabilities for file sharing Data fusion Conversion between different formats: HDF, HDF-EOS, netcdf, binary, ASCII, Earth Science Markup Language (ESML), Geographic Markup Language (GML) Data Transformation Data transformation such as scaling, radiometric conversion, sampling in time, aggregation, and mathematical or geometrical map transformations to convert the incoming data to the same coordinate system M. Steinbach Mining Climate Data 22
Data Exploration and Pre-processing Climate Data Model Data DAAC ESIP...... Data Acquisition, Fusion and Transformation Subsetting Event Detection Data Exploration and Pre-Processing Clustering Statistical Analysis Visualization Trajectory Analysis Association Analysis Output and Presentation ESIP Deviation Detection Classification and Regression Distributed Data Sources Local Data Data Mining This module consists of several visualization, statistical, and time series preprocessing tools for supporting the exploratory analysis of large-scale Earth Science datasets. Such tools can be used to aid scientists in gaining an initial insight into the distribution, regularity, and quality of the input data.
Data Mining Climate Data Model Data DAAC ESIP...... Data Acquisition, Fusion and Transformation Subsetting Event Detection Data Exploration and Pre-Processing Clustering Statistical Analysis Visualization Trajectory Analysis Association Analysis Output and Presentation ESIP Deviation Detection Classification and Regression Distributed Data Sources Local Data Data Mining Data mining technology offers a suite of advanced decision support tools to facilitate the automatic generation of scientific hypotheses from data.
Data Mining and Exploration Challenges Complex data distribution Data may be split according to time periods, region, attribute, etc. Example: To obtain different types of data for a given point on the Earth it is often necessary to pull data from many sources. Distributed computation For resource or other reasons, data mining tasks may be better executed by distributing the computation across resources in multiple organizations. Example: Finding specific events of interest; summarizing data Time vs. Accuracy Tradeoff User should be able to make choices Example: Time series similarity using correlation vs. time series similarity computed using dynamic time warping. M. Steinbach Mining Climate Data 25
Output and Presentation Climate Data Model Data DAAC ESIP...... Data Acquisition, Fusion and Transformation Subsetting Event Detection Data Exploration and Pre-Processing Clustering Statistical Analysis Visualization Trajectory Analysis Association Analysis Output and Presentation ESIP Deviation Detection Classification and Regression Distributed Data Sources Local Data Data Mining Output and presentation tools will convert results to the common representations (e.g., MIME, ESML/GML/XML, jpeg, binary, etc). Results can be posted to the Web and discussed in a collaborative fashion or easily incorporated into more traditional publications.
Output and Presentation Challenges Allow others to locate and download results when they becomes available Web services such as directory services and peer-to-peer networking capabilities for file sharing Describing the data via XML, Earth Science Markup Language, etc. Data Transformation Privacy and Policy Constraints Different levels of access needed M. Steinbach Mining Climate Data 27
Usage Scenario Download software from a web site and install it. Launch the application and collect data from well-known sources on the Web or from local sources. The system selects the appropriate format transformation and data fusion steps to convert all the data into a single coregistered format. Use the metadata associated with the retrieved data to understand the resolution, spatio-temporal framework, attributes User input Use analysis tools and subsets of the data to perform preprocessing, data exploration, data mining, and post-processing. Select results to be published on the Web, allowing collaboration and access via the Internet. M. Steinbach Mining Climate Data 28
Conclusions Disturbance and association analysis can uncover interesting patterns for Earth Scientists to investigate. By using clustering we have made some progress towards automatically finding climate patterns that display interesting connections between the ocean and the land. Many more opportunities for data mining/data analysis in Earth Science data. Many opportunities for distributed computing to play a useful or critical role. M. Steinbach Mining Climate Data 29
Questions? More information can be found at http://www.ahpcrc.umn.edu/nasa-umn/index.html