1
Data science and engineering for local weather forecasts Nikhil R Podduturi Data {Scientist, Engineer} November, 2016
Agenda About MeteoGroup Introduction to weather data Problem description Data science and weather forecasting Engineering Verification Results Questions 3
How many of you check weather forecasts frequently? 4
5
6 Weather data
7 1.5 TB/day
Types of data Observations: WMO weather stations (e.g: surface, upper-air, ships, drifting buoys, aircrafts etc) MeteoGroup measurement network 8
Types of data Observations: WMO weather stations (e.g: surface, upper-air, ships, drifting buoys, aircrafts etc) MeteoGroup measurement network Satellite data 9
Types of data Observations: WMO weather stations (e.g: surface, upper-air, ships, drifting buoys, aircrafts etc) MeteoGroup measurement network Satellite data Radar data 10
Types of data Observations: WMO weather stations (e.g: surface, upper-air, ships, drifting buoys, aircrafts etc) MeteoGroup measurement network Satellite data Radar data User data 11
Types of data Observations: WMO weather stations (e.g: surface, upper-air, ships, drifting buoys, aircrafts etc) MeteoGroup measurement network Satellite data Radar data User data Numerical weather prediction model data 12
Numerical weather prediction models Complex and Multidimensional data 13
Numerical weather prediction models Complex and multidimensional data 5 NWP models from different providers 14
Numerical weather prediction models Complex and multidimensional data 5 NWP models from different providers Data size per day - 0.5 TB 15
Data science and weather forecasting 16
17
Outcome Took 24 hours for 24 hour forecasts Grid interval - 736 km Poor results 18
MeteoGroup Forecasting system 19
MeteoGroup forecasting system 3 years of NWP data Machine learning model Trained model Forecasts 3 years of observation data Daily NWP data 20
MeteoGroup forecasting system Written in pascal 21
MeteoGroup forecasting system Written in pascal Runs on in house high performance computing cluster 22
MeteoGroup forecasting system Written in pascal Runs on in house high performance computing cluster Limitations Hard to maintain Not very transparent Scalability 23
24 Problem description
Next generation forecasting system Cloud based solution 25
Next generation forecasting system Cloud based solution Transparent 26
Next generation forecasting system Cloud based solution Transparent Scalable 27
Next generation forecasting system Cloud based solution Transparent Scalable Improve forecasting accuracy 28
Baseline model NWP data Downscale to location Interpolate missing values Linear model 29
Baseline model NWP data Downscale to location Interpolate missing values Linear model Outcome: Very fast Poor accuracy Multicollinearity 30
Iteration 1 Address multicollinearity using feature selection Scale the features NWP data Downscale to location Interpolate missing values Scale features Feature selection Linear model 31
Iteration 1 Address multicollinearity using feature selection Scale the features NWP data Downscale to location Interpolate missing values Scale features Feature selection Linear model Outcome: Improved accuracy 32
Iteration 2 Model selection between linear and non-linear models Advanced feature selection NWP data Downscale to location Interpolate missing values Scale features Advance feature selection Model selection (linear and non-linear models) 33
Iteration 2 Model selection between linear and non-linear models Advanced feature selection NWP data Downscale to location Interpolate missing values Scale features Advance feature selection Model selection (linear and non-linear models) Outcome: On par with existing forecasting system Slow training 34
35 Engineering to scale the product
Baseline model engineering (Scikit-learn, NumPy, Keras with TensorFlow) 36
Model engineering (Scikit-learn, NumPy, Keras with TensorFlow) Good: Python ML ecosystem Familiarity among the team Test driven and Agile Development Fail fast 37
Model engineering (Scikit-learn, NumPy, Keras with TensorFlow) Good: Python ML ecosystem Familiarity among the team Test driven and Agile Development Fail fast Bad: Not scalable 38
47000 * 15 * 360 model runs Locations Weather attributes e.g: temperature, wind etc Hours 39
Scaling with Apache Airflow 40 Apache Airflow By AirBnB Apache product since early 2016 Directed Acyclic Graph (DAG) Components UI Scheduler Executor(s)
Apache Airflow DAG Hooks (connections) Operators (tasks) Schedule Dependencies 41
Airflow and Mesos deploy persist AWS S3 Airflow scheduler Mesos cluster 42
Airflow and Mesos Cont Integ deploy Persist AWS S3 Airflow scheduler Mesos cluster 43
44 Verification
Model improvement cycle Deploy DAG Verify model Improve DAG 45
Forecast verification Forecast Engine AWS S3 with models JSON-LD 46
Verification metrics Mean absolute error Root mean squared error Mean error Heidke skill score Equitable threat score Probability density functions Error percentiles 47
48 Mean absolute error for different models (Temperature)
49 Probability distribution function for multiple models (Temperature)
Percentile graphs for each model (Temperature)
51 For demo please stop by MG booth
Results Cloud based solution AWS S3, EC2, ElastiCache Transparent Scalable Improve forecasting accuracy 52
Results Cloud based solution AWS S3, EC2, ElastiCache Transparent Verification microservice Scalable Improve forecasting accuracy 53
Results Cloud based solution AWS S3, EC2, ElastiCache Transparent Verification microservice Scalable Mesos cluster Training time a month to 5 hours (approx) Improve forecasting accuracy 54
Results Cloud based solution AWS S3, EC2, ElastiCache Transparent Verification microservice Scalable Mesos cluster Training time a month to 5 hours (approx) Improve forecasting accuracy On par or better 55
Improvements Hyperlocal AWS lambda integration Iterate for more accuracy 56
57 Questions?
We are hiring!
59