Severe Weather Events in the United States Chris Rodgers 28 November 2017 Synopsis This project involves exploring the U.S. National Oceanic and Atmospheric Administration s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. This project consts of two parts. Part one involves import and pre-processing the data. The data are messy and require several transformations so that they are in a format that can be used for analysis. Part two involves performing analysis to determine which weather events are the most destructive and presenting the results of that analysis. Part one: Data Processing First we download the source data set. Download if(!exists("noaa")){ temp <- tempfile() download.file("https://d396qusza40orc.cloudfront.net/repdata%2fdata%2fstormdata.csv.bz2", temp) noaa <- read.csv(bzfile(temp)) unlink(temp) } options(scipen = 999) The data includes 902,297 rows with 37 variables. dim(noaa) ## [1] 902297 37 Post-2000 only The data set is large so we first we filter to only include observations one or after 01/01/2000. This date was arbitrarily chosen in order to reduce the number of observations. noaa <- dplyr::mutate(noaa, BGN_DATE = lubridate::mdy_hms(bgn_date)) noaa <- dplyr::filter(noaa, BGN_DATE >= 01/01/2000) Filtering to observations on or after 01/01/2000 leaves us with 866,041 observations. 1
dim(noaa) ## [1] 866041 37 Tidy up damage exponents Crop and property damage are both recorded in this dataset. Property damage is record with a value for damage (PROPDMG) and an exponent which defines the unit of measure (PROPDMGEXP) for the PROPDMG entered. For example, for an observation the PROPDMG is equal to 25.0 while the PROPDMG EXP is K - this means that there was $25,000 worth of property damage. The same method of recording is done for crop damage. The exponents for property and crop damage are not recorded consistently (i.e. thousands is recorded as k and K for different observations). The below code tidies up the most commonly used exponents. noaa <- dplyr::mutate(noaa, PROPDMGEXP = gsub("k", "K", PROPDMGEXP)) noaa <- dplyr::mutate(noaa, PROPDMGEXP = gsub("m", "M", PROPDMGEXP)) noaa <- dplyr::mutate(noaa, PROPDMGEXP = gsub("h", "H", PROPDMGEXP)) noaa <- dplyr::mutate(noaa, CROPDMGEXP = gsub("k", "K", CROPDMGEXP)) noaa <- dplyr::mutate(noaa, CROPDMGEXP = gsub("m", "M", CROPDMGEXP)) Tidy up event types Event type records the type of weather event that an observations represents. There are 47 official event types however the source data has several hundred unique event types recorded. The below code updates some event type names to match one from the official list of 47. The event types to update were chosen by counting and ordering the number of observations for each event type. Event types that had a high count of observations but were variations of an official event type are covered by this update. noaa <- dplyr::mutate(noaa, EVTYPE = gsub("avalance", "AVALANCHE", EVTYPE)) noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*flash FLOOD.*", "FLASH FLOOD", EVTYPE)) noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*thunderstorm.*", "THUNDER STORM WIND", EVTYPE)) noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*tstm.*", "THUNDER STORM WIND", EVTYPE)) noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*rip CURRENT.*", "RIP CURRENT", EVTYPE)) noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*hurricane.*", "HURRICANE/TYPHOON", EVTYPE)) noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*high WIND.*", "HIGH WIND", EVTYPE)) noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*wild.*", "WILD FIRE", EVTYPE)) noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*winter WEATHER.*", "WINTER WEATHER", EVTYPE)) noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*extreme COLD.*", "EXTREME COLD/WIND CHILL", EVTYPE)) noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*extreme HEAT.*", "EXCESSIVE HEAT", EVTYPE)) 2
noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*heat WAVE.*", "HEAT", EVTYPE)) noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*unseasonably WARM AND DRY.*", "HEAT", EVTYPE)) noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*cold.*", "COLD/WIND CHILL", EVTYPE)) noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*cold.*", "COLD/WIND CHILL", EVTYPE)) noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*glaze.*", "FREEZING FOG", EVTYPE)) noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*glaze.*", "FREEZING FOG", EVTYPE)) noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*heavy SURF.*", "HIGH SURF", EVTYPE)) noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*mixed PRECIP.*", "WINTER WEATHER", EVTYPE)) noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*drought.*", "DROUGHT", EVTYPE)) noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*snow AND ICE.*", "WINTER WEATHER", EVTYPE)) noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*fog*", "FREEZING FOG", EVTYPE)) noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*snow SQUALL.*", "HEAVY SNOW", EVTYPE)) noaa <- dplyr::mutate(noaa, EVTYPE = gsub(".*freezing DRIZZLE.*", "WINTER WEATHER", EVTYPE)) #remove icy roads because they aren't a weather event noaa <- dplyr::filter(noaa, EVTYPE!= "ICY ROADS") Make damage common Damage was recorded with different exponents (e.g. millions, thousands, hundreds) - the below code updates the value recorded for damage so that they all represent millions. For example, 250K will be update to.25 (a quarter of a million). noaa <- dplyr::mutate(noaa, PROPDMG = dplyr::case_when(propdmgexp == "M" ~ PROPDMG, PROPDMGEXP == "K" ~ #set NA for PROPDMG to 0 noaa <- dplyr::mutate(noaa, PROPDMG = if_else(is.na(propdmg), 0, PROPDMG)) Summarise data With some basic data tidying complete we will group the data by event type. This will help with further filtering data so that only event types that resulted in damage are included for analysis. The summary will include total and average death, injury, property damage and crop damage values. Group data noaa <- dplyr::group_by(noaa, EVTYPE) Create summary variables 3
noaasum <- dplyr::summarise(noaa, total.propdmg = sum(propdmg), mean.propdmg = mean(propdmg), total.fata Filter out rare events that occur less than 10 times. noaasum <- dplyr::filter(noaasum, count >= 10) Filter to only event types that caused damage Create a separate data set that consists only of event types that had a death or injury. Events that have never had a recorded death or injury are excluded. events <- dplyr::filter(noaa, FATALITIES INJURIES!= 0) events <- unique(dplyr::select(events, EVTYPE)) deathsetsum <- dplyr::filter(noaasum, EVTYPE %in% events$evtype) Create a separate data set that consists only of event types that had economic damage. Any event type with 0 property and 0 crop damage is not included in this data set. events <- dplyr::filter(noaa, PROPDMG!= 0) events <- unique(dplyr::select(events, EVTYPE)) propsetsum <- dplyr::filter(noaasum, EVTYPE %in% events$evtype) Part two: Results Now that the data is somewhat tidy and grouped into sets appropriate for analysis we will have a look at the two key questions. Across the United States, which types of events are most harmful with respect to population health? Population health is measured here in terms of death and injury caused by an event. We have already created a data set of only events that included death or injury. This will be used to answer this question. dim(deathsetsum) ## [1] 68 8 There are 68 event types that have caused injury or death. This is too many to look at so we will look at the top 10. Create top 10 data sets. top10mean <- dplyr::arrange(deathsetsum, desc(mean.fatalities)) top10mean <- top10mean %>% dplyr::slice(1:10) top10total <- dplyr::arrange(deathsetsum, desc(total.fatalities)) top10total <- top10total %>% dplyr::slice(1:10) The top events by mean fatalities is shown in the barplot below. ggplot2::ggplot(top10mean, aes(x = reorder(evtype, mean.fatalities), y = mean.fatalities)) + geom_bar(st 4
TSUNAMI HEAT EXCESSIVE HEAT RIP CURRENT Event Type AVALANCHE HURRICANE/TYPHOON MARINE STRONG WIND COLD/WIND CHILL HIGH SURF ICE 0.0 0.5 1.0 1.5 Mean Fatalities Tsunami is the top weather event in terms of mean fatalities. Heat events combined have the greatest impact on average in terms of human death. The top events by total deaths are shown in the below bar plot. ggplot2::ggplot(top10total, aes(x = reorder(evtype, total.fatalities), y = total.fatalities)) + geom_bar 5
TORNADO EXCESSIVE HEAT HEAT FLASH FLOOD Event Type LIGHTNING THUNDER STORM WIND RIP CURRENT FLOOD COLD/WIND CHILL HIGH WIND 0 1000 2000 3000 Total Fatalities Tornadoes have caused the most fatalities from the year 2000 to present. Heat events again feature prominently, with excessive heat and heat combined causing the largest amount of deaths. Flooding, represented by flash flooding and flood, is also prominent. In terms of volume of harm to the population, tornadoes and heat are the most damaging. Flooding also features prominently as a cause of loss of life. Across the United States, which types of events have the greatest economic consequences? top10mean <- dplyr::arrange(propsetsum, desc(mean.propdmg)) top10mean <- top10mean %>% dplyr::slice(1:10) top10total <- dplyr::arrange(deathsetsum, desc(total.propdmg)) top10total <- top10total %>% dplyr::slice(1:10) The below bar plot shows the top events by total property damage. ggplot2::ggplot(top10total, aes(x = reorder(evtype, total.fatalities), y = total.fatalities)) + geom_bar 6
TORNADO FLASH FLOOD LIGHTNING THUNDER STORM WIND Event Type FLOOD HIGH WIND HURRICANE/TYPHOON WILD FIRE ICE STORM HAIL 0 1000 2000 3000 Total Property Damage (millions) Tornadoes are by far the event that have caused the most property damage since the year 2000. Flooding (represented in flood and flash flood) also features prominently in total damage caused. 7