Jaclyn Ashley Shafer. Bachelor of Science Meteorology Florida Institute of Technology PDF Free Download

A Verification of the National Hurricane Center s Tropical Cyclone Wind Speed Probability Forecast Product by Jaclyn Ashley Shafer Bachelor of Science Meteorology Florida Institute of Technology 2006 A thesis submitted to the Department of Marine and Environmental Systems Florida Institute of Technology in partial fulfillment of the requirements for the degree of Master of Science in Meteorology Melbourne, Florida August, 2008

ABSTRACT A Verification of the National Hurricane Center s Tropical Cyclone Wind Speed Probability Forecast Product by Jaclyn Ashley Shafer Committee Chair: Michael Splitt, M.S. In 2004, Colorado State University created a tropical cyclone (TC) probability forecast product that was adopted as an experimental product by the National Hurricane Center (NHC) in 2005 and transitioned to operations in 2006. With the exception of a recent evaluation of the cumulative probability products for the Atlantic basin, little verification has been performed for the interval probabilities especially for land threatening or land falling events. An objective evaluation of the product s performance and interpretation of the product for operational application, e.g. discerning which forecast probabilities represent low/moderate/high risk for the various wind speed and forecast interval categories, is presented. This project focuses on four hurricane seasons from 2004 2007 and includes all storms approaching areas of interested centered on Cocoa Beach, FL, Charleston, SC, New Orleans, LA and Corpus Christi, TX. Verification statistics are computed for each of the 21 forecast categories of the NHC probability product; three wind speed criteria ( 34 Kt, 50 Kt, and 64 Kt) and seven forecast time intervals (12, 24, 36, 48, 72, 96, and 120 hours). Verifications include use of reliability and sharpness diagrams, as well as additional statistics designed to quantitatively measure the product s performance. iii

Table of Contents Abstract... iii List of Figures... v List of Tables... viii List of Abbreviations... x Acknowledgements... xi 1. Introduction... 1 2. Methodology... 4 2.1 Product Forecasts... 4 2.2 Verification... 6 2.2.1 Verification Data... 6 2.2.2 Verification Tools... 9 2.2.2.1 Constructing a Relative Operating Characteristics (ROC) Diagram... 10 2.2.2.2 Optimal Threshold Selection... 12 2.2.2.2.1 Bootstrap Technique... 15 2.2.2.3 Reliability and Frequency Distribution Diagrams... 17 3. Results... 19 3.1 Relative Operating Characteristics (ROC) Diagrams... 20 3.2 Optimal Thresholds... 30 3.3 Product Evaluation... 43 3.3.1 Optimal Threshold Based Statistics... 43 3.3.2 Frequency Distribution and Reliability Diagrams... 60 3.3.2.1 Frequency Distribution Diagrams... 68 3.3.2.2 Reliability Diagrams... 68 4. Summary... 73 References... 77 Appendix A... 79 Appendix B... 83 Appendix C... 87 iv

List of Figures Figure 1: Google image of the four areas of interest... 3 Figure 2: Wind speed probability product for Cocoa Beach, Florida... 4 Figure 3: Successive forecast time intervals... 8 Figure 4: HURREVAC generated image for hurricane Frances... 8 Figure 5: Example ROC-diagram... 11 Figure 6: ROC-diagram for visual optimal threshold selection method... 13 Figure 7: ROC-diagram for TSS optimal threshold selection method... 14 Figure 8: Hypothetical Reliability Diagrams... 18 Figure 9.A: ROC-diagram for Cocoa Beach, FL 34 kt, with X forecasts... 20 Figure 9.B: ROC-diagram for Cocoa Beach, FL 50 kt, with X forecasts... 20 Figure 9.C: ROC-diagram for Cocoa Beach, FL 64 kt, with X forecasts... 21 Figure 9.D: ROC-diagram for Cocoa Beach, FL 34 kt, no X forecasts... 21 Figure 9.E: ROC-diagram for Cocoa Beach, FL 50 kt, no X forecasts... 22 Figure 9.F: ROC-diagram for Cocoa Beach, FL 64 kt, no X forecasts... 22 Figure 10.A: ROC-diagram with confidence intervals with X forecasts... 25 Figure 10.B: ROC-diagram with confidence intervals no X forecasts... 25 Figure 11: ROC curves with confidence intervals 34 kt, 12 h... 27 Figure 12: Illustration of X forecast affect on ROC curves... 28 Figure 13.A: Optimal thresholds for Cocoa Beach, FL 34 kt... 34 Figure 13.B: Optimal thresholds for Cocoa Beach, FL 50 kt... 35 Figure 13.C: Optimal thresholds for Cocoa Beach, FL 64 kt... 35 Figure 14.A: Score plots for Cocoa Beach, FL 34 kt, 120 h, with X Forecasts... 38 Figure 14.B: Score plots for Cocoa Beach, FL 34 kt, 120 h no X Forecasts... 39 Figure 15: HSS for Cocoa Beach, FL 34 kt, 12 h... 41 v

Figure 16.A: Accuracy TSS optimal threshold selection method... 44 Figure 16.B: Accuracy HSS optimal threshold selection method... 44 Figure 16.C: Accuracy 34 kt... 45 Figure 17.A: Bias Score TSS optimal threshold selection method... 46 Figure 17.B: Bias Score HSS optimal threshold selection method... 46 Figure 18.A: PoD TSS optimal threshold selection method... 47 Figure 18.B: PoD HSS optimal threshold selection method... 47 Figure 19.A: PoFD TSS optimal threshold selection method... 48 Figure 19.B: PoFD HSS optimal threshold selection method... 48 Figure 20.A: Threat Score TSS optimal threshold selection method... 49 Figure 20.B: Threat Score HSS optimal threshold selection method... 49 Figure 21.A: False Alarm Ratio TSS optimal threshold selection method... 50 Figure 21.B: False Alarm Ratio HSS optimal threshold selection method... 50 Figure 22.A: True Skill Statistic TSS optimal threshold selection method.. 51 Figure 22.B: True Skill Statistic HSS optimal threshold selection method.. 51 Figure 23.A: Heidke Skill Score TSS optimal threshold selection method.. 52 Figure 23.B: Heidke Skill Score HSS optimal threshold selection method.. 52 Figure 24.A: Frequency Distribution: 34 kt, 12 h, Cocoa Beach, FL AOI... 61 Figure 24.B: Reliability Diagram: 34 kt, 12 h, Cocoa Beach, FL AOI... 61 Figure 25.A: Frequency Distribution: 34 kt, 24 h, Cocoa Beach, FL AOI... 62 Figure 25.B: Reliability Diagram: 34 kt, 24 h, Cocoa Beach, FL AOI... 62 Figure 26.A: Frequency Distribution: 34 kt, 36 h, Cocoa Beach, FL AOI... 63 Figure 26.B: Reliability Diagram: 34 kt, 36 h, Cocoa Beach, FL AOI... 63 Figure 27.A: Frequency Distribution: 34 kt, 48 h, Cocoa Beach, FL AOI... 64 Figure 27.B: Reliability Diagram: 34 kt, 48 h, Cocoa Beach, FL AOI... 64 Figure 28.A: Frequency Distribution: 34 kt, 72 h, Cocoa Beach, FL AOI... 65 Figure 28.B: Reliability Diagram: 34 kt, 72 h, Cocoa Beach, FL AOI... 65 Figure 29.A: Frequency Distribution: 34 kt, 96 h, Cocoa Beach, FL AOI... 66 Figure 29.B: Reliability Diagram: 34 kt, 96 h, Cocoa Beach, FL AOI... 66 vi

Figure 30.A: Frequency Distribution: 34 kt, 120 h, Cocoa Beach, FL AOI... 67 Figure 30.B: Reliability Diagram: 34 kt, 120 h, Cocoa Beach, FL AOI... 67 Figure A.1.A: ROC-diagram for Charleston, SC 34 kt, with X forecasts... 79 Figure A.1.B: ROC-diagram for Charleston, SC 50 kt, with X forecasts... 79 Figure A.1.C: ROC-diagram for Charleston, SC 64 kt, with X forecasts... 80 Figure A.1.D: ROC-diagram for Charleston, SC 34 kt, no X forecasts... 80 Figure A.1.E: ROC-diagram for Charleston, SC 50 kt, no X forecasts... 81 Figure A.1.F: ROC-diagram for Charleston, SC 64 kt, no X forecasts... 81 Figure B.1.A: ROC-diagram for New Orleans, LA 34 kt, with X forecasts. 83 Figure B.1.B: ROC-diagram for New Orleans, LA 50 kt, with X forecasts. 83 Figure B.1.C: ROC-diagram for New Orleans, LA 64 kt, with X forecasts. 84 Figure B.1.D: ROC-diagram for New Orleans, LA 34 kt, no X forecasts... 84 Figure B.1.E: ROC-diagram for New Orleans, LA 50 kt, no X forecasts... 85 Figure B.1.F: ROC-diagram for New Orleans, LA 64 kt, no X forecasts... 85 Figure C.1.A: ROC-diagram for Corpus Christi, TX 34 kt with X forecasts... 87 Figure C.1.B: ROC-diagram for Corpus Christi, TX 50 kt with X forecasts... 87 Figure C.1.C: ROC-diagram for Corpus Christi, TX 64 kt with X forecasts... 88 Figure C.1.D: ROC-diagram for Corpus Christi, TX 34 kt no X forecasts... 88 Figure C.1.E: ROC-diagram for Corpus Christi, TX 50 kt no X forecasts... 89 Figure C.1.F: ROC-diagram for Corpus Christi, TX 64 kt no X forecasts... 89 vii

List of Tables Table 1: Example of data collected for Cocoa Beach, FL AOI... 6 Table 2: Classification definitions... 10 Table 3: Contingency Table... 15 Table 4: Verification Statistics... 16 Table 5: Maximum probability forecast for Cocoa Beach, FL AOI... 32 Table 6.A: Optimal thresholds for Cocoa Beach, FL AOI with X forecasts 33 Table 6.B: Optimal thresholds for Cocoa Beach, FL AOI no X forecasts... 33 Table 7: Distribution of FA and CN for Cocoa Beach, FL AOI 64 kt... 56 Table 8: Distribution of Hit, Miss, FA and CN for Cocoa Beach, FL AOI 34 kt, TSS method... 56 Table 9: Distribution of Hit, Miss, FA and CN for Cocoa Beach, FL AOI 34 kt, HSS method... 57 Table 10: Optimal Thresholds selected optimizing TSS, HSS and Bias... 70 Table 11.A: PoD, TSS and HSS metrics based on optimal thresholds from TSS method... 71 Table 11.B: PoD, TSS and HSS metrics based on optimal thresholds from HSS method... 72 Table 11.C: PoD, TSS and HSS metrics based on optimal thresholds from Bias Score method... 72 Table A.1.A: Optimal thresholds for Charleston, SC AOI with X forecasts... 82 Table A.1.B: Optimal thresholds for Charleston, SC AOI no X forecasts... 82 viii

Table B.1.A: Optimal thresholds for New Orleans, LA AOI with X forecasts... 86 Table B.1.B: Optimal thresholds for New Orleans, LA AOI no X forecasts... 86 Table C.1.A: Optimal thresholds for Corpus Christi, TX AOI with X forecasts... 90 Table C.1.B: Optimal thresholds for Corpus Christi, TX AOI no X forecasts... 90 ix

List of Abbreviations 45 WS: 45 th Weather Squadron AOI: CCAFS: CN: FA: FD: HSS: HURREVAC: IP: KSC: NHC: NSS: PoD: PoFD: ROC: TC: TSS: WPFP: Area of Interest Cape Canaveral Air Force Station Correct Negative False Alarm Frequency Distribution Heidke Skill Score Hurricane Evacuation Interval Probability Kennedy Space Center National Hurricane Center Normalized Skill Score Probability of Detection Probability of False Detection Relative Operating Characteristics Tropical Cyclone True Skill Statistic Tropical Cyclone Wind Speed Probability Forecast Product x

Acknowledgements I would first like to thank my family and friends for all of their support over the years. Mom, even through all the trials and tribulations you were there every step of the way. I greatly appreciate everything you have done for me and would not be where I am without your love, guidance and support. Leslie, just remember! I would also like to thank my advisors, Dr. Steven Lazarus and Mr. Michael Splitt, for all their dedication, guidance and encouragement throughout both my undergraduate and graduate careers. I sincerely thank Mr. David Froiseth and fellow Scitorians of Scitor Corporation for the internship opportunity that started it all! My professional career would not be where it is without you and I thank you for everything! I also thank Mike McAleenan, Bill Roeder, Kathy Winters and all other members of the 45 th Weather Squadron for their warm welcome and assistance with this project. xi

1. INTRODUCTION The 45 th Weather Squadron (45 WS) provides comprehensive weather services to America s space program at Cape Canaveral Air Force Station (CCAFS), Patrick Air Force Base and NASA s Kennedy Space Center (KSC). Given the location of these facilities, on the Florida central east coast, tropical cyclones (TC) pose a potentially significant threat to these agencies. When a TC threatens the area, the 45 WS provides detailed information to launch agencies concerning the storm threat including track, timing, intensity and size (Winters et al. 2006). One of the support tools used by the 45 WS is the Tropical Cyclone Wind Speed Probability Forecast Product (WPFP) which is issued by the National Hurricane Center (NHC). The 45 WS relays the WPFP and other TC information to senior managers who then decide if and when to begin actions necessary to protect resources. As a recent product, the WPFP has yet to be fully evaluated especially for land-falling or land-threatening TC events. By land-threatening we are referring to storms that might make land-fall or, a possibility exists that coastal cities in our area of interest will be affected by tropical storm (34-63 kt, 17-32 m/s) or hurricane (64+ kt, 33+ m/s) force winds. Herein, we address two key questions 1.) How well does the product perform and 2.) What probability values are significant for making a yes/no decision? In a related study, Knaff and DeMaria (2006) evaluated the performance of the WPFP. Their study focuses on the entire Atlantic basin and includes all storms 1

in the 2006 hurricane season. Verification includes bias score (see Table 4) and Brier skill score, which measures the improvement of the probabilistic forecast relative to climatology (Stefanova et. al 2002), as well as the use of reliability diagrams (detailed discussion of reliability diagrams later in methodology section). Results of their work show the product has biases ranging from 0.70 to 0.95; perfect score is 1, and the probability product is more skillful than the deterministic forecasts (abbreviated OFCL, the official NHC forecast) in declaring if an event will happen (brier skill score ranges from 0.1 to 0.25; perfect score is 1, 0 indicates no skill). Reliability diagrams indicated that product s predicted probabilities of an event corresponded closely with the observed frequencies (61 st IHC 2006). The purpose of this project is to determine the performance of the WPFP issued by the NHC for land-threatening and land-falling storms. This study presents results composited from four hurricane seasons from 2004 to 2007 and includes all TCs affecting areas of interest (AOI) centered on Cocoa Beach, FL Charleston, SC, Corpus Christi, TX and New Orleans, LA (see Figure 1). The methodology, data, and results follow. 2

Figure 1: Google Earth image of southeast United States outlining the four areas of interest. Charleston, SC (Green), Cocoa Beach, FL (Purple), New Orleans, LA (Pink), Corpus Christi, TX (Orange). Gold stars indicate location of listed cities. 3

2. METHODOLOGY 2.1: Product Forecasts The NHC issues tropical cyclone wind speed probabilities when a storm is a potential threat to coastal regions of the United States and other countries in the Atlantic and eastern Pacific basins. Every six hours, the WPFP provides probabilities for surface wind speeds of at least 34 kt (17 m/s), 50 kt (26 m/s) and 64 kt (33 m/s) for different, but fixed, forecast time intervals including 12, 24, 36, 48, 72, 96, and 120 h (Department of Commerce 2008). Figure 2: Wind speed probability product for Cocoa Beach, Florida issued during hurricane Wilma on 22 October 2005 at 0300Z. X forecasts indicate probabilities less than 0.4%. See text for more detail. An example of the probability forecast product is shown in Figure 2. Probabilities are given for each wind speed criteria and forecast time interval at 4

Cocoa Beach, Florida during hurricane Wilma in October 2005. The product shows two columns of data listed per forecast time interval; the left column lists the interval forecasts and the right (parenthetical) column lists the cumulative. The interval probability (IP) is defined as the probability of first occurrence of a wind speed threshold within a given forecast time interval and is thus related to predicting the onset of the event in question. The cumulative probability is an integrated quantity defined as the sum of the preceding IP for a given forecast hour within the 120 h forecast time period (NHC Product Description 2006). For this study, the IP for each storm, from the 2004 2007 hurricane seasons that affected our AOIs, were used. These probabilities can be converted into binary (i.e. yes/no) forecasts based on a user determined threshold (see Section 2.3.1 for detailed discussion). When a probability greater than 0% is forecast during at least one of the seven forecast intervals for at least one city within the AOI, X forecasts (representing < 0.4%) are issued in the remaining forecast interval positions for each city within the AOI (e.g., see Table 1). These < 0.4% forecasts are included in the original data set in order to identify missed forecasts (i.e. binary no events that were incorrect forecasts). Unfortunately, this can saturate the data set with correct negative forecasts (i.e., binary no events that were correct forecasts). This is an important issue in the context of the verification statistics presented and is discussed later in more detail. As a result, a second data compositing method was 5

applied and is also presented here. This alternative data set includes only those probabilities greater than 0.4% (red numbers, Table 1). This approach, while eliminating the spurious correct negative forecasts, also eliminates the missed forecasts that are included in the original data set. Verification statistics are calculated using both data sets to compare and test the impact of the correct negatives and missed forecasts on the performance of the WFPF. Table 1: Hypothetical probabilities (%) for cities within the Cocoa Beach, FL AOI. The X forecasts represent probabilities less than 0.4%. City 12 h 24 h 36 h 48 h 76 h 96 h 120 h Jacksonville X X X X X X 1 Daytona X X X X X X 2 Cocoa Beach X X X X X X X Fort Pierce X X X X X X X West Palm Beach X X X X X X X Miami X X X X X X X 2.2: Verification 2.2.1: Verification Data A data base from an emergency management tool referred to as the Hurricane Evacuation (HURREVAC) software package was mined for ground truth TC observations including position (i.e., observed track), intensity, size and movement. An example of the graphical output from HURREVAC, for hurricane 6

Frances, is shown in Figure 4. The actual 34 kt (17 m/s), 50 kt (26 m/s) and 64 kt (33 m/s) wind radii are displayed in the figure. When the NHC issues probability forecasts for land-threatening or land-falling storms potentially affecting our AOI s, ground truth observations, provided by HURREVAC, are used to separate the probabilities into occurred and did not occur events. For example, consider the probability forecasts issued 06 UTC on 4 September 2004 for hurricane Frances (not shown). For this cycle, the 12 h forecast interval corresponds to the 06 18 UTC time window. HURREVAC output (Figure 3) indicates that winds greater than or equal to 34 kt (17 m/s) occurred (green arrows, Figure 3) in Cocoa Beach, Fort Pierce, West Palm Beach, and Miami, Florida but did not occur (red arrows, Figure 3) in Jacksonville and Daytona Beach within the 12 h forecast time interval. This procedure was repeated for each wind speed criteria and forecast time interval for all storms in the study. Unfortunately, it is not possible to unambiguously determine first occurrence for the initial forecast interval (i.e., 0-12 h) because the onset of a given wind speed criteria may have occurred at an earlier time. The WPFP does not recognize if, at the time of the forecast, onset occurred previously. Because the WPFP is issued every six hours (with forecasts out to 120 h), there is a six hour overlap between successive forecasts for the 0 to 12 h window (see Figure 3). If the wind speed criteria onset occurs within this overlap period (yellow shaded area, Figure 3), we opt to treat these probabilities as a yes (i.e., it occurred) for both 0 to 12 h forecast intervals. It is also possible that, for a slow moving storm, 7

the wind speed criteria may extend into subsequent 0-12 h forecasts as well. Regardless, these instances are considered a yes. --- ---- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 00 Figure 3: Two successive 0 12 h forecast time intervals (Green and Purple brackets) with overlapping times shaded in yellow. Labels are hypothetical UTC times (i.e. 06 = 06Z). See text. 64Kt 34Kt 50Kt 34 kt 50 kt 64 kt Figure 4: HURREVAC generated image for hurricane Frances valid 1800 UTC 4 September 2004. The red arrows indicate did not occur events; green arrows indicate occurred events. Arrows and wind radii labels added for clarification. 8

2.2.2: Verification Tools Many verification statistics exist for evaluating the performance of a forecast system based on binary (yes/no) forecasts as well as the use of reliability diagrams (detailed discussion in later sections). In order to take advantage of these tools, the probabilities were converted accordingly. The forecasts were separated into occurred and did not occur events, through the verification process using HURREVAC, and then two additional categories were defined; event forecast to occur and event forecast not to occur. In order to determine in which of these two categories a forecast probability falls, we identify an optimal threshold (through Relative Operating Characteristics (ROC) diagrams discussed later) defined as the probability where, if the product forecasts a probability greater than or equal to the threshold value, the event is forecast to occur and, if the product forecasts a probability less than the optimal threshold, the event is forecast not to occur. At this point, the probabilities have been converted to binary (yes/no) forecasts and can be classified as a Hit, Miss, False Alarm (FA) or Correct Negative (CN) (see Table 2). These classifications are necessary for the verification statistics used to quantitatively measure the performance of the WPFP. 9

Table 2: Definitions of each classification given to the probability forecasts Classification Definition Hit Event Forecast to Occur, Event Occurred Miss Event Forecast Not to Occur, Event Occurred False Alarm (FA) Event Forecast to Occur, Event Did Not Occur Correct Negative (CN) Event Forecast Not to Occur, Event Did Not Occur 2.2.2.1: Constructing a Relative Operating Characteristics (ROC) Diagram For a probabilistic forecast system, a Relative Operating Characteristics (ROC) diagram illustrates the varying quality of the forecast system at different forecast probabilities. Depending on specific user needs, the ROC curve can be used to help identify optimum strategy in any specific application (Mason 1999). For this study, ROC diagrams were constructed for each wind speed criteria and forecast time interval to help determine the corresponding optimal thresholds. A ROC-diagram is defined, on the x-axis, by the Probability of False Detection (or false alarm rate), i.e. PoFD = FA / (CN + FA), and, on the y-axis, by the Probability of Detection, i.e. PoD = Hits / (Hits + Misses) where Hit, Miss, FA, and CN are defined in Table 2. The ROC-diagram is created by calculating a unique PoD and PoFD for each threshold probability (using increments of 1% for threshold values ranging 10

from 0 to 100%). The number of hits, misses, FAs and CNs are calculated for each individual threshold value where the sample space consists of all data within a particular forecast interval and wind speed criteria. Hence, each data point on the ROC-diagram represents an estimate of a PoFD, PoD pair for a given threshold value (see Figure 5). Figure 5: An example ROC-diagram constructed for a given wind speed criteria and forecast time interval. The PoD and PoFD (false alarm rate) is plotted for threshold values ranging from 0% - 100% in 1% increments. See text for details. 11

2.2.2.2: Optimal Threshold Selection Various approaches can be taken to determine an optimal threshold from the ROC-diagram. For example, if an optimal forecast is defined as one in which there are minimal false alarms and a maximum number of hits, then the highest skill (i.e., optimal threshold) is represented by the ROC data point that is closest to the upper left hand corner (PoFD = 0, PoD = 1) of the diagram (e.g., see Figure 6). As an alternative one can calculate what is referred to as the True Skill Statistic (TSS, see Table 4). Like the PoD and PoFD, the TSS can be calculated for each probability threshold increment. On a ROC diagram, the TSS is the vertical distance between a (PoFD, PoD) pair and the diagonal line defined by PoD = PoFD (MANZATO 2005). The TSS based optimal threshold is identified as the data point with the greatest vertical distance to the line PoD = PoFD (e.g., see Figure 7). We also use a non-roc diagram approach for determining optimal thresholds based on the maximum Heidke Skill Score (HSS, see Table 4). Unlike the TSS, this statistic, one of several accepted standards of forecasting in the scientific community (NOAA 2004), is difficult to visually determine on the ROC-diagram. As a result, the HSS is calculated for each probability threshold increment where the probability threshold value corresponding to the maximum HSS is considered to be optimal. 12

Figure 6: ROC-diagram constructed for a given wind speed criteria and forecast time interval. The PoD and PoFD (false alarm rate) is plotted for threshold values ranging from 0% - 100% in 1% increments. The yellow star (21%) identifies the optimal threshold probability based on maximum PoD and minimal PoFD (see text for details). 13

Figure 7: Same as Figure 5. The yellow star (23%) identifies the optimal threshold probability based on maximum. Using any of the aforementioned approaches to identify an optimal threshold, the probabilities are reclassified, using Table 3 (replacing threshold with optimal threshold ), as a Hit, Miss, FA, or CN. This process is repeated for each wind speed criteria and forecast time interval. 14

Table 3: Contingency table showing classification of each probability forecast. Note: False Alarm (FA), Correct Negative (CN). Occurred Did Not Occur Forecast % Threshold Hit FA Forecast % < Threshold Miss CN 2.2.2.2.1: Bootstrap Technique Here we use what is referred to as the bootstrap method to assess the robustness of the PoD and PoFD values computed to construct the ROC diagram. By selecting multiple subsamples from the original data set, with replacement, and then recalculating the PoD and PoFD values for each bootstrap subsample, confidence intervals are calculated for each data point to show uncertainty in the ROC curves (Cheng 1995). By with replacement we mean, each time a subsample is selected, the corresponding PoD and PoFD are calculated and recorded. Then the subsample is replaced in the original data set before the next subsample is drawn. Confidence intervals were calculated for both the PoD and PoFD, determining the uncertainty in the x and y directions. The R verification program, version 2.6.2 developed by the R Foundation for Statistical Computing, was used to calculate the confidence intervals with an alpha of 0.05 for 1000 iterations. Examples of uncertainty in the ROC diagrams are shown and discussed later in more detail (see section 3.1). These confidence intervals prove valuable in 15

the selection of optimal thresholds and the interpretation of the verification statistics calculated to gauge the performance of the WPFP (see Table 4). Table 4: Description of verification statistics Statistic Formula Range Definition 0 to 1; Perfect is Fraction of observed events PoD Hits / (Hits + Misses) 1 that were correctly forecast PoFD False Alarm Ratio Accuracy Threat Score Bias Score True Skill Statistic Heidke Skill Score FA / (CN + FA) FA / (Hits + FA) (Hits + CN) / Total Hits / (Hits + Misses + FA) (Hits + FA) / (Hits + Misses) PoD PoFD 2C ratio (POD-PoFD)/(((C ratio *POD) + POD)(C ratio 1)+C ratio + 1 Where C ratio = Total Observed Yes/Total Observed No 0 to 1; Perfect is 0 0 to 1; Perfect is 0 0 to 1; Perfect is 1 0 to 1; Perfect is 1 0 to Infinity; Perfect is 1-1 to 1; Perfect is 1, 0 indicates no skill -1 to 1; Perfect is 1, 0 indicates no skill A measure of the product s ability to forecast non-events A measure of the product s ability to forecast events Fraction of events that were correctly forecast A measure of how well the forecast yes events correspond to the observed events Indicates if the product under forecasts (BIAS <1) or over forecasts (BIAS >1) A measure of how well the product distinguishes observed events from nonobserved events Presents the skill as a percentage improvement over the skill expected due to random chance 16

2.2.2.3: Reliability and Frequency Distribution Diagrams Reliability diagrams are frequently used for assessing probability forecasts for binary events (Wilks 1995). To construct one of these diagrams, probability forecasts ranging from 0 100%, generally placed in 10% interval bins, are plotted on the x-axis and the observed relative frequency of event occurrence is calculated, per bin, and plotted on the y-axis. Figure 8 A-F show a series of hypothetical reliability diagrams demonstrating observed frequency as a function of forecast probability with various outcomes one may expect (Hamill 1997). Depending on where the reliability plot falls in relation to the desired diagonal line (where the forecast probability is equivalent to the observed frequency) indicates whether the product is under (above the diagonal) or over (below the diagonal) forecasting. In theory, a perfect forecast system will result in forecasts with a given probability being consistent with the observed frequency (Hartmann 2002). Frequency distributions (FD), also known as sharpness diagrams, are shown in the corner of each diagram. FD plots display the forecast probability from 0 100% on the x- axis and the number of times forecast on the y-axis (Jolliffe et. al 2003). 17

Figure 8: Hypothetical reliability diagrams showing observed relative frequency as a function of forecast probability for (a) climatological forecasts, (b) forecasts exhibiting minimal resolution, (c) forecasts showing an under-forecasting bias, (d) forecasts showing good resolution at the expense of reliability, (e) reliable forecasts of a rare event, and (f) verification dataset limited by small sample size. Inset boxes are frequency distributions (Wilks 1995). In addition to the ROC curves and the statistics listed in Table 4, reliability and frequency distribution diagrams were constructed for all wind speed criteria and forecast time intervals to help evaluate the performance of the WPFP (see section 3.3.2 for more detail). These diagrams do not require the use of a defined threshold value for product evaluation but instead are a direct comparison of the model probabilities to observed frequency. 18

3. RESULTS The work presented herein focuses on the performance of the National Hurricane Center s Tropical Cyclone Wind Speed Probability Forecast Product (WFPF) for land-threatening and land-falling storms. Recall, by land-threatening we are referring to storms that might make land-fall or, a possibility exists that coastal cities in our area of interest will be affected by tropical storm (34-63 kt, 17-32 m/s) or hurricane (64+ kt, 33+ m/s) force winds. Assessment includes direct evaluation of the product s probabilities as well as through the use of skill scores developed for binary forecasts. Relative Operating Characteristics (ROC) diagrams are also constructed to assess the quality of the forecast system derived from the position of the curve (see section 3.1 for more detail). Optimal thresholds are then selected to transform the probabilities into binary (yes/no) forecasts. 19

3.1: Relative Operating Characteristics (ROC) Diagrams A B 20

C D 21

E F Figure 9: Relative Operating Characteristics (ROC) diagrams for the Cocoa Beach, FL AOI for 34 kt, 50 kt and 64 kt wind radii with X forecasts A, B, C respectively and 34 kt, 50 kt, and 64 kt wind radii without X forecasts, D, E, F respectively. See text for details. 22

As previously mentioned ROC-diagrams were constructed for each wind speed criteria and forecast time interval for a given area of interest (AOI). Figures 9 A F show these diagrams for the Cocoa Beach, FL AOI. To evaluate the skill of the forecast system, based on the ROC-diagram, we are interested in where the ROC-curve lies in relation to the PoD = PoFD line (indicated by the bold black line on the diagram). Because there is skill only when the hit rate (PoD) exceeds the false alarm rate (PoFD), the ROC curve will lie above the 45 degree line (PoD = PoFD) if the forecast system is skillful and the area under the curve will exceed 0.5. If the curve lies below the line, negative skill is indicated (Mason et. al 1999). 23

The ROC diagram in Figure 9.A shows each ROC-curve for the 34 kt (17 m/s) wind speed criteria at each of the seven forecast time intervals (12, 24, 36, 48, 72, 96, and 120 h). As the forecast time interval increases, the ROC-curve approaches the PoD = PoFD (no skill) line. As expected, we see the area under the curve decrease, from 0.97 to 0.76, for the 12 h and 120 h forecast time intervals respectively. This trend is evident in all ROC-curves within this AOI when X forecasts are included as well as when they are not. This suggests that as the forecast time interval increases, the skill of the forecast system decreases for all wind speed criteria. In order to evaluate the robustness of the ROC curves, and to distinguish one curve as significantly different from another, we incorporate an estimate of the uncertainty via the bootstrap technique (see section 2.2.2.2.1). If the confidence intervals do not overlap, the curves are significantly different from one another. Figures 10.A and 10.B show the uncertainty in the ROC curves for the 34 kt wind speed criteria for the Cocoa Beach, FL AOI with and without X forecasts respectively. In both scenarios, although not as distinct in the earlier forecast time intervals (12 48 h), the ROC curves are significantly different; thus there is a clear deterioration in skill as the forecast time interval increases. Similar results exist for all wind speed criteria and additional AOI s. 24

A B Figure 10: 34 kt ROC diagrams with confidence intervals plotted using the bootstrap technique for the Cocoa Beach, FL AOI A) with X forecasts, B) without X forecasts. 12 h: Dark Blue, 24 h: Red, 36 h: Green, 48 h: Purple, 72 h: Light Blue, 96 h: Orange, 120 h: Magenta. 25

Figures 9.A and 9.D show the ROC diagrams for the 34 kt (17m/s) wind speed criteria with X forecasts included and without respectively. Comparing the two diagrams for the same forecast time intervals, it appears that the ROC curves in Figure 9.D have less skill than the curves with the same forecast time interval in Figure 9.A. To determine whether or not this is the case, we consider the confidence intervals of the ROC curves. As an example, consider the 34 kt, 12 h forecast time interval ROC curves with and without X forecasts (Figure 11). It appears the two curves are different in what might be considered to be critical part of the diagram, i.e., towards the upper left hand corner (zoomed in), but are the same elsewhere. 26

Figure 11: ROC curves with confidence intervals plotted for the 34 kt, 12 h forecast time interval. The blue line is with X forecasts included, the red is without. As a result of including the X forecasts, the PoD is greater with this data set due to the greater number of CN events. In some cases this will affect the selection of the optimal threshold value as a result of the change in the distribution of the points that constitute the ROC curve. The data points shift towards the bottom left corner with the large number of CNs skewing the data set (see Figure 12). 27

1% 1% 16% Optimal Threshold 8% Optimal Threshold Figure 12: Illustration of X forecasts affect on ROC curves and optimal threshold selection for the visual optimal selection method. Blue: With X, Red: No X for the Cocoa Beach, FL AOI 34 kt, 24 h forecast time interval 28

Consider another impact of the CN forecasts on the ROC curves. When comparing the 64 kt (33 m/s), 120 h forecast time interval curves in Figures 9.C and 9.F, the curve in 9.C is skillful. The curve lies above the PoD = PoFD line and the area under the curve is greater than 0.5 (area is 0.63). However, when a majority of the CN forecasts are removed, the curve has negative skill (see Figure 9.F). The CN forecasts affect the quality of the forecast system (based on the interpretation of the ROC diagram discussed in section 2.2.2.1) and affect the optimal thresholds selected depending on which data set, X or No X, is evaluated. All wind speed criteria and AOI s exhibited similar trends. As expected, the skill of the forecast system decreases as the forecast time interval increases, and the overall skill of the system is lower when the X forecasts are removed from the data set (Figures 9.D, E, F). The comparison of the two data sets may suggest the exclusion of the X forecasts gives a more accurate representation of the skill of the forecast system by eliminating the affects of the large number of CN forecasts. Note however, by excluding the X forecast events, there are occasional instances when missed forecasts (i.e. events forecast not to occur but did), which should be included in the evaluation of the product s performance, are eliminated. ROCdiagrams for all remaining AOI s are shown in appendices. 29

3.2: Optimal Thresholds As previously discussed, one method that can be used to identify an optimal threshold value is to choose a value that minimizes false alarms and maximizes the number of hits. In this approach, the highest skill (i.e. optimal threshold) is represented by the ROC data point that is closest to the upper left corner of the diagram. Note however, this is not the only method for selection of the optimal threshold. Many skill scores exist from which one can select an optimal threshold. In addition to the various skill scores, an optimal threshold can be based on cost/benefit/risk analyses (not conducted here) that are tailored to the specific needs of the customer (e.g. NASA s interest in protecting the space shuttle). For this study, we include threshold value estimates based on the optimization of TSS and HSS. Many discussions exist regarding which metric is best. The TSS is widely used in verification statistics and is recommended as a proper formulation of a skill score (Murphy and Daan 1985). However, Doswell et. al (1990) argue that the TSS has difficulty dealing with cases involving the forecast of rare events and thus recommend the HSS as a better metric. In rare event situations, the number of CN forecasts may be much larger than the numbers for the other quantities in the contingency table (e.g. Table 3) (Jollife 2003). This results in potentially misrepresentative values of PoD and PoFD thereby affecting the resulting ROC curves and optimal threshold selection. Recall, the HSS is designed to account for the proportion of forecasts that would have been correct by 30

chance in the absence of skill. As a result, the HSS may be a better metric to estimate the PoD and PoFD and thus an improved optimal threshold. A measure of sensitivity of the optimal threshold to the threshold selection method and data set was determined by evaluating the consistency of the optimal threshold values as a function of time, wind speed, and AOI. The three selection methods are as follows: Visual -- selecting the data point closest to the upper left corner of diagram maximizing PoD and minimizing PoFD, TSS -- selecting data point with maximum vertical distance to the PoD = PoFD line, and HSS the probability threshold corresponding to maximum HSS value. 31

Table 5: Maximum probability forecast as a function of wind speed criteria and forecast time interval issued for the Cocoa Beach, FL AOI. All hurricane seasons 2004 2007 included. Max Probability Forecast Forecast Interval 34 Kts 50 Kts 64 Kts 12 h 100 100 100 24 h 99 94 71 36 h 92 71 47 48 h 76 46 28 72 h 48 25 13 96 h 40 22 11 120 h 20 10 6 32

Table 6: Optimal thresholds selected for Cocoa Beach, FL AOI A) with X forecasts and B) without X forecasts. Results from each method of selection listed. A 34 Kt 50 Kt 64 Kt Forecast Int. Visual TSS HSS Visual TSS HSS Visual TSS HSS 12 hr 6 10 50 1 1 21 11 11 14 24 hr 8 8 51 5 5 25 1 5 32 36 hr 8 8 35 8 8 20 4 4 21 48 hr 15 15 33 9 7 20 9 10 18 72 hr 9 10 28 5 5 10 2 2 6 96 hr 8 9 17 3 3 7 1 1 2 120 hr 1 1 12 1 2 4 1 1 2 B 34 Kt 50 Kt 64 Kt Forecast Int. Visual TSS HSS Visual TSS HSS Visual TSS HSS 12 hr 16 23 50 12 15 21 14 14 14 24 hr 17 10 51 9 6 25 8 19 32 36 hr 14 14 33 16 16 20 9 9 21 48 hr 16 16 33 14 9 20 10 9 14 72 hr 12 12 28 5 5 10 2 2 6 96 hr 9 9 17 4 3 7 2 2 2 120 hr 12 12 12 2 2 4 1 6 6 33

Figure 13.A: Optimal thresholds as a function of forecast time interval selected for Cocoa Beach, FL AOI for the 34 kt wind speed criteria. 34

Figure 13.B: Same as 13.A for 50 kt wind speed criteria Figure 13.C: Same as 13.A for 64 kt wind speed 35

Table 5 lists the maximum probabilities forecast for each wind speed criteria and forecast time interval for the Cocoa Beach, FL AOI. Notice as the forecast time interval increases, the maximum forecast probability decreases monotonically. This is an expected trend given the nature of the forecast system; there is less skill at increased lead times. Similarly, one might expect as the forecast time interval increases, the value of the optimal threshold will decrease monotonically. However, Tables 6.A and 6.B both indicate the expected trend is not evident regardless of data set and optimal threshold selection method. When the X forecasts are included in the data set, the visual and TSS methods indicate the values increase as the forecast approaches the 48 h interval and then decrease as expected. The HSS has a similar trend however the values begin to decrease at the 24 h forecast time interval. Regardless of data set, the optimal thresholds behave similarly however the monotonic trend is weak when X forecasts are removed. This is particularly true for the visual and TSS selection methods for the 34 kt wind speed criteria. The optimal threshold values change very little regardless of forecast time interval. Similarly, optimal thresholds selected using the HSS method show little variation for both data sets as well as for all wind speeds and forecast time intervals. The behavior of the optimal thresholds as a function of time is consistent for all wind speed criteria (see Figures 13.A, B and C for graphical representation of optimal thresholds). Ultimately, the visual and TSS optimal threshold selection methods are sensitive to the inclusion/exclusion of X forecasts in the data set. The 36

HSS selection method is more consistent regardless of data set and instills more confidence when making yes/no based operational decisions. For the early forecast time intervals (12 36 h) the visual and TSS methods returned higher optimal threshold values when X forecasts were excluded from the data set. The optimal thresholds for later forecast time intervals (48 120 h) showed the values were similar. Note however, there are some exceptions. Consider, for example, the 34 kt, 120 h forecast time interval optimal thresholds (see Table 6.A and 6.B). When X forecasts are included in the data set, the visual and TSS methods have an optimal threshold of 1% while HSS is at 12%. If the X forecasts are removed from the data set, all three methods exhibit the same 12% optimal threshold. From a decision maker s perspective, 1% as an optimal threshold is not as valuable as 12%. A value of 1% is at the extreme end of all possible threshold values and does not instill much confidence in decision making situations. 37

Figure 14.A: PoD, PoFD, TSS and HSS as a function of probability forecast for the 34 kt wind speed criteria at the 120 h forecast time interval with X forecasts included in the data set 38

Figure 14.B: Same as figure 14.A without X forecasts in data set The choice of data set, with X forecasts or without, has a strong influence on the optimal threshold selection using the visual and TSS methods, especially in the earlier forecast intervals. The HSS accounts for the proportion of forecasts that would have been correct by chance in the absence of skill. Figure 14 shows the HSS values calculated for each threshold with and without X forecasts respectively for the 34 kt wind speed criteria and 12 h forecast time interval. Comparing the two curves, the HSS does not change much regardless of whether or not X forecasts are 39

included in the data set. Notice, as a function of probability threshold, HSS remains relatively constant from 15% to 80%; no distinct peak in the HSS exists to undoubtedly determine the optimal threshold. For this wind speed and forecast time interval, the optimal threshold, based on HSS, was found to be 50%. This threshold sits in the middle of the flat area of HSS values and has some uncertainty associated with it; as do all threshold values. To account for this uncertainty, HSS was recalculated with the PoD and PoFD associated with the lower vertical confidence interval at the 50% threshold (see Figure 11). The values of PoD and PoFD at this confidence interval are 0.71 and 0.01 resulting in an HSS of 0.70. This suggests any probability threshold with an HSS value greater than 0.70 could just as easily been labeled the optimal threshold when considering the uncertainty in the skill score (see Figure 15). 40

Figure 15: HSS (y-axis) as a function of probability forecast for the 34 kt wind speed criteria at the 12 h forecast time interval with X forecasts (Purple) and without (Green). Black line is HSS = 0.70; calculated from PoD and PoFD based on lower vertical confidence interval. See text for detail. Optimal thresholds were selected for each wind speed criteria and forecast time interval using three different selection methods (visual, TSS, and HSS). The sensitivity of the optimal thresholds to threshold selection method and data set (with X or no X) was examined by evaluating the consistency of the thresholds obtained for each. As stated previously, the thresholds were expected to decrease monotonically as the forecast time interval increased. Ultimately, TSS proved sensitive to the X forecasts while HSS remained consistent. 41

In some cases there are distinct peaks in the skill (TSS and HSS) as a function of probability forecast -- thereby indicating a clear optimal threshold selection (see Figures 14.A and 14.B). However, there are also flat areas in the various skill scores implying the exact optimal threshold is not as clear. Although TSS and HSS are both frequently used as forecast skill metrics, there is quite a difference in their characteristics and resulting optimal threshold values. Haklander et. al (2003) suggest a new skill score by combining both the TSS and HSS. This score, the Normalized Skill Score (NSS) is defined as follows: NSS (%) = 0.5*( (TSS (%) / TSS max ) + (HSS (%) / HSS max ) ), where TSSmax and HSSmax is the maximum TSS and HSS value respectively for any probability threshold (%). The maximum values for TSS and HSS do not necessarily occur for the same probability threshold (Haklander 2003). The NSS may return optimal thresholds where the peak in TSS and flat areas associated with HSS overlap for applicable indicating a more accurate optimal threshold. It is recommended, for future work, that the NSS be applied as an additional metric for determining optimal thresholds. 42

3.3: Product Evaluation 3.3.1: Optimal Threshold Based Statistics Heretofore we have presented an evaluation of the WPFP based on two data sets (one with X forecasts, one without) and a variety of verification statistics. Given the behavior and sensitivity of the optimal thresholds discussed, the product performance is now gauged solely using thresholds derived from the no X forecast data set. The resulting performance metrics (see Table 4) are based on the optimal thresholds selected optimizing two commonly used skill scores, TSS and HSS. 43

Figure 16.A: Accuracy vs. forecast time interval based on the optimal thresholds selected using the TSS method per wind speed criteria (34 kt Blue, 50 kt Red, 64 kt Green) and forecast time interval for the Cocoa Beach, FL AOI. Figure 16.B: Same as 16.A using the HSS optimal threshold selection method 44

Figure 16.C: Accuracy vs. forecast time interval based on the optimal thresholds selected using the TSS (Blue) and HSS (Red) methods for the 34 kt wind speed criteria per forecast time interval for the Cocoa Beach, FL AOI. Black line depicts hypothetical accuracy if the forecasts were always no. 45

Figure 17.A: Bias Score (Table 4; Bias = (Hits + FA) / (Hits + Misses)). Same criteria as 16.A Figure 17.B: Bias Score (Table 4; Bias = (Hits + FA) / (Hits + Misses)). Same criteria as 16.B 46

Figure 18.A: Probability of Detection (Table 4; PoD = Hits / Hits + Misses)). Same criteria as 16.A Figure 18.B: Probability of Detection Table 4; PoD = Hits / Hits + Misses)). Same criteria as 15.B 47

Figure 19.A: Probability of False Detection (Table 4; PoFD = FA / (Hits + FA)). Same criteria as 16.A Figure 19.B: Probability of False Detection (Table 4; PoFD = FA / (Hits + FA)). Same criteria as 16.B 48

Figure 20.A: Threat Score (Table 4; Threat Score = Hits / (Hits + Misses + FA)). Same criteria as 16.A Figure 20.B: Threat Score (Table 4; Threat Score = Hits / (Hits + Misses + FA)). Same criteria as 16.B 49

Figure 21.A: False Alarm Ratio (Table 4; FAR = FA / (Hits + FA)). Same criteria as 16.A Figure 21.B: False Alarm Ratio (Table 4; FAR = FA / (Hits + FA)). Same criteria as 16.B 50

Figure 22.A: True Skill Statistic (Table 4; TSS = PoD PoFD). Same criteria as 16.A Figure 22.B: True Skill Statistic (Table 4; TSS = PoD PoFD). Same criteria as 16.B 51

Figure 23.A: Heidke Skill Score (Table 4; HSS = 2C ratio (POD- PoFD)/(((C ratio *POD) + POD)(C ratio 1)+C ratio + 1, Where C ratio = Total Observed Yes/Total Observed No. Same criteria as 16.A Figure 23.B: Heidke Skill Score (Table 4; HSS = 2C ratio (POD- PoFD)/(((C ratio *POD) + POD)(C ratio 1)+C ratio + 1, Where C ratio = Total Observed Yes/Total Observed No. Same criteria as 16.B 52

Accuracy, also known as percent correct, is one of several commonly used forecast verification statistics (see Table 4). This statistic, by definition, is greatly influenced by the number of CN forecasts and can inflate when the data set is dominated by them. Accuracy values as a function of forecast time interval are depicted, for the TSS and HSS optimal threshold selection methods, in Figures 16.A and 16.B respectively for the Cocoa Beach, FL AOI. Comparing the two figures, the HSS based accuracy is higher as a function of forecast interval and wind speed criteria than the TSS based results. Regardless of forecast time interval and wind speed criteria, the HSS based accuracy remains around 90% while the TSS method fluctuates between 70% and 90%. However, both selection methods return values of accuracy less than what it would be if the simplistic scenario, always no, forecast was issued every time. For example, consider the 34 kt wind speed criteria at each forecast time interval. The accuracy is recalculated for forecasting always no and is plotted against the original accuracy calculated using the TSS and HSS selection methods (see Figure 16.C). Both the TSS and HSS methods show accuracy below the skill line of the simplistic forecast. The problem is that accuracy does not penalize the simplistic forecast for missed forecasts. Overall, accuracy can give a false sense of skill in the forecast system and is not a reliable single performance metric. Although both optimal threshold selection methods returned values of accuracy lower than if the forecast was always no, other performance metrics show some skill in the forecast system 53

(discussed later). The remaining metrics perform poorly if the forecast is always no. Figures 17.A and 17.B show the results of the bias score for the WFPF based on the TSS and HSS optimal threshold selection methods respectively. Recall, bias score (perfect score is 1.00) is an indication of how much the forecast system over or under forecasts events (see Table 4). Comparing the bias scores between the two optimal threshold methods, larger values are found when TSS is optimized. Bias scores ranged from 0.00 to 8.00, generally increasing as the forecast time interval increased indicating that the forecast system is over forecasting events (i.e., bias is greater than 1.00). Using the HSS optimal threshold selection method, bias values ranged between 1.00 and 2.00 and increases with forecast time interval. The bias score indicates the forecast system over forecasts regardless of threshold selection method. Based on the bias metric, HSS is a better score in which to select the optimal thresholds. However, it is clear that it is not sensible to base performance on a single metric! Consider the PoD and PoFD per wind speed criteria and forecast time interval for each optimal threshold selection method (see Figures 18.A, B and 19.A, B). Based on the optimization of TSS, the WPFP over forecasts more (greater bias scores) producing higher values of PoD and PoFD. Conversely, the PoD and PoFD, is lower using the HSS method. There are, however, some outliers worth discussing. For example, Figures 18.A and 18.B show a significant decrease in PoD at the 120 h forecast time interval. This 54

behavior is due to the distribution of Hits, Misses, FAs and CNs within the data set. If we examine the 64 kt, 120 h forecast time interval no Hits and only one Missed forecast exist. The number of occurred events is low for this wind speed and forecast time interval indicating the sample size is too small. Recall from Table 4, PoD = Hits / (Hits + Misses), indicating PoD = 0.00 at this wind speed and forecast time interval. Figure 19.B shows a distinct peak in the 64 kt wind speed criteria at the 96 h forecast time interval. Again, this is due to the number of Hits, Misses, FAs and CNs in the data set. Recall from Table 4, PoFD = FA / (CN + FA). At the 96 h forecast time interval, the amount of FAs is considerably higher than the remaining forecast time intervals (see Table 7). 55

Table 7: The distribution of FAs and CNs for the 64 kt wind speed criteria according to the optimal threshold (%) selected per forecast interval for Cocoa Beach, FL AOI. Optimal thresholds (%) according to HSS selection method in parentheses next to forecast time interval. Forecast Time Interval False Alarms (FA) Correct Negatives (CN) PoFD 12 h (14) 6 35 0.15 24 h (32) 8 98 0.08 36 h (21) 9 195 0.04 48 h (14) 16 253 0.06 72 h (6) 38 404 0.09 96 h (2) 155 396 0.30 120 h (6) 0 527 0.00 Table 8: The distribution of Hits, Misses, FAs and CNs for the 34 kt wind speed criteria according to the optimal threshold (%) selected per forecast interval for the Cocoa Beach, FL AOI. Optimal thresholds (%) according to TSS selection method in parentheses next to forecast time interval. Forecast Time Interval Hit Miss FA CN Threat Score 12 h (23) 96 11 24 380 0.73 24 h (10) 34 1 165 501 0.17 36 h (14) 32 3 139 698 0.18 48 h (16) 23 3 112 901 0.17 72 h (12) 65 15 209 1102 0.22 96 h (9) 61 21 310 1031 0.16 120 h (12) 26 57 105 1146 0.14 Table 9: Same as Table 8 according to HSS optimal threshold selection method 56

Forecast Time Interval Hit Miss FA CN Threat Score 12 h (50) 83 24 5 399 0.74 24 h (51) 18 17 38 628 0.25 36 h (33) 24 11 58 779 0.26 48 h (33) 16 10 53 960 0.20 72 h (28) 38 42 39 1272 0.32 96 h (17) 39 43 143 1198 0.17 120 h (12) 26 57 105 1146 0.14 Figures 20.A and 20.B show the threat score, commonly referred to as the Critical Success Index, for each wind speed criteria and forecast time interval for the Cocoa Beach, FL AOI. This statistic is a measure of how well the event forecast to occur probabilities and observed events correspond to one another (see Table 4). Unlike TSS, HSS, and accuracy, threat score is not influenced by CN forecasts and focuses only on the forecast to occur events (Schaefer 1990). As a function of forecast time interval, the threat score is expected to decrease as forecast time interval increases. Recall from the bias score, the forecast system over forecasts; this is especially true at longer forecast time intervals. This suggests the number of FA forecasts will increase as the forecast time interval increases. Consider, for example, the 34 kt wind speed criteria; Tables 8 and 9 show the distribution of Hit, Miss, FA and CN forecasts at all forecast time intervals for this wind speed. Generally, as the forecast time interval increases, there is an increase in the number of FA forecasts. In addition, the false alarm ratio (see Table 4) is calculated for all forecast time intervals and reflects the increasing trend in FA 57

forecasts (see Figures 21.A and 21.B). Similar results exist for all wind speed criteria. Notice there is a significant decrease in threat score, and similar increase in false alarm ratio, between the 12 h and remaining forecast time intervals. These differences are driven by the lower number of FA forecasts at the 12 h lead time and comparatively high FA forecasts at all additional forecast time intervals. Recall threat score ranges from 0 (all forecasts are incorrect) to 1 (all forecasts are correct). Values of threat score for the Cocoa Beach, FL AOI were found to be lower than 0.50 with the exception of the 12 h forecast time interval. These scores are seemingly low however according to studies performed by Kuligowskil et. al (2001) and McDonald et. al (2000), results from this project are similar and considered good performance values of threat score. TSS and HSS have been utilized throughout this study to select optimal threshold values. Based on the optimal thresholds, these scores can also be used as measures of skill to gauge the performance of the WPFP. Recall, TSS is a measure of how well the product distinguishes observed events from non-observed events. HSS presents the skill of the forecast system as a percentage improvement over the skill expected due to random chance (Table 4). Figures 22.A, B and 23.A, B show the TSS and HSS respectively as a function of forecast time interval for the Cocoa Beach, FL AOI. Regardless of optimal threshold selection method, TSS shows a gradual decrease in skill as the forecast lead time increases. This trend is expected due to the nature of a forecast system. Regardless of optimal threshold selection 58

method, values of TSS remain between 0.50 and 0.85 in the early forecast time intervals (12 h 48 h). These values are considered desirable based on results from a study performed by John Beven in 2008 showing TSS also ranging between 0.50 and 0.85. The skill of the forecast system is still within the desirable range for the 34 kt and 50kt wind speeds at the 72 h interval however all wind speed criteria drop below 0.50 in the 96 h and 120 h forecast time intervals. Values of HSS also decrease as forecast time increases regardless of optimal threshold selection method. Notice, there is a significant decrease in skill from the 12 h to 24 h forecast time interval (see Figures 23.A and 23.B). This is driven by the comparatively low amount of CN and FA forecasts and high number of Hit forecasts based on the optimal threshold selected at the 12 h forecast interval. Overall, resulting values of HSS convey skill in the forecast system greater than the skill expected due to random chance for the 12 h to 72 h forecast time intervals. Values of HSS remain above 0.30, with the exception of the 64 kt wind speed based on the TSS optimal threshold selection, and the forecast system is considered skillful (Chu 2007). The remaining forecast time intervals show little to no improvement in skill compared to skill due to random chance. Arguments have been made supporting both TSS and HSS as the best metric for selecting optimal thresholds. However, the choice is left to the user to decide depending on their individual needs. Some, for example, may prefer optimizing based on TSS due to higher values of PoD; FA forecasts may not be a 59

concern. Conversely, some users may prefer the HSS method due to lower PoFD and bias scores interpreting this as a better performance metric. An objective of this project is not to decide for the user which method is best, but instead, based on each commonly used metric, gauge the performance of the WPFP. 3.3.2: Frequency Distribution and Reliability Diagrams In addition to the optimal threshold based statistics used to gauge the performance of the WPFP, frequency distribution and reliability diagrams were also constructed to assess the probability forecasts. These diagrams do not require the use of a defined threshold value for product evaluation but instead is a direct comparison of the model probabilities to observed frequency. 60

Figure 24.A: Frequency Distribution diagram constructed for the 34 kt, 12 h forecast interval for the Cocoa Beach, FL AOI Figure 24.B: Reliability diagram constructed for the 34 kt, 12 h forecast time interval for the Cocoa Beach, FL AOI 61

Figure 25.A: Same as Figure 24.A for the 24 h forecast time interval Figure 25.B: Same as Figure 24.B for the 24 h forecast time interval 62

Figure 26.A: Same as Figure 24.A for the 36 h forecast time interval Figure 26.B: Same as Figure 24.B for the 36 h forecast time interval 63

Figure 27.A: Same as Figure 24.A for the 48 h forecast time interval Figure 27.B: Same as Figure 24.B for the 48 h forecast time interval 64

Figure 28.A: Same as Figure 24.A for the 72 h forecast time interval Figure 28.B: Same as Figure 24.B for the 72 h forecast time interval 65

Figure 29.A: Same as Figure 24.A for the 96 h forecast time interval Figure 29.B: Same as Figure 24.B for the 96 h forecast time interval 66

Figure 30.A: Same as Figure 24.A for the 120 h forecast time interval Figure 30.B: Same as Figure 24.B for the 120 h forecast time interval 3.3.2.1: Frequency Distribution Diagrams Figures 24 30.A show the frequency distribution diagrams for the Cocoa Beach, FL AOI for the 34 kt wind speed criteria at each forecast time interval. These diagrams indicate the number of times a given probability was forecast. Preferably, at the earlier forecast time intervals, the distribution of the forecasts should be sharp (i.e. majority of the forecasts should be at the extremes of the probability forecasts). Notice, at the 12 h forecast time interval the distribution shows a majority of the forecasts are less than 10% or greater than 95%. As the forecast time interval increases, the distribution becomes less sharp and the number 67

of lower probabilities increases; the 120 h forecast time interval shows all forecasts were of probabilities less than 20%. The frequency distributions show the breakdown of the probabilities within the data set which aids in the interpretation of the reliability diagrams constructed to evaluate the forecast probability product. 3.3.2.2: Reliability Diagrams Recall, reliability diagrams are frequently used for assessing probability forecasts for binary events (Wilks 1995). Figures 24 30.B show the reliability diagrams for the Cocoa Beach, FL AOI for the 34 kt wind speed criteria at each forecast time interval. Since the reliability diagrams are constructed with the probabilities in 10% bins, as opposed to plotting the reliability for each individual probability, the desired result (black line, i.e. Figure 24.B) is plotted using the bin mean. Depending on where the reliability plot falls in relation to the desired result indicates whether the product is under (above the desired result) or over (below the desired result) forecasting. In theory, a perfect forecast system will result in forecasts with a given probability being consistent with the observed frequency (Hartmann 2002). Notice in the early forecast time intervals, with the exception of the 12 h forecast, the reliability diagrams show the product is over forecasting. At the 12 h forecast time interval, the reliability plot, where the sample size is large, closely 68

follows the desired result (see Figure 24.A). However, figures 25 27.B show the reliability plot is well below the desired result line. The later forecast time intervals (72 h 120 h) show the reliability plot closely follows the desired result indicating improved reliability and less over forecasting (see Figures 28 30.B). Recall from section 3.2 the optimal thresholds selected using the TSS and HSS methods indicated over forecasting (bias greater than 1.00). Due to the information derived from the reliability diagrams, this result is not surprising given the product over forecasts in general! If the user chose to remove bias from the forecast system, the new optimal thresholds would be extremely high and the performance metrics would indicate poor skill values. For example, consider the hypothetical optimal thresholds selected by removing the bias for the 34 kt wind speed criteria for the Cocoa Beach, FL AOI (see Table 10). Table 10: Optimal thresholds selected optimizing TSS, HSS and Bias for the 34 kt wind speed criteria for the Cocoa Beach, FL AOI. TSS HSS Bias 23 50 31 10 51 72 14 33 56 16 33 55 12 28 27 9 17 24 12 12 14 69

Notice, with the exception of the 12 h forecast time interval, the optimal thresholds necessary to remove the bias in the forecast system are unreasonably high. This is particularly true for the earlier forecast time intervals (24 h 48 h). The skill of the forecast system is expected to decrease when the bias is removed because of these high threshold values. With optimal thresholds so high, the product begins to act as though it forecasts always no. Tables 11.A, 11.B and 11.C show the PoD, TSS and HSS at each forecast time interval for the 34 kt wind speed criteria for the Cocoa Beach, FL AOI. With the exception of the 12 h forecast time interval, the PoD decreases significantly when the bias score is optimized. Due to this, the TSS and HSS values also decrease. Note the PoFD, false alarm ratio and threat score did not show significant differences between the three optimal threshold selection methods. As a result of the significant drop in the PoD and resulting TSS and HSS values, it is not recommended that the threshold values are optimized based on the bias score. It would appear the HSS selection method is a sensible middle road between the other two methods. Nevertheless, the choice is left for the user to decide based on individual needs. Table 11.A: Performance metrics based on the optimal thresholds selected using the TSS method for 34 kt wind speed criteria for the Cocoa Beach, FL AOI TSS Method PoD TSS HSS 12 h 0.90 0.84 0.80 24 h 0.97 0.72 0.22 36 h 0.91 0.75 0.26 70

48 h 0.88 0.77 0.25 72 h 0.81 0.65 0.31 96 h 0.74 0.51 0.20 120 h 0.31 0.23 0.18 Table 11.B: Performance metrics based on the optimal thresholds selected using the HSS method for 34 kt wind speed criteria for the Cocoa Beach, FL AOI HSS Method PoD TSS HSS 12 h 0.78 0.76 0.82 24 h 0.51 0.46 0.36 36 h 0.69 0.62 0.38 48 h 0.62 0.56 0.31 72 h 0.48 0.45 0.45 96 h 0.48 0.37 0.24 120 h 0.31 0.23 0.18 71

Table 11.C: Performance metrics based on the optimal thresholds selected using the Bias method for 34 kt wind speed criteria for the Cocoa Beach, FL AOI Bias Method PoD TSS HSS 12 h 0.84 0.80 0.80 24 h 0.26 0.22 0.21 36 h 0.31 0.28 0.28 48 h 0.31 0.29 0.28 72 h 0.48 0.44 0.44 96 h 0.27 0.22 0.23 120 h 0.20 0.15 0.15 4. Summary The purpose of this project is to evaluate the performance of the National Hurricane Center s Tropical Cyclone Wind Speed Probability Forecast Product (WPFP) and to determine how to best utilize the product for operational applications. Results are composited from the 2004 to 2007 hurricane seasons and include all tropical cyclones affecting areas of interest centered on Cocoa Beach, FL, Charleston, SC, Corpus Christi, TX and New Orleans, LA. Through the construction of ROC diagrams and the optimization of the True Skill Statistic (TSS) and the Heidke Skill Score (HSS), optimal thresholds were 72

determined for three wind speed criteria (34 kt, 50 kt, and 64 kt) at seven different forecast time intervals (12, 24, 36, 48, 72, 96, and 120 h) to convert the probabilities into binary (yes/no) forecasts. There are many verification statistics that exist based on binary forecasts which are used to gauge the performance of a forecast system. The optimal thresholds were selected based on two different data sets, one being a subset of the other. The original data set included X forecasts (representing forecasts less than 0.4%) which incorporated any missed forecasts (i.e. binary no forecasts that were incorrect) but also saturated the data set with correct negative forecasts (i.e. binary no forecasts that were correct). In order to test the impact of these events on the optimal threshold selection, a second data set was composited including only the probabilities greater than 0.4%. Ultimately, the optimal thresholds selected based on maximum TSS were sensitive to the data set. When X forecasts were included, the optimal threshold values were lower compared to when X forecasts were excluded from the data set. The HSS optimal threshold selection method proved to be more robust returning consistent optimal threshold values regardless of the data set. Overall, when comparing the optimal thresholds selected from each data set and selection method, the no X forecast data set instilled more confidence in the threshold values for decision make purposes. Due to this, verification statistics were based on the resulting optimal thresholds selected from the no X forecast data set. 73

As previously mentioned, many verification statistics exist based on binary forecasts for evaluating the performance of a forecast system. Overall, based on the optimal thresholds selected by optimizing TSS and HSS, the WPFP performed well. As expected, the skill of the forecast system decreased as the forecast time interval increased; nevertheless, the performance metrics (TSS, HSS, threat score etc.) remained within the acceptable range indicating skill, regardless of wind speed. This is particularly true during the earlier forecast times (12 h 72 h). In addition to the optimal threshold based verification statistics, reliability diagrams were constructed to assess the performance of the forecast system based on binary events. Recall, reliability diagrams demonstrate observed frequency as a function of forecast probability and do not depend of optimal threshold selection. In theory, a perfect forecast system will result in forecasts with a given probability being consistent with the observed frequency (Hartmann 2002). As previously discussed, the position of the reliability plot in relation to the desired result (the one-to-one line where probability forecast is equivalent to the observed frequency) is an indication of whether the product over or under forecasts. Ultimately, based on the reliability diagrams, the product was found to over forecast. This is particularly true in the earlier forecast time intervals. As the forecast time interval increased, with the exception of the 12 h forecast, less over forecasting was observed in the forecast system. 74

The sensitivity of the optimal thresholds and skill of the forecast system were determined by evaluating each as a function of forecast time, wind speed and AOI. Ultimately, the optimal thresholds did not behave as expected. Given the nature of a forecast system, the maximum probability forecast per time interval decreased monotonically as the forecast time increased. Due to this, the optimal thresholds were also expected to decrease this way. Instead, with the exception of the 12 h forecast, the optimal thresholds increased until the 48 h forecast interval and then decreased as expected. This trend in optimal thresholds existed regardless of selection method and was consistent for all wind speed criteria. Based on the threshold values selected, various verification statistics were calculated to aid in gauging the performance of the WPFP. Given the nature of a forecast system, the skill is expected to decrease as the forecast time interval increases. As evident from the bias score, threat score, TSS and HSS, this trend exists for all wind speed criteria. Once again, the 12 h forecast time interval is an outlier. Values of the previously mentioned skill scores were significantly higher at this time period. This difference was also reflected in the reliability diagrams. At the 12 h forecast time interval, the forecast system closely followed the desired trend. All successive time periods showed significant over forecasting in the system. These findings suggest the forecast system is treating the 12 h interval differently. It is currently unclear as to why this is the case however it will be investigated in greater detail as part of the future work of the project. 75

The evaluation of the performance of the WPFP is intended to aid in operational based yes/no decisions at Cape Canaveral Air Force Station and Kennedy Space Center. The optimal thresholds selected are tailored to fit the needs of a particular user to determine optimum strategy for any specific application (e.g., protecting the space shuttle and other resources). Based on the results of this study, it is recommended the 45 th Weather Squadron rely on the optimal thresholds selected using the Heidke Skill Score (HSS) optimal threshold selection method; however, the ultimate choice is left for the user. References Beven, J., 2008: Verification of National Hurricane Center Forecasts of Extratropical Transition. 28 th Conference on Hurricanes and Tropical Meteorology, 28 April 2 May 2008, Paper 10C.2 Cheng, R. C. 1995: Bootstrap methods in computer simulation experiments. 27th Conference on Winter Simulation, 3-6 December 1995, 171-177 Chu, P., C. Lee, M. Lu, and X. Zhao, 2007: Climate Prediction of Tropical Cyclone Activity in the Vicinity of Taiwan Using the Multivariate Least Absolute Deviation Regression Method. Terr. Atmos. Ocean. Sci. Vol. 18, No. 4, 805-825 76

Department of Commerce, National Oceanic & Atmospheric Administration, National Weather Service, 2008: National Weather Service Instruction 10-601 Operations and Services Tropical Cyclone Weather Services Program, NWSPD 10-6 TROPICAL CYCLONE PRODUCTS 16-17 June 2008 Haklander, A.J., and A. Van Delden 2003: Thunderstorm predictors and their forecast skill for the Netherlands. Atmospheric Research. 5, 3584 Hartmann, H.C., Pagano, T.C., Sorooshiam, S. and Bales, R. 2002: Confidence builder: evaluating seasonal climate forecasts from user perspectives. Bull Amer. Met. Soc., 84, 683-698 Harvey, L. O., K. R. Hammond, C. M. Lusk, and E. F. Mross, 1992: The application of signal detection theory to weather forecasting behavior. Mon. Wea. Rev., 120, 863 883 Jolliffe, I.T., and D. B. Stephenson, 2004: Forecast Verification A Practitioner s Guide in Atmospheric Science. John Wiley & Sons Ltd, pg 48-51, 143-144 Knabb, Richard, 2008: Tropical Cyclone Surface Wind Speed Probabilities (TCSWSP) in the National Digital Forecast Database (NDFD). Available from the website of the National Hurricane Center at http://products.weather.gov/pdd/tcwindspeedprobndfd032007.pdf Kuligowski, R.J., and A.P. Barros, 2001: Blending Multiresolution Satellite Data with Application to the Initialization of an Orographic Precipitation Model. J. Appl. Meteor., 40, 1592 1606. Manzato, A. 2005: An Odds Ratio Parameterization for ROC Diagram and Skill Score Indices. Wea. and Forecasting, Vol. 20, Issue 6, December 2005, 918-930 Mason, S.J., and N.E. Graham, 1999: Conditional Probabilities, Relative Operating Characteristics, and Relative Operating Levels. Wea. Forecasting, 14, 713 725 McDonald, B. E., T. M. Graziano, and C. K. Kluepfel, 2000: The NWS National QPF Verification Program, 15th Conference on Hydrology, 9-14 Jan., American Meteorological Society, 247-250 77

Moskaitis, J., 2007: Verification of deterministic tropical cyclone intensity forecasts: Moving beyond mean absolute error, 28 th Conference on Hurricanes and Tropical Meteorology, June 2007, Paper 14A.7 Murphy, A.H., and H. Daan, 1985: Forecast Evaluation. Probability, Statistics, and Decision Making in Atmospheric Sciences, Westview Press, 379-437 National Hurricane Center November 2006: Tropical Cyclone Wind Speed Probability Product Description. Available from the web site of the National Hurricane Center at http://www.nws.noaa.gov/directives/sym/pd01006001curr.pdf National Oceanic and Atmospheric Administration 2004 FY Annual Performance Plan. Available from the web site of the National Oceanic and Atmospheric Administration at http://www.osec.doc.gov/bmi/budget/06app/noaa06app.pdf Schaefer, J.T., 1990: The Critical Success Index as an Indicator of Warning Skill. Wea. Forecasting, 5, 570 575. Stefanova, L., and T.N. Krishnamurti, 2002: Interpretation of Seasonal Climate Forecast Using Brier Skill Score, The Florida State University Superensemble, and the AMIP-I Dataset. J. Climate, 15, 537 544. Wilks, D. S 1995: Statistical Methods in the Atmospheric Sciences: An Introduction. Academic Press, 467 Winters, K. A., J. W. Weems, F. C. Flinn, G. B. Kubat, S. B. Cocks, and J. T. Madura, 2006: Providing Tropical Cyclone Weather Support to Space Launch Operations, 27th Conference on Hurricanes and Tropical Meteorology, 24-28 April 2006, Paper 13A.5 Appendix A: Charleston, SC 78

A B 79

C D 80

E F Figure A.1: Relative Operating Characteristics (ROC) diagrams for the Charleston, SC AOI for 34 kt, 50 kt and 64 kt wind radii with X forecasts A, B, C respectively and 34 kt, 50 kt, and 64 kt wind radii without X forecasts, D, E, F respectively. 81

Table A.1: Optimal thresholds selected for Charleston, SC AOI A) with X forecasts and B) without X forecasts. Results from TSS and HSS selection method listed. A 34 Kt 50 Kt 64 Kt Forecast Int. TSS HSS TSS HSS TSS HSS 12 hr 5 51 14 44 15 15 24 hr 16 37 2 18 11 11 36 hr 4 31 2 9 1 2 48 hr 5 33 1 12 2 2 72 hr 5 30 1 7 2 3 96 hr 1 18 1 2 1 1 120 hr 5 14 1 1 1 1 B 34 Kt 50 Kt 64 Kt Forecast Int. TSS HSS TSS HSS TSS HSS 12 hr 16 51 15 44 15 15 24 hr 16 56 18 18 11 11 36 hr 10 31 3 9 1 2 48 hr 6 33 3 12 2 2 72 hr 9 10 3 7 2 3 96 hr 4 5 1 4 1 1 120 hr 6 6 1 1 1 1 82

Appendix B: New Orleans, LA A B 83

C D 84

E F Figure B.1: Relative Operating Characteristics (ROC) diagrams for the New Orleans, LA AOI for 34 kt, 50 kt and 64 kt wind radii with X forecasts A, B, C respectively and 34 kt, 50 kt, and 64 kt wind radii without X forecasts, D, E, F respectively. 85

Table B.1: Optimal thresholds selected for New Orleans, LA AOI A) with X forecasts and B) without X forecasts. Results from TSS and HSS selection method listed. A 34 Kt 50 Kt 64 Kt Forecast Int. TSS HSS TSS HSS TSS HSS 12 hr 11 37 6 34 30 30 24 hr 22 40 3 18 22 37 36 hr 16 25 2 17 9 27 48 hr 11 27 3 15 6 20 72 hr 3 17 3 12 4 6 96 hr 5 14 2 5 1 1 120 hr 2 1 1 5 0 0 B 34 Kt 50 Kt 64 Kt Forecast Int. TSS HSS TSS HSS TSS HSS 12 hr 37 37 34 34 30 30 24 hr 22 40 14 29 22 27 36 hr 21 29 15 17 15 27 48 hr 16 27 10 15 6 20 72 hr 14 17 12 14 4 6 96 hr 5 14 3 6 1 1 120 hr 3 6 1 5 0 0 86

Appendix C: Corpus Christi, TX A B 87

C D 88

E F Figure C.1: Relative Operating Characteristics (ROC) diagrams for the Corpus Christi, TX AOI for 34 kt, 50 kt and 64 kt wind radii with X forecasts A, B, C respectively and 34 kt, 50 kt, and 64 kt wind radii without X forecasts, D, E, F respectively. 89

Jaclyn Ashley Shafer. Bachelor of Science Meteorology Florida Institute of Technology 2006