Statistical Analysis of Google Flu Trends

Size: px

Start display at page:

Download "Statistical Analysis of Google Flu Trends"

Anna Daniels
5 years ago
Views:

1 Statistical Analysis of Google Flu Trends A PROJECT SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Melissa Sandahl IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE Yang Li and Kang James July, 2016

3 Acknowledgements Firstly, I would like to thank my advisors Yang Li and Kang James for all of their time, guidance and expert support while completing this project. I would also like to thank Xuan Li for serving on my committee and the positive support. Lastly, I am very thankful for everything that the University s Mathematics & Statistics department has provided me with and helped me accomplish over the last two years. i

4 Abstract Predicting the behavior of influenza is crucial to helping health officials prepare for and decrease possible outbreaks of the infectious disease. This project discusses methods for testing Google flu count data taken from for spatial autocorrelation, seasonality and temporal effects. We will generate an appropriate seasonal ARIMA model to fit the data for the overall nation as well as use the statistical program R to develop multiple state models. Lastly, the Ljung-Box test will be applied to test for goodness of fit and model adequacy. The goal of this project is to be able to forecast future influenza outbreaks from Google flu trends across the United States in hopes of increasing preparation standards. ii

5 Contents Acknowledgements Abstract List of Tables List of Figures i ii v vi 1 Introduction 1 2 Spatial Data Spatial Weights Matrix Spatial Dependence Global Test for Spatial Autocorrelation Local Test for Spatial Autocorrelation Time Series Analysis 19 4 Seasonal ARIMA Model The General Model Simple ARIMA Example ACF and PACF R ARIMA Model State Models US Model Accuracy State Model Accuracy iii

6 5 Model Forecasting 37 6 Conclusion and Discussion 39 References 41 Appendix A. Spatial Matrix 43 Appendix B. Moran I Proof 45 iv

7 List of Tables 2.1 Global Moran Is: January June Global Moran Is: July December ARIMA model generated from ACF and PACF plots after differencing Results from ARIMA model generated from R Seasonal ARIMA models selected by R: Alabama - Montana Seasonal ARIMA models selected by R: Nebraska - Wyoming Seasonal ARIMA model statistics: Alabama - Montana Seasonal ARIMA model statistics: Nebraska - Wyoming v

8 List of Figures 2.1 Average Monthly Flu Counts from Average Monthly Flu Counts Per Person from Time Series of observed Global Morans Local Moran I values for Alabama Local Moran I values for Montana Local Moran I values for North Dakota Local Moran I values for South Dakota Time series of Local Moran I values for Alabama, Montana and North Dakota Times Series of original data collected Times series of data after it was logged ACF and PACF for US monthly data Times Series of the data after 12 month differencing ACF and PACF for US monthly data after differenced for 12 months Time Series, ACF, and PACF of the model s residuals Forecast for ARIMA(1, 0, 1) (2, 0, 0) 12 model A.1 Map of the continental states with corresponding numbers [13] A.2 List of Neighbors for the 48 States [13] vi

9 Chapter 1 Introduction Accurate models to help predict influenza outbreaks across the United States can help public health officials track and prepare for future events and hopefully decrease the amount of deaths. Nowadays, large data sets for diseases and epidemics can be collected quickly and easily through internet-based programs. However, generating useful models to help predict epidemics is largely dependent on the availability and accuracy of this data [14]. The data for this project came from Google Flu Trends [5]. Google keeps track of the number of times that the word flu or flu-like symptoms are searched on their website and maintain a database with this information on a weekly basis. The data is collected both nationwide and by each individual state. Studies concerning Google-search-based tracking models have been done, like the AutoRegression with Google data,argo, model developed by Yang, Santillana and Kou [14] and the Google Flu Trend with a state-space SEIR model by Dukik, Lopes and Polson [3]. Both models were created with hopes of tracking disease behavior at different temporal and spatial inputs. We will start the project by using the data collected for each state and investigate if there is any spatial autocorrelation present. When testing for spatial autocorrelation, we must build a spatial weights matrix which depicts the locational similarities between 1

10 the areas in the study. We will test for both global and local spatial autocorrelation in the data in order to get more accurate results. 2 Next, we will analyze the data as a seasonal time series. Using the ACF and PACF plots, we will determine if any lags in the data are significant in developing an appropriate seasonal ARIMA model that fits the data. For this project we will test the model s accuracy using the Ljung-Box test which will test if the data is independently distributed. Our goal for this project is to be able to create a model that will be able to forecast future behavior of Google flu trends on the nationwide and/or state level. In doing so, we would be able to predict if one state will have a rise in flu cases based on the frequency of google flu counts in neighboring states and previous time periods.

11 Chapter 2 Spatial Data Spatial data is the term used to describe data that has a spatial or geographical component associated with it. This data can be characteristic or more commonly, numerical observations. There are two common spatial structures that are used when modeling regional spatial data. One method is by determining the proximity of areas based on distances between the centroids of each areal unit. It is assumed that the observation is observed in the centroid of each region and then a spatial covariance structure is developed based on the distances of the centroids. This strategy does not take into account the fact that the centroid does not accurately describe the behavior across the entire region. [13] The second method, and the one used in this project, uses a neighborhood structure to create the spatial covariance matrix. This would mean that the proximity of the areal units is determined by the borders shared in the regional lattice structure. Thus, regions are considered neighbors when they share a common border. Unfortunately, this method uses irregular lattices and has not been studied as in-depth as the first method. In this project, we used area data from the 48 continental U.S. states of google flu counts collected from Google [5]. Area data is defined as the type of spatial data where the observations are associated with a fixed set of areal units. Like stated before, we used a neighborhood structure that contained areas/zones with irregular boundaries [4]. 3

12 4 2.1 Spatial Weights Matrix We utilized what s known as a spatial weights matrix to determine spatial relativity within our data. Since we wanted to investigate if a rise in Google Flu Counts in one state would affect the states surrounding it, we decided to use the neighbors of each state to generate our spatial weights matrix. Thus, when two states, areas i and j, share a common border the entries W i,j and W j,i will be given a value of 1. All diagonal entries W i,i will be assigned a value of 0 [4]. The Spatial Weights Matrix W of the 48 continental U.S. states is: 0 W 1,2... W 1,48 W 2, W 2,48... W 48,1 W 48, W i,j = { 1 if area j is neighbors with area i 0 otherwise. See appendix A for a map of the 48 areal units labeled in their corresponding alphabetical order and a list of all the neighbors for each state. Before we moved on with the data, we decided to row standardize W so that we could interpret each entry as the portion of spatial influence that area j has on area i.

13 2.2 Spatial Dependence 5 In order to create a useful model with this data, we must first check if there is spatial dependence among the observations. We will investigate if there is spatial autocorrelation in the number of Google Flu Counts based on the proximity of the 48 continental states. We will run two tests to check for spatial dependence, one on a global scale and the second will look at local spatial dependence for each location. Before we ran any tests, we wanted to look at what was happening with the flu counts per person in each state at multiple time periods. Figure 2.2 below reveals what happened every other month across the United States for the year We observed that there seems to be a larger percentage of people searching flu trends in the upper mid-west all year round than in other areas across the country. When looking for spatial dependence, we are comparing the similarity between the observations for each of the 48 locations. We will use all of the following variables to help us test for spatial autocorrelation: n number of areas in sample, i, j two different areas in sample, z i observation collected from area i, z average of all n observations, W ij similarity in location of areas i and j, M ij similarity in observations of areas i and j. To see visually what the counts we collected looked like, we took the average monthly data from and generated density plots of the geographical region. Figure 2.1 displays the density counts for every other month. We noticed that the states with the largest density of counts were the ones with the

14 6 largest populations, e.g. Texas, Arizona, California. For that reason, Figure 2.1 is not useful for making any conjectures since it would not be spatial effects that cause those states to have higher Google flu counts. To account for this, we divided our data of monthly counts by each state s population so we could use per capita data. We then generated the same density plots but with the monthly average per capita data, see Figure 2.2. When we graphed the newly transformed data we saw that there is no longer clustering happening primarily in the southern region of the US. We now see clustering happening mainly in the upper mid-west region of the states instead. One possible explanation for this is that that region generally has harsher winters and colder temperatures year round which would increase the chances of contracting influenza.

15 y y y y y y 7 Average January Counts Average July Counts Count Count x Average March Counts x Average September Counts Count Count x Average May Counts x Average November Counts Count Count x x Figure 2.1: Average Monthly Flu Counts from

16 y y y y y y 8 Average January Logged Counts Average July Logged Counts Count Count x Average March Logged Counts x Average September Logged Counts Count Count x Average May Logged Counts x Average November Logged Counts Count 0.20 Count x x Figure 2.2: Average Monthly Flu Counts Per Person from

17 2.2.1 Global Test for Spatial Autocorrelation 9 Global spatial autocorrelation measures and tests use the entire spatial weights matrix W to determine if there is spatial autocorrelation over the total area in the study. Whereas, local measures will calculate a statistic for each area in the study and use a smaller, restricted set of areal units [4]. When measuring for global spatial autocorrelation, we compared the similarities in the observations M ij with the similarities in the locations W ij by using the cross-product: n n M ij W ij (2.1) i=1 j=1 The two most commonly used methods for finding spatial autocorrelation for areal units are the Moran s I and Geary s c statistics. Both statistics will determine the overall degree of spatial correlation in the data set as a whole. For this project, we used Moran s I statistic to determine if there s global spatial autocorrelation present in our data. Moran s I statistic uses the cross-products to measure value similarity, M ij = (z i z)(z j z), versus Geary s c which uses squared differences such as (z i z j ) 2. The global Moran I statistic is: I = n n n W ij (z i z)(z j z) i=1 j=1 n n W ij i=1 j i i=1 (2.2) n (z i z) 2 E[I] = 1 n 1 (2.3) var(i) = n2 (n 1)W 1 n(n 1)W 2 2W 2 0 (n + 1)(n 1) 2 W 2 0 (2.4)

18 where 10 W 0 = n n W ij (2.5) i=1 j i W 1 = 1 2 n n (W ij + W ji ) 2 (2.6) i=1 j i n ( n n ) 2 W 2 = W kj + W ik (2.7) k=1 j=1 i=1 Please see Appendix B for the proof of the expected value for Moran I. The null hypothesis associated with the Moran I statistic is that the spatial processes influencing any spatial relationships is randomly placed and there is no spatial autocorrelation. To test for the significance of spatial autocorrelation, R will randomly assign the obersvations to the areal units and calculate the observed Moran I for a large number of these random assignments. The observed Moran I is then compared to the random set of Is and if the actual observed I falls less than the 5th percentile or greater than the 95th percentile then there is spatial autocorrelation present at the α = 0.5 level [6]. Therefore, we looked for p-values that were significant (< 0.05) which would imply that we could reject the null hypothesis and conclude that there is spatial autocorrelation present[4]. When spatial autocorrelation is present, for large data sets, the observed Moran I statistic will be a large value versus the expected value of the statistic under the null hypothesis of no spatial relation. To see this, consider when two neighboring areas i and j both have high observation values. They will both be larger than the average, z, and the cross product (z i z)(z j z) will be a large positive value. Using the Moran.I() command in the package {ape} in R we found the Global Moran I statistics for all 84 time periods in the study. These results are shown in Tables 2.1 and 2.2 below.

19 Date Observed I Expected I St. Dev P val Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Table 2.1: Global Moran Is: January June 2011

20 Date Observed I Expected I St. Dev P val Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Table 2.2: Global Moran Is: July December 2014

21 13 A graph of the observed Global Moran I values for all 84 time periods is shown below in Figure 2.3. The reference line for the expected Global Moran I is also included in the plot. One standard deviation above and below the observed value is indicated in the plot. Notice that none of the observed Global Moran values are at least one standard deviation away from the expected value but we did find five months that displayed spatial autocorrelation by testing the I values from Figures 2.1 and 2.2. This result is not very strong and we could not conclude that there is global spatial autocorrelation in the data. Thus, we decided to test each state for any local spatial autocorrelation. Global Moran Is Observed Gloval Moran Is Time Figure 2.3: Time Series of observed Global Morans Local Test for Spatial Autocorrelation Since we did not find any significant global spatial autocorrelation, our next step was to test for signs of local spatial autocorrelation. Local statistics are used to determine if each areal unit has a large amount of spatial clustering or if there are notable similarities in the observations of surrounding areas. Local indicators of spatial assocation (LISA) were created by Luc Anselin to determine the influence of each individual observation versus looking at the entire sample area [2].

22 14 The test for local spatial autocorrelation utilizes the cross product: n M ij W ij (2.8) j=1 which uses comparisons of spatial autocorrelations for a specific observation or areal unit. Similarly, M ij = (z i z)(z j z) like in the global test. Also, we now have neighborhood sets J i which are the collection of neighbors for area i and z now represents the average observation value for just area i s neighboring states. The local Moran I i statistic for area i is: I i = (z i z) n E[I i ] = 1 (n 1) j J i W ij (z j z) (2.9) n W ij (2.10) j=1 var[i i ] = where 1 (n 1) W i(2)(n b 2 ) + 2 (n 1)(n 2) W i(kh)(2b 2 n) 1 (n 1) 2 W i 2 (2.11) W i(2) = n j i W 2 ij (2.12) n n 2W i(kh) = W ik W ih (2.13) k i h i n W i = W ij (2.14) j=1 One thing to notice is that the sum of I i for all i values is equivalent to the global

23 Moran I statistic we used in Equation n n n I i (z i z) W ij (z j z) (2.15) j J i i=1 i=1 Using the Local.Moran() command in the package {spdep} in R, we again calculated Moran I statistics for each of the 84 time periods. However, when testing for local spatial autocorrelation we generated a test statistic for each of the 48 areas for each time period. Thus, we observed 4,032 local Moran I statistics. From these results, we generated a graph for each state that includes the calculated test statistic as well as one standard deviation above and below and the expected I i value for reference. From these, we found that only six states were consistently further than one standard deviation away from the expected value. These states include: Delaware, Montana, North Dakota, South Dakota, Texas and Wyoming. The corresponding graphs for Montana, North Dakota and South Dakota are displayed below along with the plot for Alabama which did not display significant spatial autocorrelation.

24 16 Local Moran I: Alabama Observed Local Moran Is Time Figure 2.4: Local Moran I values for Alabama Local Moran I: Montana Observed Local Moran Is Time Figure 2.5: Local Moran I values for Montana

25 17 Local Moran I: North Dakota Observed Local Moran Is Time Figure 2.6: Local Moran I values for North Dakota Local Moran I: South Dakota Observed Local Moran Is Time Figure 2.7: Local Moran I values for South Dakota

26 Observed Local Moran I values 18 Observed Local Moran I's Alabama Montana North Dakota Time Figure 2.8: Time series of Local Moran I values for Alabama, Montana and North Dakota Figure 2.8 displays the trends in the observed local Moran Is for three of the states. Alabama is used as a reference for a state that does not display any type of local spatial autocorrelation. Whereas, both Montana s and North Dakota s Moran Is indicated that they had significant local spatial autocorrelation. However, despite having six states with significant local spatial autocorrelation, our results are similar to those for global spatial autocorrelation. We concluded that our data did not show enough significant local spatial autocorrelation to continue to include it in our model building.

27 Chapter 3 Time Series Analysis For this project, we are using average monthly data observed for 7 years, for a total of 84 data points for each location. Since the data we collected was in terms of weeks, we manually converted the data to monthly averages. Weekly data was listed by the first day of the week that the counts started. Thus, if a week started at the end of January but also contained days in February it was only taken into account for January s average. A time series plot of the raw data is shown in Figure 3.1 below. Notice that there is a large amount of variance between the peaks. Looking at Figure 3.2, we can see that after we take the log of the data, there is much less variance in the output. The data now ranges roughly from 6 10 versus with the raw data. Also, since our data is in the form of counts then taking the log will help normalize the data. We will continue our project by using the logtransformed data to generate useful models. Another option we could have tried would have been to difference the data. Differencing is commonly used when you have a seasonal time series. The seasonality typically causes a time series to be nonstationary since the seasonality is the main factor affecting the output at certain time periods. Differencing, computing the difference between consecutive observations, can help make a time series stationary and remove patterns in the data. However, we decided to not use differencing and keep the seasonal effects in the data for modeling purposes we will discuss later in the project [7]. 19

28 20 Next, notice that there is an obvious spike in the data that occurs once every year. These spikes represent flu season which typically peaks between December and February every year with the majority of years peaking in February [12]. Because of this, we incorporated a 12 month seasonal effect into our models. Another observation of the data is the atypical spike for the flu season. According to the Center for Disease Control, CDC, the flu vaccine was 52% effective at preventing acute respiratory illness that required medical attention that year. However, the vaccine s effectiveness was lower for those aged 65 years and older; which could have been one factor for the flu season s severity [12].

29 21 Time Series of Monthly Average US Raw Data Raw Monthly Average Flu Counts Dates Figure 3.1: Times Series of original data collected Time Series of Monthly Average US Logged Data Logged Monthly Average Flu Counts Dates Figure 3.2: Times series of data after it was logged

30 Chapter 4 Seasonal ARIMA Model Since we are using monthly data to help us determine a model for predicting Google flu counts, we would expect our model to be seasonal. Having seasonality in a time series means there is a pattern present that repeats every m time periods. Thus, since flu season generally peaks around the same time each year, we would expect our data to have a reoccurring pattern every 12 months. 4.1 The General Model The seasonal ARIMA model includes both seasonal and non-seasonal autoregressive and moving average components. The model also incorporates the use of differencing which was mentioned earlier. The backwards difference operator, B, is defined as: The notation for the general model is: (B m )x t = x t m (4.1) ARIMA(p, d, q) (P, D, Q) m 22

31 p d q P D Q m non-seasonal AR order, non-seasonal differencing, non-seasonal MA order, seasonal AR order, seasonal differencing, seasonal MA order, number of months per season. 23 The model can be written as: Φ(B m )φ(b)(1 B... B d )(1 B... B D )(x t µ) = Θ(B m )θ(b)w t (4.2) The non-seasonal AR component is written as: The non-seasonal MA component is written as: φ(b) = 1 φ 1 B... φ p B p (4.3) The seasonal AR component is written as: θ(b) = 1 + θ 1 B θ q B q (4.4) The seasonal MA component is written as: Φ(B m ) = 1 Φ 1 B m... Φ P B mp (4.5) Θ(B m ) = 1 + Θ 1 B m Θ Q B mq (4.6) Simple ARIMA Example To illustrate a simple example of a seasonal ARIMA model, let s look at an ARIMA(0, 1, 1) (1, 0, 1) 12 model.

32 24 Our model would start as: Φ(B)(1 B)(x t µ) = Θ(B)θ(B)w t (4.7) Now, replace (x t µ) with z t and substitute in the appropriate equations to get: (1 ΦB)(1 B)z t = (1 + ΘB)(1 + θb)w t (4.8) (1 B ΦB ΦBB)z t = (1 + ΘB + θb + ΘθBB)w t (4.9) Thus, we get the resulting equation: z t = z t 1 + Φz t 1 + Φz t 2 + w t + Θw t 1 + θw t 1 + Θθw t 2 (4.10)

33 4.2 ACF and PACF 25 Next, we created ACF and PACF plots for the nationwide data which are shown in Figure 4.1. Series: USData ACF LAG PACF LAG Figure 4.1: ACF and PACF for US monthly data We observed that the ACF plot in Figure 4.1 has cyclic spikes that do not seem to dampen in magnitude at any lag value. Thus, it would be difficult to try and determine a model for this data based on these plots. Since we had seasonal data, it is sometimes helpful to analyze the ACF and PACF plots after differencing the data. Seasonality commonly brings about nonstationarity in data since it s typical for seasonal data to fluctuate and peak during certain time periods. Thus, we used a 12 month difference in our data and the time series plot after the differencing can be found in Figure 4.2. Notice that the new plot no longer has the consistent peak and valley pattern that was present back in Figure 3.2. We then generated ACF and PACF plots for the data after applying a 12 month difference which can be found in Figure 4.3.

34 26 Time Series after Differenced for 12 Months 12 Month Differenced Flu Counts Dates Figure 4.2: Times Series of the data after 12 month differencing Series: Differenced ACF LAG PACF LAG Figure 4.3: ACF and PACF for US monthly data after differenced for 12 months

35 27 From analyzing Figure 4.3 we were able to make an educated guess on what parameters to include in a seasonal ARIMA model. We used the the ACF plot to help us determine the values for the seasonal and non-seasonal MA order and the PACF for the number of seasonal and nonseasonal AR parameters. Looking at just the first couple lags in the ACF we determined that a non-seasonal MA order of 1 would fit since there is only a spike at lag 1. Next, we analyzed the lags at 12, 24 and 36 and decided to use a seasonal MA order of 1 as well since the lag at time 12 seemed to be the only significant seasonal lag. We used the same methodology for picking the AR components of the model and analyzing the PACF plot. Thus, we decided to use a non-seasonal AR order of 2 and a seasonal order of 0. Thus, we created the following model: ARIMA(2, 0, 1) (0, 0, 1) 12 Applying these parameters to equation (4.1.1) we get: φ(b)(x t µ) = Θ(B 12 )θ(b)w t (4.11) (1 φ 1 B φ 2 B 2 )(x t µ) = (1 + ΘB 12 )(1 + θb)w t (4.12) For simplicity, let z t = (x t µ) and multiply the 2 polynomials on the right side: (1 φ 1 B φ 2 B 2 )z t = (1 + ΘB 12 + θb + ΘθBB 12 )w t (4.13) z t = φ 1 z t 1 + φ 2 z t 2 + w t + Θw t 12 + θw t 1 + Θθw t 13 (4.14)

36 Type Coef S.E. AR 1 (φ 1 ) AR 2 (φ 2 ) MA 1 (θ) SMA 1 (Θ) Constant AIC=26.95 ˆσ 2 = Table 4.1: ARIMA model generated from ACF and PACF plots after differencing Our final model is: z t = z t 1 + w t w t w t w t 13 (4.15) 4.3 R ARIMA Model Now that we have analyzed the ACF and PACF plots for the data and generated a model using those, we decided to see what type of model R would choose to fit the data. We used the auto.arima() command in the package {forecast} in R to help us develop a potentially different model. This function in R selects the best model based on the Akaike information criterion, AIC. The AIC is used to estimate the quality of a model in relation to other models. Thus, it will determine which model is the most useful out of a set of multiple models. It will not determine significance of an individual model. Our results were as follows: ARIMA(1, 0, 1) (2, 0, 0) 12 Applying these parameters to equation (4.1.1) we get: Φ(B 12 )φ(b)(x t µ) = θ(b)w t (4.16)

37 29 (1 Φ 1 B 12 Φ 2 B 24 )(1 φb)(x t µ) = (1 + θb)w t (4.17) to get: For simplicity, let z t = (x t µ) and multiply the first 2 polynomials on the left side (1 Φ 1 B 12 Φ 2 B 24 φb + Φ 1 φb 12 B + Φ 2 φb 24 B)z t = (1 + θb)w t (4.18) z t = Φ 1 z t 12 + Φ 2 z t 24 + φz t 1 Φ 1 φz t 13 Φ 2 φz t 25 + w t + θw t 1 (4.19) Type Coef S.E. AR 1 (φ) MA 1 (θ) SAR 1 (Φ 1 ) SAR 2 (Φ 2 ) constant AIC=15.57 ˆσ 2 = Table 4.2: Results from ARIMA model generated from R Our final model is: z t = z t z t z t z t z t 25 +w t w t 1 (4.20) Our first model, which we developed by analyzing the ACF and PACF plots, had an AIC value of Whereas, the model chose by R had an AIC value of only

38 Thus, we concluded that the model generated by R, ARIMA(1, 0, 1) (2, 0, 0) 12, is a better model for the data than the one we calculated ourselves State Models Next, we ran auto.arima() on each of the 48 individual states in order to gather state-specific models. The models that R chose for each state are listed below in Tables 4.3 and 4.4. State AR MA SAR SMA Period NonSDiff SDiff AIC Alabama Arizona Arkansas California Colorado Connecticut Delaware Florida Georgia Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Table 4.3: Seasonal ARIMA models selected by R: Alabama - Montana

39 State AR MA SAR SMA Period NonSDiff SDiff AIC Nebraska Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming Table 4.4: Seasonal ARIMA models selected by R: Nebraska - Wyoming Notice that each state displays fairly different model specifications and that there is a wide variety of models associated with the 48 states. It does appear that the ARIMA(1, 0, 1) (2, 0, 0) 12 is the most common model present in Tables 4.3 and 4.4. This makes sense since this is the model that R also chose to describe the behavior of the overall country s flu counts. However, there are 13 states that include a seasonal difference variable in their model which is something we did not include in our model.

40 4.5 US Model Accuracy 32 We then made a time series plot of the residuals for this model as well as the ACF and PACF plots which are shown in Figure 4.4 below. res ACF PACF Lag Lag Figure 4.4: Time Series, ACF, and PACF of the model s residuals There doesn t look like there s an apparent trend in the time series for the residuals, which indicates a decent model. We also notice that there are not any significant

41 33 spikes in either the ACF or the PACF plots which would mean that there s no trends happening in lags of the residuals for the model. Based on these plots, we concluded that this model does sufficiently describe the data. Next, we ran a Ljung-Box test on the model to help us further determine the model s usefulness [8]. The Ljung-Box test is defined as: H 0 : The data are independently distributed H a : The data are not independently distributed (correlation is present) with the test statistic: Q = T (T + 2) h g=1 ˆρ 2 T g T length of time series, ˆρ sample autocorrelation at lag g, h number of lags being tested, K number of model parameters. Under the null hypothesis, the test statistic Q follows a Chi-squared distribution typically with h degrees of freedom, χ 2 (h). However, since we are using this test for an ARIMA model, we are testing if the errors resemble independence and must adjust the degrees of freedom accordingly. The degrees of freedom for a seasonal ARIMA model end up being the number of lags being tested less the number of model parameters. Thus, with a significance level of α, the rejection region for the hypothesis of randomness in the residuals is: Q > χ 2 1 α,h K According to Rob Hyndman and George Athanasopoulos [9] there is not a standard value one should use for h, but as a rule of thumb for seasonal data you should let

42 h = min(2m, T/5) where T is the length of the time series. 34 Applying our model generated from R to this test at a 95% significance level we get: h = min(2 12, 84 5 ) 17, Q = , χ ,13 = 22.36, p value = , AIC = We would reject the hypothesis of randomness in the residuals if Q > Since our test statistic for our model is , we will not reject the hypothesis and conclude that the data does show to be independently distributed. 4.6 State Model Accuracy Next, we ran the Ljung-Box test on the models that R generated for each individual state. The test statistics, degrees of freedom, p-values and AIC values are listed in Figures 4.5 and 4.6 below. We see that when using the same test we used for the overall US model, we will only reject the hypothesis of independence for the state of Arkansas at the α=0.05 level. We can conclude that the models for the rest of the 47 states look to be independently distributed and useful for our data.

43 State Q-statistic DF P val AIC Alabama Arizona Arkansas California Colorado Connecticut Delaware Florida Georgia Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Table 4.5: Seasonal ARIMA model statistics: Alabama - Montana

44 State Q-statistic DF P val AIC Nebraska Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming Table 4.6: Seasonal ARIMA model statistics: Nebraska - Wyoming

45 Chapter 5 Model Forecasting Our final task is to forecast future events from the model that we generated. To do this, we used R s forecast() command and applied it to our ARIMA(1, 0, 1) (2, 0, 0) 12 model. The resulting forecast is shown below in Figure Month Forecast Figure 5.1: Forecast for ARIMA(1, 0, 1) (2, 0, 0) 12 model 37

46 38 The black line on the left represents the actual data that we collected whereas the blue line is the fitted values of the model for 24 months in the future. The shaded areas are prediction intervals for the forecast. The light grey region the 95% confidence interval and the purple area is the 80% confidence interval for the forecasted data. The dotted line in Figure 5.1 is data that we collected from Google for January July From this graph, the forecast looks to be a good fit, the fitted values stay inside both the 80% and 95% confidence intervals.

47 Chapter 6 Conclusion and Discussion Starting with weekly flu count data for , we converted it to average monthly data and log-transformed it to normalize and reduce the variance in the data. We were then left with a seasonal time series ready to be analyzed for spatial and temporal effects. Our first step was to test for both global and local spatial autocorrelation. We used Moran s I statistic to test for both types of spatial correlation. Unfortunately, we did not find enough evidence to say there was signficant global spatial autocorrelation. And although a handful of states displayed local spatial autocorrelation, we decided this wasn t enough evidence to continue using spatial analysis in our model. Thus, we continued on with without spatial effects and looked into just temporal. We created two potential seasonal ARIMA models. One from analyzing the ACF and PACF plots of the seasonal differenced data and the other generated from R. We then used the AIC values to help us choose between the two models. The lower AIC value led us to choose the model that R developed for the data. Once we had our final model, we applied the Ljung-Box test to determine the overall quality of the seasonal ARIMA model. This test revealed that the data did in fact show to be independently distributed at a 95% significance level and we rejected the alternative hypthesis that correlation was present. Thus, our model was an adequate 39

48 fit for the data. 40 Lastly, we were able to create a 24 month forecast of the model. We generated the 80% and 95% prediction intervals based on the 24 month forecast and then plotted 7 months of data from 2015 to see how well the model could forecast. The actual data from January - July 2015 stayed within both confidence intervals so we accepted this forecast and would suggest using it for predicting future Google flu trends across the United States.

49 References [1] Anselin, Luc. Spatial Econometrics A Companion to Theoretical Econometrics. Chapter 14 (2003), [2] Anselin, Luc. Local Indicators of Spatial Association - LISA Geographical Analysis 27 (2): [3] Dukic, Vanja, Hedibert F. Lopes and Nicholas G. Polson. Tracking Epidemics With Google Flu Trends Data and a State-Space SEIR Model Journal of the American Statistical Association 107:500, Web. [4] Fischer, Manfred M., & Jinfeng Wang. Spatial Data Analysis: Models, Methods and Techniques Heidelberg: Springer, Springer Link. Springer International Publishing. Web. [5] Flu Trends. Google. N.p., Web. 22 July 2015, available at google.org/flutrends/about/data/flu/us/data.txt [6] Franklin, Meredith. Spatial Statistics USC Keck School of Medicine, Web. 6 July [7] Hyndman, Rob J., and George Athanasopoulos., 8.1 Stationarity and Differencing OTexts, Web. [8] Hyndman, Rob J. Thoughts on the Ljung-Box test WordPress, Web. 02 July [9] Hyndman, Rob J., and George Athanasopoulos., 8.9 Seasonal ARIMA Models OTexts, Web. 41

50 42 [10] LeSage, James P. Lecture 1: Maximum likelihood estimation of spatial regression models (2004), available at www4.fe.uc.pt/spatial/doc/lecture1.pdf Web. [11] Pace, R. Kelley, Ronald Barry, Otis W. Gilley1 and C.F. Sirmans. A method for spatial-temporal forecasting with an application to real estate prices International Journal of Forecasting 16 (200): n. pag. Web. [12] The Flu Season. Centers for Disease Control and Prevention. Centers for Disease Control and Prevention, 22 Oct Web. [13] Wall, Melanie M. A close look at the spatial structure implied by the CAR and SAR models Journal of statistical planning and inference 121 (2004): Web. [14] Yang, Shihao, Mauricio Santillana and S. C. Kou. Accurate estimation of influenza epidemics using Google search data via ARGO. Proceedings of the National Academy of Sciences Proc Natl Acad Sci USA (2015): Web.

51 Appendix A Spatial Matrix The spatial weights matrix W which is used for finding spatial autocorrelation is determined by the collection of neighboring states for each state. The geographical map of the states numbered in alphabetical order is displayed below in Figure A.1. The corresponding list of neighbors associated with the map is found in Figure A.2 [13]. Figure A.1: Map of the continental states with corresponding numbers [13] 43

52 Figure A.2: List of Neighbors for the 48 States [13] 44

53 Appendix B Moran I Proof Proof Start by noting: 1. z 1,...z n are given and I = E[I] = n n i=1 j=1 n i=1 j i 1 N 1 n W ij (z i z)(z j z) n W ij i=1 n (z i z) 2 n (z i z) 2 is a constant 2. For k l, E(z k z)(z l z) is the same k, l i 45

54 46 3. For k l, E(z k z)(z l z) = (z k z)(z l z) k l k n(n 1) = ( (z k z) k n(n 1) ) (z k z) = (z k z) 2 k n(n 1) 4. n n W ij = i=1 j=1 n n W ij since W kk = 0 i=1 j i k Now we can algebraically prove: E[I] = n n n W ij (z i z)(z j z) i=1 j=1 n n W ij i=1 j i i=1 n (z i z) 2 = n n n i=1 j=1 n(n 1) W ij ( k n n i=1 j i W ij i=1 (z k z) 2) n (z i z) 2 = 1 n 1

New Educators Campaign Weekly Report

New Educators Campaign Weekly Report Campaign Weekly Report Conversations and 9/24/2017 Leader Forms Emails Collected Text Opt-ins Digital Journey 14,661 5,289 4,458 7,124 317 13,699 1,871 2,124 Pro 13,924 5,175 4,345 6,726 294 13,086 1,767