CLEAR EVIDENCE OF VOTING ANOMALIES IN BLADEN AND ROBESON COUNTIES RICHARD L. SMITH FEBRUARY 11, 2019

Size: px

Start display at page:

Download "CLEAR EVIDENCE OF VOTING ANOMALIES IN BLADEN AND ROBESON COUNTIES RICHARD L. SMITH FEBRUARY 11, 2019"

Rosalyn Easter Houston
5 years ago
Views:

1 CLEAR EVIDENCE OF VOTING ANOMALIES IN BLADEN AND ROBESON COUNTIES RICHARD L. SMITH FEBRUARY 11, 2019 This is a revision of an earlier commentary submitted on January 18. I am a professor of statistics at the University of North Carolina, Chapel Hill. As an exercise for students in one of my courses, I downloaded and asked the students to analyze data on absentee ballots in this election. The results showed clear evidence of missing absentee ballots in Bladen and Robeson Counties, with an excess, compared with normal random variation, of least 1,500 absentee ballots having not been returned in these two counties. The following commentary is based on my own analysis supplemented by those of the students. Major Conclusion This commentary addresses the allegation of absentee ballots being tampered with in Bladen and Robeson Counties. Without making any assumptions about what happened or who might have been responsible, I show that there was a large number of misplaced votes among the absentee ballots in these two counties absentee ballots that were requested but never returned, over and above what could have been expected from normal statistical variation. Based on the other 98 counties of North Carolina, I estimate that at least 1,500 votes were misplaced. If those votes had indeed been intended for the Democrat candidate, they would overturn the current majority of 905 votes in favor of the Republican candidate. 1

2 Methods The website of NCBOE includes a data file consisting of an anonymized record of all 2,111,797 absentee ballots requested for the November 6 election, and also the ultimate disposition of those ballots (whether they were accepted as valid votes or one of several possible alternate outcomes). From this data file I was able to calculate the percentage not returned, which I will abbreviate as PNR, for each for each of the 100 counties of North Carolina. For each county, PNR is defined as The plot on the previous page shows distribution of PNR over all 100 counties. In 98 counties (left side of picture), the PNR is between 0 and 4%, and the distribution closely follows a classic normal or bellshaped curve. Two counties, however, stand out as being very far from the other 98: they are Bladen and Robeson counties with PNR of 11.31% and 11.0% respectively. The variability of PNR over counties may be explained in various ways. Some of it can be explained by demographic factors such as race and educational level. The majority (probably at least 75% of the total variability) is just random. However, we can combine both the demographic and random components of variability to come up with predictions of the PNR for Bladen and Robeson Counties, based on the data in the other 98 counties. These predictions show what could be expected, in the absence of vote tampering or any other mechanism that might have explained why these two counties were anomalous. It is in the nature of any statistical analysis that predictions cannot be certain. We can allow for uncertainty by quoting a prediction interval, which is a range of values that include the desired prediction with specified probability. In this exercise, I have used 99% prediction intervals. This may be interpreted as meaning that, taking account of both demographic and random variability, the quoted interval will contain the quantity being predicted with a 99% probability, assuming that the same statistical model applies to Bladen and Robeson as the other 98 counties. The demographic variables selected were the mean travel time to work, the percentage of high school graduates, the percentage of blacks in a given county, and the percentage of Hispanics in a given county. Based on these demographic predictors, the predicted PNR for Bladen County is 1.6%, with a 99% prediction interval from 0.59% to 4.3%. The actual PNR in Bladen was 11.31% an excess of 7.01% over the upper end of the prediction interval. The number of absentee ballots requested in Bladen was 8,110, so I estimate there were at least 569 (7.01% of 8,110) misplaced votes in Bladen County. A corresponding calculation for Robeson county (with 16,069 requested absentee ballots) gives at least 1,197 misplaced votes there. Combining the two counties, there appear to have been at least 1,766 misplaced votes, well in excess of the 905 votes by which Mark Harris led the actual count. There are numerous possible variants of this analysis based on different demographic variables or different ways of handling the random error. The students in my class submitted numerous alternative analyses. However, in every analysis that I was able to verify, the number of misplaced votes was in excess of 1,500. 2

3 Details The file on Absentee Data was downloaded from the North Carolina Board of Elections ( Data on a number of demographic variables, including population size (as of 2017), the proportion of individuals in each county who have graduated at either high-school or college level, the proportions of Hispanics and non-hispanic Blacks in each county, and median income level for each county, were downloaded from various online sources such as This information was combined into a single file which may be downloaded from This file lists the PNR for each country as well as the following demographic variables: population (as of 2017), percentage rural, median age, mean travel time to work, percentage of high school graduates, percentage of college graduates, percentage of blacks and percentage of Hispanics. The file also lists the total number of absentee ballots requests in each county for the November 6, 2018, election. The idea of a regression analysis is to take a variable of interest (in this case, PNR), and express it in terms of the other variables that could affect it. It may be necessary to make various transformations, such as taking logarithms. There is no unique best way to do this, but using two well-established methods of variable selection (AIC and Backward Selection), the following model was identified: In words, the logarithm of PNR in a given county is expressed as a combination of the mean travel time to work (Travel), the percentage of high school graduates (HSGrad), the percentage of Black (Black) and the percentage of Hispanics (Hisp), plus a random error that accounts for unexplained variability among the counties. The model just described was fitted to the 98 counties that do not include Bladen and Robeson counties, and then used to predict the PNR for Bladen and Robeson. To be precise, I computed a 99% prediction interval for each of Bladen and Robeson counties, which means a range of values (separately for each county) that will, if the assumptions of the statistical analysis are correct, include the true value with probability Here, saying the assumptions of the statistical analysis are correct is allowing for normal random variation between counties, but not for any unusual circumstances such as tampering with the ballots. The results are: For Bladen county, predicted PNR is 1.6% with 99%-probability prediction limits of 0.59% and 4.3%; For Robeson county, predicted PNR is 1.31% with 99%-probability prediction limits of 0.48% and 3.55%. 3

4 Alternatively, one can ignore the demographic variables and simply make the predictions assuming random variability among counties. In this case, the predicted PNR for both Bladen and Robeson counties (based on the other 98) is 1.49% with 99%-probability prediction limits of 0.48% and 4.57%. The observed PNR for Bladen and Robeson counties are 11.31% and 11.0% respectively. We can now translate these assessments into actual counts of potentially misplaced votes. Under the first analysis above, for Bladen county, the upper limit of the prediction interval is 4.3%. The observed PNR was 11.31%, and the number of absentee ballots requested was 8,110. For Robeson county, the corresponding numbers are 3.55%, 11.0%, 16,069. Therefore, our estimated lower limit on the number of misplaced votes is 8,110 x ( )/ ,069 x ( )/100 = 1,766 (to the nearest whole number). Alternatively, based on the second analysis above, the corresponding calculation is 8,110 x ( )/ ,069 x ( )/100 = 1,580. Students in my class submitted a number of alternative versions of the analysis. Some of them framed the analysis to predict PNR directly, rather than the logarithm of PNR as in the above analysis. In addition, the students used different methods to decide which demographic variables to include, resulting in different model equations. In every analysis that I was able to verify, however, the estimated number of misplaced votes was at least 1,500. Recall that, in the initial count of votes, the Republican Mark Harris was leading by 905 votes. The statistical analysis given here, while supporting that there were misplaced absentee ballots, does not provide any evidence concerning which candidate would have received those votes. However, if allegations of ballot tampering are correct, there is at least a plausible argument that the misplaced votes were all intended for the Democrat candidate. I am available to answer questions as needed. 4

5 Appendix: Statistical Code in R R version ( ) -- "Eggshell Igloo" Copyright (C) 2018 The R Foundation for Statistical Computing Platform: i386-w64-mingw32/i386 (32-bit) # R code for NC Voter Analysis # Richard L. Smith, February 2019 # load data Y=read.csv('C:/Users/rsmith/jan16/UNC/STOR556/Data/ProportionNotReturned.csv',header=T) # Plot for density par(cex=1.5) plot(density(100*y$pnr),main='percentage of Absentee Ballots Not Returned') rug(100*y$pnr) lines(density(100*y$pnr),lw=3) text(11,0.4,'bladen') lines(c(11.3,11),c(0,0.38)) text(9.2,0.3,'robeson') lines(c(11,9.2),c(0,0.28)) text(2,0.05,'all OTHER COUNTIES',cex=0.7) # Regression analysis using log (100*PNR) as response # Variable selection via AIC y1=log(100*y$pnr) wts=as.numeric(y$pnr<0.1) lm1=lm(y1~y$pop+y$rural+y$medage+y$travel+y$hsgrad+y$collgrad+y$medinc+y$black+y$hisp, lm2=step(lm1) Start: AIC= y1 ~ Y$Pop + Y$Rural + Y$MedAge + Y$Travel + Y$Hsgrad + Y$Collgrad + Y$MedInc + Y$Black + Y$Hisp - Y$MedInc Y$Collgrad Y$Travel Y$MedAge Y$Rural <none Y$Hisp Y$Pop Y$Hsgrad Y$Black

6 Step: AIC= y1 ~ Y$Pop + Y$Rural + Y$MedAge + Y$Travel + Y$Hsgrad + Y$Collgrad + Y$Black + Y$Hisp - Y$Collgrad Y$MedAge Y$Travel Y$Rural <none Y$Hisp Y$Pop Y$Hsgrad Y$Black Step: AIC= y1 ~ Y$Pop + Y$Rural + Y$MedAge + Y$Travel + Y$Hsgrad + Y$Black + Y$Hisp - Y$MedAge Y$Travel Y$Rural Y$Hisp <none Y$Pop Y$Hsgrad Y$Black Step: AIC= y1 ~ Y$Pop + Y$Rural + Y$Travel + Y$Hsgrad + Y$Black + Y$Hisp - Y$Rural Y$Travel <none Y$Pop Y$Hisp Y$Hsgrad Y$Black Step: AIC= y1 ~ Y$Pop + Y$Travel + Y$Hsgrad + Y$Black + Y$Hisp 6

7 - Y$Pop <none Y$Hisp Y$Travel Y$Hsgrad Y$Black Step: AIC= y1 ~ Y$Travel + Y$Hsgrad + Y$Black + Y$Hisp <none Y$Travel Y$Hisp Y$Hsgrad Y$Black lm1=lm2 # Write final model and compute prediction limits summary(lm1) lm(formula = y1 ~ Y$Travel + Y$Hsgrad + Y$Black + Y$Hisp, weights = wts) Estimate Std. Error t value Pr( t ) (Intercept) ** Y$Travel Y$Hsgrad ** Y$Black e-06 *** Y$Hisp * Residual standard error: on 93 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 4 and 93 DF, p-value: 7.393e-06 7

8 summary(lm1)$adj.r.squared [1] pr1=predict(lm1,se.fit=t,interval='prediction',level=0.99,weights=1) Warning message: In predict.lm(lm1, se.fit = T, interval = "prediction", level = 0.99, : predictions on current data refer to _future_ responses n1=(y$pnr[9]-exp(pr1$fit[9,3])/100)*y$absbal[9] n2=(y$pnr[78]-exp(pr1$fit[78,3])/100)*y$absbal[78] print(exp(pr1$fit[c(9,78),])) fit lwr upr print(c(n1,n2,n1+n2)) [1] print(y$absbal[c(9,78)]) [1] # Alternative solution by backward selection y1=log(100*y$pnr) wts=as.numeric(y$pnr<0.1) lm1=lm(y1~y$pop+y$rural+y$medage+y$travel+y$hsgrad+y$collgrad+y$medinc+y$black+y$hisp, summary(lm1) lm(formula = y1 ~ Y$Pop + Y$Rural + Y$MedAge + Y$Travel + Y$Hsgrad + Y$Collgrad + Y$MedInc + Y$Black + Y$Hisp, weights = wts) Estimate Std. Error t value Pr( t ) (Intercept) e e * Y$Pop 5.082e e Y$Rural 2.638e e Y$MedAge e e Y$Travel 1.213e e Y$Hsgrad 3.192e e * Y$Collgrad e e Y$MedInc 5.502e e

9 Y$Black 1.195e e e-05 *** Y$Hisp 1.683e e Residual standard error: on 88 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 9 and 88 DF, p-value: y1=log(100*y$pnr) wts=as.numeric(y$pnr<0.1) lm1=lm(y1~y$pop+y$rural+y$medage+y$travel+y$hsgrad+y$collgrad+y$black+y$hisp, summary(lm1) lm(formula = y1 ~ Y$Pop + Y$Rural + Y$MedAge + Y$Travel + Y$Hsgrad + Y$Collgrad + Y$Black + Y$Hisp, weights = wts) Estimate Std. Error t value Pr( t ) (Intercept) e e * Y$Pop 5.083e e Y$Rural 2.601e e Y$MedAge e e Y$Travel 1.278e e Y$Hsgrad 3.207e e * Y$Collgrad e e Y$Black 1.192e e e-05 *** Y$Hisp 1.693e e Residual standard error: on 89 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 8 and 89 DF, p-value: y1=log(100*y$pnr) wts=as.numeric(y$pnr<0.1) lm1=lm(y1~y$pop+y$rural+y$medage+y$travel+y$hsgrad+y$black+y$hisp, 9

10 summary(lm1) lm(formula = y1 ~ Y$Pop + Y$Rural + Y$MedAge + Y$Travel + Y$Hsgrad + Y$Black + Y$Hisp, weights = wts) Estimate Std. Error t value Pr( t ) (Intercept) e e Y$Pop 4.285e e Y$Rural 2.633e e Y$MedAge e e Y$Travel 1.362e e Y$Hsgrad 2.562e e * Y$Black 1.212e e e-05 *** Y$Hisp 1.564e e Residual standard error: on 90 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 7 and 90 DF, p-value: 5.246e-05 y1=log(100*y$pnr) wts=as.numeric(y$pnr<0.1) lm1=lm(y1~y$pop+y$rural+y$travel+y$hsgrad+y$black+y$hisp, summary(lm1) lm(formula = y1 ~ Y$Pop + Y$Rural + Y$Travel + Y$Hsgrad + Y$Black + Y$Hisp, weights = wts)

11 Estimate Std. Error t value Pr( t ) (Intercept) e e * Y$Pop 4.387e e Y$Rural 1.940e e Y$Travel 1.385e e Y$Hsgrad 2.582e e * Y$Black 1.263e e e-06 *** Y$Hisp 1.876e e Residual standard error: on 91 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 6 and 91 DF, p-value: 2.737e-05 y1=log(100*y$pnr) wts=as.numeric(y$pnr<0.1) lm1=lm(y1~y$pop+y$travel+y$hsgrad+y$black+y$hisp, summary(lm1) lm(formula = y1 ~ Y$Pop + Y$Travel + Y$Hsgrad + Y$Black + Y$Hisp, weights = wts) Estimate Std. Error t value Pr( t ) (Intercept) e e * Y$Pop 3.230e e Y$Travel 1.919e e Y$Hsgrad 2.119e e * Y$Black 1.200e e e-06 *** Y$Hisp 1.596e e Residual standard error: on 92 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 5 and 92 DF, p-value: 1.308e-05 11

12 y1=log(100*y$pnr) wts=as.numeric(y$pnr<0.1) lm1=lm(y1~y$travel+y$hsgrad+y$black+y$hisp, summary(lm1) lm(formula = y1 ~ Y$Travel + Y$Hsgrad + Y$Black + Y$Hisp, weights = wts) Estimate Std. Error t value Pr( t ) (Intercept) ** Y$Travel Y$Hsgrad ** Y$Black e-06 *** Y$Hisp * Residual standard error: on 93 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 4 and 93 DF, p-value: 7.393e-06 summary(lm1)$adj.r.squared [1] pr1=predict(lm1,se.fit=t,interval='prediction',level=0.99,weights=1) Warning message: In predict.lm(lm1, se.fit = T, interval = "prediction", level = 0.99, : predictions on current data refer to _future_ responses n1=(y$pnr[9]-exp(pr1$fit[9,3])/100)*y$absbal[9] n2=(y$pnr[78]-exp(pr1$fit[78,3])/100)*y$absbal[78] print(exp(pr1$fit[c(9,78),])) fit lwr upr print(c(n1,n2,n1+n2)) 12

13 [1] # alternative with no demographic covariates y1=log(100*y$pnr) wts=as.numeric(y$pnr<0.1) lm1=lm(y1~1, summary(lm1) lm(formula = y1 ~ 1, weights = wts) Estimate Std. Error t value Pr( t ) (Intercept) e-15 *** Residual standard error: on 97 degrees of freedom summary(lm1)$adj.r.squared [1] 0 pr1=predict(lm1,se.fit=t,interval='prediction',level=0.99,weights=1) Warning message: In predict.lm(lm1, se.fit = T, interval = "prediction", level = 0.99, : predictions on current data refer to _future_ responses n1=(y$pnr[9]-exp(pr1$fit[9,3])/100)*y$absbal[9] n2=(y$pnr[78]-exp(pr1$fit[78,3])/100)*y$absbal[78] print(exp(pr1$fit[c(9,78),])) fit lwr upr print(c(n1,n2,n1+n2)) [1] # same exercise using PNR directly with AIC for model selection y1=100*y$pnr wts=as.numeric(y$pnr<0.1) 13

14 lm1=lm(y1~y$pop+y$rural+y$medage+y$travel+y$hsgrad+y$collgrad+y$medinc+y$black+y$hisp, lm2=step(lm1) Start: AIC= y1 ~ Y$Pop + Y$Rural + Y$MedAge + Y$Travel + Y$Hsgrad + Y$Collgrad + Y$MedInc + Y$Black + Y$Hisp - Y$MedInc Y$Collgrad Y$Travel Y$Rural Y$MedAge Y$Hisp <none Y$Pop Y$Hsgrad Y$Black Step: AIC= y1 ~ Y$Pop + Y$Rural + Y$MedAge + Y$Travel + Y$Hsgrad + Y$Collgrad + Y$Black + Y$Hisp - Y$Travel Y$MedAge Y$Hisp Y$Collgrad Y$Rural <none Y$Pop Y$Hsgrad Y$Black Step: AIC= y1 ~ Y$Pop + Y$Rural + Y$MedAge + Y$Hsgrad + Y$Collgrad + Y$Black + Y$Hisp - Y$MedAge Y$Hisp Y$Collgrad <none Y$Rural

15 - Y$Pop Y$Hsgrad Y$Black Step: AIC= y1 ~ Y$Pop + Y$Rural + Y$Hsgrad + Y$Collgrad + Y$Black + Y$Hisp - Y$Collgrad <none Y$Rural Y$Hisp Y$Pop Y$Hsgrad Y$Black Step: AIC= y1 ~ Y$Pop + Y$Rural + Y$Hsgrad + Y$Black + Y$Hisp - Y$Hisp <none Y$Rural Y$Pop Y$Hsgrad Y$Black Step: AIC= y1 ~ Y$Pop + Y$Rural + Y$Hsgrad + Y$Black - Y$Rural <none Y$Hsgrad Y$Pop Y$Black Step: AIC= y1 ~ Y$Pop + Y$Hsgrad + Y$Black <none Y$Hsgrad Y$Pop

16 - Y$Black lm1=lm2 # Write final model and compute prediction limits summary(lm1) lm(formula = y1 ~ Y$Pop + Y$Hsgrad + Y$Black, weights = wts) Estimate Std. Error t value Pr( t ) (Intercept) e e Y$Pop 7.561e e Y$Hsgrad 2.277e e Y$Black 1.833e e e-05 *** Residual standard error: on 94 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 94 DF, p-value: 2.498e-05 summary(lm1)$adj.r.squared [1] ### The R-squared is not as large as for the log PNR model pr1=predict(lm1,se.fit=t,interval='prediction',level=0.99,weights=1) Warning message: In predict.lm(lm1, se.fit = T, interval = "prediction", level = 0.99, : predictions on current data refer to _future_ responses n1=(y$pnr[9]-(pr1$fit[9,3])/100)*y$absbal[9] n2=(y$pnr[78]-(pr1$fit[78,3])/100)*y$absbal[78] print(exp(pr1$fit[c(9,78),])) fit lwr upr print(c(n1,n2,n1+n2)) [1] ### final count 1888 higher than the counts quoted for the earlier models 16

Regression on Faithful with Section 9.3 content

Regression on Faithful with Section 9.3 content The faithful data frame contains 272 obervational units with variables waiting and eruptions measuring, in minutes, the amount of wait time between eruptions,