How to work correctly statistically about sex ratio

Size: px

Start display at page:

Download "How to work correctly statistically about sex ratio"

Baldric Griffin
6 years ago
Views:

1 How to work correctly statistically about sex ratio Marc Girondot Version of 12th April 2014 Contents 1 Load packages 2 2 Introduction 2 3 Confidence interval of a proportion Pourcentage For distribution with more than 2 states Linear regression upon sex ratio What you should never do Angular transformed proportion 10 6 Weight of the different measures 13 7 Using a GLM to analyze sex ratio Why a GLM? Functions logit and probit Likelihood of an observation Analysis by a generalized linear model (GLM) To go further Predict function after GLM 23 9 Application for TSD pattern 26 1

2 1 Load packages install.packages("coda", "desolve", "devtools", "entropy", "numderiv", "optimx", "parallel", "phenology", "shiny", "Hmisc") library("coda", "desolve", "devtools", "entropy", "numderiv", "optimx", "parallel", "phenology", "shiny", "Hmisc") install_url(" suppressmessages(library("phenology")) suppressmessages(library("hmisc")) suppressmessages(library("multinomialci")) suppressmessages(library("embryogrowth")) 2 Introduction A proportion is defined as a ratio between the occurrence of a specific event and the range of possibilities. Is an event whose outcome can be in two states, A or B and na and nb the number of occurrences of A and B with N = na + nb then the proportions of A and B are: pa=na/n pb=nb/n We can also define proportions when there is more than 2 states, eg 3 states: A, B and C. Generally it is not necessary for sex ratio, except if an unknown category is included. pa=na/n pb=nb/n pc=nc/n with N=nA+nB+nC The distribution of events with two outcomes is based on binomial distribution and with n outcomes is based on multinomial distribution. These rules for calculating proportions can be used to calculate frequencies from observations or to establish probabilities that are measures of uncertainty about an event. See how to write proportions in different ways with R to learn the language. We can define the various events as a vector with c() and 0 and 1: obs1 <- c(1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0) N1 <- length(obs1) obs1 == 1 2

3 [1] TRUE FALSE TRUE FALSE TRUE TRUE FALSE [8] TRUE FALSE TRUE FALSE TRUE TRUE TRUE [15] TRUE FALSE na1 <- sum(obs1 == 1) nb1 <- N1 - na1 pa1 <- na1/n1 pb1 <- nb1/n1 print(paste("there are", na1, "'1' for", N1, "in total then the frequency is", pa1)) [1] "There are 10 '1' for 16 in total then the frequency is 0.625" print(paste("there are", nb1, "'0' for", N1, "in total then the frequency is", pb1)) [1] "There are 6 '0' for 16 in total then the frequency is 0.375" Or simply using the table() function: (tobs1 <- table(obs1)) obs print(paste("there are", tobs1[2], "'1' for", tobs1[2] + tobs1[1], "in total then the frequency is", tobs1[2]/(tobs1[2] + tobs1[1]))) [1] "There are 10 '1' for 16 in total then the frequency is 0.625" print(paste("there are", tobs1[1], "'0' for", tobs1[2] + tobs1[1], "in total then the frequency is", tobs1[1]/(tobs1[2] + tobs1[1]))) [1] "There are 6 '0' for 16 in total then the frequency is 0.375" One can also define the various events in the form of a vector with c () of A, B and C: obs2 <- c("a", "C", "B", "C", "A", "A", "C", "C", "A", "A", "C", "B", "C", "A") n2 <- c(a = sum(obs2 == "A")) n2 <- c(n2, B = sum(obs2 == "B")) n2 <- c(n2, C = sum(obs2 == "C")) n2 A B C

4 N2 <- sum(n2) (p2 <- n2/n2) A B C (table(obs2)) obs2 A B C It is also possible of course to define in the aggregated form: n3 <- c(a = 20, B = 21, C = 3) N3 <- sum(n3) (p3 <- n3/n3) A B C Confidence interval of a proportion The Hmisc package has a very interesting function to calculate the confidence interval of proportion, but this only works when there are two possible states because the method is based on a binomial distribution function. b1 <- binconf(na1, N1) To represent a dot diagram with error bars, I find easier to use the package phenology because the wording is exactly the plot() function of basic R language. plot_errbar(1, b1[, 1], y.plus = b1[, 3], y.minus = b1[, 2], ylab = "Proportion", xlab = "Category", las = 1, ylim = c(0, 1), bty = "n") It is easy to represent a range of frequencies with confidence intervals. By default, alpha = 5 ie there are 95 % chance that the true proportion is well within the confidence interval calculated: n4 <- c(10, 2, 5, 7, 9, 12, 5) N4 <- c(45, 10, 5, 19, 12, 24, 6) b4 <- binconf(n4, N4) plot_errbar(1:7, b4[, 1], y.plus = b4[, 3], y.minus = b4[, 2], ylab = "Proportion", xlab = "Category", las = 1, ylim = c(0, 1), bty = "n") 4

5 Proportion Category Figure 1: Proportion and confidence interval of a proportion Note an essential feature of proportions, and then os sex ratio: the confidence interval is not symetrical. As a direct consequence, it means that the 95 % confidence interval is not mean +/- 2.SD. 3.1 Pourcentage To work as a percentage, simply multiply everything by 100: plot_errbar(1:7, b4[, 1] * 100, y.plus = b4[, 3] * 100, y.minus = b4[, 2] * 100, ylab = "Percentage", xlab = "Category", las = 1, ylim = c(0, 100), bty = "n") 3.2 For distribution with more than 2 states To establish the confidence interval of proportions in a multinomial distribution (more than 2 states), if necessary to use the method (Sison and Glaz, 1999) 5

6 Proportion Category Figure 2: Proportion and confidence interval of a series of proportions available MultinomialCI package. Glaz, J. and C.P. Sison. Simultaneous confidence intervals for multinomial proportions. Journal of Statistical Planning and Inference 82: (1999). m = multinomialci(x = c(23, 12, 44), alpha = 5) print(paste("first class: [", m[1, 1], m[1, 2], "]")) [1] "First class: [ ]" print(paste("second class: [", m[2, 1], m[2, 2], "]")) [1] "Second class: [ ]" print(paste("third class: [", m[3, 1], m[3, 2], "]")) [1] "Third class: [ ]" print(paste("somme bornes hautes:", sum(m[, 2]))) [1] "Somme bornes hautes: " 6

7 Percentage Category Figure 3: Percentage and confidence interval of a series of percentages It may be noted that the sum of the upper limit of each value exceeds 1; This is logical since the data are not independent. 7

8 4 Linear regression upon sex ratio 4.1 What you should never do... Imagine that you have timeseries of sex ratios, for example: timeobs <- c(1, 3, 6, 8, 10, 12, 13, 14, 17, 19, 20, 22, 25) obs <- c(0, 1, 2, 5, 2, 8, 9, 2, 19, 23, 5, 12, 15) Nobs <- c(1, 3, 3, 12, 4, 10, 11, 2, 21, 26, 6, 15, 15) obs/nobs [1] [7] [13] It is extremely common to see a linear regression performed on such data. See for example: Bowden RM, Ewert MA, Nelson CE Environmental sex determination in a reptile varies seasonally and with yolk hormones. Proceedings of the Royal Society B-Biological Sciences 267: x <- timeobs y <- obs/nobs plot(x, y, ylim = c(0, 1), bty = "n", xlim = c(0, 25), las = 1, ylab = "Sex ratio", xlab = "Time") abline(lm(y ~ x)) text(22, 0.6, paste("r=", sprintf("%.3f", cor(x, y, method = "pearson")), sep = "")) (testcor <- cor.test(x, y, method = "pearson")) Pearson's product-moment correlation data: x and y t = 5.01, df = 11, p-value = alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: sample estimates: cor text(22, 0.5, paste("p=", sprintf("%.3f", testcor$p.value), sep = "")) 8

9 Sex ratio r=0.834 p= Time Figure 4: Linear regression over sex ratio Linear regression yields values which are not limited in the range [0; 1]. par(xpd = TRUE) plot(x, y, ylim = c(0, 1), bty = "n", xlim = c(-10, 25), las = 1, ylab = "Sex ratio", xlab = "Time") abline(lm(y ~ x)) lines(x = c(-10, 25), y = c(0, 0), lty = 3) lines(x = c(-10, 25), y = c(1, 1), lty = 3) text(22, 0.6, paste("r=", sprintf("%.3f", cor(x, y, method = "pearson")), sep = "")) text(22, 0.5, paste("p=", sprintf("%.3f", testcor$p.value), sep = "")) 9

10 Sex ratio r=0.834 p= Time Figure 5: Linear regression over sex ratio 5 Angular transformed proportion One of the many problems of linear regression on sex ratio is that a proportions are not normally distributed (normally means a Gauss-Laplace distribution). For proof, just take into account the fact that a proportion between 0 and 1 and a variable drawn from a normal distribution is between ]-infinity, + infinity[. The fit of the regression line is made by the least squares method has assumed that the dependent variable (y) has a normal marginal distribution. This is clearly false for a proportion. To solve out this problem once it was conducting an angular transformation type: ( ) T ransf ormedproportion = 2 arcsin proportion The inverse transformation is: proportion = sin( T ransformedproportion )

11 This formula may seem magical in first originated directly in the mathematical expression of the Gaussian distribution. vx <- seq(from = 0, to = 1, by = 1) vy <- 2 * asin(sqrt(vx)) plot(vx, vy, type = "l", bty = "n", xlab = "Sex ratio", ylab = "Angular transformed", las = 1) Angular transformed Sex ratio Figure 6: Angular transformed Sex ratio vs sex ratio vx_inverse <- sin(vy/2)^2 11

12 par(xpd = FALSE) y_transform <- 2 * asin(sqrt(y)) plot(x, y_transform, bty = "n", xlim = c(0, 25), xlab = "Time", ylab = "Angular transformed sex ratio", las = 1) abline(lm(y_transform ~ x)) text(22, 1.6, paste("r=", sprintf("%.3f", cor(x, y_transform, method = "pearson")), sep = "")) (testcor_transform <- cor.test(x, y_transform, method = "pearson")) Pearson's product-moment correlation data: x and y_transform t = 4.516, df = 11, p-value = alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: sample estimates: cor text(22, 1.4, paste("p=", sprintf("%.3f", testcor_transform$p.value), sep = "")) par(xpd = FALSE) y_transform <- 2 * asin(sqrt(y)) plot(x, y, bty = "n", xlim = c(0, 25), xlab = "Time", ylab = "Sex ratio", las = 1) ab <- lm(y_transform ~ x) y_predict <- predict(ab) lines(x, sin(y_predict/2)^2) text(22, 0.6, paste("r=", sprintf("%.3f", cor(x, y_transform, method = "pearson")), sep = "")) (testcor_transform <- cor.test(x, y_transform, method = "pearson")) Pearson's product-moment correlation data: x and y_transform t = 4.516, df = 11, p-value = alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: sample estimates: cor text(22, 0.5, paste("p=", sprintf("%.3f", testcor_transform$p.value), sep = "")) 12

13 Angular transformed sex ratio r=0.806 p= Time Figure 7: Linear regression over Angular transformed sex ratio 6 Weight of the different measures But another problem is not resolved. Linear regression based on proportions does not take into account the effective, or more the number of measurements, the greater the proportion is well known and therefore should have significant weight. na <- 1 nb <- 2 na_cumul <- NULL nb_cumul <- NULL for (i in 1:10) { na_cumul <- c(na_cumul, na * i) nb_cumul <- c(nb_cumul, nb * i) } 13

14 Sex ratio r=0.806 p= Time Figure 8: Linear regression over Angular transformed sex ratio, converted as sex ratio N_cumul <- na_cumul + nb_cumul b_cumul <- binconf(na_cumul, N_cumul) plot_errbar(n_cumul, b_cumul[, 1], y.plus = b_cumul[, 3], y.minus = b_cumul[, 2], ylab = "Sex ratio", xlab = "Total", las = 1, ylim = c(0, 1), bty = "n") In conclusion, you should not work on proportions but using the number of observations. See also WARTON, D.I. & F.K.C. HUI The arcsine is asinine: the analysis of proportions in ecology. Ecology 92:3-10. par(xpd = FALSE) bobs <- binconf(obs, Nobs) plot_errbar(x, bobs[, 1], y.plus = bobs[, 3], y.minus = bobs[, 2], las = 1, ylab = "Proportion", xlab = "Time", 14

15 Sex ratio Total Figure 9: Confidence intervalle of sex ratio las = 1, ylim = c(0, 1), bty = "n", xlim = c(0, 25)) abline(lm(y ~ x)) text(22, 0.5, paste("r=", sprintf("%.3f", cor(x, y, method = "pearson")), sep = "")) text(22, 0.4, paste("p=", sprintf("%.3f", testcor$p.value), sep = "")) 15

16 Proportion r=0.834 p= Time Figure 10: Linear regression over sex ratio 7 Using a GLM to analyze sex ratio 7.1 Why a GLM? The solution to solve these problems need two conceptual changes. On the one hand, the linear regression is clearly not suited to data in bounded intervals Furthermore the method of least squares used to adjust the parameters of the regression does not take into account the accuracy of the measurements (ie, number of observations to estimate sex ratio) Functions logit and probit Instead of a linear equation, a probit or logit function must be used to limit output in the interval [0; 1]. The logit function is: 1 y = 1 + e ax+b 16

17 The probit function is based on distribution function of a Gaussian distribution: Φ (z) = 1 z e 1 2 Z2 dz 2π The choice between logit and probit is more a matter of habit than of fundamental difference between the two models. In ecology, logit are often preferred whereas in economy, probit are commonly used. In practice, there is very little difference between using a logit and probit. If we assume that the results are influenced by a variable with Gaussian distribution, the probit model would be more appropriate (but I never see clear demonstration about that point). But keep in mind that the logit and probit models are only models: Essentially, all models are wrong, but some are useful, George E. P. Box (in Box, G. E. P., and Draper, N. R., 1987, Empirical Model Building and Response Surfaces, John Wiley & Sons, New York, NY.)! xl <- -100:100 b <- 0 a <- 5 yl <- 1/(1 + exp(a * xl + b)) plot(xl, yl, bty = "n", type = "l", las = 1, col = "black", ylim = c(0, 1)) b <- 1 a <- -7 yl <- 1/(1 + exp(a * xl + b)) plot_add(xl, yl, bty = "n", type = "l", las = 1, col = "red") b <- -1 a <- 0.1 yl <- 1/(1 + exp(a * xl + b)) plot_add(xl, yl, bty = "n", type = "l", las = 1, col = "blue") legend(x = 40, y = 0.8, legend = c("a=0,05; b=1", "a=-0,07; b=-1", "a=0,1; b=-1"), col = c("black", "red", "blue"), lty = 1, bty = "n") xl <- -100:100 b <- 5 a <- 10 yl <- pnorm(xl, mean = a, sd = b) plot(xl, yl, bty = "n", type = "l", las = 1, col = "black", ylim = c(0, 1)) b <- 20 a <- -10 yl <- pnorm(xl, mean = a, sd = b) plot_add(xl, yl, bty = "n", type = "l", las = 1, col = "red") b <- 15 a <- 7 17

18 a=0,05; b=1 a= 0,07; b= 1 a=0,1; b= yl xl Figure 11: Logits curves yl <- pnorm(xl, mean = a, sd = b) plot_add(xl, yl, bty = "n", type = "l", las = 1, col = "blue") legend(x = "bottomright", legend = c("a=10; b=5", "a=-10; b=20", "a=7; b=15"), col = c("black", "red", "blue"), lty = 1, bty = "n") Likelihood of an observation In statistics, a likelihood function is a function of the parameters of a statistical model, defined as follows: the likelihood of a set of parameter values given some observed outcomes is equal to the probability of those observed outcomes given those parameter values (Wikipedia, The likelihood of observing x males in a set size with a probability of being male prob is: 18

19 yl a=10; b=5 a= 10; b=20 a=7; b= xl Figure 12: Courbes probits dbinom(x, size, prob, log = FALSE) The binomial distribution with size = n and prob = p has density: ( ) n d = p x (1 p) n x x d <- choose(size, x)*prob^x*(1-prob)^(size-x) 19

20 7.2 Analysis by a generalized linear model (GLM) A dataframe with two columns with number of males and females is created. We try to model the observations as a function of variable timeobs. ydf <- cbind(na = obs, nb = Nobs - obs) model <- glm(ydf ~ timeobs, family = binomial(link = "logit")) anova(model, test = "Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: ydf Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL timeobs e-05 NULL timeobs *** --- Signif. codes: 0 '***' 01 '**' 1 '*' 5 '.' 0.1 ' ' 1 summary(model) Call: glm(formula = ydf ~ timeobs, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) timeobs e-05 (Intercept). timeobs *** --- Signif. codes: 0 '***' 01 '**' 1 '*' 5 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) 20

21 Null deviance: on 12 degrees of freedom Residual deviance: on 11 degrees of freedom AIC: Number of Fisher Scoring iterations: 4 bobs <- binconf(obs, Nobs) plot_errbar(x, bobs[, 1], y.plus = bobs[, 3], y.minus = bobs[, 2], las = 1, ylab = "Proportion", xlab = "Time", las = 1, ylim = c(0, 1), bty = "n", xlim = c(0, 25)) newd <- data.frame(timeobs = seq(from = 1, to = 25, by = 0.1)) p <- predict(model, newdata = newd, type = "link", se = TRUE) plot_add(newd[, 1], with(p, exp(fit)/(1 + exp(fit))), type = "l", bty = "n") plot_add(newd[, 1], with(p, exp(fit * se.fit)/(1 + exp(fit * se.fit))), type = "l", bty = "n", lty = 2) plot_add(newd[, 1], with(p, exp(fit * se.fit)/(1 + exp(fit * se.fit))), type = "l", bty = "n", lty = 2) 21

22 Proportion Time Figure 13: Logistic regression using maximum likelihood 7.3 To go further... Imagine that each observation is associated in addition to the temporal dimension, another co-variable. We can then ask whether the observed sex ratio also depend on the co-factor or interaction between time and cofactor. timeobs [1] ydf na nb [1,] 0 1 [2,] 1 2 [3,] 2 1 [4,] 5 7 [5,]

23 [6,] 8 2 [7,] 9 2 [8,] 2 0 [9,] 19 2 [10,] 23 3 [11,] 5 1 [12,] 12 3 [13,] 15 0 cofacteur <- c(9.2, 8.1, 2, 7.3, 9.5, 1.2, 10.8, 20.9, 11.9, 2.5, 2.3, 9.7, 10) model_c <- glm(ydf ~ cofacteur, family = binomial(link = "logit")) model_t_c <- glm(ydf ~ timeobs + cofacteur, family = binomial(link = "logit")) model_t_c_i <- glm(ydf ~ timeobs * cofacteur, family = binomial(link = "logit")) compare_aic(list(time = model, cofacteur = model_c, time_cofacteur = model_t_c, time_cofacteur_interaction = model_t_c_i)) [1] "The lowest AIC (35.922) is for series time with Akaike weight=0.640" AIC DeltaAIC time cofacteur time_cofacteur time_cofacteur_interaction Akaike_weight time 6.400e-01 cofacteur 6.836e-05 time_cofacteur 2.400e-01 time_cofacteur_interaction 1.199e-01 8 Predict function after GLM Predict function is used to make predictions from a fitted model. However, a common mistake is to use the normal approximation to the confidence interval of the predicted results. This can lead to gross errors. Start again from the previously adjusted model: summary(model) Call: glm(formula = ydf ~ timeobs, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max

24 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) timeobs e-05 (Intercept). timeobs *** --- Signif. codes: 0 '***' 01 '**' 1 '*' 5 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 12 degrees of freedom Residual deviance: on 11 degrees of freedom AIC: Number of Fisher Scoring iterations: 4 We created a dataframe with values timeobs to predict. df <- data.frame(timeobs = 1:30) Results were visualized with the option response. # make prediction and estimate CI using 'response' # option preddf2 <- predict(model, type = "response", newdata = df, se.fit = TRUE) plot_errbar(1:30, with(preddf2, fit), bty = "n", errbar.y.minus = with(preddf2, 1.96 * se.fit), errbar.y.plus = with(preddf2, 1.96 * se.fit), xlab = "timeobs covariable", ylab = "Sex ratio", type = "l", ylim = c(0, 1.1), las = 1) segments(0, 1, 30, 1, col = "red") Now Results were visualized with the option link. # good practice: use the option 'link' preddf <- predict(model, type = "link", newdata = df, se.fit = TRUE) plot_errbar(1:30, with(preddf, exp(fit)/(1 + exp(fit))), bty = "n", errbar.y.minus = with(preddf, exp(fit)/(1 + exp(fit))) - with(preddf, exp(fit * se.fit)/(1 + exp(fit * se.fit))), errbar.y.plus = with(preddf, exp(fit * se.fit)/(1 + exp(fit * se.fit))) - with(preddf, exp(fit)/(1 + exp(fit))), xlab = "timeobs covariable", ylab = "Sex ratio", type = "l", ylim = c(0, 1), las = 1) 24

25 Sex ratio timeobs covariable Figure 14: Note that error bars are higher than 1! segments(0, 1, 30, 1, col = "red") 25

26 Sex ratio timeobs covariable Figure 15: Note that error bars are not higher than 1 9 Application for TSD pattern outtsd <- with(subset(stsre_tsd, Sp == "Cc" & RMU == "Pacific, S"), tsd(males = Males, females = Females, temperatures = Temp)) [1] "The goodness of fit test is 14" [1] "The Pivotal temperature is SE 17" [1] "The Transitional range of temperatures l=5% is SE 51" [1] "The lower limit of Transitional range of temperatures l=5% is SE 30" [1] "The higher limit of Transitional range of temperatures l=5% is SE 35" 26

27 Transitional range of temperatures l=5% Pivotal temperature male frequency Temperatures in C Figure 16: Pattern of TSD for Caretta caretta in RMU Pacific South 27

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R Brian S. Everitt and Torsten Hothorn CHAPTER 6 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, and Colonic Polyps