How to work correctly statistically about sex ratio

Similar documents
A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Interactions in Logistic Regression

Logistic Regression - problem 6.14

R Hints for Chapter 10

Non-Gaussian Response Variables

Linear Regression Models P8111

Poisson Regression. The Training Data

Using R in 200D Luke Sonnet

Logistic Regression 21/05

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Consider fitting a model using ordinary least squares (OLS) regression:

Logistic Regressions. Stat 430

ssh tap sas913, sas

R Output for Linear Models using functions lm(), gls() & glm()

Generalised linear models. Response variable can take a number of different formats

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Introduction to Statistics and R

12 Modelling Binomial Response Data

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

Exercise 5.4 Solution

Logistic Regression. 1 Analysis of the budworm moth data 1. 2 Estimates and confidence intervals for the parameters 2

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

Regression with Qualitative Information. Part VI. Regression with Qualitative Information

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

9 Generalized Linear Models

Statistical Prediction

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Regression Methods for Survey Data

Week 7 Multiple factors. Ch , Some miscellaneous parts

Log-linear Models for Contingency Tables

Extending the Discrete-Time Model

Generalized linear models

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted events/trials.

Generalized Linear Models

R-companion to: Estimation of the Thurstonian model for the 2-AC protocol

Classification. Chapter Introduction. 6.2 The Bayes classifier

Logistic & Tobit Regression

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Introduction to the Analysis of Tabular Data

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Investigating Models with Two or Three Categories

8 Nominal and Ordinal Logistic Regression

Statistics. Introduction to R for Public Health Researchers. Processing math: 100%

2015 SISG Bayesian Statistics for Genetics R Notes: Generalized Linear Modeling

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Testing Independence

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

Matched Pair Data. Stat 557 Heike Hofmann

Explanatory variables are: weight, width of shell, color (medium light, medium, medium dark, dark), and condition of spine.

STAT 7030: Categorical Data Analysis

Solutions to obligatorisk oppgave 2, STK2100

Generalized Linear Models

R code and output of examples in text. Contents. De Jong and Heller GLMs for Insurance Data R code and output. 1 Poisson regression 2

STA 450/4000 S: January

Stat 8053, Fall 2013: Multinomial Logistic Models

Chapter 8 Conclusion

Multinomial Logistic Regression Models

Range extensions in anemonefishes and host sea anemones in eastern Australia: potential constraints to tropicalisation

Business Statistics. Lecture 10: Course Review

Generalized linear models

Let s see if we can predict whether a student returns or does not return to St. Ambrose for their second year.

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

Generalized Linear Models. stat 557 Heike Hofmann

PAPER 206 APPLIED STATISTICS

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Modeling Overdispersion

Binary Dependent Variables

Logistic Regression. 0.1 Frogs Dataset

Fitting GLMMs with glmmsr Helen Ogden

UNIVERSITY OF TORONTO Faculty of Arts and Science

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

STA102 Class Notes Chapter Logistic Regression

Checking the Poisson assumption in the Poisson generalized linear model

Supplemental Resource: Brain and Cognitive Sciences Statistics & Visualization for Data Analysis & Inference January (IAP) 2009

Nonlinear Models. What do you do when you don t have a line? What do you do when you don t have a line? A Quadratic Adventure

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion

Introducing Generalized Linear Models: Logistic Regression

Model Estimation Example

BMI 541/699 Lecture 22

Metric Predicted Variable on One Group

Gov 2000: 9. Regression with Two Independent Variables

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Introduction to General and Generalized Linear Models

Analysis of binary repeated measures data with R

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Robust Inference in Generalized Linear Models

16.400/453J Human Factors Engineering. Design of Experiments II

The GLM really is different than OLS, even with a Normally distributed dependent variable, when the link function g is not the identity.

Random Independent Variables

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla.

Transcription:

How to work correctly statistically about sex ratio Marc Girondot Version of 12th April 2014 Contents 1 Load packages 2 2 Introduction 2 3 Confidence interval of a proportion 4 3.1 Pourcentage.............................. 5 3.2 For distribution with more than 2 states.............. 5 4 Linear regression upon sex ratio 8 4.1 What you should never do........................ 8 5 Angular transformed proportion 10 6 Weight of the different measures 13 7 Using a GLM to analyze sex ratio 16 7.1 Why a GLM?............................. 16 7.1.1 Functions logit and probit.................. 16 7.1.2 Likelihood of an observation................. 18 7.2 Analysis by a generalized linear model (GLM)........... 20 7.3 To go further............................... 22 8 Predict function after GLM 23 9 Application for TSD pattern 26 1

1 Load packages install.packages("coda", "desolve", "devtools", "entropy", "numderiv", "optimx", "parallel", "phenology", "shiny", "Hmisc") library("coda", "desolve", "devtools", "entropy", "numderiv", "optimx", "parallel", "phenology", "shiny", "Hmisc") install_url("http://www.ese.u-psud.fr/epc/conservation/embryogrowth/embryogrowth_4.03.tar. suppressmessages(library("phenology")) suppressmessages(library("hmisc")) suppressmessages(library("multinomialci")) suppressmessages(library("embryogrowth")) 2 Introduction A proportion is defined as a ratio between the occurrence of a specific event and the range of possibilities. Is an event whose outcome can be in two states, A or B and na and nb the number of occurrences of A and B with N = na + nb then the proportions of A and B are: pa=na/n pb=nb/n We can also define proportions when there is more than 2 states, eg 3 states: A, B and C. Generally it is not necessary for sex ratio, except if an unknown category is included. pa=na/n pb=nb/n pc=nc/n with N=nA+nB+nC The distribution of events with two outcomes is based on binomial distribution and with n outcomes is based on multinomial distribution. These rules for calculating proportions can be used to calculate frequencies from observations or to establish probabilities that are measures of uncertainty about an event. See how to write proportions in different ways with R to learn the language. We can define the various events as a vector with c() and 0 and 1: obs1 <- c(1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0) N1 <- length(obs1) obs1 == 1 2

[1] TRUE FALSE TRUE FALSE TRUE TRUE FALSE [8] TRUE FALSE TRUE FALSE TRUE TRUE TRUE [15] TRUE FALSE na1 <- sum(obs1 == 1) nb1 <- N1 - na1 pa1 <- na1/n1 pb1 <- nb1/n1 print(paste("there are", na1, "'1' for", N1, "in total then the frequency is", pa1)) [1] "There are 10 '1' for 16 in total then the frequency is 0.625" print(paste("there are", nb1, "'0' for", N1, "in total then the frequency is", pb1)) [1] "There are 6 '0' for 16 in total then the frequency is 0.375" Or simply using the table() function: (tobs1 <- table(obs1)) obs1 0 1 6 10 print(paste("there are", tobs1[2], "'1' for", tobs1[2] + tobs1[1], "in total then the frequency is", tobs1[2]/(tobs1[2] + tobs1[1]))) [1] "There are 10 '1' for 16 in total then the frequency is 0.625" print(paste("there are", tobs1[1], "'0' for", tobs1[2] + tobs1[1], "in total then the frequency is", tobs1[1]/(tobs1[2] + tobs1[1]))) [1] "There are 6 '0' for 16 in total then the frequency is 0.375" One can also define the various events in the form of a vector with c () of A, B and C: obs2 <- c("a", "C", "B", "C", "A", "A", "C", "C", "A", "A", "C", "B", "C", "A") n2 <- c(a = sum(obs2 == "A")) n2 <- c(n2, B = sum(obs2 == "B")) n2 <- c(n2, C = sum(obs2 == "C")) n2 A B C 6 2 6 3

N2 <- sum(n2) (p2 <- n2/n2) A B C 0.4286 0.1429 0.4286 (table(obs2)) obs2 A B C 6 2 6 It is also possible of course to define in the aggregated form: n3 <- c(a = 20, B = 21, C = 3) N3 <- sum(n3) (p3 <- n3/n3) A B C 0.45455 0.47727 6818 3 Confidence interval of a proportion The Hmisc package has a very interesting function to calculate the confidence interval of proportion, but this only works when there are two possible states because the method is based on a binomial distribution function. b1 <- binconf(na1, N1) To represent a dot diagram with error bars, I find easier to use the package phenology because the wording is exactly the plot() function of basic R language. plot_errbar(1, b1[, 1], y.plus = b1[, 3], y.minus = b1[, 2], ylab = "Proportion", xlab = "Category", las = 1, ylim = c(0, 1), bty = "n") It is easy to represent a range of frequencies with confidence intervals. By default, alpha = 5 ie there are 95 % chance that the true proportion is well within the confidence interval calculated: n4 <- c(10, 2, 5, 7, 9, 12, 5) N4 <- c(45, 10, 5, 19, 12, 24, 6) b4 <- binconf(n4, N4) plot_errbar(1:7, b4[, 1], y.plus = b4[, 3], y.minus = b4[, 2], ylab = "Proportion", xlab = "Category", las = 1, ylim = c(0, 1), bty = "n") 4

1.0 0.8 Proportion 0.6 0.4 0.2 0.6 0.8 1.0 1.2 1.4 Category Figure 1: Proportion and confidence interval of a proportion Note an essential feature of proportions, and then os sex ratio: the confidence interval is not symetrical. As a direct consequence, it means that the 95 % confidence interval is not mean +/- 2.SD. 3.1 Pourcentage To work as a percentage, simply multiply everything by 100: plot_errbar(1:7, b4[, 1] * 100, y.plus = b4[, 3] * 100, y.minus = b4[, 2] * 100, ylab = "Percentage", xlab = "Category", las = 1, ylim = c(0, 100), bty = "n") 3.2 For distribution with more than 2 states To establish the confidence interval of proportions in a multinomial distribution (more than 2 states), if necessary to use the method (Sison and Glaz, 1999) 5

1.0 0.8 Proportion 0.6 0.4 0.2 1 2 3 4 5 6 7 Category Figure 2: Proportion and confidence interval of a series of proportions available MultinomialCI package. Glaz, J. and C.P. Sison. Simultaneous confidence intervals for multinomial proportions. Journal of Statistical Planning and Inference 82:251-262 (1999). m = multinomialci(x = c(23, 12, 44), alpha = 5) print(paste("first class: [", m[1, 1], m[1, 2], "]")) [1] "First class: [ 0.189873417721519 0.410418258547599 ]" print(paste("second class: [", m[2, 1], m[2, 2], "]")) [1] "Second class: [ 506329113924051 0.271177752218485 ]" print(paste("third class: [", m[3, 1], m[3, 2], "]")) [1] "Third class: [ 0.455696202531646 0.676241043357725 ]" print(paste("somme bornes hautes:", sum(m[, 2]))) [1] "Somme bornes hautes: 1.35783705412381" 6

100 80 Percentage 60 40 20 0 1 2 3 4 5 6 7 Category Figure 3: Percentage and confidence interval of a series of percentages It may be noted that the sum of the upper limit of each value exceeds 1; This is logical since the data are not independent. 7

4 Linear regression upon sex ratio 4.1 What you should never do... Imagine that you have timeseries of sex ratios, for example: timeobs <- c(1, 3, 6, 8, 10, 12, 13, 14, 17, 19, 20, 22, 25) obs <- c(0, 1, 2, 5, 2, 8, 9, 2, 19, 23, 5, 12, 15) Nobs <- c(1, 3, 3, 12, 4, 10, 11, 2, 21, 26, 6, 15, 15) obs/nobs [1] 000 0.3333 0.6667 0.4167 0.5000 0.8000 [7] 0.8182 1.0000 0.9048 0.8846 0.8333 0.8000 [13] 1.0000 It is extremely common to see a linear regression performed on such data. See for example: Bowden RM, Ewert MA, Nelson CE. 2000. Environmental sex determination in a reptile varies seasonally and with yolk hormones. Proceedings of the Royal Society B-Biological Sciences 267:1745-1749. x <- timeobs y <- obs/nobs plot(x, y, ylim = c(0, 1), bty = "n", xlim = c(0, 25), las = 1, ylab = "Sex ratio", xlab = "Time") abline(lm(y ~ x)) text(22, 0.6, paste("r=", sprintf("%.3f", cor(x, y, method = "pearson")), sep = "")) (testcor <- cor.test(x, y, method = "pearson")) Pearson's product-moment correlation data: x and y t = 5.01, df = 11, p-value = 00396 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.5233 0.9489 sample estimates: cor 0.8339 text(22, 0.5, paste("p=", sprintf("%.3f", testcor$p.value), sep = "")) 8

1.0 0.8 Sex ratio 0.6 0.4 r=0.834 p=00 0.2 0 5 10 15 20 25 Time Figure 4: Linear regression over sex ratio Linear regression yields values which are not limited in the range [0; 1]. par(xpd = TRUE) plot(x, y, ylim = c(0, 1), bty = "n", xlim = c(-10, 25), las = 1, ylab = "Sex ratio", xlab = "Time") abline(lm(y ~ x)) lines(x = c(-10, 25), y = c(0, 0), lty = 3) lines(x = c(-10, 25), y = c(1, 1), lty = 3) text(22, 0.6, paste("r=", sprintf("%.3f", cor(x, y, method = "pearson")), sep = "")) text(22, 0.5, paste("p=", sprintf("%.3f", testcor$p.value), sep = "")) 9

1.0 0.8 Sex ratio 0.6 0.4 r=0.834 p=00 0.2 10 5 0 5 10 15 20 25 Time Figure 5: Linear regression over sex ratio 5 Angular transformed proportion One of the many problems of linear regression on sex ratio is that a proportions are not normally distributed (normally means a Gauss-Laplace distribution). For proof, just take into account the fact that a proportion between 0 and 1 and a variable drawn from a normal distribution is between ]-infinity, + infinity[. The fit of the regression line is made by the least squares method has assumed that the dependent variable (y) has a normal marginal distribution. This is clearly false for a proportion. To solve out this problem once it was conducting an angular transformation type: ( ) T ransf ormedproportion = 2 arcsin proportion The inverse transformation is: proportion = sin( T ransformedproportion ) 2 2 10

This formula may seem magical in first originated directly in the mathematical expression of the Gaussian distribution. vx <- seq(from = 0, to = 1, by = 1) vy <- 2 * asin(sqrt(vx)) plot(vx, vy, type = "l", bty = "n", xlab = "Sex ratio", ylab = "Angular transformed", las = 1) 3.0 2.5 Angular transformed 2.0 1.5 1.0 0.5 0.2 0.4 0.6 0.8 1.0 Sex ratio Figure 6: Angular transformed Sex ratio vs sex ratio vx_inverse <- sin(vy/2)^2 11

par(xpd = FALSE) y_transform <- 2 * asin(sqrt(y)) plot(x, y_transform, bty = "n", xlim = c(0, 25), xlab = "Time", ylab = "Angular transformed sex ratio", las = 1) abline(lm(y_transform ~ x)) text(22, 1.6, paste("r=", sprintf("%.3f", cor(x, y_transform, method = "pearson")), sep = "")) (testcor_transform <- cor.test(x, y_transform, method = "pearson")) Pearson's product-moment correlation data: x and y_transform t = 4.516, df = 11, p-value = 008779 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.4587 0.9397 sample estimates: cor 0.806 text(22, 1.4, paste("p=", sprintf("%.3f", testcor_transform$p.value), sep = "")) par(xpd = FALSE) y_transform <- 2 * asin(sqrt(y)) plot(x, y, bty = "n", xlim = c(0, 25), xlab = "Time", ylab = "Sex ratio", las = 1) ab <- lm(y_transform ~ x) y_predict <- predict(ab) lines(x, sin(y_predict/2)^2) text(22, 0.6, paste("r=", sprintf("%.3f", cor(x, y_transform, method = "pearson")), sep = "")) (testcor_transform <- cor.test(x, y_transform, method = "pearson")) Pearson's product-moment correlation data: x and y_transform t = 4.516, df = 11, p-value = 008779 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.4587 0.9397 sample estimates: cor 0.806 text(22, 0.5, paste("p=", sprintf("%.3f", testcor_transform$p.value), sep = "")) 12

3.0 2.5 Angular transformed sex ratio 2.0 1.5 1.0 r=0.806 p=01 0.5 0 5 10 15 20 25 Time Figure 7: Linear regression over Angular transformed sex ratio 6 Weight of the different measures But another problem is not resolved. Linear regression based on proportions does not take into account the effective, or more the number of measurements, the greater the proportion is well known and therefore should have significant weight. na <- 1 nb <- 2 na_cumul <- NULL nb_cumul <- NULL for (i in 1:10) { na_cumul <- c(na_cumul, na * i) nb_cumul <- c(nb_cumul, nb * i) } 13

1.0 0.8 Sex ratio 0.6 0.4 r=0.806 p=01 0.2 0 5 10 15 20 25 Time Figure 8: Linear regression over Angular transformed sex ratio, converted as sex ratio N_cumul <- na_cumul + nb_cumul b_cumul <- binconf(na_cumul, N_cumul) plot_errbar(n_cumul, b_cumul[, 1], y.plus = b_cumul[, 3], y.minus = b_cumul[, 2], ylab = "Sex ratio", xlab = "Total", las = 1, ylim = c(0, 1), bty = "n") In conclusion, you should not work on proportions but using the number of observations. See also WARTON, D.I. & F.K.C. HUI 2011. The arcsine is asinine: the analysis of proportions in ecology. Ecology 92:3-10. par(xpd = FALSE) bobs <- binconf(obs, Nobs) plot_errbar(x, bobs[, 1], y.plus = bobs[, 3], y.minus = bobs[, 2], las = 1, ylab = "Proportion", xlab = "Time", 14

1.0 0.8 0.6 Sex ratio 0.4 0.2 5 10 15 20 25 30 Total Figure 9: Confidence intervalle of sex ratio las = 1, ylim = c(0, 1), bty = "n", xlim = c(0, 25)) abline(lm(y ~ x)) text(22, 0.5, paste("r=", sprintf("%.3f", cor(x, y, method = "pearson")), sep = "")) text(22, 0.4, paste("p=", sprintf("%.3f", testcor$p.value), sep = "")) 15

1.0 0.8 Proportion 0.6 0.4 r=0.834 p=00 0.2 0 5 10 15 20 25 Time Figure 10: Linear regression over sex ratio 7 Using a GLM to analyze sex ratio 7.1 Why a GLM? The solution to solve these problems need two conceptual changes. On the one hand, the linear regression is clearly not suited to data in bounded intervals Furthermore the method of least squares used to adjust the parameters of the regression does not take into account the accuracy of the measurements (ie, number of observations to estimate sex ratio) 7.1.1 Functions logit and probit Instead of a linear equation, a probit or logit function must be used to limit output in the interval [0; 1]. The logit function is: 1 y = 1 + e ax+b 16

The probit function is based on distribution function of a Gaussian distribution: Φ (z) = 1 z e 1 2 Z2 dz 2π The choice between logit and probit is more a matter of habit than of fundamental difference between the two models. In ecology, logit are often preferred whereas in economy, probit are commonly used. In practice, there is very little difference between using a logit and probit. If we assume that the results are influenced by a variable with Gaussian distribution, the probit model would be more appropriate (but I never see clear demonstration about that point). But keep in mind that the logit and probit models are only models: Essentially, all models are wrong, but some are useful, George E. P. Box (in Box, G. E. P., and Draper, N. R., 1987, Empirical Model Building and Response Surfaces, John Wiley & Sons, New York, NY.)! xl <- -100:100 b <- 0 a <- 5 yl <- 1/(1 + exp(a * xl + b)) plot(xl, yl, bty = "n", type = "l", las = 1, col = "black", ylim = c(0, 1)) b <- 1 a <- -7 yl <- 1/(1 + exp(a * xl + b)) plot_add(xl, yl, bty = "n", type = "l", las = 1, col = "red") b <- -1 a <- 0.1 yl <- 1/(1 + exp(a * xl + b)) plot_add(xl, yl, bty = "n", type = "l", las = 1, col = "blue") legend(x = 40, y = 0.8, legend = c("a=0,05; b=1", "a=-0,07; b=-1", "a=0,1; b=-1"), col = c("black", "red", "blue"), lty = 1, bty = "n") xl <- -100:100 b <- 5 a <- 10 yl <- pnorm(xl, mean = a, sd = b) plot(xl, yl, bty = "n", type = "l", las = 1, col = "black", ylim = c(0, 1)) b <- 20 a <- -10 yl <- pnorm(xl, mean = a, sd = b) plot_add(xl, yl, bty = "n", type = "l", las = 1, col = "red") b <- 15 a <- 7 17

1.0 0.8 a=0,05; b=1 a= 0,07; b= 1 a=0,1; b= 1 0.6 yl 0.4 0.2 100 50 0 50 100 xl Figure 11: Logits curves yl <- pnorm(xl, mean = a, sd = b) plot_add(xl, yl, bty = "n", type = "l", las = 1, col = "blue") legend(x = "bottomright", legend = c("a=10; b=5", "a=-10; b=20", "a=7; b=15"), col = c("black", "red", "blue"), lty = 1, bty = "n") 7.1.2 Likelihood of an observation In statistics, a likelihood function is a function of the parameters of a statistical model, defined as follows: the likelihood of a set of parameter values given some observed outcomes is equal to the probability of those observed outcomes given those parameter values (Wikipedia, http://en.wikipedia.org/wiki/likelihood). The likelihood of observing x males in a set size with a probability of being male prob is: 18

1.0 0.8 0.6 yl 0.4 0.2 a=10; b=5 a= 10; b=20 a=7; b=15 100 50 0 50 100 xl Figure 12: Courbes probits dbinom(x, size, prob, log = FALSE) The binomial distribution with size = n and prob = p has density: ( ) n d = p x (1 p) n x x d <- choose(size, x)*prob^x*(1-prob)^(size-x) 19

7.2 Analysis by a generalized linear model (GLM) A dataframe with two columns with number of males and females is created. We try to model the observations as a function of variable timeobs. ydf <- cbind(na = obs, nb = Nobs - obs) model <- glm(ydf ~ timeobs, family = binomial(link = "logit")) anova(model, test = "Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: ydf Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 12 27.51 timeobs 1 18.5 11 8.97 1.7e-05 NULL timeobs *** --- Signif. codes: 0 '***' 01 '**' 1 '*' 5 '.' 0.1 ' ' 1 summary(model) Call: glm(formula = ydf ~ timeobs, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max -1.6171-0.5268-238 0.7149 1.1616 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -1.174 0.637-1.84 65 timeobs 0.170 43 3.96 7.5e-05 (Intercept). timeobs *** --- Signif. codes: 0 '***' 01 '**' 1 '*' 5 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) 20

Null deviance: 27.5092 on 12 degrees of freedom Residual deviance: 8.9745 on 11 degrees of freedom AIC: 35.92 Number of Fisher Scoring iterations: 4 bobs <- binconf(obs, Nobs) plot_errbar(x, bobs[, 1], y.plus = bobs[, 3], y.minus = bobs[, 2], las = 1, ylab = "Proportion", xlab = "Time", las = 1, ylim = c(0, 1), bty = "n", xlim = c(0, 25)) newd <- data.frame(timeobs = seq(from = 1, to = 25, by = 0.1)) p <- predict(model, newdata = newd, type = "link", se = TRUE) plot_add(newd[, 1], with(p, exp(fit)/(1 + exp(fit))), type = "l", bty = "n") plot_add(newd[, 1], with(p, exp(fit + 1.96 * se.fit)/(1 + exp(fit + 1.96 * se.fit))), type = "l", bty = "n", lty = 2) plot_add(newd[, 1], with(p, exp(fit - 1.96 * se.fit)/(1 + exp(fit - 1.96 * se.fit))), type = "l", bty = "n", lty = 2) 21

1.0 0.8 Proportion 0.6 0.4 0.2 0 5 10 15 20 25 Time Figure 13: Logistic regression using maximum likelihood 7.3 To go further... Imagine that each observation is associated in addition to the temporal dimension, another co-variable. We can then ask whether the observed sex ratio also depend on the co-factor or interaction between time and cofactor. timeobs [1] 1 3 6 8 10 12 13 14 17 19 20 22 25 ydf na nb [1,] 0 1 [2,] 1 2 [3,] 2 1 [4,] 5 7 [5,] 2 2 22

[6,] 8 2 [7,] 9 2 [8,] 2 0 [9,] 19 2 [10,] 23 3 [11,] 5 1 [12,] 12 3 [13,] 15 0 cofacteur <- c(9.2, 8.1, 2, 7.3, 9.5, 1.2, 10.8, 20.9, 11.9, 2.5, 2.3, 9.7, 10) model_c <- glm(ydf ~ cofacteur, family = binomial(link = "logit")) model_t_c <- glm(ydf ~ timeobs + cofacteur, family = binomial(link = "logit")) model_t_c_i <- glm(ydf ~ timeobs * cofacteur, family = binomial(link = "logit")) compare_aic(list(time = model, cofacteur = model_c, time_cofacteur = model_t_c, time_cofacteur_interaction = model_t_c_i)) [1] "The lowest AIC (35.922) is for series time with Akaike weight=0.640" AIC DeltaAIC time 35.92 00 cofacteur 54.21 18.289 time_cofacteur 37.88 1.962 time_cofacteur_interaction 39.27 3.350 Akaike_weight time 6.400e-01 cofacteur 6.836e-05 time_cofacteur 2.400e-01 time_cofacteur_interaction 1.199e-01 8 Predict function after GLM Predict function is used to make predictions from a fitted model. However, a common mistake is to use the normal approximation to the confidence interval of the predicted results. This can lead to gross errors. Start again from the previously adjusted model: summary(model) Call: glm(formula = ydf ~ timeobs, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max -1.6171-0.5268-238 0.7149 1.1616 23

Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -1.174 0.637-1.84 65 timeobs 0.170 43 3.96 7.5e-05 (Intercept). timeobs *** --- Signif. codes: 0 '***' 01 '**' 1 '*' 5 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 27.5092 on 12 degrees of freedom Residual deviance: 8.9745 on 11 degrees of freedom AIC: 35.92 Number of Fisher Scoring iterations: 4 We created a dataframe with values timeobs to predict. df <- data.frame(timeobs = 1:30) Results were visualized with the option response. # make prediction and estimate CI using 'response' # option preddf2 <- predict(model, type = "response", newdata = df, se.fit = TRUE) plot_errbar(1:30, with(preddf2, fit), bty = "n", errbar.y.minus = with(preddf2, 1.96 * se.fit), errbar.y.plus = with(preddf2, 1.96 * se.fit), xlab = "timeobs covariable", ylab = "Sex ratio", type = "l", ylim = c(0, 1.1), las = 1) segments(0, 1, 30, 1, col = "red") Now Results were visualized with the option link. # good practice: use the option 'link' preddf <- predict(model, type = "link", newdata = df, se.fit = TRUE) plot_errbar(1:30, with(preddf, exp(fit)/(1 + exp(fit))), bty = "n", errbar.y.minus = with(preddf, exp(fit)/(1 + exp(fit))) - with(preddf, exp(fit - 1.96 * se.fit)/(1 + exp(fit - 1.96 * se.fit))), errbar.y.plus = with(preddf, exp(fit + 1.96 * se.fit)/(1 + exp(fit + 1.96 * se.fit))) - with(preddf, exp(fit)/(1 + exp(fit))), xlab = "timeobs covariable", ylab = "Sex ratio", type = "l", ylim = c(0, 1), las = 1) 24

1.0 0.8 Sex ratio 0.6 0.4 0.2 0 5 10 15 20 25 30 timeobs covariable Figure 14: Note that error bars are higher than 1! segments(0, 1, 30, 1, col = "red") 25

1.0 0.8 0.6 Sex ratio 0.4 0.2 0 5 10 15 20 25 30 timeobs covariable Figure 15: Note that error bars are not higher than 1 9 Application for TSD pattern outtsd <- with(subset(stsre_tsd, Sp == "Cc" & RMU == "Pacific, S"), tsd(males = Males, females = Females, temperatures = Temp)) [1] "The goodness of fit test is 14" [1] "The Pivotal temperature is 28.553 SE 17" [1] "The Transitional range of temperatures l=5% is 5.731 SE 51" [1] "The lower limit of Transitional range of temperatures l=5% is 25.686 SE 30" [1] "The higher limit of Transitional range of temperatures l=5% is 31.417 SE 35" 26

Transitional range of temperatures l=5% Pivotal temperature 1.0 0.8 male frequency 0.6 0.4 0.2 25 26 27 28 29 30 31 32 33 Temperatures in C Figure 16: Pattern of TSD for Caretta caretta in RMU Pacific South 27