Jian WANG, PhD. Room A115 College of Fishery and Life Science Shanghai Ocean University

Jian WANG, PhD j_wang@shou.edu.cn Room A115 College of Fishery and Life Science Shanghai Ocean University

Useful Links Slides: http://sihua.us/biostatistics.htm Datasets: http://users.monash.edu.au/~murray/bdar/index.html RStudio https://www.rstudio.com

RStudio friendly IDE for R

RStudio friendly IDE for R new script input scripts Enviroment & History plots & help results

Contents 1. Introduction to R 2. Data sets 3. Introductory Statistical Principles 4. Sampling and experimental design with R 5. Graphical data presentation 6. Simple hypothesis testing 7. Introduction to Linear models 8. Correlation and simple linear regression 9. Single factor classification (ANOVA) 10. Nested ANOVA 11. Factorial ANOVA 12. Simple Frequency Analysis

ANOVA (Analysis of variance) one-way ANOVA also known Single factor classification used to investigate the effect of single factor comprising of two or more groups from a completely randomized design eg: temperature concentration of drug A factor of four levels

ANOVA (Analysis of variance)

Example A: zinc contamination on the diversity of diatom species

Example A: zinc contamination on the diversity of diatom species Medley and Clements (1998) investigated the impact of zinc contamination (and other heavy metals) on the diversity of diatom species in the USA Rocky Mountains The diversity of diatoms (number of species) degree of zinc contamination (high, medium, low or natural background level) Data were recorded from between four and six sampling stations within each of six streams known to be polluted. These data were used to test the null hypothesis that there were no differences the diversity of diatoms between different zinc levels

F-ratios F-ratios and corresponding R syntax for single factor ANOVA designs Mean of squared (variation)

F -distribution Comparing the plots of the probability density function for an F distribution with various degrees of freedom. solid line represents the probability density functions (pdf) of F(1, 1), dashed line represents the pdf of F(2, 5), dotted line represents the pdf of F(10, 20)

F -distribution

F -distribution Eg. The density plot of F(3, 23)- distribution. The distribution of F statistic for the assuming that the null hypothesis is true. The observed value of the test statistic is f = 3.2, and the corresponding p-value is shown as the shaded area above 3.2

Fixed factor & Random factor Could be control Eg: three specific temperatures Couldn t be control Eg: three operators

Fixed factor the population group means are all equal or the effect of each group equals zero H : 0 1 either i

Random factor the variance between all possible groups equals zero added variance due to this factor equals zero H : 2 0 1

Linear model

Assumptions of ANOVA Hypothesis testing for a single factor ANOVA model assumes that the residuals (and therefore the response variable for each of the treatment levels) are all: (i) normally distributed (ii) equally varied (iii) independent of one another

Tests of trends and means comparisons When H0 is rejected Researchers often wish to examine patterns of differences among groups. However, this requires multiple comparisons of group means and multiple comparisons Post-hoc unplanned pairwise comparisons e.g. Bonferroni, LSR (Duncan, Neuman-Keuls), Tukey HSD Planned comparisons

ANOVA in R Model construction: lm() aov() View ANOVA table summary() anova()

Example A: zinc contamination on the diversity of diatom species

Example A: zinc contamination on the diversity of diatom species ## 1 - import dataset (notice the directory) >setwd() > medley <- read.table('medley.csv', header=t, sep=',') > medley #check data > boxplot(diversity~zinc, medley) not in proper order

Example A: zinc contamination on the diversity of diatom species ##2 - Reorganize the levels of categorical factor into more logical order >medley$zinc #1 st * > medley$zinc <- factor(medley$zinc, levels=c('high', 'MED', 'LOW', 'BACK'), ordered=f) >medley$zinc #2 nd * *find the difference between 1 st & 2 nd

Example A: zinc contamination on the diversity of diatom species ## 3 - Assess normality/homogeneity of variance using boxplot of species diversity against zinc group > boxplot(diversity~zinc, medley) Conclusions no obvious violations of normality or homogeneity of variance basically symmetrical

Example A: zinc contamination on the diversity of diatom species ##4 - Assess homogeneity of variance assumption with a plot of mean vs variance > plot(tapply(medley$diversity, medley$zinc, mean), tapply(medley$diversity, medley$zinc, var)) Conclusions no obvious relationship between group mean and variance

Example A: zinc contamination on the diversity of diatom species ## 3 - Assess normality using shapiro test of species diversity against zinc group > library("plyr") > ddply(medley,.(zinc), function(x) {data.frame(pvalue = shapiro.test(x$diversity)$p.value)})

Example A: zinc contamination on the diversity of diatom species ## 3 - Assess homogeneity of variance using Bartlett test of species diversity against zinc group > bartlett.test(medley$diversity~medley$zinc)

Example A: zinc contamination on the diversity of diatom species ##5 - Test H0 that population group means are all equal > medley.aov <- aov(diversity ~ ZINC, medley) > medley.aov

Example A: zinc contamination on the diversity of diatom species ##5 - Test H0 that population group means are all equal > par(mfrow = c(2, 2)) > plot(medley.aov) Conclusions - no obvious violations of normality or homogeneity of variance meaningless

Example A: zinc contamination on the diversity of diatom species ##6 - Examine the ANOVA table. > anova(medley.aov) > summary(medley.aov) MS B SS k B 1 degree of freedom k-1 N-k (N: total) F (k-1,n-k) ratio, MSB/MSw MS w SS w N k

Example A: zinc contamination on the diversity of diatom species ##7 option using linear model to do ANOVA > anova(lm(diversity ~ ZINC, medley))

Example A: zinc contamination on the diversity of diatom species ##6 - Examine the ANOVA table. > anova(medley.aov) > summary(medley.aov) Conclusions - at least one of the population group means differs from the others

Post-hoc unplanned pairwise comparison One-way ANOVA results : - Rejecting the H0 that all of population group means are equal only indicates that at least one of the population group means differs from the others. - However, it does not indicate which group differ from which other groups. - multiple comparisons of group means with correction are required.

Post-hoc unplanned pairwise comparison Problems of multiple comparisons : 1- multiple significant test increase the probability of Type I errors (α, the probability of falsely rejecting H0) eg: Type I errors of 5 groups 10 pairwise comparisons with α=0.05: 1-0.95^10=0.4 2- the outcome of each test might not be independent (orthogonal). eg: A>B, B>C. if A & B are different, we already know A & C are different multiple corrections are needed for comparisons

Example A: zinc contamination on the diversity of diatom species ##7 Post-hoc to investigate pairwise mean differences between all groups #option 1 > TukeyHSD(medley.aov ) #option 2 > require('multcomp') > summary(glht(medley.aov, linfct = mcp(zinc = "Tukey"))) #option 3 > require("desctools") > PostHocTest(medley.aov,method = "hsd") Tukey s Honestly Significant Distance test for multiple comparisons

Example A: zinc contamination on the diversity of diatom species ##7 Post-hoc between all groups to investigate pairwise mean differences

Example A: zinc contamination on the diversity of diatom species ##8 Summarize result with a bargraph using biology package not available now

Example A: zinc contamination on the diversity of diatom species ##8 Summarize result with a bargraph Add * symbol manually by Graphic software like adobe illustrator * > #calculate mean & sd seperately > mean1 <- tapply(medley$diversity,medley$zinc,mean) > sd1 <- tapply(medley$diversity,medley$zinc,sd) > dd1 <- data.frame(mean1,sd1) > ylim=c(0,(max(dd1$mean1)+2*max(dd1$sd1))) > mp <- barplot(dd1$mean1,ylab="diversity", xlab = "Zinc Concentration", names.arg=row.names(dd1),ylim=ylim) > segments(mp, dd1$mean1-dd1$sd1,mp,dd1$mean1+dd1$sd1) > segments( mp - 0.1,dd1$mean1-dd1$sd1, mp + 0.1,dd1$mean1-dd1$sd1) > segments( mp - 0.1,dd1$mean1+dd1$sd1, mp + 0.1,dd1$mean1+dd1$sd1)

Example A: zinc contamination on the diversity of diatom species ##8 Summarize result with a bargraph Using ggplot2 > library(reshape2) > library(ggplot2) > library(plyr) > mdata.m <- tapply(medley$diversity,medley$zinc,mean) > mdata.sd <- tapply(medley$diversity,medley$zinc,sd) > data.r = data.frame(mdata.m,mdata.sd) > data.r$zinc = row.names(data.r) > ggplot(data.r,aes(zinc,mdata.m,fill=zinc)) + geom_bar(stat = "identity",width = 0.5) + geom_errorbar(aes(ymin=mdata.m-mdata.sd, ymax=mdata.m+mdata.sd),width=0.2)+ scale_y_continuous(expand = c(0,0),limits=c(0,3),) + ##limits should be adjusted accordingly scale_x_discrete(limits=data.r$zinc)+ylab("diversity")+ theme_bw() + theme(panel.grid.major= element_blank(),panel.grid.minor=element_blank()) *

ggplot2: Elegant Graphics

Example B : Single factor ANOVA with planned comparisons four biofilm types on the recruitment of serpulid larvae

Example B : Single factor ANOVA with planned comparisons four biofilm types on the recruitment of serpulid larvae Keough and Raimondi (1995) examined the effects of four biofilm types on the recruitment of serpulid larvae. : SL: sterile unfilmed substrate, NL: netted laboratory biofilms, UL: unnetted laboratory biofilms F: netted field biofilms Substrates treated with one of the four biofilm types were left in shallow marine waters for one week after which the number of newly recruited serpulid worms were counted.

Example B : Single factor ANOVA with planned comparisons four biofilm types on the recruitment of serpulid larvae The linear effect model would be:

Example B : Single factor ANOVA with planned comparisons four biofilm types on the recruitment of serpulid larvae ## 1&2 - Check the assumptions and scale data if appropriate > keough <- read.table("keough.csv", header = T, sep = ",") > dev.off() ##if necessary > boxplot(serp ~ BIOFILM, data = keough) > boxplot(log10(serp) ~ BIOFILM, data = keough )

Example B : Single factor ANOVA with planned comparisons four biofilm types on the recruitment of serpulid larvae ## 1&2 - Check the assumptions and scale data if appropriate > with(keough, plot(tapply(serp, BIOFILM, mean), tapply(serp, BIOFILM, var))) > with(keough, plot(tapply(log10(serp), BIOFILM, mean), tapply(log10(serp), BIOFILM, var)))

Example B : Single factor ANOVA with planned comparisons four biofilm types on the recruitment of serpulid larvae ## 1&2 - Check the assumptions and scale data if appropriate untransformed log 10 scale Conclusions - some evidence of a relationship between population mean and population variance from untransformed data, log10 transformed data meets assumptions better, therefore transformation appropriate.

Example B : Single factor ANOVA with planned comparisons four biofilm types on the recruitment of serpulid larvae SL: sterile unfilmed substrate, NL: netted laboratory biofilms, UL: unnetted laboratory biofilms F: netted field biofilms Comparisons:

Example B : Single factor ANOVA with planned comparisons four biofilm types on the recruitment of serpulid larvae ## 3&4 - Define a list of contrasts for the following planned comparisons: >keough$biofilm #1 st * > contrasts(keough$biofilm) <- cbind(c(0, 1, 0, -1), c(2, -1, 0, -1), c(-1, - 1, 3, -1)) >round(crossprod(contrasts(keough$biofilm)), 2) >keough$biofilm #2 nd * Conclusions - all defined planned contrasts are orthogonal (values above or below the cross-product matrix diagonal are all be zero). *notice the difference between 1st & 2nd

Example B : Single factor ANOVA with planned comparisons four biofilm types on the recruitment of serpulid larvae ## 5 - Define contrast labels and model construction >keough.list <- list(biofilm = list('nl vs UL' = 1,'F vs (NL&UL)' = 2, 'SL vs (F&NL&UL)' = 3)) > keough.aov <- aov(log10(serp) ~ BIOFILM, data = keough) > par(mfrow = c(2, 2)) > plot(keough.aov) > summary(keough.aov, split=keough.list)

Example B : Single factor ANOVA with planned comparisons four biofilm types on the recruitment of serpulid larvae ## 5 - Define contrast labels and model construction meaningless

Example B : Single factor ANOVA with planned comparisons four biofilm types on the recruitment of serpulid larvae ## 5 - Define contrast labels and model construction

Example B : Single factor ANOVA with planned comparisons four biofilm types on the recruitment of serpulid larvae ## 5 - Define contrast labels and model construction Conclusions Biofilm treatments were found to have a significant affect on the mean log10 number of serpulid recruits (F3,24 = 6.0058,P = 0.003). The presence of a net (NL) over the substrate was not found to alter the mean log10 serpulid recruits compared to a surface without (UL) a net (F1,24 = 0.6352,P = 0.4332). Field biofilms (F) were not found to have different mean log10 serpulid recruits than the laboratory (NL, UL) biofilms (F1,24 = 0.6635,P = 0.4233). Unfilmed treatments were found to have significantly lower mean log10 serpulid recruits than treatments with biofilms (F1,24 = 16.719,P < 0.001)

Example B : Single factor ANOVA with planned comparisons four biofilm types on the recruitment of serpulid larvae ## 5 - Define contrast labels and model construction

Example B : Single factor ANOVA with planned comparisons four biofilm types on the recruitment of serpulid larvae ## 5 - Define contrast labels and model construction Significant affects were found on: Overall biofilm treatments (F3,24 = 6.0058,P = 0.003). Unfilmed treatments and treatments with biofilms (F1,24 = 16.719,P < 0.001)

Example B : Single factor ANOVA with planned comparisons four biofilm types on the recruitment of serpulid larvae ## 6- Summarize findings with a bargraph > means <- with(keough, tapply(serp, BIOFILM, mean, na.rm = T)) > sds <- with(keough, tapply(serp, BIOFILM, sd, na.rm = T)) > n <- with(keough, tapply(serp, BIOFILM, length)) > ses <- sds/sqrt(n) > ys <- pretty(c(means - ses, means + (2 * ses))) > xs <- barplot(means, beside = T, axes = F, ann = F, ylim = c(min(ys), max(ys)), xpd = F) > arrows(xs, means + ses, xs, means - ses, ang = 90, length = 0.1, code = 3) axis(2, las = 1) > mtext(2, text = "Mean number of serpulids", line = 3, cex = 1.5) > mtext(1, text = "Biofilm treatment", line = 3, cex = 1.5) > box(bty = "l")

Example B : Single factor ANOVA with planned comparisons four biofilm types on the recruitment of serpulid larvae Mean number of serpulids ## 6- Summarize findings with a bargraph 200 180 160 140 120 100 80 F NL SL UL Biofilm treatment

Robust classification: alternatives to ANOVA either non-normality or unequal variance Welch s test adjusts the degrees of freedom to maintain test reliability in situations where populations are normally distributed but unequally varied. Kruskal-Wallis test : abnormality. Non-parametric (rank-based) tests Randomization tests : do not assume observations were collected via random sampling, however they do assume that populations are equally varied

Example E: Kruskal-Wallis test The effect of different sugar treatments on pea length was investigated: Control 2% glucose added 2% fructose added 1% glucose and 1% fructose added 2% sucrose added

Example E: Kruskal-Wallis test The effect of different sugar treatments on pea length ##1 Import data and check normality and equal variance > purves <- read.table('purves.csv', header=t, sep=',') > dev.off() > boxplot(length~treat, data=purves) unequal variance. Note: that this dataset would also suited to a Welch s test. for the purpose of providing worked examples that are consistent with popular biometry texts, a Kruskal-Wallis test will be demonstrated.

Example E: Kruskal-Wallis test The effect of different sugar treatments on pea length ##2 Perform non-parametric Kruskal-Wallis test > kruskal.test(length~treat, data=purves)

Example E: Kruskal-Wallis test The effect of different sugar treatments on pea length ##2 Perform post-hoc > pairwise.t.test(purves$length, purves$treat, pool.sd=f, p.adj= fdr") fdr: False discovery rate

Example E: Kruskal-Wallis test The effect of different sugar treatments on pea length ## Summarize findings with a bargraph > means <- with(purves, tapply(length, TREAT, mean, na.rm =T)) > sds <- with(purves, tapply(length, TREAT, sd, na.rm =T)) > n <- with(purves, tapply(length, TREAT, length)) > ses <- sds/sqrt(n) > ys <- pretty(c(means - ses, means + (2 * ses))) > xs<-barplot(means, beside=t, axes=f, ann=f, ylim = c(min(ys), max(ys)), xpd=f) > arrows(xs, means+ses, xs, means-ses, ang=90, length=0.05, code=3) > axis(2, las = 1) > mtext(2, text = "Mean pea length", line = 3, cex = 1.5) > mtext(1, text = "Sugar treatment", line = 3, cex = 1.5) > text(xs, means + ses, labels = c('a','b','b','b','c'), pos = 3) > box(bty="l")

Example E: Kruskal-Wallis test The effect of different sugar treatments on pea length ## Summarize findings with a bargraph

Example F: Welch s test The type of bird colony on beetle density The effects of sea birds on tenebrionid beetles on islands in the Gulf of California. sea birds leaving guano and carrion would increase beetle productivity. They had a sample of 25 islands and recorded the beetle density, the type of bird colony (roosting, breeding, no birds), % cover of guano and % plant cover of annuals and perennials

Example F: Welch s test The type of bird colony on beetle density ##1 Import data and check normality and equal variance sanchez <- read.table('sanchez.csv', header=t, sep=',') boxplot(guano~coltype, data=sanchez) boxplot(sqrt(guano)~coltype, data=sanchez)

Example F: Welch s test The type of bird colony on beetle density ##1 Import data and check normality and equal variance still unequal variance clear evidence that non-normality and non-homogeneity square-root transform improved a little

Example F: Welch s test The type of bird colony on beetle density ## Perform the Welch s test. > oneway.test(sqrt(guano)~coltype, data=sanchez) Significant difference. Reject the null hypothesis

Example F: Welch s test The type of bird colony on beetle density ## - Perform post-hoc test. > pairwise.t.test(sqrt(sanchez$guano), sanchez$coltype, pool.sd=f, p.adj="holm")

Example F: Welch s test The type of bird colony on beetle density ## - Perform post-hoc test. > pairwise.t.test(sqrt(sanchez$guano), sanchez$coltype, pool.sd=f, p.adj="none")

Single Factor Classification Methods ANOVA: Three assumptions satisfied Welch test: normality but NOT equally varied Kruskal-Wallis test: (non-parametric, test medians) abnormality Randomization tests: can NOT random sampling, but equally varied