Multivariate analysis of genetic data: exploring group diversity

Similar documents
Multivariate analysis of genetic data exploring group diversity

A tutorial for Discriminant Analysis of Principal Components (DAPC) using adegenet 1.3-6

Multivariate analysis of genetic data: exploring groups diversity

Multivariate analysis of genetic markers as a tool to explore the genetic diversity: some examples

Multivariate analysis of genetic data: an introduction

Spatial genetics analyses using

Multivariate analysis of genetic data: investigating spatial structures

Multivariate analysis of genetic data: investigating spatial structures

Multivariate analysis of genetic data an introduction

2/1/2016. Species Abundance Curves Plot of rank abundance (x-axis) vs abundance or P i (yaxis).

EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS

DIMENSION REDUCTION AND CLUSTER ANALYSIS

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Experimental Design and Data Analysis for Biologists

Principal Components Analysis (PCA)

Clustering: K-means. -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD

Data Exploration and Unsupervised Learning with Clustering

Package HierDpart. February 13, 2019

PCA and admixture models

Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Linear Dimensionality Reduction

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Introduction to multivariate analysis Outline

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Homework : Data Mining SOLUTIONS

Introduction to phylogenetics using

-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the

MEMGENE package for R: Tutorials

Short Answer Questions: Answer on your separate blank paper. Points are given in parentheses.

Table of Contents. Multivariate methods. Introduction II. Introduction I

Detecting selection from differentiation between populations: the FLK and hapflk approach.

Multivariate Analysis of Ecological Data using CANOCO

Principal component analysis

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Overview of clustering analysis. Yuehua Cui

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

2/7/2018. Strata. Strata

Discriminant analysis and supervised classification

Principal component analysis (PCA) for clustering gene expression data

Machine Learning 2nd Edition

Galaxy Dataset Analysis

Non-Parametric Bayes

1.3. Principal coordinate analysis. Pierre Legendre Département de sciences biologiques Université de Montréal

FINM 331: MULTIVARIATE DATA ANALYSIS FALL 2017 PROBLEM SET 3

Chapter 4: Factor Analysis

1 A factor can be considered to be an underlying latent variable: (a) on which people differ. (b) that is explained by unknown variables

Multivariate Fundamentals: Rotation. Exploratory Factor Analysis

Ordination & PCA. Ordination. Ordination

Quantitative Understanding in Biology Principal Components Analysis

PRINCIPAL COMPONENTS ANALYSIS

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /15/ /12

Principal Components Analysis: A How-To Manual for R

PCA vignette Principal components analysis with snpstats

PRINCIPAL COMPONENTS ANALYSIS

Naive Bayes classification

Appendix A : rational of the spatial Principal Component Analysis

Data Analysis in Paleontology Using R. Data Manipulation

GEOG 4110/5100 Advanced Remote Sensing Lecture 15

Multivariate Statistics Fundamentals Part 1: Rotation-based Techniques

Principal components

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

With numeric and categorical variables (active and/or illustrative) Ricco RAKOTOMALALA Université Lumière Lyon 2

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

INTRODUCCIÓ A L'ANÀLISI MULTIVARIANT. Estadística Biomèdica Avançada Ricardo Gonzalo Sanz 13/07/2015

Motivating the Covariance Matrix

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

An introduction to the picante package

Data Preprocessing Tasks

MIXED DATA GENERATOR

k-means clustering mark = which(md == min(md)) nearest[i] = ifelse(mark <= 5, "blue", "orange")}

Model selection and comparison

Factor Analysis Continued. Psy 524 Ainsworth

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

Machine Learning (Spring 2012) Principal Component Analysis

Principal Component Analysis!! Lecture 11!

2/19/2018. Dataset: 85,122 islands 19,392 > 1km 2 17,883 with data

Principal Component Analysis (PCA) Theory, Practice, and Examples

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

7. Variable extraction and dimensionality reduction

4/4/2018. Stepwise model fitting. CCA with first three variables only Call: cca(formula = community ~ env1 + env2 + env3, data = envdata)

Eigenvalues, Eigenvectors, and an Intro to PCA

Eigenvalues, Eigenvectors, and an Intro to PCA

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

More on Unsupervised Learning

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

4. Ordination in reduced space

A (short) introduction to phylogenetics

Unsupervised clustering of COMBO-17 galaxy photometry

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

HST.582J/6.555J/16.456J

Discrimination of Dyed Cotton Fibers Based on UVvisible Microspectrophotometry and Multivariate Statistical Analysis

HOMOGENEITY ANALYSIS USING EUCLIDEAN MINIMUM SPANNING TREES 1. INTRODUCTION

MCMC 2: Lecture 2 Coding and output. Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham

Linear programming. Saad Mneimneh. maximize x 1 + x 2 subject to 4x 1 x 2 8 2x 1 + x x 1 2x 2 2

Lecture 2: Linear Algebra Review

1 Principal Components Analysis

Linear Algebra Methods for Data Mining

Statistical Machine Learning

Transcription:

Practical course using the software Multivariate analysis of genetic data: exploring group diversity Thibaut Jombart Abstract This practical course tackles the question of group diversity in genetic data analysis using [8]. It consists of two main parts: first, how to infer groups when these are unknown, and second, how to use group information when describing the genetic diversity. The practical uses mostly the packages adegenet [5], ape [7] and ade4 [1, 3, 2], but others like genetics [9] and hierfstat [4] are also required. 1

Contents 1 Defining genetic clusters 3 1.1 Hierarchical clustering...................... 3 1.2 K-means.............................. 5 2 Describing genetic clusters 8 2.1 Analysis of group data...................... 8 2.2 Between-class analyses...................... 12 2.3 Discriminant Analysis of Principal Components........ 15 2

> library(adegenet) 1 Defining genetic clusters Group information is not always known when analysing genetic data. Even when some prior clustering can be defined, it is not always obvious that these are the best genetic clusters that can be defined. In this section, we illustrate two simple approaches for defining genetic clusters. 1.1 Hierarchical clustering Hierarchical clustering can be used to represent genetic distances as trees, and indirectly to define genetic clusters. This is achieved by cutting the tree at a certain distance, and pooling the tips descending from the few retained branches into the same clusters (cutree). Load the data microbov, replace the missing data, and compute the Euclidean distances between individuals (functions na.replace and dist). Then, use hclust to obtain a hierarchical clustering of the individual which forms strong groups (choose the right method(s)). > data(microbov) > X <- na.replace(microbov, method = "mean")$tab Replaced 6325 missing values > D <- dist(x) > h1 <- hclust(d, method = "complete") > plot(h1, sub = "secret clustering method") 3

Cluster Dendrogram Height 2 3 4 5 6 488 503496 515 478 510 484 490 494 495 525 482 520 274 485 493 497 261 650476 486487508 489 492 352 642 314 269 362 521 481 514391 244 509 251 258 245 677 248 686 273271 275 252 433 631 654 247 395 649 671 265 644 429 704 680 682 257 335 256 350 694 702 360 528 343 373 278 364266 238 260 235 272 383 407 411 240 699 700 304 279 703 660 526 678 695 659 688243 667 683665 655 698 663 697 701 249 378 539 661 684 689 657 693 565 674 679 281 548 396 405 394 409 367 504 369 388 353 681 524 393 691 384 646 648 632 638 575 502 673 630 634 317 318295 319 296 310 324 302 309289 626 640 322 313 320 294 308 298 299 288 286 297315 312 327 457 325 307 459 507 527 544516 501 522 290 470450270 573563 519 491 529 334 541241 566 512 549 499 400 408 232545 268 547 348 374 375 542 337 696 357 372 398 387 557 532 536 530546523 552 339 346 572 556 479 500 359 365 342 356349 690 427276 446 480 517 355 675676 687 662 664 669 658 692 666 300 282 323 672 685 330 341 259 262 328 292 627 656 670 301 303 283 284 305483 332 498 347 381 647 651 511377 412 471 344 242 403 385 333 287 533 653 389 291 306 316 285 326321 293 311 236 368 543 370 534 559 561379 386 554 540 560 571 535 351 358 607 558 568537 597 538 531 599590 592 608596 570 619 624 581 600 615 584 595 594 587 588 603 611604 579586617 613 614 618 436 610 577 612589 578 593 580 576 601 622602 605 623 464 467 449 434 439 426 438 475382 402 474 462 425 472 404420 417410 416 454 582 621585 451 461598 609 458 606 505 567 463 431 629 633406 392 445 645 637 616 421 591 234 641574 583 447 460 643 466 628 453 652371 441 444 465448 636 555 253 625 553 432 413399 419 390 437 443 401 414 415 423 418 338 455468 435 456 469 397380 422 620 331 442430 506 518 424 639 452 473 361 428 345 233 440 340239 354 562 564 550 569 264 363 376263 551 246 267 237 635 280 668277 254250 255 329 477 061 066 034 050 017 062 026 071 004 019085 080 029 030 038 012 013 049 076 090 002 016 087001 098 055 060 056 059 083 089 011 014 005 095 092 068078 063 065 081 097 051 053052 057 069 084077 100 075 099 096 067 070 020 079 128 086 088 054 058 073 074 094 048 064 072 082 091 093 130 135 142 146 122 123137 150 119 149 110 138 133 103 125 129 131 145 115 141 139 140 202 105 132 147108 102 107 101 109 126 176 113 207 219 218 160 221 169 179 174 178 031 175 157 046 151 036 039 033 044 042 043 035045 047 205 032 216 203 226 223041208 024 025021 040 197 023 003 209 170 183195 184 214 229 222231 224191 194 198 217 220 187 228 168 018 022 180 190156 008 154 009037 210201 007 230 027 028 159 155 162 161 164 166 163 177 181 185 212199 182 227 173 204 213 152 172 206 200 165 167 158 171153 148 189 192 193 186 211 196 106 111 104 114188 006 015121 127 144 143 215 225 010 336 366 513 134 136 116 120 117 118 112 124 D secret clustering method Cut the tree into two groups. Do these groups match any prior clustering of the data (see content of microbov$other)? Remember that table can be used to build contingency tables. Accordingly, what is the main component of the genetic variability in these cattle breeds? Repeat this analysis by cutting the tree into as many clusters as there are breeds in the dataset (this can be extracted by pop), and name the result grp. Build a contingency table to see the match between inferred groups and breeds. Use table.value to represent the result, then interprete it: > table.value(table(pop(microbov), grp), col.lab = paste("grp", + 1:15)) 4

grp 1 grp 2 grp 3 grp 4 grp 5 grp 6 grp 7 grp 8 grp 9 grp 10 grp 11 grp 12 grp 13 grp 14 grp 15 Borgou Zebu Lagunaire NDama Somba Aubrac Bazadais BlondeAquitaine BretPieNoire Charolais Gascon Limousin MaineAnjou Montbeliard Salers 5 15 25 35 45 Can some groups be identified to species? to breeds? Are some species more admixed than others? 1.2 K-means K-means is another, non-hierarchical approach for defining genetic clusters. It relies on the usual ANOVA model for multivariate data X R n p : VAR(X) = B(X) + W(X) where VAR, B and W correspond to, respectively, the total variance, the variance between groups, and the variance within groups. K-means then consists in finding groups that will minimize W(X). Using the function kmeans, find 15 genetic clusters in the microbov data; how do the inferred clusters compare to the breeds? Reproduce the same figure as before for these results; the figure should ressemble: > grp <- kmeans(x, 15) 5

grp 1 grp 2 grp 3 grp 4 grp 5 grp 6 grp 7 grp 8 grp 9 grp 10 grp 11 grp 12 grp 13 grp 14 grp 15 Borgou Zebu Lagunaire NDama Somba Aubrac Bazadais BlondeAquitaine BretPieNoire Charolais Gascon Limousin MaineAnjou Montbeliard Salers 5 15 25 35 45 What differences can you see? What are the species which are easily genetically identified using K-means? Look at the output of K-means clustering, and try to identify the sums of squares corresponding to the variance partition model. How do these behave when increasing the number of clusters? Using sapply, retrieve the results of several K-means where K is increased gradually. Plot the results; one example of obtained result would be: > totalss <- sum(x * X) > temp <- sapply(seq(5, 50, by = 5), function(k) totalss - sum(kmeans(x, + k)$withinss)) > plot(seq(5, 50, by = 5), temp, ylab = "Between-groups sum of squares", + type = "b", xlab = "K") 6

Between groups sum of squares 6800 7000 7200 7400 7600 7800 8000 8200 10 20 30 40 50 K How can you interprete this observation? Can B(X) or W(X) be used for selecting the best K? Deciding of the number of optimal clusters (K) to be retained is often tricky. This can be achieved by computing K-means solution for a range of K, and then selecting the K giving the best Bayesian Information Criterion (BIC). This is achieved by the function find.clusters. In addition, this function orthogonalizes and reduces the data using PCA as a prior step to K-means. Use this function to search for the optimal number of clusters in microbov. How many clusters would you retain? Compare them to the breed information: when 8 clusters are retained, what are the most admixed breeds? Repeat the same analyses for the nancycats data. What can you say about the likely profile of admixture between these cat colonies? 7

grp 1 grp 2 grp 3 grp 4 grp 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 2.5 7.5 12.5 2 Describing genetic clusters There are different ways of using genetic cluster information in the multivariate analysis of genetic data. The methods vary according to the purpose of the study: describing the differences between groups, describing how individuals are structured in the different clusters, etc. 2.1 Analysis of group data Sometimes we are just interested in the diversity between groups, and not interested by variations occuring within clusters. In such cases, we can analyse data directly at a population level. The first step doing so is computing allele counts by populations (using genind2genpop). These data can then be: directly analysed using Correspondance Analysis (CA, dudi.coa), which is appropriate for contingency tables translated into scaled allele frequencies (makefreq or scalegen), and analysed by PCA (PCA, dudi.pca) used to compute genetic distances between populations (dist.genpop) which are in turn summarised by Principal Coordinates Analysis (PCoA, dudi.pco) 8

Try applying these approaches to the dataset microbov, and compare the obtained results. > x <- genind2genpop(microbov) Converting data from a genind to a genpop object......done. > ca1 <- dudi.coa(truenames(x), scannf = FALSE, nf = 3) > s.label(ca1$li) > add.scatter.eig(ca1$eig, 3, 2, 1) > title("microbov - Correspondance Analysis") Zebu microbov Correspondance Analysis d = 0.5 Borgou Aubrac Salers BlondeAquitaine BretPieNoire Charolais Gascon Limousin MaineAnjou Montbeliard Bazadais Eigenvalues NDama Somba Lagunaire > pca1 <- dudi.pca(scalegen(x, scale = FALSE), scale = FALSE, scannf = FALSE, + nf = 3) > s.label(pca1$li) > add.scatter.eig(ca1$eig, 3, 2, 1, posi = "bottomright") > title("microbov - Principal Component Analysis") 9

microbov Principal Component Analysis Zebu d = 0.5 Borgou Somba NDama Aubrac Salers Gascon BlondeAquitaine Charolais Limousin BretPieNoire MaineAnjou Montbeliard Bazadais Eigenvalues Lagunaire > pco1 <- dudi.pco(dist.genpop(x, meth = 2), nf = 3, scannf = FALSE) > s.label(pco1$li) > add.scatter.eig(pco1$eig, 3, 2, 1, posi = "bottomright") > title("microbov - Principal Coordinates Analysis\nEdwards distance") Zebu microbov Principal Coordinates Analysis Edwards distance d = 0.1 Borgou Aubrac Salers BlondeAquitaine Gascon Charolais Limousin BretPieNoire MaineAnjou Montbeliard Bazadais NDama Somba Eigenvalues Lagunaire 10

For the PCoA, try different Euclidean distances (see dist.genpop), and compare the results. Does the genetic distance employed seem to matter? We will leave the cattle at a rest for now, and switch to Human seasonal influenza, H3N2. The dataset H3N2 contains SNPs derived from 1903 hemagglutinin RNA sequences. Obtain allele counts per year of sampling (see content of H3N2$other) using genind2genpop, and compute i) Edward s distance ii) Reynolds distance and iii) Roger s distance between years. Perform the PCoA of these distances and plot the results; here is a result for one of the requested distances: H3N2 PCoA secret distance d = 0.1 2006 2005 2001 2004 2003 2002 Eigenvalues What is the meaning of the first principal components? Are the results coherent between the three distances? To have yet another point of view, perform the PCA of the individual data, and represent group information when plotting the principal components using s.class. > pca1 <- dudi.pca(scalegen(h3n2, miss = "mean", scale = FALSE), + scale = FALSE, scannf = FALSE, nf = 2) > par(bg = "grey") > s.class(pca1$li, factor(f), col = rainbow(6)) > add.scatter.eig(pca1$eig, 2, 1, 2) 11

d = 2 2001 2002 2003 2004 2005 2006 Eigenvalues What can we say about the genetic evolution of influenza between years 2001-2006? Can we safely discard information about individual isolates, and just work with data pooled by year of epidemic? 2.2 Between-class analyses In many situations, discarding information about the individual observations leads to a considerable loss of information. However, basic analyses working at an individual level mainly PCA are not optimal in terms of group separation. This is due to the fact that PCA focuses on the total variability, while only variability between groups should be optimized. Let us remember the basic multivariate ANOVA model (P being the projector onto dummy vectors of group membership H: P = H(H T DH) 1 H T D): X = PX + (I P)X which leads to the variance partition seen before with the K-means: VAR(X) = B(X) + W(X) with: VAR(X) = trace(x T DX) B(X) = trace(x T P T DPX) 12

W(X) = trace(x T (I P) T D(I P)X) where D is a metric (in PCA, it can be replaced by 1/n) and I is the identity matrix. We can verify that ordinary PCA decomposes the total variance: > X <- scalegen(h3n2, miss = "mean", scale = FALSE) > pca1 <- dudi.pca(x, scale = FALSE, scannf = FALSE, nf = 2) > sum(pca1$eig) [1] 15.05856 > D <- diag(1/nrow(x), nrow(x)) > sum(diag(t(x) %*% D %*% X)) [1] 15.05856 Interestingly, it is possible to modify a multivariate analysis so as to decompose B(X) or W(X) instead of VAR(X). Corresponding methods are respectively called between-class (or inter-class) and within-class (or intraclass) analyses. If the basic method is a PCA, then we will be performing between-class PCA and within-class PCA. These methods are implemented by the functions between and within. Here is an example of how to perform the between-class PCA: > f <- H3N2$other$epid > bpca1 <- between(pca1, fac = factor(f), scannf = FALSE, nf = 3) Verify that the sum of eigenvalues of this analysis equates the variance between groups (B(X)); for this, you will need to compute H, and then P. This requires a few matrix operations: A %*% B: multiplies matrix A by matrix B diag: either creates a diagonal matrix, or extract the diagonal of an existing square matrix t: transposes a matrix ginv: inverses a symmetric matrix To obtain H, the matrix of dummy vectors of group memberships, do: > H <- as.matrix(acm.disjonctif(data.frame(f))) You should find that B(X) is exactly equal to the sum of the eigenvalues of the between-class analysis. Repeat the same approach for the within-class analysis, and confirm that the variance within groups is fully decomposed by the within-class PCA. Now that you have checked that we can associate multivariate methods to each component of the variance partitioning into groups, investigate further the results of the between-class PCA. 13

> par(bg = "grey") > s.class(bpca1$ls, factor(f), col = rainbow(6)) > add.scatter.eig(bpca1$eig, 2, 1, 2) > title("h3n2 - Between-class PCA") d = 2 2001 2002 2003 2004 2005 2006 Eigenvalues H3N2 Between class PCA On this scatterplot, the centres of the groups (labels of year) are as scattered as possible. Why is it slightly disappointing? How does it compare to the original PCA? The following figure represents the PCA axes onto the basis of the between-class analysis. > s.corcircle(bpca1$as) 14

Axis2 Axis1 What can we say about the PCA and between-class PCA? One obvious problem of the previous scatterplot is the dispersion of the isolates within 2001 and 2002. To assess best the separation of the isolates by year, we need to maximize dispersion between groups, and to minimize dispersion within groups. This is precisely what Discriminant Analysis does. 2.3 Discriminant Analysis of Principal Components Discriminant Analysis aims at displaying the best discrimination of individuals into pre-defined groups. However, some technical requirements make it often impractical for genetic data. Discriminant Analysis of Principal Components (DAPC, [6]) has been developped to circumvent this issue. Apply DAPC (function dapc) to the H3N2 data, retaining 50 PCs in the preliminary PCA step, and plot the results using scatter. The result should ressemble this: > dapc1 <- dapc(h3n2, pop = f, n.pca = 50, n.da = 2, all.contr = TRUE) > scatter(dapc1, posi = "top") 15

d = 5 2001 2002 2003 2004 2005 2006 2001 2002 2003 2004 2005 2006 Eigenvalues How does this compare to the previous results? What new feature appeared, that was not visible before? To have a better assessment of the structure on each axis, you can plot one dimension at a time, specifying the same axis for xax and yax in scatter.dapc; for instance: > scatter(dapc1, xax = 1, yax = 1) 5 0 5 0.0 0.5 1.0 1.5 Discriminant function 1 Density 16

To interprete results further, you will need the loadings of alleles, which are not provided by default in dapc. Re-run the analysis if needed, and specify you want allele contributions to be returned. Use loadingplot to display allele contributions to the first and to the second structure. This is the result you should obtain for the first axis: > loadingplot(dapc1$var.contr) Loading plot Loadings 0.00 0.05 0.10 0.15 X435.a X435.c X476.a X476.t X685.a X685.c X676.a X676.g X396.a X396.g X424.a X424.gX468.a X595.c X595.t X906.c X906.t X399.c X399.t X430.a X430.g X6.a X6.g X90.a X90.gX157.a X243.c X243.t X280.c X280.t X157.g X267.a X267.g X334.a X334.g X345.t X376.a X376.g X384.g X384.t X412.g X418.a X418.g X468.t X555.a X555.g X529.c X529.t X546.c X546.t X557.g X557.t X564.c X566.a X564.t X566.g X647.a X647.g X577.a X577.t X582.a X578.t X582.g X594.a X594.t X602.a X602.g X627.c X627.t X664.a X717.a X717.g X763.a X763.c X806.a X806.g X882.c X882.t X933.a X933.g X939.c X936.a X936.c X939.t Variables How can you interprete this result? Note that loadingplot returns invisibly some information about the most contributing alleles (thoses indicated on the plot). Save this information, and then examine it. Interprete the following commands and their result: > which(h3n2$loc.names == "435") L051 51 > snp435 <- truenames(h3n2[loc = "L051"]) > head(snp435) 435.a 435.c 435.g AB434107 1 0 0 AB434108 1 0 0 AB438242 1 0 0 AB438243 1 0 0 AB438244 1 0 0 AB438245 1 0 0 17

> temp <- apply(snp435, 2, function(e) tapply(e, f, mean, na.rm = TRUE)) > temp 435.a 435.c 435.g 2001 1.000000000 0.000000000 0.000000000 2002 0.995535714 0.004464286 0.000000000 2003 0.942396313 0.055299539 0.002304147 2004 0.046210721 0.953789279 0.000000000 2005 0.008316008 0.991683992 0.000000000 2006 0.000000000 1.000000000 0.000000000 > matplot(temp, type = "b", pch = c("a", "c", "g"), xlab = "year", + ylab = "allele frequency", xaxt = "n", cex = 1.5) > axis(side = 1, at = 1:6, lab = 2001:2006) allele frequency 0.0 0.2 0.4 0.6 0.8 1.0 a a a c c c gc gc c g a g ag ag 2001 2002 2003 2004 2005 2006 year Do the same for the second axis. How would you interprete this result? Should we expect a vaccine from 2005 influenza to have worked against the 2006 virus? References [1] D. Chessel, A-B. Dufour, and J. Thioulouse. The ade4 package-i- onetable methods. R News, 4:5 10, 2004. 18

[2] S. Dray and A.-B. Dufour. The ade4 package: implementing the duality diagram for ecologists. Journal of Statistical Software, 22(4):1 20, 2007. [3] S. Dray, A.-B. Dufour, and D. Chessel. The ade4 package - II: Two-table and K-table methods. R News, 7:47 54, 2007. [4] J. Goudet. HIERFSTAT, a package for R to compute and test hierarchical F-statistics. Molecular Ecology Notes, 5:184 186, 2005. [5] T. Jombart. adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics, 24:1403 1405, 2008. [6] T Jombart, S Devillard, and F Balloux. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. Genetics, submitted. [7] E. Paradis, J. Claude, and K. Strimmer. APE: analyses of phylogenetics and evolution in R language. Bioinformatics, 20:289 290, 2004. [8] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2009. ISBN 3-900051-07-0. [9] G. R. Warnes. The genetics package. R News, 3(1):9 13, 2003. 19