Practical course using the software Multivariate analysis of genetic data: exploring group diversity Thibaut Jombart Abstract This practical course tackles the question of group diversity in genetic data analysis using [8]. It consists of two main parts: first, how to infer groups when these are unknown, and second, how to use group information when describing the genetic diversity. The practical uses mostly the packages adegenet [5], ape [7] and ade4 [1, 3, 2], but others like genetics [9] and hierfstat [4] are also required. 1
Contents 1 Defining genetic clusters 3 1.1 Hierarchical clustering...................... 3 1.2 K-means.............................. 5 2 Describing genetic clusters 8 2.1 Analysis of group data...................... 8 2.2 Between-class analyses...................... 12 2.3 Discriminant Analysis of Principal Components........ 15 2
> library(adegenet) 1 Defining genetic clusters Group information is not always known when analysing genetic data. Even when some prior clustering can be defined, it is not always obvious that these are the best genetic clusters that can be defined. In this section, we illustrate two simple approaches for defining genetic clusters. 1.1 Hierarchical clustering Hierarchical clustering can be used to represent genetic distances as trees, and indirectly to define genetic clusters. This is achieved by cutting the tree at a certain distance, and pooling the tips descending from the few retained branches into the same clusters (cutree). Load the data microbov, replace the missing data, and compute the Euclidean distances between individuals (functions na.replace and dist). Then, use hclust to obtain a hierarchical clustering of the individual which forms strong groups (choose the right method(s)). > data(microbov) > X <- na.replace(microbov, method = "mean")$tab Replaced 6325 missing values > D <- dist(x) > h1 <- hclust(d, method = "complete") > plot(h1, sub = "secret clustering method") 3
Cluster Dendrogram Height 2 3 4 5 6 488 503496 515 478 510 484 490 494 495 525 482 520 274 485 493 497 261 650476 486487508 489 492 352 642 314 269 362 521 481 514391 244 509 251 258 245 677 248 686 273271 275 252 433 631 654 247 395 649 671 265 644 429 704 680 682 257 335 256 350 694 702 360 528 343 373 278 364266 238 260 235 272 383 407 411 240 699 700 304 279 703 660 526 678 695 659 688243 667 683665 655 698 663 697 701 249 378 539 661 684 689 657 693 565 674 679 281 548 396 405 394 409 367 504 369 388 353 681 524 393 691 384 646 648 632 638 575 502 673 630 634 317 318295 319 296 310 324 302 309289 626 640 322 313 320 294 308 298 299 288 286 297315 312 327 457 325 307 459 507 527 544516 501 522 290 470450270 573563 519 491 529 334 541241 566 512 549 499 400 408 232545 268 547 348 374 375 542 337 696 357 372 398 387 557 532 536 530546523 552 339 346 572 556 479 500 359 365 342 356349 690 427276 446 480 517 355 675676 687 662 664 669 658 692 666 300 282 323 672 685 330 341 259 262 328 292 627 656 670 301 303 283 284 305483 332 498 347 381 647 651 511377 412 471 344 242 403 385 333 287 533 653 389 291 306 316 285 326321 293 311 236 368 543 370 534 559 561379 386 554 540 560 571 535 351 358 607 558 568537 597 538 531 599590 592 608596 570 619 624 581 600 615 584 595 594 587 588 603 611604 579586617 613 614 618 436 610 577 612589 578 593 580 576 601 622602 605 623 464 467 449 434 439 426 438 475382 402 474 462 425 472 404420 417410 416 454 582 621585 451 461598 609 458 606 505 567 463 431 629 633406 392 445 645 637 616 421 591 234 641574 583 447 460 643 466 628 453 652371 441 444 465448 636 555 253 625 553 432 413399 419 390 437 443 401 414 415 423 418 338 455468 435 456 469 397380 422 620 331 442430 506 518 424 639 452 473 361 428 345 233 440 340239 354 562 564 550 569 264 363 376263 551 246 267 237 635 280 668277 254250 255 329 477 061 066 034 050 017 062 026 071 004 019085 080 029 030 038 012 013 049 076 090 002 016 087001 098 055 060 056 059 083 089 011 014 005 095 092 068078 063 065 081 097 051 053052 057 069 084077 100 075 099 096 067 070 020 079 128 086 088 054 058 073 074 094 048 064 072 082 091 093 130 135 142 146 122 123137 150 119 149 110 138 133 103 125 129 131 145 115 141 139 140 202 105 132 147108 102 107 101 109 126 176 113 207 219 218 160 221 169 179 174 178 031 175 157 046 151 036 039 033 044 042 043 035045 047 205 032 216 203 226 223041208 024 025021 040 197 023 003 209 170 183195 184 214 229 222231 224191 194 198 217 220 187 228 168 018 022 180 190156 008 154 009037 210201 007 230 027 028 159 155 162 161 164 166 163 177 181 185 212199 182 227 173 204 213 152 172 206 200 165 167 158 171153 148 189 192 193 186 211 196 106 111 104 114188 006 015121 127 144 143 215 225 010 336 366 513 134 136 116 120 117 118 112 124 D secret clustering method Cut the tree into two groups. Do these groups match any prior clustering of the data (see content of microbov$other)? Remember that table can be used to build contingency tables. Accordingly, what is the main component of the genetic variability in these cattle breeds? Repeat this analysis by cutting the tree into as many clusters as there are breeds in the dataset (this can be extracted by pop), and name the result grp. Build a contingency table to see the match between inferred groups and breeds. Use table.value to represent the result, then interprete it: > table.value(table(pop(microbov), grp), col.lab = paste("grp", + 1:15)) 4
grp 1 grp 2 grp 3 grp 4 grp 5 grp 6 grp 7 grp 8 grp 9 grp 10 grp 11 grp 12 grp 13 grp 14 grp 15 Borgou Zebu Lagunaire NDama Somba Aubrac Bazadais BlondeAquitaine BretPieNoire Charolais Gascon Limousin MaineAnjou Montbeliard Salers 5 15 25 35 45 Can some groups be identified to species? to breeds? Are some species more admixed than others? 1.2 K-means K-means is another, non-hierarchical approach for defining genetic clusters. It relies on the usual ANOVA model for multivariate data X R n p : VAR(X) = B(X) + W(X) where VAR, B and W correspond to, respectively, the total variance, the variance between groups, and the variance within groups. K-means then consists in finding groups that will minimize W(X). Using the function kmeans, find 15 genetic clusters in the microbov data; how do the inferred clusters compare to the breeds? Reproduce the same figure as before for these results; the figure should ressemble: > grp <- kmeans(x, 15) 5
grp 1 grp 2 grp 3 grp 4 grp 5 grp 6 grp 7 grp 8 grp 9 grp 10 grp 11 grp 12 grp 13 grp 14 grp 15 Borgou Zebu Lagunaire NDama Somba Aubrac Bazadais BlondeAquitaine BretPieNoire Charolais Gascon Limousin MaineAnjou Montbeliard Salers 5 15 25 35 45 What differences can you see? What are the species which are easily genetically identified using K-means? Look at the output of K-means clustering, and try to identify the sums of squares corresponding to the variance partition model. How do these behave when increasing the number of clusters? Using sapply, retrieve the results of several K-means where K is increased gradually. Plot the results; one example of obtained result would be: > totalss <- sum(x * X) > temp <- sapply(seq(5, 50, by = 5), function(k) totalss - sum(kmeans(x, + k)$withinss)) > plot(seq(5, 50, by = 5), temp, ylab = "Between-groups sum of squares", + type = "b", xlab = "K") 6
Between groups sum of squares 6800 7000 7200 7400 7600 7800 8000 8200 10 20 30 40 50 K How can you interprete this observation? Can B(X) or W(X) be used for selecting the best K? Deciding of the number of optimal clusters (K) to be retained is often tricky. This can be achieved by computing K-means solution for a range of K, and then selecting the K giving the best Bayesian Information Criterion (BIC). This is achieved by the function find.clusters. In addition, this function orthogonalizes and reduces the data using PCA as a prior step to K-means. Use this function to search for the optimal number of clusters in microbov. How many clusters would you retain? Compare them to the breed information: when 8 clusters are retained, what are the most admixed breeds? Repeat the same analyses for the nancycats data. What can you say about the likely profile of admixture between these cat colonies? 7
grp 1 grp 2 grp 3 grp 4 grp 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 2.5 7.5 12.5 2 Describing genetic clusters There are different ways of using genetic cluster information in the multivariate analysis of genetic data. The methods vary according to the purpose of the study: describing the differences between groups, describing how individuals are structured in the different clusters, etc. 2.1 Analysis of group data Sometimes we are just interested in the diversity between groups, and not interested by variations occuring within clusters. In such cases, we can analyse data directly at a population level. The first step doing so is computing allele counts by populations (using genind2genpop). These data can then be: directly analysed using Correspondance Analysis (CA, dudi.coa), which is appropriate for contingency tables translated into scaled allele frequencies (makefreq or scalegen), and analysed by PCA (PCA, dudi.pca) used to compute genetic distances between populations (dist.genpop) which are in turn summarised by Principal Coordinates Analysis (PCoA, dudi.pco) 8
Try applying these approaches to the dataset microbov, and compare the obtained results. > x <- genind2genpop(microbov) Converting data from a genind to a genpop object......done. > ca1 <- dudi.coa(truenames(x), scannf = FALSE, nf = 3) > s.label(ca1$li) > add.scatter.eig(ca1$eig, 3, 2, 1) > title("microbov - Correspondance Analysis") Zebu microbov Correspondance Analysis d = 0.5 Borgou Aubrac Salers BlondeAquitaine BretPieNoire Charolais Gascon Limousin MaineAnjou Montbeliard Bazadais Eigenvalues NDama Somba Lagunaire > pca1 <- dudi.pca(scalegen(x, scale = FALSE), scale = FALSE, scannf = FALSE, + nf = 3) > s.label(pca1$li) > add.scatter.eig(ca1$eig, 3, 2, 1, posi = "bottomright") > title("microbov - Principal Component Analysis") 9
microbov Principal Component Analysis Zebu d = 0.5 Borgou Somba NDama Aubrac Salers Gascon BlondeAquitaine Charolais Limousin BretPieNoire MaineAnjou Montbeliard Bazadais Eigenvalues Lagunaire > pco1 <- dudi.pco(dist.genpop(x, meth = 2), nf = 3, scannf = FALSE) > s.label(pco1$li) > add.scatter.eig(pco1$eig, 3, 2, 1, posi = "bottomright") > title("microbov - Principal Coordinates Analysis\nEdwards distance") Zebu microbov Principal Coordinates Analysis Edwards distance d = 0.1 Borgou Aubrac Salers BlondeAquitaine Gascon Charolais Limousin BretPieNoire MaineAnjou Montbeliard Bazadais NDama Somba Eigenvalues Lagunaire 10
For the PCoA, try different Euclidean distances (see dist.genpop), and compare the results. Does the genetic distance employed seem to matter? We will leave the cattle at a rest for now, and switch to Human seasonal influenza, H3N2. The dataset H3N2 contains SNPs derived from 1903 hemagglutinin RNA sequences. Obtain allele counts per year of sampling (see content of H3N2$other) using genind2genpop, and compute i) Edward s distance ii) Reynolds distance and iii) Roger s distance between years. Perform the PCoA of these distances and plot the results; here is a result for one of the requested distances: H3N2 PCoA secret distance d = 0.1 2006 2005 2001 2004 2003 2002 Eigenvalues What is the meaning of the first principal components? Are the results coherent between the three distances? To have yet another point of view, perform the PCA of the individual data, and represent group information when plotting the principal components using s.class. > pca1 <- dudi.pca(scalegen(h3n2, miss = "mean", scale = FALSE), + scale = FALSE, scannf = FALSE, nf = 2) > par(bg = "grey") > s.class(pca1$li, factor(f), col = rainbow(6)) > add.scatter.eig(pca1$eig, 2, 1, 2) 11
d = 2 2001 2002 2003 2004 2005 2006 Eigenvalues What can we say about the genetic evolution of influenza between years 2001-2006? Can we safely discard information about individual isolates, and just work with data pooled by year of epidemic? 2.2 Between-class analyses In many situations, discarding information about the individual observations leads to a considerable loss of information. However, basic analyses working at an individual level mainly PCA are not optimal in terms of group separation. This is due to the fact that PCA focuses on the total variability, while only variability between groups should be optimized. Let us remember the basic multivariate ANOVA model (P being the projector onto dummy vectors of group membership H: P = H(H T DH) 1 H T D): X = PX + (I P)X which leads to the variance partition seen before with the K-means: VAR(X) = B(X) + W(X) with: VAR(X) = trace(x T DX) B(X) = trace(x T P T DPX) 12
W(X) = trace(x T (I P) T D(I P)X) where D is a metric (in PCA, it can be replaced by 1/n) and I is the identity matrix. We can verify that ordinary PCA decomposes the total variance: > X <- scalegen(h3n2, miss = "mean", scale = FALSE) > pca1 <- dudi.pca(x, scale = FALSE, scannf = FALSE, nf = 2) > sum(pca1$eig) [1] 15.05856 > D <- diag(1/nrow(x), nrow(x)) > sum(diag(t(x) %*% D %*% X)) [1] 15.05856 Interestingly, it is possible to modify a multivariate analysis so as to decompose B(X) or W(X) instead of VAR(X). Corresponding methods are respectively called between-class (or inter-class) and within-class (or intraclass) analyses. If the basic method is a PCA, then we will be performing between-class PCA and within-class PCA. These methods are implemented by the functions between and within. Here is an example of how to perform the between-class PCA: > f <- H3N2$other$epid > bpca1 <- between(pca1, fac = factor(f), scannf = FALSE, nf = 3) Verify that the sum of eigenvalues of this analysis equates the variance between groups (B(X)); for this, you will need to compute H, and then P. This requires a few matrix operations: A %*% B: multiplies matrix A by matrix B diag: either creates a diagonal matrix, or extract the diagonal of an existing square matrix t: transposes a matrix ginv: inverses a symmetric matrix To obtain H, the matrix of dummy vectors of group memberships, do: > H <- as.matrix(acm.disjonctif(data.frame(f))) You should find that B(X) is exactly equal to the sum of the eigenvalues of the between-class analysis. Repeat the same approach for the within-class analysis, and confirm that the variance within groups is fully decomposed by the within-class PCA. Now that you have checked that we can associate multivariate methods to each component of the variance partitioning into groups, investigate further the results of the between-class PCA. 13
> par(bg = "grey") > s.class(bpca1$ls, factor(f), col = rainbow(6)) > add.scatter.eig(bpca1$eig, 2, 1, 2) > title("h3n2 - Between-class PCA") d = 2 2001 2002 2003 2004 2005 2006 Eigenvalues H3N2 Between class PCA On this scatterplot, the centres of the groups (labels of year) are as scattered as possible. Why is it slightly disappointing? How does it compare to the original PCA? The following figure represents the PCA axes onto the basis of the between-class analysis. > s.corcircle(bpca1$as) 14
Axis2 Axis1 What can we say about the PCA and between-class PCA? One obvious problem of the previous scatterplot is the dispersion of the isolates within 2001 and 2002. To assess best the separation of the isolates by year, we need to maximize dispersion between groups, and to minimize dispersion within groups. This is precisely what Discriminant Analysis does. 2.3 Discriminant Analysis of Principal Components Discriminant Analysis aims at displaying the best discrimination of individuals into pre-defined groups. However, some technical requirements make it often impractical for genetic data. Discriminant Analysis of Principal Components (DAPC, [6]) has been developped to circumvent this issue. Apply DAPC (function dapc) to the H3N2 data, retaining 50 PCs in the preliminary PCA step, and plot the results using scatter. The result should ressemble this: > dapc1 <- dapc(h3n2, pop = f, n.pca = 50, n.da = 2, all.contr = TRUE) > scatter(dapc1, posi = "top") 15
d = 5 2001 2002 2003 2004 2005 2006 2001 2002 2003 2004 2005 2006 Eigenvalues How does this compare to the previous results? What new feature appeared, that was not visible before? To have a better assessment of the structure on each axis, you can plot one dimension at a time, specifying the same axis for xax and yax in scatter.dapc; for instance: > scatter(dapc1, xax = 1, yax = 1) 5 0 5 0.0 0.5 1.0 1.5 Discriminant function 1 Density 16
To interprete results further, you will need the loadings of alleles, which are not provided by default in dapc. Re-run the analysis if needed, and specify you want allele contributions to be returned. Use loadingplot to display allele contributions to the first and to the second structure. This is the result you should obtain for the first axis: > loadingplot(dapc1$var.contr) Loading plot Loadings 0.00 0.05 0.10 0.15 X435.a X435.c X476.a X476.t X685.a X685.c X676.a X676.g X396.a X396.g X424.a X424.gX468.a X595.c X595.t X906.c X906.t X399.c X399.t X430.a X430.g X6.a X6.g X90.a X90.gX157.a X243.c X243.t X280.c X280.t X157.g X267.a X267.g X334.a X334.g X345.t X376.a X376.g X384.g X384.t X412.g X418.a X418.g X468.t X555.a X555.g X529.c X529.t X546.c X546.t X557.g X557.t X564.c X566.a X564.t X566.g X647.a X647.g X577.a X577.t X582.a X578.t X582.g X594.a X594.t X602.a X602.g X627.c X627.t X664.a X717.a X717.g X763.a X763.c X806.a X806.g X882.c X882.t X933.a X933.g X939.c X936.a X936.c X939.t Variables How can you interprete this result? Note that loadingplot returns invisibly some information about the most contributing alleles (thoses indicated on the plot). Save this information, and then examine it. Interprete the following commands and their result: > which(h3n2$loc.names == "435") L051 51 > snp435 <- truenames(h3n2[loc = "L051"]) > head(snp435) 435.a 435.c 435.g AB434107 1 0 0 AB434108 1 0 0 AB438242 1 0 0 AB438243 1 0 0 AB438244 1 0 0 AB438245 1 0 0 17
> temp <- apply(snp435, 2, function(e) tapply(e, f, mean, na.rm = TRUE)) > temp 435.a 435.c 435.g 2001 1.000000000 0.000000000 0.000000000 2002 0.995535714 0.004464286 0.000000000 2003 0.942396313 0.055299539 0.002304147 2004 0.046210721 0.953789279 0.000000000 2005 0.008316008 0.991683992 0.000000000 2006 0.000000000 1.000000000 0.000000000 > matplot(temp, type = "b", pch = c("a", "c", "g"), xlab = "year", + ylab = "allele frequency", xaxt = "n", cex = 1.5) > axis(side = 1, at = 1:6, lab = 2001:2006) allele frequency 0.0 0.2 0.4 0.6 0.8 1.0 a a a c c c gc gc c g a g ag ag 2001 2002 2003 2004 2005 2006 year Do the same for the second axis. How would you interprete this result? Should we expect a vaccine from 2005 influenza to have worked against the 2006 virus? References [1] D. Chessel, A-B. Dufour, and J. Thioulouse. The ade4 package-i- onetable methods. R News, 4:5 10, 2004. 18
[2] S. Dray and A.-B. Dufour. The ade4 package: implementing the duality diagram for ecologists. Journal of Statistical Software, 22(4):1 20, 2007. [3] S. Dray, A.-B. Dufour, and D. Chessel. The ade4 package - II: Two-table and K-table methods. R News, 7:47 54, 2007. [4] J. Goudet. HIERFSTAT, a package for R to compute and test hierarchical F-statistics. Molecular Ecology Notes, 5:184 186, 2005. [5] T. Jombart. adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics, 24:1403 1405, 2008. [6] T Jombart, S Devillard, and F Balloux. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. Genetics, submitted. [7] E. Paradis, J. Claude, and K. Strimmer. APE: analyses of phylogenetics and evolution in R language. Bioinformatics, 20:289 290, 2004. [8] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2009. ISBN 3-900051-07-0. [9] G. R. Warnes. The genetics package. R News, 3(1):9 13, 2003. 19