Computations with Markers Paulino Pérez 1 José Crossa 1 1 ColPos-México 2 CIMMyT-México June, 2015. CIMMYT, México-SAGPDB Computations with Markers 1/20
Contents 1 Genomic relationship matrix 2 3 Big Data! CIMMYT, México-SAGPDB Computations with Markers 2/20
Genomic relationship matrix Genomic relationship matrix The genomic relationship matrix (G) appears naturally in several models used routinely in Genomic selection. VanRaden (2008) studied efficient methods to compute genomic predictions using this matrix. There are several ways of computing the G matrix, CIMMYT, México-SAGPDB Computations with Markers 3/20
Genomic relationship matrix 1 2 G = XX, where X is the matrix of marker genotypes of dimensions n p. For SNPs x ij {0, 1, 2}. G = (X E)(X E) 2 p j=1 p j(1 p j ), where p j is the minor allele frequency of SNP j = 1,..., p, and E is a matrix of expected frequencies of x ij under Hardy-Weiberg equilibrium from estimates of allelic frequencies. 3 G = ZZ p, where Z is the matrix of centered and standardized SNPs codes and p is the number of SNPs, that is z ij = (x ij 2p j )/ 2p j (1 p j ). CIMMYT, México-SAGPDB Computations with Markers 4/20
Continue... Genomic relationship matrix G = XX appears naturally when we assume that we can predict the phenotypes using the linear model: y = 1µ + Xβ + e, where e N(0, σ 2 ei) and β N(0, σ 2 β I). Let u = Xβ, by using the multivariate normal distribution, it can be shown that u N(0, XX ), and the model is equivalente to y = 1µ + u + e, which is usually known as G-BLUP. We will talk about this model later on. CIMMYT, México-SAGPDB Computations with Markers 5/20
Figure 1: Toy example for markers. CIMMYT, México-SAGPDB Computations with Markers 6/20
SNP coding 1 Additive effects 1 if the SNP is homozygous for the major allele x = 0 if the SNP is heterozygous 1 if the SNP is homozygous for the other allele 2 Dominant effects x = { 1 if the SNP is heterozygous 0 if the SNP is homozygous CIMMYT, México-SAGPDB Computations with Markers 7/20
Continue... #Clear workspace rm(list=ls()) #Set working directory setwd("c:/users/p.p.rodriguez/desktop/slides Paulino/2. Gmatrix/examples/") source("recode.r") source("impute.r") Genotype_info=read.csv(file="TC-10-Genotypes-ACGT.csv", header=true,na.strings="?_?",stringsasfactors=false) entry_genotype_info=genotype_info$entry Genotype_info=Genotype_info[,-c(1,2)] X=recode(Genotype_info)$X #Impute missing genotypes set.seed(123) out=impute(x) CIMMYT, México-SAGPDB Computations with Markers 8/20
Continue... #Note that marker 167 and 179 are #monomorphic and should be excluded from analysis out$monomorphic #Remove monomorphic markers, #At this point no more missing values are present X=out$X[,-out$monomorphic] #compute p phat=colmeans(x)/2 MAF=ifelse(phat<0.5,phat,1-phat) phat=maf hist(maf,main="") CIMMYT, México-SAGPDB Computations with Markers 9/20
Continue... Frequency 0 20 40 60 80 100 120 140 0.0 0.1 0.2 0.3 0.4 0.5 MAF Figure 2: Distribution of allele frequencies. CIMMYT, México-SAGPDB Computations with Markers 10/20
Computations: three ways #Computing the genomic relationship matrix G1=tcrossprod(X) X2=scale(X,center=TRUE,scale=FALSE) k=2*sum(phat*(1-phat)) G2=tcrossprod(X2)/k X3=scale(X,center=TRUE,scale=TRUE) G3=tcrossprod(X3)/ncol(X3) heatmap(g3) hist(diag(g3),main="") CIMMYT, México-SAGPDB Computations with Markers 11/20
Exercise 1 Load the weath dataset that we were using yesterday. 2 Compute the Genomic relationship matrix using equation 1. CIMMYT, México-SAGPDB Computations with Markers 12/20
Continue... 5 137 33 24 72 70 1362 34 53 29 142 28 43 103 583 61 107 131 91 32 77 75 47 119 69 102 89 79 26 12 145 110 41 96 105 39 86 35 94 81 99 109 60 27 42 139 87 74 37 50 10 132 88 98 101 68 92 19 57 143 133 83 130 84 80 67 121 82 30 126 239 106 1001 125 124 113 112 14 46 63 71 138 48 135 117 52 15 147 111 18 146 44 64 141 40 49 59 108 95 17 56 514 11 134 118 66 22 1158 104 25 144 76 85 45 120 90 54 16 36 78 55 62 20 73 93 148 65 38 129 13 21 31 97 140 123 114 127 128 116 122 5137 33 24 72 70 136 6234 53 29 142 28 43 103 58 361 107 131 91 32 77 75 47 119 69 102 89 79 26 12 145 110 41 96 105 39 86 35 94 81 99 109 60 27 42 139 87 74 37 50 10 132 88 98 101 68 92 19 57 143 133 83 130 84 80 67 121 82 30 126 23 9106 100 125 124 113 112 14 46 63 71 138 48 135 117 52 15 147 111 18 146 44 64 141 40 49 59 108 95 17 56 51 411 134 118 66 22 115 8104 25 144 76 85 45 120 90 54 16 36 78 55 62 20 73 93 148 65 38 129 13 21 31 97 7140 123 114 127 128 116 122 Figure 3: Heatmap of G matrix. CIMMYT, México-SAGPDB Computations with Markers 13/20
Continue... Frequency 0 10 20 30 40 50 60 0.5 1.0 1.5 2.0 2.5 3.0 diag(g3) Figure 4: Histogram of the diagonal elements of the G matrix. CIMMYT, México-SAGPDB Computations with Markers 14/20
Distance matrix The distance matrix, also appears naturally in RKHS models. We will review them in the next days, d ij = x i x j 2 = k (x ik x jk ) 2 Example: D=as.matrix(dist(X)) CIMMYT, México-SAGPDB Computations with Markers 15/20
Big Data! Big Data! The computation of the genomic relationship matrix is straight forward if the matrix X is small. There are application where the number of markers can be very big, CIMMYT, México-SAGPDB Computations with Markers 16/20
Big Data! Ober s prediction problem Ober et al. (2012) predicts starvation stress resistance and starle resistance in Drosophila using p = 2.5 millions SNPs and n = 192 D. melanogaster inbreed lines derived by 20 generations of full sib mating from wild-caught females from the Raleigh, North Carolina population. CIMMYT, México-SAGPDB Computations with Markers 17/20
Continue... Big Data! Prediction in D. melanogaster Using Sequence Data Genomic relationship matrix for Ober s data. Figure 2. Heatmap of the genomic relationship matrix G. The genomic relationship matrix G was calculated according to [8] using 157 lines and 2.5 million SNPs. The S after the line-id indicates that the line belongs to the set of lines for which phenotypic records for startle response were also available (in addition to the phenotypic records of starvation resistance). doi:10.1371/journal.pgen.1002685.g002 NeLf CIMMYT, México-SAGPDB Computations with Markers 18/20
Solution Big Data! Fortunately the computation of the G matrix can be fully paralleled in modern CPU processors, G ij = k (x ik 2p k )(x jk 2p k )/c When computing G ij only the genotypes of individuals (i, j) are needed. CIMMYT, México-SAGPDB Computations with Markers 19/20
Continue... Big Data! CIMMYT, México-SAGPDB Computations with Markers 20/20