Renormalizing Illumina SNP Cell Line Data

Size: px

Start display at page:

Download "Renormalizing Illumina SNP Cell Line Data"

Judith Washington
5 years ago
Views:

1 Renormalizing Illumina SNP Cell Line Data Kevin R. Coombes 17 March 2011 Contents 1 Executive Summary Introduction Aims/Objectives Methods Description of Data Statistical Methods Results Conclusions Details Load Sample nnames Characteristic Modes or Intensity Levels of Copy Number Modeling the Copy Number Levels Appendix 11 1 Executive Summary 1.1 Introduction This report describes the analysis of a data set from Lynn Barron, a member of the laboratory of Lynne V. Abruzzo. This dataset was acquired using Illumina 610K SNP chips. The main goal of the study is to identify genetic abnormalities that are associated with clinical outcome (including overall survival and time-to-treatment). This is the first part of a series of related reports Aims/Objectives We noticed (from an analysis using lung cancer cell lines) in plots of the log R ratios (LRR) and B allele frequencies (BAF) that some of the data appeared inconsistent with our understanding 1

2 01-sanitizer 2 of how to interpret them. We hypothesized that, for many if not most cell lines, the number of chromosomes present in a cell is far in excess of the typical value of 46. If so, this excess would likely cause the Illumina normalization procedure to scale all of the LRR data to a value that is too small (since the normalization implicitly assumes that the intensities come from cells with about 46 chromosomes). The objective of this report is to test this hypothesis and try to develop methods to correct for any distortions introduced by normalization. 1.2 Methods Description of Data The dataset contains measurements on 176 previously untreated patients with CLL. Extensive clinical followup is available Statistical Methods Raw data were processed in BeadStudio to yield genotype calls, log R ratios (LRR), and B allele frequencies (BAF) for each SNP in each sample. Since the study does not include matched normal DNA, the BeadStudio computations were performed relative to the pool of 120 HapMap samples run by Illumina. In this report, we introduce a novel method for re-normalizing the SNP log ratios for each cell line. The basic idea is to write L ij for the log R ratio of the j th SNP in segment i, and write ν i for the true (unknown) copy number. We assume that there is a (sample-specific) renormalization constant α with the property that L ij = α + log 10 (ν i /2) + E ij, where E ij N(0, σ) are independent and identically distributed Gaussian random variables. If α is known, then each L ij can be assigned to the nearest log-half-integer (which we denote by I ij (α)). Then define the sum of square errors to be : SSEI(α) = (L ij I ij ) 2. We impose a prior or penalty on α. Specifically, we consider the effect of converting from the log R ratio scale back to the linear copy number scale, as a function of α. In other words, we look at the values N ij (α) = 210 L ij α. The range of these values (if rounded to the nearest integer) represents the number of distinct integer copy number levels that are represented in the data, as a function of alpha. Thus, we can think of this range as the number of (copy number) parameters in the model. We then minimize the penalized sum of square errors to find the optimal value of α.

3 01-sanitizer Results We fit the statistical model described above in order to find the optimal renormalization constants. The values are stored in the file renorm.csv. Figures illustrating the process (Figure 2 and Figure 3) are stored in the subdirectories FitFuns and ApresFitFuns. 1.4 Conclusions The model-based approach to estimate the renormalization constant appears to work well. 2 Details 2.1 Load Sample nnames > load("allsamplenames.rda") 3 Characteristic Modes or Intensity Levels of Copy Number We will illustrate the plots and the method for the following sample: > cid <-.EGSAMPLE > cid [1] "CL001" > source("00rnw/snp-utils.r") > dat <- loadsnponesample(cid) Our next step is based on the idea that we should be able to identify characteristic modes of the LRR values for each chromosome, and that these modes should each correspond to a paricular integer number of copies. We start by looking at chromosome 13. > chrn <-.CHRN > chrn [1] 13 We use the next block of code to arbitrarily divide the LRR data into segments of fixed length and to compute the median (and MAD) on each such segment. > targlength <- 500 > svals <- vals <- NULL > for (chrn in c(1:22, "X")) { + vec <- dat[dat$chr == chrn, "Log.R.Ratio"] + vec <- vec[!is.na(vec)]

4 01-sanitizer 4 + n <- length(vec)%/%targlength + i0 <- targlength * ((1:n) - 1) i1 <- targlength * (1:n) + i1[n] <- length(vec) + temp <- sapply(1:n, function(i) { + median(vec[i0[i]:i1[i]]) ) + vals <- c(vals, temp) + temp <- sapply(1:n, function(i) { + mad(vec[i0[i]:i1[i]]) ) + svals <- c(svals, temp) > dd <- density(vals) > modal <- dd$x[which(dd$y == max(dd$y))] 3.1 Modeling the Copy Number Levels We now plan to fit a statistical model that relates the log R ratios to the true copy number of segments. To describe the model, we let L ij represent the log R ratio of the j th SNP in segment i for a fixed cell line sample. We assume that the true (unknown) copy number for this segment is given by ν i. We also assume that there is a (sample-specific) global renormalization constant α with the property that L ij = α + log 10 (ν i /2) + E ij. Here the error model is that E ij N(0, σ) are indepdent and identically distributed Gaussian random variables. In the normal case where the typical copy number equals two across the genome, then most of the ν i = 2 and we expect that α = 0. By contrast, if the ploidy is such that this cell line has K > 46 chromosomes, then we expect that α log 10 (K/46) in order to get the observed (processed) log R ratios about right. In order to fit this model, we have to do a couple of things. The basic idea is to start the computation conditional on knowing α. In that case, each L ij can be assigned to the nearest loghalf-integer (which we denote by I ij (α)) as a maximum likelihood estimate of its true value. Then we can define the sum of square errors to be SSEI(α) = (L ij I ij ) 2. A simple method would then be to minimize SSEI(α) as a function of α. The difficulty with this simple approach, however, is that the log-half-integers get closer together as you increase α, which creates a bias in favor of larger positive shifts. In order to overcome this difficulty, we can impose a prior on the values of α, which has the basic effect of imposing a penalty based on the size of α.

5 01-sanitizer 5 CL001 Density vals Figure 1: Histogram of the median log R ratio on the 500-SNP long segments.

6 01-sanitizer 6 To make this penalty specific, we consider the effect of converting from the log R ratio scale back to the linear copy number scale, as a function of α. In other words, we look at the values N ij (α) = 210 L ij α. The range of these values (if rounded to the nearest integer) represents the number of distinct integer copy number levels that are represented in the data, a a function of alpha. Thus, we can think of this range as the number of (copynumber) parameters in the model. Thus, as in the Akaike Information Criterion in other contexts, this range provides a reasonable penalty term. We use the actual range rather than an integer range in order to maintain continuity in the resulting function that needs to be minimized. We add two further wrinkle to our attempts to fit this data. First, when we know the number N B of components of the B allele frequency plot, we know something more about the possible copy numbers. Namely, ˆ N B = 4 if and only if ν 3 (unbalanced heterozygous). OOPS! We can also get four bands if a fraction of the cells has lost one copy of a chromosome or if a fraction of the cells has undegone LOH. For example, if 50% of cells lose one copy, then at a heterozygous SNP where the retained copy is an A allele, we have (at the genotype level) 50A and 50AB genotype. At an alleleic level, we then have 100A and 50B, so we expect the BAF plot to have bands at 1/3 and 2/3. By contrast, if 50% of cells have undergone LOH at this locus, then our mixture contains a genotype of 50AA and 50AB or an allelotype of 150A and 50B, so we expect BAF bands at 1/4 and 3/4. Thus, we can actually have ν 1 and still see four BAF components. ˆ N B = 3 if and only if ν 2 is even (balanced heterozygous) ˆ N B = 2 if and only if ν 1 (homozygous) ˆ N B = 1 if and only if ν = 0 (complete loss) The second wrinkle is that we do not use all of the raw segment data, but instead use the summaries provided by identifying the characteristic modes on each chromosome by fitting density functions. We do, however, weight these modes proportionally to the number of SNP markers that they represent. The following functions implement and fit this statistical model. > idist <- function(x, XS) { + A <- x[1] + B <- x[2] + y <- A + B * XS + base <- sum(unlist(lapply(y, function(z0, s) { + min(abs(z0 - s))^2, s = log10((1:20)/2)))) > baz <- function(a, XS) { + idist(c(a, 1), XS)

7 01-sanitizer 7 > colset <- c(general = "black", ThreeBand = "blue", FourBand = "green", + TwoBand = "red") > foo2 <- function(a, xset) { + w <- 1 + round(xset$w) + w < * (w - 1)/max(w - 1) + shiftx <- A + xset$x + top <- 1 + trunc(max(2 * 10^shiftx)) + plot(shiftx, col = colset[as.character(xset$flag)], pch = 16, cex = w, + main = cid, ylab = "Shifted Log Ratio", xlab = "Summarized Segment Index") + abline(h = log10((1:top)/2), col = "gray") + mtext(1:top, side = 4, at = log10((1:top)/2), line = 0.5, las = 1) + legend("bottomright", names(colset), col = colset, pch = 16) Here is the analysis for CL001. > alpha <- seq(-0.05, 0.8, length = 1701) > penal <- 10^alpha * diff(range(2 * 10^vals)) > wiggle <- sapply(alpha, baz, XS = vals) > v1 <- wiggle +.PRIOR.WT * penal > ick <- which(v1 == min(v1)) > a <- alpha[ick] > a [1] > modal [1] > mean(vals) [1] > median(vals) [1] Now we can fit these statistical models for all of the samples. > if (!file.exists("fitfuns")) dir.create("fitfuns") > madqc <- alist <- rep(na, length(shortnames)) > names(madqc) <- names(alist) <- shortnames > for (cid in sort(shortnames)) { + cat(paste("working on", cid, "\n"), file = stderr())

8 01-sanitizer 8 CL001 SSEI + Penalty A Figure 2: Optimizing the SSEI + Penalty function for CL001. Figure 3: Location of segment summaries (size proportional to number of SNP markers) for CL001.

9 01-sanitizer 9 + dat <- loadsnponesample(cid) + targlength < svals <- vals <- NULL + for (chrn in c(1:22, "X")) { + vec <- dat[dat$chr == chrn, "Log.R.Ratio"] + vec <- vec[!is.na(vec)] + n <- length(vec)%/%targlength + i0 <- targlength * ((1:n) - 1) i1 <- targlength * (1:n) + i1[n] <- length(vec) + temp <- sapply(1:n, function(i) { + median(vec[i0[i]:i1[i]]) ) + vals <- c(vals, temp) + temp <- sapply(1:n, function(i) { + mad(vec[i0[i]:i1[i]]) ) + svals <- c(svals, temp) + madqc[cid] <- median(svals) + alpha <- seq(-0.05, 0.8, length = 1701) + penal <- 10^alpha * diff(range(2 * 10^vals)) + wiggle <- sapply(alpha, baz, XS = vals) + v1 <- wiggle +.PRIOR.WT * penal + ick <- which(v1 == min(v1)) + a <- alpha[ick] + a + modal + mean(vals) + median(vals) + alist[cid] <- alpha[ick] + par(bg = "white", cex = 1.5, mai = c(1.4, 1.2, 1, 0.2)) + plot(alpha, v1, type = "l", main = cid, xlab = "A", ylab = "SSEI + Penalty", + lwd = 3) + points(alpha[ick], v1[ick], pch = 16, col = "#00aa00") + dev.copy(png, file = file.path("fitfuns", paste(cid, "png", sep = ".")), + width = 600, height = 500) + dev.off() Finally, we save the intermediate results. > write.csv(data.frame(shift = alist, madqc = madqc), file = "renorm.csv")

10 01-sanitizer 10 Histogram of alist Frequency alist Figure 4: Histogram of the renormalization constants.

11 01-sanitizer 11 > if (sum(alist > 0.2)) { + print(which(alist > 0.2)) [1] 7 31 > median(alist[alist < 0.2]) [1] Appendix This analysis was run in the following directory: > getwd() [1] "o:/private/abruzzo/snp-cll/aa" Note that \\mdadqsfs02 is the standard insititutional location for storing data and analyses; N: is the name given to that location on this machine. This analysis was run in the following software environment: > sessioninfo() R version ( ) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grdevices utils datasets methods base > while (!is.null(dev.list())) dev.off()

Summarize Abnormality Counts

Summarize Abnormality Counts Kevin R. Coombes 10 September 2011 Contents 1 Executive Summary 1 1.1 Introduction......................................... 1 1.1.1 Aims/Objectives..................................