Renormalizing Illumina SNP Cell Line Data
|
|
- Judith Washington
- 5 years ago
- Views:
Transcription
1 Renormalizing Illumina SNP Cell Line Data Kevin R. Coombes 17 March 2011 Contents 1 Executive Summary Introduction Aims/Objectives Methods Description of Data Statistical Methods Results Conclusions Details Load Sample nnames Characteristic Modes or Intensity Levels of Copy Number Modeling the Copy Number Levels Appendix 11 1 Executive Summary 1.1 Introduction This report describes the analysis of a data set from Lynn Barron, a member of the laboratory of Lynne V. Abruzzo. This dataset was acquired using Illumina 610K SNP chips. The main goal of the study is to identify genetic abnormalities that are associated with clinical outcome (including overall survival and time-to-treatment). This is the first part of a series of related reports Aims/Objectives We noticed (from an analysis using lung cancer cell lines) in plots of the log R ratios (LRR) and B allele frequencies (BAF) that some of the data appeared inconsistent with our understanding 1
2 01-sanitizer 2 of how to interpret them. We hypothesized that, for many if not most cell lines, the number of chromosomes present in a cell is far in excess of the typical value of 46. If so, this excess would likely cause the Illumina normalization procedure to scale all of the LRR data to a value that is too small (since the normalization implicitly assumes that the intensities come from cells with about 46 chromosomes). The objective of this report is to test this hypothesis and try to develop methods to correct for any distortions introduced by normalization. 1.2 Methods Description of Data The dataset contains measurements on 176 previously untreated patients with CLL. Extensive clinical followup is available Statistical Methods Raw data were processed in BeadStudio to yield genotype calls, log R ratios (LRR), and B allele frequencies (BAF) for each SNP in each sample. Since the study does not include matched normal DNA, the BeadStudio computations were performed relative to the pool of 120 HapMap samples run by Illumina. In this report, we introduce a novel method for re-normalizing the SNP log ratios for each cell line. The basic idea is to write L ij for the log R ratio of the j th SNP in segment i, and write ν i for the true (unknown) copy number. We assume that there is a (sample-specific) renormalization constant α with the property that L ij = α + log 10 (ν i /2) + E ij, where E ij N(0, σ) are independent and identically distributed Gaussian random variables. If α is known, then each L ij can be assigned to the nearest log-half-integer (which we denote by I ij (α)). Then define the sum of square errors to be : SSEI(α) = (L ij I ij ) 2. We impose a prior or penalty on α. Specifically, we consider the effect of converting from the log R ratio scale back to the linear copy number scale, as a function of α. In other words, we look at the values N ij (α) = 210 L ij α. The range of these values (if rounded to the nearest integer) represents the number of distinct integer copy number levels that are represented in the data, as a function of alpha. Thus, we can think of this range as the number of (copy number) parameters in the model. We then minimize the penalized sum of square errors to find the optimal value of α.
3 01-sanitizer Results We fit the statistical model described above in order to find the optimal renormalization constants. The values are stored in the file renorm.csv. Figures illustrating the process (Figure 2 and Figure 3) are stored in the subdirectories FitFuns and ApresFitFuns. 1.4 Conclusions The model-based approach to estimate the renormalization constant appears to work well. 2 Details 2.1 Load Sample nnames > load("allsamplenames.rda") 3 Characteristic Modes or Intensity Levels of Copy Number We will illustrate the plots and the method for the following sample: > cid <-.EGSAMPLE > cid [1] "CL001" > source("00rnw/snp-utils.r") > dat <- loadsnponesample(cid) Our next step is based on the idea that we should be able to identify characteristic modes of the LRR values for each chromosome, and that these modes should each correspond to a paricular integer number of copies. We start by looking at chromosome 13. > chrn <-.CHRN > chrn [1] 13 We use the next block of code to arbitrarily divide the LRR data into segments of fixed length and to compute the median (and MAD) on each such segment. > targlength <- 500 > svals <- vals <- NULL > for (chrn in c(1:22, "X")) { + vec <- dat[dat$chr == chrn, "Log.R.Ratio"] + vec <- vec[!is.na(vec)]
4 01-sanitizer 4 + n <- length(vec)%/%targlength + i0 <- targlength * ((1:n) - 1) i1 <- targlength * (1:n) + i1[n] <- length(vec) + temp <- sapply(1:n, function(i) { + median(vec[i0[i]:i1[i]]) ) + vals <- c(vals, temp) + temp <- sapply(1:n, function(i) { + mad(vec[i0[i]:i1[i]]) ) + svals <- c(svals, temp) > dd <- density(vals) > modal <- dd$x[which(dd$y == max(dd$y))] 3.1 Modeling the Copy Number Levels We now plan to fit a statistical model that relates the log R ratios to the true copy number of segments. To describe the model, we let L ij represent the log R ratio of the j th SNP in segment i for a fixed cell line sample. We assume that the true (unknown) copy number for this segment is given by ν i. We also assume that there is a (sample-specific) global renormalization constant α with the property that L ij = α + log 10 (ν i /2) + E ij. Here the error model is that E ij N(0, σ) are indepdent and identically distributed Gaussian random variables. In the normal case where the typical copy number equals two across the genome, then most of the ν i = 2 and we expect that α = 0. By contrast, if the ploidy is such that this cell line has K > 46 chromosomes, then we expect that α log 10 (K/46) in order to get the observed (processed) log R ratios about right. In order to fit this model, we have to do a couple of things. The basic idea is to start the computation conditional on knowing α. In that case, each L ij can be assigned to the nearest loghalf-integer (which we denote by I ij (α)) as a maximum likelihood estimate of its true value. Then we can define the sum of square errors to be SSEI(α) = (L ij I ij ) 2. A simple method would then be to minimize SSEI(α) as a function of α. The difficulty with this simple approach, however, is that the log-half-integers get closer together as you increase α, which creates a bias in favor of larger positive shifts. In order to overcome this difficulty, we can impose a prior on the values of α, which has the basic effect of imposing a penalty based on the size of α.
5 01-sanitizer 5 CL001 Density vals Figure 1: Histogram of the median log R ratio on the 500-SNP long segments.
6 01-sanitizer 6 To make this penalty specific, we consider the effect of converting from the log R ratio scale back to the linear copy number scale, as a function of α. In other words, we look at the values N ij (α) = 210 L ij α. The range of these values (if rounded to the nearest integer) represents the number of distinct integer copy number levels that are represented in the data, a a function of alpha. Thus, we can think of this range as the number of (copynumber) parameters in the model. Thus, as in the Akaike Information Criterion in other contexts, this range provides a reasonable penalty term. We use the actual range rather than an integer range in order to maintain continuity in the resulting function that needs to be minimized. We add two further wrinkle to our attempts to fit this data. First, when we know the number N B of components of the B allele frequency plot, we know something more about the possible copy numbers. Namely, ˆ N B = 4 if and only if ν 3 (unbalanced heterozygous). OOPS! We can also get four bands if a fraction of the cells has lost one copy of a chromosome or if a fraction of the cells has undegone LOH. For example, if 50% of cells lose one copy, then at a heterozygous SNP where the retained copy is an A allele, we have (at the genotype level) 50A and 50AB genotype. At an alleleic level, we then have 100A and 50B, so we expect the BAF plot to have bands at 1/3 and 2/3. By contrast, if 50% of cells have undergone LOH at this locus, then our mixture contains a genotype of 50AA and 50AB or an allelotype of 150A and 50B, so we expect BAF bands at 1/4 and 3/4. Thus, we can actually have ν 1 and still see four BAF components. ˆ N B = 3 if and only if ν 2 is even (balanced heterozygous) ˆ N B = 2 if and only if ν 1 (homozygous) ˆ N B = 1 if and only if ν = 0 (complete loss) The second wrinkle is that we do not use all of the raw segment data, but instead use the summaries provided by identifying the characteristic modes on each chromosome by fitting density functions. We do, however, weight these modes proportionally to the number of SNP markers that they represent. The following functions implement and fit this statistical model. > idist <- function(x, XS) { + A <- x[1] + B <- x[2] + y <- A + B * XS + base <- sum(unlist(lapply(y, function(z0, s) { + min(abs(z0 - s))^2, s = log10((1:20)/2)))) > baz <- function(a, XS) { + idist(c(a, 1), XS)
7 01-sanitizer 7 > colset <- c(general = "black", ThreeBand = "blue", FourBand = "green", + TwoBand = "red") > foo2 <- function(a, xset) { + w <- 1 + round(xset$w) + w < * (w - 1)/max(w - 1) + shiftx <- A + xset$x + top <- 1 + trunc(max(2 * 10^shiftx)) + plot(shiftx, col = colset[as.character(xset$flag)], pch = 16, cex = w, + main = cid, ylab = "Shifted Log Ratio", xlab = "Summarized Segment Index") + abline(h = log10((1:top)/2), col = "gray") + mtext(1:top, side = 4, at = log10((1:top)/2), line = 0.5, las = 1) + legend("bottomright", names(colset), col = colset, pch = 16) Here is the analysis for CL001. > alpha <- seq(-0.05, 0.8, length = 1701) > penal <- 10^alpha * diff(range(2 * 10^vals)) > wiggle <- sapply(alpha, baz, XS = vals) > v1 <- wiggle +.PRIOR.WT * penal > ick <- which(v1 == min(v1)) > a <- alpha[ick] > a [1] > modal [1] > mean(vals) [1] > median(vals) [1] Now we can fit these statistical models for all of the samples. > if (!file.exists("fitfuns")) dir.create("fitfuns") > madqc <- alist <- rep(na, length(shortnames)) > names(madqc) <- names(alist) <- shortnames > for (cid in sort(shortnames)) { + cat(paste("working on", cid, "\n"), file = stderr())
8 01-sanitizer 8 CL001 SSEI + Penalty A Figure 2: Optimizing the SSEI + Penalty function for CL001. Figure 3: Location of segment summaries (size proportional to number of SNP markers) for CL001.
9 01-sanitizer 9 + dat <- loadsnponesample(cid) + targlength < svals <- vals <- NULL + for (chrn in c(1:22, "X")) { + vec <- dat[dat$chr == chrn, "Log.R.Ratio"] + vec <- vec[!is.na(vec)] + n <- length(vec)%/%targlength + i0 <- targlength * ((1:n) - 1) i1 <- targlength * (1:n) + i1[n] <- length(vec) + temp <- sapply(1:n, function(i) { + median(vec[i0[i]:i1[i]]) ) + vals <- c(vals, temp) + temp <- sapply(1:n, function(i) { + mad(vec[i0[i]:i1[i]]) ) + svals <- c(svals, temp) + madqc[cid] <- median(svals) + alpha <- seq(-0.05, 0.8, length = 1701) + penal <- 10^alpha * diff(range(2 * 10^vals)) + wiggle <- sapply(alpha, baz, XS = vals) + v1 <- wiggle +.PRIOR.WT * penal + ick <- which(v1 == min(v1)) + a <- alpha[ick] + a + modal + mean(vals) + median(vals) + alist[cid] <- alpha[ick] + par(bg = "white", cex = 1.5, mai = c(1.4, 1.2, 1, 0.2)) + plot(alpha, v1, type = "l", main = cid, xlab = "A", ylab = "SSEI + Penalty", + lwd = 3) + points(alpha[ick], v1[ick], pch = 16, col = "#00aa00") + dev.copy(png, file = file.path("fitfuns", paste(cid, "png", sep = ".")), + width = 600, height = 500) + dev.off() Finally, we save the intermediate results. > write.csv(data.frame(shift = alist, madqc = madqc), file = "renorm.csv")
10 01-sanitizer 10 Histogram of alist Frequency alist Figure 4: Histogram of the renormalization constants.
11 01-sanitizer 11 > if (sum(alist > 0.2)) { + print(which(alist > 0.2)) [1] 7 31 > median(alist[alist < 0.2]) [1] Appendix This analysis was run in the following directory: > getwd() [1] "o:/private/abruzzo/snp-cll/aa" Note that \\mdadqsfs02 is the standard insititutional location for storing data and analyses; N: is the name given to that location on this machine. This analysis was run in the following software environment: > sessioninfo() R version ( ) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grdevices utils datasets methods base > while (!is.null(dev.list())) dev.off()
Summarize Abnormality Counts
Summarize Abnormality Counts Kevin R. Coombes 10 September 2011 Contents 1 Executive Summary 1 1.1 Introduction......................................... 1 1.1.1 Aims/Objectives..................................
More informationThe E-M Algorithm in Genetics. Biostatistics 666 Lecture 8
The E-M Algorithm in Genetics Biostatistics 666 Lecture 8 Maximum Likelihood Estimation of Allele Frequencies Find parameter estimates which make observed data most likely General approach, as long as
More informationProbability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies
Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Ruth Pfeiffer, Ph.D. Mitchell Gail Biostatistics Branch Division of Cancer Epidemiology&Genetics National
More informationLEA: An R package for Landscape and Ecological Association Studies. Olivier Francois Ecole GENOMENV AgroParisTech, Paris, 2016
LEA: An R package for Landscape and Ecological Association Studies Olivier Francois Ecole GENOMENV AgroParisTech, Paris, 2016 Outline Installing LEA Formatting the data for LEA Basic principles o o Analysis
More informationsamplesizelogisticcasecontrol Package
samplesizelogisticcasecontrol Package January 31, 2017 > library(samplesizelogisticcasecontrol) Random data generation functions Let X 1 and X 2 be two variables with a bivariate normal ditribution with
More informationIntroduction to Statistics and R
Introduction to Statistics and R Mayo-Illinois Computational Genomics Workshop (2018) Ruoqing Zhu, Ph.D. Department of Statistics, UIUC rqzhu@illinois.edu June 18, 2018 Abstract This document is a supplimentary
More informationCase-Control Association Testing. Case-Control Association Testing
Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits. Technological advances have made it feasible to perform case-control association studies
More informationComputational Systems Biology: Biology X
Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,
More informationParts 2. Modeling chromosome segregation
Genome 371, Autumn 2018 Quiz Section 2 Meiosis Goals: To increase your familiarity with the molecular control of meiosis, outcomes of meiosis, and the important role of crossing over in generating genetic
More informationParts 2. Modeling chromosome segregation
Genome 371, Autumn 2017 Quiz Section 2 Meiosis Goals: To increase your familiarity with the molecular control of meiosis, outcomes of meiosis, and the important role of crossing over in generating genetic
More informationIntroduction to population genetics & evolution
Introduction to population genetics & evolution Course Organization Exam dates: Feb 19 March 1st Has everybody registered? Did you get the email with the exam schedule Summer seminar: Hot topics in Bioinformatics
More informationComparison of Two Population Means
Comparison of Two Population Means Esra Akdeniz March 15, 2015 Independent versus Dependent (paired) Samples We have independent samples if we perform an experiment in two unrelated populations. We have
More informationGenotype Imputation. Biostatistics 666
Genotype Imputation Biostatistics 666 Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives
More informationUnit 2 Lesson 4 - Heredity. 7 th Grade Cells and Heredity (Mod A) Unit 2 Lesson 4 - Heredity
Unit 2 Lesson 4 - Heredity 7 th Grade Cells and Heredity (Mod A) Unit 2 Lesson 4 - Heredity Give Peas a Chance What is heredity? Traits, such as hair color, result from the information stored in genetic
More informationProcesses of Evolution
15 Processes of Evolution Forces of Evolution Concept 15.4 Selection Can Be Stabilizing, Directional, or Disruptive Natural selection can act on quantitative traits in three ways: Stabilizing selection
More information1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:
.5. ESTIMATION OF HAPLOTYPE FREQUENCIES: Chapter - 8 For SNPs, alleles A j,b j at locus j there are 4 haplotypes: A A, A B, B A and B B frequencies q,q,q 3,q 4. Assume HWE at haplotype level. Only the
More informationQ Expected Coverage Achievement Merit Excellence. Punnett square completed with correct gametes and F2.
NCEA Level 2 Biology (91157) 2018 page 1 of 6 Assessment Schedule 2018 Biology: Demonstrate understanding of genetic variation and change (91157) Evidence Q Expected Coverage Achievement Merit Excellence
More informationOutline for today s lecture (Ch. 14, Part I)
Outline for today s lecture (Ch. 14, Part I) Ploidy vs. DNA content The basis of heredity ca. 1850s Mendel s Experiments and Theory Law of Segregation Law of Independent Assortment Introduction to Probability
More informationName Class Date. KEY CONCEPT Gametes have half the number of chromosomes that body cells have.
Section 1: Chromosomes and Meiosis KEY CONCEPT Gametes have half the number of chromosomes that body cells have. VOCABULARY somatic cell autosome fertilization gamete sex chromosome diploid homologous
More informationPackage msir. R topics documented: April 7, Type Package Version Date Title Model-Based Sliced Inverse Regression
Type Package Version 1.3.1 Date 2016-04-07 Title Model-Based Sliced Inverse Regression Package April 7, 2016 An R package for dimension reduction based on finite Gaussian mixture modeling of inverse regression.
More informationThe Lander-Green Algorithm. Biostatistics 666 Lecture 22
The Lander-Green Algorithm Biostatistics 666 Lecture Last Lecture Relationship Inferrence Likelihood of genotype data Adapt calculation to different relationships Siblings Half-Siblings Unrelated individuals
More informationsequenza usage example
sequenza usage example Francesco Favero, Tejal Joshi, Andrea M. Marquard, Aron C. Eklund October 8, 2015 Contents 1 Abstract 2 2 Getting started 2 2.1 Minimum requirements........................ 2 2.2
More informationAn Integrated Approach for the Assessment of Chromosomal Abnormalities
An Integrated Approach for the Assessment of Chromosomal Abnormalities Department of Biostatistics Johns Hopkins Bloomberg School of Public Health June 6, 2007 Karyotypes Mitosis and Meiosis Meiosis Meiosis
More informationBiology 211 (1) Exam 4! Chapter 12!
Biology 211 (1) Exam 4 Chapter 12 1. Why does replication occurs in an uncondensed state? 1. 2. A is a single strand of DNA. When DNA is added to associated protein molecules, it is referred to as. 3.
More informationGS Analysis of Microarray Data
GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org kcoombes@mdanderson.org
More informationLinear Regression (1/1/17)
STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression
More informationLecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015
Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2015 1 / 1 Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits.
More informationCGEN(Case-control.GENetics) Package
CGEN(Case-control.GENetics) Package October 30, 2018 > library(cgen) Example of snp.logistic Load the ovarian cancer data and print the first 5 rows. > data(xdata, package="cgen") > Xdata[1:5, ] id case.control
More informationMultivariate Survival Analysis
Multivariate Survival Analysis Previously we have assumed that either (X i, δ i ) or (X i, δ i, Z i ), i = 1,..., n, are i.i.d.. This may not always be the case. Multivariate survival data can arise in
More informationExplore the data. Anja Bråthen Kristoffersen Biomedical Research Group
Explore the data Anja Bråthen Kristoffersen Biomedical Research Group density 0.2 0.4 0.6 0.8 Probability distributions Can be either discrete or continuous (uniform, bernoulli, normal, etc) Defined by
More informationCorrelate. A method for the integrative analysis of two genomic data sets
Correlate A method for the integrative analysis of two genomic data sets Sam Gross, Balasubramanian Narasimhan, Robert Tibshirani, and Daniela Witten February 19, 2010 Introduction Sparse Canonical Correlation
More informationGS Analysis of Microarray Data
GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org
More informationExpression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia
Expression QTLs and Mapping of Complex Trait Loci Paul Schliekelman Statistics Department University of Georgia Definitions: Genes, Loci and Alleles A gene codes for a protein. Proteins due everything.
More informationName Class Date. Pearson Education, Inc., publishing as Pearson Prentice Hall. 33
Chapter 11 Introduction to Genetics Chapter Vocabulary Review Matching On the lines provided, write the letter of the definition of each term. 1. genetics a. likelihood that something will happen 2. trait
More informationAn Integrated Approach for the Assessment of Chromosomal Abnormalities
An Integrated Approach for the Assessment of Chromosomal Abnormalities Department of Biostatistics Johns Hopkins Bloomberg School of Public Health June 26, 2007 Karyotypes Karyotypes General Cytogenetics
More informationMultivariate analysis of genetic data an introduction
Multivariate analysis of genetic data an introduction Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Population genomics in Lausanne 23 Aug 2016 1/25 Outline Multivariate
More informationNaive Bayes classification
Naive Bayes classification Christos Dimitrakakis December 4, 2015 1 Introduction One of the most important methods in machine learning and statistics is that of Bayesian inference. This is the most fundamental
More informationClass 04 - Statistical Inference
Class 4 - Statistical Inference Question 1: 1. What parameters control the shape of the normal distribution? Make some histograms of different normal distributions, in each, alter the parameter values
More informationGenotype Imputation. Class Discussion for January 19, 2016
Genotype Imputation Class Discussion for January 19, 2016 Intuition Patterns of genetic variation in one individual guide our interpretation of the genomes of other individuals Imputation uses previously
More informationOverview. Background
Overview Implementation of robust methods for locating quantitative trait loci in R Introduction to QTL mapping Andreas Baierl and Andreas Futschik Institute of Statistics and Decision Support Systems
More informationJian WANG, PhD. Room A115 College of Fishery and Life Science Shanghai Ocean University
Jian WANG, PhD j_wang@shou.edu.cn Room A115 College of Fishery and Life Science Shanghai Ocean University Contents 1. Introduction to R 2. Data sets 3. Introductory Statistical Principles 4. Sampling and
More informationExplore the data. Anja Bråthen Kristoffersen
Explore the data Anja Bråthen Kristoffersen density 0.2 0.4 0.6 0.8 Probability distributions Can be either discrete or continuous (uniform, bernoulli, normal, etc) Defined by a density function, p(x)
More informationCOMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics
COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics Thorsten Dickhaus University of Bremen Institute for Statistics AG DANK Herbsttagung
More informationIntegrated Anlaysis of Genomics Data
Integrated Anlaysis of Genomics Data Elizabeth Jennings July 3, 01 Abstract In this project, we integrate data from several genomic platforms in a model that incorporates the biological relationships between
More informationCh 11.Introduction to Genetics.Biology.Landis
Nom Section 11 1 The Work of Gregor Mendel (pages 263 266) This section describes how Gregor Mendel studied the inheritance of traits in garden peas and what his conclusions were. Introduction (page 263)
More informationThe Quantitative TDT
The Quantitative TDT (Quantitative Transmission Disequilibrium Test) Warren J. Ewens NUS, Singapore 10 June, 2009 The initial aim of the (QUALITATIVE) TDT was to test for linkage between a marker locus
More informationsequenza usage example
sequenza usage example Francesco Favero, Tejal Joshi, Andrea M. Marquard, Aron C. Eklund December 10, 2013 Contents 1 Abstract 1 2 Getting started 2 2.1 Minimum requirements..................................
More informationUsing the tmle.npvi R package
Using the tmle.npvi R package Antoine Chambaz Pierre Neuvial Package version 0.10.0 Date 2015-05-13 Contents 1 Citing tmle.npvi 1 2 The non-parametric variable importance parameter 2 3 Using the tmle.npvi
More informationFei Lu. Post doctoral Associate Cornell University
Fei Lu Post doctoral Associate Cornell University http://www.maizegenetics.net Genotyping by sequencing (GBS) is simple and cost effective 1. Digest DNA 2. Ligate adapters with barcodes 3. Pool DNAs 4.
More informationPackage LBLGXE. R topics documented: July 20, Type Package
Type Package Package LBLGXE July 20, 2015 Title Bayesian Lasso for detecting Rare (or Common) Haplotype Association and their interactions with Environmental Covariates Version 1.2 Date 2015-07-09 Author
More informationPrediction problems 3: Validation and Model Checking
Prediction problems 3: Validation and Model Checking Data Science 101 Team May 17, 2018 Outline Validation Why is it important How should we do it? Model checking Checking whether your model is a good
More informationPackage diffeq. February 19, 2015
Version 1.0-1 Package diffeq February 19, 2015 Title Functions from the book Solving Differential Equations in R Author Karline Soetaert Maintainer Karline Soetaert
More informationFollow-up data with the Epi package
Follow-up data with the Epi package Summer 2014 Michael Hills Martyn Plummer Bendix Carstensen Retired Highgate, London International Agency for Research on Cancer, Lyon plummer@iarc.fr Steno Diabetes
More informationMeiosis and Mendel. Chapter 6
Meiosis and Mendel Chapter 6 6.1 CHROMOSOMES AND MEIOSIS Key Concept Gametes have half the number of chromosomes that body cells have. Body Cells vs. Gametes You have body cells and gametes body cells
More informationLogistic Regression. 0.1 Frogs Dataset
Logistic Regression We move now to the classification problem from the regression problem and study the technique ot logistic regression. The setting for the classification problem is the same as that
More information1. Understand the methods for analyzing population structure in genomes
MSCBIO 2070/02-710: Computational Genomics, Spring 2016 HW3: Population Genetics Due: 24:00 EST, April 4, 2016 by autolab Your goals in this assignment are to 1. Understand the methods for analyzing population
More informationComputations with Markers
Computations with Markers Paulino Pérez 1 José Crossa 1 1 ColPos-México 2 CIMMyT-México June, 2015. CIMMYT, México-SAGPDB Computations with Markers 1/20 Contents 1 Genomic relationship matrix 2 3 Big Data!
More informationAffected Sibling Pairs. Biostatistics 666
Affected Sibling airs Biostatistics 666 Today Discussion of linkage analysis using affected sibling pairs Our exploration will include several components we have seen before: A simple disease model IBD
More informationExpected complete data log-likelihood and EM
Expected complete data log-likelihood and EM In our EM algorithm, the expected complete data log-likelihood Q is a function of a set of model parameters τ, ie M Qτ = log fb m, r m, g m z m, l m, τ p mz
More informationSexual Reproduction and Genetics
Chapter Test A CHAPTER 10 Sexual Reproduction and Genetics Part A: Multiple Choice In the space at the left, write the letter of the term, number, or phrase that best answers each question. 1. How many
More informationGS Analysis of Microarray Data
GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org
More informationNeutral Theory of Molecular Evolution
Neutral Theory of Molecular Evolution Kimura Nature (968) 7:64-66 King and Jukes Science (969) 64:788-798 (Non-Darwinian Evolution) Neutral Theory of Molecular Evolution Describes the source of variation
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationBayes Factor Single Arm Time-to-event User s Guide (Version 1.0.0)
Bayes Factor Single Arm Time-to-event User s Guide (Version 1.0.0) Department of Biostatistics P. O. Box 301402, Unit 1409 The University of Texas, M. D. Anderson Cancer Center Houston, Texas 77230-1402,
More informationNOTES CH 17 Evolution of. Populations
NOTES CH 17 Evolution of Vocabulary Fitness Genetic Drift Punctuated Equilibrium Gene flow Adaptive radiation Divergent evolution Convergent evolution Gradualism Populations 17.1 Genes & Variation Darwin
More informationSUPPLEMENTARY MATERIAL TECHNICAL APPENDIX Article: Comparison of control charts for monitoring clinical performance using binary data
SUPPLEMENTARY MATERIAL TECHNICAL APPENDIX Article: Comparison of control charts for monitoring clinical performance using binary data APPENDIX TABLES Table A1 Comparison of out-of-control ARLs, in number
More informationMapping Chang To GEO
Mapping Chang To GEO Keith A. Baggerly and Kevin R. Coombes November 13, 2007 1 Introduction We want to match the Chang array data from GEO with the clinical information supplied in Chang et al and the
More informationSection 11 1 The Work of Gregor Mendel
Chapter 11 Introduction to Genetics Section 11 1 The Work of Gregor Mendel (pages 263 266) What is the principle of dominance? What happens during segregation? Gregor Mendel s Peas (pages 263 264) 1. The
More informationThe gpca Package for Identifying Batch Effects in High-Throughput Genomic Data
The gpca Package for Identifying Batch Effects in High-Throughput Genomic Data Sarah Reese July 31, 2013 Batch effects are commonly observed systematic non-biological variation between groups of samples
More informationReinforcement Unit 3 Resource Book. Meiosis and Mendel KEY CONCEPT Gametes have half the number of chromosomes that body cells have.
6.1 CHROMOSOMES AND MEIOSIS KEY CONCEPT Gametes have half the number of chromosomes that body cells have. Your body is made of two basic cell types. One basic type are somatic cells, also called body cells,
More informationSELESTIM: Detec-ng and Measuring Selec-on from Gene Frequency Data
«Environmental Gene4cs» doctoral course ABIES-GAIA SELESTIM: Detec-ng and Measuring Selec-on from Gene Frequency Data Renaud Vitalis Centre de Biologie pour la Ges-on des Popula-ons INRA ; Montpellier
More informationTests for Two Coefficient Alphas
Chapter 80 Tests for Two Coefficient Alphas Introduction Coefficient alpha, or Cronbach s alpha, is a popular measure of the reliability of a scale consisting of k parts. The k parts often represent k
More informationHumans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase
Humans have two copies of each chromosome Inherited from mother and father. Genotyping technologies do not maintain the phase Genotyping technologies do not maintain the phase Recall that proximal SNPs
More informationUnstable Laser Emission Vignette for the Data Set laser of the R package hyperspec
Unstable Laser Emission Vignette for the Data Set laser of the R package hyperspec Claudia Beleites DIA Raman Spectroscopy Group, University of Trieste/Italy (2005 2008) Spectroscopy
More informationPackage CEC. R topics documented: August 29, Title Cross-Entropy Clustering Version Date
Title Cross-Entropy Clustering Version 0.9.4 Date 2016-04-23 Package CEC August 29, 2016 Author Konrad Kamieniecki [aut, cre], Przemyslaw Spurek [ctb] Maintainer Konrad Kamieniecki
More informationHypothesis Testing: Chi-Square Test 1
Hypothesis Testing: Chi-Square Test 1 November 9, 2017 1 HMS, 2017, v1.0 Chapter References Diez: Chapter 6.3 Navidi, Chapter 6.10 Chapter References 2 Chi-square Distributions Let X 1, X 2,... X n be
More informationProcessing microarray data with Bioconductor
Processing microarray data with Bioconductor Statistical analysis of gene expression data with R and Bioconductor University of Copenhagen Copenhagen Biocenter Laurent Gautier 1, 2 August 17-21 2009 Contents
More informationSurvey on Population Mean
MATH 203 Survey on Population Mean Dr. Neal, Spring 2009 The first part of this project is on the analysis of a population mean. You will obtain data on a specific measurement X by performing a random
More informationBIOS 312: Precision of Statistical Inference
and Power/Sample Size and Standard Errors BIOS 312: of Statistical Inference Chris Slaughter Department of Biostatistics, Vanderbilt University School of Medicine January 3, 2013 Outline Overview and Power/Sample
More informationApplication of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data
Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data Juan Torres 1, Ashraf Saad 2, Elliot Moore 1 1 School of Electrical and Computer
More informationObjective 3.01 (DNA, RNA and Protein Synthesis)
Objective 3.01 (DNA, RNA and Protein Synthesis) DNA Structure o Discovered by Watson and Crick o Double-stranded o Shape is a double helix (twisted ladder) o Made of chains of nucleotides: o Has four types
More informationI. GREGOR MENDEL - father of heredity
GENETICS: Mendel Background: Students know that Meiosis produces 4 haploid sex cells that are not identical, allowing for genetic variation. Essential Question: What are two characteristics about Mendel's
More informationUnderstanding p Values
Understanding p Values James H. Steiger Vanderbilt University James H. Steiger Vanderbilt University Understanding p Values 1 / 29 Introduction Introduction In this module, we introduce the notion of a
More informationR-companion to: Estimation of the Thurstonian model for the 2-AC protocol
R-companion to: Estimation of the Thurstonian model for the 2-AC protocol Rune Haubo Bojesen Christensen, Hye-Seong Lee & Per Bruun Brockhoff August 24, 2017 This document describes how the examples in
More informationFamily resemblance can be striking!
Family resemblance can be striking! 1 Chapter 14. Mendel & Genetics 2 Gregor Mendel! Modern genetics began in mid-1800s in an abbey garden, where a monk named Gregor Mendel documented inheritance in peas
More information(Write your name on every page. One point will be deducted for every page without your name!)
POPULATION GENETICS AND MICROEVOLUTIONARY THEORY FINAL EXAMINATION (Write your name on every page. One point will be deducted for every page without your name!) 1. Briefly define (5 points each): a) Average
More informationChapter 9. Non-Parametric Density Function Estimation
9-1 Density Estimation Version 1.2 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least
More informationIntroduction to Crossover Trials
Introduction to Crossover Trials Stat 6500 Tutorial Project Isaac Blackhurst A crossover trial is a type of randomized control trial. It has advantages over other designed experiments because, under certain
More informationMultiple Change-Point Detection and Analysis of Chromosome Copy Number Variations
Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Yale School of Public Health Joint work with Ning Hao, Yue S. Niu presented @Tsinghua University Outline 1 The Problem
More informationMACAU 2.0 User Manual
MACAU 2.0 User Manual Shiquan Sun, Jiaqiang Zhu, and Xiang Zhou Department of Biostatistics, University of Michigan shiquans@umich.edu and xzhousph@umich.edu April 9, 2017 Copyright 2016 by Xiang Zhou
More information2. Map genetic distance between markers
Chapter 5. Linkage Analysis Linkage is an important tool for the mapping of genetic loci and a method for mapping disease loci. With the availability of numerous DNA markers throughout the human genome,
More informationGenetic proof of chromatin diminution under mitotic agamospermy
Genetic proof of chromatin diminution under mitotic agamospermy Evgenii V. Levites Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia Email: levites@bionet.nsc.ru
More informationDetecting selection from differentiation between populations: the FLK and hapflk approach.
Detecting selection from differentiation between populations: the FLK and hapflk approach. Bertrand Servin bservin@toulouse.inra.fr Maria-Ines Fariello, Simon Boitard, Claude Chevalet, Magali SanCristobal,
More informationHeterozygous BMN lines
Optical density at 80 hours 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 a YPD b YPD + 1µM nystatin c YPD + 2µM nystatin d YPD + 4µM nystatin 1 3 5 6 9 13 16 20 21 22 23 25 28 29 30
More informationUNIT 8 BIOLOGY: Meiosis and Heredity Page 148
UNIT 8 BIOLOGY: Meiosis and Heredity Page 148 CP: CHAPTER 6, Sections 1-6; CHAPTER 7, Sections 1-4; HN: CHAPTER 11, Section 1-5 Standard B-4: The student will demonstrate an understanding of the molecular
More informationSNP Association Studies with Case-Parent Trios
SNP Association Studies with Case-Parent Trios Department of Biostatistics Johns Hopkins Bloomberg School of Public Health September 3, 2009 Population-based Association Studies Balding (2006). Nature
More informationChapter 8: Introduction to Evolutionary Computation
Computational Intelligence: Second Edition Contents Some Theories about Evolution Evolution is an optimization process: the aim is to improve the ability of an organism to survive in dynamically changing
More informationUser s Guide for interflex
User s Guide for interflex A STATA Package for Producing Flexible Marginal Effect Estimates Yiqing Xu (Maintainer) Jens Hainmueller Jonathan Mummolo Licheng Liu Description: interflex performs diagnostics
More informationCausal Model Selection Hypothesis Tests in Systems Genetics
1 Causal Model Selection Hypothesis Tests in Systems Genetics Elias Chaibub Neto and Brian S Yandell SISG 2012 July 13, 2012 2 Correlation and Causation The old view of cause and effect... could only fail;
More informationMetric Predicted Variable on One Group
Metric Predicted Variable on One Group Tim Frasier Copyright Tim Frasier This work is licensed under the Creative Commons Attribution 4.0 International license. Click here for more information. Prior Homework
More information