Renormalizing Illumina SNP Cell Line Data

Size: px
Start display at page:

Download "Renormalizing Illumina SNP Cell Line Data"

Transcription

1 Renormalizing Illumina SNP Cell Line Data Kevin R. Coombes 17 March 2011 Contents 1 Executive Summary Introduction Aims/Objectives Methods Description of Data Statistical Methods Results Conclusions Details Load Sample nnames Characteristic Modes or Intensity Levels of Copy Number Modeling the Copy Number Levels Appendix 11 1 Executive Summary 1.1 Introduction This report describes the analysis of a data set from Lynn Barron, a member of the laboratory of Lynne V. Abruzzo. This dataset was acquired using Illumina 610K SNP chips. The main goal of the study is to identify genetic abnormalities that are associated with clinical outcome (including overall survival and time-to-treatment). This is the first part of a series of related reports Aims/Objectives We noticed (from an analysis using lung cancer cell lines) in plots of the log R ratios (LRR) and B allele frequencies (BAF) that some of the data appeared inconsistent with our understanding 1

2 01-sanitizer 2 of how to interpret them. We hypothesized that, for many if not most cell lines, the number of chromosomes present in a cell is far in excess of the typical value of 46. If so, this excess would likely cause the Illumina normalization procedure to scale all of the LRR data to a value that is too small (since the normalization implicitly assumes that the intensities come from cells with about 46 chromosomes). The objective of this report is to test this hypothesis and try to develop methods to correct for any distortions introduced by normalization. 1.2 Methods Description of Data The dataset contains measurements on 176 previously untreated patients with CLL. Extensive clinical followup is available Statistical Methods Raw data were processed in BeadStudio to yield genotype calls, log R ratios (LRR), and B allele frequencies (BAF) for each SNP in each sample. Since the study does not include matched normal DNA, the BeadStudio computations were performed relative to the pool of 120 HapMap samples run by Illumina. In this report, we introduce a novel method for re-normalizing the SNP log ratios for each cell line. The basic idea is to write L ij for the log R ratio of the j th SNP in segment i, and write ν i for the true (unknown) copy number. We assume that there is a (sample-specific) renormalization constant α with the property that L ij = α + log 10 (ν i /2) + E ij, where E ij N(0, σ) are independent and identically distributed Gaussian random variables. If α is known, then each L ij can be assigned to the nearest log-half-integer (which we denote by I ij (α)). Then define the sum of square errors to be : SSEI(α) = (L ij I ij ) 2. We impose a prior or penalty on α. Specifically, we consider the effect of converting from the log R ratio scale back to the linear copy number scale, as a function of α. In other words, we look at the values N ij (α) = 210 L ij α. The range of these values (if rounded to the nearest integer) represents the number of distinct integer copy number levels that are represented in the data, as a function of alpha. Thus, we can think of this range as the number of (copy number) parameters in the model. We then minimize the penalized sum of square errors to find the optimal value of α.

3 01-sanitizer Results We fit the statistical model described above in order to find the optimal renormalization constants. The values are stored in the file renorm.csv. Figures illustrating the process (Figure 2 and Figure 3) are stored in the subdirectories FitFuns and ApresFitFuns. 1.4 Conclusions The model-based approach to estimate the renormalization constant appears to work well. 2 Details 2.1 Load Sample nnames > load("allsamplenames.rda") 3 Characteristic Modes or Intensity Levels of Copy Number We will illustrate the plots and the method for the following sample: > cid <-.EGSAMPLE > cid [1] "CL001" > source("00rnw/snp-utils.r") > dat <- loadsnponesample(cid) Our next step is based on the idea that we should be able to identify characteristic modes of the LRR values for each chromosome, and that these modes should each correspond to a paricular integer number of copies. We start by looking at chromosome 13. > chrn <-.CHRN > chrn [1] 13 We use the next block of code to arbitrarily divide the LRR data into segments of fixed length and to compute the median (and MAD) on each such segment. > targlength <- 500 > svals <- vals <- NULL > for (chrn in c(1:22, "X")) { + vec <- dat[dat$chr == chrn, "Log.R.Ratio"] + vec <- vec[!is.na(vec)]

4 01-sanitizer 4 + n <- length(vec)%/%targlength + i0 <- targlength * ((1:n) - 1) i1 <- targlength * (1:n) + i1[n] <- length(vec) + temp <- sapply(1:n, function(i) { + median(vec[i0[i]:i1[i]]) ) + vals <- c(vals, temp) + temp <- sapply(1:n, function(i) { + mad(vec[i0[i]:i1[i]]) ) + svals <- c(svals, temp) > dd <- density(vals) > modal <- dd$x[which(dd$y == max(dd$y))] 3.1 Modeling the Copy Number Levels We now plan to fit a statistical model that relates the log R ratios to the true copy number of segments. To describe the model, we let L ij represent the log R ratio of the j th SNP in segment i for a fixed cell line sample. We assume that the true (unknown) copy number for this segment is given by ν i. We also assume that there is a (sample-specific) global renormalization constant α with the property that L ij = α + log 10 (ν i /2) + E ij. Here the error model is that E ij N(0, σ) are indepdent and identically distributed Gaussian random variables. In the normal case where the typical copy number equals two across the genome, then most of the ν i = 2 and we expect that α = 0. By contrast, if the ploidy is such that this cell line has K > 46 chromosomes, then we expect that α log 10 (K/46) in order to get the observed (processed) log R ratios about right. In order to fit this model, we have to do a couple of things. The basic idea is to start the computation conditional on knowing α. In that case, each L ij can be assigned to the nearest loghalf-integer (which we denote by I ij (α)) as a maximum likelihood estimate of its true value. Then we can define the sum of square errors to be SSEI(α) = (L ij I ij ) 2. A simple method would then be to minimize SSEI(α) as a function of α. The difficulty with this simple approach, however, is that the log-half-integers get closer together as you increase α, which creates a bias in favor of larger positive shifts. In order to overcome this difficulty, we can impose a prior on the values of α, which has the basic effect of imposing a penalty based on the size of α.

5 01-sanitizer 5 CL001 Density vals Figure 1: Histogram of the median log R ratio on the 500-SNP long segments.

6 01-sanitizer 6 To make this penalty specific, we consider the effect of converting from the log R ratio scale back to the linear copy number scale, as a function of α. In other words, we look at the values N ij (α) = 210 L ij α. The range of these values (if rounded to the nearest integer) represents the number of distinct integer copy number levels that are represented in the data, a a function of alpha. Thus, we can think of this range as the number of (copynumber) parameters in the model. Thus, as in the Akaike Information Criterion in other contexts, this range provides a reasonable penalty term. We use the actual range rather than an integer range in order to maintain continuity in the resulting function that needs to be minimized. We add two further wrinkle to our attempts to fit this data. First, when we know the number N B of components of the B allele frequency plot, we know something more about the possible copy numbers. Namely, ˆ N B = 4 if and only if ν 3 (unbalanced heterozygous). OOPS! We can also get four bands if a fraction of the cells has lost one copy of a chromosome or if a fraction of the cells has undegone LOH. For example, if 50% of cells lose one copy, then at a heterozygous SNP where the retained copy is an A allele, we have (at the genotype level) 50A and 50AB genotype. At an alleleic level, we then have 100A and 50B, so we expect the BAF plot to have bands at 1/3 and 2/3. By contrast, if 50% of cells have undergone LOH at this locus, then our mixture contains a genotype of 50AA and 50AB or an allelotype of 150A and 50B, so we expect BAF bands at 1/4 and 3/4. Thus, we can actually have ν 1 and still see four BAF components. ˆ N B = 3 if and only if ν 2 is even (balanced heterozygous) ˆ N B = 2 if and only if ν 1 (homozygous) ˆ N B = 1 if and only if ν = 0 (complete loss) The second wrinkle is that we do not use all of the raw segment data, but instead use the summaries provided by identifying the characteristic modes on each chromosome by fitting density functions. We do, however, weight these modes proportionally to the number of SNP markers that they represent. The following functions implement and fit this statistical model. > idist <- function(x, XS) { + A <- x[1] + B <- x[2] + y <- A + B * XS + base <- sum(unlist(lapply(y, function(z0, s) { + min(abs(z0 - s))^2, s = log10((1:20)/2)))) > baz <- function(a, XS) { + idist(c(a, 1), XS)

7 01-sanitizer 7 > colset <- c(general = "black", ThreeBand = "blue", FourBand = "green", + TwoBand = "red") > foo2 <- function(a, xset) { + w <- 1 + round(xset$w) + w < * (w - 1)/max(w - 1) + shiftx <- A + xset$x + top <- 1 + trunc(max(2 * 10^shiftx)) + plot(shiftx, col = colset[as.character(xset$flag)], pch = 16, cex = w, + main = cid, ylab = "Shifted Log Ratio", xlab = "Summarized Segment Index") + abline(h = log10((1:top)/2), col = "gray") + mtext(1:top, side = 4, at = log10((1:top)/2), line = 0.5, las = 1) + legend("bottomright", names(colset), col = colset, pch = 16) Here is the analysis for CL001. > alpha <- seq(-0.05, 0.8, length = 1701) > penal <- 10^alpha * diff(range(2 * 10^vals)) > wiggle <- sapply(alpha, baz, XS = vals) > v1 <- wiggle +.PRIOR.WT * penal > ick <- which(v1 == min(v1)) > a <- alpha[ick] > a [1] > modal [1] > mean(vals) [1] > median(vals) [1] Now we can fit these statistical models for all of the samples. > if (!file.exists("fitfuns")) dir.create("fitfuns") > madqc <- alist <- rep(na, length(shortnames)) > names(madqc) <- names(alist) <- shortnames > for (cid in sort(shortnames)) { + cat(paste("working on", cid, "\n"), file = stderr())

8 01-sanitizer 8 CL001 SSEI + Penalty A Figure 2: Optimizing the SSEI + Penalty function for CL001. Figure 3: Location of segment summaries (size proportional to number of SNP markers) for CL001.

9 01-sanitizer 9 + dat <- loadsnponesample(cid) + targlength < svals <- vals <- NULL + for (chrn in c(1:22, "X")) { + vec <- dat[dat$chr == chrn, "Log.R.Ratio"] + vec <- vec[!is.na(vec)] + n <- length(vec)%/%targlength + i0 <- targlength * ((1:n) - 1) i1 <- targlength * (1:n) + i1[n] <- length(vec) + temp <- sapply(1:n, function(i) { + median(vec[i0[i]:i1[i]]) ) + vals <- c(vals, temp) + temp <- sapply(1:n, function(i) { + mad(vec[i0[i]:i1[i]]) ) + svals <- c(svals, temp) + madqc[cid] <- median(svals) + alpha <- seq(-0.05, 0.8, length = 1701) + penal <- 10^alpha * diff(range(2 * 10^vals)) + wiggle <- sapply(alpha, baz, XS = vals) + v1 <- wiggle +.PRIOR.WT * penal + ick <- which(v1 == min(v1)) + a <- alpha[ick] + a + modal + mean(vals) + median(vals) + alist[cid] <- alpha[ick] + par(bg = "white", cex = 1.5, mai = c(1.4, 1.2, 1, 0.2)) + plot(alpha, v1, type = "l", main = cid, xlab = "A", ylab = "SSEI + Penalty", + lwd = 3) + points(alpha[ick], v1[ick], pch = 16, col = "#00aa00") + dev.copy(png, file = file.path("fitfuns", paste(cid, "png", sep = ".")), + width = 600, height = 500) + dev.off() Finally, we save the intermediate results. > write.csv(data.frame(shift = alist, madqc = madqc), file = "renorm.csv")

10 01-sanitizer 10 Histogram of alist Frequency alist Figure 4: Histogram of the renormalization constants.

11 01-sanitizer 11 > if (sum(alist > 0.2)) { + print(which(alist > 0.2)) [1] 7 31 > median(alist[alist < 0.2]) [1] Appendix This analysis was run in the following directory: > getwd() [1] "o:/private/abruzzo/snp-cll/aa" Note that \\mdadqsfs02 is the standard insititutional location for storing data and analyses; N: is the name given to that location on this machine. This analysis was run in the following software environment: > sessioninfo() R version ( ) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grdevices utils datasets methods base > while (!is.null(dev.list())) dev.off()

Summarize Abnormality Counts

Summarize Abnormality Counts Summarize Abnormality Counts Kevin R. Coombes 10 September 2011 Contents 1 Executive Summary 1 1.1 Introduction......................................... 1 1.1.1 Aims/Objectives..................................

More information

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8 The E-M Algorithm in Genetics Biostatistics 666 Lecture 8 Maximum Likelihood Estimation of Allele Frequencies Find parameter estimates which make observed data most likely General approach, as long as

More information

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Ruth Pfeiffer, Ph.D. Mitchell Gail Biostatistics Branch Division of Cancer Epidemiology&Genetics National

More information

LEA: An R package for Landscape and Ecological Association Studies. Olivier Francois Ecole GENOMENV AgroParisTech, Paris, 2016

LEA: An R package for Landscape and Ecological Association Studies. Olivier Francois Ecole GENOMENV AgroParisTech, Paris, 2016 LEA: An R package for Landscape and Ecological Association Studies Olivier Francois Ecole GENOMENV AgroParisTech, Paris, 2016 Outline Installing LEA Formatting the data for LEA Basic principles o o Analysis

More information

samplesizelogisticcasecontrol Package

samplesizelogisticcasecontrol Package samplesizelogisticcasecontrol Package January 31, 2017 > library(samplesizelogisticcasecontrol) Random data generation functions Let X 1 and X 2 be two variables with a bivariate normal ditribution with

More information

Introduction to Statistics and R

Introduction to Statistics and R Introduction to Statistics and R Mayo-Illinois Computational Genomics Workshop (2018) Ruoqing Zhu, Ph.D. Department of Statistics, UIUC rqzhu@illinois.edu June 18, 2018 Abstract This document is a supplimentary

More information

Case-Control Association Testing. Case-Control Association Testing

Case-Control Association Testing. Case-Control Association Testing Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits. Technological advances have made it feasible to perform case-control association studies

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,

More information

Parts 2. Modeling chromosome segregation

Parts 2. Modeling chromosome segregation Genome 371, Autumn 2018 Quiz Section 2 Meiosis Goals: To increase your familiarity with the molecular control of meiosis, outcomes of meiosis, and the important role of crossing over in generating genetic

More information

Parts 2. Modeling chromosome segregation

Parts 2. Modeling chromosome segregation Genome 371, Autumn 2017 Quiz Section 2 Meiosis Goals: To increase your familiarity with the molecular control of meiosis, outcomes of meiosis, and the important role of crossing over in generating genetic

More information

Introduction to population genetics & evolution

Introduction to population genetics & evolution Introduction to population genetics & evolution Course Organization Exam dates: Feb 19 March 1st Has everybody registered? Did you get the email with the exam schedule Summer seminar: Hot topics in Bioinformatics

More information

Comparison of Two Population Means

Comparison of Two Population Means Comparison of Two Population Means Esra Akdeniz March 15, 2015 Independent versus Dependent (paired) Samples We have independent samples if we perform an experiment in two unrelated populations. We have

More information

Genotype Imputation. Biostatistics 666

Genotype Imputation. Biostatistics 666 Genotype Imputation Biostatistics 666 Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives

More information

Unit 2 Lesson 4 - Heredity. 7 th Grade Cells and Heredity (Mod A) Unit 2 Lesson 4 - Heredity

Unit 2 Lesson 4 - Heredity. 7 th Grade Cells and Heredity (Mod A) Unit 2 Lesson 4 - Heredity Unit 2 Lesson 4 - Heredity 7 th Grade Cells and Heredity (Mod A) Unit 2 Lesson 4 - Heredity Give Peas a Chance What is heredity? Traits, such as hair color, result from the information stored in genetic

More information

Processes of Evolution

Processes of Evolution 15 Processes of Evolution Forces of Evolution Concept 15.4 Selection Can Be Stabilizing, Directional, or Disruptive Natural selection can act on quantitative traits in three ways: Stabilizing selection

More information

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES: .5. ESTIMATION OF HAPLOTYPE FREQUENCIES: Chapter - 8 For SNPs, alleles A j,b j at locus j there are 4 haplotypes: A A, A B, B A and B B frequencies q,q,q 3,q 4. Assume HWE at haplotype level. Only the

More information

Q Expected Coverage Achievement Merit Excellence. Punnett square completed with correct gametes and F2.

Q Expected Coverage Achievement Merit Excellence. Punnett square completed with correct gametes and F2. NCEA Level 2 Biology (91157) 2018 page 1 of 6 Assessment Schedule 2018 Biology: Demonstrate understanding of genetic variation and change (91157) Evidence Q Expected Coverage Achievement Merit Excellence

More information

Outline for today s lecture (Ch. 14, Part I)

Outline for today s lecture (Ch. 14, Part I) Outline for today s lecture (Ch. 14, Part I) Ploidy vs. DNA content The basis of heredity ca. 1850s Mendel s Experiments and Theory Law of Segregation Law of Independent Assortment Introduction to Probability

More information

Name Class Date. KEY CONCEPT Gametes have half the number of chromosomes that body cells have.

Name Class Date. KEY CONCEPT Gametes have half the number of chromosomes that body cells have. Section 1: Chromosomes and Meiosis KEY CONCEPT Gametes have half the number of chromosomes that body cells have. VOCABULARY somatic cell autosome fertilization gamete sex chromosome diploid homologous

More information

Package msir. R topics documented: April 7, Type Package Version Date Title Model-Based Sliced Inverse Regression

Package msir. R topics documented: April 7, Type Package Version Date Title Model-Based Sliced Inverse Regression Type Package Version 1.3.1 Date 2016-04-07 Title Model-Based Sliced Inverse Regression Package April 7, 2016 An R package for dimension reduction based on finite Gaussian mixture modeling of inverse regression.

More information

The Lander-Green Algorithm. Biostatistics 666 Lecture 22

The Lander-Green Algorithm. Biostatistics 666 Lecture 22 The Lander-Green Algorithm Biostatistics 666 Lecture Last Lecture Relationship Inferrence Likelihood of genotype data Adapt calculation to different relationships Siblings Half-Siblings Unrelated individuals

More information

sequenza usage example

sequenza usage example sequenza usage example Francesco Favero, Tejal Joshi, Andrea M. Marquard, Aron C. Eklund October 8, 2015 Contents 1 Abstract 2 2 Getting started 2 2.1 Minimum requirements........................ 2 2.2

More information

An Integrated Approach for the Assessment of Chromosomal Abnormalities

An Integrated Approach for the Assessment of Chromosomal Abnormalities An Integrated Approach for the Assessment of Chromosomal Abnormalities Department of Biostatistics Johns Hopkins Bloomberg School of Public Health June 6, 2007 Karyotypes Mitosis and Meiosis Meiosis Meiosis

More information

Biology 211 (1) Exam 4! Chapter 12!

Biology 211 (1) Exam 4! Chapter 12! Biology 211 (1) Exam 4 Chapter 12 1. Why does replication occurs in an uncondensed state? 1. 2. A is a single strand of DNA. When DNA is added to associated protein molecules, it is referred to as. 3.

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org kcoombes@mdanderson.org

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015 Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2015 1 / 1 Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits.

More information

CGEN(Case-control.GENetics) Package

CGEN(Case-control.GENetics) Package CGEN(Case-control.GENetics) Package October 30, 2018 > library(cgen) Example of snp.logistic Load the ovarian cancer data and print the first 5 rows. > data(xdata, package="cgen") > Xdata[1:5, ] id case.control

More information

Multivariate Survival Analysis

Multivariate Survival Analysis Multivariate Survival Analysis Previously we have assumed that either (X i, δ i ) or (X i, δ i, Z i ), i = 1,..., n, are i.i.d.. This may not always be the case. Multivariate survival data can arise in

More information

Explore the data. Anja Bråthen Kristoffersen Biomedical Research Group

Explore the data. Anja Bråthen Kristoffersen Biomedical Research Group Explore the data Anja Bråthen Kristoffersen Biomedical Research Group density 0.2 0.4 0.6 0.8 Probability distributions Can be either discrete or continuous (uniform, bernoulli, normal, etc) Defined by

More information

Correlate. A method for the integrative analysis of two genomic data sets

Correlate. A method for the integrative analysis of two genomic data sets Correlate A method for the integrative analysis of two genomic data sets Sam Gross, Balasubramanian Narasimhan, Robert Tibshirani, and Daniela Witten February 19, 2010 Introduction Sparse Canonical Correlation

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia Expression QTLs and Mapping of Complex Trait Loci Paul Schliekelman Statistics Department University of Georgia Definitions: Genes, Loci and Alleles A gene codes for a protein. Proteins due everything.

More information

Name Class Date. Pearson Education, Inc., publishing as Pearson Prentice Hall. 33

Name Class Date. Pearson Education, Inc., publishing as Pearson Prentice Hall. 33 Chapter 11 Introduction to Genetics Chapter Vocabulary Review Matching On the lines provided, write the letter of the definition of each term. 1. genetics a. likelihood that something will happen 2. trait

More information

An Integrated Approach for the Assessment of Chromosomal Abnormalities

An Integrated Approach for the Assessment of Chromosomal Abnormalities An Integrated Approach for the Assessment of Chromosomal Abnormalities Department of Biostatistics Johns Hopkins Bloomberg School of Public Health June 26, 2007 Karyotypes Karyotypes General Cytogenetics

More information

Multivariate analysis of genetic data an introduction

Multivariate analysis of genetic data an introduction Multivariate analysis of genetic data an introduction Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Population genomics in Lausanne 23 Aug 2016 1/25 Outline Multivariate

More information

Naive Bayes classification

Naive Bayes classification Naive Bayes classification Christos Dimitrakakis December 4, 2015 1 Introduction One of the most important methods in machine learning and statistics is that of Bayesian inference. This is the most fundamental

More information

Class 04 - Statistical Inference

Class 04 - Statistical Inference Class 4 - Statistical Inference Question 1: 1. What parameters control the shape of the normal distribution? Make some histograms of different normal distributions, in each, alter the parameter values

More information

Genotype Imputation. Class Discussion for January 19, 2016

Genotype Imputation. Class Discussion for January 19, 2016 Genotype Imputation Class Discussion for January 19, 2016 Intuition Patterns of genetic variation in one individual guide our interpretation of the genomes of other individuals Imputation uses previously

More information

Overview. Background

Overview. Background Overview Implementation of robust methods for locating quantitative trait loci in R Introduction to QTL mapping Andreas Baierl and Andreas Futschik Institute of Statistics and Decision Support Systems

More information

Jian WANG, PhD. Room A115 College of Fishery and Life Science Shanghai Ocean University

Jian WANG, PhD. Room A115 College of Fishery and Life Science Shanghai Ocean University Jian WANG, PhD j_wang@shou.edu.cn Room A115 College of Fishery and Life Science Shanghai Ocean University Contents 1. Introduction to R 2. Data sets 3. Introductory Statistical Principles 4. Sampling and

More information

Explore the data. Anja Bråthen Kristoffersen

Explore the data. Anja Bråthen Kristoffersen Explore the data Anja Bråthen Kristoffersen density 0.2 0.4 0.6 0.8 Probability distributions Can be either discrete or continuous (uniform, bernoulli, normal, etc) Defined by a density function, p(x)

More information

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics Thorsten Dickhaus University of Bremen Institute for Statistics AG DANK Herbsttagung

More information

Integrated Anlaysis of Genomics Data

Integrated Anlaysis of Genomics Data Integrated Anlaysis of Genomics Data Elizabeth Jennings July 3, 01 Abstract In this project, we integrate data from several genomic platforms in a model that incorporates the biological relationships between

More information

Ch 11.Introduction to Genetics.Biology.Landis

Ch 11.Introduction to Genetics.Biology.Landis Nom Section 11 1 The Work of Gregor Mendel (pages 263 266) This section describes how Gregor Mendel studied the inheritance of traits in garden peas and what his conclusions were. Introduction (page 263)

More information

The Quantitative TDT

The Quantitative TDT The Quantitative TDT (Quantitative Transmission Disequilibrium Test) Warren J. Ewens NUS, Singapore 10 June, 2009 The initial aim of the (QUALITATIVE) TDT was to test for linkage between a marker locus

More information

sequenza usage example

sequenza usage example sequenza usage example Francesco Favero, Tejal Joshi, Andrea M. Marquard, Aron C. Eklund December 10, 2013 Contents 1 Abstract 1 2 Getting started 2 2.1 Minimum requirements..................................

More information

Using the tmle.npvi R package

Using the tmle.npvi R package Using the tmle.npvi R package Antoine Chambaz Pierre Neuvial Package version 0.10.0 Date 2015-05-13 Contents 1 Citing tmle.npvi 1 2 The non-parametric variable importance parameter 2 3 Using the tmle.npvi

More information

Fei Lu. Post doctoral Associate Cornell University

Fei Lu. Post doctoral Associate Cornell University Fei Lu Post doctoral Associate Cornell University http://www.maizegenetics.net Genotyping by sequencing (GBS) is simple and cost effective 1. Digest DNA 2. Ligate adapters with barcodes 3. Pool DNAs 4.

More information

Package LBLGXE. R topics documented: July 20, Type Package

Package LBLGXE. R topics documented: July 20, Type Package Type Package Package LBLGXE July 20, 2015 Title Bayesian Lasso for detecting Rare (or Common) Haplotype Association and their interactions with Environmental Covariates Version 1.2 Date 2015-07-09 Author

More information

Prediction problems 3: Validation and Model Checking

Prediction problems 3: Validation and Model Checking Prediction problems 3: Validation and Model Checking Data Science 101 Team May 17, 2018 Outline Validation Why is it important How should we do it? Model checking Checking whether your model is a good

More information

Package diffeq. February 19, 2015

Package diffeq. February 19, 2015 Version 1.0-1 Package diffeq February 19, 2015 Title Functions from the book Solving Differential Equations in R Author Karline Soetaert Maintainer Karline Soetaert

More information

Follow-up data with the Epi package

Follow-up data with the Epi package Follow-up data with the Epi package Summer 2014 Michael Hills Martyn Plummer Bendix Carstensen Retired Highgate, London International Agency for Research on Cancer, Lyon plummer@iarc.fr Steno Diabetes

More information

Meiosis and Mendel. Chapter 6

Meiosis and Mendel. Chapter 6 Meiosis and Mendel Chapter 6 6.1 CHROMOSOMES AND MEIOSIS Key Concept Gametes have half the number of chromosomes that body cells have. Body Cells vs. Gametes You have body cells and gametes body cells

More information

Logistic Regression. 0.1 Frogs Dataset

Logistic Regression. 0.1 Frogs Dataset Logistic Regression We move now to the classification problem from the regression problem and study the technique ot logistic regression. The setting for the classification problem is the same as that

More information

1. Understand the methods for analyzing population structure in genomes

1. Understand the methods for analyzing population structure in genomes MSCBIO 2070/02-710: Computational Genomics, Spring 2016 HW3: Population Genetics Due: 24:00 EST, April 4, 2016 by autolab Your goals in this assignment are to 1. Understand the methods for analyzing population

More information

Computations with Markers

Computations with Markers Computations with Markers Paulino Pérez 1 José Crossa 1 1 ColPos-México 2 CIMMyT-México June, 2015. CIMMYT, México-SAGPDB Computations with Markers 1/20 Contents 1 Genomic relationship matrix 2 3 Big Data!

More information

Affected Sibling Pairs. Biostatistics 666

Affected Sibling Pairs. Biostatistics 666 Affected Sibling airs Biostatistics 666 Today Discussion of linkage analysis using affected sibling pairs Our exploration will include several components we have seen before: A simple disease model IBD

More information

Expected complete data log-likelihood and EM

Expected complete data log-likelihood and EM Expected complete data log-likelihood and EM In our EM algorithm, the expected complete data log-likelihood Q is a function of a set of model parameters τ, ie M Qτ = log fb m, r m, g m z m, l m, τ p mz

More information

Sexual Reproduction and Genetics

Sexual Reproduction and Genetics Chapter Test A CHAPTER 10 Sexual Reproduction and Genetics Part A: Multiple Choice In the space at the left, write the letter of the term, number, or phrase that best answers each question. 1. How many

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

Neutral Theory of Molecular Evolution

Neutral Theory of Molecular Evolution Neutral Theory of Molecular Evolution Kimura Nature (968) 7:64-66 King and Jukes Science (969) 64:788-798 (Non-Darwinian Evolution) Neutral Theory of Molecular Evolution Describes the source of variation

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Bayes Factor Single Arm Time-to-event User s Guide (Version 1.0.0)

Bayes Factor Single Arm Time-to-event User s Guide (Version 1.0.0) Bayes Factor Single Arm Time-to-event User s Guide (Version 1.0.0) Department of Biostatistics P. O. Box 301402, Unit 1409 The University of Texas, M. D. Anderson Cancer Center Houston, Texas 77230-1402,

More information

NOTES CH 17 Evolution of. Populations

NOTES CH 17 Evolution of. Populations NOTES CH 17 Evolution of Vocabulary Fitness Genetic Drift Punctuated Equilibrium Gene flow Adaptive radiation Divergent evolution Convergent evolution Gradualism Populations 17.1 Genes & Variation Darwin

More information

SUPPLEMENTARY MATERIAL TECHNICAL APPENDIX Article: Comparison of control charts for monitoring clinical performance using binary data

SUPPLEMENTARY MATERIAL TECHNICAL APPENDIX Article: Comparison of control charts for monitoring clinical performance using binary data SUPPLEMENTARY MATERIAL TECHNICAL APPENDIX Article: Comparison of control charts for monitoring clinical performance using binary data APPENDIX TABLES Table A1 Comparison of out-of-control ARLs, in number

More information

Mapping Chang To GEO

Mapping Chang To GEO Mapping Chang To GEO Keith A. Baggerly and Kevin R. Coombes November 13, 2007 1 Introduction We want to match the Chang array data from GEO with the clinical information supplied in Chang et al and the

More information

Section 11 1 The Work of Gregor Mendel

Section 11 1 The Work of Gregor Mendel Chapter 11 Introduction to Genetics Section 11 1 The Work of Gregor Mendel (pages 263 266) What is the principle of dominance? What happens during segregation? Gregor Mendel s Peas (pages 263 264) 1. The

More information

The gpca Package for Identifying Batch Effects in High-Throughput Genomic Data

The gpca Package for Identifying Batch Effects in High-Throughput Genomic Data The gpca Package for Identifying Batch Effects in High-Throughput Genomic Data Sarah Reese July 31, 2013 Batch effects are commonly observed systematic non-biological variation between groups of samples

More information

Reinforcement Unit 3 Resource Book. Meiosis and Mendel KEY CONCEPT Gametes have half the number of chromosomes that body cells have.

Reinforcement Unit 3 Resource Book. Meiosis and Mendel KEY CONCEPT Gametes have half the number of chromosomes that body cells have. 6.1 CHROMOSOMES AND MEIOSIS KEY CONCEPT Gametes have half the number of chromosomes that body cells have. Your body is made of two basic cell types. One basic type are somatic cells, also called body cells,

More information

SELESTIM: Detec-ng and Measuring Selec-on from Gene Frequency Data

SELESTIM: Detec-ng and Measuring Selec-on from Gene Frequency Data «Environmental Gene4cs» doctoral course ABIES-GAIA SELESTIM: Detec-ng and Measuring Selec-on from Gene Frequency Data Renaud Vitalis Centre de Biologie pour la Ges-on des Popula-ons INRA ; Montpellier

More information

Tests for Two Coefficient Alphas

Tests for Two Coefficient Alphas Chapter 80 Tests for Two Coefficient Alphas Introduction Coefficient alpha, or Cronbach s alpha, is a popular measure of the reliability of a scale consisting of k parts. The k parts often represent k

More information

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase Humans have two copies of each chromosome Inherited from mother and father. Genotyping technologies do not maintain the phase Genotyping technologies do not maintain the phase Recall that proximal SNPs

More information

Unstable Laser Emission Vignette for the Data Set laser of the R package hyperspec

Unstable Laser Emission Vignette for the Data Set laser of the R package hyperspec Unstable Laser Emission Vignette for the Data Set laser of the R package hyperspec Claudia Beleites DIA Raman Spectroscopy Group, University of Trieste/Italy (2005 2008) Spectroscopy

More information

Package CEC. R topics documented: August 29, Title Cross-Entropy Clustering Version Date

Package CEC. R topics documented: August 29, Title Cross-Entropy Clustering Version Date Title Cross-Entropy Clustering Version 0.9.4 Date 2016-04-23 Package CEC August 29, 2016 Author Konrad Kamieniecki [aut, cre], Przemyslaw Spurek [ctb] Maintainer Konrad Kamieniecki

More information

Hypothesis Testing: Chi-Square Test 1

Hypothesis Testing: Chi-Square Test 1 Hypothesis Testing: Chi-Square Test 1 November 9, 2017 1 HMS, 2017, v1.0 Chapter References Diez: Chapter 6.3 Navidi, Chapter 6.10 Chapter References 2 Chi-square Distributions Let X 1, X 2,... X n be

More information

Processing microarray data with Bioconductor

Processing microarray data with Bioconductor Processing microarray data with Bioconductor Statistical analysis of gene expression data with R and Bioconductor University of Copenhagen Copenhagen Biocenter Laurent Gautier 1, 2 August 17-21 2009 Contents

More information

Survey on Population Mean

Survey on Population Mean MATH 203 Survey on Population Mean Dr. Neal, Spring 2009 The first part of this project is on the analysis of a population mean. You will obtain data on a specific measurement X by performing a random

More information

BIOS 312: Precision of Statistical Inference

BIOS 312: Precision of Statistical Inference and Power/Sample Size and Standard Errors BIOS 312: of Statistical Inference Chris Slaughter Department of Biostatistics, Vanderbilt University School of Medicine January 3, 2013 Outline Overview and Power/Sample

More information

Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data

Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data Juan Torres 1, Ashraf Saad 2, Elliot Moore 1 1 School of Electrical and Computer

More information

Objective 3.01 (DNA, RNA and Protein Synthesis)

Objective 3.01 (DNA, RNA and Protein Synthesis) Objective 3.01 (DNA, RNA and Protein Synthesis) DNA Structure o Discovered by Watson and Crick o Double-stranded o Shape is a double helix (twisted ladder) o Made of chains of nucleotides: o Has four types

More information

I. GREGOR MENDEL - father of heredity

I. GREGOR MENDEL - father of heredity GENETICS: Mendel Background: Students know that Meiosis produces 4 haploid sex cells that are not identical, allowing for genetic variation. Essential Question: What are two characteristics about Mendel's

More information

Understanding p Values

Understanding p Values Understanding p Values James H. Steiger Vanderbilt University James H. Steiger Vanderbilt University Understanding p Values 1 / 29 Introduction Introduction In this module, we introduce the notion of a

More information

R-companion to: Estimation of the Thurstonian model for the 2-AC protocol

R-companion to: Estimation of the Thurstonian model for the 2-AC protocol R-companion to: Estimation of the Thurstonian model for the 2-AC protocol Rune Haubo Bojesen Christensen, Hye-Seong Lee & Per Bruun Brockhoff August 24, 2017 This document describes how the examples in

More information

Family resemblance can be striking!

Family resemblance can be striking! Family resemblance can be striking! 1 Chapter 14. Mendel & Genetics 2 Gregor Mendel! Modern genetics began in mid-1800s in an abbey garden, where a monk named Gregor Mendel documented inheritance in peas

More information

(Write your name on every page. One point will be deducted for every page without your name!)

(Write your name on every page. One point will be deducted for every page without your name!) POPULATION GENETICS AND MICROEVOLUTIONARY THEORY FINAL EXAMINATION (Write your name on every page. One point will be deducted for every page without your name!) 1. Briefly define (5 points each): a) Average

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.2 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

Introduction to Crossover Trials

Introduction to Crossover Trials Introduction to Crossover Trials Stat 6500 Tutorial Project Isaac Blackhurst A crossover trial is a type of randomized control trial. It has advantages over other designed experiments because, under certain

More information

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Yale School of Public Health Joint work with Ning Hao, Yue S. Niu presented @Tsinghua University Outline 1 The Problem

More information

MACAU 2.0 User Manual

MACAU 2.0 User Manual MACAU 2.0 User Manual Shiquan Sun, Jiaqiang Zhu, and Xiang Zhou Department of Biostatistics, University of Michigan shiquans@umich.edu and xzhousph@umich.edu April 9, 2017 Copyright 2016 by Xiang Zhou

More information

2. Map genetic distance between markers

2. Map genetic distance between markers Chapter 5. Linkage Analysis Linkage is an important tool for the mapping of genetic loci and a method for mapping disease loci. With the availability of numerous DNA markers throughout the human genome,

More information

Genetic proof of chromatin diminution under mitotic agamospermy

Genetic proof of chromatin diminution under mitotic agamospermy Genetic proof of chromatin diminution under mitotic agamospermy Evgenii V. Levites Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia Email: levites@bionet.nsc.ru

More information

Detecting selection from differentiation between populations: the FLK and hapflk approach.

Detecting selection from differentiation between populations: the FLK and hapflk approach. Detecting selection from differentiation between populations: the FLK and hapflk approach. Bertrand Servin bservin@toulouse.inra.fr Maria-Ines Fariello, Simon Boitard, Claude Chevalet, Magali SanCristobal,

More information

Heterozygous BMN lines

Heterozygous BMN lines Optical density at 80 hours 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 a YPD b YPD + 1µM nystatin c YPD + 2µM nystatin d YPD + 4µM nystatin 1 3 5 6 9 13 16 20 21 22 23 25 28 29 30

More information

UNIT 8 BIOLOGY: Meiosis and Heredity Page 148

UNIT 8 BIOLOGY: Meiosis and Heredity Page 148 UNIT 8 BIOLOGY: Meiosis and Heredity Page 148 CP: CHAPTER 6, Sections 1-6; CHAPTER 7, Sections 1-4; HN: CHAPTER 11, Section 1-5 Standard B-4: The student will demonstrate an understanding of the molecular

More information

SNP Association Studies with Case-Parent Trios

SNP Association Studies with Case-Parent Trios SNP Association Studies with Case-Parent Trios Department of Biostatistics Johns Hopkins Bloomberg School of Public Health September 3, 2009 Population-based Association Studies Balding (2006). Nature

More information

Chapter 8: Introduction to Evolutionary Computation

Chapter 8: Introduction to Evolutionary Computation Computational Intelligence: Second Edition Contents Some Theories about Evolution Evolution is an optimization process: the aim is to improve the ability of an organism to survive in dynamically changing

More information

User s Guide for interflex

User s Guide for interflex User s Guide for interflex A STATA Package for Producing Flexible Marginal Effect Estimates Yiqing Xu (Maintainer) Jens Hainmueller Jonathan Mummolo Licheng Liu Description: interflex performs diagnostics

More information

Causal Model Selection Hypothesis Tests in Systems Genetics

Causal Model Selection Hypothesis Tests in Systems Genetics 1 Causal Model Selection Hypothesis Tests in Systems Genetics Elias Chaibub Neto and Brian S Yandell SISG 2012 July 13, 2012 2 Correlation and Causation The old view of cause and effect... could only fail;

More information

Metric Predicted Variable on One Group

Metric Predicted Variable on One Group Metric Predicted Variable on One Group Tim Frasier Copyright Tim Frasier This work is licensed under the Creative Commons Attribution 4.0 International license. Click here for more information. Prior Homework

More information