Robust Detection and Identification of Sparse Segments in Ultra-High Dimensional Data Analysis

Size: px
Start display at page:

Download "Robust Detection and Identification of Sparse Segments in Ultra-High Dimensional Data Analysis"

Transcription

1 Robust Detection and Identification of Sparse Segments in Ultra-High Dimensional Data Analysis Hongzhe Li University of Pennsylvania Perelman School of Medicine Joint work with Jessie Jeng and Tony Cai Robust Segment Identification - 1 of 34-1 p. 1

2 Genetic variations and complex diseases Commonly observed genetic variations: Single nucleotide variants (SNVs). Small insertions/deletions (InDels). Structure variations, including the copy number variations (CNVs). All are associated with risk of complex diseases. Robust Segment Identification - 2 of 34-1 p. 2

3 Copy number variants (CNVs) Copy Number Variation Many CNVs have functional consequences: alter gene dosage, disrupt genes, or uncover deleterious alleles. NEJM 2007, 356: Robust Segment Identification - 3 of 34-1 p. 3

4 CNV Associations CNVs and Human Diseases Science 315: (2007) Science 316: (2007) Am J Med Genet 144: (2007) Science: in press (2008) Circulation 115: (2007) 4 Robust Segment Identification - 4 of 34-1 p. 4

5 Data available for CNV Analysis, literature Two types of data can be used for CNV analysis for germline DNA. SNP chip data for GWAS - high dimensional continuous data. Sebat et al. 2004; McCarrol and Altshuler 2007; Wellcome trust (Nature 2010). Next generation sequencing data (NGS) - ultra high dimensional discrete data; Medvedev et al., 2009 (Nat. Meth). Methods available: GWAS SNP data: circular binary segmentation (CBS) (Olshen et al, 2004); HMM based method (PennCNV, Wang et al 2007); scanning-based methods (Zhang and Siegmund 2010); Likelihood ratio selection (Jeng, Cai and Li, 2010). NGS data: Most methods are computational. Robust Segment Identification - 5 of 34-1 p. 5

6 CNV analysis based on SNP Chip data Visualization of CNVs 3 copies BBB ABB AAB AAA 4 copies BBBB ABBB AABB Normal BB AB AA AAAB AAAA 7 Robust Segment Identification - 6 of 34-1 p. 6

7 CNV Analysis and Next Generation Sequence Data Sequence Read Depth Analysis Individual sequence Reads Mapping Reference genome Read depth signal Counting mapped reads Zero level Robust Segment Identification - 7 of 34-1 p. 7

8 CNV Analysis and Next Generation Sequence Data Robust Segment Identification - 8 of 34-1 p. 8

9 CNV Analysis and Next Generation Sequence Data Robust Segment Identification - 9 of 34-1 p. 9

10 Statistical challenges Y i : # of read counts covering location i, i = 1,,n; or Y i : # of read counts in 100bp intervals. n is ultra-high, computational challenge. Y i usually does not follow a normal distribution, outliers. Existing methods do not work well when noise distribution is non-gaussian and hard to be estimated. Cauchy distribution: data, LRS, RSI Location index Location index Location index Robust Segment Identification - 10 of 34-1 p. 10

11 Statistical model for read depth data - one sample For a given individual, observe read counts {Y i,i = 1,...,n} with Y i = µ 1 1 {i I1 } +...+µ q 1 {i Iq } +ξ i, 1 i n. (1) n: length of genome (billions); q = q n : unknown number of the signal segments; I = {I 1,...I q }: disjoint intervals representing signal segments with unknown locations; µ 1,...µ q are unknown means ξ i is symmetric at 0 (median(ξ i )=0) and density function h s.t. h(0) > 0, h(y) h(0) Cy 2 in an open nbhd of 0. We want to (a) (detection) test H 0 : I = against H 1 : I, (b) (identification) if the alternative is true, identify each I j I. Robust Segment Identification - 11 of 34-1 p. 11

12 Methods Assuming Gaussian Noise (Jeng et al., 2010) Methods for detecting the presence of segments assuming Gaussian noise: Arias-Castro, Donoho and Huo (2005) Identification methods assuming Gaussian noise: likelihood ratio selector (LRS) (Jeng, Cai and Li 2010 JASA). Key of the LRS: (1) For any given interval as Ĩ {1,2,...,n}, define its likelihood ratio statistic Y(Ĩ) = i Ĩ Y i / Ĩ. (2) Scan the genome with intervals of length L, threshold σ 2log(nL). (3) Identify local maximums by removing the overlapping intervals. Detection boundaries and optimality results are established. Robust Segment Identification - 12 of 34-1 p. 12

13 Non-normal Data - Local median transformation Equally divide the n observations (e.g., counts at each bp) into T = T n groups with m = m n observations in each group. Define kth interval J k = {i : (k 1)m+1 i km} and take median: X k = median(y i : i J k ), η k = median{ξ i : i J k }, 1 k T. We have θ k X k = θ k +η k, 1 k T, = µ j, J k I j for some I j, [0,µ j ], J k I j for some I j and J k I j, = 0, otherwise. Key point: mη k = 1 2h(0) Z k +ζ k, Z k N(0,1),ζ k D 0 fast η k N(0,1/(4h 2 (0)m)). (Brown, Cai and Zhou: AoS 08). Robust Segment Identification - 13 of 34-1 p. 13

14 Robust Segment Detection (RSD) Segment detection: test H 0 : I = vs H 1 : I. For any interval Ĩ, define X(Ĩ) = k / k ĨX Ĩ, and threshold λ n = 2logn/(2h(0) m). The RSD rejects H 0 when maxĩ JT X(Ĩ) > λ n, where J T is the collection of all possible intervals in {1,...,T}. Note the effect of h(0). Robust Segment Identification - 14 of 34-1 p. 14

15 Robust Segment Detection - Type 1 error and power Under the assumed model and median transformation with m = log 1+b n for some b > 0. Type 1 error: For the collection J T of all the possible intervals in {1,...,T}, P H0 (max Ĩ J T X(Ĩ) > λ n ) C logt 0, T. Power: If there exists some segment I j I that satisfies I j /m and µ j I j 2(1+ǫ)logn/(2h(0)) for some ǫ > 0, then RSD has the sum of the probabilities of type I and type II errors going to 0. Robust Segment Identification - 15 of 34-1 p. 15

16 Robust Segment Identifier (RSI) Perform local median transformation with bin size m, get X k = θ k +η k, 1 k T. Set data-driven threshold at λ n = ˆσ 2logn, ˆσ 2 : estimate of Var(η k )(e.g.,mad) Apply LRS (Jeng, Cai and Li, JASA 2010) on X k : select intervals with their likelihood ratio statistics > λ n and achieve local maximums (removing overlapping intervals). only consider short intervals with length L/m, L: max CNV size. Conditions on m and L: m = log 1+b n, s L < d, b > 0, s =length of the longest segment, d =shortest distance between two adjacent segments. Robust Segment Identification - 16 of 34-1 p. 16

17 Theory - consistency Assume the general condition on the background noise ξ i and some sparsity conditions on the signal segments. Define s = min Ij I I j, s = max Ij I I j, and d = min Ij I{distance between I j and I j+1 }, assume s log 2 n and log s = o(logn) and logq = o(logn). If all I j I satisfies I j /m and µ j I j 2(1+ǫ)logn/(2h(0)) for some ǫ > 0, then the RSI with m = log 1+b n for b > 0 and s L < d, is consistent for I, i.e., for some δ n = o(1), P ( Î > 0)+P H0 H 1 (max min D(Î j,i j ) > δ n ) 0 I j I Î j Î Dissimilarity: D(Î,I) = 1 Î I / Î I Robust Segment Identification - 17 of 34-1 p. 17

18 Theory - Optimality If for all I j I, µ j I j 2(1 ǫ)logn/(2h(0)), then no method constructed on X k with m is consistent. Robust Segment Identification - 18 of 34-1 p. 18

19 Comparison with Gaussian noises Compare to the case with Gaussian noise: Assume ξ i N(0,1), then the original GLRT based on Y i is optimal. Further, if I j I s.t. µ j I j 2(1+ǫ n )logn, then the original GLRT is consistent. Possible price for robustness: 2(1+ǫn )logn/(2h(0)) (1+ǫ n )logn Robust Segment Identification - 19 of 34-1 p. 19

20 Simulation studies n = , I = 3, I 1 = 100, I 2 = 40 and I 3 = 20, the signal mean for all segments at µ = 1.0,1.5, and 2.0. Noise is generated from Cauchy(t(1)), t(3), t(30), median transformation m = 20. Frequency Frequency Y X Robust Segment Identification - 20 of 34-1 p. 20

21 Simulation results - robustness n = , I = 3. Noise is generated from t(1),t(3),t(30). Estimation error for I j : D j = minîk Î{ 1 Ij Îk / I j Îk } [0,1]. Number of over-selections:#o = #{Î Î : Î I j =, j = 1,...,q}.. Medians of D j and #O for RSI with m = 20,L = 120, 100 replications. D 1( I1 =100) D 2( I2 =40) D 3( I3 =20) #O t(1) µ = (0.015) 1.000(0.026) 1.000(0.000) 2(0.33) µ = (0.003) 0.184(0.017) 1.000(0.000) 2(0.26) µ = (0.009) 0.150(0.020) 0.423(0.220) 2(0.14) t(3) µ = (0.005) 1.000(0.270) 1.000(0.000) 0(0.00) µ = (0.009) 0.175(0.029) 1.000(0.000) 0(0.00) µ = (0.008) 0.150(0.016) 0.293(0.019) 0(0.00) t(30) µ = (0.014) 1.000(0.320) 1.000(0.000) 0(0.00) µ = (0.012) 0.175(0.021) 1.000(0.245) 0(0.00) µ = (0.010) (0.019) 0.250(0.028) 0(0.00) Robust Segment Identification - 21 of 34-1 p. 21

22 Simulation results - comparison with LRS and CBS Table 1: Both homogeneous and heterogeneous noises are considered. Homogenous noise is generated from the t-distribution with degrees of freedom 1, 3, and 30. Heterogeneous noise is generated from a mixture of N(0,1) and N(0,σ 2 ), where σ Gamma(2,τ). µ is fixed at 2.0. RSI LRS CBS D 2( I2 =40) #O D 2( I2 =40) #O D 2( I2 =40) #O t(1) 0.163(0.024) 2(0.2) 0.340(0.054) 3882(7) 1.000(0.000) 0(0.0) t(3) 0.125(0.028) 0(0.0) 0.025(0.006) 467(4) 1.000(0.000) 0(0.0) t(30) 0.125(0.018) 0(0.0) 0.000(0.001) 2(0) 0.006(0.006) 0(0.0) τ = (0.015) 2(0.4) 0.013(0.005) 37(3) 0.180(0.006) 4(0.6) τ = (0.022) 12(0.6) 0.000(0.006) 227(6) 1.000(0.010) 10(1.1) τ = (0.016) 26(0.8) 0.000(0.006) 461(11) 1.000(0.000) 8(1.1) Robust Segment Identification - 22 of 34-1 p. 22

23 Simulation results - effect of m Table 2: Effect of bin size m on the performance of RSI. µ is fixed at 2. D 1( I1 =100) D 2( I2 =40) D 3( I3 =20) #O t(1) m = (0.009) 0.10(0.018) 0.184(0.033) 19(0.85) m = (0.009) 0.15(0.020) 0.423(0.220) 2(0.14) m = (0.006) 0.25(0.056) 1.000(0.024) 0(0.00) t(3) m = (0.004) 0.088(0.015) 0.150(0.033) 1(0.22) m = (0.008) 0.150(0.016) 0.293(0.019) 0(0.00) m = (0.006) 0.293(0.041) 1.000(0.250) 0(0.00) t(30) m = (0.007) 0.075(0.008) 0.150(0.018) 0(0.00) m = (0.010) 0.175(0.019) 0.250(0.028) 0(0.00) m = (0.008) 0.293(0.035) 1.000(0.094) 0(0.00) Robust Segment Identification - 23 of 34-1 p. 23

24 1000 Genomes Project - NA19240, Chr 19 NA19240: an International HapMap project Yoruban daughter sample and parents, lymphoblastoid cell lines (Nature 2010). Ave 42x, SOLiD, map to the human reference genome (BAM file), ave 2.36 Gb accessed. n = 54,361,060 read counts for Chr 19. Apply RSI with m = 400, L = 60,000. Identified 106 deletions and 63 duplications. Take less than 3 mins. Compare with the CNV map from 1000 Genomes Project based on 185 samples (Mills et al. 2011), 76 overlap with the reported deletions based on 185 low-coverage samples and three methods ( 438, 332 and 615 CNVs). Robust Segment Identification - 24 of 34-1 p. 24

25 Concordant of the Yoruba trio - top ranked CNVs CNVs are inheritable - concordant rates, ranked by µ Î, apply to 3 samples separately. Concordance % Median NB top20 top50 top100 top200 top300 Robust Segment Identification - 25 of 34-1 p. 25

26 Chr 19 sequence data - NA19240 Depth of overage 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 Depth of overage Genomic location index Genomic location index Possible error in reference genome: multicopy sequences which have been incorrectly assembled and collapsed into a single copy. Robust Segment Identification - 26 of 34-1 p. 26

27 Chr 19 sequence data - NA19240 Depth of overage Depth of overage Genomic location index Genomic location index Robust Segment Identification - 27 of 34-1 p. 27

28 Chr 19 sequence data - NA19240 Depth of overage Depth of overage Genomic location index Genomic location index Robust Segment Identification - 28 of 34-1 p. 28

29 Alternative Approach -Negative Binomial Counts NB model: Y i Negative Binomial(r,p i ), p i = p 0 + q d j 1 (i Ij ). j=1 µ i = rp i 1 p i,σ 2 i = µ2 i r +µ i. Data transformation - mean-matching variance stabilizing transformation (VST) to turn into Gaussian noise problem: divide the n obs into T = T n groups of m = m n obs, for the kth interval, define X k = 2 ˆrln Y i Jk i +1/4 mˆr 1/ i J k Y i +1/4 mˆr 1/2, 1 k T, Robust Segment Identification - 29 of 34-1 p. 29

30 Transformed Data - normally distributed Model for transformed data: X k = 2ln( θ k + r +θ k )+ǫ k +m 1/2 Z k +ξ k, where = r(p 0 +d j )/(1 p 0 d j ), J k I j for some I j, θ k [rp 0 /(1 p 0 ), r(p 0 +d j )/(1 p 0 d j )], J k I j for some I j and J k I j, = rp 0 /(1 p 0 ), otherwise, ǫ k and ξ k are stochastically small, Z k N(0,1). Apply LRS to X k assuming the baseline distribution as ( ) rp0 r N 2ln( + ),1/m. 1 p 0 1 p 0 Robust Segment Identification - 30 of 34-1 p. 30

31 Concordant of the Yoruba trio - top ranked CNVs CNVs are inheritable - concordant rates, ranked by µ Î, apply to 3 samples separately. Concordance % Median NB top20 top50 top100 top200 top300 Robust Segment Identification - 31 of 34-1 p. 31

32 Comments and extensions Cai, Jeng and Li (2012): JRSS(B), in press. Many other complicated factors: repeated regions, complex rearrangments, highly repetitive elements. There is a relationship between GC content and coverage, but this effect is small for false CNVs. Read depths data: difficulty in finding high repetitive CNVs (LINE, SINE), uncertain in CNV location, but can ba applied to paired-end, single-end and mixed data; Paired-end whole genome sequencing data: statistical modeling of anomalous read pairs, can detect highly repetitive CNVs (LINE and SINE), precise location of CNVs; but span distances have effects on resolution. Robust Segment Identification - 32 of 34-1 p. 32

33 Software -RobustCNV C++ software available, with some post-processing steps: CG content adjustment; refining boundaries; filtering CNVs based on read mapping qualities; plots of the CNVs. Robust Segment Identification - 33 of 34-1 p. 33

34 THANKS! Collaborators: Jessie Jeng - Postdoc Yinghua Wu - Postdoc Tony Cai - Statistics Dept John Maris - CHOP. NIH grants support. Robust Segment Identification - 34 of 34-1 p. 34

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Yale School of Public Health Joint work with Ning Hao, Yue S. Niu presented @Tsinghua University Outline 1 The Problem

More information

Rate-Optimal Detection of Very Short Signal Segments

Rate-Optimal Detection of Very Short Signal Segments Rate-Optimal Detection of Very Short Signal Segments T. Tony Cai and Ming Yuan University of Pennsylvania and University of Wisconsin-Madison July 6, 2014) Abstract Motivated by a range of applications

More information

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies. Theoretical and computational aspects of association tests: application in case-control genome-wide association studies Mathieu Emily November 18, 2014 Caen mathieu.emily@agrocampus-ouest.fr - Agrocampus

More information

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative

More information

Decision Theoretic Classification of Copy-Number-Variation in Cancer Genomes

Decision Theoretic Classification of Copy-Number-Variation in Cancer Genomes Decision Theoretic Classification of Copy-Number-Variation in Cancer Genomes Christopher Holmes (joint work with Chris Yau) Department of Statistics, & Wellcome Trust Centre for Human Genetics, University

More information

Detecting Simultaneous Variant Intervals in Aligned Sequences

Detecting Simultaneous Variant Intervals in Aligned Sequences University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2011 Detecting Simultaneous Variant Intervals in Aligned Sequences David Siegmund Benjamin Yakir Nancy R. Zhang University

More information

Supplementary Information for Discovery and characterization of indel and point mutations

Supplementary Information for Discovery and characterization of indel and point mutations Supplementary Information for Discovery and characterization of indel and point mutations using DeNovoGear Avinash Ramu 1 Michiel J. Noordam 1 Rachel S. Schwartz 2 Arthur Wuster 3 Matthew E. Hurles 3 Reed

More information

Genotype Imputation. Biostatistics 666

Genotype Imputation. Biostatistics 666 Genotype Imputation Biostatistics 666 Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives

More information

Hidden Markov Models for the Assessment of Chromosomal Alterations using High-throughput SNP Arrays

Hidden Markov Models for the Assessment of Chromosomal Alterations using High-throughput SNP Arrays Hidden Markov Models for the Assessment of Chromosomal Alterations using High-throughput SNP Arrays Department of Biostatistics Johns Hopkins Bloomberg School of Public Health November 18, 2008 Acknowledgments

More information

Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study

Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study Rui Wang, Yong Li, XiaoFeng Wang, Haixu Tang and Xiaoyong Zhou Indiana University at Bloomington

More information

Change-Point Detection and Copy Number Variation

Change-Point Detection and Copy Number Variation Change-Point Detection and Copy Number Variation Department of Statistics The Hebrew University IMS, Singapore, 2009 Outline 1 2 3 (CNV) Classical change-point detection At each monitoring period we observe

More information

ADMM Fused Lasso for Copy Number Variation Detection in Human 3 March Genomes / 1

ADMM Fused Lasso for Copy Number Variation Detection in Human 3 March Genomes / 1 ADMM Fused Lasso for Copy Number Variation Detection in Human Genomes Yifei Chen and Jacob Biesinger 3 March 2011 ADMM Fused Lasso for Copy Number Variation Detection in Human 3 March Genomes 2011 1 /

More information

An Integrated Approach for the Assessment of Chromosomal Abnormalities

An Integrated Approach for the Assessment of Chromosomal Abnormalities An Integrated Approach for the Assessment of Chromosomal Abnormalities Department of Biostatistics Johns Hopkins Bloomberg School of Public Health June 26, 2007 Karyotypes Karyotypes General Cytogenetics

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge González-Domínguez*, Bertil Schmidt*, Jan C. Kässens**, Lars Wienbrandt** *Parallel and Distributed Architectures

More information

Optimal Detection of Heterogeneous and Heteroscedastic Mixtures

Optimal Detection of Heterogeneous and Heteroscedastic Mixtures Optimal Detection of Heterogeneous and Heteroscedastic Mixtures T. Tony Cai Department of Statistics, University of Pennsylvania X. Jessie Jeng Department of Biostatistics and Epidemiology, University

More information

Expected complete data log-likelihood and EM

Expected complete data log-likelihood and EM Expected complete data log-likelihood and EM In our EM algorithm, the expected complete data log-likelihood Q is a function of a set of model parameters τ, ie M Qτ = log fb m, r m, g m z m, l m, τ p mz

More information

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Ruth Pfeiffer, Ph.D. Mitchell Gail Biostatistics Branch Division of Cancer Epidemiology&Genetics National

More information

Plug-in Measure-Transformed Quasi Likelihood Ratio Test for Random Signal Detection

Plug-in Measure-Transformed Quasi Likelihood Ratio Test for Random Signal Detection Plug-in Measure-Transformed Quasi Likelihood Ratio Test for Random Signal Detection Nir Halay and Koby Todros Dept. of ECE, Ben-Gurion University of the Negev, Beer-Sheva, Israel February 13, 2017 1 /

More information

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power Proportional Variance Explained by QTL and Statistical Power Partitioning the Genetic Variance We previously focused on obtaining variance components of a quantitative trait to determine the proportion

More information

An Integrated Approach for the Assessment of Chromosomal Abnormalities

An Integrated Approach for the Assessment of Chromosomal Abnormalities An Integrated Approach for the Assessment of Chromosomal Abnormalities Department of Biostatistics Johns Hopkins Bloomberg School of Public Health June 6, 2007 Karyotypes Mitosis and Meiosis Meiosis Meiosis

More information

CNV Methods File format v2.0 Software v2.0.0 September, 2011

CNV Methods File format v2.0 Software v2.0.0 September, 2011 File format v2.0 Software v2.0.0 September, 2011 Copyright 2011 Complete Genomics Incorporated. All rights reserved. cpal and DNB are trademarks of Complete Genomics, Inc. in the US and certain other countries.

More information

Statistical analysis of genomic binding sites using high-throughput ChIP-seq data

Statistical analysis of genomic binding sites using high-throughput ChIP-seq data Statistical analysis of genomic binding sites using high-throughput ChIP-seq data Ibrahim Ali H Nafisah Department of Statistics University of Leeds Submitted in accordance with the requirments for the

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

Research Statement on Statistics Jun Zhang

Research Statement on Statistics Jun Zhang Research Statement on Statistics Jun Zhang (junzhang@galton.uchicago.edu) My interest on statistics generally includes machine learning and statistical genetics. My recent work focus on detection and interpretation

More information

DEXSeq paper discussion

DEXSeq paper discussion DEXSeq paper discussion L Collado-Torres December 10th, 2012 1 / 23 1 Background 2 DEXSeq paper 3 Results 2 / 23 Gene Expression 1 Background 1 Source: http://www.ncbi.nlm.nih.gov/projects/genome/probe/doc/applexpression.shtml

More information

Miscellanea False discovery rate for scanning statistics

Miscellanea False discovery rate for scanning statistics Biometrika (2011), 98,4,pp. 979 985 C 2011 Biometrika Trust Printed in Great Britain doi: 10.1093/biomet/asr057 Miscellanea False discovery rate for scanning statistics BY D. O. SIEGMUND, N. R. ZHANG Department

More information

Incorporation of Sparsity Information in Large-scale Multiple Two-sample t Tests

Incorporation of Sparsity Information in Large-scale Multiple Two-sample t Tests Incorporation of Sparsity Information in Large-scale Multiple Two-sample t Tests Weidong Liu October 19, 2014 Abstract Large-scale multiple two-sample Student s t testing problems often arise from the

More information

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017 Lecture 2: Genetic Association Testing with Quantitative Traits Instructors: Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 29 Introduction to Quantitative Trait Mapping

More information

Optimal detection of weak positive dependence between two mixture distributions

Optimal detection of weak positive dependence between two mixture distributions Optimal detection of weak positive dependence between two mixture distributions Sihai Dave Zhao Department of Statistics, University of Illinois at Urbana-Champaign and T. Tony Cai Department of Statistics,

More information

Going Beyond SNPs with Next Genera5on Sequencing Technology Personalized Medicine: Understanding Your Own Genome Fall 2014

Going Beyond SNPs with Next Genera5on Sequencing Technology Personalized Medicine: Understanding Your Own Genome Fall 2014 Going Beyond SNPs with Next Genera5on Sequencing Technology 02-223 Personalized Medicine: Understanding Your Own Genome Fall 2014 Next Genera5on Sequencing Technology (NGS) NGS technology Discover more

More information

Post-selection Inference for Changepoint Detection

Post-selection Inference for Changepoint Detection Post-selection Inference for Changepoint Detection Sangwon Hyun (Justin) Dept. of Statistics Advisors: Max G Sell, Ryan Tibshirani Committee: Will Fithian (UC Berkeley), Alessandro Rinaldo, Kathryn Roeder,

More information

Lecture 9. QTL Mapping 2: Outbred Populations

Lecture 9. QTL Mapping 2: Outbred Populations Lecture 9 QTL Mapping 2: Outbred Populations Bruce Walsh. Aug 2004. Royal Veterinary and Agricultural University, Denmark The major difference between QTL analysis using inbred-line crosses vs. outbred

More information

Supplementary Methods

Supplementary Methods Supplementary Methods Canvas: versatile and scalable detection of copy number variants Eric Roller 1,*,, Sergii Ivakhno 2,, Steve Lee 1,, Thomas Royce 3, and Stephen Tanner 1, 1 Illumina Inc., 5200 Illumina

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,

More information

Introduction to Probability and Statistics (Continued)

Introduction to Probability and Statistics (Continued) Introduction to Probability and Statistics (Continued) Prof. icholas Zabaras Center for Informatics and Computational Science https://cics.nd.edu/ University of otre Dame otre Dame, Indiana, USA Email:

More information

Guarding against Spurious Discoveries in High Dimension. Jianqing Fan

Guarding against Spurious Discoveries in High Dimension. Jianqing Fan in High Dimension Jianqing Fan Princeton University with Wen-Xin Zhou September 30, 2016 Outline 1 Introduction 2 Spurious correlation and random geometry 3 Goodness Of Spurious Fit (GOSF) 4 Asymptotic

More information

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase Humans have two copies of each chromosome Inherited from mother and father. Genotyping technologies do not maintain the phase Genotyping technologies do not maintain the phase Recall that proximal SNPs

More information

Gene Regula*on, ChIP- X and DNA Mo*fs. Statistics in Genomics Hongkai Ji

Gene Regula*on, ChIP- X and DNA Mo*fs. Statistics in Genomics Hongkai Ji Gene Regula*on, ChIP- X and DNA Mo*fs Statistics in Genomics Hongkai Ji (hji@jhsph.edu) Genetic information is stored in DNA TCAGTTGGAGCTGCTCCCCCACGGCCTCTCCTCACATTCCACGTCCTGTAGCTCTATGACCTCCACCTTTGAGTCCCTCCTC

More information

Nature Genetics: doi:0.1038/ng.2768

Nature Genetics: doi:0.1038/ng.2768 Supplementary Figure 1: Graphic representation of the duplicated region at Xq28 in each one of the 31 samples as revealed by acgh. Duplications are represented in red and triplications in blue. Top: Genomic

More information

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30 Problem Set 2 MAS 622J/1.126J: Pattern Recognition and Analysis Due: 5:00 p.m. on September 30 [Note: All instructions to plot data or write a program should be carried out using Matlab. In order to maintain

More information

EM algorithm. Rather than jumping into the details of the particular EM algorithm, we ll look at a simpler example to get the idea of how it works

EM algorithm. Rather than jumping into the details of the particular EM algorithm, we ll look at a simpler example to get the idea of how it works EM algorithm The example in the book for doing the EM algorithm is rather difficult, and was not available in software at the time that the authors wrote the book, but they implemented a SAS macro to implement

More information

(Genome-wide) association analysis

(Genome-wide) association analysis (Genome-wide) association analysis 1 Key concepts Mapping QTL by association relies on linkage disequilibrium in the population; LD can be caused by close linkage between a QTL and marker (= good) or by

More information

Stochastic processes and

Stochastic processes and Stochastic processes and Markov chains (part II) Wessel van Wieringen w.n.van.wieringen@vu.nl wieringen@vu nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University

More information

Network Biology-part II

Network Biology-part II Network Biology-part II Jun Zhu, Ph. D. Professor of Genomics and Genetic Sciences Icahn Institute of Genomics and Multi-scale Biology The Tisch Cancer Institute Icahn Medical School at Mount Sinai New

More information

Methods for Cryptic Structure. Methods for Cryptic Structure

Methods for Cryptic Structure. Methods for Cryptic Structure Case-Control Association Testing Review Consider testing for association between a disease and a genetic marker Idea is to look for an association by comparing allele/genotype frequencies between the cases

More information

USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA. By Xiaoquan Wen and Matthew Stephens University of Chicago

USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA. By Xiaoquan Wen and Matthew Stephens University of Chicago Submitted to the Annals of Applied Statistics USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA By Xiaoquan Wen and Matthew Stephens University of Chicago Recently-developed

More information

Variant visualisation and quality control

Variant visualisation and quality control Variant visualisation and quality control You really should be making plots! 25/06/14 Paul Theodor Pyl 1 Classical Sequencing Example DNA.BAM.VCF Aligner Variant Caller A single sample sequencing run 25/06/14

More information

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo Friday Harbor 2017 From Genetics to GWAS (Genome-wide Association Study) Sept 7 2017 David Fardo Purpose: prepare for tomorrow s tutorial Genetic Variants Quality Control Imputation Association Visualization

More information

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS CRYSTAL L. KAHN and BENJAMIN J. RAPHAEL Box 1910, Brown University Department of Computer Science & Center for Computational Molecular Biology

More information

Bayesian Clustering of Multi-Omics

Bayesian Clustering of Multi-Omics Bayesian Clustering of Multi-Omics for Cardiovascular Diseases Nils Strelow 22./23.01.2019 Final Presentation Trends in Bioinformatics WS18/19 Recap Intermediate presentation Precision Medicine Multi-Omics

More information

Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014

Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014 Overview - 1 Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014 Elizabeth Thompson University of Washington Seattle, WA, USA MWF 8:30-9:20; THO 211 Web page: www.stat.washington.edu/ thompson/stat550/

More information

Lecture 9: Kernel (Variance Component) Tests and Omnibus Tests for Rare Variants. Summer Institute in Statistical Genetics 2017

Lecture 9: Kernel (Variance Component) Tests and Omnibus Tests for Rare Variants. Summer Institute in Statistical Genetics 2017 Lecture 9: Kernel (Variance Component) Tests and Omnibus Tests for Rare Variants Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 46 Lecture Overview 1. Variance Component

More information

STA 216, GLM, Lecture 16. October 29, 2007

STA 216, GLM, Lecture 16. October 29, 2007 STA 216, GLM, Lecture 16 October 29, 2007 Efficient Posterior Computation in Factor Models Underlying Normal Models Generalized Latent Trait Models Formulation Genetic Epidemiology Illustration Structural

More information

Efficient Bayesian mixed model analysis increases association power in large cohorts

Efficient Bayesian mixed model analysis increases association power in large cohorts Linear regression Existing mixed model methods New method: BOLT-LMM Time O(MM) O(MN 2 ) O MN 1.5 Corrects for confounding? Power Efficient Bayesian mixed model analysis increases association power in large

More information

arxiv: v3 [math.st] 15 Oct 2018

arxiv: v3 [math.st] 15 Oct 2018 SEGMENTATION AND ESTIMATION OF CHANGE-POINT MODELS: FALSE POSITIVE CONTROL AND CONFIDENCE REGIONS arxiv:1608.03032v3 [math.st] 15 Oct 2018 Xiao Fang, Jian Li and David Siegmund The Chinese University of

More information

Major Genes, Polygenes, and

Major Genes, Polygenes, and Major Genes, Polygenes, and QTLs Major genes --- genes that have a significant effect on the phenotype Polygenes --- a general term of the genes of small effect that influence a trait QTL, quantitative

More information

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees: MCMC for the analysis of genetic data on pedigrees: Tutorial Session 2 Elizabeth Thompson University of Washington Genetic mapping and linkage lod scores Monte Carlo likelihood and likelihood ratio estimation

More information

27: Case study with popular GM III. 1 Introduction: Gene association mapping for complex diseases 1

27: Case study with popular GM III. 1 Introduction: Gene association mapping for complex diseases 1 10-708: Probabilistic Graphical Models, Spring 2015 27: Case study with popular GM III Lecturer: Eric P. Xing Scribes: Hyun Ah Song & Elizabeth Silver 1 Introduction: Gene association mapping for complex

More information

Introduction to PLINK H3ABionet Course Covenant University, Nigeria

Introduction to PLINK H3ABionet Course Covenant University, Nigeria UNIVERSITY OF THE WITWATERSRAND, JOHANNESBURG Introduction to PLINK H3ABionet Course Covenant University, Nigeria Scott Hazelhurst H3ABioNet funded by NHGRI grant number U41HG006941 Wits Bioinformatics

More information

A canonical semi-deterministic transducer

A canonical semi-deterministic transducer A canonical semi-deterministic transducer Achilles A. Beros Joint work with Colin de la Higuera Laboratoire d Informatique de Nantes Atlantique, Université de Nantes September 18, 2014 The Results There

More information

arxiv: v1 [stat.ap] 16 Aug 2011

arxiv: v1 [stat.ap] 16 Aug 2011 The Annals of Applied Statistics 2011, Vol. 5, No. 2A, 645 668 DOI: 10.1214/10-AOAS400 c Institute of Mathematical Statistics, 2011 DETECTING SIMULTANEOUS VARIANT INTERVALS IN ALIGNED SEQUENCES arxiv:1108.3177v1

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

More information

Genotype Imputation. Class Discussion for January 19, 2016

Genotype Imputation. Class Discussion for January 19, 2016 Genotype Imputation Class Discussion for January 19, 2016 Intuition Patterns of genetic variation in one individual guide our interpretation of the genomes of other individuals Imputation uses previously

More information

GLOBAL TESTING UNDER SPARSE ALTERNATIVES: ANOVA, MULTIPLE COMPARISONS AND THE HIGHER CRITICISM

GLOBAL TESTING UNDER SPARSE ALTERNATIVES: ANOVA, MULTIPLE COMPARISONS AND THE HIGHER CRITICISM Submitted to the Annals of Statistics arxiv: 7.434 GLOBAL TESTING UNDER SPARSE ALTERNATIVES: ANOVA, MULTIPLE COMPARISONS AND THE HIGHER CRITICISM By Ery Arias-Castro Department of Mathematics, University

More information

Bioinformatics. Genotype -> Phenotype DNA. Jason H. Moore, Ph.D. GECCO 2007 Tutorial / Bioinformatics.

Bioinformatics. Genotype -> Phenotype DNA. Jason H. Moore, Ph.D. GECCO 2007 Tutorial / Bioinformatics. Bioinformatics Jason H. Moore, Ph.D. Frank Lane Research Scholar in Computational Genetics Associate Professor of Genetics Adjunct Associate Professor of Biological Sciences Adjunct Associate Professor

More information

De novo assembly and genotyping of variants using colored de Bruijn graphs

De novo assembly and genotyping of variants using colored de Bruijn graphs De novo assembly and genotyping of variants using colored de Bruijn graphs Iqbal et al. 2012 Kolmogorov Mikhail 2013 Challenges Detecting genetic variants that are highly divergent from a reference Detecting

More information

Research Article Sample Size Calculation for Controlling False Discovery Proportion

Research Article Sample Size Calculation for Controlling False Discovery Proportion Probability and Statistics Volume 2012, Article ID 817948, 13 pages doi:10.1155/2012/817948 Research Article Sample Size Calculation for Controlling False Discovery Proportion Shulian Shang, 1 Qianhe Zhou,

More information

Calculation of IBD probabilities

Calculation of IBD probabilities Calculation of IBD probabilities David Evans and Stacey Cherny University of Oxford Wellcome Trust Centre for Human Genetics This Session IBD vs IBS Why is IBD important? Calculating IBD probabilities

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Multiscale Systems Engineering Research Group

Multiscale Systems Engineering Research Group Hidden Markov Model Prof. Yan Wang Woodruff School of Mechanical Engineering Georgia Institute of echnology Atlanta, GA 30332, U.S.A. yan.wang@me.gatech.edu Learning Objectives o familiarize the hidden

More information

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Lucas Janson, Stanford Department of Statistics WADAPT Workshop, NIPS, December 2016 Collaborators: Emmanuel

More information

Optimal detection of heterogeneous and heteroscedastic mixtures

Optimal detection of heterogeneous and heteroscedastic mixtures J. R. Statist. Soc. B (0) 73, Part 5, pp. 69 66 Optimal detection of heterogeneous and heteroscedastic mixtures T. Tony Cai and X. Jessie Jeng University of Pennsylvania, Philadelphia, USA and Jiashun

More information

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive. Additive distances Let T be a tree on leaf set S and let w : E R + be an edge-weighting of T, and assume T has no nodes of degree two. Let D ij = e P ij w(e), where P ij is the path in T from i to j. Then

More information

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics 1 Springer Nan M. Laird Christoph Lange The Fundamentals of Modern Statistical Genetics 1 Introduction to Statistical Genetics and Background in Molecular Genetics 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

More information

Bi-level feature selection with applications to genetic association

Bi-level feature selection with applications to genetic association Bi-level feature selection with applications to genetic association studies October 15, 2008 Motivation In many applications, biological features possess a grouping structure Categorical variables may

More information

In English, there are at least three different types of entities: letters, words, sentences.

In English, there are at least three different types of entities: letters, words, sentences. Chapter 2 Languages 2.1 Introduction In English, there are at least three different types of entities: letters, words, sentences. letters are from a finite alphabet { a, b, c,..., z } words are made up

More information

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VF Files On-line MatBio 18 Solon P. Pissis and Ahmad Retha King s ollege London 02-Aug-2018 Solon P. Pissis and Ahmad Retha

More information

Parametric Empirical Bayes Methods for Microarrays

Parametric Empirical Bayes Methods for Microarrays Parametric Empirical Bayes Methods for Microarrays Ming Yuan, Deepayan Sarkar, Michael Newton and Christina Kendziorski April 30, 2018 Contents 1 Introduction 1 2 General Model Structure: Two Conditions

More information

Unit 14: Nonparametric Statistical Methods

Unit 14: Nonparametric Statistical Methods Unit 14: Nonparametric Statistical Methods Statistics 571: Statistical Methods Ramón V. León 8/8/2003 Unit 14 - Stat 571 - Ramón V. León 1 Introductory Remarks Most methods studied so far have been based

More information

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata Most common diseases are caused by

More information

Approaches for Multiple Disease Mapping: MCAR and SANOVA

Approaches for Multiple Disease Mapping: MCAR and SANOVA Approaches for Multiple Disease Mapping: MCAR and SANOVA Dipankar Bandyopadhyay Division of Biostatistics, University of Minnesota SPH April 22, 2015 1 Adapted from Sudipto Banerjee s notes SANOVA vs MCAR

More information

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors The Multiple Testing Problem Multiple Testing Methods for the Analysis of Microarray Data 3/9/2009 Copyright 2009 Dan Nettleton Suppose one test of interest has been conducted for each of m genes in a

More information

Data Preprocessing. Data Preprocessing

Data Preprocessing. Data Preprocessing Data Preprocessing 1 Data Preprocessing Normalization: the process of removing sampleto-sample variations in the measurements not due to differential gene expression. Bringing measurements from the different

More information

Automata Theory CS F-08 Context-Free Grammars

Automata Theory CS F-08 Context-Free Grammars Automata Theory CS411-2015F-08 Context-Free Grammars David Galles Department of Computer Science University of San Francisco 08-0: Context-Free Grammars Set of Terminals (Σ) Set of Non-Terminals Set of

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

Hypothesis testing (cont d)

Hypothesis testing (cont d) Hypothesis testing (cont d) Ulrich Heintz Brown University 4/12/2016 Ulrich Heintz - PHYS 1560 Lecture 11 1 Hypothesis testing Is our hypothesis about the fundamental physics correct? We will not be able

More information

10708 Graphical Models: Homework 2

10708 Graphical Models: Homework 2 10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves

More information

Analysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing

Analysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing Analysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing So Yeun Kwon, Hwan Young Lee, and Kyoung-Jin Shin Department of Forensic Medicine, Yonsei University College of Medicine, Seoul,

More information

Algebra of Random Variables: Optimal Average and Optimal Scaling Minimising

Algebra of Random Variables: Optimal Average and Optimal Scaling Minimising Review: Optimal Average/Scaling is equivalent to Minimise χ Two 1-parameter models: Estimating < > : Scaling a pattern: Two equivalent methods: Algebra of Random Variables: Optimal Average and Optimal

More information

Allen Holder - Trinity University

Allen Holder - Trinity University Haplotyping - Trinity University Population Problems - joint with Courtney Davis, University of Utah Single Individuals - joint with John Louie, Carrol College, and Lena Sherbakov, Williams University

More information

Frequency Spectra and Inference in Population Genetics

Frequency Spectra and Inference in Population Genetics Frequency Spectra and Inference in Population Genetics Although coalescent models have come to play a central role in population genetics, there are some situations where genealogies may not lead to efficient

More information

Minimax Rate-Optimal Estimation of High- Dimensional Covariance Matrices with Incomplete Data

Minimax Rate-Optimal Estimation of High- Dimensional Covariance Matrices with Incomplete Data University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 9-2016 Minimax Rate-Optimal Estimation of High- Dimensional Covariance Matrices with Incomplete Data T. Tony Cai University

More information

ChIP seq peak calling. Statistical integration between ChIP seq and RNA seq

ChIP seq peak calling. Statistical integration between ChIP seq and RNA seq Institute for Computational Biomedicine ChIP seq peak calling Statistical integration between ChIP seq and RNA seq Olivier Elemento, PhD ChIP-seq to map where transcription factors bind DNA Transcription

More information

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

STAT 526 Spring Midterm 1. Wednesday February 2, 2011 STAT 526 Spring 2011 Midterm 1 Wednesday February 2, 2011 Time: 2 hours Name (please print): Show all your work and calculations. Partial credit will be given for work that is partially correct. Points

More information

Chromosomal rearrangements in mammalian genomes : characterising the breakpoints. Claire Lemaitre

Chromosomal rearrangements in mammalian genomes : characterising the breakpoints. Claire Lemaitre PhD defense Chromosomal rearrangements in mammalian genomes : characterising the breakpoints Claire Lemaitre Laboratoire de Biométrie et Biologie Évolutive Université Claude Bernard Lyon 1 6 novembre 2008

More information

DNA, Chromosomes, and Genes

DNA, Chromosomes, and Genes N, hromosomes, and Genes 1 You have most likely already learned about deoxyribonucleic acid (N), chromosomes, and genes. You have learned that all three of these substances have something to do with heredity

More information

The Admixture Model in Linkage Analysis

The Admixture Model in Linkage Analysis The Admixture Model in Linkage Analysis Jie Peng D. Siegmund Department of Statistics, Stanford University, Stanford, CA 94305 SUMMARY We study an appropriate version of the score statistic to test the

More information