Robust Detection and Identification of Sparse Segments in Ultra-High Dimensional Data Analysis
|
|
- Chloe Melanie Harvey
- 5 years ago
- Views:
Transcription
1 Robust Detection and Identification of Sparse Segments in Ultra-High Dimensional Data Analysis Hongzhe Li University of Pennsylvania Perelman School of Medicine Joint work with Jessie Jeng and Tony Cai Robust Segment Identification - 1 of 34-1 p. 1
2 Genetic variations and complex diseases Commonly observed genetic variations: Single nucleotide variants (SNVs). Small insertions/deletions (InDels). Structure variations, including the copy number variations (CNVs). All are associated with risk of complex diseases. Robust Segment Identification - 2 of 34-1 p. 2
3 Copy number variants (CNVs) Copy Number Variation Many CNVs have functional consequences: alter gene dosage, disrupt genes, or uncover deleterious alleles. NEJM 2007, 356: Robust Segment Identification - 3 of 34-1 p. 3
4 CNV Associations CNVs and Human Diseases Science 315: (2007) Science 316: (2007) Am J Med Genet 144: (2007) Science: in press (2008) Circulation 115: (2007) 4 Robust Segment Identification - 4 of 34-1 p. 4
5 Data available for CNV Analysis, literature Two types of data can be used for CNV analysis for germline DNA. SNP chip data for GWAS - high dimensional continuous data. Sebat et al. 2004; McCarrol and Altshuler 2007; Wellcome trust (Nature 2010). Next generation sequencing data (NGS) - ultra high dimensional discrete data; Medvedev et al., 2009 (Nat. Meth). Methods available: GWAS SNP data: circular binary segmentation (CBS) (Olshen et al, 2004); HMM based method (PennCNV, Wang et al 2007); scanning-based methods (Zhang and Siegmund 2010); Likelihood ratio selection (Jeng, Cai and Li, 2010). NGS data: Most methods are computational. Robust Segment Identification - 5 of 34-1 p. 5
6 CNV analysis based on SNP Chip data Visualization of CNVs 3 copies BBB ABB AAB AAA 4 copies BBBB ABBB AABB Normal BB AB AA AAAB AAAA 7 Robust Segment Identification - 6 of 34-1 p. 6
7 CNV Analysis and Next Generation Sequence Data Sequence Read Depth Analysis Individual sequence Reads Mapping Reference genome Read depth signal Counting mapped reads Zero level Robust Segment Identification - 7 of 34-1 p. 7
8 CNV Analysis and Next Generation Sequence Data Robust Segment Identification - 8 of 34-1 p. 8
9 CNV Analysis and Next Generation Sequence Data Robust Segment Identification - 9 of 34-1 p. 9
10 Statistical challenges Y i : # of read counts covering location i, i = 1,,n; or Y i : # of read counts in 100bp intervals. n is ultra-high, computational challenge. Y i usually does not follow a normal distribution, outliers. Existing methods do not work well when noise distribution is non-gaussian and hard to be estimated. Cauchy distribution: data, LRS, RSI Location index Location index Location index Robust Segment Identification - 10 of 34-1 p. 10
11 Statistical model for read depth data - one sample For a given individual, observe read counts {Y i,i = 1,...,n} with Y i = µ 1 1 {i I1 } +...+µ q 1 {i Iq } +ξ i, 1 i n. (1) n: length of genome (billions); q = q n : unknown number of the signal segments; I = {I 1,...I q }: disjoint intervals representing signal segments with unknown locations; µ 1,...µ q are unknown means ξ i is symmetric at 0 (median(ξ i )=0) and density function h s.t. h(0) > 0, h(y) h(0) Cy 2 in an open nbhd of 0. We want to (a) (detection) test H 0 : I = against H 1 : I, (b) (identification) if the alternative is true, identify each I j I. Robust Segment Identification - 11 of 34-1 p. 11
12 Methods Assuming Gaussian Noise (Jeng et al., 2010) Methods for detecting the presence of segments assuming Gaussian noise: Arias-Castro, Donoho and Huo (2005) Identification methods assuming Gaussian noise: likelihood ratio selector (LRS) (Jeng, Cai and Li 2010 JASA). Key of the LRS: (1) For any given interval as Ĩ {1,2,...,n}, define its likelihood ratio statistic Y(Ĩ) = i Ĩ Y i / Ĩ. (2) Scan the genome with intervals of length L, threshold σ 2log(nL). (3) Identify local maximums by removing the overlapping intervals. Detection boundaries and optimality results are established. Robust Segment Identification - 12 of 34-1 p. 12
13 Non-normal Data - Local median transformation Equally divide the n observations (e.g., counts at each bp) into T = T n groups with m = m n observations in each group. Define kth interval J k = {i : (k 1)m+1 i km} and take median: X k = median(y i : i J k ), η k = median{ξ i : i J k }, 1 k T. We have θ k X k = θ k +η k, 1 k T, = µ j, J k I j for some I j, [0,µ j ], J k I j for some I j and J k I j, = 0, otherwise. Key point: mη k = 1 2h(0) Z k +ζ k, Z k N(0,1),ζ k D 0 fast η k N(0,1/(4h 2 (0)m)). (Brown, Cai and Zhou: AoS 08). Robust Segment Identification - 13 of 34-1 p. 13
14 Robust Segment Detection (RSD) Segment detection: test H 0 : I = vs H 1 : I. For any interval Ĩ, define X(Ĩ) = k / k ĨX Ĩ, and threshold λ n = 2logn/(2h(0) m). The RSD rejects H 0 when maxĩ JT X(Ĩ) > λ n, where J T is the collection of all possible intervals in {1,...,T}. Note the effect of h(0). Robust Segment Identification - 14 of 34-1 p. 14
15 Robust Segment Detection - Type 1 error and power Under the assumed model and median transformation with m = log 1+b n for some b > 0. Type 1 error: For the collection J T of all the possible intervals in {1,...,T}, P H0 (max Ĩ J T X(Ĩ) > λ n ) C logt 0, T. Power: If there exists some segment I j I that satisfies I j /m and µ j I j 2(1+ǫ)logn/(2h(0)) for some ǫ > 0, then RSD has the sum of the probabilities of type I and type II errors going to 0. Robust Segment Identification - 15 of 34-1 p. 15
16 Robust Segment Identifier (RSI) Perform local median transformation with bin size m, get X k = θ k +η k, 1 k T. Set data-driven threshold at λ n = ˆσ 2logn, ˆσ 2 : estimate of Var(η k )(e.g.,mad) Apply LRS (Jeng, Cai and Li, JASA 2010) on X k : select intervals with their likelihood ratio statistics > λ n and achieve local maximums (removing overlapping intervals). only consider short intervals with length L/m, L: max CNV size. Conditions on m and L: m = log 1+b n, s L < d, b > 0, s =length of the longest segment, d =shortest distance between two adjacent segments. Robust Segment Identification - 16 of 34-1 p. 16
17 Theory - consistency Assume the general condition on the background noise ξ i and some sparsity conditions on the signal segments. Define s = min Ij I I j, s = max Ij I I j, and d = min Ij I{distance between I j and I j+1 }, assume s log 2 n and log s = o(logn) and logq = o(logn). If all I j I satisfies I j /m and µ j I j 2(1+ǫ)logn/(2h(0)) for some ǫ > 0, then the RSI with m = log 1+b n for b > 0 and s L < d, is consistent for I, i.e., for some δ n = o(1), P ( Î > 0)+P H0 H 1 (max min D(Î j,i j ) > δ n ) 0 I j I Î j Î Dissimilarity: D(Î,I) = 1 Î I / Î I Robust Segment Identification - 17 of 34-1 p. 17
18 Theory - Optimality If for all I j I, µ j I j 2(1 ǫ)logn/(2h(0)), then no method constructed on X k with m is consistent. Robust Segment Identification - 18 of 34-1 p. 18
19 Comparison with Gaussian noises Compare to the case with Gaussian noise: Assume ξ i N(0,1), then the original GLRT based on Y i is optimal. Further, if I j I s.t. µ j I j 2(1+ǫ n )logn, then the original GLRT is consistent. Possible price for robustness: 2(1+ǫn )logn/(2h(0)) (1+ǫ n )logn Robust Segment Identification - 19 of 34-1 p. 19
20 Simulation studies n = , I = 3, I 1 = 100, I 2 = 40 and I 3 = 20, the signal mean for all segments at µ = 1.0,1.5, and 2.0. Noise is generated from Cauchy(t(1)), t(3), t(30), median transformation m = 20. Frequency Frequency Y X Robust Segment Identification - 20 of 34-1 p. 20
21 Simulation results - robustness n = , I = 3. Noise is generated from t(1),t(3),t(30). Estimation error for I j : D j = minîk Î{ 1 Ij Îk / I j Îk } [0,1]. Number of over-selections:#o = #{Î Î : Î I j =, j = 1,...,q}.. Medians of D j and #O for RSI with m = 20,L = 120, 100 replications. D 1( I1 =100) D 2( I2 =40) D 3( I3 =20) #O t(1) µ = (0.015) 1.000(0.026) 1.000(0.000) 2(0.33) µ = (0.003) 0.184(0.017) 1.000(0.000) 2(0.26) µ = (0.009) 0.150(0.020) 0.423(0.220) 2(0.14) t(3) µ = (0.005) 1.000(0.270) 1.000(0.000) 0(0.00) µ = (0.009) 0.175(0.029) 1.000(0.000) 0(0.00) µ = (0.008) 0.150(0.016) 0.293(0.019) 0(0.00) t(30) µ = (0.014) 1.000(0.320) 1.000(0.000) 0(0.00) µ = (0.012) 0.175(0.021) 1.000(0.245) 0(0.00) µ = (0.010) (0.019) 0.250(0.028) 0(0.00) Robust Segment Identification - 21 of 34-1 p. 21
22 Simulation results - comparison with LRS and CBS Table 1: Both homogeneous and heterogeneous noises are considered. Homogenous noise is generated from the t-distribution with degrees of freedom 1, 3, and 30. Heterogeneous noise is generated from a mixture of N(0,1) and N(0,σ 2 ), where σ Gamma(2,τ). µ is fixed at 2.0. RSI LRS CBS D 2( I2 =40) #O D 2( I2 =40) #O D 2( I2 =40) #O t(1) 0.163(0.024) 2(0.2) 0.340(0.054) 3882(7) 1.000(0.000) 0(0.0) t(3) 0.125(0.028) 0(0.0) 0.025(0.006) 467(4) 1.000(0.000) 0(0.0) t(30) 0.125(0.018) 0(0.0) 0.000(0.001) 2(0) 0.006(0.006) 0(0.0) τ = (0.015) 2(0.4) 0.013(0.005) 37(3) 0.180(0.006) 4(0.6) τ = (0.022) 12(0.6) 0.000(0.006) 227(6) 1.000(0.010) 10(1.1) τ = (0.016) 26(0.8) 0.000(0.006) 461(11) 1.000(0.000) 8(1.1) Robust Segment Identification - 22 of 34-1 p. 22
23 Simulation results - effect of m Table 2: Effect of bin size m on the performance of RSI. µ is fixed at 2. D 1( I1 =100) D 2( I2 =40) D 3( I3 =20) #O t(1) m = (0.009) 0.10(0.018) 0.184(0.033) 19(0.85) m = (0.009) 0.15(0.020) 0.423(0.220) 2(0.14) m = (0.006) 0.25(0.056) 1.000(0.024) 0(0.00) t(3) m = (0.004) 0.088(0.015) 0.150(0.033) 1(0.22) m = (0.008) 0.150(0.016) 0.293(0.019) 0(0.00) m = (0.006) 0.293(0.041) 1.000(0.250) 0(0.00) t(30) m = (0.007) 0.075(0.008) 0.150(0.018) 0(0.00) m = (0.010) 0.175(0.019) 0.250(0.028) 0(0.00) m = (0.008) 0.293(0.035) 1.000(0.094) 0(0.00) Robust Segment Identification - 23 of 34-1 p. 23
24 1000 Genomes Project - NA19240, Chr 19 NA19240: an International HapMap project Yoruban daughter sample and parents, lymphoblastoid cell lines (Nature 2010). Ave 42x, SOLiD, map to the human reference genome (BAM file), ave 2.36 Gb accessed. n = 54,361,060 read counts for Chr 19. Apply RSI with m = 400, L = 60,000. Identified 106 deletions and 63 duplications. Take less than 3 mins. Compare with the CNV map from 1000 Genomes Project based on 185 samples (Mills et al. 2011), 76 overlap with the reported deletions based on 185 low-coverage samples and three methods ( 438, 332 and 615 CNVs). Robust Segment Identification - 24 of 34-1 p. 24
25 Concordant of the Yoruba trio - top ranked CNVs CNVs are inheritable - concordant rates, ranked by µ Î, apply to 3 samples separately. Concordance % Median NB top20 top50 top100 top200 top300 Robust Segment Identification - 25 of 34-1 p. 25
26 Chr 19 sequence data - NA19240 Depth of overage 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 Depth of overage Genomic location index Genomic location index Possible error in reference genome: multicopy sequences which have been incorrectly assembled and collapsed into a single copy. Robust Segment Identification - 26 of 34-1 p. 26
27 Chr 19 sequence data - NA19240 Depth of overage Depth of overage Genomic location index Genomic location index Robust Segment Identification - 27 of 34-1 p. 27
28 Chr 19 sequence data - NA19240 Depth of overage Depth of overage Genomic location index Genomic location index Robust Segment Identification - 28 of 34-1 p. 28
29 Alternative Approach -Negative Binomial Counts NB model: Y i Negative Binomial(r,p i ), p i = p 0 + q d j 1 (i Ij ). j=1 µ i = rp i 1 p i,σ 2 i = µ2 i r +µ i. Data transformation - mean-matching variance stabilizing transformation (VST) to turn into Gaussian noise problem: divide the n obs into T = T n groups of m = m n obs, for the kth interval, define X k = 2 ˆrln Y i Jk i +1/4 mˆr 1/ i J k Y i +1/4 mˆr 1/2, 1 k T, Robust Segment Identification - 29 of 34-1 p. 29
30 Transformed Data - normally distributed Model for transformed data: X k = 2ln( θ k + r +θ k )+ǫ k +m 1/2 Z k +ξ k, where = r(p 0 +d j )/(1 p 0 d j ), J k I j for some I j, θ k [rp 0 /(1 p 0 ), r(p 0 +d j )/(1 p 0 d j )], J k I j for some I j and J k I j, = rp 0 /(1 p 0 ), otherwise, ǫ k and ξ k are stochastically small, Z k N(0,1). Apply LRS to X k assuming the baseline distribution as ( ) rp0 r N 2ln( + ),1/m. 1 p 0 1 p 0 Robust Segment Identification - 30 of 34-1 p. 30
31 Concordant of the Yoruba trio - top ranked CNVs CNVs are inheritable - concordant rates, ranked by µ Î, apply to 3 samples separately. Concordance % Median NB top20 top50 top100 top200 top300 Robust Segment Identification - 31 of 34-1 p. 31
32 Comments and extensions Cai, Jeng and Li (2012): JRSS(B), in press. Many other complicated factors: repeated regions, complex rearrangments, highly repetitive elements. There is a relationship between GC content and coverage, but this effect is small for false CNVs. Read depths data: difficulty in finding high repetitive CNVs (LINE, SINE), uncertain in CNV location, but can ba applied to paired-end, single-end and mixed data; Paired-end whole genome sequencing data: statistical modeling of anomalous read pairs, can detect highly repetitive CNVs (LINE and SINE), precise location of CNVs; but span distances have effects on resolution. Robust Segment Identification - 32 of 34-1 p. 32
33 Software -RobustCNV C++ software available, with some post-processing steps: CG content adjustment; refining boundaries; filtering CNVs based on read mapping qualities; plots of the CNVs. Robust Segment Identification - 33 of 34-1 p. 33
34 THANKS! Collaborators: Jessie Jeng - Postdoc Yinghua Wu - Postdoc Tony Cai - Statistics Dept John Maris - CHOP. NIH grants support. Robust Segment Identification - 34 of 34-1 p. 34
Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations
Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Yale School of Public Health Joint work with Ning Hao, Yue S. Niu presented @Tsinghua University Outline 1 The Problem
More informationRate-Optimal Detection of Very Short Signal Segments
Rate-Optimal Detection of Very Short Signal Segments T. Tony Cai and Ming Yuan University of Pennsylvania and University of Wisconsin-Madison July 6, 2014) Abstract Motivated by a range of applications
More informationTheoretical and computational aspects of association tests: application in case-control genome-wide association studies.
Theoretical and computational aspects of association tests: application in case-control genome-wide association studies Mathieu Emily November 18, 2014 Caen mathieu.emily@agrocampus-ouest.fr - Agrocampus
More informationAssociation Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5
Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative
More informationDecision Theoretic Classification of Copy-Number-Variation in Cancer Genomes
Decision Theoretic Classification of Copy-Number-Variation in Cancer Genomes Christopher Holmes (joint work with Chris Yau) Department of Statistics, & Wellcome Trust Centre for Human Genetics, University
More informationDetecting Simultaneous Variant Intervals in Aligned Sequences
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2011 Detecting Simultaneous Variant Intervals in Aligned Sequences David Siegmund Benjamin Yakir Nancy R. Zhang University
More informationSupplementary Information for Discovery and characterization of indel and point mutations
Supplementary Information for Discovery and characterization of indel and point mutations using DeNovoGear Avinash Ramu 1 Michiel J. Noordam 1 Rachel S. Schwartz 2 Arthur Wuster 3 Matthew E. Hurles 3 Reed
More informationGenotype Imputation. Biostatistics 666
Genotype Imputation Biostatistics 666 Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives
More informationHidden Markov Models for the Assessment of Chromosomal Alterations using High-throughput SNP Arrays
Hidden Markov Models for the Assessment of Chromosomal Alterations using High-throughput SNP Arrays Department of Biostatistics Johns Hopkins Bloomberg School of Public Health November 18, 2008 Acknowledgments
More informationLearning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study
Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study Rui Wang, Yong Li, XiaoFeng Wang, Haixu Tang and Xiaoyong Zhou Indiana University at Bloomington
More informationChange-Point Detection and Copy Number Variation
Change-Point Detection and Copy Number Variation Department of Statistics The Hebrew University IMS, Singapore, 2009 Outline 1 2 3 (CNV) Classical change-point detection At each monitoring period we observe
More informationADMM Fused Lasso for Copy Number Variation Detection in Human 3 March Genomes / 1
ADMM Fused Lasso for Copy Number Variation Detection in Human Genomes Yifei Chen and Jacob Biesinger 3 March 2011 ADMM Fused Lasso for Copy Number Variation Detection in Human 3 March Genomes 2011 1 /
More informationAn Integrated Approach for the Assessment of Chromosomal Abnormalities
An Integrated Approach for the Assessment of Chromosomal Abnormalities Department of Biostatistics Johns Hopkins Bloomberg School of Public Health June 26, 2007 Karyotypes Karyotypes General Cytogenetics
More informationAlgorithmisches Lernen/Machine Learning
Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines
More informationHybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS
Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge González-Domínguez*, Bertil Schmidt*, Jan C. Kässens**, Lars Wienbrandt** *Parallel and Distributed Architectures
More informationOptimal Detection of Heterogeneous and Heteroscedastic Mixtures
Optimal Detection of Heterogeneous and Heteroscedastic Mixtures T. Tony Cai Department of Statistics, University of Pennsylvania X. Jessie Jeng Department of Biostatistics and Epidemiology, University
More informationExpected complete data log-likelihood and EM
Expected complete data log-likelihood and EM In our EM algorithm, the expected complete data log-likelihood Q is a function of a set of model parameters τ, ie M Qτ = log fb m, r m, g m z m, l m, τ p mz
More informationProbability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies
Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Ruth Pfeiffer, Ph.D. Mitchell Gail Biostatistics Branch Division of Cancer Epidemiology&Genetics National
More informationPlug-in Measure-Transformed Quasi Likelihood Ratio Test for Random Signal Detection
Plug-in Measure-Transformed Quasi Likelihood Ratio Test for Random Signal Detection Nir Halay and Koby Todros Dept. of ECE, Ben-Gurion University of the Negev, Beer-Sheva, Israel February 13, 2017 1 /
More informationProportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power
Proportional Variance Explained by QTL and Statistical Power Partitioning the Genetic Variance We previously focused on obtaining variance components of a quantitative trait to determine the proportion
More informationAn Integrated Approach for the Assessment of Chromosomal Abnormalities
An Integrated Approach for the Assessment of Chromosomal Abnormalities Department of Biostatistics Johns Hopkins Bloomberg School of Public Health June 6, 2007 Karyotypes Mitosis and Meiosis Meiosis Meiosis
More informationCNV Methods File format v2.0 Software v2.0.0 September, 2011
File format v2.0 Software v2.0.0 September, 2011 Copyright 2011 Complete Genomics Incorporated. All rights reserved. cpal and DNB are trademarks of Complete Genomics, Inc. in the US and certain other countries.
More informationStatistical analysis of genomic binding sites using high-throughput ChIP-seq data
Statistical analysis of genomic binding sites using high-throughput ChIP-seq data Ibrahim Ali H Nafisah Department of Statistics University of Leeds Submitted in accordance with the requirments for the
More informationLinear Regression (1/1/17)
STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression
More informationResearch Statement on Statistics Jun Zhang
Research Statement on Statistics Jun Zhang (junzhang@galton.uchicago.edu) My interest on statistics generally includes machine learning and statistical genetics. My recent work focus on detection and interpretation
More informationDEXSeq paper discussion
DEXSeq paper discussion L Collado-Torres December 10th, 2012 1 / 23 1 Background 2 DEXSeq paper 3 Results 2 / 23 Gene Expression 1 Background 1 Source: http://www.ncbi.nlm.nih.gov/projects/genome/probe/doc/applexpression.shtml
More informationMiscellanea False discovery rate for scanning statistics
Biometrika (2011), 98,4,pp. 979 985 C 2011 Biometrika Trust Printed in Great Britain doi: 10.1093/biomet/asr057 Miscellanea False discovery rate for scanning statistics BY D. O. SIEGMUND, N. R. ZHANG Department
More informationIncorporation of Sparsity Information in Large-scale Multiple Two-sample t Tests
Incorporation of Sparsity Information in Large-scale Multiple Two-sample t Tests Weidong Liu October 19, 2014 Abstract Large-scale multiple two-sample Student s t testing problems often arise from the
More informationLecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017
Lecture 2: Genetic Association Testing with Quantitative Traits Instructors: Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 29 Introduction to Quantitative Trait Mapping
More informationOptimal detection of weak positive dependence between two mixture distributions
Optimal detection of weak positive dependence between two mixture distributions Sihai Dave Zhao Department of Statistics, University of Illinois at Urbana-Champaign and T. Tony Cai Department of Statistics,
More informationGoing Beyond SNPs with Next Genera5on Sequencing Technology Personalized Medicine: Understanding Your Own Genome Fall 2014
Going Beyond SNPs with Next Genera5on Sequencing Technology 02-223 Personalized Medicine: Understanding Your Own Genome Fall 2014 Next Genera5on Sequencing Technology (NGS) NGS technology Discover more
More informationPost-selection Inference for Changepoint Detection
Post-selection Inference for Changepoint Detection Sangwon Hyun (Justin) Dept. of Statistics Advisors: Max G Sell, Ryan Tibshirani Committee: Will Fithian (UC Berkeley), Alessandro Rinaldo, Kathryn Roeder,
More informationLecture 9. QTL Mapping 2: Outbred Populations
Lecture 9 QTL Mapping 2: Outbred Populations Bruce Walsh. Aug 2004. Royal Veterinary and Agricultural University, Denmark The major difference between QTL analysis using inbred-line crosses vs. outbred
More informationSupplementary Methods
Supplementary Methods Canvas: versatile and scalable detection of copy number variants Eric Roller 1,*,, Sergii Ivakhno 2,, Steve Lee 1,, Thomas Royce 3, and Stephen Tanner 1, 1 Illumina Inc., 5200 Illumina
More informationAssociation studies and regression
Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration
More informationComputational Systems Biology: Biology X
Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,
More informationIntroduction to Probability and Statistics (Continued)
Introduction to Probability and Statistics (Continued) Prof. icholas Zabaras Center for Informatics and Computational Science https://cics.nd.edu/ University of otre Dame otre Dame, Indiana, USA Email:
More informationGuarding against Spurious Discoveries in High Dimension. Jianqing Fan
in High Dimension Jianqing Fan Princeton University with Wen-Xin Zhou September 30, 2016 Outline 1 Introduction 2 Spurious correlation and random geometry 3 Goodness Of Spurious Fit (GOSF) 4 Asymptotic
More informationHumans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase
Humans have two copies of each chromosome Inherited from mother and father. Genotyping technologies do not maintain the phase Genotyping technologies do not maintain the phase Recall that proximal SNPs
More informationGene Regula*on, ChIP- X and DNA Mo*fs. Statistics in Genomics Hongkai Ji
Gene Regula*on, ChIP- X and DNA Mo*fs Statistics in Genomics Hongkai Ji (hji@jhsph.edu) Genetic information is stored in DNA TCAGTTGGAGCTGCTCCCCCACGGCCTCTCCTCACATTCCACGTCCTGTAGCTCTATGACCTCCACCTTTGAGTCCCTCCTC
More informationNature Genetics: doi:0.1038/ng.2768
Supplementary Figure 1: Graphic representation of the duplicated region at Xq28 in each one of the 31 samples as revealed by acgh. Duplications are represented in red and triplications in blue. Top: Genomic
More informationProblem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30
Problem Set 2 MAS 622J/1.126J: Pattern Recognition and Analysis Due: 5:00 p.m. on September 30 [Note: All instructions to plot data or write a program should be carried out using Matlab. In order to maintain
More informationEM algorithm. Rather than jumping into the details of the particular EM algorithm, we ll look at a simpler example to get the idea of how it works
EM algorithm The example in the book for doing the EM algorithm is rather difficult, and was not available in software at the time that the authors wrote the book, but they implemented a SAS macro to implement
More information(Genome-wide) association analysis
(Genome-wide) association analysis 1 Key concepts Mapping QTL by association relies on linkage disequilibrium in the population; LD can be caused by close linkage between a QTL and marker (= good) or by
More informationStochastic processes and
Stochastic processes and Markov chains (part II) Wessel van Wieringen w.n.van.wieringen@vu.nl wieringen@vu nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University
More informationNetwork Biology-part II
Network Biology-part II Jun Zhu, Ph. D. Professor of Genomics and Genetic Sciences Icahn Institute of Genomics and Multi-scale Biology The Tisch Cancer Institute Icahn Medical School at Mount Sinai New
More informationMethods for Cryptic Structure. Methods for Cryptic Structure
Case-Control Association Testing Review Consider testing for association between a disease and a genetic marker Idea is to look for an association by comparing allele/genotype frequencies between the cases
More informationUSING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA. By Xiaoquan Wen and Matthew Stephens University of Chicago
Submitted to the Annals of Applied Statistics USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA By Xiaoquan Wen and Matthew Stephens University of Chicago Recently-developed
More informationVariant visualisation and quality control
Variant visualisation and quality control You really should be making plots! 25/06/14 Paul Theodor Pyl 1 Classical Sequencing Example DNA.BAM.VCF Aligner Variant Caller A single sample sequencing run 25/06/14
More informationFriday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo
Friday Harbor 2017 From Genetics to GWAS (Genome-wide Association Study) Sept 7 2017 David Fardo Purpose: prepare for tomorrow s tutorial Genetic Variants Quality Control Imputation Association Visualization
More informationA PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS
A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS CRYSTAL L. KAHN and BENJAMIN J. RAPHAEL Box 1910, Brown University Department of Computer Science & Center for Computational Molecular Biology
More informationBayesian Clustering of Multi-Omics
Bayesian Clustering of Multi-Omics for Cardiovascular Diseases Nils Strelow 22./23.01.2019 Final Presentation Trends in Bioinformatics WS18/19 Recap Intermediate presentation Precision Medicine Multi-Omics
More informationStatistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014
Overview - 1 Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014 Elizabeth Thompson University of Washington Seattle, WA, USA MWF 8:30-9:20; THO 211 Web page: www.stat.washington.edu/ thompson/stat550/
More informationLecture 9: Kernel (Variance Component) Tests and Omnibus Tests for Rare Variants. Summer Institute in Statistical Genetics 2017
Lecture 9: Kernel (Variance Component) Tests and Omnibus Tests for Rare Variants Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 46 Lecture Overview 1. Variance Component
More informationSTA 216, GLM, Lecture 16. October 29, 2007
STA 216, GLM, Lecture 16 October 29, 2007 Efficient Posterior Computation in Factor Models Underlying Normal Models Generalized Latent Trait Models Formulation Genetic Epidemiology Illustration Structural
More informationEfficient Bayesian mixed model analysis increases association power in large cohorts
Linear regression Existing mixed model methods New method: BOLT-LMM Time O(MM) O(MN 2 ) O MN 1.5 Corrects for confounding? Power Efficient Bayesian mixed model analysis increases association power in large
More informationarxiv: v3 [math.st] 15 Oct 2018
SEGMENTATION AND ESTIMATION OF CHANGE-POINT MODELS: FALSE POSITIVE CONTROL AND CONFIDENCE REGIONS arxiv:1608.03032v3 [math.st] 15 Oct 2018 Xiao Fang, Jian Li and David Siegmund The Chinese University of
More informationMajor Genes, Polygenes, and
Major Genes, Polygenes, and QTLs Major genes --- genes that have a significant effect on the phenotype Polygenes --- a general term of the genes of small effect that influence a trait QTL, quantitative
More informationTutorial Session 2. MCMC for the analysis of genetic data on pedigrees:
MCMC for the analysis of genetic data on pedigrees: Tutorial Session 2 Elizabeth Thompson University of Washington Genetic mapping and linkage lod scores Monte Carlo likelihood and likelihood ratio estimation
More information27: Case study with popular GM III. 1 Introduction: Gene association mapping for complex diseases 1
10-708: Probabilistic Graphical Models, Spring 2015 27: Case study with popular GM III Lecturer: Eric P. Xing Scribes: Hyun Ah Song & Elizabeth Silver 1 Introduction: Gene association mapping for complex
More informationIntroduction to PLINK H3ABionet Course Covenant University, Nigeria
UNIVERSITY OF THE WITWATERSRAND, JOHANNESBURG Introduction to PLINK H3ABionet Course Covenant University, Nigeria Scott Hazelhurst H3ABioNet funded by NHGRI grant number U41HG006941 Wits Bioinformatics
More informationA canonical semi-deterministic transducer
A canonical semi-deterministic transducer Achilles A. Beros Joint work with Colin de la Higuera Laboratoire d Informatique de Nantes Atlantique, Université de Nantes September 18, 2014 The Results There
More informationarxiv: v1 [stat.ap] 16 Aug 2011
The Annals of Applied Statistics 2011, Vol. 5, No. 2A, 645 668 DOI: 10.1214/10-AOAS400 c Institute of Mathematical Statistics, 2011 DETECTING SIMULTANEOUS VARIANT INTERVALS IN ALIGNED SEQUENCES arxiv:1108.3177v1
More informationData Mining Techniques
Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!
More informationGenotype Imputation. Class Discussion for January 19, 2016
Genotype Imputation Class Discussion for January 19, 2016 Intuition Patterns of genetic variation in one individual guide our interpretation of the genomes of other individuals Imputation uses previously
More informationGLOBAL TESTING UNDER SPARSE ALTERNATIVES: ANOVA, MULTIPLE COMPARISONS AND THE HIGHER CRITICISM
Submitted to the Annals of Statistics arxiv: 7.434 GLOBAL TESTING UNDER SPARSE ALTERNATIVES: ANOVA, MULTIPLE COMPARISONS AND THE HIGHER CRITICISM By Ery Arias-Castro Department of Mathematics, University
More informationBioinformatics. Genotype -> Phenotype DNA. Jason H. Moore, Ph.D. GECCO 2007 Tutorial / Bioinformatics.
Bioinformatics Jason H. Moore, Ph.D. Frank Lane Research Scholar in Computational Genetics Associate Professor of Genetics Adjunct Associate Professor of Biological Sciences Adjunct Associate Professor
More informationDe novo assembly and genotyping of variants using colored de Bruijn graphs
De novo assembly and genotyping of variants using colored de Bruijn graphs Iqbal et al. 2012 Kolmogorov Mikhail 2013 Challenges Detecting genetic variants that are highly divergent from a reference Detecting
More informationResearch Article Sample Size Calculation for Controlling False Discovery Proportion
Probability and Statistics Volume 2012, Article ID 817948, 13 pages doi:10.1155/2012/817948 Research Article Sample Size Calculation for Controlling False Discovery Proportion Shulian Shang, 1 Qianhe Zhou,
More informationCalculation of IBD probabilities
Calculation of IBD probabilities David Evans and Stacey Cherny University of Oxford Wellcome Trust Centre for Human Genetics This Session IBD vs IBS Why is IBD important? Calculating IBD probabilities
More informationData Mining and Analysis: Fundamental Concepts and Algorithms
Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationMultiscale Systems Engineering Research Group
Hidden Markov Model Prof. Yan Wang Woodruff School of Mechanical Engineering Georgia Institute of echnology Atlanta, GA 30332, U.S.A. yan.wang@me.gatech.edu Learning Objectives o familiarize the hidden
More informationModel-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate
Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Lucas Janson, Stanford Department of Statistics WADAPT Workshop, NIPS, December 2016 Collaborators: Emmanuel
More informationOptimal detection of heterogeneous and heteroscedastic mixtures
J. R. Statist. Soc. B (0) 73, Part 5, pp. 69 66 Optimal detection of heterogeneous and heteroscedastic mixtures T. Tony Cai and X. Jessie Jeng University of Pennsylvania, Philadelphia, USA and Jiashun
More informationAdditive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.
Additive distances Let T be a tree on leaf set S and let w : E R + be an edge-weighting of T, and assume T has no nodes of degree two. Let D ij = e P ij w(e), where P ij is the path in T from i to j. Then
More information1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics
1 Springer Nan M. Laird Christoph Lange The Fundamentals of Modern Statistical Genetics 1 Introduction to Statistical Genetics and Background in Molecular Genetics 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
More informationBi-level feature selection with applications to genetic association
Bi-level feature selection with applications to genetic association studies October 15, 2008 Motivation In many applications, biological features possess a grouping structure Categorical variables may
More informationIn English, there are at least three different types of entities: letters, words, sentences.
Chapter 2 Languages 2.1 Introduction In English, there are at least three different types of entities: letters, words, sentences. letters are from a finite alphabet { a, b, c,..., z } words are made up
More informationDictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line
Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VF Files On-line MatBio 18 Solon P. Pissis and Ahmad Retha King s ollege London 02-Aug-2018 Solon P. Pissis and Ahmad Retha
More informationParametric Empirical Bayes Methods for Microarrays
Parametric Empirical Bayes Methods for Microarrays Ming Yuan, Deepayan Sarkar, Michael Newton and Christina Kendziorski April 30, 2018 Contents 1 Introduction 1 2 General Model Structure: Two Conditions
More informationUnit 14: Nonparametric Statistical Methods
Unit 14: Nonparametric Statistical Methods Statistics 571: Statistical Methods Ramón V. León 8/8/2003 Unit 14 - Stat 571 - Ramón V. León 1 Introductory Remarks Most methods studied so far have been based
More informationMODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES
MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata Most common diseases are caused by
More informationApproaches for Multiple Disease Mapping: MCAR and SANOVA
Approaches for Multiple Disease Mapping: MCAR and SANOVA Dipankar Bandyopadhyay Division of Biostatistics, University of Minnesota SPH April 22, 2015 1 Adapted from Sudipto Banerjee s notes SANOVA vs MCAR
More informationTable of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors
The Multiple Testing Problem Multiple Testing Methods for the Analysis of Microarray Data 3/9/2009 Copyright 2009 Dan Nettleton Suppose one test of interest has been conducted for each of m genes in a
More informationData Preprocessing. Data Preprocessing
Data Preprocessing 1 Data Preprocessing Normalization: the process of removing sampleto-sample variations in the measurements not due to differential gene expression. Bringing measurements from the different
More informationAutomata Theory CS F-08 Context-Free Grammars
Automata Theory CS411-2015F-08 Context-Free Grammars David Galles Department of Computer Science University of San Francisco 08-0: Context-Free Grammars Set of Terminals (Σ) Set of Non-Terminals Set of
More informationComputational Biology
Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,
More informationHypothesis testing (cont d)
Hypothesis testing (cont d) Ulrich Heintz Brown University 4/12/2016 Ulrich Heintz - PHYS 1560 Lecture 11 1 Hypothesis testing Is our hypothesis about the fundamental physics correct? We will not be able
More information10708 Graphical Models: Homework 2
10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves
More informationAnalysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing
Analysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing So Yeun Kwon, Hwan Young Lee, and Kyoung-Jin Shin Department of Forensic Medicine, Yonsei University College of Medicine, Seoul,
More informationAlgebra of Random Variables: Optimal Average and Optimal Scaling Minimising
Review: Optimal Average/Scaling is equivalent to Minimise χ Two 1-parameter models: Estimating < > : Scaling a pattern: Two equivalent methods: Algebra of Random Variables: Optimal Average and Optimal
More informationAllen Holder - Trinity University
Haplotyping - Trinity University Population Problems - joint with Courtney Davis, University of Utah Single Individuals - joint with John Louie, Carrol College, and Lena Sherbakov, Williams University
More informationFrequency Spectra and Inference in Population Genetics
Frequency Spectra and Inference in Population Genetics Although coalescent models have come to play a central role in population genetics, there are some situations where genealogies may not lead to efficient
More informationMinimax Rate-Optimal Estimation of High- Dimensional Covariance Matrices with Incomplete Data
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 9-2016 Minimax Rate-Optimal Estimation of High- Dimensional Covariance Matrices with Incomplete Data T. Tony Cai University
More informationChIP seq peak calling. Statistical integration between ChIP seq and RNA seq
Institute for Computational Biomedicine ChIP seq peak calling Statistical integration between ChIP seq and RNA seq Olivier Elemento, PhD ChIP-seq to map where transcription factors bind DNA Transcription
More informationSTAT 526 Spring Midterm 1. Wednesday February 2, 2011
STAT 526 Spring 2011 Midterm 1 Wednesday February 2, 2011 Time: 2 hours Name (please print): Show all your work and calculations. Partial credit will be given for work that is partially correct. Points
More informationChromosomal rearrangements in mammalian genomes : characterising the breakpoints. Claire Lemaitre
PhD defense Chromosomal rearrangements in mammalian genomes : characterising the breakpoints Claire Lemaitre Laboratoire de Biométrie et Biologie Évolutive Université Claude Bernard Lyon 1 6 novembre 2008
More informationDNA, Chromosomes, and Genes
N, hromosomes, and Genes 1 You have most likely already learned about deoxyribonucleic acid (N), chromosomes, and genes. You have learned that all three of these substances have something to do with heredity
More informationThe Admixture Model in Linkage Analysis
The Admixture Model in Linkage Analysis Jie Peng D. Siegmund Department of Statistics, Stanford University, Stanford, CA 94305 SUMMARY We study an appropriate version of the score statistic to test the
More information