Supplementary Information for Discovery and characterization of indel and point mutations
|
|
- Annabella Stephens
- 5 years ago
- Views:
Transcription
1 Supplementary Information for Discovery and characterization of indel and point mutations using DeNovoGear Avinash Ramu 1 Michiel J. Noordam 1 Rachel S. Schwartz 2 Arthur Wuster 3 Matthew E. Hurles 3 Reed A. artwright 2,4 Donald F. onrad 1,5 1 Department of Genetics, 5 Department of Pathology and Immunology Washington University School of Medicine, St. Louis, MO 63110, USA 2 enter for Evolutionary Medicine and Informatics, The Biodesign Institute, 4 School of Life Sciences, Arizona State University, Tempe, AZ , USA 3 Wellcome Trust Sanger Institute, Hinxton, ambridge, B10-1SA, UK 1
2 Supplementary Figures 2
3 Supplementary Figure 1 A Sensitivity Prior 1e 12 1e 10 1e 08 1e 06 1e FDR B Sensitivity FDR Effect of different mutation rate prior values on de novo SNP calling on the Whole Genome Sequencing dataset. Sensitivity and False Discovery Rates for each mutation prior were calculated for (a) germline and (b) somatic SNPs using the validated 1000 genomes call-set as reference. Germline sensitivity was defined as the number of true positive germline mutations that were called by DeNovoGear whereas somatic sensitivity was defined as the number of true positive somatic mutations that were called. The False Discovery Rate was defined as the proportion of calls validated as false positive to the total number of calls that were validated. 3
4 Supplementary Figure 2 A Sensitivity Prior 1e 12 1e 10 1e 08 1e 06 1e FDR B Sensitivity FDR Effect of different mutation rate prior values on de novo SNP calling on the Whole Exome Sequencing dataset. Sensitivity and False Discovery Rates for each mutation prior were calculated for (a) germline and (b) somatic SNPs using the validated 1000 genomes call-set as reference. Germline sensitivity was defined as the number of true positive germline mutations that were called by DeNovoGear whereas somatic sensitivity was defined as the number of true positive somatic mutations that were called. The False Discovery Rate was defined as the proportion of calls validated as false positive to the total number of calls that were validated. 4
5 Supplementary Figure 3 A 1.00 aller Denovogear GATK Sensitivity Naivealler Polymutt Samtools FDR B 0.98 Sensitivity FDR omparison of the results from several de novo SNP callers on the Whole Genome Sequencing dataset. Sensitivity and False Discovery Rates for each caller were calculated for (a) germline and (b) somatic SNPs using the validated 1000 genomes call-set as reference. Germline sensitivity was defined as the number of true positive germline mutations that were called by each caller whereas somatic sensitivity was defined as the number of true positive somatic mutations that were called. The False Discovery Rate was defined as the proportion of calls validated as false positive to the total number of calls that were validated. 5
6 Supplementary Figure 4 A 1.00 aller Denovogear Sensitivity Denovogear BB GATK Naivealler Polymutt Samtools FDR B Sensitivity FDR omparison of the results from several de novo SNP callers on the Whole Exome Sequencing dataset. Sensitivity and False Discovery Rates for each caller were calculated for (a) germline and (b) somatic SNPs using the validated 1000 genomes call-set as reference. Germline sensitivity was defined as the number of true positive germline mutations that were called by each caller whereas somatic sensitivity was defined as the number of true positive somatic mutations that were called. The False Discovery Rate was defined as the proportion of calls validated as false positive to the total number of calls that were validated. 6
7 Supplementary Figure 5 A ount B Length of Indel ount Length of Insertion ount Length of Deletion The frequency distribution of indel lengths in the 1000 genomes phase 1 dataset. The counts of (a) all indels of different lengths is plotted as well as the separate counts for (b) insertions and (c) deletions. 7
8 Supplementary Figure 6 A log(µ) Empirical Log linear fit B Length of Insertion Empirical Log linear fit log(µ) Length of Deletion A log linear model is used to estimate the mutation rate given the length of an indel. Separate models were estimated for (a)insertions and (b)deletions. The fitted values of the model are plotted against the logarithm of the mutation rate. The smaller indels have a higher rate than the insertions of larger size. 8
9 Supplementary Figure 7 De novo indel artifact 1. In this class of artifact, an indel is clearly present in one of the parents but has been placed with a slightly different alignment than in the child. This class can be avoided by filtering out candidate DNMs that overlap an indel call in the parent. 9
10 Supplementary Figure 8 De novo indel artifact 2. In this class of artifact, an indel is clearly present in both parents with the same breakpoints, yet has not been called in either. This undercalling in the parents could be addressed by either using an alternative likelihood function for indel genotypes, as described in the main text, or by filtering sites where some indel reads are observed in the parents. 10
11 Supplementary Figure 9 De novo indel artifact 3. In this class of artifact, a high frequency of reads with a non-reference allele is seen in both parents. Many of these alternate base calls occupy the first/last position in the read, suggesting that perhaps both the insertion in the child and SNPs in the parents are alignment artifacts, possible caused by the presence of a large structural variant with a breakpoint in this region. This class of variant could be avoided by filtering sites with a high frequency of non-reference reads in one or both parents. This could be done strictly spanning the indel call or within a small window of the indel (say plus or minus 10 bp). 11
12 Supplementary Figure 10 A chromosome 3 p26.1 p25.1 p24.2 p21 p22.3 p21.33 p21.1 p14.1 p12.3 p12.1 q11.2 q13.12 q13.32 q21.3 q22.3 q24 q25.2 q26.1 q26.31 q27.1 q29 B 102,153, ,153, ,153,406 forward strand sequence reverse strand sequence T A A A T T T T A T T A T A A T T A A A T T T T A T T A T A A T T T A T A A T T A G T T G T A A A A T G A A T A T T A G T T A A G T T G T A A Example of a de novo indel called by DNG and confirmed by Sanger sequencing. A) Location of the indel is indicated by a red box. B) At this specific location, has a 3-base pair deletion, namely a deletion of the bases T, T and A (see left box), which will result in a double sequence read on the forward strand (middle box) and the reverse strand (right box) that both start at the site of this indel. ) and do not have an indel at this position and subsequently have single reads on both strands. We confirmed 53/56 (95%) de novo indel predictions in this family by Sanger sequencing (summary statistics of the predicted de novo indels are in Supplementary Table 7, a full list of de novo indels in Supplementary Table 9, and validation plots for all calls in Supplementary Figure 11). 12
13 Supplementary Figure 11: Validation Plots The following pages depict (a) IGV screenshots of nextgen sequencing data and (b) Sanger sequencing traces for all candidate de novo indels for which we attempted validation. See legend of Supplementary Figure 10 and main text for additional details. Within the True Positive class, the figures are ordered by chromosome and position of the de novo indel. All coordinates with respect to NBI37. 13
14 chromosome 17 p13.2 p13.1 p12p21 p11.2 p11.1 q11.2 q12 q21.1 q21.31 q21.33 q22 q23.1 q23.3 q24.2 q24.3 q25.1 q25.3 B 28,388,993 28,389,003 28,389,012 G A A T A A G A T T T T T T G A A T A A G A T T T T T T T T T T T A E7: A site for which primers could not be designed 14
15 chromosome 15 p12 p11.2 p11.1 p21 q11.2 q12 q13.2 q14 q15.1 q21.1 q21.3 q22.2 q22.32 q24.1 q25.1 q25.3 q26.1 q26.2 B 74,185,293 74,185,303 74,185,312 T A A T A T A G A A A A T A A T A T A G A A A A A7: A false positive site 15
16 chromosome 17 p13.2 p13.1 p12p21 p11.2 p11.1 q11.2 q12 q21.1 q21.31 q21.33 q22 q23.1 q23.3 q24.2 q24.3 q25.1 q25.3 B 2,120,331 2,120,341 2,120,350 A T T A T G G A G G A A A G T T T A A T T A T G G A G G A A A G T T T A B7: A false positive site 16
17 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 69,058,980 69,058,990 69,058,999 T A A G A G G T G T T G T T T A A G A G G T G T T G T T G1: A false positive site 17
18 chromosome 1 p36.23 p36.12 p35.1 p34.1 p32.2 p31.2 p22.3 p21.3 p13.3 p12 q11 q12 q21.1 q22 q24.1 q25.2 q31.2 q32.1 q41 q42.13 q43 B 36,060,969 36,060,979 36,060,988 A T T T A T G T T T T T G G A T T T A T G T T T T T G G T T T T G G A 18
19 chromosome 1 p36.23 p36.12 p35.1 p34.1 p32.2 p31.2 p22.3 p21.3 p13.3 p12 q11 q12 q21.1 q22 q24.1 q25.2 q31.2 q32.1 q41 q42.13 q43 B 62,752,481 62,752,491 62,752,500 A A A T T G T G G G G A G A G A A A T T G T G G G G A G A G G G G G G A G A G 19
20 chromosome 1 p36.23 p36.12 p35.1 p34.1 p32.2 p31.2 p22.3 p21.3 p13.3 p12 q11 q12 q21.1 q22 q24.1 q25.2 q31.2 q32.1 q41 q42.13 q43 B 79,985,409 79,985,409 79,985,418 T T G A G T A G T T T T G T G A A A T T G A G T A G T T T T G T G A A A T T T G T G A A A T 20
21 chromosome 1 p36.23 p36.12 p35.1 p34.1 p32.2 p31.2 p p21.3 p13.3 p12 q11 q12 q21.1 q22 q24.1 q25.2 q31.2 q32.1 q41 q42.13 q43 B 85,896,648 85,896,658 85,896,667 G A G G T A A A A G T T T T T G A G G T A A A A G T T T T T A A A G G A G A G G 21
22 chromosome 1 p36.23 p36.12 p35.1 p34.1 p32.2 p31.2 p22.3 p21.3 p13.3 p12 q11 q12 q21.1 q22 q24.1 q25.2 q31.2 q32.1 q41 q42.13 q43 B 215,772, ,772, ,772,765 A A A T A A T T T T T A A G A T A A A T A A T T T T T A A G A T T A A A T A A A 22
23 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 56,479,693 56,479,703 56,479,712 T A T A T T T A A G T T T G G G A T T A T A T T T A A G T T T G G G A T T T T G G G A T A G T G T 23
24 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 69,058,980 69,058,990 69,058,999 T A A G A G G T G T T G T T T A A G A G G T G T T G T T 24
25 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 83,119,261 83,119,271 83,119,280 A G T G G G T T A T A T A A G T T A G T G G G T T A T A T A A G T T A A G T G G G T 25
26 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 124,110, ,110, ,110,179 A T A T A A A T G T T A T A G G T A A T A T A A A T G T T A T A G G T A T T A T A G G T A T A T T 26
27 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 128,809, ,809, ,809,744 A G T T T G T G G T A A A A A G A A G T T T G T G G T A A A A A G A G A G T T T G T 27
28 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 140,541, ,541, ,541,050 T T T T T T A A T T G A T A A A T A A T T T T T T A A T T G A T A A A T A A T G A T A A A T A A T 28
29 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 197,724, ,724, ,724,636 A G T A A A T T A T A A A A A T G A G T A A A T T A T A A A A A T G A T A A A A A T G G G A 29
30 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 206,261, ,261, ,261,601 T T T T T A A A A A A T A G T T T T T T A A A A A A T A G T A A A A A A T A G 30
31 chromosome 3 p26.1 p25.1 p24.2 p22.3 p21 p21.33 p21.1 p14.1 p12.3 p12.1 q11.2 q13.12 q13.32 q21.3 q22.3 q24 q25.2 q26.1 q26.31 q27.1 q29 B 102,153, ,153, ,153,406 T A A A T T T T A T T A T A A T T A A A T T T T A T T A T A A T T T A T A A T T 31
32 chromosome 3 p26.1 p25.1 p24.2 p22.3 p21 p21.33 p21.1 p14.1 p12.3 p12.1 q11.2 q13.12 q13.32 q21.3 q22.3 q24 q25.2 q26.1 q26.31 q27.1 q29 B 119,913, ,913, ,913,685 T T T T A T T T T T T T T T T T T T T A T T T T T T T T T T T T T T T T T T T T 32
33 chromosome 3 p26.1 p25.1 p24.2 p22.3 p21 p21.33 p21.1 p14.1 p12.3 p12.1 q11.2 q13.12 q13.32 q21.3 q22.3 q24 q25.2 q26.1 q26.31 q27.1 q29 B 119,913, ,913, ,913,685 T G A G T T A A T T T T T G A A A T T G A G T T A A T T T T T G A A A T T T T T T G A A A T G 33
34 chromosome 3 p26.1 p25.1 p24.2 p22.3 p21 p21.33 p21.1 p14.1 p12.3 p12.1 q11.2 q13.12 q13.32 q21.3 q22.3 q24 q25.2 q26.1 q26.31 q27.1 q29 B 146,628, ,628, ,628,651 G A A A T G G T T T T T A A A G A A A T G G T T T T T A A A T T T T T A A A G 34
35 chromosome 4 p16.2 p15.32 p15.1 p21 p13 p11 q12 q13.1 q13.1 q21.22 q22.2 q24 q25 q26 q28.1 q28.3 q31.21 q31.3 q32.2 q33 q34.3 q35 B 28,194,765 28,194,775 28,194,784 A A T T A G A A T T T T T A T G G A A T T A G A A T T T T T A T G G T T T T A T G G T 35
36 chromosome 4 p16.2 p15.32 p15.1 p21 p13 p11 q12 q13.1 q13.1 q21.22 q22.2 q24 q25 q26 q28.1 q28.3 q31.21 q31.3 q32.2 q33 q34.3 q35 B 115,663, ,663, ,663,969 A T T T T T A A T T A T T T T T T A T T T T T A A T T A T T T T T T A T T T T T T A A A G 36
37 chromosome 4 p16.2 p15.32 p15.1 p21 p13 p11 q12 q13.1 q13.1 q21.22 q22.2 q24 q25 q26 q28.1 q28.3 q31.21 q31.3 q32.2 q33 q34.3 q35 B 134,749, ,749, ,745,002 T A T A T A T T T T T T T T A T T T T A T A T A T T T T T T T T A T T T T T T T A T T T A A A A 37
38 chromosome 4 p16.2 p15.32 p15.1 p21 p13 p11 q12 q13.1 q13.1 q21.22 q22.2 q24 q25 q26 q28.1 q28.3 q31.21 q31.3 q32.2 q33 q34.3 q35 B 176,373, ,373, ,373,080 T T T G T A A T T G A T T T T T T G T A A T T G A T T T T T G A T T T G 38
39 chromosome 5 p15.32 p15.1 p14.1 p21 p13.2 p12 q11.22 q12.2 q13.2 q14.1 q14.3 q15 q21.2 q22.2 q32.2 q31.1 q31.3 q33.1 q34 q35.1 q35 B 8,675,861 8,675,871 8,675,880 A T A A G T T T T T T G A A T A T A A G T T T T T T G A A T T T T T T T G A A T 39
40 chromosome 5 p15.32 p15.1 p14.1 p21 p13.2 p12 q11.22 q12.2 q13.2 q14.1 q14.3 q15 q21.2 q22.2 q32.2 q31.1 q31.3 q33.1 q34 q35.1 q35 B 27,924,741 27,924,751 27,924,760 G A A A G A A T T T T T T T G A A A G A A T T T T T T T T T T T T T 40
41 chromosome 5 p15.32 p15.1 p14.1 p21 p13.2 p12 q11.22 q12.2 q13.2 q14.1 q14.3 q15 q21.2 q22.2 q32.2 q31.1 q31.3 q33.1 q34 q35.1 q35 B 32,884,887 32,884,897 32,884,906 G G A A G T T G T T G A A G A A G G AA G T T G T T G A A G A A T G A G A G A A G A A G G A 41
42 chromosome 5 p15.32 p15.1 p14.1 p21 p13.2 p12 q11.22 q12.2 q13.2 q14.1 q14.3 q15 q21.2 q22.2 q32.2 q31.1 q31.3 q33.1 q34 q35.1 q35 B 53,623,826 53,623,836 53,623,845 T G A A T G G A A A T T T G G T G A A T G G A A A T T T G G A A T T T G G T 42
43 chromosome 5 p15.32 p15.1 p14.1 p21 p13.2 p12 q11.22 q q13.2 q14.1 q14.3 q15 q21.2 q22.2 q32.2 q31.1 q31.3 q33.1 q34 q35.1 q35 B 64,122,061 64,122,071 64,122,080 T G A T T T T T T T T G A T G T T G A T T T T T T T T G A T G T T T G A T G T G 43
44 chromosome 5 p15.32 p15.1 p14.1 p21 p13.2 p12 q11.22 q12.2 q13.2 q14.1 q14.3 q15 q21.2 q22.2 q32.2 q31.1 q31.3 q33.1 q34 q35.1 q35 B 74,866,858 74,866,868 74,866,877 G T A T A A T A A A A T T G T G G T A T A A T A A A A T T G T G A A T T G T G T 44
45 chromosome 5 p15.32 p15.1 p14.1 p21 p13.2 p12 q11.22 q12.2 q13.2 q14.1 q14.3 q15 q21.2 q22.2 q32.2 q31.1 q31.3 q33.1 q34 q35.1 q35 B 105,980, ,980, ,980,810 T G G T G A A G A T T T A G G G A T T T G G T G A A G A T T T A G G G A T T T T A G G G A T T T 45
46 chromosome 5 p21 p13.2 p15.32 p12 q12.2 q13.2 q14.1 q14.3 p15.1 q15 q11.22 q32.2 q31.1 q31.3 q33.1 q21.2 q34 p14.1 q22.2 q35.1 q35 B 152,139, ,139, ,139,425 A G T T T T T T A T G G A T T T T A G T T T T T T A T G G A T T T T T T T T T AT G G A T T T T A 46
47 chromosome 6 p25.1 p23 p22.2 p21 p21.31 p21.1 p12.2 p11.1 q12 q12 q14.1 q15 q16.2 q12 q22.1 q22.32 q23.3 q24.2 q25.2 q26 B 48,157,882 48,157,892 48,157,901 A T A T T A T T A T T T T T T T G A A T A T T A T T A T T T T T T T G A T T T T T T T T G 47
48 chromosome 7 p22.1 p21.2 p15.3 p21 p14.3 p14.1 p12.3 p11.2 q11.21 q11.23 q21.12 q21.3 q22.2 q31.1 q31.31 q32.1 q33 q34 q35 q36.2 B 78,999,365 78,999,375 78,999,384 T A A T T A G T A T T A T G T T A T A A T T A G T A T T A T G T T A A A A T A A T 48
49 chromosome 7 p22.1 p21.2 p15.3 p21 p14.3 p14.1 p12.3 p11.2 q11.21 q11.23 q21.12 q21.3 q22.2 q31.1 q31.31 q32.1 q33 q34 q35 q36.2 B 80,119,047 80,119,057 80,119,066 T A T T A T T T A T T T T A T T G T T A T T A T T T A T T T T A T T G T G T A T T A T T T 49
50 chromosome 8 p23.2 p22 p21.3 p21.1 p11.23 p11.1 q11.22 q12.2 q13.2 q21.11 q21.2 q22.1 q22.3 q23.2 q24.11 q24.21 q24.23 B 34,242,517 34,242,527 34,242,536 T T A T A G T T A A A A A A T T A T A G T T A A A A A A A A A A G A G A 50
51 chromosome 8 p23.2 p22 p21.3 p21.1 p11.23 p11.1 q11.22 q12.2 q13.2 q21.11 q21.2 q22.1 q22.3 q23.2 q24.11 q24.21 q24.23 B 27,870,072 27,870,082 27,870,091 G G T A A T T T G A T A G G G G T A A T T T G A T A G G A A A T T T T A G T T G 51
52 chromosome 8 p23.2 p22 p21.3 p21.1 p11.23 p11.1 q11.22 q12.2 q13.2 q21.11 q21.2 q22.1 q22.3 q23.2 q24.11 q24.21 q24.23 B 34,242,517 34,242,527 34,242,536 T T A T A G T T A A A A A A T T A T A G T T A A A A A A A A A A G A G A 52
53 chromosome 8 p23.2 p22 p21.3 p21.1 p11.23 p q q12.2 q13.2 q21.11 q21.2 q22.1 q22.3 q23.2 q24.11 q24.21 q24.23 B 51,457,368 51,457,378 51,457,387 A G T G G G A G A A T T A T G T T T G A G T G G G A G A A T T A T G T T T G A T G T T T G A T T 53
54 chromosome 8 p23.2 p22 p21.3 p21.1 p11.23 p11.1 q q12.2 q13.2 q21.11 q21.2 q22.1 q22.3 q23.2 q24.11 q24.21 q24.23 B 57,613,105 57,613,115 57,613,124 A A A A A A A G A A T A A A T G A A A T A A A A A A A G A A T A A A T G A A A T T T A A T 54
55 chromosome 10 p15.1 p13 p12.31p21 p12.1 p11.21 q11.21 q11.23 q21.2 q21.1 q22.3 q23.2 q23.33 q24.31 q25.1 q25.3 q26.13 q26.3 B 78,698,792 78,698,802 78,698,811 A A G G A A A A A T T T A A T T T A A A G G A A A A A T T T A A T T T A A T T T A A G T G 55
56 chromosome 10 p15.1 p13 p12.31p21 p12.1 p11.21 q11.21 q11.23 q21.2 q21.1 q22.3 q23.2 q23.33 q24.31 q25.1 q25.3 q26.13 q26.3 B 82,162,004 82,162,104 82,162,113 T T A T A T A A A A T G G G T T T A T A T A A A A T G G G T T T T A A A T G G G 56
57 chromosome 11 p15.4 p15.2 p14.3 p21 p13 p12 p11.2 p11.11 q12.2 q13.2 q13.5 q14.2 q21 q22.2 q23.1 q23.3 q24.2 q25 B 39,326,082 39,326,092 39,326,101 T T T G T T T G T T T G T T T G T T T A T T T G T T T G T T T G T T T G T T T A T T T G T T T G T T T G T T T A A T G 57
58 chromosome 11 p15.4 p15.2 p14.3 p21 p13 p12 p11.2 p11.11 q12.2 q13.2 q13.5 q14.2 q21 q22.2 q23.1 q23.3 q24.2 q25 B 44,762,096 44,762,106 44,762,115 T G A T G G G G A G T A G T G G G T G A T G G G G A G T A G T G G G T A G T G G G A 58
59 chromosome 11 p15.4 p15.2 p14.3 p21 p13 p12 p11.2 p11.11 q12.2 q13.2 q13.5 q14.2 q21 q22.2 q23.1 q23.3 q24.2 q25 B 44,762,096 44,762,106 44,762,115 A G G A A G G G G A G A G T G T A G G A A G G G G A G A G T G T G A G T G T 59
60 chromosome 12 p13.32 p13.2 p12.3 p12.1 p21 p11.21 q12 q13.11 q13.2 q14.2 q15 q21.2 q21.32 q22 q23.2 q24.11 q24.22 q24.32 B 23,363,908 23,363,918 23,363,927 A A A A A A A A A A G T A A A T A A A A A A A A A A A A A G T A A A T A A A T G T A A A T A A 60
61 chromosome 12 p13.32 p13.2 p12.3 p12.1 p21 p11.21 q12 q13.11 q13.2 q14.2 q15 q21.2 q21.32 q22 q23.2 q24.11 q24.22 q24.32 B 96,920,559 96,920,569 96,920,578 A G T G A A T T A A A A A A A A G T G A A T T A A A A A A A A A A A A A T A 61
62 chromosome 12 p13.32 p13.2 p12.3 p12.1 p21 p11.21 q12 q13.11 q13.2 q14.2 q15 q21.2 q21.32 q22 q23.2 q24.11 q24.22 q24.32 B 110,398, ,398, ,398,757 T T T A A G T T T G G A G T A T T T T A A G T T T G G A G T A T A T T T T A T T A A 62
63 chromosome 14 p12 p11.2 p11.1p21q11.2 q12 q13.1 q21.1 q21.2 q22.1 q23.1 q23.3 q24.2 q24.3 q31.1 q31.3 q32.12 q32.2 q32.32 B 47,965,766 47,965,776 47,965,785 T T A T A G A A A A T T T T A A T T T T T A T A G A A A A T T T T A A T T T T T T A G T T T 63
64 chromosome 14 p12 p11.2 p11.1p21q11.2 q12 q13.1 q21.1 q21.2 q22.1 q23.1 q23.3 q24.2 q24.3 q31.1 q31.3 q32.12 q32.2 q32.32 B 53,307,154 53,307,164 53,307,173 A T T T T G T T T G T T T G T T T G T T A T T T T G T T T G T T T G T T T G T T T T T G T T T G T T T G T T T T T G 64
65 chromosome 14 p12 p11.2 p11.1p21q11.2 q12 q13.1 q21.1 q21.2 q22.1 q23.1 q23.3 q24.2 q24.3 q31.1 q31.3 q32.12 q32.2 q32.32 B 86,537,553 86,537,563 86,537,572 G T A T G A A G T A G T T T G T A T G A A G T A G T T T G T G T A T G A 65
66 chromosome 17 p13.2 p13.1 p12p21 p11.2 p11.1 q11.2 q12 q21.1 q21.31 q21.33 q22 q23.1 q23.3 q24.2 q24.3 q25.1 q25.3 B 3,936,611 3,936,621 3,936,630 G T G A A G T A G G G T G A T T T G T G A A G T A G G G T G A T T T G G T G A T T T T 66
67 chromosome 17 p13.2 p13.1 p12p21 p11.2 p11.1 q11.2 q12 q21.1 q21.31 q21.33 q22 q23.1 q23.3 q24.2 q24.3 q25.1 q25.3 B 28,388,993 28,389,003 28,389,012 G A A T A A G A T T T T T T G A A T A A G A T T T T T T T T T T T A 67
68 chromosome 18 p11.31 p11.22 p21p q11.2 q12.1 q12.2 q12.3 q21.1 q21.2 q21.31 q21.33 q22.1 q22.2 q23 B 18,681,824 18,681,834 18,681,843 A G A A A A T G T T T T T T T T T T A G A A A A T G T T T T T T T T T T T T T T T T T T T 68
69 chromosome 18 p11.31 p11.22 p21p11.2 q11.2 q12.1 q12.2 q12.3 q21.1 q21.2 q21.31 q21.33 q22.1 q22.2 q23 B 54,759,473 54,759,483 54,759,492 T A T T A T T A T T A T T A T A T T A T T A T T A T T A A A G A G A T A A T T 69
70 chromosome 22 p13 p12 p21 p11.2 p11.1 q11.1 q11.21 q11.22 q12.1 q12.2 q12.3 q13.1 q13.2 q13.31 q13.32 B 31,174,501 31,174,511 41,174,520 T T A T G G T G T T A T A A A T T A T G G T G T T A T A A A T A T A A A 70
71 Supplementary Tables Supplementary Table 1 71
72 Prior Posterior Probability cutoff Known Germline DNMs Known Somatic DNMs Known False Positives Total DNM alls Sensitivity (Germline) Sensitivity (Somatic) 1e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e Assessing the impact of mutation rate prior on the sensitivity and specificity of DNM discovery with DeNovoGear in the WES dataset (see Methods of main text for description of the WES dataset). In a previous study of the same samples, we experimentally validated DNM predictions and assigned each prediction a value of Germline DNM, Somatic DNM or False Positive (onrad et al. 2011, PMID ). For each combination of mutation rate prior and posterior probability cutoff used in the current analysis, we list Total DNM alls : the total number of unfiltered DNM calls made by DNG; Known Germline DNMs : the total number of calls made at sites of validated germline mutation reported by onrad et al; Known Somatic DNMs : the total number of calls made at sites of validated somatic mutation reported by onrad et al; Known False Positive : the total number of calls made at sites of false positives reported by onrad et al. In total onrad et al. validated 2 germline denovo mutations, 19 somatic mutations and identified 39 false positive calls in the regions with coverage in the WES dataset.a higher mutation rate prior leads to an increase in the total number of calls made which brings about increased sensitivity at the cost of an increase in the False Discovery Rate. The sensitivity and FDR calculations are similar to Table??. FDR 72
73 Supplementary Table 2 Prior Posterior Probability cutoff Known Germline DNMs Known Somatic DNMs Known False Positives Total DNM alls Sensitivity (Germline) Sensitivity (Somatic) 1e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e Assessing the impact of mutation rate prior on the sensitivity and specificity of DNM discovery with DeNovoGear in the WGS dataset (see Methods of main text for description of the WGS dataset). In a previous study of the same samples, we experimentally validated DNM predictions and assigned each prediction a value of Germline DNM, Somatic DNM or False Positive (onrad et al. 2011, PMID ). For each combination of mutation rate prior and posterior probability cutoff used in the current analysis, we list Total DNM alls : the total number of unfiltered DNM calls made by DNG; Known Germline DNMs : the total number of calls made at sites of validated germline mutation reported by onrad et al; Known Somatic DNMs : the total number of calls made at sites of validated somatic mutation reported by onrad et al; Known False Positive : the total number of calls made at sites of false positives reported by onrad et al. In total onrad et al. validated 49 germline denovo mutations, 952 somatic mutations and identified 2235 false positive calls in whole genome sequencing data from these samples.a higher mutation rate prior leads to an increase in the total number of calls made which brings about increased sensitivity at the cost of an increase in the False Discovery Rate. FDR 73
74 Supplementary Table 3 Dataset lass Alpha Beta RA RA RA EU-trio AA EU-trio RR WUSTL exome RA The alpha and beta values estimated using Maximum Likelihood Estimation for various exome datasets. A different model is fitted to each genotype class (homozygous reference, RR ; heterozygous, RA, and homozygous alternate AA ). The values of α and β estimated for the RA class of genotypes are significantly different between any of the EU exomes and an internal exome dataset generated at Washington University (p < , LRT). 74
75 Supplementary Table 4 aller Posterior Probability cutoff Known Germline DNMs Known Somatic DNMs Known False Positives Total DNM alls Sensitivity (Germline) Sensitivity (Somatic) Denovogear Denovogear Denovogear Denovogear Denovogear Denovogear Denovogear-BB Denovogear-BB Denovogear-BB Denovogear-BB Denovogear-BB Denovogear-BB GATK GATK GATK GATK GATK GATK Polymutt Polymutt Polymutt Polymutt Polymutt Polymutt Samtools Samtools Samtools Samtools Samtools Samtools Naivealler NA FDR omparison between different denovo mutation callers on the WES dataset. Likelihood ratios from some packages were converted to posterior probabilities to enable comparison. The Naivealler does not have a score associated with the calls. In a previous study of the same samples, we experimentally validated DNM predictions and assigned each prediction a value of Germline DNM, Somatic DNM or False Positive (onrad et al. 2011, PMID ). Here, for each combination of calling method and posterior probability cutoff, we list Total DNM alls : the total number of unfiltered DNM calls made by DNG; Known Germline DNMs : the total number of calls made at sites of validated germline mutation reported by onrad et al; Known Somatic DNMs : the total number of calls made at sites of validated somatic mutation reported by onrad et al; Known False Positive : the total number of calls made at sites of false positives reported by onrad et al. The same validation dataset used in Table?? was used for this analysis. All the tools have similar germline sensitivity. Denovogear has the lowest false discovery rates. SamTools and the Naivealler show slightly increased somatic sensitivity but have a very high False Discovery Rate due to the high number of total calls made. 75
76 Supplementary Table 5 aller Posterior Probability cutoff Known Germline DNMs Known Somatic DNMs Known False Positives Total DNM alls Sensitivity (Germline) Sensitivity (Somatic) Denovogear Denovogear Denovogear Denovogear Denovogear Denovogear Polymutt Polymutt Polymutt Polymutt Polymutt Polymutt GATK GATK GATK GATK GATK GATK Samtools Samtools Samtools Samtools Samtools Samtools Naivealler NA omparison between the different denovo mutation callers on the WGS dataset. Here, for each combination of calling method and posterior probability cutoff, we list Total DNM alls : the total number of unfiltered DNM calls made by DNG; Known Germline DNMs : the total number of calls made at sites of validated germline mutation reported by onrad et al; Known Somatic DNMs : the total number of calls made at sites of validated somatic mutation reported by onrad et al; Known False Positive : the total number of calls made at sites of false positives reported by onrad et al. The sensitivity and false discovery rate calculation was done similar to Table??. Polymutt overcalls on the X chromosome and hence all the calls on the X made by Polymutt were ignored as per the recommendation on the developers website, this leads to a drop in its somatic sensitivity. Denovogear shows high germline and somatic sensitivity at a comparatively low False Discovery Rate. FDR 76
77 Supplementary Table 6 Program Dataset Total alls Validated alls DeNovoGear WEx 98 0 SamTools WEx DeNovoGear WGS SamTools WGS DeNovoGear DINDEL WGS The total number of denovo indel calls (posterior probability > ) and the number of validated true positive calls made by DeNovoGear, SamTools, and DeNovoGear using DINDEL genotype likelihoods on the WEx and WGS datasets. 77
78 Supplementary Table 7 Validated Not Validated True Positives False Positives Insertions Total called Avg posterior probability Samtools DINDEL NA Average length oding context Intergenic Exonic Intronic Upstream Downstream ncrna exonic ncrna intronic UTR Deletions Total called Avg posterior probability Samtools DINDEL Average length oding context Intergenic Exonic 0-0 Intronic Upstream 2-1 Downstream 1-4 ncrna exonic 0-1 ncrna intronic 0-7 UTR3 0-4 Summary statistics for de novo indels called by DNG. Sites classified as Validated are sites for which we attempted validation by Sanger sequencing. Sites classified as Not Validated are sites which we did not attempt to validate, due to manual or algorithmic filtering. 78
79 Supplementary Table 8 57 different PR assays designed to confirm indel calls made by DNG Assay # hromosome Position mutation Forward primer Reverse primer Amplicon size Annealing temperature AGTTGTTGTGTTGTT TAAAAAG GTGTTTAGAAAAT GAGAGGAATAAGGAGTG GAGTTATGTAATAAAATATGTAGG GTTTTAATTTTAGTTT GGATTTGTTTG AATTAAATTATG GGAGAAGGTTAAAAATTG GAAAGTGATTAATAATTT GAGTGTGAAAATGATAGTA GTAGATGAGGGTTGATGGG GAGGGATTTGGAAAGAA GAAAGGGATGGTGATGTG TTTGATTATTGA TAAAATGTATTTTAGAAGTTGT TGTTAATTAGGTATTTT GAAATTGGAAATGT AGATGGGGAGTTTTT AATAAAAGAAAAG TTTGTGTTGGTTTTTTATGTAG ATGTATAGTATAAATGG AATAGATGTGAGAAAGATTTG TTTATTTAGATTTGGAGGAAAAG ATAATGGGAGGTGGAATG TTTGTAGAGTTTGTGG TTGAATATTTGAAGAAATA GGTTGAAGATATGTGTG GGAATATTTTTTTTTG AGAGGTTGTGGTGAGAAG TGATTTTGTAAATTTA GGTTTAAAATAAAAAG TATTTAAAATTTGTTTA ATAAAGAAATTATTGGGAAA TTGGGGGAAATTATAA TGGGGAGTTATTGGTATG AATTAATAAAGATGTGATG ATAATGGGTGATA AAAGGAGAGGGAGGAAAAG TATAATATGTTAAGTAAAAAG TGAAGTTGGGGTGGATG TGTAAGGGTGGATTTGAG TTTGTGTAATTTAAGAAGTA AATAATATGAAAAAATT AAAATGTAAATTAATTTTTG GTAAAAGAAAATTAGAGG AAATAAGGGGTAGATGG GAGGATAAGGTTTGATAGGATTAG GAGAGTTGTAAAGTG ATGGGGGTGTGTAAT TTGGTTGAATGGATGA GAAATATAAGAATATGTTTGA AAAGGAAATTTTTA TTTGTAGATTTAAAAGAGTTGG ATGAATATTAAAAATGTAAAA TTTTTTAAAAAAAAGTAAA TTGATATTGATTAATATTTG TATGGAGATGAGTAGGG GTGAAAAATGGAGTAAAATTG GGAATTGTTGTTGTTTAGG GGATGATGAATATAAATAT GGGATTTTTAGAAAATAAATAAAAG Elongation time (s)
80 TGGTATGTTTGTTTGTG TGTATGTTTATTAAGTTG TTATATGTAGGTTAGTTTTGT TGAAGAGAAGATATT TAAAAAGTTAATG ATGGAGGATTGTGTTAG AATTGTGTGTTTGG TTTATGATGGAAGTG TAATAAATGAAGAATAAG ATTATATGAATTATTATAAATA TGATAGGGAAGAAAATGTA TGAGTAAATGAGTTTTG TTTGATGATAAAT AGAAGAGGTGGATTGTGG GGAGAAATGAAGAGA GAGGGATGTTTAAA AATAAATTAAA TTTAAATTTG ATGTGTGTTGTGAG ATTTAGGTGTTGT GTTGGAAGAAAAATG TGGATAAGATTTG GTTGTGTAGTTT TATAGGGATGGATGATG ATTTTAATGTGTTGG TGTAAAAAGAGAGAGTTGG TTTTGAGGTTTGAAATGG TGGTTAAAATGTGAAGAAATG AATGGGGTAGGTATGTTG TTTAATTATTTTTA GTTTGGATGGTTTGAGG ATTAGAGGAGAAGAG GTATTGTTTGGATG ATTTATGGAAAG GGTGATGGTGATGGTGAT GGAGAATGAAAAGTAGA TAAAATAAGTTGTT AGTATTGTGGATGGAATA GAGAAAAAAATAAGATTGTTA ATTAAAGTGGA TATTTTTTTATTTG GGATAGAAGAATGTTAATAG TTGGAATTTTAATG AGTTGTATAAAT TGGGAATTAATGATAAAAG TTTGAGTTTTAAAATGTAT TAGTGAATAATG TTTGAAATTTTA TTTGTTTGTTGTGGAG AATGGGTAGAGGAAGG X GTGAGATTAAGAG ATGGTGTAAGGATTG The 57 different PR assays that designed used to confirm indel calls made by DNG. The chromosomes on which these assays were designed as well as the locations of the indels to be captured are indicated in addition to the primers, annealing and elongation times used for each PR assay.
81 Supplementary Table 9 is provided as a stand-alone document available through the publisher s website. 81
82 Supplementary Note Analysis of Priors The DeNovoGear framework allows the user to specify the prior probability of observing a DNM, which in principle can be used as a lever to increase and decrease calling sensitivity. We performed simulations to show that increasing the mutation rate prior increases detection sensitivity (Supplementary Figures 1 and 2, Supplementary Tables 1 and 2). Specifically, we ran DeNovoGear by setting the mutation rate prior from 10 4 to mutations/bp in geometric increments of Our results show that varying the mutation rate prior does have a dramatic effect on the sensitivity and specificity of DNM calling when using a standard whole-genome sequencing study design such as the one generating the WGS dataset (Supplementary Tables 1 and 2, Supplementary Figs. 1 and 2). The total number of false positive calls increases over 5-fold when moving from to 10 4, while 879/939 (94%) of validated DNMs are detected at the smallest rate prior, and 100% sensitivity for germline DNMs is achieved at These results indicate that use of biologically realistic values for the mutation prior will give near 100% sensitivity to non-mosaic DNMs, while increasing the prior 82
83 beyond this threshold will massively inflate the number of false positive calls with marginal or no increase in sensitivity. Next, we investigated whether our use of a prior on mutation rate helps control Type I error at low sequencing depth. Low coverage data had high specificity, but low sensitivity. With a cutoff of posterior probability of being de novo of > 0.001, specificity ranged from 1 (1x coverage) to 0.87 (20x coverage); sensitivity ranged from 0 (1x coverage) to 0.97 (20x coverage). Greater than 95% of de novo mutations were identified at 16x coverage and above. DINDEL genotype likelihoods After the fact we wanted to assess the performance of alternative indel genotype likelihood functions for de novo indel calling. We selected one such alternative modeling framework, a computationally intensive, haplotypebased realignment method that is implemented in the package DINDEL (PubMed ID: ). In order to evaluate the feasibility of running DIN- DEL on the whole genome, we first ran DINDEL in the default mode and in the heuristic mode on hromosome 21 of the WGS dataset. DINDEL identified a total of 631,686 distinct candidate indels on hromosome 21 83
84 across all three samples. These calls are spread across 288,658 windows of 120 bp each. Run time varied by only 5% across different samples. The more complex modeling used by DINDEL comes at a great computational cost: the average run time for default mode was 142 hours per sample, and for heuristic mode, 80 hours. Using these numbers produces an estimated run time of 344 or 144 days per whole genome sequence. Using the heuristic likelihoods, DINDEL calls were first made on each member of the trio separately. These calls were then merged to create a list of candidate sites, and we directed DINDEL to calculate genotype likelihoods for all three samples for these candidate sites. The resultant genotype likelihoods were then fed to DeNovo- Gear to call de novo indels. DeNovoGear produced 136 indel calls from the WGS dataset with posterior probability > 0.9 and 463 calls with posterior probability > 1x10 4 (Table S6). Forty-four (79%) of the 56 candidate DNMs from our Samtools analysis were also called as DNMs with DINDEL likelihoods when considering this larger set of 463; in contrast 2/3 false positives were no longer supported as DNMs. Our results suggest that DINDEL genotype likelihoods are conservative (i.e. they underestimate the evidence in support of indel when a true indel is present) but this is balanced by a major 84
85 increase in specificity. Robustness of Indel Mutation Rate Estimate Previous estimates of the ratio of deletion to insertion variants from human polymorphism data range from , thus we interpret our observation of a nearly 8-fold enrichment of validated deletions may reflect a lack of power to detect short insertions with next-generation sequencing data. Alternatively, our finding may be an indication that purifying selection is much stronger on new deletions than on new insertions. If we were to adjust our mutation rate estimate to account for a theoretical under-ascertainment of indels, the resulting values are only slightly higher than what we presented here; assuming a true 4:1 ratio of deletions:insertions produces an estimate of 1.18 x 10 9 while assuming a 2:1 ratio leads to an estimate of 1.42 x It seems likely that discovery power for indels may be lower than that for SNVs. If our power to discover indel DNMs was much less than the value of 0.95 that we assume here, say as low as 0.2, and using the other parameter values described in the main text, our rate estimate would be revised upwards to Due to our filtering strategy, the indel rate estimate we 85
86 provide in the main text applies to the non-repeat portion of the genome. The indel mutation rate is predicted to be much higher in the repetitive portion. One way to approach the true genomic indel mutation rate would be to simply include the DNM calls from these repeat regions in a rate estimate; even under the assumption that these are all true positives (which is unlikely), the resulting callset may also suffer from a lack of power to identify indels in repeat regions. Sensitivity and specificity of indel calling in repeats is still very poorly characterized. We manually removed 55% of the post-filtered calls in our original analysis, due to visual identification of artifacts. Assuming we would remove the same propotion from the unfiltered callset, we would have 203 de novo indels, and we would expect to validate 192 of these based on our observed validation rate. Then, using the equation that we define in the methods section of the main text, with the parameter values a = 1, p = 0.95, s = 49/1001, d = 193, b = , our rate estimate would be
Pyrobayes: an improved base caller for SNP discovery in pyrosequences
Pyrobayes: an improved base caller for SNP discovery in pyrosequences Aaron R Quinlan, Donald A Stewart, Michael P Strömberg & Gábor T Marth Supplementary figures and text: Supplementary Figure 1. The
More informationGenotype Imputation. Class Discussion for January 19, 2016
Genotype Imputation Class Discussion for January 19, 2016 Intuition Patterns of genetic variation in one individual guide our interpretation of the genomes of other individuals Imputation uses previously
More informationHeterozygous BMN lines
Optical density at 80 hours 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 a YPD b YPD + 1µM nystatin c YPD + 2µM nystatin d YPD + 4µM nystatin 1 3 5 6 9 13 16 20 21 22 23 25 28 29 30
More informationVariant visualisation and quality control
Variant visualisation and quality control You really should be making plots! 25/06/14 Paul Theodor Pyl 1 Classical Sequencing Example DNA.BAM.VCF Aligner Variant Caller A single sample sequencing run 25/06/14
More informationUnfixed endogenous retroviral insertions in the human population. Emanuele Marchi, Alex Kanapin, Gkikas Magiorkinis and Robert Belshaw
Unfixed endogenous retroviral insertions in the human population Emanuele Marchi, Alex Kanapin, Gkikas Magiorkinis and Robert Belshaw Supplementary Methods Common sources of 'false positives' in mining
More informationCMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison
CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture
More informationHumans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase
Humans have two copies of each chromosome Inherited from mother and father. Genotyping technologies do not maintain the phase Genotyping technologies do not maintain the phase Recall that proximal SNPs
More informationComparing whole genomes
BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will
More informationNature Genetics: doi:0.1038/ng.2768
Supplementary Figure 1: Graphic representation of the duplicated region at Xq28 in each one of the 31 samples as revealed by acgh. Duplications are represented in red and triplications in blue. Top: Genomic
More informationDe novo assembly and genotyping of variants using colored de Bruijn graphs
De novo assembly and genotyping of variants using colored de Bruijn graphs Iqbal et al. 2012 Kolmogorov Mikhail 2013 Challenges Detecting genetic variants that are highly divergent from a reference Detecting
More informationSupplementary Figure 1. Nature Genetics: doi: /ng.3848
Supplementary Figure 1 Phenotypes and epigenetic properties of Fab2L flies. A- Phenotypic classification based on eye pigment levels in Fab2L male (orange bars) and female (yellow bars) flies (n>150).
More informationRobust Detection and Identification of Sparse Segments in Ultra-High Dimensional Data Analysis
Robust Detection and Identification of Sparse Segments in Ultra-High Dimensional Data Analysis Hongzhe Li hongzhe@upenn.edu, http://statgene.med.upenn.edu University of Pennsylvania Perelman School of
More informationSupporting Information Text S1
Supporting Information Text S1 List of Supplementary Figures S1 The fraction of SNPs s where there is an excess of Neandertal derived alleles n over Denisova derived alleles d as a function of the derived
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related
More informationRNA-seq. Differential analysis
RNA-seq Differential analysis DESeq2 DESeq2 http://bioconductor.org/packages/release/bioc/vignettes/deseq 2/inst/doc/DESeq2.html Input data Why un-normalized counts? As input, the DESeq2 package expects
More informationFigure S2. The distribution of the sizes (in bp) of syntenic regions of humans and chimpanzees on human chromosome 21.
Frequency 0 1000 2000 3000 4000 5000 0 2 4 6 8 10 Distance Figure S1. The distribution of human-chimpanzee sequence divergence for syntenic regions of humans and chimpanzees on human chromosome 21. Distance
More informationHaploid & diploid recombination and their evolutionary impact
Haploid & diploid recombination and their evolutionary impact W. Garrett Mitchener College of Charleston Mathematics Department MitchenerG@cofc.edu http://mitchenerg.people.cofc.edu Introduction The basis
More informationSupplementary Figure 1. Phenotype of the HI strain.
Supplementary Figure 1. Phenotype of the HI strain. (A) Phenotype of the HI and wild type plant after flowering (~1month). Wild type plant is tall with well elongated inflorescence. All four HI plants
More informationStatistical Applications in Genetics and Molecular Biology
Statistical Applications in Genetics and Molecular Biology Volume 11, Issue 2 2012 Article 6 COMPUTATIONAL STATISTICAL METHODS FOR GENOMICS AND SYSTEMS BIOLOGY A Family-Based Probabilistic Method for Capturing
More informationSequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University
Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of
More informationAnalysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing
Analysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing So Yeun Kwon, Hwan Young Lee, and Kyoung-Jin Shin Department of Forensic Medicine, Yonsei University College of Medicine, Seoul,
More informationGenome Sequencing and Structural Variation (2)
Genome Sequencing and Variation Analysis of matepairs for the identification of variants Institut für Medizinische Genetik und Humangenetik Charité Universitätsmedizin Berlin Genomics: Lecture #11 Today
More informationSupporting Information
Supporting Information Hammer et al. 10.1073/pnas.1109300108 SI Materials and Methods Two-Population Model. Estimating demographic parameters. For each pair of sub-saharan African populations we consider
More informationopulation genetics undamentals for SNP datasets
opulation genetics undamentals for SNP datasets with crocodiles) Sam Banks Charles Darwin University sam.banks@cdu.edu.au I ve got a SNP genotype dataset, now what? Do my data meet the requirements of
More informationSequence analysis and comparison
The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species
More informationHidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)
Hidden Markov Models Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98) 1 The occasionally dishonest casino A P A (1) = P A (2) = = 1/6 P A->B = P B->A = 1/10 B P B (1)=0.1... P
More informationO 3 O 4 O 5. q 3. q 4. Transition
Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in
More information17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:
17 Non-collinear alignment This exposition is based on: 1. Darling, A.E., Mau, B., Perna, N.T. (2010) progressivemauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5(6):e11147.
More informationSupporting Information
Supporting Information Weghorn and Lässig 10.1073/pnas.1210887110 SI Text Null Distributions of Nucleosome Affinity and of Regulatory Site Content. Our inference of selection is based on a comparison of
More informationThe supplementary document of LLR: A latent low-rank approach to colocalizing genetic risk variants in multiple GWAS
The supplementary document of LLR: A latent low-rank approach to colocalizing genetic risk variants in multiple GWAS Jin Liu 1, Xiang Wan 2, Chaolong Wang 3, Chao Yang 4, Xiaowei Zhou 5, and Can Yang 6
More informationWhole Genome Alignments and Synteny Maps
Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of
More informationDepartment of Forensic Psychiatry, School of Medicine & Forensics, Xi'an Jiaotong University, Xi'an, China;
Title: Evaluation of genetic susceptibility of common variants in CACNA1D with schizophrenia in Han Chinese Author names and affiliations: Fanglin Guan a,e, Lu Li b, Chuchu Qiao b, Gang Chen b, Tinglin
More informationp(d g A,g B )p(g B ), g B
Supplementary Note Marginal effects for two-locus models Here we derive the marginal effect size of the three models given in Figure 1 of the main text. For each model we assume the two loci (A and B)
More informationGrundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson
Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)
More informationIntroduction to the SNP/ND concept - Phylogeny on WGS data
Introduction to the SNP/ND concept - Phylogeny on WGS data Johanne Ahrenfeldt PhD student Overview What is Phylogeny and what can it be used for Single Nucleotide Polymorphism (SNP) methods CSI Phylogeny
More informationProtocol S1. Replicate Evolution Experiment
Protocol S Replicate Evolution Experiment 30 lines were initiated from the same ancestral stock (BMN, BMN, BM4N) and were evolved for 58 asexual generations using the same batch culture evolution methodology
More informationNature Methods: doi: /nmeth Supplementary Figure 1. Fragment indexing allows efficient spectra similarity comparisons.
Supplementary Figure 1 Fragment indexing allows efficient spectra similarity comparisons. The cost and efficiency of spectra similarity calculations can be approximated by the number of fragment comparisons
More informationHaplotype-based variant detection from short-read sequencing
Haplotype-based variant detection from short-read sequencing Erik Garrison and Gabor Marth July 16, 2012 1 Motivation While statistical phasing approaches are necessary for the determination of large-scale
More informationCover Requirements: Name of Unit Colored picture representing something in the unit
Name: Period: Cover Requirements: Name of Unit Colored picture representing something in the unit Biology B1 1 Target # Biology Unit B1 (Genetics & Meiosis) Learning Targets Genetics & Meiosis I can explain
More informationUsing Phylogenomics to Predict Novel Fungal Pathogenicity Genes
Using Phylogenomics to Predict Novel Fungal Pathogenicity Genes David DeCaprio, Ying Li, Hung Nguyen (sequenced Ascomycetes genomes courtesy of the Broad Institute) Phylogenomics Combining whole genome
More informationClassical Selection, Balancing Selection, and Neutral Mutations
Classical Selection, Balancing Selection, and Neutral Mutations Classical Selection Perspective of the Fate of Mutations All mutations are EITHER beneficial or deleterious o Beneficial mutations are selected
More informationAlignment. Peak Detection
ChIP seq ChIP Seq Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008 ChIP Seq Analysis Alignment Peak Detection Annotation Visualization Sequence Analysis Motif Analysis Alignment ELAND Bowtie
More informationOverview of IslandPick pipeline and the generation of GI datasets
Overview of IslandPick pipeline and the generation of GI datasets Predicting GIs using comparative genomics By using whole genome alignments we can identify regions that are present in one genome but not
More informationSNP Association Studies with Case-Parent Trios
SNP Association Studies with Case-Parent Trios Department of Biostatistics Johns Hopkins Bloomberg School of Public Health September 3, 2009 Population-based Association Studies Balding (2006). Nature
More informationBustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #
Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either
More informationModel Accuracy Measures
Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses
More informationFigure S1: The model underlying our inference of the age of ancient genomes
A genetic method for dating ancient genomes provides a direct estimate of human generation interval in the last 45,000 years Priya Moorjani, Sriram Sankararaman, Qiaomei Fu, Molly Przeworski, Nick Patterson,
More informationLinear Regression (1/1/17)
STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression
More informationTandem Mass Spectrometry: Generating function, alignment and assembly
Tandem Mass Spectrometry: Generating function, alignment and assembly With slides from Sangtae Kim and from Jones & Pevzner 2004 Determining reliability of identifications Can we use Target/Decoy to estimate
More informationSupplementary Information
Supplementary Information LINE-1-like retrotransposons contribute to RNA-based gene duplication in dicots Zhenglin Zhu 1, Shengjun Tan 2, Yaqiong Zhang 2, Yong E. Zhang 2,3 1. School of Life Sciences,
More informationSupplementary Methods and Figures
Whole-genome resequencing of honeybee drones to detect genomic selection in a population managed for royal jelly David Wragg 1*, Maria Marti 1, Benjamin Basso 2, Jean-Pierre Bidanel 3, Emmanuelle Labarthe
More informationIn-Depth Assessment of Local Sequence Alignment
2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.
More informationEvolutionary analysis of the well characterized endo16 promoter reveals substantial variation within functional sites
Evolutionary analysis of the well characterized endo16 promoter reveals substantial variation within functional sites Paper by: James P. Balhoff and Gregory A. Wray Presentation by: Stephanie Lucas Reviewed
More informationSynteny Portal Documentation
Synteny Portal Documentation Synteny Portal is a web application portal for visualizing, browsing, searching and building synteny blocks. Synteny Portal provides four main web applications: SynCircos,
More informationMATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME
MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:
More informationSUPPLEMENTARY INFORMATION
Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)
More informationHidden Markov Models for the Assessment of Chromosomal Alterations using High-throughput SNP Arrays
Hidden Markov Models for the Assessment of Chromosomal Alterations using High-throughput SNP Arrays Department of Biostatistics Johns Hopkins Bloomberg School of Public Health November 18, 2008 Acknowledgments
More informationMajor questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.
Evolutionary Genetics (for Encyclopedia of Biodiversity) Sergey Gavrilets Departments of Ecology and Evolutionary Biology and Mathematics, University of Tennessee, Knoxville, TN 37996-6 USA Evolutionary
More informationGoing Beyond SNPs with Next Genera5on Sequencing Technology Personalized Medicine: Understanding Your Own Genome Fall 2014
Going Beyond SNPs with Next Genera5on Sequencing Technology 02-223 Personalized Medicine: Understanding Your Own Genome Fall 2014 Next Genera5on Sequencing Technology (NGS) NGS technology Discover more
More informationFull file at CHAPTER 2 Genetics
CHAPTER 2 Genetics MULTIPLE CHOICE 1. Chromosomes are a. small linear bodies. b. contained in cells. c. replicated during cell division. 2. A cross between true-breeding plants bearing yellow seeds produces
More informationProcesses of Evolution
15 Processes of Evolution Forces of Evolution Concept 15.4 Selection Can Be Stabilizing, Directional, or Disruptive Natural selection can act on quantitative traits in three ways: Stabilizing selection
More informationAlgorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment
Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot
More information1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:
.5. ESTIMATION OF HAPLOTYPE FREQUENCIES: Chapter - 8 For SNPs, alleles A j,b j at locus j there are 4 haplotypes: A A, A B, B A and B B frequencies q,q,q 3,q 4. Assume HWE at haplotype level. Only the
More informationAssembly improvement: based on Ragout approach. student: Anna Lioznova scientific advisor: Son Pham
Assembly improvement: based on Ragout approach student: Anna Lioznova scientific advisor: Son Pham Plan Ragout overview Datasets Assembly improvements Quality overlap graph paired-end reads Coverage Plan
More informationMolecular Evolution & the Origin of Variation
Molecular Evolution & the Origin of Variation What Is Molecular Evolution? Molecular evolution differs from phenotypic evolution in that mutations and genetic drift are much more important determinants
More informationMolecular Evolution & the Origin of Variation
Molecular Evolution & the Origin of Variation What Is Molecular Evolution? Molecular evolution differs from phenotypic evolution in that mutations and genetic drift are much more important determinants
More informationNetwork alignment and querying
Network biology minicourse (part 4) Algorithmic challenges in genomics Network alignment and querying Roded Sharan School of Computer Science, Tel Aviv University Multiple Species PPI Data Rapid growth
More informationSupplementary Materials for
advances.sciencemag.org/cgi/content/full/3/11/eaao4709/dc1 Supplementary Materials for Pushing the limits of photoreception in twilight conditions: The rod-like cone retina of the deep-sea pearlsides Fanny
More informationDEGseq: an R package for identifying differentially expressed genes from RNA-seq data
DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Likun Wang Zhixing Feng i Wang iaowo Wang * and uegong Zhang * MOE Key Laboratory of Bioinformatics and Bioinformatics
More informationLearning gene regulatory networks Statistical methods for haplotype inference Part I
Learning gene regulatory networks Statistical methods for haplotype inference Part I Input: Measurement of mrn levels of all genes from microarray or rna sequencing Samples (e.g. 200 patients with lung
More informationQuiz Section 4 Molecular analysis of inheritance: An amphibian puzzle
Genome 371, Autumn 2018 Quiz Section 4 Molecular analysis of inheritance: An amphibian puzzle Goals: To illustrate how molecular tools can be used to track inheritance. In this particular example, we will
More informationExplore SNP polymorphism data. A. Dereeper, Y. Hueber
Explore SNP polymorphism data A. Dereeper, Y. Hueber Bioinformatics trainings, Supagro, February, 2016 Tablet Graphical tool to visualize assemblies Accept many formats ACE, SAM, BAM GATK (Genome Analysis
More informationGenetics 275 Notes Week 7
Cytoplasmic Inheritance Genetics 275 Notes Week 7 Criteriafor recognition of cytoplasmic inheritance: 1. Reciprocal crosses give different results -mainly due to the fact that the female parent contributes
More informationSupporting Information
Supporting Information Das et al. 10.1073/pnas.1302500110 < SP >< LRRNT > < LRR1 > < LRRV1 > < LRRV2 Pm-VLRC M G F V V A L L V L G A W C G S C S A Q - R Q R A C V E A G K S D V C I C S S A T D S S P E
More informationBayesian Inference of Interactions and Associations
Bayesian Inference of Interactions and Associations Jun Liu Department of Statistics Harvard University http://www.fas.harvard.edu/~junliu Based on collaborations with Yu Zhang, Jing Zhang, Yuan Yuan,
More informationAn Integrated Approach for the Assessment of Chromosomal Abnormalities
An Integrated Approach for the Assessment of Chromosomal Abnormalities Department of Biostatistics Johns Hopkins Bloomberg School of Public Health June 26, 2007 Karyotypes Karyotypes General Cytogenetics
More informationMultiple Change-Point Detection and Analysis of Chromosome Copy Number Variations
Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Yale School of Public Health Joint work with Ning Hao, Yue S. Niu presented @Tsinghua University Outline 1 The Problem
More informationPotato Genome Analysis
Potato Genome Analysis Xin Liu Deputy director BGI research 2016.1.21 WCRTC 2016 @ Nanning Reference genome construction???????????????????????????????????????? Sequencing HELL RIEND WELCOME BGI ZHEN LLOFRI
More informationTHEORY. Based on sequence Length According to the length of sequence being compared it is of following two types
Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between
More informationData Mining Techniques
Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!
More informationThe phenotype of this worm is wild type. When both genes are mutant: The phenotype of this worm is double mutant Dpy and Unc phenotype.
Series 2: Cross Diagrams - Complementation There are two alleles for each trait in a diploid organism In C. elegans gene symbols are ALWAYS italicized. To represent two different genes on the same chromosome:
More informationI519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB
I519 Introduction to Bioinformatics, 2015 Genome Comparison Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Whole genome comparison/alignment Build better phylogenies Identify polymorphism
More informationLearning ancestral genetic processes using nonparametric Bayesian models
Learning ancestral genetic processes using nonparametric Bayesian models Kyung-Ah Sohn October 31, 2011 Committee Members: Eric P. Xing, Chair Zoubin Ghahramani Russell Schwartz Kathryn Roeder Matthew
More informationOur typical RNA quantification pipeline
RNA-Seq primer Our typical RNA quantification pipeline Upload your sequence data (fastq) Align to the ribosome (Bow>e) Align remaining reads to genome (TopHat) or transcriptome (RSEM) Make report of quality
More informationRNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"
RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology" Day 1" Many biologically interesting roles for RNA" RNA secondary structure prediction" 3 4 Approaches to Structure
More informationHigh-throughput sequencing: Alignment and related topic
High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg HTS Platforms E s ta b lis h e d p la tfo rm s Illu m in a H is e q, A B I S O L id, R o c h e 4 5 4 N e w c o m e rs
More informationNotes for MCTP Week 2, 2014
Notes for MCTP Week 2, 2014 Lecture 1: Biological background Evolutionary biology and population genetics are highly interdisciplinary areas of research, with many contributions being made from mathematics,
More informationStatistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department
More informationToday s Lecture: HMMs
Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models
More informationGBS Bioinformatics Pipeline(s) Overview
GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation: Terry Casstevens With supporting information from
More informationOutline. P o purple % x white & white % x purple& F 1 all purple all purple. F purple, 224 white 781 purple, 263 white
Outline - segregation of alleles in single trait crosses - independent assortment of alleles - using probability to predict outcomes - statistical analysis of hypotheses - conditional probability in multi-generation
More informationComputational statistics
Computational statistics Combinatorial optimization Thierry Denœux February 2017 Thierry Denœux Computational statistics February 2017 1 / 37 Combinatorial optimization Assume we seek the maximum of f
More informationSupporting information for Demographic history and rare allele sharing among human populations.
Supporting information for Demographic history and rare allele sharing among human populations. Simon Gravel, Brenna M. Henn, Ryan N. Gutenkunst, mit R. Indap, Gabor T. Marth, ndrew G. Clark, The 1 Genomes
More informationEST1 Homology Domain. 100 aa. hest1a / SMG6 PIN TPR TPR. Est1-like DBD? hest1b / SMG5. TPR-like TPR. a helical. hest1c / SMG7.
hest1a / SMG6 EST1 Homology Domain 100 aa 853 695 761 780 1206 hest1 / SMG5 -like? -like 109 145 214 237 497 165 239 1016 114 207 212 381 583 hest1c / SMG7 a helical 1091 Sc 57 185 267 284 699 Figure S1:
More informationComparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey
Comparative Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following: using related genomes
More information10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison
10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:
More informationPredictive Genome Analysis Using Partial DNA Sequencing Data
Predictive Genome Analysis Using Partial DNA Sequencing Data Nauman Ahmed, Koen Bertels and Zaid Al-Ars Computer Engineering Lab, Delft University of Technology, Delft, The Netherlands {n.ahmed, k.l.m.bertels,
More informationPopulation Genetics I. Bio
Population Genetics I. Bio5488-2018 Don Conrad dconrad@genetics.wustl.edu Why study population genetics? Functional Inference Demographic inference: History of mankind is written in our DNA. We can learn
More informationTandem repeat 16,225 20,284. 0kb 5kb 10kb 15kb 20kb 25kb 30kb 35kb
Overview Fosmid XAAA112 consists of 34,783 nucleotides. Blat results indicate that this fosmid has significant identity to the 2R chromosome of D.melanogaster. Evidence suggests that fosmid XAAA112 contains
More informationUnit 3 - Molecular Biology & Genetics - Review Packet
Name Date Hour Unit 3 - Molecular Biology & Genetics - Review Packet True / False Questions - Indicate True or False for the following statements. 1. Eye color, hair color and the shape of your ears can
More information