Supplementary Information for Discovery and characterization of indel and point mutations

Size: px
Start display at page:

Download "Supplementary Information for Discovery and characterization of indel and point mutations"

Transcription

1 Supplementary Information for Discovery and characterization of indel and point mutations using DeNovoGear Avinash Ramu 1 Michiel J. Noordam 1 Rachel S. Schwartz 2 Arthur Wuster 3 Matthew E. Hurles 3 Reed A. artwright 2,4 Donald F. onrad 1,5 1 Department of Genetics, 5 Department of Pathology and Immunology Washington University School of Medicine, St. Louis, MO 63110, USA 2 enter for Evolutionary Medicine and Informatics, The Biodesign Institute, 4 School of Life Sciences, Arizona State University, Tempe, AZ , USA 3 Wellcome Trust Sanger Institute, Hinxton, ambridge, B10-1SA, UK 1

2 Supplementary Figures 2

3 Supplementary Figure 1 A Sensitivity Prior 1e 12 1e 10 1e 08 1e 06 1e FDR B Sensitivity FDR Effect of different mutation rate prior values on de novo SNP calling on the Whole Genome Sequencing dataset. Sensitivity and False Discovery Rates for each mutation prior were calculated for (a) germline and (b) somatic SNPs using the validated 1000 genomes call-set as reference. Germline sensitivity was defined as the number of true positive germline mutations that were called by DeNovoGear whereas somatic sensitivity was defined as the number of true positive somatic mutations that were called. The False Discovery Rate was defined as the proportion of calls validated as false positive to the total number of calls that were validated. 3

4 Supplementary Figure 2 A Sensitivity Prior 1e 12 1e 10 1e 08 1e 06 1e FDR B Sensitivity FDR Effect of different mutation rate prior values on de novo SNP calling on the Whole Exome Sequencing dataset. Sensitivity and False Discovery Rates for each mutation prior were calculated for (a) germline and (b) somatic SNPs using the validated 1000 genomes call-set as reference. Germline sensitivity was defined as the number of true positive germline mutations that were called by DeNovoGear whereas somatic sensitivity was defined as the number of true positive somatic mutations that were called. The False Discovery Rate was defined as the proportion of calls validated as false positive to the total number of calls that were validated. 4

5 Supplementary Figure 3 A 1.00 aller Denovogear GATK Sensitivity Naivealler Polymutt Samtools FDR B 0.98 Sensitivity FDR omparison of the results from several de novo SNP callers on the Whole Genome Sequencing dataset. Sensitivity and False Discovery Rates for each caller were calculated for (a) germline and (b) somatic SNPs using the validated 1000 genomes call-set as reference. Germline sensitivity was defined as the number of true positive germline mutations that were called by each caller whereas somatic sensitivity was defined as the number of true positive somatic mutations that were called. The False Discovery Rate was defined as the proportion of calls validated as false positive to the total number of calls that were validated. 5

6 Supplementary Figure 4 A 1.00 aller Denovogear Sensitivity Denovogear BB GATK Naivealler Polymutt Samtools FDR B Sensitivity FDR omparison of the results from several de novo SNP callers on the Whole Exome Sequencing dataset. Sensitivity and False Discovery Rates for each caller were calculated for (a) germline and (b) somatic SNPs using the validated 1000 genomes call-set as reference. Germline sensitivity was defined as the number of true positive germline mutations that were called by each caller whereas somatic sensitivity was defined as the number of true positive somatic mutations that were called. The False Discovery Rate was defined as the proportion of calls validated as false positive to the total number of calls that were validated. 6

7 Supplementary Figure 5 A ount B Length of Indel ount Length of Insertion ount Length of Deletion The frequency distribution of indel lengths in the 1000 genomes phase 1 dataset. The counts of (a) all indels of different lengths is plotted as well as the separate counts for (b) insertions and (c) deletions. 7

8 Supplementary Figure 6 A log(µ) Empirical Log linear fit B Length of Insertion Empirical Log linear fit log(µ) Length of Deletion A log linear model is used to estimate the mutation rate given the length of an indel. Separate models were estimated for (a)insertions and (b)deletions. The fitted values of the model are plotted against the logarithm of the mutation rate. The smaller indels have a higher rate than the insertions of larger size. 8

9 Supplementary Figure 7 De novo indel artifact 1. In this class of artifact, an indel is clearly present in one of the parents but has been placed with a slightly different alignment than in the child. This class can be avoided by filtering out candidate DNMs that overlap an indel call in the parent. 9

10 Supplementary Figure 8 De novo indel artifact 2. In this class of artifact, an indel is clearly present in both parents with the same breakpoints, yet has not been called in either. This undercalling in the parents could be addressed by either using an alternative likelihood function for indel genotypes, as described in the main text, or by filtering sites where some indel reads are observed in the parents. 10

11 Supplementary Figure 9 De novo indel artifact 3. In this class of artifact, a high frequency of reads with a non-reference allele is seen in both parents. Many of these alternate base calls occupy the first/last position in the read, suggesting that perhaps both the insertion in the child and SNPs in the parents are alignment artifacts, possible caused by the presence of a large structural variant with a breakpoint in this region. This class of variant could be avoided by filtering sites with a high frequency of non-reference reads in one or both parents. This could be done strictly spanning the indel call or within a small window of the indel (say plus or minus 10 bp). 11

12 Supplementary Figure 10 A chromosome 3 p26.1 p25.1 p24.2 p21 p22.3 p21.33 p21.1 p14.1 p12.3 p12.1 q11.2 q13.12 q13.32 q21.3 q22.3 q24 q25.2 q26.1 q26.31 q27.1 q29 B 102,153, ,153, ,153,406 forward strand sequence reverse strand sequence T A A A T T T T A T T A T A A T T A A A T T T T A T T A T A A T T T A T A A T T A G T T G T A A A A T G A A T A T T A G T T A A G T T G T A A Example of a de novo indel called by DNG and confirmed by Sanger sequencing. A) Location of the indel is indicated by a red box. B) At this specific location, has a 3-base pair deletion, namely a deletion of the bases T, T and A (see left box), which will result in a double sequence read on the forward strand (middle box) and the reverse strand (right box) that both start at the site of this indel. ) and do not have an indel at this position and subsequently have single reads on both strands. We confirmed 53/56 (95%) de novo indel predictions in this family by Sanger sequencing (summary statistics of the predicted de novo indels are in Supplementary Table 7, a full list of de novo indels in Supplementary Table 9, and validation plots for all calls in Supplementary Figure 11). 12

13 Supplementary Figure 11: Validation Plots The following pages depict (a) IGV screenshots of nextgen sequencing data and (b) Sanger sequencing traces for all candidate de novo indels for which we attempted validation. See legend of Supplementary Figure 10 and main text for additional details. Within the True Positive class, the figures are ordered by chromosome and position of the de novo indel. All coordinates with respect to NBI37. 13

14 chromosome 17 p13.2 p13.1 p12p21 p11.2 p11.1 q11.2 q12 q21.1 q21.31 q21.33 q22 q23.1 q23.3 q24.2 q24.3 q25.1 q25.3 B 28,388,993 28,389,003 28,389,012 G A A T A A G A T T T T T T G A A T A A G A T T T T T T T T T T T A E7: A site for which primers could not be designed 14

15 chromosome 15 p12 p11.2 p11.1 p21 q11.2 q12 q13.2 q14 q15.1 q21.1 q21.3 q22.2 q22.32 q24.1 q25.1 q25.3 q26.1 q26.2 B 74,185,293 74,185,303 74,185,312 T A A T A T A G A A A A T A A T A T A G A A A A A7: A false positive site 15

16 chromosome 17 p13.2 p13.1 p12p21 p11.2 p11.1 q11.2 q12 q21.1 q21.31 q21.33 q22 q23.1 q23.3 q24.2 q24.3 q25.1 q25.3 B 2,120,331 2,120,341 2,120,350 A T T A T G G A G G A A A G T T T A A T T A T G G A G G A A A G T T T A B7: A false positive site 16

17 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 69,058,980 69,058,990 69,058,999 T A A G A G G T G T T G T T T A A G A G G T G T T G T T G1: A false positive site 17

18 chromosome 1 p36.23 p36.12 p35.1 p34.1 p32.2 p31.2 p22.3 p21.3 p13.3 p12 q11 q12 q21.1 q22 q24.1 q25.2 q31.2 q32.1 q41 q42.13 q43 B 36,060,969 36,060,979 36,060,988 A T T T A T G T T T T T G G A T T T A T G T T T T T G G T T T T G G A 18

19 chromosome 1 p36.23 p36.12 p35.1 p34.1 p32.2 p31.2 p22.3 p21.3 p13.3 p12 q11 q12 q21.1 q22 q24.1 q25.2 q31.2 q32.1 q41 q42.13 q43 B 62,752,481 62,752,491 62,752,500 A A A T T G T G G G G A G A G A A A T T G T G G G G A G A G G G G G G A G A G 19

20 chromosome 1 p36.23 p36.12 p35.1 p34.1 p32.2 p31.2 p22.3 p21.3 p13.3 p12 q11 q12 q21.1 q22 q24.1 q25.2 q31.2 q32.1 q41 q42.13 q43 B 79,985,409 79,985,409 79,985,418 T T G A G T A G T T T T G T G A A A T T G A G T A G T T T T G T G A A A T T T G T G A A A T 20

21 chromosome 1 p36.23 p36.12 p35.1 p34.1 p32.2 p31.2 p p21.3 p13.3 p12 q11 q12 q21.1 q22 q24.1 q25.2 q31.2 q32.1 q41 q42.13 q43 B 85,896,648 85,896,658 85,896,667 G A G G T A A A A G T T T T T G A G G T A A A A G T T T T T A A A G G A G A G G 21

22 chromosome 1 p36.23 p36.12 p35.1 p34.1 p32.2 p31.2 p22.3 p21.3 p13.3 p12 q11 q12 q21.1 q22 q24.1 q25.2 q31.2 q32.1 q41 q42.13 q43 B 215,772, ,772, ,772,765 A A A T A A T T T T T A A G A T A A A T A A T T T T T A A G A T T A A A T A A A 22

23 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 56,479,693 56,479,703 56,479,712 T A T A T T T A A G T T T G G G A T T A T A T T T A A G T T T G G G A T T T T G G G A T A G T G T 23

24 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 69,058,980 69,058,990 69,058,999 T A A G A G G T G T T G T T T A A G A G G T G T T G T T 24

25 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 83,119,261 83,119,271 83,119,280 A G T G G G T T A T A T A A G T T A G T G G G T T A T A T A A G T T A A G T G G G T 25

26 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 124,110, ,110, ,110,179 A T A T A A A T G T T A T A G G T A A T A T A A A T G T T A T A G G T A T T A T A G G T A T A T T 26

27 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 128,809, ,809, ,809,744 A G T T T G T G G T A A A A A G A A G T T T G T G G T A A A A A G A G A G T T T G T 27

28 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 140,541, ,541, ,541,050 T T T T T T A A T T G A T A A A T A A T T T T T T A A T T G A T A A A T A A T G A T A A A T A A T 28

29 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 197,724, ,724, ,724,636 A G T A A A T T A T A A A A A T G A G T A A A T T A T A A A A A T G A T A A A A A T G G G A 29

30 chromosome 2 p25.1 p24.1 p22.3 p21 p16.2 p15 p13.2 p11.2 q11.2 q13 q14.3 q21.3 q22.3 q24.1 q31.1 q32.1 q33.1 q34 q36.1 q37.2 B 206,261, ,261, ,261,601 T T T T T A A A A A A T A G T T T T T T A A A A A A T A G T A A A A A A T A G 30

31 chromosome 3 p26.1 p25.1 p24.2 p22.3 p21 p21.33 p21.1 p14.1 p12.3 p12.1 q11.2 q13.12 q13.32 q21.3 q22.3 q24 q25.2 q26.1 q26.31 q27.1 q29 B 102,153, ,153, ,153,406 T A A A T T T T A T T A T A A T T A A A T T T T A T T A T A A T T T A T A A T T 31

32 chromosome 3 p26.1 p25.1 p24.2 p22.3 p21 p21.33 p21.1 p14.1 p12.3 p12.1 q11.2 q13.12 q13.32 q21.3 q22.3 q24 q25.2 q26.1 q26.31 q27.1 q29 B 119,913, ,913, ,913,685 T T T T A T T T T T T T T T T T T T T A T T T T T T T T T T T T T T T T T T T T 32

33 chromosome 3 p26.1 p25.1 p24.2 p22.3 p21 p21.33 p21.1 p14.1 p12.3 p12.1 q11.2 q13.12 q13.32 q21.3 q22.3 q24 q25.2 q26.1 q26.31 q27.1 q29 B 119,913, ,913, ,913,685 T G A G T T A A T T T T T G A A A T T G A G T T A A T T T T T G A A A T T T T T T G A A A T G 33

34 chromosome 3 p26.1 p25.1 p24.2 p22.3 p21 p21.33 p21.1 p14.1 p12.3 p12.1 q11.2 q13.12 q13.32 q21.3 q22.3 q24 q25.2 q26.1 q26.31 q27.1 q29 B 146,628, ,628, ,628,651 G A A A T G G T T T T T A A A G A A A T G G T T T T T A A A T T T T T A A A G 34

35 chromosome 4 p16.2 p15.32 p15.1 p21 p13 p11 q12 q13.1 q13.1 q21.22 q22.2 q24 q25 q26 q28.1 q28.3 q31.21 q31.3 q32.2 q33 q34.3 q35 B 28,194,765 28,194,775 28,194,784 A A T T A G A A T T T T T A T G G A A T T A G A A T T T T T A T G G T T T T A T G G T 35

36 chromosome 4 p16.2 p15.32 p15.1 p21 p13 p11 q12 q13.1 q13.1 q21.22 q22.2 q24 q25 q26 q28.1 q28.3 q31.21 q31.3 q32.2 q33 q34.3 q35 B 115,663, ,663, ,663,969 A T T T T T A A T T A T T T T T T A T T T T T A A T T A T T T T T T A T T T T T T A A A G 36

37 chromosome 4 p16.2 p15.32 p15.1 p21 p13 p11 q12 q13.1 q13.1 q21.22 q22.2 q24 q25 q26 q28.1 q28.3 q31.21 q31.3 q32.2 q33 q34.3 q35 B 134,749, ,749, ,745,002 T A T A T A T T T T T T T T A T T T T A T A T A T T T T T T T T A T T T T T T T A T T T A A A A 37

38 chromosome 4 p16.2 p15.32 p15.1 p21 p13 p11 q12 q13.1 q13.1 q21.22 q22.2 q24 q25 q26 q28.1 q28.3 q31.21 q31.3 q32.2 q33 q34.3 q35 B 176,373, ,373, ,373,080 T T T G T A A T T G A T T T T T T G T A A T T G A T T T T T G A T T T G 38

39 chromosome 5 p15.32 p15.1 p14.1 p21 p13.2 p12 q11.22 q12.2 q13.2 q14.1 q14.3 q15 q21.2 q22.2 q32.2 q31.1 q31.3 q33.1 q34 q35.1 q35 B 8,675,861 8,675,871 8,675,880 A T A A G T T T T T T G A A T A T A A G T T T T T T G A A T T T T T T T G A A T 39

40 chromosome 5 p15.32 p15.1 p14.1 p21 p13.2 p12 q11.22 q12.2 q13.2 q14.1 q14.3 q15 q21.2 q22.2 q32.2 q31.1 q31.3 q33.1 q34 q35.1 q35 B 27,924,741 27,924,751 27,924,760 G A A A G A A T T T T T T T G A A A G A A T T T T T T T T T T T T T 40

41 chromosome 5 p15.32 p15.1 p14.1 p21 p13.2 p12 q11.22 q12.2 q13.2 q14.1 q14.3 q15 q21.2 q22.2 q32.2 q31.1 q31.3 q33.1 q34 q35.1 q35 B 32,884,887 32,884,897 32,884,906 G G A A G T T G T T G A A G A A G G AA G T T G T T G A A G A A T G A G A G A A G A A G G A 41

42 chromosome 5 p15.32 p15.1 p14.1 p21 p13.2 p12 q11.22 q12.2 q13.2 q14.1 q14.3 q15 q21.2 q22.2 q32.2 q31.1 q31.3 q33.1 q34 q35.1 q35 B 53,623,826 53,623,836 53,623,845 T G A A T G G A A A T T T G G T G A A T G G A A A T T T G G A A T T T G G T 42

43 chromosome 5 p15.32 p15.1 p14.1 p21 p13.2 p12 q11.22 q q13.2 q14.1 q14.3 q15 q21.2 q22.2 q32.2 q31.1 q31.3 q33.1 q34 q35.1 q35 B 64,122,061 64,122,071 64,122,080 T G A T T T T T T T T G A T G T T G A T T T T T T T T G A T G T T T G A T G T G 43

44 chromosome 5 p15.32 p15.1 p14.1 p21 p13.2 p12 q11.22 q12.2 q13.2 q14.1 q14.3 q15 q21.2 q22.2 q32.2 q31.1 q31.3 q33.1 q34 q35.1 q35 B 74,866,858 74,866,868 74,866,877 G T A T A A T A A A A T T G T G G T A T A A T A A A A T T G T G A A T T G T G T 44

45 chromosome 5 p15.32 p15.1 p14.1 p21 p13.2 p12 q11.22 q12.2 q13.2 q14.1 q14.3 q15 q21.2 q22.2 q32.2 q31.1 q31.3 q33.1 q34 q35.1 q35 B 105,980, ,980, ,980,810 T G G T G A A G A T T T A G G G A T T T G G T G A A G A T T T A G G G A T T T T A G G G A T T T 45

46 chromosome 5 p21 p13.2 p15.32 p12 q12.2 q13.2 q14.1 q14.3 p15.1 q15 q11.22 q32.2 q31.1 q31.3 q33.1 q21.2 q34 p14.1 q22.2 q35.1 q35 B 152,139, ,139, ,139,425 A G T T T T T T A T G G A T T T T A G T T T T T T A T G G A T T T T T T T T T AT G G A T T T T A 46

47 chromosome 6 p25.1 p23 p22.2 p21 p21.31 p21.1 p12.2 p11.1 q12 q12 q14.1 q15 q16.2 q12 q22.1 q22.32 q23.3 q24.2 q25.2 q26 B 48,157,882 48,157,892 48,157,901 A T A T T A T T A T T T T T T T G A A T A T T A T T A T T T T T T T G A T T T T T T T T G 47

48 chromosome 7 p22.1 p21.2 p15.3 p21 p14.3 p14.1 p12.3 p11.2 q11.21 q11.23 q21.12 q21.3 q22.2 q31.1 q31.31 q32.1 q33 q34 q35 q36.2 B 78,999,365 78,999,375 78,999,384 T A A T T A G T A T T A T G T T A T A A T T A G T A T T A T G T T A A A A T A A T 48

49 chromosome 7 p22.1 p21.2 p15.3 p21 p14.3 p14.1 p12.3 p11.2 q11.21 q11.23 q21.12 q21.3 q22.2 q31.1 q31.31 q32.1 q33 q34 q35 q36.2 B 80,119,047 80,119,057 80,119,066 T A T T A T T T A T T T T A T T G T T A T T A T T T A T T T T A T T G T G T A T T A T T T 49

50 chromosome 8 p23.2 p22 p21.3 p21.1 p11.23 p11.1 q11.22 q12.2 q13.2 q21.11 q21.2 q22.1 q22.3 q23.2 q24.11 q24.21 q24.23 B 34,242,517 34,242,527 34,242,536 T T A T A G T T A A A A A A T T A T A G T T A A A A A A A A A A G A G A 50

51 chromosome 8 p23.2 p22 p21.3 p21.1 p11.23 p11.1 q11.22 q12.2 q13.2 q21.11 q21.2 q22.1 q22.3 q23.2 q24.11 q24.21 q24.23 B 27,870,072 27,870,082 27,870,091 G G T A A T T T G A T A G G G G T A A T T T G A T A G G A A A T T T T A G T T G 51

52 chromosome 8 p23.2 p22 p21.3 p21.1 p11.23 p11.1 q11.22 q12.2 q13.2 q21.11 q21.2 q22.1 q22.3 q23.2 q24.11 q24.21 q24.23 B 34,242,517 34,242,527 34,242,536 T T A T A G T T A A A A A A T T A T A G T T A A A A A A A A A A G A G A 52

53 chromosome 8 p23.2 p22 p21.3 p21.1 p11.23 p q q12.2 q13.2 q21.11 q21.2 q22.1 q22.3 q23.2 q24.11 q24.21 q24.23 B 51,457,368 51,457,378 51,457,387 A G T G G G A G A A T T A T G T T T G A G T G G G A G A A T T A T G T T T G A T G T T T G A T T 53

54 chromosome 8 p23.2 p22 p21.3 p21.1 p11.23 p11.1 q q12.2 q13.2 q21.11 q21.2 q22.1 q22.3 q23.2 q24.11 q24.21 q24.23 B 57,613,105 57,613,115 57,613,124 A A A A A A A G A A T A A A T G A A A T A A A A A A A G A A T A A A T G A A A T T T A A T 54

55 chromosome 10 p15.1 p13 p12.31p21 p12.1 p11.21 q11.21 q11.23 q21.2 q21.1 q22.3 q23.2 q23.33 q24.31 q25.1 q25.3 q26.13 q26.3 B 78,698,792 78,698,802 78,698,811 A A G G A A A A A T T T A A T T T A A A G G A A A A A T T T A A T T T A A T T T A A G T G 55

56 chromosome 10 p15.1 p13 p12.31p21 p12.1 p11.21 q11.21 q11.23 q21.2 q21.1 q22.3 q23.2 q23.33 q24.31 q25.1 q25.3 q26.13 q26.3 B 82,162,004 82,162,104 82,162,113 T T A T A T A A A A T G G G T T T A T A T A A A A T G G G T T T T A A A T G G G 56

57 chromosome 11 p15.4 p15.2 p14.3 p21 p13 p12 p11.2 p11.11 q12.2 q13.2 q13.5 q14.2 q21 q22.2 q23.1 q23.3 q24.2 q25 B 39,326,082 39,326,092 39,326,101 T T T G T T T G T T T G T T T G T T T A T T T G T T T G T T T G T T T G T T T A T T T G T T T G T T T G T T T A A T G 57

58 chromosome 11 p15.4 p15.2 p14.3 p21 p13 p12 p11.2 p11.11 q12.2 q13.2 q13.5 q14.2 q21 q22.2 q23.1 q23.3 q24.2 q25 B 44,762,096 44,762,106 44,762,115 T G A T G G G G A G T A G T G G G T G A T G G G G A G T A G T G G G T A G T G G G A 58

59 chromosome 11 p15.4 p15.2 p14.3 p21 p13 p12 p11.2 p11.11 q12.2 q13.2 q13.5 q14.2 q21 q22.2 q23.1 q23.3 q24.2 q25 B 44,762,096 44,762,106 44,762,115 A G G A A G G G G A G A G T G T A G G A A G G G G A G A G T G T G A G T G T 59

60 chromosome 12 p13.32 p13.2 p12.3 p12.1 p21 p11.21 q12 q13.11 q13.2 q14.2 q15 q21.2 q21.32 q22 q23.2 q24.11 q24.22 q24.32 B 23,363,908 23,363,918 23,363,927 A A A A A A A A A A G T A A A T A A A A A A A A A A A A A G T A A A T A A A T G T A A A T A A 60

61 chromosome 12 p13.32 p13.2 p12.3 p12.1 p21 p11.21 q12 q13.11 q13.2 q14.2 q15 q21.2 q21.32 q22 q23.2 q24.11 q24.22 q24.32 B 96,920,559 96,920,569 96,920,578 A G T G A A T T A A A A A A A A G T G A A T T A A A A A A A A A A A A A T A 61

62 chromosome 12 p13.32 p13.2 p12.3 p12.1 p21 p11.21 q12 q13.11 q13.2 q14.2 q15 q21.2 q21.32 q22 q23.2 q24.11 q24.22 q24.32 B 110,398, ,398, ,398,757 T T T A A G T T T G G A G T A T T T T A A G T T T G G A G T A T A T T T T A T T A A 62

63 chromosome 14 p12 p11.2 p11.1p21q11.2 q12 q13.1 q21.1 q21.2 q22.1 q23.1 q23.3 q24.2 q24.3 q31.1 q31.3 q32.12 q32.2 q32.32 B 47,965,766 47,965,776 47,965,785 T T A T A G A A A A T T T T A A T T T T T A T A G A A A A T T T T A A T T T T T T A G T T T 63

64 chromosome 14 p12 p11.2 p11.1p21q11.2 q12 q13.1 q21.1 q21.2 q22.1 q23.1 q23.3 q24.2 q24.3 q31.1 q31.3 q32.12 q32.2 q32.32 B 53,307,154 53,307,164 53,307,173 A T T T T G T T T G T T T G T T T G T T A T T T T G T T T G T T T G T T T G T T T T T G T T T G T T T G T T T T T G 64

65 chromosome 14 p12 p11.2 p11.1p21q11.2 q12 q13.1 q21.1 q21.2 q22.1 q23.1 q23.3 q24.2 q24.3 q31.1 q31.3 q32.12 q32.2 q32.32 B 86,537,553 86,537,563 86,537,572 G T A T G A A G T A G T T T G T A T G A A G T A G T T T G T G T A T G A 65

66 chromosome 17 p13.2 p13.1 p12p21 p11.2 p11.1 q11.2 q12 q21.1 q21.31 q21.33 q22 q23.1 q23.3 q24.2 q24.3 q25.1 q25.3 B 3,936,611 3,936,621 3,936,630 G T G A A G T A G G G T G A T T T G T G A A G T A G G G T G A T T T G G T G A T T T T 66

67 chromosome 17 p13.2 p13.1 p12p21 p11.2 p11.1 q11.2 q12 q21.1 q21.31 q21.33 q22 q23.1 q23.3 q24.2 q24.3 q25.1 q25.3 B 28,388,993 28,389,003 28,389,012 G A A T A A G A T T T T T T G A A T A A G A T T T T T T T T T T T A 67

68 chromosome 18 p11.31 p11.22 p21p q11.2 q12.1 q12.2 q12.3 q21.1 q21.2 q21.31 q21.33 q22.1 q22.2 q23 B 18,681,824 18,681,834 18,681,843 A G A A A A T G T T T T T T T T T T A G A A A A T G T T T T T T T T T T T T T T T T T T T 68

69 chromosome 18 p11.31 p11.22 p21p11.2 q11.2 q12.1 q12.2 q12.3 q21.1 q21.2 q21.31 q21.33 q22.1 q22.2 q23 B 54,759,473 54,759,483 54,759,492 T A T T A T T A T T A T T A T A T T A T T A T T A T T A A A G A G A T A A T T 69

70 chromosome 22 p13 p12 p21 p11.2 p11.1 q11.1 q11.21 q11.22 q12.1 q12.2 q12.3 q13.1 q13.2 q13.31 q13.32 B 31,174,501 31,174,511 41,174,520 T T A T G G T G T T A T A A A T T A T G G T G T T A T A A A T A T A A A 70

71 Supplementary Tables Supplementary Table 1 71

72 Prior Posterior Probability cutoff Known Germline DNMs Known Somatic DNMs Known False Positives Total DNM alls Sensitivity (Germline) Sensitivity (Somatic) 1e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e Assessing the impact of mutation rate prior on the sensitivity and specificity of DNM discovery with DeNovoGear in the WES dataset (see Methods of main text for description of the WES dataset). In a previous study of the same samples, we experimentally validated DNM predictions and assigned each prediction a value of Germline DNM, Somatic DNM or False Positive (onrad et al. 2011, PMID ). For each combination of mutation rate prior and posterior probability cutoff used in the current analysis, we list Total DNM alls : the total number of unfiltered DNM calls made by DNG; Known Germline DNMs : the total number of calls made at sites of validated germline mutation reported by onrad et al; Known Somatic DNMs : the total number of calls made at sites of validated somatic mutation reported by onrad et al; Known False Positive : the total number of calls made at sites of false positives reported by onrad et al. In total onrad et al. validated 2 germline denovo mutations, 19 somatic mutations and identified 39 false positive calls in the regions with coverage in the WES dataset.a higher mutation rate prior leads to an increase in the total number of calls made which brings about increased sensitivity at the cost of an increase in the False Discovery Rate. The sensitivity and FDR calculations are similar to Table??. FDR 72

73 Supplementary Table 2 Prior Posterior Probability cutoff Known Germline DNMs Known Somatic DNMs Known False Positives Total DNM alls Sensitivity (Germline) Sensitivity (Somatic) 1e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e Assessing the impact of mutation rate prior on the sensitivity and specificity of DNM discovery with DeNovoGear in the WGS dataset (see Methods of main text for description of the WGS dataset). In a previous study of the same samples, we experimentally validated DNM predictions and assigned each prediction a value of Germline DNM, Somatic DNM or False Positive (onrad et al. 2011, PMID ). For each combination of mutation rate prior and posterior probability cutoff used in the current analysis, we list Total DNM alls : the total number of unfiltered DNM calls made by DNG; Known Germline DNMs : the total number of calls made at sites of validated germline mutation reported by onrad et al; Known Somatic DNMs : the total number of calls made at sites of validated somatic mutation reported by onrad et al; Known False Positive : the total number of calls made at sites of false positives reported by onrad et al. In total onrad et al. validated 49 germline denovo mutations, 952 somatic mutations and identified 2235 false positive calls in whole genome sequencing data from these samples.a higher mutation rate prior leads to an increase in the total number of calls made which brings about increased sensitivity at the cost of an increase in the False Discovery Rate. FDR 73

74 Supplementary Table 3 Dataset lass Alpha Beta RA RA RA EU-trio AA EU-trio RR WUSTL exome RA The alpha and beta values estimated using Maximum Likelihood Estimation for various exome datasets. A different model is fitted to each genotype class (homozygous reference, RR ; heterozygous, RA, and homozygous alternate AA ). The values of α and β estimated for the RA class of genotypes are significantly different between any of the EU exomes and an internal exome dataset generated at Washington University (p < , LRT). 74

75 Supplementary Table 4 aller Posterior Probability cutoff Known Germline DNMs Known Somatic DNMs Known False Positives Total DNM alls Sensitivity (Germline) Sensitivity (Somatic) Denovogear Denovogear Denovogear Denovogear Denovogear Denovogear Denovogear-BB Denovogear-BB Denovogear-BB Denovogear-BB Denovogear-BB Denovogear-BB GATK GATK GATK GATK GATK GATK Polymutt Polymutt Polymutt Polymutt Polymutt Polymutt Samtools Samtools Samtools Samtools Samtools Samtools Naivealler NA FDR omparison between different denovo mutation callers on the WES dataset. Likelihood ratios from some packages were converted to posterior probabilities to enable comparison. The Naivealler does not have a score associated with the calls. In a previous study of the same samples, we experimentally validated DNM predictions and assigned each prediction a value of Germline DNM, Somatic DNM or False Positive (onrad et al. 2011, PMID ). Here, for each combination of calling method and posterior probability cutoff, we list Total DNM alls : the total number of unfiltered DNM calls made by DNG; Known Germline DNMs : the total number of calls made at sites of validated germline mutation reported by onrad et al; Known Somatic DNMs : the total number of calls made at sites of validated somatic mutation reported by onrad et al; Known False Positive : the total number of calls made at sites of false positives reported by onrad et al. The same validation dataset used in Table?? was used for this analysis. All the tools have similar germline sensitivity. Denovogear has the lowest false discovery rates. SamTools and the Naivealler show slightly increased somatic sensitivity but have a very high False Discovery Rate due to the high number of total calls made. 75

76 Supplementary Table 5 aller Posterior Probability cutoff Known Germline DNMs Known Somatic DNMs Known False Positives Total DNM alls Sensitivity (Germline) Sensitivity (Somatic) Denovogear Denovogear Denovogear Denovogear Denovogear Denovogear Polymutt Polymutt Polymutt Polymutt Polymutt Polymutt GATK GATK GATK GATK GATK GATK Samtools Samtools Samtools Samtools Samtools Samtools Naivealler NA omparison between the different denovo mutation callers on the WGS dataset. Here, for each combination of calling method and posterior probability cutoff, we list Total DNM alls : the total number of unfiltered DNM calls made by DNG; Known Germline DNMs : the total number of calls made at sites of validated germline mutation reported by onrad et al; Known Somatic DNMs : the total number of calls made at sites of validated somatic mutation reported by onrad et al; Known False Positive : the total number of calls made at sites of false positives reported by onrad et al. The sensitivity and false discovery rate calculation was done similar to Table??. Polymutt overcalls on the X chromosome and hence all the calls on the X made by Polymutt were ignored as per the recommendation on the developers website, this leads to a drop in its somatic sensitivity. Denovogear shows high germline and somatic sensitivity at a comparatively low False Discovery Rate. FDR 76

77 Supplementary Table 6 Program Dataset Total alls Validated alls DeNovoGear WEx 98 0 SamTools WEx DeNovoGear WGS SamTools WGS DeNovoGear DINDEL WGS The total number of denovo indel calls (posterior probability > ) and the number of validated true positive calls made by DeNovoGear, SamTools, and DeNovoGear using DINDEL genotype likelihoods on the WEx and WGS datasets. 77

78 Supplementary Table 7 Validated Not Validated True Positives False Positives Insertions Total called Avg posterior probability Samtools DINDEL NA Average length oding context Intergenic Exonic Intronic Upstream Downstream ncrna exonic ncrna intronic UTR Deletions Total called Avg posterior probability Samtools DINDEL Average length oding context Intergenic Exonic 0-0 Intronic Upstream 2-1 Downstream 1-4 ncrna exonic 0-1 ncrna intronic 0-7 UTR3 0-4 Summary statistics for de novo indels called by DNG. Sites classified as Validated are sites for which we attempted validation by Sanger sequencing. Sites classified as Not Validated are sites which we did not attempt to validate, due to manual or algorithmic filtering. 78

79 Supplementary Table 8 57 different PR assays designed to confirm indel calls made by DNG Assay # hromosome Position mutation Forward primer Reverse primer Amplicon size Annealing temperature AGTTGTTGTGTTGTT TAAAAAG GTGTTTAGAAAAT GAGAGGAATAAGGAGTG GAGTTATGTAATAAAATATGTAGG GTTTTAATTTTAGTTT GGATTTGTTTG AATTAAATTATG GGAGAAGGTTAAAAATTG GAAAGTGATTAATAATTT GAGTGTGAAAATGATAGTA GTAGATGAGGGTTGATGGG GAGGGATTTGGAAAGAA GAAAGGGATGGTGATGTG TTTGATTATTGA TAAAATGTATTTTAGAAGTTGT TGTTAATTAGGTATTTT GAAATTGGAAATGT AGATGGGGAGTTTTT AATAAAAGAAAAG TTTGTGTTGGTTTTTTATGTAG ATGTATAGTATAAATGG AATAGATGTGAGAAAGATTTG TTTATTTAGATTTGGAGGAAAAG ATAATGGGAGGTGGAATG TTTGTAGAGTTTGTGG TTGAATATTTGAAGAAATA GGTTGAAGATATGTGTG GGAATATTTTTTTTTG AGAGGTTGTGGTGAGAAG TGATTTTGTAAATTTA GGTTTAAAATAAAAAG TATTTAAAATTTGTTTA ATAAAGAAATTATTGGGAAA TTGGGGGAAATTATAA TGGGGAGTTATTGGTATG AATTAATAAAGATGTGATG ATAATGGGTGATA AAAGGAGAGGGAGGAAAAG TATAATATGTTAAGTAAAAAG TGAAGTTGGGGTGGATG TGTAAGGGTGGATTTGAG TTTGTGTAATTTAAGAAGTA AATAATATGAAAAAATT AAAATGTAAATTAATTTTTG GTAAAAGAAAATTAGAGG AAATAAGGGGTAGATGG GAGGATAAGGTTTGATAGGATTAG GAGAGTTGTAAAGTG ATGGGGGTGTGTAAT TTGGTTGAATGGATGA GAAATATAAGAATATGTTTGA AAAGGAAATTTTTA TTTGTAGATTTAAAAGAGTTGG ATGAATATTAAAAATGTAAAA TTTTTTAAAAAAAAGTAAA TTGATATTGATTAATATTTG TATGGAGATGAGTAGGG GTGAAAAATGGAGTAAAATTG GGAATTGTTGTTGTTTAGG GGATGATGAATATAAATAT GGGATTTTTAGAAAATAAATAAAAG Elongation time (s)

80 TGGTATGTTTGTTTGTG TGTATGTTTATTAAGTTG TTATATGTAGGTTAGTTTTGT TGAAGAGAAGATATT TAAAAAGTTAATG ATGGAGGATTGTGTTAG AATTGTGTGTTTGG TTTATGATGGAAGTG TAATAAATGAAGAATAAG ATTATATGAATTATTATAAATA TGATAGGGAAGAAAATGTA TGAGTAAATGAGTTTTG TTTGATGATAAAT AGAAGAGGTGGATTGTGG GGAGAAATGAAGAGA GAGGGATGTTTAAA AATAAATTAAA TTTAAATTTG ATGTGTGTTGTGAG ATTTAGGTGTTGT GTTGGAAGAAAAATG TGGATAAGATTTG GTTGTGTAGTTT TATAGGGATGGATGATG ATTTTAATGTGTTGG TGTAAAAAGAGAGAGTTGG TTTTGAGGTTTGAAATGG TGGTTAAAATGTGAAGAAATG AATGGGGTAGGTATGTTG TTTAATTATTTTTA GTTTGGATGGTTTGAGG ATTAGAGGAGAAGAG GTATTGTTTGGATG ATTTATGGAAAG GGTGATGGTGATGGTGAT GGAGAATGAAAAGTAGA TAAAATAAGTTGTT AGTATTGTGGATGGAATA GAGAAAAAAATAAGATTGTTA ATTAAAGTGGA TATTTTTTTATTTG GGATAGAAGAATGTTAATAG TTGGAATTTTAATG AGTTGTATAAAT TGGGAATTAATGATAAAAG TTTGAGTTTTAAAATGTAT TAGTGAATAATG TTTGAAATTTTA TTTGTTTGTTGTGGAG AATGGGTAGAGGAAGG X GTGAGATTAAGAG ATGGTGTAAGGATTG The 57 different PR assays that designed used to confirm indel calls made by DNG. The chromosomes on which these assays were designed as well as the locations of the indels to be captured are indicated in addition to the primers, annealing and elongation times used for each PR assay.

81 Supplementary Table 9 is provided as a stand-alone document available through the publisher s website. 81

82 Supplementary Note Analysis of Priors The DeNovoGear framework allows the user to specify the prior probability of observing a DNM, which in principle can be used as a lever to increase and decrease calling sensitivity. We performed simulations to show that increasing the mutation rate prior increases detection sensitivity (Supplementary Figures 1 and 2, Supplementary Tables 1 and 2). Specifically, we ran DeNovoGear by setting the mutation rate prior from 10 4 to mutations/bp in geometric increments of Our results show that varying the mutation rate prior does have a dramatic effect on the sensitivity and specificity of DNM calling when using a standard whole-genome sequencing study design such as the one generating the WGS dataset (Supplementary Tables 1 and 2, Supplementary Figs. 1 and 2). The total number of false positive calls increases over 5-fold when moving from to 10 4, while 879/939 (94%) of validated DNMs are detected at the smallest rate prior, and 100% sensitivity for germline DNMs is achieved at These results indicate that use of biologically realistic values for the mutation prior will give near 100% sensitivity to non-mosaic DNMs, while increasing the prior 82

83 beyond this threshold will massively inflate the number of false positive calls with marginal or no increase in sensitivity. Next, we investigated whether our use of a prior on mutation rate helps control Type I error at low sequencing depth. Low coverage data had high specificity, but low sensitivity. With a cutoff of posterior probability of being de novo of > 0.001, specificity ranged from 1 (1x coverage) to 0.87 (20x coverage); sensitivity ranged from 0 (1x coverage) to 0.97 (20x coverage). Greater than 95% of de novo mutations were identified at 16x coverage and above. DINDEL genotype likelihoods After the fact we wanted to assess the performance of alternative indel genotype likelihood functions for de novo indel calling. We selected one such alternative modeling framework, a computationally intensive, haplotypebased realignment method that is implemented in the package DINDEL (PubMed ID: ). In order to evaluate the feasibility of running DIN- DEL on the whole genome, we first ran DINDEL in the default mode and in the heuristic mode on hromosome 21 of the WGS dataset. DINDEL identified a total of 631,686 distinct candidate indels on hromosome 21 83

84 across all three samples. These calls are spread across 288,658 windows of 120 bp each. Run time varied by only 5% across different samples. The more complex modeling used by DINDEL comes at a great computational cost: the average run time for default mode was 142 hours per sample, and for heuristic mode, 80 hours. Using these numbers produces an estimated run time of 344 or 144 days per whole genome sequence. Using the heuristic likelihoods, DINDEL calls were first made on each member of the trio separately. These calls were then merged to create a list of candidate sites, and we directed DINDEL to calculate genotype likelihoods for all three samples for these candidate sites. The resultant genotype likelihoods were then fed to DeNovo- Gear to call de novo indels. DeNovoGear produced 136 indel calls from the WGS dataset with posterior probability > 0.9 and 463 calls with posterior probability > 1x10 4 (Table S6). Forty-four (79%) of the 56 candidate DNMs from our Samtools analysis were also called as DNMs with DINDEL likelihoods when considering this larger set of 463; in contrast 2/3 false positives were no longer supported as DNMs. Our results suggest that DINDEL genotype likelihoods are conservative (i.e. they underestimate the evidence in support of indel when a true indel is present) but this is balanced by a major 84

85 increase in specificity. Robustness of Indel Mutation Rate Estimate Previous estimates of the ratio of deletion to insertion variants from human polymorphism data range from , thus we interpret our observation of a nearly 8-fold enrichment of validated deletions may reflect a lack of power to detect short insertions with next-generation sequencing data. Alternatively, our finding may be an indication that purifying selection is much stronger on new deletions than on new insertions. If we were to adjust our mutation rate estimate to account for a theoretical under-ascertainment of indels, the resulting values are only slightly higher than what we presented here; assuming a true 4:1 ratio of deletions:insertions produces an estimate of 1.18 x 10 9 while assuming a 2:1 ratio leads to an estimate of 1.42 x It seems likely that discovery power for indels may be lower than that for SNVs. If our power to discover indel DNMs was much less than the value of 0.95 that we assume here, say as low as 0.2, and using the other parameter values described in the main text, our rate estimate would be revised upwards to Due to our filtering strategy, the indel rate estimate we 85

86 provide in the main text applies to the non-repeat portion of the genome. The indel mutation rate is predicted to be much higher in the repetitive portion. One way to approach the true genomic indel mutation rate would be to simply include the DNM calls from these repeat regions in a rate estimate; even under the assumption that these are all true positives (which is unlikely), the resulting callset may also suffer from a lack of power to identify indels in repeat regions. Sensitivity and specificity of indel calling in repeats is still very poorly characterized. We manually removed 55% of the post-filtered calls in our original analysis, due to visual identification of artifacts. Assuming we would remove the same propotion from the unfiltered callset, we would have 203 de novo indels, and we would expect to validate 192 of these based on our observed validation rate. Then, using the equation that we define in the methods section of the main text, with the parameter values a = 1, p = 0.95, s = 49/1001, d = 193, b = , our rate estimate would be

Pyrobayes: an improved base caller for SNP discovery in pyrosequences

Pyrobayes: an improved base caller for SNP discovery in pyrosequences Pyrobayes: an improved base caller for SNP discovery in pyrosequences Aaron R Quinlan, Donald A Stewart, Michael P Strömberg & Gábor T Marth Supplementary figures and text: Supplementary Figure 1. The

More information

Genotype Imputation. Class Discussion for January 19, 2016

Genotype Imputation. Class Discussion for January 19, 2016 Genotype Imputation Class Discussion for January 19, 2016 Intuition Patterns of genetic variation in one individual guide our interpretation of the genomes of other individuals Imputation uses previously

More information

Heterozygous BMN lines

Heterozygous BMN lines Optical density at 80 hours 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 a YPD b YPD + 1µM nystatin c YPD + 2µM nystatin d YPD + 4µM nystatin 1 3 5 6 9 13 16 20 21 22 23 25 28 29 30

More information

Variant visualisation and quality control

Variant visualisation and quality control Variant visualisation and quality control You really should be making plots! 25/06/14 Paul Theodor Pyl 1 Classical Sequencing Example DNA.BAM.VCF Aligner Variant Caller A single sample sequencing run 25/06/14

More information

Unfixed endogenous retroviral insertions in the human population. Emanuele Marchi, Alex Kanapin, Gkikas Magiorkinis and Robert Belshaw

Unfixed endogenous retroviral insertions in the human population. Emanuele Marchi, Alex Kanapin, Gkikas Magiorkinis and Robert Belshaw Unfixed endogenous retroviral insertions in the human population Emanuele Marchi, Alex Kanapin, Gkikas Magiorkinis and Robert Belshaw Supplementary Methods Common sources of 'false positives' in mining

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase Humans have two copies of each chromosome Inherited from mother and father. Genotyping technologies do not maintain the phase Genotyping technologies do not maintain the phase Recall that proximal SNPs

More information

Comparing whole genomes

Comparing whole genomes BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will

More information

Nature Genetics: doi:0.1038/ng.2768

Nature Genetics: doi:0.1038/ng.2768 Supplementary Figure 1: Graphic representation of the duplicated region at Xq28 in each one of the 31 samples as revealed by acgh. Duplications are represented in red and triplications in blue. Top: Genomic

More information

De novo assembly and genotyping of variants using colored de Bruijn graphs

De novo assembly and genotyping of variants using colored de Bruijn graphs De novo assembly and genotyping of variants using colored de Bruijn graphs Iqbal et al. 2012 Kolmogorov Mikhail 2013 Challenges Detecting genetic variants that are highly divergent from a reference Detecting

More information

Supplementary Figure 1. Nature Genetics: doi: /ng.3848

Supplementary Figure 1. Nature Genetics: doi: /ng.3848 Supplementary Figure 1 Phenotypes and epigenetic properties of Fab2L flies. A- Phenotypic classification based on eye pigment levels in Fab2L male (orange bars) and female (yellow bars) flies (n>150).

More information

Robust Detection and Identification of Sparse Segments in Ultra-High Dimensional Data Analysis

Robust Detection and Identification of Sparse Segments in Ultra-High Dimensional Data Analysis Robust Detection and Identification of Sparse Segments in Ultra-High Dimensional Data Analysis Hongzhe Li hongzhe@upenn.edu, http://statgene.med.upenn.edu University of Pennsylvania Perelman School of

More information

Supporting Information Text S1

Supporting Information Text S1 Supporting Information Text S1 List of Supplementary Figures S1 The fraction of SNPs s where there is an excess of Neandertal derived alleles n over Denisova derived alleles d as a function of the derived

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related

More information

RNA-seq. Differential analysis

RNA-seq. Differential analysis RNA-seq Differential analysis DESeq2 DESeq2 http://bioconductor.org/packages/release/bioc/vignettes/deseq 2/inst/doc/DESeq2.html Input data Why un-normalized counts? As input, the DESeq2 package expects

More information

Figure S2. The distribution of the sizes (in bp) of syntenic regions of humans and chimpanzees on human chromosome 21.

Figure S2. The distribution of the sizes (in bp) of syntenic regions of humans and chimpanzees on human chromosome 21. Frequency 0 1000 2000 3000 4000 5000 0 2 4 6 8 10 Distance Figure S1. The distribution of human-chimpanzee sequence divergence for syntenic regions of humans and chimpanzees on human chromosome 21. Distance

More information

Haploid & diploid recombination and their evolutionary impact

Haploid & diploid recombination and their evolutionary impact Haploid & diploid recombination and their evolutionary impact W. Garrett Mitchener College of Charleston Mathematics Department MitchenerG@cofc.edu http://mitchenerg.people.cofc.edu Introduction The basis

More information

Supplementary Figure 1. Phenotype of the HI strain.

Supplementary Figure 1. Phenotype of the HI strain. Supplementary Figure 1. Phenotype of the HI strain. (A) Phenotype of the HI and wild type plant after flowering (~1month). Wild type plant is tall with well elongated inflorescence. All four HI plants

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 11, Issue 2 2012 Article 6 COMPUTATIONAL STATISTICAL METHODS FOR GENOMICS AND SYSTEMS BIOLOGY A Family-Based Probabilistic Method for Capturing

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Analysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing

Analysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing Analysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing So Yeun Kwon, Hwan Young Lee, and Kyoung-Jin Shin Department of Forensic Medicine, Yonsei University College of Medicine, Seoul,

More information

Genome Sequencing and Structural Variation (2)

Genome Sequencing and Structural Variation (2) Genome Sequencing and Variation Analysis of matepairs for the identification of variants Institut für Medizinische Genetik und Humangenetik Charité Universitätsmedizin Berlin Genomics: Lecture #11 Today

More information

Supporting Information

Supporting Information Supporting Information Hammer et al. 10.1073/pnas.1109300108 SI Materials and Methods Two-Population Model. Estimating demographic parameters. For each pair of sub-saharan African populations we consider

More information

opulation genetics undamentals for SNP datasets

opulation genetics undamentals for SNP datasets opulation genetics undamentals for SNP datasets with crocodiles) Sam Banks Charles Darwin University sam.banks@cdu.edu.au I ve got a SNP genotype dataset, now what? Do my data meet the requirements of

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98) Hidden Markov Models Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98) 1 The occasionally dishonest casino A P A (1) = P A (2) = = 1/6 P A->B = P B->A = 1/10 B P B (1)=0.1... P

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on: 17 Non-collinear alignment This exposition is based on: 1. Darling, A.E., Mau, B., Perna, N.T. (2010) progressivemauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5(6):e11147.

More information

Supporting Information

Supporting Information Supporting Information Weghorn and Lässig 10.1073/pnas.1210887110 SI Text Null Distributions of Nucleosome Affinity and of Regulatory Site Content. Our inference of selection is based on a comparison of

More information

The supplementary document of LLR: A latent low-rank approach to colocalizing genetic risk variants in multiple GWAS

The supplementary document of LLR: A latent low-rank approach to colocalizing genetic risk variants in multiple GWAS The supplementary document of LLR: A latent low-rank approach to colocalizing genetic risk variants in multiple GWAS Jin Liu 1, Xiang Wan 2, Chaolong Wang 3, Chao Yang 4, Xiaowei Zhou 5, and Can Yang 6

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

Department of Forensic Psychiatry, School of Medicine & Forensics, Xi'an Jiaotong University, Xi'an, China;

Department of Forensic Psychiatry, School of Medicine & Forensics, Xi'an Jiaotong University, Xi'an, China; Title: Evaluation of genetic susceptibility of common variants in CACNA1D with schizophrenia in Han Chinese Author names and affiliations: Fanglin Guan a,e, Lu Li b, Chuchu Qiao b, Gang Chen b, Tinglin

More information

p(d g A,g B )p(g B ), g B

p(d g A,g B )p(g B ), g B Supplementary Note Marginal effects for two-locus models Here we derive the marginal effect size of the three models given in Figure 1 of the main text. For each model we assume the two loci (A and B)

More information

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)

More information

Introduction to the SNP/ND concept - Phylogeny on WGS data

Introduction to the SNP/ND concept - Phylogeny on WGS data Introduction to the SNP/ND concept - Phylogeny on WGS data Johanne Ahrenfeldt PhD student Overview What is Phylogeny and what can it be used for Single Nucleotide Polymorphism (SNP) methods CSI Phylogeny

More information

Protocol S1. Replicate Evolution Experiment

Protocol S1. Replicate Evolution Experiment Protocol S Replicate Evolution Experiment 30 lines were initiated from the same ancestral stock (BMN, BMN, BM4N) and were evolved for 58 asexual generations using the same batch culture evolution methodology

More information

Nature Methods: doi: /nmeth Supplementary Figure 1. Fragment indexing allows efficient spectra similarity comparisons.

Nature Methods: doi: /nmeth Supplementary Figure 1. Fragment indexing allows efficient spectra similarity comparisons. Supplementary Figure 1 Fragment indexing allows efficient spectra similarity comparisons. The cost and efficiency of spectra similarity calculations can be approximated by the number of fragment comparisons

More information

Haplotype-based variant detection from short-read sequencing

Haplotype-based variant detection from short-read sequencing Haplotype-based variant detection from short-read sequencing Erik Garrison and Gabor Marth July 16, 2012 1 Motivation While statistical phasing approaches are necessary for the determination of large-scale

More information

Cover Requirements: Name of Unit Colored picture representing something in the unit

Cover Requirements: Name of Unit Colored picture representing something in the unit Name: Period: Cover Requirements: Name of Unit Colored picture representing something in the unit Biology B1 1 Target # Biology Unit B1 (Genetics & Meiosis) Learning Targets Genetics & Meiosis I can explain

More information

Using Phylogenomics to Predict Novel Fungal Pathogenicity Genes

Using Phylogenomics to Predict Novel Fungal Pathogenicity Genes Using Phylogenomics to Predict Novel Fungal Pathogenicity Genes David DeCaprio, Ying Li, Hung Nguyen (sequenced Ascomycetes genomes courtesy of the Broad Institute) Phylogenomics Combining whole genome

More information

Classical Selection, Balancing Selection, and Neutral Mutations

Classical Selection, Balancing Selection, and Neutral Mutations Classical Selection, Balancing Selection, and Neutral Mutations Classical Selection Perspective of the Fate of Mutations All mutations are EITHER beneficial or deleterious o Beneficial mutations are selected

More information

Alignment. Peak Detection

Alignment. Peak Detection ChIP seq ChIP Seq Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008 ChIP Seq Analysis Alignment Peak Detection Annotation Visualization Sequence Analysis Motif Analysis Alignment ELAND Bowtie

More information

Overview of IslandPick pipeline and the generation of GI datasets

Overview of IslandPick pipeline and the generation of GI datasets Overview of IslandPick pipeline and the generation of GI datasets Predicting GIs using comparative genomics By using whole genome alignments we can identify regions that are present in one genome but not

More information

SNP Association Studies with Case-Parent Trios

SNP Association Studies with Case-Parent Trios SNP Association Studies with Case-Parent Trios Department of Biostatistics Johns Hopkins Bloomberg School of Public Health September 3, 2009 Population-based Association Studies Balding (2006). Nature

More information

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information # Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

Figure S1: The model underlying our inference of the age of ancient genomes

Figure S1: The model underlying our inference of the age of ancient genomes A genetic method for dating ancient genomes provides a direct estimate of human generation interval in the last 45,000 years Priya Moorjani, Sriram Sankararaman, Qiaomei Fu, Molly Przeworski, Nick Patterson,

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

Tandem Mass Spectrometry: Generating function, alignment and assembly

Tandem Mass Spectrometry: Generating function, alignment and assembly Tandem Mass Spectrometry: Generating function, alignment and assembly With slides from Sangtae Kim and from Jones & Pevzner 2004 Determining reliability of identifications Can we use Target/Decoy to estimate

More information

Supplementary Information

Supplementary Information Supplementary Information LINE-1-like retrotransposons contribute to RNA-based gene duplication in dicots Zhenglin Zhu 1, Shengjun Tan 2, Yaqiong Zhang 2, Yong E. Zhang 2,3 1. School of Life Sciences,

More information

Supplementary Methods and Figures

Supplementary Methods and Figures Whole-genome resequencing of honeybee drones to detect genomic selection in a population managed for royal jelly David Wragg 1*, Maria Marti 1, Benjamin Basso 2, Jean-Pierre Bidanel 3, Emmanuelle Labarthe

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Evolutionary analysis of the well characterized endo16 promoter reveals substantial variation within functional sites

Evolutionary analysis of the well characterized endo16 promoter reveals substantial variation within functional sites Evolutionary analysis of the well characterized endo16 promoter reveals substantial variation within functional sites Paper by: James P. Balhoff and Gregory A. Wray Presentation by: Stephanie Lucas Reviewed

More information

Synteny Portal Documentation

Synteny Portal Documentation Synteny Portal Documentation Synteny Portal is a web application portal for visualizing, browsing, searching and building synteny blocks. Synteny Portal provides four main web applications: SynCircos,

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)

More information

Hidden Markov Models for the Assessment of Chromosomal Alterations using High-throughput SNP Arrays

Hidden Markov Models for the Assessment of Chromosomal Alterations using High-throughput SNP Arrays Hidden Markov Models for the Assessment of Chromosomal Alterations using High-throughput SNP Arrays Department of Biostatistics Johns Hopkins Bloomberg School of Public Health November 18, 2008 Acknowledgments

More information

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics. Evolutionary Genetics (for Encyclopedia of Biodiversity) Sergey Gavrilets Departments of Ecology and Evolutionary Biology and Mathematics, University of Tennessee, Knoxville, TN 37996-6 USA Evolutionary

More information

Going Beyond SNPs with Next Genera5on Sequencing Technology Personalized Medicine: Understanding Your Own Genome Fall 2014

Going Beyond SNPs with Next Genera5on Sequencing Technology Personalized Medicine: Understanding Your Own Genome Fall 2014 Going Beyond SNPs with Next Genera5on Sequencing Technology 02-223 Personalized Medicine: Understanding Your Own Genome Fall 2014 Next Genera5on Sequencing Technology (NGS) NGS technology Discover more

More information

Full file at CHAPTER 2 Genetics

Full file at   CHAPTER 2 Genetics CHAPTER 2 Genetics MULTIPLE CHOICE 1. Chromosomes are a. small linear bodies. b. contained in cells. c. replicated during cell division. 2. A cross between true-breeding plants bearing yellow seeds produces

More information

Processes of Evolution

Processes of Evolution 15 Processes of Evolution Forces of Evolution Concept 15.4 Selection Can Be Stabilizing, Directional, or Disruptive Natural selection can act on quantitative traits in three ways: Stabilizing selection

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES: .5. ESTIMATION OF HAPLOTYPE FREQUENCIES: Chapter - 8 For SNPs, alleles A j,b j at locus j there are 4 haplotypes: A A, A B, B A and B B frequencies q,q,q 3,q 4. Assume HWE at haplotype level. Only the

More information

Assembly improvement: based on Ragout approach. student: Anna Lioznova scientific advisor: Son Pham

Assembly improvement: based on Ragout approach. student: Anna Lioznova scientific advisor: Son Pham Assembly improvement: based on Ragout approach student: Anna Lioznova scientific advisor: Son Pham Plan Ragout overview Datasets Assembly improvements Quality overlap graph paired-end reads Coverage Plan

More information

Molecular Evolution & the Origin of Variation

Molecular Evolution & the Origin of Variation Molecular Evolution & the Origin of Variation What Is Molecular Evolution? Molecular evolution differs from phenotypic evolution in that mutations and genetic drift are much more important determinants

More information

Molecular Evolution & the Origin of Variation

Molecular Evolution & the Origin of Variation Molecular Evolution & the Origin of Variation What Is Molecular Evolution? Molecular evolution differs from phenotypic evolution in that mutations and genetic drift are much more important determinants

More information

Network alignment and querying

Network alignment and querying Network biology minicourse (part 4) Algorithmic challenges in genomics Network alignment and querying Roded Sharan School of Computer Science, Tel Aviv University Multiple Species PPI Data Rapid growth

More information

Supplementary Materials for

Supplementary Materials for advances.sciencemag.org/cgi/content/full/3/11/eaao4709/dc1 Supplementary Materials for Pushing the limits of photoreception in twilight conditions: The rod-like cone retina of the deep-sea pearlsides Fanny

More information

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Likun Wang Zhixing Feng i Wang iaowo Wang * and uegong Zhang * MOE Key Laboratory of Bioinformatics and Bioinformatics

More information

Learning gene regulatory networks Statistical methods for haplotype inference Part I

Learning gene regulatory networks Statistical methods for haplotype inference Part I Learning gene regulatory networks Statistical methods for haplotype inference Part I Input: Measurement of mrn levels of all genes from microarray or rna sequencing Samples (e.g. 200 patients with lung

More information

Quiz Section 4 Molecular analysis of inheritance: An amphibian puzzle

Quiz Section 4 Molecular analysis of inheritance: An amphibian puzzle Genome 371, Autumn 2018 Quiz Section 4 Molecular analysis of inheritance: An amphibian puzzle Goals: To illustrate how molecular tools can be used to track inheritance. In this particular example, we will

More information

Explore SNP polymorphism data. A. Dereeper, Y. Hueber

Explore SNP polymorphism data. A. Dereeper, Y. Hueber Explore SNP polymorphism data A. Dereeper, Y. Hueber Bioinformatics trainings, Supagro, February, 2016 Tablet Graphical tool to visualize assemblies Accept many formats ACE, SAM, BAM GATK (Genome Analysis

More information

Genetics 275 Notes Week 7

Genetics 275 Notes Week 7 Cytoplasmic Inheritance Genetics 275 Notes Week 7 Criteriafor recognition of cytoplasmic inheritance: 1. Reciprocal crosses give different results -mainly due to the fact that the female parent contributes

More information

Supporting Information

Supporting Information Supporting Information Das et al. 10.1073/pnas.1302500110 < SP >< LRRNT > < LRR1 > < LRRV1 > < LRRV2 Pm-VLRC M G F V V A L L V L G A W C G S C S A Q - R Q R A C V E A G K S D V C I C S S A T D S S P E

More information

Bayesian Inference of Interactions and Associations

Bayesian Inference of Interactions and Associations Bayesian Inference of Interactions and Associations Jun Liu Department of Statistics Harvard University http://www.fas.harvard.edu/~junliu Based on collaborations with Yu Zhang, Jing Zhang, Yuan Yuan,

More information

An Integrated Approach for the Assessment of Chromosomal Abnormalities

An Integrated Approach for the Assessment of Chromosomal Abnormalities An Integrated Approach for the Assessment of Chromosomal Abnormalities Department of Biostatistics Johns Hopkins Bloomberg School of Public Health June 26, 2007 Karyotypes Karyotypes General Cytogenetics

More information

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Yale School of Public Health Joint work with Ning Hao, Yue S. Niu presented @Tsinghua University Outline 1 The Problem

More information

Potato Genome Analysis

Potato Genome Analysis Potato Genome Analysis Xin Liu Deputy director BGI research 2016.1.21 WCRTC 2016 @ Nanning Reference genome construction???????????????????????????????????????? Sequencing HELL RIEND WELCOME BGI ZHEN LLOFRI

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

More information

The phenotype of this worm is wild type. When both genes are mutant: The phenotype of this worm is double mutant Dpy and Unc phenotype.

The phenotype of this worm is wild type. When both genes are mutant: The phenotype of this worm is double mutant Dpy and Unc phenotype. Series 2: Cross Diagrams - Complementation There are two alleles for each trait in a diploid organism In C. elegans gene symbols are ALWAYS italicized. To represent two different genes on the same chromosome:

More information

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB I519 Introduction to Bioinformatics, 2015 Genome Comparison Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Whole genome comparison/alignment Build better phylogenies Identify polymorphism

More information

Learning ancestral genetic processes using nonparametric Bayesian models

Learning ancestral genetic processes using nonparametric Bayesian models Learning ancestral genetic processes using nonparametric Bayesian models Kyung-Ah Sohn October 31, 2011 Committee Members: Eric P. Xing, Chair Zoubin Ghahramani Russell Schwartz Kathryn Roeder Matthew

More information

Our typical RNA quantification pipeline

Our typical RNA quantification pipeline RNA-Seq primer Our typical RNA quantification pipeline Upload your sequence data (fastq) Align to the ribosome (Bow>e) Align remaining reads to genome (TopHat) or transcriptome (RSEM) Make report of quality

More information

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

RNA Search and! Motif Discovery Genome 541! Intro to Computational! Molecular Biology RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology" Day 1" Many biologically interesting roles for RNA" RNA secondary structure prediction" 3 4 Approaches to Structure

More information

High-throughput sequencing: Alignment and related topic

High-throughput sequencing: Alignment and related topic High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg HTS Platforms E s ta b lis h e d p la tfo rm s Illu m in a H is e q, A B I S O L id, R o c h e 4 5 4 N e w c o m e rs

More information

Notes for MCTP Week 2, 2014

Notes for MCTP Week 2, 2014 Notes for MCTP Week 2, 2014 Lecture 1: Biological background Evolutionary biology and population genetics are highly interdisciplinary areas of research, with many contributions being made from mathematics,

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

Today s Lecture: HMMs

Today s Lecture: HMMs Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models

More information

GBS Bioinformatics Pipeline(s) Overview

GBS Bioinformatics Pipeline(s) Overview GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation: Terry Casstevens With supporting information from

More information

Outline. P o purple % x white & white % x purple& F 1 all purple all purple. F purple, 224 white 781 purple, 263 white

Outline. P o purple % x white & white % x purple& F 1 all purple all purple. F purple, 224 white 781 purple, 263 white Outline - segregation of alleles in single trait crosses - independent assortment of alleles - using probability to predict outcomes - statistical analysis of hypotheses - conditional probability in multi-generation

More information

Computational statistics

Computational statistics Computational statistics Combinatorial optimization Thierry Denœux February 2017 Thierry Denœux Computational statistics February 2017 1 / 37 Combinatorial optimization Assume we seek the maximum of f

More information

Supporting information for Demographic history and rare allele sharing among human populations.

Supporting information for Demographic history and rare allele sharing among human populations. Supporting information for Demographic history and rare allele sharing among human populations. Simon Gravel, Brenna M. Henn, Ryan N. Gutenkunst, mit R. Indap, Gabor T. Marth, ndrew G. Clark, The 1 Genomes

More information

EST1 Homology Domain. 100 aa. hest1a / SMG6 PIN TPR TPR. Est1-like DBD? hest1b / SMG5. TPR-like TPR. a helical. hest1c / SMG7.

EST1 Homology Domain. 100 aa. hest1a / SMG6 PIN TPR TPR. Est1-like DBD? hest1b / SMG5. TPR-like TPR. a helical. hest1c / SMG7. hest1a / SMG6 EST1 Homology Domain 100 aa 853 695 761 780 1206 hest1 / SMG5 -like? -like 109 145 214 237 497 165 239 1016 114 207 212 381 583 hest1c / SMG7 a helical 1091 Sc 57 185 267 284 699 Figure S1:

More information

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Comparative Gene Finding. BMI/CS 776  Spring 2015 Colin Dewey Comparative Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following: using related genomes

More information

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison 10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:

More information

Predictive Genome Analysis Using Partial DNA Sequencing Data

Predictive Genome Analysis Using Partial DNA Sequencing Data Predictive Genome Analysis Using Partial DNA Sequencing Data Nauman Ahmed, Koen Bertels and Zaid Al-Ars Computer Engineering Lab, Delft University of Technology, Delft, The Netherlands {n.ahmed, k.l.m.bertels,

More information

Population Genetics I. Bio

Population Genetics I. Bio Population Genetics I. Bio5488-2018 Don Conrad dconrad@genetics.wustl.edu Why study population genetics? Functional Inference Demographic inference: History of mankind is written in our DNA. We can learn

More information

Tandem repeat 16,225 20,284. 0kb 5kb 10kb 15kb 20kb 25kb 30kb 35kb

Tandem repeat 16,225 20,284. 0kb 5kb 10kb 15kb 20kb 25kb 30kb 35kb Overview Fosmid XAAA112 consists of 34,783 nucleotides. Blat results indicate that this fosmid has significant identity to the 2R chromosome of D.melanogaster. Evidence suggests that fosmid XAAA112 contains

More information

Unit 3 - Molecular Biology & Genetics - Review Packet

Unit 3 - Molecular Biology & Genetics - Review Packet Name Date Hour Unit 3 - Molecular Biology & Genetics - Review Packet True / False Questions - Indicate True or False for the following statements. 1. Eye color, hair color and the shape of your ears can

More information