A DNA Sequence ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgg gtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagc ggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttc gcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgcta gaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgt agatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgat cgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggtagta gctagatgcagggataaacacacggaggcgagtgatcggtaccgggctg aggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgag gctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagat gatgtgtgtcagtaagtaagcgatgcggctgctgagagcgtaggcccgaga ggagagatgtaggaggaaggtttgatggtagttgtagatgattgtgtagttgta gctgatagtgatgatcgtag 2017/12/6 1
Possible Questions What organism is this DNA sequence from? Is it from a bacterium, fly or any other organism? Is it really from one organism? Is it a real DNA sequence? 2017/12/6 2
Is any part of the DNA transferred from another organism? ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgg gtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagc ggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttc gcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgcta gaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgt agatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgat cgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggtagta gctagatgcagggataaacacacggaggcgagtgatcggtaccgggctg aggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgag gctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagat gatgtgtgtcagtaagtaagcgatgcggctgctgagagcgtaggcccgaga ggagagatgtaggaggaaggtttgatggtagttgtagatgattgtgtagttgta gctgatagtgatgatcgtag 2017/12/6 3
Identification of Foreign Genetic Material Is it possible to identify foreign genetic materials in a given DNA sequence? 2017/12/6 4
K-mer Frequencies Combined K-mer frequency: frequency of each K-mer and its reverse complement 4-mers: GGTA/TACC, CGAA/TTGC, GGTC/GACC, frequency genome sequence 2017/12/6 5
K-mer Frequencies Genomes have highly stable combined K-mer frequencies, measured using small window size M e.g., M = 1000 bps; K = 4; This is true for all genomes, eukaryotic, prokaryotic, chromosomal and organelle 2017/12/6 6
Genome Visualization When mapping the frequencies to grey levels, each genome can be visualized as a grey-level image x-axis: combined K-mers (e.g., 4-mers), and y-axis: genome axis AAAA/TTTT 136 combined 4-mers frequency ACAG/CTGT CGAT/ATCG genome sequence 2017/12/6 7
Genome Barcodes Barcodes of various genomes P. furiosus B. pseudomallei E. coli O157 E. coli K-12 2017/12/6 8
Genome Barcodes How about the barcode of a random sequence of {A, C, G, T}? No, you cannot fake a genome Random seq 2017/12/6 9
Properties of Genome Barcodes Majority of a prokaryotic genome s short fragments have highly similar barcodes E. coli K-12 E. coli O157 2017/12/6 10 B. pseudomallei P. furiosus
Abnormal Barcodes On average, 12-13% of genomic fragments in bacterial genomes have substantially different barcodes 2017/12/6 11
Abnormal Barcodes This distance distribution suggests that we may be able to figure out how long the transferred genes have been in the host rather than just which ones are the transferred genes
Barcode Properties of Genomes Different types of genomic regions tend to have their common and unique characteristics coding regions intergenic regions interoperonic regions 2017/12/6 13
Barcode Properties of Genomes Different classes of genomes, i.e., eukaryotic, prokaryotic, mitochondrial, plasmid, plastid, have their unique and identifiable characteristics Red: prokaryotes Blue: eukaryotes Green: plastids Orange: plasmids Black: mitochondria 2017/12/6 14
Why Barcode Properties 0 th order Markov chain 1 st order Markov chain 3 rd order Markov chain 5 th order Markov chain We believe that it is the Markov chain properties of the prokaryotic DNA that give rise to the barcode property
Barcode Properties But. why do eukaryotic genomes also have (seemingly more complex) barcode properties? Do different (major) regions of a eukaryotic genome also follow Markov chain models or more complex stochastic models? protein-coding genes (hidden Markov model) RNA-coding genes regulatory regions repetitive elements. This is something that we hope to answer someday!
What Questions Can We Answer Which organism is this DNA sequence from? Is it from a bacterium, fly or any other organism? Is it really from one organism? Is it a real DNA sequence? YES to all these and many other questions! 2017/12/6 17
Take-Home Message Genome visualization could be a key to discoveries of new genomic elements Barcodes represent only an initial effort along this direction