Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

Similar documents
Practical Bioinformatics

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc

Crick s early Hypothesis Revisited

SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm

Number-controlled spatial arrangement of gold nanoparticles with

Advanced topics in bioinformatics

Supplementary Information for

Characterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

NSCI Basic Properties of Life and The Biochemistry of Life on Earth

Supporting Information for. Initial Biochemical and Functional Evaluation of Murine Calprotectin Reveals Ca(II)-

SSR ( ) Vol. 48 No ( Microsatellite marker) ( Simple sequence repeat,ssr),

SUPPLEMENTARY DATA - 1 -

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Supplementary Information

Table S1. Primers and PCR conditions used in this paper Primers Sequence (5 3 ) Thermal conditions Reference Rhizobacteria 27F 1492R

Clay Carter. Department of Biology. QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.

Supplemental Figure 1.

3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies

Supporting Information

Electronic supplementary material

Codon Distribution in Error-Detecting Circular Codes

Building a Multifunctional Aptamer-Based DNA Nanoassembly for Targeted Cancer Therapy

TM1 TM2 TM3 TM4 TM5 TM6 TM bp

Supplemental Table 1. Primers used for cloning and PCR amplification in this study

Evolvable Neural Networks for Time Series Prediction with Adaptive Learning Interval

SUPPLEMENTARY INFORMATION

The 3 Genomic Numbers Discovery: How Our Genome Single-Stranded DNA Sequence Is Self-Designed as a Numerical Whole

The Trigram and other Fundamental Philosophies

The role of the FliD C-terminal domain in pentamer formation and

Protein Threading. Combinatorial optimization approach. Stefan Balev.

part 3: analysis of natural selection pressure

SUPPLEMENTARY INFORMATION

Encoding of Amino Acids and Proteins from a Communications and Information Theoretic Perspective

Why do more divergent sequences produce smaller nonsynonymous/synonymous

Chapter 2 Class Notes Words and Probability

Biology 644: Bioinformatics

Evolutionary dynamics of abundant stop codon readthrough in Anopheles and Drosophila

evoglow - express N kit distributed by Cat.#: FP product information broad host range vectors - gram negative bacteria

Re- engineering cellular physiology by rewiring high- level global regulatory genes

Sex-Linked Inheritance in Macaque Monkeys: Implications for Effective Population Size and Dispersal to Sulawesi

evoglow - express N kit Cat. No.: product information broad host range vectors - gram negative bacteria

Introduction to Molecular Phylogeny

Supplementary Information

AtTIL-P91V. AtTIL-P92V. AtTIL-P95V. AtTIL-P98V YFP-HPR

part 4: phenomenological load and biological inference. phenomenological load review types of models. Gαβ = 8π Tαβ. Newton.

ChemiScreen CaS Calcium Sensor Receptor Stable Cell Line

From DNA to protein, i.e. the central dogma

Aoife McLysaght Dept. of Genetics Trinity College Dublin

THE MATHEMATICAL STRUCTURE OF THE GENETIC CODE: A TOOL FOR INQUIRING ON THE ORIGIN OF LIFE

Biosynthesis of Bacterial Glycogen: Primary Structure of Salmonella typhimurium ADPglucose Synthetase as Deduced from the

Lecture 15: Realities of Genome Assembly Protein Sequencing

Edinburgh Research Explorer

Near-instant surface-selective fluorogenic protein quantification using sulfonated

Lecture IV A. Shannon s theory of noisy channels and molecular codes

Timing molecular motion and production with a synthetic transcriptional clock

ydci GTC TGT TTG AAC GCG GGC GAC TGG GCG CGC AAT TAA CGG TGT GTA GGC TGG AGC TGC TTC

It is the author's version of the article accepted for publication in the journal "Biosystems" on 03/10/2015.

Objective: You will be able to justify the claim that organisms share many conserved core processes and features.

Using algebraic geometry for phylogenetic reconstruction

Evolutionary Analysis of Viral Genomes

Supporting Information. An Electric Single-Molecule Hybridisation Detector for short DNA Fragments

Chain-like assembly of gold nanoparticles on artificial DNA templates via Click Chemistry

An Analytical Model of Gene Evolution with 9 Mutation Parameters: An Application to the Amino Acids Coded by the Common Circular Code

MicroGenomics. Universal replication biases in bacteria

Identification of a Locus Involved in the Utilization of Iron by Haemophilus influenzae

FliZ Is a Posttranslational Activator of FlhD 4 C 2 -Dependent Flagellar Gene Expression

160, and 220 bases, respectively, shorter than pbr322/hag93. (data not shown). The DNA sequence of approximately 100 bases of each

Diversity of Chlamydia trachomatis Major Outer Membrane

Using an Artificial Regulatory Network to Investigate Neural Computation

Capacity of DNA Data Embedding Under. Substitution Mutations

Motif Finding Algorithms. Sudarsan Padhy IIIT Bhubaneswar

Insects act as vectors for a number of important diseases of

Supplementary Figure 1. Schematic of split-merger microfluidic device used to add transposase to template drops for fragmentation.

Supporting Information

Codon-model based inference of selection pressure. (a very brief review prior to the PAML lab)

codon substitution models and the analysis of natural selection pressure

Interpolated Markov Models for Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Pathways and Controls of N 2 O Production in Nitritation Anammox Biomass

Lecture 15. Saad Mneimneh

DNA sequencing. Bad example (repeats) Lecture 15. Shortest common superstring SCS. Given a set of fragments F,

Supplemental Figure 1. Differences in amino acid composition between the paralogous copies Os MADS17 and Os MADS6.

Table S1. DNA oligonucleo3des used for 3D pol mutagenesis. Fingers Domain Entry Channel. Fidelity

Symmetry Studies. Marlos A. G. Viana

The Physical Language of Molecules

Glucosylglycerate phosphorylase, a novel enzyme specificity involved in compatible solute metabolism

HADAMARD MATRICES AND QUINT MATRICES IN MATRIX PRESENTATIONS OF MOLECULAR GENETIC SYSTEMS

DNA sequence analysis of the imp UV protection and mutation operon of the plasmid TP110: identification of a third gene

Lecture 3: Markov chains.

The Mathematics of Phylogenomics

Genetic Code, Attributive Mappings and Stochastic Matrices

NEW DNA CYCLIC CODES OVER RINGS

CCHS 2015_2016 Biology Fall Semester Exam Review

How DNA barcoding can be more effective in microalgae. identification: a case of cryptic diversity revelation in Scenedesmus

Phylogenetic invariants versus classical phylogenetics

Genome Sequencing & DNA Sequence Analysis

Analysis of Codon Usage Bias of Delta 6 Fatty Acid Elongase Gene in Pyramimonas cordata isolate CS-140

Transcription:

582746 Modelling and Analysis in Bioinformatics Lecture 1: Genomic k-mer Statistics Juha Kärkkäinen 06.09.2016

Outline Course introduction Genomic k-mers 1-Mers 2-Mers 3-Mers k-mers for Larger k

Outline Course introduction Genomic k-mers 1-Mers 2-Mers 3-Mers k-mers for Larger k

Course Topics Sequence analysis (Juha Kärkkäinen) Network models (Leena Salmela) Network inference (Antti Honkela)

Schedule First six weeks Tuesday 12 14: Lecture to introduce the topic Thursday 10 12: Study group to deepen the knowledge Thursday 12 14: Exercise session to get started on exercises Exercise solutions must be returned about week later Week 7: Guest lectures Tuesday and Thursday 12 14

How to Pass the Course Attending study groups on Thursday mornings is mandatory Attending invited lectures on Tuesday 18.10. and Thursday 20.10. is mandatory Submit the exercises At least 6 points for each three exercise set (sequence analysis, network models, network inference) At least 30 points total No exam Not possible to pass with separate exam If you miss a study group or a visiting lecture, contact the lecturers for an alternative assignment

Grading Grading is based on submitted exercises 60 points available 30 points is required for passing 50 points gives highest grade 5

Outline Course introduction Genomic k-mers 1-Mers 2-Mers 3-Mers k-mers for Larger k

Genomic k-mers DNA sequence is a sequence of nucleotides or bases A, C, G, and T k-mer is sequence of k bases 1-mers: A, C, G, T 2-mers: AA, AC, AG, AT, CA, CC,... 3-mers (codons): AAA, AAC,... 4-mers and beyond We are interested in the frequencies of k-mers in a genome fr(a), fr(ca), fr(aac),...

Double Stranded DNA DNA is typically double stranded In the complementary strand: A T C G The strands have opposite directions marked by the 5 - and 3 -prime endings 5... GGATCGAAGCTAAGGGCT... 3 3... CCTAGCTTCGATTCCCGT... 5

Outline Course introduction Genomic k-mers 1-Mers 2-Mers 3-Mers k-mers for Larger k

1-Mer Frequencies 5 GGATCGAAGCTAAGGGCT 3 3 CCTAGCTTCGATTCCCGT 5 Top strand: fr(g) = 7/18, fr(c) = 3/18, fr(a) = 5/18, fr(t) = 3/18 Bottom strand: fr(g) = 3/18, fr(c) = 7/18, fr(a) = 3/18, fr(t) = 5/18 Both strands: fr(g) = 10/36, fr(c) = 10/36, fr(a) = 8/36, fr(t) = 8/36

G+C or GC Content 5 GGATCGAAGCTAAGGGCT 3 3 CCTAGCTTCGATTCCCGT 5 1-mer frequencies can be summarized as one number, the G+C or GC content fr(g+c) = fr(g) + fr(c) = 10/36 + 10/36 = 20/36 For double stranded case, other frequencies can be computed from the G+C content fr(g) = fr(c) = fr(g+c)/2 = 10/36 fr(a+t) = 1 fr(g+c) = 16/36 fr(a) = fr(t) = fr(a+t)/2 = 8/36 G+C content can be computed from a single strand too fr(g+c) = 10/18 = 20/36

G+C content and genome size Species G+C content genome size (Mb) Mycoplasma genitalium 31.6% 0.585 Thermoplasma volcanium 39.9% 1.585 Pyrococcus abyssi 44.6% 1.765 Escherichia coli K-12 50.7% 4.693 Pseudomonas aeruginosa PAO1 66.4% 6.264 Caenorhabditis elegans 36% 97 Arabidopsis thaliana 35% 125 Homo sapiens 41% 3080

Bernoulli or i.i.d. Model for DNA i.i.d. = independent and identically distributed Generate a sequence so that the base at each position is chosen randomly independently of other positions using identical distribution for all positions In the simplest case, we can use a uniform distribution: P(A) = P(C) = P(G) = P(T) = 0.25 If we know the desired G+C content, we can use the base frequencies derived from the G+C content as the probabilities Bernoulli is a reasonable model for some organisms but not for others

GC Skew GC skew is defined by (#G #C)/(#G + #C) It is calculated for windows (intervals) of specific width... A GGATCGAA TCTAAG... (3-1)/(3+1) = 2/4... AG GATCGAAT CTAAG... (2-1)/(2+1) = 1/3... AGG ATCGAATC TAAG... (1-2)/(1+2) = 1/4 GC skew can be different in different regions of a genome

GC Skew in A Genome Circular genome of Blochmannia vafer G+C content is black GC skew is purple/green The GC skew switch point is origin of replication More complex organisms have multiple origins of replication and less striking GC skew statistics

DNA Replication Asymmetry in replication (continuous vs. piecewise) Leading strand tends to have a different GC skew than lagging strand

Outline Course introduction Genomic k-mers 1-Mers 2-Mers 3-Mers k-mers for Larger k

First Order Markov Model In Bernoulli model, each base is independent of others In a first order Markov model, each base depends on the preceding base The probabilities can be obtained from 2-mer frequencies Often much more realistic than Bernoulli model

Markov Chain Probabilities Let X be a sequence and X t the base at position t X and Xt are random variables Probability that X t = b given that X t 1 = a: p ab = P(X t = b X t 1 = a) p ab is that same at each position t Transition matrix A C G T A p AA p AC p AG p AT C p CA p CC p CG p CT G p GA p GC p GG p GT T p TA p TC p TG p TT We also need the probabilities of the first base: p A, p C, p G, p T

Generating a Sequence 1. Choose the starting base by the distribution p A, p C, p G, p T 2. Then choose each base using a distribution defined by a row of the transition matrix For example if Xt 1 = C then X t is chosen by the distribution p CA, p CC, p CG, p CT A C G T A p AA p AC p AG p AT C p CA p CC p CG p CT G p GA p GC p GG p GT T p TA p TC p TG p TT

Estimating the Probabilities Use observed frequencies: fr(a), fr(c),..., fr(aa), fr(ac),... Base probabilities: p A = fr(a), p C = fr(c),... Note that fr(a) fr(aa) + fr(ac) + fr(ag) + fr(at) Transition probabilities p ab = P(X t = b X t 1 = a) = P(X t = b, X t 1 = a) P(X t 1 = a) = fr(ab) fr(a)

Estimating the Probabilities fr(ab) = P(X t = b, X t 1 = a) A C G T A 0.146 0.052 0.058 0.089 C 0.063 0.029 0.010 0.056 G 0.050 0.030 0.030 0.051 T 0.086 0.047 0.063 0.140 p ab = P(X t = b X t 1 = a) A C G T A 0.423 0.151 0.168 0.258 C 0.399 0.184 0.063 0.354 G 0.314 0.189 0.176 0.321 T 0.258 0.138 0.187 0.415 Example: computing p AC p A = fr(a) = fr(aa) + fr(ac) + fr(ag) + fr(at) = 0.146 + 0.052 + 0.058 + 0.089 = 0.345 p AC = fr(ac) / fr(a) = 0.052 / 0.345 0.151

CpG Suppression Low frequency of CG is common (in vertebrates) Human genome: fr(cg) 0.01 Bernoulli model: fr(c+g) = 0.41 fr(cg) 0.04 This is called CG or CpG suppression (CpG = C-phosphate-G) Caused by the tendency of CG to mutate into TG Some regions, called CpG islands, have a higher CG frequency

Outline Course introduction Genomic k-mers 1-Mers 2-Mers 3-Mers k-mers for Larger k

2nd Order Markov Model Markov model generalizes to higher orders In 2nd order Markov model, position t depends on positions t 1 and t 2 A C G T AA p AAA p AAC p AAG p AAT AC p ACA p ACC p ACG p ACT Transition matrix AG p AGA p AGC p AGG p AGT AT p ATA p ATC p ATG p ATT CA p CAA p CAC p CAG p CAT... Can be estimated from 3-mer frequencies

Codons In genes, amino acids are encoded by 3-mers called codons Amino acid Codons Ala/A GCT, GCC, GCA, GCG Arg/R CGT, CGC, CGA, CGG, AGA, AGG Asn/N AAT, AAC Asp/D GAT, GAC Cys/C TGT, TGC...... DNA sequence can be divided into codons in three different ways called reading frames AGG TGA CAC CGC AAG CCT TAT ATT AGC A A GGT GAC ACC GCA AGC CTT ATA TTA GCA AG GTG ACA CCG CAA GCC TTA TAT TAG CA 3-mer models can help in identifying reading frames

Codon Usage Bias Highly expressed genes prefer certain codons among synonymous ones Codon frequencies for some genes in E. coli Gene expression Amino acid Codon Predicted Moderate High Phe TTT 0.493 0.551 0.291 TTC 0.507 0.449 0.709 Ala GCT 0.246 0.145 0.275 GCC 0.254 0.276 0.164 GCA 0.246 0.196 0.240 GCG 0.254 0.382 0.323 Asn AAT 0.493 0.409 0.172 AAC 0.507 0.591 0.828

Codon Adaptation Index (CAI) Let X = x 1 x 2... x n be a sequence of codons in a gene p(x i ) is the probability of x i among synonymous codons in a highly expressed gene q(x i ) is the highest probability among synonymous codons Example: Alanine in E. coli p(gct)=0.275, p(gcc)=0.164, p(gca)=0.240, p(gcg)=0.323 q(gct) = q(gcc) = q(gca) = q(gcg) = 0.323 CAI is defined as ( n CAI = p(x i )/q(x i ) i=1 ) 1/n

CAI: Example Amino acid Ala Tyr Met Ser CAI Codon GCT GAA ATG TCA p 0.47 0.19 1.00 0.03 q 0.47 0.81 1.00 0.43 p/q 1.00 0.23 1.00 0.07 0.016 0.016 1/4 CAI = 0.016 1/4 = 0.36

CAI: Properties CAI = 1.0: each codon is the most probable one In a sample E. coli genes CAI ranged from 0.2 to 0.85 As the product can be a very small number, numerical problems could be avoided by using a log-odds form logcai = (1/n) n log(p(x i )/q(x i )) i=1 CAI can be used for predicting expression levels

Outline Course introduction Genomic k-mers 1-Mers 2-Mers 3-Mers k-mers for Larger k

k-mers for Larger k (k 1)th order Markov models For large k, summary statistics can be more useful Surprising (over- or under-abundant) k-mers k-mer spectra For large enough k, most frequencies are zero The set of k-mers is widely used in genome assembly in the form of De Bruijn graphs

De Bruijn Graph Directed graph Vertices are (k 1)-mers Edges are k-mers Connects prefix to suffix Example: k-mers = ATG, TGC, GTG, TGG, GGC, GCA, GCG, CGT GT CG AT TG GC CA GG

Surprising k-mers fr = observed frequency of a k-mer p = expected frequency using a lower order model odds-ratio OR = +0.5 prevents division by zero (1 fr + 0.5)/(fr + 0.5) (1 p + 0.5)/(p + 0.5) if OR 1, the k-mer is over-abundant if OR 1, the k-mer is under-abundant Over- and under-abundant k-mers are more likely to interesting Why are they over- or under-abundant? Are some under-abundant k-mers poisonous?

Most Over-Abundant 11-mers in Human Genome k-mer Count Expected OR cgcgcgcgcgc 2448 0.061 40098 gcgcgcgcgcg 2437 0.060 39949 cgccgccgccg 4642 0.481 9650 cggcggcggcg 4601 0.480 9566 ccgcgcccggc 19701 2.541 7753 gccgggcgcgg 19556 2.539 7703 caccgcgcccg 19907 2.895 6876 cgggcgcggtg 19848 2.888 6873 accgcgcccgg 19655 3.002 6547 ccgggcgcggt 19539 2.996 6522

Most Under-Abundant 11-mers in Human Genome k-mer Count Expected OR tcgaaattcgc 0 46.0 0.011 cccccccctat 9 817.2 0.012 attgcgaacga 0 41.9 0.012 tcgcgagttaa 0 34.2 0.015 atcttcgcgag 0 33.9 0.015 tcagggggggg 15 1003.5 0.015 atcgcaacgga 0 32.3 0.015 tatgtttcgcg 0 31.9 0.016 tgcaacgatcg 0 31.1 0.016 agtccgcgcaa 0 30.3 0.016

k-mer Spectra For example, there are over 700 000 distinct k-mers that occur exactly 20 times More in the study groups

What Next? Thu 10-12: study groups Assignments on the course home page Read the assigned material in advance Mandatory attendance Thu 12-14: exercise session Exercise problems will be added to the course home page No advance preparation required Solutions must be returned by 15.9. No mandatory attendance