Increasingly complex (accurate) searches. DNA2: diversity & continuity. Alignments & Scores. DNA2: Aligning ancient diversity.
|
|
- Lucy Barnett
- 5 years ago
- Views:
Transcription
1 DNA: (Last week) Types of mutants Mutation, drift, selection Binomial for each Association studies χ 2 statistic Linked & causative alleles Alleles, Haplotypes, genotypes Computing the first genome, the second... New technologies Random and systematic errors Multi-sequence alignment pace-time-accuracy tradeoffs Finding genes -- motif profiles 2 DNA2: diversity & continuity DNA: the last 5000 human generations Applications of Dynamic Programming To sequence analysis hotgun sequence assembly Multiple alignments Dispersed & tandem repeats Bird song alignments Gene Expression time-warping 3D-structure alignment Integrate&2: ancient genetic code 3 Through HMMs RNA gene search & structure prediction Distant protein homologies peech recognition Alignments & cores Increasingly complex (accurate) searches Global (e.g. haplotype) ACCACACA ::xx::x: ACACCA core= 5(+) + 3(-) = 2 Local (motif) ACCACACA :::: ACACCA core= (+) = Exact (tringearch) Regular expression (Prositeearch) ubstitution matrix (BlastN) Profile matrix (PI-blast) CGCG CGN{0-9}CG = CGAACG CGCG ~= CACG CGc(g/a) ~ = CACG uffix (shotgun assembly) ACCACACA ::: ACACCA core= 3(+) =3 Gaps (Gap-Blast) Dynamic Programming (NW, M) Hidden Markov Models (HMMER) CGCG ~= CGAACG CGCG ~= CAGACG WU 5 6
2 "Hardness" of (multi-) sequence alignment Align 2 sequences of length N allowing gaps. ACCAC-ACA ACCACACA ::x::x:x: :xxxxxx: AC-ACCA, A-----CACCA, etc. Testing search & classification algorithms eparate Training set and Testing sets Need databases of non-redundant sets. Need evaluation criteria (programs) ensistivity and pecificity (false negatives & positives) 2N gap positions, gap lengths of 0 to N each: A naïve algorithm might scale by O(N 2N ). For N= 3x0 9 this is rather large. Now, what about k>2 sequences? sensitivity (true_predicted/true) specificity (true_predicted/all_predicted) Where do training sets come from? More expensive experiments: crystallography, genetics, biochemistry 7 8 or rearrangements other than gaps? Comparisons of homology scores Pearson WR Protein ci 995 Jun;(6):5-60 Comparison of methods for searching protein sequence databases. Methods Enzymol 996;266: Effective protein sequence comparison. Algorithm: FA, Blastp, Blitz ubstitution matrix:pam20, PAM250, BLOUM50, BLOUM62 Database: PIR, WI-PROT, GenPept witch to protein searches when possible x= u c a g Uxu F Y C uxc uxa - - TER uxg - W Cxu H L cxc P R cxa Q cxg axu N axc I T C- axa K R axg M NH+ gxu D gxc V A G O- gxa E gxg H:D/A 9 A Multiple Alignment of Immunoglobulins VTICTGNIGAG-NHVKWYQQLPG VTICTGTNIG--ITVNWYQQLPG LRLCGFIF--YAMYWVRQAPG LLTCTVGTFDD--YYTWVRQPPG PEVTCVVVDVHEDPQVKFNWYVDG-- LVCLIDFYPGA--VTVAWKAD-- AALGCLVKDYFPEP--VTVWNG--- VLTCLVKGFYPD--IAVEWENG-- 0 coring matrix based on large set of distantly related blocks: Blosum62 coring Functions and Alignments % A C D E F G H I K L M N P Q R T V W Y A C D E F G H I K L M N P Q R T V 22 W Y coring function: ω(match) = +; substitution matrix ω(mismatch) = -; ω(indel) = -2; ω(other) = 0. Alignment score: sum of columns. Optimal alignment: maximum score. } 2
3 Calculating Alignment cores GA A-TGA () :xx: (2) : : : ACTA ACT-A A T G A () core = ω = + = 0. A C T A A T G A (2) core = ω = =. A C T A if ω ( indel ) =, core = + + = +. Multi-sequence alignment pace-time-accuracy tradeoffs Finding genes -- motif profiles 3 What is dynamic programming? A dynamic programming algorithm solves every subsubproblem just once and then saves its answer in a table, avoiding the work of recomputing the answer every time the subsubproblem is encountered. -- Cormen et al. "Introduction to Algorithms", The MIT Press. Recursion of Optimal Global Alignments u : optimal global alignment score of u and v. v GA ACT A GA G A = max ACTA ACT A G A. ACTA 5 6 Recursion of Optimal Local Alignments Computing Row-by-Row u : optimal local alignment score of u and v. v uu2... ui vv 2... vj vj uu2... ui ui uu2... ui = max vv 2... vj vj vv 2... vj uu2... ui ui vv 2... vj 0. 7 T min G min A min T min G min A min min = T min G min A min T min G min A min
4 Traceback Optimal Global Alignment Local and Global Alignments T min G min A min A G : A T T A : C A 9 A C C A C A C A A C A C C A T A A C C A C A C A 0 m m m m m m m m A m C m A m C m C m A m T m A m Time and pace Complexity of Computing Alignments Time and pace Problems For two sequences u=u u 2 u n and v=v v 2 v m, finding the optimal alignment takes O(mn) time and O(mn) space. An O()-time operation: one comparison, three multiplication steps, computing an entry in the alignment table An O()-space memory: one byte, a data structure of two floating points, an entry in the alignment table Comparing two one-megabase genomes. pace: An entry: bytes; Table: * 0^6 * 0^6 = G bytes memory. Time: 000 MHz CPU: M entries/second; 0^2 entries: M seconds = 0 days Time & pace Improvement for w-band Global Alignments ummary Two sequences differ by at most w bps (w<<n). w-band algorithm: O(wn) time and space. Example: w=3. A C C A C A C A 0 m m m A m C m A m C C A T A tatistical interpretation of alignments Computing optimal global alignment Computing optimal local alignment Time and space complexity Improvement of time and space coring functions 23 2
5 A Multiple Alignment of Immunoglobulins Multi-sequence alignment pace-time-accuracy tradeoffs Finding genes -- motif profiles VTICTGNIGAG-NHVKWYQQLPG VTICTGTNIG--ITVNWYQQLPG LRLCGFIF--YAMYWVRQAPG LLTCTVGTFDD--YYTWVRQPPG PEVTCVVVDVHEDPQVKFNWYVDG-- LVCLIDFYPGA--VTVAWKAD-- AALGCLVKDYFPEP--VTVWNG--- VLTCLVKGFYPD--IAVEWENG A multiple alignment <=> Dynamic programming on a hyperlattice Multiple Alignment vs Pairwise Alignment From G. Fullen, 996. A- -T Optimal Multiple Alignment A- -T Non-Optimal Pairwise Alignment Computing a Node on Hyperlattice N N - N- N N N N N - NA A V NV N- N- - + δ A NV NA N NV N- NA N V N + δ N A NV - + N N - - δ A V N N - + δ - N - A NV N - V = N + NA N max s δ NĀ - N - V N + δ - NA - NV - s N + δ NĀ - NV - N + δ - N - A 29 k=3 2 k =7 Challenges of Optimal Multiple Alignments pace complexity (hyperlattice size): O(n k ) for k sequences each n long. Computing a hyperlattice node: O(2 k ). Time complexity: O(2 k n k ). Find the optimal solution is exponential in k (non-polynomial, NP-hard). 30
6 Methods and Heuristics for Optimal Multiple Alignments Optimal: dynamic programming Pruning the hyperlattice (MA) Heuristics: tree alignments(clustalw) star alignments sampling (Gibbs) (discussed in RNA2) local profiling with iteration (PI-Blast,...) ClustalW: Progressive Multiple Alignment 2 All Pairwise Alignments imilarity Matrix Cluster Analysis From Higgins(99) and Thompson(99). Multiple Alignment tep:. Aligning and 3 2. Aligning 2 and 3. Aligning (, 3 ) with ( 2, ). Dendrogram 3 Distance 2 32 tar Alignments s = TGCCT s = GGCCT 2 s = CCATTT 3 s = CTTCTT s5 = ACTGACC Pairwise Alignment imilarity Matrix s s2 s3 s s2 7 s3 2 2 s s Find the Central equence s Multiple Alignment TGCCT-- GGCCT-- C-CATTT CTTC-TT-- ACTGACC---- *GCCTTT Combine into Multiple Alignment Pairwise Alignment TGCCT GGCCT TGCCT-- C-CATTT TGCCT CTTC-TT TGCCT ACTGACC 33 Multi-sequence alignment pace-time-accuracy tradeoffs Finding genes -- motif profiles 3 Accurately finding genes & their edges % Annotated "Protein" izes Yeast & Mycoplasma What is distinctive? Failure to find edges? 0. Promoters & CGs islands Variety & combinations. Preferred codons Tiny proteins (& RNAs) 2. RNA splice signals Alternatives & weak motifs 3. Frame across splices Alternatives. Inter-species conservation Gene too close or distant 5. cdna for splice edges Rare transcript 2% 0% Largest proteins: C 8% Mycoplasma Gli Yeast Midasin MG 90 6% Human Titin 3,350 % Yeast 2% 0% % of proteins at length x 35 smallest proteins: Mycoplasma L36 37 x = "Protein" size in #aa Yeast yjr5wa 6 2:20[equence Length] AND cerevisiae 36
7 Predicting small proteins (ORFs) #codons giving random ORF expectation= per kb (vs %GC) min max mall coding regions Mutations in domain II of 23 rrna facilitate translation of a 23 rrna-encoded pentapeptide conferring erythromycin resistance. Dam et al. 996 J Mol Biol 259:-6 Trp (W) leader peptide, codons: MKAIFVLKGWWRT TOP Yeast ORF length giving random expectation= per kb (vs %GC) Phe (F) leader peptide, 5 codons: MKHIPFFFAFFFTFP TOP His (H) leader peptide, 6 codons: TOP MTRVQFKHHHHHHHPD 0% 50% 00% 37 Other examples in proteomics lectures 38 Motif Matrices Protein starts GeneMark a a t g c a t g g a t g t g t g a c g 0 t 0 0 Align and calculate frequencies. Note: Higher order correlations lost Motif Matrices a a t g +3++ = 2 c a t g +3++ = 2 g a t g +3++ = 2 t g t g +++ = 0 a c g 0 t 0 0 Align and calculate frequencies. Note: Higher order correlations lost. core test sets: Multi- sequence alignment pace- time- accuracy tradeoffs Finding genes - - motif profiles a c c c = 2
8 Why probabilistic models in sequence analysis? A Basic idea Recognition - Is this sequence a protein start? Discrimination - Is this protein more like a hemoglobin or a myoglobin? Database search - What are all of sequences in wissprot that look like a serine protease? Assign a number to every possible sequence such that s P(s M) = P(s M) is a probability of sequence s given a model M. 3 equence recognition Database search Recognition question - What is the probability that the sequence s is from the start site model M? P(M s) = P(M)* P(s M) / P(s) (Bayes' theorem) P(M) and P(s) are prior probabilities and P(M s) is posterior probability. N = null model (random bases or AAs) Report all sequences with logp(s M) - logp(s N) > logp(n) - logp(m) Example, say α/β hydrolase fold is rare in the database, about 0 in 0,000,000. The threshold is 20 bits. If considering 0.05 as a significant level, then the threshold is 20+. = 2. bits. 5 6 Plausible sources of mono, di, tri, & tetra- nucleotide biases C rare due to lack of uracil glycosylase (cytidine deamination) TT rare due to lack of UV repair enzymes. CG rare due to 5methylCG to TG transitions (cytidine deamination) AGG rare due to low abundance of the corresponding Arg-tRNA. CTAG rare in bacteria due to error-prone "repair" of CTAGG to C*CAGG. AAAA excess due to polya pseudogenes and/or polymerase slippage. CpG Island + in a ocean of - First order Hidden Markov Model MM=6, HMM= 6 transition probabilities (adjacent bp) P(A+ A+) A+ T+ A- T- AmAcid Codon Number /000 Fraction Arg AGG Arg AGA Arg CGG Arg CGA Arg CGT Arg CGC ftp://sanger.otago.ac.nz/pub/transterm/data/codons/bct/esccol.cod 7 C+ P(G+ C+) > G+ P(C- A+)> C- G- 8
9 Estimate transistion probabilities -- an example Estimated transistion probabilities from 8 "known" islands Training set P(G C) = 3/7 = #(CG) / N #(CN) = aaacagcctgacatggttccgaacagcctcgacggcgtt Isle A+ C+ G+ T+ A- C- G- T- A C G T ums Ocean A- C- G- T- A+ C+ G+ T+ A C G T ums Training set P(G C) = #(CG) / N #(CN) (+) A C G T A C G T (-) A C G T A C G T Laplace pseudocount: Add + count to each observed. (p.9,08,32 Dirichlet) ums Viterbi: dynamic programming for HMM Most probable path l,k=2 states Recursion: v l (i+) = e l (x i +) max(v k (i)a kl ) /8*.27 v s i = C G C G begin A C G T A C G T a= table in slide 50 e= emit s i in state l (Durbin p.56) 5 Multi- sequence alignment pace- time- accuracy tradeoffs Finding genes - - motif profiles 52
DNA1: Last week's take-home lessons
Harvard-MIT Division of Health Sciences and Technology HST.508: Genomics and Computational Biology DNA1: Last week's take-home lessons Types of mutants Mutation, drift, selection Binomial for each Association
More informationCAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan
CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns
More informationComputational Biology: Basics & Interesting Problems
Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information
More informationHidden Markov Models
Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm
More informationSequence analysis and comparison
The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species
More informationAlgorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment
Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot
More informationLecture 7 Sequence analysis. Hidden Markov Models
Lecture 7 Sequence analysis. Hidden Markov Models Nicolas Lartillot may 2012 Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 1 / 60 1 Motivation 2 Examples of Hidden Markov models 3 Hidden
More informationAn Introduction to Bioinformatics Algorithms Hidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationBioinformatics Chapter 1. Introduction
Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!
More informationMarkov Models & DNA Sequence Evolution
7.91 / 7.36 / BE.490 Lecture #5 Mar. 9, 2004 Markov Models & DNA Sequence Evolution Chris Burge Review of Markov & HMM Models for DNA Markov Models for splice sites Hidden Markov Models - looking under
More informationStatistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic
More informationHidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes
Hidden Markov Models based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes music recognition deal with variations in - actual sound -
More informationGiri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748
CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/14/07 CAP5510 1 CpG Islands Regions in DNA sequences with increased
More informationHidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang
More informationStephen Scott.
1 / 21 sscott@cse.unl.edu 2 / 21 Introduction Designed to model (profile) a multiple alignment of a protein family (e.g., Fig. 5.1) Gives a probabilistic model of the proteins in the family Useful for
More informationBioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs
Bioinformatics 1--lectures 15, 16 Markov chains Hidden Markov models Profile HMMs target sequence database input to database search results are sequence family pseudocounts or background-weighted pseudocounts
More informationBiology 644: Bioinformatics
A stochastic (probabilistic) model that assumes the Markov property Markov property is satisfied when the conditional probability distribution of future states of the process (conditional on both past
More informationMarkov Chains and Hidden Markov Models. = stochastic, generative models
Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,
More informationChapter 5. Proteomics and the analysis of protein sequence Ⅱ
Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and
More informationHMMs and biological sequence analysis
HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the
More informationBioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment
Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value
More informationIntroduction to Hidden Markov Models for Gene Prediction ECE-S690
Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Outline Markov Models The Hidden Part How can we use this for gene prediction? Learning Models Want to recognize patterns (e.g. sequence
More informationHidden Markov Models for biological sequence analysis
Hidden Markov Models for biological sequence analysis Master in Bioinformatics UPF 2017-2018 http://comprna.upf.edu/courses/master_agb/ Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA
More informationSequencing alignment Ameer Effat M. Elfarash
Sequencing alignment Ameer Effat M. Elfarash Dept. of Genetics Fac. of Agriculture, Assiut Univ. amir_effat@yahoo.com Why perform a multiple sequence alignment? MSAs are at the heart of comparative genomics
More informationAlignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)
Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in
More informationHidden Markov Models for biological sequence analysis I
Hidden Markov Models for biological sequence analysis I Master in Bioinformatics UPF 2014-2015 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Example: CpG Islands
More informationHidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2
Hidden Markov Models based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis Shamir s lecture notes and Rabiner s tutorial on HMM 1 music recognition deal with variations
More informationWeek 10: Homology Modelling (II) - HHpred
Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative
More informationStatistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department
More informationMultiple Sequence Alignment using Profile HMM
Multiple Sequence Alignment using Profile HMM. based on Chapter 5 and Section 6.5 from Biological Sequence Analysis by R. Durbin et al., 1998 Acknowledgements: M.Sc. students Beatrice Miron, Oana Răţoi,
More information3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT
3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode
More informationSequence Alignment Techniques and Their Uses
Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this
More informationAlgorithms in Bioinformatics
Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology
More informationSequencing alignment Ameer Effat M. Elfarash
Sequencing alignment Ameer Effat M. Elfarash Dept. of Genetics Fac. of Agriculture, Assiut Univ. aelfarash@aun.edu.eg Why perform a multiple sequence alignment? MSAs are at the heart of comparative genomics
More informationCISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)
CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models
More informationBioinformatics 2 - Lecture 4
Bioinformatics 2 - Lecture 4 Guido Sanguinetti School of Informatics University of Edinburgh February 14, 2011 Sequences Many data types are ordered, i.e. you can naturally say what is before and what
More informationNewly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:
m Eukaryotic mrna processing Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus: Cap structure a modified guanine base is added to the 5 end. Poly-A tail
More informationCISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)
CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST
More informationWelcome to Class 21!
Welcome to Class 21! Introductory Biochemistry! Lecture 21: Outline and Objectives l Regulation of Gene Expression in Prokaryotes! l transcriptional regulation! l principles! l lac operon! l trp attenuation!
More informationBasic Local Alignment Search Tool
Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses
More informationHidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)
Hidden Markov Models Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98) 1 The occasionally dishonest casino A P A (1) = P A (2) = = 1/6 P A->B = P B->A = 1/10 B P B (1)=0.1... P
More informationTools and Algorithms in Bioinformatics
Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology
More informationCONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018
CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of
More informationMarkov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University
Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and Hidden Markov Models Modeling the statistical properties of biological sequences and distinguishing regions
More informationDNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids
Database searches 1 DNA and protein databases EMBL/GenBank/DDBJ database of nucleic acids 2 DNA and protein databases EMBL/GenBank/DDBJ database of nucleic acids (cntd) 3 DNA and protein databases SWISS-PROT
More informationBio 1B Lecture Outline (please print and bring along) Fall, 2007
Bio 1B Lecture Outline (please print and bring along) Fall, 2007 B.D. Mishler, Dept. of Integrative Biology 2-6810, bmishler@berkeley.edu Evolution lecture #5 -- Molecular genetics and molecular evolution
More informationHidden Markov models in population genetics and evolutionary biology
Hidden Markov models in population genetics and evolutionary biology Gerton Lunter Wellcome Trust Centre for Human Genetics Oxford, UK April 29, 2013 Topics for today Markov chains Hidden Markov models
More informationLecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)
Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from
More informationRegulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)
Regulatory Sequence Analysis Sequence models (Bernoulli and Markov models) 1 Why do we need random models? Any pattern discovery relies on an underlying model to estimate the random expectation. This model
More informationSequence Analysis. BBSI 2006: Lecture #(χ+1) Takis Benos (2006) BBSI MAY P. Benos
Sequence Analysis BBSI 2006: Lecture #(χ+1) Takis Benos (2006) Molecular Genetics 101 What is a gene? We cannot define it (but we know it when we see it ) A loose definition: Gene is a DNA/RNA information
More informationSequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University
Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of
More informationGetting statistical significance and Bayesian confidence limits for your hidden Markov model or score-maximizing dynamic programming algorithm,
Getting statistical significance and Bayesian confidence limits for your hidden Markov model or score-maximizing dynamic programming algorithm, with pairwise alignment of sequences as an example 1,2 1
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2011 1 HMM Lecture Notes Dannie Durand and Rose Hoberman October 11th 1 Hidden Markov Models In the last few lectures, we have focussed on three problems
More informationSUPPLEMENTARY INFORMATION
Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)
More informationMotivating the need for optimal sequence alignments...
1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use
More informationFirst generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences
First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences 140.638 where do sequences come from? DNA is not hard to extract (getting DNA from a
More informationHidden Markov Models
Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas Forward Algorithm For Markov chains we calculate the probability of a sequence, P(x) How
More informationSequence Analysis and Databases 2: Sequences and Multiple Alignments
1 Sequence Analysis and Databases 2: Sequences and Multiple Alignments Jose María González-Izarzugaza Martínez CNIO Spanish National Cancer Research Centre (jmgonzalez@cnio.es) 2 Sequence Comparisons:
More informationExample: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding
Example: The Dishonest Casino Hidden Markov Models Durbin and Eddy, chapter 3 Game:. You bet $. You roll 3. Casino player rolls 4. Highest number wins $ The casino has two dice: Fair die P() = P() = P(3)
More informationInterpolated Markov Models for Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey
Interpolated Markov Models for Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following the
More informationSequence analysis and Genomics
Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute
More informationTools and Algorithms in Bioinformatics
Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and
More informationMATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME
MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:
More informationROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY
BIOINFORMATICS Lecture 11-12 Hidden Markov Models ROBI POLIKAR 2011, All Rights Reserved, Robi Polikar. IGNAL PROCESSING & PATTERN RECOGNITION LABORATORY @ ROWAN UNIVERSITY These lecture notes are prepared
More informationTHEORY. Based on sequence Length According to the length of sequence being compared it is of following two types
Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between
More information(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.
1. A change that makes a polypeptide defective has been discovered in its amino acid sequence. The normal and defective amino acid sequences are shown below. Researchers are attempting to reproduce the
More informationSequence Analysis. BBSI 2006: Lecture #(χ+1) Takis Benos (2006) BBSI MAY P. Benos. Molecular Genetics 101. Cell s internal world
Sequence Analysis BBSI 2006: ecture #χ+ Takis Benos 2006 Molecular Genetics 0 Cell s internal world DNA - Chromosomes - Genes We cannot define it but we know it when we see it A loose definition: What
More informationReducing storage requirements for biological sequence comparison
Bioinformatics Advance Access published July 15, 2004 Bioinfor matics Oxford University Press 2004; all rights reserved. Reducing storage requirements for biological sequence comparison Michael Roberts,
More informationO 3 O 4 O 5. q 3. q 4. Transition
Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in
More informationComparative Bioinformatics Midterm II Fall 2004
Comparative Bioinformatics Midterm II Fall 2004 Objective Answer, part I: For each of the following, select the single best answer or completion of the phrase. (3 points each) 1. Deinococcus radiodurans
More informationSimilarity or Identity? When are molecules similar?
Similarity or Identity? When are molecules similar? Mapping Identity A -> A T -> T G -> G C -> C or Leu -> Leu Pro -> Pro Arg -> Arg Phe -> Phe etc If we map similarity using identity, how similar are
More information6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II
6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution Lecture 05 Hidden Markov Models Part II 1 2 Module 1: Aligning and modeling genomes Module 1: Computational foundations Dynamic programming:
More informationAdvanced topics in bioinformatics
Feinberg Graduate School of the Weizmann Institute of Science Advanced topics in bioinformatics Shmuel Pietrokovski & Eitan Rubin Spring 2003 Course WWW site: http://bioinformatics.weizmann.ac.il/courses/atib
More informationBio 119 Bacterial Genomics 6/26/10
BACTERIAL GENOMICS Reading in BOM-12: Sec. 11.1 Genetic Map of the E. coli Chromosome p. 279 Sec. 13.2 Prokaryotic Genomes: Sizes and ORF Contents p. 344 Sec. 13.3 Prokaryotic Genomes: Bioinformatic Analysis
More informationSequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.
Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene. How can I represent thousands of homolog sequences in a compact
More information1/22/13. Example: CpG Island. Question 2: Finding CpG Islands
I529: Machine Learning in Bioinformatics (Spring 203 Hidden Markov Models Yuzhen Ye School of Informatics and Computing Indiana Univerty, Bloomington Spring 203 Outline Review of Markov chain & CpG island
More informationPattern Matching (Exact Matching) Overview
CSI/BINF 5330 Pattern Matching (Exact Matching) Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Pattern Matching Exhaustive Search DFA Algorithm KMP Algorithm
More informationInDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9
Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic
More informationSara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)
Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline
More informationSimilarity searching summary (2)
Similarity searching / sequence alignment summary Biol4230 Thurs, February 22, 2016 Bill Pearson wrp@virginia.edu 4-2818 Pinn 6-057 What have we covered? Homology excess similiarity but no excess similarity
More informationLecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins
Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) 2 19 2015 Scribe: John Ekins Multiple Sequence Alignment Given N sequences x 1, x 2,, x N : Insert gaps in each of the sequences
More informationHidden Markov Models. x 1 x 2 x 3 x K
Hidden Markov Models 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization: f 0 (0) = 1 f k (0)
More informationGenome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.
Genome Annotation Bioinformatics and Computational Biology Genome Annotation Frank Oliver Glöckner 1 Genome Analysis Roadmap Genome sequencing Assembly Gene prediction Protein targeting trna prediction
More informationBME 5742 Biosystems Modeling and Control
BME 5742 Biosystems Modeling and Control Lecture 24 Unregulated Gene Expression Model Dr. Zvi Roth (FAU) 1 The genetic material inside a cell, encoded in its DNA, governs the response of a cell to various
More informationGibbs Sampling Methods for Multiple Sequence Alignment
Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical
More informationStatistical Sequence Recognition and Training: An Introduction to HMMs
Statistical Sequence Recognition and Training: An Introduction to HMMs EECS 225D Nikki Mirghafori nikki@icsi.berkeley.edu March 7, 2005 Credit: many of the HMM slides have been borrowed and adapted, with
More informationList of Code Challenges. About the Textbook Meet the Authors... xix Meet the Development Team... xx Acknowledgments... xxi
Contents List of Code Challenges xvii About the Textbook xix Meet the Authors................................... xix Meet the Development Team............................ xx Acknowledgments..................................
More informationProtein function prediction based on sequence analysis
Performing sequence searches Post-Blast analysis, Using profiles and pattern-matching Protein function prediction based on sequence analysis Slides from a lecture on MOL204 - Applied Bioinformatics 18-Oct-2005
More informationUnderstanding relationship between homologous sequences
Molecular Evolution Molecular Evolution How and when were genes and proteins created? How old is a gene? How can we calculate the age of a gene? How did the gene evolve to the present form? What selective
More informationSyllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)
Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program) Course Name: Structural Bioinformatics Course Description: Instructor: This course introduces fundamental concepts and methods for structural
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationAlgorithms in Computational Biology (236522) spring 2008 Lecture #1
Algorithms in Computational Biology (236522) spring 2008 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: 15:30-16:30/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office hours:??
More informationRGP finder: prediction of Genomic Islands
Training courses on MicroScope platform RGP finder: prediction of Genomic Islands Dynamics of bacterial genomes Gene gain Horizontal gene transfer Gene loss Deletion of one or several genes Duplication
More informationIntroduction to Bioinformatics
Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression
More informationExercise 5. Sequence Profiles & BLAST
Exercise 5 Sequence Profiles & BLAST 1 Substitution Matrix (BLOSUM62) Likelihood to substitute one amino acid with another Figure taken from https://en.wikipedia.org/wiki/blosum 2 Substitution Matrix (BLOSUM62)
More informationIntroduction to Bioinformatics
CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics
More information6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More information5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT
5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:
More informationComparative genomics: Overview & Tools + MUMmer algorithm
Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. urmila@bioinfo.ernet.in Genome sequence: Fact file 1995: The first
More information