Incorporating dependence into models for DNA motifs
|
|
- Tobias Little
- 6 years ago
- Views:
Transcription
1 Incorporating dependence into models for DNA motifs Terry Speed & Xiaoyue Zhao University of California at Berkeley Department of Human Genetics, UCLA May 17,
2 The objects of our study DNA, RNA and proteins: macromolecules which are unbranched polymers built up from smaller units. DNA: units are the nucleotide residues A, C, G and T RNA: units are the nucleotide residues A, C, G and U Proteins: units are the amino acid residues A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y. To a considerable extent, the chemical properties of DNA, RNA and protein molecules are encoded in the linear sequence of these basic units: their primary structure. 2
3 Motifs - Sites - Signals - Domains For this talk, I ll use these terms interchangeably to describe recurring elements of interest to us. In PROTEINS we have: transmembrane domains, coiled-coil domains, EGF-like domains, signal peptides, phosphorylation sites, antigenic determinants,... In DNA / RNA we have: enhancers, promoters, terminators, splicing signals, translation initiation sites, centromeres,... 3
4 Why (probability) models for biomolecular motifs? to characterize them to help identify them for incorporation into larger models, e.g. for an entire gene 4
5 Motifs and models Motifs typically represent regions of structural significance with specific biological function. Are generalisations from known examples. The models can be highly specific. Multiple models can be used to give higher sensitivity & specificity in their detection. Can sometimes be generated automatically from examples or multiple alignments. 5
6 The use of models for motifs Can be descriptive, predictive or everything else in between..almost business as usual. However, stochastic mechanisms should never be taken literally, but nevertheless they can be amazingly useful. Care is always needed: a model or method can break down at any time without notice. Biological confirmation of predictions is almost always necessary. 6
7 Transcription initiation in E. coli In E. coli transcription is initiated at the promotor, and the sequence of the promotor is recognised by the Sigma factor of RNA polymerase. 7
8 Determinism 1: consensus sequences σ Factor Promotor consensus sequence σ 70 TTGACA TATAAT σ 28 CTAAA CCGATAT Similarly for σ 32, σ 38 and σ 54. Consensus sequences have the obvious limitation: there is usually some deviation from them. 8
9 The human transcription factor Sp1 has 3 Cys-Cys-His-His zinc finger DNA binding domains 9
10 Determinism 2: regular expressions The characteristic motif of a Cys-Cys-His-His zinc finger DNA binding domain has regular expression C-X(2,4)-C-X(3)-[LIVMFYWC]-X(8)-H-X(3,5)-H Here, as in algebra, X is unknown. The 29 a.a. sequence of our example domain 1SP1 is as follows, clearly fitting the model. 1SP1: KKFACPECPKRFMRSDHLSKHIKTHQNKK 10
11 Searching with regular expressions c.{2,4}c...[livmfywc]...h.{3,5}h PatternFind output [ISREC-Server] Date: Wed Aug 22 13:00:41 MET gp AF AEB01ABAC4F945 nuclear protein NP94b [Homo sapiens] Occurences: 2 Position : 514 CYICKASCSSQQEFQDHMSEPQH Position : 606 CTVCNRYFKTPRKFVEHVKSQGH... 11
12 Regular expressions can be limiting The regular expression syntax is still too rigid to represent many highly divergent protein motifs. Also, short patterns are sometimes insufficient with today s large databases. Even requiring perfect matches you might find many false positives. On the other hand, some real sites might not be perfect matches. We need to go beyond apparently equally likely alternatives, and ranges for gaps. We deal with the former first, having a distribution at each position. 12
13 Cys-Cys-His-His profile: sequence logo form A sequence logo is a scaled position-specific a.a.distribution. Scaling is by a measure of a position s information content. (Note that we ve lost the option of variable spacing.) 13
14 Weight matrix model (WMM) = Stochastic consensus sequence A A C C G G T T Counts from 242 known σ 70 sites Relative frequencies: f bl A C G T log 2 f bl /p b Informativeness:2+Σ b p bl log 2 p bl 14
15 Interpretation of weight matrix entries candidate sequence CTATAATC... aligned position Hypotheses: S=site (and independence) R=random (equiprobable, independence) pr(ctataa S) pr(ctataa R) log 2 = log 2 = (2+log 2.09)+...+(2+log 2.01) = x.03x.26x.13x.51x.01.25x.25x.25x.25x.25x.25 { } Generally, score s bl = log f bl /p b l=position, b=base p b =background frequency 15
16 Move the matrix along the sequence and score each window. Use of the matrix to find sites A C G T C T A T A A T C sum -93 Peaks should occur at the true sites. A C G T Of course in general any threshold will have some false positive and false negative rate. A C G T
17 Modelling motifs: the next steps Missing from the weight matrix models of motifs are good ways of dealing with: Length distributions for insertions/deletions Local and non-local association of amino acids Hidden Markov Models (HMM) help with the first. Dealing with the second remains a hard unsolved problem, but we ll describe a start. 17
18 Hidden Markov Models Processes {(S t, O t ), t=1, }, where S t is the hidden state and O t the observation at time t, such that pr(s t O t-1,s t-1,o t-2,s t-2 ) = pr(s t S t-1 ) pr(o t S t,o t-1,s t-1, O t-2,s t-2, ) = pr(o t S t, S t-1 ) The basics of HMM were laid bare in a series of beautiful papers by L E Baum and colleagues around 1970, and their formulation has been used almost unchanged to this day. 18
19 The algorithms As the name suggests, with an HMM the series O = (O 1,O 2,O 3,., O T ) is observed, while the states S = (S 1,S 2,S 3,., S T ) are not. There are elegant algorithms for calculating pr(o θ), arg max θ pr(o θ) in certain special cases, and arg max S pr(s O,θ). Here θ are the parameters of the model, e.g. transition and observation probabilities. 19
20 Profile HMM = stochastic regular expressions M = Match state, I = Insert state, D = Delete state. To operate, go from left to right. I and M states output 20 amino acids; B, D and E states are silent.
21 How profile HMM are used Instances of the motif are identified by calculating log{pr(sequence M)/pr(sequence B)}, where M and B are the motif and background HMM. Alignments of instances of the motif to the HMM are found by calculating arg max states pr(states instance, M). Estimation of HMM parameters is by calculating arg max parameters pr(sequences M, parameters). In all cases, we use the efficient HMM algorithms. 21
22 Pfam domain-hmm Pfam is a library of models of recurrent protein domains. They are constructed semi-automatically using profile hidden Markov models. Pfam families have permanent accession numbers and contain functional annotation and crossreferences to other databases, while Pfam-B families are re-generated at each release and are unannotated. See 22
23 Beyond independence Weight array matrices, Zhang & Marr (1993) consider dependencies between adjacent positions, i.e. (nonstationary) first-order Markov models. The number of parameters increases exponentially if we restrict to full higher-order Markovian models. Variable length Markov models, Rissanen (1986), Buhlmann & Wyner (1999), help us get over this problem. In the last few years, many variants have appeared: all make use of trees. [The interpolated Markov models of Salzberg et al (1998) address the same problem.] 23
24 Our aim and some notation L: length of the the sequence motif. X i : discrete random variable at position i, taking values from a finite set χ. Given a number of instances of a sequence motif x = (x 1,, x L ) of length L, we want a model for the probability P(x) of x. We denote by x ij (i < j) the sequence (x j, x j-1,., x i ) in reverse time order. 24
25 Variable Length Markov Models Factorize P(x) in the usual telescopic way: P(X 1 =x 1 ) Π L 2 P(X l = x l X l-1 1 = x l-1 1 ), then simplify this using context functions c l, l=2,..l, to P(X 1 =x 1 ) Π L 2 P(X l = x l c l (X l-1) 1 ) = c l (x l-1 1 )), where c l : x l-1 1 x l-1 l-m is suitably defined on l-1 tuples. 25
26 VLMM, cont. Here c l : χ l-1 i=0 l-1 χ i, and m = m l is given by m l (x 1 l-1 ) = min {r: P(X l = x X 1 l-1 = x 1 l-1 ) = P(X l = x X l-r l-1 = x l-r l-1 ) for all x χ }. The function c l defines the sequence-specific context, and m l defines the sequencespecific memory or order of the Markov property for position l. 26
27 VLMM: an illustrative example A full set of 16 contexts of order Pruned set of 12 contexts: P(X 3 X 2 =C,X 1 =G) = P(X 3 X 2 =C), etc.
28 VLMM cont. A VLMM for a biomolecular motif of length L is specified by a distribution for X 1, and, for l = 2, L, a constrained distribution for X l given X l-1,,x 1. That is, we need L-1 context functions, or trees. But, there is a difficulty here. 28
29 Sequence dependencies (interactions) are not always local 3-dimensional folding; DNA, RNA & protein interactions The methods outlined so far all fail to incorporate long-range ( 4 bp or a.a.) interactions. New model types are needed. 29
30 Modeling long-range dependency The principal work in this area is Burge & Karlin s (1997) maximal dependence decomposition (MDD). More recently, Cai et al (2000) and Barash et al (2003) used Bayes networks (BN). Ellrott et al (2002) optimized the sequence order in which a stationary Markov chain models the motif. We have adapted this last idea, to give permuted variable length Markov models (PVLMM). Potamianos & Jelinek (1998) have related work on decision trees (PVLMM(D)). 30
31 Maximal Dependence Decomposition MDD starts with a mutual independence model as with WMMs. The data are then iteratively subdivided, at each stage splitting on the most dependent position, suitably defined. At the tips of the tree so defined, a mutual independence model across all remaining positions is used. The details can vary according to the splitting criterion (Burge & Karlin used χ 2 ), the actual splits (binary, etc), and the stopping rule. However, the result is always a single tree. 31
32 Issues in modeling short motifs In any study of this kind, essential items are: the model class (e.g. VLMM) the way we search through the model class (e.g. by forward selection) the way we compare models when searching (e.g. by χ 2 ), and finally, the way we assess the final model in relation to our aims (e.g. by cross-validation). We always need interesting, high-quality datasets. 32
33 Model classes and model search For illustrative purposes, we will compare WMM, WAM, MDD and PVLMM(decision) for TFBS and splice donor recognition. Here we search using a simple procedure: recursively choosing the best extension of a current model, or forward selection. Our slower alternative moves through the models using reversible jump Markov chain Monte Carlo (RJMCMC). 33
34 Model comparison We fit the models using maximum likelihood, and compare fitted models using both AIC and BIC, standard penalties for model complexity. Better than either of these two is approximate normalized maximum likelihood (NML), Barron et al (1998). We use mixture models for the data with Jeffreys (Dirichlet) priors. 34
35 A simple illustration: transcription factor binding sites These are of great interest, their signals are very weak, and we typically have only a few instances. We have studied 43 TFBS with effective length 9 and 20 instances. In 17/43 cases we are able to improve upon WMM, the current standard; in 26/43, we cannot. 35
36 20 instances of P$DOF3_01 GTCTAAAGCGT aattaaagtaa GACGAAAGCAA aattaaagtgc GTCTAAAGCga GCGAAAAGCGA GCGTAAAGCAG TAGAAAAGGCG aattaaagtac CACAAAAGCCC GCCCAAAgatc tgacaaagcgt GCGGAAAgatc aattaaagcaa CAAAAAAGGCG taaaaaaggct CAGCAAAGACg GGAAAAAGCAA AGCAAAAGTGC GCAGAAAGTCA 36
37 Modelling P$DOF3_01 Sn Sp WMM.15 MDD.60 PVLMM(D).55 37
38 Now we touch on finding protein-coding genes 38
39 12 examples of 5 splice (donor) sites exon TCGGTGAGT intron TGGGTGTGT CCGGTCCGT ATG GTAAGA TCT GTAAGT CAGGTAGGA CAGGTAGGG AAGGTAAGG AGGGTATGG TGGGTAAGG GAGGTTAGT CATGTGAGT 39
40 Sequence logo for human splice donor sites Base A C G T
41 Splice site dataset Human splice donor sequences from SpliceDB, Burset et al (2001): 15,155 canonical donor sites of length 9, with GT conserved at positions 0 and 1 47,495 false donor sites from the set of all sequences which lie within 40 bp on both sides of the characteristic donor dinucleotide GT. 41
42 Part of a context (decision) tree for position -2 of a splice donor PVLMM Node #s:counts; Edge #s:split variables. Sequence order: +2(A/G) +5(T) -1(G) +4(G) -2(A) +3(A) -3(A) 42
43 Parts of MDD trees for splice donors In each case, splits are into the most frequent nt vs the others. 43
44 Model assessment: Stand-alone splice site recognition M: motif model B: background model Given a sequence x = (x 1,, x L ), we predict x to be a motif (here splice donor) if log {P(x M) / P(x B)} > c, for a suitably chosen threshold value c. 44
45 Model assessment: terms TP: true positives, TN: true negatives, FP: false positives FN: false negatives Sensitivity (sn) and specificity (sp) are given by sn = TP / [TP + FN] sp = TP / [TP + FP]. A 5-fold cross-validation is used in assessing performance. 45
46 Optimal permutation: Sp vs Sn Comparison of PVLMM decision tree (NML, ord = 5), 46 MDD (chi-square), WAM and WMM.
47 Model assessment: Integrated recognition For this assessment, we integrate the splice donor models into SLAM, Pachter et al (2002), a eukaryotic cross-species gene finder. The training data consists of 3,735 aligned human and mouse gene sequences. The resulting SLAM model is then tested on the Rosetta set of 117 single human gene sequences. 47
48 Results at the nucleotide level VLMM(D) MDD PVLMM(D) Sensitivity Specificity Results at the exon level VLMM(D) MDD PVLMM(D) Correct 362/ / /464 Partial 84/460 83/465 78/464 Wrong 14/460 17/465 15/464 Missing 25/465 23/465 22/465 Sensitivity Specificity 362/ / / / / /464 48
49 Interpretation of PVLMM model selected We use sequence logos to provide simple interpretations of our selected PVLMM, including the optimal permutation. 49
50 Beginning of the splicing process splice donor splice acceptor 50
51 Long-range dependence in the chosen model U1 sn RNA G U C C A U U C A Optimal permutation:
52 The context tree for
53 Some future work Joint modelling of human and mouse sites Joint modelling of multiple motifs in one species Including indels and dependence 53
54 Acknowledgements Xiaoyue Zhao, UCB Mauro Delorenzi (ISREC) Sourav Chatterji, UCB The SLAM team: Simon Cawley, Affymetrix Lior Pachter, UCB Marina Alexandersson, FCC 54
55 55
56 References Biological Sequence Analysis R Durbin, S Eddy, A Krogh and G Mitchison Cambridge University Press, Bioinformatics The machine learning approach P Baldi and S Brunak The MIT Press, 1998 Post-Genome Informatics M Kanehisa Oxford University Press,
57 Bayes networks (BN): an example Here each node corresponds to a sequence position, and the tree 57 defines conditional independence constraints on the distribution.
58 To find genes, we need to model splice sites 58
59 Weight matrix models, Staden (1984) Base A C G T A weight matrix for donor sites. Essentially a mutual independence model. An improvement over the consensus CAGGTAAGT. 59
60 Transcription factor binding site (TFBS) recognition We extracted all known TFBS from the TRANSFAC database with a) length 9, and b) 20 known sites. In all this gave 1,419 sites corresponding to 43 TF. Next we randomly inserted each site into a background sequence of length 1,000 simulated from a stationary 3rd order MM, with parameters estimated from a large collection of human sequence upstream of genes. Finally, we used the PVLMM, MDD and WMM to scan these sequences within a 10-fold cross-validation framework, to select a number of top-scoring sequences as putative binding sites. We always made this number equal to the true number in the60 sequences, and so sn = sp.
61 Human Transcription Factor Binding Sites 61
62 TFBS: three results P$DOF3.01 wmm pvlmm mdd V$CIZ.01 wmm pvlmm mdd P$EMBP1.Q2 wmm pvlmm mdd Entry: sensitivity/specificity Of the 43 TF, our dependence methods led to no improvement in 27 cases. 62
63 TFBS: more results 63
O 3 O 4 O 5. q 3. q 4. Transition
Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in
More informationProtein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.
Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein
More informationDNA Feature Sensors. B. Majoros
DNA Feature Sensors B. Majoros What is Feature Sensing? A feature is any DNA subsequence of biological significance. For practical reasons, we recognize two broad classes of features: signals short, fixed-length
More informationComparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey
Comparative Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following: using related genomes
More informationCISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)
CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models
More informationAn Introduction to Bioinformatics Algorithms Hidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationStatistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic
More informationHidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationHMMs and biological sequence analysis
HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the
More informationToday s Lecture: HMMs
Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models
More informationCAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan
CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns
More informationMarkov Models & DNA Sequence Evolution
7.91 / 7.36 / BE.490 Lecture #5 Mar. 9, 2004 Markov Models & DNA Sequence Evolution Chris Burge Review of Markov & HMM Models for DNA Markov Models for splice sites Hidden Markov Models - looking under
More informationStatistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department
More informationHidden Markov Models for biological sequence analysis I
Hidden Markov Models for biological sequence analysis I Master in Bioinformatics UPF 2014-2015 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Example: CpG Islands
More informationHidden Markov Models (I)
GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm Hidden Markov models A Markov chain of states At each state, there are a set of possible observables
More informationGibbs Sampling Methods for Multiple Sequence Alignment
Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical
More informationHidden Markov Models for biological sequence analysis
Hidden Markov Models for biological sequence analysis Master in Bioinformatics UPF 2017-2018 http://comprna.upf.edu/courses/master_agb/ Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA
More informationMotifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.
Motifs and Logos Six Discovering Genomics, Proteomics, and Bioinformatics by A. Malcolm Campbell and Laurie J. Heyer Chapter 2 Genome Sequence Acquisition and Analysis Sami Khuri Department of Computer
More informationLearning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling
Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence
More informationIntroduction to Hidden Markov Models for Gene Prediction ECE-S690
Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Outline Markov Models The Hidden Part How can we use this for gene prediction? Learning Models Want to recognize patterns (e.g. sequence
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang
More informationData Mining in Bioinformatics HMM
Data Mining in Bioinformatics HMM Microarray Problem: Major Objective n Major Objective: Discover a comprehensive theory of life s organization at the molecular level 2 1 Data Mining in Bioinformatics
More information6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationPage 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence
Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationMarkov Chains and Hidden Markov Models. = stochastic, generative models
Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,
More informationMCMC: Markov Chain Monte Carlo
I529: Machine Learning in Bioinformatics (Spring 2013) MCMC: Markov Chain Monte Carlo Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2013 Contents Review of Markov
More informationSequence analysis and comparison
The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species
More informationIntroduction to Hidden Markov Models (HMMs)
Introduction to Hidden Markov Models (HMMs) But first, some probability and statistics background Important Topics 1.! Random Variables and Probability 2.! Probability Distributions 3.! Parameter Estimation
More informationMarkov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University
Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and Hidden Markov Models Modeling the statistical properties of biological sequences and distinguishing regions
More informationHidden Markov Models and Their Applications in Biological Sequence Analysis
Hidden Markov Models and Their Applications in Biological Sequence Analysis Byung-Jun Yoon Dept. of Electrical & Computer Engineering Texas A&M University, College Station, TX 77843-3128, USA Abstract
More information1. In most cases, genes code for and it is that
Name Chapter 10 Reading Guide From DNA to Protein: Gene Expression Concept 10.1 Genetics Shows That Genes Code for Proteins 1. In most cases, genes code for and it is that determine. 2. Describe what Garrod
More informationStephen Scott.
1 / 21 sscott@cse.unl.edu 2 / 21 Introduction Designed to model (profile) a multiple alignment of a protein family (e.g., Fig. 5.1) Gives a probabilistic model of the proteins in the family Useful for
More informationThe Computational Problem. We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene.
GENE FINDING The Computational Problem We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene. The Computational Problem Confounding Realities:
More informationHidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2
Hidden Markov Models based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis Shamir s lecture notes and Rabiner s tutorial on HMM 1 music recognition deal with variations
More informationHidden Markov Models
Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas Forward Algorithm For Markov chains we calculate the probability of a sequence, P(x) How
More informationInterpolated Markov Models for Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey
Interpolated Markov Models for Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following the
More informationRNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"
RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology" Day 1" Many biologically interesting roles for RNA" RNA secondary structure prediction" 3 4 Approaches to Structure
More informationLecture 7 Sequence analysis. Hidden Markov Models
Lecture 7 Sequence analysis. Hidden Markov Models Nicolas Lartillot may 2012 Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 1 / 60 1 Motivation 2 Examples of Hidden Markov models 3 Hidden
More informationIntroduction to Bioinformatics
CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics
More informationHidden Markov Models (HMMs) and Profiles
Hidden Markov Models (HMMs) and Profiles Swiss Institute of Bioinformatics (SIB) 26-30 November 2001 Markov Chain Models A Markov Chain Model is a succession of states S i (i = 0, 1,...) connected by transitions.
More informationLecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008
Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 1 Sequence Motifs what is a sequence motif? a sequence pattern of biological significance typically
More informationPredicting Protein Functions and Domain Interactions from Protein Interactions
Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput
More informationMultiple Sequence Alignment using Profile HMM
Multiple Sequence Alignment using Profile HMM. based on Chapter 5 and Section 6.5 from Biological Sequence Analysis by R. Durbin et al., 1998 Acknowledgements: M.Sc. students Beatrice Miron, Oana Răţoi,
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationMATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME
MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:
More informationHidden Markov Models
Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm
More informationDynamic Approaches: The Hidden Markov Model
Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message
More informationHidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)
Hidden Markov Models Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98) 1 The occasionally dishonest casino A P A (1) = P A (2) = = 1/6 P A->B = P B->A = 1/10 B P B (1)=0.1... P
More informationComputational Genomics. Systems biology. Putting it together: Data integration using graphical models
02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput
More informationGCD3033:Cell Biology. Transcription
Transcription Transcription: DNA to RNA A) production of complementary strand of DNA B) RNA types C) transcription start/stop signals D) Initiation of eukaryotic gene expression E) transcription factors
More informationCSE 527 Autumn Lectures 8-9 (& part of 10) Motifs: Representation & Discovery
CSE 527 Autumn 2006 Lectures 8-9 (& part of 10) Motifs: Representation & Discovery 1 DNA Binding Proteins A variety of DNA binding proteins ( transcription factors ; a significant fraction, perhaps 5-10%,
More informationHidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes
Hidden Markov Models based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes music recognition deal with variations in - actual sound -
More informationBioinformatics Chapter 1. Introduction
Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!
More informationVL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16
VL Algorithmen und Datenstrukturen für Bioinformatik (19400001) WS15/2016 Woche 16 Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin Based on slides by
More informationHMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM
I529: Machine Learning in Bioinformatics (Spring 2017) HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington
More informationAmino Acid Structures from Klug & Cummings. Bioinformatics (Lec 12)
Amino Acid Structures from Klug & Cummings 2/17/05 1 Amino Acid Structures from Klug & Cummings 2/17/05 2 Amino Acid Structures from Klug & Cummings 2/17/05 3 Amino Acid Structures from Klug & Cummings
More information1/22/13. Example: CpG Island. Question 2: Finding CpG Islands
I529: Machine Learning in Bioinformatics (Spring 203 Hidden Markov Models Yuzhen Ye School of Informatics and Computing Indiana Univerty, Bloomington Spring 203 Outline Review of Markov chain & CpG island
More informationNewly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:
m Eukaryotic mrna processing Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus: Cap structure a modified guanine base is added to the 5 end. Poly-A tail
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2011 1 HMM Lecture Notes Dannie Durand and Rose Hoberman October 11th 1 Hidden Markov Models In the last few lectures, we have focussed on three problems
More informationDNA Binding Proteins CSE 527 Autumn 2007
DNA Binding Proteins CSE 527 Autumn 2007 A variety of DNA binding proteins ( transcription factors ; a significant fraction, perhaps 5-10%, of all human proteins) modulate transcription of protein coding
More information3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM
I529: Machine Learning in Bioinformatics (Spring 2017) Content HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University,
More information6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationApplications of Hidden Markov Models
18.417 Introduction to Computational Molecular Biology Lecture 18: November 9, 2004 Scribe: Chris Peikert Lecturer: Ross Lippert Editor: Chris Peikert Applications of Hidden Markov Models Review of Notation
More informationSOME CHALLENGES IN COMPUTATIONAL BIOLOGY
SOME CHALLENGES IN COMPUTATIONAL BIOLOGY M. Vidyasagar Advanced Technology Centre Tata Consultancy Services 6th Floor, Khan Lateefkhan Building Hyderabad 500 001, INDIA sagar@atc.tcs.co.in Keywords: Computational
More informationBME 5742 Biosystems Modeling and Control
BME 5742 Biosystems Modeling and Control Lecture 24 Unregulated Gene Expression Model Dr. Zvi Roth (FAU) 1 The genetic material inside a cell, encoded in its DNA, governs the response of a cell to various
More informationModel Accuracy Measures
Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses
More informationGiri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748
CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/8/07 CAP5510 1 Pattern Discovery 2/8/07 CAP5510 2 Patterns Nature
More informationMotivating the need for optimal sequence alignments...
1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use
More informationQB LECTURE #4: Motif Finding
QB LECTURE #4: Motif Finding Adam Siepel Nov. 20, 2015 2 Plan for Today Probability models for binding sites Scoring and detecting binding sites De novo motif finding 3 Transcription Initiation Chromatin
More informationHMM : Viterbi algorithm - a toy example
MM : Viterbi algorithm - a toy example 0.6 et's consider the following simple MM. This model is composed of 2 states, (high GC content) and (low GC content). We can for example consider that state characterizes
More informationWeek 10: Homology Modelling (II) - HHpred
Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative
More informationHYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH
HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH Hoang Trang 1, Tran Hoang Loc 1 1 Ho Chi Minh City University of Technology-VNU HCM, Ho Chi
More informationINTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA
INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA XIUFENG WAN xw6@cs.msstate.edu Department of Computer Science Box 9637 JOHN A. BOYLE jab@ra.msstate.edu Department of Biochemistry and Molecular Biology
More informationComputational Biology: Basics & Interesting Problems
Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information
More informationProtein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche
Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its
More informationCISC 636 Computational Biology & Bioinformatics (Fall 2016)
CISC 636 Computational Biology & Bioinformatics (Fall 2016) Predicting Protein-Protein Interactions CISC636, F16, Lec22, Liao 1 Background Proteins do not function as isolated entities. Protein-Protein
More informationSequence Analysis. BBSI 2006: Lecture #(χ+1) Takis Benos (2006) BBSI MAY P. Benos
Sequence Analysis BBSI 2006: Lecture #(χ+1) Takis Benos (2006) Molecular Genetics 101 What is a gene? We cannot define it (but we know it when we see it ) A loose definition: Gene is a DNA/RNA information
More informationEvolutionary Models. Evolutionary Models
Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment
More informationCSEP 590A Summer Lecture 4 MLE, EM, RE, Expression
CSEP 590A Summer 2006 Lecture 4 MLE, EM, RE, Expression 1 FYI, re HW #2: Hemoglobin History Alberts et al., 3rd ed.,pg389 2 Tonight MLE: Maximum Likelihood Estimators EM: the Expectation Maximization Algorithm
More informationCSEP 590A Summer Tonight MLE. FYI, re HW #2: Hemoglobin History. Lecture 4 MLE, EM, RE, Expression. Maximum Likelihood Estimators
CSEP 59A Summer 26 Lecture 4 MLE, EM, RE, Expression FYI, re HW #2: Hemoglobin History 1 Alberts et al., 3rd ed.,pg389 2 Tonight MLE: Maximum Likelihood Estimators EM: the Expectation Maximization Algorithm
More informationBioinformatics 2 - Lecture 4
Bioinformatics 2 - Lecture 4 Guido Sanguinetti School of Informatics University of Edinburgh February 14, 2011 Sequences Many data types are ordered, i.e. you can naturally say what is before and what
More informationCSCE555 Bioinformatics. Protein Function Annotation
CSCE555 Bioinformatics Protein Function Annotation Why we need to do function annotation? Fig from: Network-based prediction of protein function. Molecular Systems Biology 3:88. 2007 What s function? The
More informationMitochondrial Genome Annotation
Protein Genes 1,2 1 Institute of Bioinformatics University of Leipzig 2 Department of Bioinformatics Lebanese University TBI Bled 2015 Outline Introduction Mitochondrial DNA Problem Tools Training Annotation
More informationHidden Markov models in population genetics and evolutionary biology
Hidden Markov models in population genetics and evolutionary biology Gerton Lunter Wellcome Trust Centre for Human Genetics Oxford, UK April 29, 2013 Topics for today Markov chains Hidden Markov models
More informationIntroduction to Machine Learning Midterm, Tues April 8
Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend
More informationClass 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio
Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant
More informationGenomics and bioinformatics summary. Finding genes -- computer searches
Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence
More informationHidden Markov Models. Three classic HMM problems
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Hidden Markov Models Slides revised and adapted to Computational Biology IST 2015/2016 Ana Teresa Freitas Three classic HMM problems
More informationStatistical Sequence Recognition and Training: An Introduction to HMMs
Statistical Sequence Recognition and Training: An Introduction to HMMs EECS 225D Nikki Mirghafori nikki@icsi.berkeley.edu March 7, 2005 Credit: many of the HMM slides have been borrowed and adapted, with
More informationChapter 4: Hidden Markov Models
Chapter 4: Hidden Markov Models 4.1 Introduction to HMM Prof. Yechiam Yemini (YY) Computer Science Department Columbia University Overview Markov models of sequence structures Introduction to Hidden Markov
More informationPrediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines
Article Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines Yun-Fei Wang, Huan Chen, and Yan-Hong Zhou* Hubei Bioinformatics and Molecular Imaging Key Laboratory,
More informationCONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018
CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of
More informationLecture 7: Simple genetic circuits I
Lecture 7: Simple genetic circuits I Paul C Bressloff (Fall 2018) 7.1 Transcription and translation In Fig. 20 we show the two main stages in the expression of a single gene according to the central dogma.
More informationLecture 18 June 2 nd, Gene Expression Regulation Mutations
Lecture 18 June 2 nd, 2016 Gene Expression Regulation Mutations From Gene to Protein Central Dogma Replication DNA RNA PROTEIN Transcription Translation RNA Viruses: genome is RNA Reverse Transcriptase
More informationLecture 3: Markov chains.
1 BIOINFORMATIK II PROBABILITY & STATISTICS Summer semester 2008 The University of Zürich and ETH Zürich Lecture 3: Markov chains. Prof. Andrew Barbour Dr. Nicolas Pétrélis Adapted from a course by Dr.
More informationROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY
BIOINFORMATICS Lecture 11-12 Hidden Markov Models ROBI POLIKAR 2011, All Rights Reserved, Robi Polikar. IGNAL PROCESSING & PATTERN RECOGNITION LABORATORY @ ROWAN UNIVERSITY These lecture notes are prepared
More informationAnomaly Detection for the CERN Large Hadron Collider injection magnets
Anomaly Detection for the CERN Large Hadron Collider injection magnets Armin Halilovic KU Leuven - Department of Computer Science In cooperation with CERN 2018-07-27 0 Outline 1 Context 2 Data 3 Preprocessing
More informationLecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010
Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition
More information