CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Similar documents
Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Hidden Markov Models

EECS730: Introduction to Bioinformatics

Hidden Markov Models

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

HMMs and biological sequence analysis

O 3 O 4 O 5. q 3. q 4. Transition

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Patterns and profiles applications of multiple alignments. Tore Samuelsson March 2013

Chapter 4: Hidden Markov Models

Hidden Markov Models (I)

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Markov Models & DNA Sequence Evolution

Stephen Scott.

An Introduction to Sequence Similarity ( Homology ) Searching

Hidden Markov Models (HMMs) and Profiles

Hidden Markov Models. x 1 x 2 x 3 x K

Computational Genomics and Molecular Biology, Fall

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Hidden Markov Models for biological sequence analysis

Gibbs Sampling Methods for Multiple Sequence Alignment

Today s Lecture: HMMs

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Hidden Markov Models. Three classic HMM problems

Sequence Alignment Techniques and Their Uses

Hidden Markov Models for biological sequence analysis I

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

HIDDEN MARKOV MODELS

Computational Genomics and Molecular Biology, Fall

Pairwise sequence alignment and pair hidden Markov models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Week 10: Homology Modelling (II) - HHpred

CSCE 471/871 Lecture 3: Markov Chains and

Hidden Markov Models (HMMs) November 14, 2017

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

BMI/CS 576 Fall 2016 Final Exam

Hidden Markov Models. x 1 x 2 x 3 x K

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Lecture 7 Sequence analysis. Hidden Markov Models

Hidden Markov Models

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Hidden Markov Models

Stephen Scott.

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Quantitative Bioinformatics

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Graph Alignment and Biological Networks

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

HMM : Viterbi algorithm - a toy example

CSEP 590A Summer Lecture 4 MLE, EM, RE, Expression

CSEP 590A Summer Tonight MLE. FYI, re HW #2: Hemoglobin History. Lecture 4 MLE, EM, RE, Expression. Maximum Likelihood Estimators

De novo identification of motifs in one species. Modified from Serafim Batzoglou s lecture notes

Markov Chains and Hidden Markov Models. = stochastic, generative models

Multiple Sequence Alignment using Profile HMM

Data Mining in Bioinformatics HMM

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

Alignment Algorithms. Alignment Algorithms

Amino Acid Structures from Klug & Cummings. Bioinformatics (Lec 12)

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

Pairwise alignment using HMMs

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

Hidden Markov Models and some applications

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

Matrix-based pattern discovery algorithms

Hidden Markov Models

Pair Hidden Markov Models

Sequence Analysis. BBSI 2006: Lecture #(χ+1) Takis Benos (2006) BBSI MAY P. Benos

Profiles. Evolutionary related sequences (orthologs and paralogs) are often identified with local alignment programs like BLAST, FASTA, SSEARCH.

Multiple Sequence Alignment: HMMs and Other Approaches

Bioinformatics 2 - Lecture 4

Detecting Distant Homologs Using Phylogenetic Tree-Based HMMs

Theoretical distribution of PSSM scores

STRUCTURAL BIOINFORMATICS I. Fall 2015

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

HMM : Viterbi algorithm - a toy example

Hidden Markov Models and some applications

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Motivating the need for optimal sequence alignments...

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Sequence analysis and comparison

Dynamic Approaches: The Hidden Markov Model

HIDDEN MARKOV MODELS FOR REMOTE PROTEIN HOMOLOGY DETECTION

Biological Sequences and Hidden Markov Models CPBS7711

Transcription:

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html

Describing & Modeling Patterns

Patterns in DNA Sequences q Signals in DNA sequence control events Start and end of genes Start and end of introns Transcription factor binding sites (regulatory elements) Ribosome binding sites q Detection of these patterns are useful for Understanding gene structure Understanding gene regulation 2/9/15 CAP5510 / CGS 5166 3

Motifs in DNA Sequences q Given a collection of DNA sequences of promoter regions, describe the transcription factor binding sites (also called regulatory elements) Example: 2/9/15 CAP5510 / CGS 5166 4 http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html

Motifs in DNA Sequences 2/9/15 CAP5510 / CGS 5166 5 http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html

More Motifs in E. Coli DNA Sequences 2/9/15 CAP5510 / CGS 5166 6 http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html

2/9/15 CAP5510 / CGS 5166 7 http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html

Other Motifs in DNA Sequences: Human Splice Junctions 2/9/15 CAP5510 / CGS 5166 8 http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html

Motifs 2/9/15 CAP5510 / CGS 5166 9 http://weblogo.berkeley.edu/examples.html

Pattern: Representations GAGGTAAAC TCCGTAAGT CAGGTTGGA ACAGTCAGT TAGGTCATT TAGGTACTG ATGGTAACT CAGGTATAC TGTGTGAGT AAGGTAAGT Alignments Consensus Sequences Logo Formats... TAGGTAAGT TAGGTAAGT 2/9/15 CAP5510 / CGS 5166 10

Profiles GAGGTAAAC TCCGTAAGT CAGGTTGGA ACAGTCAGT TAGGTCATT TAGGTACTG ATGGTAACT CAGGTATAC TGTGTGAGT AAGGTAAGT 1 2 3 4 5 6 7 8 9 A 3 6 1 0 0 6 7 2 1 C 2 2 1 0 0 2 1 1 2 G 1 1 7 10 0 1 1 5 1 T 4 1 1 0 10 1 1 2 6 1 2 3 4 5 6 7 8 9 A.3.6.1 0 0.6.7.2.1 C.2.2.1 0 0.2.1.1.2 G.1.1.7 1 0.1.1.5.1 T.4.1.1 0 1.1.1.2.6 Frequency Matrix Relative Frequencies 2/9/15 CAP5510 / CGS 5166 11

Profiles GAGGTAAAC TCCGTAAGT CAGGTTGGA ACAGTCAGT TAGGTCATT TAGGTACTG ATGGTAACT CAGGTATAC TGTGTGAGT AAGGTAAGT 1 2 3 4 5 6 7 8 9 A.3.6.1 0 0.6.7.2.1 C.2.2.1 0 0.2.1.1.2 G.1.1.7 1 0.1.1.5.1 T.4.1.1 0 1.1.1.2.6 Relative Frequencies 1 2 3 4 5 6 7 8 9 A 0.14 0.72-0.61-1.43-1.43 0.72 0.86-0.16-0.61 C -0.16-0.16-0.61-1.43-1.43-0.16-0.61-0.61-0.16 G -0.61-0.61 0.86-0.61-1.43-0.61-0.61 0.57-0.61 T 0.38-0.61-0.61-1.43 1.19-0.61-0.61-0.16 0.72 2/9/15 CAP5510 / CGS 5166 12

Profile entries: P ij = ln (f ij /b i ) Zero counts: f ij = (c ij +αb i )/ (n+α) Profiles 1 2 3 4 5 6 7 8 9 A.3.6.1 0 0.6.7.2.1 C.2.2.1 0 0.2.1.1.2 G.1.1.7 1 0.1.1.5.1 T.4.1.1 0 1.1.1.2.6 Relative 1 2 3 4 5 6 7 8 9 Frequencies A 0.14 0.72-0.61-1.43-1.43 0.72 0.86-0.16-0.61 C -0.16-0.16-0.61-1.43-1.43-0.16-0.61-0.61-0.16 G -0.61-0.61 0.86 1.19-1.43-0.61-0.61 0.57-0.61 T 0.38-0.61-0.61-1.43 1.19-0.61-0.61-0.16 0.72 http://coding.plantpath.ksu.edu/profile/ 2/9/15 CAP5510 / CGS 5166 13

CpG Islands q Regions in DNA sequences with increased occurrences of substring CG q Rare: typically C gets methylated and then mutated into a T. q Often around promoter or start regions of genes q Few hundred to a few thousand bases long 2/9/15 CAP5510 / CGS 5166 14

Problem 1: Input: Small sequence S Output: Is S from a CpG island? Build Markov models: M+ and M Then compare 2/9/15 CAP5510 / CGS 5166 15

Markov Models + A C G T A C G T A 0.180 0.274 0.426 0.120 A 0.300 0.205 0.285 0.210 C 0.171 0.368 0.274 0.188 C 0.322 0.298 0.078 0.302 G 0.161 0.339 0.375 0.125 G 0.248 0.246 0.298 0.208 T 0.079 0.355 0.384 0.182 T 0.177 0.239 0.292 0.292 2/9/15 CAP5510 / CGS 5166 16

How to distinguish? q Compute S ( P( x M + ) % & # ' P( x M ) $ L ( i 1) i ( x) = log = log = rx( i 1) i= 1 ( p & ' m x x ( i 1) x x i % # $ L i= 1 x i r=p/m A C G T A -0.740 0.419 0.580-0.803 C -0.913 0.302 1.812-0.685 G -0.624 0.461 0.331-0.730 T -1.169 0.573 0.393-0.679 Score(GCAC) =.461-.913+.419 < 0. GCAC not from CpG island. Score(GCTC) =.461-.685+.573 > 0. GCTC from CpG island. 2/9/15 CAP5510 / CGS 5166 17

Problem 1: Input: Small sequence S Output: Is S from a CpG island? Build Markov Models: M+ & M- Then compare Problem 2: Input: Long sequence S Output: Identify the CpG islands in S. Markov models are inadequate. Need Hidden Markov Models. 2/9/15 CAP5510 / CGS 5166 18

Markov Models + A C G T A 0.180 0.274 0.426 0.120 P(A+ A+) A+ T+ C 0.171 0.368 0.274 0.188 G 0.161 0.339 0.375 0.125 T 0.079 0.355 0.384 0.182 C+ P(G+ C+) G+ 2/9/15 CAP5510 / CGS 5166 19

CpG Island + in an ocean of First order Hidden Markov Model P(A+ A+) A+ MM=16, HMM= 64 transition probabilities (adjacent bp) T+ A- T- C+ P(G+ C+) G+ C- G- 2/9/15 CAP5510 / CGS 5166 20

Hidden Markov Model (HMM) States Transitions Transition Probabilities Emissions Emission Probabilities What is hidden about HMMs? Answer: The path through the model is hidden since there are many valid paths. 2/9/15 CAP5510 / CGS 5166 21

How to Solve Problem 2? q Solve the following problem: Input: Hidden Markov Model M, parameters Θ, emitted sequence S Output: Most Probable Path Π How: Viterbi s Algorithm (Dynamic Programming) Define Π[i,j] = MPP for first j characters of S ending in state i Define P[i,j] = Probability of Π[i,j] Compute state i with largest P[i,j]. 2/9/15 CAP5510 / CGS 5166 22

Profile entries: P ij = ln (f ij /b i ) Zero counts: f ij = (c ij +αb i )/ (n+α) Profiles 1 2 3 4 5 6 7 8 9 A.3.6.1 0 0.6.7.2.1 C.2.2.1 0 0.2.1.1.2 G.1.1.7 1 0.1.1.5.1 T.4.1.1 0 1.1.1.2.6 Relative 1 2 3 4 5 6 7 8 9 Frequencies A 0.14 0.72-0.61-1.43-1.43 0.72 0.86-0.16-0.61 C -0.16-0.16-0.61-1.43-1.43-0.16-0.61-0.61-0.16 G -0.61-0.61 0.86 1.19-1.43-0.61-0.61 0.57-0.61 T 0.38-0.61-0.61-1.43 1.19-0.61-0.61-0.16 0.72 http://coding.plantpath.ksu.edu/profile/ Profiles; Position Weight Matrix (PWM); Position-Specific Scoring Matrix (PSSM) 2/9/15 CAP5510 / CGS 5166 23

Profile HMMs START STATE 1 STATE 2 STATE 3 STATE 4 STATE 5 STATE 6 END 2/9/15 CAP5510 / CGS 5166 24

Profile HMMs with InDels Insertions Deletions Insertions & Deletions DELETE 1 DELETE 2 DELETE 3 START STATE 1 STATE 2 STATE 3 STATE 4 STATE 5 STATE 6 END INSERT 3 INSERT 4 2/9/15 CAP5510 / CGS 5166 25

Profile HMMs with InDels DELETE 1 DELETE 2 DELETE 3 DELETE 4 DELETE 5 DELETE 6 START STATE 1 STATE 2 STATE 3 STATE 4 STATE 5 STATE 6 END INSERT 4 INSERT 4 INSERT 3 INSERT 4 INSERT 4 INSERT 4 Missing transitions from DELETE j to INSERT j and from INSERT j to DELETE j+1. 2/9/15 CAP5510 / CGS 5166 26

HMM for Sequence Alignment 2/9/15 CAP5510 / CGS 5166 27

Profile HMM Software q HMMER http://hmmer.wustl.edu/ q SAM http://www.cse.ucsc.edu/research/compbio/sam.html q PFTOOLS http://www.isrec.isb-sib.ch/ftp-server/pftools/ q HMMpro http://www.netid.com/html/hmmpro.html q GENEWISE http://www.ebi.ac.uk/wise2/ q PROBE ftp://ftp.ncbi.nih.gov/pub/neuwald/probe1.0/ q META-MEME http://metameme.sdsc.edu/ q BLOCKS http://www.blocks.fhcrc.org/ q PSI-BLAST http://www.ncbi.nlm.nih.gov/blast/newblast.html q Read more about Profile HMMs at http://www.csb.yale.edu/userguides/seq/hmmer/docs/node9.html 2/9/15 CAP5510 / CGS 5166 28

How to model Pairwise Sequence Alignment LEAPVE LAPVIE DELETE Pair HMMs Emit pairs of synbols Emission probs? Related to Sub. Matrices START MATCH END INSERT How to deal with InDels? Global Alignment? Local? Related to Sub. Matrices 2/9/15 CAP5510 / CGS 5166 29

How to model Pairwise Local Alignments? START Skip Module Align Module Skip Module END How to model Pairwise Local Alignments with gaps? START Skip Module Align Module Skip Module END 2/9/15 CAP5510 / CGS 5166 30

Standard HMM architectures 2/9/15 CAP5510 / CGS 5166 31

Standard HMM architectures 2/9/15 CAP5510 / CGS 5166 32

Standard HMM architectures 2/9/15 CAP5510 / CGS 5166 33

Problem 3: LIKELIHOOD QUESTION Input: Sequence S, model M, state i Output: Compute the probability of reaching state i with sequence S using model M Backward Algorithm (DP) Problem 4: LIKELIHOOD QUESTION Input: Sequence S, model M Output: Compute the probability that S was emitted by model M Forward Algorithm (DP) 2/9/15 CAP5510 / CGS 5166 34

Problem 5: LEARNING QUESTION Input: model structure M, Training Sequence S Output: Compute the parameters Θ Criteria: ML criterion maximize P(S M, Θ) HOW??? Problem 6: DESIGN QUESTION Input: Training Sequence S Output: Choose model structure M, and compute the parameters Θ No reasonable solution Standard models to pick from 2/9/15 CAP5510 / CGS 5166 35

Iterative Solution to the LEARNING QUESTION (Problem 5) q Pick initial values for parameters Θ 0 q Repeat Run training set S on model M Count # of times transition i j is made Count # of times letter x is emitted from state i Update parameters Θ q Until (some stopping condition) 2/9/15 CAP5510 / CGS 5166 36

Entropy q Entropy measures the variability observed in given data. E = c log pc pc q Entropy is useful in multiple alignments & profiles. q Entropy is max when uncertainty is max. 2/9/15 CAP5510 / CGS 5166 37

G-Protein Couple Receptors q Transmembrane proteins with 7 α-helices and 6 loops; many subfamilies q Highly variable: 200-1200 aa in length, some have only 20% identity. q [Baldi & Chauvin, 94] HMM for GPCRs q HMM constructed with 430 match states (avg length of sequences) ; Training: with 142 sequences, 12 iterations 2/9/15 CAP5510 / CGS 5166 38

GPCR - Analysis q Compute main state entropy values H i = a q For every sequence from test set (142) & random set (1600) & all SWISS-PROT proteins e Compute the negative log of probability of the most probable path π ia log e ia ( P( π S, )) Score( S) = log M 2/9/15 CAP5510 / CGS 5166 39

GPCR Analysis 2/9/15 CAP5510 / CGS 5166 40

Entropy 2/9/15 CAP5510 / CGS 5166 41

GPCR Analysis (Cont d) 2/9/15 CAP5510 / CGS 5166 42

Applications of HMM for GPCR q Bacteriorhodopsin Transmembrane protein with 7 domains But it is not a GPCR Compute score and discover that it is close to the regression line. Hence not a GPCR. q Thyrotropin receptor precursors All have long initial loop on INSERT STATE 20. Also clustering possible based on distance to regression line. 2/9/15 CAP5510 / CGS 5166 43

HMMs Advantages q Sound statistical foundations q Efficient learning algorithms q Consistent treatment for insert/delete penalties for alignments in the form of locally learnable probabilities q Capable of handling inputs of variable length q Can be built in a modular & hierarchical fashion; can be combined into libraries. q Wide variety of applications: Multiple Alignment, Data mining & classification, Structural Analysis, Pattern discovery, Gene prediction. 2/9/15 CAP5510 / CGS 5166 44

HMMs Disadvantages q Large # of parameters. q Cannot express dependencies & correlations between hidden states. 2/9/15 CAP5510 / CGS 5166 45

References q Krogh, Brown, Mian, Sjolander, Haussler, J. Mol. Biol. 235:1501-1531, 1994 q Gribskov, Luthy, Eisenberg, Meth. Enzymol. 183:146-159, 1995 q Gribskov, McLachlan, Eisenberg, Proc Natl. Acad. Sci. 84:4355-4358, 1996. 2/9/15 CAP5510 / CGS 5166 46