Alignment Algorithms. Alignment Algorithms

Similar documents
Evolutionary Models. Evolutionary Models

Hidden Markov Models. x 1 x 2 x 3 x K

Genome 373: Hidden Markov Models II. Doug Fowler

Basic math for biology

Solving with Absolute Value

Reading for Lecture 13 Release v10

Pair Hidden Markov Models

Markov Chains and Hidden Markov Models. = stochastic, generative models

Hidden Markov Models: All the Glorious Gory Details

Computational Genomics and Molecular Biology, Fall

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Stephen Scott.

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Math 31 Lesson Plan. Day 2: Sets; Binary Operations. Elizabeth Gillaspy. September 23, 2011

Hidden Markov Models. x 1 x 2 x 3 x K

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Data Mining in Bioinformatics HMM

Hidden Markov Models

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

MAXIMUM AND MINIMUM 2

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

11.3 Decoding Algorithm

Hidden Markov models in population genetics and evolutionary biology

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

CH 59 SQUARE ROOTS. Every positive number has two square roots. Ch 59 Square Roots. Introduction

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Today s Lecture: HMMs

Hidden Markov Models

Elementary Statistics Triola, Elementary Statistics 11/e Unit 17 The Basics of Hypotheses Testing

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Pairwise alignment using HMMs

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

O 3 O 4 O 5. q 3. q 4. Transition

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

What is proof? Lesson 1

5.6 Solving Equations Using Both the Addition and Multiplication Properties of Equality

Lecture 3: Markov chains.

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Open a Word document to record answers to any italicized questions. You will the final document to me at

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

base 2 4 The EXPONENT tells you how many times to write the base as a factor. Evaluate the following expressions in standard notation.

BMI/CS 576 Fall 2016 Final Exam

HIDDEN MARKOV MODELS

Lab Slide Rules and Log Scales

Math 350: An exploration of HMMs through doodles.

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

STEP Support Programme. Hints and Partial Solutions for Assignment 1

Amount of Substance and Its Unit Mole- Connecting the Invisible Micro World to the Observable Macro World Part 2 (English, mp4)

Chapter 18. Sampling Distribution Models /51

Data-Intensive Computing with MapReduce

How to write maths (well)

Chapter 1 Review of Equations and Inequalities

Reading for Lecture 6 Release v10

EECS730: Introduction to Bioinformatics

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Math 110, Spring 2015: Midterm Solutions

order is number of previous outputs

HMMs and biological sequence analysis

1. Markov models. 1.1 Markov-chain

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

An Introduction to Bioinformatics Algorithms Hidden Markov Models

What Is a Sampling Distribution? DISTINGUISH between a parameter and a statistic

Hidden Markov Models (HMMs) November 14, 2017

Math 308 Midterm Answers and Comments July 18, Part A. Short answer questions

Hidden Markov Models

STA 4273H: Statistical Machine Learning

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

STA 414/2104: Machine Learning

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

a. Do you think the function is linear or non-linear? Explain using what you know about powers of variables.

Lesson 3: Solving Equations A Balancing Act

Please bring the task to your first physics lesson and hand it to the teacher.

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models. Terminology and Basic Algorithms

Math 317 M1A, October 8th, 2010 page 1 of 7 Name:

CS 68: BIOINFORMATICS. Prof. Sara Mathieson Swarthmore College Spring 2018

Lecture Notes: Markov chains

POWER ALGEBRA NOTES: QUICK & EASY

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

CS 7180: Behavioral Modeling and Decision- making in AI

Grades 7 & 8, Math Circles 10/11/12 October, Series & Polygonal Numbers

Statistical NLP: Hidden Markov Models. Updated 12/15

CS1820 Notes. hgupta1, kjline, smechery. April 3-April 5. output: plausible Ancestral Recombination Graph (ARG)

Welcome to Math Video Lessons. Stanley Ocken. Department of Mathematics The City College of New York Fall 2013

Bayesian Updating with Continuous Priors Class 13, Jeremy Orloff and Jonathan Bloom

Experiments with. Cascading Design. Ben Kovitz Fluid Analogies Research Group Indiana University START

Statistical Methods for NLP

Our first case consists of those sequences, which are obtained by adding a constant number d to obtain subsequent elements:

Hidden Markov Models and Their Applications in Biological Sequence Analysis

Student Instruction Sheet: Unit 2, Lesson 2. Equations of Lines, Part 2

Boyle s Law and Charles Law Activity

Finding Limits Graphically and Numerically

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Transcription:

Midterm Results Big improvement over scores from the previous two years. Since this class grade is based on the previous years curve, that means this class will get higher grades than the previous years. Your reward for the extra pain of the new in-class exercises: you ve learned more than previous classes did; you re getting better grades than previous classes did.

A Few Lessons From the Midterm The main problem I see is students trying to remember the right answer rather than thinking about what the question is asking. Example: Y p(x Y )=? If you believe this sums to 1 (or p(x)), why not at least check an example? e.g. p(rain day-of-week). People who think things through for themselves get in the habit of looking for easy ways to check their answer, because they ve learned this is easy and valuable. You could easily have checked those possible answers against the equations you were provided on the equation sheet.

Example: Two Sisters Model Many of you proposed the following information graph for the model that assumes the two women are sisters: But this assumes they are the same person! Even before taking this class, everyone here knew that sisters are not genetically identical. The only way I can interpret this answer is that you were trying to remember the match model from the forensic test problem you saw before, rather than thinking about what this question was asking.

Forward-Backward Probabilities State the probabilistic meaning, for an HMM consisting of n hidden variables, of the forward probability of state Θ 1 = s j.

Forward Probability Lessons Hopefully, homework 5 helped you learn the forward-backward algorithm by implementing it. Please review the following to distinguish them clearly in your minds: p(θ t X n ) (posterior): this is what we want to know. The forward-backward algorithm computes this in two halves: p(θ t, X t ) (forward probability): the first half. p( X n t+1 θ t ) (backward probability): the second half. Many of you mixed up forward vs. posterior, or left out the observable completely. p(θ t, X n )=p(θ t, X t )p( X n t+1 θ t )

Forward Probability for t=1? Some of you ignored the specific question about the t=1 case. Sounds like regurgitating what you read rather than thinking about the question you were asked. Some of you referred to the stationary distribution rather than the prior. No. p(θ 1 ) is the prior. Only equals π if you specifically set that to be true. Some of you referred to p(θ 1 X1 ) as the likelihood that observation X1 was emitted by state theta1, or defined the forward probability as a posterior p(θ 1 X 1 ). You re reversing the conditional probability; watch out! One of you defined the forward probability as p(θ 1,θ 2,...,θ t ). Forward-backward is easy to remember (and understand) if you just bear in mind what the goal is: p(θ t X n ): just one hidden variable, and all the observables.

Forward-Backward Probabilities State the probabilistic meaning, for an HMM consisting of n hidden variables, of the backward probability of the START state.

Backward Probability Lessons Two things confused you: the START state; think of this as before time t=1. i.e. p(θ 0 = START)=1. lets us handle priors just like a regular transition, i.e. τ START,i = p(θ 1 = s i θ 0 = START)=p(θ 1 = s i ). Some of you gave p( X n n+1 θ n )=1 as your answer. Sounds like the end rather than START. the backwards probability. Some of you seem to associate the backwards probability with θ n... but the question specifically asked about the backwards probability of the START. as before, some of you left the observables out of the backward probability. Remember the goal!

Backwards to START? One of you wondered how it could even be possible to go backward to the START state, because I said the HMM never goes back to the START state. Does b 0,START > 0 mean there are directed edges that point to the START state? No. The word backward is confusing you here. The backward probability does not mean that there are directed edges pointing backward (e.g. to START), but rather that p( X n t+1 θ t ) must be computed in reverse order (last edge first) in order to be computationally efficient. (You could compute it in forward order, but the computational complexity would be horrible. Try it yourself to see why.) Think of this as analogous to the Viterbi backtracking stage: i.e. we are not following edges that point backwards, but rather we are traversing in reverse order edges that point forwards.

Good Nomenclature can save your @#$! How are you going to keep the concepts clear in your mind, if you don t have clear, consistent ways of naming them? to specify a variable: time index t, e.g. θ t. to specify a state: state index i, e.g. s i. to specify both: e.g. f ti = p(θ t = s i, X t ) emission (likelihood): p(x t θ t ). transition: p(θ t = s j θ t 1 = s i )=τ ij posterior: p(θ t = s j X n )

Sequence Alignment Matrix Pairwise alignment of two sequences X m, Y n Index letter positions in the two sequences with t,u, e.g. X t,y u. Hypothesis: these two sequences are related to each other by some evolutionary process, e.g. mutation from a common ancestor: A X and A Y. Want to be able to assess this hypothesis and gain insights from it, e.g. calculate posterior odds vs. null hypothesis that they are unrelated; predict which parts of the two sequences are the same, i.e. aligned, and likely to perform same function. Easy when the two sequences are very similar, much harder when similarity weak.

Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment matrix.

Alignment Paths Using the three edit operator moves,,, write the list of all possible paths representing global alignments of X 2 to Y 1. (For simplicity, you can write the moves as -, /, and ). If any of these paths correspond to the same alignment (i.e the same letter(s) of X m, Y n are aligned to each other), group those paths together as one group.

Viterbi Alignment Given the two sequences X = AGC... and Y = ACC... Find the optimal global alignment of X 2 to Y 1 by filling out the corresponding Viterbi alignment matrix. Assume the following scores: exact match: +5 substitution: -2 gap: -3

Alignment Scoring Say you are given the likelihood p(x t,y u aligned) derived from true alignments of sequences that are known to be evolutionarily related to each other. Based on the hypothesis testing principles we have learned, propose a scoring function for assessing whether X t,y u should be aligned to each other or not. How does your scoring function indicate they should be aligned vs. should not be aligned?

Complex Gene Structure

Searching for regulatory sequences in introns You are studying introns in mammalian genes, looking for regulatory elements in orthologous introns. One important indicator of regulatory element function is selection pressure against mutation, i.e. finding regions of aligned introns from different mammals, that are mutated much less than the usual. This requires good intron alignments. You know that introns are poorly conserved and consequently hard to align accurately.

Intron Alignment Idea You have a database of matching exons in human and mouse (i.e. pairs of coding region segments in the human genome and mouse genome that align to each other optimally). A colleague suggests you use this matching exon information to find optimal alignments of matching introns as follows: take two consecutive human exons A, B (separated by an intron I) that are known to align to two consecutive mouse exons A, B (separated by an intron I ). Now find the top-scoring alignment of the two introns I, I subject to the constraint that you already know the correct alignment of the exons on either side of the intron.

Intron Alignment Assume that you know exactly the positions that are aligned at the end of exons A, A and at the beginning of exons B, B. What would be the easiest way to obtain the optimal intron alignment subject to the proposed constraint? State the proposed constraint in precise terms of the allowed paths on the alignment matrix.