VL Bioinformatik für Nebenfächler SS2018 Woche 9

Similar documents
VL Bioinformatik für Nebenfächler SS2018 Woche 8

Algorithms for Bioinformatics

Algorithms in Bioinformatics II SS 07 ZBIT, C. Dieterich, (modified script of D. Huson), April 25,

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Finding Regulatory Motifs in DNA Sequences

Probabilistic models of biological sequence motifs

Lecture 24: Randomized Algorithms

Lecture 23: Randomized Algorithms

Greedy Algorithms. CS 498 SS Saurabh Sinha

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?

Algorithms for Bioinformatics

Partial restriction digest

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

9/8/09 Comp /Comp Fall

Algorithms for Bioinformatics

Quantitative Bioinformatics

How can one gene have such drastic effects?

Finding Regulatory Motifs in DNA Sequences

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

With Question/Answer Animations. Chapter 7

Jianlin Cheng, PhD. Department of Computer Science University of Missouri, Columbia. Fall, 2014

Discrete Structures for Computer Science

Sequence Alignment. Johannes Starlinger

Markov Models & DNA Sequence Evolution

Hidden Markov Models. Three classic HMM problems

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Hidden Markov Models

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5,

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Dynamic Programming Lecture #4

Introduction to Probability and Sample Spaces

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Stat 516, Homework 1

Stephen Scott.

Inferring Models of cis-regulatory Modules using Information Theory

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Probably About Probability p <.05. Probability. What Is Probability?

Introduction to Hidden Markov Models (HMMs)

BMI/CS 576 Fall 2016 Final Exam

Bioinformatics 2 - Lecture 4

Simulation of Gene Regulatory Networks

MCMC: Markov Chain Monte Carlo

CSE 312, 2017 Winter, W.L.Ruzzo. 5. independence [ ]

Motivating the need for optimal sequence alignments...

Inferring Models of cis-regulatory Modules using Information Theory

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Introduction to Artificial Intelligence (AI)

Grundlagen der Bioinformatik, SS 08, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

Hidden Markov Models

De novo identification of motifs in one species. Modified from Serafim Batzoglou s lecture notes

1 Probabilities. 1.1 Basics 1 PROBABILITIES

CSCE 471/871 Lecture 3: Markov Chains and

Name: SBI 4U. Gene Expression Quiz. Overall Expectation:

Matrix-based pattern discovery algorithms

BIOINFORMATICS. Neighbourhood Thresholding for Projection-Based Motif Discovery. James King, Warren Cheung and Holger H. Hoos

MAT Mathematics in Today's World

The genome encodes biology as patterns or motifs. We search the genome for biologically important patterns.

In a radioactive source containing a very large number of radioactive nuclei, it is not

The Computational Problem. We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene.

Chapter 7: Regulatory Networks

Hidden Markov Models

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Stephen Scott.

Grundlagen der Bioinformatik, SS 09, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

1 Recap: Interactive Proofs

Exhaustive search. CS 466 Saurabh Sinha

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Gibbs Sampling Methods for Multiple Sequence Alignment

Biology 644: Bioinformatics

Probability and Inference. POLI 205 Doing Research in Politics. Populations and Samples. Probability. Fall 2015

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

EM-algorithm for motif discovery

Correcting Localized Deletions Using Guess & Check Codes

Probability: Terminology and Examples Class 2, Jeremy Orloff and Jonathan Bloom

1. In most cases, genes code for and it is that

HMMs and biological sequence analysis

Introduction to spectral alignment

CS 343: Artificial Intelligence

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

QB LECTURE #4: Motif Finding

Hidden Markov Models 1

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

Probability deals with modeling of random phenomena (phenomena or experiments whose outcomes may vary)

Lecture #5. Dependencies along the genome

4. Conditional Probability P( ) CSE 312 Autumn 2012 W.L. Ruzzo

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 5. G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1

Example questions. Z:\summer_10_teaching\bioinfo\Beispiel_frage_bioinformatik.doc [1 / 5]

Der Satz von Immerman-Szelepcsényi

Skriptsprachen. Numpy und Scipy. Kai Dührkop. Lehrstuhl fuer Bioinformatik Friedrich-Schiller-Universitaet Jena

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Alignment. Peak Detection

Review. A Bernoulli Trial is a very simple experiment:

Statistical testing. Samantha Kleinberg. October 20, 2009

Transcription:

VL Bioinformatik für Nebenfächler SS2018 Woche 9 Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, FU Berlin Based on slides by P. Compeau and P. Pevzner, authors of

Teil 1 (Woche 1-3) : Einführung in Python & Algorithmen Teil 2 (Woche 4-6) : Finden von unbekannten Mustern (ohne Fehler) in großen Texten am Beispiel der oric Region Teil 3 (Woche 7 - ) : Finden von unbekannten Mustern (mit Fehlern) in großen Texten am Beispiel von Transkriptions-Faktor Bindungs-Stellen

Tim Conrad, VL Einführung in die Bioinformatik, SS2018 Motif finding using randomized algorithms

4

Zusammenfassung aktueller Block (grob!) Gene haben gewisse Merkmale auf Sequenzebene Änderungen in der Zusammensetzung der Sequenz Bestimmte Motive, die z.b. Start oder Ende angeben Problem: kein Anhaltspunkt, wie diese Motive aussehen könnten Mögliches Vorgehen Randomisierte Algorithmen

Basic Gene Structures Eukaryotic genes Exons,introns, translation starts and stops, splice (donor/acceptor) junctions, Tim Conrad, VL Einführung in die Bioinformatik, SS2018 6

How Does CCA1 Know Where to Bind? CCA1 CCA1 CCA1 CCA1 Gene1 Gene2 Gene3 Gene4 There must be some hidden messages in these regions that tells CCA1 WHERE to bind. Tim Conrad, VL Einführung in die Bioinformatik, SS2018

Transcription Factor Binding Sites cagtataaagtctactgatgcaacctgactcatgacgaggaa Gene1 agtcgactgacttaacaaatctcggatcgattcgtccgagga cgtcagctctgtcgggattcgccccgtattcaaaaaagctac accgtctccacaaaacctgctcgtgccactgatgcaacctga Gene2 Gene3 Gene4 The hidden messages (motif AAAAAATCT): transcription factor binding sites of CCA1 Tim Conrad, VL Einführung in die Bioinformatik, SS2018

From Motif to Consensus String Motifs T C G G G G G T T T T T C C G G T G A C T T A C A C G G G G A T T T T C T T G G G G A C T T T T A A G G G G A C T T C C T T G G G G A C T T C C T C G G G G A T T C A T T C G G G G A T T C C T T A G G G G A A C T A C T C G G G T A T A A C C Visualizing using: motif logo Tim Conrad, VL Einführung in die Bioinformatik, SS2018

Median String Problem Median String Problem. Finding a median string. Input: A set of sequences Dna and an integer k. Output: A k-mer minimizing distance d(k-mer, Dna) among all k-mers. MedianString(Dna, k) best-k-mer AAA AA for each k-mer from AAA AA to TTT TT if d(k-mer, Dna) < d (best-k-mer, Dna) best-k-mer k-mer return(best-k-mer) Runtime: 4 k n t k (for Dna with t sequences of length n). d(k-mer, Dna) requires n-k+1 comparisons of strings; each string comparison has k character comparisons; since k<<n, it s n k steps for calculating each d(k-mer, sequence); since we have t sequences in Dna, it s totally n t k steps per k-mer Motif Finding Problem versus Median String Problem Runtime: n t k t Runtime: 4 k n t k

11

Outline From Implanted Patterns to Regulatory Motifs Implanting Patterns into Strings Recall: Implanted Motif Problem Motif Finding Problem Median String Problem Greedy Motif Search Pseudocounts How Rolling Dice Helps Us Find Regulatory Motifs Randomized Motif Search Application: How do Bacteria Hibernate? Gibbs Sampling 12

So far: (1) Given some DNA strings (2) Extract Motifs (best similar k-mers) And now? How to use this information? How to find a k-mer given the motifs in a new string?

From Motifs to Profile Motifs A:.22.22 0 0 0 0.99.11.11.11.33 0 C:.11.66 0 0 0 0 0.44.11.22.44.66 Profile(Motifs) Count(Motifs) G: 0 0 10 1 10 1.99.99.11 0 0 0 0 0 T:.77.22 0 0.11.11 0.55.88.77.33.44 frequency count of nucleotide of i in i in column j j T C G G G G A C T T C C Each column of a profile represents a biased 4-sided dice with A, C, G, and T on its sides. Thus, a profile corresponds to k dice. What is the probability that k rolls of such dice generate a given string? 14

Scoring k-mers with a Profile Given the following Profile: A 1/2 7/8 3/8 0 1/8 0 C 1/8 0 1/2 5/8 3/8 0 T 1/8 1/8 0 0 1/4 7/8 G 1/4 0 1/8 3/8 1/4 1/8 The probability of the consensus string: Pr(AAACCT Profile) =??? 15

Scoring k-mers with a Profile Given the following Profile: A 1/2 7/8 3/8 0 1/8 0 C 1/8 0 1/2 5/8 3/8 0 T 1/8 1/8 0 0 1/4 7/8 G 1/4 0 1/8 3/8 1/4 1/8 The probability of the consensus string: Pr(AAACCT Profile) = 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 = 0.033646 16

Scoring k-mers with a Profile Given the following Profile: A 1/2 7/8 3/8 0 1/8 0 C 1/8 0 1/2 5/8 3/8 0 T 1/8 1/8 0 0 1/4 7/8 G 1/4 0 1/8 3/8 1/4 1/8 The probability of another string: Pr(ATACAG Profile) = 1/2 x 1/8 x 3/8 x 5/8 x 1/8 x 1/8 = 0.001602 The closer k-mer is to the consensus string, the larger Pr(k-mer Profile) is. 17

18

What is the Profile-most probable 6-mer in CTATAAACCTTACAT? A 1/2 7/8 3/8 0 1/8 0 C 1/8 0 1/2 5/8 3/8 0 T 1/8 1/8 0 0 1/4 7/8 G 1/4 0 1/8 3/8 1/4 1/8 Profile-most probable k-mer in a sequence: the k-mer with the highest Pr(k-mer Profile) among all k-mers in this sequence. 6-mer Prob(6-mer Profile) CTATAAACCTTACAT 1/8 x 1/8 x 3/8 x 0 x 1/8 x 0 0 CTATAAACCTTACAT 1/2 x 7/8 x 0 x 0 x 1/8 x 0 0 CTATAAACCTTACAT 1/2 x 1/8 x 3/8 x 0 x 1/8 x 0 0 CTATAAACCTTACAT 1/8 x 7/8 x 3/8 x 0 x 3/8 x 0 0 CTATAAACCTTACAT 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8.0336 CTATAAACCTTACAT 1/2 x 7/8 x 1/2 x 5/8 x 1/4 x 7/8.0299 CTATAAACCTTACAT 1/2 x 0 x 1/2 x 0 1/4 x 0 0 CTATAAACCTTACAT 1/8 x 0 x 0 x 0 x 0 x 1/8 x 0 0 CTATAAACCTTACAT 1/8 x 1/8 x 0 x 0 x 3/8 x 0 0 CTATAAACCTTACAT 1/8 x 1/8 x 3/8 x 5/8 x 1/8 x 7/8.0004 19

How can we use this idea for a from scratch Motif search?

21

22

23

Outline From Implanted Patterns to Regulatory Motifs Implanting Patterns into Strings Recall: Implanted Motif Problem Motif Finding Problem Median String Problem Greedy Motif Search Pseudocounts How Rolling Dice Helps Us Find Regulatory Motifs Randomized Motif Search Application: How do Bacteria Hibernate? Gibbs Sampling 24

Laplace s Rule of Succession If we repeat an experiment that can result in a success or failure n times and get s successes, what is the probability that the next repetition will be a success? 25

Pseudocounts If X 1,..., X n+1 are conditionally independent random boolean variables (failure 0 or success 1) then: Pr(X n+1 =1 X 1 + +X n =s )=s/n Pr(X n+1 =1 X 1 + +X n =s )=(s+1)/(n+2) Since we have the prior knowledge that both success and failure are possible, we essentially made n+2 rather than n observations since we (implicitly) observed one success and one failure before we even started the experiments. Thus, we have made n+2 observations (known as pseudocounts) with s+1 successes. 26

Laplace s Rule of Succession Laplace calculated the probability that the sun will not rise tomorrow, given that it has risen every day for the past 5000 years (1 in 1826251) In small datasets, there is always a chance that a possible event does not occur (e.g., zeroes in Count matrix). Randomized algorithms do not like zeroes and introduce pseudocounts that inflate the probabilities of rare events and eliminate empirical zero-frequencies. 27

28

Replace by: 29

Running GREEDYMOTIFSEARCH with pseudocounts to solve the subtle motif problem returns the consensus string AAAAAtAgaGGGGtt with score 41. Running GREEDYMOTIFSEARCH without pseudocounts to solve the subtle motif problem returns the consensus string gttaaatagagatgtg with score 58. 30

Outline From Implanted Patterns to Regulatory Motifs Implanting Patterns into Strings Recall: Implanted Motif Problem Motif Finding Problem Median String Problem Greedy Motif Search Pseudocounts How Rolling Dice Helps Us Find Regulatory Motifs Randomized Motif Search Application: How do Bacteria Hibernate? Gibbs Sampling 31

From Motifs to Profile to Motifs to Profile to Motifs Given Motifs, we can construct Profile(Motifs) Given an arbitrary Profile and a set of sequences Dna, we can construct Motifs(Profile, Dna) as a set of Profile-most probable k-mers in each sequence from Dna. Iterate! Motifs(Profile(Motifs(Profile(Motifs),Dna)),Dna)),Dna) 32

RandomizedMotifSearch RandomizedMotifSearch(Dna, k, t) randomly select k-mers Motifs = (Motif 1,,Motif t ) in each string from DNA bestmotifs Motifs while forever Profile Profile(Motifs) Motifs Motifs(Profile, Dna) if Score(Motifs) < Score(bestMotifs) else bestmotifs Motifs return(bestmotifs) How in the World Can This Algorithm Find Anything? When we form k-mers randomly, it results in a uniform expected Profile with every entry 1/4. Such a uniform profile is useless because it does not provide any clues about the implanted motif. 33

RandomizedMotifSearch in Action Dna with implanted (4,1)-motif ACGT Randomly select Motifs (in bold) Motifs Profile(Motifs) ttaccttaac gatgtctgtc ccggcgttag cactaacgag cgtcagaggt ttaccttaac gatgtctgtc ccggcgttag cactaacgag cgtcagaggt t a a c G T c t c c g G a c t a A G G T A: 2/5 1/5 1/5 1/5 C: 1/5 2/5 1/5 1/5 G: 1/5 1/5 2/5 1/5 T: 1/5 1/5 1/5 2/5.0016/ttAC.0016/tACC.0128/ACCT.0064/CCTt.0016/Ctta.0016/Ttaa.0016/gATG.0128/ATGT.0016/TGTc.0032/GTct.0032/Tctg.0032/ctgt.0064/ccgG.0036/cgGC.0016/gGCG.0128/GCGT.0032/CGTt.0016/Gtta.0032/cact.0064/acta.0016/ctaA.0016/taAC.0032/aACG.0128/ACGA.0016/taac.0016/tgtc.0016/Ttag.0016/CGAg.0016/cgtc.0016/gtca.0016/tcag.0032/cagA.0032/agAG.0032/gAGG ttaccttaac gatgtctgtc ccggcgttag cactaacgag cgtcagaggt Motifs (Profile (Motifs), Dna).0128/AGGT 34

35

How in the World Can This Algorithm Find Anything? When we select k-mers randomly, it results in a uniform expected Profile where every entry is 1/4. Such a uniform profile is useless because it does not provide any clues about the implanted motif. But strings in Dna are not truly random since they include implanted motifs! These implanted motifs result in a biased expected Profile. Where does this statistical bias point to? 36

RandomizedMotifSearch in Action DNA with implanted (4,1)-motif ACGT Randomly select Motifs (in bold) Motifs Profile(Motifs) ttaccttaac gatgtctgtc ccggcgttag cactaacgag cgtcagaggt ttaccttaac gatgtctgtc ccggcgttag cactaacgag cgtcagaggt t a a c G T c t c c g G a c t a A G G T A: 2/5 1/5 1/5 1/5 C: 1/5 2/5 1/5 1/5 G: 1/5 1/5 2/5 1/5 T: 1/5 1/5 1/5 2/5 The statistical bias points in the direction of the implanted (4,1)-motif ACGT because one of the randomly selected 4- mers happened to be the implanted pattern AGGT simply by chance. 37

Assignment 4 1. Create sub-directory assignment 4 within your GIThub account 2. Solve Rosalind exercises BA2F (http://rosalind.info/problems/ba2f/) 3. Upload code of all of the solutions and a screenshot of each program run using the Rosalind examples (into the assignment4 subdirectory). Use the Sample Dataset and the Extra Dataset. 4. Send an email to your tutor, containing the url to your GIThub repository sub-directory