Exhaustive search. CS 466 Saurabh Sinha

Similar documents
An Algorithmic Problem in Molecular Biology Partial Digest Problem

Bioinformatics Algorithms. Physical Mapping Restriction Mapping

Physical Mapping Restriction Mapping

Lecture 4: DNA Restriction Mapping

Lecture 4: DNA Restriction Mapping

9/2/2009 Comp /Comp Fall

8/29/13 Comp 555 Fall

Partial restriction digest

GCD3033:Cell Biology. Transcription

Physical Mapping. Restriction Mapping. Lecture 12. A physical map of a DNA tells the location of precisely defined sequences along the molecule.

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?

4. Why not make all enzymes all the time (even if not needed)? Enzyme synthesis uses a lot of energy.

CHAPTER : Prokaryotic Genetics

13.4 Gene Regulation and Expression

Bacterial Genetics & Operons

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics

Lesson Overview. Gene Regulation and Expression. Lesson Overview Gene Regulation and Expression

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Controlling Gene Expression

Introduction to Molecular and Cell Biology

Name: SBI 4U. Gene Expression Quiz. Overall Expectation:

2012 Univ Aguilera Lecture. Introduction to Molecular and Cell Biology

Topic 4 - #14 The Lactose Operon

Genomes and Their Evolution

RNA Synthesis and Processing

Biology 112 Practice Midterm Questions

Boolean models of gene regulatory networks. Matthew Macauley Math 4500: Mathematical Modeling Clemson University Spring 2016

2. What was the Avery-MacLeod-McCarty experiment and why was it significant? 3. What was the Hershey-Chase experiment and why was it significant?

Computational Systems Biology

Name Period The Control of Gene Expression in Prokaryotes Notes

Computational Biology: Basics & Interesting Problems

Three types of RNA polymerase in eukaryotic nuclei

Chapter 19. Gene creatures, Part 1: viruses, viroids and plasmids. Prepared by Woojoo Choi

A Simple Protein Synthesis Model

Biological Networks. Gavin Conant 163B ASRC

Analysis and Design of Algorithms Dynamic Programming

Protein folding. α-helix. Lecture 21. An α-helix is a simple helix having on average 10 residues (3 turns of the helix)

Genetic transcription and regulation

Explain your answer:

Bio nformatics. Lecture 3. Saad Mneimneh

Frequently Asked Questions (FAQs)

Biology. Biology. Slide 1 of 26. End Show. Copyright Pearson Prentice Hall

3.B.1 Gene Regulation. Gene regulation results in differential gene expression, leading to cell specialization.

Section 7. Junaid Malek, M.D.

Unit 3: Control and regulation Higher Biology

Introduction. Gene expression is the combined process of :

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Introduction to Bioinformatics

Computational Cell Biology Lecture 4

Translation and Operons

Control of Gene Expression in Prokaryotes

Complete all warm up questions Focus on operon functioning we will be creating operon models on Monday

Inferring Protein-Signaling Networks

networks in molecular biology Wolfgang Huber

APGRU6L2. Control of Prokaryotic (Bacterial) Genes

L3.1: Circuits: Introduction to Transcription Networks. Cellular Design Principles Prof. Jenna Rickus

Understanding Science Through the Lens of Computation. Richard M. Karp Nov. 3, 2007

Laith AL-Mustafa. Protein synthesis. Nabil Bashir 10\28\ First

Lecture 7: Simple genetic circuits I

Molecular Biology of the Cell

Bio nformatics. Lecture 23. Saad Mneimneh

Topic 1 - The building blocks of. cells! Name:!

What Organelle Makes Proteins According To The Instructions Given By Dna

On the Monotonicity of the String Correction Factor for Words with Mismatches

Gene Regulation and Expression

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

CRISPR-SeroSeq: A Developing Technique for Salmonella Subtyping

Reading Assignments. A. Genes and the Synthesis of Polypeptides. Lecture Series 7 From DNA to Protein: Genotype to Phenotype

Characteristics of Life

Molecular and cellular biology is about studying cell structure and function

Bioinformatics 2 - Lecture 4

CHAPTER 13 PROKARYOTE GENES: E. COLI LAC OPERON

Activation of a receptor. Assembly of the complex

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Development Team. Regulation of gene expression in Prokaryotes: Lac Operon. Molecular Cell Biology. Department of Zoology, University of Delhi

Section 19 1 Bacteria (pages )

NAME: PERIOD: DATE: A View of the Cell. Use Chapter 8 of your book to complete the chart of eukaryotic cell components.

Algorithms for Bioinformatics

Bio 119 Bacterial Genomics 6/26/10

Translation. Genetic code

Proteomics Systems Biology

DNA Technology, Bacteria, Virus and Meiosis Test REVIEW

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

Approximate counting: count-min data structure. Problem definition

Module 6 Note Taking Guide. Lesson 6.01:Organization of Life

GSBHSRSBRSRRk IZTI/^Q. LlML. I Iv^O IV I I I FROM GENES TO GENOMES ^^^H*" ^^^^J*^ ill! BQPIP. illt. goidbkc. itip31. li4»twlil FIFTH EDITION

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

Prokaryotes & Viruses. Practice Questions. Slide 1 / 71. Slide 2 / 71. Slide 3 / 71. Slide 4 / 71. Slide 6 / 71. Slide 5 / 71

Biological networks CS449 BIOINFORMATICS

Chapters 12&13 Notes: DNA, RNA & Protein Synthesis

12-5 Gene Regulation

2. The development of revolutionized the of life.

Gene Regula*on, ChIP- X and DNA Mo*fs. Statistics in Genomics Hongkai Ji

Genetic transcription and regulation

Genome 541! Unit 4, lecture 2! Transcription factor binding using functional genomics

Chapter 15 Active Reading Guide Regulation of Gene Expression

Chapter 1. Biology: Exploring Life. Lecture by Richard L. Myers

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

Biological Mass Spectrometry

Transcription:

Exhaustive search CS 466 Saurabh Sinha

Agenda Two different problems Restriction mapping Motif finding Common theme: exhaustive search of solution space Reading: Chapter 4.

Restriction Mapping

Restriction enzymes A protein that cuts DNA at very specific sites (occurrences of a particular word) Foreign (viral) DNA entering a bacterium is usually unable to do anything Reason: Restriction enzymes shred the DNA Do not cleave methylated DNA Host DNA is suitably methylated, hence protected 1973 Nobel Prize in Medicine: discovery of restriction enzymes

Molecular Scissors Molecular Cell Biology, 4 th edition

Recognition Sites of Restriction Enzymes Molecular Cell Biology, 4 th edition

Restriction Maps A map showing positions of restriction sites in a DNA sequence If DNA sequence is known then construction of restriction map is a trivial exercise In early days of molecular biology DNA sequences were often unknown Biologists had to solve the problem of constructing restriction maps without knowing DNA sequences What is this? A plasmid ; Read more about this

Measuring Length of Restriction Fragments Restriction enzymes break DNA into restriction fragments. Gel electrophoresis is a process for separating DNA by size and measuring sizes of restriction fragments Can separate DNA fragments that differ in length in only 1 nucleotide for fragments up to 500 nucleotides long

Partial Restriction Digest The sample of DNA is exposed to the restriction enzyme for only a limited amount of time to prevent it from being cut at all restriction sites This experiment generates the set of all possible restriction fragments between every two (not necessarily consecutive) cuts This set of fragment sizes is used to determine the positions of the restriction sites in the DNA sequence

Partial Restriction Digest Multiset of fragment lengths: {3, 5, 5, 8, 9, 14, 14, 17, 19, 22}

Partial Digest Problem (PDP) Let X = { x 1, x 2, x 3, x n } Given pairwise distances between each pair {x i, x j } Given X = { x j - x i 1 i < j n } Reconstruct X Does a unique solution exist?

Partial Digest Problem (PDP) Let X = { x 1 = 0, x 2, x 3, x n } Given pairwise distances between each pair {x i, x j } Given X = { x j - x i 1 i < j n } Reconstruct X

Brute force algorithm Also called enumerative algorithms Used in some problems in bioinformatics If the program runs in reasonable time If the goodness of the algorithm is in a special objective function, enumerative search can guarantee finding the optimal solution

Brute Force PDP Given L = set of all pairwise distances Need to find X such that X = L Know that x 1 = 0 and x n = M (where M is the largest number in L) x 2, x 3, x n-1 must all be integers between 1 and M-1. Try all possible solutions: Approximately O(M n-2 )

Brute Force PDP 2 Do we need to try every integer between 0 and M? Since x 1 = 0, for every x i in X, the number (x i - x 1 ) = x i must be in X We need to find X such that X = L. Therefore, only consider x i that are in L Therefore, only L possibilities from which to choose n-2 numbers Try all possible solutions: Approximately O( L n-2 ), i.e., O(n 2n-4 )

A practical solution: key idea 0 M Pick the largest (other than M) number from L Let this be

A practical solution: key idea 0 M Case i

A practical solution: key idea 0 M Case ii M-

Notation D(y, X) = { y x 1, y x 2,, y x n } for X = {x 1, x 2,, x n }

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0 }

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0 } Remove 10 from L and insert it into X. We know this must be the length of the DNA sequence because it is the largest fragment.

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 10 }

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 10 } Take 8 from L and make y = 2 or 8. Let us go with y = 2.

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 10 } We find that the distances from y=2 to other elements in X are D(y, X) = {8, 2}, so we remove {8, 2} from L and add 2 to X.

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 10 }

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 10 } Take 7 from L and make y = 7 or y = 10 7 = 3. We will explore y = 7 first, so D(y, X ) = {7, 5, 3}.

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 10 } For y = 7 first, D(y, X ) = {7, 5, 3}. Therefore we remove {7, 5,3} from L and add 7 to X. D(y, X) = {7, 5, 3}

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 7, 10 }

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 7, 10 } Take 6 from L. We can have y = 4 or y = 6. Let s make y = 6. Unfortunately D(y, X) = {6, 4, 1,4}, which is not a subset of L. Therefore we won t explore this branch. 6

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 7, 10 } This time make y = 4. D(y, X) = {4, 2, 3,6}, which is a subset of L so we will explore this branch. We remove {4, 2, 3,6} from L and add 4 to X.

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 4, 7, 10 }

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 4, 7, 10 } L is now empty, so we have a solution, which is X.

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 7, 10 } To find other solutions, we backtrack.

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 10 } More backtrack.

An Example L = { 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 } X = { 0, 2, 10 } This time we will explore y = 3. D(y, X) = {3, 1, 7}, which is not a subset of L, so we won t explore this branch.

Algorithm Given L, build X incrementally, starting from X = {0, M} At each step, extract y = maximum element in L Consider the two possibilities: y is in X M - y is in X Check if either possibility is consistent with L, and if so, include that in X, remove the induced pairwise distances from L, and proceed Backtracking Pseudo code of algorithm in Section 4.3. If you are new to algorithms, please read this.

Time complexity At each step, two possibilities to pursue Checking each possibility takes O(n) time T(n) = 2T(n-1) + O(n) T(n) = O(n2 n ) What is n here? This is an exponential time algorithm Actually, a polynomial time algorithm exists Maurice Nivat and colleagues, 2002.

Second example of exhaustive search: Motif finding

My fruitfly has a bacterial infection When attacked by bacteria, the fruitfly s immune system kicks in Many genes that were lying dormant now producing their proteins, to fight the infection. (Some otherwise active genes may now become inactive.) Which genes are these?

Looking for differentially expressed genes Measure the activity level of all genes in normal fly and in infected fly Find genes whose activity levels are significantly different between the two conditions How to measure gene activity level?

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info DNA Arrays--Technical Foundations An array works by exploiting the ability of a given mrna molecule to hybridize to the DNA template. Using an array containing many DNA samples in an experiment, the expression levels of hundreds or thousands genes within a cell by measuring the amount of mrna bound to each site on the array. With the aid of a computer, the amount of mrna bound to the spots on the microarray is precisely measured, generating a profile of gene expression in the cell. May, 11, 2004 http://www.ncbi.nih.gov/about/primer/microarrays.html 41

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info DNA Microarray Millions of DNA strands build up on each location. May, 11, 2004 Tagged probes become hybridized to the DNA chip s microarray. http://www.affymetrix.com/corporate/media/image_library/image_library_1.affx 42

An experiment on a microarray In this schematic: GREEN represents Control DNA RED represents Sample DNA YELLOW represents a combination of Control and Sample DNA BLACK represents areas where neither the Control nor Sample DNA Each color in an array represents either healthy (control) or diseased (sample) tissue. The location and intensity of a color tell us whether the gene is present in the control and/or sample DNA. May 11,2004 http://www.ncbi.nih.gov/about/primer/microarrays.htm 10 l

Differentially expressed genes Find a set of genes differentially expressed in the infected fly These are perhaps the ones orchestrating the immune response Look at promoters of these genes Find that the substring TCGGGGATTTCC occurs often (modulo minor spelling mistakes) in these promoters

Regulatory motif TCGGGGATTTCC is the canonical binding site recognized by the NFkB transcription factor Infer that NFkB is turning on the immunity! What if we did not know that NFkB binds TCGGGGATTTCC? Could we have just gazed at the promoter sequences, and discovered this binding site?

Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string ( motif ) count its occurrences in the promoters Report the most frequently occurring motif Does the true motif pop out?

Today s summary Restriction enzymes and restriction site maps Partial Digest Problem: an enumerative algorithm DNA Microarrays and differentially expressed genes. Prelude to the motif finding problem.