Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

Similar documents
GCD3033:Cell Biology. Transcription

1. In most cases, genes code for and it is that

Biology. Biology. Slide 1 of 26. End Show. Copyright Pearson Prentice Hall

GENE ACTIVITY Gene structure Transcription Transcript processing mrna transport mrna stability Translation Posttranslational modifications

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

Reading Assignments. A. Genes and the Synthesis of Polypeptides. Lecture Series 7 From DNA to Protein: Genotype to Phenotype

BME 5742 Biosystems Modeling and Control

12-5 Gene Regulation

Regulation of Gene Expression

Translation and Operons

Name: SBI 4U. Gene Expression Quiz. Overall Expectation:

Bioinformatics Chapter 1. Introduction

Eukaryotic vs. Prokaryotic genes

Searching Sear ( Sub- (Sub )Strings Ulf Leser

UNIT 6 PART 3 *REGULATION USING OPERONS* Hillis Textbook, CH 11

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11

RNA Processing: Eukaryotic mrnas

Controlling Gene Expression

From gene to protein. Premedical biology

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Chapter 17. From Gene to Protein. Biology Kevin Dees

Multiple Choice Review- Eukaryotic Gene Expression

Chapters 12&13 Notes: DNA, RNA & Protein Synthesis

Quantitative Bioinformatics

Big Idea 3: Living systems store, retrieve, transmit and respond to information essential to life processes. Tuesday, December 27, 16

Regulation of Gene Expression

Gene Control Mechanisms at Transcription and Translation Levels

CSEP 590A Summer Tonight MLE. FYI, re HW #2: Hemoglobin History. Lecture 4 MLE, EM, RE, Expression. Maximum Likelihood Estimators

CSEP 590A Summer Lecture 4 MLE, EM, RE, Expression

Section 7. Junaid Malek, M.D.

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Chapter 12. Genes: Expression and Regulation

Chapter 18: Control of Gene Expression

Translation Part 2 of Protein Synthesis

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

13.4 Gene Regulation and Expression

Hidden Markov Models

Regulation of Transcription in Eukaryotes

Old FINAL EXAM BIO409/509 NAME. Please number your answers and write them on the attached, lined paper.

From Gene to Protein

DNA Feature Sensors. B. Majoros

Protein Synthesis. Unit 6 Goal: Students will be able to describe the processes of transcription and translation.

Flow of Genetic Information

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

Data Mining in Bioinformatics HMM

GENE REGULATION AND PROBLEMS OF DEVELOPMENT

Algorithms in Bioinformatics

Name Period The Control of Gene Expression in Prokaryotes Notes

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Cellular Neuroanatomy I The Prototypical Neuron: Soma. Reading: BCP Chapter 2

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Eukaryotic Gene Expression: Basics and Benefits Prof. P N RANGARAJAN Department of Biochemistry Indian Institute of Science Bangalore

3.B.1 Gene Regulation. Gene regulation results in differential gene expression, leading to cell specialization.

Chapter 15 Active Reading Guide Regulation of Gene Expression

Protein Synthesis. Unit 6 Goal: Students will be able to describe the processes of transcription and translation.

Principles of Genetics

Computational Biology: Basics & Interesting Problems

UE Praktikum Bioinformatik

Proteomics. 2 nd semester, Department of Biotechnology and Bioinformatics Laboratory of Nano-Biotechnology and Artificial Bioengineering

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Co-ordination occurs in multiple layers Intracellular regulation: self-regulation Intercellular regulation: coordinated cell signalling e.g.

Comparative Bioinformatics Midterm II Fall 2004

Honors Biology Reading Guide Chapter 11

Motivating the need for optimal sequence alignments...

Alignment. Peak Detection

Предсказание и анализ промотерных последовательностей. Татьяна Татаринова

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

O 3 O 4 O 5. q 3. q 4. Transition

AP Bio Module 16: Bacterial Genetics and Operons, Student Learning Guide

Molecular Biology of the Cell

Eukaryotic Gene Expression

Bio 119 Bacterial Genomics 6/26/10

Molecular Biology of the Cell

Control of Gene Expression in Prokaryotes

UNIT 5. Protein Synthesis 11/22/16

L3.1: Circuits: Introduction to Transcription Networks. Cellular Design Principles Prof. Jenna Rickus

Chapter 10, 11, 14: Gene Expression, Regulation, and Development Exam

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

The Gene The gene; Genes Genes Allele;

In Genomes, Two Types of Genes

Quiz answers. Allele. BIO 5099: Molecular Biology for Computer Scientists (et al) Lecture 17: The Quiz (and back to Eukaryotic DNA)

Supplemental Materials

32 Gene regulation, continued Lecture Outline 11/21/05

2. What was the Avery-MacLeod-McCarty experiment and why was it significant? 3. What was the Hershey-Chase experiment and why was it significant?

Jianlin Cheng, PhD. Department of Computer Science University of Missouri, Columbia. Fall, 2014

How much non-coding DNA do eukaryotes require?

Computational Cell Biology Lecture 4

Introduction to Bioinformatics

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

CHAPTER 3. Cell Structure and Genetic Control. Chapter 3 Outline

Genomics and bioinformatics summary. Finding genes -- computer searches

Chapter 20. Initiation of transcription. Eukaryotic transcription initiation

Ch 10, 11 &14 Preview

PROTEIN SYNTHESIS INTRO

Computational Genomics and Molecular Biology, Fall

Related Courses He who asks is a fool for five minutes, but he who does not ask remains a fool forever.

Peter Pristas. Gene regulation in eukaryotes

BCH 4054 Spring 2001 Chapter 33 Lecture Notes

ومن أحياها Translation 1. Translation 1. DONE BY :Maen Faoury

Transcription:

Motifs and Logos Six Discovering Genomics, Proteomics, and Bioinformatics by A. Malcolm Campbell and Laurie J. Heyer Chapter 2 Genome Sequence Acquisition and Analysis Sami Khuri Department of Computer Science San José State University Chapter 2: Math Minute 2.2 Copyright 2006 A. Malcolm Campbell ISBN 0-8053-8219-4 Importance and Abundance of Motifs DNA motifs are nucleotide sequence patterns of functional significance. Examples: The TATA box is a motif that helps RNA polymerase find the transcription start site (TSS) in many eukaryotic genes. The CAT box is another highly conserved region used for the initiation of transcription. TSS Transcription Start Site CAP start exon internal exons stop exon non coding coding coding non coding 5 UTR Getting the CDS ATG Translation initiation ATG internal introns CDS Coding SEQUENCE stop 3 UTR AAAAAAA stop From DNA to Protein Ungapped sequence alignment of eleven E. coli sequences defining a start codon. www.clcbio.com 6.1

E.Coli Promoter Sequences Anatomy of an Intron logo logo logo Conserved Sequences in Introns Sequence Motifs The conserved nucleotides in the transcript are recognized by small nuclear ribonucleoprotein particles (snrnps), which are complexes of protein and small nuclear RNA. A functional splicing unit is composed of a team of snrnps called a spliceosome. Detecting Motifs A motif is a sequence pattern of functional significance. Example: The TATA box is a motif that helps the polymerase find the transcription start site. Creating Tables of Frequencies The probability of having an A in the first position is: 61/389 = 0.1568 The probability of a T in the second position is: 309/389 = 0.7943 Similarly for all 4 bases at all 15 positions. We can thus create a table of frequencies. 6.2

Creating Log-Odds Tables Instead of creating a table of frequencies, we create a table of log-odds. Suppose that the genome-wide average G and C content is 44%. Then the probability of an A is 0.56/2 = 0.28. The Log-Odds Tables log 2 (0.1568/0.28) = log 2 (0.56) = - 0.84. Note that the base of the logarithm here is 2. Similarly, log 2 (0.7943/0.28) = 1.5. log b Taking Log-Odds > 1 P(observed) P(expected) is = 1 < 1 > 0 P(observed) P(expected) is = 0 < 0 What is the Significance of Log-Odds If the nucleotide is more likely to occur at a given position than it is to occur overall, the ratio will be bigger than 1.0 and the log odds is positive. If the nucleotide is less likely to occur at a certain position than it is to occur overall, then the ratio will be smaller than 1.0 and the log odds is negative. Using Log-Odds Tables (I) Using Log-Odds Tables (II) Table MM2.2 was constructed as explained in the previous slides; in other words, by taking the log of the ratio of the observed frequency over the expected frequency. To see if a sequence of length 15 is a TATA box, we simply add the corresponding values from the PWM and see if we get a value above some threshhold. In the example above, we add the 15 highlighted numbers to get 6.78. 6.3

Designing Logos A logo is a visual representation of a set of aligned sequences that indicates the positional preferences as given by information theory. A logo gives a visual representation of the motif. The size of the character in the stack of characters is proportional to the character s frequency in that position. The total height of each column is proportional to its information content. Information theory quantifies the amount of information Entropy and Logos The entropy of a random variable is a measure of the uncertainty of the random variable. The entropy (uncertainty) in position j is defined as: H j = - f x,j log 2 (f x,j ) where f x,j is the frequency of character x in position j, the summation is over all the characters x, and the entropy units are bits of information. Logos with Proteins Recall: entropy in position j is defined as: H j = - f x,j log 2 (f x,j ) If only one residue is found at position j, all terms are zero and H j = 0. Note, by convention: (0)log 2 (0) = 0. In other words, there is no uncertainty at this position. The maximum value of H j occurs if all residues are present with equal frequency. In this case: H j = - (1/20)log 2 (1/20) = log 2 (20). [amino acids] Logos with Proteins: An Example The information present in the pattern at position j is denoted by I j and is given by: I j = log 2 (20) - H j = log 2 (20) + f x,j log 2 (f x,j ) In other words, the information content I j at position j is defined as the "opposite" of its uncertainty. Note that a position with a perfectly conserved residue will have the maximum amount of information. Logos with Proteins: An Example Recall: I j = log 2 (20) - H j = log 2 (20) + f x,j log 2 (f x,j ) The information content is a number between 0 and log 2 (20) bits and measures the conservation of a position in a profile. Since conserved positions in sequence families are considered to be functionally or structurally important, they should stand out when the profile is visualized. Recall: Logos with Proteins: An Example I j = log 2 (20) - H j = log 2 (20) + f x,j log 2 (f x,j ) At every position of the logo, the residues are represented by their one-character letter having a height proportional to their contribution which is equal to the product: (f x,j )(I j ). 6.4

Logos with Bases Define: I j = log 2 (4) - H j = 2 + f x,j log 2 (f x,j ) where f x.j is the frequency of character x at position j. 11 sites Consensus Sequence and PWM All current methods for representing DNA motifs involve either consensus sequences or probabilistic models (such as PWM) of the motif. Consensus sequences do not adequately represent the variability seen in promoters or transcription factor binding sites. Both consensus sequences and PWM models assume positional independence. Neither method can accommodate correlations between positions. Probabilities calculated from PWM models can be highly misleading. Classification Based Statistics Quantitative method to evaluate: how well one can distinguish between cases and controls. how well a diagnostic test performs in testing for some disease. 6.5

With this test, how many people that are actually ill will I catch? OR The likelihood of spotting a positive case when presented with one. TN Specificity = TN + FP TN Specificity = TN + FP With this test, will I tell too many people they might be ill? OR The likelihood of spotting a negative case when presented with one. 6.6

Medical Test Evaluation s = Test states you have the disease when you do have the disease s = Test states you do not have the disease when you do not have the disease s = Test states you have the disease when you do not have the disease s = Test states you do not have the disease when you do Evaluating Medical Tests The probability of having a positive test result among those with a positive diagnosis for the disease Sensitivity = s / s + s Specificity = The probability of having a negative test result among those with a negative diagnosis for the disease Specificity = s / s + s 6.7