Bioinformatics Chapter 1. Introduction

Similar documents
Computational Biology: Basics & Interesting Problems

BME 5742 Biosystems Modeling and Control

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

GCD3033:Cell Biology. Transcription

From gene to protein. Premedical biology

Reading Assignments. A. Genes and the Synthesis of Polypeptides. Lecture Series 7 From DNA to Protein: Genotype to Phenotype

1. In most cases, genes code for and it is that

From Gene to Protein

Chapter 17. From Gene to Protein. Biology Kevin Dees

Translation Part 2 of Protein Synthesis

Multiple Choice Review- Eukaryotic Gene Expression

Lesson Overview. Ribosomes and Protein Synthesis 13.2

GENE ACTIVITY Gene structure Transcription Transcript processing mrna transport mrna stability Translation Posttranslational modifications

Chapters 12&13 Notes: DNA, RNA & Protein Synthesis

UNIT 5. Protein Synthesis 11/22/16

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

Introduction to Bioinformatics

Bio 119 Bacterial Genomics 6/26/10

Controlling Gene Expression

Laith AL-Mustafa. Protein synthesis. Nabil Bashir 10\28\ First

Videos. Bozeman, transcription and translation: Crashcourse: Transcription and Translation -

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Molecular Biology - Translation of RNA to make Protein *

Algorithms in Computational Biology (236522) spring 2008 Lecture #1

Translation. Genetic code

9/11/18. Molecular and Cellular Biology. 3. The Cell From Genes to Proteins. key processes

RNA & PROTEIN SYNTHESIS. Making Proteins Using Directions From DNA

Chapter

Name: Class: Date: ID: A

9/2/17. Molecular and Cellular Biology. 3. The Cell From Genes to Proteins. key processes

BIOLOGY STANDARDS BASED RUBRIC

Graph Alignment and Biological Networks

Sequence analysis and comparison

Honors Biology Reading Guide Chapter 11

Biology 112 Practice Midterm Questions

3.B.1 Gene Regulation. Gene regulation results in differential gene expression, leading to cell specialization.

Lab 2A--Life on Earth

Microbiome: 16S rrna Sequencing 3/30/2018

Types of RNA. 1. Messenger RNA(mRNA): 1. Represents only 5% of the total RNA in the cell.

From DNA to protein, i.e. the central dogma

UNIT 6 PART 3 *REGULATION USING OPERONS* Hillis Textbook, CH 11

PROTEIN SYNTHESIS: TRANSLATION AND THE GENETIC CODE

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Genomics and bioinformatics summary. Finding genes -- computer searches

Name: SBI 4U. Gene Expression Quiz. Overall Expectation:

Supplemental Materials

Biology I Level - 2nd Semester Final Review

Flow of Genetic Information

Chapter 15 Active Reading Guide Regulation of Gene Expression

Prokaryotic Regulation

15.2 Prokaryotic Transcription *

Sequences, Structures, and Gene Regulatory Networks

Related Courses He who asks is a fool for five minutes, but he who does not ask remains a fool forever.

Proteomics. 2 nd semester, Department of Biotechnology and Bioinformatics Laboratory of Nano-Biotechnology and Artificial Bioengineering

Section 7. Junaid Malek, M.D.

Introduction to molecular biology. Mitesh Shrestha

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

CHAPTER 13 PROKARYOTE GENES: E. COLI LAC OPERON

Chapter 17. Table of Contents. Objectives. Taxonomy. Classifying Organisms. Section 1 Biodiversity. Section 2 Systematics

PROTEIN SYNTHESIS INTRO

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11

Kingdoms in Eukarya: Protista, Fungi, Plantae, & Animalia Each Eukarya kingdom has distinguishing characteristics:

More Protein Synthesis and a Model for Protein Transcription Error Rates

GENE REGULATION AND PROBLEMS OF DEVELOPMENT

AQA Biology A-level. relationships between organisms. Notes.

A. Correct! Taxonomy is the science of classification. B. Incorrect! Taxonomy is the science of classification.

In Genomes, Two Types of Genes

Protein Synthesis. Unit 6 Goal: Students will be able to describe the processes of transcription and translation.

F. Piazza Center for Molecular Biophysics and University of Orléans, France. Selected topic in Physical Biology. Lecture 1

Regulation of Gene Expression

GENETICS - CLUTCH CH.11 TRANSLATION.

Protein Synthesis. Unit 6 Goal: Students will be able to describe the processes of transcription and translation.

REVIEW SESSION. Wednesday, September 15 5:30 PM SHANTZ 242 E

What is the central dogma of biology?

Introduction. Gene expression is the combined process of :

Lecture 7: Simple genetic circuits I

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Complete all warm up questions Focus on operon functioning we will be creating operon models on Monday

ومن أحياها Translation 2. Translation 2. DONE BY :Nisreen Obeidat

Name Period The Control of Gene Expression in Prokaryotes Notes

Translation and Operons

Bacillus anthracis. Last Lecture: 1. Introduction 2. History 3. Koch s Postulates. 1. Prokaryote vs. Eukaryote 2. Classifying prokaryotes

SPRINGFIELD TECHNICAL COMMUNITY COLLEGE ACADEMIC AFFAIRS

BLAST. Varieties of BLAST

13.1 Biological Classification - Kingdoms and Domains Modern species are divided into three large groups, or domains. Bacteria Archaea Eukarya

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Genomes and Their Evolution

Biology Tutorial. Aarti Balasubramani Anusha Bharadwaj Massa Shoura Stefan Giovan

A. Incorrect! In the binomial naming convention the Kingdom is not part of the name.

DNA. Announcements. Invertebrates DNA. DNA Code. DNA Molecule of inheritance. & Protein Synthesis. Midterm II is Friday

Unit 5: Taxonomy. KEY CONCEPT Organisms can be classified based on physical similarities.

Old FINAL EXAM BIO409/509 NAME. Please number your answers and write them on the attached, lined paper.

Cells and the Stuff They re Made of. Indiana University P575 1

Quiz answers. Allele. BIO 5099: Molecular Biology for Computer Scientists (et al) Lecture 17: The Quiz (and back to Eukaryotic DNA)

Chapter 17B. Table of Contents. Section 1 Introduction to Kingdoms and Domains. Section 2 Advent of Multicellularity

UE Praktikum Bioinformatik

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Control of Gene Expression

Transcription:

Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences! Prediction of Molecular Function and Structure (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2

! A liger is the result of breeding a male lion with a female tiger. It has stripes and spots.! All ligers are presumed to be born sterile like mule. This is not unusual for hybrids.! Extravagant combination of genomes: this organism is unable to contribute to further to the evolution of the gene pool. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 3 1. Biological Data in Digital Symbol Sequence! Digital nature of genetic data/biological sequence! DNA sequence: 4 nucleotides (A, T, G, C)! Protein sequence: 20 amino acids Structure of DNA (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 4

! Central Dogma 4 Information flow from DNA to protein 4 Proteins are synthesized based on the information of DNA 4 DNA: information storage 4 RNA: information intermediate 4 Protein: various cellular functions (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 5! RNA transcription 4 Newly synthesized RNAs have complementary base sequences to template DNAs. (A U, C G) 4 Three major RNA classes: 4 mrna: protein coding 4 trna:attranslationprocess, brings amino acids to ribosomes 4 rrna: ribosome component 4 This transcription process is performed by RNA polymerase. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 6

! Translation: synthesis of proteins 4 Ribosome, trna and other components carry out this process. 4 Codon (mrna): anticodon (trna) recognition 4 Codon triplelets give more diverse compositions (4 nucleotide compositions = give different 64 codons) (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 7! Post translational protein folding 4 Proteins exist not as linear form but as complex 3D form. 4 Proteins are folded by many molecular interactions (hydrophobic interactions, disulfied bond, ionic bond ) 4 Various molecules like chaperone help protein folding. 4 This protein conformation construction is determined by information on its sequence. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 8

Why Probabilistic Models?! This course focuses on the probabilistic models of sequences. Why?! While the goal is to study a particular sequence and its molecular structure and function, the analysis typically proceeds through the study of an ensemble of sequences consisting of its different versions in different species or different versions in the same species.! Thus, comparison of sequence patterns across species must take into account that biological sequences are inherently noisy, the variability resulting in part from random events amplified by evolution. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 9 1.1 Database Annotation Quality! High error rate after handling of information 4 Handling by a diverse group of people 4 Incorrect assignments by experimentalists: simple consistency check 4 Features are indicated by listing the relevant positions in numeric form. 4 Fuzzy statements in annotation: comments like potential or probable! Bioinformaticians 4 Should consider potential sources of errors when creating machinelearning approaches for prediction and classification. 4 Prepare the data carefully 4 Discard data from unclear sources (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 10

1.1 Database Annotation Quality! Machine learning techniques 4 can handle noise if large corpora of sequences are available.! Learning from DNA sequences as language acquisition 4 Infants can detect linguistic regularities and learn simple statistics for the recognition of word boundaries 4 Learning techniques are useful for revealing similar regularities ( grammar or rules) in genomic data ( sentences or examples). (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 11 1.2 Database Redundancy! Redundancy: sources of error 4 Data for large families of closely related sequences " biased and over-represent features 4 Apparent correlations between different positions in the sequences may be artifact of biased sampling of the data. 4 Predictive performance may be overestimated if training/test data are closed related.! Over-representation problem 4 It is necessary to avoid too closely related sequences in a data set. 4 Keep all sequences in a data set and assign weights according to their novelty 4 Redundancy reduction arbitrary similarity threshold representative data set (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 12

2. Genomes Diversity, Size, and Structure! Genomes: total genetic material within a cell or organism DNA (e.g. cellular organism) or RNA (e.g. HIV virus)! Genome size: 5,386bp (bacteriophage ϕx174) ~3Gbp (homo sapiens) Gene numbers in several organisms (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 13 2. Genomes Diversity, Size, and Structure! Genome size does not directly relate to cell complexity 4 Many plant and amphibian genomes are larger than the human genome. 4 Cell complexity is related to gene numbers or gene-to-gene interactions. 4 Most regions of genome are not expressive units. Gene: expressive segment - coding proteins or functional RNA molecules (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 14

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 15 (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 16

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 17 (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 18

3. Proteins and Proteomes 3.1 From Genome to Proteome! Gene : genome = protein : proteome! Proteome: total protein expression of a set of chromosomes! Protein: dynamic structure with various modification! Proteome analysis: 4 Not only dealing with sequences and functions but also concerning biochemical states of proteins in its posttranslational form (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 19 3.2 Protein Length Distributions! Amino acid sequences in protein direct its structure which in turn influences its functions.! To make a suitable structure, regularity of amino acid sequences does exist.! Statistical analysis has played a major role in protein sequence analysis. Sequence of hemoglobin and its structure (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 20

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 21 Some Observations! It is observed that the peaks for the prokaryote H. influenzae and the eukaryote S. cerevisiae are positioned in different intervals: at 140-160 and 100-120, respectively.! Further studies show that a eukaryotic distribution from a wide range of species peaks at 125 amino acids and that the distribution displays a periodicity based on this size unit.! Interestingly, the distribution for the achaeon M. jannaschii is sort of inbetween the H. influenzae (eukaryote) and the S. cerevisiae (prokaryote) distributions. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 22

Tree of life: Three great kingdoms 4 Drawn by sequence analysis of rrna 4 Three kingdoms: bacteria, archaea, eukarya * Former five kingdoms: Animalia, Plantae, Fungi, Protista, Monea (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 23 3.3 Protein Function! Protein function 4 Does not depend on whole structure 4 Determined mainly by local sequences (e.g. active site of enzyme)! End of human genome project: finding genes, measuring its function and activity and many other analysis are required. 4 Prediction by experiment: time costing and hard job & non-real environment (under laboratory environment) 4 Many computational approaches can be helpful for protein analysis. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 24

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 25 (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 26

4. On the Information Content of Biological Sequences! Concept of information and its quantification is essential for the basic principles of machine-learning approaches in molecular biology! Data-driven prediction methods can extract essential features from individual examples and discard unwanted information.! Machine-learning techniques are excellent for the task of discarding of and compacting redundant sequence information (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 27! Information reduction 4 Key feature in the understanding of almost any kind of system 4 Create simple representation of a sequence space " much more powerful and useful than the original data that contain all details! Compression rate 4 Ratio between size of the an encoded corpus of sequences and the original corpus of sequences 4 Global quantifier of the degree of regularity in the data R C = S S E O (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 28

4.2 Alignment vs. Prediction: When are Alignment Reliable?! New sequences are aligned against all sequences in DB 4 Hints toward structural/functional relationships + functional insights! Fundamental question 4 When is the sequence similarity high enough that one may infer a structural/functional similarity from the pairwise alignment of two sequences? 4 Given the detected overlap in a sequence segment, can a similarity threshold be defined that sifts out cases where the inference will be reliable?! Answer 4 It depends on the structural/functional aspect one wants to investigate 4 Different for each task! Prediction 4 When alignment alone is not enough to lead a reliable inference (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 29 BLAST search result Query: Soy bean (plant) leghemoglobin Database: homo sapiens Alignment result shows two merely matched sequences, but their functions and structures are surprisingly coincided. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 30

4.3 Prediction of Functional Features! Two protein sequences share sequence similarity 4 These proteins share common function?! New sequence identity threshold is required for prediction of function 4 Threshold used for structural problems can not be used.! Solution 4 Split each sequence into a number of subsequences 4 The fraction of aligned site per alignment (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 31 4.4 Global and Local Alignments and Substitution Matrix Entropies! Pairwise alignment 4 There is no optimal answer (actually we don t know optimal alignment): changed by various options 4 Dynamic programming: computational demanding job 4 Heuristic approaches: BLAST, FASTA! Substitution matrix 4 Set of scores s ij of replacing amino acid i by amino acid j 4 Generated from simplified protein evolution model involving amino acid frequencies, p i s ij 1 = (ln λ q ij pipj ) p i = 1 20 q ij q for i = j = q for i j (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 32

PAM250 Matrix Matches involving rare amino acids count more than common amino acids Mismatches between similar amino acids have higher score than between distinct amino acids (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 33 4.5 Consensus Sequences and Sequence Logos! Creating consensus sequences 4 From alignments 4 Choose most common nucleotides or amino acids 4 Can be highly misleading! Sequence logo 4 Based on Shannon information content at each position 4 Emphasize the deviation from the uniform or flat distribution A D( i) = log2 A + p k = 1 A = 4 or 20 ( i)log 4 Displays a significant degree of deviation from flat distribution k 2 pk( i) (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 34

Initiation codon (ATG) is very well conserved. 5 to the initiation codons (-12~-7) are conserved also. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 35! Slight variation of the logo formula H ( i) = H ( P( i), Q( i)) = p ( i) A k pk( i)log k = 1 qk( i) 4 Based on relative entropy 4 Quantification of the contrast between the observed probabilities P(i) and a reference probability distribution Q(i) 4 Q: depends on the position i in the alignment (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 36

5. Prediction of Molecular Function and Structure! DNA, RNA and protein sequence analysis based on machine-learning approaches can be helpful for many biological problems. 4 Finding intron splice sites in mrna processing 4 Gene finding 4 Recognition of promoter and other regulatory regions 4 Prediction of gene expression levels 4 Prediction of DNA bending and bendability 4 Sequence clustering 4 RNA secondary structure prediction 4 Protein function/structure prediction 4 Protein family classification (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 37