Genomics and bioinformatics summary. Finding genes -- computer searches

Similar documents
(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Sequence analysis and comparison

Motivating the need for optimal sequence alignments...

Lecture 18 June 2 nd, Gene Expression Regulation Mutations

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

GCD3033:Cell Biology. Transcription

Comparative genomics: Overview & Tools + MUMmer algorithm

BLAST. Varieties of BLAST

Large-Scale Genomic Surveys

Introduction to protein alignments

BIOINFORMATICS LAB AP BIOLOGY

GENE ACTIVITY Gene structure Transcription Transcript processing mrna transport mrna stability Translation Posttranslational modifications

Sequence Database Search Techniques I: Blast and PatternHunter tools

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Multiple Sequence Alignment. Sequences

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Complete all warm up questions Focus on operon functioning we will be creating operon models on Monday

Bioinformatics Chapter 1. Introduction

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Microbial Taxonomy. Slowly evolving molecules (e.g., rrna) used for large-scale structure; "fast- clock" molecules for fine-structure.

Procedure to Create NCBI KOGS

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

From Gene to Protein

Chapter 15 Active Reading Guide Regulation of Gene Expression

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11

Sequencing alignment Ameer Effat M. Elfarash

Introduction to Bioinformatics Online Course: IBT

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Introduction to Bioinformatics

Controlling Gene Expression

Bioinformatics Exercises

Exploring Evolution & Bioinformatics

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

BME 5742 Biosystems Modeling and Control

Quantifying sequence similarity

Genomes and Their Evolution

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage.

UNIT 5. Protein Synthesis 11/22/16

Computational methods for predicting protein-protein interactions

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Sequence Alignment Techniques and Their Uses

L3.1: Circuits: Introduction to Transcription Networks. Cellular Design Principles Prof. Jenna Rickus

Bioinformatics and BLAST

Multiple Choice Review- Eukaryotic Gene Expression

Name: SBI 4U. Gene Expression Quiz. Overall Expectation:

Sequencing alignment Ameer Effat M. Elfarash

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Biology. Biology. Slide 1 of 26. End Show. Copyright Pearson Prentice Hall

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

Algorithms in Bioinformatics

Homology Modeling. Roberto Lins EPFL - summer semester 2005

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Tools and Algorithms in Bioinformatics

1. In most cases, genes code for and it is that

UNIT 6 PART 3 *REGULATION USING OPERONS* Hillis Textbook, CH 11

Computational Biology: Basics & Interesting Problems

Comparative Network Analysis

Microbial Taxonomy. Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

12-5 Gene Regulation

Sequences, Structures, and Gene Regulatory Networks

Week 10: Homology Modelling (II) - HHpred

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

Biased amino acid composition in warm-blooded animals

Protein Structure Prediction, Engineering & Design CHEM 430

Old FINAL EXAM BIO409/509 NAME. Please number your answers and write them on the attached, lined paper.

The nature of genomes. Viral genomes. Prokaryotic genome. Nonliving particle. DNA or RNA. Compact genomes with little spacer DNA

SCIENTIFIC EVIDENCE TO SUPPORT THE THEORY OF EVOLUTION. Using Anatomy, Embryology, Biochemistry, and Paleontology

Eukaryotic vs. Prokaryotic genes

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Transport between cytosol and nucleus

Dr. Amira A. AL-Hosary

Types of RNA. 1. Messenger RNA(mRNA): 1. Represents only 5% of the total RNA in the cell.

What is the central dogma of biology?

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

PHYLOGENY AND SYSTEMATICS

General context Anchor-based method Evaluation Discussion. CoCoGen meeting. Accuracy of the anchor-based strategy for genome alignment.

Lecture 25: Protein Synthesis Key learning goals: Be able to explain the main stuctural features of ribosomes, and know (roughly) how many DNA and

Hands-On Nine The PAX6 Gene and Protein

Protein function prediction based on sequence analysis

Flow of Genetic Information

Small RNA in rice genome

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Review. Membrane proteins. Membrane transport

Tools and Algorithms in Bioinformatics

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Overview Multiple Sequence Alignment

Pairwise & Multiple sequence alignments

3.B.1 Gene Regulation. Gene regulation results in differential gene expression, leading to cell specialization.

Transcription:

Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence and evolutionary relationships 6. Protein sequences are evolutionary clocks 7. Some public databases and protein sequence analysis tools Finding genes -- computer searches Computer searches locate most genes in prokaryotes, Archeae, and yeast, but only ~1/3 of human genes are identified correctly. Criteria Protein start, stop signals, splicing signals... Codon bias Comparisons to other genomes (mouse, rat, fish, fly, mosquito, worm, yeast...) Some hard problems: small genes, post-translational modifications, unique genes, spliced genes, alternative splicing, gene rearrangements (e.g. IgGs)... 1

Finding genes -- cdna synthesis Synthesizing cdna (complementary DNA) 1. Extract RNA 2. Hybridize polyt primer 3. Synthesize DNA strand 1 using reverse transcriptase. 4. Fragment RNA strand using RNaseH. 5. Synthesize DNA strand 2 using DNA pol Sequences of random cdnas provide ESTs (Expressed Sequence Tags) Microarrays quantify expressed genes by hybridization 1. Label cdnas with red fluorophore in one condition and green fluorophore in another reference condition. 2. Mix red and green DNA and hybridize to a microarray. Red genes enriched in reference Yellow genes (green + red) = Green genes enriched in experiment Each spot is a different synthetic oligonucleotide complementary to a specific gene. 2

Cluster analysis identifies patterns of gene expression Conditions Genes 1. Similar patterns of expression are placed next to each other. Groups of genes with similar patterns form a hierarchical tree. For example the two major branches of the tree comprise activated (left, green) or repressed genes (right, red). 2. Genes with similar expression patterns (e.g. A-E) often function together. Tiling microarrays can find transcribed sequences Microarray coding capacity ~16 M bases Each spot has a different synthetic oligonucleotide complementary to a different segment of the genome (E.g every 100 bps). Spots that hydridize reveal transcribed regions. 3

Find similar sequences (homologs) with BLAST The most related human protein identified by a BLAST search of the human genome using the sequence of M. tuberculosis PknB Ser/Thr protein kinase is... ELKL motif kinase 1. Query = the part of the PknB sequence that matches ELKL-1. Subject = ELKL-1. Expect = expectation value = the number of hits of this quality expected by chance in a database of this size (5e-24 = 5 x 10-24 ; is this a big number or small?) Identities = # of exact amino acid matches in the alignment. Positives = # of conservative changes as defined by the residues that tend to replace each other in homologous proteins. NP_00495.2 = sequence ID for ELKL-1. >ref NP_004945.2 ELKL motif kinase 1 [Homo sapiens] Length = 691 Score = 108 bits (270), Expect = 5e-24 Identities = 87/296 (29%), Positives = 135/296 (45%), Gaps = 21/296 (7%) Query: 11 YELGEILGFGGMSEVHLARDLRLHRDVAVKVLRADLARDPSFYLRFRREAQNAAALNHPA 70 Y L + +G G ++V LAR + ++VAVK++ S FR E + LNHP Sbjct: 20 YRLLKTIGKGNFAKVKLARHILTGKEVAVKIIDKTQLNSSSLQKLFR-EVRIMKVLNHPN 78 Query: 71 IVAVYDTGEAETPAGPLPYIVMEYVDGVTLRDIVHTEGPMTPKRAIEVIADACQALNFSH 130 IV +++ E E Y+VMEY G + D + G M K A A+ + H Sbjct: 79 IVKLFEVIETEKTL----YLVMEYASGGEVFDYLVAHGRMKEKEARAKFRQIVSAVQYCH 134 Query: 131 QNGIIHRDVKPANIMISATNAVKVMDFGIARAIADSGNSVTQTAAVIGTAQYLSPEQARG 190 Q I+HRD+K N+++ A +K+ DFG + GN + G+ Y +PE +G Sbjct: 135 QKFIVHRDLKAENLLLDADMNIKIADFGFSNEFT-FGNKLD---TFCGSPPYAAPELFQG 190 Query: 191 DSVDA-RSDVYSLGCVLYEVLTGEPPFTGDSPVSVAYQHVREDPIPPSARHE-GLSADLD 248 D DV+SLG +LY +++G PF G + + +RE + R +S D + Sbjct: 191 KKYDGPEVDVWSLGVILYTLVSGSLPFDGQN-----LKELRERVLRGKYRIPFYMSTDCE 245 Query: 249 AVVLKALAKNPENRYQTAAEMRADLVRVHNGEPPEAPKV-----LTDAERTSLLSS 299 ++ K L NP R M+ + V + + P V D RT L+ S Sbjct: 246 NLLKKFLILNPSKRGTLEQIMKDRWMNVGHEDDELKPYVEPLPDYKDPRRTELMVS 301 Ser/Thr Protein kinases diverge rapidly Multiple Sequence Alignment (MSA) of the N-terminal ~90 residues of M. tuberculosis PknB (bottom) and Ser/Thr protein kinases of known structure. The histogram at the bottom shows % identity at each position. Only a few residues are absolutely conserved (functional sites!). The MSA defines the beginning of the kinase domain. Insertions often occur in loops. 4

Histones evolve slowly Tree MSA = Multiple Sequence Alignment Core H3 proteins (that have the same function) are nearly identical in eukaryotes (left). Archaeal H3s and specialized H3 proteins that bind at centromeres show much more divergence (bottom sequences and tree branches, right). Protein sequences are evolutionary clocks Slow Assuming that organisms diverged from a common ancestor and sequence changes accumulate at constant rates, the number of changes in homologous proteins gives information about the time that each sequence has been evolving independently. Fast Average rate of change of proteins of different function. 5

Tree of life (Sequences = biological clocks) A tree derived by clustering sequences of a typical protein family (pterin-4ahydroxylase) recapitulates the tree of life. Evolutionary relationships are seen at the molecular level in virtually every shared protein and RNA! Some web sites for bioinformatics Nucleic acid sequences http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=search&db=nucleotide Protein sequences http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=protein Structure Coordinates: Protein Data Bank http://www.rcsb.org/pdb/ Programs BLAST sequence similarity calculation http://www.ncbi.nlm.nih.gov/blast/ BLAST bacterial genomes http://www.ncbi.nlm.nih.gov/sutils/genom_table.cgi PHD secondary structure predictor and motif search http://www.embl-heidelberg.de/predictprotein/predictprotein.html PHYRE fold predictor http://www.sbg.bio.ic.ac.uk/~phyre/ Multicoil: Coiled coil prediction http://multicoil.lcs.mit.edu/cgi-bin/multicoil/ Many nucleic acid and protein sequence-analysis tools http://au.expasy.org/ Predict transmembrane helices http://www.cbs.dtu.dk/services/thmm-2.0/ Predict signal sequences http://www.cbs.dtu.dk/services/signalp/ 6

Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence and evolutionary relationships 6. Protein sequences are evolutionary clocks 7. Lots of public databases and protein sequence analysis tools 7