Background. Bioinformatics Exercises for the Study of Evolution with Heme Proteins as a Model System

Similar documents
Bioinformatics Exercises

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Using Bioinformatics to Study Evolutionary Relationships Instructions

Hands-On Nine The PAX6 Gene and Protein

Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

BIOINFORMATICS LAB AP BIOLOGY

Chapter 7: Covalent Structure of Proteins. Voet & Voet: Pages ,

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Comparing whole genomes

Supplemental Materials

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

GENERAL BIOLOGY LABORATORY EXERCISE Amino Acid Sequence Analysis of Cytochrome C in Bacteria and Eukarya Using Bioinformatics

Introduction to Bioinformatics Online Course: IBT

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Genomes and Their Evolution

Open a Word document to record answers to any italicized questions. You will the final document to me at

Investigating Evolutionary Questions Using Online Molecular Databases *

Protein Architecture V: Evolution, Function & Classification. Lecture 9: Amino acid use units. Caveat: collagen is a. Margaret A. Daugherty.

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Introduction to Bioinformatics

BIOINFORMATICS: An Introduction

Comparative genomics: Overview & Tools + MUMmer algorithm

Example Items. Biology

CS612 - Algorithms in Bioinformatics

Bioinformatics. Part 8. Sequence Analysis An introduction. Mahdi Vasighi

COMPARING DNA SEQUENCES TO UNDERSTAND EVOLUTIONARY RELATIONSHIPS WITH BLAST

Journal of Proteomics & Bioinformatics - Open Access

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Piecing It Together. 1) The envelope contains puzzle pieces for 5 vertebrate embryos in 3 different stages of

Warm Up. What are some examples of living things? Describe the characteristics of living things

Studying Life. Lesson Overview. Lesson Overview. 1.3 Studying Life

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Protein Structure Prediction and Display

BIOLOGY STANDARDS BASED RUBRIC

BIG IDEA 4: BIOLOGICAL SYSTEMS INTERACT, AND THESE SYSTEMS AND THEIR INTERACTIONS POSSESS COMPLEX PROPERTIES.

Protein Bioinformatics Computer lab #1 Friday, April 11, 2008 Sean Prigge and Ingo Ruczinski

Sequences, Structures, and Gene Regulatory Networks

Synteny Portal Documentation

Chapter 15: Darwin and Evolution

The Complete Set Of Genetic Instructions In An Organism's Chromosomes Is Called The

Sequencing alignment Ameer Effat M. Elfarash

EBI web resources II: Ensembl and InterPro

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Lab 1 Uniform Motion - Graphing and Analyzing Motion

Exploring Evolution & Bioinformatics

Cladistics and Bioinformatics Questions 2013

The Contribution of Bioinformatics to Evolutionary Thought

Full file at CHAPTER 2 Genetics

Emily Blanton Phylogeny Lab Report May 2009

Genomics and bioinformatics summary. Finding genes -- computer searches

Chapter 18 Active Reading Guide Genomes and Their Evolution

Biology New Jersey 1. NATURE OF LIFE 2. THE CHEMISTRY OF LIFE. Tutorial Outline

Name Block Date Final Exam Study Guide

Student Handout Fruit Fly Ethomics & Genomics

Proteins: Historians of Life on Earth Garry A. Duncan, Eric Martz, and Sam Donovan

Variation of Traits. genetic variation: the measure of the differences among individuals within a population

Topics. Antibiotic resistance, changing environment LITERACY MATHEMATICS. Traits, variation, population MATHEMATICS

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Chem Lecture 3 Hemoglobin & Myoglobin

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

A DISEASE ECOLOGIST S GUIDE TO EVOLUTION: EVIDENCE FROM HOST- PARASITE RELATIONSHIPS

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

How to Use This Presentation

STAAR Biology Assessment

AP Biology Summer Assignment

Homology and Information Gathering and Domain Annotation for Proteins

PROTEIN SYNTHESIS INTRO

Translation Part 2 of Protein Synthesis

Biology Semester 2 Final Review

Sequence analysis and comparison

Ap Biology Chapter 17 Packet Answers

Readings Lecture Topics Class Activities Labs Projects Chapter 1: Biology 6 th ed. Campbell and Reese Student Selected Magazine Article

Sequencing alignment Ameer Effat M. Elfarash

Big Idea 1: Does the process of evolution drive the diversity and unit of life?

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

GENETICS - CLUTCH CH.1 INTRODUCTION TO GENETICS.

Campbell Biology AP Edition 11 th Edition, 2018

Large-Scale Genomic Surveys

Introduction to Molecular and Cell Biology

SCOTCAT Credits: 20 SCQF Level 7 Semester 1 Academic year: 2018/ am, Practical classes one per week pm Mon, Tue, or Wed

Tutorial 4 Substitution matrices and PSI-BLAST

What can sequences tell us?

Biology Assessment. Eligible Texas Essential Knowledge and Skills

1. The basic structural and physiological unit of all living organisms is the A) aggregate. B) organelle. C) organism. D) membrane. E) cell.

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Introduction to Biology with Lab

2012 Univ Aguilera Lecture. Introduction to Molecular and Cell Biology

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Local Alignment Statistics

Transcription:

AP Biology Academy www.bioc.rice.edu/~olson/courses.html Bioinformatics Exercises for the Study of Evolution with Heme Proteins as a Model System Evolution of O2 transport and storage by hemoglobins: From Bacteria to Plants and Animals 2008 AP Biology Teachers Workshop Background Evolution of O2 transport and storage by hemoglobins: From Bacteria to Plants and Animals truncated globins sensor globins M. tuberculosis B. subtillis HemAT CD01040 globins: Globins are heme proteins, which bind and transport oxygen. This family summarizes a diverse set of homologous protein domains, including: (1) tetrameric vertebrate hemoglobins, which are the major protein component of erythrocytes and transport oxygen in the bloodstream, (2) microorganismal flavohemoglobins, which are linked to C-terminal FADdependend reductase domains, (3) homodimeric bacterial hemoglobins, such as from Vitreoscilla, (4) plant leghemoglobins (symbiotic hemoglobins, involved in nitrogen metabolism in plant rhizomes), (5) plant non-symbiotic hexacoordinate globins and hexacoordinate globins from bacteria and animals, such as neuroglobin, (6) invertebrate hemoglobins, which may occur in tandem-repeat arrangements, and (7) monomeric myoglobins found in animal muscle tissue. mammalian hemoglobin globins 1

CD01068 sensor globins: Globin domain present in Globin-Coupled-Sensors (GCS). These domains detect changes in intracellular concentrations of oxygen, monoxide, or nitrous oxide, which result in aerotaxis and/or gene regulation. One subgroup, the HemATs, are aerotactic heme sensors combining a globin with an MCP signaling domain, others function as gene regulators, by direct combination with DNA-binding domains, with domains modulating 2nd messengers, or with domains interacting with transcription factors or regulators. CD00454 truncated hemoglobins: Truncated hemoglobins (trhbs) are a family of oxygen-binding heme proteins found in cyanobacteria, eubacteria, unicellular eukaryotes, and plants. The truncated hemoglobins have a characteristic two-over-two alpha helical folding pattern that is distinct from the three-over-three pattern found in other globins. A subset of these have been demonstrated to form homodimers. Molecular Evolution (ME): relatedness of biological molecules (does not necessarily reflect relatedness of species from which the gene and protein sequences have been taken) Molecular Evolution FlavoHbs Mb I. Genes ME determined primarily by nucleic acid sequence II. Proteins ME determined primarily by amino acid sequence Plant 2-on-2 Hb Bacterial 2-on-2 Hb Plant Hb Hb Bioinformatics Tools Sequence alignment Gene identification Genome assembly Protein structure prediction/alignment Gene expression predictions Info about biomolecules Protein function Protein structure Evolution (molecular and whole organism level) Gene Expression Legend: 1ASH - Hemoglobin (Hb) domain 1 Ascaris suum (pig roundworm) 1DLY - Hb Green Unicellular Alga Chlamydomonas Eugametos 1KR7 - Nerve Tissue Mini-Hb Cerebratulus Lacteus (Nemertean Worm) 1OR4 - HemAT Sensor Domain From B. subtilis 2

Protein phylogenetic Tree Visual of relationships between sequences based on molecular evolution Bioinformatics Exercises Advantages: 1. Most schools have computer labs or library computer access. 2. Exercises only require a browser. 3. Students interested in all areas of bioscience must learn to mine computational databases - unavoidable for future scientists. Pitfalls to avoid: 1. Databases are HUGE, must give specific, step-by-step instructions or the student will get hopelessly lost. 2. Some sites go down temporarily, design a 2- or 3-part exercise that uses more than one site, or make the due date flexible. 3. Practice the exercise each year, databases and interfaces change rapidly - what worked last semester may not work now. Assignment: Background Databases: How are they created, maintained, and used? Assignment: Bioinformatics Tools Vocabulary Sequence alignment Multiple sequence alignment and phylogenetic trees Databases: background for students 1. Convey the necessity of computers in storing and analyzing the large amount of biological data available. Link to Human Genome Project web sites 2. Choose exercises or background reading that stress the size and growth rate of biological databases. Explore NCBI 3. Discuss search engines and search design: specificity vs. sensitivity Examples in Entrez 4. Local research in genomics Baylor College of Medicine 1. Convey the necessity of computers in storing and analyzing such a large amount of data. Emphasize the factors that contribute to the dependence of biological studies on computers How many bases in a genome? Human ~ 3 billion Rat ~ 3 billion Chicken ~ 1 billion Fish ~ 400 million Bacteria - M. tuberculosis ~ 4 million 1. Convey the necessity of computers in storing and analyzing such a large amount of data. Human Genome Project 3

AP Biology Academy 2. Choose exercises or background reading that stress the size and growth rate of biological databases. NCBI Genbank = nucleotide database About the Human Genome Project Question 1: Question 2: Question 3: 80 billion nucleotide bases (Jan 2008) Doubles ~ every 18 months 260,000 species +1700 per month Part of international collection of 100 gigabases (Aug 2005) How many genes are found in the human genome? How many DNA base pairs make up the human genome? Name 2 project goals that will require the help of computers. 2. Choose exercises or background reading that stress the size and growth rate of biological databases. NCBI How many new sequences were added in 2007? What are the top 5 species with sequences in GenBank? What does a GenBank entry include? Who typically submits data to GenBank? Open Access Jan 2008 Database Issue What website would you use to retrieve GenBank information? 3. Discuss search engines and search design: specificity vs. sensitivity Entrez PubMed 3. Discuss search engines and search design: specificity vs. sensitivity Entrez PubMed 4

4. Is the Houston Medical Center involved in genomics? http://www.hgsc.bcm.tmc.edu/ Sample questions to ask about the BCM HGSC site How many genome projects are being sequenced for different organisms at the Human Genome Sequencing Center, Baylor College of Medicine? How many primate genome projects are listed? Why do you think so many primate genomes are being sequenced? Why is it important to humans to learn about bovine genomes? Why is it important to humans to learn about microbial genomes? URLs for bioinformatics students: The Human Genome Project: http://www.ornl.gov/sci/techresources/human_genome/project/about.shtml The Human Genome Sequencing Center at Baylor College of Medicine http://www.hgsc.bcm.tmc.edu/ Cells Alive: www.cellsalive.com The Biology Workbench: http://workbench.sdsc.edu/ National Center for Biotechnology Information: http://www.ncbi.nlm.nih.gov/ Expasy (Swiss Institute of Bioinformatics) http://us.expasy.org/tools/ European Bioinformatics Institute http://www.ebi.ac.uk/tools/ 1. Introduce vocabulary necessary to describe features of sequence alignments. 2. Sequence alignment Sequence Alignments NCBI Blast 3. Multiple sequence alignments and phylogenetic trees Biology Workbench To compare, use protein sequences rather than DNA when possible Why? Less noise in protein sequences - what are the causes? I. Mathematical Probability: From a strictly mathematical point of view, assuming that there is an equal likelihood of any nucleotide appearing at any point in a sequence (which is generally NOT true biologically), what are the chances that a G in a nucleotide sequence will be randomly matched by a G in the same position in a different sequence? 1/4 From the same point of view, what are the chances that a G in a protein sequence will be randomly matched by a G in the same position in a different sequence? 1/20 5

Less noise in protein sequences - what are the causes? II. Degeneracy of the genetic code: 18 of the 20 amino acids are coded for by > one codon - therefore, a single mutation in the DNA code does not necessarily translate into a change in the amino acid code (particularly true of mutations in the 3rd codon) UUC to UUU mutation: UUC encodes PHE (F) UUU encodes PHE (F) a single change within a triplet codon is often not sufficient to cause a codon to code for an amino acid in a different category (nonpolar, polar, positively charged, negatively charged) AAG to AGG mutation: AAG encodes LYS (K) AGG encodes ARG (R) Less noise in protein sequences - what are the causes? III. Similarity signals contribute more information in protein sequences than in nucleotide sequences. Many categories of amino acids, some can be weighted more heavily than others (nonpolar, polar, positively charged, negatively charged, aromatic, structural similarity) Nucleotides - transitions purine to purine, pyrimidine to pyrimidine transversions purine to pyrimidine, pyrimidine to purine Studying VERY CLOSELY related sequences (identity, homology, paralogy): Nucleotide sequence might be preferred (can see subtle changes that might be invisible in protein sequences). Vocabulary for Sequence Alignments If the same letter occurs in two aligned sequences then this position has been conserved in evolution. If the letters differ it is assumed that the two derive from an ancestral letter (which could be one of the two or neither). Evolutionary processes in biology can introduce insertions or deletions in sequences. In a sequence alignment, a letter or a stretch of letters may be paired up with dashes in the other sequence, called gaps, to signify an insertion or deletion. If a biologist makes the statement that two sequences are related, he means that they are believed to have a common evolutionary origin. Identity indicates exact match in two (or more) sequences Similarity indicates chemical or structural similarity (based on categories of amino acids) between unidentical aligned residues in two (or more) sequences Homology the source of the similarity between unidentical aligned residues in two (or more) sequences is biological, such as evolutionarily related sequences in different species (same origin and function) or relationship between members of a chromosome pair in diploid organisms (homologous sequences are similar, but similar sequences are not always homologous) 12 GAPS (yellow) = an insertion in one sequence or a deletion in the other sequence? The residues in aligned positions of different sequences are implied to have a common evolutionary origin 29 identities (green) 20 similarities (cyan) 6

Sequence comparison Healthy vs. diseased Identify genes involved in diseases One organism vs. another How closely related are two organisms Unknown function vs. known Lots of genes are not understood Proteins involved in genetic diseases Lactase - Lactose Intolerance Insulin receptor - P53 protein - Cancer Diabetes Disease Sickling cell Sickle Cell Anemia Irreversibly sickled cells Normal red blood cells with no deformations Exercise in Sequence Alignment: Our example is HbB vs. HbS Type the following web site into your browser: http://www.ncbi.nlm.nih.gov/ Next to the Search box, select Protein, to search the NCBI database containing protein sequences. Red blood cells containing HbS β subunits http://www.sicklecelldisease.org/research/index.phtml The record for hemoglobin S should be returned. Hemoglobin is the protein in our blood cells that carries oxygen. Click on the link entitled 1HBSB. Next to the word Display in the grey region at the top of the file, change GenPept to FASTA. 7

This will display the amino acid sequence for hemoglobin S in FASTA format. Hold down the left mouse button while you move the mouse over the sequence. This should highlight the amino acid sequence in blue. Now choose Edit:Copy from the browser window, or hit the buttons Ctrl and C to copy. Now, click on the NCBI logo in the upper left corner of the web page to return to the main page. In the dark blue menu bar at the top of the page, click on the word BLAST. In the box of Basic BLAST options, click on the link entitled protein BLAST. Click in the Search box and choose Edit: paste from the browser menu or hit the Ctrl and P keys to paste in the sequence. 8

Change the nr database to swissprot, then click the BLAST! button. Wait for alignment to Format Time depends on server use. Under the graph indicating the length of the top alignments, there will be a list of aligning sequences in order of decreasing alignment scores. Click on the score of the first item in the list, which is the highest scoring alignment. This will take you to the section of the file where you can view the alignment. Identify the differences in the sequence of Query 1and Subject 2 A dissimilar substitution occurs at amino acid number 6. The sickle cell mutation in Hemoglobin: Sickle cell anemia is a blood condition seen most commonly in people of African ancestry and in the tribal peoples of India. The individual must have two copies of the mutant hemoglobin gene to exhibit the sickle-shaped cells indicative of the condition. The hemoglobin S beta subunit has the amino acid valine at position 6 instead of the glutamic acid that is normally present. This alteration is the basis of all the problems that occur in people with sickle cell disease. Glu6 to Val The β E6V mutation dramatically decreases the solubility of hemoglobin and causes long fibers to form that sickle the red cells. This was the first genetic disease that was identified at the amino acid level. Hb AS is benign (only shortens red cell life time to ~80 days) Hb SS leads to severe anemia and early death 9

Disease Disease AS + AS Sickle Cell anemia Anopheles gambiae, Infection with Plasmodium falciparum HbAS cells only last ~80 days and release the immature parasite which is destroyed by the immune system. Sickle Cell Anemia The sickle cell trait is maintained by strong selective pressure because AS individuals have resistance to malaria. It is a classic case of natural selection. AA + AS + SA + SS 50% are resistant and live malaria sickle cell disease Trait has evolved spontaneously and independently in several different locations in Africa because of strong selective pressure The Plasmodium parasite must remain in the red blood cells for more than 90 days to be able to change its cell coat and avoid antibody responses Exercise in Multiple Sequence Alignment The Biology Workbench http://workbench.sdsc.edu/ The Biology Workbench is one of my favorite teaching tools, because the student can do a complete Bioinformatics project with the Workbench, from retrieving the sequences to performing multiple alignments and creating phylogenetic tree diagrams. 1. Retrieve sequences from database 2. Manually input sequences 3. Align selection of sequences 4. Align multiple alignments of sequences 5. Create a phylogenetic tree 1. Retrieve sequences Be careful, there are many databases - too much information -too many results from a query confuses the student a. GenPept - Genbank gene products - full release b. SwissProt - manually curated European database Select Ndjinn Multiple Database Search and click Run. Ndjinn Multiple Database Search 10

Genpept search of fetal hemoglobin : 7 results in over 1 minute SwissProt search of fetal hemoglobin : 68 results in ~ 20 seconds Select two human Hb gamma protein sequences Click Import Sequence(s) Sequences now listed under Protein Tools 2. Manually input sequences Obtain sequence in FASTA format from NCBI as in previous exercise Select Add New Protein Sequence and click Run Add New Protein Sequence Type in a name for the sequence. Use paste to input sequence. Click Save. Sequence will be listed below Protein Tools. 3. Align selected sequences Check boxes next to sequences Scroll down Protein Tools list to CLUSTALW - Multiple Sequence Alignment Click Run then click Submit on the next page CLUSTALW 11

Color coding represents consensus. HbG1 and HbG2 have an A:G substitution at position 136 HbE has the A at 136 HbB has the G Alignment can be copied an pasted into text document. Set page to landscape for rows of alignment to print on one line. Analysis: Highest scoring pairwise alignments: 1. HbG1 and HbG2 2. HbS and HbB Scroll back to alignment and import the alignment. Lowest scoring pairwise alignments: 1. HbS and HbE 2. HbS and HbG2 3. HbB and HbG1 Alignment is listed including protein names under Alignment Tools. Complete 5 alignments of proteins in the globin-like superfamily. 1. Human Hemoglobins (globin family) HbB, HbS, HbE, HbG1, HbG2 2. Globin family Non-symbiotic hemoglobins from rice and corn, symbiotic hemoglobins from soybean and pea 3. Sensor family HemAT from B. subtillis and H. salinarum, methyl-accepting chemotaxis protein from A. tumefaciens 4. Truncated family 2-on-2 hemoglobin from Arabidopsis, protein from S. aureus, and M. tuberculosis Identifiers to retrieve sequences can be found in syllabus. 12

4. Aligning aligned sequences Check boxes next to alignments Scroll down Alignment Tools list to CLUSTALWPROF - Align Two Existing Alignments (Profiles) Click Run then click Submit on the next page CLUSTALWPROF Import the alignment. To create an alignment with all 4 groups of sequences, repeat CLUSTALWPROF two more times, using the new alignment plus one other. WARNING: This creates a LONG alignment, but also leads to an interesting phylogenetic tree. 5. Creating a phylogenetic tree Select an alignment Scroll down Alignment Tools list to DRAWGRAM - Draw Rooted Phylogenetic Tree from Alignment Click Run then click Submit on the next page DRAWGRAM Use this link to get a version with white background suitable for reports. Phylogenetic Tree of Proteins in Globin-Like Superfamily Globin family Truncated family Sensor family Additional Ideas Have students align all 15 sequences using CLUSTALW and compare to the alignment of alignments from CLUSTALWPROF Assign students an unknown sequence and have them identify to which family it belongs using alignments and paired alignment scores Using an alignment of all 15 sequences, hypothesize why sensor globin family members are longer proteins? Predict which portion of the B. subtillis HemAT is responsible for signaling. 13