Background. Bioinformatics Exercises for the Study of Evolution with Heme Proteins as a Model System

AP Biology Academy www.bioc.rice.edu/~olson/courses.html Bioinformatics Exercises for the Study of Evolution with Heme Proteins as a Model System Evolution of O2 transport and storage by hemoglobins: From Bacteria to Plants and Animals 2008 AP Biology Teachers Workshop Background Evolution of O2 transport and storage by hemoglobins: From Bacteria to Plants and Animals truncated globins sensor globins M. tuberculosis B. subtillis HemAT CD01040 globins: Globins are heme proteins, which bind and transport oxygen. This family summarizes a diverse set of homologous protein domains, including: (1) tetrameric vertebrate hemoglobins, which are the major protein component of erythrocytes and transport oxygen in the bloodstream, (2) microorganismal flavohemoglobins, which are linked to C-terminal FADdependend reductase domains, (3) homodimeric bacterial hemoglobins, such as from Vitreoscilla, (4) plant leghemoglobins (symbiotic hemoglobins, involved in nitrogen metabolism in plant rhizomes), (5) plant non-symbiotic hexacoordinate globins and hexacoordinate globins from bacteria and animals, such as neuroglobin, (6) invertebrate hemoglobins, which may occur in tandem-repeat arrangements, and (7) monomeric myoglobins found in animal muscle tissue. mammalian hemoglobin globins 1

CD01068 sensor globins: Globin domain present in Globin-Coupled-Sensors (GCS). These domains detect changes in intracellular concentrations of oxygen, monoxide, or nitrous oxide, which result in aerotaxis and/or gene regulation. One subgroup, the HemATs, are aerotactic heme sensors combining a globin with an MCP signaling domain, others function as gene regulators, by direct combination with DNA-binding domains, with domains modulating 2nd messengers, or with domains interacting with transcription factors or regulators. CD00454 truncated hemoglobins: Truncated hemoglobins (trhbs) are a family of oxygen-binding heme proteins found in cyanobacteria, eubacteria, unicellular eukaryotes, and plants. The truncated hemoglobins have a characteristic two-over-two alpha helical folding pattern that is distinct from the three-over-three pattern found in other globins. A subset of these have been demonstrated to form homodimers. Molecular Evolution (ME): relatedness of biological molecules (does not necessarily reflect relatedness of species from which the gene and protein sequences have been taken) Molecular Evolution FlavoHbs Mb I. Genes ME determined primarily by nucleic acid sequence II. Proteins ME determined primarily by amino acid sequence Plant 2-on-2 Hb Bacterial 2-on-2 Hb Plant Hb Hb Bioinformatics Tools Sequence alignment Gene identification Genome assembly Protein structure prediction/alignment Gene expression predictions Info about biomolecules Protein function Protein structure Evolution (molecular and whole organism level) Gene Expression Legend: 1ASH - Hemoglobin (Hb) domain 1 Ascaris suum (pig roundworm) 1DLY - Hb Green Unicellular Alga Chlamydomonas Eugametos 1KR7 - Nerve Tissue Mini-Hb Cerebratulus Lacteus (Nemertean Worm) 1OR4 - HemAT Sensor Domain From B. subtilis 2

Protein phylogenetic Tree Visual of relationships between sequences based on molecular evolution Bioinformatics Exercises Advantages: 1. Most schools have computer labs or library computer access. 2. Exercises only require a browser. 3. Students interested in all areas of bioscience must learn to mine computational databases - unavoidable for future scientists. Pitfalls to avoid: 1. Databases are HUGE, must give specific, step-by-step instructions or the student will get hopelessly lost. 2. Some sites go down temporarily, design a 2- or 3-part exercise that uses more than one site, or make the due date flexible. 3. Practice the exercise each year, databases and interfaces change rapidly - what worked last semester may not work now. Assignment: Background Databases: How are they created, maintained, and used? Assignment: Bioinformatics Tools Vocabulary Sequence alignment Multiple sequence alignment and phylogenetic trees Databases: background for students 1. Convey the necessity of computers in storing and analyzing the large amount of biological data available. Link to Human Genome Project web sites 2. Choose exercises or background reading that stress the size and growth rate of biological databases. Explore NCBI 3. Discuss search engines and search design: specificity vs. sensitivity Examples in Entrez 4. Local research in genomics Baylor College of Medicine 1. Convey the necessity of computers in storing and analyzing such a large amount of data. Emphasize the factors that contribute to the dependence of biological studies on computers How many bases in a genome? Human ~ 3 billion Rat ~ 3 billion Chicken ~ 1 billion Fish ~ 400 million Bacteria - M. tuberculosis ~ 4 million 1. Convey the necessity of computers in storing and analyzing such a large amount of data. Human Genome Project 3

AP Biology Academy 2. Choose exercises or background reading that stress the size and growth rate of biological databases. NCBI Genbank = nucleotide database About the Human Genome Project Question 1: Question 2: Question 3: 80 billion nucleotide bases (Jan 2008) Doubles ~ every 18 months 260,000 species +1700 per month Part of international collection of 100 gigabases (Aug 2005) How many genes are found in the human genome? How many DNA base pairs make up the human genome? Name 2 project goals that will require the help of computers. 2. Choose exercises or background reading that stress the size and growth rate of biological databases. NCBI How many new sequences were added in 2007? What are the top 5 species with sequences in GenBank? What does a GenBank entry include? Who typically submits data to GenBank? Open Access Jan 2008 Database Issue What website would you use to retrieve GenBank information? 3. Discuss search engines and search design: specificity vs. sensitivity Entrez PubMed 3. Discuss search engines and search design: specificity vs. sensitivity Entrez PubMed 4

4. Is the Houston Medical Center involved in genomics? http://www.hgsc.bcm.tmc.edu/ Sample questions to ask about the BCM HGSC site How many genome projects are being sequenced for different organisms at the Human Genome Sequencing Center, Baylor College of Medicine? How many primate genome projects are listed? Why do you think so many primate genomes are being sequenced? Why is it important to humans to learn about bovine genomes? Why is it important to humans to learn about microbial genomes? URLs for bioinformatics students: The Human Genome Project: http://www.ornl.gov/sci/techresources/human_genome/project/about.shtml The Human Genome Sequencing Center at Baylor College of Medicine http://www.hgsc.bcm.tmc.edu/ Cells Alive: www.cellsalive.com The Biology Workbench: http://workbench.sdsc.edu/ National Center for Biotechnology Information: http://www.ncbi.nlm.nih.gov/ Expasy (Swiss Institute of Bioinformatics) http://us.expasy.org/tools/ European Bioinformatics Institute http://www.ebi.ac.uk/tools/ 1. Introduce vocabulary necessary to describe features of sequence alignments. 2. Sequence alignment Sequence Alignments NCBI Blast 3. Multiple sequence alignments and phylogenetic trees Biology Workbench To compare, use protein sequences rather than DNA when possible Why? Less noise in protein sequences - what are the causes? I. Mathematical Probability: From a strictly mathematical point of view, assuming that there is an equal likelihood of any nucleotide appearing at any point in a sequence (which is generally NOT true biologically), what are the chances that a G in a nucleotide sequence will be randomly matched by a G in the same position in a different sequence? 1/4 From the same point of view, what are the chances that a G in a protein sequence will be randomly matched by a G in the same position in a different sequence? 1/20 5

Less noise in protein sequences - what are the causes? II. Degeneracy of the genetic code: 18 of the 20 amino acids are coded for by > one codon - therefore, a single mutation in the DNA code does not necessarily translate into a change in the amino acid code (particularly true of mutations in the 3rd codon) UUC to UUU mutation: UUC encodes PHE (F) UUU encodes PHE (F) a single change within a triplet codon is often not sufficient to cause a codon to code for an amino acid in a different category (nonpolar, polar, positively charged, negatively charged) AAG to AGG mutation: AAG encodes LYS (K) AGG encodes ARG (R) Less noise in protein sequences - what are the causes? III. Similarity signals contribute more information in protein sequences than in nucleotide sequences. Many categories of amino acids, some can be weighted more heavily than others (nonpolar, polar, positively charged, negatively charged, aromatic, structural similarity) Nucleotides - transitions purine to purine, pyrimidine to pyrimidine transversions purine to pyrimidine, pyrimidine to purine Studying VERY CLOSELY related sequences (identity, homology, paralogy): Nucleotide sequence might be preferred (can see subtle changes that might be invisible in protein sequences). Vocabulary for Sequence Alignments If the same letter occurs in two aligned sequences then this position has been conserved in evolution. If the letters differ it is assumed that the two derive from an ancestral letter (which could be one of the two or neither). Evolutionary processes in biology can introduce insertions or deletions in sequences. In a sequence alignment, a letter or a stretch of letters may be paired up with dashes in the other sequence, called gaps, to signify an insertion or deletion. If a biologist makes the statement that two sequences are related, he means that they are believed to have a common evolutionary origin. Identity indicates exact match in two (or more) sequences Similarity indicates chemical or structural similarity (based on categories of amino acids) between unidentical aligned residues in two (or more) sequences Homology the source of the similarity between unidentical aligned residues in two (or more) sequences is biological, such as evolutionarily related sequences in different species (same origin and function) or relationship between members of a chromosome pair in diploid organisms (homologous sequences are similar, but similar sequences are not always homologous) 12 GAPS (yellow) = an insertion in one sequence or a deletion in the other sequence? The residues in aligned positions of different sequences are implied to have a common evolutionary origin 29 identities (green) 20 similarities (cyan) 6

Sequence comparison Healthy vs. diseased Identify genes involved in diseases One organism vs. another How closely related are two organisms Unknown function vs. known Lots of genes are not understood Proteins involved in genetic diseases Lactase - Lactose Intolerance Insulin receptor - P53 protein - Cancer Diabetes Disease Sickling cell Sickle Cell Anemia Irreversibly sickled cells Normal red blood cells with no deformations Exercise in Sequence Alignment: Our example is HbB vs. HbS Type the following web site into your browser: http://www.ncbi.nlm.nih.gov/ Next to the Search box, select Protein, to search the NCBI database containing protein sequences. Red blood cells containing HbS β subunits http://www.sicklecelldisease.org/research/index.phtml The record for hemoglobin S should be returned. Hemoglobin is the protein in our blood cells that carries oxygen. Click on the link entitled 1HBSB. Next to the word Display in the grey region at the top of the file, change GenPept to FASTA. 7

This will display the amino acid sequence for hemoglobin S in FASTA format. Hold down the left mouse button while you move the mouse over the sequence. This should highlight the amino acid sequence in blue. Now choose Edit:Copy from the browser window, or hit the buttons Ctrl and C to copy. Now, click on the NCBI logo in the upper left corner of the web page to return to the main page. In the dark blue menu bar at the top of the page, click on the word BLAST. In the box of Basic BLAST options, click on the link entitled protein BLAST. Click in the Search box and choose Edit: paste from the browser menu or hit the Ctrl and P keys to paste in the sequence. 8

Change the nr database to swissprot, then click the BLAST! button. Wait for alignment to Format Time depends on server use. Under the graph indicating the length of the top alignments, there will be a list of aligning sequences in order of decreasing alignment scores. Click on the score of the first item in the list, which is the highest scoring alignment. This will take you to the section of the file where you can view the alignment. Identify the differences in the sequence of Query 1and Subject 2 A dissimilar substitution occurs at amino acid number 6. The sickle cell mutation in Hemoglobin: Sickle cell anemia is a blood condition seen most commonly in people of African ancestry and in the tribal peoples of India. The individual must have two copies of the mutant hemoglobin gene to exhibit the sickle-shaped cells indicative of the condition. The hemoglobin S beta subunit has the amino acid valine at position 6 instead of the glutamic acid that is normally present. This alteration is the basis of all the problems that occur in people with sickle cell disease. Glu6 to Val The β E6V mutation dramatically decreases the solubility of hemoglobin and causes long fibers to form that sickle the red cells. This was the first genetic disease that was identified at the amino acid level. Hb AS is benign (only shortens red cell life time to ~80 days) Hb SS leads to severe anemia and early death 9

Disease Disease AS + AS Sickle Cell anemia Anopheles gambiae, Infection with Plasmodium falciparum HbAS cells only last ~80 days and release the immature parasite which is destroyed by the immune system. Sickle Cell Anemia The sickle cell trait is maintained by strong selective pressure because AS individuals have resistance to malaria. It is a classic case of natural selection. AA + AS + SA + SS 50% are resistant and live malaria sickle cell disease Trait has evolved spontaneously and independently in several different locations in Africa because of strong selective pressure The Plasmodium parasite must remain in the red blood cells for more than 90 days to be able to change its cell coat and avoid antibody responses Exercise in Multiple Sequence Alignment The Biology Workbench http://workbench.sdsc.edu/ The Biology Workbench is one of my favorite teaching tools, because the student can do a complete Bioinformatics project with the Workbench, from retrieving the sequences to performing multiple alignments and creating phylogenetic tree diagrams. 1. Retrieve sequences from database 2. Manually input sequences 3. Align selection of sequences 4. Align multiple alignments of sequences 5. Create a phylogenetic tree 1. Retrieve sequences Be careful, there are many databases - too much information -too many results from a query confuses the student a. GenPept - Genbank gene products - full release b. SwissProt - manually curated European database Select Ndjinn Multiple Database Search and click Run. Ndjinn Multiple Database Search 10

Genpept search of fetal hemoglobin : 7 results in over 1 minute SwissProt search of fetal hemoglobin : 68 results in ~ 20 seconds Select two human Hb gamma protein sequences Click Import Sequence(s) Sequences now listed under Protein Tools 2. Manually input sequences Obtain sequence in FASTA format from NCBI as in previous exercise Select Add New Protein Sequence and click Run Add New Protein Sequence Type in a name for the sequence. Use paste to input sequence. Click Save. Sequence will be listed below Protein Tools. 3. Align selected sequences Check boxes next to sequences Scroll down Protein Tools list to CLUSTALW - Multiple Sequence Alignment Click Run then click Submit on the next page CLUSTALW 11

Color coding represents consensus. HbG1 and HbG2 have an A:G substitution at position 136 HbE has the A at 136 HbB has the G Alignment can be copied an pasted into text document. Set page to landscape for rows of alignment to print on one line. Analysis: Highest scoring pairwise alignments: 1. HbG1 and HbG2 2. HbS and HbB Scroll back to alignment and import the alignment. Lowest scoring pairwise alignments: 1. HbS and HbE 2. HbS and HbG2 3. HbB and HbG1 Alignment is listed including protein names under Alignment Tools. Complete 5 alignments of proteins in the globin-like superfamily. 1. Human Hemoglobins (globin family) HbB, HbS, HbE, HbG1, HbG2 2. Globin family Non-symbiotic hemoglobins from rice and corn, symbiotic hemoglobins from soybean and pea 3. Sensor family HemAT from B. subtillis and H. salinarum, methyl-accepting chemotaxis protein from A. tumefaciens 4. Truncated family 2-on-2 hemoglobin from Arabidopsis, protein from S. aureus, and M. tuberculosis Identifiers to retrieve sequences can be found in syllabus. 12

4. Aligning aligned sequences Check boxes next to alignments Scroll down Alignment Tools list to CLUSTALWPROF - Align Two Existing Alignments (Profiles) Click Run then click Submit on the next page CLUSTALWPROF Import the alignment. To create an alignment with all 4 groups of sequences, repeat CLUSTALWPROF two more times, using the new alignment plus one other. WARNING: This creates a LONG alignment, but also leads to an interesting phylogenetic tree. 5. Creating a phylogenetic tree Select an alignment Scroll down Alignment Tools list to DRAWGRAM - Draw Rooted Phylogenetic Tree from Alignment Click Run then click Submit on the next page DRAWGRAM Use this link to get a version with white background suitable for reports. Phylogenetic Tree of Proteins in Globin-Like Superfamily Globin family Truncated family Sensor family Additional Ideas Have students align all 15 sequences using CLUSTALW and compare to the alignment of alignments from CLUSTALWPROF Assign students an unknown sequence and have them identify to which family it belongs using alignments and paired alignment scores Using an alignment of all 15 sequences, hypothesize why sensor globin family members are longer proteins? Predict which portion of the B. subtillis HemAT is responsible for signaling. 13