Bioinformatics Exercises

Similar documents
Background. Bioinformatics Exercises for the Study of Evolution with Heme Proteins as a Model System

Using Bioinformatics to Study Evolutionary Relationships Instructions

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Hands-On Nine The PAX6 Gene and Protein

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

BIOINFORMATICS LAB AP BIOLOGY

Comparing whole genomes

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution

BIOINFORMATICS: An Introduction

Supplemental Materials

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Introduction to Bioinformatics

Piecing It Together. 1) The envelope contains puzzle pieces for 5 vertebrate embryos in 3 different stages of

Journal of Proteomics & Bioinformatics - Open Access

Introduction to Bioinformatics Online Course: IBT

Open a Word document to record answers to any italicized questions. You will the final document to me at

CS612 - Algorithms in Bioinformatics

Sequencing alignment Ameer Effat M. Elfarash

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

The Complete Set Of Genetic Instructions In An Organism's Chromosomes Is Called The

Bioinformatics. Dept. of Computational Biology & Bioinformatics

GENERAL BIOLOGY LABORATORY EXERCISE Amino Acid Sequence Analysis of Cytochrome C in Bacteria and Eukarya Using Bioinformatics

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Quantifying sequence similarity

Investigating Evolutionary Questions Using Online Molecular Databases *

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequencing alignment Ameer Effat M. Elfarash

COMPARING DNA SEQUENCES TO UNDERSTAND EVOLUTIONARY RELATIONSHIPS WITH BLAST

Comparative genomics: Overview & Tools + MUMmer algorithm

Genomes and Their Evolution

Cladistics and Bioinformatics Questions 2013

Tools and Algorithms in Bioinformatics

BLAST. Varieties of BLAST

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Introduction to Bioinformatics

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

Chapter 7: Covalent Structure of Proteins. Voet & Voet: Pages ,

Bioinformatics. Part 8. Sequence Analysis An introduction. Mahdi Vasighi

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

EBI web resources II: Ensembl and InterPro

Introduction to protein alignments

Algorithms in Bioinformatics

Gene Families part 2. Review: Gene Families /727 Lecture 8. Protein family. (Multi)gene family

Genomics and bioinformatics summary. Finding genes -- computer searches

7. Tests for selection

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Sequences, Structures, and Gene Regulatory Networks

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

Week 10: Homology Modelling (II) - HHpred

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

BIOLOGY STANDARDS BASED RUBRIC

Protein Architecture V: Evolution, Function & Classification. Lecture 9: Amino acid use units. Caveat: collagen is a. Margaret A. Daugherty.

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence analysis and Genomics

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Sequence Alignment (chapter 6)

Draft document version 0.6; ClustalX version 2.1(PC), (Mac); NJplot version 2.3; 3/26/2012

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Sequence analysis and comparison

Practical considerations of working with sequencing data

Homology and Information Gathering and Domain Annotation for Proteins

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Computational Biology

Chapter 15: Darwin and Evolution

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Orthology Part I: concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Bioinformatics and BLAST

B I O I N F O R M A T I C S

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Performance Indicators: Students who demonstrate this understanding can:

An Introduction to Sequence Similarity ( Homology ) Searching

Homology. and. Information Gathering and Domain Annotation for Proteins

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Browsing Genomic Information with Ensembl Plants

Local Alignment Statistics

Exploring Evolution & Bioinformatics

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

AP Biology Summer Assignment

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Emily Blanton Phylogeny Lab Report May 2009

Dr. Amira A. AL-Hosary

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Synteny Portal Documentation

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Understanding relationship between homologous sequences

Transcription:

Bioinformatics Exercises AP Biology Teachers Workshop Susan Cates, Ph.D. Evolution of Species Phylogenetic Trees show the relatedness of organisms Common Ancestor (Root of the tree) 1

Rooted vs. Unrooted Trees Rooted Unrooted Molecular Evolution FlavoHbs Mb Plant 2-on-2 Hb Hb Plant Hb Bacterial 2-on-2 Hb 2

Sequence comparison Healthy vs. diseased Identify genes involved in diseases One organism vs. another How closely related are two organisms Unknown function vs. known Lots of genes are not understood Sequence comparison One protein within a family vs. another Identify mechanisms of disease, identify favorable characteristics (stability, specificity of substrate, affinity for substrate, etc.) 3

Vocabulary If the same letter occurs in two aligned sequences then this position has been conserved in evolution. If the letters differ it is assumed that the two derive from an ancestral letter (which could be one of the two or neither). Evolutionary processes in biology can introduce insertions or deletions in sequences. In a sequence alignment, a letter or a stretch of letters may be paired up with dashes in the other sequence, called gaps, to signify an insertion or deletion. If a biologist makes the statement that two sequences are related, he means that they are believed to have a common evolutionary origin. 12 GAPS (yellow) The residues in aligned positions of different sequences are implied to have a common evolutionary origin (an insertion in one sequence or a deletion in the other sequence?) 29 identities (green) 20 similarities (cyan) 4

Identity indicates exact match in two (or more) sequences Similarity indicates chemical or structural similarity between unidentical aligned residues in two (or more) sequences Homology the source of the similarity between unidentical aligned residues in two (or more) sequences is biological, such as evolutionarily related sequences in different species (same origin and function) or relationship between members of a chromosome pair in diploid organisms (homologous sequences are similar, but similar sequences are not always homologous) Specificity The ability to reject false relationships, measured by the ratio of the number of true negatives to the sum of false positives and true negatives. true negatives (true negatives + false positives) Sensitivity The ability detect all true relationships, measured by the ratio of the number of true positives to the sum of true positives and false negatives. true positives (true positives + false negatives) 5

Studying distantly related sequences: 1. Use protein sequence. Studying closely related sequences (identity, homology, paralogy): 1. Nucleotide sequence might be preferred (can see subtle changes that might be invisible in protein sequences) use protein sequences rather than DNA when possible (why?) Higher signal to noise ratio in protein sequences - what are the causes? I. Mathematical Probability: From a strictly mathematical point of view, assuming that there is an equal likelihood of any nucleotide appearing at any point in a sequence (which is generally NOT true biologically), what are the chances that a G in a nucleotide sequence will be randomly matched by a G in the same position in a different sequence? 1/4 From the same point of view, what are the chances that a G in a protein sequence will be randomly matched by a G in the same position in a different sequence? 1/20 6

Higher signal to noise ratio in protein sequences - what are the causes? II. Degeneracy of the genetic code: a. 18 of the 20 amino acids are coded for by > one codon - therefore, a single mutation in the DNA code does not necessarily translate into a change in the amino acid code (particularly true of mutations in the 3rd codon) UUC to UUU mutation: UUC encodes PHE (F) UUU encodes PHE (F) b. a single change within a triplet codon is often not sufficient to cause a codon to code for an amino acid in a different category (nonpolar, polar, positively charged, negatively charged) AAG to AGG mutation: AAG encodes LYS (K) AGG encodes ARG (R) Higher signal to noise ratio in protein sequences - what are the causes? Similarity signals contribute more information in protein sequences than in nucleotide sequences a. Many categories, some can be weighted more heavily than others (nonpolar, polar, positively charged, negatively charged, aromatic, structural similarity) b. Nucleotides - transitions purine to purine, pyrimidine to pyrimidine transversions purine to pyrimidine, pyrimidine to purine 7

URLs or google searches for bioinformatics students: The Human Genome Project: http://www.ornl.gov/sci/techresources/human_genome/project/about.shtml The Human Genome Sequencing Center at Baylor College of Medicine http://www.hgsc.bcm.tmc.edu/ Cells Alive: www.cellsalive.com The Biology Workbench: http://workbench.sdsc.edu/ National Center for Biotechnology Information: http://www.ncbi.nlm.nih.gov/ Expasy (Swiss Institute of Bioinformatics) http://us.expasy.org/tools/ European Bioinformatics Institute http://www.ebi.ac.uk/tools/ Emphasize the factors that contribute to the dependence of biological studies on computers How many bases in a genome? Human Rat Chicken Fish Tuberculosis (bacteria) ~ 3 billion ~ 3 billion ~ 1 billion ~ 400 million ~ 4 million 8

About the Human Genome Project Question 1: How many genes are found in the human genome? Question 2: How many DNA base pairs make up the human genome? Question 3: Name 2 project goals that will require the help of computers. Question 2: Question 1: How many genes are found in the human genome? ~ 20,000-25,000 How many DNA base pairs make up the human genome? ~ 3 billion Question 3: Name 2 project goals that will require the help of computers. 1. store this information in databases 2. tools for data analysis 9

Is the Houston Medical Center involved in genomics? http://www.hgsc.bcm.tmc.edu/ 10

How many genome projects are being sequenced for different organisms at the Human Genome Sequencing Center, Baylor College of Medicine? How many primate genome projects are listed? Why do you think so many primate genomes are being sequenced? Why is it important to humans to learn about bovine genomes? Why is it important to humans to learn about microbial genomes? www.cellsalive.com 11

The student can look up the answers to related cell biology questions at cellsalive.com, example: Where are genes located? NUCLEUS (with DNA inside) Animal Cell Molecular chains of Deoxyribonucleic Acid (DNA) inside each cell encode the organism s genes. They are the hereditary information that determines what characteristics each cell, and, in a bigger sense, each organism will have. Sequence comparison Healthy vs. diseased Identify genes involved in diseases One organism vs. another How closely related are two organisms Unknown function vs. known Lots of genes are not understood 12

Proteins involved in genetic diseases Lactase - digests milk sugar. Insulin receptor - mediates the proper response to glucose. P53 protein - tumor suppressor. Exercise in Sequence Alignment: Our example is HbB vs. HbS Type the following web site into your browser: http://www.ncbi.nlm.nih.gov/ Next to the Search box, select Protein, to search the NCBI database containing protein sequences. 13

The record for hemoglobin S should be returned. Hemoglobin is the protein in our blood cells that carries oxygen. Click on the link entitled 1HBSB. Next to the word Display in the grey region at the top of the file, change GenPept to FASTA. 14

This will display the amino acid sequence for hemoglobin S in FASTA format. Hold down the left mouse button while you move the mouse over the sequence. This should highlight the amino acid sequence in blue. Now choose Edit:Copy from the browser window, or hit the buttons Ctrl and C to copy. 15

Now, click on the NCBI logo in the upper left corner of the web page to return to the main page. In the dark blue menu bar at the top of the page, click on the word BLAST. 16

In the box of Protein options, click on the link entitled Protein-protein BLAST (blastp). Click in the Search box and choose Edit: paste from the browser menu or hit the Ctrl and P keys to paste the sequence into the search box. 17

Change the nr database to swissprot, then click the BLAST! button. Click the Format! button. A new window will open containing our sequence alignments. 18

Under the graph indicating the length of the top alignments, there will be a list of aligning sequences in order of decreasing alignment scores. Click on the score of the first item in the list, which is the highest scoring alignment. This will take you to the section of the file where you can view the alignment. Identify the differences in the sequence of Query 1and Subject 2 A dissimilar substitution occurs at amino acid number 6. 19

The sickle cell mutation in Hemoglobin. Sickle cell anemia is a blood condition seen most commonly in people of African ancestry and in the tribal peoples of India. The individual must have two copies of the mutant hemoglobin gene to exhibit the sickle-shaped cells indicative of the condition. The hemoglobin S beta subunit has the amino acid valine at position 6 instead of the glutamic acid that is normally present. This alteration is the basis of all the problems that occur in people with sickle cell disease. Exercises in Multiple Sequence Alignment. ClustalW is a multiple sequence alignment routine available online at the EBI website: http://www.ebi.ac.uk/clustalw/ 20

Exercise in Multiple Sequence Alignment: Our example is non-alpha versus alpha Hb The Biology Workbench http://workbench.sdsc.edu/ The Biology Workbench is one of my favorite teaching tools, because the student can do a complete Bioinformatics project with the Workbench, from retrieving the sequences to performing multiple alignments and creating phylogenetic tree diagrams. 1. Retrieve sequences a. Be careful, there are many databases - too much information -too many results from a query confuses the student b. GenPept - Genbank gene products - full release c. SwissProt - manually curated European database 21

Genpept search of fetal hemoglobin : 7 results in over 1 minute 22

AP Biology Workshop June 2006 SwissProt search of fetal hemoglobin : 59 results in ~ 20 seconds JSO s slide on globin expression during human development (low O2 affinity) α2ε2 Hb embryonic (High O2 affinity) α2 γ2 HbF (moderate O2 affinity) ε γ subunit subunit (placenta) α2 β2 HbA β subunit (lungs) 23

Hb non-alpha subunit alignment Analysis: HbG1 and HbG2 have an A:G substitution at position 136 HbE has the A at 136, HbB has the G Why does the HbS sequence have an N-terminal methionine? Hb alpha and non-alpha subunit alignment 24

Analysis: Highest scoring pairwise alignments: 1. HbG1 and HbG2 2. HbS and HbB 3. HbE with HbG1, HbG2 Lowest scoring pairwise alignments: 1. HbA and HbE 2. HbA and HbG1, HbG2 3. HbA and HbB, HbS 25