2 Comparative Analysis of Proteins Experimental evidence from one or more proteins can be used to infer function of related protein(s). Gene A Gene X Protein A compare Protein X experiment? FUNCTION
3 Comparative Analysis of Proteins The more experimental evidence you have, the more likely your inference is correct. Genes A, B, C, D Gene X Proteins A, B, C, D compare Protein X experiments? FUNCTION
4 What do we compare? Primary structure (1 o ) The sequence of amino acids. Sequence alignments Each amino acid has unique chemical properties. Look at The 20 Amino Acids Found in Proteins
5 What do we compare? Secondary structure (2 o ) 3D structure of localized regions in a protein. Alpha-helix and beta-sheet
6 What do we compare? Tertiary structure (3 o ) 3D structure of entire protein Determined by amino acid sequence
7 What do we compare? N-terminal Central a3-helix Superimposition of PTP1B (magenta), RPTPa (gray), RPTPµ (red), LAR (blue), SHP1 (green) and SHP2 (yellow). http;//ptp.cshl.edu & Andersen et al Mol. Cell. Biol. 2001
8 Some proteins are made of modular domains Ig FN III FN III FN III MAM P MAM Ig FN III
9 Model organisms Species Domain Kingdom Phylum E. coli Bacteria Yeast Eukaryota Fungi Dictyostelium Eukaryota Amoebozoa Arabidopsis Eukaryota Plantae C. elegans Eukaryota Animalia Nematoda D. melanogaster (fruit fly) D. raniro (zebrafish) M. musculus (mouse) Eukaryota Animalia Arthropoda Eukaryota Animalia Chordata Eukaryota Animalia Chordata H. sapiens Eukaryota Animalia Chordata
10 Evolution of a Sequence (review) red aa s: essential to protein function atg gcg gtg cgc att gaa acc ggc tat gaa ctg atg M A V R I E T G Y E L M atg gcg gtg cgc att gaa Gcc ggc tat gaa ctg atg M A V R I E A G Y E L M atg gcg gtg cgc att gaa Gcc ggc ttt gaa ctg agg M A V R I E A G F E L R atg gcg gtg ctc att gaa Gcc ggc ttt gaa ctg agg M A V L I E A G F E L R atg gcg gtg cgc att gca Gcc ggc ttt gaa ctg agg M A V R I A A G F E L R
11 Evolution of a Sequence MAVLIESGFELR MAVRIEAGYELM
12 Homology Homology: features in different individuals that are descended from the same feature in a common ancestor. from Baxenvanis, A. D. and Ouelette, B. F. F., Bioinformatics 3 rd Ed., 2005 For example: human arm and bird wing are homologues. Analogous features: similar features in different individuals that result from natural selection. For example: body shape of dolphins and tuna.
13 Homology Homologous sequences: Sequences that are similar because they are descended from a common ancestor. Homologous structures: Structures that are similar because they are descended from a common ancestor. May not have high degree of sequence similarity.
14 Homologs orthologs: Result of speciation or split of one species into two. Similar function in different species. paralogs: produced by gene duplication in an organism. One species can have multiple paralogs. Related, but not identical, function - e.g. different proteases. Many proteases in one species all break peptide bonds, but each has different target amino acids or proteins.
15 Origins of Homologs A Gene duplication in one species A B Speciation: creation of new species B 1 B 2 A 1 A 2 A, A 1, and A 2 are orthologs A and B are paralogs A 1 and B 1 are paralogs A and B 2 are paralogs
16 Sequence alignments Overview What positions in two sequences are equivalent Pairwise alignments two sequences aligned Multiple sequence alignment (MSA) Local and Global alignments
17 What information can we get from an alignment? Are genes or proteins similar? infer similar function Look for presence of functional residues active sites in enzymes is your protein functional?
18 Local and Global Alignments Local: finds the most similar regions of two sequence Global: compares two sequences along their total length Local alignments help align modular proteins or genes. Choose the method that gives the best alignment score.
19 Local and Global Alignments A B local high scores global low score
20 How is Alignment Made? VRETERI VRATERI VRETERI VRATERI VRETERI VEITGEIST?
21 How is Alignment Made? VRETERI VEITGEIST 1. Two sequences are aligned 2. A score is given to the alignment 3. All possible alignments are made and scores assigned 4. Alignment with the best possible score is selected.
22 Scoring an Alignment Simple scoring system: Match = 1 Mismatch = 0 or -1 Gap = -1 Align and score these two sequences VRETERI VEITGEIST
23 Scoring an Alignment Simple scoring system: Match = 1 Mismatch = 0 or -1 Gap = -1 VRETERI VEITGEIST = 3 Another alignment more identities, but many more gaps. The score is low VRE-T-ERI V-EITGE-IST = 1
24 Advanced Scoring Systems Scoring Matrices: PAM and BLOSUM Some substitutions in sequence can conserve the function of the sequence position. Assign different scores to each substitution not just 0 or 1 Assign score for opening a gap and extending a gap example (BLOSUM 62): E aligned with E = 4 E aligned with D = 2 E aligned with I = -3
25 Advanced Scoring Systems Scoring Matrices: PAM and BLOSUM Factors taken into account with scoring matrices: Conservation of residue function What residues can substitute for another Charge, size, hydrophobicity o amino acid properties chart Frequency of residue in all proteins Evolutionary patterns
26 Choosing a scoring matrix BLOSUM30 Long alignments of divergent sequences, <30% BLOSUM62 Most effective at finding potential similarities, 30-40% BLOSUM80 Detecting members of a conserved protein family, 50-60% BLOSUM90 Short alignments of highly similar sequences, 70-90% PAM250 Long alignments of divergent sequences, <30% PAM160 Detecting members of a conserved protein family, 50-60% PAM40 Short alignments of highly similar sequences, 70-90% Taken from Bioinformatics, A Practical Guide... 3 rd Ed., Baxevanis, A.D. and Ouellette, B. F. F., 2005, pg. 303
27 fasta format file ending.fa or.fasta) Why use fasta? Simple, portable, widely used, easy to edit. Starts with > ; then, first line is sequence information. The following line starts the sequence (need carriage return). Be careful of format make sure you have it all
28 BLAST -Proteins Use BLASTP to find homologs: 1. Determine function of a novel protein. 2. Find members of gene or protein family. 3. Find a candidate protein structure.
29 How BLAST works Given the sequence: KRPFIETAERLRDQHKKDYPEYKYQPRRR BLAST searches the database for the three-letter query words starting at each letter of the sequence. The word size can be changed in the parameters section. Examples of 3-letter words: KRPFIETAERLRDQHKKDYPEYKYQPRRR KRP IET... à search database RPF ETA... à search database PFI TAE... à search database FIE AER... à search database
30 How BLAST works The program searches for the exact match as well as three letter words containing conservative substitutions. Scores for each three-letter word are determined by a scoring matrix (i.e. BLOSUM62). For example, when the query word is RDQ, KRPFIETAERLRDQHKKDYPEYKYQPRRR The score for this match is 16 look at your BLOSUM62 matrix Note: Word size can be set in BLAST
31 How BLAST works Some possible three-letter words that match RDQ with their scores are: RDQ 16 QDQ 12 NDQ 11 RDN 11 EDQ 11 SDQ 10 KDQ 13 REQ 12 HDQ 11 RDD 11 MDQ 10 RDP 10 RDE 13 RDR 12 RNQ 11 RDH 11 ADQ 10 RDT 10 Similar tables are generated for each three-letter word in the query. Each of these possible words are used to extend the alignment or neighborhood around the first alignment.
32 How BLAST works A three-letter word that aligns with RDQ is found in a sequence in the database. query database KRPFIETAERLRDQHKKDYPEYKYQPRRR R+Q KRPFVEGAERLREQHKKDHPEYKYQPRRR Extension of the neighborhood: query KRPFIETAERLRDQHKKDYPEYKYQPRRR R+Q database QHK KRPFVEGAERLREQHKKDHPEYKYQPRRR Neighborhood cutoff set to 11 not adjustable in web version?
33 How BLAST works The alignment is then extended from this high-scoring segment pair (RDQ/REQ) in both directions and a score is calculated. Length of extension is determined by the score of the resulting alignment. query subject KRPFIETAERLRDQHKKDYPEYKYQPRRR KRPF+E AERLR+QHKKD+P+YKYQPRRR KRPFVEGAERLREQHKKDHPEYKYQPRRR The alignment is called a high-scoring alignment pair (HSP). These are possible alignments. More than one HSP can be generated for each query-subject pair.
34 How BLAST works The alignment is then extended from this high-scoring segment pair (RDQ/REQ) in both directions and a score is calculated. Length of extension is determined by the score of the resulting alignment. Extension stops when the score can t be improved by including more sequence. The resulting alignment is called a high-scoring segment pair (HSP). query subject KRPFIETAERLRDQHKKDYPEYKYQPRRR KRPF+E AERLR+QHKKD+P+YKYQPRRR KRPFVEGAERLREQHKKDHPEYKYQPRRR
35 How BLAST works The score of the HSP is used to calculate the E-value. The HSP is reported as a hit if the E-value is below the specified cutoff.
36 BLAST results What are your looking for? What did you get? Homologs - orthologs or paralogs Conserved domains in a new protein architecture paralog?
37 BLAST results Ø If you have a lot of sequence in your database you might expect to find any sequence pattern if you look hard enough. The likelihood of this happening is greater if the database is large and your query is small. Think about searching the nr with a five or ten amino acid segment. ELVIS in the genome!
38 BLAST results 1. E-value Number of instances where the match would occur by chance. Calculated from the length of the database, length of the query and the score of the HSP. No hard and fast rules for E-value: Lower E-values indicates significant hits. An E-value close to 0 indicates an identical match. E-value should be below
39 BLAST results 2. Bit score Measure of the significance of the alignment. 3. Percent identity 4. Length of alignment the longer the better 5. Beyond BLAST results is the hit a homolog?
Feinberg Graduate School of the Weizmann Institute of Science Advanced topics in bioinformatics Shmuel Pietrokovski & Eitan Rubin Spring 2003 Course WWW site: http://bioinformatics.weizmann.ac.il/courses/atib
Molecular Biology-2018 1 Definitions: RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES Heterologues: Genes or proteins that possess different sequences and activities. Homologues: Genes or proteins that
Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses
Homology Modeling (Comparative Structure Modeling) Aims of Structural Genomics High-throughput 3D structure determination and analysis To determine or predict the 3D structures of all the proteins encoded
3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode
Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between
Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and
Orthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona Toni Gabaldón Contact: email@example.com Group website: http://gabaldonlab.crg.es Science blog: http://treevolution.blogspot.com
Hands-On Nine The PAX6 Gene and Protein Main Purpose of Hands-On Activity: Using bioinformatics tools to examine the sequences, homology, and disease relevance of the Pax6: a master gene of eye formation.
Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative
Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists
SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS 1 Prokaryotes and Eukaryotes 2 DNA and RNA 3 4 Double helix structure Codons Codons are triplets of bases from the RNA sequence. Each triplet defines an amino-acid.
BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database
Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand
Biochemistry 324 Bioinformatics Pairwise sequence alignment How do we compare genes/proteins? When we have sequenced a genome, we try and identify the function of unknown genes by finding a similar gene
Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1 Local sequence alignments Local sequence alignments are necessary
Alignment & BLAST By: Hadi Mozafari KUMS SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence
Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline
Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 firstname.lastname@example.org Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived
Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,
Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, 2008 39 5 Blast This lecture is based on the following, which are all recommended reading: R. Merkl, S. Waack: Bioinformatik Interaktiv. Chapter 11.4-11.7
Supplementary Information for Evolutionary conservation of codon optimality reveals hidden signatures of co-translational folding Sebastian Pechmann & Judith Frydman Department of Biology and BioX, Stanford
Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of
Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute
Procedure to Create NCBI KOGS full details in: Tatusov et al (2003) BMC Bioinformatics 4:41. 1. Detect and mask typical repetitive domains Reason: masking prevents spurious lumping of non-orthologs based
Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL
Tutorial 4 Substitution matrices and PSI-BLAST 1 Agenda Substitution Matrices PAM - Point Accepted Mutations BLOSUM - Blocks Substitution Matrix PSI-BLAST Cool story of the day: Why should we care about
Bioinformatics Exercises AP Biology Teachers Workshop Susan Cates, Ph.D. Evolution of Species Phylogenetic Trees show the relatedness of organisms Common Ancestor (Root of the tree) 1 Rooted vs. Unrooted
Sequence Bioinformatics Multiple Sequence Alignment Waqas Nasir 2010-11-12 Multiple Sequence Alignment One amino acid plays coy; a pair of homologous sequences whisper; many aligned sequences shout out
An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,
Christian Sigrist General Definition on Conserved Regions Conserved regions in proteins can be classified into 5 different groups: Domains: specific combination of secondary structures organized into a
Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and
Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Bioinformatics Definitions The use of computational
Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in
Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this
Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution Background How does an evolutionary biologist decide how closely related two different species are? The simplest way is to compare
Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene. How can I represent thousands of homolog sequences in a compact
Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA email@example.com www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology
Chapter 7: Rapid alignment methods: FASTA and BLAST The biological problem Search strategies FASTA BLAST Introduction to bioinformatics, Autumn 2007 117 BLAST: Basic Local Alignment Search Tool BLAST (Altschul
The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species
Homology and Information Gathering and Domain Annotation for Proteins Outline Homology Information Gathering for Proteins Domain Annotation for Proteins Examples and exercises The concept of homology The
Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of
Similarity searching / sequence alignment summary Biol4230 Thurs, February 22, 2016 Bill Pearson firstname.lastname@example.org 4-2818 Pinn 6-057 What have we covered? Homology excess similiarity but no excess similarity
Conery, J.S. and Lynch, M. Nucleotide substitutions and evolution of duplicate genes. Pacific Symposium on Biocomputing 6:167-178 (2001). NUCLEOTIDE SUBSTITUTIONS AND THE EVOLUTION OF DUPLICATE GENES JOHN
Find similar genes Example of Function Prediction Suggesting functions of newly identified genes It was known that mutations of NF1 are associated with inherited disease neurofibromatosis 1; but little
Introduction to Bioinformatics Lecture : p he biological problem p lobal alignment p Local alignment p Multiple alignment 6 Background: comparative genomics p Basic question in biology: what properties
Comparative Genomics II Advances in Bioinformatics and Genomics GEN 240B Jason Stajich May 19 Comparative Genomics II Slide 1/31 Outline Introduction Gene Families Pairwise Methods Phylogenetic Methods
Bioinformatics 1 lecture 13 Database searches. Profiles Orthologs/paralogs Tree of Life term projects Various ways to do database searches Purpose of database search (what you want) phylogenetic analysis
Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction Lesser Tenrec (Echinops telfairi) Goals: 1. Use phylogenetic experimental design theory to select optimal taxa to
Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with
ICB Fall 2003 G4120: Introduction to Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology Copyright 2003 Oliver Jovanovic, All Rights Reserved. Bioinformatics and
5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:
Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix  (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block
Lecture 2, 12/3/2003: Introduction to sequence alignment The Needleman-Wunsch algorithm for global sequence alignment: description and properties Local alignment the Smith-Waterman algorithm 1 Computational
Scoring Matrices Shifra Ben-Dor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison
Performing sequence searches Post-Blast analysis, Using profiles and pattern-matching Protein function prediction based on sequence analysis Slides from a lecture on MOL204 - Applied Bioinformatics 18-Oct-2005
BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and
Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature
Homology and Information Gathering and Domain Annotation for Proteins Outline WHAT IS HOMOLOGY? HOW TO GATHER KNOWN PROTEIN INFORMATION? HOW TO ANNOTATE PROTEIN DOMAINS? EXAMPLES AND EXERCISES Homology
Chapter 26: Phylogeny and the Tree of Life 1. Key Concepts Pertaining to Phylogeny 2. Determining Phylogenies 3. Evolutionary History Revealed in Genomes 1. Key Concepts Pertaining to Phylogeny PHYLOGENY
BIRKBECK COLLEGE (University of London) Advanced Certificate in Principles in Protein Structure MSc Structural Molecular Biology Date: Thursday, 1st September 2011 Time: 3 hours You will be given a start
First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences 140.638 where do sequences come from? DNA is not hard to extract (getting DNA from a
1 Supporting Information: What Evidence is There for the Homology of Protein-Protein Interactions? Anna C. F. Lewis, Nick S. Jones, Mason A. Porter, Charlotte M. Deane Supplementary text for the section
Research Proposal Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family. Name: Minjal Pancholi Howard University Washington, DC. June 19, 2009 Research
Phylogeny and the Tree of Life Chapter 26 Objectives Explain the following characteristics of the Linnaean system of classification: a. binomial nomenclature b. hierarchical classification List the major
Lecture 6:Protein Architecture II: Secondary Structure or From peptides to proteins Margaret Daugherty Fall 2004 Outline Four levels of structure are used to describe proteins; Alpha helices and beta sheets
Exercise 5 Sequence Profiles & BLAST 1 Substitution Matrix (BLOSUM62) Likelihood to substitute one amino acid with another Figure taken from https://en.wikipedia.org/wiki/blosum 2 Substitution Matrix (BLOSUM62)
Session 5: Phylogenomics B.- Phylogeny based orthology assignment REMINDER: Gene tree reconstruction is divided in three steps: homology search, multiple sequence alignment and model selection plus tree
Database searches 1 DNA and protein databases EMBL/GenBank/DDBJ database of nucleic acids 2 DNA and protein databases EMBL/GenBank/DDBJ database of nucleic acids (cntd) 3 DNA and protein databases SWISS-PROT
Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation
bioinformatics 1 -- lecture 7 Probability and conditional probability Random sequences and significance (real sequences are not random) Erdos & Renyi: theoretical basis for the significance of an alignment
Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information
Bioinformatics tools for phylogeny and visualization Yanbin Yin 1 Homework assignment 5 1. Take the MAFFT alignment http://cys.bios.niu.edu/yyin/teach/pbb/purdue.cellwall.list.lignin.f a.aln as input and